Figuring out a statistically reliable baseline for out of order processors.

dedndave · May 03, 2012, 05:27:30 AM

ok - i'll wait
no hurry - but i've got a boring movie and a warm wifee - lol
perfect recipe for sleepage :bg

nixeagle · May 03, 2012, 05:51:35 AM

Quote from: dedndave on May 03, 2012, 05:27:30 AM
ok - i'll wait
no hurry - but i've got a boring movie and a warm wifee - lol
perfect recipe for sleepage :bg

Alrighty, I think this is bug free ::).

It is setup to ensure that there is enough time for the cpu to spin up. So you know something is happening it'll print out when it adjusts the number of ticks required to output a number.

It'll look something like this:

Code Select


$ cat 4.dat
Now requiring only 10 entries.
Now requiring only 9 entries.
91
91
91
91
91
91
91
91
...

Thanks for trying :bg

Edit: Oh darned off by one mistake :red, when it says "Now requiring only 10 entries.", it really means to say "Now requiring only 11 entries." Nothing wrong with the actual code at least. :wink

dedndave · May 03, 2012, 06:07:02 AM

it still spins the hell out of my hard drive - i don't like that
i modified it to CSV instead of CR/LF - and i terminated it early

Code Select

Now requiring only 10 entries.
Now requiring only 9 entries.
Now requiring only 8 entries.
Now requiring only 7 entries.
Now requiring only 6 entries.
Now requiring only 5 entries.
Now requiring only 4 entries.
473, 472, 473, 472, 472, 473, 472, 473, 473, 473, 472, 473, 473, 532, 472, 473,
472, 472, 472, 473, 472, 472, 472, 472, 473, 472, 473, 473, 473, 472, 472, 472,
473, 472, 472, 472, 472, 472, 473, 473, 472, 473, 473, 472, 472, 473, 473, 472,
472, 472, 473, 472, 472, 472, 473, 473, 472, 473, 472, 473, 472, 517, 473, 473,
472, 473, 548, 472, 473, 472, 473, 473, 472, 473, 473, 473, 472, 473, 472, 472,
473, 473, 472, 473, 472, 473, 540, 472, 473, 473, 472, 472, 472, 472, 473, 473,
473, 473, 472, 472, 525, 472, 473, 472, 510, 472, 473, 472, 472, 472, 472, 472,
473, 472, 473, 472, 472, 472, 473, 473, 472, 472, 472, 473, 472, 473, 510, 472,
472, 473, 540, 473, 473, 473, 472, 472, 472, 472, 472, 473, 473, 473, 473, 473,
473, 472, 472, 473, 473, 473, 472, 472, 472, 472, 472, 473, 473, 473, 472, 472,
472, 472, 473, 472, 473, 473, 472, 473, 473, 472, 473, 473, 472, 473, 473, 473,
473, 472, 473, 472, 473, 472, 473, 472, 473, 473, 472, 525, 473, 472, 473, 473,
473, 472, 472, 472, 472, 473, 473, 473, 510, 473, 472, 473, 473, 472, 473, 473,
472, 472, 472, 473, 473, 472, 472, 495, 472, 472, 472, 472, 472, 473, 472, 510,
473, 473, 495, 495, 495, 473, 472, 473, 472, 495, 473, 472, 473, 472, 472, 473,
473, 472, 540, 510, 473, 473

nixeagle · May 03, 2012, 06:08:21 AM

Quote from: dedndave on May 03, 2012, 06:07:02 AM
it still spins the hell out of my hard drive - i don't like that
i modified it to CSV instead of CR/LF - and i terminated it early
Code Select Expand
Now requiring only 10 entries. Now requiring only 9 entries. Now requiring only 8 entries. Now requiring only 7 entries. Now requiring only 6 entries. Now requiring only 5 entries. Now requiring only 4 entries. 473, 472, 473, 472, 472, 473, 472, 473, 473, 473, 472, 473, 473, 532, 472, 473, 472, 472, 472, 473, 472, 472, 472, 472, 473, 472, 473, 473, 473, 472, 472, 472, 473, 472, 472, 472, 472, 472, 473, 473, 472, 473, 473, 472, 472, 473, 473, 472, 472, 472, 473, 472, 472, 472, 473, 473, 472, 473, 472, 473, 472, 517, 473, 473, 472, 473, 548, 472, 473, 472, 473, 473, 472, 473, 473, 473, 472, 473, 472, 472, 473, 473, 472, 473, 472, 473, 540, 472, 473, 473, 472, 472, 472, 472, 473, 473, 473, 473, 472, 472, 525, 472, 473, 472, 510, 472, 473, 472, 472, 472, 472, 472, 473, 472, 473, 472, 472, 472, 473, 473, 472, 472, 472, 473, 472, 473, 510, 472, 472, 473, 540, 473, 473, 473, 472, 472, 472, 472, 472, 473, 473, 473, 473, 473, 473, 472, 472, 473, 473, 473, 472, 472, 472, 472, 472, 473, 473, 473, 472, 472, 472, 472, 473, 472, 473, 473, 472, 473, 473, 472, 473, 473, 472, 473, 473, 473, 473, 472, 473, 472, 473, 472, 473, 472, 473, 473, 472, 525, 473, 472, 473, 473, 473, 472, 472, 472, 472, 473, 473, 473, 510, 473, 472, 473, 473, 472, 473, 473, 472, 472, 472, 473, 473, 472, 472, 495, 472, 472, 472, 472, 472, 473, 472, 510, 473, 473, 495, 495, 495, 473, 472, 473, 472, 495, 473, 472, 473, 472, 472, 473, 473, 472, 540, 510, 473, 473

Hmm, why would it spin your harddrive? I don't even touch that!

Edit: After getting over that shock :P... Your outputs look much more consistant than what I remember you showing earlier on in this discussion. Give me a few minutes and I'll have MMA produce some pretty graphs!

Second Edit: Alright, here is the plot and stat output:

Standard Deviation -> 12.3756
Mean -> 475.687
GeometricMean -> 475.537
HarmonicMean -> 475.397
Median -> 473
Mode -> {472}
MeanDeviation -> 5.93198
Min -> 472
Max -> 548
Length -> 246
TrimmedMean -> 473.189
MeanDeviation -> 5.93198
MedianDeviation -> 1.
QuartileDeviation -> 0.5
InterquartileRange -> 1.

dedndave · May 03, 2012, 11:04:04 AM

well - i think it's the hard drive - maybe it's the CPU fan - lol
either way - it's worrysome
when i have a program that does that, i use Sleep to let the OS have some core time

nixeagle · May 03, 2012, 05:34:34 PM

Quote from: dedndave on May 03, 2012, 11:04:04 AM
well - i think it's the hard drive - maybe it's the CPU fan - lol
either way - it's worrysome
when i have a program that does that, i use Sleep to let the OS have some core time

Yea, sounds like the CPU fan to me. I mean the program is so small that there is no chance of swapping going on and the program does not write or read from the disk during runtime.

Given a night to sleep on the results, I'm starting to suspect that different cpu generations might be best served by their own set of timing code selected at runtime based on CPUID. It really looks to me like your CPU is safe to simply take the minimum for algorithm comparisons. That puts your baseline at 472 ticks, and I suspect no matter how many times you run the program you will always get 472 ticks. Any additional work (in the timeit function) will result in a higher minimum. Most importantly, your results should be consistent over program runs.

In summary: I believe for your processor generation, you should set gu_steps to 4 when running tests using this timing program. You can test my theory out if you like by

Doing the suggested change to gu_steps, running the program and watching it terminate. With gu_steps set to 4, the program ought to do its full run of 1000 samples quickly.¹ This will establish your baseline.²
Change the program again, adding some code to the testit function. I suggest adding mov eax, 100; imul eax, eax, 10. Run the test again and note the results.

Both sets would be really useful to me.

Now I suspect the reason I have been having so much trouble with this is due to the sandy bridge architecture. The i7 is able to "overclock" itself automatically for very short durations of time. Thus taking the absolute minimum in my case is not really the best idea as it is more often then not an outlier. So I have to resort to other methods to analyze results.

Later today I'll update with a new program that computes the statistics of a test run along with some other modifications. :dance:

1: Remember the prior program had a bug in output. When it said Now requiring only 4 entries, it really meant to say 5 entries.
2: Assuming this works out as I suspect it will for you, a "finished" program will do this automatically and in so doing only take 10 samples instead of 1000 for the baseline measurement.

nixeagle · May 04, 2012, 12:46:46 AM

For my core i7, this gives extremely stable values. There are still some "interesting" quirks, but we are getting there! The baseline case gives 107 ticks every time you run the program!

If you look at the source code, I've commented some options that you can try tweaking and checking the results. These are mostly directed for dedndave :bdg.

Also note that the output is now in csv format and it all comes at once at the end of the program. So for about 30 seconds to a minute the program will look like it is not doing anything! By default the program will output 100 values. Additionally please note that my program does not "pause" and prompt for keyboard input after running, just start your own shell and run it from there. I build and test my code from inside Emacs and have a single keyboard shortcut I use for invoking it.

Now that the output is stable, I played a bit with the test function and found that if you put a multiplication on a core i7 processor, it'll take much longer to reach stabilization. I'm still mulling ideas how to automatically detect when the CPU is stable and results are worth taking.

Please take some time to try this out and tell me your findings! I'm curious how other processors work. If someone has AMD, I'd love to see this run on that. Does it stabilize? Does it have any other issues? Once I have a stable baseline, I plan to step up to harder problems as I find this whole measuring thing very interesting as far as the mathematics of it goes! :dance:

P.S. Statistics output will be completed after the ballgame. :bg

MichaelW · May 04, 2012, 03:11:00 AM

Running on my P3 Windows 2000 system, with testit stripped down to just a RET, I get a uniform 117 cycles. If in the do_sample_run procedure I comment out the "jne retry" then I get an occasional higher value, usually 120, but through perhaps 20 trials I saw values as high as 155.

Running on my P4 Windows XP SP3 system I get a uniform 380 cycles, and with no retries, except for the first few runs where I did see some higher values, and once where the results looked like they came from an RNG, I get a uniform 380 cycles.

How do you intend to remove the overhead from the results?

nixeagle · May 04, 2012, 03:22:56 AM

Quote from: MichaelW on May 04, 2012, 03:11:00 AM
With the retries how do you propose to remove the overhead from the results?

You don't. This is meant to serve as a baseline. Both the baseline and the algorithm under test will have the same overhead :wink.

Awesome test results :U.

Yea, the first few runs tend to be goofy, if you look at the gu_warmup_samples option and set it to something other than 1, those should go away. I'd try like 20.
Where you got output looking like it came from a RNG, how did that happen? What modifications, if any did you make? Did you have retries disabled for the gibberish? You would do better to set gu_default_sample_batch_size to 1 instead of commenting out just the jne retry bit. Reason is otherwise you collect 16 (the default setting) results, just to store only the first one. Edit: re-reading, I think I misread what you said. I think you said the gibberish happened when you turned off retries. Could you confirm that I've understood you? :red

Thanks for testing! :bg

MichaelW · May 04, 2012, 03:47:34 AM

When I disabled the retries my results were somewhat non-uniform. Cycle counts on my P3 tend to be very repeatable, but what surprised me here was how repeatable the P4 counts were, without retries. Given my experiences with other P4s I was expecting a lot of scatter.

nixeagle · May 04, 2012, 04:03:43 AM

Does your P3 or P4 support XMM registers? Or do I need to use the floating point stack to be compatible with those? I ask because I am in the middle of implementing the statistics calculation stuff.

And just thought of this, I'll need to either implement my own sort in assembler ::) or find one I can use.

jj2007 · May 04, 2012, 05:51:17 AM

Quote from: nixeagle on May 04, 2012, 04:03:43 AM
And just thought of this, I'll need to either implement my own sort in assembler ::) or find one I can use.

There is a lot of choice, see e.g. Sorting a Real8 array, or check \masm32\help\masmlib.chm for "Sorting Functions".

nixeagle · May 04, 2012, 06:06:06 PM

Quote from: jj2007 on May 04, 2012, 05:51:17 AM
Quote from: nixeagle on May 04, 2012, 04:03:43 AM
And just thought of this, I'll need to either implement my own sort in assembler ::) or find one I can use.

There is a lot of choice, see e.g. Sorting a Real8 array, or check \masm32\help\masmlib.chm for "Sorting Functions".

Thanks! :U

In related news, follows is the output of the i7 on battery power. Much higher than when plugged into the wall, but stable!

Code Select


264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 272, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264, 264,
Min:   264
Max:   272
Mean:  264
Range: 8

nixeagle · May 04, 2012, 07:20:51 PM

For dedndave's CPU fan thingie, I've worked in a invoke Sleep,1000 call that gets run before every sample. The downside is sampling takes much longer, the upside is... maybe your CPU fan won't go nuts. ::)

The problem I see with doing Sleep for a long period of time is you essentially allow all the caches and such to get polluted with other stuff. Thus when the program gets control back, it has to spend some time doing excess retries to get everything back to a repeatable state. You can only get deterministic behavior if you are able to discard all the confounding elements.

Essentially I'm treating this like a scienctific experiment. I wish to identify and control all possible external sources not related to what is under test. What I cannot eliminate, I ensure will be present during both the baseline run and any algorithm tests afterwards.

Also adjusted are some alignment issues, the critical loops are now aligned on 16 byte boundaries.

The attached version will take 25 stable samples and exit emitting a min, max, mean, and range. It discards only one sample during startup. If you want to adjust this behavior, edit the options prefixed with gu_. The important ones are (IMHO) well commented. Again, emphasize that this version takes a while, as in 2 to 5 minutes for me.

dedndave · May 04, 2012, 08:16:52 PM

that's 1 second :eek
probably no need to go that far

if you can get it down to where the sleep period is at least as long as the test period, that works pretty well
if you let it run free, it consumes 50 % of CPU time
get it below ~25 %, and you'll be ok for short periods
if you want to get thousands of samples, which takes a while, try to get below ~17 %
that would be sleep period = 2 x test period

you can use the results of the last pass to determine the Sleep parameter :U
(provided you know the CPU frequency)

QueryPerformanceFrequency should work on most systems

News:

Figuring out a statistically reliable baseline for out of order processors.