Figuring out a statistically reliable baseline for out of order processors.

Started by nixeagle, May 01, 2012, 02:02:34 AM

Previous topic - Next topic

dedndave

that is the main use for timing code, afterall
some of us also use it a slightly different way
we want to make a quick change in code to see which sequence is better
for example...
        push    edx
        cmp     eax,ecx
        jnz     SomeLabel

        cmp     eax,ecx
        push    edx
        jnz     SomeLabel

it may not be readily obvious that this is quite different from benchmarking an algorithm

nixeagle

Alright, I need to go to bed. I wanted to post statistical info from Mathematica using the output from dedndave's 3rd program. But MMA is taking forever to do the computations and I'm tired. In the morning I'll post those results.

Aside from that, I'd love any guesses to the questions I've posed in the original post. The major one atm is figuring out how to force the CPU into high throttle mode before we take a sample point. Doing so will help remove any deviation from the mean (average).

Finally I'm working on a program that does timing, but keeps a count of how many times we encounter each value. From that I can compute additional information about the reliability of the data.

P.S. dedndave, I feel your program ought to omit values greater than 4 billion from the output. Those tend to result from the counter wrapping and thus are not really sample points.

MichaelW

I fail to see how a statistical analysis of the run to run variations (AKA noise) serves any useful purpose here.
eschew obfuscation

jj2007

Quote from: nixeagle on May 01, 2012, 06:32:20 AMFinally I'm working on a program that does timing, but keeps a count of how many times we encounter each value. From that I can compute additional information about the reliability of the data.

If you take that job seriously, buy a Prescott CPU - it's a real pig :boohoo:

Jokes apart, you have done some very interesting things here. A few remarks:

> If we time the code by simply taking the average of all runs we get 74.0773 ticks. Alternatively we can count only the fastest run which is 42 ticks.
Common sense tells us the low number is the right one - nothing can be faster than the speed of light, and everything above suffers from some other activity of the CPU. However, my practical experience with the cyct_xx macros (attached, they build on MichaelW's stuff) says there are a few abnormally low values - and many high outliers. So what I do there to obtain stable values even on a Prescott P4 is to eliminate 5% or so at the low end and 20% or so at the high end - and there are indeed very few values that deviate from the new mean.

> No jiggling of the mouse at all will result in a test that looks something like:
If jiggling has effect, then your design is flawed. Algo testing should be done in Win32 console apps, so no mouse messages from the own process, and it should be short enough to avoid task switches. Therefore the mouse should never influence the results.

By the way, the results for the Sorting a Real8 array thread were obtained with yet another mechanism: Behind the NanoTimer() macro is QPC. Microsoft claims it is better than rdtsc, I am not sure about that, but results are fairly consistent, and it seems convenient for some slightly longer tasks, such as sorting 5 Mio array members.

Again, welcome to the Forum, what you are doing looks very promising :thumbu

hutch--

Dave,

There was a trick to get reliable timings on a Prescott core, go into the BIOS and turn OFF hyperthreading. This was a mod specific to Win2000 which did not specifically support hyperthreading but it certainly improved timing consistency. XP is supposed to support hyperthreading but it may be worth a quick play to see if the timings improve.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

dedndave

i get pretty stable timing as long as:
1) the timing code binds to a single core
2) PerfEnablePackageIdle is enabled (registry)

nixeagle

Oh wow! So many replies. I'll try to reply to them all here, shout out if I left you out!

First off I've finally gotten my mathematica program to handle the test input from dedndave's 3rd program. To say that the timing data is scattered about is an understatement. But this is to be expected as this is only during the initial startup period. Later I'll modify the program to take 100,000,000 data points. Long story short, the plot that follows is a neat illustration of why we need to properly handle "spinup"/"startup".1



Standard Deviation->918.774
Mean->255.389
GeometricMean->21.3827
HarmonicMean->8.23219
Median->12
Mode->{9}
MeanDeviation->404.787
Min->1
Max->528044
TrimmedMean->180.868


Taking a look at a few numbers, we can see a few things jump out.
  • First, check out that standard deviation. That is telling us our data points are everywhere.2
  • Second, take a look at the mode. The most common result is 9 ticks. This is far from our Mean, but is very close to our Harmonic Mean. Not sure how or what to make of this yet, but food for thought.
  • Third, our minimum is 1 tick. :eek I'm not sure if that even makes sense.
  • Finally, it is clear from the maximum value that context switches and external factors are playing a role here but the test does not (yet) account for those. Our maximum would have been even higher had I not filtered out instances where the RDTSC counter wrapped around by dropping any values over 4 billion.

As part of calculating an average value, I feel that confounding factors should be omitted to the best of our ability. For short tests, we can easily remove all context switches and RDTSC timer wrap-around problems. That still leaves us the problem of getting the CPU to spin up to the highest power state.3

To MichaelW: Maybe I have failed to communicate my intent. We don't want to analyze noise, as that would be fairly useless :wink. What we want to do is measure when our tests have excess noise in them. Noisy tests are less meaningful and harder to reliably repeat than clean tests.

To jj2007:
  • I absolutely agree that mouse jitter should not affect the results. Jittering the mouse seems to spin the CPU into a higher power state.
  • I'm not convinced taking the minimum in all cases is the best option, the minimum can easily be an outlier. At the very least, we ought to be able to identify how much of the time a given algorithm is able to run at its fastest possible speed. Things like cache misses, register read/store/etc stalls matter. Maybe I'm wrong, I need to think on this a bit more. :red
To Hutch--: Hyperthreading is another issue. Since that means each "thread" alternates every clock tick, when would it be safe to assume an algorithm runs twice as fast on the same processor without hyperthreading enabled? I'll have to test this sometime this summer as well! :D

Finally a little food for thought. Instead of printing every test value out, I'm thinking it might be better to keep a tally of how many times we see each value. Such a task can easily be done using a 1Kib array with elements that look something like:

struct {
  unsigned int ticks;
  unsigned int count;
}

From that we can compute the mean, harmonic mean, min, max, standard deviation, mode, and median of the sample. That would allow the timing program to emit not only the averages as is usually done, but information on how good of a measurement it really is. Thoughts? If you guys like the idea, I'll break down and hack up something in assembler :wink.



  • 1: This goes back to one of my big questions. How can we reliably ensure the CPU is in the highest power state possible?
  • 2: Which is to be expected, this is the first 1,000,000 runs through the loop. There is going to be a delay before the CPU spins up and so on. All the standard deviation is doing here is providing a measure of it.
  • 3: I suspect this is what jj2007 is getting at as far as mouse jitter.

jj2007

Quote from: nixeagle on May 01, 2012, 07:35:50 PM
[li]I'm not convinced taking the minimum in all cases is the best option, the minimum can easily be an outlier.

That is what I wrote earlier, actually. Let's take a practical example, timngs for finding a substring at index 94 of a 100-byte string:
Loop 2
3       GetTickCount
384     Instr case-sensitive
372     Instr case-insensitive
426     CRT StrStr

Loop 1
3       GetTickCount
384     Instr case-sensitive
372     Instr case-insensitive
426     CRT StrStr

This is on a Celeron, and it is fairly stable. On a P4 it looks less stable, but still much better than "raw" averages. What we do here is to build a 1k table, as you suggested, and see how often values appear. Then they are cut off on both ends, and the average is taken. There are flags to show the process (see cyctShowSlow in the attached source), here for GetTickCount:
Loop 2

0:-4 1:-4 2:-4 3:-4 4:3 5:3 6:3 7:3 8:3 9:3 10:3

...

999:14
GetTickCount

average with cutoff     3       cutoff at n=992 for >4 cycles
average w/o cutoff      3
difference raw-cutoff:  0
Loop 1

0:-267 1:-7 2:-5 3:-5 4:-4 5:3 6:3 7:3 8:3 9:3 10:3

...

999:85
GetTickCount

average with cutoff     3       cutoff at n=993 for >4 cycles
average w/o cutoff      2
difference raw-cutoff:  -1


As you can see, there are a few hefty outliers, such as negative cycle counts (0:-287 ... 10:xx - and this is a one core processor). But 900+ of 1000 values are exactly at 3 cycles for GetTickCount...

This is crude maths, but it works. If you get greater deviations, it is usually because your code is hanging around for too long, allowing the OS to interfere with it. If mouse interrupts or messages affect timings, then something is wrong in design. In most cases, with this cutoff table you can arrive at exactly 370 cycles, not 369 and not 371. Except on the P4, of course :bg

nixeagle

jj2007, your negative cycle count is due to your RDTSC timer wrapping. Since everyone only works with the lower 32 bits of that 64 bit counter, every now and then you wind up with negative numbers. I was using unsigned values and spoke of numbers greater than 4,000,000,000. Either way we are talking about the same thing. Tests should either
  • Drop, that is do not even record, outliers greater than 4,000,000,000.
  • Use all 64 bits of the timestamp counter, removing the issue completely.

Either way, work should be done to remove as much confounding as possible. The same applies for operating system context switches. There is no real reason to record them other than maybe to note as a "stat" how many context switches occurred during the test.

On more modern processors, a way to ensure that the CPU is fully spun up before initiating testing would be nice to have. Again trying to remove as many environmental factors as possible from the test environment to ensure higher quality data. We can measure the resulting data quality by using some of the methods I described in my original post.

Long story short, if you have a set of samples small standard deviation, you can be very confident that the test is repeatable. Thus I'm really starting to think that having the tests tell you not just the average number of ticks per sample, but also some extra statistical information would be really helpful. Then we can tell how significant the test data really is.

jj2007

Quote from: nixeagle on May 01, 2012, 10:15:53 PMOn more modern processors, a way to ensure that the CPU is fully spun up before initiating testing would be nice to have.

Why don't you just use a loop?

        mov ecx, 777777777   ; half a second
@@:
        dec ecx
        jns @B

nixeagle

Quote from: jj2007 on May 01, 2012, 11:02:50 PM
Quote from: nixeagle on May 01, 2012, 10:15:53 PMOn more modern processors, a way to ensure that the CPU is fully spun up before initiating testing would be nice to have.

Why don't you just use a loop?

        mov ecx, 777777777   ; half a second
@@:
        dec ecx
        jns @B


Tried that, managed to get fairly unpredictable results.1 Maybe for core i7 (sandy bridge) you have to do something more than just dec ecx; jns @B. If I recall correctly, dec engages one of p0,p1,p52 and the jump engages p5 only. Maybe in order to cause a spinup I need to insert some more useless operations that are not NOP. :eek

As a side note I've played a bit with recording each time run, tallying the number of times a run takes a given number of ticks. For the program corresponding to Plot 3 in the original post, I have the following "dump". Still need to write the mathy stuff for computing standard deviation and whatnot, but I gotta watch a ballgame with friends first.


154:  299191
165:  514
462:  14
176:  132
187:  55
198:  16
330:  4
220:  4
484:  3
319:  2
308:  2
209:  5
363:  3
253:  2
242:  3
352:  3
297:  2
231:  2




  • 1: By unpredictable I mean results with a very high standard deviation to the point that they are statistically insignificant. Plus I know I was not in the highest throttle state as mouse jitter was easily able to make it speed up.
  • 2: Shorthand for execution port 0, execution port 1, and execution port 5. A quick (incomplete) overview: The i7 has 6 execution ports with ports 0,1,5 capable of doing integer operations. Port 4 is used for memory write and ports 2 and 3 are used for memory read.

dedndave

i was giving some thought to the "mouse jiggle" thing
it seems to me that a specially designed message loop would take care of that
maybe one with a semaphore that yields to the timing code

when no messages are in the queue - block the message loop and make the next timing run
when the timing run has completed - unblock the message loop and wait for a "no messages in queue" state to do the next run

MichaelW

Quote from: nixeagle on May 01, 2012, 10:15:53 PM
Since everyone only works with the lower 32 bits of that 64 bit counter, every now and then you wind up with negative numbers.

Not everyone. Why put so much effort and complexity into this without first taking care of the simple stuff? And at least in my code, which uses the full 64-bit count, the most common reason for a negative cycle count is that something caused the reference loop to take more cycles than the test loop.

Also, for a desktop system the only reason I can see for a core that is executing a loop to run at less than full speed, and assuming that the OS is not controlling the speed, is temperature management. And if this is happening under anything resembling normal conditions, then IMO the maximum clock speed rating is just marketing BS, not to be taken seriously.
eschew obfuscation

Neo

Quote from: dedndave on May 02, 2012, 12:46:57 AM
i was giving some thought to the "mouse jiggle" thing
it seems to me that a specially designed message loop would take care of that
maybe one with a semaphore that yields to the timing code

when no messages are in the queue - block the message loop and make the next timing run
when the timing run has completed - unblock the message loop and wait for a "no messages in queue" state to do the next run
He probably doesn't even have a message loop, so introducing one would probably cause worse problems.  The operating system's interrupt handler kicking in would be enough to mess up the timings not just of the interrupted loop iteration (probably way off the top of the chart), but the next loop as the code cache (etc.) updates.

I made a similar sort of time analysis/plotting tool a while back also designed to identify the statistically relevant samples, albeit primarily for scaling as opposed to constant time operations: http://www.masm32.com/board/index.php?topic=14058.msg111612#msg111612

(wow, it's been forever since I've posted here...)

dedndave

hey Neo - good to see you
you should come by more often   :bg

well - they had mentioned that timing code "must" be done in a console app because of mouse movement
and i think it is quite possible to do it in a GUI app - if you take the proper precautions

i think it might be best to run things in a GUI app
at the end of the day, that's where the code is likely to be used - closer to the real world