Figuring out a statistically reliable baseline for out of order processors.

Started by nixeagle, May 01, 2012, 02:02:34 AM

Previous topic - Next topic

nixeagle

First off, I'm new so please be gentle. This post will be long.

Getting a statistically reliable baseline is not easy as we have multiple confounding inputs from operating system interrupts/context switches, cache misses, hyperthreading,  CPU power state/throttling and so on. In the face of this, we want to establish a method of measuring how good our data is, just taking the average or searching for the minimum number of ticks recorded is not enough without characterizing the data quality. Would you consider the following data useful? (Context switches were omitted as they take 1000s of ticks.)



My answer is emphatically no! For those wondering, the parts of the plot around 100,000 and 250,000 match up to the portions of the test's runtime where I jiggled the mouse around. Yes, jiggling the mouse or attempting to do other work while a test is running can introduce confounding. Most likely this is obvious and or already known to you guys. But this leads to the question of:
QuoteHow can we identify when our timing results are screwed up?

If we time the code by simply taking the average of all runs we get 74.0773 ticks. Alternatively we can count only the fastest run which is 42 ticks. Both numbers only tell a very small part of the picture though. A program that simply outputs one or both is hiding from us the glaring problem in our test data. Running the test again can very easily give us completely different results. No jiggling of the mouse at all will result in a test that looks something like:



Which while not perfect, is a step in the right direction as far as repeatability and reliability go. We can still do better, and we will at at the end of this post. Before we do so, let us ask:
QuoteHow do we tell the difference between graph 1 and graph 2?

  • One option is to measure the standard deviation of the samples.

    • Our first plot has a standard deviation of 19.5089 ticks! This tells us that most of our data lies between 54.5684 ticks and 93.5862 ticks. A very wide spread and a useful indicator that something is fishy about our timing data!
    • Our second plot has a standard deviation of 4.91369 ticks. While nothing to call home about, at least we are on the right track here.
  • Another option that is useful alongside the standard deviation is to look at just the percentiles of the sample.1
    • The first plot is scattered all about: 42, 51, 51, 51, 51, 51, 51, 52, 52, 52, 52, 52, 52, 60, 60, 60, 60, 60, 60, 60, 60, 60, 60, 60, 64, 64, 64, 64, 64, 64, 64, 64, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 88, 110, 132, 132, 143, 154
    • The second plot has a much nicer set of percentiles: 66, 66, 66, 66, 66, 66, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 88

    Perhaps later, if you guys like what I've done so far, I'll cover more on analyzing the quality of timing results. I hope it is clear why the second is preferred to the first. Assuming it is, onwards to the final
2 test!

With the following timing code, placed inside of a loop:
00000011 <timeit()>:
  11:   0f 31                   rdtsc
  13:   0f 31                   rdtsc
  15:   0f 31                   rdtsc
  17:   0f ae f0                mfence
  1a:   0f 31                   rdtsc
  1c:   89 c1                   mov    ecx,eax
  1e:   0f ae f0                mfence
  21:   0f 31                   rdtsc
  23:   29 c8                   sub    eax,ecx
  25:   c3                      ret


we manage to get stable samples. Each run through the loop takes 154 ticks. Slow, but consistantly slow! First a plot, then we will take a quick look at the numerical measurements of sample quality.



While still scattered about a bit, we are much better in terms of standard deviation and percentile. The numbers:
  • min=154
  • average=154.025
  • standard deviation=0.757224
  • percentiles=154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154, 154

Look at the consistency! Yes, we still have unexplained data points that are not 154 ticks, but the vast, vast majority key in at 154 ticks! Ignoring potential problems, this timing code is "ideal" as the loop will take a consistent 154 ticks every time through. This assists us in fairly measuring competing algorithm speeds.

However, this does not imply that the method is flawless!
  • Presumably the CPU remained in a lower power state through this test. I have not tested this much yet. Is there a way to determine what state the CPU is in? Can we find a way to ensure the cpu is in the lowest power state? How about the highest power state?
  • An empty loop has no cache effects. I intend to analyze this problem in the future.
  • Stable timings were achieved on an Intel Core i7 cpu. I have no clue if other out of order CPUs will reach stability.
  • There are still confounding factors at play in the above plot. Remember the ideal plot would have every test run at exactly 154 ticks. What possible reasons can we think of for the outliers? Context switches were already filtered out as those tend to take 1000's of ticks.

At the very least, I hope my post starts some interesting conversation on measuring the quality of timing results. I hope it is clear why a simple average or just taking the minimum value is not sufficient.




  • 1: We compute the list of percentiles by sorting all the data points in the sample then grabbing the (1/100)*sample_size element and repeating until we reach (99/100)*sample_size element. This helps us eliminate outliers.
  • 2: Well final test for this post, over time I hope to come up with improvements that cover more complex cases with more confounding factors than just the empty loop.

dedndave

i am interested to see what code you are timing
as well as the timing code, itself

have you tried timing the same code with Michael's macros ?
perhaps not perfect - i doubt any timing code is - but we are all very "comfortable" with the results
it would be interesting to see how it compares with your timing code
Michael uses CPUID to serialize instructions

also - if you notice the example code i posted in the other thread...
on multicore machines, it may be easier to get stable readings by using SetProcessAffinityMask to bind to a single core
i find that works best in here - some of us have older P4's that jump around, otherwise
to compare apples with apples, it is best to bind for all processors

dedndave

...i might add

the absolute times aren't what we are after, so much as comparing one algorithm to another on diffierent CPU's
our idea of "good timing code" is that which provides repeatable results on a variety of cores, while timing a variety of code

nixeagle

Thanks for taking the time to read through my admittedly long post dedndave! The code for the last plot (the stable one) is as follows: (Please note that this is the result of objdump after compiling some C++ code using GCC). For the purposes of my investigation it is good enough. The code loops through and prints each run's result to the standard output which is then piped to a file. From there I use mathematica to generate pretty plots and compute various statistics; two of which are mentioned above.


00000000 <__tcf_0>:
   0:   83 ec 0c                sub    esp,0xc
   3:   b9 00 00 00 00          mov    ecx,0x0
                        4: dir32        .bss
   8:   e8 00 00 00 00          call   d <__tcf_0+0xd>
                        9: DISP32       std::ios_base::Init::~Init()
   d:   83 c4 0c                add    esp,0xc
  10:   c3                      ret

00000011 <timeit()>:
  11:   0f 31                   rdtsc
  13:   0f 31                   rdtsc
  15:   0f 31                   rdtsc
  17:   0f ae f0                mfence
  1a:   0f 31                   rdtsc
  1c:   89 c1                   mov    ecx,eax
  1e:   0f ae f0                mfence
  21:   0f 31                   rdtsc
  23:   29 c8                   sub    eax,ecx
  25:   c3                      ret

00000026 <main>:
  26:   8d 4c 24 04             lea    ecx,[esp+0x4]
  2a:   83 e4 f0                and    esp,0xfffffff0
  2d:   ff 71 fc                push   DWORD PTR [ecx-0x4]
  30:   55                      push   ebp
  31:   89 e5                   mov    ebp,esp
  33:   57                      push   edi
  34:   56                      push   esi
  35:   53                      push   ebx
  36:   51                      push   ecx
  37:   83 ec 18                sub    esp,0x18
  3a:   e8 00 00 00 00          call   3f <main+0x19>
                        3b: DISP32      __main
  3f:   bf e0 93 04 00          mov    edi,0x493e0
  44:   e8 c8 ff ff ff          call   11 <timeit()>
  49:   89 04 24                mov    DWORD PTR [esp],eax
  4c:   b9 00 00 00 00          mov    ecx,0x0
                        4d: dir32       std::cout
  51:   e8 00 00 00 00          call   56 <main+0x30>
                        52: DISP32      std::ostream& std::ostream::_M_insert<unsigned long>(unsigned long)
  56:   83 ec 04                sub    esp,0x4
  59:   89 c6                   mov    esi,eax
  5b:   8b 00                   mov    eax,DWORD PTR [eax]
  5d:   8b 40 f4                mov    eax,DWORD PTR [eax-0xc]
  60:   8b 5c 06 7c             mov    ebx,DWORD PTR [esi+eax*1+0x7c]
  64:   85 db                   test   ebx,ebx
  66:   75 05                   jne    6d <main+0x47>
  68:   e8 00 00 00 00          call   6d <main+0x47>
                        69: DISP32      std::__throw_bad_cast()
  6d:   80 7b 1c 00             cmp    BYTE PTR [ebx+0x1c],0x0
  71:   74 06                   je     79 <main+0x53>
  73:   0f b6 43 27             movzx  eax,BYTE PTR [ebx+0x27]
  77:   eb 18                   jmp    91 <main+0x6b>
  79:   89 d9                   mov    ecx,ebx
  7b:   e8 00 00 00 00          call   80 <main+0x5a>
                        7c: DISP32      std::ctype<char>::_M_widen_init() const
  80:   8b 03                   mov    eax,DWORD PTR [ebx]
  82:   c7 04 24 0a 00 00 00    mov    DWORD PTR [esp],0xa
  89:   89 d9                   mov    ecx,ebx
  8b:   ff 50 18                call   DWORD PTR [eax+0x18]
  8e:   83 ec 04                sub    esp,0x4
  91:   0f be c0                movsx  eax,al
  94:   89 04 24                mov    DWORD PTR [esp],eax
  97:   89 f1                   mov    ecx,esi
  99:   e8 00 00 00 00          call   9e <main+0x78>
                        9a: DISP32      std::ostream::put(char)
  9e:   83 ec 04                sub    esp,0x4
  a1:   89 c1                   mov    ecx,eax
  a3:   e8 00 00 00 00          call   a8 <main+0x82>
                        a4: DISP32      std::ostream::flush()
  a8:   83 ef 01                sub    edi,0x1
  ab:   75 97                   jne    44 <main+0x1e>
  ad:   b8 00 00 00 00          mov    eax,0x0
  b2:   8d 65 f0                lea    esp,[ebp-0x10]
  b5:   59                      pop    ecx
  b6:   5b                      pop    ebx
  b7:   5e                      pop    esi
  b8:   5f                      pop    edi
  b9:   5d                      pop    ebp
  ba:   8d 61 fc                lea    esp,[ecx-0x4]
  bd:   c3                      ret

000000be <_GLOBAL__sub_I__Z6timeitv>:
  be:   83 ec 1c                sub    esp,0x1c
  c1:   b9 00 00 00 00          mov    ecx,0x0
                        c2: dir32       .bss
  c6:   e8 00 00 00 00          call   cb <_GLOBAL__sub_I__Z6timeitv+0xd>
                        c7: DISP32      std::ios_base::Init::Init()
  cb:   c7 04 24 00 00 00 00    mov    DWORD PTR [esp],0x0
                        ce: dir32       .text
  d2:   e8 00 00 00 00          call   d7 <_GLOBAL__sub_I__Z6timeitv+0x19>
                        d3: DISP32      atexit
  d7:   83 c4 1c                add    esp,0x1c
  da:   c3                      ret
  db:   90                      nop


Again, I used C++ for this testing as the only critical code (as far as I'm aware...) is the timeit() function. It is possible factors like flushing the output after every run through the loop is confounding the results, though I'd be surprised as I'd suspect that would show up in the math.

I plan to test the timing macros you linked to tonight or tomorrow and see how those fare. I suspect they will blow me out of the water!  :wink

The main focus of my post is assessing the quality of timing results. A little history, I found this forum through stack overflow a few weeks ago and enjoyed reading past posts in this subforum. During that, I came across some epic arguments over measurement quality and test harnesses. As a mathematics major, I became interested in how to establish some form of confidence in an out of order CPU measurement. To be very clear, my interest is purely academic here. I simply believe a better job can be done of qualifying how good a set of measurements are.

I am very interested in answers to the various questions posed in my original post. Setting affinity is a great idea. It'll assure that we are using the same L1 cache when I get around to analyzing that.

P.S. In reply to your second post, I understand, and my goal atm is simply to qualify how good a set of results are with respect to those parameters. We want the timing loop to be as consistent as possible across all the CPUs. How fast or slow the loop is does not matter. My problem atm is figuring out if for example the code above is running at full throttle or not. If it is not, then code inside the loop could potentially cause the test to no longer be consistent.

dedndave

i am not so sure that MFENCE is doing what you want it to
it serializes memory load and store instructions (or more correctly, operations)
from what i can see, RDTSC does not fall into that catagory

you might test that by removing the MFENCE instructions for comparison

instruction serialization has been the uphill battle that Michael spent hours upon hours fighting with - lol
i played with it a little bit - trying to use IRET

i think there are other ways, however, to force in-order execution - without using one of the serialization instructions

nixeagle

It works, removing MFENCE causes a huge speedup, but a loss of consistency as compared to with MFENCE.



Standard Deviation -> 4.20628
Mean -> 75.8523
Min -> 60
Percentiles -> 60, 60, 60, 60, 64, 64, 64, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77, 77

So there is definitely an effect.

My main problem right now seems to be ensuring the CPU remains in the same throttling mode throughout the test. Around run 270,000 the CPU increased throttling for a good stretch.  :eek Well that and it is clear that MFENCE is doing something. I'll try CPUID soonish, right now messing with the linked to timing code.

dedndave

if i understand correctly, you are timing the RDTSC instruction
here are the results i get using Michael's macros on my old P4 prescott
96, 96, 96, 96, 96, 96, 100, 98, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96,
96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96,
96, 98, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 95, 96, 96, 96,
96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96,
96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 96, 95, 95, 96, 96,


the program is attached, so you can compare on an i7

dedndave

as for the throttle thing...
i use a little reg file when i set up my OS
REGEDIT4

[HKEY_LOCAL_MACHINE\SYSTEM\ControlSet002\Control\Session Manager\Throttle]
"PerfEnablePackageIdle"=dword:00000001

prescotts are notorious for "wobbly" timing - lol
this helps a lot

nixeagle

I am actually testing the blank loop, that is nothing in the loop at all.

The problem I see with the code you posted is that it essentially gives only an average.


= 00989680 LOOP_COUNT = 10000000 ;try to choose a value so each run takes about 0.5 seconds


This obscures some of the information, are all of the values clustered around the same average, or are they more like the first plot in the original post?

I did run your program and got this output:  53, 45, 38, 38, 35, 35, 32, 30, 30, 25, 26, 26, 24, 24, 24, 19, 21, 21, 20, 15, 15, 15, 15, 15, 15, 15, 15, 16, 15, 15, 16, 15, 15, 15, 15, 15, 15, 15, 16, 15, 15, 15, 15, 15, 15, 15, 16, 15, 15, 15, 15, 15, 15, 15, 16, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 16, 15, 15, 15, 15, 15, 16, 15, 15, 15, 16, 15, 15, 15, 15, 15, 15, 15, 16, 16, 15, 15, 15, 16, 15, 15, 16, 15, 15, 15, 15, 15, 15.

Now here I can clearly see that the more we run the program, the more we move into higher throttling. Ideally we would get the CPU into the highest possible throttling before taking any measurements.

I have been working on using the macros to do as I did with the C++ code. Spam out a list of numbers, each number matching a single test run, not an average of 10 million runs.

dedndave

i forget now whether Michael takes the average or the fastest time
but - getting stable results depends on running the test enough times
we generally get good results if each run takes about 0.5 seconds
you can do alright with shorter runs - depending on the code you are timing

at any rate - i can change the LOOP_COUNT to 1 and run it   :bg

dedndave

prescott w/htt - LOOP_COUNT = 1
75, 30, 75, 82, 75, 75, 82, 82, 83, 75, 83, 82, 83, 82, 75, 75, 75, 82, 83, 82,
83, 75, 82, 82, 53, 82, 83, 83, 75, 82, 82, 82, 82, 75, 75, 75, 98, 75, 75, 75,
75, 83, 75, 83, 83, 75, 75, 82, 75, 82, 83, 82, 83, 75, 75, 83, 75, 75, 82, 75,
75, 75, 75, 83, 83, 83, 82, 83, 75, 83, 82, 83, 83, 83, 83, 83, 75, 82, 75, 82,
75, 82, 75, 82, 83, 83, 82, 82, 75, 82, 82, 75, 75, 82, 82, 82, 83, 75, 75, 75


you will probably get more stable results on the i7

nixeagle

Quote from: dedndave on May 01, 2012, 05:23:17 AM
i forget now whether Michael takes the average or the fastest time
but - getting stable results depends on running the test enough times
we generally get good results if each run takes about 0.5 seconds
you can do alright with shorter runs - depending on the code you are timing
Well yea, but that obscures information on what the sample actually looks like. Looking back at my original post, it is very useful to know if we are dealing with a very spread out sample like plot number 1. Taking the average (mean) of plot number 1 yields ~74 ticks, but that number is near meaningless as there is such a large deviation. Something like plot number 3 is much more desirable as far as statistical significance.

Quote
at any rate - i can change the LOOP_COUNT to 1 and run it   :bg

Go for it, and adjust it to spam something like 1000000 numbers :wink.


nixeagle

Quote from: dedndave on May 01, 2012, 05:26:21 AM
prescott w/htt - LOOP_COUNT = 1
75, 30, 75, 82, 75, 75, 82, 82, 83, 75, 83, 82, 83, 82, 75, 75, 75, 82, 83, 82,
83, 75, 82, 82, 53, 82, 83, 83, 75, 82, 82, 82, 82, 75, 75, 75, 98, 75, 75, 75,
75, 83, 75, 83, 83, 75, 75, 82, 75, 82, 83, 82, 83, 75, 75, 83, 75, 75, 82, 75,
75, 75, 75, 83, 83, 83, 82, 83, 75, 83, 82, 83, 83, 83, 83, 83, 75, 82, 75, 82,
75, 82, 75, 82, 83, 83, 82, 82, 75, 82, 82, 75, 75, 82, 82, 82, 83, 75, 75, 75


you will probably get more stable results on the i7

I tossed your data into my mathematica program :) The plot is


StandardDeviation -> 6.94087
Mean -> 78.81
GeometricMean -> 78.3778
HarmonicMean -> 77.6976
Median -> 82
Mode -> {75}
MeanDeviation -> 4.5404
Min -> 30
Max -> 98
TrimmedMean -> 79.3667
MeanDeviation -> 4.5404
InterquartileRange -> 8


The important bit here is we still get an average, but we also get information on how good of an approximation that average actually is. Note that your standard deviation is better than that of Plot/Graph/Sample number 1 in the original post!

hutch--

The technique I still use is to design a benchmark that comes close to how the algo will be used, then run it for at least a second or two to reduce the OS interference range to well under 1%. Just for example if you want an algo that attacks fast on small data then you make a short data test piece and loop it many times to get a duration of a second or more. If alternatively its an algo that hits very large data then I usually make a test data piece that is very large, > 100 mb up to a gigabyte and see how the times run. Its more work but can be tuned very accurately to the required task the algo will perform.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php