Figuring out a statistically reliable baseline for out of order processors.

Started by nixeagle, May 01, 2012, 02:02:34 AM

Previous topic - Next topic

nixeagle

Quote from: dedndave on May 07, 2012, 03:50:45 AM
DOH !

prescott w/htt XP MCE2005 SP3
Running testit_baseline.
Min:      570 ticks
Max:      570 ticks
Range:    0 ticks
Mean:     570 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 555506 calls

Running testit_div.
Min:      1575 ticks
Max:      1575 ticks
Range:    0 ticks
Mean:     1575 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 49938 calls

Running testit_cdiv.
Min:      720 ticks
Max:      720 ticks
Range:    0 ticks
Mean:     720 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 3682 calls


Whoa! very nice! You can take those numbers to the bank! Tomorrow I'll go ahead and finish off the calculation part. That way the program computes the number of ticks taken instead of mental arithmetic! Since you have 0 variance, computing the number of ticks taken is trivial.

Starting off with ticks taken for timings_cdiv we have

(720 - 570)/32 = 4.6875 ticks.


And time taken for timings_div we have

(1575 - 570)/32 = 31.4063 ticks.


So clearly qWord's constant division macro is faster by a factor of 6.7 times!

Thank you for testing! I'm very happy with the results on your processor! :dance: I'll be thrilled if others take the time to post some results too.

dedndave

well - we all knew it was faster
but, i seriously doubt it is 5 clock cycles
i didn't look at the code - but it probably takes something like twice that, depending on the variation

nixeagle

Quote from: dedndave on May 07, 2012, 05:11:10 AM
well - we all knew it was faster
but, i seriously doubt it is 5 clock cycles
i didn't look at the code - but it probably takes something like twice that, depending on the variation
Depends on how many mul can be in flight at a time. Take a look at the code and tell me what you think. I'll look up agner fog's CPU timings for the prescott (if they exist) and we can sort this out. I am very confident in the test results though.

Plus I wonder if hyperthreading messes with this? Only way to figure that out is to disable it for a run and see what it says.

Edit: here is the disassembly of testit_cdiv

00000480 <testit_cdiv@0>:
480:   53                      push   ebx
481:   b9 1f 00 00 00          mov    ecx,0x1f
486:   b8 7f 96 98 00          mov    eax,0x98967f
48b:   ba 47 ac c5 a7          mov    edx,0xa7c5ac47
490:   83 c0 01                add    eax,0x1
493:   72 02                   jb     497 <testit_cdiv@0+0x17>
495:   f7 e2                   mul    edx
497:   c1 ea 10                shr    edx,0x10
49a:   49                      dec    ecx
49b:   79 e9                   jns    486 <testit_cdiv@0+0x6>
49d:   5b                      pop    ebx
49e:   c3                      ret


Remember the cost of the loop and pushing/popping ebx is factored out of the measurement time due to testit_baseline. Follows is the disassembly for that:

00000470 <testit_baseline@0>:
470:   53                      push   ebx
471:   b9 1f 00 00 00          mov    ecx,0x1f
476:   49                      dec    ecx
477:   79 fd                   jns    476 <testit_baseline@0+0x6>
479:   5b                      pop    ebx
47a:   c3                      ret


Your timings are reasonable assuming the cost of a multiply is less than 5 clocks. Remember the non multiply stuff can be done out of order during the multiply.

dedndave

well - i generally bind to a single core for all timing tests
i think many forum members do, also

the number of MUL's that can happen at once is 1   :bg
they are all dependent on EAX and EDX
not to mention - they probably use some specialized ciruitry to perform MUL

nixeagle

Quote from: dedndave on May 07, 2012, 05:17:26 AM
well - i generally bind to a single core for all timing tests
i think many forum members do, also

the number of MUL's that can happen at once is 1   :bg
they are all dependent on EAX and EDX
not to mention - they probably use some specialized ciruitry to perform MUL

See my edit: and yes the test binds itself to one core. But that does not mean htt is disabled. In a way the test code is still only getting half of your CPU's time.

Edit: ugh, looking at the disassembly for timeit_baseline it looks like I forgot to make that do mov eax, 9999999 like the other two functions under test. So that means in a way the timings should be shorter than what you have!

Edit 2: This is the code actually being timed:

486:   b8 7f 96 98 00          mov    eax,0x98967f
48b:   ba 47 ac c5 a7          mov    edx,0xa7c5ac47
490:   83 c0 01                add    eax,0x1
493:   72 02                   jb     497 <testit_cdiv@0+0x17>
495:   f7 e2                   mul    edx
497:   c1 ea 10                shr    edx,0x10

I can easily see this executing in less than 5 clocks due to out of order execution. While the CPU is busy with that mul instruction, it can start work on the next iteration of the inner loop.

Edit 3: Ugh, according to Agner Fog the latency for MUL is 11 clocks on a prescott. The pipeline would allow everything else to execute as they use different execution ports, but no way around the MUL latency. Now to figure out how in the heck we are undercounting. Maybe we are skipping the mul and taking the jb 497 branch?

Edit 4: The DIV number looks in the ballpark according to Agner Fog's number as DIV has a 34 tick reciprocal throughput. We are at 31.4. A little under, but not nearly as bad as the MUL case. Have you ever overclocked your processor? If yes, I'll have to check into how RDTSC works on the prescott. I know it is different from Core i7 as here it is a monotonic 2Ghz clock, regardless of what speed the actual CPU is at.

Edit 5: This really is weird. The thing is your CPU actually ran at this speed. The function got called, the CPU churned through it and was measured. There is no denying that. The question is why and right now I'm at a loss. I can't see agner forgetting to note a reciprocal throughput for MUL in his timing data. I mean he has IMUL r32,r32 with a reciprocal throughput of 2.5 on an operation that has a latency of 10 clocks.

Edit 6: I just confirmed that the MUL instruction is actually taken by running the program in gdb. Now I'm really stumped! Can someone else please take the program for a spin and post the results along with what processor you have?

The difference in timing between 4.86 ticks and 11 ticks is statistically significant here, so we can't just dismiss this as measurement error.

jj2007

Attachment from reply #103, on Celeron M, Win XP SP3:

Running testit_baseline.
Min:      348 ticks
Max:      348 ticks
Range:    0 ticks
Mean:     348 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 5600 calls

Running testit_div.
Min:      696 ticks
Max:      696 ticks
Range:    0 ticks
Mean:     696 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 5663 calls

Running testit_cdiv.
Min:      480 ticks
Max:      480 ticks
Range:    0 ticks
Mean:     480 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 5063205 calls

nixeagle

Quote from: jj2007 on May 07, 2012, 09:12:31 PM
Attachment from reply #103, on Celeron M, Win XP SP3:

Running testit_baseline.
Min:      348 ticks
Max:      348 ticks
Range:    0 ticks
Mean:     348 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 5600 calls

Running testit_div.
Min:      696 ticks
Max:      696 ticks
Range:    0 ticks
Mean:     696 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 5663 calls

Running testit_cdiv.
Min:      480 ticks
Max:      480 ticks
Range:    0 ticks
Mean:     480 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 5063205 calls


Thanks, from what I understand, the Celeron M is essentially a Pentinum M with less Cache and no speedstep. The lack of speedstep actually makes measuring things easier on your processor. Agner Fog says DIV has a 12-20 clock reciprocal throughput with a 15-39 latency and  MUL has a 1 clock reciprocal throughput with a 5 clock latency. Your numbers are close enough to those timings as you have:
(696 - 348)/32 = 10.875 clocks
for DIV and
(480 - 348)/32 = 4.125 clocks
for MUL. The division on yours is slightly under, just as dedndave's was. The real difference is your multiplcation times. Those are higher than the 1 cycle reciprocal throughput. However the extra time can easily be accounted to pipelining effects and out of order execution. That is executing the instructions around the MUL.

Looking at your results and comparing to dedndave's my two possible conclusions are:
  • The CPU did not fully spin up when taking the timeit_baseline measurement. I'll attach a version that runs the timeit_baseline function again after the two functions under test. That'll allow us to test that theory.
  • Agner Fog has a mistake in his instruction timings somehow and there is either a pipelining of the MUL instruction or it executes faster than he measured. I recall dedndave said he overclocked his CPU, if that is right, that may have an effect depending on how RDTSC works on it.

I find agner fog being wrong very unlikely, so there has to be some other effect going on here. Especially as jj2007's measurements are pretty close to being spot on with expected results.

Attached program is just like the prior program in #103, just adds an extra testit_baseline call after all tests are run. The assumption is timeit_baseline will take the same amount of time; both when it is run at the start and when it is run at the end.

jj2007

Here it is...
Running testit_baseline.
Min:      348 ticks
Max:      348 ticks
Range:    0 ticks
Mean:     348 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 5649 calls

Running testit_div.
Min:      696 ticks
Max:      696 ticks
Range:    0 ticks
Mean:     696 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 5593 calls

Running testit_cdiv.
Min:      480 ticks
Max:      480 ticks
Range:    0 ticks
Mean:     480 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 4206034 calls

Running testit_baseline.
Min:      348 ticks
Max:      348 ticks
Range:    0 ticks
Mean:     348 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 5670 calls

nixeagle

Quote from: jj2007 on May 08, 2012, 12:01:53 AM
Here it is...
Running testit_baseline.
Min:      348 ticks
Max:      348 ticks
Range:    0 ticks
Mean:     348 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 5649 calls

Running testit_div.
Min:      696 ticks
Max:      696 ticks
Range:    0 ticks
Mean:     696 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 5593 calls

Running testit_cdiv.
Min:      480 ticks
Max:      480 ticks
Range:    0 ticks
Mean:     480 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 4206034 calls

Running testit_baseline.
Min:      348 ticks
Max:      348 ticks
Range:    0 ticks
Mean:     348 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 5670 calls


:clap: Looks very good to me. You even have the exact same timings as before, which is a large part of what I'm trying to achieve here. Timings that are consistent no matter how many times you run the program and no matter what order the algorithms are in. For your processor we have that down to a 't'. :bg.

Major thanks for giving us a third opinion. Now just to figure out the oddities with dedndave's processor. If you or anyone else had ideas on the strange measurement results he has... speak up please! Once I'm happy with his processor and we have stable results for everyone, I'll polish the program up some and try testing some different algorithms for fun :D.

dedndave

well - it's not just my processor - lol
XP Media Center Edition does some funny stuff with timers, etc - for fast video rendering
i have an old-style TV card in this machine, as well

nixeagle

Quote from: dedndave on May 08, 2012, 02:20:24 AM
well - it's not just my processor - lol
XP Media Center Edition does some funny stuff with timers, etc - for fast video rendering
i have an old-style TV card in this machine, as well

Can you please at least run the program in post #111? What that'll do is confirm that the measurements are stable. If the second call to timeit_baseline completes faster, we will know the problem is spinning up your processor. If it is not, I'm down to either:
  • Agner Fog messed up,
  • You overclocked or did something non-standard to the CPU. This might cause RDTSC to act in unknown (to me) ways.
  • There is a hidden bug-a-boo in the timing program that causes MUL to be timed twice as fast as it should. Problem is, I'm out of bug ideas ::). Especially with it returning feasible results for jj2007 and I.

If anyone else wants to test, please don't hesitate to post some times!

P.S. I know of no way for the operating system to modify the value returned by RDTSC. I simply don't believe it is possible. The simplest solution to our dilemma will be if your test with the program in #111 shows up a discrepancy between testit_baseline runs.

dedndave

from post 111
prescott w/htt - XP MCE2005 SP3
Running testit_baseline.
Min:      570 ticks
Max:      570 ticks
Range:    0 ticks
Mean:     570 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 583233 calls

Running testit_div.
Min:      1575 ticks
Max:      1575 ticks
Range:    0 ticks
Mean:     1575 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 66367 calls

Running testit_cdiv.
Min:      720 ticks
Max:      720 ticks
Range:    0 ticks
Mean:     720 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 3584 calls

Running testit_baseline.
Min:      570 ticks
Max:      570 ticks
Range:    0 ticks
Mean:     570 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 563234 calls

nixeagle

Amazing how those values are exactly the same as the last time you ran this. Plus notice that testit_baseline takes the same amount of time regardless of when you run it. So this disqualifies my best theory. :(

I'll sleep on this and see if I can't think of some reason for this behavior. I might write up another test for you to try tomorrow. On the bright side, at least we have consistent timings on your "jittery" CPU :bg.

FORTRANS

Hi,

   Results from two older processors: P-III and P-MMX.

Regards,

Steve N.


Spinning up the processor.

Running testit_baseline.
Min:      200 ticks
Max:      200 ticks
Range:    0 ticks
Mean:     200 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 3535 calls

Running testit_div.
Min:      1420 ticks
Max:      1420 ticks
Range:    0 ticks
Mean:     1420 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 3542 calls

Running testit_cdiv.
Min:      333 ticks
Max:      333 ticks
Range:    0 ticks
Mean:     333 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 3542 calls

Running testit_baseline.
Min:      200 ticks
Max:      200 ticks
Range:    0 ticks
Mean:     200 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 3528 calls
Please subtract testit_baseline time from both, then divide by 32.
... yes this program will do that computation automatically soon!
Press any key to continue ...
Spinning up the processor.

Running testit_baseline.
Min:      89 ticks
Max:      89 ticks
Range:    0 ticks
Mean:     89 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 3528 calls

Running testit_div.
Min:      1450 ticks
Max:      1450 ticks
Range:    0 ticks
Mean:     1450 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 3647 calls

Running testit_cdiv.
Min:      470 ticks
Max:      470 ticks
Range:    0 ticks
Mean:     470 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 3570 calls

Running testit_baseline.
Min:      89 ticks
Max:      89 ticks
Range:    0 ticks
Mean:     89 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 3521 calls
Please subtract testit_baseline time from both, then divide by 32.
... yes this program will do that computation automatically soon!
Press any key to continue ...

lingo

CPU - i7 2600K; Win7 Ult 64 SP1
Spinning up the processor.

Running testit_baseline.
Min:      145 ticks
Max:      158 ticks
Range:    13 ticks
Mean:     145 ticks
Variance: 1 ticks^2
Batch Size: 8
Call Count: 434693 calls

Running testit_div.
Min:      497 ticks
Max:      544 ticks
Range:    47 ticks
Mean:     520 ticks
Variance: 121 ticks^2
Batch Size: 8
Call Count: 780528 calls

Running testit_cdiv.
Min:      198 ticks
Max:      207 ticks
Range:    9 ticks
Mean:     199 ticks
Variance: 6 ticks^2
Batch Size: 8
Call Count: 609322 calls

Running testit_baseline.
Min:      145 ticks
Max:      158 ticks
Range:    13 ticks
Mean:     145 ticks
Variance: 5 ticks^2
Batch Size: 8
Call Count: 394219 calls
☻Please subtract testit_baseline time from both, then divide by 32.
... yes this program will do that computation automatically soon!
Press any key to continue ...