Figuring out a statistically reliable baseline for out of order processors.

nixeagle · May 08, 2012, 06:53:24 PM

To FORTRANS: Your results are reasonable and match what I'd expect. I have for the P3 the following timings:


(1420 - 200)/32 = 38.125 ticks  // For div
(333 - 200)/32  = 4.15625 ticks // For cdiv

Based on AgnerFog IMUL ought to take 4 ticks latency and 1 tick reciprocal throughput. So your getting 4 ticks for the whole measurement seems within reason considering there are other operations occurring other than the multiplication. Plus Agner Fog does not have timings for MUL this processor. Just the IMUL so I'll just have to assume they are close enough. Either way, you are certainly within expectations here.

The P-MMX takes

Code Select


(1450 - 200)/32 = 39.0625 ticks // div
(470 - 200)/32  = 8.4375 ticks  // cdiv

To be honest, I'm not real sure which CPU this is. I'm guessing this is a P1 with MMX. Here MUL takes 9 ticks. I'm going to assume the other operations occur while MUL is occurring and thus the time for the other operations does not contribute to the total time. Still we are half a clock too fast with our measurement. My only guess is testit_baseline is a tad too pessimistic. The same story goes for DIV. Agner's measurement has it at 41 ticks, we have it at about 39 ticks. I'll be doing another release of the timing program today that you might want to try out.

For measuring comparative algorithm speeds, all the results I'm seeing seem spot on. The only possible error I'm seeing is testit_baseline is too pessimistic and should complete sooner than it actually does. But as this is a constant overhead applied to all algorithms under test, no one algorithm is going to gain an unfair advantage with respect to any other.

To lingo: Your results practically mirror mine as far as stability goes. Which makes sense, we are both running a i7. What makes me real happy with your results is that testit_baseline is taking the same amount of time at both program start and program end. This means both algorithms in between the testit_baseline (should) have optimal and consistent timings. That is, you are going to always get measurements within the standard deviation of the sample.

Now computing the amount of time taken by the algorithms on your processor is going to involve a little more work due to the variance (and thus standard deviation) of the measurements. In the following, I use the notation {lower_bound_of_mean, mean, upper_bound_of_mean}. What I mean by this is that the mean should always appear between the lower and upper bounds of the mean. In a way, this expresses the quality of the measurement and how much confidence we can take in them. Off we go!

Code Select


({509, 520, 531} - 145)/32                     = {11.375, 11.7188, 12.0625} ticks   // div
({199 - Sqrt[6], 199, 199 + Sqrt[6]} - 145)/32 = {1.61095, 1.6875, 1.76405} ticks   // cdiv

For simplicity I've assumed the baseline is 145 ticks with no standard deviation. Including the deviation of the baseline measurement just makes the math even more complicated. I'll have the final timing program do it, but for this post it is simple enough to disregard it, if only to avoid explaining it! The impact on the final measurements is on the order of ~0.1 to ~0.05 ticks.

As we can see the difference between cdiv and div on limbo's i7 is quite significant. Most importantly, no matter how many times we run this test, the value of Mean will always appear between the bounds specified¹.

Major thanks to both FORTRANS and limbo for giving us more test data to analyze! :bg I'll update with a new program later today that will hopefully fix a few potential flaws² and if I have time, add floating point calculations. For that I'll have to figure out how to print "floats" or "doubles" using masm. ::).

1: This assumes that the test environment is similar enough. Tests on a laptop while plugged in will not be the same as tests on a laptop while unplugged.
2: Looking at you dedndave. :lol Still baffled on how your processor is defying Agner Fog's measurements by running twice as fast as it is supposed to go. No Fair! :8)

FORTRANS · May 08, 2012, 09:26:44 PM

Quote from: nixeagle on May 08, 2012, 06:53:24 PM
To FORTRANS: Your results are reasonable and match what I'd expect.

Hi nixeagle,

Good.

Quote
To be honest, I'm not real sure which CPU this is. I'm guessing this is a P1 with MMX.

Yes, that is correct.

Cheers,

Steve N.

nixeagle · May 08, 2012, 10:29:02 PM

Alright, this update is mostly focused at dedndave, however I'd be very interested to see testing results from others. What I have done is create some extra copies of the test functions and named them short_testit_div, long_testit_div and so forth. These are mirror images of the original functions, just with a shorter or longer inner timing loop.

Functions with short_ prefix only loop 4 times per sample time taken.
Functions with long_ prefix loop 128 times per sample time taken

This will uncover any pipelining effects that occur as well as allow me to correlate the results to further establish how trustworthy the results really are.

Additionally the newest update includes an output revamp. Instead of spamming 8 lines per function tested, we spam only 4.

Edit: Removed battery power test results to shorten the length of this post and draw attention to the more interesting parts. If anyone knows how to "collapse or otherwise put scrollbars on code sections... I'd greatly appreciate it.

My test results when my i7 is plugged in

Code Select


Spinning up the processor.

Running testit_long_baseline.
Min:      220 ticks, Max:      223 ticks, Range:    3 ticks
Mean:     220 ticks, Variance: 1 ticks^2
Batch Size: 8      , Call Count: 34230 calls

Running long_testit_div.
Min:      1496 ticks, Max:      1496 ticks, Range:    0 ticks
Mean:     1496 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 18319 calls

Running long_testit_cdiv.
Min:      434 ticks, Max:      478 ticks, Range:    44 ticks
Mean:     476 ticks, Variance: 53 ticks^2
Batch Size: 8      , Call Count: 19483198 calls

Running testit_long_baseline.
Min:      220 ticks, Max:      223 ticks, Range:    3 ticks
Mean:     220 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 26089 calls
Please subtract testit_long_baseline time from both, then divide by 128.

Running testit_baseline.
Min:      129 ticks, Max:      129 ticks, Range:    0 ticks
Mean:     129 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 172053 calls

Running testit_div.
Min:      440 ticks, Max:      440 ticks, Range:    0 ticks
Mean:     440 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 22743 calls

Running testit_cdiv.
Min:      173 ticks, Max:      179 ticks, Range:    6 ticks
Mean:     175 ticks, Variance: 1 ticks^2
Batch Size: 8      , Call Count: 127876 calls

Running testit_baseline.
Min:      129 ticks, Max:      132 ticks, Range:    3 ticks
Mean:     129 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 168182 calls
Please subtract testit_baseline time from both, then divide by 32.

Running short_testit_baseline.
Min:      86 ticks, Max:      94 ticks, Range:    8 ticks
Mean:     93 ticks, Variance: 2 ticks^2
Batch Size: 7      , Call Count: 4516969 calls

Running short_testit_div.
Min:      132 ticks, Max:      132 ticks, Range:    0 ticks
Mean:     132 ticks, Variance: 0 ticks^2
Batch Size: 7      , Call Count: 16716 calls

Running short_testit_cdiv.
Min:      97 ticks, Max:      101 ticks, Range:    4 ticks
Mean:     100 ticks, Variance: 1 ticks^2
Batch Size: 6      , Call Count: 1798979 calls

Running short_testit_baseline.
Min:      88 ticks, Max:      94 ticks, Range:    6 ticks
Mean:     93 ticks, Variance: 1 ticks^2
Batch Size: 6      , Call Count: 898280 calls
Please subtract short_testit_baseline time from both, then divide by 4.
... yes this program will do that computation automatically soon!

Which gives the following results for cdiv:

Code Select


(100-93)/4= 1.75 ticks  // short_testit_cdiv
(175-129)/32 = 1.4375 ticks  // testit_cdiv
({476-Sqrt[53],476,476+Sqrt[53]}-220)/128 = {1.94312, 2., 2.05688} ticks // long_testit_cdiv

As the first two have insignificent variance, I chose to disregard it in the calculation. Notice how looping 32 times in the inner loop results in a timing of ~1.4 ticks per cycle. The interesting result on my i7 is how looping 128 times in the inner loop results in 2 ticks per cycle, plus or minus ~0.06 ticks. This is very consistent across program runs. I do not have any idea why a longer inner loop would cause this.

And the results for div.

Code Select


(132-93)/4= 9.75 ticks  // short_testit_div
(440-129)/32= 9.71875 ticks // testit_div
(1496-220)/128 = 9.96875 ticks // long_testit_div

Notice how consistent the timings are for division, no matter how many times you run the program or how many times the inner loop is run. To me this information is interesting :bg. I'm going to have to figure out a way to work in varying the inner loop length into the final timing program :8).

I really would appreciate posts pasting testing results as it helps me validate the approach taken. I believe we are nearly there! Later tonight or tomorrow, depending on the ballgame, I'll post an updated program that does the math I've done manually here automatically. But to do that I need to figure out how to do floating point calculations and print floating point numbers out in masm ::).

jj2007 · May 08, 2012, 10:52:39 PM

Here you are :bg

Code Select

Spinning up the processor.

Running testit_long_baseline.
Min:      480 ticks, Max:      480 ticks, Range:    0 ticks
Mean:     480 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 81466 calls
☻
Running long_testit_div.
Min:      1920 ticks, Max:      1920 ticks, Range:    0 ticks
Mean:     1920 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 72261 calls
☻
Running long_testit_cdiv.
Min:      984 ticks, Max:      996 ticks, Range:    12 ticks
Mean:     991 ticks, Variance: 34 ticks^2
Batch Size: 8      , Call Count: 19328246 calls
☻
Running testit_long_baseline.
Min:      480 ticks, Max:      480 ticks, Range:    0 ticks
Mean:     480 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 77287 calls
☻Please subtract testit_long_baseline time from both, then divide by 128

Running testit_baseline.
Min:      348 ticks, Max:      348 ticks, Range:    0 ticks
Mean:     348 ticks, Variance: 0 ticks^2
Batch Size: 6      , Call Count: 32535902 calls
☻
Running testit_div.
Min:      696 ticks, Max:      696 ticks, Range:    0 ticks
Mean:     696 ticks, Variance: 0 ticks^2
Batch Size: 6      , Call Count: 4380 calls
☻
Running testit_cdiv.
Min:      480 ticks, Max:      480 ticks, Range:    0 ticks
Mean:     480 ticks, Variance: 0 ticks^2
Batch Size: 6      , Call Count: 796815 calls
☻
Running testit_baseline.
Min:      348 ticks, Max:      348 ticks, Range:    0 ticks
Mean:     348 ticks, Variance: 0 ticks^2
Batch Size: 6      , Call Count: 23628705 calls
☻Please subtract testit_baseline time from both, then divide by 32.

Running short_testit_baseline.
Min:      264 ticks, Max:      264 ticks, Range:    0 ticks
Mean:     264 ticks, Variance: 0 ticks^2
Batch Size: 6      , Call Count: 53740 calls
☻
Running short_testit_div.
Min:      312 ticks, Max:      312 ticks, Range:    0 ticks
Mean:     312 ticks, Variance: 0 ticks^2
Batch Size: 6      , Call Count: 3865 calls
☻
Running short_testit_cdiv.
Min:      288 ticks, Max:      288 ticks, Range:    0 ticks
Mean:     288 ticks, Variance: 0 ticks^2
Batch Size: 6      , Call Count: 18250 calls
☻
Running short_testit_baseline.
Min:      264 ticks, Max:      264 ticks, Range:    0 ticks
Mean:     264 ticks, Variance: 0 ticks^2
Batch Size: 6      , Call Count: 55020 calls
☻Please subtract short_testit_baseline time from both, then divide by 4.

dedndave · May 09, 2012, 01:12:24 AM

prescott w/htt - xp mce2005 sp3

Code Select

Running testit_long_baseline.
Min:      735 ticks, Max:      1125 ticks, Range:    390 ticks
Mean:     794 ticks, Variance: 7547 ticks^2
Batch Size: 5      , Call Count: 29589995 calls

Running long_testit_div.
Min:      4837 ticks, Max:      4845 ticks, Range:    8 ticks
Mean:     4844 ticks, Variance: 6 ticks^2
Batch Size: 5      , Call Count: 2212 calls

Running long_testit_cdiv.
Min:      1290 ticks, Max:      2235 ticks, Range:    945 ticks
Mean:     1365 ticks, Variance: 53492 ticks^2
Batch Size: 3      , Call Count: 1756543 calls

Running testit_long_baseline.
Min:      765 ticks, Max:      765 ticks, Range:    0 ticks
Mean:     765 ticks, Variance: 0 ticks^2
Batch Size: 3      , Call Count: 9548 calls
Please subtract testit_long_baseline time from both, then divide by 128.

Running testit_baseline.
Min:      570 ticks, Max:      705 ticks, Range:    135 ticks
Mean:     574 ticks, Variance: 197 ticks^2
Batch Size: 3      , Call Count: 85822 calls

Running testit_div.
Min:      1575 ticks, Max:      2325 ticks, Range:    750 ticks
Mean:     1577 ticks, Variance: 2131 ticks^2
Batch Size: 3      , Call Count: 19086 calls

Running testit_cdiv.
Min:      720 ticks, Max:      720 ticks, Range:    0 ticks
Mean:     720 ticks, Variance: 0 ticks^2
Batch Size: 3      , Call Count: 1006 calls

Running testit_baseline.
Min:      570 ticks, Max:      720 ticks, Range:    150 ticks
Mean:     578 ticks, Variance: 542 ticks^2
Batch Size: 3      , Call Count: 110724 calls
Please subtract testit_baseline time from both, then divide by 32.

Running short_testit_baseline.
Min:      480 ticks, Max:      480 ticks, Range:    0 ticks
Mean:     480 ticks, Variance: 0 ticks^2
Batch Size: 3      , Call Count: 1030 calls

Running short_testit_div.
Min:      615 ticks, Max:      615 ticks, Range:    0 ticks
Mean:     615 ticks, Variance: 0 ticks^2
Batch Size: 3      , Call Count: 20838 calls

Running short_testit_cdiv.
Min:      487 ticks, Max:      615 ticks, Range:    128 ticks
Mean:     491 ticks, Variance: 69 ticks^2
Batch Size: 3      , Call Count: 55542 calls

Running short_testit_baseline.
Min:      480 ticks, Max:      570 ticks, Range:    90 ticks
Mean:     480 ticks, Variance: 78 ticks^2
Batch Size: 3      , Call Count: 3310 calls

FORTRANS · May 09, 2012, 01:42:23 PM

Hi,

Results from P-III, P-MMX, and a Mobile Intel(R) Celeron(R).

Regards,

Steve N.

Spinning up the processor.

Running testit_long_baseline.
Min: 394 ticks, Max: 394 ticks, Range: 0 ticks
Mean: 394 ticks, Variance: 0 ticks^2
Batch Size: 8 , Call Count: 3542 calls

Running long_testit_div.
Min: 5260 ticks, Max: 5260 ticks, Range: 0 ticks
Mean: 5260 ticks, Variance: 0 ticks^2
Batch Size: 8 , Call Count: 3682 calls

Running long_testit_cdiv.
Min: 860 ticks, Max: 860 ticks, Range: 0 ticks
Mean: 860 ticks, Variance: 0 ticks^2
Batch Size: 8 , Call Count: 3549 calls

Running testit_long_baseline.
Min: 394 ticks, Max: 394 ticks, Range: 0 ticks
Mean: 394 ticks, Variance: 0 ticks^2
Batch Size: 8 , Call Count: 3535 calls
Please subtract testit_long_baseline time from both, then divide by 128.

Running testit_baseline.
Min: 202 ticks, Max: 202 ticks, Range: 0 ticks
Mean: 202 ticks, Variance: 0 ticks^2
Batch Size: 8 , Call Count: 3549 calls

Running testit_div.
Min: 1420 ticks, Max: 1420 ticks, Range: 0 ticks
Mean: 1420 ticks, Variance: 0 ticks^2
Batch Size: 8 , Call Count: 3556 calls

Running testit_cdiv.
Min: 333 ticks, Max: 333 ticks, Range: 0 ticks
Mean: 333 ticks, Variance: 0 ticks^2
Batch Size: 8 , Call Count: 3549 calls

Running testit_baseline.
Min: 202 ticks, Max: 202 ticks, Range: 0 ticks
Mean: 202 ticks, Variance: 0 ticks^2
Batch Size: 8 , Call Count: 3528 calls
Please subtract testit_baseline time from both, then divide by 32.

Running short_testit_baseline.
Min: 137 ticks, Max: 137 ticks, Range: 0 ticks
Mean: 137 ticks, Variance: 0 ticks^2
Batch Size: 8 , Call Count: 3598 calls

Running short_testit_div.
Min: 309 ticks, Max: 309 ticks, Range: 0 ticks
Mean: 309 ticks, Variance: 0 ticks^2
Batch Size: 8 , Call Count: 3535 calls

Running short_testit_cdiv.
Min: 175 ticks, Max: 175 ticks, Range: 0 ticks
Mean: 175 ticks, Variance: 0 ticks^2
Batch Size: 8 , Call Count: 3535 calls

Running short_testit_baseline.
Min: 137 ticks, Max: 137 ticks, Range: 0 ticks
Mean: 137 ticks, Variance: 0 ticks^2
Batch Size: 8 , Call Count: 3591 calls
Please subtract short_testit_baseline time from both, then divide by 4.
... yes this program will do that computation automatically soon!
Press any key to continue ...
Spinning up the processor.

Running testit_long_baseline.
Min: 310 ticks, Max: 310 ticks, Range: 0 ticks
Mean: 310 ticks, Variance: 0 ticks^2
Batch Size: 8 , Call Count: 10320282 calls

Running long_testit_div.
Min: 5674 ticks, Max: 5674 ticks, Range: 0 ticks
Mean: 5674 ticks, Variance: 0 ticks^2
Batch Size: 8 , Call Count: 3710 calls

Running long_testit_cdiv.
Min: 1718 ticks, Max: 1718 ticks, Range: 0 ticks
Mean: 1718 ticks, Variance: 0 ticks^2
Batch Size: 8 , Call Count: 3598 calls

Running testit_long_baseline.
Min: 310 ticks, Max: 310 ticks, Range: 0 ticks
Mean: 310 ticks, Variance: 0 ticks^2
Batch Size: 8 , Call Count: 10638530 calls
Please subtract testit_long_baseline time from both, then divide by 128.

Running testit_baseline.
Min: 118 ticks, Max: 118 ticks, Range: 0 ticks
Mean: 118 ticks, Variance: 0 ticks^2
Batch Size: 8 , Call Count: 3507 calls

Running testit_div.
Min: 1450 ticks, Max: 1450 ticks, Range: 0 ticks
Mean: 1450 ticks, Variance: 0 ticks^2
Batch Size: 8 , Call Count: 3577 calls

Running testit_cdiv.
Min: 470 ticks, Max: 470 ticks, Range: 0 ticks
Mean: 470 ticks, Variance: 0 ticks^2
Batch Size: 8 , Call Count: 3528 calls

Running testit_baseline.
Min: 118 ticks, Max: 118 ticks, Range: 0 ticks
Mean: 118 ticks, Variance: 0 ticks^2
Batch Size: 8 , Call Count: 3521 calls
Please subtract testit_baseline time from both, then divide by 32.

Running short_testit_baseline.
Min: 62 ticks, Max: 62 ticks, Range: 0 ticks
Mean: 62 ticks, Variance: 0 ticks^2
Batch Size: 8 , Call Count: 3521 calls

Running short_testit_div.
Min: 213 ticks, Max: 213 ticks, Range: 0 ticks
Mean: 213 ticks, Variance: 0 ticks^2
Batch Size: 8 , Call Count: 3542 calls

Running short_testit_cdiv.
Min: 106 ticks, Max: 106 ticks, Range: 0 ticks
Mean: 106 ticks, Variance: 0 ticks^2
Batch Size: 8 , Call Count: 3528 calls

Running short_testit_baseline.
Min: 62 ticks, Max: 62 ticks, Range: 0 ticks
Mean: 62 ticks, Variance: 0 ticks^2
Batch Size: 8 , Call Count: 3521 calls
Please subtract short_testit_baseline time from both, then divide by 4.
... yes this program will do that computation automatically soon!
Press any key to continue ...
Spinning up the processor.

Running testit_long_baseline.
Min: 443 ticks, Max: 443 ticks, Range: 0 ticks
Mean: 443 ticks, Variance: 0 ticks^2
Batch Size: 8 , Call Count: 3542 calls

Running long_testit_div.
Min: 5306 ticks, Max: 5306 ticks, Range: 0 ticks
Mean: 5306 ticks, Variance: 0 ticks^2
Batch Size: 8 , Call Count: 3542 calls

Running long_testit_cdiv.
Min: 924 ticks, Max: 924 ticks, Range: 0 ticks
Mean: 924 ticks, Variance: 0 ticks^2
Batch Size: 8 , Call Count: 3542 calls

Running testit_long_baseline.
Min: 443 ticks, Max: 443 ticks, Range: 0 ticks
Mean: 443 ticks, Variance: 0 ticks^2
Batch Size: 8 , Call Count: 3535 calls
Please subtract testit_long_baseline time from both, then divide by 128.

Running testit_baseline.
Min: 273 ticks, Max: 273 ticks, Range: 0 ticks
Mean: 273 ticks, Variance: 0 ticks^2
Batch Size: 8 , Call Count: 3563 calls

Running testit_div.
Min: 1473 ticks, Max: 1473 ticks, Range: 0 ticks
Mean: 1473 ticks, Variance: 0 ticks^2
Batch Size: 8 , Call Count: 3577 calls

Running testit_cdiv.
Min: 384 ticks, Max: 384 ticks, Range: 0 ticks
Mean: 384 ticks, Variance: 0 ticks^2
Batch Size: 8 , Call Count: 91364 calls

Running testit_baseline.
Min: 273 ticks, Max: 273 ticks, Range: 0 ticks
Mean: 273 ticks, Variance: 0 ticks^2
Batch Size: 8 , Call Count: 3556 calls
Please subtract testit_baseline time from both, then divide by 32.

Running short_testit_baseline.
Min: 183 ticks, Max: 183 ticks, Range: 0 ticks
Mean: 183 ticks, Variance: 0 ticks^2
Batch Size: 8 , Call Count: 3514 calls

Running short_testit_div.
Min: 335 ticks, Max: 335 ticks, Range: 0 ticks
Mean: 335 ticks, Variance: 0 ticks^2
Batch Size: 8 , Call Count: 3556 calls

Running short_testit_cdiv.
Min: 206 ticks, Max: 206 ticks, Range: 0 ticks
Mean: 206 ticks, Variance: 0 ticks^2
Batch Size: 8 , Call Count: 3612 calls

Running short_testit_baseline.
Min: 183 ticks, Max: 183 ticks, Range: 0 ticks
Mean: 183 ticks, Variance: 0 ticks^2
Batch Size: 8 , Call Count: 3507 calls
Please subtract short_testit_baseline time from both, then divide by 4.
... yes this program will do that computation automatically soon!
Press any key to continue ...

mineiro · May 09, 2012, 04:56:01 PM

Intel(R) Pentium(R) Dual CPU E2160 @ 1.80GHz (SSE2)

Code Select


Spinning up the processor.

Running testit_long_baseline.
Min:      396 ticks, Max:      603 ticks, Range:    207 ticks
Mean:     399 ticks, Variance: 592 ticks^2
Batch Size: 7      , Call Count: 9204369 calls
☻
Running long_testit_div.
Min:      1818 ticks, Max:      2727 ticks, Range:    909 ticks
Mean:     1827 ticks, Variance: 8180 ticks^2
Batch Size: 7      , Call Count: 3006 calls
☻
Running long_testit_cdiv.
Min:      774 ticks, Max:      774 ticks, Range:    0 ticks
Mean:     774 ticks, Variance: 0 ticks^2
Batch Size: 7      , Call Count: 9714 calls
☻
Running testit_long_baseline.
Min:      396 ticks, Max:      405 ticks, Range:    9 ticks
Mean:     396 ticks, Variance: 1 ticks^2
Batch Size: 7      , Call Count: 3220512 calls
☻Please subtract testit_long_baseline time from both, then divide by 128.

Running testit_baseline.
Min:      333 ticks, Max:      333 ticks, Range:    0 ticks
Mean:     333 ticks, Variance: 0 ticks^2
Batch Size: 7      , Call Count: 24516 calls
☻
Running testit_div.
Min:      675 ticks, Max:      675 ticks, Range:    0 ticks
Mean:     675 ticks, Variance: 0 ticks^2
Batch Size: 7      , Call Count: 3048 calls
☻
Running testit_cdiv.
Min:      414 ticks, Max:      414 ticks, Range:    0 ticks
Mean:     414 ticks, Variance: 0 ticks^2
Batch Size: 7      , Call Count: 436272 calls
☻
Running testit_baseline.
Min:      333 ticks, Max:      333 ticks, Range:    0 ticks
Mean:     333 ticks, Variance: 0 ticks^2
Batch Size: 7      , Call Count: 16746 calls
☻Please subtract testit_baseline time from both, then divide by 32.

Running short_testit_baseline.
Min:      252 ticks, Max:      252 ticks, Range:    0 ticks
Mean:     252 ticks, Variance: 0 ticks^2
Batch Size: 7      , Call Count: 13290 calls
☻
Running short_testit_div.
Min:      297 ticks, Max:      450 ticks, Range:    153 ticks
Mean:     307 ticks, Variance: 247 ticks^2
Batch Size: 7      , Call Count: 18758514 calls
☻
Running short_testit_cdiv.
Min:      270 ticks, Max:      270 ticks, Range:    0 ticks
Mean:     270 ticks, Variance: 0 ticks^2
Batch Size: 7      , Call Count: 6894 calls
☻
Running short_testit_baseline.
Min:      252 ticks, Max:      252 ticks, Range:    0 ticks
Mean:     252 ticks, Variance: 0 ticks^2
Batch Size: 7      , Call Count: 11292 calls
☻Please subtract short_testit_baseline time from both, then divide by 4.
... yes this program will do that computation automatically soon!

nixeagle · May 09, 2012, 07:51:18 PM

Wow! :clap: Thanks to everyone who took the time to run tests for me. We have so many processors being tested now that I had to write a small mathematica function to do the math and show the actual results! Later today I'll upload the latest version of the timing code that will output the "real" numbers. That way you guys don't have to wait on me to translate it and I don't have to spend 30 minutes translating them ::).

The new program also strives to minimize the error bounds. If a full sample run comes back with a large error, the program will re-run that test up to 5 more times in the hopes of getting a smaller error for the total measurement. This is especially important for baseline measurements as any error in the baseline measurement propagates to all of the functions compared against it.

Please let me know what you all think of the timing and such below. Plus to any who have time to look at the code, please suggest improvements to the style/way of doing that. This has become the largest project I've ever done in raw assembly. :lol

Again, thanks to everyone!

jj2007's Unknown processor

No really, I forgot which one this was :lol.

	128 iterations	32 iterations	4 iterations
div	11.25 +/- 0 ticks	10.88 +/- 0 ticks	12.00 +/- 0 ticks
cdiv	3.99 +/- 0.52 ticks	4.13 +/- 0 ticks	6.00 +/- 0 ticks

Note the really nice pipelining effect for cdiv here. This is something div is unable to do.

dedndave's prescott

	128 iterations	32 iterations	4 iterations
div	31.87 +/- 0.22 ticks	31.34 +/- 8.53 ticks	33.75 +/- 0 ticks
cdiv	4.69 +/- 20.44 ticks	4.56 +/- 2.48 ticks	2.75 +/- 4.15 ticks

Note how the measurements for cdiv have large errors when compared to the measurements. Two of the three have an error that is greater than the actual mean value. What this tells us is that we can't really trust those measurements. The updated program I'm working on attempts to check if results are meaningful and if not, it will try to resample/re-run the test in order to get significant results.

The important thing here is: we can tell when our timings are suspect!

FORTRANS' P-III

	128 iterations	32 iterations	4 iterations
div	38.02 +/- 0 ticks	38.06 +/- 0 ticks	43.00 +/- 0 ticks
cdiv	3.64 +/- 0 ticks	4.09 +/- 0 ticks	9.50 +/- 0 ticks

Again note the nice pipelining effect for cdiv that div does not get.

FORTRANS' P-MMX

	128 iterations	32 iterations	4 iterations
div	41.91 +/- 0 ticks	42.56 +/- 0 ticks	37.75 +/- 0 ticks
cdiv	11.00 +/- 0 ticks	11.00 +/- 0 ticks	11.00 +/- 0 ticks

Here no matter how many times you loop, the throughput is the same.

FORTRANS' Mobile Intel Celeron

	128 iterations	32 iterations	4 iterations
div	37.99 +/- 0 ticks	37.50 +/- 0 ticks	38.00 +/- 0 ticks
cdiv	3.76 +/- 0 ticks	3.47 +/- 0 ticks	5.75 +/- 0 ticks

Here we have a nice speedup again when we stay in the inner loop longer.

mineiro's Pentium

	128 iterations	32 iterations	4 iterations
div	11.18 +/- 7.99 ticks	10.69 +/- 0 ticks	13.75 +/- 7.86 ticks
cdiv	2.95 +/- 0.09 ticks	2.53 +/- 0 ticks	4.50 +/- 0 ticks

Here we have a nice warmup effect for cdiv. One might also say div has a warmup effect, but we really can't say anything for it due to the amount of error in the measurements.

hutch-- · May 10, 2012, 12:01:29 AM

Here is the result on my Core2 quad.

Code Select



Spinning up the processor.

Running testit_long_baseline.
Min:      369 ticks, Max:      369 ticks, Range:    0 ticks
Mean:     369 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 3017630 calls
☻
Running long_testit_div.
Min:      1530 ticks, Max:      1530 ticks, Range:    0 ticks
Mean:     1530 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 370356 calls
☻
Running long_testit_cdiv.
Min:      738 ticks, Max:      747 ticks, Range:    9 ticks
Mean:     745 ticks, Variance: 13 ticks^2
Batch Size: 8      , Call Count: 3584 calls
☻
Running testit_long_baseline.
Min:      369 ticks, Max:      369 ticks, Range:    0 ticks
Mean:     369 ticks, Variance: 0 ticks^2
Batch Size: 8      , Call Count: 2946244 calls
☻Please subtract testit_long_baseline time from both, then divide by 128.

Running testit_baseline.
Min:      261 ticks, Max:      270 ticks, Range:    9 ticks
Mean:     269 ticks, Variance: 2 ticks^2
Batch Size: 7      , Call Count: 24366835 calls
☻
Running testit_div.
Min:      540 ticks, Max:      549 ticks, Range:    9 ticks
Mean:     548 ticks, Variance: 2 ticks^2
Batch Size: 7      , Call Count: 22754298 calls
☻
Running testit_cdiv.
Min:      342 ticks, Max:      342 ticks, Range:    0 ticks
Mean:     342 ticks, Variance: 0 ticks^2
Batch Size: 7      , Call Count: 3030 calls
☻
Running testit_baseline.
Min:      261 ticks, Max:      270 ticks, Range:    9 ticks
Mean:     269 ticks, Variance: 2 ticks^2
Batch Size: 7      , Call Count: 17146668 calls
☻Please subtract testit_baseline time from both, then divide by 32.

Running short_testit_baseline.
Min:      234 ticks, Max:      243 ticks, Range:    9 ticks
Mean:     238 ticks, Variance: 20 ticks^2
Batch Size: 4      , Call Count: 9344100 calls
☻
Running short_testit_div.
Min:      261 ticks, Max:      270 ticks, Range:    9 ticks
Mean:     265 ticks, Variance: 20 ticks^2
Batch Size: 4      , Call Count: 6207672 calls
☻
Running short_testit_cdiv.
Min:      243 ticks, Max:      243 ticks, Range:    0 ticks
Mean:     243 ticks, Variance: 0 ticks^2
Batch Size: 4      , Call Count: 12390 calls
☻
Running short_testit_baseline.
Min:      234 ticks, Max:      243 ticks, Range:    9 ticks
Mean:     238 ticks, Variance: 20 ticks^2
Batch Size: 4      , Call Count: 5224032 calls
☻Please subtract short_testit_baseline time from both, then divide by 4.
... yes this program will do that computation automatically soon!
Press any key to continue ...

nixeagle · May 10, 2012, 12:39:48 AM

Since hutch-- was nice enough to post timing results, I thought I'd run them through my MMA program. We have the following:

	128 iterations	32 iterations	4 iterations
div	9.07 +/- 0 ticks	8.72 +/- 0.35 ticks	6.75 +/- 3.16 ticks
cdiv	2.94 +/- 0.32 ticks	2.28 +/- 0.25 ticks	1.25 +/- 2.24 ticks

I see some curious results for div when iterating 4 times only, but the error in that sample is more than enough to render the whole measurement meaningless. My next program update will address this case as I'll be having it strive to get the lowest possible error. Not that we are always able to do so; but for those times we can't, the important thing is that we know there is an error. Just taking the mean is not enough here.

The same deal goes with the measurements for cdiv when looping 4 times. But here notice that the error in the measurement is larger than the value with a relative deviation of 174%!

Finally after the ballgame I'll try to finish up what I'll call v0.5 of the timing/benchmark program. Major thanks to everyone for taking the time to test this and answer my questions. :U

P.S. Hutch--, does that have speedstep or something else causing clock variation? Just curious :bdg.

Relative Deviation is defined as mean/standard_deviation. I converted it to a percentage by multiplying the result by 100.

hutch-- · May 10, 2012, 01:47:53 AM

The dev box I use is a Core2 quad running at 3 gig and it has been far more reliable in timings than the last generation of PIVs that I used. My old 2.8 gig Northwood was very consistent in terms of timing but the later 3.8 gig Prescott PIV which had the extra 1 meg cache wandered a lot more with its timings. On the PIVs I ran Win2000 with the hyperthreading turned off in the BIOS as Win2000 did not properly support it. The Core2 does not support it either but the later i7 I have that I rarely ever turn on does. It is also very consistent in terms of timings.

When I time an algo I write a test piece that is as close as I can get to its real world task then bash it to death in ring3 to get its averages. You occasionally get a quirky low reading but what I am after is the average running in ring3 while the OS is running and performing the normal task switching.

nixeagle · May 10, 2012, 02:41:39 AM

Quote from: hutch-- on May 10, 2012, 01:47:53 AM
The dev box I use is a Core2 quad running at 3 gig and it has been far more reliable in timings than the last generation of PIVs that I used. My old 2.8 gig Northwood was very consistent in terms of timing but the later 3.8 gig Prescott PIV which had the extra 1 meg cache wandered a lot more with its timings. On the PIVs I ran Win2000 with the hyperthreading turned off in the BIOS as Win2000 did not properly support it. The Core2 does not support it either but the later i7 I have that I rarely ever turn on does. It is also very consistent in terms of timings.

When I time an algo I write a test piece that is as close as I can get to its real world task then bash it to death in ring3 to get its averages. You occasionally get a quirky low reading but what I am after is the average running in ring3 while the OS is running and performing the normal task switching.

These timings are basically micro-benchmarks. We are measuring in terms of ticks here. Once I finish up this program that handles the "optimal" situation, I'd like to move on to measuring algorithm performance when the cache is not dedicated to the algorithm under test. From there I'll work on adding more complex things in. Right now I'm working on getting the most consistent results possible. To do that, I have to identify confounding sources of error in the sample and attempt to remove them. I'm aiming to do computer science here :wink:. With what I have now, I think I've done a decent job.

I'm a little tired, the ballgame is not even over. So I think I'll finish this up tomorrow. However I have quite some progress from my last update. Follows is the code output. Please suggest possible improvements to how the output looks. How would you want it to look?

Code Select


Spinning up the processor.
long_testit_div: 10.0798 +/- 0.0000 ticks
long_testit_cdiv: 2.0876 +/- 0.1004 ticks
testit_div: 10.7298 +/- 0.0000 ticks
testit_cdiv: 1.4485 +/- 0.0442 ticks
short_testit_div: 10.8500 +/- 0.4546 ticks
short_testit_cdiv: 2.5000 +/- 1.0782 ticks

I'd release tonight, but I believe there is a small mathematical error in how sampling is averaged and I'd like to correct that before releasing what I'll call v0.5. :dance:

Edit: Follows are the timings when my laptop is under battery power. Added just for variety. :U

Code Select


long_testit_div: 25.4694 +/- 0.0371 ticks
long_testit_cdiv: 5.1557 +/- 0.1759 ticks
testit_div: 25.8500 +/- 0.0000 ticks
testit_cdiv: 4.6625 +/- 0.2624 ticks
short_testit_div: 24.0000 +/- 0.2500 ticks
short_testit_cdiv: 4.0000 +/- 0.2500 ticks

nixeagle · May 11, 2012, 12:30:28 AM

Alright, finally got this to the point I'm happy with calling this "0.5". Everything mentioned in earlier posts of this thread is in here except for the error minimization code. I need to re-do that and have it work off of the relative deviation rather than just a "range". As is, it causes too much slowdown for what it is worth.

My results on my i7:

Code Select


Spinning up the processor.
long_testit_div: 10.0148 +/- 1.3194 ticks
long_testit_cdiv: 1.9891 +/- 0.2029 ticks
testit_div: 9.8562 +/- 2.1695 ticks
testit_cdiv: 1.4321 +/- 0.0765 ticks
short_testit_div: 9.5420 +/- 0.2500 ticks
short_testit_cdiv: 1.7339 +/- 0.4330 ticks
Press any key to continue ...

Yey for not having to manually calculate this! :dance:

Again, I'd really appreciate some posts showing your results and suggestions on how to improve the program output. :bg

dedndave · May 11, 2012, 12:35:00 AM

prescott w/htt - xp mce2005 sp3

Code Select

long_testit_div: 32.0050 +/- 0.3598 ticks
long_testit_cdiv: 4.2274 +/- 0.3646 ticks
testit_div: 31.0509 +/- 1.3254 ticks
testit_cdiv: 4.2843 +/- 0.7849 ticks
short_testit_div: 33.0279 +/- 0.9013 ticks
short_testit_cdiv: 3.5929 +/- 0.5000 ticks

i like the way you versioned it as "0.5" - showing off your new floating point skills, i see - lol

FORTRANS · May 11, 2012, 12:39:03 PM

Quote from: nixeagle on May 09, 2012, 07:51:18 PM
FORTRANS' P-MMX
128 iterations 32 iterations 4 iterations
div 41.91 +/- 0 ticks 42.56 +/- 0 ticks 37.75 +/- 0 ticks
cdiv 11.00 +/- 0 ticks 11.00 +/- 0 ticks 11.00 +/- 0 ticks

Here no matter how many times you loop, the throughput is the same.

Hi,

Makes sense probably. Pentium 1 was not out of order, does
not do branch prediction, and has (mostly?) hard-wired execution,
as opposed to convering to micro-ops or using micro coding.

And here are some new results.

P-III

Spinning up the processor.
long_testit_div: 38.0156 +/- 0.0000 ticks
long_testit_cdiv: 3.6406 +/- 0.0000 ticks
testit_div: 38.0625 +/- 0.0000 ticks
testit_cdiv: 4.0937 +/- 0.0000 ticks
short_testit_div: 43.0000 +/- 0.0000 ticks
short_testit_cdiv: 9.5000 +/- 0.0000 ticks
Press any key to continue ...

P-MMX

Spinning up the processor.
long_testit_div: 41.9062 +/- 0.0000 ticks
long_testit_cdiv: 11.0000 +/- 0.0000 ticks
testit_div: 41.6250 +/- 0.0000 ticks
testit_cdiv: 11.0000 +/- 0.0000 ticks
short_testit_div: 37.7500 +/- 0.0000 ticks
short_testit_cdiv: 11.0000 +/- 0.0000 ticks
Press any key to continue ...

Mobile Celeron

Spinning up the processor.
long_testit_div: 37.9921 +/- 0.0000 ticks
long_testit_cdiv: 3.7578 +/- 0.0000 ticks
testit_div: 37.5000 +/- 0.0000 ticks
testit_cdiv: 3.4687 +/- 0.0000 ticks
short_testit_div: 38.0000 +/- 0.0000 ticks
short_testit_cdiv: 5.7500 +/- 0.0000 ticks
Press any key to continue ...

Regards,

Steve N.

News:

Figuring out a statistically reliable baseline for out of order processors.