Figuring out a statistically reliable baseline for out of order processors.

Started by nixeagle, May 01, 2012, 02:02:34 AM

Previous topic - Next topic

nixeagle

Output for me on windows 7 with intel core i7 (sandy bridge).

1681, 1375, 1295, 1111, 1094, 1030, 996, 949, 866, 877, 825, 752, 749, 778, 671,
679,
Press any key to continue ...


What you are measuring is how long it takes to execute the sleep call.

       counter_begin LOOP_COUNT,HIGH_PRIORITY_CLASS

;---------

;code to be timed goes here

       INVOKE  Sleep,0

;---------

       counter_end

Remember, coming out of Sleep 0, you are handed a new timeslice. So I suspect you are tossing away timeslices without even going near half of the way through it.

What we want to measure is how long you have before the operating system takes control from you. The best way to measure that is to do something similar to what I described a few posts ago :bg.

Am I making sense? Basically I think you are measuring the wrong thing! :wink

dedndave

i don't think so
we need thoughts and ideas from other members   :bg

it could be that it is somewhat syncronous, as Michael uses the Sleep function
i don't recall exactly how - but i think it's only at the beginning of the loop

nixeagle

Quote from: dedndave on May 05, 2012, 08:43:58 PM
i don't think so
we need thoughts and ideas from other members   :bg

it could be that it is somewhat syncronous, as Michael uses the Sleep function
i don't recall exactly how - but i think it's only at the beginning of the loop

:lol, ok after I eat I'll do up a test program that shoves numbers in a buffer and after running spams them out. Just pipe the result to a file and attach in a zip file and I'll plot them in Mathematica. The context switches will be obvious as they will be the portions where the linear plot jumps from say 200000000 to 200005000 instead of 200000000 to 200000100. It'll look like a "step" on an upward sloping line. :bdg

MichaelW

The Sleep 0 is use to synchronize the test code with the start of a time slice, maximizing the time the code can run before the end of the time slice. The point is to avoid having the system interrupt the execution of the test code.
eschew obfuscation

nixeagle

Quote from: nixeagle on May 05, 2012, 08:50:08 PM
Quote from: dedndave on May 05, 2012, 08:43:58 PM
i don't think so
we need thoughts and ideas from other members   :bg

it could be that it is somewhat syncronous, as Michael uses the Sleep function
i don't recall exactly how - but i think it's only at the beginning of the loop

:lol, ok after I eat I'll do up a test program that shoves numbers in a buffer and after running spams them out. Just pipe the result to a file and attach in a zip file and I'll plot them in Mathematica. The context switches will be obvious as they will be the portions where the linear plot jumps from say 200000000 to 200005000 instead of 200000000 to 200000100. It'll look like a "step" on an upward sloping line. :bdg

Alright, this was easier than I thought :bg. I changed timeit.asm such that

gu_default_sample_batch_size equ 1
gu_total_samples equ 25000


Which basically turns the program into a "normal" timing loop that attempts to take 25000 samples. Here is the result:

Min:      186 ticks
Max:      8697 ticks
Range:    8511 ticks
Mean:     238 ticks
Variance: 9877 ticks^2   // This got so large that it wrapped around a 32 bit integer. Needs 64bit support soon! Or do this with doubles...


I piped the result into a csv file and fed it to MMA. Mathematica found 5 points where the computer took longer than 500 ticks. That means this program took 6 slices to execute. Even if we (wrongly) assume that all runs other than context switches take 186 ticks, we have the following:
186*25000 / 6.
Which equals 775,000 ticks on my machine. In other words, you get roughly 750/2=375 microseconds. To put this in human time: you get about 0.000375 seconds. Granted my margin of error here is huge, but I can say with certainity that it is larger than my lowerbound calculation of 10microseconds earlier.

Am I making sense now? If not, where am I wrong. :wink

nixeagle

Quote from: MichaelW on May 05, 2012, 09:23:12 PM
The Sleep 0 is use to synchronize the test code with the start of a time slice, maximizing the time the code can run before the end of the time slice. The point is to avoid having the system interrupt the execution of the test code.


Alright, my method instead detects the context switch and discards the result. Well it does that when gu_default_sample_batch_size is greater than 1.

MichaelW

Under conditions where the system and the test code are running on the same core, the system must interrupt the test code at the end of a time slice, if only to determine that the current thread is going to get the next time slice.
eschew obfuscation

nixeagle

Quote from: MichaelW on May 05, 2012, 09:58:06 PM
Under conditions where the system and the test code are running on the same core, the system must interrupt the test code at the end of a time slice, if only to determine that the current thread is going to get the next time slice.
Right. The question was how often does that occur. :wink I believe I've shown it must be substantially larger than what dedndave had by measuring how long Sleep 0 took. That just tells you how much time the operating system takes to complete the Sleep 0 call, but it does not tell you how long the operating system will allow you to go before pre-emptively taking control from you.

I did an interesting graph with Mathematica that ought to help make it intuitively clear why the OS can't context switch every 2 microseconds. For if it did, the operating system would have to peg the CPU at 50% utilization all of the time!



Here the X axes represents the number of nanoseconds a program is permitted before having control taken away by the operating system. The Y axes represents what percentage of the CPU is idle or free for the user's program to make use of. This is done under the assumption that context switching takes 1433 nanoseconds (as on dedndave's processor by his Sleep 0 call).

nixeagle

I think you guys will find this interesting. I have a stable baseline that is capable of comparing against test functions. That is we can use this to figure out precisely how many clocks something takes. At least on my i7.

Below is the program output. It ought to speak for itself. Real "production" versions of this will only do one run of the baseline and each algorithm. Plus the baseline won't get shown. Just the difference from the algorithms to the baseline. :)

Spinning up the processor.
Running 4 full runs of testit_baseline.
Batch Size: 8

{107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 104, 107, 107, 107, 107, 110, 107, 107, 107, 107, }

Min:      104 ticks
Max:      110 ticks
Range:    6 ticks
Mean:     107 ticks
Variance: 0 ticks^2
Batch Size: 7
Call Count: 2379447 calls

{107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 92, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 104, }

Min:      92 ticks
Max:      107 ticks
Range:    15 ticks
Mean:     106 ticks
Variance: 8 ticks^2
Batch Size: 7
Call Count: 663870 calls

{107, 107, 104, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, }

Min:      104 ticks
Max:      107 ticks
Range:    3 ticks
Mean:     106 ticks
Variance: 1 ticks^2
Batch Size: 7
Call Count: 625968 calls

{107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 110, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, }

Min:      107 ticks
Max:      110 ticks
Range:    3 ticks
Mean:     107 ticks
Variance: 0 ticks^2
Batch Size: 7
Call Count: 1125312 calls

Running 4 full runs of testit, which contains two imul.

{123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 104, 123, 123, 119, 123, 123, 123, 123, 123, 123, }

Min:      104 ticks
Max:      123 ticks
Range:    19 ticks
Mean:     122 ticks
Variance: 14 ticks^2
Batch Size: 6
Call Count: 2835779 calls

{123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, }

Min:      123 ticks
Max:      123 ticks
Range:    0 ticks
Mean:     123 ticks
Variance: 0 ticks^2
Batch Size: 5
Call Count: 2246256 calls

{123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 104, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 104, 123, }

Min:      104 ticks
Max:      123 ticks
Range:    19 ticks
Mean:     121 ticks
Variance: 26 ticks^2
Batch Size: 5
Call Count: 398668 calls

{123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, }

Min:      123 ticks
Max:      123 ticks
Range:    0 ticks
Mean:     123 ticks
Variance: 0 ticks^2
Batch Size: 5
Call Count: 318908 calls
Press any key to continue ...


Please take the time to test this and paste the outputs!

P.S. dedndave, others the program now requests HIGH priority instead of realtime as we don't need realtime here. Before running this, you may want to adjust gu_default_sample_batch_size down to 4 or 5 from 8. Your call on that :wink

jj2007

Looks promising :U

{336, 336, 336, 336, 336, 336, 336, 336, 336, 336, 336, 336, 336, 336, 336, 336,
336, 336, 336, 336, 336, 336, 336, 336, 336, }

Min:      336 ticks
Max:      336 ticks
Range:    0 ticks
Mean:     336 ticks
Variance: 0 ticks^2
Batch Size: 5
Call Count: 13224 calls

nixeagle

Quote from: jj2007 on May 06, 2012, 03:03:21 AM
Looks promising :U

{336, 336, 336, 336, 336, 336, 336, 336, 336, 336, 336, 336, 336, 336, 336, 336,
336, 336, 336, 336, 336, 336, 336, 336, 336, }

Min:      336 ticks
Max:      336 ticks
Range:    0 ticks
Mean:     336 ticks
Variance: 0 ticks^2
Batch Size: 5
Call Count: 13224 calls


Is that from testit_baseline or testit? I'd like to see both if possible :bg. But awesome that one works for you :dance:. Btw folks, about how long does it take for the program to prompt you to push a key?

hutch--

Something to note with the SleepEx() API (Sleep is just a wrapper to it). Unless you set some duration you do not get a garranteed yield to another thread, set it to "1" or higher and you do.


invoke SleepEx,1,0
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

nixeagle

Quote from: hutch-- on May 06, 2012, 06:14:08 AM
Something to note with the SleepEx() API (Sleep is just a wrapper to it). Unless you set some duration you do not get a garranteed yield to another thread, set it to "1" or higher and you do.


invoke SleepEx,1,0


Wow, did not know that. Thanks! :clap:

In other news, to keep everyone up to date, I'm working on a "practical" test example. Pitting qWord's cdiv macro against the naive div we get timings that look something like this:


Spinning up the processor.
Running 4 full runs of testit_baseline.
Batch Size: 8

{129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, }

Min:      129 ticks
Max:      129 ticks
Range:    0 ticks
Mean:     129 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 62580 calls

{132, 132, 132, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 132, }

Min:      129 ticks
Max:      132 ticks
Range:    3 ticks
Mean:     129 ticks
Variance: 1 ticks^2
Batch Size: 8
Call Count: 72996 calls

{132, 132, 132, 132, 132, 132, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 132, 129, 129, 129, }

Min:      129 ticks
Max:      132 ticks
Range:    3 ticks
Mean:     129 ticks
Variance: 2 ticks^2
Batch Size: 8
Call Count: 24983 calls

{129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, }

Min:      129 ticks
Max:      129 ticks
Range:    0 ticks
Mean:     129 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 65051 calls

Running 2 full runs of testit_div which contains integer division.

{440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, }

Min:      440 ticks
Max:      440 ticks
Range:    0 ticks
Mean:     440 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 1953 calls

{440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, 440, }

Min:      440 ticks
Max:      440 ticks
Range:    0 ticks
Mean:     440 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 1925 calls

Running 2 full runs of testit_cdiv, a constant division macro by qWord.

{173, 176, 173, 176, 176, 176, 179, 173, 173, 173, 176, 179, 176, 176, 176, 176, 176, 176, 176, 173, 176, 176, 176, 176, 176, }

Min:      173 ticks
Max:      179 ticks
Range:    6 ticks
Mean:     175 ticks
Variance: 2 ticks^2
Batch Size: 8
Call Count: 36911 calls

{173, 173, 176, 176, 173, 173, 176, 173, 173, 176, 176, 176, 176, 176, 176, 176, 176, 173, 173, 173, 176, 176, 173, 173, 176, }

Min:      173 ticks
Max:      176 ticks
Range:    3 ticks
Mean:     174 ticks
Variance: 2 ticks^2
Batch Size: 8
Call Count: 19656 calls
Please subtract testit_baseline time from both, then divide by 32.
... yes this program will do that computation automatically soon!
Press any key to continue ...


Hopefully by the end of today I'll have the code reasonably cleaned up and workable so it'll just say two lines instead of all this spam. One line will indicate time taken for testit_cdiv and the other line will indicate time taken for testit_div. Even as is though, the timings are consistant!

A question for the MASM pros out there:

I have code that looks like:

runner proc
    ...
   
    print " ",13,10
    print "Running 2 full runs of testit_div which contains integer division.",13,10
    mov [gu_testfunction], testit_div
    call run_driver
    print " ", 13,10

    ...
runner endp

where the ... stand for repeated instances of the 5 lines shown, just with different strings and a different function name moved into gu_testfunction. Anyway I can do this better? Code it as a macro or otherwise make it simpler for users of the test harness? Basically with this: you write a function and tell the harness to test it.

P.S. I am planning on releasing this code either under a BSD license (basically do whatever you want with it, just keep the comment saying who wrote the code and don't claim it as your own) or just simply releasing to the public domain. I just want to improve the quality of micro-benchmarking.

dedndave

you could write a little proc with parameters for the test routine and string
then...
        INVOKE  RunTest,Proc1,offset szString1
        INVOKE  RunTest,Proc2,offset szString2

nixeagle

Quote from: dedndave on May 06, 2012, 05:07:42 PM
you could write a little proc with parameters for the test routine and string
then...
        INVOKE  RunTest,Proc1,offset szString1
        INVOKE  RunTest,Proc2,offset szString2

Would that be better done as a macro at this point? Sorry, I am not really that great at MASM :red. Most of my code is a more or less direct translation of the way I'd write it using NASM.