Figuring out a statistically reliable baseline for out of order processors.

Started by nixeagle, May 01, 2012, 02:02:34 AM

Previous topic - Next topic

dedndave

i would do it as a proc
the speed isn't important - so go for smaller size
if you expand a macro everytime, the code will be larger

nixeagle

Quote from: dedndave on May 06, 2012, 05:11:53 PM
i would do it as a proc
the speed isn't important - so go for smaller size
if you expand a macro everytime, the code will be larger

Alright, then I'll have to figure out how invoke works :red.

P.S. dedndave, did you test out the latest version of this? It ought to be fairly fast and I'm really concerned with ensuring stability of the test results on your jittery processor. You don't have to paste the results, I'd just appreciate knowing if the outputs for the 8 full sample runs are stable and have a low variance/range.

dedndave

;prototypes should be near the beginning of the source file

RunTest PROTO   :LPVOID,:LPSTR

;
;
;

RunTest PROC    lpfnFunction:LPVOID,lpszString:LPSTR

        print   chr$(20,13,10)
        print   lpszString
        mov     eax,lpfnFunction
        mov     [gu_testfunction],eax
        call    run_driver
        print   chr$(20,13,10)
        ret

RunTest ENDP

dedndave

prescott w/htt - XP MCE2005 SP3
{555, 555, 555, 555, 555, 555, 555, 555, 555, 555, 555, 555, 555, 555, 555, 555,
555, 555, 555, 555, 555, 555, 555, 555, 555, }

Min:      555 ticks
Max:      555 ticks
Range:    0 ticks
Mean:     555 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 1491 calls


nixeagle

Quote from: dedndave on May 06, 2012, 05:28:01 PM
;prototypes should be near the beginning of the source file

RunTest PROTO   :LPVOID,:LPSTR

;
;
;

RunTest PROC    lpfnFunction:LPVOID,lpszString:LPSTR

        print   chr$(20,13,10)
        print   lpszString
        mov     eax,lpfnFunction
        mov     [gu_testfunction],eax
        call    run_driver
        print   chr$(20,13,10)
        ret

RunTest ENDP

Cool code! Learned about chr$. So I can now change it to chr$(13,10) :U. However the invoke stuff still confuses me :(. With the code you gave, attempted to invoke it with:

invoke RunTest,testit_baseline,"Running testit_baseline."

but got these two errors:

timeit.asm(519) : error A2084: constant value too large
timeit.asm(519) : error A2114: INVOKE argument type mismatch : argument : 2


Does this mean I have to do something fancy to pass a constant string?

Quote from: dedndave on May 06, 2012, 05:31:07 PM
prescott w/htt - XP MCE2005 SP3
{555, 555, 555, 555, 555, 555, 555, 555, 555, 555, 555, 555, 555, 555, 555, 555,
555, 555, 555, 555, 555, 555, 555, 555, 555, }

Min:      555 ticks
Max:      555 ticks
Range:    0 ticks
Mean:     555 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 1491 calls


Wow, very awesome! Was that for the baseline or for the "test" function with imul.

Thanks for the awesome help :8)

dedndave

Hutch has a fancy macro for doing that - fn or fnx, i think
you'll have to look in the hlhelp.chm file

you could probably also use chr$
        INVOKE  RunTest,testit_baseline,chr$("Running testit_baseline.")

otherwise, define the string in the .DATA section, then pass a pointer
        .DATA

szStr0 db "Running testit_baseline.",0

        .CODE

        INVOKE  RunTest,testit_baseline,offset szStr0


Wow, very awesome! Was that for the baseline or for the "test" function with imul.
i guess it's the IMUL

nixeagle

Quote from: dedndave on May 06, 2012, 05:48:12 PM
Hutch has a fancy macro for doing that - fn or fnx, i think
you'll have to look in the hlhelp.chm file

you could probably also use chr$
        INVOKE  RunTest,testit_baseline,chr$("Running testit_baseline.")

otherwise, define the string in the .DATA section, then pass a pointer
        .DATA

szStr0 db "Running testit_baseline.",0

        .CODE

        INVOKE  RunTest,testit_baseline,offset szStr0


Wow, very awesome! Was that for the baseline or for the "test" function with imul.
i guess it's the IMUL
Awesome, chr$(...) will work just fine for now. If we want prettier syntax later after this works, we can do it then. Great to hear that it is IMUL that is so stable :U. Mind double checking that the baseline was stable for me? Of the functions under test, the most important one to have stable is the baseline as that serves as the yardstick that everything else is compared against. A baseline function with high variance/range can potentially lead to negative timing results.

P.S. the invoke now segfaults, but that I can figure out myself :bg. Probably something with the stack :eek.

dedndave

well - you have been doing some strange stuff with stack parameters - lol
(like popping the return address etc)

anyways...
i made a little mistake in 2 places
        print   chr$(20,13,10)
should have been
        print   chr$(32,13,10)

you could also do it this way...
RunTest PROC    lpfnFunction:LPVOID,lpszString:LPSTR

        push    0A0D20h
        print   esp
        print   lpszString
        mov     eax,lpfnFunction
        mov     [gu_testfunction],eax
        call    run_driver
        print   esp
        pop     eax
        ret

RunTest ENDP

dedndave

or, if you return a value in EAX when you call the test...
RunTest PROC    lpfnFunction:LPVOID,lpszString:LPSTR

        push    0A0D20h
        print   esp
        print   lpszString
        mov     eax,lpfnFunction
        mov     [gu_testfunction],eax
        call    run_driver
        push    eax     ;save return value
        print   esp
        pop     eax     ;EAX = return value
        pop     edx
        ret

RunTest ENDP

dedndave

i bet i know why you are having problems - lol

your run_driver proc probably trashes EBP
quick fix...
RunTest PROC    lpfnFunction:LPVOID,lpszString:LPSTR

        push    0A0D20h
        print   esp
        print   lpszString
        mov     eax,lpfnFunction
        mov     gu_testfunction,eax  ;braces not required
        push    ebp
        call    run_driver
        pop     ebp
        push    eax     ;save return value
        print   esp
        pop     eax     ;EAX = return value
        pop     edx
        ret

RunTest ENDP

nixeagle

Quote from: dedndave on May 06, 2012, 06:08:44 PM
i bet i know why you are having problems - lol

your run_driver proc probably trashes EBP
quick fix...
RunTest PROC    lpfnFunction:LPVOID,lpszString:LPSTR

        push    0A0D20h
        print   esp
        print   lpszString
        mov     eax,lpfnFunction
        mov     gu_testfunction,eax  ;braces not required
        push    ebp
        call    run_driver
        pop     ebp
        push    eax     ;save return value
        print   esp
        pop     eax     ;EAX = return value
        pop     edx
        ret

RunTest ENDP


Yea, you and I came to the same solution. Trashing ebp has been the cause of most of my segfaults, so it is the first thing I look for ::). Thanks though, I've copied and pasted your code into the program if that is ok with you :bdg.

I think I'm going to go through and do a quick cleanup of the program, remove commented out junk and so on and then do a new release that runs testit_baseline only once, then runs testit_div followed by testit_cdiv. That should help people see some of the uses for this program while I work on actually subtracting the baseline from the other test results and so on. Granted to do so, I'll need to implement a way to detect the commonest result from a sample run. That is if testit_baseline returns '172' twenty-four times and '169' once... we will choose to use 172, treating 169 as an outlier1.

P.S: If I'm doing strange stuff, I'd be thrilled to have that pointed out to me. I don't mind critiques on how I'm doing things.



  • 1: Lower outliers are usually caused by the CPU overclocking itself for a short period of time. But such overclocking is not consistent, thus worthless for benchmarking short algorithms.

nixeagle

Alright, I've done some cleanup on the assembly file and have the runtime of the test harness down to about 15 seconds on this computer. That includes the time required to spin up the processor!

The part of the program that is meant to be changed by users is found at the very bottom of testit.asm and looks like this:

align 16
runner proc
    invoke RunTest,testit_baseline,chr$("Running testit_baseline.")
    invoke RunTest,testit_div, chr$("Running testit_div.")
    invoke RunTest,testit_cdiv, chr$("Running testit_cdiv.")
   

    print "Please subtract testit_baseline time from both, then divide by 32.",13,10
    print "... yes this program will do that computation automatically soon!",13,10

    ret
runner endp

Where testit_baseline is a function that does the minimal work shared by all the other functions under test. For example here all the functions push ebx, so the baseline function does as well. This way anything that can be considered "overhead" and unrelated to the test will be factored out.

My output looks like this:

Spinning up the processor.

Running testit_baseline.
Min:      129 ticks
Max:      129 ticks
Range:    0 ticks
Mean:     129 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 800800 calls

Running testit_div.
Min:      440 ticks
Max:      440 ticks
Range:    0 ticks
Mean:     440 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 14434 calls

Running testit_cdiv.
Min:      173 ticks
Max:      198 ticks
Range:    25 ticks
Mean:     173 ticks
Variance: 16 ticks^2
Batch Size: 7
Call Count: 3070267 calls
Please subtract testit_baseline time from both, then divide by 32.
... yes this program will do that computation automatically soon!


As we can see qWord's cdiv is much faster than the naive division! As far as computing how much time each function takes, that works something like this:
(440-139)/32
gives 9.71875 ticks per time through the inner loop.1 Compare that to
(173-139)/32
gives 1.375 ticks per time through the inner loop for testit_cdiv. Clearly qWord's implementation is fastest!

Attached is the latest program. Please give it a try and tell us about your results! Is qWord's implementation ever slower? Please note that this version of timeit.zip includes qWord's ConstDiv.inc file to make it easier for you guys to recompile timeit.asm with different settings or whatnot. Enjoy! :dance:



  • 1: Note that there is some variance, and thus a standard deviation in the results for testit_cdiv. The variance is 16 ticks2 which means the standard deviation is 4 ticks. If we treat the standard deviation as error bounds on our measurement, the total time taken varies between {9.59375, 9.71875, 9.84375} ticks. What is this useful for? Those times when multiple algorithms take about the same amount of time! We can tell when the difference in times is significant or not. If two timings overlap in these error bounds, we know that we cannot tell with certainty which is fastest!

jj2007

Hi,
This is the complete output. I have not understood, though, where to find testit_baseline ::)
Celeron M:
Min:      456 ticks
Max:      480 ticks
Range:    24 ticks
Mean:     479 ticks
Variance: 2 ticks^2
Batch Size: 7
Call Count: 6920857 calls
☻Please subtract testit_baseline time from both, then divide by 32.
... yes this program will do that computation automatically soon!
Press any key to continue ...

nixeagle

Quote from: jj2007 on May 06, 2012, 10:33:50 PM
Hi,
This is the complete output. I have not understood, though, where to find testit_baseline ::)
Celeron M:
Min:      456 ticks
Max:      480 ticks
Range:    24 ticks
Mean:     479 ticks
Variance: 2 ticks^2
Batch Size: 7
Call Count: 6920857 calls
☻Please subtract testit_baseline time from both, then divide by 32.
... yes this program will do that computation automatically soon!
Press any key to continue ...


OH! I'm so dumb. No wonder everyone has only been giving one test result. :red I test my programs in Cygwin's bash shell so cls does not actually do anything. I get the same results when I try to run it by double clicking on the icon. Attached is a new version that clears the screen only once when it starts up and then leaves it unmodified all the way to the end. You should see all 3 test results now. :bg

dedndave

DOH !

prescott w/htt XP MCE2005 SP3
Running testit_baseline.
Min:      570 ticks
Max:      570 ticks
Range:    0 ticks
Mean:     570 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 555506 calls

Running testit_div.
Min:      1575 ticks
Max:      1575 ticks
Range:    0 ticks
Mean:     1575 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 49938 calls

Running testit_cdiv.
Min:      720 ticks
Max:      720 ticks
Range:    0 ticks
Mean:     720 ticks
Variance: 0 ticks^2
Batch Size: 8
Call Count: 3682 calls