News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Code timing macros

Started by MichaelW, February 16, 2005, 03:21:52 AM

Previous topic - Next topic

MichaelW

I updated and renamed the clock cycle counter macros that I had previously posted to several other threads here, and added a set of timing macros that utilize the high-resolution performance counter. I initially included code to utilize the system timer as a backup for the performance counter, but I eventually decided to remove it because of the systems I have tested all of them supported the performance counter.

IMO, as a code performance metric clock cycle counts are better than loop times for two reasons:

1. Clock cycle counts are more comparable across systems because for a given block of code the clock cycle count varies (mostly) with the processor family, where loop times vary with both the processor family and, to a much greater extent, with the clock speed.
2. A clock cycle count can be used as a rough gauge of the level of code optimization. For example, if a code sequence contained 10 simple instructions and it consumed 20 clock cycles, the code would likely be far from optimum. But if the same sequence consumed 5 clock cycles, the code would likely be near optimum.

EDIT: Added a new version of the clock cycle count macros.
Updated new version to correct problems that surfaced when I tested REALTIME_PRIORITY_CLASS.

http://www.masm32.com/board/index.php?topic=770.msg38831#msg38831


NOTE: The original file was deleted due to removal of old files from the forum, this file is directly from the MASM32 distribution.
eschew obfuscation

Vortex

Tested on Windows Server 2003 Standard Edition - Intel Xeon / 2.66 Ghz

26 cycles
27 cycles
110 ms
93 ms


Back home, It will test the code on my PIV

MichaelW

Hi Vortex,

Thank for testing. I think much of the variation on the loop times is due to a loop count that is too low for a fast system. On my 500MHz P3 the loop time was ~980ms with little variation. And I'm surprised at how much lower the cycle count is than the 49 for my P3. But considering that the StrLen procedure is heavily optimized and running on a double-pumped ALU, I guess I shouldn't be.
eschew obfuscation

Vortex

#3
Hi MichaelW,

Thanks for the new version :U

Here are the results on my PIV 2.66 GHz  O.S : Win Xp Home Sp2

24 cycles
23 cycles
87 ms
86 ms

Vortex

Here is an example comparing Win32 API function lstrcmp with the masm32 function szCmp:


    .586
    .model flat, stdcall
    option casemap :none

    include \masm32\include\windows.inc
    include \masm32\include\masm32.inc
    include \masm32\include\kernel32.inc

    includelib \masm32\lib\masm32.lib
    includelib \masm32\lib\kernel32.lib

    include \masm32\macros\macros.asm

    include timers.asm

    .data
      str1 db "my other brother darryl",0
      str2 db "test",0
    .code
start:

    counter_begin 10000000, HIGH_PRIORITY_CLASS
    invoke lstrcmp,ADDR str1,ADDR str2
    counter_end
    print ustr$(eax)
    print chr$(" cycles",13,10)

    counter_begin 10000000, HIGH_PRIORITY_CLASS
    invoke szCmp,ADDR str1,ADDR str2
    counter_end
    print ustr$(eax)
    print chr$(" cycles",13,10)

    mov   eax, input(13,10,"Press enter to exit...")
    exit

end start

Result:

774 cycles
5 cycles

MichaelW

#5
Interesting. I added in a test of the CRT strcmp function and a string where the first mismatch is at the end rather than the beginning.

; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .586                       ; create 32 bit code
    .model flat, stdcall       ; 32 bit memory model
    option casemap :none       ; case sensitive

    include \masm32\include\windows.inc
    include \masm32\include\masm32.inc
    include \masm32\include\kernel32.inc
    include \masm32\include\msvcrt.inc

    includelib \masm32\lib\masm32.lib
    includelib \masm32\lib\kernel32.lib
    includelib \masm32\lib\msvcrt.lib

    include \masm32\macros\macros.asm

    include timers.asm
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
      str1 db "my other brother darryl",0
      str2 db "test",0
      str3 db "my other brother darryx",0
    .code
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

    counter_begin 10000000, HIGH_PRIORITY_CLASS
      invoke lstrcmp,ADDR str1,ADDR str2
    counter_end
    print ustr$(eax)
    print chr$(" cycles",13,10)

    counter_begin 10000000, HIGH_PRIORITY_CLASS
      invoke szCmp,ADDR str1,ADDR str2
    counter_end
    print ustr$(eax)
    print chr$(" cycles",13,10)

    counter_begin 10000000, HIGH_PRIORITY_CLASS
      invoke strcmp,ADDR str1,ADDR str2
    counter_end
    print ustr$(eax)
    print chr$(" cycles",13,10)

    counter_begin 10000000, HIGH_PRIORITY_CLASS
      invoke lstrcmp,ADDR str1,ADDR str3
    counter_end
    print ustr$(eax)
    print chr$(" cycles",13,10)

    counter_begin 10000000, HIGH_PRIORITY_CLASS
      invoke szCmp,ADDR str1,ADDR str3
    counter_end
    print ustr$(eax)
    print chr$(" cycles",13,10)

    counter_begin 10000000, HIGH_PRIORITY_CLASS
      invoke strcmp,ADDR str1,ADDR str3
    counter_end
    print ustr$(eax)
    print chr$(" cycles",13,10)

    mov   eax, input(13,10,"Press enter to exit...")
    exit
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

Result on my P3:

753 cycles
10 cycles
10 cycles
777 cycles
91 cycles
137 cycles


Perhaps this explains the large difference:
Quote
The lstrcmp function uses a word sort, rather than a string sort. A word sort treats hyphens and apostrophes differently than it treats other symbols that are not alphanumeric, in order to ensure that words such as "coop" and "co-op" stay together within a sorted list. For a detailed discussion of word sorts and string sorts, see the Remarks section for the CompareString function.

BTW, the msvcrt include file and import library are the ones that Hutch posted recently, they are not part of the MASM32 distribution.

eschew obfuscation

rags

MichaelW,
Here is the result of Test.exe running on my AMD XP 1600, XP Pro SP1.


36 cycles
35 cycles
254 ms
254 ms

Press enter to exit...
God made Man, but the monkey applied the glue -DEVO

Farabi

Intel p4 2.8 ghz Win XP.

21 cycles
21 cycles

96 ms
79 ms
Those who had universe knowledges can control the world by a micro processor.
http://www.wix.com/farabio/firstpage

"Etos siperi elegi"

liquidsilver

Celeron 2.66 Ghz 244 RAM WinXP SP2 (not my PC  ::), mine's better :bg):

37 cycles
36 cycles
135 ms
138 ms

Ghirai

On my P4, 2 GHz, XP Pro SP2:

29 cycles
29 cycles
142 ms
143 ms
MASM32 Project/RadASM mirror - http://ghirai.com/hutch/mmi.html

MichaelW

#10
Interesting:

XEON 2.66 GHz - 26 cycles
PIV 2.66 GHz - 24 cycles
Celeron 2.66 GHz - 37 cycles

I'm assuming that these were all for the following test program, and that the Celeron is a P4. Anyone care to explain why the Celeron would take ~50% more cycles on a 3.5 KB executable running relatively short loops?

; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .586                       ; create 32 bit code
    .model flat, stdcall       ; 32 bit memory model
    option casemap :none       ; case sensitive

    include \masm32\include\windows.inc
    include \masm32\include\masm32.inc
    include \masm32\include\kernel32.inc

    includelib \masm32\lib\masm32.lib
    includelib \masm32\lib\kernel32.lib

    include \masm32\macros\macros.asm

    include timers.asm
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
      str1 db "my other brother darryl",0
    .code
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    ;repeat 3          ; Used to check sensitivity to alignment
    ;  nop
    ;endm
   
    counter_begin 10000000, REALTIME_PRIORITY_CLASS
      invoke StrLen, ADDR str1
    counter_end
    print ustr$(eax)
    print chr$(" cycles",13,10)

    counter_begin 10000000, HIGH_PRIORITY_CLASS
      invoke StrLen, ADDR str1
    counter_end
    print ustr$(eax)
    print chr$(" cycles",13,10)

    timer_begin 10000000, REALTIME_PRIORITY_CLASS
      invoke StrLen, ADDR str1
    timer_end
    print ustr$(eax)
    print chr$(" ms",13,10)

    timer_begin 10000000, HIGH_PRIORITY_CLASS
      invoke StrLen, ADDR str1
    timer_end
    print ustr$(eax)
    print chr$(" ms",13,10)

    mov   eax, input(13,10,"Press enter to exit...")
    exit
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

eschew obfuscation

Frank

#11
Quote
I'm assuming that these were all for the following test program, and that the Celeron is a P4. Anyone care to explain why the Celeron would take ~50% more cycles on a 3.5 KB executable running relatively short loops?

It may be an artifact of the measurement method. With 10 million iterations, the loop takes 100 milliseconds and more. Thus, several context switches will occur during measurement (even at high or realtime priority). Due to its smaller L2-cache as compared to the other processors[1], the Celeron is likely to require a) more cycles per context switch, and, consequently, b) more context switches overall to complete the 10 million iterations. If so, then the additional cycles are actually burnt during context switches, not inside the loop. They do show up as an increase in the average loop cycle count, though.

It might help to confine the to-be-tested code within a time-window of 10ms, and to run an "invoke Sleep,0" immediately before (just a guess -- you may have been there already).


[1] According to Agner Fog's manual, the built-in level-2 cache is 256kb for P4, 512kb for Xeon, and 0-128kb for Celeron processors.

mariø

so quickly...



counter_begin 10000000, REALTIME_PRIORITY_CLASS
;      invoke StrLen, ADDR str1
push eax
pop eax
;should be 2 cyclces??
    counter_end

-10 cycles


:8)

MichaelW

Frank,

The numbers were for REALTIME_PRIORITY_CLASS, and AFAIK that will cause the program to preempt all other processes. If this is correct (?), there should be no context switches within the loop. Hopefully someone with a P4 Celeron will try to implement your suggestions and post the results.

Was this a misprint "0-128kb"? Perhaps some Celerons have the entire L2 cache disabled.

Mario,

Maybe systems have now gotten so fast that they do things (other than hang up) in negative time :bg

I think the instruction sequence is too short. On a P3 your coding returns 0 cycles, but if I place the two instructions in a REPEAT 100...ENDM, I get 200 cycles.
eschew obfuscation

Frank

Hi Michael,

Quote
The numbers were for REALTIME_PRIORITY_CLASS, and AFAIK that will cause the program to preempt all other processes. If this is correct (?), there should be no context switches within the loop.

Our "mental models" of how the Windows scheduler works seem to differ.

Model A (cooperative multitasking):
1) Find the highest-priority process that is ready to run;
2) Run it until completion;
3) Goto 1).

Model B (preemptive multitasking):
1) Find the highest-priority process that is ready to run;
2) Run it until completion, or until the 10ms time-slice is over, whatever comes first;
3) Goto 1).

My understanding is that Model A describes Windows 3.1, whereas Model B describes Windows 95 and above. If so, then even a realtime-priority process will be interrupted briefly every 10ms. The realtime priority level just makes sure that the scheduler will return control to that very process immediately. But it does not prevent the scheduler from interrupting the process every 10ms in the first place.

Quote
Hopefully someone with a P4 Celeron will try to implement your suggestions and post the results.

I don't have 2.66 GHz machines (as were referred to in your original post), but I could run a modified version of your program on a P4 1.7GHz, and on a P4 Celeron 1.7 GHz as well. With a loop count of 1000 (as opposed to the original 10 million), and an "invoke Sleep,0" immediately before the "counter_begin" macro, I get a result of 26 cycles on both of these machines. Moreover, I get these 26 cycles on each run, there is absolutely no variability (it's never 25 or 27 cycles).

Quote
Was this a misprint "0-128kb"? Perhaps some Celerons have the entire L2 cache disabled.

No misprint. The figures stem from Agner Fog's Pentium optimization PDF, version 2004-04-16, see the table on the very last page (and there, the footnote).

Regards,

    Frank