The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: MichaelW on February 16, 2005, 03:21:52 AM

Title: Code timing macros
Post by: MichaelW on February 16, 2005, 03:21:52 AM
I updated and renamed the clock cycle counter macros that I had previously posted to several other threads here, and added a set of timing macros that utilize the high-resolution performance counter. I initially included code to utilize the system timer as a backup for the performance counter, but I eventually decided to remove it because of the systems I have tested all of them supported the performance counter.

IMO, as a code performance metric clock cycle counts are better than loop times for two reasons:

1. Clock cycle counts are more comparable across systems because for a given block of code the clock cycle count varies (mostly) with the processor family, where loop times vary with both the processor family and, to a much greater extent, with the clock speed.
2. A clock cycle count can be used as a rough gauge of the level of code optimization. For example, if a code sequence contained 10 simple instructions and it consumed 20 clock cycles, the code would likely be far from optimum. But if the same sequence consumed 5 clock cycles, the code would likely be near optimum.

EDIT: Added a new version of the clock cycle count macros.
Updated new version to correct problems that surfaced when I tested REALTIME_PRIORITY_CLASS.

http://www.masm32.com/board/index.php?topic=770.msg38831#msg38831


NOTE: The original file was deleted due to removal of old files from the forum, this file is directly from the MASM32 distribution.
Title: Re: Code timing macros
Post by: Vortex on February 16, 2005, 10:47:21 AM
Tested on Windows Server 2003 Standard Edition - Intel Xeon / 2.66 Ghz

26 cycles
27 cycles
110 ms
93 ms


Back home, It will test the code on my PIV
Title: Re: Code timing macros
Post by: MichaelW on February 16, 2005, 04:15:25 PM
Hi Vortex,

Thank for testing. I think much of the variation on the loop times is due to a loop count that is too low for a fast system. On my 500MHz P3 the loop time was ~980ms with little variation. And I'm surprised at how much lower the cycle count is than the 49 for my P3. But considering that the StrLen procedure is heavily optimized and running on a double-pumped ALU, I guess I shouldn't be.
Title: Re: Code timing macros
Post by: Vortex on February 16, 2005, 06:34:19 PM
Hi MichaelW,

Thanks for the new version :U

Here are the results on my PIV 2.66 GHz  O.S : Win Xp Home Sp2

24 cycles
23 cycles
87 ms
86 ms
Title: Re: Code timing macros
Post by: Vortex on February 20, 2005, 07:19:08 PM
Here is an example comparing Win32 API function lstrcmp with the masm32 function szCmp:


    .586
    .model flat, stdcall
    option casemap :none

    include \masm32\include\windows.inc
    include \masm32\include\masm32.inc
    include \masm32\include\kernel32.inc

    includelib \masm32\lib\masm32.lib
    includelib \masm32\lib\kernel32.lib

    include \masm32\macros\macros.asm

    include timers.asm

    .data
      str1 db "my other brother darryl",0
      str2 db "test",0
    .code
start:

    counter_begin 10000000, HIGH_PRIORITY_CLASS
    invoke lstrcmp,ADDR str1,ADDR str2
    counter_end
    print ustr$(eax)
    print chr$(" cycles",13,10)

    counter_begin 10000000, HIGH_PRIORITY_CLASS
    invoke szCmp,ADDR str1,ADDR str2
    counter_end
    print ustr$(eax)
    print chr$(" cycles",13,10)

    mov   eax, input(13,10,"Press enter to exit...")
    exit

end start

Result:

774 cycles
5 cycles
Title: Re: Code timing macros
Post by: MichaelW on February 20, 2005, 08:14:18 PM
Interesting. I added in a test of the CRT strcmp function and a string where the first mismatch is at the end rather than the beginning.

; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .586                       ; create 32 bit code
    .model flat, stdcall       ; 32 bit memory model
    option casemap :none       ; case sensitive

    include \masm32\include\windows.inc
    include \masm32\include\masm32.inc
    include \masm32\include\kernel32.inc
    include \masm32\include\msvcrt.inc

    includelib \masm32\lib\masm32.lib
    includelib \masm32\lib\kernel32.lib
    includelib \masm32\lib\msvcrt.lib

    include \masm32\macros\macros.asm

    include timers.asm
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
      str1 db "my other brother darryl",0
      str2 db "test",0
      str3 db "my other brother darryx",0
    .code
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

    counter_begin 10000000, HIGH_PRIORITY_CLASS
      invoke lstrcmp,ADDR str1,ADDR str2
    counter_end
    print ustr$(eax)
    print chr$(" cycles",13,10)

    counter_begin 10000000, HIGH_PRIORITY_CLASS
      invoke szCmp,ADDR str1,ADDR str2
    counter_end
    print ustr$(eax)
    print chr$(" cycles",13,10)

    counter_begin 10000000, HIGH_PRIORITY_CLASS
      invoke strcmp,ADDR str1,ADDR str2
    counter_end
    print ustr$(eax)
    print chr$(" cycles",13,10)

    counter_begin 10000000, HIGH_PRIORITY_CLASS
      invoke lstrcmp,ADDR str1,ADDR str3
    counter_end
    print ustr$(eax)
    print chr$(" cycles",13,10)

    counter_begin 10000000, HIGH_PRIORITY_CLASS
      invoke szCmp,ADDR str1,ADDR str3
    counter_end
    print ustr$(eax)
    print chr$(" cycles",13,10)

    counter_begin 10000000, HIGH_PRIORITY_CLASS
      invoke strcmp,ADDR str1,ADDR str3
    counter_end
    print ustr$(eax)
    print chr$(" cycles",13,10)

    mov   eax, input(13,10,"Press enter to exit...")
    exit
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

Result on my P3:

753 cycles
10 cycles
10 cycles
777 cycles
91 cycles
137 cycles


Perhaps this explains the large difference:
Quote
The lstrcmp function uses a word sort, rather than a string sort. A word sort treats hyphens and apostrophes differently than it treats other symbols that are not alphanumeric, in order to ensure that words such as "coop" and "co-op" stay together within a sorted list. For a detailed discussion of word sorts and string sorts, see the Remarks section for the CompareString function.

BTW, the msvcrt include file and import library are the ones that Hutch posted recently, they are not part of the MASM32 distribution.

Title: Re: Code timing macros
Post by: rags on February 21, 2005, 01:54:17 AM
MichaelW,
Here is the result of Test.exe running on my AMD XP 1600, XP Pro SP1.


36 cycles
35 cycles
254 ms
254 ms

Press enter to exit...
Title: Re: Code timing macros
Post by: Farabi on February 21, 2005, 12:34:05 PM
Intel p4 2.8 ghz Win XP.

21 cycles
21 cycles

96 ms
79 ms
Title: Re: Code timing macros
Post by: liquidsilver on March 15, 2005, 09:48:33 AM
Celeron 2.66 Ghz 244 RAM WinXP SP2 (not my PC  ::), mine's better :bg):

37 cycles
36 cycles
135 ms
138 ms
Title: Re: Code timing macros
Post by: Ghirai on March 17, 2005, 12:17:32 PM
On my P4, 2 GHz, XP Pro SP2:

29 cycles
29 cycles
142 ms
143 ms
Title: Re: Code timing macros
Post by: MichaelW on March 17, 2005, 02:37:01 PM
Interesting:

XEON 2.66 GHz - 26 cycles
PIV 2.66 GHz - 24 cycles
Celeron 2.66 GHz - 37 cycles

I'm assuming that these were all for the following test program, and that the Celeron is a P4. Anyone care to explain why the Celeron would take ~50% more cycles on a 3.5 KB executable running relatively short loops?

; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .586                       ; create 32 bit code
    .model flat, stdcall       ; 32 bit memory model
    option casemap :none       ; case sensitive

    include \masm32\include\windows.inc
    include \masm32\include\masm32.inc
    include \masm32\include\kernel32.inc

    includelib \masm32\lib\masm32.lib
    includelib \masm32\lib\kernel32.lib

    include \masm32\macros\macros.asm

    include timers.asm
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
      str1 db "my other brother darryl",0
    .code
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    ;repeat 3          ; Used to check sensitivity to alignment
    ;  nop
    ;endm
   
    counter_begin 10000000, REALTIME_PRIORITY_CLASS
      invoke StrLen, ADDR str1
    counter_end
    print ustr$(eax)
    print chr$(" cycles",13,10)

    counter_begin 10000000, HIGH_PRIORITY_CLASS
      invoke StrLen, ADDR str1
    counter_end
    print ustr$(eax)
    print chr$(" cycles",13,10)

    timer_begin 10000000, REALTIME_PRIORITY_CLASS
      invoke StrLen, ADDR str1
    timer_end
    print ustr$(eax)
    print chr$(" ms",13,10)

    timer_begin 10000000, HIGH_PRIORITY_CLASS
      invoke StrLen, ADDR str1
    timer_end
    print ustr$(eax)
    print chr$(" ms",13,10)

    mov   eax, input(13,10,"Press enter to exit...")
    exit
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

Title: Re: Code timing macros
Post by: Frank on March 17, 2005, 05:32:13 PM
Quote
I'm assuming that these were all for the following test program, and that the Celeron is a P4. Anyone care to explain why the Celeron would take ~50% more cycles on a 3.5 KB executable running relatively short loops?

It may be an artifact of the measurement method. With 10 million iterations, the loop takes 100 milliseconds and more. Thus, several context switches will occur during measurement (even at high or realtime priority). Due to its smaller L2-cache as compared to the other processors[1], the Celeron is likely to require a) more cycles per context switch, and, consequently, b) more context switches overall to complete the 10 million iterations. If so, then the additional cycles are actually burnt during context switches, not inside the loop. They do show up as an increase in the average loop cycle count, though.

It might help to confine the to-be-tested code within a time-window of 10ms, and to run an "invoke Sleep,0" immediately before (just a guess -- you may have been there already).


[1] According to Agner Fog's manual, the built-in level-2 cache is 256kb for P4, 512kb for Xeon, and 0-128kb for Celeron processors.
Title: Re: Code timing macros
Post by: mariø on March 17, 2005, 06:25:16 PM
so quickly...



counter_begin 10000000, REALTIME_PRIORITY_CLASS
;      invoke StrLen, ADDR str1
push eax
pop eax
;should be 2 cyclces??
    counter_end

-10 cycles


:8)
Title: Re: Code timing macros
Post by: MichaelW on March 17, 2005, 08:46:48 PM
Frank,

The numbers were for REALTIME_PRIORITY_CLASS, and AFAIK that will cause the program to preempt all other processes. If this is correct (?), there should be no context switches within the loop. Hopefully someone with a P4 Celeron will try to implement your suggestions and post the results.

Was this a misprint "0-128kb"? Perhaps some Celerons have the entire L2 cache disabled.

Mario,

Maybe systems have now gotten so fast that they do things (other than hang up) in negative time :bg

I think the instruction sequence is too short. On a P3 your coding returns 0 cycles, but if I place the two instructions in a REPEAT 100...ENDM, I get 200 cycles.
Title: Re: Code timing macros
Post by: Frank on March 18, 2005, 11:56:31 AM
Hi Michael,

Quote
The numbers were for REALTIME_PRIORITY_CLASS, and AFAIK that will cause the program to preempt all other processes. If this is correct (?), there should be no context switches within the loop.

Our "mental models" of how the Windows scheduler works seem to differ.

Model A (cooperative multitasking):
1) Find the highest-priority process that is ready to run;
2) Run it until completion;
3) Goto 1).

Model B (preemptive multitasking):
1) Find the highest-priority process that is ready to run;
2) Run it until completion, or until the 10ms time-slice is over, whatever comes first;
3) Goto 1).

My understanding is that Model A describes Windows 3.1, whereas Model B describes Windows 95 and above. If so, then even a realtime-priority process will be interrupted briefly every 10ms. The realtime priority level just makes sure that the scheduler will return control to that very process immediately. But it does not prevent the scheduler from interrupting the process every 10ms in the first place.

Quote
Hopefully someone with a P4 Celeron will try to implement your suggestions and post the results.

I don't have 2.66 GHz machines (as were referred to in your original post), but I could run a modified version of your program on a P4 1.7GHz, and on a P4 Celeron 1.7 GHz as well. With a loop count of 1000 (as opposed to the original 10 million), and an "invoke Sleep,0" immediately before the "counter_begin" macro, I get a result of 26 cycles on both of these machines. Moreover, I get these 26 cycles on each run, there is absolutely no variability (it's never 25 or 27 cycles).

Quote
Was this a misprint "0-128kb"? Perhaps some Celerons have the entire L2 cache disabled.

No misprint. The figures stem from Agner Fog's Pentium optimization PDF, version 2004-04-16, see the table on the very last page (and there, the footnote).

Regards,

    Frank
Title: Re: Code timing macros
Post by: MichaelW on March 18, 2005, 04:13:21 PM
Thanks Frank.

Just to make sure I understand, the Sleep, 0 is so the counter will start near the beginning of a new time-slice?

And did you run the test with the original loop count?
Title: Re: Code timing macros
Post by: Frank on March 18, 2005, 06:41:04 PM
Quote
Just to make sure I understand, the Sleep, 0 is so the counter will start near the beginning of a new time-slice?
Yes, exactly.

Quote
And did you run the test with the original loop count?
Just did it. Here are the results of 10 consecutive runs of your TEST.exe from the TimingMacros.zip. The Celeron runs WinXP SP2, and the P4 runs Win2k SP4 (sorry, I can't hold the OS constant with reasonable effort).


Celeron    P4
RP HP    RP HP
-----    -----
25 26    26 27
25 32    26 27
25 27    25 28
30 25    27 27
25 27    26 27
27 25    26 27
28 25    25 27
29 27    25 28
26 25    25 27
25 27    25 27
--------------
RP: realtime priority
HP: high priority


Interestingly, the means are almost identical for Celeron (26.55) and P4 (26.40). The difference between priority levels is somewhat larger (realtime priority: 26.05, high priority: 26.90). According to my statistics software, there is significantly more variability around the mean for the Celeron/XP machine than for the P4/W2k machine. The constant cycle count of 26 that I reported above for the 1000-iterations-only version of the program was also from close to 10 runs on each of the machines, but I did not keep a precise record. Together it looks more like differences in the cycle count are due to the operating system than to the hardware.

Regards

    Frank
Title: Re: Code timing macros
Post by: liquidsilver on March 19, 2005, 01:25:31 PM
Please note that those results from the Celeron 2.66 were from a school PC running a few background processes. These results did vary by a few cycles or ms, but those were the most common. I'm surprised that they varied at all, but I guess it was cause of the OS.
Interesting code though, very easy to implement :U, but those varied results must go :tdown Why did you make it run 10 million times, is that needed?
Title: Re: Code timing macros
Post by: Jimg on March 19, 2005, 03:25:59 PM
Reading through this thread, I notice that generally, it seems like intel chips take about 26 cycles and amd chips about 36 cycles.  Is there a reason for this?
Title: Re: Code timing macros
Post by: x86asm on March 25, 2005, 07:01:17 PM
AMD AthlonXP 1900+:

36 cycles
36 cycles
222 ms
222 ms

Title: Re: Code timing macros
Post by: rags on March 27, 2005, 04:05:10 AM
On My new AMD Sempron 2500+, win xp pro sp1:
36 cycles
35 cycles
204 ms
203 ms

It does seem to me that it takes the same number of cycles, regardless of amd processor to complete the test.
EDIT: Time wise I reduced the time by 20%, over my older athlonxp 1600
Title: Re: Code timing macros
Post by: MichaelW on March 27, 2005, 07:57:53 AM
My K5-112 is an AMD processor
33 cycles
33 cycles
2962 ms
2969 ms
:bg


Title: Re: Code timing macros
Post by: Tedd on March 31, 2005, 09:31:07 AM
Just to join in the fun...

AMD K6-III 400MHz
41 cycles
41 cycles
1051 ms
1051 ms


There goes the theory about number of cycles? :bdg
Title: Re: Code timing macros
Post by: Rolan on April 02, 2005, 10:50:01 PM
joining the fun with pride  :lol
AMD A64 3500+
28 cycles
29 cycles
130 ms
131 ms
Title: Re: Code timing macros
Post by: Jeaton on April 08, 2005, 07:30:44 PM
36 cycles
36 cycles
124 ms
124 ms

Celeron D 2.93 Ghz; XP Home SP2
Title: Re: Code timing macros
Post by: Jimg on April 09, 2005, 03:24:47 PM
I'm using an Athlon 3000+ at 2.16 GHz

I was a little concerned at how slow my computer was compared to the intel chips, so I took the program and copied the StrLen function at the end of the code and changed it's name rather than calling it from the Masm library to see what was going on.
The test immediately ran over 10 percent faster.  My curiosity now took over, and I tried inserting an Align 4 before the routine.  Better still.  I then tried inserting an increasing number of bytes before the routine, and found the best was inserting 15 extra bytes after the align 4.
Here are the results:

               StrLen
       Using   Moved  align  insert
        lib    from     4      1      2       7       15
                lib           byte   bytes   bytes   bytes
cycles  36      32      31     31     32      30      29
  ms   166     150     142    139    147     138     134



[attachment deleted by admin]
Title: Re: Code timing macros
Post by: MichaelW on April 09, 2005, 10:36:33 PM
I tested the macros to ensure that the timing would not be (significantly) affected by the alignment of the macro calls and the calls to a library procedure. I think I also tested with a block of code rather than an invoke between the macro calls, but at this point I'm not sure. In any event, I see no way to control the alignment of a called procedure from within the macros.

The source for the library procedure, surprisingly, does not contain an alignment directive, so this could explain the difference in cycle counts. When I modify your code to invoke first StrLen and then StrLen2, both with REALTIME_PRIORITY_CLASS, on a P3 under Windows 2000 I get:

49 cycles
44 cycles

Replacing your alignment code with just a db 14 dup(0), I get:

45 cycles
49 cycles

So it appears to be just an alignment effect. I think the (align 4)+15 works because it is placing the most important part of the loop code at the best alignment. I would be interested in seeing how a version of the procedure optimized for the Athlon would perform.
Title: Re: Code timing macros
Post by: Jimg on April 10, 2005, 02:42:26 PM
After a lot more testing, it seems there can be up to a 20 percent difference dependiing on the placement.  It all cycles around 16 bytes so one would have to do an align 16 followed by inserting the best number of bytes for optimum timing.  This could be up to 30 extra bytes wasted, but not too bad for a time critical routine.  Of course, judicious placement of fillers within the routine could help also.
Title: Re: Code timing macros
Post by: Mark Jones on April 18, 2005, 06:58:50 PM
Quote from: Shakain on March 17, 2005, 09:14:38 PM
The results for a AMD XP 1800, XP pro (default)

36
36
307
307 *

* this number changes between runs, I saw it under 300 at some time. Strange ?


Interesting.... AMD XP 1800, XP Pro SP1:

36 cycles
36 cycles
196 ms
196 ms

(Also interesting that my AMD 1800 is faster than Rags' AMD Semperon 2500+ at these tests, and almost keeps up with the P4 2600???) :dazzled:

Lets not forget about overclocking, hardware, and driver contributions to this test. I'm using a highly-optimized (but not overclocked) SolTek nVIDIA mainboard. I also update all drivers every once-in-a-while. This has been the best PC ever (knock on wood) so if you're ever in the market for a new system, check out SolTek for your next mainboard: http://www.soltek.de/soltek/news/index.php (Innocent plug.) :U

Edit: Soltek is long-gone. :(
Title: Re: Code timing macros
Post by: Citric on May 17, 2005, 10:22:58 AM
Quote from: Citric on May 17, 2005, 10:19:02 AM
Hi All
I have ported MichaelW code timing macros to GoAsm but they dont seem to give the same results.

test.exe displays results for the masm32.lib StrLen procedure, at HIGH_PRIORITY_CLASS

Please test them and post your results, maybe with a comparsion to MichaelW macros.

Cheers Adam
ps Any Guru, please have a look to see if i have gone wrong somewhere to get different result to the masm macros.
ps I am building them with "GoAsm.Exe Version 0.52.4 beta" and "GoLink.Exe Version 0.25.4"



Please have a look over at the GoAsm forum.
Title: Re: Code timing macros
Post by: Webring on June 21, 2005, 04:34:29 PM
Great macros MichaelW, they've helped me to optimize what I thought was already a fast byte conversion alg to make it truely fast. :dance: and Citric i'll take a look at your code when I get home.
Title: Re: Code timing macros
Post by: Ratch on June 24, 2005, 04:17:45 AM
MichaelW,
     I have two reservations about your timing macros.

1)  To quote INTEL, the people who should know:
QuoteWhen using the CPUID instruction, however, the programmer must also take into account the cycles it takes for the
instruction to complete, and subtract this from the recorded number of cycles. A strange quirk of the CPUID
instruction is that it can take longer to complete the first couple of times it is called. Thus, the best policy is to call the
instruction three times, measure the elapsed time on the third call, then subtract this measurement from all future
measurements.
http://www.math.uwaterloo.ca/~jamuir/rdtscpm1.pdf

     I do not see your macro calling the CPUID/RDTSC three times.

2)  I believe you loop through the test code X number of times, accumulate the time/counts  and divide by the loop count for a average.  Sounds logical.  My method is to loop the test code X number of times, select the shortest time, and discard the longer times.  I believe this method gives a lower bound of what the machine and algo are capable of doing.  This method automatically discards "unlucky" combinations of time slices, interrupts, or other time/count stealers over which the user has little or only indirect control.  Since I switched to this method, my results have been remarkably consistent compared to the average method.  Try it, you will probably like it.  Ratch
Title: Re: Code timing macros
Post by: MichaelW on June 24, 2005, 06:09:35 AM
Ratch,

Thanks for the information on the CPUID instruction. As I read it I would simply need to call CPUID a few times before I set up the empty loop. The example code calls CPUID and RDTSC three times, but I see nothing in the text about needing to "warm-up" RDTSC. Also, the code contains an obvious error in that it fails to control the CPUID function number, which on my P3 has a relatively large effect on the timing.
EAX = 0, 79 cycles
EAX = 1, 81 cycles
EAX = 2, 105 cycles
EAX >= 3, 129 cycles

I originally considered using only the lowest count, but I decided against it because I could not, and do not, see any benefit in doing so. I understand the principle well enough, but what could be the benefit of knowing how the code performs under optimum execution conditions when it will not typically be used under optimum execution conditions. A code sequence that is optimal for optimum execution conditions may not be optimal for average execution conditions. But I could be wrong, and if I am I hope that someone will enlighten me.

In any event, consistency seems to me to be the most important aspect here, and with high loop counts, I get very consistent results for most code, even when using HIGH_PRIORITY_CLASS instead of REALTIME_PRIORITY_CLASS, and even with a bunch of other crap running on the system. For example, for the timings above, running the code below on my P3-500 with 38 processes showing in Task Manager, including a firewall and a live Internet connection, I ran the tests 10+ times and the results were identical each time.

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .586
    include timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    LOOP_COUNT EQU 10000000

    FOR func,<0,1,2,3,4,5>
      counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
        mov  eax, func
        cpuid
      counter_end
      print ustr$(eax)," cycles",13,10
    ENDM

    mov   eax, input(13,10,"Press enter to exit...")
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start


EDIT:

As a test I added 3 calls to CPUID as described above, and there was no significant change in the timings (the EAX = 2 cycles was 104 instead of 105 on most runs).

Title: Re: Code timing macros
Post by: Petroizki on June 28, 2005, 05:11:27 PM
Note that the cpuid instruction changes the ebx register, when it is called with eax = 0.

The counter_begin and counter_end changes ebx register, without preserving it.
Title: Re: Code timing macros
Post by: MichaelW on June 30, 2005, 07:42:12 AM
Hi petroizki,

Yes, because of the CPUID calls EBX is being changed, and this could be a problem for user code that depends on EBX being preserved. The change in EBX is not specific to CPUID function 0 -- all of the standard and extended functions (that I have seen documented) change EBX. The only reasonable solution that I can see would be to preserve EBX at the start of counter_begin and restore it at the end of counter_end. This is one of the changes that I am planning on making just as soon as I get a chance to focus on something other than work.

Title: Re: Code timing macros
Post by: SoKaR on July 06, 2005, 11:06:28 AM
29 cycles
29 cycles
192 ms
192 ms

Centrino 1,5 ghz Dothan
XP Home, SP2
Title: Re: Code timing macros
Post by: Codewarp on July 19, 2005, 01:13:40 AM
Quote
Posted on: June 23, 2005, 11:09:35 pmPosted by: MichaelW 

I originally considered using only the lowest count, but I decided against it because I could not, and do not, see any benefit in doing so. I understand the principle well enough, but what could be the benefit of knowing how the code performs under optimum execution conditions when it will not typically be used under optimum execution conditions. A code sequence that is optimal for optimum execution conditions may not be optimal for average execution conditions. But I could be wrong, and if I am I hope that someone will enlighten me.

In any event, consistency seems to me to be the most important aspect here, and with high loop counts, I get very consistent results for most code, even when using HIGH_PRIORITY_CLASS instead of REALTIME_PRIORITY_CLASS, and even with a bunch of other crap running on the system. For example, for the timings above, running the code below on my P3-500 with 38 processes showing in Task Manager, including a firewall and a live Internet connection, I ran the tests 10+ times and the results were identical each time.

I have been disturbed by these comments, ever since I read them a while back, but it took me a while to become clear about it.  Let me take a stab at it and see where it goes...

(1)  Taking the smallest clock count

There is always a model to these things, whether explicily stated or not.  The model here is peak-performance--the smallest clock count obtainable, usually found and reported after many reruns.  It's simple and easy to reproduce, to evaluate and to know and understand what it means.  But now, you have arbitrarily decided that you are not going to report the lowest clock count, based on some real-world computing notion you have.

Now, on the surface of it, that sounds really great!  The trouble is, you have not offered any basis for a different model.  I would love to have another model to evaluate cycle counts against--then we could show both metrics together, getting ever closer to the true we seek...  However, I suggest that until you can clearly and unambiguously define this "real world computing" notion and connect it to what you are actually doing, you must admit that your method is to insert noise into the results of a peak-performance model, making fuzzy results that are harder to interpret and compare.  The fact that your results are "very consistent" is good, but not sufficient to overcome this defficiency.

(2)  Messing with thread priority for "benchmarking" purposes

I have been successfully benchmarking at the clock cycle level for many years and I have never found it necessary to raise thread priority in order to measure performance.  Many applications will run differently than "normal", often skewing the timing results--so much for real-world benchmarking...  Once a benchmark is rerun many times and the outliers peeled off (as your methods do quite nicely), you can pretty much run at any priority--so why not use the level it will ultimately deployed with?  Actually, I routinely run my software at slightly below :eek normal priority, unless I really need different specific level.  Everthing runs just as fast, but not even hardcore computation and infinite loops at this priority will ever bog down the machine for anything else running.  It leaves other programmers scratching their heads... :lol and my benchmarks are stable as well. :dance:  Messing with thread priority for "benchmarking" purposes is a bad idea rooted in lack of experience and superstition. :tdown

Title: Re: Code timing macros
Post by: l_d_allan on February 21, 2006, 03:38:34 PM
<alert comment="masm32 newbie">

Would it be appropriate to have timers.asm (or something equivalent) be part of the standard distribution ... rather than something you have to track down to get test code to work?

Quote  ; These macros require a .586 or higher processor directive.

Is there code that works adequately with a .486 processor? .386 processor? This could perhaps be something like the Windows hi-res QueryPerformanceCounter.

Could a "production release version" detect the processor and use the applicable code?

</alert>
Title: Re: Code timing macros
Post by: MichaelW on February 21, 2006, 04:36:11 PM
Actually, although I had not previously noticed, the statement regarding the .586 or higher processor directive is not correct. The counter macros do require .586 or higher because of the CPUID and RDTSC instructions, but the timer macros, assuming _EMMS is not defined, should work OK with .486, or even .386 combined with .387.

Detecting the processor/processor features is no problem, but I know of no method of counting clock cycles that does not depend on RDTSC.


Title: Re: Code timing macros
Post by: jdoe on April 09, 2006, 08:48:36 PM

Hi,

I have downloaded the timing macros to do some test and it seems that it don't work anymore with MASM32 v9.

The error come from using crt__ultoa instead of dwtoa in the ustr$ (macros.asm)
Simply comments crt__ultoa and uncomments dwtoa make it work again.

Title: Re: Code timing macros
Post by: MichaelW on April 09, 2006, 09:10:15 PM
Hi jdoe,

Thanks for pointing this out. I knew about the change in macros.asm, but I had not thought to update the test app. The problem is in the includes, and adding these statements will correct it:

include \masm32\include\msvcrt.inc
includelib \masm32\lib\msvcrt.lib


Title: Re: Code timing macros
Post by: johnsa on July 05, 2006, 11:05:16 AM
Hey all,

Was just reading an article on Intel's website regarding the innacuracies of using RDTSC due to speed step and other energy saving mechanisms which alter the CPU frequency. Apparently they recommend ETimer which uses Windows QueryPerformanceCounter and Frequency functions, which apparently won't use RDTSC if Energy Saving is present in the machine, they'll use another timer such as the power-saving or bios etc.

Any thoughts on this? (Personally I'd much prefer to stick to rdtsc)  :wink
Title: Re: Code timing macros
Post by: Ossa on July 05, 2006, 01:05:00 PM
As far as I understand it, RDTSC counts CPU cycles, which is what we're interested in - we don't care about the accurate "real" time. Therefore, when you have speed step change the only errors from RDTSC that occur are when the task is switched mid-way through a timing. That is why taking a series of identical measurements is important.

The actual timings from RDTSC when a processor is at low or high speed is identical (I have a laptop that does 800 at "low speed" and 2200 at "full speed" - they come out with identical timings - this was very hard to find out empirically though).

However, to avoid measuring during the change, I often add "spin-up" code at the start (like a loop doing many adds/mults). But as I said before, the best idea is to use multiple timings and either use the mean (or better yet, the mode).

Ossa
Title: Re: Code timing macros
Post by: johnsa on July 05, 2006, 01:47:57 PM
I agree that the key information we want is the cycle-count, but having the accurate ms value is useful too. I'm sure there are scenarios where you might have a higher cycle count but fewer ms depending on parallel execution etc etc.. so some simple code may use more cycles but run in fewer ms... or am I smoking my socks?  :bg
Title: Re: Code timing macros
Post by: Ratch on July 05, 2006, 01:49:33 PM
Ossa,

     I have found it extremely difficult to get timing results which I am confident are correct.  So much so that I have stopped doing timings.  It doesn't seem to make much difference anyway.  I just try to do basic techniques like avoiding extensive use LOOP, REP MOVSB, etc.  And arranging instructions so that a register has time to "settle" before using it again.  Other "tricks" are found in optimization manuals.  Although sometimes I can get consistency, oftentimes just by doing a minor thing like changing the position of a block of code, rebooting, or even using one different instruction makes a large timing change that is unexplainable.  I believe that is because of cache effects and the "mix" of other background programs.  I think that the best timing result is the fastest time you can get.  Anything slower is caused by factors other than a slow instruction sequence or algorithm.  So to iterate, if you want to see what your best instruction sequence is, use the fastest single time.  If you want to see how well your code gets along with the cache and its neighbor programs, use the mean, median, or mode averages.  Ratch
Title: Re: Code timing macros
Post by: stanhebben on July 06, 2006, 08:54:27 PM
I've been doing some experiments with the macros, and I had strange results at times. Some very short tests ended up with negative (!) times - but an empty loop always timed at zero cycles.
Title: Re: Code timing macros
Post by: Mark Jones on July 06, 2006, 09:15:26 PM
Could it be an artifact from prefecting/caching? If using an AMD, run the app through CodeAnalyst. It can tell you exactly what is going on. :bg

We must realize that even if a spot-on measurement were possible, it is only good for a particular CPU + Mainboard + OS. The numerous combinations of these elements make the test results very specific by nature, so accuracy is partially invalidated anyways. I suspect that a Real-Mode timer might be only marginally better.

Also note, if the timer is not adaptive to a missing stack frame or an unbalanced stack, (say it is a PROC timer but you run it on a stack-frame-less-PROC) it could leave you with negative result. With Michael's code timer I would get a zero or -1 result timing 4 NOPs, even though each should take 0.25 cycle on my CPU, so perhaps something was a little amiss there. Petriozki's routines seem to be rock-stable and quite accurate, have you tried those? :8)
Title: Re: Code timing macros
Post by: MichaelW on July 07, 2006, 09:28:59 AM
I finally got around to updating the cycle count macros. This time, in an effort to minimize the run to run variation in the cycle counts, instead of returning the average cycle count across the entire loop, I'm returning the lowest count that occurs in a single loop, as Ratch and others here have suggested. Assuming a suitable sample size, for code that executes in less than about 10000 cycles the run to run variation is now near zero. I also corrected the problem with CPUID altering the value in EBX. And because there is no need to calculate an average, there is no FPU code, and no need for EMMS. Everything seems to be working correctly, but I have done only a very limited amount of testing. I added the attachment to the first post in this thread.

Title: Re: Code timing macros
Post by: Chip on August 02, 2006, 01:31:41 PM

I read with interest this thread concerning timing.   

Seeing that you recently updated your macros,
I ran them on my AMD FX53.

Chip


Test.exe

     26 cycles
     26 cycles
   113 ms
   113 ms


C2Test2.exe

     HIGH_PRIORITY_CLASS
     -168 cycles, empty
     0 cycles, mov eax,1
     2 cycles, mov eax,1 mov eax,2
     1 cycles, nops 4
     -166 cycles, mul ecx
     0 cycles, rol ecx,32
     4 cycles, rcr ecx,31

However, this second program crashed.  I tried to get the debug
information to no avail.  However, it did display  the following info:

AppName: c2test2.exe    AppVer: 0.0.0.0    ModName: c2test2.exe
ModVer: 0.0.0.0    Offset: 00001a20
Title: Re: Code timing macros
Post by: MichaelW on August 02, 2006, 06:57:21 PM
Thanks for testing. The test app was intended to test the worst-case functionality. In my view it is not reasonable to expect meaningful clock cycle counts for an instruction or instruction sequence that executes in less than a few clock cycles because:

The test is running in a multitasking environment where context switches cannot be avoided.

The timed instructions cannot, in general, be completely isolated from the timing instructions.

The counter has a resolution of one clock cycle.

On my system I do see an occasional negative count, but generally only for instructions or instruction sequences that otherwise execute in a small number cycles, and I decided that eliminating them was not worth the effort.

I'm not sure why the REALTIME_PRIORITY_CLASS tests failed on your system, as they work OK on my systems (all of which are slow by today's standards). I do not normally use REALTIME_PRIORITY_CLASS for reasons that I explained in the comments.

What version of Windows are you running, and what were the cycle counts for div ecx and StrLen?


Title: Re: Code timing macros
Post by: Chip on August 03, 2006, 09:37:54 AM
Michael,

I ran  the tests on Windows XP with service pack 2.

I only ran only your executables, and did not attempt to integrate your macro's
into any of my code.  (I am new to assembler, and somewhat leary of
experimenting on my own disasterous code.  (Smile)  Yet, there will come a time
when I will need to analyse my results.  I look  forward to using your macro's then.

As an aside, I found this whole thread interesting and educational.  Your dialogue
with others demonstrated a sincere interest in, and knowledge of, the system
internals.  Your inclusion of thier thoughts in your final version showed a sense
of maturity that I often find lacking in online forums.

Chip
Title: Re: Code timing macros
Post by: hutch-- on August 03, 2006, 12:34:11 PM
I am pretty much of the view that absolute times are a fantasy as there are too many other factors. The closest I have seen is Agner Fog's real mode timing methods but they do not translate directly to ring3 protected mode programming.

I have long ago settled for relative timings as they tend to be consistent across different machines under different conditions. Start with an algo you want to make fast, make a copy of it slightly renamed and run both in the same test piece. Run the test piece for a half a second or longer and the timings start to become very consistent. Then you can modify the copy to see if you can get it faster relative to the first one.

The test piece design depends on your target, if the algo has to do long single runs on data, a very big test piece of data works best but if the algo gets looped many times on short data you need to design a test for that target.

Cycle counts are a waste of time as post 486 machines use multiple pipelines and the action is scheduling, not cycle counting ALA 386 and earlier technology. This is finally why a real time test piece is more use in that what it does equates to the task at hand.
Title: Re: Code timing macros
Post by: M4D45M on August 14, 2006, 04:22:41 PM
good to see the macros becoming better and better.
as there exist some problems with timing, i just wanted to ask
if you considered these issues:

1. 'affinity' mask to ensure reading from same cpu (multiprocessor machines/multi-core machines)
2. RDTSC not reliable using mobile processors coz it assumes fixed clock-cycles
     http://msdn.microsoft.com/library/default.asp?url=/library/en-us/directx9_c/Game_Timing_and_Multicore_Processors.asp

3. QueryPerformanceCounter may unexpectedly leap forward several seconds on some chipsets
     http://support.microsoft.com/default.aspx?scid=kb;en-us;274323
Title: Re: Code timing macros
Post by: wizzra on November 15, 2006, 08:31:04 PM
AMD Athlon 64 X2 3800+


24 cycles
-29 cycles
125 ms
116 ms


:eek
Title: Re: Code timing macros
Post by: FairLight on January 04, 2007, 04:14:26 PM
Pentium M 750 Dothan 1,86 Ghz @ 2,5 Ghz (on Asus P4C800-E Dlx.)

29 Cycles
29 Cycles
119 ms
120 ms

Title: Re: Code timing macros
Post by: ramguru on January 06, 2007, 11:46:49 AM
It would be nice if someone good with macros could make them work in POASM...
Title: Re: Code timing macros
Post by: MichaelW on January 06, 2007, 03:22:13 PM

24 cycles
-29 cycles
125 ms
116 ms

When you see large negative counts like this it probably indicates that the system interfered with the reference loop, inflating the overhead count. Delaying for several seconds after the app is started, to allow system disk cache activity, etc, to subside, should eliminate most of these problems.
Title: Re: Code timing macros
Post by: dsouza123 on January 12, 2007, 11:23:38 PM
Issue with the QueryPerformanceCounter function in both 32 bit and 64 bit

http://support.microsoft.com/kb/895980
Programs that use the QueryPerformanceCounter function may perform poorly in Windows Server 2003 and in Windows XP

Add /usepmtimer to the end of a specific line in boot.ini
Title: Re: Code timing macros
Post by: Abel on February 06, 2007, 08:25:37 PM
Michael,
In checking out alternative code timers on a P4 I noted:

Running c2test2.exe (7/7/2006, MD5 B48DC522C3D33AD1673B97D0BA19F6A8):


HIGH_PRIORITY_CLASS                   REALTIME_PRIORITY_CLASS
-304 cycles, empty                     -48 cycles, empty
-44 cycles, mov eax,1                 -44 cycles, mov eax,1
-44 cycles, mov eax,1 mov eax,2       -44 cycles, mov eax,1 mov eax,2
-48 cycles, nops 4                    -40 cycles, nops 4
-28 cycles, mul ecx                   -36 cycles, mul ecx
-44 cycles, rol ecx,32               -260 cycles, rol ecx,32
-20 cycles, rcr ecx,31                -20 cycles, rcr ecx,31
   8 cycles, div ecx                    16 cycles, div ecx
   0 cycles, StrLen                     -8 cycles, StrLen


When line 101 in counter2.asm was changed:
from:    jnz   @B                  ;; End of reference loop
to:      jnz   $-40h               ;; End of reference loop

and c2test2.exe reassembled:


HIGH_PRIORITY_CLASS                    REALTIME_PRIORITY_CLASS
 0 cycles, empty                        0 cycles, empty
-4 cycles, mov eax,1                   -4 cycles, mov eax,1
-4 cycles, mov eax,1 mov eax,2         -4 cycles, mov eax,1 mov eax,2
 0 cycles, nops 4                       0 cycles, nops 4
12 cycles, mul ecx                     12 cycles, mul ecx
-4 cycles, rol ecx,32                  -4 cycles, rol ecx,32
20 cycles, rcr ecx,31                  20 cycles, rcr ecx,31
56 cycles, div ecx                     56 cycles, div ecx
40 cycles, StrLen                      40 cycles, StrLen


Cycles are nearly double those given by timers.asm but make more sense to me.

Abel
Title: Re: Code timing macros
Post by: MichaelW on February 06, 2007, 08:52:01 PM
For several reasons, you cannot get good cycle counts for instruction sequences that execute in less than a few cycles. Usually, a negative count indicates that the system interrupted the reference loop. Unless your system is busy doing something intensive (in which case you should not be trying to time code), you can eliminate most of these problems by delaying for a few seconds before starting the first timing loop, to allow disk cache, etc activity to subside.

The second set of macros use a very different method than the first set. They seem to work better for timing code that executes in only a few clock cycles, and they generally provide more repeatable results, but I have reservations about using them after seeing them reverse the relative positions of your rand32 code and Lingo's, compared to the first set of timing macros.
Title: Re: Code timing macros
Post by: Abel on February 07, 2007, 03:45:49 PM
Michael has provided us with two code clocking programs.  The first, timers.asm, takes timestamp counter readings before and after executing
an empty reference loop N times and before and after executing test code N times, and then finds a cycle count for the test code.

The second, counter2.asm, takes timestamp counter readings before and after each of N executions of a reference loop, saving only the least
difference.  It then does the same with the test code and calculates a cycle count from the two least values.

Those who have tried the latter report negative results for short code, a sign that the empty loop count is too large (vide supra).  Browsing
the posted asm file shows a jnz @B instruction at the end of the reference loop that doesn't go back to the intended aa: label so that the
empty loop count stays fixed at the value set by the first pass.

When that's remedied, counts are more realistic but, at least on a P4, all are multiples of four, e.g. -4, 0, 4, 8,...  P4 tests show the values
returned by rdtsc are always multiples of 4.  (On an old Pentium(1), multiples of 2.)  This limits clock resolution to 4 cycles.

One improvement is to find the lowest cycle count for successive loops of 4 repetitions, thereby increasing resolution to 1 cycle/iteration. 
With good fortune, it might be possible to whip through 4 iterations without interruption.  As a further check on multitasking sensitivity,
one may compare results with the extreme priority settings, IDLE_PRIORITY_CLASS and REALTIME_PRIORITY_CLASS.  With the former,
code can appear faster as it doesn't get to play until the A-codes have reached a hiatus.

The changes take but a few tweaks of Michael's work and attached is a suggested modification for the file counters.asm and some demo
executables.  While most results look promising, at times one hits a very reproducible but higher than expected result that can be shortened
by inserting NOPs or irrelevant code elsewhere in the source.  A cache or piping penalty??  So far, nothing yet found faster than -1 cycle.

FWIW, running c2test2x.exe on a 120MHz, Pentium(1), W95 box gave:
0 0 0 1 9 1 8 42 52 cycle counts for both priorities, which, apart from a div and StrLen call, are nearly identical with results on a 2GHz P4.

Abel


[attachment deleted by admin]
Title: Re: Code timing macros
Post by: MichaelW on February 24, 2007, 12:42:28 PM
Abel,

Thanks for finding the reference loop problem, and for testing. Shortly after I first noticed your post I became very busy at work, and by the time I got over that hump I had forgotten about your post.

Until very recently virtually all of my testing was done on a P3, and assuming I included a several-second delay at the start of my code, the original version (of the counter2 macros), even with the problem, would return repeatable results. Without the delay there would be significant variations, mostly in the first set of results. I see now why the delay had an effect on the repeatability, and in hindsight, considering that the major point of these macros was to improve repeatability, I should have investigated this effect further.

Correcting the reference loop eliminated the run to run variation, even without the delay. On a P3 the cycle counts for single instructions or pairs of instructions looked reasonable, for example:

0 empty
1 nop
3 nops 4
1 bswap eax
1 mov eax, 1
1 movzx eax, cl
2 mov eax, 1 : mov eax, 2
1 add eax, 1
1 adc eax, 1
1 ror eax, 8
4 rcr eax, 8
5 mul ebx
38 xor edx, edx : div ebx


But on a P4 some of the counts appeared to be ~4 times higher than they should have been:

0 empty
4 nop
0 nops 4
4 bswap eax
4 mov eax, 1
4 movzx eax, cl
4 mov eax, 1 : mov eax, 2
4 add eax, 1
8 adc eax, 1
4 ror eax, 8
20 rcr eax, 8
16 mul ebx
68 xor edx, edx : div ebx


After I eliminated the instructions that preserve EBX around the CPUID instructions the counts for a P3 were little changed:

0        empty
0        nop
2        nops 4
0        bswap eax
1        mov eax, 1
1        movzx eax, cl
1        add eax, 1
1        adc eax, 1
1        ror eax, 8
3        rcr eax, 8
3        mul ebx
37       xor edx, edx : div ebx


And most of the counts for a P4 were more or less reasonable:

0        empty
0        nop
8        nops 4
8        bswap eax
0        mov eax, 1
0        movzx eax, cl
0        add eax, 1
12       adc eax, 1
8        ror eax, 8
28       rcr eax, 8
16       mul ebx
68       xor edx, edx : div ebx


As I have stated previously, I can see no way to get good cycle counts for instructions or sequences of instructions that execute in less than a few clock cycles. With the reference loop working correctly the second set of macros can effectively filter out the bad counts caused by context switches in the loops, but this still leaves a significant problem with the timed instructions not being completely isolated from the timing instructions. AFAIK this is what accounts for the difference in the P4 counts after eliminating the POP EBX following the first CPUID instruction. The only way I can see to correct this problem would be to insert a stand-alone serializing instruction after the timed instructions. CPUID will not work in this position, because for consistent results a fixed CPUID function number must be loaded into EAX, and the instruction that does this will not be isolated from the timed instructions. AFAICT there are no suitable instructions that can be used from a normal Windows app.

And there is still the question of why the second set of macros return significantly different counts than the first set and, more importantly, why the difference varies widely depending on the code being timed. For example:

(timers.asm)
86 cycles, nrandom
38 cycles, nrand
42 cycles, rand32
39 cycles, rand32_lingo


(counter2.asm)
98 cycles, nrandom
56 cycles, nrand
53 cycles, rand32
62 cycles, rand32_lingo

I have spent hours on this, and I still cannot see the reason.

Title: Re: Code timing macros
Post by: Abel on February 26, 2007, 10:40:23 AM
Michael,
FWIY, on the home P4 and macros:





   MEAN                                 MINIMUM                             ADJ MEAN

1,000,000 Loops, IDLE_CLASS_PRIORITY
0 cycles, empty                     1 cycles, empty                     1 cycles, empty
0 cycles, nop                       -1 cycles, nop                      0 cycles, nop
0 cycles, nops 4                    1 cycles, nops 4                    1 cycles, nops 4
1 cycles, bswap eax                 1 cycles, bswap eax                 -1 cycles, bswap eax
0 cycles, mov eax, 1                0 cycles, mov eax, 1                0 cycles, mov eax, 1
0 cycles, movzx eax,cl              -1 cycles, movzx eax,cl             -1 cycles, movzx eax,cl
1 cycles, add eax, 1                -1 cycles, add eax, 1               1 cycles, add eax, 1
0 cycles, adc eax, 1                1 cycles, adc eax, 1                1 cycles, adc eax, 1
7 cycles, ror eax, 8                0 cycles, ror eax, 8                0 cycles, ror eax, 8
14 cycles, rcr eax, 8               11 cycles, rcr eax, 8               11 cycles, rcr eax, 8
10 cycles, mul ebx                  7 cycles, mul ebx                   6 cycles, mul ebx
46 cycles, xor edx, edx : div ebx   40 cycles, xor edx, edx : div ebx   38 cycles, xor edx, edx : div ebx

100 Loops, IDLE_CLASS_PRIORITY
1 cycles, empty                     0 cycles, empty                     0 cycles, empty
0 cycles, nop                       0 cycles, nop                       0 cycles, nop
0 cycles, nops 4                    1 cycles, nops 4                    1 cycles, nops 4
2 cycles, bswap eax                 1 cycles, bswap eax                 1 cycles, bswap eax
0 cycles, mov eax, 1                0 cycles, mov eax, 1                0 cycles, mov eax, 1
0 cycles, movzx eax,cl              0 cycles, movzx eax,cl              0 cycles, movzx eax,cl
0 cycles, add eax, 1                2 cycles, add eax, 1                2 cycles, add eax, 1
0 cycles, adc eax, 1                2 cycles, adc eax, 1                2 cycles, adc eax, 1
0 cycles, ror eax, 8                0 cycles, ror eax, 8                0 cycles, ror eax, 8
14 cycles, rcr eax, 8               11 cycles, rcr eax, 8               11 cycles, rcr eax, 8
10 cycles, mul ebx                  6 cycles, mul ebx                   6 cycles, mul ebx
41 cycles, xor edx, edx : div ebx   40 cycles, xor edx, edx : div ebx   40 cycles, xor edx, edx : div ebx


The first table may be pushing into the OS temporal regime but I didn't checkout REALTIME effects.

Abel























Title: Re: Code timing macros
Post by: Mark Jones on August 15, 2007, 02:51:14 AM
Just curious, has anyone considered running these timings in real-mode? Maybe disable the cache if possible? Seems like the best possible speed approximation would be to just include say 10000 repetitions of each instruction under test, surrounded by the serialization routines. Or am I missing something?
Title: Re: Code timing macros
Post by: MichaelW on August 15, 2007, 04:09:20 AM
Hi Mark,

Testing in real mode is a good idea that I had not considered. It should be fairly easy to implement, and it would eliminate the variables that Windows introduces into the tests. Real mode might allow the use of a free standing serializing instruction to completely isolate the timed instruction(s) from the timing instructions. I want to further investigate the 4-clock-multiple behavior of the P4, and real mode might provide a better way to do this. The Intel documents seem to indicate that for a P4 the time-stamp counter is incremented in step with the bus clock, instead of with the processor internal clock, and I'm thinking that this might explain the multiples of 4.
Title: Re: Code timing macros
Post by: Adamanteus on December 29, 2007, 08:27:16 PM
I have such program, it's now ported to Win32 but functionality for DOS is stayed. Possible mark that in Win32 with REALTIME_PRIORITY_CLASS results absolutely the same. Comments on Russian, but I'm to lazy to translate such "gefilta fish" :boohoo:
P. S. Sorry, I here's mistaken on QueryPerformanceFrequency divisor experimentally 10000, that on 10 more then by MSDN documentation. So result of program is : Command - 100 repeat DIV EBX : 32 ns

[attachment deleted by admin]
Title: Re: Code timing macros
Post by: KeepingRealBusy on November 09, 2008, 01:05:41 AM
Hutch,

As a "Johnny come lately", am I allowed to ask a question? What assembler was used to build the executables in this thread? This first entry in this topic is dated Feb 2005, so many things may have changed from then to now (last post was dec 29 2007). Is JWASM a good choiceto use, or just use MASM 8.0?

Dave,
Title: Re: Code timing macros
Post by: hutch-- on November 09, 2008, 02:27:42 AM
Dave,

I would be pretty sure that Michael used MASM as the macros were designed to work in MASM. For JWASM I would test it to see if you get the same results. Japheth has done a great job in terms of compatibility so it will probably work the same but he is still in beta so it is worth testing against a MASM built version.
Title: Re: Code timing macros
Post by: KeepingRealBusy on November 09, 2008, 02:32:49 AM
Hutch,

Thank you.

Dave.
Title: Re: Code timing macros
Post by: KeepingRealBusy on November 09, 2008, 07:22:40 PM
Well, the MASM32 install was successful. Read all of the macros. Then assembled test.asm successfully.

Here is my timing for my HP Pavilion, AMD 4200+ 64x2, still running 32 bit mode XP.



24 cycles
26 cycles
117 ms
110 ms

Press enter to exit...



Dave.
Title: Re: Code timing macros
Post by: KeepingRealBusy on November 09, 2008, 07:56:06 PM
Now the fun begins. I assembled c2test2.asm and executed it. This is what I got:



HIGH_PRIORITY_CLASS
-138 cycles, empty
0 cycles, mov eax,1
2 cycles, mov eax,1 mov eax,2
1 cycles, nops 4
-136 cycles, mul ecx
0 cycles, rol ecx,32
4 cycles, rcr ecx,31



And then it blew up with the following VS error.


VS Error Message: Unhandled exception at 0x00401a20 in c2test2.exe: 0xC0000095: Integer overflow.

Call stack: c2test2.exe!00401a20()

Instruction in error: 00401A20  div         eax,ecx

Regs:       EAX     00000001
ECX 444D4163
EBX 68747541
        EDX 69746e65



As nearly as I can determine, "div         eax,ecx " is invalid, at least on the AMD Athlon. The div instruction does not support two parameters (not like the imul which supports one, two, or three operands). div only supports a single parameter of reg/mem, and if the divisor is 32 bits (take your pick, not knowing the Intel format, but guessing from imul I would say ecx), then edx:eax is the dividend and ecx is the divisor. The quotient is returned in eax, the remainder in edx. Since edx was not cleared before the div, the quotient would overflow.

What is the Intel format for div? All the references I have agree with AMD. Where did the two parameter call come from?

Dave.
Title: Re: Code timing macros
Post by: KeepingRealBusy on November 15, 2008, 03:08:25 AM
Please ignore my post above. The problem was not a two parameter DIV, it was just VS displaying that this was a 64 bit divide by 32 bits. The real problem was that the EDX register was not cleared before the DIV instruction (at least this must be done on AMD). I made the following changes to counter2.asm:

--c2test2asm.new----------------------------------------------- 53
        xor edx,edx
--c2test2.asm.old----------------------------------------------53

--c2test2.asm.new---------------------------------------------- 101
        xor edx,edx
--c2test2.asm.old----------------------------------------------100


The following are my timings:


HIGH_PRIORITY_CLASS
0 cycles, empty
0 cycles, mov eax,1
2 cycles, mov eax,1 mov eax,2
1 cycles, nops 4
-135 cycles, mul ecx
0 cycles, rol ecx,32
4 cycles, rcr ecx,31
40 cycles, div ecx
-297 cycles, StrLen

REALTIME_PRIORITY_CLASS
0 cycles, empty
-136 cycles, mov eax,1
2 cycles, mov eax,1 mov eax,2
1 cycles, nops 4
1 cycles, mul ecx
0 cycles, rol ecx,32
4 cycles, rcr ecx,31
40 cycles, div ecx
31 cycles, StrLen



I also found a problem trying to assemble test.asm in timing macros using the latest MASM32.zip. This was a clean install on an otherwise unused drive. The problem was that the crt__xxx symbols were undefined. I changed the header to match the c2test2.asm header as follows and this correctly assembles:


--test.asm.new-------------------------------------------------- 2
    include \masm32\include\masm32rt.inc
    .586
--test.asm.old--------------------------------------------------2
    .586                       ; create 32 bit code
    .model flat, stdcall       ; 32 bit memory model
    option casemap :none       ; case sensitive

    include \masm32\include\windows.inc
    include \masm32\include\masm32.inc
    include \masm32\include\kernel32.inc

    includelib \masm32\lib\masm32.lib
    includelib \masm32\lib\kernel32.lib

    include \masm32\macros\macros.asm


My timings for test.asm are as follows:


19 cycles
19 cycles
201 ms
191 ms


Dave.
Title: Re: Code timing macros
Post by: jj2007 on March 16, 2009, 09:12:13 AM
I stumbled over an odd problem - probably I am just blind to see what's wrong here. I tried to improve the accuracy of timings with a REPEAT 10, but instead of getting 10*the cycle count, I seem getting 100*cycles...

Testing (on a P4) whether
        mov ecx, dword ptr [esp-8]
        jmp ecx
is faster than
        jmp dword ptr [esp-8]

4       cycles jmp directly
5       cycles mov+jmp
474     cycles jmp directly, divide by 10
488     cycles mov+jmp, divide by 10


Expected value for the last two lines would be 47 and 48.

Here is the code:
.nolist
include \masm32\include\masm32rt.inc
.686
.xmm
include \masm32\macros\timers.asm

p1 PROTO: DWORD
p2 PROTO: DWORD

.code
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

p1 proc a1
  pop eax
  pop eax
  mov ecx, dword ptr [esp-8]
  jmp ecx
p1 endp

p2 proc a1
  pop eax
  pop eax
  jmp dword ptr [esp-8]
p2 endp

start:
LOOP_COUNT = 10000000
print chr$("Testing whether",13,10,9,"mov ecx, dword ptr [esp-8]",13,10,9,"jmp ecx",13,10,"is faster than",13,10,9,"jmp dword ptr [esp-8]",13,10,10)

invoke Sleep, 200
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
invoke p2, 123
counter_end
print str$(eax), 9, "cycles jmp directly", 13, 10

invoke Sleep, 200
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
invoke p1, 123
counter_end
print str$(eax), 9, "cycles mov+jmp", 13, 10

invoke Sleep, 200
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
invoke p2, 123 ; manual REPEAT 10 to exclude that REPEAT is causing the problem
invoke p2, 123
invoke p2, 123
invoke p2, 123
invoke p2, 123
invoke p2, 123
invoke p2, 123
invoke p2, 123
invoke p2, 123
invoke p2, 123
counter_end
print str$(eax), 9, "cycles jmp directly, divide by 10", 13, 10

invoke Sleep, 200
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
REPEAT 10
invoke p1, 123
  ENDM
counter_end
print str$(eax), 9, "cycles mov+jmp, divide by 10", 13, 10

print chr$(10,"hit any key")

exit

end start


Any explanation?? ::)
Title: Re: Code timing macros
Post by: sinsi on March 16, 2009, 09:46:30 AM

C:\Users\me\Desktop>jj
Testing whether
        mov ecx, dword ptr [esp-8]
        jmp ecx
is faster than
        jmp dword ptr [esp-8]

0       cycles jmp directly
1       cycles mov+jmp
46      cycles jmp directly, divide by 10
44      cycles mov+jmp, divide by 10

hit any key
C:\Users\me\Desktop>

If I add

  repeat 10
    invoke p2, 123
  endm

I get 46 again (all copy+paste, none of my code except the 'repeat' bits)
Title: Re: Code timing macros
Post by: jj2007 on March 16, 2009, 10:38:50 AM
Thanks, Sinsi. Mysterious. I remember that it worked correctly on my other puter, with a Core CPU, but can't test it right now.
A P4-specific problem?
Title: Re: Code timing macros
Post by: sinsi on March 16, 2009, 11:05:08 AM
Quote from: jj2007 on March 16, 2009, 10:38:50 AMA P4-specific problem?
I get the same numbers from 98SE/NT4/2000 (via virtualpc, which is a P3 I think).

I would think more of an inc/lib/ml version problem...
Title: Re: Code timing macros
Post by: MichaelW on March 16, 2009, 11:10:26 AM
Considering the code that is being executed, I suspect that your first two results are not valid. I think the biggest part of the problem is the P4 in combination with your attempt to time code that executes too fast for reliable results on any processor. On my P3 I get a consistent:

0012FFC0
Testing whether
        mov ecx, dword ptr [esp-8]
        jmp ecx
is faster than
        jmp dword ptr [esp-8]

1       cycles jmp directly
1       cycles mov+jmp
170     cycles jmp directly, divide by 10
169     cycles mov+jmp, divide by 10
0012FFC0


The starting and ending hex number are the value of the stack pointer at entry and exit. If I understand what you are trying to do, perhaps something like this would work better:

    invoke Sleep, 3000

    counter_begin 1000, HIGH_PRIORITY_CLASS
      REPEAT 100
        push @f
        jmp DWORD PTR [esp]
      @@:
        add esp, 4
      ENDM
    counter_end
    print str$(eax),13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      REPEAT 100
        push @f
        mov ecx, [esp]
        mov ecx, @f
        add esp, 4
        jmp ecx
      @@:
      ENDM
    counter_end
    print str$(eax),13,10

Running on my P3, I get a consistent:

297
335





Title: Re: Code timing macros
Post by: sinsi on March 16, 2009, 11:20:11 AM
98 under virtualpc gives me -1, -1 for the first two (watch out marty!)
Title: Re: Code timing macros
Post by: jj2007 on March 16, 2009, 12:37:19 PM
Quote from: MichaelW on March 16, 2009, 11:10:26 AM
Considering the code that is being executed, I suspect that your first two results are not valid. I think the biggest part of the problem is the P4 in combination with your attempt to time code that executes too fast for reliable results on any processor. On my P3 I get a consistent:
..
If I understand what you are trying to do

Michael,
Even a simple doubling produces odd results:
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
invoke p2, 123
invoke p2, 123
counter_end

That should give 8 cycles, but I get 65:
4       cycles jmp directly
65      cycles jmp directly, divide by 2


On a Celeron M, I get consistently 4 cycles for a single test, and with REP 10 I get 40/10=4 cycles. So it might be the P4... ::)

P.S.: Background is an argument with Lingo (http://www.masm32.com/board/index.php?topic=1807.msg81296#msg81296) over this technique (admittedly not a very clear style, but it's short and fast):

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
strlen32s proc src:DWORD
pop eax ; trash the return address
pop eax ; the src pointer
... do stuff ...
jmp dword ptr [esp-4-4] ; ret address, one arg - the Lingo style equivalent to ret 4 ;-)
strlen32s endp
Title: Re: Code timing macros
Post by: sinsi on March 16, 2009, 12:55:25 PM
You can't tell what windows is doing - forget about the priority thing, windows has its own agenda...
Interrupts (timer/kbd/mouse, hell even usb/vga) muck things up, different cores, etc. - multi core/cpu can't rely on rdtsc, and there's the speedstep/choke thing too.

A load from memory, then a jump should  be slower than a jump from memory, but it depends on what the intel guys decided (or what the compiler guys convinced them to do eh?)

I think that once we get down to single-figure counts we're hoping (pissing in the wind  :bg).
Title: Re: Code timing macros
Post by: jj2007 on March 16, 2009, 03:36:51 PM
Quote from: sinsi on March 16, 2009, 12:55:25 PM
I think that once we get down to single-figure counts we're hoping (pissing in the wind  :bg).

Of course, relative precision is low. But the timer macros still yield surprisingly consistent and plausible results. Some time ago I played with push+pop; timings for a P4:

1       cycles for n*(push eax+pop ecx)
4       cycles for n*(push eax+pop eax)
1       cycles for n*push eax, n*pop ecx
9       cycles for n*(push eax+pop m32a)
10      cycles for n*push eax, n*pop m32a
9       cycles for n*(push m32a+pop m32b)
10      cycles for n*push m32a, n*pop m32b
26      cycles for n*(pushad+popad)
26      cycles for n*pushad, n*popad

The practical relevance is admittedly low, but it tells something about the CPU when a push/pop for the same register is four times slower. IMHO these timer macros are a fantastic tool - thanx, Michael :U

[attachment deleted by admin]
Title: Re: Code timing macros
Post by: BeeOnRope on March 31, 2009, 08:13:48 PM
Quote from: jj2007 on March 16, 2009, 03:36:51 PM
The practical relevance is admittedly low, but it tells something about the CPU when a push/pop for the same register is four times slower. IMHO these timer macros are a fantastic tool - thanx, Michael :U

The issue is dependence chains in the code.  A push/pop sequence on the same register is serially dependent - the prior pop must complete before the push can occur, since the push needs the value from the pop.  Also, the prior push must occur before the pop, since the pop is writing to the same memory location as the push, so the CPU must "complete" the push in order to process the pop.  So you measure the latency of the operation in this case, which turns out to be about 4.

When you push/pop different registers, the chain is broken because the pop->push sequence is no longer dependent as the push value doesn't depend on the pop (since the pushed value comes from a different register).  Even though push->pop still is still dependent, the chain is broken and the operations can be parallelized - so you are measuring the (reciprocal) throughput.

In general, these kinds of complexities are why your tests of whether "mov+jmp reg" is faster than "mov+jmp mem" are invalid.  No code consists of a tight loop of these instructions - it has to do some useful work as well.  In performing that useful work, it may be that the cost of the mov ecx,[esp-8] can be completely hidden since it can be done earlier, long before the jmp.  In fact this is very likely since it is completely independent of the preceding code (except that it needs ecx) so that there is a good chance this comes for free.  Then if you compare jmp reg vs jmp mem, you may find that jmp reg is faster (probably free, in fact) than jmp [mem], which has to do the read.

This is why micro-benchmarking small sequences of instructions is quite useless (except perhaps to build tables of latencies and throughput) - since the actual behavior in real code may be very different.
Title: Re: Code timing macros
Post by: jj2007 on March 31, 2009, 09:19:22 PM
Quote from: BeeOnRope on March 31, 2009, 08:13:48 PM
The issue is dependence chains in the code.  A push/pop sequence on the same register is serially dependent -

Welcome, BeeOnRope :thumbu

What you write is not entirely new, but we tend to forget, and I have rarely seen it so nicely explained. Thanks :U (and apart from that, I wanted to tease Lingo a little bit :bg)

You might like the linecounter thread (http://www.masm32.com/board/index.php?topic=11061.msg81715#msg81715) - we needed kilocycles to time that one. I would also like to your views on the practical relevance of timings and cache (http://www.masm32.com/board/index.php?topic=11036.msg81254#msg81254).
Title: Re: Code timing macros
Post by: m0nk3yi3unz on July 06, 2009, 07:12:13 PM
Hey MichaelW! Great stuff here - I've never fathomed using the clock as a timing mechanism (that sounds weird, but you know what I mean). I didn't even know you -could- monitor the CPU clock :X

I'll definitely be checking out the source.

And, to keep the trend going:

Quote
21 cycles
22 cycles
86 ms
86 ms
Title: Re: Code timing macros
Post by: dedndave on July 06, 2009, 07:44:11 PM
go to the first post in this thread and d/l the zip file named counter2.zip
place counter2.asm in your masm32\macros folder
use the test.asm file as an example of how to use it
Michael did a great job and spent considerable time getting it to work correctly - lol
Title: Re: Code timing macros
Post by: Magnum on September 26, 2009, 02:29:08 PM
When I ran test.exe, I got 50 cycles and and 678 ms.

That is on a P-3, 733 Mhz system.

I have to agree with Hutch.

The timing macros work well when comparing code on a single system,
but the granularity isn't possible when comparing P-3 to P-4 etc.

But I plan on using them to see where I can speed things up.

I think it's like comparing apples to oranges.

Andy


Title: Re: Code timing macros
Post by: dedndave on September 26, 2009, 04:34:59 PM
the way we really use it, it works ok
we usually want to take 2 proc's (sometimes more) and compare how they perform on a variety of processors
as an example:
on processor type A, routine #1 is faster than routine #2 by x %
on processor type B, routine #2 is faster than routine #1 by y %
while "absolutes" are interesting, we just want to know which routine is best, and by what percentage
for that purpose, the timing macros work great
Title: Re: Code timing macros
Post by: PBrennick on October 02, 2009, 11:29:17 PM
Yes, it is a relative thing needing no absolutes, just a comparison that can be based on anything as long as it is the same in all cases. The macros allow this to be done quite nicely.

Paul
Title: Re: Code timing macros
Post by: sysfce2 on November 22, 2009, 01:07:14 PM
I'm trying to convert this code timing macro to NASM and I have a question about how it works:

In the first section, under the reference loop, there is:
@@:
sub __ctr__loop__counter__, 1
jnz @B

This appears to repeatedly subtract 1 from the loop counter.
Should it loop to the "@@:" above where it says ";; Reference loop"?
Title: Re: Code timing macros
Post by: MichaelW on November 22, 2009, 02:14:32 PM
I'm not sure what you mean by "above where it says ;; Reference loop". The code below implements an empty reference loop that is used to determine the cycle count for the loop overhead, which includes not only the two instructions in the loop but everything from the preceding RDTSC to the next RDTSC.

@@:                     ;; Start an empty reference loop
        sub   __counter__loop__counter__, 1
        jnz   @B


Note that the code from the preceding RDTSC to the next RDTSC is the same for both loops, so the count for the test code can be calculated by subtracting the count for the reference loop from the count for the test loop.
Title: Re: Code timing macros
Post by: sysfce2 on November 22, 2009, 02:31:37 PM
Sorry, I didn't word that very well.  I am talking about counter2.asm.

I guess what I'm asking is if the jump on line 101 is intended to jump to the local label on line 98 (as is the current behaviour) or the local label on line 60 (which is unused).

I believe counters2 is for the lowest time taken while timers is an average.  Am I correct and what is the preferred choice for benchmarking?

Thanks,
sysfce2
Title: Re: Code timing macros
Post by: MichaelW on November 22, 2009, 02:52:33 PM
The jump on line 101 is supposed to jump to the local label on line 98. The local label on line 60 is not supposed to be there.
Title: Re: Code timing macros
Post by: dedndave on February 02, 2010, 10:11:51 PM
hate to wake up an old thread   :P

as many of you know, i have always had a bit of trouble getting reliable timing numbers on my p4 prescott
while researching CPUID, i came across a little piece of info on some gamer forum
their thread had nothing to do with timing code, but it rang a bell in my head
i decided to try it out for the fun of it, and what do ya know - it helped my timing issue
it is a simple change in the registry

of course, unless you are having trouble getting reliable readings, i wouldn't suggest altering the registry

REGEDIT4

[HKEY_LOCAL_MACHINE\SYSTEM\ControlSet002\Control\Session Manager\Throttle]
"PerfEnablePackageIdle"=dword:00000001


any non-zero value turns it on - zero turns it off
reg file attached

another item i saw in the same thread added this to the boot.ini line
this one is more likely to cause trouble than to fix it, but i thought i would mention it for the sake of completness

/usepmtimer
Title: Re: Code timing macros
Post by: MichaelW on February 03, 2010, 12:53:32 AM
Interesting, but does it actually improve the repeatability on your system? Some additional information:

http://support.microsoft.com/kb/896256

Title: Re: Code timing macros
Post by: dedndave on February 03, 2010, 12:55:48 AM
yah - i already have that KB installed   :P
it makes a big difference, Michael
i wish i had known about it when i was working on ling long kai fang - would have saved me a lot of time
Title: Re: Code timing macros
Post by: KeepingRealBusy on June 21, 2010, 12:49:06 AM
MichaelW

I have been using the timing macros in the Compare Two Strings Topic here in the Laboratory. I added a .list at the start of the .code segment and added a /Fl to create a .lst file to look at the created code, mainly in the test cases, but noticed that the timing and counting macros themselves were hard to follow with jumps to labels that were not present:


000000D0      2       ??001F:                 
000000D0  68 00000001 R     1 push offset Src1
000000D5  68 0000D702 R     1 push offset Src2
000000DA  E8 00003ADC      1 call StrCmpSSE
000000DF  83 2D 0001D5FC R  2         sub   __counter__loop__counter__, 1
   01
000000E6  75 E8      2         jnz   __counter__loop__label__
000000E8  33 C0      2         xor   eax, eax


Adding the jnz offset of E8 to the next instruction offset of E8 gives the correct target of D0 (??001f). I found that if you modified these macros to add a .listmacroall and .listmacro to the counter_begin and timer_begin macros as follows you will get a clearer listing (at least you will get a label that matches the jnz target):


.listmacroall
          label:                ;; Start test loop
            __timer__loop__label__ equ <label>
.listmacro


This is the resulting listing fragment:


     2 .listmacroall
000000D0      2       ??001F:                 
= ??001F      2         __counter__loop__label__ equ <??001F>
     2 .listmacro
000000D0  68 00000001 R     1 push offset Src1
000000D5  68 0000D702 R     1 push offset Src2
000000DA  E8 00003ADC      1 call StrCmpSSE
000000DF  83 2D 0001D5FC R  2         sub   __counter__loop__counter__, 1
   01
000000E6  75 E8      2         jnz   __counter__loop__label__
000000E8  33 C0      2         xor   eax, eax


Of course, turning off .list eliminates all code, leaving only the symbols and deleting the ML option /Fl deletes the .lst file.

Dave.
Title: Re: Code timing macros
Post by: ecube on July 10, 2010, 01:33:45 AM
MichaelW I been using a lot of Nt functions lately and I came across NtDelayExecution, the significance of this over regular winapi Sleep, is the resolution is 'Delay in 100-ns units', and they're are 1000000 nanoseconds in 1 millisecond so this much better than sleeps 10 ms delay, which could make your macros that much more accurate.
Title: Re: Code timing macros
Post by: oex on July 10, 2010, 01:52:02 AM
.... There is some interesting info on timing posted here:

http://www.masm32.com/board/index.php?topic=14031.0

NtDelayExecution wasnt suggested but other 100ns timing was, it may be of interest for comparison
Title: Re: Code timing macros
Post by: dedndave on July 10, 2010, 01:56:34 AM
i don't think Michael's code relies on the resolution of Sleep   :red
he uses Sleep simply to relinquish execution until the next time-slice
it may still provide a slight improvement
Title: Re: Code timing macros
Post by: milomir on July 07, 2011, 06:10:48 AM
pentium dual-core e5500 @3.2ghz
Quote
20 cycles
24 cycles
60 ms
75 ms

Quote
HIGH_PRIORITY_CLASS
-14 cycles, empty
-14 cycles, mov eax,1
-14 cycles, mov eax,1 mov eax,2
-14 cycles, nops 4
-14 cycles, mul ecx
-14 cycles, rol ecx,32
0 cycles, rcr ecx,31
14 cycles, div ecx
14 cycles, StrLen

REALTIME_PRIORITY_CLASS
0 cycles, empty
-210 cycles, mov eax,1
-14 cycles, mov eax,1 mov eax,2
-182 cycles, nops 4
-14 cycles, mul ecx
-28 cycles, rol ecx,32
0 cycles, rcr ecx,31
14 cycles, div ecx
14 cycles, StrLen
Title: Re: Code timing macros
Post by: dedndave on July 07, 2011, 09:09:07 AM
 :bg
what code are you running ?
there may be some things we can do to improve the results

ok - i think i found it
it is either c2test2x.exe or c2test2x-idle.exe

try placing this code at the beginning of the program, then re-assemble
        INVOKE  GetCurrentProcess
        INVOKE  SetProcessAffinityMask,eax,1
        INVOKE  Sleep,750


my machine is known for turning out unusualy results
Michael suggested using 500 mS
but, on my machine, if i wait 750 mS for it to bind, i get much nicer numbers