News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Code timing macros

Started by MichaelW, February 16, 2005, 03:21:52 AM

Previous topic - Next topic

MichaelW

Thanks Frank.

Just to make sure I understand, the Sleep, 0 is so the counter will start near the beginning of a new time-slice?

And did you run the test with the original loop count?
eschew obfuscation

Frank

Quote
Just to make sure I understand, the Sleep, 0 is so the counter will start near the beginning of a new time-slice?
Yes, exactly.

Quote
And did you run the test with the original loop count?
Just did it. Here are the results of 10 consecutive runs of your TEST.exe from the TimingMacros.zip. The Celeron runs WinXP SP2, and the P4 runs Win2k SP4 (sorry, I can't hold the OS constant with reasonable effort).


Celeron    P4
RP HP    RP HP
-----    -----
25 26    26 27
25 32    26 27
25 27    25 28
30 25    27 27
25 27    26 27
27 25    26 27
28 25    25 27
29 27    25 28
26 25    25 27
25 27    25 27
--------------
RP: realtime priority
HP: high priority


Interestingly, the means are almost identical for Celeron (26.55) and P4 (26.40). The difference between priority levels is somewhat larger (realtime priority: 26.05, high priority: 26.90). According to my statistics software, there is significantly more variability around the mean for the Celeron/XP machine than for the P4/W2k machine. The constant cycle count of 26 that I reported above for the 1000-iterations-only version of the program was also from close to 10 runs on each of the machines, but I did not keep a precise record. Together it looks more like differences in the cycle count are due to the operating system than to the hardware.

Regards

    Frank

liquidsilver

Please note that those results from the Celeron 2.66 were from a school PC running a few background processes. These results did vary by a few cycles or ms, but those were the most common. I'm surprised that they varied at all, but I guess it was cause of the OS.
Interesting code though, very easy to implement :U, but those varied results must go :tdown Why did you make it run 10 million times, is that needed?

Jimg

Reading through this thread, I notice that generally, it seems like intel chips take about 26 cycles and amd chips about 36 cycles.  Is there a reason for this?

x86asm

AMD AthlonXP 1900+:

36 cycles
36 cycles
222 ms
222 ms


rags

On My new AMD Sempron 2500+, win xp pro sp1:
36 cycles
35 cycles
204 ms
203 ms

It does seem to me that it takes the same number of cycles, regardless of amd processor to complete the test.
EDIT: Time wise I reduced the time by 20%, over my older athlonxp 1600
God made Man, but the monkey applied the glue -DEVO

MichaelW

My K5-112 is an AMD processor
33 cycles
33 cycles
2962 ms
2969 ms
:bg


eschew obfuscation

Tedd

Just to join in the fun...

AMD K6-III 400MHz
41 cycles
41 cycles
1051 ms
1051 ms


There goes the theory about number of cycles? :bdg
No snowflake in an avalanche feels responsible.

Rolan

joining the fun with pride  :lol
AMD A64 3500+
28 cycles
29 cycles
130 ms
131 ms

Jeaton

36 cycles
36 cycles
124 ms
124 ms

Celeron D 2.93 Ghz; XP Home SP2

Jimg

I'm using an Athlon 3000+ at 2.16 GHz

I was a little concerned at how slow my computer was compared to the intel chips, so I took the program and copied the StrLen function at the end of the code and changed it's name rather than calling it from the Masm library to see what was going on.
The test immediately ran over 10 percent faster.  My curiosity now took over, and I tried inserting an Align 4 before the routine.  Better still.  I then tried inserting an increasing number of bytes before the routine, and found the best was inserting 15 extra bytes after the align 4.
Here are the results:

               StrLen
       Using   Moved  align  insert
        lib    from     4      1      2       7       15
                lib           byte   bytes   bytes   bytes
cycles  36      32      31     31     32      30      29
  ms   166     150     142    139    147     138     134



[attachment deleted by admin]

MichaelW

I tested the macros to ensure that the timing would not be (significantly) affected by the alignment of the macro calls and the calls to a library procedure. I think I also tested with a block of code rather than an invoke between the macro calls, but at this point I'm not sure. In any event, I see no way to control the alignment of a called procedure from within the macros.

The source for the library procedure, surprisingly, does not contain an alignment directive, so this could explain the difference in cycle counts. When I modify your code to invoke first StrLen and then StrLen2, both with REALTIME_PRIORITY_CLASS, on a P3 under Windows 2000 I get:

49 cycles
44 cycles

Replacing your alignment code with just a db 14 dup(0), I get:

45 cycles
49 cycles

So it appears to be just an alignment effect. I think the (align 4)+15 works because it is placing the most important part of the loop code at the best alignment. I would be interested in seeing how a version of the procedure optimized for the Athlon would perform.
eschew obfuscation

Jimg

After a lot more testing, it seems there can be up to a 20 percent difference dependiing on the placement.  It all cycles around 16 bytes so one would have to do an align 16 followed by inserting the best number of bytes for optimum timing.  This could be up to 30 extra bytes wasted, but not too bad for a time critical routine.  Of course, judicious placement of fillers within the routine could help also.

Mark Jones

#28
Quote from: Shakain on March 17, 2005, 09:14:38 PM
The results for a AMD XP 1800, XP pro (default)

36
36
307
307 *

* this number changes between runs, I saw it under 300 at some time. Strange ?


Interesting.... AMD XP 1800, XP Pro SP1:

36 cycles
36 cycles
196 ms
196 ms

(Also interesting that my AMD 1800 is faster than Rags' AMD Semperon 2500+ at these tests, and almost keeps up with the P4 2600???) :dazzled:

Lets not forget about overclocking, hardware, and driver contributions to this test. I'm using a highly-optimized (but not overclocked) SolTek nVIDIA mainboard. I also update all drivers every once-in-a-while. This has been the best PC ever (knock on wood) so if you're ever in the market for a new system, check out SolTek for your next mainboard: http://www.soltek.de/soltek/news/index.php (Innocent plug.) :U

Edit: Soltek is long-gone. :(
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

Citric

Quote from: Citric on May 17, 2005, 10:19:02 AM
Hi All
I have ported MichaelW code timing macros to GoAsm but they dont seem to give the same results.

test.exe displays results for the masm32.lib StrLen procedure, at HIGH_PRIORITY_CLASS

Please test them and post your results, maybe with a comparsion to MichaelW macros.

Cheers Adam
ps Any Guru, please have a look to see if i have gone wrong somewhere to get different result to the masm macros.
ps I am building them with "GoAsm.Exe Version 0.52.4 beta" and "GoLink.Exe Version 0.25.4"



Please have a look over at the GoAsm forum.