News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Code timing macros

Started by MichaelW, February 16, 2005, 03:21:52 AM

Previous topic - Next topic

stanhebben

I've been doing some experiments with the macros, and I had strange results at times. Some very short tests ended up with negative (!) times - but an empty loop always timed at zero cycles.

Mark Jones

Could it be an artifact from prefecting/caching? If using an AMD, run the app through CodeAnalyst. It can tell you exactly what is going on. :bg

We must realize that even if a spot-on measurement were possible, it is only good for a particular CPU + Mainboard + OS. The numerous combinations of these elements make the test results very specific by nature, so accuracy is partially invalidated anyways. I suspect that a Real-Mode timer might be only marginally better.

Also note, if the timer is not adaptive to a missing stack frame or an unbalanced stack, (say it is a PROC timer but you run it on a stack-frame-less-PROC) it could leave you with negative result. With Michael's code timer I would get a zero or -1 result timing 4 NOPs, even though each should take 0.25 cycle on my CPU, so perhaps something was a little amiss there. Petriozki's routines seem to be rock-stable and quite accurate, have you tried those? :8)
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

MichaelW

I finally got around to updating the cycle count macros. This time, in an effort to minimize the run to run variation in the cycle counts, instead of returning the average cycle count across the entire loop, I'm returning the lowest count that occurs in a single loop, as Ratch and others here have suggested. Assuming a suitable sample size, for code that executes in less than about 10000 cycles the run to run variation is now near zero. I also corrected the problem with CPUID altering the value in EBX. And because there is no need to calculate an average, there is no FPU code, and no need for EMMS. Everything seems to be working correctly, but I have done only a very limited amount of testing. I added the attachment to the first post in this thread.

eschew obfuscation

Chip


I read with interest this thread concerning timing.   

Seeing that you recently updated your macros,
I ran them on my AMD FX53.

Chip


Test.exe

     26 cycles
     26 cycles
   113 ms
   113 ms


C2Test2.exe

     HIGH_PRIORITY_CLASS
     -168 cycles, empty
     0 cycles, mov eax,1
     2 cycles, mov eax,1 mov eax,2
     1 cycles, nops 4
     -166 cycles, mul ecx
     0 cycles, rol ecx,32
     4 cycles, rcr ecx,31

However, this second program crashed.  I tried to get the debug
information to no avail.  However, it did display  the following info:

AppName: c2test2.exe    AppVer: 0.0.0.0    ModName: c2test2.exe
ModVer: 0.0.0.0    Offset: 00001a20
I am on Skype under my email address!

MichaelW

Thanks for testing. The test app was intended to test the worst-case functionality. In my view it is not reasonable to expect meaningful clock cycle counts for an instruction or instruction sequence that executes in less than a few clock cycles because:

The test is running in a multitasking environment where context switches cannot be avoided.

The timed instructions cannot, in general, be completely isolated from the timing instructions.

The counter has a resolution of one clock cycle.

On my system I do see an occasional negative count, but generally only for instructions or instruction sequences that otherwise execute in a small number cycles, and I decided that eliminating them was not worth the effort.

I'm not sure why the REALTIME_PRIORITY_CLASS tests failed on your system, as they work OK on my systems (all of which are slow by today's standards). I do not normally use REALTIME_PRIORITY_CLASS for reasons that I explained in the comments.

What version of Windows are you running, and what were the cycle counts for div ecx and StrLen?


eschew obfuscation

Chip

Michael,

I ran  the tests on Windows XP with service pack 2.

I only ran only your executables, and did not attempt to integrate your macro's
into any of my code.  (I am new to assembler, and somewhat leary of
experimenting on my own disasterous code.  (Smile)  Yet, there will come a time
when I will need to analyse my results.  I look  forward to using your macro's then.

As an aside, I found this whole thread interesting and educational.  Your dialogue
with others demonstrated a sincere interest in, and knowledge of, the system
internals.  Your inclusion of thier thoughts in your final version showed a sense
of maturity that I often find lacking in online forums.

Chip
I am on Skype under my email address!

hutch--

I am pretty much of the view that absolute times are a fantasy as there are too many other factors. The closest I have seen is Agner Fog's real mode timing methods but they do not translate directly to ring3 protected mode programming.

I have long ago settled for relative timings as they tend to be consistent across different machines under different conditions. Start with an algo you want to make fast, make a copy of it slightly renamed and run both in the same test piece. Run the test piece for a half a second or longer and the timings start to become very consistent. Then you can modify the copy to see if you can get it faster relative to the first one.

The test piece design depends on your target, if the algo has to do long single runs on data, a very big test piece of data works best but if the algo gets looped many times on short data you need to design a test for that target.

Cycle counts are a waste of time as post 486 machines use multiple pipelines and the action is scheduling, not cycle counting ALA 386 and earlier technology. This is finally why a real time test piece is more use in that what it does equates to the task at hand.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

M4D45M

good to see the macros becoming better and better.
as there exist some problems with timing, i just wanted to ask
if you considered these issues:

1. 'affinity' mask to ensure reading from same cpu (multiprocessor machines/multi-core machines)
2. RDTSC not reliable using mobile processors coz it assumes fixed clock-cycles
     http://msdn.microsoft.com/library/default.asp?url=/library/en-us/directx9_c/Game_Timing_and_Multicore_Processors.asp

3. QueryPerformanceCounter may unexpectedly leap forward several seconds on some chipsets
     http://support.microsoft.com/default.aspx?scid=kb;en-us;274323

wizzra

AMD Athlon 64 X2 3800+


24 cycles
-29 cycles
125 ms
116 ms


:eek

FairLight

Pentium M 750 Dothan 1,86 Ghz @ 2,5 Ghz (on Asus P4C800-E Dlx.)

29 Cycles
29 Cycles
119 ms
120 ms


ramguru

It would be nice if someone good with macros could make them work in POASM...

MichaelW


24 cycles
-29 cycles
125 ms
116 ms

When you see large negative counts like this it probably indicates that the system interfered with the reference loop, inflating the overhead count. Delaying for several seconds after the app is started, to allow system disk cache activity, etc, to subside, should eliminate most of these problems.
eschew obfuscation

dsouza123

Issue with the QueryPerformanceCounter function in both 32 bit and 64 bit

http://support.microsoft.com/kb/895980
Programs that use the QueryPerformanceCounter function may perform poorly in Windows Server 2003 and in Windows XP

Add /usepmtimer to the end of a specific line in boot.ini

Abel

Michael,
In checking out alternative code timers on a P4 I noted:

Running c2test2.exe (7/7/2006, MD5 B48DC522C3D33AD1673B97D0BA19F6A8):


HIGH_PRIORITY_CLASS                   REALTIME_PRIORITY_CLASS
-304 cycles, empty                     -48 cycles, empty
-44 cycles, mov eax,1                 -44 cycles, mov eax,1
-44 cycles, mov eax,1 mov eax,2       -44 cycles, mov eax,1 mov eax,2
-48 cycles, nops 4                    -40 cycles, nops 4
-28 cycles, mul ecx                   -36 cycles, mul ecx
-44 cycles, rol ecx,32               -260 cycles, rol ecx,32
-20 cycles, rcr ecx,31                -20 cycles, rcr ecx,31
   8 cycles, div ecx                    16 cycles, div ecx
   0 cycles, StrLen                     -8 cycles, StrLen


When line 101 in counter2.asm was changed:
from:    jnz   @B                  ;; End of reference loop
to:      jnz   $-40h               ;; End of reference loop

and c2test2.exe reassembled:


HIGH_PRIORITY_CLASS                    REALTIME_PRIORITY_CLASS
 0 cycles, empty                        0 cycles, empty
-4 cycles, mov eax,1                   -4 cycles, mov eax,1
-4 cycles, mov eax,1 mov eax,2         -4 cycles, mov eax,1 mov eax,2
 0 cycles, nops 4                       0 cycles, nops 4
12 cycles, mul ecx                     12 cycles, mul ecx
-4 cycles, rol ecx,32                  -4 cycles, rol ecx,32
20 cycles, rcr ecx,31                  20 cycles, rcr ecx,31
56 cycles, div ecx                     56 cycles, div ecx
40 cycles, StrLen                      40 cycles, StrLen


Cycles are nearly double those given by timers.asm but make more sense to me.

Abel

MichaelW

For several reasons, you cannot get good cycle counts for instruction sequences that execute in less than a few cycles. Usually, a negative count indicates that the system interrupted the reference loop. Unless your system is busy doing something intensive (in which case you should not be trying to time code), you can eliminate most of these problems by delaying for a few seconds before starting the first timing loop, to allow disk cache, etc activity to subside.

The second set of macros use a very different method than the first set. They seem to work better for timing code that executes in only a few clock cycles, and they generally provide more repeatable results, but I have reservations about using them after seeing them reverse the relative positions of your rand32 code and Lingo's, compared to the first set of timing macros.
eschew obfuscation