Code timing macros

sinsi · March 16, 2009, 11:05:08 AM

Quote from: jj2007 on March 16, 2009, 10:38:50 AMA P4-specific problem?

I get the same numbers from 98SE/NT4/2000 (via virtualpc, which is a P3 I think).

I would think more of an inc/lib/ml version problem...

MichaelW · March 16, 2009, 11:10:26 AM

Considering the code that is being executed, I suspect that your first two results are not valid. I think the biggest part of the problem is the P4 in combination with your attempt to time code that executes too fast for reliable results on any processor. On my P3 I get a consistent:

Code Select


0012FFC0
Testing whether
        mov ecx, dword ptr [esp-8]
        jmp ecx
is faster than
        jmp dword ptr [esp-8]

1       cycles jmp directly
1       cycles mov+jmp
170     cycles jmp directly, divide by 10
169     cycles mov+jmp, divide by 10
0012FFC0

The starting and ending hex number are the value of the stack pointer at entry and exit. If I understand what you are trying to do, perhaps something like this would work better:

Code Select


    invoke Sleep, 3000

    counter_begin 1000, HIGH_PRIORITY_CLASS
      REPEAT 100
        push @f
        jmp DWORD PTR [esp]
      @@:
        add esp, 4
      ENDM
    counter_end
    print str$(eax),13,10

    counter_begin 1000, HIGH_PRIORITY_CLASS
      REPEAT 100
        push @f
        mov ecx, [esp]
        mov ecx, @f
        add esp, 4
        jmp ecx
      @@:
      ENDM
    counter_end
    print str$(eax),13,10

Running on my P3, I get a consistent:

Code Select


297
335

sinsi · March 16, 2009, 11:20:11 AM

98 under virtualpc gives me -1, -1 for the first two (watch out marty!)

jj2007 · March 16, 2009, 12:37:19 PM

Quote from: MichaelW on March 16, 2009, 11:10:26 AM
Considering the code that is being executed, I suspect that your first two results are not valid. I think the biggest part of the problem is the P4 in combination with your attempt to time code that executes too fast for reliable results on any processor. On my P3 I get a consistent:
..
If I understand what you are trying to do

Michael,
Even a simple doubling produces odd results:

Code Select

	counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
		invoke p2, 123
		invoke p2, 123
	counter_end

That should give 8 cycles, but I get 65:

Code Select

4       cycles jmp directly
65      cycles jmp directly, divide by 2

On a Celeron M, I get consistently 4 cycles for a single test, and with REP 10 I get 40/10=4 cycles. So it might be the P4... ::)

P.S.: Background is an argument with Lingo over this technique (admittedly not a very clear style, but it's short and fast):

Code Select

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
strlen32s proc src:DWORD
	pop eax			; trash the return address
	pop eax			; the src pointer
... do stuff ...
	jmp dword ptr [esp-4-4]	; ret address, one arg - the Lingo style equivalent to ret 4 ;-)
strlen32s endp

sinsi · March 16, 2009, 12:55:25 PM

You can't tell what windows is doing - forget about the priority thing, windows has its own agenda...
Interrupts (timer/kbd/mouse, hell even usb/vga) muck things up, different cores, etc. - multi core/cpu can't rely on rdtsc, and there's the speedstep/choke thing too.

A load from memory, then a jump should be slower than a jump from memory, but it depends on what the intel guys decided (or what the compiler guys convinced them to do eh?)

I think that once we get down to single-figure counts we're hoping (pissing in the wind :bg).

jj2007 · March 16, 2009, 03:36:51 PM

Quote from: sinsi on March 16, 2009, 12:55:25 PM
I think that once we get down to single-figure counts we're hoping (pissing in the wind :bg).

Of course, relative precision is low. But the timer macros still yield surprisingly consistent and plausible results. Some time ago I played with push+pop; timings for a P4:

Code Select

1       cycles for n*(push eax+pop ecx)
4       cycles for n*(push eax+pop eax)
1       cycles for n*push eax, n*pop ecx
9       cycles for n*(push eax+pop m32a)
10      cycles for n*push eax, n*pop m32a
9       cycles for n*(push m32a+pop m32b)
10      cycles for n*push m32a, n*pop m32b
26      cycles for n*(pushad+popad)
26      cycles for n*pushad, n*popad

The practical relevance is admittedly low, but it tells something about the CPU when a push/pop for the same register is four times slower. IMHO these timer macros are a fantastic tool - thanx, Michael :U

[attachment deleted by admin]

BeeOnRope · March 31, 2009, 08:13:48 PM

Quote from: jj2007 on March 16, 2009, 03:36:51 PM
The practical relevance is admittedly low, but it tells something about the CPU when a push/pop for the same register is four times slower. IMHO these timer macros are a fantastic tool - thanx, Michael :U

The issue is dependence chains in the code. A push/pop sequence on the same register is serially dependent - the prior pop must complete before the push can occur, since the push needs the value from the pop. Also, the prior push must occur before the pop, since the pop is writing to the same memory location as the push, so the CPU must "complete" the push in order to process the pop. So you measure the latency of the operation in this case, which turns out to be about 4.

When you push/pop different registers, the chain is broken because the pop->push sequence is no longer dependent as the push value doesn't depend on the pop (since the pushed value comes from a different register). Even though push->pop still is still dependent, the chain is broken and the operations can be parallelized - so you are measuring the (reciprocal) throughput.

In general, these kinds of complexities are why your tests of whether "mov+jmp reg" is faster than "mov+jmp mem" are invalid. No code consists of a tight loop of these instructions - it has to do some useful work as well. In performing that useful work, it may be that the cost of the mov ecx,[esp-8] can be completely hidden since it can be done earlier, long before the jmp. In fact this is very likely since it is completely independent of the preceding code (except that it needs ecx) so that there is a good chance this comes for free. Then if you compare jmp reg vs jmp mem, you may find that jmp reg is faster (probably free, in fact) than jmp [mem], which has to do the read.

This is why micro-benchmarking small sequences of instructions is quite useless (except perhaps to build tables of latencies and throughput) - since the actual behavior in real code may be very different.

jj2007 · March 31, 2009, 09:19:22 PM

Quote from: BeeOnRope on March 31, 2009, 08:13:48 PM
The issue is dependence chains in the code. A push/pop sequence on the same register is serially dependent -

Welcome, BeeOnRope :thumbu

What you write is not entirely new, but we tend to forget, and I have rarely seen it so nicely explained. Thanks :U (and apart from that, I wanted to tease Lingo a little bit :bg)

You might like the linecounter thread - we needed kilocycles to time that one. I would also like to your views on the practical relevance of timings and cache.

m0nk3yi3unz · July 06, 2009, 07:12:13 PM

Hey MichaelW! Great stuff here - I've never fathomed using the clock as a timing mechanism (that sounds weird, but you know what I mean). I didn't even know you -could- monitor the CPU clock :X

I'll definitely be checking out the source.

And, to keep the trend going:

Quote
21 cycles
22 cycles
86 ms
86 ms

dedndave · July 06, 2009, 07:44:11 PM

go to the first post in this thread and d/l the zip file named counter2.zip
place counter2.asm in your masm32\macros folder
use the test.asm file as an example of how to use it
Michael did a great job and spent considerable time getting it to work correctly - lol

Magnum · September 26, 2009, 02:29:08 PM

When I ran test.exe, I got 50 cycles and and 678 ms.

That is on a P-3, 733 Mhz system.

I have to agree with Hutch.

The timing macros work well when comparing code on a single system,
but the granularity isn't possible when comparing P-3 to P-4 etc.

But I plan on using them to see where I can speed things up.

I think it's like comparing apples to oranges.

Andy

dedndave · September 26, 2009, 04:34:59 PM

the way we really use it, it works ok
we usually want to take 2 proc's (sometimes more) and compare how they perform on a variety of processors
as an example:
on processor type A, routine #1 is faster than routine #2 by x %
on processor type B, routine #2 is faster than routine #1 by y %
while "absolutes" are interesting, we just want to know which routine is best, and by what percentage
for that purpose, the timing macros work great

PBrennick · October 02, 2009, 11:29:17 PM

Yes, it is a relative thing needing no absolutes, just a comparison that can be based on anything as long as it is the same in all cases. The macros allow this to be done quite nicely.

Paul

sysfce2 · November 22, 2009, 01:07:14 PM

I'm trying to convert this code timing macro to NASM and I have a question about how it works:

In the first section, under the reference loop, there is:
@@:
sub __ctr__loop__counter__, 1
jnz @B

This appears to repeatedly subtract 1 from the loop counter.
Should it loop to the "@@:" above where it says ";; Reference loop"?

MichaelW · November 22, 2009, 02:14:32 PM

I'm not sure what you mean by "above where it says ;; Reference loop". The code below implements an empty reference loop that is used to determine the cycle count for the loop overhead, which includes not only the two instructions in the loop but everything from the preceding RDTSC to the next RDTSC.

Code Select


@@:                     ;; Start an empty reference loop
        sub   __counter__loop__counter__, 1
        jnz   @B

Note that the code from the preceding RDTSC to the next RDTSC is the same for both loops, so the count for the test code can be calculated by subtracting the count for the reference loop from the count for the test loop.

News:

Code timing macros

BeeOnRope

m0nk3yi3unz

sysfce2