Code location sensitivity of timings

jj2007 · May 22, 2009, 10:09:37 AM

Quote from: MichaelW on May 22, 2009, 09:12:12 AM
QuoteThe TIMER macros take account of that by waiting until a new time slice is available (@MichaelW: please correct me if that is imprecise).

Only the second set of macros (counter2.asm in counter2.zip) do this. These macros also capture the lowest cycle count that occurs in any loop.

Oops, I stand corrected - counter2.asm, lines 57 and 106:

Code Select

invoke Sleep, 0 ;; Start a new time slice

If that is sufficient to guarantee a new slice, it can be done with a

QuotecyctCutFast = 10
cyctSleepMs = 0

See in cyct_begin:

Code Select

  ifdef cyctSleepMs
	invoke Sleep, cyctSleepMs
  endif

And unfortunately, it has no influence on the observed speed difference, so that does not seem to be the culprit...

Re lowest cycle count, my wrapper uses CombSortA to eliminate the outliers. On some occasions, I saw negative cycle counts, too.

Mark Jones · May 22, 2009, 01:29:40 PM

AMD x64 4000+ / WinXP Pro x64

Code Select


AddStrBSlow:
average with cutoff     215     cutoff at n=995 for ct>=269 cycles
average w/o cutoff      222
difference raw-cutoff:  7

Code Select


AddStrBFast:
average with cutoff     230     cutoff at n=987 for ct>=288 cycles
average w/o cutoff      239
difference raw-cutoff:  9

jj2007 · June 09, 2009, 01:56:05 PM

Another cute example (d7s8=destination aligned 16+7, source 16+8):

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)

Algo              MemCo1    MemCo2    MemCo5   MemCopy _imp__str
Description    rep movsd    movdqa    movdqa    Masm32       CRT
                 dest-al  src-al A  src-al B   library    strcpy
Code size             63       201       270         ?         ?
----------------------------------------------------------------
2048, d0s0           729       603       602      1113      2389
2048, d7s8          1585      1220       831      1615      5430
2048, d7s9          1586      1226       831      4438      4948

Same code with MakeSlow = 1:

Algo              MemCo1    MemCo2    MemCo5   MemCopy _imp__str
Description    rep movsd    movdqa    movdqa    Masm32       CRT
                 dest-al  src-al A  src-al B   library    strcpy
Code size             63       201       270         ?         ?
----------------------------------------------------------------
2048, d7s8          1562      7280      7257      1622      5423
2048, d7s9          1562      7284      7271      5237      4923

Code Select

if MakeSlow	; used exactly once, before the table
	print chr$(13, 13)
endif

Good to know that others have the same problem:
I'm a developer for the x264 video encoder. When Loren Merritt, another developer, was testing SAD assembly operation performance, he noticed that the number of clocks required varied wildly. In particular, when the unaligned data to be loaded crossed a cache line boundary, which occurred 25% of the time for blocks of width 16 bytes, up to eight times as many clock cycles were required for the entire operation, depending on the processor. This issue was found to exist not only for SSE2 load operations like MOVDQU, but also for MMX, and on processors going all the way back to the Pentium 3 (though obviously the cache line intervals varied). No older processors were tested than the Pentium 3. Page lines were also encountered every 64 cache lines, i.e. 1/256th of the time. Data crossing a page line resulted in up to 100 times slower performance; thousands of clock cycles for a single 16x16 SAD operation that normally took 48. However, when testing the same operations on AMD processors, we noticed there was no measurable penalty at all for unaligned loads across cache lines, and furthermore, the page line penalty was a mere 5-10%.

[attachment deleted by admin]

hutch-- · June 09, 2009, 02:01:20 PM

JJ,

To force SleepEx to actually take a time slice set it to over 20 ms and you are reasonably safe, set to 0 it can immediately return without reference to a timeslice.

I usually use,

Code Select


    invoke SleepEx,100,0

By doing it this wy you get the fresh boost directly after the task switch and its very uniform from one instance of SleepEx() to another.

jj2007 · June 09, 2009, 02:23:10 PM

Quote from: hutch-- on June 09, 2009, 02:01:20 PM
Code Select Expand
invoke SleepEx,100,0

By doing it this wy you get the fresh boost directly after the task switch and its very uniform from one instance of SleepEx() to another.

Thanks, Hutch. I'll implement it next time, although it seems to have no influence on the timings. In the attachment above, I use the cutoff method, i.e. I take a thousand timings and use the average from n=10 to the first one that is 25% slower than cycles(10). The cutoff usually takes place at >990 for non-P4 and between 500...950 for P4. The resulting timings are very stable.

The problem here is not the measurement. That code is really slow, under specific conditions, and some call it "Cacheline splits aka Intel hell" :toothy

P.S.: The code posted above has a new set of macros, wrapping the well-known TIMER.ASM macros by MichaelW. The table above was produced with this code (apologies to the macro haters :wink):

Code Select

.code
MyString	db "A 16 bytes text.", 0

start:
	invoke ShowCpu, 1	; prints the brand string and returns SSE level in eax
	mov ebx, ByteCount/16+1
	.Repeat
		invoke lstrcat, addr src16, offset MyString
		dec ebx
	.Until Zero?

	cyctProcs	equ MemCo1, MemCo2, MemCo5, MemCopy, crt_strcpy

	cyctTable "2048, d0s0", offset dest16, offset src16, 2048
	cyctTable "2048, d8s8", offset dest8, offset src8, 2048
	cyctTable "2048, d7s7", offset dest7, offset src7, 2048
	cyctTable "2048, d7s8", offset dest7, offset src8, 2048
	cyctTable "2048, d7s9", offset dest7, offset src9, 2048
	cyctTable "2048, d9s7", offset dest9, offset src7, 2048
	cyctTable "2048, d9s8", offset dest9, offset src8, 2048
	cyctTable "2048, d9s9", offset dest9, offset src9, 2048
	inkey chr$(13, 10, 10, "--- ok ---", 13)
	exit

UtillMasm · June 09, 2009, 02:27:39 PM

:U

Code Select

@echo off
\masm32\bin\ml.exe /FoMemCopySSE2.obj /nologo MemCopySSE2.asm
pause

Code Select

 Assembling: MemCopySSE2.asm
MemCopySSE2.asm(3) : fatal error A1000: cannot open file : \masm32\Gfa2Masm\Gfa2
Masm.inc
Press any key to continue . . .

jj2007 · June 09, 2009, 03:53:58 PM

Quote from: UtillMasm on June 09, 2009, 02:27:39 PM
fatal error A1000: cannot open file : \masm32\Gfa2Masm\Gfa2Masm.inc

Oops... download new version above, or change lines 2+3 as follows:

include \masm32\include\masm32rt.inc
; include \masm32\Gfa2Masm\Gfa2Masm.inc ; needed only for deb macro

dedndave · June 09, 2009, 04:00:09 PM

hey Jochen,
I was looking at this post you refered to...
http://software.intel.com/en-us/forums/watercooler-catchall/topic/56271/
they are using "ffmpegs bench.h" to time algos
do you know where to find that?
i looked high and low for it

jj2007 · June 09, 2009, 08:23:52 PM

Quote from: dedndave on June 09, 2009, 04:00:09 PM
hey Jochen,
I was looking at this post you refered to...
http://software.intel.com/en-us/forums/watercooler-catchall/topic/56271/
they are using "ffmpegs bench.h" to time algos
do you know where to find that?
i looked high and low for it

Apparently ffmpegs is a suite for Linux that includes also a -bench option...

dedndave · June 09, 2009, 11:06:56 PM

well - i found ffmpeg.org - they make an open source mpeg player/editor/whatever
i was interested in their timing code
the file i was trying to find was named "bench.h"

Mark Jones · June 10, 2009, 02:29:39 AM

Code Select

AMD Athlon(tm) 64 X2 Dual Core Processor 4000+ (SSE3)

Algo              MemCo1    MemCo2    MemCo5   MemCopy _imp__str
Description    rep movsd    movdqa    movdqa    Masm32       CRT
                 dest-al  src-al A  src-al B   library    strcpy
Code size             63       201       270         ?         ?
----------------------------------------------------------------
2048, d0s0           549       424       361       546      2094
2048, d8s8           585       465       403       555      2100
2048, d7s7           593       478       411      1390      2088
2048, d7s8           834       562       564      1055      2172
2048, d7s9           834       564       564      1187      2175
2048, d9s7           833       564       565      1561      2173
2048, d9s8           834       562       566      1216      2172
2048, d9s9           575       472       410      1399      2091

jj2007 · June 11, 2009, 08:05:17 PM

Got some weird timings again:

Code Select

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
Algo         _imp__st   MemCo1   MemCo2  MemCoC3  MemCoP4   MemCo5   MemCoL
Description       CRT rep movs   movdqa  lps+hps   movdqa   movdqa   Masm32
               strcpy  dest-al    psllq src+dest  dest-al   src-al  library
Code size           ?       70      291      222      200      269       33
---------------------------------------------------------------------------
2048, d0s0-0     2591      566      363      363      373      363      560
2048, d1s1-0     2598      619      421      423      444      423     1726
2048, d7s7-0     2584      619      419      421      446      421     1735
2048, d7s8-1     2782     1525     1090      441      956     1033     1527
2048, d7s9-2     2788     1526     1090      441      957     1033     1755
2048, d8s7+1     2786     1309     1090      683      809      815     1467
2048, d8s8-0     2581      619      421      423      448      423      560
2048, d8s9-1     2789     1526     1083      441      956     1033     1465
2048, d9s7+2     2790     1309     1081      684      809      815     1802
2048, d9s8+1     2782     1309     1081      684      809      815     1511
2048, d9s9-0     2589      619      421      423      448      423     1737
2048, d15s15     2582      619      423      425      446      424     1723

MemCoP4 for the P4. MemCo2 should be insensitive to the cacheline split trap, as it does no unaligned memory accesses.
MemCoC3 is optimised for Celeron, and pretty fast, but very sensitive to code location.

Grateful for tests on other CPUs, jj - [EDIT] with the new MemCopySSE2c.zip, see two posts further down.

ramguru · June 11, 2009, 08:22:40 PM

Code Select


Intel(R) Core(TM) i7 CPU         920  @ 2.67GHz (SSE4)

Algo         _imp__st   MemCo1   MemCo2  MemCoC3  MemCoP4   MemCo5   MemCoL
Description       CRT rep movs   movdqa  lps+hps   movdqa   movdqa   Masm32
               strcpy  dest-al    psllq src+dest  dest-al   src-al  library
Code size           ?       70      291      222      200      269       33
---------------------------------------------------------------------------
2048, d0s0-0     1985      191      204      205      203      175      196
2048, d1s1-0     1992      223      251      248      251      222      244
2048, d7s7-0     1982      227      253      250      248      223      243
2048, d7s8-1     1976      246      592      501      193      211      243
2048, d7s9-2     1986      247      594      501      194      212      243
2048, d8s7+1     1982      249      587      501      193      179      244
2048, d8s8-0     1976      229      255      253      251      226      243
2048, d8s9-1     1983      249      588      502      194      211      243
2048, d9s7+2     1982      248      583      501      193      179      243
2048, d9s8+1     1977      249      583      501      193      179      243
2048, d9s9-0     1984      229      255      253      251      226      244
2048, d15s15     1974      228      256      256      258      226      243


--- ok ---

jj2007 · June 11, 2009, 10:02:07 PM

Core 2 shows dramatic improvements...!

I have renamed MemCo5 to MemCoC2, for Core 2. And I moved the somewhat faster loop for cases d7s8-1 etc. from MemCoP4 to the new MemCoC2.

[attachment deleted by admin]

Mark Jones · June 11, 2009, 10:07:00 PM

Code Select

AMD Athlon(tm) 64 X2 Dual Core Processor 4000+ (SSE3)

Algo         _imp__st   MemCo1   MemCo2  MemCoC3  MemCoP4  MemCoC2   MemCoL
Description       CRT rep movs   movdqa  lps+hps   movdqa   movdqa   Masm32
               strcpy  dest-al    psllq src+dest  dest-al   src-al  library
Code size           ?       70      291      222      200      269       33
---------------------------------------------------------------------------
2048, d0s0-0     2079      551      359      424      424      359      547
2048, d1s1-0     2084      613      410      473      473      410     1060
2048, d7s7-0     2080      598      412      474      474      411     1059
2048, d7s8-1     2156      853     1016      564      567      569      802
2048, d7s9-2     2162      859     1016      564      567      568     1058
2048, d8s7+1     2172      849      868      564      565      566      804
2048, d8s8-0     2082      603      404      465      465      402      547
2048, d8s9-1     2177      848      995      564      565      568      803
2048, d9s7+2     2156      855      862      581      565      580     1060
2048, d9s8+1     2167      869      878      564      567      566      803
2048, d9s9-0     2090      595      412      472      472      409     1060
2048, d15s15     2064      592      410      470      486      408     1060

News:

Code location sensitivity of timings

ramguru