News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

StrLen timings needed

Started by jj2007, August 15, 2010, 09:32:10 PM

Previous topic - Next topic

jj2007

Here is one more, with a modification of Alex' excellent algo (shorter, same cycle count on my CPU).
I tried to include Lingo's new algo, but - not surprisingly - it raised an exception. If you are masochist enough, you can "heal" it as follows:

QuoteExA:
   bsf            eax,   ecx
   mov edx, [esp-8]      ; added by JJ (still unsafe but for testing it's ok)
   jmp            edx
StrLenLingo   endp
Lingo's algo will still raise an exception for CrashIt = 1.

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
80      code size for MbStrLen4a        80      total bytes for MbStrLen4a
97      code size for AxStrLenSSE1      104     total bytes for AxStrLenSSE1
84      code size for AxStrLenSSE1j     84      total bytes for AxStrLenSSE1j
------- timings, misaligned -------
131     cycles for szLen
49      cycles for MbStrLen4a
35      cycles for AxStrLenSSE1
34      cycles for AxStrLenSSE1j

131     cycles for szLen
49      cycles for MbStrLen4a
35      cycles for AxStrLenSSE1
34      cycles for AxStrLenSSE1j

131     cycles for szLen
49      cycles for MbStrLen4a
34      cycles for AxStrLenSSE1
34      cycles for AxStrLenSSE1j

131     cycles for szLen
49      cycles for MbStrLen4a
34      cycles for AxStrLenSSE1
34      cycles for AxStrLenSSE1j


EDIT:
Quote   sub ecx, edx
   pxor xmm7, xmm7         ; thanks Alex!!
   jz @F

Antariy

Test attached archive, please. I add Lingo's and Edgar's procs.

Timings:

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
80      code size for MbStrLen4a        80      total bytes for MbStrLen4a
72      code size for MbStrLen4aP4      72      total bytes for MbStrLen4aP4
72      code size for MbStrLen4aP42     72      total bytes for MbStrLen4aP42
34      code size for AxStrLenMMX       34      total bytes for AxStrLenMMX
97      code size for AxStrLenSSE1      104     total bytes for AxStrLenSSE1
79      code size for StrLenLingo       84      total bytes for StrLenLingo
49      code size for lszLenMMX 49      total bytes for lszLenMMX
------- timings -------
263     cycles for szLen
97      cycles for MbStrLen4a
100     cycles for MbStrLen4aP4
121     cycles for MbStrLen4aP42
108     cycles for AxStrLenMMX
62      cycles for AxStrLenSSE1
53      cycles for StrLenLingo
119     cycles for lszLenMMX

251     cycles for szLen
96      cycles for MbStrLen4a
122     cycles for MbStrLen4aP4
96      cycles for MbStrLen4aP42
108     cycles for AxStrLenMMX
62      cycles for AxStrLenSSE1
53      cycles for StrLenLingo
124     cycles for lszLenMMX

251     cycles for szLen
91      cycles for MbStrLen4a
146     cycles for MbStrLen4aP4
105     cycles for MbStrLen4aP42
109     cycles for AxStrLenMMX
62      cycles for AxStrLenSSE1
53      cycles for StrLenLingo
119     cycles for lszLenMMX

251     cycles for szLen
71      cycles for MbStrLen4a
96      cycles for MbStrLen4aP4
120     cycles for MbStrLen4aP42
109     cycles for AxStrLenMMX
62      cycles for AxStrLenSSE1
53      cycles for StrLenLingo
122     cycles for lszLenMMX


--- ok ---


About fastest proc: this is as comparsion of lame horse with a bulldozer. Horse is sully some regs, which is must be preserved for fair comparsion, and horse is stumbled and falled on some strings :P

And naughty horse is mess timings of good bulldozers :)


Alex

Antariy

For latest Jochen's archive (80StrLen.zip):

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
80      code size for MbStrLen4a        80      total bytes for MbStrLen4a
97      code size for AxStrLenSSE1      104     total bytes for AxStrLenSSE1
84      code size for AxStrLenSSE1j     84      total bytes for AxStrLenSSE1j
13 - ERROR in AxStrLenSSE1j
15 - ERROR in AxStrLenSSE1j
83      code size for StrLenLingo       92      total bytes for StrLenLingo
------- timings, misaligned -------
269     cycles for szLen
85      cycles for MbStrLen4a
66      cycles for AxStrLenSSE1
68      cycles for AxStrLenSSE1j
58      cycles for StrLenLingo

268     cycles for szLen
83      cycles for MbStrLen4a
66      cycles for AxStrLenSSE1
69      cycles for AxStrLenSSE1j
57      cycles for StrLenLingo

269     cycles for szLen
85      cycles for MbStrLen4a
66      cycles for AxStrLenSSE1
69      cycles for AxStrLenSSE1j
57      cycles for StrLenLingo

269     cycles for szLen
105     cycles for MbStrLen4a
65      cycles for AxStrLenSSE1
68      cycles for AxStrLenSSE1j
56      cycles for StrLenLingo


--- ok ---




Alex

jj2007

Quote from: Antariy on August 22, 2010, 09:54:29 PM
For latest Jochen's archive (80StrLen.zip):

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
80      code size for MbStrLen4a        80      total bytes for MbStrLen4a
97      code size for AxStrLenSSE1      104     total bytes for AxStrLenSSE1
84      code size for AxStrLenSSE1j     84      total bytes for AxStrLenSSE1j
13 - ERROR in AxStrLenSSE1j
15 - ERROR in AxStrLenSSE1j


Strange - these errors are not present in my runs. Can you check what happened there??

Antariy

Jochen,

movups [esp], xmm7
sub ecx, edx
jz @F
pxor xmm7, xmm7       <--- move this above jz @F
pcmpeqb xmm7, [edx]




Alex

jj2007

Quote from: Antariy on August 22, 2010, 10:08:00 PM

movups [esp], xmm7
sub ecx, edx
jz @F
pxor xmm7, xmm7       <--- move this above jz @F
pcmpeqb xmm7, [edx]


Of course :red
It's fixed, see new attachment above.
Thanks Alex :U

Antariy

I change pxor between sub ecx,edx and jz @F, this is timings:

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
80      code size for MbStrLen4a        80      total bytes for MbStrLen4a
97      code size for AxStrLenSSE1      104     total bytes for AxStrLenSSE1
84      code size for AxStrLenSSE1j     84      total bytes for AxStrLenSSE1j
83      code size for StrLenLingo       92      total bytes for StrLenLingo
------- timings, misaligned -------
262     cycles for szLen
82      cycles for MbStrLen4a
65      cycles for AxStrLenSSE1
67      cycles for AxStrLenSSE1j
55      cycles for StrLenLingo

263     cycles for szLen
87      cycles for MbStrLen4a
64      cycles for AxStrLenSSE1
67      cycles for AxStrLenSSE1j
56      cycles for StrLenLingo

262     cycles for szLen
85      cycles for MbStrLen4a
65      cycles for AxStrLenSSE1
67      cycles for AxStrLenSSE1j
56      cycles for StrLenLingo

262     cycles for szLen
84      cycles for MbStrLen4a
64      cycles for AxStrLenSSE1
67      cycles for AxStrLenSSE1j
56      cycles for StrLenLingo


--- ok ---



Edited: my fix is equal to your fix, Jochen, so this result the same.



Alex

jj2007

Thanks. So my 1j is three cycles slower on your CPU - on mine it's about half a cycle faster. For aligned strings, by the way, it looks like this:
------- timings, 16-byte aligned -------
131     cycles for szLen
32      cycles for MbStrLen4a
32      cycles for AxStrLenSSE1
32      cycles for AxStrLenSSE1j
24      cycles for StrLenLingo (UNSAFE)

Antariy

For aligned:

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
80      code size for MbStrLen4a        80      total bytes for MbStrLen4a
97      code size for AxStrLenSSE1      104     total bytes for AxStrLenSSE1
84      code size for AxStrLenSSE1j     84      total bytes for AxStrLenSSE1j
83      code size for StrLenLingo       92      total bytes for StrLenLingo
------- timings, 16-byte aligned -------
263     cycles for szLen
69      cycles for MbStrLen4a
68      cycles for AxStrLenSSE1
70      cycles for AxStrLenSSE1j
55      cycles for StrLenLingo

261     cycles for szLen
69      cycles for MbStrLen4a
68      cycles for AxStrLenSSE1
69      cycles for AxStrLenSSE1j
55      cycles for StrLenLingo

262     cycles for szLen
69      cycles for MbStrLen4a
68      cycles for AxStrLenSSE1
69      cycles for AxStrLenSSE1j
56      cycles for StrLenLingo

261     cycles for szLen
69      cycles for MbStrLen4a
68      cycles for AxStrLenSSE1
69      cycles for AxStrLenSSE1j
56      cycles for StrLenLingo


--- ok ---


So, some clocks is not have meaning. As you see, lingo's proc work not very well  :green2 With consideration, what it crashes and it don't preserve regs - this is without comments...  :green2



Alex

KeepingRealBusy

Alex,

91Test_StrLenSaveXmm.exe crashes on my P4

Intel(R) Pentium(R) 4 CPU 3.20GHz (SSE2)
80      code size for MbStrLen4a        80      total bytes for MbStrLen4a
72      code size for MbStrLen4aP4      72      total bytes for MbStrLen4aP4
72      code size for MbStrLen4aP42     72      total bytes for MbStrLen4aP42
34      code size for AxStrLenMMX       34      total bytes for AxStrLenMMX
97      code size for AxStrLenSSE1      104     total bytes for AxStrLenSSE1
79      code size for StrLenLingo       84      total bytes for StrLenLingo
49      code size for lszLenMMX 49      total bytes for lszLenMMX
------- timings -------
260     cycles for szLen
53      cycles for MbStrLen4a


91Test_StrLenSaveXmm.exe has encountered a problem and needs to close.  We are sorry for the inconvenience.

Dave.

Antariy

Dave, this is something with old Jochen's MbStrLen4aP4 proc :(



Alex

Antariy

Dave, test "80StrLen.zip", please, may be Jochen fix this problem, I have old his proc for P4.



Alex

donkey

Here's my test from my laptop:

AMD Athlon(tm) X2 Dual-Core QL-62 (SSE3)
80      code size for MbStrLen4a        80      total bytes for MbStrLen4a
72      code size for MbStrLen4aP4      72      total bytes for MbStrLen4aP4
72      code size for MbStrLen4aP42     72      total bytes for MbStrLen4aP42
34      code size for AxStrLenMMX       34      total bytes for AxStrLenMMX
97      code size for AxStrLenSSE1      104     total bytes for AxStrLenSSE1
79      code size for StrLenLingo       84      total bytes for StrLenLingo
49      code size for lszLenMMX 49      total bytes for lszLenMMX
------- timings -------
147     cycles for szLen
56      cycles for MbStrLen4a
53      cycles for MbStrLen4aP4
52      cycles for MbStrLen4aP42
72      cycles for AxStrLenMMX
69      cycles for AxStrLenSSE1
43      cycles for StrLenLingo
75      cycles for lszLenMMX

138     cycles for szLen
58      cycles for MbStrLen4a
56      cycles for MbStrLen4aP4
63      cycles for MbStrLen4aP42
76      cycles for AxStrLenMMX
64      cycles for AxStrLenSSE1
43      cycles for StrLenLingo
76      cycles for lszLenMMX

137     cycles for szLen
52      cycles for MbStrLen4a
54      cycles for MbStrLen4aP4
53      cycles for MbStrLen4aP42
76      cycles for AxStrLenMMX
68      cycles for AxStrLenSSE1
42      cycles for StrLenLingo
77      cycles for lszLenMMX

138     cycles for szLen
50      cycles for MbStrLen4a
56      cycles for MbStrLen4aP4
55      cycles for MbStrLen4aP42
70      cycles for AxStrLenMMX
65      cycles for AxStrLenSSE1
48      cycles for StrLenLingo
77      cycles for lszLenMMX


--- ok ---


Well, 5 years or so ago when I wrote lszLenMMX it was pretty fast and rather unique, now it seems to be a bit of a pig compared with others...
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

Antariy

Dave, when you will run your development computer (AMD), you may find, where app crashes?



Alex

Antariy

Thanks, Edgar!

This is always nice: having timings from different hardware.



Alex