|
Pages: [1] 2 3 4
|
 |
|
Author
|
Topic: lstrcpy vs szCopy (Read 36494 times)
|
|
jj2007
|
Well, if the subject sounds familiar: I am reviving an old thread. The algos there are fast, but they are mmx and therefore trash the FPU (I like the FPU). So I thought of adapting one of them, actually an algo by Lingo, to produce an XMM version. And to make it more realistic, I introduced spoilers: .data align 16
spoil1 db 1, 2, 3 ; badly aligned source
String1 DB "Sample String 01234 56789 ABCDEF AaBbCcDdEeFfGgHhIiJjKkLlMMNnOoP",\
On my Core 2 Celeron M, differences in timings are there but not dramatic. What was more dramatic was the silent bye-bye when I removed the spoilers... With the spoilers, the XMM version works just fine and leaves the FPU in peace. When I remove them, then both Lingo's and my adapted algo crash miserably with exception #5 at movq xmm0, qword ptr [ecx+eax] Anybody interested to have a look into this? I also suspect that my version could be a lot improved... [attachment deleted by admin]
|
|
|
|
|
Logged
|
|
|
|
sinsi
Member
    
Gender: 
Posts: 1758
RIP Bodie 1999-2011
|
It seems to overwrite the counter used by counter_end - this is always 39383736h, so the counter never gets to 0. I don't get any sort of exception.
|
|
|
|
|
Logged
|
Light travels faster than sound, that's why some people seem bright until you hear them.
|
|
|
|
jj2007
|
Thanks Sinsi - that makes sense. In the meantime, I made up another xmm version: comment * based on MMX Fast by Mark Larson * align 16 ; seems to have little influence, the nop makes it two cycles faster ;-) nop szCopyXMM proc dest:DWORD, src:DWORD mov eax,[esp+8] mov esi,[esp+4] align 16 qword_copy1b: pxor xmm1, xmm1 movups xmm0, oword ptr [eax] pcmpeqb xmm1, xmm0 add eax, 8+8 pmovmskb ecx, xmm1 or ecx,ecx jnz finish_rest1 movups oword ptr [esi], xmm0 add esi, 8+8 jmp qword_copy1b finish_rest1: ret 8 szCopyXMM endp
512-byte string copy timing results: szCopyXMM -> jj -> xmm: 278 clocks szCopyMMX -> Mark Larson -> MMX: 312 clocks SzCpy11 -- > Lingo -> MMX -> Fast: 284 clocks szCopyMMX1-> Mark Larson -> MMX-> Fast: 309 clocks szCopy 1076 clocks lstrcpy 1202 clocks SzCpy10 - > Lingo -> MMX: 283 clocks MbCopy -> jj -> xmm: 384 clocks Now one problem is that, aligned or not, these algos work in chunks of 128 bytes. So there are problems with small strings...
|
|
|
|
|
Logged
|
|
|
|
|
qWord
|
hi,
after a quick view, i think that the problem is caused by "test bl,bl" (in your and lingo's routine) -> There are 4 packet Bytes after "packsswb" - so you have to test for these 4 bytes with "test ebx,ebx".
regards, qWord
|
|
|
|
|
Logged
|
FPU in a trice: SmplMathIt's that simple!
|
|
|
|
jj2007
|
Thanks, qword - I am afraid it keeps choking. But the other one seems to work just fine, also for small strings and bad alignment. However, it needs a zero delimiter at the end - see mov byte ptr [esi], 0 below OPTION PROLOGUE:NONE OPTION EPILOGUE:NONE comment * based on MMX Fast by Mark Larson * align 16 ; seems to have little influence, the nop makes it two cycles faster ;-) nop szCopyXMM proc dest:DWORD, src:DWORD mov eax,[esp+8] mov esi,[esp+4] align 16 qword_copy1b: pxor xmm1, xmm1 movups xmm0, oword ptr [eax] pcmpeqb xmm1, xmm0 add eax, 8+8 pmovmskb ecx, xmm1 or ecx,ecx jnz finish_rest1 movups oword ptr [esi], xmm0 add esi, 8+8 jmp qword_copy1b finish_rest1: mov byte ptr [esi], 0 ret 8 szCopyXMM endp
|
|
|
|
|
Logged
|
|
|
|
|
jj2007
|
OK, I made a bit of cleanup and am satisfied with this version: OPTION PROLOGUE:NONE OPTION EPILOGUE:NONE comment * inspired by "MMX Fast" by Mark Larson * ; align 16 ; seems to have NO influence szCopyXMM proc dest:DWORD, src:DWORD push esi push edi mov edi, [esp+4+8] mov esi, [esp+8+8] push ecx ; preserve another valuable register @@: pxor xmm1, xmm1 movups xmm0, oword ptr [esi] pcmpeqb xmm1, xmm0 pmovmskb ecx, xmm1 ; Move Byte Mask To Integer - a fantastic instruction! test ecx,ecx jnz @F movups oword ptr [edi], xmm0 add esi, 16 add edi, 16 jmp @B @@: .Repeat lodsb ; relatively slow stosb ; tail cleanup .Until al==0 mov eax, edi ; a stringcat routine might need this one pop ecx ; restore ecx pop edi pop esi ret 8 ; cleanup szCopyXMM endp
Testing the 16-byte boundary looks fine: Source=B23456789012345 Dest=B23456789012345 Source=C234567890123456 Dest=C234567890123456 Source=D2345678901234567 Dest=D2345678901234567
512-byte string copy timing results (aligned): len of source string = 512 len of szCopyXMM: 55 szCopyXMM -> jj -> xmm: 298 clocks szCopyMMX -> Mark Larson -> MMX: 312 clocks SzCpy11 -- > Lingo -> MMX -> Fast: 285 clocks szCopyMMX1-> Mark Larson -> MMX-> Fast: 308 clocks szCopy 1053 clocks lstrcpy 1184 clocks SzCpy10 - > Lingo -> MMX: 283 clocks MbCopy -> jj -> xmm: 380 clocks Three times as fast as szCopy, 55 bytes short, and does not trash the FPU. The only caveat is that your puter should be less than seven years old  [attachment deleted by admin]
|
|
|
|
|
Logged
|
|
|
|
NightWare
Member
    
Gender: 
Posts: 416
when dream comes true
|
jj, glad to see you play with simd stuff  but, 1. can you explain me why pxor xmm1,xmm1 is IN the loop ? 2. for unaligned data, look at lddqu instruction
|
|
|
|
|
Logged
|
|
|
|
|
askm
|
Who can explain these results ?
512-byte string copy timing results:
len of source string = 512 len of szCopyXMM: 52 szCopyXMM -> jj -> xmm: 2085 clocks szCopyMMX -> Mark Larson -> MMX: 323 clocks SzCpy11 -- > Lingo -> MMX -> Fast: 285 clocks szCopyMMX1-> Mark Larson -> MMX-> Fast: 324 clocks szCopy 1556 clocks lstrcpy 1573 clocks SzCpy10 - > Lingo -> MMX: 285 clocks MbCopy -> jj -> xmm: 284 clocks
|
|
|
|
|
Logged
|
|
|
|
MichaelW
Global Moderator
Member
    
Gender: 
Posts: 5161
|
I think the problem might be the processor it's running on. This is what I get on my P3: len of source string = 512 len of szCopyXMM: 52 szCopyXMM -> jj -> xmm: 2090 clocks szCopyMMX -> Mark Larson -> MMX: 319 clocks SzCpy11 -- > Lingo -> MMX -> Fast: 281 clocks szCopyMMX1-> Mark Larson -> MMX-> Fast: 308 clocks szCopy 2078 clocks lstrcpy 2384 clocks SzCpy10 - > Lingo -> MMX: 285 clocks MbCopy -> jj -> xmm: 282 clocks
Or not. If I comment out the tail cleanup code then szCopyXMM runs in 11 cycles and the procedure fails the function tests, implying that most or all of the work is being done by the tail cleanup code. After more tests I think the problem is my processor. On a P3 I think pmovmskb and pcmpeqb are limited to the MMX registers. I don't see any errors when I assemble, but on the first iteration of the loop ECX is always 0FFh, when it should be 0 up to the last loop. Or not exactly. Assembling the code with ML 6.14, 6.15, and 7.00 I get: 004019AF 660FEFC9 pxor mm1,mm1 004019B3 0F1006 movups xmm0,[esi] 004019B6 660F74C8 pcmpeqb mm1,mm0 004019BA 660FD7C9 pmovmskb cx,mm1
And with 6.15 and 7.00 the code generates an illegal instruction exception somewhere further down (in MbCopy). So there is a problem with the version of ML, but if that were fixed then there would be a problem with the processor not supporting some of the instructions.
|
|
|
|
« Last Edit: February 08, 2009, 09:56:53 AM by MichaelW »
|
Logged
|
eschew obfuscation
|
|
|
|
jj2007
|
jj, glad to see you play with simd stuff  but, 1. can you explain me why pxor xmm1,xmm1 is IN the loop ? Because I shamelessly copied that from Mark's code  2. for unaligned data, look at lddqu instruction
Yields the same timings, is 2 bytes longer (55->57 bytes), and decreases the maximum age of your puter. movdqu and movups produce exactly the same timings. I chose movups below (2 bytes shorter than movdqu), but maybe there are differences by processor type. Anyway, thanks a lot for the hint to lddqu, it made me find movups/movdqu, which both improve drastically the timings for the non-aligned strings: 512-byte string copy timing results: len of source string = 512 alignment: offset src=4202611, dest=4203173 len of szCopyXMM: 55
szCopyXMM -> jj -> xmm: 484 clocks szCopyMMX -> Mark Larson -> MMX: 474 clocks SzCpy11 -- > Lingo -> MMX -> Fast: 474 clocks szCopyMMX1-> Mark Larson -> MMX-> Fast: 439 clocks szCopy 1053 clocks lstrcpy 1214 clocks SzCpy10 - > Lingo -> MMX: 476 clocks MbCopy -> jj -> xmm: 560 clocks
There are some that are a few clocks faster, but remember they trash the FPU. OPTION PROLOGUE:NONE OPTION EPILOGUE:NONE comment * inspired by "MMX Fast" by Mark Larson * ; align 16 ; seems to have NO influence szCopyXMM proc dest:DWORD, src:DWORD push esi push edi mov edi, [esp+4+8] mov esi, [esp+8+8] push ecx ; preserve another valuable register pxor xmm1, xmm1 @@: movups xmm0, [esi] pcmpeqb xmm1, xmm0 pmovmskb ecx, xmm1 ; Move Byte Mask To Integer - a fantastic instruction! test ecx,ecx jnz @F movups [edi], xmm0 add esi, 16 add edi, 16 jmp @B @@: .Repeat lodsb ; relatively slow stosb ; tail cleanup .Until al==0 mov eax, edi ; a stringcat routine might need this one pop ecx ; save ecx pop edi pop esi ret 8 ; cleanup szCopyXMM endp
Finally, as to the "strange" timings: Try to assemble the code with ML 9.0 or with JWasm.EDIT: Here are the tiny differences between the codes generated by masm 6.14 and the others. You might google for "size override" optimization 66hml v614 004019C0 ³? 0FEFC9 pxor mm1, mm1 004019C3 ³> 0F1006 Úmovups xmm0, dqword ptr [esi] 004019C6 ³. 0F74C8 ³pcmpeqb mm1, mm0 004019C9 ³. 0FD7C9 ³pmovmskb ecx, mm1 004019CC ³. 85C9 ³test ecx, ecx 004019CE ³.75 0B ³jne short SzCpy.004019DB 004019D0 ³. 0F1107 ³movups dqword ptr [edi], xmm0 ml v9 004019C8 ³? 660FEFC9 ³pxor xmm1, xmm1 004019CC ³. 0F1006 ³movups xmm0, dqword ptr [esi] 004019CF ³? 660F74C8 ³pcmpeqb xmm1, xmm0 004019D3 ³. 660FD7C9 ³pmovmskb ecx, xmm1 004019D7 ³? 85C9 ³test ecx, ecx 004019D9 ³.75 0B Àjne short SzCpy.004019E6 004019DB ³> 0F1107 Úmovups dqword ptr [edi], xmm0 JWasm 004019C8 ³? 660FEFC9 ³pxor xmm1, xmm1 004019CC ³. 0F1006 ³movups xmm0, dqword ptr [esi] 004019CF ³? 660F74C8 ³pcmpeqb xmm1, xmm0 004019D3 ³. 660FD7C9 ³pmovmskb ecx, xmm1 004019D7 ³? 85C9 ³test ecx, ecx 004019D9 ³.75 0B Àjne short SzCpy.004019E6 004019DB ³> 0F1107 Úmovups dqword ptr [edi], xmm0 [attachment deleted by admin]
|
|
|
|
|
Logged
|
|
|
|
|
askm
|
Must be... I used the new 'ml' and then the old 'link' and the differences are...
512-byte string copy timing results:
len of source string = 512 len of szCopyXMM: 55 szCopyXMM -> jj -> xmm: 358 clocks szCopyMMX -> Mark Larson -> MMX: 323 clocks SzCpy11 -- > Lingo -> MMX -> Fast: 285 clocks szCopyMMX1-> Mark Larson -> MMX-> Fast: 325 clocks szCopy 1564 clocks lstrcpy 1571 clocks SzCpy10 - > Lingo -> MMX: 284 clocks MbCopy -> jj -> xmm: 476 clocks
|
|
|
|
|
Logged
|
|
|
|
|
jj2007
|
Must be... I used the new 'ml' and then the old 'link' and the differences are...
Yes, it's the three missing size override 66h bytes. Jwasm works fine, too.
|
|
|
|
|
Logged
|
|
|
|
MichaelW
Global Moderator
Member
    
Gender: 
Posts: 5161
|
Manually encoding the three operand size prefixes does not improve the cycle count on my P3, it stays at about 2090.
|
|
|
|
|
Logged
|
eschew obfuscation
|
|
|
|
donkey
|
I was looking through the original thread and noticed a few references to my stings functions but for some reason I never bothered to post any code probably because I just assumed I had already posted them in other threads, here are the functions from strings.lib that Mark Larson referred to. They have mostly been dissected and rewritten over the years by people like Mark who took them and vastly improved them but for what its worth... lszLenMMX/lszLenMMXW NOTE: These functions require a Pentium 3 or better with SSE instructions Calculates the length of a string, the string should be aligned. lszLenMMXW is a Unicode variant. Parameters: pString = Pointer to a null terminated string Returns the length of the supplied string not including the NULL terminator
lszCopyMMX NOTE: This function requires a Pentium 3 or better with SSE instructions Copies a zero terminated string using the MMX registers (not preserved) Parameters: Dest = Pointer to destination buffer Source = Pointer to source string Returns the address of the destination buffer lszCopyMMX FRAME lpDest,lpSource uses esi,edi
mov esi,[lpSource] mov edi,[lpDest]
mov ecx,esi and ecx,15 rep movsb
nop pxor mm0,mm0 nop pxor mm1,mm1 nop
: movq mm0,[esi] movq mm2,[esi] pcmpeqb mm2,mm1 pmovmskb ecx,mm2 or ecx,ecx jnz > movq [edi],mm0 add edi, 8 add esi, 8 jmp < :
emms ; Do the remainder bsf ecx,ecx rep movsb mov [edi],cl mov eax,edi sub eax,[lpDest] ret ENDF lszLenMMX FRAME pString
mov eax,[pString] nop nop ; fill in stack frame+mov to 8 bytes
pxor mm0,mm0 nop ; fill pxor to 4 bytes pxor mm1,mm1 nop ; fill pxor to 4 bytes
: ; this is aligned to 16 bytes movq mm0,[eax] pcmpeqb mm0,mm1 add eax,8 pmovmskb ecx,mm0 or ecx,ecx jz <
sub eax,[pString]
bsf ecx,ecx sub eax,8 add eax,ecx
emms
RET
ENDF lszLenMMXW FRAME pString
mov eax,[pString] nop nop ; fill in stack frame+mov to 8 bytes
pxor mm0,mm0 nop ; fill pxor to 4 bytes pxor mm1,mm1 nop ; fill pxor to 4 bytes
: ; this is aligned to 16 bytes movq mm0,[eax] pcmpeqw mm0,mm1 add eax,8 pmovmskb ecx,mm0 or ecx,ecx jz <
sub eax,[pString]
bsf ecx,ecx sub eax,8 add eax,ecx shr eax,1 emms
RET
ENDF
|
|
|
|
|
Logged
|
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender "It was just a dream, Bender. There's no such thing as two". -- Fry -- Futurama Donkey's Stable
|
|
|
|
jj2007
|
Manually encoding the three operand size prefixes does not improve the cycle count on my P3, it stays at about 2090.
That is what I get with the ml614 version, see below. With JWasm and ML 9.0, this drops to 471 cycles. I attach the latest version with the two executables. szCopyXMM -> jj -> xmm: 2084 clocks szCopyMMX -> Mark Larson -> MMX: 474 clocks SzCpy11 -- > Lingo -> MMX -> Fast: 477 clocks szCopyMMX1-> Mark Larson -> MMX-> Fast: 438 clocks szCopy 1053 clocks lstrcpy 1216 clocks SzCpy10 - > Lingo -> MMX: 480 clocks MbCopy -> jj -> xmm: 478 clocks [attachment deleted by admin]
|
|
|
|
|
Logged
|
|
|
|
|
|
Pages: [1] 2 3 4
|
|
|
 |