News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

lstrcpy vs szCopy

Started by jj2007, February 07, 2009, 11:02:33 PM

Previous topic - Next topic

jj2007

Well, if the subject sounds familiar: I am reviving an old thread.

The algos there are fast, but they are mmx and therefore trash the FPU (I like the FPU). So I thought of adapting one of them, actually an algo by Lingo, to produce an XMM version. And to make it more realistic, I introduced spoilers:

.data
align 16

spoil1 db 1, 2, 3 ; badly aligned source

String1     DB  "Sample String 01234 56789 ABCDEF AaBbCcDdEeFfGgHhIiJjKkLlMMNnOoP",\


On my Core 2 Celeron M, differences in timings are there but not dramatic. What was more dramatic was the silent bye-bye when I removed the spoilers...

With the spoilers, the XMM version works just fine and leaves the FPU in peace. When I remove them, then both Lingo's and my adapted algo crash miserably with exception #5 at movq xmm0, qword ptr [ecx+eax]

Anybody interested to have a look into this? I also suspect that my version could be a lot improved...

[attachment deleted by admin]

sinsi

It seems to overwrite the counter used by counter_end - this is always 39383736h, so the counter never gets to 0. I don't get any sort of exception.
Light travels faster than sound, that's why some people seem bright until you hear them.

jj2007

Thanks Sinsi - that makes sense. In the meantime, I made up another xmm version:

comment * based on MMX  Fast by Mark Larson *
align 16 ; seems to have little influence, the nop makes it two cycles faster ;-)
nop
szCopyXMM proc dest:DWORD, src:DWORD
   mov eax,[esp+8]
   mov esi,[esp+4]
align 16
qword_copy1b:
   pxor xmm1, xmm1
   movups xmm0, oword ptr [eax]
   pcmpeqb xmm1, xmm0
   add eax, 8+8
   pmovmskb ecx, xmm1
   or ecx,ecx
   jnz finish_rest1
   movups oword ptr [esi], xmm0
   add esi, 8+8
   jmp qword_copy1b
finish_rest1:
ret 8
szCopyXMM endp


512-byte string copy timing results:

szCopyXMM -> jj   ->               xmm: 278 clocks
szCopyMMX   -> Mark Larson   ->    MMX: 312 clocks
SzCpy11  -- > Lingo ->  MMX   ->  Fast: 284 clocks
szCopyMMX1-> Mark Larson -> MMX-> Fast: 309 clocks
                                szCopy  1076 clocks
                                lstrcpy 1202 clocks
SzCpy10        - > Lingo ->        MMX: 283 clocks
MbCopy     -> jj   ->              xmm: 384 clocks


Now one problem is that, aligned or not, these algos work in chunks of 128 bytes. So there are problems with small strings...

qWord

hi,

after a quick view, i think that the problem is caused by "test bl,bl" (in your and lingo's routine) -> There are 4 packet Bytes after "packsswb" - so you have to test for these 4 bytes with "test ebx,ebx".

regards, qWord
FPU in a trice: SmplMath
It's that simple!

jj2007

Thanks, qword - I am afraid it keeps choking. But the other one seems to work just fine, also for small strings and bad alignment. However, it needs a zero delimiter at the end - see mov byte ptr [esi], 0 below

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
comment * based on MMX  Fast by Mark Larson *
align 16 ; seems to have little influence, the nop makes it two cycles faster ;-)
nop
szCopyXMM proc dest:DWORD, src:DWORD
   mov eax,[esp+8]
   mov esi,[esp+4]
align 16
qword_copy1b:
   pxor xmm1, xmm1
   movups xmm0, oword ptr [eax]
   pcmpeqb xmm1, xmm0
   add eax, 8+8
   pmovmskb ecx, xmm1
   or ecx,ecx
   jnz finish_rest1
   movups oword ptr [esi], xmm0
   add esi, 8+8
   jmp qword_copy1b
finish_rest1:
  mov byte ptr [esi], 0
ret 8
szCopyXMM endp

jj2007

OK, I made a bit of cleanup and am satisfied with this version:

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
comment * inspired by "MMX  Fast" by Mark Larson *
; align 16 ; seems to have NO influence
szCopyXMM proc dest:DWORD, src:DWORD
  push esi
  push edi
  mov edi, [esp+4+8]
  mov esi, [esp+8+8]
  push ecx ; preserve another valuable register
@@:
   pxor xmm1, xmm1
   movups xmm0, oword ptr [esi]
   pcmpeqb xmm1, xmm0
   pmovmskb ecx, xmm1 ; Move Byte Mask To Integer - a fantastic instruction!
   test ecx,ecx
   jnz @F
   movups oword ptr [edi], xmm0
   add esi, 16
   add edi, 16
   jmp @B
@@:
  .Repeat
lodsb ; relatively slow
stosb ; tail cleanup
  .Until al==0
  mov eax, edi ; a stringcat routine might need this one
  pop ecx ; restore ecx
  pop edi
  pop esi
ret 8 ; cleanup
szCopyXMM endp


Testing the 16-byte boundary looks fine:

Source=B23456789012345
  Dest=B23456789012345
Source=C234567890123456
  Dest=C234567890123456
Source=D2345678901234567
  Dest=D2345678901234567


512-byte string copy timing results (aligned):

len of source string = 512
len of szCopyXMM: 55
szCopyXMM -> jj   ->               xmm: 298 clocks
szCopyMMX   -> Mark Larson   ->    MMX: 312 clocks
SzCpy11  -- > Lingo ->  MMX   ->  Fast: 285 clocks
szCopyMMX1-> Mark Larson -> MMX-> Fast: 308 clocks
                                szCopy  1053 clocks
                                lstrcpy 1184 clocks
SzCpy10        - > Lingo ->        MMX: 283 clocks
MbCopy     -> jj   ->              xmm: 380 clocks


Three times as fast as szCopy, 55 bytes short, and does not trash the FPU. The only caveat is that your puter should be less than seven years old :green

[attachment deleted by admin]

NightWare

jj, glad to see you play with simd stuff  :bg

but,
1. can you explain me why pxor xmm1,xmm1 is IN the loop ?
2. for unaligned data, look at lddqu instruction

askm

Who can explain these results ?

512-byte string copy timing results:

len of source string = 512
len of szCopyXMM: 52
szCopyXMM -> jj   ->          xmm: 2085 clocks
szCopyMMX   -> Mark Larson   ->    MMX: 323 clocks
SzCpy11  -- > Lingo ->  MMX   ->  Fast: 285 clocks
szCopyMMX1-> Mark Larson -> MMX-> Fast: 324 clocks
            szCopy   1556 clocks
            lstrcpy   1573 clocks
SzCpy10        - > Lingo ->        MMX: 285 clocks
MbCopy     -> jj   ->          xmm: 284 clocks

MichaelW

#8
I think the problem might be the processor it's running on. This is what I get on my P3:

len of source string = 512
len of szCopyXMM: 52
szCopyXMM -> jj   ->               xmm: 2090 clocks
szCopyMMX   -> Mark Larson   ->    MMX: 319 clocks
SzCpy11  -- > Lingo ->  MMX   ->  Fast: 281 clocks
szCopyMMX1-> Mark Larson -> MMX-> Fast: 308 clocks
                                szCopy  2078 clocks
                                lstrcpy 2384 clocks
SzCpy10        - > Lingo ->        MMX: 285 clocks
MbCopy     -> jj   ->              xmm: 282 clocks


Or not. If I comment out the tail cleanup code then szCopyXMM runs in 11 cycles and the procedure fails the function tests, implying that most or all of the work is being done by the tail cleanup code.

After more tests I think the problem is my processor. On a P3 I think pmovmskb and pcmpeqb are limited to the MMX registers. I don't see any errors when I assemble, but on the first iteration of the loop ECX is always 0FFh, when it should be 0 up to the last loop.

Or not exactly. Assembling the code with ML 6.14, 6.15, and 7.00 I get:

004019AF 660FEFC9               pxor    mm1,mm1
004019B3 0F1006                 movups  xmm0,[esi]
004019B6 660F74C8               pcmpeqb mm1,mm0
004019BA 660FD7C9               pmovmskb cx,mm1


And with 6.15 and 7.00 the code generates an illegal instruction exception somewhere further down (in MbCopy). So there is a problem with the version of ML, but if that were fixed then there would be a problem with the processor not supporting some of the instructions.
eschew obfuscation

jj2007

Quote from: NightWare on February 08, 2009, 03:29:17 AM
jj, glad to see you play with simd stuff  :bg

but,
1. can you explain me why pxor xmm1,xmm1 is IN the loop ?

Because I shamelessly copied that from Mark's code  :bg

Quote
2. for unaligned data, look at lddqu instruction

Yields the same timings, is 2 bytes longer (55->57 bytes), and decreases the maximum age of your puter.
movdqu and movups produce exactly the same timings. I chose movups below (2 bytes shorter than movdqu), but maybe there are differences by processor type. Anyway, thanks a lot for the hint to lddqu, it made me find movups/movdqu, which both improve drastically the timings for the non-aligned strings:

512-byte string copy timing results:

len of source string = 512
alignment: offset src=4202611, dest=4203173
len of szCopyXMM: 55

szCopyXMM -> jj   ->               xmm: 484 clocks
szCopyMMX   -> Mark Larson   ->    MMX: 474 clocks
SzCpy11  -- > Lingo ->  MMX   ->  Fast: 474 clocks
szCopyMMX1-> Mark Larson -> MMX-> Fast: 439 clocks
                                szCopy  1053 clocks
                                lstrcpy 1214 clocks
SzCpy10        - > Lingo ->        MMX: 476 clocks
MbCopy     -> jj   ->              xmm: 560 clocks


There are some that are a few clocks faster, but remember they trash the FPU.

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
comment * inspired by "MMX  Fast" by Mark Larson *
; align 16 ; seems to have NO influence
szCopyXMM proc dest:DWORD, src:DWORD
  push esi
  push edi
  mov edi, [esp+4+8]
  mov esi, [esp+8+8]
  push ecx ; preserve another valuable register
  pxor xmm1, xmm1
@@:
  movups xmm0, [esi]
  pcmpeqb xmm1, xmm0
  pmovmskb ecx, xmm1 ; Move Byte Mask To Integer - a fantastic instruction!
  test ecx,ecx
  jnz @F
  movups [edi], xmm0
  add esi, 16
  add edi, 16
  jmp @B
@@:
  .Repeat
lodsb ; relatively slow
stosb ; tail cleanup
  .Until al==0
  mov eax, edi ; a stringcat routine might need this one
  pop ecx ; save ecx
  pop edi
  pop esi
ret 8 ; cleanup
szCopyXMM endp


Finally, as to the "strange" timings: Try to assemble the code with ML 9.0 or with JWasm.

EDIT: Here are the tiny differences between the codes generated by masm 6.14 and the others.
You might google for "size override" optimization 66h

ml v614
004019C0              ³? 0FEFC9                      pxor mm1, mm1
004019C3              ³> 0F1006                      Úmovups xmm0, dqword ptr [esi]
004019C6              ³. 0F74C8                      ³pcmpeqb mm1, mm0
004019C9              ³. 0FD7C9                      ³pmovmskb ecx, mm1
004019CC              ³. 85C9                        ³test ecx, ecx
004019CE              ³.75 0B                       ³jne short SzCpy.004019DB
004019D0              ³. 0F1107                      ³movups dqword ptr [edi], xmm0

ml v9
004019C8              ³? 660FEFC9                    ³pxor xmm1, xmm1
004019CC              ³. 0F1006                      ³movups xmm0, dqword ptr [esi]
004019CF              ³? 660F74C8                    ³pcmpeqb xmm1, xmm0
004019D3              ³. 660FD7C9                    ³pmovmskb ecx, xmm1
004019D7              ³? 85C9                        ³test ecx, ecx
004019D9              ³.75 0B                       Àjne short SzCpy.004019E6
004019DB              ³> 0F1107                      Úmovups dqword ptr [edi], xmm0

JWasm
004019C8              ³? 660FEFC9                    ³pxor xmm1, xmm1
004019CC              ³. 0F1006                      ³movups xmm0, dqword ptr [esi]
004019CF              ³? 660F74C8                    ³pcmpeqb xmm1, xmm0
004019D3              ³. 660FD7C9                    ³pmovmskb ecx, xmm1
004019D7              ³? 85C9                        ³test ecx, ecx
004019D9              ³.75 0B                       Àjne short SzCpy.004019E6
004019DB              ³> 0F1107                      Úmovups dqword ptr [edi], xmm0

[attachment deleted by admin]

askm

Must be...
I used the new 'ml' and then the old 'link' and the differences are...

512-byte string copy timing results:

len of source string = 512
len of szCopyXMM: 55
szCopyXMM -> jj   ->               xmm: 358 clocks
szCopyMMX   -> Mark Larson   ->    MMX: 323 clocks
SzCpy11  -- > Lingo ->  MMX   ->  Fast: 285 clocks
szCopyMMX1-> Mark Larson -> MMX-> Fast: 325 clocks
                                szCopy  1564 clocks
                                lstrcpy 1571 clocks
SzCpy10        - > Lingo ->        MMX: 284 clocks
MbCopy     -> jj   ->              xmm: 476 clocks

jj2007

Quote from: askm on February 08, 2009, 10:39:21 AM
Must be...
I used the new 'ml' and then the old 'link' and the differences are...


Yes, it's the three missing size override 66h bytes. Jwasm works fine, too.

MichaelW

Manually encoding the three operand size prefixes does not improve the cycle count on my P3, it stays at about 2090.
eschew obfuscation

donkey

I was looking through the original thread and noticed a few references to my stings functions but for some reason I never bothered to post any code probably because I just assumed I had already posted them in other threads, here are the functions from strings.lib that Mark Larson referred to. They have mostly been dissected and rewritten over the years by people like Mark who took them and vastly improved them but for what its worth...

lszLenMMX/lszLenMMXW
NOTE: These functions require a Pentium 3 or better with SSE instructions
Calculates the length of a string, the string should be aligned.
lszLenMMXW is a Unicode variant.
Parameters:
pString = Pointer to a null terminated string
Returns the length of the supplied string not including the NULL terminator

lszCopyMMX
NOTE: This function requires a Pentium 3 or better with SSE instructions
Copies a zero terminated string using the MMX registers (not preserved)
Parameters:
Dest = Pointer to destination buffer
Source = Pointer to source string
Returns the address of the destination buffer


lszCopyMMX FRAME lpDest,lpSource
uses esi,edi

mov esi,[lpSource]
mov edi,[lpDest]

mov ecx,esi
and ecx,15
rep movsb

nop
pxor mm0,mm0
nop
pxor mm1,mm1
nop

:
movq mm0,[esi]
movq mm2,[esi]
pcmpeqb mm2,mm1
pmovmskb ecx,mm2
or ecx,ecx
jnz >
movq [edi],mm0
add edi, 8
add esi, 8
jmp <
:

emms
; Do the remainder
bsf ecx,ecx
rep movsb
mov [edi],cl

mov eax,edi
sub eax,[lpDest]
   ret
ENDF


lszLenMMX FRAME pString

mov eax,[pString]
nop
nop ; fill in stack frame+mov to 8 bytes

pxor mm0,mm0
nop ; fill pxor to 4 bytes
pxor mm1,mm1
nop ; fill pxor to 4 bytes

: ; this is aligned to 16 bytes
movq mm0,[eax]
pcmpeqb mm0,mm1
add eax,8
pmovmskb ecx,mm0
or ecx,ecx
jz <

sub eax,[pString]

bsf ecx,ecx
sub eax,8
add eax,ecx

emms

   RET

ENDF


lszLenMMXW FRAME pString

mov eax,[pString]
nop
nop ; fill in stack frame+mov to 8 bytes

pxor mm0,mm0
nop ; fill pxor to 4 bytes
pxor mm1,mm1
nop ; fill pxor to 4 bytes

: ; this is aligned to 16 bytes
movq mm0,[eax]
pcmpeqw mm0,mm1
add eax,8
pmovmskb ecx,mm0
or ecx,ecx
jz <

sub eax,[pString]

bsf ecx,ecx
sub eax,8
add eax,ecx
shr eax,1
emms

   RET

ENDF
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

jj2007

Quote from: MichaelW on February 08, 2009, 12:21:48 PM
Manually encoding the three operand size prefixes does not improve the cycle count on my P3, it stays at about 2090.

That is what I get with the ml614 version, see below. With JWasm and ML 9.0, this drops to 471 cycles.

I attach the latest version with the two executables.

szCopyXMM -> jj   ->               xmm: 2084 clocks
szCopyMMX   -> Mark Larson   ->    MMX: 474 clocks
SzCpy11  -- > Lingo ->  MMX   ->  Fast: 477 clocks
szCopyMMX1-> Mark Larson -> MMX-> Fast: 438 clocks
                                szCopy  1053 clocks
                                lstrcpy 1216 clocks
SzCpy10        - > Lingo ->        MMX: 480 clocks
MbCopy     -> jj   ->              xmm: 478 clocks

[attachment deleted by admin]