lstrcpy vs szCopy

Main Menu

Home
Search

May 20, 2025, 07:08:07 PM

News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

lstrcpy vs szCopy

Started by jj2007, February 07, 2009, 11:02:33 PM

Previous topic - Next topic

Print

Go Down Pages1 2 3 4

jj2007

Member
Posts: 5,393
Location: Italy
Logged

lstrcpy vs szCopy

February 07, 2009, 11:02:33 PM

Well, if the subject sounds familiar: I am reviving an old thread.

The algos there are fast, but they are mmx and therefore trash the FPU (I like the FPU). So I thought of adapting one of them, actually an algo by Lingo, to produce an XMM version. And to make it more realistic, I introduced spoilers:

Code Select

.data
align 16

spoil1	db 1, 2, 3		; badly aligned source

String1     DB  "Sample String 01234 56789 ABCDEF AaBbCcDdEeFfGgHhIiJjKkLlMMNnOoP",\

On my Core 2 Celeron M, differences in timings are there but not dramatic. What was more dramatic was the silent bye-bye when I removed the spoilers...

With the spoilers, the XMM version works just fine and leaves the FPU in peace. When I remove them, then both Lingo's and my adapted algo crash miserably with exception #5 at movq xmm0, qword ptr [ecx+eax]

Anybody interested to have a look into this? I also suspect that my version could be a lot improved...

[attachment deleted by admin]

Masm32 Tips, Tricks and Traps

sinsi

Member
Posts: 1,372
RIP Bodie 1999-2011
Location: Adelaide
Logged

Re: lstrcpy vs szCopy

#1

February 07, 2009, 11:54:20 PM

It seems to overwrite the counter used by counter_end - this is always 39383736h, so the counter never gets to 0. I don't get any sort of exception.

Light travels faster than sound, that's why some people seem bright until you hear them.

jj2007

Member
Posts: 5,393
Location: Italy
Logged

Re: lstrcpy vs szCopy

#2

February 08, 2009, 12:02:56 AM

Thanks Sinsi - that makes sense. In the meantime, I made up another xmm version:

Code Select

comment * based on MMX  Fast by Mark Larson *
align 16	; seems to have little influence, the nop makes it two cycles faster ;-)
nop
szCopyXMM proc dest:DWORD, src:DWORD
   mov eax,[esp+8]
   mov esi,[esp+4]
align 16
qword_copy1b:
   pxor xmm1, xmm1
   movups xmm0, oword ptr [eax]
   pcmpeqb xmm1, xmm0
   add eax, 8+8
   pmovmskb ecx, xmm1
   or ecx,ecx 
   jnz finish_rest1
   movups oword ptr [esi], xmm0
   add esi, 8+8
   jmp qword_copy1b
finish_rest1:
ret 8
szCopyXMM endp

512-byte string copy timing results:

Code Select

 
szCopyXMM -> jj   ->               xmm: 278 clocks
szCopyMMX   -> Mark Larson   ->    MMX: 312 clocks
SzCpy11  -- > Lingo ->  MMX   ->  Fast: 284 clocks
szCopyMMX1-> Mark Larson -> MMX-> Fast: 309 clocks
                                szCopy  1076 clocks
                                lstrcpy 1202 clocks
SzCpy10        - > Lingo ->        MMX: 283 clocks
MbCopy     -> jj   ->              xmm: 384 clocks

Now one problem is that, aligned or not, these algos work in chunks of 128 bytes. So there are problems with small strings...

Masm32 Tips, Tricks and Traps

qWord

Member
Posts: 1,401
Logged

Re: lstrcpy vs szCopy

#3

February 08, 2009, 12:24:38 AM

hi,

after a quick view, i think that the problem is caused by "test bl,bl" (in your and lingo's routine) -> There are 4 packet Bytes after "packsswb" - so you have to test for these 4 bytes with "test ebx,ebx".

regards, qWord

FPU in a trice: SmplMath
It's that simple!

jj2007

Member
Posts: 5,393
Location: Italy
Logged

Re: lstrcpy vs szCopy

#4

February 08, 2009, 12:44:06 AM

Thanks, qword - I am afraid it keeps choking. But the other one seems to work just fine, also for small strings and bad alignment. However, it needs a zero delimiter at the end - see mov byte ptr [esi], 0 below

Code Select

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
comment * based on MMX  Fast by Mark Larson *
align 16	; seems to have little influence, the nop makes it two cycles faster ;-)
nop
szCopyXMM proc dest:DWORD, src:DWORD
   mov eax,[esp+8]
   mov esi,[esp+4]
align 16
qword_copy1b:
   pxor xmm1, xmm1
   movups xmm0, oword ptr [eax] 
   pcmpeqb xmm1, xmm0
   add eax, 8+8
   pmovmskb ecx, xmm1
   or ecx,ecx 
   jnz finish_rest1
   movups oword ptr [esi], xmm0
   add esi, 8+8
   jmp qword_copy1b
finish_rest1:
  mov byte ptr [esi], 0
ret 8
szCopyXMM endp

Masm32 Tips, Tricks and Traps

jj2007

Member
Posts: 5,393
Location: Italy
Logged

Re: lstrcpy vs szCopy

#5

February 08, 2009, 01:55:29 AM

OK, I made a bit of cleanup and am satisfied with this version:

Code Select

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
comment * inspired by "MMX  Fast" by Mark Larson *
; align 16	; seems to have NO influence
szCopyXMM proc dest:DWORD, src:DWORD
  push esi
  push edi
  mov edi, [esp+4+8]
  mov esi, [esp+8+8]
  push ecx	; preserve another valuable register
@@:
   pxor xmm1, xmm1
   movups xmm0, oword ptr [esi] 
   pcmpeqb xmm1, xmm0
   pmovmskb ecx, xmm1	; Move Byte Mask To Integer - a fantastic instruction!
   test ecx,ecx 
   jnz @F
   movups oword ptr [edi], xmm0
   add esi, 16
   add edi, 16
   jmp @B
@@:
  .Repeat
	lodsb	; relatively slow
	stosb	; tail cleanup
  .Until al==0
  mov eax, edi		; a stringcat routine might need this one
  pop ecx		; restore ecx
  pop edi
  pop esi
ret 8		; cleanup
szCopyXMM endp

Testing the 16-byte boundary looks fine:

Code Select

Source=B23456789012345
  Dest=B23456789012345
Source=C234567890123456
  Dest=C234567890123456
Source=D2345678901234567
  Dest=D2345678901234567

512-byte string copy timing results (aligned):

Code Select

 len of source string = 512
 len of szCopyXMM: 55
szCopyXMM -> jj   ->               xmm: 298 clocks
szCopyMMX   -> Mark Larson   ->    MMX: 312 clocks
SzCpy11  -- > Lingo ->  MMX   ->  Fast: 285 clocks
szCopyMMX1-> Mark Larson -> MMX-> Fast: 308 clocks
                                szCopy  1053 clocks
                                lstrcpy 1184 clocks
SzCpy10        - > Lingo ->        MMX: 283 clocks
MbCopy     -> jj   ->              xmm: 380 clocks

Three times as fast as szCopy, 55 bytes short, and does not trash the FPU. The only caveat is that your puter should be less than seven years old :green

[attachment deleted by admin]

Masm32 Tips, Tricks and Traps

NightWare

Member
Posts: 321
when dream comes true
Logged

Re: lstrcpy vs szCopy

#6

February 08, 2009, 03:29:17 AM

jj, glad to see you play with simd stuff :bg

but,
1. can you explain me why pxor xmm1,xmm1 is IN the loop ?
2. for unaligned data, look at lddqu instruction

askm

Member
Posts: 78
Logged

Re: lstrcpy vs szCopy

#7

February 08, 2009, 05:09:49 AM

Who can explain these results ?

512-byte string copy timing results:

len of source string = 512
len of szCopyXMM: 52
szCopyXMM -> jj ->         xmm: 2085 clocks
szCopyMMX -> Mark Larson -> MMX: 323 clocks
SzCpy11 -- > Lingo -> MMX -> Fast: 285 clocks
szCopyMMX1-> Mark Larson -> MMX-> Fast: 324 clocks
            szCopy   1556 clocks
            lstrcpy   1573 clocks
SzCpy10 - > Lingo -> MMX: 285 clocks
MbCopy -> jj ->         xmm: 284 clocks

MichaelW

Global Moderator
Member
Posts: 4,553
Logged

Re: lstrcpy vs szCopy

#8

February 08, 2009, 06:52:19 AM Last Edit: February 08, 2009, 09:56:53 AM by MichaelW

I think the problem might be the processor it's running on. This is what I get on my P3:

Code Select


 len of source string = 512
 len of szCopyXMM: 52
szCopyXMM -> jj   ->               xmm: 2090 clocks
szCopyMMX   -> Mark Larson   ->    MMX: 319 clocks
SzCpy11  -- > Lingo ->  MMX   ->  Fast: 281 clocks
szCopyMMX1-> Mark Larson -> MMX-> Fast: 308 clocks
                                szCopy  2078 clocks
                                lstrcpy 2384 clocks
SzCpy10        - > Lingo ->        MMX: 285 clocks
MbCopy     -> jj   ->              xmm: 282 clocks

Or not. If I comment out the tail cleanup code then szCopyXMM runs in 11 cycles and the procedure fails the function tests, implying that most or all of the work is being done by the tail cleanup code.

After more tests I think the problem is my processor. On a P3 I think pmovmskb and pcmpeqb are limited to the MMX registers. I don't see any errors when I assemble, but on the first iteration of the loop ECX is always 0FFh, when it should be 0 up to the last loop.

Or not exactly. Assembling the code with ML 6.14, 6.15, and 7.00 I get:

Code Select


004019AF 660FEFC9               pxor    mm1,mm1
004019B3 0F1006                 movups  xmm0,[esi]
004019B6 660F74C8               pcmpeqb mm1,mm0
004019BA 660FD7C9               pmovmskb cx,mm1

And with 6.15 and 7.00 the code generates an illegal instruction exception somewhere further down (in MbCopy). So there is a problem with the version of ML, but if that were fixed then there would be a problem with the processor not supporting some of the instructions.

eschew obfuscation

jj2007

Member
Posts: 5,393
Location: Italy
Logged

Re: lstrcpy vs szCopy

#9

February 08, 2009, 09:02:40 AM

Quote from: NightWare on February 08, 2009, 03:29:17 AM
jj, glad to see you play with simd stuff :bg

but,
1. can you explain me why pxor xmm1,xmm1 is IN the loop ?

Because I shamelessly copied that from Mark's code :bg

Quote
2. for unaligned data, look at lddqu instruction

Yields the same timings, is 2 bytes longer (55->57 bytes), and decreases the maximum age of your puter.
movdqu and movups produce exactly the same timings. I chose movups below (2 bytes shorter than movdqu), but maybe there are differences by processor type. Anyway, thanks a lot for the hint to lddqu, it made me find movups/movdqu, which both improve drastically the timings for the non-aligned strings:

512-byte string copy timing results:

Code Select

 len of source string = 512
 alignment: offset src=4202611, dest=4203173
 len of szCopyXMM: 55

szCopyXMM -> jj   ->               xmm: 484 clocks
szCopyMMX   -> Mark Larson   ->    MMX: 474 clocks
SzCpy11  -- > Lingo ->  MMX   ->  Fast: 474 clocks
szCopyMMX1-> Mark Larson -> MMX-> Fast: 439 clocks
                                szCopy  1053 clocks
                                lstrcpy 1214 clocks
SzCpy10        - > Lingo ->        MMX: 476 clocks
MbCopy     -> jj   ->              xmm: 560 clocks

There are some that are a few clocks faster, but remember they trash the FPU.

Code Select

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
comment * inspired by "MMX  Fast" by Mark Larson *
; align 16	; seems to have NO influence
szCopyXMM proc dest:DWORD, src:DWORD
  push esi
  push edi
  mov edi, [esp+4+8]
  mov esi, [esp+8+8]
  push ecx	; preserve another valuable register
  pxor xmm1, xmm1
@@:
  movups xmm0, [esi] 
  pcmpeqb xmm1, xmm0
  pmovmskb ecx, xmm1	; Move Byte Mask To Integer - a fantastic instruction!
  test ecx,ecx 
  jnz @F
  movups [edi], xmm0
  add esi, 16
  add edi, 16
  jmp @B
@@:
  .Repeat
	lodsb	; relatively slow
	stosb	; tail cleanup
  .Until al==0
  mov eax, edi		; a stringcat routine might need this one
  pop ecx		; save ecx
  pop edi
  pop esi
ret 8		; cleanup
szCopyXMM endp

Finally, as to the "strange" timings: Try to assemble the code with ML 9.0 or with JWasm.

EDIT: Here are the tiny differences between the codes generated by masm 6.14 and the others.
You might google for "size override" optimization 66h

ml v614
004019C0 ³? 0FEFC9 pxor mm1, mm1
004019C3 ³> 0F1006 Úmovups xmm0, dqword ptr [esi]
004019C6 ³. 0F74C8 ³pcmpeqb mm1, mm0
004019C9 ³. 0FD7C9 ³pmovmskb ecx, mm1
004019CC ³. 85C9 ³test ecx, ecx
004019CE ³.75 0B ³jne short SzCpy.004019DB
004019D0 ³. 0F1107 ³movups dqword ptr [edi], xmm0

ml v9
004019C8 ³? 660FEFC9 ³pxor xmm1, xmm1
004019CC ³. 0F1006 ³movups xmm0, dqword ptr [esi]
004019CF ³? 660F74C8 ³pcmpeqb xmm1, xmm0
004019D3 ³. 660FD7C9 ³pmovmskb ecx, xmm1
004019D7 ³? 85C9 ³test ecx, ecx
004019D9 ³.75 0B Àjne short SzCpy.004019E6
004019DB ³> 0F1107 Úmovups dqword ptr [edi], xmm0

JWasm
004019C8 ³? 660FEFC9 ³pxor xmm1, xmm1
004019CC ³. 0F1006 ³movups xmm0, dqword ptr [esi]
004019CF ³? 660F74C8 ³pcmpeqb xmm1, xmm0
004019D3 ³. 660FD7C9 ³pmovmskb ecx, xmm1
004019D7 ³? 85C9 ³test ecx, ecx
004019D9 ³.75 0B Àjne short SzCpy.004019E6
004019DB ³> 0F1107 Úmovups dqword ptr [edi], xmm0

[attachment deleted by admin]

Masm32 Tips, Tricks and Traps

askm

Member
Posts: 78
Logged

Re: lstrcpy vs szCopy

#10

February 08, 2009, 10:39:21 AM

Must be...
I used the new 'ml' and then the old 'link' and the differences are...

512-byte string copy timing results:

len of source string = 512
len of szCopyXMM: 55
szCopyXMM -> jj -> xmm: 358 clocks
szCopyMMX -> Mark Larson -> MMX: 323 clocks
SzCpy11 -- > Lingo -> MMX -> Fast: 285 clocks
szCopyMMX1-> Mark Larson -> MMX-> Fast: 325 clocks
szCopy 1564 clocks
lstrcpy 1571 clocks
SzCpy10 - > Lingo -> MMX: 284 clocks
MbCopy -> jj -> xmm: 476 clocks

jj2007

Member
Posts: 5,393
Location: Italy
Logged

Re: lstrcpy vs szCopy

#11

February 08, 2009, 11:29:24 AM

Quote from: askm on February 08, 2009, 10:39:21 AM
Must be...
I used the new 'ml' and then the old 'link' and the differences are...

Yes, it's the three missing size override 66h bytes. Jwasm works fine, too.

Masm32 Tips, Tricks and Traps

MichaelW

Global Moderator
Member
Posts: 4,553
Logged

Re: lstrcpy vs szCopy

#12

February 08, 2009, 12:21:48 PM

Manually encoding the three operand size prefixes does not improve the cycle count on my P3, it stays at about 2090.

eschew obfuscation

donkey

Member
Posts: 2,866
ASS-embler
Location: Calgary, Alberta, Canada
Logged

Re: lstrcpy vs szCopy

#13

February 08, 2009, 12:34:56 PM

I was looking through the original thread and noticed a few references to my stings functions but for some reason I never bothered to post any code probably because I just assumed I had already posted them in other threads, here are the functions from strings.lib that Mark Larson referred to. They have mostly been dissected and rewritten over the years by people like Mark who took them and vastly improved them but for what its worth...

Code Select

lszLenMMX/lszLenMMXW
	NOTE: These functions require a Pentium 3 or better with SSE instructions
	Calculates the length of a string, the string should be aligned.
	lszLenMMXW is a Unicode variant.
	Parameters:
		pString = Pointer to a null terminated string
	Returns the length of the supplied string not including the NULL terminator

lszCopyMMX
	NOTE: This function requires a Pentium 3 or better with SSE instructions
	Copies a zero terminated string using the MMX registers (not preserved)
	Parameters:
		Dest = Pointer to destination buffer
		Source = Pointer to source string
	Returns the address of the destination buffer

Code Select

lszCopyMMX FRAME lpDest,lpSource
	uses esi,edi

	mov esi,[lpSource]
	mov edi,[lpDest]

	mov ecx,esi
	and ecx,15
	rep movsb

	nop
	pxor mm0,mm0
	nop
	pxor mm1,mm1
	nop

	:
		movq mm0,[esi]
		movq mm2,[esi]
		pcmpeqb mm2,mm1
		pmovmskb ecx,mm2
		or ecx,ecx
		jnz >
		movq [edi],mm0
		add edi, 8
		add esi, 8
	jmp <
	:

	emms
	; Do the remainder
	bsf ecx,ecx
	rep movsb
	mov [edi],cl
	
	mov eax,edi
	sub eax,[lpDest]
   ret
ENDF

Code Select

lszLenMMX FRAME pString

	mov eax,[pString]
	nop
	nop ; fill in stack frame+mov to 8 bytes

	pxor mm0,mm0
	nop ; fill pxor to 4 bytes
	pxor mm1,mm1
	nop ; fill pxor to 4 bytes

	: ; this is aligned to 16 bytes
	movq mm0,[eax]
	pcmpeqb mm0,mm1
	add eax,8
	pmovmskb ecx,mm0
	or ecx,ecx
	jz <

	sub eax,[pString]

	bsf ecx,ecx
	sub eax,8
	add eax,ecx

	emms

   RET

ENDF

Code Select

lszLenMMXW FRAME pString

	mov eax,[pString]
	nop
	nop ; fill in stack frame+mov to 8 bytes

	pxor mm0,mm0
	nop ; fill pxor to 4 bytes
	pxor mm1,mm1
	nop ; fill pxor to 4 bytes

	: ; this is aligned to 16 bytes
	movq mm0,[eax]
	pcmpeqw mm0,mm1
	add eax,8
	pmovmskb ecx,mm0
	or ecx,ecx
	jz <

	sub eax,[pString]

	bsf ecx,ecx
	sub eax,8
	add eax,ecx
	shr eax,1
	emms

   RET

ENDF

"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

jj2007

Member
Posts: 5,393
Location: Italy
Logged

Re: lstrcpy vs szCopy

#14

February 08, 2009, 01:42:11 PM

Quote from: MichaelW on February 08, 2009, 12:21:48 PM
Manually encoding the three operand size prefixes does not improve the cycle count on my P3, it stays at about 2090.

That is what I get with the ml614 version, see below. With JWasm and ML 9.0, this drops to 471 cycles.

I attach the latest version with the two executables.

Code Select

szCopyXMM -> jj   ->               xmm: 2084 clocks
szCopyMMX   -> Mark Larson   ->    MMX: 474 clocks
SzCpy11  -- > Lingo ->  MMX   ->  Fast: 477 clocks
szCopyMMX1-> Mark Larson -> MMX-> Fast: 438 clocks
                                szCopy  1053 clocks
                                lstrcpy 1216 clocks
SzCpy10        - > Lingo ->        MMX: 480 clocks
MbCopy     -> jj   ->              xmm: 478 clocks

[attachment deleted by admin]

Masm32 Tips, Tricks and Traps

Print

Go Up Pages1 2 3 4

User actions

Print