SzCpy vs. lstrcpy

Mark Jones · May 09, 2005, 01:23:45 AM

Hello, I tried to improve on the lstrcpy function but only had marginal results. It appears that INC/DEC is faster than ADD/SUB on AMD processors. Can anyone suggest any further refinements? These results are with Windows XP Pro SP1, AMD XP 1800+

Code Select


128-byte string copy timing results:

lstrcpy              : 432 clocks

SzCpy1 (movsb/cmp)   : 540 clocks
SzCpy2 (mov/cmp/inc) : 540 clocks
SzCpy3 (mov/cmp/add) : 666 clocks
SzCpy4 (mov/test/inc): 344 clocks
SzCpy5 (mov/test/add): 479 clocks

[attachment deleted by admin]

hutch-- · May 09, 2005, 02:22:51 AM

Thes are the results I got Mark.

128-byte string copy timing results:

lstrcpy : 755 clocks

SzCpy1 (movsb/cmp) : 1185 clocks
SzCpy2 (mov/cmp/inc) : 582 clocks
SzCpy3 (mov/cmp/add) : 454 clocks
SzCpy4 (mov/test/inc): 593 clocks
SzCpy5 (mov/test/add): 445 clocks

I am a bit busy at the moment but if you have the time, will you plug this masm32 library module into your test piece to see how it works. It uses an index in the addressing mode and ends up with a lower loop instruction count.

Code Select


align 4

OPTION PROLOGUE:NONE 
OPTION EPILOGUE:NONE 

comment * -----------------------------------------------
        copied length minus terminator is returned in EAX
        ----------------------------------------------- *
align 16

szCopy proc src:DWORD,dst:DWORD

    push ebp

    mov edx, [esp+8]
    mov ebp, [esp+12]
    xor ecx, ecx        ; ensure no previous content in ECX
    mov eax, -1

  @@:
    add eax, 1
    mov cl, [edx+eax]
    mov [ebp+eax], cl
    test cl, cl
    jnz @B

    pop ebp

    ret 8

szCopy endp

OPTION PROLOGUE:PrologueDef 
OPTION EPILOGUE:EpilogueDef

Box is a 2.8 gig Prescott PIV.

Jimg · May 09, 2005, 03:24:39 AM

Hi Mark-

The Athlon is very sensitive to alignment. I tried some alignment tests. Before the changes, my times were virtually identical to yours. Here are my results with alignment changes.

(By the way, the last two, SzCpy4 and SzCpy5 copied from destination to source, I changed them for my tests).

lstrcpy : 432 clocks

SzCpy1 (movsb/cmp) : 540 clocks
SzCpy2 (mov/cmp/inc) : 526 clocks
SzCpy3 (mov/cmp/add) : 408 clocks
SzCpy4 (mov/test/inc): 325 clocks
SzCpy5 (mov/test/add): 342 clocks

[attachment deleted by admin]

mnemonic · May 09, 2005, 03:40:59 AM

Hi Mark,

here the results from my machine (AMD Athlon 500Mhz):

Code Select

128-byte string copy timing results:

lstrcpy              : 378 clocks

SzCpy1 (movsb/cmp)   : 536 clocks
SzCpy2 (mov/cmp/inc) : 535 clocks
SzCpy3 (mov/cmp/add) : 660 clocks
SzCpy4 (mov/test/inc): 341 clocks
SzCpy5 (mov/test/add): 421 clocks

hutch-- · May 09, 2005, 10:34:45 AM

Here are the times on my Sempron 2.4

128-byte string copy timing results:

lstrcpy : 398 clocks

SzCpy1 (movsb/cmp) : 532 clocks
SzCpy2 (mov/cmp/inc) : 534 clocks
SzCpy3 (mov/cmp/add) : 655 clocks
SzCpy4 (mov/test/inc): 339 clocks
SzCpy5 (mov/test/add): 424 clocks

roticv · May 09, 2005, 12:37:54 PM

My celeron,

128-byte string copy timing results:

lstrcpy : 608 clocks

SzCpy1 (movsb/cmp) : 1366 clocks
SzCpy2 (mov/cmp/inc) : 644 clocks
SzCpy3 (mov/cmp/add) : 511 clocks
SzCpy4 (mov/test/inc): 575 clocks
SzCpy5 (mov/test/add): 456 clocks

MichaelW · May 09, 2005, 01:45:11 PM

P3-500, including the procedure that Hutch posted:

lstrcpy : 672 clocks

SzCpy1 (movsb/cmp) : 654 clocks
SzCpy2 (mov/cmp/inc) : 406 clocks
SzCpy3 (mov/cmp/add) : 529 clocks
SzCpy4 (mov/test/inc): 402 clocks
SzCpy5 (mov/test/add): 402 clocks
szCopy : 289 clocks

And some strange and not very useful timings for an AMD K5:

lstrcpy : 531 clocks

SzCpy1 (movsb/cmp) : 984 clocks
SzCpy2 (mov/cmp/inc) : 984 clocks
SzCpy3 (mov/cmp/add) : 984 clocks
SzCpy4 (mov/test/inc): 984 clocks
SzCpy5 (mov/test/add): 984 clocks

lingo · May 09, 2005, 10:23:35 PM

:lol
WinXP Pro SP2/Pentium 4 -560J, 3.6-GHz (Prescott), including the Hutch's procedure
and my procedure SzCpy7
It is 1st because I want to use the zeroes from the
destination buffer...:

Code Select


OPTION PROLOGUE:NONE 
OPTION EPILOGUE:NONE 
align 16
db 90h,90h,90h
SzCpy7   proc SzDest:DWORD, SzSource:DWORD
         mov  ecx, [esp+8]                 ; ecx = source
         mov  edx, [esp+4]                 ; edx= destination
         push ebx
         xor  eax, eax
         mov  bl, [ecx]                    ; read byte
 @@:     
         add  [edx+eax], bl                ; I use the fact ->the destination
         lea  eax, [eax+1]                 ;   buffer is zero filled!!! 	
         mov  bl, [ecx+eax]                ; read byte
         jnz  @B                           ; keep going
         pop  ebx
         ret  2*4
SzCpy7   endp
OPTION PROLOGUE:PrologueDef 
OPTION EPILOGUE:EpilogueDef



Microsoft (R) Macro Assembler Version 7.10.3077
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: SzCpy.asm
Microsoft (R) Incremental Linker Version 7.10.3077
Copyright (C) Microsoft Corporation.  All rights reserved.

Press any key to continue . . .
 SzCpy/lstrcpy speed comparison by Mark Jones (gzscuqn02ATsneakemailD0Tcom)
 Timing routines by MichaelW from MASM32 forums, http://www.masmforum.com/
 Please terminate any high-priority tasks and press ENTER to begin.


 128-byte string copy timing results:

 SzCpy7 (mov/add/lea->Lingo): 172 clocks
 lstrcpy              : 646 clocks

 SzCpy1 (movsb/cmp)   : 1236 clocks
 SzCpy2 (mov/cmp/inc) : 667 clocks
 SzCpy3 (mov/cmp/add) : 532 clocks
 SzCpy4 (mov/test/inc): 602 clocks
 SzCpy5 (mov/test/add): 453 clocks
 SzCpy6 (mov/test/add->Hutch): 545 clocks

 Press ENTER to exit...

Regards,
Lingo

[attachment deleted by admin]

deroko · May 09, 2005, 11:25:10 PM

Athlon64 2800+

Code Select


128-byte string copy timing results:

SzCpy7 (mov/add/lea->Lingo): 157 clocks
lstrcpy              : 327 clocks

SzCpy1 (movsb/cmp)   : 714 clocks
SzCpy2 (mov/cmp/inc) : 574 clocks
SzCpy3 (mov/cmp/add) : 712 clocks
SzCpy4 (mov/test/inc): 301 clocks
SzCpy5 (mov/test/add): 432 clocks
SzCpy6 (mov/test/add->Hutch): 303 clocks

hutch-- · May 10, 2005, 03:23:07 AM

Here is a quick play, this one is faster on my PIV by a ratio of 270 to 200 MS against the library version of szCopy. The idea was register alternation with the source and destination adresses and a 4 times unroll. The first dropped the times a little and in conjunction with the unroll, dropped some more.

Code Select


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

align 4

szCopyx proc src:DWORD,dst:DWORD

    push ebx
    push esi
    push edi

    xor ecx, ecx
    mov esi, src
    mov edi, dst
    mov ebx, src
    mov edx, dst

    mov eax, -4

  align 4
  @@:
    add eax, 4
    mov cl, [esi+eax]
    mov [edi+eax], cl
    test cl, cl
    jz @F

    mov cl, [ebx+eax+1]
    mov [edx+eax+1], cl
    test cl, cl
    jz @F

    mov cl, [esi+eax+2]
    mov [edi+eax+2], cl
    test cl, cl
    jz @F

    mov cl, [ebx+eax+3]
    mov [edx+eax+3], cl
    test cl, cl
    jnz @B

  @@:

    pop edi
    pop esi
    pop ebx

    ret

szCopyx endp

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

Mark Jones · May 10, 2005, 02:30:18 PM

Wow, we have some interesting results so far! Lingo's algortihm didn't seem to slow down any when the destination string was not all zeroes. Added is Hutch's 4x unroll:

XP SP1/AMD XP 1800+

Code Select


 128-byte string copy timing results:

 lstrcpy                       : 433 clocks

 SzCpy1 (movsb/cmp)    MJ      : 540 clocks
 SzCpy2 (mov/cmp/inc)  MJ      : 541 clocks
 SzCpy3 (mov/cmp/add)  MJ      : 666 clocks
 SzCpy4 (mov/test/inc) MJ      : 344 clocks
 SzCpy5 (mov/test/add) MJ      : 477 clocks
 SzCopy (mov/test/add) Hutch   : 367 clocks
 SzCpy7 (mov/add/lea)  Lingo   : 148 clocks
 szCopyx (4x unroll)   Hutch   : 300 clocks

 Press ENTER to exit...

[attachment deleted by admin]

AeroASM · May 10, 2005, 04:25:03 PM

The latest one gives me

Microsoft (R) Incremental Linker Version 5.12.8078
Copyright (C) Microsoft Corp 1992-1998. All rights reserved.

masm32.lib(DW2A.obj) : error LNK2001: unresolved external symbol _wsprintfA
SzCpy.exe : fatal error LNK1120: 1 unresolved externals

Jimg · May 10, 2005, 04:25:55 PM

How about moving a word at a time?

Code Select

align 16
SzCpy12 proc  uses ebx SzDest:DWORD, SzSource:DWORD

    xor ecx, ecx
	mov eax, -4
    mov ebx, SzSource
    mov edx, SzDest

  align 4
@@:
	add eax,4
    mov cx, [ebx+eax]    
    test cl, cl
    jz move1			; done, go move zero byte
    mov [edx+eax], cx	; save both
    test ch,ch
    jz @f				; all done4

    mov cx, [ebx+eax+2]
    test cl, cl
    jz move2
    mov [edx+eax+2], cx
    test ch,ch
    jnz @b
@@:
    ret
	
move1:
	mov [ebx+eax],cl
	ret	
move2:
	mov [ebx+eax+2],cl
	ret	
SzCpy12 endp

runs in 255 on my Athlon

p.s.
Did anyone try Lingo's routine? It doesn't give me the correct results, I must be doing something wrong.

Mark_Larson · May 10, 2005, 04:30:28 PM

Mark, you missed the trick that Lingo was trying to do. If the destination buffer is not all 0's, the code will run ok, but will produce incorrect results. He adds the destination and source together in place of a move, and that only works if the destination is all 0's.

Lingo, couple points. I haven't tried any of this, but just stuff that probably could make it faster.

1) on the P4 that you ran this code on both MOV and ADD run in the same time ( 0.5 cycles). So timing-wise if you change your add below to a MOV it should theoretically perform the same. Both MOV and ADD go through the same execution unit ( ALU), so it should run in the same speed. If it doesn't, that would be really surprising. That will also get rid of your dependency on having an all 0 destination register.

2) "mov bl, [ecx] " is going to cause a partial register stall. Even on the P4 you can still get "false partial register stalls". change it to a MOVZX or add a "xor ebx,ebx" before the move

3) Lea is the slowest way to increment a number on a P4. ADD/SUB is the fastest. Also try INC/DEC..

Code Select


         lea  eax, [eax+1]                 ;   buffer is zero filled!!!
         mov  bl, [ecx+eax]                ; read byte

4) the above code snippet has a stall ( read after write dependeny) from the previous line. Instead try switching the two lines and using "ecx+eax+1" for the memory address. That should free up a stall.

5) comment below.

Code Select


OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
db 90h,90h,90h     - I am clueless why you added 3 NOPs before the procedure?  this makes it not aligned.  For the P4 that is not a big deal, but for non-P4 that is going to cause a loss in performance.
SzCpy7   proc SzDest:DWORD, SzSource:DWORD

6) Advanced trick - try interwearving two instances of the code that use different register sets to avoid stalls. Just need to free up enough registers. If you'd like me to explain this in more detail, just ask.

I'll post a bit of it. I have not verified this runs faster, but it should help break up some stalls. I also used your original code, so I didn't make any of the changes I suggested above.

Code Select


;ecx is source
;edi is destination - changed this from EDX to EDI, to free up a byte accessible register.
;esi is the same as eax, just a different offset
;all new code left justified
         xor  eax, eax
mov esi,1
         mov  bl, [ecx]                    ; read byte
mov dl,[ecx+1]
@@:     
         add  [edi+eax], bl                ; I use the fact ->the destination
add [edi+esi],dl
         lea  eax, [eax+1]                 ;   buffer is zero filled!!!
lea esi,[esi+1]
         mov  bl, [ecx+eax]                ; read byte
mov dl,[ecx+esi]

Petroizki · May 10, 2005, 05:46:19 PM

Quote from: Mark_Larson on May 10, 2005, 04:30:28 PM
5) comment below.

The three nop's make the loop itself to be aligned to 16.

News:

SzCpy vs. lstrcpy

roticv

deroko

AeroASM