News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

SzCpy vs. lstrcpy

Started by Mark Jones, May 09, 2005, 01:23:45 AM

Previous topic - Next topic

Mark Jones

Hello, I tried to improve on the lstrcpy function but only had marginal results. It appears that INC/DEC is faster than ADD/SUB on AMD processors. Can anyone suggest any further refinements? These results are with Windows XP Pro SP1, AMD XP 1800+


128-byte string copy timing results:

lstrcpy              : 432 clocks

SzCpy1 (movsb/cmp)   : 540 clocks
SzCpy2 (mov/cmp/inc) : 540 clocks
SzCpy3 (mov/cmp/add) : 666 clocks
SzCpy4 (mov/test/inc): 344 clocks
SzCpy5 (mov/test/add): 479 clocks

[attachment deleted by admin]
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

hutch--

Thes are the results I got Mark.


128-byte string copy timing results:

lstrcpy              : 755 clocks

SzCpy1 (movsb/cmp)   : 1185 clocks
SzCpy2 (mov/cmp/inc) : 582 clocks
SzCpy3 (mov/cmp/add) : 454 clocks
SzCpy4 (mov/test/inc): 593 clocks
SzCpy5 (mov/test/add): 445 clocks


I am a bit busy at the moment but if you have the time, will you plug this masm32 library module into your test piece to see how it works. It uses an index in the addressing mode and ends up with a lower loop instruction count.


align 4

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

comment * -----------------------------------------------
        copied length minus terminator is returned in EAX
        ----------------------------------------------- *
align 16

szCopy proc src:DWORD,dst:DWORD

    push ebp

    mov edx, [esp+8]
    mov ebp, [esp+12]
    xor ecx, ecx        ; ensure no previous content in ECX
    mov eax, -1

  @@:
    add eax, 1
    mov cl, [edx+eax]
    mov [ebp+eax], cl
    test cl, cl
    jnz @B

    pop ebp

    ret 8

szCopy endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef


Box is a 2.8 gig Prescott PIV.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Jimg

Hi Mark-

The Athlon is very sensitive to alignment.  I tried some alignment tests.  Before the changes, my times were virtually identical to yours.  Here are my results with alignment changes.

(By the way, the last two, SzCpy4 and SzCpy5 copied from destination to source, I changed them for my tests).

lstrcpy              : 432 clocks

SzCpy1 (movsb/cmp)   : 540 clocks
SzCpy2 (mov/cmp/inc) : 526 clocks
SzCpy3 (mov/cmp/add) : 408 clocks
SzCpy4 (mov/test/inc): 325 clocks
SzCpy5 (mov/test/add): 342 clocks



[attachment deleted by admin]

mnemonic

Hi Mark,

here the results from my machine (AMD Athlon 500Mhz):

128-byte string copy timing results:

lstrcpy              : 378 clocks

SzCpy1 (movsb/cmp)   : 536 clocks
SzCpy2 (mov/cmp/inc) : 535 clocks
SzCpy3 (mov/cmp/add) : 660 clocks
SzCpy4 (mov/test/inc): 341 clocks
SzCpy5 (mov/test/add): 421 clocks
Be kind. Everyone you meet is fighting a hard battle.--Plato
-------
How To Ask Questions The Smart Way

hutch--

Here are the times on my Sempron 2.4


128-byte string copy timing results:

lstrcpy              : 398 clocks

SzCpy1 (movsb/cmp)   : 532 clocks
SzCpy2 (mov/cmp/inc) : 534 clocks
SzCpy3 (mov/cmp/add) : 655 clocks
SzCpy4 (mov/test/inc): 339 clocks
SzCpy5 (mov/test/add): 424 clocks
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

roticv

My celeron,



128-byte string copy timing results:

lstrcpy              : 608 clocks

SzCpy1 (movsb/cmp)   : 1366 clocks
SzCpy2 (mov/cmp/inc) : 644 clocks
SzCpy3 (mov/cmp/add) : 511 clocks
SzCpy4 (mov/test/inc): 575 clocks
SzCpy5 (mov/test/add): 456 clocks

MichaelW

P3-500, including the procedure that Hutch posted:

lstrcpy              : 672 clocks

SzCpy1 (movsb/cmp)   : 654 clocks
SzCpy2 (mov/cmp/inc) : 406 clocks
SzCpy3 (mov/cmp/add) : 529 clocks
SzCpy4 (mov/test/inc): 402 clocks
SzCpy5 (mov/test/add): 402 clocks
szCopy               : 289 clocks


And some strange and not very useful timings for an AMD K5:

lstrcpy              : 531 clocks

SzCpy1 (movsb/cmp)   : 984 clocks
SzCpy2 (mov/cmp/inc) : 984 clocks
SzCpy3 (mov/cmp/add) : 984 clocks
SzCpy4 (mov/test/inc): 984 clocks
SzCpy5 (mov/test/add): 984 clocks



eschew obfuscation

lingo

 :lol
WinXP Pro SP2/Pentium 4 -560J, 3.6-GHz (Prescott), including the Hutch's procedure
and my procedure  SzCpy7
It is 1st because I want to use the zeroes from the
destination buffer...:


OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
db 90h,90h,90h
SzCpy7   proc SzDest:DWORD, SzSource:DWORD
         mov  ecx, [esp+8]                 ; ecx = source
         mov  edx, [esp+4]                 ; edx= destination
         push ebx
         xor  eax, eax
         mov  bl, [ecx]                    ; read byte
@@:     
         add  [edx+eax], bl                ; I use the fact ->the destination
         lea  eax, [eax+1]                 ;   buffer is zero filled!!!
         mov  bl, [ecx+eax]                ; read byte
         jnz  @B                           ; keep going
         pop  ebx
         ret  2*4
SzCpy7   endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef



Microsoft (R) Macro Assembler Version 7.10.3077
Copyright (C) Microsoft Corporation.  All rights reserved.

Assembling: SzCpy.asm
Microsoft (R) Incremental Linker Version 7.10.3077
Copyright (C) Microsoft Corporation.  All rights reserved.

Press any key to continue . . .
SzCpy/lstrcpy speed comparison by Mark Jones (gzscuqn02ATsneakemailD0Tcom)
Timing routines by MichaelW from MASM32 forums, http://www.masmforum.com/
Please terminate any high-priority tasks and press ENTER to begin.


128-byte string copy timing results:

SzCpy7 (mov/add/lea->Lingo): 172 clocks
lstrcpy              : 646 clocks

SzCpy1 (movsb/cmp)   : 1236 clocks
SzCpy2 (mov/cmp/inc) : 667 clocks
SzCpy3 (mov/cmp/add) : 532 clocks
SzCpy4 (mov/test/inc): 602 clocks
SzCpy5 (mov/test/add): 453 clocks
SzCpy6 (mov/test/add->Hutch): 545 clocks

Press ENTER to exit...


Regards,
Lingo

[attachment deleted by admin]

deroko

Athlon64 2800+

128-byte string copy timing results:

SzCpy7 (mov/add/lea->Lingo): 157 clocks
lstrcpy              : 327 clocks

SzCpy1 (movsb/cmp)   : 714 clocks
SzCpy2 (mov/cmp/inc) : 574 clocks
SzCpy3 (mov/cmp/add) : 712 clocks
SzCpy4 (mov/test/inc): 301 clocks
SzCpy5 (mov/test/add): 432 clocks
SzCpy6 (mov/test/add->Hutch): 303 clocks


hutch--

Here is a quick play, this one is faster on my PIV by a ratio of 270 to 200 MS against the library version of szCopy. The idea was register alternation with the source and destination adresses and a 4 times unroll. The first dropped the times a little and in conjunction with the unroll, dropped some more.


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

align 4

szCopyx proc src:DWORD,dst:DWORD

    push ebx
    push esi
    push edi

    xor ecx, ecx
    mov esi, src
    mov edi, dst
    mov ebx, src
    mov edx, dst

    mov eax, -4

  align 4
  @@:
    add eax, 4
    mov cl, [esi+eax]
    mov [edi+eax], cl
    test cl, cl
    jz @F

    mov cl, [ebx+eax+1]
    mov [edx+eax+1], cl
    test cl, cl
    jz @F

    mov cl, [esi+eax+2]
    mov [edi+eax+2], cl
    test cl, cl
    jz @F

    mov cl, [ebx+eax+3]
    mov [edx+eax+3], cl
    test cl, cl
    jnz @B

  @@:

    pop edi
    pop esi
    pop ebx

    ret

szCopyx endp

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Mark Jones

Wow, we have some interesting results so far! Lingo's algortihm didn't seem to slow down any when the destination string was not all zeroes. Added is Hutch's 4x unroll:

XP SP1/AMD XP 1800+

128-byte string copy timing results:

lstrcpy                       : 433 clocks

SzCpy1 (movsb/cmp)    MJ      : 540 clocks
SzCpy2 (mov/cmp/inc)  MJ      : 541 clocks
SzCpy3 (mov/cmp/add)  MJ      : 666 clocks
SzCpy4 (mov/test/inc) MJ      : 344 clocks
SzCpy5 (mov/test/add) MJ      : 477 clocks
SzCopy (mov/test/add) Hutch   : 367 clocks
SzCpy7 (mov/add/lea)  Lingo   : 148 clocks
szCopyx (4x unroll)   Hutch   : 300 clocks

Press ENTER to exit...

[attachment deleted by admin]
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

AeroASM

The latest one gives me

Microsoft (R) Incremental Linker Version 5.12.8078
Copyright (C) Microsoft Corp 1992-1998. All rights reserved.

masm32.lib(DW2A.obj) : error LNK2001: unresolved external symbol _wsprintfA
SzCpy.exe : fatal error LNK1120: 1 unresolved externals

Jimg

How about moving a word at a time?

align 16
SzCpy12 proc  uses ebx SzDest:DWORD, SzSource:DWORD

    xor ecx, ecx
mov eax, -4
    mov ebx, SzSource
    mov edx, SzDest

  align 4
@@:
add eax,4
    mov cx, [ebx+eax]   
    test cl, cl
    jz move1 ; done, go move zero byte
    mov [edx+eax], cx ; save both
    test ch,ch
    jz @f ; all done4

    mov cx, [ebx+eax+2]
    test cl, cl
    jz move2
    mov [edx+eax+2], cx
    test ch,ch
    jnz @b
@@:
    ret

move1:
mov [ebx+eax],cl
ret
move2:
mov [ebx+eax+2],cl
ret
SzCpy12 endp


runs in 255 on my Athlon


p.s.
Did anyone try Lingo's routine?  It doesn't give me the correct results, I must be doing something wrong.

Mark_Larson

  Mark, you missed the trick that Lingo was trying to do.  If the destination buffer is not all 0's, the code will run ok, but will produce incorrect results.  He adds the destination and source together in place of a move, and that only works if the destination is all 0's.


  Lingo, couple points.  I haven't tried any of this, but just stuff that probably could make it faster.

1) on the P4 that you ran this code on both MOV and ADD run in the same time ( 0.5 cycles).  So timing-wise if you change your add below to a MOV it should theoretically perform the same.  Both MOV and ADD go through the same execution unit ( ALU), so it should run in the same speed.  If it doesn't, that would be really surprising.  That will also get rid of your dependency on having an all 0 destination register.

2) "mov  bl, [ecx] " is going to cause a partial register stall.  Even on the P4 you can still get "false partial register stalls".  change it to a MOVZX or add a "xor ebx,ebx" before the move

3) Lea is the slowest way to increment a number on a P4.  ADD/SUB is the fastest.  Also try INC/DEC..

         lea  eax, [eax+1]                 ;   buffer is zero filled!!!
         mov  bl, [ecx+eax]                ; read byte


4) the above code snippet has a stall ( read after write dependeny) from the previous line.  Instead try switching the two lines and using "ecx+eax+1" for the memory address.  That should free up a stall.


5) comment below.

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
db 90h,90h,90h     - I am clueless why you added 3 NOPs before the procedure?  this makes it not aligned.  For the P4 that is not a big deal, but for non-P4 that is going to cause a loss in performance.
SzCpy7   proc SzDest:DWORD, SzSource:DWORD


6) Advanced trick - try interwearving two instances of the code that use different register sets to avoid stalls.  Just need to free up enough registers.  If you'd like me to explain this in more detail, just ask.

I'll post a bit of it.  I have not verified this runs faster, but it should help break up some stalls.  I also used your original code, so I didn't make any of the changes I suggested above.


;ecx is source
;edi is destination - changed this from EDX to EDI, to free up a byte accessible register.
;esi is the same as eax, just a different offset
;all new code left justified
         xor  eax, eax
mov esi,1
         mov  bl, [ecx]                    ; read byte
mov dl,[ecx+1]
@@:     
         add  [edi+eax], bl                ; I use the fact ->the destination
add [edi+esi],dl
         lea  eax, [eax+1]                 ;   buffer is zero filled!!!
lea esi,[esi+1]
         mov  bl, [ecx+eax]                ; read byte
mov dl,[ecx+esi]

BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

Petroizki

Quote from: Mark_Larson on May 10, 2005, 04:30:28 PM
5) comment below.
The three nop's make the loop itself to be aligned to 16.