The MASM Forum Archive 2004 to 2012
Welcome, Guest. Please login or register.
June 10, 2023, 04:23:08 PM

Login with username, password and session length
Search:     Advanced search
128553 Posts in 15254 Topics by 684 Members
Latest Member: mottt
* Home Help Search Login Register
+  The MASM Forum Archive 2004 to 2012
|-+  General Forums
| |-+  The Laboratory (Moderator: Mark_Larson)
| | |-+  SzCpy vs. lstrcpy
« previous next »
Pages: [1] 2 3 ... 8 Print
Author Topic: SzCpy vs. lstrcpy  (Read 68349 times)
Mark Jones
Drifting in the Abstract
Member
*****
Posts: 2302


=- Stargate Atlantis -=


SzCpy vs. lstrcpy
« on: May 09, 2005, 01:23:45 AM »

Hello, I tried to improve on the lstrcpy function but only had marginal results. It appears that INC/DEC is faster than ADD/SUB on AMD processors. Can anyone suggest any further refinements? These results are with Windows XP Pro SP1, AMD XP 1800+

Code:
128-byte string copy timing results:

lstrcpy              : 432 clocks

SzCpy1 (movsb/cmp)   : 540 clocks
SzCpy2 (mov/cmp/inc) : 540 clocks
SzCpy3 (mov/cmp/add) : 666 clocks
SzCpy4 (mov/test/inc): 344 clocks
SzCpy5 (mov/test/add): 479 clocks

[attachment deleted by admin]
Logged

"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08
hutch--
Administrator
Member
*****
Posts: 12013


Mnemonic Driven API Grinder


Re: SzCpy vs. lstrcpy
« Reply #1 on: May 09, 2005, 02:22:51 AM »

Thes are the results I got Mark.


128-byte string copy timing results:

lstrcpy              : 755 clocks

SzCpy1 (movsb/cmp)   : 1185 clocks
SzCpy2 (mov/cmp/inc) : 582 clocks
SzCpy3 (mov/cmp/add) : 454 clocks
SzCpy4 (mov/test/inc): 593 clocks
SzCpy5 (mov/test/add): 445 clocks


I am a bit busy at the moment but if you have the time, will you plug this masm32 library module into your test piece to see how it works. It uses an index in the addressing mode and ends up with a lower loop instruction count.

Code:
align 4

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

comment * -----------------------------------------------
        copied length minus terminator is returned in EAX
        ----------------------------------------------- *
align 16

szCopy proc src:DWORD,dst:DWORD

    push ebp

    mov edx, [esp+8]
    mov ebp, [esp+12]
    xor ecx, ecx        ; ensure no previous content in ECX
    mov eax, -1

  @@:
    add eax, 1
    mov cl, [edx+eax]
    mov [ebp+eax], cl
    test cl, cl
    jnz @B

    pop ebp

    ret 8

szCopy endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

Box is a 2.8 gig Prescott PIV.
Logged

Regards,



Download site for MASM32
http://www.masm32.com
Jimg
Member
*****
Posts: 1280


Re: SzCpy vs. lstrcpy
« Reply #2 on: May 09, 2005, 03:24:39 AM »

Hi Mark-

The Athlon is very sensitive to alignment.  I tried some alignment tests.  Before the changes, my times were virtually identical to yours.  Here are my results with alignment changes.

(By the way, the last two, SzCpy4 and SzCpy5 copied from destination to source, I changed them for my tests).

lstrcpy              : 432 clocks

SzCpy1 (movsb/cmp)   : 540 clocks
SzCpy2 (mov/cmp/inc) : 526 clocks
SzCpy3 (mov/cmp/add) : 408 clocks
SzCpy4 (mov/test/inc): 325 clocks
SzCpy5 (mov/test/add): 342 clocks



[attachment deleted by admin]
Logged
mnemonic
Chaos Engine
Member
*****
Gender: Male
Posts: 199


Peace!


Re: SzCpy vs. lstrcpy
« Reply #3 on: May 09, 2005, 03:40:59 AM »

Hi Mark,

here the results from my machine (AMD Athlon 500Mhz):

Code:
128-byte string copy timing results:

lstrcpy              : 378 clocks

SzCpy1 (movsb/cmp)   : 536 clocks
SzCpy2 (mov/cmp/inc) : 535 clocks
SzCpy3 (mov/cmp/add) : 660 clocks
SzCpy4 (mov/test/inc): 341 clocks
SzCpy5 (mov/test/add): 421 clocks
Logged

Be kind. Everyone you meet is fighting a hard battle.--Plato
-------
How To Ask Questions The Smart Way
hutch--
Administrator
Member
*****
Posts: 12013


Mnemonic Driven API Grinder


Re: SzCpy vs. lstrcpy
« Reply #4 on: May 09, 2005, 10:34:45 AM »

Here are the times on my Sempron 2.4


128-byte string copy timing results:

lstrcpy              : 398 clocks

SzCpy1 (movsb/cmp)   : 532 clocks
SzCpy2 (mov/cmp/inc) : 534 clocks
SzCpy3 (mov/cmp/add) : 655 clocks
SzCpy4 (mov/test/inc): 339 clocks
SzCpy5 (mov/test/add): 424 clocks
Logged

Regards,



Download site for MASM32
http://www.masm32.com
roticv
Guest


Email
Re: SzCpy vs. lstrcpy
« Reply #5 on: May 09, 2005, 12:37:54 PM »

My celeron,



 128-byte string copy timing results:

 lstrcpy              : 608 clocks

 SzCpy1 (movsb/cmp)   : 1366 clocks
 SzCpy2 (mov/cmp/inc) : 644 clocks
 SzCpy3 (mov/cmp/add) : 511 clocks
 SzCpy4 (mov/test/inc): 575 clocks
 SzCpy5 (mov/test/add): 456 clocks
Logged
MichaelW
Global Moderator
Member
*****
Gender: Male
Posts: 5161


Re: SzCpy vs. lstrcpy
« Reply #6 on: May 09, 2005, 01:45:11 PM »

P3-500, including the procedure that Hutch posted:

lstrcpy              : 672 clocks

SzCpy1 (movsb/cmp)   : 654 clocks
SzCpy2 (mov/cmp/inc) : 406 clocks
SzCpy3 (mov/cmp/add) : 529 clocks
SzCpy4 (mov/test/inc): 402 clocks
SzCpy5 (mov/test/add): 402 clocks
szCopy               : 289 clocks


And some strange and not very useful timings for an AMD K5:

lstrcpy              : 531 clocks

SzCpy1 (movsb/cmp)   : 984 clocks
SzCpy2 (mov/cmp/inc) : 984 clocks
SzCpy3 (mov/cmp/add) : 984 clocks
SzCpy4 (mov/test/inc): 984 clocks
SzCpy5 (mov/test/add): 984 clocks



Logged

eschew obfuscation
lingo
Member
*****
Posts: 625



Re: SzCpy vs. lstrcpy
« Reply #7 on: May 09, 2005, 10:23:35 PM »

 lol
WinXP Pro SP2/Pentium 4 -560J, 3.6-GHz (Prescott), including the Hutch's procedure
and my procedure  SzCpy7
It is 1st because I want to use the zeroes from the
destination buffer...:

Code:
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
db 90h,90h,90h
SzCpy7   proc SzDest:DWORD, SzSource:DWORD
         mov  ecx, [esp+8]                 ; ecx = source
         mov  edx, [esp+4]                 ; edx= destination
         push ebx
         xor  eax, eax
         mov  bl, [ecx]                    ; read byte
@@:     
         add  [edx+eax], bl                ; I use the fact ->the destination
         lea  eax, [eax+1]                 ;   buffer is zero filled!!!
         mov  bl, [ecx+eax]                ; read byte
         jnz  @B                           ; keep going
         pop  ebx
         ret  2*4
SzCpy7   endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef



Microsoft (R) Macro Assembler Version 7.10.3077
Copyright (C) Microsoft Corporation.  All rights reserved.

Assembling: SzCpy.asm
Microsoft (R) Incremental Linker Version 7.10.3077
Copyright (C) Microsoft Corporation.  All rights reserved.

Press any key to continue . . .
SzCpy/lstrcpy speed comparison by Mark Jones (gzscuqn02ATsneakemailD0Tcom)
Timing routines by MichaelW from MASM32 forums, http://www.masmforum.com/
Please terminate any high-priority tasks and press ENTER to begin.


128-byte string copy timing results:

SzCpy7 (mov/add/lea->Lingo): 172 clocks
lstrcpy              : 646 clocks

SzCpy1 (movsb/cmp)   : 1236 clocks
SzCpy2 (mov/cmp/inc) : 667 clocks
SzCpy3 (mov/cmp/add) : 532 clocks
SzCpy4 (mov/test/inc): 602 clocks
SzCpy5 (mov/test/add): 453 clocks
SzCpy6 (mov/test/add->Hutch): 545 clocks

Press ENTER to exit...

Regards,
Lingo

[attachment deleted by admin]
Logged
deroko
Guest


Email
Re: SzCpy vs. lstrcpy
« Reply #8 on: May 09, 2005, 11:25:10 PM »

Athlon64 2800+
Code:
128-byte string copy timing results:

SzCpy7 (mov/add/lea->Lingo): 157 clocks
lstrcpy              : 327 clocks

SzCpy1 (movsb/cmp)   : 714 clocks
SzCpy2 (mov/cmp/inc) : 574 clocks
SzCpy3 (mov/cmp/add) : 712 clocks
SzCpy4 (mov/test/inc): 301 clocks
SzCpy5 (mov/test/add): 432 clocks
SzCpy6 (mov/test/add->Hutch): 303 clocks
Logged
hutch--
Administrator
Member
*****
Posts: 12013


Mnemonic Driven API Grinder


Re: SzCpy vs. lstrcpy
« Reply #9 on: May 10, 2005, 03:23:07 AM »

Here is a quick play, this one is faster on my PIV by a ratio of 270 to 200 MS against the library version of szCopy. The idea was register alternation with the source and destination adresses and a 4 times unroll. The first dropped the times a little and in conjunction with the unroll, dropped some more.

Code:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

align 4

szCopyx proc src:DWORD,dst:DWORD

    push ebx
    push esi
    push edi

    xor ecx, ecx
    mov esi, src
    mov edi, dst
    mov ebx, src
    mov edx, dst

    mov eax, -4

  align 4
  @@:
    add eax, 4
    mov cl, [esi+eax]
    mov [edi+eax], cl
    test cl, cl
    jz @F

    mov cl, [ebx+eax+1]
    mov [edx+eax+1], cl
    test cl, cl
    jz @F

    mov cl, [esi+eax+2]
    mov [edi+eax+2], cl
    test cl, cl
    jz @F

    mov cl, [ebx+eax+3]
    mov [edx+eax+3], cl
    test cl, cl
    jnz @B

  @@:

    pop edi
    pop esi
    pop ebx

    ret

szCopyx endp

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
Logged

Regards,



Download site for MASM32
http://www.masm32.com
Mark Jones
Drifting in the Abstract
Member
*****
Posts: 2302


=- Stargate Atlantis -=


Re: SzCpy vs. lstrcpy
« Reply #10 on: May 10, 2005, 02:30:18 PM »

Wow, we have some interesting results so far! Lingo's algortihm didn't seem to slow down any when the destination string was not all zeroes. Added is Hutch's 4x unroll:

XP SP1/AMD XP 1800+
Code:
128-byte string copy timing results:

lstrcpy                       : 433 clocks

SzCpy1 (movsb/cmp)    MJ      : 540 clocks
SzCpy2 (mov/cmp/inc)  MJ      : 541 clocks
SzCpy3 (mov/cmp/add)  MJ      : 666 clocks
SzCpy4 (mov/test/inc) MJ      : 344 clocks
SzCpy5 (mov/test/add) MJ      : 477 clocks
SzCopy (mov/test/add) Hutch   : 367 clocks
SzCpy7 (mov/add/lea)  Lingo   : 148 clocks
szCopyx (4x unroll)   Hutch   : 300 clocks

Press ENTER to exit...

[attachment deleted by admin]
Logged

"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08
AeroASM
Guest


Email
Re: SzCpy vs. lstrcpy
« Reply #11 on: May 10, 2005, 04:25:03 PM »

The latest one gives me

Microsoft (R) Incremental Linker Version 5.12.8078
Copyright (C) Microsoft Corp 1992-1998. All rights reserved.

masm32.lib(DW2A.obj) : error LNK2001: unresolved external symbol _wsprintfA
SzCpy.exe : fatal error LNK1120: 1 unresolved externals
Logged
Jimg
Member
*****
Posts: 1280


Re: SzCpy vs. lstrcpy
« Reply #12 on: May 10, 2005, 04:25:55 PM »

How about moving a word at a time?

Code:
align 16
SzCpy12 proc  uses ebx SzDest:DWORD, SzSource:DWORD

    xor ecx, ecx
mov eax, -4
    mov ebx, SzSource
    mov edx, SzDest

  align 4
@@:
add eax,4
    mov cx, [ebx+eax]   
    test cl, cl
    jz move1 ; done, go move zero byte
    mov [edx+eax], cx ; save both
    test ch,ch
    jz @f ; all done4

    mov cx, [ebx+eax+2]
    test cl, cl
    jz move2
    mov [edx+eax+2], cx
    test ch,ch
    jnz @b
@@:
    ret

move1:
mov [ebx+eax],cl
ret
move2:
mov [ebx+eax+2],cl
ret
SzCpy12 endp

runs in 255 on my Athlon


p.s.
Did anyone try Lingo's routine?  It doesn't give me the correct results, I must be doing something wrong.
Logged
Mark_Larson
Moderator
Member
*****
Posts: 556


Re: SzCpy vs. lstrcpy
« Reply #13 on: May 10, 2005, 04:30:28 PM »

  Mark, you missed the trick that Lingo was trying to do.  If the destination buffer is not all 0's, the code will run ok, but will produce incorrect results.  He adds the destination and source together in place of a move, and that only works if the destination is all 0's.


  Lingo, couple points.  I haven't tried any of this, but just stuff that probably could make it faster.

1) on the P4 that you ran this code on both MOV and ADD run in the same time ( 0.5 cycles).  So timing-wise if you change your add below to a MOV it should theoretically perform the same.  Both MOV and ADD go through the same execution unit ( ALU), so it should run in the same speed.  If it doesn't, that would be really surprising.  That will also get rid of your dependency on having an all 0 destination register.

2) "mov  bl, [ecx] " is going to cause a partial register stall.  Even on the P4 you can still get "false partial register stalls".  change it to a MOVZX or add a "xor ebx,ebx" before the move

3) Lea is the slowest way to increment a number on a P4.  ADD/SUB is the fastest.  Also try INC/DEC..
Code:
         lea  eax, [eax+1]                 ;   buffer is zero filled!!!
         mov  bl, [ecx+eax]                ; read byte

4) the above code snippet has a stall ( read after write dependeny) from the previous line.  Instead try switching the two lines and using "ecx+eax+1" for the memory address.  That should free up a stall.


5) comment below.
Code:
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
db 90h,90h,90h     - I am clueless why you added 3 NOPs before the procedure?  this makes it not aligned.  For the P4 that is not a big deal, but for non-P4 that is going to cause a loss in performance.
SzCpy7   proc SzDest:DWORD, SzSource:DWORD

6) Advanced trick - try interwearving two instances of the code that use different register sets to avoid stalls.  Just need to free up enough registers.  If you'd like me to explain this in more detail, just ask.

I'll post a bit of it.  I have not verified this runs faster, but it should help break up some stalls.  I also used your original code, so I didn't make any of the changes I suggested above.

Code:
;ecx is source
;edi is destination - changed this from EDX to EDI, to free up a byte accessible register.
;esi is the same as eax, just a different offset
;all new code left justified
         xor  eax, eax
mov esi,1
         mov  bl, [ecx]                    ; read byte
mov dl,[ecx+1]
@@:     
         add  [edi+eax], bl                ; I use the fact ->the destination
add [edi+esi],dl
         lea  eax, [eax+1]                 ;   buffer is zero filled!!!
lea esi,[esi+1]
         mov  bl, [ecx+eax]                ; read byte
mov dl,[ecx+esi]
Logged

BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm
Petroizki
Member
*****
Gender: Male
Posts: 135



Re: SzCpy vs. lstrcpy
« Reply #14 on: May 10, 2005, 05:46:19 PM »

5) comment below.
The three nop's make the loop itself to be aligned to 16.
Logged
Pages: [1] 2 3 ... 8 Print 
« previous next »
Jump to:  

Powered by MySQL Powered by PHP The MASM Forum Archive 2004 to 2012 | Powered by SMF 1.0.12.
© 2001-2005, Lewis Media. All Rights Reserved.
Valid XHTML 1.0! Valid CSS!