|
Pages: [1] 2 3 ... 8
|
 |
|
Author
|
Topic: SzCpy vs. lstrcpy (Read 68349 times)
|
Mark Jones
Drifting in the Abstract
Member
    
Posts: 2302
=- Stargate Atlantis -=
|
Hello, I tried to improve on the lstrcpy function but only had marginal results. It appears that INC/DEC is faster than ADD/SUB on AMD processors. Can anyone suggest any further refinements? These results are with Windows XP Pro SP1, AMD XP 1800+ 128-byte string copy timing results:
lstrcpy : 432 clocks
SzCpy1 (movsb/cmp) : 540 clocks SzCpy2 (mov/cmp/inc) : 540 clocks SzCpy3 (mov/cmp/add) : 666 clocks SzCpy4 (mov/test/inc): 344 clocks SzCpy5 (mov/test/add): 479 clocks
[attachment deleted by admin]
|
|
|
|
|
Logged
|
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08
|
|
|
hutch--
Administrator
Member
    
Posts: 12013
Mnemonic Driven API Grinder
|
Thes are the results I got Mark. 128-byte string copy timing results:
lstrcpy : 755 clocks
SzCpy1 (movsb/cmp) : 1185 clocks SzCpy2 (mov/cmp/inc) : 582 clocks SzCpy3 (mov/cmp/add) : 454 clocks SzCpy4 (mov/test/inc): 593 clocks SzCpy5 (mov/test/add): 445 clocks
I am a bit busy at the moment but if you have the time, will you plug this masm32 library module into your test piece to see how it works. It uses an index in the addressing mode and ends up with a lower loop instruction count. align 4
OPTION PROLOGUE:NONE OPTION EPILOGUE:NONE
comment * ----------------------------------------------- copied length minus terminator is returned in EAX ----------------------------------------------- * align 16
szCopy proc src:DWORD,dst:DWORD
push ebp
mov edx, [esp+8] mov ebp, [esp+12] xor ecx, ecx ; ensure no previous content in ECX mov eax, -1
@@: add eax, 1 mov cl, [edx+eax] mov [ebp+eax], cl test cl, cl jnz @B
pop ebp
ret 8
szCopy endp
OPTION PROLOGUE:PrologueDef OPTION EPILOGUE:EpilogueDef
Box is a 2.8 gig Prescott PIV.
|
|
|
|
|
Logged
|
|
|
|
|
Jimg
|
Hi Mark-
The Athlon is very sensitive to alignment. I tried some alignment tests. Before the changes, my times were virtually identical to yours. Here are my results with alignment changes.
(By the way, the last two, SzCpy4 and SzCpy5 copied from destination to source, I changed them for my tests).
lstrcpy : 432 clocks
SzCpy1 (movsb/cmp) : 540 clocks SzCpy2 (mov/cmp/inc) : 526 clocks SzCpy3 (mov/cmp/add) : 408 clocks SzCpy4 (mov/test/inc): 325 clocks SzCpy5 (mov/test/add): 342 clocks
[attachment deleted by admin]
|
|
|
|
|
Logged
|
|
|
|
mnemonic
Chaos Engine
Member
    
Gender: 
Posts: 199
Peace!
|
Hi Mark, here the results from my machine (AMD Athlon 500Mhz): 128-byte string copy timing results:
lstrcpy : 378 clocks
SzCpy1 (movsb/cmp) : 536 clocks SzCpy2 (mov/cmp/inc) : 535 clocks SzCpy3 (mov/cmp/add) : 660 clocks SzCpy4 (mov/test/inc): 341 clocks SzCpy5 (mov/test/add): 421 clocks
|
|
|
|
|
Logged
|
|
|
|
hutch--
Administrator
Member
    
Posts: 12013
Mnemonic Driven API Grinder
|
Here are the times on my Sempron 2.4
128-byte string copy timing results:
lstrcpy : 398 clocks
SzCpy1 (movsb/cmp) : 532 clocks SzCpy2 (mov/cmp/inc) : 534 clocks SzCpy3 (mov/cmp/add) : 655 clocks SzCpy4 (mov/test/inc): 339 clocks SzCpy5 (mov/test/add): 424 clocks
|
|
|
|
|
Logged
|
|
|
|
roticv
Guest
|
My celeron,
128-byte string copy timing results:
lstrcpy : 608 clocks
SzCpy1 (movsb/cmp) : 1366 clocks SzCpy2 (mov/cmp/inc) : 644 clocks SzCpy3 (mov/cmp/add) : 511 clocks SzCpy4 (mov/test/inc): 575 clocks SzCpy5 (mov/test/add): 456 clocks
|
|
|
|
|
Logged
|
|
|
|
MichaelW
Global Moderator
Member
    
Gender: 
Posts: 5161
|
P3-500, including the procedure that Hutch posted:
lstrcpy : 672 clocks
SzCpy1 (movsb/cmp) : 654 clocks SzCpy2 (mov/cmp/inc) : 406 clocks SzCpy3 (mov/cmp/add) : 529 clocks SzCpy4 (mov/test/inc): 402 clocks SzCpy5 (mov/test/add): 402 clocks szCopy : 289 clocks
And some strange and not very useful timings for an AMD K5:
lstrcpy : 531 clocks
SzCpy1 (movsb/cmp) : 984 clocks SzCpy2 (mov/cmp/inc) : 984 clocks SzCpy3 (mov/cmp/add) : 984 clocks SzCpy4 (mov/test/inc): 984 clocks SzCpy5 (mov/test/add): 984 clocks
|
|
|
|
|
Logged
|
eschew obfuscation
|
|
|
|
lingo
|
 WinXP Pro SP2/Pentium 4 -560J, 3.6-GHz (Prescott), including the Hutch's procedure and my procedure SzCpy7 It is 1st because I want to use the zeroes from the destination buffer...: OPTION PROLOGUE:NONE OPTION EPILOGUE:NONE align 16 db 90h,90h,90h SzCpy7 proc SzDest:DWORD, SzSource:DWORD mov ecx, [esp+8] ; ecx = source mov edx, [esp+4] ; edx= destination push ebx xor eax, eax mov bl, [ecx] ; read byte @@: add [edx+eax], bl ; I use the fact ->the destination lea eax, [eax+1] ; buffer is zero filled!!! mov bl, [ecx+eax] ; read byte jnz @B ; keep going pop ebx ret 2*4 SzCpy7 endp OPTION PROLOGUE:PrologueDef OPTION EPILOGUE:EpilogueDef
Microsoft (R) Macro Assembler Version 7.10.3077 Copyright (C) Microsoft Corporation. All rights reserved.
Assembling: SzCpy.asm Microsoft (R) Incremental Linker Version 7.10.3077 Copyright (C) Microsoft Corporation. All rights reserved.
Press any key to continue . . . SzCpy/lstrcpy speed comparison by Mark Jones (gzscuqn02ATsneakemailD0Tcom) Timing routines by MichaelW from MASM32 forums, http://www.masmforum.com/ Please terminate any high-priority tasks and press ENTER to begin.
128-byte string copy timing results:
SzCpy7 (mov/add/lea->Lingo): 172 clocks lstrcpy : 646 clocks
SzCpy1 (movsb/cmp) : 1236 clocks SzCpy2 (mov/cmp/inc) : 667 clocks SzCpy3 (mov/cmp/add) : 532 clocks SzCpy4 (mov/test/inc): 602 clocks SzCpy5 (mov/test/add): 453 clocks SzCpy6 (mov/test/add->Hutch): 545 clocks
Press ENTER to exit... Regards, Lingo [attachment deleted by admin]
|
|
|
|
|
Logged
|
|
|
|
deroko
Guest
|
Athlon64 2800+ 128-byte string copy timing results:
SzCpy7 (mov/add/lea->Lingo): 157 clocks lstrcpy : 327 clocks
SzCpy1 (movsb/cmp) : 714 clocks SzCpy2 (mov/cmp/inc) : 574 clocks SzCpy3 (mov/cmp/add) : 712 clocks SzCpy4 (mov/test/inc): 301 clocks SzCpy5 (mov/test/add): 432 clocks SzCpy6 (mov/test/add->Hutch): 303 clocks
|
|
|
|
|
Logged
|
|
|
|
hutch--
Administrator
Member
    
Posts: 12013
Mnemonic Driven API Grinder
|
Here is a quick play, this one is faster on my PIV by a ratio of 270 to 200 MS against the library version of szCopy. The idea was register alternation with the source and destination adresses and a 4 times unroll. The first dropped the times a little and in conjunction with the unroll, dropped some more. ; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
align 4
szCopyx proc src:DWORD,dst:DWORD
push ebx push esi push edi
xor ecx, ecx mov esi, src mov edi, dst mov ebx, src mov edx, dst
mov eax, -4
align 4 @@: add eax, 4 mov cl, [esi+eax] mov [edi+eax], cl test cl, cl jz @F
mov cl, [ebx+eax+1] mov [edx+eax+1], cl test cl, cl jz @F
mov cl, [esi+eax+2] mov [edi+eax+2], cl test cl, cl jz @F
mov cl, [ebx+eax+3] mov [edx+eax+3], cl test cl, cl jnz @B
@@:
pop edi pop esi pop ebx
ret
szCopyx endp
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
|
|
|
|
|
Logged
|
|
|
|
Mark Jones
Drifting in the Abstract
Member
    
Posts: 2302
=- Stargate Atlantis -=
|
Wow, we have some interesting results so far! Lingo's algortihm didn't seem to slow down any when the destination string was not all zeroes. Added is Hutch's 4x unroll: XP SP1/AMD XP 1800+ 128-byte string copy timing results:
lstrcpy : 433 clocks
SzCpy1 (movsb/cmp) MJ : 540 clocks SzCpy2 (mov/cmp/inc) MJ : 541 clocks SzCpy3 (mov/cmp/add) MJ : 666 clocks SzCpy4 (mov/test/inc) MJ : 344 clocks SzCpy5 (mov/test/add) MJ : 477 clocks SzCopy (mov/test/add) Hutch : 367 clocks SzCpy7 (mov/add/lea) Lingo : 148 clocks szCopyx (4x unroll) Hutch : 300 clocks
Press ENTER to exit...
[attachment deleted by admin]
|
|
|
|
|
Logged
|
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08
|
|
|
AeroASM
Guest
|
The latest one gives me
Microsoft (R) Incremental Linker Version 5.12.8078 Copyright (C) Microsoft Corp 1992-1998. All rights reserved.
masm32.lib(DW2A.obj) : error LNK2001: unresolved external symbol _wsprintfA SzCpy.exe : fatal error LNK1120: 1 unresolved externals
|
|
|
|
|
Logged
|
|
|
|
|
Jimg
|
How about moving a word at a time? align 16 SzCpy12 proc uses ebx SzDest:DWORD, SzSource:DWORD
xor ecx, ecx mov eax, -4 mov ebx, SzSource mov edx, SzDest
align 4 @@: add eax,4 mov cx, [ebx+eax] test cl, cl jz move1 ; done, go move zero byte mov [edx+eax], cx ; save both test ch,ch jz @f ; all done4
mov cx, [ebx+eax+2] test cl, cl jz move2 mov [edx+eax+2], cx test ch,ch jnz @b @@: ret move1: mov [ebx+eax],cl ret move2: mov [ebx+eax+2],cl ret SzCpy12 endp
runs in 255 on my Athlon p.s. Did anyone try Lingo's routine? It doesn't give me the correct results, I must be doing something wrong.
|
|
|
|
|
Logged
|
|
|
|
|
Mark_Larson
|
Mark, you missed the trick that Lingo was trying to do. If the destination buffer is not all 0's, the code will run ok, but will produce incorrect results. He adds the destination and source together in place of a move, and that only works if the destination is all 0's. Lingo, couple points. I haven't tried any of this, but just stuff that probably could make it faster. 1) on the P4 that you ran this code on both MOV and ADD run in the same time ( 0.5 cycles). So timing-wise if you change your add below to a MOV it should theoretically perform the same. Both MOV and ADD go through the same execution unit ( ALU), so it should run in the same speed. If it doesn't, that would be really surprising. That will also get rid of your dependency on having an all 0 destination register. 2) "mov bl, [ecx] " is going to cause a partial register stall. Even on the P4 you can still get "false partial register stalls". change it to a MOVZX or add a "xor ebx,ebx" before the move 3) Lea is the slowest way to increment a number on a P4. ADD/SUB is the fastest. Also try INC/DEC.. lea eax, [eax+1] ; buffer is zero filled!!! mov bl, [ecx+eax] ; read byte
4) the above code snippet has a stall ( read after write dependeny) from the previous line. Instead try switching the two lines and using "ecx+eax+1" for the memory address. That should free up a stall. 5) comment below. OPTION PROLOGUE:NONE OPTION EPILOGUE:NONE align 16 db 90h,90h,90h - I am clueless why you added 3 NOPs before the procedure? this makes it not aligned. For the P4 that is not a big deal, but for non-P4 that is going to cause a loss in performance. SzCpy7 proc SzDest:DWORD, SzSource:DWORD
6) Advanced trick - try interwearving two instances of the code that use different register sets to avoid stalls. Just need to free up enough registers. If you'd like me to explain this in more detail, just ask. I'll post a bit of it. I have not verified this runs faster, but it should help break up some stalls. I also used your original code, so I didn't make any of the changes I suggested above. ;ecx is source ;edi is destination - changed this from EDX to EDI, to free up a byte accessible register. ;esi is the same as eax, just a different offset ;all new code left justified xor eax, eax mov esi,1 mov bl, [ecx] ; read byte mov dl,[ecx+1] @@: add [edi+eax], bl ; I use the fact ->the destination add [edi+esi],dl lea eax, [eax+1] ; buffer is zero filled!!! lea esi,[esi+1] mov bl, [ecx+eax] ; read byte mov dl,[ecx+esi]
|
|
|
|
|
Logged
|
BIOS programmers do it fastest, hehe. ;)
My Optimization webpage htttp://www.website.masmforum.com/mark/index.htm
|
|
|
|
Petroizki
|
5) comment below.
The three nop's make the loop itself to be aligned to 16.
|
|
|
|
|
Logged
|
|
|
|
|
|
Pages: [1] 2 3 ... 8
|
|
|
 |