News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

StrLen timings needed

Started by jj2007, August 15, 2010, 09:32:10 PM

Previous topic - Next topic

Antariy

Hi, folks!

I find one nice rule: stupid lingo don't like, when any code beat his code, and not have so great functionality as his code. But, when his lamer's code not have any functionality, he think, what this is great - this is part of algo. And, when somebody says about this to him - he very unsatisfyed and make insults to other peoples.

Lingos is unadequate man - madman.
Stupid GrandLamer lingo, where you live? Not in Toronto, is it?
Don't worry, we don't say anybody, why you ugly and wretched. Make confess to us  :green2



Alex

dedndave


jj2007

Quote from: dedndave on August 26, 2010, 10:20:40 PM
i think he's in Germany

Right - if he lived in Toronto, his English would be a lot better.

dedndave

i remember hearing somewhere in another thread that his wife doesn't like him, either   :bdg

Antariy

Quote from: jj2007 on August 26, 2010, 10:28:54 PM
Quote from: dedndave on August 26, 2010, 10:20:40 PM
i think he's in Germany

Right - if he lived in Toronto, his English would be a lot better.

I see European phrases construction in his text. This is not unrecognizable.



Alex

hutch--

 :bg

Come on guys, Lingo is OK, he just has a charming turn of phrase.  :P
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Antariy

Quote from: hutch-- on August 30, 2010, 04:20:03 AM
:bg

Come on guys, Lingo is OK, he just has a charming turn of phrase.  :P

If someone, what is not adequate and is upstarter - then this is true: this someone is OK  :P



Alex

oex

AMD Sempron(tm) Processor 3100+ (SSE3)
97      code size for AxStrLenSSE1      104     total bytes for AxStrLenSSE1
88      code size for AxJJStrLen4       88      total bytes for AxJJStrLen4
96      code size for AxJJStrLen5       96      total bytes for AxJJStrLen5
79      code size for StrLenLingo       79      total bytes for StrLenLingo

139     cycles for szLen
66      cycles for AxStrLenSSE1
66      cycles for AxJJStrLen4
65      cycles for AxJJStrLen5
45      cycles for StrLenLingo

139     cycles for szLen
62      cycles for AxStrLenSSE1
66      cycles for AxJJStrLen4
65      cycles for AxJJStrLen5
45      cycles for StrLenLingo

139     cycles for szLen
67      cycles for AxStrLenSSE1
66      cycles for AxJJStrLen4
65      cycles for AxJJStrLen5
44      cycles for StrLenLingo

139     cycles for szLen
62      cycles for AxStrLenSSE1
67      cycles for AxJJStrLen4
66      cycles for AxJJStrLen5
45      cycles for StrLenLingo

------- timings, misaligned, 5 byte string -------
8       cycles for szLen
29      cycles for AxStrLenSSE1
24      cycles for AxJJStrLen4
23      cycles for AxJJStrLen5
23      cycles for StrLenLingo

8       cycles for szLen
23      cycles for AxStrLenSSE1
24      cycles for AxJJStrLen4
23      cycles for AxJJStrLen5
23      cycles for StrLenLingo

8       cycles for szLen
29      cycles for AxStrLenSSE1
25      cycles for AxJJStrLen4
23      cycles for AxJJStrLen5
23      cycles for StrLenLingo

8       cycles for szLen
23      cycles for AxStrLenSSE1
24      cycles for AxJJStrLen4
23      cycles for AxJJStrLen5
23      cycles for StrLenLingo
We are all of us insane, just to varying degrees and intelligently balanced through networking

http://www.hereford.tv

lingo

New CPU with SSE 4.2 and new results... :lol
Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz (SSE4)
97      code size for AxStrLenSSE1      104     total bytes for AxStrLenSSE1
88      code size for AxJJStrLen4       88      total bytes for AxJJStrLen4
96      code size for AxJJStrLen5       96      total bytes for AxJJStrLen5
82      code size for pcmpistriLingo    82      total bytes for pcmpistriLingo
79      code size for StrLenLingo       79      total bytes for StrLenLingo

77      cycles for szLen
14      cycles for AxStrLenSSE1
15      cycles for AxJJStrLen4
14      cycles for AxJJStrLen5
11      cycles for pcmpistriLingo
8       cycles for StrLenLingo

78      cycles for szLen
13      cycles for AxStrLenSSE1
15      cycles for AxJJStrLen4
13      cycles for AxJJStrLen5
11      cycles for pcmpistriLingo
9       cycles for StrLenLingo

77      cycles for szLen
14      cycles for AxStrLenSSE1
15      cycles for AxJJStrLen4
15      cycles for AxJJStrLen5
11      cycles for pcmpistriLingo
8       cycles for StrLenLingo

78      cycles for szLen
13      cycles for AxStrLenSSE1
15      cycles for AxJJStrLen4
13      cycles for AxJJStrLen5
11      cycles for pcmpistriLingo
8       cycles for StrLenLingo

------- timings, misaligned, 5 byte string -------
4       cycles for szLen
3       cycles for AxStrLenSSE1
3       cycles for AxJJStrLen4
4       cycles for AxJJStrLen5
-1      cycles for pcmpistriLingo
-1      cycles for StrLenLingo

5       cycles for szLen
3       cycles for AxStrLenSSE1
3       cycles for AxJJStrLen4
3       cycles for AxJJStrLen5
0       cycles for pcmpistriLingo
-1      cycles for StrLenLingo

4       cycles for szLen
2       cycles for AxStrLenSSE1
4       cycles for AxJJStrLen4
3       cycles for AxJJStrLen5
0       cycles for pcmpistriLingo
-1      cycles for StrLenLingo

4       cycles for szLen
2       cycles for AxStrLenSSE1
3       cycles for AxJJStrLen4
3       cycles for AxJJStrLen5
-1      cycles for pcmpistriLingo
-1      cycles for StrLenLingo


--- ok ---


brethren

this string length algorithm is the fastest one i've found that uses no sse instructions. i've timed it against masm32's fast StrLen and it is slightly faster, plus there is still room for optimization :wink
btw i found this algo in one of randy hydes books, originally written in hla. the hla source code is public domain

StrLength PROC USES esi, buf:DWORD
mov esi, buf
test esi, 3
jz IsAligned
cmp BYTE PTR [esi], NULL
je done
add esi, 1
test esi, 3
jz IsAligned
cmp BYTE PTR [esi], NULL
je done
add esi, 1
test esi, 3
jz IsAligned
cmp BYTE PTR [esi], NULL
je done
add esi, 1
IsAligned:
sub esi, 32
lbl1:
add esi, 32
lbl2:
mov eax, [esi]
and eax, 7F7F7F7Fh
sub eax, 01010101h
and eax, 80808080h
jnz MightBeZero0

mov eax, [esi+4]
and eax, 7F7F7F7Fh
sub eax, 01010101h
and eax, 80808080h
jnz MightBeZero4

mov eax, [esi+8]
and eax, 7F7F7F7Fh
sub eax, 01010101h
and eax, 80808080h
jnz MightBeZero8

mov eax, [esi+12]
and eax, 7F7F7F7Fh
sub eax, 01010101h
and eax, 80808080h
jnz MightBeZero12

mov eax, [esi+16]
and eax, 7F7F7F7Fh
sub eax, 01010101h
and eax, 80808080h
jnz MightBeZero16

mov eax, [esi+20]
and eax, 7F7F7F7Fh
sub eax, 01010101h
and eax, 80808080h
jnz MightBeZero20

mov eax, [esi+24]
and eax, 7F7F7F7Fh
sub eax, 01010101h
and eax, 80808080h
jnz MightBeZero24

mov eax, [esi+28]
and eax, 7F7F7F7Fh
sub eax, 01010101h
and eax, 80808080h
jz lbl1

add esi, 28
jmp MightBeZero0
MightBeZero4:
add esi, 4
jmp MightBeZero0
MightBeZero8:
add esi, 8
jmp MightBeZero0
MightBeZero12:
add esi, 12
jmp MightBeZero0
MightBeZero16:
add esi, 16
jmp MightBeZero0
MightBeZero20:
add esi, 20
jmp MightBeZero0
MightBeZero24:
add esi, 24
MightBeZero0:
mov eax, [esi]
cmp al, 0
je done
cmp ah, 0
je done1
test eax, 0FF0000h
je done2
test eax, 0FF000000h
je done3

add esi, 4
jmp lbl2
done3:
sub esi, buf
lea eax, [esi+3]
jmp @F
done2:
sub esi, buf
lea eax, [esi+2]
jmp @F
done1:
sub esi, buf
lea eax, [esi+1]
jmp @F
done:
mov eax, esi
sub eax, buf
@@:
ret
StrLength ENDP

lingo

#145
"this string length algorithm is the fastest one i've found that uses no sse instructions."

My algo (without SSE) is faster:  :lol
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
StrLenLingo proc lpszStr:DWORD
mov edx, esp
mov esp, [esp+1*4]
pop eax
@@Loop:
sub    eax, 1010101h
pop ecx
                sub ecx, 1010101h
test eax, 80808080h
  pop eax
jne @f
@@LoopCont:
test ecx, 80808080h
je @@Loop
test    byte ptr [esp-8], 0FFh     
je      mi8       
test    byte ptr [esp-7], 0FFh
je      mi7         
test    byte ptr [esp-6], 0FFh
je      mi6         
test    byte ptr [esp-5], 0FFh
jne     @@Loop
lea eax, [esp-5]
mov esp, edx
sub eax, [edx+1*4]
ret 4
align 8
@@:
test    byte ptr [esp-12], 0FFh     
jz      mi12         
test    byte ptr [esp-11], 0FFh
jz      mi11         
test    byte ptr [esp-10], 0FFh
jz      mi10         
test    byte ptr [esp-9], 0FFh
jnz     @@LoopCont
lea eax, [esp-9]
mov esp, edx
sub eax,[edx+1*4]
ret 4
mi8:
            lea eax, [esp-8]
    mov esp, edx
            sub eax,[edx+1*4]
    ret 4
mi7:
    lea eax, [esp-7]
    mov esp, edx
            sub eax,[edx+1*4]
    ret 4
mi6:   
            lea eax, [esp-6]
    mov esp, edx
            sub eax,[edx+1*4] 
    ret 4
mi12:   
            lea eax, [esp-12]
    mov esp, edx
            sub eax,[edx+1*4]
    ret 4
mi11:
            lea eax, [esp-11]
    mov esp, edx
            sub eax,[edx+1*4]
    ret 4
mi10:   
            lea eax, [esp-10]
    mov esp, edx
            sub eax,[edx+1*4] 
    ret 4
StrLenLingo endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

The results on my CPU i7-2600K are:

28 cycles for agner fog StrLen-masmlib
25 cycles for StrLength
17 cycles for StrLenLingo
--test finished--


jj2007

Prescott P4:
110 cycles for agner fog StrLen-masmlib
63 cycles for StrLength
83 cycles for StrLenLingo

FORTRANS

Hi,

   PIII, Win2k.

Steve


Assembling: strlen_a.asm

G:\WORK\TEMP>strlen_a
65 cycles for agner fog StrLen-masmlib
54 cycles for StrLength
65 cycles for StrLenLingo
--test finished--

jj2007


lingo

Il vecchio idiota non può stare controllo di nuovo.
Sembra che lui ha dimenticato le sue pillole di nuovo... :lol
Intel Core 2 Duo E8500, 3,16 GHz:  :lol
49 cycles for agner fog StrLen-masmlib
41 cycles for StrLength
26 cycles for StrLenLingo
--test finished--