News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

StrLen timings needed

Started by jj2007, August 15, 2010, 09:32:10 PM

Previous topic - Next topic

frktons

Quote from: lingo on August 25, 2010, 03:50:59 PM
"I ladri sono sempre bugiardi"[/U]

Do you mean : "Pick one, get two"? It sounds like a commercial ad   :lol
Mind is like a parachute. You know what to do in order to use it :-)

jj2007

Quote from: lingo on August 25, 2010, 03:50:59 PM
"It is a tick faster than the others on certain CPUs."
Again false...should be "It is 45% faster than the others on ALL CPUs."

Well, not ALL CPUs. There is this rare exception of the so-called "P4":

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
------- timings, misaligned, 5 byte string -------
100     cycles for szLen
30      cycles for AxStrLenSSE1
25      cycles for AxJJStrLen3
28      cycles for AxJJStrLen4
28      cycles for AxJJStrLen5
32      cycles for StrLenLingo (unsafe)

:green2

lingo

"I ladri sono sempre bugiardi"[/b][/U]
is equal to:
"The thieves are always liars"[/b]
is equal to:
""Воры всегда лжецы" [/b]
is equal to:
""Les voleurs sont toujours menteurs" [/b] :lol

jj2007

I can't decide, they are so close. Any diverging results on other CPUs?
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
97      code size for AxStrLenSSE1      104     total bytes for AxStrLenSSE1
88      code size for AxJJStrLen4       88      total bytes for AxJJStrLen4
96      code size for AxJJStrLen5       96      total bytes for AxJJStrLen5
------- timings, misaligned, 100 byte string -------
132     cycles for szLen
34      cycles for AxStrLenSSE1
33      cycles for AxJJStrLen4
33      cycles for AxJJStrLen5
------- timings, misaligned, 5 byte string -------
31      cycles for szLen
19      cycles for AxStrLenSSE1
17      cycles for AxJJStrLen4
17      cycles for AxJJStrLen5


Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
------- timings, 16-byte aligned, 100 byte string -------
252     szLen
61      AxStrLenSSE1
69      AxJJStrLen4
69      AxJJStrLen5
------- timings, misaligned, 100 byte string -------
91      szLen
29      AxStrLenSSE1
28      AxJJStrLen4
28      AxJJStrLen5

Antariy

Quote from: lingo on August 25, 2010, 03:50:59 PM
"I ladri sono sempre bugiardi"[/U]

"It is a tick faster than the others on certain CPUs."
Again false...should be "It is 45% faster than the others on ALL CPUs."



Wow, lamer Lingo make sounds again:


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
97      code size for AxStrLenSSE1      104     total bytes for AxStrLenSSE1
79      code size for AxJJStrLen3       80      total bytes for AxJJStrLen3
88      code size for AxJJStrLen4       88      total bytes for AxJJStrLen4
96      code size for AxJJStrLen5       96      total bytes for AxJJStrLen5
79      code size for StrLenLingo       79      total bytes for StrLenLingo

66      cycles for AxStrLenSSE1
69      cycles for AxJJStrLen3
72      cycles for AxJJStrLen4
71      cycles for AxJJStrLen5
56      cycles for StrLenLingo

65      cycles for AxStrLenSSE1
68      cycles for AxJJStrLen3
72      cycles for AxJJStrLen4
71      cycles for AxJJStrLen5
56      cycles for StrLenLingo

64      cycles for AxStrLenSSE1
68      cycles for AxJJStrLen3
72      cycles for AxJJStrLen4
71      cycles for AxJJStrLen5
55      cycles for StrLenLingo

65      cycles for AxStrLenSSE1
67      cycles for AxJJStrLen3
74      cycles for AxJJStrLen4
71      cycles for AxJJStrLen5
55      cycles for StrLenLingo

------- timings, misaligned, 5 byte string -------
26      cycles for AxStrLenSSE1
18      cycles for AxJJStrLen3
21      cycles for AxJJStrLen4
21      cycles for AxJJStrLen5
14      cycles for StrLenLingo

26      cycles for AxStrLenSSE1
18      cycles for AxJJStrLen3
22      cycles for AxJJStrLen4
22      cycles for AxJJStrLen5
14      cycles for StrLenLingo

26      cycles for AxStrLenSSE1
19      cycles for AxJJStrLen3
22      cycles for AxJJStrLen4
22      cycles for AxJJStrLen5
14      cycles for StrLenLingo

26      cycles for AxStrLenSSE1
18      cycles for AxJJStrLen3
22      cycles for AxJJStrLen4
22      cycles for AxJJStrLen5
14      cycles for StrLenLingo


--- ok ---


56 vs 66 - 45%??? Lingo, you buy all your certificates, indeed.
Maybe, because you live not in Toronto, really? I doubt, what Toronto gives citizenship to so lamer's as you.

56 faster than 66 by 15%, relatively to 66. So, you don't able to calculate so simple thing? This is great, really.
I doubt, which spee it will be have on PIII and older.



Alex
P.S. all yours translations may be done by machine - this is simple text. Try to translate this: Линго, ты ЧМО !!! Когда ты переехал в Торонто, чмошник-недоучка?

Antariy

Hi, Jochen!

Timings:

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
97      code size for AxStrLenSSE1      104     total bytes for AxStrLenSSE1
88      code size for AxJJStrLen4       88      total bytes for AxJJStrLen4
96      code size for AxJJStrLen5       96      total bytes for AxJJStrLen5

------- timings, misaligned, 100 byte string -------
267     cycles for szLen
65      cycles for AxStrLenSSE1
74      cycles for AxJJStrLen4
73      cycles for AxJJStrLen5

266     cycles for szLen
65      cycles for AxStrLenSSE1
74      cycles for AxJJStrLen4
70      cycles for AxJJStrLen5

264     cycles for szLen
64      cycles for AxStrLenSSE1
72      cycles for AxJJStrLen4
71      cycles for AxJJStrLen5

261     cycles for szLen
64      cycles for AxStrLenSSE1
72      cycles for AxJJStrLen4
70      cycles for AxJJStrLen5

------- timings, misaligned, 5 byte string -------
99      cycles for szLen
30      cycles for AxStrLenSSE1
29      cycles for AxJJStrLen4
29      cycles for AxJJStrLen5

95      cycles for szLen
31      cycles for AxStrLenSSE1
29      cycles for AxJJStrLen4
28      cycles for AxJJStrLen5

93      cycles for szLen
31      cycles for AxStrLenSSE1
29      cycles for AxJJStrLen4
29      cycles for AxJJStrLen5

95      cycles for szLen
30      cycles for AxStrLenSSE1
29      cycles for AxJJStrLen4
29      cycles for AxJJStrLen5

--- ok ---




Alex

Antariy

GrandLamer Lingo's:
Quote
"The thieves are always liars"="Воры всегда лжецы"

While you don't get a patent for yours lamer's algos, you cannot speak this. Because all your technics is stolen from Intel's examples. But, not very good Intels tutorials you makes to awesome badly and ugly lamer's code  :toothy

I doubt, what you can get any patent to any respect office, maybe patent: "The Napoleon of Torontos' Central Mental Hospital", or something like this.



Alex

jj2007

Quote from: Antariy on August 25, 2010, 09:15:51 PM
Hi, Jochen!

Timings:

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
65      cycles for AxStrLenSSE1
74      cycles for AxJJStrLen4
73      cycles for AxJJStrLen5


Your code is better on your CPU, again :bg
I had a bright idea to skip a bsf but...:
pcmpeqb xmm0, [edx] ; there may be nullbytes before the real string
pmovmskb eax, xmm0
shr eax, cl ; shift their bits out
; bsf eax, eax ############# not necessary
jnz @ret1
..
@ret: mov ecx, [esp]
movaps xmm0, [ecx]
mov ecx, [esp+4]
add esp, Extra
ret 4
@ret1: bsf eax, eax ; good on P4 but disastrous on Celeron M
mov ecx, [esp]
movaps xmm0, [ecx]
mov ecx, [esp+4]
add esp, Extra
ret 4

Antariy

Quote from: jj2007 on August 25, 2010, 09:32:30 PM
Your code is better on your CPU, again :bg
I had a bright idea to skip a bsf but...:

This is not have big meaning  - 6 clocks. If not preserve ecx and xmm, this proc would be to 66bytes long and 56 clocks speed (checking of end of buffer still maked).

BSF - very slow instruction (relatively), but on my CPU overhead of checking, branch and exiting gets the same speed, and biggest code only.



Alex

dedndave

BSF is a little slow - and clumsey to use, too
but - i have tried the alternatives without much luck
i am using a P4 - it's a slow instruction
i think it is faster on newer CPU's, Alex
even though that doesn't help us much, it makes the code look good in the forum tests   :bg

lingo

"BSF is a little slow - and clumsey to use"

With new technologies we don't need  BSF :lol...just take a look:
align 16
Newstrlen_sse4_2
movdqa  xmm7,  notnul
pop        ecx
pop        eax
mov       edx, eax
@@:
pcmpistri xmm7, [eax], 14h
lea   eax,    [eax+16]
jnz   @b   
sub    eax, edx
lea   eax, [eax+ecx-16]       
jmp    dword ptr [esp-2*4]


Hi, unhappy losers...slow.slow..again and again... :lol
Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)
97      code size for AxStrLenSSE1      104     total bytes for AxStrLenSSE1
88      code size for AxJJStrLen4       88      total bytes for AxJJStrLen4
96      code size for AxJJStrLen5       96      total bytes for AxJJStrLen5
79      code size for StrLenLingo       79      total bytes for StrLenLingo

110     cycles for szLen
20      cycles for AxStrLenSSE1
22      cycles for AxJJStrLen4
22      cycles for AxJJStrLen5
12      cycles for StrLenLingo

110     cycles for szLen
40      cycles for AxStrLenSSE1
23      cycles for AxJJStrLen4
22      cycles for AxJJStrLen5
12      cycles for StrLenLingo

110     cycles for szLen
20      cycles for AxStrLenSSE1
23      cycles for AxJJStrLen4
21      cycles for AxJJStrLen5
12      cycles for StrLenLingo

111     cycles for szLen
49      cycles for AxStrLenSSE1
22      cycles for AxJJStrLen4
22      cycles for AxJJStrLen5
12      cycles for StrLenLingo

------- timings, misaligned, 5 byte string -------
7       cycles for szLen
8       cycles for AxStrLenSSE1
5       cycles for AxJJStrLen4
6       cycles for AxJJStrLen5
1       cycles for StrLenLingo

7       cycles for szLen
40      cycles for AxStrLenSSE1
6       cycles for AxJJStrLen4
6       cycles for AxJJStrLen5
2       cycles for StrLenLingo

7       cycles for szLen
8       cycles for AxStrLenSSE1
6       cycles for AxJJStrLen4
6       cycles for AxJJStrLen5
1       cycles for StrLenLingo

7       cycles for szLen
42      cycles for AxStrLenSSE1
6       cycles for AxJJStrLen4
6       cycles for AxJJStrLen5
1       cycles for StrLenLingo


--- ok ---


jj2007

Quote from: lingo on August 26, 2010, 08:03:48 PM
Hi, unhappy losers...slow.slow..again and again... :lol

Lingo, your algo goes bang when used with short strings near the end of the buffer. You seem not be able to learn. You are a troll :tdown

Antariy

Hi, stupid lingo!

This is not wonder, why your algos is fast - you test them in best cases.
For example, part of Paul Dixon's algo, stolen and "optimized" by you:
Quote

.data
align 16
Src   dd 100 Dup(0)
Num dd 0FFFFFFFFh,0          <-------------------------- THIS!!!
vaPtr   dd 0
align 16
chartabL dw "00","10","20","30","40","50","60","70","80","90"
                dw "01","11","21","31","41","51","61","71","81","91"
                dw "02","12","22","32","42","52","62","72","82","92"
                dw "03","13","23","33","43","53","63","73","83","93"
                dw "04","14","24","34","44","54","64","74","84","94"
                dw "05","15","25","35","45","55","65","75","85","95"
                dw "06","16","26","36","46","56","66","76","86","96"
                dw "07","17","27","37","47","57","67","77","87","97"
                dw "08","18","28","38","48","58","68","78","88","98"
                dw "09","19","29","39","49","59","69","79","89","99"



Very nice, you test with best case - only sign "-" and "1". Result will be "-1" - same fast conversion. Your stupid proc not faster than Hutch's with 2^31  :bdg on NOT your CPU, of course  :toothy

If you so lamer, what cannot make something good without "new technologies" - so, this is without comments. Posted code - cannot be compared with other code, because it DON'T make all things, which is needed for fair comparsion. IF you not able write proc with SAME characteristics - shut up. You be very unsatisfyed, when drizz algo "without functionality" beat your stupid simple unrolled algo. But you satisfyed with YOUR OWN stupid algo without functionality. This is strange, not - this is funny, because you are lamer.



Alex

lingo

Ugly spaghetti+lamer tubeteikin's code = slow...slow..slow...non me ne frega un cazzo. :lol

Antariy

Quote from: jj2007 on August 26, 2010, 09:22:49 PM
Quote from: lingo on August 26, 2010, 08:03:48 PM
Hi, unhappy losers...slow.slow..again and again... :lol

Lingo, your algo goes bang when used with short strings near the end of the buffer. You seem not be able to learn. You are a troll :tdown

No, lingo are a ЧМО !!!
This is nice Russian therm for peoples like lingo  :bdg



Alex
P.S. Lingo, show any your normal app, or you can improve and optimize existed algos of other peoples only?
Under normal app I meant any your app, which works 5 minutes without crashes. And which can make something other, than printing your clocks only  :green2