StrLen timings needed

frktons · August 25, 2010, 03:56:44 PM

Quote from: lingo on August 25, 2010, 03:50:59 PM
"I ladri sono sempre bugiardi"[/U]

Do you mean : "Pick one, get two"? It sounds like a commercial ad :lol

jj2007 · August 25, 2010, 04:01:20 PM

Quote from: lingo on August 25, 2010, 03:50:59 PM
"It is a tick faster than the others on certain CPUs."
Again false...should be "It is 45% faster than the others on ALL CPUs."

Well, not ALL CPUs. There is this rare exception of the so-called "P4":

Code Select

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
------- timings, misaligned, 5 byte string -------
100     cycles for szLen
30      cycles for AxStrLenSSE1
25      cycles for AxJJStrLen3
28      cycles for AxJJStrLen4
28      cycles for AxJJStrLen5
32      cycles for StrLenLingo (unsafe)

:green2

lingo · August 25, 2010, 04:09:07 PM

"I ladri sono sempre bugiardi"[/b][/U]
is equal to:
"The thieves are always liars"[/b]
is equal to:
""Воры всегда лжецы" [/b]
is equal to:
""Les voleurs sont toujours menteurs" [/b] :lol

jj2007 · August 25, 2010, 09:08:32 PM

I can't decide, they are so close. Any diverging results on other CPUs?

Code Select

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
97      code size for AxStrLenSSE1      104     total bytes for AxStrLenSSE1
88      code size for AxJJStrLen4       88      total bytes for AxJJStrLen4
96      code size for AxJJStrLen5       96      total bytes for AxJJStrLen5
------- timings, misaligned, 100 byte string -------
132     cycles for szLen
34      cycles for AxStrLenSSE1
33      cycles for AxJJStrLen4
33      cycles for AxJJStrLen5
------- timings, misaligned, 5 byte string -------
31      cycles for szLen
19      cycles for AxStrLenSSE1
17      cycles for AxJJStrLen4
17      cycles for AxJJStrLen5

Code Select

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
------- timings, 16-byte aligned, 100 byte string -------
252     szLen
61      AxStrLenSSE1
69      AxJJStrLen4
69      AxJJStrLen5
------- timings, misaligned, 100 byte string -------
91      szLen
29      AxStrLenSSE1
28      AxJJStrLen4
28      AxJJStrLen5

Antariy · August 25, 2010, 09:13:43 PM

Quote from: lingo on August 25, 2010, 03:50:59 PM
"I ladri sono sempre bugiardi"[/U]

"It is a tick faster than the others on certain CPUs."
Again false...should be "It is 45% faster than the others on ALL CPUs."

Wow, lamer Lingo make sounds again:

Code Select


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
97      code size for AxStrLenSSE1      104     total bytes for AxStrLenSSE1
79      code size for AxJJStrLen3       80      total bytes for AxJJStrLen3
88      code size for AxJJStrLen4       88      total bytes for AxJJStrLen4
96      code size for AxJJStrLen5       96      total bytes for AxJJStrLen5
79      code size for StrLenLingo       79      total bytes for StrLenLingo

66      cycles for AxStrLenSSE1
69      cycles for AxJJStrLen3
72      cycles for AxJJStrLen4
71      cycles for AxJJStrLen5
56      cycles for StrLenLingo

65      cycles for AxStrLenSSE1
68      cycles for AxJJStrLen3
72      cycles for AxJJStrLen4
71      cycles for AxJJStrLen5
56      cycles for StrLenLingo

64      cycles for AxStrLenSSE1
68      cycles for AxJJStrLen3
72      cycles for AxJJStrLen4
71      cycles for AxJJStrLen5
55      cycles for StrLenLingo

65      cycles for AxStrLenSSE1
67      cycles for AxJJStrLen3
74      cycles for AxJJStrLen4
71      cycles for AxJJStrLen5
55      cycles for StrLenLingo

------- timings, misaligned, 5 byte string -------
26      cycles for AxStrLenSSE1
18      cycles for AxJJStrLen3
21      cycles for AxJJStrLen4
21      cycles for AxJJStrLen5
14      cycles for StrLenLingo

26      cycles for AxStrLenSSE1
18      cycles for AxJJStrLen3
22      cycles for AxJJStrLen4
22      cycles for AxJJStrLen5
14      cycles for StrLenLingo

26      cycles for AxStrLenSSE1
19      cycles for AxJJStrLen3
22      cycles for AxJJStrLen4
22      cycles for AxJJStrLen5
14      cycles for StrLenLingo

26      cycles for AxStrLenSSE1
18      cycles for AxJJStrLen3
22      cycles for AxJJStrLen4
22      cycles for AxJJStrLen5
14      cycles for StrLenLingo


--- ok ---

56 vs 66 - 45%??? Lingo, you buy all your certificates, indeed.
Maybe, because you live not in Toronto, really? I doubt, what Toronto gives citizenship to so lamer's as you.

56 faster than 66 by 15%, relatively to 66. So, you don't able to calculate so simple thing? This is great, really.
I doubt, which spee it will be have on PIII and older.

Alex
P.S. all yours translations may be done by machine - this is simple text. Try to translate this: Линго, ты ЧМО !!! Когда ты переехал в Торонто, чмошник-недоучка?

Antariy · August 25, 2010, 09:15:51 PM

Hi, Jochen!

Timings:

Code Select


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
97      code size for AxStrLenSSE1      104     total bytes for AxStrLenSSE1
88      code size for AxJJStrLen4       88      total bytes for AxJJStrLen4
96      code size for AxJJStrLen5       96      total bytes for AxJJStrLen5

------- timings, misaligned, 100 byte string -------
267     cycles for szLen
65      cycles for AxStrLenSSE1
74      cycles for AxJJStrLen4
73      cycles for AxJJStrLen5

266     cycles for szLen
65      cycles for AxStrLenSSE1
74      cycles for AxJJStrLen4
70      cycles for AxJJStrLen5

264     cycles for szLen
64      cycles for AxStrLenSSE1
72      cycles for AxJJStrLen4
71      cycles for AxJJStrLen5

261     cycles for szLen
64      cycles for AxStrLenSSE1
72      cycles for AxJJStrLen4
70      cycles for AxJJStrLen5

------- timings, misaligned, 5 byte string -------
99      cycles for szLen
30      cycles for AxStrLenSSE1
29      cycles for AxJJStrLen4
29      cycles for AxJJStrLen5

95      cycles for szLen
31      cycles for AxStrLenSSE1
29      cycles for AxJJStrLen4
28      cycles for AxJJStrLen5

93      cycles for szLen
31      cycles for AxStrLenSSE1
29      cycles for AxJJStrLen4
29      cycles for AxJJStrLen5

95      cycles for szLen
30      cycles for AxStrLenSSE1
29      cycles for AxJJStrLen4
29      cycles for AxJJStrLen5

--- ok ---

Alex

Antariy · August 25, 2010, 09:21:41 PM

GrandLamer Lingo's:

Quote
"The thieves are always liars"="Воры всегда лжецы"

While you don't get a patent for yours lamer's algos, you cannot speak this. Because all your technics is stolen from Intel's examples. But, not very good Intels tutorials you makes to awesome badly and ugly lamer's code :toothy

I doubt, what you can get any patent to any respect office, maybe patent: "The Napoleon of Torontos' Central Mental Hospital", or something like this.

Alex

jj2007 · August 25, 2010, 09:32:30 PM

Quote from: Antariy on August 25, 2010, 09:15:51 PM
Hi, Jochen!

Timings:
Code Select Expand
Intel(R) Celeron(R) CPU 2.13GHz (SSE3) 65 cycles for AxStrLenSSE1 74 cycles for AxJJStrLen4 73 cycles for AxJJStrLen5

Your code is better on your CPU, again :bg
I had a bright idea to skip a bsf but...:

Code Select

	pcmpeqb xmm0, [edx]	; there may be nullbytes before the real string
	pmovmskb eax, xmm0
	shr eax, cl	; shift their bits out
;	bsf eax, eax ############# not necessary
	jnz @ret1
..
@ret:	mov ecx, [esp]
	movaps xmm0, [ecx]
	mov ecx, [esp+4]
	add esp, Extra
	ret 4
@ret1:	bsf eax, eax	; good on P4 but disastrous on Celeron M
	mov ecx, [esp]
	movaps xmm0, [ecx]
	mov ecx, [esp+4]
	add esp, Extra
	ret 4

Antariy · August 25, 2010, 09:40:34 PM

Quote from: jj2007 on August 25, 2010, 09:32:30 PM
Your code is better on your CPU, again :bg
I had a bright idea to skip a bsf but...:

This is not have big meaning - 6 clocks. If not preserve ecx and xmm, this proc would be to 66bytes long and 56 clocks speed (checking of end of buffer still maked).

BSF - very slow instruction (relatively), but on my CPU overhead of checking, branch and exiting gets the same speed, and biggest code only.

Alex

dedndave · August 25, 2010, 11:42:15 PM

BSF is a little slow - and clumsey to use, too
but - i have tried the alternatives without much luck
i am using a P4 - it's a slow instruction
i think it is faster on newer CPU's, Alex
even though that doesn't help us much, it makes the code look good in the forum tests :bg

lingo · August 26, 2010, 08:03:48 PM

"BSF is a little slow - and clumsey to use"

With new technologies we don't need BSF :lol...just take a look:
align 16
Newstrlen_sse4_2
movdqa xmm7, notnul
pop ecx
pop eax
mov edx, eax
@@:
pcmpistri xmm7, [eax], 14h
lea   eax, [eax+16]
jnz   @b
sub    eax, edx
lea   eax, [eax+ecx-16]
jmp    dword ptr [esp-2*4]

Hi, unhappy losers...slow.slow..again and again... :lol

Code Select


Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)
97      code size for AxStrLenSSE1      104     total bytes for AxStrLenSSE1
88      code size for AxJJStrLen4       88      total bytes for AxJJStrLen4
96      code size for AxJJStrLen5       96      total bytes for AxJJStrLen5
79      code size for StrLenLingo       79      total bytes for StrLenLingo

110     cycles for szLen
20      cycles for AxStrLenSSE1
22      cycles for AxJJStrLen4
22      cycles for AxJJStrLen5
12      cycles for StrLenLingo

110     cycles for szLen
40      cycles for AxStrLenSSE1
23      cycles for AxJJStrLen4
22      cycles for AxJJStrLen5
12      cycles for StrLenLingo

110     cycles for szLen
20      cycles for AxStrLenSSE1
23      cycles for AxJJStrLen4
21      cycles for AxJJStrLen5
12      cycles for StrLenLingo

111     cycles for szLen
49      cycles for AxStrLenSSE1
22      cycles for AxJJStrLen4
22      cycles for AxJJStrLen5
12      cycles for StrLenLingo

------- timings, misaligned, 5 byte string -------
7       cycles for szLen
8       cycles for AxStrLenSSE1
5       cycles for AxJJStrLen4
6       cycles for AxJJStrLen5
1       cycles for StrLenLingo

7       cycles for szLen
40      cycles for AxStrLenSSE1
6       cycles for AxJJStrLen4
6       cycles for AxJJStrLen5
2       cycles for StrLenLingo

7       cycles for szLen
8       cycles for AxStrLenSSE1
6       cycles for AxJJStrLen4
6       cycles for AxJJStrLen5
1       cycles for StrLenLingo

7       cycles for szLen
42      cycles for AxStrLenSSE1
6       cycles for AxJJStrLen4
6       cycles for AxJJStrLen5
1       cycles for StrLenLingo


--- ok ---

jj2007 · August 26, 2010, 09:22:49 PM

Quote from: lingo on August 26, 2010, 08:03:48 PM
Hi, unhappy losers...slow.slow..again and again... :lol

Lingo, your algo goes bang when used with short strings near the end of the buffer. You seem not be able to learn. You are a troll :tdown

Antariy · August 26, 2010, 09:53:11 PM

Hi, stupid lingo!

This is not wonder, why your algos is fast - you test them in best cases.
For example, part of Paul Dixon's algo, stolen and "optimized" by you:

Quote

.data
align 16
Src dd 100 Dup(0)
Num dd 0FFFFFFFFh,0 <-------------------------- THIS!!!
vaPtr dd 0
align 16
chartabL dw "00","10","20","30","40","50","60","70","80","90"
dw "01","11","21","31","41","51","61","71","81","91"
dw "02","12","22","32","42","52","62","72","82","92"
dw "03","13","23","33","43","53","63","73","83","93"
dw "04","14","24","34","44","54","64","74","84","94"
dw "05","15","25","35","45","55","65","75","85","95"
dw "06","16","26","36","46","56","66","76","86","96"
dw "07","17","27","37","47","57","67","77","87","97"
dw "08","18","28","38","48","58","68","78","88","98"
dw "09","19","29","39","49","59","69","79","89","99"

Very nice, you test with best case - only sign "-" and "1". Result will be "-1" - same fast conversion. Your stupid proc not faster than Hutch's with 2^31 :bdg on NOT your CPU, of course :toothy

If you so lamer, what cannot make something good without "new technologies" - so, this is without comments. Posted code - cannot be compared with other code, because it DON'T make all things, which is needed for fair comparsion. IF you not able write proc with SAME characteristics - shut up. You be very unsatisfyed, when drizz algo "without functionality" beat your stupid simple unrolled algo. But you satisfyed with YOUR OWN stupid algo without functionality. This is strange, not - this is funny, because you are lamer.

Alex

lingo · August 26, 2010, 10:00:00 PM

Ugly spaghetti+lamer tubeteikin's code = slow...slow..slow...non me ne frega un cazzo. :lol

Antariy · August 26, 2010, 10:01:44 PM

Quote from: jj2007 on August 26, 2010, 09:22:49 PM
Quote from: lingo on August 26, 2010, 08:03:48 PM
Hi, unhappy losers...slow.slow..again and again... :lol

Lingo, your algo goes bang when used with short strings near the end of the buffer. You seem not be able to learn. You are a troll :tdown

No, lingo are a ЧМО !!!
This is nice Russian therm for peoples like lingo :bdg

Alex
P.S. Lingo, show any your normal app, or you can ~~improve~~ and ~~optimize~~ existed algos of other peoples only?
Under normal app I meant any your app, which works 5 minutes without crashes. And which can make something other, than printing your clocks only :green2

News:

StrLen timings needed