News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

StrLen timings needed

Started by jj2007, August 15, 2010, 09:32:10 PM

Previous topic - Next topic

Antariy

For Jochen's 80d:

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
97      code size for AxStrLenSSE1      104     total bytes for AxStrLenSSE1
77      code size for AxJJStrLen1       78      total bytes for AxJJStrLen1
77      code size for AxJJStrLen2       78      total bytes for AxJJStrLen2
79      code size for AxJJStrLen3       80      total bytes for AxJJStrLen3

------- timings, misaligned, 100 byte string -------
67      cycles for AxStrLenSSE1
75      cycles for AxJJStrLen1
81      cycles for AxJJStrLen2
69      cycles for AxJJStrLen3

66      cycles for AxStrLenSSE1
74      cycles for AxJJStrLen1
72      cycles for AxJJStrLen2
69      cycles for AxJJStrLen3

67      cycles for AxStrLenSSE1
73      cycles for AxJJStrLen1
73      cycles for AxJJStrLen2
70      cycles for AxJJStrLen3

66      cycles for AxStrLenSSE1
74      cycles for AxJJStrLen1
72      cycles for AxJJStrLen2
69      cycles for AxJJStrLen3

------- timings, misaligned, 5 byte string -------
20      cycles for szLen
25      cycles for AxStrLenSSE1
27      cycles for AxJJStrLen1
23      cycles for AxJJStrLen2
19      cycles for AxJJStrLen3

19      cycles for szLen
27      cycles for AxStrLenSSE1
27      cycles for AxJJStrLen1
24      cycles for AxJJStrLen2
19      cycles for AxJJStrLen3

19      cycles for szLen
25      cycles for AxStrLenSSE1
25      cycles for AxJJStrLen1
24      cycles for AxJJStrLen2
18      cycles for AxJJStrLen3

19      cycles for szLen
26      cycles for AxStrLenSSE1
25      cycles for AxJJStrLen1
24      cycles for AxJJStrLen2
19      cycles for AxJJStrLen3


--- ok ---




Alex

Antariy

For late lingo fix for short string support (Microsoft recommend not make any code good and reliable in first release, but release some patches and SPs after some days of releasing initial version).


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
97      code size for AxStrLenSSE1      104     total bytes for AxStrLenSSE1
77      code size for AxJJStrLen1       78      total bytes for AxJJStrLen1
77      code size for AxJJStrLen2       78      total bytes for AxJJStrLen2
79      code size for AxJJStrLen3       80      total bytes for AxJJStrLen3
79      code size for StrLenLingo       79      total bytes for StrLenLingo

------- timings, misaligned, 100 byte string -------
66      cycles for AxStrLenSSE1
73      cycles for AxJJStrLen1
84      cycles for AxJJStrLen2
69      cycles for AxJJStrLen3
57      cycles for StrLenLingo

67      cycles for AxStrLenSSE1
77      cycles for AxJJStrLen1
72      cycles for AxJJStrLen2
69      cycles for AxJJStrLen3
74      cycles for StrLenLingo

69      cycles for AxStrLenSSE1
72      cycles for AxJJStrLen1
70      cycles for AxJJStrLen2
69      cycles for AxJJStrLen3
56      cycles for StrLenLingo

66      cycles for AxStrLenSSE1
72      cycles for AxJJStrLen1
70      cycles for AxJJStrLen2
69      cycles for AxJJStrLen3
55      cycles for StrLenLingo

------- timings, misaligned, 5 byte string -------
19      cycles for szLen
25      cycles for AxStrLenSSE1
26      cycles for AxJJStrLen1
23      cycles for AxJJStrLen2
19      cycles for AxJJStrLen3
14      cycles for StrLenLingo

20      cycles for szLen
27      cycles for AxStrLenSSE1
26      cycles for AxJJStrLen1
23      cycles for AxJJStrLen2
18      cycles for AxJJStrLen3
15      cycles for StrLenLingo

19      cycles for szLen
25      cycles for AxStrLenSSE1
29      cycles for AxJJStrLen1
23      cycles for AxJJStrLen2
18      cycles for AxJJStrLen3
14      cycles for StrLenLingo

19      cycles for szLen
26      cycles for AxStrLenSSE1
27      cycles for AxJJStrLen1
28      cycles for AxJJStrLen2
18      cycles for AxJJStrLen3
14      cycles for StrLenLingo


--- ok ---



Still big timings not on HIS CPUs...  :green2
With consideration, what proc must be twice faster (because use two regs)...  :green2



Alex

jj2007

Quote from: Antariy on August 23, 2010, 10:59:42 PM
For Jochen's 80d:

Thanks, Alex. So your algo is still a tick faster on your Celeron. Mine is a Yonah Celeron M - yours is Prescott or Merom?

Antariy

Quote from: jj2007 on August 23, 2010, 11:28:45 PM
Quote from: Antariy on August 23, 2010, 10:59:42 PM
For Jochen's 80d:

Thanks, Alex. So your algo is still a tick faster on your Celeron. Mine is a Yonah Celeron M - yours is Prescott or Merom?


This is Celeron D, Prescott (310)



Alex
P.S. What timings have my last code (93xxx.zip)?

Antariy

This is big post, and this post is written on poor English (my English is equal to Lingo's Russian), but, PLEASE, read this post entirely, think about it, and say, I right or not.

Lingo's
Quote
I can't thieve nothing from you and from the asian lamer just because you have no ideas in assembly , hence your code will be very ugly and slow always...

Really? How about this:

Lingo's "abs" function:

pop ecx
        pop eax 
        cdq
        xor eax,edx <---
        sub eax,edx <---
        jmp ecx



My Axa2l function, which used unified approach to conversion of signed/unsigned string to long/dword:


    @done:       


    add eax,edi <---
    xor eax,edi <---

    pop edi

    ret


This is the same code, so, Lingo steal it. But, I confess, what this is not hard algo. But and he write very simple algos, and he HAVE NO RIGHTS to say, what somebody steal something from him.
But Lingo, as stupid new-fangled programmer, don't know, what SUB is slower by technical (electrical) reasons, since firsts computers (BIG gathering of registers based on valves or relays). It seems, what Lingo steal his certificate of degree of Electrical Engineer somewhere.
Novadays, and many days ago, sub stand have speed almost as add, and this is not measurable. But, this is good practic - using ADD instead SUB, if you can. Lingo don't know this practic, so - he is not good programmer, he is lamer, because lamers thinks only, what they are Grandmasters. Minded peoples know about theys' drawbacks, and not speak, what they are "the best".

Other argument: many peoples talk to him about returning from proc via "jmp ...". This is stupid practic of lamer programmer.
For example, not only I talk about this to him, but dioxin also:
Quote
You aren't really advocating that, for speed, you should pop a return address off the stack and jump to it, are you? That messes up the branch prediction mechanism which has short cuts for a paired CALL-RETURN.
This is good advice, and perfectly right.
But Lingo, don't listen to good advice - he think, what he the Grandmaster, without compromises.
So, this is ONLY HIS PROBLEMS. But I don't understand, why in mental hospital of Toronto, where Lingo located, Internet connection is available :P



Hutch, why you allow to so stupid lamer, as Lingo, come to your forum? He discredit this nice forum only. Forum have peoples (I don't mean myself, I mean many-many others), which have experience incomparably with his experience, but he call them: "stupid", "thefters", "tolerated" etc?


All of his technics NOT HAVE ANY unbelievable things and thinks. All his technics is KNOWN a long time.
For example, maybe somebody don't know, why Lingo use this code in his SSE hex2dword:

pshufw mm0,[eax], 01Bh <--- this...
pxor mm2, mm2
pshufw mm1,mm0, 0E4h <--- and this
paddb mm0, maskD0h


Firstly, pshuf may behave faster, than movq, but this instruction reverse words order also (1Bh) in this case.
Second, moving one MMx reg to other reg is faster with using pshufw with direct-copying encoding (0E4h). On P6 family, for example, "pshuf MMx,MMx" is faster than "movq MMx,MMx" in 3(!) times.


So, if anybody get intent look to Lingo's code, his code will NOT appear as something great, unknown or unbelievable.
His coding style is stale (smell of depraved young age), and not contain something to theif. Because not Lingo invent this technics.



Lingo you can hand in an application of your *unique algo* to any respectable patent office? Or you will be ridiculed there? Last variant is most possible, sorry. So, don't talk to us, how you great, and which "uniques" algo you produce.


He swagger about his fast "great" code, because he have fast CPU only. Because Core+ is tolerant to some his lamer's technics (like return via jump and many others).
As all peoples saw, his code don't have prominent results on not-hi-end CPUs.

All history of humans, is great and respectable making something good from something bad or cheap (as Ford for US or Diesel for Europe and world).
No great thing - make something good from something nice.
So, ANYBODY can say, what he great programmer? If yes - some similarity to Lingo have place in this.


So, Lingo is ORDINARY man (who can read Intel's manuals), not God of assembly, why he use rights (insults etc), which don't use Administrators and Moderators. Who he are for this?


Further. In seen of responses of other peoples, maybe, with similar to Lingo's "thinking engine", I say, why my code is that, as it is. Answer is short - it works NOT badly than Lingo's "great, nice, fast, unreliable" code, so, about talking? Novadays, I write for me and my work only, I write for CPUs of one type, and I don't have need to make other code.
But this is not meant what I "asian lamer", and Lingo is greater. Lingo write for his CPU only also, but his code work badly on other CPUs, but my - not. If I don't use some optimization technics - this is because they DON'T have something useful for me it this time.
I welcome reports about bad work on other CPU architectures, and trying to make it better.
Lingo is don't listen any reports (if they no good for him), and not listen any comments. If his code work badly, he say, what this is because testing computer is old and slow, and because owner of this computer is a lamer. Maybe Lingo - madman?



Any insults and something like, is NOT interesting to me, and NOT have any meaning to me. I write this post only to open eyes of peoples, who think, what Lingo - unbeatable champion.

And Lingo's insults is not have any meaning to me, because I treat them as talk of madman.
If somebody think, what I write this post to show, how I nice and great - this peoples make mistake.
I write this post, because Lingo's behaviour with other members is not excusable.
I don't like, and never use offensive speech with peoples, but I make a big exclusion to Lingo only.
Please sorry, all other peoples!



Alex

dedndave

Alex - don't let lingo get on your nerves
most of us just ignore him   :lol

Antariy

Quote from: dedndave on August 24, 2010, 08:54:21 PM
Alex - don't let lingo get on your nerves
most of us just ignore him   :lol

I'm balanced man. Not any Lingo, or 1,000,000 of Lingo's clones not "pick-up" me. I worry for forum and its members.
Because this is not correctness - call members with offensive words, without any cause from they side.



Alex

dedndave

it's fun to pick on lingo
he gets upset so easily
i can visualize the blood vessels popping out on his neck - lol

jj2007

Hi Alex,
Dave is right - don't let Rumpelstilzchen spoil your peaceful coding nights. He is good at writing highly specialised (and highly unrolled) algos, although it's always a pain to find out under which conditions they raise exceptions :green2

Here are the timings for the 93 zip, with the addition of AxJJStrLen3, i.e. the algo that uses a global variable to store the xmm reg:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
85      code size for AxStrLenSSE1a     88      total bytes for AxStrLenSSE1a
84      code size for AxStrLenSSE1j     84      total bytes for AxStrLenSSE1j
79      code size for AxJJStrLen3       80      total bytes for AxJJStrLen3
------- timings -------
132     cycles for szLen
33      cycles for AxStrLenSSE1a
33      cycles for AxStrLenSSE1J
31      cycles for AxJJStrLen3

Antariy

Quote from: jj2007 on August 24, 2010, 09:36:11 PM

Here are the timings for the 93 zip, with the addition of AxJJStrLen3, i.e. the algo that uses a global variable to store the xmm reg:


Jochen, I think, you must don't use a global variable, because this is make proc not reenterant => it don't support multi-threaded applications. 1-2 clocks not have any mean, I think.



Alex

Antariy

I still confidence, what Rumpelstilzchen cannot make something really great and unique. Maybe, make GREAT speech about his "GREAT" code :green2



Alex

jj2007

Quote from: Antariy on August 24, 2010, 09:42:40 PM
Jochen, I think, you must don't use a global variable, because this is make proc not reenterant => it don't support multi-threaded applications.

That is correct, thanks for reminding me. Here are multi-thread and SSE2-safe variants 4+5:

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
97      code size for AxStrLenSSE1      104     total bytes for AxStrLenSSE1
77      code size for AxJJStrLen1       78      total bytes for AxJJStrLen1
77      code size for AxJJStrLen2       78      total bytes for AxJJStrLen2
79      code size for AxJJStrLen3       80      total bytes for AxJJStrLen3
88      code size for AxJJStrLen4       88      total bytes for AxJJStrLen4
96      code size for AxJJStrLen5       96      total bytes for AxJJStrLen5

------- timings, misaligned, 100 byte string -------
34      cycles for AxStrLenSSE1
31      cycles for AxJJStrLen3
34      cycles for AxJJStrLen4
33      cycles for AxJJStrLen5

34      cycles for AxStrLenSSE1
31      cycles for AxJJStrLen3
33      cycles for AxJJStrLen4
33      cycles for AxJJStrLen5

34      cycles for AxStrLenSSE1
31      cycles for AxJJStrLen3
33      cycles for AxJJStrLen4
33      cycles for AxJJStrLen5

------- timings, misaligned, 5 byte string -------
12      cycles for szLen
15      cycles for AxStrLenSSE1
7       cycles for AxJJStrLen3
10      cycles for AxJJStrLen4
10      cycles for AxJJStrLen5

12      cycles for szLen
12      cycles for AxStrLenSSE1
7       cycles for AxJJStrLen3
10      cycles for AxJJStrLen4
10      cycles for AxJJStrLen5

12      cycles for szLen
16      cycles for AxStrLenSSE1
7       cycles for AxJJStrLen3
10      cycles for AxJJStrLen4
10      cycles for AxJJStrLen5

lingo

"The thieves are always liars"="Воры всегда лжецы"[/U]
Ugly spaghetti+lamer tubeteikin's code = slow.,..slow..slow...non me ne frega un cazzo... :lol
Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)
97      code size for AxStrLenSSE1      104     total bytes for AxStrLenSSE1
79      code size for AxJJStrLen3       80      total bytes for AxJJStrLen3
88      code size for AxJJStrLen4       88      total bytes for AxJJStrLen4
96      code size for AxJJStrLen5       96      total bytes for AxJJStrLen5
79      code size for StrLenLingo       79      total bytes for StrLenLingo

20      cycles for AxStrLenSSE1
20      cycles for AxJJStrLen3
23      cycles for AxJJStrLen4
22      cycles for AxJJStrLen5
12      cycles for StrLenLingo

20      cycles for AxStrLenSSE1
20      cycles for AxJJStrLen3
22      cycles for AxJJStrLen4
22      cycles for AxJJStrLen5
12      cycles for StrLenLingo

20      cycles for AxStrLenSSE1
20      cycles for AxJJStrLen3
23      cycles for AxJJStrLen4
22      cycles for AxJJStrLen5
12      cycles for StrLenLingo

21      cycles for AxStrLenSSE1
20      cycles for AxJJStrLen3
23      cycles for AxJJStrLen4
22      cycles for AxJJStrLen5
12      cycles for StrLenLingo

------- timings, misaligned, 5 byte string -------
8       cycles for AxStrLenSSE1
3       cycles for AxJJStrLen3
5       cycles for AxJJStrLen4
5       cycles for AxJJStrLen5
1       cycles for StrLenLingo

15      cycles for AxStrLenSSE1
4       cycles for AxJJStrLen3
6       cycles for AxJJStrLen4
6       cycles for AxJJStrLen5
1       cycles for StrLenLingo

8       cycles for AxStrLenSSE1
3       cycles for AxJJStrLen3
5       cycles for AxJJStrLen4
5       cycles for AxJJStrLen5
1       cycles for StrLenLingo

15      cycles for AxStrLenSSE1
4       cycles for AxJJStrLen3
6       cycles for AxJJStrLen4
6       cycles for AxJJStrLen5
2       cycles for StrLenLingo


--- ok ---

jj2007

WARNING: The algo "StrLenLingo" posted above by Lingo raises an exception when used with short strings that happen to start near the end of a buffer allocated with VirtualAlloc (it will also happen with HeapAlloc buffers, but only in rare cases - the typical "impossible to chase bug").
Unlike all other algos, it does not preserve ecx and xmm0.

On the positive side: It is a tick faster than the others on certain CPUs. Bravo, Lingo :cheekygreen:

lingo

"I ladri sono sempre bugiardi"[/U]

"It is a tick faster than the others on certain CPUs."
Again false...should be "It is 45% faster than the others on ALL CPUs."