Hi folks,
Could I please have some timings on non-Celerons?
Thanks, jj
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
58 bytes for MbStrLen1
84 bytes for MbStrLen2
73 bytes for MbStrLen3
80 bytes for MbStrLen4a
71 bytes for MbStrLen4b
78 bytes for MbStrLen5
29 cycles for MbStrLen1
34 cycles for MbStrLen2
34 cycles for MbStrLen3
31 cycles for MbStrLen4a
35 cycles for MbStrLen4b
38 cycles for MbStrLen5
AMD Athlon(tm) 64 Processor 3000+ (SSE3)
58 bytes for MbStrLen1
84 bytes for MbStrLen2
73 bytes for MbStrLen3
80 bytes for MbStrLen4a
71 bytes for MbStrLen4b
78 bytes for MbStrLen5
47 cycles for MbStrLen1
48 cycles for MbStrLen2
51 cycles for MbStrLen3
47 cycles for MbStrLen4a
51 cycles for MbStrLen4b
55 cycles for MbStrLen5
47 cycles for MbStrLen1
54 cycles for MbStrLen2
54 cycles for MbStrLen3
52 cycles for MbStrLen4a
58 cycles for MbStrLen4b
53 cycles for MbStrLen5
47 cycles for MbStrLen1
48 cycles for MbStrLen2
52 cycles for MbStrLen3
47 cycles for MbStrLen4a
50 cycles for MbStrLen4b
54 cycles for MbStrLen5
48 cycles for MbStrLen1
54 cycles for MbStrLen2
54 cycles for MbStrLen3
52 cycles for MbStrLen4a
57 cycles for MbStrLen4b
53 cycles for MbStrLen5
48 cycles for MbStrLen1
48 cycles for MbStrLen2
50 cycles for MbStrLen3
47 cycles for MbStrLen4a
52 cycles for MbStrLen4b
55 cycles for MbStrLen5
--- ok ---
P3:
pre-P4 (SSE1)
58 bytes for MbStrLen1
84 bytes for MbStrLen2
73 bytes for MbStrLen3
80 bytes for MbStrLen4a
71 bytes for MbStrLen4b
78 bytes for MbStrLen5
45 cycles for MbStrLen1
51 cycles for MbStrLen2
46 cycles for MbStrLen3
51 cycles for MbStrLen4a
47 cycles for MbStrLen4b
59 cycles for MbStrLen5
45 cycles for MbStrLen1
62 cycles for MbStrLen2
46 cycles for MbStrLen3
50 cycles for MbStrLen4a
47 cycles for MbStrLen4b
56 cycles for MbStrLen5
46 cycles for MbStrLen1
51 cycles for MbStrLen2
46 cycles for MbStrLen3
50 cycles for MbStrLen4a
47 cycles for MbStrLen4b
56 cycles for MbStrLen5
45 cycles for MbStrLen1
51 cycles for MbStrLen2
46 cycles for MbStrLen3
51 cycles for MbStrLen4a
47 cycles for MbStrLen4b
55 cycles for MbStrLen5
45 cycles for MbStrLen1
51 cycles for MbStrLen2
46 cycles for MbStrLen3
50 cycles for MbStrLen4a
47 cycles for MbStrLen4b
55 cycles for MbStrLen5
Thanks. For the curious: I am testing the Intel recommendation for movxxx xmm, mem:
QuoteIntel (http://software.intel.com/en-us/articles/memcpy-performance/), generic optimization of memcpy(): movdqu is suitable for fetching byte-aligned groups of 16 bytes from memory, but not useful for storing them. The Barcelona architecture prefers movaps for stores. movaps, movdqa, and movapd are functionally equivalent, with movaps having shorter encoding
if 1 ; 4a
movlps qword ptr [esp], xmm0
movhps qword ptr [esp+8], xmm0
else ; 4b
movdqu [esp], xmm0
endif
...
if 1
movlps xmm0, qword ptr [esp]
movhps xmm0, qword ptr [esp+8]
else
movups xmm0, [esp]
endif
At least for the Celeron and E^cube's AMD, this seems not to be true: The partial lps/hps moves are faster.
(obviously the code does other things, too - the purpose is to efficiently preserve the xmm0 register in a bread-and-butter stringlen algo)
Off topic but MichaelW my CPU is 10+ years old now I believe, so yours must be ancient, i'm just curious is that your main one? Also jj2007 i'm not sure what your plans are but feel free to take notes on optimization technique you discover :U while lot of stuff is floating around this board I know people enjoy a single place to read up on such things.
I build my P3 system in 98 or 99, and it's currently my primary system at home. It's still very reliable, but sooner or later...
Quote from: MichaelW on August 15, 2010, 10:04:47 PM
I build my P3 system in 98 or 99, and it's currently my primary system at home. It's still very reliable, but sooner or later...
heh wow, what os? I can't imagine that thing being able to handle vista, is a resource pig. i'd be suprised if you said windows 2k, I myself wanted to stick with it but was forced to upgrade due to so much software being xp+ only.
JJ,
Here is my P4:
Intel(R) Pentium(R) 4 CPU 3.20GHz (SSE2)
58 bytes for MbStrLen1
84 bytes for MbStrLen2
73 bytes for MbStrLen3
80 bytes for MbStrLen4a
71 bytes for MbStrLen4b
78 bytes for MbStrLen5
38 cycles for MbStrLen1
41 cycles for MbStrLen2
46 cycles for MbStrLen3
37 cycles for MbStrLen4a
45 cycles for MbStrLen4b
41 cycles for MbStrLen5
35 cycles for MbStrLen1
40 cycles for MbStrLen2
41 cycles for MbStrLen3
37 cycles for MbStrLen4a
51 cycles for MbStrLen4b
40 cycles for MbStrLen5
34 cycles for MbStrLen1
45 cycles for MbStrLen2
49 cycles for MbStrLen3
36 cycles for MbStrLen4a
51 cycles for MbStrLen4b
40 cycles for MbStrLen5
33 cycles for MbStrLen1
39 cycles for MbStrLen2
40 cycles for MbStrLen3
39 cycles for MbStrLen4a
43 cycles for MbStrLen4b
47 cycles for MbStrLen5
34 cycles for MbStrLen1
40 cycles for MbStrLen2
65 cycles for MbStrLen3
37 cycles for MbStrLen4a
40 cycles for MbStrLen4b
40 cycles for MbStrLen5
--- ok ---
JJ,
Here are mu AMD timings:
AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ (SSE3)
58 bytes for MbStrLen1
84 bytes for MbStrLen2
73 bytes for MbStrLen3
80 bytes for MbStrLen4a
71 bytes for MbStrLen4b
78 bytes for MbStrLen5
68 cycles for MbStrLen1
47 cycles for MbStrLen2
66 cycles for MbStrLen3
47 cycles for MbStrLen4a
58 cycles for MbStrLen4b
56 cycles for MbStrLen5
47 cycles for MbStrLen1
37 cycles for MbStrLen2
56 cycles for MbStrLen3
57 cycles for MbStrLen4a
43 cycles for MbStrLen4b
84 cycles for MbStrLen5
47 cycles for MbStrLen1
53 cycles for MbStrLen2
73 cycles for MbStrLen3
47 cycles for MbStrLen4a
52 cycles for MbStrLen4b
70 cycles for MbStrLen5
36 cycles for MbStrLen1
57 cycles for MbStrLen2
54 cycles for MbStrLen3
51 cycles for MbStrLen4a
58 cycles for MbStrLen4b
58 cycles for MbStrLen5
63 cycles for MbStrLen1
47 cycles for MbStrLen2
51 cycles for MbStrLen3
51 cycles for MbStrLen4a
56 cycles for MbStrLen4b
55 cycles for MbStrLen5
--- ok ---
AMD Phenom(tm) II X6 1055T Processo
58 bytes for MbStrLen1
84 bytes for MbStrLen2
73 bytes for MbStrLen3
80 bytes for MbStrLen4a
71 bytes for MbStrLen4b
78 bytes for MbStrLen5
31 cycles for MbStrLen1
34 cycles for MbStrLen2
33 cycles for MbStrLen3
35 cycles for MbStrLen4a
35 cycles for MbStrLen4b
40 cycles for MbStrLen5
31 cycles for MbStrLen1
34 cycles for MbStrLen2
36 cycles for MbStrLen3
35 cycles for MbStrLen4a
35 cycles for MbStrLen4b
40 cycles for MbStrLen5
31 cycles for MbStrLen1
34 cycles for MbStrLen2
33 cycles for MbStrLen3
35 cycles for MbStrLen4a
35 cycles for MbStrLen4b
39 cycles for MbStrLen5
31 cycles for MbStrLen1
37 cycles for MbStrLen2
33 cycles for MbStrLen3
35 cycles for MbStrLen4a
35 cycles for MbStrLen4b
39 cycles for MbStrLen5
31 cycles for MbStrLen1
34 cycles for MbStrLen2
33 cycles for MbStrLen3
35 cycles for MbStrLen4a
35 cycles for MbStrLen4b
40 cycles for MbStrLen5
Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz (SSE4)
58 bytes for MbStrLen1
84 bytes for MbStrLen2
73 bytes for MbStrLen3
80 bytes for MbStrLen4a
71 bytes for MbStrLen4b
78 bytes for MbStrLen5
16 cycles for MbStrLen1
21 cycles for MbStrLen2
23 cycles for MbStrLen3
23 cycles for MbStrLen4a
23 cycles for MbStrLen4b
23 cycles for MbStrLen5
17 cycles for MbStrLen1
26 cycles for MbStrLen2
29 cycles for MbStrLen3
23 cycles for MbStrLen4a
32 cycles for MbStrLen4b
24 cycles for MbStrLen5
16 cycles for MbStrLen1
23 cycles for MbStrLen2
23 cycles for MbStrLen3
23 cycles for MbStrLen4a
23 cycles for MbStrLen4b
23 cycles for MbStrLen5
17 cycles for MbStrLen1
26 cycles for MbStrLen2
29 cycles for MbStrLen3
23 cycles for MbStrLen4a
32 cycles for MbStrLen4b
24 cycles for MbStrLen5
16 cycles for MbStrLen1
21 cycles for MbStrLen2
23 cycles for MbStrLen3
23 cycles for MbStrLen4a
23 cycles for MbStrLen4b
23 cycles for MbStrLen5
--- ok ---
Intel(R) Pentium(R) Dual CPU E2160 @ 1.80GHz (SSE4)
58 bytes for MbStrLen1
84 bytes for MbStrLen2
73 bytes for MbStrLen3
80 bytes for MbStrLen4a
71 bytes for MbStrLen4b
78 bytes for MbStrLen5
16 cycles for MbStrLen1
20 cycles for MbStrLen2
23 cycles for MbStrLen3
23 cycles for MbStrLen4a
23 cycles for MbStrLen4b
23 cycles for MbStrLen5
16 cycles for MbStrLen1
25 cycles for MbStrLen2
26 cycles for MbStrLen3
23 cycles for MbStrLen4a
26 cycles for MbStrLen4b
23 cycles for MbStrLen5
16 cycles for MbStrLen1
20 cycles for MbStrLen2
23 cycles for MbStrLen3
23 cycles for MbStrLen4a
23 cycles for MbStrLen4b
23 cycles for MbStrLen5
16 cycles for MbStrLen1
25 cycles for MbStrLen2
26 cycles for MbStrLen3
23 cycles for MbStrLen4a
26 cycles for MbStrLen4b
23 cycles for MbStrLen5
16 cycles for MbStrLen1
20 cycles for MbStrLen2
23 cycles for MbStrLen3
23 cycles for MbStrLen4a
23 cycles for MbStrLen4b
23 cycles for MbStrLen5
--- ok ---
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz (SSE4)
58 bytes for MbStrLen1
84 bytes for MbStrLen2
73 bytes for MbStrLen3
80 bytes for MbStrLen4a
71 bytes for MbStrLen4b
78 bytes for MbStrLen5
16 cycles for MbStrLen1
20 cycles for MbStrLen2
23 cycles for MbStrLen3
23 cycles for MbStrLen4a
23 cycles for MbStrLen4b
23 cycles for MbStrLen5
16 cycles for MbStrLen1
25 cycles for MbStrLen2
26 cycles for MbStrLen3
23 cycles for MbStrLen4a
26 cycles for MbStrLen4b
23 cycles for MbStrLen5
16 cycles for MbStrLen1
20 cycles for MbStrLen2
23 cycles for MbStrLen3
23 cycles for MbStrLen4a
23 cycles for MbStrLen4b
23 cycles for MbStrLen5
16 cycles for MbStrLen1
26 cycles for MbStrLen2
26 cycles for MbStrLen3
23 cycles for MbStrLen4a
26 cycles for MbStrLen4b
23 cycles for MbStrLen5
16 cycles for MbStrLen1
20 cycles for MbStrLen2
23 cycles for MbStrLen3
23 cycles for MbStrLen4a
23 cycles for MbStrLen4b
23 cycles for MbStrLen5
--- ok ---
Intel(R) Pentium(R) 4 CPU 3.20GHz (SSE3)
58 bytes for MbStrLen1
84 bytes for MbStrLen2
73 bytes for MbStrLen3
80 bytes for MbStrLen4a
71 bytes for MbStrLen4b
78 bytes for MbStrLen5
65 cycles for MbStrLen1
66 cycles for MbStrLen2
83 cycles for MbStrLen3
67 cycles for MbStrLen4a
66 cycles for MbStrLen4b
77 cycles for MbStrLen5
64 cycles for MbStrLen1
68 cycles for MbStrLen2
71 cycles for MbStrLen3
66 cycles for MbStrLen4a
66 cycles for MbStrLen4b
73 cycles for MbStrLen5
66 cycles for MbStrLen1
66 cycles for MbStrLen2
72 cycles for MbStrLen3
67 cycles for MbStrLen4a
66 cycles for MbStrLen4b
74 cycles for MbStrLen5
72 cycles for MbStrLen1
66 cycles for MbStrLen2
81 cycles for MbStrLen3
66 cycles for MbStrLen4a
74 cycles for MbStrLen4b
86 cycles for MbStrLen5
64 cycles for MbStrLen1
66 cycles for MbStrLen2
74 cycles for MbStrLen3
66 cycles for MbStrLen4a
66 cycles for MbStrLen4b
79 cycles for MbStrLen5
Thanks to all of you, that should be enough info :U
Note: in results appeared "ERROR" signal, but this is not right. Just, I set repeating string as 20 times, so 100bytes*20times = 2000bytes total length. This length showed in line with "ERROR" word - so, this is true, not error.
Also I set shorter string, but I lazy to change CodeSize macro each time. So, I will write testing string size (for comparsion with "error" :)
Unaligned StrLen, 2000bytes:
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
58 bytes for MbStrLen1
2000 - ERROR
84 bytes for MbStrLen2
2000 - ERROR
73 bytes for MbStrLen3
2000 - ERROR
80 bytes for MbStrLen4a
2000 - ERROR
71 bytes for MbStrLen4b
2000 - ERROR
78 bytes for MbStrLen5
2000 - ERROR
147 bytes for AxStrLen
2000 - ERROR
920 cycles for MbStrLen1
915 cycles for MbStrLen2
914 cycles for MbStrLen3
912 cycles for MbStrLen4a
916 cycles for MbStrLen4b
918 cycles for MbStrLen5
908 cycles for AxStrLen
909 cycles for MbStrLen1
912 cycles for MbStrLen2
913 cycles for MbStrLen3
912 cycles for MbStrLen4a
916 cycles for MbStrLen4b
918 cycles for MbStrLen5
908 cycles for AxStrLen
905 cycles for MbStrLen1
920 cycles for MbStrLen2
913 cycles for MbStrLen3
913 cycles for MbStrLen4a
915 cycles for MbStrLen4b
918 cycles for MbStrLen5
908 cycles for AxStrLen
905 cycles for MbStrLen1
912 cycles for MbStrLen2
913 cycles for MbStrLen3
913 cycles for MbStrLen4a
916 cycles for MbStrLen4b
926 cycles for MbStrLen5
907 cycles for AxStrLen
920 cycles for MbStrLen1
916 cycles for MbStrLen2
915 cycles for MbStrLen3
913 cycles for MbStrLen4a
916 cycles for MbStrLen4b
919 cycles for MbStrLen5
907 cycles for AxStrLen
--- ok ---
Aligned (16bytes) StrLen, 2000bytes:
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
58 bytes for MbStrLen1
2000 - ERROR
84 bytes for MbStrLen2
2000 - ERROR
73 bytes for MbStrLen3
2000 - ERROR
80 bytes for MbStrLen4a
2000 - ERROR
71 bytes for MbStrLen4b
2000 - ERROR
78 bytes for MbStrLen5
2000 - ERROR
147 bytes for AxStrLen
2000 - ERROR
903 cycles for MbStrLen1
909 cycles for MbStrLen2
909 cycles for MbStrLen3
906 cycles for MbStrLen4a
915 cycles for MbStrLen4b
915 cycles for MbStrLen5
903 cycles for AxStrLen
903 cycles for MbStrLen1
907 cycles for MbStrLen2
911 cycles for MbStrLen3
905 cycles for MbStrLen4a
914 cycles for MbStrLen4b
913 cycles for MbStrLen5
904 cycles for AxStrLen
902 cycles for MbStrLen1
907 cycles for MbStrLen2
910 cycles for MbStrLen3
905 cycles for MbStrLen4a
914 cycles for MbStrLen4b
913 cycles for MbStrLen5
905 cycles for AxStrLen
902 cycles for MbStrLen1
908 cycles for MbStrLen2
909 cycles for MbStrLen3
906 cycles for MbStrLen4a
913 cycles for MbStrLen4b
914 cycles for MbStrLen5
904 cycles for AxStrLen
903 cycles for MbStrLen1
907 cycles for MbStrLen2
909 cycles for MbStrLen3
905 cycles for MbStrLen4a
913 cycles for MbStrLen4b
914 cycles for MbStrLen5
903 cycles for AxStrLen
--- ok ---
Unaligned, 100bytes:
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
58 bytes for MbStrLen1
100 - ERROR
84 bytes for MbStrLen2
100 - ERROR
73 bytes for MbStrLen3
100 - ERROR
80 bytes for MbStrLen4a
100 - ERROR
71 bytes for MbStrLen4b
100 - ERROR
78 bytes for MbStrLen5
100 - ERROR
147 bytes for AxStrLen
100 - ERROR
87 cycles for MbStrLen1
97 cycles for MbStrLen2
90 cycles for MbStrLen3
90 cycles for MbStrLen4a
71 cycles for MbStrLen4b
95 cycles for MbStrLen5
67 cycles for AxStrLen
92 cycles for MbStrLen1
97 cycles for MbStrLen2
90 cycles for MbStrLen3
90 cycles for MbStrLen4a
71 cycles for MbStrLen4b
95 cycles for MbStrLen5
67 cycles for AxStrLen
92 cycles for MbStrLen1
105 cycles for MbStrLen2
81 cycles for MbStrLen3
88 cycles for MbStrLen4a
73 cycles for MbStrLen4b
93 cycles for MbStrLen5
67 cycles for AxStrLen
92 cycles for MbStrLen1
102 cycles for MbStrLen2
90 cycles for MbStrLen3
90 cycles for MbStrLen4a
71 cycles for MbStrLen4b
95 cycles for MbStrLen5
67 cycles for AxStrLen
92 cycles for MbStrLen1
97 cycles for MbStrLen2
89 cycles for MbStrLen3
98 cycles for MbStrLen4a
71 cycles for MbStrLen4b
101 cycles for MbStrLen5
68 cycles for AxStrLen
--- ok ---
Jochen, I really try get best times for all proc's. But each run, results NOT the same, but very mess: one run - one procs have big timings, next run - other proc have big timings, next - another, etc. I think, this messing is because in your procs have 2 loops (one internal), and when string is unaligned, they runs, and get biggest timings. When string aligned, they not runs, so, timings much better.
Aligned, 100bytes:
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
58 bytes for MbStrLen1
100 - ERROR
84 bytes for MbStrLen2
100 - ERROR
73 bytes for MbStrLen3
100 - ERROR
80 bytes for MbStrLen4a
100 - ERROR
71 bytes for MbStrLen4b
100 - ERROR
78 bytes for MbStrLen5
100 - ERROR
147 bytes for AxStrLen
100 - ERROR
73 cycles for MbStrLen1
66 cycles for MbStrLen2
86 cycles for MbStrLen3
65 cycles for MbStrLen4a
65 cycles for MbStrLen4b
96 cycles for MbStrLen5
64 cycles for AxStrLen
64 cycles for MbStrLen1
65 cycles for MbStrLen2
68 cycles for MbStrLen3
65 cycles for MbStrLen4a
65 cycles for MbStrLen4b
96 cycles for MbStrLen5
64 cycles for AxStrLen
84 cycles for MbStrLen1
65 cycles for MbStrLen2
86 cycles for MbStrLen3
65 cycles for MbStrLen4a
65 cycles for MbStrLen4b
96 cycles for MbStrLen5
64 cycles for AxStrLen
65 cycles for MbStrLen1
66 cycles for MbStrLen2
68 cycles for MbStrLen3
65 cycles for MbStrLen4a
65 cycles for MbStrLen4b
96 cycles for MbStrLen5
64 cycles for AxStrLen
83 cycles for MbStrLen1
65 cycles for MbStrLen2
86 cycles for MbStrLen3
65 cycles for MbStrLen4a
65 cycles for MbStrLen4b
96 cycles for MbStrLen5
64 cycles for AxStrLen
--- ok ---
As you see, if string aligned, sometimes, your shortest proc have timings near to my proc (which have size 2.53 times long).
My proc have more stable timings, but this is "price" of its size.
Next, 15bytes string, unaligned, timings:
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
58 bytes for MbStrLen1
15 - ERROR
84 bytes for MbStrLen2
15 - ERROR
73 bytes for MbStrLen3
15 - ERROR
80 bytes for MbStrLen4a
15 - ERROR
71 bytes for MbStrLen4b
15 - ERROR
78 bytes for MbStrLen5
15 - ERROR
147 bytes for AxStrLen
15 - ERROR
37 cycles for MbStrLen1
36 cycles for MbStrLen2
37 cycles for MbStrLen3
38 cycles for MbStrLen4a
38 cycles for MbStrLen4b
43 cycles for MbStrLen5
21 cycles for AxStrLen
37 cycles for MbStrLen1
36 cycles for MbStrLen2
37 cycles for MbStrLen3
38 cycles for MbStrLen4a
38 cycles for MbStrLen4b
43 cycles for MbStrLen5
21 cycles for AxStrLen
37 cycles for MbStrLen1
36 cycles for MbStrLen2
37 cycles for MbStrLen3
38 cycles for MbStrLen4a
38 cycles for MbStrLen4b
43 cycles for MbStrLen5
20 cycles for AxStrLen
37 cycles for MbStrLen1
36 cycles for MbStrLen2
37 cycles for MbStrLen3
38 cycles for MbStrLen4a
37 cycles for MbStrLen4b
43 cycles for MbStrLen5
21 cycles for AxStrLen
37 cycles for MbStrLen1
36 cycles for MbStrLen2
37 cycles for MbStrLen3
38 cycles for MbStrLen4a
38 cycles for MbStrLen4b
43 cycles for MbStrLen5
21 cycles for AxStrLen
--- ok ---
On short unaligned strings, drawbacks of two-loops looks more.
But these timings of aligned, 15 bytes string:
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
58 bytes for MbStrLen1
15 - ERROR
84 bytes for MbStrLen2
15 - ERROR
73 bytes for MbStrLen3
15 - ERROR
80 bytes for MbStrLen4a
15 - ERROR
71 bytes for MbStrLen4b
15 - ERROR
78 bytes for MbStrLen5
15 - ERROR
147 bytes for AxStrLen
15 - ERROR
21 cycles for MbStrLen1
27 cycles for MbStrLen2
29 cycles for MbStrLen3
28 cycles for MbStrLen4a
29 cycles for MbStrLen4b
32 cycles for MbStrLen5
22 cycles for AxStrLen
21 cycles for MbStrLen1
28 cycles for MbStrLen2
28 cycles for MbStrLen3
28 cycles for MbStrLen4a
29 cycles for MbStrLen4b
32 cycles for MbStrLen5
22 cycles for AxStrLen
21 cycles for MbStrLen1
27 cycles for MbStrLen2
29 cycles for MbStrLen3
27 cycles for MbStrLen4a
29 cycles for MbStrLen4b
32 cycles for MbStrLen5
22 cycles for AxStrLen
20 cycles for MbStrLen1
27 cycles for MbStrLen2
29 cycles for MbStrLen3
28 cycles for MbStrLen4a
29 cycles for MbStrLen4b
32 cycles for MbStrLen5
22 cycles for AxStrLen
21 cycles for MbStrLen1
27 cycles for MbStrLen2
29 cycles for MbStrLen3
28 cycles for MbStrLen4a
29 cycles for MbStrLen4b
32 cycles for MbStrLen5
22 cycles for AxStrLen
--- ok ---
All advantages of short code is drawed on short aligned strings. Short code fastest in this test.
So, what give analysis of test? Jochen, I think, you need to make other solution for case of unaligned strings. Because for short strings drawbacks of loops is very big (as you see, advantage of shorter code is drawed only on very large strings (2000bytes in test)).
My variant also is not very good, but I write it for testing short and long algos only.
Big ask to all peoples: test this also, please. This is interesting: how different CPUs work with a inside-loops.
If you have time, make different tests: for different string lengths and aligned/not_aligned variants. If you have no time - archive contain sources and compiled exe with unaligned 100byte string testing.
Alex
P.S. Sources not compilable by ml6.15 and earlyer. I compile them with ML8 (SSE2 movsd mnemonic).
Hi Alex,
If you want to get rid of the ERROR, use
movups xmm0, oword ptr Src
.if eax!=sizeof Src-1
in the CodeSize macro.
It is true that all versions are slower for unaligned strings, but still a factor three faster than the Masm32 len() function:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
132 cycles for szLen
46 cycles for MbStrLen1
47 cycles for MbStrLen2
50 cycles for MbStrLen3
48 cycles for MbStrLen4a
53 cycles for MbStrLen4b
51 cycles for MbStrLen5
And why I preserve xmm0? Precisely to encourage people to use SSE... MasmBasic is intended to be fast and noob-friendly. The next version will be "SSE2-safe", and "SSE2-friendly", too. Try doing a Print Str$(xmm1) in ordinary assembler :wink
Hi, Jochen!
How timings you have?
I assuraced, on CPUs with big cache unaligned access may be more fast.
Alex
Jochen, you see results of your hex2dwon Dave's CPU?
This is incredible! What mean ONE architecture (NetBurst), if EVERY CPU in forum have different timings. Intel must write some "appendix" in each CPUs :)
For info:
14 cycles for Lingo's SSE version
13 cycles for Lingo's BIG integer version
5 cycles for Jochen's WORD-Indexed version
15 cycles for Dave's version (with minor changes)
This is fantastic!
Alex
Jochen, I don't understand the "noob" word. What it mean? Don't forgot - I'm not good english speaker.
Alex
Quote from: Antariy on August 16, 2010, 08:53:17 PM
Jochen, I don't understand the "noob" word. What it mean? Don't forgot - I'm not good english speaker.
noob = newbie, beginner.
Timings:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
58 bytes for MbStrLen1
80 bytes for MbStrLen4a
131 bytes for AxStrLen
Src: 100 bytes
46 cycles for MbStrLen1 (xmm0 not preserved)
48 cycles for MbStrLen4a
36 cycles for AxStrLen
Your algo is really fast, compliments :U
Quote from: jj2007 on August 16, 2010, 09:09:52 PM
Quote from: Antariy on August 16, 2010, 08:53:17 PM
Jochen, I don't understand the "noob" word. What it mean? Don't forgot - I'm not good english speaker.
noob = newbie, beginner.
Timings:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
58 bytes for MbStrLen1
80 bytes for MbStrLen4a
131 bytes for AxStrLen
Src: 100 bytes
46 cycles for MbStrLen1 (xmm0 not preserved)
48 cycles for MbStrLen4a
36 cycles for AxStrLen
Your algo is really fast, compliments :U
No, it really BIG, not fast.
If you remove "mov eax,[esp+4]" from sources, it stand slower, try with "mov eax...".
And, for this size, it really slow.
Alex
Can I impose on someone to include this unrolled version of Agner Fog's StrLen algo.
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 4
StrLen proc item:DWORD
mov eax, [esp+4] ; get pointer to string
lea edx, [eax+3] ; pointer+3 used in the end
push ebp
push edi
mov ebp, 80808080h
@@:
REPEAT 3
mov edi, [eax] ; read first 4 bytes
add eax, 4 ; increment pointer
lea ecx, [edi-01010101h] ; subtract 1 from each byte
not edi ; invert all bytes
and ecx, edi ; and these two
and ecx, ebp
jnz nxt
ENDM
mov edi, [eax] ; read first 4 bytes
add eax, 4 ; 4 increment DWORD pointer
lea ecx, [edi-01010101h] ; subtract 1 from each byte
not edi ; invert all bytes
and ecx, edi ; and these two
and ecx, ebp
jz @B ; no zero bytes, continue loop
nxt:
test ecx, 00008080h ; test first two bytes
jnz @F
shr ecx, 16 ; not in the first 2 bytes
add eax, 2
@@:
shl cl, 1 ; use carry flag to avoid branch
sbb eax, edx ; compute length
pop edi
pop ebp
ret 4
StrLen endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
Here it is, Hutch:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
80 bytes for MbStrLen4a
128 bytes for AxStrLen
113 bytes for StrLenAF
Src: 100 bytes
46 cycles for MbStrLen1 (xmm0 not preserved)
48 cycles for MbStrLen4a
94 cycles for StrLenAF
36 cycles for AxStrLen
@Alex: Sorry, this is a modified version, without the fancy stuff. Much shorter and 2 cycles faster on the Celeron M, but it might be slower on other CPUs, of course. The table (http://wikis.sun.com/display/BluePrints/Instruction+Selection) might interest you.
Quote from: hutch-- on August 16, 2010, 11:27:21 PM
Can I impose on someone to include this unrolled version of Agner Fog's StrLen algo.
I make test-bed.
Timings:
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
58 bytes for MbStrLen1
100 - ERROR
84 bytes for MbStrLen2
100 - ERROR
73 bytes for MbStrLen3
100 - ERROR
80 bytes for MbStrLen4a
100 - ERROR
71 bytes for MbStrLen4b
100 - ERROR
78 bytes for MbStrLen5
100 - ERROR
147 bytes for AxStrLen
100 - ERROR
86 cycles for MbStrLen1
97 cycles for MbStrLen2
78 cycles for MbStrLen3
102 cycles for MbStrLen4a
75 cycles for MbStrLen4b
90 cycles for MbStrLen5
71 cycles for AxStrLen
247 cycles for StrLen
80 cycles for MbStrLen1
88 cycles for MbStrLen2
79 cycles for MbStrLen3
92 cycles for MbStrLen4a
87 cycles for MbStrLen4b
95 cycles for MbStrLen5
70 cycles for AxStrLen
230 cycles for StrLen
77 cycles for MbStrLen1
94 cycles for MbStrLen2
82 cycles for MbStrLen3
93 cycles for MbStrLen4a
74 cycles for MbStrLen4b
98 cycles for MbStrLen5
69 cycles for AxStrLen
230 cycles for StrLen
85 cycles for MbStrLen1
90 cycles for MbStrLen2
79 cycles for MbStrLen3
91 cycles for MbStrLen4a
88 cycles for MbStrLen4b
97 cycles for MbStrLen5
77 cycles for AxStrLen
233 cycles for StrLen
82 cycles for MbStrLen1
98 cycles for MbStrLen2
79 cycles for MbStrLen3
94 cycles for MbStrLen4a
75 cycles for MbStrLen4b
94 cycles for MbStrLen5
70 cycles for AxStrLen
244 cycles for StrLen
--- ok ---
Gratsie,
Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz (SSE4)
58 bytes for MbStrLen1
84 bytes for MbStrLen2
73 bytes for MbStrLen3
80 bytes for MbStrLen4a
71 bytes for MbStrLen4b
78 bytes for MbStrLen5
128 bytes for AxStrLen
113 bytes for StrLenAF
Src: 100 bytes
34 cycles for MbStrLen1 (xmm0 not preserved)
42 cycles for MbStrLen4a
68 cycles for StrLenAF
24 cycles for AxStrLen
34 cycles for MbStrLen1 (xmm0 not preserved)
42 cycles for MbStrLen4a
68 cycles for StrLenAF
24 cycles for AxStrLen
34 cycles for MbStrLen1 (xmm0 not preserved)
42 cycles for MbStrLen4a
68 cycles for StrLenAF
24 cycles for AxStrLen
34 cycles for MbStrLen1 (xmm0 not preserved)
42 cycles for MbStrLen4a
67 cycles for StrLenAF
24 cycles for AxStrLen
34 cycles for MbStrLen1 (xmm0 not preserved)
42 cycles for MbStrLen4a
68 cycles for StrLenAF
24 cycles for AxStrLen
--- ok ---
Quote from: jj2007 on August 16, 2010, 11:39:08 PM
@Alex: Sorry, this is a modified version, without the fancy stuff. Much shorter and 2 cycles faster on the Celeron M, but it might be slower on other CPUs, of course. The table (http://wikis.sun.com/display/BluePrints/Instruction+Selection) might interest you.
No, you don't understand my nice English :)))
I mean, if you DON'T load value of [esp+4] to eax, then code almost always will be work slower, because eax - part of checking: is the string unaligned or not. And almost, proc will be work by unaligned branch. Please, post your timings for my original archive, too hard to explain in English, sorry.
Alex
Hutch, test my version, please. Jochen made some changes, but algo works NOT in optimal way after his changes. (I see changes made by Jochen).
Test my, please.
Alex
Alex,
This one ?
Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz (SSE4)
58 bytes for MbStrLen1
100 - ERROR
84 bytes for MbStrLen2
100 - ERROR
73 bytes for MbStrLen3
100 - ERROR
80 bytes for MbStrLen4a
100 - ERROR
71 bytes for MbStrLen4b
100 - ERROR
78 bytes for MbStrLen5
100 - ERROR
147 bytes for AxStrLen
100 - ERROR
34 cycles for MbStrLen1
42 cycles for MbStrLen2
37 cycles for MbStrLen3
42 cycles for MbStrLen4a
42 cycles for MbStrLen4b
41 cycles for MbStrLen5
25 cycles for AxStrLen
67 cycles for StrLen
34 cycles for MbStrLen1
46 cycles for MbStrLen2
45 cycles for MbStrLen3
42 cycles for MbStrLen4a
48 cycles for MbStrLen4b
41 cycles for MbStrLen5
24 cycles for AxStrLen
67 cycles for StrLen
34 cycles for MbStrLen1
42 cycles for MbStrLen2
37 cycles for MbStrLen3
41 cycles for MbStrLen4a
42 cycles for MbStrLen4b
41 cycles for MbStrLen5
24 cycles for AxStrLen
67 cycles for StrLen
34 cycles for MbStrLen1
46 cycles for MbStrLen2
46 cycles for MbStrLen3
42 cycles for MbStrLen4a
48 cycles for MbStrLen4b
41 cycles for MbStrLen5
24 cycles for AxStrLen
67 cycles for StrLen
34 cycles for MbStrLen1
42 cycles for MbStrLen2
37 cycles for MbStrLen3
42 cycles for MbStrLen4a
42 cycles for MbStrLen4b
40 cycles for MbStrLen5
24 cycles for AxStrLen
67 cycles for StrLen
--- ok ---
Quote from: Antariy on August 16, 2010, 11:48:37 PM
Hutch, test my version, please. Jochen made some changes, but algo works NOT in optimal way after his changes. (I see changes made by Jochen).
Test my, please.
Alex,
Sorry, I should have split it but was too tired yesterday night. On the other hand, look at the timings: for Hutch, it's 24 cycles for both versions, for me it's 2 cycles faster without the "extras". And 128 instead of 147 bytes means 8 instead of 9 16-byte instruction cache slots.
Quote from: hutch-- on August 17, 2010, 12:02:21 AM
Alex,
This one ?
Yes,
Hutch!
"ERROR" word is not true - I just don' change CodeSize macro.
Thanks!
Alex
Hi!
Here is 34bytes long MMX StrLen, and 90bytes long (decreased by 57bytes) SSE1 version by 2 clocks faster.
Peoples, test this please!
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
58 bytes for MbStrLen1
84 bytes for MbStrLen2
73 bytes for MbStrLen3
80 bytes for MbStrLen4a
71 bytes for MbStrLen4b
78 bytes for MbStrLen5
34 bytes for AxStrLenMMX
90 bytes for AxStrLenSSE1
113 bytes for StrLen
92 cycles for MbStrLen1
95 cycles for MbStrLen2
89 cycles for MbStrLen3
96 cycles for MbStrLen4a
71 cycles for MbStrLen4b
97 cycles for MbStrLen5
109 cycles for AxStrLenMMX
65 cycles for AxStrLenSSE1
165 cycles for StrLen
100 cycles for MbStrLen1
97 cycles for MbStrLen2
89 cycles for MbStrLen3
90 cycles for MbStrLen4a
100 cycles for MbStrLen4b
95 cycles for MbStrLen5
110 cycles for AxStrLenMMX
65 cycles for AxStrLenSSE1
164 cycles for StrLen
87 cycles for MbStrLen1
90 cycles for MbStrLen2
91 cycles for MbStrLen3
80 cycles for MbStrLen4a
71 cycles for MbStrLen4b
95 cycles for MbStrLen5
109 cycles for AxStrLenMMX
65 cycles for AxStrLenSSE1
164 cycles for StrLen
92 cycles for MbStrLen1
97 cycles for MbStrLen2
90 cycles for MbStrLen3
80 cycles for MbStrLen4a
72 cycles for MbStrLen4b
94 cycles for MbStrLen5
110 cycles for AxStrLenMMX
65 cycles for AxStrLenSSE1
164 cycles for StrLen
92 cycles for MbStrLen1
90 cycles for MbStrLen2
87 cycles for MbStrLen3
80 cycles for MbStrLen4a
71 cycles for MbStrLen4b
95 cycles for MbStrLen5
111 cycles for AxStrLenMMX
65 cycles for AxStrLenSSE1
164 cycles for StrLen
--- ok ---
Note: test maked with unaligned (not 16byte aligned) strings.
Alex
Quote from: jj2007 on August 17, 2010, 06:34:22 AM
Quote from: Antariy on August 16, 2010, 11:48:37 PM
Hutch, test my version, please. Jochen made some changes, but algo works NOT in optimal way after his changes. (I see changes made by Jochen).
Test my, please.
Alex,
Sorry, I should have split it but was too tired yesterday night. On the other hand, look at the timings: for Hutch, it's 24 cycles for both versions, for me it's 2 cycles faster without the "extras". And 128 instead of 147 bytes means 8 instead of 9 16-byte instruction cache slots.
:bg
This is always not clear - other's code, I understand.
Alex
Alex,
Here is my P4:
Intel(R) Pentium(R) 4 CPU 3.20GHz (SSE2)
58 bytes for MbStrLen1
84 bytes for MbStrLen2
73 bytes for MbStrLen3
80 bytes for MbStrLen4a
71 bytes for MbStrLen4b
78 bytes for MbStrLen5
34 bytes for AxStrLenMMX
90 bytes for AxStrLenSSE1
113 bytes for StrLen
48 cycles for MbStrLen1
51 cycles for MbStrLen2
46 cycles for MbStrLen3
56 cycles for MbStrLen4a
51 cycles for MbStrLen4b
50 cycles for MbStrLen5
70 cycles for AxStrLenMMX
43 cycles for AxStrLenSSE1
135 cycles for StrLen
79 cycles for MbStrLen1
54 cycles for MbStrLen2
53 cycles for MbStrLen3
47 cycles for MbStrLen4a
48 cycles for MbStrLen4b
51 cycles for MbStrLen5
72 cycles for AxStrLenMMX
45 cycles for AxStrLenSSE1
124 cycles for StrLen
40 cycles for MbStrLen1
49 cycles for MbStrLen2
45 cycles for MbStrLen3
55 cycles for MbStrLen4a
43 cycles for MbStrLen4b
51 cycles for MbStrLen5
69 cycles for AxStrLenMMX
43 cycles for AxStrLenSSE1
126 cycles for StrLen
40 cycles for MbStrLen1
47 cycles for MbStrLen2
45 cycles for MbStrLen3
44 cycles for MbStrLen4a
45 cycles for MbStrLen4b
56 cycles for MbStrLen5
70 cycles for AxStrLenMMX
45 cycles for AxStrLenSSE1
125 cycles for StrLen
59 cycles for MbStrLen1
47 cycles for MbStrLen2
48 cycles for MbStrLen3
42 cycles for MbStrLen4a
44 cycles for MbStrLen4b
53 cycles for MbStrLen5
73 cycles for AxStrLenMMX
50 cycles for AxStrLenSSE1
145 cycles for StrLen
--- ok ---
Dave.
Intel(R) Pentium(R) Dual CPU T2390 @ 1.86GHz (SSE4)
58 bytes for MbStrLen1
84 bytes for MbStrLen2
73 bytes for MbStrLen3
80 bytes for MbStrLen4a
71 bytes for MbStrLen4b
78 bytes for MbStrLen5
16 cycles for MbStrLen1
20 cycles for MbStrLen2
24 cycles for MbStrLen3
23 cycles for MbStrLen4a
23 cycles for MbStrLen4b
23 cycles for MbStrLen5
16 cycles for MbStrLen1
25 cycles for MbStrLen2
26 cycles for MbStrLen3
23 cycles for MbStrLen4a
26 cycles for MbStrLen4b
23 cycles for MbStrLen5
17 cycles for MbStrLen1
20 cycles for MbStrLen2
23 cycles for MbStrLen3
23 cycles for MbStrLen4a
23 cycles for MbStrLen4b
23 cycles for MbStrLen5
16 cycles for MbStrLen1
24 cycles for MbStrLen2
26 cycles for MbStrLen3
23 cycles for MbStrLen4a
26 cycles for MbStrLen4b
23 cycles for MbStrLen5
16 cycles for MbStrLen1
20 cycles for MbStrLen2
23 cycles for MbStrLen3
23 cycles for MbStrLen4a
24 cycles for MbStrLen4b
23 cycles for MbStrLen5
--- ok ---
Intel(R) Pentium(R) Dual CPU T2390 @ 1.86GHz (SSE4)
58 bytes for MbStrLen1
84 bytes for MbStrLen2
73 bytes for MbStrLen3
80 bytes for MbStrLen4a
71 bytes for MbStrLen4b
78 bytes for MbStrLen5
16 cycles for MbStrLen1
20 cycles for MbStrLen2
24 cycles for MbStrLen3
23 cycles for MbStrLen4a
23 cycles for MbStrLen4b
23 cycles for MbStrLen5
16 cycles for MbStrLen1
25 cycles for MbStrLen2
26 cycles for MbStrLen3
23 cycles for MbStrLen4a
26 cycles for MbStrLen4b
23 cycles for MbStrLen5
17 cycles for MbStrLen1
20 cycles for MbStrLen2
23 cycles for MbStrLen3
23 cycles for MbStrLen4a
23 cycles for MbStrLen4b
23 cycles for MbStrLen5
16 cycles for MbStrLen1
24 cycles for MbStrLen2
26 cycles for MbStrLen3
23 cycles for MbStrLen4a
26 cycles for MbStrLen4b
23 cycles for MbStrLen5
16 cycles for MbStrLen1
20 cycles for MbStrLen2
23 cycles for MbStrLen3
23 cycles for MbStrLen4a
24 cycles for MbStrLen4b
23 cycles for MbStrLen5
--- ok ---
5Test:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
90 bytes for AxStrLenSSE1
113 bytes for StrLen
46 cycles for MbStrLen1
47 cycles for MbStrLen2
50 cycles for MbStrLen3
48 cycles for MbStrLen4a
53 cycles for MbStrLen4b
51 cycles for MbStrLen5
60 cycles for AxStrLenMMX
38 cycles for AxStrLenSSE1
Quotemov eax,[esp+4] ; Jochen, if you remove this string again :), then
add esp,-10h ; algo would almost always work with any string
movups [esp],xmm7 ; as with unaligned string, because I made
and eax,0fh ; checking for alignment in THIS line! ;-)
jz @F
OOPS :red
Quote from: jj2007 on August 18, 2010, 11:11:31 PM
OOPS :red
No, This is because I don't write comments (time economy).
Alex
Tried this?
Quote mov edx,[esp+4]
; mov eax,[esp+4] ; Jochen, if you remove this string again :), then
add esp, -10h ; algo would almost always work with any string
movups [esp],xmm7 ; as with unaligned string, because I made
test dl, 15
; and eax, 0fh ; checking for alignment in THIS line! ;-)
jz @F
Intel(R) Pentium(R) Dual CPU T2390 @ 1.86GHz (SSE4)
58 bytes for MbStrLen1
84 bytes for MbStrLen2
73 bytes for MbStrLen3
80 bytes for MbStrLen4a
71 bytes for MbStrLen4b
78 bytes for MbStrLen5
16 cycles for MbStrLen1
20 cycles for MbStrLen2
24 cycles for MbStrLen3
23 cycles for MbStrLen4a
23 cycles for MbStrLen4b
23 cycles for MbStrLen5
16 cycles for MbStrLen1
25 cycles for MbStrLen2
26 cycles for MbStrLen3
23 cycles for MbStrLen4a
26 cycles for MbStrLen4b
23 cycles for MbStrLen5
17 cycles for MbStrLen1
20 cycles for MbStrLen2
23 cycles for MbStrLen3
23 cycles for MbStrLen4a
23 cycles for MbStrLen4b
23 cycles for MbStrLen5
16 cycles for MbStrLen1
24 cycles for MbStrLen2
26 cycles for MbStrLen3
23 cycles for MbStrLen4a
26 cycles for MbStrLen4b
23 cycles for MbStrLen5
16 cycles for MbStrLen1
20 cycles for MbStrLen2
23 cycles for MbStrLen3
23 cycles for MbStrLen4a
24 cycles for MbStrLen4b
23 cycles for MbStrLen5
--- ok ---
another dual test.
Intel(R) Pentium(R) Dual CPU E2160 @ 1.80GHz (SSE4)
58 bytes for MbStrLen1 33 33 33 33 33 cycles
84 bytes for MbStrLen2 34 42 34 42 34
73 bytes for MbStrLen3 36 44 36 44 36
80 bytes for MbStrLen4a 41 41 41 41 41
71 bytes for MbStrLen4b 36 45 36 45 36
78 bytes for MbStrLen5 46 46 46 46 48
34 bytes for AxStrLenMMX 58 62 62 68 64
90 bytes for AxStrLenSSE1 21 27 21 27 21
113 bytes for StrLen 67 67 67 67 67
--- ok ---
Hi!
This is slightly changed version of my SSE1 algo. Fixed stumb load (I do they by correlation with MMX, but this is not the same).
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
58 bytes for MbStrLen1
84 bytes for MbStrLen2
73 bytes for MbStrLen3
80 bytes for MbStrLen4a
71 bytes for MbStrLen4b
78 bytes for MbStrLen5
34 bytes for AxStrLenMMX
86 bytes for AxStrLenSSE1
113 bytes for StrLen
83 cycles for MbStrLen1
92 cycles for MbStrLen2
80 cycles for MbStrLen3
78 cycles for MbStrLen4a
71 cycles for MbStrLen4b
95 cycles for MbStrLen5
111 cycles for AxStrLenMMX
60 cycles for AxStrLenSSE1
183 cycles for StrLen
110 cycles for MbStrLen1
90 cycles for MbStrLen2
90 cycles for MbStrLen3
90 cycles for MbStrLen4a
71 cycles for MbStrLen4b
95 cycles for MbStrLen5
109 cycles for AxStrLenMMX
61 cycles for AxStrLenSSE1
199 cycles for StrLen
92 cycles for MbStrLen1
102 cycles for MbStrLen2
90 cycles for MbStrLen3
80 cycles for MbStrLen4a
71 cycles for MbStrLen4b
149 cycles for MbStrLen5
109 cycles for AxStrLenMMX
61 cycles for AxStrLenSSE1
235 cycles for StrLen
100 cycles for MbStrLen1
113 cycles for MbStrLen2
74 cycles for MbStrLen3
103 cycles for MbStrLen4a
71 cycles for MbStrLen4b
77 cycles for MbStrLen5
109 cycles for AxStrLenMMX
61 cycles for AxStrLenSSE1
200 cycles for StrLen
92 cycles for MbStrLen1
99 cycles for MbStrLen2
91 cycles for MbStrLen3
90 cycles for MbStrLen4a
71 cycles for MbStrLen4b
95 cycles for MbStrLen5
115 cycles for AxStrLenMMX
61 cycles for AxStrLenSSE1
197 cycles for StrLen
--- ok ---
Big ask to all: test this, please!
Alex
Quote from: jj2007 on August 18, 2010, 11:37:45 PM
Tried this?
Quote mov edx,[esp+4]
; mov eax,[esp+4] ; Jochen, if you remove this string again :), then
add esp, -10h ; algo would almost always work with any string
movups [esp],xmm7 ; as with unaligned string, because I made
test dl, 15
; and eax, 0fh ; checking for alignment in THIS line! ;-)
jz @F
Jochen, on my CPU, if I use edx for check, proc slower by 2 clocks. If I use part of reg, this is not get anything (I know about this, and this is have very hardware-depended reasons in work. On moder CPUs this is very slow).
Alex
Intel(R) Pentium(R) Dual CPU E2160 @ 1.80GHz (SSE4)
58 bytes for MbStrLen1 33 33 33 33 33
84 bytes for MbStrLen2 34 42 34 42 34
73 bytes for MbStrLen3 36 44 36 44 36
80 bytes for MbStrLen4a 41 41 41 41 37
71 bytes for MbStrLen4b 36 45 36 45 36
78 bytes for MbStrLen5 46 46 46 48 46
34 bytes for AxStrLenMMX 59 65 58 68 65
86 bytes for AxStrLenSSE1 19 24 19 25 19
113 bytes for StrLen 67 67 67 67 67
--- ok ---
== bytes for MbStrLen1 == == == == ==
== bytes for MbStrLen2 == == == == ==
== bytes for MbStrLen3 == == == == ==
== bytes for MbStrLen4a == 37 == == ==
== bytes for MbStrLen4b == == == == ==
== bytes for MbStrLen5 == == == 46 ==
== bytes for AxStrLenMMX 62 == 61 == 62
== bytes for AxStrLenSSE1 == == == == ==
=== bytes for StrLen == == == == ==
--- ok ---
Quote from: Antariy on August 19, 2010, 10:51:20 PM
Quote from: jj2007 on August 18, 2010, 11:37:45 PM
Tried this?
Quote mov edx,[esp+4]
; mov eax,[esp+4] ; Jochen, if you remove this string again :), then
add esp, -10h ; algo would almost always work with any string
movups [esp],xmm7 ; as with unaligned string, because I made
test dl, 15
; and eax, 0fh ; checking for alignment in THIS line! ;-)
jz @F
Jochen, on my CPU, if I use edx for check, proc slower by 2 clocks. If I use part of reg, this is not get anything (I know about this, and this is have very hardware-depended reasons in work. On moder CPUs this is very slow).
Alex
Hi Alex,
The difference is very small on my trusty old P4 and inexistent on my Celeron. Here is one more for testing... I am tempted to use the MbStrLen4aP4 for the MasmBasic library, because it's short, reasonably fast, and safe for strings that end precisely at the legal area (you remember VirtualAlloc can be a bit rude with attempts to use movups xmm7, [edx] when edx is near the next page :wink)
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
80 bytes for MbStrLen4a
72 bytes for MbStrLen4aP4
100 bytes for AxStrLenSSE1
80 bytes for AxStrLenSSE1j
------- timings -------
333 cycles for szLen
78 cycles for MbStrLen4a
72 cycles for MbStrLen4aP4
62 cycles for AxStrLenSSE1
63 cycles for AxStrLenSSE1j
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
35 cycles for AxStrLenSSE1
35 cycles for AxStrLenSSE1j
Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz (SSE4)
80 bytes for MbStrLen4a
72 bytes for MbStrLen4aP4
72 bytes for MbStrLen4aP42
100 bytes for AxStrLenSSE1
80 bytes for AxStrLenSSE1j
------- timings -------
104 cycles for szLen
42 cycles for MbStrLen4a
42 cycles for MbStrLen4aP4
42 cycles for MbStrLen4aP42
19 cycles for AxStrLenSSE1
23 cycles for AxStrLenSSE1j
104 cycles for szLen
42 cycles for MbStrLen4a
48 cycles for MbStrLen4aP4
48 cycles for MbStrLen4aP42
27 cycles for AxStrLenSSE1
28 cycles for AxStrLenSSE1j
104 cycles for szLen
42 cycles for MbStrLen4a
42 cycles for MbStrLen4aP4
42 cycles for MbStrLen4aP42
20 cycles for AxStrLenSSE1
24 cycles for AxStrLenSSE1j
104 cycles for szLen
42 cycles for MbStrLen4a
48 cycles for MbStrLen4aP4
48 cycles for MbStrLen4aP42
24 cycles for AxStrLenSSE1
26 cycles for AxStrLenSSE1j
--- ok ---
Quote from: hutch-- on August 20, 2010, 09:31:58 AM
Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz (SSE4)
42 cycles for MbStrLen4a
48 cycles for MbStrLen4aP4
48 cycles for MbStrLen4aP42
19 cycles for AxStrLenSSE1
23 cycles for AxStrLenSSE1j
Thanks. Interesting that the 4a is faster, on my P4 it's definitely slower. I have changed the 72Test posted above and added an option
CrashIt, which will test a string near the VirtualAlloc page boundary.
Intel(R) Core(TM) i7 CPU 860 @ 2.80GHz (SSE4)
80 bytes for MbStrLen4a
72 bytes for MbStrLen4aP4
72 bytes for MbStrLen4aP42
100 bytes for AxStrLenSSE1
80 bytes for AxStrLenSSE1j
------- timings -------
129 cycles for szLen
72 cycles for MbStrLen4a
94 cycles for MbStrLen4aP4
91 cycles for MbStrLen4aP42
45 cycles for AxStrLenSSE1
40 cycles for AxStrLenSSE1j
157 cycles for szLen
72 cycles for MbStrLen4a
93 cycles for MbStrLen4aP4
93 cycles for MbStrLen4aP42
45 cycles for AxStrLenSSE1
42 cycles for AxStrLenSSE1j
157 cycles for szLen
72 cycles for MbStrLen4a
71 cycles for MbStrLen4aP4
70 cycles for MbStrLen4aP42
45 cycles for AxStrLenSSE1
40 cycles for AxStrLenSSE1j
158 cycles for szLen
72 cycles for MbStrLen4a
73 cycles for MbStrLen4aP4
94 cycles for MbStrLen4aP42
45 cycles for AxStrLenSSE1
42 cycles for AxStrLenSSE1j
--- ok ---
Thanxalot. The current version of MasmBasic Len() is MbStrLen4a, and it will stay that way. 80 bytes short, reasonably fast and SSE-safe near page boundaries.
JJ,
On both the Core2 and i7 boxes, SSE is a lot faster than on my P4 boxes relative to normal integer instruction code so if you are pointing the procedures at SSE capable processors I would go for the ones that are faster on the Core2 and i3/i5/i7 architecture. Rockoon has been posting results from a recent 6 core AMD so it will also be worthwhile seeing what they work like on a late AMD box.
Hutch,
Yes, some more tests would be fine - but it seems the MbStrLen4a is overall quite ok. The SSE1 versions are not "boundary safe", the others seem a tick slower.
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
58 cycles for MbStrLen4a
62 cycles for MbStrLen4aP4
62 cycles for MbStrLen4aP42
56 cycles for MbStrLen4a
71 cycles for MbStrLen4aP4
71 cycles for MbStrLen4aP42
The increase from 62 to 71 may have to do with the testbed setting: For each timing loop, there is a push for changing the stack alignment. So sometimes the movdqu [esp], xmm0 is 16-byte aligned and thus faster (on some CPUs the movdqu becomes as fast as movdqa if it hits an aligned address).
Jochen, you proc is safe with end of VirtualAlloc? (I don't see sourcess yet)?
Alex
Jochen, your proc's also is not safe, they crashes with short strings.
You force me to make version with SEH :), really.
Alex
Hutch, for what you needed in SSE1 StrLen algo? This is needed to you?
Alex
Jochen, I saw sources, your proc is not crashes :)
Alex
... But, even if I add SEH, this is not make proc much slower :) Only in case of end-of-buffer.
Alex
Hutch,
I'm wondering why you tested and tolerated such idiotic, slow algos.
Idiotic because they preserved registers without any need.
Only Microsoft can require which registers we must preserve :(
Quote from: Antariy on August 20, 2010, 11:07:26 PM
... But, even if I add SEH, this is not make proc much slower :) Only in case of end-of-buffer.
Maybe... but code size will increase again. 80 bytes is enough for a strlen algo in a Basic library. And remember we are talking about 40-80 cycles, depending on the CPU. 80 cycles at 1.6 GHz means 20 Million calls to strlen per second - that is rarely a bottleneck. Those who need more than that can use the fastest and unsafest MMX algos trashing the FPU, but for a general purpose library a compromise should be sought.
Hi!
Big ask to all: test this please. This is fixed version of my SSE1 StrLen proc, which is not crashes in end of buffer in normal case of zero-terminated string.
And this proc is still fast with unaligned strings, it have the same speed as my previous proc, or slightly slower.
For first test-bed (Jochen's old procs):
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
58 bytes for MbStrLen1
84 bytes for MbStrLen2
73 bytes for MbStrLen3
80 bytes for MbStrLen4a
71 bytes for MbStrLen4b
78 bytes for MbStrLen5
34 bytes for AxStrLenMMX
93 bytes for AxStrLenSSE1
113 bytes for StrLen
73 cycles for MbStrLen1
90 cycles for MbStrLen2
91 cycles for MbStrLen3
90 cycles for MbStrLen4a
71 cycles for MbStrLen4b
95 cycles for MbStrLen5
110 cycles for AxStrLenMMX
63 cycles for AxStrLenSSE1
164 cycles for StrLen
100 cycles for MbStrLen1
113 cycles for MbStrLen2
104 cycles for MbStrLen3
90 cycles for MbStrLen4a
71 cycles for MbStrLen4b
100 cycles for MbStrLen5
108 cycles for AxStrLenMMX
62 cycles for AxStrLenSSE1
164 cycles for StrLen
100 cycles for MbStrLen1
90 cycles for MbStrLen2
90 cycles for MbStrLen3
80 cycles for MbStrLen4a
71 cycles for MbStrLen4b
95 cycles for MbStrLen5
114 cycles for AxStrLenMMX
62 cycles for AxStrLenSSE1
164 cycles for StrLen
81 cycles for MbStrLen1
130 cycles for MbStrLen2
104 cycles for MbStrLen3
90 cycles for MbStrLen4a
71 cycles for MbStrLen4b
80 cycles for MbStrLen5
110 cycles for AxStrLenMMX
62 cycles for AxStrLenSSE1
164 cycles for StrLen
92 cycles for MbStrLen1
89 cycles for MbStrLen2
90 cycles for MbStrLen3
104 cycles for MbStrLen4a
71 cycles for MbStrLen4b
95 cycles for MbStrLen5
109 cycles for AxStrLenMMX
62 cycles for AxStrLenSSE1
164 cycles for StrLen
For new Jochen's test-bed with CrashIt macro is "1"
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
80 bytes for MbStrLen4a
72 bytes for MbStrLen4aP4
72 bytes for MbStrLen4aP42
93 bytes for AxStrLenSSE1
80 bytes for AxStrLenSSE1j
------- timings -------
27 cycles for szLen
33 cycles for MbStrLen4a
62 cycles for MbStrLen4aP4
37 cycles for MbStrLen4aP42
24 cycles for AxStrLenSSE1
After my proc runs Jochen's tweak, and it crashes. My proc work properly now (see: testing of it is successful - 24 clocks). I.e. - my proc not crashes in end of buffer.
For new, without CrashIt macro:
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
80 bytes for MbStrLen4a
72 bytes for MbStrLen4aP4
72 bytes for MbStrLen4aP42
93 bytes for AxStrLenSSE1
80 bytes for AxStrLenSSE1j
------- timings -------
264 cycles for szLen
88 cycles for MbStrLen4a
73 cycles for MbStrLen4aP4
72 cycles for MbStrLen4aP42
62 cycles for AxStrLenSSE1
68 cycles for AxStrLenSSE1j
270 cycles for szLen
105 cycles for MbStrLen4a
73 cycles for MbStrLen4aP4
93 cycles for MbStrLen4aP42
62 cycles for AxStrLenSSE1
69 cycles for AxStrLenSSE1j
270 cycles for szLen
105 cycles for MbStrLen4a
73 cycles for MbStrLen4aP4
71 cycles for MbStrLen4aP42
62 cycles for AxStrLenSSE1
68 cycles for AxStrLenSSE1j
264 cycles for szLen
91 cycles for MbStrLen4a
73 cycles for MbStrLen4aP4
71 cycles for MbStrLen4aP42
62 cycles for AxStrLenSSE1
68 cycles for AxStrLenSSE1j
Jochen, read comments in start of AxStrLenSSE1 proc. I hope, you understand what I right.
AxStrLenSSE1j (Jochen's remake) is still crashes, I don't work with it.
And integer version crashes on in-buffer-end strings also.
Hutch, SSE works on Core+ nearly 3 times faster than on PIV.
Hutch, test my new version please.
I preserve XMM7 and ECX only for fair comparsion with Jochen's procs. This is his right - what he do with HIS MasmBasic.
Alex
Quote from: jj2007 on August 21, 2010, 07:07:28 AM
Quote from: Antariy on August 20, 2010, 11:07:26 PM
... But, even if I add SEH, this is not make proc much slower :) Only in case of end-of-buffer.
Maybe... but code size will increase again. 80 bytes is enough for a strlen algo in a Basic library. And remember we are talking about 40-80 cycles, depending on the CPU. 80 cycles at 1.6 GHz means 20 Million calls to strlen per second - that is rarely a bottleneck. Those who need more than that can use the fastest and unsafest MMX algos trashing the FPU, but for a general purpose library a compromise should be sought.
Try new version. I make it via normal way, NOT SEH covered, only for respect to you. And I preserve ECX and XMM7 only for respect to you.
They stand by 1 clock slower, how much :)
But, I disagree, what you add alignment stuff to "codesize". This is NOT code, this is never been executed and pre-decoded, so - without comments. Jochen, draw some respect to me, make
fair comparsion, please.
I don't impose my proc to you, really. It have size by 16% bigger, and speed not less than by 30% bigger on new CPUs, so, I think, this is satisfactory.
Alex
Hutch, page changed again, so, I repeat my ask.
Test this please: "http://www.masm32.com/board/index.php?action=dlattach;topic=14626.0;id=8001".
This is my new safe version of SSE1 proc.
Alex
Nobody, except God, can require from me, what I must do, and what regs I must preserve.
So, if somebody don't have a nimbus and a wings, shut up, please.
Alex
P.S. somebody, you don't prove your second assertion: write faster proc, after this - talk.
I know, you write BLOAT algo again, and will be think, what you are great, this is only yours problems.
Create VERY bloat unrolled (using many XMM regs) algo is NOT hard. Is hard - create smallest algo from fastest algos.
Quote from: Antariy on August 21, 2010, 04:16:43 PM
But, I disagree, what you add alignment stuff to "codesize". This is NOT code, this is never been executed and pre-decoded, so - without comments. Jochen, draw some respect to me, make fair comparison, please.
Dear Alex,
The assumption here is that you build a library, and each algo starts on a 16-byte aligned boundary. So if somebody (Lingo does that very often) inserts 7 bytes of strange db xyz before the algo start, this will a) add to the size of the executable and b) may waste some bytes of instruction cache when the CPU pulls code in with 16-byte alignment. Therefore "codesize" starts at the 16-byte boundary in all our tests. Call it a convention. Since it's being applied to everybody, I would call it fair :wink
The a16 macro should not count imho because in real life you would not insert 16 int 3's between all the algos of your library. We use it here - for everybody - in the hope that it might eliminate execution cache influences on the timings. Whether that really works, I don't know...
i dunno how fair it is, really - lol
but, it appears to be as "real-world" as it can be
it depends on how the code is placed in the test program (luck of the draw)
maybe a more judicious method would be to count actual routine bytes, then add 8 for a 16-alignment
that would represent the "average" byte-count for all possible placements
and - those who do want to not use align 16, can suffer a few clock-cycles penalty to save 8 bytes in the count
that would be fair
Jochen, this is piece of text from Intel's optimization guide:
Quote
Assembly/Compiler Coding Rule 56. (M impact, L generality) If (hopefully read-only) data must occur on the same page as code, avoid placing it immediately after an indirect jump. For example, follow an indirect jump with its mostly likely target, and place the data after an unconditional branch.
This text cannot be copied with any standard apps (because Intel locks his PDFs from copying), but, since we talk with
dedndave about encryptions and its researching... :)
Returning to quote. As you see, Intel recomment place data pieces in places, what would not be executed, and CPU may "found" this. So, CPUs is NO so silly, what don't know, what no need in pre-decoding some "code".
Code location problem is really problem of interconnection of code and data, not location or type of alignment instructions (int3, or long lea esp,[esp] etc - is not have meaning). This is too hard to say, I hope, you understand me enough.
Location have some small influence, but it cannot have critical meaning. Algos, which use many data in work - more sensitive to "code placement".
Alex
Has anyone tried my modest entry in the race ? I could not get the AxStrLenMMX or AxStrLenSSE1 examples I downloaded to yield either a consistent or correct result using a string length of 11 chars ("Hello There") either in MASM or GoAsm so my tests were pretty much shot. After all however fast it takes to yield the wrong answer, its still wrong. I was using vkim's debug to display the results (both gave me a string length of 7) I have included the RadAsm project if someone can tell me what is required to get correct results I would appreciate it.
; From my rather old strings library
lszLenMMX proc pString:DWORD
mov eax,[pString]
nop
nop ; fill in stack frame+mov to 8 bytes
pxor mm0,mm0
nop ; fill pxor to 4 bytes
pxor mm1,mm1
nop ; fill pxor to 4 bytes
@@: ; this is aligned to 16 bytes
movq mm0,[eax]
pcmpeqb mm0,mm1
add eax,8
pmovmskb ecx,mm0
or ecx,ecx
jz @B
sub eax,[pString]
bsf ecx,ecx
sub eax,8
add eax,ecx
emms
RET
lszLenMMX ENDP
Edgar
Hi Edgar,
Your attachment won't assemble, some includes are missing, and paths rely on environment variables. But anyway, I got the algo to work:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
80 bytes for MbStrLen4a
49 bytes for lszLenMMX
------- timings -------
131 cycles for szLen
49 cycles for MbStrLen4a
90 cycles for lszLenMMX
Faster than the Masm32 library algo, but problematic because a) it trashes the FPU and b) it throws an exception for strings near a VirtualAlloc boundary.
Thanks jj2007,
I finally got the Ax... ones to work, hadn't noticed the lack of prologue and epilogue so ESP+4 wasn't pointing to the right place in my tests.
Edgar
It is not a big deal to beat the stupid losers: :lol
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz (SSE4)
80 bytes for MbStrLen4a
72 bytes for MbStrLen4aP4
72 bytes for MbStrLen4aP42
93 bytes for AxStrLenSSE1
80 bytes for AxStrLenSSE1j
84 bytes for StrLenLingo
------- timings -------
112 cycles for szLen
37 cycles for MbStrLen4a
49 cycles for MbStrLen4aP4
49 cycles for MbStrLen4aP42
20 cycles for AxStrLenSSE1
45 cycles for AxStrLenSSE1j
12 cycles for StrLenLingo
111 cycles for szLen
37 cycles for MbStrLen4a
42 cycles for MbStrLen4aP4
42 cycles for MbStrLen4aP42
21 cycles for AxStrLenSSE1
22 cycles for AxStrLenSSE1j
12 cycles for StrLenLingo
110 cycles for szLen
63 cycles for MbStrLen4a
73 cycles for MbStrLen4aP4
73 cycles for MbStrLen4aP42
20 cycles for AxStrLenSSE1
26 cycles for AxStrLenSSE1j
12 cycles for StrLenLingo
110 cycles for szLen
37 cycles for MbStrLen4a
42 cycles for MbStrLen4aP4
43 cycles for MbStrLen4aP42
40 cycles for AxStrLenSSE1
22 cycles for AxStrLenSSE1j
12 cycles for StrLenLingo
--- ok ---
Later:
Corrected a bug in my algo. Pls,reload it..sorry...
lingo wasn't bad-lookin, when he was little....
(http://www.babble.com/CS/blogs/strollerderby/2008/08/23-End/Tantrum-1.jpg)
Hi!
This is old proc, with some changes. I don't see other new procs, and not add it to tests.
Test this, please.
Alex
Alex.
Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz (SSE4)
80 code size for MbStrLen4a 80 total bytes for MbStrLen4a
72 code size for MbStrLen4aP4 72 total bytes for MbStrLen4aP4
72 code size for MbStrLen4aP42 72 total bytes for MbStrLen4aP42
104 code size for AxStrLenSSE1 104 total bytes for AxStrLenSSE1
80 code size for AxStrLenSSE1j 80 total bytes for AxStrLenSSE1j
------- timings -------
110 cycles for szLen
52 cycles for MbStrLen4a
54 cycles for MbStrLen4aP4
54 cycles for MbStrLen4aP42
34 cycles for AxStrLenSSE1
37 cycles for AxStrLenSSE1j
110 cycles for szLen
52 cycles for MbStrLen4a
64 cycles for MbStrLen4aP4
64 cycles for MbStrLen4aP42
33 cycles for AxStrLenSSE1
45 cycles for AxStrLenSSE1j
110 cycles for szLen
52 cycles for MbStrLen4a
54 cycles for MbStrLen4aP4
54 cycles for MbStrLen4aP42
33 cycles for AxStrLenSSE1
34 cycles for AxStrLenSSE1j
110 cycles for szLen
52 cycles for MbStrLen4a
64 cycles for MbStrLen4aP4
64 cycles for MbStrLen4aP42
32 cycles for AxStrLenSSE1
42 cycles for AxStrLenSSE1j
What about make CrashIt macro "1", and test Lingo's proc? :P
Alex
P.S. Lingo's proc so bad! It don't preserve ECX and XMM regs! :P
Thanks, Hutch!
Alex
Here is one more, with a modification of Alex' excellent algo (shorter, same cycle count on my CPU).
I tried to include Lingo's new algo, but - not surprisingly - it raised an exception. If you are masochist enough, you can "heal" it as follows:
QuoteExA:
bsf eax, ecx
mov edx, [esp-8] ; added by JJ (still unsafe but for testing it's ok)
jmp edx
StrLenLingo endp
Lingo's algo will still raise an exception for
CrashIt = 1.
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
80 code size for MbStrLen4a 80 total bytes for MbStrLen4a
97 code size for AxStrLenSSE1 104 total bytes for AxStrLenSSE1
84 code size for AxStrLenSSE1j 84 total bytes for AxStrLenSSE1j
------- timings, misaligned -------
131 cycles for szLen
49 cycles for MbStrLen4a
35 cycles for AxStrLenSSE1
34 cycles for AxStrLenSSE1j
131 cycles for szLen
49 cycles for MbStrLen4a
35 cycles for AxStrLenSSE1
34 cycles for AxStrLenSSE1j
131 cycles for szLen
49 cycles for MbStrLen4a
34 cycles for AxStrLenSSE1
34 cycles for AxStrLenSSE1j
131 cycles for szLen
49 cycles for MbStrLen4a
34 cycles for AxStrLenSSE1
34 cycles for AxStrLenSSE1j
EDIT:
Quote sub ecx, edx
pxor xmm7, xmm7 ; thanks Alex!!
jz @F
Test attached archive, please. I add Lingo's and Edgar's procs.
Timings:
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
80 code size for MbStrLen4a 80 total bytes for MbStrLen4a
72 code size for MbStrLen4aP4 72 total bytes for MbStrLen4aP4
72 code size for MbStrLen4aP42 72 total bytes for MbStrLen4aP42
34 code size for AxStrLenMMX 34 total bytes for AxStrLenMMX
97 code size for AxStrLenSSE1 104 total bytes for AxStrLenSSE1
79 code size for StrLenLingo 84 total bytes for StrLenLingo
49 code size for lszLenMMX 49 total bytes for lszLenMMX
------- timings -------
263 cycles for szLen
97 cycles for MbStrLen4a
100 cycles for MbStrLen4aP4
121 cycles for MbStrLen4aP42
108 cycles for AxStrLenMMX
62 cycles for AxStrLenSSE1
53 cycles for StrLenLingo
119 cycles for lszLenMMX
251 cycles for szLen
96 cycles for MbStrLen4a
122 cycles for MbStrLen4aP4
96 cycles for MbStrLen4aP42
108 cycles for AxStrLenMMX
62 cycles for AxStrLenSSE1
53 cycles for StrLenLingo
124 cycles for lszLenMMX
251 cycles for szLen
91 cycles for MbStrLen4a
146 cycles for MbStrLen4aP4
105 cycles for MbStrLen4aP42
109 cycles for AxStrLenMMX
62 cycles for AxStrLenSSE1
53 cycles for StrLenLingo
119 cycles for lszLenMMX
251 cycles for szLen
71 cycles for MbStrLen4a
96 cycles for MbStrLen4aP4
120 cycles for MbStrLen4aP42
109 cycles for AxStrLenMMX
62 cycles for AxStrLenSSE1
53 cycles for StrLenLingo
122 cycles for lszLenMMX
--- ok ---
About fastest proc: this is as comparsion of lame horse with a bulldozer. Horse is sully some regs, which is must be preserved for fair comparsion, and horse is stumbled and falled on some strings :P
And naughty horse is mess timings of good bulldozers :)
Alex
For latest Jochen's archive (80StrLen.zip):
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
80 code size for MbStrLen4a 80 total bytes for MbStrLen4a
97 code size for AxStrLenSSE1 104 total bytes for AxStrLenSSE1
84 code size for AxStrLenSSE1j 84 total bytes for AxStrLenSSE1j
13 - ERROR in AxStrLenSSE1j
15 - ERROR in AxStrLenSSE1j
83 code size for StrLenLingo 92 total bytes for StrLenLingo
------- timings, misaligned -------
269 cycles for szLen
85 cycles for MbStrLen4a
66 cycles for AxStrLenSSE1
68 cycles for AxStrLenSSE1j
58 cycles for StrLenLingo
268 cycles for szLen
83 cycles for MbStrLen4a
66 cycles for AxStrLenSSE1
69 cycles for AxStrLenSSE1j
57 cycles for StrLenLingo
269 cycles for szLen
85 cycles for MbStrLen4a
66 cycles for AxStrLenSSE1
69 cycles for AxStrLenSSE1j
57 cycles for StrLenLingo
269 cycles for szLen
105 cycles for MbStrLen4a
65 cycles for AxStrLenSSE1
68 cycles for AxStrLenSSE1j
56 cycles for StrLenLingo
--- ok ---
Alex
Quote from: Antariy on August 22, 2010, 09:54:29 PM
For latest Jochen's archive (80StrLen.zip):
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
80 code size for MbStrLen4a 80 total bytes for MbStrLen4a
97 code size for AxStrLenSSE1 104 total bytes for AxStrLenSSE1
84 code size for AxStrLenSSE1j 84 total bytes for AxStrLenSSE1j
13 - ERROR in AxStrLenSSE1j
15 - ERROR in AxStrLenSSE1j
Strange - these errors are not present in my runs. Can you check what happened there??
Jochen,
movups [esp], xmm7
sub ecx, edx
jz @F
pxor xmm7, xmm7 <--- move this above jz @F
pcmpeqb xmm7, [edx]
Alex
Quote from: Antariy on August 22, 2010, 10:08:00 PM
movups [esp], xmm7
sub ecx, edx
jz @F
pxor xmm7, xmm7 <--- move this above jz @F
pcmpeqb xmm7, [edx]
Of course :red
It's fixed, see new attachment above.
Thanks Alex :U
I change pxor between sub ecx,edx and jz @F, this is timings:
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
80 code size for MbStrLen4a 80 total bytes for MbStrLen4a
97 code size for AxStrLenSSE1 104 total bytes for AxStrLenSSE1
84 code size for AxStrLenSSE1j 84 total bytes for AxStrLenSSE1j
83 code size for StrLenLingo 92 total bytes for StrLenLingo
------- timings, misaligned -------
262 cycles for szLen
82 cycles for MbStrLen4a
65 cycles for AxStrLenSSE1
67 cycles for AxStrLenSSE1j
55 cycles for StrLenLingo
263 cycles for szLen
87 cycles for MbStrLen4a
64 cycles for AxStrLenSSE1
67 cycles for AxStrLenSSE1j
56 cycles for StrLenLingo
262 cycles for szLen
85 cycles for MbStrLen4a
65 cycles for AxStrLenSSE1
67 cycles for AxStrLenSSE1j
56 cycles for StrLenLingo
262 cycles for szLen
84 cycles for MbStrLen4a
64 cycles for AxStrLenSSE1
67 cycles for AxStrLenSSE1j
56 cycles for StrLenLingo
--- ok ---
Edited: my fix is equal to your fix, Jochen, so this result the same.
Alex
Thanks. So my 1j is three cycles slower on your CPU - on mine it's about half a cycle faster. For aligned strings, by the way, it looks like this:
------- timings, 16-byte aligned -------
131 cycles for szLen
32 cycles for MbStrLen4a
32 cycles for AxStrLenSSE1
32 cycles for AxStrLenSSE1j
24 cycles for StrLenLingo (UNSAFE)
For aligned:
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
80 code size for MbStrLen4a 80 total bytes for MbStrLen4a
97 code size for AxStrLenSSE1 104 total bytes for AxStrLenSSE1
84 code size for AxStrLenSSE1j 84 total bytes for AxStrLenSSE1j
83 code size for StrLenLingo 92 total bytes for StrLenLingo
------- timings, 16-byte aligned -------
263 cycles for szLen
69 cycles for MbStrLen4a
68 cycles for AxStrLenSSE1
70 cycles for AxStrLenSSE1j
55 cycles for StrLenLingo
261 cycles for szLen
69 cycles for MbStrLen4a
68 cycles for AxStrLenSSE1
69 cycles for AxStrLenSSE1j
55 cycles for StrLenLingo
262 cycles for szLen
69 cycles for MbStrLen4a
68 cycles for AxStrLenSSE1
69 cycles for AxStrLenSSE1j
56 cycles for StrLenLingo
261 cycles for szLen
69 cycles for MbStrLen4a
68 cycles for AxStrLenSSE1
69 cycles for AxStrLenSSE1j
56 cycles for StrLenLingo
--- ok ---
So, some clocks is not have meaning. As you see, lingo's proc work not very well :green2 With consideration, what it crashes and it don't preserve regs - this is without comments... :green2
Alex
Alex,
91Test_StrLenSaveXmm.exe crashes on my P4
Intel(R) Pentium(R) 4 CPU 3.20GHz (SSE2)
80 code size for MbStrLen4a 80 total bytes for MbStrLen4a
72 code size for MbStrLen4aP4 72 total bytes for MbStrLen4aP4
72 code size for MbStrLen4aP42 72 total bytes for MbStrLen4aP42
34 code size for AxStrLenMMX 34 total bytes for AxStrLenMMX
97 code size for AxStrLenSSE1 104 total bytes for AxStrLenSSE1
79 code size for StrLenLingo 84 total bytes for StrLenLingo
49 code size for lszLenMMX 49 total bytes for lszLenMMX
------- timings -------
260 cycles for szLen
53 cycles for MbStrLen4a
91Test_StrLenSaveXmm.exe has encountered a problem and needs to close. We are sorry for the inconvenience.
Dave.
Dave, this is something with old Jochen's MbStrLen4aP4 proc :(
Alex
Dave, test "80StrLen.zip", please, may be Jochen fix this problem, I have old his proc for P4.
Alex
Here's my test from my laptop:
AMD Athlon(tm) X2 Dual-Core QL-62 (SSE3)
80 code size for MbStrLen4a 80 total bytes for MbStrLen4a
72 code size for MbStrLen4aP4 72 total bytes for MbStrLen4aP4
72 code size for MbStrLen4aP42 72 total bytes for MbStrLen4aP42
34 code size for AxStrLenMMX 34 total bytes for AxStrLenMMX
97 code size for AxStrLenSSE1 104 total bytes for AxStrLenSSE1
79 code size for StrLenLingo 84 total bytes for StrLenLingo
49 code size for lszLenMMX 49 total bytes for lszLenMMX
------- timings -------
147 cycles for szLen
56 cycles for MbStrLen4a
53 cycles for MbStrLen4aP4
52 cycles for MbStrLen4aP42
72 cycles for AxStrLenMMX
69 cycles for AxStrLenSSE1
43 cycles for StrLenLingo
75 cycles for lszLenMMX
138 cycles for szLen
58 cycles for MbStrLen4a
56 cycles for MbStrLen4aP4
63 cycles for MbStrLen4aP42
76 cycles for AxStrLenMMX
64 cycles for AxStrLenSSE1
43 cycles for StrLenLingo
76 cycles for lszLenMMX
137 cycles for szLen
52 cycles for MbStrLen4a
54 cycles for MbStrLen4aP4
53 cycles for MbStrLen4aP42
76 cycles for AxStrLenMMX
68 cycles for AxStrLenSSE1
42 cycles for StrLenLingo
77 cycles for lszLenMMX
138 cycles for szLen
50 cycles for MbStrLen4a
56 cycles for MbStrLen4aP4
55 cycles for MbStrLen4aP42
70 cycles for AxStrLenMMX
65 cycles for AxStrLenSSE1
48 cycles for StrLenLingo
77 cycles for lszLenMMX
--- ok ---
Well, 5 years or so ago when I wrote lszLenMMX it was pretty fast and rather unique, now it seems to be a bit of a pig compared with others...
Dave, when you will run your development computer (AMD), you may find, where app crashes?
Alex
Thanks, Edgar!
This is always nice: having timings from different hardware.
Alex
Alex,
Here is my P4 for 80StrLen.zip.
Intel(R) Pentium(R) 4 CPU 3.20GHz (SSE2)
80 code size for MbStrLen4a 80 total bytes for MbStrLen4a
97 code size for AxStrLenSSE1 104 total bytes for AxStrLenSSE1
84 code size for AxStrLenSSE1j 84 total bytes for AxStrLenSSE1j
83 code size for StrLenLingo 92 total bytes for StrLenLingo
------- timings, misaligned -------
236 cycles for szLen
52 cycles for MbStrLen4a
44 cycles for AxStrLenSSE1
38 cycles for AxStrLenSSE1j
32 cycles for StrLenLingo
227 cycles for szLen
46 cycles for MbStrLen4a
37 cycles for AxStrLenSSE1
38 cycles for AxStrLenSSE1j
31 cycles for StrLenLingo
230 cycles for szLen
45 cycles for MbStrLen4a
37 cycles for AxStrLenSSE1
38 cycles for AxStrLenSSE1j
32 cycles for StrLenLingo
227 cycles for szLen
46 cycles for MbStrLen4a
37 cycles for AxStrLenSSE1
39 cycles for AxStrLenSSE1j
30 cycles for StrLenLingo
--- ok ---
Dave.
Quote from: donkey on August 22, 2010, 10:59:39 PM
Well, 5 years or so ago when I wrote lszLenMMX it was pretty fast and rather unique, now it seems to be a bit of a pig compared with others...
No, Edgar. Don't forgot, what my MMX version is very SLIGHTLY faster, because I emit prologue-epilogue code. All SSE version is faster, because they "eat" twice more data per loop (lingo's - in 4 times more data per loop). This is normal results, not a pig.
Alex
One more for the night - I shaved off a cycle and six bytes of codesize:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
80 code size for MbStrLen4a 80 total bytes for MbStrLen4a
97 code size for AxStrLenSSE1 104 total bytes for AxStrLenSSE1
77 code size for AxStrLenSSE1j 78 total bytes for AxStrLenSSE1j
83 code size for StrLenLingo 90 total bytes for StrLenLingo
------- timings, misaligned -------
132 cycles for szLen
49 cycles for MbStrLen4a
34 cycles for AxStrLenSSE1
33 cycles for AxStrLenSSE1j
24 cycles for StrLenLingo (unsafe)
132 cycles for szLen
49 cycles for MbStrLen4a
34 cycles for AxStrLenSSE1
33 cycles for AxStrLenSSE1j
24 cycles for StrLenLingo (unsafe)
131 cycles for szLen
49 cycles for MbStrLen4a
34 cycles for AxStrLenSSE1
33 cycles for AxStrLenSSE1j
24 cycles for StrLenLingo (unsafe)
132 cycles for szLen
49 cycles for MbStrLen4a
34 cycles for AxStrLenSSE1
34 cycles for AxStrLenSSE1j
24 cycles for StrLenLingo (unsafe)
JJ,
Here is my P4 for 80bStrLen.zip.
Intel(R) Pentium(R) 4 CPU 3.20GHz (SSE2)
80 code size for MbStrLen4a 80 total bytes for MbStrLen4a
97 code size for AxStrLenSSE1 104 total bytes for AxStrLenSSE1
77 code size for AxStrLenSSE1j 78 total bytes for AxStrLenSSE1j
83 code size for StrLenLingo 90 total bytes for StrLenLingo
------- timings, misaligned -------
242 cycles for szLen
56 cycles for MbStrLen4a
40 cycles for AxStrLenSSE1
42 cycles for AxStrLenSSE1j
30 cycles for StrLenLingo (unsafe)
228 cycles for szLen
52 cycles for MbStrLen4a
45 cycles for AxStrLenSSE1
42 cycles for AxStrLenSSE1j
30 cycles for StrLenLingo (unsafe)
228 cycles for szLen
43 cycles for MbStrLen4a
38 cycles for AxStrLenSSE1
42 cycles for AxStrLenSSE1j
31 cycles for StrLenLingo (unsafe)
230 cycles for szLen
54 cycles for MbStrLen4a
38 cycles for AxStrLenSSE1
52 cycles for AxStrLenSSE1j
33 cycles for StrLenLingo (unsafe)
--- ok ---
Dave
Quote from: KeepingRealBusy on August 22, 2010, 10:35:58 PM
Alex,
91Test_StrLenSaveXmm.exe crashes on my P4
Intel(R) Pentium(R) 4 CPU 3.20GHz (SSE2)
Hi Dave & Alex,
I found the "bug": It's lddqu - the instruction requires SSE3.
Attached a new testbed with two AxJJ variants that behave similar on a P4 but very different on my Celeron. Timings?
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
------- timings, misaligned, 100 byte string -------
272 cycles for szLen
89 cycles for MbStrLen4a
63 cycles for AxStrLenSSE1
70 cycles for AxJJStrLen1
68 cycles for AxJJStrLen2
Again idiotic vain efforts... :lol
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz (SSE4)
80 code size for MbStrLen4a 80 total bytes for MbStrLen4a
97 code size for AxStrLenSSE1 104 total bytes for AxStrLenSSE1
77 code size for AxJJStrLen1 78 total bytes for AxJJStrLen1
77 code size for AxJJStrLen2 78 total bytes for AxJJStrLen2
79 code size for StrLenLingo 79 total bytes for StrLenLingo
------- timings, misaligned, 100 byte string -------
112 cycles for szLen
37 cycles for MbStrLen4a
20 cycles for AxStrLenSSE1
24 cycles for AxJJStrLen1
26 cycles for AxJJStrLen2
12 cycles for StrLenLingo
110 cycles for szLen
37 cycles for MbStrLen4a
21 cycles for AxStrLenSSE1
25 cycles for AxJJStrLen1
29 cycles for AxJJStrLen2
12 cycles for StrLenLingo
110 cycles for szLen
37 cycles for MbStrLen4a
20 cycles for AxStrLenSSE1
25 cycles for AxJJStrLen1
24 cycles for AxJJStrLen2
12 cycles for StrLenLingo
110 cycles for szLen
37 cycles for MbStrLen4a
20 cycles for AxStrLenSSE1
25 cycles for AxJJStrLen1
29 cycles for AxJJStrLen2
12 cycles for StrLenLingo
------- timings, misaligned, 5 byte string -------
7 cycles for szLen
32 cycles for MbStrLen4a
8 cycles for AxStrLenSSE1
15 cycles for AxJJStrLen1
8 cycles for AxJJStrLen2
2 cycles for StrLenLingo
7 cycles for szLen
32 cycles for MbStrLen4a
15 cycles for AxStrLenSSE1
8 cycles for AxJJStrLen1
24 cycles for AxJJStrLen2
1 cycles for StrLenLingo
7 cycles for szLen
30 cycles for MbStrLen4a
8 cycles for AxStrLenSSE1
15 cycles for AxJJStrLen1
8 cycles for AxJJStrLen2
2 cycles for StrLenLingo
7 cycles for szLen
32 cycles for MbStrLen4a
15 cycles for AxStrLenSSE1
8 cycles for AxJJStrLen1
24 cycles for AxJJStrLen2
1 cycles for StrLenLingo
--- ok ---
Quote from: lingo on August 23, 2010, 05:30:46 PM
Again idiotic vain efforts... :lol
Lingo,
While you have a fast CPU, and stolen a lot from Alex and my code, your algo still crashes. Give up.
Version d, Celeron M timings:
Quote------- timings, misaligned, 5 byte string -------
12 cycles for szLen
15 cycles for AxStrLenSSE1
11 cycles for AxJJStrLen1
22 cycles for AxJJStrLen2
7 cycles for AxJJStrLen3
12 cycles for szLen
12 cycles for AxStrLenSSE1
17 cycles for AxJJStrLen1
11 cycles for AxJJStrLen2
7 cycles for AxJJStrLen3
12 cycles for szLen
16 cycles for AxStrLenSSE1
11 cycles for AxJJStrLen1
22 cycles for AxJJStrLen2
7 cycles for AxJJStrLen3
12 cycles for szLen
12 cycles for AxStrLenSSE1
17 cycles for AxJJStrLen1
11 cycles for AxJJStrLen2
7 cycles for AxJJStrLen3
The "jumping" is most probably caused by the movups [esp+xxx], xmm0 - in the REPEAT loop, the stack is being gradually decreased (push eax), so every 4 loops one of the algo is lucky to have a 16-byte alignment.
To eliminate this effect, AxJJStrLen3 uses a global aligned variable and movdqua. Results look convincing.
"and stolen a lot from Alex and my code"
Wow, the thief crying "catch the thief" see the link: (www.masm32.com/board/index.php?topic=11353.msg84371#msg84371)
I can't thieve nothing from you and from the asian lamer just because you have no ideas in assembly , hence your code will be very ugly and slow always... So, get your peels and take it easy.. :lol
"your algo still crashes"
For some lamers in programing may be but for the advanced programmers which test the end of their buffer after every call to VirtualAlloc just no way to crash. :lol
"Give up."
Due to some sick idiotic lamers in the forum :lol....Never!
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz (SSE4)
80 code size for MbStrLen4a 80 total bytes for MbStrLen4a
97 code size for AxStrLenSSE1 104 total bytes for AxStrLenSSE1
77 code size for AxJJStrLen1 78 total bytes for AxJJStrLen1
77 code size for AxJJStrLen2 78 total bytes for AxJJStrLen2
79 code size for StrLenLingo 79 total bytes for StrLenLingo
------- timings, misaligned, 100 byte string -------
112 cycles for szLen
37 cycles for MbStrLen4a
20 cycles for AxStrLenSSE1
24 cycles for AxJJStrLen1
26 cycles for AxJJStrLen2
12 cycles for StrLenLingo
110 cycles for szLen
37 cycles for MbStrLen4a
21 cycles for AxStrLenSSE1
25 cycles for AxJJStrLen1
29 cycles for AxJJStrLen2
12 cycles for StrLenLingo
110 cycles for szLen
37 cycles for MbStrLen4a
20 cycles for AxStrLenSSE1
25 cycles for AxJJStrLen1
24 cycles for AxJJStrLen2
12 cycles for StrLenLingo
110 cycles for szLen
37 cycles for MbStrLen4a
20 cycles for AxStrLenSSE1
25 cycles for AxJJStrLen1
29 cycles for AxJJStrLen2
12 cycles for StrLenLingo
------- timings, misaligned, 5 byte string -------
7 cycles for szLen
32 cycles for MbStrLen4a
8 cycles for AxStrLenSSE1
15 cycles for AxJJStrLen1
8 cycles for AxJJStrLen2
2 cycles for StrLenLingo
7 cycles for szLen
32 cycles for MbStrLen4a
15 cycles for AxStrLenSSE1
8 cycles for AxJJStrLen1
24 cycles for AxJJStrLen2
1 cycles for StrLenLingo
7 cycles for szLen
30 cycles for MbStrLen4a
8 cycles for AxStrLenSSE1
15 cycles for AxJJStrLen1
8 cycles for AxJJStrLen2
2 cycles for StrLenLingo
7 cycles for szLen
32 cycles for MbStrLen4a
15 cycles for AxStrLenSSE1
8 cycles for AxJJStrLen1
24 cycles for AxJJStrLen2
1 cycles for StrLenLingo
--- ok ---
Quote from: lingo on August 23, 2010, 06:56:17 PM
For some lamers in programing may be but for the advanced programmers which test the end of their buffer after every call to VirtualAlloc just no way to crash. :lol
Line 3:
CrashIt =
1 ; overrides MisAlign - the "SSE1" algos will bang their head against the VirtualAlloc boundary
Line 153:
Quote if 0 ; CrashIt
print "No result for Lingo's algo, it crashes", 13, 10
else
cycles Src, StrLenLingo ; ok, so let it crash
endif
:bg
> downloaded 8 times
Nice trick, Lingo. The original is in reply #98, though
"The thief is always a liar"[/U]-> just see the link: (www.masm32.com/board/index.php?topic=11353.msg84371#msg84371)
Just put your lame macro in... you know where... :lol
I can explain about VirtualAlloc to every normal man but it seems you forgot your peels again... :lol
Take care or next step will be the "electroconvulsive therapy"... :lol
I see the inferior are trying to take on the champ again, with little success :U
Hi, this is new old version. For JJ and some other explainers of VirtualAlloc.
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
80 code size for MbStrLen4a 80 total bytes for MbStrLen4a
72 code size for MbStrLen4aP4 72 total bytes for MbStrLen4aP4
72 code size for MbStrLen4aP42 72 total bytes for MbStrLen4aP42
34 code size for AxStrLenMMX 34 total bytes for AxStrLenMMX
85 code size for AxStrLenSSE1a 88 total bytes for AxStrLenSSE1a
83 code size for StrLenLingo 88 total bytes for StrLenLingo
49 code size for lszLenMMX 49 total bytes for lszLenMMX
84 code size for AxStrLenSSE1j 84 total bytes for AxStrLenSSE1j
------- timings -------
251 cycles for szLen
105 cycles for MbStrLen4a
109 cycles for AxStrLenMMX
62 cycles for AxStrLenSSE1a
53 cycles for StrLenLingo
119 cycles for lszLenMMX
65 cycles for AxStrLenSSE1J
251 cycles for szLen
77 cycles for MbStrLen4a
109 cycles for AxStrLenMMX
62 cycles for AxStrLenSSE1a
53 cycles for StrLenLingo
118 cycles for lszLenMMX
65 cycles for AxStrLenSSE1J
258 cycles for szLen
105 cycles for MbStrLen4a
109 cycles for AxStrLenMMX
62 cycles for AxStrLenSSE1a
53 cycles for StrLenLingo
115 cycles for lszLenMMX
65 cycles for AxStrLenSSE1J
251 cycles for szLen
78 cycles for MbStrLen4a
109 cycles for AxStrLenMMX
62 cycles for AxStrLenSSE1a
53 cycles for StrLenLingo
119 cycles for lszLenMMX
65 cycles for AxStrLenSSE1J
--- ok ---
Alex
Quote from: E^cube on August 23, 2010, 09:30:19 PM
I see the inferior are trying to take on the champ again, with little success :U
Why anybody think, what this is big need and deal: "Beat Lingo!". Wow! Not any need.
E^cube, you underestimate yourself, if you are think, what all peoples have only one target: beating of Lingo.
This is funny :)
His proc eat twice more data per loop, his proc have twice less functionality (it crashes and not preserves regs, which is needed for fair comparsion with Jochen's procs).
And his proc have only ~45% of performance gain on HIS CPU only. This is your "champ"? This is bad programmer, which cherish hopes to other soft for make his procs reliable.
What he make "fastest" proc because something etc - this is excuse for inability of making proc with the same functionality and bigger speed.
And, Jochen fix his proc, otherwice it crashes on short strings.
So, your "champ" not have any respect - he produce bad unreliable code (maybe fast, but NOT working).
Alex
For Jochen's 80d:
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
97 code size for AxStrLenSSE1 104 total bytes for AxStrLenSSE1
77 code size for AxJJStrLen1 78 total bytes for AxJJStrLen1
77 code size for AxJJStrLen2 78 total bytes for AxJJStrLen2
79 code size for AxJJStrLen3 80 total bytes for AxJJStrLen3
------- timings, misaligned, 100 byte string -------
67 cycles for AxStrLenSSE1
75 cycles for AxJJStrLen1
81 cycles for AxJJStrLen2
69 cycles for AxJJStrLen3
66 cycles for AxStrLenSSE1
74 cycles for AxJJStrLen1
72 cycles for AxJJStrLen2
69 cycles for AxJJStrLen3
67 cycles for AxStrLenSSE1
73 cycles for AxJJStrLen1
73 cycles for AxJJStrLen2
70 cycles for AxJJStrLen3
66 cycles for AxStrLenSSE1
74 cycles for AxJJStrLen1
72 cycles for AxJJStrLen2
69 cycles for AxJJStrLen3
------- timings, misaligned, 5 byte string -------
20 cycles for szLen
25 cycles for AxStrLenSSE1
27 cycles for AxJJStrLen1
23 cycles for AxJJStrLen2
19 cycles for AxJJStrLen3
19 cycles for szLen
27 cycles for AxStrLenSSE1
27 cycles for AxJJStrLen1
24 cycles for AxJJStrLen2
19 cycles for AxJJStrLen3
19 cycles for szLen
25 cycles for AxStrLenSSE1
25 cycles for AxJJStrLen1
24 cycles for AxJJStrLen2
18 cycles for AxJJStrLen3
19 cycles for szLen
26 cycles for AxStrLenSSE1
25 cycles for AxJJStrLen1
24 cycles for AxJJStrLen2
19 cycles for AxJJStrLen3
--- ok ---
Alex
For late lingo fix for short string support (Microsoft recommend not make any code good and reliable in first release, but release some patches and SPs after some days of releasing initial version).
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
97 code size for AxStrLenSSE1 104 total bytes for AxStrLenSSE1
77 code size for AxJJStrLen1 78 total bytes for AxJJStrLen1
77 code size for AxJJStrLen2 78 total bytes for AxJJStrLen2
79 code size for AxJJStrLen3 80 total bytes for AxJJStrLen3
79 code size for StrLenLingo 79 total bytes for StrLenLingo
------- timings, misaligned, 100 byte string -------
66 cycles for AxStrLenSSE1
73 cycles for AxJJStrLen1
84 cycles for AxJJStrLen2
69 cycles for AxJJStrLen3
57 cycles for StrLenLingo
67 cycles for AxStrLenSSE1
77 cycles for AxJJStrLen1
72 cycles for AxJJStrLen2
69 cycles for AxJJStrLen3
74 cycles for StrLenLingo
69 cycles for AxStrLenSSE1
72 cycles for AxJJStrLen1
70 cycles for AxJJStrLen2
69 cycles for AxJJStrLen3
56 cycles for StrLenLingo
66 cycles for AxStrLenSSE1
72 cycles for AxJJStrLen1
70 cycles for AxJJStrLen2
69 cycles for AxJJStrLen3
55 cycles for StrLenLingo
------- timings, misaligned, 5 byte string -------
19 cycles for szLen
25 cycles for AxStrLenSSE1
26 cycles for AxJJStrLen1
23 cycles for AxJJStrLen2
19 cycles for AxJJStrLen3
14 cycles for StrLenLingo
20 cycles for szLen
27 cycles for AxStrLenSSE1
26 cycles for AxJJStrLen1
23 cycles for AxJJStrLen2
18 cycles for AxJJStrLen3
15 cycles for StrLenLingo
19 cycles for szLen
25 cycles for AxStrLenSSE1
29 cycles for AxJJStrLen1
23 cycles for AxJJStrLen2
18 cycles for AxJJStrLen3
14 cycles for StrLenLingo
19 cycles for szLen
26 cycles for AxStrLenSSE1
27 cycles for AxJJStrLen1
28 cycles for AxJJStrLen2
18 cycles for AxJJStrLen3
14 cycles for StrLenLingo
--- ok ---
Still big timings not on HIS CPUs... :green2
With consideration, what proc must be twice faster (because use two regs)... :green2
Alex
Quote from: Antariy on August 23, 2010, 10:59:42 PM
For Jochen's 80d:
Thanks, Alex. So your algo is still a tick faster on your Celeron. Mine is a Yonah Celeron M - yours is Prescott or Merom?
Quote from: jj2007 on August 23, 2010, 11:28:45 PM
Quote from: Antariy on August 23, 2010, 10:59:42 PM
For Jochen's 80d:
Thanks, Alex. So your algo is still a tick faster on your Celeron. Mine is a Yonah Celeron M - yours is Prescott or Merom?
This is Celeron D, Prescott (310)
Alex
P.S. What timings have my last code (93xxx.zip)?
This is big post, and this post is written on poor English (my English is equal to Lingo's Russian), but, PLEASE, read this post entirely, think about it, and say, I right or not.
Lingo's
Quote
I can't thieve nothing from you and from the asian lamer just because you have no ideas in assembly , hence your code will be very ugly and slow always...
Really? How about this:
Lingo's "abs" function:
pop ecx
pop eax
cdq
xor eax,edx <---
sub eax,edx <---
jmp ecx
My Axa2l function, which used unified approach to conversion of signed/unsigned string to long/dword:
@done:
add eax,edi <---
xor eax,edi <---
pop edi
ret
This is the same code, so, Lingo steal it. But, I confess, what this is not hard algo. But and he write very simple algos, and he HAVE NO RIGHTS to say, what somebody steal something from him.
But Lingo, as stupid new-fangled programmer, don't know, what SUB is slower by technical (electrical) reasons, since firsts computers (BIG gathering of registers based on valves or relays). It seems, what Lingo steal his certificate of degree of Electrical Engineer somewhere.
Novadays, and many days ago, sub stand have speed almost as add, and this is not measurable. But, this is good practic - using ADD instead SUB, if you can. Lingo don't know this practic, so - he is not good programmer, he is lamer, because lamers thinks only, what they are Grandmasters. Minded peoples know about theys' drawbacks, and not speak, what they are "the best".
Other argument: many peoples talk to him about returning from proc via "jmp ...". This is stupid practic of lamer programmer.
For example, not only I talk about this to him, but
dioxin also:
Quote
You aren't really advocating that, for speed, you should pop a return address off the stack and jump to it, are you? That messes up the branch prediction mechanism which has short cuts for a paired CALL-RETURN.
This is good advice, and perfectly right.
But Lingo, don't listen to good advice - he think, what he the Grandmaster, without compromises.
So, this is ONLY HIS PROBLEMS. But I don't understand, why in mental hospital of Toronto, where Lingo located, Internet connection is available :P
Hutch, why you allow to so stupid lamer, as Lingo, come to your forum? He discredit this nice forum only. Forum have peoples (I don't mean myself, I mean many-many others), which have experience incomparably with his experience, but he call them: "stupid", "thefters", "tolerated" etc?
All of his technics NOT HAVE ANY unbelievable things and thinks. All his technics is KNOWN a long time.
For example, maybe somebody don't know, why Lingo use this code in his SSE hex2dword:
pshufw mm0,[eax], 01Bh <--- this...
pxor mm2, mm2
pshufw mm1,mm0, 0E4h <--- and this
paddb mm0, maskD0h
Firstly, pshuf may behave faster, than movq, but this instruction reverse words order also (1Bh) in this case.
Second, moving one MMx reg to other reg is faster with using pshufw with direct-copying encoding (0E4h). On P6 family, for example, "pshuf MMx,MMx" is faster than "movq MMx,MMx" in 3(!) times.
So, if anybody get intent look to Lingo's code, his code will NOT appear as something great, unknown or unbelievable.
His coding style is stale (smell of depraved young age), and not contain something to theif. Because not Lingo invent this technics.
Lingo you can hand in an application of your *unique algo* to any respectable patent office? Or you will be ridiculed there? Last variant is most possible, sorry. So, don't talk to us, how you great, and which "uniques" algo you produce.
He swagger about his fast "great" code, because he have fast CPU only. Because Core+ is tolerant to some his lamer's technics (like return via jump and many others).
As all peoples saw, his code don't have prominent results on not-hi-end CPUs.
All history of humans, is great and respectable making something good from something bad or cheap (as Ford for US or Diesel for Europe and world).
No great thing - make something good from something nice.
So, ANYBODY can say, what he great programmer? If yes - some similarity to Lingo have place in this.
So, Lingo is ORDINARY man (who can read Intel's manuals), not God of assembly, why he use rights (insults etc), which don't use Administrators and Moderators. Who he are for this?
Further. In seen of responses of other peoples, maybe, with similar to Lingo's "thinking engine", I say, why my code is that, as it is. Answer is short - it works NOT badly than Lingo's "great, nice, fast, unreliable" code, so, about talking? Novadays, I write for me and my work only, I write for CPUs of one type, and I don't have need to make other code.
But this is not meant what I "asian lamer", and Lingo is greater. Lingo write for his CPU only also, but his code work badly on other CPUs, but my - not. If I don't use some optimization technics - this is because they DON'T have something useful for me it this time.
I welcome reports about bad work on other CPU architectures, and trying to make it better.
Lingo is don't listen any reports (if they no good for him), and not listen any comments. If his code work badly, he say, what this is because testing computer is old and slow, and because owner of this computer is a lamer. Maybe Lingo - madman?
Any insults and something like, is NOT interesting to me, and NOT have any meaning to me. I write this post only to open eyes of peoples, who think, what Lingo - unbeatable champion.
And Lingo's insults is not have any meaning to me, because I treat them as talk of madman.
If somebody think, what I write this post to show, how I nice and great - this peoples make mistake.
I write this post, because Lingo's behaviour with other members is not excusable.
I don't like, and never use offensive speech with peoples, but I make a big exclusion to Lingo only.
Please sorry, all other peoples!
Alex
Alex - don't let lingo get on your nerves
most of us just ignore him :lol
Quote from: dedndave on August 24, 2010, 08:54:21 PM
Alex - don't let lingo get on your nerves
most of us just ignore him :lol
I'm balanced man. Not any Lingo, or 1,000,000 of Lingo's clones not "pick-up" me. I worry for forum and its members.
Because this is not correctness - call members with offensive words, without any cause from they side.
Alex
it's fun to pick on lingo
he gets upset so easily
i can visualize the blood vessels popping out on his neck - lol
Hi Alex,
Dave is right - don't let Rumpelstilzchen spoil your peaceful coding nights. He is good at writing highly specialised (and highly unrolled) algos, although it's always a pain to find out under which conditions they raise exceptions :green2
Here are the timings for the 93 zip, with the addition of AxJJStrLen3, i.e. the algo that uses a global variable to store the xmm reg:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
85 code size for AxStrLenSSE1a 88 total bytes for AxStrLenSSE1a
84 code size for AxStrLenSSE1j 84 total bytes for AxStrLenSSE1j
79 code size for AxJJStrLen3 80 total bytes for AxJJStrLen3
------- timings -------
132 cycles for szLen
33 cycles for AxStrLenSSE1a
33 cycles for AxStrLenSSE1J
31 cycles for AxJJStrLen3
Quote from: jj2007 on August 24, 2010, 09:36:11 PM
Here are the timings for the 93 zip, with the addition of AxJJStrLen3, i.e. the algo that uses a global variable to store the xmm reg:
Jochen, I think, you must don't use a global variable, because this is make proc not reenterant => it don't support multi-threaded applications. 1-2 clocks not have any mean, I think.
Alex
I still confidence, what Rumpelstilzchen cannot make something really great and unique. Maybe, make GREAT speech about his "GREAT" code :green2
Alex
Quote from: Antariy on August 24, 2010, 09:42:40 PM
Jochen, I think, you must don't use a global variable, because this is make proc not reenterant => it don't support multi-threaded applications.
That is correct, thanks for reminding me. Here are multi-thread and SSE2-safe variants 4+5:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
97 code size for AxStrLenSSE1 104 total bytes for AxStrLenSSE1
77 code size for AxJJStrLen1 78 total bytes for AxJJStrLen1
77 code size for AxJJStrLen2 78 total bytes for AxJJStrLen2
79 code size for AxJJStrLen3 80 total bytes for AxJJStrLen3
88 code size for AxJJStrLen4 88 total bytes for AxJJStrLen4
96 code size for AxJJStrLen5 96 total bytes for AxJJStrLen5
------- timings, misaligned, 100 byte string -------
34 cycles for AxStrLenSSE1
31 cycles for AxJJStrLen3
34 cycles for AxJJStrLen4
33 cycles for AxJJStrLen5
34 cycles for AxStrLenSSE1
31 cycles for AxJJStrLen3
33 cycles for AxJJStrLen4
33 cycles for AxJJStrLen5
34 cycles for AxStrLenSSE1
31 cycles for AxJJStrLen3
33 cycles for AxJJStrLen4
33 cycles for AxJJStrLen5
------- timings, misaligned, 5 byte string -------
12 cycles for szLen
15 cycles for AxStrLenSSE1
7 cycles for AxJJStrLen3
10 cycles for AxJJStrLen4
10 cycles for AxJJStrLen5
12 cycles for szLen
12 cycles for AxStrLenSSE1
7 cycles for AxJJStrLen3
10 cycles for AxJJStrLen4
10 cycles for AxJJStrLen5
12 cycles for szLen
16 cycles for AxStrLenSSE1
7 cycles for AxJJStrLen3
10 cycles for AxJJStrLen4
10 cycles for AxJJStrLen5
"The thieves are always liars"="Воры всегда лжецы"[/U]
Ugly spaghetti+lamer tubeteikin's code = slow.,..slow..slow...non me ne frega un cazzo... :lol
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz (SSE4)
97 code size for AxStrLenSSE1 104 total bytes for AxStrLenSSE1
79 code size for AxJJStrLen3 80 total bytes for AxJJStrLen3
88 code size for AxJJStrLen4 88 total bytes for AxJJStrLen4
96 code size for AxJJStrLen5 96 total bytes for AxJJStrLen5
79 code size for StrLenLingo 79 total bytes for StrLenLingo
20 cycles for AxStrLenSSE1
20 cycles for AxJJStrLen3
23 cycles for AxJJStrLen4
22 cycles for AxJJStrLen5
12 cycles for StrLenLingo
20 cycles for AxStrLenSSE1
20 cycles for AxJJStrLen3
22 cycles for AxJJStrLen4
22 cycles for AxJJStrLen5
12 cycles for StrLenLingo
20 cycles for AxStrLenSSE1
20 cycles for AxJJStrLen3
23 cycles for AxJJStrLen4
22 cycles for AxJJStrLen5
12 cycles for StrLenLingo
21 cycles for AxStrLenSSE1
20 cycles for AxJJStrLen3
23 cycles for AxJJStrLen4
22 cycles for AxJJStrLen5
12 cycles for StrLenLingo
------- timings, misaligned, 5 byte string -------
8 cycles for AxStrLenSSE1
3 cycles for AxJJStrLen3
5 cycles for AxJJStrLen4
5 cycles for AxJJStrLen5
1 cycles for StrLenLingo
15 cycles for AxStrLenSSE1
4 cycles for AxJJStrLen3
6 cycles for AxJJStrLen4
6 cycles for AxJJStrLen5
1 cycles for StrLenLingo
8 cycles for AxStrLenSSE1
3 cycles for AxJJStrLen3
5 cycles for AxJJStrLen4
5 cycles for AxJJStrLen5
1 cycles for StrLenLingo
15 cycles for AxStrLenSSE1
4 cycles for AxJJStrLen3
6 cycles for AxJJStrLen4
6 cycles for AxJJStrLen5
2 cycles for StrLenLingo
--- ok ---
WARNING: The algo "StrLenLingo" posted above by Lingo raises an exception when used with short strings that happen to start near the end of a buffer allocated with VirtualAlloc (it will also happen with HeapAlloc buffers, but only in rare cases - the typical "impossible to chase bug").
Unlike all other algos, it does not preserve ecx and xmm0.
On the positive side: It is a tick faster than the others on certain CPUs. Bravo, Lingo :cheekygreen:
"I ladri sono sempre bugiardi"[/U]
"It is a tick faster than the others on certain CPUs."
Again false...should be "It is 45% faster than the others on ALL CPUs."
Quote from: lingo on August 25, 2010, 03:50:59 PM
"I ladri sono sempre bugiardi"[/U]
Do you mean : "Pick one, get two"? It sounds like a commercial ad :lol
Quote from: lingo on August 25, 2010, 03:50:59 PM
"It is a tick faster than the others on certain CPUs."
Again false...should be "It is 45% faster than the others on ALL CPUs."
Well, not ALL CPUs. There is this rare exception of the so-called "P4":
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
------- timings, misaligned, 5 byte string -------
100 cycles for szLen
30 cycles for AxStrLenSSE1
25 cycles for AxJJStrLen3
28 cycles for AxJJStrLen4
28 cycles for AxJJStrLen5
32 cycles for StrLenLingo (unsafe)
:green2
"I ladri sono sempre bugiardi"[/b][/U]
is equal to:
"The thieves are always liars"[/b]
is equal to:
""Воры всегда лжецы" [/b]
is equal to:
""Les voleurs sont toujours menteurs" [/b] :lol
I can't decide, they are so close. Any diverging results on other CPUs?
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
97 code size for AxStrLenSSE1 104 total bytes for AxStrLenSSE1
88 code size for AxJJStrLen4 88 total bytes for AxJJStrLen4
96 code size for AxJJStrLen5 96 total bytes for AxJJStrLen5
------- timings, misaligned, 100 byte string -------
132 cycles for szLen
34 cycles for AxStrLenSSE1
33 cycles for AxJJStrLen4
33 cycles for AxJJStrLen5
------- timings, misaligned, 5 byte string -------
31 cycles for szLen
19 cycles for AxStrLenSSE1
17 cycles for AxJJStrLen4
17 cycles for AxJJStrLen5
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
------- timings, 16-byte aligned, 100 byte string -------
252 szLen
61 AxStrLenSSE1
69 AxJJStrLen4
69 AxJJStrLen5
------- timings, misaligned, 100 byte string -------
91 szLen
29 AxStrLenSSE1
28 AxJJStrLen4
28 AxJJStrLen5
Quote from: lingo on August 25, 2010, 03:50:59 PM
"I ladri sono sempre bugiardi"[/U]
"It is a tick faster than the others on certain CPUs."
Again false...should be "It is 45% faster than the others on ALL CPUs."
Wow, lamer Lingo make sounds again:
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
97 code size for AxStrLenSSE1 104 total bytes for AxStrLenSSE1
79 code size for AxJJStrLen3 80 total bytes for AxJJStrLen3
88 code size for AxJJStrLen4 88 total bytes for AxJJStrLen4
96 code size for AxJJStrLen5 96 total bytes for AxJJStrLen5
79 code size for StrLenLingo 79 total bytes for StrLenLingo
66 cycles for AxStrLenSSE1
69 cycles for AxJJStrLen3
72 cycles for AxJJStrLen4
71 cycles for AxJJStrLen5
56 cycles for StrLenLingo
65 cycles for AxStrLenSSE1
68 cycles for AxJJStrLen3
72 cycles for AxJJStrLen4
71 cycles for AxJJStrLen5
56 cycles for StrLenLingo
64 cycles for AxStrLenSSE1
68 cycles for AxJJStrLen3
72 cycles for AxJJStrLen4
71 cycles for AxJJStrLen5
55 cycles for StrLenLingo
65 cycles for AxStrLenSSE1
67 cycles for AxJJStrLen3
74 cycles for AxJJStrLen4
71 cycles for AxJJStrLen5
55 cycles for StrLenLingo
------- timings, misaligned, 5 byte string -------
26 cycles for AxStrLenSSE1
18 cycles for AxJJStrLen3
21 cycles for AxJJStrLen4
21 cycles for AxJJStrLen5
14 cycles for StrLenLingo
26 cycles for AxStrLenSSE1
18 cycles for AxJJStrLen3
22 cycles for AxJJStrLen4
22 cycles for AxJJStrLen5
14 cycles for StrLenLingo
26 cycles for AxStrLenSSE1
19 cycles for AxJJStrLen3
22 cycles for AxJJStrLen4
22 cycles for AxJJStrLen5
14 cycles for StrLenLingo
26 cycles for AxStrLenSSE1
18 cycles for AxJJStrLen3
22 cycles for AxJJStrLen4
22 cycles for AxJJStrLen5
14 cycles for StrLenLingo
--- ok ---
56 vs 66 - 45%??? Lingo, you buy all your certificates, indeed.
Maybe, because you live not in Toronto, really? I doubt, what Toronto gives citizenship to so lamer's as you.
56 faster than 66 by 15%, relatively to 66. So, you don't able to calculate so simple thing? This is great, really.
I doubt, which spee it will be have on PIII and older.
Alex
P.S. all yours translations may be done by machine - this is simple text. Try to translate this: Линго, ты ЧМО !!! Когда ты переехал в Торонто, чмошник-недоучка?
Hi, Jochen!
Timings:
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
97 code size for AxStrLenSSE1 104 total bytes for AxStrLenSSE1
88 code size for AxJJStrLen4 88 total bytes for AxJJStrLen4
96 code size for AxJJStrLen5 96 total bytes for AxJJStrLen5
------- timings, misaligned, 100 byte string -------
267 cycles for szLen
65 cycles for AxStrLenSSE1
74 cycles for AxJJStrLen4
73 cycles for AxJJStrLen5
266 cycles for szLen
65 cycles for AxStrLenSSE1
74 cycles for AxJJStrLen4
70 cycles for AxJJStrLen5
264 cycles for szLen
64 cycles for AxStrLenSSE1
72 cycles for AxJJStrLen4
71 cycles for AxJJStrLen5
261 cycles for szLen
64 cycles for AxStrLenSSE1
72 cycles for AxJJStrLen4
70 cycles for AxJJStrLen5
------- timings, misaligned, 5 byte string -------
99 cycles for szLen
30 cycles for AxStrLenSSE1
29 cycles for AxJJStrLen4
29 cycles for AxJJStrLen5
95 cycles for szLen
31 cycles for AxStrLenSSE1
29 cycles for AxJJStrLen4
28 cycles for AxJJStrLen5
93 cycles for szLen
31 cycles for AxStrLenSSE1
29 cycles for AxJJStrLen4
29 cycles for AxJJStrLen5
95 cycles for szLen
30 cycles for AxStrLenSSE1
29 cycles for AxJJStrLen4
29 cycles for AxJJStrLen5
--- ok ---
Alex
GrandLamer Lingo's:
Quote
"The thieves are always liars"="Воры всегда лжецы"
While you don't get a patent for yours lamer's algos, you cannot speak this. Because all your technics is stolen from Intel's examples. But, not very good Intels tutorials you makes to awesome badly and ugly lamer's code :toothy
I doubt, what you can get any patent to any respect office, maybe patent: "The Napoleon of Torontos' Central Mental Hospital", or something like this.
Alex
Quote from: Antariy on August 25, 2010, 09:15:51 PM
Hi, Jochen!
Timings:
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
65 cycles for AxStrLenSSE1
74 cycles for AxJJStrLen4
73 cycles for AxJJStrLen5
Your code is better on your CPU, again :bg
I had a bright idea to skip a bsf but...:
pcmpeqb xmm0, [edx] ; there may be nullbytes before the real string
pmovmskb eax, xmm0
shr eax, cl ; shift their bits out
; bsf eax, eax ############# not necessary
jnz @ret1
..
@ret: mov ecx, [esp]
movaps xmm0, [ecx]
mov ecx, [esp+4]
add esp, Extra
ret 4
@ret1: bsf eax, eax ; good on P4 but disastrous on Celeron M
mov ecx, [esp]
movaps xmm0, [ecx]
mov ecx, [esp+4]
add esp, Extra
ret 4
Quote from: jj2007 on August 25, 2010, 09:32:30 PM
Your code is better on your CPU, again :bg
I had a bright idea to skip a bsf but...:
This is not have big meaning - 6 clocks. If not preserve ecx and xmm, this proc would be to 66bytes long and 56 clocks speed (checking of end of buffer still maked).
BSF - very slow instruction (relatively), but on my CPU overhead of checking, branch and exiting gets the same speed, and biggest code only.
Alex
BSF is a little slow - and clumsey to use, too
but - i have tried the alternatives without much luck
i am using a P4 - it's a slow instruction
i think it is faster on newer CPU's, Alex
even though that doesn't help us much, it makes the code look good in the forum tests :bg
"BSF is a little slow - and clumsey to use"
With new technologies we don't need BSF :lol...just take a look:
align 16
Newstrlen_sse4_2
movdqa xmm7, notnul
pop ecx
pop eax
mov edx, eax
@@:
pcmpistri xmm7, [eax], 14h
lea eax, [eax+16]
jnz @b
sub eax, edx
lea eax, [eax+ecx-16]
jmp dword ptr [esp-2*4]
Hi, unhappy losers...slow.slow..again and again... :lol
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz (SSE4)
97 code size for AxStrLenSSE1 104 total bytes for AxStrLenSSE1
88 code size for AxJJStrLen4 88 total bytes for AxJJStrLen4
96 code size for AxJJStrLen5 96 total bytes for AxJJStrLen5
79 code size for StrLenLingo 79 total bytes for StrLenLingo
110 cycles for szLen
20 cycles for AxStrLenSSE1
22 cycles for AxJJStrLen4
22 cycles for AxJJStrLen5
12 cycles for StrLenLingo
110 cycles for szLen
40 cycles for AxStrLenSSE1
23 cycles for AxJJStrLen4
22 cycles for AxJJStrLen5
12 cycles for StrLenLingo
110 cycles for szLen
20 cycles for AxStrLenSSE1
23 cycles for AxJJStrLen4
21 cycles for AxJJStrLen5
12 cycles for StrLenLingo
111 cycles for szLen
49 cycles for AxStrLenSSE1
22 cycles for AxJJStrLen4
22 cycles for AxJJStrLen5
12 cycles for StrLenLingo
------- timings, misaligned, 5 byte string -------
7 cycles for szLen
8 cycles for AxStrLenSSE1
5 cycles for AxJJStrLen4
6 cycles for AxJJStrLen5
1 cycles for StrLenLingo
7 cycles for szLen
40 cycles for AxStrLenSSE1
6 cycles for AxJJStrLen4
6 cycles for AxJJStrLen5
2 cycles for StrLenLingo
7 cycles for szLen
8 cycles for AxStrLenSSE1
6 cycles for AxJJStrLen4
6 cycles for AxJJStrLen5
1 cycles for StrLenLingo
7 cycles for szLen
42 cycles for AxStrLenSSE1
6 cycles for AxJJStrLen4
6 cycles for AxJJStrLen5
1 cycles for StrLenLingo
--- ok ---
Quote from: lingo on August 26, 2010, 08:03:48 PM
Hi, unhappy losers...slow.slow..again and again... :lol
Lingo, your algo goes bang when used with short strings near the end of the buffer. You seem not be able to learn. You are a troll :tdown
Hi, stupid lingo!
This is not wonder, why your algos is fast - you test them in best cases.
For example, part of Paul Dixon's algo, stolen and "optimized" by you:
Quote
.data
align 16
Src dd 100 Dup(0)
Num dd 0FFFFFFFFh,0 <-------------------------- THIS!!!
vaPtr dd 0
align 16
chartabL dw "00","10","20","30","40","50","60","70","80","90"
dw "01","11","21","31","41","51","61","71","81","91"
dw "02","12","22","32","42","52","62","72","82","92"
dw "03","13","23","33","43","53","63","73","83","93"
dw "04","14","24","34","44","54","64","74","84","94"
dw "05","15","25","35","45","55","65","75","85","95"
dw "06","16","26","36","46","56","66","76","86","96"
dw "07","17","27","37","47","57","67","77","87","97"
dw "08","18","28","38","48","58","68","78","88","98"
dw "09","19","29","39","49","59","69","79","89","99"
Very nice, you test with best case - only sign "-" and "1". Result will be "-1" - same fast conversion. Your stupid proc not faster than Hutch's with 2^31 :bdg on NOT your CPU, of course :toothy
If you so lamer, what cannot make something good without
"new technologies" - so, this is without comments. Posted code - cannot be compared with other code, because it DON'T make all things, which is needed for fair comparsion. IF you not able write proc with SAME characteristics - shut up. You be very unsatisfyed, when drizz algo
"without functionality" beat your stupid simple unrolled algo. But you satisfyed with YOUR OWN stupid algo
without functionality. This is strange, not - this is funny, because you are lamer.
Alex
Ugly spaghetti+lamer tubeteikin's code = slow...slow..slow...non me ne frega un cazzo. :lol
Quote from: jj2007 on August 26, 2010, 09:22:49 PM
Quote from: lingo on August 26, 2010, 08:03:48 PM
Hi, unhappy losers...slow.slow..again and again... :lol
Lingo, your algo goes bang when used with short strings near the end of the buffer. You seem not be able to learn. You are a troll :tdown
No, lingo are a ЧМО !!!
This is nice Russian therm for peoples like lingo :bdg
Alex
P.S. Lingo, show any your normal app, or you can
improve and
optimize existed algos of other peoples only?
Under
normal app I meant any your app, which works 5 minutes without crashes. And which can make something other, than printing your clocks only :green2
Hi, folks!
I find one nice rule: stupid lingo don't like, when any code beat his code, and not have so great functionality as his code. But, when his lamer's code not have any functionality, he think, what this is great - this is part of algo. And, when somebody says about this to him - he very unsatisfyed and make insults to other peoples.
Lingos is unadequate man - madman.
Stupid GrandLamer lingo, where you live? Not in Toronto, is it?
Don't worry, we don't say anybody, why you ugly and wretched. Make confess to us :green2
Alex
i think he's in Germany
Quote from: dedndave on August 26, 2010, 10:20:40 PM
i think he's in Germany
Right - if he lived in Toronto, his English would be a lot better.
i remember hearing somewhere in another thread that his wife doesn't like him, either :bdg
Quote from: jj2007 on August 26, 2010, 10:28:54 PM
Quote from: dedndave on August 26, 2010, 10:20:40 PM
i think he's in Germany
Right - if he lived in Toronto, his English would be a lot better.
I see European phrases construction in his text. This is not unrecognizable.
Alex
:bg
Come on guys, Lingo is OK, he just has a charming turn of phrase. :P
Quote from: hutch-- on August 30, 2010, 04:20:03 AM
:bg
Come on guys, Lingo is OK, he just has a charming turn of phrase. :P
If someone, what is not adequate and is upstarter - then this is true: this someone is OK :P
Alex
AMD Sempron(tm) Processor 3100+ (SSE3)
97 code size for AxStrLenSSE1 104 total bytes for AxStrLenSSE1
88 code size for AxJJStrLen4 88 total bytes for AxJJStrLen4
96 code size for AxJJStrLen5 96 total bytes for AxJJStrLen5
79 code size for StrLenLingo 79 total bytes for StrLenLingo
139 cycles for szLen
66 cycles for AxStrLenSSE1
66 cycles for AxJJStrLen4
65 cycles for AxJJStrLen5
45 cycles for StrLenLingo
139 cycles for szLen
62 cycles for AxStrLenSSE1
66 cycles for AxJJStrLen4
65 cycles for AxJJStrLen5
45 cycles for StrLenLingo
139 cycles for szLen
67 cycles for AxStrLenSSE1
66 cycles for AxJJStrLen4
65 cycles for AxJJStrLen5
44 cycles for StrLenLingo
139 cycles for szLen
62 cycles for AxStrLenSSE1
67 cycles for AxJJStrLen4
66 cycles for AxJJStrLen5
45 cycles for StrLenLingo
------- timings, misaligned, 5 byte string -------
8 cycles for szLen
29 cycles for AxStrLenSSE1
24 cycles for AxJJStrLen4
23 cycles for AxJJStrLen5
23 cycles for StrLenLingo
8 cycles for szLen
23 cycles for AxStrLenSSE1
24 cycles for AxJJStrLen4
23 cycles for AxJJStrLen5
23 cycles for StrLenLingo
8 cycles for szLen
29 cycles for AxStrLenSSE1
25 cycles for AxJJStrLen4
23 cycles for AxJJStrLen5
23 cycles for StrLenLingo
8 cycles for szLen
23 cycles for AxStrLenSSE1
24 cycles for AxJJStrLen4
23 cycles for AxJJStrLen5
23 cycles for StrLenLingo
New CPU with SSE 4.2 and new results... :lol
Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz (SSE4)
97 code size for AxStrLenSSE1 104 total bytes for AxStrLenSSE1
88 code size for AxJJStrLen4 88 total bytes for AxJJStrLen4
96 code size for AxJJStrLen5 96 total bytes for AxJJStrLen5
82 code size for pcmpistriLingo 82 total bytes for pcmpistriLingo
79 code size for StrLenLingo 79 total bytes for StrLenLingo
77 cycles for szLen
14 cycles for AxStrLenSSE1
15 cycles for AxJJStrLen4
14 cycles for AxJJStrLen5
11 cycles for pcmpistriLingo
8 cycles for StrLenLingo
78 cycles for szLen
13 cycles for AxStrLenSSE1
15 cycles for AxJJStrLen4
13 cycles for AxJJStrLen5
11 cycles for pcmpistriLingo
9 cycles for StrLenLingo
77 cycles for szLen
14 cycles for AxStrLenSSE1
15 cycles for AxJJStrLen4
15 cycles for AxJJStrLen5
11 cycles for pcmpistriLingo
8 cycles for StrLenLingo
78 cycles for szLen
13 cycles for AxStrLenSSE1
15 cycles for AxJJStrLen4
13 cycles for AxJJStrLen5
11 cycles for pcmpistriLingo
8 cycles for StrLenLingo
------- timings, misaligned, 5 byte string -------
4 cycles for szLen
3 cycles for AxStrLenSSE1
3 cycles for AxJJStrLen4
4 cycles for AxJJStrLen5
-1 cycles for pcmpistriLingo
-1 cycles for StrLenLingo
5 cycles for szLen
3 cycles for AxStrLenSSE1
3 cycles for AxJJStrLen4
3 cycles for AxJJStrLen5
0 cycles for pcmpistriLingo
-1 cycles for StrLenLingo
4 cycles for szLen
2 cycles for AxStrLenSSE1
4 cycles for AxJJStrLen4
3 cycles for AxJJStrLen5
0 cycles for pcmpistriLingo
-1 cycles for StrLenLingo
4 cycles for szLen
2 cycles for AxStrLenSSE1
3 cycles for AxJJStrLen4
3 cycles for AxJJStrLen5
-1 cycles for pcmpistriLingo
-1 cycles for StrLenLingo
--- ok ---
this string length algorithm is the fastest one i've found that uses no sse instructions. i've timed it against masm32's fast StrLen and it is slightly faster, plus there is still room for optimization :wink
btw i found this algo in one of randy hydes books, originally written in hla. the hla source code is public domain
StrLength PROC USES esi, buf:DWORD
mov esi, buf
test esi, 3
jz IsAligned
cmp BYTE PTR [esi], NULL
je done
add esi, 1
test esi, 3
jz IsAligned
cmp BYTE PTR [esi], NULL
je done
add esi, 1
test esi, 3
jz IsAligned
cmp BYTE PTR [esi], NULL
je done
add esi, 1
IsAligned:
sub esi, 32
lbl1:
add esi, 32
lbl2:
mov eax, [esi]
and eax, 7F7F7F7Fh
sub eax, 01010101h
and eax, 80808080h
jnz MightBeZero0
mov eax, [esi+4]
and eax, 7F7F7F7Fh
sub eax, 01010101h
and eax, 80808080h
jnz MightBeZero4
mov eax, [esi+8]
and eax, 7F7F7F7Fh
sub eax, 01010101h
and eax, 80808080h
jnz MightBeZero8
mov eax, [esi+12]
and eax, 7F7F7F7Fh
sub eax, 01010101h
and eax, 80808080h
jnz MightBeZero12
mov eax, [esi+16]
and eax, 7F7F7F7Fh
sub eax, 01010101h
and eax, 80808080h
jnz MightBeZero16
mov eax, [esi+20]
and eax, 7F7F7F7Fh
sub eax, 01010101h
and eax, 80808080h
jnz MightBeZero20
mov eax, [esi+24]
and eax, 7F7F7F7Fh
sub eax, 01010101h
and eax, 80808080h
jnz MightBeZero24
mov eax, [esi+28]
and eax, 7F7F7F7Fh
sub eax, 01010101h
and eax, 80808080h
jz lbl1
add esi, 28
jmp MightBeZero0
MightBeZero4:
add esi, 4
jmp MightBeZero0
MightBeZero8:
add esi, 8
jmp MightBeZero0
MightBeZero12:
add esi, 12
jmp MightBeZero0
MightBeZero16:
add esi, 16
jmp MightBeZero0
MightBeZero20:
add esi, 20
jmp MightBeZero0
MightBeZero24:
add esi, 24
MightBeZero0:
mov eax, [esi]
cmp al, 0
je done
cmp ah, 0
je done1
test eax, 0FF0000h
je done2
test eax, 0FF000000h
je done3
add esi, 4
jmp lbl2
done3:
sub esi, buf
lea eax, [esi+3]
jmp @F
done2:
sub esi, buf
lea eax, [esi+2]
jmp @F
done1:
sub esi, buf
lea eax, [esi+1]
jmp @F
done:
mov eax, esi
sub eax, buf
@@:
ret
StrLength ENDP
"this string length algorithm is the fastest one i've found that uses no sse instructions."
My algo (without SSE) is faster: :lol
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
StrLenLingo proc lpszStr:DWORD
mov edx, esp
mov esp, [esp+1*4]
pop eax
@@Loop:
sub eax, 1010101h
pop ecx
sub ecx, 1010101h
test eax, 80808080h
pop eax
jne @f
@@LoopCont:
test ecx, 80808080h
je @@Loop
test byte ptr [esp-8], 0FFh
je mi8
test byte ptr [esp-7], 0FFh
je mi7
test byte ptr [esp-6], 0FFh
je mi6
test byte ptr [esp-5], 0FFh
jne @@Loop
lea eax, [esp-5]
mov esp, edx
sub eax, [edx+1*4]
ret 4
align 8
@@:
test byte ptr [esp-12], 0FFh
jz mi12
test byte ptr [esp-11], 0FFh
jz mi11
test byte ptr [esp-10], 0FFh
jz mi10
test byte ptr [esp-9], 0FFh
jnz @@LoopCont
lea eax, [esp-9]
mov esp, edx
sub eax,[edx+1*4]
ret 4
mi8:
lea eax, [esp-8]
mov esp, edx
sub eax,[edx+1*4]
ret 4
mi7:
lea eax, [esp-7]
mov esp, edx
sub eax,[edx+1*4]
ret 4
mi6:
lea eax, [esp-6]
mov esp, edx
sub eax,[edx+1*4]
ret 4
mi12:
lea eax, [esp-12]
mov esp, edx
sub eax,[edx+1*4]
ret 4
mi11:
lea eax, [esp-11]
mov esp, edx
sub eax,[edx+1*4]
ret 4
mi10:
lea eax, [esp-10]
mov esp, edx
sub eax,[edx+1*4]
ret 4
StrLenLingo endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
The results on my CPU i7-2600K are:
28 cycles for agner fog StrLen-masmlib
25 cycles for StrLength
17 cycles for StrLenLingo
--test finished--
Prescott P4:
110 cycles for agner fog StrLen-masmlib
63 cycles for StrLength
83 cycles for StrLenLingo
Hi,
PIII, Win2k.
Steve
Assembling: strlen_a.asm
G:\WORK\TEMP>strlen_a
65 cycles for agner fog StrLen-masmlib
54 cycles for StrLength
65 cycles for StrLenLingo
--test finished--
Compliments to Randy :U
Il vecchio idiota non può stare controllo di nuovo.
Sembra che lui ha dimenticato le sue pillole di nuovo... :lol
Intel Core 2 Duo E8500, 3,16 GHz: :lol
49 cycles for agner fog StrLen-masmlib
41 cycles for StrLength
26 cycles for StrLenLingo
--test finished--