The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: jj2007 on August 15, 2010, 09:32:10 PM

Title: StrLen timings needed
Post by: jj2007 on August 15, 2010, 09:32:10 PM
Hi folks,
Could I please have some timings on non-Celerons?
Thanks, jj

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
58       bytes for MbStrLen1
84       bytes for MbStrLen2
73       bytes for MbStrLen3
80       bytes for MbStrLen4a
71       bytes for MbStrLen4b
78       bytes for MbStrLen5

29      cycles for MbStrLen1
34      cycles for MbStrLen2
34      cycles for MbStrLen3
31      cycles for MbStrLen4a
35      cycles for MbStrLen4b
38      cycles for MbStrLen5
Title: Re: StrLen timings needed
Post by: ecube on August 15, 2010, 09:37:20 PM
AMD Athlon(tm) 64 Processor 3000+ (SSE3)
58       bytes for MbStrLen1
84       bytes for MbStrLen2
73       bytes for MbStrLen3
80       bytes for MbStrLen4a
71       bytes for MbStrLen4b
78       bytes for MbStrLen5

47      cycles for MbStrLen1
48      cycles for MbStrLen2
51      cycles for MbStrLen3
47      cycles for MbStrLen4a
51      cycles for MbStrLen4b
55      cycles for MbStrLen5

47      cycles for MbStrLen1
54      cycles for MbStrLen2
54      cycles for MbStrLen3
52      cycles for MbStrLen4a
58      cycles for MbStrLen4b
53      cycles for MbStrLen5

47      cycles for MbStrLen1
48      cycles for MbStrLen2
52      cycles for MbStrLen3
47      cycles for MbStrLen4a
50      cycles for MbStrLen4b
54      cycles for MbStrLen5

48      cycles for MbStrLen1
54      cycles for MbStrLen2
54      cycles for MbStrLen3
52      cycles for MbStrLen4a
57      cycles for MbStrLen4b
53      cycles for MbStrLen5

48      cycles for MbStrLen1
48      cycles for MbStrLen2
50      cycles for MbStrLen3
47      cycles for MbStrLen4a
52      cycles for MbStrLen4b
55      cycles for MbStrLen5


--- ok ---
Title: Re: StrLen timings needed
Post by: MichaelW on August 15, 2010, 09:50:42 PM
P3:

pre-P4 (SSE1)
58       bytes for MbStrLen1
84       bytes for MbStrLen2
73       bytes for MbStrLen3
80       bytes for MbStrLen4a
71       bytes for MbStrLen4b
78       bytes for MbStrLen5

45      cycles for MbStrLen1
51      cycles for MbStrLen2
46      cycles for MbStrLen3
51      cycles for MbStrLen4a
47      cycles for MbStrLen4b
59      cycles for MbStrLen5

45      cycles for MbStrLen1
62      cycles for MbStrLen2
46      cycles for MbStrLen3
50      cycles for MbStrLen4a
47      cycles for MbStrLen4b
56      cycles for MbStrLen5

46      cycles for MbStrLen1
51      cycles for MbStrLen2
46      cycles for MbStrLen3
50      cycles for MbStrLen4a
47      cycles for MbStrLen4b
56      cycles for MbStrLen5

45      cycles for MbStrLen1
51      cycles for MbStrLen2
46      cycles for MbStrLen3
51      cycles for MbStrLen4a
47      cycles for MbStrLen4b
55      cycles for MbStrLen5

45      cycles for MbStrLen1
51      cycles for MbStrLen2
46      cycles for MbStrLen3
50      cycles for MbStrLen4a
47      cycles for MbStrLen4b
55      cycles for MbStrLen5
Title: Re: StrLen timings needed
Post by: jj2007 on August 15, 2010, 09:51:29 PM
Thanks. For the curious: I am testing the Intel recommendation for movxxx xmm, mem:
QuoteIntel (http://software.intel.com/en-us/articles/memcpy-performance/), generic optimization of memcpy(): movdqu is suitable for fetching byte-aligned groups of 16 bytes from memory, but not useful for storing them. The Barcelona architecture prefers movaps for stores.  movaps, movdqa, and movapd are functionally equivalent, with movaps having shorter encoding

if 1  ; 4a
movlps qword ptr [esp], xmm0
movhps qword ptr [esp+8], xmm0
else  ; 4b
movdqu [esp], xmm0
endif
...
if 1
movlps xmm0, qword ptr [esp]
movhps xmm0, qword ptr [esp+8]
else
movups xmm0, [esp]
endif


At least for the Celeron and E^cube's AMD, this seems not to be true: The partial lps/hps moves are faster.

(obviously the code does other things, too - the purpose is to efficiently preserve the xmm0 register in a bread-and-butter stringlen algo)
Title: Re: StrLen timings needed
Post by: ecube on August 15, 2010, 09:54:53 PM
Off topic but MichaelW my CPU is 10+ years old now I believe, so yours must be ancient, i'm just curious is that your main one? Also jj2007  i'm not sure what your plans are but feel free to take notes on optimization technique you discover  :U while lot of stuff is floating around this board I know people enjoy a single place to read up on such things.
Title: Re: StrLen timings needed
Post by: MichaelW on August 15, 2010, 10:04:47 PM
I build my P3 system in 98 or 99, and it's currently my primary system at home. It's still very reliable, but sooner or later...
Title: Re: StrLen timings needed
Post by: ecube on August 15, 2010, 10:07:30 PM
Quote from: MichaelW on August 15, 2010, 10:04:47 PM
I build my P3 system in 98 or 99, and it's currently my primary system at home. It's still very reliable, but sooner or later...


heh wow, what os? I can't imagine that thing being able to handle vista, is a resource pig. i'd be suprised if you said windows 2k, I myself wanted to stick with it but was forced to upgrade due to so much software being xp+ only.
Title: Re: StrLen timings needed
Post by: KeepingRealBusy on August 15, 2010, 10:13:05 PM
JJ,

Here is my P4:


Intel(R) Pentium(R) 4 CPU 3.20GHz (SSE2)
58       bytes for MbStrLen1
84       bytes for MbStrLen2
73       bytes for MbStrLen3
80       bytes for MbStrLen4a
71       bytes for MbStrLen4b
78       bytes for MbStrLen5

38      cycles for MbStrLen1
41      cycles for MbStrLen2
46      cycles for MbStrLen3
37      cycles for MbStrLen4a
45      cycles for MbStrLen4b
41      cycles for MbStrLen5

35      cycles for MbStrLen1
40      cycles for MbStrLen2
41      cycles for MbStrLen3
37      cycles for MbStrLen4a
51      cycles for MbStrLen4b
40      cycles for MbStrLen5

34      cycles for MbStrLen1
45      cycles for MbStrLen2
49      cycles for MbStrLen3
36      cycles for MbStrLen4a
51      cycles for MbStrLen4b
40      cycles for MbStrLen5

33      cycles for MbStrLen1
39      cycles for MbStrLen2
40      cycles for MbStrLen3
39      cycles for MbStrLen4a
43      cycles for MbStrLen4b
47      cycles for MbStrLen5

34      cycles for MbStrLen1
40      cycles for MbStrLen2
65      cycles for MbStrLen3
37      cycles for MbStrLen4a
40      cycles for MbStrLen4b
40      cycles for MbStrLen5


--- ok ---
Title: Re: StrLen timings needed
Post by: KeepingRealBusy on August 15, 2010, 10:20:00 PM
JJ,

Here are mu AMD timings:


AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ (SSE3)
58       bytes for MbStrLen1
84       bytes for MbStrLen2
73       bytes for MbStrLen3
80       bytes for MbStrLen4a
71       bytes for MbStrLen4b
78       bytes for MbStrLen5

68      cycles for MbStrLen1
47      cycles for MbStrLen2
66      cycles for MbStrLen3
47      cycles for MbStrLen4a
58      cycles for MbStrLen4b
56      cycles for MbStrLen5

47      cycles for MbStrLen1
37      cycles for MbStrLen2
56      cycles for MbStrLen3
57      cycles for MbStrLen4a
43      cycles for MbStrLen4b
84      cycles for MbStrLen5

47      cycles for MbStrLen1
53      cycles for MbStrLen2
73      cycles for MbStrLen3
47      cycles for MbStrLen4a
52      cycles for MbStrLen4b
70      cycles for MbStrLen5

36      cycles for MbStrLen1
57      cycles for MbStrLen2
54      cycles for MbStrLen3
51      cycles for MbStrLen4a
58      cycles for MbStrLen4b
58      cycles for MbStrLen5

63      cycles for MbStrLen1
47      cycles for MbStrLen2
51      cycles for MbStrLen3
51      cycles for MbStrLen4a
56      cycles for MbStrLen4b
55      cycles for MbStrLen5


--- ok ---
Title: Re: StrLen timings needed
Post by: Rockoon on August 15, 2010, 10:28:06 PM
AMD Phenom(tm) II X6 1055T Processo
58       bytes for MbStrLen1
84       bytes for MbStrLen2
73       bytes for MbStrLen3
80       bytes for MbStrLen4a
71       bytes for MbStrLen4b
78       bytes for MbStrLen5

31      cycles for MbStrLen1
34      cycles for MbStrLen2
33      cycles for MbStrLen3
35      cycles for MbStrLen4a
35      cycles for MbStrLen4b
40      cycles for MbStrLen5

31      cycles for MbStrLen1
34      cycles for MbStrLen2
36      cycles for MbStrLen3
35      cycles for MbStrLen4a
35      cycles for MbStrLen4b
40      cycles for MbStrLen5

31      cycles for MbStrLen1
34      cycles for MbStrLen2
33      cycles for MbStrLen3
35      cycles for MbStrLen4a
35      cycles for MbStrLen4b
39      cycles for MbStrLen5

31      cycles for MbStrLen1
37      cycles for MbStrLen2
33      cycles for MbStrLen3
35      cycles for MbStrLen4a
35      cycles for MbStrLen4b
39      cycles for MbStrLen5

31      cycles for MbStrLen1
34      cycles for MbStrLen2
33      cycles for MbStrLen3
35      cycles for MbStrLen4a
35      cycles for MbStrLen4b
40      cycles for MbStrLen5
Title: Re: StrLen timings needed
Post by: hutch-- on August 16, 2010, 12:15:13 AM

Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (SSE4)
58       bytes for MbStrLen1
84       bytes for MbStrLen2
73       bytes for MbStrLen3
80       bytes for MbStrLen4a
71       bytes for MbStrLen4b
78       bytes for MbStrLen5

16      cycles for MbStrLen1
21      cycles for MbStrLen2
23      cycles for MbStrLen3
23      cycles for MbStrLen4a
23      cycles for MbStrLen4b
23      cycles for MbStrLen5

17      cycles for MbStrLen1
26      cycles for MbStrLen2
29      cycles for MbStrLen3
23      cycles for MbStrLen4a
32      cycles for MbStrLen4b
24      cycles for MbStrLen5

16      cycles for MbStrLen1
23      cycles for MbStrLen2
23      cycles for MbStrLen3
23      cycles for MbStrLen4a
23      cycles for MbStrLen4b
23      cycles for MbStrLen5

17      cycles for MbStrLen1
26      cycles for MbStrLen2
29      cycles for MbStrLen3
23      cycles for MbStrLen4a
32      cycles for MbStrLen4b
24      cycles for MbStrLen5

16      cycles for MbStrLen1
21      cycles for MbStrLen2
23      cycles for MbStrLen3
23      cycles for MbStrLen4a
23      cycles for MbStrLen4b
23      cycles for MbStrLen5


--- ok ---
Title: Re: StrLen timings needed
Post by: mineiro on August 16, 2010, 01:30:07 AM
Intel(R) Pentium(R) Dual  CPU  E2160  @ 1.80GHz (SSE4)
58       bytes for MbStrLen1
84       bytes for MbStrLen2
73       bytes for MbStrLen3
80       bytes for MbStrLen4a
71       bytes for MbStrLen4b
78       bytes for MbStrLen5

16      cycles for MbStrLen1
20      cycles for MbStrLen2
23      cycles for MbStrLen3
23      cycles for MbStrLen4a
23      cycles for MbStrLen4b
23      cycles for MbStrLen5

16      cycles for MbStrLen1
25      cycles for MbStrLen2
26      cycles for MbStrLen3
23      cycles for MbStrLen4a
26      cycles for MbStrLen4b
23      cycles for MbStrLen5

16      cycles for MbStrLen1
20      cycles for MbStrLen2
23      cycles for MbStrLen3
23      cycles for MbStrLen4a
23      cycles for MbStrLen4b
23      cycles for MbStrLen5

16      cycles for MbStrLen1
25      cycles for MbStrLen2
26      cycles for MbStrLen3
23      cycles for MbStrLen4a
26      cycles for MbStrLen4b
23      cycles for MbStrLen5

16      cycles for MbStrLen1
20      cycles for MbStrLen2
23      cycles for MbStrLen3
23      cycles for MbStrLen4a
23      cycles for MbStrLen4b
23      cycles for MbStrLen5


--- ok ---
Title: Re: StrLen timings needed
Post by: dancho on August 16, 2010, 07:51:21 AM

Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
58       bytes for MbStrLen1
84       bytes for MbStrLen2
73       bytes for MbStrLen3
80       bytes for MbStrLen4a
71       bytes for MbStrLen4b
78       bytes for MbStrLen5

16      cycles for MbStrLen1
20      cycles for MbStrLen2
23      cycles for MbStrLen3
23      cycles for MbStrLen4a
23      cycles for MbStrLen4b
23      cycles for MbStrLen5

16      cycles for MbStrLen1
25      cycles for MbStrLen2
26      cycles for MbStrLen3
23      cycles for MbStrLen4a
26      cycles for MbStrLen4b
23      cycles for MbStrLen5

16      cycles for MbStrLen1
20      cycles for MbStrLen2
23      cycles for MbStrLen3
23      cycles for MbStrLen4a
23      cycles for MbStrLen4b
23      cycles for MbStrLen5

16      cycles for MbStrLen1
26      cycles for MbStrLen2
26      cycles for MbStrLen3
23      cycles for MbStrLen4a
26      cycles for MbStrLen4b
23      cycles for MbStrLen5

16      cycles for MbStrLen1
20      cycles for MbStrLen2
23      cycles for MbStrLen3
23      cycles for MbStrLen4a
23      cycles for MbStrLen4b
23      cycles for MbStrLen5


--- ok ---
Title: Re: StrLen timings needed
Post by: Vortex on August 16, 2010, 07:58:43 AM
Intel(R) Pentium(R) 4 CPU 3.20GHz (SSE3)
58       bytes for MbStrLen1
84       bytes for MbStrLen2
73       bytes for MbStrLen3
80       bytes for MbStrLen4a
71       bytes for MbStrLen4b
78       bytes for MbStrLen5

65      cycles for MbStrLen1
66      cycles for MbStrLen2
83      cycles for MbStrLen3
67      cycles for MbStrLen4a
66      cycles for MbStrLen4b
77      cycles for MbStrLen5

64      cycles for MbStrLen1
68      cycles for MbStrLen2
71      cycles for MbStrLen3
66      cycles for MbStrLen4a
66      cycles for MbStrLen4b
73      cycles for MbStrLen5

66      cycles for MbStrLen1
66      cycles for MbStrLen2
72      cycles for MbStrLen3
67      cycles for MbStrLen4a
66      cycles for MbStrLen4b
74      cycles for MbStrLen5

72      cycles for MbStrLen1
66      cycles for MbStrLen2
81      cycles for MbStrLen3
66      cycles for MbStrLen4a
74      cycles for MbStrLen4b
86      cycles for MbStrLen5

64      cycles for MbStrLen1
66      cycles for MbStrLen2
74      cycles for MbStrLen3
66      cycles for MbStrLen4a
66      cycles for MbStrLen4b
79      cycles for MbStrLen5
Title: Re: StrLen timings needed
Post by: jj2007 on August 16, 2010, 09:59:20 AM
Thanks to all of you, that should be enough info :U
Title: Re: StrLen timings needed
Post by: Antariy on August 16, 2010, 08:24:52 PM
Note: in results appeared "ERROR" signal, but this is not right. Just, I set repeating string as 20 times, so 100bytes*20times = 2000bytes total length. This length showed in line with "ERROR" word - so, this is true, not error.
Also I set shorter string, but I lazy to change CodeSize macro each time. So, I will write testing string size (for comparsion with "error" :)

Unaligned StrLen, 2000bytes:

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
58       bytes for MbStrLen1
2000 - ERROR
84       bytes for MbStrLen2
2000 - ERROR
73       bytes for MbStrLen3
2000 - ERROR
80       bytes for MbStrLen4a
2000 - ERROR
71       bytes for MbStrLen4b
2000 - ERROR
78       bytes for MbStrLen5
2000 - ERROR
147      bytes for AxStrLen
2000 - ERROR

920     cycles for MbStrLen1
915     cycles for MbStrLen2
914     cycles for MbStrLen3
912     cycles for MbStrLen4a
916     cycles for MbStrLen4b
918     cycles for MbStrLen5

908     cycles for AxStrLen

909     cycles for MbStrLen1
912     cycles for MbStrLen2
913     cycles for MbStrLen3
912     cycles for MbStrLen4a
916     cycles for MbStrLen4b
918     cycles for MbStrLen5

908     cycles for AxStrLen

905     cycles for MbStrLen1
920     cycles for MbStrLen2
913     cycles for MbStrLen3
913     cycles for MbStrLen4a
915     cycles for MbStrLen4b
918     cycles for MbStrLen5

908     cycles for AxStrLen

905     cycles for MbStrLen1
912     cycles for MbStrLen2
913     cycles for MbStrLen3
913     cycles for MbStrLen4a
916     cycles for MbStrLen4b
926     cycles for MbStrLen5

907     cycles for AxStrLen

920     cycles for MbStrLen1
916     cycles for MbStrLen2
915     cycles for MbStrLen3
913     cycles for MbStrLen4a
916     cycles for MbStrLen4b
919     cycles for MbStrLen5

907     cycles for AxStrLen


--- ok ---





Aligned (16bytes) StrLen, 2000bytes:

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
58       bytes for MbStrLen1
2000 - ERROR
84       bytes for MbStrLen2
2000 - ERROR
73       bytes for MbStrLen3
2000 - ERROR
80       bytes for MbStrLen4a
2000 - ERROR
71       bytes for MbStrLen4b
2000 - ERROR
78       bytes for MbStrLen5
2000 - ERROR
147      bytes for AxStrLen
2000 - ERROR

903     cycles for MbStrLen1
909     cycles for MbStrLen2
909     cycles for MbStrLen3
906     cycles for MbStrLen4a
915     cycles for MbStrLen4b
915     cycles for MbStrLen5

903     cycles for AxStrLen

903     cycles for MbStrLen1
907     cycles for MbStrLen2
911     cycles for MbStrLen3
905     cycles for MbStrLen4a
914     cycles for MbStrLen4b
913     cycles for MbStrLen5

904     cycles for AxStrLen

902     cycles for MbStrLen1
907     cycles for MbStrLen2
910     cycles for MbStrLen3
905     cycles for MbStrLen4a
914     cycles for MbStrLen4b
913     cycles for MbStrLen5

905     cycles for AxStrLen

902     cycles for MbStrLen1
908     cycles for MbStrLen2
909     cycles for MbStrLen3
906     cycles for MbStrLen4a
913     cycles for MbStrLen4b
914     cycles for MbStrLen5

904     cycles for AxStrLen

903     cycles for MbStrLen1
907     cycles for MbStrLen2
909     cycles for MbStrLen3
905     cycles for MbStrLen4a
913     cycles for MbStrLen4b
914     cycles for MbStrLen5

903     cycles for AxStrLen


--- ok ---




Unaligned, 100bytes:

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
58       bytes for MbStrLen1
100 - ERROR
84       bytes for MbStrLen2
100 - ERROR
73       bytes for MbStrLen3
100 - ERROR
80       bytes for MbStrLen4a
100 - ERROR
71       bytes for MbStrLen4b
100 - ERROR
78       bytes for MbStrLen5
100 - ERROR
147      bytes for AxStrLen
100 - ERROR

87      cycles for MbStrLen1
97      cycles for MbStrLen2
90      cycles for MbStrLen3
90      cycles for MbStrLen4a
71      cycles for MbStrLen4b
95      cycles for MbStrLen5

67      cycles for AxStrLen

92      cycles for MbStrLen1
97      cycles for MbStrLen2
90      cycles for MbStrLen3
90      cycles for MbStrLen4a
71      cycles for MbStrLen4b
95      cycles for MbStrLen5

67      cycles for AxStrLen

92      cycles for MbStrLen1
105     cycles for MbStrLen2
81      cycles for MbStrLen3
88      cycles for MbStrLen4a
73      cycles for MbStrLen4b
93      cycles for MbStrLen5

67      cycles for AxStrLen

92      cycles for MbStrLen1
102     cycles for MbStrLen2
90      cycles for MbStrLen3
90      cycles for MbStrLen4a
71      cycles for MbStrLen4b
95      cycles for MbStrLen5

67      cycles for AxStrLen

92      cycles for MbStrLen1
97      cycles for MbStrLen2
89      cycles for MbStrLen3
98      cycles for MbStrLen4a
71      cycles for MbStrLen4b
101     cycles for MbStrLen5

68      cycles for AxStrLen


--- ok ---


Jochen, I really try get best times for all proc's. But each run, results NOT the same, but very mess: one run - one procs have big timings, next run - other proc have big timings, next - another, etc. I think, this messing is because in your procs have 2 loops (one internal), and when string is unaligned, they runs, and get biggest timings. When string aligned, they not runs, so, timings much better.

Aligned, 100bytes:

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
58       bytes for MbStrLen1
100 - ERROR
84       bytes for MbStrLen2
100 - ERROR
73       bytes for MbStrLen3
100 - ERROR
80       bytes for MbStrLen4a
100 - ERROR
71       bytes for MbStrLen4b
100 - ERROR
78       bytes for MbStrLen5
100 - ERROR
147      bytes for AxStrLen
100 - ERROR

73      cycles for MbStrLen1
66      cycles for MbStrLen2
86      cycles for MbStrLen3
65      cycles for MbStrLen4a
65      cycles for MbStrLen4b
96      cycles for MbStrLen5

64      cycles for AxStrLen

64      cycles for MbStrLen1
65      cycles for MbStrLen2
68      cycles for MbStrLen3
65      cycles for MbStrLen4a
65      cycles for MbStrLen4b
96      cycles for MbStrLen5

64      cycles for AxStrLen

84      cycles for MbStrLen1
65      cycles for MbStrLen2
86      cycles for MbStrLen3
65      cycles for MbStrLen4a
65      cycles for MbStrLen4b
96      cycles for MbStrLen5

64      cycles for AxStrLen

65      cycles for MbStrLen1
66      cycles for MbStrLen2
68      cycles for MbStrLen3
65      cycles for MbStrLen4a
65      cycles for MbStrLen4b
96      cycles for MbStrLen5

64      cycles for AxStrLen

83      cycles for MbStrLen1
65      cycles for MbStrLen2
86      cycles for MbStrLen3
65      cycles for MbStrLen4a
65      cycles for MbStrLen4b
96      cycles for MbStrLen5

64      cycles for AxStrLen


--- ok ---


As you see, if string aligned, sometimes, your shortest proc have timings near to my proc (which have size 2.53 times long).
My proc have more stable timings, but this is "price" of its size.


Next, 15bytes string, unaligned, timings:

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
58       bytes for MbStrLen1
15 - ERROR
84       bytes for MbStrLen2
15 - ERROR
73       bytes for MbStrLen3
15 - ERROR
80       bytes for MbStrLen4a
15 - ERROR
71       bytes for MbStrLen4b
15 - ERROR
78       bytes for MbStrLen5
15 - ERROR
147      bytes for AxStrLen
15 - ERROR

37      cycles for MbStrLen1
36      cycles for MbStrLen2
37      cycles for MbStrLen3
38      cycles for MbStrLen4a
38      cycles for MbStrLen4b
43      cycles for MbStrLen5

21      cycles for AxStrLen

37      cycles for MbStrLen1
36      cycles for MbStrLen2
37      cycles for MbStrLen3
38      cycles for MbStrLen4a
38      cycles for MbStrLen4b
43      cycles for MbStrLen5

21      cycles for AxStrLen

37      cycles for MbStrLen1
36      cycles for MbStrLen2
37      cycles for MbStrLen3
38      cycles for MbStrLen4a
38      cycles for MbStrLen4b
43      cycles for MbStrLen5

20      cycles for AxStrLen

37      cycles for MbStrLen1
36      cycles for MbStrLen2
37      cycles for MbStrLen3
38      cycles for MbStrLen4a
37      cycles for MbStrLen4b
43      cycles for MbStrLen5

21      cycles for AxStrLen

37      cycles for MbStrLen1
36      cycles for MbStrLen2
37      cycles for MbStrLen3
38      cycles for MbStrLen4a
38      cycles for MbStrLen4b
43      cycles for MbStrLen5

21      cycles for AxStrLen


--- ok ---



On short unaligned strings, drawbacks of two-loops looks more.
But these timings of aligned, 15 bytes string:

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
58       bytes for MbStrLen1
15 - ERROR
84       bytes for MbStrLen2
15 - ERROR
73       bytes for MbStrLen3
15 - ERROR
80       bytes for MbStrLen4a
15 - ERROR
71       bytes for MbStrLen4b
15 - ERROR
78       bytes for MbStrLen5
15 - ERROR
147      bytes for AxStrLen
15 - ERROR

21      cycles for MbStrLen1
27      cycles for MbStrLen2
29      cycles for MbStrLen3
28      cycles for MbStrLen4a
29      cycles for MbStrLen4b
32      cycles for MbStrLen5

22      cycles for AxStrLen

21      cycles for MbStrLen1
28      cycles for MbStrLen2
28      cycles for MbStrLen3
28      cycles for MbStrLen4a
29      cycles for MbStrLen4b
32      cycles for MbStrLen5

22      cycles for AxStrLen

21      cycles for MbStrLen1
27      cycles for MbStrLen2
29      cycles for MbStrLen3
27      cycles for MbStrLen4a
29      cycles for MbStrLen4b
32      cycles for MbStrLen5

22      cycles for AxStrLen

20      cycles for MbStrLen1
27      cycles for MbStrLen2
29      cycles for MbStrLen3
28      cycles for MbStrLen4a
29      cycles for MbStrLen4b
32      cycles for MbStrLen5

22      cycles for AxStrLen

21      cycles for MbStrLen1
27      cycles for MbStrLen2
29      cycles for MbStrLen3
28      cycles for MbStrLen4a
29      cycles for MbStrLen4b
32      cycles for MbStrLen5

22      cycles for AxStrLen


--- ok ---



All advantages of short code is drawed on short aligned strings. Short code fastest in this test.


So, what give analysis of test? Jochen, I think, you need to make other solution for case of unaligned strings. Because for short strings drawbacks of loops is very big (as you see, advantage of shorter code is drawed only on very large strings (2000bytes in test)).


My variant also is not very good, but I write it for testing short and long algos only.



Big ask to all peoples: test this also, please. This is interesting: how different CPUs work with a inside-loops.
If you have time, make different tests: for different string lengths and aligned/not_aligned variants. If you have no time - archive contain sources and compiled exe with unaligned 100byte string testing.



Alex
P.S. Sources not compilable by ml6.15 and earlyer. I compile them with ML8 (SSE2 movsd mnemonic).
Title: Re: StrLen timings needed
Post by: jj2007 on August 16, 2010, 08:45:35 PM
Hi Alex,
If you want to get rid of the ERROR, use
  movups xmm0, oword ptr Src
  .if eax!=sizeof Src-1

in the CodeSize macro.

It is true that all versions are slower for unaligned strings, but still a factor three faster than the Masm32 len() function:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
132     cycles for szLen
46      cycles for MbStrLen1
47      cycles for MbStrLen2
50      cycles for MbStrLen3
48      cycles for MbStrLen4a
53      cycles for MbStrLen4b
51      cycles for MbStrLen5


And why I preserve xmm0? Precisely to encourage people to use SSE... MasmBasic is intended to be fast and noob-friendly. The next version will be "SSE2-safe", and "SSE2-friendly", too. Try doing a Print Str$(xmm1) in ordinary assembler :wink
Title: Re: StrLen timings needed
Post by: Antariy on August 16, 2010, 08:48:01 PM
Hi, Jochen!

How timings you have?
I assuraced, on CPUs with big cache unaligned access may be more fast.



Alex
Title: Re: StrLen timings needed
Post by: Antariy on August 16, 2010, 08:50:30 PM
Jochen, you see results of your hex2dwon Dave's CPU?
This is incredible! What mean ONE architecture (NetBurst), if EVERY CPU in forum have different timings. Intel must write some "appendix" in each CPUs :)
For info:

14      cycles for Lingo's SSE version
13      cycles for Lingo's BIG integer version
5       cycles for Jochen's WORD-Indexed version
15      cycles for Dave's version (with minor changes)


This is fantastic!



Alex
Title: Re: StrLen timings needed
Post by: Antariy on August 16, 2010, 08:53:17 PM
Jochen, I don't understand the "noob" word. What it mean? Don't forgot - I'm not good english speaker.



Alex
Title: Re: StrLen timings needed
Post by: jj2007 on August 16, 2010, 09:09:52 PM
Quote from: Antariy on August 16, 2010, 08:53:17 PM
Jochen, I don't understand the "noob" word. What it mean? Don't forgot - I'm not good english speaker.

noob = newbie, beginner.

Timings:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
58       bytes for MbStrLen1
80       bytes for MbStrLen4a
131      bytes for AxStrLen

Src: 100 bytes
46      cycles for MbStrLen1 (xmm0 not preserved)
48      cycles for MbStrLen4a
36      cycles for AxStrLen


Your algo is really fast, compliments :U
Title: Re: StrLen timings needed
Post by: Antariy on August 16, 2010, 09:27:09 PM
Quote from: jj2007 on August 16, 2010, 09:09:52 PM
Quote from: Antariy on August 16, 2010, 08:53:17 PM
Jochen, I don't understand the "noob" word. What it mean? Don't forgot - I'm not good english speaker.

noob = newbie, beginner.

Timings:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
58       bytes for MbStrLen1
80       bytes for MbStrLen4a
131      bytes for AxStrLen

Src: 100 bytes
46      cycles for MbStrLen1 (xmm0 not preserved)
48      cycles for MbStrLen4a
36      cycles for AxStrLen


Your algo is really fast, compliments :U

No, it really BIG, not fast.
If you remove "mov eax,[esp+4]" from sources, it stand slower, try with "mov eax...".
And, for this size, it really slow.



Alex
Title: Re: StrLen timings needed
Post by: hutch-- on August 16, 2010, 11:27:21 PM
Can I impose on someone to include this unrolled version of Agner Fog's StrLen algo.


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 4

StrLen proc item:DWORD

    mov     eax, [esp+4]            ; get pointer to string
    lea     edx, [eax+3]            ; pointer+3 used in the end
    push    ebp
    push    edi
    mov     ebp, 80808080h

  @@:     
  REPEAT 3
    mov     edi, [eax]              ; read first 4 bytes
    add     eax, 4                  ; increment pointer
    lea     ecx, [edi-01010101h]    ; subtract 1 from each byte
    not     edi                     ; invert all bytes
    and     ecx, edi                ; and these two
    and     ecx, ebp
    jnz     nxt
  ENDM

    mov     edi, [eax]              ; read first 4 bytes
    add     eax, 4                  ; 4 increment DWORD pointer
    lea     ecx, [edi-01010101h]    ; subtract 1 from each byte
    not     edi                     ; invert all bytes
    and     ecx, edi                ; and these two
    and     ecx, ebp
    jz      @B                      ; no zero bytes, continue loop

  nxt:
    test    ecx, 00008080h          ; test first two bytes
    jnz     @F
    shr     ecx, 16                 ; not in the first 2 bytes
    add     eax, 2
  @@:
    shl     cl, 1                   ; use carry flag to avoid branch
    sbb     eax, edx                ; compute length
    pop     edi
    pop     ebp

    ret     4

StrLen endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
Title: Re: StrLen timings needed
Post by: jj2007 on August 16, 2010, 11:39:08 PM
Here it is, Hutch:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
80       bytes for MbStrLen4a
128      bytes for AxStrLen
113      bytes for StrLenAF

Src: 100 bytes
46      cycles for MbStrLen1 (xmm0 not preserved)
48      cycles for MbStrLen4a
94      cycles for StrLenAF
36      cycles for AxStrLen


@Alex: Sorry, this is a modified version, without the fancy stuff. Much shorter and 2 cycles faster on the Celeron M, but it might be slower on other CPUs, of course. The table (http://wikis.sun.com/display/BluePrints/Instruction+Selection) might interest you.
Title: Re: StrLen timings needed
Post by: Antariy on August 16, 2010, 11:42:48 PM
Quote from: hutch-- on August 16, 2010, 11:27:21 PM
Can I impose on someone to include this unrolled version of Agner Fog's StrLen algo.

I make test-bed.
Timings:

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
58       bytes for MbStrLen1
100 - ERROR
84       bytes for MbStrLen2
100 - ERROR
73       bytes for MbStrLen3
100 - ERROR
80       bytes for MbStrLen4a
100 - ERROR
71       bytes for MbStrLen4b
100 - ERROR
78       bytes for MbStrLen5
100 - ERROR
147      bytes for AxStrLen
100 - ERROR

86      cycles for MbStrLen1
97      cycles for MbStrLen2
78      cycles for MbStrLen3
102     cycles for MbStrLen4a
75      cycles for MbStrLen4b
90      cycles for MbStrLen5

71      cycles for AxStrLen

247     cycles for StrLen

80      cycles for MbStrLen1
88      cycles for MbStrLen2
79      cycles for MbStrLen3
92      cycles for MbStrLen4a
87      cycles for MbStrLen4b
95      cycles for MbStrLen5

70      cycles for AxStrLen

230     cycles for StrLen

77      cycles for MbStrLen1
94      cycles for MbStrLen2
82      cycles for MbStrLen3
93      cycles for MbStrLen4a
74      cycles for MbStrLen4b
98      cycles for MbStrLen5

69      cycles for AxStrLen

230     cycles for StrLen

85      cycles for MbStrLen1
90      cycles for MbStrLen2
79      cycles for MbStrLen3
91      cycles for MbStrLen4a
88      cycles for MbStrLen4b
97      cycles for MbStrLen5

77      cycles for AxStrLen

233     cycles for StrLen

82      cycles for MbStrLen1
98      cycles for MbStrLen2
79      cycles for MbStrLen3
94      cycles for MbStrLen4a
75      cycles for MbStrLen4b
94      cycles for MbStrLen5

70      cycles for AxStrLen

244     cycles for StrLen


--- ok ---


Title: Re: StrLen timings needed
Post by: hutch-- on August 16, 2010, 11:43:54 PM
Gratsie,


Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (SSE4)
58       bytes for MbStrLen1
84       bytes for MbStrLen2
73       bytes for MbStrLen3
80       bytes for MbStrLen4a
71       bytes for MbStrLen4b
78       bytes for MbStrLen5
128      bytes for AxStrLen
113      bytes for StrLenAF

Src: 100 bytes
34      cycles for MbStrLen1 (xmm0 not preserved)
42      cycles for MbStrLen4a
68      cycles for StrLenAF
24      cycles for AxStrLen

34      cycles for MbStrLen1 (xmm0 not preserved)
42      cycles for MbStrLen4a
68      cycles for StrLenAF
24      cycles for AxStrLen

34      cycles for MbStrLen1 (xmm0 not preserved)
42      cycles for MbStrLen4a
68      cycles for StrLenAF
24      cycles for AxStrLen

34      cycles for MbStrLen1 (xmm0 not preserved)
42      cycles for MbStrLen4a
67      cycles for StrLenAF
24      cycles for AxStrLen

34      cycles for MbStrLen1 (xmm0 not preserved)
42      cycles for MbStrLen4a
68      cycles for StrLenAF
24      cycles for AxStrLen


--- ok ---
Title: Re: StrLen timings needed
Post by: Antariy on August 16, 2010, 11:46:23 PM
Quote from: jj2007 on August 16, 2010, 11:39:08 PM

@Alex: Sorry, this is a modified version, without the fancy stuff. Much shorter and 2 cycles faster on the Celeron M, but it might be slower on other CPUs, of course. The table (http://wikis.sun.com/display/BluePrints/Instruction+Selection) might interest you.

No, you don't understand my nice English :)))

I mean, if you DON'T load value of [esp+4] to eax, then code almost always will be work slower, because eax - part of checking: is the string unaligned or not. And almost, proc will be work by unaligned branch. Please, post your timings for my original archive, too hard to explain in English, sorry.



Alex
Title: Re: StrLen timings needed
Post by: Antariy on August 16, 2010, 11:48:37 PM
Hutch, test my version, please. Jochen made some changes, but algo works NOT in optimal way after his changes. (I see changes made by Jochen).
Test my, please.



Alex
Title: Re: StrLen timings needed
Post by: hutch-- on August 17, 2010, 12:02:21 AM
Alex,

This one ?


Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (SSE4)
58       bytes for MbStrLen1
100 - ERROR
84       bytes for MbStrLen2
100 - ERROR
73       bytes for MbStrLen3
100 - ERROR
80       bytes for MbStrLen4a
100 - ERROR
71       bytes for MbStrLen4b
100 - ERROR
78       bytes for MbStrLen5
100 - ERROR
147      bytes for AxStrLen
100 - ERROR

34      cycles for MbStrLen1
42      cycles for MbStrLen2
37      cycles for MbStrLen3
42      cycles for MbStrLen4a
42      cycles for MbStrLen4b
41      cycles for MbStrLen5

25      cycles for AxStrLen

67      cycles for StrLen

34      cycles for MbStrLen1
46      cycles for MbStrLen2
45      cycles for MbStrLen3
42      cycles for MbStrLen4a
48      cycles for MbStrLen4b
41      cycles for MbStrLen5

24      cycles for AxStrLen

67      cycles for StrLen

34      cycles for MbStrLen1
42      cycles for MbStrLen2
37      cycles for MbStrLen3
41      cycles for MbStrLen4a
42      cycles for MbStrLen4b
41      cycles for MbStrLen5

24      cycles for AxStrLen

67      cycles for StrLen

34      cycles for MbStrLen1
46      cycles for MbStrLen2
46      cycles for MbStrLen3
42      cycles for MbStrLen4a
48      cycles for MbStrLen4b
41      cycles for MbStrLen5

24      cycles for AxStrLen

67      cycles for StrLen

34      cycles for MbStrLen1
42      cycles for MbStrLen2
37      cycles for MbStrLen3
42      cycles for MbStrLen4a
42      cycles for MbStrLen4b
40      cycles for MbStrLen5

24      cycles for AxStrLen

67      cycles for StrLen


--- ok ---
Title: Re: StrLen timings needed
Post by: jj2007 on August 17, 2010, 06:34:22 AM
Quote from: Antariy on August 16, 2010, 11:48:37 PM
Hutch, test my version, please. Jochen made some changes, but algo works NOT in optimal way after his changes. (I see changes made by Jochen).
Test my, please.

Alex,

Sorry, I should have split it but was too tired yesterday night. On the other hand, look at the timings: for Hutch, it's 24 cycles for both versions, for me it's 2 cycles faster without the "extras". And 128 instead of 147 bytes means 8 instead of 9 16-byte instruction cache slots.
Title: Re: StrLen timings needed
Post by: Antariy on August 17, 2010, 09:35:38 PM
Quote from: hutch-- on August 17, 2010, 12:02:21 AM
Alex,

This one ?

Yes, Hutch!

"ERROR" word is not true - I just don' change CodeSize macro.

Thanks!



Alex
Title: Re: StrLen timings needed
Post by: Antariy on August 17, 2010, 09:37:12 PM
Hi!

Here is 34bytes long MMX StrLen, and 90bytes long (decreased by 57bytes) SSE1 version by 2 clocks faster.

Peoples, test this please!


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
58       bytes for MbStrLen1
84       bytes for MbStrLen2
73       bytes for MbStrLen3
80       bytes for MbStrLen4a
71       bytes for MbStrLen4b
78       bytes for MbStrLen5
34       bytes for AxStrLenMMX
90       bytes for AxStrLenSSE1
113      bytes for StrLen

92      cycles for MbStrLen1
95      cycles for MbStrLen2
89      cycles for MbStrLen3
96      cycles for MbStrLen4a
71      cycles for MbStrLen4b
97      cycles for MbStrLen5

109     cycles for AxStrLenMMX

65      cycles for AxStrLenSSE1

165     cycles for StrLen

100     cycles for MbStrLen1
97      cycles for MbStrLen2
89      cycles for MbStrLen3
90      cycles for MbStrLen4a
100     cycles for MbStrLen4b
95      cycles for MbStrLen5

110     cycles for AxStrLenMMX

65      cycles for AxStrLenSSE1

164     cycles for StrLen

87      cycles for MbStrLen1
90      cycles for MbStrLen2
91      cycles for MbStrLen3
80      cycles for MbStrLen4a
71      cycles for MbStrLen4b
95      cycles for MbStrLen5

109     cycles for AxStrLenMMX

65      cycles for AxStrLenSSE1

164     cycles for StrLen

92      cycles for MbStrLen1
97      cycles for MbStrLen2
90      cycles for MbStrLen3
80      cycles for MbStrLen4a
72      cycles for MbStrLen4b
94      cycles for MbStrLen5

110     cycles for AxStrLenMMX

65      cycles for AxStrLenSSE1

164     cycles for StrLen

92      cycles for MbStrLen1
90      cycles for MbStrLen2
87      cycles for MbStrLen3
80      cycles for MbStrLen4a
71      cycles for MbStrLen4b
95      cycles for MbStrLen5

111     cycles for AxStrLenMMX

65      cycles for AxStrLenSSE1

164     cycles for StrLen


--- ok ---



Note: test maked with unaligned (not 16byte aligned) strings.



Alex
Title: Re: StrLen timings needed
Post by: Antariy on August 17, 2010, 09:39:55 PM
Quote from: jj2007 on August 17, 2010, 06:34:22 AM
Quote from: Antariy on August 16, 2010, 11:48:37 PM
Hutch, test my version, please. Jochen made some changes, but algo works NOT in optimal way after his changes. (I see changes made by Jochen).
Test my, please.

Alex,

Sorry, I should have split it but was too tired yesterday night. On the other hand, look at the timings: for Hutch, it's 24 cycles for both versions, for me it's 2 cycles faster without the "extras". And 128 instead of 147 bytes means 8 instead of 9 16-byte instruction cache slots.

:bg

This is always not clear - other's code, I understand.



Alex
Title: Re: StrLen timings needed
Post by: KeepingRealBusy on August 17, 2010, 10:45:58 PM
Alex,

Here is my P4:


Intel(R) Pentium(R) 4 CPU 3.20GHz (SSE2)
58       bytes for MbStrLen1
84       bytes for MbStrLen2
73       bytes for MbStrLen3
80       bytes for MbStrLen4a
71       bytes for MbStrLen4b
78       bytes for MbStrLen5
34       bytes for AxStrLenMMX
90       bytes for AxStrLenSSE1
113      bytes for StrLen

48      cycles for MbStrLen1
51      cycles for MbStrLen2
46      cycles for MbStrLen3
56      cycles for MbStrLen4a
51      cycles for MbStrLen4b
50      cycles for MbStrLen5

70      cycles for AxStrLenMMX

43      cycles for AxStrLenSSE1

135     cycles for StrLen

79      cycles for MbStrLen1
54      cycles for MbStrLen2
53      cycles for MbStrLen3
47      cycles for MbStrLen4a
48      cycles for MbStrLen4b
51      cycles for MbStrLen5

72      cycles for AxStrLenMMX

45      cycles for AxStrLenSSE1

124     cycles for StrLen

40      cycles for MbStrLen1
49      cycles for MbStrLen2
45      cycles for MbStrLen3
55      cycles for MbStrLen4a
43      cycles for MbStrLen4b
51      cycles for MbStrLen5

69      cycles for AxStrLenMMX

43      cycles for AxStrLenSSE1

126     cycles for StrLen

40      cycles for MbStrLen1
47      cycles for MbStrLen2
45      cycles for MbStrLen3
44      cycles for MbStrLen4a
45      cycles for MbStrLen4b
56      cycles for MbStrLen5

70      cycles for AxStrLenMMX

45      cycles for AxStrLenSSE1

125     cycles for StrLen

59      cycles for MbStrLen1
47      cycles for MbStrLen2
48      cycles for MbStrLen3
42      cycles for MbStrLen4a
44      cycles for MbStrLen4b
53      cycles for MbStrLen5

73      cycles for AxStrLenMMX

50      cycles for AxStrLenSSE1

145     cycles for StrLen


--- ok ---


Dave.
Title: Re: StrLen timings needed
Post by: Antariy on August 17, 2010, 11:03:00 PM
Quote from: KeepingRealBusy on August 17, 2010, 10:45:58 PM
Alex,

Here is my P4:



Thanks, Dave!



Alex
Title: Re: StrLen timings needed
Post by: Farabi on August 18, 2010, 08:39:23 PM
Intel(R) Pentium(R) Dual  CPU  T2390  @ 1.86GHz (SSE4)
58       bytes for MbStrLen1
84       bytes for MbStrLen2
73       bytes for MbStrLen3
80       bytes for MbStrLen4a
71       bytes for MbStrLen4b
78       bytes for MbStrLen5

16      cycles for MbStrLen1
20      cycles for MbStrLen2
24      cycles for MbStrLen3
23      cycles for MbStrLen4a
23      cycles for MbStrLen4b
23      cycles for MbStrLen5

16      cycles for MbStrLen1
25      cycles for MbStrLen2
26      cycles for MbStrLen3
23      cycles for MbStrLen4a
26      cycles for MbStrLen4b
23      cycles for MbStrLen5

17      cycles for MbStrLen1
20      cycles for MbStrLen2
23      cycles for MbStrLen3
23      cycles for MbStrLen4a
23      cycles for MbStrLen4b
23      cycles for MbStrLen5

16      cycles for MbStrLen1
24      cycles for MbStrLen2
26      cycles for MbStrLen3
23      cycles for MbStrLen4a
26      cycles for MbStrLen4b
23      cycles for MbStrLen5

16      cycles for MbStrLen1
20      cycles for MbStrLen2
23      cycles for MbStrLen3
23      cycles for MbStrLen4a
24      cycles for MbStrLen4b
23      cycles for MbStrLen5


--- ok ---
Title: Re: StrLen timings needed
Post by: Farabi on August 18, 2010, 09:03:23 PM
Intel(R) Pentium(R) Dual  CPU  T2390  @ 1.86GHz (SSE4)
58       bytes for MbStrLen1
84       bytes for MbStrLen2
73       bytes for MbStrLen3
80       bytes for MbStrLen4a
71       bytes for MbStrLen4b
78       bytes for MbStrLen5

16      cycles for MbStrLen1
20      cycles for MbStrLen2
24      cycles for MbStrLen3
23      cycles for MbStrLen4a
23      cycles for MbStrLen4b
23      cycles for MbStrLen5

16      cycles for MbStrLen1
25      cycles for MbStrLen2
26      cycles for MbStrLen3
23      cycles for MbStrLen4a
26      cycles for MbStrLen4b
23      cycles for MbStrLen5

17      cycles for MbStrLen1
20      cycles for MbStrLen2
23      cycles for MbStrLen3
23      cycles for MbStrLen4a
23      cycles for MbStrLen4b
23      cycles for MbStrLen5

16      cycles for MbStrLen1
24      cycles for MbStrLen2
26      cycles for MbStrLen3
23      cycles for MbStrLen4a
26      cycles for MbStrLen4b
23      cycles for MbStrLen5

16      cycles for MbStrLen1
20      cycles for MbStrLen2
23      cycles for MbStrLen3
23      cycles for MbStrLen4a
24      cycles for MbStrLen4b
23      cycles for MbStrLen5


--- ok ---
Title: Re: StrLen timings needed
Post by: jj2007 on August 18, 2010, 11:11:31 PM
5Test:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
90       bytes for AxStrLenSSE1
113      bytes for StrLen

46      cycles for MbStrLen1
47      cycles for MbStrLen2
50      cycles for MbStrLen3
48      cycles for MbStrLen4a
53      cycles for MbStrLen4b
51      cycles for MbStrLen5

60      cycles for AxStrLenMMX

38      cycles for AxStrLenSSE1


Quotemov eax,[esp+4]      ; Jochen, if you remove this string again :), then
   add esp,-10h      ; algo would almost always work with any string
   movups [esp],xmm7   ; as with unaligned string, because I made
   and eax,0fh      ; checking for alignment in THIS line! ;-)
   jz @F
OOPS :red
Title: Re: StrLen timings needed
Post by: Antariy on August 18, 2010, 11:18:59 PM
Quote from: jj2007 on August 18, 2010, 11:11:31 PM
OOPS :red

No, This is because I don't write comments (time economy).



Alex
Title: Re: StrLen timings needed
Post by: jj2007 on August 18, 2010, 11:37:45 PM
Tried this?
Quote   mov edx,[esp+4]
;   mov eax,[esp+4]   ; Jochen, if you remove this string again :), then
   add esp, -10h   ; algo would almost always work with any string
   movups [esp],xmm7   ; as with unaligned string, because I made
   test dl, 15
;   and eax, 0fh   ; checking for alignment in THIS line!      ;-)
   jz @F
Title: Re: StrLen timings needed
Post by: Farabi on August 19, 2010, 03:01:20 AM
Intel(R) Pentium(R) Dual  CPU  T2390  @ 1.86GHz (SSE4)
58       bytes for MbStrLen1
84       bytes for MbStrLen2
73       bytes for MbStrLen3
80       bytes for MbStrLen4a
71       bytes for MbStrLen4b
78       bytes for MbStrLen5

16      cycles for MbStrLen1
20      cycles for MbStrLen2
24      cycles for MbStrLen3
23      cycles for MbStrLen4a
23      cycles for MbStrLen4b
23      cycles for MbStrLen5

16      cycles for MbStrLen1
25      cycles for MbStrLen2
26      cycles for MbStrLen3
23      cycles for MbStrLen4a
26      cycles for MbStrLen4b
23      cycles for MbStrLen5

17      cycles for MbStrLen1
20      cycles for MbStrLen2
23      cycles for MbStrLen3
23      cycles for MbStrLen4a
23      cycles for MbStrLen4b
23      cycles for MbStrLen5

16      cycles for MbStrLen1
24      cycles for MbStrLen2
26      cycles for MbStrLen3
23      cycles for MbStrLen4a
26      cycles for MbStrLen4b
23      cycles for MbStrLen5

16      cycles for MbStrLen1
20      cycles for MbStrLen2
23      cycles for MbStrLen3
23      cycles for MbStrLen4a
24      cycles for MbStrLen4b
23      cycles for MbStrLen5


--- ok ---
Title: Re: StrLen timings needed
Post by: mineiro on August 19, 2010, 08:16:50 AM
another dual test.

Intel(R) Pentium(R) Dual  CPU  E2160  @ 1.80GHz (SSE4)
58       bytes for MbStrLen1 33 33 33 33 33 cycles
84       bytes for MbStrLen2 34 42 34 42 34
73       bytes for MbStrLen3 36 44 36 44 36
80       bytes for MbStrLen4a 41 41 41 41 41
71       bytes for MbStrLen4b 36 45 36 45 36
78       bytes for MbStrLen5 46 46 46 46 48
34       bytes for AxStrLenMMX 58 62 62 68 64
90       bytes for AxStrLenSSE1 21 27 21 27 21
113      bytes for StrLen 67 67 67 67 67
--- ok ---
Title: Re: StrLen timings needed
Post by: Antariy on August 19, 2010, 10:45:13 PM

Hi!

This is slightly changed version of my SSE1 algo. Fixed stumb load (I do they by correlation with MMX, but this is not the same).


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
58       bytes for MbStrLen1
84       bytes for MbStrLen2
73       bytes for MbStrLen3
80       bytes for MbStrLen4a
71       bytes for MbStrLen4b
78       bytes for MbStrLen5
34       bytes for AxStrLenMMX
86       bytes for AxStrLenSSE1
113      bytes for StrLen

83      cycles for MbStrLen1
92      cycles for MbStrLen2
80      cycles for MbStrLen3
78      cycles for MbStrLen4a
71      cycles for MbStrLen4b
95      cycles for MbStrLen5

111     cycles for AxStrLenMMX

60      cycles for AxStrLenSSE1

183     cycles for StrLen

110     cycles for MbStrLen1
90      cycles for MbStrLen2
90      cycles for MbStrLen3
90      cycles for MbStrLen4a
71      cycles for MbStrLen4b
95      cycles for MbStrLen5

109     cycles for AxStrLenMMX

61      cycles for AxStrLenSSE1

199     cycles for StrLen

92      cycles for MbStrLen1
102     cycles for MbStrLen2
90      cycles for MbStrLen3
80      cycles for MbStrLen4a
71      cycles for MbStrLen4b
149     cycles for MbStrLen5

109     cycles for AxStrLenMMX

61      cycles for AxStrLenSSE1

235     cycles for StrLen

100     cycles for MbStrLen1
113     cycles for MbStrLen2
74      cycles for MbStrLen3
103     cycles for MbStrLen4a
71      cycles for MbStrLen4b
77      cycles for MbStrLen5

109     cycles for AxStrLenMMX

61      cycles for AxStrLenSSE1

200     cycles for StrLen

92      cycles for MbStrLen1
99      cycles for MbStrLen2
91      cycles for MbStrLen3
90      cycles for MbStrLen4a
71      cycles for MbStrLen4b
95      cycles for MbStrLen5

115     cycles for AxStrLenMMX

61      cycles for AxStrLenSSE1

197     cycles for StrLen


--- ok ---



Big ask to all: test this, please!



Alex
Title: Re: StrLen timings needed
Post by: Antariy on August 19, 2010, 10:51:20 PM
Quote from: jj2007 on August 18, 2010, 11:37:45 PM
Tried this?
Quote   mov edx,[esp+4]
;   mov eax,[esp+4]   ; Jochen, if you remove this string again :), then
   add esp, -10h   ; algo would almost always work with any string
   movups [esp],xmm7   ; as with unaligned string, because I made
   test dl, 15
;   and eax, 0fh   ; checking for alignment in THIS line!      ;-)
   jz @F


Jochen, on my CPU, if I use edx for check, proc slower by 2 clocks. If I use part of reg, this is not get anything (I know about this, and this is have very hardware-depended  reasons in work. On moder CPUs this is very slow).



Alex
Title: Re: StrLen timings needed
Post by: mineiro on August 19, 2010, 11:45:40 PM
Intel(R) Pentium(R) Dual  CPU  E2160  @ 1.80GHz (SSE4)


58       bytes for MbStrLen1 33 33 33 33 33
84       bytes for MbStrLen2 34 42 34 42 34
73       bytes for MbStrLen3 36 44 36 44 36
80       bytes for MbStrLen4a 41 41 41 41 37
71       bytes for MbStrLen4b 36 45 36 45 36
78       bytes for MbStrLen5 46 46 46 48 46
34       bytes for AxStrLenMMX 59 65 58 68 65
86       bytes for AxStrLenSSE1 19 24 19 25 19
113     bytes for StrLen 67 67 67 67 67
--- ok ---


==       bytes for MbStrLen1 == == == == ==
==       bytes for MbStrLen2 == == == == ==
==       bytes for MbStrLen3 == == == == ==
==       bytes for MbStrLen4a == 37 == == ==
==       bytes for MbStrLen4b == == == == ==
==       bytes for MbStrLen5 == == == 46 ==
==       bytes for AxStrLenMMX 62 == 61 == 62
==       bytes for AxStrLenSSE1 == == == == ==
===     bytes for StrLen == == == == ==
--- ok ---
Title: Re: StrLen timings needed
Post by: jj2007 on August 20, 2010, 08:52:51 AM
Quote from: Antariy on August 19, 2010, 10:51:20 PM
Quote from: jj2007 on August 18, 2010, 11:37:45 PM
Tried this?
Quote   mov edx,[esp+4]
;   mov eax,[esp+4]   ; Jochen, if you remove this string again :), then
   add esp, -10h   ; algo would almost always work with any string
   movups [esp],xmm7   ; as with unaligned string, because I made
   test dl, 15
;   and eax, 0fh   ; checking for alignment in THIS line!      ;-)
   jz @F


Jochen, on my CPU, if I use edx for check, proc slower by 2 clocks. If I use part of reg, this is not get anything (I know about this, and this is have very hardware-depended  reasons in work. On moder CPUs this is very slow).

Alex


Hi Alex,
The difference is very small on my trusty old P4 and inexistent on my Celeron. Here is one more for testing... I am tempted to use the MbStrLen4aP4 for the MasmBasic library, because it's short, reasonably fast, and safe for strings that end precisely at the legal area (you remember VirtualAlloc can be a bit rude with attempts to use movups xmm7, [edx] when edx is near the next page :wink)

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
80      bytes for MbStrLen4a
72      bytes for MbStrLen4aP4
100     bytes for AxStrLenSSE1
80      bytes for AxStrLenSSE1j
------- timings -------
333     cycles for szLen
78      cycles for MbStrLen4a
72      cycles for MbStrLen4aP4
62      cycles for AxStrLenSSE1
63      cycles for AxStrLenSSE1j


Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
35      cycles for AxStrLenSSE1
35      cycles for AxStrLenSSE1j
Title: Re: StrLen timings needed
Post by: hutch-- on August 20, 2010, 09:31:58 AM

Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (SSE4)
80      bytes for MbStrLen4a
72      bytes for MbStrLen4aP4
72      bytes for MbStrLen4aP42
100     bytes for AxStrLenSSE1
80      bytes for AxStrLenSSE1j
------- timings -------
104     cycles for szLen
42      cycles for MbStrLen4a
42      cycles for MbStrLen4aP4
42      cycles for MbStrLen4aP42
19      cycles for AxStrLenSSE1
23      cycles for AxStrLenSSE1j

104     cycles for szLen
42      cycles for MbStrLen4a
48      cycles for MbStrLen4aP4
48      cycles for MbStrLen4aP42
27      cycles for AxStrLenSSE1
28      cycles for AxStrLenSSE1j

104     cycles for szLen
42      cycles for MbStrLen4a
42      cycles for MbStrLen4aP4
42      cycles for MbStrLen4aP42
20      cycles for AxStrLenSSE1
24      cycles for AxStrLenSSE1j

104     cycles for szLen
42      cycles for MbStrLen4a
48      cycles for MbStrLen4aP4
48      cycles for MbStrLen4aP42
24      cycles for AxStrLenSSE1
26      cycles for AxStrLenSSE1j


--- ok ---
Title: Re: StrLen timings needed
Post by: jj2007 on August 20, 2010, 09:51:59 AM
Quote from: hutch-- on August 20, 2010, 09:31:58 AM

Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (SSE4)
42      cycles for MbStrLen4a
48      cycles for MbStrLen4aP4
48      cycles for MbStrLen4aP42
19      cycles for AxStrLenSSE1
23      cycles for AxStrLenSSE1j


Thanks. Interesting that the 4a is faster, on my P4 it's definitely slower. I have changed the 72Test posted above and added an option CrashIt, which will test a string near the VirtualAlloc page boundary.
Title: Re: StrLen timings needed
Post by: jcfuller on August 20, 2010, 12:57:37 PM
Intel(R) Core(TM) i7 CPU         860  @ 2.80GHz (SSE4)
80      bytes for MbStrLen4a
72      bytes for MbStrLen4aP4
72      bytes for MbStrLen4aP42
100     bytes for AxStrLenSSE1
80      bytes for AxStrLenSSE1j
------- timings -------
129     cycles for szLen
72      cycles for MbStrLen4a
94      cycles for MbStrLen4aP4
91      cycles for MbStrLen4aP42
45      cycles for AxStrLenSSE1
40      cycles for AxStrLenSSE1j

157     cycles for szLen
72      cycles for MbStrLen4a
93      cycles for MbStrLen4aP4
93      cycles for MbStrLen4aP42
45      cycles for AxStrLenSSE1
42      cycles for AxStrLenSSE1j

157     cycles for szLen
72      cycles for MbStrLen4a
71      cycles for MbStrLen4aP4
70      cycles for MbStrLen4aP42
45      cycles for AxStrLenSSE1
40      cycles for AxStrLenSSE1j

158     cycles for szLen
72      cycles for MbStrLen4a
73      cycles for MbStrLen4aP4
94      cycles for MbStrLen4aP42
45      cycles for AxStrLenSSE1
42      cycles for AxStrLenSSE1j


--- ok ---
Title: Re: StrLen timings needed
Post by: jj2007 on August 20, 2010, 01:11:05 PM
Thanxalot. The current version of MasmBasic Len() is MbStrLen4a, and it will stay that way. 80 bytes short, reasonably fast and SSE-safe near page boundaries.
Title: Re: StrLen timings needed
Post by: hutch-- on August 20, 2010, 09:47:33 PM
JJ,

On both the Core2 and i7 boxes, SSE is a lot faster than on my P4 boxes relative to normal integer instruction code so if you are pointing the procedures at SSE capable processors I would go for the ones that are faster on the Core2 and i3/i5/i7 architecture. Rockoon has been posting results from a recent 6 core AMD so it will also be worthwhile seeing what they work like on a late AMD box.
Title: Re: StrLen timings needed
Post by: jj2007 on August 20, 2010, 10:16:07 PM
Hutch,
Yes, some more tests would be fine - but it seems the MbStrLen4a is overall quite ok. The SSE1 versions are not "boundary safe", the others seem a tick slower.
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
58      cycles for MbStrLen4a
62      cycles for MbStrLen4aP4
62      cycles for MbStrLen4aP42

56      cycles for MbStrLen4a
71      cycles for MbStrLen4aP4
71      cycles for MbStrLen4aP42

The increase from 62 to 71 may have to do with the testbed setting: For each timing loop, there is a push for changing the stack alignment. So sometimes the movdqu [esp], xmm0 is 16-byte aligned and thus faster (on some CPUs the movdqu becomes as fast as movdqa if it hits an aligned address).
Title: Re: StrLen timings needed
Post by: Antariy on August 20, 2010, 10:34:37 PM
Jochen, you proc is safe with end of VirtualAlloc? (I don't see sourcess yet)?



Alex
Title: Re: StrLen timings needed
Post by: Antariy on August 20, 2010, 10:55:57 PM
Jochen, your proc's also is not safe, they crashes with short strings.

You force me to make version with SEH :), really.



Alex
Title: Re: StrLen timings needed
Post by: Antariy on August 20, 2010, 10:57:55 PM
Hutch, for what you needed in SSE1 StrLen algo? This is needed to you?



Alex
Title: Re: StrLen timings needed
Post by: Antariy on August 20, 2010, 11:04:54 PM
Jochen, I saw sources, your proc is not crashes :)



Alex
Title: Re: StrLen timings needed
Post by: Antariy on August 20, 2010, 11:07:26 PM
... But, even if I add SEH, this is not make proc much slower :) Only in case of end-of-buffer.



Alex
Title: Re: StrLen timings needed
Post by: lingo on August 21, 2010, 01:03:16 AM
Hutch,
I'm wondering why you tested and tolerated such idiotic, slow algos.
Idiotic because they preserved registers without any need.
Only Microsoft can require which  registers we must preserve :(
Title: Re: StrLen timings needed
Post by: jj2007 on August 21, 2010, 07:07:28 AM
Quote from: Antariy on August 20, 2010, 11:07:26 PM
... But, even if I add SEH, this is not make proc much slower :) Only in case of end-of-buffer.

Maybe... but code size will increase again. 80 bytes is enough for a strlen algo in a Basic library. And remember we are talking about 40-80 cycles, depending on the CPU. 80 cycles at 1.6 GHz means 20 Million calls to strlen per second - that is rarely a bottleneck. Those who need more than that can use the fastest and unsafest MMX algos trashing the FPU, but for a general purpose library a compromise should be sought.
Title: Re: StrLen timings needed
Post by: Antariy on August 21, 2010, 04:08:37 PM
Hi!

Big ask to all: test this please. This is fixed version of my SSE1 StrLen proc, which is not crashes in end of buffer in normal case of zero-terminated string.
And this proc is still fast with unaligned strings, it have the same speed as my previous proc, or slightly slower.

For first test-bed (Jochen's old procs):

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
58       bytes for MbStrLen1
84       bytes for MbStrLen2
73       bytes for MbStrLen3
80       bytes for MbStrLen4a
71       bytes for MbStrLen4b
78       bytes for MbStrLen5
34       bytes for AxStrLenMMX
93       bytes for AxStrLenSSE1
113      bytes for StrLen

73      cycles for MbStrLen1
90      cycles for MbStrLen2
91      cycles for MbStrLen3
90      cycles for MbStrLen4a
71      cycles for MbStrLen4b
95      cycles for MbStrLen5

110     cycles for AxStrLenMMX

63      cycles for AxStrLenSSE1

164     cycles for StrLen

100     cycles for MbStrLen1
113     cycles for MbStrLen2
104     cycles for MbStrLen3
90      cycles for MbStrLen4a
71      cycles for MbStrLen4b
100     cycles for MbStrLen5

108     cycles for AxStrLenMMX

62      cycles for AxStrLenSSE1

164     cycles for StrLen

100     cycles for MbStrLen1
90      cycles for MbStrLen2
90      cycles for MbStrLen3
80      cycles for MbStrLen4a
71      cycles for MbStrLen4b
95      cycles for MbStrLen5

114     cycles for AxStrLenMMX

62      cycles for AxStrLenSSE1

164     cycles for StrLen

81      cycles for MbStrLen1
130     cycles for MbStrLen2
104     cycles for MbStrLen3
90      cycles for MbStrLen4a
71      cycles for MbStrLen4b
80      cycles for MbStrLen5

110     cycles for AxStrLenMMX

62      cycles for AxStrLenSSE1

164     cycles for StrLen

92      cycles for MbStrLen1
89      cycles for MbStrLen2
90      cycles for MbStrLen3
104     cycles for MbStrLen4a
71      cycles for MbStrLen4b
95      cycles for MbStrLen5

109     cycles for AxStrLenMMX

62      cycles for AxStrLenSSE1

164     cycles for StrLen



For new Jochen's test-bed with CrashIt macro is "1"


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
80      bytes for MbStrLen4a
72      bytes for MbStrLen4aP4
72      bytes for MbStrLen4aP42
93      bytes for AxStrLenSSE1
80      bytes for AxStrLenSSE1j
------- timings -------
27      cycles for szLen
33      cycles for MbStrLen4a
62      cycles for MbStrLen4aP4
37      cycles for MbStrLen4aP42
24      cycles for AxStrLenSSE1



After my proc runs Jochen's tweak, and it crashes. My proc work properly now (see: testing of it is successful - 24 clocks). I.e. - my proc not crashes in end of buffer.

For new, without CrashIt macro:


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
80      bytes for MbStrLen4a
72      bytes for MbStrLen4aP4
72      bytes for MbStrLen4aP42
93      bytes for AxStrLenSSE1
80      bytes for AxStrLenSSE1j
------- timings -------
264     cycles for szLen
88      cycles for MbStrLen4a
73      cycles for MbStrLen4aP4
72      cycles for MbStrLen4aP42
62      cycles for AxStrLenSSE1
68      cycles for AxStrLenSSE1j

270     cycles for szLen
105     cycles for MbStrLen4a
73      cycles for MbStrLen4aP4
93      cycles for MbStrLen4aP42
62      cycles for AxStrLenSSE1
69      cycles for AxStrLenSSE1j

270     cycles for szLen
105     cycles for MbStrLen4a
73      cycles for MbStrLen4aP4
71      cycles for MbStrLen4aP42
62      cycles for AxStrLenSSE1
68      cycles for AxStrLenSSE1j

264     cycles for szLen
91      cycles for MbStrLen4a
73      cycles for MbStrLen4aP4
71      cycles for MbStrLen4aP42
62      cycles for AxStrLenSSE1
68      cycles for AxStrLenSSE1j



Jochen, read comments in start of AxStrLenSSE1 proc. I hope, you understand what I right.

AxStrLenSSE1j (Jochen's remake) is still crashes, I don't work with it.
And integer version crashes on in-buffer-end strings also.


Hutch, SSE works on Core+ nearly 3 times faster than on PIV.
Hutch, test my new version please.


I preserve XMM7 and ECX only for fair comparsion with Jochen's procs. This is his right - what he do with HIS MasmBasic.



Alex
Title: Re: StrLen timings needed
Post by: Antariy on August 21, 2010, 04:16:43 PM
Quote from: jj2007 on August 21, 2010, 07:07:28 AM
Quote from: Antariy on August 20, 2010, 11:07:26 PM
... But, even if I add SEH, this is not make proc much slower :) Only in case of end-of-buffer.

Maybe... but code size will increase again. 80 bytes is enough for a strlen algo in a Basic library. And remember we are talking about 40-80 cycles, depending on the CPU. 80 cycles at 1.6 GHz means 20 Million calls to strlen per second - that is rarely a bottleneck. Those who need more than that can use the fastest and unsafest MMX algos trashing the FPU, but for a general purpose library a compromise should be sought.

Try new version. I make it via normal way, NOT SEH covered, only for respect to you. And I preserve ECX and XMM7 only for respect to you.
They stand by 1 clock slower, how much :)
But, I disagree, what you add alignment stuff to "codesize". This is NOT code, this is never been executed and pre-decoded, so - without comments. Jochen, draw some respect to me, make fair comparsion, please.

I don't impose my proc to you, really. It have size by 16% bigger, and speed not less than by 30% bigger on new CPUs, so, I think, this is satisfactory.



Alex
Title: Re: StrLen timings needed
Post by: Antariy on August 21, 2010, 04:18:21 PM
Hutch, page changed again, so, I repeat my ask.

Test this please: "http://www.masm32.com/board/index.php?action=dlattach;topic=14626.0;id=8001".
This is my new safe version of SSE1 proc.



Alex
Title: Re: StrLen timings needed
Post by: Antariy on August 21, 2010, 04:44:15 PM
Nobody, except God, can require from me, what I must do, and what regs I must preserve.
So, if somebody don't have a nimbus and a wings, shut up, please.



Alex
P.S. somebody, you don't prove your second assertion: write faster proc, after this - talk.
I know, you write BLOAT algo again, and will be think, what you are great, this is only yours problems.
Create VERY bloat unrolled (using many XMM regs) algo is NOT hard. Is hard - create smallest algo from fastest algos.
Title: Re: StrLen timings needed
Post by: jj2007 on August 21, 2010, 07:02:16 PM
Quote from: Antariy on August 21, 2010, 04:16:43 PM
But, I disagree, what you add alignment stuff to "codesize". This is NOT code, this is never been executed and pre-decoded, so - without comments. Jochen, draw some respect to me, make fair comparison, please.

Dear Alex,
The assumption here is that you build a library, and each algo starts on a 16-byte aligned boundary. So if somebody (Lingo does that very often) inserts 7 bytes of strange db xyz before the algo start, this will a) add to the size of the executable and b) may waste some bytes of instruction cache when the CPU pulls code in with 16-byte alignment. Therefore "codesize" starts at the 16-byte boundary in all our tests. Call it a convention. Since it's being applied to everybody, I would call it fair :wink
The a16 macro should not count imho because in real life you would not insert 16 int 3's between all the algos of your library. We use it here - for everybody - in the hope that it might eliminate execution cache influences on the timings. Whether that really works, I don't know...
Title: Re: StrLen timings needed
Post by: dedndave on August 21, 2010, 07:49:40 PM
i dunno how fair it is, really - lol
but, it appears to be as "real-world" as it can be
it depends on how the code is placed in the test program (luck of the draw)
maybe a more judicious method would be to count actual routine bytes, then add 8 for a 16-alignment
that would represent the "average" byte-count for all possible placements
and - those who do want to not use align 16, can suffer a few clock-cycles penalty to save 8 bytes in the count
that would be fair
Title: Re: StrLen timings needed
Post by: Antariy on August 21, 2010, 10:15:53 PM
Jochen, this is piece of text from Intel's optimization guide:
Quote
Assembly/Compiler Coding Rule 56. (M impact, L generality) If (hopefully read-only) data must occur on the same page as code, avoid placing it immediately after an indirect jump. For example, follow an indirect jump with its mostly likely target, and place the data after an unconditional branch.

This text cannot be copied with any standard apps (because Intel locks his PDFs from copying), but, since we talk with dedndave about encryptions and its researching... :)

Returning to quote. As you see, Intel recomment place data pieces in places, what would not be executed, and CPU may "found" this. So, CPUs is NO so silly, what don't know, what no need in pre-decoding some "code".

Code location problem is really problem of interconnection of code and data, not location or type of alignment instructions (int3, or long lea esp,[esp] etc - is not have meaning). This is too hard to say, I hope, you understand me enough.

Location have some small influence, but it cannot have critical meaning. Algos, which use many data in work - more sensitive to "code placement".



Alex
Title: Re: StrLen timings needed
Post by: donkey on August 22, 2010, 01:04:31 AM
Has anyone tried my modest entry in the race ? I could not get the AxStrLenMMX or AxStrLenSSE1 examples I downloaded to yield either a consistent or correct result using a string length of 11 chars ("Hello There") either in MASM or GoAsm so my tests were pretty much shot. After all however fast it takes to yield the wrong answer, its still wrong. I was using vkim's debug to display the results (both gave me a string length of 7) I have included the RadAsm project if someone can tell me what is required to get correct results I would appreciate it.

; From my rather old strings library
lszLenMMX proc pString:DWORD

mov eax,[pString]
nop
nop ; fill in stack frame+mov to 8 bytes

pxor mm0,mm0
nop ; fill pxor to 4 bytes
pxor mm1,mm1
nop ; fill pxor to 4 bytes

@@: ; this is aligned to 16 bytes
movq mm0,[eax]
pcmpeqb mm0,mm1
add eax,8
pmovmskb ecx,mm0
or ecx,ecx
jz @B

sub eax,[pString]

bsf ecx,ecx
sub eax,8
add eax,ecx

emms


   RET

lszLenMMX ENDP


Edgar
Title: Re: StrLen timings needed
Post by: jj2007 on August 22, 2010, 07:25:07 AM
Hi Edgar,
Your attachment won't assemble, some includes are missing, and paths rely on environment variables. But anyway, I got the algo to work:

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
80      bytes for MbStrLen4a
49      bytes for lszLenMMX
------- timings -------
131     cycles for szLen
49      cycles for MbStrLen4a
90      cycles for lszLenMMX


Faster than the Masm32 library algo, but problematic because a) it trashes the FPU and b) it throws an exception for strings near a VirtualAlloc boundary.
Title: Re: StrLen timings needed
Post by: donkey on August 22, 2010, 11:28:57 AM
Thanks jj2007,

I finally got the Ax... ones to work, hadn't noticed the lack of prologue and epilogue so ESP+4 wasn't pointing to the right place in my tests.

Edgar
Title: Re: StrLen timings needed
Post by: lingo on August 22, 2010, 12:23:34 PM
It is not a big deal to beat the stupid losers:  :lol
Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)
80      bytes for MbStrLen4a
72      bytes for MbStrLen4aP4
72      bytes for MbStrLen4aP42
93      bytes for AxStrLenSSE1
80      bytes for AxStrLenSSE1j
84      bytes for StrLenLingo
------- timings -------
112     cycles for szLen
37      cycles for MbStrLen4a
49      cycles for MbStrLen4aP4
49      cycles for MbStrLen4aP42
20      cycles for AxStrLenSSE1
45      cycles for AxStrLenSSE1j
12      cycles for StrLenLingo

111     cycles for szLen
37      cycles for MbStrLen4a
42      cycles for MbStrLen4aP4
42      cycles for MbStrLen4aP42
21      cycles for AxStrLenSSE1
22      cycles for AxStrLenSSE1j
12      cycles for StrLenLingo

110     cycles for szLen
63      cycles for MbStrLen4a
73      cycles for MbStrLen4aP4
73      cycles for MbStrLen4aP42
20      cycles for AxStrLenSSE1
26      cycles for AxStrLenSSE1j
12      cycles for StrLenLingo

110     cycles for szLen
37      cycles for MbStrLen4a
42      cycles for MbStrLen4aP4
43      cycles for MbStrLen4aP42
40      cycles for AxStrLenSSE1
22      cycles for AxStrLenSSE1j
12      cycles for StrLenLingo


--- ok ---


Later:
Corrected a bug in my algo. Pls,reload it..sorry...
Title: Re: StrLen timings needed
Post by: dedndave on August 22, 2010, 03:57:34 PM
lingo wasn't bad-lookin, when he was little....

(http://www.babble.com/CS/blogs/strollerderby/2008/08/23-End/Tantrum-1.jpg)
Title: Re: StrLen timings needed
Post by: Antariy on August 22, 2010, 05:27:43 PM
Hi!

This is old proc, with some changes. I don't see other new procs, and not add it to tests.

Test this, please.



Alex
Title: Re: StrLen timings needed
Post by: hutch-- on August 22, 2010, 05:32:55 PM
Alex.


Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (SSE4)
80      code size for MbStrLen4a        80      total bytes for MbStrLen4a
72      code size for MbStrLen4aP4      72      total bytes for MbStrLen4aP4
72      code size for MbStrLen4aP42     72      total bytes for MbStrLen4aP42
104     code size for AxStrLenSSE1      104     total bytes for AxStrLenSSE1
80      code size for AxStrLenSSE1j     80      total bytes for AxStrLenSSE1j
------- timings -------
110     cycles for szLen
52      cycles for MbStrLen4a
54      cycles for MbStrLen4aP4
54      cycles for MbStrLen4aP42
34      cycles for AxStrLenSSE1
37      cycles for AxStrLenSSE1j

110     cycles for szLen
52      cycles for MbStrLen4a
64      cycles for MbStrLen4aP4
64      cycles for MbStrLen4aP42
33      cycles for AxStrLenSSE1
45      cycles for AxStrLenSSE1j

110     cycles for szLen
52      cycles for MbStrLen4a
54      cycles for MbStrLen4aP4
54      cycles for MbStrLen4aP42
33      cycles for AxStrLenSSE1
34      cycles for AxStrLenSSE1j

110     cycles for szLen
52      cycles for MbStrLen4a
64      cycles for MbStrLen4aP4
64      cycles for MbStrLen4aP42
32      cycles for AxStrLenSSE1
42      cycles for AxStrLenSSE1j
Title: Re: StrLen timings needed
Post by: Antariy on August 22, 2010, 05:34:39 PM
What about make CrashIt macro "1", and test Lingo's proc?  :P



Alex
P.S. Lingo's proc so bad! It don't preserve ECX and XMM regs!  :P
Title: Re: StrLen timings needed
Post by: Antariy on August 22, 2010, 05:36:47 PM
Thanks, Hutch!



Alex
Title: Re: StrLen timings needed
Post by: jj2007 on August 22, 2010, 09:30:32 PM
Here is one more, with a modification of Alex' excellent algo (shorter, same cycle count on my CPU).
I tried to include Lingo's new algo, but - not surprisingly - it raised an exception. If you are masochist enough, you can "heal" it as follows:

QuoteExA:
   bsf            eax,   ecx
   mov edx, [esp-8]      ; added by JJ (still unsafe but for testing it's ok)
   jmp            edx
StrLenLingo   endp
Lingo's algo will still raise an exception for CrashIt = 1.

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
80      code size for MbStrLen4a        80      total bytes for MbStrLen4a
97      code size for AxStrLenSSE1      104     total bytes for AxStrLenSSE1
84      code size for AxStrLenSSE1j     84      total bytes for AxStrLenSSE1j
------- timings, misaligned -------
131     cycles for szLen
49      cycles for MbStrLen4a
35      cycles for AxStrLenSSE1
34      cycles for AxStrLenSSE1j

131     cycles for szLen
49      cycles for MbStrLen4a
35      cycles for AxStrLenSSE1
34      cycles for AxStrLenSSE1j

131     cycles for szLen
49      cycles for MbStrLen4a
34      cycles for AxStrLenSSE1
34      cycles for AxStrLenSSE1j

131     cycles for szLen
49      cycles for MbStrLen4a
34      cycles for AxStrLenSSE1
34      cycles for AxStrLenSSE1j


EDIT:
Quote   sub ecx, edx
   pxor xmm7, xmm7         ; thanks Alex!!
   jz @F
Title: Re: StrLen timings needed
Post by: Antariy on August 22, 2010, 09:47:51 PM
Test attached archive, please. I add Lingo's and Edgar's procs.

Timings:

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
80      code size for MbStrLen4a        80      total bytes for MbStrLen4a
72      code size for MbStrLen4aP4      72      total bytes for MbStrLen4aP4
72      code size for MbStrLen4aP42     72      total bytes for MbStrLen4aP42
34      code size for AxStrLenMMX       34      total bytes for AxStrLenMMX
97      code size for AxStrLenSSE1      104     total bytes for AxStrLenSSE1
79      code size for StrLenLingo       84      total bytes for StrLenLingo
49      code size for lszLenMMX 49      total bytes for lszLenMMX
------- timings -------
263     cycles for szLen
97      cycles for MbStrLen4a
100     cycles for MbStrLen4aP4
121     cycles for MbStrLen4aP42
108     cycles for AxStrLenMMX
62      cycles for AxStrLenSSE1
53      cycles for StrLenLingo
119     cycles for lszLenMMX

251     cycles for szLen
96      cycles for MbStrLen4a
122     cycles for MbStrLen4aP4
96      cycles for MbStrLen4aP42
108     cycles for AxStrLenMMX
62      cycles for AxStrLenSSE1
53      cycles for StrLenLingo
124     cycles for lszLenMMX

251     cycles for szLen
91      cycles for MbStrLen4a
146     cycles for MbStrLen4aP4
105     cycles for MbStrLen4aP42
109     cycles for AxStrLenMMX
62      cycles for AxStrLenSSE1
53      cycles for StrLenLingo
119     cycles for lszLenMMX

251     cycles for szLen
71      cycles for MbStrLen4a
96      cycles for MbStrLen4aP4
120     cycles for MbStrLen4aP42
109     cycles for AxStrLenMMX
62      cycles for AxStrLenSSE1
53      cycles for StrLenLingo
122     cycles for lszLenMMX


--- ok ---


About fastest proc: this is as comparsion of lame horse with a bulldozer. Horse is sully some regs, which is must be preserved for fair comparsion, and horse is stumbled and falled on some strings :P

And naughty horse is mess timings of good bulldozers :)


Alex
Title: Re: StrLen timings needed
Post by: Antariy on August 22, 2010, 09:54:29 PM
For latest Jochen's archive (80StrLen.zip):

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
80      code size for MbStrLen4a        80      total bytes for MbStrLen4a
97      code size for AxStrLenSSE1      104     total bytes for AxStrLenSSE1
84      code size for AxStrLenSSE1j     84      total bytes for AxStrLenSSE1j
13 - ERROR in AxStrLenSSE1j
15 - ERROR in AxStrLenSSE1j
83      code size for StrLenLingo       92      total bytes for StrLenLingo
------- timings, misaligned -------
269     cycles for szLen
85      cycles for MbStrLen4a
66      cycles for AxStrLenSSE1
68      cycles for AxStrLenSSE1j
58      cycles for StrLenLingo

268     cycles for szLen
83      cycles for MbStrLen4a
66      cycles for AxStrLenSSE1
69      cycles for AxStrLenSSE1j
57      cycles for StrLenLingo

269     cycles for szLen
85      cycles for MbStrLen4a
66      cycles for AxStrLenSSE1
69      cycles for AxStrLenSSE1j
57      cycles for StrLenLingo

269     cycles for szLen
105     cycles for MbStrLen4a
65      cycles for AxStrLenSSE1
68      cycles for AxStrLenSSE1j
56      cycles for StrLenLingo


--- ok ---




Alex
Title: Re: StrLen timings needed
Post by: jj2007 on August 22, 2010, 09:59:43 PM
Quote from: Antariy on August 22, 2010, 09:54:29 PM
For latest Jochen's archive (80StrLen.zip):

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
80      code size for MbStrLen4a        80      total bytes for MbStrLen4a
97      code size for AxStrLenSSE1      104     total bytes for AxStrLenSSE1
84      code size for AxStrLenSSE1j     84      total bytes for AxStrLenSSE1j
13 - ERROR in AxStrLenSSE1j
15 - ERROR in AxStrLenSSE1j


Strange - these errors are not present in my runs. Can you check what happened there??
Title: Re: StrLen timings needed
Post by: Antariy on August 22, 2010, 10:08:00 PM
Jochen,

movups [esp], xmm7
sub ecx, edx
jz @F
pxor xmm7, xmm7       <--- move this above jz @F
pcmpeqb xmm7, [edx]




Alex
Title: Re: StrLen timings needed
Post by: jj2007 on August 22, 2010, 10:14:06 PM
Quote from: Antariy on August 22, 2010, 10:08:00 PM

movups [esp], xmm7
sub ecx, edx
jz @F
pxor xmm7, xmm7       <--- move this above jz @F
pcmpeqb xmm7, [edx]


Of course :red
It's fixed, see new attachment above.
Thanks Alex :U
Title: Re: StrLen timings needed
Post by: Antariy on August 22, 2010, 10:16:23 PM
I change pxor between sub ecx,edx and jz @F, this is timings:

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
80      code size for MbStrLen4a        80      total bytes for MbStrLen4a
97      code size for AxStrLenSSE1      104     total bytes for AxStrLenSSE1
84      code size for AxStrLenSSE1j     84      total bytes for AxStrLenSSE1j
83      code size for StrLenLingo       92      total bytes for StrLenLingo
------- timings, misaligned -------
262     cycles for szLen
82      cycles for MbStrLen4a
65      cycles for AxStrLenSSE1
67      cycles for AxStrLenSSE1j
55      cycles for StrLenLingo

263     cycles for szLen
87      cycles for MbStrLen4a
64      cycles for AxStrLenSSE1
67      cycles for AxStrLenSSE1j
56      cycles for StrLenLingo

262     cycles for szLen
85      cycles for MbStrLen4a
65      cycles for AxStrLenSSE1
67      cycles for AxStrLenSSE1j
56      cycles for StrLenLingo

262     cycles for szLen
84      cycles for MbStrLen4a
64      cycles for AxStrLenSSE1
67      cycles for AxStrLenSSE1j
56      cycles for StrLenLingo


--- ok ---



Edited: my fix is equal to your fix, Jochen, so this result the same.



Alex
Title: Re: StrLen timings needed
Post by: jj2007 on August 22, 2010, 10:21:52 PM
Thanks. So my 1j is three cycles slower on your CPU - on mine it's about half a cycle faster. For aligned strings, by the way, it looks like this:
------- timings, 16-byte aligned -------
131     cycles for szLen
32      cycles for MbStrLen4a
32      cycles for AxStrLenSSE1
32      cycles for AxStrLenSSE1j
24      cycles for StrLenLingo (UNSAFE)
Title: Re: StrLen timings needed
Post by: Antariy on August 22, 2010, 10:25:27 PM
For aligned:

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
80      code size for MbStrLen4a        80      total bytes for MbStrLen4a
97      code size for AxStrLenSSE1      104     total bytes for AxStrLenSSE1
84      code size for AxStrLenSSE1j     84      total bytes for AxStrLenSSE1j
83      code size for StrLenLingo       92      total bytes for StrLenLingo
------- timings, 16-byte aligned -------
263     cycles for szLen
69      cycles for MbStrLen4a
68      cycles for AxStrLenSSE1
70      cycles for AxStrLenSSE1j
55      cycles for StrLenLingo

261     cycles for szLen
69      cycles for MbStrLen4a
68      cycles for AxStrLenSSE1
69      cycles for AxStrLenSSE1j
55      cycles for StrLenLingo

262     cycles for szLen
69      cycles for MbStrLen4a
68      cycles for AxStrLenSSE1
69      cycles for AxStrLenSSE1j
56      cycles for StrLenLingo

261     cycles for szLen
69      cycles for MbStrLen4a
68      cycles for AxStrLenSSE1
69      cycles for AxStrLenSSE1j
56      cycles for StrLenLingo


--- ok ---


So, some clocks is not have meaning. As you see, lingo's proc work not very well  :green2 With consideration, what it crashes and it don't preserve regs - this is without comments...  :green2



Alex
Title: Re: StrLen timings needed
Post by: KeepingRealBusy on August 22, 2010, 10:35:58 PM
Alex,

91Test_StrLenSaveXmm.exe crashes on my P4

Intel(R) Pentium(R) 4 CPU 3.20GHz (SSE2)
80      code size for MbStrLen4a        80      total bytes for MbStrLen4a
72      code size for MbStrLen4aP4      72      total bytes for MbStrLen4aP4
72      code size for MbStrLen4aP42     72      total bytes for MbStrLen4aP42
34      code size for AxStrLenMMX       34      total bytes for AxStrLenMMX
97      code size for AxStrLenSSE1      104     total bytes for AxStrLenSSE1
79      code size for StrLenLingo       84      total bytes for StrLenLingo
49      code size for lszLenMMX 49      total bytes for lszLenMMX
------- timings -------
260     cycles for szLen
53      cycles for MbStrLen4a


91Test_StrLenSaveXmm.exe has encountered a problem and needs to close.  We are sorry for the inconvenience.

Dave.
Title: Re: StrLen timings needed
Post by: Antariy on August 22, 2010, 10:54:43 PM
Dave, this is something with old Jochen's MbStrLen4aP4 proc :(



Alex
Title: Re: StrLen timings needed
Post by: Antariy on August 22, 2010, 10:58:10 PM
Dave, test "80StrLen.zip", please, may be Jochen fix this problem, I have old his proc for P4.



Alex
Title: Re: StrLen timings needed
Post by: donkey on August 22, 2010, 10:59:39 PM
Here's my test from my laptop:

AMD Athlon(tm) X2 Dual-Core QL-62 (SSE3)
80      code size for MbStrLen4a        80      total bytes for MbStrLen4a
72      code size for MbStrLen4aP4      72      total bytes for MbStrLen4aP4
72      code size for MbStrLen4aP42     72      total bytes for MbStrLen4aP42
34      code size for AxStrLenMMX       34      total bytes for AxStrLenMMX
97      code size for AxStrLenSSE1      104     total bytes for AxStrLenSSE1
79      code size for StrLenLingo       84      total bytes for StrLenLingo
49      code size for lszLenMMX 49      total bytes for lszLenMMX
------- timings -------
147     cycles for szLen
56      cycles for MbStrLen4a
53      cycles for MbStrLen4aP4
52      cycles for MbStrLen4aP42
72      cycles for AxStrLenMMX
69      cycles for AxStrLenSSE1
43      cycles for StrLenLingo
75      cycles for lszLenMMX

138     cycles for szLen
58      cycles for MbStrLen4a
56      cycles for MbStrLen4aP4
63      cycles for MbStrLen4aP42
76      cycles for AxStrLenMMX
64      cycles for AxStrLenSSE1
43      cycles for StrLenLingo
76      cycles for lszLenMMX

137     cycles for szLen
52      cycles for MbStrLen4a
54      cycles for MbStrLen4aP4
53      cycles for MbStrLen4aP42
76      cycles for AxStrLenMMX
68      cycles for AxStrLenSSE1
42      cycles for StrLenLingo
77      cycles for lszLenMMX

138     cycles for szLen
50      cycles for MbStrLen4a
56      cycles for MbStrLen4aP4
55      cycles for MbStrLen4aP42
70      cycles for AxStrLenMMX
65      cycles for AxStrLenSSE1
48      cycles for StrLenLingo
77      cycles for lszLenMMX


--- ok ---


Well, 5 years or so ago when I wrote lszLenMMX it was pretty fast and rather unique, now it seems to be a bit of a pig compared with others...
Title: Re: StrLen timings needed
Post by: Antariy on August 22, 2010, 11:01:29 PM
Dave, when you will run your development computer (AMD), you may find, where app crashes?



Alex
Title: Re: StrLen timings needed
Post by: Antariy on August 22, 2010, 11:03:13 PM
Thanks, Edgar!

This is always nice: having timings from different hardware.



Alex
Title: Re: StrLen timings needed
Post by: KeepingRealBusy on August 22, 2010, 11:05:47 PM
Alex,

Here is my P4 for 80StrLen.zip.

Intel(R) Pentium(R) 4 CPU 3.20GHz (SSE2)
80      code size for MbStrLen4a        80      total bytes for MbStrLen4a
97      code size for AxStrLenSSE1      104     total bytes for AxStrLenSSE1
84      code size for AxStrLenSSE1j     84      total bytes for AxStrLenSSE1j
83      code size for StrLenLingo       92      total bytes for StrLenLingo
------- timings, misaligned -------
236     cycles for szLen
52      cycles for MbStrLen4a
44      cycles for AxStrLenSSE1
38      cycles for AxStrLenSSE1j
32      cycles for StrLenLingo

227     cycles for szLen
46      cycles for MbStrLen4a
37      cycles for AxStrLenSSE1
38      cycles for AxStrLenSSE1j
31      cycles for StrLenLingo

230     cycles for szLen
45      cycles for MbStrLen4a
37      cycles for AxStrLenSSE1
38      cycles for AxStrLenSSE1j
32      cycles for StrLenLingo

227     cycles for szLen
46      cycles for MbStrLen4a
37      cycles for AxStrLenSSE1
39      cycles for AxStrLenSSE1j
30      cycles for StrLenLingo


--- ok ---

Dave.
Title: Re: StrLen timings needed
Post by: Antariy on August 22, 2010, 11:06:33 PM
Quote from: donkey on August 22, 2010, 10:59:39 PM
Well, 5 years or so ago when I wrote lszLenMMX it was pretty fast and rather unique, now it seems to be a bit of a pig compared with others...

No, Edgar. Don't forgot, what my MMX version is very SLIGHTLY faster, because I emit prologue-epilogue code. All SSE version is faster, because they "eat" twice more data per loop (lingo's - in 4 times more data per loop). This is normal results, not a pig.



Alex
Title: Re: StrLen timings needed
Post by: Antariy on August 22, 2010, 11:11:45 PM
Quote from: KeepingRealBusy on August 22, 2010, 11:05:47 PM
Alex,

Here is my P4 for 80StrLen.zip.


Thanks, Dave!



Alex
Title: Re: StrLen timings needed
Post by: jj2007 on August 22, 2010, 11:16:06 PM
One more for the night - I shaved off a cycle and six bytes of codesize:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
80      code size for MbStrLen4a        80      total bytes for MbStrLen4a
97      code size for AxStrLenSSE1      104     total bytes for AxStrLenSSE1
77      code size for AxStrLenSSE1j     78      total bytes for AxStrLenSSE1j
83      code size for StrLenLingo       90      total bytes for StrLenLingo
------- timings, misaligned -------
132     cycles for szLen
49      cycles for MbStrLen4a
34      cycles for AxStrLenSSE1
33      cycles for AxStrLenSSE1j
24      cycles for StrLenLingo (unsafe)

132     cycles for szLen
49      cycles for MbStrLen4a
34      cycles for AxStrLenSSE1
33      cycles for AxStrLenSSE1j
24      cycles for StrLenLingo (unsafe)

131     cycles for szLen
49      cycles for MbStrLen4a
34      cycles for AxStrLenSSE1
33      cycles for AxStrLenSSE1j
24      cycles for StrLenLingo (unsafe)

132     cycles for szLen
49      cycles for MbStrLen4a
34      cycles for AxStrLenSSE1
34      cycles for AxStrLenSSE1j
24      cycles for StrLenLingo (unsafe)
Title: Re: StrLen timings needed
Post by: KeepingRealBusy on August 22, 2010, 11:18:53 PM
JJ,

Here is my P4 for 80bStrLen.zip.

Intel(R) Pentium(R) 4 CPU 3.20GHz (SSE2)
80      code size for MbStrLen4a        80      total bytes for MbStrLen4a
97      code size for AxStrLenSSE1      104     total bytes for AxStrLenSSE1
77      code size for AxStrLenSSE1j     78      total bytes for AxStrLenSSE1j
83      code size for StrLenLingo       90      total bytes for StrLenLingo
------- timings, misaligned -------
242     cycles for szLen
56      cycles for MbStrLen4a
40      cycles for AxStrLenSSE1
42      cycles for AxStrLenSSE1j
30      cycles for StrLenLingo (unsafe)

228     cycles for szLen
52      cycles for MbStrLen4a
45      cycles for AxStrLenSSE1
42      cycles for AxStrLenSSE1j
30      cycles for StrLenLingo (unsafe)

228     cycles for szLen
43      cycles for MbStrLen4a
38      cycles for AxStrLenSSE1
42      cycles for AxStrLenSSE1j
31      cycles for StrLenLingo (unsafe)

230     cycles for szLen
54      cycles for MbStrLen4a
38      cycles for AxStrLenSSE1
52      cycles for AxStrLenSSE1j
33      cycles for StrLenLingo (unsafe)


--- ok ---

Dave
Title: Re: StrLen timings needed
Post by: jj2007 on August 23, 2010, 01:33:34 PM
Quote from: KeepingRealBusy on August 22, 2010, 10:35:58 PM
Alex,

91Test_StrLenSaveXmm.exe crashes on my P4

Intel(R) Pentium(R) 4 CPU 3.20GHz (SSE2)


Hi Dave & Alex,

I found the "bug": It's lddqu - the instruction requires SSE3.

Attached a new testbed with two AxJJ variants that behave similar on a P4 but very different on my Celeron. Timings?
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
------- timings, misaligned, 100 byte string -------
272     cycles for szLen
89      cycles for MbStrLen4a
63      cycles for AxStrLenSSE1
70      cycles for AxJJStrLen1
68      cycles for AxJJStrLen2
Title: Re: StrLen timings needed
Post by: lingo on August 23, 2010, 05:30:46 PM
Again idiotic vain efforts... :lol
Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)
80      code size for MbStrLen4a        80      total bytes for MbStrLen4a
97      code size for AxStrLenSSE1      104     total bytes for AxStrLenSSE1
77      code size for AxJJStrLen1       78      total bytes for AxJJStrLen1
77      code size for AxJJStrLen2       78      total bytes for AxJJStrLen2
79      code size for StrLenLingo       79      total bytes for StrLenLingo

------- timings, misaligned, 100 byte string -------
112     cycles for szLen
37      cycles for MbStrLen4a
20      cycles for AxStrLenSSE1
24      cycles for AxJJStrLen1
26      cycles for AxJJStrLen2
12      cycles for StrLenLingo

110     cycles for szLen
37      cycles for MbStrLen4a
21      cycles for AxStrLenSSE1
25      cycles for AxJJStrLen1
29      cycles for AxJJStrLen2
12      cycles for StrLenLingo

110     cycles for szLen
37      cycles for MbStrLen4a
20      cycles for AxStrLenSSE1
25      cycles for AxJJStrLen1
24      cycles for AxJJStrLen2
12      cycles for StrLenLingo

110     cycles for szLen
37      cycles for MbStrLen4a
20      cycles for AxStrLenSSE1
25      cycles for AxJJStrLen1
29      cycles for AxJJStrLen2
12      cycles for StrLenLingo

------- timings, misaligned, 5 byte string -------
7       cycles for szLen
32      cycles for MbStrLen4a
8       cycles for AxStrLenSSE1
15      cycles for AxJJStrLen1
8       cycles for AxJJStrLen2
2       cycles for StrLenLingo

7       cycles for szLen
32      cycles for MbStrLen4a
15      cycles for AxStrLenSSE1
8       cycles for AxJJStrLen1
24      cycles for AxJJStrLen2
1       cycles for StrLenLingo

7       cycles for szLen
30      cycles for MbStrLen4a
8       cycles for AxStrLenSSE1
15      cycles for AxJJStrLen1
8       cycles for AxJJStrLen2
2       cycles for StrLenLingo

7       cycles for szLen
32      cycles for MbStrLen4a
15      cycles for AxStrLenSSE1
8       cycles for AxJJStrLen1
24      cycles for AxJJStrLen2
1       cycles for StrLenLingo


--- ok ---
Title: Re: StrLen timings needed
Post by: jj2007 on August 23, 2010, 05:48:50 PM
Quote from: lingo on August 23, 2010, 05:30:46 PM
Again idiotic vain efforts... :lol

Lingo,
While you have a fast CPU, and stolen a lot from Alex and my code, your algo still crashes. Give up.
Title: Re: StrLen timings needed
Post by: jj2007 on August 23, 2010, 06:08:38 PM
Version d, Celeron M timings:
Quote------- timings, misaligned, 5 byte string -------
12      cycles for szLen
15      cycles for AxStrLenSSE1
11      cycles for AxJJStrLen1
22      cycles for AxJJStrLen2
7       cycles for AxJJStrLen3

12      cycles for szLen
12      cycles for AxStrLenSSE1
17      cycles for AxJJStrLen1
11      cycles for AxJJStrLen2
7       cycles for AxJJStrLen3

12      cycles for szLen
16      cycles for AxStrLenSSE1
11      cycles for AxJJStrLen1
22      cycles for AxJJStrLen2
7       cycles for AxJJStrLen3

12      cycles for szLen
12      cycles for AxStrLenSSE1
17      cycles for AxJJStrLen1
11      cycles for AxJJStrLen2
7       cycles for AxJJStrLen3

The "jumping" is most probably caused by the movups [esp+xxx], xmm0 - in the REPEAT loop, the stack is being gradually decreased (push eax), so every 4 loops one of the algo is lucky to have a 16-byte alignment.
To eliminate this effect, AxJJStrLen3 uses a global aligned variable and movdqua. Results look convincing.
Title: Re: StrLen timings needed
Post by: lingo on August 23, 2010, 06:56:17 PM
"and stolen a lot from Alex and my code"

Wow, the thief crying "catch the thief" see the link: (www.masm32.com/board/index.php?topic=11353.msg84371#msg84371)
I can't thieve nothing from you and from the asian lamer just because you have no ideas in assembly , hence your code will be very ugly and slow always... So, get your peels and take it easy.. :lol

"your algo still crashes"

For some lamers in programing may be but for the advanced programmers which test the end of their buffer after every call to VirtualAlloc just no way to crash. :lol

"Give up."

Due to some sick idiotic lamers in the forum :lol....Never!
Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)
80      code size for MbStrLen4a        80      total bytes for MbStrLen4a
97      code size for AxStrLenSSE1      104     total bytes for AxStrLenSSE1
77      code size for AxJJStrLen1       78      total bytes for AxJJStrLen1
77      code size for AxJJStrLen2       78      total bytes for AxJJStrLen2
79      code size for StrLenLingo       79      total bytes for StrLenLingo

------- timings, misaligned, 100 byte string -------
112     cycles for szLen
37      cycles for MbStrLen4a
20      cycles for AxStrLenSSE1
24      cycles for AxJJStrLen1
26      cycles for AxJJStrLen2
12      cycles for StrLenLingo

110     cycles for szLen
37      cycles for MbStrLen4a
21      cycles for AxStrLenSSE1
25      cycles for AxJJStrLen1
29      cycles for AxJJStrLen2
12      cycles for StrLenLingo

110     cycles for szLen
37      cycles for MbStrLen4a
20      cycles for AxStrLenSSE1
25      cycles for AxJJStrLen1
24      cycles for AxJJStrLen2
12      cycles for StrLenLingo

110     cycles for szLen
37      cycles for MbStrLen4a
20      cycles for AxStrLenSSE1
25      cycles for AxJJStrLen1
29      cycles for AxJJStrLen2
12      cycles for StrLenLingo

------- timings, misaligned, 5 byte string -------
7       cycles for szLen
32      cycles for MbStrLen4a
8       cycles for AxStrLenSSE1
15      cycles for AxJJStrLen1
8       cycles for AxJJStrLen2
2       cycles for StrLenLingo

7       cycles for szLen
32      cycles for MbStrLen4a
15      cycles for AxStrLenSSE1
8       cycles for AxJJStrLen1
24      cycles for AxJJStrLen2
1       cycles for StrLenLingo

7       cycles for szLen
30      cycles for MbStrLen4a
8       cycles for AxStrLenSSE1
15      cycles for AxJJStrLen1
8       cycles for AxJJStrLen2
2       cycles for StrLenLingo

7       cycles for szLen
32      cycles for MbStrLen4a
15      cycles for AxStrLenSSE1
8       cycles for AxJJStrLen1
24      cycles for AxJJStrLen2
1       cycles for StrLenLingo


--- ok ---


Title: Re: StrLen timings needed
Post by: jj2007 on August 23, 2010, 07:05:13 PM
Quote from: lingo on August 23, 2010, 06:56:17 PM
For some lamers in programing may be but for the advanced programmers which test the end of their buffer after every call to VirtualAlloc just no way to crash. :lol

Line 3:
CrashIt =   1   ; overrides MisAlign - the "SSE1" algos will bang their head against the VirtualAlloc boundary

Line 153:
Quote      if 0   ; CrashIt
         print "No result for Lingo's algo, it crashes", 13, 10
      else
         cycles Src, StrLenLingo  ; ok, so let it crash
      endif
:bg

> downloaded 8 times
Nice trick, Lingo. The original is in reply #98, though
Title: Re: StrLen timings needed
Post by: lingo on August 23, 2010, 07:13:33 PM
 "The thief is always a liar"[/U]-> just see the link: (www.masm32.com/board/index.php?topic=11353.msg84371#msg84371)
Just put your lame macro in... you know where... :lol
I can explain about VirtualAlloc  to every normal man but it seems you forgot your peels again... :lol
Take care or next step will be the "electroconvulsive therapy"... :lol
Title: Re: StrLen timings needed
Post by: ecube on August 23, 2010, 09:30:19 PM
I see the inferior are trying to take on the champ again, with little success  :U
Title: Re: StrLen timings needed
Post by: Antariy on August 23, 2010, 10:43:44 PM
Hi, this is new old version. For JJ and some other explainers of VirtualAlloc.


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
80      code size for MbStrLen4a        80      total bytes for MbStrLen4a
72      code size for MbStrLen4aP4      72      total bytes for MbStrLen4aP4
72      code size for MbStrLen4aP42     72      total bytes for MbStrLen4aP42
34      code size for AxStrLenMMX       34      total bytes for AxStrLenMMX
85      code size for AxStrLenSSE1a     88      total bytes for AxStrLenSSE1a
83      code size for StrLenLingo       88      total bytes for StrLenLingo
49      code size for lszLenMMX 49      total bytes for lszLenMMX
84      code size for AxStrLenSSE1j     84      total bytes for AxStrLenSSE1j
------- timings -------
251     cycles for szLen
105     cycles for MbStrLen4a
109     cycles for AxStrLenMMX
62      cycles for AxStrLenSSE1a
53      cycles for StrLenLingo
119     cycles for lszLenMMX
65      cycles for AxStrLenSSE1J

251     cycles for szLen
77      cycles for MbStrLen4a
109     cycles for AxStrLenMMX
62      cycles for AxStrLenSSE1a
53      cycles for StrLenLingo
118     cycles for lszLenMMX
65      cycles for AxStrLenSSE1J

258     cycles for szLen
105     cycles for MbStrLen4a
109     cycles for AxStrLenMMX
62      cycles for AxStrLenSSE1a
53      cycles for StrLenLingo
115     cycles for lszLenMMX
65      cycles for AxStrLenSSE1J

251     cycles for szLen
78      cycles for MbStrLen4a
109     cycles for AxStrLenMMX
62      cycles for AxStrLenSSE1a
53      cycles for StrLenLingo
119     cycles for lszLenMMX
65      cycles for AxStrLenSSE1J


--- ok ---




Alex
Title: Re: StrLen timings needed
Post by: Antariy on August 23, 2010, 10:54:53 PM
Quote from: E^cube on August 23, 2010, 09:30:19 PM
I see the inferior are trying to take on the champ again, with little success  :U


Why anybody think, what this is big need and deal: "Beat Lingo!". Wow! Not any need.

E^cube, you underestimate yourself, if you are think, what all peoples have only one target: beating of Lingo.

This is funny :)

His proc eat twice more data per loop, his proc have twice less functionality (it crashes and not preserves regs, which is needed for fair comparsion with Jochen's procs).
And his proc have only ~45% of performance gain on HIS CPU only. This is your "champ"? This is bad programmer, which cherish hopes to other soft for make his procs reliable.
What he make "fastest" proc because something etc - this is excuse for inability of making proc with the same functionality and bigger speed.

And, Jochen fix his proc, otherwice it crashes on short strings.
So, your "champ" not have any respect - he produce bad unreliable code (maybe fast, but NOT working).



Alex
Title: Re: StrLen timings needed
Post by: Antariy on August 23, 2010, 10:59:42 PM
For Jochen's 80d:

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
97      code size for AxStrLenSSE1      104     total bytes for AxStrLenSSE1
77      code size for AxJJStrLen1       78      total bytes for AxJJStrLen1
77      code size for AxJJStrLen2       78      total bytes for AxJJStrLen2
79      code size for AxJJStrLen3       80      total bytes for AxJJStrLen3

------- timings, misaligned, 100 byte string -------
67      cycles for AxStrLenSSE1
75      cycles for AxJJStrLen1
81      cycles for AxJJStrLen2
69      cycles for AxJJStrLen3

66      cycles for AxStrLenSSE1
74      cycles for AxJJStrLen1
72      cycles for AxJJStrLen2
69      cycles for AxJJStrLen3

67      cycles for AxStrLenSSE1
73      cycles for AxJJStrLen1
73      cycles for AxJJStrLen2
70      cycles for AxJJStrLen3

66      cycles for AxStrLenSSE1
74      cycles for AxJJStrLen1
72      cycles for AxJJStrLen2
69      cycles for AxJJStrLen3

------- timings, misaligned, 5 byte string -------
20      cycles for szLen
25      cycles for AxStrLenSSE1
27      cycles for AxJJStrLen1
23      cycles for AxJJStrLen2
19      cycles for AxJJStrLen3

19      cycles for szLen
27      cycles for AxStrLenSSE1
27      cycles for AxJJStrLen1
24      cycles for AxJJStrLen2
19      cycles for AxJJStrLen3

19      cycles for szLen
25      cycles for AxStrLenSSE1
25      cycles for AxJJStrLen1
24      cycles for AxJJStrLen2
18      cycles for AxJJStrLen3

19      cycles for szLen
26      cycles for AxStrLenSSE1
25      cycles for AxJJStrLen1
24      cycles for AxJJStrLen2
19      cycles for AxJJStrLen3


--- ok ---




Alex
Title: Re: StrLen timings needed
Post by: Antariy on August 23, 2010, 11:04:27 PM
For late lingo fix for short string support (Microsoft recommend not make any code good and reliable in first release, but release some patches and SPs after some days of releasing initial version).


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
97      code size for AxStrLenSSE1      104     total bytes for AxStrLenSSE1
77      code size for AxJJStrLen1       78      total bytes for AxJJStrLen1
77      code size for AxJJStrLen2       78      total bytes for AxJJStrLen2
79      code size for AxJJStrLen3       80      total bytes for AxJJStrLen3
79      code size for StrLenLingo       79      total bytes for StrLenLingo

------- timings, misaligned, 100 byte string -------
66      cycles for AxStrLenSSE1
73      cycles for AxJJStrLen1
84      cycles for AxJJStrLen2
69      cycles for AxJJStrLen3
57      cycles for StrLenLingo

67      cycles for AxStrLenSSE1
77      cycles for AxJJStrLen1
72      cycles for AxJJStrLen2
69      cycles for AxJJStrLen3
74      cycles for StrLenLingo

69      cycles for AxStrLenSSE1
72      cycles for AxJJStrLen1
70      cycles for AxJJStrLen2
69      cycles for AxJJStrLen3
56      cycles for StrLenLingo

66      cycles for AxStrLenSSE1
72      cycles for AxJJStrLen1
70      cycles for AxJJStrLen2
69      cycles for AxJJStrLen3
55      cycles for StrLenLingo

------- timings, misaligned, 5 byte string -------
19      cycles for szLen
25      cycles for AxStrLenSSE1
26      cycles for AxJJStrLen1
23      cycles for AxJJStrLen2
19      cycles for AxJJStrLen3
14      cycles for StrLenLingo

20      cycles for szLen
27      cycles for AxStrLenSSE1
26      cycles for AxJJStrLen1
23      cycles for AxJJStrLen2
18      cycles for AxJJStrLen3
15      cycles for StrLenLingo

19      cycles for szLen
25      cycles for AxStrLenSSE1
29      cycles for AxJJStrLen1
23      cycles for AxJJStrLen2
18      cycles for AxJJStrLen3
14      cycles for StrLenLingo

19      cycles for szLen
26      cycles for AxStrLenSSE1
27      cycles for AxJJStrLen1
28      cycles for AxJJStrLen2
18      cycles for AxJJStrLen3
14      cycles for StrLenLingo


--- ok ---



Still big timings not on HIS CPUs...  :green2
With consideration, what proc must be twice faster (because use two regs)...  :green2



Alex
Title: Re: StrLen timings needed
Post by: jj2007 on August 23, 2010, 11:28:45 PM
Quote from: Antariy on August 23, 2010, 10:59:42 PM
For Jochen's 80d:

Thanks, Alex. So your algo is still a tick faster on your Celeron. Mine is a Yonah Celeron M - yours is Prescott or Merom?
Title: Re: StrLen timings needed
Post by: Antariy on August 23, 2010, 11:33:07 PM
Quote from: jj2007 on August 23, 2010, 11:28:45 PM
Quote from: Antariy on August 23, 2010, 10:59:42 PM
For Jochen's 80d:

Thanks, Alex. So your algo is still a tick faster on your Celeron. Mine is a Yonah Celeron M - yours is Prescott or Merom?


This is Celeron D, Prescott (310)



Alex
P.S. What timings have my last code (93xxx.zip)?
Title: Re: StrLen timings needed
Post by: Antariy on August 24, 2010, 08:38:21 PM
This is big post, and this post is written on poor English (my English is equal to Lingo's Russian), but, PLEASE, read this post entirely, think about it, and say, I right or not.

Lingo's
Quote
I can't thieve nothing from you and from the asian lamer just because you have no ideas in assembly , hence your code will be very ugly and slow always...

Really? How about this:

Lingo's "abs" function:

pop ecx
        pop eax 
        cdq
        xor eax,edx <---
        sub eax,edx <---
        jmp ecx



My Axa2l function, which used unified approach to conversion of signed/unsigned string to long/dword:


    @done:       


    add eax,edi <---
    xor eax,edi <---

    pop edi

    ret


This is the same code, so, Lingo steal it. But, I confess, what this is not hard algo. But and he write very simple algos, and he HAVE NO RIGHTS to say, what somebody steal something from him.
But Lingo, as stupid new-fangled programmer, don't know, what SUB is slower by technical (electrical) reasons, since firsts computers (BIG gathering of registers based on valves or relays). It seems, what Lingo steal his certificate of degree of Electrical Engineer somewhere.
Novadays, and many days ago, sub stand have speed almost as add, and this is not measurable. But, this is good practic - using ADD instead SUB, if you can. Lingo don't know this practic, so - he is not good programmer, he is lamer, because lamers thinks only, what they are Grandmasters. Minded peoples know about theys' drawbacks, and not speak, what they are "the best".

Other argument: many peoples talk to him about returning from proc via "jmp ...". This is stupid practic of lamer programmer.
For example, not only I talk about this to him, but dioxin also:
Quote
You aren't really advocating that, for speed, you should pop a return address off the stack and jump to it, are you? That messes up the branch prediction mechanism which has short cuts for a paired CALL-RETURN.
This is good advice, and perfectly right.
But Lingo, don't listen to good advice - he think, what he the Grandmaster, without compromises.
So, this is ONLY HIS PROBLEMS. But I don't understand, why in mental hospital of Toronto, where Lingo located, Internet connection is available :P



Hutch, why you allow to so stupid lamer, as Lingo, come to your forum? He discredit this nice forum only. Forum have peoples (I don't mean myself, I mean many-many others), which have experience incomparably with his experience, but he call them: "stupid", "thefters", "tolerated" etc?


All of his technics NOT HAVE ANY unbelievable things and thinks. All his technics is KNOWN a long time.
For example, maybe somebody don't know, why Lingo use this code in his SSE hex2dword:

pshufw mm0,[eax], 01Bh <--- this...
pxor mm2, mm2
pshufw mm1,mm0, 0E4h <--- and this
paddb mm0, maskD0h


Firstly, pshuf may behave faster, than movq, but this instruction reverse words order also (1Bh) in this case.
Second, moving one MMx reg to other reg is faster with using pshufw with direct-copying encoding (0E4h). On P6 family, for example, "pshuf MMx,MMx" is faster than "movq MMx,MMx" in 3(!) times.


So, if anybody get intent look to Lingo's code, his code will NOT appear as something great, unknown or unbelievable.
His coding style is stale (smell of depraved young age), and not contain something to theif. Because not Lingo invent this technics.



Lingo you can hand in an application of your *unique algo* to any respectable patent office? Or you will be ridiculed there? Last variant is most possible, sorry. So, don't talk to us, how you great, and which "uniques" algo you produce.


He swagger about his fast "great" code, because he have fast CPU only. Because Core+ is tolerant to some his lamer's technics (like return via jump and many others).
As all peoples saw, his code don't have prominent results on not-hi-end CPUs.

All history of humans, is great and respectable making something good from something bad or cheap (as Ford for US or Diesel for Europe and world).
No great thing - make something good from something nice.
So, ANYBODY can say, what he great programmer? If yes - some similarity to Lingo have place in this.


So, Lingo is ORDINARY man (who can read Intel's manuals), not God of assembly, why he use rights (insults etc), which don't use Administrators and Moderators. Who he are for this?


Further. In seen of responses of other peoples, maybe, with similar to Lingo's "thinking engine", I say, why my code is that, as it is. Answer is short - it works NOT badly than Lingo's "great, nice, fast, unreliable" code, so, about talking? Novadays, I write for me and my work only, I write for CPUs of one type, and I don't have need to make other code.
But this is not meant what I "asian lamer", and Lingo is greater. Lingo write for his CPU only also, but his code work badly on other CPUs, but my - not. If I don't use some optimization technics - this is because they DON'T have something useful for me it this time.
I welcome reports about bad work on other CPU architectures, and trying to make it better.
Lingo is don't listen any reports (if they no good for him), and not listen any comments. If his code work badly, he say, what this is because testing computer is old and slow, and because owner of this computer is a lamer. Maybe Lingo - madman?



Any insults and something like, is NOT interesting to me, and NOT have any meaning to me. I write this post only to open eyes of peoples, who think, what Lingo - unbeatable champion.

And Lingo's insults is not have any meaning to me, because I treat them as talk of madman.
If somebody think, what I write this post to show, how I nice and great - this peoples make mistake.
I write this post, because Lingo's behaviour with other members is not excusable.
I don't like, and never use offensive speech with peoples, but I make a big exclusion to Lingo only.
Please sorry, all other peoples!



Alex
Title: Re: StrLen timings needed
Post by: dedndave on August 24, 2010, 08:54:21 PM
Alex - don't let lingo get on your nerves
most of us just ignore him   :lol
Title: Re: StrLen timings needed
Post by: Antariy on August 24, 2010, 09:10:01 PM
Quote from: dedndave on August 24, 2010, 08:54:21 PM
Alex - don't let lingo get on your nerves
most of us just ignore him   :lol

I'm balanced man. Not any Lingo, or 1,000,000 of Lingo's clones not "pick-up" me. I worry for forum and its members.
Because this is not correctness - call members with offensive words, without any cause from they side.



Alex
Title: Re: StrLen timings needed
Post by: dedndave on August 24, 2010, 09:32:08 PM
it's fun to pick on lingo
he gets upset so easily
i can visualize the blood vessels popping out on his neck - lol
Title: Re: StrLen timings needed
Post by: jj2007 on August 24, 2010, 09:36:11 PM
Hi Alex,
Dave is right - don't let Rumpelstilzchen spoil your peaceful coding nights. He is good at writing highly specialised (and highly unrolled) algos, although it's always a pain to find out under which conditions they raise exceptions :green2

Here are the timings for the 93 zip, with the addition of AxJJStrLen3, i.e. the algo that uses a global variable to store the xmm reg:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
85      code size for AxStrLenSSE1a     88      total bytes for AxStrLenSSE1a
84      code size for AxStrLenSSE1j     84      total bytes for AxStrLenSSE1j
79      code size for AxJJStrLen3       80      total bytes for AxJJStrLen3
------- timings -------
132     cycles for szLen
33      cycles for AxStrLenSSE1a
33      cycles for AxStrLenSSE1J
31      cycles for AxJJStrLen3
Title: Re: StrLen timings needed
Post by: Antariy on August 24, 2010, 09:42:40 PM
Quote from: jj2007 on August 24, 2010, 09:36:11 PM

Here are the timings for the 93 zip, with the addition of AxJJStrLen3, i.e. the algo that uses a global variable to store the xmm reg:


Jochen, I think, you must don't use a global variable, because this is make proc not reenterant => it don't support multi-threaded applications. 1-2 clocks not have any mean, I think.



Alex
Title: Re: StrLen timings needed
Post by: Antariy on August 24, 2010, 09:50:22 PM
I still confidence, what Rumpelstilzchen cannot make something really great and unique. Maybe, make GREAT speech about his "GREAT" code :green2



Alex
Title: Re: StrLen timings needed
Post by: jj2007 on August 24, 2010, 10:50:42 PM
Quote from: Antariy on August 24, 2010, 09:42:40 PM
Jochen, I think, you must don't use a global variable, because this is make proc not reenterant => it don't support multi-threaded applications.

That is correct, thanks for reminding me. Here are multi-thread and SSE2-safe variants 4+5:

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
97      code size for AxStrLenSSE1      104     total bytes for AxStrLenSSE1
77      code size for AxJJStrLen1       78      total bytes for AxJJStrLen1
77      code size for AxJJStrLen2       78      total bytes for AxJJStrLen2
79      code size for AxJJStrLen3       80      total bytes for AxJJStrLen3
88      code size for AxJJStrLen4       88      total bytes for AxJJStrLen4
96      code size for AxJJStrLen5       96      total bytes for AxJJStrLen5

------- timings, misaligned, 100 byte string -------
34      cycles for AxStrLenSSE1
31      cycles for AxJJStrLen3
34      cycles for AxJJStrLen4
33      cycles for AxJJStrLen5

34      cycles for AxStrLenSSE1
31      cycles for AxJJStrLen3
33      cycles for AxJJStrLen4
33      cycles for AxJJStrLen5

34      cycles for AxStrLenSSE1
31      cycles for AxJJStrLen3
33      cycles for AxJJStrLen4
33      cycles for AxJJStrLen5

------- timings, misaligned, 5 byte string -------
12      cycles for szLen
15      cycles for AxStrLenSSE1
7       cycles for AxJJStrLen3
10      cycles for AxJJStrLen4
10      cycles for AxJJStrLen5

12      cycles for szLen
12      cycles for AxStrLenSSE1
7       cycles for AxJJStrLen3
10      cycles for AxJJStrLen4
10      cycles for AxJJStrLen5

12      cycles for szLen
16      cycles for AxStrLenSSE1
7       cycles for AxJJStrLen3
10      cycles for AxJJStrLen4
10      cycles for AxJJStrLen5
Title: Re: StrLen timings needed
Post by: lingo on August 25, 2010, 12:14:46 AM
"The thieves are always liars"="Воры всегда лжецы"[/U]
Ugly spaghetti+lamer tubeteikin's code = slow.,..slow..slow...non me ne frega un cazzo... :lol
Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)
97      code size for AxStrLenSSE1      104     total bytes for AxStrLenSSE1
79      code size for AxJJStrLen3       80      total bytes for AxJJStrLen3
88      code size for AxJJStrLen4       88      total bytes for AxJJStrLen4
96      code size for AxJJStrLen5       96      total bytes for AxJJStrLen5
79      code size for StrLenLingo       79      total bytes for StrLenLingo

20      cycles for AxStrLenSSE1
20      cycles for AxJJStrLen3
23      cycles for AxJJStrLen4
22      cycles for AxJJStrLen5
12      cycles for StrLenLingo

20      cycles for AxStrLenSSE1
20      cycles for AxJJStrLen3
22      cycles for AxJJStrLen4
22      cycles for AxJJStrLen5
12      cycles for StrLenLingo

20      cycles for AxStrLenSSE1
20      cycles for AxJJStrLen3
23      cycles for AxJJStrLen4
22      cycles for AxJJStrLen5
12      cycles for StrLenLingo

21      cycles for AxStrLenSSE1
20      cycles for AxJJStrLen3
23      cycles for AxJJStrLen4
22      cycles for AxJJStrLen5
12      cycles for StrLenLingo

------- timings, misaligned, 5 byte string -------
8       cycles for AxStrLenSSE1
3       cycles for AxJJStrLen3
5       cycles for AxJJStrLen4
5       cycles for AxJJStrLen5
1       cycles for StrLenLingo

15      cycles for AxStrLenSSE1
4       cycles for AxJJStrLen3
6       cycles for AxJJStrLen4
6       cycles for AxJJStrLen5
1       cycles for StrLenLingo

8       cycles for AxStrLenSSE1
3       cycles for AxJJStrLen3
5       cycles for AxJJStrLen4
5       cycles for AxJJStrLen5
1       cycles for StrLenLingo

15      cycles for AxStrLenSSE1
4       cycles for AxJJStrLen3
6       cycles for AxJJStrLen4
6       cycles for AxJJStrLen5
2       cycles for StrLenLingo


--- ok ---
Title: Re: StrLen timings needed
Post by: jj2007 on August 25, 2010, 06:07:14 AM
WARNING: The algo "StrLenLingo" posted above by Lingo raises an exception when used with short strings that happen to start near the end of a buffer allocated with VirtualAlloc (it will also happen with HeapAlloc buffers, but only in rare cases - the typical "impossible to chase bug").
Unlike all other algos, it does not preserve ecx and xmm0.

On the positive side: It is a tick faster than the others on certain CPUs. Bravo, Lingo :cheekygreen:
Title: Re: StrLen timings needed
Post by: lingo on August 25, 2010, 03:50:59 PM
"I ladri sono sempre bugiardi"[/U]

"It is a tick faster than the others on certain CPUs."
Again false...should be "It is 45% faster than the others on ALL CPUs."

Title: Re: StrLen timings needed
Post by: frktons on August 25, 2010, 03:56:44 PM
Quote from: lingo on August 25, 2010, 03:50:59 PM
"I ladri sono sempre bugiardi"[/U]

Do you mean : "Pick one, get two"? It sounds like a commercial ad   :lol
Title: Re: StrLen timings needed
Post by: jj2007 on August 25, 2010, 04:01:20 PM
Quote from: lingo on August 25, 2010, 03:50:59 PM
"It is a tick faster than the others on certain CPUs."
Again false...should be "It is 45% faster than the others on ALL CPUs."

Well, not ALL CPUs. There is this rare exception of the so-called "P4":

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
------- timings, misaligned, 5 byte string -------
100     cycles for szLen
30      cycles for AxStrLenSSE1
25      cycles for AxJJStrLen3
28      cycles for AxJJStrLen4
28      cycles for AxJJStrLen5
32      cycles for StrLenLingo (unsafe)

:green2
Title: Re: StrLen timings needed
Post by: lingo on August 25, 2010, 04:09:07 PM
"I ladri sono sempre bugiardi"[/b][/U]
is equal to:
"The thieves are always liars"[/b]
is equal to:
""Воры всегда лжецы" [/b]
is equal to:
""Les voleurs sont toujours menteurs" [/b] :lol
Title: Re: StrLen timings needed
Post by: jj2007 on August 25, 2010, 09:08:32 PM
I can't decide, they are so close. Any diverging results on other CPUs?
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
97      code size for AxStrLenSSE1      104     total bytes for AxStrLenSSE1
88      code size for AxJJStrLen4       88      total bytes for AxJJStrLen4
96      code size for AxJJStrLen5       96      total bytes for AxJJStrLen5
------- timings, misaligned, 100 byte string -------
132     cycles for szLen
34      cycles for AxStrLenSSE1
33      cycles for AxJJStrLen4
33      cycles for AxJJStrLen5
------- timings, misaligned, 5 byte string -------
31      cycles for szLen
19      cycles for AxStrLenSSE1
17      cycles for AxJJStrLen4
17      cycles for AxJJStrLen5


Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
------- timings, 16-byte aligned, 100 byte string -------
252     szLen
61      AxStrLenSSE1
69      AxJJStrLen4
69      AxJJStrLen5
------- timings, misaligned, 100 byte string -------
91      szLen
29      AxStrLenSSE1
28      AxJJStrLen4
28      AxJJStrLen5
Title: Re: StrLen timings needed
Post by: Antariy on August 25, 2010, 09:13:43 PM
Quote from: lingo on August 25, 2010, 03:50:59 PM
"I ladri sono sempre bugiardi"[/U]

"It is a tick faster than the others on certain CPUs."
Again false...should be "It is 45% faster than the others on ALL CPUs."



Wow, lamer Lingo make sounds again:


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
97      code size for AxStrLenSSE1      104     total bytes for AxStrLenSSE1
79      code size for AxJJStrLen3       80      total bytes for AxJJStrLen3
88      code size for AxJJStrLen4       88      total bytes for AxJJStrLen4
96      code size for AxJJStrLen5       96      total bytes for AxJJStrLen5
79      code size for StrLenLingo       79      total bytes for StrLenLingo

66      cycles for AxStrLenSSE1
69      cycles for AxJJStrLen3
72      cycles for AxJJStrLen4
71      cycles for AxJJStrLen5
56      cycles for StrLenLingo

65      cycles for AxStrLenSSE1
68      cycles for AxJJStrLen3
72      cycles for AxJJStrLen4
71      cycles for AxJJStrLen5
56      cycles for StrLenLingo

64      cycles for AxStrLenSSE1
68      cycles for AxJJStrLen3
72      cycles for AxJJStrLen4
71      cycles for AxJJStrLen5
55      cycles for StrLenLingo

65      cycles for AxStrLenSSE1
67      cycles for AxJJStrLen3
74      cycles for AxJJStrLen4
71      cycles for AxJJStrLen5
55      cycles for StrLenLingo

------- timings, misaligned, 5 byte string -------
26      cycles for AxStrLenSSE1
18      cycles for AxJJStrLen3
21      cycles for AxJJStrLen4
21      cycles for AxJJStrLen5
14      cycles for StrLenLingo

26      cycles for AxStrLenSSE1
18      cycles for AxJJStrLen3
22      cycles for AxJJStrLen4
22      cycles for AxJJStrLen5
14      cycles for StrLenLingo

26      cycles for AxStrLenSSE1
19      cycles for AxJJStrLen3
22      cycles for AxJJStrLen4
22      cycles for AxJJStrLen5
14      cycles for StrLenLingo

26      cycles for AxStrLenSSE1
18      cycles for AxJJStrLen3
22      cycles for AxJJStrLen4
22      cycles for AxJJStrLen5
14      cycles for StrLenLingo


--- ok ---


56 vs 66 - 45%??? Lingo, you buy all your certificates, indeed.
Maybe, because you live not in Toronto, really? I doubt, what Toronto gives citizenship to so lamer's as you.

56 faster than 66 by 15%, relatively to 66. So, you don't able to calculate so simple thing? This is great, really.
I doubt, which spee it will be have on PIII and older.



Alex
P.S. all yours translations may be done by machine - this is simple text. Try to translate this: Линго, ты ЧМО !!! Когда ты переехал в Торонто, чмошник-недоучка?
Title: Re: StrLen timings needed
Post by: Antariy on August 25, 2010, 09:15:51 PM
Hi, Jochen!

Timings:

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
97      code size for AxStrLenSSE1      104     total bytes for AxStrLenSSE1
88      code size for AxJJStrLen4       88      total bytes for AxJJStrLen4
96      code size for AxJJStrLen5       96      total bytes for AxJJStrLen5

------- timings, misaligned, 100 byte string -------
267     cycles for szLen
65      cycles for AxStrLenSSE1
74      cycles for AxJJStrLen4
73      cycles for AxJJStrLen5

266     cycles for szLen
65      cycles for AxStrLenSSE1
74      cycles for AxJJStrLen4
70      cycles for AxJJStrLen5

264     cycles for szLen
64      cycles for AxStrLenSSE1
72      cycles for AxJJStrLen4
71      cycles for AxJJStrLen5

261     cycles for szLen
64      cycles for AxStrLenSSE1
72      cycles for AxJJStrLen4
70      cycles for AxJJStrLen5

------- timings, misaligned, 5 byte string -------
99      cycles for szLen
30      cycles for AxStrLenSSE1
29      cycles for AxJJStrLen4
29      cycles for AxJJStrLen5

95      cycles for szLen
31      cycles for AxStrLenSSE1
29      cycles for AxJJStrLen4
28      cycles for AxJJStrLen5

93      cycles for szLen
31      cycles for AxStrLenSSE1
29      cycles for AxJJStrLen4
29      cycles for AxJJStrLen5

95      cycles for szLen
30      cycles for AxStrLenSSE1
29      cycles for AxJJStrLen4
29      cycles for AxJJStrLen5

--- ok ---




Alex
Title: Re: StrLen timings needed
Post by: Antariy on August 25, 2010, 09:21:41 PM
GrandLamer Lingo's:
Quote
"The thieves are always liars"="Воры всегда лжецы"

While you don't get a patent for yours lamer's algos, you cannot speak this. Because all your technics is stolen from Intel's examples. But, not very good Intels tutorials you makes to awesome badly and ugly lamer's code  :toothy

I doubt, what you can get any patent to any respect office, maybe patent: "The Napoleon of Torontos' Central Mental Hospital", or something like this.



Alex
Title: Re: StrLen timings needed
Post by: jj2007 on August 25, 2010, 09:32:30 PM
Quote from: Antariy on August 25, 2010, 09:15:51 PM
Hi, Jochen!

Timings:

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
65      cycles for AxStrLenSSE1
74      cycles for AxJJStrLen4
73      cycles for AxJJStrLen5


Your code is better on your CPU, again :bg
I had a bright idea to skip a bsf but...:
pcmpeqb xmm0, [edx] ; there may be nullbytes before the real string
pmovmskb eax, xmm0
shr eax, cl ; shift their bits out
; bsf eax, eax ############# not necessary
jnz @ret1
..
@ret: mov ecx, [esp]
movaps xmm0, [ecx]
mov ecx, [esp+4]
add esp, Extra
ret 4
@ret1: bsf eax, eax ; good on P4 but disastrous on Celeron M
mov ecx, [esp]
movaps xmm0, [ecx]
mov ecx, [esp+4]
add esp, Extra
ret 4
Title: Re: StrLen timings needed
Post by: Antariy on August 25, 2010, 09:40:34 PM
Quote from: jj2007 on August 25, 2010, 09:32:30 PM
Your code is better on your CPU, again :bg
I had a bright idea to skip a bsf but...:

This is not have big meaning  - 6 clocks. If not preserve ecx and xmm, this proc would be to 66bytes long and 56 clocks speed (checking of end of buffer still maked).

BSF - very slow instruction (relatively), but on my CPU overhead of checking, branch and exiting gets the same speed, and biggest code only.



Alex
Title: Re: StrLen timings needed
Post by: dedndave on August 25, 2010, 11:42:15 PM
BSF is a little slow - and clumsey to use, too
but - i have tried the alternatives without much luck
i am using a P4 - it's a slow instruction
i think it is faster on newer CPU's, Alex
even though that doesn't help us much, it makes the code look good in the forum tests   :bg
Title: Re: StrLen timings needed
Post by: lingo on August 26, 2010, 08:03:48 PM
"BSF is a little slow - and clumsey to use"

With new technologies we don't need  BSF :lol...just take a look:
align 16
Newstrlen_sse4_2
movdqa  xmm7,  notnul
pop        ecx
pop        eax
mov       edx, eax
@@:
pcmpistri xmm7, [eax], 14h
lea   eax,    [eax+16]
jnz   @b   
sub    eax, edx
lea   eax, [eax+ecx-16]       
jmp    dword ptr [esp-2*4]


Hi, unhappy losers...slow.slow..again and again... :lol
Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)
97      code size for AxStrLenSSE1      104     total bytes for AxStrLenSSE1
88      code size for AxJJStrLen4       88      total bytes for AxJJStrLen4
96      code size for AxJJStrLen5       96      total bytes for AxJJStrLen5
79      code size for StrLenLingo       79      total bytes for StrLenLingo

110     cycles for szLen
20      cycles for AxStrLenSSE1
22      cycles for AxJJStrLen4
22      cycles for AxJJStrLen5
12      cycles for StrLenLingo

110     cycles for szLen
40      cycles for AxStrLenSSE1
23      cycles for AxJJStrLen4
22      cycles for AxJJStrLen5
12      cycles for StrLenLingo

110     cycles for szLen
20      cycles for AxStrLenSSE1
23      cycles for AxJJStrLen4
21      cycles for AxJJStrLen5
12      cycles for StrLenLingo

111     cycles for szLen
49      cycles for AxStrLenSSE1
22      cycles for AxJJStrLen4
22      cycles for AxJJStrLen5
12      cycles for StrLenLingo

------- timings, misaligned, 5 byte string -------
7       cycles for szLen
8       cycles for AxStrLenSSE1
5       cycles for AxJJStrLen4
6       cycles for AxJJStrLen5
1       cycles for StrLenLingo

7       cycles for szLen
40      cycles for AxStrLenSSE1
6       cycles for AxJJStrLen4
6       cycles for AxJJStrLen5
2       cycles for StrLenLingo

7       cycles for szLen
8       cycles for AxStrLenSSE1
6       cycles for AxJJStrLen4
6       cycles for AxJJStrLen5
1       cycles for StrLenLingo

7       cycles for szLen
42      cycles for AxStrLenSSE1
6       cycles for AxJJStrLen4
6       cycles for AxJJStrLen5
1       cycles for StrLenLingo


--- ok ---

Title: Re: StrLen timings needed
Post by: jj2007 on August 26, 2010, 09:22:49 PM
Quote from: lingo on August 26, 2010, 08:03:48 PM
Hi, unhappy losers...slow.slow..again and again... :lol

Lingo, your algo goes bang when used with short strings near the end of the buffer. You seem not be able to learn. You are a troll :tdown
Title: Re: StrLen timings needed
Post by: Antariy on August 26, 2010, 09:53:11 PM
Hi, stupid lingo!

This is not wonder, why your algos is fast - you test them in best cases.
For example, part of Paul Dixon's algo, stolen and "optimized" by you:
Quote

.data
align 16
Src   dd 100 Dup(0)
Num dd 0FFFFFFFFh,0          <-------------------------- THIS!!!
vaPtr   dd 0
align 16
chartabL dw "00","10","20","30","40","50","60","70","80","90"
                dw "01","11","21","31","41","51","61","71","81","91"
                dw "02","12","22","32","42","52","62","72","82","92"
                dw "03","13","23","33","43","53","63","73","83","93"
                dw "04","14","24","34","44","54","64","74","84","94"
                dw "05","15","25","35","45","55","65","75","85","95"
                dw "06","16","26","36","46","56","66","76","86","96"
                dw "07","17","27","37","47","57","67","77","87","97"
                dw "08","18","28","38","48","58","68","78","88","98"
                dw "09","19","29","39","49","59","69","79","89","99"



Very nice, you test with best case - only sign "-" and "1". Result will be "-1" - same fast conversion. Your stupid proc not faster than Hutch's with 2^31  :bdg on NOT your CPU, of course  :toothy

If you so lamer, what cannot make something good without "new technologies" - so, this is without comments. Posted code - cannot be compared with other code, because it DON'T make all things, which is needed for fair comparsion. IF you not able write proc with SAME characteristics - shut up. You be very unsatisfyed, when drizz algo "without functionality" beat your stupid simple unrolled algo. But you satisfyed with YOUR OWN stupid algo without functionality. This is strange, not - this is funny, because you are lamer.



Alex
Title: Re: StrLen timings needed
Post by: lingo on August 26, 2010, 10:00:00 PM
Ugly spaghetti+lamer tubeteikin's code = slow...slow..slow...non me ne frega un cazzo. :lol
Title: Re: StrLen timings needed
Post by: Antariy on August 26, 2010, 10:01:44 PM
Quote from: jj2007 on August 26, 2010, 09:22:49 PM
Quote from: lingo on August 26, 2010, 08:03:48 PM
Hi, unhappy losers...slow.slow..again and again... :lol

Lingo, your algo goes bang when used with short strings near the end of the buffer. You seem not be able to learn. You are a troll :tdown

No, lingo are a ЧМО !!!
This is nice Russian therm for peoples like lingo  :bdg



Alex
P.S. Lingo, show any your normal app, or you can improve and optimize existed algos of other peoples only?
Under normal app I meant any your app, which works 5 minutes without crashes. And which can make something other, than printing your clocks only  :green2
Title: Re: StrLen timings needed
Post by: Antariy on August 26, 2010, 10:14:08 PM
Hi, folks!

I find one nice rule: stupid lingo don't like, when any code beat his code, and not have so great functionality as his code. But, when his lamer's code not have any functionality, he think, what this is great - this is part of algo. And, when somebody says about this to him - he very unsatisfyed and make insults to other peoples.

Lingos is unadequate man - madman.
Stupid GrandLamer lingo, where you live? Not in Toronto, is it?
Don't worry, we don't say anybody, why you ugly and wretched. Make confess to us  :green2



Alex
Title: Re: StrLen timings needed
Post by: dedndave on August 26, 2010, 10:20:40 PM
i think he's in Germany
Title: Re: StrLen timings needed
Post by: jj2007 on August 26, 2010, 10:28:54 PM
Quote from: dedndave on August 26, 2010, 10:20:40 PM
i think he's in Germany

Right - if he lived in Toronto, his English would be a lot better.
Title: Re: StrLen timings needed
Post by: dedndave on August 27, 2010, 12:56:38 AM
i remember hearing somewhere in another thread that his wife doesn't like him, either   :bdg
Title: Re: StrLen timings needed
Post by: Antariy on August 28, 2010, 10:39:58 PM
Quote from: jj2007 on August 26, 2010, 10:28:54 PM
Quote from: dedndave on August 26, 2010, 10:20:40 PM
i think he's in Germany

Right - if he lived in Toronto, his English would be a lot better.

I see European phrases construction in his text. This is not unrecognizable.



Alex
Title: Re: StrLen timings needed
Post by: hutch-- on August 30, 2010, 04:20:03 AM
 :bg

Come on guys, Lingo is OK, he just has a charming turn of phrase.  :P
Title: Re: StrLen timings needed
Post by: Antariy on August 30, 2010, 10:03:23 PM
Quote from: hutch-- on August 30, 2010, 04:20:03 AM
:bg

Come on guys, Lingo is OK, he just has a charming turn of phrase.  :P

If someone, what is not adequate and is upstarter - then this is true: this someone is OK  :P



Alex
Title: Re: StrLen timings needed
Post by: oex on September 22, 2010, 12:53:51 AM
AMD Sempron(tm) Processor 3100+ (SSE3)
97      code size for AxStrLenSSE1      104     total bytes for AxStrLenSSE1
88      code size for AxJJStrLen4       88      total bytes for AxJJStrLen4
96      code size for AxJJStrLen5       96      total bytes for AxJJStrLen5
79      code size for StrLenLingo       79      total bytes for StrLenLingo

139     cycles for szLen
66      cycles for AxStrLenSSE1
66      cycles for AxJJStrLen4
65      cycles for AxJJStrLen5
45      cycles for StrLenLingo

139     cycles for szLen
62      cycles for AxStrLenSSE1
66      cycles for AxJJStrLen4
65      cycles for AxJJStrLen5
45      cycles for StrLenLingo

139     cycles for szLen
67      cycles for AxStrLenSSE1
66      cycles for AxJJStrLen4
65      cycles for AxJJStrLen5
44      cycles for StrLenLingo

139     cycles for szLen
62      cycles for AxStrLenSSE1
67      cycles for AxJJStrLen4
66      cycles for AxJJStrLen5
45      cycles for StrLenLingo

------- timings, misaligned, 5 byte string -------
8       cycles for szLen
29      cycles for AxStrLenSSE1
24      cycles for AxJJStrLen4
23      cycles for AxJJStrLen5
23      cycles for StrLenLingo

8       cycles for szLen
23      cycles for AxStrLenSSE1
24      cycles for AxJJStrLen4
23      cycles for AxJJStrLen5
23      cycles for StrLenLingo

8       cycles for szLen
29      cycles for AxStrLenSSE1
25      cycles for AxJJStrLen4
23      cycles for AxJJStrLen5
23      cycles for StrLenLingo

8       cycles for szLen
23      cycles for AxStrLenSSE1
24      cycles for AxJJStrLen4
23      cycles for AxJJStrLen5
23      cycles for StrLenLingo
Title: Re: StrLen timings needed
Post by: lingo on February 01, 2011, 04:22:48 PM
New CPU with SSE 4.2 and new results... :lol
Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz (SSE4)
97      code size for AxStrLenSSE1      104     total bytes for AxStrLenSSE1
88      code size for AxJJStrLen4       88      total bytes for AxJJStrLen4
96      code size for AxJJStrLen5       96      total bytes for AxJJStrLen5
82      code size for pcmpistriLingo    82      total bytes for pcmpistriLingo
79      code size for StrLenLingo       79      total bytes for StrLenLingo

77      cycles for szLen
14      cycles for AxStrLenSSE1
15      cycles for AxJJStrLen4
14      cycles for AxJJStrLen5
11      cycles for pcmpistriLingo
8       cycles for StrLenLingo

78      cycles for szLen
13      cycles for AxStrLenSSE1
15      cycles for AxJJStrLen4
13      cycles for AxJJStrLen5
11      cycles for pcmpistriLingo
9       cycles for StrLenLingo

77      cycles for szLen
14      cycles for AxStrLenSSE1
15      cycles for AxJJStrLen4
15      cycles for AxJJStrLen5
11      cycles for pcmpistriLingo
8       cycles for StrLenLingo

78      cycles for szLen
13      cycles for AxStrLenSSE1
15      cycles for AxJJStrLen4
13      cycles for AxJJStrLen5
11      cycles for pcmpistriLingo
8       cycles for StrLenLingo

------- timings, misaligned, 5 byte string -------
4       cycles for szLen
3       cycles for AxStrLenSSE1
3       cycles for AxJJStrLen4
4       cycles for AxJJStrLen5
-1      cycles for pcmpistriLingo
-1      cycles for StrLenLingo

5       cycles for szLen
3       cycles for AxStrLenSSE1
3       cycles for AxJJStrLen4
3       cycles for AxJJStrLen5
0       cycles for pcmpistriLingo
-1      cycles for StrLenLingo

4       cycles for szLen
2       cycles for AxStrLenSSE1
4       cycles for AxJJStrLen4
3       cycles for AxJJStrLen5
0       cycles for pcmpistriLingo
-1      cycles for StrLenLingo

4       cycles for szLen
2       cycles for AxStrLenSSE1
3       cycles for AxJJStrLen4
3       cycles for AxJJStrLen5
-1      cycles for pcmpistriLingo
-1      cycles for StrLenLingo


--- ok ---

Title: Re: StrLen timings needed
Post by: brethren on February 03, 2011, 03:14:23 PM
this string length algorithm is the fastest one i've found that uses no sse instructions. i've timed it against masm32's fast StrLen and it is slightly faster, plus there is still room for optimization :wink
btw i found this algo in one of randy hydes books, originally written in hla. the hla source code is public domain

StrLength PROC USES esi, buf:DWORD
mov esi, buf
test esi, 3
jz IsAligned
cmp BYTE PTR [esi], NULL
je done
add esi, 1
test esi, 3
jz IsAligned
cmp BYTE PTR [esi], NULL
je done
add esi, 1
test esi, 3
jz IsAligned
cmp BYTE PTR [esi], NULL
je done
add esi, 1
IsAligned:
sub esi, 32
lbl1:
add esi, 32
lbl2:
mov eax, [esi]
and eax, 7F7F7F7Fh
sub eax, 01010101h
and eax, 80808080h
jnz MightBeZero0

mov eax, [esi+4]
and eax, 7F7F7F7Fh
sub eax, 01010101h
and eax, 80808080h
jnz MightBeZero4

mov eax, [esi+8]
and eax, 7F7F7F7Fh
sub eax, 01010101h
and eax, 80808080h
jnz MightBeZero8

mov eax, [esi+12]
and eax, 7F7F7F7Fh
sub eax, 01010101h
and eax, 80808080h
jnz MightBeZero12

mov eax, [esi+16]
and eax, 7F7F7F7Fh
sub eax, 01010101h
and eax, 80808080h
jnz MightBeZero16

mov eax, [esi+20]
and eax, 7F7F7F7Fh
sub eax, 01010101h
and eax, 80808080h
jnz MightBeZero20

mov eax, [esi+24]
and eax, 7F7F7F7Fh
sub eax, 01010101h
and eax, 80808080h
jnz MightBeZero24

mov eax, [esi+28]
and eax, 7F7F7F7Fh
sub eax, 01010101h
and eax, 80808080h
jz lbl1

add esi, 28
jmp MightBeZero0
MightBeZero4:
add esi, 4
jmp MightBeZero0
MightBeZero8:
add esi, 8
jmp MightBeZero0
MightBeZero12:
add esi, 12
jmp MightBeZero0
MightBeZero16:
add esi, 16
jmp MightBeZero0
MightBeZero20:
add esi, 20
jmp MightBeZero0
MightBeZero24:
add esi, 24
MightBeZero0:
mov eax, [esi]
cmp al, 0
je done
cmp ah, 0
je done1
test eax, 0FF0000h
je done2
test eax, 0FF000000h
je done3

add esi, 4
jmp lbl2
done3:
sub esi, buf
lea eax, [esi+3]
jmp @F
done2:
sub esi, buf
lea eax, [esi+2]
jmp @F
done1:
sub esi, buf
lea eax, [esi+1]
jmp @F
done:
mov eax, esi
sub eax, buf
@@:
ret
StrLength ENDP
Title: Re: StrLen timings needed
Post by: lingo on February 04, 2011, 02:43:03 PM
"this string length algorithm is the fastest one i've found that uses no sse instructions."

My algo (without SSE) is faster:  :lol
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
StrLenLingo proc lpszStr:DWORD
mov edx, esp
mov esp, [esp+1*4]
pop eax
@@Loop:
sub    eax, 1010101h
pop ecx
                sub ecx, 1010101h
test eax, 80808080h
  pop eax
jne @f
@@LoopCont:
test ecx, 80808080h
je @@Loop
test    byte ptr [esp-8], 0FFh     
je      mi8       
test    byte ptr [esp-7], 0FFh
je      mi7         
test    byte ptr [esp-6], 0FFh
je      mi6         
test    byte ptr [esp-5], 0FFh
jne     @@Loop
lea eax, [esp-5]
mov esp, edx
sub eax, [edx+1*4]
ret 4
align 8
@@:
test    byte ptr [esp-12], 0FFh     
jz      mi12         
test    byte ptr [esp-11], 0FFh
jz      mi11         
test    byte ptr [esp-10], 0FFh
jz      mi10         
test    byte ptr [esp-9], 0FFh
jnz     @@LoopCont
lea eax, [esp-9]
mov esp, edx
sub eax,[edx+1*4]
ret 4
mi8:
            lea eax, [esp-8]
    mov esp, edx
            sub eax,[edx+1*4]
    ret 4
mi7:
    lea eax, [esp-7]
    mov esp, edx
            sub eax,[edx+1*4]
    ret 4
mi6:   
            lea eax, [esp-6]
    mov esp, edx
            sub eax,[edx+1*4] 
    ret 4
mi12:   
            lea eax, [esp-12]
    mov esp, edx
            sub eax,[edx+1*4]
    ret 4
mi11:
            lea eax, [esp-11]
    mov esp, edx
            sub eax,[edx+1*4]
    ret 4
mi10:   
            lea eax, [esp-10]
    mov esp, edx
            sub eax,[edx+1*4] 
    ret 4
StrLenLingo endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

The results on my CPU i7-2600K are:

28 cycles for agner fog StrLen-masmlib
25 cycles for StrLength
17 cycles for StrLenLingo
--test finished--

Title: Re: StrLen timings needed
Post by: jj2007 on February 04, 2011, 03:07:42 PM
Prescott P4:
110 cycles for agner fog StrLen-masmlib
63 cycles for StrLength
83 cycles for StrLenLingo
Title: Re: StrLen timings needed
Post by: FORTRANS on February 04, 2011, 04:48:08 PM
Hi,

   PIII, Win2k.

Steve


Assembling: strlen_a.asm

G:\WORK\TEMP>strlen_a
65 cycles for agner fog StrLen-masmlib
54 cycles for StrLength
65 cycles for StrLenLingo
--test finished--
Title: Re: StrLen timings needed
Post by: jj2007 on February 04, 2011, 05:40:34 PM
Compliments to Randy :U
Title: Re: StrLen timings needed
Post by: lingo on February 04, 2011, 11:17:39 PM
Il vecchio idiota non può stare controllo di nuovo.
Sembra che lui ha dimenticato le sue pillole di nuovo... :lol
Intel Core 2 Duo E8500, 3,16 GHz:  :lol
49 cycles for agner fog StrLen-masmlib
41 cycles for StrLength
26 cycles for StrLenLingo
--test finished--