News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Compare two strings

Started by yvansoftware, March 30, 2010, 07:40:20 PM

Previous topic - Next topic

jj2007

Ok, misunderstanding :thumbu

qWord

Quote from: jj2007 on March 30, 2010, 11:21:51 PM
Ok, misunderstanding :thumbu
yep ..  - maybe caused by my bad english  :bg
FPU in a trice: SmplMath
It's that simple!

jj2007

For Yvan TheHolyMan: These are the correct invokes for the standard Windows and CRT functions:

   invoke lstrcmp, esi, edi
   invoke crt_strcmp, esi, edi

qWord

just for addition: an version without string instructions (as hutch suggest):
    lea esi,sz1
    lea edi,sz2
    movzx eax,BYTE ptr [esi]
    movzx ebx,BYTE ptr [edi]
    .while eax == ebx
@@:     or eax,ebx
        jz @F
        inc esi
        inc edi
        movzx eax,BYTE ptr [esi]
        movzx ebx,BYTE ptr [edi]
    .endw
    .if 0
@@:     ; equal
    .else
        ; not equal     
    .endif
FPU in a trice: SmplMath
It's that simple!

jj2007

just for fun:
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
4842    cycles for qWord, 1000 bytes
3746    cycles for movzx, 1000 bytes
6685    cycles for movzx check null, 1000 bytes
4192    cycles for repe cmpsb, 1000 bytes
31225   cycles for lstrcmp, 1000 bytes
3119    cycles for crt_strcmp, 1000 bytes

43      cycles for qWord, 10 bytes
32      cycles for movzx, 10 bytes
141     cycles for movzx check null, 10 bytes
123     cycles for repe cmpsb, 10 bytes
831     cycles for lstrcmp, 10 bytes
40      cycles for crt_strcmp, 10 bytes


As so often, the CRT function is damn fast, also taking into account that they check for nullbytes why "our" algos don't (with the exception of the movzx check null version).

lingo


Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)
2542    cycles for qWord, 1000 bytes
2049    cycles for movzx, 1000 bytes
2032    cycles for LingoCMP, 1000 bytes
4036    cycles for movzx check null, 1000 bytes
4075    cycles for repe cmpsb, 1000 bytes
13431   cycles for lstrcmp, 1000 bytes
2807    cycles for crt_strcmp, 1000 bytes

15      cycles for LingoCMP, 10 bytes
25      cycles for qWord, 10 bytes
18      cycles for movzx, 10 bytes
33      cycles for movzx check null, 10 bytes
93      cycles for repe cmpsb, 10 bytes
836     cycles for lstrcmp, 10 bytes
27      cycles for crt_strcmp, 10 bytes

2524    cycles for qWord, 1000 bytes
2047    cycles for movzx, 1000 bytes
2026    cycles for LingoCMP, 1000 bytes
4039    cycles for movzx check null, 1000 bytes
4089    cycles for repe cmpsb, 1000 bytes
13416   cycles for lstrcmp, 1000 bytes
2789    cycles for crt_strcmp, 1000 bytes

15      cycles for LingoCMP, 10 bytes
25      cycles for qWord, 10 bytes
17      cycles for movzx, 10 bytes
34      cycles for movzx check null, 10 bytes
92      cycles for repe cmpsb, 10 bytes
846     cycles for lstrcmp, 10 bytes
27      cycles for crt_strcmp, 10 bytes

2530    cycles for qWord, 1000 bytes
2038    cycles for movzx, 1000 bytes
2032    cycles for LingoCMP, 1000 bytes
4039    cycles for movzx check null, 1000 bytes
4081    cycles for repe cmpsb, 1000 bytes
13491   cycles for lstrcmp, 1000 bytes
2798    cycles for crt_strcmp, 1000 bytes

15      cycles for LingoCMP, 10 bytes
25      cycles for qWord, 10 bytes
17      cycles for movzx, 10 bytes
33      cycles for movzx check null, 10 bytes
94      cycles for repe cmpsb, 10 bytes
843     cycles for lstrcmp, 10 bytes
28      cycles for crt_strcmp, 10 bytes


--- ok ---


hutch--

My only comment so far is why are these tests being done on such small samples, 10 bytes is "who cares" and 1k is very small, tests should also be run only much longer compares, a meg or larger. All you are testing with such short samples is the attack rate, not the sustain rate.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

hutch--

Here is a quick dabble with an unroll of movzx.


Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (SSE4)
2516    cycles for qWord, 1000 bytes
2023    cycles for movzx, 1000 bytes
2022    cycles for LingoCMP, 1000 bytes
2019    cycles for movzx unrolled, 1000 bytes xxxxxxxx
4059    cycles for repe cmpsb, 1000 bytes
13379   cycles for lstrcmp, 1000 bytes
2773    cycles for crt_strcmp, 1000 bytes

15      cycles for LingoCMP, 10 bytes
22      cycles for qWord, 10 bytes
26      cycles for movzx, 10 bytes
15      cycles for movzx unrolled, 10 bytes  zzzzzzzzzzzz
92      cycles for repe cmpsb, 10 bytes
393     cycles for lstrcmp, 10 bytes
27      cycles for crt_strcmp, 10 bytes

2359    cycles for qWord, 1000 bytes
2023    cycles for movzx, 1000 bytes
2019    cycles for LingoCMP, 1000 bytes
2018    cycles for movzx unrolled, 1000 bytes xxxxxxxx
4059    cycles for repe cmpsb, 1000 bytes
13401   cycles for lstrcmp, 1000 bytes
2776    cycles for crt_strcmp, 1000 bytes

15      cycles for LingoCMP, 10 bytes
25      cycles for qWord, 10 bytes
17      cycles for movzx, 10 bytes
15      cycles for movzx unrolled, 10 bytes  zzzzzzzzzzzz
92      cycles for repe cmpsb, 10 bytes
393     cycles for lstrcmp, 10 bytes
27      cycles for crt_strcmp, 10 bytes

2523    cycles for qWord, 1000 bytes
2026    cycles for movzx, 1000 bytes
2024    cycles for LingoCMP, 1000 bytes
2019    cycles for movzx unrolled, 1000 bytes xxxxxxxx
4053    cycles for repe cmpsb, 1000 bytes
13295   cycles for lstrcmp, 1000 bytes
2773    cycles for crt_strcmp, 1000 bytes

15      cycles for LingoCMP, 10 bytes
22      cycles for qWord, 10 bytes
26      cycles for movzx, 10 bytes
15      cycles for movzx unrolled, 10 bytes  zzzzzzzzzzzz
92      cycles for repe cmpsb, 10 bytes
393     cycles for lstrcmp, 10 bytes
27      cycles for crt_strcmp, 10 bytes


--- ok ---
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

lingo

 :wink
Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)
2514    cycles for qWord, 1000 bytes
2033    cycles for movzx, 1000 bytes
1265    cycles for LingoCMP, 1000 bytes
2022    cycles for LingoCMP with check null, 1000 bytes
2019    cycles for movzx unrolled, 1000 bytes xxxxxxxx
4059    cycles for repe cmpsb, 1000 bytes
13251   cycles for lstrcmp, 1000 bytes
2776    cycles for crt_strcmp, 1000 bytes

8       cycles for LingoCMP, 10 bytes
15      cycles for LingoCMP with check null, 10 bytes
22      cycles for qWord, 10 bytes
26      cycles for movzx, 10 bytes
15      cycles for movzx unrolled, 10 bytes  zzzzzzzzzzzz
93      cycles for repe cmpsb, 10 bytes
825     cycles for lstrcmp, 10 bytes
27      cycles for crt_strcmp, 10 bytes

2358    cycles for qWord, 1000 bytes
2023    cycles for movzx, 1000 bytes
1266    cycles for LingoCMP, 1000 bytes
2021    cycles for LingoCMP with check null, 1000 bytes
2018    cycles for movzx unrolled, 1000 bytes xxxxxxxx
4058    cycles for repe cmpsb, 1000 bytes
13228   cycles for lstrcmp, 1000 bytes
2775    cycles for crt_strcmp, 1000 bytes

8       cycles for LingoCMP, 10 bytes
15      cycles for LingoCMP with check null, 10 bytes
25      cycles for qWord, 10 bytes
17      cycles for movzx, 10 bytes
15      cycles for movzx unrolled, 10 bytes  zzzzzzzzzzzz
92      cycles for repe cmpsb, 10 bytes
850     cycles for lstrcmp, 10 bytes
27      cycles for crt_strcmp, 10 bytes

2519    cycles for qWord, 1000 bytes
2026    cycles for movzx, 1000 bytes
1267    cycles for LingoCMP, 1000 bytes
2022    cycles for LingoCMP with check null, 1000 bytes
2019    cycles for movzx unrolled, 1000 bytes xxxxxxxx
4059    cycles for repe cmpsb, 1000 bytes
13223   cycles for lstrcmp, 1000 bytes
2785    cycles for crt_strcmp, 1000 bytes

8       cycles for LingoCMP, 10 bytes
15      cycles for LingoCMP with check null, 10 bytes
22      cycles for qWord, 10 bytes
26      cycles for movzx, 10 bytes
15      cycles for movzx unrolled, 10 bytes  zzzzzzzzzzzz
92      cycles for repe cmpsb, 10 bytes
843     cycles for lstrcmp, 10 bytes
27      cycles for crt_strcmp, 10 bytes


--- ok ---

jj2007

Lingo, what about your old algo here?

jj2007

Quote from: hutch-- on March 31, 2010, 02:40:35 PM
why are these tests being done on such small samples, 10 bytes is "who cares"

Pretty relevant for a string sort, though, don't you think?

clive

Quote from: jj2007
Pretty relevant for a string sort, though, don't you think?

Well if you're doing sorts then perhaps we should move beyond a simple inequity test, and consider whether the string is actually greater,equal,smaller which is what strcmp() does.

-Clive
It could be a random act of randomness. Those happen a lot as well.

clive

Quote from: jj2007
As so often, the CRT function is damn fast, also taking into account that they check for nullbytes why "our" algos don't (with the exception of the movzx check null version).

I'll note that the CRT algorithm has code to get string1 to be DWORD aligned during the bulk of the comparison. The key branch targets are also DWORD aligned, with the primary loop fitting within a single cache line.

-Clive
It could be a random act of randomness. Those happen a lot as well.

jj2007

Quote from: clive on March 31, 2010, 07:55:09 PM
Well if you're doing sorts then perhaps we should move beyond a simple inequity test, and consider whether the string is actually greater,equal,smaller which is what strcmp() does.
Good idea. Here is a new version, the StrCmpSSE proc returns in edx the position of the first mismatch, and in eax the difference of the mismatching bytes. It still needs some polishing (still no check for end of string as yet), but the timings look promising.

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
String comparison: short string 10 bytes, long string 4000
1070    cycles for SSE, long string
13702   cycles for qWord, long string
3266    cycles for Lingo, long string
12030   cycles for movzx, long string
24007   cycles for movzx check null, long string
16089   cycles for repe cmpsb, long string
72240   cycles for lstrcmp, long string
11139   cycles for crt_strcmp, long string

15      cycles for SSE, 10 bytes
43      cycles for qWord, 10 bytes
27      cycles for movzx, 10 bytes
51      cycles for movzx check null, 10 bytes
88      cycles for repe cmpsb, 10 bytes
456     cycles for lstrcmp, 10 bytes
47      cycles for crt_strcmp, 10 bytes

Codesizes:
Lingo:  365
SSE:    164

jj2007

Here is the version with check for end of string. The timings suffer a bit, of course, but it is still a bit faster than Lingo's old algo, and a factor 5 faster than the CRT version.

Note that lstrcmp is slow because it performs additional tasks (MSDN, remarks section):
QuoteThe lstrcmp function uses a word sort, rather than a string sort. A word sort treats hyphens and apostrophes differently than it treats other symbols that are not alphanumeric, in order to ensure that words such as "coop" and "co-op" stay together within a sorted list.
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
String comparison: short string 10 bytes, long string 5050
3061    cycles for SSE with null check, long string
4977    cycles for Lingo, long string
15422   cycles for crt_strcmp, long string
34880   cycles for movzx check null, long string
20329   cycles for repe cmpsb, long string
174637  cycles for lstrcmp, long string

37      cycles for SSE with null check, 10 bytes
23      cycles for Lingo, 10 bytes
41      cycles for crt_strcmp, 10 bytes
98      cycles for movzx check null, 10 bytes
123     cycles for repe cmpsb, 10 bytes
679     cycles for lstrcmp, 10 bytes

Codesizes:
Lingo:  365
SSE:    190