News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Compare two strings

Started by yvansoftware, March 30, 2010, 07:40:20 PM

Previous topic - Next topic

frktons

In case you wish to see how I got the results in the previous post,
here it is the new_strcmp comparing Lingo's code with mine.

Let me know if I screwed up anything.  :U

Enjoy

Frank
Mind is like a parachute. You know what to do in order to use it :-)

jj2007

Frank,
Your algo is pretty fast but if it was to return the position and/or the value of the first mismatch, it would require some extra code. There is also no check for nullbytes; which is often not a problem, but what if you run into two short identical strings followed by some megabytes allocated with VirtualAlloc? It will crash right away...

P4 results:
4735    cycles for SSE with null check, long string (MasmBasic)
7014    cycles for Lingo, long string
6084    cycles for Frank, long string
14697   cycles for crt_strcmp, long string
158055  cycles for lstrcmp, long string


Here is your slightly modified routine, full testbed attached:
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

Frank proc s1, s2 ; granularity is 4 bytes; some extra code required to get bytes, not dwords
mov ecx, [esp+4]
mov eax, [esp+8]
; xor ebx, ebx
; mov ebx, 125
@@:
mov edx, [ecx] ; Comparing two strings in edx and esi
mov esi, [eax]
xor edx, esi
jnz @f
add ecx, 4
add eax, 4
mov edx, [ecx] ; unrolled 1
mov esi, [eax]
xor edx, esi
jnz @f
add ecx, 4
add eax, 4
jmp @B
; dec ebx
; jnz @b
@@: sub eax, [esp+8] ; return the relative position in eax
ret 8
Frank endp

frktons

Quote from: jj2007 on June 17, 2010, 08:39:34 AM
Frank,
Your algo is pretty fast but if it was to return the position and/or the value of the first mismatch, it would require some extra code. There is also no check for nullbytes; which is often not a problem, but what if you run into two short identical strings followed by some megabytes allocated VirtualAlloc? It will crash right away...

Of course Jochen, some extra code is required if the count of bytes is not a multiple of 4 as well.
It could be managed quite easily I think.

This is just an example of how we can do a faster search-compare using 4 bytes at a time,
instead of 1 byte at a time.

Thanks anyway, I'll figure it out what you changed and see your point.  :U

Mind is like a parachute. You know what to do in order to use it :-)

frktons

Well it'll take quite a while to grasp your methods, so I'd like to ask just
a simple question:

Did you get the best result using xmm registers? I mean did you managed 8
bytes at a time thanks to xmm registers?

P.S.
On my pc I get these results from your routine:

Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
String comparison: short string 10 bytes, long string 5050
3597    cycles for SSE with null check, long string
4344    cycles for SSE with null check, long string, case-insensitive
6163    cycles for Lingo, long string
5223    cycles for Frank, long string
14665   cycles for crt_strcmp, long string
10229   cycles for crt__stricmp, long string, case-insensitive
20480   cycles for movzx check null, long string
26398   cycles for repe cmpsb, long string
57011   cycles for lstrcmp, long string

12      cycles for SSE with null check, 10 bytes
18      cycles for SSE with null check, 10 bytes, case-insensitive
11      cycles for Lingo, 10 bytes
49      cycles for Frank, 10 bytes
27      cycles for crt_strcmp, 10 bytes
47      cycles for crt__stricmp, 10 bytes, case-insensitive
33      cycles for movzx check null, 10 bytes
100     cycles for repe cmpsb, 10 bytes
800     cycles for lstrcmp, 10 bytes

3866    cycles for SSE with null check, long string
4362    cycles for SSE with null check, long string, case-insensitive
5865    cycles for Lingo, long string
5166    cycles for Frank, long string
14179   cycles for crt_strcmp, long string
10710   cycles for crt__stricmp, long string, case-insensitive
20041   cycles for movzx check null, long string
25723   cycles for repe cmpsb, long string
58023   cycles for lstrcmp, long string

12      cycles for SSE with null check, 10 bytes
20      cycles for SSE with null check, 10 bytes, case-insensitive
11      cycles for Lingo, 10 bytes
6       cycles for Frank, 10 bytes
27      cycles for crt_strcmp, 10 bytes
45      cycles for crt__stricmp, 10 bytes, case-insensitive
33      cycles for movzx check null, 10 bytes
100     cycles for repe cmpsb, 10 bytes
850     cycles for lstrcmp, 10 bytes

Codesizes:
Lingo:  365
SSE:    277     (MasmBasic)
--- ok ---


How strange that I have :

6       cycles for Frank, 10 bytes


and


49      cycles for Frank, 10 bytes


in two different tests. What they mean? Why such a difference?

Mind is like a parachute. You know what to do in order to use it :-)

jj2007

Quote from: frktons on June 17, 2010, 09:10:07 AM
Well it'll take quite a while to grasp your methods, so I'd like to ask just
a simple question:

Did you get the best result using xmm registers? I mean did you managed 8
bytes at a time thanks to xmm registers?
Yes, the MasmBasic StringsDiffer(str1, str2) algo is SSE2, i.e. it uses xmm registers

Quote
in two different tests. What they mean? Why such a difference?
Try the test with LOOP_COUNT = 100000 instead of 10000. Such differences are quite frequent, especially with the Pentium 4. To get a better picture, run the test several times. The lowest value should be the most valid value. For very short strings, your algo is definitely very fast.

frktons

OK  :U This is the general idea I was talking about.
In new processors it is usually faster and better manipulating
groups of bytes than single ones.
I wandered why in this topic the examples dealed with single
bytes. You can always test for remaining bytes in many ways, but
the overall process should be done with multiple bytes processing.
This is what I consider better design  :lol

By the way I'm glad I contributed a bit (I mean a multibyte  :P) to this
discussion.

Enjoy

Frank
Mind is like a parachute. You know what to do in order to use it :-)

lingo

"I have the strange idea that this method could be quite fast:"

Why do you think xor is faster than cmp instruction? :wink

"In new processors it is usually faster and better manipulating
groups of bytes than single ones"


It is true for 'old' processors too... :wink

"You can always test for remaining bytes in many way"
Just try to do it in your testbed... :wink

hutch--

Frank,

Just keep this in mind, to get the full advantage of multibyte compares, the data needs to be aligned by at least 4 otherwise the 4 byte reads take more than one access by the processor. If you bother to use this method you need to align one of the two strings being compared as the other may not be possible due to minor differences in alignment.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

frktons

Quote from: lingo on June 17, 2010, 01:27:40 PM
"I have the strange idea that this method could be quite fast:"

Why do you think xor is faster than cmp instruction? :wink

"In new processors it is usually faster and better manipulating
groups of bytes than single ones"


It is true for 'old' processors too... :wink

"You can always test for remaining bytes in many way"
Just try to do it in your testbed... :wink


:lol :lol :lol So why didn't you use that in your code?  ::)
Of course I'll try to do it, the Algo is already in my mind
and it shouldn't be difficult to code  :P even for a n00b like me.  :P

I mean if I have:

x = bytes of the string
y = result of x divided by 4 or 8 [it depends what I need]
r = remainder of the previous division

after I do the loop from y to 0
I could subtract the r from 4 or 8[...]
subtract the result from the two registers holding the addresses of the two strings
do another round of 4 or eight bytes
almost done  :lol I should check if everything is fine, but this is the overall idea.



Quote from: hutch-- on June 17, 2010, 01:41:36 PM
Frank,

Just keep this in mind, to get the full advantage of multibyte compares, the data needs to be aligned by at least 4 otherwise the 4 byte reads take more than one access by the processor. If you bother to use this method you need to align one of the two strings being compared as the other may not be possible due to minor differences in alignment.

ALIGN, ALIGN, ALIGN ...  :lol

I was going to open a new discussion on this matter.
so:

1- What is ALIGNment?
2 - Why should I ALIGN stuffs and What stuff should I ALIGN?
3 - When should I ALIGN and at what degree [4 - 8 - 16....]?
4 - Please show me how can I use the ALIGN stuff in the code I posted

Please feel free to ALIGN your answers after this post  :lol
Mind is like a parachute. You know what to do in order to use it :-)

jj2007

The SSE2/MasmBasic version is "multibyte", of course.
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
String comparison: short string 10 bytes, long string 5050
2921    cycles for SSE with null check, long string
3687    cycles for SSE with null check, long string, case-insensitive
5826    cycles for Lingo, long string
4915    cycles for Frank, long string
13945   cycles for crt_strcmp, long string
16489   cycles for crt__stricmp, long string, case-insensitive
30043   cycles for movzx check null, long string
20105   cycles for repe cmpsb, long string
90040   cycles for lstrcmp, long string

18      cycles for SSE with null check, 10 bytes
25      cycles for SSE with null check, 10 bytes, case-insensitive
16      cycles for Lingo, 10 bytes
9       cycles for Frank, 10 bytes
32      cycles for crt_strcmp, 10 bytes
108     cycles for crt__stricmp, 10 bytes, case-insensitive
51      cycles for movzx check null, 10 bytes
88      cycles for repe cmpsb, 10 bytes
451     cycles for lstrcmp, 10 bytes


Same but aligned:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
String comparison: short string 10 bytes, long string 5050
2133    cycles for SSE with null check, long string
3103    cycles for SSE with null check, long string, case-insensitive
4058    cycles for Lingo, long string
3170    cycles for Frank, long string
13859   cycles for crt_strcmp, long string
16480   cycles for crt__stricmp, long string, case-insensitive
30060   cycles for movzx check null, long string
20101   cycles for repe cmpsb, long string
89970   cycles for lstrcmp, long string

frktons

Hi Jochen, ALIGN your explanation please, what does your algos do?
Could you post the source as well? Thanks
If you feel like, please use a private msg in your native language.

Frank
Mind is like a parachute. You know what to do in order to use it :-)

jj2007

Quote from: frktons on June 17, 2010, 03:26:59 PM
Could you post the source as well? Thanks

The source was always included. Just open the attachment...

clive

Quote from: frktons

x = bytes of the string
y = result of x divided by 4 or 8 [it depends what I need]
r = remainder of the previous division


Problem is, with strcmp() is that you usually don't know what the length of the string is, unless you actually check it first (at additional cost) or haul the metadata around with the string. Further the real strcmp() actually provides an equal, greater than, less than answer.

Perhaps memcmp() would be a better model if you know the length, and don't want to check for a terminating NUL.
It could be a random act of randomness. Those happen a lot as well.

cman

What software are you using to the cycle analysis for each algorithm?


clive

Quote from: cman on June 17, 2010, 04:48:43 PM
What software are you using to the cycle analysis for each algorithm?

MichaelW has a timing library using RDTSC, look at some of the attachments.
It could be a random act of randomness. Those happen a lot as well.