Hi,
I'm trying to compare two strings, and I get the error "StrCmp not defined".
Here's my code:
.386
.model flat,stdcall
option casemap:none
include \masm32\include\windows.inc
include \masm32\include\masm32.inc
include \masm32\include\gdi32.inc
include \masm32\include\user32.inc
include \masm32\include\kernel32.inc
include \masm32\include\Comctl32.inc
include \masm32\include\comdlg32.inc
include \masm32\include\shell32.inc
include \masm32\include\oleaut32.inc
include \masm32\macros\macros.asm
includelib \masm32\lib\masm32.lib
includelib \masm32\lib\gdi32.lib
includelib \masm32\lib\user32.lib
includelib \masm32\lib\kernel32.lib
includelib \masm32\lib\Comctl32.lib
includelib \masm32\lib\comdlg32.lib
includelib \masm32\lib\shell32.lib
includelib \masm32\lib\oleaut32.lib
.data
YvanSoftware db "(c) YvanSoftware - ALL RIGHTS RESERVED", 13 ,10 ,0
EnterYourName db "Please enter your name: ", 0
CRLF db 13,10,0
TheHolyMan db "Yvan", 0
Seriously db "Seriously? You're the MAN!", 13,10,0
LoserName db "What a loser name.", 13,10
.data?
buffer db 100 dup(?)
.code
start:
invoke StdOut,addr YvanSoftware
invoke StdOut, addr EnterYourName
invoke StdIn, addr buffer, 100
invoke StdOut, addr CRLF
invoke StrCmp,addr buffer, addr TheHolyMan
je HolyMan
IfNotHolyMan:
invoke StdOut, addr LoserName
jmp EndIfHolyMan
HolyMan:
invoke StdOut, addr Seriously
jmp EndIfHolyMan
EndIfHolyMan:
invoke ExitProcess,0
END start
What do I need to do?
(yeah, I know - it's a bullsh*t app, but it's for learning ASM ;-) )
Hi,
; preserve these registers
push esi
push edi
cld ; count downwards or from the end of the string
mov ecx,40 ; length of the string to compare
mov esi,offset FirstStr
mov edi,offset SecondStr
rep cmpsb
je @F
; not equal processing here
jmp _EXIT
@@:
; equal processing here
_EXIT:
pop edi
pop esi
ret
Best regards,
Robin.
'rep cmpsb' isn't possible :wink
.data
sz1 db "hello",0
sz2 db "hello",0
.code
cld
lea edi,sz1
mov ecx,-1
xor eax,eax
mov edx,edi
repnz scasb
sub edi,edx
mov ecx,edi
mov edi,edx
lea esi,sz2
repz cmpsb
.if ZERO?
; equal
.else
; not equal
.endif
Huh.......???
From opcodes.chm:
QuoteCMPS - Compare String (Byte, Word or Doubleword)
Usage: CMPS dest,src
CMPSB
CMPSW
CMPSD (386+)
Modifies flags: AF CF OF PF SF ZF
CMPSB inc/decrements the index registers by 1
The REP prefixes can be used to process entire data items.
Best regards,
Robin.
This works, if I use repz. Maybe that is what you meant?
include \masm32\include\masm32rt.inc
.486
.model flat,stdcall
.data
string1 BYTE "Test!",0
string2 BYTE "Test!",0
.code
start:
; preserve these registers
push esi
push edi
cld ; count downwards or from the end of the string
mov ecx,6 ; length of the string to compare
lea esi,string1
lea edi,string2
repz cmpsb
je @F
; not equal processing here
print "Not equal!"
jmp _EXIT
@@:
; equal processing here
print "Equal!"
_EXIT:
pop edi
pop esi
ret
end start
Best regards,
Robin.
REP is not supported by cmpsX! -> see Intel's or AMD's developer guides.
OK - but repz definitely works. Try the code!
I see why rep doesn't work - wrong op code.
REP:
QuoteF3 6C REP INS r/m8, DX Input (E)CX bytes from port DX into ES:[(E)DI]
F3 6D REP INS r/m16,DX Input (E)CX words from port DX into ES:[(E)DI]
F3 6D REP INS r/m32,DX Input (E)CX doublewords from port DX into ES:[(E)DI]
F3 A4 REP MOVS m8,m8 Move (E)CX bytes from DS:[(E)SI] to ES:[(E)DI]
F3 A5 REP MOVS m16,m16 Move (E)CX words from DS:[(E)SI] to ES:[(E)DI]
F3 A5 REP MOVS m32,m32 Move (E)CX doublewords from DS:[(E)SI] to ES:[(E)DI]
F3 6E REP OUTS DX, r/m8 Output (E)CX bytes from DS:[(E)SI] to port DX
F3 6F REP OUTS DX, r/m16 Output (E)CX words from DS:[(E)SI] to port DX
F3 6F REP OUTS DX, r/m32 Output (E)CX doublewords from DS:[(E)SI] to port DX
F3 AC REP LODS AL Load (E)CX bytes from DS:[(E)SI] to AL
F3 AD REP LODS AX Load (E)CX words from DS:[(E)SI] to AX
F3 AD REP LODS EAX Load (E)CX doublewords from DS:[(E)SI] to EAX
F3 AA REP STOS m8 Fill (E)CX bytes at ES:[(E)DI] with AL
F3 AB REP STOS m16 Fill (E)CX words at ES:[(E)DI] with AX
F3 AB REP STOS m32 Fill (E)CX doublewords at ES:[(E)DI] with EAX
REPZ:
QuoteF3 A6 REPE CMPS m8,m8 Find nonmatching bytes in ES:[(E)DI] and DS:[(E)SI]
F3 A7 REPE CMPS m16,m16 Find nonmatching words in ES:[(E)DI] and DS:[(E)SI]
F3 A7 REPE CMPS m32,m32 Find nonmatching doublewords in ES:[(E)DI] and DS:[(E)SI]
F3 AE REPE SCAS m8 Find non-AL byte starting at ES:[(E)DI]
F3 AF REPE SCAS m16 Find non-AX word starting at ES:[(E)DI]
F3 AF REPE SCAS m32 Find non-EAX doubleword starting at ES:[(E)DI]
Best regards,
Robin.
Quote from: Astro on March 30, 2010, 08:26:37 PMI see why rep doesn't work - incompatible op code.
cmpsX with a REP-prefix wouldn't make much sense - think about it :bg
but repz definitely works. Try the code!
I've never disbelieved this - see my code
qWord
I see! :bg
I know I need to study the opcodes again. :red
Best regards,
Robin.
Quote from: Astro
I know I need to study the opcodes again.
Well REP and REPZ are the same opcode, so it frankly wouldn't matter at the machine code level. The inline assembler in Microsoft C (tested 12.00) is tolerant of either encoding as is MASM 5.1
-Clive
I confess to not having used CMPS? since the DOS days, you can do it faster and from memory with less registers by manually incrementing the count yourself.
Quote from: qWord on March 30, 2010, 08:31:04 PM
Quote from: Astro on March 30, 2010, 08:26:37 PMI see why rep doesn't work - incompatible op code.
cmpsX with a REP-prefix wouldn't make much sense - think about it :bg
Where is the problem? You need to know the length of the strings, but for example, comparing two files with identical size works just fine.
include \masm32\include\masm32rt.inc
.code
TextA db "Masm32?", 0
TextB db "Masm32!", 0
start:
mov esi, offset TextA
mov edi, offset TextB
mov ecx, SIZEOF TextA
repe cmpsb
MsgBox 0, str$(ecx), addr TextA, MB_OK
exit
end start
Quoteinclude \masm32\MasmBasic\MasmBasic.inc ; library is here (http://www.masm32.com/board/index.php?topic=12460)
.code
start: xor ebx, ebx
push Timer
Let esi=FileRead$("\masm32\include\Windows.inc")
; 1 bytes longer, Warning Duplicate instead of WARNING Duplicate
Let edi=FileRead$("\masm32\include\Copy_of_Windows.inc")
; invoke lstrcmp, esi, edi ; returns 1
; invoke crt_strcmp, esi, edi ; returns -1
mov ecx, Lof("\masm32\include\Copy_of_Windows.inc")
.if ecx>Lof("\masm32\include\Windows.inc")
xchg ecx, eax
.endif
mov ebx, eax
sub ebx, ecx
repe cmpsb
sub Timer, [esp]
pop edx
Inkey Str$("Time=%i ms", eax), Str$(", mismatch at byte EOF-%i", ecx), Str$(", diff in size=%i", ebx)
Exit
end start
Output:
Time=0 ms, mismatch at byte EOF-98, diff in size=1
jj,
rep and repz/e have the same opcode (I've not notice this before), but different termination conditions. Also in masm you can't write 'rep cmpsb'.
Quote from: qWord on March 30, 2010, 10:45:49 PM
jj,
rep and repz/e have the same opcode (I've not notice this before), but different termination conditions. Also in masm you can't write 'rep cmpsb'.
See attached exe from my (modified) example. It works.
when saying rep makes no sense with cmpsb, I was talking about the mnemonic 'rep', which terminates if ecx=0 - in context to cmpsX the mnemonic is named 'repz', which terminates if ecx=0 OR ZF=0. I've never said that you can't use it for comparing files or what ever...
Ok, misunderstanding :thumbu
Quote from: jj2007 on March 30, 2010, 11:21:51 PM
Ok, misunderstanding :thumbu
yep .. - maybe caused by my bad english :bg
For Yvan TheHolyMan: These are the correct invokes for the standard Windows and CRT functions:
invoke lstrcmp, esi, edi
invoke crt_strcmp, esi, edi
just for addition: an version without string instructions (as hutch suggest):
lea esi,sz1
lea edi,sz2
movzx eax,BYTE ptr [esi]
movzx ebx,BYTE ptr [edi]
.while eax == ebx
@@: or eax,ebx
jz @F
inc esi
inc edi
movzx eax,BYTE ptr [esi]
movzx ebx,BYTE ptr [edi]
.endw
.if 0
@@: ; equal
.else
; not equal
.endif
just for fun:
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
4842 cycles for qWord, 1000 bytes
3746 cycles for movzx, 1000 bytes
6685 cycles for movzx check null, 1000 bytes
4192 cycles for repe cmpsb, 1000 bytes
31225 cycles for lstrcmp, 1000 bytes
3119 cycles for crt_strcmp, 1000 bytes
43 cycles for qWord, 10 bytes
32 cycles for movzx, 10 bytes
141 cycles for movzx check null, 10 bytes
123 cycles for repe cmpsb, 10 bytes
831 cycles for lstrcmp, 10 bytes
40 cycles for crt_strcmp, 10 bytes
As so often, the CRT function is damn fast, also taking into account that they check for nullbytes why "our" algos don't (with the exception of the movzx check null version).
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz (SSE4)
2542 cycles for qWord, 1000 bytes
2049 cycles for movzx, 1000 bytes
2032 cycles for LingoCMP, 1000 bytes
4036 cycles for movzx check null, 1000 bytes
4075 cycles for repe cmpsb, 1000 bytes
13431 cycles for lstrcmp, 1000 bytes
2807 cycles for crt_strcmp, 1000 bytes
15 cycles for LingoCMP, 10 bytes
25 cycles for qWord, 10 bytes
18 cycles for movzx, 10 bytes
33 cycles for movzx check null, 10 bytes
93 cycles for repe cmpsb, 10 bytes
836 cycles for lstrcmp, 10 bytes
27 cycles for crt_strcmp, 10 bytes
2524 cycles for qWord, 1000 bytes
2047 cycles for movzx, 1000 bytes
2026 cycles for LingoCMP, 1000 bytes
4039 cycles for movzx check null, 1000 bytes
4089 cycles for repe cmpsb, 1000 bytes
13416 cycles for lstrcmp, 1000 bytes
2789 cycles for crt_strcmp, 1000 bytes
15 cycles for LingoCMP, 10 bytes
25 cycles for qWord, 10 bytes
17 cycles for movzx, 10 bytes
34 cycles for movzx check null, 10 bytes
92 cycles for repe cmpsb, 10 bytes
846 cycles for lstrcmp, 10 bytes
27 cycles for crt_strcmp, 10 bytes
2530 cycles for qWord, 1000 bytes
2038 cycles for movzx, 1000 bytes
2032 cycles for LingoCMP, 1000 bytes
4039 cycles for movzx check null, 1000 bytes
4081 cycles for repe cmpsb, 1000 bytes
13491 cycles for lstrcmp, 1000 bytes
2798 cycles for crt_strcmp, 1000 bytes
15 cycles for LingoCMP, 10 bytes
25 cycles for qWord, 10 bytes
17 cycles for movzx, 10 bytes
33 cycles for movzx check null, 10 bytes
94 cycles for repe cmpsb, 10 bytes
843 cycles for lstrcmp, 10 bytes
28 cycles for crt_strcmp, 10 bytes
--- ok ---
My only comment so far is why are these tests being done on such small samples, 10 bytes is "who cares" and 1k is very small, tests should also be run only much longer compares, a meg or larger. All you are testing with such short samples is the attack rate, not the sustain rate.
Here is a quick dabble with an unroll of movzx.
Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz (SSE4)
2516 cycles for qWord, 1000 bytes
2023 cycles for movzx, 1000 bytes
2022 cycles for LingoCMP, 1000 bytes
2019 cycles for movzx unrolled, 1000 bytes xxxxxxxx
4059 cycles for repe cmpsb, 1000 bytes
13379 cycles for lstrcmp, 1000 bytes
2773 cycles for crt_strcmp, 1000 bytes
15 cycles for LingoCMP, 10 bytes
22 cycles for qWord, 10 bytes
26 cycles for movzx, 10 bytes
15 cycles for movzx unrolled, 10 bytes zzzzzzzzzzzz
92 cycles for repe cmpsb, 10 bytes
393 cycles for lstrcmp, 10 bytes
27 cycles for crt_strcmp, 10 bytes
2359 cycles for qWord, 1000 bytes
2023 cycles for movzx, 1000 bytes
2019 cycles for LingoCMP, 1000 bytes
2018 cycles for movzx unrolled, 1000 bytes xxxxxxxx
4059 cycles for repe cmpsb, 1000 bytes
13401 cycles for lstrcmp, 1000 bytes
2776 cycles for crt_strcmp, 1000 bytes
15 cycles for LingoCMP, 10 bytes
25 cycles for qWord, 10 bytes
17 cycles for movzx, 10 bytes
15 cycles for movzx unrolled, 10 bytes zzzzzzzzzzzz
92 cycles for repe cmpsb, 10 bytes
393 cycles for lstrcmp, 10 bytes
27 cycles for crt_strcmp, 10 bytes
2523 cycles for qWord, 1000 bytes
2026 cycles for movzx, 1000 bytes
2024 cycles for LingoCMP, 1000 bytes
2019 cycles for movzx unrolled, 1000 bytes xxxxxxxx
4053 cycles for repe cmpsb, 1000 bytes
13295 cycles for lstrcmp, 1000 bytes
2773 cycles for crt_strcmp, 1000 bytes
15 cycles for LingoCMP, 10 bytes
22 cycles for qWord, 10 bytes
26 cycles for movzx, 10 bytes
15 cycles for movzx unrolled, 10 bytes zzzzzzzzzzzz
92 cycles for repe cmpsb, 10 bytes
393 cycles for lstrcmp, 10 bytes
27 cycles for crt_strcmp, 10 bytes
--- ok ---
:wink
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz (SSE4)
2514 cycles for qWord, 1000 bytes
2033 cycles for movzx, 1000 bytes
1265 cycles for LingoCMP, 1000 bytes
2022 cycles for LingoCMP with check null, 1000 bytes
2019 cycles for movzx unrolled, 1000 bytes xxxxxxxx
4059 cycles for repe cmpsb, 1000 bytes
13251 cycles for lstrcmp, 1000 bytes
2776 cycles for crt_strcmp, 1000 bytes
8 cycles for LingoCMP, 10 bytes
15 cycles for LingoCMP with check null, 10 bytes
22 cycles for qWord, 10 bytes
26 cycles for movzx, 10 bytes
15 cycles for movzx unrolled, 10 bytes zzzzzzzzzzzz
93 cycles for repe cmpsb, 10 bytes
825 cycles for lstrcmp, 10 bytes
27 cycles for crt_strcmp, 10 bytes
2358 cycles for qWord, 1000 bytes
2023 cycles for movzx, 1000 bytes
1266 cycles for LingoCMP, 1000 bytes
2021 cycles for LingoCMP with check null, 1000 bytes
2018 cycles for movzx unrolled, 1000 bytes xxxxxxxx
4058 cycles for repe cmpsb, 1000 bytes
13228 cycles for lstrcmp, 1000 bytes
2775 cycles for crt_strcmp, 1000 bytes
8 cycles for LingoCMP, 10 bytes
15 cycles for LingoCMP with check null, 10 bytes
25 cycles for qWord, 10 bytes
17 cycles for movzx, 10 bytes
15 cycles for movzx unrolled, 10 bytes zzzzzzzzzzzz
92 cycles for repe cmpsb, 10 bytes
850 cycles for lstrcmp, 10 bytes
27 cycles for crt_strcmp, 10 bytes
2519 cycles for qWord, 1000 bytes
2026 cycles for movzx, 1000 bytes
1267 cycles for LingoCMP, 1000 bytes
2022 cycles for LingoCMP with check null, 1000 bytes
2019 cycles for movzx unrolled, 1000 bytes xxxxxxxx
4059 cycles for repe cmpsb, 1000 bytes
13223 cycles for lstrcmp, 1000 bytes
2785 cycles for crt_strcmp, 1000 bytes
8 cycles for LingoCMP, 10 bytes
15 cycles for LingoCMP with check null, 10 bytes
22 cycles for qWord, 10 bytes
26 cycles for movzx, 10 bytes
15 cycles for movzx unrolled, 10 bytes zzzzzzzzzzzz
92 cycles for repe cmpsb, 10 bytes
843 cycles for lstrcmp, 10 bytes
27 cycles for crt_strcmp, 10 bytes
--- ok ---
Lingo, what about your old algo here (http://www.masm32.com/board/index.php?topic=2508.msg20153#msg20153)?
Quote from: hutch-- on March 31, 2010, 02:40:35 PM
why are these tests being done on such small samples, 10 bytes is "who cares"
Pretty relevant for a string sort, though, don't you think?
Quote from: jj2007
Pretty relevant for a string sort, though, don't you think?
Well if you're doing sorts then perhaps we should move beyond a simple inequity test, and consider whether the string is actually greater,equal,smaller which is what strcmp() does.
-Clive
Quote from: jj2007
As so often, the CRT function is damn fast, also taking into account that they check for nullbytes why "our" algos don't (with the exception of the movzx check null version).
I'll note that the CRT algorithm has code to get string1 to be DWORD aligned during the bulk of the comparison. The key branch targets are also DWORD aligned, with the primary loop fitting within a single cache line.
-Clive
Quote from: clive on March 31, 2010, 07:55:09 PM
Well if you're doing sorts then perhaps we should move beyond a simple inequity test, and consider whether the string is actually greater,equal,smaller which is what strcmp() does.
Good idea. Here is a new version, the StrCmpSSE proc returns in edx the position of the first mismatch, and in eax the difference of the mismatching bytes. It still needs some polishing (still no check for end of string as yet), but the timings look promising.
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
String comparison: short string 10 bytes, long string 4000
1070 cycles for SSE, long string
13702 cycles for qWord, long string
3266 cycles for Lingo, long string
12030 cycles for movzx, long string
24007 cycles for movzx check null, long string
16089 cycles for repe cmpsb, long string
72240 cycles for lstrcmp, long string
11139 cycles for crt_strcmp, long string
15 cycles for SSE, 10 bytes
43 cycles for qWord, 10 bytes
27 cycles for movzx, 10 bytes
51 cycles for movzx check null, 10 bytes
88 cycles for repe cmpsb, 10 bytes
456 cycles for lstrcmp, 10 bytes
47 cycles for crt_strcmp, 10 bytes
Codesizes:
Lingo: 365
SSE: 164
Here is the version with check for end of string. The timings suffer a bit, of course, but it is still a bit faster than Lingo's old algo (http://www.masm32.com/board/index.php?topic=2508.msg20153#msg20153), and a factor 5 faster than the CRT version.
Note that
lstrcmp is slow because it performs additional tasks (MSDN, remarks section (http://msdn.microsoft.com/en-us/library/ms647488%28VS.85%29.aspx)):
QuoteThe lstrcmp function uses a word sort, rather than a string sort. A word sort treats hyphens and apostrophes differently than it treats other symbols that are not alphanumeric, in order to ensure that words such as "coop" and "co-op" stay together within a sorted list.
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
String comparison: short string 10 bytes, long string 5050
3061 cycles for SSE with null check, long string
4977 cycles for Lingo, long string
15422 cycles for crt_strcmp, long string
34880 cycles for movzx check null, long string
20329 cycles for repe cmpsb, long string
174637 cycles for lstrcmp, long string
37 cycles for SSE with null check, 10 bytes
23 cycles for Lingo, 10 bytes
41 cycles for crt_strcmp, 10 bytes
98 cycles for movzx check null, 10 bytes
123 cycles for repe cmpsb, 10 bytes
679 cycles for lstrcmp, 10 bytes
Codesizes:
Lingo: 365
SSE: 190
Quote"coop" and "co-op" stay together within a sorted list
one is a place for chickens - the other is a group of guys that raise them :bg
Hi Dave,
You tested the old version. Here is a new one, including a case-insensitive algo.
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
String comparison: short string 10 bytes, long string 5050
3533 cycles for SSE with null check, long string
4507 cycles for SSE with null check, long string, case-insensitive
6926 cycles for Lingo, long string
14194 cycles for crt_strcmp, long string
51312 cycles for crt__stricmp, long string, case-insensitive
33241 cycles for movzx check null, long string
20460 cycles for repe cmpsb, long string
154405 cycles for lstrcmp, long string
40 cycles for SSE with null check, 10 bytes
50 cycles for SSE with null check, 10 bytes, case-insensitive
23 cycles for Lingo, 10 bytes
41 cycles for crt_strcmp, 10 bytes
242 cycles for crt__stricmp, 10 bytes, case-insensitive
Edit: Correct version is StrCompSSE_ci3.zip - removed, it did not always return correct results. MasmBasic (http://www.masm32.com/board/index.php?topic=12460.0) StringsDiffer and FilesDiffer are more stable.
Intel Pentium 4 Prescott CPU 3.00GHz (SSE3)
String comparison: short string 10 bytes, long string 5050
3564 cycles for SSE with null check, long string
4509 cycles for SSE with null check, long string, case-insensitive
72008 cycles for Lingo, long string
14371 cycles for crt_strcmp, long string
50830 cycles for crt__stricmp, long string, case-insensitive
32967 cycles for movzx check null, long string
-44643 cycles for repe cmpsb, long string
808120 cycles for lstrcmp, long string
40 cycles for SSE with null check, 10 bytes
52 cycles for SSE with null check, 10 bytes, case-insensitive
23 cycles for Lingo, 10 bytes
41 cycles for crt_strcmp, 10 bytes
244 cycles for crt__stricmp, 10 bytes, case-insensitive
100 cycles for movzx check null, 10 bytes
123 cycles for repe cmpsb, 10 bytes
740 cycles for lstrcmp, 10 bytes
3542 cycles for SSE with null check, long string
4568 cycles for SSE with null check, long string, case-insensitive
6953 cycles for Lingo, long string
14017 cycles for crt_strcmp, long string
115829 cycles for crt__stricmp, long string, case-insensitive
98018 cycles for movzx check null, long string
85415 cycles for repe cmpsb, long string
809285 cycles for lstrcmp, long string
43 cycles for SSE with null check, 10 bytes
55 cycles for SSE with null check, 10 bytes, case-insensitive
26 cycles for Lingo, 10 bytes
47 cycles for crt_strcmp, 10 bytes
246 cycles for crt__stricmp, 10 bytes, case-insensitive
101 cycles for movzx check null, 10 bytes
125 cycles for repe cmpsb, 10 bytes
745 cycles for lstrcmp, 10 bytes
i get crazy results, Jochen
ok - closed all other applications and ran it 10 times - something is amiss...
i am not sure i understand the Dummy1 var
doesn't that misalign the data strings ?
Yep, the dummies are for misaligning the strings.
Your results are really crazy. You are not by accident running a P4? :wink
Try changing the invoke Sleep, 100 to invoke Sleep, 200 - on my P4 it helps. And take the updated version with 285 bytes size - you were the only downloader yet, so I exchanged it silently.
[Off topic: Nice comics here (http://xkcd.com/). Say bye to your mouse :green]
i have played with that Sleep value a little bit after i saw MichaelW use it
it doesn't seem to make much difference until you get up to at least Sleep,500
Michael was using Sleep,1500 if i recall
let me try the updated virgin...
oh - running a P4 .... on purpose :P
here is my result for the new one - lol
30 diff at pos 4999 (zero-based)
looks like you fixed it :bg at least i get the same result every time
ohhhh - i get it - you got me :P
i was gonna take up a collection to buy Jochen a cuppa cappuccino
Oops, it seems I posted the one with the test for correctness activated :red
New one attached above, see StrCompSSE_ci3.zip
"i was gonna take up a collection to buy Jochen a cuppa cappuccino"
Will be better to advise him to not adjust his Chlorpromazine dose (from 50 mg three times daily) unless his doctor specifically instructs him to do so... :wink
is that like Librium ?
i still get crazy data, JJ
Quote from: dedndave on April 01, 2010, 02:53:09 PM
i still get crazy data, JJ
Typical behaviour for a P4 - many outliers. The lowest values are supposed to be the correct ones; for me they are 3600, 4500, 6900 cycles for the first three. Later I will test it on my Celeron.
i dunno - i have never seen numbers like "-69000" before
Celeron is more stable, as usual:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
String comparison: short string 10 bytes, long string 5050
2902 cycles for SSE with null check, long string
3672 cycles for SSE with null check, long string, case-insensitive
5834 cycles for Lingo, long string
13933 cycles for crt_strcmp, long string
16508 cycles for crt__stricmp, long string, case-insensitive
30042 cycles for movzx check null, long string
20108 cycles for repe cmpsb, long string
89881 cycles for lstrcmp, long string
18 cycles for SSE with null check, 10 bytes
25 cycles for SSE with null check, 10 bytes, case-insensitive
16 cycles for Lingo, 10 bytes
32 cycles for crt_strcmp, 10 bytes
108 cycles for crt__stricmp, 10 bytes, case-insensitive
51 cycles for movzx check null, 10 bytes
88 cycles for repe cmpsb, 10 bytes
453 cycles for lstrcmp, 10 bytes
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz (SSE4)
1405 cycles for LingoCMP, 1000 bytes
519 cycles for frktonsCMP , 1000 bytes
--- ok ---
I have the strange idea that this method could be quite fast:
;--------------------------------------------------------------------------------------------------------------
; compare_4bytes
; -------------------------------------------------------------------------------------------------------------
; comparing 2 4 bytes strings the fast way
;--------------------------------------------------------------------------------------------------------------
;
; Method proposed by frktons
;--------------------------------------------------------------------------------------------------------------
; 16 june 2010 - masmforum
;--------------------------------------------------------------------------------------------------------------
include \masm32\include\masm32rt.inc
.686
.data
Src1 db "This", 0
Src2 db "This", 0
.code
start:
mov ecx, offset Src1
mov eax, offset Src2
@@:
mov edx, [ecx] ; Comparing two strings in edx and esi
mov esi, [eax]
xor edx, esi
jnz end_check
print "The strings are identical",13,10
jmp end_of_game
end_check:
print "The strings are different",13,10
end_of_game:
inkey chr$(13, 10, "--- ok ---", 13)
exit
end start
If you put this method inside a loop, it could be
nice to see how it compares with the other methods.
To be optimal, the cycle has to be 125 times, and the code should
be modified like:
...............
mov ecx, offset Src1
mov eax, offset Src2
xor ebx, ebx
mov ebx, 125
@@:
mov edx, [ecx] ; Comparing two strings in edx and esi
mov esi, [eax]
xor edx, esi
jnz @f
add ecx, 4
add eax,4
mov edx, [ecx]
mov esi, [eax]
xor edx, esi
jnz @f
add ecx, 4
add eax, 4
dec ebx
jnz @b
@@:
.........
If anyone is kind enough to try it and let me know, I'm quite curious. :P
Being an Assembly beginner, it could be that I've messed up things without
realizing it. :lol
Frank
In case you wish to see how I got the results in the previous post,
here it is the new_strcmp comparing Lingo's code with mine.
Let me know if I screwed up anything. :U
Enjoy
Frank
Frank,
Your algo is pretty fast but if it was to return the position and/or the value of the first mismatch, it would require some extra code. There is also no check for nullbytes; which is often not a problem, but what if you run into two short identical strings followed by some megabytes allocated with VirtualAlloc? It will crash right away...
P4 results:
4735 cycles for SSE with null check, long string (MasmBasic)
7014 cycles for Lingo, long string
6084 cycles for Frank, long string
14697 cycles for crt_strcmp, long string
158055 cycles for lstrcmp, long string
Here is your slightly modified routine, full testbed attached:
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
Frank proc s1, s2 ; granularity is 4 bytes; some extra code required to get bytes, not dwords
mov ecx, [esp+4]
mov eax, [esp+8]
; xor ebx, ebx
; mov ebx, 125
@@:
mov edx, [ecx] ; Comparing two strings in edx and esi
mov esi, [eax]
xor edx, esi
jnz @f
add ecx, 4
add eax, 4
mov edx, [ecx] ; unrolled 1
mov esi, [eax]
xor edx, esi
jnz @f
add ecx, 4
add eax, 4
jmp @B
; dec ebx
; jnz @b
@@: sub eax, [esp+8] ; return the relative position in eax
ret 8
Frank endp
Quote from: jj2007 on June 17, 2010, 08:39:34 AM
Frank,
Your algo is pretty fast but if it was to return the position and/or the value of the first mismatch, it would require some extra code. There is also no check for nullbytes; which is often not a problem, but what if you run into two short identical strings followed by some megabytes allocated VirtualAlloc? It will crash right away...
Of course Jochen, some extra code is required if the count of bytes is not a multiple of 4 as well.
It could be managed quite easily I think.
This is just an example of how we can do a faster search-compare using 4 bytes at a time,
instead of 1 byte at a time.
Thanks anyway, I'll figure it out what you changed and see your point. :U
Well it'll take quite a while to grasp your methods, so I'd like to ask just
a simple question:
Did you get the best result using xmm registers? I mean did you managed 8
bytes at a time thanks to xmm registers?
P.S.
On my pc I get these results from your routine:
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz (SSE4)
String comparison: short string 10 bytes, long string 5050
3597 cycles for SSE with null check, long string
4344 cycles for SSE with null check, long string, case-insensitive
6163 cycles for Lingo, long string
5223 cycles for Frank, long string
14665 cycles for crt_strcmp, long string
10229 cycles for crt__stricmp, long string, case-insensitive
20480 cycles for movzx check null, long string
26398 cycles for repe cmpsb, long string
57011 cycles for lstrcmp, long string
12 cycles for SSE with null check, 10 bytes
18 cycles for SSE with null check, 10 bytes, case-insensitive
11 cycles for Lingo, 10 bytes
49 cycles for Frank, 10 bytes
27 cycles for crt_strcmp, 10 bytes
47 cycles for crt__stricmp, 10 bytes, case-insensitive
33 cycles for movzx check null, 10 bytes
100 cycles for repe cmpsb, 10 bytes
800 cycles for lstrcmp, 10 bytes
3866 cycles for SSE with null check, long string
4362 cycles for SSE with null check, long string, case-insensitive
5865 cycles for Lingo, long string
5166 cycles for Frank, long string
14179 cycles for crt_strcmp, long string
10710 cycles for crt__stricmp, long string, case-insensitive
20041 cycles for movzx check null, long string
25723 cycles for repe cmpsb, long string
58023 cycles for lstrcmp, long string
12 cycles for SSE with null check, 10 bytes
20 cycles for SSE with null check, 10 bytes, case-insensitive
11 cycles for Lingo, 10 bytes
6 cycles for Frank, 10 bytes
27 cycles for crt_strcmp, 10 bytes
45 cycles for crt__stricmp, 10 bytes, case-insensitive
33 cycles for movzx check null, 10 bytes
100 cycles for repe cmpsb, 10 bytes
850 cycles for lstrcmp, 10 bytes
Codesizes:
Lingo: 365
SSE: 277 (MasmBasic)
--- ok ---
How strange that I have :
6 cycles for Frank, 10 bytes
and
49 cycles for Frank, 10 bytes
in two different tests. What they mean? Why such a difference?
Quote from: frktons on June 17, 2010, 09:10:07 AM
Well it'll take quite a while to grasp your methods, so I'd like to ask just
a simple question:
Did you get the best result using xmm registers? I mean did you managed 8
bytes at a time thanks to xmm registers?
Yes, the MasmBasic StringsDiffer(str1, str2) algo is SSE2, i.e. it uses xmm registers
Quote
in two different tests. What they mean? Why such a difference?
Try the test with LOOP_COUNT = 100000 instead of 10000. Such differences are quite frequent, especially with the Pentium 4. To get a better picture, run the test several times. The lowest value should be the most valid value. For very short strings, your algo is definitely very fast.
OK :U This is the general idea I was talking about.
In new processors it is usually faster and better manipulating
groups of bytes than single ones.
I wandered why in this topic the examples dealed with single
bytes. You can always test for remaining bytes in many ways, but
the overall process should be done with multiple bytes processing.
This is what I consider better design :lol
By the way I'm glad I contributed a bit (I mean a multibyte :P) to this
discussion.
Enjoy
Frank
"I have the strange idea that this method could be quite fast:"
Why do you think xor is faster than cmp instruction? :wink
"In new processors it is usually faster and better manipulating
groups of bytes than single ones"
It is true for 'old' processors too... :wink
"You can always test for remaining bytes in many way"
Just try to do it in your testbed... :wink
Frank,
Just keep this in mind, to get the full advantage of multibyte compares, the data needs to be aligned by at least 4 otherwise the 4 byte reads take more than one access by the processor. If you bother to use this method you need to align one of the two strings being compared as the other may not be possible due to minor differences in alignment.
Quote from: lingo on June 17, 2010, 01:27:40 PM
"I have the strange idea that this method could be quite fast:"
Why do you think xor is faster than cmp instruction? :wink
"In new processors it is usually faster and better manipulating
groups of bytes than single ones"
It is true for 'old' processors too... :wink
"You can always test for remaining bytes in many way"
Just try to do it in your testbed... :wink
:lol :lol :lol So why didn't you use that in your code? ::)
Of course I'll try to do it, the Algo is already in my mind
and it shouldn't be difficult to code :P even for a n00b like me. :P
I mean if I have:
x = bytes of the string
y = result of x divided by 4 or 8 [it depends what I need]
r = remainder of the previous division
after I do the loop from y to 0
I could subtract the r from 4 or 8[...]
subtract the result from the two registers holding the addresses of the two strings
do another round of 4 or eight bytes
almost done :lol I should check if everything is fine, but this is the overall idea.
Quote from: hutch-- on June 17, 2010, 01:41:36 PM
Frank,
Just keep this in mind, to get the full advantage of multibyte compares, the data needs to be aligned by at least 4 otherwise the 4 byte reads take more than one access by the processor. If you bother to use this method you need to align one of the two strings being compared as the other may not be possible due to minor differences in alignment.
ALIGN, ALIGN, ALIGN ... :lol
I was going to open a new discussion on this matter.
so:
1- What is ALIGNment?
2 - Why should I ALIGN stuffs and What stuff should I ALIGN?
3 - When should I ALIGN and at what degree [4 - 8 - 16....]?
4 - Please show me how can I use the ALIGN stuff in the code I posted
Please feel free to ALIGN your answers after this post :lol
The SSE2/MasmBasic version is "multibyte", of course.
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
String comparison: short string 10 bytes, long string 5050
2921 cycles for SSE with null check, long string
3687 cycles for SSE with null check, long string, case-insensitive
5826 cycles for Lingo, long string
4915 cycles for Frank, long string
13945 cycles for crt_strcmp, long string
16489 cycles for crt__stricmp, long string, case-insensitive
30043 cycles for movzx check null, long string
20105 cycles for repe cmpsb, long string
90040 cycles for lstrcmp, long string
18 cycles for SSE with null check, 10 bytes
25 cycles for SSE with null check, 10 bytes, case-insensitive
16 cycles for Lingo, 10 bytes
9 cycles for Frank, 10 bytes
32 cycles for crt_strcmp, 10 bytes
108 cycles for crt__stricmp, 10 bytes, case-insensitive
51 cycles for movzx check null, 10 bytes
88 cycles for repe cmpsb, 10 bytes
451 cycles for lstrcmp, 10 bytes
Same but aligned:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
String comparison: short string 10 bytes, long string 5050
2133 cycles for SSE with null check, long string
3103 cycles for SSE with null check, long string, case-insensitive
4058 cycles for Lingo, long string
3170 cycles for Frank, long string
13859 cycles for crt_strcmp, long string
16480 cycles for crt__stricmp, long string, case-insensitive
30060 cycles for movzx check null, long string
20101 cycles for repe cmpsb, long string
89970 cycles for lstrcmp, long string
Hi Jochen, ALIGN your explanation please, what does your algos do?
Could you post the source as well? Thanks
If you feel like, please use a private msg in your native language.
Frank
Quote from: frktons on June 17, 2010, 03:26:59 PM
Could you post the source as well? Thanks
The source was always included. Just open the attachment...
Quote from: frktons
x = bytes of the string
y = result of x divided by 4 or 8 [it depends what I need]
r = remainder of the previous division
Problem is, with strcmp() is that you usually don't know what the length of the string is, unless you actually check it first (at additional cost) or haul the metadata around with the string. Further the real strcmp() actually provides an equal, greater than, less than answer.
Perhaps memcmp() would be a better model if you know the length, and don't want to check for a terminating NUL.
What software are you using to the cycle analysis for each algorithm?
Quote from: cman on June 17, 2010, 04:48:43 PM
What software are you using to the cycle analysis for each algorithm?
MichaelW has a timing library using RDTSC, look at some of the attachments.
Quote from: clive on June 17, 2010, 04:33:25 PM
Problem is, with strcmp() is that you usually don't know what the length of the string is, unless you actually check it first (at additional cost) or haul the metadata around with the string. Further the real strcmp() actually provides an equal, greater than, less than answer.
The two SSE2 variants, which have since been included in MasmBasic, do the null check; they return in eax the equal, greater than, less than info, and in edx the position where the first mismatch occurred.
Re algos used for timings:
Quoteinclude \masm32\macros\timers.asm ; get them from the Masm32 Laboratory (http://www.masm32.com/board/index.php?topic=770.0)
Quote from: jj2007
The two SSE2 variants, which have since been included in MasmBasic, do the null check; they return in eax the equal, greater than, less than info, and in edx the position where the first mismatch occurred.
Indeed, but the routines that Frank wrote do not, and in the context of the quote they further assume the length of the string, and that the strings are of equal length. In the general solution you can't make such assumptions, or you would be using memcmp().
Quote from: clive on June 17, 2010, 05:37:00 PM
Quote from: jj2007
The two SSE2 variants, which have since been included in MasmBasic, do the null check; they return in eax the equal, greater than, less than info, and in edx the position where the first mismatch occurred.
Indeed, but the routines that Frank wrote do not, and in the context of the quote they further assume the length of the string, and that the strings are of equal length. In the general solution you can't make such assumptions, or you would be using memcmp().
For the sake of fairness, I should add that Lingo's algo does the null check, although it's not mentioned in the output.
Re assuming equal length: The shorter string would trigger a mismatch, since the other one is non-zero; the real danger is that in 99% of all cases that will give the false impression that the algo works fine. Until you run into that rare case where you load two equal short files into a VirtualAlloc'ed buffer, and bang, access violation.
Very well indeed. :U
When I'll be able to understand all the stuff you are talking about
I'll be probably able to solve these problems as well, I hope at least. :P
So far I'll be happy just to understand why, what and when to ALIGN data
and how to do it, I mean how and where I declare the ALIGN directive?
Thanks
Frank
Quote from: jj2007
Re assuming equal length: The shorter string would trigger a mismatch, since the other one is non-zero; the real danger is that in 99% of all cases that will give the false impression that the algo works fine. Until you run into that rare case where you load two equal short files into a VirtualAlloc'ed buffer, and bang, access violation.
Plus the potential for byte(s) beyond the NUL to have any value, which would could potentially mean a pair of 2 character string which are otherwise identical to be miscompared if they were examined four bytes (or 8, whatever) at a time.
str1 db 'it',0,'x'
str2 db 'it',0,'y'
Quote from: clive on June 17, 2010, 07:14:54 PMPlus the potential for byte(s) beyond the NUL to have any value, which would could potentially mean a pair of 2 character string which are otherwise identical to be miscompared if they were examined four bytes (or 8, whatever) at a time.
str1 db 'it',0,'x'
str2 db 'it',0,'y'
jj is using bsf for finding the zero byte. Bytes following the termination zero are ignored (AFAIKS)
Quote from: frktons
So far I'll be happy just to understand why, what and when to ALIGN data
and how to do it, I mean how and where I declare the ALIGN directive?
You use the ALIGN directive immediately before the data/code you want to fall on a specific alignment boundary. MASM will insert padding to achieve this based on the current address ($).
If you use .486, the alignment for sections will be 16 bytes, so you can't use ALIGN 32 for example. You could explicitly define the segments and give them page alignment. The linker and the order of the objects linked, will impact where the code actually ends up, MASM passes down this alignment information, but the first object to be linked is the one that the linker uses.
The alignment relates to cache line boundaries (the width of data fetched into cache in a single operation, and other operations internal to the CPU), depends on CPU, but often 32 bytes wide.
Aligining branch targets on 16-byte boundaries helps with the branch prediction, and how the CPU tracks recent branches taken/not-taken.
Data should be aligned on at least 4-byte boundaries, or the data bus width of the processor. If a whole string fits in a single cache line, all the better. There is a penalty if they cross a cache line, Intel tends to be worse than AMD.
Other gains can be had by performing the bulk of 4-byte accesses on 4-byte boundaries, so library routines will often check for this and handle unaligned bytes first.
Quote from: qWord
he use bsf for finding the zero byte. Bytes following the termination zero are ignored (AFAIKS)
Not suggesting this is a flaw in all implementations, but should be part of a valid test suite.
Quote from: qWord on June 17, 2010, 07:32:39 PM
jj is using bsf for finding the zero byte. Bytes following the termination zero are ignored (AFAIKS)
Correct. It works as it should.
I've tried with 2600 bytes strings, and the proportion of performance
on my pc, with my original routine vs Lingo's one is still 2.5:1 more
or less.
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz (SSE4)
3268 cycles for LingoCMP, 2600 bytes
1331 cycles for frktonsCMP , 2600 bytes
--- ok ---
I wander why with the modification of Jochen the proportion changes
so much. ::)
New version included, with the ALIGN suggested. :P
Frank, can i ask a question mate?
You set EBX to 325 and then decrement by one on each iteration, right? With a multiple of 4, this gives us 1300 bytes scanned, not the 2600 which you've logged, please correct me if i'm wrong?
Also, you are using a fixed length compare, to make it a fair comparison you would need to remove the zero check from Lingo's code and make it so it knew exactly how long the array was to compare against. Keep working on this though, its always good to see people trying new ideas.
HR,
Ghandi
Quote from: frktons
I wander why with the modification of Jochen the proportion changes
I think the word you are looking for is
wonder, or perhaps not, but I always imagine someone walking around.<G>
http://dictionary.reference.com/browse/wander
"to think or speak confusedly or incoherently."
http://dictionary.reference.com/browse/wonder
"to think or speculate curiously"
Quote from: Ghandi
You set EBX to 325 and then decrement by one on each iteration, right? With a multiple of 4, this gives us 1300 bytes scanned, not the 2600 which you've logged, please correct me if i'm wrong?
8 bytes per iteration
Take a look that cmp is faster than xor instruction... :wink
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz (SSE4)
3274 cycles for LingoCMP, 2600 bytes
1318 cycles for NewfrktonsWithoutXOR , 2600 bytes
1334 cycles for frktonsCMP , 2600 bytes
--- ok ---
Quote from: clive on June 17, 2010, 09:51:11 PM
Quote from: frktons
I wander why with the modification of Jochen the proportion changes
I think the word you are looking for is wonder, or perhaps not, but I always imaging someone walking around.<G>
http://dictionary.reference.com/browse/wander
"to think or speak confusedly or incoherently."
http://dictionary.reference.com/browse/wonder
"to think or speculate curiously"
Yes Clive, I probably should use
wonder; bear with me, this is
my first semester in Assembly, C language and English language as well :P
wander is a bit more lunatic than I usually am :lol
I use
wander in a figurative way, like my mind going around
without knowing what to do or how to interpret things.
But I really don't know when or where I got that meaning. ::)
Maybe this translation:
to ramble without a definite purpose or objective; roam, rove, or stray: to wander over the earth.
is close to the meaning I intend.
Thanks for your corrections in all the fields and the explanations and
examples you give me. :U
Quote from: lingo on June 17, 2010, 10:53:23 PM
Take a look that cmp is faster than xor instruction... :wink
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz (SSE4)
3274 cycles for LingoCMP, 2600 bytes
1318 cycles for NewfrktonsWithoutXOR , 2600 bytes
1334 cycles for frktonsCMP , 2600 bytes
--- ok ---
Not really that fast 1318 vs 1334. You have surely spared some
cycles skipping a passage, maybe one MOV or ADD less somewhere. ::)
I think that an experienced ASM programmer like you can cut down
the cycles by 30-40% wisely using xmm instructions and registers I'll
be aware of in my 4th or 5th semester. :eek
Quote from: lingo on June 17, 2010, 10:53:23 PM
Take a look that cmp is faster than xor instruction... :wink
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz (SSE4)
3274 cycles for LingoCMP, 2600 bytes
1318 cycles for NewfrktonsWithoutXOR , 2600 bytes
1334 cycles for frktonsCMP , 2600 bytes
--- ok ---
1330 withouth xor
1331 CMP
Indeed is CMP more faster but it was needed to do such a test for this? (Not to be rude) But i think it was clear that CMP was faster. lol...
Funny second test and:
1330 withouth xor
1330 CMP
Well I tried to use
CMP instead of
XOR and there is
a very slight difference in performance:
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz (SSE4)
3738 cycles for LingoCMP, 2600 bytes
1319 cycles for NewfrktonsWithout XOR , 2600 bytes
1329 cycles for frktons XOR , 2600 bytes
1325 cycles for frktons with CMP , 2600 bytes
--- ok ---
So the extra 5-6 cycles that Lingo spared depends on something else.
The new routine I tried is:
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS ; LONG SOURCE
mov ecx, offset Src1
mov eax, offset Src2
mov ebx, 325
@@:
mov edx, [ecx] ; Comparing two strings in edx and esi
mov esi, [eax]
cmp edx, esi
jne @f
add ecx, 4
add eax,4
mov edx, [ecx]
mov esi, [eax]
cmp edx, esi
jne @f
add ecx, 4
add eax, 4
dec ebx
jnz @b
@@:
counter_end
print str$(eax), 9, "cycles for frktons with CMP , 2600 bytes", 13, 10
Quote from: theunknownguy on June 18, 2010, 12:12:43 AM
Indeed is CMP more faster but it was needed to do such a test for this? (Not to be rude) But i think it was clear that CMP was faster. lol...
Probably it is clear for you, but the same I can't say about me :P
Lingo was so kind to post an example to clear this point and I am
grateful he did. :U
"So why didn't you use that in your code?"
and
"...you can cut down the cycles by 30-40% wisely using xmm instructions and registers
Yes, I have similar algo named SrchBytes with times here.. (http://www.masm32.com/board/index.php?topic=9370.90) :wink
Quote from: lingo on June 18, 2010, 12:29:50 AM
"So why didn't you use that in your code?"
and
"...you can cut down the cycles by 30-40% wisely using xmm instructions and registers
Yes, I have similar algo named SrchBytes with times here.. (http://www.masm32.com/board/index.php?topic=9370.90) :wink
My, I just need to be 30 years younger and a lot of free time and I can
have a nice time between bytes and mnemonics :lol
Great, that is really good to know that we have people still using
their brain today. :P
Quote
Probably it is clear for you, but the same I can't say about me :P
Lingo was so kind to post an example to clear this point and I am
grateful he did. :U
Quote
I think you can get it from logic itself:
"CMP subtracts the second operand from the first but, unlike the SUB instruction, does not store the result; only the flags are changed"
XOR computes the exclusive OR of the two operands. Each bit of the result is 1 if the corresponding bits of the operands are different; each bit is 0 if the corresponding bits are the same. The answer replaces the first operand.
But the example is quiet good. Still i am waiting for some PowerPC for run test like this... :dazzled:
Quote
You set EBX to 325 and then decrement by one on each iteration, right? With a multiple of 4, this gives us 1300 bytes scanned, not the 2600 which you've logged, please correct me if i'm wrong?
8 bytes per iteration
So i saw after re-reading the source and noticing that there are 2x INC EBX per iteration, making it 8 bytes per iteration. My bad, o0
HR,
Ghandi
Quote from: theunknownguy on June 18, 2010, 12:58:27 AM
I think you can get it from logic itself:
"CMP subtracts the second operand from the first but,
unlike the SUB instruction, does not store the result;
only the flags are changed"
XOR computes the exclusive OR of the two operands. Each bit of the result is 1
if the corresponding bits of the operands are different; each bit is 0 if the
corresponding bits are the same. The answer replaces the first operand.
:U well, I am in the path of learning Assembly, so these definitions
are studying matter. Where did you get them? I'll surely need them in many
future occasions.
Quote from: Ghandi on June 18, 2010, 01:17:06 AM
Quote
You set EBX to 325 and then decrement by one on each iteration, right? With a multiple of 4, this gives us 1300 bytes scanned, not the 2600 which you've logged, please correct me if i'm wrong?
8 bytes per iteration
So i saw after re-reading the source and noticing that there are 2x INC EBX per iteration, making it 8 bytes per iteration. My bad, o0
HR,
Ghandi
Yes, Clive already said that, so I didn't feel to add anything else. :U
Quote from: frktons on June 18, 2010, 01:31:06 AM
Quote from: theunknownguy on June 18, 2010, 12:58:27 AM
I think you can get it from logic itself:
"CMP subtracts the second operand from the first but,
unlike the SUB instruction, does not store the result;
only the flags are changed"
XOR computes the exclusive OR of the two operands. Each bit of the result is 1
if the corresponding bits of the operands are different; each bit is 0 if the
corresponding bits are the same. The answer replaces the first operand.
:U well, I am in the path of learning Assembly, so these definitions
are studying matter. Where did you get them? I'll surely need them in many
future occasions.
http://pdos.csail.mit.edu/6.828/2004/readings/i386/
Or IBM manual
Quote from: frktons on June 18, 2010, 12:04:27 AM
Not really that fast 1318 vs 1334. You have surely spared some
cycles skipping a passage, maybe one MOV or ADD less somewhere. ::)
If you are not sure, why don't you simple open his attachment and look at your code? You would have found out that butcher lingo has done an impressive job - see below, original copy & paste....
By the way: I had tested that question by simply replacing xor with cmp, and found out there is absolutely
no difference in timings.
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS ; LONG SOURCE
mov ecx, offset Src1
mov ebx, 325
lea eax, Src2
@@:
mov edx, [ecx] ; Comparing two strings in ecx and eax
add ecx, 8
cmp edx, [eax]
mov edx, [ecx-4]
lea eax, [eax+8]
jnz @f
cmp edx, [eax-4]
jnz @f
sub ebx, 1
jnz @b
@@:
counter_end
print str$(eax), 9, "cycles for NewfrktonsWithoutXOR , 2600 bytes", 13, 10
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS ; LONG SOURCE
mov ecx, offset Src1
mov eax, offset Src2
mov ebx, 325
@@:
mov edx, [ecx] ; Comparing two strings in edx and esi
mov esi, [eax]
xor edx, esi
jnz @f
add ecx, 4
add eax,4
mov edx, [ecx]
mov esi, [eax]
xor edx, esi
jnz @f
add ecx, 4
add eax, 4
dec ebx
jnz @b
@@:
counter_end
print str$(eax), 9, "cycles for frktonsCMP , 2600 bytes", 13, 10
Yes, Jochen, of course I looked inside the code Lingo posted, and
I noticed That he changed many things to get those 5-6 cycles, but
I am not able, so far, to understand which of those instructions got
those cycles. ::)
By the way it's good to see people getting better results with a little
bit crunching, I hoped this could be done in our complex and sometime
difficult daily life.
Enjoy
Quote from: theunknownguy on June 18, 2010, 01:45:42 AM
http://pdos.csail.mit.edu/6.828/2004/readings/i386/
Or IBM manual
Thanks unkn, it would be useful for reference. :U
An straight cmp comparation runned on the same test of 2600 bytes:
1338 clocks
mov ecx, offset Src1
mov eax, offset Src2
xor ebx, ebx
mov edx, [ecx+ebx]
cmp [eax+ebx], edx
jnz @2
add ebx, 4
jnz @1
1365 ~ 1379?:
mov edx, [ecx+ebx*4]
cmp [eax+ebx*4], edx
jnz @2
add ebx, 1
jnz @1
Against:
1370 clocks
mov edx, [ecx+ebx]
xor edx, [eax+ebx]
jnz @2
add ebx, 4
jnz @1
1378 clocks
mov edx, [ecx+ebx]
mov edi, [eax+ebx]
xor edx, edi
jnz @2
add ebx, 4
jnz @1
1369 clocks
mov edx, [ecx+ebx*4]
mov edi, [eax+ebx*4]
xor edx, edi
jnz @2
add ebx, 1
jnz @1
1365 clocks
mov edx, [ecx+ebx*4]
xor edx, [eax+ebx*4]
jnz @2
add ebx, 1
jnz @1
Surprise me the CMP SCALAR... meaby something wrong with the code i dont know... 5:14 AM here my eyes could be cheating on me.
A slighter faster version of string_compare:
I skipped a couple of MOV and I got a slight improvement in the overall
performance of the routine.
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz (SSE4)
3369 cycles for LingoCMP, 2600 bytes
1330 cycles for frktons with XOR , 2600 bytes
1325 cycles for frktons with CMP , 2600 bytes
1316 cycles for frktons with CMP II , 2600 bytes
--- ok ---
Source and EXE attached.
Enjoy
Frank
Try this...
mov ecx, offset Src1
mov eax, offset Src2
mov ebx, 325
@@:
mov edx, [ecx] ; Comparing two strings in edx and [eax]
cmp edx, [eax]
jne @f
; add ecx, 4
; add eax,4
mov edx, [ecx+4]
cmp edx, [eax+4]
jne @f
add ecx, 4+4
add eax, 4+4
dec ebx
jnz @b
@@:
Quote from: jj2007 on June 18, 2010, 12:54:10 PM
Try this...
mov ecx, offset Src1
mov eax, offset Src2
mov ebx, 325
@@:
mov edx, [ecx] ; Comparing two strings in edx and [eax]
cmp edx, [eax]
jne @f
; add ecx, 4
; add eax,4
mov edx, [ecx+4]
cmp edx, [eax+4]
jne @f
add ecx, 4+4
add eax, 4+4
dec ebx
jnz @b
@@:
A very tiny improvement 1 or 2 cycles, I expected more to be honest.
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz (SSE4)
3269 cycles for LingoCMP, 2600 bytes
1329 cycles for frktons with XOR , 2600 bytes
1326 cycles for frktons with CMP , 2600 bytes
1316 cycles for frktons with CMP II , 2600 bytes
1314 cycles for frktons with CMP III , 2600 bytes
--- ok ---
I only changed:
add ecx, 4+4
add eax, 4+4
with:
add ecx, 8
add eax, 8
But it is somewhat smaller and smarter. :U
I tried to unroll the loop, 10 times, but no misurable gain.
Probably this method has been exploited enough :P
Enjoy
Overclocked to 3.46ghz:
AMD Phenom(tm) II X6 1055T Processor (SSE3)
3268 cycles for LingoCMP, 2600 bytes
1642 cycles for frktons with XOR , 2600 bytes
1642 cycles for frktons with CMP , 2600 bytes
1966 cycles for frktons with CMP II , 2600 bytes
Quote from: Rockoon on June 18, 2010, 05:00:42 PM
Overclocked to 3.46ghz:
AMD Phenom(tm) II X6 1055T Processor (SSE3)
3268 cycles for LingoCMP, 2600 bytes
1642 cycles for frktons with XOR , 2600 bytes
1642 cycles for frktons with CMP , 2600 bytes
1966 cycles for frktons with CMP II , 2600 bytes
:lol :lol :lol :lol :lol :lol :lol :lol :lol :lol :lol :lol :lol
AMD has a different philsophy about optimization.
:lol :lol :lol :lol :lol :lol :lol :lol :lol :lol :lol :lol :lol
Quote from: frktons on June 18, 2010, 05:40:13 PM
Quote from: Rockoon on June 18, 2010, 05:00:42 PM
Overclocked to 3.46ghz:
AMD Phenom(tm) II X6 1055T Processor (SSE3)
3268 cycles for LingoCMP, 2600 bytes
1642 cycles for frktons with XOR , 2600 bytes
1642 cycles for frktons with CMP , 2600 bytes
1966 cycles for frktons with CMP II , 2600 bytes
:lol :lol :lol :lol :lol :lol :lol :lol :lol :lol :lol :lol :lol
AMD has a different philsophy about optimization.
:lol :lol :lol :lol :lol :lol :lol :lol :lol :lol :lol :lol :lol
Me too. It can be faster, even on Intel :)
Added CMP III
AMD Phenom(tm) II X6 1055T Processor (SSE3)
3265 cycles for LingoCMP, 2600 bytes
1640 cycles for frktons with XOR , 2600 bytes
1640 cycles for frktons with CMP , 2600 bytes
1965 cycles for frktons with CMP II , 2600 bytes
1310 cycles for frktons with CMP III, 2600 bytes
Quote from: jj2007 on June 18, 2010, 12:54:10 PM
Try this...
mov ecx, offset Src1
mov eax, offset Src2
mov ebx, 325
@@:
mov edx, [ecx] ; Comparing two strings in edx and [eax]
cmp edx, [eax]
jne @f
; add ecx, 4
; add eax,4
mov edx, [ecx+4]
cmp edx, [eax+4]
jne @f
add ecx, 4+4
add eax, 4+4
dec ebx
jnz @b
@@:
Dec EBX? what happen with dec being slower than sub?.
It shouldnt been sub ebx, 1. At least for what i remember from agner fog...
Also using 1 reg could avoid 2 adds.
mov ecx, offset Src1
mov eax, offset Src2
mov ebx, 325
xor edi, edi
@@:
mov edx, [ecx+edi] ; Comparing two strings in edx and [eax]
cmp edx, [eax+edi]
jne @f
mov edx, [ecx+edi+4]
cmp edx, [eax+edi+4]
jne @f
add edi, 4+4
sub ebx, 1
jnz @b
@@:
Testing the CMP REG32*4 SCALAR give worst results, but on XOR give betters... odd
Anyway this should run faster some body test it?
Quote from: Rockoon on June 18, 2010, 07:03:01 PM
Added CMP III
AMD Phenom(tm) II X6 1055T Processor (SSE3)
3265 cycles for LingoCMP, 2600 bytes
1640 cycles for frktons with XOR , 2600 bytes
1640 cycles for frktons with CMP , 2600 bytes
1965 cycles for frktons with CMP II , 2600 bytes
1310 cycles for frktons with CMP III, 2600 bytes
Where is the code? Does it work on Intel as well?
Let me know.
Frank
Added method proposed by theunknownguy:
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz (SSE4)
3372 cycles for LingoCMP, 2600 bytes
1331 cycles for frktons with XOR , 2600 bytes
1326 cycles for frktons with CMP , 2600 bytes
1318 cycles for frktons with CMP II , 2600 bytes
1316 cycles for frktons with CMP III , 2600 bytes
1317 cycles for frktons with CMP unrolled , 2600 bytes
1316 cycles for theunknown with CMP , 2600 bytes
--- ok ---
No gain. ::)
aligning the top of all the inner loops by 16, and introducing my own method:
AMD Phenom(tm) II X6 1055T Processor (SSE3)
2834 cycles for LingoCMP, 2600 bytes
1553 cycles for frktons with XOR , 2600 bytes
1542 cycles for frktons with CMP , 2600 bytes
1319 cycles for frktons with CMP II , 2600 bytes
1146 cycles for frktons with CMP III, 2600 bytes
989 cycles for RockoonCMP, 2600 bytes
and my method (please check for errors?)
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS ; LONG SOURCE
mov ecx, offset Src1
mov eax, offset Src2
mov ebx, 325
sub eax, ecx
align 16
@@:
mov edx, [ecx]
cmp edx, [ecx + eax]
jne @f
mov edx, [ecx + 4]
cmp edx, [ecx + eax + 4]
jne @f
add ecx, 8
dec ebx
jnz @b
@@:
counter_end
print str$(eax), 9, "cycles for RockoonCMP, 2600 bytes", 13, 10
i doubt that ALIGN 16 makes much difference :P
i find that, for smaller loops (where the branch at the end is SHORT), alignment can actually slow you down a bit
the added padding takes time to execute, too
if the branch at the end of the loop is NEAR, 16-alignment definately helps
Well, it does make a difference on AMD.
On Intels with a trace cache, then probably not. 1 byte instructions might be stored as 32-bits, for instance. I'm sure Intel made some good choices with it, so that all the uops in it fall on boundaries that make the most sense for the decoder.
AMD hasnt really changed its architecture in quite awhile. No trace caches or anything like that. Alignment still matters there. In a lot of ways, its like programming for the old Pentium III architecture. The same can be said for the Core2's tho, with the P4's netburst behind us I think that we are all happy.
Quote from: Rockoon on June 18, 2010, 07:23:57 PM
aligning the top of all the inner loops by 16, and introducing my own method:
...
sub eax, ecx
align 16
@@:
mov edx, [ecx]
cmp edx, [ecx + eax]
jne @f
mov edx, [ecx + 4]
cmp edx, [ecx + eax + 4]
jne @f
add ecx, 8
dec ebx
jnz @b
@@:
Cute idea :U
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
String comparison: short string 10 bytes, long string 5050
2911 cycles for SSE with null check, long string
3684 cycles for SSE with null check, long string, case-insensitive
5854 cycles for Lingo, long string
4878 cycles for Frank, long string
3803 cycles for Rockoon, long string
3811 cycles for RockoonJ, long string
3770 cycles for RockoonJ2, long string
13906 cycles for crt_strcmp, long string
16493 cycles for crt__stricmp, long string, case-insensitive
30072 cycles for movzx check null, long string
20088 cycles for repe cmpsb, long string
89965 cycles for lstrcmp, long string
18 cycles for SSE with null check, 10 bytes
24 cycles for SSE with null check, 10 bytes, case-insensitive
16 cycles for Lingo, 10 bytes
9 cycles for Frank, 10 bytes
48 cycles for Rockoon, 10 bytes
7 cycles for RockoonJ, 10 bytes
7 cycles for RockoonJ2, 10 bytes
32 cycles for crt_strcmp, 10 bytes
108 cycles for crt__stricmp, 10 bytes, case-insensitive
51 cycles for movzx check null, 10 bytes
88 cycles for repe cmpsb, 10 bytes
453 cycles for lstrcmp, 10 bytes
Codesizes:
Lingo: 365
Frank: 45
Rockoon: 44
RockoonJ: 42
RockoonJ2: 33
SSE: 277 (MasmBasic)
It can compete with the SSE algo, but bear in mind there is no null check, and granularity is 8 bytes.
2911 cycles for SSE with null check, long string
3803 cycles for Rockoon, long string
Well on my pc I got these results:
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz (SSE4)
3299 cycles for LingoCMP, 2600 bytes
1331 cycles for frktons with XOR , 2600 bytes
1370 cycles for frktons with CMP , 2600 bytes
1317 cycles for frktons with CMP II , 2600 bytes
1317 cycles for frktons with CMP III , 2600 bytes
1314 cycles for frktons with CMP unrolled , 2600 bytes
1334 cycles for theunknown with CMP , 2600 bytes
1316 cycles for RockoonCMP, 2600 bytes
--- ok ---
Doesn't seem different from others method regarding performance. ::)
Just replaced the attachment above - I could shave off around 30 cycles and a few bytes. Here is the 33 bytes version.
align 16
RockoonJ2 proc
pop edx
pop eax
pop ecx
push edx
sub ecx, eax
align 8
@@: mov edx, [eax+ecx]
cmp edx, [eax]
jne @F
lea eax, [eax+4]
mov edx, [eax+ecx]
cmp edx, [eax]
lea eax, [eax+4]
je @B
@@: sub eax, [esp-4] ; return the relative position in ecx
ret
RockoonJ2 endp
RockoonJ2_endp:
1 byte shorter and remove one dependency:
mov edx, [eax+ecx+4]
cmp edx, [eax+4]
lea eax, [eax+8]
instead of:
lea eax, [eax+4]
mov edx, [eax+ecx]
cmp edx, [eax]
lea eax, [eax+4]
:bg
Quote from: qWord on June 18, 2010, 08:55:22 PM
1 byte shorter and remove one dependency:
Thanks, qWord. I had tested that one, it's a whopping 80 cycles slower on my Celeron - but of course, it might be faster on other CPU's. Did you test it?
no - can you post your current test bed?
Here it is. Search for "qWord" :bg
sorry jj,
I've found your previous test bed in the mean time ::)
Quote from: jj2007 on June 18, 2010, 08:29:38 PM
Cute idea :U
'twas nothing. Just one of those things in my bag of tricks. I've got many :)
Now, thinking about this further...
...operations on eax are usually encoded shorter, and that is true for both MOV EAX, [...] and ADD EAX, constant
so maybe make the comparator register EAX or conversely make the incrementing register EAX. I doubt it will have a performance impact (so doubtful that I wont bother) but it should shave a byte or so (??) off the loop.
as far as detecting the null terminator, a SWAR method would probably be most efficient for the general purpose regs. Might become register constrained tho.
jj,
for my core2duo it seems a bit faster for long strings, but no difference for the short ones.
Am I the only guy with AMD's?
AMD Phenom(tm) II X6 1055T Processor (SSE3)
String comparison: short string 10 bytes, long string 5050
1609 cycles for SSE with null check, long string
1921 cycles for SSE with null check, long string, case-insensitive
3776 cycles for Lingo, long string
2732 cycles for Frank, long string
2522 cycles for Rockoon, long string
2018 cycles for RockoonJ, long string
2211 cycles for RockoonJ2, long string
10046 cycles for crt_strcmp, long string
20065 cycles for crt__stricmp, long string, case-insensitive
25050 cycles for movzx check null, long string
11026 cycles for repe cmpsb, long string
78248 cycles for lstrcmp, long string
23 cycles for SSE with null check, 10 bytes
34 cycles for SSE with null check, 10 bytes, case-insensitive
17 cycles for Lingo, 10 bytes
8 cycles for Frank, 10 bytes
9 cycles for Rockoon, 10 bytes
10 cycles for RockoonJ, 10 bytes
10 cycles for RockoonJ2, 10 bytes
23 cycles for crt_strcmp, 10 bytes
60 cycles for crt__stricmp, 10 bytes, case-insensitive
41 cycles for movzx check null, 10 bytes
34 cycles for repe cmpsb, 10 bytes
777 cycles for lstrcmp, 10 bytes
1608 cycles for SSE with null check, long string
1920 cycles for SSE with null check, long string, case-insensitive
3776 cycles for Lingo, long string
2732 cycles for Frank, long string
2522 cycles for Rockoon, long string
2018 cycles for RockoonJ, long string
2210 cycles for RockoonJ2, long string
10037 cycles for crt_strcmp, long string
20072 cycles for crt__stricmp, long string, case-insensitive
25048 cycles for movzx check null, long string
11031 cycles for repe cmpsb, long string
77537 cycles for lstrcmp, long string
23 cycles for SSE with null check, 10 bytes
34 cycles for SSE with null check, 10 bytes, case-insensitive
17 cycles for Lingo, 10 bytes
8 cycles for Frank, 10 bytes
9 cycles for Rockoon, 10 bytes
9 cycles for RockoonJ, 10 bytes
10 cycles for RockoonJ2, 10 bytes
20 cycles for crt_strcmp, 10 bytes
60 cycles for crt__stricmp, 10 bytes, case-insensitive
41 cycles for movzx check null, 10 bytes
34 cycles for repe cmpsb, 10 bytes
755 cycles for lstrcmp, 10 bytes
Codesizes:
Lingo: 365
Frank: 45
Rockoon: 40
RockoonJ: 42
RockoonJ2: 33
SSE: 277 (MasmBasic)
Quote from: Rockoon on June 18, 2010, 09:22:09 PM
...operations on eax are usually encoded shorter, and that is true for both MOV EAX, [...] and ADD EAX, constant
so maybe make the comparator register EAX or conversely make the incrementing register EAX.
I'm afraid there is no difference - sizes are identical. This variant seems a tick faster, though:
align 16
RockoonJ3 proc
pop edx
pop eax
pop ecx
sub ecx, eax
push edx
align 8 ; aka mov edi, edi
@@:
REPEAT 1 ; further unrolling slows it down
mov edx, [eax+ecx]
cmp edx, [eax]
lea eax, [eax+4]
jne @F
ENDM
mov edx, [eax+ecx]
cmp edx, [eax]
lea eax, [eax+4]
je @B
@@: sub eax, [esp-4] ; return the relative position in eax
ret
RockoonJ3 endp
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
2909 cycles for SSE with null check, long string
3826 cycles for Rockoon, long string
3836 cycles for RockoonJ, long string
3788 cycles for RockoonJ2, long string
3723 cycles for RockoonJ3, long string
Quote from: Rockoon
Am I the only guy with AMD's?
I have a couple of Venice and Newcastle boxes, but unfortunately nothing newer. Will also try the code on an Atom later, because that is more like the P3.
SWAR method for detecting NULL bytes...
assuming eax contains 4 bytes to be checked
mov esi, eax
sub eax, 1010101h
not esi
and eax, 80808080h
and eax, esi
ZF is now set if EAX contained no null's
can be extended in the obvious manner for 64-bit registers
In C, this is
bool hasZeroByte = (v - 0x01010101UL) & ~v & 0x80808080UL;
from http://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord
Quote from: Rockoon on June 18, 2010, 09:55:33 PM
SWAR method for detecting NULL bytes...
It makes a difference :bg
For these timings, I have been mean - I added 50,000 nullbytes to the long test strings.
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
String comparison: short string 10 bytes, long string 5050 (+50000)
2932 cycles for SSE with null check, long string
3665 cycles for SSE with null check, long string, case-insensitive
5545 cycles for Lingo, long string with null check
59617 cycles for Frank, long string
44534 cycles for Rockoon, long string
44263 cycles for RockoonJ, long string
44044 cycles for RockoonJ2, long string
44192 cycles for RockoonJ3, long string
7875 cycles for RockoonCNB, long string, check nullbyte
13922 cycles for crt_strcmp, long string
16129 cycles for crt__stricmp, long string, case-insensitive
90185 cycles for lstrcmp, long string
Quote from: Rockoon on June 18, 2010, 09:55:33 PM
SWAR method for detecting NULL bytes...
...
bool hasZeroByte = (v - 0x01010101UL) & ~v & 0x80808080UL;
from http://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord
"was written in a newsgroup post on April 27, 1987 by Alan Mycroft" - I had just abandoned my Sinclair ZX at that time ::)
So what? Are you checking if one or more bytes inside the register
contain a NULL [zero binary?].
And then you stop confronting the two strings?
What about a binary file that contains some zero binary bytes and
those bytes are part of the string we are comparing?
If all the strings are NULL terminated like in the C Language, let's
assume we are comparing just two memory areas with that content.
I hope I understood what you are meaning. ::)
What about a binary file that contains some zero binary bytes and
those bytes are part of the string we are comparing?
They are NOT strings (PASCAL, C, '$' terminated), they are simply collections of bytes.
Yes, you want to do a memcmp(), not a strcmp().
Strings are null-terminated, unless the contrary is explicitly stated. But that's not the problem.
Imagine you are comparing files. You have allocated two fat memory buffers with VirtualAlloc. You are loading two files into this buffer, and run the routine without checking the nullbytes. The following scenarios are possible:
- you know the string length, and stop comparing when the end is reached
- you do not know the string length, and stop comparing when the strings differ
- you do not know the string length, and the files are equal
In the latter case, you stop comparing with an access violation at the end of the buffer.
http://www.masm32.com/board/index.php?topic=13701.msg113195#msg113195
StrCompCheckNull.zip
Venice
AMD Athlon(tm) 64 Processor 3400+ (SSE3)
String comparison: short string 10 bytes, long string 5050
2883 cycles for SSE with null check, long string
3520 cycles for SSE with null check, long string, case-insensitive
3964 cycles for Lingo, long string with null check
56656 cycles for Frank, long string
54710 cycles for Rockoon, long string
61931 cycles for RockoonJ, long string
54668 cycles for RockoonJ2, long string
54667 cycles for RockoonJ3, long string
5061 cycles for RockoonCNB, long string, check nullbyte
10149 cycles for crt_strcmp, long string
20218 cycles for crt__stricmp, long string, case-insensitive
106631 cycles for lstrcmp, long string
29 cycles for SSE with null check, 10 bytes
46 cycles for SSE with null check, 10 bytes, case-insensitive
13 cycles for Lingo, 10 bytes with null check
7 cycles for Frank, 10 bytes
7 cycles for Rockoon, 10 bytes
8 cycles for RockoonJ, 10 bytes
8 cycles for RockoonJ2, 10 bytes
10 cycles for RockoonJ3, 10 bytes
13 cycles for RockoonCNB, 10 bytes, check nullbyte
524 cycles for lstrcmp, 10 bytes
2888 cycles for SSE with null check, long string
3519 cycles for SSE with null check, long string, case-insensitive
3975 cycles for Lingo, long string with null check
56400 cycles for Frank, long string
55030 cycles for Rockoon, long string
61608 cycles for RockoonJ, long string
54774 cycles for RockoonJ2, long string
55538 cycles for RockoonJ3, long string
5169 cycles for RockoonCNB, long string, check nullbyte
10234 cycles for crt_strcmp, long string
20625 cycles for crt__stricmp, long string, case-insensitive
108163 cycles for lstrcmp, long string
29 cycles for SSE with null check, 10 bytes
43 cycles for SSE with null check, 10 bytes, case-insensitive
8 cycles for Lingo, 10 bytes with null check
-12 cycles for Frank, 10 bytes
7 cycles for Rockoon, 10 bytes
2 cycles for RockoonJ, 10 bytes
14 cycles for RockoonJ2, 10 bytes
10 cycles for RockoonJ3, 10 bytes
13 cycles for RockoonCNB, 10 bytes, check nullbyte
558 cycles for lstrcmp, 10 bytes
Codesizes:
Lingo: 365
Frank: 45
Rockoon: 40
RockoonJ: 42
RockoonJ2: 33
RockoonJ3: 33
RockoonCNB: 84
SSE: 277 (MasmBasic)
--- ok ---
NewCastle
AMD Athlon(tm) 64 Processor 3200+ (SSE2)
String comparison: short string 10 bytes, long string 5050
2920 cycles for SSE with null check, long string
3541 cycles for SSE with null check, long string, case-insensitive
3985 cycles for Lingo, long string with null check
91701 cycles for Frank, long string
57463 cycles for Rockoon, long string
61822 cycles for RockoonJ, long string
57433 cycles for RockoonJ2, long string
57502 cycles for RockoonJ3, long string
5104 cycles for RockoonCNB, long string, check nullbyte
10184 cycles for crt_strcmp, long string
20326 cycles for crt__stricmp, long string, case-insensitive
107065 cycles for lstrcmp, long string
29 cycles for SSE with null check, 10 bytes
46 cycles for SSE with null check, 10 bytes, case-insensitive
19 cycles for Lingo, 10 bytes with null check
6 cycles for Frank, 10 bytes
7 cycles for Rockoon, 10 bytes
8 cycles for RockoonJ, 10 bytes
8 cycles for RockoonJ2, 10 bytes
9 cycles for RockoonJ3, 10 bytes
15 cycles for RockoonCNB, 10 bytes, check nullbyte
528 cycles for lstrcmp, 10 bytes
2925 cycles for SSE with null check, long string
3542 cycles for SSE with null check, long string, case-insensitive
4111 cycles for Lingo, long string with null check
57906 cycles for Frank, long string
57502 cycles for Rockoon, long string
61841 cycles for RockoonJ, long string
57448 cycles for RockoonJ2, long string
57577 cycles for RockoonJ3, long string
5143 cycles for RockoonCNB, long string, check nullbyte
10188 cycles for crt_strcmp, long string
20411 cycles for crt__stricmp, long string, case-insensitive
107784 cycles for lstrcmp, long string
29 cycles for SSE with null check, 10 bytes
47 cycles for SSE with null check, 10 bytes, case-insensitive
14 cycles for Lingo, 10 bytes with null check
6 cycles for Frank, 10 bytes
7 cycles for Rockoon, 10 bytes
8 cycles for RockoonJ, 10 bytes
8 cycles for RockoonJ2, 10 bytes
13 cycles for RockoonJ3, 10 bytes
15 cycles for RockoonCNB, 10 bytes, check nullbyte
528 cycles for lstrcmp, 10 bytes
Codesizes:
Lingo: 365
Frank: 45
Rockoon: 40
RockoonJ: 42
RockoonJ2: 33
RockoonJ3: 33
RockoonCNB: 84
SSE: 277 (MasmBasic)
--- ok ---
Atom
Intel(R) Atom(TM) CPU N270 @ 1.60GHz (SSE4)
String comparison: short string 10 bytes, long string 5050
6787 cycles for SSE with null check, long string
7194 cycles for SSE with null check, long string, case-insensitive
13367 cycles for Lingo, long string with null check
131992 cycles for Frank, long string
95902 cycles for Rockoon, long string
104444 cycles for RockoonJ, long string
106118 cycles for RockoonJ2, long string
70027 cycles for RockoonJ3, long string
10839 cycles for RockoonCNB, long string, check nullbyte
18223 cycles for crt_strcmp, long string
23639 cycles for crt__stricmp, long string, case-insensitive
222617 cycles for lstrcmp, long string
73 cycles for SSE with null check, 10 bytes
70 cycles for SSE with null check, 10 bytes, case-insensitive
33 cycles for Lingo, 10 bytes with null check
23 cycles for Frank, 10 bytes
22 cycles for Rockoon, 10 bytes
20 cycles for RockoonJ, 10 bytes
30 cycles for RockoonJ2, 10 bytes
26 cycles for RockoonJ3, 10 bytes
29 cycles for RockoonCNB, 10 bytes, check nullbyte
913 cycles for lstrcmp, 10 bytes
4723 cycles for SSE with null check, long string
5298 cycles for SSE with null check, long string, case-insensitive
12200 cycles for Lingo, long string with null check
130570 cycles for Frank, long string
96277 cycles for Rockoon, long string
97053 cycles for RockoonJ, long string
98298 cycles for RockoonJ2, long string
68774 cycles for RockoonJ3, long string
9803 cycles for RockoonCNB, long string, check nullbyte
17967 cycles for crt_strcmp, long string
24884 cycles for crt__stricmp, long string, case-insensitive
216194 cycles for lstrcmp, long string
57 cycles for SSE with null check, 10 bytes
71 cycles for SSE with null check, 10 bytes, case-insensitive
38 cycles for Lingo, 10 bytes with null check
27 cycles for Frank, 10 bytes
29 cycles for Rockoon, 10 bytes
24 cycles for RockoonJ, 10 bytes
34 cycles for RockoonJ2, 10 bytes
30 cycles for RockoonJ3, 10 bytes
35 cycles for RockoonCNB, 10 bytes, check nullbyte
930 cycles for lstrcmp, 10 bytes
Codesizes:
Lingo: 365
Frank: 45
Rockoon: 40
RockoonJ: 42
RockoonJ2: 33
RockoonJ3: 33
RockoonCNB: 84
SSE: 277 (MasmBasic)
--- ok ---
Clive,
For this following test, which ZiP did you use?
Dave.
Quote from: clive on June 18, 2010, 11:54:59 PM
AMD Athlon(tm) 64 Processor 3400+ (SSE3)
String comparison: short string 10 bytes, long string 5050
2883 cycles for SSE with null check, long string
3520 cycles for SSE with null check, long string, case-insensitive
3964 cycles for Lingo, long string with null check
56656 cycles for Frank, long string
54710 cycles for Rockoon, long string
61931 cycles for RockoonJ, long string
54668 cycles for RockoonJ2, long string
54667 cycles for RockoonJ3, long string
5061 cycles for RockoonCNB, long string, check nullbyte
10149 cycles for crt_strcmp, long string
20218 cycles for crt__stricmp, long string, case-insensitive
106631 cycles for lstrcmp, long string
29 cycles for SSE with null check, 10 bytes
46 cycles for SSE with null check, 10 bytes, case-insensitive
13 cycles for Lingo, 10 bytes with null check
7 cycles for Frank, 10 bytes
7 cycles for Rockoon, 10 bytes
8 cycles for RockoonJ, 10 bytes
8 cycles for RockoonJ2, 10 bytes
10 cycles for RockoonJ3, 10 bytes
13 cycles for RockoonCNB, 10 bytes, check nullbyte
524 cycles for lstrcmp, 10 bytes
2888 cycles for SSE with null check, long string
3519 cycles for SSE with null check, long string, case-insensitive
3975 cycles for Lingo, long string with null check
56400 cycles for Frank, long string
55030 cycles for Rockoon, long string
61608 cycles for RockoonJ, long string
54774 cycles for RockoonJ2, long string
55538 cycles for RockoonJ3, long string
5169 cycles for RockoonCNB, long string, check nullbyte
10234 cycles for crt_strcmp, long string
20625 cycles for crt__stricmp, long string, case-insensitive
108163 cycles for lstrcmp, long string
29 cycles for SSE with null check, 10 bytes
43 cycles for SSE with null check, 10 bytes, case-insensitive
8 cycles for Lingo, 10 bytes with null check
-12 cycles for Frank, 10 bytes
7 cycles for Rockoon, 10 bytes
2 cycles for RockoonJ, 10 bytes
14 cycles for RockoonJ2, 10 bytes
10 cycles for RockoonJ3, 10 bytes
13 cycles for RockoonCNB, 10 bytes, check nullbyte
558 cycles for lstrcmp, 10 bytes
Codesizes:
Lingo: 365
Frank: 45
Rockoon: 40
RockoonJ: 42
RockoonJ2: 33
RockoonJ3: 33
RockoonCNB: 84
SSE: 277 (MasmBasic)
--- ok ---
Quote from: KeepingRealBusy on June 19, 2010, 12:31:42 AM
For this following test, which ZiP did you use?
The last one JJ posted.
http://www.masm32.com/board/index.php?topic=13701.msg113195#msg113195
Clive,
Thank you.
Quote from: jj2007 on June 18, 2010, 11:29:51 PM
Strings are null-terminated, unless the contrary is explicitly stated. But that's not the problem.
Imagine you are comparing files. You have allocated two fat memory buffers with VirtualAlloc. You are loading two files into this buffer, and run the routine without checking the nullbytes. The following scenarios are possible:
- you know the string length, and stop comparing when the end is reached
- you do not know the string length, and stop comparing when the strings differ
- you do not know the string length, and the files are equal
In the latter case, you stop comparing with an access violation at the end of the buffer.
When I read files, I have some chances to know the number of bytes I'm reading into
memory, so I can compare just that number of bytes or create by myself the sequence
of bytes that tell me I'm on the EOF.
As Clive pointed out, I'm going to do a memcmp() in that case, and it could be one
possible scenario for the use of this kind of comparing.
If they are strings, the NULL check is mandatory, being it the de facto EOF.
By the way, I assume you are getting those results with the methods tested
just because this rule is not explicit, so there are big differences due to this
simple fact, not because the algos are that much smarter. :P
Here is my time for my dekstop (development) system:
AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ (SSE3)
String comparison: short string 10 bytes, long string 5050
2863 cycles for SSE with null check, long string
6233 cycles for SSE with null check, long string, case-insensitive
3948 cycles for Lingo, long string with null check
60190 cycles for Frank, long string
56336 cycles for Rockoon, long string
63878 cycles for RockoonJ, long string
55673 cycles for RockoonJ2, long string
54345 cycles for RockoonJ3, long string
5029 cycles for RockoonCNB, long string, check nullbyte
12533 cycles for crt_strcmp, long string
20673 cycles for crt__stricmp, long string, case-insensitive
101896 cycles for lstrcmp, long string
29 cycles for SSE with null check, 10 bytes
55 cycles for SSE with null check, 10 bytes, case-insensitive
13 cycles for Lingo, 10 bytes with null check
6 cycles for Frank, 10 bytes
7 cycles for Rockoon, 10 bytes
8 cycles for RockoonJ, 10 bytes
-6 cycles for RockoonJ2, 10 bytes
10 cycles for RockoonJ3, 10 bytes
13 cycles for RockoonCNB, 10 bytes, check nullbyte
591 cycles for lstrcmp, 10 bytes
2864 cycles for SSE with null check, long string
3495 cycles for SSE with null check, long string, case-insensitive
6455 cycles for Lingo, long string with null check
56777 cycles for Frank, long string
60430 cycles for Rockoon, long string
65005 cycles for RockoonJ, long string
54752 cycles for RockoonJ2, long string
60138 cycles for RockoonJ3, long string
5272 cycles for RockoonCNB, long string, check nullbyte
10029 cycles for crt_strcmp, long string
20091 cycles for crt__stricmp, long string, case-insensitive
95502 cycles for lstrcmp, long string
29 cycles for SSE with null check, 10 bytes
46 cycles for SSE with null check, 10 bytes, case-insensitive
13 cycles for Lingo, 10 bytes with null check
6 cycles for Frank, 10 bytes
7 cycles for Rockoon, 10 bytes
8 cycles for RockoonJ, 10 bytes
8 cycles for RockoonJ2, 10 bytes
10 cycles for RockoonJ3, 10 bytes
13 cycles for RockoonCNB, 10 bytes, check nullbyte
504 cycles for lstrcmp, 10 bytes
Codesizes:
Lingo: 365
Frank: 45
Rockoon: 40
RockoonJ: 42
RockoonJ2: 33
RockoonJ3: 33
RockoonCNB: 84
SSE: 277 (MasmBasic)
--- ok ---
And here is my timing for my laptop (Internet):
Intel(R) Pentium(R) 4 CPU 3.20GHz (SSE2)
String comparison: short string 10 bytes, long string 5050
5463 cycles for SSE with null check, long string
8446 cycles for SSE with null check, long string, case-insensitive
9214 cycles for Lingo, long string with null check
77152 cycles for Frank, long string
67687 cycles for Rockoon, long string
66298 cycles for RockoonJ, long string
69855 cycles for RockoonJ2, long string
66412 cycles for RockoonJ3, long string
9748 cycles for RockoonCNB, long string, check nullbyte
14225 cycles for crt_strcmp, long string
33251 cycles for crt__stricmp, long string, case-insensitive
135879 cycles for lstrcmp, long string
16 cycles for SSE with null check, 10 bytes
23 cycles for SSE with null check, 10 bytes, case-insensitive
37 cycles for Lingo, 10 bytes with null check
43 cycles for Frank, 10 bytes
6 cycles for Rockoon, 10 bytes
16 cycles for RockoonJ, 10 bytes
5 cycles for RockoonJ2, 10 bytes
16 cycles for RockoonJ3, 10 bytes
11 cycles for RockoonCNB, 10 bytes, check nullbyte
695 cycles for lstrcmp, 10 bytes
5105 cycles for SSE with null check, long string
9337 cycles for SSE with null check, long string, case-insensitive
8719 cycles for Lingo, long string with null check
80502 cycles for Frank, long string
67870 cycles for Rockoon, long string
65171 cycles for RockoonJ, long string
66735 cycles for RockoonJ2, long string
66174 cycles for RockoonJ3, long string
9802 cycles for RockoonCNB, long string, check nullbyte
14903 cycles for crt_strcmp, long string
34773 cycles for crt__stricmp, long string, case-insensitive
132146 cycles for lstrcmp, long string
22 cycles for SSE with null check, 10 bytes
32 cycles for SSE with null check, 10 bytes, case-insensitive
18 cycles for Lingo, 10 bytes with null check
16 cycles for Frank, 10 bytes
20 cycles for Rockoon, 10 bytes
13 cycles for RockoonJ, 10 bytes
10 cycles for RockoonJ2, 10 bytes
10 cycles for RockoonJ3, 10 bytes
18 cycles for RockoonCNB, 10 bytes, check nullbyte
648 cycles for lstrcmp, 10 bytes
Codesizes:
Lingo: 365
Frank: 45
Rockoon: 40
RockoonJ: 42
RockoonJ2: 33
RockoonJ3: 33
RockoonCNB: 84
SSE: 277 (MasmBasic)
--- ok ---
Here it is a new approach to string comparing.
First difference: a parallel forward and backward scan of the strings.
This should be faster because if there is a difference toward the end
or near the beginning of the string it'll be detected without scanning
the whole.
Second difference: it checks all the spare bytes the previous version
did not.
I don't know if anyone has already used this approach, but I think it could
be useful in various occasions.
The routine doesn't return the position of the different bytes, nor anything else.
It is just a quick code, without any optimization in mind.
This is just to show the method.
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz (SSE4)
1172 cycles for Parallel scan, 2600 bytes
--- ok ---
If anyone thinks it is worth the effort, optimize it, customize it, and enjoy.
Frank
to do it that way, seems like you have to know the string lengths, first :P
Quote from: dedndave on June 20, 2010, 02:49:54 PM
to do it that way, seems like you have to know the string lengths, first :P
Compliments :clap: :clap:
This is one the three assumptions the program does. :P
Guess the next, please. :lol
I have added some timing for my version of an SSE compare as KRBxxxx (4 tests)
(KRB for KeepingRealBusy). I use this procedure in a Quicksort Indirect for a
buffer (2 GB) full of variable length strings. These have been read from a 6+ GB
file and converted (2 GB block by block) to variable length structures
(WORD_WITH_BYTE_COUNT,BYTE_ARRAY) in-place in the input buffer. Then a separate
buffer is created pointing to these structures, and the pointers are
Quicksorted. Next, the pointers are written to an intermediate file followed by
the records in-order (data only, not the string length - the pointers are
adjusted to point to the moved data in the intermediate buffer, thus the string
length is defined by the difference between any pointer and it's following
pointer). I did reserve the last 16 bytes of the buffer to insure that an XMM
access of the last record in the buffer would not encounter an error. My sorting
requirements did not need the mismatch location (as has been supplied in some of
these other tests) but did require the type of the mismatch (below/equal/above,
and not just equal/not_equal). Thus - I made 4 tests with different optional
return choices: returning both values, one value, the other value, and neither.
My choice would be KRBNP (return compare result but no position return) for use
with the sort. I implemented some test code to verify the results of my first
long string test and my first short string test (returning both results), and
then removed the verification code with "if 0" "endif".
Of course, since my data needed a particular format (a leading size WORD), I
created that format with 4 new data structures (2 for long and 2 for short).
I had originally created fixed length copies of the input strings for direct
sorting, but this took too much time with data handling. The average string
length was 48 bytes but the longest strings were 160 bytes. This caused the
fixed length file to balloon to 20+ GB. The new method allows blocks of sorted
records to be created with only 2 extra bytes per string so that the merging of
the blocks is much faster (only 1/3 of the blocks). The two different formats
for the string lengths (a WORD preceeding the string during sort, and the
difference between two pointers during merging) require two different entries to
the compare, but the rest of the compare logic remains the same following the
entry.
I was a bit disappointed with some of the timing code. No attempt was made to
actually return a true string comparison indicator (a negative value, a zero
value, or a positive value in eax) (such as would be returned from lstrcmp)
except for the last three methods (crt_strcmp, crt__stricmp, and lstrcmp) and
Lingo. In fact, the returned flags for all methods other than those last three
returned the flags from subtracting the starting value from the mismatch point.
For my tests which return comparison results, the results are returned in the
flags. For my tests which return mismatch position, the result is returned in
eax without affecting the comparison flags.
Frank made no check to detect the end of a string and ran off the end of the
first string into the null and then the 50000 byte zero pad and then the string
of "xxx" and then the second string, but at the same time ran off the end of the
second string into the null and then the 50000 byte zero pad and then the string
of "xxx" and then the short strings which finally stopped the compare with a
mismatch (the second string data as part of the first string did not match the
short string data as part of the second string). Essentially, he compared 55036
bytes (5000+1+50000+34+1) before the mismatch. If he had made a null check, then
his counts would have been in the 5000 range instead of the 55000 range. Rockoon
(below) had the same problem until the last attempt where he added the null
check and his times show the huge difference. Neither Frank's method nor
Rockoon's method with DWORD compares will substitute for a lstrcmp even if it
returned the comparison result. What needs to be done to return a true lstrcmp
result (but using DWORD compares) is something like the following (but it
requires that the string be padded with zeros to a mod 4 bound and not just a
single null with trailing garbage in the last DWORD and this is not the normal
case unless the strings are edited):
@@:
mov eax[esi]
cmp eax[edi]
jnz @f
;
; null check or length check here
;
.
.
.
lea esi,[esi+4]
lea edi,[edi+4]
jmp @b
@@:
mov ebx,[edi]
bswap eax
bswap ebx
cmp eax,ebx
Rockoon depended on the two strings being adjacent such that the string length
of the first was the difference between the two pointers (and the second string
length was not checked) or maybe I am missing something. He was comparing the
strings, the null, the 50000 byte zero pad and then the string of "xxx". His
final test did make the null check and the counts show it:
56620 cycles for RockoonJ3, long string
5028 cycles for RockoonCNB, long string, check nullbyte
I have not completely analyzed the StrCmpSSE method, but at least it returns both
a mismatch position and the mismatched byte difference (it swaps the regs,
trying to get one string that is aligned, and I did not thoroughly analyze if
this difference was corrected for such swaps - probably the simplest test would
be to create test strings that differ in alignment and where the first string is
always less than the second string in the lstrcmp sense and verify that the
returned comparison result is always correct whether or not the strings might
have been swapped for alignment).
I have not analyzed Lingo's method completely, but what can I say. His code is
good, and fast, and good and fast. He even returns the result of the comparison
(I think it does, I did not spend too much time verifying all of the possible
conditions).
What can I say about crt_strcmp, crt__stricmp, and lstrcmp. The counts tell the
story of what you can expect with a general purpose solution:
9670 cycles for crt_strcmp, long string
20589 cycles for crt__stricmp, long string, case-insensitive
95656 cycles for lstrcmp, long string
I was dismayed to see that the repe cmpsb test was conditionally not assembled.
I wanted to see how bad it could really get since they didn't include it, so I
re-enabled the tests. Actually, I created 4 tests with 4 sets of data. The first
set of two tests used unaligned data, the second set of two tests used aligned
data. The first test of each set was a repe cmpsb, the second test of each set
was a repe cmpsd (by DWORDS until mismatch), then backup the indices and retry
the last DWORD with a cmpsb to get an accurate compare result. Amazingly, the
counts are NOT ALL THAT BAD, especially for aligned data!
In all of tests (except mine) which make null byte checks, the null byte check
is done for each compare. Checking for the end of the string for each compare
seems to be a case of time wasted, at least for usage with a sort. This is
necessary for a generic strcmp, but you do not want to use a generic process for
a specific task that is repeated a billion times or more. What I have done is to
do this once (to begin with when I read the data and convert it to structures)
and then use the calculated lengths for the compares. It is hard to quantify how
much extra time my tests would take (considering that I really do a sort which
compares the strings more than once, more like a billion times, and the length
calculation is only done once and saved with the strings). A 2 GB buffer of
average length strings (48 byte) gives about 44,739,242 records per buffer. This
can be converted in 6 seconds per buffer, so that is about 7,456,540 records per
second, or .134 microseconds per record, or about 134 nanoseconds per record.
Does anyone have a clue of how to convert that to cycles, and then how to
quantify the sort condition where the strings get compared 44,739,242 *
lg(44,739,242) or 44,739,242 * 26 = 1,163,220,292 times. I think you need to
divide 134 nano by 1,163,220,292 * 2 and I do not think in numbers that small. I
know that if I did the strlen for each string during the compare, it would add
134 nano * 1,163,220,292 compares * 2 (both strings) or 311,743,038,256 nano or
312 seconds to the sort time - that is 5 extra minutes, and that is per2 GB buffer,
and there are 3+ buffers to sort and merge.
Case insensitive compares are also time consuming, especially for sorts where
the same string is accessed many times. I find it highly beneficial to do a
lowercase conversion once (table lookup if ASCII - only 3 instructions per
character) and preface the string with the lowercase data, sort the
lowercase/uppercase data, then strip off the lowercase data when writing the
result strings.
Without any further ado, here are my timings.
AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ (SSE3)
String comparison: short string 10 bytes, long string 5050
3835 cycles for SSE with null check, long string
2738 cycles for SSE with null check, long string, case-insensitive
4062 cycles for Lingo, long string with null check
59170 cycles for Frank, long string
3189 cycles for KRB, long string
3171 cycles for KRBNF, long string no flags returned
2855 cycles for KRBNR, long string no return position
2542 cycles for KRBNFNR, long string no flags returned no return position
11172 cycles for repe cmpsb, unaligned long string
3849 cycles for repe cmpsd, unaligned long string
12140 cycles for repe cmpsb, aligned long string
1868 cycles for repe cmpsd, aligned long string
60221 cycles for Rockoon, long string
61258 cycles for RockoonJ, long string
55305 cycles for RockoonJ2, long string
57307 cycles for RockoonJ3, long string
5294 cycles for RockoonCNB, long string, check nullbyte
9670 cycles for crt_strcmp, long string
20589 cycles for crt__stricmp, long string, case-insensitive
95656 cycles for lstrcmp, long string
29 cycles for SSE with null check, 10 bytes
46 cycles for SSE with null check, 10 bytes, case-insensitive
13 cycles for Lingo, 10 bytes with null check
6 cycles for Frank, 10 bytes
56 cycles for KRB, 10 bytes
42 cycles for KRBNF, 10 bytes no flags returned
38 cycles for KRBNR, 10 bytes no return position
38 cycles for KRBNFNR, 10 bytes no flags returned no return position
36 cycles for repe cmpsb, unaligned 10 bytes
50 cycles for repe cmpsd, unaligned 10 bytes
36 cycles for repe cmpsb, aligned 10 bytes
46 cycles for repe cmpsd, aligned 10 bytes
7 cycles for Rockoon, aligned 10 bytes
9 cycles for RockoonJ, 10 bytes
8 cycles for RockoonJ2, 10 bytes
10 cycles for RockoonJ3, 10 bytes
13 cycles for RockoonCNB, 10 bytes, check nullbyte
504 cycles for lstrcmp, 10 bytes
2866 cycles for SSE with null check, long string
3497 cycles for SSE with null check, long string, case-insensitive
3938 cycles for Lingo, long string with null check
58561 cycles for Frank, long string
3362 cycles for KRB, long string
3340 cycles for KRBNF, long string no flags returned
2873 cycles for KRBNR, long string no return position
2694 cycles for KRBNFNR, long string no flags returned no return position
12600 cycles for repe cmpsb, unaligned long string
3835 cycles for repe cmpsd, unaligned long string
11129 cycles for repe cmpsb, aligned long string
2825 cycles for repe cmpsd, aligned long string
57165 cycles for Rockoon, long string
62959 cycles for RockoonJ, long string
56479 cycles for RockoonJ2, long string
55263 cycles for RockoonJ3, long string
4118 cycles for RockoonCNB, long string, check nullbyte
11211 cycles for crt_strcmp, long string
21257 cycles for crt__stricmp, long string, case-insensitive
91616 cycles for lstrcmp, long string
29 cycles for SSE with null check, 10 bytes
29 cycles for SSE with null check, 10 bytes, case-insensitive
13 cycles for Lingo, 10 bytes with null check
6 cycles for Frank, 10 bytes
72 cycles for KRB, 10 bytes
42 cycles for KRBNF, 10 bytes no flags returned
38 cycles for KRBNR, 10 bytes no return position
47 cycles for KRBNFNR, 10 bytes no flags returned no return position
36 cycles for repe cmpsb, unaligned 10 bytes
50 cycles for repe cmpsd, unaligned 10 bytes
36 cycles for repe cmpsb, aligned 10 bytes
46 cycles for repe cmpsd, aligned 10 bytes
7 cycles for Rockoon, aligned 10 bytes
9 cycles for RockoonJ, 10 bytes
8 cycles for RockoonJ2, 10 bytes
10 cycles for RockoonJ3, 10 bytes
26 cycles for RockoonCNB, 10 bytes, check nullbyte
541 cycles for lstrcmp, 10 bytes
Codesizes:
Lingo: 365
Frank: 45
KRB: 115
KRBNF: 99
KRBNR: 105
KRBNFNR: 91
Rockoon: 40
RockoonJ: 42
RockoonJ2: 33
RockoonJ3: 33
RockoonCNB: 84
SSE: 277 (MasmBasic)
--- ok ---
Good job :U
QuoteChecking for the end of the string for each compare seems to be a case of time wasted, at least for usage with a sort.
The leading word is a solution for this particular case, where you compare over and over the same strings. However, in a different setting checking the len first might imply that the cache has to be reloaded, i.e. once for Len() and again for the actual comparison.
QuoteStrCmpSSE method, ... the simplest test would be to create test strings that differ in alignment
The two dummy strings are meant for doing this test.
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
String comparison: short string 10 bytes, long string 5050
2930 cycles for SSE with null check, long string
3662 cycles for SSE with null check, long string, case-insensitive
5533 cycles for Lingo, long string with null check
55931 cycles for Frank, long string
14716 cycles for KRB, long string
3192 cycles for KRBNF, long string no flags returned
3210 cycles for KRBNR, long string no return position
3141 cycles for KRBNFNR, long string no flags returned no return position
20961 cycles for repe cmpsb, unaligned long string
6982 cycles for repe cmpsd, unaligned long string
26079 cycles for repe cmpsb, aligned long string
5135 cycles for repe cmpsd, aligned long string
44509 cycles for Rockoon, long string
44180 cycles for RockoonJ, long string
44074 cycles for RockoonJ2, long string
44162 cycles for RockoonJ3, long string
7864 cycles for RockoonCNB, long string, check nullbyte
13922 cycles for crt_strcmp, long string
16095 cycles for crt__stricmp, long string, case-insensitive
89770 cycles for lstrcmp, long string
18 cycles for SSE with null check, 10 bytes
25 cycles for SSE with null check, 10 bytes, case-insensitive
16 cycles for Lingo, 10 bytes with null check
9 cycles for Frank, 10 bytes
53 cycles for KRB, 10 bytes
21 cycles for KRBNF, 10 bytes no flags returned
20 cycles for KRBNR, 10 bytes no return position
20 cycles for KRBNFNR, 10 bytes no flags returned no return position
88 cycles for repe cmpsb, unaligned 10 bytes
125 cycles for repe cmpsd, unaligned 10 bytes
88 cycles for repe cmpsb, aligned 10 bytes
121 cycles for repe cmpsd, aligned 10 bytes
8 cycles for Rockoon, aligned 10 bytes
7 cycles for RockoonJ, 10 bytes
7 cycles for RockoonJ2, 10 bytes
8 cycles for RockoonJ3, 10 bytes
16 cycles for RockoonCNB, 10 bytes, check nullbyte
453 cycles for lstrcmp, 10 bytes
KeepingRealBusy i can't compile your example I get
strcompkrb.asm(757) : error A2085: instruction or register not accepted in curre
nt CPU mode
strcompkrb.asm(758) : error A2085: instruction or register not accepted in curre
nt CPU mode
strcompkrb.asm(759) : error A2085: instruction or register not accepted in curre
nt CPU mode
etc...
strcompkrb.asm(1143) : error A2006: undefined symbol : aE2
strcompkrb.asm(1144) : error A2006: undefined symbol : uE2
strcompkrb.asm(1148) : error A2006: undefined symbol : ciE2
strcompkrb.asm(1172) : error A2006: undefined symbol : aL1
strcompkrb.asm(1203) : error A2006: undefined symbol : aL1
but according http://www.xbitlabs.com/images/cpu/athlon64-3000/cpuz.png my cpu supports mmx :\
I hope I'll have time enough to modify the routine I proposed for a different context,
and let it suite the context you are using, so in a couple of days the new string compare
routine should be out. :P
If I have understood the rules of the game, the routine has to be built with the assumptions
that:
1 - the strings are NULL terminated
2 - we don't know the length of the strings
And the next point that I would like to understand is:
3 - the strings are of the same length or it doesn't matter?
If the latter is true what do we check?
Let me know so I can full enjoy the game for the Beginners Section :P
you may have to enable the MXX instructions set, Cube
include \masm32\include\masm32rt.inc
.686
.MXX
.XMM
that should turn everything on :P
masm32rt.inc sets the processor to .486 i think
if you want to change it, do so after the include
i think a minimum of .586 has to be on before .MXX or .XMM, too
AMD Phenom(tm) II X6 1055T Processor (SSE3)
String comparison: short string 10 bytes, long string 5050
2110 cycles for SSE with null check, long string
2400 cycles for SSE with null check, long string, case-insensitive
3852 cycles for Lingo, long string with null check
34198 cycles for Frank, long string
2028 cycles for KRB, long string
1742 cycles for KRBNF, long string no flags returned
1435 cycles for KRBNR, long string no return position
1274 cycles for KRBNFNR, long string no flags returned no return position
11137 cycles for repe cmpsb, unaligned long string
3017 cycles for repe cmpsd, unaligned long string
11136 cycles for repe cmpsb, aligned long string
2824 cycles for repe cmpsd, aligned long string
33713 cycles for Rockoon, long string
32009 cycles for RockoonJ, long string
33575 cycles for RockoonJ2, long string
33840 cycles for RockoonJ3, long string
5196 cycles for RockoonCNB, long string, check nullbyte
10196 cycles for crt_strcmp, long string
20172 cycles for crt__stricmp, long string, case-insensitive
89104 cycles for lstrcmp, long string
24 cycles for SSE with null check, 10 bytes
34 cycles for SSE with null check, 10 bytes, case-insensitive
17 cycles for Lingo, 10 bytes with null check
1 cycles for Frank, 10 bytes
37 cycles for KRB, 10 bytes
27 cycles for KRBNF, 10 bytes no flags returned
26 cycles for KRBNR, 10 bytes no return position
40 cycles for KRBNFNR, 10 bytes no flags returned no return position
34 cycles for repe cmpsb, unaligned 10 bytes
43 cycles for repe cmpsd, unaligned 10 bytes
34 cycles for repe cmpsb, aligned 10 bytes
43 cycles for repe cmpsd, aligned 10 bytes
8 cycles for Rockoon, aligned 10 bytes
8 cycles for RockoonJ, 10 bytes
10 cycles for RockoonJ2, 10 bytes
21 cycles for RockoonJ3, 10 bytes
14 cycles for RockoonCNB, 10 bytes, check nullbyte
786 cycles for lstrcmp, 10 bytes
2092 cycles for SSE with null check, long string
2340 cycles for SSE with null check, long string, case-insensitive
3894 cycles for Lingo, long string with null check
34232 cycles for Frank, long string
2018 cycles for KRB, long string
1864 cycles for KRBNF, long string no flags returned
1463 cycles for KRBNR, long string no return position
1290 cycles for KRBNFNR, long string no flags returned no return position
11240 cycles for repe cmpsb, unaligned long string
3068 cycles for repe cmpsd, unaligned long string
11272 cycles for repe cmpsb, aligned long string
2868 cycles for repe cmpsd, aligned long string
33779 cycles for Rockoon, long string
31979 cycles for RockoonJ, long string
33669 cycles for RockoonJ2, long string
33805 cycles for RockoonJ3, long string
5194 cycles for RockoonCNB, long string, check nullbyte
10047 cycles for crt_strcmp, long string
20273 cycles for crt__stricmp, long string, case-insensitive
79376 cycles for lstrcmp, long string
24 cycles for SSE with null check, 10 bytes
34 cycles for SSE with null check, 10 bytes, case-insensitive
17 cycles for Lingo, 10 bytes with null check
8 cycles for Frank, 10 bytes
37 cycles for KRB, 10 bytes
42 cycles for KRBNF, 10 bytes no flags returned
26 cycles for KRBNR, 10 bytes no return position
26 cycles for KRBNFNR, 10 bytes no flags returned no return position
34 cycles for repe cmpsb, unaligned 10 bytes
43 cycles for repe cmpsd, unaligned 10 bytes
34 cycles for repe cmpsb, aligned 10 bytes
43 cycles for repe cmpsd, aligned 10 bytes
8 cycles for Rockoon, aligned 10 bytes
8 cycles for RockoonJ, 10 bytes
10 cycles for RockoonJ2, 10 bytes
7 cycles for RockoonJ3, 10 bytes
14 cycles for RockoonCNB, 10 bytes, check nullbyte
793 cycles for lstrcmp, 10 bytes
Codesizes:
Lingo: 365
Frank: 45
KRB: 115
KRBNF: 99
KRBNR: 105
KRBNFNR: 91
Rockoon: 40
RockoonJ: 42
RockoonJ2: 33
RockoonJ3: 33
RockoonCNB: 84
SSE: 277 (MasmBasic)
--- ok ---
It looks like on AMD, the KRB* solutions are best so far. Remember tho that TANSTATFC
thanks I had that, and .MXX isn't necessary, turns out I was using too old version of masm. Also while you guys did some really interesting stuff with this, I wish more case insensitive versions were made, as a regular strcmp is kinda useless imo. Lingo what is that db stuff you put infront of your functions?
;------------------------------------------------------------------------------------------
; Comparing 2 strings - 5000 bytes long, using a parallel approach.
; One scan goes from the beginning of the string toward the end.
; A second scan goes from the end of the string backward to the start.
;------------------------------------------------------------------------------------------
; Assumptions:
; 1. Strings are of the same length and NULL terminated
; 2. We don't know the lenght of the strings
; 3. We need to know if the strings are equal or different
;------------------------------------------------------------------------------------------
This modification of the routine is a first approach to the context problem.
Not very fast, but it is a beginning for the Beginners Section :lol
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz (SSE4)
7570 cycles for Parallel scan, 5000 bytes
--- ok ---
Please use this routine inside the testbed instead of the previous one.
If I find better solutions, I'll be working on this version.
Frank
Quote from: E^cube on June 21, 2010, 10:29:36 PM
KeepingRealBusy i can't compile your example I get
strcompkrb.asm(757) : error A2085: instruction or register not accepted in curre
nt CPU mode
strcompkrb.asm(758) : error A2085: instruction or register not accepted in curre
nt CPU mode
strcompkrb.asm(759) : error A2085: instruction or register not accepted in curre
nt CPU mode
etc...
strcompkrb.asm(1143) : error A2006: undefined symbol : aE2
strcompkrb.asm(1144) : error A2006: undefined symbol : uE2
strcompkrb.asm(1148) : error A2006: undefined symbol : ciE2
strcompkrb.asm(1172) : error A2006: undefined symbol : aL1
strcompkrb.asm(1203) : error A2006: undefined symbol : aL1
but according http://www.xbitlabs.com/images/cpu/athlon64-3000/cpuz.png my cpu supports mmx :\
E^cube,
Sorry I haven't responded sooner. I see that you already found the problem from a later post. I was using, I think, the ML from Visual Studio 2008.
Dave.
Quote from: KeepingRealBusy on June 22, 2010, 03:14:25 AM
I was using, I think, the ML from Visual Studio 2008.
ML 6.14 will fail, but 6.15 and higher are ok. Same for JWasm.