News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

StrLen timings needed

Started by jj2007, August 15, 2010, 09:32:10 PM

Previous topic - Next topic

Antariy

Quote from: jj2007 on August 21, 2010, 07:07:28 AM
Quote from: Antariy on August 20, 2010, 11:07:26 PM
... But, even if I add SEH, this is not make proc much slower :) Only in case of end-of-buffer.

Maybe... but code size will increase again. 80 bytes is enough for a strlen algo in a Basic library. And remember we are talking about 40-80 cycles, depending on the CPU. 80 cycles at 1.6 GHz means 20 Million calls to strlen per second - that is rarely a bottleneck. Those who need more than that can use the fastest and unsafest MMX algos trashing the FPU, but for a general purpose library a compromise should be sought.

Try new version. I make it via normal way, NOT SEH covered, only for respect to you. And I preserve ECX and XMM7 only for respect to you.
They stand by 1 clock slower, how much :)
But, I disagree, what you add alignment stuff to "codesize". This is NOT code, this is never been executed and pre-decoded, so - without comments. Jochen, draw some respect to me, make fair comparsion, please.

I don't impose my proc to you, really. It have size by 16% bigger, and speed not less than by 30% bigger on new CPUs, so, I think, this is satisfactory.



Alex

Antariy

Hutch, page changed again, so, I repeat my ask.

Test this please: "http://www.masm32.com/board/index.php?action=dlattach;topic=14626.0;id=8001".
This is my new safe version of SSE1 proc.



Alex

Antariy

Nobody, except God, can require from me, what I must do, and what regs I must preserve.
So, if somebody don't have a nimbus and a wings, shut up, please.



Alex
P.S. somebody, you don't prove your second assertion: write faster proc, after this - talk.
I know, you write BLOAT algo again, and will be think, what you are great, this is only yours problems.
Create VERY bloat unrolled (using many XMM regs) algo is NOT hard. Is hard - create smallest algo from fastest algos.

jj2007

Quote from: Antariy on August 21, 2010, 04:16:43 PM
But, I disagree, what you add alignment stuff to "codesize". This is NOT code, this is never been executed and pre-decoded, so - without comments. Jochen, draw some respect to me, make fair comparison, please.

Dear Alex,
The assumption here is that you build a library, and each algo starts on a 16-byte aligned boundary. So if somebody (Lingo does that very often) inserts 7 bytes of strange db xyz before the algo start, this will a) add to the size of the executable and b) may waste some bytes of instruction cache when the CPU pulls code in with 16-byte alignment. Therefore "codesize" starts at the 16-byte boundary in all our tests. Call it a convention. Since it's being applied to everybody, I would call it fair :wink
The a16 macro should not count imho because in real life you would not insert 16 int 3's between all the algos of your library. We use it here - for everybody - in the hope that it might eliminate execution cache influences on the timings. Whether that really works, I don't know...

dedndave

i dunno how fair it is, really - lol
but, it appears to be as "real-world" as it can be
it depends on how the code is placed in the test program (luck of the draw)
maybe a more judicious method would be to count actual routine bytes, then add 8 for a 16-alignment
that would represent the "average" byte-count for all possible placements
and - those who do want to not use align 16, can suffer a few clock-cycles penalty to save 8 bytes in the count
that would be fair

Antariy

Jochen, this is piece of text from Intel's optimization guide:
Quote
Assembly/Compiler Coding Rule 56. (M impact, L generality) If (hopefully read-only) data must occur on the same page as code, avoid placing it immediately after an indirect jump. For example, follow an indirect jump with its mostly likely target, and place the data after an unconditional branch.

This text cannot be copied with any standard apps (because Intel locks his PDFs from copying), but, since we talk with dedndave about encryptions and its researching... :)

Returning to quote. As you see, Intel recomment place data pieces in places, what would not be executed, and CPU may "found" this. So, CPUs is NO so silly, what don't know, what no need in pre-decoding some "code".

Code location problem is really problem of interconnection of code and data, not location or type of alignment instructions (int3, or long lea esp,[esp] etc - is not have meaning). This is too hard to say, I hope, you understand me enough.

Location have some small influence, but it cannot have critical meaning. Algos, which use many data in work - more sensitive to "code placement".



Alex

donkey

Has anyone tried my modest entry in the race ? I could not get the AxStrLenMMX or AxStrLenSSE1 examples I downloaded to yield either a consistent or correct result using a string length of 11 chars ("Hello There") either in MASM or GoAsm so my tests were pretty much shot. After all however fast it takes to yield the wrong answer, its still wrong. I was using vkim's debug to display the results (both gave me a string length of 7) I have included the RadAsm project if someone can tell me what is required to get correct results I would appreciate it.

; From my rather old strings library
lszLenMMX proc pString:DWORD

mov eax,[pString]
nop
nop ; fill in stack frame+mov to 8 bytes

pxor mm0,mm0
nop ; fill pxor to 4 bytes
pxor mm1,mm1
nop ; fill pxor to 4 bytes

@@: ; this is aligned to 16 bytes
movq mm0,[eax]
pcmpeqb mm0,mm1
add eax,8
pmovmskb ecx,mm0
or ecx,ecx
jz @B

sub eax,[pString]

bsf ecx,ecx
sub eax,8
add eax,ecx

emms


   RET

lszLenMMX ENDP


Edgar
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

jj2007

Hi Edgar,
Your attachment won't assemble, some includes are missing, and paths rely on environment variables. But anyway, I got the algo to work:

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
80      bytes for MbStrLen4a
49      bytes for lszLenMMX
------- timings -------
131     cycles for szLen
49      cycles for MbStrLen4a
90      cycles for lszLenMMX


Faster than the Masm32 library algo, but problematic because a) it trashes the FPU and b) it throws an exception for strings near a VirtualAlloc boundary.

donkey

Thanks jj2007,

I finally got the Ax... ones to work, hadn't noticed the lack of prologue and epilogue so ESP+4 wasn't pointing to the right place in my tests.

Edgar
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

lingo

#69
It is not a big deal to beat the stupid losers:  :lol
Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)
80      bytes for MbStrLen4a
72      bytes for MbStrLen4aP4
72      bytes for MbStrLen4aP42
93      bytes for AxStrLenSSE1
80      bytes for AxStrLenSSE1j
84      bytes for StrLenLingo
------- timings -------
112     cycles for szLen
37      cycles for MbStrLen4a
49      cycles for MbStrLen4aP4
49      cycles for MbStrLen4aP42
20      cycles for AxStrLenSSE1
45      cycles for AxStrLenSSE1j
12      cycles for StrLenLingo

111     cycles for szLen
37      cycles for MbStrLen4a
42      cycles for MbStrLen4aP4
42      cycles for MbStrLen4aP42
21      cycles for AxStrLenSSE1
22      cycles for AxStrLenSSE1j
12      cycles for StrLenLingo

110     cycles for szLen
63      cycles for MbStrLen4a
73      cycles for MbStrLen4aP4
73      cycles for MbStrLen4aP42
20      cycles for AxStrLenSSE1
26      cycles for AxStrLenSSE1j
12      cycles for StrLenLingo

110     cycles for szLen
37      cycles for MbStrLen4a
42      cycles for MbStrLen4aP4
43      cycles for MbStrLen4aP42
40      cycles for AxStrLenSSE1
22      cycles for AxStrLenSSE1j
12      cycles for StrLenLingo


--- ok ---


Later:
Corrected a bug in my algo. Pls,reload it..sorry...

dedndave

lingo wasn't bad-lookin, when he was little....


Antariy

Hi!

This is old proc, with some changes. I don't see other new procs, and not add it to tests.

Test this, please.



Alex

hutch--

Alex.


Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (SSE4)
80      code size for MbStrLen4a        80      total bytes for MbStrLen4a
72      code size for MbStrLen4aP4      72      total bytes for MbStrLen4aP4
72      code size for MbStrLen4aP42     72      total bytes for MbStrLen4aP42
104     code size for AxStrLenSSE1      104     total bytes for AxStrLenSSE1
80      code size for AxStrLenSSE1j     80      total bytes for AxStrLenSSE1j
------- timings -------
110     cycles for szLen
52      cycles for MbStrLen4a
54      cycles for MbStrLen4aP4
54      cycles for MbStrLen4aP42
34      cycles for AxStrLenSSE1
37      cycles for AxStrLenSSE1j

110     cycles for szLen
52      cycles for MbStrLen4a
64      cycles for MbStrLen4aP4
64      cycles for MbStrLen4aP42
33      cycles for AxStrLenSSE1
45      cycles for AxStrLenSSE1j

110     cycles for szLen
52      cycles for MbStrLen4a
54      cycles for MbStrLen4aP4
54      cycles for MbStrLen4aP42
33      cycles for AxStrLenSSE1
34      cycles for AxStrLenSSE1j

110     cycles for szLen
52      cycles for MbStrLen4a
64      cycles for MbStrLen4aP4
64      cycles for MbStrLen4aP42
32      cycles for AxStrLenSSE1
42      cycles for AxStrLenSSE1j
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Antariy

What about make CrashIt macro "1", and test Lingo's proc?  :P



Alex
P.S. Lingo's proc so bad! It don't preserve ECX and XMM regs!  :P

Antariy