News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

StrLen timings needed

Started by jj2007, August 15, 2010, 09:32:10 PM

Previous topic - Next topic

jj2007

Quote from: Antariy on August 19, 2010, 10:51:20 PM
Quote from: jj2007 on August 18, 2010, 11:37:45 PM
Tried this?
Quote   mov edx,[esp+4]
;   mov eax,[esp+4]   ; Jochen, if you remove this string again :), then
   add esp, -10h   ; algo would almost always work with any string
   movups [esp],xmm7   ; as with unaligned string, because I made
   test dl, 15
;   and eax, 0fh   ; checking for alignment in THIS line!      ;-)
   jz @F


Jochen, on my CPU, if I use edx for check, proc slower by 2 clocks. If I use part of reg, this is not get anything (I know about this, and this is have very hardware-depended  reasons in work. On moder CPUs this is very slow).

Alex


Hi Alex,
The difference is very small on my trusty old P4 and inexistent on my Celeron. Here is one more for testing... I am tempted to use the MbStrLen4aP4 for the MasmBasic library, because it's short, reasonably fast, and safe for strings that end precisely at the legal area (you remember VirtualAlloc can be a bit rude with attempts to use movups xmm7, [edx] when edx is near the next page :wink)

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
80      bytes for MbStrLen4a
72      bytes for MbStrLen4aP4
100     bytes for AxStrLenSSE1
80      bytes for AxStrLenSSE1j
------- timings -------
333     cycles for szLen
78      cycles for MbStrLen4a
72      cycles for MbStrLen4aP4
62      cycles for AxStrLenSSE1
63      cycles for AxStrLenSSE1j


Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
35      cycles for AxStrLenSSE1
35      cycles for AxStrLenSSE1j

hutch--


Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (SSE4)
80      bytes for MbStrLen4a
72      bytes for MbStrLen4aP4
72      bytes for MbStrLen4aP42
100     bytes for AxStrLenSSE1
80      bytes for AxStrLenSSE1j
------- timings -------
104     cycles for szLen
42      cycles for MbStrLen4a
42      cycles for MbStrLen4aP4
42      cycles for MbStrLen4aP42
19      cycles for AxStrLenSSE1
23      cycles for AxStrLenSSE1j

104     cycles for szLen
42      cycles for MbStrLen4a
48      cycles for MbStrLen4aP4
48      cycles for MbStrLen4aP42
27      cycles for AxStrLenSSE1
28      cycles for AxStrLenSSE1j

104     cycles for szLen
42      cycles for MbStrLen4a
42      cycles for MbStrLen4aP4
42      cycles for MbStrLen4aP42
20      cycles for AxStrLenSSE1
24      cycles for AxStrLenSSE1j

104     cycles for szLen
42      cycles for MbStrLen4a
48      cycles for MbStrLen4aP4
48      cycles for MbStrLen4aP42
24      cycles for AxStrLenSSE1
26      cycles for AxStrLenSSE1j


--- ok ---
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

Quote from: hutch-- on August 20, 2010, 09:31:58 AM

Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (SSE4)
42      cycles for MbStrLen4a
48      cycles for MbStrLen4aP4
48      cycles for MbStrLen4aP42
19      cycles for AxStrLenSSE1
23      cycles for AxStrLenSSE1j


Thanks. Interesting that the 4a is faster, on my P4 it's definitely slower. I have changed the 72Test posted above and added an option CrashIt, which will test a string near the VirtualAlloc page boundary.

jcfuller

Intel(R) Core(TM) i7 CPU         860  @ 2.80GHz (SSE4)
80      bytes for MbStrLen4a
72      bytes for MbStrLen4aP4
72      bytes for MbStrLen4aP42
100     bytes for AxStrLenSSE1
80      bytes for AxStrLenSSE1j
------- timings -------
129     cycles for szLen
72      cycles for MbStrLen4a
94      cycles for MbStrLen4aP4
91      cycles for MbStrLen4aP42
45      cycles for AxStrLenSSE1
40      cycles for AxStrLenSSE1j

157     cycles for szLen
72      cycles for MbStrLen4a
93      cycles for MbStrLen4aP4
93      cycles for MbStrLen4aP42
45      cycles for AxStrLenSSE1
42      cycles for AxStrLenSSE1j

157     cycles for szLen
72      cycles for MbStrLen4a
71      cycles for MbStrLen4aP4
70      cycles for MbStrLen4aP42
45      cycles for AxStrLenSSE1
40      cycles for AxStrLenSSE1j

158     cycles for szLen
72      cycles for MbStrLen4a
73      cycles for MbStrLen4aP4
94      cycles for MbStrLen4aP42
45      cycles for AxStrLenSSE1
42      cycles for AxStrLenSSE1j


--- ok ---

jj2007

Thanxalot. The current version of MasmBasic Len() is MbStrLen4a, and it will stay that way. 80 bytes short, reasonably fast and SSE-safe near page boundaries.

hutch--

JJ,

On both the Core2 and i7 boxes, SSE is a lot faster than on my P4 boxes relative to normal integer instruction code so if you are pointing the procedures at SSE capable processors I would go for the ones that are faster on the Core2 and i3/i5/i7 architecture. Rockoon has been posting results from a recent 6 core AMD so it will also be worthwhile seeing what they work like on a late AMD box.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

Hutch,
Yes, some more tests would be fine - but it seems the MbStrLen4a is overall quite ok. The SSE1 versions are not "boundary safe", the others seem a tick slower.
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
58      cycles for MbStrLen4a
62      cycles for MbStrLen4aP4
62      cycles for MbStrLen4aP42

56      cycles for MbStrLen4a
71      cycles for MbStrLen4aP4
71      cycles for MbStrLen4aP42

The increase from 62 to 71 may have to do with the testbed setting: For each timing loop, there is a push for changing the stack alignment. So sometimes the movdqu [esp], xmm0 is 16-byte aligned and thus faster (on some CPUs the movdqu becomes as fast as movdqa if it hits an aligned address).

Antariy

Jochen, you proc is safe with end of VirtualAlloc? (I don't see sourcess yet)?



Alex

Antariy

Jochen, your proc's also is not safe, they crashes with short strings.

You force me to make version with SEH :), really.



Alex

Antariy

Hutch, for what you needed in SSE1 StrLen algo? This is needed to you?



Alex

Antariy

Jochen, I saw sources, your proc is not crashes :)



Alex

Antariy

... But, even if I add SEH, this is not make proc much slower :) Only in case of end-of-buffer.



Alex

lingo

Hutch,
I'm wondering why you tested and tolerated such idiotic, slow algos.
Idiotic because they preserved registers without any need.
Only Microsoft can require which  registers we must preserve :(

jj2007

Quote from: Antariy on August 20, 2010, 11:07:26 PM
... But, even if I add SEH, this is not make proc much slower :) Only in case of end-of-buffer.

Maybe... but code size will increase again. 80 bytes is enough for a strlen algo in a Basic library. And remember we are talking about 40-80 cycles, depending on the CPU. 80 cycles at 1.6 GHz means 20 Million calls to strlen per second - that is rarely a bottleneck. Those who need more than that can use the fastest and unsafest MMX algos trashing the FPU, but for a general purpose library a compromise should be sought.

Antariy

Hi!

Big ask to all: test this please. This is fixed version of my SSE1 StrLen proc, which is not crashes in end of buffer in normal case of zero-terminated string.
And this proc is still fast with unaligned strings, it have the same speed as my previous proc, or slightly slower.

For first test-bed (Jochen's old procs):

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
58       bytes for MbStrLen1
84       bytes for MbStrLen2
73       bytes for MbStrLen3
80       bytes for MbStrLen4a
71       bytes for MbStrLen4b
78       bytes for MbStrLen5
34       bytes for AxStrLenMMX
93       bytes for AxStrLenSSE1
113      bytes for StrLen

73      cycles for MbStrLen1
90      cycles for MbStrLen2
91      cycles for MbStrLen3
90      cycles for MbStrLen4a
71      cycles for MbStrLen4b
95      cycles for MbStrLen5

110     cycles for AxStrLenMMX

63      cycles for AxStrLenSSE1

164     cycles for StrLen

100     cycles for MbStrLen1
113     cycles for MbStrLen2
104     cycles for MbStrLen3
90      cycles for MbStrLen4a
71      cycles for MbStrLen4b
100     cycles for MbStrLen5

108     cycles for AxStrLenMMX

62      cycles for AxStrLenSSE1

164     cycles for StrLen

100     cycles for MbStrLen1
90      cycles for MbStrLen2
90      cycles for MbStrLen3
80      cycles for MbStrLen4a
71      cycles for MbStrLen4b
95      cycles for MbStrLen5

114     cycles for AxStrLenMMX

62      cycles for AxStrLenSSE1

164     cycles for StrLen

81      cycles for MbStrLen1
130     cycles for MbStrLen2
104     cycles for MbStrLen3
90      cycles for MbStrLen4a
71      cycles for MbStrLen4b
80      cycles for MbStrLen5

110     cycles for AxStrLenMMX

62      cycles for AxStrLenSSE1

164     cycles for StrLen

92      cycles for MbStrLen1
89      cycles for MbStrLen2
90      cycles for MbStrLen3
104     cycles for MbStrLen4a
71      cycles for MbStrLen4b
95      cycles for MbStrLen5

109     cycles for AxStrLenMMX

62      cycles for AxStrLenSSE1

164     cycles for StrLen



For new Jochen's test-bed with CrashIt macro is "1"


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
80      bytes for MbStrLen4a
72      bytes for MbStrLen4aP4
72      bytes for MbStrLen4aP42
93      bytes for AxStrLenSSE1
80      bytes for AxStrLenSSE1j
------- timings -------
27      cycles for szLen
33      cycles for MbStrLen4a
62      cycles for MbStrLen4aP4
37      cycles for MbStrLen4aP42
24      cycles for AxStrLenSSE1



After my proc runs Jochen's tweak, and it crashes. My proc work properly now (see: testing of it is successful - 24 clocks). I.e. - my proc not crashes in end of buffer.

For new, without CrashIt macro:


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
80      bytes for MbStrLen4a
72      bytes for MbStrLen4aP4
72      bytes for MbStrLen4aP42
93      bytes for AxStrLenSSE1
80      bytes for AxStrLenSSE1j
------- timings -------
264     cycles for szLen
88      cycles for MbStrLen4a
73      cycles for MbStrLen4aP4
72      cycles for MbStrLen4aP42
62      cycles for AxStrLenSSE1
68      cycles for AxStrLenSSE1j

270     cycles for szLen
105     cycles for MbStrLen4a
73      cycles for MbStrLen4aP4
93      cycles for MbStrLen4aP42
62      cycles for AxStrLenSSE1
69      cycles for AxStrLenSSE1j

270     cycles for szLen
105     cycles for MbStrLen4a
73      cycles for MbStrLen4aP4
71      cycles for MbStrLen4aP42
62      cycles for AxStrLenSSE1
68      cycles for AxStrLenSSE1j

264     cycles for szLen
91      cycles for MbStrLen4a
73      cycles for MbStrLen4aP4
71      cycles for MbStrLen4aP42
62      cycles for AxStrLenSSE1
68      cycles for AxStrLenSSE1j



Jochen, read comments in start of AxStrLenSSE1 proc. I hope, you understand what I right.

AxStrLenSSE1j (Jochen's remake) is still crashes, I don't work with it.
And integer version crashes on in-buffer-end strings also.


Hutch, SSE works on Core+ nearly 3 times faster than on PIV.
Hutch, test my new version please.


I preserve XMM7 and ECX only for fair comparsion with Jochen's procs. This is his right - what he do with HIS MasmBasic.



Alex