News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Suggestions and improvements for SSE2 code are welcome

Started by Gunther, August 26, 2010, 05:20:06 PM

Previous topic - Next topic

dioxin

Gunther,
you have 8 SSE registers and you're only using 6 of them.
Either use them all to fetch/calculate data or to act as accumulators. Don't leave them unused.
You'll have to experiment to see which alternate use of the unused registers gives the better results.

I get faster results using 4 accumulators:
addps xmm4,xmm0               'add the products
addps xmm5,xmm1
addps xmm6,xmm2
addps xmm7,xmm3


Paul.

jj2007

I have looked into the purpose of this code and modified the testbed accordingly:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
2563    cycles for DotXMM1Acc4E
2167    cycles for DotXMM1Acc4EJ

2680    cycles for DotXMM1Acc4E
2167    cycles for DotXMM1Acc4EJ

2540    cycles for DotXMM1Acc4E
2167    cycles for DotXMM1Acc4EJ

The result: 2867507200
The result: 2867507200


EDIT: I squeezed out a few cycles...

add ecx, edx ; slightly faster
; int 3 ; OPT_Olly 2
align 16

@@:
REPEAT 16 ; unrolling helps, but make sure that the count is divisible by the rep count!!
movaps xmm0, [eax+edx] ; xmm0 = X[i+3] X[i+2] X[i+1] X[i+0] ; xmm0: 4  3   2 1, 8765, 12 11 10 9, ... (high to low)
mulps xmm0, [edx] ; xmm0 = X[i+3]*Y[i+3] X[i+2]*Y[i+2] X[i+1]*Y[i+1] X[i]*Y[i] ; xmm0: 20 12 6 2, 72 56 42 30, 156 132 110 90,
lea edx, [edx+16] ; update pointers
addps xmm7, xmm0 ; sum up ; xmm7: 20 12 6 2, 92 68 48 32, 248 200 158 122, ...
ENDM
if 1
cmp edx, ecx ; faster, at least on Celeron M
jb @B
else
sub ecx, 16 ; count down
jg @B
endif

dioxin

jj2007,
if you have the CPU registers avaialable, which you do in this case, then you can always arrange for the loop to count up to zero and so end without a compare instruction and still only have one loop counter so you don't need to update (in your case) both edx and ecx.
!mov ecx,2047                  'loop counter for 2048 elements to get

!movups xmm4,z                 'zero the 4 accumulators (z predifined as 4x zero single)
!movaps xmm5,xmm4
!movaps xmm6,xmm4
!movaps xmm7,xmm4

!mov eax,aptr                 'get pointer to start of first array
!mov edx,bptr                 'get pointer to start of second array

!lea eax,[eax+4*ecx]          'offset pointers by size of array
!lea edx,[edx+4*ecx]
     
!neg ecx                      'negate counter so now pointer + 4*counter = start of array
                                'but also counter now counts up to zero


#ALIGN 16
lp:
!movaps xmm0,[eax+ecx*4]      'get 16 elements of first vector
!movaps xmm1,[eax+ecx*4+16]
!movaps xmm2,[eax+ecx*4+32]
!movaps xmm3,[eax+ecx*4+48]


!mulps xmm0,[edx+ecx*4]        'multiply by 16 elements of second vector
!mulps xmm1,[edx+ecx*4+16]
!mulps xmm2,[edx+ecx*4+32]
!mulps xmm3,[edx+ecx*4+48]


!addps xmm4,xmm0               'add the products
!addps xmm5,xmm1
!addps xmm6,xmm2
!addps xmm7,xmm3

             
!add ecx,16                     'next block
!js short lp                    'ends when sign goes +ve


!addps xmm4,xmm5                'add the partial sums
!addps xmm6,xmm7
!addps xmm4,xmm6



#IF %DEF (%sse3)
!haddps xmm4,xmm4   'sum pairs of results
!haddps xmm4,xmm4   'sum the sum of pairs of results
#ELSE
!movaps xmm1, xmm4
!shufps xmm4, xmm1, &hb1
!addps xmm4, xmm1
!movaps xmm1, xmm4
     

!shufps xmm4, xmm4, &h0a
!addps xmm4, xmm1
#ENDIF

'!movd sum!,xmm4     'store the dot product
!movups z1,xmm4

!emms

The above PowerBASIC ASM format version runs 5,000,000 loops in 1.07s or a single loop in 645 clks ave. on a 3GHz Phenom II x4.

Paul.

jj2007

Quote from: dioxin on August 28, 2010, 11:37:24 AM
jj2007,
if you have the CPU registers avaialable, which you do in this case, then you can always arrange for the loop to count up to zero and so end without a compare instruction and still only have one loop counter so you don't need to update (in your case) both edx and ecx.

Paul,
You know about conditional assembly? "If 1" means "do this, ignore the else part".

dioxin

jj2007,
yes I know about that but if you look at your code your have:

Either:
lea edx, [edx+16]
..
cmp edx, ecx ; faster, at least on Celeron M
jb @B

OR
lea edx, [edx+16]
..
sub ecx, 16 ; count down
jg @B


Either way it's 3 instructions to control the loop.

Look at the way it's done in the code I just posted and it's only 2 instructions:
!add ecx,16                     'next block
!js short lp                    'ends when sign goes +ve


Paul.

dedndave

i think i would use JC   :bg
or maybe JNS

and - no need to....
mov ecx,2047
.
.
.
neg ecx

why not just
mov ecx,-2047
or
LoopCount    EQU 2047
LoopCountVal EQU -LoopCount
.
.
.
mov ecx,LoopCountVal

jj2007

Quote from: dioxin on August 28, 2010, 03:38:19 PM
jj2007,
yes I know about that but if you look at your code your have:
...
Look at the way it's done in the code I just posted and it's only 2 instructions:
!add ecx,16                     'next block
!js short lp                    'ends when sign goes +ve


Paul,
Your code needs one instruction less, and it is a good and recommended way of speeding up a loop. If it really helps, needs to be demonstrated. Just add it to the testbed above.

My post was a reaction to
Quote from: dioxin on August 28, 2010, 11:37:24 AM
you don't need to update (in your case) both edx and ecx.

That is simply not correct. Only edx is being updated in the loop:
cmp edx, ecx ; faster, at least on Celeron M
jb @B


EDIT: Since these debates are always fun, here a testbed "Loop_Art":
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
1036    cycles for neg ecx, TheCt=128
843     cycles for cmp edx, ecx, TheCt=128

1035    cycles for neg ecx, TheCt=128
843     cycles for cmp edx, ecx, TheCt=128


I guess results will differ wildly by CPU :bg

dedndave

prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
3823    cycles for neg ecx, TheCt=128
2009    cycles for cmp edx, ecx, TheCt=128
1066    cycles for dedndave, TheCt=128

3792    cycles for neg ecx, TheCt=128
1909    cycles for cmp edx, ecx, TheCt=128
1036    cycles for dedndave, TheCt=128

jj2007

You cheat :dance:
@@: mov ebx,[eax+edx]
inc TheCt
mov [edx],ebx
add edx, 16
cmp edx, ecx
jb @B

@@: nop
add edx, 16
cmp edx, ecx
jb @B

-99 cycles for Lingo :lol

Surprisingly, your code runs exactly as fast as mine:
QuoteIntel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
1036    cycles for neg ecx, TheCt=128
843     cycles for cmp edx, ecx, TheCt=128
842     cycles for dedndave, TheCt=128

Which means a push to mem/pop from mem pair is as fast as two moves on my Celeron.

lingo

"-99 cycles for Lingo"

cretino,
perché non si salva SSE e ECX registri? :lol

Gunther

So guys, I'm back here. Please excuse the delay, but this weekend is school start here and I was involved a little bit. That was the simple reason.

Special thanks for all test results and the hole bunch of hints.

Quote from: dioxin August 27, 2010, 11:38:45 amyou have 8 SSE registers and you're only using 6 of them. Either use them all to fetch/calculate data or to act as accumulators. Don't leave them unused.

Sounds not bad. I've tested that with the following code:


_DotXMM4Acc16E:

pxor    xmm4, xmm4 ;sums are in xmm4, xmm5, xmm6 and xmm7
mov     ecx,[esp+12] ;ecx = n
mov     eax,[esp+4] ;eax -> X
pxor    xmm5,xmm5
mov     edx,[esp+8] ;edx -> Y
shl     ecx,2 ;ecx = 4*n (float)
pxor xmm6,xmm6
sub     esp,4 ;stack space for function result
sub eax,edx ;saves 1 lea instruction inside the main loop
pxor xmm7,xmm7

ALIGN 16

.loop:

movaps xmm0,[eax+edx] ;xmm0 = X[i+3] X[i+2] X[i+1] X[i+0]
movaps xmm1,[eax+edx+16] ;xmm1 = X[i+7] X[i+6] X[i+5] X[i+4]
movaps xmm2,[eax+edx+32] ;xmm2 = X[i+11] X[i+10] X[i+9] X[i+8]
movaps xmm3,[eax+edx+48] ;xmm3 = X[i+15] X[i+14] X[i+13] X[i+12]
mulps xmm0,[edx] ;xmm0 = X[i+3]*Y[i+3] X[i+2]*Y[i+2] X[i+1]*Y[i+1] X[i+0]*Y[i+0]
mulps xmm1,[edx+16] ;xmm1 = X[i+7]*Y[i+7] X[i+6]*Y[i+6] X[i+5]*Y[i+5] X[i+4]*Y[i+4]
mulps xmm2,[edx+32] ;xmm2 = X[i+11]*Y[i+11] X[i+10]*Y[i+10] X[i+9]*Y[i+9] X[i+8]*Y[i+8]
mulps xmm3,[edx+48] ;xmm3 = X[i+15]*Y[i+15] X[i+14]*Y[i+14] X[i+13]*Y[i+13] X[i+12]*Y[i+12]
addps xmm4,xmm0 ;sum up
addps xmm5,xmm1
addps xmm6,xmm2
addps xmm7,xmm3
lea     edx,[edx+64] ;update pointers
sub     ecx,byte 64 ;count down
jnz .loop
addps xmm4,xmm5 ;add accumulators
addps xmm6,xmm7
addps xmm4,xmm6 ;sum in xmm4
movhlps xmm0,xmm4 ;get bits 64 - 127 from xmm4
addps   xmm4,xmm0 ;sum in 2 dwords
pshufd  xmm0,xmm4,1 ;get bits 32 - 63 from xmm4
addss   xmm4,xmm0 ;sum in 1 dword
movss   [esp],xmm4 ;store sum
fld     dword [esp] ;load function result
add     esp,byte 4 ;adjust stack
ret


It's probably faster with your Phenom, but slower with my Athlon X2. That has to do with limitations by the floating point adder on older AMD chips. Intel chips must be tested. Let me know, if your code is similar; if so, I'll work that into the archive and update it for appropriate testing. Thank you. That's not a big deal; there's always CPUID and different code paths (one for older AMD, the other probably for Intel and newer AMD). Never mind.

Quote from: dioxin August 28, 2010, 12:37:24 pmf you have the CPU registers available, which you do in this case, then you can always arrange for the loop to count up to zero and so end without a compare instruction

That's okay. Counting down the loop is mostly faster as counting up. But caching works better with counting up; what you're doing is counting down with a negative offset and forward caching. Good idea.

Quote from: jj2007  August 28, 2010, 06:23:05 pmI guess results will differ wildly by CPU

Yes, Jochen the all results differ wildly and are very surprising.

Quote from: hutch-- August 27, 2010, 08:41:57 amOne request on benchmarks, put a keyboard pause at the end of it so it does not have to be downloaded to run it.

It's done, but the archive isn't yet updated. I'll do that after Pauls answer.

Gunther

Forgive your enemies, but never forget their names.

lingo

Hi,Gunther,
"I've tested that with the following code:"

Now you have two algos with different results: :wink
prev:
4E2C 4AE8 4E2C 0A98  4EAB 0AB4 4F2A EAB0
last:
4E2C 47F2 4E2C  0792  4EAB 0AB8 4F2A EAB0

dioxin

Lingo,
   You mis-understand the "result".
   The result of a dot product is a scalar (one value). With SINGLE types an xmm register can hold 4 values so 3 of them will be irrelevant and only 1 will be the required answer.

   The same scalar value, 2867507200, is being returned for all code so clearly they all work and give the same answer.

   That value, 2867507200 decimal, when stored as a SINGLE and displayed in hex would be 4F2A EAB0 which is shown in the low DWORD of both of your answers. The other DWORDs of that SSE register are not relevant as they were just intermediate results used to derive the answer, only the low DWORD, 4F2A EAB0, is returned as the answer.



   
Gunther,
I can't imagine why it would slow down on your CPU. The work being done is the same but there should be less register stalls. Worst case I'd expect it to run at the same speed.
I might have access to an Athlon X2 sometime this week and if I get time I'll check it out.

Until then, you can still use the unused registers to fetch/multiply more data in each loop. It won't be a convenient power of 2 but it should be faster.




dedndave,
Quotei think i would use JC   BigGrin
or maybe JNS

Why? I think I used the correct branch.


Quotewhy not just..
Because I was writing it to be easily understood and to fit in with Gunther's original code.
It's easy enough to save a cycle or 2 afterwards once the main code (which is saving 100's clks) is understood.

   

jj2007 & dedndave,
Your version of the loop is only accessing 25% of the memory it should be.
Both loops go around 128 times but your loop only increments by 16 bytes instead of 64. Note in my code the counter increments by 16 but is always reference as 4*ecx, not just ecx alone.

Paul.


lingo

"only the low DWORD, 4F2A EAB0, is returned as the answer."
You are right, thanks.. :U

lingo

Gunther,
I'm sure my algo will runs faster on your CPU.
So, please, try it with your testbed... :U

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
db 10 Dup(0)
DotXMM1Acc4ELingo proc srcX, srcY, counter
mov     eax,   [esp+1*4] ; eax->srcX
pxor    xmm5,  xmm5
mov     edx,   [esp+2*4] ; edx ->srcY
pxor    xmm4,  xmm4
mov     ecx,   [esp+3*4] ; ecx = counter
sub     eax,   edx
@@:
movaps  xmm0, [eax+edx]
movaps  xmm1, [eax+edx+16]
mulps xmm0, [edx]
movaps  xmm2, [eax+edx+32]
mulps xmm1, [edx+16]
addps xmm4, xmm0
movaps  xmm3, [eax+edx+48]
mulps xmm2, [edx+32]
addps xmm5, xmm1
movaps  xmm6, [eax+edx+64]
mulps xmm3, [edx+48]
addps xmm4, xmm2
movaps  xmm7, [eax+edx+80]
mulps xmm6, [edx+64]
addps xmm5, xmm3
movaps  xmm0, [eax+edx+96]
mulps xmm7, [edx+80]
addps xmm4, xmm6
movaps  xmm1, [eax+edx+112]
mulps xmm0, [edx+96]
addps xmm5, xmm7
mulps xmm1, [edx+112]
addps xmm4, xmm0
add edx,  128
addps xmm5, xmm1
sub    ecx,  32
ja @b
addps xmm4, xmm5
movhlps xmm1, xmm4
addps xmm4, xmm1
pshufd  xmm1, xmm4,1
addss xmm4, xmm1
movss   dword ptr [esp+2*4], xmm4
fld     dword ptr [esp+2*4]
ret     4*3
DotXMM1Acc4ELingo endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef