Suggestions and improvements for SSE2 code are welcome - Page 2

Main Menu

Home
Search

April 26, 2024, 02:02:08 PM

News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Suggestions and improvements for SSE2 code are welcome

Started by Gunther, August 26, 2010, 05:20:06 PM

Previous topic - Next topic

Print

Go Down Pages 1 2 3 4 ... 9

dioxin

Member
Posts: 156
Logged

Re: Suggestions and improvements for SSE2 code are welcome

#15

August 27, 2010, 10:38:45 AM

Gunther,
you have 8 SSE registers and you're only using 6 of them.
Either use them all to fetch/calculate data or to act as accumulators. Don't leave them unused.
You'll have to experiment to see which alternate use of the unused registers gives the better results.

I get faster results using 4 accumulators:

Code Select

addps xmm4,xmm0               'add the products
addps xmm5,xmm1
addps xmm6,xmm2
addps xmm7,xmm3

Paul.

jj2007

Member
Posts: 5,393
Location: Italy
Logged

Re: Suggestions and improvements for SSE2 code are welcome

#16

August 27, 2010, 11:18:28 PM

I have looked into the purpose of this code and modified the testbed accordingly:

Code Select

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
2563    cycles for DotXMM1Acc4E
2167    cycles for DotXMM1Acc4EJ

2680    cycles for DotXMM1Acc4E
2167    cycles for DotXMM1Acc4EJ

2540    cycles for DotXMM1Acc4E
2167    cycles for DotXMM1Acc4EJ

The result: 2867507200
The result: 2867507200

EDIT: I squeezed out a few cycles...

Code Select

	add ecx, edx	; slightly faster
;	int 3	; OPT_Olly 2
	align 16

@@:
	REPEAT 16	; unrolling helps, but make sure that the count is divisible by the rep count!!
	movaps xmm0, [eax+edx]	; xmm0 = X[i+3] X[i+2] X[i+1] X[i+0]				; xmm0: 4  3   2 1, 8765, 12 11 10 9, ... (high to low)
	mulps xmm0, [edx]	; xmm0 = X[i+3]*Y[i+3] X[i+2]*Y[i+2] X[i+1]*Y[i+1] X[i]*Y[i]		; xmm0: 20 12 6 2, 72 56 42 30, 156 132 110 90,
	lea edx, [edx+16]	; update pointers
	addps xmm7, xmm0	; sum up									; xmm7: 20 12 6 2, 92 68 48 32, 248 200 158 122, ...
	ENDM
	if 1
		cmp edx, ecx	; faster, at least on Celeron M
		jb @B
	else
		sub ecx, 16	; count down
		jg @B
	endif

DotProductC.zip
30.21 KB
downloaded 337 times

Masm32 Tips, Tricks and Traps

dioxin

Member
Posts: 156
Logged

Re: Suggestions and improvements for SSE2 code are welcome

#17

August 28, 2010, 11:37:24 AM Last Edit: August 28, 2010, 01:23:06 PM by dioxin

jj2007,
if you have the CPU registers avaialable, which you do in this case, then you can always arrange for the loop to count up to zero and so end without a compare instruction and still only have one loop counter so you don't need to update (in your case) both edx and ecx.

Code Select

!mov ecx,2047                  'loop counter for 2048 elements to get

!movups xmm4,z                 'zero the 4 accumulators (z predifined as 4x zero single)
!movaps xmm5,xmm4
!movaps xmm6,xmm4
!movaps xmm7,xmm4

!mov eax,aptr                 'get pointer to start of first array
!mov edx,bptr                 'get pointer to start of second array

!lea eax,[eax+4*ecx]          'offset pointers by size of array
!lea edx,[edx+4*ecx]
      
!neg ecx                      'negate counter so now pointer + 4*counter = start of array
                                'but also counter now counts up to zero


#ALIGN 16
lp:
!movaps xmm0,[eax+ecx*4]      'get 16 elements of first vector
!movaps xmm1,[eax+ecx*4+16]
!movaps xmm2,[eax+ecx*4+32]
!movaps xmm3,[eax+ecx*4+48]


!mulps xmm0,[edx+ecx*4]        'multiply by 16 elements of second vector
!mulps xmm1,[edx+ecx*4+16]
!mulps xmm2,[edx+ecx*4+32]
!mulps xmm3,[edx+ecx*4+48]


!addps xmm4,xmm0               'add the products
!addps xmm5,xmm1
!addps xmm6,xmm2
!addps xmm7,xmm3

             
!add ecx,16                     'next block
!js short lp                    'ends when sign goes +ve


!addps xmm4,xmm5                'add the partial sums
!addps xmm6,xmm7
!addps xmm4,xmm6



#IF %DEF (%sse3)
!haddps xmm4,xmm4   'sum pairs of results
!haddps xmm4,xmm4   'sum the sum of pairs of results
#ELSE
!movaps xmm1, xmm4
!shufps xmm4, xmm1, &hb1
!addps xmm4, xmm1
!movaps xmm1, xmm4
      

!shufps xmm4, xmm4, &h0a
!addps xmm4, xmm1
#ENDIF

'!movd sum!,xmm4     'store the dot product
!movups z1,xmm4

!emms

The above PowerBASIC ASM format version runs 5,000,000 loops in 1.07s or a single loop in 645 clks ave. on a 3GHz Phenom II x4.

Paul.

jj2007

Member
Posts: 5,393
Location: Italy
Logged

Re: Suggestions and improvements for SSE2 code are welcome

#18

August 28, 2010, 03:18:22 PM

Quote from: dioxin on August 28, 2010, 11:37:24 AM
jj2007,
if you have the CPU registers avaialable, which you do in this case, then you can always arrange for the loop to count up to zero and so end without a compare instruction and still only have one loop counter so you don't need to update (in your case) both edx and ecx.

Paul,
You know about conditional assembly? "If 1" means "do this, ignore the else part".

Masm32 Tips, Tricks and Traps

dioxin

Member
Posts: 156
Logged

Re: Suggestions and improvements for SSE2 code are welcome

#19

August 28, 2010, 03:38:19 PM

jj2007,
yes I know about that but if you look at your code your have:

Either:

Code Select

lea edx, [edx+16]
..
cmp edx, ecx ; faster, at least on Celeron M
jb @B

OR

Code Select

lea edx, [edx+16]
..
sub ecx, 16 ; count down
jg @B

Either way it's 3 instructions to control the loop.

Look at the way it's done in the code I just posted and it's only 2 instructions:

Code Select

!add ecx,16                     'next block
!js short lp                    'ends when sign goes +ve

Paul.

dedndave

Member
Posts: 10,231
Logged

Re: Suggestions and improvements for SSE2 code are welcome

#20

August 28, 2010, 04:10:59 PM

i think i would use JC :bg
or maybe JNS

and - no need to....

Code Select

mov ecx,2047
.
.
.
neg ecx

why not just

Code Select

mov ecx,-2047
or

Code Select

LoopCount    EQU 2047
LoopCountVal EQU -LoopCount
.
.
.
mov ecx,LoopCountVal

jj2007

Member
Posts: 5,393
Location: Italy
Logged

Re: Suggestions and improvements for SSE2 code are welcome

#21

August 28, 2010, 05:23:05 PM

Quote from: dioxin on August 28, 2010, 03:38:19 PM
jj2007,
yes I know about that but if you look at your code your have:
...
Look at the way it's done in the code I just posted and it's only 2 instructions:
Code Select Expand
!add ecx,16 'next block !js short lp 'ends when sign goes +ve

Paul,
Your code needs one instruction less, and it is a good and recommended way of speeding up a loop. If it really helps, needs to be demonstrated. Just add it to the testbed above.

My post was a reaction to

Quote from: dioxin on August 28, 2010, 11:37:24 AM
you don't need to update (in your case) both edx and ecx.

That is simply not correct. Only edx is being updated in the loop:

Code Select

		cmp edx, ecx	; faster, at least on Celeron M
		jb @B

EDIT: Since these debates are always fun, here a testbed "Loop_Art":

Code Select

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
1036    cycles for neg ecx, TheCt=128
843     cycles for cmp edx, ecx, TheCt=128

1035    cycles for neg ecx, TheCt=128
843     cycles for cmp edx, ecx, TheCt=128

I guess results will differ wildly by CPU :bg

Loop_Art.zip
2.85 KB
downloaded 321 times

Masm32 Tips, Tricks and Traps

dedndave

Member
Posts: 10,231
Logged

Re: Suggestions and improvements for SSE2 code are welcome

#22

August 28, 2010, 06:40:37 PM

prescott w/htt

Code Select

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
3823    cycles for neg ecx, TheCt=128
2009    cycles for cmp edx, ecx, TheCt=128
1066    cycles for dedndave, TheCt=128

3792    cycles for neg ecx, TheCt=128
1909    cycles for cmp edx, ecx, TheCt=128
1036    cycles for dedndave, TheCt=128

Loop_Art2.zip
3.11 KB
downloaded 332 times

jj2007

Member
Posts: 5,393
Location: Italy
Logged

Re: Suggestions and improvements for SSE2 code are welcome

#23

August 28, 2010, 07:32:01 PM

You cheat :dance:

Code Select

		@@:		mov ebx,[eax+edx]
				inc TheCt
				mov [edx],ebx
				add edx, 16
				cmp edx, ecx
				jb @B

Code Select

		@@:		nop
				add edx, 16
				cmp edx, ecx
				jb @B

-99 cycles for Lingo :lol

Surprisingly, your code runs exactly as fast as mine:

QuoteIntel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
1036 cycles for neg ecx, TheCt=128
843 cycles for cmp edx, ecx, TheCt=128
842 cycles for dedndave, TheCt=128

Which means a push to mem/pop from mem pair is as fast as two moves on my Celeron.

Masm32 Tips, Tricks and Traps

lingo

Member
Posts: 487
Logged

Re: Suggestions and improvements for SSE2 code are welcome

#24

August 28, 2010, 09:06:34 PM

"-99 cycles for Lingo"

cretino,
perché non si salva SSE e ECX registri? :lol

Gunther

Member
Posts: 157
Assembly Language Dinosaur
Location: Berlin
Logged

Re: Suggestions and improvements for SSE2 code are welcome

#25

August 29, 2010, 12:43:04 AM

So guys, I'm back here. Please excuse the delay, but this weekend is school start here and I was involved a little bit. That was the simple reason.

Special thanks for all test results and the hole bunch of hints.

Quote from: dioxin August 27, 2010, 11:38:45 amyou have 8 SSE registers and you're only using 6 of them. Either use them all to fetch/calculate data or to act as accumulators. Don't leave them unused.

Sounds not bad. I've tested that with the following code:

Code Select


_DotXMM4Acc16E:

		pxor    xmm4, xmm4				;sums are in xmm4, xmm5, xmm6 and xmm7
		mov     ecx,[esp+12]				;ecx = n
		mov     eax,[esp+4]				;eax -> X
		pxor    xmm5,xmm5
		mov     edx,[esp+8]				;edx -> Y
		shl     ecx,2					;ecx = 4*n (float)
		pxor	xmm6,xmm6
		sub     esp,4					;stack space for function result
		sub	eax,edx					;saves 1 lea instruction inside the main loop
		pxor	xmm7,xmm7
		
		ALIGN 16

.loop:

		movaps	xmm0,[eax+edx]				;xmm0 = X[i+3] X[i+2] X[i+1] X[i+0]
		movaps	xmm1,[eax+edx+16]			;xmm1 = X[i+7] X[i+6] X[i+5] X[i+4]
		movaps	xmm2,[eax+edx+32]			;xmm2 = X[i+11] X[i+10] X[i+9] X[i+8]
		movaps	xmm3,[eax+edx+48]			;xmm3 = X[i+15] X[i+14] X[i+13] X[i+12]
		mulps	xmm0,[edx]				;xmm0 = X[i+3]*Y[i+3] X[i+2]*Y[i+2] X[i+1]*Y[i+1] X[i+0]*Y[i+0]
		mulps	xmm1,[edx+16]				;xmm1 = X[i+7]*Y[i+7] X[i+6]*Y[i+6] X[i+5]*Y[i+5] X[i+4]*Y[i+4]
		mulps	xmm2,[edx+32]				;xmm2 = X[i+11]*Y[i+11] X[i+10]*Y[i+10] X[i+9]*Y[i+9] X[i+8]*Y[i+8]
		mulps	xmm3,[edx+48]				;xmm3 = X[i+15]*Y[i+15] X[i+14]*Y[i+14] X[i+13]*Y[i+13] X[i+12]*Y[i+12]
		addps	xmm4,xmm0				;sum up
		addps	xmm5,xmm1
		addps	xmm6,xmm2
		addps	xmm7,xmm3
		lea     edx,[edx+64]				;update pointers
		sub     ecx,byte 64				;count down
		jnz	.loop
		addps	xmm4,xmm5				;add accumulators
		addps	xmm6,xmm7
		addps	xmm4,xmm6				;sum in xmm4
		movhlps xmm0,xmm4				;get bits 64 - 127 from xmm4
		addps   xmm4,xmm0				;sum in 2 dwords
		pshufd  xmm0,xmm4,1				;get bits 32 - 63 from xmm4
		addss   xmm4,xmm0				;sum in 1 dword
		movss   [esp],xmm4				;store sum
		fld     dword [esp]				;load function result
		add     esp,byte 4				;adjust stack
		ret

It's probably faster with your Phenom, but slower with my Athlon X2. That has to do with limitations by the floating point adder on older AMD chips. Intel chips must be tested. Let me know, if your code is similar; if so, I'll work that into the archive and update it for appropriate testing. Thank you. That's not a big deal; there's always CPUID and different code paths (one for older AMD, the other probably for Intel and newer AMD). Never mind.

Quote from: dioxin August 28, 2010, 12:37:24 pmf you have the CPU registers available, which you do in this case, then you can always arrange for the loop to count up to zero and so end without a compare instruction

That's okay. Counting down the loop is mostly faster as counting up. But caching works better with counting up; what you're doing is counting down with a negative offset and forward caching. Good idea.

Quote from: jj2007 August 28, 2010, 06:23:05 pmI guess results will differ wildly by CPU

Yes, Jochen the all results differ wildly and are very surprising.

Quote from: hutch-- August 27, 2010, 08:41:57 amOne request on benchmarks, put a keyboard pause at the end of it so it does not have to be downloaded to run it.

It's done, but the archive isn't yet updated. I'll do that after Pauls answer.

Gunther

Forgive your enemies, but never forget their names.

lingo

Member
Posts: 487
Logged

Re: Suggestions and improvements for SSE2 code are welcome

#26

August 29, 2010, 02:19:41 AM

Hi,Gunther,
"I've tested that with the following code:"

Now you have two algos with different results: :wink
prev:
4E2C 4AE8 4E2C 0A98 4EAB 0AB4 4F2A EAB0
last:
4E2C 47F2 4E2C 0792 4EAB 0AB8 4F2A EAB0

dioxin

Member
Posts: 156
Logged

Re: Suggestions and improvements for SSE2 code are welcome

#27

August 30, 2010, 10:42:29 PM

Lingo,
   You mis-understand the "result".
   The result of a dot product is a scalar (one value). With SINGLE types an xmm register can hold 4 values so 3 of them will be irrelevant and only 1 will be the required answer.

   The same scalar value, 2867507200, is being returned for all code so clearly they all work and give the same answer.

   That value, 2867507200 decimal, when stored as a SINGLE and displayed in hex would be 4F2A EAB0 which is shown in the low DWORD of both of your answers. The other DWORDs of that SSE register are not relevant as they were just intermediate results used to derive the answer, only the low DWORD, 4F2A EAB0, is returned as the answer.


Gunther,
I can't imagine why it would slow down on your CPU. The work being done is the same but there should be less register stalls. Worst case I'd expect it to run at the same speed.
I might have access to an Athlon X2 sometime this week and if I get time I'll check it out.

Until then, you can still use the unused registers to fetch/multiply more data in each loop. It won't be a convenient power of 2 but it should be faster.

dedndave,

Quotei think i would use JC BigGrin
or maybe JNS

Why? I think I used the correct branch.

Quotewhy not just..

Because I was writing it to be easily understood and to fit in with Gunther's original code.
It's easy enough to save a cycle or 2 afterwards once the main code (which is saving 100's clks) is understood.

jj2007 & dedndave,
Your version of the loop is only accessing 25% of the memory it should be.
Both loops go around 128 times but your loop only increments by 16 bytes instead of 64. Note in my code the counter increments by 16 but is always reference as 4*ecx, not just ecx alone.

Paul.

lingo

Member
Posts: 487
Logged

Re: Suggestions and improvements for SSE2 code are welcome

#28

August 30, 2010, 11:24:13 PM

"only the low DWORD, 4F2A EAB0, is returned as the answer."
You are right, thanks.. :U

lingo

Member
Posts: 487
Logged

Re: Suggestions and improvements for SSE2 code are welcome

#29

August 30, 2010, 11:46:00 PM Last Edit: August 31, 2010, 12:46:18 AM by lingo

Gunther,
I'm sure my algo will runs faster on your CPU.
So, please, try it with your testbed... :U

Code Select



OPTION PROLOGUE:NONE 
OPTION EPILOGUE:NONE 
align 16
db 10 Dup(0)
DotXMM1Acc4ELingo proc srcX, srcY, counter
		mov     eax,   [esp+1*4]	; eax->srcX
		pxor    xmm5,  xmm5
		mov     edx,   [esp+2*4]	; edx ->srcY
		pxor    xmm4,  xmm4					
		mov     ecx,   [esp+3*4]	; ecx = counter
		sub     eax,   edx
@@:
		movaps  xmm0, [eax+edx]	
		movaps  xmm1, [eax+edx+16]
		mulps	xmm0, [edx]				
		movaps  xmm2, [eax+edx+32]	
		mulps	xmm1, [edx+16]
		addps	xmm4, xmm0
		movaps  xmm3, [eax+edx+48]
		mulps	xmm2, [edx+32]
		addps	xmm5, xmm1
		movaps  xmm6, [eax+edx+64]
		mulps	xmm3, [edx+48]
		addps	xmm4, xmm2
		movaps  xmm7, [eax+edx+80]
		mulps	xmm6, [edx+64]
		addps	xmm5, xmm3
		movaps  xmm0, [eax+edx+96]
		mulps	xmm7, [edx+80]
		addps	xmm4, xmm6
		movaps  xmm1, [eax+edx+112]
		mulps	xmm0, [edx+96]
		addps	xmm5, xmm7
		mulps	xmm1, [edx+112]
		addps	xmm4, xmm0
		add	edx,  128
		addps	xmm5, xmm1
		sub    	ecx,  32
		ja 	@b
		addps	xmm4, xmm5
		movhlps xmm1, xmm4
		addps	xmm4, xmm1
		pshufd  xmm1, xmm4,1
		addss	xmm4, xmm1
		movss   dword ptr [esp+2*4], xmm4
		fld     dword ptr [esp+2*4]
		ret     4*3
DotXMM1Acc4ELingo endp
OPTION PROLOGUE:PrologueDef 
OPTION EPILOGUE:EpilogueDef

Print

Go Up Pages 1 2 3 4 ... 9

User actions

Print