SSE VS FPU

Farabi · February 27, 2012, 11:02:16 AM

I dont see any big difference, it only just simpler.


Vec_SubSSE	proc	uses ebx DestVec:dword, A:dword, B:dword 
			mov	eax, DestVec
			mov	ebx, A
			mov	ecx, B
		;	fld	dword ptr [ebx + VERTEX.x]
		;	fsub	dword ptr [ecx + VERTEX.x]
		;	fstp	dword ptr [eax + VERTEX.x]
;
;			fld	dword ptr [ebx + VERTEX.y]
;			fsub	dword ptr [ecx + VERTEX.y]
;			fstp	dword ptr [eax + VERTEX.y]
;
;			fld	dword ptr [ebx + VERTEX.z]
;			fsub	dword ptr [ecx + VERTEX.z]
;			fstp	dword ptr [eax + VERTEX.z]
			
			movups xmm0,[ebx]
			movups xmm1,[ecx]
			subps xmm0,xmm1
			movups [eax],xmm0
			
			ret
Vec_SubSSE			endp

Vec_Sub			proc	uses ebx DestVec:dword, A:dword, B:dword 
			mov	eax, DestVec
			mov	ebx, A
			mov	ecx, B
			fld	dword ptr [ebx + VERTEX.x]
			fsub	dword ptr [ecx + VERTEX.x]
			fstp	dword ptr [eax + VERTEX.x]

			fld	dword ptr [ebx + VERTEX.y]
			fsub	dword ptr [ecx + VERTEX.y]
			fstp	dword ptr [eax + VERTEX.y]

			fld	dword ptr [ebx + VERTEX.z]
			fsub	dword ptr [ecx + VERTEX.z]
			fstp	dword ptr [eax + VERTEX.z]
			
			ret
Vec_Sub			endp

dedndave · February 27, 2012, 11:50:29 AM

if you don't see a difference, then the code may not be executed as often as you think
the SSE code above should be a few times faster than the FPU code
you may need to formulate the proper test to see the difference
it's likely that most of the time is consumed elsewhere, making it hard to see a change

Farabi · February 27, 2012, 12:46:36 PM

You'll surprised. mul took 1 ms an fmul 913 ms. If OpenGL used too many FPU or SSE2, I think I'll beat them. Im still gathering info what they done.

oex · February 27, 2012, 12:50:56 PM

Simply be aware that SSE is less supported....

I think SSE2 is supported now by most computer users but good programming practice means you need to check at the start of your application for the relevent instruction support....

Recently I checked out OceanJeffs fireworks demo.... It worked on my AMD but not on my far newer Intel....

PS. Use MichaelW's timing script, your timings look a little shakey, have a look at laboratory test peices to understand how to better time your code....

jj2007 · February 27, 2012, 01:15:25 PM

Quote from: Farabi on February 27, 2012, 12:46:36 PM
You'll surprised. mul took 1 ms an fmul 913 ms.

It means there is a bug in the FPU code, probably an exception. Fmul is only marginally slower that mulsd, see below. And your Vec_SubSSE is even a bit slower than the FPU version.

Code Select

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
13      cycles for Vec_SubSSE
11      cycles for Vec_Sub
194     cycles for 100*fmul
179     cycles for 100*mulsd

13      cycles for Vec_SubSSE
11      cycles for Vec_Sub
193     cycles for 100*fmul
180     cycles for 100*mulsd

oex · February 27, 2012, 02:41:23 PM

Code Select


Intel(R) Core(TM) i3-2310M CPU @ 2.10GHz (SSE4)
5       cycles for Vec_SubSSE
12      cycles for Vec_Sub
153     cycles for 100*fmul
587     cycles for 100*mulsd

9       cycles for Vec_SubSSE
15      cycles for Vec_Sub
154     cycles for 100*fmul
602     cycles for 100*mulsd

dancho · February 27, 2012, 03:42:06 PM

you are using unaligned data with movups ( slow ),
try with aligned,
movaps ( should be faster ) ...

jj2007 · February 27, 2012, 04:27:24 PM

Quote from: dancho on February 27, 2012, 03:42:06 PM
you are using unaligned data with movups ( slow ),
try with aligned,
movaps ( should be faster ) ...

Yes, and you get a 50% chance for an access violation. Unless you set up the whole project differently...

dancho · February 27, 2012, 04:43:05 PM

Quote
Yes, and you get a 50% chance for an access violation. Unless you set up the whole project differently...

yes,ofc,
data must be aligned on 16 bytes boundary address...

qWord · February 27, 2012, 04:55:23 PM

With proper alignment and code I get the following results:

Code Select

Intel(R) Core(TM) i5 CPU       M 520  @ 2.40GHz (SSE4)
3       cycles for Vec_SubSSE
7       cycles for Vec_Sub
193     cycles for 100*fmul
630     cycles for 100*mulsd

4       cycles for Vec_SubSSE
8       cycles for Vec_Sub
202     cycles for 100*fmul
619     cycles for 100*mulsd

Code Select

movaps xmm0,[ebx]
subps xmm0,[ecx]
movaps [eax],xmm0

For such tasks SSEx was introduced - all you need to do, is to set up the right conditions. Also, implementing this functions (6 instructions!) as a macro is highly recommended.

EDIT: @jj, there is a small bug for mulsd: -> movlps --> movsd
then I get:

Quote3 cycles for Vec_SubSSE
8 cycles for Vec_Sub
167 cycles for 100*fmul
174 cycles for 100*mulsd

3 cycles for Vec_SubSSE
7 cycles for Vec_Sub
204 cycles for 100*fmul
196 cycles for 100*mulsd

3 cycles for Vec_SubSSE
7 cycles for Vec_Sub
180 cycles for 100*fmul
176 cycles for 100*mulsd

4 cycles for Vec_SubSSE
7 cycles for Vec_Sub
173 cycles for 100*fmul
175 cycles for 100*mulsd

jj2007 · February 27, 2012, 06:51:42 PM

With all those "improvements" the SSE2 code gets, wow, as fast as the FPU:

Code Select

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
9       cycles for Vec_SubSSE
11      cycles for Vec_Sub
195     cycles for 100*fmul
195     cycles for 100*mulsd movlps
195     cycles for 100*mulsd movsd

9       cycles for Vec_SubSSE
11      cycles for Vec_Sub
196     cycles for 100*fmul
195     cycles for 100*mulsd movlps
195     cycles for 100*mulsd movsd

P.S.: I am a fan of SSE2 - there is a lot of it in MasmBasic :green

qWord · February 27, 2012, 07:06:58 PM

QuoteIntel(R) Core(TM) i5 CPU M 520 @ 2.40GHz (SSE4)
3 cycles for Vec_SubSSE
7 cycles for Vec_Sub
168 cycles for 100*fmul
590 cycles for 100*mulsd movlps ;(ps<>sd)
216 cycles for 100*mulsd movsd

3 cycles for Vec_SubSSE
8 cycles for Vec_Sub
164 cycles for 100*fmul
617 cycles for 100*mulsd movlps ;(ps<>sd)
172 cycles for 100*mulsd movsd

--- ok ---

... and again, it is nice to see that the FPU is still on an equal footing with SSEx :bg

Farabi · February 29, 2012, 07:51:56 AM

It seems to be SSE only faster on Intel processor.

oex · February 29, 2012, 11:04:25 AM

Quote from: Farabi on February 29, 2012, 07:51:56 AM
It seems to be SSE only faster on Intel processor.

Different instructions, different processors.... One SSE MemCopy function I used on my old AMD was over 2*Faster than movsd....

Farabi · March 03, 2012, 10:01:42 AM

O I forget, what I mean with mul is integer x86 mul, not SSE or FPU. It is far superior faster than Floating point version. Why not using Div and Mul for the floating point subtitutions?

News:

SSE VS FPU