News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

SSE VS FPU

Started by Farabi, February 27, 2012, 11:02:16 AM

Previous topic - Next topic

Farabi

I dont see any big difference, it only just simpler.


Vec_SubSSE proc uses ebx DestVec:dword, A:dword, B:dword
mov eax, DestVec
mov ebx, A
mov ecx, B
; fld dword ptr [ebx + VERTEX.x]
; fsub dword ptr [ecx + VERTEX.x]
; fstp dword ptr [eax + VERTEX.x]
;
; fld dword ptr [ebx + VERTEX.y]
; fsub dword ptr [ecx + VERTEX.y]
; fstp dword ptr [eax + VERTEX.y]
;
; fld dword ptr [ebx + VERTEX.z]
; fsub dword ptr [ecx + VERTEX.z]
; fstp dword ptr [eax + VERTEX.z]

movups xmm0,[ebx]
movups xmm1,[ecx]
subps xmm0,xmm1
movups [eax],xmm0

ret
Vec_SubSSE endp

Vec_Sub proc uses ebx DestVec:dword, A:dword, B:dword
mov eax, DestVec
mov ebx, A
mov ecx, B
fld dword ptr [ebx + VERTEX.x]
fsub dword ptr [ecx + VERTEX.x]
fstp dword ptr [eax + VERTEX.x]

fld dword ptr [ebx + VERTEX.y]
fsub dword ptr [ecx + VERTEX.y]
fstp dword ptr [eax + VERTEX.y]

fld dword ptr [ebx + VERTEX.z]
fsub dword ptr [ecx + VERTEX.z]
fstp dword ptr [eax + VERTEX.z]

ret
Vec_Sub endp
Those who had universe knowledges can control the world by a micro processor.
http://www.wix.com/farabio/firstpage

"Etos siperi elegi"

dedndave

if you don't see a difference, then the code may not be executed as often as you think
the SSE code above should be a few times faster than the FPU code
you may need to formulate the proper test to see the difference
it's likely that most of the time is consumed elsewhere, making it hard to see a change

Farabi

You'll surprised. mul took 1 ms an fmul 913 ms. If OpenGL used too many FPU or SSE2, I think I'll beat them. Im still gathering info what they done.
Those who had universe knowledges can control the world by a micro processor.
http://www.wix.com/farabio/firstpage

"Etos siperi elegi"

oex

Simply be aware that SSE is less supported....

I think SSE2 is supported now by most computer users but good programming practice means you need to check at the start of your application for the relevent instruction support....

Recently I checked out OceanJeffs fireworks demo.... It worked on my AMD but not on my far newer Intel....

PS. Use MichaelW's timing script, your timings look a little shakey, have a look at laboratory test peices to understand how to better time your code....
We are all of us insane, just to varying degrees and intelligently balanced through networking

http://www.hereford.tv

jj2007

Quote from: Farabi on February 27, 2012, 12:46:36 PM
You'll surprised. mul took 1 ms an fmul 913 ms.
It means there is a bug in the FPU code, probably an exception. Fmul is only marginally slower that mulsd, see below. And your Vec_SubSSE is even a bit slower than the FPU version.

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
13      cycles for Vec_SubSSE
11      cycles for Vec_Sub
194     cycles for 100*fmul
179     cycles for 100*mulsd

13      cycles for Vec_SubSSE
11      cycles for Vec_Sub
193     cycles for 100*fmul
180     cycles for 100*mulsd

oex


Intel(R) Core(TM) i3-2310M CPU @ 2.10GHz (SSE4)
5       cycles for Vec_SubSSE
12      cycles for Vec_Sub
153     cycles for 100*fmul
587     cycles for 100*mulsd

9       cycles for Vec_SubSSE
15      cycles for Vec_Sub
154     cycles for 100*fmul
602     cycles for 100*mulsd
We are all of us insane, just to varying degrees and intelligently balanced through networking

http://www.hereford.tv

dancho

you are using unaligned data with movups ( slow ),
try with aligned,
movaps ( should be faster ) ...

jj2007

Quote from: dancho on February 27, 2012, 03:42:06 PM
you are using unaligned data with movups ( slow ),
try with aligned,
movaps ( should be faster ) ...

Yes, and you get a 50% chance for an access violation. Unless you set up the whole project differently...

dancho

Quote
Yes, and you get a 50% chance for an access violation. Unless you set up the whole project differently...

yes,ofc,
data must be aligned on 16 bytes boundary address...

qWord

With proper alignment and code I get the following results:
Intel(R) Core(TM) i5 CPU       M 520  @ 2.40GHz (SSE4)
3       cycles for Vec_SubSSE
7       cycles for Vec_Sub
193     cycles for 100*fmul
630     cycles for 100*mulsd

4       cycles for Vec_SubSSE
8       cycles for Vec_Sub
202     cycles for 100*fmul
619     cycles for 100*mulsd


movaps xmm0,[ebx]
subps xmm0,[ecx]
movaps [eax],xmm0

For such tasks SSEx was introduced - all you need to do, is to set up the right conditions. Also, implementing this functions (6 instructions!) as a macro is highly recommended.


EDIT: @jj, there is a small bug for mulsd: -> movlps --> movsd
then I get:

Quote3       cycles for Vec_SubSSE
8       cycles for Vec_Sub
167     cycles for 100*fmul
174     cycles for 100*mulsd

3       cycles for Vec_SubSSE
7       cycles for Vec_Sub
204     cycles for 100*fmul
196     cycles for 100*mulsd

3       cycles for Vec_SubSSE
7       cycles for Vec_Sub
180     cycles for 100*fmul
176     cycles for 100*mulsd

4       cycles for Vec_SubSSE
7       cycles for Vec_Sub
173     cycles for 100*fmul
175     cycles for 100*mulsd
FPU in a trice: SmplMath
It's that simple!

jj2007

With all those "improvements" the SSE2 code gets, wow, as fast as the FPU:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
9       cycles for Vec_SubSSE
11      cycles for Vec_Sub
195     cycles for 100*fmul
195     cycles for 100*mulsd movlps
195     cycles for 100*mulsd movsd

9       cycles for Vec_SubSSE
11      cycles for Vec_Sub
196     cycles for 100*fmul
195     cycles for 100*mulsd movlps
195     cycles for 100*mulsd movsd


P.S.: I am a fan of SSE2 - there is a lot of it in MasmBasic :green

qWord

QuoteIntel(R) Core(TM) i5 CPU       M 520  @ 2.40GHz (SSE4)
3       cycles for Vec_SubSSE
7       cycles for Vec_Sub
168     cycles for 100*fmul
590     cycles for 100*mulsd movlps ;(ps<>sd)
216     cycles for 100*mulsd movsd

3       cycles for Vec_SubSSE
8       cycles for Vec_Sub
164     cycles for 100*fmul
617     cycles for 100*mulsd movlps ;(ps<>sd)
172     cycles for 100*mulsd movsd


--- ok ---
... and again, it is nice to see that the FPU is still on an equal footing with SSEx  :bg
FPU in a trice: SmplMath
It's that simple!

Farabi

It seems to be SSE only faster on Intel processor.
Those who had universe knowledges can control the world by a micro processor.
http://www.wix.com/farabio/firstpage

"Etos siperi elegi"

oex

Quote from: Farabi on February 29, 2012, 07:51:56 AM
It seems to be SSE only faster on Intel processor.

Different instructions, different processors.... One SSE MemCopy function I used on my old AMD was over 2*Faster than movsd....
We are all of us insane, just to varying degrees and intelligently balanced through networking

http://www.hereford.tv

Farabi

O I forget, what I mean with mul is integer x86 mul, not SSE or FPU. It is far superior faster than Floating point version. Why not using Div and Mul for the floating point subtitutions?
Those who had universe knowledges can control the world by a micro processor.
http://www.wix.com/farabio/firstpage

"Etos siperi elegi"