The MASM Forum Archive 2004 to 2012
Welcome, Guest. Please login or register.
June 10, 2023, 04:25:32 PM

Login with username, password and session length
Search:     Advanced search
128553 Posts in 15254 Topics by 684 Members
Latest Member: mottt
* Home Help Search Login Register
+  The MASM Forum Archive 2004 to 2012
|-+  General Forums
| |-+  The Laboratory (Moderator: Mark_Larson)
| | |-+  SSE VS FPU
« previous next »
Pages: [1] 2 3 Print
Author Topic: SSE VS FPU  (Read 34310 times)
Farabi
Neuroscientist Student
Member
*****
Gender: Male
Posts: 2409


MASM+OpenGL Fanatic


SSE VS FPU
« on: February 27, 2012, 11:02:16 AM »

I dont see any big difference, it only just simpler.

Code:
Vec_SubSSE proc uses ebx DestVec:dword, A:dword, B:dword
mov eax, DestVec
mov ebx, A
mov ecx, B
; fld dword ptr [ebx + VERTEX.x]
; fsub dword ptr [ecx + VERTEX.x]
; fstp dword ptr [eax + VERTEX.x]
;
; fld dword ptr [ebx + VERTEX.y]
; fsub dword ptr [ecx + VERTEX.y]
; fstp dword ptr [eax + VERTEX.y]
;
; fld dword ptr [ebx + VERTEX.z]
; fsub dword ptr [ecx + VERTEX.z]
; fstp dword ptr [eax + VERTEX.z]

movups xmm0,[ebx]
movups xmm1,[ecx]
subps xmm0,xmm1
movups [eax],xmm0

ret
Vec_SubSSE endp

Vec_Sub proc uses ebx DestVec:dword, A:dword, B:dword
mov eax, DestVec
mov ebx, A
mov ecx, B
fld dword ptr [ebx + VERTEX.x]
fsub dword ptr [ecx + VERTEX.x]
fstp dword ptr [eax + VERTEX.x]

fld dword ptr [ebx + VERTEX.y]
fsub dword ptr [ecx + VERTEX.y]
fstp dword ptr [eax + VERTEX.y]

fld dword ptr [ebx + VERTEX.z]
fsub dword ptr [ecx + VERTEX.z]
fstp dword ptr [eax + VERTEX.z]

ret
Vec_Sub endp
Logged

Those who had universe knowledges can control the world by a micro processor.
http://www.wix.com/farabio/firstpage

"Etos siperi elegi"
dedndave
Member
*****
Posts: 12523


Re: SSE VS FPU
« Reply #1 on: February 27, 2012, 11:50:29 AM »

if you don't see a difference, then the code may not be executed as often as you think
the SSE code above should be a few times faster than the FPU code
you may need to formulate the proper test to see the difference
it's likely that most of the time is consumed elsewhere, making it hard to see a change
Logged
Farabi
Neuroscientist Student
Member
*****
Gender: Male
Posts: 2409


MASM+OpenGL Fanatic


Re: SSE VS FPU
« Reply #2 on: February 27, 2012, 12:46:36 PM »

You'll surprised. mul took 1 ms an fmul 913 ms. If OpenGL used too many FPU or SSE2, I think I'll beat them. Im still gathering info what they done.
Logged

Those who had universe knowledges can control the world by a micro processor.
http://www.wix.com/farabio/firstpage

"Etos siperi elegi"
oex
Futurist EDIT: In Training
Member
*****
Gender: Male
Posts: 2008


Everything = Maths * Community2


Re: SSE VS FPU
« Reply #3 on: February 27, 2012, 12:50:56 PM »

Simply be aware that SSE is less supported....

I think SSE2 is supported now by most computer users but good programming practice means you need to check at the start of your application for the relevent instruction support....

Recently I checked out OceanJeffs fireworks demo.... It worked on my AMD but not on my far newer Intel....

PS. Use MichaelW's timing script, your timings look a little shakey, have a look at laboratory test peices to understand how to better time your code....
Logged

We are all of us insane, just to varying degrees and intelligently balanced through networking

http://www.hereford.tv
jj2007
Member
*****
Gender: Male
Posts: 6011



Re: SSE VS FPU
« Reply #4 on: February 27, 2012, 01:15:25 PM »

You'll surprised. mul took 1 ms an fmul 913 ms.
It means there is a bug in the FPU code, probably an exception. Fmul is only marginally slower that mulsd, see below. And your Vec_SubSSE is even a bit slower than the FPU version.

Code:
AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
13      cycles for Vec_SubSSE
11      cycles for Vec_Sub
194     cycles for 100*fmul
179     cycles for 100*mulsd

13      cycles for Vec_SubSSE
11      cycles for Vec_Sub
193     cycles for 100*fmul
180     cycles for 100*mulsd

* VertexTimings.zip (3.29 KB - downloaded 361 times.)
Logged

oex
Futurist EDIT: In Training
Member
*****
Gender: Male
Posts: 2008


Everything = Maths * Community2


Re: SSE VS FPU
« Reply #5 on: February 27, 2012, 02:41:23 PM »

Code:
Intel(R) Core(TM) i3-2310M CPU @ 2.10GHz (SSE4)
5       cycles for Vec_SubSSE
12      cycles for Vec_Sub
153     cycles for 100*fmul
587     cycles for 100*mulsd

9       cycles for Vec_SubSSE
15      cycles for Vec_Sub
154     cycles for 100*fmul
602     cycles for 100*mulsd
Logged

We are all of us insane, just to varying degrees and intelligently balanced through networking

http://www.hereford.tv
dancho
Member
****
Posts: 86


Re: SSE VS FPU
« Reply #6 on: February 27, 2012, 03:42:06 PM »

you are using unaligned data with movups ( slow ),
try with aligned,
movaps ( should be faster ) ...
Logged
jj2007
Member
*****
Gender: Male
Posts: 6011



Re: SSE VS FPU
« Reply #7 on: February 27, 2012, 04:27:24 PM »

you are using unaligned data with movups ( slow ),
try with aligned,
movaps ( should be faster ) ...

Yes, and you get a 50% chance for an access violation. Unless you set up the whole project differently...
Logged

dancho
Member
****
Posts: 86


Re: SSE VS FPU
« Reply #8 on: February 27, 2012, 04:43:05 PM »

Quote
Yes, and you get a 50% chance for an access violation. Unless you set up the whole project differently...

yes,ofc,
data must be aligned on 16 bytes boundary address...
Logged
qWord
Member
*****
Posts: 1425



Re: SSE VS FPU
« Reply #9 on: February 27, 2012, 04:55:23 PM »

With proper alignment and code I get the following results:
Code:
Intel(R) Core(TM) i5 CPU       M 520  @ 2.40GHz (SSE4)
3       cycles for Vec_SubSSE
7       cycles for Vec_Sub
193     cycles for 100*fmul
630     cycles for 100*mulsd

4       cycles for Vec_SubSSE
8       cycles for Vec_Sub
202     cycles for 100*fmul
619     cycles for 100*mulsd

Code:
movaps xmm0,[ebx]
subps xmm0,[ecx]
movaps [eax],xmm0
For such tasks SSEx was introduced - all you need to do, is to set up the right conditions. Also, implementing this functions (6 instructions!) as a macro is highly recommended.


EDIT: @jj, there is a small bug for mulsd: -> movlps --> movsd
then I get:

Quote
3       cycles for Vec_SubSSE
8       cycles for Vec_Sub
167     cycles for 100*fmul
174     cycles for 100*mulsd

3       cycles for Vec_SubSSE
7       cycles for Vec_Sub
204     cycles for 100*fmul
196     cycles for 100*mulsd

3       cycles for Vec_SubSSE
7       cycles for Vec_Sub
180     cycles for 100*fmul
176     cycles for 100*mulsd

4       cycles for Vec_SubSSE
7       cycles for Vec_Sub
173     cycles for 100*fmul
175     cycles for 100*mulsd
Logged

FPU in a trice: SmplMath
It's that simple!
jj2007
Member
*****
Gender: Male
Posts: 6011



Re: SSE VS FPU
« Reply #10 on: February 27, 2012, 06:51:42 PM »

With all those "improvements" the SSE2 code gets, wow, as fast as the FPU:
Code:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
9       cycles for Vec_SubSSE
11      cycles for Vec_Sub
195     cycles for 100*fmul
195     cycles for 100*mulsd movlps
195     cycles for 100*mulsd movsd

9       cycles for Vec_SubSSE
11      cycles for Vec_Sub
196     cycles for 100*fmul
195     cycles for 100*mulsd movlps
195     cycles for 100*mulsd movsd

P.S.: I am a fan of SSE2 - there is a lot of it in MasmBasic green

* VertexTimings2.zip (3.56 KB - downloaded 359 times.)
Logged

qWord
Member
*****
Posts: 1425



Re: SSE VS FPU
« Reply #11 on: February 27, 2012, 07:06:58 PM »

Quote
Intel(R) Core(TM) i5 CPU       M 520  @ 2.40GHz (SSE4)
3       cycles for Vec_SubSSE
7       cycles for Vec_Sub
168     cycles for 100*fmul
590     cycles for 100*mulsd movlps ;(ps<>sd)
216     cycles for 100*mulsd movsd

3       cycles for Vec_SubSSE
8       cycles for Vec_Sub
164     cycles for 100*fmul
617     cycles for 100*mulsd movlps ;(ps<>sd)
172     cycles for 100*mulsd movsd


--- ok ---
... and again, it is nice to see that the FPU is still on an equal footing with SSEx  BigGrin
Logged

FPU in a trice: SmplMath
It's that simple!
Farabi
Neuroscientist Student
Member
*****
Gender: Male
Posts: 2409


MASM+OpenGL Fanatic


Re: SSE VS FPU
« Reply #12 on: February 29, 2012, 07:51:56 AM »

It seems to be SSE only faster on Intel processor.
Logged

Those who had universe knowledges can control the world by a micro processor.
http://www.wix.com/farabio/firstpage

"Etos siperi elegi"
oex
Futurist EDIT: In Training
Member
*****
Gender: Male
Posts: 2008


Everything = Maths * Community2


Re: SSE VS FPU
« Reply #13 on: February 29, 2012, 11:04:25 AM »

It seems to be SSE only faster on Intel processor.

Different instructions, different processors.... One SSE MemCopy function I used on my old AMD was over 2*Faster than movsd....
Logged

We are all of us insane, just to varying degrees and intelligently balanced through networking

http://www.hereford.tv
Farabi
Neuroscientist Student
Member
*****
Gender: Male
Posts: 2409


MASM+OpenGL Fanatic


Re: SSE VS FPU
« Reply #14 on: March 03, 2012, 10:01:42 AM »

O I forget, what I mean with mul is integer x86 mul, not SSE or FPU. It is far superior faster than Floating point version. Why not using Div and Mul for the floating point subtitutions?
Logged

Those who had universe knowledges can control the world by a micro processor.
http://www.wix.com/farabio/firstpage

"Etos siperi elegi"
Pages: [1] 2 3 Print 
« previous next »
Jump to:  

Powered by MySQL Powered by PHP The MASM Forum Archive 2004 to 2012 | Powered by SMF 1.0.12.
© 2001-2005, Lewis Media. All Rights Reserved.
Valid XHTML 1.0! Valid CSS!