|
Pages: [1] 2 3
|
 |
|
Author
|
Topic: SSE VS FPU (Read 34310 times)
|
Farabi
Neuroscientist Student
Member
    
Gender: 
Posts: 2409
MASM+OpenGL Fanatic
|
I dont see any big difference, it only just simpler. Vec_SubSSE proc uses ebx DestVec:dword, A:dword, B:dword mov eax, DestVec mov ebx, A mov ecx, B ; fld dword ptr [ebx + VERTEX.x] ; fsub dword ptr [ecx + VERTEX.x] ; fstp dword ptr [eax + VERTEX.x] ; ; fld dword ptr [ebx + VERTEX.y] ; fsub dword ptr [ecx + VERTEX.y] ; fstp dword ptr [eax + VERTEX.y] ; ; fld dword ptr [ebx + VERTEX.z] ; fsub dword ptr [ecx + VERTEX.z] ; fstp dword ptr [eax + VERTEX.z] movups xmm0,[ebx] movups xmm1,[ecx] subps xmm0,xmm1 movups [eax],xmm0 ret Vec_SubSSE endp
Vec_Sub proc uses ebx DestVec:dword, A:dword, B:dword mov eax, DestVec mov ebx, A mov ecx, B fld dword ptr [ebx + VERTEX.x] fsub dword ptr [ecx + VERTEX.x] fstp dword ptr [eax + VERTEX.x]
fld dword ptr [ebx + VERTEX.y] fsub dword ptr [ecx + VERTEX.y] fstp dword ptr [eax + VERTEX.y]
fld dword ptr [ebx + VERTEX.z] fsub dword ptr [ecx + VERTEX.z] fstp dword ptr [eax + VERTEX.z] ret Vec_Sub endp
|
|
|
|
|
Logged
|
|
|
|
|
dedndave
|
if you don't see a difference, then the code may not be executed as often as you think the SSE code above should be a few times faster than the FPU code you may need to formulate the proper test to see the difference it's likely that most of the time is consumed elsewhere, making it hard to see a change
|
|
|
|
|
Logged
|
|
|
|
Farabi
Neuroscientist Student
Member
    
Gender: 
Posts: 2409
MASM+OpenGL Fanatic
|
You'll surprised. mul took 1 ms an fmul 913 ms. If OpenGL used too many FPU or SSE2, I think I'll beat them. Im still gathering info what they done.
|
|
|
|
|
Logged
|
|
|
|
oex
Futurist EDIT: In Training
Member
    
Gender: 
Posts: 2008
Everything = Maths * Community2
|
Simply be aware that SSE is less supported....
I think SSE2 is supported now by most computer users but good programming practice means you need to check at the start of your application for the relevent instruction support....
Recently I checked out OceanJeffs fireworks demo.... It worked on my AMD but not on my far newer Intel....
PS. Use MichaelW's timing script, your timings look a little shakey, have a look at laboratory test peices to understand how to better time your code....
|
|
|
|
|
Logged
|
We are all of us insane, just to varying degrees and intelligently balanced through networking http://www.hereford.tv
|
|
|
|
jj2007
|
You'll surprised. mul took 1 ms an fmul 913 ms. It means there is a bug in the FPU code, probably an exception. Fmul is only marginally slower that mulsd, see below. And your Vec_SubSSE is even a bit slower than the FPU version. AMD Athlon(tm) Dual Core Processor 4450B (SSE3) 13 cycles for Vec_SubSSE 11 cycles for Vec_Sub 194 cycles for 100*fmul 179 cycles for 100*mulsd
13 cycles for Vec_SubSSE 11 cycles for Vec_Sub 193 cycles for 100*fmul 180 cycles for 100*mulsd
|
|
|
|
Logged
|
|
|
|
oex
Futurist EDIT: In Training
Member
    
Gender: 
Posts: 2008
Everything = Maths * Community2
|
Intel(R) Core(TM) i3-2310M CPU @ 2.10GHz (SSE4) 5 cycles for Vec_SubSSE 12 cycles for Vec_Sub 153 cycles for 100*fmul 587 cycles for 100*mulsd
9 cycles for Vec_SubSSE 15 cycles for Vec_Sub 154 cycles for 100*fmul 602 cycles for 100*mulsd
|
|
|
|
|
Logged
|
We are all of us insane, just to varying degrees and intelligently balanced through networking http://www.hereford.tv
|
|
|
|
dancho
|
you are using unaligned data with movups ( slow ), try with aligned, movaps ( should be faster ) ...
|
|
|
|
|
Logged
|
|
|
|
|
jj2007
|
you are using unaligned data with movups ( slow ), try with aligned, movaps ( should be faster ) ...
Yes, and you get a 50% chance for an access violation. Unless you set up the whole project differently...
|
|
|
|
|
Logged
|
|
|
|
|
dancho
|
Yes, and you get a 50% chance for an access violation. Unless you set up the whole project differently...
yes,ofc, data must be aligned on 16 bytes boundary address...
|
|
|
|
|
Logged
|
|
|
|
|
qWord
|
With proper alignment and code I get the following results: Intel(R) Core(TM) i5 CPU M 520 @ 2.40GHz (SSE4) 3 cycles for Vec_SubSSE 7 cycles for Vec_Sub 193 cycles for 100*fmul 630 cycles for 100*mulsd
4 cycles for Vec_SubSSE 8 cycles for Vec_Sub 202 cycles for 100*fmul 619 cycles for 100*mulsd movaps xmm0,[ebx] subps xmm0,[ecx] movaps [eax],xmm0 For such tasks SSEx was introduced - all you need to do, is to set up the right conditions. Also, implementing this functions (6 instructions!) as a macro is highly recommended. EDIT: @jj, there is a small bug for mulsd: -> movlps --> movsd then I get: 3 cycles for Vec_SubSSE 8 cycles for Vec_Sub 167 cycles for 100*fmul 174 cycles for 100*mulsd
3 cycles for Vec_SubSSE 7 cycles for Vec_Sub 204 cycles for 100*fmul 196 cycles for 100*mulsd
3 cycles for Vec_SubSSE 7 cycles for Vec_Sub 180 cycles for 100*fmul 176 cycles for 100*mulsd
4 cycles for Vec_SubSSE 7 cycles for Vec_Sub 173 cycles for 100*fmul 175 cycles for 100*mulsd
|
|
|
|
|
Logged
|
FPU in a trice: SmplMathIt's that simple!
|
|
|
|
jj2007
|
With all those "improvements" the SSE2 code gets, wow, as fast as the FPU: Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3) 9 cycles for Vec_SubSSE 11 cycles for Vec_Sub 195 cycles for 100*fmul 195 cycles for 100*mulsd movlps 195 cycles for 100*mulsd movsd
9 cycles for Vec_SubSSE 11 cycles for Vec_Sub 196 cycles for 100*fmul 195 cycles for 100*mulsd movlps 195 cycles for 100*mulsd movsd P.S.: I am a fan of SSE2 - there is a lot of it in MasmBasic 
|
|
|
|
Logged
|
|
|
|
|
qWord
|
Intel(R) Core(TM) i5 CPU M 520 @ 2.40GHz (SSE4) 3 cycles for Vec_SubSSE 7 cycles for Vec_Sub 168 cycles for 100*fmul 590 cycles for 100*mulsd movlps ;(ps<>sd) 216 cycles for 100*mulsd movsd
3 cycles for Vec_SubSSE 8 cycles for Vec_Sub 164 cycles for 100*fmul 617 cycles for 100*mulsd movlps ;(ps<>sd) 172 cycles for 100*mulsd movsd
--- ok --- ... and again, it is nice to see that the FPU is still on an equal footing with SSEx 
|
|
|
|
|
Logged
|
FPU in a trice: SmplMathIt's that simple!
|
|
|
Farabi
Neuroscientist Student
Member
    
Gender: 
Posts: 2409
MASM+OpenGL Fanatic
|
It seems to be SSE only faster on Intel processor.
|
|
|
|
|
Logged
|
|
|
|
oex
Futurist EDIT: In Training
Member
    
Gender: 
Posts: 2008
Everything = Maths * Community2
|
It seems to be SSE only faster on Intel processor.
Different instructions, different processors.... One SSE MemCopy function I used on my old AMD was over 2*Faster than movsd....
|
|
|
|
|
Logged
|
We are all of us insane, just to varying degrees and intelligently balanced through networking http://www.hereford.tv
|
|
|
Farabi
Neuroscientist Student
Member
    
Gender: 
Posts: 2409
MASM+OpenGL Fanatic
|
O I forget, what I mean with mul is integer x86 mul, not SSE or FPU. It is far superior faster than Floating point version. Why not using Div and Mul for the floating point subtitutions?
|
|
|
|
|
Logged
|
|
|
|
|
|
Pages: [1] 2 3
|
|
|
 |