Pages: [1] 2 3 ... 9



Author

Topic: Suggestions and improvements for SSE2 code are welcome (Read 61633 times)

Gunther
Member
Gender:
Posts: 184
Assembly Language Dinosaur

The background of this new thread is a discussion in The Campus: http://www.masm32.com/board/index.php?topic=14692.msg119277#msg119277I've finished my little test suite and attached the appropriate archive. The application will run under the following 32bit operating systems: Windows, Linux, FreeBSD and Intel based Mac OS X. The archive includes all source code and the running Win32 program, for users, which haven't installed GCC at the moment. Also included is a batch file, to build the application from the sources. The shell script is for Unix users and does the same (there shouldn't be a problem, because GCC is installed by default as the system compiler). A short description of every file contains the readme.txt. The source code is well commented and should be self explanatory. The software is in experimental stage  nothing is final or the last word. I've coded a special case, where the array size is divisible by 16, only to show the principle. A more generic implementation has to check that, of course. The program does not much error handling, but it checks SSE2 support during run time. So your machine won't crash, if it's not available. Every advice and help to improve the software is welcome. It's the same with testing the program with other processors and environments (please, have a look into the results.pdf file). Gunther



Logged

Forgive your enemies, but never forget their names.



jj2007

From DotAsm.asm: ALIGN 16 .loop: movaps xmm0,[eax+edx] ;xmm0 = X[i+3] X[i+2] X[i+1] X[i+0] mulps xmm0,[edx] ;xmm0 = X[i+3]*Y[i+3] X[i+2]*Y[i+2] X[i+1]*Y[i+1] X[i]*Y[i] addps xmm7,xmm0 ;sum up lea edx,[edx+16] ;update pointers sub ecx,byte 16 ;count down jnz .loop
Well done... and challenging. There is not much too improve at first sight



Logged




Gunther
Member
Gender:
Posts: 184
Assembly Language Dinosaur

Hallo jj2007, Well done... and challenging. I hope it's well done. But challenging? To be honest, I file away at these procedures since a few days; so I used some so called dirty assembly language tricks, not more. There is not much too improve at first sight I'm not sure. As I wrote in the cited Campus thread: That function is called over 30 million times, given a 256x256 image, which isn't a very large picture. The arrays in the original program are not so large, either 256 or 64 or 16 elements, depending at the partition depth of the algorithm. The more tricky thing is, that the dot product must not be calculated once, but eight times. That has to do with rotations and mirroring of data (not on the screen, only in the memory). So, every saved nansecond counts, because these nanoseconds sum up to microseconds, the microseconds sum up to milliseconds ... etc. That makes the speed difference. I'm thinking about some fastcall, because passing parameters via registers is faster. But, here comes the bad news: With the release of GCC 4.3., the GNU world was changed again. Here is what I've found: http://www.gnu.org/software/gcc/gcc4.3/changes.html. Fastcall for i386 has been changed not to pass aggregate arguments in registers, following Microsoft compilers. Very nice. And why, and for what should one use fastcall? On the other hand, I coded the horizontal addition in the loop epilogue with SSE2 instructions, so the program runs on a small Sempron, too. I've to test, if haddps is better; but that's a SSE3 instruction. A lot of questions ... By the way, did you run the program? Do you have some timings? Thank you. Gunther



Logged

Forgive your enemies, but never forget their names.



jj2007

Hi Gunther, Timings are easy to get: Straight forward C++ implementation (FPU Code): 
Dot Product 1 = 2867507200.00 Elapsed Time = 22.20 Seconds
C++ implementation (FPU Code  2 Accumulators): 
Dot Product 2 = 2867507200.00 Elapsed Time = 15.59 Seconds
C++ Code with intrinsics (Original Code by Drizz): 
Dot Product 3 = 2867507200.00 Elapsed Time = 6.89 Seconds
Simple SSE2 Code (1 Accumulator  4 elements per cycle): 
Dot Product 4 = 2867507200.00 Elapsed Time = 8.41 Seconds
Solid SSE2 Code (2 Accumulators  8 elements per cycle): 
Dot Product 5 = 2867507200.00 Elapsed Time = 7.20 Seconds
Better SSE2 Code (2 Accumulators  16 elements per cycle): 
Dot Product 6 = 2867507200.00 Elapsed Time = 6.67 Seconds As I wrote in my PM to you, there are a few SSE2 geeks around here. I see drizz is already involved. Question: The inner loop that I posted above does contain vector instructions (packed mul and add), but the function returns apparently only a REAL4 value on the FPU: fld dword [esp] ;load function resultIs there no room for parallelisation? Apologies if that is a dumb question, as I said: I am not a geek for SSE2....



Logged




Gunther
Member
Gender:
Posts: 184
Assembly Language Dinosaur

Thank you for the result. It's surprising. The intrinsic code is very fast. What's your environment? I see drizz is already involved. Yes, he made the excellent intrinsic code, so I gave credit. Question: The inner loop that I posted above does contain vector instructions (packed mul and add), but the function returns apparently only a REAL4 value on the FPU: fld dword [esp] ;load function result The scalar product is by definition a real number. Given 2 vectors A and B of the dimension n, we have: Scalar Product = A[0]*B[0] + A[1]*B[1] + ... + A[n]*B[n] Please, let me think a little while about parallelisation. Gunther



Logged

Forgive your enemies, but never forget their names.



jj2007

Thank you for the result. It's surprising. The intrinsic code is very fast. What's your environment?
Win XP SP2, a Celeron M CPU from the Yonah series, 1.6 Ghz, 1G RAM.



Logged




Gunther
Member
Gender:
Posts: 184
Assembly Language Dinosaur

The good old Win XP, SP2 configuration. Your Intel Celeron did an amazing job.
Gunther



Logged

Forgive your enemies, but never forget their names.



dioxin

It's usually faster for SSE to access memory in contiguous blocks so don't access the vectors the way you do, change the order something like this: .loop: movaps xmm1,[eax+edx+16] ;mm1 = X[i+7] X[i+6] X[i+5] X[i+4] movaps xmm2,[eax+edx+32] ;xmm2 = X[i+11] X[i+10] X[i+9] X[i+8] movaps xmm3,[eax+edx+48] ;xmm3 = X[i+15] X[i+14] X[i+13] X[i+12] movaps xmm0,[eax+edx+64] ;xmm0 = X[i+3] X[i+2] X[i+1] X[i+0]
mulps xmm0,[edx] ;xmm0 = X[i+3]*Y[i+3] X[i+2]*Y[i+2] X[i+1]*Y[i+1] X[i+0]*Y[i+0] mulps xmm1,[edx+16] ;xmm1 = X[i+7]*Y[i+7] X[i+6]*Y[i+6] X[i+5]*Y[i+5] X[i+4]*Y[i+4] mulps xmm2,[edx+32] ;xmm2 = X[i+11]*Y[i+11] X[i+10]*Y[i+10] X[i+9]*Y[i+9] X[i+8]*Y[i+8] mulps xmm3,[edx+48] ;xmm3 = X[i+15]*Y[i+15] X[i+14]*Y[i+14] X[i+13]*Y[i+13] X[i+12]*Y[i+12]
addps xmm4,xmm0 ;sum up addps xmm5,xmm1 ;sum up addps xmm4,xmm2 ;sum up addps xmm5,xmm3 ;sum up
lea edx,[edx+64] ;update pointers sub ecx,byte 64 ;count down jnz .loop Paul.



Logged




Gunther
Member
Gender:
Posts: 184
Assembly Language Dinosaur

Thank you Paul. I'll check this and make another function. Rearranging the loop instructions could be an improvement.
Gunther



Logged

Forgive your enemies, but never forget their names.



dioxin

Phenom II x4 3GHz. Straight forward C++ implementation (FPU Code): 
Dot Product 1 = 2867507200.00 Elapsed Time = 13.70 Seconds
C++ implementation (FPU Code  2 Accumulators): 
Dot Product 2 = 2867507200.00 Elapsed Time = 6.89 Seconds
C++ Code with intrinsics (Original Code by Drizz): 
Dot Product 3 = 2867507200.00 Elapsed Time = 3.47 Seconds
Simple SSE2 Code (1 Accumulator  4 elements per cycle): 
Dot Product 4 = 2867507200.00 Elapsed Time = 3.45 Seconds
Solid SSE2 Code (2 Accumulators  8 elements per cycle): 
Dot Product 5 = 2867507200.00 Elapsed Time = 1.77 Seconds
Better SSE2 Code (2 Accumulators  16 elements per cycle): 
Dot Product 6 = 2867507200.00 Elapsed Time = 1.78 Seconds
Rearranged SSE2 Code (4 Accumulators  16 elements per cycle): 
Dot Product 7 = 2867507200.00 Elapsed Time = 1.14 Seconds


« Last Edit: August 27, 2010, 11:54:50 AM by dioxin »

Logged




mineiro

dotfloat.exe > result.txt Intel(R) Pentium(R) Dual CPU E2160 @ 1.80GHz (SSE4) Supported by Processor and installed Operating System:  Pentium 4 Instruction Set, + FPU (floating point unit) on chip, + support of FXSAVE and FXRSTOR, + 57 MMX Instructions, + 70 SSE (Katmai) Instructions, + 144 SSE2 (Willamette) Instructions, + 13 SSE3 (Prescott) Instructions, + HTT (hyper thread technology) support.
Calculating the dot product in 6 different variations. That'll take a little while ...
Straight forward C++ implementation (FPU Code):  Dot Product 1 = 2867507200.00 Elapsed Time = 17.20 Seconds
C++ implementation (FPU Code  2 Accumulators):  Dot Product 2 = 2867507200.00 Elapsed Time = 11.52 Seconds
C++ Code with intrinsics (Original Code by Drizz):  Dot Product 3 = 2867507200.00 Elapsed Time = 4.28 Seconds
Simple SSE2 Code (1 Accumulator  4 elements per cycle):  Dot Product 4 = 2867507200.00 Elapsed Time = 4.38 Seconds
Solid SSE2 Code (2 Accumulators  8 elements per cycle):  Dot Product 5 = 2867507200.00 Elapsed Time = 3.61 Seconds
Better SSE2 Code (2 Accumulators  16 elements per cycle):  Dot Product 6 = 2867507200.00 Elapsed Time = 3.28 Seconds
regards.



Logged




jj2007

Prescott P4, Win XP SP2: Straight forward C++ implementation (FPU Code): 
Dot Product 1 = 2867507200.00 Elapsed Time = 19.77 Seconds
C++ implementation (FPU Code  2 Accumulators): 
Dot Product 2 = 2867507200.00 Elapsed Time = 9.67 Seconds
C++ Code with intrinsics (Original Code by Drizz): 
Dot Product 3 = 2867507200.00 Elapsed Time = 4.11 Seconds
Simple SSE2 Code (1 Accumulator  4 elements per cycle): 
Dot Product 4 = 2867507200.00 Elapsed Time = 3.96 Seconds
Solid SSE2 Code (2 Accumulators  8 elements per cycle): 
Dot Product 5 = 2867507200.00 Elapsed Time = 2.53 Seconds
Better SSE2 Code (2 Accumulators  16 elements per cycle): 
Dot Product 6 = 2867507200.00 Elapsed Time = 2.46 Seconds



Logged




hutch
Administrator
Member
Posts: 12013
Mnemonic Driven API Grinder

Gunther, One request on benchmarks, put a keyboard pause at the end of it so it does not have to be downloaded to run it. It ran from the browser but closed before I could save the results. These timings are on a 3 gig Core2 Quad. Supported by Processor and installed Operating System: 
Pentium 4 Instruction Set, + FPU (floating point unit) on chip, + support of FXSAVE and FXRSTOR, + 57 MMX Instructions, + 70 SSE (Katmai) Instructions, + 144 SSE2 (Willamette) Instructions, + 13 SSE3 (Prescott) Instructions, + HTT (hyper thread technology) support.
Calculating the dot product in 6 different variations. That'll take a little while ...
Straight forward C++ implementation (FPU Code): 
Dot Product 1 = 2867507200.00 Elapsed Time = 10.31 Seconds
C++ implementation (FPU Code  2 Accumulators): 
Dot Product 2 = 2867507200.00 Elapsed Time = 6.89 Seconds
C++ Code with intrinsics (Original Code by Drizz): 
Dot Product 3 = 2867507200.00 Elapsed Time = 2.58 Seconds
Simple SSE2 Code (1 Accumulator  4 elements per cycle): 
Dot Product 4 = 2867507200.00 Elapsed Time = 2.62 Seconds
Solid SSE2 Code (2 Accumulators  8 elements per cycle): 
Dot Product 5 = 2867507200.00 Elapsed Time = 2.16 Seconds
Better SSE2 Code (2 Accumulators  16 elements per cycle): 
Dot Product 6 = 2867507200.00 Elapsed Time = 1.97 Seconds



Logged




jj2007

I freely admit that I know too little about the dot product, so the attached testbed might not be realistic. Please adapt. Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3) 53 cycles for DotXMM1Acc4E <<< this value too high, Prescott P4 behaves badly in such testbeds 31 cycles for DotXMM1Acc4E <<< 31 seems to be the correct value 31 cycles for DotXMM1Acc4E 32 cycles for DotXMM1Acc4E 31 cycles for DotXMM1Acc4E



Logged




daydreamer

why not make it support multicores with process in multiple threads?



Logged





Pages: [1] 2 3 ... 9



