Suggestions and improvements for SSE2 code are welcome

Gunther · August 26, 2010, 05:20:06 PM

The background of this new thread is a discussion in The Campus: http://www.masm32.com/board/index.php?topic=14692.msg119277#msg119277

I've finished my little test suite and attached the appropriate archive. The application will run under the following 32bit operating systems: Windows, Linux, FreeBSD and Intel based Mac OS X. The archive includes all source code and the running Win32 program, for users, which haven't installed GCC at the moment. Also included is a batch file, to build the application from the sources. The shell script is for Unix users and does the same (there shouldn't be a problem, because GCC is installed by default as the system compiler). A short description of every file contains the readme.txt. The source code is well commented and should be self explanatory.

The software is in experimental stage - nothing is final or the last word. I've coded a special case, where the array size is divisible by 16, only to show the principle. A more generic implementation has to check that, of course. The program does not much error handling, but it checks SSE2 support during run time. So your machine won't crash, if it's not available. Every advice and help to improve the software is welcome. It's the same with testing the program with other processors and environments (please, have a look into the results.pdf file).

Gunther

jj2007 · August 26, 2010, 06:32:46 PM

From DotAsm.asm:

Code Select

		ALIGN 16
.loop:	movaps xmm0,[eax+edx]	;xmm0 = X[i+3] X[i+2] X[i+1] X[i+0]
	mulps xmm0,[edx]	;xmm0 = X[i+3]*Y[i+3] X[i+2]*Y[i+2] X[i+1]*Y[i+1] X[i]*Y[i]
	addps xmm7,xmm0	;sum up
	lea edx,[edx+16]	;update pointers
	sub ecx,byte 16	;count down
	jnz .loop

Well done... and challenging. There is not much too improve at first sight ::)

Gunther · August 26, 2010, 09:59:29 PM

Hallo jj2007,

Quote from: jj2007 Today at 07:32:46 PMWell done... and challenging.

I hope it's well done. But challenging? To be honest, I file away at these procedures since a few days; so I used some so called dirty assembly language tricks, not more.

Quote from: jj2007 Today at 07:32:46 PMThere is not much too improve at first sight

I'm not sure. As I wrote in the cited Campus thread: That function is called over 30 million times, given a 256x256 image, which isn't a very large picture. The arrays in the original program are not so large, either 256 or 64 or 16 elements, depending at the partition depth of the algorithm. The more tricky thing is, that the dot product must not be calculated once, but eight times. That has to do with rotations and mirroring of data (not on the screen, only in the memory). So, every saved nansecond counts, because these nanoseconds sum up to microseconds, the microseconds sum up to milliseconds ... etc. That makes the speed difference.

I'm thinking about some fastcall, because passing parameters via registers is faster. But, here comes the bad news: With the release of GCC 4.3., the GNU world was changed again. Here is what I've found: http://www.gnu.org/software/gcc/gcc-4.3/changes.html.

Quote from: GCC 4.3 Release Series, Changes, New Features, and FixesFastcall for i386 has been changed not to pass aggregate arguments in registers, following Microsoft compilers.

Very nice. And why, and for what should one use fastcall?

On the other hand, I coded the horizontal addition in the loop epilogue with SSE2 instructions, so the program runs on a small Sempron, too. I've to test, if haddps is better; but that's a SSE3 instruction. A lot of questions ... By the way, did you run the program? Do you have some timings? Thank you.

Gunther

jj2007 · August 26, 2010, 10:22:05 PM

Hi Gunther,
Timings are easy to get: :bg

Code Select

Straight forward C++ implementation (FPU Code):
-----------------------------------------------

Dot Product 1 = 2867507200.00
Elapsed Time  = 22.20 Seconds

C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------

Dot Product 2 = 2867507200.00
Elapsed Time  = 15.59 Seconds

C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------

Dot Product 3 = 2867507200.00
Elapsed Time  = 6.89 Seconds

Simple SSE2 Code (1 Accumulator - 4 elements per cycle):
--------------------------------------------------------

Dot Product 4 = 2867507200.00
Elapsed Time  = 8.41 Seconds

Solid SSE2 Code (2 Accumulators - 8 elements per cycle):
--------------------------------------------------------

Dot Product 5 = 2867507200.00
Elapsed Time  = 7.20 Seconds

Better SSE2 Code (2 Accumulators - 16 elements per cycle):
----------------------------------------------------------

Dot Product 6 = 2867507200.00
Elapsed Time  = 6.67 Seconds

As I wrote in my PM to you, there are a few SSE2 geeks around here. I see drizz is already involved.
Question: The inner loop that I posted above does contain vector instructions (packed mul and add), but the function returns apparently only a REAL4 value on the FPU: fld dword [esp] ;load function result
Is there no room for parallelisation? Apologies if that is a dumb question, as I said: I am not a geek for SSE2....

Gunther · August 26, 2010, 10:58:42 PM

Thank you for the result. It's surprising. The intrinsic code is very fast. What's your environment?

Quote from: jj2007 Today at 11:22:05 PMI see drizz is already involved.

Yes, he made the excellent intrinsic code, so I gave credit.

Quote from: jj2007 Today at 11:22:05 PMQuestion: The inner loop that I posted above does contain vector instructions (packed mul and add), but the function returns apparently only a REAL4 value on the FPU: fld dword [esp] ;load function result

The scalar product is by definition a real number. Given 2 vectors A and B of the dimension n, we have:

Scalar Product = A[0]*B[0] + A[1]*B[1] + ... + A[n]*B[n]

Please, let me think a little while about parallelisation.

Gunther

jj2007 · August 26, 2010, 11:01:52 PM

Quote from: Gunther on August 26, 2010, 10:58:42 PM
Thank you for the result. It's surprising. The intrinsic code is very fast. What's your environment?

Win XP SP2, a Celeron M CPU from the Yonah series, 1.6 Ghz, 1G RAM.

Gunther · August 26, 2010, 11:42:03 PM

The good old Win XP, SP2 configuration. Your Intel Celeron did an amazing job.

Gunther

dioxin · August 27, 2010, 01:05:42 AM

It's usually faster for SSE to access memory in contiguous blocks so don't access the vectors the way you do, change the order something like this:

Code Select

.loop:
		movaps  xmm1,[eax+edx+16]			;mm1 = X[i+7] X[i+6] X[i+5] X[i+4]
		movaps  xmm2,[eax+edx+32]			;xmm2 = X[i+11] X[i+10] X[i+9] X[i+8]
		movaps  xmm3,[eax+edx+48]			;xmm3 = X[i+15] X[i+14] X[i+13] X[i+12]
		movaps  xmm0,[eax+edx+64]			;xmm0 = X[i+3] X[i+2] X[i+1] X[i+0]


		mulps   xmm0,[edx]				;xmm0 = X[i+3]*Y[i+3] X[i+2]*Y[i+2] X[i+1]*Y[i+1] X[i+0]*Y[i+0]
		mulps   xmm1,[edx+16]				;xmm1 = X[i+7]*Y[i+7] X[i+6]*Y[i+6] X[i+5]*Y[i+5] X[i+4]*Y[i+4]
		mulps   xmm2,[edx+32]				;xmm2 = X[i+11]*Y[i+11] X[i+10]*Y[i+10] X[i+9]*Y[i+9] X[i+8]*Y[i+8]
		mulps   xmm3,[edx+48]				;xmm3 = X[i+15]*Y[i+15] X[i+14]*Y[i+14] X[i+13]*Y[i+13] X[i+12]*Y[i+12]


		addps   xmm4,xmm0				;sum up
		addps   xmm5,xmm1				;sum up
		addps   xmm4,xmm2				;sum up
		addps   xmm5,xmm3				;sum up

		lea     edx,[edx+64]				;update pointers
		sub     ecx,byte 64				;count down
		jnz	.loop

Paul.

Gunther · August 27, 2010, 01:34:00 AM

Thank you Paul. I'll check this and make another function. Re-arranging the loop instructions could be an improvement.

Gunther

dioxin · August 27, 2010, 01:54:26 AM

Phenom II x4 3GHz.

Code Select

Straight forward C++ implementation (FPU Code):
-----------------------------------------------

Dot Product 1 = 2867507200.00
Elapsed Time  = 13.70 Seconds

C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------

Dot Product 2 = 2867507200.00
Elapsed Time  = 6.89 Seconds

C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------

Dot Product 3 = 2867507200.00
Elapsed Time  = 3.47 Seconds

Simple SSE2 Code (1 Accumulator - 4 elements per cycle):
--------------------------------------------------------

Dot Product 4 = 2867507200.00
Elapsed Time  = 3.45 Seconds

Solid SSE2 Code (2 Accumulators - 8 elements per cycle):
--------------------------------------------------------

Dot Product 5 = 2867507200.00
Elapsed Time  = 1.77 Seconds

Better SSE2 Code (2 Accumulators - 16 elements per cycle):
----------------------------------------------------------

Dot Product 6 = 2867507200.00
Elapsed Time  = 1.78 Seconds


Rearranged SSE2 Code (4 Accumulators - 16 elements per cycle):
----------------------------------------------------------

Dot Product 7 = 2867507200.00
Elapsed Time  = 1.14 Seconds

mineiro · August 27, 2010, 02:09:36 AM

Code Select

dotfloat.exe > result.txt
Intel(R) Pentium(R) Dual CPU E2160 @ 1.80GHz (SSE4)

Code Select


Supported by Processor and installed Operating System:
------------------------------------------------------
   Pentium 4 Instruction Set,
 + FPU (floating point unit) on chip,
 + support of FXSAVE and FXRSTOR,
 + 57 MMX Instructions,
 + 70 SSE (Katmai) Instructions,
 + 144 SSE2 (Willamette) Instructions,
 + 13 SSE3 (Prescott) Instructions,
 + HTT (hyper thread technology) support.

Calculating the dot product in 6 different variations.
That'll take a little while ...

Straight forward C++ implementation (FPU Code):
-----------------------------------------------
Dot Product 1 = 2867507200.00
Elapsed Time  = 17.20 Seconds

C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------
Dot Product 2 = 2867507200.00
Elapsed Time  = 11.52 Seconds

C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------
Dot Product 3 = 2867507200.00
Elapsed Time  = 4.28 Seconds

Simple SSE2 Code (1 Accumulator - 4 elements per cycle):
--------------------------------------------------------
Dot Product 4 = 2867507200.00
Elapsed Time  = 4.38 Seconds

Solid SSE2 Code (2 Accumulators - 8 elements per cycle):
--------------------------------------------------------
Dot Product 5 = 2867507200.00
Elapsed Time  = 3.61 Seconds

Better SSE2 Code (2 Accumulators - 16 elements per cycle):
----------------------------------------------------------
Dot Product 6 = 2867507200.00
Elapsed Time  = 3.28 Seconds

regards.

jj2007 · August 27, 2010, 07:39:30 AM

Prescott P4, Win XP SP2:

Code Select

Straight forward C++ implementation (FPU Code):
-----------------------------------------------

Dot Product 1 = 2867507200.00
Elapsed Time  = 19.77 Seconds

C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------

Dot Product 2 = 2867507200.00
Elapsed Time  = 9.67 Seconds

C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------

Dot Product 3 = 2867507200.00
Elapsed Time  = 4.11 Seconds

Simple SSE2 Code (1 Accumulator - 4 elements per cycle):
--------------------------------------------------------

Dot Product 4 = 2867507200.00
Elapsed Time  = 3.96 Seconds

Solid SSE2 Code (2 Accumulators - 8 elements per cycle):
--------------------------------------------------------

Dot Product 5 = 2867507200.00
Elapsed Time  = 2.53 Seconds

Better SSE2 Code (2 Accumulators - 16 elements per cycle):
----------------------------------------------------------

Dot Product 6 = 2867507200.00
Elapsed Time  = 2.46 Seconds

hutch-- · August 27, 2010, 07:41:57 AM

Gunther,

One request on benchmarks, put a keyboard pause at the end of it so it does not have to be downloaded to run it. It ran from the browser but closed before I could save the results.

These timings are on a 3 gig Core2 Quad.

Code Select


Supported by Processor and installed Operating System:
------------------------------------------------------

   Pentium 4 Instruction Set,
 + FPU (floating point unit) on chip,
 + support of FXSAVE and FXRSTOR,
 + 57 MMX Instructions,
 + 70 SSE (Katmai) Instructions,
 + 144 SSE2 (Willamette) Instructions,
 + 13 SSE3 (Prescott) Instructions,
 + HTT (hyper thread technology) support.

Calculating the dot product in 6 different variations.
That'll take a little while ...

Straight forward C++ implementation (FPU Code):
-----------------------------------------------

Dot Product 1 = 2867507200.00
Elapsed Time  = 10.31 Seconds

C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------

Dot Product 2 = 2867507200.00
Elapsed Time  = 6.89 Seconds

C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------

Dot Product 3 = 2867507200.00
Elapsed Time  = 2.58 Seconds

Simple SSE2 Code (1 Accumulator - 4 elements per cycle):
--------------------------------------------------------

Dot Product 4 = 2867507200.00
Elapsed Time  = 2.62 Seconds

Solid SSE2 Code (2 Accumulators - 8 elements per cycle):
--------------------------------------------------------

Dot Product 5 = 2867507200.00
Elapsed Time  = 2.16 Seconds

Better SSE2 Code (2 Accumulators - 16 elements per cycle):
----------------------------------------------------------

Dot Product 6 = 2867507200.00
Elapsed Time  = 1.97 Seconds

jj2007 · August 27, 2010, 08:26:55 AM

I freely admit that I know too little about the dot product, so the attached testbed might not be realistic. Please adapt.

Code Select

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
53      cycles for DotXMM1Acc4E <<< this value too high, Prescott P4 behaves badly in such testbeds
31      cycles for DotXMM1Acc4E <<< 31 seems to be the correct value
31      cycles for DotXMM1Acc4E
32      cycles for DotXMM1Acc4E
31      cycles for DotXMM1Acc4E

daydreamer · August 27, 2010, 08:53:15 AM

why not make it support multicores with process in multiple threads?

News:

Suggestions and improvements for SSE2 code are welcome