News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Suggestions and improvements for SSE2 code are welcome

Started by Gunther, August 26, 2010, 05:20:06 PM

Previous topic - Next topic

Gunther

The background of this new thread is a discussion in The Campus: http://www.masm32.com/board/index.php?topic=14692.msg119277#msg119277

I've finished my little test suite and attached the appropriate archive. The application will run under the following 32bit operating systems: Windows, Linux, FreeBSD and Intel based Mac OS X. The archive includes all source code and the running Win32 program, for users, which haven't installed GCC at the moment. Also included is a batch file, to build the application from the sources. The shell script is for Unix users and does the same (there shouldn't be a problem, because GCC is installed by default as the system compiler). A short description of every file contains the readme.txt. The source code is well commented and should be self explanatory.

The software is in experimental stage - nothing is final or the last word. I've coded a special case, where the array size is divisible by 16, only to show the principle. A more generic implementation has to check that, of course. The program does not much error handling, but it checks SSE2 support during run time. So your machine won't crash, if it's not available. Every advice and help to improve the software is welcome. It's the same with testing the program with other processors and environments (please, have a look into the results.pdf file).

Gunther
Forgive your enemies, but never forget their names.

jj2007

From DotAsm.asm:
ALIGN 16
.loop: movaps xmm0,[eax+edx] ;xmm0 = X[i+3] X[i+2] X[i+1] X[i+0]
mulps xmm0,[edx] ;xmm0 = X[i+3]*Y[i+3] X[i+2]*Y[i+2] X[i+1]*Y[i+1] X[i]*Y[i]
addps xmm7,xmm0 ;sum up
lea edx,[edx+16] ;update pointers
sub ecx,byte 16 ;count down
jnz .loop


Well done... and challenging. There is not much too improve at first sight ::)

Gunther

Hallo jj2007,

Quote from: jj2007 Today at 07:32:46 PMWell done... and challenging.

I hope it's well done. But challenging? To be honest, I file away at these procedures since a few days; so I used some so called dirty assembly language tricks, not more.

Quote from: jj2007 Today at 07:32:46 PMThere is not much too improve at first sight

I'm not sure. As I wrote in the cited Campus thread: That function is called over 30 million times, given a 256x256 image, which isn't a very large picture. The arrays in the original program are not so large, either 256 or 64 or 16 elements, depending at the partition depth of the algorithm. The more tricky thing is, that the dot product must not be calculated once, but eight times. That has to do with rotations and mirroring of data (not on the screen, only in the memory). So, every saved nansecond counts, because these nanoseconds sum up to microseconds, the microseconds sum up to milliseconds ... etc. That makes the speed difference.

I'm thinking about some fastcall, because passing parameters via registers is faster. But, here comes the bad news: With the release of GCC 4.3., the GNU world was changed again. Here is what I've found: http://www.gnu.org/software/gcc/gcc-4.3/changes.html.

Quote from: GCC 4.3 Release Series, Changes, New Features, and FixesFastcall for i386 has been changed not to pass aggregate arguments in registers, following Microsoft compilers.

Very nice. And why, and for what should one use fastcall?

On the other hand, I coded the horizontal addition in the loop epilogue with SSE2 instructions, so the program runs on a small Sempron, too. I've to test, if haddps is better; but that's a SSE3 instruction. A lot of questions ... By the way, did you run the program? Do you have some timings? Thank you.

Gunther 
Forgive your enemies, but never forget their names.

jj2007

Hi Gunther,
Timings are easy to get: :bg
Straight forward C++ implementation (FPU Code):
-----------------------------------------------

Dot Product 1 = 2867507200.00
Elapsed Time  = 22.20 Seconds

C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------

Dot Product 2 = 2867507200.00
Elapsed Time  = 15.59 Seconds

C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------

Dot Product 3 = 2867507200.00
Elapsed Time  = 6.89 Seconds

Simple SSE2 Code (1 Accumulator - 4 elements per cycle):
--------------------------------------------------------

Dot Product 4 = 2867507200.00
Elapsed Time  = 8.41 Seconds

Solid SSE2 Code (2 Accumulators - 8 elements per cycle):
--------------------------------------------------------

Dot Product 5 = 2867507200.00
Elapsed Time  = 7.20 Seconds

Better SSE2 Code (2 Accumulators - 16 elements per cycle):
----------------------------------------------------------

Dot Product 6 = 2867507200.00
Elapsed Time  = 6.67 Seconds


As I wrote in my PM to you, there are a few SSE2 geeks around here. I see drizz is already involved.
Question: The inner loop that I posted above does contain vector instructions (packed mul and add), but the function returns apparently only a REAL4 value on the FPU: fld dword [esp] ;load function result
Is there no room for parallelisation? Apologies if that is a dumb question, as I said: I am not a geek for SSE2....

Gunther

Thank you for the result. It's surprising. The intrinsic code is very fast. What's your environment?

Quote from: jj2007 Today at 11:22:05 PMI see drizz is already involved.

Yes, he made the excellent intrinsic code, so I gave credit.

Quote from: jj2007 Today at 11:22:05 PMQuestion: The inner loop that I posted above does contain vector instructions (packed mul and add), but the function returns apparently only a REAL4 value on the FPU: fld dword [esp] ;load function result

The scalar product is by definition a real number. Given 2 vectors A and B of the dimension n, we have:

Scalar Product = A[0]*B[0] + A[1]*B[1] + ... + A[n]*B[n]

Please, let me think a little while about parallelisation.

Gunther
Forgive your enemies, but never forget their names.

jj2007

Quote from: Gunther on August 26, 2010, 10:58:42 PM
Thank you for the result. It's surprising. The intrinsic code is very fast. What's your environment?

Win XP SP2, a Celeron M CPU from the Yonah series, 1.6 Ghz, 1G RAM.

Gunther

The good old Win XP, SP2 configuration. Your Intel Celeron did an amazing job.

Gunther
Forgive your enemies, but never forget their names.

dioxin

It's usually faster for SSE to access memory in contiguous blocks so don't access the vectors the way you do, change the order something like this:
.loop:
movaps  xmm1,[eax+edx+16] ;mm1 = X[i+7] X[i+6] X[i+5] X[i+4]
movaps  xmm2,[eax+edx+32] ;xmm2 = X[i+11] X[i+10] X[i+9] X[i+8]
movaps  xmm3,[eax+edx+48] ;xmm3 = X[i+15] X[i+14] X[i+13] X[i+12]
movaps  xmm0,[eax+edx+64] ;xmm0 = X[i+3] X[i+2] X[i+1] X[i+0]


mulps   xmm0,[edx] ;xmm0 = X[i+3]*Y[i+3] X[i+2]*Y[i+2] X[i+1]*Y[i+1] X[i+0]*Y[i+0]
mulps   xmm1,[edx+16] ;xmm1 = X[i+7]*Y[i+7] X[i+6]*Y[i+6] X[i+5]*Y[i+5] X[i+4]*Y[i+4]
mulps   xmm2,[edx+32] ;xmm2 = X[i+11]*Y[i+11] X[i+10]*Y[i+10] X[i+9]*Y[i+9] X[i+8]*Y[i+8]
mulps   xmm3,[edx+48] ;xmm3 = X[i+15]*Y[i+15] X[i+14]*Y[i+14] X[i+13]*Y[i+13] X[i+12]*Y[i+12]


addps   xmm4,xmm0 ;sum up
addps   xmm5,xmm1 ;sum up
addps   xmm4,xmm2 ;sum up
addps   xmm5,xmm3 ;sum up

lea     edx,[edx+64] ;update pointers
sub     ecx,byte 64 ;count down
jnz .loop


Paul.


Gunther

Thank you Paul. I'll check this and make another function. Re-arranging the loop instructions could be an improvement.

Gunther
Forgive your enemies, but never forget their names.

dioxin

Phenom II x4 3GHz.


Straight forward C++ implementation (FPU Code):
-----------------------------------------------

Dot Product 1 = 2867507200.00
Elapsed Time  = 13.70 Seconds

C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------

Dot Product 2 = 2867507200.00
Elapsed Time  = 6.89 Seconds

C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------

Dot Product 3 = 2867507200.00
Elapsed Time  = 3.47 Seconds

Simple SSE2 Code (1 Accumulator - 4 elements per cycle):
--------------------------------------------------------

Dot Product 4 = 2867507200.00
Elapsed Time  = 3.45 Seconds

Solid SSE2 Code (2 Accumulators - 8 elements per cycle):
--------------------------------------------------------

Dot Product 5 = 2867507200.00
Elapsed Time  = 1.77 Seconds

Better SSE2 Code (2 Accumulators - 16 elements per cycle):
----------------------------------------------------------

Dot Product 6 = 2867507200.00
Elapsed Time  = 1.78 Seconds


Rearranged SSE2 Code (4 Accumulators - 16 elements per cycle):
----------------------------------------------------------

Dot Product 7 = 2867507200.00
Elapsed Time  = 1.14 Seconds

mineiro

dotfloat.exe > result.txt
Intel(R) Pentium(R) Dual  CPU  E2160  @ 1.80GHz (SSE4)

Supported by Processor and installed Operating System:
------------------------------------------------------
   Pentium 4 Instruction Set,
+ FPU (floating point unit) on chip,
+ support of FXSAVE and FXRSTOR,
+ 57 MMX Instructions,
+ 70 SSE (Katmai) Instructions,
+ 144 SSE2 (Willamette) Instructions,
+ 13 SSE3 (Prescott) Instructions,
+ HTT (hyper thread technology) support.

Calculating the dot product in 6 different variations.
That'll take a little while ...

Straight forward C++ implementation (FPU Code):
-----------------------------------------------
Dot Product 1 = 2867507200.00
Elapsed Time  = 17.20 Seconds

C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------
Dot Product 2 = 2867507200.00
Elapsed Time  = 11.52 Seconds

C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------
Dot Product 3 = 2867507200.00
Elapsed Time  = 4.28 Seconds

Simple SSE2 Code (1 Accumulator - 4 elements per cycle):
--------------------------------------------------------
Dot Product 4 = 2867507200.00
Elapsed Time  = 4.38 Seconds

Solid SSE2 Code (2 Accumulators - 8 elements per cycle):
--------------------------------------------------------
Dot Product 5 = 2867507200.00
Elapsed Time  = 3.61 Seconds

Better SSE2 Code (2 Accumulators - 16 elements per cycle):
----------------------------------------------------------
Dot Product 6 = 2867507200.00
Elapsed Time  = 3.28 Seconds

regards.

jj2007

Prescott P4, Win XP SP2:
Straight forward C++ implementation (FPU Code):
-----------------------------------------------

Dot Product 1 = 2867507200.00
Elapsed Time  = 19.77 Seconds

C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------

Dot Product 2 = 2867507200.00
Elapsed Time  = 9.67 Seconds

C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------

Dot Product 3 = 2867507200.00
Elapsed Time  = 4.11 Seconds

Simple SSE2 Code (1 Accumulator - 4 elements per cycle):
--------------------------------------------------------

Dot Product 4 = 2867507200.00
Elapsed Time  = 3.96 Seconds

Solid SSE2 Code (2 Accumulators - 8 elements per cycle):
--------------------------------------------------------

Dot Product 5 = 2867507200.00
Elapsed Time  = 2.53 Seconds

Better SSE2 Code (2 Accumulators - 16 elements per cycle):
----------------------------------------------------------

Dot Product 6 = 2867507200.00
Elapsed Time  = 2.46 Seconds

hutch--

Gunther,

One request on benchmarks, put a keyboard pause at the end of it so it does not have to be downloaded to run it. It ran from the browser but closed before I could save the results.

These timings are on a 3 gig Core2 Quad.


Supported by Processor and installed Operating System:
------------------------------------------------------

   Pentium 4 Instruction Set,
+ FPU (floating point unit) on chip,
+ support of FXSAVE and FXRSTOR,
+ 57 MMX Instructions,
+ 70 SSE (Katmai) Instructions,
+ 144 SSE2 (Willamette) Instructions,
+ 13 SSE3 (Prescott) Instructions,
+ HTT (hyper thread technology) support.

Calculating the dot product in 6 different variations.
That'll take a little while ...

Straight forward C++ implementation (FPU Code):
-----------------------------------------------

Dot Product 1 = 2867507200.00
Elapsed Time  = 10.31 Seconds

C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------

Dot Product 2 = 2867507200.00
Elapsed Time  = 6.89 Seconds

C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------

Dot Product 3 = 2867507200.00
Elapsed Time  = 2.58 Seconds

Simple SSE2 Code (1 Accumulator - 4 elements per cycle):
--------------------------------------------------------

Dot Product 4 = 2867507200.00
Elapsed Time  = 2.62 Seconds

Solid SSE2 Code (2 Accumulators - 8 elements per cycle):
--------------------------------------------------------

Dot Product 5 = 2867507200.00
Elapsed Time  = 2.16 Seconds

Better SSE2 Code (2 Accumulators - 16 elements per cycle):
----------------------------------------------------------

Dot Product 6 = 2867507200.00
Elapsed Time  = 1.97 Seconds
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

I freely admit that I know too little about the dot product, so the attached testbed might not be realistic. Please adapt.

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
53      cycles for DotXMM1Acc4E <<< this value too high, Prescott P4 behaves badly in such testbeds
31      cycles for DotXMM1Acc4E <<< 31 seems to be the correct value
31      cycles for DotXMM1Acc4E
32      cycles for DotXMM1Acc4E
31      cycles for DotXMM1Acc4E


daydreamer

why not make it support multicores with process in multiple threads?