The MASM Forum Archive 2004 to 2012
Welcome, Guest. Please login or register.
October 23, 2021, 05:18:09 PM

Login with username, password and session length
Search:     Advanced search
128553 Posts in 15254 Topics by 684 Members
Latest Member: mottt
* Home Help Search Login Register
+  The MASM Forum Archive 2004 to 2012
|-+  General Forums
| |-+  The Laboratory (Moderator: Mark_Larson)
| | |-+  Suggestions and improvements for SSE2 code are welcome
« previous next »
Pages: [1] 2 3 ... 9 Print
Author Topic: Suggestions and improvements for SSE2 code are welcome  (Read 61633 times)
Gunther
Member
*****
Gender: Male
Posts: 184


Assembly Language Dinosaur


Suggestions and improvements for SSE2 code are welcome
« on: August 26, 2010, 05:20:06 PM »

The background of this new thread is a discussion in The Campus: http://www.masm32.com/board/index.php?topic=14692.msg119277#msg119277

I've finished my little test suite and attached the appropriate archive. The application will run under the following 32bit operating systems: Windows, Linux, FreeBSD and Intel based Mac OS X. The archive includes all source code and the running Win32 program, for users, which haven't installed GCC at the moment. Also included is a batch file, to build the application from the sources. The shell script is for Unix users and does the same (there shouldn't be a problem, because GCC is installed by default as the system compiler). A short description of every file contains the readme.txt. The source code is well commented and should be self explanatory.

The software is in experimental stage - nothing is final or the last word. I've coded a special case, where the array size is divisible by 16, only to show the principle. A more generic implementation has to check that, of course. The program does not much error handling, but it checks SSE2 support during run time. So your machine won't crash, if it's not available. Every advice and help to improve the software is welcome. It's the same with testing the program with other processors and environments (please, have a look into the results.pdf file).

Gunther

* dotfloat.zip (164.81 KB - downloaded 362 times.)
Logged

Forgive your enemies, but never forget their names.
jj2007
Member
*****
Gender: Male
Posts: 6011



Re: Suggestions and improvements for SSE2 code are welcome
« Reply #1 on: August 26, 2010, 06:32:46 PM »

From DotAsm.asm:
Code:
ALIGN 16
.loop: movaps xmm0,[eax+edx] ;xmm0 = X[i+3] X[i+2] X[i+1] X[i+0]
mulps xmm0,[edx] ;xmm0 = X[i+3]*Y[i+3] X[i+2]*Y[i+2] X[i+1]*Y[i+1] X[i]*Y[i]
addps xmm7,xmm0 ;sum up
lea edx,[edx+16] ;update pointers
sub ecx,byte 16 ;count down
jnz .loop

Well done... and challenging. There is not much too improve at first sight Roll Eyes
Logged

Gunther
Member
*****
Gender: Male
Posts: 184


Assembly Language Dinosaur


Re: Suggestions and improvements for SSE2 code are welcome
« Reply #2 on: August 26, 2010, 09:59:29 PM »

Hallo jj2007,

Quote from: jj2007 Today at 07:32:46 PM
Well done... and challenging.

I hope it's well done. But challenging? To be honest, I file away at these procedures since a few days; so I used some so called dirty assembly language tricks, not more.

Quote from: jj2007 Today at 07:32:46 PM
There is not much too improve at first sight

I'm not sure. As I wrote in the cited Campus thread: That function is called over 30 million times, given a 256x256 image, which isn't a very large picture. The arrays in the original program are not so large, either 256 or 64 or 16 elements, depending at the partition depth of the algorithm. The more tricky thing is, that the dot product must not be calculated once, but eight times. That has to do with rotations and mirroring of data (not on the screen, only in the memory). So, every saved nansecond counts, because these nanoseconds sum up to microseconds, the microseconds sum up to milliseconds ... etc. That makes the speed difference.

I'm thinking about some fastcall, because passing parameters via registers is faster. But, here comes the bad news: With the release of GCC 4.3., the GNU world was changed again. Here is what I've found: http://www.gnu.org/software/gcc/gcc-4.3/changes.html.

Quote from: GCC 4.3 Release Series, Changes, New Features, and Fixes
Fastcall for i386 has been changed not to pass aggregate arguments in registers, following Microsoft compilers.

Very nice. And why, and for what should one use fastcall?

On the other hand, I coded the horizontal addition in the loop epilogue with SSE2 instructions, so the program runs on a small Sempron, too. I've to test, if haddps is better; but that's a SSE3 instruction. A lot of questions ... By the way, did you run the program? Do you have some timings? Thank you.

Gunther 
Logged

Forgive your enemies, but never forget their names.
jj2007
Member
*****
Gender: Male
Posts: 6011



Re: Suggestions and improvements for SSE2 code are welcome
« Reply #3 on: August 26, 2010, 10:22:05 PM »

Hi Gunther,
Timings are easy to get: BigGrin
Code:
Straight forward C++ implementation (FPU Code):
-----------------------------------------------

Dot Product 1 = 2867507200.00
Elapsed Time  = 22.20 Seconds

C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------

Dot Product 2 = 2867507200.00
Elapsed Time  = 15.59 Seconds

C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------

Dot Product 3 = 2867507200.00
Elapsed Time  = 6.89 Seconds

Simple SSE2 Code (1 Accumulator - 4 elements per cycle):
--------------------------------------------------------

Dot Product 4 = 2867507200.00
Elapsed Time  = 8.41 Seconds

Solid SSE2 Code (2 Accumulators - 8 elements per cycle):
--------------------------------------------------------

Dot Product 5 = 2867507200.00
Elapsed Time  = 7.20 Seconds

Better SSE2 Code (2 Accumulators - 16 elements per cycle):
----------------------------------------------------------

Dot Product 6 = 2867507200.00
Elapsed Time  = 6.67 Seconds

As I wrote in my PM to you, there are a few SSE2 geeks around here. I see drizz is already involved.
Question: The inner loop that I posted above does contain vector instructions (packed mul and add), but the function returns apparently only a REAL4 value on the FPU: fld dword [esp] ;load function result
Is there no room for parallelisation? Apologies if that is a dumb question, as I said: I am not a geek for SSE2....
Logged

Gunther
Member
*****
Gender: Male
Posts: 184


Assembly Language Dinosaur


Re: Suggestions and improvements for SSE2 code are welcome
« Reply #4 on: August 26, 2010, 10:58:42 PM »

Thank you for the result. It's surprising. The intrinsic code is very fast. What's your environment?

Quote from: jj2007 Today at 11:22:05 PM
I see drizz is already involved.

Yes, he made the excellent intrinsic code, so I gave credit.

Quote from: jj2007 Today at 11:22:05 PM
Question: The inner loop that I posted above does contain vector instructions (packed mul and add), but the function returns apparently only a REAL4 value on the FPU: fld dword [esp] ;load function result

The scalar product is by definition a real number. Given 2 vectors A and B of the dimension n, we have:

Scalar Product = A[0]*B[0] + A[1]*B[1] + ... + A[n]*B[n]

Please, let me think a little while about parallelisation.

Gunther
Logged

Forgive your enemies, but never forget their names.
jj2007
Member
*****
Gender: Male
Posts: 6011



Re: Suggestions and improvements for SSE2 code are welcome
« Reply #5 on: August 26, 2010, 11:01:52 PM »

Thank you for the result. It's surprising. The intrinsic code is very fast. What's your environment?

Win XP SP2, a Celeron M CPU from the Yonah series, 1.6 Ghz, 1G RAM.
Logged

Gunther
Member
*****
Gender: Male
Posts: 184


Assembly Language Dinosaur


Re: Suggestions and improvements for SSE2 code are welcome
« Reply #6 on: August 26, 2010, 11:42:03 PM »

The good old Win XP, SP2 configuration. Your Intel Celeron did an amazing job.

Gunther
Logged

Forgive your enemies, but never forget their names.
dioxin
Member
*****
Posts: 162


Re: Suggestions and improvements for SSE2 code are welcome
« Reply #7 on: August 27, 2010, 01:05:42 AM »

It's usually faster for SSE to access memory in contiguous blocks so don't access the vectors the way you do, change the order something like this:
Code:
.loop:
movaps  xmm1,[eax+edx+16] ;mm1 = X[i+7] X[i+6] X[i+5] X[i+4]
movaps  xmm2,[eax+edx+32] ;xmm2 = X[i+11] X[i+10] X[i+9] X[i+8]
movaps  xmm3,[eax+edx+48] ;xmm3 = X[i+15] X[i+14] X[i+13] X[i+12]
movaps  xmm0,[eax+edx+64] ;xmm0 = X[i+3] X[i+2] X[i+1] X[i+0]


mulps   xmm0,[edx] ;xmm0 = X[i+3]*Y[i+3] X[i+2]*Y[i+2] X[i+1]*Y[i+1] X[i+0]*Y[i+0]
mulps   xmm1,[edx+16] ;xmm1 = X[i+7]*Y[i+7] X[i+6]*Y[i+6] X[i+5]*Y[i+5] X[i+4]*Y[i+4]
mulps   xmm2,[edx+32] ;xmm2 = X[i+11]*Y[i+11] X[i+10]*Y[i+10] X[i+9]*Y[i+9] X[i+8]*Y[i+8]
mulps   xmm3,[edx+48] ;xmm3 = X[i+15]*Y[i+15] X[i+14]*Y[i+14] X[i+13]*Y[i+13] X[i+12]*Y[i+12]


addps   xmm4,xmm0 ;sum up
addps   xmm5,xmm1 ;sum up
addps   xmm4,xmm2 ;sum up
addps   xmm5,xmm3 ;sum up

lea     edx,[edx+64] ;update pointers
sub     ecx,byte 64 ;count down
jnz .loop

Paul.

Logged
Gunther
Member
*****
Gender: Male
Posts: 184


Assembly Language Dinosaur


Re: Suggestions and improvements for SSE2 code are welcome
« Reply #8 on: August 27, 2010, 01:34:00 AM »

Thank you Paul. I'll check this and make another function. Re-arranging the loop instructions could be an improvement.

Gunther
Logged

Forgive your enemies, but never forget their names.
dioxin
Member
*****
Posts: 162


Re: Suggestions and improvements for SSE2 code are welcome
« Reply #9 on: August 27, 2010, 01:54:26 AM »

Phenom II x4 3GHz.


Code:
Straight forward C++ implementation (FPU Code):
-----------------------------------------------

Dot Product 1 = 2867507200.00
Elapsed Time  = 13.70 Seconds

C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------

Dot Product 2 = 2867507200.00
Elapsed Time  = 6.89 Seconds

C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------

Dot Product 3 = 2867507200.00
Elapsed Time  = 3.47 Seconds

Simple SSE2 Code (1 Accumulator - 4 elements per cycle):
--------------------------------------------------------

Dot Product 4 = 2867507200.00
Elapsed Time  = 3.45 Seconds

Solid SSE2 Code (2 Accumulators - 8 elements per cycle):
--------------------------------------------------------

Dot Product 5 = 2867507200.00
Elapsed Time  = 1.77 Seconds

Better SSE2 Code (2 Accumulators - 16 elements per cycle):
----------------------------------------------------------

Dot Product 6 = 2867507200.00
Elapsed Time  = 1.78 Seconds


Rearranged SSE2 Code (4 Accumulators - 16 elements per cycle):
----------------------------------------------------------

Dot Product 7 = 2867507200.00
Elapsed Time  = 1.14 Seconds
« Last Edit: August 27, 2010, 11:54:50 AM by dioxin » Logged
mineiro
Member
*****
Posts: 253



Re: Suggestions and improvements for SSE2 code are welcome
« Reply #10 on: August 27, 2010, 02:09:36 AM »

Code:
dotfloat.exe > result.txt
Intel(R) Pentium(R) Dual  CPU  E2160  @ 1.80GHz (SSE4)
Code:
Supported by Processor and installed Operating System:
------------------------------------------------------
   Pentium 4 Instruction Set,
+ FPU (floating point unit) on chip,
+ support of FXSAVE and FXRSTOR,
+ 57 MMX Instructions,
+ 70 SSE (Katmai) Instructions,
+ 144 SSE2 (Willamette) Instructions,
+ 13 SSE3 (Prescott) Instructions,
+ HTT (hyper thread technology) support.

Calculating the dot product in 6 different variations.
That'll take a little while ...

Straight forward C++ implementation (FPU Code):
-----------------------------------------------
Dot Product 1 = 2867507200.00
Elapsed Time  = 17.20 Seconds

C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------
Dot Product 2 = 2867507200.00
Elapsed Time  = 11.52 Seconds

C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------
Dot Product 3 = 2867507200.00
Elapsed Time  = 4.28 Seconds

Simple SSE2 Code (1 Accumulator - 4 elements per cycle):
--------------------------------------------------------
Dot Product 4 = 2867507200.00
Elapsed Time  = 4.38 Seconds

Solid SSE2 Code (2 Accumulators - 8 elements per cycle):
--------------------------------------------------------
Dot Product 5 = 2867507200.00
Elapsed Time  = 3.61 Seconds

Better SSE2 Code (2 Accumulators - 16 elements per cycle):
----------------------------------------------------------
Dot Product 6 = 2867507200.00
Elapsed Time  = 3.28 Seconds
regards.
Logged
jj2007
Member
*****
Gender: Male
Posts: 6011



Re: Suggestions and improvements for SSE2 code are welcome
« Reply #11 on: August 27, 2010, 07:39:30 AM »

Prescott P4, Win XP SP2:
Code:
Straight forward C++ implementation (FPU Code):
-----------------------------------------------

Dot Product 1 = 2867507200.00
Elapsed Time  = 19.77 Seconds

C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------

Dot Product 2 = 2867507200.00
Elapsed Time  = 9.67 Seconds

C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------

Dot Product 3 = 2867507200.00
Elapsed Time  = 4.11 Seconds

Simple SSE2 Code (1 Accumulator - 4 elements per cycle):
--------------------------------------------------------

Dot Product 4 = 2867507200.00
Elapsed Time  = 3.96 Seconds

Solid SSE2 Code (2 Accumulators - 8 elements per cycle):
--------------------------------------------------------

Dot Product 5 = 2867507200.00
Elapsed Time  = 2.53 Seconds

Better SSE2 Code (2 Accumulators - 16 elements per cycle):
----------------------------------------------------------

Dot Product 6 = 2867507200.00
Elapsed Time  = 2.46 Seconds
Logged

hutch--
Administrator
Member
*****
Posts: 12013


Mnemonic Driven API Grinder


Re: Suggestions and improvements for SSE2 code are welcome
« Reply #12 on: August 27, 2010, 07:41:57 AM »

Gunther,

One request on benchmarks, put a keyboard pause at the end of it so it does not have to be downloaded to run it. It ran from the browser but closed before I could save the results.

These timings are on a 3 gig Core2 Quad.

Code:
Supported by Processor and installed Operating System:
------------------------------------------------------

   Pentium 4 Instruction Set,
+ FPU (floating point unit) on chip,
+ support of FXSAVE and FXRSTOR,
+ 57 MMX Instructions,
+ 70 SSE (Katmai) Instructions,
+ 144 SSE2 (Willamette) Instructions,
+ 13 SSE3 (Prescott) Instructions,
+ HTT (hyper thread technology) support.

Calculating the dot product in 6 different variations.
That'll take a little while ...

Straight forward C++ implementation (FPU Code):
-----------------------------------------------

Dot Product 1 = 2867507200.00
Elapsed Time  = 10.31 Seconds

C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------

Dot Product 2 = 2867507200.00
Elapsed Time  = 6.89 Seconds

C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------

Dot Product 3 = 2867507200.00
Elapsed Time  = 2.58 Seconds

Simple SSE2 Code (1 Accumulator - 4 elements per cycle):
--------------------------------------------------------

Dot Product 4 = 2867507200.00
Elapsed Time  = 2.62 Seconds

Solid SSE2 Code (2 Accumulators - 8 elements per cycle):
--------------------------------------------------------

Dot Product 5 = 2867507200.00
Elapsed Time  = 2.16 Seconds

Better SSE2 Code (2 Accumulators - 16 elements per cycle):
----------------------------------------------------------

Dot Product 6 = 2867507200.00
Elapsed Time  = 1.97 Seconds
Logged

Regards,



Download site for MASM32
http://www.masm32.com
jj2007
Member
*****
Gender: Male
Posts: 6011



Re: Suggestions and improvements for SSE2 code are welcome
« Reply #13 on: August 27, 2010, 08:26:55 AM »

I freely admit that I know too little about the dot product, so the attached testbed might not be realistic. Please adapt.

Code:
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
53      cycles for DotXMM1Acc4E <<< this value too high, Prescott P4 behaves badly in such testbeds
31      cycles for DotXMM1Acc4E <<< 31 seems to be the correct value
31      cycles for DotXMM1Acc4E
32      cycles for DotXMM1Acc4E
31      cycles for DotXMM1Acc4E

* DotProduct.zip (3.2 KB - downloaded 294 times.)
Logged

daydreamer
Member
*****
Posts: 616


Re: Suggestions and improvements for SSE2 code are welcome
« Reply #14 on: August 27, 2010, 08:53:15 AM »

why not make it support multicores with process in multiple threads?
Logged
Pages: [1] 2 3 ... 9 Print 
« previous next »
Jump to:  

Powered by MySQL Powered by PHP The MASM Forum Archive 2004 to 2012 | Powered by SMF 1.0.12.
© 2001-2005, Lewis Media. All Rights Reserved.
Valid XHTML 1.0! Valid CSS!