News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Suggestions and improvements for SSE2 code are welcome

Started by Gunther, August 26, 2010, 05:20:06 PM

Previous topic - Next topic

dioxin

jj2007,
QuoteA variant with an option to unroll Alex' algo by 4 - still compact at 102 bytes, and a bit faster of course.
AMD Phenom(tm) II X4 945 Processor (SSE3)
2202    cycles for DotXMM1Acc4E
2153    cycles for DotXMM1Acc4EJ1
2152    cycles for DotXMM1Acc4EJ2
2140    cycles for AxDotXMM1_fastcall
1196    cycles for DotXMM2Acc16ELingo
1186    cycles for DotXMM2Acc32ELingo
789     cycles for DotXMM2Acc16EPaul

2156    cycles for DotXMM1Acc4E
2161    cycles for DotXMM1Acc4EJ1
2147    cycles for DotXMM1Acc4EJ2
2142    cycles for AxDotXMM1_fastcall
1196    cycles for DotXMM2Acc16ELingo
1187    cycles for DotXMM2Acc32ELingo
789     cycles for DotXMM2Acc16EPaul

2157    cycles for DotXMM1Acc4E
2155    cycles for DotXMM1Acc4EJ1
2155    cycles for DotXMM1Acc4EJ2
2139    cycles for AxDotXMM1_fastcall
1196    cycles for DotXMM2Acc16ELingo
1187    cycles for DotXMM2Acc32ELingo
815     cycles for DotXMM2Acc16EPaul


The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
80       bytes for DotXMM1Acc4E
278      bytes for DotXMM1Acc4EJ1
266      bytes for DotXMM1Acc4EJ2
102      bytes for AxDotXMM1_fastcall
120      bytes for DotXMM2Acc16ELingo
175      bytes for DotXMM2Acc32ELingo
129      bytes for DotXMM2Acc16EPaul

--- done ---

jj2007

Quote from: dioxin on September 03, 2010, 12:40:23 AM
Timings from Gunther's latest post in Reply #72:

Celeron M:
QuoteC++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------

Dot Product 3 = 2867507200.00
Elapsed Time  = 6.92 Seconds

Simple SSE2 Code (1 Accumulator - 4 elements per loop cycle):
-------------------------------------------------------------

Dot Product 4 = 2867507200.00
Elapsed Time  = 8.42 Seconds

Solid SSE2 Code (2 Accumulators - 8 elements per loop cycle):
-------------------------------------------------------------

Dot Product 5 = 2867507200.00
Elapsed Time  = 7.52 Seconds

Better SSE2 Code (2 Accumulators - 16 elements per loop cycle):
---------------------------------------------------------------

Dot Product 6 = 2867507200.00
Elapsed Time  = 6.77 Seconds

SSE2 Code by Dioxin (4 Accumulators - 16 elements per loop cycle):
------------------------------------------------------------------

Dot Product 7 = 2867507200.00
Elapsed Time  = 7.16 Seconds

SSE2 Code by Lingo (2 Accumulators - 32 elements per loop cycle):
-----------------------------------------------------------------

Dot Product 8 = 2867507200.00
Elapsed Time  = 6.78 Seconds

SSE2 Code with PATTERN (2 Accumulators - 32 elements per loop cycle):
---------------------------------------------------------------------

Dot Product 9 = 2867507200.00
Elapsed Time  = 6.70 Seconds

EDIT: I have added a macro for testing the correctness of results.
IsCorrect MACRO algo
invoke &algo&, offset SrcX, offset SrcY, 2048
fstp Res4
Fcmp Res4, Expected
.if !Zero?
Print Str$("\nIncorrect result: %i for &algo&", Res4)
.endif
ENDM

lingo

I have 2 notes about my algos in dotfloat_update.zip (from Gunther's latest post)
- I see  ALIGN 16  just before  .loop: label.  It is not normal coz we have some clocks more.
- My last algo is not included too...:(

Gunther

Quote from: dioxin September 03, 2010, at 01:40:23 AMTimings from Gunther's latest post in Reply #72:

Thank you Paul, you see: your Phenom makes a good job by using 4 accumulators, while my Athlon did not.

Quote from: lingo September 03, 2010, at 02:16:02 AM- I see  ALIGN 16  just before  .loop: label.  It is not normal coz we have some clocks more.

What a mess. But aligning the hot spots (that is the .loop label) by 16 is a recommendation by Intel and AMD. All together it's only a question of cut & paste for you. Set the align command at the procedures entry, compile the program again - and voila it's over. What the heck. By the way, was it Windows or Snow Leopard?

Quote from: lingo September 03, 2010, at 02:16:02 AM- My last algo is not included too...

Another mess. But joking apart, it is a question of time. If there is enough time, I'll try to include your last algorithm this weekend. Okay?

Gunther
Forgive your enemies, but never forget their names.

jj2007

For the record, timings on my old Intel:
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
3674    cycles for DotXMM1Acc4E
3215    cycles for DotXMM1Acc4EJ1
3088    cycles for DotXMM1Acc4EJ2
2795    cycles for AxDotXMM1_fastcall
1953    cycles for DotXMM2Acc16ELingo
1910    cycles for DotXMM2Acc32ELingo
1752    cycles for DotXMM2Acc16EPaul

4064    cycles for DotXMM1Acc4E
3644    cycles for DotXMM1Acc4EJ1
3832    cycles for DotXMM1Acc4EJ2
3065    cycles for AxDotXMM1_fastcall
1831    cycles for DotXMM2Acc16ELingo
1752    cycles for DotXMM2Acc32ELingo
1849    cycles for DotXMM2Acc16EPaul


It seems archaic CPUs like Lingo's code :bg

And the same CPU on reply #72:
Straight forward C++ implementation (FPU Code):
-----------------------------------------------

Dot Product 1 = 2867507200.00
Elapsed Time  = 20.66 Seconds

C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------

Dot Product 2 = 2867507200.00
Elapsed Time  = 10.12 Seconds

C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------

Dot Product 3 = 2867507200.00
Elapsed Time  = 4.11 Seconds

Simple SSE2 Code (1 Accumulator - 4 elements per loop cycle):
-------------------------------------------------------------

Dot Product 4 = 2867507200.00
Elapsed Time  = 4.00 Seconds

Solid SSE2 Code (2 Accumulators - 8 elements per loop cycle):
-------------------------------------------------------------

Dot Product 5 = 2867507200.00
Elapsed Time  = 2.58 Seconds

Better SSE2 Code (2 Accumulators - 16 elements per loop cycle):
---------------------------------------------------------------

Dot Product 6 = 2867507200.00
Elapsed Time  = 2.49 Seconds

SSE2 Code by Dioxin (4 Accumulators - 16 elements per loop cycle):
------------------------------------------------------------------

Dot Product 7 = 2867507200.00
Elapsed Time  = 2.45 Seconds

SSE2 Code by Lingo (2 Accumulators - 32 elements per loop cycle):
-----------------------------------------------------------------

Dot Product 8 = 2867507200.00
Elapsed Time  = 2.53 Seconds

SSE2 Code with PATTERN (2 Accumulators - 32 elements per loop cycle):
---------------------------------------------------------------------

Dot Product 9 = 2867507200.00
Elapsed Time  = 2.47 Seconds

Gunther

Quote from: jj2007 September 03, 2010, at 08:41:56 AMFor the record, timings on my old Intel:

Thank you, Jochen.

Quote from: jj2007 September 03, 2010, at 08:41:56 AMIt seems archaic CPUs like Lingo's code

But Paul's too.  :bg

Gunther
Forgive your enemies, but never forget their names.

FORTRANS

Hi,

   DotPro18b results.


Mobile Intel(R) Celeron(R) processor     600MHz (SSE2)
2613 cycles for DotXMM1Acc4E
2227 cycles for DotXMM1Acc4EJ1
2084 cycles for DotXMM1Acc4EJ2
2417 cycles for AxDotXMM1_fastcall
2142 cycles for DotXMM2Acc16ELingo
2099 cycles for DotXMM2Acc32ELingo
2424 cycles for DotXMM2Acc16EPaul

2623 cycles for DotXMM1Acc4E
2227 cycles for DotXMM1Acc4EJ1
2097 cycles for DotXMM1Acc4EJ2
2418 cycles for AxDotXMM1_fastcall
2142 cycles for DotXMM2Acc16ELingo
2101 cycles for DotXMM2Acc32ELingo
2424 cycles for DotXMM2Acc16EPaul

Test for correct results, expected 2867507200:

80 bytes for DotXMM1Acc4E
278 bytes for DotXMM1Acc4EJ1
266 bytes for DotXMM1Acc4EJ2
102 bytes for AxDotXMM1_fastcall
120 bytes for DotXMM2Acc16ELingo
175 bytes for DotXMM2Acc32ELingo
129 bytes for DotXMM2Acc16EPaul

--- done ---


Steve

Antariy

For dotfloat_update.zip:



Supported by Processor and installed Operating System:
------------------------------------------------------

   Pentium 4 Instruction Set,
+ FPU (floating point unit) on chip,
+ support of FXSAVE and FXRSTOR,
+ 57 MMX Instructions,
+ 70 SSE (Katmai) Instructions,
+ 144 SSE2 (Willamette) Instructions,
+ 13 SSE3 (Prescott) Instructions,
+ HTT (hyper thread technology) support.

Calculating the dot product in 9 different variations.
That'll take a little while ...

Straight forward C++ implementation (FPU Code):
-----------------------------------------------

Dot Product 1 = 2867507200.00
Elapsed Time  = 34.39 Seconds

C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------

Dot Product 2 = 2867507200.00
Elapsed Time  = 16.62 Seconds

C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------

Dot Product 3 = 2867507200.00
Elapsed Time  = 7.08 Seconds

Simple SSE2 Code (1 Accumulator - 4 elements per loop cycle):
-------------------------------------------------------------

Dot Product 4 = 2867507200.00
Elapsed Time  = 7.05 Seconds

Solid SSE2 Code (2 Accumulators - 8 elements per loop cycle):
-------------------------------------------------------------

Dot Product 5 = 2867507200.00
Elapsed Time  = 4.53 Seconds

Better SSE2 Code (2 Accumulators - 16 elements per loop cycle):
---------------------------------------------------------------

Dot Product 6 = 2867507200.00
Elapsed Time  = 4.62 Seconds

SSE2 Code by Dioxin (4 Accumulators - 16 elements per loop cycle):
------------------------------------------------------------------

Dot Product 7 = 2867507200.00
Elapsed Time  = 4.45 Seconds

SSE2 Code by Lingo (2 Accumulators - 32 elements per loop cycle):
-----------------------------------------------------------------

Dot Product 8 = 2867507200.00
Elapsed Time  = 4.50 Seconds

SSE2 Code with PATTERN (2 Accumulators - 32 elements per loop cycle):
---------------------------------------------------------------------

Dot Product 9 = 2867507200.00
Elapsed Time  = 4.55 Seconds



Gunther, my CPU don't support HTT. But this is mistake of most progs, which make CPU detection. As far as I know: all 90nm Prescotts "says", what they support HTT, but this is not true sometimes.



Alex

Gunther

Quote from: Antariy September 04, 2010, at 12:47:53 AMGunther, my CPU don't support HTT. But this is mistake of most progs, which make CPU detection. As far as I know: all 90nm Prescotts "says", what they support HTT, but this is not true sometimes.

Hi Antariy,

Thank you for your timings. Never mind; the application doesn't use HTT. What exactly is your CPU and OS?

Gunther
Forgive your enemies, but never forget their names.

Antariy

Quote from: Gunther on September 04, 2010, 01:36:10 AM

Hi Antariy,

Thank you for your timings. Never mind; the application doesn't use HTT. What exactly is your CPU and OS?

Gunther


I know, what app don't use HTT. But HTT make mess of timings sometimes.

My CPU - Celeron D 310. In details - Prescott core, 2.13 GHz, 256KB L2 cache.
OS - WinXP SP2.



Alex

Antariy

Gunther, when I talk about HTT, I don't mean, what prog use it. How prog can use it :)
I mean this:

+ 13 SSE3 (Prescott) Instructions,
+ HTT (hyper thread technology) support.          <-------------------------- THIS

Calculating the dot product in 9 different variations.
That'll take a little while ...


Your CPU detection make general mistake - my CPU don't have HTT, really.
Sorry, if initially I write not clear - my English is bad...



Alex

Gunther

Alex,

thank you for your information.

Quote from: Antariy  September 05, 2010, at 11:05:57 PMMy CPU - Celeron D 310. In details - Prescott core, 2.13 GHz, 256KB L2 cache.
OS - WinXP SP2.

The machine isn't so bad, as you mentioned in your PM.

Gunther
Forgive your enemies, but never forget their names.

Gunther

Alex,

excuse me, we posted both at the same time, so I couldn't see your latest answer.

Quote from: Antariy September 05, 2010, at 11:38:52 PMYour CPU detection make general mistake - my CPU don't have HTT, really.

I'll check that, but I used the algorithm recommended by Intel and AMD.

Gunther
Forgive your enemies, but never forget their names.

Antariy

Quote from: Gunther on September 04, 2010, 10:41:13 PM

The machine isn't so bad, as you mentioned in your PM.


Yes, big advantage - heavily unrolled algos work slower than short looped :)



Alex

hutch--

Alex,

The level of unrolling depends on the size of the code cache of the particular processor. Some older processors respond well to unrolling where many of the later ones show little change. I have algos written for P4s that are slower relative to a short algo on a later processor. Intel recommend unrolling up to the limit of the code cache but I have found in practice that you try different amounts and don't go more than where the speed picks up.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php