Suggestions and improvements for SSE2 code are welcome

dioxin · September 03, 2010, 12:46:31 AM

jj2007,

QuoteA variant with an option to unroll Alex' algo by 4 - still compact at 102 bytes, and a bit faster of course.

AMD Phenom(tm) II X4 945 Processor (SSE3)
2202    cycles for DotXMM1Acc4E
2153    cycles for DotXMM1Acc4EJ1
2152    cycles for DotXMM1Acc4EJ2
2140    cycles for AxDotXMM1_fastcall
1196    cycles for DotXMM2Acc16ELingo
1186    cycles for DotXMM2Acc32ELingo
789     cycles for DotXMM2Acc16EPaul

2156    cycles for DotXMM1Acc4E
2161    cycles for DotXMM1Acc4EJ1
2147    cycles for DotXMM1Acc4EJ2
2142    cycles for AxDotXMM1_fastcall
1196    cycles for DotXMM2Acc16ELingo
1187    cycles for DotXMM2Acc32ELingo
789     cycles for DotXMM2Acc16EPaul

2157    cycles for DotXMM1Acc4E
2155    cycles for DotXMM1Acc4EJ1
2155    cycles for DotXMM1Acc4EJ2
2139    cycles for AxDotXMM1_fastcall
1196    cycles for DotXMM2Acc16ELingo
1187    cycles for DotXMM2Acc32ELingo
815     cycles for DotXMM2Acc16EPaul


The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
80       bytes for DotXMM1Acc4E
278      bytes for DotXMM1Acc4EJ1
266      bytes for DotXMM1Acc4EJ2
102      bytes for AxDotXMM1_fastcall
120      bytes for DotXMM2Acc16ELingo
175      bytes for DotXMM2Acc32ELingo
129      bytes for DotXMM2Acc16EPaul

--- done ---

jj2007 · September 03, 2010, 12:49:34 AM

Quote from: dioxin on September 03, 2010, 12:40:23 AM
Timings from Gunther's latest post in Reply #72:

Celeron M:

QuoteC++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------

Dot Product 3 = 2867507200.00
Elapsed Time = 6.92 Seconds

Simple SSE2 Code (1 Accumulator - 4 elements per loop cycle):
-------------------------------------------------------------

Dot Product 4 = 2867507200.00
Elapsed Time = 8.42 Seconds

Solid SSE2 Code (2 Accumulators - 8 elements per loop cycle):
-------------------------------------------------------------

Dot Product 5 = 2867507200.00
Elapsed Time = 7.52 Seconds

Better SSE2 Code (2 Accumulators - 16 elements per loop cycle):
---------------------------------------------------------------

Dot Product 6 = 2867507200.00
Elapsed Time = 6.77 Seconds

SSE2 Code by Dioxin (4 Accumulators - 16 elements per loop cycle):
------------------------------------------------------------------

Dot Product 7 = 2867507200.00
Elapsed Time = 7.16 Seconds

SSE2 Code by Lingo (2 Accumulators - 32 elements per loop cycle):
-----------------------------------------------------------------

Dot Product 8 = 2867507200.00
Elapsed Time = 6.78 Seconds

SSE2 Code with PATTERN (2 Accumulators - 32 elements per loop cycle):
---------------------------------------------------------------------

Dot Product 9 = 2867507200.00
Elapsed Time = 6.70 Seconds

EDIT: I have added a macro for testing the correctness of results.

Code Select

IsCorrect MACRO algo
	invoke &algo&, offset SrcX, offset SrcY, 2048
	fstp Res4
	Fcmp Res4, Expected
	.if !Zero?
			Print Str$("\nIncorrect result: %i for &algo&", Res4)
	.endif
ENDM

lingo · September 03, 2010, 01:16:02 AM

I have 2 notes about my algos in dotfloat_update.zip (from Gunther's latest post)
- I see ALIGN 16 just before .loop: label. It is not normal coz we have some clocks more.
- My last algo is not included too...:(

Gunther · September 03, 2010, 01:55:16 AM

Quote from: dioxin September 03, 2010, at 01:40:23 AMTimings from Gunther's latest post in Reply #72:

Thank you Paul, you see: your Phenom makes a good job by using 4 accumulators, while my Athlon did not.

Quote from: lingo September 03, 2010, at 02:16:02 AM- I see ALIGN 16 just before .loop: label. It is not normal coz we have some clocks more.

What a mess. But aligning the hot spots (that is the .loop label) by 16 is a recommendation by Intel and AMD. All together it's only a question of cut & paste for you. Set the align command at the procedures entry, compile the program again - and voila it's over. What the heck. By the way, was it Windows or Snow Leopard?

Quote from: lingo September 03, 2010, at 02:16:02 AM- My last algo is not included too...

Another mess. But joking apart, it is a question of time. If there is enough time, I'll try to include your last algorithm this weekend. Okay?

Gunther

jj2007 · September 03, 2010, 07:41:56 AM

For the record, timings on my old Intel:

Code Select

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
3674    cycles for DotXMM1Acc4E
3215    cycles for DotXMM1Acc4EJ1
3088    cycles for DotXMM1Acc4EJ2
2795    cycles for AxDotXMM1_fastcall
1953    cycles for DotXMM2Acc16ELingo
1910    cycles for DotXMM2Acc32ELingo
1752    cycles for DotXMM2Acc16EPaul

4064    cycles for DotXMM1Acc4E
3644    cycles for DotXMM1Acc4EJ1
3832    cycles for DotXMM1Acc4EJ2
3065    cycles for AxDotXMM1_fastcall
1831    cycles for DotXMM2Acc16ELingo
1752    cycles for DotXMM2Acc32ELingo
1849    cycles for DotXMM2Acc16EPaul

It seems archaic CPUs like Lingo's code :bg

And the same CPU on reply #72:

Code Select

Straight forward C++ implementation (FPU Code):
-----------------------------------------------

Dot Product 1 = 2867507200.00
Elapsed Time  = 20.66 Seconds

C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------

Dot Product 2 = 2867507200.00
Elapsed Time  = 10.12 Seconds

C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------

Dot Product 3 = 2867507200.00
Elapsed Time  = 4.11 Seconds

Simple SSE2 Code (1 Accumulator - 4 elements per loop cycle):
-------------------------------------------------------------

Dot Product 4 = 2867507200.00
Elapsed Time  = 4.00 Seconds

Solid SSE2 Code (2 Accumulators - 8 elements per loop cycle):
-------------------------------------------------------------

Dot Product 5 = 2867507200.00
Elapsed Time  = 2.58 Seconds

Better SSE2 Code (2 Accumulators - 16 elements per loop cycle):
---------------------------------------------------------------

Dot Product 6 = 2867507200.00
Elapsed Time  = 2.49 Seconds

SSE2 Code by Dioxin (4 Accumulators - 16 elements per loop cycle):
------------------------------------------------------------------

Dot Product 7 = 2867507200.00
Elapsed Time  = 2.45 Seconds

SSE2 Code by Lingo (2 Accumulators - 32 elements per loop cycle):
-----------------------------------------------------------------

Dot Product 8 = 2867507200.00
Elapsed Time  = 2.53 Seconds

SSE2 Code with PATTERN (2 Accumulators - 32 elements per loop cycle):
---------------------------------------------------------------------

Dot Product 9 = 2867507200.00
Elapsed Time  = 2.47 Seconds

Gunther · September 03, 2010, 11:54:59 AM

Quote from: jj2007 September 03, 2010, at 08:41:56 AMFor the record, timings on my old Intel:

Thank you, Jochen.

Quote from: jj2007 September 03, 2010, at 08:41:56 AMIt seems archaic CPUs like Lingo's code

But Paul's too. :bg

Gunther

FORTRANS · September 03, 2010, 12:47:36 PM

Hi,

DotPro18b results.

Code Select


Mobile Intel(R) Celeron(R) processor     600MHz (SSE2)
2613	cycles for DotXMM1Acc4E
2227	cycles for DotXMM1Acc4EJ1
2084	cycles for DotXMM1Acc4EJ2
2417	cycles for AxDotXMM1_fastcall
2142	cycles for DotXMM2Acc16ELingo
2099	cycles for DotXMM2Acc32ELingo
2424	cycles for DotXMM2Acc16EPaul

2623	cycles for DotXMM1Acc4E
2227	cycles for DotXMM1Acc4EJ1
2097	cycles for DotXMM1Acc4EJ2
2418	cycles for AxDotXMM1_fastcall
2142	cycles for DotXMM2Acc16ELingo
2101	cycles for DotXMM2Acc32ELingo
2424	cycles for DotXMM2Acc16EPaul

Test for correct results, expected 2867507200:

80	 bytes for DotXMM1Acc4E
278	 bytes for DotXMM1Acc4EJ1
266	 bytes for DotXMM1Acc4EJ2
102	 bytes for AxDotXMM1_fastcall
120	 bytes for DotXMM2Acc16ELingo
175	 bytes for DotXMM2Acc32ELingo
129	 bytes for DotXMM2Acc16EPaul

--- done ---

Steve

Antariy · September 03, 2010, 11:47:53 PM

For dotfloat_update.zip:

Code Select



Supported by Processor and installed Operating System:
------------------------------------------------------

   Pentium 4 Instruction Set,
 + FPU (floating point unit) on chip,
 + support of FXSAVE and FXRSTOR,
 + 57 MMX Instructions,
 + 70 SSE (Katmai) Instructions,
 + 144 SSE2 (Willamette) Instructions,
 + 13 SSE3 (Prescott) Instructions,
 + HTT (hyper thread technology) support.

Calculating the dot product in 9 different variations.
That'll take a little while ...

Straight forward C++ implementation (FPU Code):
-----------------------------------------------

Dot Product 1 = 2867507200.00
Elapsed Time  = 34.39 Seconds

C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------

Dot Product 2 = 2867507200.00
Elapsed Time  = 16.62 Seconds

C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------

Dot Product 3 = 2867507200.00
Elapsed Time  = 7.08 Seconds

Simple SSE2 Code (1 Accumulator - 4 elements per loop cycle):
-------------------------------------------------------------

Dot Product 4 = 2867507200.00
Elapsed Time  = 7.05 Seconds

Solid SSE2 Code (2 Accumulators - 8 elements per loop cycle):
-------------------------------------------------------------

Dot Product 5 = 2867507200.00
Elapsed Time  = 4.53 Seconds

Better SSE2 Code (2 Accumulators - 16 elements per loop cycle):
---------------------------------------------------------------

Dot Product 6 = 2867507200.00
Elapsed Time  = 4.62 Seconds

SSE2 Code by Dioxin (4 Accumulators - 16 elements per loop cycle):
------------------------------------------------------------------

Dot Product 7 = 2867507200.00
Elapsed Time  = 4.45 Seconds

SSE2 Code by Lingo (2 Accumulators - 32 elements per loop cycle):
-----------------------------------------------------------------

Dot Product 8 = 2867507200.00
Elapsed Time  = 4.50 Seconds

SSE2 Code with PATTERN (2 Accumulators - 32 elements per loop cycle):
---------------------------------------------------------------------

Dot Product 9 = 2867507200.00
Elapsed Time  = 4.55 Seconds

Gunther, my CPU don't support HTT. But this is mistake of most progs, which make CPU detection. As far as I know: all 90nm Prescotts "says", what they support HTT, but this is not true sometimes.

Alex

Gunther · September 04, 2010, 01:36:10 AM

Quote from: Antariy September 04, 2010, at 12:47:53 AMGunther, my CPU don't support HTT. But this is mistake of most progs, which make CPU detection. As far as I know: all 90nm Prescotts "says", what they support HTT, but this is not true sometimes.

Hi Antariy,

Thank you for your timings. Never mind; the application doesn't use HTT. What exactly is your CPU and OS?

Gunther

Antariy · September 04, 2010, 10:05:57 PM

Quote from: Gunther on September 04, 2010, 01:36:10 AM

Hi Antariy,

Thank you for your timings. Never mind; the application doesn't use HTT. What exactly is your CPU and OS?

Gunther

I know, what app don't use HTT. But HTT make mess of timings sometimes.

My CPU - Celeron D 310. In details - Prescott core, 2.13 GHz, 256KB L2 cache.
OS - WinXP SP2.

Alex

Antariy · September 04, 2010, 10:38:52 PM

Gunther, when I talk about HTT, I don't mean, what prog use it. How prog can use it :)
I mean this:

Code Select


+ 13 SSE3 (Prescott) Instructions,
+ HTT (hyper thread technology) support.          <-------------------------- THIS

Calculating the dot product in 9 different variations.
That'll take a little while ...

Your CPU detection make general mistake - my CPU don't have HTT, really.
Sorry, if initially I write not clear - my English is bad...

Alex

Gunther · September 04, 2010, 10:41:13 PM

Alex,

thank you for your information.

Quote from: Antariy September 05, 2010, at 11:05:57 PMMy CPU - Celeron D 310. In details - Prescott core, 2.13 GHz, 256KB L2 cache.
OS - WinXP SP2.

The machine isn't so bad, as you mentioned in your PM.

Gunther

Gunther · September 04, 2010, 10:47:59 PM

Alex,

excuse me, we posted both at the same time, so I couldn't see your latest answer.

Quote from: Antariy September 05, 2010, at 11:38:52 PMYour CPU detection make general mistake - my CPU don't have HTT, really.

I'll check that, but I used the algorithm recommended by Intel and AMD.

Gunther

Antariy · September 04, 2010, 11:10:54 PM

Quote from: Gunther on September 04, 2010, 10:41:13 PM

The machine isn't so bad, as you mentioned in your PM.

Yes, big advantage - heavily unrolled algos work slower than short looped :)

Alex

hutch-- · September 05, 2010, 04:12:26 AM

Alex,

The level of unrolling depends on the size of the code cache of the particular processor. Some older processors respond well to unrolling where many of the later ones show little change. I have algos written for P4s that are slower relative to a short algo on a later processor. Intel recommend unrolling up to the limit of the code cache but I have found in practice that you try different amounts and don't go more than where the speed picks up.

News:

Suggestions and improvements for SSE2 code are welcome