Suggestions and improvements for SSE2 code are welcome

dioxin · September 02, 2010, 05:21:18 PM

I think I've lost track of what's going on in this thread.
How is Lingo's latest significantly different to the original DotXMM2Acc16E version posted by Gunther?

At some point in this thread the timings switched from seconds for 5,000,000 loops to clks per loop. Did anything else change?
If not, then I get the following timings:

Code Select

			Phenom II     Atom N270     
Code from reply#17         640          2156           
Gunther's original        1056          1660
(2acc. 16 Element)
Lingo's latest            1157          1858

On my PC (3GHz Phenom II) Gunther's original runs in 1.76s = 1056clks, Lingo's latest runs in 1157 clks but the code posted in Reply#17 of this thread runs in 640clks. I gather from other posts in this thread that this is very CPU dependent.

I've tried them on other machines and older PCs do show Lingo's code to be a little faster (in the region of 5%) than the reply #17 code on those old machines.
Other modern Athlon type CPus show the Reply #17 code being nearly twice as fast.
I don't have any modern Intels available to test.

Paul.

lingo · September 02, 2010, 05:40:19 PM

"If not, then I get the following timings:"

Without your testing files it is just bla, bla, bla... :lol

dioxin · September 02, 2010, 05:46:31 PM

Lingo,
I forget, sometimes people can only handle 1 query at a time so I'll simplify it.

How is Lingo's latest significantly different to the original DotXMM2Acc16E version posted by Gunther?
The code looks similar and the timings are similar.
Here are the timings from the already posted testing files:

Code Select

                         Phenom II     Atom N270              
Gunther's original        1056          1660
(2acc. 16 Element)
Lingo's latest            1157          1858

Paul.

dioxin · September 02, 2010, 05:52:59 PM

Here are the relevant extracts from the Gunther original and Lingo's latest.
Shown below are the main loops of the 2 routines:

Code Select

@@:
movaps     xmm1, [eax+edx+16]
mulps      xmm0, [edx]
addps      xmm5, xmm6
movaps     xmm3, [eax+edx+32]
mulps      xmm1, [edx+16]
addps      xmm4, xmm0
movaps     xmm6, [eax+edx+48]
mulps      xmm3, [edx+32]
add edx, 64
addps      xmm5, xmm1
movaps     xmm0, [eax+edx]
mulps      xmm6, [edx+48-64]
addps      xmm4, xmm3
sub ecx, 16
ja @b

Code Select

.loop:

		movaps  xmm2,[eax+edx+32]			;xmm2 = X[i+11] X[i+10] X[i+9] X[i+8]
		mulps   xmm1,[edx+16]				;xmm1 = X[i+7]*Y[i+7] X[i+6]*Y[i+6] X[i+5]*Y[i+5] X[i+4]*Y[i+4]
		addps   xmm4,xmm0				;sum up
		movaps  xmm3,[eax+edx+48]			;xmm3 = X[i+15] X[i+14] X[i+13] X[i+12]
		mulps   xmm2,[edx+32]				;xmm2 = X[i+11]*Y[i+11] X[i+10]*Y[i+10] X[i+9]*Y[i+9] X[i+8]*Y[i+8]
		addps   xmm5,xmm1				;sum up
		movaps  xmm0,[eax+edx+64]			;xmm0 = X[i+3] X[i+2] X[i+1] X[i+0]
		mulps   xmm3,[edx+48]				;xmm3 = X[i+15]*Y[i+15] X[i+14]*Y[i+14] X[i+13]*Y[i+13] X[i+12]*Y[i+12]
		lea     edx,[edx+64]				;update pointers
		addps   xmm4,xmm2				;sum up
		movaps  xmm1,[eax+edx+16]			;mm1 = X[i+7] X[i+6] X[i+5] X[i+4]
		mulps   xmm0,[edx]				;xmm0 = X[i+3]*Y[i+3] X[i+2]*Y[i+2] X[i+1]*Y[i+1] X[i+0]*Y[i+0]
		addps   xmm5,xmm3				;sum up
		sub     ecx,byte 64				;count down
		jnz	.loop

lingo · September 02, 2010, 06:11:13 PM

Your "results" without your testing files are just bla,bla,bla... :(

Will be better to post all algos rather than just the loops.. :lol
This is a Gunther's thread and it is his choice to get the better algo according to him.

dioxin · September 02, 2010, 06:16:39 PM

Lingo,

Quotewithout your testing files

What testing files are you after? I used the ones in the original post of this thread by Gunther and the one in reply in Reply #59 by jj2007.
The files are already posted. It would bee pointless to repost the same ones again.

Paul.

lingo · September 02, 2010, 06:34:13 PM

"I used the ones in the original post of this thread by Gunther and the one in reply in Reply #59 by jj2007."

It is a result from original post by Gunther:
Better SSE2 Code (2 Accumulators - 16 elements per cycle):
----------------------------------------------------------
Dot Product 6 = 2867507200.00
Elapsed Time = 1.87 Seconds

it is a result from Reply #59 by jj2007:
2173 cycles for DotXMM2Acc16ELingo

As you see the dimensions of the times are different...(seconds vs cycles) :lol

So,Your "results" without your testing files are just bla,bla,bla... [/U]

jj2007 · September 02, 2010, 06:39:56 PM

Quote from: dioxin on September 02, 2010, 05:21:18 PM
I think I've lost track of what's going on in this thread.

Paul,
Here are results for all algos including your #17:

Code Select

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
2529    cycles for DotXMM1Acc4E
2163    cycles for DotXMM1Acc4EJ1
2070    cycles for DotXMM1Acc4EJ2
2550    cycles for AxDotXMM1_fastcall
2138    cycles for DotXMM2Acc16ELingo
2135    cycles for DotXMM2Acc16EPaul

dioxin · September 02, 2010, 06:46:26 PM

Lingo,
it's simple maths. Gunther's runs 5,000,000 loops in 1.87sec.

That's 1.87 / 5000000 = 374nsec per loop. Multiply that 374 by the speed of your CPU in GHz and you'll see how many cycles Gunther's code takes on your PC.

jj2007,
thanks, on mine with your posted code I get:

Code Select

AMD Phenom(tm) II X4 945 Processor (SSE3)
2410    cycles for DotXMM1Acc4E
2115    cycles for DotXMM1Acc4EJ1
2118    cycles for DotXMM1Acc4EJ2
2116    cycles for AxDotXMM1_fastcall
1126    cycles for DotXMM2Acc16ELingo
723     cycles for DotXMM2Acc16EPaul

2156    cycles for DotXMM1Acc4E
2117    cycles for DotXMM1Acc4EJ1
2115    cycles for DotXMM1Acc4EJ2
2114    cycles for AxDotXMM1_fastcall
1126    cycles for DotXMM2Acc16ELingo
725     cycles for DotXMM2Acc16EPaul

2155    cycles for DotXMM1Acc4E
2130    cycles for DotXMM1Acc4EJ1
2134    cycles for DotXMM1Acc4EJ2
2121    cycles for AxDotXMM1_fastcall
1126    cycles for DotXMM2Acc16ELingo
724     cycles for DotXMM2Acc16EPaul


The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
--- done ---

jj2007 · September 02, 2010, 06:55:11 PM

Quote from: dioxin on September 02, 2010, 06:46:26 PM
jj2007,
thanks, on mine with your posted code I get:
QuoteAMD Phenom(tm) II X4 945 Processor (SSE3)
2118 cycles for DotXMM1Acc4EJ2
723 cycles for DotXMM2Acc16EPaul

QuoteIntel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
2071 cycles for DotXMM1Acc4EJ2
2135 cycles for DotXMM2Acc16EPaul

Wow!

lingo · September 02, 2010, 07:01:47 PM

:lol

Code Select


Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)
1566    cycles for DotXMM1Acc4E
1556    cycles for DotXMM1Acc4EJ1
1556    cycles for DotXMM1Acc4EJ2
1561    cycles for AxDotXMM1_fastcall
1051    cycles for DotXMM2Acc16ELingo
1052    cycles for DotXMM2Acc16EPaul

1562    cycles for DotXMM1Acc4E
1556    cycles for DotXMM1Acc4EJ1
1556    cycles for DotXMM1Acc4EJ2
1561    cycles for AxDotXMM1_fastcall
1051    cycles for DotXMM2Acc16ELingo
1052    cycles for DotXMM2Acc16EPaul

1562    cycles for DotXMM1Acc4E
1555    cycles for DotXMM1Acc4EJ1
1561    cycles for DotXMM1Acc4EJ2
1561    cycles for AxDotXMM1_fastcall
1051    cycles for DotXMM2Acc16ELingo
1052    cycles for DotXMM2Acc16EPaul


The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
--- done ---

AMD Turion(tm) 64 Mobile Technology ML-30 (SSE3)
2270    cycles for DotXMM1Acc4E
2226    cycles for DotXMM1Acc4EJ1
2199    cycles for DotXMM1Acc4EJ2
2259    cycles for AxDotXMM1_fastcall
1400    cycles for DotXMM2Acc16ELingo
1582    cycles for DotXMM2Acc16EPaul

2242    cycles for DotXMM1Acc4E
2226    cycles for DotXMM1Acc4EJ1
2232    cycles for DotXMM1Acc4EJ2
2269    cycles for AxDotXMM1_fastcall
1393    cycles for DotXMM2Acc16ELingo
1579    cycles for DotXMM2Acc16EPaul

2246    cycles for DotXMM1Acc4E
2221    cycles for DotXMM1Acc4EJ1
2193    cycles for DotXMM1Acc4EJ2
2268    cycles for AxDotXMM1_fastcall
1425    cycles for DotXMM2Acc16ELingo
1575    cycles for DotXMM2Acc16EPaul


The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
--- done ---

redskull · September 02, 2010, 07:17:05 PM

Quote from: dioxin on September 02, 2010, 06:46:26 PM

it's simple maths. Gunther's runs 5,000,000 loops in 1.87sec.

That's 1.87 / 5000000 = 374nsec per loop. Multiply that 374 by the speed of your CPU in GHz and you'll see how many cycles Gunther's code takes on your PC.

Unless Lingos Core2 has an wider pipeline that can decode three uOps per cycle, or has an extra ALU to do up to 6 uOps per cycle. Then the math isn't quite that simple.

-r

Gunther · September 02, 2010, 10:31:26 PM

As I announced for a while, here is the update of my test bed. The program calculates now the dot product in 9 different ways. I've included Paul's and lingo's code via inline assembly. Furthermore, I squeezed out a bit time by doing a little bit macro magic in DotXMM2Acc32E (please have a look into dotfloatfa.cpp).

My impression is: that's a good solution especially for older AMD chips (Athlons or Opterons). Newer AMD processors like the Phenom will probably better run with Paul's code, but that's not yet tested. Here is a part of the current timings with my Athlon X2, 1.9 Ghz (Only the last 4 results are shown; for more look into results.pdf):

Code Select


Better SSE2 Code (2 Accumulators - 16 elements per loop cycle):
---------------------------------------------------------------

Dot Product 6 = 2867507200.00
Elapsed Time  = 3.20 Seconds

SSE2 Code by Dioxin (4 Accumulators - 16 elements per loop cycle):
------------------------------------------------------------------

Dot Product 7 = 2867507200.00
Elapsed Time  = 3.88 Seconds

SSE2 Code by Lingo (2 Accumulators - 32 elements per loop cycle):
-----------------------------------------------------------------

Dot Product 8 = 2867507200.00
Elapsed Time  = 3.20 Seconds

SSE2 Code with PATTERN (2 Accumulators - 32 elements per loop cycle):
---------------------------------------------------------------------

Dot Product 9 = 2867507200.00
Elapsed Time  = 3.08 Seconds

It would be interesting, what lingo's Snow Leopard brings. The program will compile properly under Mac OS X, so try it, please. I'll also try to include Antariy's code into the application. My special thank goes to all guys for making suggestions and for test helping.

Gunther

hutch-- · September 02, 2010, 10:49:34 PM

Like a voice crying in the wilderness, if you want to know how FAST an algorithm is, test it in REAL TIME !!!!

Cycles may be cute and easy to use benchmarks may be fun but algorithms run in application code in REAL TIME, test the algo any other way and you get serious anomolies.

lingo · September 02, 2010, 11:02:54 PM

A small correction in my old algo and new results: :lol

Code Select


align 16
db 3 Dup(0)
DotXMM2Acc32ELingo proc srcX, srcY, counter
		mov     eax,   [esp+1*4]	; eax->srcX
		pxor    xmm5,  xmm5
		pxor    xmm3,  xmm3
		mov     edx,   [esp+2*4]	; edx ->srcY
		movaps  xmm0,  [eax]	
		pxor    xmm4,  xmm4		
		mov     ecx,   [esp+3*4]	; ecx = counter
		sub     eax,   edx
@@:
		movaps  xmm1, [eax+edx+16]
		mulps   xmm0, [edx]
		addps   xmm4, xmm3
		movaps  xmm2, [eax+edx+32]	
		mulps   xmm1, [edx+16]
		addps   xmm5, xmm0
		movaps  xmm3, [eax+edx+48]
		mulps   xmm2, [edx+32]
		addps   xmm4, xmm1
		movaps  xmm6, [eax+edx+64]
		mulps   xmm3, [edx+48]
		addps   xmm5, xmm2
		movaps  xmm7, [eax+edx+80]
		mulps   xmm6, [edx+64]
		addps   xmm4, xmm3
		movaps  xmm2, [eax+edx+96]
		mulps	xmm7, [edx+80]
		addps   xmm5, xmm6
		movaps  xmm3, [eax+edx+112]
		mulps	xmm2, [edx+96]
		addps   xmm4, xmm7
		movaps  xmm0, [eax+edx+128]	
		mulps	xmm3, [edx+112]
		addps   xmm5, xmm2
		add     edx,  128
		sub     ecx,  32
		ja      @b
		addps    xmm4, xmm3
		addps    xmm4, xmm5
		movhlps  xmm0, xmm4
		addps    xmm4, xmm0
		pshufd   xmm0, xmm4,1
		addss    xmm4, xmm0
		movss    dword ptr [esp+2*4], xmm4
		fld      dword ptr [esp+2*4]
		ret      4*3
DotXMM2Acc32ELingo endp

Results:
Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)
1564    cycles for DotXMM1Acc4E
1556    cycles for DotXMM1Acc4EJ1
1557    cycles for DotXMM1Acc4EJ2
1562    cycles for AxDotXMM1_fastcall
1051    cycles for DotXMM2Acc16ELingo
1038    cycles for DotXMM2Acc32ELingo
1053    cycles for DotXMM2Acc16EPaul

1563    cycles for DotXMM1Acc4E
1557    cycles for DotXMM1Acc4EJ1
1557    cycles for DotXMM1Acc4EJ2
1562    cycles for AxDotXMM1_fastcall
1051    cycles for DotXMM2Acc16ELingo
1038    cycles for DotXMM2Acc32ELingo
1052    cycles for DotXMM2Acc16EPaul

1562    cycles for DotXMM1Acc4E
1556    cycles for DotXMM1Acc4EJ1
1556    cycles for DotXMM1Acc4EJ2
1563    cycles for AxDotXMM1_fastcall
1051    cycles for DotXMM2Acc16ELingo
1038    cycles for DotXMM2Acc32ELingo
1052    cycles for DotXMM2Acc16EPaul


The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
--- done ---


AMD Turion(tm) 64 Mobile Technology ML-30 (SSE3)
2291    cycles for DotXMM1Acc4E
2233    cycles for DotXMM1Acc4EJ1
2215    cycles for DotXMM1Acc4EJ2
2277    cycles for AxDotXMM1_fastcall
1454    cycles for DotXMM2Acc16ELingo
1338    cycles for DotXMM2Acc32ELingo
1611    cycles for DotXMM2Acc16EPaul

2257    cycles for DotXMM1Acc4E
2230    cycles for DotXMM1Acc4EJ1
2249    cycles for DotXMM1Acc4EJ2
2251    cycles for AxDotXMM1_fastcall
1435    cycles for DotXMM2Acc16ELingo
1362    cycles for DotXMM2Acc32ELingo
1623    cycles for DotXMM2Acc16EPaul

2268    cycles for DotXMM1Acc4E
2253    cycles for DotXMM1Acc4EJ1
2250    cycles for DotXMM1Acc4EJ2
2235    cycles for AxDotXMM1_fastcall
1402    cycles for DotXMM2Acc16ELingo
1371    cycles for DotXMM2Acc32ELingo
1605    cycles for DotXMM2Acc16EPaul


The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
--- done ---

News:

Suggestions and improvements for SSE2 code are welcome