News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Suggestions and improvements for SSE2 code are welcome

Started by Gunther, August 26, 2010, 05:20:06 PM

Previous topic - Next topic

dioxin

I think I've lost track of what's going on in this thread.
How is Lingo's latest significantly different to the original DotXMM2Acc16E version posted by Gunther?

At some point in this thread the timings switched from seconds for 5,000,000 loops to clks per loop. Did anything else change?
If not, then I get the following timings:
Phenom II     Atom N270     
Code from reply#17         640          2156           
Gunther's original        1056          1660
(2acc. 16 Element)
Lingo's latest            1157          1858 



On my PC (3GHz Phenom II) Gunther's original runs in 1.76s = 1056clks, Lingo's latest runs in 1157 clks but the code posted in Reply#17 of this thread runs in 640clks. I gather from other posts in this thread that this is very CPU dependent.

I've tried them on other machines and older PCs do show Lingo's code to be a little faster (in the region of 5%) than the reply #17 code on those old machines.
Other modern Athlon type CPus show the Reply #17 code being nearly twice as fast.
I don't have any modern Intels available to test.

Paul.

lingo

"If not, then I get the following timings:"

Without your testing files it is just bla, bla, bla... :lol

dioxin

Lingo,
I forget, sometimes people can only handle 1 query at a time so I'll simplify it.

How is Lingo's latest significantly different to the original DotXMM2Acc16E version posted by Gunther?
The code looks similar and the timings are similar.
Here are the timings from the already posted testing files:
                         Phenom II     Atom N270             
Gunther's original        1056          1660
(2acc. 16 Element)
Lingo's latest            1157          1858 


Paul.

dioxin

Here are the relevant extracts from the Gunther original and Lingo's latest.
Shown below are the main loops of the 2 routines:

@@:
movaps     xmm1, [eax+edx+16]
mulps      xmm0, [edx]
addps      xmm5, xmm6
movaps     xmm3, [eax+edx+32]
mulps      xmm1, [edx+16]
addps      xmm4, xmm0
movaps     xmm6, [eax+edx+48]
mulps      xmm3, [edx+32]
add edx, 64
addps      xmm5, xmm1
movaps     xmm0, [eax+edx]
mulps      xmm6, [edx+48-64]
addps      xmm4, xmm3
sub ecx, 16
ja @b




.loop:

movaps  xmm2,[eax+edx+32] ;xmm2 = X[i+11] X[i+10] X[i+9] X[i+8]
mulps   xmm1,[edx+16] ;xmm1 = X[i+7]*Y[i+7] X[i+6]*Y[i+6] X[i+5]*Y[i+5] X[i+4]*Y[i+4]
addps   xmm4,xmm0 ;sum up
movaps  xmm3,[eax+edx+48] ;xmm3 = X[i+15] X[i+14] X[i+13] X[i+12]
mulps   xmm2,[edx+32] ;xmm2 = X[i+11]*Y[i+11] X[i+10]*Y[i+10] X[i+9]*Y[i+9] X[i+8]*Y[i+8]
addps   xmm5,xmm1 ;sum up
movaps  xmm0,[eax+edx+64] ;xmm0 = X[i+3] X[i+2] X[i+1] X[i+0]
mulps   xmm3,[edx+48] ;xmm3 = X[i+15]*Y[i+15] X[i+14]*Y[i+14] X[i+13]*Y[i+13] X[i+12]*Y[i+12]
lea     edx,[edx+64] ;update pointers
addps   xmm4,xmm2 ;sum up
movaps  xmm1,[eax+edx+16] ;mm1 = X[i+7] X[i+6] X[i+5] X[i+4]
mulps   xmm0,[edx] ;xmm0 = X[i+3]*Y[i+3] X[i+2]*Y[i+2] X[i+1]*Y[i+1] X[i+0]*Y[i+0]
addps   xmm5,xmm3 ;sum up
sub     ecx,byte 64 ;count down
jnz .loop


lingo

Your "results" without your testing files are just bla,bla,bla... :(

Will be better to post all algos rather than just the loops.. :lol
This is a Gunther's thread and it is his choice to get the better algo according to him.

dioxin

Lingo,
Quotewithout your testing files

What testing files are you after? I used the ones in the original post of this thread by Gunther and the one in reply in Reply #59 by jj2007.
The files are already posted. It would bee pointless to repost the same ones again.

Paul.

lingo

"I used the ones in the original post of this thread by Gunther and the one in reply in Reply #59 by jj2007."

It is a result from original post by Gunther:
Better SSE2 Code (2 Accumulators - 16 elements per cycle):
----------------------------------------------------------
Dot Product 6 = 2867507200.00
Elapsed Time  = 1.87 Seconds


it is a result from Reply #59 by jj2007:
2173    cycles for DotXMM2Acc16ELingo

As you see the dimensions of the times are different...(seconds vs cycles)  :lol

So,Your "results" without your testing files are just bla,bla,bla... [/U]

jj2007

Quote from: dioxin on September 02, 2010, 05:21:18 PM
I think I've lost track of what's going on in this thread.

Paul,
Here are results for all algos including your #17:

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
2529    cycles for DotXMM1Acc4E
2163    cycles for DotXMM1Acc4EJ1
2070    cycles for DotXMM1Acc4EJ2
2550    cycles for AxDotXMM1_fastcall
2138    cycles for DotXMM2Acc16ELingo
2135    cycles for DotXMM2Acc16EPaul


dioxin

Lingo,
it's simple maths. Gunther's runs 5,000,000 loops in 1.87sec.

That's 1.87 / 5000000 = 374nsec per loop. Multiply that 374 by the speed of your CPU in GHz and you'll see how many cycles Gunther's code takes on your PC.

jj2007,
thanks, on mine with your posted code I get:
AMD Phenom(tm) II X4 945 Processor (SSE3)
2410    cycles for DotXMM1Acc4E
2115    cycles for DotXMM1Acc4EJ1
2118    cycles for DotXMM1Acc4EJ2
2116    cycles for AxDotXMM1_fastcall
1126    cycles for DotXMM2Acc16ELingo
723     cycles for DotXMM2Acc16EPaul

2156    cycles for DotXMM1Acc4E
2117    cycles for DotXMM1Acc4EJ1
2115    cycles for DotXMM1Acc4EJ2
2114    cycles for AxDotXMM1_fastcall
1126    cycles for DotXMM2Acc16ELingo
725     cycles for DotXMM2Acc16EPaul

2155    cycles for DotXMM1Acc4E
2130    cycles for DotXMM1Acc4EJ1
2134    cycles for DotXMM1Acc4EJ2
2121    cycles for AxDotXMM1_fastcall
1126    cycles for DotXMM2Acc16ELingo
724     cycles for DotXMM2Acc16EPaul


The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
--- done ---



jj2007

Quote from: dioxin on September 02, 2010, 06:46:26 PM
jj2007,
thanks, on mine with your posted code I get:
QuoteAMD Phenom(tm) II X4 945 Processor (SSE3)
2118    cycles for DotXMM1Acc4EJ2
723     cycles for DotXMM2Acc16EPaul

QuoteIntel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
2071    cycles for DotXMM1Acc4EJ2
2135    cycles for DotXMM2Acc16EPaul

Wow!

lingo

 :lol
Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)
1566    cycles for DotXMM1Acc4E
1556    cycles for DotXMM1Acc4EJ1
1556    cycles for DotXMM1Acc4EJ2
1561    cycles for AxDotXMM1_fastcall
1051    cycles for DotXMM2Acc16ELingo
1052    cycles for DotXMM2Acc16EPaul

1562    cycles for DotXMM1Acc4E
1556    cycles for DotXMM1Acc4EJ1
1556    cycles for DotXMM1Acc4EJ2
1561    cycles for AxDotXMM1_fastcall
1051    cycles for DotXMM2Acc16ELingo
1052    cycles for DotXMM2Acc16EPaul

1562    cycles for DotXMM1Acc4E
1555    cycles for DotXMM1Acc4EJ1
1561    cycles for DotXMM1Acc4EJ2
1561    cycles for AxDotXMM1_fastcall
1051    cycles for DotXMM2Acc16ELingo
1052    cycles for DotXMM2Acc16EPaul


The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
--- done ---

AMD Turion(tm) 64 Mobile Technology ML-30 (SSE3)
2270    cycles for DotXMM1Acc4E
2226    cycles for DotXMM1Acc4EJ1
2199    cycles for DotXMM1Acc4EJ2
2259    cycles for AxDotXMM1_fastcall
1400    cycles for DotXMM2Acc16ELingo
1582    cycles for DotXMM2Acc16EPaul

2242    cycles for DotXMM1Acc4E
2226    cycles for DotXMM1Acc4EJ1
2232    cycles for DotXMM1Acc4EJ2
2269    cycles for AxDotXMM1_fastcall
1393    cycles for DotXMM2Acc16ELingo
1579    cycles for DotXMM2Acc16EPaul

2246    cycles for DotXMM1Acc4E
2221    cycles for DotXMM1Acc4EJ1
2193    cycles for DotXMM1Acc4EJ2
2268    cycles for AxDotXMM1_fastcall
1425    cycles for DotXMM2Acc16ELingo
1575    cycles for DotXMM2Acc16EPaul


The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
--- done ---

redskull

Quote from: dioxin on September 02, 2010, 06:46:26 PM

it's simple maths. Gunther's runs 5,000,000 loops in 1.87sec.

That's 1.87 / 5000000 = 374nsec per loop. Multiply that 374 by the speed of your CPU in GHz and you'll see how many cycles Gunther's code takes on your PC.


Unless Lingos Core2 has an wider pipeline that can decode three uOps per cycle, or has an extra ALU to do up to 6 uOps per cycle.  Then the math isn't quite that simple.

-r
Strange women, lying in ponds, distributing swords, is no basis for a system of government

Gunther

As I announced for a while, here is the update of my test bed. The program calculates now the dot product in 9 different ways. I've included Paul's and lingo's code via inline assembly. Furthermore, I squeezed out a bit time by doing a little bit macro magic in DotXMM2Acc32E (please have a look into dotfloatfa.cpp).

My impression is: that's a good solution especially for older AMD chips (Athlons or Opterons). Newer AMD processors like the Phenom will probably better run with Paul's code, but that's not yet tested. Here is a part of the current timings with my Athlon X2, 1.9 Ghz (Only the last 4 results are shown; for more look into results.pdf):

Better SSE2 Code (2 Accumulators - 16 elements per loop cycle):
---------------------------------------------------------------

Dot Product 6 = 2867507200.00
Elapsed Time  = 3.20 Seconds

SSE2 Code by Dioxin (4 Accumulators - 16 elements per loop cycle):
------------------------------------------------------------------

Dot Product 7 = 2867507200.00
Elapsed Time  = 3.88 Seconds

SSE2 Code by Lingo (2 Accumulators - 32 elements per loop cycle):
-----------------------------------------------------------------

Dot Product 8 = 2867507200.00
Elapsed Time  = 3.20 Seconds

SSE2 Code with PATTERN (2 Accumulators - 32 elements per loop cycle):
---------------------------------------------------------------------

Dot Product 9 = 2867507200.00
Elapsed Time  = 3.08 Seconds

It would be interesting, what lingo's Snow Leopard brings. The program will compile properly under Mac OS X, so try it, please. I'll also try to include Antariy's code into the application. My special thank goes to all guys for making suggestions and for test helping.

Gunther
Forgive your enemies, but never forget their names.

hutch--

Like a voice crying in the wilderness, if you want to know how FAST an algorithm is, test it in REAL TIME !!!!

Cycles may be cute and easy to use benchmarks may be fun but algorithms run in application code in REAL TIME, test the algo any other way and you get serious anomolies.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

lingo

A small correction in my old algo and new results: :lol
align 16
db 3 Dup(0)
DotXMM2Acc32ELingo proc srcX, srcY, counter
mov     eax,   [esp+1*4] ; eax->srcX
pxor    xmm5,  xmm5
pxor    xmm3,  xmm3
mov     edx,   [esp+2*4] ; edx ->srcY
movaps  xmm0,  [eax]
pxor    xmm4,  xmm4
mov     ecx,   [esp+3*4] ; ecx = counter
sub     eax,   edx
@@:
movaps  xmm1, [eax+edx+16]
mulps   xmm0, [edx]
addps   xmm4, xmm3
movaps  xmm2, [eax+edx+32]
mulps   xmm1, [edx+16]
addps   xmm5, xmm0
movaps  xmm3, [eax+edx+48]
mulps   xmm2, [edx+32]
addps   xmm4, xmm1
movaps  xmm6, [eax+edx+64]
mulps   xmm3, [edx+48]
addps   xmm5, xmm2
movaps  xmm7, [eax+edx+80]
mulps   xmm6, [edx+64]
addps   xmm4, xmm3
movaps  xmm2, [eax+edx+96]
mulps xmm7, [edx+80]
addps   xmm5, xmm6
movaps  xmm3, [eax+edx+112]
mulps xmm2, [edx+96]
addps   xmm4, xmm7
movaps  xmm0, [eax+edx+128]
mulps xmm3, [edx+112]
addps   xmm5, xmm2
add     edx,  128
sub     ecx,  32
ja      @b
addps    xmm4, xmm3
addps    xmm4, xmm5
movhlps  xmm0, xmm4
addps    xmm4, xmm0
pshufd   xmm0, xmm4,1
addss    xmm4, xmm0
movss    dword ptr [esp+2*4], xmm4
fld      dword ptr [esp+2*4]
ret      4*3
DotXMM2Acc32ELingo endp

Results:
Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)
1564    cycles for DotXMM1Acc4E
1556    cycles for DotXMM1Acc4EJ1
1557    cycles for DotXMM1Acc4EJ2
1562    cycles for AxDotXMM1_fastcall
1051    cycles for DotXMM2Acc16ELingo
1038    cycles for DotXMM2Acc32ELingo
1053    cycles for DotXMM2Acc16EPaul

1563    cycles for DotXMM1Acc4E
1557    cycles for DotXMM1Acc4EJ1
1557    cycles for DotXMM1Acc4EJ2
1562    cycles for AxDotXMM1_fastcall
1051    cycles for DotXMM2Acc16ELingo
1038    cycles for DotXMM2Acc32ELingo
1052    cycles for DotXMM2Acc16EPaul

1562    cycles for DotXMM1Acc4E
1556    cycles for DotXMM1Acc4EJ1
1556    cycles for DotXMM1Acc4EJ2
1563    cycles for AxDotXMM1_fastcall
1051    cycles for DotXMM2Acc16ELingo
1038    cycles for DotXMM2Acc32ELingo
1052    cycles for DotXMM2Acc16EPaul


The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
--- done ---


AMD Turion(tm) 64 Mobile Technology ML-30 (SSE3)
2291    cycles for DotXMM1Acc4E
2233    cycles for DotXMM1Acc4EJ1
2215    cycles for DotXMM1Acc4EJ2
2277    cycles for AxDotXMM1_fastcall
1454    cycles for DotXMM2Acc16ELingo
1338    cycles for DotXMM2Acc32ELingo
1611    cycles for DotXMM2Acc16EPaul

2257    cycles for DotXMM1Acc4E
2230    cycles for DotXMM1Acc4EJ1
2249    cycles for DotXMM1Acc4EJ2
2251    cycles for AxDotXMM1_fastcall
1435    cycles for DotXMM2Acc16ELingo
1362    cycles for DotXMM2Acc32ELingo
1623    cycles for DotXMM2Acc16EPaul

2268    cycles for DotXMM1Acc4E
2253    cycles for DotXMM1Acc4EJ1
2250    cycles for DotXMM1Acc4EJ2
2235    cycles for AxDotXMM1_fastcall
1402    cycles for DotXMM2Acc16ELingo
1371    cycles for DotXMM2Acc32ELingo
1605    cycles for DotXMM2Acc16EPaul


The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
--- done ---