News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Suggestions and improvements for SSE2 code are welcome

Started by Gunther, August 26, 2010, 05:20:06 PM

Previous topic - Next topic

Gunther

Quote from: hutch-- September 03, 2010, at 11:49:34 PMLike a voice crying in the wilderness, if you want to know how FAST an algorithm is, test it in REAL TIME !!!!

Wise spoken. There are cache misses, bad predicted jumps, stalls and the hole bunch of other difficulties, which can occur in practice. Therefore, I made the test bed as practical as possible. But 100% certainty is reached after implementation the new algorithm in the original application. That will happen next week, I hope you keep your fingers crossed for me hutch.

Gunther
Forgive your enemies, but never forget their names.

clive

Is it working right?

Quote from: lingo
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1326793031 <<<<<<<<<
The result: 1328212656
--- done ---
It could be a random act of randomness. Those happen a lot as well.

KeepingRealBusy

Interesting that it only fails on one of Lingo's machines.

Antariy

Latest Lingo's code:

2943    cycles for DotXMM1Acc4E
2816    cycles for DotXMM1Acc4EJ1
2828    cycles for DotXMM1Acc4EJ2
2816    cycles for AxDotXMM1_fastcall
2266    cycles for DotXMM2Acc16ELingo
1878    cycles for DotXMM2Acc32ELingo
1812    cycles for DotXMM2Acc16EPaul

2908    cycles for DotXMM1Acc4E
2811    cycles for DotXMM1Acc4EJ1
2815    cycles for DotXMM1Acc4EJ2
2794    cycles for AxDotXMM1_fastcall
2278    cycles for DotXMM2Acc16ELingo
1842    cycles for DotXMM2Acc32ELingo
1803    cycles for DotXMM2Acc16EPaul

2903    cycles for DotXMM1Acc4E
2819    cycles for DotXMM1Acc4EJ1
2836    cycles for DotXMM1Acc4EJ2
2822    cycles for AxDotXMM1_fastcall
2270    cycles for DotXMM2Acc16ELingo
1874    cycles for DotXMM2Acc32ELingo
1799    cycles for DotXMM2Acc16EPaul


The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656





Alex

Antariy

I'm not have any desire to delve into Lingo's code, but, maybe, incorrect results on Core is in this place:

movss dword ptr [esp+2*4], xmm4
fld dword ptr [esp+2*4]


As far, as I know - Core+ have changed engine, what watch read after write, and can make execution out-of-order in two ways - if writed, and if not writed to place, which would be read belower.

Maybe, mixing SSE and FPU code is not work well with this engine? I find in disasm, what before FLD is no WAIT instruction.



Alex

dioxin

redskull,
   
QuoteThen the math isn't quite that simple.
Yes it is.
   Whatever CPU Lingo has the conversion from Seconds to Cycles is a straight forward equation.
   Seconds = Cycles / Clk Frequency.

   Nothing else in his CPU matters.

Paul.

jj2007

Quote from: hutch-- on September 02, 2010, 10:49:34 PM
Like a voice crying in the wilderness, if you want to know how FAST an algorithm is, test it in REAL TIME !!!!

Cycles may be cute and easy to use benchmarks may be fun but algorithms run in application code in REAL TIME, test the algo any other way and you get serious anomolies.

No need to cry, Hutch :lol
The timings with MichaelW's macros can be problematic for very small pieces of code, but this kind of algos - 180 bytes for Lingo's algo, 456 for my JE2, both over 2,000 cycles - they yield quite realistic results.

QuoteIntel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
2575    cycles for DotXMM1Acc4E
2161    cycles for DotXMM1Acc4EJ1
2071    cycles for DotXMM1Acc4EJ2
2543    cycles for AxDotXMM1_fastcall
2139    cycles for DotXMM2Acc16ELingo
2079    cycles for DotXMM2Acc32ELingo
2103    cycles for DotXMM2Acc16EPaul

Congrats, Lingo. Even on my CPU you are close now :U

hutch--

Lingo's dotpro18 results.


Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (SSE4)
1574    cycles for DotXMM1Acc4E
1555    cycles for DotXMM1Acc4EJ1
1556    cycles for DotXMM1Acc4EJ2
1562    cycles for AxDotXMM1_fastcall
1051    cycles for DotXMM2Acc16ELingo
1038    cycles for DotXMM2Acc32ELingo
1051    cycles for DotXMM2Acc16EPaul

1561    cycles for DotXMM1Acc4E
1555    cycles for DotXMM1Acc4EJ1
1556    cycles for DotXMM1Acc4EJ2
1561    cycles for AxDotXMM1_fastcall
1050    cycles for DotXMM2Acc16ELingo
1038    cycles for DotXMM2Acc32ELingo
1051    cycles for DotXMM2Acc16EPaul

1563    cycles for DotXMM1Acc4E
1554    cycles for DotXMM1Acc4EJ1
1556    cycles for DotXMM1Acc4EJ2
1563    cycles for AxDotXMM1_fastcall
1052    cycles for DotXMM2Acc16ELingo
1038    cycles for DotXMM2Acc32ELingo
1052    cycles for DotXMM2Acc16EPaul


The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
--- done ---
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

Quote from: clive on September 02, 2010, 11:22:16 PM
Is it working right?

Quote from: lingo
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1326793031 <<<<<<<<<
The result: 1328212656
--- done ---

Where did you see that one? Can't find that post...

clive

Quote from: jj2007
Where did you see that one? Can't find that post...

It was from post #74, but it's been changed.
It could be a random act of randomness. Those happen a lot as well.

jj2007

Quote from: clive on September 03, 2010, 12:24:59 AM
Quote from: jj2007
Where did you see that one? Can't find that post...

It was from post #74, but it's been changed.

OK. Here is DotPro18 with code sizes added.
78       bytes for DotXMM1Acc4E
278      bytes for DotXMM1Acc4EJ1
266      bytes for DotXMM1Acc4EJ2
60       bytes for AxDotXMM1_fastcall
120      bytes for DotXMM2Acc16ELingo
183      bytes for DotXMM2Acc32ELingo
129      bytes for DotXMM2Acc16EPaul


Alex, have you tried unrolling a little bit?

dioxin

DotPro18:
AMD Phenom(tm) II X4 945 Processor (SSE3)
2754    cycles for DotXMM1Acc4E
2155    cycles for DotXMM1Acc4EJ1
2155    cycles for DotXMM1Acc4EJ2
2137    cycles for AxDotXMM1_fastcall
1197    cycles for DotXMM2Acc16ELingo
1186    cycles for DotXMM2Acc32ELingo
818     cycles for DotXMM2Acc16EPaul

2158    cycles for DotXMM1Acc4E
2154    cycles for DotXMM1Acc4EJ1
2154    cycles for DotXMM1Acc4EJ2
2133    cycles for AxDotXMM1_fastcall
1195    cycles for DotXMM2Acc16ELingo
1186    cycles for DotXMM2Acc32ELingo
818     cycles for DotXMM2Acc16EPaul

2159    cycles for DotXMM1Acc4E
2154    cycles for DotXMM1Acc4EJ1
2130    cycles for DotXMM1Acc4EJ2
2131    cycles for AxDotXMM1_fastcall
1195    cycles for DotXMM2Acc16ELingo
1186    cycles for DotXMM2Acc32ELingo
818     cycles for DotXMM2Acc16EPaul


The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
--- done ---


hutch--

JJ,

For data of the type proposed for algos of this type it needs to be streamed to get viable results. What I would suggest is set up 100 meg of data then load it in memory and stream it to get timings. The advantage of a large source is it does not fit into cache so you avoid the effects that are not present when the algo gets used IE in real time.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

dioxin

Timings from Gunther's latest post in Reply #72:
Straight forward C++ implementation (FPU Code):
-----------------------------------------------

Dot Product 1 = 2867507200.00
Elapsed Time  = 13.70 Seconds

C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------

Dot Product 2 = 2867507200.00
Elapsed Time  = 6.91 Seconds

C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------

Dot Product 3 = 2867507200.00
Elapsed Time  = 3.47 Seconds

Simple SSE2 Code (1 Accumulator - 4 elements per loop cycle):
-------------------------------------------------------------

Dot Product 4 = 2867507200.00
Elapsed Time  = 3.47 Seconds

Solid SSE2 Code (2 Accumulators - 8 elements per loop cycle):
-------------------------------------------------------------

Dot Product 5 = 2867507200.00
Elapsed Time  = 1.77 Seconds

Better SSE2 Code (2 Accumulators - 16 elements per loop cycle):
---------------------------------------------------------------

Dot Product 6 = 2867507200.00
Elapsed Time  = 1.78 Seconds

SSE2 Code by Dioxin (4 Accumulators - 16 elements per loop cycle):
------------------------------------------------------------------

Dot Product 7 = 2867507200.00
Elapsed Time  = 1.16 Seconds

SSE2 Code by Lingo (2 Accumulators - 32 elements per loop cycle):
-----------------------------------------------------------------

Dot Product 8 = 2867507200.00
Elapsed Time  = 1.77 Seconds

SSE2 Code with PATTERN (2 Accumulators - 32 elements per loop cycle):
---------------------------------------------------------------------

Dot Product 9 = 2867507200.00
Elapsed Time  = 1.78 Seconds


Please press enter to terminate...

jj2007

A variant with an option to unroll Alex' algo by 4 - still compact at 102 bytes, and a bit faster of course.