News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Suggestions and improvements for SSE2 code are welcome

Started by Gunther, August 26, 2010, 05:20:06 PM

Previous topic - Next topic

Gunther

Quote from: jj2007 on September 01, 2010, at 12:25:07 AMThe latter is pretty useless, and I feel guilty about it.

Jochen,

the side track isn't useless. It's a very interesting question, but I think it's worth to have it's own thread.

Gunther
Forgive your enemies, but never forget their names.

Antariy

Hi!

I done remake of Gunther's simple SSE2 code, with my suggestion of loop construction. And implemention as __fastcall, with support of __stdcall and __cdecl via thunkers.
Other code the same (final repacking and addition).

Gunther, this is code in MASM32 format, I have not experience with GCC, sorry!

Code too big to put it into post, so, you can get it from DotProduct4_1.zip, attached to post. Proc(edure) named as AxDotXMM1_fastcall, thunkers - below it.
This archive is original Jochen's DotProduct4.zip, but with my code addeded.

Other archive - your old dotfloat.exe with my proc(edure) included (__cdecl version).

For make test, I will forced to patch your original (old) posted version of dotfloat.exe.
I replace your simple SSE2 version with one accumulator to my version of code, and just run test.

This is timings of PATCHED version (with my simple version of proc)

Supported by Processor and installed Operating System:
------------------------------------------------------

   Pentium 4 Instruction Set,
+ FPU (floating point unit) on chip,
+ support of FXSAVE and FXRSTOR,
+ 57 MMX Instructions,
+ 70 SSE (Katmai) Instructions,
+ 144 SSE2 (Willamette) Instructions,
+ 13 SSE3 (Prescott) Instructions,
+ HTT (hyper thread technology) support.

Calculating the dot product in 6 different variations.
That'll take a little while ...

Straight forward C++ implementation (FPU Code):
-----------------------------------------------

Dot Product 1 = 2867507200.00
Elapsed Time  = 31.49 Seconds

C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------

Dot Product 2 = 2867507200.00
Elapsed Time  = 15.14 Seconds

C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------

Dot Product 3 = 2867507200.00
Elapsed Time  = 6.49 Seconds

Simple SSE2 Code (Alex's loop variation: 16bytes loop):
--------------------------------------------------------

Dot Product 4 = 2867507200.00
Elapsed Time  = 6.14 Seconds

Solid SSE2 Code (2 Accumulators - 8 elements per cycle):
--------------------------------------------------------

Dot Product 5 = 2867507200.00
Elapsed Time  = 4.01 Seconds

Better SSE2 Code (2 Accumulators - 16 elements per cycle):
----------------------------------------------------------

Dot Product 6 = 2867507200.00
Elapsed Time  = 4.09 Seconds




Timings for your ORIGINAL posted version:

Supported by Processor and installed Operating System:
------------------------------------------------------

   Pentium 4 Instruction Set,
+ FPU (floating point unit) on chip,
+ support of FXSAVE and FXRSTOR,
+ 57 MMX Instructions,
+ 70 SSE (Katmai) Instructions,
+ 144 SSE2 (Willamette) Instructions,
+ 13 SSE3 (Prescott) Instructions,
+ HTT (hyper thread technology) support.

Calculating the dot product in 6 different variations.
That'll take a little while ...

Straight forward C++ implementation (FPU Code):
-----------------------------------------------

Dot Product 1 = 2867507200.00
Elapsed Time  = 31.50 Seconds

C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------

Dot Product 2 = 2867507200.00
Elapsed Time  = 15.28 Seconds

C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------

Dot Product 3 = 2867507200.00
Elapsed Time  = 6.52 Seconds

Simple SSE2 Code (1 Accumulator - 4 elements per cycle):
--------------------------------------------------------

Dot Product 4 = 2867507200.00
Elapsed Time  = 6.39 Seconds

Solid SSE2 Code (2 Accumulators - 8 elements per cycle):
--------------------------------------------------------

Dot Product 5 = 2867507200.00
Elapsed Time  = 4.00 Seconds

Better SSE2 Code (2 Accumulators - 16 elements per cycle):
----------------------------------------------------------

Dot Product 6 = 2867507200.00
Elapsed Time  = 4.06 Seconds



Note: overhead of C++ loop what calls to dot production code is 2.3 seconds on my CPU.


I'll attach archive with patched version of your executable. So, you can test new version of loop just now.



Alex
P.S. Sorry for patching, as I say: I'm have no experience with GCC, and cannot add my code to test via normal way. And sources is seems to be not compilable with MSVC, because have other name mangling, assembler etc.

Antariy

Timings for DotProduct4_1.zip (with Jochen's procs).


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
2814    cycles for DotXMM1Acc4E
2716    cycles for DotXMM1Acc4EJ1
2798    cycles for DotXMM1Acc4EJ2
2707    cycles for AxDotXMM1_fastcall

2817    cycles for DotXMM1Acc4E
2714    cycles for DotXMM1Acc4EJ1
2804    cycles for DotXMM1Acc4EJ2
2713    cycles for AxDotXMM1_fastcall

2805    cycles for DotXMM1Acc4E
2726    cycles for DotXMM1Acc4EJ1
2800    cycles for DotXMM1Acc4EJ2
2719    cycles for AxDotXMM1_fastcall


The result for DotXMM1Acc4E:     2867507200
The result for DotXMM1Acc4EJ1:   2867507200
The result for DotXMM1Acc4EJ2:   2867507200
The result for AxDotXMM1_cdecl:  2867507200
--- done ---


Note: this is strange enough, but work via stdcall thunker faster than direct call to fastcall version - on my system.



Alex

jj2007

Hi Alex,
It seems the Celeron M doesn't like it... sorry...
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
2544    cycles for DotXMM1Acc4E
2161    cycles for DotXMM1Acc4EJ1
2074    cycles for DotXMM1Acc4EJ2
2541    cycles for AxDotXMM1_fastcall

2517    cycles for DotXMM1Acc4E
2166    cycles for DotXMM1Acc4EJ1
2073    cycles for DotXMM1Acc4EJ2
2539    cycles for AxDotXMM1_fastcall

2511    cycles for DotXMM1Acc4E
2168    cycles for DotXMM1Acc4EJ1
2078    cycles for DotXMM1Acc4EJ2
2539    cycles for AxDotXMM1_fastcall


The result for DotXMM1Acc4E:     2867507200
The result for DotXMM1Acc4EJ1:   2867507200
The result for DotXMM1Acc4EJ2:   2867507200
The result for AxDotXMM1_cdecl:  2867507200

Antariy

Quote from: jj2007 on September 01, 2010, 10:08:41 PM
Hi Alex,
It seems the Celeron M doesn't like it... sorry...

:P

Not have meaning :)

What timings is for patched Gunther's exe?
Post them, please! I made it to work, not for attaching to post only, but for getting results also :)



Alex

jj2007

Here they are ;-)

Straight forward C++ implementation (FPU Code):
-----------------------------------------------

Dot Product 1 = 2867507200.00
Elapsed Time  = 22.64 Seconds

C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------

Dot Product 2 = 2867507200.00
Elapsed Time  = 15.58 Seconds

C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------

Dot Product 3 = 2867507200.00
Elapsed Time  = 6.89 Seconds

Simple SSE2 Code (Alex's loop variation: 16bytes loop):
--------------------------------------------------------

Dot Product 4 = 2867507200.00
Elapsed Time  = 7.55 Seconds

Solid SSE2 Code (2 Accumulators - 8 elements per cycle):
--------------------------------------------------------

Dot Product 5 = 2867507200.00
Elapsed Time  = 7.25 Seconds

Better SSE2 Code (2 Accumulators - 16 elements per cycle):
----------------------------------------------------------

Dot Product 6 = 2867507200.00
Elapsed Time  = 6.72 Seconds

Antariy

Quote from: jj2007 on September 01, 2010, 10:19:03 PM
Here they are ;-)

So, as always, nothing to say... All code optimized for my CPU - seems to be anti-optimized for others :(
Good chance for someone to make his stupid remarks about "archaic CPU... etc"   :toothy
But this is also have no meaning  :green2



Alex

hutch--

Dotproduct result.


Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (SSE4)
1824    cycles for DotXMM1Acc4E
1556    cycles for DotXMM1Acc4EJ1
1556    cycles for DotXMM1Acc4EJ2
1562    cycles for AxDotXMM1_fastcall

1561    cycles for DotXMM1Acc4E
1557    cycles for DotXMM1Acc4EJ1
1554    cycles for DotXMM1Acc4EJ2
1561    cycles for AxDotXMM1_fastcall

1561    cycles for DotXMM1Acc4E
1556    cycles for DotXMM1Acc4EJ1
1554    cycles for DotXMM1Acc4EJ2
1562    cycles for AxDotXMM1_fastcall


The result for DotXMM1Acc4E:     2867507200
The result for DotXMM1Acc4EJ1:   2867507200
The result for DotXMM1Acc4EJ2:   2867507200
The result for AxDotXMM1_cdecl:  2867507200
--- done ---


The other.


Supported by Processor and installed Operating System:
------------------------------------------------------

   Pentium 4 Instruction Set,
+ FPU (floating point unit) on chip,
+ support of FXSAVE and FXRSTOR,
+ 57 MMX Instructions,
+ 70 SSE (Katmai) Instructions,
+ 144 SSE2 (Willamette) Instructions,
+ 13 SSE3 (Prescott) Instructions,
+ HTT (hyper thread technology) support.

Calculating the dot product in 6 different variations.
That'll take a little while ...

Straight forward C++ implementation (FPU Code):
-----------------------------------------------

Dot Product 1 = 2867507200.00
Elapsed Time  = 10.31 Seconds

C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------

Dot Product 2 = 2867507200.00
Elapsed Time  = 6.89 Seconds

C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------

Dot Product 3 = 2867507200.00
Elapsed Time  = 2.58 Seconds

Simple SSE2 Code (Alex's loop variation: 16bytes loop):
--------------------------------------------------------

Dot Product 4 = 2867507200.00
Elapsed Time  = 2.62 Seconds

Solid SSE2 Code (2 Accumulators - 8 elements per cycle):
--------------------------------------------------------

Dot Product 5 = 2867507200.00
Elapsed Time  = 2.14 Seconds

Better SSE2 Code (2 Accumulators - 16 elements per cycle):
----------------------------------------------------------

Dot Product 6 = 2867507200.00
Elapsed Time  = 1.97 Seconds
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

clive

From the netbook, for chuckles..

Intel(R) Atom(TM) CPU N270   @ 1.60GHz (SSE4)
6068    cycles for DotXMM1Acc4E
3240    cycles for DotXMM1Acc4EJ1
3542    cycles for DotXMM1Acc4EJ2
5300    cycles for AxDotXMM1_fastcall

4428    cycles for DotXMM1Acc4E
3339    cycles for DotXMM1Acc4EJ1
3537    cycles for DotXMM1Acc4EJ2
5405    cycles for AxDotXMM1_fastcall

4292    cycles for DotXMM1Acc4E
3328    cycles for DotXMM1Acc4EJ1
3520    cycles for DotXMM1Acc4EJ2
5285    cycles for AxDotXMM1_fastcall


The result for DotXMM1Acc4E:     2867507200
The result for DotXMM1Acc4EJ1:   2867507200
The result for DotXMM1Acc4EJ2:   2867507200
The result for AxDotXMM1_cdecl:  2867507200



Supported by Processor and installed Operating System:
------------------------------------------------------

   Pentium 4 Instruction Set,
+ FPU (floating point unit) on chip,
+ support of FXSAVE and FXRSTOR,
+ 57 MMX Instructions,
+ 70 SSE (Katmai) Instructions,
+ 144 SSE2 (Willamette) Instructions,
+ 13 SSE3 (Prescott) Instructions,
+ HTT (hyper thread technology) support.

Calculating the dot product in 6 different variations.
That'll take a little while ...

Straight forward C++ implementation (FPU Code):
-----------------------------------------------

Dot Product 1 = 2867507200.00
Elapsed Time  = 48.08 Seconds

C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------

Dot Product 2 = 2867507200.00
Elapsed Time  = 45.88 Seconds

C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------

Dot Product 3 = 2867507200.00
Elapsed Time  = 9.44 Seconds

Simple SSE2 Code (Alex's loop variation: 16bytes loop):
--------------------------------------------------------

Dot Product 4 = 2867507200.00
Elapsed Time  = 16.88 Seconds

Solid SSE2 Code (2 Accumulators - 8 elements per cycle):
--------------------------------------------------------

Dot Product 5 = 2867507200.00
Elapsed Time  = 12.69 Seconds

Better SSE2 Code (2 Accumulators - 16 elements per cycle):
----------------------------------------------------------

Dot Product 6 = 2867507200.00
Elapsed Time  = 5.81 Seconds
It could be a random act of randomness. Those happen a lot as well.

lingo

"In practice, it leads to the same time as DotXMM2Acc16E on my machine (Win32 and Linux). That's a bit surprising."

You can try my shorter variant too: :U
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
db 3 Dup(0)
DotXMM2Acc16ELingo proc srcX, srcY, counter
pxor        xmm5,  xmm5
mov         eax,   [esp+1*4] ; eax->srcX
pxor        xmm6,  xmm6
mov         edx,   [esp+2*4] ; edx ->srcY
movaps      xmm0,  [eax]
pxor        xmm4,  xmm4
mov         ecx,   [esp+3*4] ; ecx = counter
sub         eax,   edx
@@:
movaps     xmm1, [eax+edx+16]
mulps    xmm0, [edx]
addps   xmm5, xmm6
movaps     xmm3, [eax+edx+32]
mulps    xmm1, [edx+16]
addps    xmm4, xmm0
movaps     xmm6, [eax+edx+48]
mulps    xmm3, [edx+32]
add    edx,  64
addps     xmm5, xmm1
movaps     xmm0, [eax+edx]
mulps    xmm6, [edx+48-64]
addps   xmm4, xmm3
sub      ecx,  16
ja     @b
addps   xmm4, xmm6
addps   xmm4, xmm5
movhlps   xmm0, xmm4
addps   xmm4, xmm0
pshufd    xmm0, xmm4,1
addss   xmm4, xmm0
movss     dword ptr [esp+2*4], xmm4
fld       dword ptr [esp+2*4]
ret       3*4
DotXMM2Acc16ELingo endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

clive

And with Lingo's XMM2 spanking everyone

Intel(R) Atom(TM) CPU N270   @ 1.60GHz (SSE4)
5739 cycles for DotXMM1Acc4E
3250 cycles for DotXMM1Acc4EJ1
3534 cycles for DotXMM1Acc4EJ2
5290 cycles for AxDotXMM1_cdecl
5253 cycles for AxDotXMM1_fastcall
1899 cycles for DotXMM2Acc16ELingo

4239 cycles for DotXMM1Acc4E
3289 cycles for DotXMM1Acc4EJ1
3488 cycles for DotXMM1Acc4EJ2
5278 cycles for AxDotXMM1_cdecl
5502 cycles for AxDotXMM1_fastcall
2903 cycles for DotXMM2Acc16ELingo

5959 cycles for DotXMM1Acc4E
4370 cycles for DotXMM1Acc4EJ1
4629 cycles for DotXMM1Acc4EJ2
6626 cycles for AxDotXMM1_cdecl
6223 cycles for AxDotXMM1_fastcall
1945 cycles for DotXMM2Acc16ELingo


The result for DotXMM1Acc4E: 2867507200
The result for DotXMM1Acc4EJ1: 2867507200
The result for DotXMM1Acc4EJ2: 2867507200
The result for AxDotXMM1_cdecl: 2867507200
The result for AxDotXMM1_fastcall: 2867507200
The result for DotXMM2Acc16ELingo: 2867507200
--- done ---
It could be a random act of randomness. Those happen a lot as well.

dedndave

Intel(R) Core(TM)2 Duo CPU     T7500  @ 2.20GHz (SSE4)
1621    cycles for DotXMM1Acc4E
1596    cycles for DotXMM1Acc4EJ1
1573    cycles for DotXMM1Acc4EJ2
1587    cycles for AxDotXMM1_cdecl
1699    cycles for AxDotXMM1_fastcall
1051    cycles for DotXMM2Acc16ELingo

1568    cycles for DotXMM1Acc4E
1564    cycles for DotXMM1Acc4EJ1
1678    cycles for DotXMM1Acc4EJ2
1611    cycles for AxDotXMM1_cdecl
1598    cycles for AxDotXMM1_fastcall
1052    cycles for DotXMM2Acc16ELingo

1566    cycles for DotXMM1Acc4E
1559    cycles for DotXMM1Acc4EJ1
1582    cycles for DotXMM1Acc4EJ2
1601    cycles for AxDotXMM1_cdecl
1611    cycles for AxDotXMM1_fastcall
1063    cycles for DotXMM2Acc16ELingo

hutch--


Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (SSE4)
1574    cycles for DotXMM1Acc4E
1554    cycles for DotXMM1Acc4EJ1
1553    cycles for DotXMM1Acc4EJ2
1565    cycles for AxDotXMM1_cdecl
1560    cycles for AxDotXMM1_fastcall
1050    cycles for DotXMM2Acc16ELingo

1560    cycles for DotXMM1Acc4E
1555    cycles for DotXMM1Acc4EJ1
1554    cycles for DotXMM1Acc4EJ2
1565    cycles for AxDotXMM1_cdecl
1560    cycles for AxDotXMM1_fastcall
1050    cycles for DotXMM2Acc16ELingo

1560    cycles for DotXMM1Acc4E
1554    cycles for DotXMM1Acc4EJ1
1553    cycles for DotXMM1Acc4EJ2
1565    cycles for AxDotXMM1_cdecl
1560    cycles for AxDotXMM1_fastcall
1051    cycles for DotXMM2Acc16ELingo


The result for DotXMM1Acc4E:             2867507200
The result for DotXMM1Acc4EJ1:           2867507200
The result for DotXMM1Acc4EJ2:           2867507200
The result for AxDotXMM1_cdecl:          2867507200
The result for AxDotXMM1_fastcall:       2867507200
The result for DotXMM2Acc16ELingo:       2867507200
--- done ---
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

KeepingRealBusy

Alex,

Here are my P4 timings for DotProducts:


Intel(R) Pentium(R) 4 CPU 3.20GHz (SSE2)
2864    cycles for DotXMM1Acc4E
2618    cycles for DotXMM1Acc4EJ1
2235    cycles for DotXMM1Acc4EJ2
2527    cycles for AxDotXMM1_fastcall

2525    cycles for DotXMM1Acc4E
2488    cycles for DotXMM1Acc4EJ1
2367    cycles for DotXMM1Acc4EJ2
2528    cycles for AxDotXMM1_fastcall

2498    cycles for DotXMM1Acc4E
2483    cycles for DotXMM1Acc4EJ1
2220    cycles for DotXMM1Acc4EJ2
2536    cycles for AxDotXMM1_fastcall


The result for DotXMM1Acc4E:     2867507200
The result for DotXMM1Acc4EJ1:   2867507200
The result for DotXMM1Acc4EJ2:   2867507200
The result for AxDotXMM1_cdecl:  2867507200
--- done ---



Supported by Processor and installed Operating System:
------------------------------------------------------

   Pentium 4 Instruction Set,
+ FPU (floating point unit) on chip,
+ support of FXSAVE and FXRSTOR,
+ 57 MMX Instructions,
+ 70 SSE (Katmai) Instructions,
+ 144 SSE2 (Willamette) Instructions.


Calculating the dot product in 6 different variations.
That'll take a little while ...

Straight forward C++ implementation (FPU Code):
-----------------------------------------------

Dot Product 1 = 2867507200.00
Elapsed Time  = 17.34 Seconds

C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------

Dot Product 2 = 2867507200.00
Elapsed Time  = 11.42 Seconds

C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------

Dot Product 3 = 2867507200.00
Elapsed Time  = 3.69 Seconds

Simple SSE2 Code (Alex's loop variation: 16bytes loop):
--------------------------------------------------------

Dot Product 4 = 2867507200.00
Elapsed Time  = 4.02 Seconds

Solid SSE2 Code (2 Accumulators - 8 elements per cycle):
--------------------------------------------------------

Dot Product 5 = 2867507200.00
Elapsed Time  = 3.69 Seconds

Better SSE2 Code (2 Accumulators - 16 elements per cycle):
----------------------------------------------------------

Dot Product 6 = 2867507200.00
Elapsed Time  = 3.33 Seconds


Dave

jj2007

#59
Quote from: lingo on September 02, 2010, 12:29:41 AM
You can try my shorter variant too: :U

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
2548    cycles for DotXMM1Acc4E
2163    cycles for DotXMM1Acc4EJ1
2072    cycles for DotXMM1Acc4EJ2
2540    cycles for AxDotXMM1_fastcall
2133    cycles for DotXMM2Acc16ELingo

The result returned by Lingo's algo is correct! If you want to test yourself, activate UseMB in line 1.

On my archaic Prescott, Lingo's code is actually a bit faster:
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
3652    cycles for DotXMM1Acc4E
2778    cycles for DotXMM1Acc4EJ1
2779    cycles for DotXMM1Acc4EJ2
2754    cycles for AxDotXMM1_fastcall
2173    cycles for DotXMM2Acc16ELingo