The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: Gunther on August 26, 2010, 05:20:06 PM

Title: Suggestions and improvements for SSE2 code are welcome
Post by: Gunther on August 26, 2010, 05:20:06 PM
The background of this new thread is a discussion in The Campus: http://www.masm32.com/board/index.php?topic=14692.msg119277#msg119277

I've finished my little test suite and attached the appropriate archive. The application will run under the following 32bit operating systems: Windows, Linux, FreeBSD and Intel based Mac OS X. The archive includes all source code and the running Win32 program, for users, which haven't installed GCC at the moment. Also included is a batch file, to build the application from the sources. The shell script is for Unix users and does the same (there shouldn't be a problem, because GCC is installed by default as the system compiler). A short description of every file contains the readme.txt. The source code is well commented and should be self explanatory.

The software is in experimental stage - nothing is final or the last word. I've coded a special case, where the array size is divisible by 16, only to show the principle. A more generic implementation has to check that, of course. The program does not much error handling, but it checks SSE2 support during run time. So your machine won't crash, if it's not available. Every advice and help to improve the software is welcome. It's the same with testing the program with other processors and environments (please, have a look into the results.pdf file).

Gunther
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: jj2007 on August 26, 2010, 06:32:46 PM
From DotAsm.asm:
ALIGN 16
.loop: movaps xmm0,[eax+edx] ;xmm0 = X[i+3] X[i+2] X[i+1] X[i+0]
mulps xmm0,[edx] ;xmm0 = X[i+3]*Y[i+3] X[i+2]*Y[i+2] X[i+1]*Y[i+1] X[i]*Y[i]
addps xmm7,xmm0 ;sum up
lea edx,[edx+16] ;update pointers
sub ecx,byte 16 ;count down
jnz .loop


Well done... and challenging. There is not much too improve at first sight ::)
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: Gunther on August 26, 2010, 09:59:29 PM
Hallo jj2007,

Quote from: jj2007 Today at 07:32:46 PMWell done... and challenging.

I hope it's well done. But challenging? To be honest, I file away at these procedures since a few days; so I used some so called dirty assembly language tricks, not more.

Quote from: jj2007 Today at 07:32:46 PMThere is not much too improve at first sight

I'm not sure. As I wrote in the cited Campus thread: That function is called over 30 million times, given a 256x256 image, which isn't a very large picture. The arrays in the original program are not so large, either 256 or 64 or 16 elements, depending at the partition depth of the algorithm. The more tricky thing is, that the dot product must not be calculated once, but eight times. That has to do with rotations and mirroring of data (not on the screen, only in the memory). So, every saved nansecond counts, because these nanoseconds sum up to microseconds, the microseconds sum up to milliseconds ... etc. That makes the speed difference.

I'm thinking about some fastcall, because passing parameters via registers is faster. But, here comes the bad news: With the release of GCC 4.3., the GNU world was changed again. Here is what I've found: http://www.gnu.org/software/gcc/gcc-4.3/changes.html.

Quote from: GCC 4.3 Release Series, Changes, New Features, and FixesFastcall for i386 has been changed not to pass aggregate arguments in registers, following Microsoft compilers.

Very nice. And why, and for what should one use fastcall?

On the other hand, I coded the horizontal addition in the loop epilogue with SSE2 instructions, so the program runs on a small Sempron, too. I've to test, if haddps is better; but that's a SSE3 instruction. A lot of questions ... By the way, did you run the program? Do you have some timings? Thank you.

Gunther 
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: jj2007 on August 26, 2010, 10:22:05 PM
Hi Gunther,
Timings are easy to get: :bg
Straight forward C++ implementation (FPU Code):
-----------------------------------------------

Dot Product 1 = 2867507200.00
Elapsed Time  = 22.20 Seconds

C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------

Dot Product 2 = 2867507200.00
Elapsed Time  = 15.59 Seconds

C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------

Dot Product 3 = 2867507200.00
Elapsed Time  = 6.89 Seconds

Simple SSE2 Code (1 Accumulator - 4 elements per cycle):
--------------------------------------------------------

Dot Product 4 = 2867507200.00
Elapsed Time  = 8.41 Seconds

Solid SSE2 Code (2 Accumulators - 8 elements per cycle):
--------------------------------------------------------

Dot Product 5 = 2867507200.00
Elapsed Time  = 7.20 Seconds

Better SSE2 Code (2 Accumulators - 16 elements per cycle):
----------------------------------------------------------

Dot Product 6 = 2867507200.00
Elapsed Time  = 6.67 Seconds


As I wrote in my PM to you, there are a few SSE2 geeks around here. I see drizz is already involved.
Question: The inner loop that I posted above does contain vector instructions (packed mul and add), but the function returns apparently only a REAL4 value on the FPU: fld dword [esp] ;load function result
Is there no room for parallelisation? Apologies if that is a dumb question, as I said: I am not a geek for SSE2....
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: Gunther on August 26, 2010, 10:58:42 PM
Thank you for the result. It's surprising. The intrinsic code is very fast. What's your environment?

Quote from: jj2007 Today at 11:22:05 PMI see drizz is already involved.

Yes, he made the excellent intrinsic code, so I gave credit.

Quote from: jj2007 Today at 11:22:05 PMQuestion: The inner loop that I posted above does contain vector instructions (packed mul and add), but the function returns apparently only a REAL4 value on the FPU: fld dword [esp] ;load function result

The scalar product is by definition a real number. Given 2 vectors A and B of the dimension n, we have:

Scalar Product = A[0]*B[0] + A[1]*B[1] + ... + A[n]*B[n]

Please, let me think a little while about parallelisation.

Gunther
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: jj2007 on August 26, 2010, 11:01:52 PM
Quote from: Gunther on August 26, 2010, 10:58:42 PM
Thank you for the result. It's surprising. The intrinsic code is very fast. What's your environment?

Win XP SP2, a Celeron M CPU from the Yonah series, 1.6 Ghz, 1G RAM.
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: Gunther on August 26, 2010, 11:42:03 PM
The good old Win XP, SP2 configuration. Your Intel Celeron did an amazing job.

Gunther
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: dioxin on August 27, 2010, 01:05:42 AM
It's usually faster for SSE to access memory in contiguous blocks so don't access the vectors the way you do, change the order something like this:
.loop:
movaps  xmm1,[eax+edx+16] ;mm1 = X[i+7] X[i+6] X[i+5] X[i+4]
movaps  xmm2,[eax+edx+32] ;xmm2 = X[i+11] X[i+10] X[i+9] X[i+8]
movaps  xmm3,[eax+edx+48] ;xmm3 = X[i+15] X[i+14] X[i+13] X[i+12]
movaps  xmm0,[eax+edx+64] ;xmm0 = X[i+3] X[i+2] X[i+1] X[i+0]


mulps   xmm0,[edx] ;xmm0 = X[i+3]*Y[i+3] X[i+2]*Y[i+2] X[i+1]*Y[i+1] X[i+0]*Y[i+0]
mulps   xmm1,[edx+16] ;xmm1 = X[i+7]*Y[i+7] X[i+6]*Y[i+6] X[i+5]*Y[i+5] X[i+4]*Y[i+4]
mulps   xmm2,[edx+32] ;xmm2 = X[i+11]*Y[i+11] X[i+10]*Y[i+10] X[i+9]*Y[i+9] X[i+8]*Y[i+8]
mulps   xmm3,[edx+48] ;xmm3 = X[i+15]*Y[i+15] X[i+14]*Y[i+14] X[i+13]*Y[i+13] X[i+12]*Y[i+12]


addps   xmm4,xmm0 ;sum up
addps   xmm5,xmm1 ;sum up
addps   xmm4,xmm2 ;sum up
addps   xmm5,xmm3 ;sum up

lea     edx,[edx+64] ;update pointers
sub     ecx,byte 64 ;count down
jnz .loop


Paul.

Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: Gunther on August 27, 2010, 01:34:00 AM
Thank you Paul. I'll check this and make another function. Re-arranging the loop instructions could be an improvement.

Gunther
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: dioxin on August 27, 2010, 01:54:26 AM
Phenom II x4 3GHz.


Straight forward C++ implementation (FPU Code):
-----------------------------------------------

Dot Product 1 = 2867507200.00
Elapsed Time  = 13.70 Seconds

C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------

Dot Product 2 = 2867507200.00
Elapsed Time  = 6.89 Seconds

C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------

Dot Product 3 = 2867507200.00
Elapsed Time  = 3.47 Seconds

Simple SSE2 Code (1 Accumulator - 4 elements per cycle):
--------------------------------------------------------

Dot Product 4 = 2867507200.00
Elapsed Time  = 3.45 Seconds

Solid SSE2 Code (2 Accumulators - 8 elements per cycle):
--------------------------------------------------------

Dot Product 5 = 2867507200.00
Elapsed Time  = 1.77 Seconds

Better SSE2 Code (2 Accumulators - 16 elements per cycle):
----------------------------------------------------------

Dot Product 6 = 2867507200.00
Elapsed Time  = 1.78 Seconds


Rearranged SSE2 Code (4 Accumulators - 16 elements per cycle):
----------------------------------------------------------

Dot Product 7 = 2867507200.00
Elapsed Time  = 1.14 Seconds
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: mineiro on August 27, 2010, 02:09:36 AM
dotfloat.exe > result.txt
Intel(R) Pentium(R) Dual  CPU  E2160  @ 1.80GHz (SSE4)

Supported by Processor and installed Operating System:
------------------------------------------------------
   Pentium 4 Instruction Set,
+ FPU (floating point unit) on chip,
+ support of FXSAVE and FXRSTOR,
+ 57 MMX Instructions,
+ 70 SSE (Katmai) Instructions,
+ 144 SSE2 (Willamette) Instructions,
+ 13 SSE3 (Prescott) Instructions,
+ HTT (hyper thread technology) support.

Calculating the dot product in 6 different variations.
That'll take a little while ...

Straight forward C++ implementation (FPU Code):
-----------------------------------------------
Dot Product 1 = 2867507200.00
Elapsed Time  = 17.20 Seconds

C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------
Dot Product 2 = 2867507200.00
Elapsed Time  = 11.52 Seconds

C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------
Dot Product 3 = 2867507200.00
Elapsed Time  = 4.28 Seconds

Simple SSE2 Code (1 Accumulator - 4 elements per cycle):
--------------------------------------------------------
Dot Product 4 = 2867507200.00
Elapsed Time  = 4.38 Seconds

Solid SSE2 Code (2 Accumulators - 8 elements per cycle):
--------------------------------------------------------
Dot Product 5 = 2867507200.00
Elapsed Time  = 3.61 Seconds

Better SSE2 Code (2 Accumulators - 16 elements per cycle):
----------------------------------------------------------
Dot Product 6 = 2867507200.00
Elapsed Time  = 3.28 Seconds

regards.
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: jj2007 on August 27, 2010, 07:39:30 AM
Prescott P4, Win XP SP2:
Straight forward C++ implementation (FPU Code):
-----------------------------------------------

Dot Product 1 = 2867507200.00
Elapsed Time  = 19.77 Seconds

C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------

Dot Product 2 = 2867507200.00
Elapsed Time  = 9.67 Seconds

C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------

Dot Product 3 = 2867507200.00
Elapsed Time  = 4.11 Seconds

Simple SSE2 Code (1 Accumulator - 4 elements per cycle):
--------------------------------------------------------

Dot Product 4 = 2867507200.00
Elapsed Time  = 3.96 Seconds

Solid SSE2 Code (2 Accumulators - 8 elements per cycle):
--------------------------------------------------------

Dot Product 5 = 2867507200.00
Elapsed Time  = 2.53 Seconds

Better SSE2 Code (2 Accumulators - 16 elements per cycle):
----------------------------------------------------------

Dot Product 6 = 2867507200.00
Elapsed Time  = 2.46 Seconds
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: hutch-- on August 27, 2010, 07:41:57 AM
Gunther,

One request on benchmarks, put a keyboard pause at the end of it so it does not have to be downloaded to run it. It ran from the browser but closed before I could save the results.

These timings are on a 3 gig Core2 Quad.


Supported by Processor and installed Operating System:
------------------------------------------------------

   Pentium 4 Instruction Set,
+ FPU (floating point unit) on chip,
+ support of FXSAVE and FXRSTOR,
+ 57 MMX Instructions,
+ 70 SSE (Katmai) Instructions,
+ 144 SSE2 (Willamette) Instructions,
+ 13 SSE3 (Prescott) Instructions,
+ HTT (hyper thread technology) support.

Calculating the dot product in 6 different variations.
That'll take a little while ...

Straight forward C++ implementation (FPU Code):
-----------------------------------------------

Dot Product 1 = 2867507200.00
Elapsed Time  = 10.31 Seconds

C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------

Dot Product 2 = 2867507200.00
Elapsed Time  = 6.89 Seconds

C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------

Dot Product 3 = 2867507200.00
Elapsed Time  = 2.58 Seconds

Simple SSE2 Code (1 Accumulator - 4 elements per cycle):
--------------------------------------------------------

Dot Product 4 = 2867507200.00
Elapsed Time  = 2.62 Seconds

Solid SSE2 Code (2 Accumulators - 8 elements per cycle):
--------------------------------------------------------

Dot Product 5 = 2867507200.00
Elapsed Time  = 2.16 Seconds

Better SSE2 Code (2 Accumulators - 16 elements per cycle):
----------------------------------------------------------

Dot Product 6 = 2867507200.00
Elapsed Time  = 1.97 Seconds
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: jj2007 on August 27, 2010, 08:26:55 AM
I freely admit that I know too little about the dot product, so the attached testbed might not be realistic. Please adapt.

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
53      cycles for DotXMM1Acc4E <<< this value too high, Prescott P4 behaves badly in such testbeds
31      cycles for DotXMM1Acc4E <<< 31 seems to be the correct value
31      cycles for DotXMM1Acc4E
32      cycles for DotXMM1Acc4E
31      cycles for DotXMM1Acc4E

Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: daydreamer on August 27, 2010, 08:53:15 AM
why not make it support multicores with process in multiple threads?
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: dioxin on August 27, 2010, 10:38:45 AM
Gunther,
you have 8 SSE registers and you're only using 6 of them.
Either use them all to fetch/calculate data or to act as accumulators. Don't leave them unused.
You'll have to experiment to see which alternate use of the unused registers gives the better results.

I get faster results using 4 accumulators:
addps xmm4,xmm0               'add the products
addps xmm5,xmm1
addps xmm6,xmm2
addps xmm7,xmm3


Paul.
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: jj2007 on August 27, 2010, 11:18:28 PM
I have looked into the purpose of this code and modified the testbed accordingly:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
2563    cycles for DotXMM1Acc4E
2167    cycles for DotXMM1Acc4EJ

2680    cycles for DotXMM1Acc4E
2167    cycles for DotXMM1Acc4EJ

2540    cycles for DotXMM1Acc4E
2167    cycles for DotXMM1Acc4EJ

The result: 2867507200
The result: 2867507200


EDIT: I squeezed out a few cycles...

add ecx, edx ; slightly faster
; int 3 ; OPT_Olly 2
align 16

@@:
REPEAT 16 ; unrolling helps, but make sure that the count is divisible by the rep count!!
movaps xmm0, [eax+edx] ; xmm0 = X[i+3] X[i+2] X[i+1] X[i+0] ; xmm0: 4  3   2 1, 8765, 12 11 10 9, ... (high to low)
mulps xmm0, [edx] ; xmm0 = X[i+3]*Y[i+3] X[i+2]*Y[i+2] X[i+1]*Y[i+1] X[i]*Y[i] ; xmm0: 20 12 6 2, 72 56 42 30, 156 132 110 90,
lea edx, [edx+16] ; update pointers
addps xmm7, xmm0 ; sum up ; xmm7: 20 12 6 2, 92 68 48 32, 248 200 158 122, ...
ENDM
if 1
cmp edx, ecx ; faster, at least on Celeron M
jb @B
else
sub ecx, 16 ; count down
jg @B
endif
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: dioxin on August 28, 2010, 11:37:24 AM
jj2007,
if you have the CPU registers avaialable, which you do in this case, then you can always arrange for the loop to count up to zero and so end without a compare instruction and still only have one loop counter so you don't need to update (in your case) both edx and ecx.
!mov ecx,2047                  'loop counter for 2048 elements to get

!movups xmm4,z                 'zero the 4 accumulators (z predifined as 4x zero single)
!movaps xmm5,xmm4
!movaps xmm6,xmm4
!movaps xmm7,xmm4

!mov eax,aptr                 'get pointer to start of first array
!mov edx,bptr                 'get pointer to start of second array

!lea eax,[eax+4*ecx]          'offset pointers by size of array
!lea edx,[edx+4*ecx]
     
!neg ecx                      'negate counter so now pointer + 4*counter = start of array
                                'but also counter now counts up to zero


#ALIGN 16
lp:
!movaps xmm0,[eax+ecx*4]      'get 16 elements of first vector
!movaps xmm1,[eax+ecx*4+16]
!movaps xmm2,[eax+ecx*4+32]
!movaps xmm3,[eax+ecx*4+48]


!mulps xmm0,[edx+ecx*4]        'multiply by 16 elements of second vector
!mulps xmm1,[edx+ecx*4+16]
!mulps xmm2,[edx+ecx*4+32]
!mulps xmm3,[edx+ecx*4+48]


!addps xmm4,xmm0               'add the products
!addps xmm5,xmm1
!addps xmm6,xmm2
!addps xmm7,xmm3

             
!add ecx,16                     'next block
!js short lp                    'ends when sign goes +ve


!addps xmm4,xmm5                'add the partial sums
!addps xmm6,xmm7
!addps xmm4,xmm6



#IF %DEF (%sse3)
!haddps xmm4,xmm4   'sum pairs of results
!haddps xmm4,xmm4   'sum the sum of pairs of results
#ELSE
!movaps xmm1, xmm4
!shufps xmm4, xmm1, &hb1
!addps xmm4, xmm1
!movaps xmm1, xmm4
     

!shufps xmm4, xmm4, &h0a
!addps xmm4, xmm1
#ENDIF

'!movd sum!,xmm4     'store the dot product
!movups z1,xmm4

!emms

The above PowerBASIC ASM format version runs 5,000,000 loops in 1.07s or a single loop in 645 clks ave. on a 3GHz Phenom II x4.

Paul.
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: jj2007 on August 28, 2010, 03:18:22 PM
Quote from: dioxin on August 28, 2010, 11:37:24 AM
jj2007,
if you have the CPU registers avaialable, which you do in this case, then you can always arrange for the loop to count up to zero and so end without a compare instruction and still only have one loop counter so you don't need to update (in your case) both edx and ecx.

Paul,
You know about conditional assembly? "If 1" means "do this, ignore the else part".
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: dioxin on August 28, 2010, 03:38:19 PM
jj2007,
yes I know about that but if you look at your code your have:

Either:
lea edx, [edx+16]
..
cmp edx, ecx ; faster, at least on Celeron M
jb @B

OR
lea edx, [edx+16]
..
sub ecx, 16 ; count down
jg @B


Either way it's 3 instructions to control the loop.

Look at the way it's done in the code I just posted and it's only 2 instructions:
!add ecx,16                     'next block
!js short lp                    'ends when sign goes +ve


Paul.
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: dedndave on August 28, 2010, 04:10:59 PM
i think i would use JC   :bg
or maybe JNS

and - no need to....
mov ecx,2047
.
.
.
neg ecx

why not just
mov ecx,-2047
or
LoopCount    EQU 2047
LoopCountVal EQU -LoopCount
.
.
.
mov ecx,LoopCountVal
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: jj2007 on August 28, 2010, 05:23:05 PM
Quote from: dioxin on August 28, 2010, 03:38:19 PM
jj2007,
yes I know about that but if you look at your code your have:
...
Look at the way it's done in the code I just posted and it's only 2 instructions:
!add ecx,16                     'next block
!js short lp                    'ends when sign goes +ve


Paul,
Your code needs one instruction less, and it is a good and recommended way of speeding up a loop. If it really helps, needs to be demonstrated. Just add it to the testbed above.

My post was a reaction to
Quote from: dioxin on August 28, 2010, 11:37:24 AM
you don't need to update (in your case) both edx and ecx.

That is simply not correct. Only edx is being updated in the loop:
cmp edx, ecx ; faster, at least on Celeron M
jb @B


EDIT: Since these debates are always fun, here a testbed "Loop_Art":
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
1036    cycles for neg ecx, TheCt=128
843     cycles for cmp edx, ecx, TheCt=128

1035    cycles for neg ecx, TheCt=128
843     cycles for cmp edx, ecx, TheCt=128


I guess results will differ wildly by CPU :bg
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: dedndave on August 28, 2010, 06:40:37 PM
prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
3823    cycles for neg ecx, TheCt=128
2009    cycles for cmp edx, ecx, TheCt=128
1066    cycles for dedndave, TheCt=128

3792    cycles for neg ecx, TheCt=128
1909    cycles for cmp edx, ecx, TheCt=128
1036    cycles for dedndave, TheCt=128
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: jj2007 on August 28, 2010, 07:32:01 PM
You cheat :dance:
@@: mov ebx,[eax+edx]
inc TheCt
mov [edx],ebx
add edx, 16
cmp edx, ecx
jb @B

@@: nop
add edx, 16
cmp edx, ecx
jb @B

-99 cycles for Lingo :lol

Surprisingly, your code runs exactly as fast as mine:
QuoteIntel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
1036    cycles for neg ecx, TheCt=128
843     cycles for cmp edx, ecx, TheCt=128
842     cycles for dedndave, TheCt=128

Which means a push to mem/pop from mem pair is as fast as two moves on my Celeron.
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: lingo on August 28, 2010, 09:06:34 PM
"-99 cycles for Lingo"

cretino,
perché non si salva SSE e ECX registri? :lol
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: Gunther on August 29, 2010, 12:43:04 AM
So guys, I'm back here. Please excuse the delay, but this weekend is school start here and I was involved a little bit. That was the simple reason.

Special thanks for all test results and the hole bunch of hints.

Quote from: dioxin August 27, 2010, 11:38:45 amyou have 8 SSE registers and you're only using 6 of them. Either use them all to fetch/calculate data or to act as accumulators. Don't leave them unused.

Sounds not bad. I've tested that with the following code:


_DotXMM4Acc16E:

pxor    xmm4, xmm4 ;sums are in xmm4, xmm5, xmm6 and xmm7
mov     ecx,[esp+12] ;ecx = n
mov     eax,[esp+4] ;eax -> X
pxor    xmm5,xmm5
mov     edx,[esp+8] ;edx -> Y
shl     ecx,2 ;ecx = 4*n (float)
pxor xmm6,xmm6
sub     esp,4 ;stack space for function result
sub eax,edx ;saves 1 lea instruction inside the main loop
pxor xmm7,xmm7

ALIGN 16

.loop:

movaps xmm0,[eax+edx] ;xmm0 = X[i+3] X[i+2] X[i+1] X[i+0]
movaps xmm1,[eax+edx+16] ;xmm1 = X[i+7] X[i+6] X[i+5] X[i+4]
movaps xmm2,[eax+edx+32] ;xmm2 = X[i+11] X[i+10] X[i+9] X[i+8]
movaps xmm3,[eax+edx+48] ;xmm3 = X[i+15] X[i+14] X[i+13] X[i+12]
mulps xmm0,[edx] ;xmm0 = X[i+3]*Y[i+3] X[i+2]*Y[i+2] X[i+1]*Y[i+1] X[i+0]*Y[i+0]
mulps xmm1,[edx+16] ;xmm1 = X[i+7]*Y[i+7] X[i+6]*Y[i+6] X[i+5]*Y[i+5] X[i+4]*Y[i+4]
mulps xmm2,[edx+32] ;xmm2 = X[i+11]*Y[i+11] X[i+10]*Y[i+10] X[i+9]*Y[i+9] X[i+8]*Y[i+8]
mulps xmm3,[edx+48] ;xmm3 = X[i+15]*Y[i+15] X[i+14]*Y[i+14] X[i+13]*Y[i+13] X[i+12]*Y[i+12]
addps xmm4,xmm0 ;sum up
addps xmm5,xmm1
addps xmm6,xmm2
addps xmm7,xmm3
lea     edx,[edx+64] ;update pointers
sub     ecx,byte 64 ;count down
jnz .loop
addps xmm4,xmm5 ;add accumulators
addps xmm6,xmm7
addps xmm4,xmm6 ;sum in xmm4
movhlps xmm0,xmm4 ;get bits 64 - 127 from xmm4
addps   xmm4,xmm0 ;sum in 2 dwords
pshufd  xmm0,xmm4,1 ;get bits 32 - 63 from xmm4
addss   xmm4,xmm0 ;sum in 1 dword
movss   [esp],xmm4 ;store sum
fld     dword [esp] ;load function result
add     esp,byte 4 ;adjust stack
ret


It's probably faster with your Phenom, but slower with my Athlon X2. That has to do with limitations by the floating point adder on older AMD chips. Intel chips must be tested. Let me know, if your code is similar; if so, I'll work that into the archive and update it for appropriate testing. Thank you. That's not a big deal; there's always CPUID and different code paths (one for older AMD, the other probably for Intel and newer AMD). Never mind.

Quote from: dioxin August 28, 2010, 12:37:24 pmf you have the CPU registers available, which you do in this case, then you can always arrange for the loop to count up to zero and so end without a compare instruction

That's okay. Counting down the loop is mostly faster as counting up. But caching works better with counting up; what you're doing is counting down with a negative offset and forward caching. Good idea.

Quote from: jj2007  August 28, 2010, 06:23:05 pmI guess results will differ wildly by CPU

Yes, Jochen the all results differ wildly and are very surprising.

Quote from: hutch-- August 27, 2010, 08:41:57 amOne request on benchmarks, put a keyboard pause at the end of it so it does not have to be downloaded to run it.

It's done, but the archive isn't yet updated. I'll do that after Pauls answer.

Gunther

Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: lingo on August 29, 2010, 02:19:41 AM
Hi,Gunther,
"I've tested that with the following code:"

Now you have two algos with different results: :wink
prev:
4E2C 4AE8 4E2C 0A98  4EAB 0AB4 4F2A EAB0
last:
4E2C 47F2 4E2C  0792  4EAB 0AB8 4F2A EAB0
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: dioxin on August 30, 2010, 10:42:29 PM
Lingo,
   You mis-understand the "result".
   The result of a dot product is a scalar (one value). With SINGLE types an xmm register can hold 4 values so 3 of them will be irrelevant and only 1 will be the required answer.

   The same scalar value, 2867507200, is being returned for all code so clearly they all work and give the same answer.

   That value, 2867507200 decimal, when stored as a SINGLE and displayed in hex would be 4F2A EAB0 which is shown in the low DWORD of both of your answers. The other DWORDs of that SSE register are not relevant as they were just intermediate results used to derive the answer, only the low DWORD, 4F2A EAB0, is returned as the answer.



   
Gunther,
I can't imagine why it would slow down on your CPU. The work being done is the same but there should be less register stalls. Worst case I'd expect it to run at the same speed.
I might have access to an Athlon X2 sometime this week and if I get time I'll check it out.

Until then, you can still use the unused registers to fetch/multiply more data in each loop. It won't be a convenient power of 2 but it should be faster.




dedndave,
Quotei think i would use JC   BigGrin
or maybe JNS

Why? I think I used the correct branch.


Quotewhy not just..
Because I was writing it to be easily understood and to fit in with Gunther's original code.
It's easy enough to save a cycle or 2 afterwards once the main code (which is saving 100's clks) is understood.

   

jj2007 & dedndave,
Your version of the loop is only accessing 25% of the memory it should be.
Both loops go around 128 times but your loop only increments by 16 bytes instead of 64. Note in my code the counter increments by 16 but is always reference as 4*ecx, not just ecx alone.

Paul.

Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: lingo on August 30, 2010, 11:24:13 PM
"only the low DWORD, 4F2A EAB0, is returned as the answer."
You are right, thanks.. :U
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: lingo on August 30, 2010, 11:46:00 PM
Gunther,
I'm sure my algo will runs faster on your CPU.
So, please, try it with your testbed... :U

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
db 10 Dup(0)
DotXMM1Acc4ELingo proc srcX, srcY, counter
mov     eax,   [esp+1*4] ; eax->srcX
pxor    xmm5,  xmm5
mov     edx,   [esp+2*4] ; edx ->srcY
pxor    xmm4,  xmm4
mov     ecx,   [esp+3*4] ; ecx = counter
sub     eax,   edx
@@:
movaps  xmm0, [eax+edx]
movaps  xmm1, [eax+edx+16]
mulps xmm0, [edx]
movaps  xmm2, [eax+edx+32]
mulps xmm1, [edx+16]
addps xmm4, xmm0
movaps  xmm3, [eax+edx+48]
mulps xmm2, [edx+32]
addps xmm5, xmm1
movaps  xmm6, [eax+edx+64]
mulps xmm3, [edx+48]
addps xmm4, xmm2
movaps  xmm7, [eax+edx+80]
mulps xmm6, [edx+64]
addps xmm5, xmm3
movaps  xmm0, [eax+edx+96]
mulps xmm7, [edx+80]
addps xmm4, xmm6
movaps  xmm1, [eax+edx+112]
mulps xmm0, [edx+96]
addps xmm5, xmm7
mulps xmm1, [edx+112]
addps xmm4, xmm0
add edx,  128
addps xmm5, xmm1
sub    ecx,  32
ja @b
addps xmm4, xmm5
movhlps xmm1, xmm4
addps xmm4, xmm1
pshufd  xmm1, xmm4,1
addss xmm4, xmm1
movss   dword ptr [esp+2*4], xmm4
fld     dword ptr [esp+2*4]
ret     4*3
DotXMM1Acc4ELingo endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: Gunther on August 31, 2010, 01:40:12 AM
Quote from: lingo August 31, 2010 12:46:00 AMI'm sure my algo will runs faster on your CPU.

Thank you for your reply. Of course, I'll test your code and let you know the results.

Quote from: dioxin August 30, 2010, 11:42:29 pmI can't imagine why it would slow down on your CPU. The work being done is the same but there should be less register stalls.

Yes, of course, but there is a limitation by the throughput of the floating point adder on older AMD chips (only 64 bits). Your Phenom or an Intel Core2 processor have a 128 bits wide floating point adder that can handle a whole vector in one operation. This makes the difference.

But never mind. In the original program, which is to optimize, I'll insert 2 code paths: one for older AMD chips, the other for Intel and the fancy new Phenom. Which way to go can be detected during run time with CPUID.

Gunther
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: jj2007 on August 31, 2010, 07:35:34 AM
Quote from: dioxin on August 30, 2010, 10:42:29 PM
jj2007 & dedndave,
Your version of the loop is only accessing 25% of the memory it should be.
Both loops go around 128 times but your loop only increments by 16 bytes instead of 64. Note in my code the counter increments by 16 but is always reference as 4*ecx, not just ecx alone.

Paul.

That's correct, sorry for my sloppiness :thumbu
Modified code attached. I disabled Dave's code because it uses a faster "fill code", which makes the results not comparable.
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
1853    cycles for neg ecx, TheCt=128
1839    cycles for cmp edx, ecx, TheCt=128

1842    cycles for neg ecx, TheCt=128
1836    cycles for cmp edx, ecx, TheCt=128

1837    cycles for neg ecx, TheCt=128
1838    cycles for cmp edx, ecx, TheCt=128
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: Rockoon on August 31, 2010, 09:05:32 AM
AMD Phenom(tm) II X6 1055T Processor (SSE3)
659     cycles for neg ecx, TheCt=128
919     cycles for cmp edx, ecx, TheCt=128

659     cycles for neg ecx, TheCt=128
918     cycles for cmp edx, ecx, TheCt=128

659     cycles for neg ecx, TheCt=128
919     cycles for cmp edx, ecx, TheCt=128

659     cycles for neg ecx, TheCt=128
918     cycles for cmp edx, ecx, TheCt=128

659     cycles for neg ecx, TheCt=128
918     cycles for cmp edx, ecx, TheCt=128

Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: hutch-- on August 31, 2010, 09:11:31 AM

Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (SSE4)
914     cycles for neg ecx, TheCt=128
827     cycles for cmp edx, ecx, TheCt=128

912     cycles for neg ecx, TheCt=128
827     cycles for cmp edx, ecx, TheCt=128

912     cycles for neg ecx, TheCt=128
827     cycles for cmp edx, ecx, TheCt=128

913     cycles for neg ecx, TheCt=128
828     cycles for cmp edx, ecx, TheCt=128

912     cycles for neg ecx, TheCt=128
827     cycles for cmp edx, ecx, TheCt=128


--- ok ---
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: jj2007 on August 31, 2010, 09:25:30 AM
Very different indeed. Let's test with another "fill code":
Quoteneg ecx
            and TheCt, 0
            align 16
      @@:
            mov ebx, [eax+ecx*4]
            inc TheCt
            mov [edx+ecx*4], ebx
            add ecx, 4
            js short @B
Quotelea ecx, [edx+ecx]
            and TheCt, 0
            sub eax, edx
            align 16
      @@:
            mov ebx, [eax+edx]
            inc TheCt
            mov [edx], ebx
            add edx, 16
            cmp edx, ecx
            jb short @B

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
997     cycles for neg ecx, TheCt=128
986     cycles for cmp edx, TheCt=128

979     cycles for neg ecx, TheCt=128
988     cycles for cmp edx, TheCt=128
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: Gunther on August 31, 2010, 03:03:41 PM
Quote from: lingo August 31, 2010, at 12:46:00 AMGunther,
I'm sure my algo will runs faster on your CPU.
So, please, try it with your testbed...   :U

At the first glance, it should be faster, because you've rolled out the loop and your code processes now 32 elements per loop cycle. In practice, it leads to the same time as DotXMM2Acc16E on my machine (Win32 and Linux). That's a bit surprising.

I'll update the archive. So, your procedure will be renamed to DotXMM2Acc32ELingo (2 Accumulators, 32 elements per loop cycle) - but I'll give credit, that's clear. I'll set up a message after updating the archive.

Gunther
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: jj2007 on August 31, 2010, 03:44:08 PM
With modified testbed. J1 is short loop, J2 is using more xmm registers as shown below:
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
3566    cycles for DotXMM1Acc4E
2739    cycles for DotXMM1Acc4EJ1
2821    cycles for DotXMM1Acc4EJ2

2782    cycles for DotXMM1Acc4E
2714    cycles for DotXMM1Acc4EJ1
2806    cycles for DotXMM1Acc4EJ2

2784    cycles for DotXMM1Acc4E
2761    cycles for DotXMM1Acc4EJ1
2844    cycles for DotXMM1Acc4EJ2


Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
2548    cycles for DotXMM1Acc4E
2165    cycles for DotXMM1Acc4EJ1
2073    cycles for DotXMM1Acc4EJ2

2546    cycles for DotXMM1Acc4E
2165    cycles for DotXMM1Acc4EJ1
2073    cycles for DotXMM1Acc4EJ2

@@:
REPEAT repct/4 ; unrolling helps, but make sure that the count is divisible by the rep count!!
movaps xmm0, [eax+edx]
movaps xmm1, [eax+edx+16]
movaps xmm2, [eax+edx+32]
movaps xmm3, [eax+edx+48]
mulps xmm0, [edx]
mulps xmm1, [edx+16]
mulps xmm2, [edx+32]
mulps xmm3, [edx+48]
lea edx, [edx+64]
addps xmm7, xmm0 ; sum up
addps xmm7, xmm1
addps xmm7, xmm2
addps xmm7, xmm3
ENDM

if UseCmp
cmp edx, ecx ; faster, at least on Celeron M
jb @B
else
sub ecx, 16*repct ; count down
jg @B
endif
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: lingo on August 31, 2010, 03:59:51 PM
"Dave's algo does not use the same "fill code", so it's not comparable"

JJ's code is not comparable too... :(
or how to manipulate the people with stupid testing:

original JJ's stupid code:
1st algo:
push ebx
mov ecx, 2048/4                   ; 1st lame error here; Why 2048?  Why /4?
mov eax, offset Src             ; eax+2048 is 1 dword after the end of Src!
mov edx, offset Dest           ; edx+2048 is 1 dword after the end of Dest! 
lea eax, [eax+4*ecx]            ;  Why 4*ecx? Wow, coz we have /4 above!!!
lea edx, [edx+4*ecx]            ; the same stupidity again!!! 
neg ecx
and TheCt, 0
align 16
@@:
mov ebx, [eax+ecx*4]           ; multiply ecx by 4 + sum of 2registers and read
inc TheCt
mov [edx+ecx*4], ebx           ; multiply ecx by 4 + sum of 2registers and WRITE !!!
add ecx, 4
js short @B
pop ebx

and the loop of the 2nd algo:
@@:
mov ebx, [eax+edx] ;sum of 2registers and read
inc TheCt
mov [edx], ebx                        ; just one register and WRITE !!!
add edx, 16
cmp edx, ecx
jb short @B

As you see the loops are not comparable!

To be comparable:
just change the 1st algo to:
push ebx
mov ecx, 2047
and TheCt, 0
lea   eax, [Src+ecx]
lea   edx, [Dest+ecx]
neg ecx
align 16
@@:
mov ebx, [eax+ecx]            ;sum of 2registers and read
inc TheCt
mov ebx, [edx+ecx]            ;sum of 2registers and read   
add ecx, 16
jle short @B
pop ebx


and change the 2nd algo too to:
push ebx
mov ecx, 2047
mov eax, offset Src
mov edx, offset Dest
lea ecx, [edx+ecx]
and TheCt, 0
sub eax, edx
align 16
@@:
mov ebx, [eax+edx]                 ;sum of 2registers and read
inc TheCt
mov ebx, [edx]                         ;just one register and read 
add edx, 16
cmp edx, ecx
jbe short @B
pop ebx

and voila, we have similar results ...

Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)
778     cycles for neg ecx, TheCt=128
779     cycles for cmp edx, TheCt=128

778     cycles for neg ecx, TheCt=128
779     cycles for cmp edx, TheCt=128

--- ok ---
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: jj2007 on August 31, 2010, 04:44:06 PM
Good job, Lingo :U
Unfortunately I had little time for this, so I am grateful that you took over. Now if you find some spare time, please try to speed up the DotXMM1Acc4EJ2 posted above.
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: lingo on August 31, 2010, 05:33:16 PM
"same time as DotXMM2Acc16E on my machine (Win32 and Linux)"

On my pc i have more: Win7, Vista, XP and Mac OS-SnowLeopard
but I can't understand the relation between OS and speed optimization. :wink
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: Gunther on August 31, 2010, 06:13:49 PM
Quote from: lingo August 31, 2010, at 06:33:16 PMbut I can't understand the relation between OS and speed optimization

I mentioned that only, because the application is compiled with two different compiler versions:

That makes a small difference under both operating systems.

Gunther
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: KeepingRealBusy on August 31, 2010, 07:26:25 PM
Quote from: lingo on August 31, 2010, 03:59:51 PM
"Dave's algo does not use the same "fill code", so it's not comparable"

JJ's code is not comparable too... :(
or how to manipulate the people with stupid testing:

original JJ's stupid code:
1st algo:
push ebx
mov ecx, 2048/4                   ; 1st lame error here; Why 2048?  Why /4?
mov eax, offset Src             ; eax+2048 is 1 dword after the end of Src!
mov edx, offset Dest           ; edx+2048 is 1 dword after the end of Dest! 
lea eax, [eax+4*ecx]            ;  Why 4*ecx? Wow, coz we have /4 above!!!
lea edx, [edx+4*ecx]            ; the same stupidity again!!! 
neg ecx
and TheCt, 0
align 16
@@:
mov ebx, [eax+ecx*4]           ; multiply ecx by 4 + sum of 2registers and read
inc TheCt
mov [edx+ecx*4], ebx           ; multiply ecx by 4 + sum of 2registers and WRITE !!!
add ecx, 4
js short @B
pop ebx

and the loop of the 2nd algo:
@@:
mov ebx, [eax+edx] ;sum of 2registers and read
inc TheCt
mov [edx], ebx                        ; just one register and WRITE !!!
add edx, 16
cmp edx, ecx
jb short @B

As you see the loops are not comparable!

To be comparable:
just change the 1st algo to:
push ebx
mov ecx, 2047
and TheCt, 0
lea   eax, [Src+ecx]
lea   edx, [Dest+ecx]
neg ecx
align 16
@@:
mov ebx, [eax+ecx]            ;sum of 2registers and read
inc TheCt
mov ebx, [edx+ecx]            ;sum of 2registers and read   
add ecx, 16
jle short @B
pop ebx


and change the 2nd algo too to:
push ebx
mov ecx, 2047
mov eax, offset Src
mov edx, offset Dest
lea ecx, [edx+ecx]
and TheCt, 0
sub eax, edx
align 16
@@:
mov ebx, [eax+edx]                 ;sum of 2registers and read
inc TheCt
mov ebx, [edx]                         ;just one register and read 
add edx, 16
cmp edx, ecx
jbe short @B
pop ebx

and voila, we have similar results ...

Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)
778     cycles for neg ecx, TheCt=128
779     cycles for cmp edx, TheCt=128

778     cycles for neg ecx, TheCt=128
779     cycles for cmp edx, TheCt=128

--- ok ---


Lingo,

I'm sorry, but that code just won't work. You are not copying the data from src to dest, but just loading two values into ebx.

Also, wouldn't this be better. Instead of this:


@@:
mov ebx, [eax+edx]                 ;sum of 2registers and read
inc TheCt
mov ebx, [edx]                         ;just one register and read      (double load)
add edx, 16
cmp edx, ecx
jbe short @B


Use this:


@@:
mov ebx, [eax+edx]                 ;sum of 2registers and read
add edx, 16
inc TheCt
mov [edx-16], ebx                    ;just one register and read  (also fixed the double load)
cmp edx, ecx
jbe short @B


Dave.
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: KeepingRealBusy on August 31, 2010, 07:33:45 PM
Lingo,

I have lost track of what is being done here (just got back from  trip), but the "add ebx,16" doesn't seem right, unless you are just initializing a single dword in a 16 byte array entry.

Shouldn't that be:


add edx, 4
inc TheCt
mov [edx-4], ebx                    ;just one register and read  (also fixed the double load)


Dave.
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: lingo on August 31, 2010, 09:26:05 PM
"You are not copying the data from src to dest"
Why to do this? :lol
"I have lost track of what is being done here"
It is jj's example so will be better to ask him.. :wink
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: jj2007 on August 31, 2010, 11:25:07 PM
Quote from: KeepingRealBusy on August 31, 2010, 07:33:45 PM
I have lost track of what is being done here

Hi Dave,
There are two parallel testing efforts here, the "real" one on speeding up the dot product, and a side track on loop optimisation with negative offsets. The latter is pretty useless, and I feel guilty about it. For those interested, see Agner Fog's optimizing_assembly.pdf, page 89/90:
QuoteIt is possible to modify example 12.4a to make it count down rather than up, but the data
cache is optimized for accessing data forwards, not backwards. Therefore it is better to
count up through negative values from -n to zero. This is possible by making a pointer to
the end of the array and using a negative offset from the end of the array:

; Example 12.4b. For-loop with negative index from end of array
    mov  ecx, n            ; Load n
    lea  esi, Array[4*ecx] ; Point to end of array
    neg  ecx               ; i = -n
    jnl  LoopEnd           ; Skip if (-n) >= 0
LoopTop:
    ; Loop body: Add 1 to all elements in Array:
    add  dword ptr [esi+4*ecx], 1
    add  ecx, 1            ; i++
    js   LoopTop           ; Loop back if i < 0
LoopEnd:

A slightly different solution is to multiply n by 4 and count from -4*n to zero:

; Example 12.4c. For-loop with neg. index multiplied by element size
    mov  ecx, n            ; Load n
    shl  ecx, 2            ; n * 4
    jng  LoopEnd           ; Skip if (4*n) <= 0
    lea  esi, Array[ecx]   ; Point to end of array   90
    neg  ecx               ; i = -4*n
LoopTop:
    ; Loop body: Add 1 to all elements in Array:
    add  dword ptr [esi+ecx], 1
    add  ecx, 4            ; i += 4
    js   LoopTop           ; Loop back if i < 0
LoopEnd:

There is no difference in speed between example 12.4b and 12.4c, but the latter method is
useful if the size of the array elements is not 1, 2, 4 or 8 so that we cannot use the scaled
index addressing.
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: Gunther on September 01, 2010, 01:03:05 PM
Quote from: jj2007 on September 01, 2010, at 12:25:07 AMThe latter is pretty useless, and I feel guilty about it.

Jochen,

the side track isn't useless. It's a very interesting question, but I think it's worth to have it's own thread.

Gunther
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: Antariy on September 01, 2010, 09:28:15 PM
Hi!

I done remake of Gunther's simple SSE2 code, with my suggestion of loop construction. And implemention as __fastcall, with support of __stdcall and __cdecl via thunkers.
Other code the same (final repacking and addition).

Gunther, this is code in MASM32 format, I have not experience with GCC, sorry!

Code too big to put it into post, so, you can get it from DotProduct4_1.zip, attached to post. Proc(edure) named as AxDotXMM1_fastcall, thunkers - below it.
This archive is original Jochen's DotProduct4.zip, but with my code addeded.

Other archive - your old dotfloat.exe with my proc(edure) included (__cdecl version).

For make test, I will forced to patch your original (old) posted version of dotfloat.exe.
I replace your simple SSE2 version with one accumulator to my version of code, and just run test.

This is timings of PATCHED version (with my simple version of proc)

Supported by Processor and installed Operating System:
------------------------------------------------------

   Pentium 4 Instruction Set,
+ FPU (floating point unit) on chip,
+ support of FXSAVE and FXRSTOR,
+ 57 MMX Instructions,
+ 70 SSE (Katmai) Instructions,
+ 144 SSE2 (Willamette) Instructions,
+ 13 SSE3 (Prescott) Instructions,
+ HTT (hyper thread technology) support.

Calculating the dot product in 6 different variations.
That'll take a little while ...

Straight forward C++ implementation (FPU Code):
-----------------------------------------------

Dot Product 1 = 2867507200.00
Elapsed Time  = 31.49 Seconds

C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------

Dot Product 2 = 2867507200.00
Elapsed Time  = 15.14 Seconds

C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------

Dot Product 3 = 2867507200.00
Elapsed Time  = 6.49 Seconds

Simple SSE2 Code (Alex's loop variation: 16bytes loop):
--------------------------------------------------------

Dot Product 4 = 2867507200.00
Elapsed Time  = 6.14 Seconds

Solid SSE2 Code (2 Accumulators - 8 elements per cycle):
--------------------------------------------------------

Dot Product 5 = 2867507200.00
Elapsed Time  = 4.01 Seconds

Better SSE2 Code (2 Accumulators - 16 elements per cycle):
----------------------------------------------------------

Dot Product 6 = 2867507200.00
Elapsed Time  = 4.09 Seconds




Timings for your ORIGINAL posted version:

Supported by Processor and installed Operating System:
------------------------------------------------------

   Pentium 4 Instruction Set,
+ FPU (floating point unit) on chip,
+ support of FXSAVE and FXRSTOR,
+ 57 MMX Instructions,
+ 70 SSE (Katmai) Instructions,
+ 144 SSE2 (Willamette) Instructions,
+ 13 SSE3 (Prescott) Instructions,
+ HTT (hyper thread technology) support.

Calculating the dot product in 6 different variations.
That'll take a little while ...

Straight forward C++ implementation (FPU Code):
-----------------------------------------------

Dot Product 1 = 2867507200.00
Elapsed Time  = 31.50 Seconds

C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------

Dot Product 2 = 2867507200.00
Elapsed Time  = 15.28 Seconds

C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------

Dot Product 3 = 2867507200.00
Elapsed Time  = 6.52 Seconds

Simple SSE2 Code (1 Accumulator - 4 elements per cycle):
--------------------------------------------------------

Dot Product 4 = 2867507200.00
Elapsed Time  = 6.39 Seconds

Solid SSE2 Code (2 Accumulators - 8 elements per cycle):
--------------------------------------------------------

Dot Product 5 = 2867507200.00
Elapsed Time  = 4.00 Seconds

Better SSE2 Code (2 Accumulators - 16 elements per cycle):
----------------------------------------------------------

Dot Product 6 = 2867507200.00
Elapsed Time  = 4.06 Seconds



Note: overhead of C++ loop what calls to dot production code is 2.3 seconds on my CPU.


I'll attach archive with patched version of your executable. So, you can test new version of loop just now.



Alex
P.S. Sorry for patching, as I say: I'm have no experience with GCC, and cannot add my code to test via normal way. And sources is seems to be not compilable with MSVC, because have other name mangling, assembler etc.
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: Antariy on September 01, 2010, 09:36:37 PM
Timings for DotProduct4_1.zip (with Jochen's procs).


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
2814    cycles for DotXMM1Acc4E
2716    cycles for DotXMM1Acc4EJ1
2798    cycles for DotXMM1Acc4EJ2
2707    cycles for AxDotXMM1_fastcall

2817    cycles for DotXMM1Acc4E
2714    cycles for DotXMM1Acc4EJ1
2804    cycles for DotXMM1Acc4EJ2
2713    cycles for AxDotXMM1_fastcall

2805    cycles for DotXMM1Acc4E
2726    cycles for DotXMM1Acc4EJ1
2800    cycles for DotXMM1Acc4EJ2
2719    cycles for AxDotXMM1_fastcall


The result for DotXMM1Acc4E:     2867507200
The result for DotXMM1Acc4EJ1:   2867507200
The result for DotXMM1Acc4EJ2:   2867507200
The result for AxDotXMM1_cdecl:  2867507200
--- done ---


Note: this is strange enough, but work via stdcall thunker faster than direct call to fastcall version - on my system.



Alex
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: jj2007 on September 01, 2010, 10:08:41 PM
Hi Alex,
It seems the Celeron M doesn't like it... sorry...
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
2544    cycles for DotXMM1Acc4E
2161    cycles for DotXMM1Acc4EJ1
2074    cycles for DotXMM1Acc4EJ2
2541    cycles for AxDotXMM1_fastcall

2517    cycles for DotXMM1Acc4E
2166    cycles for DotXMM1Acc4EJ1
2073    cycles for DotXMM1Acc4EJ2
2539    cycles for AxDotXMM1_fastcall

2511    cycles for DotXMM1Acc4E
2168    cycles for DotXMM1Acc4EJ1
2078    cycles for DotXMM1Acc4EJ2
2539    cycles for AxDotXMM1_fastcall


The result for DotXMM1Acc4E:     2867507200
The result for DotXMM1Acc4EJ1:   2867507200
The result for DotXMM1Acc4EJ2:   2867507200
The result for AxDotXMM1_cdecl:  2867507200
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: Antariy on September 01, 2010, 10:12:14 PM
Quote from: jj2007 on September 01, 2010, 10:08:41 PM
Hi Alex,
It seems the Celeron M doesn't like it... sorry...

:P

Not have meaning :)

What timings is for patched Gunther's exe?
Post them, please! I made it to work, not for attaching to post only, but for getting results also :)



Alex
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: jj2007 on September 01, 2010, 10:19:03 PM
Here they are ;-)

Straight forward C++ implementation (FPU Code):
-----------------------------------------------

Dot Product 1 = 2867507200.00
Elapsed Time  = 22.64 Seconds

C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------

Dot Product 2 = 2867507200.00
Elapsed Time  = 15.58 Seconds

C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------

Dot Product 3 = 2867507200.00
Elapsed Time  = 6.89 Seconds

Simple SSE2 Code (Alex's loop variation: 16bytes loop):
--------------------------------------------------------

Dot Product 4 = 2867507200.00
Elapsed Time  = 7.55 Seconds

Solid SSE2 Code (2 Accumulators - 8 elements per cycle):
--------------------------------------------------------

Dot Product 5 = 2867507200.00
Elapsed Time  = 7.25 Seconds

Better SSE2 Code (2 Accumulators - 16 elements per cycle):
----------------------------------------------------------

Dot Product 6 = 2867507200.00
Elapsed Time  = 6.72 Seconds
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: Antariy on September 01, 2010, 10:26:05 PM
Quote from: jj2007 on September 01, 2010, 10:19:03 PM
Here they are ;-)

So, as always, nothing to say... All code optimized for my CPU - seems to be anti-optimized for others :(
Good chance for someone to make his stupid remarks about "archaic CPU... etc"   :toothy
But this is also have no meaning  :green2



Alex
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: hutch-- on September 01, 2010, 11:51:59 PM
Dotproduct result.


Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (SSE4)
1824    cycles for DotXMM1Acc4E
1556    cycles for DotXMM1Acc4EJ1
1556    cycles for DotXMM1Acc4EJ2
1562    cycles for AxDotXMM1_fastcall

1561    cycles for DotXMM1Acc4E
1557    cycles for DotXMM1Acc4EJ1
1554    cycles for DotXMM1Acc4EJ2
1561    cycles for AxDotXMM1_fastcall

1561    cycles for DotXMM1Acc4E
1556    cycles for DotXMM1Acc4EJ1
1554    cycles for DotXMM1Acc4EJ2
1562    cycles for AxDotXMM1_fastcall


The result for DotXMM1Acc4E:     2867507200
The result for DotXMM1Acc4EJ1:   2867507200
The result for DotXMM1Acc4EJ2:   2867507200
The result for AxDotXMM1_cdecl:  2867507200
--- done ---


The other.


Supported by Processor and installed Operating System:
------------------------------------------------------

   Pentium 4 Instruction Set,
+ FPU (floating point unit) on chip,
+ support of FXSAVE and FXRSTOR,
+ 57 MMX Instructions,
+ 70 SSE (Katmai) Instructions,
+ 144 SSE2 (Willamette) Instructions,
+ 13 SSE3 (Prescott) Instructions,
+ HTT (hyper thread technology) support.

Calculating the dot product in 6 different variations.
That'll take a little while ...

Straight forward C++ implementation (FPU Code):
-----------------------------------------------

Dot Product 1 = 2867507200.00
Elapsed Time  = 10.31 Seconds

C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------

Dot Product 2 = 2867507200.00
Elapsed Time  = 6.89 Seconds

C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------

Dot Product 3 = 2867507200.00
Elapsed Time  = 2.58 Seconds

Simple SSE2 Code (Alex's loop variation: 16bytes loop):
--------------------------------------------------------

Dot Product 4 = 2867507200.00
Elapsed Time  = 2.62 Seconds

Solid SSE2 Code (2 Accumulators - 8 elements per cycle):
--------------------------------------------------------

Dot Product 5 = 2867507200.00
Elapsed Time  = 2.14 Seconds

Better SSE2 Code (2 Accumulators - 16 elements per cycle):
----------------------------------------------------------

Dot Product 6 = 2867507200.00
Elapsed Time  = 1.97 Seconds
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: clive on September 02, 2010, 12:23:52 AM
From the netbook, for chuckles..

Intel(R) Atom(TM) CPU N270   @ 1.60GHz (SSE4)
6068    cycles for DotXMM1Acc4E
3240    cycles for DotXMM1Acc4EJ1
3542    cycles for DotXMM1Acc4EJ2
5300    cycles for AxDotXMM1_fastcall

4428    cycles for DotXMM1Acc4E
3339    cycles for DotXMM1Acc4EJ1
3537    cycles for DotXMM1Acc4EJ2
5405    cycles for AxDotXMM1_fastcall

4292    cycles for DotXMM1Acc4E
3328    cycles for DotXMM1Acc4EJ1
3520    cycles for DotXMM1Acc4EJ2
5285    cycles for AxDotXMM1_fastcall


The result for DotXMM1Acc4E:     2867507200
The result for DotXMM1Acc4EJ1:   2867507200
The result for DotXMM1Acc4EJ2:   2867507200
The result for AxDotXMM1_cdecl:  2867507200



Supported by Processor and installed Operating System:
------------------------------------------------------

   Pentium 4 Instruction Set,
+ FPU (floating point unit) on chip,
+ support of FXSAVE and FXRSTOR,
+ 57 MMX Instructions,
+ 70 SSE (Katmai) Instructions,
+ 144 SSE2 (Willamette) Instructions,
+ 13 SSE3 (Prescott) Instructions,
+ HTT (hyper thread technology) support.

Calculating the dot product in 6 different variations.
That'll take a little while ...

Straight forward C++ implementation (FPU Code):
-----------------------------------------------

Dot Product 1 = 2867507200.00
Elapsed Time  = 48.08 Seconds

C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------

Dot Product 2 = 2867507200.00
Elapsed Time  = 45.88 Seconds

C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------

Dot Product 3 = 2867507200.00
Elapsed Time  = 9.44 Seconds

Simple SSE2 Code (Alex's loop variation: 16bytes loop):
--------------------------------------------------------

Dot Product 4 = 2867507200.00
Elapsed Time  = 16.88 Seconds

Solid SSE2 Code (2 Accumulators - 8 elements per cycle):
--------------------------------------------------------

Dot Product 5 = 2867507200.00
Elapsed Time  = 12.69 Seconds

Better SSE2 Code (2 Accumulators - 16 elements per cycle):
----------------------------------------------------------

Dot Product 6 = 2867507200.00
Elapsed Time  = 5.81 Seconds
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: lingo on September 02, 2010, 12:29:41 AM
"In practice, it leads to the same time as DotXMM2Acc16E on my machine (Win32 and Linux). That's a bit surprising."

You can try my shorter variant too: :U
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
db 3 Dup(0)
DotXMM2Acc16ELingo proc srcX, srcY, counter
pxor        xmm5,  xmm5
mov         eax,   [esp+1*4] ; eax->srcX
pxor        xmm6,  xmm6
mov         edx,   [esp+2*4] ; edx ->srcY
movaps      xmm0,  [eax]
pxor        xmm4,  xmm4
mov         ecx,   [esp+3*4] ; ecx = counter
sub         eax,   edx
@@:
movaps     xmm1, [eax+edx+16]
mulps    xmm0, [edx]
addps   xmm5, xmm6
movaps     xmm3, [eax+edx+32]
mulps    xmm1, [edx+16]
addps    xmm4, xmm0
movaps     xmm6, [eax+edx+48]
mulps    xmm3, [edx+32]
add    edx,  64
addps     xmm5, xmm1
movaps     xmm0, [eax+edx]
mulps    xmm6, [edx+48-64]
addps   xmm4, xmm3
sub      ecx,  16
ja     @b
addps   xmm4, xmm6
addps   xmm4, xmm5
movhlps   xmm0, xmm4
addps   xmm4, xmm0
pshufd    xmm0, xmm4,1
addss   xmm4, xmm0
movss     dword ptr [esp+2*4], xmm4
fld       dword ptr [esp+2*4]
ret       3*4
DotXMM2Acc16ELingo endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: clive on September 02, 2010, 12:59:57 AM
And with Lingo's XMM2 spanking everyone

Intel(R) Atom(TM) CPU N270   @ 1.60GHz (SSE4)
5739 cycles for DotXMM1Acc4E
3250 cycles for DotXMM1Acc4EJ1
3534 cycles for DotXMM1Acc4EJ2
5290 cycles for AxDotXMM1_cdecl
5253 cycles for AxDotXMM1_fastcall
1899 cycles for DotXMM2Acc16ELingo

4239 cycles for DotXMM1Acc4E
3289 cycles for DotXMM1Acc4EJ1
3488 cycles for DotXMM1Acc4EJ2
5278 cycles for AxDotXMM1_cdecl
5502 cycles for AxDotXMM1_fastcall
2903 cycles for DotXMM2Acc16ELingo

5959 cycles for DotXMM1Acc4E
4370 cycles for DotXMM1Acc4EJ1
4629 cycles for DotXMM1Acc4EJ2
6626 cycles for AxDotXMM1_cdecl
6223 cycles for AxDotXMM1_fastcall
1945 cycles for DotXMM2Acc16ELingo


The result for DotXMM1Acc4E: 2867507200
The result for DotXMM1Acc4EJ1: 2867507200
The result for DotXMM1Acc4EJ2: 2867507200
The result for AxDotXMM1_cdecl: 2867507200
The result for AxDotXMM1_fastcall: 2867507200
The result for DotXMM2Acc16ELingo: 2867507200
--- done ---
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: dedndave on September 02, 2010, 01:57:31 AM
Intel(R) Core(TM)2 Duo CPU     T7500  @ 2.20GHz (SSE4)
1621    cycles for DotXMM1Acc4E
1596    cycles for DotXMM1Acc4EJ1
1573    cycles for DotXMM1Acc4EJ2
1587    cycles for AxDotXMM1_cdecl
1699    cycles for AxDotXMM1_fastcall
1051    cycles for DotXMM2Acc16ELingo

1568    cycles for DotXMM1Acc4E
1564    cycles for DotXMM1Acc4EJ1
1678    cycles for DotXMM1Acc4EJ2
1611    cycles for AxDotXMM1_cdecl
1598    cycles for AxDotXMM1_fastcall
1052    cycles for DotXMM2Acc16ELingo

1566    cycles for DotXMM1Acc4E
1559    cycles for DotXMM1Acc4EJ1
1582    cycles for DotXMM1Acc4EJ2
1601    cycles for AxDotXMM1_cdecl
1611    cycles for AxDotXMM1_fastcall
1063    cycles for DotXMM2Acc16ELingo
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: hutch-- on September 02, 2010, 01:58:51 AM

Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (SSE4)
1574    cycles for DotXMM1Acc4E
1554    cycles for DotXMM1Acc4EJ1
1553    cycles for DotXMM1Acc4EJ2
1565    cycles for AxDotXMM1_cdecl
1560    cycles for AxDotXMM1_fastcall
1050    cycles for DotXMM2Acc16ELingo

1560    cycles for DotXMM1Acc4E
1555    cycles for DotXMM1Acc4EJ1
1554    cycles for DotXMM1Acc4EJ2
1565    cycles for AxDotXMM1_cdecl
1560    cycles for AxDotXMM1_fastcall
1050    cycles for DotXMM2Acc16ELingo

1560    cycles for DotXMM1Acc4E
1554    cycles for DotXMM1Acc4EJ1
1553    cycles for DotXMM1Acc4EJ2
1565    cycles for AxDotXMM1_cdecl
1560    cycles for AxDotXMM1_fastcall
1051    cycles for DotXMM2Acc16ELingo


The result for DotXMM1Acc4E:             2867507200
The result for DotXMM1Acc4EJ1:           2867507200
The result for DotXMM1Acc4EJ2:           2867507200
The result for AxDotXMM1_cdecl:          2867507200
The result for AxDotXMM1_fastcall:       2867507200
The result for DotXMM2Acc16ELingo:       2867507200
--- done ---
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: KeepingRealBusy on September 02, 2010, 03:51:53 AM
Alex,

Here are my P4 timings for DotProducts:


Intel(R) Pentium(R) 4 CPU 3.20GHz (SSE2)
2864    cycles for DotXMM1Acc4E
2618    cycles for DotXMM1Acc4EJ1
2235    cycles for DotXMM1Acc4EJ2
2527    cycles for AxDotXMM1_fastcall

2525    cycles for DotXMM1Acc4E
2488    cycles for DotXMM1Acc4EJ1
2367    cycles for DotXMM1Acc4EJ2
2528    cycles for AxDotXMM1_fastcall

2498    cycles for DotXMM1Acc4E
2483    cycles for DotXMM1Acc4EJ1
2220    cycles for DotXMM1Acc4EJ2
2536    cycles for AxDotXMM1_fastcall


The result for DotXMM1Acc4E:     2867507200
The result for DotXMM1Acc4EJ1:   2867507200
The result for DotXMM1Acc4EJ2:   2867507200
The result for AxDotXMM1_cdecl:  2867507200
--- done ---



Supported by Processor and installed Operating System:
------------------------------------------------------

   Pentium 4 Instruction Set,
+ FPU (floating point unit) on chip,
+ support of FXSAVE and FXRSTOR,
+ 57 MMX Instructions,
+ 70 SSE (Katmai) Instructions,
+ 144 SSE2 (Willamette) Instructions.


Calculating the dot product in 6 different variations.
That'll take a little while ...

Straight forward C++ implementation (FPU Code):
-----------------------------------------------

Dot Product 1 = 2867507200.00
Elapsed Time  = 17.34 Seconds

C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------

Dot Product 2 = 2867507200.00
Elapsed Time  = 11.42 Seconds

C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------

Dot Product 3 = 2867507200.00
Elapsed Time  = 3.69 Seconds

Simple SSE2 Code (Alex's loop variation: 16bytes loop):
--------------------------------------------------------

Dot Product 4 = 2867507200.00
Elapsed Time  = 4.02 Seconds

Solid SSE2 Code (2 Accumulators - 8 elements per cycle):
--------------------------------------------------------

Dot Product 5 = 2867507200.00
Elapsed Time  = 3.69 Seconds

Better SSE2 Code (2 Accumulators - 16 elements per cycle):
----------------------------------------------------------

Dot Product 6 = 2867507200.00
Elapsed Time  = 3.33 Seconds


Dave
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: jj2007 on September 02, 2010, 06:20:42 AM
Quote from: lingo on September 02, 2010, 12:29:41 AM
You can try my shorter variant too: :U

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
2548    cycles for DotXMM1Acc4E
2163    cycles for DotXMM1Acc4EJ1
2072    cycles for DotXMM1Acc4EJ2
2540    cycles for AxDotXMM1_fastcall
2133    cycles for DotXMM2Acc16ELingo

The result returned by Lingo's algo is correct! If you want to test yourself, activate UseMB in line 1.

On my archaic Prescott, Lingo's code is actually a bit faster:
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
3652    cycles for DotXMM1Acc4E
2778    cycles for DotXMM1Acc4EJ1
2779    cycles for DotXMM1Acc4EJ2
2754    cycles for AxDotXMM1_fastcall
2173    cycles for DotXMM2Acc16ELingo
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: dioxin on September 02, 2010, 05:21:18 PM
I think I've lost track of what's going on in this thread.
How is Lingo's latest significantly different to the original DotXMM2Acc16E version posted by Gunther?

At some point in this thread the timings switched from seconds for 5,000,000 loops to clks per loop. Did anything else change?
If not, then I get the following timings:
Phenom II     Atom N270     
Code from reply#17         640          2156           
Gunther's original        1056          1660
(2acc. 16 Element)
Lingo's latest            1157          1858 



On my PC (3GHz Phenom II) Gunther's original runs in 1.76s = 1056clks, Lingo's latest runs in 1157 clks but the code posted in Reply#17 of this thread runs in 640clks. I gather from other posts in this thread that this is very CPU dependent.

I've tried them on other machines and older PCs do show Lingo's code to be a little faster (in the region of 5%) than the reply #17 code on those old machines.
Other modern Athlon type CPus show the Reply #17 code being nearly twice as fast.
I don't have any modern Intels available to test.

Paul.
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: lingo on September 02, 2010, 05:40:19 PM
"If not, then I get the following timings:"

Without your testing files it is just bla, bla, bla... :lol
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: dioxin on September 02, 2010, 05:46:31 PM
Lingo,
I forget, sometimes people can only handle 1 query at a time so I'll simplify it.

How is Lingo's latest significantly different to the original DotXMM2Acc16E version posted by Gunther?
The code looks similar and the timings are similar.
Here are the timings from the already posted testing files:
                         Phenom II     Atom N270             
Gunther's original        1056          1660
(2acc. 16 Element)
Lingo's latest            1157          1858 


Paul.
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: dioxin on September 02, 2010, 05:52:59 PM
Here are the relevant extracts from the Gunther original and Lingo's latest.
Shown below are the main loops of the 2 routines:

@@:
movaps     xmm1, [eax+edx+16]
mulps      xmm0, [edx]
addps      xmm5, xmm6
movaps     xmm3, [eax+edx+32]
mulps      xmm1, [edx+16]
addps      xmm4, xmm0
movaps     xmm6, [eax+edx+48]
mulps      xmm3, [edx+32]
add edx, 64
addps      xmm5, xmm1
movaps     xmm0, [eax+edx]
mulps      xmm6, [edx+48-64]
addps      xmm4, xmm3
sub ecx, 16
ja @b




.loop:

movaps  xmm2,[eax+edx+32] ;xmm2 = X[i+11] X[i+10] X[i+9] X[i+8]
mulps   xmm1,[edx+16] ;xmm1 = X[i+7]*Y[i+7] X[i+6]*Y[i+6] X[i+5]*Y[i+5] X[i+4]*Y[i+4]
addps   xmm4,xmm0 ;sum up
movaps  xmm3,[eax+edx+48] ;xmm3 = X[i+15] X[i+14] X[i+13] X[i+12]
mulps   xmm2,[edx+32] ;xmm2 = X[i+11]*Y[i+11] X[i+10]*Y[i+10] X[i+9]*Y[i+9] X[i+8]*Y[i+8]
addps   xmm5,xmm1 ;sum up
movaps  xmm0,[eax+edx+64] ;xmm0 = X[i+3] X[i+2] X[i+1] X[i+0]
mulps   xmm3,[edx+48] ;xmm3 = X[i+15]*Y[i+15] X[i+14]*Y[i+14] X[i+13]*Y[i+13] X[i+12]*Y[i+12]
lea     edx,[edx+64] ;update pointers
addps   xmm4,xmm2 ;sum up
movaps  xmm1,[eax+edx+16] ;mm1 = X[i+7] X[i+6] X[i+5] X[i+4]
mulps   xmm0,[edx] ;xmm0 = X[i+3]*Y[i+3] X[i+2]*Y[i+2] X[i+1]*Y[i+1] X[i+0]*Y[i+0]
addps   xmm5,xmm3 ;sum up
sub     ecx,byte 64 ;count down
jnz .loop

Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: lingo on September 02, 2010, 06:11:13 PM
Your "results" without your testing files are just bla,bla,bla... :(

Will be better to post all algos rather than just the loops.. :lol
This is a Gunther's thread and it is his choice to get the better algo according to him.
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: dioxin on September 02, 2010, 06:16:39 PM
Lingo,
Quotewithout your testing files

What testing files are you after? I used the ones in the original post of this thread by Gunther and the one in reply in Reply #59 by jj2007.
The files are already posted. It would bee pointless to repost the same ones again.

Paul.
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: lingo on September 02, 2010, 06:34:13 PM
"I used the ones in the original post of this thread by Gunther and the one in reply in Reply #59 by jj2007."

It is a result from original post by Gunther:
Better SSE2 Code (2 Accumulators - 16 elements per cycle):
----------------------------------------------------------
Dot Product 6 = 2867507200.00
Elapsed Time  = 1.87 Seconds


it is a result from Reply #59 by jj2007:
2173    cycles for DotXMM2Acc16ELingo

As you see the dimensions of the times are different...(seconds vs cycles)  :lol

So,Your "results" without your testing files are just bla,bla,bla... [/U]
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: jj2007 on September 02, 2010, 06:39:56 PM
Quote from: dioxin on September 02, 2010, 05:21:18 PM
I think I've lost track of what's going on in this thread.

Paul,
Here are results for all algos including your #17:

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
2529    cycles for DotXMM1Acc4E
2163    cycles for DotXMM1Acc4EJ1
2070    cycles for DotXMM1Acc4EJ2
2550    cycles for AxDotXMM1_fastcall
2138    cycles for DotXMM2Acc16ELingo
2135    cycles for DotXMM2Acc16EPaul

Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: dioxin on September 02, 2010, 06:46:26 PM
Lingo,
it's simple maths. Gunther's runs 5,000,000 loops in 1.87sec.

That's 1.87 / 5000000 = 374nsec per loop. Multiply that 374 by the speed of your CPU in GHz and you'll see how many cycles Gunther's code takes on your PC.

jj2007,
thanks, on mine with your posted code I get:
AMD Phenom(tm) II X4 945 Processor (SSE3)
2410    cycles for DotXMM1Acc4E
2115    cycles for DotXMM1Acc4EJ1
2118    cycles for DotXMM1Acc4EJ2
2116    cycles for AxDotXMM1_fastcall
1126    cycles for DotXMM2Acc16ELingo
723     cycles for DotXMM2Acc16EPaul

2156    cycles for DotXMM1Acc4E
2117    cycles for DotXMM1Acc4EJ1
2115    cycles for DotXMM1Acc4EJ2
2114    cycles for AxDotXMM1_fastcall
1126    cycles for DotXMM2Acc16ELingo
725     cycles for DotXMM2Acc16EPaul

2155    cycles for DotXMM1Acc4E
2130    cycles for DotXMM1Acc4EJ1
2134    cycles for DotXMM1Acc4EJ2
2121    cycles for AxDotXMM1_fastcall
1126    cycles for DotXMM2Acc16ELingo
724     cycles for DotXMM2Acc16EPaul


The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
--- done ---


Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: jj2007 on September 02, 2010, 06:55:11 PM
Quote from: dioxin on September 02, 2010, 06:46:26 PM
jj2007,
thanks, on mine with your posted code I get:
QuoteAMD Phenom(tm) II X4 945 Processor (SSE3)
2118    cycles for DotXMM1Acc4EJ2
723     cycles for DotXMM2Acc16EPaul

QuoteIntel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
2071    cycles for DotXMM1Acc4EJ2
2135    cycles for DotXMM2Acc16EPaul

Wow!
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: lingo on September 02, 2010, 07:01:47 PM
 :lol
Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)
1566    cycles for DotXMM1Acc4E
1556    cycles for DotXMM1Acc4EJ1
1556    cycles for DotXMM1Acc4EJ2
1561    cycles for AxDotXMM1_fastcall
1051    cycles for DotXMM2Acc16ELingo
1052    cycles for DotXMM2Acc16EPaul

1562    cycles for DotXMM1Acc4E
1556    cycles for DotXMM1Acc4EJ1
1556    cycles for DotXMM1Acc4EJ2
1561    cycles for AxDotXMM1_fastcall
1051    cycles for DotXMM2Acc16ELingo
1052    cycles for DotXMM2Acc16EPaul

1562    cycles for DotXMM1Acc4E
1555    cycles for DotXMM1Acc4EJ1
1561    cycles for DotXMM1Acc4EJ2
1561    cycles for AxDotXMM1_fastcall
1051    cycles for DotXMM2Acc16ELingo
1052    cycles for DotXMM2Acc16EPaul


The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
--- done ---

AMD Turion(tm) 64 Mobile Technology ML-30 (SSE3)
2270    cycles for DotXMM1Acc4E
2226    cycles for DotXMM1Acc4EJ1
2199    cycles for DotXMM1Acc4EJ2
2259    cycles for AxDotXMM1_fastcall
1400    cycles for DotXMM2Acc16ELingo
1582    cycles for DotXMM2Acc16EPaul

2242    cycles for DotXMM1Acc4E
2226    cycles for DotXMM1Acc4EJ1
2232    cycles for DotXMM1Acc4EJ2
2269    cycles for AxDotXMM1_fastcall
1393    cycles for DotXMM2Acc16ELingo
1579    cycles for DotXMM2Acc16EPaul

2246    cycles for DotXMM1Acc4E
2221    cycles for DotXMM1Acc4EJ1
2193    cycles for DotXMM1Acc4EJ2
2268    cycles for AxDotXMM1_fastcall
1425    cycles for DotXMM2Acc16ELingo
1575    cycles for DotXMM2Acc16EPaul


The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
--- done ---
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: redskull on September 02, 2010, 07:17:05 PM
Quote from: dioxin on September 02, 2010, 06:46:26 PM

it's simple maths. Gunther's runs 5,000,000 loops in 1.87sec.

That's 1.87 / 5000000 = 374nsec per loop. Multiply that 374 by the speed of your CPU in GHz and you'll see how many cycles Gunther's code takes on your PC.


Unless Lingos Core2 has an wider pipeline that can decode three uOps per cycle, or has an extra ALU to do up to 6 uOps per cycle.  Then the math isn't quite that simple.

-r
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: Gunther on September 02, 2010, 10:31:26 PM
As I announced for a while, here is the update of my test bed. The program calculates now the dot product in 9 different ways. I've included Paul's and lingo's code via inline assembly. Furthermore, I squeezed out a bit time by doing a little bit macro magic in DotXMM2Acc32E (please have a look into dotfloatfa.cpp).

My impression is: that's a good solution especially for older AMD chips (Athlons or Opterons). Newer AMD processors like the Phenom will probably better run with Paul's code, but that's not yet tested. Here is a part of the current timings with my Athlon X2, 1.9 Ghz (Only the last 4 results are shown; for more look into results.pdf):

Better SSE2 Code (2 Accumulators - 16 elements per loop cycle):
---------------------------------------------------------------

Dot Product 6 = 2867507200.00
Elapsed Time  = 3.20 Seconds

SSE2 Code by Dioxin (4 Accumulators - 16 elements per loop cycle):
------------------------------------------------------------------

Dot Product 7 = 2867507200.00
Elapsed Time  = 3.88 Seconds

SSE2 Code by Lingo (2 Accumulators - 32 elements per loop cycle):
-----------------------------------------------------------------

Dot Product 8 = 2867507200.00
Elapsed Time  = 3.20 Seconds

SSE2 Code with PATTERN (2 Accumulators - 32 elements per loop cycle):
---------------------------------------------------------------------

Dot Product 9 = 2867507200.00
Elapsed Time  = 3.08 Seconds

It would be interesting, what lingo's Snow Leopard brings. The program will compile properly under Mac OS X, so try it, please. I'll also try to include Antariy's code into the application. My special thank goes to all guys for making suggestions and for test helping.

Gunther
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: hutch-- on September 02, 2010, 10:49:34 PM
Like a voice crying in the wilderness, if you want to know how FAST an algorithm is, test it in REAL TIME !!!!

Cycles may be cute and easy to use benchmarks may be fun but algorithms run in application code in REAL TIME, test the algo any other way and you get serious anomolies.
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: lingo on September 02, 2010, 11:02:54 PM
A small correction in my old algo and new results: :lol
align 16
db 3 Dup(0)
DotXMM2Acc32ELingo proc srcX, srcY, counter
mov     eax,   [esp+1*4] ; eax->srcX
pxor    xmm5,  xmm5
pxor    xmm3,  xmm3
mov     edx,   [esp+2*4] ; edx ->srcY
movaps  xmm0,  [eax]
pxor    xmm4,  xmm4
mov     ecx,   [esp+3*4] ; ecx = counter
sub     eax,   edx
@@:
movaps  xmm1, [eax+edx+16]
mulps   xmm0, [edx]
addps   xmm4, xmm3
movaps  xmm2, [eax+edx+32]
mulps   xmm1, [edx+16]
addps   xmm5, xmm0
movaps  xmm3, [eax+edx+48]
mulps   xmm2, [edx+32]
addps   xmm4, xmm1
movaps  xmm6, [eax+edx+64]
mulps   xmm3, [edx+48]
addps   xmm5, xmm2
movaps  xmm7, [eax+edx+80]
mulps   xmm6, [edx+64]
addps   xmm4, xmm3
movaps  xmm2, [eax+edx+96]
mulps xmm7, [edx+80]
addps   xmm5, xmm6
movaps  xmm3, [eax+edx+112]
mulps xmm2, [edx+96]
addps   xmm4, xmm7
movaps  xmm0, [eax+edx+128]
mulps xmm3, [edx+112]
addps   xmm5, xmm2
add     edx,  128
sub     ecx,  32
ja      @b
addps    xmm4, xmm3
addps    xmm4, xmm5
movhlps  xmm0, xmm4
addps    xmm4, xmm0
pshufd   xmm0, xmm4,1
addss    xmm4, xmm0
movss    dword ptr [esp+2*4], xmm4
fld      dword ptr [esp+2*4]
ret      4*3
DotXMM2Acc32ELingo endp

Results:
Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)
1564    cycles for DotXMM1Acc4E
1556    cycles for DotXMM1Acc4EJ1
1557    cycles for DotXMM1Acc4EJ2
1562    cycles for AxDotXMM1_fastcall
1051    cycles for DotXMM2Acc16ELingo
1038    cycles for DotXMM2Acc32ELingo
1053    cycles for DotXMM2Acc16EPaul

1563    cycles for DotXMM1Acc4E
1557    cycles for DotXMM1Acc4EJ1
1557    cycles for DotXMM1Acc4EJ2
1562    cycles for AxDotXMM1_fastcall
1051    cycles for DotXMM2Acc16ELingo
1038    cycles for DotXMM2Acc32ELingo
1052    cycles for DotXMM2Acc16EPaul

1562    cycles for DotXMM1Acc4E
1556    cycles for DotXMM1Acc4EJ1
1556    cycles for DotXMM1Acc4EJ2
1563    cycles for AxDotXMM1_fastcall
1051    cycles for DotXMM2Acc16ELingo
1038    cycles for DotXMM2Acc32ELingo
1052    cycles for DotXMM2Acc16EPaul


The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
--- done ---


AMD Turion(tm) 64 Mobile Technology ML-30 (SSE3)
2291    cycles for DotXMM1Acc4E
2233    cycles for DotXMM1Acc4EJ1
2215    cycles for DotXMM1Acc4EJ2
2277    cycles for AxDotXMM1_fastcall
1454    cycles for DotXMM2Acc16ELingo
1338    cycles for DotXMM2Acc32ELingo
1611    cycles for DotXMM2Acc16EPaul

2257    cycles for DotXMM1Acc4E
2230    cycles for DotXMM1Acc4EJ1
2249    cycles for DotXMM1Acc4EJ2
2251    cycles for AxDotXMM1_fastcall
1435    cycles for DotXMM2Acc16ELingo
1362    cycles for DotXMM2Acc32ELingo
1623    cycles for DotXMM2Acc16EPaul

2268    cycles for DotXMM1Acc4E
2253    cycles for DotXMM1Acc4EJ1
2250    cycles for DotXMM1Acc4EJ2
2235    cycles for AxDotXMM1_fastcall
1402    cycles for DotXMM2Acc16ELingo
1371    cycles for DotXMM2Acc32ELingo
1605    cycles for DotXMM2Acc16EPaul


The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
--- done ---





Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: Gunther on September 02, 2010, 11:10:00 PM
Quote from: hutch-- September 03, 2010, at 11:49:34 PMLike a voice crying in the wilderness, if you want to know how FAST an algorithm is, test it in REAL TIME !!!!

Wise spoken. There are cache misses, bad predicted jumps, stalls and the hole bunch of other difficulties, which can occur in practice. Therefore, I made the test bed as practical as possible. But 100% certainty is reached after implementation the new algorithm in the original application. That will happen next week, I hope you keep your fingers crossed for me hutch.

Gunther
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: clive on September 02, 2010, 11:22:16 PM
Is it working right?

Quote from: lingo
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1326793031 <<<<<<<<<
The result: 1328212656
--- done ---
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: KeepingRealBusy on September 02, 2010, 11:26:41 PM
Interesting that it only fails on one of Lingo's machines.
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: Antariy on September 02, 2010, 11:38:37 PM
Latest Lingo's code:

2943    cycles for DotXMM1Acc4E
2816    cycles for DotXMM1Acc4EJ1
2828    cycles for DotXMM1Acc4EJ2
2816    cycles for AxDotXMM1_fastcall
2266    cycles for DotXMM2Acc16ELingo
1878    cycles for DotXMM2Acc32ELingo
1812    cycles for DotXMM2Acc16EPaul

2908    cycles for DotXMM1Acc4E
2811    cycles for DotXMM1Acc4EJ1
2815    cycles for DotXMM1Acc4EJ2
2794    cycles for AxDotXMM1_fastcall
2278    cycles for DotXMM2Acc16ELingo
1842    cycles for DotXMM2Acc32ELingo
1803    cycles for DotXMM2Acc16EPaul

2903    cycles for DotXMM1Acc4E
2819    cycles for DotXMM1Acc4EJ1
2836    cycles for DotXMM1Acc4EJ2
2822    cycles for AxDotXMM1_fastcall
2270    cycles for DotXMM2Acc16ELingo
1874    cycles for DotXMM2Acc32ELingo
1799    cycles for DotXMM2Acc16EPaul


The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656





Alex
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: Antariy on September 02, 2010, 11:48:37 PM
I'm not have any desire to delve into Lingo's code, but, maybe, incorrect results on Core is in this place:

movss dword ptr [esp+2*4], xmm4
fld dword ptr [esp+2*4]


As far, as I know - Core+ have changed engine, what watch read after write, and can make execution out-of-order in two ways - if writed, and if not writed to place, which would be read belower.

Maybe, mixing SSE and FPU code is not work well with this engine? I find in disasm, what before FLD is no WAIT instruction.



Alex
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: dioxin on September 03, 2010, 12:04:57 AM
redskull,
   
QuoteThen the math isn't quite that simple.
Yes it is.
   Whatever CPU Lingo has the conversion from Seconds to Cycles is a straight forward equation.
   Seconds = Cycles / Clk Frequency.

   Nothing else in his CPU matters.

Paul.
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: jj2007 on September 03, 2010, 12:08:32 AM
Quote from: hutch-- on September 02, 2010, 10:49:34 PM
Like a voice crying in the wilderness, if you want to know how FAST an algorithm is, test it in REAL TIME !!!!

Cycles may be cute and easy to use benchmarks may be fun but algorithms run in application code in REAL TIME, test the algo any other way and you get serious anomolies.

No need to cry, Hutch :lol
The timings with MichaelW's macros can be problematic for very small pieces of code, but this kind of algos - 180 bytes for Lingo's algo, 456 for my JE2, both over 2,000 cycles - they yield quite realistic results.

QuoteIntel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
2575    cycles for DotXMM1Acc4E
2161    cycles for DotXMM1Acc4EJ1
2071    cycles for DotXMM1Acc4EJ2
2543    cycles for AxDotXMM1_fastcall
2139    cycles for DotXMM2Acc16ELingo
2079    cycles for DotXMM2Acc32ELingo
2103    cycles for DotXMM2Acc16EPaul

Congrats, Lingo. Even on my CPU you are close now :U
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: hutch-- on September 03, 2010, 12:10:09 AM
Lingo's dotpro18 results.


Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (SSE4)
1574    cycles for DotXMM1Acc4E
1555    cycles for DotXMM1Acc4EJ1
1556    cycles for DotXMM1Acc4EJ2
1562    cycles for AxDotXMM1_fastcall
1051    cycles for DotXMM2Acc16ELingo
1038    cycles for DotXMM2Acc32ELingo
1051    cycles for DotXMM2Acc16EPaul

1561    cycles for DotXMM1Acc4E
1555    cycles for DotXMM1Acc4EJ1
1556    cycles for DotXMM1Acc4EJ2
1561    cycles for AxDotXMM1_fastcall
1050    cycles for DotXMM2Acc16ELingo
1038    cycles for DotXMM2Acc32ELingo
1051    cycles for DotXMM2Acc16EPaul

1563    cycles for DotXMM1Acc4E
1554    cycles for DotXMM1Acc4EJ1
1556    cycles for DotXMM1Acc4EJ2
1563    cycles for AxDotXMM1_fastcall
1052    cycles for DotXMM2Acc16ELingo
1038    cycles for DotXMM2Acc32ELingo
1052    cycles for DotXMM2Acc16EPaul


The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
--- done ---
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: jj2007 on September 03, 2010, 12:12:23 AM
Quote from: clive on September 02, 2010, 11:22:16 PM
Is it working right?

Quote from: lingo
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1326793031 <<<<<<<<<
The result: 1328212656
--- done ---

Where did you see that one? Can't find that post...
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: clive on September 03, 2010, 12:24:59 AM
Quote from: jj2007
Where did you see that one? Can't find that post...

It was from post #74, but it's been changed.
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: jj2007 on September 03, 2010, 12:28:53 AM
Quote from: clive on September 03, 2010, 12:24:59 AM
Quote from: jj2007
Where did you see that one? Can't find that post...

It was from post #74, but it's been changed.

OK. Here is DotPro18 with code sizes added.
78       bytes for DotXMM1Acc4E
278      bytes for DotXMM1Acc4EJ1
266      bytes for DotXMM1Acc4EJ2
60       bytes for AxDotXMM1_fastcall
120      bytes for DotXMM2Acc16ELingo
183      bytes for DotXMM2Acc32ELingo
129      bytes for DotXMM2Acc16EPaul


Alex, have you tried unrolling a little bit?
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: dioxin on September 03, 2010, 12:30:07 AM
DotPro18:
AMD Phenom(tm) II X4 945 Processor (SSE3)
2754    cycles for DotXMM1Acc4E
2155    cycles for DotXMM1Acc4EJ1
2155    cycles for DotXMM1Acc4EJ2
2137    cycles for AxDotXMM1_fastcall
1197    cycles for DotXMM2Acc16ELingo
1186    cycles for DotXMM2Acc32ELingo
818     cycles for DotXMM2Acc16EPaul

2158    cycles for DotXMM1Acc4E
2154    cycles for DotXMM1Acc4EJ1
2154    cycles for DotXMM1Acc4EJ2
2133    cycles for AxDotXMM1_fastcall
1195    cycles for DotXMM2Acc16ELingo
1186    cycles for DotXMM2Acc32ELingo
818     cycles for DotXMM2Acc16EPaul

2159    cycles for DotXMM1Acc4E
2154    cycles for DotXMM1Acc4EJ1
2130    cycles for DotXMM1Acc4EJ2
2131    cycles for AxDotXMM1_fastcall
1195    cycles for DotXMM2Acc16ELingo
1186    cycles for DotXMM2Acc32ELingo
818     cycles for DotXMM2Acc16EPaul


The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
--- done ---

Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: hutch-- on September 03, 2010, 12:34:23 AM
JJ,

For data of the type proposed for algos of this type it needs to be streamed to get viable results. What I would suggest is set up 100 meg of data then load it in memory and stream it to get timings. The advantage of a large source is it does not fit into cache so you avoid the effects that are not present when the algo gets used IE in real time.
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: dioxin on September 03, 2010, 12:40:23 AM
Timings from Gunther's latest post in Reply #72:
Straight forward C++ implementation (FPU Code):
-----------------------------------------------

Dot Product 1 = 2867507200.00
Elapsed Time  = 13.70 Seconds

C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------

Dot Product 2 = 2867507200.00
Elapsed Time  = 6.91 Seconds

C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------

Dot Product 3 = 2867507200.00
Elapsed Time  = 3.47 Seconds

Simple SSE2 Code (1 Accumulator - 4 elements per loop cycle):
-------------------------------------------------------------

Dot Product 4 = 2867507200.00
Elapsed Time  = 3.47 Seconds

Solid SSE2 Code (2 Accumulators - 8 elements per loop cycle):
-------------------------------------------------------------

Dot Product 5 = 2867507200.00
Elapsed Time  = 1.77 Seconds

Better SSE2 Code (2 Accumulators - 16 elements per loop cycle):
---------------------------------------------------------------

Dot Product 6 = 2867507200.00
Elapsed Time  = 1.78 Seconds

SSE2 Code by Dioxin (4 Accumulators - 16 elements per loop cycle):
------------------------------------------------------------------

Dot Product 7 = 2867507200.00
Elapsed Time  = 1.16 Seconds

SSE2 Code by Lingo (2 Accumulators - 32 elements per loop cycle):
-----------------------------------------------------------------

Dot Product 8 = 2867507200.00
Elapsed Time  = 1.77 Seconds

SSE2 Code with PATTERN (2 Accumulators - 32 elements per loop cycle):
---------------------------------------------------------------------

Dot Product 9 = 2867507200.00
Elapsed Time  = 1.78 Seconds


Please press enter to terminate...
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: jj2007 on September 03, 2010, 12:42:40 AM
A variant with an option to unroll Alex' algo by 4 - still compact at 102 bytes, and a bit faster of course.
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: dioxin on September 03, 2010, 12:46:31 AM
jj2007,
QuoteA variant with an option to unroll Alex' algo by 4 - still compact at 102 bytes, and a bit faster of course.
AMD Phenom(tm) II X4 945 Processor (SSE3)
2202    cycles for DotXMM1Acc4E
2153    cycles for DotXMM1Acc4EJ1
2152    cycles for DotXMM1Acc4EJ2
2140    cycles for AxDotXMM1_fastcall
1196    cycles for DotXMM2Acc16ELingo
1186    cycles for DotXMM2Acc32ELingo
789     cycles for DotXMM2Acc16EPaul

2156    cycles for DotXMM1Acc4E
2161    cycles for DotXMM1Acc4EJ1
2147    cycles for DotXMM1Acc4EJ2
2142    cycles for AxDotXMM1_fastcall
1196    cycles for DotXMM2Acc16ELingo
1187    cycles for DotXMM2Acc32ELingo
789     cycles for DotXMM2Acc16EPaul

2157    cycles for DotXMM1Acc4E
2155    cycles for DotXMM1Acc4EJ1
2155    cycles for DotXMM1Acc4EJ2
2139    cycles for AxDotXMM1_fastcall
1196    cycles for DotXMM2Acc16ELingo
1187    cycles for DotXMM2Acc32ELingo
815     cycles for DotXMM2Acc16EPaul


The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
80       bytes for DotXMM1Acc4E
278      bytes for DotXMM1Acc4EJ1
266      bytes for DotXMM1Acc4EJ2
102      bytes for AxDotXMM1_fastcall
120      bytes for DotXMM2Acc16ELingo
175      bytes for DotXMM2Acc32ELingo
129      bytes for DotXMM2Acc16EPaul

--- done ---
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: jj2007 on September 03, 2010, 12:49:34 AM
Quote from: dioxin on September 03, 2010, 12:40:23 AM
Timings from Gunther's latest post in Reply #72:

Celeron M:
QuoteC++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------

Dot Product 3 = 2867507200.00
Elapsed Time  = 6.92 Seconds

Simple SSE2 Code (1 Accumulator - 4 elements per loop cycle):
-------------------------------------------------------------

Dot Product 4 = 2867507200.00
Elapsed Time  = 8.42 Seconds

Solid SSE2 Code (2 Accumulators - 8 elements per loop cycle):
-------------------------------------------------------------

Dot Product 5 = 2867507200.00
Elapsed Time  = 7.52 Seconds

Better SSE2 Code (2 Accumulators - 16 elements per loop cycle):
---------------------------------------------------------------

Dot Product 6 = 2867507200.00
Elapsed Time  = 6.77 Seconds

SSE2 Code by Dioxin (4 Accumulators - 16 elements per loop cycle):
------------------------------------------------------------------

Dot Product 7 = 2867507200.00
Elapsed Time  = 7.16 Seconds

SSE2 Code by Lingo (2 Accumulators - 32 elements per loop cycle):
-----------------------------------------------------------------

Dot Product 8 = 2867507200.00
Elapsed Time  = 6.78 Seconds

SSE2 Code with PATTERN (2 Accumulators - 32 elements per loop cycle):
---------------------------------------------------------------------

Dot Product 9 = 2867507200.00
Elapsed Time  = 6.70 Seconds

EDIT: I have added a macro for testing the correctness of results.
IsCorrect MACRO algo
invoke &algo&, offset SrcX, offset SrcY, 2048
fstp Res4
Fcmp Res4, Expected
.if !Zero?
Print Str$("\nIncorrect result: %i for &algo&", Res4)
.endif
ENDM
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: lingo on September 03, 2010, 01:16:02 AM
I have 2 notes about my algos in dotfloat_update.zip (from Gunther's latest post)
- I see  ALIGN 16  just before  .loop: label.  It is not normal coz we have some clocks more.
- My last algo is not included too...:(
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: Gunther on September 03, 2010, 01:55:16 AM
Quote from: dioxin September 03, 2010, at 01:40:23 AMTimings from Gunther's latest post in Reply #72:

Thank you Paul, you see: your Phenom makes a good job by using 4 accumulators, while my Athlon did not.

Quote from: lingo September 03, 2010, at 02:16:02 AM- I see  ALIGN 16  just before  .loop: label.  It is not normal coz we have some clocks more.

What a mess. But aligning the hot spots (that is the .loop label) by 16 is a recommendation by Intel and AMD. All together it's only a question of cut & paste for you. Set the align command at the procedures entry, compile the program again - and voila it's over. What the heck. By the way, was it Windows or Snow Leopard?

Quote from: lingo September 03, 2010, at 02:16:02 AM- My last algo is not included too...

Another mess. But joking apart, it is a question of time. If there is enough time, I'll try to include your last algorithm this weekend. Okay?

Gunther
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: jj2007 on September 03, 2010, 07:41:56 AM
For the record, timings on my old Intel:
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
3674    cycles for DotXMM1Acc4E
3215    cycles for DotXMM1Acc4EJ1
3088    cycles for DotXMM1Acc4EJ2
2795    cycles for AxDotXMM1_fastcall
1953    cycles for DotXMM2Acc16ELingo
1910    cycles for DotXMM2Acc32ELingo
1752    cycles for DotXMM2Acc16EPaul

4064    cycles for DotXMM1Acc4E
3644    cycles for DotXMM1Acc4EJ1
3832    cycles for DotXMM1Acc4EJ2
3065    cycles for AxDotXMM1_fastcall
1831    cycles for DotXMM2Acc16ELingo
1752    cycles for DotXMM2Acc32ELingo
1849    cycles for DotXMM2Acc16EPaul


It seems archaic CPUs like Lingo's code :bg

And the same CPU on reply #72:
Straight forward C++ implementation (FPU Code):
-----------------------------------------------

Dot Product 1 = 2867507200.00
Elapsed Time  = 20.66 Seconds

C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------

Dot Product 2 = 2867507200.00
Elapsed Time  = 10.12 Seconds

C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------

Dot Product 3 = 2867507200.00
Elapsed Time  = 4.11 Seconds

Simple SSE2 Code (1 Accumulator - 4 elements per loop cycle):
-------------------------------------------------------------

Dot Product 4 = 2867507200.00
Elapsed Time  = 4.00 Seconds

Solid SSE2 Code (2 Accumulators - 8 elements per loop cycle):
-------------------------------------------------------------

Dot Product 5 = 2867507200.00
Elapsed Time  = 2.58 Seconds

Better SSE2 Code (2 Accumulators - 16 elements per loop cycle):
---------------------------------------------------------------

Dot Product 6 = 2867507200.00
Elapsed Time  = 2.49 Seconds

SSE2 Code by Dioxin (4 Accumulators - 16 elements per loop cycle):
------------------------------------------------------------------

Dot Product 7 = 2867507200.00
Elapsed Time  = 2.45 Seconds

SSE2 Code by Lingo (2 Accumulators - 32 elements per loop cycle):
-----------------------------------------------------------------

Dot Product 8 = 2867507200.00
Elapsed Time  = 2.53 Seconds

SSE2 Code with PATTERN (2 Accumulators - 32 elements per loop cycle):
---------------------------------------------------------------------

Dot Product 9 = 2867507200.00
Elapsed Time  = 2.47 Seconds
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: Gunther on September 03, 2010, 11:54:59 AM
Quote from: jj2007 September 03, 2010, at 08:41:56 AMFor the record, timings on my old Intel:

Thank you, Jochen.

Quote from: jj2007 September 03, 2010, at 08:41:56 AMIt seems archaic CPUs like Lingo's code

But Paul's too.  :bg

Gunther
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: FORTRANS on September 03, 2010, 12:47:36 PM
Hi,

   DotPro18b results.


Mobile Intel(R) Celeron(R) processor     600MHz (SSE2)
2613 cycles for DotXMM1Acc4E
2227 cycles for DotXMM1Acc4EJ1
2084 cycles for DotXMM1Acc4EJ2
2417 cycles for AxDotXMM1_fastcall
2142 cycles for DotXMM2Acc16ELingo
2099 cycles for DotXMM2Acc32ELingo
2424 cycles for DotXMM2Acc16EPaul

2623 cycles for DotXMM1Acc4E
2227 cycles for DotXMM1Acc4EJ1
2097 cycles for DotXMM1Acc4EJ2
2418 cycles for AxDotXMM1_fastcall
2142 cycles for DotXMM2Acc16ELingo
2101 cycles for DotXMM2Acc32ELingo
2424 cycles for DotXMM2Acc16EPaul

Test for correct results, expected 2867507200:

80 bytes for DotXMM1Acc4E
278 bytes for DotXMM1Acc4EJ1
266 bytes for DotXMM1Acc4EJ2
102 bytes for AxDotXMM1_fastcall
120 bytes for DotXMM2Acc16ELingo
175 bytes for DotXMM2Acc32ELingo
129 bytes for DotXMM2Acc16EPaul

--- done ---


Steve
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: Antariy on September 03, 2010, 11:47:53 PM
For dotfloat_update.zip:



Supported by Processor and installed Operating System:
------------------------------------------------------

   Pentium 4 Instruction Set,
+ FPU (floating point unit) on chip,
+ support of FXSAVE and FXRSTOR,
+ 57 MMX Instructions,
+ 70 SSE (Katmai) Instructions,
+ 144 SSE2 (Willamette) Instructions,
+ 13 SSE3 (Prescott) Instructions,
+ HTT (hyper thread technology) support.

Calculating the dot product in 9 different variations.
That'll take a little while ...

Straight forward C++ implementation (FPU Code):
-----------------------------------------------

Dot Product 1 = 2867507200.00
Elapsed Time  = 34.39 Seconds

C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------

Dot Product 2 = 2867507200.00
Elapsed Time  = 16.62 Seconds

C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------

Dot Product 3 = 2867507200.00
Elapsed Time  = 7.08 Seconds

Simple SSE2 Code (1 Accumulator - 4 elements per loop cycle):
-------------------------------------------------------------

Dot Product 4 = 2867507200.00
Elapsed Time  = 7.05 Seconds

Solid SSE2 Code (2 Accumulators - 8 elements per loop cycle):
-------------------------------------------------------------

Dot Product 5 = 2867507200.00
Elapsed Time  = 4.53 Seconds

Better SSE2 Code (2 Accumulators - 16 elements per loop cycle):
---------------------------------------------------------------

Dot Product 6 = 2867507200.00
Elapsed Time  = 4.62 Seconds

SSE2 Code by Dioxin (4 Accumulators - 16 elements per loop cycle):
------------------------------------------------------------------

Dot Product 7 = 2867507200.00
Elapsed Time  = 4.45 Seconds

SSE2 Code by Lingo (2 Accumulators - 32 elements per loop cycle):
-----------------------------------------------------------------

Dot Product 8 = 2867507200.00
Elapsed Time  = 4.50 Seconds

SSE2 Code with PATTERN (2 Accumulators - 32 elements per loop cycle):
---------------------------------------------------------------------

Dot Product 9 = 2867507200.00
Elapsed Time  = 4.55 Seconds



Gunther, my CPU don't support HTT. But this is mistake of most progs, which make CPU detection. As far as I know: all 90nm Prescotts "says", what they support HTT, but this is not true sometimes.



Alex
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: Gunther on September 04, 2010, 01:36:10 AM
Quote from: Antariy September 04, 2010, at 12:47:53 AMGunther, my CPU don't support HTT. But this is mistake of most progs, which make CPU detection. As far as I know: all 90nm Prescotts "says", what they support HTT, but this is not true sometimes.

Hi Antariy,

Thank you for your timings. Never mind; the application doesn't use HTT. What exactly is your CPU and OS?

Gunther
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: Antariy on September 04, 2010, 10:05:57 PM
Quote from: Gunther on September 04, 2010, 01:36:10 AM

Hi Antariy,

Thank you for your timings. Never mind; the application doesn't use HTT. What exactly is your CPU and OS?

Gunther


I know, what app don't use HTT. But HTT make mess of timings sometimes.

My CPU - Celeron D 310. In details - Prescott core, 2.13 GHz, 256KB L2 cache.
OS - WinXP SP2.



Alex
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: Antariy on September 04, 2010, 10:38:52 PM
Gunther, when I talk about HTT, I don't mean, what prog use it. How prog can use it :)
I mean this:

+ 13 SSE3 (Prescott) Instructions,
+ HTT (hyper thread technology) support.          <-------------------------- THIS

Calculating the dot product in 9 different variations.
That'll take a little while ...


Your CPU detection make general mistake - my CPU don't have HTT, really.
Sorry, if initially I write not clear - my English is bad...



Alex
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: Gunther on September 04, 2010, 10:41:13 PM
Alex,

thank you for your information.

Quote from: Antariy  September 05, 2010, at 11:05:57 PMMy CPU - Celeron D 310. In details - Prescott core, 2.13 GHz, 256KB L2 cache.
OS - WinXP SP2.

The machine isn't so bad, as you mentioned in your PM.

Gunther
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: Gunther on September 04, 2010, 10:47:59 PM
Alex,

excuse me, we posted both at the same time, so I couldn't see your latest answer.

Quote from: Antariy September 05, 2010, at 11:38:52 PMYour CPU detection make general mistake - my CPU don't have HTT, really.

I'll check that, but I used the algorithm recommended by Intel and AMD.

Gunther
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: Antariy on September 04, 2010, 11:10:54 PM
Quote from: Gunther on September 04, 2010, 10:41:13 PM

The machine isn't so bad, as you mentioned in your PM.


Yes, big advantage - heavily unrolled algos work slower than short looped :)



Alex
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: hutch-- on September 05, 2010, 04:12:26 AM
Alex,

The level of unrolling depends on the size of the code cache of the particular processor. Some older processors respond well to unrolling where many of the later ones show little change. I have algos written for P4s that are slower relative to a short algo on a later processor. Intel recommend unrolling up to the limit of the code cache but I have found in practice that you try different amounts and don't go more than where the speed picks up.
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: Antariy on September 06, 2010, 09:49:52 PM
Gunther, this is code to make sure, what CPU truely have HTT support:

mov eax,1
cpuid
shr ebx,16
and ebx,255


If EBX contain 1 - then CPU don't support HTT. Because HTT CPU have logically separated cores, so, true HTT CPU must report more than 1 core on its chip.



Alex
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: Antariy on September 06, 2010, 09:58:25 PM
Quote from: hutch-- on September 05, 2010, 04:12:26 AM
Alex,

The level of unrolling depends on the size of the code cache of the particular processor. Some older processors respond well to unrolling where many of the later ones show little change. I have algos written for P4s that are slower relative to a short algo on a later processor. Intel recommend unrolling up to the limit of the code cache but I have found in practice that you try different amounts and don't go more than where the speed picks up.

Yes hutch.
Most code have big dependency from hardware, on which this code is runned.
All CPUs have different schemes, even if they one type and microarchitecture.
This is not wonder, what AMD and Intel have very different results in testings: AMD always make wide solutions, which is use simple but parallel shemes (this is seems from all tests), Intel make deep solutions - which is try use strong prediction and very deep pipelining. But deep pipelining is bad in case of many registers renaming code, and if something prediction is not successful - CPU with deep pipeline have bigger stalls.
This is my thinks, of course, but they are corroborated with many things: code which use parallel and multiple registers runs very good on AMD, AMD CPUs have big heat emission - what can says about really big schemes, which can make really parallel exection, etc.




Alex
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: Gunther on September 06, 2010, 11:31:11 PM
Quote from: Antariy  September 06, 2010, 10:49:52 pmIf EBX contain 1 - then CPU don't support HTT. Because HTT CPU have logically separated cores, so, true HTT CPU must report more than 1 core on its chip.

Alex,

thank you for the hint. I'll inspect your CPU detection method as soon as possible. May be, I can adopt a few ideas (I give credit, that's clear) and make my procedure more robust and reliable.

Gunther
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: Antariy on September 06, 2010, 11:37:56 PM
Gunther, your procedure is good, I only suggest about accurate testing for HTT.
You can see my thread "http://www.masm32.com/board/index.php?topic=14754.0" (CPU identification), on which I post small app, what make sure what CPU is support HTT (cores.zip). It seems to be truely works.



Alex
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: dedndave on September 07, 2010, 02:32:03 AM
it may report 2 logical cores, but only 1 physical core
that is what HTT is - and it only applies to Intel chips
if the HTT bit is set on Intel CPU's, the number of logical cores is 2X the number of physical cores
for other manufacturers, the bit may have other meanings or no meaning at all
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: Rockoon on September 07, 2010, 04:55:47 AM
AMD is rolling out its own version of HT in early 2011 with its first major architectural change since the first Athlons, the new 32nm Bulldozer architecture.

http://techreport.com/articles.x/19514

Expect Intel to try something similar when they roll down to 22nm tech (its too late for their 32nm tech, they are about to roll out Sandy Bridge and have packed the silicon with 256-bit floating point execution units instead)

Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: Gunther on September 07, 2010, 10:32:23 PM
Quote from: Rockoon September 07, 2010, at 05:55:47 AMthey are about to roll out Sandy Bridge and have packed the silicon with 256-bit floating point execution units instead

Is that the AVX instruction set? We've already the specification, but at the present time, no processor supports it When will this part of Sandy Bridge arrive?

Gunther
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: Antariy on September 07, 2010, 10:44:28 PM
Quote from: dedndave on September 07, 2010, 02:32:03 AM
it may report 2 logical cores, but only 1 physical core
that is what HTT is - and it only applies to Intel chips
if the HTT bit is set on Intel CPU's, the number of logical cores is 2X the number of physical cores
for other manufacturers, the bit may have other meanings or no meaning at all

Dave, I read many times, what reporting about HTT support can be lie. My BIOS have HTT support, but it not show this option, because BIOS know, what CPU don't support HTT, even it says other things.

Here is EDX result after CPUID EAX=1 for my CPU: BFEBFBFFh.
Binary form:

10111111111010111111101111111111
   ^ this bit says what CPU have HTT, but this is NOT true.


Method which I gives reports about logical/physical cores. If CPUID HTT bit says what CPU supports HTT, but cores count is 1 - this is really funny :) So, this CPU don't have HTT.

In my point - HTT is much commercial advertisement, because "Hey, us CPU have 2 cores!"... But they are logical (i.e. - virtual) and use the same execution units of one physical core.
I saw ~4 years ago true 2-cores Prescott LGA 775. It eat 120 Watts of energy, and it really HOT...
Anybody can say, what 2 VIRTUAL CPUs - is nice, but this is funny :)


EDITED: I suggest treat HTT bit as "This CPU architecture can support HTT...", but due to 1 logical/physical core "... but we economize on its implementation" :)



Alex
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: Rockoon on September 07, 2010, 11:22:38 PM
Quote from: Gunther on September 07, 2010, 10:32:23 PM
Quote from: Rockoon September 07, 2010, at 05:55:47 AMthey are about to roll out Sandy Bridge and have packed the silicon with 256-bit floating point execution units instead

Is that the AVX instruction set? We've already the specification, but at the present time, no processor supports it When will this part of Sandy Bridge arrive?

Gunther


Sandy Bridge arrives.. maybe in December or January.

Bulldozer arrives a couple months after that.

Its really looking like AMD will take the lead on per-clock integer performance, while Intel wins the per-clock floating point performance and maybe destroys nVidia's integrated GPU division in the process
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: Gunther on September 08, 2010, 01:12:20 AM
Quote from: RockoonIts really looking like AMD will take the lead on per-clock integer performance, while Intel wins the per-clock floating point performance and maybe destroys nVidia's integrated GPU division in the process

Interesting, but seems logical. Intel chips have u- and v-integer pipes, while AMD provides 3 pipes. On the other hand, Intel brings better performance (mostly) in floating point math. What's more important? It depends. Will Bulldozer support AVX?

Gunther
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: Rockoon on September 08, 2010, 01:47:03 AM
Quote from: Gunther on September 08, 2010, 01:12:20 AM
Quote from: RockoonIts really looking like AMD will take the lead on per-clock integer performance, while Intel wins the per-clock floating point performance and maybe destroys nVidia's integrated GPU division in the process

Interesting, but seems logical. Intel chips have u- and v-integer pipes, while AMD provides 3 pipes. On the other hand, Intel brings better performance (mostly) in floating point math. What's more important? It depends. Will Bulldozer support AVX?

Gunther

AVX, AES, SSE4.1 and SSE4.2

Not sure about SSSE3

Even though it will support AVX, it wont be using a 256-bit FPU unit.
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: Gunther on September 08, 2010, 01:56:37 AM
Quote from: Rockoon September 08, 2010, at 02:47:03 AMNot sure about SSSE3

I hope that's not the point, because SSE3 is available nowadays on AMD chips.

Quote from: Rockoon September 08, 2010, at 02:47:03 AMEven though it will support AVX, it wont be using a 256-bit FPU unit.

That works? AVX based on the new YMM registers, which are 256 bit wide. We'll see.

Gunther
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: Rockoon on September 08, 2010, 03:10:59 AM
Quote from: Gunther on September 08, 2010, 01:56:37 AM
Quote from: Rockoon September 08, 2010, at 02:47:03 AMNot sure about SSSE3

I hope that's not the point, because SSE3 is available nowadays on AMD chips.

SSSE3 is not the same as SSE3.

Quote from: Gunther on September 08, 2010, 01:56:37 AM
Quote from: Rockoon September 08, 2010, at 02:47:03 AMEven though it will support AVX, it wont be using a 256-bit FPU unit.

That works? AVX based on the new YMM registers, which are 256 bit wide. We'll see.


Yes it works. The CPU will simply break up 256-bit ops into two 128-bit ops, just as many CPU's (from both Intel and AMD) broke up 128-bit ops into two 64-bit ops to accomplish SSE using only 64-bit FPU's
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: Gunther on September 08, 2010, 11:15:26 AM
Quote from: Rockoon, September 08, 2010, at 04:10:59 AMSSSE3 is not the same as SSE3.

Yes, I've overlooked the 3rd S.

Quote from: Rockoon, September 08, 2010, at 04:10:59 AMYes it works. The CPU will simply break up 256-bit ops into two 128-bit ops, just as many CPU's (from both Intel and AMD) broke up 128-bit ops into two 64-bit ops to accomplish SSE using only 64-bit FPU's

In that sense, the speed advantage for floating point operations is at the Intel side.

Gunther

Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: Antariy on September 28, 2010, 11:24:55 PM
Quote from: jj2007 on September 03, 2010, 12:28:53 AM

OK. Here is DotPro18 with code sizes added.
78       bytes for DotXMM1Acc4E
278      bytes for DotXMM1Acc4EJ1
266      bytes for DotXMM1Acc4EJ2
60       bytes for AxDotXMM1_fastcall
120      bytes for DotXMM2Acc16ELingo
183      bytes for DotXMM2Acc32ELingo
129      bytes for DotXMM2Acc16EPaul


Alex, have you tried unrolling a little bit?

Not, at that time is not tried. But, after a big delay I found some time for do this.
Codesize is still the smallest (~106 bytes) from fast, and speed is satisfactory.
Simple unrolling with interleaving of used execution units.
Probably, with a much better modernest CPUs, Paul's code is best due to using access to contiguous memory locations, but on old CPUs using of many equal commands (i.e. - execution units) is not gives anythings useful.

Also changed calling convention (stdcall now).

Test this please, anybody who readed this post (attached archive).

This is my timings:

Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
3107    cycles for DotXMM1Acc4E
2864    cycles for DotXMM1Acc4EJ1
2804    cycles for DotXMM1Acc4EJ2
1860    cycles for AxDotXMM1
2246    cycles for DotXMM2Acc16ELingo
1879    cycles for DotXMM2Acc32ELingo
1907    cycles for DotXMM2Acc16EPaul

2965    cycles for DotXMM1Acc4E
2822    cycles for DotXMM1Acc4EJ1
2827    cycles for DotXMM1Acc4EJ2
1818    cycles for AxDotXMM1
2220    cycles for DotXMM2Acc16ELingo
1919    cycles for DotXMM2Acc32ELingo
1818    cycles for DotXMM2Acc16EPaul

2936    cycles for DotXMM1Acc4E
2920    cycles for DotXMM1Acc4EJ1
2818    cycles for DotXMM1Acc4EJ2
1852    cycles for AxDotXMM1
2256    cycles for DotXMM2Acc16ELingo
1898    cycles for DotXMM2Acc32ELingo
1832    cycles for DotXMM2Acc16EPaul


The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
--- done ---




Alex
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: dioxin on September 29, 2010, 12:09:16 AM
AMD Phenom(tm) II X4 945 Processor (SSE3)
2214    cycles for DotXMM1Acc4E
2153    cycles for DotXMM1Acc4EJ1
2164    cycles for DotXMM1Acc4EJ2
913     cycles for AxDotXMM1
1211    cycles for DotXMM2Acc16ELingo
1194    cycles for DotXMM2Acc32ELingo
783     cycles for DotXMM2Acc16EPaul

2195    cycles for DotXMM1Acc4E
2108    cycles for DotXMM1Acc4EJ1
2177    cycles for DotXMM1Acc4EJ2
914     cycles for AxDotXMM1
1209    cycles for DotXMM2Acc16ELingo
1196    cycles for DotXMM2Acc32ELingo
815     cycles for DotXMM2Acc16EPaul

2200    cycles for DotXMM1Acc4E
2159    cycles for DotXMM1Acc4EJ1
2154    cycles for DotXMM1Acc4EJ2
922     cycles for AxDotXMM1
1197    cycles for DotXMM2Acc16ELingo
1189    cycles for DotXMM2Acc32ELingo
805     cycles for DotXMM2Acc16EPaul


The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
--- done ---
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: redskull on September 29, 2010, 12:16:58 AM
Intel(R) Core(TM)2 Duo CPU     E4500  @ 2.20GHz (SSE4)
3080    cycles for DotXMM1Acc4E
2867    cycles for DotXMM1Acc4EJ1
2874    cycles for DotXMM1Acc4EJ2
1930    cycles for AxDotXMM1
1925    cycles for DotXMM2Acc16ELingo
1914    cycles for DotXMM2Acc32ELingo
1363    cycles for DotXMM2Acc16EPaul

1575    cycles for DotXMM1Acc4E
1557    cycles for DotXMM1Acc4EJ1
1569    cycles for DotXMM1Acc4EJ2
1055    cycles for AxDotXMM1
1057    cycles for DotXMM2Acc16ELingo
1049    cycles for DotXMM2Acc32ELingo
1063    cycles for DotXMM2Acc16EPaul

1583    cycles for DotXMM1Acc4E
1560    cycles for DotXMM1Acc4EJ1
1556    cycles for DotXMM1Acc4EJ2
1062    cycles for AxDotXMM1
1054    cycles for DotXMM2Acc16ELingo
1038    cycles for DotXMM2Acc32ELingo
1055    cycles for DotXMM2Acc16EPaul


The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
--- done ---


-r
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: frktons on September 29, 2010, 12:55:17 AM

Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
1603    cycles for DotXMM1Acc4E
1628    cycles for DotXMM1Acc4EJ1
1588    cycles for DotXMM1Acc4EJ2
1077    cycles for AxDotXMM1
1083    cycles for DotXMM2Acc16ELingo
1059    cycles for DotXMM2Acc32ELingo
1076    cycles for DotXMM2Acc16EPaul

1599    cycles for DotXMM1Acc4E
1593    cycles for DotXMM1Acc4EJ1
1592    cycles for DotXMM1Acc4EJ2
1072    cycles for AxDotXMM1
1071    cycles for DotXMM2Acc16ELingo
1063    cycles for DotXMM2Acc32ELingo
1083    cycles for DotXMM2Acc16EPaul

1598    cycles for DotXMM1Acc4E
1589    cycles for DotXMM1Acc4EJ1
1558    cycles for DotXMM1Acc4EJ2
1077    cycles for AxDotXMM1
1071    cycles for DotXMM2Acc16ELingo
1060    cycles for DotXMM2Acc32ELingo
1054    cycles for DotXMM2Acc16EPaul


The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
--- done ---
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: redskull on September 29, 2010, 01:59:28 AM
Quote from: frktons on September 29, 2010, 12:55:17 AM

Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
1603    cycles for DotXMM1Acc4E


This one is twice as fast on a CPU that is almost exactly the same as mine; the only difference being twice as much cache and fsb speed (4 vs 2, 1066 vs 800).  Can cache really have that much of an effect on such small, isolated code?  Yikes.

-r
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: Antariy on September 29, 2010, 10:29:27 PM
Quote from: redskull on September 29, 2010, 01:59:28 AM
This one is twice as fast on a CPU that is almost exactly the same as mine; the only difference being twice as much cache and fsb speed (4 vs 2, 1066 vs 800).  Can cache really have that much of an effect on such small, isolated code?  Yikes.

Yes, because code tests very small piece of data, the cache parameters have drastically effect - if data very small with comparsion of cache size - it is in cache. If cache is bigger - much bigger piece of data/code can be putted into it.
If run test as Hutch is suggest - then cache size would have less meaning, but system bus speed would be.

Which is mean "Yikes" word? I don't know what this is - this is slang? Can I know its sense? My English is not very good :)



Alex
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: oex on September 29, 2010, 11:05:29 PM
"Yikes"
"Informal an expression of surprise, fear, or alarm"
http://www.thefreedictionary.com/yikes

Used in popular culture such as Scooby Doo :bg
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: Antariy on September 29, 2010, 11:45:04 PM
Quote from: oex on September 29, 2010, 11:05:29 PM
"Yikes"
"Informal an expression of surprise, fear, or alarm"
http://www.thefreedictionary.com/yikes

Used in popular culture such as Scooby Doo :bg

Thanks - for link and explanation!  :bg



Alex
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: Gunther on September 30, 2010, 02:48:22 PM
Alex,

here are the timings from my machine.


AMD Athlon(tm) 64 X2 Dual-Core Processor TK-57 (SSE3)
2297 cycles for DotXMM1Acc4E
2277 cycles for DotXMM1Acc4EJ1
2240 cycles for DotXMM1Acc4EJ2
1502 cycles for AxDotXMM1
1425 cycles for DotXMM2Acc16ELingo
1362 cycles for DotXMM2Acc32ELingo
1633 cycles for DotXMM2Acc16EPaul

2289 cycles for DotXMM1Acc4E
2277 cycles for DotXMM1Acc4EJ1
2240 cycles for DotXMM1Acc4EJ2
1499 cycles for AxDotXMM1
1437 cycles for DotXMM2Acc16ELingo
1360 cycles for DotXMM2Acc32ELingo
1637 cycles for DotXMM2Acc16EPaul

2288 cycles for DotXMM1Acc4E
2277 cycles for DotXMM1Acc4EJ1
2242 cycles for DotXMM1Acc4EJ2
1507 cycles for AxDotXMM1
1423 cycles for DotXMM2Acc16ELingo
1359 cycles for DotXMM2Acc32ELingo
1640 cycles for DotXMM2Acc16EPaul


The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
--- done ---


Gunther
Title: Re: Suggestions and improvements for SSE2 code are welcome
Post by: jj2007 on September 30, 2010, 03:54:28 PM
Good ol' P4:
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
2930    cycles for DotXMM1Acc4E
2741    cycles for DotXMM1Acc4EJ1
2902    cycles for DotXMM1Acc4EJ2
1788    cycles for AxDotXMM1
2180    cycles for DotXMM2Acc16ELingo
2035    cycles for DotXMM2Acc32ELingo
1806    cycles for DotXMM2Acc16EPaul

2861    cycles for DotXMM1Acc4E
2715    cycles for DotXMM1Acc4EJ1
2756    cycles for DotXMM1Acc4EJ2
3149    cycles for AxDotXMM1
2150    cycles for DotXMM2Acc16ELingo
2024    cycles for DotXMM2Acc32ELingo
1826    cycles for DotXMM2Acc16EPaul

3286    cycles for DotXMM1Acc4E
2934    cycles for DotXMM1Acc4EJ1
2962    cycles for DotXMM1Acc4EJ2
2020    cycles for AxDotXMM1
2156    cycles for DotXMM2Acc16ELingo
1968    cycles for DotXMM2Acc32ELingo
1762    cycles for DotXMM2Acc16EPaul