The background of this new thread is a discussion in The Campus: http://www.masm32.com/board/index.php?topic=14692.msg119277#msg119277
I've finished my little test suite and attached the appropriate archive. The application will run under the following 32bit operating systems: Windows, Linux, FreeBSD and Intel based Mac OS X. The archive includes all source code and the running Win32 program, for users, which haven't installed GCC at the moment. Also included is a batch file, to build the application from the sources. The shell script is for Unix users and does the same (there shouldn't be a problem, because GCC is installed by default as the system compiler). A short description of every file contains the readme.txt. The source code is well commented and should be self explanatory.
The software is in experimental stage - nothing is final or the last word. I've coded a special case, where the array size is divisible by 16, only to show the principle. A more generic implementation has to check that, of course. The program does not much error handling, but it checks SSE2 support during run time. So your machine won't crash, if it's not available. Every advice and help to improve the software is welcome. It's the same with testing the program with other processors and environments (please, have a look into the results.pdf file).
Gunther
From DotAsm.asm:
ALIGN 16
.loop: movaps xmm0,[eax+edx] ;xmm0 = X[i+3] X[i+2] X[i+1] X[i+0]
mulps xmm0,[edx] ;xmm0 = X[i+3]*Y[i+3] X[i+2]*Y[i+2] X[i+1]*Y[i+1] X[i]*Y[i]
addps xmm7,xmm0 ;sum up
lea edx,[edx+16] ;update pointers
sub ecx,byte 16 ;count down
jnz .loop
Well done... and challenging. There is not much too improve at first sight ::)
Hallo jj2007,
Quote from: jj2007 Today at 07:32:46 PMWell done... and challenging.
I hope it's well done. But challenging? To be honest, I file away at these procedures since a few days; so I used some so called
dirty assembly language tricks, not more.
Quote from: jj2007 Today at 07:32:46 PMThere is not much too improve at first sight
I'm not sure. As I wrote in the cited Campus thread: That function is called over 30 million times, given a 256x256 image, which isn't a very large picture. The arrays in the original program are not so large, either 256 or 64 or 16 elements, depending at the partition depth of the algorithm. The more tricky thing is, that the dot product must not be calculated once, but eight times. That has to do with rotations and mirroring of data (not on the screen, only in the memory). So, every saved nansecond counts, because these nanoseconds sum up to microseconds, the microseconds sum up to milliseconds ... etc. That makes the speed difference.
I'm thinking about some fastcall, because passing parameters via registers is faster. But, here comes the bad news: With the release of GCC 4.3., the GNU world was changed again. Here is what I've found: http://www.gnu.org/software/gcc/gcc-4.3/changes.html.
Quote from: GCC 4.3 Release Series, Changes, New Features, and FixesFastcall for i386 has been changed not to pass aggregate arguments in registers, following Microsoft compilers.
Very nice. And why, and for what should one use fastcall?
On the other hand, I coded the horizontal addition in the loop epilogue with SSE2 instructions, so the program runs on a small Sempron, too. I've to test, if haddps is better; but that's a SSE3 instruction. A lot of questions ... By the way, did you run the program? Do you have some timings? Thank you.
Gunther
Hi Gunther,
Timings are easy to get: :bg
Straight forward C++ implementation (FPU Code):
-----------------------------------------------
Dot Product 1 = 2867507200.00
Elapsed Time = 22.20 Seconds
C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------
Dot Product 2 = 2867507200.00
Elapsed Time = 15.59 Seconds
C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------
Dot Product 3 = 2867507200.00
Elapsed Time = 6.89 Seconds
Simple SSE2 Code (1 Accumulator - 4 elements per cycle):
--------------------------------------------------------
Dot Product 4 = 2867507200.00
Elapsed Time = 8.41 Seconds
Solid SSE2 Code (2 Accumulators - 8 elements per cycle):
--------------------------------------------------------
Dot Product 5 = 2867507200.00
Elapsed Time = 7.20 Seconds
Better SSE2 Code (2 Accumulators - 16 elements per cycle):
----------------------------------------------------------
Dot Product 6 = 2867507200.00
Elapsed Time = 6.67 Seconds
As I wrote in my PM to you, there are a few SSE2 geeks around here. I see drizz is already involved.
Question: The inner loop that I posted above does contain vector instructions (packed mul and add), but the function returns apparently only a REAL4 value on the FPU: fld dword [esp] ;load function result
Is there no room for parallelisation? Apologies if that is a dumb question, as I said: I am not a geek for SSE2....
Thank you for the result. It's surprising. The intrinsic code is very fast. What's your environment?
Quote from: jj2007 Today at 11:22:05 PMI see drizz is already involved.
Yes, he made the excellent intrinsic code, so I gave credit.
Quote from: jj2007 Today at 11:22:05 PMQuestion: The inner loop that I posted above does contain vector instructions (packed mul and add), but the function returns apparently only a REAL4 value on the FPU: fld dword [esp] ;load function result
The scalar product is by definition a real number. Given 2 vectors A and B of the dimension n, we have:
Scalar Product = A[0]*B[0] + A[1]*B[1] + ... + A[n]*B[n]
Please, let me think a little while about parallelisation.
Gunther
Quote from: Gunther on August 26, 2010, 10:58:42 PM
Thank you for the result. It's surprising. The intrinsic code is very fast. What's your environment?
Win XP SP2, a Celeron M CPU from the Yonah series, 1.6 Ghz, 1G RAM.
The good old Win XP, SP2 configuration. Your Intel Celeron did an amazing job.
Gunther
It's usually faster for SSE to access memory in contiguous blocks so don't access the vectors the way you do, change the order something like this:
.loop:
movaps xmm1,[eax+edx+16] ;mm1 = X[i+7] X[i+6] X[i+5] X[i+4]
movaps xmm2,[eax+edx+32] ;xmm2 = X[i+11] X[i+10] X[i+9] X[i+8]
movaps xmm3,[eax+edx+48] ;xmm3 = X[i+15] X[i+14] X[i+13] X[i+12]
movaps xmm0,[eax+edx+64] ;xmm0 = X[i+3] X[i+2] X[i+1] X[i+0]
mulps xmm0,[edx] ;xmm0 = X[i+3]*Y[i+3] X[i+2]*Y[i+2] X[i+1]*Y[i+1] X[i+0]*Y[i+0]
mulps xmm1,[edx+16] ;xmm1 = X[i+7]*Y[i+7] X[i+6]*Y[i+6] X[i+5]*Y[i+5] X[i+4]*Y[i+4]
mulps xmm2,[edx+32] ;xmm2 = X[i+11]*Y[i+11] X[i+10]*Y[i+10] X[i+9]*Y[i+9] X[i+8]*Y[i+8]
mulps xmm3,[edx+48] ;xmm3 = X[i+15]*Y[i+15] X[i+14]*Y[i+14] X[i+13]*Y[i+13] X[i+12]*Y[i+12]
addps xmm4,xmm0 ;sum up
addps xmm5,xmm1 ;sum up
addps xmm4,xmm2 ;sum up
addps xmm5,xmm3 ;sum up
lea edx,[edx+64] ;update pointers
sub ecx,byte 64 ;count down
jnz .loop
Paul.
Thank you Paul. I'll check this and make another function. Re-arranging the loop instructions could be an improvement.
Gunther
Phenom II x4 3GHz.
Straight forward C++ implementation (FPU Code):
-----------------------------------------------
Dot Product 1 = 2867507200.00
Elapsed Time = 13.70 Seconds
C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------
Dot Product 2 = 2867507200.00
Elapsed Time = 6.89 Seconds
C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------
Dot Product 3 = 2867507200.00
Elapsed Time = 3.47 Seconds
Simple SSE2 Code (1 Accumulator - 4 elements per cycle):
--------------------------------------------------------
Dot Product 4 = 2867507200.00
Elapsed Time = 3.45 Seconds
Solid SSE2 Code (2 Accumulators - 8 elements per cycle):
--------------------------------------------------------
Dot Product 5 = 2867507200.00
Elapsed Time = 1.77 Seconds
Better SSE2 Code (2 Accumulators - 16 elements per cycle):
----------------------------------------------------------
Dot Product 6 = 2867507200.00
Elapsed Time = 1.78 Seconds
Rearranged SSE2 Code (4 Accumulators - 16 elements per cycle):
----------------------------------------------------------
Dot Product 7 = 2867507200.00
Elapsed Time = 1.14 Seconds
dotfloat.exe > result.txt
Intel(R) Pentium(R) Dual CPU E2160 @ 1.80GHz (SSE4)
Supported by Processor and installed Operating System:
------------------------------------------------------
Pentium 4 Instruction Set,
+ FPU (floating point unit) on chip,
+ support of FXSAVE and FXRSTOR,
+ 57 MMX Instructions,
+ 70 SSE (Katmai) Instructions,
+ 144 SSE2 (Willamette) Instructions,
+ 13 SSE3 (Prescott) Instructions,
+ HTT (hyper thread technology) support.
Calculating the dot product in 6 different variations.
That'll take a little while ...
Straight forward C++ implementation (FPU Code):
-----------------------------------------------
Dot Product 1 = 2867507200.00
Elapsed Time = 17.20 Seconds
C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------
Dot Product 2 = 2867507200.00
Elapsed Time = 11.52 Seconds
C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------
Dot Product 3 = 2867507200.00
Elapsed Time = 4.28 Seconds
Simple SSE2 Code (1 Accumulator - 4 elements per cycle):
--------------------------------------------------------
Dot Product 4 = 2867507200.00
Elapsed Time = 4.38 Seconds
Solid SSE2 Code (2 Accumulators - 8 elements per cycle):
--------------------------------------------------------
Dot Product 5 = 2867507200.00
Elapsed Time = 3.61 Seconds
Better SSE2 Code (2 Accumulators - 16 elements per cycle):
----------------------------------------------------------
Dot Product 6 = 2867507200.00
Elapsed Time = 3.28 Seconds
regards.
Prescott P4, Win XP SP2:
Straight forward C++ implementation (FPU Code):
-----------------------------------------------
Dot Product 1 = 2867507200.00
Elapsed Time = 19.77 Seconds
C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------
Dot Product 2 = 2867507200.00
Elapsed Time = 9.67 Seconds
C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------
Dot Product 3 = 2867507200.00
Elapsed Time = 4.11 Seconds
Simple SSE2 Code (1 Accumulator - 4 elements per cycle):
--------------------------------------------------------
Dot Product 4 = 2867507200.00
Elapsed Time = 3.96 Seconds
Solid SSE2 Code (2 Accumulators - 8 elements per cycle):
--------------------------------------------------------
Dot Product 5 = 2867507200.00
Elapsed Time = 2.53 Seconds
Better SSE2 Code (2 Accumulators - 16 elements per cycle):
----------------------------------------------------------
Dot Product 6 = 2867507200.00
Elapsed Time = 2.46 Seconds
Gunther,
One request on benchmarks, put a keyboard pause at the end of it so it does not have to be downloaded to run it. It ran from the browser but closed before I could save the results.
These timings are on a 3 gig Core2 Quad.
Supported by Processor and installed Operating System:
------------------------------------------------------
Pentium 4 Instruction Set,
+ FPU (floating point unit) on chip,
+ support of FXSAVE and FXRSTOR,
+ 57 MMX Instructions,
+ 70 SSE (Katmai) Instructions,
+ 144 SSE2 (Willamette) Instructions,
+ 13 SSE3 (Prescott) Instructions,
+ HTT (hyper thread technology) support.
Calculating the dot product in 6 different variations.
That'll take a little while ...
Straight forward C++ implementation (FPU Code):
-----------------------------------------------
Dot Product 1 = 2867507200.00
Elapsed Time = 10.31 Seconds
C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------
Dot Product 2 = 2867507200.00
Elapsed Time = 6.89 Seconds
C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------
Dot Product 3 = 2867507200.00
Elapsed Time = 2.58 Seconds
Simple SSE2 Code (1 Accumulator - 4 elements per cycle):
--------------------------------------------------------
Dot Product 4 = 2867507200.00
Elapsed Time = 2.62 Seconds
Solid SSE2 Code (2 Accumulators - 8 elements per cycle):
--------------------------------------------------------
Dot Product 5 = 2867507200.00
Elapsed Time = 2.16 Seconds
Better SSE2 Code (2 Accumulators - 16 elements per cycle):
----------------------------------------------------------
Dot Product 6 = 2867507200.00
Elapsed Time = 1.97 Seconds
I freely admit that I know too little about the dot product, so the attached testbed might not be realistic. Please adapt.
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
53 cycles for DotXMM1Acc4E <<< this value too high, Prescott P4 behaves badly in such testbeds
31 cycles for DotXMM1Acc4E <<< 31 seems to be the correct value
31 cycles for DotXMM1Acc4E
32 cycles for DotXMM1Acc4E
31 cycles for DotXMM1Acc4E
why not make it support multicores with process in multiple threads?
Gunther,
you have 8 SSE registers and you're only using 6 of them.
Either use them all to fetch/calculate data or to act as accumulators. Don't leave them unused.
You'll have to experiment to see which alternate use of the unused registers gives the better results.
I get faster results using 4 accumulators:
addps xmm4,xmm0 'add the products
addps xmm5,xmm1
addps xmm6,xmm2
addps xmm7,xmm3
Paul.
I have looked into the purpose of this code and modified the testbed accordingly:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
2563 cycles for DotXMM1Acc4E
2167 cycles for DotXMM1Acc4EJ
2680 cycles for DotXMM1Acc4E
2167 cycles for DotXMM1Acc4EJ
2540 cycles for DotXMM1Acc4E
2167 cycles for DotXMM1Acc4EJ
The result: 2867507200
The result: 2867507200
EDIT: I squeezed out a few cycles...
add ecx, edx ; slightly faster
; int 3 ; OPT_Olly 2
align 16
@@:
REPEAT 16 ; unrolling helps, but make sure that the count is divisible by the rep count!!
movaps xmm0, [eax+edx] ; xmm0 = X[i+3] X[i+2] X[i+1] X[i+0] ; xmm0: 4 3 2 1, 8765, 12 11 10 9, ... (high to low)
mulps xmm0, [edx] ; xmm0 = X[i+3]*Y[i+3] X[i+2]*Y[i+2] X[i+1]*Y[i+1] X[i]*Y[i] ; xmm0: 20 12 6 2, 72 56 42 30, 156 132 110 90,
lea edx, [edx+16] ; update pointers
addps xmm7, xmm0 ; sum up ; xmm7: 20 12 6 2, 92 68 48 32, 248 200 158 122, ...
ENDM
if 1
cmp edx, ecx ; faster, at least on Celeron M
jb @B
else
sub ecx, 16 ; count down
jg @B
endif
jj2007,
if you have the CPU registers avaialable, which you do in this case, then you can always arrange for the loop to count up to zero and so end without a compare instruction and still only have one loop counter so you don't need to update (in your case) both edx and ecx.
!mov ecx,2047 'loop counter for 2048 elements to get
!movups xmm4,z 'zero the 4 accumulators (z predifined as 4x zero single)
!movaps xmm5,xmm4
!movaps xmm6,xmm4
!movaps xmm7,xmm4
!mov eax,aptr 'get pointer to start of first array
!mov edx,bptr 'get pointer to start of second array
!lea eax,[eax+4*ecx] 'offset pointers by size of array
!lea edx,[edx+4*ecx]
!neg ecx 'negate counter so now pointer + 4*counter = start of array
'but also counter now counts up to zero
#ALIGN 16
lp:
!movaps xmm0,[eax+ecx*4] 'get 16 elements of first vector
!movaps xmm1,[eax+ecx*4+16]
!movaps xmm2,[eax+ecx*4+32]
!movaps xmm3,[eax+ecx*4+48]
!mulps xmm0,[edx+ecx*4] 'multiply by 16 elements of second vector
!mulps xmm1,[edx+ecx*4+16]
!mulps xmm2,[edx+ecx*4+32]
!mulps xmm3,[edx+ecx*4+48]
!addps xmm4,xmm0 'add the products
!addps xmm5,xmm1
!addps xmm6,xmm2
!addps xmm7,xmm3
!add ecx,16 'next block
!js short lp 'ends when sign goes +ve
!addps xmm4,xmm5 'add the partial sums
!addps xmm6,xmm7
!addps xmm4,xmm6
#IF %DEF (%sse3)
!haddps xmm4,xmm4 'sum pairs of results
!haddps xmm4,xmm4 'sum the sum of pairs of results
#ELSE
!movaps xmm1, xmm4
!shufps xmm4, xmm1, &hb1
!addps xmm4, xmm1
!movaps xmm1, xmm4
!shufps xmm4, xmm4, &h0a
!addps xmm4, xmm1
#ENDIF
'!movd sum!,xmm4 'store the dot product
!movups z1,xmm4
!emms
The above PowerBASIC ASM format version runs 5,000,000 loops in 1.07s or a single loop in 645 clks ave. on a 3GHz Phenom II x4.
Paul.
Quote from: dioxin on August 28, 2010, 11:37:24 AM
jj2007,
if you have the CPU registers avaialable, which you do in this case, then you can always arrange for the loop to count up to zero and so end without a compare instruction and still only have one loop counter so you don't need to update (in your case) both edx and ecx.
Paul,
You know about conditional assembly? "If 1" means "do this, ignore the else part".
jj2007,
yes I know about that but if you look at your code your have:
Either:
lea edx, [edx+16]
..
cmp edx, ecx ; faster, at least on Celeron M
jb @B
OR
lea edx, [edx+16]
..
sub ecx, 16 ; count down
jg @B
Either way it's 3 instructions to control the loop.
Look at the way it's done in the code I just posted and it's only 2 instructions:
!add ecx,16 'next block
!js short lp 'ends when sign goes +ve
Paul.
i think i would use JC :bg
or maybe JNS
and - no need to....
mov ecx,2047
.
.
.
neg ecx
why not just
mov ecx,-2047
or
LoopCount EQU 2047
LoopCountVal EQU -LoopCount
.
.
.
mov ecx,LoopCountVal
Quote from: dioxin on August 28, 2010, 03:38:19 PM
jj2007,
yes I know about that but if you look at your code your have:
...
Look at the way it's done in the code I just posted and it's only 2 instructions:
!add ecx,16 'next block
!js short lp 'ends when sign goes +ve
Paul,
Your code needs one instruction less, and it is a good and recommended way of speeding up a loop. If it really helps, needs to be demonstrated. Just add it to the testbed above.
My post was a reaction to
Quote from: dioxin on August 28, 2010, 11:37:24 AM
you don't need to update (in your case) both edx and ecx.
That is simply not correct. Only edx is being updated in the loop:
cmp edx, ecx ; faster, at least on Celeron M
jb @B
EDIT: Since these debates are always fun, here a testbed "Loop_Art":
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
1036 cycles for neg ecx, TheCt=128
843 cycles for cmp edx, ecx, TheCt=128
1035 cycles for neg ecx, TheCt=128
843 cycles for cmp edx, ecx, TheCt=128
I guess results will differ wildly by CPU :bg
prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
3823 cycles for neg ecx, TheCt=128
2009 cycles for cmp edx, ecx, TheCt=128
1066 cycles for dedndave, TheCt=128
3792 cycles for neg ecx, TheCt=128
1909 cycles for cmp edx, ecx, TheCt=128
1036 cycles for dedndave, TheCt=128
You cheat :dance:
@@: mov ebx,[eax+edx]
inc TheCt
mov [edx],ebx
add edx, 16
cmp edx, ecx
jb @B
@@: nop
add edx, 16
cmp edx, ecx
jb @B
-99 cycles for Lingo :lol
Surprisingly, your code runs exactly as fast as mine:
QuoteIntel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
1036 cycles for neg ecx, TheCt=128
843 cycles for cmp edx, ecx, TheCt=128
842 cycles for dedndave, TheCt=128
Which means a push to mem/pop from mem pair is as fast as two moves on my Celeron.
"-99 cycles for Lingo"
cretino,
perché non si salva SSE e ECX registri? :lol
So guys, I'm back here. Please excuse the delay, but this weekend is school start here and I was involved a little bit. That was the simple reason.
Special thanks for all test results and the hole bunch of hints.
Quote from: dioxin August 27, 2010, 11:38:45 amyou have 8 SSE registers and you're only using 6 of them. Either use them all to fetch/calculate data or to act as accumulators. Don't leave them unused.
Sounds not bad. I've tested that with the following code:
_DotXMM4Acc16E:
pxor xmm4, xmm4 ;sums are in xmm4, xmm5, xmm6 and xmm7
mov ecx,[esp+12] ;ecx = n
mov eax,[esp+4] ;eax -> X
pxor xmm5,xmm5
mov edx,[esp+8] ;edx -> Y
shl ecx,2 ;ecx = 4*n (float)
pxor xmm6,xmm6
sub esp,4 ;stack space for function result
sub eax,edx ;saves 1 lea instruction inside the main loop
pxor xmm7,xmm7
ALIGN 16
.loop:
movaps xmm0,[eax+edx] ;xmm0 = X[i+3] X[i+2] X[i+1] X[i+0]
movaps xmm1,[eax+edx+16] ;xmm1 = X[i+7] X[i+6] X[i+5] X[i+4]
movaps xmm2,[eax+edx+32] ;xmm2 = X[i+11] X[i+10] X[i+9] X[i+8]
movaps xmm3,[eax+edx+48] ;xmm3 = X[i+15] X[i+14] X[i+13] X[i+12]
mulps xmm0,[edx] ;xmm0 = X[i+3]*Y[i+3] X[i+2]*Y[i+2] X[i+1]*Y[i+1] X[i+0]*Y[i+0]
mulps xmm1,[edx+16] ;xmm1 = X[i+7]*Y[i+7] X[i+6]*Y[i+6] X[i+5]*Y[i+5] X[i+4]*Y[i+4]
mulps xmm2,[edx+32] ;xmm2 = X[i+11]*Y[i+11] X[i+10]*Y[i+10] X[i+9]*Y[i+9] X[i+8]*Y[i+8]
mulps xmm3,[edx+48] ;xmm3 = X[i+15]*Y[i+15] X[i+14]*Y[i+14] X[i+13]*Y[i+13] X[i+12]*Y[i+12]
addps xmm4,xmm0 ;sum up
addps xmm5,xmm1
addps xmm6,xmm2
addps xmm7,xmm3
lea edx,[edx+64] ;update pointers
sub ecx,byte 64 ;count down
jnz .loop
addps xmm4,xmm5 ;add accumulators
addps xmm6,xmm7
addps xmm4,xmm6 ;sum in xmm4
movhlps xmm0,xmm4 ;get bits 64 - 127 from xmm4
addps xmm4,xmm0 ;sum in 2 dwords
pshufd xmm0,xmm4,1 ;get bits 32 - 63 from xmm4
addss xmm4,xmm0 ;sum in 1 dword
movss [esp],xmm4 ;store sum
fld dword [esp] ;load function result
add esp,byte 4 ;adjust stack
ret
It's probably faster with your Phenom, but slower with my Athlon X2. That has to do with limitations by the floating point adder on older AMD chips. Intel chips must be tested. Let me know, if your code is similar; if so, I'll work that into the archive and update it for appropriate testing. Thank you. That's not a big deal; there's always CPUID and different code paths (one for older AMD, the other probably for Intel and newer AMD). Never mind.
Quote from: dioxin August 28, 2010, 12:37:24 pmf you have the CPU registers available, which you do in this case, then you can always arrange for the loop to count up to zero and so end without a compare instruction
That's okay. Counting down the loop is mostly faster as counting up. But caching works better with counting up; what you're doing is counting down with a negative offset and forward caching. Good idea.
Quote from: jj2007 August 28, 2010, 06:23:05 pmI guess results will differ wildly by CPU
Yes, Jochen the all results differ wildly and are very surprising.
Quote from: hutch-- August 27, 2010, 08:41:57 amOne request on benchmarks, put a keyboard pause at the end of it so it does not have to be downloaded to run it.
It's done, but the archive isn't yet updated. I'll do that after Pauls answer.
Gunther
Hi,Gunther,
"I've tested that with the following code:"
Now you have two algos with different results: :wink
prev:
4E2C 4AE8 4E2C 0A98 4EAB 0AB4 4F2A EAB0
last:
4E2C 47F2 4E2C 0792 4EAB 0AB8 4F2A EAB0
Lingo,
You mis-understand the "result".
The result of a dot product is a scalar (one value). With SINGLE types an xmm register can hold 4 values so 3 of them will be irrelevant and only 1 will be the required answer.
The same scalar value, 2867507200, is being returned for all code so clearly they all work and give the same answer.
That value, 2867507200 decimal, when stored as a SINGLE and displayed in hex would be 4F2A EAB0 which is shown in the low DWORD of both of your answers. The other DWORDs of that SSE register are not relevant as they were just intermediate results used to derive the answer, only the low DWORD, 4F2A EAB0, is returned as the answer.
Gunther,
I can't imagine why it would slow down on your CPU. The work being done is the same but there should be less register stalls. Worst case I'd expect it to run at the same speed.
I might have access to an Athlon X2 sometime this week and if I get time I'll check it out.
Until then, you can still use the unused registers to fetch/multiply more data in each loop. It won't be a convenient power of 2 but it should be faster.
dedndave,
Quotei think i would use JC BigGrin
or maybe JNS
Why? I think I used the correct branch.
Quotewhy not just..
Because I was writing it to be easily understood and to fit in with Gunther's original code.
It's easy enough to save a cycle or 2 afterwards once the main code (which is saving 100's clks) is understood.
jj2007 & dedndave,
Your version of the loop is only accessing 25% of the memory it should be.
Both loops go around 128 times but your loop only increments by 16 bytes instead of 64. Note in my code the counter increments by 16 but is always reference as 4*ecx, not just ecx alone.
Paul.
"only the low DWORD, 4F2A EAB0, is returned as the answer."
You are right, thanks.. :U
Gunther,
I'm sure my algo will runs faster on your CPU.
So, please, try it with your testbed... :U
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
db 10 Dup(0)
DotXMM1Acc4ELingo proc srcX, srcY, counter
mov eax, [esp+1*4] ; eax->srcX
pxor xmm5, xmm5
mov edx, [esp+2*4] ; edx ->srcY
pxor xmm4, xmm4
mov ecx, [esp+3*4] ; ecx = counter
sub eax, edx
@@:
movaps xmm0, [eax+edx]
movaps xmm1, [eax+edx+16]
mulps xmm0, [edx]
movaps xmm2, [eax+edx+32]
mulps xmm1, [edx+16]
addps xmm4, xmm0
movaps xmm3, [eax+edx+48]
mulps xmm2, [edx+32]
addps xmm5, xmm1
movaps xmm6, [eax+edx+64]
mulps xmm3, [edx+48]
addps xmm4, xmm2
movaps xmm7, [eax+edx+80]
mulps xmm6, [edx+64]
addps xmm5, xmm3
movaps xmm0, [eax+edx+96]
mulps xmm7, [edx+80]
addps xmm4, xmm6
movaps xmm1, [eax+edx+112]
mulps xmm0, [edx+96]
addps xmm5, xmm7
mulps xmm1, [edx+112]
addps xmm4, xmm0
add edx, 128
addps xmm5, xmm1
sub ecx, 32
ja @b
addps xmm4, xmm5
movhlps xmm1, xmm4
addps xmm4, xmm1
pshufd xmm1, xmm4,1
addss xmm4, xmm1
movss dword ptr [esp+2*4], xmm4
fld dword ptr [esp+2*4]
ret 4*3
DotXMM1Acc4ELingo endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
Quote from: lingo August 31, 2010 12:46:00 AMI'm sure my algo will runs faster on your CPU.
Thank you for your reply. Of course, I'll test your code and let you know the results.
Quote from: dioxin August 30, 2010, 11:42:29 pmI can't imagine why it would slow down on your CPU. The work being done is the same but there should be less register stalls.
Yes, of course, but there is a limitation by the throughput of the floating point adder on older AMD chips (only 64 bits). Your Phenom or an Intel Core2 processor have a 128 bits wide floating point adder that can handle a whole vector in one operation. This makes the difference.
But never mind. In the original program, which is to optimize, I'll insert 2 code paths: one for older AMD chips, the other for Intel and the fancy new Phenom. Which way to go can be detected during run time with CPUID.
Gunther
Quote from: dioxin on August 30, 2010, 10:42:29 PM
jj2007 & dedndave,
Your version of the loop is only accessing 25% of the memory it should be.
Both loops go around 128 times but your loop only increments by 16 bytes instead of 64. Note in my code the counter increments by 16 but is always reference as 4*ecx, not just ecx alone.
Paul.
That's correct, sorry for my sloppiness :thumbu
Modified code attached. I disabled Dave's code because it uses a faster "fill code", which makes the results not comparable.
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
1853 cycles for neg ecx, TheCt=128
1839 cycles for cmp edx, ecx, TheCt=128
1842 cycles for neg ecx, TheCt=128
1836 cycles for cmp edx, ecx, TheCt=128
1837 cycles for neg ecx, TheCt=128
1838 cycles for cmp edx, ecx, TheCt=128
AMD Phenom(tm) II X6 1055T Processor (SSE3)
659 cycles for neg ecx, TheCt=128
919 cycles for cmp edx, ecx, TheCt=128
659 cycles for neg ecx, TheCt=128
918 cycles for cmp edx, ecx, TheCt=128
659 cycles for neg ecx, TheCt=128
919 cycles for cmp edx, ecx, TheCt=128
659 cycles for neg ecx, TheCt=128
918 cycles for cmp edx, ecx, TheCt=128
659 cycles for neg ecx, TheCt=128
918 cycles for cmp edx, ecx, TheCt=128
Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz (SSE4)
914 cycles for neg ecx, TheCt=128
827 cycles for cmp edx, ecx, TheCt=128
912 cycles for neg ecx, TheCt=128
827 cycles for cmp edx, ecx, TheCt=128
912 cycles for neg ecx, TheCt=128
827 cycles for cmp edx, ecx, TheCt=128
913 cycles for neg ecx, TheCt=128
828 cycles for cmp edx, ecx, TheCt=128
912 cycles for neg ecx, TheCt=128
827 cycles for cmp edx, ecx, TheCt=128
--- ok ---
Very different indeed. Let's test with another "fill code":
Quoteneg ecx
and TheCt, 0
align 16
@@:
mov ebx, [eax+ecx*4]
inc TheCt
mov [edx+ecx*4], ebx
add ecx, 4
js short @B
Quotelea ecx, [edx+ecx]
and TheCt, 0
sub eax, edx
align 16
@@:
mov ebx, [eax+edx]
inc TheCt
mov [edx], ebx
add edx, 16
cmp edx, ecx
jb short @B
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
997 cycles for neg ecx, TheCt=128
986 cycles for cmp edx, TheCt=128
979 cycles for neg ecx, TheCt=128
988 cycles for cmp edx, TheCt=128
Quote from: lingo August 31, 2010, at 12:46:00 AMGunther,
I'm sure my algo will runs faster on your CPU.
So, please, try it with your testbed... :U
At the first glance, it should be faster, because you've rolled out the loop and your code processes now 32 elements per loop cycle. In practice, it leads to the same time as DotXMM2Acc16E on my machine (Win32 and Linux). That's a bit surprising.
I'll update the archive. So, your procedure will be renamed to DotXMM2Acc32ELingo (2 Accumulators, 32 elements per loop cycle) - but I'll give credit, that's clear. I'll set up a message after updating the archive.
Gunther
With modified testbed. J1 is short loop, J2 is using more xmm registers as shown below:
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
3566 cycles for DotXMM1Acc4E
2739 cycles for DotXMM1Acc4EJ1
2821 cycles for DotXMM1Acc4EJ2
2782 cycles for DotXMM1Acc4E
2714 cycles for DotXMM1Acc4EJ1
2806 cycles for DotXMM1Acc4EJ2
2784 cycles for DotXMM1Acc4E
2761 cycles for DotXMM1Acc4EJ1
2844 cycles for DotXMM1Acc4EJ2
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
2548 cycles for DotXMM1Acc4E
2165 cycles for DotXMM1Acc4EJ1
2073 cycles for DotXMM1Acc4EJ2
2546 cycles for DotXMM1Acc4E
2165 cycles for DotXMM1Acc4EJ1
2073 cycles for DotXMM1Acc4EJ2
@@:
REPEAT repct/4 ; unrolling helps, but make sure that the count is divisible by the rep count!!
movaps xmm0, [eax+edx]
movaps xmm1, [eax+edx+16]
movaps xmm2, [eax+edx+32]
movaps xmm3, [eax+edx+48]
mulps xmm0, [edx]
mulps xmm1, [edx+16]
mulps xmm2, [edx+32]
mulps xmm3, [edx+48]
lea edx, [edx+64]
addps xmm7, xmm0 ; sum up
addps xmm7, xmm1
addps xmm7, xmm2
addps xmm7, xmm3
ENDM
if UseCmp
cmp edx, ecx ; faster, at least on Celeron M
jb @B
else
sub ecx, 16*repct ; count down
jg @B
endif
"Dave's algo does not use the same "fill code", so it's not comparable"
JJ's code is not comparable too... :(
or how to manipulate the people with stupid testing:
original JJ's stupid code:
1st algo:
push ebx
mov ecx, 2048/4 ; 1st lame error here; Why 2048? Why /4?
mov eax, offset Src ; eax+2048 is 1 dword after the end of Src!
mov edx, offset Dest ; edx+2048 is 1 dword after the end of Dest!
lea eax, [eax+4*ecx] ; Why 4*ecx? Wow, coz we have /4 above!!!
lea edx, [edx+4*ecx] ; the same stupidity again!!!
neg ecx
and TheCt, 0
align 16
@@:
mov ebx, [eax+ecx*4] ; multiply ecx by 4 + sum of 2registers and read
inc TheCt
mov [edx+ecx*4], ebx ; multiply ecx by 4 + sum of 2registers and WRITE !!!
add ecx, 4
js short @B
pop ebx
and the loop of the 2nd algo:
@@:
mov ebx, [eax+edx] ;sum of 2registers and read
inc TheCt
mov [edx], ebx ; just one register and WRITE !!!
add edx, 16
cmp edx, ecx
jb short @B
As you see the loops are not comparable!
To be comparable:
just change the 1st algo to:
push ebx
mov ecx, 2047
and TheCt, 0
lea eax, [Src+ecx]
lea edx, [Dest+ecx]
neg ecx
align 16
@@:
mov ebx, [eax+ecx] ;sum of 2registers and read
inc TheCt
mov ebx, [edx+ecx] ;sum of 2registers and read
add ecx, 16
jle short @B
pop ebx
and change the 2nd algo too to:
push ebx
mov ecx, 2047
mov eax, offset Src
mov edx, offset Dest
lea ecx, [edx+ecx]
and TheCt, 0
sub eax, edx
align 16
@@:
mov ebx, [eax+edx] ;sum of 2registers and read
inc TheCt
mov ebx, [edx] ;just one register and read
add edx, 16
cmp edx, ecx
jbe short @B
pop ebx
and voila, we have similar results ...
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz (SSE4)
778 cycles for neg ecx, TheCt=128
779 cycles for cmp edx, TheCt=128
778 cycles for neg ecx, TheCt=128
779 cycles for cmp edx, TheCt=128
--- ok ---
Good job, Lingo :U
Unfortunately I had little time for this, so I am grateful that you took over. Now if you find some spare time, please try to speed up the DotXMM1Acc4EJ2 posted above.
"same time as DotXMM2Acc16E on my machine (Win32 and Linux)"
On my pc i have more: Win7, Vista, XP and Mac OS-SnowLeopard
but I can't understand the relation between OS and speed optimization. :wink
Quote from: lingo August 31, 2010, at 06:33:16 PMbut I can't understand the relation between OS and speed optimization
I mentioned that only, because the application is compiled with two different compiler versions:
- Win32: gcc 3.4.5 (MinGW)
- Linux: gcc 4.2.4 (not gcc 4.1, the worst-performing gcc for x86 machines)
That makes a small difference under both operating systems.
Gunther
Quote from: lingo on August 31, 2010, 03:59:51 PM
"Dave's algo does not use the same "fill code", so it's not comparable"
JJ's code is not comparable too... :(
or how to manipulate the people with stupid testing:
original JJ's stupid code:
1st algo:
push ebx
mov ecx, 2048/4 ; 1st lame error here; Why 2048? Why /4?
mov eax, offset Src ; eax+2048 is 1 dword after the end of Src!
mov edx, offset Dest ; edx+2048 is 1 dword after the end of Dest!
lea eax, [eax+4*ecx] ; Why 4*ecx? Wow, coz we have /4 above!!!
lea edx, [edx+4*ecx] ; the same stupidity again!!!
neg ecx
and TheCt, 0
align 16
@@:
mov ebx, [eax+ecx*4] ; multiply ecx by 4 + sum of 2registers and read
inc TheCt
mov [edx+ecx*4], ebx ; multiply ecx by 4 + sum of 2registers and WRITE !!!
add ecx, 4
js short @B
pop ebx
and the loop of the 2nd algo:
@@:
mov ebx, [eax+edx] ;sum of 2registers and read
inc TheCt
mov [edx], ebx ; just one register and WRITE !!!
add edx, 16
cmp edx, ecx
jb short @B
As you see the loops are not comparable!
To be comparable:
just change the 1st algo to:
push ebx
mov ecx, 2047
and TheCt, 0
lea eax, [Src+ecx]
lea edx, [Dest+ecx]
neg ecx
align 16
@@:
mov ebx, [eax+ecx] ;sum of 2registers and read
inc TheCt
mov ebx, [edx+ecx] ;sum of 2registers and read
add ecx, 16
jle short @B
pop ebx
and change the 2nd algo too to:
push ebx
mov ecx, 2047
mov eax, offset Src
mov edx, offset Dest
lea ecx, [edx+ecx]
and TheCt, 0
sub eax, edx
align 16
@@:
mov ebx, [eax+edx] ;sum of 2registers and read
inc TheCt
mov ebx, [edx] ;just one register and read
add edx, 16
cmp edx, ecx
jbe short @B
pop ebx
and voila, we have similar results ...
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz (SSE4)
778 cycles for neg ecx, TheCt=128
779 cycles for cmp edx, TheCt=128
778 cycles for neg ecx, TheCt=128
779 cycles for cmp edx, TheCt=128
--- ok ---
Lingo,
I'm sorry, but that code just won't work. You are not copying the data from src to dest, but just loading two values into ebx.
Also, wouldn't this be better. Instead of this:
@@:
mov ebx, [eax+edx] ;sum of 2registers and read
inc TheCt
mov ebx, [edx] ;just one register and read (double load)
add edx, 16
cmp edx, ecx
jbe short @B
Use this:
@@:
mov ebx, [eax+edx] ;sum of 2registers and read
add edx, 16
inc TheCt
mov [edx-16], ebx ;just one register and read (also fixed the double load)
cmp edx, ecx
jbe short @B
Dave.
Lingo,
I have lost track of what is being done here (just got back from trip), but the "add ebx,16" doesn't seem right, unless you are just initializing a single dword in a 16 byte array entry.
Shouldn't that be:
add edx, 4
inc TheCt
mov [edx-4], ebx ;just one register and read (also fixed the double load)
Dave.
"You are not copying the data from src to dest"
Why to do this? :lol
"I have lost track of what is being done here"
It is jj's example so will be better to ask him.. :wink
Quote from: KeepingRealBusy on August 31, 2010, 07:33:45 PM
I have lost track of what is being done here
Hi Dave,
There are two parallel testing efforts here, the "real" one on speeding up the dot product, and a side track on loop optimisation with negative offsets. The latter is pretty useless, and I feel guilty about it. For those interested, see Agner Fog's optimizing_assembly.pdf, page 89/90:
QuoteIt is possible to modify example 12.4a to make it count down rather than up, but the data
cache is optimized for accessing data forwards, not backwards. Therefore it is better to
count up through negative values from -n to zero. This is possible by making a pointer to
the end of the array and using a negative offset from the end of the array:
; Example 12.4b. For-loop with negative index from end of array
mov ecx, n ; Load n
lea esi, Array[4*ecx] ; Point to end of array
neg ecx ; i = -n
jnl LoopEnd ; Skip if (-n) >= 0
LoopTop:
; Loop body: Add 1 to all elements in Array:
add dword ptr [esi+4*ecx], 1
add ecx, 1 ; i++
js LoopTop ; Loop back if i < 0
LoopEnd:
A slightly different solution is to multiply n by 4 and count from -4*n to zero:
; Example 12.4c. For-loop with neg. index multiplied by element size
mov ecx, n ; Load n
shl ecx, 2 ; n * 4
jng LoopEnd ; Skip if (4*n) <= 0
lea esi, Array[ecx] ; Point to end of array 90
neg ecx ; i = -4*n
LoopTop:
; Loop body: Add 1 to all elements in Array:
add dword ptr [esi+ecx], 1
add ecx, 4 ; i += 4
js LoopTop ; Loop back if i < 0
LoopEnd:
There is no difference in speed between example 12.4b and 12.4c, but the latter method is
useful if the size of the array elements is not 1, 2, 4 or 8 so that we cannot use the scaled
index addressing.
Quote from: jj2007 on September 01, 2010, at 12:25:07 AMThe latter is pretty useless, and I feel guilty about it.
Jochen,
the side track isn't useless. It's a very interesting question, but I think it's worth to have it's own thread.
Gunther
Hi!
I done remake of Gunther's simple SSE2 code, with my suggestion of loop construction. And implemention as __fastcall, with support of __stdcall and __cdecl via thunkers.
Other code the same (final repacking and addition).
Gunther, this is code in MASM32 format, I have not experience with GCC, sorry!
Code too big to put it into post, so, you can get it from DotProduct4_1.zip, attached to post. Proc(edure) named as AxDotXMM1_fastcall, thunkers - below it.
This archive is original Jochen's DotProduct4.zip, but with my code addeded.
Other archive - your old dotfloat.exe with my proc(edure) included (__cdecl version).
For make test, I will forced to patch your original (old) posted version of dotfloat.exe.
I replace your simple SSE2 version with one accumulator to my version of code, and just run test.
This is timings of PATCHED version (with my simple version of proc)
Supported by Processor and installed Operating System:
------------------------------------------------------
Pentium 4 Instruction Set,
+ FPU (floating point unit) on chip,
+ support of FXSAVE and FXRSTOR,
+ 57 MMX Instructions,
+ 70 SSE (Katmai) Instructions,
+ 144 SSE2 (Willamette) Instructions,
+ 13 SSE3 (Prescott) Instructions,
+ HTT (hyper thread technology) support.
Calculating the dot product in 6 different variations.
That'll take a little while ...
Straight forward C++ implementation (FPU Code):
-----------------------------------------------
Dot Product 1 = 2867507200.00
Elapsed Time = 31.49 Seconds
C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------
Dot Product 2 = 2867507200.00
Elapsed Time = 15.14 Seconds
C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------
Dot Product 3 = 2867507200.00
Elapsed Time = 6.49 Seconds
Simple SSE2 Code (Alex's loop variation: 16bytes loop):
--------------------------------------------------------
Dot Product 4 = 2867507200.00
Elapsed Time = 6.14 Seconds
Solid SSE2 Code (2 Accumulators - 8 elements per cycle):
--------------------------------------------------------
Dot Product 5 = 2867507200.00
Elapsed Time = 4.01 Seconds
Better SSE2 Code (2 Accumulators - 16 elements per cycle):
----------------------------------------------------------
Dot Product 6 = 2867507200.00
Elapsed Time = 4.09 Seconds
Timings for your ORIGINAL posted version:
Supported by Processor and installed Operating System:
------------------------------------------------------
Pentium 4 Instruction Set,
+ FPU (floating point unit) on chip,
+ support of FXSAVE and FXRSTOR,
+ 57 MMX Instructions,
+ 70 SSE (Katmai) Instructions,
+ 144 SSE2 (Willamette) Instructions,
+ 13 SSE3 (Prescott) Instructions,
+ HTT (hyper thread technology) support.
Calculating the dot product in 6 different variations.
That'll take a little while ...
Straight forward C++ implementation (FPU Code):
-----------------------------------------------
Dot Product 1 = 2867507200.00
Elapsed Time = 31.50 Seconds
C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------
Dot Product 2 = 2867507200.00
Elapsed Time = 15.28 Seconds
C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------
Dot Product 3 = 2867507200.00
Elapsed Time = 6.52 Seconds
Simple SSE2 Code (1 Accumulator - 4 elements per cycle):
--------------------------------------------------------
Dot Product 4 = 2867507200.00
Elapsed Time = 6.39 Seconds
Solid SSE2 Code (2 Accumulators - 8 elements per cycle):
--------------------------------------------------------
Dot Product 5 = 2867507200.00
Elapsed Time = 4.00 Seconds
Better SSE2 Code (2 Accumulators - 16 elements per cycle):
----------------------------------------------------------
Dot Product 6 = 2867507200.00
Elapsed Time = 4.06 Seconds
Note: overhead of C++ loop what calls to dot production code is 2.3 seconds on my CPU.
I'll attach archive with patched version of your executable. So, you can test new version of loop just now.
Alex
P.S. Sorry for patching, as I say: I'm have no experience with GCC, and cannot add my code to test via normal way. And sources is seems to be not compilable with MSVC, because have other name mangling, assembler etc.
Timings for DotProduct4_1.zip (with Jochen's procs).
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
2814 cycles for DotXMM1Acc4E
2716 cycles for DotXMM1Acc4EJ1
2798 cycles for DotXMM1Acc4EJ2
2707 cycles for AxDotXMM1_fastcall
2817 cycles for DotXMM1Acc4E
2714 cycles for DotXMM1Acc4EJ1
2804 cycles for DotXMM1Acc4EJ2
2713 cycles for AxDotXMM1_fastcall
2805 cycles for DotXMM1Acc4E
2726 cycles for DotXMM1Acc4EJ1
2800 cycles for DotXMM1Acc4EJ2
2719 cycles for AxDotXMM1_fastcall
The result for DotXMM1Acc4E: 2867507200
The result for DotXMM1Acc4EJ1: 2867507200
The result for DotXMM1Acc4EJ2: 2867507200
The result for AxDotXMM1_cdecl: 2867507200
--- done ---
Note: this is strange enough, but work via stdcall thunker faster than direct call to fastcall version - on my system.
Alex
Hi Alex,
It seems the Celeron M doesn't like it... sorry...
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
2544 cycles for DotXMM1Acc4E
2161 cycles for DotXMM1Acc4EJ1
2074 cycles for DotXMM1Acc4EJ2
2541 cycles for AxDotXMM1_fastcall
2517 cycles for DotXMM1Acc4E
2166 cycles for DotXMM1Acc4EJ1
2073 cycles for DotXMM1Acc4EJ2
2539 cycles for AxDotXMM1_fastcall
2511 cycles for DotXMM1Acc4E
2168 cycles for DotXMM1Acc4EJ1
2078 cycles for DotXMM1Acc4EJ2
2539 cycles for AxDotXMM1_fastcall
The result for DotXMM1Acc4E: 2867507200
The result for DotXMM1Acc4EJ1: 2867507200
The result for DotXMM1Acc4EJ2: 2867507200
The result for AxDotXMM1_cdecl: 2867507200
Quote from: jj2007 on September 01, 2010, 10:08:41 PM
Hi Alex,
It seems the Celeron M doesn't like it... sorry...
:P
Not have meaning :)
What timings is for patched Gunther's exe?
Post them, please! I made it to work, not for attaching to post only, but for getting results also :)
Alex
Here they are ;-)
Straight forward C++ implementation (FPU Code):
-----------------------------------------------
Dot Product 1 = 2867507200.00
Elapsed Time = 22.64 Seconds
C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------
Dot Product 2 = 2867507200.00
Elapsed Time = 15.58 Seconds
C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------
Dot Product 3 = 2867507200.00
Elapsed Time = 6.89 Seconds
Simple SSE2 Code (Alex's loop variation: 16bytes loop):
--------------------------------------------------------
Dot Product 4 = 2867507200.00
Elapsed Time = 7.55 Seconds
Solid SSE2 Code (2 Accumulators - 8 elements per cycle):
--------------------------------------------------------
Dot Product 5 = 2867507200.00
Elapsed Time = 7.25 Seconds
Better SSE2 Code (2 Accumulators - 16 elements per cycle):
----------------------------------------------------------
Dot Product 6 = 2867507200.00
Elapsed Time = 6.72 Seconds
Quote from: jj2007 on September 01, 2010, 10:19:03 PM
Here they are ;-)
So, as always, nothing to say... All code optimized for my CPU - seems to be anti-optimized for others :(
Good chance for someone to make his stupid remarks about "archaic CPU... etc" :toothy
But this is also have no meaning :green2
Alex
Dotproduct result.
Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz (SSE4)
1824 cycles for DotXMM1Acc4E
1556 cycles for DotXMM1Acc4EJ1
1556 cycles for DotXMM1Acc4EJ2
1562 cycles for AxDotXMM1_fastcall
1561 cycles for DotXMM1Acc4E
1557 cycles for DotXMM1Acc4EJ1
1554 cycles for DotXMM1Acc4EJ2
1561 cycles for AxDotXMM1_fastcall
1561 cycles for DotXMM1Acc4E
1556 cycles for DotXMM1Acc4EJ1
1554 cycles for DotXMM1Acc4EJ2
1562 cycles for AxDotXMM1_fastcall
The result for DotXMM1Acc4E: 2867507200
The result for DotXMM1Acc4EJ1: 2867507200
The result for DotXMM1Acc4EJ2: 2867507200
The result for AxDotXMM1_cdecl: 2867507200
--- done ---
The other.
Supported by Processor and installed Operating System:
------------------------------------------------------
Pentium 4 Instruction Set,
+ FPU (floating point unit) on chip,
+ support of FXSAVE and FXRSTOR,
+ 57 MMX Instructions,
+ 70 SSE (Katmai) Instructions,
+ 144 SSE2 (Willamette) Instructions,
+ 13 SSE3 (Prescott) Instructions,
+ HTT (hyper thread technology) support.
Calculating the dot product in 6 different variations.
That'll take a little while ...
Straight forward C++ implementation (FPU Code):
-----------------------------------------------
Dot Product 1 = 2867507200.00
Elapsed Time = 10.31 Seconds
C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------
Dot Product 2 = 2867507200.00
Elapsed Time = 6.89 Seconds
C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------
Dot Product 3 = 2867507200.00
Elapsed Time = 2.58 Seconds
Simple SSE2 Code (Alex's loop variation: 16bytes loop):
--------------------------------------------------------
Dot Product 4 = 2867507200.00
Elapsed Time = 2.62 Seconds
Solid SSE2 Code (2 Accumulators - 8 elements per cycle):
--------------------------------------------------------
Dot Product 5 = 2867507200.00
Elapsed Time = 2.14 Seconds
Better SSE2 Code (2 Accumulators - 16 elements per cycle):
----------------------------------------------------------
Dot Product 6 = 2867507200.00
Elapsed Time = 1.97 Seconds
From the netbook, for chuckles..
Intel(R) Atom(TM) CPU N270 @ 1.60GHz (SSE4)
6068 cycles for DotXMM1Acc4E
3240 cycles for DotXMM1Acc4EJ1
3542 cycles for DotXMM1Acc4EJ2
5300 cycles for AxDotXMM1_fastcall
4428 cycles for DotXMM1Acc4E
3339 cycles for DotXMM1Acc4EJ1
3537 cycles for DotXMM1Acc4EJ2
5405 cycles for AxDotXMM1_fastcall
4292 cycles for DotXMM1Acc4E
3328 cycles for DotXMM1Acc4EJ1
3520 cycles for DotXMM1Acc4EJ2
5285 cycles for AxDotXMM1_fastcall
The result for DotXMM1Acc4E: 2867507200
The result for DotXMM1Acc4EJ1: 2867507200
The result for DotXMM1Acc4EJ2: 2867507200
The result for AxDotXMM1_cdecl: 2867507200
Supported by Processor and installed Operating System:
------------------------------------------------------
Pentium 4 Instruction Set,
+ FPU (floating point unit) on chip,
+ support of FXSAVE and FXRSTOR,
+ 57 MMX Instructions,
+ 70 SSE (Katmai) Instructions,
+ 144 SSE2 (Willamette) Instructions,
+ 13 SSE3 (Prescott) Instructions,
+ HTT (hyper thread technology) support.
Calculating the dot product in 6 different variations.
That'll take a little while ...
Straight forward C++ implementation (FPU Code):
-----------------------------------------------
Dot Product 1 = 2867507200.00
Elapsed Time = 48.08 Seconds
C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------
Dot Product 2 = 2867507200.00
Elapsed Time = 45.88 Seconds
C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------
Dot Product 3 = 2867507200.00
Elapsed Time = 9.44 Seconds
Simple SSE2 Code (Alex's loop variation: 16bytes loop):
--------------------------------------------------------
Dot Product 4 = 2867507200.00
Elapsed Time = 16.88 Seconds
Solid SSE2 Code (2 Accumulators - 8 elements per cycle):
--------------------------------------------------------
Dot Product 5 = 2867507200.00
Elapsed Time = 12.69 Seconds
Better SSE2 Code (2 Accumulators - 16 elements per cycle):
----------------------------------------------------------
Dot Product 6 = 2867507200.00
Elapsed Time = 5.81 Seconds
"In practice, it leads to the same time as DotXMM2Acc16E on my machine (Win32 and Linux). That's a bit surprising."
You can try my shorter variant too: :U
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
db 3 Dup(0)
DotXMM2Acc16ELingo proc srcX, srcY, counter
pxor xmm5, xmm5
mov eax, [esp+1*4] ; eax->srcX
pxor xmm6, xmm6
mov edx, [esp+2*4] ; edx ->srcY
movaps xmm0, [eax]
pxor xmm4, xmm4
mov ecx, [esp+3*4] ; ecx = counter
sub eax, edx
@@:
movaps xmm1, [eax+edx+16]
mulps xmm0, [edx]
addps xmm5, xmm6
movaps xmm3, [eax+edx+32]
mulps xmm1, [edx+16]
addps xmm4, xmm0
movaps xmm6, [eax+edx+48]
mulps xmm3, [edx+32]
add edx, 64
addps xmm5, xmm1
movaps xmm0, [eax+edx]
mulps xmm6, [edx+48-64]
addps xmm4, xmm3
sub ecx, 16
ja @b
addps xmm4, xmm6
addps xmm4, xmm5
movhlps xmm0, xmm4
addps xmm4, xmm0
pshufd xmm0, xmm4,1
addss xmm4, xmm0
movss dword ptr [esp+2*4], xmm4
fld dword ptr [esp+2*4]
ret 3*4
DotXMM2Acc16ELingo endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
And with Lingo's XMM2 spanking everyone
Intel(R) Atom(TM) CPU N270 @ 1.60GHz (SSE4)
5739 cycles for DotXMM1Acc4E
3250 cycles for DotXMM1Acc4EJ1
3534 cycles for DotXMM1Acc4EJ2
5290 cycles for AxDotXMM1_cdecl
5253 cycles for AxDotXMM1_fastcall
1899 cycles for DotXMM2Acc16ELingo
4239 cycles for DotXMM1Acc4E
3289 cycles for DotXMM1Acc4EJ1
3488 cycles for DotXMM1Acc4EJ2
5278 cycles for AxDotXMM1_cdecl
5502 cycles for AxDotXMM1_fastcall
2903 cycles for DotXMM2Acc16ELingo
5959 cycles for DotXMM1Acc4E
4370 cycles for DotXMM1Acc4EJ1
4629 cycles for DotXMM1Acc4EJ2
6626 cycles for AxDotXMM1_cdecl
6223 cycles for AxDotXMM1_fastcall
1945 cycles for DotXMM2Acc16ELingo
The result for DotXMM1Acc4E: 2867507200
The result for DotXMM1Acc4EJ1: 2867507200
The result for DotXMM1Acc4EJ2: 2867507200
The result for AxDotXMM1_cdecl: 2867507200
The result for AxDotXMM1_fastcall: 2867507200
The result for DotXMM2Acc16ELingo: 2867507200
--- done ---
Intel(R) Core(TM)2 Duo CPU T7500 @ 2.20GHz (SSE4)
1621 cycles for DotXMM1Acc4E
1596 cycles for DotXMM1Acc4EJ1
1573 cycles for DotXMM1Acc4EJ2
1587 cycles for AxDotXMM1_cdecl
1699 cycles for AxDotXMM1_fastcall
1051 cycles for DotXMM2Acc16ELingo
1568 cycles for DotXMM1Acc4E
1564 cycles for DotXMM1Acc4EJ1
1678 cycles for DotXMM1Acc4EJ2
1611 cycles for AxDotXMM1_cdecl
1598 cycles for AxDotXMM1_fastcall
1052 cycles for DotXMM2Acc16ELingo
1566 cycles for DotXMM1Acc4E
1559 cycles for DotXMM1Acc4EJ1
1582 cycles for DotXMM1Acc4EJ2
1601 cycles for AxDotXMM1_cdecl
1611 cycles for AxDotXMM1_fastcall
1063 cycles for DotXMM2Acc16ELingo
Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz (SSE4)
1574 cycles for DotXMM1Acc4E
1554 cycles for DotXMM1Acc4EJ1
1553 cycles for DotXMM1Acc4EJ2
1565 cycles for AxDotXMM1_cdecl
1560 cycles for AxDotXMM1_fastcall
1050 cycles for DotXMM2Acc16ELingo
1560 cycles for DotXMM1Acc4E
1555 cycles for DotXMM1Acc4EJ1
1554 cycles for DotXMM1Acc4EJ2
1565 cycles for AxDotXMM1_cdecl
1560 cycles for AxDotXMM1_fastcall
1050 cycles for DotXMM2Acc16ELingo
1560 cycles for DotXMM1Acc4E
1554 cycles for DotXMM1Acc4EJ1
1553 cycles for DotXMM1Acc4EJ2
1565 cycles for AxDotXMM1_cdecl
1560 cycles for AxDotXMM1_fastcall
1051 cycles for DotXMM2Acc16ELingo
The result for DotXMM1Acc4E: 2867507200
The result for DotXMM1Acc4EJ1: 2867507200
The result for DotXMM1Acc4EJ2: 2867507200
The result for AxDotXMM1_cdecl: 2867507200
The result for AxDotXMM1_fastcall: 2867507200
The result for DotXMM2Acc16ELingo: 2867507200
--- done ---
Alex,
Here are my P4 timings for DotProducts:
Intel(R) Pentium(R) 4 CPU 3.20GHz (SSE2)
2864 cycles for DotXMM1Acc4E
2618 cycles for DotXMM1Acc4EJ1
2235 cycles for DotXMM1Acc4EJ2
2527 cycles for AxDotXMM1_fastcall
2525 cycles for DotXMM1Acc4E
2488 cycles for DotXMM1Acc4EJ1
2367 cycles for DotXMM1Acc4EJ2
2528 cycles for AxDotXMM1_fastcall
2498 cycles for DotXMM1Acc4E
2483 cycles for DotXMM1Acc4EJ1
2220 cycles for DotXMM1Acc4EJ2
2536 cycles for AxDotXMM1_fastcall
The result for DotXMM1Acc4E: 2867507200
The result for DotXMM1Acc4EJ1: 2867507200
The result for DotXMM1Acc4EJ2: 2867507200
The result for AxDotXMM1_cdecl: 2867507200
--- done ---
Supported by Processor and installed Operating System:
------------------------------------------------------
Pentium 4 Instruction Set,
+ FPU (floating point unit) on chip,
+ support of FXSAVE and FXRSTOR,
+ 57 MMX Instructions,
+ 70 SSE (Katmai) Instructions,
+ 144 SSE2 (Willamette) Instructions.
Calculating the dot product in 6 different variations.
That'll take a little while ...
Straight forward C++ implementation (FPU Code):
-----------------------------------------------
Dot Product 1 = 2867507200.00
Elapsed Time = 17.34 Seconds
C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------
Dot Product 2 = 2867507200.00
Elapsed Time = 11.42 Seconds
C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------
Dot Product 3 = 2867507200.00
Elapsed Time = 3.69 Seconds
Simple SSE2 Code (Alex's loop variation: 16bytes loop):
--------------------------------------------------------
Dot Product 4 = 2867507200.00
Elapsed Time = 4.02 Seconds
Solid SSE2 Code (2 Accumulators - 8 elements per cycle):
--------------------------------------------------------
Dot Product 5 = 2867507200.00
Elapsed Time = 3.69 Seconds
Better SSE2 Code (2 Accumulators - 16 elements per cycle):
----------------------------------------------------------
Dot Product 6 = 2867507200.00
Elapsed Time = 3.33 Seconds
Dave
Quote from: lingo on September 02, 2010, 12:29:41 AM
You can try my shorter variant too: :U
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
2548 cycles for DotXMM1Acc4E
2163 cycles for DotXMM1Acc4EJ1
2072 cycles for DotXMM1Acc4EJ2
2540 cycles for AxDotXMM1_fastcall
2133 cycles for DotXMM2Acc16ELingo
The result returned by Lingo's algo is correct! If you want to test yourself, activate UseMB in line 1.
On my archaic Prescott, Lingo's code is actually a bit faster:
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
3652 cycles for DotXMM1Acc4E
2778 cycles for DotXMM1Acc4EJ1
2779 cycles for DotXMM1Acc4EJ2
2754 cycles for AxDotXMM1_fastcall
2173 cycles for DotXMM2Acc16ELingo
I think I've lost track of what's going on in this thread.
How is Lingo's latest significantly different to the original DotXMM2Acc16E version posted by Gunther?
At some point in this thread the timings switched from seconds for 5,000,000 loops to clks per loop. Did anything else change?
If not, then I get the following timings:
Phenom II Atom N270
Code from reply#17 640 2156
Gunther's original 1056 1660
(2acc. 16 Element)
Lingo's latest 1157 1858
On my PC (3GHz Phenom II) Gunther's original runs in 1.76s = 1056clks, Lingo's latest runs in 1157 clks but the code posted in Reply#17 of this thread runs in 640clks. I gather from other posts in this thread that this is very CPU dependent.
I've tried them on other machines and older PCs do show Lingo's code to be a little faster (in the region of 5%) than the reply #17 code on those old machines.
Other modern Athlon type CPus show the Reply #17 code being nearly twice as fast.
I don't have any modern Intels available to test.
Paul.
"If not, then I get the following timings:"
Without your testing files it is just bla, bla, bla... :lol
Lingo,
I forget, sometimes people can only handle 1 query at a time so I'll simplify it.
How is Lingo's latest significantly different to the original DotXMM2Acc16E version posted by Gunther?
The code looks similar and the timings are similar.
Here are the timings from the already posted testing files:
Phenom II Atom N270
Gunther's original 1056 1660
(2acc. 16 Element)
Lingo's latest 1157 1858
Paul.
Here are the relevant extracts from the Gunther original and Lingo's latest.
Shown below are the main loops of the 2 routines:
@@:
movaps xmm1, [eax+edx+16]
mulps xmm0, [edx]
addps xmm5, xmm6
movaps xmm3, [eax+edx+32]
mulps xmm1, [edx+16]
addps xmm4, xmm0
movaps xmm6, [eax+edx+48]
mulps xmm3, [edx+32]
add edx, 64
addps xmm5, xmm1
movaps xmm0, [eax+edx]
mulps xmm6, [edx+48-64]
addps xmm4, xmm3
sub ecx, 16
ja @b
.loop:
movaps xmm2,[eax+edx+32] ;xmm2 = X[i+11] X[i+10] X[i+9] X[i+8]
mulps xmm1,[edx+16] ;xmm1 = X[i+7]*Y[i+7] X[i+6]*Y[i+6] X[i+5]*Y[i+5] X[i+4]*Y[i+4]
addps xmm4,xmm0 ;sum up
movaps xmm3,[eax+edx+48] ;xmm3 = X[i+15] X[i+14] X[i+13] X[i+12]
mulps xmm2,[edx+32] ;xmm2 = X[i+11]*Y[i+11] X[i+10]*Y[i+10] X[i+9]*Y[i+9] X[i+8]*Y[i+8]
addps xmm5,xmm1 ;sum up
movaps xmm0,[eax+edx+64] ;xmm0 = X[i+3] X[i+2] X[i+1] X[i+0]
mulps xmm3,[edx+48] ;xmm3 = X[i+15]*Y[i+15] X[i+14]*Y[i+14] X[i+13]*Y[i+13] X[i+12]*Y[i+12]
lea edx,[edx+64] ;update pointers
addps xmm4,xmm2 ;sum up
movaps xmm1,[eax+edx+16] ;mm1 = X[i+7] X[i+6] X[i+5] X[i+4]
mulps xmm0,[edx] ;xmm0 = X[i+3]*Y[i+3] X[i+2]*Y[i+2] X[i+1]*Y[i+1] X[i+0]*Y[i+0]
addps xmm5,xmm3 ;sum up
sub ecx,byte 64 ;count down
jnz .loop
Your "results" without your testing files are just bla,bla,bla... :(
Will be better to post all algos rather than just the loops.. :lol
This is a Gunther's thread and it is his choice to get the better algo according to him.
Lingo,
Quotewithout your testing files
What testing files are you after? I used the ones in the original post of this thread by Gunther and the one in reply in Reply #59 by jj2007.
The files are already posted. It would bee pointless to repost the same ones again.
Paul.
"I used the ones in the original post of this thread by Gunther and the one in reply in Reply #59 by jj2007."
It is a result from original post by Gunther:
Better SSE2 Code (2 Accumulators - 16 elements per cycle):
----------------------------------------------------------
Dot Product 6 = 2867507200.00
Elapsed Time = 1.87 Seconds
it is a result from Reply #59 by jj2007:
2173 cycles for DotXMM2Acc16ELingo
As you see the dimensions of the times are different...(seconds vs cycles) :lol
So,Your "results" without your testing files are just bla,bla,bla... [/U]
Quote from: dioxin on September 02, 2010, 05:21:18 PM
I think I've lost track of what's going on in this thread.
Paul,
Here are results for all algos including your #17:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
2529 cycles for DotXMM1Acc4E
2163 cycles for DotXMM1Acc4EJ1
2070 cycles for DotXMM1Acc4EJ2
2550 cycles for AxDotXMM1_fastcall
2138 cycles for DotXMM2Acc16ELingo
2135 cycles for DotXMM2Acc16EPaul
Lingo,
it's simple maths. Gunther's runs 5,000,000 loops in 1.87sec.
That's 1.87 / 5000000 = 374nsec per loop. Multiply that 374 by the speed of your CPU in GHz and you'll see how many cycles Gunther's code takes on your PC.
jj2007,
thanks, on mine with your posted code I get:
AMD Phenom(tm) II X4 945 Processor (SSE3)
2410 cycles for DotXMM1Acc4E
2115 cycles for DotXMM1Acc4EJ1
2118 cycles for DotXMM1Acc4EJ2
2116 cycles for AxDotXMM1_fastcall
1126 cycles for DotXMM2Acc16ELingo
723 cycles for DotXMM2Acc16EPaul
2156 cycles for DotXMM1Acc4E
2117 cycles for DotXMM1Acc4EJ1
2115 cycles for DotXMM1Acc4EJ2
2114 cycles for AxDotXMM1_fastcall
1126 cycles for DotXMM2Acc16ELingo
725 cycles for DotXMM2Acc16EPaul
2155 cycles for DotXMM1Acc4E
2130 cycles for DotXMM1Acc4EJ1
2134 cycles for DotXMM1Acc4EJ2
2121 cycles for AxDotXMM1_fastcall
1126 cycles for DotXMM2Acc16ELingo
724 cycles for DotXMM2Acc16EPaul
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
--- done ---
Quote from: dioxin on September 02, 2010, 06:46:26 PM
jj2007,
thanks, on mine with your posted code I get:
QuoteAMD Phenom(tm) II X4 945 Processor (SSE3)
2118 cycles for DotXMM1Acc4EJ2
723 cycles for DotXMM2Acc16EPaul
QuoteIntel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
2071 cycles for DotXMM1Acc4EJ2
2135 cycles for DotXMM2Acc16EPaul
Wow!
:lol
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz (SSE4)
1566 cycles for DotXMM1Acc4E
1556 cycles for DotXMM1Acc4EJ1
1556 cycles for DotXMM1Acc4EJ2
1561 cycles for AxDotXMM1_fastcall
1051 cycles for DotXMM2Acc16ELingo
1052 cycles for DotXMM2Acc16EPaul
1562 cycles for DotXMM1Acc4E
1556 cycles for DotXMM1Acc4EJ1
1556 cycles for DotXMM1Acc4EJ2
1561 cycles for AxDotXMM1_fastcall
1051 cycles for DotXMM2Acc16ELingo
1052 cycles for DotXMM2Acc16EPaul
1562 cycles for DotXMM1Acc4E
1555 cycles for DotXMM1Acc4EJ1
1561 cycles for DotXMM1Acc4EJ2
1561 cycles for AxDotXMM1_fastcall
1051 cycles for DotXMM2Acc16ELingo
1052 cycles for DotXMM2Acc16EPaul
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
--- done ---
AMD Turion(tm) 64 Mobile Technology ML-30 (SSE3)
2270 cycles for DotXMM1Acc4E
2226 cycles for DotXMM1Acc4EJ1
2199 cycles for DotXMM1Acc4EJ2
2259 cycles for AxDotXMM1_fastcall
1400 cycles for DotXMM2Acc16ELingo
1582 cycles for DotXMM2Acc16EPaul
2242 cycles for DotXMM1Acc4E
2226 cycles for DotXMM1Acc4EJ1
2232 cycles for DotXMM1Acc4EJ2
2269 cycles for AxDotXMM1_fastcall
1393 cycles for DotXMM2Acc16ELingo
1579 cycles for DotXMM2Acc16EPaul
2246 cycles for DotXMM1Acc4E
2221 cycles for DotXMM1Acc4EJ1
2193 cycles for DotXMM1Acc4EJ2
2268 cycles for AxDotXMM1_fastcall
1425 cycles for DotXMM2Acc16ELingo
1575 cycles for DotXMM2Acc16EPaul
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
--- done ---
Quote from: dioxin on September 02, 2010, 06:46:26 PM
it's simple maths. Gunther's runs 5,000,000 loops in 1.87sec.
That's 1.87 / 5000000 = 374nsec per loop. Multiply that 374 by the speed of your CPU in GHz and you'll see how many cycles Gunther's code takes on your PC.
Unless Lingos Core2 has an wider pipeline that can decode three uOps per cycle, or has an extra ALU to do up to 6 uOps per cycle. Then the math isn't quite that simple.
-r
As I announced for a while, here is the update of my test bed. The program calculates now the dot product in 9 different ways. I've included Paul's and lingo's code via inline assembly. Furthermore, I squeezed out a bit time by doing a little bit macro magic in DotXMM2Acc32E (please have a look into dotfloatfa.cpp).
My impression is: that's a good solution especially for older AMD chips (Athlons or Opterons). Newer AMD processors like the Phenom will probably better run with Paul's code, but that's not yet tested. Here is a part of the current timings with my Athlon X2, 1.9 Ghz (Only the last 4 results are shown; for more look into results.pdf):
Better SSE2 Code (2 Accumulators - 16 elements per loop cycle):
---------------------------------------------------------------
Dot Product 6 = 2867507200.00
Elapsed Time = 3.20 Seconds
SSE2 Code by Dioxin (4 Accumulators - 16 elements per loop cycle):
------------------------------------------------------------------
Dot Product 7 = 2867507200.00
Elapsed Time = 3.88 Seconds
SSE2 Code by Lingo (2 Accumulators - 32 elements per loop cycle):
-----------------------------------------------------------------
Dot Product 8 = 2867507200.00
Elapsed Time = 3.20 Seconds
SSE2 Code with PATTERN (2 Accumulators - 32 elements per loop cycle):
---------------------------------------------------------------------
Dot Product 9 = 2867507200.00
Elapsed Time = 3.08 Seconds
It would be interesting, what lingo's Snow Leopard brings. The program will compile properly under Mac OS X, so try it, please. I'll also try to include Antariy's code into the application. My special thank goes to all guys for making suggestions and for test helping.
Gunther
Like a voice crying in the wilderness, if you want to know how FAST an algorithm is, test it in REAL TIME !!!!
Cycles may be cute and easy to use benchmarks may be fun but algorithms run in application code in REAL TIME, test the algo any other way and you get serious anomolies.
A small correction in my old algo and new results: :lol
align 16
db 3 Dup(0)
DotXMM2Acc32ELingo proc srcX, srcY, counter
mov eax, [esp+1*4] ; eax->srcX
pxor xmm5, xmm5
pxor xmm3, xmm3
mov edx, [esp+2*4] ; edx ->srcY
movaps xmm0, [eax]
pxor xmm4, xmm4
mov ecx, [esp+3*4] ; ecx = counter
sub eax, edx
@@:
movaps xmm1, [eax+edx+16]
mulps xmm0, [edx]
addps xmm4, xmm3
movaps xmm2, [eax+edx+32]
mulps xmm1, [edx+16]
addps xmm5, xmm0
movaps xmm3, [eax+edx+48]
mulps xmm2, [edx+32]
addps xmm4, xmm1
movaps xmm6, [eax+edx+64]
mulps xmm3, [edx+48]
addps xmm5, xmm2
movaps xmm7, [eax+edx+80]
mulps xmm6, [edx+64]
addps xmm4, xmm3
movaps xmm2, [eax+edx+96]
mulps xmm7, [edx+80]
addps xmm5, xmm6
movaps xmm3, [eax+edx+112]
mulps xmm2, [edx+96]
addps xmm4, xmm7
movaps xmm0, [eax+edx+128]
mulps xmm3, [edx+112]
addps xmm5, xmm2
add edx, 128
sub ecx, 32
ja @b
addps xmm4, xmm3
addps xmm4, xmm5
movhlps xmm0, xmm4
addps xmm4, xmm0
pshufd xmm0, xmm4,1
addss xmm4, xmm0
movss dword ptr [esp+2*4], xmm4
fld dword ptr [esp+2*4]
ret 4*3
DotXMM2Acc32ELingo endp
Results:
Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz (SSE4)
1564 cycles for DotXMM1Acc4E
1556 cycles for DotXMM1Acc4EJ1
1557 cycles for DotXMM1Acc4EJ2
1562 cycles for AxDotXMM1_fastcall
1051 cycles for DotXMM2Acc16ELingo
1038 cycles for DotXMM2Acc32ELingo
1053 cycles for DotXMM2Acc16EPaul
1563 cycles for DotXMM1Acc4E
1557 cycles for DotXMM1Acc4EJ1
1557 cycles for DotXMM1Acc4EJ2
1562 cycles for AxDotXMM1_fastcall
1051 cycles for DotXMM2Acc16ELingo
1038 cycles for DotXMM2Acc32ELingo
1052 cycles for DotXMM2Acc16EPaul
1562 cycles for DotXMM1Acc4E
1556 cycles for DotXMM1Acc4EJ1
1556 cycles for DotXMM1Acc4EJ2
1563 cycles for AxDotXMM1_fastcall
1051 cycles for DotXMM2Acc16ELingo
1038 cycles for DotXMM2Acc32ELingo
1052 cycles for DotXMM2Acc16EPaul
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
--- done ---
AMD Turion(tm) 64 Mobile Technology ML-30 (SSE3)
2291 cycles for DotXMM1Acc4E
2233 cycles for DotXMM1Acc4EJ1
2215 cycles for DotXMM1Acc4EJ2
2277 cycles for AxDotXMM1_fastcall
1454 cycles for DotXMM2Acc16ELingo
1338 cycles for DotXMM2Acc32ELingo
1611 cycles for DotXMM2Acc16EPaul
2257 cycles for DotXMM1Acc4E
2230 cycles for DotXMM1Acc4EJ1
2249 cycles for DotXMM1Acc4EJ2
2251 cycles for AxDotXMM1_fastcall
1435 cycles for DotXMM2Acc16ELingo
1362 cycles for DotXMM2Acc32ELingo
1623 cycles for DotXMM2Acc16EPaul
2268 cycles for DotXMM1Acc4E
2253 cycles for DotXMM1Acc4EJ1
2250 cycles for DotXMM1Acc4EJ2
2235 cycles for AxDotXMM1_fastcall
1402 cycles for DotXMM2Acc16ELingo
1371 cycles for DotXMM2Acc32ELingo
1605 cycles for DotXMM2Acc16EPaul
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
--- done ---
Quote from: hutch-- September 03, 2010, at 11:49:34 PMLike a voice crying in the wilderness, if you want to know how FAST an algorithm is, test it in REAL TIME !!!!
Wise spoken. There are cache misses, bad predicted jumps, stalls and the hole bunch of other difficulties, which can occur in practice. Therefore, I made the test bed as practical as possible. But 100% certainty is reached after implementation the new algorithm in the original application. That will happen next week, I hope you keep your fingers crossed for me hutch.
Gunther
Is it working right?
Quote from: lingo
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1326793031 <<<<<<<<<
The result: 1328212656
--- done ---
Interesting that it only fails on one of Lingo's machines.
Latest Lingo's code:
2943 cycles for DotXMM1Acc4E
2816 cycles for DotXMM1Acc4EJ1
2828 cycles for DotXMM1Acc4EJ2
2816 cycles for AxDotXMM1_fastcall
2266 cycles for DotXMM2Acc16ELingo
1878 cycles for DotXMM2Acc32ELingo
1812 cycles for DotXMM2Acc16EPaul
2908 cycles for DotXMM1Acc4E
2811 cycles for DotXMM1Acc4EJ1
2815 cycles for DotXMM1Acc4EJ2
2794 cycles for AxDotXMM1_fastcall
2278 cycles for DotXMM2Acc16ELingo
1842 cycles for DotXMM2Acc32ELingo
1803 cycles for DotXMM2Acc16EPaul
2903 cycles for DotXMM1Acc4E
2819 cycles for DotXMM1Acc4EJ1
2836 cycles for DotXMM1Acc4EJ2
2822 cycles for AxDotXMM1_fastcall
2270 cycles for DotXMM2Acc16ELingo
1874 cycles for DotXMM2Acc32ELingo
1799 cycles for DotXMM2Acc16EPaul
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
Alex
I'm not have any desire to delve into Lingo's code, but, maybe, incorrect results on Core is in this place:
movss dword ptr [esp+2*4], xmm4
fld dword ptr [esp+2*4]
As far, as I know - Core+ have changed engine, what watch read after write, and can make execution out-of-order in two ways - if writed, and if not writed to place, which would be read belower.
Maybe, mixing SSE and FPU code is not work well with this engine? I find in disasm, what before FLD is no WAIT instruction.
Alex
redskull,
QuoteThen the math isn't quite that simple.
Yes it is.
Whatever CPU Lingo has the conversion from Seconds to Cycles is a straight forward equation.
Seconds = Cycles / Clk Frequency.
Nothing else in his CPU matters.
Paul.
Quote from: hutch-- on September 02, 2010, 10:49:34 PM
Like a voice crying in the wilderness, if you want to know how FAST an algorithm is, test it in REAL TIME !!!!
Cycles may be cute and easy to use benchmarks may be fun but algorithms run in application code in REAL TIME, test the algo any other way and you get serious anomolies.
No need to cry, Hutch :lol
The timings with MichaelW's macros can be problematic for very small pieces of code, but this kind of algos - 180 bytes for Lingo's algo, 456 for my JE2, both over 2,000 cycles - they yield quite realistic results.
QuoteIntel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
2575 cycles for DotXMM1Acc4E
2161 cycles for DotXMM1Acc4EJ1
2071 cycles for DotXMM1Acc4EJ2
2543 cycles for AxDotXMM1_fastcall
2139 cycles for DotXMM2Acc16ELingo
2079 cycles for DotXMM2Acc32ELingo
2103 cycles for DotXMM2Acc16EPaul
Congrats, Lingo. Even on my CPU you are close now :U
Lingo's dotpro18 results.
Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz (SSE4)
1574 cycles for DotXMM1Acc4E
1555 cycles for DotXMM1Acc4EJ1
1556 cycles for DotXMM1Acc4EJ2
1562 cycles for AxDotXMM1_fastcall
1051 cycles for DotXMM2Acc16ELingo
1038 cycles for DotXMM2Acc32ELingo
1051 cycles for DotXMM2Acc16EPaul
1561 cycles for DotXMM1Acc4E
1555 cycles for DotXMM1Acc4EJ1
1556 cycles for DotXMM1Acc4EJ2
1561 cycles for AxDotXMM1_fastcall
1050 cycles for DotXMM2Acc16ELingo
1038 cycles for DotXMM2Acc32ELingo
1051 cycles for DotXMM2Acc16EPaul
1563 cycles for DotXMM1Acc4E
1554 cycles for DotXMM1Acc4EJ1
1556 cycles for DotXMM1Acc4EJ2
1563 cycles for AxDotXMM1_fastcall
1052 cycles for DotXMM2Acc16ELingo
1038 cycles for DotXMM2Acc32ELingo
1052 cycles for DotXMM2Acc16EPaul
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
--- done ---
Quote from: clive on September 02, 2010, 11:22:16 PM
Is it working right?
Quote from: lingo
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1326793031 <<<<<<<<<
The result: 1328212656
--- done ---
Where did you see that one? Can't find that post...
Quote from: jj2007
Where did you see that one? Can't find that post...
It was from post #74, but it's been changed.
Quote from: clive on September 03, 2010, 12:24:59 AM
Quote from: jj2007
Where did you see that one? Can't find that post...
It was from post #74, but it's been changed.
OK. Here is DotPro18 with code sizes added.
78 bytes for DotXMM1Acc4E
278 bytes for DotXMM1Acc4EJ1
266 bytes for DotXMM1Acc4EJ2
60 bytes for AxDotXMM1_fastcall
120 bytes for DotXMM2Acc16ELingo
183 bytes for DotXMM2Acc32ELingo
129 bytes for DotXMM2Acc16EPaul
Alex, have you tried unrolling a little bit?
DotPro18:
AMD Phenom(tm) II X4 945 Processor (SSE3)
2754 cycles for DotXMM1Acc4E
2155 cycles for DotXMM1Acc4EJ1
2155 cycles for DotXMM1Acc4EJ2
2137 cycles for AxDotXMM1_fastcall
1197 cycles for DotXMM2Acc16ELingo
1186 cycles for DotXMM2Acc32ELingo
818 cycles for DotXMM2Acc16EPaul
2158 cycles for DotXMM1Acc4E
2154 cycles for DotXMM1Acc4EJ1
2154 cycles for DotXMM1Acc4EJ2
2133 cycles for AxDotXMM1_fastcall
1195 cycles for DotXMM2Acc16ELingo
1186 cycles for DotXMM2Acc32ELingo
818 cycles for DotXMM2Acc16EPaul
2159 cycles for DotXMM1Acc4E
2154 cycles for DotXMM1Acc4EJ1
2130 cycles for DotXMM1Acc4EJ2
2131 cycles for AxDotXMM1_fastcall
1195 cycles for DotXMM2Acc16ELingo
1186 cycles for DotXMM2Acc32ELingo
818 cycles for DotXMM2Acc16EPaul
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
--- done ---
JJ,
For data of the type proposed for algos of this type it needs to be streamed to get viable results. What I would suggest is set up 100 meg of data then load it in memory and stream it to get timings. The advantage of a large source is it does not fit into cache so you avoid the effects that are not present when the algo gets used IE in real time.
Timings from Gunther's latest post in Reply #72:
Straight forward C++ implementation (FPU Code):
-----------------------------------------------
Dot Product 1 = 2867507200.00
Elapsed Time = 13.70 Seconds
C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------
Dot Product 2 = 2867507200.00
Elapsed Time = 6.91 Seconds
C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------
Dot Product 3 = 2867507200.00
Elapsed Time = 3.47 Seconds
Simple SSE2 Code (1 Accumulator - 4 elements per loop cycle):
-------------------------------------------------------------
Dot Product 4 = 2867507200.00
Elapsed Time = 3.47 Seconds
Solid SSE2 Code (2 Accumulators - 8 elements per loop cycle):
-------------------------------------------------------------
Dot Product 5 = 2867507200.00
Elapsed Time = 1.77 Seconds
Better SSE2 Code (2 Accumulators - 16 elements per loop cycle):
---------------------------------------------------------------
Dot Product 6 = 2867507200.00
Elapsed Time = 1.78 Seconds
SSE2 Code by Dioxin (4 Accumulators - 16 elements per loop cycle):
------------------------------------------------------------------
Dot Product 7 = 2867507200.00
Elapsed Time = 1.16 Seconds
SSE2 Code by Lingo (2 Accumulators - 32 elements per loop cycle):
-----------------------------------------------------------------
Dot Product 8 = 2867507200.00
Elapsed Time = 1.77 Seconds
SSE2 Code with PATTERN (2 Accumulators - 32 elements per loop cycle):
---------------------------------------------------------------------
Dot Product 9 = 2867507200.00
Elapsed Time = 1.78 Seconds
Please press enter to terminate...
A variant with an option to unroll Alex' algo by 4 - still compact at 102 bytes, and a bit faster of course.
jj2007,
QuoteA variant with an option to unroll Alex' algo by 4 - still compact at 102 bytes, and a bit faster of course.
AMD Phenom(tm) II X4 945 Processor (SSE3)
2202 cycles for DotXMM1Acc4E
2153 cycles for DotXMM1Acc4EJ1
2152 cycles for DotXMM1Acc4EJ2
2140 cycles for AxDotXMM1_fastcall
1196 cycles for DotXMM2Acc16ELingo
1186 cycles for DotXMM2Acc32ELingo
789 cycles for DotXMM2Acc16EPaul
2156 cycles for DotXMM1Acc4E
2161 cycles for DotXMM1Acc4EJ1
2147 cycles for DotXMM1Acc4EJ2
2142 cycles for AxDotXMM1_fastcall
1196 cycles for DotXMM2Acc16ELingo
1187 cycles for DotXMM2Acc32ELingo
789 cycles for DotXMM2Acc16EPaul
2157 cycles for DotXMM1Acc4E
2155 cycles for DotXMM1Acc4EJ1
2155 cycles for DotXMM1Acc4EJ2
2139 cycles for AxDotXMM1_fastcall
1196 cycles for DotXMM2Acc16ELingo
1187 cycles for DotXMM2Acc32ELingo
815 cycles for DotXMM2Acc16EPaul
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
80 bytes for DotXMM1Acc4E
278 bytes for DotXMM1Acc4EJ1
266 bytes for DotXMM1Acc4EJ2
102 bytes for AxDotXMM1_fastcall
120 bytes for DotXMM2Acc16ELingo
175 bytes for DotXMM2Acc32ELingo
129 bytes for DotXMM2Acc16EPaul
--- done ---
Quote from: dioxin on September 03, 2010, 12:40:23 AM
Timings from Gunther's latest post in Reply #72:
Celeron M:
QuoteC++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------
Dot Product 3 = 2867507200.00
Elapsed Time = 6.92 Seconds
Simple SSE2 Code (1 Accumulator - 4 elements per loop cycle):
-------------------------------------------------------------
Dot Product 4 = 2867507200.00
Elapsed Time = 8.42 Seconds
Solid SSE2 Code (2 Accumulators - 8 elements per loop cycle):
-------------------------------------------------------------
Dot Product 5 = 2867507200.00
Elapsed Time = 7.52 Seconds
Better SSE2 Code (2 Accumulators - 16 elements per loop cycle):
---------------------------------------------------------------
Dot Product 6 = 2867507200.00
Elapsed Time = 6.77 Seconds
SSE2 Code by Dioxin (4 Accumulators - 16 elements per loop cycle):
------------------------------------------------------------------
Dot Product 7 = 2867507200.00
Elapsed Time = 7.16 Seconds
SSE2 Code by Lingo (2 Accumulators - 32 elements per loop cycle):
-----------------------------------------------------------------
Dot Product 8 = 2867507200.00
Elapsed Time = 6.78 Seconds
SSE2 Code with PATTERN (2 Accumulators - 32 elements per loop cycle):
---------------------------------------------------------------------
Dot Product 9 = 2867507200.00
Elapsed Time = 6.70 Seconds
EDIT: I have added a macro for testing the correctness of results.
IsCorrect MACRO algo
invoke &algo&, offset SrcX, offset SrcY, 2048
fstp Res4
Fcmp Res4, Expected
.if !Zero?
Print Str$("\nIncorrect result: %i for &algo&", Res4)
.endif
ENDM
I have 2 notes about my algos in dotfloat_update.zip (from Gunther's latest post)
- I see ALIGN 16 just before .loop: label. It is not normal coz we have some clocks more.
- My last algo is not included too...:(
Quote from: dioxin September 03, 2010, at 01:40:23 AMTimings from Gunther's latest post in Reply #72:
Thank you Paul, you see: your Phenom makes a good job by using 4 accumulators, while my Athlon did not.
Quote from: lingo September 03, 2010, at 02:16:02 AM- I see ALIGN 16 just before .loop: label. It is not normal coz we have some clocks more.
What a mess. But aligning the hot spots (that is the .loop label) by 16 is a recommendation by Intel and AMD. All together it's only a question of cut & paste for you. Set the align command at the procedures entry, compile the program again - and voila it's over. What the heck. By the way, was it Windows or Snow Leopard?
Quote from: lingo September 03, 2010, at 02:16:02 AM- My last algo is not included too...
Another mess. But joking apart, it is a question of time. If there is enough time, I'll try to include your last algorithm this weekend. Okay?
Gunther
For the record, timings on my old Intel:
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
3674 cycles for DotXMM1Acc4E
3215 cycles for DotXMM1Acc4EJ1
3088 cycles for DotXMM1Acc4EJ2
2795 cycles for AxDotXMM1_fastcall
1953 cycles for DotXMM2Acc16ELingo
1910 cycles for DotXMM2Acc32ELingo
1752 cycles for DotXMM2Acc16EPaul
4064 cycles for DotXMM1Acc4E
3644 cycles for DotXMM1Acc4EJ1
3832 cycles for DotXMM1Acc4EJ2
3065 cycles for AxDotXMM1_fastcall
1831 cycles for DotXMM2Acc16ELingo
1752 cycles for DotXMM2Acc32ELingo
1849 cycles for DotXMM2Acc16EPaul
It seems archaic CPUs like Lingo's code :bg
And the same CPU on reply #72:
Straight forward C++ implementation (FPU Code):
-----------------------------------------------
Dot Product 1 = 2867507200.00
Elapsed Time = 20.66 Seconds
C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------
Dot Product 2 = 2867507200.00
Elapsed Time = 10.12 Seconds
C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------
Dot Product 3 = 2867507200.00
Elapsed Time = 4.11 Seconds
Simple SSE2 Code (1 Accumulator - 4 elements per loop cycle):
-------------------------------------------------------------
Dot Product 4 = 2867507200.00
Elapsed Time = 4.00 Seconds
Solid SSE2 Code (2 Accumulators - 8 elements per loop cycle):
-------------------------------------------------------------
Dot Product 5 = 2867507200.00
Elapsed Time = 2.58 Seconds
Better SSE2 Code (2 Accumulators - 16 elements per loop cycle):
---------------------------------------------------------------
Dot Product 6 = 2867507200.00
Elapsed Time = 2.49 Seconds
SSE2 Code by Dioxin (4 Accumulators - 16 elements per loop cycle):
------------------------------------------------------------------
Dot Product 7 = 2867507200.00
Elapsed Time = 2.45 Seconds
SSE2 Code by Lingo (2 Accumulators - 32 elements per loop cycle):
-----------------------------------------------------------------
Dot Product 8 = 2867507200.00
Elapsed Time = 2.53 Seconds
SSE2 Code with PATTERN (2 Accumulators - 32 elements per loop cycle):
---------------------------------------------------------------------
Dot Product 9 = 2867507200.00
Elapsed Time = 2.47 Seconds
Quote from: jj2007 September 03, 2010, at 08:41:56 AMFor the record, timings on my old Intel:
Thank you, Jochen.
Quote from: jj2007 September 03, 2010, at 08:41:56 AMIt seems archaic CPUs like Lingo's code
But Paul's too. :bg
Gunther
Hi,
DotPro18b results.
Mobile Intel(R) Celeron(R) processor 600MHz (SSE2)
2613 cycles for DotXMM1Acc4E
2227 cycles for DotXMM1Acc4EJ1
2084 cycles for DotXMM1Acc4EJ2
2417 cycles for AxDotXMM1_fastcall
2142 cycles for DotXMM2Acc16ELingo
2099 cycles for DotXMM2Acc32ELingo
2424 cycles for DotXMM2Acc16EPaul
2623 cycles for DotXMM1Acc4E
2227 cycles for DotXMM1Acc4EJ1
2097 cycles for DotXMM1Acc4EJ2
2418 cycles for AxDotXMM1_fastcall
2142 cycles for DotXMM2Acc16ELingo
2101 cycles for DotXMM2Acc32ELingo
2424 cycles for DotXMM2Acc16EPaul
Test for correct results, expected 2867507200:
80 bytes for DotXMM1Acc4E
278 bytes for DotXMM1Acc4EJ1
266 bytes for DotXMM1Acc4EJ2
102 bytes for AxDotXMM1_fastcall
120 bytes for DotXMM2Acc16ELingo
175 bytes for DotXMM2Acc32ELingo
129 bytes for DotXMM2Acc16EPaul
--- done ---
Steve
For dotfloat_update.zip:
Supported by Processor and installed Operating System:
------------------------------------------------------
Pentium 4 Instruction Set,
+ FPU (floating point unit) on chip,
+ support of FXSAVE and FXRSTOR,
+ 57 MMX Instructions,
+ 70 SSE (Katmai) Instructions,
+ 144 SSE2 (Willamette) Instructions,
+ 13 SSE3 (Prescott) Instructions,
+ HTT (hyper thread technology) support.
Calculating the dot product in 9 different variations.
That'll take a little while ...
Straight forward C++ implementation (FPU Code):
-----------------------------------------------
Dot Product 1 = 2867507200.00
Elapsed Time = 34.39 Seconds
C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------
Dot Product 2 = 2867507200.00
Elapsed Time = 16.62 Seconds
C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------
Dot Product 3 = 2867507200.00
Elapsed Time = 7.08 Seconds
Simple SSE2 Code (1 Accumulator - 4 elements per loop cycle):
-------------------------------------------------------------
Dot Product 4 = 2867507200.00
Elapsed Time = 7.05 Seconds
Solid SSE2 Code (2 Accumulators - 8 elements per loop cycle):
-------------------------------------------------------------
Dot Product 5 = 2867507200.00
Elapsed Time = 4.53 Seconds
Better SSE2 Code (2 Accumulators - 16 elements per loop cycle):
---------------------------------------------------------------
Dot Product 6 = 2867507200.00
Elapsed Time = 4.62 Seconds
SSE2 Code by Dioxin (4 Accumulators - 16 elements per loop cycle):
------------------------------------------------------------------
Dot Product 7 = 2867507200.00
Elapsed Time = 4.45 Seconds
SSE2 Code by Lingo (2 Accumulators - 32 elements per loop cycle):
-----------------------------------------------------------------
Dot Product 8 = 2867507200.00
Elapsed Time = 4.50 Seconds
SSE2 Code with PATTERN (2 Accumulators - 32 elements per loop cycle):
---------------------------------------------------------------------
Dot Product 9 = 2867507200.00
Elapsed Time = 4.55 Seconds
Gunther, my CPU don't support HTT. But this is mistake of most progs, which make CPU detection. As far as I know: all 90nm Prescotts "says", what they support HTT, but this is not true sometimes.
Alex
Quote from: Antariy September 04, 2010, at 12:47:53 AMGunther, my CPU don't support HTT. But this is mistake of most progs, which make CPU detection. As far as I know: all 90nm Prescotts "says", what they support HTT, but this is not true sometimes.
Hi Antariy,
Thank you for your timings. Never mind; the application doesn't use HTT. What exactly is your CPU and OS?
Gunther
Quote from: Gunther on September 04, 2010, 01:36:10 AM
Hi Antariy,
Thank you for your timings. Never mind; the application doesn't use HTT. What exactly is your CPU and OS?
Gunther
I know, what app don't use HTT. But HTT make mess of timings sometimes.
My CPU - Celeron D 310. In details - Prescott core, 2.13 GHz, 256KB L2 cache.
OS - WinXP SP2.
Alex
Gunther, when I talk about HTT, I don't mean, what prog use it. How prog can use it :)
I mean this:
+ 13 SSE3 (Prescott) Instructions,
+ HTT (hyper thread technology) support. <-------------------------- THIS
Calculating the dot product in 9 different variations.
That'll take a little while ...
Your CPU detection make general mistake - my CPU don't have HTT, really.
Sorry, if initially I write not clear - my English is bad...
Alex
Alex,
thank you for your information.
Quote from: Antariy September 05, 2010, at 11:05:57 PMMy CPU - Celeron D 310. In details - Prescott core, 2.13 GHz, 256KB L2 cache.
OS - WinXP SP2.
The machine isn't so bad, as you mentioned in your PM.
Gunther
Alex,
excuse me, we posted both at the same time, so I couldn't see your latest answer.
Quote from: Antariy September 05, 2010, at 11:38:52 PMYour CPU detection make general mistake - my CPU don't have HTT, really.
I'll check that, but I used the algorithm recommended by Intel and AMD.
Gunther
Quote from: Gunther on September 04, 2010, 10:41:13 PM
The machine isn't so bad, as you mentioned in your PM.
Yes, big advantage - heavily unrolled algos work slower than short looped :)
Alex
Alex,
The level of unrolling depends on the size of the code cache of the particular processor. Some older processors respond well to unrolling where many of the later ones show little change. I have algos written for P4s that are slower relative to a short algo on a later processor. Intel recommend unrolling up to the limit of the code cache but I have found in practice that you try different amounts and don't go more than where the speed picks up.
Gunther, this is code to make sure, what CPU truely have HTT support:
mov eax,1
cpuid
shr ebx,16
and ebx,255
If EBX contain 1 - then CPU don't support HTT. Because HTT CPU have logically separated cores, so, true HTT CPU must report more than 1 core on its chip.
Alex
Quote from: hutch-- on September 05, 2010, 04:12:26 AM
Alex,
The level of unrolling depends on the size of the code cache of the particular processor. Some older processors respond well to unrolling where many of the later ones show little change. I have algos written for P4s that are slower relative to a short algo on a later processor. Intel recommend unrolling up to the limit of the code cache but I have found in practice that you try different amounts and don't go more than where the speed picks up.
Yes
hutch.
Most code have big dependency from hardware, on which this code is runned.
All CPUs have different schemes, even if they one type and microarchitecture.
This is not wonder, what AMD and Intel have very different results in testings: AMD always make
wide solutions, which is use simple but parallel shemes (this is seems from all tests), Intel make
deep solutions - which is try use strong prediction and very deep pipelining. But deep pipelining is bad in case of many registers renaming code, and if something prediction is not successful - CPU with deep pipeline have bigger stalls.
This is my thinks, of course, but they are corroborated with many things: code which use parallel and multiple registers runs very good on AMD, AMD CPUs have big heat emission - what can says about really big schemes, which can make really parallel exection, etc.
Alex
Quote from: Antariy September 06, 2010, 10:49:52 pmIf EBX contain 1 - then CPU don't support HTT. Because HTT CPU have logically separated cores, so, true HTT CPU must report more than 1 core on its chip.
Alex,
thank you for the hint. I'll inspect your CPU detection method as soon as possible. May be, I can adopt a few ideas (I give credit, that's clear) and make my procedure more robust and reliable.
Gunther
Gunther, your procedure is good, I only suggest about accurate testing for HTT.
You can see my thread "http://www.masm32.com/board/index.php?topic=14754.0" (CPU identification), on which I post small app, what make sure what CPU is support HTT (cores.zip). It seems to be truely works.
Alex
it may report 2 logical cores, but only 1 physical core
that is what HTT is - and it only applies to Intel chips
if the HTT bit is set on Intel CPU's, the number of logical cores is 2X the number of physical cores
for other manufacturers, the bit may have other meanings or no meaning at all
AMD is rolling out its own version of HT in early 2011 with its first major architectural change since the first Athlons, the new 32nm Bulldozer architecture.
http://techreport.com/articles.x/19514
Expect Intel to try something similar when they roll down to 22nm tech (its too late for their 32nm tech, they are about to roll out Sandy Bridge and have packed the silicon with 256-bit floating point execution units instead)
Quote from: Rockoon September 07, 2010, at 05:55:47 AMthey are about to roll out Sandy Bridge and have packed the silicon with 256-bit floating point execution units instead
Is that the AVX instruction set? We've already the specification, but at the present time, no processor supports it When will this part of Sandy Bridge arrive?
Gunther
Quote from: dedndave on September 07, 2010, 02:32:03 AM
it may report 2 logical cores, but only 1 physical core
that is what HTT is - and it only applies to Intel chips
if the HTT bit is set on Intel CPU's, the number of logical cores is 2X the number of physical cores
for other manufacturers, the bit may have other meanings or no meaning at all
Dave, I read many times, what reporting about HTT support can be lie. My BIOS have HTT support, but it not show this option, because BIOS know, what CPU don't support HTT, even it says other things.
Here is EDX result after CPUID EAX=1 for my CPU: BFEBFBFFh.
Binary form:
10111111111010111111101111111111
^ this bit says what CPU have HTT, but this is NOT true.
Method which I gives reports about logical/physical cores. If CPUID HTT bit says what CPU supports HTT, but cores count is 1 - this is really funny :) So, this CPU don't have HTT.
In my point - HTT is much commercial advertisement, because "Hey, us CPU have 2 cores!"... But they are logical (i.e. - virtual) and use the same execution units of one physical core.
I saw ~4 years ago true 2-cores Prescott LGA 775. It eat 120 Watts of energy, and it really HOT...
Anybody can say, what 2 VIRTUAL CPUs - is nice, but this is funny :)
EDITED: I suggest treat HTT bit as "This CPU architecture can support HTT...", but due to 1 logical/physical core "... but we economize on its implementation" :)
Alex
Quote from: Gunther on September 07, 2010, 10:32:23 PM
Quote from: Rockoon September 07, 2010, at 05:55:47 AMthey are about to roll out Sandy Bridge and have packed the silicon with 256-bit floating point execution units instead
Is that the AVX instruction set? We've already the specification, but at the present time, no processor supports it When will this part of Sandy Bridge arrive?
Gunther
Sandy Bridge arrives.. maybe in December or January.
Bulldozer arrives a couple months after that.
Its really looking like AMD will take the lead on per-clock integer performance, while Intel wins the per-clock floating point performance and maybe destroys nVidia's integrated GPU division in the process
Quote from: RockoonIts really looking like AMD will take the lead on per-clock integer performance, while Intel wins the per-clock floating point performance and maybe destroys nVidia's integrated GPU division in the process
Interesting, but seems logical. Intel chips have u- and v-integer pipes, while AMD provides 3 pipes. On the other hand, Intel brings better performance (mostly) in floating point math. What's more important? It depends. Will Bulldozer support AVX?
Gunther
Quote from: Gunther on September 08, 2010, 01:12:20 AM
Quote from: RockoonIts really looking like AMD will take the lead on per-clock integer performance, while Intel wins the per-clock floating point performance and maybe destroys nVidia's integrated GPU division in the process
Interesting, but seems logical. Intel chips have u- and v-integer pipes, while AMD provides 3 pipes. On the other hand, Intel brings better performance (mostly) in floating point math. What's more important? It depends. Will Bulldozer support AVX?
Gunther
AVX, AES, SSE4.1 and SSE4.2
Not sure about SSSE3
Even though it will support AVX, it wont be using a 256-bit FPU unit.
Quote from: Rockoon September 08, 2010, at 02:47:03 AMNot sure about SSSE3
I hope that's not the point, because SSE3 is available nowadays on AMD chips.
Quote from: Rockoon September 08, 2010, at 02:47:03 AMEven though it will support AVX, it wont be using a 256-bit FPU unit.
That works? AVX based on the new YMM registers, which are 256 bit wide. We'll see.
Gunther
Quote from: Gunther on September 08, 2010, 01:56:37 AM
Quote from: Rockoon September 08, 2010, at 02:47:03 AMNot sure about SSSE3
I hope that's not the point, because SSE3 is available nowadays on AMD chips.
SSSE3 is not the same as SSE3.
Quote from: Gunther on September 08, 2010, 01:56:37 AM
Quote from: Rockoon September 08, 2010, at 02:47:03 AMEven though it will support AVX, it wont be using a 256-bit FPU unit.
That works? AVX based on the new YMM registers, which are 256 bit wide. We'll see.
Yes it works. The CPU will simply break up 256-bit ops into two 128-bit ops, just as many CPU's (from both Intel and AMD) broke up 128-bit ops into two 64-bit ops to accomplish SSE using only 64-bit FPU's
Quote from: Rockoon, September 08, 2010, at 04:10:59 AMSSSE3 is not the same as SSE3.
Yes, I've overlooked the 3rd S.
Quote from: Rockoon, September 08, 2010, at 04:10:59 AMYes it works. The CPU will simply break up 256-bit ops into two 128-bit ops, just as many CPU's (from both Intel and AMD) broke up 128-bit ops into two 64-bit ops to accomplish SSE using only 64-bit FPU's
In that sense, the speed advantage for floating point operations is at the Intel side.
Gunther
Quote from: jj2007 on September 03, 2010, 12:28:53 AM
OK. Here is DotPro18 with code sizes added.
78 bytes for DotXMM1Acc4E
278 bytes for DotXMM1Acc4EJ1
266 bytes for DotXMM1Acc4EJ2
60 bytes for AxDotXMM1_fastcall
120 bytes for DotXMM2Acc16ELingo
183 bytes for DotXMM2Acc32ELingo
129 bytes for DotXMM2Acc16EPaul
Alex, have you tried unrolling a little bit?
Not, at that time is not tried. But, after a big delay I found some time for do this.
Codesize is still the smallest (~106 bytes) from fast, and speed is satisfactory.
Simple unrolling with interleaving of used execution units.
Probably, with a much better modernest CPUs, Paul's code is best due to using access to contiguous memory locations, but on old CPUs using of many equal commands (i.e. - execution units) is not gives anythings useful.
Also changed calling convention (stdcall now).
Test this please, anybody who readed this post (attached archive).
This is my timings:
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
3107 cycles for DotXMM1Acc4E
2864 cycles for DotXMM1Acc4EJ1
2804 cycles for DotXMM1Acc4EJ2
1860 cycles for AxDotXMM1
2246 cycles for DotXMM2Acc16ELingo
1879 cycles for DotXMM2Acc32ELingo
1907 cycles for DotXMM2Acc16EPaul
2965 cycles for DotXMM1Acc4E
2822 cycles for DotXMM1Acc4EJ1
2827 cycles for DotXMM1Acc4EJ2
1818 cycles for AxDotXMM1
2220 cycles for DotXMM2Acc16ELingo
1919 cycles for DotXMM2Acc32ELingo
1818 cycles for DotXMM2Acc16EPaul
2936 cycles for DotXMM1Acc4E
2920 cycles for DotXMM1Acc4EJ1
2818 cycles for DotXMM1Acc4EJ2
1852 cycles for AxDotXMM1
2256 cycles for DotXMM2Acc16ELingo
1898 cycles for DotXMM2Acc32ELingo
1832 cycles for DotXMM2Acc16EPaul
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
--- done ---
Alex
AMD Phenom(tm) II X4 945 Processor (SSE3)
2214 cycles for DotXMM1Acc4E
2153 cycles for DotXMM1Acc4EJ1
2164 cycles for DotXMM1Acc4EJ2
913 cycles for AxDotXMM1
1211 cycles for DotXMM2Acc16ELingo
1194 cycles for DotXMM2Acc32ELingo
783 cycles for DotXMM2Acc16EPaul
2195 cycles for DotXMM1Acc4E
2108 cycles for DotXMM1Acc4EJ1
2177 cycles for DotXMM1Acc4EJ2
914 cycles for AxDotXMM1
1209 cycles for DotXMM2Acc16ELingo
1196 cycles for DotXMM2Acc32ELingo
815 cycles for DotXMM2Acc16EPaul
2200 cycles for DotXMM1Acc4E
2159 cycles for DotXMM1Acc4EJ1
2154 cycles for DotXMM1Acc4EJ2
922 cycles for AxDotXMM1
1197 cycles for DotXMM2Acc16ELingo
1189 cycles for DotXMM2Acc32ELingo
805 cycles for DotXMM2Acc16EPaul
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
--- done ---
Intel(R) Core(TM)2 Duo CPU E4500 @ 2.20GHz (SSE4)
3080 cycles for DotXMM1Acc4E
2867 cycles for DotXMM1Acc4EJ1
2874 cycles for DotXMM1Acc4EJ2
1930 cycles for AxDotXMM1
1925 cycles for DotXMM2Acc16ELingo
1914 cycles for DotXMM2Acc32ELingo
1363 cycles for DotXMM2Acc16EPaul
1575 cycles for DotXMM1Acc4E
1557 cycles for DotXMM1Acc4EJ1
1569 cycles for DotXMM1Acc4EJ2
1055 cycles for AxDotXMM1
1057 cycles for DotXMM2Acc16ELingo
1049 cycles for DotXMM2Acc32ELingo
1063 cycles for DotXMM2Acc16EPaul
1583 cycles for DotXMM1Acc4E
1560 cycles for DotXMM1Acc4EJ1
1556 cycles for DotXMM1Acc4EJ2
1062 cycles for AxDotXMM1
1054 cycles for DotXMM2Acc16ELingo
1038 cycles for DotXMM2Acc32ELingo
1055 cycles for DotXMM2Acc16EPaul
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
--- done ---
-r
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz (SSE4)
1603 cycles for DotXMM1Acc4E
1628 cycles for DotXMM1Acc4EJ1
1588 cycles for DotXMM1Acc4EJ2
1077 cycles for AxDotXMM1
1083 cycles for DotXMM2Acc16ELingo
1059 cycles for DotXMM2Acc32ELingo
1076 cycles for DotXMM2Acc16EPaul
1599 cycles for DotXMM1Acc4E
1593 cycles for DotXMM1Acc4EJ1
1592 cycles for DotXMM1Acc4EJ2
1072 cycles for AxDotXMM1
1071 cycles for DotXMM2Acc16ELingo
1063 cycles for DotXMM2Acc32ELingo
1083 cycles for DotXMM2Acc16EPaul
1598 cycles for DotXMM1Acc4E
1589 cycles for DotXMM1Acc4EJ1
1558 cycles for DotXMM1Acc4EJ2
1077 cycles for AxDotXMM1
1071 cycles for DotXMM2Acc16ELingo
1060 cycles for DotXMM2Acc32ELingo
1054 cycles for DotXMM2Acc16EPaul
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
The result: 1328212656
--- done ---
Quote from: frktons on September 29, 2010, 12:55:17 AM
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz (SSE4)
1603 cycles for DotXMM1Acc4E
This one is twice as fast on a CPU that is almost exactly the same as mine; the only difference being twice as much cache and fsb speed (4 vs 2, 1066 vs 800). Can cache really have that much of an effect on such small, isolated code? Yikes.
-r
Quote from: redskull on September 29, 2010, 01:59:28 AM
This one is twice as fast on a CPU that is almost exactly the same as mine; the only difference being twice as much cache and fsb speed (4 vs 2, 1066 vs 800). Can cache really have that much of an effect on such small, isolated code? Yikes.
Yes, because code tests very small piece of data, the cache parameters have drastically effect - if data very small with comparsion of cache size - it is in cache. If cache is bigger - much bigger piece of data/code can be putted into it.
If run test as Hutch is suggest - then cache size would have less meaning, but system bus speed would be.
Which is mean "Yikes" word? I don't know what this is - this is slang? Can I know its sense? My English is not very good :)
Alex
"Yikes"
"Informal an expression of surprise, fear, or alarm"
http://www.thefreedictionary.com/yikes
Used in popular culture such as Scooby Doo :bg
Quote from: oex on September 29, 2010, 11:05:29 PM
"Yikes"
"Informal an expression of surprise, fear, or alarm"
http://www.thefreedictionary.com/yikes
Used in popular culture such as Scooby Doo :bg
Thanks - for link and explanation! :bg
Alex
Alex,
here are the timings from my machine.
AMD Athlon(tm) 64 X2 Dual-Core Processor TK-57 (SSE3)
2297 cycles for DotXMM1Acc4E
2277 cycles for DotXMM1Acc4EJ1
2240 cycles for DotXMM1Acc4EJ2
1502 cycles for AxDotXMM1
1425 cycles for DotXMM2Acc16ELingo
1362 cycles for DotXMM2Acc32ELingo
1633 cycles for DotXMM2Acc16EPaul
2289 cycles for DotXMM1Acc4E
2277 cycles for DotXMM1Acc4EJ1
2240 cycles for DotXMM1Acc4EJ2
1499 cycles for AxDotXMM1
1437 cycles for DotXMM2Acc16ELingo
1360 cycles for DotXMM2Acc32ELingo
1637 cycles for DotXMM2Acc16EPaul
2288 cycles for DotXMM1Acc4E
2277 cycles for DotXMM1Acc4EJ1
2242 cycles for DotXMM1Acc4EJ2
1507 cycles for AxDotXMM1
1423 cycles for DotXMM2Acc16ELingo
1359 cycles for DotXMM2Acc32ELingo
1640 cycles for DotXMM2Acc16EPaul
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
The result: 2867507200
--- done ---
Gunther
Good ol' P4:
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
2930 cycles for DotXMM1Acc4E
2741 cycles for DotXMM1Acc4EJ1
2902 cycles for DotXMM1Acc4EJ2
1788 cycles for AxDotXMM1
2180 cycles for DotXMM2Acc16ELingo
2035 cycles for DotXMM2Acc32ELingo
1806 cycles for DotXMM2Acc16EPaul
2861 cycles for DotXMM1Acc4E
2715 cycles for DotXMM1Acc4EJ1
2756 cycles for DotXMM1Acc4EJ2
3149 cycles for AxDotXMM1
2150 cycles for DotXMM2Acc16ELingo
2024 cycles for DotXMM2Acc32ELingo
1826 cycles for DotXMM2Acc16EPaul
3286 cycles for DotXMM1Acc4E
2934 cycles for DotXMM1Acc4EJ1
2962 cycles for DotXMM1Acc4EJ2
2020 cycles for AxDotXMM1
2156 cycles for DotXMM2Acc16ELingo
1968 cycles for DotXMM2Acc32ELingo
1762 cycles for DotXMM2Acc16EPaul