News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Suggestions and improvements for SSE2 code are welcome

Started by Gunther, August 26, 2010, 05:20:06 PM

Previous topic - Next topic

Gunther

Quote from: lingo August 31, 2010 12:46:00 AMI'm sure my algo will runs faster on your CPU.

Thank you for your reply. Of course, I'll test your code and let you know the results.

Quote from: dioxin August 30, 2010, 11:42:29 pmI can't imagine why it would slow down on your CPU. The work being done is the same but there should be less register stalls.

Yes, of course, but there is a limitation by the throughput of the floating point adder on older AMD chips (only 64 bits). Your Phenom or an Intel Core2 processor have a 128 bits wide floating point adder that can handle a whole vector in one operation. This makes the difference.

But never mind. In the original program, which is to optimize, I'll insert 2 code paths: one for older AMD chips, the other for Intel and the fancy new Phenom. Which way to go can be detected during run time with CPUID.

Gunther
Forgive your enemies, but never forget their names.

jj2007

Quote from: dioxin on August 30, 2010, 10:42:29 PM
jj2007 & dedndave,
Your version of the loop is only accessing 25% of the memory it should be.
Both loops go around 128 times but your loop only increments by 16 bytes instead of 64. Note in my code the counter increments by 16 but is always reference as 4*ecx, not just ecx alone.

Paul.

That's correct, sorry for my sloppiness :thumbu
Modified code attached. I disabled Dave's code because it uses a faster "fill code", which makes the results not comparable.
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
1853    cycles for neg ecx, TheCt=128
1839    cycles for cmp edx, ecx, TheCt=128

1842    cycles for neg ecx, TheCt=128
1836    cycles for cmp edx, ecx, TheCt=128

1837    cycles for neg ecx, TheCt=128
1838    cycles for cmp edx, ecx, TheCt=128

Rockoon

AMD Phenom(tm) II X6 1055T Processor (SSE3)
659     cycles for neg ecx, TheCt=128
919     cycles for cmp edx, ecx, TheCt=128

659     cycles for neg ecx, TheCt=128
918     cycles for cmp edx, ecx, TheCt=128

659     cycles for neg ecx, TheCt=128
919     cycles for cmp edx, ecx, TheCt=128

659     cycles for neg ecx, TheCt=128
918     cycles for cmp edx, ecx, TheCt=128

659     cycles for neg ecx, TheCt=128
918     cycles for cmp edx, ecx, TheCt=128

When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.

hutch--


Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (SSE4)
914     cycles for neg ecx, TheCt=128
827     cycles for cmp edx, ecx, TheCt=128

912     cycles for neg ecx, TheCt=128
827     cycles for cmp edx, ecx, TheCt=128

912     cycles for neg ecx, TheCt=128
827     cycles for cmp edx, ecx, TheCt=128

913     cycles for neg ecx, TheCt=128
828     cycles for cmp edx, ecx, TheCt=128

912     cycles for neg ecx, TheCt=128
827     cycles for cmp edx, ecx, TheCt=128


--- ok ---
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

Very different indeed. Let's test with another "fill code":
Quoteneg ecx
            and TheCt, 0
            align 16
      @@:
            mov ebx, [eax+ecx*4]
            inc TheCt
            mov [edx+ecx*4], ebx
            add ecx, 4
            js short @B
Quotelea ecx, [edx+ecx]
            and TheCt, 0
            sub eax, edx
            align 16
      @@:
            mov ebx, [eax+edx]
            inc TheCt
            mov [edx], ebx
            add edx, 16
            cmp edx, ecx
            jb short @B

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
997     cycles for neg ecx, TheCt=128
986     cycles for cmp edx, TheCt=128

979     cycles for neg ecx, TheCt=128
988     cycles for cmp edx, TheCt=128

Gunther

Quote from: lingo August 31, 2010, at 12:46:00 AMGunther,
I'm sure my algo will runs faster on your CPU.
So, please, try it with your testbed...   :U

At the first glance, it should be faster, because you've rolled out the loop and your code processes now 32 elements per loop cycle. In practice, it leads to the same time as DotXMM2Acc16E on my machine (Win32 and Linux). That's a bit surprising.

I'll update the archive. So, your procedure will be renamed to DotXMM2Acc32ELingo (2 Accumulators, 32 elements per loop cycle) - but I'll give credit, that's clear. I'll set up a message after updating the archive.

Gunther
Forgive your enemies, but never forget their names.

jj2007

With modified testbed. J1 is short loop, J2 is using more xmm registers as shown below:
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
3566    cycles for DotXMM1Acc4E
2739    cycles for DotXMM1Acc4EJ1
2821    cycles for DotXMM1Acc4EJ2

2782    cycles for DotXMM1Acc4E
2714    cycles for DotXMM1Acc4EJ1
2806    cycles for DotXMM1Acc4EJ2

2784    cycles for DotXMM1Acc4E
2761    cycles for DotXMM1Acc4EJ1
2844    cycles for DotXMM1Acc4EJ2


Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
2548    cycles for DotXMM1Acc4E
2165    cycles for DotXMM1Acc4EJ1
2073    cycles for DotXMM1Acc4EJ2

2546    cycles for DotXMM1Acc4E
2165    cycles for DotXMM1Acc4EJ1
2073    cycles for DotXMM1Acc4EJ2

@@:
REPEAT repct/4 ; unrolling helps, but make sure that the count is divisible by the rep count!!
movaps xmm0, [eax+edx]
movaps xmm1, [eax+edx+16]
movaps xmm2, [eax+edx+32]
movaps xmm3, [eax+edx+48]
mulps xmm0, [edx]
mulps xmm1, [edx+16]
mulps xmm2, [edx+32]
mulps xmm3, [edx+48]
lea edx, [edx+64]
addps xmm7, xmm0 ; sum up
addps xmm7, xmm1
addps xmm7, xmm2
addps xmm7, xmm3
ENDM

if UseCmp
cmp edx, ecx ; faster, at least on Celeron M
jb @B
else
sub ecx, 16*repct ; count down
jg @B
endif

lingo

"Dave's algo does not use the same "fill code", so it's not comparable"

JJ's code is not comparable too... :(
or how to manipulate the people with stupid testing:

original JJ's stupid code:
1st algo:
push ebx
mov ecx, 2048/4                   ; 1st lame error here; Why 2048?  Why /4?
mov eax, offset Src             ; eax+2048 is 1 dword after the end of Src!
mov edx, offset Dest           ; edx+2048 is 1 dword after the end of Dest! 
lea eax, [eax+4*ecx]            ;  Why 4*ecx? Wow, coz we have /4 above!!!
lea edx, [edx+4*ecx]            ; the same stupidity again!!! 
neg ecx
and TheCt, 0
align 16
@@:
mov ebx, [eax+ecx*4]           ; multiply ecx by 4 + sum of 2registers and read
inc TheCt
mov [edx+ecx*4], ebx           ; multiply ecx by 4 + sum of 2registers and WRITE !!!
add ecx, 4
js short @B
pop ebx

and the loop of the 2nd algo:
@@:
mov ebx, [eax+edx] ;sum of 2registers and read
inc TheCt
mov [edx], ebx                        ; just one register and WRITE !!!
add edx, 16
cmp edx, ecx
jb short @B

As you see the loops are not comparable!

To be comparable:
just change the 1st algo to:
push ebx
mov ecx, 2047
and TheCt, 0
lea   eax, [Src+ecx]
lea   edx, [Dest+ecx]
neg ecx
align 16
@@:
mov ebx, [eax+ecx]            ;sum of 2registers and read
inc TheCt
mov ebx, [edx+ecx]            ;sum of 2registers and read   
add ecx, 16
jle short @B
pop ebx


and change the 2nd algo too to:
push ebx
mov ecx, 2047
mov eax, offset Src
mov edx, offset Dest
lea ecx, [edx+ecx]
and TheCt, 0
sub eax, edx
align 16
@@:
mov ebx, [eax+edx]                 ;sum of 2registers and read
inc TheCt
mov ebx, [edx]                         ;just one register and read 
add edx, 16
cmp edx, ecx
jbe short @B
pop ebx

and voila, we have similar results ...

Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)
778     cycles for neg ecx, TheCt=128
779     cycles for cmp edx, TheCt=128

778     cycles for neg ecx, TheCt=128
779     cycles for cmp edx, TheCt=128

--- ok ---

jj2007

Good job, Lingo :U
Unfortunately I had little time for this, so I am grateful that you took over. Now if you find some spare time, please try to speed up the DotXMM1Acc4EJ2 posted above.

lingo

"same time as DotXMM2Acc16E on my machine (Win32 and Linux)"

On my pc i have more: Win7, Vista, XP and Mac OS-SnowLeopard
but I can't understand the relation between OS and speed optimization. :wink

Gunther

Quote from: lingo August 31, 2010, at 06:33:16 PMbut I can't understand the relation between OS and speed optimization

I mentioned that only, because the application is compiled with two different compiler versions:

  • Win32: gcc 3.4.5 (MinGW)
  • Linux: gcc 4.2.4 (not gcc 4.1, the worst-performing gcc for x86 machines)

That makes a small difference under both operating systems.

Gunther
Forgive your enemies, but never forget their names.

KeepingRealBusy

Quote from: lingo on August 31, 2010, 03:59:51 PM
"Dave's algo does not use the same "fill code", so it's not comparable"

JJ's code is not comparable too... :(
or how to manipulate the people with stupid testing:

original JJ's stupid code:
1st algo:
push ebx
mov ecx, 2048/4                   ; 1st lame error here; Why 2048?  Why /4?
mov eax, offset Src             ; eax+2048 is 1 dword after the end of Src!
mov edx, offset Dest           ; edx+2048 is 1 dword after the end of Dest! 
lea eax, [eax+4*ecx]            ;  Why 4*ecx? Wow, coz we have /4 above!!!
lea edx, [edx+4*ecx]            ; the same stupidity again!!! 
neg ecx
and TheCt, 0
align 16
@@:
mov ebx, [eax+ecx*4]           ; multiply ecx by 4 + sum of 2registers and read
inc TheCt
mov [edx+ecx*4], ebx           ; multiply ecx by 4 + sum of 2registers and WRITE !!!
add ecx, 4
js short @B
pop ebx

and the loop of the 2nd algo:
@@:
mov ebx, [eax+edx] ;sum of 2registers and read
inc TheCt
mov [edx], ebx                        ; just one register and WRITE !!!
add edx, 16
cmp edx, ecx
jb short @B

As you see the loops are not comparable!

To be comparable:
just change the 1st algo to:
push ebx
mov ecx, 2047
and TheCt, 0
lea   eax, [Src+ecx]
lea   edx, [Dest+ecx]
neg ecx
align 16
@@:
mov ebx, [eax+ecx]            ;sum of 2registers and read
inc TheCt
mov ebx, [edx+ecx]            ;sum of 2registers and read   
add ecx, 16
jle short @B
pop ebx


and change the 2nd algo too to:
push ebx
mov ecx, 2047
mov eax, offset Src
mov edx, offset Dest
lea ecx, [edx+ecx]
and TheCt, 0
sub eax, edx
align 16
@@:
mov ebx, [eax+edx]                 ;sum of 2registers and read
inc TheCt
mov ebx, [edx]                         ;just one register and read 
add edx, 16
cmp edx, ecx
jbe short @B
pop ebx

and voila, we have similar results ...

Intel(R) Core(TM)2 Duo CPU     E8500  @ 3.16GHz (SSE4)
778     cycles for neg ecx, TheCt=128
779     cycles for cmp edx, TheCt=128

778     cycles for neg ecx, TheCt=128
779     cycles for cmp edx, TheCt=128

--- ok ---


Lingo,

I'm sorry, but that code just won't work. You are not copying the data from src to dest, but just loading two values into ebx.

Also, wouldn't this be better. Instead of this:


@@:
mov ebx, [eax+edx]                 ;sum of 2registers and read
inc TheCt
mov ebx, [edx]                         ;just one register and read      (double load)
add edx, 16
cmp edx, ecx
jbe short @B


Use this:


@@:
mov ebx, [eax+edx]                 ;sum of 2registers and read
add edx, 16
inc TheCt
mov [edx-16], ebx                    ;just one register and read  (also fixed the double load)
cmp edx, ecx
jbe short @B


Dave.

KeepingRealBusy

Lingo,

I have lost track of what is being done here (just got back from  trip), but the "add ebx,16" doesn't seem right, unless you are just initializing a single dword in a 16 byte array entry.

Shouldn't that be:


add edx, 4
inc TheCt
mov [edx-4], ebx                    ;just one register and read  (also fixed the double load)


Dave.

lingo

"You are not copying the data from src to dest"
Why to do this? :lol
"I have lost track of what is being done here"
It is jj's example so will be better to ask him.. :wink

jj2007

Quote from: KeepingRealBusy on August 31, 2010, 07:33:45 PM
I have lost track of what is being done here

Hi Dave,
There are two parallel testing efforts here, the "real" one on speeding up the dot product, and a side track on loop optimisation with negative offsets. The latter is pretty useless, and I feel guilty about it. For those interested, see Agner Fog's optimizing_assembly.pdf, page 89/90:
QuoteIt is possible to modify example 12.4a to make it count down rather than up, but the data
cache is optimized for accessing data forwards, not backwards. Therefore it is better to
count up through negative values from -n to zero. This is possible by making a pointer to
the end of the array and using a negative offset from the end of the array:

; Example 12.4b. For-loop with negative index from end of array
    mov  ecx, n            ; Load n
    lea  esi, Array[4*ecx] ; Point to end of array
    neg  ecx               ; i = -n
    jnl  LoopEnd           ; Skip if (-n) >= 0
LoopTop:
    ; Loop body: Add 1 to all elements in Array:
    add  dword ptr [esi+4*ecx], 1
    add  ecx, 1            ; i++
    js   LoopTop           ; Loop back if i < 0
LoopEnd:

A slightly different solution is to multiply n by 4 and count from -4*n to zero:

; Example 12.4c. For-loop with neg. index multiplied by element size
    mov  ecx, n            ; Load n
    shl  ecx, 2            ; n * 4
    jng  LoopEnd           ; Skip if (4*n) <= 0
    lea  esi, Array[ecx]   ; Point to end of array   90
    neg  ecx               ; i = -4*n
LoopTop:
    ; Loop body: Add 1 to all elements in Array:
    add  dword ptr [esi+ecx], 1
    add  ecx, 4            ; i += 4
    js   LoopTop           ; Loop back if i < 0
LoopEnd:

There is no difference in speed between example 12.4b and 12.4c, but the latter method is
useful if the size of the array elements is not 1, 2, 4 or 8 so that we cannot use the scaled
index addressing.