SSE2 replacement of ALU registers for 384 bit unsigned integers

Started by dsouza123, June 04, 2006, 05:09:23 AM

Previous topic - Next topic

dsouza123

Doing unsigned integer math operations on arrays of 12 unsigned dwords (384 bits)
such as addition, subtraction, shifts (multiplying/dividing by powers of 2)
using ALU registers  EAX, EBX etc is straight forward.

Doing the same using SSE2 instructions could in theory increase performance
because 128 bits, equivalent to 4 dwords, can be operated on instead of only 32 bit chunks.

The following is addition using ALU instructions, then an incorrect first attempt with SSE2.
Carries aren't propagated past 64 bit chunks with SSE2, so something has to be done
to compensate for it.

The logical functions of AND, OR, XOR, ANDNOT are available.
NOT can be simulated by two 64 subtracts from all bits set.
NOT equivalent with byte values 255 - x.
Bit functions can be simulated by the logical functions AND, OR.


.data
ALIGN 16
   a      dd  3239879239, 1309340349, 2349087349, 4020030040
          dd   239879239, 2023987439, 1234567890,   23978239
          dd  1245409999, 3340340309,  343498392, 3020030040

   b      dd  1340983534, 3295798324, 3843298739, 1897349923
          dd  2398721309, 1347204809, 4085034850, 2409872134
          dd  2594597492, 2398792379, 1398349830, 4139823991

   c      dd  3239879239, 1309340349, 2349087349, 4020030040
          dd   239879239, 2023987439, 1234567890,   23978239
          dd  1245409999, 3340340309,  343498392, 3020030040

   d      dd  1340983534, 3295798324, 3843298739, 1897349923
          dd  2398721309, 1347204809, 4085034850, 2409872134
          dd  2594597492, 2398792379, 1398349830, 4139823991

   e      dd  12 dup (0)  ;  384 bit unsigned values

   f      dd  12 dup (0)

.code
start:
   ;-------- 384 bit unsigned add  using ALU registers in dword chunks
   clc
   mov eax, [c +  0]
   mov ebx, [c +  4]
   mov ecx, [c +  8]
   mov edx, [c + 12]

   adc [d +  0], eax
   adc [d +  4], ebx
   adc [d +  8], ecx
   adc [d + 12], edx

   mov eax, [c + 16]
   mov ebx, [c + 20]
   mov ecx, [c + 24]
   mov edx, [c + 28]

   adc [d + 16], eax
   adc [d + 20], ebx
   adc [d + 24], ecx
   adc [d + 28], edx

   mov eax, [c + 32]
   mov ebx, [c + 36]
   mov ecx, [c + 40]
   mov edx, [c + 44]

   adc [d + 32], eax
   adc [d + 36], ebx
   adc [d + 40], ecx
   adc [d + 44], edx

   ;-------- 384 bit unsigned add with 128 bit oword chunks
   ;                 first attempt gives obviously INCORRECT results

   movdqa xmm0, oword ptr [a +  0]
   movdqa xmm1, oword ptr [a + 16]
   movdqa xmm2, oword ptr [a + 32]

   paddq  xmm0, oword ptr [b +  0]
   paddq  xmm1, oword ptr [b + 16]
   paddq  xmm2, oword ptr [b + 32]

   movdqa oword ptr [a +  0], xmm0
   movdqa oword ptr [a + 16], xmm1
   movdqa oword ptr [a + 32], xmm2
end start


Adding a value with only bits 0 and/or 64 set out of bits 0 to 127
would provide the carr(y/ies), determining which of the 3 possible versions
is still an issue.

Any ideas/code for doing any of the math operations ?

daydreamer

perform several 384 in parallel, instead of try serial on one 384 because these instructions are only made with parallel execution in mind
for example work with two 64bit at a time, maybe unroll it to perform 4    64bits at a time, for all execution units work parallel
but that require restructure your data to be efficient

for addition, packed compare less than and use this to mask a 1,1,1,1,... constant and a second padd to add this carry
for each carryoperation that is needed
each 1 emulates a carrybit

doubt it will be faster with so much xtra code to emulate carry operations
maybe if you mix alu and SSE2, it helps because they perform in different execution units ?



dsouza123

After trying various methods, this is the most compact and least complicated,
I believe this will do the add of two 384 bit unsigned numbers with SSE2, but it is untested.

Is there another way to move bit 127 to bit 0, within a xmm register other than by a 126 bit right shift ?
Example: if it was a byte instead of an oword  1000 1000 to 0000 0001.
Remember the lower qword may have the high bit set, which is dealt with by the 126 bit right shift.

Could extract high dword or high byte, ROL, then place in low dword or low byte in a zeroed 128 bit temp
but trying to use only SSE2 (or SSE if appropriate) opcodes.


.data
ALIGN 16
   a      dd  3239879239, 1309340349, 2349087349, 4020030040
          dd   239879239, 2023987439, 1234567890,   23978239
          dd  1245409999, 3340340309,  343498392, 3020030040

   b      dd  1340983534, 3295798324, 3843298739, 1897349923
          dd  2398721309, 1347204809, 4085034850, 2409872134
          dd  2594597492, 2398792379, 1398349830,  139823991  ; took off leading 4 in high dword

   c      dd  3239879239, 1309340349, 2349087349, 4020030040
          dd   239879239, 2023987439, 1234567890,   23978239
          dd  1245409999, 3340340309,  343498392, 3020030040

   d      dd  1340983534, 3295798324, 3843298739, 1897349923
          dd  2398721309, 1347204809, 4085034850, 2409872134
          dd  2594597492, 2398792379, 1398349830,  139823991  ; took off leading 4 in high dword

   g      dd           0, 2147483648,          0, 2147483648

.code
start:
; conventional ALU add of two 384 bit unsigned integers, a dword pair at a time
   clc
   mov eax, [d +  0]
   mov ebx, [d +  4]
   mov ecx, [d +  8]
   mov edx, [d + 12]

   adc [c +  0], eax
   adc [c +  4], ebx
   adc [c +  8], ecx
   adc [c + 12], edx

   mov eax, [d + 16]
   mov ebx, [d + 20]
   mov ecx, [d + 24]
   mov edx, [d + 28]

   adc [c + 16], eax
   adc [c + 20], ebx
   adc [c + 24], ecx
   adc [c + 28], edx

   mov eax, [d + 32]
   mov ebx, [d + 36]
   mov ecx, [d + 40]
   mov edx, [d + 44]

   adc [c + 32], eax
   adc [c + 36], ebx
   adc [c + 40], ecx
   adc [c + 44], edx

; add the two 384 bit values, carries between qwords aren't done
; make xmm registers copy of A because it gets over written but
; needs to be used later in the ANDs to calculate the qword carries

   movdqa xmm0, oword ptr [a +  0]   ; a
   movdqa xmm1, oword ptr [a + 16]
   movdqa xmm2, oword ptr [a + 32]

   movdqa xmm4, oword ptr [b +  0]   ; b
   movdqa xmm5, oword ptr [b + 16]
   movdqa xmm6, oword ptr [b + 32]

   paddq  oword ptr [a +  0], xmm4   ; a = a + b
   paddq  oword ptr [a + 16], xmm5
   paddq  oword ptr [a + 32], xmm6

   movdqa xmm3, oword ptr g          ; 63:1,62..0:0 _ 63:1,62..0:0

   andpd  xmm0, xmm4   ; a = a and b
   andpd  xmm1, xmm5
   andpd  xmm2, xmm6

   andpd  xmm0, xmm3   ; a = a and 1...1...
   andpd  xmm1, xmm3
   andpd  xmm2, xmm3

; shl 384 bits, 1  will move the computed carries from all qwords to the correct spots

   movdqa xmm7, xmm1   ; copy 1
   psrldq xmm7, 126    ; get bit 127, put in bit 0

   pslldq xmm2, 1      ; shl 2 by 1
   paddq  xmm2, xmm7   ; add bit 0 from 1

   movdqa xmm7, xmm0   ; copy 0
   psrldq xmm7, 126    ; get bit 127, put in bit 0

   pslldq xmm1, 1      ; shl 1 by 1
   paddq  xmm1, xmm7   ; add bit 0

   pslldq xmm0, 1      ; shl 0 by 1

; add the shifted qword carries to previous uncarried add of the two 384 bit values

   paddq  oword ptr [a + 32], xmm2
   paddq  oword ptr [a + 16], xmm1
   paddq  oword ptr [a +  0], xmm0


a should equal c, if the SSE2 code works correctly.

EduardoS

I guess it won't work...
The paddq won't pass carry from lower part to the higher part, and if you add 1 to 111...1111 it won't pass the carry for the last bit...

I tryed some ways to do it, my conclusion was that a complex method with SSE to manage every case will be slower than adc.

savage

QuoteIs there another way to move bit 127 to bit 0, within a xmm register other than by a 126 bit right shift ?
Forgive me for not knowing more about the instructions, but what's wrong with shift, exactly?



dsouza123

Found out the SSE2 right and left shifts are by byte not bit so the max right shift is 15.
Work around, since it is a single bit that that is being shifted, shift by byte, double, shift by byte(s)
in opposite direction.


move high bit 127 to low bit 0
10000000 00000000... before any ops,  1 == bit 127,  the high bit in xmm register
00000000 10000000... after 1 byte shift right
00000001 00000000... after doubling
...00000000 00000001 after 15 byte shift right

move bit 63 to bit 64
00000000 00000000^10000000 before any ops   ^ == qword boundary
00000000 10000000^00000000 after 1 byte shift left
00000001 00000000^00000000 after doubling
00000000 00000001^00000000 after 1 byte shift right


The following replaces the previous code chunk for moving the computed carries.


; shl 384 bits, 1  will move the computed carries from all qwords to the correct spots

  movdqa xmm7, xmm1   ; copy 1 -- get bit 127, put in bit 0
  psrldq xmm7, 1      ; right shift 1 byte
  paddq  xmm7, xmm7   ; double, effect of shift left 1 bit
  psrldq xmm7, 15     ; right shift 15 bytes

  pslldq xmm2, 1      ; left shift 2 by 1 byte
  paddq  xmm2, xmm2   ; double, effect of shift left 1 bit
  psrldq xmm2, 1
  paddq  xmm2, xmm7   ; add bit 0 from 1

  movdqa xmm7, xmm0   ; copy 0 -- get bit 127, put in bit 0
  psrldq xmm7, 1      ; get bit 127, put in bit 0
                      ; right shift 1 byte
  paddq  xmm7, xmm7   ; double, effect of shift left 1 bit
  psrldq xmm7, 15     ; right shift 15 bytes

  pslldq xmm1, 1      ; left shift 1 by 1 byte
  paddq  xmm1, xmm1   ; double, effect of shift left 1 bit
  psrldq xmm1, 1
  paddq  xmm1, xmm7   ; add bit 0

  pslldq xmm0, 1      ; shift left 0 by 1 byte
  paddq  xmm0, xmm0   ; double, effect of shift left 1 bit
  psrldq xmm0, 1

dsouza123

The technique relies on the properties of AND indicating a carry only if both bits are 1,
just realized since 3 bits are being added if either of the original bits at bit 63
are 1 and the carry from the addition of bits 62 is 1 then an ADD will also cause a carry
into bit 64 across the qword boundary.

Had dealt with this in an earlier method involving XOR of the original bits,
then XOR that with bit 63 result of ADD,
if first XOR == 1 and ADD bit 63 == 0 then carry should have been created.
OR this last result with AND of original bits.  Will always give the correct carry state
crossing the qword boundary.  Will also have to implement it for pairs of bit 127.


1  1 carry from bit add of bit - 1
0  1 original
1  0 original
=^==
0  0  both generate a carry ie  10 was the result of adding the three bits.

AND
0 0 1 1
0 1 0 1
= = = =
0 0 0 1

XOR
0 0 1 1
0 1 0 1
= = = =
0 1 1 0

asmfan

I think that there isn't more convenient way to do a addition / subtraction of any size operands than you described above

xor  ecx, ecx
lea  esi, src
lea  edi, dst
clc
@@:
mov  eax, [esi + ecx*4]
adc  [edi + ecx*4], eax
inc  ecx
cmp  ecx, 12
jne  @B
Russia is a weird place

Ossa

Quote from: asmfan on June 13, 2006, 07:15:04 AM
xor  ecx, ecx
lea  esi, src
lea  edi, dst
clc
@@:
mov  eax, [esi + ecx*4]
adc  [edi + ecx*4], eax
inc  ecx
cmp  ecx, 12
jne  @B

asmfan, have I missed something, or does that have a bug?

The carry, so nicely preserved through almost every instruction will be destroyed by the cmp instruction making the whole thing useless.

Ossa

[edit] I think this should fix it though:

[edit(2)] eek... stupid mistake! [/edit(2)]

[edit(3)] OK, really got it this time... it was a bit fiddly to get the byte order right, but...

NUMBITS EQU 384

...

    mov  ecx, -(NUMBITS / 32)
    mov  esi, ((offset src) + (NUMBITS / 8))
    mov  edi, ((offset dest) + (NUMBITS / 8))
    clc
@@:
    mov  eax, [esi + ecx * 4]
    adc  [edi + ecx * 4], eax
    inc  ecx
    jnz  @B


[/edit(3)]

[/edit]
Website (very old): ossa.the-wot.co.uk

asmfan

Ossa, well done man:) i really missed it (only inc/dec not alters cf)... nice code
Regards,
asmfan
Russia is a weird place

Ossa

heh, it's you code and idea really - a nice way to do it... I wouldn't have even tried to get away without altering the carry if you hadn't suggested it. (Reminds me of the SAS motto: "Who Dares Wins")

Ossa
Website (very old): ossa.the-wot.co.uk

stanhebben

I'd recommend keeping your original (ALU) code, which is unrolled nicely.

First idea wins imo  :wink

Stan

asmfan

Stan, but unrolling cannot be infinite... And looped code can be applied to a "REAL" big numbers;) up to some GB... since it universal
Russia is a weird place

stanhebben

It depends if you want to stay with 384 bits. If you want to go somewhat higher than that, you could write a macro that unrolls the loop for you.

For larger numbers, you can unloop partially. (and handle end cases separately, if needed) An unroll of 2 or 4 will speed up the code noticeably.

Unroll by 2:

xor ecx, ecx
lea esi, nc
lea edi, nd
clc
@@:
mov eax, [esi + ecx*8]
mov ebx, [esi + ecx*8+4]
adc [edi + ecx*8], eax
adc [edi + ecx*8+4], ebx
inc ecx
cmp ecx, 6
jne @B


Unroll by 4:

xor ecx, ecx
lea esi, nc
lea edi, nd
clc
@@:
mov eax, [esi + ecx*8]
mov ebx, [esi + ecx*8+4]
adc [edi + ecx*8], eax
adc [edi + ecx*8+4], ebx

mov eax, [esi + ecx*8+8]
mov ebx, [esi + ecx*8+12]
adc [edi + ecx*8+8], eax
adc [edi + ecx*8+12], ebx

inc ecx
inc ecx
cmp ecx, 6
jne @B


AMD Athlon 64 3200+: (384 bit numbers)

16 cycles, original ALU code
45 cycles, asmfan's code
20 cycles, unrolled by 2
17 cycles, unrolled by 4


Stan

[attachment deleted by admin]

Ossa

Stan, for the same reason as I stated for asmfan's code earlier, your code will not work - see fix above.

Ossa
Website (very old): ossa.the-wot.co.uk