Fast bitwise swap?

jj2007 · November 17, 2009, 04:52:55 PM

Anybody aware of an algo or an instruction, even SSEx, that allows a fast bitwise swap? BSWAP does it bytewise.
XLAT or MOV AL,[BX+AL] is of course an option, but it would require a 256*256 table for a word register...

dedndave · November 17, 2009, 05:18:48 PM

i have wanted the same thing a few times, Jochen
for a word value, you could use a byte table twice and swap bytes
the only thought i have ever had (that would actually work - lol) was to shift out of one register into another
i have never implemented such an algo because i new it would be too slow
there may be some math trick, but it eludes me - lol

qWord · November 17, 2009, 05:54:27 PM

Quote from: jj2007 on November 17, 2009, 04:52:55 PM
... even SSEx ...

that's an job for sse2 :8)

Code Select


	.data
		align 16
		msk db 1,1,2,2,4,4,8,8,16,16,32,32,64,64,128,128
	.code
	
	mov eax, what ever; e.g.: 0110010001011010011001001y ==> 1001001100101101000100110y
	
	movd xmm0,eax
	movdqa xmm4,OWORD ptr msk
	pxor xmm5,xmm5
	punpcklbw xmm0,xmm0
	punpcklwd xmm0,xmm0
	movdqa xmm1,xmm0
	movdqa xmm2,xmm0
	movdqa xmm3,xmm0	
	psrldq xmm0,12		
	psrldq xmm1,8		
	psrldq xmm2,4		
						
	pshufd xmm0,xmm0,0 	; xmm0 = byte[3] of eax
	pshufd xmm1,xmm1,0 	; xmm1 = byte[2] "  "
	pshufd xmm2,xmm2,0 	; xmm2 = byte[1] "  "
	pshufd xmm3,xmm3,0 	; xmm3 = byte[0] "  "
	
	pandn xmm0,xmm4		; mask bits 
	pandn xmm1,xmm4		;
	pandn xmm2,xmm4		;
	pandn xmm3,xmm4		;
	
	pcmpeqb xmm0,xmm5
	pcmpeqb xmm1,xmm5
	pcmpeqb xmm2,xmm5
	pcmpeqb xmm3,xmm5
	
	psrlw xmm0,8		; 
	psrlw xmm1,8		; 
	psrlw xmm2,8		;
	psrlw xmm3,8		;
	
	pshufhw xmm0,xmm0,00011011y
	pshufhw xmm1,xmm1,00011011y
	pshufhw xmm2,xmm2,00011011y
	pshufhw xmm3,xmm3,00011011y

	pshuflw xmm0,xmm0,00011011y
	pshuflw xmm1,xmm1,00011011y
	pshuflw xmm2,xmm2,00011011y
	pshuflw xmm3,xmm3,00011011y

	pshufd xmm0,xmm0,01001110y
	pshufd xmm1,xmm1,01001110y
	pshufd xmm2,xmm2,01001110y
	pshufd xmm3,xmm3,01001110y

	packuswb xmm0,xmm1
	packuswb xmm2,xmm3
	
	pmovmskb eax,xmm0
	pmovmskb edx,xmm2
	
	shl edx,16
	or eax,edx

qWord · November 17, 2009, 06:05:59 PM

Quote from: jj2007 on November 17, 2009, 04:52:55 PM
XLAT or MOV AL,[BX+AL] is of course an option, but it would require a 256*256 table for a word register...

no, you need only one table (256 byte). For an word you have to make two look ups - when storing the result, you must only swap this two bytes.

EDIT:
here is some code (for 32 bit values):

Code Select

	.data
		
		align 16
		lut LABEL BYTE
		
		cntr = 0
		REPEAT 256
			.radix 2
			txt TEXTEQU %(cntr)
			.radix 10
			txt2 TEXTEQU <>
			
			cnt2=@SizeStr(%txt)
		%	WHILE cnt2 NE 8
				txt CATSTR <0>,txt
				cnt2=@SizeStr(%txt)
			ENDM
			
		%	FORC chr,<txt>
				txt2 CATSTR <&chr>,txt2
			ENDM			
			txt2 TEXTEQU <0>,txt2,<y>
			db txt2 
			cntr  = cntr + 1
		ENDM
		
		
	.code

	mov eax,123456 ; 00000000000000011110001001000000 ==> 00000010010001111000000000000000
                  

	xor edx,edx
	movzx ecx,al
	mov dl,[lut+ecx]
	shl edx,8
	movzx ecx,ah
	shr eax,16
	mov dl,[lut+ecx]
	shl edx,8
	movzx ecx,al
	mov dl,[lut+ecx]
	shl edx,8
	movzx ecx,ah
	mov dl,[lut+ecx]
	; result in edx

jj2007 · November 17, 2009, 11:14:46 PM

Quote from: dedndave on November 17, 2009, 05:18:48 PM
i have wanted the same thing a few times, Jochen
for a word value, you could use a byte table twice and swap bytes
the only thought i have ever had (that would actually work - lol) was to shift out of one register into another
i have never implemented such an algo because i new it would be too slow
there may be some math trick, but it eludes me - lol

Dave,
Speed is not an issue, because you need to create the table only once. For a word size table, that's around 1 millisecond. Here is my first attempt, 20 bytes long:

Code Select


include \masm32\include\masm32rt.inc

.data?
bufSrc	db 36 dup(?)
bufDest	db 36 dup(?)

.code
start:
	xor ebx, ebx	; clear outer counter

L0:	m2m esi, 16	; set inner counter
	mov eax, ebx	; the source word

L1:	rol ax, 1		; get a carry flag in case the high order bit is set
	rcr dx, 1		; and pull it in from the left
	dec esi
	jne L1

	.if ebx>65515 || ebx<20	; show some nice results
		mov edi, edx
		mov esi, eax
		print str$(ebx), 9
		invoke dw2bin_ex, esi, offset bufSrc
		invoke dw2bin_ex, edi, offset bufDest
		print offset bufSrc+16, 32
		print offset bufDest+16, 13, 10
	.endif

	inc bx
	jne L0

L2:	print "Size="
	mov eax, L2
	sub eax, start
	sub eax, 110		; correct size for print routine
	print str$(eax)
	exit
end start

Output:

Code Select

0       0000000000000000 0000000000000000
1       0000000000000001 1000000000000000
2       0000000000000010 0100000000000000
3       0000000000000011 1100000000000000
4       0000000000000100 0010000000000000
5       0000000000000101 1010000000000000
6       0000000000000110 0110000000000000
7       0000000000000111 1110000000000000
8       0000000000001000 0001000000000000
9       0000000000001001 1001000000000000
10      0000000000001010 0101000000000000
11      0000000000001011 1101000000000000
12      0000000000001100 0011000000000000
13      0000000000001101 1011000000000000
14      0000000000001110 0111000000000000
15      0000000000001111 1111000000000000
16      0000000000010000 0000100000000000
17      0000000000010001 1000100000000000
18      0000000000010010 0100100000000000
19      0000000000010011 1100100000000000
65516   1111111111101100 0011011111111111
65517   1111111111101101 1011011111111111
65518   1111111111101110 0111011111111111
65519   1111111111101111 1111011111111111
65520   1111111111110000 0000111111111111
65521   1111111111110001 1000111111111111
65522   1111111111110010 0100111111111111
65523   1111111111110011 1100111111111111
65524   1111111111110100 0010111111111111
65525   1111111111110101 1010111111111111
65526   1111111111110110 0110111111111111
65527   1111111111110111 1110111111111111
65528   1111111111111000 0001111111111111
65529   1111111111111001 1001111111111111
65530   1111111111111010 0101111111111111
65531   1111111111111011 1101111111111111
65532   1111111111111100 0011111111111111
65533   1111111111111101 1011111111111111
65534   1111111111111110 0111111111111111
65535   1111111111111111 1111111111111111

dedndave · November 17, 2009, 11:56:04 PM

oh - lol
i didn't know this was for table creation - that makes it nice
you are shifting out of and into word size registers
i would go backwards, so you could use dword size registers - no size override bytes

L1: shr eax,1
rcl edx,1
dec esi
jnz L1

jj2007 · November 18, 2009, 12:17:29 AM

Quote from: dedndave on November 17, 2009, 11:56:04 PM
i would go backwards, so you could use dword size registers - no size override bytes

Hmmm... did you test it? Doesn't work for me...

dedndave · November 18, 2009, 12:46:13 AM

huh ?
it should work fine - try it again

dedndave · November 18, 2009, 12:50:52 AM

i get nothing out of your routine at all, my friend
ok - now i got it - lol
it can't be named "shift.exe" i guess i have that program someplace - lol
the pattern looks cooler on a black console window than it does in here

my code works fine, as long as you do not need the original value
try this...

push eax
L1: shr eax,1
rcl edx,1
dec esi
jnz L1
pop eax

jj2007 · November 18, 2009, 01:04:31 AM

Can you post a complete example? Mine is implemented in the bin2byte thread now.

dedndave · November 18, 2009, 01:16:20 AM

wtCreate:   m2m esi, 16   ; set inner counter
   mov edx, ebx   ; the source word

@@:
shr edx,1 ;was rol dx, 1      ; get a carry flag in case the high order bit is set
rcl eax,1 ;was rcr ax, 1      ; and pull it in from the left
dec esi
jne @B
mov edx,ebx ;added

i am not sure if you need the original copy in edx

dedndave · November 18, 2009, 01:34:32 AM

i think i see why it may not have worked for you
in the post above, you have

mov eax, ebx ; the source word

L1: rol ax, 1 ; get a carry flag in case the high order bit is set
rcr dx, 1 ; and pull it in from the left
dec esi
jne L1

so my code was

L1: shr eax,1
rcl edx,1
dec esi
jnz L1

in Bin2Byte, you swapped the AX/DX registers to

@@: rol dx, 1 ; get a carry flag in case the high order bit is set
rcr ax, 1 ; and pull it in from the left
dec esi
jne @B

MichaelW · November 18, 2009, 02:30:02 AM

In my tests of a 32-bit operation the code was fastest (65 cycles versus 101) if I unrolled the loop completely:

Code Select


REPEAT 32
    shr edx, 1
    rcl eax, 1
ENDM

And the P3 cycle counts were 65 versus 30 for an 8-bit table lookup.

sinsi · November 18, 2009, 06:13:32 AM

qWord, I get 7 clocks for your code, this is 4

Code Select


	movzx ecx,al
	mov dh,[lut2+ecx]
	movzx ecx,ah
	shr eax,16
	mov dl,[lut2+ecx]
	shl edx,16
	movzx ecx,al
	mov dh,[lut2+ecx]
	movzx ecx,ah
	mov dl,[lut2+ecx]
	; result in edx

Still your code though.

How about a 4-bit look-up? 16 bytes vs 256 - I am thinking of size here, since 4 clocks is fast enough.

jj2007 · November 18, 2009, 09:11:00 AM

Quote from: dedndave on November 18, 2009, 01:34:32 AM
i think i see why it may not have worked for you

Yeah, I know. I had tested that, and tried all kinds of combinations, but probably I was just too tired. Now it works, thanxalot :thumbu

News:

Fast bitwise swap?