Needed to see the capacity of BSHUFB. Looks like it will really hurry up BSWAP for streaming data. It appears to be a very useful instruction with many applications. NOTE that it requires SSE3.
IF 0 ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
Build this template with "CONSOLE ASSEMBLE AND LINK"
Computer must be SSE3 capable
ENDIF ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
include \masm32\include\masm32rt.inc
.data
align 16
shflmask dd 03020100h,07060504h,0B0A0908h,0F0E0D0Ch
pshmsk dd shflmask
align 16
shflmask2 dd 00010203h,04050607h,08090A0Bh,0C0D0E0Fh
pshmsk2 dd shflmask2
align 16
mytest db "32107654BA98FECD",0,0,0,0 ; data to shuffle
pmtst dd mytest
align 16
mybuff db 20 dup (0) ; output buffer for result
pmbuf dd mybuff
.code
start:
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
call main
inkey
exit
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
main proc
mov eax, pshmsk2 ; load the shuffle mask address
movdqa xmm2, [eax] ; copy it into XMM2
mov eax, pmtst ; load the data address in EAX
movdqa xmm1, [eax] ; load the data into XMM1
pshufb xmm1, xmm2 ; shuffle bytes to order in XMM2
mov eax, pmbuf ; load output buffer address
movdqa [eax], xmm1 ; copy 16 byte result to it
print pmbuf,13,10
ret
main endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
end start
hutch,
I had to add .XMM to get it to assemble with MASM 10.0.
That makes sense, I built this with ML version 9.0 and just typed it in and it worked.
My new you beaut AMD processor doesn't support SSSE3 so I get an invalid opcode error for pshufb :(
What a shame, is there another AMD specific that will do the same or similar task ?
Maybe in SSE5? It seems that Intel and AMD are going their own ways again.
This was interesting: http://abinstein.blogspot.com/2007/09/amds-latest-x86-extension-sse5-part-2.html
AMD make logical instruction encodings to allow for expansion whereas Intel squeeze them in any old how :lol
It works fine. :bg