News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

MASM movlps problem?

Started by Ficko, June 02, 2009, 05:38:51 PM

Previous topic - Next topic

Ficko

I am trying to compile:

"movlps xmm0, [eax]"

with MASM (ver 9.0xx) but getting "error A2070:invalid instruction operands".

With JWASM it compiles just fine.

Can someone confirm it's a bug or do I missing something here?! ::)

Using:
.686p
.xmm
.model flat

Thanks

ToutEnMasm

Same thing
Quote
MOVLPS xmm, mem64 0F 12 /r Moves two packed single-precision floating-point values from a
64-bit memory location to an XMM register.
MOVLPS mem64, xmm 0F 13 /r Moves two packed single-precision floating-point values from an
XMM register to a 64-bit memory location.
127 64

working
.data
truc qword ?
.code
movlps xmm0,truc ;[eax]

dedndave


Ficko

Thanks so I gues I do have 3 choicest:

1.) Hard code it.
2.) Change to JWASM for good.
3.) Change my code. ::)

I think I go with 2.
It seems to be a fine alternative and may even better than MASM. :U

MichaelW

It assembles for me using ML 6.15 or 7.00.

Have you tried:

movlps xmm0, QWORD PTR [eax]

eschew obfuscation

Ficko

You are the man Michael!  :U

I didn't think about that one since 'movlps' can't take anything else than QWORD and usually MASM figur self such things out.

I am downgrading the 'bug' to 'annoyance'. :P

MichaelW

I should have made it clear that I did not have to add the QWORD PTR. A plausible explanation for the change, other than it just being a mistake, might be that the instruction actually supports more than the two forms officially listed.
eschew obfuscation

jj2007

Movaps and movups do not require the size.

Quote.nolist
include \masm32\include\masm32rt.inc
.686
.xmm
include \masm32\macros\timers.asm      ; get them from the Masm32 Laboratory
   buffersize      = 10000               ; don't go higher than 100000, ml.exe would slow down
   LOOP_COUNT   = 10000

.data?
align 16
Buffer16   db 1, 2, 3, 4, 5, 6, 7      ; try adding 8, then 9
buffer   dd buffersize dup(?)

.code
start:
   REPEAT 3
   counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
      mov esi, offset Buffer16
      REPEAT 100
         movaps xmm1, [esi]
         lea esi, [esi+16]
      ENDM
   counter_end
   print str$(eax), 9, "cycles for movaps", 13, 10

   counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
      mov esi, offset buffer
      REPEAT 100
         movups xmm1, [esi]
         lea esi, [esi+16]
      ENDM
   counter_end
   print str$(eax), 9, "cycles for movups", 13, 10

   counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
      mov esi, offset buffer
      REPEAT 100
         movlps xmm0, QWORD PTR [esi]
         movhps xmm0, QWORD PTR [esi+8]
         lea esi, [esi+16]
      ENDM
   counter_end
   print str$(eax), 9, "cycles for movlps/movhps", 13, 10, 10
   ENDM

   inkey chr$(13, 10, "--- ok ---", 13)
   exit
end start

Interesting that the pair movlps/movhps is 20% faster than the single movups (Celeron M):
197     cycles for movaps
397     cycles for movups
321     cycles for movlps/movhps

197     cycles for movaps
403     cycles for movups
322     cycles for movlps/movhps

197     cycles for movaps
397     cycles for movups
321     cycles for movlps/movhps

Ficko

Thanks JJ2007,

Nice timing. :thumbu

Just for fun can you put up a comparison with:

"fild qword ptr [eax]"
"fistp qword ptr [eax]"

just being curious how horrible –or not- FPU is for QWORD data transfer.
Since the mmx regs are mapped to FPU regs may it isn't that bud at all.?! ::)


jj2007

Quote from: Ficko on June 03, 2009, 09:38:28 AM
Thanks JJ2007,

Nice timing. :thumbu

Just for fun can you put up a comparison with:

"fild qword ptr [eax]"
"fistp qword ptr [eax]"

just being curious how horrible –or not- FPU is for QWORD data transfer.
Since the mmx regs are mapped to FPU regs may it isn't that bud at all.?! ::)


It depends.. it depends, as usual. Here are P4 timings, 100*16 bytes mem to mem each:

1377    cycles for fild/fistp
515     cycles for movdqa (aligned 16)
625     cycles for rep movsd (aligned 16)
1206    cycles for movdqu
3391    cycles for movlps/movhps

1411    cycles for fild/fistp
490     cycles for movdqa (aligned 16)
668     cycles for rep movsd (aligned 16)
1744    cycles for movdqu
3486    cycles for movlps/movhps

1342    cycles for fild/fistp
605     cycles for movdqa (aligned 16)
956     cycles for rep movsd (aligned 16)
1489    cycles for movdqu
3572    cycles for movlps/movhps

1979    cycles for fild/fistp
776     cycles for movdqa (aligned 16)
955     cycles for rep movsd (aligned 16)
1606    cycles for movdqu
3490    cycles for movlps/movhps

1440    cycles for fild/fistp
673     cycles for movdqa (aligned 16)
971     cycles for rep movsd (aligned 16)
1741    cycles for movdqu
3583    cycles for movlps/movhps


Note this is comparing apples and oranges: Movdqa and movsd are aligned to a 16-byte boundary, the others are badly misaligned, i.e.  +7 for src and +9 for dest (I made 5 repeats to show the variance).

Surprised?
:bg

EDIT: Fixed a bug - fild/fistp count was too low (it's 8 bytes, not 16 as for movdqa)

[attachment deleted by admin]

Ficko

Yes I am surprised indeed. :bg

Secretly I was thinking it will be quite worse because FPU should do "int" to "float" conversion on load
but like a lot of times things aren't always the way you are thinking they are. :wink

Or  maybe SSE instructions are doing conversions as well that would explain the closeness of results by "movdqu" and "fild/fistp",
which are about the same caliber –moving qwords with no alignment-!?

Cool, I have been using "fild/fistp" quite often but always thought there is a huge penalty to it but seems it isn't true. :dance:

Mark Jones

#11
Win XP Pro x64
AMD Athlon(tm) 64 X2 Dual Core Processor 4000+ (SSE3)
409     cycles for movdqa               (src + dest aligned 16)
613     cycles for movlps/movhps        srcalign=7, destalign=9
609     cycles for movdqu               srcalign=7, destalign=9
906     cycles for rep movsd            srcalign=7, destalign=9
620     cycles for fild/fistp           srcalign=7, destalign=9

409     cycles for movdqa               (src + dest aligned 16)
612     cycles for movlps/movhps        srcalign=7, destalign=9
609     cycles for movdqu               srcalign=7, destalign=9
907     cycles for rep movsd            srcalign=7, destalign=9
617     cycles for fild/fistp           srcalign=7, destalign=9

408     cycles for movdqa               (src + dest aligned 16)
612     cycles for movlps/movhps        srcalign=7, destalign=9
611     cycles for movdqu               srcalign=7, destalign=9
906     cycles for rep movsd            srcalign=7, destalign=9
618     cycles for fild/fistp           srcalign=7, destalign=9


Edit: Revision 2 timings.
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

jj2007

Quote from: Ficko on June 03, 2009, 03:51:51 PM
Yes I am surprised indeed. :bg

Secretly I was thinking it will be quite worse because FPU should do "int" to "float" conversion on load
but like a lot of times things aren't always the way you are thinking they are. :wink

Or  maybe SSE instructions are doing conversions as well that would explain the closeness of results by "movdqu" and "fild/fistp",
which are about the same caliber –moving qwords with no alignment-!?

Cool, I have been using "fild/fistp" quite often but always thought there is a huge penalty to it but seems it isn't true. :dance:


It is still a good option if you don't have SSE2. It is slightly slower than rep movsd, though (but movsd binds esi and edi...).

Unfortunately I found a little bug in the last routine:
movlps qword ptr [esi], xmm0
should read: edi
... which gave an unfair treatment to the movlps/movhps pair.
I also replaced the assembly-time REPEATs with run-time .Repeats, and the results change:

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
308     cycles for movdqa               (src + dest aligned 16)
408     cycles for movlps/movhps        srcalign=8, destalign=8
712     cycles for movdqu               srcalign=8, destalign=8
458     cycles for rep movsd            srcalign=8, destalign=8
1220    cycles for fild/fistp           srcalign=8, destalign=8

310     cycles for movdqa               (src + dest aligned 16)
408     cycles for movlps/movhps        srcalign=8, destalign=8
712     cycles for movdqu               srcalign=8, destalign=8
459     cycles for rep movsd            srcalign=8, destalign=8
1220    cycles for fild/fistp           srcalign=8, destalign=8

308     cycles for movdqa               (src + dest aligned 16)
408     cycles for movlps/movhps        srcalign=8, destalign=8
721     cycles for movdqu               srcalign=8, destalign=8
459     cycles for rep movsd            srcalign=8, destalign=8
1220    cycles for fild/fistp           srcalign=8, destalign=8


Now what is really surprising is that the movlps/movhps pair is consistently a lot faster than movdqu. Try fumbling with the srcalign and destalign equates on top of the source - movlps/movhps is always faster.

Try in particular to set
srcalign = 8
destalign = 8

Of course, movdqa can't be beaten, but remember that HeapAlloc guarantees only an 8-byte aligned memory - no good for movdqa...

@Mark: Thanks for the AMD timings - and sorry for the bug in the movlps row.

[attachment deleted by admin]

dedndave

Prescott dual-core
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
480     cycles for movdqa               (src + dest aligned 16)
3303    cycles for movlps/movhps        srcalign=7, destalign=9
3247    cycles for movdqu               srcalign=7, destalign=9
3590    cycles for rep movsd            srcalign=7, destalign=9
3292    cycles for fild/fistp           srcalign=7, destalign=9

483     cycles for movdqa               (src + dest aligned 16)
3336    cycles for movlps/movhps        srcalign=7, destalign=9
3276    cycles for movdqu               srcalign=7, destalign=9
3572    cycles for rep movsd            srcalign=7, destalign=9
3287    cycles for fild/fistp           srcalign=7, destalign=9

486     cycles for movdqa               (src + dest aligned 16)
3306    cycles for movlps/movhps        srcalign=7, destalign=9
3292    cycles for movdqu               srcalign=7, destalign=9
3582    cycles for rep movsd            srcalign=7, destalign=9
3342    cycles for fild/fistp           srcalign=7, destalign=9

big numbers   :(

jj2007

Quote from: dedndave on June 03, 2009, 08:46:40 PM
Prescott dual-core
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
big numbers   :(
Small numbers when aligned to 8 bytes :bg
And again, the movlps/movhps pair beats them all...

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
586     cycles for movdqa               (src + dest aligned 16)
712     cycles for movlps/movhps        srcalign=8, destalign=8
1377    cycles for movdqu               srcalign=8, destalign=8
802     cycles for rep movsd            srcalign=8, destalign=8
1175    cycles for fild/fistp           srcalign=8, destalign=8

520     cycles for movdqa               (src + dest aligned 16)
758     cycles for movlps/movhps        srcalign=8, destalign=8
1323    cycles for movdqu               srcalign=8, destalign=8
845     cycles for rep movsd            srcalign=8, destalign=8
1160    cycles for fild/fistp           srcalign=8, destalign=8

505     cycles for movdqa               (src + dest aligned 16)
797     cycles for movlps/movhps        srcalign=8, destalign=8
1242    cycles for movdqu               srcalign=8, destalign=8
798     cycles for rep movsd            srcalign=8, destalign=8
1047    cycles for fild/fistp           srcalign=8, destalign=8