Strange problems migrating to 64bit

Started by johnsa, January 19, 2012, 08:17:45 AM

Previous topic - Next topic

johnsa

Ok.. I think we can work around this problem... I posted in the laboratory earlier with a new typed struct to represent XMMWORD (_m128) ..

If we combine that with a new proc/endproc and invoke macro loosely based on the following:


option prologue:none
option epilogue:none
align 16
testproc PROC a:_mm128

   LOCAL myVar:DWORD
   
   push rbp
   sub rsp,2ch ;28h 4 params + return addr
   mov rbp,rsp
   
   mov myVar,10
   
   add rsp,2ch
   pop rbp
   
   ret
testproc ENDP
option prologue:PrologueDef
option epilogue:EpilogueDef

I think we could avoid the register->shadow space bug and fix the code generation by not breaking RSP outside of the callee... In addition we ONLY and RSP,-16 once at the beginning of an application.. then the new PROC macro ensures the stack stays aligned by correctly inserting padding and rolling out the params + locals to the stack in aligned increments....

The above seems to work perfectly in VS, just like the C++ code..

Ok.. anyone want to help with the macros? :)

sinsi

If your proc doesn't call any windows functions there is no need to align the stack.
"sub rsp,2ch" will cause problems since the stack is usually used in qword chunks, so a dword local should be 'promoted' to 8 bytes, with the upper dword only used for alignment to 8.
Light travels faster than sound, that's why some people seem bright until you hear them.

dedndave

also, it is common practice to PUSH the base pointer register, then load it with the value of the stack pointer
adjustments for local variables are then made to the stack pointer
when you exit the routine, the stack pointer may be restored by using the base pointer value
i would imagine that the LEAVE instruction works under 64-bit, and is the same as MOV RSP,RBP/POP RBP
if other registers are preserved, they are pushed before the base pointer and popped afer the stack and base pointers have been restored

dedndave

this seems like a logical sequence
        OPTION  PROLOGUE:None
        OPTION  EPILOGUE:None

TestProc PROC ParmA:_mm128

        push    rbx
        push    rsi
        push    rdi
        push    rbp
        mov     rbp,rsp
        sub     rsp,4          ;create a dword local at [rbp-4]
        and     rsp,-16        ;align the stack to create 16-aligned locals
        sub     rsp,32         ;create any simd locals
;
;
;
        leave                  ;this restores RSP and RBP, discarding any locals
        pop     rdi
        pop     rsi
        pop     rbx
        ret     16             ;return, discarding the _mm128 parm

TestProc ENDP

        OPTION  PROLOGUE:PrologueDef
        OPTION  EPILOGUE:EpilogueDef

johnsa

I tried getting the _m128 onto the stack, but jwasm always sends it by reference rather than value... it might be nice to somehow update invoke to cater for that, so it could BY VAL or BY REF.. so currently it would just be a ret 8(instead of a ret 16) I'm guessing as it's a ptr.

johnsa

Quote from: sinsi on January 20, 2012, 11:37:01 AM
If your proc doesn't call any windows functions there is no need to align the stack.
"sub rsp,2ch" will cause problems since the stack is usually used in qword chunks, so a dword local should be 'promoted' to 8 bytes, with the upper dword only used for alignment to 8.

It makes sense, but VC++ disagrees :) It pushes R9d onto the stack instead of the full R9. Also then what would happen to a byte/word parameter?
Even under win32 stdcall you'd have stack alignment issues if your params were say dword, byte, dword... ? So not having a qword aligned as a parameter I guess wouldn't be worse than the old case. Alternatively I'm sure it would be possible for the prologue
to correctly guage the params and just do one large 16n subtract of RSP to ensure that all reference locals are aligned natively.. even if it means wasting a few bytes of stack space here and there?

dedndave

"ParmA:_mm128" should make space on the stack for the data, not a pointer
however, it is probably easier to pass a pointer whenever the data is larger than the machine width   :P

johnsa

Agreed :) But it would be nice to have control over it.. in any event jwasms invoke automatically makes it a ptr... more reason for a serious update to proc/invoke :)

dedndave

well, you created a new type with your _mm128 stuff
you can reference the pointer as a pVoid (PVOID) type (i guess that's the 64-bit hungarian)

dedndave

i might add....
if you pass simd data as a parameter, you have to be concerned with stack alignment of the parm(s)
this could be tricky, maybe even very difficult for INVOKE to verify and implement

whereas, if you pass a pointer to the simd data, you don't have to worry about such things   :U

johnsa

To be honest.. I have never yet needed to pass an XMM as a parameter. I've lived for years in MASM32 only passing pointers to SIMD data.. So although nice, my last few suggestions are purely *WISH LIST* things.

Right now the only problem I have is getting JWASMS built code to work in the debugger... that's an absolute show stopper for me, to be able to use VS2010 for debugging and see args/locals..For this I think we need a macro patch for PROC and INVOKE... ?

johnsa

Hey,

After some discussion with Japeth we've worked out the problem, found specs around the API and how VC handles all of this correctly..

Basically what happens in VC is that it analyzes the call graph within a procedure by looking at every call that procedure makes and how much stack to allocate to accommodate this.
It appears to be something along the lines of sub RSP,32+(MAX_CALLED_PARAMS*16)+(MAX_PRIMITIVE_LOCALS*16)+ROUND_UP_TO_PARA(MAX_LOCAL_STRUCTS)... it then reserves this in the caller.
In addition VC doesn't write full QWORD registers to the stack as it's way of promoting .. it simply aligns the slots to QWORDS and writes in the respective type byte,word,dword etc.
The ABI clearly states that structs which aren't sized the same as one of these simple types should be passed by reference/pointer and not value (so that idea I've shelved).

No 64bit assembler currently does this... and it has many benefits
1) Allows one to debug 64bit applications with Visual Studio and use tools like VTune correctly with locals/arguments.
2) Follows the ABI more closely
3) Improved performance of generated code for two reasons, lower call overhead and by keeping the stack fixed throughout the duration of a proc regardless of how many calls it makes would improve the likelihood of stack data being cached.

What this means for the assembler:
1) modify INVOKE code generation - remove add,sup RSP around the call..
2) Add the difficult part - tracks calls inside a proc when rolling-out invokes and their locals/parameters to plug in the prologue for the caller..

This is quite a bit of work to change, but I think the benefits are worth it.. in terms of JWASM it would make it THE ONLY choice for 64bit Windows asm development.. as it's already near perfect and fast.

Hopefully Japeth agrees :) I've even offered to put money towards the effort if required as it really is worth it to me to be able to use my full tool-set in 64bit and Visual Studio is my debugger of choice.
It's an incredible piece of software and deserves it... I'd rather spend money on helping get it 100% than paying more money to MS for any other tool!

John

BogdanOntanu

I agree, I will try to implement this kind of stuff to my SOL_ASM also ;)
Ambition is a lame excuse for the ones not brave enough to be lazy.
http://www.oby.ro

johnsa

So I thought I should bash away at replacing the prologue,invoke etc to at least be able to use 64bit in the meantime... and i've run into more issues...



; A proper type definition for an XMMWORD that allows debugger to see sub-elements/types.
__mm128i struct
i0 DWORD ?
i1 DWORD ?
i2 DWORD ?
i3 DWORD ?
__mm128i ends

_mm128i typedef __mm128i

__mm128f struct
f0 real4 ?
f1 real4 ?
f2 real4 ?
f3 real4 ?
__mm128f ends

_mm128f typedef __mm128f

_mm128 union
i32 _mm128i <>
f32 _mm128f <>
_mm128 ends

BNT MACRO
db 2eh
ENDM

BTK MACRO
db 3eh
ENDM

option casemap:none
option win64:1
option frame:auto

    .nolist
    .nocref
WIN32_LEAN_AND_MEAN equ 1
_WIN64 EQU 1
    include c:\jwasm\WinInc\Include\windows.inc
    .list
    .cref
   
    includelib <kernel32.lib>
    includelib <user32.lib>

;myproc proto a:DWORD, b:REAL4, cc:QWORD

.const

.data?

.data

.code

NewPrologue MACRO procname, flags, argbytes, localbytes, <reglist>, userparms:VARARG

;mov [rsp+8],rcx
;mov [rsp+16],rdx
;mov [rsp+24],r8
;mov [rsp+32],r9

ECHO localbytes

IF localbytes GT 0
push rbp
mov rbp,rsp
sub rsp,(8*16)
ELSE ; If there are no locals, simply reserve space for 16 parameters.
push rbp
mov rbp,rsp
sub rsp,(8*16)
ENDIF

IFNB <reglist>
FOR reg,reglist
push reg
ENDM
ENDIF

exitm <(8*16)>
endm

NewEpilogue MACRO procname, flags, argbytes, localbytes, <reglist>, userparms:VARARG

IFNB <reglist>
FOR reg,reglist
pop reg
ENDM
ENDIF

leave

retn 0
endm

main proc

;invoke myproc , 1 , 10.2 , 4
mov ecx,1
pxor xmm0,xmm0
mov rdx,4
call myproc

ret
main endp

mainCRTStartup proc
invoke main
invoke ExitProcess,0
mainCRTStartup endp

option PROLOGUE:NewPrologue
option EPILOGUE:NewEpilogue
myproc proc a:DWORD, b:REAL4, cc:QWORD

LOCAL var1:DWORD
LOCAL var2:BYTE
LOCAL var3:REAL4
LOCAL var4:QWORD
LOCAL var5:_mm128

mov eax,var1

ret
myproc endp

end mainCRTStartup


If you look at that code, the localbytes that is sent into the prologue macro is reporting itself as 40bytes (28h) upon assembly... which is completely wrong...
it seems like its just taking 5 locals * 8.. and isn't respecting the size of the struct type _mm128.

Further more there seems to be a problem in the actually ABI specification, which states that 4*8 must be reserved as a minimum for shadow space, but they're neglecting that there could be 4 integer register parameters AS WELL as 4 float params in xmm0-xmm3.
So really the minimum reservation should be 8*8 (64 bytes) not 32 to ensure that a proc can copy all params from reg to stack...

I'm assembling with jwasm, going to see if ML64 reports the same incorrect localbytes value.

johnsa

It appears ML64 also outputs 28h for local bytes...