Strange problems migrating to 64bit

Started by johnsa, January 19, 2012, 08:17:45 AM

Previous topic - Next topic

johnsa

Hey all,

So ive started trying to port my code and work into 64bit (felt it was finally time), especially considering I have some projects I'm working on that really could benefit from the extra ram/registers etc.

Things started off pretty smoothly, did all the reading up and experimenting. I decided to switch to jwasm + wininc instead of ml64 because I really can't live without the high-level syntax etc.
I still use Visual Studio 2010 to debug as I've always done.

So here is where I've run into some wierd issues and have a few questions around the calling convention.

According to my understanding the following is the case:
1) The caller is responsible for decrementing and incrementing the stack pre/post call.
2) fastcall calling convention will pass the first 4 integer/ptr arguments in rcx,rdx,r8,r9 and the first 4 float args in xmm0-xmm3.
3) Shadow space is reserved on the stack in accordance with the parameters passed (their sizes) + any local variables.

Here is where i start getting confused:

Why does the stack need to be 16 byte aligned? I see code doing an AND RSP,-16 .. but some code doesn't.. does doing this inside a routine not potential break the stack? as RSP is pushed before switching the RBP.. and popped at the end of the call, now if RSP is changed with the AND that could break?

What is the point of fastcall if the space is a) reserved on the stack anyway and then b) the assembler copies rcx,rdx etc into the shadow space automatically??? It seems to me like this is slower now than stdcall and is just wasting those registers?

Based on that calling convention.. I would traditionally use ecx as a loop counter, suppose now i have a look with a call/invoke in the middle... if rcx was my loop counter the call generation doesn't save rcx etc.. so now my registers are trashed?
Is the intention that i should simply not ever use rcx,rdx,r8,r9,xmm0-xmm3 ??? or must I manually save these around every call? seems ridiculous to me as now the calling convention doesn't live up to it's name of fast at all.. plus you can't push an XMM register, so that means a lot of effort involved in saving them around a call.

What is the actual difference between a PROC and a PROC with FRAME specified? I know that it's 64bit SEH compliant in terms of the prolog/epilog.. but how does this affect the entry and exit code in the procedure?

So now for my other major issue.. I think my code is breaking because of the calling convention issues mentioned above, but what is strange is if I take for example win64_3e from the jwasm samples.. build it in debug mode and debug it under VS2010, I can see the local variables, but their values never update IE:


WinMain proc FRAME hInst:HINSTANCE, hPrevInst:HINSTANCE, CmdLine:LPSTR, CmdShow:UINT

    LOCAL wc:WNDCLASSEXA
    LOCAL msg:MSG
    LOCAL hwnd:HWND

In VS2010 I can see all 7 locals.. however jwasm doesn't generate the automatic copy of the registers to shadow space?? The example code manually does mov hInst,rcx which doesn't update hInst in the locals view.

If I build my project/code and go into VS2010, I can see no locals AT ALL??

Any help would be hugely appreciated!
Thanks
John

johnsa

I'm just having a look at the WinInc includes.. which are supposedly for 32 and 64bit... but the typedefs and structs don't look right to me for 64bit.. they're still full of ptr's as DWORDS ???

BogdanOntanu

Quote from: johnsa on January 19, 2012, 08:17:45 AM
...

According to my understanding the following is the case:
1) The caller is responsible for decrementing and incrementing the stack pre/post call.
2) fastcall calling convention will pass the first 4 integer/ptr arguments in rcx,rdx,r8,r9 and the first 4 float args in xmm0-xmm3.
3) Shadow space is reserved on the stack in accordance with the parameters passed (their sizes) + any local variables.

Correct, local variables remain as always. Shadow space is always reserved for arguments (away/independent from local variables).
If I recall correctly USES have changed position because of stack alignment.

Quote
Here is where i start getting confused:

Why does the stack need to be 16 byte aligned?

Because it is looking "cool" :P. Well  the CPU kind of needs 8 bytes alignment in 64 bits long mode but 16 bytes is required because the compiler sometimes uses SSE code to move XMM registers around. If the stack address is not 16 bytes aligned the SSE MOV would crash... (MOVAPS etc) and the compiler is no wise enough to know when to use unaligned SSE moves.


Quote
I see code doing an AND RSP,-16 .. but some code doesn't.. does doing this inside a routine not potential break the stack? as RSP is pushed before switching the RBP.. and popped at the end of the call, now if RSP is changed with the AND that could break?

GoASM :D ?

Yeah it is also my "impression" that sometimes it is possible (but i might be wrong).
I would never do an AND RSP,something ... my assembler does not do it in 64 bits prologue/epilogue generation.

Quote
What is the point of fastcall if the space is a) reserved on the stack anyway and then b) the assembler copies rcx,rdx etc into the shadow space automatically??? It seems to me like this is slower now than stdcall and is just wasting those registers?

The point is that this convention was created by people that have NO CLUE about ASM programming BUT unfortunately decided about ABI and ASM stuff.

Apparently they are under the influence of old MS DOS days when API transferred params by registers and "this is faster" TM :P (and copy cat from UNIX 64 bits ELF "ways")

Yes, sometimes (in inner loops) it is faster but it is STUPID in API calling API calling API ... they jut do not understand it...
Wrong decision they made... we will have to live with it it ... unfortunately is done... accept it young Jedi  :D

Quote
Based on that calling convention.. I would traditionally use ecx as a loop counter, suppose now i have a look with a call/invoke in the middle... if rcx was my loop counter the call generation doesn't save rcx etc.. so now my registers are trashed?

Unfortunately yes. But note that ECX was also trashed by API's with STDCALL.
Be "wise" and only use FASTCALL for API and try to still use STDCALL for your own functions :D if possible ...

Quote
Is the intention that i should simply not ever use rcx,rdx,r8,r9,xmm0-xmm3 ???

Yes, we add registers ONLY in order to LOOSE them and trash them and have more code saving and restoring them BECAUSE we are so "cool" about wasting registers ;)
You are correct in your sad observations.

Quote
or must I manually save these around every call? seems ridiculous to me as now the calling convention doesn't live up to it's name of fast at all.. plus you can't push an XMM register, so that means a lot of effort involved in saving them around a call.

Yes but the compiler will do this easy; it is just hard for humans hnece stop programming in ASM :D (irony)

Basically compilers will only do it once at start of PROC code leaving enough space for ALL invokes inside a PROC ;) There is no real need to do it before each API invoke.

Yes it is not "fast" at all. It just looks like it.

You cannot PUSH XMM but you can MOV them to [RSP] and now you see why RSP has to be 16 bytes aligned ;)

Quote
What is the actual difference between a PROC and a PROC with FRAME specified? I know that it's 64bit SEH compliant in terms of the prolog/epilog.. but how does this affect the entry and exit code in the procedure?

JWASM specific, sorry....

Basically the epilogue / prologue is fixed in order for the "unwind" code to recognize it because the "cool" guys lost the easy way to do this ;)
Info about each PROC with FRAME will be stored in an PE section /directory and this makes your code easy to reverse and unwind if an exception occurs in your PROC

Quote
So now for my other major issue.. I think my code is breaking because of the calling convention issues mentioned above, but what is strange is if I take for example win64_3e from the jwasm samples.. build it in debug mode and debug it under VS2010, I can see the local variables, but their values never update IE:


WinMain proc FRAME hInst:HINSTANCE, hPrevInst:HINSTANCE, CmdLine:LPSTR, CmdShow:UINT

    LOCAL wc:WNDCLASSEXA
    LOCAL msg:MSG
    LOCAL hwnd:HWND

In VS2010 I can see all 7 locals.. however jwasm doesn't generate the automatic copy of the registers to shadow space?? The example code manually does mov hInst,rcx which doesn't update hInst in the locals view.

If I build my project/code and go into VS2010, I can see no locals AT ALL??


I assume from my experience with my own 64 bits SOL_Asm ...(but I might be wrong) that JWASM does not generate the PDB files directly (undocumented).

Instead it generates the old CodeView format (documented) and converts it to PDB.
Unfortunately in this case you loose / not have full LOCALS and ARGS information for debug ... sorry :D

Ambition is a lame excuse for the ones not brave enough to be lazy.
http://www.oby.ro

dedndave

QuoteI see code doing an AND RSP,-16 .. but some code doesn't.. does doing this inside a routine not potential break the stack? as RSP is pushed before switching the RBP.. and popped at the end of the call, now if RSP is changed with the AND that could break?
if the epilogue uses a LEAVE instruction, or otherwise restores ESP from a stored value, this is ok

QuoteI'm just having a look at the WinInc includes.. which are supposedly for 32 and 64bit....
i am reasonalby certain that you use different versions for 32 or 64 bit

sinsi

I would disgaree with you Bogdan about using stdcall for your own functions, it is easier to track RSP if only you change it.
Example, my stack is always para aligned so when my function starts it knows it is out by 8 bytes, a simple push rbx aligns it and makes rbx available.
I must admit that I use ml64 but it makes you think about where the stack is.
Another thing about the alignment is that you can load the registers (rcx/rdx/r8/r9) as you have to and then push them to make the shadow space, maybe
with a superfluous push for odd numbered params.

You can also go upto r15, so rbx rdi rsi rbp r10-r15 are available, 10 registers that windows api's don't trash.

The hardest part for me is structures, not knowing C and its love of re-defining everything it is hard to keep up with pointers/dwords and alignment...
Light travels faster than sound, that's why some people seem bright until you hear them.

johnsa

I honestly cannot find a correct definition in WinInc for 64bit.. unless i'm being really slow today :)


Ok.. so from looking at the disasembly from jwasm for a PROC FRAME with frame:auto set.. we have

;move regs to shadow space
push rbp
mov rbp,rsp
; Here I'd do and rsp,-16
; which means further down things which access the shadow parameter space are all wrong in the disassembly assuming RSP was changed...
mov dword ptr [rsp+20h],0
it then does an add rsp,20h.. i don't see it restoring RSP from anything safe? .. even so the code in between that addresses the shadow space would be wrong after the AND..

to be honest .. This stack alignment thing makes no sense... even with an AND rsp,-16 (which will break references to shadow space).. there is no guarantee that the stack will align correctly ever... imagine:
xyz PROC a:QWORD, b:QWORD, c:QWORD, d:QWORD, e:BYTE, f:REAL4 ....
a,b,c,d will be loaded into rcx,rdx,r8,r9 ... byte E will be pushed to the stack.. and f would have to be a MOVSS [rsp+x],xmm0 ...

I can only imagine that parameters would be either REAL4 or REAL8 using MOVSS or MOVSD which don't require alignment like MOVAPS does... ?

I guess one could implement a NEW proc macro for your own code which reverts to stdcall, as there isn't a way in the assembler by default to have it use fastcall for one and not the other... and this would mean you'd need a new invoke too... :( :(

As for the PDB.. I really cannot work without proper locals/args debugging, I guess Japeth would need to confirm the status on this one...

qWord

you have declared the  _WIN64-equate before including windows.inc?

UNICODE EQU 1
WIN32_LEAN_AND_MEAN EQU 1
_WIN64 EQU 1

include windows.inc
FPU in a trice: SmplMath
It's that simple!

johnsa

I have.. although I don't see that being used in WINGUI1.ASM example in WinInc for the 64bit gui sample app.. plus I cannot find anything inside windows.inc or it's children where it actually changes the definitions depending on that equate.. IE:  HWND = dword / qword.

johnsa

On a side note... PDB file format should be necessary as I still link using MS link ??
Link generates the PDB file from the OBJ file which is all JWASM has to produce.. and that format is open?

donkey

Quote from: BogdanOntanu on January 19, 2012, 11:50:25 AM
Quote
I see code doing an AND RSP,-16 .. but some code doesn't.. does doing this inside a routine not potential break the stack? as RSP is pushed before switching the RBP.. and popped at the end of the call, now if RSP is changed with the AND that could break?

GoASM :D ?

Not GoAsm, it uses OR SPL,8 to align the stack.

Invoke
push    rsp
push    q[rsp]
or      spl,8
<params>
sub     rsp,20h
call function


Parameters are optimized, for example a parameter of zero uses XOR instead of MOV if it is passed in a register. It also takes advantage of the zero extension behaviour when moving smaller numbers into registers to reduce code size.

Stack frame:
mov     [rsp+8],rcx
mov     [rsp+10h],rdx
mov     [rsp+18h],r8
mov     [rsp+20h],r9
mov     rbp,rsp
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

johnsa

So I've tried using jwlink instead now.. it doesn't seem to be able to link a file created by jwasm when the -Zi switch is on.... this is degenerating quickly.. i may be forced back to 32bit code :)

johnsa

Ok.. so a small update... I stipulated /machine and /subsystem on link .. which wasn't there and it wasn't complaining.
I also installed jwasm 2.07 pre.
I then changed the prototype of my WinMain.. which didn't work at all it seems to be pre-defined, so I renamed it to WinMainX.
And now.. I can see args and locals in VS2010!!!

BUT..

They're not getting assigned the right values... lol

hInstance is null, hPrevInstance is getting what should be hInstance, and CmdLinePtr is getting SW_SHOWDEFAULT....

one step closer...

johnsa

The first problem seems to be...

WinMainX proto :QWORD, :QWORD, :QWORD, :DWORD

000000013F9B10EE  mov         qword ptr [rsp+8],rcx 
000000013F9B10F3  mov         qword ptr [rsp+10h],rdx 
000000013F9B10F8  mov         qword ptr [rsp+18h],r8 
000000013F9B10FD  mov         qword ptr [rsp+20h],r9

From VS2010 disasm...  the last parameter is still being stored as r9 (qword) .. when the parameter passed should be r9d only...

The parameters in order should be :
hInstance, hPrevInstance, CmdLine, nShow

but in the VS locals view the order is
hInstance, CmdLine, hPrevInstance, nShow ....
I don't know if this is ok, but the values are not going into the right locals... even tho the above code seems to be putting them onto the stack correctly.

johnsa

Narrowed it down....

in the disasm view.. the args and locals ONLY update correctly when the push rbp is executed, now they all line up.

If you assemble and the PROC FRAME is used, it seems to do the ordering in such a way that you can see the right values from disasm, but not from source view.. if you leave FRAME off.. then F10 in the proc heading brings it all into line as expected.

So i think this is definitely a bug in jwasm ?

Something about the code ordering when FRAME is specified causes VS debugger to not put the cursor on the procedure heading but rather the first instruction in the proc...

[EDIT]... it gets worse.. during execution of code inside WinMain as other procs are called RSP is adjusted, which totally buggers up the locals/args .... It would seem that VS2010 uses the current RSP to determine the values for locals which is moving around constantly...

johnsa

It seems like the code generated by JWASM doesn't conform and to me just isn't correct.

Based on looking at the 64bit disasm of Visual C++ apps I draw the following conclusions:

1) There should be NO need to AND or align the stack pointer, the prolog should deal with all of this... especially considering you cannot pass xmmN as a param, the assembler would only allow a real4/real8 which would be moved using MOVSS/SD... no need for alignment. In addition if somewhere somehow you needed to mov a full XMM onto the stack I believe there was a new opcode (can't remember what its called now) that will automatically handle between movaps/movups.

2) the generated prologue should not be modifying RBP?

3) RSP should be sub'ed/added to INSIDE the callee not the caller..

The code should look something like:

000000013F7D1040  mov         dword ptr [rsp+20h],r9d 
000000013F7D1045  mov         qword ptr [rsp+18h],r8 
000000013F7D104A  mov         qword ptr [rsp+10h],rdx 
000000013F7D104F  mov         qword ptr [rsp+8],rcx 
000000013F7D1054  push        rdi 
000000013F7D1055  sub         rsp,70h 
000000013F7D1059  mov         rdi,rsp

....

000000013F7D1152  add         rsp,70h 
000000013F7D1156  pop         rdi 
000000013F7D1157  ret 

NB> there is still that bug with jwasm prolog moving the full R9 to stack instead of just R9d

Does someone have a contact for Japeth so we could get resolution on this? Or at least a way to work around it.. perhaps a new set of macros to avoid the built-in prologue..invoke...?