Application Blowing up with _asm to MASM64 Conversion

Started by HooKooDooKu, January 24, 2012, 10:22:54 PM

Previous topic - Next topic

HooKooDooKu

Using C++ and MFC for bulk of my programming (Visual Studio 2010).  However I've got some _asm blocks I'm trying to move to MASM64 as we get our code x64 ready.

I've got an _asm block that is using 'float' values to scale an image.  The replacement MASM64 PROC seems to be working as the resulting pixels coming from the MASM64 PROC match the pixels from the _asm block.  But the image never displays, and when I terminate the application I get tons of memory leaks.  If I replace MASM64 PROC with a subroutine that hardcodes the resulting image to a bunch of white pixels, I get a white image an no memory leaks.

I've enclosed the MASM64 PROC below.  Comments have been removed for space (and I don't expect anyone to attempt do debug the whole PROC searching for some minor flaw).  But I was hoping that someone could review the way the PROC is structured and tell me if there is something at a fundamental level that I might be doing wrong.

In the C++ code, the MASM64 PROC is delared with
extern "C" void SuperScale_asm( COLORREF* pSrc, int uSrcWidth, //uSrcHeight : not required
                                COLORREF* pDst, int uResWidth, int uResHeight,
                                LineContribType* YContrib, LineContribType* XContrib,
                                float* RGBArray );

All variables declared outside of the _asm block yet used within the _asm block have simply been passed to the MASM64 PROC.  No functions or memory allocations occur within the code (that is all done before the _asm code executes).

;Content of Fast2PassScale.inc
PDWORD TYPEDEF PTR DWORD
PCOLORREF TYPEDEF PTR DWORD
Cint TYPEDEF SDWORD
Cfloat TYPEDEF REAL4
PCfloat TYPEDEF PTR Cfloat

ContributionType   struct
Left               Cint    ?
Right              Cint    ?
Weights            PCfloat ?
ContributionType   ends
PContributionType TYPEDEF PTR ContributionType

LineContribType    struct
ContribRow         PContributionType ?
WindowSize         Cint              ?
LineLength         Cint              ?
LineContribType    ends
PLineContribType TYPEDEF PTR LineContribType


;Content of Fast2PassScale.asm
option casemap :none
include Fast2PassScale64.inc
.data
.code
public SuperScale_asm
SuperScale_asm PROC p1:PCOLORREF, p2:Cint, p3:PCOLORREF, p4:Cint, uResHeight:Cint, YContrib: PLineContribType, XContrib: PLineContribType, RGBArray: PCfloat
   LOCAL ContribPtrX      :PDWORD   
   LOCAL ContribTempPtr   :PDWORD
   LOCAL ContribPtrY      :PDWORD
   LOCAL YWeightPtr       :PCfloat   
   LOCAL RGBArrPtr        :PCfloat   
   LOCAL BVal             :DWORD   
   LOCAL GVal             :DWORD
   LOCAL RVal             :DWORD
   LOCAL YDelta           :DWORD   
   LOCAL YCounter         :DWORD
   LOCAL XCounter         :DWORD
   LOCAL ColumnCounter    :DWORD
   mov r10, rcx         
   mov r11d, edx         
   shl edx, 2      
   mov r12d, edx   
   xor eax, eax
   mov BVal, eax
   mov GVal, eax
   mov RVal, eax
   mov rax, YContrib   
   mov rax, [rax]         
   sub rax, 12            
   mov ContribPtrY, rax      
   mov YCounter, 0      
   ALIGN 16
   VerticalLoop:
      mov rbx, XContrib   
      mov rbx, [rbx]      
      sub rbx, 12            
      mov ContribPtrX, rbx
      add ContribPtrY, 12
      mov rdi, ContribPtrY
      mov ecx, Cint ptr [rdi]   
      mov esi, Cint ptr [rdi + 4]
      sub esi, ecx
      inc esi
      mov YDelta, esi
      mov eax, r12d      
      imul eax, ecx
      add rax, r10
      sub rax, 4
      mov rsi, rax   
      mov rdi, [rdi + 8]
      mov YWeightPtr, rdi      
      mov eax, r11d   
      mov rcx, RGBArray
      mov ColumnCounter, eax
      mov RGBArrPtr, rcx
      ALIGN 16
      ColumnLoop:         
         mov ecx, YDelta
         mov rdi, YWeightPtr
         add rsi, 4
         mov rdx, rsi      
         fldz            
         fldz
         fldz                           
         ALIGN 16
         YWeightingLoop:      
            fld dword ptr[rdi]            
            movzx eax, byte ptr [rdx]      
            movzx ebx, byte ptr [rdx + 1]
            mov BVal, eax
            movzx eax, byte ptr [rdx + 2]
            mov GVal, ebx
            mov RVal, eax   
            fild BVal
            fmul st(0), st(1)
            fxch
            add edx, r12d            
            fild GVal
            fmul st(0), st(1)
            fxch
            add rdi, 4                  
            fild RVal;
            fmulp st(1), st(0)
            fxch st(2)                     
            faddp st(3), st(0)
            faddp st(3), st(0)      
            faddp st(3), st(0)   
            dec rcx                         
            jnz short YWeightingLoop
         mov rcx, RGBArrPtr
         fstp dword ptr [ecx]      
         fstp dword ptr [ecx + 4]
         fstp dword ptr [ecx + 8]
         add RGBArrPtr, 12
         dec ColumnCounter
      jnz short ColumnLoop         
      mov eax, r9d               
      mov XCounter, eax
      mov rdx, r8      
      mov rax, ContribPtrX
      mov ContribTempPtr, rax
      ALIGN 16
      RowLoop:         
         add ContribTempPtr, 12
         mov rax, ContribTempPtr      
         mov rbx, RGBArray
         mov rdi, rax
         mov rcx, [rax]
         mov rsi, [rdi + 4]
         sub rsi, rcx
         mov rdi, [rdi + 8]
         lea rax, [rcx * 8 + rbx]
         lea rbx, [rax + rcx * 4]
         inc rsi
         mov rax, 4
         fldz
         fldz
         fldz
         ALIGN 16
         XWeightingLoop:
            fld dword ptr[rdi]         
            fld dword ptr [rbx]
            fmul st(0), st(1)         
            fxch
            add rdi, rax
            fld dword ptr [rbx + rax]
            fmul st(0), st(1)         
            add rbx, 12
            fxch
            fld dword ptr [rbx + 2 * rax - 12]
            fmulp st(1), st(0)         
            fxch st(2)
            dec rsi
            faddp st(3), st(0)
            faddp st(3), st(0)      
            faddp st(3), st(0)            
         jnz short XWeightingLoop   
         fistp BVal                
         fistp GVal
         fistp RVal
         mov ebx, RVal
         rol ebx, 8   
         or ebx, GVal            
         rol ebx,8
         or ebx, BVal
         mov dword ptr [rdx], ebx   
         lea rdx, [rdx + 4]
         dec XCounter
      jnz RowLoop
      mov r8, rdx
      inc YCounter
      mov eax, YCounter
      cmp eax, uResHeight
   jb VerticalLoop            
   ret
SuperScale_asm ENDP
end

tofu-sensei

last time i checked ml64 didn't handle parameters or local variables at all.

HooKooDooKu

Quote from: tofu-sensei on January 24, 2012, 11:19:43 PM
last time i checked ml64 didn't handle parameters or local variables at all.

???

The code will assemble without error.  So if it not handling LOCAL variables, then it's not telling me about it...  that is unless your point is that ml64 doesn't handle modifying the stack properly to "handle" them.  Otherwise, when I step through the code with the disassembly window, is see things like

mov         dword ptr [rbp-2Ch],eax

(negative offsets) for LOCAL variables, and

mov         rcx,qword ptr [rbp+48h]

(positive offsets) for Parameters


HooKooDooKu

A little more information:

After executing the 1st primary loop, as some point (I haven't narrowed it down yet) the code seems to be exiting the ASM code earlier than it is supposed to, and when it does, it skips executing the line of code in the calling function that executes some cleanup (delete) code.  At least that points to why I've got memory leaks.  It also appears that something modifies the code.  When I reenter the ASM code in the disassymbly window, parts of the code have obviously changed (starting to sound like a bug with a pointer not getting processed within all that ASM code correctly).

tofu-sensei

but does it actually set up a stack frame? you're also not saving any nonvolatile registers you're using.

HooKooDooKu

Quote from: tofu-sensei on January 26, 2012, 06:03:41 PM
but does it actually set up a stack frame? you're also not saving any nonvolatile registers you're using.

Well, this is only my 2nd foray into MASM x64.  The 1st time, I didn't seem to need to set up a stack frame because it looked like the compiler was doing that for me.

As I said, the function is declared ".extern "C" void SuperScale_asm".  Here's what the disassembly window shows how the C++ compiler is calling the function:
1. Four parameters are loaded onto the stack @ [rsp+20h], [rsp+28h], [rsp+30h], and [rsp+38h]. 
2. The other four parameters are loaded into r9d, r8, edx, and rcx.
3. call (SuperScale_asm) (which goes to a jmp SuperScale_asm command)
  Then I assume this is the setting up of the stack frame
4. push rbp
5. mov rpb,rsp
6. add rsp, 0FFFFFFFFFFFFFF98h
7. mov register parameters to [rbp-8], [rbp-0Ch], [rbp-14h] and [rbp-18h]
  Then My asm source code starts executing. 
8. Function ends with leave followed by ret.

What else should I be doing.


Additional update.  I found that a part of my problem was where I'm using #ifdef WIN64 that determines if the SuperScale_asm is called, or the old _asm block gets called.  I had some descructors duplicated in the WIN64 code.  Once I got the double destructor call worked out, the Debug version of the logic ran just fine.  But when I try it in Release mode, well, things go just wrong.  I tried to put in some debug code by inserting message boxes.  The subroutine seems to run fine, but when I try to write the results to a file, my message boxes quit appearing the first time I attempt to access the data (and c++ try/catch blocks around the subroutine don't catch anything).


HooKooDooKu

Quote from: tofu-sensei on January 30, 2012, 05:55:48 PM
Quote from: HooKooDooKu on January 30, 2012, 04:11:35 PM
What else should I be doing.
save rbx, rsi, rdi, r12

That seems to be working.  Thx
Not sure why I hadn't run into this being an issue before.

What about the floating point registers.  Do I need to do anything with them? 

I've never worked with the floating point registers before.  I did some research on them and found it strange that in the 32bit code, when the 32 bit equivalent function was called, the TAGS register shows FFFF indicating the floating point stack is empty.  But in 64 bit, the TAGS shows 0000 when the 64 bit function is called.  The floating point tutorial (referenced in the links at the top right of this MASM web page) indicates that means floating point registers are loaded with valid non-zero numbers (but ST) is zero).