News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Stack Probing PROLOGUE macro

Started by chep, June 22, 2005, 06:58:50 AM

Previous topic - Next topic

chep

Here is a macro that can be used with OPTION PROLOGUE in order to allow stack probing (when LOCALs are more than 4Kb).

<EDITED>

;:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
;
; Allows a procedure to safely use LOCAL variables with a total size of 4kb or more,
; using an unrolled stack probing method by default.
;
;:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
;
; Usage:
;
;   OPTION PROLOGUE:STACKPROBE
;   MyProcedure PROC ; ...
;     ; ...
;   MyProcedure ENDP
;   OPTION PROLOGUE:PROLOGUEDEF
;
; The ROLLED macro argument generates a loop rather than the default unrolled code:
;
;   MyProcedure PROC <ROLLED> ; ...
;
;:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
;
; Notes:
;   - When the total size of the LOCAL variables is less than 4kb, the code generated is
;     identical to PROLOGUEDEF, so there is no drawback using this macro
;   - See "OPTION PROLOGUE" and "PROC" topics in MASM32.HLP for the macro specifications
;
;:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
;
; Limitations compared to PROLOGUEDEF:
;   - Stack probing is relevant only for Windows, ie FLAT model, so it won't accept other models
;   - Due to the FLAT model restriction, LOADDS is not supported
;   - FORCEFRAME argument doesn't generate a correct epilogue when no LOCAL variables are defined
;     So it is not supported for now :(
;
;:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

I finally gave up making this macro fully compatible with MASM's PROLOGUEDEF.
That's because stack probing is needed only for Windows (ie FLAT model), so LOADDS argument is not supported as it concerns only 16 bit code.

Also, when using FORCEFRAME argument with EPILOGUEDEF, it doesn't generate any epilogue (ie. leave instruction) although it generates the pop instructions corresponding to the USES directive. I really can't figure where this problem comes from, so I also gave up trying to implement FORCEFRAME... :red


Enjoy!

[attachment deleted by admin]

chep

At last a useable version.

Unfortunately I don't have the time to dig into the FORCEFRAME bug for the moment, nor to add the looped probing option (only unrolled probing for now...).

Anyway, here it is... (The first post has been updated) ::)

Petroizki

You still have to make the stack frame, when you have no locals but you do have proc arguments.

So something like:

  ;; Set up stack frame
  IF localbytes GT 0
    ...
  ELSEIF argbytes GT 0
    push ebp
    mov ebp, esp
  ENDIF


EDIT: Also, wouldn't it be better to use 'mov dword ptr [esp], eax' instead of 'mov byte ptr [esp], 0'?
It's one bytes smaller, and faster..

chep

Quote from: Petroizki on June 22, 2005, 04:45:05 PM
You still have to make the stack frame, when you have no locals but you do have proc arguments.
You're perfectly right, I forgot that! :red :red :red
And your proposed fix is perfect.

Quote from: Petroizki on June 22, 2005, 04:45:05 PM
Also, wouldn't it be better to use 'mov dword ptr [esp], eax' instead of 'mov byte ptr [esp], 0'?
It's one bytes smaller, and faster..
You're perfectly right also! :wink


Source code updated...
Thanks for pointing this out! :U :thumbu :thumbu

Petroizki

I added stack probing to my own prologue macros, it generates code like this on the beginning:

push ebp
mov ebp, esp
sub esp, 3A98 ; reserve stack space
mov dword ptr [ebp-1000], eax ; probe first page
mov dword ptr [ebp-2000], eax ; probe second page
mov dword ptr [ebp-3000], eax ; probe the last page
...

Seems to work, at least it produces less code..

chep

You're right (again :green).

A few thoughts however, correct me if I'm wrong :

- I would adjust esp *after* probing the pages, just in case "something else" would use the stack before the last page is probed. Well, that's how VCToolkit's probing function works anyway. (Ok, I should have looked at it before writing my macro, but well... ::))

- Shouldn't add esp, (-size) be slightly faster than sub ? (I suppose it is, as MASM as well as VC use it rather than sub. I didn't make any tests though)

- In the example you give, I think [ebp-3000h] is not the last page. The last probed page should be [ebp-3A98h] (or [ebp-4000h] for instance, it doesn't change anything) :
If the final esp lands on a page boundary (ie. a multiple of 1000h), it will land on the last DWORD of the guard page. But when the next push is made, it will try to access uncommitted memory, and the app will be killed. I'm not sure about this as I haven't managed to produce the "crashing case", but that's how I understand the VCToolkit probing function :


msvcrt_probe  proc ; argument : eax = localbytes

  cmp     eax, 1000h
  jnb     probe_stack
  neg     eax            ; this part is for localbytes < 4k so it's not relevant in our case
  add     eax, esp
  add     eax, 4
  test    [eax], eax
  xchg    eax, esp
  mov     eax, [eax]
  push    eax
  ret

probe_stack:             ; the interesting part...
  push    ecx
  lea     ecx, [esp+8]

probepages:
  sub     ecx, 1000h
  sub     eax, 1000h
  test    [ecx], eax
  cmp     eax, 1000h
  jnb     probepages

probelastpage:
  sub     ecx, eax
  mov     eax, esp
  test    [ecx], eax
  mov     esp, ecx
  mov     ecx, [eax]
  mov     eax, [eax+4]
  push    eax
  ret

msvcrt_probe  endp

; ... in main() :
  push    ebp
  mov     ebp, esp
  mov     eax, 2328h
  call    msvcrt_probe
  ; ...


This has been generated from the following C code (statically linked) :


int main()
{
  char test[9000];
  // ...
}



Well, anyway I have updated the code in the first post.

Petroizki

- What would you mean by "something else", a debugger? The probing could be easily done before adjusting esp, but you would have to use instruction that would not change any values in the negative offsets of esp (test, cmp, ...), this would probably make it slightly slower.

- I don't think add and sub have any speed differences, at least according to the optimization guides i have. They are basically the same instruction, on pentium that is.

- I guess your right, but i couldn't get it GPF on Windows XP. Actually you can remove the last two probes, and make it work. I will do some testing on 9x later.

chep

Quote from: Petroizki on June 24, 2005, 06:34:08 AM
- What would you mean by "something else", a debugger?
Indeed. I guess a user-mode debugger is the only thing that could tamper the program's stack.

Quote from: Petroizki on June 24, 2005, 06:34:08 AM
but you would have to use instruction that would not change any values in the negative offsets of esp (test, cmp, ...)
I don't understand why?

In fact I simply meant swapping the probing mov instructions and the esp adjustment :

mov DWORD PTR [ebp-1000h], eax
...
mov DWORD PTR [ebp-4000h], eax
sub esp, 4000h

That seems to work fine.

Petroizki

It may not be safe to mess with outside the stack; http://board.win32asmcommunity.net/index.php?topic=20128.0.

At least debugging with Whidbey may cause a problem..

chep

Ok, I understand now.

But in our case I guess we don't mind if the stack is overwritten by a debugger before esp is adjusted, as we are writing dummy values just to make sure each page is probed.
On the contrary it's more likely a problem could arise if we adjust esp before probing the stack, as a debugger could then hit a non probed, unguarded page, thus leading to a GPF.

Am I wrong?

Webring

You are a genious, thankyou so much for this

farrier

This was from the post I originally pointed you to:

http://board.win32asmcommunity.net/index.php?topic=19497.15

Code from KetilO:
MainDlgProc proc hWin:HWND,uMsg:UINT,wParam:WPARAM,lPar am:LPARAM
  LOCAL buffer[4096]:byte
  LOCAL buffer2[256]:byte
  LOCAL buffer3[256]:byte
  LOCAL printout[4096]:byte
   
  LOCAL pos:dword
  LOCAL hdi:HD_ITEM

  ;Touching the stack frame
  mov eax,ebp
  .while eax>esp
     mov dword ptr [eax],0
     sub eax,4
  .endw
  push edx
  push esi
  push edi


I found that if you replace
sub eax, 4
with
sub eax, 4096

it works just as well and faster!  Since we only need to touch each page and not each DWORD.

My point is, that the touching took place after all the stack adjustments were made.

The first problem in the above post was when the uses function was used, the push of the "used" registers, caused the guard page errors.

farrier
It is a GOOD day to code!
Some assembly required!
ASM me!
With every mistake, we must surely be learning. (George...Bush)

hutch--

farrier,

Compliments, that is a good technique.  :thumbu
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Mirno

Here is a first hack at a "probelogue" macro.


probelogue MACRO szProcName, flags, cbParams, cbLocals, rgRegs, rgUserParams
  push ebp
  mov  ebp, esp
  sub  esp, cbLocals

  mov eax, ebp
  .while eax > esp
     mov dword ptr [eax], 0
     sub eax, 4096
  .endw

  FOR usesreg, rgRegs
    push usesreg
  ENDM

EXITM <0>
ENDM


It should be useable with the "OPTION PROLOGUE:probelogue" command.
I've not tested it though, and it will not deal with all the fiddly bits that the default prologue does (near, far, calling convention, and the so on).

Mirno

Mirno

New and improved (it produces better code in some cases):

probelogue MACRO szProcName, flags, cbParams, cbLocals, rgRegs, rgUserParams
  LOCAL counter
  LOCAL alignedLocals
  LOCAL whileBias

  alignedLocals = (cbLocals + 3) AND NOT(3)

  whileBias = 2
  IFNB <rgUserParams>
    whileBias = rgUserParams
  ENDIF

  push ebp
  mov  ebp, esp

  IF alignedLocals NE 0
    sub  esp, alignedLocals
  ENDIF

  IF alignedLocals GT (4096 * whileBias)
    .while ebp > esp
       mov DWORD PTR [ebp], 0
       sub ebp, 4096
    .endw
    add ebp, alignedLocals AND NOT(4096 - 1)
  ELSEIF alignedLocals GE 4096
    counter = 0

    WHILE alignedLocals GE counter
      mov DWORD PTR [ebp + counter], 0
      counter = counter + 4096
    ENDM
  ENDIF

  FOR usesreg, rgRegs
    push usesreg
  ENDM


EXITM <0>
ENDM


Note that the while bias comes from the user parameters, the value 2 was chosen because it gives smallest code.


.code
start:
option PROLOGUE:probelogue
blah PROC <8>, a:DWORD, b:DWORD
  LOCAL zyx[4096]:BYTE
  ret
blah ENDP
end start


The "<8>" overrides the default whileBias, allowing you to generate unrolled stack probes for locals greater than 8192 bytes.
Assembling with the default prologue is fine, but with a warning about an unknown prologue user argument.

If someone has code they can test this on I'd be greatful, also if you can test with the wierd and wonderful combinations of near, far, public, private, uses, calling convention, and so on as I've not had the chance (or the knowledge of how they should affect the assembly generated on the default prologue).

This is all untested, I've been looking at the list code generated by MASM so there will almost certainly be errors.

Mirno