News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Avoiding stack frames (again!)

Started by Damos, July 03, 2009, 11:54:41 AM

Previous topic - Next topic

Damos

we're allways talking about avoiding setting up stack frames so as an optimization on here and wondered what you guys thought of this:

say we have a routine that needed arguments as in:

PrintSum proc alpha:DWORD,beta:DWORD
...
PrintSum endp

alpha and beta are then addressed as an offset to ebp

but what if we had a macro that:

MyProc PrintSum,alpha:DWORD,beta:DWORD

would be interpreted as:

.data?
PrintSum_alpha  dd ?
PrintSum_beta   dd ?
alpha equ PrintSum_alpha
beta equ PrintSum_beta
.code
PrintSum:
...
then undefine alpha & beta at end of routine to release the namespace for other routines.
so now we have no need to set up stack frame, instead we are using uninitialized memory to pass our params onto a routine.
we could also have a new invoke macro that:

MyInvoke PrintSum,2,4
interpretes as:
push 2
pop PrintSum_Alpha
push 4
pop PrintSum_beta
I know there this needs tweaking here and there but what do you think in principle?
Any intelligent fool can make things bigger, more complex, and more violent. It takes a touch of genius -- and a lot of courage -- to move in the opposite direction. - Albert Einstien

dedndave

#1
it sounds good, but i dunno how to "undefine" data labels - lol

this code assumes the labels are already declared

        push    2
        pop     PrintSum_Alpha
        push    4
        pop     PrintSum_beta

for that matter, the values could just as well be permanently declared
but, i think the stack frame method turns out to be faster

i dunno if this is faster or not...

        mov dword ptr PrintSum_Alpha,2
        mov dword ptr PrintSum_beta,4

on an 8088, it would be faster because there are fewer memory references, but that rule doesn't apply for pentiums, i guess
(well, for word-sized values, at least)

personally, i like to pass parms in register, provided there are only a couple (as in most cases)
but, i am a dinosaur programmer - i have dinosaur thoughts and i write dinosaur code - lol
these guys are used to procs that get INVOKEd
they all like to be C-compatible and, let's face it, windows seems to have been designed around C
as for me, i dislike C and that is why i write in assembler - lol

i was playing with another method that has some potential
you may find it interesting
http://www.masm32.com/board/index.php?topic=11671.msg87985#msg87985
as you can see, noone seems to be interested in my dinosaur ideas - lol

dedndave

on a similar note, one of the other guys in here had a good idea (i forget who it was and can't locate the thread)
instead of using the EBP register to reference locals, use ESP directly and design the
assembler so that it keeps track of the PUSH's and POP's to calculate the offsets
this frees up the EBP register and is a little faster and smaller than the regular stack frame

AProc   PROC

        sub     esp,8            ;2 local dword variables
        mov dword ptr [esp+4],1  ;first local var
        mov dword ptr [esp],2    ;second local var
.
.
.
        push    eax              ;assembler maintains PUSH count
.
.
.
        mov     edx,[esp+8]      ;first local var new offset
        mov     ecx,[esp+4]      ;second local var new offset
.
.
.
        pop     eax
.
.
.
        add     esp,8
        ret

AProc   ENDP


hutch--

Damos,

Using global memory in the .DATA or .DATA? section is an old trick from the days when stack space was very limited but there is no reason not to use it today if it does what you want. Stack based local variables have the advantage that you can call another proc from the current one and the values in the first will be the same when the called proc returns which limits nesting of procedures. In most instances this would not matter and you could handle it with a few different sets of variables but you could not perform recursion by this method.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

ramguru

Sometimes global variables are good sometimes they're bad.
Let's say you want to create a custom control &
use global variables as temporal variables for better speed.
That would be very unwise. 'Cuz if you are to support
multiple instances of the control on the same window,
you gotta take into account concurrent reads/writes...
So stack-based variables suit better here.

jj2007

Quote from: dedndave on July 03, 2009, 01:32:44 PM
on a similar note, one of the other guys in here had a good idea (i forget who it was and can't locate the thread)
instead of using the EBP register to reference locals, use ESP directly

On a P4, using ESP directly is 5 cycles faster but becomes a bit longer with every local variable:
1891    cycles for 100*call stack_frame_on
1417    cycles for 100*call stack_frame_OFF

Code sizes:
Frame on:       42
Frame off:      46


Test yourself...

.nolist
include \masm32\include\masm32rt.inc
.686
include \masm32\macros\timers.asm ; get them from the [url=http://www.masm32.com/board/index.php?topic=770.0]Masm32 Laboratory[/url]
LOOP_COUNT = 1000000

.code
start:
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
mov eax, 1234578h
REPEAT 100
call stack_frame_on
ENDM
counter_end
print str$(eax), 9, "cycles for 100*call stack_frame_on", 13, 10

counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
mov eax, 1234578h
REPEAT 100
call stack_frame_OFF
ENDM
counter_end
print str$(eax), 9, "cycles for 100*stack_frame_OFF", 13, 10, 10, "Code sizes:", 13, 10, "Frame on: ", 9
mov eax, stack_frame_on_END
sub eax, stack_frame_on
print str$(eax), 13, 10, "Frame off: ", 9
mov eax, stack_frame_OFF_END
sub eax, stack_frame_OFF
print str$(eax)

inkey chr$(13, 10, "--- ok ---", 13)
exit

stack_frame_on proc
LOCAL v1, v2, v3, v4, v5, v6
  mov v1, eax
  mov v2, eax
  mov v3, 1234h
  mov v4, 5678h
  mov v5, 5555h
  mov v6, 6666h
  ret
stack_frame_on endp
stack_frame_on_END:


stack_frame_OFF proc
; LOCAL v1, v2, v3, v4, v5, v6
  add esp, -4*6
  mov [esp], eax ; v1
  mov [esp+4], eax ; v2
  mov dword ptr [esp+8], 1234h ; v3
  mov dword ptr [esp+12], 5678h ; v4
  mov dword ptr [esp+16], 5555h ; v5
  mov dword ptr [esp+20], 6666h ; v6
  sub esp, -4*6
  ret
stack_frame_OFF endp
stack_frame_OFF_END:

end start

dedndave

i think that's because the instruction set is optimized for using EBP

[ebp+4]  uses a byte offset
[esp+4]  uses a word offset

of course, the LEAVE saves a couple bytes for you

quite a big difference in speed, don't you think?

jj2007

Quote from: dedndave on July 03, 2009, 04:48:34 PM
quite a big difference in speed, don't you think?

5 cycles on a P4, we'll see on others. But the code becomes very difficult to read and maintain, unless you revert to a pair of macros and do not use esp:

MyLocal MACRO args:VARARG
LOCAL tmp$
  .if 1
MyLocEsp = 0
FOR arg, <args>
  tmp$ CATSTR <arg>, < equ !<dword ptr [esp+>, %MyLocEsp, <]!>>
tmp$
MyLocEsp = MyLocEsp + 4
ENDM
tmp$ CATSTR <add esp, ->, %MyLocEsp
tmp$
ENDM

MyRet MACRO
tmp$ CATSTR <sub esp, ->, %MyLocEsp
tmp$
ret
  .endif
ENDM


Usage (dwords only, names can be used only once because they are global):


stack_frame_OFF proc
  MyLocal LocV1, LocV2, LocV3, LocV4, LocV5, LocV6
  mov LocV1, eax
  mov LocV2, eax
  mov LocV3, 1234h
  mov LocV4, 5678h
  mov LocV5, 5555h
  mov LocV6, 6666h
  MyRet
stack_frame_OFF endp

dedndave

i see over 400 cycles diff - am i lookin in the wrong spot Jochen ? - lol
Quote1891    cycles for 100*call stack_frame_on
1417    cycles for 100*call stack_frame_OFF

Code sizes:
Frame on:       42
Frame off:      46

jj2007

Quote from: dedndave on July 03, 2009, 05:26:21 PM
i see over 400 cycles diff - am i lookin in the wrong spot Jochen ? - lol
Quote1891    cycles for 100*call stack_frame_on
1417    cycles for 100*call stack_frame_OFF

Code sizes:
Frame on:       42
Frame off:      46

Divide by 100 :bg

Celeron M:
980     cycles for 100*call stack_frame_on
871     cycles for 100*stack_frame_OFF

i.e. 1 (one) cycle faster

dedndave


jj2007

Just for fun, here a more complex example. On a Celeron M, the proc without frame is about 0.7 cycles faster, a bit longer and definitely trickier - see the print str$(LocV2).


.nolist
include \masm32\include\masm32rt.inc
.686
include \masm32\macros\timers.asm

LOOP_COUNT = 200000

.code
start:
print "Test for correctness:", 13, 10
mov eax, 123456/2 ; magic number
call stack_frame_OFF

mov eax, 123456/2 ; magic number
call stack_frame_on

print chr$(13, 10, "Timings:", 13, 10)

counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
mov eax, 1234578h
REPEAT 100
call stack_frame_on
ENDM
counter_end
print str$(eax), 9, "cycles for 100*call stack_frame_on", 13, 10

counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
mov eax, 1234578h
REPEAT 100
call stack_frame_OFF
ENDM
counter_end
print str$(eax), 9, "cycles for 100*stack_frame_OFF", 13, 10, 10, "Code sizes:", 13, 10, "Frame on: ", 9
mov eax, stack_frame_on_END
sub eax, stack_frame_on
print str$(eax), 13, 10, "Frame off: ", 9
mov eax, stack_frame_OFF_END
sub eax, stack_frame_OFF
print str$(eax)

inkey chr$(13, 10, "--- ok ---", 13)
exit

MyPush MACRO arg
  .if 1
push arg
MyBase = MyBase + 4
ENDM

MyPop MACRO arg
pop arg
MyBase = MyBase - 4
  .endif
ENDM

MyLocal MACRO args:VARARG
LOCAL tmp$
  .if 1
MyLocEsp = 0
MyBase = 0
FOR arg, <args>
  tmp$ CATSTR <arg>, < equ !<dword ptr [esp+MyBase+>, %MyLocEsp, <]!>>
tmp$
MyLocEsp = MyLocEsp + 4
ENDM
tmp$ CATSTR <add esp, ->, %MyLocEsp
tmp$
ENDM

MyRet MACRO
tmp$ CATSTR <sub esp, ->, %MyLocEsp
tmp$
ret
  .endif
ENDM

stack_frame_OFF proc
  MyLocal LocV1, LocV2, LocV3, LocV4, LocV5, LocV6
  mov LocV1, eax
  add eax, eax
  mov LocV2, eax
  mov LocV3, 1234h
  .if eax==123456
MyPush eax
MyPush eax
MyPush eax
MyPush eax
print chr$("Frame OFF: ")
mov ecx, LocV2
print str$(ecx), 9
print str$(LocV2), 13, 10 ; wrong variable because we are pushing [eSp+X]
MyPop ecx
MyPop ecx
MyPop ecx
MyPop ecx
  .endif
  mov LocV4, 5678h
  mov LocV5, 5555h
  mov LocV6, 6666h
  MyRet
stack_frame_OFF endp
stack_frame_OFF_END:

stack_frame_on proc
LOCAL v1, v2, v3, v4, v5, v6
  mov v1, eax
  add eax, eax
  mov v2, eax
  mov v3, 1234h
  .if eax==123456
Push eax
Push eax
Push eax
Push eax
print chr$("Frame  ON: ")
mov ecx, v2
print str$(ecx), 9
print str$(v2), 13, 10 ; right variable because we are pushing [eBp+X]
Pop ecx
Pop ecx
Pop ecx
Pop ecx
  .endif
  mov v4, 5678h
  mov v5, 5555h
  mov v6, 6666h
  ret
stack_frame_on endp
stack_frame_on_END:

end start


Timers.asm in the Masm32 Laboratory