News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

EXE Jump Tables

Started by dedndave, May 29, 2009, 05:51:54 PM

Previous topic - Next topic

dedndave

#30
at the location of the invoke is a CALL relative
that branches to a JMP dword ptr [nnnnnnnn] indirect
the value at that nnnnnnnn address is the address of the api function

this code works

        mov     esi,labelA-4
        mov     esi,labelA[esi+2]
        mov     eax,[esi]
        call    eax
        exit

        INVOKE  GetCurrentProcess
labelA  label   dword


but this code does not work

        jmp short test01

test00: INVOKE  GetCurrentProcess
labelA  label   dword
        exit

test01: mov     esi,labelA-4
        mov     esi,labelA[esi+2]
        mov     eax,[esi]
        sub     eax,offset labelA
        mov     labelA-4,eax
        jmp     test00


just like microsoft - take me right up to the point of almost an orgasm,
then show me a picture of rosie o'donnell and spray cold water on me

hutch--

There were always alternatives, GetProcEddress gives you a callable DWORD address but you can cut another corner, copy the API to local app memory set to execute and run the API within your own app. On win9x systems you got a speed increase, don't know about NT based versions.

The reason why I never lost much sleep over it is most API calls are so slow that a million cycles here and there don't matter much and you lose nothing like that much with address call variations.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

dedndave

i suspect that would get you a security violation with win2K or higher, Hutch
but that would be a nice technique to see how some of the functions work

hutch--

Nah, thats not the problem, if you can call an address you can also copy from it, its more to do with how the internals of the OS work, API calls the NTDLL.DLL, some procedures within that call even lower level DLLs so the best you can get from it is one level reduction in the call layers. Back in the win9x days the technique seemed to work best on GDI calls but that was for a simple reason, a lot of Win9x GDI was written in MASM.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

dedndave

ok - i isolated the bad guy....
Quotebut this code does not work

        jmp short test01

test00: INVOKE  GetCurrentProcess
labelA  label   dword
        exit

test01: mov     esi,labelA-4
        mov     esi,labelA[esi+2]
        mov     eax,[esi]
        sub     eax,offset labelA
        mov     labelA-4,eax
        jmp     test00

this line assembles fine, but crashes the program

        mov     labelA-4,eax

i made a temporary work-around by placing the "labelA-4" address in esi then mov [esi],eax
that crashes also

i am working on that

now, i need somebody really sharp to tell me about re-based PE's - lol
Vortex maybe ?
here is my question....
if the PE gets re-based at load-time, do these tables become far jumps ?
and another question.....
is there a way to force an exe to be re-based for testing purposes ?

Neo

This is a bit of a tangent, but with the assembler I've built into Inventor IDE, I don't generate jump tables, since despite Bogdan's explanation of efficiency:

  • You need to call each imported function an average of >5 times before jump tables are more space efficient in terms of the code alone (or at all if the executable isn't relocatable.)
  • The extra relocations are only there when you specify that you want the executable/library to be relocatable, which isn't the default for executables.  Plus, relocations are only resolved upon starting the application, whereas the extra jump is done every time an import is called.  Each import appears only once in the Import Address Table and Import Lookup Table, regardless of whether there are any relocations or not.
  • You don't need the standard lib files at all if the assembler knows that it can call the functions through the import table, which is why you don't need any lib files to assemble Windows apps with Inventor IDE.

All that said, I really don't think there's a huge performance difference (since imported functions are usually pretty lengthy anyway), but if I had to choose, my money would be on that not using jump tables is slightly more efficient overall.  Anyone up for a tough performance testing challenge?  :wink

P.S. There are currently other issues with importing libraries other than kernel32/user32/gdi32 in Inventor IDE (it is just an alpha after all), so it's far from perfect.  I'm just using it as an example w.r.t. jump tables versus no jump tables.

BogdanOntanu

#36
Quote from: Neo on May 31, 2009, 06:27:11 AM
This is a bit of a tangent, but with the assembler I've built into Inventor IDE, I don't generate jump tables, since despite Bogdan's explanation of efficiency:

It is my own preference to have jump tables. I do not claim "efficiency" at run time. I claim that it does not matter much at run time. I would add an option to disable jump table generation for my own assembler if this makes users happy.

Quote

  • You need to call each imported function an average of >5 times before jump tables are more space efficient in terms of the code alone (or at all if the executable isn't relocatable.)

Yes this is true but not related to relocations... it is related to the code size of a relative jump versus and absolute indirect jump.

Quote

  • The extra relocations are only there when you specify that you want the executable/library to be relocatable, which isn't the default for executables.

But it is the default and needed for DLL's.

Besides run-time or load time relocations there is another kind of relocations that are generated inside the OBJ. Unlike the run-time kind of relocations those kind of compile time relocations are mandatory if you generate OBJ's and link multiple modules.

It will take one such relocation for each API call in an executable with no jump table. With jump tables it will only take one for each API used. Hence compilation and linking speed is helped here and this was my primary concern since I create huge ASM projects.

Quote
...
  Plus, relocations are only resolved upon starting the application, whereas the extra jump is done every time an import is called.  Each import appears only once in the Import Address Table and Import Lookup Table, regardless of whether there are any relocations or not.

Yes, true but once you call an API speed is no longer of the essence.

Yes each import only appears once in the IAT table BUT each direct call requires an run time relocation (in a DLL).


Quote

  • You don't need the standard lib files at all if the assembler knows that it can call the functions through the import table, which is why you don't need any lib files to assemble Windows apps with Inventor IDE.

Ok, this is nice advertising for your Inventor IDE... I will check it out. Is it written in full ASM?

FYI Sol_Asm does not require any kind of libs when directly creating an Executable/DLL/binary. Neither does FASM or NASM AFAIK etc... In fact neither does MASM for generating the OBJ... The libs are only needed by the linker when it links multiple OBJ's.

This feature is in no way related to the subject.

However and assembler that can NOT produce OBJ's in order to be linked together by a linker has a huge miss feature. "Most" professional projects out there involve generating OBJ's and then linking them together to create the final executable.

After all the jump table method is also calling through the very same import table. Is the API calls in code that are relative  in one case and absolute indirect in another case but both methods do reach the very same IAT Table in the end.

Quote
All that said, I really don't think there's a huge performance difference (since imported functions are usually pretty lengthy anyway), but if I had to choose, my money would be on that not using jump tables is slightly more efficient overall.

I prefer jump tables because the run time speed improvement is not worthy in this case, the size of the executable/dll is potentially smaller, the compilation speed is bigger, the load time is faster and OBJ size is smaller. If i want speed then I choose better algorithms, write my own functions to reduce API's overhead but I do not try to optimize every opcode/byte/cycle.

However this is my personal preference.


Ambition is a lame excuse for the ones not brave enough to be lazy.
http://www.oby.ro

BogdanOntanu

Quote from: dedndave on May 31, 2009, 03:46:08 AM
...
now, i need somebody really sharp to tell me about re-based PE's - lol
Vortex maybe ?
here is my question....
if the PE gets re-based at load-time, do these tables become far jumps ?
and another question.....
is there a way to force an exe to be re-based for testing purposes ?

By "re-based" I guess you mean relocated at run time. There is another tool named exactly "rebase" that can "cold" change the preferred load address of an executable or DLL after compile time.

Quote
if the PE gets re-based at load-time, do these tables become far jumps ?

No. Everything that is absolute must be relocated in this case but the jumps remain "near".

There is no use for "far" jumps in normal user mode win32 programming. Everything is near in flat protected mode (win32) but some addresses in code are absolute (not relative) and those addresses need to be changed IF the base address is changed.

Quote
is there a way to force an exe to be re-based for testing purposes ?

EXE's are rarely (if ever) relocated in Win32. Only if they are DLL's in disguise or plugins to be loaded by another EXE. The default load / base address of and PE EXE is normally free at EXE's  load time.

However DLL's are often relocated because you can not be sure of the load order and memory position of all DLL's needed for an EXE / process.

One way to force a run time relocation to occur is to have 2 DLL's compiled for the very same preferred base address and then load them by hand one after another. The second one must be relocated by the OS loader because it's address space is already occupied by the first DLL.

Another method would be to compile an EXE for a preferred address (other than the default 0x40_0000) that is already in use by the OS.

If the PE EXE has run time relocations stored inside then you can use that "re-base" tool to change it's base address even after compile time.
Ambition is a lame excuse for the ones not brave enough to be lazy.
http://www.oby.ro

jj2007

Quote from: dedndave on May 30, 2009, 09:25:38 PM
at the location of the invoke is a CALL relative
that branches to a JMP dword ptr [nnnnnnnn] indirect
the value at that nnnnnnnn address is the address of the api function


Here is the simplest variant for calling with a register:

include \masm32\include\masm32rt.inc

.code
start:
mov esi, MessageBox
push MB_OK
push chr$("Hello")
push chr$("Called via esi")
push 0
call esi

exit

end start


Slightly more sophisticated:
include \masm32\include\masm32rt.inc

MBox = 0
Exit = 4

.data
MyJumpTable dd MessageBox, ExitProcess

.code
start:
mov esi, offset MyJumpTable
push MB_OK
push chr$("Hello")
push chr$("Called via esi")
push 0
call dword ptr [esi+MBox]
push 0
call dword ptr [esi+Exit]

end start


But whether that is more efficient... no idea

UtillMasm

 :U
very clean, i like this more:
comment #
@echo off
\masm32\bin\ml.exe /c /coff /Focall2.obj /nologo call2.asm
\masm32\bin\link.exe /subsystem:windows /out:call2.exe call2.obj /nologo
pause
#
include\masm32\include\masm32rt.inc
MBox=0
Exit=4
.data
MyJumpTable dd MessageBox,ExitProcess
.code
start:mov esi,offset MyJumpTable
push MB_OK
push chr$("Hello")
push chr$("Called via esi")
push 0
call dword ptr[esi+MBox]
push 0
call dword ptr[esi+Exit]
end start

and like radasm msg jump table too.
:wink

Vortex

Hi dedndave,

Quoteis there a way to force an exe to be re-based for testing purposes ?

You would like to have a look at the thread Loading and running EXEs and DLLs from memory The EXE\DLL is loaded to a memory address allocated by VirtualAlloc

hutch--

First, there is more to an API call than the difference between a direct call in code and an indirect call through an address table. A call outside the running app memory space is measurably slower than an internal call where a direct JMP to an address usually is not. For the indirect method you get a fast call and a fast JMP, with the direct method you get a slow call. The example I have picked is SendMessageA which gets bashed in an app a massive number of times which justifies it being placed in an address table to save space and usually be in cache.

I doubt you could successfully benchmark the difference but indirect calls never went slower than the direct call.


; ÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷
; ÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷
; ÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷


004011FB 6A00                   push    0
004011FD 6A02                   push    2
004011FF 6811010000             push    111h
00401204 FF3550304000           push    dword ptr [403050h]
0040120A E86D000000             call    jmp_SendMessageA

jmp_SendMessageA:               jmp     dword ptr [SendMessageA]

; ÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷
; ÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷
; ÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷


00401212 6A00                   push    0
00401214 6A02                   push    2
00401216 6811010000             push    111h
0040121B FF3550304000           push    dword ptr [403050h]
00401221 FF1518204000           call    dword ptr [SendMessageA]


; ÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷
; ÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷
; ÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

jj2007

Since we are all in the brainstorming mode now, here one more idea to play with:

include \masm32\include\masm32rt.inc

MBox = 121
Exit = 0

.data
MyJumpTable dd ExitProcess
dd 120 dup(0) ; 120 slots for other API's
dd MessageBox

.data?
RetAdd dd ?
ChkEsp dd ?

.code
start:
mov ChkEsp, esp

mov esi, Scheduler

push MB_OK
push chr$("Hello")
push chr$("Called via esi")
push 0
push MBox ; MBox = 121
call esi ; works but is only one byte shorter

invoke MessageBox, 0, chr$("The conventional way"), chr$("Title"), MB_OK

sub ChkEsp, esp
MsgBox 0, str$(ChkEsp), "Esp diff=0?", MB_OK

push 0 ; ret 0
push 0 ; Exit
call esi

Scheduler proc
  pop RetAdd
  pop eax
  lea eax, [MyJumpTable+4*eax]
  call dword ptr [eax]
  jmp RetAdd
Scheduler endp

end start


It works, it's probably utterly slow, but for code size freaks it might be interesting :bg

dedndave

Quotethis code works

        mov     esi,labelA-4       ;get the relative address from INVOKE
        mov     esi,labelA[esi+2]  ;get the address part of the indirect JMP
        mov     eax,[esi]          ;get the API target refered to in the JMP
        call    eax
        exit

        INVOKE  GetCurrentProcess
labelA  label   dword
notice that the IAT method takes:
4 bytes in the INVOKE code
6 bytes for the indirect JMP
4 more bytes for the target
-------------------
14 bytes total

and, while it may be true that "CALL reg" direct may be faster than "CALL near rel"
let's not forget that we have to get the target address into the register to begin with

i think, overall, the fastest would be "CALL near rel" (E8 nn nn nn nn)
which is what the INVOKE currently uses
if we eliminate the IAT table, as well as the target address, we reduce the byte-count by 10
reducing bytes is nice, but let's face it, not a big issue with todays storage sizes
if i have 100 different API calls, that's only 1 KB - not an issue

the only problem i am having at the moment is that the OS
will not let me over-write the 4 bytes in the CALL instruction of the INVOKE sequence
i suspect that this is a write protection fault, for obvious security reasons

because i intend to replace the operand in only a few select places,
i can work around this by using something other than an INVOKE or CALL
need be, i can hard code it like this:

        db 0E8h
labelB  db 4 dup(?)

and fill it in during initialization

while it is true that most API calls are slow to begin with, there are a few that are relatively fast
i would have to think that QueryPerformanceCounter is fairly fast, as an example,
because there isn't a lot of decision-making to be done - just gimme 2 dword values

as i mentioned before, i am interested in synchronizing threads with the "highest resolution possible"
i am trying to develop a technique for timing evaluation code on single/multi core machines
the idea is, to have one thread perform the timing operation, while another thread runs the code
the eval code thread needs to be ready to run, then
once the time-keeping thread has read it's initial timer value, it will release the eval code thread for execution
the reason for the dual-thread method is that some machines have more than one core
on those machines, the TSC needs to be run with a process affinity mask of only one selected core
the eval thread can be run with all cores selected, or whatever the test calls for
i am trying to keep the overhead of the SetProcessAffinityMask function out of the evaluation measurement

MichaelW

I'm not sure that all of this is correct. I selected PostMessage instead of SendMessage because PostMessage returns immediately without waiting for the window procedure to process the message. If the cycle count is not more than a few hundred cycles my P3 normally returns very consistent counts. I can't get consistent results here, partly because the cycle counts are too high, and I think partly because the called function has a variable execution time. In any case, under Windows 2000 I can see no significant difference (or if there is, it's smaller than the variation).

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .686
    include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
      hwndTarget dd 0
      itotal     dd 0
      dtotal     dd 0
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    EXTERNDEF _imp__PostMessageA@16:NEAR PTR

    invoke FindWindow, NULL, chr$("TARGET")
    mov hwndTarget, eax
    print ustr$(hwndTarget),13,10,13,10
    .IF hwndTarget

      nops 3

      push 0
      push 0
      push WM_NULL
      push hwndTarget
      call _imp__PostMessageA@16

      nops 3

      invoke PostMessage, hwndTarget, WM_NULL, 0, 0

      nops 3

      print "direct indirect",13,10
      print "------ -------- ",13,10

      invoke Sleep, 4000

      REPEAT 20

        counter_begin 1000, REALTIME_PRIORITY_CLASS
          push 0
          push 0
          push WM_NULL
          push hwndTarget
          call _imp__PostMessageA@16
        counter_end
        add dtotal, eax
        print ustr$(eax),9

        counter_begin 1000, REALTIME_PRIORITY_CLASS
          invoke PostMessage, hwndTarget, WM_NULL, 0, 0
        counter_end
        add itotal, eax
        print ustr$(eax),13,10

      ENDM

      print "------ -------- ",13,10
      print ustr$(dtotal), 9
      print ustr$(itotal),13,10,13,10

    .ENDIF

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start


00401048 90                     nop
00401049 90                     nop
0040104A 90                     nop
0040104B 6A00                   push    0
0040104D 6A00                   push    0
0040104F 6A00                   push    0
00401051 FF3500504000           push    dword ptr [405000h]
00401057 FF1534404000           call    dword ptr [PostMessageA]
0040105D 90                     nop
0040105E 90                     nop
0040105F 90                     nop
00401060 6A00                   push    0
00401062 6A00                   push    0
00401064 6A00                   push    0
00401066 FF3500504000           push    dword ptr [405000h]
0040106C E827270000             call    fn_00403798
00401071 90                     nop
00401072 90                     nop
00401073 90                     nop
. . .
00403798                    fn_00403798:
00403798 FF2534404000           jmp     dword ptr [PostMessageA]

Typical results on my P3:

direct indirect
------ --------
1332    1206
1202    1198
1191    1204
1189    1192
1192    1206
1192    1199
1190    1252
1189    1196
1210    1202
1190    1192
1197    1209
1201    1196
1190    1201
1188    1193
1190    1206
1190    1192
1191    1201
1187    1191
1192    1203
1190    1191
------ --------
23993   24030



[attachment deleted by admin]
eschew obfuscation