EXE Jump Tables

dedndave · May 30, 2009, 09:25:38 PM

at the location of the invoke is a CALL relative
that branches to a JMP dword ptr [nnnnnnnn] indirect
the value at that nnnnnnnn address is the address of the api function

this code works

mov esi,labelA-4
mov esi,labelA[esi+2]
mov eax,[esi]
call eax
exit

INVOKE GetCurrentProcess
labelA label dword

but this code does not work

jmp short test01

test00: INVOKE GetCurrentProcess
labelA label dword
exit

test01: mov esi,labelA-4
mov esi,labelA[esi+2]
mov eax,[esi]
sub eax,offset labelA
mov labelA-4,eax
jmp test00

just like microsoft - take me right up to the point of almost an orgasm,
then show me a picture of rosie o'donnell and spray cold water on me

hutch-- · May 31, 2009, 01:09:46 AM

There were always alternatives, GetProcEddress gives you a callable DWORD address but you can cut another corner, copy the API to local app memory set to execute and run the API within your own app. On win9x systems you got a speed increase, don't know about NT based versions.

The reason why I never lost much sleep over it is most API calls are so slow that a million cycles here and there don't matter much and you lose nothing like that much with address call variations.

dedndave · May 31, 2009, 01:21:48 AM

i suspect that would get you a security violation with win2K or higher, Hutch
but that would be a nice technique to see how some of the functions work

hutch-- · May 31, 2009, 01:36:40 AM

Nah, thats not the problem, if you can call an address you can also copy from it, its more to do with how the internals of the OS work, API calls the NTDLL.DLL, some procedures within that call even lower level DLLs so the best you can get from it is one level reduction in the call layers. Back in the win9x days the technique seemed to work best on GDI calls but that was for a simple reason, a lot of Win9x GDI was written in MASM.

dedndave · May 31, 2009, 03:46:08 AM

ok - i isolated the bad guy....

Quotebut this code does not work

jmp short test01

test00: INVOKE GetCurrentProcess
labelA label dword
exit

test01: mov esi,labelA-4
mov esi,labelA[esi+2]
mov eax,[esi]
sub eax,offset labelA
mov labelA-4,eax
jmp test00

this line assembles fine, but crashes the program

mov labelA-4,eax

i made a temporary work-around by placing the "labelA-4" address in esi then mov [esi],eax
that crashes also

i am working on that

now, i need somebody really sharp to tell me about re-based PE's - lol
Vortex maybe ?
here is my question....
if the PE gets re-based at load-time, do these tables become far jumps ?
and another question.....
is there a way to force an exe to be re-based for testing purposes ?

Neo · May 31, 2009, 06:27:11 AM

This is a bit of a tangent, but with the assembler I've built into Inventor IDE, I don't generate jump tables, since despite Bogdan's explanation of efficiency:

You need to call each imported function an average of >5 times before jump tables are more space efficient in terms of the code alone (or at all if the executable isn't relocatable.)
The extra relocations are only there when you specify that you want the executable/library to be relocatable, which isn't the default for executables. Plus, relocations are only resolved upon starting the application, whereas the extra jump is done every time an import is called. Each import appears only once in the Import Address Table and Import Lookup Table, regardless of whether there are any relocations or not.
You don't need the standard lib files at all if the assembler knows that it can call the functions through the import table, which is why you don't need any lib files to assemble Windows apps with Inventor IDE.

All that said, I really don't think there's a huge performance difference (since imported functions are usually pretty lengthy anyway), but if I had to choose, my money would be on that not using jump tables is slightly more efficient overall. Anyone up for a tough performance testing challenge? :wink

P.S. There are currently other issues with importing libraries other than kernel32/user32/gdi32 in Inventor IDE (it is just an alpha after all), so it's far from perfect. I'm just using it as an example w.r.t. jump tables versus no jump tables.

BogdanOntanu · May 31, 2009, 07:04:14 AM

Quote from: Neo on May 31, 2009, 06:27:11 AM
This is a bit of a tangent, but with the assembler I've built into Inventor IDE, I don't generate jump tables, since despite Bogdan's explanation of efficiency:

It is my own preference to have jump tables. I do not claim "efficiency" at run time. I claim that it does not matter much at run time. I would add an option to disable jump table generation for my own assembler if this makes users happy.

Quote

You need to call each imported function an average of >5 times before jump tables are more space efficient in terms of the code alone (or at all if the executable isn't relocatable.)

Yes this is true but not related to relocations... it is related to the code size of a relative jump versus and absolute indirect jump.

Quote

The extra relocations are only there when you specify that you want the executable/library to be relocatable, which isn't the default for executables.

But it is the default and needed for DLL's.

Besides run-time or load time relocations there is another kind of relocations that are generated inside the OBJ. Unlike the run-time kind of relocations those kind of compile time relocations are mandatory if you generate OBJ's and link multiple modules.

It will take one such relocation for each API call in an executable with no jump table. With jump tables it will only take one for each API used. Hence compilation and linking speed is helped here and this was my primary concern since I create huge ASM projects.

Quote
...
Plus, relocations are only resolved upon starting the application, whereas the extra jump is done every time an import is called. Each import appears only once in the Import Address Table and Import Lookup Table, regardless of whether there are any relocations or not.

Yes, true but once you call an API speed is no longer of the essence.

Yes each import only appears once in the IAT table BUT each direct call requires an run time relocation (in a DLL).

Quote

You don't need the standard lib files at all if the assembler knows that it can call the functions through the import table, which is why you don't need any lib files to assemble Windows apps with Inventor IDE.

Ok, this is nice advertising for your Inventor IDE... I will check it out. Is it written in full ASM?

FYI Sol_Asm does not require any kind of libs when directly creating an Executable/DLL/binary. Neither does FASM or NASM AFAIK etc... In fact neither does MASM for generating the OBJ... The libs are only needed by the linker when it links multiple OBJ's.

This feature is in no way related to the subject.

However and assembler that can NOT produce OBJ's in order to be linked together by a linker has a huge miss feature. "Most" professional projects out there involve generating OBJ's and then linking them together to create the final executable.

After all the jump table method is also calling through the very same import table. Is the API calls in code that are relative in one case and absolute indirect in another case but both methods do reach the very same IAT Table in the end.

Quote
All that said, I really don't think there's a huge performance difference (since imported functions are usually pretty lengthy anyway), but if I had to choose, my money would be on that not using jump tables is slightly more efficient overall.

I prefer jump tables because the run time speed improvement is not worthy in this case, the size of the executable/dll is potentially smaller, the compilation speed is bigger, the load time is faster and OBJ size is smaller. If i want speed then I choose better algorithms, write my own functions to reduce API's overhead but I do not try to optimize every opcode/byte/cycle.

However this is my personal preference.

BogdanOntanu · May 31, 2009, 07:27:59 AM

Quote from: dedndave on May 31, 2009, 03:46:08 AM
...
now, i need somebody really sharp to tell me about re-based PE's - lol
Vortex maybe ?
here is my question....
if the PE gets re-based at load-time, do these tables become far jumps ?
and another question.....
is there a way to force an exe to be re-based for testing purposes ?

By "re-based" I guess you mean relocated at run time. There is another tool named exactly "rebase" that can "cold" change the preferred load address of an executable or DLL after compile time.

Quote
if the PE gets re-based at load-time, do these tables become far jumps ?

No. Everything that is absolute must be relocated in this case but the jumps remain "near".

There is no use for "far" jumps in normal user mode win32 programming. Everything is near in flat protected mode (win32) but some addresses in code are absolute (not relative) and those addresses need to be changed IF the base address is changed.

Quote
is there a way to force an exe to be re-based for testing purposes ?

EXE's are rarely (if ever) relocated in Win32. Only if they are DLL's in disguise or plugins to be loaded by another EXE. The default load / base address of and PE EXE is normally free at EXE's load time.

However DLL's are often relocated because you can not be sure of the load order and memory position of all DLL's needed for an EXE / process.

One way to force a run time relocation to occur is to have 2 DLL's compiled for the very same preferred base address and then load them by hand one after another. The second one must be relocated by the OS loader because it's address space is already occupied by the first DLL.

Another method would be to compile an EXE for a preferred address (other than the default 0x40_0000) that is already in use by the OS.

If the PE EXE has run time relocations stored inside then you can use that "re-base" tool to change it's base address even after compile time.

jj2007 · May 31, 2009, 07:40:03 AM

Quote from: dedndave on May 30, 2009, 09:25:38 PM
at the location of the invoke is a CALL relative
that branches to a JMP dword ptr [nnnnnnnn] indirect
the value at that nnnnnnnn address is the address of the api function

Here is the simplest variant for calling with a register:

Code Select

include \masm32\include\masm32rt.inc

.code
start:
	mov esi, MessageBox
	push MB_OK
	push chr$("Hello")
	push chr$("Called via esi")
	push 0
	call esi

	exit

end start

Slightly more sophisticated:

Code Select

include \masm32\include\masm32rt.inc

MBox	= 0
Exit	= 4

.data
MyJumpTable	dd MessageBox, ExitProcess

.code
start:
	mov esi, offset MyJumpTable
	push MB_OK
	push chr$("Hello")
	push chr$("Called via esi")
	push 0
	call dword ptr [esi+MBox]
	push 0
	call dword ptr [esi+Exit]

end start

But whether that is more efficient... no idea

UtillMasm · May 31, 2009, 08:16:25 AM

:U
very clean, i like this more:

Code Select

comment #
 @echo off
\masm32\bin\ml.exe /c /coff /Focall2.obj /nologo call2.asm
\masm32\bin\link.exe /subsystem:windows /out:call2.exe call2.obj /nologo
pause
#
include\masm32\include\masm32rt.inc
MBox=0
Exit=4
.data
 MyJumpTable dd MessageBox,ExitProcess
.code
 start:mov esi,offset MyJumpTable
 push MB_OK
 push chr$("Hello")
 push chr$("Called via esi")
 push 0
 call dword ptr[esi+MBox]
 push 0
 call dword ptr[esi+Exit]
end start

and like radasm msg jump table too.
:wink

Vortex · May 31, 2009, 08:37:08 AM

Hi dedndave,

Quoteis there a way to force an exe to be re-based for testing purposes ?

You would like to have a look at the thread Loading and running EXEs and DLLs from memory The EXE\DLL is loaded to a memory address allocated by VirtualAlloc

hutch-- · May 31, 2009, 08:50:36 AM

First, there is more to an API call than the difference between a direct call in code and an indirect call through an address table. A call outside the running app memory space is measurably slower than an internal call where a direct JMP to an address usually is not. For the indirect method you get a fast call and a fast JMP, with the direct method you get a slow call. The example I have picked is SendMessageA which gets bashed in an app a massive number of times which justifies it being placed in an address table to save space and usually be in cache.

I doubt you could successfully benchmark the difference but indirect calls never went slower than the direct call.

Code Select


; ÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷
; ÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷
; ÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷


004011FB 6A00                   push    0
004011FD 6A02                   push    2
004011FF 6811010000             push    111h
00401204 FF3550304000           push    dword ptr [403050h]
0040120A E86D000000             call    jmp_SendMessageA

jmp_SendMessageA:               jmp     dword ptr [SendMessageA]

; ÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷
; ÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷
; ÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷


00401212 6A00                   push    0
00401214 6A02                   push    2
00401216 6811010000             push    111h
0040121B FF3550304000           push    dword ptr [403050h]
00401221 FF1518204000           call    dword ptr [SendMessageA]


; ÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷
; ÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷
; ÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷

jj2007 · May 31, 2009, 09:42:06 AM

Since we are all in the brainstorming mode now, here one more idea to play with:

Code Select


include \masm32\include\masm32rt.inc

MBox	= 121
Exit	= 0

.data
MyJumpTable	dd ExitProcess
		dd 120 dup(0)	; 120 slots for other API's
		dd MessageBox

.data?
RetAdd	dd ?
ChkEsp	dd ?

.code
start:
	mov ChkEsp, esp

	mov esi, Scheduler

	push MB_OK
	push chr$("Hello")
	push chr$("Called via esi")
	push 0
	push MBox		; MBox = 121
	call esi		; works but is only one byte shorter

	invoke MessageBox, 0, chr$("The conventional way"), chr$("Title"), MB_OK

	sub ChkEsp, esp
	MsgBox 0, str$(ChkEsp), "Esp diff=0?", MB_OK

	push 0	; ret 0
	push 0	; Exit
	call esi

Scheduler proc
  pop RetAdd
  pop eax
  lea eax, [MyJumpTable+4*eax]
  call dword ptr [eax]
  jmp RetAdd
Scheduler endp

end start

It works, it's probably utterly slow, but for code size freaks it might be interesting :bg

dedndave · May 31, 2009, 10:28:58 AM

Quotethis code works

mov esi,labelA-4 ;get the relative address from INVOKE
mov esi,labelA[esi+2] ;get the address part of the indirect JMP
mov eax,[esi] ;get the API target refered to in the JMP
call eax
exit

INVOKE GetCurrentProcess
labelA label dword

notice that the IAT method takes:
4 bytes in the INVOKE code
6 bytes for the indirect JMP
4 more bytes for the target
-------------------
14 bytes total

and, while it may be true that "CALL reg" direct may be faster than "CALL near rel"
let's not forget that we have to get the target address into the register to begin with

i think, overall, the fastest would be "CALL near rel" (E8 nn nn nn nn)
which is what the INVOKE currently uses
if we eliminate the IAT table, as well as the target address, we reduce the byte-count by 10
reducing bytes is nice, but let's face it, not a big issue with todays storage sizes
if i have 100 different API calls, that's only 1 KB - not an issue

the only problem i am having at the moment is that the OS
will not let me over-write the 4 bytes in the CALL instruction of the INVOKE sequence
i suspect that this is a write protection fault, for obvious security reasons

because i intend to replace the operand in only a few select places,
i can work around this by using something other than an INVOKE or CALL
need be, i can hard code it like this:

db 0E8h
labelB db 4 dup(?)

and fill it in during initialization

while it is true that most API calls are slow to begin with, there are a few that are relatively fast
i would have to think that QueryPerformanceCounter is fairly fast, as an example,
because there isn't a lot of decision-making to be done - just gimme 2 dword values

as i mentioned before, i am interested in synchronizing threads with the "highest resolution possible"
i am trying to develop a technique for timing evaluation code on single/multi core machines
the idea is, to have one thread perform the timing operation, while another thread runs the code
the eval code thread needs to be ready to run, then
once the time-keeping thread has read it's initial timer value, it will release the eval code thread for execution
the reason for the dual-thread method is that some machines have more than one core
on those machines, the TSC needs to be run with a process affinity mask of only one selected core
the eval thread can be run with all cores selected, or whatever the test calls for
i am trying to keep the overhead of the SetProcessAffinityMask function out of the evaluation measurement

MichaelW · May 31, 2009, 11:19:59 AM

I'm not sure that all of this is correct. I selected PostMessage instead of SendMessage because PostMessage returns immediately without waiting for the window procedure to process the message. If the cycle count is not more than a few hundred cycles my P3 normally returns very consistent counts. I can't get consistent results here, partly because the cycle counts are too high, and I think partly because the called function has a variable execution time. In any case, under Windows 2000 I can see no significant difference (or if there is, it's smaller than the variation).

Code Select


; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .686
    include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
      hwndTarget dd 0
      itotal     dd 0
      dtotal     dd 0
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    EXTERNDEF _imp__PostMessageA@16:NEAR PTR

    invoke FindWindow, NULL, chr$("TARGET")
    mov hwndTarget, eax
    print ustr$(hwndTarget),13,10,13,10
    .IF hwndTarget

      nops 3

      push 0
      push 0
      push WM_NULL
      push hwndTarget
      call _imp__PostMessageA@16

      nops 3

      invoke PostMessage, hwndTarget, WM_NULL, 0, 0

      nops 3

      print "direct indirect",13,10
      print "------ -------- ",13,10

      invoke Sleep, 4000

      REPEAT 20

        counter_begin 1000, REALTIME_PRIORITY_CLASS
          push 0
          push 0
          push WM_NULL
          push hwndTarget
          call _imp__PostMessageA@16
        counter_end
        add dtotal, eax
        print ustr$(eax),9

        counter_begin 1000, REALTIME_PRIORITY_CLASS
          invoke PostMessage, hwndTarget, WM_NULL, 0, 0
        counter_end
        add itotal, eax
        print ustr$(eax),13,10

      ENDM

      print "------ -------- ",13,10
      print ustr$(dtotal), 9
      print ustr$(itotal),13,10,13,10

    .ENDIF

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

Code Select


00401048 90                     nop
00401049 90                     nop
0040104A 90                     nop
0040104B 6A00                   push    0
0040104D 6A00                   push    0
0040104F 6A00                   push    0
00401051 FF3500504000           push    dword ptr [405000h]
00401057 FF1534404000           call    dword ptr [PostMessageA]
0040105D 90                     nop
0040105E 90                     nop
0040105F 90                     nop
00401060 6A00                   push    0
00401062 6A00                   push    0
00401064 6A00                   push    0
00401066 FF3500504000           push    dword ptr [405000h]
0040106C E827270000             call    fn_00403798
00401071 90                     nop
00401072 90                     nop
00401073 90                     nop
. . .
00403798                    fn_00403798:
00403798 FF2534404000           jmp     dword ptr [PostMessageA]

Typical results on my P3:

Code Select


direct indirect
------ --------
1332    1206
1202    1198
1191    1204
1189    1192
1192    1206
1192    1199
1190    1252
1189    1196
1210    1202
1190    1192
1197    1209
1201    1196
1190    1201
1188    1193
1190    1206
1190    1192
1191    1201
1187    1191
1192    1203
1190    1191
------ --------
23993   24030

[attachment deleted by admin]

News:

EXE Jump Tables