Author
|
Topic: EXE Jump Tables (Read 74452 times)
|
dedndave
|
at the location of the invoke is a CALL relative that branches to a JMP dword ptr [nnnnnnnn] indirect the value at that nnnnnnnn address is the address of the api function
this code works
mov esi,labelA-4 mov esi,labelA[esi+2] mov eax,[esi] call eax exit
INVOKE GetCurrentProcess labelA label dword
but this code does not work
jmp short test01
test00: INVOKE GetCurrentProcess labelA label dword exit
test01: mov esi,labelA-4 mov esi,labelA[esi+2] mov eax,[esi] sub eax,offset labelA mov labelA-4,eax jmp test00
just like microsoft - take me right up to the point of almost an orgasm, then show me a picture of rosie o'donnell and spray cold water on me
|
|
« Last Edit: May 30, 2009, 11:08:32 PM by dedndave »
|
Logged
|
|
|
|
hutch--
Administrator
Member
    
Posts: 12013
Mnemonic Driven API Grinder
|
There were always alternatives, GetProcEddress gives you a callable DWORD address but you can cut another corner, copy the API to local app memory set to execute and run the API within your own app. On win9x systems you got a speed increase, don't know about NT based versions.
The reason why I never lost much sleep over it is most API calls are so slow that a million cycles here and there don't matter much and you lose nothing like that much with address call variations.
|
|
|
Logged
|
|
|
|
dedndave
|
i suspect that would get you a security violation with win2K or higher, Hutch but that would be a nice technique to see how some of the functions work
|
|
|
Logged
|
|
|
|
hutch--
Administrator
Member
    
Posts: 12013
Mnemonic Driven API Grinder
|
Nah, thats not the problem, if you can call an address you can also copy from it, its more to do with how the internals of the OS work, API calls the NTDLL.DLL, some procedures within that call even lower level DLLs so the best you can get from it is one level reduction in the call layers. Back in the win9x days the technique seemed to work best on GDI calls but that was for a simple reason, a lot of Win9x GDI was written in MASM.
|
|
|
Logged
|
|
|
|
dedndave
|
ok - i isolated the bad guy.... but this code does not work
jmp short test01
test00: INVOKE GetCurrentProcess labelA label dword exit
test01: mov esi,labelA-4 mov esi,labelA[esi+2] mov eax,[esi] sub eax,offset labelA mov labelA-4,eax jmp test00 this line assembles fine, but crashes the program mov labelA-4,eax
i made a temporary work-around by placing the "labelA-4" address in esi then mov [esi],eax that crashes also i am working on that now, i need somebody really sharp to tell me about re-based PE's - lol Vortex maybe ? here is my question.... if the PE gets re-based at load-time, do these tables become far jumps ? and another question..... is there a way to force an exe to be re-based for testing purposes ?
|
|
|
Logged
|
|
|
|
Neo
|
This is a bit of a tangent, but with the assembler I've built into Inventor IDE, I don't generate jump tables, since despite Bogdan's explanation of efficiency: - You need to call each imported function an average of >5 times before jump tables are more space efficient in terms of the code alone (or at all if the executable isn't relocatable.)
- The extra relocations are only there when you specify that you want the executable/library to be relocatable, which isn't the default for executables. Plus, relocations are only resolved upon starting the application, whereas the extra jump is done every time an import is called. Each import appears only once in the Import Address Table and Import Lookup Table, regardless of whether there are any relocations or not.
- You don't need the standard lib files at all if the assembler knows that it can call the functions through the import table, which is why you don't need any lib files to assemble Windows apps with Inventor IDE.
All that said, I really don't think there's a huge performance difference (since imported functions are usually pretty lengthy anyway), but if I had to choose, my money would be on that not using jump tables is slightly more efficient overall. Anyone up for a tough performance testing challenge?  P.S. There are currently other issues with importing libraries other than kernel32/user32/gdi32 in Inventor IDE (it is just an alpha after all), so it's far from perfect. I'm just using it as an example w.r.t. jump tables versus no jump tables.
|
|
|
Logged
|
|
|
|
BogdanOntanu
Global Moderator
Member
    
Gender: 
Posts: 1154
|
This is a bit of a tangent, but with the assembler I've built into Inventor IDE, I don't generate jump tables, since despite Bogdan's explanation of efficiency: It is my own preference to have jump tables. I do not claim "efficiency" at run time. I claim that it does not matter much at run time. I would add an option to disable jump table generation for my own assembler if this makes users happy. - You need to call each imported function an average of >5 times before jump tables are more space efficient in terms of the code alone (or at all if the executable isn't relocatable.)
Yes this is true but not related to relocations... it is related to the code size of a relative jump versus and absolute indirect jump. - The extra relocations are only there when you specify that you want the executable/library to be relocatable, which isn't the default for executables.
But it is the default and needed for DLL's. Besides run-time or load time relocations there is another kind of relocations that are generated inside the OBJ. Unlike the run-time kind of relocations those kind of compile time relocations are mandatory if you generate OBJ's and link multiple modules. It will take one such relocation for each API call in an executable with no jump table. With jump tables it will only take one for each API used. Hence compilation and linking speed is helped here and this was my primary concern since I create huge ASM projects. ... Plus, relocations are only resolved upon starting the application, whereas the extra jump is done every time an import is called. Each import appears only once in the Import Address Table and Import Lookup Table, regardless of whether there are any relocations or not.
Yes, true but once you call an API speed is no longer of the essence. Yes each import only appears once in the IAT table BUT each direct call requires an run time relocation (in a DLL). - You don't need the standard lib files at all if the assembler knows that it can call the functions through the import table, which is why you don't need any lib files to assemble Windows apps with Inventor IDE.
Ok, this is nice advertising for your Inventor IDE... I will check it out. Is it written in full ASM? FYI Sol_Asm does not require any kind of libs when directly creating an Executable/DLL/binary. Neither does FASM or NASM AFAIK etc... In fact neither does MASM for generating the OBJ... The libs are only needed by the linker when it links multiple OBJ's. This feature is in no way related to the subject. However and assembler that can NOT produce OBJ's in order to be linked together by a linker has a huge miss feature. "Most" professional projects out there involve generating OBJ's and then linking them together to create the final executable. After all the jump table method is also calling through the very same import table. Is the API calls in code that are relative in one case and absolute indirect in another case but both methods do reach the very same IAT Table in the end. All that said, I really don't think there's a huge performance difference (since imported functions are usually pretty lengthy anyway), but if I had to choose, my money would be on that not using jump tables is slightly more efficient overall.
I prefer jump tables because the run time speed improvement is not worthy in this case, the size of the executable/dll is potentially smaller, the compilation speed is bigger, the load time is faster and OBJ size is smaller. If i want speed then I choose better algorithms, write my own functions to reduce API's overhead but I do not try to optimize every opcode/byte/cycle. However this is my personal preference.
|
|
« Last Edit: May 31, 2009, 08:22:29 AM by BogdanOntanu »
|
Logged
|
Ambition is a lame excuse for the ones not brave enough to be lazy. http://www.oby.ro
|
|
|
BogdanOntanu
Global Moderator
Member
    
Gender: 
Posts: 1154
|
... now, i need somebody really sharp to tell me about re-based PE's - lol Vortex maybe ? here is my question.... if the PE gets re-based at load-time, do these tables become far jumps ? and another question..... is there a way to force an exe to be re-based for testing purposes ?
By "re-based" I guess you mean relocated at run time. There is another tool named exactly "rebase" that can "cold" change the preferred load address of an executable or DLL after compile time. if the PE gets re-based at load-time, do these tables become far jumps ?
No. Everything that is absolute must be relocated in this case but the jumps remain "near". There is no use for "far" jumps in normal user mode win32 programming. Everything is near in flat protected mode (win32) but some addresses in code are absolute (not relative) and those addresses need to be changed IF the base address is changed. is there a way to force an exe to be re-based for testing purposes ?
EXE's are rarely (if ever) relocated in Win32. Only if they are DLL's in disguise or plugins to be loaded by another EXE. The default load / base address of and PE EXE is normally free at EXE's load time. However DLL's are often relocated because you can not be sure of the load order and memory position of all DLL's needed for an EXE / process. One way to force a run time relocation to occur is to have 2 DLL's compiled for the very same preferred base address and then load them by hand one after another. The second one must be relocated by the OS loader because it's address space is already occupied by the first DLL. Another method would be to compile an EXE for a preferred address (other than the default 0x40_0000) that is already in use by the OS. If the PE EXE has run time relocations stored inside then you can use that "re-base" tool to change it's base address even after compile time.
|
|
|
Logged
|
Ambition is a lame excuse for the ones not brave enough to be lazy. http://www.oby.ro
|
|
|
jj2007
|
at the location of the invoke is a CALL relative that branches to a JMP dword ptr [nnnnnnnn] indirect the value at that nnnnnnnn address is the address of the api function
Here is the simplest variant for calling with a register: include \masm32\include\masm32rt.inc
.code start: mov esi, MessageBox push MB_OK push chr$("Hello") push chr$("Called via esi") push 0 call esi
exit
end start
Slightly more sophisticated: include \masm32\include\masm32rt.inc
MBox = 0 Exit = 4
.data MyJumpTable dd MessageBox, ExitProcess
.code start: mov esi, offset MyJumpTable push MB_OK push chr$("Hello") push chr$("Called via esi") push 0 call dword ptr [esi+MBox] push 0 call dword ptr [esi+Exit]
end start
But whether that is more efficient... no idea
|
|
|
Logged
|
|
|
|
UtillMasm
|
 very clean, i like this more: comment # @echo off \masm32\bin\ml.exe /c /coff /Focall2.obj /nologo call2.asm \masm32\bin\link.exe /subsystem:windows /out:call2.exe call2.obj /nologo pause # include\masm32\include\masm32rt.inc MBox=0 Exit=4 .data MyJumpTable dd MessageBox,ExitProcess .code start:mov esi,offset MyJumpTable push MB_OK push chr$("Hello") push chr$("Called via esi") push 0 call dword ptr[esi+MBox] push 0 call dword ptr[esi+Exit] end start and like radasm msg jump table too. 
|
|
|
Logged
|
|
|
|
Vortex
Raider of the lost code
Member
    
Gender: 
Posts: 3460
|
Hi dedndave, is there a way to force an exe to be re-based for testing purposes ? You would like to have a look at the thread Loading and running EXEs and DLLs from memory The EXE\DLL is loaded to a memory address allocated by VirtualAlloc
|
|
|
Logged
|
|
|
|
hutch--
Administrator
Member
    
Posts: 12013
Mnemonic Driven API Grinder
|
First, there is more to an API call than the difference between a direct call in code and an indirect call through an address table. A call outside the running app memory space is measurably slower than an internal call where a direct JMP to an address usually is not. For the indirect method you get a fast call and a fast JMP, with the direct method you get a slow call. The example I have picked is SendMessageA which gets bashed in an app a massive number of times which justifies it being placed in an address table to save space and usually be in cache. I doubt you could successfully benchmark the difference but indirect calls never went slower than the direct call. ; ÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷ ; ÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷ ; ÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷
004011FB 6A00 push 0 004011FD 6A02 push 2 004011FF 6811010000 push 111h 00401204 FF3550304000 push dword ptr [403050h] 0040120A E86D000000 call jmp_SendMessageA
jmp_SendMessageA: jmp dword ptr [SendMessageA]
; ÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷ ; ÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷ ; ÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷
00401212 6A00 push 0 00401214 6A02 push 2 00401216 6811010000 push 111h 0040121B FF3550304000 push dword ptr [403050h] 00401221 FF1518204000 call dword ptr [SendMessageA]
; ÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷ ; ÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷ ; ÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷·÷
|
|
|
Logged
|
|
|
|
jj2007
|
Since we are all in the brainstorming mode now, here one more idea to play with: include \masm32\include\masm32rt.inc
MBox = 121 Exit = 0
.data MyJumpTable dd ExitProcess dd 120 dup(0) ; 120 slots for other API's dd MessageBox
.data? RetAdd dd ? ChkEsp dd ?
.code start: mov ChkEsp, esp
mov esi, Scheduler
push MB_OK push chr$("Hello") push chr$("Called via esi") push 0 push MBox ; MBox = 121 call esi ; works but is only one byte shorter
invoke MessageBox, 0, chr$("The conventional way"), chr$("Title"), MB_OK
sub ChkEsp, esp MsgBox 0, str$(ChkEsp), "Esp diff=0?", MB_OK
push 0 ; ret 0 push 0 ; Exit call esi
Scheduler proc pop RetAdd pop eax lea eax, [MyJumpTable+4*eax] call dword ptr [eax] jmp RetAdd Scheduler endp
end start
It works, it's probably utterly slow, but for code size freaks it might be interesting 
|
|
|
Logged
|
|
|
|
dedndave
|
this code works
mov esi,labelA-4 ;get the relative address from INVOKE mov esi,labelA[esi+2] ;get the address part of the indirect JMP mov eax,[esi] ;get the API target refered to in the JMP call eax exit
INVOKE GetCurrentProcess labelA label dword
notice that the IAT method takes: 4 bytes in the INVOKE code 6 bytes for the indirect JMP 4 more bytes for the target ------------------- 14 bytes total and, while it may be true that "CALL reg" direct may be faster than "CALL near rel" let's not forget that we have to get the target address into the register to begin with i think, overall, the fastest would be "CALL near rel" (E8 nn nn nn nn) which is what the INVOKE currently uses if we eliminate the IAT table, as well as the target address, we reduce the byte-count by 10 reducing bytes is nice, but let's face it, not a big issue with todays storage sizes if i have 100 different API calls, that's only 1 KB - not an issue the only problem i am having at the moment is that the OS will not let me over-write the 4 bytes in the CALL instruction of the INVOKE sequence i suspect that this is a write protection fault, for obvious security reasons because i intend to replace the operand in only a few select places, i can work around this by using something other than an INVOKE or CALL need be, i can hard code it like this: db 0E8h labelB db 4 dup(?)
and fill it in during initialization while it is true that most API calls are slow to begin with, there are a few that are relatively fast i would have to think that QueryPerformanceCounter is fairly fast, as an example, because there isn't a lot of decision-making to be done - just gimme 2 dword values as i mentioned before, i am interested in synchronizing threads with the "highest resolution possible" i am trying to develop a technique for timing evaluation code on single/multi core machines the idea is, to have one thread perform the timing operation, while another thread runs the code the eval code thread needs to be ready to run, then once the time-keeping thread has read it's initial timer value, it will release the eval code thread for execution the reason for the dual-thread method is that some machines have more than one core on those machines, the TSC needs to be run with a process affinity mask of only one selected core the eval thread can be run with all cores selected, or whatever the test calls for i am trying to keep the overhead of the SetProcessAffinityMask function out of the evaluation measurement
|
|
|
Logged
|
|
|
|
MichaelW
Global Moderator
Member
    
Gender: 
Posts: 5161
|
I’m not sure that all of this is correct. I selected PostMessage instead of SendMessage because PostMessage returns immediately without waiting for the window procedure to process the message. If the cycle count is not more than a few hundred cycles my P3 normally returns very consistent counts. I can’t get consistent results here, partly because the cycle counts are too high, and I think partly because the called function has a variable execution time. In any case, under Windows 2000 I can see no significant difference (or if there is, it's smaller than the variation). ; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««« include \masm32\include\masm32rt.inc .686 include \masm32\macros\timers.asm ; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««« .data hwndTarget dd 0 itotal dd 0 dtotal dd 0 .code ; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««« start: ; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««« EXTERNDEF _imp__PostMessageA@16:NEAR PTR
invoke FindWindow, NULL, chr$("TARGET") mov hwndTarget, eax print ustr$(hwndTarget),13,10,13,10 .IF hwndTarget
nops 3
push 0 push 0 push WM_NULL push hwndTarget call _imp__PostMessageA@16
nops 3
invoke PostMessage, hwndTarget, WM_NULL, 0, 0
nops 3
print "direct indirect",13,10 print "------ -------- ",13,10
invoke Sleep, 4000
REPEAT 20
counter_begin 1000, REALTIME_PRIORITY_CLASS push 0 push 0 push WM_NULL push hwndTarget call _imp__PostMessageA@16 counter_end add dtotal, eax print ustr$(eax),9
counter_begin 1000, REALTIME_PRIORITY_CLASS invoke PostMessage, hwndTarget, WM_NULL, 0, 0 counter_end add itotal, eax print ustr$(eax),13,10
ENDM
print "------ -------- ",13,10 print ustr$(dtotal), 9 print ustr$(itotal),13,10,13,10
.ENDIF
inkey "Press any key to exit..." exit ; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««« end start
00401048 90 nop 00401049 90 nop 0040104A 90 nop 0040104B 6A00 push 0 0040104D 6A00 push 0 0040104F 6A00 push 0 00401051 FF3500504000 push dword ptr [405000h] 00401057 FF1534404000 call dword ptr [PostMessageA] 0040105D 90 nop 0040105E 90 nop 0040105F 90 nop 00401060 6A00 push 0 00401062 6A00 push 0 00401064 6A00 push 0 00401066 FF3500504000 push dword ptr [405000h] 0040106C E827270000 call fn_00403798 00401071 90 nop 00401072 90 nop 00401073 90 nop . . . 00403798 fn_00403798: 00403798 FF2534404000 jmp dword ptr [PostMessageA]
Typical results on my P3: direct indirect ------ -------- 1332 1206 1202 1198 1191 1204 1189 1192 1192 1206 1192 1199 1190 1252 1189 1196 1210 1202 1190 1192 1197 1209 1201 1196 1190 1201 1188 1193 1190 1206 1190 1192 1191 1201 1187 1191 1192 1203 1190 1191 ------ -------- 23993 24030
[attachment deleted by admin]
|
|
|
Logged
|
eschew obfuscation
|
|
|
|
 |