News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Reading raw debug symbol data

Started by donkey, March 28, 2010, 07:34:12 PM

Previous topic - Next topic

donkey

I need to obtain the symbols from a GoAsm executable file and have been trying to figure out how to decode the raw symbol table data. Finding the data is simple enough but I can't figure out the header information. As far as I know, and I could be wrong, the symbols are stored in IMAGE_SYMBOL structures, if the symbol name is longer than 8 bytes the union points to an offset in the symbol name table. That would mean that these offsets or names would be found every 18 bytes (SIZEOF IMAGE_SYMBOL) but the first recognizable symbol is not at an even multiple of 18 so there has to be some header data at the beginning of the raw data. Below is the definition from winnt.h for the IMAGE_SYMBOL structure and a dump of the raw symbol data from my test program.

In order to obtain a pointer to the symbol data I used the following:

// Get a pointer to IMAGE_NT_HEADERS

mov eax,[pMapFile]
mov edi,[eax+IMAGE_DOS_HEADER.e_lfanew]
add edi,[pMapFile]

// Store the offset of the DataDirectory in EBX
lea ebx,[edi+IMAGE_NT_HEADERS.OptionalHeader.DataDirectory]
// Get the debug symbols entry (7th entry base 0)
// SIZEOF IMAGE_DATA_DIRECTORY is always 8 BYTEs
mov eax,6
shl eax,3

add ebx,eax

// EBX now contains the IMAGE_DATA_DIRECTORY structure for the symbol table
// Convert the RVA to a file position
invoke RVAToFilePos,[pMapFile],[ebx+IMAGE_DATA_DIRECTORY.VirtualAddress]
mov esi,[pMapFile]
add esi,eax

mov eax,[ebx+IMAGE_DATA_DIRECTORY.Size]
mov ecx,SIZEOF IMAGE_DEBUG_DIRECTORY
xor edx,edx
div ecx

// EAX contains the number of IMAGE_DEBUG_DIRECTORY entries
// ESI now points to the IMAGE_DEBUG_DIRECTORY entry for the PE
// get a pointer to the raw data in the file

mov ebx,[esi+IMAGE_DEBUG_DIRECTORY.SizeOfData]
lea eax,[esi+IMAGE_DEBUG_DIRECTORY.PointerToRawData]
mov esi,[eax]
add esi,[pMapFile]

// ESI contains a memory pointer to the base of the raw data
// EBX contains the size of the data


IMAGE_SYMBOL STRUCT
UNION
ShortName DB 8 DUP
Name STRUCT
Short DD
Long DD
ENDS
LongName DD 2 DUP
ENDUNION
Value DD
SectionNumber DW
Type DW
StorageClass DB
NumberOfAuxSymbols DB
ENDS


00380E00:  0B 00 00 00-20 00 00 00-00 00 00 00-00 00 00 00   .... ...........
00380E10:  00 10 00 00-00 12 00 00-00 20 00 00-00 28 00 00   ......... ...(..
00380E20:  00 00 00 00-04 00 00 00-50 20 00 00-02 00 00 00   ........P ......
00380E30:  02 00 73 7A-54 65 73 74-00 00 54 20-00 00 02 00   ..szTest..T ....
00380E40:  00 00 02 00-53 54 41 52-54 00 00 00-00 10 00 00   ....START.......
00380E50:  01 00 00 00-02 00 00 00-00 00 0E 00-00 00 00 20   ...............
00380E60:  00 00 02 00-00 00 02 00-00 00 00 00-1D 00 00 00   ................
00380E70:  00 20 00 00-02 00 00 00-02 00 00 00-00 00 31 00   . ............1.
00380E80:  00 00 1C 20-00 00 02 00-00 00 02 00-44 6C 67 50   ... ........DlgP
00380E90:  72 6F 63 00-E2 10 00 00-01 00 00 00-02 00 00 00   roc.â...........
00380EA0:  00 00 45 00-00 00 F0 10-00 00 01 00-00 00 02 00   ..E...ð.........
00380EB0:  00 00 00 00-58 00 00 00-14 11 00 00-01 00 00 00   ....X...........
00380EC0:  02 00 00 00-00 00 65 00-00 00 FB 10-00 00 01 00   ......e...û.....
00380ED0:  00 00 02 00-00 00 00 00-76 00 00 00-0B 11 00 00   ........v.......
00380EE0:  01 00 00 00-02 00 86 00-00 00 68 49-6E 73 74 61   ......†...hInsta
00380EF0:  6E 63 65 00-45 78 63 65-70 74 69 6F-6E 41 72 67   nce.ExceptionArg
00380F00:  73 31 00 45-78 63 65 70-74 69 6F 6E-41 72 67 73   s1.ExceptionArgs
00380F10:  31 2E 4E 61-6D 65 00 45-78 63 65 70-74 69 6F 6E   1.Name.Exception
00380F20:  41 72 67 73-31 2E 74 79-70 65 00 44-6C 67 50 72   Args1.type.DlgPr
00380F30:  6F 63 2E 57-4D 5F 43 4F-4D 4D 41 4E-44 00 44 6C   oc.WM_COMMAND.Dl
00380F40:  67 50 72 6F-63 2E 45 58-49 54 00 44-6C 67 50 72   gProc.EXIT.DlgPr
00380F50:  6F 63 2E 57-4D 5F 43 4C-4F 53 45 00-44 6C 67 50   oc.WM_CLOSE.DlgP
00380F60:  72 6F 63 2E-44 45 46 50-52 4F 43 00-              roc.DEFPROC.


As you can see the first easily recognized symbol is szTest, using that as a jumping off point it appears that the 18 byte rule holds but since it is at offset 50 in the data there must be a 32 byte header which might indicate that the 2nd DWORD in the data points to the first symbol. To verify that the structure seems correct, I look at the START symbol and count ahead 18 bytes, I find the value 00 00-00 00 0E 00-00 00. If I take the start of the names section to be the 0x86 (†) which seems to be right as it is the number of bytes in the names section, I arrive at ExceptionArgs1 exactly 0E (IMAGE_SYMBOL.Name.Long) into the names, which would appear to be correct. Counting backward 18 bytes from START I find the data 00 00 00 00-04 00 00 00, the IMAGE_SYMBOL.Name.Long offset of hInstance, the first symbol in the table, so with multiple verifications I can be pretty confident that I have chosen the right struct to decode the entries.

So, does anyone know the structure of the data in the 32 byte header, mainly I am looking to obtain the offset to the names section, in this case 0xE6 or maybe an RVA ?

Edgar
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

donkey

Mmmmm, IMAGE_COFF_SYMBOLS_HEADER might be it.

EDIT:
Yup, that's the answer to my question. IMAGE_COFF_SYMBOLS_HEADER is 32 bytes long and the second parameter is LvaToFirstSymbol which is what I expected it to be. The lva to the names section is not given but is easily calculated using the following:

mov eax,[esi+IMAGE_COFF_SYMBOLS_HEADER.NumberOfSymbols]
mov ecx, SIZEOF IMAGE_SYMBOL
mul ecx


EAX holds the offset to the names section from the first symbol entry. Now just to find their addresses and values :)
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable


donkey

Thanks Dave,

It looks like it will help alot, I have spent a few hours reverse engineering the PE file so I could get the raw symbol data. I can already read the data and now understand much of it but it should help (I hope) to assign relocations to the symbols so I can find them in the executable by address. All a part of my grand idea for a profiling tool :) An exceedingly interesting project.

Edgar
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

dedndave

it does sound interesting
i have thought some about the idea myself
although, i am a total n00b at windows gui apps - lol
but, i have been trying to do some of the low-level stuff like ID'ing processors (incl frequency) and OS's, etc
Hutch has detailed available memory and disk space for us
Michael has given us some insight to timing
but, to get into ring 0 and use the performance counters is the next big step - a little over my head   :bg

someplace, i found a nice link that might be helpful - let me see if i can find it...

...well - this is one - i thought i had a more interesting one - maybe i saved the whole url page - i will keep my eyes open for it

http://perfinsp.sourceforge.net/

donkey

Thanks Dave, not too worried about accurate timing yet, just have to get the address of the code symbols figured out. BTW the above code was the long way to do it, once you figure everything out and are sure it boils down to this:

invoke Dbghelp.dll:ImageNtHeader,[pMapFile]
mov edi,eax

mov ebx,[edi+IMAGE_NT_HEADERS.FileHeader.PointerToSymbolTable]
add ebx,[pMapFile]

mov eax,[edi+IMAGE_NT_HEADERS.FileHeader.NumberOfSymbols]
mov esi,eax
mov ecx,SIZEOF IMAGE_SYMBOL
mul ecx
mov edi,eax
add edi,ebx


EDI contains the memory address of the names
EBX contains the base address of the array of IMAGE_SYMBOL structures
ESI contains the number of entries

Pretty easy to decode after that.

Edgar
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

dedndave

well - the link i remember (or don't remember) had a lot of great info about programming the performance regsiters
it was over my head, of course, but i found it interesting

clive

The COFF debug data isn't particularly rich with detail, you'll have to pull any section and relocation information out of the PE structure itself.

The CodeView and PDB files have significantly more information, but again you'd have to leverage data from the PE file as a whole. There is some pre-link segmentation data that comes from the object files, and this gets condensed into the PE sections, via a segment table. Another area of complication with Microsoft system files is the use of post-link optimization mapping (OMAP) where stuff is moved around, and the debug information doesn't get updated.

DumpPE should give a rough translation of the debug data within the file.

-Clive
It could be a random act of randomness. Those happen a lot as well.

donkey

CodeView and PDB are not an option, GoAsm embeds debug symbols in the file, I am currently trying to get the DbgHelp API functions working but not having much luck enumerating the symbols with SymEnumSymbols

The call from the CREATE_PROCESS_DEBUG_EVENT handler:

invoke EnumerateSymbols,[dbe.u.CreateProcessInfo.hProcess],[dbe.u.CreateProcessInfo.lpBaseOfImage]

My symbol enumerator:
EnumerateSymbols FRAME hProcess, ImageBase
LOCAL SymInfo:SYMBOL_INFO
LOCAL ProcessPath[MAX_PATH]:%CHAR

mov D[ProcessPath],0

invoke SetLastError,0
invoke GetProcessImageFileName ,[hProcess],offset ProcessPath,MAX_PATH

invoke SymInitialize,[hProcess], offset ProcessPath, FALSE

// BaseOfDll is a 64 bit number passed as 2 DWORDS ([ImageBase],0)
invoke SymEnumSymbols,[hProcess],[ImageBase],0,0,offset SymEnumSymbolsProc, NULL

invoke SymCleanup,[hProcess]
RET
ENDF

SymEnumSymbolsProc FRAME pSymInfo, SymbolSize, UserContext

mov eax,[pSymInfo]
add eax,SYMBOL_INFO.Name

PrintStringByAddr(eax)

mov eax,TRUE
RET
ENDF


SymInitialize, SymEnumSymbols and SymCleanup all return TRUE (successful) however the enumeration function is not called, it leads me to believe that the DbgApi does not recognize the symbol data. I am not sure what is required but I am trying a few different things.

Edgar
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

donkey

Well, I have it working, sort of anyway:

The call from the CREATE_PROCESS_DEBUG_EVENT handler:

invoke EnumerateSymbols,[dbe.u.CreateProcessInfo.hFile],[dbe.u.CreateProcessInfo.hProcess],[dbe.u.CreateProcessInfo.lpBaseOfImage]

EnumerateSymbols FRAME hFile, hProcess, ImageBase
LOCAL fsh:%DWORD32
LOCAL fsl:%DWORD32
LOCAL ProcessPath[MAX_PATH]:%CHAR

mov D[ProcessPath],0

invoke SetLastError,0
invoke GetProcessImageFileName ,[hProcess],offset ProcessPath,MAX_PATH

invoke GetFileSize,[hFile],offset fsh
mov [fsl],eax

invoke SymInitialize,[hProcess], offset ProcessPath, FALSE

invoke SymLoadModuleEx,[hProcess],[hFile],offset ProcessPath,NULL,[ImageBase],0,[fsl],0,0

invoke SymEnumSymbols,[hProcess],[ImageBase],0,"*",offset SymEnumSymbolsProc, NULL

invoke SymUnloadModule64,[hProcess],[ImageBase],0

invoke SymCleanup,[hProcess]

RET
ENDF

SymEnumSymbolsProc FRAME pSymInfo, SymbolSize, UserContext
mov eax,[pSymInfo]
add eax,4

mov edx,[eax+SYMBOL_INFO.Address]
add eax,SYMBOL_INFO.Name
invoke AddSymbol,[hSymbolListview],eax,edx
mov eax,TRUE
RET
ENDF


This will enumerate the symbols and everything is perfect as long as I add 4 to the pSymInfo address in the callback. I can't figure that one out at all but the structures address is actually 4 bytes above the address in pSymInfo. Once I figured that out, which was no small puzzle, everything fell into place, I can get the address of the symbol, its name and a lot of other information about the symbol. This is definitely the way to go but I am worried that I'll be bitten in the ass because of the 4 byte offset thing.

Edgar
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

jj2007

Great stuff. I see getting bored is a good incentive for becoming really productive :green
So we see a variable & labels dump function at the horizon, right?
:bg

donkey

Quote from: jj2007 on March 29, 2010, 06:40:49 AM
Great stuff. I see getting bored is a good incentive for becoming really productive :green
So we see a variable & labels dump function at the horizon, right?
:bg

The tool will do that but it is going to be a profiling tool (time profiler) when it grows up, pick 2 labels and measure the time to execute the code between them. Frequency of calls to a particular procedure or references to a particular memory location etc...
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

clive

Quote from: donkey
CodeView and PDB are not an option, GoAsm embeds debug symbols in the file, I am currently trying to get the DbgHelp API functions working but not having much luck enumerating the symbols with SymEnumSymbols

Fair enough, the COFF stuff is pretty straight forward. Most of the relocation is done, just need to add in the base address. The symbol records are nominally 0x12 (IMAGE_SYMBOL) bytes long, but can spill  into multiple records (NumberOfAuxSymbols). Short symbols are stored withing the symbol (8 chars), with long ones indexed into the appended symbol table (at sizeof(IMAGE_SYMBOL) * NumberOfSymbols + LvaToFirstSymbol). With GoAsm you probably don't have to worry about OMAPing.

From Testbug3.dll


Debug Entry

Chars    TimeDate Maj  Min  Type                   Size     AddrRaw  PtrRaw
-------- -------- ---- ---- ---------------------- -------- -------- --------
00000000 41587495 0000 0000 00000001 COFF          000000CA 00000000 00001000

COFF Debug Info Header

  NumberOfSymbols:      00000008
  LvaToFirstSymbol:     00000020
  NumberOfLinenumbers:  00000000
  LvaToFirstLinenumber: 00000000
  RvaToFirstByteOfCode: 00001000
  RvaToLastByteOfCode:  00001200
  RvaToFirstByteOfData: 00002000
  RvaToLastByteOfData:  00002A00

Val 00002000, Sec 0002, Typ 0000, Sto 02, Aux 00, DLLMESS1

Val 00002078, Sec 0002, Typ 0000, Sto 02, Aux 00, M1

Val 0000207F, Sec 0002, Typ 0000, Sto 02, Aux 00, DLLMESS2

Val 000020C0, Sec 0002, Typ 0000, Sto 02, Aux 00, DLLMESS3

Val 00002128, Sec 0002, Typ 0000, Sto 02, Aux 00, M2

Val 00001000, Sec 0001, Typ 0000, Sto 02, Aux 00, HEXROTATE4

Val 0000101F, Sec 0001, Typ 0000, Sto 02, Aux 00, START

Val 00001065, Sec 0001, Typ 0000, Sto 02, Aux 00, DLL_TEST3B


-Clive
It could be a random act of randomness. Those happen a lot as well.

donkey

#13
Hi clive,

Thanks, with a bit of RE work I got the original one working though in the end I decided to use DbgHelp to do the symbol extraction for me, that way if I decide to support other formats it will be a minor addition. Also the amount of information returned from SymEnumSymbols would have been a lot of work to duplicate. The actual code I used to read the raw debug symbol table is this:

GetDebugSymbolsFromFile FRAME pMapFile
uses edi,esi,ebx

// Get a pointer to IMAGE_NT_HEADERS
mov edi,[pMapFile]
add edi,[edi+IMAGE_DOS_HEADER.e_lfanew]

// Get a pointer to the COFF symbol table
mov ebx,[edi+IMAGE_NT_HEADERS.FileHeader.PointerToSymbolTable]
// EBX will hold the memory address of the symbol table
add ebx,[pMapFile]

// Get the number of symbols
mov eax,[edi+IMAGE_NT_HEADERS.FileHeader.NumberOfSymbols]
// ESI will hold the symbol count
mov esi,eax

// Calculate the size of the IMAGE_SYMBOL array
mov ecx,SIZEOF IMAGE_SYMBOL
mul ecx

// The long names are stored right after the IMAGE_SYMBOL array
// EDI will hold the memory address of the long names array
mov edi,eax
add edi,ebx

:
mov edx,ebx
mov ecx,[ebx+IMAGE_SYMBOL.Name.Long]
add ecx,edi
cmp B[ebx],0
cmovz edx,ecx
movzx eax,W[ebx+IMAGE_SYMBOL.SectionNumber]

// So how do I get the symbols address ?????
invoke AddSymbol,[hSymbolListview],edx,NULL

add ebx,SIZEOF IMAGE_SYMBOL
dec esi
jnz <

RET
ENDF


However, getting the VA for a symbol would have been quite a lot of detective work added on to a day of RE'ing the PE symbol format so DbgHelp was the only solution that fit into the schedule I had (have to do this around work and other responsibilities). I may still tackle the raw way to do it but that is for another Sunday, in this project the symbols and VA though critical are only a small part of the overall code. The 4 byte offset thing still has me worried though, it seems to always contain DWORD 0x58

Edgar

EDIT: tidied up the code a bit.
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

clive

I'd probably look at the NumberOfAuxSymbols being zero, and the SectionNumber not being 0x0000 or 0xFFFF, as a quick junk filter.

Also when NumberOfAuxSymbols is non zero you have to skip Aux * IMAGE_SYMBOL additional records. Not sure if GoAsm would generate them, but regular COFF debug records from LINK do.

Most profiling type applications tend to instrument the prolog/epilog code, or use FPO records.

-Clive
It could be a random act of randomness. Those happen a lot as well.