Building a symbol table

Started by ThoughtCriminal, December 25, 2004, 02:02:13 PM

Previous topic - Next topic

ThoughtCriminal

I understand that alot of compilers use hash tables for symbol managment.

What I was thinking was giving each symbol a 32-bit ID.  Keywords would be defined in code with a static value, perhaps starting with 7fffffffh building down.  When a new symbol is parsed I was think of just giving it an ID in the order it was created.  In other words the first symbol created would have an ID of 1.  Using IDs I think would be easier than doing a string compare any time I want to compare two symbols.  Now I have the problem of redefining a symbol:

foo      DW       10 (internal ID of 1)
.
.some code
.
foo      DQ        64000 (internal ID of 24)

foo could be considered a diffrent type, but it still has the same name.  I am begining to think there must be a point in the parsing where you must compare strings to other strings.  Perhaps I could store them in stables based on the first character to keep from searching the entire list, or length like FASM.

And I also need to consider when a already defined symbol is used:

mov eax,[foo]

If this is global scope there should only be one foo.  To the parser it is just a string "foo", so I have found what seems to be a symbol.  I must take that and see if there really is a simple named foo.  It seems the only way to check would be by compaing strings.

It seems if I want things to be reasonably fast, I must arrange my symbol table in a way that I quickly find the symbol by up string compares.
After possibly using length and/or first character to limit the searchable set.

That also means I need to differentiate between symbol creation and symbol use.

Does that sound about right, or am I completly wrong?

Thanks.







tenkey

It helps to split lexical parsing (recognizing words, keywords, and special symbols) from syntax parsing (code structure).

The only place you need to compare symbol strings is during lexical parsing. You assign a unique ID for each string, which can be stored as an integer. String comparisons occur only for uniqueness searches. The symbol table searches will compare string IDs instead of the original strings. You can use a string address, a string pool offset, or a symbol counter. Using a string pool offset allows you to store the strings in a file for later use, such as providing debug information, list of symbols and their definitions, or a cross-reference table.

If you need to keep track of different usages of the same symbol, then you must structure your symbol table. For example, local symbols will need to be kept in a separate table from the global symbols, although they can have the same format. Local symbols can be stacked with the global symbols, but that means you don't save the information for later use (e.g., listings). Another example, structures tend not to have very many symbols, and each structure is fixed in size. So a small table that is searched linearly will allow you to search that table alone if a name is found in a position that can only be the name of a structure component.
A programming language is low level when its programs require attention to the irrelevant.
Alan Perlis, Epigram #8

ThoughtCriminal


ÇódèKñïght

Hi
for scope information you can create two symbol table one for procedure and other for variables for your procedure symbol table will have a variable symbol table it's member for local varibles

like this


struct SymbolVar
{
char *Id

//attribute info

SymbolVar *next
}

struct procedure
{
char *Id

//attributes
SymbolVar local
procedure *next

}

regards
Ck

   

Randall Hyde

Quote from: ThoughtCriminal on December 25, 2004, 02:02:13 PM
I understand that alot of compilers use hash tables for symbol managment.
among other techniques. Binary search trees are also common.

Quote
What I was thinking was giving each symbol a 32-bit ID.  Keywords would be defined in code with a static value, perhaps starting with 7fffffffh building down.  When a new symbol is parsed I was think of just giving it an ID in the order it was created. 
Why not just use the memory address of the symbol table entry that holds the symbol?
What would you use this value for, anyway?

Quote
In other words the first symbol created would have an ID of 1.  Using IDs I think would be easier than doing a string compare any time I want to compare two symbols.  Now I have the problem of redefining a symbol:

foo      DW       10 (internal ID of 1)
.
.some code
.
foo      DQ        64000 (internal ID of 24)
At *some* point you have to compare strings. Typically, this is done in the lexer. The lexer returns, as an attribute, a pointer to the symbol table entry for that symbol (if the symbol is defined; or maybe it creates an *undefined symbol* table entry if it's not defined, so you always get a pointer back). With the pointer, you can then access other attributes of your symbol and you don't ever have to do a string comparison more than once.
Well, almost never more than once.

Quote
foo could be considered a diffrent type, but it still has the same name.  I am begining to think there must be a point in the parsing where you must compare strings to other strings.  Perhaps I could store them in stables based on the first character to keep from searching the entire list, or length like FASM.
I don't understand your example. Are the two FOOs above different symbols (as opposed to the same symbol having two different type attributes over time). Are we talking local versus global here?

Quote
And I also need to consider when a already defined symbol is used:

mov eax,[foo]

If this is global scope there should only be one foo.  To the parser it is just a string "foo", so I have found what seems to be a symbol.  I must take that and see if there really is a simple named foo.  It seems the only way to check would be by compaing strings.
Yes, the only way to do that is by comparing strings. Technically, if you had a perfect hash function, you wouldn't have to compare strings, but just ask Rene Tournois about perfect hashing sometime (better yet, *don't*).  Yes, you can have collisions in a hash table. Yes, you will have to do string comparisons to eliminate the possibility of collisions. And collisions occur even when the identifiers are *not* equal!  That is, "FOO" might map to the same hash table entry as "ARGH" (depending on your hash function, of course). So you have to use string comparisions to differentiate them.

The solution is easy, use a decent hashing function which will reduce the number of collisions. If you still get a lot of collisions, use a binary search rather than a linear search to resolve the conflicts (HLA v2.0, for example, does exactly this -- if there is a collision in the hash table it uses a binary search to compare the colliding strings against the input string; this is *very* fast).



Quote
It seems if I want things to be reasonably fast, I must arrange my symbol table in a way that I quickly find the symbol by up string compares.
After possibly using length and/or first character to limit the searchable set.
Length is often one of the hashing values, yes.
So are *all* of the characters in the input string, not just the first character.


Quote
That also means I need to differentiate between symbol creation and symbol use.
At the lexer level, I'm not sure why you would want to do this.
If you expect the same symbol to be reused frequently, you could create a look-up table that maps the symbol to a 32-bit value (as you've suggested) and then compare this 32-bit value throughout your symbol management code, but reusing symbols in a source file is not all that common. So you'd be paying the price of this conversion for something that just doesn't happen that often.  Not a good trade-off.

By using a decent hashing function, and perhaps a binary search to back up the possbility of collisions, you'll have sufficient speed that the symbol table lookup will *not* be where you spend most of your time.
You might want to take a look at the symbol table management code in the Assembler Developer's Kit at http://webster.cs.ucr.edu/AsmTools/RollYourOwn/index.html. Granted, the ADK supports a very sophisticated symbol table system, designed to support block structured assembly language (any number of levels of nesting of local and global objects). But it's not that hard to strip out stuff you don't need. And the symbol table stuff is (mostly) documented.
Cheers,
Randy Hyde