The MASM Forum Archive 2004 to 2012

General Forums => The Laboratory => Topic started by: NoCforMe on November 05, 2011, 06:49:05 PM

Title: Parser in assembly language
Post by: NoCforMe on November 05, 2011, 06:49:05 PM
I just finished part of a project and wanted to share it here. It's not done, but the part that is finished is actually usable. But mostly I want to post it here because it's a good teaching example.

The project is a set of simple configuration functions, similar to the familiar ".ini file" model, that allow a program to save and retrieve configuration information in a disk file. What I had in mind was a small set of configuration items used by an application: its window position and size, font information, possibly its saved state, etc. Nothing too fancy.

The specific part I want to present is the parser, which is the heart of the "get configuration" part of the package. (The package is attached below.)

The GetOurConfig() function opens, reads and parses a configuration file (the name of which is "<progname>" + .cfg), and stuffs the resulting values into a table in memory which the calling program can access. The configuration format is very simple:


key=value


The parser is of a type that I've been using for years in assembly language. I've even coded it in C, though it actually lends itself much more to assembler than to high-level languages. It's a finite-state automaton, or FSA, to use a fancy term, which basically just means that it's a machine which deals with a fixed, known set of "states".

I think this is the best way to do most parsing tasks; I learned it from a book I read years ago on compiler construction, and it's the technique used (with variations) by most language-driven programs. It's robust, easily changeable, and above all fairly simple and easy to understand, unlike the horribly complicated parsers I've seen people come up, with dozens of string comparisons, etc.

The whole thing is driven by a parsing state table. Here's the one from my parser:


GCparseTable LABEL DWORD
;   sp    ;     =    EOL   EOF  <any>
  DD _pn0, _pn1, _pnE, _pn0, _pnF, _pnA ;row 0
  DD _pn1, _pn1, _pn1, _pn0, _pnF, _pn1 ;row 1
  DD _pn4, _pnE, _pn3, _pnE, _pnE, _pn2 ;row 2
  DD _pn3, _pnD, _pnE, _pnC, _pnB, _pn5 ;row 3
  DD _pn4, _pnE, _pn3, _pnE, _pnE, _pnE ;row 4
  DD _pn6, _pnD, _pnE, _pnC, _pnB, _pn5 ;row 5
  DD _pn6, _pnE, _pnE, _pnC, _pnB, _pnE ;row 6


Each  row is a "state",  and each column in that row is a "node". For each state, the next node to go to is determined by the current "token", which here is just the next character read from the file. See the 2nd line which shows the characters; EOL is end-of-line, which here is just a carriage return (I simply throw away line feeds, which is a somewhat brute-force but effective way to deal with CR/LF pairs), and EOF is end-of-file. "<any>" means any other character; when <any> is seen, the parser goes to the last node of the row.

So what I have here is a dispatcher that goes to the correct node depending on the current state and the current token. The question is, how does one construct this table? The answer is in the .zip file. Look at the drawing (GetOurConfig parser.gif); this somewhat messy drawing shows the whole scheme graphically. Each circle or square is a node (more on the square nodes later), and the connections between nodes  are through the tokens and their connecting lines. So, for instance, if we're at node 2 and the next character is an equal sign, we then go to node 3.

Here's a sample of some parsing nodes:


_pn1 LABEL NEAR
MOV EBX, $GCrow1
JMP parse0

_pnA LABEL NEAR
; Set up <key> storage:
MOV EDX, HeapStringPtr
MOV keyPtr, EDX
; Fall through to _pn2:

_pn2 LABEL NEAR
; Store <key> character:
INC heapCount
CMP heapCount, $keyHeapSize
JA heapSizeError
MOV EDI,  HeapStringPtr
STOSB
MOV HeapStringPtr, EDI ;*****
MOV EBX, $GCrow2
JMP parse0


Notice that I had to use code LABELs here instead of just label:-type labels, so that they'd be visible outside of the procedure. Nodes that don't actually do anything, like _pn1, simply set the row pointer (EBX here) to the appropriate value, then go back to get the next character.

Notice also the first 3 lines after _pn2, which is "guard" code to protect against overrunning the heap A better implementation might use this to expand the heap if needed, but this would be more complex.

If you want to play around with this, there's a little testbed program (libtest) that displays the contents of a configuration file after parsing. The parser also allows comments (using ";" just like in assembler) anywhere in the file. It also allows for null "values", like


key=


If you wanted to play with this further, you could modify the parser so it doesn't allow null values (which is the way I had originally designed it, but changed my mind and thought that null entries could be useful).

Limitations:

As a practical configuration package, this has definite limitations:

I'm planning on tackling the writing part next, with the following functions:

Writing a configuration file poses another gnarly problem: what do you do with comments? Since the read side is nice enough to let the user include comments, it would be a shame to just throw them away. However, preserving them would make it much more complicated.

The main problem with adding/updating keys is memory management. Here's what I've come up with so far for a possible solution:

Kind of primitive, but it should work OK for a limited size of configuration.

Anyhow, let me know what you think, and what brilliant ideas you might have for solving these problems.
Title: Re: Parser in assembly language
Post by: jj2007 on November 05, 2011, 06:57:45 PM
Unresolved external symbol '__mainCRTStartup'

You have modified your \masm32\include\masm32rt.inc, right?
Title: Re: Parser in assembly language
Post by: NoCforMe on November 05, 2011, 07:08:03 PM
Yes, but ...  all I did was change the paths to the include files therein. (I actually didn't modify masm32rt.inc, but created my own copy, my_masm32rt.inc which I modified.) so I changed "include \masm32\include\msvcrt.inc" to "include \techstuff\masm32\include\msvcrt.inc". That's all. I just did a side-by-side comparison of the original file and my modified version, and they have the exact same files, including msvcrt.inc, which I ASS-U-ME is the offending one?

So what did I break, and how did I break it? Everything assembles and links OK on this end. (Windows 2000 Pro SP4 here)

Hmm; looks like the problem has to be with a library somewhere. Is the problem with my library or with my testbed app?

Wait: I did find one difference. I commented out "include \masm32\macros\macros.asm" 'cuz I don't use your macros; could this be the problem?

When I include that file I get an error because I had the nerve--the nerve! to write my own strcat() routine, which I now discover is implemented as a macro in your include file.

Oh, well.
Title: Re: Parser in assembly language
Post by: dedndave on November 05, 2011, 07:11:59 PM
there is a set of Get Profile/Write Profile functions for dealing with INI format files
for example, GetProfileString, GetPrivateProfileString, WritePrivateProfileString
Title: Re: Parser in assembly language
Post by: jj2007 on November 05, 2011, 07:14:49 PM
libtest.asm works fine with this minor change:
Quote;   include \techstuff\masm32\include\my_masm32rt.inc
   include \masm32\include\masm32rt.inc
Title: Re: Parser in assembly language
Post by: NoCforMe on November 05, 2011, 07:19:56 PM
OK, maybe if I do this much more I'll switch to the Standard Canonical Regular way of doing things (MASM32 in the root, etc.).

Any comments on the substance  of my project?
Title: Re: Parser in assembly language
Post by: NoCforMe on November 05, 2011, 07:28:15 PM
Quote from: dedndave on November 05, 2011, 07:11:59 PM
there is a set of Get Profile/Write Profile functions for dealing with INI format files
for example, GetProfileString, GetPrivateProfileString, WritePrivateProfileString

Well, without getting into a whole long thing about it, the name of the game here is "how to avoid using the registry", for reasons that have been discussed ad nauseam within the "programming community".

I just wanted to see if I could some up with a simple but workable scheme all on my own.

Besides, how else would I show off my parser-writing props?
Title: Re: Parser in assembly language
Post by: jj2007 on November 05, 2011, 07:43:59 PM
Quote from: NoCforMe on November 05, 2011, 07:19:56 PM
Any comments on the substance  of my project?

It seems to work fine. The PrivateProfileXX functions offer a two-fold key - one option to consider.
I have not quite understood why you need that table. It all boils down to some instr operations...

include \masm32\MasmBasic\MasmBasic.inc   ; download (http://www.masm32.com/board/index.php?topic=12460)
   Init
   Recall "GetConfig.cfg", cfg$()
   For_ ebx=0 To eax-1
      mov esi, cfg$(ebx)
      .if Instr_(esi, "=")
         mov ecx, edx
         Let edi=Left$(esi, ecx-1)
         Print Str$("\nKey #%i = [", ebx), Trim$(edi), "], value = ["
         void Instr_(ecx, esi, Chr$(59))   ; check for comments
         xchg eax, edx
         sub eax, ecx
         Let edi=Mid$(esi, ecx+1, eax-1)
         Print Trim$(edi), "]"   ; get rid of comments
      .endif
   Next
   Inkey CrLf$, "OK"
   Exit
end start
Key #0 = [xyz], value = [abc]
Key #1 = [this], value = [that]
Key #3 = [something], value = [nothing]
Key #4 = [really_long_key_name], value = [realllllly_long_key_value]
Title: Re: Parser in assembly language
Post by: NoCforMe on November 05, 2011, 07:57:11 PM
Well, sure; there must be half a dozen ways to skin this particular cat. And it appears that the "private" profile functions do pretty much everything I wanted to do here without the implementation headaches (but see comment below).

I just wanted to present this method of parsing as a generalized approach. This is a pretty simple application, so perhaps not the best example. However, this parsing method really shines when the parsing task is more complex, requiring more and more string comparisons, exceptions, etc.

On second thought, I think I'll steer clear of those private profile functions.

MSDN says, in its description of WritePrivateProfileString():

Quote

The system maps most .ini file references to the registry ...


which is enough for me not to want to go there at all. (That's why I intentionally used a different file extension from .ini in my package.) It's not a huge deal; I had just wanted some simple way to preserve a few configuration items for my pathetically small li'l applications.

Note: I updated the .zip file in the original post to remove all offending pathnames (and I changed the names  of "strcpy" and "strcat" to protect the innocent).
Title: Re: Parser in assembly language
Post by: Gunner on November 05, 2011, 08:13:58 PM
Not for private ini file it doesn't.  Been using ini files for years and it never stored anything in the registry, only in the ini file I tell it to.

If you modify system ini files (control, system, winini etc..) then it will map to the registry
Title: Re: Parser in assembly language
Post by: NoCforMe on November 05, 2011, 08:15:43 PM
OK, maybe I'll give it a try.
Title: Re: Parser in assembly language
Post by: Ficko on November 05, 2011, 08:46:41 PM
Just a remark to "PrivateProfileString"s it had a limitation of 32768 entries up to XP I don't know anything changed above XP.
Title: Re: Parser in assembly language
Post by: hutch-- on November 05, 2011, 09:35:42 PM
NoCforMe,

I think you are basically taking the right approach of rolling your own parser, I have never liked the Windows INI system and the advantage with doing your own is you can twiddle it to do exactly what you want. I have on and off played with different systems, left to right sequential word parsing, array word parsing (both using 256 character lookup tables) and occasionally left to right character parsing where I needed single characters that were not naturally spaced between words (this=that [the = symbol]). Line continuations are another PHUN technique depending on how you want the notation to work, I prefer a trailing "," like MASM uses for its non MACRO notation.
Title: Re: Parser in assembly language
Post by: NoCforMe on November 05, 2011, 09:41:25 PM
Well, if  I might say so, that's one of the beauty things about using FSAs for parsing. Mine was built around line-oriented statements, but there's no requirement that you do so. Line endings can just be treated as "white space" or not, depending on the context.