News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Parser in assembly language

Started by NoCforMe, November 05, 2011, 06:49:05 PM

Previous topic - Next topic

NoCforMe

I just finished part of a project and wanted to share it here. It's not done, but the part that is finished is actually usable. But mostly I want to post it here because it's a good teaching example.

The project is a set of simple configuration functions, similar to the familiar ".ini file" model, that allow a program to save and retrieve configuration information in a disk file. What I had in mind was a small set of configuration items used by an application: its window position and size, font information, possibly its saved state, etc. Nothing too fancy.

The specific part I want to present is the parser, which is the heart of the "get configuration" part of the package. (The package is attached below.)

The GetOurConfig() function opens, reads and parses a configuration file (the name of which is "<progname>" + .cfg), and stuffs the resulting values into a table in memory which the calling program can access. The configuration format is very simple:


key=value


The parser is of a type that I've been using for years in assembly language. I've even coded it in C, though it actually lends itself much more to assembler than to high-level languages. It's a finite-state automaton, or FSA, to use a fancy term, which basically just means that it's a machine which deals with a fixed, known set of "states".

I think this is the best way to do most parsing tasks; I learned it from a book I read years ago on compiler construction, and it's the technique used (with variations) by most language-driven programs. It's robust, easily changeable, and above all fairly simple and easy to understand, unlike the horribly complicated parsers I've seen people come up, with dozens of string comparisons, etc.

The whole thing is driven by a parsing state table. Here's the one from my parser:


GCparseTable LABEL DWORD
;   sp    ;     =    EOL   EOF  <any>
  DD _pn0, _pn1, _pnE, _pn0, _pnF, _pnA ;row 0
  DD _pn1, _pn1, _pn1, _pn0, _pnF, _pn1 ;row 1
  DD _pn4, _pnE, _pn3, _pnE, _pnE, _pn2 ;row 2
  DD _pn3, _pnD, _pnE, _pnC, _pnB, _pn5 ;row 3
  DD _pn4, _pnE, _pn3, _pnE, _pnE, _pnE ;row 4
  DD _pn6, _pnD, _pnE, _pnC, _pnB, _pn5 ;row 5
  DD _pn6, _pnE, _pnE, _pnC, _pnB, _pnE ;row 6


Each  row is a "state",  and each column in that row is a "node". For each state, the next node to go to is determined by the current "token", which here is just the next character read from the file. See the 2nd line which shows the characters; EOL is end-of-line, which here is just a carriage return (I simply throw away line feeds, which is a somewhat brute-force but effective way to deal with CR/LF pairs), and EOF is end-of-file. "<any>" means any other character; when <any> is seen, the parser goes to the last node of the row.

So what I have here is a dispatcher that goes to the correct node depending on the current state and the current token. The question is, how does one construct this table? The answer is in the .zip file. Look at the drawing (GetOurConfig parser.gif); this somewhat messy drawing shows the whole scheme graphically. Each circle or square is a node (more on the square nodes later), and the connections between nodes  are through the tokens and their connecting lines. So, for instance, if we're at node 2 and the next character is an equal sign, we then go to node 3.

Here's a sample of some parsing nodes:


_pn1 LABEL NEAR
MOV EBX, $GCrow1
JMP parse0

_pnA LABEL NEAR
; Set up <key> storage:
MOV EDX, HeapStringPtr
MOV keyPtr, EDX
; Fall through to _pn2:

_pn2 LABEL NEAR
; Store <key> character:
INC heapCount
CMP heapCount, $keyHeapSize
JA heapSizeError
MOV EDI,  HeapStringPtr
STOSB
MOV HeapStringPtr, EDI ;*****
MOV EBX, $GCrow2
JMP parse0


Notice that I had to use code LABELs here instead of just label:-type labels, so that they'd be visible outside of the procedure. Nodes that don't actually do anything, like _pn1, simply set the row pointer (EBX here) to the appropriate value, then go back to get the next character.

Notice also the first 3 lines after _pn2, which is "guard" code to protect against overrunning the heap A better implementation might use this to expand the heap if needed, but this would be more complex.

If you want to play around with this, there's a little testbed program (libtest) that displays the contents of a configuration file after parsing. The parser also allows comments (using ";" just like in assembler) anywhere in the file. It also allows for null "values", like


key=


If you wanted to play with this further, you could modify the parser so it doesn't allow null values (which is the way I had originally designed it, but changed my mind and thought that null entries could be useful).

Limitations:

As a practical configuration package, this has definite limitations:

  • The reader/parser is limited to a fixed number and size of configuration entries; currently I have it set at 100 entries with an average size of 25 bytes each (this is the size of the strings, including NULL terminators, not the entry table).
  • I've only done the reading part of the package; what's needed is for an application to be able to write its configuration to disk. This is a little more complicated. Adding a new key  would be fairly simple, assuming there's enough space left in the heap. But what if it wants to modify an existing key with a longer "value" string than the current one? (More on this below.)
I'm planning on tackling the writing part next, with the following functions:

  • SetConfigKey(), which will either create a new key or update an existing one;
  • CreateConfigTable(), to create a new table "from scratch";
  • WriteOurConfig(), to write the current configuration table back to a file.
Writing a configuration file poses another gnarly problem: what do you do with comments? Since the read side is nice enough to let the user include comments, it would be a shame to just throw them away. However, preserving them would make it much more complicated.

The main problem with adding/updating keys is memory management. Here's what I've come up with so far for a possible solution:

  • When the update function (SetConfigkey() ) is first called, allocate a new heap space.
  • If the "value" being updated is the same size or smaller than the original, just update it in place in the old heap.
  • If the "value" string is longer, put it in the new heap, and put the pointer to the string back into the original pointer in the table.
Kind of primitive, but it should work OK for a limited size of configuration.

Anyhow, let me know what you think, and what brilliant ideas you might have for solving these problems.

jj2007

Unresolved external symbol '__mainCRTStartup'

You have modified your \masm32\include\masm32rt.inc, right?

NoCforMe

Yes, but ...  all I did was change the paths to the include files therein. (I actually didn't modify masm32rt.inc, but created my own copy, my_masm32rt.inc which I modified.) so I changed "include \masm32\include\msvcrt.inc" to "include \techstuff\masm32\include\msvcrt.inc". That's all. I just did a side-by-side comparison of the original file and my modified version, and they have the exact same files, including msvcrt.inc, which I ASS-U-ME is the offending one?

So what did I break, and how did I break it? Everything assembles and links OK on this end. (Windows 2000 Pro SP4 here)

Hmm; looks like the problem has to be with a library somewhere. Is the problem with my library or with my testbed app?

Wait: I did find one difference. I commented out "include \masm32\macros\macros.asm" 'cuz I don't use your macros; could this be the problem?

When I include that file I get an error because I had the nerve--the nerve! to write my own strcat() routine, which I now discover is implemented as a macro in your include file.

Oh, well.

dedndave

there is a set of Get Profile/Write Profile functions for dealing with INI format files
for example, GetProfileString, GetPrivateProfileString, WritePrivateProfileString

jj2007

libtest.asm works fine with this minor change:
Quote;   include \techstuff\masm32\include\my_masm32rt.inc
   include \masm32\include\masm32rt.inc

NoCforMe

OK, maybe if I do this much more I'll switch to the Standard Canonical Regular way of doing things (MASM32 in the root, etc.).

Any comments on the substance  of my project?

NoCforMe

Quote from: dedndave on November 05, 2011, 07:11:59 PM
there is a set of Get Profile/Write Profile functions for dealing with INI format files
for example, GetProfileString, GetPrivateProfileString, WritePrivateProfileString

Well, without getting into a whole long thing about it, the name of the game here is "how to avoid using the registry", for reasons that have been discussed ad nauseam within the "programming community".

I just wanted to see if I could some up with a simple but workable scheme all on my own.

Besides, how else would I show off my parser-writing props?

jj2007

Quote from: NoCforMe on November 05, 2011, 07:19:56 PM
Any comments on the substance  of my project?

It seems to work fine. The PrivateProfileXX functions offer a two-fold key - one option to consider.
I have not quite understood why you need that table. It all boils down to some instr operations...

include \masm32\MasmBasic\MasmBasic.inc   ; download
   Init
   Recall "GetConfig.cfg", cfg$()
   For_
ebx=0 To eax-1
      mov esi, cfg$(ebx)
      .if Instr_(esi, "=")
         mov ecx, edx
         Let edi=Left$(esi, ecx-1)
         Print Str$("\nKey #%i = [", ebx), Trim$(edi), "], value = ["
         void Instr_(ecx, esi, Chr$(59))   ; check for comments
         xchg eax, edx
         sub eax, ecx
         Let edi=Mid$(esi, ecx+1, eax-1)
         Print Trim$(edi), "]"   ; get rid of comments
      .endif
   Next

   Inkey CrLf$, "OK"
   Exit
end start
Key #0 = [xyz], value = [abc]
Key #1 = [this], value = [that]
Key #3 = [something], value = [nothing]
Key #4 = [really_long_key_name], value = [realllllly_long_key_value]

NoCforMe

Well, sure; there must be half a dozen ways to skin this particular cat. And it appears that the "private" profile functions do pretty much everything I wanted to do here without the implementation headaches (but see comment below).

I just wanted to present this method of parsing as a generalized approach. This is a pretty simple application, so perhaps not the best example. However, this parsing method really shines when the parsing task is more complex, requiring more and more string comparisons, exceptions, etc.

On second thought, I think I'll steer clear of those private profile functions.

MSDN says, in its description of WritePrivateProfileString():

Quote

The system maps most .ini file references to the registry ...


which is enough for me not to want to go there at all. (That's why I intentionally used a different file extension from .ini in my package.) It's not a huge deal; I had just wanted some simple way to preserve a few configuration items for my pathetically small li'l applications.

Note: I updated the .zip file in the original post to remove all offending pathnames (and I changed the names  of "strcpy" and "strcat" to protect the innocent).

Gunner

Not for private ini file it doesn't.  Been using ini files for years and it never stored anything in the registry, only in the ini file I tell it to.

If you modify system ini files (control, system, winini etc..) then it will map to the registry
~Rob (Gunner)
- IE Zone Editor
- Gunners File Type Editor
http://www.gunnerinc.com

NoCforMe


Ficko

Just a remark to "PrivateProfileString"s it had a limitation of 32768 entries up to XP I don't know anything changed above XP.

hutch--

NoCforMe,

I think you are basically taking the right approach of rolling your own parser, I have never liked the Windows INI system and the advantage with doing your own is you can twiddle it to do exactly what you want. I have on and off played with different systems, left to right sequential word parsing, array word parsing (both using 256 character lookup tables) and occasionally left to right character parsing where I needed single characters that were not naturally spaced between words (this=that [the = symbol]). Line continuations are another PHUN technique depending on how you want the notation to work, I prefer a trailing "," like MASM uses for its non MACRO notation.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

NoCforMe

Well, if  I might say so, that's one of the beauty things about using FSAs for parsing. Mine was built around line-oriented statements, but there's no requirement that you do so. Line endings can just be treated as "white space" or not, depending on the context.