My Assembler Development Update

johnsa · May 06, 2012, 07:01:31 PM

Hey all,

Well it's been 5 days since I started this (officially) and I'm about 5000 lines of code in.
So I thought I'd give an update and provide what I've done so-far for some testing.

What's finished:
XASM Assembler skeleton application
Lexer/Scanner (I'm quite comfortable that this is 100% or at least 99.98%)
Opcode, Register, HLL, Directives DBs (All reserverd words, modr/m and sib lookups etc)
Source Management
The parser structure + a few productions (namely NOP and INCLUDE).

Design Thoughts:
1) The whole thing runs around a producer-consumer model.
2) The assembler sets up, loads the primary file and initiates the parser.
3) The parser (is re-entrant, thread safe and supports horizontal parsing as well as recursion).
4) The parser acts like a state machine progressively refining a production with input tokens.
5) The lexer handles asm (MASM/JWASM + my extensions) as well as C/C++
6) My source loader is using CRC32 opcode (to be augmented with a software version for compatability) to checksum each file. This is to ensure I can test effectively for circular references. (As filenames and paths aren't reliable).
As i'm supporting namespaces, a file can be included in different namespaces without an issue, but not in the same scope/ns.

So basically I want to take MASM syntax and add:
try, catch, new, delete, delete[], namespaces, ::, enums, class, method, support for incbin, cinclude(.h), no need for proto, invoke + direct calling of function name IE: MyFunc() .., better support for literals and constants in calls, unicode strings suffixed with L,
more data type declarations like int, char which will switch with the code when you use .32bit or .64bit directives.

Testing and Performance:
I've included the two basic lexer test scripts.
I've run every asm project I could find through it so-far without a single issue.
Since the lexer can handle C and C++ too I threw a load of headers and CPPs at it fine.
I've hand checked about 15,000 lines of output tokens from these projects and so far so good.
When you run it you'll see each module output debug info.. so with each token apart from the text itself, is the token group, sub-type and some other info. If it's a register it knowns the reg-number, bit-size and register group.
Numbers are converted to their actual types.

Performance wise it's doing about 200,000 lines per second over two full passes. The assembler runs itself in 0.4 seconds, as opposed to jwasm's 0.2 on my machine.
Most of this performance loss is in the lexers directive/opcode/reg lookup functions which are just table scanning string comparisons.
I have added in hashing and sorting, and once that's finished it should be back to about 500,000 lines per second.

I've not been able to find any other compiler benchmarks out there to compare with, but to be honest I think it will land up being faster than masm, but a bit slower than jwasm.. which I can live with.

Issues:
So far I have run into only one issue... a funny lexer rule to handle this:

slice@1 CATSTR slice@1,<">,argz,<">
if @InStr(1, Message, <!">)

Firstly, you're supposed to use ! to escape something like " in quoted literal text like that, but it's not always the case. Secondly allowing <> to terminate literals was an issue in the lexer, and the only solution I had so-far (at least temporarily) was to use other tokens
like catstr,instr etc to switch a lexer mode on/off to allow quoted literal text in <>'s.

Next things:
By the end of next week I want to upgrade the numeric generator to use a full 80bit for decimals and 64bit for other integral types.
Add some more parser productions to be able to get enough of a source generated to write it out as a bin file.
Complete symbol table code.
Optimize lookups using hash for lexer speed.
Finish generating sections/segments
Add at least 2 or so declarations for data and EQUs.

johnsa · May 06, 2012, 07:13:45 PM

As a side note, the attachment contains two exe's.
xasmd outputs every piece of info on the tokens/parser as it goes.
xasmr is identical but doesn't write anything out ( just to see speed and test whole projects to make sure all the includes load in and parse).

I followed Bogdan's advice with the lexer , so it's hand-coded without any reg-ex engines.
I decided the make the parser work in a rather strange way so that normal lines of source can be parsed sequentially in states (what i call horizontal) and certain others initiate a recursion to handle blocks .. ie: things like proc, nested macros, { }, .if etc.

BogdanOntanu · May 06, 2012, 08:12:54 PM

Great to see some progress ;)

I have moved the original thread here to Assembler/Compiler tech

You must use hash for token search (language tokens like MOV,CALL,XOR,etc) and user tokens like PROC names or labels or EQUs, etc... otherwise it will be very slow on big projects with thousands of symbols.

Things will get slower when you add a lot of instructions and directives handling and code generation (BIN and COFF32/64 are a minimum must).

For testing my assembler I've created an application that can generate 100K or more of PROC's or other ASM specific structures (labels, structures, etc) and the I used to test run my assembler on those synthetic generated files. It helps to identify bottlenecks and test speed even if it's synthetic. Then you can tweak this application in order to generate the syntax of "other" assemblers and test them also :D :D :D

Generally speaking you might need to do more than 2 passes.

Sometimes I have to do 3 to 5 passes for bigger applications with size optimizations off.
With size optimizations ON it can take 200+ passes to compile and fully size optimize a huge project (200.000+ lines) but this could be because my incremental solver is converging too slow.

Anyway I have seen other assemblers like FASM doing 44 passes on normal operations and I have also seen TASM being able to always perform exactly 2 passes no matter what :D

Ah and BTW do not assume that people can follow your line of thought in here. Compiler/Assembler creation is a very limited subject, kind of elite because not many people did such projects from scratch .... and besides the terms and concepts are not easy to follow from one human to another.

Each compiler creator has his own "standards" and conventions and names the same concepts slightly differently ... :D

Hence I suggest to use more exact and simple explanations of status and current issues if you want people to follow and give useful hints ;)

johnsa · May 06, 2012, 08:26:25 PM

Hey,

Definitely agreed on the hash, I actually started by coding the hash first and setting the data-structures up that way, but I was in a hurry to just test that it worked so I've skipped a bit of code which needs to go back in to speed that up :)
I'm going to implement the same hash scheme for the symbol table items too as you suggest.. It's a very fast hash I stumbled on once long ago .. FNV1a.. which I use often.

Since my last post I've added ECHO and MOV r32,r32 productions.. so it's working well .. it's going to be a long and laborious process to roll out all these productions.. I think it will account for 80% of the project dev time.. but I'm happy that it's gotten to this point
already.

I like that synthetic test idea. I'll definitely do that!

I've lost you on needing more than 2 passes.. what would cause you to need more than 2?
I don't think I've seen any of my projects go beyond 2 passes..

I'd only anticipated to implement size optimization in the form of replace opcode x (2 bytes) with opcode y (1 byte) when functionally identical and setup a string table and ensure all == strings are mapped to the same string entry.. am I missing something? :)

You're right.. I've sort of got myself stuck in this head-space now so I sort of blurt out what I'm thinking.. I tried explaining the way I setup the parser to my wife... that was pointless.. haha
But then again, I suspect that only yourself and Habran might be even interested in this, and with your experience my ideas shouldn't need to be simplified ( maybe just translated from brain inner monologue ;) )

BTW Did my note about the lexer funny around <"> and <!"> make sense? not sure if the way i've handled it is ideal.

BogdanOntanu · May 06, 2012, 08:54:17 PM

Quote from: johnsa on May 06, 2012, 08:26:25 PM
...
It's a very fast hash I stumbled on once long ago .. FNV1a.. which I use often.

Yes, I think I use the same hash function.

Quote
Since my last post I've added ECHO and MOV r32,r32 productions.. so it's working well .. it's going to be a long and laborious process to roll out all these productions.. I think it will account for 80% of the project dev time.. but I'm happy that it's gotten to this point
already.

Yes, it is important to keep your brain happy and give it candy like that in order to maintain focus. There is a lot of work to be done in adding ALL instructions and ALL parameters. I have generated a few instructions very fast also when I have started my assembler (I guess a few single byte instructions like NOP's/CLI/etc and register to register moves also).

I have the advantage of having a few huge projects of may own at that time and I have literally started parsing them and implemented each instruction needed in order to assemble my projects. making a bigger stop every time I have found a new concept like includes or MACRO's and PROC's or .IFs

Quote
I've lost you on needing more than 2 passes.. what would cause you to need more than 2?
I don't think I've seen any of my projects go beyond 2 passes..

First pass is for "nothing" usually ... you just note new things you encounter and accumulate symbols and stuff BUT because most things will be defined LATER in another source file you might have to postpone any code generation until the second pass... right?

The Second pass should fix all of those unsolved issues of the first passe... BUT... IF you also want to optimize the size of the instructions (jumps for example) THEN some of them might become smaller or bigger depending on how you generate the code and what exactly is after that code...

In consequence another symbol might change position (for example one label might become 0x401001 instead of being 0x40100 as you initially generated it and it might become too far from an previous generated "short" jump...

This can generate a chain of linked reactions and you might NOT be able to solve more than a few of them on each pass :D

IF you are not careful then you might NOT solve them at all ;)
You need to get sure that the solution to the problem converges ...

Of course on simple files and IF you impose the same rules as in HLL languages: define what you use BEFORE you use it (by using include files or some kind of forward declarations like PROTO ....) THEN you might not observe this problem initially...

But you can not evade it on bigger projects mainly because of the nature of ASM... labels are by design defined after they are used and jump to those labels can be encoded longer or shorter depending on this distance.

Anyway you can avoid most of the problems IF you forget about size optimizations but this will not look "cool" :P

Quote
I'd only anticipated to implement size optimization in the form of replace opcode x (2 bytes) with opcode y (1 byte) when functionally identical and setup a string table and ensure all == strings are mapped to the same string entry.. am I missing something? :)

Apparently you did ;)

Quote
...
But then again, I suspect that only yourself and Habran might be even interested in this, and with your experience my ideas shouldn't need to be simplified ( maybe just translated from brain inner monologue ;) )

They need to be simpler ... even if I could eventually understand your specific term and concepts ... I have other projects and I only have a limited time allocated for understanding what you say and if I can not do that very fast then I have a strange tendency to return to my projects ;)

Besides there are other people here that might just know a solution or two to your problems and might drop in a hint or two ;)

Quote
BTW Did my note about the lexer funny around <"> and <!"> make sense? not sure if the way i've handled it is ideal.

Nope :D

My syntax is slightly different hence I have "other" problems but of course that I am aware of multiple issues with using "<" and ">" I just did not have time to follow the exact problem... hence I have skipped it and answered about more common issues like "hash" etc ...

A prime example of why you need to simplify and find a way to explain that problem in an easy to follow and understand manner for people that do not want to dig deep into the subject's details ;)

jj2007 · May 06, 2012, 10:13:56 PM

Your progress looks impressive :U

On Win XP SP3 it chokes with exception 1d at 401295 for xasmd.exe lextest.asm -d -b -c32
Anything wrong with the commandline?

habran · May 07, 2012, 05:24:51 AM

Hi johnsa

If you succeed in your plans and I believe you will, we will have most powerful programming tool ever

I wish you success
I would also like to thank Bogdan in his help and advice
He is showing now his great personality

regards

BTW I have no problem to run it on i7 Windows7

jj2007 · May 07, 2012, 06:32:13 AM

Quote from: habran on May 07, 2012, 05:24:51 AM
BTW I have no problem to run it on i7 Windows7

Now I was able to debug the exception, on Win7-32, AMD Athlon(tm) Dual Core Processor 4450B. After

Code Select

Advanced Systems Research (R) XASM Version 1.00
Copyright (C) ASR. All rights reserved.
[Release Mode Build]
[Output Format: 32bit COFF]

it stops at the crc32 instruction:

Code Select

00A51293                   ³>Ú8B13                Úmov edx, [ebx]
00A51295                   ³.³F20F38F1C2          ³crc32 eax, edx
00A5129A                   ³.³83C3 04             ³add ebx, 4

If I circumvent that one, it continues and finishes with

Code Select

...
[LEXER] token: , 1, 0, 0, line: 78
[LEXER] token: .data?, 4, 138, 0, line: 78
[LEXER] token: , 1, 0, 0, line: 79
[LEXER] token: , 1, 0, 0, line: 80
Assembly completed in 298.255 seconds.

HTH, jj

johnsa · May 07, 2012, 07:14:31 AM

Hey,

Yeah in the abesence of implementing crc32 or md5 myself quickly I just used the crc32 opcode.. so that would be required. I'll have that fixed today or tomorrow with h/w detection so it'll use either the opcode or a software equivalent.

Thanks guys for all the feedback! I'll keep you posted with updates during the course of the week.

Bogdan, I have a theory (potential solution) in my head.. this is a bit tricky to explain but as you say let me try make it as simple as possible. I'f I'm correct this should remove the need for anything more than 2 passes.
The problem boils down to symbols which can move (labels etc) usually because of a change in encoding size.
For example a jump that is encoded as 5 bytes, and the later on you come back, realize it can be optimized to a short 2 byte form, now all your labels are in the wrong place.
So here is the plan.. on the first pass, you generate ALL these problematic opcodes as their shortest/optimal form and put the address calculations into your symbol table as per normal.
Now when we come back for the second pass, we know that all changes to encodings are going in the same direction, IE: making them larger (2 -> 5 bytes for example). What this means is that the position of symbols (or the distance to them) can only grow.
To solve this we have another table called an "Address shift" table.

So for example:

Label1:
jmp Label4

Label2:

Label3:
jmp Label2

Label4:

(Hypothetical scenario above). after the first pass we have addresses for the labels, jmp Label2 for argument sake will remain unchanged as it's within short range. However jmp Label4 needs to be extended. So we do the extension
and write a value into the shift table, namely the address of jmp Label4 and the number of bytes we increased it by.

Now any address lookup in pass 2 uses the symbol table AND the shift table to calculate it's final position.

We get to jmp Label2, we find that label2 occurs after an entry in the shift table.. so we update it's address as stored in symbol table by suming all the shift's that occur in the table up to that point (assuming it's absolute). For relative instructions you'd take
the address of the instruction itself and the target, and sum all shifts that occur between those two in the table.

That way all the moving around is solved without the need for more passes and restarts.

Make sense?

dedndave · May 07, 2012, 07:35:26 AM

that table sounds like a good idea
of course, it could suck up resources on larger projects

as for adding features above and beyond those in masm...
when you do this, you step farther away from being compatible with masm syntax
i had given this some thought, too :P
i was thinking that some options could be selected in the form of a comment

Code Select

;OPTION NewOption

if there are any spaces between ";" and "O", it is a comment
that way, they can still have comments that start with "OPTION"
masm would, of course, ignore the line

you might even use a different keyword

Code Select

;OPTIONEX NewOption

johnsa · May 07, 2012, 07:56:48 AM

On my table idea.. I realize it's a bit flawed.. hmm It would need to be 3 pass..
In which case.. you could do the same logic, but just apply the update to the symbol table directly.

IE: Your first pass was optimistic, all possible opcodes are their shortest form.

The second pass performs branch-extensions (anti-optimisation). Everytime an opcode is encountered that needs to be extended, all symbols in the symbol table are updated. Once again no code is actually generated this pass.

Third pass, now all instructions should be optimal and the symbol table address correct. Generate.

johnsa · May 07, 2012, 08:01:00 AM

Quote from: dedndave on May 07, 2012, 07:35:26 AM
that table sounds like a good idea
of course, it could suck up resources on larger projects

as for adding features above and beyond those in masm...
when you do this, you step farther away from being compatible with masm syntax
i had given this some thought, too :P
i was thinking that some options could be selected in the form of a comment

Code Select Expand
;OPTION NewOption

if there are any spaces between ";" and "O", it is a comment
that way, they can still have comments that start with "OPTION"
masm would, of course, ignore the line

you might even use a different keyword
Code Select Expand
;OPTIONEX NewOption

I like the idea of the OPTIONEX..
From what I'm doing this side I just have to be very careful not to break masm-compatibility. But I'm only interested in one way compatibility, IE: I can assemble masm source.. not the other way around. So as long as none of the lexemes and productions I add
affect existing masm functionality I should be safe.

jj2007 · May 07, 2012, 08:47:31 AM

Quote from: johnsa on May 07, 2012, 07:14:31 AM
The problem boils down to symbols which can move (labels etc) usually because of a change in encoding size.

One option might be to pass once backwards, from end to start.

dedndave · May 07, 2012, 01:37:34 PM

:bg then you don't know the backward reference distances

johnsa · May 07, 2012, 01:42:14 PM

I think 3 passes should solve it if you want an "optimizing" assembler.
As long as you know which symbol occurs before/after each other (IE: an address ordered version of your symbol table) and you start with shortest form so distances can only grow.

News:

My Assembler Development Update