The MASM Forum Archive 2004 to 2012

Specialised Projects => Assembler/Compiler Technology => Topic started by: johnsa on May 06, 2012, 07:01:31 PM

Title: My Assembler Development Update
Post by: johnsa on May 06, 2012, 07:01:31 PM
Hey all,

Well it's been 5 days since I started this (officially) and I'm about 5000 lines of code in.
So I thought I'd give an update and provide what I've done so-far for some testing.

What's finished:
   XASM Assembler skeleton application
   Lexer/Scanner (I'm quite comfortable that this is 100% or at least 99.98%)
   Opcode, Register, HLL, Directives DBs (All reserverd words, modr/m and sib lookups etc)
   Source Management
   The parser structure + a few productions (namely NOP and INCLUDE).

Design Thoughts:
1) The whole thing runs around a producer-consumer model.
2) The assembler sets up, loads the primary file and initiates the parser.
3) The parser (is re-entrant, thread safe and supports horizontal parsing as well as recursion).
4) The parser acts like a state machine progressively refining a production with input tokens.
5) The lexer handles asm (MASM/JWASM + my extensions) as well as C/C++
6) My source loader is using CRC32 opcode (to be augmented with a software version for compatability) to checksum each file. This is to ensure I can test effectively for circular references. (As filenames and paths aren't reliable).
    As i'm supporting namespaces, a file can be included in different namespaces without an issue, but not in the same scope/ns.

So basically I want to take MASM syntax and add:
try, catch, new, delete, delete[], namespaces, ::, enums, class, method, support for incbin, cinclude(.h), no need for proto, invoke + direct calling of function name IE: MyFunc() .., better support for literals and constants in calls, unicode strings suffixed with L,
more data type declarations like int, char which will switch with the code when you use .32bit or .64bit directives.

Testing and Performance:
   I've included the two basic lexer test scripts.
   I've run every asm project I could find through it so-far without a single issue.
   Since the lexer can handle C and C++ too I threw a load of headers and CPPs at it fine.
   I've hand checked about 15,000 lines of output tokens from these projects and so far so good.
   When you run it you'll see each module output debug info.. so with each token apart from the text itself, is the token group, sub-type and some other info. If it's a register it knowns the reg-number, bit-size and register group.
   Numbers are converted to their actual types.

   Performance wise it's doing about 200,000 lines per second over two full passes. The assembler runs itself in 0.4 seconds, as opposed to jwasm's 0.2 on my machine.
   Most of this performance loss is in the lexers directive/opcode/reg lookup functions which are just table scanning string comparisons.
   I have added in hashing and sorting, and once that's finished it should be back to about 500,000 lines per second.

   I've not been able to find any other compiler benchmarks out there to compare with, but to be honest I think it will land up being faster than masm, but a bit slower than jwasm.. which I can live with.

Issues:
   So far I have run into only one issue... a funny lexer rule to handle this:

slice@1 CATSTR slice@1,<">,argz,<">
   if @InStr(1, Message, <!">) 

Firstly, you're supposed to use ! to escape something like " in quoted literal text like that, but it's not always the case. Secondly allowing <> to terminate literals was an issue in the lexer, and the only solution I had so-far (at least temporarily) was to use other tokens
like catstr,instr etc to switch a lexer mode on/off to allow quoted literal text in <>'s.

Next things:
   By the end of next week I want to upgrade the numeric generator to use a full 80bit for decimals and 64bit for other integral types.
   Add some more parser productions to be able to get enough of a source generated to write it out as a bin file.
   Complete symbol table code.
   Optimize lookups using hash for lexer speed.
   Finish generating sections/segments
   Add at least 2 or so declarations for data and EQUs.
Title: Re: My Assembler Development Update
Post by: johnsa on May 06, 2012, 07:13:45 PM
As a side note, the attachment contains two exe's.
xasmd outputs every piece of info on the tokens/parser as it goes.
xasmr is identical but doesn't write anything out ( just to see speed and test whole projects to make sure all the includes load in and parse).

I followed Bogdan's advice with the lexer , so it's hand-coded without any reg-ex engines.
I decided the make the parser work in a rather strange way so that normal lines of source can be parsed sequentially in states (what i call horizontal) and certain others initiate a recursion to handle blocks .. ie: things like proc, nested macros, { }, .if etc.
Title: Re: My Assembler Development Update
Post by: BogdanOntanu on May 06, 2012, 08:12:54 PM
Great to see some progress ;)

I have moved the original thread here to Assembler/Compiler tech

You must use hash for token search (language tokens like MOV,CALL,XOR,etc) and user tokens like PROC names or labels or EQUs, etc... otherwise it will be very slow on big projects with thousands of symbols.

Things will get slower when you add a lot of instructions and directives handling and code generation (BIN and COFF32/64 are a minimum must).

For testing my assembler I've created an application that can generate 100K or more of PROC's or other ASM specific structures (labels, structures, etc) and the I used to test run my assembler on those synthetic generated files. It helps to identify bottlenecks and test speed even if it's synthetic. Then you can tweak this application in order to generate the syntax of "other" assemblers and test them also :D :D :D


Generally speaking you might need to do more than 2 passes.

Sometimes I have to do 3 to 5 passes for bigger applications with size optimizations off.
With size optimizations ON it can take 200+ passes to compile and fully size optimize a huge project (200.000+ lines) but this could be because my incremental solver is converging too slow.

Anyway I have seen other assemblers like FASM doing 44 passes on normal operations and I have also seen TASM being able to always perform exactly 2 passes no matter what :D

Ah and BTW do not assume that people can follow your line of thought in here. Compiler/Assembler creation is a very limited subject, kind of elite because not many people did such projects from scratch ....  and besides the terms and concepts are not easy to follow from one human to another.

Each compiler creator has his own "standards" and conventions and names the same concepts slightly differently ...  :D

Hence I suggest to use more exact and simple explanations of status and current issues if you want people to follow and give useful hints ;)

Title: Re: My Assembler Development Update
Post by: johnsa on May 06, 2012, 08:26:25 PM
Hey,

Definitely agreed on the hash, I actually started by coding the hash first and setting the data-structures up that way, but I was in a hurry to just test that it worked so I've skipped a bit of code which needs to go back in to speed that up :)
I'm going to implement the same hash scheme for the symbol table items too as you suggest.. It's a very fast hash I stumbled on once long ago .. FNV1a.. which I use often.

Since my last post I've added ECHO and MOV r32,r32 productions.. so it's working well .. it's going to be a long and laborious process to roll out all these productions.. I think it will account for 80% of the project dev time.. but I'm happy that it's gotten to this point
already.

I like that synthetic test idea. I'll definitely do that!

I've lost you on needing more than 2 passes.. what would cause you to need more than 2?
I don't think I've seen any of my projects go beyond 2 passes..

I'd only anticipated to implement size optimization in the form of replace opcode x (2 bytes) with opcode y (1 byte) when functionally identical and setup a string table and ensure all == strings are mapped to the same string entry.. am I missing something? :)

You're right.. I've sort of got myself stuck in this head-space now so I sort of blurt out what I'm thinking.. I tried explaining the way I setup the parser to my wife... that was pointless.. haha
But then again, I suspect that only yourself and Habran might be even interested in this, and with your experience my ideas shouldn't need to be simplified  ( maybe just translated from brain inner monologue ;) )

BTW Did my note about the lexer funny around <"> and <!"> make sense? not sure if the way i've handled it is ideal.
Title: Re: My Assembler Development Update
Post by: BogdanOntanu on May 06, 2012, 08:54:17 PM
Quote from: johnsa on May 06, 2012, 08:26:25 PM
...
It's a very fast hash I stumbled on once long ago .. FNV1a.. which I use often.

Yes, I think I use the same hash function.

Quote
Since my last post I've added ECHO and MOV r32,r32 productions.. so it's working well .. it's going to be a long and laborious process to roll out all these productions.. I think it will account for 80% of the project dev time.. but I'm happy that it's gotten to this point
already.

Yes, it is important to keep your brain happy and give it candy like that in order to maintain focus. There is a lot of work to be done in adding ALL instructions and ALL parameters. I have generated a few instructions very fast also when I have started my assembler (I guess a few single byte instructions like NOP's/CLI/etc and register to register moves also).

I have the advantage of having a few huge projects of may own at that time and I have literally started parsing them and implemented each instruction needed in order to assemble my projects. making a bigger stop every time I have found a new concept like includes or MACRO's and PROC's or .IFs


Quote
I've lost you on needing more than 2 passes.. what would cause you to need more than 2?
I don't think I've seen any of my projects go beyond 2 passes..

First pass is for "nothing" usually ... you just note new things you encounter and accumulate symbols and stuff BUT because most things will be defined LATER in another source file you might have to postpone any code generation until the second pass... right?

The Second pass should fix all of those unsolved issues of the first passe... BUT...  IF you also want to optimize the size of the instructions (jumps for example) THEN some of them might become smaller or bigger depending on how you generate the code and what exactly is after that code...

In consequence another symbol might change position (for example one label might become 0x401001 instead of being 0x40100 as you initially generated it and it might become too far from an previous generated "short" jump...

This can generate  a chain of linked reactions and you might NOT be able to solve more than a few of them on each pass :D

IF you are not careful then you might NOT solve them at all ;)
You need to get sure that the solution to the problem converges ...

Of course on simple files and IF you impose the same rules as in HLL languages: define what you use BEFORE you use it (by using include files or some kind of forward declarations like PROTO ....) THEN you might not observe this problem initially...

But you can not evade it on bigger projects mainly because of the nature of ASM... labels are by design defined after they are used and jump to those labels can be encoded longer or shorter depending on this distance.

Anyway you can avoid most of the problems IF you forget about size optimizations but this will not look "cool" :P


Quote
I'd only anticipated to implement size optimization in the form of replace opcode x (2 bytes) with opcode y (1 byte) when functionally identical and setup a string table and ensure all == strings are mapped to the same string entry.. am I missing something? :)

Apparently you did ;)


Quote
...
But then again, I suspect that only yourself and Habran might be even interested in this, and with your experience my ideas shouldn't need to be simplified  ( maybe just translated from brain inner monologue ;) )

They need to be simpler ... even if I could eventually understand your specific term and concepts ... I have other projects and I only have a limited time allocated for understanding what you say and if I can not do that very fast then I have a strange tendency to return to my projects ;)

Besides there are other people here that might just know a solution or two to your problems and might drop in a hint or two ;)


Quote
BTW Did my note about the lexer funny around <"> and <!"> make sense? not sure if the way i've handled it is ideal.

Nope :D

My syntax is slightly different  hence I have "other" problems but of course that I am aware of multiple issues with using "<" and ">" I just did not have time to follow the exact problem... hence I have skipped it and answered about more common issues like "hash" etc ...

A prime example of why you need to simplify and find a way to explain that problem in an easy to follow and understand manner for people that do not want to dig deep into the subject's details ;)

Title: Re: My Assembler Development Update
Post by: jj2007 on May 06, 2012, 10:13:56 PM
Your progress looks impressive :U

On Win XP SP3 it chokes with exception 1d at 401295 for xasmd.exe lextest.asm -d -b -c32
Anything wrong with the commandline?
Title: Re: My Assembler Development Update
Post by: habran on May 07, 2012, 05:24:51 AM
Hi johnsa

If you succeed in your plans and I believe you will, we will have most powerful programming tool ever

I wish you success
I would also like to thank Bogdan in his help and advice
He is showing now his great personality

regards

BTW I have no problem to run it on i7 Windows7
Title: Re: My Assembler Development Update
Post by: jj2007 on May 07, 2012, 06:32:13 AM
Quote from: habran on May 07, 2012, 05:24:51 AM
BTW I have no problem to run it on i7 Windows7

Now I was able to debug the exception, on Win7-32, AMD Athlon(tm) Dual Core Processor 4450B. After
Advanced Systems Research (R) XASM Version 1.00
Copyright (C) ASR. All rights reserved.
[Release Mode Build]
[Output Format: 32bit COFF]
it stops at the crc32 instruction:
00A51293                   ³>Ú8B13                Úmov edx, [ebx]
00A51295                   ³.³F20F38F1C2          ³crc32 eax, edx
00A5129A                   ³.³83C3 04             ³add ebx, 4

If I circumvent that one, it continues and finishes with
...
[LEXER] token: , 1, 0, 0, line: 78
[LEXER] token: .data?, 4, 138, 0, line: 78
[LEXER] token: , 1, 0, 0, line: 79
[LEXER] token: , 1, 0, 0, line: 80
Assembly completed in 298.255 seconds.

HTH, jj
Title: Re: My Assembler Development Update
Post by: johnsa on May 07, 2012, 07:14:31 AM
Hey,

Yeah in the abesence of implementing crc32 or md5 myself quickly I just used the crc32 opcode.. so that would be required. I'll have that fixed today or tomorrow with h/w detection so it'll use either the opcode or a software equivalent.

Thanks guys for all the feedback! I'll keep you posted with updates during the course of the week.

Bogdan, I have a theory (potential solution) in my head.. this is a bit tricky to explain but as you say let me try make it as simple as possible. I'f I'm correct this should remove the need for anything more than 2 passes.
The problem boils down to symbols which can move (labels etc) usually because of a change in encoding size.
For example a jump that is encoded as 5 bytes, and the later on you come back, realize it can be optimized to a short 2 byte form, now all your labels are in the wrong place.
So here is the plan.. on the first pass, you generate ALL these problematic opcodes as their shortest/optimal form and put the address calculations into your symbol table as per normal.
Now when we come back for the second pass, we know that all changes to encodings are going in the same direction, IE: making them larger (2 -> 5 bytes for example). What this means is that the position of symbols (or the distance to them) can only grow.
To solve this we have another table called an "Address shift" table.

So for example:

Label1:
    jmp Label4

Label2:

Label3:
   jmp Label2

Label4:

(Hypothetical scenario above). after the first pass we have addresses for the labels, jmp Label2 for argument sake will remain unchanged as it's within short range. However jmp Label4 needs to be extended. So we do the extension
and write a value into the shift table, namely the address of jmp Label4 and the number of bytes we increased it by.

Now any address lookup in pass 2 uses the symbol table AND the shift table to calculate it's final position.

We get to jmp Label2, we find that label2 occurs after an entry in the shift table.. so we update it's address as stored in symbol table by suming all the shift's that occur in the table up to that point (assuming it's absolute). For relative instructions you'd take
the address of the instruction itself and the target, and sum all shifts that occur between those two in the table.

That way all the moving around is solved without the need for more passes and restarts.

Make sense?
Title: Re: My Assembler Development Update
Post by: dedndave on May 07, 2012, 07:35:26 AM
that table sounds like a good idea
of course, it could suck up resources on larger projects

as for adding features above and beyond those in masm...
when you do this, you step farther away from being compatible with masm syntax
i had given this some thought, too   :P
i was thinking that some options could be selected in the form of a comment

;OPTION NewOption

if there are any spaces between ";" and "O", it is a comment
that way, they can still have comments that start with "OPTION"
masm would, of course, ignore the line

you might even use a different keyword
;OPTIONEX NewOption
Title: Re: My Assembler Development Update
Post by: johnsa on May 07, 2012, 07:56:48 AM
On my table idea.. I realize it's a bit flawed.. hmm It would need to be 3 pass..
In which case.. you could do the same logic, but just apply the update to the symbol table directly.

IE: Your first pass was optimistic, all possible opcodes are their shortest form.

The second pass performs branch-extensions (anti-optimisation). Everytime an opcode is encountered that needs to be extended, all symbols in the symbol table are updated. Once again no code is actually generated this pass.

Third pass, now all instructions should be optimal and the symbol table address correct. Generate.
Title: Re: My Assembler Development Update
Post by: johnsa on May 07, 2012, 08:01:00 AM
Quote from: dedndave on May 07, 2012, 07:35:26 AM
that table sounds like a good idea
of course, it could suck up resources on larger projects

as for adding features above and beyond those in masm...
when you do this, you step farther away from being compatible with masm syntax
i had given this some thought, too   :P
i was thinking that some options could be selected in the form of a comment

;OPTION NewOption

if there are any spaces between ";" and "O", it is a comment
that way, they can still have comments that start with "OPTION"
masm would, of course, ignore the line

you might even use a different keyword
;OPTIONEX NewOption

I like the idea of the OPTIONEX..
From what I'm doing this side I just have to be very careful not to break masm-compatibility. But I'm only interested in one way compatibility, IE: I can assemble masm source.. not the other way around. So as long as none of the lexemes and productions I add
affect existing masm functionality I should be safe.
Title: Re: My Assembler Development Update
Post by: jj2007 on May 07, 2012, 08:47:31 AM
Quote from: johnsa on May 07, 2012, 07:14:31 AM
The problem boils down to symbols which can move (labels etc) usually because of a change in encoding size.

One option might be to pass once backwards, from end to start.
Title: Re: My Assembler Development Update
Post by: dedndave on May 07, 2012, 01:37:34 PM
 :bg  then you don't know the backward reference distances
Title: Re: My Assembler Development Update
Post by: johnsa on May 07, 2012, 01:42:14 PM
I think 3 passes should solve it if you want an "optimizing" assembler.
As long as you know which symbol occurs before/after each other (IE: an address ordered version of your symbol table) and you start with shortest form so distances can only grow.
Title: Re: My Assembler Development Update
Post by: dedndave on May 07, 2012, 04:54:14 PM
not saying it is right or wrong or better
but, masm inserts space for the larger offset form, then reduces if it can
with older versions, you might have seen something like this in the disassembly

        jmp short SomeAddress
        nop(s)


but - that isn't what rubs me - lol
it can be frustrating when you want to do something like this
        ORG     (SomeLabel+3) AND -4
and the assembler spits out an error telling you that the operand must be a constant   :(

not as pertinent today as it was with 16-bit code
but, if you are writing boot sectors or - especially - ROMable code - it's a pain

i remember when i used to write BIOS's...
i would assemble, creating a MAP file with public symbols
getting the addresses of - say - 5 or 6 symbols
then adjusting EQUates in the program with those addresses
and assemble again - lol
Title: Re: My Assembler Development Update
Post by: jj2007 on May 07, 2012, 05:59:26 PM
Quote from: johnsa on May 06, 2012, 07:01:31 PM
   Performance wise it's doing about 200,000 lines per second over two full passes. The assembler runs itself in 0.4 seconds, as opposed to jwasm's 0.2 on my machine.
   Most of this performance loss is in the lexers directive/opcode/reg lookup functions which are just table scanning string comparisons.
   I have added in hashing and sorting, and once that's finished it should be back to about 500,000 lines per second.

I wonder where the physical limits are. I did some tests on my Celeron M (slow...) for 20*(Windows.inc+WinExtra.inc) and got this:
Reading + tokenising one Mio lines from disk:
130 ms in the 1st round, 145 ms in the 2nd

156 ms for finding 60 occurences of WM_PAINT
163 ms for finding 35580 occurences of EQU


It's SSE2, and the Instr algo is not the absolutely fastest but close. Of course, a lexer is still a different animal...
Title: Re: My Assembler Development Update
Post by: johnsa on May 08, 2012, 02:19:27 PM
Next Update:

1) Added H/W detection for CRC32 support.
2) Added Fallback ADLER32 for software checksum if no CRC32.
3) Some bug-fixes, refactoring and optimizations.
4) Added full multi-pass support
5) Improved source file management for multiple passes to simply reset each file instead of unload/reload.
6) Added symbol table and lookups.
7) Added two more productions to parser.
8) Updated output to only show on first pass.
9) Added full DB's and hashing system for Lexer lookups for opcodes, registers, directives and symbols.

Notes:
The opcode, register, directive tables I opted to make some more work for myself, instead of generating the tables sorted and pre-hashed offline, the actual table is stored in a readable logical format in the code.. IE: grouped by type/alphabetical whatever.. and the
hash tables and lookups are built by the assembler on start. This means I can add directives opcodes etc without having to re-generate any sort of table outside of the main code.

Updated Performance:
Now that the lexer, hashing and file sub-system is optimized the assembler runs itself in 78ms for me. This includes all 3 full passes (I'm assuming for now my 3 pass idea will hold). This equates currently to about 1,800,000 lines per second lexed, parsed with lookups, register values, numerical value calc, symbol table lookup etc.

Update attached.
Title: Re: My Assembler Development Update
Post by: habran on May 08, 2012, 03:45:24 PM
 :U you are really amazing
If you continue to work like that in just few weeks we will be able to use it for work

keep up excellent work

Title: Re: My Assembler Development Update
Post by: jj2007 on May 08, 2012, 04:00:00 PM
Quote from: habran on May 08, 2012, 03:45:24 PM
:U you are really amazing

I agree - really impressing :U
Title: Re: My Assembler Development Update
Post by: johnsa on May 08, 2012, 07:36:53 PM
Thanks :)

Luckily I think I spent so much time thinking about it and not coding it, that now I'm coding it's going quite quickly. I'm sure I'll be calling on you all to help test and provide some needed insight!

I think the biggest piece of the work is unfortunately still to come in the form of all 400-500 parser productions with conditions for state, current bit mode, pass no. etc

I am going to try my best to have it generating a BIN file in the next 2-3 days of a simple program



.686p
.mmx
.k3d
.xmm

.const

.data?

.data

myVariable db 10
MyVariable2 dd 20
AnotherOne REAL4 2.5

.code

start:
   mov eax,ebx
   mov ecx,0x20
   mov edx,10
   mov eax,32h
   mov al,10101010b
   nop
   ret

start ends


As soon as that works as a bin, I'll finish MOV opcode completely with memory addressing modes. That should be another 2-3 days.. Call it a week. Then labels, some basic jmps, optimization pass... another 5 days, then I'll try get a real OBJ file out of it. Not sure how long that will take, I'll give myself a week.
My personal deadline/objective is to have a working asm by end May (obviously far from complete still in terms of all the opcodes/macros/procs), but working in essence.. able to generate an OBJ from the above simple opcodes, optimized and be linkable by LINK with symbolic debug info in 64bit.

Please let me know if you find any bugs or issues along the way and I'll factor that in too.

Cheers!
John
Title: Re: My Assembler Development Update
Post by: johnsa on May 10, 2012, 09:59:26 AM
Next update:

1. Added some more functionality to my global state, including tracking warnings and errors + counts.
2. Started re-factoring the error system to allow it to accumulate all errors on pass 1 then display. (Almost finished this). At present it just terminates on error.
3. Update the output a bit to include the above.
4. Added the following parser productions (.386 - .686p, .code, .data, .const, .data?, .pdata, .xdata, mmx, k3d, xmm, .32bits, .16bits, .64bits...).
5. 50% implemented the section and segment manager.
6. Added parser validation of entry point and END directive.
7. Added nop, ret, mov r32,r32 opcode productions.
8. Added support to declare a variable with DB... will continue rolling out all the other data types this week.
9. Started implementing basic BIN file... if you do xasm test.asm -b you'll get a .BIN output now.. format is very simple DWORD(Length of section),DATA.... Suggestions here?
10. Fixed a command line arg handling bug.
11. 80% complete on the symbol table implementation (just waiting on the number converters).

Tonight I need to finish the numerical converters so that I can update the symbol table with those entries and have them write out too into the BIN file (IE: the data section).
Title: Re: My Assembler Development Update
Post by: jj2007 on May 10, 2012, 10:28:41 AM
Quote from: johnsa on May 10, 2012, 09:59:26 AM
2. Started re-factoring the error system to allow it to accumulate all errors on pass 1 then display.

Older versions of Masm display them one by one, newer versions (and JWasm) "en bloc". Personally I prefer the first variant... just a thought.

:U
Title: Re: My Assembler Development Update
Post by: johnsa on May 10, 2012, 10:45:31 AM
Ok, maybe I can make that an cmd line option then..

-e Terminate on Error. or something like that.
Title: Re: My Assembler Development Update
Post by: dedndave on May 10, 2012, 01:31:13 PM
i think masm stops after it has shown you 100 errors - lol
generally, if you have one problem that creates several errors, the first error listed is the one that will find it for you
i would say 20 or 30 errors is plenty   :P
Title: Re: My Assembler Development Update
Post by: jj2007 on May 10, 2012, 05:22:11 PM
I just realised I meant the progress messages when building a library. Recent ML and JWasm remain silent until the whole library is built, then it dumps a long list on you.
But regarding ordinary error messages, I agree with Dave that 20 are enough. If you want a really intelligent solution, suppress repeated "undefined symbol" stuff - once is enough.
Title: Re: My Assembler Development Update
Post by: mineiro on May 10, 2012, 10:49:06 PM
Nice job Sr johnsa
Quote9. Started implementing basic BIN file... if you do xasm test.asm -b you'll get a .BIN output now.. format is very simple DWORD(Length of section),DATA.... Suggestions here?
From my point of view, BIN files are a raw output, a mix betwen data and code, but first comes code, after comes data. Bin files are like .com files, .rom files, I think the only difference betwen these extensions are the place that they are loaded in memory.
.386
.code
.16bits
start:
org 100h   ;com file, one segment to all, data or code,cs=ds=es=ss
nop
ret
Variable1 db 90
end start

The generated .bin file put's data variable first, before code. With this, I cannot rename .bin to .com and run.
Suggestion is assume that data can be inside code, and/or vice-versa. Bin files do not have a format, so you can remove the lenght of section, or turn this into an option.
Title: Re: My Assembler Development Update
Post by: johnsa on May 13, 2012, 09:12:38 PM
Hey,

Ok done a bit more work.. been going a bit slowly at the moment. I re-factored some things in the lexer to ensure that $,$$ are not operators but identifiers. AND OR NOT XOR POW EXP SHL SHR SIN COS TAN are converted
into operators.

I updated the BIN file output to be just that.. a dump of the sections as they are in order. So in a few basic tests that worked ok and generated the same 3 byte output as jwasm for the example mineiro posted.
I did a bit more work on the symbol table and added a few things as pre-defined symbols on init, like $,$$.

I changed things around in terms of handling $ and $$ to link them to actual segment/section entries and maintain an actual set of tokens to represent these values.

The main reason for that is things like the expression evaluator expect tokens, not just simple numbers but identifiers/symbols so the expression evaluator can now grab these in the right form.
The expression system is almost right, it was a bit fiddly doing the infix->postfix and evaluation, handling symbol looks as well as negative numbers.
I decided to update the lexer to NOT take something like -5 as a number, but rather as an operator (-) and a number (5).
This should allow simple expressions like
myVar db -5
to be handled by the expression evaluator (as it automatically inserts a zero token) when the stack doesn't balance.. so -5 is actually 0-5. If that makes sense.
As well as handling more complex ones like 2*4-5+(-5)/10*SIN(10)
I still need to put in automatic promotion so that the result will convert to the right type.. IE: above SIN(10) would force the expression to require a float or better.

So now I've run into a small issue/question which maybe you guys can help with.
Once the symbol table starts running now and all identifiers/equ's etc start adding.. by the time I get past windows/winextra.inc I'm already sitting on about 60,000 entries and the table grows to a whopping 150Mb.
At which point VirtualAlloc refuses to allocate any more :) I'm storing the symbol entries one by one in a linked list, so I could change that around and have it allocate in blocks which might stop VirtualAlloc from bombing
and would probably speed things up .. IE: initialize storage for blocks of 1024 symbols at a time.. but it still doesn't change the fact that this is going to run up several hundred Mb of memory to assemble a project just
from what's defined in the standard includes.
I'm not sure if I should be handling EQU's a different way, but even if I stored them in a lighter weight setup, they'd still be using a lot of memory due to the sheer number of them.
My symbol entry (structure) is quite dense as I've put everything in there I think might be useful to allow it to handle struct,macro,proc,arguments,types,dup arrays etc.

Any thoughts?
John
Title: Re: My Assembler Development Update
Post by: dedndave on May 14, 2012, 01:06:00 AM
that's a good problem   :P
not that 150 Mb is really all that much, in the grander scheme of things
the bugger is - a lot of those symbols won't be used
and, it's hard to predict which will be needed and which will not
it's not as though you could load them on an as-needed basis

this is especially true with the masm32 package
nearly all of the equates are in windows.inc/winextra.inc
typically - the equates that belong to, say, advapi32 are in advapi32.inc - not windows.inc   :'(

one solution that comes to mind is to perhaps use a temporary file for the "windows" symbol tokens,
and keep the "project" symbol tokens in memory
if the source code accesses a symbol, the token is moved from the temporary file into memory

a little bit of memory manager strategy code is going to be needed
you might look at JwAsm, to get some ideas
Title: Re: My Assembler Development Update
Post by: johnsa on May 14, 2012, 07:54:25 AM
That's not a bad option.. But how would you determine what's project specific and what is system as it's all just includes.. without pulling some sort of hack, like a precompiled header of sorts specifically for windows.inc

the other option I guess would be to add things to the symbol table on use, not on declaration.
We could do this for all things, or at least just for equates. So on the first pass, we detect a reference to a symbol, we don't find it we create it as unknown and then only on the second pass would we actually update the symbol entry with the values from the
EQU. The only problem I can see with this approach is things like textequ .. if we create a dummy symbol on use, a text substitution for example could cause some havoc if we had it as an empty string..

Seing as you can do var equ <text here> too.. makes it hard to limit as it's not just textequ.

The only alternative I can see to that, is that we create something like a GUID to fill the text portion of the symbol if its a literal or quoted literal.

This way only equates and variables that are actually used are pulled into the symbol table.

Maybe Bogdan can shed some light on how he handled this.
Title: Re: My Assembler Development Update
Post by: jj2007 on May 14, 2012, 05:58:44 PM
Quote from: johnsa on May 13, 2012, 09:12:38 PM
Once the symbol table starts running now and all identifiers/equ's etc start adding.. by the time I get past windows/winextra.inc I'm already sitting on about 60,000 entries and the table grows to a whopping 150Mb.

150MB/60k=2500 bytes per entry? Why so much? Can you give an example of such entries?

By the way: We count on you. Jwasm is on ice (http://www.japheth.de/Download/JWasm/?C=M;O=D).
Title: Re: My Assembler Development Update
Post by: johnsa on May 14, 2012, 06:30:01 PM

; Symbol Table Entry Types.
STYPE_LABEL     equ 0
STYPE_PROC      equ 1
STYPE_MACRO     equ 2
STYPE_VARIABLE  equ 3
STYPE_STRUCT    equ 4
STYPE_CONST     equ 5
STYPE_EQUATE    equ 6
STYPE_ENUM      equ 7
STYPE_EXTRN     equ 8
STYLE_LITERAL   equ 9
STYPE_RECORD    equ 10
STYPE_UNION     equ 11
STYPE_NAMESPACE equ 12
STYPE_CLASS     equ 13
STYPE_METHOD    equ 14
STYPE_SEGMENT   equ 15
STYPE_SECTION   equ 16
STYPE_TEXTEQU   equ 17
STYPE_UNKNOWN   equ 18 ; For when we create a symbol before we know what to do with it.

ARGTYPE_BYTE  equ 0
ARGTYPE_WORD  equ 1
ARGTYPE_DWORD equ 2

SYMBOL struct
hash       dd ?
symType    dd ? ; Symbol Type.
filePtr    dd ? ; Source File or external lib/obj file containing definition ptr.
line       dd ? ; Line number symbol defined on.
address    dd ?
sectionDef dd ? ; Section id symbol defined in.
segmentDef dd ? ; Segment id symbol defined in.
sSize      db ? ; size in bits if variable or element bit size if DUP/Array.
sLen       dd ? ; Number of elements if Array/DUP or BSS count.
int8alue   db ? ; Actual integral value of symbol.
int16alue  dw ? ; Actual integral value of symbol.
int32alue  dd ? ; Actual integral value of symbol.
int64alue  dq ? ; Actual integral value of symbol.
fValue32   REAL4 0.0 ; Floating point value.
fValue64   REAL8 0.0 ; Floating point value.
fValue80   REAL10 0.0 ; Floating point value.
symName    db 64 DUP (?)
isDeclared db ?
isDefined  db ?
isExtern   db ?
isType     db ? ; Is this symbol a type reference? (IE: struct or DWORD etc).
usage      dd ? ; Number of times this symbol has been referenced via offset,addr,lea,call,invoke etc.
argCount   db ? ; If Proc or macro, how many args does it have?
argTypes   db 64 DUP (?) ; Argument types.
scopePtr   dd ? ; Pointer to a scope entry (namespace, local etc).
prevPtr    dd ?
nextPtr    dd ? ; Pointer to next symbol entry in linked list.
SYMBOL ends


Thats tentative, it will change no doubt.
I think i've solved the issue by working backwards. I add symbols as "UNKNOWN" on reference, then on the next pass it'll use the declaration to fill in the symbol information. This means ONLY symbols that are actually used are created and makes it a lot faster.

I've added in the first recursive stage of the parser for blocks/nestings and macros (anything that can be terminated by ENDM.. rept, repeat, ifp ifpc.. etc).
Making a few minor adjustments now for the multi-pass, fixing a bug in expression and having it output some more debug info.
In theory then I can send out another update that should fully generate ORG directives including symbol references and expressions... IE: org 20+(-2)+myVariable AND -4 for example

Has there been an official statement on jwasm? I know that 2.07 was suppose to come out beginning of the year and that hasn't happened yet...
Title: Re: My Assembler Development Update
Post by: johnsa on May 21, 2012, 03:37:47 PM
Next update...

Lots of pain ...
Many bugs, much re-factoring, unit testing and regression testing. Changes to lexer and parser.

I've decided the smart thing to do would be to group instructions together according to shared parser rules. My thinking is that there are a number of instructions which can be handled by exactly the same rule set.
IE: All instructions that take NO parameters... A group for instructions that take one parameter being a memory address and so on.

I realized my idea with multiple passes was still flawed.. you DO need as many passes as it takes to solve the problem, but my solution to solve the jumps will still work. I've subsequently implemented FULL multi-pass support so that after each pass it knows
the state of forward references and symbol definition completeness.
This was necessary to solve things like this:


A equ B
B equ C
C equ D
D equ 2


While doing this I noticed that ML/ML64 handle a lot of things FAR better than JWASM. Like the above which works in ML but not in JWASM without funny errors if these values are all fwd. references and defined after use.
This also breaks jwasm:


A equ B
B equ A


Whereas ML and mine handle this as expected by hitting maximum pass warning.

My solution to only adding symbols to the symbol table on reference works nicely. I've fully implemented org, EQU, Expression evaluation and a bunch of built-in pre-defined symbols for things like $, $$, true, false, null. I've tested some ORGs, forward references,
offset operator and more.

I've added a dump of the symbol table to a .sym file when you build in debug mode with binary output.

The debug mode execution will also demonstrate the parser deciding where to evaluate through recurssion or linear-state matching.

I'm starting to look at building up the line number info necessary for debug mode output. As yet I'm not sure what COFF etc requires for this, I'm assuming it needs a line number+address reference for every instruction? As well as line number for symbol definitions (which is already stored in the symbol table). Any thoughts?
The one thing I do want to fix here over ML is that the line number in source of the actual MACRO must be stored (as this annoys me currently) when debugging you can't really step into a macro and there's no reason why not.. it should be much like a proc.

I've also used all the cpu manuals and MASM manual to finish capturing every single instruction and directive into the lookups... that was painful.

Attached is the next update including the usual release/debug version with added info coming from the SYMBOL TABLE sub-system and EXPRESSION system. I've included a test file which has just about every possible expression i could come up with to test it.

Once I can solve the line number debug info, complete a few more opcode group rules I should be able to start on doing the first simple OBJ generation with COFF that will actually link, run and debug properly.
(I will need some assistance or advice around what info needs to be captured to .xdata / pdata etc).

It's going slower than I would've liked.. but at least its going :) 8000 lines of code and counting...
Title: Re: My Assembler Development Update
Post by: BogdanOntanu on May 21, 2012, 08:16:37 PM
I think you should move the posts to the new forums :D
Title: Re: My Assembler Development Update
Post by: jj2007 on May 21, 2012, 10:42:43 PM
Quote from: johnsa on May 21, 2012, 03:37:47 PM
I'm starting to look at building up the line number info necessary for debug mode output. As yet I'm not sure what COFF etc requires for this, I'm assuming it needs a line number+address reference for every instruction? As well as line number for symbol definitions (which is already stored in the symbol table). Any thoughts?
The one thing I do want to fix here over ML is that the line number in source of the actual MACRO must be stored (as this annoys me currently) when debugging you can't really step into a macro and there's no reason why not.. it should be much like a proc.

You probably saw the mapinfo:lines thread (http://www.masm32.com/board/index.php?topic=18874.0).

Good luck for your project :thumbu