Building an Assembler from scratch

Started by johnsa, April 23, 2012, 03:40:06 PM

Previous topic - Next topic

zemtex

Yes and masm is for non commercial use only
I have been puzzling with lego bricks all my life. I know how to do this. When Peter, at age 6 is competing with me, I find it extremely neccessary to show him that I can puzzle bricks better than him, because he is so damn talented that all that is called rational has gone haywire.

dedndave

i don't think that's quite correct
my understanding of the licensing for the free versions is:

not for writing code for use under non-ms operating systems
not for writing operating systems

johnsa

Thanks for all the hints and points!

I agree 100% on the testing side.

I think once the basic skeleton is in place, I'll need to create a test file with every possible instruction in it.. assemble it to BIN using something like jwasm and then use it as a test script to compare each assembled instruction matches.. fun :)

So if I follow your design concept, avoid the traditional HLL style parser with syntax trees, you're basically going for a huge state machine thats basically case driven..

Example:
mov eax,20+(2*3)

Token Stream:
OP(MOV) REG(EAX) COMMA NUM(20) OP_PLUS LPAREN NUM(2) OP_MUL NUM(3) RPAREN

So you'd read a single token.. OP(MOV)
That would take you to a particular code block (a state change and intended to handle MOV generation)
Then the next token is read REG(EAX)
So you go to another sub-code-block which handles cases for mov r32,? (r32/mem32/imm32)
Then you get COMMA so you continue (no errors reported yet)
Then you get to NUM(20) .. now you know its definitely not mov r32,r32
but it could still be mem32 or imm32 (IE: a simple numeric, one that needs evaluation or a memory address that starts with a numeric.. like mov eax,myLabel+20
You'd push the NUM(20) onto a stack for potential evaluation
Then you take the OP_PLUS, push onto an operator stack
handle the parentheses.. etc.. you get to the end of the line and there has been no reference to a label or symbol or use of an equate (IE: no state changes necessary)..
wrap up the evaluation stack and you now know its mov r32,imm32 and you generate the obj. code

Does that sound right?


dedndave

when you start coding, one of the first things you'll find is that many instructions may have different forms
simple example...
        xchg    ecx,edx
        xchg    edx,ecx

2 ways of coding the same thing
a little different if one of the operands is EAX
        xchg    eax,edx
        xchg    edx,eax

one is a single byte, the other is 2 bytes
most assemblers will replace XCHG EDX,EAX with XCHG EAX,EDX - it saves a byte   :P

when it comes to complex addressing modes, i think you'll find many cases
the point is - you may correctly code an instruction, but differently than another assembler

johnsa

Definitely, I've found many examples of that already in my investigation.
I would work on opting for the shortest instruction form where possible (copying ml64 or jwasm output) is a good place to start as a lot of those decisions have been made already. It's just a case of working through every single case and seing why the particular opcode was selected.

dedndave

encoding (or decoding, depending on how you call it) instructions is the least of your worries, i would think

handling macros   :dazzled:

johnsa

macros shouldn't be too bad.. if we're going for the whole state machine model..

ie:
name MACRO a:REQ
mov eax,a
ENDM

tokens:

ID(name) DIR(MACRO) ID(a) COLON DIR(REQ)
OP(MOV) REG32(EAX) COMMA ID(a)
DIR(ENDM)

So on declaration it would check/create a symbol table entry for name and link that to the token stream created from statements... the asm would have a flag like inMacroDefintion.. in which case it will expect an ENDM token before allowing other directives for example ...

on use the macro token stream would be inserted and a replaced...

for things like REPT I think I'd maintain "variables" in the assembler itself.. so

i = 0
rept 100

i = i+1
endm

the assembler would allocate i, process the statements and update it's internal variable i .. the inner part would be expanded iteratively then by the normal parser code..

Thats sort of how I have this in my head.. (which could be bollocks :)

dedndave

it's probably not so much implementing a macro, but parsing it   :P
some of them can get rather hairy
i figure, if i can't understand what it does when i read it, i can't write the code to create it - lol

johnsa

Ahh true :) But remember its usually the logic of what the macro does that's confusing not each individual component..
So it's easy to understand how to implement say CatStr or reparg.. but when you put 20 of them together.. the overall purpose / logic of the macro can be obtuse... luckily neither I nor the parser should be worried about that as long
as each component works :)


johnsa

Yeah.. its horrific :)

But when you break it down line by line and then component by component it's not so bad..  its just when it all comes together it looks like the rantings of an insane person :) hehe