Building an Assembler from scratch

Started by johnsa, April 23, 2012, 03:40:06 PM

Previous topic - Next topic

johnsa

Hey all,

Just for curiosity sake having used asm for years, I wanted to look at building one from scratch. I've read many online articles, tutorials and have started working my way through the dragon book.

To get started I decided to use ANTLR and compile an EBNF / set of production rules to get through the whole lexer phase. I thought this would also help proof the grammar even though the final product I want to be self-assembly capable and thus written in asm.

In any event, I'm relative happy with the grammar and constructs thus-far. I started looking at writing a full DFA / Thomson etc. reg-ex parser in asm to use for the tokenization, but subsequently I've realized I can hand-code something quite similar.
1) First question, is it worth writing the reg-ex parser from experience or should I just stick to my own hand-coded implementation which basically works like a state-machine evaluating characters as it goes? (bearing in mind the parser implementation)

2) Question 2... should I deal with macros and directives as a separate pass altogether, or should i simply roll this up into the parser in general? IE: be able to emit source lines as opposed to object code.

I've create a DB of opcodes and reserved words already.
I relatively comfortable with how to handle the symbol table, multiple passes, back-patching etc. But where I've come a bit unstuck is how best to handle the stream of tokens and the parsing stage.

I don't want to limit myself by trying to hard-code anything and using complex jump table setups, I'd like to keep it flexible and open-ended as it's absolutely essential that it supports higher level functionality like macros, type checking. Basically I want ML 32 + incbin + cinclude + enum + namespaces + (possibly built in new/delete/class/method directives) but supporting only 64bit. I have no need to replicate ML32 .. it's perfect for me.

I know most people will say why bother? and the project is too large for anyone to undertake, but that's really not important to me now.. I have lots of time, spare resources and it can happily take a year to build, as long as the foundation is solid.

My idea originally with the parsing was to take the token stream and match it to generic parser-rules for example:
MOV REG32 COMMA IMM32
MOV REG32 COMMA MEM32

Then in the grammar IMM32 was recursively defined by terms and factors, and a mem32 using a regex match support (SEG_REG COLON)* OPEN_BRACKET REG32 TIMES* .. etc etc..
If the parser rules are built as such it would require a reg-ex parser to validate token structures into syntactially correct statements.

I then thought about normal immediate expressions and their simplification via postfix/prefix notation and tree structures or stack.
Now i really can't decide the best way forward..

As an example:
mov eax,2*(3+2)

assuming the tokenizer works we land up with the following tokens:

OPCODE(MOV) REG32(EAX) COMMA NUM(2) OP_TIMES LPAREN NUM(3) OP_PLUS NUM(2) RPAREN

Bearing in mind I'd like the parser to support macros, type checking and all the higher level syntax (.if PROC INVOKE etc) should I be looking at creating a tree of these tokens for evaluation?
If so what should the tree look like, to my mind an AST isn't really necessary for an assembler, even with the high level functionality is it's not really going to be optimizing or performing data-flow analysis or writing out blocks of code.. just simple evaluation, simplification, substitution and emitting.

My thought with the tree was that we could simplify the tree as we go and that would handle numeric expressions as well as taking each construct back to it's most general form IE: MOV REG32,IMM32. In addition nodes in the tree could execute custom functions to handle things like ADDR, OFFSET and SIZEOF which would then create attributes that pass back up the tree as it evaluates.

I guess after all my waffling.. the question really is.. IS the tree the right way to go, and how should the tree be structured?

INSTRUCTION-------------|
|       |         |             |
MOV COMMA EAX EXPRESSION
                                 |
                                 |
                                OP(*)
                              /         \
                            (2)        EXPR---------|---------------|
                                         /              OP(+)        |
                                        LPAREN     / \            RPAREN
                                                      (3) (2)

And how would that relate to other functions such as directive, macro declaration etc..

I hope this didn't bore you to sleep as any guidance would be hugely appreciated!


For reference I've listed the draft of the grammar file below that I started using for proofing:



// XASM Lexer/Parser Grammar.
grammar XASM;

parse
   : (t=.
{System.out.printf("text: \%-7s type: \%s \n",
$t.text, tokenNames[$t.type]);}
     )*
     EOF
   ;

/*
-----------------------------------------------
   Reserverd Words and Directives
-----------------------------------------------
*/

Dir_Align        : 'align';
Dir_Ptr          : 'ptr';
Dir_Null         : 'null';
Dir_Equ          : 'equ';
Dir_Byte         : 'byte';
Dir_Char         : 'char';
Dir_Word         : 'word';
Dir_Dword        : 'dword';
Dir_Int          : 'int';
Dir_Int16        : 'int16';
Dir_Int32        : 'int32';
Dir_Int64        : 'int64';
Dir_Uint16       : 'uint16';
Dir_Uint32       : 'uint32';
Dir_Uint64       : 'uint64';
Dir_Float        : 'float';
Dir_Double       : 'double';
Dir_QWord        : 'qword';
Dir_DQWord       : 'dqword';
Dir_OWord        : 'oword';
Dir_MMWord       : 'mmword';
Dir_XMMWord      : 'xmmword';
Dir_TByte        : 'tbyte';
Dir_Dollar       : '$';
Dir_DollarDollar : '$$';
Dir_Namespace    : 'namespace';
Dir_NamespaceRef : '::';
Dir_SizeOf       : 'sizeof';
Dir_Addr         : 'addr';
Dir_Offset       : 'offset';
Dir_TextEqu      : 'textequ';
Dir_Db           : 'db';
Dir_Dw           : 'dw';
Dir_Dd           : 'dd';
Dir_Dq           : 'dq';
Dir_Dt           : 'dt';
Dir_Dup          : 'dup';
Dir_Real4        : 'real4';
Dir_Real8        : 'real8';
Dir_Real10       : 'real10';
Dir_Struct       : 'struct';
Dir_Struc        : 'struc';
Dir_End          : 'end';
Dir_Ends         : 'ends';
Dir_Macro        : 'macro';
Dir_EndM         : 'endm';
Dir_Proc         : 'proc';
Dir_Proto        : 'proto';
Dir_Include      : 'include';
Dir_IncludeLib   : 'includelib';
Dir_CInclude     : 'cinclude';
Dir_IncBin       : 'incbin';
Dir_Segment      : 'segment';
Dir_Public       : 'public';
Dir_Private      : 'private';
Dir_Use16        : 'use16';
Dir_Use32        : 'use32';
Dir_Use64        : 'use64';
Dir_Class        : 'class';
Dir_This         : 'this';
Dir_New          : 'new';
Dir_Delete       : 'delete';
Dir_Method       : 'method';
Dir_Return       : 'return';
Dir_Void         : 'void';
Dir_Req          : 'req';
Dir_Vararg       : 'vararg';
Dir_Assume       : 'assume';
Dir_Union        : 'union';
Dir_Record       : 'record';
Dir_Enum         : 'enum';
Dir_Rept         : 'rept';
Dir_Org          : 'org';
Dir_Echo         : 'echo';
Dir_Comment      : 'comment';
Dir_Option       : 'option';
Dir_Uses         : 'uses';
Dir_High         : 'high';
Dir_Low          : 'low';
Dir_HighWord     : 'highword';
Dir_LowWord      : 'lowword';
Dir_Length       : 'length';
Dir_Size         : 'size';
Dir_Invoke       : 'invoke';
Dir_LengthOf     : 'lengthof';
Dir_Type         : 'type';
Dir_InStr        : 'instr';
Dir_SizeStr      : 'sizestr';
Dir_CatStr       : 'catstr';
Dir_SubStr       : 'substr';
Dir_Opattr       : 'opattr';
Dir_Local        : 'local';
Dir_If           : 'if';
Dir_Ife          : 'ife';
Dir_Ifb          : 'ifb';
Dir_Ifnb         : 'ifnb';
Dir_Ifdef        : 'ifdef';
Dir_Ifndef       : 'ifndef';
Dir_Ifdif        : 'ifdif';
Dir_Ifdifi       : 'ifdifi';
Dir_Ifidn        : 'ifidn';
Dir_Ifidni       : 'ifidni';
Dir_ElseIf       : 'elseif';
Dir_EndIf        : 'endif';
Dir_Label        : 'label';
Dir_Mbreak       : '.break';
Dir_Mcontinue    : '.continue';
Dir_Seccode      : '.code';
Dir_Secconst     : '.const';
Dir_Secdata      : '.const';
Dir_Secudata     : '.data?';
Dir_Secxdata     : '.xdata';
Dir_Secpdata     : '.pdata';
Dir_Extern       : 'extern';
Dir_Externdef    : 'externdef';
Dir_Typedef      : 'typedef';
Dir_For          : 'for';
Dir_Forc         : 'forc';
Dir_Goto         : 'goto';
Dir_While        : 'while';
Dir_EndP         : 'endp';
Dir_Rel          : 'rel';
Dir_Abs          : 'abs';
Dir_LocalLabel   : '@@';
Dir_ForwardJmp   : '@F';
Dir_BackJmp      : '@B';
Dir_Short        : 'short';
Dir_Far          : 'far';
Dir_Near         : 'near';

HLL_if        : '.if';
HLL_elseif    : '.elseif';
HLL_else      : '.else';
HLL_endif     : '.endif';
HLL_for       : '.for';
HLL_while     : '.while';
HLL_endw      : '.endw';
HLL_repeat    : '.repeat';
HLL_Until     : '.until';
HLL_switch    : '.switch';
HLL_case      : '.case';

/*
--------------------------------------------
   ALL MNEMONICS (INSTRUCTIONS)
   Not Broken down into groups as
   general, fpu, mmx, sse, sse2, avx etc.
   these are just for general lexer,
   parser will handle mmx vs sse cases
   of instructions and groupings etc.
--------------------------------------------
*/

Instr_adc:        'adc';
Instr_add:        'add';
Instr_and:        'and';
Instr_addpd:      'addpd';
Instr_addps:      'addps';
Instr_addsd:      'addsd';
Instr_addss:      'addss';
Instr_addsubpd:   'addsubpd';
Instr_addsubps:   'addsubps';
Instr_andnpd:     'andnpd';
Instr_andnps:     'andnps';
Instr_andpd:      'andpd';
Instr_andps:      'andps';
Instr_blendpd:    'blendpd';
Instr_blendps:    'blendps';
Instr_blendvpd:   'blendvpd';
Instr_blendvps:   'blendvps';
Instr_bsf:        'bsf';
Instr_bsr:        'bsr';
Instr_bswap:      'bswap';
Instr_bt:         'bt';
Instr_btc:        'btc';
Instr_btr:        'btr';
Instr_bts:        'bts';
Instr_call:       'call';
Instr_callf:      'callf';
Instr_cbw:        'cbw';
Instr_clc:        'clc';
Instr_cld:        'cld';
Instr_clflush:    'clflush';
Instr_cli:        'cli';
Instr_clts:       'clts';
Instr_cmc:        'cmc';
Instr_cmovb:      'cmovb';
Instr_cmovnae:    'cmovnae';
Instr_cmovc:      'cmovc';
Instr_cmovbe:     'cmovbe';
Instr_cmovba:     'cmovba';
Instr_cmovl:      'cmovl';
Instr_cmovnge:    'cmovnge';
Instr_cmovle:     'cmovle';
Instr_cmovng:     'cmovng';
Instr_cmovnb:     'cmovnb';
Instr_cmovae:     'cmovae';
Instr_cmovnc:     'cmovnc';
Instr_cmovnbe:    'cmovnbe';
Instr_cmova:      'cmova';
Instr_cmovnl:     'cmovnl';
Instr_cmovge:     'cmovge';
Instr_cmovnle:    'cmovnle';
Instr_cmovg:      'cmovg';
Instr_cmovno:     'cmovno';
Instr_cmovnp:     'cmovnp';
Instr_cmovpo:     'cmovpo';
Instr_cmovns:     'cmovns';
Instr_cmovnz:     'cmovnz';
Instr_cmovne:     'cmovne';
Instr_cmovo:      'cmovo';
Instr_cmovp:      'cmovp';
Instr_cmovpe:     'cmovpe';
Instr_cmovs:      'cmovs';
Instr_cmovz:      'cmovz';
Instr_cmove:      'cmove';
Instr_cmp:        'cmp';
Instr_cmppd:      'cmppd';
Instr_cmpps:      'cmpps';
Instr_cmps:       'cmps';
Instr_cmpsw:      'cmpsw';
Instr_cmpsd:      'cmpsd';
Instr_cmpsq:      'cmpsq';
Instr_cmpxchg:    'cmpxchg';
Instr_cmpxchg8b:  'cmpxchg8b';
Instr_xmpxchg16b: 'cmpxchg16b';
Instr_comisd:     'comisd';
Instr_comiss:     'comiss';
Instr_cpuid:      'cpuid';
Instr_crc32:      'crc32';

Instr_mov:        'mov'; /*test*/

/*Instr_cvtdq2pd:
Instr_cvtdq2ps:
Instr_cvtpd2dq:
Instr_cvtpd2pi:
Instr_cvtpd2ps:
Instr_cvtpi2pd:
Instr_cvtpi2ps:
Instr_cvtps2dq:
Instr_cvtps2pd:
Instr_cvtps2pi:
Instr_cvtsd2si:
Instr_cvtsd2ss:
Instr_cvtsi2sd:
Instr_cvtsi2ss:
Instr_cvtss2sd:
Instr_cvtss2si:
Instr_cvttpd2dq:
Instr_cvttpd1pi:
Instr_cvttps2dq:
Instr_cvttps2pi:
Instr_cvttsd2si:
Instr_cvttss2si:
Instr_cwd:
Instr_cdq:
Instr_cqo:
Instr_cwde:
Instr_cdqe:
Instr_dec:
Instr_div:
Instr_divpd:
Instr_divps:
Instr_divsd:
Instr_divss:
Instr_dppd:
Instr_dpps:
Instr_emms:
Instr_enter:
Instr_extractps:
Instr_extrq:
Instr_f2xm1:
Instr_fabs:
Instr_fadd:
Instr_faddp:
Instr_fbld:
Instr_fbstp:
Instr_fchs:
Instr_fclex:
Instr_fcmovb:
Instr_fcmovebe:
Instr_fcmove:
Instr_fcmovnb:
Instr_fcmovnbe:
Instr_fcmovne:
Instr_fcmovnu:
Instr_fcmovu:
Instr_fcom:
Instr_fcom2:
Instr_fcomi:
Instr_fcomip:
Instr_fcomp:
Instr_fcomp3:
Instr_fcomp5:
Instr_fcompp:
Instr_fcos:
Instr_fdecstp:
Instr_fdiv:
Instr_fdivp:
Instr_fdivr:
Instr_fdivrp:
Instr_ffree:
Instr_ffreep:
Instr_fiadd:
Instr_ficom:
Instr_ficomp:
Instr_fidiv:
Instr_fidivr:
Instr_fild:
Instr_fimul:
Instr_fincstp:
Instr_finit:
Instr_fist:
Instr_fistp:
Instr_fisttp:
Instr_fisub:
Instr_fisubr:
Instr_fld:
Instr_fld1:
Instr_fldcw:
Instr_fldenv:
Instr_fldl2e:
Instr_fldl2t:
Instr_fldlg2:
Instr_fldln2:
Instr_fldpi:
Instr_fldz:
Instr_fmul:
Instr_fmulp:
Instr_fnclex:
Instr_fndisi:
Instr_fneni:
Instr_fninit:
Instr_fnop:
Instr_fnsave:
Instr_fnsetpm:
Instr_fnstcw:
Instr_fnstenv:
Instr_fnstsw:
Instr_fpatan:
Instr_fprem:
Instr_fprem1:
Instr_fptan:
Instr_frndint:
Instr_frstor:
Instr_fsave:
Instr_fscale:
Instr_fsin:
Instr_fsincos:
Instr_fsqrt:
Instr_fst:
Instr_fstcw:
Instr_fstenv:
Instr_fstp:
Instr_fstp1:
Instr_fstp8:
Instr_fstp9:
Instr_fstsw:
Instr_fsub:
Instr_fsubp:
Instr_fsubr:
Instr_fsubrp:
Instr_ftst:
Instr_fucom:
Instr_fucomi:
Instr_fucomip:
Instr_fucomp:
Instr_fucompp:
Instr_fwait:
Instr_wait:
Instr_fxam:
Instr_fxch:
Instr_fxch4:
Instr_fxch7:
Instr_fxrstor:
Instr_fxsave:
Instr_fxtract:
Instr_fyl2x:
Instr_fyl2xp1:
Instr_getsec:
Instr_haddpd:
Instr_haddps:
Instr_hintnop:
Instr_hlt:
Instr_hsubpd:
Instr_hsubps:
Instr_idiv:
Instr_imul:
Instr_in:
Instr_inc:
Instr_ins:
Instr_insb:
Instr_insw:
Instr_insd:
Instr_insertps:
Instr_insertq:
Instr_int:
Instr_int1:
Instr_icebp:
Instr_into:
Instr_invd:
Instr_invert:
Instr_invlpg:
Instr_invvpid:
Instr_iret:
Instr_iretd:
Instr_iretq:
Instr_jb:
Instr_jnae:
Instr_jc:
Instr_jbe:
Instr_jna:
Instr_jecxz:
Instr_jrcxz:
Instr_jl:
Instr_jnge:
Instr_jle:
Instr_jng:
Instr_jmp:
Instr_jmpf:
Instr_jnb:
Instr_jae:
Instr_jnc:
Instr_jnbe:
Instr_ja:
Instr_jnl:
Instr_jge:
Instr_jnle:
Instr_jg:
Instr_jno:
Instr_jnp:
Instr_jpo:
Instr_jns:
Instr_jnz:
Instr_jne:
Instr_jo:
Instr_jp:
Instr_jpe:
Instr_js:
Instr_jz:
Instr_je:
Instr_lahf:
Instr_lar:
Instr_lddqu:
Instr_ldmxcsr:
Instr_lea:
Instr_leave:
Instr_lfence:
Instr_lfs:
Instr_lgdt:
Instr_lgs:
Instr_lidt:
Instr_lldt:
Instr_lmsw:
Instr_lods:
Instr_lodsb:
Instr_lodsw:
Instr_lodsd:
Instr_lodsq:
Instr_loop:
Instr_loopnz:
Instr_loopne:
Instr_loopz:
Instr_loope:
Instr_lsl:
Instr_lss:
Instr_ltr:
Instr_lzcnt:
Instr_maskmovdqu:
Instr_maskmovq:
Instr_maxpd:
Instr_maxps:
Instr_maxsd:
Instr_maxss:
Instr_mfence:
Instr_minpd:
Instr_minps:
Instr_minsd:
Instr_minss:
Instr_monitor:
Instr_mov:
Instr_movapd:
Instr_movaps:
Instr_movbe:
Instr_movd:
Instr_movq:
Instr_movddup:
Instr_movdq2q:
Instr_movdqa:
Instr_movdqu:
Instr_movhlps:
Instr_movhpd:
Instr_movhps:
Instr_movlhps:
Instr_movlpd:
Instr_movlps:
Instr_movmskpd:
Instr_movmskps:
Instr_movntdq:
Instr_movntdqa:
Instr_movnti:
Instr_movntpd:
Instr_movntps:
Instr_movntss:
Instr_movntsd:
Instr_movntq:
Instr_movq:
Instr_movq2dq:
Instr_movs:
Instr_movsb:
Instr_movsw:
Instr_movsd:
Instr_movsq:
Instr_movsd:
Instr_movshdup:
Instr_movsldup:
Instr_movss:
Instr_movsx:
Instr_movsxd:
Instr_movupd:
Instr_movups:
Instr_movzx:
Instr_mpsadbw:
Instr_mul:
Instr_mulpd:
Instr_mulps:
Instr_mulsd:
Instr_mulss:
Instr_mwait:
Instr_neg:
Instr_nop:
Instr_not:
Instr_or:
Instr_orpd:
Instr_orps:
Instr_out:
Instr_outs:
Instr_outsb:
Instr_outsw:
Instr_outsd:
Instr_packssdw:
Instr_packsswb:
Instr_packuswb:
Instr_packusdw:
Instr_paddb:
Instr_paddd:
Instr_paddq:
Instr_paddsb:
Instr_paddsw:
Instr_paddusb:
Instr_paddusw:
Instr_paddw:
Instr_palignr:
Instr_pand:
Instr_pandn:
Instr_pause:
Instr_pavgb:
Instr_pavgw:
Instr_pblendvb:
Instr_pblendw:
Instr_pcmpeqb:
Instr_pcmpeqd:
Instr_pcmpeqq:
Instr_pcmpeqw:
Instr_pcmpestri:
Instr_pcmpestrm:
Instr_pcmpgtb:
Instr_pcmpgtd:
Instr_pcmpgtw:
Instr_pcmpgtq:
Instr_pcmpistri:
Instr_pcmpistrm:
Instr_pextrb:
Instr_pextrd:
Instr_pextrq:
Instr_pextrw:
Instr_phminposuw:
Instr_pinsrb:
Instr_pinsrd:
Instr_pinsrq:
Instr_pinsrw:
Instr_pmaddwd:
Instr_pminsb:
Instr_pmaxsb:
Instr_pminuw:
Instr_pmaxuw:
Instr_pminud:
Instr_pmaxud:
Instr_pminsd:
Instr_pmaxsd:
Instr_pmaxsw:
Instr_pmaxub:
Instr_pminsw:
Instr_pminub:
Instr_pmovmskb:
Instr_pmovsxbw:
Instr_pmovzxbw:
Instr_pmovsxbd:
Instr_pmovzxbd:
Instr_pmovsxbq:
Instr_pmovzxbq:
Instr_pmovsxwd:
Instr_pmovzxwd:
Instr_pmovsxwq:
Instr_pmovzxwq:
Instr_pmovsxdq:
Instr_pmovzxdq:
Instr_pmuldq:
Instr_pmulld:
Instr_pmulhw:
Instr_pmullw:
Instr_pmuludq:
Instr_pop:
Instr_popcnt:
Instr_popf:
Instr_popfq:
Instr_por:
Instr_prefetchnta:
Instr_prefetcht0:
Instr_prefetcht1:
Instr_prefetcht2:
Instr_psadbw:
Instr_pshufd:
Instr_pshufhw:
Instr_pshuflw:
Instr_pshufw:
Instr_pslld:
Instr_pslldq:
Instr_psllq:
Instr_psllw:
Instr_psrd:
Instr_psrw:
Instr_psrld:
Instr_psrldq:
Instr_psrlq:
Instr_psrlw:
Instr_psubb:
Instr_psubd:
Instr_psubq:
Instr_psubsb:
Instr_psubsw:
Instr_psubusb:
Instr_psubusw:
Instr_psubw:
Instr_ptest:
Instr_punpckhbw:
Instr_punpckhdq:
Instr_punpckhqdq:
Instr_punpckhwd:
Instr_punpcklbw:
Instr_punpckldq:
Instr_punpcklqdq:
Instr_punpcklwd:
Instr_push:
Instr_pushf:
Instr_pushq:
Instr_pxor:
Instr_rcl:
Instr_rcpps:
Instr_rcpss:
Instr_rcr:
Instr_rdmsr:
Instr_rdpmc:
Instr_rdtsc:
Instr_rdtscp:
Instr_retf:
Instr_retn:
Instr_rol:
Instr_ror:
Instr_roundpd:
Instr_roundps:
Instr_roundsd:
Instr_roundss:
Instr_rsm:
Instr_rsqrtps:
Instr_rsqrtss:
Instr_sahf:
Instr_sal:
Instr_shl:
Instr_sar:
Instr_sbb:
Instr_scas:
Instr_scasb:
Instr_scasw:
Instr_scasd:
Instr_scasq:
Instr_setb:
Instr_setnae:
Instr_setc:
Instr_setbe:
Instr_setna:
Instr_setl:
Instr_setnge:
Instr_setle:
Instr_setng:
Instr_setnb:
Instr_setae:
Instr_setnc:
Instr_setnbe:
Instr_seta:
Instr_setnl:
Instr_setge:
Instr_setnle:
Instr_setg:
Instr_setno:
Instr_setnp:
Instr_setpo:
Instr_setns:
Instr_setnz:
Instr_setne:
Instr_seto:
Instr_setp:
Instr_setpe:
Instr_sets:
Instr_setz:
Instr_sete:
Instr_sfence:
Instr_sgdt:
Instr_shl:
Instr_sal:
Instr_shld:
Instr_shr:
Instr_shrd:
Instr_shufpd:
Instr_shufps:
Instr_sidt:
Instr_sldt:
Instr_smsw:
Instr_sqrtpd:
Instr_sqrtps:
Instr_sqrtsd:
Instr_sqrtss:
Instr_stc:
Instr_std:
Instr_sti:
Instr_stmxcsr:
Instr_stos:
Instr_stosb:
Instr_stosw:
Instr_stosd:
Instr_stosq:
Instr_str:
Instr_sub:
Instr_subpd:
Instr_subps:
Instr_subsd:
Instr_subss:
Instr_swapgs:
Instr_syscall:
Instr_sysenter:
Instr_sysexit:
Instr_sysret:
Instr_test:
Instr_ucomisd:
Instr_ucomiss:
Instr_ud:
Instr_ud2:
Instr_unpckhpd:
Instr_unpckhps:
Instr_unpcklpd:
Instr_unpcklps:
Instr_verr:
Instr_verw:
Instr_vmcall:
Instr_vmclear:
Instr_vmlaunch:
Instr_vmptrld:
Instr_vmptrst:
Instr_vmread:
Instr_vmresume:
Instr_vmwrite:
Instr_vmxoff:
Instr_vmxon:
Instr_wbinvd:
Instr_wrmsr:
Instr_xadd:
Instr_xchg:
Instr_xgetbv:
Instr_xlat:
Instr_xlatb:
Instr_xor:
Instr_xorpd:
Instr_xorps:
Instr_xrstor:
Instr_xsave:
Instr_xsetbv:
Instr_vbroadcastss:
Instr_vbroadcastsd:
Instr_vbroadcastf128:
Instr_vinsertf128:
Instr_vextractf128:
Instr_vmaskmovps:
Instr_vmaskmovpd:
Instr_vpermilps:
Instr_vpermilpd:
Instr_vperm2f128:
Instr_vzeroall:
Instr_vzeroupper:
*/

/*
----------------------------------------
INSTRUCTION PREFIXES
----------------------------------------
*/

Instr_Prefix_lock:    'lock';
Instr_Prefix_rep:     'rep';
Instr_Prefix_repnz:   'repnz';
Instr_Prefix_repne:   'repne';
Instr_Prefix_repz:    'repz';
Instr_Prefix_repe:    'repe';
Instr_Prefix_btaken:  'btaken';
Instr_Prefix_bntaken: 'bntaken';

/*
----------------------------------------
REGISTERS AND REGISTER GROUPS
----------------------------------------
*/

AnyReg:      (GPReg8 | GPReg16 | GPReg32 | GPReg64 | RegMM | RegXMM | RegYMM | DebugReg | ControlReg | Reg_rsp | Reg_sp | Reg_esp);

DebugReg:    (Reg_dr0 | Reg_dr1 | Reg_dr2 | Reg_dr3 | Reg_dr4 | Reg_dr5 | Reg_dr6 | Reg_dr7);
TestReg:     (Reg_tr0 | Reg_tr1 | Reg_tr2 | Reg_tr3 | Reg_tr4 | Reg_tr5 | Reg_tr6 | Reg_tr7);
ControlReg:  (Reg_cr0 | Reg_cr1 | Reg_cr2 | Reg_cr3 | Reg_cr4 | Reg_cr5 | Reg_cr6 | Reg_cr7);
FloatReg:    (Reg_st | Reg_st0 | Reg_st1 | Reg_st2 | Reg_st3 | Reg_st4 | Reg_st5 | Reg_st6 | Reg_st7);
SegReg:      (Reg_cs | Reg_ds | Reg_es | Reg_fs | Reg_gs);

GPReg:       (GPReg8 | GPReg16 | GPReg32 | GPReg64);

GPReg8:      (Reg_al | Reg_bl | Reg_cl | Reg_dl | Reg_ah | Reg_bh | Reg_ch | Reg_dh | Reg_r8b | Reg_r9b | Reg_r10b | Reg_r11b | Reg_r12b | Reg_r13b | Reg_r14b | Reg_r15b | Reg_sil | Reg_dil | Reg_spl | Reg_bpl);
GPReg16:     (Reg_ax | Reg_bx | Reg_cx | Reg_dx | Reg_si | Reg_di | Reg_bp | Reg_r8w | Reg_r9w | Reg_r10w | Reg_r11w | Reg_r12w | Reg_r13w | Reg_r14w | Reg_r15w );
GPReg32:     (Reg_eax | Reg_ebx | Reg_ecx | Reg_edx | Reg_esi | Reg_edi | Reg_esp | Reg_ebp | Reg_r8d | Reg_r9d | Reg_r10d | Reg_r11d | Reg_r12d | Reg_r13d | Reg_r14d | Reg_r15d);
GPReg64:     (Reg_rax | Reg_rbx | Reg_rcx | Reg_rdx | Reg_rsi | Reg_rdi | Reg_rsp | Reg_rbp | Reg_r8 | Reg_r9 | Reg_r10 | Reg_r11 | Reg_r12 | Reg_r13 | Reg_r14 | Reg_r15);
RegMM:       (Reg_mm0 | Reg_mm1 | Reg_mm2 | Reg_mm3 | Reg_mm4 | Reg_mm5 | Reg_mm6 | Reg_mm7);
RegXMM:      (Reg_xmm0 | Reg_xmm1 | Reg_xmm2 | Reg_xmm3 | Reg_xmm4 | Reg_xmm5 | Reg_xmm6 | Reg_xmm7 | Reg_xmm8 | Reg_xmm9 | Reg_xmm10 | Reg_xmm11 | Reg_xmm12 | Reg_xmm13 | Reg_xmm14 | Reg_xmm15);
RegYMM:      (Reg_ymm0 | Reg_ymm1 | Reg_ymm2 | Reg_ymm3 | Reg_ymm4 | Reg_ymm5 | Reg_ymm6 | Reg_ymm7 | Reg_ymm8 | Reg_ymm9 | Reg_ymm10 | Reg_ymm11 | Reg_ymm12 | Reg_ymm13 | Reg_ymm14 | Reg_ymm15);

Reg_al       : 'al';
Reg_ah       : 'ah';
Reg_bl       : 'bl';
Reg_bh       : 'bh';
Reg_cl       : 'cl';
Reg_ch       : 'ch';
Reg_dl       : 'dl';
Reg_dh       : 'dh';
Reg_ax       : 'ax';
Reg_bx       : 'bx';
Reg_cx       : 'cx';
Reg_dx       : 'dx';
Reg_ip       : 'ip';
Reg_si       : 'si';
Reg_di       : 'di';
Reg_sp       : 'sp';
Reg_bp       : 'bp';
Reg_eax      : 'eax';
Reg_ebx      : 'ebx';
Reg_ecx      : 'ecx';
Reg_edx      : 'edx';
Reg_eip      : 'eip';
Reg_esi      : 'esi';
Reg_edi      : 'edi';
Reg_esp      : 'esp';
Reg_ebp      : 'ebp';
Reg_rax      : 'rax';
Reg_rbx      : 'rbx';
Reg_rcx      : 'rcx';
Reg_rdx      : 'rdx';
Reg_rip      : 'rip';
Reg_rsi      : 'rsi';
Reg_rdi      : 'rdi';
Reg_rbp      : 'rbp';
Reg_rsp      : 'rsp';
Reg_r8       : 'r8';
Reg_r9       : 'r9';
Reg_r10      : 'r10';
Reg_r11      : 'r11';
Reg_r12      : 'r12';
Reg_r13      : 'r13';
Reg_r14      : 'r14';
Reg_r15      : 'r15';
Reg_r8b      : 'r8b';
Reg_r9b      : 'r9b';
Reg_r10b     : 'r10b';
Reg_r11b     : 'r11b';
Reg_r12b     : 'r12b';
Reg_r13b     : 'r13b';
Reg_r14b     : 'r14b';
Reg_r15b     : 'r15b';
Reg_r8w      : 'r8w';
Reg_r9w      : 'r9w';
Reg_r10w     : 'r10w';
Reg_r11w     : 'r11w';
Reg_r12w     : 'r12w';
Reg_r13w     : 'r13w';
Reg_r14w     : 'r14w';
Reg_r15w     : 'r15w';
Reg_r8d      : 'r8d';
Reg_r9d      : 'r9d';
Reg_r10d     : 'r10d';
Reg_r11d     : 'r11d';
Reg_r12d     : 'r12d';
Reg_r13d     : 'r13d';
Reg_r14d     : 'r14d';
Reg_r15d     : 'r15d';
Reg_sil      : 'sil';
Reg_dil      : 'dil';
Reg_spl      : 'spl';
Reg_bpl      : 'bpl';
Reg_cs       : 'cs';
Reg_ds       : 'ds';
Reg_es       : 'es';
Reg_fs       : 'fs';
Reg_gs       : 'gs';
Reg_ss       : 'ss';
Reg_st       : 'st';
Reg_st0      : 'st0' | 'st(0)';
Reg_st1      : 'st1' | 'st(1)';
Reg_st2      : 'st2' | 'st(2)';
Reg_st3      : 'st3' | 'st(3)';
Reg_st4      : 'st4' | 'st(4)';
Reg_st5      : 'st5' | 'st(5)';
Reg_st6      : 'st6' | 'st(6)';
Reg_st7      : 'st7' | 'st(7)';
Reg_mm0      : 'mm0';
Reg_mm1      : 'mm1';
Reg_mm2      : 'mm2';
Reg_mm3      : 'mm3';
Reg_mm4      : 'mm4';
Reg_mm5      : 'mm5';
Reg_mm6      : 'mm6';
Reg_mm7      : 'mm7';
Reg_xmm0     : 'xmm0';
Reg_xmm1     : 'xmm1';
Reg_xmm2     : 'xmm2';
Reg_xmm3     : 'xmm3';
Reg_xmm4     : 'xmm4';
Reg_xmm5     : 'xmm5';
Reg_xmm6     : 'xmm6';
Reg_xmm7     : 'xmm7';
Reg_xmm8     : 'xmm8';
Reg_xmm9     : 'xmm9';
Reg_xmm10    : 'xmm10';
Reg_xmm11    : 'xmm11';
Reg_xmm12    : 'xmm12';
Reg_xmm13    : 'xmm13';
Reg_xmm14    : 'xmm14';
Reg_xmm15    : 'xmm15';
Reg_ymm0     : 'ymm0';
Reg_ymm1     : 'ymm1';
Reg_ymm2     : 'ymm2';
Reg_ymm3     : 'ymm3';
Reg_ymm4     : 'ymm4';
Reg_ymm5     : 'ymm5';
Reg_ymm6     : 'ymm6';
Reg_ymm7     : 'ymm7';
Reg_ymm8     : 'ymm8';
Reg_ymm9     : 'ymm9';
Reg_ymm10    : 'ymm10';
Reg_ymm11    : 'ymm11';
Reg_ymm12    : 'ymm12';
Reg_ymm13    : 'ymm13';
Reg_ymm14    : 'ymm14';
Reg_ymm15    : 'ymm15';
Reg_cr0      : 'cr0';
Reg_cr1      : 'cr1';
Reg_cr2      : 'cr2';
Reg_cr3      : 'cr3';
Reg_cr4      : 'cr4';
Reg_cr5      : 'cr5';
Reg_cr6      : 'cr6';
Reg_cr7      : 'cr7';
Reg_efer     : 'efer';
Reg_dr0      : 'dr0';
Reg_dr1      : 'dr1';
Reg_dr2      : 'dr2';
Reg_dr3      : 'dr3';
Reg_dr4      : 'dr4';
Reg_dr5      : 'dr5';
Reg_dr6      : 'dr6';
Reg_dr7      : 'dr7';
Reg_tr0      : 'tr0';
Reg_tr1      : 'tr1';
Reg_tr2      : 'tr2';
Reg_tr3      : 'tr3';
Reg_tr4      : 'tr4';
Reg_tr5      : 'tr5';
Reg_tr6      : 'tr6';
Reg_tr7      : 'tr7';
Reg_flags    : 'flags';
Reg_eflags   : 'eflags';

/*
-----------------------------------------------
   Parser Rules
-----------------------------------------------
*/

Equ:        Identifier Dir_Equ Immediate;

/*
#####################################
Legacy Instructions
#####################################
*/

Movr8r8:    Instr_mov GPReg8 Comma GPReg8;
Movr8m8:    Instr_mov GPReg8 Comma MemAddr;
Movm8r8:    Instr_mov MemAddr Comma GPReg8;

/*
#####################################
  VMX Instructions
#####################################
*/

/*
-----------------------------------------------
   Memory Addressing Formats
MemAddr: (OParen? ( PtrType? Identifier? SegOverride? '[' (Memi|Memd) (Expression)? ']' CParen?);
       | (OParen? PtrType? Identifier (Expression)? CParen?);       

-----------------------------------------------
*/

MemAddr: (OParen? PtrType? Identifier (Expression)? CParen?);       
Memi:  GPReg64 ('+' (GPReg64|GPReg32) )? ('*' ('1'|'2'|'4'|'8') )?;
Memd:  Expression; 
PtrType: (Identifier|Dir_Byte|Dir_Word|Dir_Dword|Dir_QWord|Dir_DQWord|Dir_OWord|Dir_MMWord|Dir_XMMWord|Dir_TByte|Dir_Real4|Dir_Real8|Dir_Real10) Dir_Ptr;

SegOverride: ('cs'|'ds'|'es'|'fs'|'gs'|'ss') ':';

/*
-----------------------------------------------
   Numeric Expressions (Simplifiable)
-----------------------------------------------
*/

Expression: Immediate; /*TODO*/

/*
-----------------------------------------------
   Lexer Rules
-----------------------------------------------
*/

LogicalOr  : '||';
LogicalAnd : '&&';
OpNot    : '!';
BitwiseOr  : '|';
BitwiseAnd : '&';
Equals     : '==';
NEquals    : '!=' | '<>';
GTEquals   : '>=';
LTEquals   : '<=';
GT         : '>';
LT         : '<';
Pow        : '^';
OpAdd      : '+';
Subtract   : '-';
Multiply   : '*';
Divide     : '/';
Modulus    : '%';
OBrace     : '{';
CBrace     : '}';
OBracket   : '[';
CBracket   : ']';
OParen     : '(';
CParen     : ')';
SColon     : ';';
Colon      : ':';
QMark      : '?';
Assign     : '=';
Comma      : ',';

Bool:         ('true'|'TRUE'|'false'|'FALSE');

Label:        Identifier Colon;
LineContinue: '\\';
Identifier:   (Letter|'_') ('_'|Letter|DecDigit)*;

String:     ' \'' ( '\\' . | ~('\\'|'\'') )* '\'';

Immediate:  HexNum | OctNum | DecNum | BinNum | FloatNum;

HexNum:     DecDigit HexDigit* ('h'|'H')
      |     '0' ('x'|'X') HexDigit+;

OctNum:     OctDigit+ ('o'|'O');
DecNum:     ('+'|'-')? DecDigit+ ('d'|'D')?;
BinNum:     BinDigit+ ('b'|'B');

FloatNum:   ('+'|'-')? DecDigit+ '.' DecDigit+ Exponent? /*1.2, -1.2, 2.1e-3, 2.1e3*/
        |   ('+'|'-')? DecDigit+ Exponent? /*1, 2, 2e3, 2e-3*/
|   ('+'|'-')? '.' DecDigit+ Exponent?; /* .2, -.3, -.2e-3*/

fragment Exponent: ('e'|'E') ('+'|'-')? ('0'..'9')+;

fragment DecDigit: '0'..'9';
fragment HexDigit: ('0'..'9'|'a'..'f'|'A'..'F');
fragment OctDigit: '0'..'7';
fragment BinDigit: ('0'|'1');
fragment Letter:   ('a'..'z'|'A'..'Z');

WS:            (' '|'\t'|'\n'|'\r') {$channel=HIDDEN;};
LINE_COMMENT:  ';' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;};
C_COMMENT:     '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;};
BLOCK_COMMENT: '/*' ~('\n'|'\r')* '*/' '\r'? '\n' {$channel=HIDDEN;};


dedndave

a daunting task, to be sure

maybe if i were younger.......
as it is, i'd be too old to use it if i finished it

johnsa

I like daunting :) I'm always up for a challenge.
I've been working on an OS project on the side, and to really take it any further I need an assembler for it. It's x64 only.
From my Windows development point of view, I really can't live without 64bit anymore. I refuse to move to VC just to have decent 64bit support.
Habran and I had some discussions around making fixes to jwasm to get it fully functional. I've discussed them with Japheth too but he's seriously time constrained. Habran made some of the fixes, I was attempting to get CodeView v8 support right, but after several weeks of trying, I gave up.. mostly because I'm just not familiar enough with jwasm's internal tables and workings to make the shift. I do finally have sufficient documentation to implement the OBJ file format with Symbolic Debug Info though.

As far as I'm concerned under Windows, masm can't be beat for 32bit work. So if i'm going to do this, I'm happy that it's only 64bit which doesn't remove a significant portion of the work having to support
legacy operating modes, full sets of opcode encodings and legacy segment declarations. So by MASM compatible I would hope to achieve compatibility with how it's commonly used now under Windows

IE: No support for things like:

data32 segment para public data 'use32'
assume ds:data32

etc.. My OS doesn't need that and I haven't needed that since DOS :)

BogdanOntanu

Been there done that kind of feeling here ;)
(with my SOL_ASM 32/64 bits) and my Solar_OS (again 32/64 bits) :D

And of course that current Sol_Asm(version N) assembles with Sol_Asm (version N-1).

Good luck with your path and a few tips:

- You can do it with a hand made parser you do not need the standard grammar / lexer kind of tool chain and regex parsing but if you like that kind of parser then I guess it is OK also.

I have used a custom hand made parser and it worked perfectly until now. I fully support MACRO's and .IF .ELSEIF and INVOKE and ENUMS and STRUCS and UNIONS and a bunch of other advanced stuff in both 32 and 64 bits modes... I generate binary and OBJ for COFF32/64,  ELF32/64  and MACHO and even older formats like OMF

I made the code portable on Intel CPU's hence it runs on Windows (all versions) and Linux and FreeBSD and MacOSX and of course SOlar OS...

It is great to have your own assembler for your own OS hence I say that you should go for it! The only problem is time and real life... after a while real life has a tendency to catch up with you :D But with perseverance and tolerance for long pauses ... one can easy succeed.

I would somehow suggest VC also for real life projects but for fun your own ASM is OK.

DO not base any profit or fame on such things though. There are tons of free open source assemblers out there beside MASM (FASM, JWASM, NASM, YASM) Same goes for custom hobby OS.

But for fun and learning nothing really equals such a project. I extract great pleasure and joy from coding Sol_Asm and Sol_OS in "pure" ASM... and it relaxes me a lot.

IMHO you will still need need most of 32 bits encoding and address calculations for 64 bits anyway because many are still valid in 64 bits mode.

For an OS you need 16 bits (bootlader) and 32 bits (protected mode setup and older CPU's without 64 bits like net-books) and 64 bits for long mode (advanced)

We can discuss this kind of issues here and exchange ideas about assembler and OS creation If you like ;)

I have somehow limited but functional support for symbol generation for SOl_ASM (my own text kind of files and OBJ symbols) and I would be interested if you have any information on generating PDB's and more complex symbol information.


Ambition is a lame excuse for the ones not brave enough to be lazy.
http://www.oby.ro

habran

Hi johnsa,
I think that it is a great project
and that you should use your parser because than you know exactly what you are doing
it would be great to have self compiling assembler (I was thinking about translating JWasm to asm code)
are you going to write it in 32 or 64 code?
if you are writing 64 code you will have to implement the OBJ file format with Symbolic Debug Info first into the JWasm

regards

johnsa

I think going hand-made might actually be less work in the long run, especially for an assembler. It's not like i'm trying to recreate c# or anything :)
Agreed on real life, but I have a fair amount of time on my hands and this has huge benefits for me even for work. Most of my day to day work is usually stuck with .NET / Web stuff, however I have managed to train several other devs in using MASM and they're all in love with it as I knew they would be ! ;) I've built a couple of full commercial projects in Asm (one is quite large at around 50k lines), which we maintain and to be honest I actually find it more manageable than the C++ code would have been.
That being said I'd never rule VC out for other work where required, just when I have a choice I'd go asm.
If I got the asm finished, it would certainly be free.. for two reasons, one I don't think there is money in it and secondly I feel it only fair to give back to the asm community in any way possible.

For my OS code I've used fasm for the boot - loaders (16bit up to longmode etc) and I'm happy to keep using it for that portion. I really only need a new one for the main OS + applications to run on it.
I see your point about the addressing modes etc.. It might land up making sense to fully support 32 and 64bit.

The information I have on the debug data is as follows:

http://www.fantastictimes.co.za/specs.html

That is the elusive reference that disappeared from the internet some time ago on which you can build. It's the CodeView 5 info (at the bottom). To migrate from CV5 to CV8 I found the following:

http://www.hackchina.com/en/r/48474/yasm-0.7.1-_-modules-_-dbgfmts-_-codeview-_-cv8.txt__html

Extract from YASM info, it has some reverse engineered notes about CV8 format and it's changes from cv5.

Then there is the pecoff_v8.docx file (on the same server/url). That covers the full COFF and 64bit OBJ file spec.
Codeview.pdf is also at the same URL.

Initially I plan to code this under Windows using masm (32bit) which should be able to output BIN/COFF/COFF64+debug/my custom format (VX)+debug .. from there I'd port it to 64bit using itself.

So based on the plan to go hand-coded lexer/parser without a full EBNF, grammar etc. let me re-factor my questions

1) Loading a single source file, I plan to convert each line into a linked-list structure storing other information about each line (this will also help to allow line inserts from macro parsing).
2) Should I pre-process ALL include statements first or should I allow conditional includes to be picked up from the start by parsing a line at a time and handling macros/conditional assembly?
3) Should parsing of macro and high level stuff be a separate pass, or can/should I roll this all up into a single parser? - If not it kind of answers 2 by essentially pre-processing the source file into more pure/traditional rolled out ASM source ( a temp file perhaps? )
4) Assuming the lexer is right.. It will be processing a line at a time and outputting a stream of tokens:

Each token structure will have attributes/meta data.. for example it's type, group, numerical values, ptr to the original text, length etc.

Based on that if I take several different lines of code:

a equ 22h
mov eax,10
mov eax,10+(2*3)
mov eax,10+(2*sizeof(var))
mov eax,(mytype ptr [esi+ebx*2]).member

the token streams for certain lines are inherently simplified already like "a EQU 22h"
Others where there is some form of numeric expression, or a memory address calculation which can be seen in much the same way need to be simplified first.
This is where I'm not sure of the best (or just workable approach).
I was thinking of having a generic process of converting every stream of tokens into a tree, that can be evaluated and simplified upwards.. each node could parse attributes and calc'ed values back up the tree.. in theory that could simplify expressions
and deal with things like sizeof, type ptr etc.
I would imagine you don't want to write translation code for EVERY single possible scenario, so you'd want to group stuff together as much as possible..

ie: translation blocks for
mov r32,r32
mov r32,imm32
mov r32,mem32
mov mem32,r32

Some of these might be groupable as well...

Perhaps the best way to start is to do some manual worked examples of something like MOV EAX,10+(2*4) through lexing and parsing..





shlomok

Quote from: BogdanOntanu on April 23, 2012, 09:38:06 PM
I would somehow suggest VC also for real life projects but for fun your own ASM is OK.

BogdanOntanu,
If you were at the design phase of an open source project that includes a KMD driver (or a windows service) and a UI which needs to be shipped for both 32 and 64 bit windows, would you opt for asm?
I expect around 5000 lines of code for the 32 bit version, and VC 11 beta has fantastic support for driver development for both 32 and 64 bit targets.

Would the time spent on porting 32 bit asm to 64 bit make it not "profitable" in terms of the time invested compared with VC 11?

Thanks,

S.

dedndave

i can think of one reason to write an assembler
that would be to have one that interprets windows .h files directly
a lot of effort goes into creating the .inc files we use
wouldn't it be nice if the assembler used the C versions, directly   :P

johnsa

Exactly!

I love incbin , which masm doesn't have.
Apparently goasm has cinclude.. I would like to have all of these options in one.. so as to never have to create another inc file again.
Of course this does mean the assembler must support typedef, { }, C structs with pragmas etc.. # directives etc. which is going to add a lot of extra work.

dedndave

yes...
the macros would be nice
but, i think we usually find a way of making our own that perform as well or better

i was thinking primarily of enumerations
it is a c construct for which masm doesn't really have a counterpart
we have to kludge something together to make it work
i think a lot more forum members would write COM stuff if we could handle them properly
you might look into programming DirectX stuff in ASM to get ideas

BogdanOntanu

Quote from: dedndave on April 24, 2012, 11:09:57 AM
i can think of one reason to write an assembler
that would be to have one that interprets windows .h files directly
a lot of effort goes into creating the .inc files we use
wouldn't it be nice if the assembler used the C versions, directly   :P

This idea crossed my mind also. I guess that I will detach the .h parsing part from my work in progress  C compiler and attach it to Sol_Asm as a plugin ... that should be fun  :U
Ambition is a lame excuse for the ones not brave enough to be lazy.
http://www.oby.ro

BogdanOntanu

Quote from: shlomok on April 24, 2012, 09:12:31 AM
Quote from: BogdanOntanu on April 23, 2012, 09:38:06 PM
I would somehow suggest VC also for real life projects but for fun your own ASM is OK.

BogdanOntanu,
If you were at the design phase of an open source project that includes a KMD driver (or a windows service) and a UI which needs to be shipped for both 32 and 64 bit windows, would you opt for asm?
I expect around 5000 lines of code for the 32 bit version, and VC 11 beta has fantastic support for driver development for both 32 and 64 bit targets.

Would the time spent on porting 32 bit asm to 64 bit make it not "profitable" in terms of the time invested compared with VC 11?

Thanks,

S.


If the project is for fun and learning I have a natural tendency to make it fully in ASM.
It does not matter if it is an KMD or a service or an user land application. It does not matter if it is a small utility or a huge IDE like VC.

Of course that I have the luxury of having my own assembler that I can change whenever I feel like I need a new feature or option added. Then I am also very comfortable with ASM. Most people are not and for them this is not such a good path. At least not until they become experienced with an HLL language.

When it comes to professional programming I would advocate for C/C++ (minimum) with additional ASM DLL's but only if needed.

There is no market for anything done in ASM anymore out there. ASM knowledge is still required and highly appreciated in AV industry BUT ASM development is not existing and the developers are scarce or NULLl so to say.

Hence for fun and learning it is very much OK but for business it is almost always a NO NO.

I agree that is should not be like this ...  BUT however this is the real life situation out there now and one must understand and acknowledge it as it is not as it wishes it would be.

To put it bluntly: I once had  a big contract for doing some drivers for security purposes for a big US company. I did them an driver in ASM as an prof of concept and it worked perfectly. After a while I was "kindly" asked to convert the driver to C/C++ IF I want the collaboration to continue. Using ASM was already NOT an viable option back then.

This event was approximately 8 years ago (or more). Today the situation is even worst.



Ambition is a lame excuse for the ones not brave enough to be lazy.
http://www.oby.ro

BogdanOntanu

Quote from: johnsa on April 24, 2012, 08:59:24 AM
I think going hand-made might actually be less work in the long run, especially for an assembler. It's not like i'm trying to recreate c# or anything :)

I did it like this and it works OK. However you must love state machines and ASM and you must keep stuff highly organized. Optimizations are always for later or they can ruin your architecture and organization.

Quote
Agreed on real life, but I have a fair amount of time on my hands and this has huge benefits for me even for work.

Wow .... this is unsuspected (to have fair amount of time on your hands that is ...) You will need it. An assembler is a huge project for various reasons. At times I find it more complicated than a HLL C compiler for example.

Quote
Most of my day to day work is usually stuck with .NET / Web stuff, however I have managed to train several other devs in using MASM and they're all in love with it as I knew they would be ! ;) I've built a couple of full commercial projects in Asm (one is quite large at around 50k lines), which we maintain and to be honest I actually find it more manageable than the C++ code would have been.

Again wow for having an ASM project that is actually selling... I have 3 huge ASM projects  around 300K lines each (HE RTS, Solar OS and Sol_ASM) but only HE RTS was originally designed to become commercial... and it failed for various non technical reasons (lack of finance mainly)

I know that ASM is more easy to develop in and to manage then HLL languages and the opposite opinion is a fake myth ...  or at least the result of inexperienced ASM programmers....

But I also know that ASM has one major flaw: it is not portable on different CPU's.

For example I have to rewrite my OS for 64 bits ... if you know what I mean... and I will have to maintain both the 32 bits and 64 bits versions separately. Of course that I will do it for fun  but here is where C (or other HLLs) shinine above ASM ...


Quote
That being said I'd never rule VC out for other work where required, just when I have a choice I'd go asm.
If I got the asm finished, it would certainly be free.. for two reasons, one I don't think there is money in it and secondly I feel it only fair to give back to the asm community in any way possible.

There is no money ... But there is a lot of work in it ... as for community is do not care much... I only value individuals and their skills ;)

Quote
For my OS code I've used fasm for the boot - loaders (16bit up to longmode etc) and I'm happy to keep using it for that portion. I really only need a new one for the main OS + applications to run on it.

Hmmm... you really really want to depend on FASM once you have your own assembler?

Quote
I see your point about the addressing modes etc.. It might land up making sense to fully support 32 and 64bit.

I think that sooner or later you will implement 16/32/64 handling. At least I did.

It is ok to charge into 32 bits or even 64 bits directly at first with only minimal or none 16 bits and 32 bits and come back later to finish those as needed ... at least this is what I did

Quote
The information I have on the debug data is as follows:

http://www.fantastictimes.co.za/specs.html

That is the elusive reference that disappeared from the internet some time ago on which you can build. It's the CodeView 5 info (at the bottom). To migrate from CV5 to CV8 I found the following:

http://www.hackchina.com/en/r/48474/yasm-0.7.1-_-modules-_-dbgfmts-_-codeview-_-cv8.txt__html

Extract from YASM info, it has some reverse engineered notes about CV8 format and it's changes from cv5.

Then there is the pecoff_v8.docx file (on the same server/url). That covers the full COFF and 64bit OBJ file spec.
Codeview.pdf is also at the same URL.

Tanks for the info, I am aware of some of it but not all of it... although by the name of some of the sites in your links I might never visit them  :green  Yeah I am the kind of guy that does not follow his best interest when his best interest goes against his principles

I seek only legitimate information that is available to general public. Of course that I have the PE/COFF specs and some older public CV specs ;)

Quote
Initially I plan to code this under Windows using masm (32bit) which should be able to output BIN/COFF/COFF64+debug/my custom format (VX)+debug .. from there I'd port it to 64bit using itself.

This sounds like a very logical and correct course of action. MASM is very reliable and capable in 32 bits and should ease your path a lot.

Quote
So based on the plan to go hand-coded lexer/parser without a full EBNF, grammar etc. let me re-factor my questions

I can tell you what I did in Sol_ASM ...

Quote
1) Loading a single source file, I plan to convert each line into a linked-list structure storing other information about each line (this will also help to allow line inserts from macro parsing).

I do not. I parse each file and tokenize it and then update state machine states. The parser is re-entrant / recursive.

Hence the parser invokes the parser for macros ... very easy and elegant IMHO ;)

Quote
2) Should I pre-process ALL include statements first or should I allow conditional includes to be picked up from the start by parsing a line at a time and handling macros/conditional assembly?

I parse each line, in fact each token ...lines are not a key concept in my parser... tokens and states are key concepts
Some tokens  are directives like INCLUDE. Because the parser is re-entrant it just continues parsing the include. I do not pre process anything at all here.

You will need pre processing in the MACRO body... again probably recursive :D

This gives great flexibility but it might reduce speed a little.


Quote
3) Should parsing of macro and high level stuff be a separate pass, or can/should I roll this all up into a single parser?

Multiple pass is not done because of HLL features like .IF .WHILE or INVOKE or MACROs.

Multiple passes are  done mainly because you need to resolve symbols that are going to be defined AFTER you use them and other symbols and even encoding will depend on them and so on... assemblers are harder tham HLL compilers because of this, because you do not clearly define what things are before using them.

Quote
- If not it kind of answers 2 by essentially pre-processing the source file into more pure/traditional rolled out ASM source ( a temp file perhaps? )

I do not do things like this. Each file is loaded into memory processed and then unloaded when parsing is done.

However I can directly generate binary and PE32/PE64 executable files and I do no restrict myself to generating only OBJ files and asking the linker to link them and create an executable image (although this can be done if wanted)

I also have projects that include asm files in asm files into a kind of single module made of multiple modules but only at source level. This would generate a huge ASM file if it was unified even for a brief time in memory.


Quote
4) Assuming the lexer is right.. It will be processing a line at a time and outputting a stream of tokens:

Trust me you have a lot of testing and unit testing ahead of you ;) you need some form of automation to test such a big project.

Quote
Each token structure will have attributes/meta data.. for example it's type, group, numerical values, ptr to the original text, length etc.

I do not really think that one is capable to anticipate the future in design. I strongly believe in incremental improvements until it makes perfect.

I think that the belief system that tells you that you are able to design something in advance without the experience required is a big fail of the human race as a whole but mainly of it's assumed "intelligent" part ;)

A token is not THAT important in an assembler the key is more in the state transitions that a token might  generate.

Quote
Based on that if I take several different lines of code:

a equ 22h
mov eax,10
mov eax,10+(2*3)
mov eax,10+(2*sizeof(var))
mov eax,(mytype ptr [esi+ebx*2]).member

the token streams for certain lines are inherently simplified already like "a EQU 22h"
Others where there is some form of numeric expression, or a memory address calculation which can be seen in much the same way need to be simplified first.

You will surely need a part/module that must handle parsing and calculating expressions with operators and numbers and variables and functions.

However DO NOT confuse this with ModRM address calculations like mov eax,[esi+my_struct.my_enum,.my_field +edx*4]

The  ModRM is a different beast with it's own rules.

Quote
This is where I'm not sure of the best (or just workable approach).

It serves no purpose to give you hints here. You must experiment and gain the needed experience. Start with small expressions and ModRM cases (you will need to learn them ALL from the Intel manuals) and build up one step at a time.

After a while... you will start to see the pattern ... or the matrix so to say :D

If you get lost you can always come back for more help ;)

Quote
I was thinking of having a generic process of converting every stream of tokens into a tree, that can be evaluated and simplified upwards.. each node could parse attributes and calc'ed values back up the tree.. in theory that could simplify expressions
and deal with things like sizeof, type ptr etc.

An AST is only useful for HLL compilers because they need it for internal optimizations are allocating resources (registers). An assembler does not really need it IMHO.

You will have a more complicated optimization problem... the size of instructions and the avalanche of changes that it will create in the symbols... all kind of symbols, variables, labels, procedures, code relocation, even structure members. Then the macro features are much more powerful than what standard HLL permit.

Quote
I would imagine you don't want to write translation code for EVERY single possible scenario, so you'd want to group stuff together as much as possible..

Depends how much speed you want. However you will have much more cases that you expect. The CPU does not really care BUT for an assembler programmer the number of instructions and encoding variations is simply huge... and many do not fit into any category or groups in order to organize nicely.

Quote
ie: translation blocks for
mov r32,r32
mov r32,imm32
mov r32,mem32
mov mem32,r32

Some of these might be groupable as well...

Keyword ..."some" ... but not all or as many as expected

I group them more on INSTRUCTION TYPE: MOV, PUSH/POP/Stack, Arithmentic, Logical, JUMPS, Jcc, CALL/RET, SSE, MMX, FPU
and argument types: REGISTER, MEMORY location, stack location, ModRM, Immediate

But still there are sub groups and sub groups and exceptions

Quote
Perhaps the best way to start is to do some manual worked examples of something like MOV EAX,10+(2*4) through lexing and parsing..

Absolutely.

Practically I have started with a simple file containing only a NOP :D

There are so many parts to be done: file management, tokenizer, state machines (lexer), code generation, directives/sections management, HLL, MACROS, OBJ formats generation, IMPORTS and EXPORTS, PE BIN EXE generation...

And a lot of  TESTING  and unit testing

Ambition is a lame excuse for the ones not brave enough to be lazy.
http://www.oby.ro

zemtex

I think that we need fewer assemblers. Assembly programming is such a scarse hobby if we divide it up even more, there will be a lack of the neccesary unity. It is like Jesus, fish and 5000 men. If you cut the fish into pieces, its barely noticeable.

The assembly community should come together more than they should split apart, in order to make the community more attractive and more visible to outsiders. If everyone should write their own assembler just to "solve" a tiny problem that havent been solved in different assemblers, the world will be full of different assemblers. If everyone comes together around one assembler, they will all be forced to solve all problems in that one assembler.
I have been puzzling with lego bricks all my life. I know how to do this. When Peter, at age 6 is competing with me, I find it extremely neccessary to show him that I can puzzle bricks better than him, because he is so damn talented that all that is called rational has gone haywire.

dedndave

well - a few is nice to choose from   :bg

pros of masm:
the syntax is compatible with a wide selection of source code (not to mention - i am most familiar with masm syntax)
powerful macros
SSE support (varying degrees in different versions)
it's free   :P

cons of masm:
no support for certain aspects of C include files
licensing issues

if i am writing code that is not affected by the cons, i am likely to use masm
if i want to write code that is affected by the cons, i have to look for alternatives

while i like masm, i am glad there are alternatives out there to choose from