Bare Bones Assembler

From CPUDev Wiki
Jump to: navigation, search

This page or section is in progress. It may therefore not be complete and/or accurate.

This tutorial aims to provide the reader with instructions of how to write a simple assembler for the Bare Bones ISA. We will use C# as the programming language.

Programming Environment

We will be targeting the .net core environment in this tutorial, which is available here. Mono is directly compatible with this source code as well but .net core is recommended. Visual studio code is a free open source editor that you can use to follow along, if you are not using an IDE like Visual studio code, you can create and compile the project like this:

$ dotnet new console

$ dotnet run

What is an Assembler?

The goal of the assembler is to take raw assembly code and turn it into machine code.

How do Assemblers Work?

If we want to take raw source code, that is a string, and convert it into a binary file we need to do the following steps:

  • Designing The Syntax
  • Parsing The Syntax
  • Designing the binary layout
  • Validating and Emitting the AST

Designing The Syntax

Syntax is how the actual code is formatted. Syntax exists because computers can not understand human languages like English for example, therefore we need to define specific patterns that allows us to properly and correctly parse the code.

Some general pointers in designing your syntax:

  • Keep it simple and easy to memorize
  • Try not to add unnecessary literals and key words for eg. insted of move $eax into $ebx use mov eax, ebx
  • do not use eccentric designs like BF code

The nature of assembly makes it very easy to design syntax compared to a more high level language like C, because assembly usually revolves around opcodes expressing one simple action in one line, therefore assembly can use a generic syntax for every opcode, like this:

<Opcode> <space> [<Operand> <Comma>]

Technically there can be an unlimited number of operands, but for our simple instruction set we will never need more than 3 operands lets look at the simple syntax that we will be using for this tutorial, the syntax is NASM Style.

;this is a comment
;this program will print B to the tty device
LDA RO, x041 
LDI R0, char_a
LDI R1, 0x80
LDA R2, R0
LDI R3, 1
ADD R2, R2, R3
STA R1, R2
HLT
db char_a, 0x41

Node: for simplicity the Initialized value syntax (db char_a, 0x41) also follows the generic pattern unlike Nasm x86 assembly

Parsing The Syntax

To understand parsing we first need to understand the abstract syntax tree or AST for short. The AST is basically just a structure used to store the useful values that are extracted from the source code.

Take this sample code:

LDI R3, 1

We know that the LDI is the opcode and that the R3 is the first operand, and that the 1 is the second operand.

So for this line of code we only want 3 values, the Opcode the First Operandand the Second Operand, we do not want the space after the opcode nor do we want the comma and space between Operand one and two.

This means if we want to parse the sample code above we need to store the 3 values we extract from the code. That is the point of the AST, you can simple represent the AST as a Struct in a language like C or an Class in an Object orientated language like C#.

Once we have some way of storing the important bit of the code, we can move onto parsing the actual code. There are a lot of ways to parse text, everything from hand written parser (like the one in this tutorial), to parser generators like ebfn, or Regex, etc.

The basic idea of how our parser will work is simple, all we do is we iterate over ever char in the string until we hit a specific literal or key words.

Looking at the sample code we can parse it like this: state = 0

L :- add to tmp buffer

D :- add to tmp buffer

I :- add to tmp buffer

(Space) :- state = 1; now that we have hit our first important literal we know that the opcode part is over so we can store the value in the tmp buffer into the AST node

R :- add to tmp buffer

3 :- add to tmp buffer

, :- state - 2; another important literal this literal indicates that the first operand is over we can now store the tmp buffer's value in the AST node

(Space) :- the syntax dictates that we can ignore this space

1 :- add to tmp buffer

state = 0; now that the line is over we can add the value of the tmp buffer to the AST


This is the basic idea of how the parser works, later in this tutorial we will look at actual code, then this should make more sense.

Designing the binary layout

We can think of the binary output file as one big byte array because that is all it is. The reason we need to design the layout of things in this array is because the initialized value's address needs to line up in the code. To explain this here is some sample code:

LDI R0, char_a
LDI R1, 0x80
LDA R2, R0
LDI R3, 1
ADD R2, R2, R3
STA R1, R2
HLT
db char_a, 0x41

If we look at the first line we will see that it is referencing the initialized value(char_a) that is defined on the last line, if we assemble the code we get this binary:

Hex Dump of Rom:

Machine code: 0x40 0xCE 0x41 0x80 0x22 0x0 0x43 0x1 0x52 0x23 0x31 0x20 0x10 0x0

initialized value(s): 0x41

Looking at byte 2 we can see that the LDI opcode's second operand is 0xCE the ROM starts at memory address 0xC0 therefore the assembler has calculated that the char_a's address is at 0xCE. Now there is a fundamental problem, when trying to calculate the address, and that is that we first need to calculate the size of the machine code, before we can get the address of the initialized value.

The reason we are covering this is because it makes more sense to have the values first then the machine code, but unfortunately our cpu starts executing at address space 0xC0 therefore the machine code must come first and therefore we order the binary layout machine code, initialized values.

Validating and Emitting the AST

Validation

The task of validating code is rather simple, take this sample code:

LDI R1, 0x80

This code is valid, we know this because the ISA clearly specifies that LDI has two operands. Operand one must be an resister and operand two must be a literal value, therefore the code:

LDI R1, R2

is not valid because operand two must be a literal and not a register. What you want to do is to make sure that all the operands are the correct type, more validation is possible and recommended, but that is sadly beyond the scope of a bare bones assembler.

How can i Improve and Expand the Assembler?

Error Handling

Memory

Just a short note on error handling, you should make sure that the user is not trying to do the following things to memory:

  • access memory that does not exist
  • try to write to ROM
  • try to read from tty

Note that it is not possible to always make sure the user does not violate one of the rules, but its good practice to at least handle the obvious cases, and if your device has expendable memory, just warn the user instead of throwing an error.

Adding more Opcodes

The idea behind adding opcodes that is not in the instruction set is simple all you do it you break the opcode down into smaller opcodes.

Here is a list of things you can experiment with adding:

  • Mov like the x86 opcode
  • More advance initialized values for eg. db hello_world "hello world"
  • Macros

Implementation

The Assemblers Code base consists of three major elements

The Parser

{todo}

The Binary Writer

{todo}

The Assembler

{todo}

Putting it all Together

Program.cs

using System.IO;

namespace BareBonesAssembler
{
    class Program
    {
        static void Main(string[] args)
        {
            var asmbler = new Assembler();
            asmbler.Assemble(File.ReadAllText("test.asm"), "rom.img");
        }
    }
}


Assembler.cs

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;

namespace BareBonesAssembler
{
    public class Assembler
    {
        public class Instruction
        {
            public string Opcode { get; set; }
            public string Destination { get; set; }
            public string Source0 { get; set; }
            public string Source1 { get; set; }


            public Instruction(string opcode)
            {
                Opcode = opcode;
            }

            public Instruction(string opcode, string destination)
            {
                Opcode = opcode;
                Destination = destination;
            }

            public Instruction(string opcode, string destination, string source)
            {
                Opcode = opcode;
                Destination = destination;
                Source0 = source;
            }

            public Instruction(string opcode, string destination, string source0, string source1)
            {
                Opcode = opcode;
                Destination = destination;
                Source0 = source0;
                Source1 = source1;
            }
        }

        public class BinaryWriter
        {
            public string FileName { get; set; }

            private int _index = 0;
            private byte[] _buffer = new byte[64]; // hard coded to 62 because that's the max rom size

            public BinaryWriter(string filename)
            {
                FileName = filename;
            }

            public void FlushToFile()
            {
                File.WriteAllBytes(FileName, _buffer.ToArray());
            }

            private void AddByte(byte b)
            {
                _buffer[_index] = b;
                _index++;
            }

            private void AddShort(ushort b)
            {
                var buf = BitConverter.GetBytes(b);

                AddByte(buf[0]);
                AddByte(buf[1]);
            }

            public void Nop()
            {
                AddShort(0b0000);
            }

            public void Hlt()
            {
                AddShort(0x1 << 4);
            }

            public void Lda(byte dest, byte src)
            {
                AddByte((byte) (dest | (0x2 << 4)));
                AddByte(src);
            }

            public void Sta(byte dest, byte src)
            {
                AddByte((byte) (dest | (0x3 << 4)));
                AddByte((byte) (src<< 4));
            }

            public void Ldi(byte dest, byte src)
            {
                AddByte((byte) (dest | (0x4 << 4)));
                AddByte(src);
            }

            public void Add(byte dest, byte src, byte src1)
            {
                AddByte((byte) (dest | (0x5 << 4)));
                AddByte((byte) (src1 | (src << 4)));
            }

            public void WriteByte(byte b)
            {
                AddByte(b);
            }
            
            public void Dump()
            {
                Console.WriteLine("Hex Dump of Rom:");
                foreach (var b in _buffer)
                {
                    Console.Write("0x" + b.ToString("X") + " ");
                }
                Console.WriteLine();
            }
        }


        public void Assemble(string rawSrc, string outputFileName)
        {
            var instructions = ParseAssemblyCode(rawSrc);

            var writer = new BinaryWriter(outputFileName);

            var intructionsSize = 0;

            var initValues = new Dictionary<string, int>();


            //first we calulate the code's binary size
            foreach (var instruction in instructions)
            {
                if (instruction.Opcode == "db")
                {
                    initValues.Add(instruction.Destination, initValues.Count);
                }
                else
                {
                    intructionsSize += 2;
                }
            }

            foreach (var instruction in instructions)
            {
                switch (instruction.Opcode)
                {
                    case "nop":
                        writer.Nop();
                        break;
                    case "hlt":
                        writer.Hlt();
                        break;
                    case "lda":
                        ValidateArgument(instruction.Destination);
                        ValidateArgument(instruction.Source0);

                        writer.Lda(ParseRegistor(instruction.Destination), ParseRegistor(instruction.Source0));
                        break;
                    case "sta":
                        ValidateArgument(instruction.Destination);
                        ValidateArgument(instruction.Source0);

                        writer.Sta(ParseRegistor(instruction.Destination), ParseRegistor(instruction.Source0));
                        break;
                    case "ldi":
                        ValidateArgument(instruction.Destination);
                        ValidateArgument(instruction.Source0);

                        if (initValues.ContainsKey(instruction.Source0))
                        {
                            writer.Ldi(ParseRegistor(instruction.Destination), (byte) (initValues[instruction.Source0] + 0xC0 + intructionsSize));
                        }
                        else
                        {
                            writer.Ldi(ParseRegistor(instruction.Destination), ParseByte(instruction.Source0));
                        }
                        
                        
                        break;
                    case "add":
                        ValidateArgument(instruction.Destination);
                        ValidateArgument(instruction.Source0);
                        ValidateArgument(instruction.Source1);
                        
                        writer.Add(ParseRegistor(instruction.Destination), ParseRegistor(instruction.Source0), ParseRegistor(instruction.Source1));
                        break;
                }
            }

            foreach (var instruction in instructions)
            {
                if (instruction.Opcode == "db")
                {
                    ValidateArgument(instruction.Source0);
                    writer.WriteByte(ParseByte(instruction.Source0));
                }
            }
            
            writer.FlushToFile();
            writer.Dump();
        }

        private byte ParseRegistor(string raw)
        {
            if (!raw.StartsWith("r"))
            {
                Console.WriteLine("Invalid Registor: " + raw);

                Environment.Exit(0);
            }

            return ParseByte(raw.Remove(0, 1));
        }

        private byte ParseByte(string raw)
        {
            try
            {
                if (raw.StartsWith("0x"))
                {
                    return Convert.ToByte(raw, 16);
                }
                else
                {
                    return byte.Parse(raw);
                }
            }
            catch (Exception e)
            {
                Console.WriteLine("[Parsing] Invalid Argument: " + raw);

                Environment.Exit(0);
            }

            return 0;
        }

        private void ValidateArgument(string s)
        {
            if (string.IsNullOrEmpty(s))
            {
                Console.WriteLine("[Null] Invalid Argument: " + s);

                Environment.Exit(0);
            }
        }

        public List<Instruction> ParseAssemblyCode(string rawSrc)
        {
            //we will start by writing a real simple parser

            rawSrc = rawSrc.ToLower() +
                     "\n"; // we must add this to the source to be sure it will just throw away the last instruction

            //the list we will return
            var re = new List<Instruction>();

            //first we remove comments which are nasm style eg. ; this is a comment
            var inComment = false;

            //we use the buffer to tmp store things
            var buffer = new StringBuilder();

            //we use the state to keep track how far into the syntax we are
            var state = 0;

            //tmp storage for values
            string tmpOpcode = "";
            string tmpDest = "";
            string tmpSrc1 = "";
            string tmpSrc2 = "";

            //keep track of the line number
            int line = 0;

            //loop over every char in the src code
            for (int i = 0; i < rawSrc.Length; i++)
            {
                var c = rawSrc[i];

                if (c == '\n') line++; //new line track it

                if (inComment)
                {
                    //we are currently scanning in a comment so we can ignore this part of the code until we hit a new line
                    if (c == '\n')
                    {
                        inComment = false;
                    }
                }
                else
                {
                    if (c == ';')
                    {
                        // are starting a comment
                        inComment = true;
                    }
                    else
                    {
                        //we can safely assume this code will not execute in a comment

                        switch (state)
                        {
                            case 0:
                                //in this case we are expecting the start of an opcode or db
                                if (c == ' ' || c == '\n')
                                {
                                    if (c == '\n') i--; // we need to move back one to have \n apply down stream

                                    //we have the name of the opcode in the buffer now
                                    tmpOpcode = buffer.ToString()
                                        .Trim(); //the trim removes any extra spaces or tabs etc

                                    buffer.Clear();
                                    state = 1;
                                }
                                else
                                {
                                    buffer.Append(c);
                                }
                                break;
                            case 1:
                                //this state indicates to us the we need to start parsing operands
                                if (c == ',')
                                {
                                    //we have a second operand
                                    tmpDest = buffer.ToString();
                                    buffer.Clear();
                                    state = 2;
                                }
                                else if (c == '\n')
                                {
                                    if (!string.IsNullOrEmpty(tmpOpcode))
                                    {
                                        //we have no operands
                                        re.Add(new Instruction(tmpOpcode));
                                    }
                                    tmpOpcode = "";
                                    state = 0;
                                }
                                else
                                {
                                    buffer.Append(c);
                                }
                                break;
                            case 2:
                                //this state indicates to us the we need to start parsing operands
                                if (c == ',')
                                {
                                    //we have a second operand
                                    tmpSrc1 = buffer.ToString().Trim();
                                    buffer.Clear();
                                    state = 3;
                                }
                                else if (c == '\n')
                                {
                                    tmpSrc1 = buffer.ToString().Trim();
                                    buffer.Clear();

                                    //we have a dest only
                                    re.Add(new Instruction(tmpOpcode, tmpDest, tmpSrc1));
                                    tmpOpcode = "";
                                    tmpDest = "";
                                    tmpSrc1 = "";

                                    state = 0;
                                }
                                else
                                {
                                    buffer.Append(c);
                                }
                                break;
                            case 3:
                                //this state indicates to us the we need to start parsing operands
                                if (c == ',')
                                {
                                    Console.WriteLine(
                                        $"Error: Line {line}, Instruction operands count must be less or equal to 3 ");
                                    Environment.Exit(1);
                                }
                                else if (c == '\n')
                                {
                                    //we have a second operand
                                    tmpSrc2 = buffer.ToString().Trim();
                                    buffer.Clear();

                                    re.Add(new Instruction(tmpOpcode, tmpDest, tmpSrc1, tmpSrc2));

                                    tmpOpcode = "";
                                    tmpDest = "";
                                    tmpSrc1 = "";
                                    tmpSrc2 = "";

                                    state = 0;
                                }
                                else
                                {
                                    buffer.Append(c);
                                }
                                break;
                        }
                    }
                }
            }

            //return our parsed opcodes
            return re;
        }
    }
}