SI 413 Fall 2021 / Labs


This is the archived website of SI 413 from the Fall 2021 semester. Feel free to browse around; you may also find more recent offerings at my teaching page.

Lab 11: Compiler Part 1

This lab is due at 2359 on Friday, 3 December. It should be submitted to the course SI413 as Lab11, with the files named properly according to the lab instructions. See the submit page for details. You are highly encouraged to submit whatever you have done by the end of this lab time. Remember that you must thoroughly test your code; the tests that are visible before the deadline are only there to make sure you are using the right names and input/output formats.

1 Starter Code

The starter code for this week's lab is similar to the solution from Lab 8, but modified for the purposes of making a compiler.

This lab starts a new group of labs. These labs are for individual work (no partners).

You should create a new repo for yourself and your instructor by running the following command.

roche@ubuntu$ 413repo compiler

If that command doesn't work, check the initial setup instructions.

You should now have a folder like si413/compiler/lab11 with the starter code for this lab and a handy script to submit your code.

You can also browse the starter code from the Gitlab page (USNA intranet only).

2 Introduction

This is a two-part lab that will be the final crowning achievement of your work with the SPL language. It will be difficult, but it should be fun and rewarding too!

For these labs, you are going to implement a compiler for the SPL language. Specifically, you will write a program splc that reads in an SPL program and writes out a program in the LLVM Intermediate Representation language. This is a real IR language that is used by many popular modern compilers, and it allows your program to be ultimately compiled to machine code for just about any computer that you can find today. The IR code your splc program generates can either be executed directly by using the lli command, compiled to assembly using the llc command, or compiled directly to an executable using clang.

The first segment of this lab is an overview of the LLVM tools you will use and how the LLVM IR language works. There is nothing to turn in for this part, but it will be a useful reference for the rest of the assignment.

The actual coding you need to do for this lab is getting most of SPL supported by your compiler, including arithmetic, reading and writing, variables, but not including function calls yet.

Writing a compiler is a difficult challenge, since it requires you to have three programs in mind: (1) the SPL program your compiler takes as input; (2) your actual compiler which is written in C++; and (3) the LLVM IR code that your compiler is producing. You are ready for this challenge, but it will take careful, deliberate work. Help each other out (within the bounds of the course policy), give yourself plenty of time, ask your instructors for help as needed, and have fun!

3 LLVM

The LLVM Language Reference Manual is the most complete and most useful source of information on how the LLVM IR language works. You may want to bookmark that page now, as you should refer to it frequently. Of particular note is the "Instruction Reference" section, which contains details on all of the instructions in the language.

3.1 Tools

The job of your splc compiler is to convert an SPL program such as myprog.spl into an LLVM IR program such as myprog.ll. Note that LLVM IR code typically ends with a .ll extension.

There are three main tools that you can use to work with LLVM IR code. These should be installed already on the lab machines; on your virtual machine, you may need to install some extra packages by running:

roche@ubuntu$ sudo apt install clang llvm
  • clang: This is a general-purpose C compiler, but it is built on LLVM and has support for LLVM compilation as well. It can be used both to turn a C program into LLVM IR, but also to turn LLVM IR into an executable.
    roche@ubuntu$ # Produce myprog.ll from C program myprog.c
    roche@ubuntu$ clang myprog.c -O0 -S -emit-llvm
    
    roche@ubuntu$ # Produce executable myprog from myprog.ll
    roche@ubuntu$ clang myprog.ll -O2 -Wall -o myprog
  • lli: An LLVM IR interpreter that will run your .ll program directy. What this does under the hood is really just compile your program and then run it.
    roche@ubuntu$ # Run myprog.ll
    roche@ubuntu$ lli myprog.ll
  • llc: The LLVM compiler which transforms your LLVM IR code to actual machine assembly code. This is the first tool used by the clang command above which fully compiles LLVM IR to an executable.
    roche@ubuntu$ # Compile myprog.ll to x86 assembly myprog.s
    roche@ubuntu$ llc myprog.ll

3.2 LLVM IR Types

Every register or global variable in LLVM IR code has a type. It's important to emphasize right away that these types don't necessarily have to match exactly with the types in SPL. In fact, all of the variables in SPL will be stored as 64-bit integers in your compiled code, whether they actually represent a number, a boolean value, or a lambda. This is because, at compile-time, you don't know the type of anything in SPL; therefore all SPL values must be stored in the same way (as 64-bit integers).

The most important types you will need to use in LLVM IR are:

  • i1: This is a 1-bit integer, i.e., a boolean value. This is the type returned by comparison instructions in LLVM, and it is the type that you must use for the condition of conditional branches.
  • i8: An 8-bit integer, i.e., a char. You have to use this type when calling system functions like printf which expect (arrays of) characters.
  • i32: A 32-bit integer, i.e., a int. You will actually not use this type very much in your LLVM IR, except as the return value of system calls such as printf, scanf, or main.
  • i64: A 64-bit integer, i.e., a long. This is the type you will use in LLVM to store any SPL value, whether it be a boolean, a number, or a function pointer.
  • [N x type]: An array of N elements of type type. The most common you will see is an array of characters such as [5 x i8].
  • type*: A pointer to an object of the given type.
  • type(types): This is how you write the type of a function, specifying the return type as well as the parameter types (a comma-separated list). For example, the type of main is typically i32() if you don't care about command-line arguments, or i32(i32,i8**) if you do.

There are various instructions in LLVM to convert between any of these types. Most useful are trunc, to truncate a larger integer type to a smaller one, zext, to extend a smaller integer to a larger one by padding with 0 bits, ptrtoint, to convert a pointer to an i64, and (you guessed it) inttoptr, to do the opposite.

3.3 Names and instructions

LLVM IR is an SSA (Static Single Assignment) language, so every name can only be assigned exactly once in your program. Global names are preceded with a @ and are typically used for global constants or function names, for example @main.

Within each function, you will more commonly use local register names, which are preceded with a %. Because this is an intermediate representation, these names don't really matter, and because the language is SSA, there need to be a lot of them, so it's common to use some numbering scheme like %v1, %v2, etc. You will notice that running clang produces register names tha are just numbers like %1, %2, %3, but you should not use these kind of names in your own LLVM IR because they have to follow some special (and annoying) rules.

LLVM IR is a 3AC language, and most typical instructions have this form:

%dest = instruction type %arg1, %arg2

For example, here is an instruction to subtract 3 from the 64-bit integer stored in register %x, storing the result in register %y:

%y = sub i64 %x, 3

You can think of sub i64 as the entire name of the instruction in the previous case. Some instructions like this are add, sub, mul, sdiv, and, or. The icmp instruction is similar, except there is an extra component for the type of comparison. For example, the following instruction compares two 64-bit values in registers %v1 and %v2 storing the i1 value of 1 or 0 in register %v3, depending on whether %v1 is greater than or equal to %v2:

%v3 = icmp sge i64 %v1, %v2

3.4 Labels and terminators

Instructions within each LLVM IR function are grouped into basic blocks or normal instructions. Each basic block may optionally start with a label such as

somelabel:

and each basic block must end with a special "terminator instruction", probably either br or ret.

Labels are actually stored in registers; for example the label above would be stored in a register called %somelabel.

Branch instructions (br) can be either unconditional, like

br label %somelabel

or conditional, like

br i1 %v3, label %iflabel, label %elselabel

3.5 Functions

Functions in LLVM IR are defined with their return type, name (which should always be a global name with @), and arguments. For example, here is the definition of a function that takes a 64-bit integer \(n\) and returns \(n^2\):

define i64 @sqr(i64 %n) {
  %nsq = mul i64 %n, %n
  ret i64 %nsq
}

To call a function, you use the call instruction, like:

%twentyfive = call i64 @sqr (i64 %five)

Notice that in function calls, we have to specify the function return type, the name of the function, and then the types and names of all the arguments.

The only slight difference is that for some functions, such as printf and scanf that take a variable number of arguments, you have to specify the complete function type, not just the return value, e.g.:

call i32(i8*,...) @printf(i8* %fmtstring, i64 %number)

3.6 Program structure

Every LLVM program (technically called a module) starts out with a specification of the target architecture (and possibly other global attributes):

target triple = "x86_64-pc-linux-gnu"

Next we should see any global variables. These will most commonly be string literals, for example like:

@msg = constant [14 x i8] c"Hello, world!\00"

Next we have function prototypes for any external functions or system calls, using the declare keyword, such as

declare i32 @puts(i8*)

After this, all that remains is function definitions. One of them should of course be called @main, like

define i32 @main() {
INITIAL:
  call i32 @puts(i8* getelementptr([14 x i8], [14 x i8]* @msg, i32 0, i32 0))
  ret i32 0
}

If you find the getelementptr syntax really annoying, you can use bitcast-to instead, like this:

define i32 @main() {
INITIAL:
  %hello = bitcast [14 x i8]* @msg to i8*
  call i32 @puts(i8* %hello)
  ret i32 0
}

Putting all this together gives you a complete, working LLVM IR program.

Note that the order of function and constant definitions does not matter, which means that we don't need to have declarations (prototypes) for any locally-defined functions.

4 Starting to fill in your compiler

Now let's get down to writing your SPL compiler! Remember what you're doing here: reading in SPL code and printing out (to the resout stream) LLVM IR code. You won't actually be running the program, but printing the instructions to run the program later. Your scanner, parser, and AST generation will be the same, but your main(), as well as the eval and exec methods in the AST, will work differently.

If you inspect the starter code for this lab, you'll notice a few differences from the previos interpreter labs:

  • The main() function has been moved out of spl.ypp and into its own file in splc.cpp. So now spl.ypp is just a parser. In fact, you shouldn't need to modify the parser (spl.ypp) or scanner (spl.lpp) for this lab.
  • There is no value.hpp file, and the eval methods in AST expression nodes return a string, not a Value object. This is because the actual values are not known until run-time, so they have no place in the compiler. Instead, each eval method will return a string representing that value in LLVM IR. This will typically be a register name like "%v8 where the value of that expression has been computed.
  • Each eval and exec method takes not only a pointer to the current Frame, but also a pointer to a Context object. You can see the (very short!) definition of the Context class in splc.hpp. Right now, it just maintains a counter and implements a simple function to return the next unused register name. This will be frequently useful, since you have to come up with a new name for every instruction in LLVM IR.

If you compile and run the starter code as-is, you will get an splc program that works only for literal numbers (class Num), arithmetic operators (class ArithOp), and write statements (class Write). Looking at how those eval and exec methods are implemented will be useful to get started here.

Exercises

  1. Get all arithmetic, comparison, and boolean operators working, along with both read and write.
    Although they won't be particularly useful yet, but you may as well also implement expression statements and blocks.
    Here is a program which should successfully compile at this point. Note that true and false should print as 1 and 0 respectively, since you don't have run-time type information (yet).
    write 10;
    write 5 + 6;
    write 20 + (-8);
    write -7 + 4*5;
    write true;
    write 5 < 1;
    write not 1 = 2 and 4 != 5;
    write read * 2;
    20;
    { write 30; }
    { write 40; { write 50; write 60; } write 70; }

    Hint: Getting read to work is a little tricky! You will have to call the C library function scanf Because scanf takes the address of the value you are reading into, you might need to use alloca, load to get it working. You basically want to do something like this, except in LLVM:

    long temp; // this means doing an alloca in LLVM
    scanf(" %ld", &temp);
    // afterwards, in LLVM you have to load here to get the value into a register
    (You also might want to print a nice read> prompt beforehand, but that's up to you. If you do, be sure to print the prompt to stderr so that it's ignored by the submit system.)

    To test your code: If you just run ./splc and type in lines of SPL like above, your compiler should spit out LLVM IR instructions that you can inspect visually.
    To actually run your compiled code, you can do this in two stages: first calling splc to compile from SPL to LLVM IR, and then calling lli to actually run it!
    If you have some SPL code in a file prog.spl, you can do:

    roche@ubuntu$ make splc
    roche@ubuntu$ ./splc prog.spl prog.ll
    roche@ubuntu$ lli prog.ll
    This is what I recommend! When something goes wrong, look at the LLVM IR code your compiler generated in prog.ll and try to work backwards.
Submit now and check the autotesting so far! Just run ./submitme.sh from your lab directory.

5 Variables

Recall from class that variables can pose a challenge with SSA languages like LLVM IR, because each register can only be assigned at one point in the program. One solution to this is to use phi instructions, and that is certainly an option here.

But instead, we'll take an easier route: memory. The idea is that, for each declared variable, your program will reserve a space in memory (on the stack) to store the variable's value, and only the address will be stored in a register. That way, the variable can change multiple times (via load/store operations) but the address never changes, so we don't violate SSA.

You will need to use the instructions alloca, load, store to get this working. Look at the documentation and examples there for help in getting started.

Notice that the AST nodes still make use of Frame objects, but the Frame class itself has changed in two very important ways. First, instead of associating a Value to each name in the symbol table, you associate a string which will store the name of the register holding the address of that variable. For this reason, you no longer need a rebind method in the Frame class, since the address of a variable never changes.

Rather than try the full test case shown below, think about starting very small and working your way up. For example, you could first just try to get a program to compile that declares a new variable. Then try assigning a variable and printing it out right away. Starting slowly like this and working carefully, examining the code your compiler produces for very simple cases first, will make this whole lab much easier.

Exercises

  1. Get variable declaration with new, assignment, and lookup working. Here's a program which should work now:
    new x := 10;
    write x + 3;
    new y := x * x;
    write y;
    write y + x;
    { new x := 20;
      write x; # should be 20 of course
      x := -3;
      write x; # now it's changed
    }
    write x; # should be back to 10 here
    y := 1101;
    write x+y;

6 Control

In order to write any really interesting programs in SPL, you need support for conditionals and loops. Getting those working will be reminiscent of the assembly programming you did in IC210, and will involve careful use of labels and jumps.

For example, a code fragment such as

ifelse 3 < 5
  { write 100; }
  { write 200; }
write 300;

might be compiled to

b0:
  %v1 = icmp slt i64 3, 5
  br i1 %v1, label %b1, label %b2
b1:
  call void @write (i64 100)
  br label %b3
b2:
  call void @write (i64 200)
  br label %b3
b3:
  call void @write (i64 300)

Exercises

  1. Get if/else and while statements working. After this, a program like the following should compile and run successfully:
    new secret := 42;
    new guess := read;
    while guess != secret {
      ifelse guess < secret
      { write -1; }
      { write 1; }
      guess := read;
    }
    write secret;

7 Move on!

This is the end of this lab, but your compiler is not done yet! You can move on to Lab 12 as soon as you are confident everything here is working perfectly.