SI 413 Fall 2023 / Labs


This is the archived website of SI 413 from the Fall 2023 semester. Feel free to browse around; you may also find more recent offerings at my teaching page.

Lab 10: Compiler Part 1

This lab is due at 2359 on Wednesday, 15 November. It should be submitted to the course SI413 as Lab10, with the files named properly according to the lab instructions. See the submit page for details. You are highly encouraged to submit whatever you have done by the end of this lab time. Remember that you must thoroughly test your code; the tests that are visible before the deadline are only there to make sure you are using the right names and input/output formats.

1 Starter Code

The starter code for this week's lab is basically a stripped-down solution to Lab 8, modified for the purposes of making a compiler.

This lab starts a new group of labs. This means two things. First, you need to choose a new lab partner and make sure to notate that on the shared Google sheet.

Second, you need to create a new repo for your shared work running the following command. (One lab partner should run this command first to create the shared repo, and then the other lab partner should run it next to clone the repo.)

roche@ubuntu$ 413repo compiler

If that command doesn't work, check the initial setup instructions.

You should now have a folder like si413/compiler/lab10 with the starter code for this lab, the lab10.md file you will need to fill out, and a handy script to submit your code.

You can also browse the starter code from the Gitlab page (USNA intranet only).

2 Introduction

This is a two-part lab that is honestly kind of the high point of this whole course. We have been deliberating setting you up for this exact moment all semester, so you are ready for this. It will be difficult, but it should be fun and rewarding too!

For these labs, you are going to implement a compiler for the SPL language. Specifically, you will write a program that reads in SPL source code and produces a program in the LLVM Intermediate Representation language. This is a real IR language that is used by many popular modern compilers, and it allows your program to be ultimately compiled to machine code for just about any computer that you can find today. The IR code your compiler generates can either be executed directly by using the lli-15 command, compiled to assembly using the llc-15 command, or compiled directly to an executable using clang-15.

The first segment of this lab is an overview of the LLVM tools you will use and how the LLVM IR language works. This should be somewhat review from what we have already been covering in class, but it will be a useful reference for the rest of the assignment.

The actual coding you need to do for this lab is getting most of SPL supported by your compiler, including arithmetic, reading and writing, variables, but not including function calls yet. You are kind of doing the same tasks as you did in Lab 7, except you are now writing a compiler rather than an interpreter.

Writing a compiler is a difficult challenge, since it requires you to have three programs in mind: (1) the SPL program your compiler takes as input; (2) your actual compiler which is written in Java; and (3) the LLVM IR code that your compiler is producing. You are ready for this challenge, but it will take careful, deliberate work. Help each other out (within the bounds of the course policy), give yourself plenty of time, ask your instructors for help as needed, and have fun!

3 LLVM

The LLVM Language Reference Manual is the most complete and most useful source of information on how the LLVM IR language works. You may want to bookmark that page now, as you should refer to it frequently. Of particular note is the "Instruction Reference" section, which contains details on all of the instructions in the language.

3.1 Tools

The job of your splc compiler is to convert an SPL program such as myprog.spl into an LLVM IR program such as myprog.ll. Note that LLVM IR code typically ends with a .ll extension.

There are three main tools that you can use to work with LLVM IR code. These should be installed already on the lab machines; on your virtual machine or WSL you may need to install some extra packages by running:

roche@ubuntu$ sudo apt install clang-15 llvm-15
  • clang: This is a general-purpose C compiler, but it is built on LLVM and has support for LLVM compilation as well. It can be used both to turn a C program into LLVM IR, but also to turn LLVM IR into an executable.
    roche@ubuntu$ # Produce myprog.ll from C program myprog.c
    roche@ubuntu$ clang-15 -S -emit-llvm myprog.c
    
    roche@ubuntu$ # Produce executable myprog from myprog.ll
    roche@ubuntu$ clang-15 myprog.ll -o myprog
  • lli: An LLVM IR interpreter that will run your .ll program directy. What this does under the hood is really just compile your program and then run it, but lli is a convenient shortcut to do both when you are testing out your LLVM IR code.
    roche@ubuntu$ # Run myprog.ll
    roche@ubuntu$ lli-15 myprog.ll
  • llc: The LLVM compiler which transforms your LLVM IR code to actual machine assembly code. This is the first tool used by the clang command above which fully compiles LLVM IR to an executable. (You probably don't need to use this much directly, unless you like looking at assembly code.)
    roche@ubuntu$ # Compile myprog.ll to x86 assembly myprog.s
    roche@ubuntu$ llc-15 myprog.ll

3.2 LLVM IR Types

Every register or global variable in LLVM IR code has a type. It's important to emphasize right away that these types don't necessarily have to match exactly with the types in SPL. In fact, all of the variables in SPL will be stored as 64-bit integers in your compiled code, whether they actually represent a number, a boolean value, or a lambda. This is because, at compile-time, you don't know the type of anything in SPL; therefore all SPL values must be stored in the same way (as 64-bit integers).

The most important types you will need to use in LLVM IR are:

  • i1: This is a 1-bit integer, i.e., a boolean value. This is the type returned by comparison instructions in LLVM, and it is the type that you must use for the condition of conditional branches.
  • i8: An 8-bit integer, i.e., a char. You have to use this type when calling system functions like printf which expect (arrays of) characters.
  • i32: A 32-bit integer, i.e., a int. You will actually not use this type very much in your LLVM IR, except as the return value of system calls such as printf, scanf, or main.
  • i64: A 64-bit integer, i.e., a long. This is the type you will use in LLVM to store any SPL value, whether it be a boolean, a number, or a function pointer.
  • label: An instruction label in the code. Each basic block of LLVM IR code must begin with a label like mylabel:, which corresponds to a register %mylabel with type (you guessed it) label. These are used in br (branch) instructions.
  • [N x type]: An array of N elements of type type. The most common you will see is an array of characters such as [5 x i8].
  • ptr: Any kind of pointer to anything (could be a function ponter, an array pointer, a pointer to a single int, etc.) Really, it's just a memory address.
  • type(types): This is how you write the type of a function, specifying the return type as well as the parameter types (a comma-separated list). For example, the type of main is typically i32() if you don't care about command-line arguments, or i32(i32,ptr) if you do.

There are various instructions in LLVM to convert between any of these types. Most useful are trunc, to truncate a larger integer type to a smaller one, zext, to extend a smaller integer to a larger one by padding with 0 bits, ptrtoint, to convert a pointer to an i64, and (you guessed it) inttoptr, to do the opposite.

3.3 Names and instructions

LLVM IR is an SSA (Static Single Assignment) language, so every name can only be assigned exactly once in your program. Global names are preceded with a @ and are typically used for global constants or function names, for example @main.

Within each function, you will more commonly use local register names, which are preceded with a %. Because this is an intermediate representation, these names don't really matter, and because the language is SSA, there need to be a lot of them, so it's common to use some numbering scheme like %v1, %v2, etc. You will notice that running clang produces register names tha are just numbers like %1, %2, %3, but you should not use these kind of names in your own LLVM IR because they have to follow some special (and annoying) rules.

LLVM IR is a 3AC language, and most typical instructions have this form:

%dest = instruction type %arg1, %arg2

For example, here is an instruction to subtract 3 from the 64-bit integer stored in register %x, storing the result in register %y:

%y = sub i64 %x, 3

You can think of sub i64 as the entire name of the instruction in the previous case. Some instructions like this are add, sub, mul, sdiv, and, or. The icmp instruction is similar, except there is an extra component for the type of comparison. For example, the following instruction compares two 64-bit values in registers %v1 and %v2 storing the i1 value of 1 or 0 in register %v3, depending on whether %v1 is greater than or equal to %v2. (The sge part stands for "signed greater than or equal to".)

%v3 = icmp sge i64 %v1, %v2

3.4 Labels and terminators

Instructions within each LLVM IR function are grouped into basic blocks or normal instructions. Each basic block may optionally start with a label such as

somelabel:

and each basic block must end with a special "terminator instruction", probably either br or ret.

Labels are actually stored in registers; for example the label above would be stored in a register called %somelabel.

Branch instructions (br) can be either unconditional, like

br label %somelabel

or conditional, like

br i1 %v3, label %iflabel, label %elselabel

3.5 Functions

Functions in LLVM IR are defined with their return type, name (which should always be a global name with @), and arguments. For example, here is the definition of a function that takes a 64-bit integer \(n\) and returns \(n^2\):

define i64 @sqr(i64 %n) {
  %nsq = mul i64 %n, %n
  ret i64 %nsq
}

To call a function, you use the call instruction, like:

%twentyfive = call i64 @sqr (i64 %five)

Notice that in function calls, we have to specify the function return type, the name of the function, and then the types and names of all the arguments.

3.6 Program structure

Every LLVM program (technically called a module) starts out with a specification of the target architecture (and possibly other global attributes):

target triple = "x86_64-pc-linux-gnu"

Next we should see any global variables. These will most commonly be string literals, for example like:

@msg = constant [14 x i8] c"Hello, world!\00"

Next we have function prototypes for any external functions or system calls, using the declare keyword, such as

declare i32 @puts(ptr)

After this, all that remains is function definitions. One of them should of course be called @main, like

define i32 @main() {
Initial:
  call i32 @puts(ptr @msg)
  ret i32 0
}

Putting all this together gives you a complete, working LLVM IR program.

Note that the order of function and constant definitions does not matter, which means that we don't need to have declarations (prototypes) for any locally-defined functions.

4 Starting to fill in your compiler

Now let's get down to writing your SPL compiler! Remember what you're doing here: reading in SPL code and printing out LLVM IR code. You won't actually be running the program, but printing the instructions to run the program later. Your scanner, parser, and AST generation will be the same, but everything after that, including the main() as well as the AST nodes' evalaution/execution, will work differently.

If you inspect the starter code for this lab, you'll notice a few differences from the previous interpreter labs:

  • The Interpreter and Closure classes are gone.
  • The main method is now in a new class Compiler.java, which deals with command-line arguments, does scanning, parsing, and semantic analysis (AST generation), and then calls the generateCode() which is the only function where you will be making some edits.
  • The Frame class is still there, but it has been changed to store strings instead of actual integer values (more on this later).
  • There is a new Context class, which keeps track of the things your compiler needs to know such as which register name to use next.
    (You might need to add some more functionality to this class at some point.)
  • All of the evaluate() and execute() methods in the AST classes have been removed. Instead, we have a function compile(), which so far is only implemented in a few classes.
    (You will need to fill in the compile() methods to make your compiler fully work!)
  • (Understanding this is crucial to getting started!) Remember that an Expression in SPL (or any language) is a code fragment that produces a value at run-time.
    For Expression sub-classes, the compile() method returns a String which is the name of the register where that value will be stored at run-time. Notice, this is not returning the value itself, but instead a register name. At compile-time, we don't know what the actual value will be, but we do know where it will be stored.

You can compile the code by running mvn compile and run the compiler by typing ./splc.sh.

As you have it initially, the compiler works only for literal booleans (true/false), boolean operations (and, or, not) and write statements.

To get started, look at those compile methods which are implemented in the Bool, BoolOp, NotOp, and Write classes, as well as the generateCode(), method in the Compiler class.

Try compiling a small program using only these SPL features, like

write true and not false;
Make sure you undertand where each part of the compiled LLVM output comes from in the Java code.

To actually run your compiled code, you can do this in two stages: first calling splc to compile from SPL to LLVM IR, and then calling lli-15 to actually run it!
If you have some SPL code in a file prog.spl, you can do:

roche@ubuntu$ mvn compile
roche@ubuntu$ ./splc.sh prog.spl
roche@ubuntu$ lli-15 prog.ll

This is what I recommend! When something goes wrong, look at the LLVM IR code your compiler generated in prog.ll and try to work backwards.

Exercises

  1. Get numeric literals, negation, arithmetic, comparison Get all arithmetic, comparison, and boolean operators working, as well as expression statements and blocks. (Note that expression statements and blocks won't be particularly useful or interesting yet; implementing these should be very very simple!)

    This means implementing the compile(Frame,Context) methods in all of those classes.

    Note that in some ways the numeric classes should be simpler than the boolean arithmetic that is in the starter code, since all values should be stored as i64 and you don't have to do any conversion with trunc and zext for integers.

    Here is a program which should successfully compile at this point. Note that true and false should print as 1 and 0 respectively, since you don't have run-time type information (yet).

    write 10;
    write 5 + 6;
    write 20 + (-8);
    write -7 + 4*5;
    write true;
    write 5 < 1;
    write not 1 = 2 and 4 != 5;
    20;
    { write 30; }
    { write 40; { write 50; write 60; } write 70; }
  2. Now get read expressions working.

    This is a little bit tricky! Remember, your compile() method in the Read class should not actually read anything from System.in; instead, it should be outputting LLVM IR code that can do the reading at run-time.

    You will have to call the C library function scanf to make this happen. Because scanf takes the address of the value you are reading into, you might need to use alloca, load to get it working. You basically want to do something like this, except in LLVM:

    long temp; // this means doing an alloca in LLVM
    scanf(" %ld", &temp);
    // afterwards, in LLVM you have to load here to get the value into a register

    (You also might want to print a nice read> prompt beforehand, but that's up to you. If you do, be sure to print the prompt to stderr so that it's ignored by the submit system. It might be easiest to use the write system call, remembering that the file descriptor for standard error is always 2.)

    Once you have this, then a program like this should work. Again, when you compile it, it should NOT ask for any input. But when you run the resulting LLVM code, it should prompt the user for two numbers.

    write 2 + read * 3 - read;
Submit now and check the autotesting so far! Just run ./submitme.sh from your lab directory.

5 Variables

Recall from class that variables can pose a challenge with SSA languages like LLVM IR, because each register can only be assigned at one point in the program. One solution to this is to use phi instructions, and that is certainly an option here.

But instead, we'll take an easier route: memory. The idea is that, for each declared variable, your program will reserve a space in memory (on the stack) to store the variable's value, and only the address will be stored in a register. That way, the variable can change multiple times (via load/store operations) but the address never changes, so we don't violate SSA.

You will need to use the instructions alloca, load, store to get this working. Look at the documentation and examples there for help in getting started.

Notice that the AST nodes still make use of Frame objects, but the Frame class itself has changed in two very important ways. First, instead of associating an actual value to each name in the symbol table, you associate a String which will store the name of the register holding the address of that variable. For this reason, you no longer need a rebind method in the Frame class, since the address of a variable never changes.

Rather than try the full test case shown below, think about starting very small and working your way up. For example, you could first just try to get a program to compile that declares a new variable. Then try assigning a variable and printing it out right away. Starting slowly like this and working carefully, examining the code your compiler produces for very simple cases first, will make this whole lab much easier.

Exercises

  1. Get variable declaration with new, assignment, and lookup working. Here's a program which should work now:
    new x := 10;
    write x + 3;
    new y := x * x;
    write y;
    write y + x;
    { new x := 20;
      write x; # should be 20 of course
      x := -3;
      write x; # now it's changed
    }
    write x; # should be back to 10 here
    y := 1101;
    write x+y;
Submit now and check the autotesting so far! Just run ./submitme.sh from your lab directory.

6 Control

In order to write any really interesting programs in SPL, you need support for conditionals and loops. Getting those working will be reminiscent of the assembly programming you did in IC210, and will involve careful use of labels and branch instructions.

For example, a code fragment such as

ifelse 3 < 5
  { write 100; }
  { write 200; }
write 300;

might be compiled to something like

b0:
  %v1 = icmp slt i64 3, 5
  br i1 %v1, label %b1, label %b2
b1:
  call void @write (i64 100)
  br label %b3
b2:
  call void @write (i64 200)
  br label %b3
b3:
  call void @write (i64 300)

Exercises

  1. Get if/else and while statements working. After this, a program like the following should compile and run successfully:
    new secret := 42;
    new guess := read;
    while guess != secret {
      ifelse guess < secret
      { write -1; }
      { write 1; }
      guess := read;
    }
    write secret;
Submit now and check the autotesting so far! Just run ./submitme.sh from your lab directory.

7 Move on!

This is the end of this lab, but your compiler is not done yet! You can move on to Lab 11 as soon as you are confident everything here is working perfectly.