Lab 10: Compiler Part 1
This lab is due at 2359 on Wednesday, 15 November. It should be submitted to the course SI413 as Lab10, with the files named properly according to the lab instructions. See the submit page for details. You are highly encouraged to submit whatever you have done by the end of this lab time. Remember that you must thoroughly test your code; the tests that are visible before the deadline are only there to make sure you are using the right names and input/output formats.
1 Starter Code
The starter code for this week's lab is basically a stripped-down solution to Lab 8, modified for the purposes of making a compiler.
This lab starts a new group of labs. This means two things. First, you need to choose a new lab partner and make sure to notate that on the shared Google sheet.
Second, you need to create a new repo for your shared work running the following command. (One lab partner should run this command first to create the shared repo, and then the other lab partner should run it next to clone the repo.)
roche@ubuntu$
413repo compiler
If that command doesn't work, check the initial setup instructions.
You should now have a folder like
si413/compiler/lab10
with the starter code for this lab, the
lab10.md
file you will need to fill out,
and a handy script to submit your code.
You can also browse the starter code from the Gitlab page (USNA intranet only).
2 Introduction
This is a two-part lab that is honestly kind of the high point of this whole course. We have been deliberating setting you up for this exact moment all semester, so you are ready for this. It will be difficult, but it should be fun and rewarding too!
For these labs, you are going to implement a compiler for the SPL language.
Specifically, you will write a program that reads in SPL
source code and produces a program in the LLVM Intermediate Representation language.
This is a real IR language that is used by many popular modern compilers, and it
allows your program to be ultimately compiled to machine code for just about any
computer that you can find today.
The IR code your compiler generates
can either be executed directly by using the lli-15
command,
compiled to assembly using the llc-15
command, or compiled directly
to an executable using clang-15
.
The first segment of this lab is an overview of the LLVM tools you will use and how the LLVM IR language works. This should be somewhat review from what we have already been covering in class, but it will be a useful reference for the rest of the assignment.
The actual coding you need to do for this lab is getting most of SPL supported by your compiler, including arithmetic, reading and writing, variables, but not including function calls yet. You are kind of doing the same tasks as you did in Lab 7, except you are now writing a compiler rather than an interpreter.
Writing a compiler is a difficult challenge, since it requires you to have three programs in mind: (1) the SPL program your compiler takes as input; (2) your actual compiler which is written in Java; and (3) the LLVM IR code that your compiler is producing. You are ready for this challenge, but it will take careful, deliberate work. Help each other out (within the bounds of the course policy), give yourself plenty of time, ask your instructors for help as needed, and have fun!
3 LLVM
The LLVM Language Reference Manual is the most complete and most useful source of information on how the LLVM IR language works. You may want to bookmark that page now, as you should refer to it frequently. Of particular note is the "Instruction Reference" section, which contains details on all of the instructions in the language.
3.1 Tools
The job of your splc
compiler is to convert an SPL program such
as myprog.spl
into an LLVM IR program such as myprog.ll
.
Note that LLVM IR code typically ends with a .ll
extension.
There are three main tools that you can use to work with LLVM IR code. These should be installed already on the lab machines; on your virtual machine or WSL you may need to install some extra packages by running:
roche@ubuntu$
sudo apt install clang-15 llvm-15
- clang: This is a general-purpose C compiler, but it is built
on LLVM and has support for LLVM compilation as well. It can be used both to turn a
C program into LLVM IR, but also to turn LLVM IR into an executable.
roche@ubuntu$
# Produce myprog.ll from C program myprog.c
roche@ubuntu$
clang-15 -S -emit-llvm myprog.c
roche@ubuntu$
# Produce executable myprog from myprog.ll
roche@ubuntu$
clang-15 myprog.ll -o myprog
- lli: An LLVM IR interpreter that will run your
.ll
program directy. What this does under the hood is really just compile your program and then run it, butlli
is a convenient shortcut to do both when you are testing out your LLVM IR code.roche@ubuntu$
# Run myprog.ll
roche@ubuntu$
lli-15 myprog.ll
- llc: The LLVM compiler which transforms your LLVM IR
code to actual machine assembly code. This is the first tool used by the clang command
above which fully compiles LLVM IR to an executable.
(You probably don't need to use this much directly, unless you like looking at
assembly code.)
roche@ubuntu$
# Compile myprog.ll to x86 assembly myprog.s
roche@ubuntu$
llc-15 myprog.ll
3.2 LLVM IR Types
Every register or global variable in LLVM IR code has a type. It's important to emphasize right away that these types don't necessarily have to match exactly with the types in SPL. In fact, all of the variables in SPL will be stored as 64-bit integers in your compiled code, whether they actually represent a number, a boolean value, or a lambda. This is because, at compile-time, you don't know the type of anything in SPL; therefore all SPL values must be stored in the same way (as 64-bit integers).
The most important types you will need to use in LLVM IR are:
- i1: This is a 1-bit integer, i.e., a boolean value. This is the type returned by comparison instructions in LLVM, and it is the type that you must use for the condition of conditional branches.
- i8: An 8-bit integer, i.e., a
char
. You have to use this type when calling system functions likeprintf
which expect (arrays of) characters. - i32: A 32-bit integer, i.e., a
int
. You will actually not use this type very much in your LLVM IR, except as the return value of system calls such asprintf
,scanf
, ormain
. - i64: A 64-bit integer, i.e., a
long
. This is the type you will use in LLVM to store any SPL value, whether it be a boolean, a number, or a function pointer. - label: An instruction label in the code.
Each basic block of LLVM IR code must begin with a label
like
mylabel:
, which corresponds to a register%mylabel
with type (you guessed it)label
. These are used inbr
(branch) instructions. - [N x type]: An array of N elements of type
type. The most common you will see is an array of characters
such as
[5 x i8]
. - ptr: Any kind of pointer to anything (could be a function ponter, an array pointer, a pointer to a single int, etc.) Really, it's just a memory address.
- type(types): This is how you write the type of
a function, specifying the return type as well as the parameter types
(a comma-separated list). For example, the type of
main
is typicallyi32()
if you don't care about command-line arguments, ori32(i32,ptr)
if you do.
There are various instructions in LLVM to convert between any of these types.
Most useful are
trunc,
to truncate a larger integer type to a smaller one,
zext,
to extend a smaller integer to a larger one by padding with 0 bits,
ptrtoint,
to convert a pointer to an i64
, and (you guessed it)
inttoptr,
to do the opposite.
3.3 Names and instructions
LLVM IR is an SSA (Static Single Assignment) language, so every name can
only be assigned exactly once in your program. Global names are preceded with a
@
and are typically used for global constants or function names, for example
@main
.
Within each function, you will more commonly use local register names, which are
preceded with a %
. Because this is an intermediate representation, these
names don't really matter, and because the language is SSA, there need to be a lot of them,
so it's common to use some numbering scheme like %v1
, %v2
, etc.
You will notice that running clang
produces register names tha are just numbers
like %1
, %2
, %3
, but you should not use these
kind of names in your own LLVM IR because they have to follow some special (and annoying) rules.
LLVM IR is a 3AC language, and most typical instructions have this form:
%dest = instruction type %arg1, %arg2
For example, here is an instruction to subtract 3 from the 64-bit integer
stored in register %x
, storing the result in register %y
:
%y = sub i64 %x, 3
You can think of sub i64
as the entire name of the instruction in the previous
case. Some instructions like this are
add, sub, mul, sdiv, and, or.
The icmp instruction is similar, except there is an extra component
for the type of comparison. For example, the following instruction compares two
64-bit values in registers %v1
and %v2
storing the i1
value of 1 or 0 in register %v3
, depending on whether %v1
is
greater than or equal to %v2
.
(The sge
part stands for "signed greater than or equal to".)
%v3 = icmp sge i64 %v1, %v2
3.4 Labels and terminators
Instructions within each LLVM IR function are grouped into basic blocks or normal instructions. Each basic block may optionally start with a label such as
somelabel:
and each basic block must end with a special "terminator instruction", probably either br or ret.
Labels are actually stored in registers; for example the label above would be
stored in a register called %somelabel
.
Branch instructions (br
) can be either unconditional, like
br label %somelabel
or conditional, like
br i1 %v3, label %iflabel, label %elselabel
3.5 Functions
Functions in LLVM IR are defined with their return type,
name (which should always
be a global name with @
), and arguments. For example,
here is the definition of a function that takes a 64-bit integer
\(n\) and returns \(n^2\):
define i64 @sqr(i64 %n) {
%nsq = mul i64 %n, %n
ret i64 %nsq
}
To call a function, you use the call instruction, like:
%twentyfive = call i64 @sqr (i64 %five)
Notice that in function calls, we have to specify the function return type, the name of the function, and then the types and names of all the arguments.
3.6 Program structure
Every LLVM program (technically called a module) starts out with a specification of the target architecture (and possibly other global attributes):
target triple = "x86_64-pc-linux-gnu"
Next we should see any global variables. These will most commonly be string literals, for example like:
@msg = constant [14 x i8] c"Hello, world!\00"
Next we have function prototypes for any external functions or system calls,
using the declare
keyword, such as
declare i32 @puts(ptr)
After this, all that remains is function definitions. One of them should
of course be called @main
, like
define i32 @main() {
Initial:
call i32 @puts(ptr @msg)
ret i32 0
}
Putting all this together gives you a complete, working LLVM IR program.
Note that the order of function and constant definitions does not matter, which means that we don't need to have declarations (prototypes) for any locally-defined functions.
4 Starting to fill in your compiler
Now let's get down to writing your SPL compiler! Remember what you're doing here:
reading in SPL code and printing out LLVM IR code.
You won't actually be running the program, but printing the instructions
to run the program later. Your scanner, parser, and AST generation will be the
same, but everything after that, including the main()
as well as the
AST nodes' evalaution/execution, will work differently.
If you inspect the starter code for this lab, you'll notice a few differences from the previous interpreter labs:
- The
Interpreter
andClosure
classes are gone. - The main method is now in a new class
Compiler.java
, which deals with command-line arguments, does scanning, parsing, and semantic analysis (AST generation), and then calls thegenerateCode()
which is the only function where you will be making some edits. - The
Frame
class is still there, but it has been changed to store strings instead of actual integer values (more on this later). - There is a new
Context
class, which keeps track of the things your compiler needs to know such as which register name to use next.
(You might need to add some more functionality to this class at some point.) - All of the
evaluate()
andexecute()
methods in the AST classes have been removed. Instead, we have a functioncompile()
, which so far is only implemented in a few classes.
(You will need to fill in thecompile()
methods to make your compiler fully work!) - (Understanding this is crucial to getting started!)
Remember that an Expression in SPL (or any language) is a code fragment
that produces a value at run-time.
ForExpression
sub-classes, thecompile()
method returns a String which is the name of the register where that value will be stored at run-time. Notice, this is not returning the value itself, but instead a register name. At compile-time, we don't know what the actual value will be, but we do know where it will be stored.
You can compile the code by running mvn compile
and run
the compiler by typing ./splc.sh
.
As you have it initially, the compiler
works only for literal booleans (true/false),
boolean operations (and, or, not)
and write
statements.
To get started, look at those compile
methods
which are implemented in the Bool
, BoolOp
,
NotOp
, and Write
classes, as well as the
generateCode(),
method in the Compiler
class.
Try compiling a small program using only these SPL features, like
write true and not false;
Make sure you undertand where each part of the compiled LLVM output comes from
in the Java code.
To actually run your compiled code, you can do this in two stages: first calling splc
to compile from SPL to LLVM IR, and then calling lli-15
to actually run it!
If you have some SPL code in a file prog.spl
, you can do:
roche@ubuntu$
mvn compile
roche@ubuntu$
./splc.sh prog.spl
roche@ubuntu$
lli-15 prog.ll
This is what I recommend! When something goes wrong, look at the LLVM IR code your compiler generated
in prog.ll
and try to work backwards.
Exercises
-
Get numeric literals, negation, arithmetic, comparison
Get all arithmetic, comparison, and boolean operators working, as well
as expression statements and blocks. (Note that expression statements and
blocks won't be particularly useful or interesting yet; implementing these
should be very very simple!)
This means implementing the compile(Frame,Context)
methods in all of those
classes.
Note that in some ways the numeric classes should be simpler than the
boolean arithmetic that is in the starter code, since
all values should be stored as i64 and you don't have to do any
conversion with trunc
and zext
for integers.
Here is a program which should successfully compile at this point. Note that
true and false should print as 1 and 0 respectively, since you don't have
run-time type information (yet).
write 10;
write 5 + 6;
write 20 + (-8);
write -7 + 4*5;
write true;
write 5 < 1;
write not 1 = 2 and 4 != 5;
20;
{ write 30; }
{ write 40; { write 50; write 60; } write 70; }
-
Now get read
expressions working.
This is a little bit tricky! Remember, your compile()
method in the Read
class should not actually
read anything from System.in; instead, it should be outputting LLVM IR code
that can do the reading at run-time.
You will have to call the C library function scanf
to make this happen.
Because scanf takes the address of
the value you are reading into, you might
need to use alloca, load to get it working.
You basically
want to do something like this, except in LLVM:
long temp; // this means doing an alloca in LLVM
scanf(" %ld", &temp);
// afterwards, in LLVM you have to load here to get the value into a register
(You also might want to print a nice read>
prompt beforehand,
but that's up to you. If you do, be sure to print the prompt to stderr
so that it's ignored by the submit system. It might be easiest to use the
write
system call, remembering that the file descriptor for standard error
is always 2
.)
Once you have this, then a program like this should work. Again, when you
compile it, it should NOT ask for any input. But when you run the resulting LLVM
code, it should prompt the user for two numbers.
write 2 + read * 3 - read;
Submit now and check the autotesting so far!
Just run ./submitme.sh
from your lab directory.
5 Variables
Get numeric literals, negation, arithmetic, comparison Get all arithmetic, comparison, and boolean operators working, as well as expression statements and blocks. (Note that expression statements and blocks won't be particularly useful or interesting yet; implementing these should be very very simple!)
This means implementing the compile(Frame,Context)
methods in all of those
classes.
Note that in some ways the numeric classes should be simpler than the
boolean arithmetic that is in the starter code, since
all values should be stored as i64 and you don't have to do any
conversion with trunc
and zext
for integers.
Here is a program which should successfully compile at this point. Note that true and false should print as 1 and 0 respectively, since you don't have run-time type information (yet).
write 10;
write 5 + 6;
write 20 + (-8);
write -7 + 4*5;
write true;
write 5 < 1;
write not 1 = 2 and 4 != 5;
20;
{ write 30; }
{ write 40; { write 50; write 60; } write 70; }
Now get read
expressions working.
This is a little bit tricky! Remember, your compile()
method in the Read
class should not actually
read anything from System.in; instead, it should be outputting LLVM IR code
that can do the reading at run-time.
You will have to call the C library function scanf
to make this happen.
Because scanf takes the address of
the value you are reading into, you might
need to use alloca, load to get it working.
You basically
want to do something like this, except in LLVM:
long temp; // this means doing an alloca in LLVM
scanf(" %ld", &temp);
// afterwards, in LLVM you have to load here to get the value into a register
(You also might want to print a nice read>
prompt beforehand,
but that's up to you. If you do, be sure to print the prompt to stderr
so that it's ignored by the submit system. It might be easiest to use the
write
system call, remembering that the file descriptor for standard error
is always 2
.)
Once you have this, then a program like this should work. Again, when you compile it, it should NOT ask for any input. But when you run the resulting LLVM code, it should prompt the user for two numbers.
write 2 + read * 3 - read;
./submitme.sh
from your lab directory.
Recall from class that variables can pose a challenge with SSA languages like LLVM IR, because each register can only be assigned at one point in the program. One solution to this is to use phi instructions, and that is certainly an option here.
But instead, we'll take an easier route: memory. The idea is that, for each declared variable, your program will reserve a space in memory (on the stack) to store the variable's value, and only the address will be stored in a register. That way, the variable can change multiple times (via load/store operations) but the address never changes, so we don't violate SSA.
You will need to use the instructions alloca, load, store to get this working. Look at the documentation and examples there for help in getting started.
Notice that the AST nodes still make use of Frame
objects, but the
Frame
class itself has changed in two very important ways. First,
instead of associating an actual value to each name in the symbol table, you
associate a String
which will store the name of the register
holding the address of that variable. For this reason, you no longer need
a rebind
method in the Frame class, since the address of a variable
never changes.
Rather than try the full test case shown below, think about starting very small and working your way up. For example, you could first just try to get a program to compile that declares a new variable. Then try assigning a variable and printing it out right away. Starting slowly like this and working carefully, examining the code your compiler produces for very simple cases first, will make this whole lab much easier.
Exercises
-
Get variable declaration with
new
, assignment, and lookup working.
Here's a program which should work now:
new x := 10;
write x + 3;
new y := x * x;
write y;
write y + x;
{ new x := 20;
write x; # should be 20 of course
x := -3;
write x; # now it's changed
}
write x; # should be back to 10 here
y := 1101;
write x+y;
Submit now and check the autotesting so far!
Just run ./submitme.sh
from your lab directory.
6 Control
new
, assignment, and lookup working.
Here's a program which should work now:
new x := 10;
write x + 3;
new y := x * x;
write y;
write y + x;
{ new x := 20;
write x; # should be 20 of course
x := -3;
write x; # now it's changed
}
write x; # should be back to 10 here
y := 1101;
write x+y;
./submitme.sh
from your lab directory.
In order to write any really interesting programs in SPL, you need support for conditionals and loops. Getting those working will be reminiscent of the assembly programming you did in IC210, and will involve careful use of labels and branch instructions.
For example, a code fragment such as
ifelse 3 < 5
{ write 100; }
{ write 200; }
write 300;
might be compiled to something like
b0:
%v1 = icmp slt i64 3, 5
br i1 %v1, label %b1, label %b2
b1:
call void @write (i64 100)
br label %b3
b2:
call void @write (i64 200)
br label %b3
b3:
call void @write (i64 300)
Exercises
-
Get if/else and while statements working. After this, a program like
the following should compile and run successfully:
new secret := 42;
new guess := read;
while guess != secret {
ifelse guess < secret
{ write -1; }
{ write 1; }
guess := read;
}
write secret;
Submit now and check the autotesting so far!
Just run ./submitme.sh
from your lab directory.
7 Move on!
new secret := 42;
new guess := read;
while guess != secret {
ifelse guess < secret
{ write -1; }
{ write 1; }
guess := read;
}
write secret;
./submitme.sh
from your lab directory.
This is the end of this lab, but your compiler is not done yet! You can move on to Lab 11 as soon as you are confident everything here is working perfectly.