Lab 11: Compiler Part 2
This lab is due at 2359 on Wednesday, 29 November. It should be submitted to the course SI413 as Lab11, with the files named properly according to the lab instructions. See the submit page for details. You are highly encouraged to submit whatever you have done by the end of this lab time. Remember that you must thoroughly test your code; the tests that are visible before the deadline are only there to make sure you are using the right names and input/output formats.
1 Starter Code
The starter code for this lab is... your solution to the previous lab!
Whenever you're condfident that your Lab 10 solution is working well,
just copy the entirety of your lab10/src
folder
to a new folder lab11/src
and get to work, like:
cd si413/compiler cp -R lab10/src lab11/src cd lab11
You should already have a shared git repository compiler
.
To get the starter code for today's lab, cd to your
si413/compiler
directory and run these commands:
git pull starter main
git pull
You should now have a folder like
si413/compiler/lab11
with the starter code for this lab, the
lab11.md
file you will need to fill out,
and a handy script to submit your code.
You can also browse the starter code from the Gitlab page (USNA intranet only).
2 Introduction
This lab is part two of your two-part compilers lab. You can get started as soon as you are finished with the first part and are confident that your solution is working correctly. Note, we will not release solutions to the first part, or starter code for this part. You really have to do part one first!
The first thing you need to do for this part is getting basic function calls to work. At that point, you will be able to compile many SPL programs, with the restriction that your functions won't support any non-local references.
The rest of this lab is partially up to you! We have six possible ways of making your compiler support a larger subset of the SPL language. You need to pick (at least) three of these six options and implement them. As you will discover, each option has some tricky aspects that will challenge you to think hard about how your compiler needs to work.
Completing everything here will be a significant challenge, but you can do it if you work deliberately and debug carefully from small examples. Remember that your instructors are here to help you too. Good luck!
3 Function calls
Everyone is required to get function calls working at some level. To make things easier, for this part your compiler may insist that the bodies of functions never contain any non-local references. That means that the communication to/from the function consists entirely of the argument and return values, and it makes the implementation much simpler, albeit more restricted in the SPL programs you can compile.
Getting functions to work really consists of two tasks.
First, you have to implement function definitions, as triggered
by lambda
expressions in the code. Each lambda in the SPL code will
correspond to one function in the LLVM IR code that your compiler produces.
But at the point of the lambda
itself, your compiler is in the middle
of emitting code for main()
, so it's not time to print out the
function definition yet. Instead, your compiler should just add that Lambda
node
to some list of functions, whose definitions will be output later, after main()
is over.
At the point of the lambda
itself, all your compiler needs to
do is convert the function pointer to an i64
type in LLVM, using the
ptrtoint-to instruction. For example, the very
simple (and mostly pointless) SPL program
lambda x { write 123; };
write 456;
might compile to something like
; ... header definitions etc up here ...
define i32 @main() {
%v1 = ptrtoint ptr @fun1 to i64
call void @write (i64 456)
ret i32 0
}
define i64 @fun1 (i64 %arg) {
call void @write (i64 123)
ret i64 0
}
(Note, this program is NOT correctly handling arguments and return values yet. You have to figure that out next!)
Once you have lambda
s working well, emitting code for function names
like we see above, the next step is to implement function calls. As with your
interpreter labs, I recommend starting by ignoring arguments and return values
and just get the control to go to the function and come back. Just as the
example above uses ptrtoint
to convert the function pointer to a saved
i64
value, at the function call site you will have to convert
an i64
back to a function pointer using
inttoptr-to.
Exercises
-
Get function definitions and function calls working without nonlocal references.
After this, you can compile some neat programs like this:
new f := false;
ifelse read = 2
{ f := lambda x { ret := x*x; }; }
{ f := lambda x { ret := x*x*x; }; }
new i := 1;
while i <= 10 {
write f@i;
i := i + 1;
}
Submit now and check the autotesting so far!
Just run ./submitme.sh
from your lab directory.
4 Choose your own adventure
new f := false;
ifelse read = 2
{ f := lambda x { ret := x*x; }; }
{ f := lambda x { ret := x*x*x; }; }
new i := 1;
while i <= 10 {
write f@i;
i := i + 1;
}
./submitme.sh
from your lab directory.
There are 6 more exercises in this section. For full credit on the lab, you must complete at least three of them. Of course, you are encouraged to try and implement even more than three!
Exercises
-
Implement run-time type information in your compiler.
This means that you will store - at run time - an extra value alongside
every actual SPL value, to hold the type of that value.
It is up to you how to represent the types in your LLVM IR program.
I recommend using an i8
value with something like 0 indicating unset,
1 indicating a number, 2 indicating a boolean, and 3 indicating a function.
The most visible change in your program is that the write
command in SPL should now respect the type of its argument, printing the value of
a number, or "true"/"false" for a boolean, "Function" for a function pointer,
or "UNSET" for an unset value.
In addition, you should do run-time type checking everywhere your program makes
use of a value. For example, an SPL program like
write 1 * false;
should still compile, but produce a run-time error from the failed coversion of
false
to a number.
Note 1: Your compiled program should terminate with exit code 5 when
a type error occurs at run-time. You may want to use the standard C library call
exit
so you can safely terminate from any point in the program.
Note 2: You may want to look into defining and using a
struct type in
LLVM IR, which allows you to store a (type,value) pair together in a register
and/or in memory. If you go this route, be sure to look up the LLVM instructions
extractvalue, insertvalue.
-
Implement full lexical scope with closures, allowing non-local references in
function calls. This challenging task will allow all kinds of more sophisticated
SPL programs to compile, such as:
new showx := lambda x {
ret := lambda ignored { write x; };
};
new fiver := showx@5;
new sixer := showx@6;
fiver@0;
sixer@0;
fiver@0;
fiver@0;
(Look back to Lab 8 for many more examples!)
To get this working, you will need to first of all allocate storage for your
variables on the heap rather than the stack. Because there
is no LLVM instruction to do heap allocation, you will have to do this with
system calls to malloc
.
Then, you'll need to store a closure for each function declared,
rather than just a pointer to the function itself. This closure needs to
contain this function pointer, but also the addresses of all of the variables
in scope at the point of the function definition.
Then when the function begins, you will need to unwrap this closure and load those
addresses back into registers.
This will be a good challenge, but worthwhile!
There are a few different ways to approach the implementation.
Remember, we aren't so concerned about efficiency at this point.
Ask your instructor if you run into trouble and need some help.
-
Implement the debug statement in SPL, which is any string enclosed in
double-quote characters. That string should print out when the compiled program
is executed, at that point in the program where the debug statement appears.
This is already part of the SPL scanner and parser you have, so all you need to
do is figure out how to get Debug::compile()
method to work.
After this, a program like the following should compile.
Running it should result in different messages output depending on what user
the number enters originally.
"Please enter a positive number."
new x := read;
ifelse x > 0
{ "Thank you and good job." }
{ "You will now be punished with 50 punches."
new i := 0;
while i < 50 { "punch" i := i + 1; }
}
-
Implement built-in functions in SPL like you did for
Lab 9.
You need to have (at least) these three built-ins:
square
which takes one argument for the size, and draws
a box of asterisk characters with that width and height.
For example, the SPL code
square @ 5;
should compile to LLVM IR code that, when run, displays this in the terminal:
*****
* *
* *
* *
*****
rand
, which works the same as in Lab 9,
taking a single argument n
and returning a random
number from 1 up to n (inclusive).
(Note, you will have to use some C standard library functions like
random() to make
this happen in LLVM.)
box
, which should be a two-argument built-in
(similar to note
from Lab 9) that takes a width and a height,
and draws a box of asterisk characters with that width and height.
For example, the following SPL program:
new w := 3;
new h := 3;
while h <= 5 {
write w * 1000 + h;
box @ w @ h;
w := w + 2;
h := h + 1;
}
should compile and, when run, produce output like
3003
***
* *
***
5004
*****
* *
* *
*****
7005
*******
* *
* *
* *
*******
The trick here will be to write the code for the built-in functions
in LLVM IR and emit those function definitions every time your compiler runs.
Inside your main()
you'll probably have to have instructions to store
those function pointers just like any other variable would be stored, so that function
calls to your built-in functions work just like any other function call.
For the two-argument function box
, you have to do it as two
1-argument builtins, where the first "outer" builtin returns the "inner" builtin.
Passing the outer width value to the inner builtin will be challenging
if you didn't do exercise 3 above (lexical scoping). If you don't have lexical scoping,
you can make this work by using a global variable in the LLVM IR code.
Try compiling a C program with a global long
variable to see how that
works!
-
Implement short-circuit evaluation of and
and or
expressions,
using phi expressions in LLVM.
Remember that short-circuiting means the second part of a boolean operation (and
or or
) is only evaluated if it's necessary. For example, if the first part of an
and
is false, then we don't need to bother with evaluating the seecond part,
because we know already that the entire expression is definitely false.
This would not be too difficult to do with a simple if/else construct, except that
after the blocks come back together, you need to have just a single register
that stores the result of the boolean operation. Because of the rules of SSA, you'll
find that this isn't straightforward to do.
To solve it, you will need to use a phi function in the first
basic block after the short-circuited boolean expression completes. Look at the documentation
there for guidance and ask your instructor if you need any help!
-
Right now, in order to get variables working, your compiled LLVM IR code probably does an
alloca
call every time a NewStmt is executed. That means a whole lot of memory gets
used even for simple loops like this one to compute \(n^2\) in a very stupid way:
new n := 1000;
new i := 1;
new s := 0;
while i <= n {
new j := 1;
while j <= n {
new temp := s + 1;
s := temp;
j := j + 1;
}
i := i + 1;
}
write s;
If you compile and run the program above, you will most likely get a seg fault. But
the issue is not in how you are allocating memory, just in how much memory
you are allocating! As written, the space for variable temp
gets allocated
a million itmes. That should be just enough to blow through the 8MB of stack space that Linux
gives you by default.
To fix this, the trick is to allocate the space for every variable just once, at the
beginning of the function. For example, the first thing in the main
that your compiler outputs should statement(s) to allocate memory for all variables
that are used in main. Then, each time a variable is used, you just access that memory
address which was alreay allocated. In the example above, this means that the space
for temp
is only allocated once instead of 1 million times - hooray!
Note: This optimization is incompatible with lexical scopes with closures,
unless you want to implement a full run-time garbage collector in LLVM.
So if you do exercise 3 above, don't choose this one!
Implement run-time type information in your compiler. This means that you will store - at run time - an extra value alongside every actual SPL value, to hold the type of that value.
It is up to you how to represent the types in your LLVM IR program.
I recommend using an i8
value with something like 0 indicating unset,
1 indicating a number, 2 indicating a boolean, and 3 indicating a function.
The most visible change in your program is that the write
command in SPL should now respect the type of its argument, printing the value of
a number, or "true"/"false" for a boolean, "Function" for a function pointer,
or "UNSET" for an unset value.
In addition, you should do run-time type checking everywhere your program makes
use of a value. For example, an SPL program like
write 1 * false;
should still compile, but produce a run-time error from the failed coversion of
false
to a number.
Note 1: Your compiled program should terminate with exit code 5 when
a type error occurs at run-time. You may want to use the standard C library call
exit
so you can safely terminate from any point in the program.
Note 2: You may want to look into defining and using a struct type in LLVM IR, which allows you to store a (type,value) pair together in a register and/or in memory. If you go this route, be sure to look up the LLVM instructions extractvalue, insertvalue.
Implement full lexical scope with closures, allowing non-local references in function calls. This challenging task will allow all kinds of more sophisticated SPL programs to compile, such as:
new showx := lambda x {
ret := lambda ignored { write x; };
};
new fiver := showx@5;
new sixer := showx@6;
fiver@0;
sixer@0;
fiver@0;
fiver@0;
(Look back to Lab 8 for many more examples!)
To get this working, you will need to first of all allocate storage for your
variables on the heap rather than the stack. Because there
is no LLVM instruction to do heap allocation, you will have to do this with
system calls to malloc
.
Then, you'll need to store a closure for each function declared, rather than just a pointer to the function itself. This closure needs to contain this function pointer, but also the addresses of all of the variables in scope at the point of the function definition. Then when the function begins, you will need to unwrap this closure and load those addresses back into registers.
This will be a good challenge, but worthwhile! There are a few different ways to approach the implementation. Remember, we aren't so concerned about efficiency at this point. Ask your instructor if you run into trouble and need some help.
Implement the debug statement in SPL, which is any string enclosed in double-quote characters. That string should print out when the compiled program is executed, at that point in the program where the debug statement appears.
This is already part of the SPL scanner and parser you have, so all you need to
do is figure out how to get Debug::compile()
method to work.
After this, a program like the following should compile. Running it should result in different messages output depending on what user the number enters originally.
"Please enter a positive number."
new x := read;
ifelse x > 0
{ "Thank you and good job." }
{ "You will now be punished with 50 punches."
new i := 0;
while i < 50 { "punch" i := i + 1; }
}
square
which takes one argument for the size, and draws a box of asterisk characters with that width and height.
For example, the SPL code
should compile to LLVM IR code that, when run, displays this in the terminal:square @ 5;
***** * * * * * * *****
rand
, which works the same as in Lab 9, taking a single argumentn
and returning a random number from 1 up to n (inclusive). (Note, you will have to use some C standard library functions like random() to make this happen in LLVM.)box
, which should be a two-argument built-in (similar tonote
from Lab 9) that takes a width and a height, and draws a box of asterisk characters with that width and height.
For example, the following SPL program:
should compile and, when run, produce output likenew w := 3; new h := 3; while h <= 5 { write w * 1000 + h; box @ w @ h; w := w + 2; h := h + 1; }
3003 *** * * *** 5004 ***** * * * * ***** 7005 ******* * * * * * * *******
The trick here will be to write the code for the built-in functions
in LLVM IR and emit those function definitions every time your compiler runs.
Inside your main()
you'll probably have to have instructions to store
those function pointers just like any other variable would be stored, so that function
calls to your built-in functions work just like any other function call.
For the two-argument function box
, you have to do it as two
1-argument builtins, where the first "outer" builtin returns the "inner" builtin.
Passing the outer width value to the inner builtin will be challenging
if you didn't do exercise 3 above (lexical scoping). If you don't have lexical scoping,
you can make this work by using a global variable in the LLVM IR code.
Try compiling a C program with a global long
variable to see how that
works!
Implement short-circuit evaluation of and
and or
expressions,
using phi expressions in LLVM.
Remember that short-circuiting means the second part of a boolean operation (and
or or
) is only evaluated if it's necessary. For example, if the first part of an
and
is false, then we don't need to bother with evaluating the seecond part,
because we know already that the entire expression is definitely false.
This would not be too difficult to do with a simple if/else construct, except that after the blocks come back together, you need to have just a single register that stores the result of the boolean operation. Because of the rules of SSA, you'll find that this isn't straightforward to do.
To solve it, you will need to use a phi function in the first basic block after the short-circuited boolean expression completes. Look at the documentation there for guidance and ask your instructor if you need any help!
Right now, in order to get variables working, your compiled LLVM IR code probably does an
alloca
call every time a NewStmt is executed. That means a whole lot of memory gets
used even for simple loops like this one to compute \(n^2\) in a very stupid way:
new n := 1000;
new i := 1;
new s := 0;
while i <= n {
new j := 1;
while j <= n {
new temp := s + 1;
s := temp;
j := j + 1;
}
i := i + 1;
}
write s;
If you compile and run the program above, you will most likely get a seg fault. But
the issue is not in how you are allocating memory, just in how much memory
you are allocating! As written, the space for variable temp
gets allocated
a million itmes. That should be just enough to blow through the 8MB of stack space that Linux
gives you by default.
To fix this, the trick is to allocate the space for every variable just once, at the
beginning of the function. For example, the first thing in the main
that your compiler outputs should statement(s) to allocate memory for all variables
that are used in main. Then, each time a variable is used, you just access that memory
address which was alreay allocated. In the example above, this means that the space
for temp
is only allocated once instead of 1 million times - hooray!
Note: This optimization is incompatible with lexical scopes with closures, unless you want to implement a full run-time garbage collector in LLVM. So if you do exercise 3 above, don't choose this one!