SI 413 Fall 2023 / Labs


This is the archived website of SI 413 from the Fall 2023 semester. Feel free to browse around; you may also find more recent offerings at my teaching page.

Lab 11: Compiler Part 2

This lab is due at 2359 on Wednesday, 29 November. It should be submitted to the course SI413 as Lab11, with the files named properly according to the lab instructions. See the submit page for details. You are highly encouraged to submit whatever you have done by the end of this lab time. Remember that you must thoroughly test your code; the tests that are visible before the deadline are only there to make sure you are using the right names and input/output formats.

1 Starter Code

The starter code for this lab is... your solution to the previous lab! Whenever you're condfident that your Lab 10 solution is working well, just copy the entirety of your lab10/src folder to a new folder lab11/src and get to work, like:

cd si413/compiler
cp -R lab10/src lab11/src
cd lab11

You should already have a shared git repository compiler. To get the starter code for today's lab, cd to your si413/compiler directory and run these commands:

git pull starter main
git pull

You should now have a folder like si413/compiler/lab11 with the starter code for this lab, the lab11.md file you will need to fill out, and a handy script to submit your code.

You can also browse the starter code from the Gitlab page (USNA intranet only).

2 Introduction

This lab is part two of your two-part compilers lab. You can get started as soon as you are finished with the first part and are confident that your solution is working correctly. Note, we will not release solutions to the first part, or starter code for this part. You really have to do part one first!

The first thing you need to do for this part is getting basic function calls to work. At that point, you will be able to compile many SPL programs, with the restriction that your functions won't support any non-local references.

The rest of this lab is partially up to you! We have six possible ways of making your compiler support a larger subset of the SPL language. You need to pick (at least) three of these six options and implement them. As you will discover, each option has some tricky aspects that will challenge you to think hard about how your compiler needs to work.

Completing everything here will be a significant challenge, but you can do it if you work deliberately and debug carefully from small examples. Remember that your instructors are here to help you too. Good luck!

3 Function calls

Everyone is required to get function calls working at some level. To make things easier, for this part your compiler may insist that the bodies of functions never contain any non-local references. That means that the communication to/from the function consists entirely of the argument and return values, and it makes the implementation much simpler, albeit more restricted in the SPL programs you can compile.

Getting functions to work really consists of two tasks. First, you have to implement function definitions, as triggered by lambda expressions in the code. Each lambda in the SPL code will correspond to one function in the LLVM IR code that your compiler produces. But at the point of the lambda itself, your compiler is in the middle of emitting code for main(), so it's not time to print out the function definition yet. Instead, your compiler should just add that Lambda node to some list of functions, whose definitions will be output later, after main() is over.

At the point of the lambda itself, all your compiler needs to do is convert the function pointer to an i64 type in LLVM, using the ptrtoint-to instruction. For example, the very simple (and mostly pointless) SPL program

lambda x { write 123; };
write 456;
might compile to something like
; ... header definitions etc up here ...
define i32 @main() {
    %v1 = ptrtoint ptr @fun1 to i64
    call void @write (i64 456)
    ret i32 0
}
define i64 @fun1 (i64 %arg) {
    call void @write (i64 123)
    ret i64 0
}

(Note, this program is NOT correctly handling arguments and return values yet. You have to figure that out next!)

Once you have lambdas working well, emitting code for function names like we see above, the next step is to implement function calls. As with your interpreter labs, I recommend starting by ignoring arguments and return values and just get the control to go to the function and come back. Just as the example above uses ptrtoint to convert the function pointer to a saved i64 value, at the function call site you will have to convert an i64 back to a function pointer using inttoptr-to.

Exercises

  1. Get function definitions and function calls working without nonlocal references. After this, you can compile some neat programs like this:
    new f := false;
    ifelse read = 2
      { f := lambda x { ret := x*x; }; }
      { f := lambda x { ret := x*x*x; }; }
    new i := 1;
    while i <= 10 {
      write f@i;
      i := i + 1;
    }
Submit now and check the autotesting so far! Just run ./submitme.sh from your lab directory.

4 Choose your own adventure

There are 6 more exercises in this section. For full credit on the lab, you must complete at least three of them. Of course, you are encouraged to try and implement even more than three!

Exercises

  1. Implement run-time type information in your compiler. This means that you will store - at run time - an extra value alongside every actual SPL value, to hold the type of that value.

    It is up to you how to represent the types in your LLVM IR program. I recommend using an i8 value with something like 0 indicating unset, 1 indicating a number, 2 indicating a boolean, and 3 indicating a function.

    The most visible change in your program is that the write command in SPL should now respect the type of its argument, printing the value of a number, or "true"/"false" for a boolean, "Function" for a function pointer, or "UNSET" for an unset value.

    In addition, you should do run-time type checking everywhere your program makes use of a value. For example, an SPL program like write 1 * false; should still compile, but produce a run-time error from the failed coversion of false to a number.

    Note 1: Your compiled program should terminate with exit code 5 when a type error occurs at run-time. You may want to use the standard C library call exit so you can safely terminate from any point in the program.

    Note 2: You may want to look into defining and using a struct type in LLVM IR, which allows you to store a (type,value) pair together in a register and/or in memory. If you go this route, be sure to look up the LLVM instructions extractvalue, insertvalue.

  2. Implement full lexical scope with closures, allowing non-local references in function calls. This challenging task will allow all kinds of more sophisticated SPL programs to compile, such as:

    new showx := lambda x {
      ret := lambda ignored { write x; };
    };
    new fiver := showx@5;
    new sixer := showx@6;
    fiver@0;
    sixer@0;
    fiver@0;
    fiver@0;

    (Look back to Lab 8 for many more examples!)

    To get this working, you will need to first of all allocate storage for your variables on the heap rather than the stack. Because there is no LLVM instruction to do heap allocation, you will have to do this with system calls to malloc.

    Then, you'll need to store a closure for each function declared, rather than just a pointer to the function itself. This closure needs to contain this function pointer, but also the addresses of all of the variables in scope at the point of the function definition. Then when the function begins, you will need to unwrap this closure and load those addresses back into registers.

    This will be a good challenge, but worthwhile! There are a few different ways to approach the implementation. Remember, we aren't so concerned about efficiency at this point. Ask your instructor if you run into trouble and need some help.

  3. Implement the debug statement in SPL, which is any string enclosed in double-quote characters. That string should print out when the compiled program is executed, at that point in the program where the debug statement appears.

    This is already part of the SPL scanner and parser you have, so all you need to do is figure out how to get Debug::compile() method to work.

    After this, a program like the following should compile. Running it should result in different messages output depending on what user the number enters originally.

    "Please enter a positive number."
    new x := read;
    ifelse x > 0
      { "Thank you and good job." }
      { "You will now be punished with 50 punches."
        new i := 0;
        while i < 50 { "punch" i := i + 1; }
      }
  4. Implement built-in functions in SPL like you did for Lab 9. You need to have (at least) these three built-ins:

    • square which takes one argument for the size, and draws a box of asterisk characters with that width and height.
      For example, the SPL code
      square @ 5;
      should compile to LLVM IR code that, when run, displays this in the terminal:
      *****
      *   *
      *   *
      *   *
      *****
      
    • rand, which works the same as in Lab 9, taking a single argument n and returning a random number from 1 up to n (inclusive). (Note, you will have to use some C standard library functions like random() to make this happen in LLVM.)
    • box, which should be a two-argument built-in (similar to note from Lab 9) that takes a width and a height, and draws a box of asterisk characters with that width and height.
      For example, the following SPL program:
      new w := 3;
      new h := 3;
      while h <= 5 {
        write w * 1000 + h;
        box @ w @ h;
        w := w + 2;
        h := h + 1;
      }
      should compile and, when run, produce output like
      3003
      ***
      * *
      ***
      5004
      *****
      *   *
      *   *
      *****
      7005
      *******
      *     *
      *     *
      *     *
      *******
      

    The trick here will be to write the code for the built-in functions in LLVM IR and emit those function definitions every time your compiler runs. Inside your main() you'll probably have to have instructions to store those function pointers just like any other variable would be stored, so that function calls to your built-in functions work just like any other function call.

    For the two-argument function box, you have to do it as two 1-argument builtins, where the first "outer" builtin returns the "inner" builtin. Passing the outer width value to the inner builtin will be challenging if you didn't do exercise 3 above (lexical scoping). If you don't have lexical scoping, you can make this work by using a global variable in the LLVM IR code. Try compiling a C program with a global long variable to see how that works!

  5. Implement short-circuit evaluation of and and or expressions, using phi expressions in LLVM.

    Remember that short-circuiting means the second part of a boolean operation (and or or) is only evaluated if it's necessary. For example, if the first part of an and is false, then we don't need to bother with evaluating the seecond part, because we know already that the entire expression is definitely false.

    This would not be too difficult to do with a simple if/else construct, except that after the blocks come back together, you need to have just a single register that stores the result of the boolean operation. Because of the rules of SSA, you'll find that this isn't straightforward to do.

    To solve it, you will need to use a phi function in the first basic block after the short-circuited boolean expression completes. Look at the documentation there for guidance and ask your instructor if you need any help!

  6. Right now, in order to get variables working, your compiled LLVM IR code probably does an alloca call every time a NewStmt is executed. That means a whole lot of memory gets used even for simple loops like this one to compute \(n^2\) in a very stupid way:

    new n := 1000;
    new i := 1;
    new s := 0;
    while i <= n {
      new j := 1;
      while j <= n {
        new temp := s + 1;
        s := temp;
        j := j + 1;
      }
      i := i + 1;
    }
    write s;

    If you compile and run the program above, you will most likely get a seg fault. But the issue is not in how you are allocating memory, just in how much memory you are allocating! As written, the space for variable temp gets allocated a million itmes. That should be just enough to blow through the 8MB of stack space that Linux gives you by default.

    To fix this, the trick is to allocate the space for every variable just once, at the beginning of the function. For example, the first thing in the main that your compiler outputs should statement(s) to allocate memory for all variables that are used in main. Then, each time a variable is used, you just access that memory address which was alreay allocated. In the example above, this means that the space for temp is only allocated once instead of 1 million times - hooray!

    Note: This optimization is incompatible with lexical scopes with closures, unless you want to implement a full run-time garbage collector in LLVM. So if you do exercise 3 above, don't choose this one!