Grades and deadlines

Getting started: files

Make a new directory for this lab.

Download the file lab1.3.yml into that directory. You will fill in and submit this file along with your code.

For your java source code, start by copying your own working solution from the previous lab, and renaming your Interp.java to Compiler.java (and fix the corresponding class name as well).

Task 1: Write your compiler

Your Compile class should have a main method that now takes two command-line arguments: the source code file, and the destination ll file. I will run it like this:

java Compiler source.prog compiled.ll

where source.prog is a program in your chosen language for this unit.

If there are no errors in the code, your computer should not write anything at all to standard out, or read anything from standard in. Instead, it should create the destination file (compiled.ll in the example above) which is a complete, runnable program in LLVM IR.

You should be able to run the resulting LLVM IR program like

lli compiled.ll

and that should work exactly the same as running source.prog through your working interpreter from last week.

(Or, if you want the full power of what you are doing, you can completely compile the source program in our made-up programming language, down to executable machine code, via

clang compiled.ll -o compiled

which could then be run over and over again as a native executable ./compiled.)

Task 2: Test it!

Be sure to thoroughly test your compiler to make sure it works in all cases. You should be testing as you go starting with the simplest possible programs in your language, and eventually getting to programs with multiple print statements and complicated expressions with multiple reversals, inputs, and concatenations.

Tips

Here are some helpful suggestions on how you can get started and make progress. Whether you take these suggestions or decide to work in a different way is fine, as long as your compiler works!

Writing your very first compiler is an exciting day in your life as a computer scientist, and you are ready for the challenge! Work carefully, follow the tips, ask for help when needed, and help each other (within the rules), and you will do great.

Initial setup: transforming your interpreter

First you need to get your new Compiler.java set up. Starting with what you had for the interpreter, modify your main method so that it takes a second command-line argument for the destination file. Go ahead and open that for writing, maybe with a PrintWriter in java.

Then go through and find the places in your interpreter where it actually runs the code. Right now your languages can’t actually do very much, so you’re just looking for the 1-2 lines where each small thing happens:

Change just the few lines that do these tasks, by commenting them out and replacing them with some exception like

throw new UnsupportedOperationException("this doesn't work yet")

Compile and test your compiler on an existing program in your language - it should of course throw that exception. Great! Now you now (some of) what you need to fill in.

Step 0: empty program

Start by getting your compiler to work on a source program that doesn’t contain any statements or code at all. (Maybe just comments, or a blank file.)

The result should be an LLVM IR program that also does nothing! Like this:

target triple = "x86_64-pc-linux-gnu"

declare i32 @puts(ptr noundef) #1
;;; probably other function declarations or definitions will eventually
;;; go up here, but they don't depend on the actual program

define i32 @main() {
  ;;; WHEN SOMETHING HAPPENS, IT WILL EVENTUALLY GO HERE
  ret i32 0
}

See if you can get this to work! What should probably happen is really in three steps:

  1. Your program dumps out the “preamble” to the destination file, which is everything up until the line where “something happens”. Everything up here is global declarations and function definitions, so it should be the same regardless of what’s in the input file.

  2. You call whatever functions or whatnot you already wrote in the interpreter to read in and process the source code. (Since we are testing on a source code with no actual code in it now, this part should actually do nothing at the moment. It should not trigger any of your UnsupportedOperationExceptions from above.

  3. When that’s all done, your program dumps out the last few lines, or the “epilogue”, returning from main and ending the program.

Now test this on some empty programs, or with just comments in them. Make sure the resulting .ll file actually works when you run it with lli - it should successfully do nothing!

Tip: bash one-liner for double compiling and running

To help with testing, I found it convenient to use bash’s && operator on the command line, like

javac *.java && java Compiler some-test.prog some-test.ll && lli some-test.ll

That one line of bash will:

  1. Compile your compiler (in Java)
  2. Run your compiler on some-test.prog, which would be a small test program in your language
  3. Run the resulting some-test.ll LLVM IR program, which should execute according to whatever some-test.prog is supposed to do

By using the && operator, bash will “short-circuit” the operation so that it stops at the first step above which fails. So for example if your java code has some typo in it, it will just show you the compiler error from javac and stop. I recommend working this way!

Doing something small: aloha

Here is a stripped-down LLVM IR program that just prints a single string literal using the C standard library puts() function:

target triple = "x86_64-pc-linux-gnu"

declare i32 @puts(ptr noundef) #1

define i32 @main() {
  call i32 @puts(ptr @lit1)
  ret i32 0
}

@lit1 = constant [6 x i8] c"aloha\00"

Now work backwards. Write a small program in your language which prints out “aloha”, and then try to get your compiler to output something like the code above.

Notice that really two things need to change:

So the questions you need to figure out are: (1) how can your compiler spit out the literal definitions at a later time, after it’s already moved past them in the source code; and (2) how can the print statement “know” the name of the argument (here it’s @lit1) it is supposed to print out?

(Hint: for (1), you are allowed to read the source code file twice in your compiler if you need to. Or you can read it once, and save the string literals into an array or something as you encounter each one…)

(Another hint: for (2), you need to essentially pass some information between the part of your code that handles string literals in the source, and the part that handles a print statement. You probably already have that from your interpreter! But now instead of passing the actual string, you want to pass the name of the register that should be used, like @lit1 in this example.)

Once this works for just “aloha”, don’t stop there. Go crazy and try a program with two print statements and two string literals. If you did things well, that should work too!

Moving on: helper functions

For whatever capability you choose next, be it stdin input, reversing a string, or concatenation, I strongly recommend writing a helper function in C which does (most of) the work for you.

We talked about this approach in class. In the context of your compiler, it will look like this:

  1. You write a C program like helpers.c with just function definitions and no main. The functions should do helpful things for you, like say taking a char* string, then allocating a new one with malloc, filling it in with the reversal of the original string, and returning the new string char* pointer.

  2. Maybe write a main in C in a separate file, so that you can test your helper functions in pure C and make sure they work perfectly. So far this is just C programming, no LLVM involved.

  3. Compile your helpers into LLVM using clang -S -emit-llvm

  4. Next change the preamble part of your compiler so that before main, it dumps out the LLVM code for your helper functions.

And then voila, you can now call these helper functions in the main for the LLVM program your compiler produces. It’s a complicated workflow, but it will really make the job easier when you get it set up.

The idea here is that the really hard work is getting your compiler to put the right commands inside the main, so we want to kind of “offload” as much as we can away from that error-prone piece.

One, two, many

When testing each new aspect that you think you have working, I recommend starting with the smallest possible program that uses that feature, like (say) a program that just reverses a single string literal and prints it out, etc. That way, when you are first getting it to work, you can look at the LLVM code your compiler produced, see what you wanted it to do vs what is actually happening, and adjust accordingly.

Then, when you get it working for one concatenation, one reversal, one print statement, etc, try doing it twice in the same program. If you wrote the code well, it should work with no more changes! But if you need to debug, the program is still small and understandable so you have a better chance at fixing it.

And then, after it works twice in simple settings, make it as complex as you think you should be able to handle and test again. Here your tests won’t really be useful as debugging directly, but will help you make certain that the current feature is working perfectly before you move on to the next one.