Summary

Compilation pipeline

Up to this point in the semester, your compilers have focused on correctness: taking source code and compiling (“lowering”) it into valid LLVM IR. You have likely relied heavily on stack allocation (alloca) to manage variables, treating the LLVM IR almost like a direct translation of the abstract syntax tree.

In this unit, we want to dip our toes briefly into optimization, taking that initial LLVM IR code as the starting point, and producing eventually some machine code which is equivalent (so we don’t mess with correctness), but uses a few resources (time and space) as possible.

We can think of the compilation as happening in a pipeline like this:

When someone creates a new programming language, they only need to worry about the “front end” part, which is what we have done this semester so far. The job of the compiler (LLVM in our case) is to perform multiple optimization passes in the middle-end and back-end level, transforming the raw LLVM IR code, into efficient machine code.

First Goal: Eliminate memory usage

Today we will look at a “middle-end” optimization to convert memory usage, to register usage.

The “Naive” Approach (Stack)

The naive code generated by most front-ends is “safe but slow.” It treats every local variable as a memory address.

Consider this C code:

int x = 10;
int y = x + 5;
printf("%d + 5 = %d\n", x, y);

The first code generated by a compiler will generally be something like this:

%ptr_x = alloca i32            ; Allocate memory on stack
%ptr_y = alloca i32
store i32 10, ptr %ptr_x       ; Write to memory (Expensive!)
%tmp1  = load i32, ptr %ptr_x  ; Read from memory (Expensive!)
%tmp2  = add i32 %tmp1, 5      ; Do the math
store i32 %tmp2, ptr %ptr_y    ; Write to memory again (Expensive!)
%tmp3  = load i32 ptr %ptr_x   ; Read both variables before printing (Expensive!)
%tmp4  = load i32 ptr %ptr_y   ; Read both variables before printing (Expensive!)
; finally call printf (assuming @lit1 is defined outside of main)
call i32 @printf(ptr @lit1, i32 %tmp3, i32 %tmp4)

Problem: Memory access (Load/Store) is significantly slower than register access.

You know from your computer architecture class that memory access (even in CPU cache) has a latency that can be hundreds of times slower than direct register access.

The “Optimized” Approach

To optimize this code and eliminate unnecessary stack memory usage, we are really doing two things:

In this small example it’s easy to see what should happen:

; x is just the constant 10, don't even need to use a register for it
%y = add i32 10, 5   ; Calculate y directly
call i32 @printf(ptr @lit1, i32 10, i32 %y)

This optimization will have a huge effect in practice! We managed to reduce from 9 instructions down to just 2, and eliminated 5 expensive memory load/store operations.

Phi nodes

Remember that, because LLVM IR is an SSA language, it requires that every virtual register is assigned exactly once.

Unfortunately this creates some new challenges when we try to eliminate memory usage.

The Problem: Control Flow & Mutability

Real programs have branches (if, while). This creates a conflict with SSA. How do we represent a variable that changes value based on which path the code took?

Branching example:

int x = 10;
if (some_condition()) {
    x += 5;
} else {
    x = 20;
}
return x * 2; // Which x is this?

Here is how this might be naively translated to LLVM (using memory):

entry:
  %xptr = alloca i32
  store i32 10, ptr %xptr
  %cond = call i1 @some_condition()
  br i1 %cond, label %true_block, label %false_block

true_block:
  %tmp1 = load i32, ptr %xptr
  %tmp2 = add i32 %tmp1, 5
  store i32 %tmp2, ptr %xptr
  br label %merge_block

false_block:
  store i32 20, ptr %xptr
  br label %merge_block

merge_block:
  %x = load i32, ptr %xptr
  %tmp3 = mul i32 %x, 2
  ret i32 %tmp3

In a Control Flow Graph (CFG), this looks like a diamond. We have a “True Block” and a “False Block” merging into a “Merge Block.”

The SSA Paradox:

The Solution: The Phi Node (\(\phi\))

To solve this, SSA introduces a special instruction called the Phi function. It conceptually lives at the very start of a Basic Block where control flow merges.

The \(\phi\) node works like a selector: “If we arrived at this block from Path A, select Value A. If from Path B, select Value B.”

Here is the syntax in LLVM IR:

%result = phi <type> [ <value1>, <label1> ], [ <value2>, <label2> ], ...

Example Solution

Here is the previous if/else logic implemented correctly in LLVM SSA:

entry:
  ; initial alloca and store eliminated
  ; no more alloca or store needed; we are not using memory!
  %cond = call i1 @some_condition()
  br i1 %cond, label %true_block, label %false_block

true_block:
  ; load from xptr eliminated; we know it equals 10 here
  %tmp2 = add i32 10, 5
  ; store to xptr eliminated
  br label %merge_block

false_block:
  ; store to xptr eliminated
  br label %merge_block

merge_block:
  ; load replaced with phi
  %x = phi i32 [%tmp2, %true_block], [20, %false_block]
  %tmp3 = mul i32 %x, 2
  ret i32 %tmp3

Rules for phi nodes

  1. Position: Phi nodes must be the first instructions in a Basic Block (before any math or other logic).
  2. Completeness: You must list a value for every predecessor block — that is, every basic block in the control flow graph (CFG) that has an arrow to this block

General strategy for replacing memory with registers

While we can do these optimizations “by hand” for small examples, it’s important to remember that a real complier has to do this automatically, with an algorithm.

Here is one way that algorithm can be constructed:

Easy case example

In the above example, there is only one memory location to eliminate, xptr. The first load was from this snippet:

entry:
  store i32 10, ptr %xptr
  ; ...
  br i1 %cond, label %true_block, label %false_block

true_block:
  %tmp1 = load i32, ptr %xptr
  ; ...

This is an “easy case” for elimination: we know this load corresponds to the initial store with the constant value 10. So we just replace everywhere %tmp1 is used, with the literal 10, and remove the load operation.

Hard case example

The second load in the above example comes from this part of the code:

true_block:
  ; ...
  store i32 %tmp2, ptr %xptr
  br label %merge_block

false_block:
  store i32 20, ptr %xptr
  br label %merge_block

merge_block:
  %x = load i32, ptr %xptr
  ; ...

This is a “hard case” because the store could have come from either of the two predecessor blocks. So we construct a phi node to replace the load operation:

  %x = phi i32 [%tmp2, %true_block], [20, %false_block]

In this case we don’t actually get to eliminate the register (%x), but we do eliminate the memory usage — which was the original goal.

Constant propagation

There is one more kind of “easy” optimization to do at this level:

For any operation (like add, or, mul, etc.) whose operands are all constants, just perform the operation at compile time and replace all uses of the destination register with that constant.

After we eliminated the memory usage, there is one instruction in the above example which is ripe for constant propagation:

%tmp2 = add i32 10, 5

This fits the bill because both operands to the 3AC instruction (10 and 5) are constants. So the compiler can just perform that operation directly, replace all references to register %tmo2 with the constant 15, and then remove this add instruction. Here’s the resulting, further optimized example code:

entry:
  %cond = call i1 @some_condition()
  br i1 %cond, label %true_block, label %false_block

true_block:
  ; add operation removed from here
  br label %merge_block

false_block:
  br label %merge_block

merge_block:
  ; reference to %tmp2 replaced with constant 15
  %x = phi i32 [15, %true_block], [20, %false_block]
  %tmp3 = mul i32 %x, 2
  ret i32 %tmp3

(In fact, even more optimizations are possible here, such as replacing the branch and two trivial basic blocks with a single select operation. But we won’t get into those optimizations in this class.)

In general, this process of constant propagation is performed after memory-to-register optimization (see why?), and is also performed later after other optimization steps in the pipeline.

Optimization Passes in LLVM

The opt command (part of the llvm tool suite) performs various “middle-end” optimizations at the LLVM IR code level.

Today we have seen (part of) what two “passes” in LLVM perform. You can check these yourself on example LLVM IR code! The syntax is like this:

opt --passes=XXX -S source.ll

where XXX is the name of the “optimization pass” you want. There are hundreds of possible optimization passes in llvm (see them all by running opt -print-passes), but the two techniques we have learned about today are;

These are both actual steps in the compiler when you ask it to optimize your C or C++ code by specifying -O2 for example.