Summary

Context

Throughout the semester, we have traversed the pipeline from Source Code \(\rightarrow\) AST \(\rightarrow\) LLVM IR. We recently explored IR-level optimizations (Constant Propagation, DCE, Loop Unrolling). Today, we cross the final bridge: translating optimized, target-independent IR into target-specific Assembly (e.g., x86-64, ARM).

Instruction Selection (Brief Overview)

The first step in the backend is Instruction Selection. This processes the mapping of abstract LLVM IR operations to the concrete instruction set of the target architecture.

Register Allocation

Once instructions are selected, we face a resource constraint problem.

Register Allocation is the process of mapping the infinite virtual registers to the finite physical registers. We treat this as a Graph Coloring problem.

The Setup: Liveness & Interference

To determine which variables can share a physical register, we must first understand when a variable is “in use.”

The Big Picture: Iterative Coloring & Spilling

Conceptually, the register allocation process is a loop. Let \(K\) be the number of physical registers available on the target machine.

  1. Build the Interference Graph based on current Live Ranges.
  2. Attempt to \(K\)-Color the graph:
    • Assign one of \(K\) colors to each node such that no adjacent nodes share the same color.
  3. Decision:
    • Success: If the graph is \(K\)-colorable, we are done! Each color corresponds to a physical register (e.g., Color 1 = rax, Color 2 = rbx).
    • Failure: If the graph cannot be \(K\)-colored, we must Spill.
  4. Spill:
    • Pick a variable to move to memory (stack).
    • Rewrite the code to insert load and store instructions for that variable.
    • This breaks the variable’s continuous live range into tiny, disjoint intervals, drastically reducing interference.
    • Repeat: Go back to Step 1 and rebuild the graph.

The “How”: Approximating an NP-Hard Problem

In reality, determining if a graph is \(K\)-colorable is an NP-Complete problem. We cannot afford an exponential time algorithm in a compiler. Therefore, we use a heuristic (greedy) strategy known as Chaitin’s Algorithm.

  1. The Coloring Heuristic (Simplify & Select)

    We use the degree of nodes to simplify the graph:

    • Simplify: Find a node with degree \(< K\). Remove it and push it onto a stack.

      Rationale: If a node has fewer than \(K\) neighbors, we are guaranteed to find a color for it later, no matter what colors its neighbors take.

    • Select: Once the graph is shrunk down to \(\le K\) nodes, we can start assigning colors.

      For the initial \(\le K\) nodes, just pick different colors for them arbitrarily.

      Then pop nodes from the stack one by one and assign them the first available color that doesn’t conflict with their already-colored neighbors.

  2. The Spilling Heuristic

    If, during the “Simplify” phase, we have a graph with \(\gt K\) nodes remaining nodes all with degree \(\ge K\), we are “stuck.” We cannot guarantee a coloring.

    • We must optimistically pick a node to remove (spill) and hope the graph becomes colorable.

    • Heuristic: Which node do we spill? There is a greedy choice based on which register is least “dense”:

      \(\text{density} \approx \frac{\text{number of usages}}{\text{size of live range}}\)

      where “usages” refers to each time that register is accessed within its live range.

      The rough idea is to try and spill “sparse” registers which are infrequently used but “hog” an allocation over a large range.

      (In actual LLVM, the formula is more complicated to count “usages” higher when they are within loops.)

Worked Example: Loop & Liveness

Let’s look at a simple function that sums integers from v0 to v1. We want to color this with \(K=3\) physical registers.

LLVM IR Code:

define i64 @sum(i64 %v0, i64 %v1) {
top:
  br label %loop_cond

loop_cond:
  %v2 = phi i64 [ %v0, %top ], [ %v6, %loop_body ]
  %v3 = phi i64 [ 0, %top ], [ %v5, %loop_body ]
  %v4 = icmp sle i64 %v2, %v1
  br i1 %v4, label %loop_body, label %loop_exit

loop_body:
  %v5 = add i64 %v3, %v2
  %v6 = add i64 %v2, 1
  br label %loop_cond

loop_exit:
  ret i64 %v3
}

Phi elimination

Before we can start to assign registers, we have to convert the phi instructions into fake copy instructions. I say “fake” because these copy instructions don’t actually exist in LLVM IR; this is just a temporary intermediate stage of the code.

Remember, we put the copy instructions in the blocks that come before the actual phi. Sometimes it might be necessary to insert another basic block just for the copy commands, but here it’s not necessary since both predecessor blocks have unconditional branches. This is what we get:

define i64 @sum(i64 %v0, i64 %v1) {
top:
  %v2 = %v0 ;; FAKE copy for phi elimination
  %v3 = 0   ;; FAKE copy for phi elimination
  br label %loop_cond

loop_cond:
  %v4 = icmp sle i64 %v2, %v1
  br i1 %v4, label %loop_body, label %loop_exit

loop_body:
  %v5 = add i64 %v3, %v2
  %v6 = add i64 %v2, 1
  %v2 = %v6 ;; FAKE copy for phi elimination
  %v3 = %v5 ;; FAKE copy for phi elimination
  br label %loop_cond

loop_exit:
  ret i64 %v3
}

As you can see, each “option” in the phi command corresponds to a “fake” copy instruction, that goes right before the unconditional branch in each of the predecessor basic blocks.

Live ranges:

Now we can evaluate the “live range” of each register. Here is the same code, annotated with the live range according to where each register is defined and where it is used:

Notice a few things here: First, a register that is used for the last time as an argument to some instruction, does not overlap with the register that is defined in that same instruction. This is the meaning of the open/closed bubbles on the end of the live ranges in our image.

For example, in this code, %v2 and %v6 do not overlap; we could (and probably should!) allocate the same physical register for both of those.

Interference Graph:

Next we build a graph with the virtual registers as the nodes, and edges whenever two live ranges overlap:

First attempt at coloring (need to spill)

Now we can attempt to color this graph with \(K=3\) colors.

We can easily remove the v0 node since it has degree 1. Similarly for v6 which has degree 2.

After removing v6, v5 ends up with degree 2 also, so that is removed next.

But after this, we are “stuck” with 4 remaining nodes (v1, v2, v3, v4) which all have degree 3.

So, we need to spill one of these registers to memory. The one with the lowest density here is v1, since it is only used once in the loop, but it has the largest live range.

Spilling v1 involves allocating memory for it at the top, and replacing the one usage of v1 with a load instruction. Note that this actually involves using another register v7 to hold the loaded value (but only for a very short window of time!). Here is the resulting code after spilling v1:

define i64 @sum(i64 %v0, i64 %v1) {
top:
  %v1mem = alloca i64               ; space for spilled register
  store i64 %v1, ptr %v1mem         ; store value in stack memory
  %v2 = %v0 ;; FAKE copy
  %v3 = 0   ;; FAKE copy
  br label %loop_cond

loop_cond:
  %v7 = load i64, ptr %v1mem        ; load spilled value before use
  %v4 = icmp sle i64 %v2, %v7       ; note, v1 changed to v7 here
  br i1 %v4, label %loop_body, label %loop_exit

loop_body:
  %v5 = add i64 %v3, %v2
  %v6 = add i64 %v2, 1
  %v2 = %v6 ;; FAKE copy
  %v3 = %v5 ;; FAKE copy
  br label %loop_cond

loop_exit:
  ret i64 %v3
}

Second attempt at coloring

After removing (spilling) v1 to memory, we end up with the following updated interference graph:

Compare this with the interference graph above. It has a lot fewer edges, even though there is the one extra v7 node for the new temporary loaded value.

Now let’s try again to color this with \(K=3\) colors, say Blue, Red, and Green.

The resulting colored graph looks like this:

Wrapping up

After this spilling and register assignment, we are ready to generate actual assembly code in the target language. Each of the “colors” in the graph coloring will correspond to a different physical register, and each “spill” will correspond to stack memory being used for that register.

Here is how the code above might be compiled to ARM assembly.

Some reminders about how ARM works:

OK, here is the complete ARM code with instruction selection, register spilling and allocation. It uses only registers x0 (Blue), x1 (Green), and x8 (Red). In comments on each line are the corresponding LLVM IR instructions from the code above.

        .text
        .globl  sum           // -- Begin function sum
        .type   sum,@function
sum:
        sub     sp, sp, 8     // %v1mem = alloca i64
        str     x1, [sp, #0]  // store i64 %v1(Green), ptr %v1mem
        mov     x8, xzr       // %v3(Red) = 0
        jmp     .loop_cond    // br label %loop_cond
.loop_body:
        add     x8, x8, x0    // %v5(Red) = add i64 %v3(Red), %v2(Blue)
        add     x0, x0, #1    // %v6(Blue) = add i64 %v2(Blue), 1
.loop_cond:
        ldr     x1, [sp, #0]  // %v7(Green) = load i64, ptr %v1mem
        cmp     x0, x1        // %v4 = icmp sle i64 %v2(Blue), %v7(Green)
        b.le    .loop_body    // br i1 %v4, label %loop_body, label %loop_exit
.loop_exit:
        mov     x0, x8        // ret i64 %v3(Red)
        add     sp, sp, #8    // cleanup the stack
        ret

(The very astute reader will notice that we didn’t actually use a register for %v4 at all. This is due to the way ARM works with a conditional branch; the cmp instruction in ARM sets the “condition flags” and then the b.gt conditional branch uses those flags, so an extra register is not actually needed!

In reality, LLVM will realize this during instruction selection, prior to register allocation, so v4 will not be part of the live ranges and interference graph at all. Feel free to go back and see how this might affect the spilling and register allocation for the example.)