Summary

Instruction selection
Phi node elimination
Live range analysis
Interference graph
Register allocation via Graph coloring
Spilling

Context

Throughout the semester, we have traversed the pipeline from Source Code \(\rightarrow\) AST \(\rightarrow\) LLVM IR. We recently explored IR-level optimizations (Constant Propagation, DCE, Loop Unrolling). Today, we cross the final bridge: translating optimized, target-independent IR into target-specific Assembly (e.g., x86-64, ARM).

Instruction Selection (Brief Overview)

The first step in the backend is Instruction Selection. This processes the mapping of abstract LLVM IR operations to the concrete instruction set of the target architecture.

The Challenge: There is rarely a 1-to-1 mapping. A single IR instruction might require multiple assembly instructions, or multiple IR instructions might be fused into a single specialized CPU instruction (e.g., FMA - Fused Multiply Add).
The Approach: LLVM typically uses a pattern-matching algorithm (often based on DAG covering) to find the “cheapest” sequence of assembly instructions that matches the semantics of the IR.

Register Allocation

Once instructions are selected, we face a resource constraint problem.

LLVM IR: Uses an infinite supply of virtual registers (%1, %2, %name).
Physical Hardware: Has a finite, small set of physical registers (e.g., \(K\) registers).

Register Allocation is the process of mapping the infinite virtual registers to the finite physical registers. We treat this as a Graph Coloring problem.

The Setup: Liveness & Interference

To determine which variables can share a physical register, we must first understand when a variable is “in use.”

Liveness: A variable is Live at a point in the program if it holds a value that may be needed in the future.
- The Live Range is the set of instructions where the variable is live (from definition to last use).
Interference Graph \(G=(V,E)\):
- Nodes (\(V\)): Every virtual register.
- Edges (\(E\)): Connect two nodes if their Live Ranges overlap. If they overlap, they are both alive simultaneously and cannot share a register.

The Big Picture: Iterative Coloring & Spilling

Conceptually, the register allocation process is a loop. Let \(K\) be the number of physical registers available on the target machine.

Build the Interference Graph based on current Live Ranges.
Attempt to \(K\)-Color the graph:
- Assign one of \(K\) colors to each node such that no adjacent nodes share the same color.
Decision:
- Success: If the graph is \(K\)-colorable, we are done! Each color corresponds to a physical register (e.g., Color 1 = rax, Color 2 = rbx).
- Failure: If the graph cannot be \(K\)-colored, we must Spill.
Spill:
- Pick a variable to move to memory (stack).
- Rewrite the code to insert load and store instructions for that variable.
- This breaks the variable’s continuous live range into tiny, disjoint intervals, drastically reducing interference.
- Repeat: Go back to Step 1 and rebuild the graph.

The “How”: Approximating an NP-Hard Problem

In reality, determining if a graph is \(K\)-colorable is an NP-Complete problem. We cannot afford an exponential time algorithm in a compiler. Therefore, we use a heuristic (greedy) strategy known as Chaitin’s Algorithm.

The Coloring Heuristic (Simplify & Select)

We use the degree of nodes to simplify the graph:
- Simplify: Find a node with degree \(< K\). Remove it and push it onto a stack.
  
  Rationale: If a node has fewer than \(K\) neighbors, we are guaranteed to find a color for it later, no matter what colors its neighbors take.
- Select: Once the graph is shrunk down to \(\le K\) nodes, we can start assigning colors.
  
  For the initial \(\le K\) nodes, just pick different colors for them arbitrarily.
  
  Then pop nodes from the stack one by one and assign them the first available color that doesn’t conflict with their already-colored neighbors.
The Spilling Heuristic

If, during the “Simplify” phase, we have a graph with \(\gt K\) nodes remaining nodes all with degree \(\ge K\), we are “stuck.” We cannot guarantee a coloring.
- We must optimistically pick a node to remove (spill) and hope the graph becomes colorable.
- Heuristic: Which node do we spill? There is a greedy choice based on which register is least “dense”:
  
  \(\text{density} \approx \frac{\text{number of usages}}{\text{size of live range}}\)
  
  where “usages” refers to each time that register is accessed within its live range.
  
  The rough idea is to try and spill “sparse” registers which are infrequently used but “hog” an allocation over a large range.
  
  (In actual LLVM, the formula is more complicated to count “usages” higher when they are within loops.)

Worked Example: Loop & Liveness

Let’s look at a simple function that sums integers from v0 to v1. We want to color this with \(K=3\) physical registers.

LLVM IR Code:

define i64 @sum(i64 %v0, i64 %v1) {
top:
  br label %loop_cond

loop_cond:
  %v2 = phi i64 [ %v0, %top ], [ %v6, %loop_body ]
  %v3 = phi i64 [ 0, %top ], [ %v5, %loop_body ]
  %v4 = icmp sle i64 %v2, %v1
  br i1 %v4, label %loop_body, label %loop_exit

loop_body:
  %v5 = add i64 %v3, %v2
  %v6 = add i64 %v2, 1
  br label %loop_cond

loop_exit:
  ret i64 %v3
}

Phi elimination

Before we can start to assign registers, we have to convert the phi instructions into fake copy instructions. I say “fake” because these copy instructions don’t actually exist in LLVM IR; this is just a temporary intermediate stage of the code.

Remember, we put the copy instructions in the blocks that come before the actual phi. Sometimes it might be necessary to insert another basic block just for the copy commands, but here it’s not necessary since both predecessor blocks have unconditional branches. This is what we get:

define i64 @sum(i64 %v0, i64 %v1) {
top:
  %v2 = %v0 ;; FAKE copy for phi elimination
  %v3 = 0   ;; FAKE copy for phi elimination
  br label %loop_cond

loop_cond:
  %v4 = icmp sle i64 %v2, %v1
  br i1 %v4, label %loop_body, label %loop_exit

loop_body:
  %v5 = add i64 %v3, %v2
  %v6 = add i64 %v2, 1
  %v2 = %v6 ;; FAKE copy for phi elimination
  %v3 = %v5 ;; FAKE copy for phi elimination
  br label %loop_cond

loop_exit:
  ret i64 %v3
}

As you can see, each “option” in the phi command corresponds to a “fake” copy instruction, that goes right before the unconditional branch in each of the predecessor basic blocks.

Live ranges:

Now we can evaluate the “live range” of each register. Here is the same code, annotated with the live range according to where each register is defined and where it is used:

Notice a few things here: First, a register that is used for the last time as an argument to some instruction, does not overlap with the register that is defined in that same instruction. This is the meaning of the open/closed bubbles on the end of the live ranges in our image.

For example, in this code, %v2 and %v6 do not overlap; we could (and probably should!) allocate the same physical register for both of those.

Interference Graph:

Next we build a graph with the virtual registers as the nodes, and edges whenever two live ranges overlap:

First attempt at coloring (need to spill)

Now we can attempt to color this graph with \(K=3\) colors.

We can easily remove the v0 node since it has degree 1. Similarly for v6 which has degree 2.

After removing v6, v5 ends up with degree 2 also, so that is removed next.

But after this, we are “stuck” with 4 remaining nodes (v1, v2, v3, v4) which all have degree 3.

So, we need to spill one of these registers to memory. The one with the lowest density here is v1, since it is only used once in the loop, but it has the largest live range.

Spilling v1 involves allocating memory for it at the top, and replacing the one usage of v1 with a load instruction. Note that this actually involves using another register v7 to hold the loaded value (but only for a very short window of time!). Here is the resulting code after spilling v1:

define i64 @sum(i64 %v0, i64 %v1) {
top:
  %v1mem = alloca i64               ; space for spilled register
  store i64 %v1, ptr %v1mem         ; store value in stack memory
  %v2 = %v0 ;; FAKE copy
  %v3 = 0   ;; FAKE copy
  br label %loop_cond

loop_cond:
  %v7 = load i64, ptr %v1mem        ; load spilled value before use
  %v4 = icmp sle i64 %v2, %v7       ; note, v1 changed to v7 here
  br i1 %v4, label %loop_body, label %loop_exit

loop_body:
  %v5 = add i64 %v3, %v2
  %v6 = add i64 %v2, 1
  %v2 = %v6 ;; FAKE copy
  %v3 = %v5 ;; FAKE copy
  br label %loop_cond

loop_exit:
  ret i64 %v3
}

Second attempt at coloring

After removing (spilling) v1 to memory, we end up with the following updated interference graph:

Compare this with the interference graph above. It has a lot fewer edges, even though there is the one extra v7 node for the new temporary loaded value.

Now let’s try again to color this with \(K=3\) colors, say Blue, Red, and Green.

First we see v0, v1, v4, v5, v6, and v7 all have degree \(\lt K\), so remove them from the graph (we know we will be able to color them later based on their neighbors).
The resulting graph has just 2 nodes, so we can assign different colors to v2 and v3 arbitrarily (say, Blue and Red respectively).
Then we add back in the 6 vertices we originally removed, one at a time, assigning each one any color that is not used by one of its neighbors.
- Add v0: it doesn’t touch any node currently in the graph. So it can be any color. But the compiler will notice that we copy from v0 to v2, and v2 is already Blue, so we make v0 Blue to avoid that copy instruction.
- Add v1: it touches v0, so it can be either Red or Green. We’ll make v1 Green.
- Add v4: it touches v2 (Blue) and v3 (Red), so v4 must be Green.
- Add v5: it touches v2, so it could be Red or Green. To avoid the copy from v5 to v3, we will make it the same color as v3, Red.
- Add v6: It touches v5 (Red), so it could be Blue or Green. To avoid the copy from v2 to v6, we make it Blue like v2.
- Add v7: It touches v2 (Blue) and v3 (Red), so it must be Green.

The resulting colored graph looks like this:

Wrapping up

After this spilling and register assignment, we are ready to generate actual assembly code in the target language. Each of the “colors” in the graph coloring will correspond to a different physical register, and each “spill” will correspond to stack memory being used for that register.

Here is how the code above might be compiled to ARM assembly.

Some reminders about how ARM works:

The function arguments come in registers x0 and x1 automatically. So those will be the “Blue” and “Green” registers, respectively.
We will use the temporary register x8 as “Red”.
The return value must be stored into x0 at the end. So we have to have one move instruction at the end to account for that.

OK, here is the complete ARM code with instruction selection, register spilling and allocation. It uses only registers x0 (Blue), x1 (Green), and x8 (Red). In comments on each line are the corresponding LLVM IR instructions from the code above.

        .text
        .globl  sum           // -- Begin function sum
        .type   sum,@function
sum:
        sub     sp, sp, 8     // %v1mem = alloca i64
        str     x1, [sp, #0]  // store i64 %v1(Green), ptr %v1mem
        mov     x8, xzr       // %v3(Red) = 0
        jmp     .loop_cond    // br label %loop_cond
.loop_body:
        add     x8, x8, x0    // %v5(Red) = add i64 %v3(Red), %v2(Blue)
        add     x0, x0, #1    // %v6(Blue) = add i64 %v2(Blue), 1
.loop_cond:
        ldr     x1, [sp, #0]  // %v7(Green) = load i64, ptr %v1mem
        cmp     x0, x1        // %v4 = icmp sle i64 %v2(Blue), %v7(Green)
        b.le    .loop_body    // br i1 %v4, label %loop_body, label %loop_exit
.loop_exit:
        mov     x0, x8        // ret i64 %v3(Red)
        add     sp, sp, #8    // cleanup the stack
        ret

(The very astute reader will notice that we didn’t actually use a register for %v4 at all. This is due to the way ARM works with a conditional branch; the cmp instruction in ARM sets the “condition flags” and then the b.gt conditional branch uses those flags, so an extra register is not actually needed!

In reality, LLVM will realize this during instruction selection, prior to register allocation, so v4 will not be part of the live ranges and interference graph at all. Feel free to go back and see how this might affect the spilling and register allocation for the example.)