Summary
- The concept of compiler optimization and passes
- Constant propagation
- Memory elimination (
mem2reg) - Phi nodes in LLVM IR
Compilation pipeline
Up to this point in the semester, your compilers have focused on
correctness: taking source code and compiling (“lowering”) it into valid LLVM IR.
You have likely relied heavily on stack allocation (alloca) to manage variables,
treating the LLVM IR almost like a direct translation of the abstract syntax
tree.
In this unit, we want to dip our toes briefly into optimization, taking that initial LLVM IR code as the starting point, and producing eventually some machine code which is equivalent (so we don’t mess with correctness), but uses a few resources (time and space) as possible.
We can think of the compilation as happening in a pipeline like this:
-
Front-End: Checks semantics, generates “raw” IR. (You built this already). This encompasses scanning, parsing, and syntax analysis.
-
Middle-End: Analyzes and Transforms IR to be more efficient. (We are here for the next two days.)
-
Back-End: Maps IR to specific hardware (x86, ARM) and assigns physical registers. (We will also think about this a little bit.)
When someone creates a new programming language, they only need to worry about the “front end” part, which is what we have done this semester so far. The job of the compiler (LLVM in our case) is to perform multiple optimization passes in the middle-end and back-end level, transforming the raw LLVM IR code, into efficient machine code.
First Goal: Eliminate memory usage
Today we will look at a “middle-end” optimization to convert memory usage, to register usage.
The “Naive” Approach (Stack)
The naive code generated by most front-ends is “safe but slow.” It treats every local variable as a memory address.
Consider this C code:
int x = 10;
int y = x + 5;
printf("%d + 5 = %d\n", x, y);
The first code generated by a compiler will generally be something like this:
%ptr_x = alloca i32 ; Allocate memory on stack
%ptr_y = alloca i32
store i32 10, ptr %ptr_x ; Write to memory (Expensive!)
%tmp1 = load i32, ptr %ptr_x ; Read from memory (Expensive!)
%tmp2 = add i32 %tmp1, 5 ; Do the math
store i32 %tmp2, ptr %ptr_y ; Write to memory again (Expensive!)
%tmp3 = load i32 ptr %ptr_x ; Read both variables before printing (Expensive!)
%tmp4 = load i32 ptr %ptr_y ; Read both variables before printing (Expensive!)
; finally call printf (assuming @lit1 is defined outside of main)
call i32 @printf(ptr @lit1, i32 %tmp3, i32 %tmp4)
Problem: Memory access (Load/Store) is significantly slower than register access.
You know from your computer architecture class that memory access (even in CPU cache) has a latency that can be hundreds of times slower than direct register access.
The “Optimized” Approach
To optimize this code and eliminate unnecessary stack memory usage, we are really doing two things:
- Replacing constants with their actual values (constant propagation)
- Replacing memory usage with registers (promotion)
In this small example it’s easy to see what should happen:
; x is just the constant 10, don't even need to use a register for it
%y = add i32 10, 5 ; Calculate y directly
call i32 @printf(ptr @lit1, i32 10, i32 %y)
This optimization will have a huge effect in practice! We managed to reduce from 9 instructions down to just 2, and eliminated 5 expensive memory load/store operations.
Phi nodes
Remember that, because LLVM IR is an SSA language, it requires that every virtual register is assigned exactly once.
Unfortunately this creates some new challenges when we try to eliminate memory usage.
The Problem: Control Flow & Mutability
Real programs have branches (if, while). This creates a conflict with SSA. How do we represent a variable that changes value based on which path the code took?
Branching example:
int x = 10;
if (some_condition()) {
x += 5;
} else {
x = 20;
}
return x * 2; // Which x is this?
Here is how this might be naively translated to LLVM (using memory):
entry:
%xptr = alloca i32
store i32 10, ptr %xptr
%cond = call i1 @some_condition()
br i1 %cond, label %true_block, label %false_block
true_block:
%tmp1 = load i32, ptr %xptr
%tmp2 = add i32 %tmp1, 5
store i32 %tmp2, ptr %xptr
br label %merge_block
false_block:
store i32 20, ptr %xptr
br label %merge_block
merge_block:
%x = load i32, ptr %xptr
%tmp3 = mul i32 %x, 2
ret i32 %tmp3
In a Control Flow Graph (CFG), this looks like a diamond. We have a “True Block” and a “False Block” merging into a “Merge Block.”
The SSA Paradox:
- We cannot simply write
%x = 10in the top block and%x = 20in the bottom block (violates Single Assignment). - We cannot create
%x1and%x2, because thereturnstatement in the merge block won’t know which one to use.
The Solution: The Phi Node (\(\phi\))
To solve this, SSA introduces a special instruction called the Phi function. It conceptually lives at the very start of a Basic Block where control flow merges.
The \(\phi\) node works like a selector: “If we arrived at this block from Path A, select Value A. If from Path B, select Value B.”
Here is the syntax in LLVM IR:
%result = phi <type> [ <value1>, <label1> ], [ <value2>, <label2> ], ...
Example Solution
Here is the previous if/else logic implemented correctly in LLVM SSA:
entry:
; initial alloca and store eliminated
; no more alloca or store needed; we are not using memory!
%cond = call i1 @some_condition()
br i1 %cond, label %true_block, label %false_block
true_block:
; load from xptr eliminated; we know it equals 10 here
%tmp2 = add i32 10, 5
; store to xptr eliminated
br label %merge_block
false_block:
; store to xptr eliminated
br label %merge_block
merge_block:
; load replaced with phi
%x = phi i32 [%tmp2, %true_block], [20, %false_block]
%tmp3 = mul i32 %x, 2
ret i32 %tmp3
Rules for phi nodes
- Position: Phi nodes must be the first instructions in a Basic Block (before any math or other logic).
- Completeness: You must list a value for every predecessor block — that is, every basic block in the control flow graph (CFG) that has an arrow to this block
General strategy for replacing memory with registers
While we can do these optimizations “by hand” for small examples, it’s important to remember that a real complier has to do this automatically, with an algorithm.
Here is one way that algorithm can be constructed:
-
Identify each
allocastack memory allocation that can potentially be optimized away. Make sure that address is not used for any pointer arithmetic or passed/returned outside of the local function. -
Eliminate each
loadfrom that address, one at a time.-
For each
load, trace backwards in the CFG to the last place where that memory would have previously beenstored. -
Easy case: If there’s only one corresponding
store(for example, if it’s just a few lines up in the same basic block), then we don’t need a phi. Just replace every use of the destination register, with whatever value was stored above. -
Hard case: If there is more than one corresponding
storedepending on the control flow, then we have to add a phi node. Make a phi node for this value at the top of the basic block, assign that to the destination register, and get rid of theload.
-
-
Once all the
loads have been replaced, we can safely remove theallocaand allstores to that address
Easy case example
In the above example, there is only one memory location to eliminate,
xptr. The first load was from this snippet:
entry:
store i32 10, ptr %xptr
; ...
br i1 %cond, label %true_block, label %false_block
true_block:
%tmp1 = load i32, ptr %xptr
; ...
This is an “easy case” for elimination: we know this load corresponds
to the initial store with the constant value 10. So we just replace
everywhere %tmp1 is used, with the literal 10, and remove the load
operation.
Hard case example
The second load in the above example comes from this part of the code:
true_block:
; ...
store i32 %tmp2, ptr %xptr
br label %merge_block
false_block:
store i32 20, ptr %xptr
br label %merge_block
merge_block:
%x = load i32, ptr %xptr
; ...
This is a “hard case” because the store could have come from either of
the two predecessor blocks. So we construct a phi node to replace the
load operation:
%x = phi i32 [%tmp2, %true_block], [20, %false_block]
In this case we don’t actually get to eliminate the register (%x), but we do
eliminate the memory usage — which was the original goal.
Constant propagation
There is one more kind of “easy” optimization to do at this level:
For any operation (like
add,or,mul, etc.) whose operands are all constants, just perform the operation at compile time and replace all uses of the destination register with that constant.
After we eliminated the memory usage, there is one instruction in the above example which is ripe for constant propagation:
%tmp2 = add i32 10, 5
This fits the bill because both operands to the 3AC instruction (10
and 5) are constants. So the compiler can just perform that operation
directly, replace all references to register %tmo2 with the constant
15, and then remove this add instruction. Here’s the resulting,
further optimized example code:
entry:
%cond = call i1 @some_condition()
br i1 %cond, label %true_block, label %false_block
true_block:
; add operation removed from here
br label %merge_block
false_block:
br label %merge_block
merge_block:
; reference to %tmp2 replaced with constant 15
%x = phi i32 [15, %true_block], [20, %false_block]
%tmp3 = mul i32 %x, 2
ret i32 %tmp3
(In fact, even more optimizations are possible here, such as replacing
the branch and two trivial basic blocks with a single select
operation. But we won’t get into those optimizations in this class.)
In general, this process of constant propagation is performed after memory-to-register optimization (see why?), and is also performed later after other optimization steps in the pipeline.
Optimization Passes in LLVM
The opt command (part of the llvm tool suite) performs various
“middle-end” optimizations at the LLVM IR code level.
Today we have seen (part of) what two “passes” in LLVM perform. You can check these yourself on example LLVM IR code! The syntax is like this:
opt --passes=XXX -S source.ll
where XXX is the name of the “optimization pass” you want. There are
hundreds of possible optimization passes in llvm (see them all by
running opt -print-passes), but the two techniques we have learned
about today are;
mem2reg: Eliminating stack memory, using phi nodes where necessaryinstcombine: Constant propagation
These are both actual steps in the compiler when you ask it to optimize
your C or C++ code by specifying -O2 for example.