Summary

Preheader/header/body/exit blocks
Loop-invariant code motion (licm)
Loop unrolling
- Trip count
- Unrolling factor
- Epilogue

Optimizing loops in LLVM

Today we will look at more “middle-end” optimization passes in LLVM. Recall that this refers to machine-independent optimizations that happen within the LLVM IR code itself, after it is generated by a “front-end” compiler like the ones you have written for the labs.

Last class we looked at optimizing away one source of slowness: memory loads and stores. Today we will look at two optimizations on loops, with the following high-level goals:

Reduce the number of instructions within each loop iteration, by moving some instructions outside of the loop before it starts
Reduce the actual number of loop iterations, by combining multiple iterations into a single (larger) basic block

Overall, the goal as in any optimization, is to take correct (but slow) code and make it fast without ever breaking the correctness.

The Anatomy of a Loop in LLVM

When your compiler generates a while or for loop, it likely creates a structure like this. To optimize it, we need to agree on a few key locations:

The Preheader: This is a block that executes immediately before the loop starts. It implies “we are about to loop.” If your code doesn’t have one, we often create a blank one just to hold optimized code.
The Header: The entry point of the loop. This usually contains the condition check (icmp) and the conditional branch.
The Body: The code that runs if the condition is true.
The Exit: Where we go when the loop is done.

Visualizing the structure:

    [ Preheader ]  <-- Insert "One-time setup" code here
         |
         v
  ->[  Header   ]  <-- Loop condition check
 /     /    \
/   True   False
|    /        \
| [ Body ]   [ Exit ]
|   /
\__/

Optimization 1: Loop Invariant Code Motion (LICM)

The Intuition:

If you calculate x = y * 5 inside a loop, but y never changes, you are wasting CPU cycles re-calculating the same answer on every iteration. We should calculate it once, before the loop starts, and just use the result inside.

The Strategy:

Identify an instruction in the Body where the operands are defined outside the loop (or are constants).
Move (don’t copy) that instruction into the Preheader.

LICM Example

Source Code (Mental Model): We want to move the calculation of step_size out of the loop.

// 'scale' and 'base' are passed in as arguments
int sum = 0;
for (int i = 0; i < 100; i++) {
    int step_size = scale * base; // <--- This never changes!
    sum = sum + step_size;
}

Before Optimization (Slow LLVM): Notice that %step_size is recalculated every single time.

loop_header:
  %i = phi i32 [ 0, %preheader ], [ %i.next, %loop_body ]
  %sum = phi i32 [ 0, %preheader ], [ %sum.new, %loop_body ]
  %cmp = icmp slt i32 %i, 100
  br i1 %cmp, label %loop_body, label %exit

loop_body:
  ; PROBLEM: 'scale' and 'base' are loop-invariant.
  %step_size = mul i32 %scale, %base

  %sum.new = add i32 %sum, %step_size
  %i.next = add i32 %i, 1
  br label %loop_header

After Optimization (Fast LLVM): We move the calculation to the Preheader.

preheader:
  ; SOLUTION: Calculate once, reuse forever.
  %step_size = mul i32 %scale, %base
  br label %loop_header

loop_header:
  ; Now %step_size is available here without recalculation
  %i = phi i32 [ 0, %preheader ], [ %i.next, %loop_body ]
  ...

LICM rules

More generally, we can move any lines to before the loop if they satisfy these conditions:

Performing basic arithmetic with no side effects, like addition, logical operations, etc. Definitely not function calls.
All operands in the 3AC are defined outside the loop

Of course, after moving one line, then condition (2) might be true for another line, so we have to keep looking for more lines to move until there are none. (It’s sufficient to just consider each line in order of execution.)

Optimization 2: Loop Unrolling

The Intuition: Every loop iteration has “administrative overhead”, like:

Incrementing the counter (++i).
Checking the condition (icmp).
Branching (br).

If the work inside the loop is small, the CPU spends more time on administration than actual work. Unrolling duplicates the work to reduce the number of checks.

General idea

We want to combine multiple iterations into a single block, then do the “administrative” work for all these iterations just once, reducing the total overhead.

As an analogy, imagine you want to work out until 1600. One way to do this would be to follow this “algorithm”:

Check your watch
If it’s 1600 or later, quit
Otherwise, do a 1-minute workout
Go back to step 1

But if you did this, you would waste a lot of time checking your watch. Instead, you could try something like this:

Check your watch
If it’s 1556 or later, go to step 5
Otherwise, do a 5-minute workout
Go back to step 1
Do 1-minute workouts until it’s 1600

Hopefully you see the idea here. By structuring your workout in larger 5-minute chunks, you spend more time actually doing stuff and less time checking your watch. This is what we want to do with loops in LLVM.

Calculating the “Trip Count”

Before unrolling, the compiler tries to determine: “How many times does this loop run?”

Static Count: If the loop runs 0 to 100, we know the count is 100. We can unroll perfectly.
Dynamic Count: If the loop runs 0 to N, we don’t know the count until the code actually runs. This is harder.

If we unroll a 0..N loop by 2, but N turns out to be 9, we will process {0,1}, {2,3}, {4,5}, {6,7}. We still need to process {8}. This requires a cleanup structure called the Epilogue.

Example: Dynamic Unrolling & The Epilogue

Source Code:

// We don't know what N is!
for (int i = 0; i < N; i++) {
    sum = sum + i;
}

LLVM IR Visualized (Unrolled by 2): To handle this, we create two distinct loop sections.

The Fast Loop: Runs as long as we have chunks of 2 left.
The Epilogue: Picks up where the fast loop stopped and finishes any remainder.

; --- STAGE 1: THE FAST LOOP (Steps by 2) ---
unrolled_header:
  %i = phi i32 [ 0, %preheader ], [ %i.next, %unrolled_body ]
  %sum = phi i32 [ 0, %preheader ], [ %sum.2, %unrolled_body ]

  ; Safety check: Do we have at least 2 items left to process?
  ; (simplified logic for readability)
  %limit_check = add i32 %i, 2
  %can_unroll = icmp sle i32 %limit_check, %N
  br i1 %can_unroll, label %unrolled_body, label %epilogue_header

unrolled_body:
  ; Work 1 (Index i)
  %sum.1 = add i32 %sum, %i

  ; Work 2 (Index i+1)
  %i_plus_1 = add i32 %i, 1
  %sum.2 = add i32 %sum.1, %i_plus_1

  ; Admin: Increment by 2 and jump back
  %i.next = add i32 %i, 2
  br label %unrolled_header


; --- STAGE 2: THE EPILOGUE (Steps by 1) ---
epilogue_header:
  ; PHI Nodes here are CRITICAL. They ask:
  ; "Did we come here from the start? Or from the unrolled loop?"
  ; We inherit the current values of %i and %sum.
  %i.epi = phi i32 [ %i, %unrolled_header ], [ 0, %preheader ]
  %sum.epi = phi i32 [ %sum, %unrolled_header ], [ 0, %preheader ]

  ; Standard loop check: Are we fully done?
  %is_done = icmp sge i32 %i.epi, %N
  br i1 %is_done, label %exit, label %epilogue_body

epilogue_body:
  ; This runs at most 1 time (because we unrolled by 2)
  %sum.final = add i32 %sum.epi, %i.epi

  ; Finish up
  br label %exit

exit:
  %final_result = phi i32 [ %sum, %epilogue_header ], [ %sum.final, %epilogue_body ]