Loop Unrolling Explained (Part 1)

5 min readMar 12, 2022

Loop unrolling is a loop optimization technique that reduces the number of iterations by putting multiple iterations into one iteration.

Given the example of computing sum of an array.

for (size_t i = 0; i < N; i++) {
    sum += A[i];
}

If we analyze the code by linearizing it, for each loop, we will have the following code:

loop:
    if not (i < N) {
        goto end;
    }
    sum += A[i];
    i++;
    goto loop;
end:

However, you may have seen that each loop needs to compute the if condition (loop condition) and the i++ statement (loop increment). Suppose that we have N = 1 million. We need to compute 1 million loop conditions and loop increments as overheads for the actual task, which is the sum of an array.

But, we also cannot spend time writing sum += A[0] sum += A[1] and so on to remove the overheads. More importantly, N is still unknown at the compile time. In other words, compilers cannot help you write N sum increment statements.

One thing to notice you is computers and humans are different because you could have an insight of what is N or what is the characteristic of N. Let’s assume that N is always 1 million. You can, perhaps, write a program to generate 1 million sum increment statements and get rids out of the overheads.

To be more precise, I have set a small experiment to test that the hypothesis of the overheads is correct.

Execution time (in ms) between unroll and no-unroll implementations (lower is better)

In the experiment, I ran two programs that do the same thing; the sum of an array.

The first program uses typical loops.

The second program unrolls the loop above.

Both of them were compiled using -O0 flag (without compiler optimizations) to make sure that compilers do not do anything weird.

I captured the time before the loop/increment started and ended right after the last loop/increment gets executed.

However, the unrolled program was slower than the typical loop, which it should not be. Why?

One reason is the size of the program. The second program contains more than 1 million lines of code, and its binary (executable) file is larger than the code (~20MB, in my computer). Hence, when my processors processed the program, they needed to read a lot of instructions (which may lead to instruction cache misses). This might be the cause of the longer execution time.

Therefore, the previous experiment was not working well since the number of actual instructions matters. We need another experiment.

Before moving to the next experiment, another number that I would like to share is the compilation time, which is another trade-off that I have not talked about.

The compilation time of the first program was ~1400x faster than the second program. Specifically, it reduced from ~750 seconds (super slow 😡) to ~0.5 seconds on my computer.

Let me sum everything up before we move forward.

Unrolling loops reduces loop overheads (e.g., loop conditions, loop variable increments)
Unrolling too much may not help since there is a trade-off between the number of instructions and the loop overheads
Unrolling too much produces larger programs
Unrolling too much consumes higher compilation time

Since the previous experiment is inadequate, how do we achieve better time with unrolling (or there is no such thing)?

What if I do not unroll it too much? Specifically, I can unroll 2 statements per iteration, as illustrated in the example below.

for (size_t i = 0; i < N; i += 2) {
    sum += A[i];
    sum += A[i+1];
}

Here, it removes half of the loop overheads, which is a significant number.

Looks good. Let me try to set this experiment up.

Execution time (in ms) between three implementations (lower is better)

And it is now better (~143% speedup)! Thus, our hypothesis is correct. The loop overheads were cut down by half.

Roughly, we now have 200 million statements from 300 million statements, which is 67%. More importantly, 67% is quite close to an inverse of 143% (~70%), which sounds reasonable to the speedup achieved.

In reality, there were not only 200 or 300 million statements. There are all compiled into machine instructions. Let’s look at the assembly for the typical loop program. (I used objdump -d [binary_file_path] to dump the assembly)

100003b3d: cmpq  $1000000, -8000080(%rbp)
100003b48: jae   0x100003b82
100003b4e: movq  -8000080(%rbp), %rax
100003b55: movq  -8000016(%rbp,%rax,8), %rax
100003b5d: addq  -8000072(%rbp), %rax
100003b64: movq  %rax, -8000072(%rbp)
100003b6b: movq  -8000080(%rbp), %rax
100003b72: addq  $1, %rax
100003b76: movq  %rax, -8000080(%rbp)
100003b7d: jmp   0x100003b3d <_main+0xcd>

I will not go into details, but I wanted to highlight that there are more than “statements”. (Please compare these points with the linearized code)

cmpq at line 100003b3d equals if (i < N)
jae at line 100003b48 equals goto end;
addq at line 100003b72 equals i++;
jmp at line 100003b7d equals goto loop;
The instructions at line 100003b6b and 100003b76 are related to the i++; statement

I may conclude that 6 out of 10 instructions were reduced by half after attempting loop unrolling. In other words, the unrolled program had 7 instructions per i.

More importantly, 70% instructions and 143% speedup are super close to each other now. Thus, we can be more confident of claiming that the number is correct.

To sum up, unrolling loops actually helps reduce the program’s execution time by removing loop overheads.

Another reality is compilers have already done this for you (that is why I need to put -O0 in the experiments) since it is so obvious.

Moreover, loop unrolling does not help you with reducing instructions executed only; it also makes your processor pipelining* better. See the links below for more information.

Dependence analysis - Wikipedia

In compiler theory, dependence analysis produces execution-order constraints between statements/instructions. Broadly…

en.wikipedia.org

Out-of-order execution - Wikipedia

In computer engineering, out-of-order execution (or more formally dynamic execution) is a paradigm used in most…

en.wikipedia.org

The several questions before we apart:

What if I unroll the loop by 4 or more than that?
What is the break-even point of the loop unrolling?
How can I deal with the more complicated loop?
How about the loop with dynamic numbers of iterations?
How about the loop where the number of iterations is prime?

Let’s talk about them in the later parts!

For folks who want to reproduce the result, I provide the repository containing all the codes (that I can put) below. Check it out!

GitHub - pnxguide/unrolling_experiment

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com