Obfuscation vs the Optimizer: An LLVM Middle-End Arms Race

Posted Thu 16 April 2026
Author Robert Yates
Category Program Analysis
Tags 2026, Clang, LLVM, obfuscation, software-protection, compilers, reverse-engineering

How one Commit Broke Obfuscation: A blog post exploring the role of compilers and optimizations in the field of obfuscation and de-obfuscation.

Introduction

Obfuscation is security through obscurity; its purpose is to transform a piece of code into a much more complex representation, whilst preserving the original semantics of the code. A compiler's job is to transform source code into binary code and produce the simplest and most optimized representation it can for a given architecture. These are contrary goals, yet this contradiction is where obfuscators find their greatest leverage.

In this blog post, we will explore the relationship between compilers, obfuscation, and de-obfuscation. We will first learn about LLVM, but I will frame the information so it's a little deeper and more relevant to this topic. Finally, we will walk through an example of obfuscation and watch the tug-of-war between our code and the optimization passes and see how a single commit in LLVM breaks our obfuscation. Hopefully, by the end, we will have a better understanding of how this tug-of-war is, in fact, more of a yin-yang.

Meet the mystery function

The star of the blog will be the following function. We will watch how the compiler removes the obfuscation, and we will try to fight back.

#include <stdint.h>

uint8_t mystery(void) {
    return (uint8_t)(
        ((((40u ^ 0xFFu) | 0x9Bu) & 65u) +
         (((0u - (40u & 110u)) - 1u) | 81u) +
         (40u & 110u) - 65u) ^
        0xFFu
    );
}

Before we watch LLVM tear this down, here’s the minimal background you need.

A quick LLVM primer

LLVM is a framework for building compilers. A collection of reusable components helps the author build up their compiler stages.

A compiler is often described in 3 stages: Front-End / Middle-End / Back-End. The so-called middle-end is the stage of compilation where transformations and analyses take place to support optimizations.

Before that, it is the front-end's responsibility to parse the natural source code language into an abstract syntax tree AST. It is then lowered into an intermediate representation IR. In the reverse engineering world, it is sometimes referred to as an intermediate language, but both IL and IR are used.

The IR is an important state because its aim is to represent the semantics of the source language in a way that enables code to reason about its behaviour and perform optimizations. IR is target independent and therefore, in theory, generic and simple.

The IR is eventually passed to the back end; it's here that further lowering occurs into more target-selected architectures, and eventual instructions are selected to generate binary code, such as X86. The beauty in this architecture is that you can have many input languages and many output architectures. Still, the middle-end works to optimize the same IR using a large collection of complex analysis and transformation passes that don't break the semantics of the code, helping the back-end produce fast and/or small code.

Try it yourself

In this blog, we will be working with IR snippets, and knowing how to generate and work with these files would be useful.

We can generate IR from C or C++ code using clang:

clang hello.c -S -emit-llvm -o hello.ll

or, to disable all optimizations:

clang -O0 -Xclang -disable-O0-optnone hello.c -S -emit-llvm -o hello.ll

To run optimization pipelines or specific passes:

opt hello.ll -O2 -S
opt hello.ll -passes=sroa,mem2reg -S

-O0 -O1 -O2 -O3 are optimization levels, and these options trigger a ready-to-use arrangement of passes in a pipeline.

To generate object files:

llc -filetype=obj hello.ll -o hello.o

You can also use these tools with Compiler Explorer

Why the middle-end matters for both sides

LLVM’s middle-end is the product of decades of compiler research made concrete: Theory turned into analyses, algorithms into passes, and ideas refined through real implementation work. That makes it a rich source of knowledge for both reverse engineers and obfuscator authors. If we can produce code that remains difficult for these passes to simplify or reason about, that suggests the obfuscation is doing its job. On the other hand, if we can bring similar algorithms to RE tooling, then we have the beginnings of a capable de-obfuscator. The same machinery can help either hide intent or recover it.

As you saw earlier, we can run passes on the LLVM IR from the command line. LLVM has several passes, although that's a bit of an understatement. The LLVM pass list is split into analysis, transformation, and utility passes. They aim to eliminate unnecessary computation through methods such as dead code elimination, redundancy removal, control-flow simplification, memory optimizations, and much more.

On the flip side we could also write our own passes to introduce the exact opposite.

The optimizer's toolkit

In the context of reverse engineering, obfuscation, and de-obfuscation, I would categorise them by their effect on simplification. These categories help understand how compiler optimizations reduce code complexity, the same mechanisms that make optimization/de-obfuscation possible. Here is an extremely brief look at a few passes and my own groupings. (Inter-procedural analysis is purposely left out)

Dead Code/Store Elimination -> DSEPass DCEPass BDCEPass Removes code that does not affect program output. Obfuscators often insert junk code, opaque predicates, or unreachable paths. DCE passes eliminate these.
Constant Propagation & Folding -> SCCPPass CorrelatedValuePropagationPass Evaluates expressions at compile time and propagates known values. Defeats obfuscation that relies on dynamic computation of constants (opaque predicates, encoded values).
Control Flow Simplification -> SimplifyCFGPass JumpThreadingPass Simplifies the control flow graph by merging blocks, removing redundant branches, and threading jumps. Critical for defeating control flow flattening and bogus control flow.
Redundancy Elimination -> GVNPass EarlyCSEPass Removes redundant computations. Removes duplicate expressions or equivalent computations inserted by obfuscators across different code paths.
Instruction Simplification & Combining -> InstCombinePass ReassociatePass Simplifies and canonicalises instructions. Defeats arithmetic obfuscation (MBA expressions, substitution patterns, identity operations). A bit of a swiss army knife pass, I highly recommended looking through the code of this one.
Memory Optimization -> SROAPass MemCpyOptPass Optimizes memory access patterns. Simplifies obfuscation that on purpose routes values through stack and memory rather than direct access in registers.

The arms race

Now that we know some of the tools, let's watch some of them in action.

Since the middle-end is designed to be generic, it's a great place to optimize code; we could, in fact, de-optimize it, or in more familiar terms, we could obfuscate it. Our obfuscation should be resistant to LLVM's optimization pipelines at a bare minimum. Let's take a piece of already obfuscated code that could have been generated by a beginners pass and see how we fare against the optimization pipeline.

Round 1 — all constants, no contest

Back to our mystery function, let's work with it in LLVM IR form:

define i8 @mystery() {
  %notx = xor i8 40, -1
  %a = or i8 %notx, -101
  %b = and i8 %a, 65
  %c = and i8 40, 110
  %neg = sub i8 0, %c
  %comp = sub i8 %neg, 1
  %d = or i8 %comp, 81
  %sum1 = add i8 %b, %d
  %sum2 = add i8 %sum1, %c
  %sum3 = add i8 %sum2, -65
  %r = xor i8 %sum3, -1
  ret i8 %r
}

The syntax of LLVM IR is quite assembly like. Here is a more 1:1 C version to help with understanding how to read the IR.

#include <stdint.h>

uint8_t mystery(void) {
    uint8_t notx = (uint8_t)(40u ^ 0xFFu);
    uint8_t a    = (uint8_t)(notx | (uint8_t)-101);
    uint8_t b    = (uint8_t)(a & 65u);
    uint8_t c    = (uint8_t)(40u & 110u);
    uint8_t neg  = (uint8_t)(0u - c);
    uint8_t comp = (uint8_t)(neg - 1u);
    uint8_t d    = (uint8_t)(comp | 81u);
    uint8_t sum1 = (uint8_t)(b + d);
    uint8_t sum2 = (uint8_t)(sum1 + c);
    uint8_t sum3 = (uint8_t)(sum2 + (uint8_t)-65);
    uint8_t r    = (uint8_t)(sum3 ^ 0xFFu);
    return r;
}

The code appears to be a function that returns an 8-bit integer. We need to understand the contents of this function; it's very opaque and difficult to reason about what the result should be, and it's successfully obfuscated.

Let's see what happens when we run an O2 optimization pipeline on this. We shall use LLVM 18 and its tool opt, which allows us to run pipelines and passes.

opt sample01.ll -O2 -S

; ModuleID = 'sample01.ll'
source_filename = "sample01.ll"

; Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(none)
define noundef i8 @mystery() local_unnamed_addr #0 {
  ret i8 0
}

attributes #0 = { mustprogress nofree norecurse nosync nounwind willreturn memory(none) }

Round 2 — why it collapsed instantly

The code was complex, but the optimization process quickly discovered the result, revealing the mystery. The function returns 0. The function is simplified because the code can be seen as a constant expression, and the optimization pipeline fully folded it.

We don't even need to run an entire O2 pipeline on it because the pass responsible for this is only EarlyCSEPass. We can achieve the same result with: opt sample01.ll -passes=early-cse -S

If you wish to follow along, then you can use compiler explorer On the left side, choose LLVM IR and on the right side, choose opt 18.1.0 and add the compiler options -O2. Also, click Add New and Opt Pipeline

The pass instcombine could also achieve this, but "Early Common Subexpression Elimination" was run first and easily saw through the code, evaluating it as a constant expression. The pass knows that the first instruction %notx = xor i8 40, -1 is the same as a not, so %notx could be replaced with %notx = 0xD7. Therefore %a = or i8 %notx, -101 is %a = 0xDF, and so on so forth until the whole thing folds down to our 0.

Modern RE tools will also easily see through this; they lift assembly code into their own IR in order to optimize and reason about it for the final decompiler layer.

For example, take this Binary Ninja snippet. It shows data flow tracking in its disassembly view within the {}, and the folding happens line by line:

0x00400000  b0ff               mov     al, 0xff
0x00400002  3428               xor     al, 0x28  {0xd7}
0x00400004  0c9b               or      al, 0x9b  {0xdf}
0x00400006  2441               and     al, 0x41
0x00400008  b16e               mov     cl, 0x6e
0x0040000a  80e128             and     cl, 0x28
0x0040000d  31d2               xor     edx, edx  {0x0}
0x0040000f  28ca               sub     dl, cl  {0xd8}
0x00400011  80ea01             sub     dl, 0x1  {0xd7}
0x00400014  80ca51             or      dl, 0x51  {0xd7}
0x00400017  00d0               add     al, dl  {0x18}
0x00400019  00c8               add     al, cl  {0x40}
0x0040001b  04bf               add     al, 0xbf  {0xff}
0x0040001d  34ff               xor     al, 0xff  {0x0}    <-----
0x0040001f  c3                 retn     {__return_addr}

1) As an author of obfuscation, we have learnt that a linear set of simple instructions that use constant values can easily be broken. 2) As a reverse engineer trying to de-obfuscate code, we have learnt that proving that something is constant is very important. When something is constant, it will have a cascading impact on analysis.

If you are a C++ coder, you might remember being taught that you should be setting variables and class members as const wherever possible. Marking things const informs the compiler what cannot change, thereby enabling stronger optimizations. The same principle applies to RE tools, where asserting immutability improves analysis and de-obfuscation.

Round 3 — hiding behind a variable

Let's improve upon our example to make it stronger. We need to somehow prevent the compiler from knowing something is constant. In our example, we have the value 40 twice; we could replace this with an instance an unknown value.

Our first instinct might be:

define i8 @mystery() {
  %unknown = call i8 asm "", "=r"()
  %notx = xor i8 %unknown, -1
  %a = or i8 %notx, -101
  %b = and i8 %a, 65
  %c = and i8 %unknown, 110
  %neg = sub i8 0, %c
  %comp = sub i8 %neg, 1
  %d = or i8 %comp, 81
  %sum1 = add i8 %b, %d
  %sum2 = add i8 %sum1, %c
  %sum3 = add i8 %sum2, -65
  %r = xor i8 %sum3, -1
  ret i8 %r
}

The new version of the code now uses a random register and breaks the optimizer, and also our reversing tools. The random use of a register does stick out, since it appears out of nowhere and looks like uninitialised use; we can do better.

Such as interweaving our expression into the existing code, for instance using an existing variable in the program. Since this contrived example doesn't have one, I will add a parameter to the function and use that instead.

define i8 @mystery(i8 %arg1) {
  %notx = xor i8 %arg1, -1
  %a = or i8 %notx, -101
  %b = and i8 %a, 65
  %c = and i8 %arg1, 110
  %neg = sub i8 0, %c
  %comp = sub i8 %neg, 1
  %d = or i8 %comp, 81
  %sum1 = add i8 %b, %d
  %sum2 = add i8 %sum1, %c
  %sum3 = add i8 %sum2, -65
  %r = xor i8 %sum3, -1
  ret i8 %r
}

The new code is now a mix of variables, arithmetic, and bitwise operations; this is known as Mixed Boolean Arithmetic (MBA), and our example is, in fact, a semi-linear MBA used for constant obfuscation.

Now the optimizer can't figure out that this is a constant expression (even if it simplifies it a bit):

opt sample02.ll -O2 -S

define i8 @mystery(i8 %arg1) local_unnamed_addr #0 {
  %c = and i8 %arg1, 110
  %comp = xor i8 %c, -1
  %d = or i8 %comp, 81
  %1 = or i8 %arg1, -65
  %sub.neg = add nsw i8 %1, 64
  %2 = add nsw i8 %c, %d
  %r = sub nsw i8 %sub.neg, %2
  ret i8 %r
}

When viewed inside a decompiler the new complex expression now looks part of the functionality of the program. The interweaving with an existing value makes it hard for the decompiler to reason about the code. This is where using features in your reverse engineering tools to inform the decompiler about the state of certain values will help you to de-obfuscate this.

For now, the mystery value is once again secure; we require more work to figure it out before we can know the answer again.

Round 4 — version shock LLVM 18 vs LLVM 19

Up until now, we have been testing with LLVM version 18.1.8, but some time has passed in our contrived scenario, and we now have access to llvm 19 LLVM version 19.1.7. Let's rerun our command opt sample02.ll -O2 -S

; ModuleID = 'sample02.ll'
source_filename = "sample02.ll"

; Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(none)
define noundef range(i8 -110, 112) i8 @mystery(i8 %arg1) local_unnamed_addr #0 {
  ret i8 0
}

attributes #0 = { mustprogress nofree norecurse nosync nounwind willreturn memory(none) }

Wait ...what happened? The upgraded LLVM version can now reverse our encoded secret. If we run once more opt sample02.ll -passes=early-cse -S we get:

define i8 @mystery(i8 %arg1) {
  %notx = xor i8 %arg1, -1
  %a = or i8 %notx, -101
  %b = and i8 %a, 65
  %c = and i8 %arg1, 110
  %neg = sub i8 0, %c
  %comp = sub i8 %neg, 1
  %d = or i8 %comp, 81
  %sum1 = add i8 %b, %d
  %sum2 = add i8 %sum1, %c
  %sum3 = add i8 %sum2, -65
  %r = xor i8 %sum3, -1
  ret i8 %r
}

From the pipeline, we can see it's not the early-cse pass reverting our changes, but something new! We can figure out the exact cause for the optimization through the compiler explorer opt pipeline viewer.

opt sample02.ll -passes=instcombine,reassociate,instcombine,gvn,bdce -S

; ModuleID = 'sample02.ll'
source_filename = "sample02.ll"

define i8 @mystery(i8 %arg1) {
  ret i8 0
}

A chain of just 5 passes: InstCombine, Reassociate, InstCombine again, GVN, and BDCE, is all it takes to unravel the expression down to zero. The culprit that triggers this is a single commit that landed in LLVM 19, adding several lines of code to InstCombine's getFreelyInvertedImpl function.

The new change is an example of the middle-end evolving and finding ways to augment optimization. The change teaches the pass to apply De Morgan's Law, ~(A | B) → (~A & ~B), allowing it to push a bitwise NOT recursively through OR and AND operations. Our obfuscation relied on exactly this: a final NOT tangled through nested ORs that the compiler couldn't see through. With DeMorgan inversion, the NOT layers peel away, and the expression flattens into a form where both sides of a subtraction are visibly identical. The compiler folds x - x to zero. A single rule of boolean algebra that LLVM 19 learned to apply collapsed our obfuscated expression. A beautiful example of how a seemingly small algebraic rule can unlock a much larger simplification

Round 5 — one constant away from survival

This is the arms race; obfuscation techniques that exploit gaps in compiler reasoning have an expiry date. The middle-end only gets smarter with each release.

One last fun remark, if we change both 65s in our expression to 66 (and -65 to -66) like so:

define i8 @mystery(i8 %arg1) {
  %notx = xor i8 %arg1, -1
  %a = or i8 %notx, -101
  %b = and i8 %a, 66
  %c = and i8 %arg1, 110
  %neg = sub i8 0, %c
  %comp = sub i8 %neg, 1
  %d = or i8 %comp, 81
  %sum1 = add i8 %b, %d
  %sum2 = add i8 %sum1, %c
  %sum3 = add i8 %sum2, -66
  %r = xor i8 %sum3, -1
  ret i8 %r
}

then it's enough to defeat the llvm 19 change. The expression still returns zero for every input, but the altered constants misalign the bit masks that InstCombine needs for its algebraic cancellation; the XOR residue left behind poisons the entire simplification chain. Not even the opt version 22.1.0 can fold it, even more intriguing.

The yin-yang

This post was a very simple primer on the topic, but it demonstrates that staying ahead means understanding not just what the compiler can do today, but what it will be able to do tomorrow. Whether you are building obfuscation or building tools to break it, the knowledge is the same: understanding how optimization passes reason about code is the foundation for both sides of the game.

That's the yin-yang at the heart of this whole story: the same machinery that helps hide intent can also reveal it, and each side sharpens the other over time. Better obfuscation pressures optimizers and analysis tools to evolve, while better optimization and de-obfuscation force obfuscation to become more thoughtful and less fragile. They are not opposites moving apart; they are complementary forces in the same cycle, and understanding that cycle is what makes you dangerous on either side.

If you made it this far, then I thank you for your time and hope you enjoyed the post :)

Acknowledgments

Thanks to Béatrice Creusillet for her thorough review of my post. To Jean François for his encouragement, support and general jolliness :)

If you would like to learn more about our security audits and explore how we can help you, get in touch with us!

Table of contents