While building an LLVM-based obfuscator, we explore some unexpected code areas. For instance, what happens when you try to optimize a single function that holds millions of instructions? Some LLVM passes start to suffer, including an unexpected one: Global Dead Code Elimination. Let's investigate!
Context
A trivial yet efficient way to obfuscate code is to turn a simple computation into a very complex (and non-trivially reducible) one. For instance, given the two following identities:
where are integer values and are integer constants; one can easily build a generator of arbitrary complex expressions that challenges state-of-the-art expression simplifier (see http://blog.quarkslab.com/what-theoretical-tools-are-needed-to-simplify-mba-expressions.html for more details).
A Forged Use Case
Consider the following C code:
float foo(float i, float j) { i += j; i += j; // repeat n times i += j; return i; }
Thanks to the non-associativity of floating point operation, it translates in a very similar LLVM bitcode when piped though clang -S -emit-llvm -O2 large.c:
define float @foo(float %i, float %j) { %1 = fadd float %i, %j %2 = fadd float %1, %j ; repeat N times %N = fadd float %2, %j ret float %N }
This may looks like a very stupid test case, but it's actually a good abstraction of what an obfuscator can generate from a simple function using a huge combination of substitutions as described above: lot of arithmetic operations that are non-reducible by LLVM.
Running the standard LLVM optimization pipeline after regular obfuscations is generally a good idea: it cleans up the generated code, and adds a layer of diversification, folds constants and removes dead code that may have been introduced by the obfuscation process.
Obviously, this only works if the developers of the obfuscator are 100% confident that LLVM does not remove their obfuscation. But they have unittests for this, and would you expect an obfuscation that can be undone by LLVM to resist to a decent reverser?
Global Dead Code Elimination
Global Dead Code Elimination is an LLVM pass that removes unused global symbols, like a static function that got inlined and now does not have any remaining call site: it can be pruned.
So the challenge here is simply to track down all users of all globals, and based on these information, put some of them in the live bucket and some others in the dead bucket.
As of r293321, the LLVM algorithm to implement this task is to loop through all globals, and if the global is needed, for instance because it has a visible linkage, mark everything that is part of the definition of this global as needed.
For instance, in the following, local is marked as alive because it is used in the initialization of global:
static int local = 1;
int global = local;
The same goes for two less known features of LLVM:
global aliases that can be used to create a new entry in the symbol table, pointing to an existing symbol. For instance after the compilation of the following code:
char acter() { return 42; } char ism ()__attribute__ ((weak, alias ("acter")));
The object code holds two symbols:
$> nm a.o 0000000000000000 T acter 0000000000000000 T ism
Ifunc that can be used to create a dynamic alias whose value is resolved at runtime (Only once, during the relocation made by the dynamic linker). For instance to (very slightly) disturb a reverser performing a dynamic analysis of a binary, one could setup the following code:
#include <unistd.h> #include <sys/syscall.h> #include <linux/random.h> static unsigned mod0(unsigned n, unsigned s) { const unsigned int d = 1U << s; return n % d; } static unsigned mod1(unsigned n, unsigned s) { const unsigned int d = 1U << s; return n & (d - 1); } typedef unsigned (*mod_type)(unsigned, unsigned); static mod_type resolve_mod() { unsigned char buffer; syscall(SYS_getrandom, &buffer, sizeof(buffer), 0); // slightly overkill if(buffer & 1) return mod0; else return mod1; } unsigned mod (unsigned, unsigned) __attribute__((ifunc("resolve_mod")));
A single symbol mod would appear in the symbol table, a direct call is generated at each call site but either mod0 or mod1 is used, depending on the result of resolve_mod. That's a kind of relocation-time indirect call.
Comdat station [0]
Another source of bafflement is the handling of comdat. It seems that if a single global from a given comdat section is alive, then all globals in the same comdat section are alive too. It's a one-or-nothing rule.
I couldn't find the official reason for this rule, but quoting the test case test/Transforms/GlobalDCE/comdats.ll:
; First test checks that if one function in a comdat group is used, both other ; functions and other globals even if unused will be preserved.
[...]
; Tenth test checks that we can eliminate a comdat when it has multiple ; participants that form internal cyclic uses but are never used externally and ; thus the entire ifunc can safely be eliminated.
Functions
In a similar manner to global variables, anything involved in the body of a live function is alive. The piece of code that handles this is:
for (Use &U : F->operands()) MarkUsedGlobalsAsNeeded(cast<Constant>(U.get())); for (BasicBlock &BB : *F) for (Instruction &I : BB) for (Use &U : I.operands()) if (GlobalValue *GV = dyn_cast<GlobalValue>(U)) GlobalIsNeeded(GV); else if (Constant *C = dyn_cast<Constant>(U)) MarkUsedGlobalsAsNeeded(C);
And it's the source of a funny problem. If one goes back to the original source code with multiple fadd float %i, %j, benchmark the Global DCE pass using opt -o /dev/null -analyze -globaldce -time-passes -disable-verify as a rough way to get Global DCE time for various function size, we get the following timings (llvm r293321 compiled in RelWithDebInfo mode, but the trend is more important than the numbers)
SLOC 100000 500000 1000000 Time(s) 0.0012s 0.0040s 0.0120s
Even if we only have a single global that is trivially alive in the module, we still have a non-constant complexity (with respect to the number of instructions in the input bitcode) when making the function grow.
The problem lies in the way the algorithm is designed: it needs to scan every instruction in any live function to update liveness information. As a corollary, if we use a slightly different code, that adds a global variable instead of a constant, we get a similar execution time.
Faster Global Dead Code Elimination
Another approach to the problem is to compute liveness based on the uses of every live globals. Walking the users of all globals builds a dependency graph between globals. And all globals connected to a live global is itself alive (it means that its value is used to initialize the live global variable), which make the coloring of the graph trivial. Once liveness has been correctly propagated, all dead globals are removed.
A pseudo algorithm could be:
// init
SmallPtrSet<GlobalValue*, 8> Live = ComputeTriviallyLiveVariables(M);
// dependency graph
std::unordered_multimap<GlobalValue*, GlobalValue*> Deps;
for(GlobalValue* GV : Live)
for(GlobalValue* Dep : ComputeGlobalsInUsers(GV))
Deps.insert(std::make_pair(Dep, GV);
// propagate liveness
SmallVector<GlobalValue *, 8> LiveGVs{Live.begin(), Live.end()};
while (!LiveGVs.empty()) {
GlobalValue *LGV = LiveGVs.pop_back_val();
for (auto &&Dep : make_range(Deps.equal_range(LGV))) {
auto const Ret = Live.insert(Dep);
if(Ret.second)
LiveGVS.push_back(Dep);
}
}
// removal
for (GlobalVariable &GV : M.globals())
if(!Live.count(&GV))
DeadGV.push_back(&GV);
In a sense, it's like pre computing part of the liveness computation: the algorithm requires some additional book keeping, but it has the benefit of scaling extremely well on the original example:
SLOC 100000 500000 1000000 Time(s) 0.0s 0.0s 0.0s
So, we're extremely fast if we have nothing to do :-)
If we produce the worst situation, where a global is involved in most instructions:
extern volatile float j; float foo(float i) { i += j; i += j; // repeat n times i += j; return i; }
@j = external global float define float @foo(float %i) { %1 = load volatile float, float* @j %2 = fadd float %i, %1 %3 = load volatile float, float* @j %4 = fadd float %2, %3 ; repeat N times %N_1 = load volatile float, float* @j %N = fadd float %N_2, %N_1 ret float %N }
In a similar manner to the first testbed, the C code is first compiled with clang -O2 then benchmarked by opt -gdce.
We get the following timing:
SLOC 100000 500000 1000000 Time(s) 0.006s 0.035s 0.069s
Compared to the original implementation, on the same code:
SLOC 100000 500000 1000000 Time(s) 0.006s 0.031s 0.058s
That's within the same magnitude order, maintaining the graph structure is slightly more costly than the original one in the worst case, when there's the same order of user count and instruction count.
On an intermediate test case generated from the pattern i += j + k + 3; the new algorithm starts to be beneficial. It actually depends on the global/non global operands ratio.
Conclusion
These changes to the global dead code elimination algorithm should be unnoticeable in most cases. But in the context of code obfuscation, some corner cases appear, and it handles them.
And from a theoretical point of view, we've made a different compromise on the complexity of the algorithm, which is already a great source of satisfaction :-)
The final patch landed in LLVM repo, so feel free to report performance bug if any!
Thanks
Submitting the patch to LLVM led to a very interesting review, thanks a lot to @mehdi_amini and @davide for their advices!
My fellow Adrien Guinet helped me a lot polishing the details of the new implementation and (as too often) point out its shortcomings!
[0] | Can you find the reference hidden in this one? |