What theoretical tools are needed to simplify MBA expressions?

Mixed Boolean-Arithmetic expressions can be used as an obfuscation technique. Why are they hard to de-obfuscate, and what do we need to do so?

Context

A Mixed Boolean-Arithmetic expression (or MBA) is any expression composed of integer arithmetic operators, e.g. \((+, -, *)\) and bitwise operators, e.g. \((\land, \lor, \oplus, \neg)\). For example,

\begin{equation*} E = (x \oplus y) + 2*(x \land y) \end{equation*}

is an MBA expression.

Expressions mixing arithmetic and bitwise already exist in cryptographic algorithms (ARX designs are used in hash functions and symmetric encryption algorithms, IDEA is a well known block cipher of the 90s, famous for its combined use of integer arithmetic and bitwise operators). The rationale is to get efficient non-linear Boolean functions—indeed addition is already implemented in hardware in most processors, while keeping a design simple and still difficult to analyze by cryptanalysts. Another example, of a cryptographic primitive that uses MBA are T-functions as integer arithmetic operators, such as \((+, -, *)\) are triangular T-functions. A careful choice of mixing between arithmetic and bitwise operators is needed to avoid pure algebraic or bitwise attacks or classical cryptanalysis such as linear or differential attacks.

MBA expressions and their use in obfuscation were defined in [1], even though for the purpose of this blogpost, we have extended their definition to consider more general types of expression. Where Zhou et al. consider linear MBA, we're interested in any kind of expressions mixing both types of operators. Here, have a fancy mathematical definition—because who doesn't love those? An expression \(E\) is a Mixed Boolean-Arithmetic expression on \(n\) bits if it is of the form:

\begin{equation*} E = \varphi_1 \circ \dots \circ \varphi_s(x_1, \dots, x_t) \qquad \text{ with } \quad \varphi_j = \sum_{i \in I}a_i e_i(x_1, \dots, x_t) \end{equation*}

where \(a_i\) are constants modulo \(2^n\) and \(e_i\) are bitwise expressions on \(t\) variables \((x_1, \dots, x_t)\) of \(n\) bits.

MBA are used to obfuscate the data flow of a program: they can be used to either hide a constant or "complexify" an operation. In this post, we'll focus on operator obfuscation, which is carried out mainly by rewriting one or several operations with MBA expressions. For example, the expression \(x + y\) can be rewritten as our first example \((x \oplus y) + 2*(x \land y)\)

MBA simplification: not that simple

Or, put in another way: why are MBA expressions used as an obfuscation technique?

Bibliography

First of all, MBA expressions barely exist in scientific literature. As said before, the concept can be found in subjects related to cryptography, but aside from Zhou et al.'s work, there is—to our knowledge—no mention of MBA in academic papers. Even in works concerning cryptography or MBA directly, the main points of the papers do not concern the simplification of such expressions. This is rather comprehensible, since bitwise operators and arithmetic operators do not interact very well, as there are no rules (distributivity, factorization...) to cope with this mixing of operators. But obfuscation concerns programs, and machines can clearly use both types of operators in a very transparent way. And that's where bit-vector logic appears, because it is the theoretical ground for machine computing; and the only place our MBA can live happily—or almost happily.

Bit-vector logic, as the name suggests, accounts for the internal representation of variables and integers as bit words (or bit vectors). It also consider every operators we are interested in, meaning our mixed expressions can exist within this logic. Unfortunately, the literature on bit-vectors is focused on proving that a predicate is SAT, not finding an alternative version of an expression that we could consider "simpler". Still, there are some interesting concepts here, such as bit-blasting, that can hint us towards the beginning of a simplification technique. But this could fill a whole article, therefore we won't detail this method here (sorry reader, but your bit-blasted princess will be in another article).

Tools

Same observation here: our MBA expressions are not welcome.

On the side of computer algebra softwares (Maple, Sage, ...), if arithmetic expressions (namely, polynomials) are featured, bitwise operators are often not supported with symbolic variables, meaning the simple manipulation of MBA is not even possible. On the other hand, SMT solvers such as Z3 or Boolector offer an implementation of the bit-vector logic, but it is not their role to focus on simplification: they rather aim at proving SAT.

Simplification?

We didn't even address one of the main question here: what is simplifying? What does it mean to de-obfuscate MBA expressions?

If we consider polynomials, pure arithmetic expressions, we know they have a canonical form: this means that the expanded form of a polynomial \(P = a_0 + a_1X + a_2X^2 + ... + a_nX^n\) characterizes \(P\) in a unique way. Boolean expressions also have canonical forms (e.g. conjunctive normal form), and it would extend easily on bitwise expressions.

But canonical (resp. normal) form is not always synonym of "more readable". For example, considering the expression \((1 + x)^{100}\), the expanded form \(1 + 100x + 4950x^2 + \dots + 4950x^{98} + 100x^{99} + x^{100}\), composed of 101 monomials, is clearly more confusing for a human than the factorized form. Even in terms of internal representation, a machine would probably prefer storing the factorized form.

Thus, even if there were a normal form for MBA expressions—and there does not seem to be one easy to find—, it would not guarantee us to have something simpler to process for human or automatic analysis. Because this is part of the problem: obfuscation is designed to resist automatic analysis of programs with tools (disassembly, debugger, ...), but also human analysis. And trying to define or quantify human comprehension would be quite a challenge!

Still, we can come up with some basic ideas of what could make an expression simpler:

  • Reducing the size of the expression should make it easier to apprehend and manipulate.
  • Reducing the "MBA aspect" (sorry, no fancy mathematical definition here), meaning the parts of the expressions that are mixing both operators, should allow us to have bigger parts containing only one type of operator. Then we could use what is available to simplify pure arithmetic or bitwise expressions.

This could be our ticket to some kind of simplification algorithm! Let's see what we can do with that.

Symbolic simplification

As said before, there are theoretical grounds and tools to manipulate both pure arithmetic or pure bitwise expressions (tools are less convincing for the bitwise part, but let's forget that for now). Even if the normal form is not always what we are seeking, there are some standard rules that definitely simplify an expression, for example by removing useless terms: the transformations \((x + 0) \rightarrow x\) or \((x \land 0) \rightarrow 0\) produce simpler expressions in terms of size and operators (resulting expressions are not bitwise nor arithmetic). Could we not build on what already exists, and just add what is needed to handle the MBA parts?

We indicated in our introduction that obfuscation of operators with MBA expressions is conducted through multiple rewritings of operations. For example, by applying the rewriting rule \((x + y) \color{red}{\rightarrow} (x \oplus y) + 2*(x \land y)\) recursively on an input \((x + 45)\), we obtain :

\begin{equation*} x + 45 \color{red}{\rightarrow} (x \oplus 45) + 2*(x \land 45) \color{red}{\rightarrow} ((x \oplus 45) \oplus 2*(x \land 45)) + 2*((x \oplus 45) \land 2*(x \land 45)) \end{equation*}

Supposing we know the set of rewriting rules used and that they are invertible—more on that later—, we can simplify MBA expressions by reversing this process! This is called term rewriting (theoretical ground) or pattern matching (implementation of the actual rewriting).

Then what are the difficulties of this simplification technique?

Detecting the pattern

First thing we need to do is to detect the pattern we want to replace. For example, if we suppose the obfuscated expression has been obtained with the previous rewriting rule \((x + y) \color{red}{\rightarrow} (x \oplus y) + 2*(x \land y)\), we might want to detect any pattern of the form \((A \oplus B) + 2*(A \land B)\), with \(A, B\) wildcards representing variables, constants, or even expressions. The problem is, detecting this kind of pattern is not always trivial.

Detecting patterns can be hard because of the multiple forms of expressions: for example, \(\neg x\) could also appear as \((-x -1)\) or \((x \oplus -1)\). One can imagine some process of "normalization" for those example, but as we said before, normalization for MBA is not something we master, and some forms might be trickier to deal with. An annoying example is \(2*(x \land y) \rightarrow (2x \land 2y)\), (also working with \(\lor \text{ or } \oplus\) instead of \(\land\)). Using these rewriting rules adds work for the analyst, since there is not always a single way to inverse it.

Let's take a part of our previous example of \((x + 45)\) obfuscation, \(2*(x \land 45)\). If we use the previous rewriting rule and then apply some constant folding (supposing we work on 8 bits for example), we obtain the expression \((2x \land 90)\). If we want to apply the inverse rewriting rules, namely \((2x \land 2y) \color{blue}{\rightarrow} 2*(x \land y)\), we will be confronted to a choice, because both \(45*2 \text{ and } 173*2\) give us \(90\). Which means, by trying to simplify our example of obfuscated \((x + 45)\), we could do the wrong choice:

\begin{align*} (x \oplus 45) + (2x \land 90) &\color{blue}{\rightarrow} (x \oplus 45) + 2*(x \land 173) &\text{stuck...}\\ &\color{blue}{\rightarrow} (x \oplus 45) + 2*(x \land 45) \color{red}{\rightarrow} (x + 45) &\text{yay!} \end{align*}

This is a very simple example, but it is just designed to show you one of the multiple little things that will bother you when you try to detect your pattern.

On another note, commutativity and associativity of operators will also give you hard times. Commutativity is just about inverting the order of operands in your pattern, but associativity can drastically increase the number of possibilities you have to match. This is due to the fact that expressions are often represented with binary operators, taking only two operands. This means that \((x + y + z)\) can be represented as \(((x + y) + z)\) or \((x + (y + z))\), not to mention all other variations induced by commutativity. A good approach to deal with this is to use n-ary operators, but this still means that every combination of your pattern's operands with the n operands has to be tried.

Properties of the set of rules

If we look into the theoretical aspect of pattern matching, called term rewriting (check [2] for insights on it), we find some interesting questions too. Let's imagine we have a list of rewriting rules that could allow us to simplify an MBA expression. There are properties that could guarantee us some outcomes about our simplification algorithm.

The first of these properties is termination, which means that, in a finite number of rewriting steps, we always reach a state in which no more rewriting rules apply; our set of rewriting rules cannot loop forever whatever the input expression we choose. In our case, termination can be quite easy to achieve: if all the rules reduce the size of the expressions (in terms of number of operators for example), the set is terminating. In our experience, and simplification rules are very often reducing the size of the expressions they are applied to.

The other interesting property is confluence, which means that if there are different rules applying to an expression \(E\) leading to two different expressions \(E_1 \text{ and } E_2\), then there is always an expression \(E'\) that is reachable from \(E_1 \text{ and } E_2\) with the set of rewriting rules. This certifies that if given multiple choices when applying the rewriting rules, taking either of those choices will lead you eventually to the same expression. Without confluence, the simplification algorithm would probably need a backtracking part, to test different orders for applying the rewriting rules. If we remember the annoying \((2x \land 2y) \color{blue}{\rightarrow} 2*(x \land y)\), we can guess that confluence won't be easy to attain with such rules.

Conclusion

MBA expressions are interesting partly because there is little public knowledge about it, even if it is related to a lot of disciplines: cryptography, computer algebra, bit-vector logic... As we have seen, MBA simplification raises interesting points to look into, both in the practical and the theoretical fields. We just tried to highlight the most interesting, challenging, or just troublesome ones.

References

[1]Y. Zhou, A. Main, Y. X. Gu, and H. Johnson. Information hiding in software with mixed boolean-arithmetic transforms. In WISA, 2007.
[2]F. Baader and T. Nipkow. Term Rewriting and All That. Cambridge University Press, Aug. 1999.

Comments