Where building a custom obfuscated Python interpreter for a Python packer turned into an optimized Python interpreter.
The paper Looking inside the (Drop) box) from Przemysław Węgrzyn and Dhiru Kholia gave me the motivation to spend some time playing around with the CPython interpreter, trying to leverage on freeze.py to build an obfuscated Python packer. Part of the obfuscation used by DropBox rely on a custom interpreter: the opcodes are permuted, so that one needs to recover the permutation in order to understand the bytecode. In this post, we will go a step further by extending base opcodes with new ones. This modification will force an attacker to understand the custom interpreter (its main evaluation loop) in order to decode the obfuscated bytecode.
Python Bytecode Primer
It's pretty simple to inspect the bytecode of a module/class/function in Python, using the (standard!) dis module:
>>> def fib(n): ... return fib(n - 1) + fib(n - 2) if n > 1 else n >>> import dis >>> dis.dis(fib) 2 0 LOAD_FAST 0 (n) 3 LOAD_CONST 1 (1) 6 COMPARE_OP 4 (>) 9 POP_JUMP_IF_FALSE 40 12 LOAD_GLOBAL 0 (fib) 15 LOAD_FAST 0 (n) 18 LOAD_CONST 1 (1) 21 BINARY_SUBTRACT 22 CALL_FUNCTION 1 25 LOAD_GLOBAL 0 (fib) 28 LOAD_FAST 0 (n) 31 LOAD_CONST 2 (2) 34 BINARY_SUBTRACT 35 CALL_FUNCTION 1 38 BINARY_ADD 39 RETURN_VALUE >> 40 LOAD_FAST 0 (n) 43 RETURN_VALUE
CPython uses a virtual stack machine and its bytecode is fairly easy to read. Some of the opcodes have an argument, e.g. LOAD_FAST takes the index of a variable to load as argument. And some don't, e.g. BINARY_SUBTRACT reads the two uppermost elements of the stack, pop them and push their sum on the stack. The opcodes are encoded with a char but not all values are used: one hundred are still available. That's one hundred opportunities to wreak havoc in the interpreter!
Introducing CISC for Python
If you look at the above code in detail, you may notice that:
LOAD_FAST 0 LOAD_CONST 1
or:
LOAD_FAST 0 LOAD_CONST 2
are used a lot. What about combining them into a new instruction, LOAD_FAST_ZERO_LOAD_CONST that would perform the two operations in a row? This way (and without this explicit name) someone trying to understand the code will hopefully choke on this unknown instruction :-) To do so, we have to build our own interpreter. Easy enough: download the archive, untar it, and dive into the source. First thing to do is to declare a new opcode. This can be done by adding this line at the end of Include/opcode.h:
#define LOAD_FAST_ZERO_LOAD_CONST 148
Eventually, one can update Lib/opcode.py but that's not mandatory. Then we need to perform some pattern matching on the bytecode sequence to detect our pattern. Fortunately, there is already a peephole optimizer in CPython (in Python/peephole.c) it's quite easy to understand and modify it in order to turn any sequence of:
LOAD_FAST 0 LOAD_CONST n
into:
LOAD_FAST_ZERO_LOAD_CONST n
and a few NOP. If we compile the interpreter now, we end up with this message:
$> make -j XXX lineno: 409, opcode: 148 Traceback (most recent call last): [...] SystemError: unknown opcode
Good news! The peephole optimizer works, has the new opcode is now generated. Bad news, it is still unknown. In fact that is no surprise: we have done nothing to describe its behaviour. Let us update the big switch in Python/ceval.c with the case that handles the new opcode:
case LOAD_FAST_ZERO_LOAD_CONST: x = GETLOCAL(0); if (x != NULL) { Py_INCREF(x); PUSH(x); x = GETITEM(consts, oparg); Py_INCREF(x); PUSH(x); goto fast_next_opcode; } format_exc_check_arg(PyExc_UnboundLocalError, UNBOUNDLOCAL_ERROR_MSG, PyTuple_GetItem(co->co_varnames, oparg)); break;
then recompile:
$> make -j [...]
It works! How does fibo get disassembled now? Let's use our custom Python interpreter:
$> ./python >>> def fib(n): ... return fib(n - 1) + fib(n - 2) if n > 1 else n >>> import dis >>> dis.dis(fib) 1 0 <148> 1 3 COMPARE_OP 4 (>) 6 POP_JUMP_IF_FALSE 31 9 LOAD_GLOBAL 0 (fib) 12 <148> 1 15 BINARY_SUBTRACT 16 CALL_FUNCTION 1 19 LOAD_GLOBAL 0 (fib) 22 <148> 2 25 BINARY_SUBTRACT 26 CALL_FUNCTION 1 29 BINARY_ADD 30 RETURN_VALUE >> 31 LOAD_FAST 0 (n) 34 RETURN_VALUE
Paradise yeah, our (unnamed because we did not update opcode.py) opcodes are right there. The code is now slightly more difficult to understand!
Performance
Python is not a particularly efficient language, as showcased by the computer language benchmark game. Let's take a reference timing of the above function, with the unmodified interpreter (built from the source too, to have a real reference):
$> ./python -m timeit -s 'def fib(n):return fib(n - 1) + fib(n - 2) if n > 1 else n' 'fib(20)' 100 loops, best of 3: 2.41 msec per loop
What is the impact of our modification? Let's use the modified interpreter:
$> ./python -m timeit -s 'def fib(n):return fib(n - 1) + fib(n - 2) if n > 1 else n' 'fib(20)' 100 loops, best of 3: 2.23 msec per loop
It's slightly faster! Why? Because:
I have chosen a stupid benchmark that always calls the same instructions over and over;
We have removed an interpretation step.
Conclusion
Extending Python's opcode through the aggregation of several opcodes in a new one provides many benefits:
the resulting bytecode is slightly smaller: less disk usage!
the resulting bytecode is slightly faster: less cpu usage!
the resulting bytecode is slightly obfuscated: more funz!
Resources
The interpreter patch is available here: https://gist.github.com/serge-sans-paille/71a4b1e656d70ae94bb4
Acknowledgment
Thanks to Haypo for the fruitful discussions, and to the Quarkslab team for their reviews!