Where building a custom obfuscated Python interpreter for a Python packer turned into an optimized Python interpreter.

The paper Looking inside the (Drop) box) from Przemysław Węgrzyn and Dhiru Kholia gave me the motivation to spend some time playing around with the CPython interpreter, trying to leverage on freeze.py to build an obfuscated Python packer. Part of the obfuscation used by DropBox rely on a custom interpreter: the opcodes are permuted, so that one needs to recover the permutation in order to understand the bytecode. In this post, we will go a step further by extending base opcodes with new ones. This modification will force an attacker to understand the custom interpreter (its main evaluation loop) in order to decode the obfuscated bytecode.

Python Bytecode Primer

It's pretty simple to inspect the bytecode of a module/class/function in Python, using the (standard!) dis module:

>>> def fib(n):
...    return fib(n - 1) + fib(n - 2) if n > 1 else n
>>> import dis
>>> dis.dis(fib)
2           0 LOAD_FAST                0 (n)
            3 LOAD_CONST               1 (1)
            6 COMPARE_OP               4 (>)
            9 POP_JUMP_IF_FALSE       40
           12 LOAD_GLOBAL              0 (fib)
           15 LOAD_FAST                0 (n)
           18 LOAD_CONST               1 (1)
           21 BINARY_SUBTRACT
           22 CALL_FUNCTION            1
           25 LOAD_GLOBAL              0 (fib)
           28 LOAD_FAST                0 (n)
           31 LOAD_CONST               2 (2)
           34 BINARY_SUBTRACT
           35 CALL_FUNCTION            1
           38 BINARY_ADD
           39 RETURN_VALUE
      >>   40 LOAD_FAST                0 (n)
           43 RETURN_VALUE

CPython uses a virtual stack machine and its bytecode is fairly easy to read. Some of the opcodes have an argument, e.g. LOAD_FAST takes the index of a variable to load as argument. And some don't, e.g. BINARY_SUBTRACT reads the two uppermost elements of the stack, pop them and push their sum on the stack. The opcodes are encoded with a char but not all values are used: one hundred are still available. That's one hundred opportunities to wreak havoc in the interpreter!

Introducing CISC for Python

If you look at the above code in detail, you may notice that:

LOAD_FAST                0
LOAD_CONST               1

or:

LOAD_FAST                0
LOAD_CONST               2

are used a lot. What about combining them into a new instruction, LOAD_FAST_ZERO_LOAD_CONST that would perform the two operations in a row? This way (and without this explicit name) someone trying to understand the code will hopefully choke on this unknown instruction :-) To do so, we have to build our own interpreter. Easy enough: download the archive, untar it, and dive into the source. First thing to do is to declare a new opcode. This can be done by adding this line at the end of Include/opcode.h:

#define LOAD_FAST_ZERO_LOAD_CONST         148

Eventually, one can update Lib/opcode.py but that's not mandatory. Then we need to perform some pattern matching on the bytecode sequence to detect our pattern. Fortunately, there is already a peephole optimizer in CPython (in Python/peephole.c) it's quite easy to understand and modify it in order to turn any sequence of:

LOAD_FAST                0
LOAD_CONST               n

into:

LOAD_FAST_ZERO_LOAD_CONST n

and a few NOP. If we compile the interpreter now, we end up with this message:

$> make -j
XXX lineno: 409, opcode: 148
Traceback (most recent call last):
[...]
SystemError: unknown opcode

Good news! The peephole optimizer works, has the new opcode is now generated. Bad news, it is still unknown. In fact that is no surprise: we have done nothing to describe its behaviour. Let us update the big switch in Python/ceval.c with the case that handles the new opcode:

case LOAD_FAST_ZERO_LOAD_CONST:
  x = GETLOCAL(0);
  if (x != NULL) {
    Py_INCREF(x);
    PUSH(x);
    x = GETITEM(consts, oparg);
    Py_INCREF(x);
    PUSH(x);
    goto fast_next_opcode;
}
format_exc_check_arg(PyExc_UnboundLocalError,
                     UNBOUNDLOCAL_ERROR_MSG,
                     PyTuple_GetItem(co->co_varnames, oparg));
break;

then recompile:

$> make -j
[...]

It works! How does fibo get disassembled now? Let's use our custom Python interpreter:

$> ./python
>>> def fib(n):
...     return fib(n - 1) + fib(n - 2) if n > 1 else n
>>> import dis
>>> dis.dis(fib)
  1           0 <148>                    1
              3 COMPARE_OP               4 (>)
              6 POP_JUMP_IF_FALSE       31
              9 LOAD_GLOBAL              0 (fib)
             12 <148>                    1
             15 BINARY_SUBTRACT
             16 CALL_FUNCTION            1
             19 LOAD_GLOBAL              0 (fib)
             22 <148>                    2
             25 BINARY_SUBTRACT
             26 CALL_FUNCTION            1
             29 BINARY_ADD
             30 RETURN_VALUE
        >>   31 LOAD_FAST                0 (n)
             34 RETURN_VALUE

Paradise yeah, our (unnamed because we did not update opcode.py) opcodes are right there. The code is now slightly more difficult to understand!

Performance

Python is not a particularly efficient language, as showcased by the computer language benchmark game. Let's take a reference timing of the above function, with the unmodified interpreter (built from the source too, to have a real reference):

$> ./python -m timeit -s 'def fib(n):return fib(n - 1) + fib(n - 2) if n > 1 else n' 'fib(20)'
100 loops, best of 3: 2.41 msec per loop

What is the impact of our modification? Let's use the modified interpreter:

$> ./python -m timeit -s 'def fib(n):return fib(n - 1) + fib(n - 2) if n > 1 else n' 'fib(20)'
100 loops, best of 3: 2.23 msec per loop

It's slightly faster! Why? Because:

  1. I have chosen a stupid benchmark that always calls the same instructions over and over;

  2. We have removed an interpretation step.

Conclusion

Extending Python's opcode through the aggregation of several opcodes in a new one provides many benefits:

  • the resulting bytecode is slightly smaller: less disk usage!

  • the resulting bytecode is slightly faster: less cpu usage!

  • the resulting bytecode is slightly obfuscated: more funz!

Resources

The interpreter patch is available here: https://gist.github.com/serge-sans-paille/71a4b1e656d70ae94bb4

Acknowledgment

Thanks to Haypo for the fruitful discussions, and to the Quarkslab team for their reviews!


If you would like to learn more about our security audits and explore how we can help you, get in touch with us!