Quokka: A Fast and Accurate Binary Exporter

Quarkslab is open-sourcing Quokka, a binary exporter to manipulate a program's disassembly without a disassembler. This blog post introduces the project, details some parts of its inner workings, and showcases some potential usages. Quokka enables users to write complex analyses on a disassembled binary without dealing with the disassembler API.


Quokka Logo (generated by DALL·E)

Introduction

Analyzing binary programs often requires disassembling them. It is the backbone of security workflows for multiple topics like malware analysis, vulnerability research, or binary instrumentation. Thus, disassembling is crucial to inspect untrusted or proprietary binaries whose source code is not available.

As correctly disassembling is an open problem, the security community has offloaded this task to specialized tools. Some are commercial (IDA, Binary Ninja, Jeb...), and others open-source (Ghidra, BAP, McSema). The main problem faced by disassemblers is to recover information (e.g. symbols, types) lost during the compilation. Indeed, converting a sequence of bytes into meaningful assembly instructions is often insufficient. Typical tasks for disassemblers involve finding references between code and data, recovering function boundaries, identifying typical language structures (i.e. jumps or virtual tables), or reconstructing the Control Flow Graph. Disassemblers either rely on algorithms, producing results with some correctness guarantees or heuristics based on common patterns, but with fewer guarantees.

Usually heavy and complex software, disassemblers are inadequate to either perform custom analysis on a disassembled program, or to analyze multiple binaries simultaneously. Moreover, their APIs may be convoluted and painful to use (looking at you IDA!). If only the disassembler's output is needed for further analysis, why not extract it to run offline queries? That is what Quokka is about.

A Review of Existing Binary Exporters

This blog post follows one we published in 2019: An Experimental Study of Different Binary Exporters.

In this previous blog post, we established the state-of-the-art for various Binary Exporters: tools producing binary exports, standalone file (i.e. usable without the disassembler) containing data from the disassembled binary. The situation is almost the same 3 years later, no new player entered the game.

Today's best choice for a user is to use BinExport which exports the disassembly from IDA, Binary Ninja, and Ghidra. However, it lacks bindings to read the disassembly seamlessly and is tailored to be used with BinDiff.

Quokka: A Fast and Accurate Binary Exporter

Quokka offers a generic binary exporter, suited for various contexts. It abides by the following properties:

  • Exhaustivity: To be used in various contexts, Quokka exports as much data as possible.
  • Efficiency: To ease the integration inside analysis workflows and not creating a bottleneck, Quokka is fast. The export time is negligible compared to the disassembly time.
  • Compactness: To avoid unnecessary disk usage and allow seamless export file sharing between users, Quokka export file is compact.

Quokka is composed of two independent parts:

  • An IDA plugin that generates an export file.
  • Python bindings to manipulate the exported file seamlessly.

Of note, while generating an export file requires an IDA installation, the result is usable without it.

Using Quokka

Generating the export file

The first step before using Quokka is to generate an export file using the IDA plugin. If you don't have an IDA installation, you can skip this part and directly download it from here.

After installing the plugin, the easiest way of generating the export file is to run the following command:

$ idat64 -OQuokkaAuto:true -A path/to/the/binary
[...]
INFO 12:35:24 Starting to register Quokka (version 0.0.3)
INFO 12:35:24 Auto Export
INFO 12:35:24 Exporter set in NORMAL
INFO 12:35:24 Starting to export to [...]/docs/samples/qb-crackme.quokka
INFO 12:35:24 Start to export FileMetadata
INFO 12:35:24 FileMetadata exported (took 0.00s)
INFO 12:35:24 Start to export segments
INFO 12:35:24 Segments exported (took 0.00s)
INFO 12:35:24 Start export enums and structures
INFO 12:35:24 Enum and structures written (took 0.00s)
INFO 12:35:24 Start to export Layout
INFO 12:35:24 End export layout in 0.00s
INFO 12:35:24 Start to write layout.
INFO 12:35:24 Start to write mnemonic.
INFO 12:35:24 Finished to write mnemonics (took: 0.00s)
INFO 12:35:24 Start to write operand strings.
INFO 12:35:24 Finished to write operand_strings (took: 0.00s)
INFO 12:35:24 Start to write operands
INFO 12:35:24 Finished to write operands (took: 0.00s)
INFO 12:35:24 Start to write instructions
INFO 12:35:24 Finished to write instructions (took: 0.00s)
INFO 12:35:24 Start to write func chunks
INFO 12:35:24 Finished to write func_chunks (took: 0.00s)
INFO 12:35:24 Start to export and write functions
INFO 12:35:24 Finished to export/write functions (took : 0.00s)
INFO 12:35:24 Start to transform references
INFO 12:35:24 Start to write data, comments and references
INFO 12:35:24 Finished to write data comments and references (took : 0.00s)
INFO 12:35:24 File [..]/docs/samples/qb-crackme.quokka is written
INFO 12:35:24 quokka finished (took 0.01s)
INFO 12:35:24 Quokka: terminate

It is also worth mentionning that the export can be generated using the Python API.

import quokka

# Quokka respects IDA_PATH to find idat64
program = quokka.Program.from_binary("docs/samples/qb-crackme")

Warning: The IDA plugin support for Windows is only experimental.

Load and Manipulating the Export

import quokka

# To load a Program, use the paths to the export file and the binary itself
prog = quokka.Program("docs/samples/qb-crackme.quokka", "docs/samples/qb-crackme")
print(prog)

for func in prog.values():
    print(f"Function {func.name} at 0x{func.start:x}")
    for block_start in func.graph.nodes:
        block = func.get_block(block_start)
        print(f"\tBlock at 0x{block_start:x} with {len(block)} instructions")
<Program qb-crackme (ArchX86)>

Function _init_proc at 0x8049000
        Block at 0x8049000 with 7 instructions
        Block at 0x8049019 with 1 instructions
        Block at 0x804901b with 3 instructions
Function sub_8049020 at 0x8049020
        Block at 0x8049020 with 2 instructions
[...]

The snippet above shows how to load a program with Quokka and to print the list of functions within the binary. The interested readers can refer to the documentation for a more thorough example: documentation

Architecture

The IDA plugin

The IDA plugin is composed of about 3,500 C++ lines of code which targets IDA's latest versions (from 7.6 and onwards). The export phase is divided in three parts:

  • The first one exports everything related to the program itself but not in its address space. During this phase, the metadata, the segments, and the structures are exported.
[...]
INFO 12:56:53 Start to export FileMetadata
INFO 12:56:53 Start to export segments
INFO 12:56:53 Start export enums and structures
[...]
  • The second phase is the main one. It performs a single linear scan of the program address space and export every item found during the scan. In this phase, the instructions, the functions (and their chunks), and the data are exported.
INFO 12:56:53 Start to export Layout
INFO 12:56:53 Start to write layout.
INFO 12:56:53 Start to write mnemonic.
INFO 12:56:53 Start to write operands
INFO 12:56:53 Start to write instructions
INFO 12:56:53 Start to write func chunks
INFO 12:56:53 Start to export and write functions
  • Finally, during the last phase, all the references are sorted and resolved between the different items (i.e. structures, instructions, or data). This step is crucial because references are one of the most important elements in the disassembler output.
INFO 12:56:53 Start to transform references
INFO 12:56:53 Start to write data, comments and references

Space optimizations on the wire

Quokka generates a Protobuf file to store the information on the wire. We discussed in the previous blog post (An Experimental Study...) different binary serialization formats. For our use cases, Protobuf offers the best trade-off: it is compact while still being fast at deserializing data. However, to further reduce the exported file size, Quokka leverages some Protobuf's optimizations.

Addresses or Offsets

Most program items (i.e. functions or instructions) have an associated address within the program address space. This element is key for numerous analyses and needs to be exported. However, programs usually have a large base address (e.g. 0x400000). Because in Protobuf the size on the wire of an integer depends on its absolute value (when using the varint encoding, it is more efficient to store relatively small integers).

In Quokka, function addresses are stored as offsets to the program base address and block addresses as offsets to the function start. To go even further, only the instruction sizes are kept, and their address is dynamically recomputed during the unserialization.

As we show at the end of this article, this optimization (and the next ones) helps improving the compacity of the files generated by Quokka.

Data Deduplication

A program may use multiple times the same item, but at different addresses. For example, the instruction push ebp may be used by each function. To improve the storage compactness, Quokka only stores items once in a table and refers to them by their index in this table. To reduce the storage usage, items are sorted by frequency to lower the indexes of the most frequent items.

In Quokka, the operands, the mnemonics, the instructions, and the data are stored in deduplicated tables. However, it's challenging to evaluate how much space the deduplication saves.

Default Values

In Protobuf, each field has an associated type and these types posess a default value. For example, the string type default value is the empty string and numeric types default to 0. It is interesting because the Protobuf's serializer does not write default values on the wire.

We leverage this property in Quokka to reduce the exported file size. For example, in the Instruction message, the field is_thumb defaults to False and is only set when dealing with thumb instructions in an ARM binary.

Summary

Let's consider the following extract of Quokka Protobuf schema. It implements each optimization previously mentioned.

message Quokka {

    message Instruction {
        uint32 size = 1;
        uint32 mnemonic_index = 2;
        bool is_thumb = 4;
    }

    repeated string mnemonics = 8;
}
  1. The instruction has no address.
  2. The instruction mnemonic is stored in the mnemonics table and only its index is used.
  3. The is_thumb field is only set to True for thumb instructions (The protobuf default value for boolean is False).

Usage Examples

Extracting data from a disassembler in a reusable format is a useful building block for numerous workflows. Let's try to see some potential use cases. While every example is feasible within IDA using its API, we believe it will be more natural with Quokka.

Feature Extraction

In some machine learning workflows, researchers need to extract data from a dataset to train or evaluate their algorithms. For example, AlphaDiff's authors developed a custom plugin to extract data from functions within the binary. The snippet below shows how to extract the same data (and others) with Quokka:

def extract_features(function: quokka.Function):
    vector = [
        # In / Out degrees of the function
        (function.in_degree, function.out_degree),
        # Function bytes
        function.bytes,
        # Functions used
        set(imp.name for imp in function.calls if imp.type == FunctionType.IMPORTED),
        # Function size
        function.end - function.start,
        # Number of basic blocks
        len(function.graph),
        # Bag of mnemonics
        set(inst.mnemonic for inst in function.instructions),
    ]
    return vector

Binary Analysis

Sometimes, when analyzing a binary, it can be interesting to see if some so-called dangerous functions are used within the binary. With the following code, a user can quickly search through the program and flag potential usages:

candidate_functions = {"strcpy", "strcmp", "memcpy", ...}
def find_dangerous_functions(program: quokka.Program):
    for function in program:
        if found := candidate_functions.intersection(function.calls):
            print(f"Function {function.name} calls {found=}")

# Other solution
for dangerous_function in candidate_functions:
    # Filter out functions not used in the program
    if dangerous_function not in prog.fun_names:
        continue

    for func in prog.fun_names[dangerous_function].callers:
        print(f"Function {func.name} calls {dangerous_function}")

Side by Side Analysis

In this example, we show an interesting feature for Quokka: it is possible to load multiple binaries at the same time. The following snippet simply loads two binaries, computes a hash for every common function and reports if any difference has been found. This is a poor person differ that could be improved (stay tuned!).

def hash_func(function: quokka.Function):
    # Compute a hash for the function
    return ...

prog1 = quokka.Program("prog1.qk", "prog1")
prog2 = quokka.Program("prog2.qk", "prog2")

for func_name in set(prog1.fun_names).intersection(prog2.fun_names):
    func1 = prog1.fun_names[func_name]
    func2 = prog2.fun_names[func_name]

    if hash_func(func1) != hash_func(func2):
        print(f"Function {func_name} has changed between prog1 and prog2")

Benchmarks

Let’s reuse the binaries from the 2019's blog post to draw a fair comparison. We compare the results between Quokka and BinExport (version 12) on a laptop running Debian 11 with a Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz and 16 GB of RAM.

The commands used for getting the following results were:

  • BinExport: idat64 -OBinExportAutoAction:BinExportBinary -OBinExportAlsoLogToStdErr:TRUE -A ts3server.i64
  • Quokka: idat64 -OQuokkaAuto:true -OQuokkaLog:INFO -A ts3server.i64

Export Size

Program Size i64 BinExport Quokka
elf-Linux-x64-bash 908 KB 11 MB 4.2 MB 3.1 MB
ts3server 7.8 MB 58 MB 20 MB 13 MB
llvm-opt 34 MB 304 MB 144 MB 87 MB

Export Time

Program Size Disassembly BinExport Quokka
elf-Linux-x64-bash 908 KB 8.30 s 2.49 s 0.86 s
ts3server 7.8 MB 60.88 s 15.42 s 5.36 s
llvm-opt 34 MB 395 s 108 s 35.7 s

Conclusion

As Quokka is still a relatively young project, we are looking for feedback, ideas, and pull requests (for example to help support Windows, or to export other elements).

The source code is available here under the Apache 2.0 license, and the documentation on its website: https://quarkslab.github.io/quokka/.

This blog post only introduces the project. More examples and use cases are present in the documentation.

Context & Acknowledgments

This work was conducted during Alexis's PhD (Towards 1-day Vulnerability Detection using Semantic Patch Signature).

I would also like to thanks the people who reviewed this blog post and the tool for their helpful comments.

Comments