Quarkslab is open-sourcing Quokka, a binary exporter to manipulate a program's disassembly without a disassembler. This blog post introduces the project, details some parts of its inner workings, and showcases some potential usages. Quokka enables users to write complex analyses on a disassembled binary without dealing with the disassembler API.
Quokka Logo (generated by DALL·E)
Introduction
Analyzing binary programs often requires disassembling them. It is the backbone of security workflows for multiple topics like malware analysis, vulnerability research, or binary instrumentation. Thus, disassembling is crucial to inspect untrusted or proprietary binaries whose source code is not available.
As correctly disassembling is an open problem, the security community has offloaded this task to specialized tools. Some are commercial (IDA, Binary Ninja, Jeb...), and others open-source (Ghidra, BAP, McSema). The main problem faced by disassemblers is to recover information (e.g. symbols, types) lost during the compilation. Indeed, converting a sequence of bytes into meaningful assembly instructions is often insufficient. Typical tasks for disassemblers involve finding references between code and data, recovering function boundaries, identifying typical language structures (i.e. jumps or virtual tables), or reconstructing the Control Flow Graph. Disassemblers either rely on algorithms, producing results with some correctness guarantees or heuristics based on common patterns, but with fewer guarantees.
Usually heavy and complex software, disassemblers are inadequate to either perform custom analysis on a disassembled program, or to analyze multiple binaries simultaneously. Moreover, their APIs may be convoluted and painful to use (looking at you IDA!). If only the disassembler's output is needed for further analysis, why not extract it to run offline queries? That is what Quokka is about.
A Review of Existing Binary Exporters
This blog post follows one we published in 2019: An Experimental Study of Different Binary Exporters.
In this previous blog post, we established the state-of-the-art for various Binary Exporters: tools producing binary exports, standalone file (i.e. usable without the disassembler) containing data from the disassembled binary. The situation is almost the same 3 years later, no new player entered the game.
Today's best choice for a user is to use
BinExport
which exports the disassembly
from IDA, Binary Ninja, and Ghidra. However, it lacks bindings to read the
disassembly seamlessly and is tailored to be used with
BinDiff.
Quokka: A Fast and Accurate Binary Exporter
Quokka offers a generic binary exporter, suited for various contexts. It abides by the following properties:
- Exhaustivity: To be used in various contexts, Quokka exports as much data as possible.
- Efficiency: To ease the integration inside analysis workflows and not creating a bottleneck, Quokka is fast. The export time is negligible compared to the disassembly time.
- Compactness: To avoid unnecessary disk usage and allow seamless export file sharing between users, Quokka export file is compact.
Quokka
is composed of two independent parts:
- An IDA plugin that generates an export file.
- Python bindings to manipulate the exported file seamlessly.
Of note, while generating an export file requires an IDA installation, the result is usable without it.
Using Quokka
Generating the export file
The first step before using Quokka is to generate an export file using the IDA plugin. If you don't have an IDA installation, you can skip this part and directly download it from here.
After installing the plugin, the easiest way of generating the export file is to run the following command:
$ idat64 -OQuokkaAuto:true -A path/to/the/binary
[...]
INFO 12:35:24 Starting to register Quokka (version 0.0.3)
INFO 12:35:24 Auto Export
INFO 12:35:24 Exporter set in NORMAL
INFO 12:35:24 Starting to export to [...]/docs/samples/qb-crackme.quokka
INFO 12:35:24 Start to export FileMetadata
INFO 12:35:24 FileMetadata exported (took 0.00s)
INFO 12:35:24 Start to export segments
INFO 12:35:24 Segments exported (took 0.00s)
INFO 12:35:24 Start export enums and structures
INFO 12:35:24 Enum and structures written (took 0.00s)
INFO 12:35:24 Start to export Layout
INFO 12:35:24 End export layout in 0.00s
INFO 12:35:24 Start to write layout.
INFO 12:35:24 Start to write mnemonic.
INFO 12:35:24 Finished to write mnemonics (took: 0.00s)
INFO 12:35:24 Start to write operand strings.
INFO 12:35:24 Finished to write operand_strings (took: 0.00s)
INFO 12:35:24 Start to write operands
INFO 12:35:24 Finished to write operands (took: 0.00s)
INFO 12:35:24 Start to write instructions
INFO 12:35:24 Finished to write instructions (took: 0.00s)
INFO 12:35:24 Start to write func chunks
INFO 12:35:24 Finished to write func_chunks (took: 0.00s)
INFO 12:35:24 Start to export and write functions
INFO 12:35:24 Finished to export/write functions (took : 0.00s)
INFO 12:35:24 Start to transform references
INFO 12:35:24 Start to write data, comments and references
INFO 12:35:24 Finished to write data comments and references (took : 0.00s)
INFO 12:35:24 File [..]/docs/samples/qb-crackme.quokka is written
INFO 12:35:24 quokka finished (took 0.01s)
INFO 12:35:24 Quokka: terminate
It is also worth mentionning that the export can be generated using the Python API.
import quokka
# Quokka respects IDA_PATH to find idat64
program = quokka.Program.from_binary("docs/samples/qb-crackme")
Warning: The IDA plugin support for Windows is only experimental.
Load and Manipulating the Export
import quokka
# To load a Program, use the paths to the export file and the binary itself
prog = quokka.Program("docs/samples/qb-crackme.quokka", "docs/samples/qb-crackme")
print(prog)
for func in prog.values():
print(f"Function {func.name} at 0x{func.start:x}")
for block_start in func.graph.nodes:
block = func.get_block(block_start)
print(f"\tBlock at 0x{block_start:x} with {len(block)} instructions")
<Program qb-crackme (ArchX86)>
Function _init_proc at 0x8049000
Block at 0x8049000 with 7 instructions
Block at 0x8049019 with 1 instructions
Block at 0x804901b with 3 instructions
Function sub_8049020 at 0x8049020
Block at 0x8049020 with 2 instructions
[...]
The snippet above shows how to load a program with Quokka and to print the list of functions within the binary. The interested readers can refer to the documentation for a more thorough example: documentation
Architecture
The IDA plugin
The IDA plugin is composed of about 3,500 C++ lines of code which targets IDA's latest versions (from 7.6 and onwards). The export phase is divided in three parts:
- The first one exports everything related to the program itself but not in its address space. During this phase, the metadata, the segments, and the structures are exported.
[...]
INFO 12:56:53 Start to export FileMetadata
INFO 12:56:53 Start to export segments
INFO 12:56:53 Start export enums and structures
[...]
- The second phase is the main one. It performs a single linear scan of the program address space and export every item found during the scan. In this phase, the instructions, the functions (and their chunks), and the data are exported.
INFO 12:56:53 Start to export Layout
INFO 12:56:53 Start to write layout.
INFO 12:56:53 Start to write mnemonic.
INFO 12:56:53 Start to write operands
INFO 12:56:53 Start to write instructions
INFO 12:56:53 Start to write func chunks
INFO 12:56:53 Start to export and write functions
- Finally, during the last phase, all the references are sorted and resolved between the different items (i.e. structures, instructions, or data). This step is crucial because references are one of the most important elements in the disassembler output.
INFO 12:56:53 Start to transform references
INFO 12:56:53 Start to write data, comments and references
Space optimizations on the wire
Quokka generates a Protobuf file to store the information on the wire. We discussed in the previous blog post (An Experimental Study...) different binary serialization formats. For our use cases, Protobuf offers the best trade-off: it is compact while still being fast at deserializing data. However, to further reduce the exported file size, Quokka leverages some Protobuf's optimizations.
Addresses or Offsets
Most program items (i.e. functions or instructions) have an associated address
within the program address space. This element is key for numerous analyses and
needs to be exported. However, programs usually have a large base address (e.g.
0x400000
). Because in Protobuf the size on the wire of an integer depends on
its absolute value (when using the
varint
encoding, it is more efficient to store relatively small integers).
In Quokka, function addresses are stored as offsets
to the program base
address and block addresses as offsets to the function start. To go even
further, only the instruction sizes are kept, and their address is dynamically
recomputed during the unserialization.
As we show at the end of this article, this optimization (and the next ones) helps improving the compacity of the files generated by Quokka.
Data Deduplication
A program may use multiple times the same item, but at different addresses. For
example, the instruction push ebp
may be used by each function. To improve
the storage compactness, Quokka only stores items once in a table and refers to
them by their index in this table. To reduce the storage usage, items are sorted
by frequency to lower the indexes of the most frequent items.
In Quokka, the operands
, the mnemonics
, the instructions
, and the data
are stored in deduplicated tables. However, it's challenging to evaluate how
much space the deduplication saves.
Default Values
In Protobuf, each field has an associated type
and these types posess a default
value. For example, the string
type default value is the empty string and
numeric types default to 0
. It is interesting because the Protobuf's serializer
does not write default values on the wire.
We leverage this property in Quokka to reduce the exported file size.
For example, in the Instruction
message, the field is_thumb
defaults to
False
and is only set when dealing with thumb instructions in an ARM binary.
Summary
Let's consider the following extract of Quokka Protobuf schema. It implements each optimization previously mentioned.
message Quokka {
message Instruction {
uint32 size = 1;
uint32 mnemonic_index = 2;
bool is_thumb = 4;
}
repeated string mnemonics = 8;
}
- The
instruction
has no address. - The instruction mnemonic is stored in the
mnemonics
table and only its index is used. - The
is_thumb
field is only set toTrue
for thumb instructions (The protobuf default value for boolean isFalse
).
Usage Examples
Extracting data from a disassembler in a reusable format is a useful building block for numerous workflows. Let's try to see some potential use cases. While every example is feasible within IDA using its API, we believe it will be more natural with Quokka.
Feature Extraction
In some machine learning workflows, researchers need to extract data from a dataset to train or evaluate their algorithms. For example, AlphaDiff's authors developed a custom plugin to extract data from functions within the binary. The snippet below shows how to extract the same data (and others) with Quokka:
def extract_features(function: quokka.Function):
vector = [
# In / Out degrees of the function
(function.in_degree, function.out_degree),
# Function bytes
function.bytes,
# Functions used
set(imp.name for imp in function.calls if imp.type == FunctionType.IMPORTED),
# Function size
function.end - function.start,
# Number of basic blocks
len(function.graph),
# Bag of mnemonics
set(inst.mnemonic for inst in function.instructions),
]
return vector
Binary Analysis
Sometimes, when analyzing a binary, it can be interesting to see if some so-called dangerous functions are used within the binary. With the following code, a user can quickly search through the program and flag potential usages:
candidate_functions = {"strcpy", "strcmp", "memcpy", ...}
def find_dangerous_functions(program: quokka.Program):
for function in program:
if found := candidate_functions.intersection(function.calls):
print(f"Function {function.name} calls {found=}")
# Other solution
for dangerous_function in candidate_functions:
# Filter out functions not used in the program
if dangerous_function not in prog.fun_names:
continue
for func in prog.fun_names[dangerous_function].callers:
print(f"Function {func.name} calls {dangerous_function}")
Side by Side Analysis
In this example, we show an interesting feature for Quokka: it is possible to load multiple binaries at the same time. The following snippet simply loads two binaries, computes a hash for every common function and reports if any difference has been found. This is a poor person differ that could be improved (stay tuned!).
def hash_func(function: quokka.Function):
# Compute a hash for the function
return ...
prog1 = quokka.Program("prog1.qk", "prog1")
prog2 = quokka.Program("prog2.qk", "prog2")
for func_name in set(prog1.fun_names).intersection(prog2.fun_names):
func1 = prog1.fun_names[func_name]
func2 = prog2.fun_names[func_name]
if hash_func(func1) != hash_func(func2):
print(f"Function {func_name} has changed between prog1 and prog2")
Benchmarks
Let’s reuse the binaries from the 2019's blog post to draw a fair comparison.
We compare the results between Quokka and BinExport (version 12) on a
laptop running Debian 11 with a Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
and 16 GB of RAM.
The commands used for getting the following results were:
- BinExport:
idat64 -OBinExportAutoAction:BinExportBinary -OBinExportAlsoLogToStdErr:TRUE -A ts3server.i64
- Quokka:
idat64 -OQuokkaAuto:true -OQuokkaLog:INFO -A ts3server.i64
Export Size
Program | Size | i64 | BinExport | Quokka |
---|---|---|---|---|
elf-Linux-x64-bash | 908 KB | 11 MB | 4.2 MB | 3.1 MB |
ts3server | 7.8 MB | 58 MB | 20 MB | 13 MB |
llvm-opt | 34 MB | 304 MB | 144 MB | 87 MB |
Export Time
Program | Size | Disassembly | BinExport | Quokka |
---|---|---|---|---|
elf-Linux-x64-bash | 908 KB | 8.30 s | 2.49 s | 0.86 s |
ts3server | 7.8 MB | 60.88 s | 15.42 s | 5.36 s |
llvm-opt | 34 MB | 395 s | 108 s | 35.7 s |
Conclusion
As Quokka is still a relatively young project, we are looking for feedback, ideas, and pull requests (for example to help support Windows, or to export other elements).
The source code is available here under the Apache 2.0 license, and the documentation on its website: https://quarkslab.github.io/quokka/.
This blog post only introduces the project. More examples and use cases are present in the documentation.
Context & Acknowledgments
This work was conducted during Alexis's PhD (Towards 1-day Vulnerability Detection using Semantic Patch Signature).
I would also like to thanks the people who reviewed this blog post and the tool for their helpful comments.