Leveraging Sourcetrail to a mapping tool, meet Numbat and Pyrrha

Posted Thu 07 March 2024
Authors Eloïse Brocas, Sami Babigeon
Category Reverse-Engineering
Tags reverse-engineering, tool, release, 2024

Ever wanted to find a nice tool to easily represent cartography results and other graphs? The Sourcetrail tool could be a nice solution! In this blog post, we will introduce two of our tools: Numbat, a new Python API for Sourcetrail, and Pyrrha, a mapper collection for firmware cartography.

Going beyond Sourcetrail

Sourcetrail is a source code explorer which allows to quickly understand any project, especially complex ones. The user can navigate through its different components (functions, classes, types, etc.) and observe their interactions as shown by the animation below. Originally developed by CoatiSoftware, it supports indexing C, C++, Java and Python. Unfortunately, it is not maintained anymore.

Given any C or C++ project and a preprocessing of its Makefile/Cmake (cf Sourcetrail Documentation), Sourcetrail indexes all of the source code and the different structures involved. One can then navigate through the resulting data with a great view or a source code view. The first one groups the elements by type, then, given a specific one, for example a class, it shows its interactions, like imports, with other project elements. It is also possible to see where this class is defined in the source code and where it is used thanks to dynamic links between the graph part and the source code.

Sourcetrail is very powerful for source code analysis and whitebox security reviews. In summary, it helps the analyst understand a lot of data in a limited amount of time, so why not extend it to show other kinds of data?

Let’s meet Numbat

To that end, Quarkslab developed a Python API, called Numbat, to create and manipulate Sourcetrail databases. Thanks to Numbat, anyone can easily write their own indexer to write arbitrary data as a graph into a Sourcetrail database. They can then be visualized with the nice graphical Sourcetrail interface.

Why develop a new SDK?

Numbat's main goal is to offer a user-oriented Python SDK given the fact that the current one, SourcetrailDB, cannot be used efficiently anymore. First of all, it is no longer maintained and as it is based on bindings that need to be compiled to create a Python package, it is more and more difficult to build it, especially on Windows. Moreover, SourcetrailDB requires a steep learning curve as it does not hide the internal database structure to the user. We wanted to have an API that can be used easily by anyone to obtain results quickly. That’s why we decided to develop a Python SDK with a simple workflow.

Create or open a database.
Create nodes with a given type (class, functions, etc.).
Create relationships between nodes.

A source code can also be added, which allows the creation of some association between the nodes and the corresponding elements in it.

Finally, some features have been added like the ability to search for an element in the database. As it is a free software, Numbat is available on GitHub as well as directly on PyPi with the following command:

pip install numbat

Explore Numbat possibilities

Numbat offers the possibility to store any kind of data which can be visualized as graphs. It also decorrelates data generation and its visualization. Moreover, the results can easily distribute analysis outputs without access to the original target, which can be useful in some situations like in DFIR.

First, let’s take a simple example to illustrate the API usage: two classes, with the method of one using a field of the other.

from numbat import SourcetrailDB

# Create DB
db = SourcetrailDB.open('my_db', clear=True)

# Create a first class containing the method 'main'
my_main = db.record_class(name="MyMainClass")
meth_id = db.record_method(name="main", parent_id=my_main)

# Create a second class with a public field 'first_name'
class_id = db.record_class(name="PersonalInfo")
field_id = db.record_field(name="first_name", parent_id=class_id)

# The method 'main' is using the 'first_name' field
db.record_ref_usage(meth_id, field_id)

# Save modifications and close the DB
db.commit()
db.close()

After running this code, opening the resulting database with Sourcetrail will produce the following result.

Numbat can be used to create any kind of data that can be visualized with Sourcetrail. For example, we developed a Ghidra script which, given a binary, decompiles it, iterates over the functions to recreate the function-level call graph with Numbat, and, for each function, registers within it the associated decompiled source code. It allows the user to quickly understand the code structure and to target specific functions without having to deal with Ghidra UI at the beginning of their analysis.

Tools are not limited only to the reverse/program analysis area, we could use Numbat in other fields, like in the following example for network visualization. The complete script is available here.

[...]
    # Create a new database
    db = SourcetrailDB.open(args.outfile, clear=True)
    nodes = {}
    edges = {}

    for file in args.infile:
        # Open pcap file using scapy
        packets = rdpcap(file)
        for packet in packets:
            # Read packet information
            protocol = packet.lastlayer().name
            src, sport, dst, dport = get_packet_info(packet)
            if not src or not dst:
                continue

            # Update nodes for src/dst
            if src not in nodes:
                id = db.record_class(prefix="Machine", name=src, postfix="")
                nodes.update({src: id})
            if dst not in nodes:
                id = db.record_class(prefix="Machine", name=dst, postfix="")
                nodes.update({dst: id})
            sname = f'{src}:{sport} {protocol}'
            dname = f'{dst}:{dport} {protocol}'

            # Add ports as class fields
            if sname not in nodes:
                id = db.record_field(name=f'{sport} {protocol}', parent_id=nodes[src])
                nodes.update({sname: id})
            if dname not in nodes:
                id = db.record_field(name=f'{dport} {protocol}', parent_id=nodes[dst])
                nodes.update({dname: id})

            # Add the edges between nodes
            edge_name = f'{sname}|{dname}'
            if edge_name not in edges:
                # Record a usage between the src port and dst port
                id = db.record_ref_usage(nodes[sname], nodes[dname])
                edges.update({edge_name: id})
    db.commit()
    db.close()

This example takes a network capture in the .pcap format and outputs a Sourcetrail database. With less than a hundred lines of Python, it's possible to quickly visualize the interactions between the different capture elements. We run this script on a capture of the network traffic generated by a malware obtained through hybrid-analysis. This sample was interesting because it interacted with a lot of different devices.

The result of this script in Sourcetrail can be seen below:

In addition to all of these options, we could imagine developing various visualization tools to help security analysts. For instance, they could parse:

a mass scan on a given infrastructure, showing which port is open on which machine, which service is exposed;
an ActiveDirectory dump to show the rights;
and so on.

The possibilities are endless! We have written a detailed step-by-step tutorial. Do not hesitate to take a look at it and the whole documentation to discover how Numbat can be used for new tools!

Pyrrha: Numbat applied on filesystem

After having an efficient API to create Sourcetrail-compatible DB, now take a look at one project we developed using Numbat: Pyrrha, a mapper collection for firmware analysis. The goal of this tool is to do a cartography of a firmware using several mappers. For the moment only one has been developed, which maps ELF/PE imports/exports and the associated symlinks of the filesystem to analyze.

The Pyrrha filesystem mapper workflow is quite simple, as described on the diagram below. It uses the lief tool to parse each ELF (or PE) file contained on the filesystem and export all the imported/exported symbols. We have implemented a simple linker to resolve all of these imports. Besides its limitations (e.g., it does not handle all the options given to ld for import resolutions), it works well to give the analyst a first view of the OS structure they are working on.

As a result, the analyst can visualize which file is importing which function and thus quickly understand which binaries are related to "critical" functions/libraries. For the image below, we have used Pyrrha on the Netgear RAX30 router firmware. Visualizing the result with Sourcetrail allows us to directly obtain the list of binaries that are using the curl option to set parameters, and potentially deactivate the certificate verification. In a few seconds, using Pyrrha, we are able to reduce our analysis spectrum to only a few binaries. (To learn about the end of this ’curl’ story, take a look at our blog post on the subject).

New mappers can really easily be developed as described in the Pyrrha documentation. Pyrrha is available on Quarkslab’s GitHub as well as directly on PyPi, doing:

pip install pyrrha-mapper

Conclusion

We are releasing Numbat to create arbitrary Sourcetrail databases that can be used for various topics as shown with our examples (Ghidra callgraphs or network). We are already using Numbat in our firmware mapping tool Pyrrha. It's now time to play with them!

If you are using Numbat to create a database, let us know! We welcome any kind of contribution.

If you would like to learn more about our security audits and explore how we can help you, get in touch with us!