In this blog post we present SightHouse, an open-source tool designed to assist reverse engineers by retrieving information and metadata from programs and identifying similar functions already known from other libraries, binaries or any other source codes that can be found online.


Introduction

Alt text

SightHouse's logo

Whether you are new to reverse engineering or have years of experience, you have likely encountered a common challenge: distinguishing relevant software components from third-party libraries within firmware or programs. This task can be highly challenging and time-consuming as unnecessary code is often reversed.

Software evolves rapidly, compelling reverse engineers to continuously adapt. Modern programs are complex, requiring analysis of thousands of functions and layers of abstraction introduced by SDKs and new programming languages like Rust or Golang. Additionally, while LLM-generated code accelerates development, it tends to produce repetitive, often vulnerable patterns across models1, leaving reverse engineers to sift through yet another source of redundant code.

To address this challenge, numerous approaches have emerged over the years: spanning from IDA Flirt2, released in 1996, to the latest innovations in the Large Language Model (LLM) era we're experiencing today. Most of these static analysis approaches aim to solve the Binary Similarity problem. The latter involves identifying similar functions based on a given representation, such as raw bytes, assembly code, Intermediate Representation (IR), or source code. However, choosing the right tool is not straightforward, as each solution has its own strengths and limitations.

Once you have selected a specific algorithm for your needs, it is often necessary to compute a large database of known function signatures to make the tool effective. The creation and maintenance of these signature databases can be particularly challenging for researchers, as they need to continuously identify, compile, and extract new signatures from programs.

Moreover, the reverse engineering ecosystem is fragmented, which limits collaboration and contribution among reverse engineers. Many available solutions are tightly coupled with specific Software Reverse Engineering (SRE) tools like IDA Pro, Binary Ninja, or Ghidra. This fragmentation can hinder the broader adoption and integration of these tools across different workflows.

To address these challenges, we present SightHouse, a new function identification tool designed to automate the creation of signature databases and seamlessly integrate with your preferred SRE environment.

Choosing the right tool

We stand on the shoulders of giants.

As mentioned earlier, many tools have emerged over the years, and we aimed to identify the best fit for our specific use cases. First and foremost, the algorithm needed to be free and open-source, with a permissive license allowing integration into our project. This constraint ruled out commercial solutions like IDA Pro or Binary Ninja.

We sought a solution that could handle multiple architectures while ultimately providing a cross-architecture capability (for example, enabling comparisons between x86 and ARM32 of memcpy). Additionally, the algorithm needed to be scalable, capable of supporting server-based queries from multiple clients, and deliver strong performance even when processing millions of functions.

To evaluate potential solutions, we benchmarked approaches that represent the state-of-the-art in academia, such as jTrans3 or GMN4, as well as more "industrial" ones like FunctionSimSearch5, FunctionID6, and BSIM7.

For our experiments, we created a new dataset using projects from PlatformIO8, a software aggregator for embedded projects, to include architectures like ARM, RISC-V, and XTensa. We also added well-known projects such as glibc, sqlite, openssl, curl, and zlib, all compiled for x86. This resulted in 9,775 programs, 379,822 functions, and 782 MB of storage.

We duplicated the dataset, stripped the symbols, and then applied each algorithm to reassign function names. Some might argue that using the same dataset for both signature extraction and comparison is problematic (a known issue in traditional machine learning). However, we did not use this dataset for training any models. Instead, the results of each algorithm were contextually independent, relying solely on mathematical computations. Furthermore, some algorithms are designed to recognize specific byte sequences, which means they would fail if those sequences do not appear in the final database.

Here are the results of our experiments. For those unfamiliar with the chosen metrics, here is a short explanation:

  • Precision: Measures the ability to retrieve accurate matches.
  • Recall: Indicates how effectively the algorithm identifies all instances of the same function.
  • F1-Score: Represents the harmonic mean between Precision and Recall, providing a balanced measure of both accuracy and effectiveness.

From the table below, we can draw the following conclusions:

  • While GMN is an appealing state-of-the-art approach, it currently lacks scalability for real-world applications.
  • FunctionSimSearch delivers the best results but frequently crashes, raising questions about the validity and reliability of its outcomes.
  • Simpler methods like FunctionID are notably fast yet struggle to generalize on unseen functions.

Ultimately, despite its slightly less impressive performance compared to others, BSIM emerges as a robust choice for production scenarios. It achieves decent results and benefits from strong server-side backend support, such as compatibility with PostgreSQL or Elasticsearch, making it a practical solution for real-world deployment.

Method Architecture Time (s) Scores
Precision Recall F1-score
GMN x86 2472000 - - -
jTrans x86 16612 0.14 0.19 0.16
FunctionSimSearch x86 13662 0.41 0.67 0.51
FunctionID All 164 0.82 0.10 0.18
x86 41 0.51 0.20 0.29
BSIM All 2909 0.64 0.13 0.22
x86 728 0.30 0.23 0.26

Overview of SightHouse

A picture is worth a thousand words, so let's see SightHouse in action!

The video demonstrates how SightHouse can be used to query for known signatures using scripts tailored for different SRE tools. Currently, SightHouse supports IDA Pro, Ghidra, and Binary Ninja.

When a signature is found, it is added as a bookmark, and some comments are included to show the name of the matched function along with its origin.

The project is organized into three main components:

At the bottom are the SightHouse plugins, which are designed for each SRE tools. Each plugin is built on a shared Python package that contains the core functionality. This approach ensures consistency across all plugins and reduces code duplication.

The SightHouse clients interact with a REST HTTP API called the frontend server. This server exposes a unified API that abstracts the underlying Reverse Engineering tools. When analyzing a new file, the client sends the raw binary and metadata about the program, sections, and functions to the server. The server exposes a unified API providing Ghidra in headless mode with a custom loader and BSIM features to query signatures.

Note: Only use SightHouse instances that you trust, as they will handle your program's binaries. You can run your own server instance — see the Going Further section below.

While this setup provides a solid foundation, we wanted to address the challenge of creating and maintaining a signatures database. To solve this, we developed the Signature Pipeline! This pipeline consists of tailored workers that can search for new projects online, download them, compile them, and extract function signatures, which are then automatically added to the database.

Quick Start

SightHouse is available on PyPI and as Docker images on GitHub Container Registry.

SRE client

The easiest way to install the SightHouse client for your SRE is to install the sighthouse-client package and then run one of the following commands.

pip install sighthouse-client
# Ghidra
sighthouse client install ghidra --ghidra-install-dir /path/to/ghidra

# IDA Pro
sighthouse client install ida --ida-dir /path/to/ida_dir

# Binary Ninja
sighthouse client install binja

After restarting your SRE tool, SightHouse will appear in the plugin list.

Note: Some clients, like Ghidra, manage their own virtual environments, so the installation script automatically detects and manages them. Other clients, like IDA, do not provide a virtual environment, though some users create one inside IDA_DIR. If you are already in a virtual environment, the installer will perform the installation there.

Frontend Server

The easiest way to run a SightHouse frontend is via Docker Compose. The following minimal setup starts the frontend along with its dependencies (Redis and a BSIM-enabled PostgreSQL):

docker pull ghcr.io/quarkslab/sighthouse/sighthouse-frontend:1.0.1
docker pull ghcr.io/quarkslab/sighthouse/ghidra-bsim-postgres:1.0.1
# docker-compose.yml
services:
  redis:
    image: redis:7
    volumes:
      - ./data/redis:/data

  bsim_postgres:
    image: ghcr.io/quarkslab/sighthouse/ghidra-bsim-postgres:1.0.1
    volumes:
      - ./data/postgres:/home/user/ghidra-data

  create_user:
    image: ghcr.io/quarkslab/sighthouse/sighthouse-frontend:1.0.1
    entrypoint: '/home/user/.local/bin/sighthouse'
    command: "frontend add-user -d sqlite:////data/frontend.db user -p password"
    restart: "no"
    volumes:
      - ./data/frontend:/data

  sighthouse_frontend:
    image: ghcr.io/quarkslab/sighthouse/sighthouse-frontend:1.0.1
    entrypoint: '/home/user/.local/bin/sighthouse'
    command: >
      frontend -g /ghidra -d sqlite:////data/frontend.db
      -r local://data start
      -w redis://redis:6379/0
      -b postgresql://user@bsim_postgres:5432/bsim
    ports:
      - "6669:6671"
    volumes:
      - ./data/frontend:/data
    depends_on: [create_user, bsim_postgres, redis]

Then, the frontend can be started using the following script:

#!/bin/sh

SCRIPT_DIR=$(cd -- "$(dirname -- "$0")" >/dev/null 2>&1 && pwd)


mkdir -p "$SCRIPT_DIR/data/postgres"
mkdir -p "$SCRIPT_DIR/data/redis"
mkdir -p "$SCRIPT_DIR/data/frontend"

chown -R 1000:1000 "$SCRIPT_DIR/data"

docker compose -f "$SCRIPT_DIR/docker-compose.yml" up -d

The API will be available on port 6669.

Signature Pipeline

The easiest way to run a full pipeline (scraper + compiler + analyzer) is via Docker Compose:

docker pull ghcr.io/quarkslab/sighthouse/sighthouse-pipeline:1.0.1
docker pull ghcr.io/quarkslab/sighthouse/create_bsim_db:1.0.1 
docker pull ghcr.io/quarkslab/sighthouse/ghidra-bsim-postgres:1.0.1
# docker-compose.yml
services:
  redis:
    image: redis:7
    hostname: redis
    user: "1000:1000"
    volumes:
      - ./data/redis:/data
    networks:
      - internal-net

  minio:
    image: minio/minio:RELEASE.2025-04-22T22-12-26Z
    hostname: minio
    environment:
      - MINIO_ROOT_USER=admin
      - MINIO_ROOT_PASSWORD=password
    command: 'minio server --console-address ":9001" /data'
    volumes:
      - ./data/minio:/data
    networks:
      - internal-net
      - external-net

  createbuckets:
    image: minio/minio:RELEASE.2025-04-22T22-12-26Z
    depends_on:
      - minio
    restart: on-failure
    entrypoint: >
      /bin/sh -c "
      sleep 3;
      /usr/bin/mc alias set dockerminio http://minio:9000 admin password;
      /usr/bin/mc mb dockerminio/uploads;
      /usr/bin/mc anonymous set public dockerminio/uploads;
      exit 0;
      "
    networks:
      - internal-net

  bsim_postgres:
    image: ghcr.io/quarkslab/sighthouse/ghidra-bsim-postgres:1.0.1
    hostname: bsim_postgres
    volumes:
      - ./data/postgres:/home/user/ghidra-data
    restart: unless-stopped
    healthcheck:
      test: ["CMD-SHELL", "/ghidra/Ghidra/Features/BSim/support/pg_is_ready.sh || exit 1 "]
      retries: 5
      interval: "30s"
      timeout: "5s"
    networks:
      - internal-net

  create_bsim_db_postgres:
    image: ghcr.io/quarkslab/sighthouse/create_bsim_db:1.0.1
    command: 'user "" bsim_postgres postgresql 5432'
    depends_on:
      bsim_postgres:
        condition: service_healthy
    restart: no
    networks:
      - internal-net

  ghidra_analyzer:
    image: ghcr.io/quarkslab/sighthouse/sighthouse-pipeline:1.0.1
    restart: unless-stopped
    command: [
      "sighthouse-pipeline/src/sighthouse/pipeline/core_modules/GhidraAnalyzer",
      "Ghidra Analyzer",
      "-w", "redis://redis:6379/0",
      "-r", "s3://minio:9000/uploads",
      "-g", "/ghidra",
    ]
    healthcheck:
      test: ["CMD-SHELL", "ls /tmp/sighthouse_Ghidra_Analyzer_*.ready 2>/dev/null | grep -q ."]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 30s
    depends_on:
      - bsim_postgres
      - minio
      - redis
    networks:
      - internal-net

  autotools_compiler:
    image: ghcr.io/quarkslab/sighthouse/sighthouse-pipeline:1.0.1
    restart: unless-stopped
    command: [
      "sighthouse-pipeline/src/sighthouse/pipeline/core_modules/AutotoolsCompiler",
      "Autotools Compiler",
      "-w", "redis://redis:6379/0",
      "-r", "s3://minio:9000/uploads",
      "--strict"
    ]
    healthcheck:
      test: ["CMD-SHELL", "ls /tmp/sighthouse_Autotools_Compiler_*.ready 2>/dev/null | grep -q ."]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 30s
    depends_on:
      ghidra_analyzer:
        condition: service_healthy
    networks:
      - internal-net

  git_scrapper:
    image: ghcr.io/quarkslab/sighthouse/sighthouse-pipeline:1.0.1
    restart: unless-stopped
    command: [
      "sighthouse-pipeline/src/sighthouse/pipeline/core_modules/GitScrapper",
      "Git Scrapper",
      "-w", "redis://redis:6379/0",
      "-r", "s3://minio:9000/uploads",
    ]
    healthcheck:
      test: ["CMD-SHELL", "ls /tmp/sighthouse_Git_Scrapper_*.ready 2>/dev/null | grep -q ."]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 30s
    depends_on:
      autotools_compiler:
        condition: service_healthy
    networks:
      - internal-net
      - external-net

  create_recipe:
    image: ghcr.io/quarkslab/sighthouse/sighthouse-pipeline:1.0.1
    entrypoint: >
      /home/user/.local/bin/sighthouse pipeline -r s3://minio:9000/uploads -w redis://redis:6379/0 start pipeline.yml
    volumes:
      - ./data/pipeline.yml:/build/pipeline.yml:ro
    depends_on:
      git_scrapper:
        condition: service_healthy
    restart: on-failure
    networks:
      - internal-net

networks:
  internal-net:
    driver: bridge
    internal: true  # Blocks host access
  external-net:
    driver: bridge

Now we need to feed some jobs into the pipeline. To accomplish this, we have created a custom YAML format, similar to CI/CD pipeline files, which allows you to specify which jobs should run on which workers.

Write the following content into ./data/pipeline.yml:

# pipeline.yml
name: My pipeline
description: A simple pipeline
workers:

  - name: fetch_glibc
    package: Git Scrapper
    target: compile_glibc
    args:
      repositories:
        - name: libc
          url: git://sourceware.org/git/glibc.git
          branches:
            - glibc-2.25.90

  # Glibc cannot be compiled without optimization
  - name: compile_glibc
    package: Autotools Compiler
    target: analyzer
    foreach:
      - compiler_variants:
          x86_64-O1:
            cc: gcc
            cflags: -O1 -Wno-error=array-parameter
            configure_extra_args: --disable-werror

  - name: analyzer
    package: Ghidra Analyzer
    args:
      bsim:
        urls:
          - postgresql://user@bsim_postgres:5432/bsim
        min_instructions: 10
        max_instructions: 0

Finally, the pipeline can be started using the following script:

#!/bin/sh

SCRIPT_DIR=$(cd -- "$(dirname -- "$0")" >/dev/null 2>&1 && pwd)

mkdir -p "$SCRIPT_DIR/data/postgres"
mkdir -p "$SCRIPT_DIR/data/redis"
mkdir -p "$SCRIPT_DIR/data/minio"
mkdir -p "$SCRIPT_DIR/data/scrapper"
cp "$SCRIPT_DIR/pipeline.yml" "$SCRIPT_DIR/data/pipeline.yml"

chown -R 1000:1000 "$SCRIPT_DIR/data"

docker compose -f "$SCRIPT_DIR/docker-compose.yml" up -d

The final directory structure should look like this:

.
|-- docker-compose.yml
|-- pipeline.yml
`-- start.sh

Conclusion

In this blog post, we introduced SightHouse, a tool designed to help reverse engineers by identifying similar functions. The code is open-source under the MIT license, and is hosted on GitHub, along with its documentation.

SightHouse was presented at Re//verse 2026:

Don't hesitate to take a look! Feedback and contributions are welcome!

Going Further

The documentation covers each component in detail:


  1. Maxime Rossi Bellom, Ramtine Tofighi Shirazi. Is Vibe Coding a Security Nightmare? A Benchmark of AI Coding Agents. https://blog.secmate.dev/posts/vibe-coding-security-benchmark/ 

  2. Hex-Rays Team. IDA F.L.I.R.T. Technology In-Depth. https://docs.hex-rays.com/user-guide/signatures/flirt/ida-f.l.i.r.t.-technology-in-depth 

  3. Wang, Hao and Qu, Wenjie and Katz, Gilad and Zhu, Wenyu and Gao, Zeyu and Qiu, Han and Zhuge, Jianwei and Zhang, Chao. jTrans: Jump-Aware Transformer for Binary Code Similarity. https://doi.org/10.1145/3533767.3534367 

  4. Andrea Marcelli and Mariano Graziano and Xabier Ugarte-Pedrero and Yanick Fratantonio and Mohamad Mansouri and Davide Balzarotti. How Machine Learning Is Solving the Binary Function Similarity Problem. https://www.usenix.org/conference/usenixsecurity22/presentation/marcelli 

  5. Thomas Dullien. FunctionSimSearch: SimHash-based similarity search over CFGs . https://github.com/thomasdullien/functionsimsearch 

  6. Ghidra Team. FunctionID. https://github.com/NationalSecurityAgency/ghidra/blob/master/Ghidra/Features/FunctionID/src/main/doc/fid.xml 

  7. Ghidra Team. BSim Tutorial. https://ghidra.re/ghidra_docs/GhidraClass/BSim/README.html 

  8. PlatformIO team. A cross-platform, cross-architecture tool for embedded products. https://docs.platformio.org/en/latest/what-is-platformio.html#what-is-platformio 


If you would like to learn more about our security audits and explore how we can help you, get in touch with us!