Author Sami Babigeon
Category Program Analysis
Tags 2026, binary analysis, program analysis, reverse-engineering, binary similarity, BSIM
Since its initial released in December 2023, many people have used and built tools around the BSIM feature of Ghidra but up to this date its internals were unknown. This post brings some light on how BSIM works, theoretically and in it's C++ implementation.
Introduction
During our work on SightHouse, we evaluated several binary similarity engines to find one that met our needs. After thorough evaluation, we chose Ghidra's Behavioral Similarity (BSIM) feature. One key difference of BSIM compared to other approaches is that, despite being open-source, its algorithm is sparsely documented.
Existing documentation1 indicates BSIM uses locality-sensitive hashing and cosine similarity, but the description is brief and incomplete. So here it is, once and for all, BSIM finally explained!
All information in this post regarding Ghidra refers to the code in the Ghidra_12.0_build tag on Github.
BSIM Overview
BSIM is designed to identify whether two binary functions implement the same semantics, regardless of compiler, optimization level, or target architecture. It works by first lifting each function through Ghidra's decompiler to obtain P-code instructions which are Ghidra's architecture-independent Intermediate Representation of the decompiled code2. These instructions are considered "raw" or "Low P-code"; the decompiler then normalizes away compiler noise, stripping dead flag computations, abstracting stack mechanics, and producing a clean SSA (Static Single Assignment) dataflow graph. This refined form of P-code is called "High P-code". It shares the same grammar as raw P-code but is rewritten into a cleaner, normalized form, with a few notable differences, for instance, the MULTIEQUAL operation (Phi-node) only appears in High P-code.
Once generated, BSIM iterates over these refined instructions and incrementally hashes them into a "feature vector" (a vector of integer hash values). These feature hashes form a function fingerprint, which is stored in a database (local, PostgreSQL, or Elasticsearch). When querying for similar functions, BSIM retrieves candidates from the database by comparing feature vector similarity scores. The result is a similarity score between 0 and 1 that reliably identifies semantically equivalent functions.
The figure below presents the different steps of the BSIM pipeline:
The next parts of the blog post breaks down these two steps: feature generation and how the resulting vectors are compared.
Ghidra Architecture
To understand how BSIM works, we need to explain how Ghidra operates. Ghidra is
mainly written in Java, except for a few components including the decompiler,
which is written in C++. The decompiler sources are located under
Ghidra/Features/Decompiler/src/decompile/cpp, referred to later in this post
as DECOMP_DIR.
The interaction between these two environments uses a small custom serial
protocol that reads input from the decompiler process's stdin and returns
results on stdout. The implementation is available at
DECOMP_DIR/ghidra_process.cc.
Whenever Ghidra needs to decompile a function, it spawns (or reuses) one of the decompiler processes. It sends all necessary information (raw bytes, processor definitions, address spaces, etc.) to that process and then displays the decompilation results in the UI.
The decompiler loads a SLEIGH3 definition corresponding to the processor
identifier (for example, x86:LE:64:default). SLEIGH is a processor
description language originally based on SLED4 but refined for Ghidra's
needs. SLEIGH has two main goals: enabling disassembly and decompilation.
For decompilation, SLEIGH specifies the translation from machine instructions into P-code. P-code is a register-transfer language (RTL) designed to capture the semantics of machine instructions in a uniform, processor-independent form. Code for different processors can be translated straightforwardly into P-code, allowing a single suite of analysis tools to perform data-flow analysis and decompilation.
Finally, to fully understand P-code, we need to introduce 3 concepts:
the address space, the varnode, and the operation.
-
Address Space: A named region where bytes can be addressed and manipulated, such as RAM, registers, or special internal storage. The defining characteristics of a space are its name, size and endianness.
-
Varnode: The fundamental unit of data in P-code, representing a contiguous sequence of bytes within an address space, uniquely characterized by its address space, offset, and size
-
Operation: An operation (often called a P-code op) is a single, primitive action that takes one or more varnodes as inputs and optionally produces one output varnode.
To illustrate P-code, consider the following bytes: b82a000000. Using the
x86_64 instruction set in little-endian, we can disassemble those bytes as
MOV EAX, 0x2a, which can be translated to the following P-code operation:
RAX = COPY 42:8. The destination varnode is RAX and it's being assigned
a copy of a source varnode with an immediate value of 42 and size 8 bytes
(i.e., a 64-bit value).
Down the rabbit hole
P-code lifting and normalization
The main entrypoint of the signature generation is the SignaturesAt::rawAction
function located in DECOMP_DIR/signature_ghidra.cc. This function is called
whenever the "generateSignatures" action is triggered by Ghidra through the
custom serial protocol.
This function takes the address of the function and loads it. It then runs the function through Ghidra's decompiler under the normalize action, a specific subset of the full decompilation pipeline.
void SignaturesAt::rawAction(void) {
Funcdata *fd = ghidra->symboltab->getGlobalScope()->queryFunction(addr);
// ...
if (!fd->isProcStarted()) {
string curname = ghidra->allacts.getCurrentName();
Action *sigact;
if (curname != "normalize")
sigact = ghidra->allacts.setCurrent("normalize");
else
sigact = ghidra->allacts.getCurrent();
sigact->reset(*fd);
sigact->perform(*fd);
if (curname != "normalize")
ghidra->allacts.setCurrent(curname);
}
PackedEncode encoder(sout); // Write output XML directly to outstream
simpleSignature(fd,encoder);
}
The normalize action performs a specific subset of the full pipeline. The result is a function represented in SSA form as a multigraph of Varnodes (SSA values) connected via P-code Operation.
The action applies a sequence of analysis passes:
const char *normali[] = { "base", "protorecovery", "protorecovery_b", "deindirect", "localrecovery",
"deadcode", "stackptrflow", "normalanalysis",
"stackvars", "deadcontrolflow", "analysis", "fixateproto", "nodejoin",
"unreachable", "subvar", "floatprecision", "normalizebranches",
"conditionalexe", "" };
setGroup("normalize",normali);
Among them, we find the following ones:
-
Dead code is eliminated: On x86, every arithmetic instruction in low P-code produces six separate flag outputs (CF, OF, SF, ZF, PF, AF). After dead-code elimination, only flags actually read by a downstream branch or operation survive.
-
Stack pointer is abstracted away:
stackptrflowremoves the RSP/RBP juggling of function prologues/epilogues.
To illustrate the difference between low and high P-code, here is a concrete example:
As you can see, High P-code really captures the semantics of the function, is closer to the actual source code, and is much easier to work with as it has less noise.
To easily visualize the High P-code produced by the different simplification passes, one can use the following script from NCC Group5.
Local-sensitive Hashing and Weisfeiler-Lehman
Locality-sensitive hashing (LSH) is commonly used in binary similarity detection. Unlike cryptographic hashes, which avoid collisions, LSH is designed to map similar inputs to the same buckets, reducing the amount of data stored in the database while preserving similarity relationships.
However, LSH does not account for the internal structure of inputs, so structural algorithms like the Weisfeiler-Lehman graph refinement can be used to inject structural awareness.
The next section first introduces the Weisfeiler-Lehman algorithm and then describes the different LSH variants used by BSIM.
Weisfeiler-Lehman isomorphism test
With the normalized function in hand, BSIM extracts a set of 32-bit feature hashes. The algorithm is an application of the 1-dimensional Weisfeiler-Lehman (WL) graph isomorphism test6 to both the data-flow graph and the control-flow graph.
Life can be funny sometimes: our research began years after Elie Mengin published his post on this blog6. The original goal was to implement the test as a feature within QBinDiff. As we dug deeper, we eventually set out to understand the algorithm behind BSIM; only to discover later that a former colleague of ours had worked on it. Elie's article does an excellent job of explaining how Weisfeiler-Lehman works, and we highly recommend reading it.
The WL test works by iteratively re-labeling nodes based on their neighborhood:
-
Iteration 0: Assign each node an initial label based purely on its own local properties.
-
Iteration k: Update each node's label by hashing together its current label and the labels of its immediate neighbors. In BSIM, however, only input neighbors are considered and outputs are excluded.
After k iterations, isomorphic k-hop subgraphs will always produce the same label (but the same label does not guarantee isomorphism). This principle extends to similarity as well: if two expression trees differ in a single leaf, their root hashes will likely diverge. BSIM runs 3 data-flow hashing iterations and 1 block control-flow hashing iteration.
You may wonder: why 3 iterations? The short answer is that we don't know.
The iteration count, defined by the maxiter variable, appears to have been
set empirically. It is user-configurable via
GraphSigManager::initializeFromStream(), and is explicitly acknowledged as
a tunable parameter rather than a mathematically derived constant. The value
of 3 seems to strike a practical balance: enough context to be meaningfully
discriminating across a function's features, but shallow enough to remain
robust against compiler-introduced noise.
Data-flow graph hashing (varnode features)
The simpleSignature function performs the following:
void simpleSignature(Funcdata *fd, Encoder &encoder) {
GraphSigManager sigmanager;
sigmanager.setCurrentFunction(fd);
sigmanager.generate();
vector<uint4> feature;
sigmanager.getSignatureVector(feature);
// Sends the feature array to the encoder
}
GraphSigManager::setCurrentFunction begins by allocating a SignatureEntry
object for each varnode in the SSA graph of the function. Then, depending
on the configuration, it attempts to remove redundant information using the
SignatureEntry::removeNoise method. This method traverses the P-code graph,
marking nodes that are part of COPY/INDIRECT/MULTIEQUAL chains, then applies a
dominator analysis to collapse redundant copies back to their original value.
A varnode that is merely a renamed copy of another, like a Phi-node selecting
between two copies of the same input for example, is excluded from
feature emission.
As an example, consider the following function:
int foo(int a) {
return a + 1;
}
This produces the following High P-code:
EAX#1 = INT_ADD EDI#2 1:4#3
EAX#4 = COPY EAX#1
RETURN 0:8#5 EAX#4
Varnodes are suffixed with a unique identifier, as in SSA form, to make shadow relationships explicit. The dominator analysis produces the following graph:
Here, EAX#4 falls under EAX#1 in the dominator tree, meaning it carries no
additional information and can safely be ignored during hashing. Once shadow
nodes have been identified, an initial hash is computed for each remaining
node based on its local properties:
SignatureEntry::localHash(v) = hashSize(v) // byte width of the value
^ opcode_hash(def_op(v)) // the operation that defines it
^ constant_value(v) // if it's a constant (optional)
^ 0x55055055 // if it's a persistent global
^ 0x10101 // if it's a function input
Once non-shadowed nodes have their initial hash value (the label),
GraphSigManager::generate runs the Weisfeiler-Lehman algorithm: each round
mixes a node's current hash with its inputs' hashes from the previous round:
void GraphSigManager::generate(void) {
int4 minusone,firsthalf,secondhalf;
minusone = maxiter - 1;
firsthalf = minusone/2;
secondhalf = minusone - firsthalf;
signatureIterate();
for(int4 i=0;i<firsthalf;++i)
signatureIterate();
// Do the block signatures incorporating varnode sigs halfway thru
if (maxblockiter >=0 ) {
initializeBlocks();
for(int4 i=0;i<maxblockiter;++i) {
signatureBlockIterate();
}
collectBlockSigs();
blockClear();
}
for(int4 i=0;i<secondhalf;++i)
signatureIterate();
collectVarnodeSigs();
varnodeClear(); // Varnodes are used in block sigs
}
Here, GraphSigManager::signatureIterate propagates hashes across varnode
entries, while GraphSigManager::signatureBlockIterate propagates hashes
across a different kind of entry: BlockSignatureEntry objects. These hold a
hash value representing structural information derived from the CFG. They
are covered in the control-flow graph hashing section
below.
For varnode hashing, commutative operations (MULTIEQUAL, ADD, XOR, etc.)
accumulate inputs in an order-independent way; non-commutative operations
(shifts, subtractions) preserve input order. The following is a pseudocode
version of SignatureEntry::hashIn:
def hashIn(v):
if isCommutative(v):
# Commutative case
accum = 0
for inp in inputs:
accum += hash_mixin(hash_prev[v], hash_prev[inp])
hash_new[v] = hash_mixin(hash_prev[v], accum)
else:
# Non-commutative case
h = hash_prev[v]
for inp in inputs:
h = hash_mixin(h, hash_prev[inp])
hash_new[v] = h
hash_mixin is a custom fuzzy hash function based on two rounds of CRC32
combined with XOR-shift-multiply operations. A double-buffer
(hash[0]/hash[1]) ensures all nodes read from the previous round's values
during each update, making the result independent of iteration order
through the node list.
After the configured number of iterations (3 by default), every varnode written
by a non-trivial operation and not shadowed emits its final hash as a
VarnodeSignature feature.
Control-flow graph hashing (block features)
The attentive reader may have noticed that between varnode hashing iterations, BSIM runs a parallel hashing pass over the function's basic blocks. This allows structural information to be incorporated into the final signature. Each block is initially seeded purely by its degree (the number of basic blocks entering and leaving it):
BlockSignatureEntry::localHash(B) = (in_degree(B) << 8) | out_degree(B)
As with varnodes, once the initial hash value is computed, iterative
propagation begins through GraphSigManager::signatureBlockIterate.
Predecessor block hashes are mixed in commutatively, but with a twist: for
conditional branches, the true edge and false edge carry different mixing
constants:
def hashIn(B):
accum = 0xbafabaca
for pred, edge_kind in predecessors(B):
h = hash_mixin(hash_prev[B], hash_prev[pred])
if edge_kind == TRUE_EDGE:
h = hash_mixin(h, 0x777)
elif edge_kind == FALSE_EDGE:
h = hash_mixin(h, 0x777 ^ 0x7abc7abc)
accum += h
hash_new[B] = hash_mixin(hash_prev[B], accum)
This means a block's hash encodes which path through a conditional leads to it,
not merely that it has predecessors. Final block features are then generated
by GraphSigManager::collectBlockSigs. For each basic block, BSIM scans for
"root" operations; those with side effects visible beyond the function
boundary: CALL, CALLIND, STORE, CBRANCH, and RETURN. For each consecutive pair
of root operations, it fuses the block's structural hash with the output
varnode's expression hash at that point:
BlockSignature.hash = hash_mixin(varnode_hash_half_iter, block_hash)
This is the key fusion point: each block feature blends expression semantics with control-flow topology into a single 32-bit value. Once this process is complete, the block signature entries are cleared and varnode hashing resumes for the final iterations, producing the feature vector.
The Feature Vector
The BSIM generation pipeline outputs a sorted list of 32-bit hash values derived from three feature types:
- VarnodeSignature: one hash for each non-shadowed, non-trivially defined varnode (produced by data-flow hashing).
- BlockSignature: one hash for each root operations inside a basic block as well as a final hash for the full block (produced by control-flow hashing).
- CopySignature: one hash that aggregates all COPY operations per basic block.
This sorted vector encodes the function as a set of structural motif identifiers: semantically equivalent functions yield largely overlapping sets, while unrelated functions yield largely disjoint sets.
Note 1: the vectors are sorted only to speed up the subsequent comparison step of the algorithm.
Note 2: Root operations are operations that represent the roots of expressions.
A BSIM feature vector typically looks like this:
(1:545c6155,1:7086215d,2:bd945601,1:ca0bb8a0,1:e123ddbb)
The number before the colon represents the frequency of consecutive identical hash elements (which allows the vector to be factorized when repeated values are present), and the number after is the feature hash itself.
To inspect these vectors, Ghidra provides the
DumpBSimSignaturesScript.py7 script.
Comparing the vectors using TF-IDF
Now that we have our feature vectors, how do we compare them? A raw set intersection would be naïve, because not all features are equally informative. A feature encoding "integer addition of two 4-byte values" appears in virtually every compiled function; a feature encoding a specific 3-hop expression tree combining a shift, an XOR, and a masked store is extremely rare and highly discriminating.
BSIM borrows TF-IDF (Term Frequency / Inverse Document Frequency) from information retrieval to weight each feature by its global rarity across a training corpus. In the BSIM context:
- A document is a function stored in the database
- A term is a 32-bit feature hash
- is the total number of functions in the training database
- is the number of functions containing feature hash
The IDF weight of a feature is defined as:
Features present in nearly every function receive a weight close to . Rare, distinctive features receive a high weight. This weighting is not computed at extraction time; it is applied at query time using pre-computed IDF values fitted from a training corpus shipped with Ghidra.
Those weight files can be found under Ghidra/Features/BSim/data and are
stored as XML. When creating a database, a weight file is implicitly selected
by setting the config_template parameter via the support/bsim tool.
Taking lshweights_nosize.xml as an example:
<weights settings="0x4d">
<weightfactory scale="1.55369941" addend="6.00980084">
<idf>1.00000000e+00</idf> <!-- bucket 0: rarest features, max weight -->
<idf>9.99459862e-01</idf>
... <!-- 512 IDF weights total -->
<tf>1.00000000e+00</tf> <!-- tf=1: baseline -->
<tf>1.41421356e+00</tf> <!-- tf=2: sqrt(2) -->
... <!-- 64 TF weights total -->
<probflip0>2.67731136e-01</probflip0>
<probflip1>6.20184175e-01</probflip1>
<probdiff0>2.01821663e-02</probdiff0>
<probdiff1>7.10384098e+00</probdiff1>
</weightfactory>
<idflookup size="1000">
<hash count="0">0xd99bb820</hash> <!-- hash seen in 0 functions -->
<hash count="1">0x26111c79</hash>
... <!-- 1000 most common hashes -->
</idflookup>
</weights>
Each feature's weight has two components. The IDF weight reflects how rarely the feature appears across the training corpus. BSIM maintains a lookup table of 1000 feature hashes observed during training, each annotated with a normalized frequency count. When a vector is built, each feature hash is looked up in this table; the resulting count (capped at 511) serves as an index into a 512-entry IDF weight table, where index 0 yields the maximum weight of and higher indices yield progressively smaller values approaching . Features absent from the table receive index 0 and are therefore treated as maximally rare and maximally informative.
The TF weight reflects how often a feature appears within the specific function being analyzed. Repetition increases the weight, but with diminishing returns following a curve: a feature seen once has weight , twice yields , four times , and eight times exactly . This prevents a function that mechanically repeats a trivial pattern from dominating the similarity score.
The final coefficient for a HashEntry is the product of both components:
This scalar is computed once when the vector is constructed and stored directly in the entry.
Cosine similarity
The vector comparison is implemented across multiple backends and languages:
the H2 (local) and Elasticsearch backends are written in Java, while PostgreSQL
uses a dedicated C extension. We will focus on the Java implementation,
available in Ghidra/Framework/Generic/src/main/java/generic/lsh/vector/LSHCosineVector.java.
With both vectors represented as sorted arrays of HashEntry(hash, coeff, tf)
entries, LSHCosineVector.compare() computes their cosine similarity using a
merge-join (in time, where and are the numbers of distinct
features in each vector). No quadratic search is needed because both arrays
are already sorted by hash value.
The algorithm maintains two iterators, one per vector, and advances them according to three cases at each step:
-
Matching hashes: when both iterators point to entries with the same hash, the feature is shared between the two functions. Its contribution to the dot product is . Specifically, the code compares the term frequencies of both entries and uses the coefficient from whichever vector has the lower TF. This conservative choice credits only the genuine overlap: if function uses a pattern three times and function uses it once, only one occurrence is considered shared. Both iterators then advance.
-
Hash in only: when the hash under iterator is less than the hash under iterator , the feature exists only in . It contributes nothing to the dot product and iterator advances alone.
-
Hash in only: the symmetric case, where iterator advances.
Non-shared features still matter, however: they were factored in when computing each vector's length, the Euclidean norm , which is pre-computed during construction.
The final cosine score is:
Unmatched features inflate the denominators without contributing to the numerator, naturally penalizing vectors that diverge significantly in their feature sets. The result is a value in , where indicates perfectly aligned weighted feature sets and values near indicate little to no overlap.
Conclusion
This post has walked through the complete BSIM pipeline: from raw machine instructions lifted to High P-code, through Weisfeiler-Lehman hashing of both the data-flow and control-flow graphs, to the final TF-IDF-weighted cosine similarity comparison.
A few open questions remain. The hashing constants (0x55055055,
0xbafabaca, 0x777, 0x7abc7abc) and the choice of 3 data-flow iterations
are clearly empirical but no public documentation explains the experiments
that informed them. Similarly, the training corpus used to fit the IDF weights
shipped with Ghidra is undocumented; the distribution of functions it contains
will directly influence which features are considered rare and therefore
discriminating.
The use of the Weisfeiler-Lehman test for binary function analysis was already explored by Elie Mengin in his work6, whose post we strongly recommend reading for the theoretical underpinnings of the graph kernel. Another great piece of work that we need to mention is Hashashin8 by River Loop Security, which presents a similar approach using Binary Ninja IL and LSH for cross-architecture function similarity, before BSIM's public release. Whether these works directly influenced the Ghidra team's design is unknown, but they share the same core intuitions: normalize away architecture noise, encode semantics and code structure information as graph features, and compare functions in a metric space where similarity implies behavioral equivalence.
Understanding these internals matters. Knowing how features are generated exposes the limits of the approach: very small functions (few varnodes, no root operations) produce sparse vectors and are inherently harder to match; heavily inlined or LTO-compiled code may fragment a logical function into shapes that look unlike the original; and an IDF table trained on a Windows x86-64 userspace corpus may transfer poorly to a very different domain, such as RTOS ARM baremetal firmware.
It is precisely these trade-offs that shaped our design choices when building SightHouse. If you are curious to see BSIM put to work in practice, feel free to check it out!
Acknowledgments
First of all, thanks to the Ghidra developers and the community behind it for creating this awesome tool available to everyone!
Thanks to all my Quarkslab colleagues for proofreading this article. I also would like to express my gratitude to Roxane Cohen and Aldo Moscattelli for their help and guidance regarding the understanding of the implementation and theories behind it.
References
-
National Security Agency (NSA) Ghidra Team, How Does BSIM Work?. ↩
-
National Security Agency (NSA) Ghidra Team, P-Code Reference Manual. ↩
-
National Security Agency (NSA) Ghidra Team, SLEIGH Overview. ↩
-
Norman Ramsey and Mary F. Fernández. Specifying Representations of Machine Instructions. ACM Trans. Programming Languages and Systems, Volume 19, Issue 2,Pages 492-524. ↩
-
NCC Group, PrintHighPCode.java. ↩
-
Elie Mengin, Weisfeiler-Lehman Graph Kernel for Binary Function Analysis, Quarkslab, 2019. ↩↩↩
-
National Security Agency (NSA) Ghidra Team, DumpBSimSignaturesScript. ↩
-
Rylan O'Connell and Ryan Speers, Hashashin: Using Binary Hashing to Port Annotations, 2019. ↩