<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>Quarkslab's blog - BSIM</title><link href="http://blog.quarkslab.com/" rel="alternate"></link><link href="http://blog.quarkslab.com/feeds/bsim.rss.xml" rel="self"></link><id>http://blog.quarkslab.com/</id><updated>2026-04-14T00:00:00+02:00</updated><entry><title>BSIM explained once and for all!</title><link href="http://blog.quarkslab.com/bsim-explained-once-and-for-all.html" rel="alternate"></link><published>2026-04-14T00:00:00+02:00</published><updated>2026-04-14T00:00:00+02:00</updated><author><name>Sami Babigeon</name></author><id>tag:blog.quarkslab.com,2026-04-14:/bsim-explained-once-and-for-all.html</id><summary type="html">&lt;p&gt;Since its initial released in December 2023, many people have used and built tools around the BSIM feature of Ghidra but up to this date its internals were unknown. This post brings some light on how BSIM works, theoretically and in it's C++ implementation.&lt;/p&gt;</summary><content type="html">&lt;h1 id="introduction"&gt;Introduction&lt;/h1&gt;
&lt;p&gt;During our work on &lt;a href="https://github.com/quarkslab/sighthouse"&gt;SightHouse&lt;/a&gt;, we 
evaluated several binary similarity engines to find one that met our needs. 
After thorough evaluation, we chose Ghidra's &lt;strong&gt;B&lt;/strong&gt;ehavioral &lt;strong&gt;Sim&lt;/strong&gt;ilarity
(BSIM) feature. One key difference of BSIM compared to other approaches is 
that, despite being open-source, its algorithm is sparsely documented.&lt;/p&gt;
&lt;p&gt;Existing documentation&lt;sup id="fnref:1"&gt;&lt;a class="footnote-ref" href="#fn:1"&gt;1&lt;/a&gt;&lt;/sup&gt; indicates BSIM uses &lt;em&gt;locality-sensitive hashing&lt;/em&gt;
and &lt;em&gt;cosine similarity&lt;/em&gt;, but the description is brief and incomplete. 
So here it is, once and for all, BSIM finally explained!&lt;/p&gt;
&lt;p&gt;All information in this post regarding Ghidra refers to the code in the 
&lt;a href="https://github.com/NationalSecurityAgency/ghidra/tree/Ghidra_12.0_build"&gt;Ghidra_12.0_build&lt;/a&gt; 
tag on Github.&lt;/p&gt;
&lt;h1 id="bsim-overview"&gt;BSIM Overview&lt;/h1&gt;
&lt;p&gt;BSIM is designed to identify whether two binary functions implement the same
semantics, regardless of compiler, optimization level, or target architecture.
It works by first lifting each function through Ghidra's decompiler to 
obtain P-code instructions which are Ghidra's architecture-independent 
Intermediate Representation of the decompiled code&lt;sup id="fnref:2"&gt;&lt;a class="footnote-ref" href="#fn:2"&gt;2&lt;/a&gt;&lt;/sup&gt;. These instructions are
considered "raw" or "Low P-code"; the decompiler then normalizes away compiler
noise, stripping dead flag computations, abstracting stack mechanics, and 
producing a clean SSA (Static Single Assignment) dataflow graph. This refined
form of P-code is called "High P-code". It shares the same grammar as raw 
P-code but is rewritten into a cleaner, normalized form, with a few notable
differences, for instance, the MULTIEQUAL operation (Phi-node) only appears in
High P-code.&lt;/p&gt;
&lt;p&gt;Once generated, BSIM iterates over these refined instructions and incrementally
hashes them into a "feature vector" (a vector of integer hash values). These
feature hashes form a function fingerprint, which is stored in a database 
(local, PostgreSQL, or Elasticsearch). When querying for similar functions, 
BSIM retrieves candidates from the database by comparing feature vector 
similarity scores. The result is a similarity score between 0 and 1 that
reliably identifies semantically equivalent functions.&lt;/p&gt;
&lt;p&gt;The figure below presents the different steps of the BSIM pipeline:&lt;/p&gt;
&lt;div class="row"&gt;
&lt;center&gt;
&lt;a href="resources/2026-04-14_bsim_explained_once_and_for_all/bsim_pipeline.svg" target="_blank"&gt;
&lt;img alt="BSIM pipeline" height="70%" src="resources/2026-04-14_bsim_explained_once_and_for_all/bsim_pipeline.svg"/&gt;
&lt;/a&gt;
&lt;/center&gt;
&lt;/div&gt;
&lt;p&gt;The next parts of the blog post breaks down these two steps: feature generation
and how the resulting vectors are compared.&lt;/p&gt;
&lt;h1 id="ghidra-architecture"&gt;Ghidra Architecture&lt;/h1&gt;
&lt;p&gt;To understand how BSIM works, we need to explain how Ghidra operates. Ghidra is
mainly written in Java, except for a few components including the decompiler, 
which is written in C++. The decompiler sources are located under 
&lt;code&gt;Ghidra/Features/Decompiler/src/decompile/cpp&lt;/code&gt;, referred to later in this post
as &lt;code&gt;DECOMP_DIR&lt;/code&gt;. &lt;/p&gt;
&lt;p&gt;The interaction between these two environments uses a small custom serial 
protocol that reads input from the decompiler process's &lt;em&gt;stdin&lt;/em&gt; and returns
results on &lt;em&gt;stdout&lt;/em&gt;. The implementation is available at 
&lt;code&gt;DECOMP_DIR/ghidra_process.cc&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Whenever Ghidra needs to decompile a function, it spawns (or reuses) one of the
decompiler processes. It sends all necessary information (raw bytes, processor
definitions, address spaces, etc.) to that process and then displays the 
decompilation results in the UI.&lt;/p&gt;
&lt;p&gt;The decompiler loads a SLEIGH&lt;sup id="fnref:3"&gt;&lt;a class="footnote-ref" href="#fn:3"&gt;3&lt;/a&gt;&lt;/sup&gt; definition corresponding to the processor 
identifier (for example, &lt;code&gt;x86:LE:64:default&lt;/code&gt;). SLEIGH is a processor 
description language originally based on SLED&lt;sup id="fnref:4"&gt;&lt;a class="footnote-ref" href="#fn:4"&gt;4&lt;/a&gt;&lt;/sup&gt; but refined for Ghidra's
needs. SLEIGH has two main goals: enabling disassembly and decompilation.&lt;/p&gt;
&lt;p&gt;For decompilation, SLEIGH specifies the translation from machine instructions
into P-code. P-code is a register-transfer language (RTL) designed to capture
the semantics of machine instructions in a uniform, processor-independent form.
Code for different processors can be translated straightforwardly into P-code, 
allowing a single suite of analysis tools to perform data-flow analysis
and decompilation.&lt;/p&gt;
&lt;p&gt;Finally, to fully understand P-code, we need to introduce 3 concepts:&lt;br/&gt;
the &lt;strong&gt;address space&lt;/strong&gt;, the &lt;strong&gt;varnode&lt;/strong&gt;, and the &lt;strong&gt;operation&lt;/strong&gt;.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Address Space&lt;/strong&gt;: A named region where bytes can be addressed and 
  manipulated, such as RAM, registers, or special internal storage. 
  The defining characteristics of a space are its name, size and endianness.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Varnode&lt;/strong&gt;: The fundamental unit of data in P-code, representing a 
  contiguous sequence of bytes within an address space, uniquely characterized
  by its address space, offset, and size&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Operation&lt;/strong&gt;: An operation (often called a P-code op) is a single, primitive
  action that takes one or more varnodes as inputs and optionally produces
  one output varnode.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;To illustrate P-code, consider the following bytes: &lt;code&gt;b82a000000&lt;/code&gt;. Using the 
x86_64 instruction set in little-endian, we can disassemble those bytes as
&lt;code&gt;MOV EAX, 0x2a&lt;/code&gt;, which can be translated to the following P-code operation: 
&lt;code&gt;RAX = COPY 42:8&lt;/code&gt;. The destination varnode is RAX and it's being assigned
a copy of a source varnode with an immediate value of 42 and size 8 bytes
(i.e., a 64-bit value).&lt;/p&gt;
&lt;h1 id="down-the-rabbit-hole"&gt;Down the rabbit hole&lt;/h1&gt;
&lt;h2 id="p-code-lifting-and-normalization"&gt;P-code lifting and normalization&lt;/h2&gt;
&lt;p&gt;The main entrypoint of the signature generation is the &lt;code&gt;SignaturesAt::rawAction&lt;/code&gt;
function located in &lt;code&gt;DECOMP_DIR/signature_ghidra.cc&lt;/code&gt;. This function is called 
whenever the "generateSignatures" action is triggered by Ghidra through the 
custom serial protocol. &lt;/p&gt;
&lt;p&gt;This function takes the address of the function and loads it. It then runs
the function through Ghidra's decompiler under the normalize action, 
a specific subset of the full decompilation pipeline.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;SignaturesAt::rawAction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;Funcdata&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ghidra&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;symboltab&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;getGlobalScope&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;queryFunction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;addr&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// ...&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;isProcStarted&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;curname&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ghidra&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;allacts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;getCurrentName&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;Action&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;sigact&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;curname&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;!=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;"normalize"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;sigact&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ghidra&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;allacts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;setCurrent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"normalize"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;else&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;sigact&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ghidra&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;allacts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;getCurrent&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;sigact&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;reset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;sigact&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;perform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;curname&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;!=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;"normalize"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;ghidra&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;allacts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;setCurrent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;curname&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;PackedEncode&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;encoder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sout&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="c1"&gt;// Write output XML directly to outstream&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;simpleSignature&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;encoder&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The &lt;strong&gt;normalize&lt;/strong&gt; action performs a &lt;a href="https://github.com/NationalSecurityAgency/ghidra/blob/Ghidra_12.0_build/Ghidra/Features/Decompiler/src/main/java/ghidra/app/decompiler/DecompInterface.java#L454-L490"&gt;specific subset&lt;/a&gt; of the full
pipeline. The result is a function represented in SSA form as
a multigraph of Varnodes (SSA values) connected via P-code Operation.&lt;/p&gt;
&lt;p&gt;The action applies a sequence of analysis passes:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kt"&gt;char&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;normali&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;"base"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;"protorecovery"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;"protorecovery_b"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;"deindirect"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;"localrecovery"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;               &lt;/span&gt;&lt;span class="s"&gt;"deadcode"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;"stackptrflow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;"normalanalysis"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;               &lt;/span&gt;&lt;span class="s"&gt;"stackvars"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;"deadcontrolflow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;"analysis"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;"fixateproto"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;"nodejoin"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;               &lt;/span&gt;&lt;span class="s"&gt;"unreachable"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;"subvar"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;"floatprecision"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;"normalizebranches"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;               &lt;/span&gt;&lt;span class="s"&gt;"conditionalexe"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;""&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="n"&gt;setGroup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"normalize"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;normali&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Among them, we find the following ones:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Dead code is eliminated&lt;/strong&gt;: On x86, every arithmetic instruction in 
  low P-code produces six separate flag outputs (CF, OF, SF, ZF, PF, AF). 
  After dead-code elimination, only flags actually read by a downstream
  branch or operation survive. &lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Stack pointer is abstracted away&lt;/strong&gt;: &lt;code&gt;stackptrflow&lt;/code&gt; removes the RSP/RBP
  juggling of function prologues/epilogues. &lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;To illustrate the difference between low and high P-code, here is a 
concrete example: &lt;/p&gt;
&lt;div class="row"&gt;
&lt;center&gt;
&lt;a href="resources/2026-04-14_bsim_explained_once_and_for_all/pcode_comparison.svg" target="_blank"&gt;
&lt;img alt="Different stages of P-code lifting" src="resources/2026-04-14_bsim_explained_once_and_for_all/pcode_comparison.svg" width="90%"/&gt;
&lt;/a&gt;
&lt;/center&gt;
&lt;/div&gt;
&lt;p&gt;As you can see, High P-code really captures the semantics of the function, is
closer to the actual source code, and is much easier to work with as it has
less noise.&lt;/p&gt;
&lt;p&gt;To easily visualize the High P-code produced by the different simplification 
passes, one can use the following script from NCC Group&lt;sup id="fnref:5"&gt;&lt;a class="footnote-ref" href="#fn:5"&gt;5&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;
&lt;h2 id="local-sensitive-hashing-and-weisfeiler-lehman"&gt;Local-sensitive Hashing and Weisfeiler-Lehman&lt;/h2&gt;
&lt;p&gt;Locality-sensitive hashing (LSH) is commonly used in binary similarity 
detection. Unlike cryptographic hashes, which avoid collisions, LSH is designed
to map similar inputs to the same buckets, reducing the amount of data stored
in the database while preserving similarity relationships.&lt;/p&gt;
&lt;p&gt;However, LSH does not account for the internal structure of inputs, so structural
algorithms like the Weisfeiler-Lehman graph refinement can be used to inject 
structural awareness.&lt;/p&gt;
&lt;p&gt;The next section first introduces the Weisfeiler-Lehman algorithm and then
describes the different LSH variants used by BSIM.&lt;/p&gt;
&lt;h3 id="weisfeiler-lehman-isomorphism-test"&gt;Weisfeiler-Lehman isomorphism test&lt;/h3&gt;
&lt;p&gt;With the normalized function in hand, BSIM extracts a set of 32-bit 
feature hashes. The algorithm is an application of the 1-dimensional
Weisfeiler-Lehman (WL) graph isomorphism test&lt;sup id="fnref:6"&gt;&lt;a class="footnote-ref" href="#fn:6"&gt;6&lt;/a&gt;&lt;/sup&gt; to both the data-flow graph
and the control-flow graph.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Life can be funny sometimes: our research began years after Elie Mengin
published his post on this blog&lt;sup id="fnref3:6"&gt;&lt;a class="footnote-ref" href="#fn:6"&gt;6&lt;/a&gt;&lt;/sup&gt;. The original goal was to implement the test
as a feature within &lt;a href="https://github.com/quarkslab/qbindiff"&gt;QBinDiff&lt;/a&gt;. As we
dug deeper, we eventually set out to understand the algorithm behind BSIM; 
only to discover later that a former colleague of ours had worked on it. 
Elie's article does an excellent job of explaining how Weisfeiler-Lehman
works, and we highly recommend reading it.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The WL test works by iteratively re-labeling nodes based on their neighborhood:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Iteration 0&lt;/strong&gt;: Assign each node an initial label based purely on its own
  local properties.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Iteration k&lt;/strong&gt;: Update each node's label by hashing together its current
  label and the labels of its immediate neighbors. In BSIM, however, only
  input neighbors are considered and outputs are excluded.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;After &lt;em&gt;k&lt;/em&gt; iterations, isomorphic k-hop subgraphs will always produce the same
label (but the same label does not guarantee isomorphism). This principle 
extends to similarity as well: if two expression trees differ in a single leaf,
their root hashes will likely diverge. BSIM runs 3 data-flow hashing iterations
and 1 block control-flow hashing iteration.&lt;/p&gt;
&lt;p&gt;You may wonder: &lt;em&gt;why 3 iterations?&lt;/em&gt; The short answer is that we don't know.
The iteration count, defined by the &lt;code&gt;maxiter&lt;/code&gt; variable, appears to have been
set empirically. It is user-configurable via 
&lt;code&gt;GraphSigManager::initializeFromStream()&lt;/code&gt;, and is explicitly acknowledged as
a tunable parameter rather than a mathematically derived constant. The value
of 3 seems to strike a practical balance: enough context to be meaningfully
discriminating across a function's features, but shallow enough to remain
robust against compiler-introduced noise.&lt;/p&gt;
&lt;h3 id="data-flow-graph-hashing-varnode-features"&gt;Data-flow graph hashing (varnode features)&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;simpleSignature&lt;/code&gt; function performs the following: &lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;simpleSignature&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Funcdata&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Encoder&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;encoder&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;GraphSigManager&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sigmanager&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;sigmanager&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;setCurrentFunction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;sigmanager&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;uint4&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;feature&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;sigmanager&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;getSignatureVector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;feature&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// Sends the feature array to the encoder&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;code&gt;GraphSigManager::setCurrentFunction&lt;/code&gt; begins by allocating a &lt;code&gt;SignatureEntry&lt;/code&gt;
object for each varnode in the SSA graph of the function. Then, depending
on the configuration, it attempts to remove redundant information using the
&lt;code&gt;SignatureEntry::removeNoise&lt;/code&gt; method. This method traverses the P-code graph,
marking nodes that are part of COPY/INDIRECT/MULTIEQUAL chains, then applies a
dominator analysis to collapse redundant copies back to their original value.
A varnode that is merely a renamed copy of another, like a Phi-node selecting
between two copies of the same input for example, is excluded from 
feature emission.&lt;/p&gt;
&lt;p&gt;As an example, consider the following function:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;foo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This produces the following High P-code:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;EAX#1 = INT_ADD EDI#2 1:4#3
EAX#4 = COPY EAX#1
RETURN 0:8#5 EAX#4
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Varnodes are suffixed with a unique identifier, as in SSA form, to make shadow 
relationships explicit. The dominator analysis produces the following graph:&lt;/p&gt;
&lt;div class="row"&gt;
&lt;center&gt;
&lt;a href="resources/2026-04-14_bsim_explained_once_and_for_all/dominator_tree.svg" target="_blank"&gt;
&lt;img alt="Example of dominator tree" src="resources/2026-04-14_bsim_explained_once_and_for_all/dominator_tree.svg" width="25%"/&gt;
&lt;/a&gt;
&lt;/center&gt;
&lt;/div&gt;
&lt;p&gt;Here, &lt;code&gt;EAX#4&lt;/code&gt; falls under &lt;code&gt;EAX#1&lt;/code&gt; in the dominator tree, meaning it carries no
additional information and can safely be ignored during hashing. Once shadow
nodes have been identified, an initial hash is computed for each remaining
node based on its local properties:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;SignatureEntry&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;localHash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;hashSize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="c1"&gt;// byte width of the value&lt;/span&gt;
&lt;span class="w"&gt;                             &lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;opcode_hash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;def_op&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;// the operation that defines it&lt;/span&gt;
&lt;span class="w"&gt;                             &lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;constant_value&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="c1"&gt;// if it's a constant (optional)&lt;/span&gt;
&lt;span class="w"&gt;                             &lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x55055055&lt;/span&gt;&lt;span class="w"&gt;             &lt;/span&gt;&lt;span class="c1"&gt;// if it's a persistent global&lt;/span&gt;
&lt;span class="w"&gt;                             &lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;0x10101&lt;/span&gt;&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="c1"&gt;// if it's a function input&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Once non-shadowed nodes have their initial hash value (the label), 
&lt;code&gt;GraphSigManager::generate&lt;/code&gt; runs the Weisfeiler-Lehman algorithm: each round
mixes a node's current hash with its inputs' hashes from the previous round:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;GraphSigManager::generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;int4&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;minusone&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;firsthalf&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;secondhalf&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;minusone&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;maxiter&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;firsthalf&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;minusone&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;secondhalf&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;minusone&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;firsthalf&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;signatureIterate&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;int4&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;firsthalf&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;signatureIterate&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// Do the block signatures incorporating varnode sigs halfway thru&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;maxblockiter&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;initializeBlocks&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;int4&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;maxblockiter&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;signatureBlockIterate&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;collectBlockSigs&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;blockClear&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;int4&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;secondhalf&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;signatureIterate&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;collectVarnodeSigs&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;varnodeClear&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="c1"&gt;// Varnodes are used in block sigs&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here, &lt;code&gt;GraphSigManager::signatureIterate&lt;/code&gt; propagates hashes across varnode
entries, while &lt;code&gt;GraphSigManager::signatureBlockIterate&lt;/code&gt; propagates hashes
across a different kind of entry: &lt;code&gt;BlockSignatureEntry&lt;/code&gt; objects. These hold a
hash value representing structural information derived from the CFG. They
are covered in the control-flow graph hashing section 
&lt;a href="#control-flow-graph-hashing-block-features"&gt;below&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;For varnode hashing, commutative operations (MULTIEQUAL, ADD, XOR, etc.) 
accumulate inputs in an order-independent way; non-commutative operations
(shifts, subtractions) preserve input order. The following is a pseudocode
version of &lt;code&gt;SignatureEntry::hashIn&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;hashIn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;isCommutative&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Commutative case&lt;/span&gt;
    &lt;span class="n"&gt;accum&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;inp&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;accum&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;hash_mixin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hash_prev&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;hash_prev&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;inp&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;hash_new&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hash_mixin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hash_prev&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;accum&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Non-commutative case&lt;/span&gt;
    &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hash_prev&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;inp&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hash_mixin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hash_prev&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;inp&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;hash_new&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;code&gt;hash_mixin&lt;/code&gt; is a custom fuzzy hash function based on two rounds of CRC32 
combined with XOR-shift-multiply operations. A double-buffer 
(&lt;code&gt;hash[0]&lt;/code&gt;/&lt;code&gt;hash[1]&lt;/code&gt;) ensures all nodes read from the previous round's values
during each update, making the result independent of iteration order 
through the node list.&lt;/p&gt;
&lt;p&gt;After the configured number of iterations (3 by default), every varnode written
by a non-trivial operation and not shadowed emits its final hash as a&lt;br/&gt;
&lt;code&gt;VarnodeSignature&lt;/code&gt; feature.&lt;/p&gt;
&lt;h3 id="control-flow-graph-hashing-block-features"&gt;Control-flow graph hashing (block features)&lt;/h3&gt;
&lt;p&gt;The attentive reader may have noticed that between varnode hashing iterations,
BSIM runs a parallel hashing pass over the function's basic blocks. This allows
structural information to be incorporated into the final signature. Each block
is initially seeded purely by its degree (the number of basic blocks entering
and leaving it):&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;BlockSignatureEntry&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;localHash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;in_degree&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;out_degree&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;As with varnodes, once the initial hash value is computed, iterative 
propagation begins through &lt;code&gt;GraphSigManager::signatureBlockIterate&lt;/code&gt;. 
Predecessor block hashes are mixed in commutatively, but with a twist: for
conditional branches, the true edge and false edge carry different mixing
constants:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;hashIn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="n"&gt;accum&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mh"&gt;0xbafabaca&lt;/span&gt;
  &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;pred&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;edge_kind&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;predecessors&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
      &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hash_mixin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hash_prev&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;hash_prev&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;pred&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
      &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;edge_kind&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;TRUE_EDGE&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
          &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hash_mixin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x777&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;edge_kind&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;FALSE_EDGE&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
          &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hash_mixin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x777&lt;/span&gt; &lt;span class="o"&gt;^&lt;/span&gt; &lt;span class="mh"&gt;0x7abc7abc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="n"&gt;accum&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;
  &lt;span class="n"&gt;hash_new&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hash_mixin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hash_prev&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;accum&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This means a block's hash encodes which path through a conditional leads to it,
not merely that it has predecessors. Final block features are then generated
by &lt;code&gt;GraphSigManager::collectBlockSigs&lt;/code&gt;. For each basic block, BSIM scans for
"root" operations; those with side effects visible beyond the function 
boundary: CALL, CALLIND, STORE, CBRANCH, and RETURN. For each consecutive pair
of root operations, it fuses the block's structural hash with the output
varnode's expression hash at that point:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;BlockSignature&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hash&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;hash_mixin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;varnode_hash_half_iter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;block_hash&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This is the key fusion point: each block feature blends expression semantics
with control-flow topology into a single 32-bit value. Once this process is
complete, the block signature entries are cleared and varnode hashing resumes
for the final iterations, producing the feature vector.&lt;/p&gt;
&lt;h2 id="the-feature-vector_1"&gt;The Feature Vector&lt;/h2&gt;
&lt;p&gt;The BSIM generation pipeline outputs a sorted list of 32-bit hash values
derived from three feature types:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;VarnodeSignature&lt;/strong&gt;: one hash for each non-shadowed, non-trivially defined varnode
  (produced by data-flow hashing).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;BlockSignature&lt;/strong&gt;: one hash for each &lt;em&gt;root operations&lt;/em&gt; inside a
  basic block as well as a final hash for the full block 
  (produced by control-flow hashing).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;CopySignature&lt;/strong&gt;: one hash that aggregates all COPY operations per basic
  block.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This sorted vector encodes the function as a set of structural motif 
identifiers: semantically equivalent functions yield largely overlapping sets,
while unrelated functions yield largely disjoint sets.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Note 1: the vectors are sorted only to speed up the subsequent comparison step of the algorithm&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Note 2: Root operations are operations that represent the roots of expressions.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;A BSIM feature vector typically looks like this:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;(1:545c6155,1:7086215d,2:bd945601,1:ca0bb8a0,1:e123ddbb)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The number before the colon represents the frequency of consecutive identical
hash elements (which allows the vector to be factorized when repeated values
are present), and the number after is the feature hash itself.&lt;/p&gt;
&lt;p&gt;To inspect these vectors, Ghidra provides the 
&lt;code&gt;DumpBSimSignaturesScript.py&lt;/code&gt;&lt;sup id="fnref:7"&gt;&lt;a class="footnote-ref" href="#fn:7"&gt;7&lt;/a&gt;&lt;/sup&gt; script.&lt;/p&gt;
&lt;h2 id="comparing-the-vectors-using-tf-idf"&gt;Comparing the vectors using TF-IDF&lt;/h2&gt;
&lt;p&gt;Now that we have our feature vectors, how do we compare them? A raw set 
intersection would be na&amp;iuml;ve, because not all features are equally informative.
A feature encoding "integer addition of two 4-byte values" appears in virtually
every compiled function; a feature encoding a specific 3-hop expression tree
combining a shift, an XOR, and a masked store is extremely rare and highly
discriminating.&lt;/p&gt;
&lt;p&gt;BSIM borrows TF-IDF (Term Frequency / Inverse Document Frequency) from
information retrieval to weight each feature by its global rarity across a
training corpus. In the BSIM context:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;document&lt;/strong&gt; is a function stored in the database&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;term&lt;/strong&gt; is a 32-bit feature hash&lt;/li&gt;
&lt;li&gt;&lt;span class="katex"&gt;&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;N&lt;/mi&gt;&lt;/mrow&gt;&lt;annotation encoding="application/x-tex"&gt;
\def\pelican{\textrm{pelican}^2}

N&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt; is the total number of functions in the &lt;em&gt;training&lt;/em&gt; database&lt;/li&gt;
&lt;li&gt;&lt;span class="katex"&gt;&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;d&lt;/mi&gt;&lt;mi&gt;f&lt;/mi&gt;&lt;mo stretchy="false"&gt;(&lt;/mo&gt;&lt;mi&gt;f&lt;/mi&gt;&lt;mo stretchy="false"&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;annotation encoding="application/x-tex"&gt;
\def\pelican{\textrm{pelican}^2}

df(f)&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt; is the number of functions containing feature hash &lt;span class="katex"&gt;&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;f&lt;/mi&gt;&lt;/mrow&gt;&lt;annotation encoding="application/x-tex"&gt;
\def\pelican{\textrm{pelican}^2}

f&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The IDF weight of a feature is defined as:&lt;/p&gt;
&lt;p&gt;&lt;span class="katex"&gt;&lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mtext&gt;IDF&lt;/mtext&gt;&lt;mo stretchy="false"&gt;(&lt;/mo&gt;&lt;mi&gt;f&lt;/mi&gt;&lt;mo stretchy="false"&gt;)&lt;/mo&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mi&gt;log&lt;/mi&gt;&lt;mo&gt;&amp;af;&lt;/mo&gt;&lt;mtext&gt;&amp;thinsp;&amp;ic;&lt;/mtext&gt;&lt;mrow&gt;&lt;mo fence="true"&gt;(&lt;/mo&gt;&lt;mfrac&gt;&lt;mi&gt;N&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;d&lt;/mi&gt;&lt;mi&gt;f&lt;/mi&gt;&lt;mo stretchy="false"&gt;(&lt;/mo&gt;&lt;mi&gt;f&lt;/mi&gt;&lt;mo stretchy="false"&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;/mfrac&gt;&lt;mo fence="true"&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;annotation encoding="application/x-tex"&gt;
\def\pelican{\textrm{pelican}^2}

\text{IDF}(f) = \log\!\left(\frac{N}{df(f)}\right)&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;Features present in nearly every function receive a weight close to &lt;span class="katex"&gt;&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mn&gt;0&lt;/mn&gt;&lt;/mrow&gt;&lt;annotation encoding="application/x-tex"&gt;
\def\pelican{\textrm{pelican}^2}

0&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;. 
Rare, distinctive features receive a high weight. This weighting is not 
computed at extraction time; it is applied at query time using pre-computed
IDF values fitted from a training corpus shipped with Ghidra.&lt;/p&gt;
&lt;p&gt;Those weight files can be found under &lt;code&gt;Ghidra/Features/BSim/data&lt;/code&gt; and are
stored as XML. When creating a database, a weight file is implicitly selected
by setting the &lt;code&gt;config_template&lt;/code&gt; parameter via the &lt;code&gt;support/bsim&lt;/code&gt; tool.&lt;/p&gt;
&lt;p&gt;Taking &lt;code&gt;lshweights_nosize.xml&lt;/code&gt; as an example:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;weights&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="na"&gt;settings=&lt;/span&gt;&lt;span class="s"&gt;"0x4d"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;weightfactory&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="na"&gt;scale=&lt;/span&gt;&lt;span class="s"&gt;"1.55369941"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="na"&gt;addend=&lt;/span&gt;&lt;span class="s"&gt;"6.00980084"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;idf&amp;gt;&lt;/span&gt;1.00000000e+00&lt;span class="nt"&gt;&amp;lt;/idf&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;             &lt;/span&gt;&lt;span class="cm"&gt;&amp;lt;!-- bucket 0: rarest features, max weight --&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;idf&amp;gt;&lt;/span&gt;9.99459862e-01&lt;span class="nt"&gt;&amp;lt;/idf&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;...&lt;span class="w"&gt;                                   &lt;/span&gt;&lt;span class="cm"&gt;&amp;lt;!-- 512 IDF weights total --&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;tf&amp;gt;&lt;/span&gt;1.00000000e+00&lt;span class="nt"&gt;&amp;lt;/tf&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;               &lt;/span&gt;&lt;span class="cm"&gt;&amp;lt;!-- tf=1: baseline --&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;tf&amp;gt;&lt;/span&gt;1.41421356e+00&lt;span class="nt"&gt;&amp;lt;/tf&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;               &lt;/span&gt;&lt;span class="cm"&gt;&amp;lt;!-- tf=2: sqrt(2) --&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;...&lt;span class="w"&gt;                                   &lt;/span&gt;&lt;span class="cm"&gt;&amp;lt;!-- 64 TF weights total --&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;probflip0&amp;gt;&lt;/span&gt;2.67731136e-01&lt;span class="nt"&gt;&amp;lt;/probflip0&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;probflip1&amp;gt;&lt;/span&gt;6.20184175e-01&lt;span class="nt"&gt;&amp;lt;/probflip1&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;probdiff0&amp;gt;&lt;/span&gt;2.01821663e-02&lt;span class="nt"&gt;&amp;lt;/probdiff0&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;probdiff1&amp;gt;&lt;/span&gt;7.10384098e+00&lt;span class="nt"&gt;&amp;lt;/probdiff1&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/weightfactory&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;idflookup&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="na"&gt;size=&lt;/span&gt;&lt;span class="s"&gt;"1000"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;hash&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="na"&gt;count=&lt;/span&gt;&lt;span class="s"&gt;"0"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;0xd99bb820&lt;span class="nt"&gt;&amp;lt;/hash&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="cm"&gt;&amp;lt;!-- hash seen in 0 functions --&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;hash&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="na"&gt;count=&lt;/span&gt;&lt;span class="s"&gt;"1"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;0x26111c79&lt;span class="nt"&gt;&amp;lt;/hash&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;...&lt;span class="w"&gt;                                   &lt;/span&gt;&lt;span class="cm"&gt;&amp;lt;!-- 1000 most common hashes --&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/idflookup&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/weights&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Each feature's weight has two components. The &lt;strong&gt;IDF weight&lt;/strong&gt; reflects how
rarely the feature appears across the training corpus. BSIM maintains a lookup
table of 1000 feature hashes observed during training, each annotated with a
normalized frequency count. When a vector is built, each feature hash is 
looked up in this table; the resulting count (capped at 511) serves as an
index into a 512-entry IDF weight table, where index 0 yields the maximum
weight of &lt;span class="katex"&gt;&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mn&gt;1.0&lt;/mn&gt;&lt;/mrow&gt;&lt;annotation encoding="application/x-tex"&gt;
\def\pelican{\textrm{pelican}^2}

1.0&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt; and higher indices yield progressively smaller values 
approaching &lt;span class="katex"&gt;&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mn&gt;0.67&lt;/mn&gt;&lt;/mrow&gt;&lt;annotation encoding="application/x-tex"&gt;
\def\pelican{\textrm{pelican}^2}

0.67&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;. Features absent from the table receive index 0 and are
therefore treated as maximally rare and maximally informative.&lt;/p&gt;
&lt;p&gt;The &lt;strong&gt;TF weight&lt;/strong&gt; reflects how often a feature appears within the specific
function being analyzed. Repetition increases the weight, but with diminishing
returns following a &lt;span class="katex"&gt;&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msqrt&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;msub&gt;&lt;mrow&gt;&lt;mi&gt;log&lt;/mi&gt;&lt;mo&gt;&amp;af;&lt;/mo&gt;&lt;/mrow&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msub&gt;&lt;mo stretchy="false"&gt;(&lt;/mo&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mi&gt;f&lt;/mi&gt;&lt;mo stretchy="false"&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;/msqrt&gt;&lt;/mrow&gt;&lt;annotation encoding="application/x-tex"&gt;
\def\pelican{\textrm{pelican}^2}

\sqrt{1 + \log_2(tf)}&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt; curve: a feature seen once has
weight &lt;span class="katex"&gt;&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mn&gt;1.0&lt;/mn&gt;&lt;/mrow&gt;&lt;annotation encoding="application/x-tex"&gt;
\def\pelican{\textrm{pelican}^2}

1.0&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;, twice yields &lt;span class="katex"&gt;&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mo&gt;&amp;asymp;&lt;/mo&gt;&lt;mn&gt;1.41&lt;/mn&gt;&lt;/mrow&gt;&lt;annotation encoding="application/x-tex"&gt;
\def\pelican{\textrm{pelican}^2}

\approx 1.41&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;, four times &lt;span class="katex"&gt;&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mo&gt;&amp;asymp;&lt;/mo&gt;&lt;mn&gt;1.73&lt;/mn&gt;&lt;/mrow&gt;&lt;annotation encoding="application/x-tex"&gt;
\def\pelican{\textrm{pelican}^2}

\approx 1.73&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;, and eight
times exactly &lt;span class="katex"&gt;&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mn&gt;2.0&lt;/mn&gt;&lt;/mrow&gt;&lt;annotation encoding="application/x-tex"&gt;
\def\pelican{\textrm{pelican}^2}

2.0&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;. This prevents a function that mechanically repeats a 
trivial pattern from dominating the similarity score.&lt;/p&gt;
&lt;p&gt;The final coefficient for a &lt;code&gt;HashEntry&lt;/code&gt; is the product of both components:&lt;/p&gt;
&lt;p&gt;&lt;span class="katex"&gt;&lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mtext&gt;coeff&lt;/mtext&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mtext&gt;idfweight&lt;/mtext&gt;&lt;mo stretchy="false"&gt;[&lt;/mo&gt;&lt;mtext&gt;idf&lt;/mtext&gt;&lt;mo stretchy="false"&gt;]&lt;/mo&gt;&lt;mo&gt;&amp;times;&lt;/mo&gt;&lt;mtext&gt;tfweight&lt;/mtext&gt;&lt;mo stretchy="false"&gt;[&lt;/mo&gt;&lt;mtext&gt;tf&lt;/mtext&gt;&lt;mo stretchy="false"&gt;]&lt;/mo&gt;&lt;/mrow&gt;&lt;annotation encoding="application/x-tex"&gt;
\def\pelican{\textrm{pelican}^2}

\text{coeff} = \text{idfweight}[\text{idf}] \times \text{tfweight}[\text{tf}]&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;This scalar is computed once when the vector is constructed and stored directly in the entry.&lt;/p&gt;
&lt;h2 id="cosine-similarity"&gt;Cosine similarity&lt;/h2&gt;
&lt;p&gt;The vector comparison is implemented across multiple backends and languages:
the H2 (local) and Elasticsearch backends are written in Java, while PostgreSQL
uses a dedicated C extension. We will focus on the Java implementation, 
available in &lt;code&gt;Ghidra/Framework/Generic/src/main/java/generic/lsh/vector/LSHCosineVector.java&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;With both vectors represented as sorted arrays of &lt;code&gt;HashEntry(hash, coeff, tf)&lt;/code&gt;
entries, &lt;code&gt;LSHCosineVector.compare()&lt;/code&gt; computes their cosine similarity using a
merge-join (in &lt;span class="katex"&gt;&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;O&lt;/mi&gt;&lt;mo stretchy="false"&gt;(&lt;/mo&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;mi&gt;m&lt;/mi&gt;&lt;mo stretchy="false"&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;annotation encoding="application/x-tex"&gt;
\def\pelican{\textrm{pelican}^2}

O(n + m)&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt; time, where &lt;span class="katex"&gt;&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;/mrow&gt;&lt;annotation encoding="application/x-tex"&gt;
\def\pelican{\textrm{pelican}^2}

n&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt; and &lt;span class="katex"&gt;&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;m&lt;/mi&gt;&lt;/mrow&gt;&lt;annotation encoding="application/x-tex"&gt;
\def\pelican{\textrm{pelican}^2}

m&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt; are the numbers of distinct
features in each vector). No quadratic search is needed because both arrays
are already sorted by hash value.&lt;/p&gt;
&lt;p&gt;The algorithm maintains two iterators, one per vector, and advances them
according to three cases at each step:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Matching hashes&lt;/strong&gt;: when both iterators point to entries with the same hash,
  the feature is shared between the two functions. Its contribution to the
  dot product is &lt;span class="katex"&gt;&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;min&lt;/mi&gt;&lt;mo&gt;&amp;af;&lt;/mo&gt;&lt;mo stretchy="false"&gt;(&lt;/mo&gt;&lt;msub&gt;&lt;mtext&gt;coeff&lt;/mtext&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;/msub&gt;&lt;mo separator="true"&gt;,&lt;/mo&gt;&lt;msub&gt;&lt;mtext&gt;coeff&lt;/mtext&gt;&lt;mi&gt;B&lt;/mi&gt;&lt;/msub&gt;&lt;msup&gt;&lt;mo stretchy="false"&gt;)&lt;/mo&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;annotation encoding="application/x-tex"&gt;
\def\pelican{\textrm{pelican}^2}

\min(\text{coeff}_A, \text{coeff}_B)^2&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;. Specifically, the
  code compares the term frequencies of both entries and uses the coefficient
  from whichever vector has the lower TF. This conservative choice credits
  only the genuine overlap: if function &lt;span class="katex"&gt;&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;/mrow&gt;&lt;annotation encoding="application/x-tex"&gt;
\def\pelican{\textrm{pelican}^2}

A&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt; uses a pattern three times and
  function &lt;span class="katex"&gt;&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;B&lt;/mi&gt;&lt;/mrow&gt;&lt;annotation encoding="application/x-tex"&gt;
\def\pelican{\textrm{pelican}^2}

B&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt; uses it once, only one occurrence is considered shared. Both
  iterators then advance.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Hash in &lt;span class="katex"&gt;&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;/mrow&gt;&lt;annotation encoding="application/x-tex"&gt;
\def\pelican{\textrm{pelican}^2}

A&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt; only&lt;/strong&gt;: when the hash under iterator &lt;span class="katex"&gt;&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;/mrow&gt;&lt;annotation encoding="application/x-tex"&gt;
\def\pelican{\textrm{pelican}^2}

A&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt; is less than the hash
  under iterator &lt;span class="katex"&gt;&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;B&lt;/mi&gt;&lt;/mrow&gt;&lt;annotation encoding="application/x-tex"&gt;
\def\pelican{\textrm{pelican}^2}

B&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;, the feature exists only in &lt;span class="katex"&gt;&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;/mrow&gt;&lt;annotation encoding="application/x-tex"&gt;
\def\pelican{\textrm{pelican}^2}

A&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;. It contributes nothing
  to the dot product and iterator &lt;span class="katex"&gt;&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;/mrow&gt;&lt;annotation encoding="application/x-tex"&gt;
\def\pelican{\textrm{pelican}^2}

A&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt; advances alone.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Hash in &lt;span class="katex"&gt;&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;B&lt;/mi&gt;&lt;/mrow&gt;&lt;annotation encoding="application/x-tex"&gt;
\def\pelican{\textrm{pelican}^2}

B&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt; only&lt;/strong&gt;: the symmetric case, where iterator &lt;span class="katex"&gt;&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;B&lt;/mi&gt;&lt;/mrow&gt;&lt;annotation encoding="application/x-tex"&gt;
\def\pelican{\textrm{pelican}^2}

B&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt; advances.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Non-shared features still matter, however: they were factored in when computing
each vector's length, the Euclidean norm &lt;span class="katex"&gt;&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msqrt&gt;&lt;mrow&gt;&lt;mo&gt;&amp;sum;&lt;/mo&gt;&lt;msup&gt;&lt;mtext&gt;coeff&lt;/mtext&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;/msqrt&gt;&lt;/mrow&gt;&lt;annotation encoding="application/x-tex"&gt;
\def\pelican{\textrm{pelican}^2}

\sqrt{\sum \text{coeff}^2}&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;, which
is pre-computed during construction.&lt;/p&gt;
&lt;p&gt;The final cosine score is:&lt;/p&gt;
&lt;p&gt;&lt;span class="katex"&gt;&lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mtext&gt;score&lt;/mtext&gt;&lt;mo stretchy="false"&gt;(&lt;/mo&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;mo separator="true"&gt;,&lt;/mo&gt;&lt;mi&gt;B&lt;/mi&gt;&lt;mo stretchy="false"&gt;)&lt;/mo&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mfrac&gt;&lt;mstyle displaystyle="true" scriptlevel="0"&gt;&lt;munder&gt;&lt;mo&gt;&amp;sum;&lt;/mo&gt;&lt;mrow&gt;&lt;mi&gt;f&lt;/mi&gt;&lt;mo&gt;&amp;isin;&lt;/mo&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;mo&gt;&amp;cap;&lt;/mo&gt;&lt;mi&gt;B&lt;/mi&gt;&lt;/mrow&gt;&lt;/munder&gt;&lt;mi&gt;min&lt;/mi&gt;&lt;mo&gt;&amp;af;&lt;/mo&gt;&lt;mo stretchy="false"&gt;(&lt;/mo&gt;&lt;msub&gt;&lt;mtext&gt;coeff&lt;/mtext&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;/msub&gt;&lt;mo stretchy="false"&gt;(&lt;/mo&gt;&lt;mi&gt;f&lt;/mi&gt;&lt;mo stretchy="false"&gt;)&lt;/mo&gt;&lt;mo separator="true"&gt;,&lt;/mo&gt;&lt;mtext&gt;&amp;thinsp;&lt;/mtext&gt;&lt;msub&gt;&lt;mtext&gt;coeff&lt;/mtext&gt;&lt;mi&gt;B&lt;/mi&gt;&lt;/msub&gt;&lt;mo stretchy="false"&gt;(&lt;/mo&gt;&lt;mi&gt;f&lt;/mi&gt;&lt;mo stretchy="false"&gt;)&lt;/mo&gt;&lt;msup&gt;&lt;mo stretchy="false"&gt;)&lt;/mo&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;/mstyle&gt;&lt;mrow&gt;&lt;msqrt&gt;&lt;mstyle displaystyle="true" scriptlevel="0"&gt;&lt;munder&gt;&lt;mo&gt;&amp;sum;&lt;/mo&gt;&lt;mrow&gt;&lt;mi&gt;f&lt;/mi&gt;&lt;mo&gt;&amp;isin;&lt;/mo&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;/mrow&gt;&lt;/munder&gt;&lt;msub&gt;&lt;mtext&gt;coeff&lt;/mtext&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;/msub&gt;&lt;mo stretchy="false"&gt;(&lt;/mo&gt;&lt;mi&gt;f&lt;/mi&gt;&lt;msup&gt;&lt;mo stretchy="false"&gt;)&lt;/mo&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;/mstyle&gt;&lt;/msqrt&gt;&lt;mo&gt;&amp;times;&lt;/mo&gt;&lt;msqrt&gt;&lt;mstyle displaystyle="true" scriptlevel="0"&gt;&lt;munder&gt;&lt;mo&gt;&amp;sum;&lt;/mo&gt;&lt;mrow&gt;&lt;mi&gt;f&lt;/mi&gt;&lt;mo&gt;&amp;isin;&lt;/mo&gt;&lt;mi&gt;B&lt;/mi&gt;&lt;/mrow&gt;&lt;/munder&gt;&lt;msub&gt;&lt;mtext&gt;coeff&lt;/mtext&gt;&lt;mi&gt;B&lt;/mi&gt;&lt;/msub&gt;&lt;mo stretchy="false"&gt;(&lt;/mo&gt;&lt;mi&gt;f&lt;/mi&gt;&lt;msup&gt;&lt;mo stretchy="false"&gt;)&lt;/mo&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;/mstyle&gt;&lt;/msqrt&gt;&lt;/mrow&gt;&lt;/mfrac&gt;&lt;/mrow&gt;&lt;annotation encoding="application/x-tex"&gt;
\def\pelican{\textrm{pelican}^2}

\text{score}(A, B) = \frac{\displaystyle\sum_{f \in A \cap B} \min(\text{coeff}_A(f),\, \text{coeff}_B(f))^2}{\sqrt{\displaystyle\sum_{f \in A} \text{coeff}_A(f)^2} \times \sqrt{\displaystyle\sum_{f \in B} \text{coeff}_B(f)^2}}&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;Unmatched features inflate the denominators without contributing to the 
numerator, naturally penalizing vectors that diverge significantly in their
feature sets. The result is a value in &lt;span class="katex"&gt;&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mo stretchy="false"&gt;[&lt;/mo&gt;&lt;mn&gt;0&lt;/mn&gt;&lt;mo separator="true"&gt;,&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mo stretchy="false"&gt;]&lt;/mo&gt;&lt;/mrow&gt;&lt;annotation encoding="application/x-tex"&gt;
\def\pelican{\textrm{pelican}^2}

[0, 1]&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt;, where &lt;span class="katex"&gt;&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mn&gt;1.0&lt;/mn&gt;&lt;/mrow&gt;&lt;annotation encoding="application/x-tex"&gt;
\def\pelican{\textrm{pelican}^2}

1.0&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt; indicates
perfectly aligned weighted feature sets and values near &lt;span class="katex"&gt;&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mn&gt;0&lt;/mn&gt;&lt;/mrow&gt;&lt;annotation encoding="application/x-tex"&gt;
\def\pelican{\textrm{pelican}^2}

0&lt;/annotation&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/span&gt; indicate little to
no overlap.&lt;/p&gt;
&lt;h1 id="conclusion_1"&gt;Conclusion&lt;/h1&gt;
&lt;p&gt;This post has walked through the complete BSIM pipeline: from raw machine
instructions lifted to High P-code, through Weisfeiler-Lehman hashing of both
the data-flow and control-flow graphs, to the final TF-IDF-weighted cosine
similarity comparison.&lt;/p&gt;
&lt;p&gt;A few open questions remain. The hashing constants (&lt;code&gt;0x55055055&lt;/code&gt;,
&lt;code&gt;0xbafabaca&lt;/code&gt;, &lt;code&gt;0x777&lt;/code&gt;, &lt;code&gt;0x7abc7abc&lt;/code&gt;) and the choice of 3 data-flow iterations
are clearly empirical but no public documentation explains the experiments
that informed them. Similarly, the training corpus used to fit the IDF weights
shipped with Ghidra is undocumented; the distribution of functions it contains
will directly influence which features are considered rare and therefore
discriminating.&lt;/p&gt;
&lt;p&gt;The use of the Weisfeiler-Lehman test for binary function analysis was already
explored by Elie Mengin in his work&lt;sup id="fnref2:6"&gt;&lt;a class="footnote-ref" href="#fn:6"&gt;6&lt;/a&gt;&lt;/sup&gt;, whose post we strongly recommend 
reading for the theoretical underpinnings of the graph kernel. Another great
piece of work that we need to mention is Hashashin&lt;sup id="fnref:8"&gt;&lt;a class="footnote-ref" href="#fn:8"&gt;8&lt;/a&gt;&lt;/sup&gt; by River Loop Security,
which presents a similar approach using Binary Ninja IL and LSH for 
cross-architecture function similarity, before BSIM's public release. Whether
these works directly influenced the Ghidra team's design is unknown, but they
share the same core intuitions: normalize away architecture noise, encode
semantics and code structure information as graph features, and compare functions in a metric space where
similarity implies behavioral equivalence.&lt;/p&gt;
&lt;p&gt;Understanding these internals matters. Knowing how features are generated
exposes the limits of the approach: very small functions (few varnodes, no
root operations) produce sparse vectors and are inherently harder to match;
heavily inlined or LTO-compiled code may fragment a logical function into
shapes that look unlike the original; and an IDF table trained on a Windows
x86-64 userspace corpus may transfer poorly to a very different domain, such
as RTOS ARM baremetal firmware. &lt;/p&gt;
&lt;p&gt;It is precisely these trade-offs that shaped our design choices when building
&lt;a href="https://github.com/quarkslab/sighthouse"&gt;SightHouse&lt;/a&gt;. If you are curious to
see BSIM put to work in practice, feel free to check it out!&lt;/p&gt;
&lt;h1 id="acknowledgments"&gt;Acknowledgments&lt;/h1&gt;
&lt;p&gt;First of all, thanks to the Ghidra developers and the community behind it for
creating this awesome tool available to everyone!&lt;/p&gt;
&lt;p&gt;Thanks to all my Quarkslab colleagues for proofreading this article. I also would
like to express my gratitude to Roxane Cohen and Aldo Moscattelli for their
help and guidance regarding the understanding of the implementation and theories
behind it.&lt;/p&gt;
&lt;h1 id="references"&gt;References&lt;/h1&gt;
&lt;div class="footnote"&gt;
&lt;hr/&gt;
&lt;ol&gt;
&lt;li id="fn:1"&gt;
&lt;p&gt;National Security Agency (NSA) Ghidra Team, &lt;a href="https://ghidra.re/ghidra_docs/GhidraClass/BSIM/BSIMTutorial_Intro.html#how-does-bsim-work"&gt;&lt;em&gt;How Does BSIM Work?&lt;/em&gt;&lt;/a&gt;.&amp;nbsp;&lt;a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text"&gt;&amp;larrhk;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:2"&gt;
&lt;p&gt;National Security Agency (NSA) Ghidra Team, &lt;a href="https://ghidra.re/ghidra_docs/languages/html/pcoderef.html"&gt;&lt;em&gt;P-Code Reference Manual&lt;/em&gt;&lt;/a&gt;.&amp;nbsp;&lt;a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text"&gt;&amp;larrhk;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:3"&gt;
&lt;p&gt;National Security Agency (NSA) Ghidra Team, &lt;a href="https://ghidra.re/ghidra_docs/languages/html/sleigh.html#sleigh_overview"&gt;&lt;em&gt;SLEIGH Overview&lt;/em&gt;&lt;/a&gt;.&amp;nbsp;&lt;a class="footnote-backref" href="#fnref:3" title="Jump back to footnote 3 in the text"&gt;&amp;larrhk;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:4"&gt;
&lt;p&gt;Norman Ramsey and Mary F. Fernández. &lt;a href="https://www.cs.tufts.edu/~nr/pubs/specifying.pdf"&gt;&lt;em&gt;Specifying Representations of Machine Instructions&lt;/em&gt;&lt;/a&gt;. ACM Trans. Programming Languages and Systems, Volume 19, Issue 2,Pages 492-524.&amp;nbsp;&lt;a class="footnote-backref" href="#fnref:4" title="Jump back to footnote 4 in the text"&gt;&amp;larrhk;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:5"&gt;
&lt;p&gt;NCC Group, &lt;a href="https://github.com/nccgroup/ghostrings/blob/main/ghidra_scripts/PrintHighPCode.java"&gt;PrintHighPCode.java&lt;/a&gt;.&amp;nbsp;&lt;a class="footnote-backref" href="#fnref:5" title="Jump back to footnote 5 in the text"&gt;&amp;larrhk;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:6"&gt;
&lt;p&gt;Elie Mengin, &lt;a href="https://blog.quarkslab.com/weisfeiler-lehman-graph-kernel-for-binary-function-analysis.html"&gt;Weisfeiler-Lehman Graph Kernel for Binary Function Analysis&lt;/a&gt;, Quarkslab, 2019.&amp;nbsp;&lt;a class="footnote-backref" href="#fnref:6" title="Jump back to footnote 6 in the text"&gt;&amp;larrhk;&lt;/a&gt;&lt;a class="footnote-backref" href="#fnref2:6" title="Jump back to footnote 6 in the text"&gt;&amp;larrhk;&lt;/a&gt;&lt;a class="footnote-backref" href="#fnref3:6" title="Jump back to footnote 6 in the text"&gt;&amp;larrhk;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:7"&gt;
&lt;p&gt;National Security Agency (NSA) Ghidra Team, &lt;a href="https://github.com/NationalSecurityAgency/ghidra/blob/Ghidra_12.0_build/Ghidra/Features/BSim/ghidra_scripts/DumpBSimSignaturesScript.py"&gt;DumpBSimSignaturesScript&lt;/a&gt;.&amp;nbsp;&lt;a class="footnote-backref" href="#fnref:7" title="Jump back to footnote 7 in the text"&gt;&amp;larrhk;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:8"&gt;
&lt;p&gt;Rylan O'Connell and Ryan Speers, &lt;a href="https://riverloopsecurity.com/blog/2019/12/binary-hashing-hashashin/"&gt;Hashashin: Using Binary Hashing to Port Annotations&lt;/a&gt;, 2019.&amp;nbsp;&lt;a class="footnote-backref" href="#fnref:8" title="Jump back to footnote 8 in the text"&gt;&amp;larrhk;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</content><category term="Program Analysis"></category><category term="2026"></category><category term="binary analysis"></category><category term="program analysis"></category><category term="reverse-engineering"></category><category term="binary similarity"></category><category term="BSIM"></category></entry></feed>