Commit Level Vulnerability Dataset

In this blog post, we present a new vulnerability dataset composed of thousands of vulnerabilities aimed at helping security practitioners to develop, test and enhance their tools. Unlike others, this dataset contains both the vulnerable and fixed states with source data.


The CVE standard [1] is used to publish vulnerability advisories through a common and universal mechanism. To encompass such variety, they offer a set of features (CVE-ID, CVSS, description), but there is no standardized way to report the corresponding fix. For some data-driven vulnerability research, the workflow relies on this information availability. We aim to fill this gap by linking together the vulnerabilities and their fixes in a unique dataset.

Vulnerabilities datasets have already been introduced in the literature but they all suffer from limitations:

  • usage of synthetic code [4] ;
  • handcrafted vulnerabilities [5] ;
  • unrelated vulnerabilities in various software [6] .

For this work, we approach the problem from another perspective. A list of real-world vulnerabilities already exists: the CVEs. This list is leveraged to build a vulnerability dataset that can be exploited by security practitioners and academic researchers to enhance their products or tools. This approach is similar to the Open Security Software Foundation CVE Benchmark [8] which targeted JavaScript/TypeScript vulnerabilities.

More specifically, we focus on the CVEs affecting Android and the Android Open Source Project (AOSP). AOSP is the perfect target for such dataset because it provides the monthly Android Security Bulletins and presents the following advantages:

  • AOSP is the heart of a complete OS and the bulletins cover each of its open source components.
  • Android is used by billions: vulnerabilities inside affect numerous users and are not artificially constructed.
  • Android's vulnerabilities are de facto representative because they have not been chosen by someone.
  • Since bulletins are regularly published, vulnerabilities in the dataset are always up to date and a new class of vulnerability will automagically appear if instances are found in Android.

Potential Usages

This dataset can be used in numerous ways, we list below some potential applications for various open research subjects:

  • Silent Fix Detection: Not all security patches are labeled as so and some projects silently fix vulnerabilities. Being able to detect them could lead to interesting results.
  • Cross-architecture Binary Diffing/Matching: Since we offer the same binaries for multiple architectures, this dataset could be used to improve cross-architecture binary diffing tools and methods.
  • Patch Detection: Our binaries could help to develop methods to solve the Patch Presence Problem (asserting if a binary has been patched).
  • ...


The dataset is released on Quarkslab's GitHub and is composed of 3,900 CVEs among which we retrieved a fixing commit for 1,359 of them. Moreover, to help develop binary-only methods, we provide precompiled binaries for a subset of those vulnerabilities.

For the interested readers, the remaining of this blogpost details how we built this dataset.

Android CVE Data Aggregation

March 2022 Android Security Bulletin Extract

Since August 2015, Google published, through the Android Security Bulletin Monthly Release, advisories on issues fixed in the last Android release. Google's bulletins are usually divided into two Security Patch Levels (SPL), themselves divided into categories. Each category contains the list of vulnerabilities. A list entry contains the CVE-ID, the vulnerability type, its severity, the updated AOSP versions and, if the component is open source, a direct link to the fixing commit. [2]

Roy Architecture

We crawl Android Security Bulletins using a homemade tool named Roy. Its architecture is depicted in the Figure above. For each new bulletin since its last run, Roy recovers the new vulnerabilities list. If a link towards a fix commit exists, the tool also parses the changes provided by the commit (e.g. changed files, new lines...). To reduce the workload, Roy works in an incremental fashion: an already parsed bulletin is never reanalysed. Thus, the parser complexity remains stable over time because only the last bulletin version parsers are maintained (and not all of them).

Generating Binary Artifacts

Vulnerabilities affect both open source and closed-source projects where only binaries are available. To develop and test binary-level SAST (Static Application Security Tools), DAST (Dynamic Application Security Tools) or CVE-checkers, having a dataset also containing binaries could prove itself useful.

AOSP is also a perfect target to provide precompiled binaries for the vulnerabilities:

  • the build system is open sourced and documented;
  • thanks to the fixing commit, we know that the project before is vulnerable and fixed after;
  • AOSP targets various architectures allowing to generate multiple binaries from a single vulnerability.
AOSP Builder Workflow

This process was automated using a tool (unsurprisingly) named AOSPBuilder whose workflow is depicted in the Figure above. It compiles the binaries at the commits just before and just after the vulnerability fix. As input, it uses a fixing commit hash from Roy and builds the project with and without the fixing commit. Finally, we only keep the binaries that differ between the two builds. To reduce noise introduced by compiler optimization, the settings between the two builds are not changed.

Our automated compilation suffers from various problems and managed to compile only about half of the vulnerabilities. Most issues stem from the synchronization problem between a project and its dependencies, as compiling a precise commit in AOSP is not trivial [3]. Additional work could be leveraged to reduce the error rate such as using the manifest file present in each Android build and listing every project commit-id.

Dataset Overview

At Source Level

A CVE in the dataset is represented as illustrated below:

    "cveId": "CVE-2017-0738",
    "dateReported": "2017-08-09",
    "vulnerabilityType": "Information Disclosure Vulnerability",
    "language": "c",
    "fixes": [
        "commitId": "1d919d737b374b98b900c08c9d0c82fe250feb08",
        "patchUrl": ""
        "commitId": "234848dd6756c5d636f5a103e51636d60932983c",
        "patchUrl": ""
    "severity": "Moderate",
    "component": "Media Framework"

Note that a vulnerability may be fixed by multiple commits (like in the example above) which are all listed. The exact JSON Schema used is available in the repository [7].

Source Level Dataset Repartition

The Source Level Dataset contains 3,903 CVEs. Closed source vulnerabilities affect components from Qualcomm, NVIDIA, or Google... The fixing commit id is available for 1,359 open source vulnerabilities (64%).

The Figure below shows the CVEs distribution over the years. We considered bulletins from August 2015 to March 2022 and used the Published Date information for each CVE. Some earlier CVEs may thus have been taken into account because they were fixed in subsequent patches. The number of reports decreases for closed source projects after 2017-2018, explaining the drop on the graph. The data for 2022 is shown until March 2022 Android Security Bulletin.

CVE Evolution over the Years

Vulnerabilities affecting Android usually have a huge impact. The Figure below lists their cumulative CVSS scores. On the 3,500 vulnerabilities considered, only 1,000 of them have a CVSS score below 7.

Cumulative CVSS Scores

At Binary Level

Each compiled vulnerability follows the same scheme as described in the Figure below. Each file is prefixed by its SHA256 hash to prevent name collisions. Binaries with symbols are also kept to ease some analyses if present.

Example of a CVE Artifact

Based on file extensions, our dataset contains shared libraries (35%), object files (37%) and executables (13%). The largest file is libv8 (1.6 GiB), the static library of v8, a JavaScript engine.

The Table below lists the most frequent binaries found in the dataset. It's interesting to see the prevalence of libbluetooth, both because it was affected by several vulnerabilities, but also because it's easier to compile than some other AOSP projects.

Name Count 953 748 650 421
net_test_btif 417
net_test_stack 299
hevcdec 268 242 234 229


Our dataset suffers from some limitations detailed below:

  • Component Diversity: All components used in this dataset are open source which does not reflect the diversity of real-world code. For example, low-level firmware code, drivers for specific hardware remain closed source and thus are not considered in this dataset.
  • Single Point of Failure: Our dataset exclusively relies on the Google commitment to regularly publish Android Security Bulletins. If this publication stops, the dataset cannot be updated anymore.
  • Data Quality: Our dataset implies that the commit referenced as a vulnerability fixing commit is complete, i.e. completely fixes the vulnerability, and minimal, i.e. it does not fix any other problem nor add functionalities.


I would like to thank the authors of Roy initial version. They did most of the work and we only stand on their shoulders. ;)

Many thanks also to this blog post reviewers for their valuable feedback and corrections.


This work was part of the author's thesis and has been published in CODASPY 2022 under the name Building a Commit-level Dataset of Real-world Vulnerabilities [9].