The OCI Image Specification is the core concept behind container images. However, not much is known about it even though container technologies are becoming more and more popular. In this blogpost we will demystify it and look into its internals.

Introduction

The Open Container Initiative (OCI) gave birth to the way we create containers today. It defined two specifications - the runtime-spec and the image-spec. In the previous article of the container series, we analyzed runc - an implementation of the OCI runtime-spec. In this article, we are going to study in detail the OCI image-spec. It defines how a runtime specification (hence a container) can be moved from one host to another. Furthermore, the specification provides means for OS resources to be efficiently shared between different container processes. The image-spec is implemented by Containerd and Docker which are considered as more "high-level" runtimes than runc. Let's dig into it!

Limitations of the Runtime Specification (runtime-spec)

The runtime specification defines the following two necessary elements to launch a container:

  • root file directory containing the program to be executed in the isolated environment as well as all its dependencies to be executed in the container;
  • configuration file defining the isolation mechanisms to be applied on the Linux process (e.g.: namespaces, cgroups, capabilities, etc.);

With runc it is possible to checkpoint a running container process and start it later in the same process state (memory pages, file descriptors, etc) thanks to CRIU. Sadly, runc doesn't provide a similar functionality regarding the process' root file system. This could be extremely useful for live migration or a simple backup of the container's current state. Another big limitation of runc is that for each independent container process we need a separate runtime-spec bundle (configuration + root directory) if we want to preserve the isolation between different containers. This can quickly lead to bad resource usage, especially if several container processes use the same configuration and/or don't modify the predefined layout. If we want to modify a runtime bundle and then move it from one container host to another, the procedure gets even more complicated. Hopefully, the image specification resolves these challenges. With the definition of this specification, two types of runtimes can be distinguished: "low-level", which does not implement it, and "high-level", which does implement it on top of other cool features (which we will discuss in other articles).

Image Format Specification (image-spec)

This specification defines an OCI Image, consisting of a manifest, an image index (optional), a set of filesystem layers, and a configuration. The goal of this specification is to enable the creation of interoperable tools for building, transporting, and preparing a container image to run.

Introduction of the Open Container Initiative Image Format Specification.

Java application packed into an OCI image

Above we can see a graphical representation of how a Java application can be packed into an OCI image following the above definition. All the necessary system libraries and dependencies of the application are referenced as layers. The properties, associated with the Linux process executing the program, are defined in a file called config. The third element, called image manifest, specifies the CPU architecture for which the previous two elements are suitable.

If we look closely we can see a relation between these components and the ones defined in the runtime specification. To some extent, these specification elements are sufficient to define a runtime-spec for a particular host.

Formally, the image-spec defines a couple of more elements:

  • Image Index - a set of image manifests;
  • Image Layout - file system layout of the contents of an image;
  • Conversion - a procedure of how the image-spec can be converted to runtime-spec;
  • Descriptor - a unique pointer to particular content of an image;

Note that in the next sections by content descriptors we'll refer to the hash of the content.

Let's now demystify the purpose of all these specification components.

Image Manifest

Manifest element of the OCI image-spec

The Image Manifest describes the necessary elements to create a container on a given host with a given CPU architecture. In other words, it defines the runtime bundle. Furthermore, following the documentation, the specification element also aims to achieve the following:

  • content-addressable images: every image manifest is assigned a unique identifier (ID). This ID is in reality a hash (descriptor) calculated on the contents of a bundle. More precisely, it is calculated using the hashes of the configuration file and the hashes of the set of layers which are calculated on the real contents. The manifest fingerprint is unique as it references a unique set of descriptors. Hence, the contents of an image can be uniquely retrieved using the image manifest;
  • multi-architecture images: the hash of the image manifest can be used as a pointer entry in a bigger manifest called 'fat manifest'. This bigger manifest defines the same image for a variety of different CPU architectures;
  • translation to OCI Runtime Specification: the image manifest allows the complete generation of a runtime bundle as defined in the runtime-spec;

The implementational scheme of the image manifest has a JSON-like structure containing the following:

  • schemaVersion (int), the template version used as a JSON structure;
  • mediaType (string), metadata field describing the type of the file following the OCI specification;
  • config (descriptor), content-based reference (hash) to the configuration element;
  • layers (array of descriptors), a set of content-based descriptors to different states of the root file system. Each state can be either a deletion, modification, or addition of a file. Each of these actions is relative to the previous layer. These alterations of the file system are compressed and the final root file system directory layout is the union-join of all of the layers into an empty directory;
  • annotations (string-string map), metadata;

Image Manifest in Practice

To illustrate the image manifest we are going to use the Containerd's CLI tool - ctr and inspect the contents of an Alpine Linux image downloaded from the Dockerhub.

cryptonite@host:~$ sudo ctr image pull docker.io/library/alpine:latest:
index-sha256:4edbd2beb5f78b1014028f4fbb99f3237d9561100b6881aabbf5acce2c4f9454:    done       ...
manifest-sha256:a777c9c66ba177ccfea23f2a216ff6721e78a662cd17019488c417135299cd89: done           ...
config-sha256:0ac33e5f5afa79e084075e8698a22d574816eea8d7b7d480586835657c3e1c8b:   done           ...
layer-sha256:df9b9388f04ad6279a7410b85cedfdcb2208c0a003da7ab5613af71079148139:    done          ...
...
unpacking linux/amd64 sha256:4edbd2beb5f78b1014028f4fbb99f3237d9561100b6881aabbf5acce2c4f9454 ...

Using the pull command we can see that the following image elements were downloaded:

  • Image manifest;
  • Image index;
  • OCI Configuration;
  • 1 layer of file system data.

Each of them comes compressed and with a sha256 hash calculated on its compressed contents. Ctr also automatically decompresses and extracts the downloaded contents into directories saved in the Containerd's image store. The latter can be found under /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/. The descriptor of the image manifest is sha256:a777.... Let's inspect its content.

cryptonite@host:~$ sudo cat /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/a777c9c66ba177ccfea23f2a216ff6721e78a662cd17019488c417135299cd89 | jq .
{
  "schemaVersion": 2,
  "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
  "config": {
    "mediaType": "application/vnd.docker.container.image.v1+json",
    "size": 1472,
    "digest": "sha256:0ac33e5f5afa79e084075e8698a22d574816eea8d7b7d480586835657c3e1c8b"
  },
  "layers": [
    {
      "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",
      "size": 2814559,
      "digest": "sha256:df9b9388f04ad6279a7410b85cedfdcb2208c0a003da7ab5613af71079148139"
    }
  ]
}

We retrieve the file descriptors of the file system layers and the configuration file.

In addition, the hash values of the contents can be obtained by doing the following:

# calculate the hash of the image manifest
cryptonite@host:~$ sudo sha256sum /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/a777c9c66ba177ccfea23f2a216ff6721e78a662cd17019488c417135299cd89
a777c9c66ba177ccfea23f2a216ff6721e78a662cd17019488c417135299cd89 ...

Image Index

Index element of the OCI image-spec

The Image Index is a higher-level manifest containing file descriptors to architecture-specific image manifests (eg: amd64, arm, x86, etc.). It can be thought of as a Merkle tree that contains different versions of the same image. This element of the specification is again represented using a JSON-like structure and contains the following:

  • schemaVersion (int), the template version;
  • mediaType (string), metadata field describing the type of the specification element;
  • manifests (array of descriptors), content descriptors of the image manifests for the different CPU architectures. Each element of the array contains:
    • mediaType (string);
    • size (int), the size of the manifest in bytes;
    • digest (string), content descriptor of a particular manifest;
    • platform (object), information about the CPU architecture for which the manifest is suitable.

Image Index in Practice

To illustrate the Image Index we'll use ctr and the same Alpine Linux image downloaded in the previous chapter. Its index is annotated with sha256:4edbd.. tag (see the content downloading snippet above).

cryptonite@host:~$ sudo cat \
/var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/ \
4edbd2beb5f78b1014028f4fbb99f3237d9561100b6881aabbf5acce2c4f9454 | jq .
{
  "manifests": [
    {
      "digest": "sha256:a777c9c66ba177ccfea23f2a216ff6721e78a662cd17019488c417135299cd89",
      "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
      "platform": {
        "architecture": "amd64",
        "os": "linux"
      },
      "size": 528
    },
    {
      "digest": "sha256:70dc0b1a6029d999b9bba6b9d8793e077a16c938f5883397944f3bd01f8cd48a",
      "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
      "platform": {
        "architecture": "arm",
        "os": "linux",
        "variant": "v6"
      },
      "size": 528
    },
    {
      "digest": "sha256:dc18010aabc13ce121123c7bb0f4dcb6879ce22b4f7c65669a2c634b5ceecafb",
      "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
      "platform": {
        "architecture": "arm",
        "os": "linux",
        "variant": "v7"
      },
      "size": 528
    },
    {
      "digest": "sha256:f3bec467166fd0e38f83ff32fb82447f5e89b5abd13264a04454c75e11f1cdc6",
      "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
      "platform": {
        "architecture": "arm64",
        "os": "linux",
        "variant": "v8"
      },
      "size": 528
    },
    {
      "digest": "sha256:51103b3f2993cbc1b45ff9d941b5d461484002792e02aa29580ec5282de719d4",
      "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
      "platform": {
        "architecture": "386",
        "os": "linux"
      },
      "size": 528
    },

    ...

  "mediaType": "application/vnd.docker.distribution.manifest.list.v2+json",
  "schemaVersion": 2
}

With the help of the Image Index, a container manager such as Docker or Containerd could retrieve the contents of a given container image for the host's current CPU architecture and derive a container bundle from them. The following results/experiments in this article were obtained/conducted using the amd64 architecture manifest (here identified with sha256:a777...). Of course, "high-level" container managers are smart and will only download the image manifest which is suitable for the current CPU architecture:

# try to inspect the manifest of Linux arm v6
cryptonite@host:~$ sudo cat  \
/var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/ \
70dc0b1a6029d999b9bba6b9d8793e077a16c938f5883397944f3bd01f8cd48a | jq .
cat: /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/0a6a2a45b31cd5e28a366a035185eb75020ec28866957c2cb82422ff68fae065: \
No such file or directory

To download all the manifests with ctr we have to specify the --all-platforms flag.

Image Layout

The Image Layout is a file system directory layout where the contents of a container image are extracted. It contains all the elements of the image-spec represented as file system objects (or just files and directories). The directory structure format is the following:

  • index.json - the Image Index;
  • oci-layout - provides the version of the Image Layout;
  • blobs - a directory containing the referenced content in the Image Index.

Image Layout in Practice

Let's inspect all that using the ctr's export functionality and the same Alpine Linux container image.

Note that to be able to use this feature one has to download all manifests contained in the Image Index.

cryptonite@host:~$ sudo ctr image pull --all-platforms \
docker.io/library/alpine:latest
...
cryptonite@host:~$ sudo ctr image export --all-platforms  \
image-layout-alpine.tar  docker.io/library/alpine:latest
# this produces a tar archive with the name image-layout-alpine.tar
# let's extract it
cryptonite@host:~$ tar xf image-layout-alpine.tar
cryptonite@host:~$ tree .
├── blobs
│   └── sha256
│     ├── 09addbcf0db5a11803f29bddbdbfd31adce7e40d68750359f9a4eb4dcc54078f
│     ├── 0a6a2a45b31cd5e28a366a035185eb75020ec28866957c2cb82422ff68fae065
│     ├── 0ac33e5f5afa79e084075e8698a22d574816eea8d7b7d480586835657c3e1c8b
│     ├── 1877acf2d48ed8bcb5bd9756a95aca0c077457be7cf4fcef25807f4e9be88db1
│     ├── 24d2ad2d4b14ac9edb48fb580d067884a93067ba026d6e47cd94dbc7d97b80d5
│     ├── 28ca6b2fc07057618ad749d6f6403afde056ed534ba14271d6c9ead8cd1ea136
│     ├── 3fb3c9af89a9178a2ab12a1f30d8df607fa46a6f176acf9448328b22d31086a2
│     ├── 4edbd2beb5f78b1014028f4fbb99f3237d9561100b6881aabbf5acce2c4f9454
│     ├── 51103b3f2993cbc1b45ff9d941b5d461484002792e02aa29580ec5282de719d4
│     ├── 57fb4b5f1a47c953ca5703f0f81ce14e5d01cf23aa79558b5adb961cc526e320
│     ├── 70dc0b1a6029d999b9bba6b9d8793e077a16c938f5883397944f3bd01f8cd48a
│     ├── 73b28a5955ec7fb46f2cf0434e4901a889f7dda3f8c9ec496300feb256c7eda8
│     ├── 9981e73032c8833e387a8f96986e560edbed12c38119e0edb0439c9c2234eac9
│     ├── a27b630f446c3da376a30cf610e4bfa6847f8b87c83702c29e72b986f4e52d28
│     ├── a777c9c66ba177ccfea23f2a216ff6721e78a662cd17019488c417135299cd89
│     ├── c319b1fc4ed70b8241a7ce6ac0c4015d354bf5cf8c01eb73c50b6709c0c46e49
│     ├── cf7b6fa1108a7ad1dfcc61d4e7d7c1b62cd4550ef574df4212d7a8c7a6fada81
│     ├── d378343b49e42a7f34e8c5a63abea857964e5ebc62628e6f9d21dda419f0efc3
│     ├── dc18010aabc13ce121123c7bb0f4dcb6879ce22b4f7c65669a2c634b5ceecafb
│     ├── dc95795d85e881384cd236fbe303e7b9b31a65d2c76ec61ab55dba539f27e158
│     ├── df9b9388f04ad6279a7410b85cedfdcb2208c0a003da7ab5613af71079148139
│     └── f3bec467166fd0e38f83ff32fb82447f5e89b5abd13264a04454c75e11f1cdc6
├── image-layout-alpine.tar
├── index.json
├── manifest.json
└── oci-layout

From above, we can see the described directory layout. There is one element more than in the original specification - the manifest.json. It is a direct reference to the contents of the appropriate Image Manifest:

# Inspecting the manifest
cryptonite@host:~$  cat manifest.json  | jq .
[
  {
    "Config": "blobs/sha256/0ac33e5f5afa79e084075e8698a22d574816eea8d7b7d480586835657c3e1c8b",
    "RepoTags": [
      "alpine:latest"
    ],
    "Layers": [
      "blobs/sha256/df9b9388f04ad6279a7410b85cedfdcb2208c0a003da7ab5613af71079148139"
    ]
  }
]

We can confirm that by retrieving the contents following the Image Index and then the Image Manifest.

# Inspecting the index to find the right manifest (amd64).
cryptonite@host:~$ cat index.json | jq .
{
  "schemaVersion": 2,
  "manifests": [
    {
      "mediaType": "application/vnd.docker.distribution.manifest.list.v2+json",
      "digest": "sha256:4edbd2beb5f78b1014028f4fbb99f3237d9561100b6881aabbf5acce2c4f9454",
      "size": 1638,
      "annotations": {
        "io.containerd.image.name": "docker.io/library/alpine:latest",
        "org.opencontainers.image.ref.name": "latest"
      }
    }
  ]
}
# the digest field is a pointer to the image index which
# is placed in the blobs directory
cryptonite@host:~$ cat ./blobs/sha256/4edbd2beb5f78b1014028f4fbb99f3237d9561100b6881aabbf5acce2c4f9454  | jq .
{
  "manifests": [
    {
      "digest": "sha256:a777c9c66ba177ccfea23f2a216ff6721e78a662cd17019488c417135299cd89",
      "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
      "platform": {
        "architecture": "amd64",
        "os": "linux"
      },
      "size": 528
    },
    {
      "digest": "sha256:70dc0b1a6029d999b9bba6b9d8793e077a16c938f5883397944f3bd01f8cd48a",
      "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
      "platform": {
        "architecture": "arm",
        "os": "linux",
        "variant": "v6"
      },
      "size": 528
    },
...

And now we inspect the contents of the amd64 manifest (sha256:a777):

cryptonite@host:~$ cat ./blobs/sha256/a777c9c66ba177ccfea23f2a216ff6721e78a662cd17019488c417135299cd89
{
   "schemaVersion": 2,
   "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
   "config": {
      "mediaType": "application/vnd.docker.container.image.v1+json",
      "size": 1472,
      "digest": "sha256:0ac33e5f5afa79e084075e8698a22d574816eea8d7b7d480586835657c3e1c8b"
   },
   "layers": [
      {
         "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",
         "size": 2814559,
         "digest": "sha256:df9b9388f04ad6279a7410b85cedfdcb2208c0a003da7ab5613af71079148139"
      }
   ]
}

Indeed, the manifest.json file is a shortcut to the contents of the suitable manifest.

Image Configuration

Configuration element of the OCI image-spec

The Image Configuration contains Linux process attributes which will be inherited by the container process. These attributes include the command to be executed, its arguments, Unix data stream configurations (stdin/stdout/stderr), mount points as well as the layer's Diff ID's. It looks really like a runtime-spec configuration file but it is not. The image-spec configuration element does not include any isolation mechanisms such as namespaces, cgroups, etc. They are set and configured by the container engine of the host. However, the Image Configuration contains additional information about the image components which are used by the container process. For example, it contains an ordered collection of root file system changes. This history log describes how and when the container image was created. This configuration file is going to be passed through a conversion procedure to be transformed to a classical runtime-spec config.json.

Image Configuration in Practice

Let's now inspect the configuration of our favorite Alpine image identified by sha256:0ac33e5... (see above).

cryptonite@host:~$   cat ./blobs/sha256/ \
0ac33e5f5afa79e084075e8698a22d574816eea8d7b7d480586835657c3e1c8b   | jq .
{
  "architecture": "amd64",
  "config": {
    "Hostname": "",
    "Domainname": "",
    "User": "",
    "AttachStdin": false,
    "AttachStdout": false,
    "AttachStderr": false,
    "Tty": false,
    "OpenStdin": false,
    "StdinOnce": false,
    "Env": [
      "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
    ],
    "Cmd": [
      "/bin/sh"
    ],
    "Image": "sha256:d49869997c508135352366cebd3509ee756bba1ceb8eef708a4c3ff0d481084a",
    "Volumes": null,
    "WorkingDir": "",
    "Entrypoint": null,
    "OnBuild": null,
    "Labels": null
  },
  "container": "b714116bd3f3418e7b61a6d70dd7244382f0844e47a8d1d66dbf61cb1cb02b2b",
  "container_config": {
    "Hostname": "b714116bd3f3",
    "Domainname": "",
    "User": "",
    "AttachStdin": false,
    "AttachStdout": false,
    "AttachStderr": false,
    "Tty": false,
    "OpenStdin": false,
    "StdinOnce": false,
    "Env": [
      "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
    ],
    "Cmd": [
      "/bin/sh",
      "-c",
      "#(nop) ",
      "CMD [\"/bin/sh\"]"
    ],
    "Image": "sha256:d49869997c508135352366cebd3509ee756bba1ceb8eef708a4c3ff0d481084a",
    "Volumes": null,
    "WorkingDir": "",
    "Entrypoint": null,
    "OnBuild": null,
    "Labels": {}
  },
  "created": "2022-04-05T00:19:59.912662499Z",
  "docker_version": "20.10.12",
  "history": [
    {
      "created": "2022-04-05T00:19:59.790636867Z",
      "created_by": "/bin/sh -c #(nop) ADD file:5d673d25da3a14ce1f6cf66e4c7fd4f4b85a3759a9d93efb3fd9ff852b5b56e4 in / "
    },
    {
      "created": "2022-04-05T00:19:59.912662499Z",
      "created_by": "/bin/sh -c #(nop)  CMD [\"/bin/sh\"]",
      "empty_layer": true
    }
  ],
  "os": "linux",
  "rootfs": {
    "type": "layers",
    "diff_ids": [
      "sha256:4fc242d58285699eca05db3cc7c7122a2b8e014d9481f323bd9277baacfa0628"
    ]
  }
}

From the above snippet, we can retrieve a lot of similar elements to the runtime-spec configuration alongside the previously mentioned other cool features. It is important to note the difference between the descriptors used for navigation in the image layout directory (the sha256 blobs) and the DiffID's (diff_ids) present in the configuration file. The former could be calculated on compressed or decompressed content while the DiffID's are calculated only on decompressed content.

# We take the image blob of the amd64 manifest
cryptonite@host:~$ sha256sum ./blobs/sha256/\
df9b9388f04ad6279a7410b85cedfdcb2208c0a003da7ab5613af71079148139

df9b9388f04ad6279a7410b85cedfdcb2208c0a003da7ab5613af71079148139 ...

# and now inspect the type of data that is in it
cryptonite@host:~$ file ./blobs/sha256/ \
df9b9388f04ad6279a7410b85cedfdcb2208c0a003da7ab5613af71079148139

./blobs/sha256/df9b9388f04ad6279a7410b85cedfdcb2208c0a003da7ab5613af71079148139: \
gzip compressed data, original size modulo 2^32 5855232

We can see that the sha256 tags from the configuration file (sha256:4fc...) and the directory in the image layout (sha256:df9...) are different. But now if we decompress the contents and compute their hash we will resolve this small mystery:

cryptonite@host:~$ cat ./blobs/sha256/df9b9388f04ad6279a7410b85cedfdcb2208c0a003da7ab5613af71079148139 \
| gunzip - | sha256sum -
4fc242d58285699eca05db3cc7c7122a2b8e014d9481f323bd9277baacfa0628  -

Conversion

As mentioned above, the image-spec configuration element has to be transformed into a runtime-spec configuration file. The conversion part of the image-spec indicates how this could happen. Broadly speaking, it can be seen as an intuitive 1-to-1 field conversion:

Image-spec Runtime-spec
Config.WorkingDir process.cwd
Config.Env process.env
Config.User process.user

This conversion is done by the container engine which will also include the isolation mechanisms concerning the current host configuration. However, there are more detailed rules so again please refer to the documentation for more details.

Descriptors

Throughout the whole article, we used content descriptors to reference the elements of the image-spec. Let's see a more formal definition:

  • an OCI image consists of several different components, arranged in a Merkle Directed Acyclic Graph (DAG);
  • references between components in the graph are expressed through Content Descriptors;
  • a Content Descriptor (or simply Descriptor) describes the disposition of the targeted content;
  • a Content Descriptor includes the type of the content, a content identifier (digest of the described object), and the byte size of the raw content.
{
  "mediaType": "application/vnd.oci.image.manifest.v1+json",
  "size": 7682,
  "digest": "sha256:5b0bcabd1ed22e9fb1310cf6c2dec7cdef19f0ad69efa1f392e94a4333501270"
}

File System Layer

Layer OCI specification element

We saw that an Image Manifest contains references to different file system states which are referred to as layers. Each layer contains a modification (insertion, deletion, addition) of the file system relative to the previous one. When all of this file system "history" is combined in the right order a root directory layout for a container is created. This part of the specification defines the types of file system modifications and how they can be assembled to form a container root file directory. Let's consider the following modification types:

  • additions;
  • deletions;
  • modifications.

These actions can be applied to every Linux file system component (regular files, sockets, device mappers, hard links, etc).

Here we will demonstrate, using a toy example, how the magic happens. Let's consider an initial filesystem root having the following structure:

root/
    etc/
        hi.txt
    bin/
        binhi.txt

and the following changes:

1. rm /etc/hi.txt
2. mkdir /etc/hello
3. echo 'binhi' >> /bin/binhi.txt

To convert this sequence of actions to a set of image layers one has to do the following:

  1. create an exact file system snapshot of the initial rootfs resulting in rootfs.s1. This is going to be layer 1;
  2. apply the changes (rm hi) on rootfs.s1;
  3. compare the resulted snapshot with the original directory and construct a new layer containing: DELETED /etc/hi
  4. compute an integrity proof on the new layer and store it as 2;
  5. create a new snapshot of the initial rootfs and apply the changes done in step layer 2;
  6. apply the changes in the current phase (mkdir /etc/hello);
  7. construct a new layer containing: ADD /etc/hello
  8. produce an integrity proof of the new layer and store it as layer 3;
  9. create a new snapshot of the rootfs and apply the changes done in layer 2 and layer 3 in the right order;
  10. apply the current changes (echo 'binhi' >> /bin/hi);
  11. construct a new layer containing: MOD /bin/binhi -> /bin/hi + 'binhi'
  12. produce an integrity proof of the new layer and store it as layer 4;

When constructing the final root directory, the layers (also called changesets) are applied in top-down order (first-in-first-out). These layers are often compressed as seen in the previous examples to optimize transportation costs. Because of the number of snapshots, it is recommended to use a copy-on-write or union filesystems (Overlay or AUFS) when building images.

Building a File System Layer in Practice

Let's see a more practical example of how all this works using a good old Dockerfile. The Docker technology has integrated an image-building feature that allows the transformation of a container root directory to an image. For more details on how this is done, please read the documentation.

FROM alpine
RUN mkdir hello
RUN touch /hello/hi
RUN rm /etc/alpine-release
RUN rm hello/hi && \
    touch hello/hi2

Let's now build it:

cryptonite@host:~$ docker build -t layers . < Dockerfile
Sending build context to Docker daemon  2.048kB
Step 1/5 : FROM alpine
 ---> 0ac33e5f5afa
Step 2/5 : RUN mkdir hello
 ---> 8f200d4b4388
Step 3/5 : RUN touch /hello/hi
 ---> e282cf66f48d
Step 4/5 : RUN rm /etc/alpine-release
 ---> 68edb2bec8d7
Step 5/5 : RUN rm hello/hi &&     touch hello/hi2
 ---> Running in 4cdf386e2ad9
Removing intermediate container 4cdf386e2ad9
 ---> 06441d7c908f
Successfully built 06441d7c908f
Successfully tagged layers:latest

To analyze it using again the ctr tool (let's be consistent even though this can be done using Docker), we need to push this custom image to a private image registry so that we can download it afterward:

cryptonite@host:~$ docker run -d -p 5000:5000 --restart=always --name registry-test registry:2
# retag the image for an easier pull with ctr
cryptonite@host:~$ docker tag layers localhost:5000/layers:latest
# push it locally
cryptonite@host:~$ docker push localhost:5000/layers:latest
...

Let's now download and analyze it:

# now download it into the containerd's store
cryptonite@host:~/test $ ctr image pull --plain-http localhost:5000/layers:latest
...
# We export the image to a tar to obtain a directory layout
cryptonite@host:~/test $ ctr image export layers.tar localhost:5000/layers:latest
# obtain an image layout as before
cryptonite@host:~/test $ tar xf layers.tar
# let's inspect the manifest
cryptonite@host:~/test $ cat manifest.json | jq .
[
  {
    "Config": "blobs/sha256/06441d7c908f6010a8ed742bc37265455a042483670031ca359fd82d1bd1c714",
    "RepoTags": [
      "localhost:5000/layers:latest"
    ],
    "Layers": [
      "blobs/sha256/df9b9388f04ad6279a7410b85cedfdcb2208c0a003da7ab5613af71079148139",
      "blobs/sha256/6f59d6f00b77a27e0cc7009c4793b4fa0c103a5acf751511fa433fb3b23e3615",
      "blobs/sha256/3f26df9cf37166c953e0110eaa785f39901074741b03953657f66d774bb87289",
      "blobs/sha256/d505d1a3d8171860c26f441438eb2753882085f793849e7d574a5958957e2d16",
      "blobs/sha256/75aabadffcdcd22f7196d3073bf6a3a0273b78f87621db27c705eb5a855777f2"
    ]
  }
]

We can see that the number of layers corresponds to the number of lines starting with a Docker specific command (RUN, FROM). Now you probably understand why people try to write Dockerfiles as compactly as possible ;). Let's dig into the configuration and inspect the different layers:

cryptonite@host:~/test $ cat ./blobs/sha256/06441d7c908f6010a8ed742bc37265455a042483670031ca359fd82d1bd1c714 | jq .
...
"history": [
    {
      "created": "2022-04-05T00:19:59.790636867Z",
      "created_by": "/bin/sh -c #(nop) ADD file:5d673d25da3a14ce1f6cf66e4c7fd4f4b85a3759a9d93efb3fd9ff852b5b56e4 in / "
    },
    {
      "created": "2022-04-05T00:19:59.912662499Z",
      "created_by": "/bin/sh -c #(nop)  CMD [\"/bin/sh\"]",
      "empty_layer": true
    },
    {
      "created": "2022-04-20T14:18:44.267013462Z",
      "created_by": "/bin/sh -c mkdir hello"
    },
    {
      "created": "2022-04-20T14:51:46.03773869Z",
      "created_by": "/bin/sh -c touch /hello/hi"
    },
    {
      "created": "2022-04-20T14:51:47.088511078Z",
      "created_by": "/bin/sh -c rm /etc/alpine-release"
    },
    {
      "created": "2022-04-20T14:52:09.712954334Z",
      "created_by": "/bin/sh -c rm hello/hi &&     touch hello/hi2"
    }
  ],
...

Okay, we can see the history log of the actions in the Dockerfile. Let's now inspect how are they represented on the file system level:

# extract the layers into separate folders
cryptonite@host:~/test $ for i in $(seq 1 5); do mkdir -p layer$i; done
# extract each layer into a separate directory
cryptonite@host:~/test $ tar xf ./blobs/sha256/df9b9388f04ad6279a7410b85cedfdcb2208c0a003da7ab5613af71079148139 -C layer1/
cryptonite@host:~/test $ tar xf ./blobs/sha256/6f59d6f00b77a27e0cc7009c4793b4fa0c103a5acf751511fa433fb3b23e3615 -C layer2/
cryptonite@host:~/test $ tar xf ./blobs/sha256/3f26df9cf37166c953e0110eaa785f39901074741b03953657f66d774bb87289 -C layer3/
cryptonite@host:~/test $ tar xf ./blobs/sha256/d505d1a3d8171860c26f441438eb2753882085f793849e7d574a5958957e2d16 -C layer4/
cryptonite@host:~/test $ tar xf ./blobs/sha256/75aabadffcdcd22f7196d3073bf6a3a0273b78f87621db27c705eb5a855777f2 -C layer5/
# investigate the contents of each directory
### LAYER1
cryptonite@host:~/test$ ls -a layer1/
.  ..  bin  dev  etc  home  lib  media  mnt  opt  proc  root  run  sbin  srv  sys  tmp  usr  var
### LAYER2
cryptonite@host:~/test$ ls -a layer2/
.  ..  hello
### LAYER3
cryptonite@host:~/test$ ls -a layer3/
.  ..  hello
cryptonite@host:~/test$ ls -a layer3/hello/
.  ..  hi
### LAYER4
cryptonite@host:~/test $ ls -a layer4/
.  ..  etc
cryptonite@host:~/test$ ls -a layer4/etc
.  ..  .wh.alpine-release
### LAYER5
cryptonite@host:~/test $ ls -a layer5/
.  ..  hello
cryptonite@host:~/test $ ls -a layer5/hello/
.  ..  hi2  .wh.hi

We can see that our toy example was not so far from the truth after all.

Note that a file with .wh (without file) prefix indicates that this is a file to be deleted when the layer is joined.

Conclusion

The OCI image-spec extends the runtime-spec allowing portability of container applications between different hosts. In combination with other cool technologies such as OverlayFS/COW (Copy-on-write), resources between containers running on the same host can be efficiently shared. Modern container runtimes such as Containerd and Docker implement the OCI image specification, hence they deliver some nice features facilitating container management.

Acknowledgments

I would like to thank Frédéric Raynal (pappy), Mahé Tardy (mahé), Damiano Melotti (dmell) and Sébastien Rolland (Sébastien), members of Quarkslab, for their insightful comments, encouragement and intellectual guidance.

References


If you would like to learn more about our security audits and explore how we can help you, get in touch with us!