Skip to main content

· 3 min read

This summer, over four engineer weeks, Trail of Bits and OSTIF collaborated on a security audit of Dragonfly2. A CNCF Incubating Project, Dragonfly2 functions as file distribution for peer-to-peer technologies. Included in the scope was the sub-project Nydus’s repository that works in image distribution. The engagement was outlined and framed around several goals relevant to the security and longevity of the project as it moves towards graduation.

The Trail of Bits audit team approached the audit by using static and manual testing with automated and manual processes. By introducing semgrep and CodeQL tooling, performing a manual review of client, scheduler, and manager code, and fuzz testing on the gRPC handlers, the audit team was able to identify a variety of findings for the project to improve their security. In focusing efforts on high-level business logic and externally accessible endpoints, the Trail of Bits audit team was able to direct their focus during the audit and provide guidance and recommendations for Dragonfly2’s future work.

Recorded in the audit report are 19 findings. Five of the findings were ranked as high, one as medium, four low, five informational, and four were considered undetermined. Nine of the findings were categorized as Data Validation, three of which were high severity. Ranked and reviewed as well was Dragonfly2’s Codebase Maturity, comprising eleven aspects of project code which are analyzed individually in the report.

This is a large project and could not be reviewed in total due to time constraints and scope. multiple specialized features were outside the scope of this audit for those reasons. this project is a great opportunity for continued audit work to improve and elevate code and harden security before graduation. Ongoing efforts for security is critical, as security is a moving target.

We would like to thank the Trail of Bits team, particularly Dan Guido, Jeff Braswell, Paweł Płatek, and Sam Alws for their work on this project. Thank you to the Dragonfly2 maintainers and contributors, specifically Wenbo Qi, for their ongoing work and contributions to this engagement. Finally, we are grateful to the CNCF for funding this audit and supporting open source security efforts.

OSTIF & Trail of Bits

Dragonfly community

Nydus community

· 9 min read
Gaius

Introduce definition

Dragonfly has been selected and put into production use by many Internet companies since its open source in 2017, and entered CNCF in October 2018, becoming the third project in China to enter the CNCF Sandbox. In April 2020, CNCF TOC voted to accept Dragonfly as an CNCF Incubating project. Dragonfly has developed the next version through production practice, which has absorbed the advantages of Dragonfly1.x and made a lot of optimizations for known problems.

Nydus optimized the OCIv1 image format, and designed a brand new image-based filesystem, so that the container can download the image on demand, and the container no longer needs to download the complete image to start the container. In the latest version, dragonfly has completed the integration with the nydus, allowing the container to start downloading images on demand, reducing the amount of downloads. The dragonfly P2P transmission method can also be used during the transmission process to reduce the back-to-source traffic and increase the speed.

Quick start

Prerequisites

NameVersionDocument
Kubernetes cluster1.20+kubernetes.io
Helm3.8.0+helm.sh
Containerdv1.4.3+containerd.io
Nerdctl0.22+containerd/nerdctl

Notice: Kind is recommended if no kubernetes cluster is available for testing.

Install dragonfly

For detailed installation documentation based on kubernetes cluster, please refer to quick-start-kubernetes.

Setup kubernetes cluster

Create kind multi-node cluster configuration file kind-config.yaml, configuration content is as follows:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
extraPortMappings:
- containerPort: 30950
hostPort: 65001
- containerPort: 30951
hostPort: 40901
- role: worker

Create a kind multi-node cluster using the configuration file:

kind create cluster --config kind-config.yaml

Switch the context of kubectl to kind cluster:

kubectl config use-context kind-kind

Kind loads dragonfly image

Pull dragonfly latest images:

docker pull dragonflyoss/scheduler:latest
docker pull dragonflyoss/manager:latest
docker pull dragonflyoss/dfdaemon:latest

Kind cluster loads dragonfly latest images:

kind load docker-image dragonflyoss/scheduler:latest
kind load docker-image dragonflyoss/manager:latest
kind load docker-image dragonflyoss/dfdaemon:latest

Create dragonfly cluster based on helm charts

Create helm charts configuration file charts-config.yaml and enable prefetching, configuration content is as follows:

scheduler:
replicas: 1
metrics:
enable: true
config:
verbose: true
pprofPort: 18066

seedPeer:
replicas: 1
metrics:
enable: true
config:
verbose: true
pprofPort: 18066
download:
prefetch: true

dfdaemon:
hostNetwork: true
metrics:
enable: true
config:
verbose: true
pprofPort: 18066
download:
prefetch: true
proxy:
defaultFilter: 'Expires&Signature&ns'
security:
insecure: true
tcpListen:
listen: 0.0.0.0
port: 65001
registryMirror:
dynamic: true
url: https://index.docker.io
proxies:
- regx: blobs/sha256.*

manager:
replicas: 1
metrics:
enable: true
config:
verbose: true
pprofPort: 18066

Create a dragonfly cluster using the configuration file:

$ helm repo add dragonfly https://dragonflyoss.github.io/helm-charts/
$ helm install --wait --create-namespace --namespace dragonfly-system dragonfly dragonfly/dragonfly -f charts-config.yaml
NAME: dragonfly
LAST DEPLOYED: Wed Oct 19 04:23:22 2022
NAMESPACE: dragonfly-system
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
1. Get the scheduler address by running these commands:
export SCHEDULER_POD_NAME=$(kubectl get pods --namespace dragonfly-system -l "app=dragonfly,release=dragonfly,component=scheduler" -o jsonpath={.items[0].metadata.name})
export SCHEDULER_CONTAINER_PORT=$(kubectl get pod --namespace dragonfly-system $SCHEDULER_POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
kubectl --namespace dragonfly-system port-forward $SCHEDULER_POD_NAME 8002:$SCHEDULER_CONTAINER_PORT
echo "Visit http://127.0.0.1:8002 to use your scheduler"

2. Get the dfdaemon port by running these commands:
export DFDAEMON_POD_NAME=$(kubectl get pods --namespace dragonfly-system -l "app=dragonfly,release=dragonfly,component=dfdaemon" -o jsonpath={.items[0].metadata.name})
export DFDAEMON_CONTAINER_PORT=$(kubectl get pod --namespace dragonfly-system $DFDAEMON_POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
You can use $DFDAEMON_CONTAINER_PORT as a proxy port in Node.

3. Configure runtime to use dragonfly:
https://d7y.io/docs/getting-started/quick-start/kubernetes/

Check that dragonfly is deployed successfully:

$ kubectl get po -n dragonfly-system
NAME READY STATUS RESTARTS AGE
dragonfly-dfdaemon-rhnr6 1/1 Running 4 (101s ago) 3m27s
dragonfly-dfdaemon-s6sv5 1/1 Running 5 (111s ago) 3m27s
dragonfly-manager-67f97d7986-8dgn8 1/1 Running 0 3m27s
dragonfly-mysql-0 1/1 Running 0 3m27s
dragonfly-redis-master-0 1/1 Running 0 3m27s
dragonfly-redis-replicas-0 1/1 Running 1 (115s ago) 3m27s
dragonfly-redis-replicas-1 1/1 Running 0 95s
dragonfly-redis-replicas-2 1/1 Running 0 70s
dragonfly-scheduler-0 1/1 Running 0 3m27s
dragonfly-seed-peer-0 1/1 Running 2 (95s ago) 3m27s

Create peer service configuration file peer-service-config.yaml, configuration content is as follows:

apiVersion: v1
kind: Service
metadata:
name: peer
namespace: dragonfly-system
spec:
type: NodePort
ports:
- name: http-65001
nodePort: 30950
port: 65001
- name: http-40901
nodePort: 30951
port: 40901
selector:
app: dragonfly
component: dfdaemon
release: dragonfly

Create a peer service using the configuration file:

kubectl apply -f peer-service-config.yaml

Install nydus for containerd

For detailed nydus installation documentation based on containerd environment, please refer to nydus-setup-for-containerd-environment. The example uses Systemd to manage the nydus-snapshotter service.

Install nydus tools

Download containerd-nydus-grpc binary, please refer to nydus-snapshotter/releases:

NYDUS_SNAPSHOTTER_VERSION=0.3.3
wget https://github.com/containerd/nydus-snapshotter/releases/download/v$NYDUS_SNAPSHOTTER_VERSION/nydus-snapshotter-v$NYDUS_SNAPSHOTTER_VERSION-x86_64.tgz
tar zxvf nydus-snapshotter-v$NYDUS_SNAPSHOTTER_VERSION-x86_64.tgz

Install containerd-nydus-grpc tool:

sudo cp nydus-snapshotter/containerd-nydus-grpc /usr/local/bin/

Download nydus-image, nydusd and nydusify binaries, please refer to dragonflyoss/image-service:

NYDUS_VERSION=2.1.1
wget https://github.com/dragonflyoss/image-service/releases/download/v$NYDUS_VERSION/nydus-static-v$NYDUS_VERSION-linux-amd64.tgz
tar zxvf nydus-static-v$NYDUS_VERSION-linux-amd64.tgz

Install nydus-image, nydusd and nydusify tools:

sudo cp nydus-static/nydus-image nydus-static/nydusd nydus-static/nydusify /usr/local/bin/

Install nydus snapshotter plugin for containerd

Configure containerd to use the nydus-snapshotter plugin, please refer to configure-and-start-containerd.

127.0.0.1:65001 is the proxy address of dragonfly peer, and the X-Dragonfly-Registry header is the address of origin registry, which is provided for dragonfly to download the images.

Change configuration of containerd in /etc/containerd/config.toml:

[proxy_plugins]
[proxy_plugins.nydus]
type = "snapshot"
address = "/run/containerd-nydus/containerd-nydus-grpc.sock"

[plugins.cri]
[plugins.cri.containerd]
snapshotter = "nydus"
disable_snapshot_annotations = false

Restart containerd service:

sudo systemctl restart containerd

Check that containerd uses the nydus-snapshotter plugin:

$ ctr -a /run/containerd/containerd.sock plugin ls | grep nydus
io.containerd.snapshotter.v1 nydus - ok

Systemd starts nydus snapshotter service

For detailed configuration documentation based on nydus mirror mode, please refer to enable-mirrors-for-storage-backend.

Create nydusd configuration file nydusd-config.json, configuration content is as follows:

{
"device": {
"backend": {
"type": "registry",
"config": {
"mirrors": [
{
"host": "http://127.0.0.1:65001",
"auth_through": false,
"headers": {
"X-Dragonfly-Registry": "https://index.docker.io"
},
"ping_url": "http://127.0.0.1:40901/server/ping"
}
],
"scheme": "https",
"skip_verify": false,
"timeout": 10,
"connect_timeout": 10,
"retry_limit": 2
}
},
"cache": {
"type": "blobcache",
"config": {
"work_dir": "/var/lib/nydus/cache/"
}
}
},
"mode": "direct",
"digest_validate": false,
"iostats_files": false,
"enable_xattr": true,
"fs_prefetch": {
"enable": true,
"threads_count": 10,
"merging_size": 131072,
"bandwidth_rate": 1048576
}
}

Copy configuration file to /etc/nydus/config.json:

sudo mkdir /etc/nydus && cp nydusd-config.json /etc/nydus/config.json

Create systemd configuration file nydus-snapshotter.service of nydus snapshotter, configuration content is as follows:

[Unit]
Description=nydus snapshotter
After=network.target
Before=containerd.service

[Service]
Type=simple
Environment=HOME=/root
ExecStart=/usr/local/bin/containerd-nydus-grpc --config-path /etc/nydus/config.json
Restart=always
RestartSec=1
KillMode=process
OOMScoreAdjust=-999
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

Copy configuration file to /etc/systemd/system/:

sudo cp nydus-snapshotter.service /etc/systemd/system/

Systemd starts nydus snapshotter service:

$ sudo systemctl enable nydus-snapshotter
$ sudo systemctl start nydus-snapshotter
$ sudo systemctl status nydus-snapshotter
● nydus-snapshotter.service - nydus snapshotter
Loaded: loaded (/etc/systemd/system/nydus-snapshotter.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2022-10-19 08:01:00 UTC; 2s ago
Main PID: 2853636 (containerd-nydu)
Tasks: 9 (limit: 37574)
Memory: 4.6M
CPU: 20ms
CGroup: /system.slice/nydus-snapshotter.service
└─2853636 /usr/local/bin/containerd-nydus-grpc --config-path /etc/nydus/config.json

Oct 19 08:01:00 kvm-gaius-0 systemd[1]: Started nydus snapshotter.
Oct 19 08:01:00 kvm-gaius-0 containerd-nydus-grpc[2853636]: time="2022-10-19T08:01:00.493700269Z" level=info msg="gc goroutine start..."
Oct 19 08:01:00 kvm-gaius-0 containerd-nydus-grpc[2853636]: time="2022-10-19T08:01:00.493947264Z" level=info msg="found 0 daemons running"

Convert an image to nydus format

Convert python:3.9.15 image to nydus format, you can use the converted dragonflyoss/python:3.9.15-nydus image and skip this step. Conversion tool can use nydusify and acceld.

Login to Dockerhub:

docker login

Convert python:3.9.15 image to nydus format, and DOCKERHUB_REPO_NAME environment variable needs to be set to the user's image repository:

DOCKERHUB_REPO_NAME=dragonflyoss
sudo nydusify convert --nydus-image /usr/local/bin/nydus-image --source python:3.9.15 --target $DOCKERHUB_REPO_NAME/python:3.9.15-nydus

Try nydus with nerdctl

Running python:3.9.15-nydus with nerdctl:

sudo nerdctl --snapshotter nydus run --rm -it $DOCKERHUB_REPO_NAME/python:3.9.15-nydus

Check that nydus is downloaded via dragonfly based on mirror mode:

$ grep mirrors /var/lib/containerd-nydus/logs/**/*log
[2022-10-19 10:16:13.276548 +00:00] INFO [storage/src/backend/connection.rs:271] backend config: ConnectionConfig { proxy: ProxyConfig { url: "", ping_url: "", fallback: false, check_interval: 5, use_http: false }, mirrors: [MirrorConfig { host: "http://127.0.0.1:65001", headers: {"X-Dragonfly-Registry": "https://index.docker.io"}, auth_through: false }], skip_verify: false, timeout: 10, connect_timeout: 10, retry_limit: 2 }

Performance testing

Test the performance of single-machine image download after the integration of nydus mirror mode and dragonfly P2P. Test running version commands using images in different languages. For example, the startup command used to run a python image is python -V. The tests were performed on the same machine. Due to the influence of the network environment of the machine itself, the actual download time is not important, but the ratio of the increase in the download time in different scenarios is very important.

nydus-mirror-dragonfly

  • OCIv1: Use containerd to pull image directly.
  • Nydus Cold Boot: Use containerd to pull image via nydus-snapshotter and doesn't hit any cache.
  • Nydus & Dragonfly Cold Boot: Use containerd to pull image via nydus-snapshotter. Transfer the traffic to dragonfly P2P based on nydus mirror mode and no cache hits.
  • Hit Dragonfly Remote Peer Cache: Use containerd to pull image via nydus-snapshotter. Transfer the traffic to dragonfly P2P based on nydus mirror mode and hit the remote peer cache.
  • Hit Dragonfly Local Peer Cache: Use containerd to pull image via nydus-snapshotter. Transfer the traffic to dragonfly P2P based on nydus mirror mode and hit the local peer cache.
  • Hit Nydus Cache: Use containerd to pull image via nydus-snapshotter. Transfer the traffic to dragonfly P2P based on nydus mirror mode and hit the nydus local cache.

Test results show nydus mirror mode and dragonfly P2P integration. Use the nydus download image to compare the OCIv1 mode, It can effectively reduce the image download time. The cold boot of nydus and nydus & dragonfly are basically close. All hits to dragonfly cache are better than nydus only. The most important thing is that if a very large kubernetes cluster uses nydus to pull images. The download of each image layer will be generate as many range requests as needed. The QPS of the source of the registry is too high. Causes the QPS of the registry to be relatively high. Dragonfly can effectively reduce the number of requests and download traffic for back-to-source registry. In the best case, dragonfly can make the same task back-to-source download only once.

Dragonfly community

Nydus community

· 14 min read

The Evolution of the Nydus Image Acceleration

Optimized container images together with technologies such as P2P networks can effectively speed up the process of container deployment and startup. In order to achieve this, we developed the Nydus image acceleration service (also a sub-project of CNCF Dragonfly).

In addition to startup speed, core features such as image layering, lazy pulling etc. are also particularly important in the field of container images. But since there is no native filesystem supporting that, most opt for the userspace solution, and Nydus initially did the same. However user-mode solutions are encountering more and more challenges nowadays, such as a huge gap in performance compared with native filesystems, and noticable resource overhead in high-density employed scenarios.

Therefore, we designed and implemented the RAFS v6 format which is compatible with the in-kernel EROFS filesystem, hoping to form a content-addressable in-kernel filesystem for container images. After the lazy-pulling technology of "EROFS over Fscache" was merged into 5.19 kernel, the next-generation architecture of Nydus is gradually becoming clear. This is the first native in-kernel solution for container images, promoting a high-density, high-performance and high-availability solution for container images.

This article will introduce the evolution of Nydus from three perspectives: Nydus architecture outline, RAFS v6 image format and "EROFS over Fscache" on-demand loading technology.

Please refer to Nydus for more details of this project. Now you can experience all these new features with this user guide.

Nydus Architecture Outline

In brief, Nydus is a filesystem-based image acceleration service that designs the RAFS (Registry Acceleration File System) disk format, optimizing the startup performance of OCIv1 container images.

The fundamental idea of the container image is to provide the root directory (rootfs) of the container, which can be carried by the filesystem or the archive format. Besides, it can also be implemented together with a custom block format, but anyway it needs to present as a directory tree, providing the file interface to containers.

Let's take a look at the OCIv1 standard image format first. The OCIv1 format is an image format specification based on the Docker Image Manifest Version 2 Schema 2. It consists of a manifest, an image index (optional), a series of container image layers and configuration files. Essentially, OCIv1 is a layer-based image format, with each layer storing file-level diff data in tgz archive format. ociv1

Due to the limitation of tgz, OCIv1 has some inherent issues, such as inability to load on demand, coarser level deduplication granularity, unstable hash digest for each layer, etc.

As for the custom block format, it also has some flaws by design.

  • Since the container image should be eventually presented as a directory tree, a filesystem (such as ext4) is needed upon that. In this case the dependency chain is "custom block format + userspace block device + filesystem", which is obviously more complex compared to the native filesystem solution;
  • Since the block format is not aware of the upper filesystem, it would be impossible to distinguish the metadata and data of the filesystem and process them separately (such as compression);
  • Similarly, it is unable to implement file-based image analysis features such as security scanning, hotspot analysis, and runtime interception, etc.;
  • Unable to directly merge multiple existing images into one large image without modifying the blob image, which is the natural ability of the filesystem solution.

Therefore, Nydus is a filesystem-based container image acceleration solution. It introduces a RAFS image format, in which data (blobs) and metadata (bootstrap) of the image are separated, whilst the image layer only stores the data part. Files within the image are divided into chunks for deduplication, with each image layer storing the corresponding chunk data. Since then, the chunk-level deduplication is allowed between layers and images. Besides it also helps implement on-demand loading. Since the metadata is separated from data and then combined into one place, the access to the metadata does not need to pull the corresponding data, which speeds up the file access quite a lot.

The Nydus RAFS image format is shown below: nydus_rafs

RAFS v6 image format

Evolution of RAFS image format

Prior to the introduction of RAFS v6 format, Nydus used to handle a fully userspace implemented image format, working via FUSE or virtiofs. However, the userspace filesystem has the following defects:

  • The overhead of large amounts of system call cannot be ignored, especially in the case of random small I/Os with depth 1;
  • Frequent file operations will generate a large number of FUSE requests, resulting in frequent switching of kernel/user mode context, which becomes the performance bottleneck then;
  • In non-FSDAX scenarios, the buffer copy from user to kernel mode will consume CPUs;
  • In the FSDAX (via virtiofs) scenario, a large number of small files will occupy considerable DAX window resources, resulting in potential performance jitter; frequent switching between small files will also generate noticeable DAX mapping setup overhead.

Essentially these problems are caused by the natural limitations of the userspace filesystem solution, and if the container filesystem is an in-kernel filesystem, the problems above can be resolved in practice. Therefore, we introduced RAFS v6 image format, a container image format implemented in kernel based on EROFS filesystem.

Introduction to EROFS filesystem

EROFS filesystem has been in the Linux mainline since the Linux 4.19. In the past, it was mainly used for mobile devices. It exists in the current major distributions (such as Fedora, Ubuntu, Archlinux, Debian, Gentoo, etc.). The userspace tools erofs-utils also already exists in these distributions and the OIN Linux system definition list, and the community is quite active.

EROFS filesystem has the following characteristics:

  • Native local read-only block-based filesystem suitable for various scenarios, the disk format has the minimum I/O unit definition;
  • Page-sized block-aligned uncompressed metadata;
  • Effective space saving through Tail-packing inline technology while keeping high performance;
  • Data is addressed in blocks (mmap I/O friendly, no post I/O processing required);
  • Disk directory format friendly for random access;
  • Simple on-disk format, easy to increase the payload, better scalability;
  • Support DIRECT I/O access; support block devices, FSDAX and other backends;
  • A boot sector is reserved, which can help bootstrap and other requirements.

Introduction to RAFS v6 image format

Over the past year, the Alibaba Cloud kernel team has made several improvements and enhancements to EROFS filesystem, adapting it to the container image storage scenarios, and finally presenting it as a container image format implemented on the kernel side, RAFS v6. In addition, RAFS v6 also carries out a series of optimizations on the image format, such as block alignment, more compact metadata, and more.

The new RAFS v6 image format is as follows: rafsv6

The improved Nydus image service architecture is illustrated as below, adding support for the (EROFS-based) RAFS v6 image format: rafsv6_arch

EROFS over Fscache

erofs over fscache is the next-generation container image on-demand loading technology developed by the Alibaba Cloud kernel team for Nydus. It is also the native image on-demand loading feature of the Linux kernel. It was integrated into the Linux kernel mainline 5.19. erofs_over_fscache_merge

And on LWN.net as a highlighting feature of the 5.19 merge window: erofs_over_fscache_lwn

Prior to this, almost all lazy pulling solutions available were in the user mode. The userspace solution involves frequent kernel/user mode context switching and memory copying between kernel/user mode, resulting in performance bottlenecks. This problem is especially prominent when all the container images have been downloaded locally, in which case the file access will still switch to userspace.

In order to avoid the unnecessary overhead, we can decouple the two operations of 1) cache management of image data and 2) fetching data through various sources (such as network) on cache miss. Cache management can be all in the kernel mode, so that it can avoid kernel/user mode context switching when the image is locally ready. This is exactly the main benefit of erofs over fscache technology.

Brief Introduction

fscache/cachefiles (hereinafter collectively referred to as fscache) is a relatively mature file caching solution in Linux systems, and is widely used in network filesystems (such as NFS, Ceph, etc.). Our attempt is to make it work with the on-demand loading for local filesystems such as EROFS.

In this case, when the container accesses the container image, fscache will check whether the requested data has been cached. On cache hit, the data will be read directly from the cache file. It is processed directly in kernel, and will not switch to userspace. erofs_over_fscache_cache_hit

Otherwise (cache miss), the userspace service Nydusd will be notified to process this request, while the container process will sleep on this then; Nydusd will fetch data from remote, write it to the cache file through fscache, and awake the original asleep process. Once awaken, the process is able to read the data from the cache file. erofs_over_fscache_cache_miss

Advantages of the Solution

As described above, when the image has been downloaded locally, userspace solutions still need to switch to userspace when accessing, while the memory copying overhead between kernel/user modes is also involved. As for erofs over fscache, it will no longer switch to userspace, so that on-demand loading is truly "on-demand". IOWs, it has native performance and stability when images has been locally ready. In brief, it implements a real one-stop and lossless solution in the following two scenarios of 1) on-demand loading and 2) downloading container images in advance.

Specifically, erofs over fscache has the following advantages over userspace solutions.

1. Asynchronous prefetch

After the container is created, Nydusd can start to download images even without the on-demand loading (cache miss) triggered. Nydusd will download data and write it to the cache file. Then when the specific file range is accessed, EROFS will directly read from the cache file, without switching to the userspace, whilst the other userspace solutions have to go the round trip. erofs_over_fscache_prefetch

2. Network IO optimization

When on-demand loading (cache miss) is triggered, Nydusd can download more data at one time than requested. For example, when 4KB I/O is requested, Nydusd can actually download 1MB of data at a time to reduce the network transmission delay per unit file size. Then, when the container accesses the remaining data within this 1MB, it won't switch to userspace anymore. The userspace solutions cannot work like this since it still needs to switch to the userspace on data access within the prefetched range. erofs_over_fscache_readahead

3. Better performance

When images have been downloaded locally (the impact of on-demand loading is not considered in this case), erofs over fscache performs significantly better than userspace solutions, while achieving similar performance compared to the native filesystem. Here is the performance statistics under several workloads as below[1].

read/randread IO

The following is the performance statistics of file read/randread buffered IO [2]

readIOPSBWperformance
native ext4267K1093MB/s1
loop240K982MB/s0.90
fscache227K931MB/s0.85
fuse191K764MB/s0.70
randreadIOPSBWPerformance
native ext410.1K41.2MB/s1
loop8.7K34.8MB/s0.84
fscache9.5K38.2MB/s0.93
fuse7.6K31.2MB/s0.76
  • "native" means that the test file is directly on the local ext4 filesystem
  • "loop" means that the test file is inside a erofs image, while the erofs image is mounted through the DIRECT IO mode of the loop device
  • "fscache" means that the test file is inside a erofs image, while the erofs image is mounted through the erofs over fscache scheme
  • "fuse" means that the test file is in the fuse filesystem [3]
  • The "Performance" column normalizes the performance statistics of each mode, based on the performance of the native ext4 filesystem

It can be seen that the read/randread performance in fscache mode is basically the same as that in loop mode, and is better than that in fuse mode; however, there is still a certain gap with the performance of the native ext4 file system. We are further analyzing and optimizing it. In theory, it can achieve basically lossless performance with that of native filesystem.

File metadata manipulation

Test the performance of file metadata operations by performing a tar operation [4] on a large number of small files.

TimePerformance
native ext41.04s1
loop0.550s1.89
fscache0.570s1.82
fuse3.2s0.33

It can be seen that the erofs format is even better than that of the native ext4 filesystem, which is caused by the optimized filesystem format of erofs. Since erofs is a read-only filesystem, all its metadata can be closely arranged, while ext4 is a writable filesystem, and its metadata is scattered among multiple BGs (block group) .

Typical workload

Test the performance of linux source code compilation [5] as the typical workload.

Linux CompilingTimePerformance
native ext4156s1
loop154s1.0
fscache156s1.0
fuse200s0.78

It can be seen that fscache mode is basically the same as that of loop mode and native ext4 filesystem, and is better than fuse mode.

4. High-density deployment

Since the erofs over fscache technology is implemented based on files, i.e. each container image is represented as a cache file under fscache, it naturally supports high-density deployment scenarios. For example, a typical node.js container image corresponds to ~20 cache files under this scheme, then in a machine with hundreds of containers deployed, only thousands of cache files need to be maintained.

5. Failover and Hot Upgrade

When all the image files have been downloaded locally, the file access will no longer require the intervention of the user-mode service process, in which case the user-mode service process has a more abundant time window to realize the functions of failure recovery and hot upgrade. The user-mode processes are even no longer required in this scenario, which promotes the stability of the solution.

6. An one-stop solution for container image

With RAFS v6 image format and erofs over fscache on-demand loading technology, Nydus is suitable for both runc and Kata as a one-stop solution for container image distribution in these two scenarios.

More importantly, erofs over fscache is a truly a one-stop and lossless solution in the following two scenarios of 1) on-demand loading and 2) downloading container images in advance. On the one hand, with the on-demand loading feature implemented, it can significantly speed up the container startup, as it does not need to download the complete container images to the local. On the other hand, it is compatible with the scenario where the container image has been downloaded locally. It will no longer switch to userspace in this case, so as to achieve almost lossless performance and stability with the native filesystem.

The Future

After that, we will keep improving the erofs over fscache technology, such as more fine-grained image deduplication among containers, stargz support, FSDAX support, and performance optimization.

Last but not least, I would like to thank all the individuals and teams who have supported and helped us during the development of the project, and specially thanks to ByteDance and Kuaishou folks for their solid support. Let us work together to build a better container image ecosystem :)

  1. Test environment: ecs.i2ne.4xlarge (16 vCPU, 128 GiB Mem, local NVMe disk)
  2. Test command "fio -ioengine=psync -bs=4k -direct=0 -rw=[read|randread] -numjobs=1"
  3. Use passthrough_hp as fuse daemon
  4. Test the execution time of "tar -cf /dev/null linux_src_dir" command
  5. Test the execution time of the "time make -j16" command

· 4 min read

Containerd Accepted Nydus-snapshotter

Early January, Containerd community has taken in nydus-snapshotter as a sub-project. Check out the code, particular introductions and tutorial from its new repository. We believe that the donation to containerd will attract more users and developers for nydus itself and bring much value to the community users.

Nydus-snapshotter is a containerd's remote snapshotter, it works as a standalone process out of containerd, which only pulls nydus image's bootstrap from remote registry and forks another process called nydusd. Nydusd has a unified architecture, which means it works in form of a FUSE user-space filesystem daemon, a virtio-fs daemon or a fscache user-space daemon. Nydusd is responsible for fetching data blocks from remote storage like object storage or standard image registry, thus to fulfill containers' requests to read its rootfs.

Nydus is an excellent container image acceleration solution which significantly reduces time cost by starting container. It is originally developed by a virtual team from Alibaba Cloud and Ant Group and deployed in very large scale. Millions of containers are created based on nydus image each day in Alibaba Cloud and Ant Group. The underlying technique is a newly designed, container optimized and oriented read-only filesystem named Rafs. Several approaches are provided to create rafs format container image. The image can be pushed and stored in standard registry since it is compatible with OCI image and distribution specifications. A nydus image can be converted from a OCI source image where metadata and files data are split into a "bootstrap" and one or more "blobs" together with necessary manifest.json and config.json. Development of integration with Buildkit is in progress.

rafs disk layout

Nydus provides following key features:

  • Chunk level data de-duplication among layers in a single repository to reduce storage, transport and memory cost
  • Deleted(whiteout) files in certain layer aren't packed into nydus image, therefore, image size may be reduced
  • E2E image data integrity check. So security issues like "Supply Chain Attack" can be avoided and detected at runtime
  • Integrated with CNCF incubating project Dragonfly to distribute container images in P2P fashion and mitigate the pressure on container registries
  • Different container image storage backends are supported. For example, Registry, NAS, Aliyun/OSS and applying other remote storage backend like AWS S3 is also possible.
  • Record files access pattern during runtime gathering access trace/log, by which user's abnormal behaviors are easily caught. So we can ensure the image can be trusted

Beyond above essential features, nydus can be flexibly configured as a FUSE-base user-space filesystem or in-kernel EROFS with an on-demand loader user-space daemon and integrating nydus with VM-based container runtime is much easier.

  • Lightweight integration with VM-based containers runtime like KataContainers. In fact, KataContainers is considering supporting nydus as a native image acceleration solution.
  • Nydus closely cooperates with Linux in-kernel disk filesystem Containers' rootfs can directly be set up by EROFS with lazy pulling capability. The corresponding changes had been merged into Linux kernel since v5.16

To run with runc, nydusd works as FUSE user-space daemon:

runc nydus

To work with KataContainers, it works as a virtio-fs daemon:

kata nydus

Nydus community is working together with Linux Kernel to develop erofs+fscache based user-space on-demand read.

runc erofs nydus

Nydus and eStargz developers are working together on a new project named acceld in Harbor community to provide a general service to support the conversion from OCI v1 image to kinds of acceleration image formats for various accelerator providers, so that keep a smooth upgrade from OCI v1 image. In addition to the conversion service acceld and the conversion tool nydusify, nydus is also supporting buildkit to enable exporting nydus image directly from Dockerfile as a compression type.

In the future, nydus community will work closely with the containerd community on fast and efficient methods and solution of distributing container images, container image security, container image content storage efficiency, etc.

· 7 min read

Introducing Dragonfly Container Image Service

Small is Fast, Large is Slow

With containers, it is relatively fast to deploy web apps, mobile backends, and API services right out of the box. Why? Because the container images they use are generally small (hundreds of MBs).

A larger challenge is deploying applications with a huge container image (several GBs). It takes a good amount of time to have these images ready to use. We want the time spent shortened to a certain extent to leverage the powerful container abstractions to run and scale the applications fast.

Dragonfly has been doing well at distributing container images. However, users still have to download an entire container image before creating a new container. Another big challenge is arising security concerns about container image.

Conceptually, we pack application's environment into a single image that is more easily shared with consumers. Image is then put into a filesystem locally on top of which an application can run. The pieces that are now being launched as nydus are the culmination of the years of work and experience of our team in building filesystems. Here we introduce the dragonfly image service (codename nydus) as an extension to the Dragonfly project. It's software that minimizes download time and provides image integrity check across the whole lifetime of a container, enabling users to manage applications fast and safely.

nydus is co-developed by engineers from Alibaba Cloud and Ant Group. It is widely used in the internal production deployments. From our experience, we value its container creation speedup and image isolation enhancement the most. And we are seeing interesting use cases of it from time to time.

Nydus: Dragonfly Image Service

The nydus project designs and implements an user space filesystem on top of a container image format that improves over the current OCI image specification. Its key features include:

  • Container images are downloaded on demand
  • Chunk level data duplication
  • Flatten image metadata and data to remove all intermediate layers
  • Only usable image data is saved when building a container image
  • Only usable image data is downloaded when running a container
  • End-to-end image data integrity
  • Compactible with the OCI artifacts spec and distribution spec
  • Integrated with existing CNCF project dragonfly to support image distribution in large clusters
  • Different container image storage backends are supported

Nydus mainly consists of a new containier image format and a FUSE (Filesystem in USErspace) daemon to translate it into container accessible mountpoint.

nydus-architecture| center | 768x356

The FUSE daemon takes in either FUSE or virtiofs protocol to service POD created by conventional runc containers or Kata Containers. It supports pulling container image data from container image registry, OSS, NAS, as well as Dragonfly supernode and node peers. It can also optionally use a local directory to cache all container image data to speed up future container creation.

Internally, nydus splits a container image into two parts: a metadata layer and a data layer. The metadata layer is a self-verifiable merkle tree. Each file and directory is a node in the merkle tree with a hash aloneside. A file's hash is the hash of its file content, and a directory's hash is the hash of all of its descendents. Each file is divided into even sized chunks and saved in a data layer. File chunks can be shared among different container images by letting file nodes pointing inside them point to the same chunk location in the shared data layer.

nydus-format| center | 768x356

How can you benefit from nydus?

The immediate benefit of running nydus image service is that users can launch containers almost instantly. In our tests, we found out that nydus can boost container creation from minutes to seconds.

nydus-performance| center | 768x356

Another less-obvious but important benefit is runtime data integration check. With OCIv1 container images, the image data cannot be verified after being unpacked to local directory, which means if some files in the local directories are undermined either intentionally or not, containers will simply take them as is, incurring data leaking risk. In contrast, nydus image won't be unpacked to local directory at all, what's more, given that verification can be enforced on every data access to nydus image, the data leak risk can be completely avoided by forcing to fetch the data from the trusted image registry again.

nydus-integraty| center | 768x356

The Future of Nydus

The above examples showcase the power of nydus. For the last year, we've worked alongside the production team, laser-focused on making nydus stable, secure, easy to use.

Now, as the foundation for nydus has been laid, our new focus is the ecosystem it aims to serve broadly. We envision a future where users install dragonfly and nydus on their clusters, run containers with large image as fast they do with regular size image today, and feel confident about the safety of data on their container image.

For the community

While we have widely deployed nydus in our production, we believe a proper upgrade to OCI image spec shouldn’t be built without the community. To this end, we propose nydus as a reference implementation that aligns well with the OCI image spec v2 proposal [1], and we look forward to working with other industry leaders should this project come to fruition.

FAQ

Q: What are the challenges with oci image spec v1?

Q: How is this different than crfs?

  • The basic idea of the two are quite similar. Deep down, the nydus image format supports chunk level data deduplication and end-to-end data integraty at runtime, which is an improvement over the stargz format used by crfs.

Q: How is this different than Teleport of Azure?

  • Azure Teleport is like the current OCI image format plus a SMB-enabled snapshotter. It supports container image lazy-fetching and suffers from all the Tar format defects. OTOH, nydus deprecates the legacy Tar format and takes advantage of the merkle tree format to provide more advantages over the Tar format.

Q: What if network is down while container is running with nydus?

  • With OCIv1, container would fail to start at all should network be down while container image is not fully downloaded. Nydus has changed that a lot because it goes with lazy fetch/load mechanism, a failure in network may take down a running container. Nydus addresses the problem with a prefetch mechanism which can be configured to
  • run in background right after starting a container.

[1]:OCI Image Specification V2 Requirements

In the mean time, the OCI (Open Container Initiate) community has been actively discussing the emerging of OCI image spec v2 aiming to address new challenges with oci image spec v1.

Starting from June 2020, the OCI community spent more than a month discussing the requirements for OCI image specification v2. It is important to notice that OCIv2 is just a marketing term for updating the OCI specification to better address some use cases. It is not a brand new specification.

The discussion went from an email thread (Proposal Draft for OCI Image Spec V2) and a shared document to several OCI community online meetings, and the result is quite aspiring. The concluded OCIv2 requirements are:

  • Reduced Duplication
  • Canonical Representation (Reproducible Image Building)
  • Explicit (and Minimal) Filesystem Objects and Metadata
  • Mountable Filesystem Format
  • Bill of Materials
  • Lazy Fetch Support
  • Extensibility
  • Verifiability and/or Repairability
  • Reduced Uploading
  • Untrusted Storage

For detailed meaning of each requirement, please refer to the original shared document. We actively joined the community discussions and found out that the nydus project fits nicely to these requirements. It further encouraged us to opensource the nydus project to help the community discussion with a working code base.