Skip to main content

20 posts tagged with "containerd"

View All Tags

Using dragonfly to distribute images and files for multi-cluster kubernetes

· 14 min read

Posted on September 1, 2023

CNCF projects highlighted in this post, and migrated by mingcheng.

Dragonfly provides efficient, stable, securefile distribution and image acceleration based on p2p technology to be the best practice and standard solution in cloud native architectures. It is hosted by the Cloud Native Computing Foundation(CNCF) as an Incubating Level Project.

This article introduces the deployment of dragonfly for multi-cluster kubernetes. A dragonfly cluster manages cluster within a network. If you have two clusters with disconnected networks, you can use two dragonfly clusters to manage their own clusters.

The recommended deployment for multi-cluster kubernetes is to use a dragonfly cluster to manage a kubernetes cluster, and use a centralized manager service to manage multiple dragonfly clusters. Because peer can only transmit data in its own dragonfly cluster, if a kubernetes cluster deploys a dragonfly cluster, then a kubernetes cluster forms a p2p network, and internal peers can only schedule and transmit data in a kubernetes cluster.

Screenshot showing diagram flow between Network A / Kubernetes Cluster A and Network B / Kubernetes Cluster B towards Manager

Ant Group security technology’s Nydus and Dragonfly image acceleration practices

· 23 min read

CNCF projects highlighted in this post, and migrated by mingcheng.

Introduction

ZOLOZ is a global security and risk management platform under Ant Group. Through biometric, big data analysis, and artificial intelligence technologies, ZOLOZ provides safe and convenient security and risk management solutions for users and institutions. ZOLOZ has provided security and risk management technology support for more than 70 partners in 14 countries and regions, including China, Indonesia, Malaysia, and the Philippines. It has already covered areas such as finance, insurance, securities, credit, telecommunications, and public services, and has served over 1.2 billion users.

With the explosion of Kubernetes and cloud-native, ZOLOZ applications have begun to be deployed on a large scale on public clouds using containerization. The images of ZOLOZ applications have been maintained and updated for a long time, and both the number of layers and the overall size have reached a large scale (hundreds of MBs or several GBs). In particular, the basic image size of ZOLOZ’s AI algorithm inference application is much larger than that of general application images (PyTorch/PyTorch:1.13.1-CUDA 11.6-cuDNN 8-Runtime on Docker Hub is 4.92GB, compared to CentOS:latest with only about 234MB).

For container cold start, i.e., when there is no image locally, the image needs to be downloaded from the registry before creating the container. In the production environment, container cold start often takes several minutes, and as the scale increases, the registry may be unable to download images quickly due to network congestion within the cluster. Such large images have brought many challenges to application updates and scaling. With the continuous promotion of containerization on public clouds, ZOLOZ applications mainly face three challenges:

  1. The algorithm image is large, and pushing it to the cloud image repository takes a long time. During the development process, when testing in the testing environment, developers often hope to iterate quickly and verify quickly. However, every time a branch is modified and released for verification, it takes several tens of minutes, which is very inefficient.
  2. Pulling the algorithm image takes a long time, and pulling many image files during cluster expansion can easily cause the cluster network card to be flooded and affect the normal operation of the business.
  3. The cluster machine takes a long time to start up, making it difficult to meet the needs of sudden traffic increases and elastic automatic scaling.

Although various compromise solutions have been attempted, these solutions all have their shortcomings. Now, in collaboration with multiple technical teams such as Ant Group, Alibaba Cloud, and ByteDance, a more universal solution on public clouds has been developed, which has low transformation costs and good performance, and currently appears to be an ideal solution.

Volcano Engine, distributed image acceleration practice based on Dragonfly

· 11 min read

CNCF projects highlighted in this post, and migrated by mingcheng.

Terms and definitions

TermDefinition
OCIThe Open Container Initiative is a Linux Foundation project launched by Docker in June 2015 to design open standards for operating system-level virtualization (and most importantly Linux containers).
OCI ArtifactProducts that follow the OCI image spec.
ImageThe image in this article refers to OCI Artifact
Image DistributionA product distribution implemented according to the OCI distribution spec.
ECSIt is a collection of resources composed of CPU, memory, and Cloud Drive, each of which logically corresponds to the computing hardware entity of the Data center infrastructure.
CRVolcano Engine image distribution service.
VKEVolcano Engine deeply integrates the new generation of Cloud Native technology to provide high-performance Kubernetes container cluster management services with containers as the core, helping users to quickly build containerized applications.
VCIVolcano is a serverless and containerized computing service. The current VCI seamlessly integrates with the Container Service VKE to provide Kubernetes orchestration capabilities. With VCI, you can focus on building the app itself, without having to buy and manage infrastructure such as the underlying Cloud as a Service, and pay only for the resources that the container actually consumes to run. VCI also supports second startup, high concurrent creation, sandbox Container Security isolation, and more.
TOSVolcano Engine provides massive, secure, low-cost, easy-to-use, highly reliable and highly available distributed cloud storage services.
Private ZonePrivate DNS service based on a proprietary network VPC (Virtual Private Cloud) environment. This service allows private domain names to be mapped to IP addresses in one or more custom VPCs.
P2PPeer-to-peer technology, when a peer in a P2P network downloads data from the server, it can also be used as a server level for other peers to download after downloading the data. When a large number of nodes download at the same time, it can ensure that the subsequent downloaded data does not need to be downloaded from the server side. Thereby reducing the pressure on the server side.
DragonflyDragonfly is a file distribution and image acceleration system based on P2P technology, and is the standard solution and best practice in the field of image acceleration in Cloud Native architecture. Now hosted as an incubation project by the Cloud Native Computing Foundation (CNCF).
NydusNydus Acceleration Framework implements a content-addressable filesystem that can accelerate container image startup by lazy loading. It has supported the creation of millions of accelerated image containers daily, and deeply integrated with the linux kernel's erofs and fscache, enabling in-kernel support for image acceleration.

Background

Volcano Engine image repository CR uses TOS to store container images. Currently, it can meet the demand of large-scale concurrent image pulling to a certain extent. However, the final concurrency of pulling is limited by the bandwidth and QPS of TOS.

Here is a brief introduction of the two scenarios that are currently encountered for large-scale image pulling:

  1. The number of clients is increasing, and the images are getting larger. The bandwidth of TOS will eventually be insufficient.
  2. If the client uses Nydus to convert the image format, the request volume to TOS will increase by an order of magnitude. The QPS limit of TOS API makes it unable to meet the demand.

Whether it is the image repository service itself or the underlying storage, there will be bandwidth and QPS limitations in the end. If you rely solely on the bandwidth and QPS provided by the server, it is easy to be unable to meet the demand. Therefore, P2P needs to be introduced to reduce server pressure and meet the demand for large-scale concurrent image pulling.

Dragonfly v2.0.9 is released

· 5 min read

CNCF projects highlighted in this post, and migrated by mingcheng.

Project post originally published on GitHub by Dragonfly maintainers

Dragonfly provide efficient, stable, secure file distribution and image acceleration based on p2p technology to be the best practice and standard solution in cloud native architectures. It is hosted by the Cloud Native Computing Foundation (CNCF) as an Incubating Level Project

Dragonfly v2.0.9 is released! 🎉🎉🎉 Thanks to the Google Cloud Platform (GCP) Team, Volcano Engine Team, and Baidu AI Cloud Team for helping Dragonfly integrate with their public clouds. Welcome to visit d7y.io website.

GitHub snippit

Features

  • Download tasks based on priority. Priority can be passed as parameter during the download task, or can be associated with priority in the application of the Manager console, refer to priority protoc definition.
  • Scheduler adds PieceDownloadTimeout parameter, which indicates that if the piece download times out, the scheduler will change the task state to TaskStateFailed.
  • Add health service to each GRPC service.
  • Add reflection to each GRPC service.
  • Manager supports redis sentinel model.
  • Refactor dynconfig package to remove json.Unmarshal, improving its runtime efficiency.
  • Fix panic caused by hashring not being built.
  • Previously, most of the pieces were downloaded from the same parent. Now, different pieces are downloaded from different parents to improve download efficiency and distribute bandwidth among multiple parents.
  • If Manager’s searcher cannot found candidate scheduler clusters, It will return all the clusters for peers to check health. If check health is successful, the scheduler cluster can be used.
  • Support ORAS source client to pull image.
  • Add UDP ping package and GRPC protoc definition for building virtual network topology.
  • The V2 P2P protocol has been added, and both Scheduler and Manager have implemented the API of the V2 P2P protocol, in preparation for the future Rust version of Dfdaemon.
  • OSS source client supports STS access, user can set security token in header.
  • Dynconfig supports to resolve addresses with health service.
  • Add hostTTL and hostGCInterval in Scheduler to prevent information of abnormally exited Dfdaemon from becoming dirty data in the Scheduler.
  • Add CIDR to searcher to provide more precise scheduler cluster selection for Dfdaemon.
  • Refactor the metric definitions for the V1 P2P protocol and add the metric definitions for the V2 P2P protocol. Additionally, reorganize the Dragonfly Grafana Dashboards, refer to monitoring.

Break Change

  • Using the default value for the key used to generate JWT tokens in Manager can lead to security issues. Therefore, Manager has added JWT Key in the configuration, and upgrading Manager requires generating a new JWT Key and setting it in the Manager configuration.

Public Cloud Providers

Others

You can see CHANGELOG for more details.

Dragonfly integrates nydus for image acceleration practive

· 11 min read
Gaius
Dragonfly Maintainer

Introduce definition

Dragonfly has been selected and put into production use by many Internet companies since its open source in 2017, and entered CNCF in October 2018, becoming the third project in China to enter the CNCF Sandbox. In April 2020, CNCF TOC voted to accept Dragonfly as an CNCF Incubating project. Dragonfly has developed the next version through production practice, which has absorbed the advantages of Dragonfly1.x and made a lot of optimizations for known problems.

Nydus optimized the OCIv1 image format, and designed a brand new image-based filesystem, so that the container can download the image on demand, and the container no longer needs to download the complete image to start the container. In the latest version, dragonfly has completed the integration with the nydus, allowing the container to start downloading images on demand, reducing the amount of downloads. The dragonfly P2P transmission method can also be used during the transmission process to reduce the back-to-source traffic and increase the speed.

The Evolution of the Nydus Image Acceleration

· 14 min read
Jingbo Xu

The Evolution of the Nydus Image Acceleration

Optimized container images together with technologies such as P2P networks can effectively speed up the process of container deployment and startup. In order to achieve this, we developed the Nydus image acceleration service (also a sub-project of CNCF Dragonfly).

In addition to startup speed, core features such as image layering, lazy pulling etc. are also particularly important in the field of container images. But since there is no native filesystem supporting that, most opt for the userspace solution, and Nydus initially did the same. However user-mode solutions are encountering more and more challenges nowadays, such as a huge gap in performance compared with native filesystems, and noticable resource overhead in high-density employed scenarios.

Therefore, we designed and implemented the RAFS v6 format which is compatible with the in-kernel EROFS filesystem, hoping to form a content-addressable in-kernel filesystem for container images. After the lazy-pulling technology of "EROFS over Fscache" was merged into 5.19 kernel, the next-generation architecture of Nydus is gradually becoming clear. This is the first native in-kernel solution for container images, promoting a high-density, high-performance and high-availability solution for container images.

This article will introduce the evolution of Nydus from three perspectives: Nydus architecture outline, RAFS v6 image format and "EROFS over Fscache" on-demand loading technology.

Please refer to Nydus for more details of this project. Now you can experience all these new features with this user guide.

Containerd Accepted Nydus-snapshotter

· 4 min read
Changwei Ge

Containerd Accepted Nydus-snapshotter

Early January, Containerd community has taken in nydus-snapshotter as a sub-project. Check out the code, particular introductions and tutorial from its new repository. We believe that the donation to containerd will attract more users and developers for nydus itself and bring much value to the community users.

Introducing Nydus – Dragonfly Container Image Service

· 8 min read

Guest post by Pengtao and Liubo, Software Engineers at Ant Group

Tao is a software engineer at Ant Group. He has been working on Linux file system development for more than 10 years. He is also a core maintainer of Kata Containers project. In recent years, Tao mainly works on container runtime and services. He is a strong believer and advocator for open source and cloud native technology_

Bo Liu, he has been an active contributor of Linux kernel since 2009, mostly working on the Btrfs Filesystem, and now he is working at Alibaba Group, his main interest is linux filesystems and container technologies.

Small is Fast, Large is Slow

With containers, it is relatively fast to deploy web apps, mobile backends, and API services right out of the box. Why? Because the container images they use are generally small (hundreds of MB).

A larger challenge is deploying applications with a huge container image (several GB). It takes a good amount of time to have these images ready to use. We want the time spent shortened to a certain extent to leverage the powerful container abstractions to run and scale the applications fast.

Dragonfly has been doing well at distributing container images. However, users still have to download an entire container image before creating a new container.

Another big challenge is arising security concerns about container image.

Conceptually, we pack application’s environment into a single image that is more easily shared with consumers. Image is then put into a filesystem locally on top of which an application can run. The pieces that are now being launched as nydus are the culmination of the years of work and experience of our team in building filesystems.

Here we introduce the dragonfly image service called nydus as an extension to the Dragonfly project.  It’s software that minimizes download time and provides image integrity check across the whole lifetime of a container, enabling users to manage applications fast and safely.

nydus is co-developed by engineers from Alibaba Cloud and Ant Group. It is widely used in the internal production deployments. From our experience, we value its container creation speedup and image isolation enhancement the most. And we are seeing interesting use cases of it from time to time.

TOC votes to move Dragonfly into CNCF incubator

· 4 min read

This post was migrated by mingcheng from a CNCF Blog post.

Today, the CNCF Technical Oversight Committee (TOC) voted to accept Dragonfly as an incubation-level hosted project.

Dragonfly, which was accepted into the CNCF Sandbox in October 2018, is an open source, cloud native image and file distribution system. Dragonfly was created in June 2015 by Alibaba Cloud to improve the user experience of image and file distribution in Kubernetes. This allows engineers in enterprises to focus on the application itself rather than infrastructure management.

“Dragonfly is one of the backbone technologies for container platforms within Alibaba’s ecosystem, supporting billions of application deliveries each year, and in use by many enterprise customers around the world,” said, Li Yi, senior staff engineer, Alibaba. “Alibaba looks forward to continually improving Dragonfly, making it more efficient and easier to use.”

The goal of Dragonfly is to tackle distribution problems in cloud native scenarios. The project is comprised of three main components: supernode plays the role of central scheduler and controls all distribution procedure among the peer network; dfget resides on each peer as an agent to download file pieces; and “dfdaemon” plays the role of proxy which intercepts image downloading requests from container engine to dfget.

“Dragonfly improves the user experience by taking advantage of a P2P image and file distribution protocol and easing the network load of the image registry,” said Sheng Liang, TOC member and project sponsor. “As organizations across the world migrate their workloads onto container stacks, we expect the adoption of Dragonfly to continue to increase significantly.”

Dragonfly integrates with other CNCF projects, including Prometheus, containerd, Harbor, Kubernetes, and Helm. Project maintainers come from Alibaba, ByteDance, eBay, and Meitu, and there are more than 20 contributing companies, including NetEase, JD.com, Walmart, VMware, Shopee, ChinaMobile, Qunar, ZTE, Qiniu, NVIDIA, and others.

Main Dragonfly Features:

  • P2P based file distribution: Using P2P technology for file transmission, which can make full use of the bandwidth resources of each peer to improve download efficiency, saves a lot of cross-IDC bandwidth, especially costly cross-board bandwidth.
  • Non-invasive support for all kinds of container technologies: Dragonfly can seamlessly support various containers for distributing images.
  • Host level speed limit: Many downloading tools (wget/curl) only have rate limit for the current download task, but dragonfly also provides a rate limit for the entire host.
  • Passive CDN: The CDN mechanism can avoid repetitive remote downloads.

Notable Milestones:

  • 7 project maintainers from 4 organizations