Blog | Dragonfly

Dragonfly v2.3.0 has been released

July 1, 2025 · 7 min read

Dragonfly v2.3.0 is released! 🎉🎉🎉 Thanks the contributors who made this release happend and welcome you to visit d7y.io website.

dragonfly

Features

Persistent Cache Task

It designs to provide persistent caching for tasks. This tool can import file and export file in P2P network. The solution is specifically engineered for high-speed read and write operations. This makes it particularly advantageous for scenarios involving large files, such as machine learning model checkpoints, where rapid, reliable access and distribution across the network are critical for training and inference workflows. By leveraging P2P distribution and persistent caching, dfcache significantly reduces I/O bottlenecks and accelerates the lifecycle of large data assets.

For documentation on how to use the dfcache command-line tool, please refer to the following link: dfcache.

$ dfcache import /tmp/file.txt
⣷ Done: 2229733261

dfcache export 2229733261 -O /tmp/file.txt

Resource Search (Tasks and Persistent Cache Tasks)

The Resource Search feature enables seamless querying of tasks, including files, images, and persistent cache tasks. It optimizes resource access, improving task management and retrieval efficiency.

Vortex: A P2P File Transfer Protocol Based on TLV

Vortex protocol is a high-performance peer-to-peer (P2P) file transfer protocol implementation in Rust, designed as part of the Dragonfly project. It utilizes the TLV (Tag-Length-Value) format for efficient and flexible data transmission, making it ideal for large-scale file distribution scenarios.

Packet Format:

Packet Identifier (8 bits): Uniquely identifies each packet
Tag (8 bits): Specifies data type in value field
Length (32 bits): Indicates Value field length, up to 4 GiB
Value (variable): Actual data content, maximum 1 GiB Protocol Format:

-------------------------------------------------------------------------------------------------
|                            |                   |                    |                         |
| Packet Identifier (8 bits) |    Tag (8 bits)   |  Length (32 bits)  |   Value (up to 4 GiB)   |
|                            |                   |                    |                         |
-------------------------------------------------------------------------------------------------

For more information, please refer to the Vortex Protocol.

Enhanced Large File Distribution

This release significantly enhances Dragonfly's large file distribution capabilities, delivering improved efficiency and performance. We've revamped our scheduling algorithms for large file scenarios to ensure smarter resource and task allocation. Additionally, new mechanisms now more effectively balance the load across peers during large file transfers. Optimizations to the peer-to-peer (P2P) protocol and network transport layers further boost transmission efficiency.

These improvements include performance optimizations for both Client and Scheduler. You can find more details in the project's pull request.

Support scopes for Personal Access Tokens (PATs)

By enabling users to define specific access rights (scopes) for each PAT, we significantly enhance the security of Open API interactions. Instead of granting broad permissions, PATs can now be limited to only the necessary privileges required for a particular integration or task.

Enhanced Preheating

Implement Distributed Rate Limiting for Preheating Tasks

By limiting the rate at which preheating requests are initiated across the distributed system, it prevents excessive preheating activities from stressing the origin. This enhancement ensures a more stable preheating.

Support to set piece length for preheating

By allowing adjustment of the piece size, users can optimize data transfer efficiency, particularly in scenarios involving large files.

Flexible Preheating: Set Peer scope by Percentage or Count

This feature enhances preheating capabilities by allowing users to specify the preheating scope more precisely.

Implement Audit Logging for User Operations

This feature introduces comprehensive audit logging capabilities to track user operations within the system. Audit logs will record critical actions performed by users, such as initiating preheating tasks, deleting task caches, and other significant system interactions.

Garbage Collection

Dragonfly supports Garbage Collection (GC) Audit Logs and GC Job Records to track and manage garbage collection activities. The Manager enables automated GC retention, allowing records to be preserved for a configurable time period. Additionally, it provides the capability to manually trigger forced GC operations as needed.

This feature ensures efficient monitoring and management of GC processes, offering flexibility through automated retention policies and manual intervention for immediate GC execution.

Optimized File Download with Hard Link

File download needs to be done in a way that is efficient and secure. If users are downloading a large file, it is not efficient to download the file and copy to the output path. Instead, we can create a hard link to the file and send the link to the user. This way, we can avoid copying the file and save time and resources. If hard link fails (e.g. due to different file systems), dfdaemon will fallback to copying the file.

For more information, please refer to the file download workflow.

Hardware Acceleration for Piece Hash Computation

This feature enables hardware-accelerated Piece hash computation, significantly boosting performance and efficiency. By utilizing specialized hardware, the hash computation process is accelerated, allowing faster processing of large file.

Advanced Storage Management

Disk Space Validation for Operations

This feature enhances the client's storage functionality by implementing disk space validation. When insufficient disk space is detected, the client will return a failure response, preventing potential data corruption or incomplete operations.

Disk Garbage Collection Management

This feature enhances Peer's disk management by introducing configurable garbage collection (GC) thresholds based on disk usage. The distThreshold parameter allows users to define a specific disk capacity (e.g., 10TiB) as the base for calculating GC trigger points. If set, the distHighThresholdPercent (e.g., 80%) and distLowThresholdPercent (e.g., 60%) are applied relative to this capacity. If distThreshold is not provided or set to 0, these percentages are calculated based on the total actual disk space. When disk usage exceeds the high threshold, Dragonfly triggers GC(LRU) to reclaim space. GC stops when usage falls below the low threshold. This enables efficient management of a logical disk portion for caching, improving resource utilization and system performance.

p10

Support for OpenTelemetry Tracing

Dragonfly supports for tracing based on OpenTelemetry, covering the Manager, Scheduler, and Peers. This enables end-to-end visibility into the download process, allowing users to query detailed information, such as overall download latency, using a specific task ID. The integration ensures efficient monitoring and performance analysis across the entire system.

Add tracing configuration as follows(in Manager, Scheduler and Peer):

p11

You can access the Jaeger UI to visualize the traces.

p12

For more information, please refer to the Tracing.

Security Enhancements

We extend our sincere gratitude to the CNCF TAG Security for their collaboration on a joint security audit. Their expertise and thorough review were invaluable in helping us identify areas for security improvement within Dragonfly. For detailed information on the specific security issues addressed and the corresponding fixes, please refer to the following issue: #3811

Nydus

Significant bug fixes

Fixed memory leaks and file descriptor leaks caused by sysinfo library.
Cleans up the Unix domain socket (UDS) to prevent dfdaemon startup crashes.
Prevent client from repeatedly downloading the same piece from multiple parents.

Others

You can see CHANGELOG for more details.

Dragonfly Github

p13

Dragonfly v2.2.0 has been released

January 7, 2025 · 8 min read

CNCF projects highlighted in this post, and migrated by mingcheng.

Dragonfly v2.2.0 is released! 🎉🎉🎉 Thanks the contributors who made this release happend and welcome you to visit d7y.io website.

Features

Client written in Rust

The client is written in Rust, offering advantages such as ensuring memory safety, improving performance, etc. The client is a submodule of Dragonfly, refer to dragonflyoss/client.

scheduler schema

second scheduler schema

Client supports bandwidth rate limiting for prefetching

Client now supports rate limiting for prefetch requests, which can prevent network overload and reduce competition with other active download tasks, thereby enhancing overall system performance. Refer to the documentation to configure the proxy.prefetchRateLimit option.

code

The following diagram illustrates the usage of download rate limit, upload rate limir, and prefetct rate limit for the client.

rate limit

Client supports leeching

If the user configures the client to disable sharing, it will become a leech.

code

Optimize client’s performance for handling a large number of small I/Os by Nydus

Add the X-Dragonfly-Prefetch HTTP header. If X-Dragonfly-Prefetch is set to true and it is a range request, the client will prefetch the entire task. This feature allows Nydus to control which requests need prefetching.
The client’s HTTP proxy adds an independent cache to reduce requests to the gRPC server, thereby reducing request latency.
Increase the memory cache size in RocksDB and enable prefix search for quickly searching piece metadata.
Use the CRC-32-Castagnoli algorithm with hardware acceleration to reduce the hash calculation cost for piece content.
Reuse the gRPC connections for downloading and optimize the download logic.

Defines the V2 of the P2P transfer protocol

Define the V2 of the P2P transfer protocol to make it more standard, clearer, and better performing, refer to dragonflyoss/api.

Enhanced Harbor Integration with P2P Preheating

Dragonfly improves its integration with Harbor v2.13 for preheating images, includes the following enhancements:

Support for preheating multi architecture images.
User can select the preheat scope for multi-granularity preheating. (Single Seed Peer, All Seed Peers, All Peers)
User can specify the scheduler cluster ids for preheating images to the desired Dragonfly clusters.

Refer to documentation for more details.

create P2P Provider policy

Task Manager

User can search all peers of cached task by task ID or download URL, and delete the cache on the selected peers, refer to the documentation.

dragonfly dashboard

Peer Manager

Manager will regularly synchronize peers’ information and also allows for manual refreshes. Additionally, it will display peers’ information on the Manager Console.

dragonfly dashboard

Add hostname regexes and CIDRs to cluster scopes for matching clients

When the client starts, it reports its hostname and IP to the Manager. The Manager then returns the best matching cluster (including schedulers and seed peers) to the client based on the cluster scopes configuration.

Creating a cluster on the Dragonfly dashboard

Supports distributed rate limiting for creating jobs across different clusters

User can configure rate limiting for job creation across different clusters in the Manager Console.

creating a cluster on the dragonfly dashboard

Support preheating images using self-signed certificates

Preheating requires calling the container registry to parse the image manifest and construct the URL for downloading blobs. If the container registry uses a self-signed certificate, user can configure the self-signed certificate in the Manager’s config for calling to the container registry.

code

Support mTLS for gRPC calls between services

By setting self-signed certificates in the configurations of the Manager, Scheduler, Seed Peer, and Peer, gRPC calls between services will use mTLS.

Observability

Dragonfly is recommending to use prometheus for monitoring. Prometheus and grafana configurations are maintained in the dragonflyoss/monitoring repository.

Grafana dashboards are listed below:

Name	ID	Link	Description
Dragonfly Manager	15945	https://grafana.com/grafana/dashboards/15945	Grafana dashboard for dragonfly manager.
Dragonfly Scheduler	15944	https://grafana.com/grafana/dashboards/15944	Granafa dashboard for dragonfly scheduler.
Dragonfly Client	21053	https://grafana.com/grafana/dashboards/21053	Grafana dashboard for dragonfly client and dragonfly seed client.
Dragonfly Seed Client	21054	https://grafana.com/grafana/dashboards/21054	Grafana dashboard for dragonfly seed client.

dashboard

creating a cluster on the dragonfly dashboard

Nydus

Nydus v2.3.0 is released, refer to Nydus Image Service v2.3.0 for more details.

builder: support –parent-bootstrap for merge.
builder/nydusd: support batch chunks mergence.
nydusify/nydus-snapshotter: support OCI reference types.
nydusify: support export/import for remote images.
nydusify: support –push-chunk-size for large size image.
nydusd/nydus-snapshotter: support basic failover and hot upgrade.
nydusd: support overlay writable mount for fusedev.

Console

Console v0.2.0 is released, featuring a redesigned UI and an improved interaction flow. Additionally, more functional pages have been added, such as preheating, task manager, PATs(Personal Access Tokens) manager, etc. Refer to the documentation for more details.

cluster overview

deeper dive image into cluster-1 on dashboard

Document

Refactor the website documentation to make Dragonfly simpler and more practical for users, refer to d7y.io.

dragonfly website

Significant bug fixes

The following content only highlights the significant bug fixes in this release.

Fix the thread safety issue that occurs when constructing the DAG(Directed Acyclic Graph) during scheduling.
Fix the memory leak caused by the OpenTelemetry library.
Avoid hot reload when dynconfig refresh data from Manager.
Prevent concurrent download requests from causing failures in state machine transitions.
Use context.Background() to avoid stream cancel by dfdaemon.
Fix the database performance issue caused by clearing expired jobs when there are too many job records.
Reuse the gRPC connection pool to prevent redundant request construction.

AI Infrastructure

Model Spec

The Dragonfly community is collaboratively defining the OCI Model Specification. OCI Model Specification aims to provide a standard way to package, distribute and run AI models in a cloud native environment. The goal of this specification is to package models in an OCI artifact to take advantage of OCI distribution and ensure efficient model deployment, refer to CloudNativeAI/model-spec for more details.

OCI Model Specification image

node

Support accelerated distribution of AI models in Hugging Face Hub(Git LFS)

Distribute large files downloaded via the Git LFS protocol through Dragonfly P2P, refer to the documentation.

hugging face hub clusters

Maintainers

The community has added four new Maintainers, hoping to help more contributors participate in the community.

Han Jiang: He works for Kuaishou and will focus on the engineering work for Dragonfly.
Yuan Yang: He works for Alibaba Group and will focus on the engineering work for Dragonfly.

Other

You can see CHANGELOG for more details.

Triton Server accelerates distribution of models based on Dragonfly

April 15, 2024 · 10 min read

CNCF projects highlighted in this post, and migrated by mingcheng.

Project post by Yufei Chen, Miao Hao, and Min Huang, Dragonfly project

This document will help you experience how to use dragonfly with TritonServe. During the downloading of models, the file size is large and there are many services downloading the files at the same time. The bandwidth of the storage will reach the limit and the download will be slow.

Diagram flow showing nodes in Triton Server in Cluster A and Cluster B to Model Registry

Dragonfly can be used to eliminate the bandwidth limit of the storage through P2P technology, thereby accelerating file downloading.

Diagram flow showing Cluster A and Cluster B Peer to Root Peer to Model Registry

Installation

By integrating Dragonfly Repository Agent into Triton, download traffic through Dragonfly to pull models stored in S3, OSS, GCS, and ABS, and register models in Triton. The Dragonfly Repository Agent is in the dragonfly-repository-agent repository.

Prerequisites

Name	Version	Document
Kubernetes cluster	1.20+	kubernetes.io
Helm	3.8.0+	helm.sh
Triton Server	23.08-py3	Triton Server

Notice: Kind is recommended if no kubernetes cluster is available for testing.

Dragonfly Kubernetes Cluster Setup

For detailed installation documentation, please refer to quick-start-kubernetes.

Prepare Kubernetes Cluster

Create kind multi-node cluster configuration file kind-config.yaml, configuration content is as follows:

kind: ClusterapiVersion: kind.x-k8s.io/v1alpha4nodes:  - role: control-plane  - role: worker  - role: worker

Create a kind multi-node cluster using the configuration file:

kind create cluster --config kind-config.yaml

Switch the context of kubectl to kind cluster:

kubectl config use-context kind-kind

Kind loads dragonfly image

Pull dragonfly latest images:

docker pull dragonflyoss/scheduler:latestdocker pull dragonflyoss/manager:latestdocker pull dragonflyoss/dfdaemon:latest

Kind cluster loads dragonfly latest images:

kind load docker-image dragonflyoss/scheduler:latestkind load docker-image dragonflyoss/manager:latestkind load docker-image dragonflyoss/dfdaemon:latest

Create dragonfly cluster based on helm charts

Create helm charts configuration file charts-config.yamland set dfdaemon.config.agents.regx to match the download path of the object storage. Example: add regx:.*models.* to match download request from object storage bucket models. Configuration content is as follows:

scheduler:  image: dragonflyoss/scheduler  tag: latest  replicas: 1  metrics:    enable: true  config:    verbose: true    pprofPort: 18066seedPeer:  image: dragonflyoss/dfdaemon  tag: latest  replicas: 1  metrics:    enable: true  config:    verbose: true    pprofPort: 18066dfdaemon:  image: dragonflyoss/dfdaemon  tag: latest  metrics:    enable: true  config:    verbose: true    pprofPort: 18066    proxy:      defaultFilter: 'Expires&Signature&ns'      security:        insecure: true        cacert: ''        cert: ''        key: ''      tcpListen:        namespace: ''        port: 65001      registryMirror:        url: https://index.docker.io        insecure: true        certs: []        direct: false      proxies:        - regx: blobs/sha256.*        # Proxy all http downlowd requests of model bucket path.        - regx: .*models.*manager:  image: dragonflyoss/manager  tag: latest  replicas: 1  metrics:    enable: true  config:    verbose: true    pprofPort: 18066jaeger:  enable: true

Create a dragonfly cluster using the configuration file:

helm repo add dragonfly https://dragonflyoss.github.io/helm-charts/
helm install --wait --create-namespace --namespace dragonfly-system dragonfly dragonfly/dragonfly -f charts-config.yaml

Example output:

LAST DEPLOYED: Wed Nov 29 21:23:48 2023
NAMESPACE: dragonfly-system
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
1. Get the scheduler address by running these commands:
  export SCHEDULER_POD_NAME=$(kubectl get pods --namespace dragonfly-system -l "app=dragonfly,release=dragonfly,component=scheduler" -o jsonpath={.items[0].metadata.name})
  export SCHEDULER_CONTAINER_PORT=$(kubectl get pod --namespace dragonfly-system $SCHEDULER_POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
  kubectl --namespace dragonfly-system port-forward $SCHEDULER_POD_NAME 8002:$SCHEDULER_CONTAINER_PORT
  echo "Visit http://127.0.0.1:8002 to use your scheduler"

2. Get the dfdaemon port by running these commands:
  export DFDAEMON_POD_NAME=$(kubectl get pods --namespace dragonfly-system -l "app=dragonfly,release=dragonfly,component=dfdaemon" -o jsonpath={.items[0].metadata.name})
  export DFDAEMON_CONTAINER_PORT=$(kubectl get pod --namespace dragonfly-system $DFDAEMON_POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
  You can use $DFDAEMON_CONTAINER_PORT as a proxy port in Node.

3. Configure runtime to use dragonfly:
  https://d7y.io/docs/getting-started/quick-start/kubernetes/

4. Get Jaeger query URL by running these commands:
  export JAEGER_QUERY_PORT=$(kubectl --namespace dragonfly-system get services dragonfly-jaeger-query -o jsonpath="{.spec.ports[0].port}")
  kubectl --namespace dragonfly-system port-forward service/dragonfly-jaeger-query 16686:$JAEGER_QUERY_PORT
  echo "Visit http://127.0.0.1:16686/search?limit=20&lookback=1h&maxDuration&minDuration&service=dragonfly to query download events"

Check that dragonfly is deployed successfully:

kubectl get pods -n dragonfly-systemNAME                                 READY   STATUS    RESTARTS       AGEdragonfly-dfdaemon-8qcpd             1/1     Running   4 (118s ago)   2m45sdragonfly-dfdaemon-qhkn8             1/1     Running   4 (108s ago)   2m45sdragonfly-jaeger-6c44dc44b9-dfjfv    1/1     Running   0              2m45sdragonfly-manager-549cd546b9-ps5tf   1/1     Running   0              2m45sdragonfly-mysql-0                    1/1     Running   0              2m45sdragonfly-redis-master-0             1/1     Running   0              2m45sdragonfly-redis-replicas-0           1/1     Running   0              2m45sdragonfly-redis-replicas-1           1/1     Running   0              2m7sdragonfly-redis-replicas-2           1/1     Running   0              101sdragonfly-scheduler-0                1/1     Running   0              2m45sdragonfly-seed-peer-0                1/1     Running   1 (52s ago)    2m45s

Expose the Proxy service port

Create the dfstore.yaml configuration file to expose the port on which the Dragonfly Peer’s HTTP proxy listens. The default port is 65001 and settargetPort to 65001.

kind: ServiceapiVersion: v1metadata:  name: dfstorespec:  selector:    app: dragonfly    component: dfdaemon    release: dragonfly  ports:    - protocol: TCP      port: 65001      targetPort: 65001  type: NodePort

Create service:

kubectl --namespace dragonfly-system apply -f dfstore.yaml

Forward request to Dragonfly Peer’s HTTP proxy:

kubectl --namespace dragonfly-system port-forward service/dfstore 65001:65001

Install Dragonfly Repository Agent

Set Dragonfly Repository Agent configuration

Create the dragonfly_config.jsonconfiguration file, the configuration is as follows:

{
  "proxy": "http://127.0.0.1:65001",
  "header": {},
  "filter": [
    "X-Amz-Algorithm",
    "X-Amz-Credential",
    "X-Amz-Date",
    "X-Amz-Expires",
    "X-Amz-SignedHeaders",
    "X-Amz-Signature"
  ]
}

proxy: The address of Dragonfly Peer’s HTTP Proxy.
header: Adds a request header to the request.
filter: Used to generate unique tasks and filter unnecessary query parameters in the URL.

In the filter of the configuration, set different values when using different object storage:

Type	Value
OSS	["Expires","Signature","ns"]
S3	["X-Amz-Algorithm", "X-Amz-Credential", "X-Amz-Date", "X-Amz-Expires", "X-Amz-SignedHeaders", "X-Amz-Signature"]
OBS	["X-Amz-Algorithm", "X-Amz-Credential", "X-Amz-Date", "X-Obs-Date", "X-Amz-Expires", "X-Amz-SignedHeaders", "X-Amz-Signature"]

Set Model Repository configuration

Create cloud_credential.json cloud storage credential, the configuration is as follows:

```json
{
  "gs": {
    "": "PATH_TO_GOOGLE_APPLICATION_CREDENTIALS",
    "gs://gcs-bucket-002": "PATH_TO_GOOGLE_APPLICATION_CREDENTIALS_2"
  },
  "s3": {
    "": {
      "secret_key": "AWS_SECRET_ACCESS_KEY",
      "key_id": "AWS_ACCESS_KEY_ID",
      "region": "AWS_DEFAULT_REGION",
      "session_token": "",
      "profile": ""
    },
    "s3://s3-bucket-002": {
      "secret_key": "AWS_SECRET_ACCESS_KEY_2",
      "key_id": "AWS_ACCESS_KEY_ID_2",
      "region": "AWS_DEFAULT_REGION_2",
      "session_token": "AWS_SESSION_TOKEN_2",
      "profile": "AWS_PROFILE_2"
    }
  },
  "as": {
    "": {
      "account_str": "AZURE_STORAGE_ACCOUNT",
      "account_key": "AZURE_STORAGE_KEY"
    },
    "as://Account-002/Container": {
      "account_str": "",
      "account_key": ""
    }
  }
}

In order to pull the model through Dragonfly, the model configuration file needs to be added following code in config.pbtxt file:

model_repository_agents{  agents [    {      name: "dragonfly",    }  ]}

The densenet_onnx example contains modified configuration and model file. Modified config.pbtxt such as:

name: "densenet_onnx"platform: "onnxruntime_onnx"max_batch_size : 0input [  {    name: "data_0"    data_type: TYPE_FP32    format: FORMAT_NCHW    dims: [ 3, 224, 224 ]    reshape { shape: [ 1, 3, 224, 224 ] }  }]output [  {    name: "fc6_1"    data_type: TYPE_FP32    dims: [ 1000 ]    reshape { shape: [ 1, 1000, 1, 1 ] }    label_filename: "densenet_labels.txt"  }]model_repository_agents{  agents [    {      name: "dragonfly",    }  ]}

Triton Server integrates Dragonfly Repository Agent plugin

Install Triton Server with Docker

Pull dragonflyoss/dragonfly-repository-agent image which is integrated Dragonfly Repository Agent plugin in Triton Server, refer to Dockerfile.

docker pull dragonflyoss/dragonfly-repository-agent:latest

Run the container and mount the configuration directory:

docker run --network host --rm \  -v ${path-to-config-dir}:/home/triton/ \  dragonflyoss/dragonfly-repository-agent:latest tritonserver \  --model-repository=${model-repository-path}

path-to-config-dir: The files path of dragonfly_config.json&cloud_credential.json.
model-repository-path: The path of remote model repository.

The correct output is as follows:

=============================== Triton Inference Server ===============================
successfully loaded 'densenet_onnx'
I1130 09:43:22.595672 1 server.cc:604]
+------------------+------------------------------------------------------------------------+
| Repository Agent | Path                                                                   |
+------------------+------------------------------------------------------------------------+
| dragonfly        | /opt/tritonserver/repoagents/dragonfly/libtritonrepoagent_dragonfly.so |
+------------------+------------------------------------------------------------------------+

I1130 09:43:22.596011 1 server.cc:631]
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Backend     | Path                                                            | Config                                                                                                                                                        |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| pytorch     | /opt/tritonserver/backends/pytorch/libtriton_pytorch.so         | {}                                                                                                                                                            |
| onnxruntime | /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+

I1130 09:43:22.596112 1 server.cc:674]
+---------------+---------+--------+
| Model         | Version | Status |
+---------------+---------+--------+
| densenet_onnx | 1       | READY  |
+---------------+---------+--------+

I1130 09:43:22.598318 1 metrics.cc:703] Collecting CPU metrics
I1130 09:43:22.599373 1 tritonserver.cc:2435]
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                                                           |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                                                          |
| server_version                   | 2.37.0                                                                                                                                                                                                          |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
| model_repository_path[0]         | s3://192.168.36.128:9000/models                                                                                                                                                                                 |
| model_control_mode               | MODE_NONE                                                                                                                                                                                                       |
| strict_model_config              | 0                                                                                                                                                                                                               |
| rate_limit                       | OFF                                                                                                                                                                                                             |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                                       |
| min_supported_compute_capability | 6.0                                                                                                                                                                                                             |
| strict_readiness                 | 1                                                                                                                                                                                                               |
| exit_timeout                     | 30                                                                                                                                                                                                              |
| cache_enabled                    | 0                                                                                                                                                                                                               |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I1130 09:43:22.610334 1 grpc_server.cc:2451] Started GRPCInferenceService at 0.0.0.0:8001
I1130 09:43:22.612623 1 http_server.cc:3558] Started HTTPService at 0.0.0.0:8000
I1130 09:43:22.695843 1 http_server.cc:187] Started Metrics Service at 0.0.0.0:8002

Execute the following command to check the Dragonfly logs:

kubectl exec -it -n dragonfly-system dragonfly-dfdaemon-<id> -- tail -f /var/log/dragonfly/daemon/core.log

Check downloaded successfully through Dragonfly:

{"level":"info","ts":"2024-02-02 05:28:02.631","caller":"peer/peertask_conductor.go:1349","msg":"peer task done, cost: 352ms","peer":"10.244.2.3-1-4398a429-d780-423a-a630-57d765f1ccfc","task":"974aaf56d4877cc65888a4736340fb1d8fecc93eadf7507f531f9fae650f1b4d","component":"PeerTask","trace":"4cca9ce80dbf5a445d321cec593aee65"}

Verify

Call inference API：

docker run -it --rm --net=host nvcr.io/nvidia/tritonserver:23.08-py3-sdk /workspace/install/bin/image_client -m densenet_onnx -c 3 -s INCEPTION /workspace/images/mug.jpg

Check the response successful:

Request 01Image '/workspace/images/mug.jpg':    15.349563 (504) = COFFEE MUG    13.227461 (968) = CUP    10.424893 (505) = COFFEEPOT

Performance testing

Test the performance of single-machine model download by Triton API after the integration of Dragonfly P2P. Due to the influence of the network environment of the machine itself, the actual download time is not important, but The proportion of download speed in different scenarios is more meaningful:

Bar chart showing time to download large Triton API; Triton API & Dragonfly Cold Boot; Hit Dragonfly Remote Peer Cache; Hit, Dragonfly Local Peer Cache

Triton API: Use signed URL provided by Object Storage to download the model directly.
Triton API & Dragonfly Cold Boot: Use Triton Serve API to download model via Dragonfly P2P network and no cache hits.
Hit Remote Peer: Use Triton Serve API to download model via Dragonfly P2P network and hit the remote peer cache.
Hit Local Peer: Use Triton Serve API to download model via Dragonfly P2P network and hit the local peer cache.

Test results show Triton and Dragonfly integration. It can effectively reduce the file download time. Note that this test was a single-machine test, which means that in the case of cache hits, the performance limitation is on the disk. If Dragonfly is deployed on multiple machines for P2P download, the models download speed will be faster.

Resources

Dragonfly Community

Website: https://d7y.io/
Github Repo: https://github.com/dragonflyoss/dragonfly
Dragonfly Repository Agent Github Repo: https://github.com/dragonflyoss/dragonfly-repository-agent
Slack Channel: #dragonfly on CNCF Slack
Discussion Group: dragonfly-discuss@googlegroups.com
Twitter: @dragonfly_oss

NVIDIA Triton Inference Server

Website: https://developer.nvidia.com/triton-inference-server
Github Repo: https://github.com/triton-inference-server/server

Dragonfly accelerates distribution of large files with Git LFS

January 15, 2024 · 11 min read

CNCF projects highlighted in this post, and migrated by mingcheng.

What is Git LFS?

Git LFS (Large File Storage) is an open-source extension for Git that enables users to handle large files more efficiently in Git repositories. Git is a version control system designed primarily for text files such as source code and it can become less efficient when dealing with large binary files like audio, videos, datasets, graphics and other large assets. These files can significantly increase the size of a repository and make cloning and fetching operations slow.

Diagram flow showing Remote to Large File Storage

Git LFS addresses this issue by storing these large files on a separate server and replacing them in the Git repository with small placeholder files (pointers). When a user clones or pulls from the repository, Git LFS fetches the large files from the LFS server as needed rather than downloading all the large files with the initial clone of the repository. For specifications, please refer to the Git LFS Specification. The server is implemented based on the HTTP protocol, refer to Git LFS API. Usually Git LFS’s content storage uses object storage to store large files.

Git LFS Usage

Git LFS manages large files

Github and GitLab usually manage large files based on Git LFS.

GitHub uses Git LFS refer to About Git Large File Storage.
GitLab uses Git LFS refer to Git Large File Storage.

Git LFS manages AI models and AI datasets

Large files of models and datasets in AI are usually managed based on Git LFS. Hugging Face Hub and ModelScope Hub manage models and datasets based on Git LFS.

Hugging Face Hub uses Git LFS refer to Getting Started with Repositories.
ModelScope Hub uses Git LFS refer to Getting Started with ModelScope.

Hugging Face Hub’s Python Library implements Git LFS to download models and datasets. Hugging Face Hub’s Python Library distributes models and datasets to accelerate, refer to Hugging Face accelerates distribution of models and datasets based on Dragonfly.

Dragonfly eliminates the bandwidth limit of Git LFS’s content storage

This document will help you experience how to use dragonfly with Git LFS. During the downloading of large files, the file size is large and there are many services downloading the larges files at the same time. The bandwidth of the storage will reach the limit and the download will be slow.

Diagram flow showing Cluster A and Cluster B to Large File Storage

Dragonfly can be used to eliminate the bandwidth limit of the storage through P2P technology, thereby accelerating large files downloading.

Diagram flow showing Cluster A and Cluster B to Large File Storage using Peer and Root Peer

Dragonfly accelerates downloads with Git LFS

By proxying the HTTP protocol file download request of Git LFS to Dragonfly Peer Proxy, the file download traffic is forwarded to the P2P network. The following documentation is based on GitHub LFS.

Get the Content Storage address of Git LFS

Add GIT_CURL_VERBOSE=1 to print verbose logs of git clone and get the address of content storage of Git LFS.

GIT_CURL_VERBOSE=1 git clone git@github.com:{YOUR-USERNAME}/{YOUR-REPOSITORY}.git

Look for the trace git-lfs keyword in the logs and you can see the log of Git LFS download files. Pay attention to the content of actions and download in the log.

15:31:04.848308 trace git-lfs: HTTP: {"objects":[{"oid":"c036cbb7553a909f8b8877d4461924307f27ecb66cff928eeeafd569c3887e29","size":5242880,"actions":{"download":{"href":"https://github-cloud.githubusercontent.com/alambic/media/376919987/c0/36/c036cbb7553a909f8b8877d4461924307f27ecb66cff928eeeafd569c3887e29?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIMWPLRQEC4XCWWPA%2F20231221%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20231221T073104Z&X-Amz-Expires=3600&X-Amz-Signature=4dc757dff0ac96eac3f0cd2eb29ca887035d3a6afba41cb10200ed0aa22812fa&15:31:04.848403 trace git-lfs: HTTP: X-Amz-SignedHeaders=host&actor_id=15955374&key_id=0&repo_id=392935134&token=1","expires_at":"2023-12-21T08:31:04Z","expires_in":3600}}}]}

The download URL can be found in actions.download.href in the objects. You can find that the content storage of GitHub LFS is actually stored at github-cloud.githubusercontent.com. And query parameters include X-Amz-Algorithm, X-Amz-Credential, X-Amz-Date, X-Amz-Expires, X-Amz-Signature and X-Amz-SignedHeaders. The query parameters are AWS Authenticating Requests parameters. The keys of query parameters will be used later when configuring Dragonfly Peer Proxy.

Information about Git LFS :

The content storage address of Git LFS is github-cloud.githubusercontent.com.
The query parameters of the download URL include X-Amz-Algorithm, X-Amz-Credential, X-Amz-Date, X-Amz-Expires, X-Amz-Signature and X-Amz-SignedHeaders.

Installation

Prerequisites

Notice: Kind is recommended if no kubernetes cluster is available for testing.

Install dragonfly

For detailed installation documentation based on kubernetes cluster, please refer to quick-start-kubernetes.

Setup kubernetes cluster

Create kind multi-node cluster configuration file kind-config.yaml, configuration content is as follows:

kind: ClusterapiVersion: kind.x-k8s.io/v1alpha4nodes:  - role: control-plane  - role: worker    extraPortMappings:      - containerPort: 30950        hostPort: 65001  - role: worker

Create a kind multi-node cluster using the configuration file:

kind create cluster --config kind-config.yaml

Switch the context of kubectl to kind cluster:

kubectl config use-context kind-kind

Kind loads dragonfly image

Pull dragonfly latest images:

docker pull dragonflyoss/scheduler:latestdocker pull dragonflyoss/manager:latestdocker pull dragonflyoss/dfdaemon:latest

Kind cluster loads dragonfly latest images:

kind load docker-image dragonflyoss/scheduler:latestkind load docker-image dragonflyoss/manager:latestkind load docker-image dragonflyoss/dfdaemon:latest

Create dragonfly cluster based on helm charts

Create helm charts configuration file charts-config.yaml. Add the github-cloud.githubusercontent.com rule to dfdaemon.config.proxy.proxies.regx to forward the HTTP file download of content storage of Git LFS to the P2P network. And dfdaemon.config.proxy.defaultFilter adds X-Amz-Algorithm, X-Amz-Credential, X-Amz-Date, X-Amz-Expires, X-Amz-Signature and X-Amz-SignedHeaders parameters to filter the query parameters. Dargonfly generates a unique task id based on the URL, so it is necessary to filter the query parameters to generate a unique task id. Configuration content is as follows:

scheduler:
  image: dragonflyoss/scheduler
  tag: latest
  replicas: 1
  metrics:
    enable: true
  config:
    verbose: true
    pprofPort: 18066

seedPeer:
  image: dragonflyoss/dfdaemon
  tag: latest
  replicas: 1
  metrics:
    enable: true
  config:
    verbose: true
    pprofPort: 18066

dfdaemon:
  image: dragonflyoss/dfdaemon
  tag: latest
  metrics:
    enable: true
  config:
    verbose: true
    pprofPort: 18066
    proxy:
      defaultFilter: "X-Amz-Algorithm&X-Amz-Credential&X-Amz-Date&X-Amz-Expires&X-Amz-Signature&X-Amz-SignedHeaders"
      security:
        insecure: true
        cacert: ""
        cert: ""
        key: ""
      tcpListen:
        namespace: ""
        port: 65001
      registryMirror:
        url: https://index.docker.io
        insecure: true
        certs: []
        direct: false
      proxies:
      - regx: blobs/sha256.*
      - regx: github-cloud.githubusercontent.com.*

manager:
  image: dragonflyoss/manager
  tag: latest
  replicas: 1
  metrics:
    enable: true
  config:
    verbose: true
    pprofPort: 18066

jaeger:
  enable: true

Create a dragonfly cluster using the configuration file:

helm repo add dragonfly https://dragonflyoss.github.io/helm-charts/
helm install --wait --create-namespace --namespace dragonfly-system dragonfly dragonfly/dragonfly -f charts-config.yaml

Output:

NAME: dragonfly
LAST DEPLOYED: Thu Dec 21 17:24:37 2023
NAMESPACE: dragonfly-system
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
1. Get the scheduler address by running these commands:
  export SCHEDULER_POD_NAME=$(kubectl get pods --namespace dragonfly-system -l "app=dragonfly,release=dragonfly,component=scheduler" -o jsonpath={.items[0].metadata.name})
  export SCHEDULER_CONTAINER_PORT=$(kubectl get pod --namespace dragonfly-system $SCHEDULER_POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
  kubectl --namespace dragonfly-system port-forward $SCHEDULER_POD_NAME 8002:$SCHEDULER_CONTAINER_PORT
  echo "Visit http://127.0.0.1:8002 to use your scheduler"

2. Get the dfdaemon port by running these commands:
  export DFDAEMON_POD_NAME=$(kubectl get pods --namespace dragonfly-system -l "app=dragonfly,release=dragonfly,component=dfdaemon" -o jsonpath={.items[0].metadata.name})
  export DFDAEMON_CONTAINER_PORT=$(kubectl get pod --namespace dragonfly-system $DFDAEMON_POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
  You can use $DFDAEMON_CONTAINER_PORT as a proxy port in Node.

3. Configure runtime to use dragonfly:
  https://d7y.io/docs/getting-started/quick-start/kubernetes/

4. Get Jaeger query URL by running these commands:
  export JAEGER_QUERY_PORT=$(kubectl --namespace dragonfly-system get services dragonfly-jaeger-query -o jsonpath="{.spec.ports[0].port}")
  kubectl --namespace dragonfly-system port-forward service/dragonfly-jaeger-query 16686:$JAEGER_QUERY_PORT
  echo "Visit http://127.0.0.1:16686/search?limit=20&lookback=1h&maxDuration&minDuration&service=dragonfly to query download events"

Check that dragonfly is deployed successfully:

kubectl get po -n dragonfly-systemNAME                                 READY   STATUS    RESTARTS       AGEdragonfly-dfdaemon-cttxz             1/1     Running   4 (116s ago)   2m51sdragonfly-dfdaemon-k62vd             1/1     Running   4 (117s ago)   2m51sdragonfly-jaeger-84dbfd5b56-mxpfs    1/1     Running   0              2m51sdragonfly-manager-5c598d5754-fd9tf   1/1     Running   0              2m51sdragonfly-mysql-0                    1/1     Running   0              2m51sdragonfly-redis-master-0             1/1     Running   0              2m51sdragonfly-redis-replicas-0           1/1     Running   0              2m51sdragonfly-redis-replicas-1           1/1     Running   0              106sdragonfly-redis-replicas-2           1/1     Running   0              78sdragonfly-scheduler-0                1/1     Running   0              2m51sdragonfly-seed-peer-0                1/1     Running   1 (37s ago)    2m51s

Create peer service configuration file peer-service-config.yaml, configuration content is as follows:

apiVersion: v1kind: Servicemetadata:  name: peer  namespace: dragonfly-systemspec:  type: NodePort  ports:    - name: http-65001      nodePort: 30950      port: 65001  selector:    app: dragonfly    component: dfdaemon    release: dragonfly

Create a peer service using the configuration file:

kubectl apply -f peer-service-config.yaml

Git LFS downlads large files via dragonfly

Proxy Git LFS download requests to Dragonfly Peer Proxy(http://127.0.0.1:65001) through Git configuration. Set Git configuration includes http.proxy, lfs.transfer.enablehrefrewrite and url.{YOUR-LFS-CONTENT-STORAGE}.insteadOf properties.

git config --global http.proxy http://127.0.0.1:65001git config --global lfs.transfer.enablehrefrewrite truegit config --global url.http://github-cloud.githubusercontent.com/.insteadOf https://github-cloud.githubusercontent.com/

Forward Git LFS download requests to the P2P network via Dragonfly Peer Proxy and Git clone the large files.

git clone git@github.com:{YOUR-USERNAME}/{YOUR-REPOSITORY}.git

Verify large files download with Dragonfly

Execute the command:

# find podskubectl -n dragonfly-system get pod -l component=dfdaemon# find logspod_name=dfdaemon-xxxxxkubectl -n dragonfly-system exec -it ${pod_name} -- grep "peer task done" /var/log/dragonfly/daemon/core.log

Example output:

2023-12-21T16:55:20.495+0800INFOpeer/peertask_conductor.go:1326peer task done, cost: 2238ms{"peer": "30.54.146.131-15874-f6729352-950e-412f-b876-0e5c8e3232b1", "task": "70c644474b6c986e3af27d742d3602469e88f8956956817f9f67082c6967dc1a", "component": "PeerTask", "trace": "35c801b7dac36eeb0ea43a58d1c82e77"}

Performance testing

Test the performance of single-machine large files download after the integration of Git LFS and Dragonfly P2P. Due to the influence of the network environment of the machine itself, the actual download time is not important, but the ratio of the increase in the download time in different scenarios is very important.

Bar chart showing time to download large files (512M and 1G) between Git LFS, Git LFS & Dragonfly Cold Boot, Hit Dragonfly Remote Peer Cache and Hit Dragonfly Local Peer Cache

Git LFS: Use Git LFS to download large files directly.
Git LFS & Dragonfly Cold Boot: Use Git LFS to download large files via Dragonfly P2P network and no cache hits.
Hit Dragonfly Remote Peer Cache: Use Git LFS to download large files via Dragonfly P2P network and hit the remote peer cache.
Hit Dragonfly Remote Local Cache: Use Git LFS to download large files via Dragonfly P2P network and hit the local peer cache.

Test results show Git LFS and Dragonfly P2P integration. It can effectively reduce the file download time. Note that this test was a single-machine test, which means that in the case of cache hits, the performance limitation is on the disk. If Dragonfly is deployed on multiple machines for P2P download, the large files download speed will be faster.

TorchServe accelerates the distribution of models based on Dragonfly

December 6, 2023 · 14 min read

CNCF projects highlighted in this post, and migrated by mingcheng.

This document will help you experience how to use dragonfly with TorchServe. During the downloading of models, the file size is large and there are many services downloading the files at the same time. The bandwidth of the storage will reach the limit and the download will be slow.

Diagram flow showing Model Registry flow from Cluster A and Cluster B

Dragonfly can be used to eliminate the bandwidth limit of the storage through P2P technology, thereby accelerating file downloading.

Diagram flow showing Model Registry flow from Cluster A and Cluster B

Architecture

Dragonfly Endpoint architecture

Dragonfly Endpoint plugin forwards TorchServe download model requests to the Dragonfly P2P network.

Dragonfly Endpoint architecture

The models download steps:

TorchServe sends a model download request and the request is forwarded to the Dragonfly Peer.
The Dragonfly Peer registers tasks with the Dragonfly Scheduler.
Return the candidate parents to Dragonfly Peer.
Dragonfly Peer downloads model from candidate parents.
After downloading the model, TorchServe will register the model.

Installation

By integrating Dragonfly Endpoint into TorchServe, download traffic through Dragonfly to pull models stored in S3, OSS, GCS, and ABS, and register models in TorchServe. The Dragonfly Endpoint plugin is in the dragonfly-endpoint repository.

Prerequisites

Name	Version	Document
Kubernetes cluster	1.20+	kubernetes.io
Helm	3.8.0+	helm.sh
TorchServe	0.4.0+	pytorch.org/serve/

Notice: Kind is recommended if no kubernetes cluster is available for testing.

Dragonfly Kubernetes Cluster Setup

For detailed installation documentation, please refer to quick-start-kubernetes.

Prepare Kubernetes Cluster

Create kind multi-node cluster configuration file kind-config.yaml, configuration content is as follows:

kind: ClusterapiVersion: kind.x-k8s.io/v1alpha4nodes:  - role: control-plane  - role: worker  - role: worker

Create a kind multi-node cluster using the configuration file:

kind create cluster --config kind-config.yaml

Switch the context of kubectl to kind cluster:

kubectl config use-context kind-kind

Kind loads dragonfly image

Pull dragonfly latest images:

docker pull dragonflyoss/scheduler:latestdocker pull dragonflyoss/manager:latestdocker pull dragonflyoss/dfdaemon:latest

Kind cluster loads dragonfly latest images:

kind load docker-image dragonflyoss/scheduler:latestkind load docker-image dragonflyoss/manager:latestkind load docker-image dragonflyoss/dfdaemon:latest

Create dragonfly cluster based on helm charts

Create helm charts configuration file charts-config.yaml and set dfdaemon.config.agents.regx to match the download path of the object storage, configuration content is as follows:

scheduler:
  replicas: 1
  metrics:
    enable: true
  config:
    verbose: true
    pprofPort: 18066

seedPeer:
  replicas: 1
  metrics:
    enable: true
  config:
    verbose: true
    pprofPort: 18066

dfdaemon:
  hostNetwork: true
  metrics:
    enable: true
  config:
    verbose: true
    pprofPort: 18066
    proxy:
      defaultFilter: "Expires&Signature&ns"
      security:
        insecure: true
        cacert: ""
        cert: ""
        key: ""
      tcpListen:
        namespace: ""
        port: 65001
      registryMirror:
        url: https://index.docker.io
        insecure: true
        certs: []
        direct: false
      proxies:
      - regx: blobs/sha256.*
      - regx: .*amazonaws.*

manager:
  replicas: 1
  metrics:
    enable: true
  config:
    verbose: true
    pprofPort: 18066

jaeger:
  enable: true

Create a dragonfly cluster using the configuration file:

$ helm repo add dragonfly https://dragonflyoss.github.io/helm-charts/
$ helm install --wait --create-namespace --namespace dragonfly-system dragonfly dragonfly/dragonfly -f charts-config.yaml
LAST DEPLOYED: Mon Sep  4 10:24:55 2023
NAMESPACE: dragonfly-system
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
1. Get the scheduler address by running these commands:
  export SCHEDULER_POD_NAME=$(kubectl get pods --namespace dragonfly-system -l "app=dragonfly,release=dragonfly,component=scheduler" -o jsonpath={.items[0].metadata.name})
  export SCHEDULER_CONTAINER_PORT=$(kubectl get pod --namespace dragonfly-system $SCHEDULER_POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
  kubectl --namespace dragonfly-system port-forward $SCHEDULER_POD_NAME 8002:$SCHEDULER_CONTAINER_PORT
  echo "Visit http://127.0.0.1:8002 to use your scheduler"
2. Get the dfdaemon port by running these commands:
  export DFDAEMON_POD_NAME=$(kubectl get pods --namespace dragonfly-system -l "app=dragonfly,release=dragonfly,component=dfdaemon" -o jsonpath={.items[0].metadata.name})
  export DFDAEMON_CONTAINER_PORT=$(kubectl get pod --namespace dragonfly-system $DFDAEMON_POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
  You can use $DFDAEMON_CONTAINER_PORT as a proxy port in Node.
3. Configure runtime to use dragonfly:
  https://d7y.io/docs/getting-started/quick-start/kubernetes/
4. Get Jaeger query URL by running these commands:
  export JAEGER_QUERY_PORT=$(kubectl --namespace dragonfly-system get services dragonfly-jaeger-query -o jsonpath="{.spec.ports[0].port}")
  kubectl --namespace dragonfly-system port-forward service/dragonfly-jaeger-query 16686:$JAEGER_QUERY_PORT
  echo "Visit http://127.0.0.1:16686/search?limit=20&lookback=1h&maxDuration&minDuration&service=dragonfly to query download events"

Check that dragonfly is deployed successfully:

$ kubectl get po -n dragonfly-system
NAME                                 READY   STATUS    RESTARTS      AGE
dragonfly-dfdaemon-7r2cn             1/1     Running   0          3m31s
dragonfly-dfdaemon-fktl4             1/1     Running   0          3m31s
dragonfly-jaeger-c7947b579-2xk44     1/1     Running   0          3m31s
dragonfly-manager-5d4f444c6c-wq8d8   1/1     Running   0          3m31s
dragonfly-mysql-0                    1/1     Running   0          3m31s
dragonfly-redis-master-0             1/1     Running   0          3m31s
dragonfly-redis-replicas-0           1/1     Running   0          3m31s
dragonfly-redis-replicas-1           1/1     Running   0          3m5s
dragonfly-redis-replicas-2           1/1     Running   0          2m44s
dragonfly-scheduler-0                1/1     Running   0          3m31s
dragonfly-seed-peer-0                1/1     Running   0          3m31s

Expose the Proxy service port

Create the dfstore.yaml configuration to expose the port on which the Dragonfly Peer’s HTTP proxy listens. The default port is 65001 and settargetPort to 65001.

kind: Service
apiVersion: v1
metadata:
  name: dfstore
spec:
  selector:
    app: dragonfly
    component: dfdaemon
    release: dragonfly
  ports:
  - protocol: TCP
    port: 65001
    targetPort: 65001
  type: NodePort

Create service:

kubectl --namespace dragonfly-system apply -f dfstore.yaml

Forward request to Dragonfly Peer’s HTTP proxy:

kubectl --namespace dragonfly-system port-forward service/dfstore 65001:65001

Install Dragonfly Endpoint plugin

Set environment variables for Dragonfly Endpoint configuration

Create config.json configuration，and set DRAGONFLY_ENDPOINT_CONFIG environment variable for config.json file path.

export DRAGONFLY_ENDPOINT_CONFIG=/etc/dragonfly-endpoint/config.json

The default configuration path is:

linux: /etc/dragonfly-endpoint/config.json
darwin: ~/.dragonfly-endpoint/config.json

Dragonfly Endpoint configuration

Create the config.json configuration to configure the Dragonfly Endpoint for S3, the configuration is as follows:

{
  "addr": "http://127.0.0.1:65001",
  "header": {},
  "filter": [
    "X-Amz-Algorithm",
    "X-Amz-Credential",
    "X-Amz-Date",
    "X-Amz-Expires",
    "X-Amz-SignedHeaders",
    "X-Amz-Signature"
  ],
  "object_storage": {
    "type": "s3",
    "bucket_name": "your_s3_bucket_name",
    "region": "your_s3_region",
    "access_key": "your_s3_access_key",
    "secret_key": "your_s3_secret_key"
  }
}

In the filter of the configuration, set different values when using different object storage:

Type	Value
OSS	"Expires&Signature&ns"
S3	"X-Amz-Algorithm&X-Amz-Credential&X-Amz-Date&X-Amz-Expires&X-Amz-SignedHeaders&X-Amz-Signature"
OBS	"X-Amz-Algorithm&X-Amz-Credential&X-Amz-Date&X-Obs-Date&X-Amz-Expires&X-Amz-SignedHeaders&X-Amz-Signature"

Object storage configuration

In addition to S3, Dragonfly Endpoint plugin also supports OSS, GCS and ABS. Different object storage configurations are as follows:

OSS(Object Storage Service)

{
  "addr": "http://127.0.0.1:65001",
  "header": {},
  "filter": ["Expires", "Signature"],
  "object_storage": {
    "type": "oss",
    "bucket_name": "your_oss_bucket_name",
    "endpoint": "your_oss_endpoint",
    "access_key_id": "your_oss_access_key_id",
    "access_key_secret": "your_oss_access_key_secret"
  }
}

GCS(Google Cloud Storage)

{
  "addr": "http://127.0.0.1:65001",
  "header": {},
  "object_storage": {
    "type": "gcs",
    "bucket_name": "your_gcs_bucket_name",
    "project_id": "your_gcs_project_id",
    "service_account_path": "your_gcs_service_account_path"
  }
}

ABS(Azure Blob Storage)

{
  "addr": "http://127.0.0.1:65001",
  "header": {},
  "object_storage": {
    "type": "abs",
    "account_name": "your_abs_account_name",
    "account_key": "your_abs_account_key",
    "container_name": "your_abs_container_name"
  }
}

TorchServe integrates Dragonfly Endpoint plugin

For detailed installation documentation, please refer to TorchServe document.

Binary installation

The Prerequisites

Name	Version	Document
Python	3.8.0+	https://www.python.org/
TorchServe	0.4.0+	pytorch.org/serve/
Java	11	https://openjdk.org/projects/jdk/11/

Install TorchServe dependencies and torch-model-archiver：

python ./ts_scripts/install_dependencies.py
conda install torchserve torch-model-archiver torch-workflow-archiver -c pytorch

Clone TorchServe repository：

git clone https://github.com/pytorch/serve.gitcd serve

Create model-store directory to store the models：

mkdir model-storechmod 777 model-store

Create plugins-path directory to store the binaries of the plugin：

mkdir plugins-path

Package Dragonfly Endpoint plugin

Clone dragonfly-endpoint repository：

git clone https://github.com/dragonflyoss/dragonfly-endpoint.git

Build the dragonfly-endpoint project to generate file in the build/libs directory:

cd ./dragonfly-endpointgradle shadowJar

Note: Due to the limitations of TorchServe’s JVM, the best Java version for Gradle is 11, as a higher version will cause the plugin to fail to parse.

Move the Jar file into the plugins-path directory:

mv build/libs/dragonfly_endpoint-1.0-all.jar  <your plugins-path>

Prepare the plugin configuration config.json, and use S3 as the object storage:

{
  "addr": "http://127.0.0.1:65001",
  "header": {},
  "filter": [
    "X-Amz-Algorithm",
    "X-Amz-Credential",
    "X-Amz-Date",
    "X-Amz-Expires",
    "X-Amz-SignedHeaders",
    "X-Amz-Signature"
  ],
  "object_storage": {
    "type": "s3",
    "bucket_name": "your_s3_bucket_name",
    "region": "your_s3_region",
    "access_key": "your_s3_access_key",
    "secret_key": "your_s3_secret_key"
  }
}

Set the environment variables for the configuration:

export DRAGONFLY_ENDPOINT_CONFIG=/etc/dragonfly-endpoint/config.json

–model-storesets the previously created directory to store the models and –plugins-path sets the previously created directory to store the plugins. Start the TorchServe with Dragonfly Endpoint plugin:

torchserve --start --model-store <path-to-model-store-file> --plugins-path=<path-to-plugin-jars>

Verify

Prepare the model. Download a model from Model ZOO or package the model refer to Torch Model archiver for TorchServe. Use squeezenet1_1_scripted.mar model to verify：

wget https://torchserve.pytorch.org/mar_files/squeezenet1_1_scripted.mar

Upload the model to object storage. For detailed uploading the model to S3, please refer to S3。

# Download the command line toolpip install awscli# Configure the key as promptedaws configure# Upload fileaws s3 cp < local file path > s3://< bucket name >/< Target path >

TorchServe plugin is named dragonfly, please refer to TorchServe Register API for details of plugin API. The url parameter are not supported and add the file_name parameter which is the model file name to download.

Download the model:

curl -X POST  "http://localhost:8081/dragonfly/models?file_name=squeezenet1_1.mar"

Verify the model download successful:

{"Status": "Model \"squeezenet1_1\" Version: 1.0 registered with 0 initial workers. Use scale workers API to add workers for the model."}

Added model worker for inference:

curl -v -X PUT "http://localhost:8081/models/squeezenet1_1?min_worker=1"

Check the number of worker is increased:

* About to connect() to localhost port 8081 (#0)
*   Trying ::1...
* Connected to localhost (::1) port 8081 (#0)
> PUT /models/squeezenet1_1?min_worker=1 HTTP/1.1
> User-Agent: curl/7.29.0
> Host: localhost:8081
> Accept: */*
>
< HTTP/1.1 202 Accepted
< content-type: application/json
< x-request-id: 66761b5a-54a7-4626-9aa4-12041e0e4e63
< Pragma: no-cache
< Cache-Control: no-cache; no-store, must-revalidate, private
< Expires: Thu, 01 Jan 1970 00:00:00 UTC
< content-length: 47
< connection: keep-alive
<
{  "status": "Processing worker updates..."}
* Connection #0 to host localhost left intact

Call inference API:

# Prepare pictures that require reasoning
curl -O https://raw.githubusercontent.com/pytorch/serve/master/docs/images/kitten_small.jpg
curl -O https://raw.githubusercontent.com/pytorch/serve/master/docs/images/dogs-before.jpg

# Call inference API
curl http://localhost:8080/predictions/squeezenet1_1 -T kitten_small.jpg -T dogs-before.jpg

Check the response successful:

{
  "lynx": 0.5455784201622009,
  "tabby": 0.2794168293476105,
  "Egyptian_cat": 0.10391931980848312,
  "tiger_cat": 0.062633216381073,
  "leopard": 0.005019133910536766
}

Install TorchServe with Docker

Docker configuration

Pull dragonflyoss/dragonfly-endpoint image with the plugin. The following is an example of the CPU version of TorchServe, refer to Dockerfile.

docker pull dragonflyoss/dragonfly-endpoint

Create model-store directory to store the model files：

mkdir model-storechmod 777 model-store

Prepare the plugin configuration config.json, and use S3 as the object storage:

{
  "addr": "http://127.0.0.1:65001",
  "header": {},
  "filter": [
    "X-Amz-Algorithm",
    "X-Amz-Credential",
    "X-Amz-Date",
    "X-Amz-Expires",
    "X-Amz-SignedHeaders",
    "X-Amz-Signature"
  ],
  "object_storage": {
    "type": "s3",
    "bucket_name": "your_s3_bucket_name",
    "region": "your_s3_region",
    "access_key": "your_s3_access_key",
    "secret_key": "your_s3_secret_key"
  }
}

Set the environment variables for the configuration:

export DRAGONFLY_ENDPOINT_CONFIG=/etc/dragonfly-endpoint/config.json

Mount the model-store and dragonfly-endpoint configuration directory. Run the container:

sudo docker run --rm -it --network host \
  -v $(pwd)/model-store:/home/model-server/model-store \
  -v ${DRAGONFLY_ENDPOINT_CONFIG}:${DRAGONFLY_ENDPOINT_CONFIG} \
  dragonflyoss/dragonfly-endpoint:latest

How to Verify

Prepare the model. Download a model from Model ZOO or package the model refer to Torch Model archiver for TorchServe. Use squeezenet1_1_scripted.mar model to verify：

wget https://torchserve.pytorch.org/mar_files/squeezenet1_1_scripted.mar

Upload the model to object storage. For detailed uploading the model to S3, please refer to S3。

# Download the command line tool
pip install awscli

# Configure the key as prompted
aws configure

# Upload file
aws s3 cp <local file path> s3://<bucket name>/<Target path>

Download a model：

curl -X POST  "http://localhost:8081/dragonfly/models?file_name=squeezenet1_1.mar"

Verify the model download successful:

{"Status": "Model \"squeezenet1_1\" Version: 1.0 registered with 0 initial workers. Use scale workers API to add workers for the model."}

Added model worker for inference:

curl -v -X PUT "http://localhost:8081/models/squeezenet1_1?min_worker=1"

Check the number of worker is increased:

* About to connect() to localhost port 8081 (#0)
*   Trying ::1...
* Connected to localhost (::1) port 8081 (#0)
> PUT /models/squeezenet1_1?min_worker=1 HTTP/1.1
> User-Agent: curl/7.29.0
> Host: localhost:8081
> Accept: */*
>
< HTTP/1.1 202 Accepted
< content-type: application/json
< x-request-id: 66761b5a-54a7-4626-9aa4-12041e0e4e63
< Pragma: no-cache
< Cache-Control: no-cache; no-store, must-revalidate, private
< Expires: Thu, 01 Jan 1970 00:00:00 UTC
< content-length: 47
< connection: keep-alive
<
{  "status": "Processing worker updates..."}
* Connection #0 to host localhost left intact

Call inference API:

# Prepare pictures that require reasoning
curl -O https://raw.githubusercontent.com/pytorch/serve/master/docs/images/kitten_small.jpg
curl -O https://raw.githubusercontent.com/pytorch/serve/master/docs/images/dogs-before.jpg

# Call inference API
curl http://localhost:8080/predictions/squeezenet1_1 -T kitten_small.jpg -T dogs-before.jpg

Check the response successful:

{
  "lynx": 0.5455784201622009,
  "tabby": 0.2794168293476105,
  "Egyptian_cat": 0.10391931980848312,
  "tiger_cat": 0.062633216381073,
  "leopard": 0.005019133910536766
}

Performance testing

Test the performance of single-machine model download by TorchServe API after the integration of Dragonfly P2P. Due to the influence of the network environment of the machine itself, the actual download time is not important, but the ratio of the increase in the download time in different scenarios is very important.

Bar chart showing TorchServe API, TouchServe API & Dragonfly Cold Boot, Hit Dragonfly Remote Peer Cache and Hit Dragonfly Local Peer Cache performance based on time to download

TorchServe API: Use signed URL provided by Object Storage to download the model directly.
TorchServe API & Dragonfly Cold Boot: Use TorchServe API to download model via Dragonfly P2P network and no cache hits.
Hit Remote Peer: Use TorchServe API to download model via Dragonfly P2P network and hit the remote peer cache.
Hit Local Peer: Use TorchServe API to download model via Dragonfly P2P network and hit the local peer cache.

Test results show TorchServe and Dragonfly integration. It can effectively reduce the file download time. Note that this test was a single-machine test, which means that in the case of cache hits, the performance limitation is on the disk. If Dragonfly is deployed on multiple machines for P2P download, the models download speed will be faster.

Hugging Face accelerates distribution of models and datasets based on Dragonfly

November 16, 2023 · 10 min read

CNCF projects highlighted in this post, and migrated by mingcheng.

This document will help you experience how to use dragonfly with hugging face. During the downloading of datasets or models, the file size is large and there are many services downloading the files at the same time. The bandwidth of the storage will reach the limit and the download will be slow.

Diagram flow showing Hugging Face Hub flow from Cluster A and Cluster B

Dragonfly can be used to eliminate the bandwidth limit of the storage through P2P technology, thereby accelerating file downloading.

Diagram flow showing Hugging Face Hub flow from Cluster A and Cluster B

Prerequisites

Notice: Kind is recommended if no kubernetes cluster is available for testing.

Install dragonfly

For detailed installation documentation based on kubernetes cluster, please refer to quick-start-kubernetes.

Setup kubernetes cluster

Create kind multi-node cluster configuration file kind-config.yaml, configuration content is as follows:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
  - role: control-plane
  - role: worker
    extraPortMappings:
      - containerPort: 30950
        hostPort: 65001
  - role: worker

Create a kind multi-node cluster using the configuration file:

kind create cluster --config kind-config.yaml

Switch the context of kubectl to kind cluster:

kubectl config use-context kind-kind

Kind loads dragonfly image

Pull dragonfly latest images:

docker pull dragonflyoss/scheduler:latestdocker pull dragonflyoss/manager:latestdocker pull dragonflyoss/dfdaemon:latest

Kind cluster loads dragonfly latest images:

kind load docker-image dragonflyoss/scheduler:latestkind load docker-image dragonflyoss/manager:latestkind load docker-image dragonflyoss/dfdaemon:latest

Create dragonfly cluster based on helm charts

Create helm charts configuration file charts-config.yaml and set dfdaemon.config.proxy.registryMirror.url to the address of the Hugging Face Hub’s LFS server, configuration content is as follows:

scheduler:
  replicas: 1
  metrics:
    enable: true
  config:
    verbose: true
    pprofPort: 18066

seedPeer:
  replicas: 1
  metrics:
    enable: true
  config:
    verbose: true
    pprofPort: 18066

dfdaemon:
  metrics:
    enable: true
  hostNetwork: true
  config:
    verbose: true
    pprofPort: 18066
    proxy:
      defaultFilter: 'Expires&Key-Pair-Id&Policy&Signature'
      security:
        insecure: true
      tcpListen:
        listen: 0.0.0.0
        port: 65001
      registryMirror:
        # When enable, using header "X-Dragonfly-Registry" for remote instead of url.
        dynamic: true
        # URL for the registry mirror.
        url: https://cdn-lfs.huggingface.co
        # Whether to ignore https certificate errors.
        insecure: true
        # Optional certificates if the remote server uses self-signed certificates.
        certs: []
        # Whether to request the remote registry directly.
        direct: false
        # Whether to use proxies to decide if dragonfly should be used.
        useProxies: true
      proxies:
        - regx: repos.*
          useHTTPS: true

manager:
  replicas: 1
  metrics:
    enable: true
  config:
    verbose: true
    pprofPort: 18066

Create a dragonfly cluster using the configuration file:

$ helm repo add dragonfly https://dragonflyoss.github.io/helm-charts/
$ helm install --wait --create-namespace --namespace dragonfly-system dragonfly dragonfly/dragonfly -f charts-config.yaml
NAME: dragonfly
LAST DEPLOYED: Wed Oct 19 04:23:22 2022
NAMESPACE: dragonfly-system
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
1. Get the scheduler address by running these commands:
  export SCHEDULER_POD_NAME=$(kubectl get pods --namespace dragonfly-system -l "app=dragonfly,release=dragonfly,component=scheduler" -o jsonpath={.items[0].metadata.name})
  export SCHEDULER_CONTAINER_PORT=$(kubectl get pod --namespace dragonfly-system $SCHEDULER_POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
  kubectl --namespace dragonfly-system port-forward $SCHEDULER_POD_NAME 8002:$SCHEDULER_CONTAINER_PORT
  echo "Visit http://127.0.0.1:8002 to use your scheduler"
2. Get the dfdaemon port by running these commands:
  export DFDAEMON_POD_NAME=$(kubectl get pods --namespace dragonfly-system -l "app=dragonfly,release=dragonfly,component=dfdaemon" -o jsonpath={.items[0].metadata.name})
  export DFDAEMON_CONTAINER_PORT=$(kubectl get pod --namespace dragonfly-system $DFDAEMON_POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
  You can use $DFDAEMON_CONTAINER_PORT as a proxy port in Node.
3. Configure runtime to use dragonfly:
  https://d7y.io/docs/getting-started/quick-start/kubernetes/

Check that dragonfly is deployed successfully:

$ kubectl get po -n dragonfly-system
NAME                                 READY   STATUS    RESTARTS       AGE
dragonfly-dfdaemon-rhnr6             1/1     Running   4 (101s ago)   3m27s
dragonfly-dfdaemon-s6sv5             1/1     Running   5 (111s ago)   3m27s
dragonfly-manager-67f97d7986-8dgn8   1/1     Running   0              3m27s
dragonfly-mysql-0                    1/1     Running   0              3m27s
dragonfly-redis-master-0             1/1     Running   0              3m27s
dragonfly-redis-replicas-0           1/1     Running   1 (115s ago)   3m27s
dragonfly-redis-replicas-1           1/1     Running   0              95s
dragonfly-redis-replicas-2           1/1     Running   0              70s
dragonfly-scheduler-0                1/1     Running   0              3m27s
dragonfly-seed-peer-0                1/1     Running   2 (95s ago)    3m27s

Create peer service configuration file peer-service-config.yaml, configuration content is as follows:

apiVersion: v1
kind: Service
metadata:
  name: peer
  namespace: dragonfly-system
spec:
  type: NodePort
  ports:
    - name: http-65001
      nodePort: 30950
      port: 65001
  selector:
    app: dragonfly
    component: dfdaemon
    release: dragonfly

Create a peer service using the configuration file:

kubectl apply -f peer-service-config.yaml

Use Hub Python Library to download files and distribute traffic through Draognfly

Any API in the Hub Python Library that uses Requests library for downloading files can distribute the download traffic in the P2P network by setting DragonflyAdapter to the requests Session.

Download a single file with Dragonfly

A single file can be downloaded using the hf_hub_download, distribute traffic through the Dragonfly peer.

Create hf_hub_download_dragonfly.py file. Use DragonflyAdapter to forward the file download request of the LFS protocol to Dragonfly HTTP proxy, so that it can use the P2P network to distribute file, content is as follows:

import requests
from requests.adapters import HTTPAdapter
from urllib.parse import urlparse
from huggingface_hub import hf_hub_download
from huggingface_hub import configure_http_backend

class DragonflyAdapter(HTTPAdapter):
  def get_connection(self, url, proxies=None):
    # Change the schema of the LFS request to download large files from https:// to http://,
    # so that Dragonfly HTTP proxy can be used.
    if url.startswith('https://cdn-lfs.huggingface.co'):
      url = url.replace('https://', 'http://')
    return super().get_connection(url, proxies)

  def add_headers(self, request, kwargs):
    super().add_headers(request, kwargs)
    # If there are multiple different LFS repositories, you can override the
    # default repository address by adding X-Dragonfly-Registry header.
    if request.url.find('example.com') != -1:
      request.headers["X-Dragonfly-Registry"] = 'https://example.com'

# Create a factory function that returns a new Session.
def backend_factory() -> requests.Session:
  session = requests.Session()
  session.mount('http://', DragonflyAdapter())
  session.mount('https://', DragonflyAdapter())
  session.proxies = {'http': 'http://127.0.0.1:65001'}
  return session

# Set it as the default session factory
configure_http_backend(backend_factory=backend_factory)

hf_hub_download(repo_id="tiiuae/falcon-rw-1b", filename="pytorch_model.bin")

Download a single file of th LFS protocol with Dragonfly:

$ python3 hf_hub_download_dragonfly.py
(…)YkNX13a46FCg__&Key-Pair-Id=KVTP0A1DKRTAX: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.62G/2.62G [00:52<00:00, 49.8MB/s]

Verify a single file download with Dragonfly

Execute the command:

# find podskubectl -n dragonfly-system get pod -l component=dfdaemon# find logspod_name=dfdaemon-xxxxxkubectl -n dragonfly-system exec -it ${pod_name} -- grep "peer task done" /var/log/dragonfly/daemon/core.log

Example output:

peer task done, cost: 28349ms   {"peer": "89.116.64.101-77008-a95a6918-a52b-47f5-9b18-cec6ada03daf", "task": "2fe93348699e07ab67823170925f6be579a3fbc803ff3d33bf9278a60b08d901", "component": "PeerTask", "trace": "b34ed802b7afc0f4acd94b2cedf3fa2a"}

Download a snapshot of the repo with Dragonfly

A snapshot of the repo can be downloaded using the snapshot_download, distribute traffic through the Dragonfly peer.

Create snapshot_download_dragonfly.py file. Use DragonflyAdapter to forward the file download request of the LFS protocol to Dragonfly HTTP proxy, so that it can use the P2P network to distribute file. Only the files of the LFS protocol will be distributed through the Dragonfly P2P network. content is as follows:

import requests
from requests.adapters import HTTPAdapter
from urllib.parse import urlparse
from huggingface_hub import snapshot_download
from huggingface_hub import configure_http_backend

class DragonflyAdapter(HTTPAdapter):
  def get_connection(self, url, proxies=None):
    # Change the schema of the LFS request to download large files from https:// to http://,
    # so that Dragonfly HTTP proxy can be used.
    if url.startswith('https://cdn-lfs.huggingface.co'):
      url = url.replace('https://', 'http://')
    return super().get_connection(url, proxies)

  def add_headers(self, request, kwargs):
    super().add_headers(request, kwargs)
    # If there are multiple different LFS repositories, you can override the
    # default repository address by adding X-Dragonfly-Registry header.
    if request.url.find('example.com') != -1:
      request.headers["X-Dragonfly-Registry"] = 'https://example.com'

# Create a factory function that returns a new Session.
def backend_factory() -> requests.Session:
  session = requests.Session()
  session.mount('http://', DragonflyAdapter())
  session.mount('https://', DragonflyAdapter())
  session.proxies = {'http': 'http://127.0.0.1:65001'}
  return session

# Set it as the default session factory
configure_http_backend(backend_factory=backend_factory)

snapshot_download(repo_id="tiiuae/falcon-rw-1b")

Download a snapshot of the repo with Dragonfly:

$ python3 snapshot_download_dragonfly.py
(…)03165eb22f0a867d4e6a64d34fce19/README.md: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7.60k/7.60k [00:00<00:00, 374kB/s]
(…)7d4e6a64d34fce19/configuration_falcon.py: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6.70k/6.70k [00:00<00:00, 762kB/s]
(…)f0a867d4e6a64d34fce19/modeling_falcon.py: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 56.9k/56.9k [00:00<00:00, 5.35MB/s]
(…)3165eb22f0a867d4e6a64d34fce19/merges.txt: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 9.07MB/s]
(…)867d4e6a64d34fce19/tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 234/234 [00:00<00:00, 106kB/s]
(…)eb22f0a867d4e6a64d34fce19/tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.11M/2.11M [00:00<00:00, 27.7MB/s]
(…)3165eb22f0a867d4e6a64d34fce19/vocab.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 798k/798k [00:00<00:00, 19.7MB/s]
(…)7d4e6a64d34fce19/special_tokens_map.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 99.0/99.0 [00:00<00:00, 45.3kB/s]
(…)67d4e6a64d34fce19/generation_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 115/115 [00:00<00:00, 5.02kB/s]
(…)165eb22f0a867d4e6a64d34fce19/config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.05k/1.05k [00:00<00:00, 75.9kB/s]
(…)eb22f0a867d4e6a64d34fce19/.gitattributes: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.48k/1.48k [00:00<00:00, 171kB/s]
(…)t-oSSW23tawg__&Key-Pair-Id=KVTP0A1DKRTAX: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.62G/2.62G [00:50<00:00, 52.1MB/s]
Fetching 12 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:50<00:00,  4.23s/it]

Verify a snapshot of the repo download with Dragonfly

Execute the command:

# find podskubectl -n dragonfly-system get pod -l component=dfdaemon# find logspod_name=dfdaemon-xxxxxkubectl -n dragonfly-system exec -it ${pod_name} -- grep "peer task done" /var/log/dragonfly/daemon/core.log

Example output:

peer task done, cost: 28349ms   {"peer": "89.116.64.101-77008-a95a6918-a52b-47f5-9b18-cec6ada03daf", "task": "2fe93348699e07ab67823170925f6be579a3fbc803ff3d33bf9278a60b08d901", "component": "PeerTask", "trace": "b34ed802b7afc0f4acd94b2cedf3fa2a"}

Performance testing

Test the performance of single-machine file download by hf_hub_download API after the integration of Hugging Face Python Library and Dragonfly P2P. Due to the influence of the network environment of the machine itself, the actual download time is not important, but the ratio of the increase in the download time in different scenarios is very important.

Bar chart showing performance testing result

Hugging Face Python Library: Use hf_hub_download API to download models directly.
Hugging Face Python Library & Dragonfly Cold Boot: Use hf_hub_download API to download models via Dragonfly P2P network and no cache hits.
Hit Dragonfly Remote Peer Cache: Use hf_hub_download API to download models via Dragonfly P2P network and hit the remote peer cache.
Hit Dragonfly Local Peer Cache: Use hf_hub_download API to download models via Dragonfly P2P network and hit the local peer cache.
Hit Hugging Face Cache: Use hf_hub_download API to download models via Dragonfly P2P network and hit the Hugging Face local cache.

Test results show Hugging Face Python Library and Dragonfly P2P integration. It can effectively reduce the file download time. Note that this test was a single-machine test, which means that in the case of cache hits, the performance limitation is on the disk. If Dragonfly is deployed on multiple machines for P2P download, the models download speed will be faster.

Dragonfly completes security audit!

September 15, 2023 · 3 min read

This summer, over four engineer weeks, Trail of Bits and OSTIF collaborated on a security audit of dragonfly. A CNCF Incubating Project, dragonfly functions as file distribution for peer-to-peer technologies. Included in the scope was the sub-project Nydus’s repository that works in image distribution. The engagement was outlined and framed around several goals relevant to the security and longevity of the project as it moves towards graduation.

The Trail of Bits audit team approached the audit by using static and manual testing with automated and manual processes. By introducing semgrep and CodeQL tooling, performing a manual review of client, scheduler, and manager code, and fuzz testing on the gRPC handlers, the audit team was able to identify a variety of findings for the project to improve their security. In focusing efforts on high-level business logic and externally accessible endpoints, the Trail of Bits audit team was able to direct their focus during the audit and provide guidance and recommendations for dragonfly’s future work.

Recorded in the audit report are 19 findings. Five of the findings were ranked as high, one as medium, four low, five informational, and four were considered undetermined. Nine of the findings were categorized as Data Validation, three of which were high severity. Ranked and reviewed as well was dragonfly’s Codebase Maturity, comprising eleven aspects of project code which are analyzed individually in the report.

This is a large project and could not be reviewed in total due to time constraints and scope. multiple specialized features were outside the scope of this audit for those reasons. this project is a great opportunity for continued audit work to improve and elevate code and harden security before graduation. Ongoing efforts for security is critical, as security is a moving target.

We would like to thank the Trail of Bits team, particularly Dan Guido, Jeff Braswell, Paweł Płatek, and Sam Alws for their work on this project. Thank you to the dragonfly maintainers and contributors, specifically Wenbo Qi, for their ongoing work and contributions to this engagement. Finally, we are grateful to the CNCF for funding this audit and supporting open source security efforts.

Using dragonfly to distribute images and files for multi-cluster kuberenetes

September 1, 2023 · 14 min read

Posted on September 1, 2023

CNCF projects highlighted in this post, and migrated by mingcheng.

Dragonfly provides efficient, stable, securefile distribution and image acceleration based on p2p technology to be the best practice and standard solution in cloud native architectures. It is hosted by the Cloud Native Computing Foundation(CNCF) as an Incubating Level Project.

This article introduces the deployment of dragonfly for multi-cluster kubernetes. A dragonfly cluster manages cluster within a network. If you have two clusters with disconnected networks, you can use two dragonfly clusters to manage their own clusters.

The recommended deployment for multi-cluster kubernetes is to use a dragonfly cluster to manage a kubernetes cluster, and use a centralized manager service to manage multiple dragonfly clusters. Because peer can only transmit data in its own dragonfly cluster, if a kubernetes cluster deploys a dragonfly cluster, then a kubernetes cluster forms a p2p network, and internal peers can only schedule and transmit data in a kubernetes cluster.

Screenshot showing diagram flow between Network A / Kubernetes Cluster A and Network B / Kubernetes Cluster B towards Manager

Setup kubernetes cluster

Kind is recommended if no Kubernetes cluster is available for testing.

Create kind cluster configuration file kind-config.yaml, configuration content is as follows:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
  - role: control-plane
  - role: worker
    extraPortMappings:
      - containerPort: 30950
        hostPort: 8080
    labels:
      cluster: a
  - role: worker
    labels:
      cluster: a
  - role: worker
    labels:
      cluster: b
  - role: worker
    labels:
      cluster: b

Create cluster using the configuration file:

kind create cluster --config kind-config.yaml

Switch the context of kubectl to kind cluster A:

kubectl config use-context kind-kind

Kind loads dragonfly image

Pull dragonfly latest images:

docker pull dragonflyoss/scheduler:latest
docker pull dragonflyoss/manager:latest
docker pull dragonflyoss/dfdaemon:latest

Kind cluster loads dragonfly latest images:

kind load docker-image dragonflyoss/scheduler:latest
kind load docker-image dragonflyoss/manager:latest
kind load docker-image dragonflyoss/dfdaemon:latest

Create dragonfly cluster A

Create dragonfly cluster A, the schedulers, seed peers, peers and centralized manager included in the cluster should be installed using helm.

Create dragonfly cluster A based on helm charts

Create dragonfly cluster A charts configuration file charts-config-cluster-a.yaml, configuration content is as follows:

containerRuntime:
  containerd:
    enable: true
    injectConfigPath: true
    registries:
      - 'https://ghcr.io'
scheduler:
  image: dragonflyoss/scheduler
  tag: latest
  nodeSelector:
    cluster: a
  replicas: 1
  metrics:
    enable: true
  config:
    verbose: true
    pprofPort: 18066
seedPeer:
  image: dragonflyoss/dfdaemon
  tag: latest
  nodeSelector:
    cluster: a
  replicas: 1
  metrics:
    enable: true
  config:
    verbose: true
    pprofPort: 18066
dfdaemon:
  image: dragonflyoss/dfdaemon
  tag: latest
  nodeSelector:
    cluster: a
  metrics:
    enable: true
  config:
    verbose: true
    pprofPort: 18066
manager:
  image: dragonflyoss/manager
  tag: latest
  nodeSelector:
    cluster: a
  replicas: 1
  metrics:
    enable: true
  config:
    verbose: true
    pprofPort: 18066
jaeger:
  enable: true

Create dragonfly cluster A using the configuration file:

$ helm repo add dragonfly https://dragonflyoss.github.io/helm-charts/
$ helm install --wait --create-namespace --namespace cluster-a dragonfly dragonfly/dragonfly -f charts-config-cluster-a.yaml
NAME: dragonfly
LAST DEPLOYED: Mon Aug  7 22:07:02 2023
NAMESPACE: cluster-a
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
1. Get the scheduler address by running these commands:
  export SCHEDULER_POD_NAME=$(kubectl get pods --namespace cluster-a -l "app=dragonfly,release=dragonfly,component=scheduler" -o jsonpath={.items[0].metadata.name})
  export SCHEDULER_CONTAINER_PORT=$(kubectl get pod --namespace cluster-a $SCHEDULER_POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
  kubectl --namespace cluster-a port-forward $SCHEDULER_POD_NAME 8002:$SCHEDULER_CONTAINER_PORT
  echo "Visit http://127.0.0.1:8002 to use your scheduler"

2. Get the dfdaemon port by running these commands:
  export DFDAEMON_POD_NAME=$(kubectl get pods --namespace cluster-a -l "app=dragonfly,release=dragonfly,component=dfdaemon" -o jsonpath={.items[0].metadata.name})
  export DFDAEMON_CONTAINER_PORT=$(kubectl get pod --namespace cluster-a $DFDAEMON_POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
  You can use $DFDAEMON_CONTAINER_PORT as a proxy port in Node.

3. Configure runtime to use dragonfly:
  https://d7y.io/docs/getting-started/quick-start/kubernetes/

4. Get Jaeger query URL by running these commands:
  export JAEGER_QUERY_PORT=$(kubectl --namespace cluster-a get services dragonfly-jaeger-query -o jsonpath="{.spec.ports[0].port}")
  kubectl --namespace cluster-a port-forward service/dragonfly-jaeger-query 16686:$JAEGER_QUERY_PORT
  echo "Visit http://127.0.0.1:16686/search?limit=20&lookback=1h&maxDuration&minDuration&service=dragonfly to query download events"

Check that dragonfly cluster A is deployed successfully:

$ kubectl get po -n cluster-a
NAME                                 READY   STATUS    RESTARTS      AGE
dragonfly-dfdaemon-7t6wc             1/1     Running   0             3m18s
dragonfly-dfdaemon-r45bk             1/1     Running   0             3m18s
dragonfly-jaeger-84dbfd5b56-fmhh6    1/1     Running   0             3m18s
dragonfly-manager-75f4c54d6d-tr88v   1/1     Running   0             3m18s
dragonfly-mysql-0                    1/1     Running   0             3m18s
dragonfly-redis-master-0             1/1     Running   0             3m18s
dragonfly-redis-replicas-0           1/1     Running   1 (2m ago)    3m18s
dragonfly-redis-replicas-1           1/1     Running   0             96s
dragonfly-redis-replicas-2           1/1     Running   0             45s
dragonfly-scheduler-0                1/1     Running   0             3m18s
dragonfly-seed-peer-0                1/1     Running   1 (37s ago)   3m18s

Create NodePort service of the manager REST service

Create the manager REST service configuration file manager-rest-svc.yaml, configuration content is as follows:

apiVersion: v1
kind: Service
metadata:
  name: manager-rest
  namespace: cluster-a
spec:
  type: NodePort
  ports:
    - name: http
      nodePort: 30950
      port: 8080
  selector:
    app: dragonfly
    component: manager
    release: dragonfly

Create manager REST service using the configuration file:

kubectl apply -f manager-rest-svc.yaml -n cluster-a

Visit manager console

Visit address localhost:8080 to see the manager console. Sign in the console with the default root user, the username is root and password is dragonfly.

Screenshot showing Dragonfly welcome back page

Screenshot showing Dragonfly cluster page

By default, Dragonfly will automatically create dragonfly cluster A record in manager when it is installed for the first time. You can click dragonfly cluster A to view the details.

Screenshot showing Cluster-1 page on Dragonfly

Create dragonfly cluster B

Create dragonfly cluster B, you need to create a dragonfly cluster record in the manager console first, and the schedulers, seed peers and peers included in the dragonfly cluster should be installed using helm.

Create dragonfly cluster B in the manager console

Visit manager console and click the ADD CLUSTER button to add dragonfly cluster B record. Note that the IDC is set to cluster-2 to match the peer whose IDC is cluster-2.

Screenshot showing Create Cluster page on Dragonfly

Create dragonfly cluster B record successfully.

Screenshot showing Cluster page on Dragonfly

Use scopes to distinguish different dragonfly clusters

The dragonfly cluster needs to serve the scope. It wil provide scheduler services and seed peer services to peers in the scope. The scopes of the dragonfly cluster are configured when the console is created and updated. The scopes of the peer are configured in peer YAML config, the fields are host.idc, host.location and host.advertiseIP, refer to dfdaemon config.

If the peer scopes match the dragonfly cluster scopes, then the peer will use the dragonfly cluster’s scheduler and seed peer first, and if there is no matching dragonfly cluster then use the default dragonfly cluster.

Location: The dragonfly cluster needs to serve all peers in the location. When the location in the peer configuration matches the location in the dragonfly cluster, the peer will preferentially use the scheduler and the seed peer of the dragonfly cluster. It separated by “|”, for example “area|country|province|city”.

IDC: The dragonfly cluster needs to serve all peers in the IDC. When the IDC in the peer configuration matches the IDC in the dragonfly cluster, the peer will preferentially use the scheduler and the seed peer of the dragonfly cluster. IDC has higher priority than location in the scopes.

CIDRs: The dragonfly cluster needs to serve all peers in the CIDRs. The advertise IP will be reported in the peer configuration when the peer is started, and if the advertise IP is empty in the peer configuration, peer will automatically get expose IP as advertise IP. When advertise IP of the peer matches the CIDRs in dragonfly cluster, the peer will preferentially use the scheduler and the seed peer of the dragonfly cluster. CIDRs has higher priority than IDC in the scopes.

Create dragonfly cluster B based on helm charts

Create charts configuration with cluster information in the manager console.

Screenshot showing Cluster-2 page on Dragonfly

Scheduler.config.manager.schedulerClusterID using the Scheduler cluster ID from cluster-2 information in the manager console.
Scheduler.config.manager.addr is address of the manager GRPC server.
seedPeer.config.scheduler.manager.seedPeer.clusterID using the Seed peer cluster ID from cluster-2 information in the manager console.
seedPeer.config.scheduler.manager.netAddrs[0].addr is address of the manager GRPC server.
dfdaemon.config.host.idc using the IDC from cluster-2 information in the manager console.
dfdaemon.config.scheduler.manager.netAddrs[0].addr is address of the manager GRPC server.
externalManager.host is host of the manager GRPC server.
externalRedis.addrs[0] is address of the redis.

Create dragonfly cluster B charts configuration file charts-config-cluster-b.yaml, configuration content is as follows:

containerRuntime:
  containerd:
    enable: true
    injectConfigPath: true
    registries:
      - 'https://ghcr.io'
scheduler:
  image: dragonflyoss/scheduler
  tag: latest
  nodeSelector:
    cluster: b
  replicas: 1
  config:
    manager:
      addr: dragonfly-manager.cluster-a.svc.cluster.local:65003
      schedulerClusterID: 2
seedPeer:
  image: dragonflyoss/dfdaemon
  tag: latest
  nodeSelector:
    cluster: b
  replicas: 1
  config:
    scheduler:
      manager:
        netAddrs:
          - type: tcp
            addr: dragonfly-manager.cluster-a.svc.cluster.local:65003
        seedPeer:
          enable: true
          clusterID: 2
dfdaemon:
  image: dragonflyoss/dfdaemon
  tag: latest
  nodeSelector:
    cluster: b
  config:
    host:
      idc: cluster-2
    scheduler:
      manager:
        netAddrs:
          - type: tcp
            addr: dragonfly-manager.cluster-a.svc.cluster.local:65003
manager:
  enable: false
externalManager:
  enable: true
  host: dragonfly-manager.cluster-a.svc.cluster.local
  restPort: 8080
  grpcPort: 65003
redis:
  enable: false
externalRedis:
  addrs:
    - dragonfly-redis-master.cluster-a.svc.cluster.local:6379
  password: dragonfly
mysql:
  enable: false
jaeger:
  enable: true

Create dragonfly cluster B using the configuration file:

$ helm install --wait --create-namespace --namespace cluster-b dragonfly dragonfly/dragonfly -f charts-config-cluster-b.yaml
NAME: dragonfly
LAST DEPLOYED: Mon Aug  7 22:13:51 2023
NAMESPACE: cluster-b
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
1. Get the scheduler address by running these commands:
  export SCHEDULER_POD_NAME=$(kubectl get pods --namespace cluster-b -l "app=dragonfly,release=dragonfly,component=scheduler" -o jsonpath={.items[0].metadata.name})
  export SCHEDULER_CONTAINER_PORT=$(kubectl get pod --namespace cluster-b $SCHEDULER_POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
  kubectl --namespace cluster-b port-forward $SCHEDULER_POD_NAME 8002:$SCHEDULER_CONTAINER_PORT
  echo "Visit http://127.0.0.1:8002 to use your scheduler"

2. Get the dfdaemon port by running these commands:
  export DFDAEMON_POD_NAME=$(kubectl get pods --namespace cluster-b -l "app=dragonfly,release=dragonfly,component=dfdaemon" -o jsonpath={.items[0].metadata.name})
  export DFDAEMON_CONTAINER_PORT=$(kubectl get pod --namespace cluster-b $DFDAEMON_POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
  You can use $DFDAEMON_CONTAINER_PORT as a proxy port in Node.

3. Configure runtime to use dragonfly:
  https://d7y.io/docs/getting-started/quick-start/kubernetes/

4. Get Jaeger query URL by running these commands:
  export JAEGER_QUERY_PORT=$(kubectl --namespace cluster-b get services dragonfly-jaeger-query -o jsonpath="{.spec.ports[0].port}")
  kubectl --namespace cluster-b port-forward service/dragonfly-jaeger-query 16686:$JAEGER_QUERY_PORT
  echo "Visit http://127.0.0.1:16686/search?limit=20&lookback=1h&maxDuration&minDuration&service=dragonfly to query download events"

Check that dragonfly cluster B is deployed successfully:

$ kubectl get po -n cluster-b
NAME                                READY   STATUS    RESTARTS   AGE
dragonfly-dfdaemon-q8bsg            1/1     Running   0          67s
dragonfly-dfdaemon-tsqls            1/1     Running   0          67s
dragonfly-jaeger-84dbfd5b56-rg5dv   1/1     Running   0          67s
dragonfly-scheduler-0               1/1     Running   0          67s
dragonfly-seed-peer-0               1/1     Running   0          67s

Create dragonfly cluster B successfully.

Screenshot showing Cluster-2 page on Dragonfly

Using dragonfly to distribute images for multi-cluster kubernetes

Containerd pull image back-to-source for the first time through dragonfly in cluster A

Pull ghcr.io/dragonflyoss/dragonfly2/scheduler:v2.0.5 image in kind-worker node:

docker exec -i kind-worker /usr/local/bin/crictl pull ghcr.io/dragonflyoss/dragonfly2/scheduler:v2.0.5

Expose jaeger’s port 16686:

kubectl --namespace cluster-a port-forward service/dragonfly-jaeger-query 16686:16686

Visit the Jaeger page in http://127.0.0.1:16686/search, Search for tracing with Tags http.url="/v2/dragonflyoss/dragonfly2/scheduler/blobs/sha256:82cbeb56bf8065dfb9ff5a0c6ea212ab3a32f413a137675df59d496e68eaf399?ns=ghcr.io":

Screenshot showing Jaeger page (dragonfly-dfget)

Tracing details:

Screenshot showing dragonfly-dfget tracing details on Jaeger UI

When pull image back-to-source for the first time through dragonfly, peer uses cluster-a’s scheduler and seed peer. It takes 1.47s to download the 82cbeb56bf8065dfb9ff5a0c6ea212ab3a32f413a137675df59d496e68eaf399 layer.

Containerd pull image hits the cache of remote peer in cluster A

Pull ghcr.io/dragonflyoss/dragonfly2/scheduler:v2.0.5 image in kind-worker2 node:

docker exec -i kind-worker2 /usr/local/bin/crictl pull ghcr.io/dragonflyoss/dragonfly2/scheduler:v2.0.5

Expose jaeger’s port 16686:

kubectl --namespace cluster-a port-forward service/dragonfly-jaeger-query 16686:16686

Visit the Jaeger page in http://127.0.0.1:16686/search, Search for tracing with Tags http.url="/v2/dragonflyoss/dragonfly2/scheduler/blobs/sha256:82cbeb56bf8065dfb9ff5a0c6ea212ab3a32f413a137675df59d496e68eaf399?ns=ghcr.io":

Screenshot showing dragonfly-dfget on Jaeger UI

Tracing details:

Screenshot showing dragonfly-dfget Tracing Details on Jaeger UI

When pull image hits cache of remote peer, peer uses cluster-a’s scheduler and seed peer. It takes 37.48ms to download the 82cbeb56bf8065dfb9ff5a0c6ea212ab3a32f413a137675df59d496e68eaf399 layer.

Containerd pull image back-to-source for the first time through dragonfly in cluster B

Pull ghcr.io/dragonflyoss/dragonfly2/scheduler:v2.0.5 image in kind-worker3 node:

docker exec -i kind-worker3 /usr/local/bin/crictl pull ghcr.io/dragonflyoss/dragonfly2/scheduler:v2.0.5

Expose jaeger’s port 16686:

kubectl --namespace cluster-b port-forward service/dragonfly-jaeger-query 16686:16686

Visit the Jaeger page in http://127.0.0.1:16686/search, Search for tracing with Tags http.url=”/v2/dragonflyoss/dragonfly2/scheduler/blobs/sha256:82cbeb56bf8065dfb9ff5a0c6ea212ab3a32f413a137675df59d496e68eaf399?ns=ghcr.io”:

Screenshot showing dragonfly-dfget on Jaeger UI

Tracing details:

Screenshot showing dragonfly-dfget Tracing Details on Jaeger UI

When pull image back-to-source for the first time through dragonfly, peer uses cluster-b’s scheduler and seed peer. It takes 4.97s to download the 82cbeb56bf8065dfb9ff5a0c6ea212ab3a32f413a137675df59d496e68eaf399 layer.

Containerd pull image hits the cache of remote peer in cluster B

Pull ghcr.io/dragonflyoss/dragonfly2/scheduler:v2.0.5 image in kind-worker4 node:

docker exec -i kind-worker4 /usr/local/bin/crictl pull ghcr.io/dragonflyoss/dragonfly2/scheduler:v2.0.5

Expose jaeger’s port 16686:

kubectl --namespace cluster-b port-forward service/dragonfly-jaeger-query 16686:16686

Screenshot showing dragonfly-dfget on Jaeger UI

Tracing details:

Screenshot showing dragonfly-dfget Tracing Details on Jaeger UI

When pull image hits cache of remote peer, peer uses cluster-b’s scheduler and seed peer. It takes 14.53ms to download the 82cbeb56bf8065dfb9ff5a0c6ea212ab3a32f413a137675df59d496e68eaf399 layer.

Links

Dragonfly Website: https://d7y.io/

Dragonfly Github Repo: https://github.com/dragonflyoss/dragonfly

Dragonfly Slack Channel: #dragonfly on CNCF Slack

Dragonfly Discussion Group: dragonfly-discuss@googlegroups.com

Dragonfly Twitter: @dragonfly_oss

Nydus Website: https://nydus.dev/

Nydus Github Repo: https://github.com/dragonflyoss/image-service

Dragonfly v2.1.0 is released!

August 7, 2023 · 6 min read

CNCF projects highlighted in this post, and migrated by mingcheng.

Dragonfly v2.1.0 is released! 🎉🎉🎉 Thanks to the Xinxin Zhao[1] for helping to refactor the console[2] and the manager provides a new console for users to operate Dragonfly. Welcome to visit d7y.io[3] website.

Announcement screenshot from Github mentioning "Dragonfly v2.1.0 is released!"

#features Features

Console v1.0.0[4] is released and it provides a new console for users to operate Dragonfly.
Add network topology feature and it can probe the network latency between peers, providing better scheduling capabilities.
Provides the ability to control the features of the scheduler in the manager. If the scheduler preheat feature is not in feature flags, then it will stop providing the preheating in the scheduler.
dfstore adds GetObjectMetadatas and CopyObject to supports using Dragonfly as the JuiceFS backend.
Add personal access tokens feature in the manager and personal access token contains your security credentials for the restful open api.
Add TLS config to manager rest server.
Fix dfdaemon fails to start when there is no available scheduler address.
Add cluster in the manager and the cluster contains a scheduler cluster and a seed peer cluster.
Fix object downloads failed by dfstore when dfdaemon enabled concurrent.
Scheduler adds database field in config and moves the redis config to database field.
Replace net.Dial with grpc health check in dfdaemon.
Fix filtering and evaluation in scheduling. Since the final length of the filter is the candidateParentLimit used, the parents after the filter is wrong.
Fix storage can not write records to file when bufferSize is zero.
Hiding sensitive information in logs, such as the token in the header.
Use unscoped delete when destroying the manager’s resources.
Add uk_scheduler index and uk_seed_peer index in the table of the database.
Remove security domain feature and security feature in the manager.
Add advertise port config to manager and scheduler.
Fix fsm changes state failed when register task.

#break-change Break Change

The M:N relationship model between the scheduler cluster and the seed peer cluster is no longer supported. In the future, a P2P cluster will be a cluster in the manager, and a cluster will only include a scheduler cluster and a seed peer cluster.

#console Console

Screenshot showing Dragonfly Console welcome back page

You can see Manager Console[5] for more details.

#ai-infrastructure AI Infrastructure

Triton Inference Server[6] uses Dragonfly to distribute model files, refer to #2185[7]. If there are developers who are interested in the drgaonfly repository agent[8] project, please contact gaius.qi@gmail.com.
TorchServer[9] uses Dragonfly to distribute model files. Developers have already participated in the dragonfly endpoint[10] project, and the feature will be released in v2.1.1.
Fluid[11] downloads data through Dragonfly when running based on JuiceFS[12], the feature will be released in v2.1.1.
Dragonfly helps Volcano Engine AIGC inference to accelerate image through p2p technology[13].
There have been many cases in the community, using Dragonfly to distribute data in AI scenarios based on P2P technology. In the inference stage, the concurrent download model of the inference service can effectively relieve the bandwidth pressure of the model registry through Dragonfly, and improving the download speed. Community will share topic 《Dragonfly: Intro, Updates and AI Model Distribution in the Practice of Kuaishou – Wenbo Qi, Ant Group & Zekun Liu, Kuaishou Technology》[14] with Kuaishou[15] in KubeCon + CloudNativeCon + Open Source Summit China 2023[16], please follow if interested.

#maintainers Maintainers

The community has added four new Maintainers, hoping to help more contributors participate in community.

Yiyang Huang[17]: He works for Volcano Engine and will focus on the engineering work for Dragonfly.
Manxiang Wen[18]: He works for Baidu and will focus on the engineering work for Dragonfly.
Mohammed Farooq[19] He works for Intel and will focus on the engineering work for Dragonfly.
Zhou Xu[20]: He is a PhD student at Dalian University of Technology and will focus on the intelligent scheduling algorithms.

#others Others

You can see CHANGELOG[21] for more details.

Ant Group security technology’s Nydus and Dragonfly image acceleration practices

May 1, 2023 · 24 min read

CNCF projects highlighted in this post, and migrated by mingcheng.

Introduction

ZOLOZ is a global security and risk management platform under Ant Group. Through biometric, big data analysis, and artificial intelligence technologies, ZOLOZ provides safe and convenient security and risk management solutions for users and institutions. ZOLOZ has provided security and risk management technology support for more than 70 partners in 14 countries and regions, including China, Indonesia, Malaysia, and the Philippines. It has already covered areas such as finance, insurance, securities, credit, telecommunications, and public services, and has served over 1.2 billion users.

With the explosion of Kubernetes and cloud-native, ZOLOZ applications have begun to be deployed on a large scale on public clouds using containerization. The images of ZOLOZ applications have been maintained and updated for a long time, and both the number of layers and the overall size have reached a large scale (hundreds of MBs or several GBs). In particular, the basic image size of ZOLOZ’s AI algorithm inference application is much larger than that of general application images (PyTorch/PyTorch:1.13.1-CUDA 11.6-cuDNN 8-Runtime on Docker Hub is 4.92GB, compared to CentOS:latest with only about 234MB).

For container cold start, i.e., when there is no image locally, the image needs to be downloaded from the registry before creating the container. In the production environment, container cold start often takes several minutes, and as the scale increases, the registry may be unable to download images quickly due to network congestion within the cluster. Such large images have brought many challenges to application updates and scaling. With the continuous promotion of containerization on public clouds, ZOLOZ applications mainly face three challenges:

The algorithm image is large, and pushing it to the cloud image repository takes a long time. During the development process, when testing in the testing environment, developers often hope to iterate quickly and verify quickly. However, every time a branch is modified and released for verification, it takes several tens of minutes, which is very inefficient.
Pulling the algorithm image takes a long time, and pulling many image files during cluster expansion can easily cause the cluster network card to be flooded and affect the normal operation of the business.
The cluster machine takes a long time to start up, making it difficult to meet the needs of sudden traffic increases and elastic automatic scaling.

Although various compromise solutions have been attempted, these solutions all have their shortcomings. Now, in collaboration with multiple technical teams such as Ant Group, Alibaba Cloud, and ByteDance, a more universal solution on public clouds has been developed, which has low transformation costs and good performance, and currently appears to be an ideal solution.

Terminology

OCI: Open Container Initiative, a Linux Foundation project initiated by Docker in June 2015, aimed at designing open standards for operating system-level virtualization, most importantly Linux containers.

OCI Manifest: The product follows the OCI Image Spec.

BuildKit: A new generation Docker build tool produced by Docker that is more efficient, Dockerfile-independent, and more suitable for cloud-native applications.

Image: In this article, the image refers to OCI Manifest, including Helm Chart and other OCI Manifests.

Image Repository: A product repository implemented in accordance with OCI Distribution Spec.

ECS: A resource collection consisting of CPUs, memory, and cloud disks, with each type of resource logically corresponding to a computing hardware entity in a data center.

ACR: Alibaba Cloud’s image repository service.

ACK: Alibaba Cloud Container Service Kubernetes version provides high-performance and scalable container application management capabilities and supports full lifecycle management of enterprise-level containerized applications.

ACI: Ant Continuous Integration, is a CI/CD efficiency product under the Ant Group’s research and development efficiency umbrella, which is centered around pipelines. With intelligent automated construction, testing, and deployment, it provides a lightweight continuous delivery solution based on code flow to improve the work efficiency of team development.

Private Zone: A private DNS service based on the Virtual Private Cloud (VPC) environment. This service allows private domain names to be mapped to IP addresses in one or more custom VPCs.

P2P: Peer-to-peer technology, when a Peer in a P2P network downloads data from the server, it can also act as a server for other Peers to download after downloading the data. When a large number of nodes are downloading simultaneously, subsequent data downloads can be obtained without downloading from the server. This can reduce the pressure on the server.

Dragonfly: Dragonfly is a file distribution and image acceleration system based on P2P technology and is the standard solution and best practice in the field of image acceleration in cloud-native architecture. It is now hosted by the Cloud Native Computing Foundation (CNCF) as an incubation-level project.

Nydus: Nydus is a sub-project of Dragonfly’s image acceleration framework that provides on-demand loading of container images and supports millions of accelerated image container creations in production environments every day. It has significant advantages over OCIv1 in terms of startup performance, image space optimization, end-to-end data consistency, kernel-level support, etc.

LifseaOS: A lightweight, fast, secure, and image-atomic management container optimization operating system launched by Alibaba Cloud for container scenarios. Compared with traditional operating systems, the number of software packages is reduced by 60%, and the image size is reduced by 70%. The first-time startup time is reduced from over 1 minute to around 2 seconds. It supports image read-only and OSTree technology, versioning management of OS images, and updating software packages or fixed configurations on the operating system on an image-level basis.

Solution

1: Large image size

Reduce the size of the base image

The basic OS is changed from CentOS 7 to AnolisOS 8, and the installation of maintenance tools is streamlined. Only a list of essential tools (basic maintenance tools, runtime dependencies, log cleaning, security baselines, etc.) is installed by default, and the configuration of security hardening is simplified. The base image is reduced from 1.63GB to 300MB.

AnolisOS Repository: https://hub.docker.com/r/openanolis/anolisos/tags

Dockerfile optimization

Reduce unnecessary build resources and time through Dockerfile writing constraints, image inspection, and other means.

Dockerfile Best Practices: https://docs.docker.com/develop/develop-images/dockerfile_best-practices/

Parallel building and build caching

AntGroup’s Build Center uses the Nydus community-optimized version of BuildKit, which supports layer-level caching. The previous artifacts are accurately referenced and cached, and for Multistage type Dockerfiles, BuildKit can achieve parallel execution between different stages.

2: Slow pushing image

Use Nydus images for block-level data deduplication

In traditional OCI images, the smallest unit that can be shared between different images is the layer in the image, and the efficiency of deduplication is very low. There may be a lot of duplicate data between layers, even if there are slight differences, they will be treated as different layers. According to the design of deletion files and hard links in OCI Image Spec, there may be files that have been deleted in the upper layer but still exist in the lower layer and are included in the image. In addition, OCI Image uses the tar+gzip format to express the layers in the image, and the tar format does not distinguish between tar archive entries order, which brings a problem that if users build the same image on different machines, they may get different images because of using different file systems, but the substantial content of several different images is completely identical, which leads to a sharp increase in the amount of uploaded and downloaded data.

Issues with OCIv1 and OCIv2 proposals: https://hackmd.io/@cyphar/ociv2-brainstorm

Nydus image files are divided into file chunks, and the metadata layer is flattened (removing intermediate layers). Each chunk is only saved once in the image, and a base image can be specified as a chunk dictionary for other Nydus images. Based on chunk-level deduplication, it provides low-cost data deduplication capabilities between different images, greatly reducing the amount of uploaded and downloaded data for the images.

nydus-image-files

As shown in the figure above, Nydus image 1 and image 2 have the same data blocks B2, C, E1, and F. Image 2 adds E2, G1, H1, and H2. If image 1 already exists in the image repository, image 2 can be built based on image 1. Only E2, G1, H1, and H2 need to be built in one layer, and only this layer needs to be uploaded to the image repository during upload. This achieves the effect of uploading and pulling only file differences, shortens the development cycle.

Directly building Nydus images

Currently, in most landing scenarios for acceleration images, the production of acceleration images is based on image conversion. The following two Nydus conversion schemes are currently in place:

i. Repository conversion

After a traditional image is built and pushed to the image repository, the conversion action of the image repository is triggered to complete the image conversion. The disadvantage of this approach is that the build and conversion are often done on different machines. After the image is built and pushed, it needs to be pulled to the conversion machine and the output needs to be pushed to the image repository, which adds a complete image circulation process and causes high latency. Also, it occupies the network resources of the image repository. Before the acceleration image conversion is complete, application deployment cannot enjoy the acceleration effect and still needs to pull the entire image.

ii. Double version building

After the traditional image is built, it is converted directly on the local build machine. To improve efficiency, the conversion of each layer can be started immediately after the construction of that layer, which can significantly reduce the delay in generating acceleration images. With this approach, conversion can begin without waiting for traditional image upload, and because it is local conversion, compared to approach 1, the cost of transfer between the conversion machine and the image repository can be saved. If the accelerated image corresponding to the base image does not exist, it will be converted; if it exists, pulling can be ignored, but inevitably, pushing always requires twice the data.

iii. Direct building

Compared with the two conversion-based schemes mentioned above, directly building Nydus acceleration images has obvious production delays. First, OCI-based image construction is significantly slower than Nydus image construction. Second, conversion is an after-the-fact behavior and there is more or less delay. Third, there is additional data transmission in both schemes. Direct building, on the other hand, has fewer steps, is faster, and saves resources.

It can be seen that the steps and data transmission volume for building acceleration images are significantly reduced. After the construction is completed, the ability of the acceleration image can be directly enjoyed and the speed of application deployment can be greatly improved.

3: Slow container startup

Nydus images load on demand

The actual usage rate of the image data is very low. For example, Cern’s paper mentions that only 6% of the content of a general image is actually used. The purpose of on-demand loading is to allow the container runtime to selectively download and extract files from the image layers in the Blob, but the OCI/Docker image specifications package all image layers into a tar or tar.gz archive. This means that even if you want to extract a single file, you still have to scan the entire Blob. If the image is compressed using gzip, it is even more difficult to extract specific files.

nydus-images-load-on-demand

The RAFS image format is an archive compression format proposed by Nydus. It separates the data (Blobs) and metadata (Bootstrap) of the container image file system, so that the original image layers only store the data part of the files. Furthermore, the files are divided into chunks according to a certain granularity, and the corresponding chunk data is stored in each layer of Blob. Using chunk granularity refines the deduplication granularity, and allows easier sharing of data between layers and images, and easier on-demand loading. The original image layers only store the data part of the files (i.e. the Blob layer in the figure).

The Blob layer stores the chunk files, which are chunks of file data. For example, a 10MB file can be sliced into 10 1MB blocks, and the offset of each chunk can be recorded in an index. When requesting part of the data from a file, the container runtime can selectively obtain the file from the image repository by combining with the HTTP Range Request supported by the OCI/Docker image repository specification, thus saving unnecessary network overhead. For more details about the Nydus image format, please refer to the Nydus Image Service project.

The metadata and chunk indexes are combined to form the Meta layer in the figure above, which is the entire filesystem structure that the container can see after all image layers are stacked. It includes the directory tree structure, file metadata, chunk information (block size and offset, as well as metadata such as file name, file type, owner, etc. for each file). With Meta, the required files can be extracted without scanning the entire archive file. In addition, the Meta layer contains a hash tree and the hash of each chunk data block, which ensures that the entire file tree can be verified at runtime, and the signature of the entire Meta layer can be checked to ensure that the runtime data can be detected even if it is tampered with.

nydus-meta-layer

Nydus uses the user-mode file system implementation FUSE to implement on-demand loading by default. The user-mode Nydus daemon process mounts the Nydus image mount point as the container RootFS directory. When the container generates a file system IO such as read(fd, count), the kernel-mode FUSE driver adds the request to the processing queue. The user-mode Nydus daemon reads and processes the request through the FUSE Device, pulls the corresponding number of Chunk data blocks from the remote Registry, and finally replies to the container through the kernel-mode FUSE. Nydus also implements a layer of local cache, where chunks that have been pulled from the remote are uncompressed and cached locally. The cache can be shared between images on a layer-by-layer basis, or at a chunk level.

nydus-uses-the-user-mode-file-system

After using Nydus for image acceleration, the startup time of different applications has made a qualitative leap, enabling applications to be launched in a very short time, meeting the requirements of rapid scaling in the cloud.

Read-only file system EROFS

When there are many files in the container image, frequent file operations generate a large number of FUSE requests, which causes frequent context switching between kernel-space and user-space, resulting in performance bottlenecks. Based on the kernel-space EROFS file system (originating from Linux 4.19), Nydus has made a series of improvements and enhancements to expand its capabilities in the image scenario. The final result is a kernel-space container image format, Nydus RAFS (Registry Acceleration File System) v6. Compared with the previous format, it has the advantages of block data alignment, more concise metadata, high scalability, and high performance. When all image data is downloaded locally, the FUSE user-space solution can cause the process that accesses the file to frequently trap to user-space, and involves memory copies between kernel-space and user-space.

Furthermore, Nydus supports the EROFS over FS-Cache scheme (Linux 5.19-rc1), where the user-space Nydusd directly writes downloaded chunks into the FS-Cache cache. When the container accesses the data, it can directly read the data through the kernel-space FS-Cache without trapping to user-space. In the container image scenario, this achieves almost lossless performance and stability, outperforming the FUSE user-space solution, and comparable to native file systems (without on-demand loading).

	OCI	Fuse + rafsv5	Fuse + rafsv6	Fscache + rafsv6	Fscache + rafsv6 + opt patch
e2e startup wordpress	11.704s, 11.651s, 11.330s	5.237s, 5.489s, 5.337s	5.094s, 5.382s, 5.314s	10.167s, 9.999s, 9.884s	4.659s, 4.541s, 4.658s
e2e startup Hello bench java	9.2186s, 8.9132s, 8.8412s	2.8325s, 2.7671s, 2.7671s	2.7543s, 2.8104, 2.8692s	4.6904s, 4.7012s, 4.6654s	2.9691s, 3.0485s, 3.0294s

Currently Nydus has supported this scheme in building, running, and kernel-space (Linux 5.19-rc1). For detailed usage, please refer to the Nydus EROFS FS-Cache user guide. If you want to learn more about the implementation details of Nydus in kernel-space, you can refer to Nydus Image Acceleration: The Evolutionary Road of the Kernel.

the-evolution-of-the-nydus-kernel

Dragonfly P2P Accelerates Image Downloading

Both the image repository service and the underlying storage have bandwidth and QPS limitations. If we rely solely on the bandwidth and QPS provided by the server, it is easy to fail to meet the demand. Therefore, P2P needs to be introduced to relieve server pressure, thereby meeting the demand for large-scale concurrent image pulling. In scenarios where large-scale image pulling is required, using Dragonfly&Nydus can save more than 90% of container startup time compared to using OCIv1.

dragonfly-p2p-accelerates-image-downloading

The shorter startup time after using Nydus is due to the lazy loading feature of the image, where only a small portion of metadata needs to be pulled for the Pod to start. In large-scale scenarios, the number of images pulled back by Dragonfly is very small. In the OCIv1 scenario, all image pulling requires a return to the source, so the peak return to the source and return to the source traffic using Dragonfly are much less than in the OCIv1 scenario. Furthermore, after using Dragonfly, as the concurrency increases, the peak return to the source and traffic do not increase significantly.

1GB random file for TEST
Concurrency	Completion time for OCI image	Completion time for Nydus+Dragonfly image	Performance improvement ratio
1	63s	41s	53%
5	63s	51s	23%
50	145s	65s	123%

4: Slow Cluster scaling

ACR Image Repository Global Synchronization

To meet customer demand for a high-quality experience and data compliance requirements, ZOLOZ deploys in multiple global sites in the cloud. With the help of ACR image repository for cross-border synchronization acceleration, multiple regions around the world are synchronized to improve the efficiency of container image distribution. Image uploading and downloading are performed within the local data center, so even in countries with poor network conditions, deployments can be made like in local data centers, truly achieving one-click deployment of applications around the world.

Use ContainerOS for High-Speed Startup

With cloud-native, customers can rapidly expand resource scaling, and use elasticity to reduce costs. On the cloud, virtual machines need to be scaled quickly and added to the cluster. ContainerOS simplifies the OS startup process and pre-installs necessary container images for cluster management components, reducing the time spent during node startup due to image pulling, greatly improving OS startup speed, and reducing node scaling time in the ACK link. ContainerOS is optimized in the following ways:

ContainerOS simplifies the OS startup process to effectively reduce OS startup time. The positioning of ContainerOS is an operating system running on a cloud-based virtual machine, which does not involve too many hardware drivers. Therefore, ContainerOS modifies the necessary kernel driver modules to built-in mode. In addition, ContainerOS removes initramfs, and udev rules are greatly simplified, which significantly improves OS startup speed. For example, in the ecs.g7.large ECS instance, the first startup time of LifseaOS is about 2 seconds, while Alinux3 requires more than 1 minute.
ContainerOS pre-installs necessary container images for cluster management components to reduce the time spent during node startup due to image pulling. After the ECS node startup is completed, some component container images need to be pulled, which are responsible for performing some basic work in the ACK scenario. For example, the Terway component is responsible for the network, and the node must be in the ready state only when the Terway component container is ready. Therefore, since the long-tail effect of network pulling will cause great time consumption, pre-installing this component in the OS in advance can directly obtain it from the local directory, avoiding the time consumption of pulling images from the network.
ContainerOS also improves node elasticity performance by combining ACK control link optimization.

Finally, the end-to-end P90 time consumption from an empty ACK node pool expansion was statistically calculated, starting from the issuance of the expansion request and ending the timing when 90% of the nodes were ready, and compared with CentOS and Alinux2 Optimized-OS solutions. ContainerOS has significant performance advantages.

The overall solution

the-overall-solution

By using a streamlined base image and following Dockerfile conventions, we can reduce the size of our images.
We can utilize the buildkit provided by Ant Group for multistage and parallel image building, and use caching to speed up repeated builds. When directly building Nydus accelerated images, we can deduplicate by analyzing the repetition between images and only upload the different blocks to remote image repositories.
By utilizing ACR’s global acceleration synchronization capability, we can distribute our images to different repositories around the world for faster pulling.
We can use the Dragonfly P2P network to accelerate the on-demand pulling of Nydus image blocks.
We can use the ContainerOS operating system on our nodes to improve both OS and image startup speed.

mprove-the-startup-speed-of-the-image

Time (3GB image as an example)	Build Image	Push Image	Schedule Node	Pull Image
Before	180s	60s	506s	4m15s
After	130s	1s	56s	560ms

*Schedule Node: This refers to the time from creating an ECS instance on Alibaba Cloud to the node joining the K8s cluster and becoming Ready. Thanks to the optimization of ContainerOS, this time is reduced significantly.

Through extreme optimization of various stages in the entire R&D process, it is found that after optimization, both R&D efficiency and online stability have been qualitatively improved. Currently, the entire solution has been deployed on both Alibaba Cloud and AWS and has been running stably for three months. In the future, standard deployment environments will be provided by cloud vendors to meet the needs of more types of business scenarios.

Usage Guide

Dragonfly installation

$ helm repo add dragonfly https://dragonflyoss.github.io/helm-charts/
$ helm install --wait --timeout 10m --dependency-update --create-namespace --namespace dragonfly-system dragonfly dragonfly/dragonfly --set dfdaemon.config.download.prefetch=true,seedPeer.config.download.prefetch=true
NAME: dragonfly
LAST DEPLOYED: Fri Apr  7 10:35:12 2023
NAMESPACE: dragonfly-system
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
1. Get the scheduler address by running these commands:
  export SCHEDULER_POD_NAME=$(kubectl get pods --namespace dragonfly-system -l "app=dragonfly,release=dragonfly,component=scheduler" -o jsonpath={.items[0].metadata.name})
  export SCHEDULER_CONTAINER_PORT=$(kubectl get pod --namespace dragonfly-system $SCHEDULER_POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
  kubectl --namespace dragonfly-system port-forward $SCHEDULER_POD_NAME 8002:$SCHEDULER_CONTAINER_PORT
  echo "Visit http://127.0.0.1:8002 to use your scheduler"

2. Get the dfdaemon port by running these commands:
  export DFDAEMON_POD_NAME=$(kubectl get pods --namespace dragonfly-system -l "app=dragonfly,release=dragonfly,component=dfdaemon" -o jsonpath={.items[0].metadata.name})
  export DFDAEMON_CONTAINER_PORT=$(kubectl get pod --namespace dragonfly-system $DFDAEMON_POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
  You can use $DFDAEMON_CONTAINER_PORT as a proxy port in Node.

3. Configure runtime to use dragonfly:
  https://d7y.io/docs/getting-started/quick-start/kubernetes/

For more details, please refer to: nydus

Nydus installation

$ curl -fsSL -o config-nydus.yaml https://raw.githubusercontent.com/dragonflyoss/Dragonfly2/main/test/testdata/charts/config-nydus.yaml
$ helm install --wait --timeout 10m --dependency-update --create-namespace --namespace nydus-snapshotter nydus-snapshotter dragonfly/nydus-snapshotter -f config-nydus.yaml
NAME: nydus-snapshotter
LAST DEPLOYED: Fri Apr  7 10:40:50 2023
NAMESPACE: nydus-snapshotter
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
Thank you for installing nydus-snapshotter.
Your release is named nydus-snapshotter.
To learn more about the release, try:
  $ helm status nydus-snapshotter
  $ helm get all nydus-snapshotter

For more details, please refer to: https://github.com/dragonflyoss/helm-charts/blob/main/INSTALL.md

ContainerOS

ContainerOS has implemented high-speed scaling for elastic expansion scenarios in ACK cluster node pools through the following optimizations:

OS Startup Speed Improvement

LifseaOS significantly improves OS startup speed by:

Removing unnecessary hardware drivers for cloud scenarios
Modifying essential kernel driver modules to built-in mode
Removing initramfs
Simplifying udev rules

These optimizations reduce first boot time from over 1 minute (traditional OS) to approximately 2 seconds.

ACK-Specific Optimizations

ContainerOS is customized for ACK environments:

Pre-installed container images for cluster management components eliminate network pull time
ACK control link optimizations:
- Adjusted detection frequency for critical logic
- Modified system bottleneck thresholds under high load

How to Use ContainerOS

When creating a managed node pool for an ACK cluster in the Alibaba Cloud console:

Go to the ECS instance OS configuration menu
Select ContainerOS from the dropdown menu
Note: The version number (e.g., "1.24.6") in the OS image name corresponds to your cluster's Kubernetes version

For optimal node scaling performance, refer to ContainerOS documentation on high-speed node scaling.

Nydus Project: https://nydus.dev/

Dragonfly Project: https://d7y.io/

Reference

[1]ZOLOZ: https://www.zoloz.com/

[2]BuildKit: https://github.com/moby/buildkit/blob/master/docs/nydus.md

[3]Paper by Cern: https://indico.cern.ch/event/567550/papers/2627182/files/6153-paper.pdf

[4]OCI: https://github.com/opencontainers/image-spec/

[5]Docker: https://github.com/moby/moby/blob/master/image/spec/v1.2.md

[6]RAFS image format: https://d7y.io/blog/2022/06/06/evolution-of-nydus/

[7]Nydus Image Service project: https://github.com/dragonflyoss/image-service

[8]FUSE: https://www.kernel.org/doc/html/latest/filesystems/fuse.html

[9]Nydus EROFS fscache user guide: https://github.com/dragonflyoss/image-service/blob/master/docs/nydus-fscache.md

[10]The path of kernel evolution for Nydus image acceleration: https://d7y.io/blog/2022/06/06/evolution-of-nydus/[11]Alinux2 Optimized-OS: https://www.alibabacloud.com/help/en/container-service-for-kubernetes/latest/containeros-overview

Features​

Persistent Cache Task​

Resource Search (Tasks and Persistent Cache Tasks)​

Vortex: A P2P File Transfer Protocol Based on TLV​

Enhanced Large File Distribution​

Support scopes for Personal Access Tokens (PATs)​

Enhanced Preheating​

Implement Distributed Rate Limiting for Preheating Tasks​

Support to set piece length for preheating​

Flexible Preheating: Set Peer scope by Percentage or Count​

Implement Audit Logging for User Operations​

Garbage Collection​

Optimized File Download with Hard Link​

Hardware Acceleration for Piece Hash Computation​

Advanced Storage Management​

Disk Space Validation for Operations​

Disk Garbage Collection Management​

Support for OpenTelemetry Tracing​

Security Enhancements​

Nydus​

Significant bug fixes​

Others​

Links​

Dragonfly Github​

Features​

Client written in Rust​

Client supports bandwidth rate limiting for prefetching​

Client supports leeching​

Optimize client’s performance for handling a large number of small I/Os by Nydus​

Defines the V2 of the P2P transfer protocol​

Enhanced Harbor Integration with P2P Preheating​

Task Manager​

Peer Manager​

Add hostname regexes and CIDRs to cluster scopes for matching clients​

Supports distributed rate limiting for creating jobs across different clusters​

Support preheating images using self-signed certificates​

Support mTLS for gRPC calls between services​

Observability​

Nydus​

Console​

Document​

Significant bug fixes​

AI Infrastructure​

Model Spec​

Support accelerated distribution of AI models in Hugging Face Hub(Git LFS)​

Maintainers​

Other​

Links​

Installation​

Prerequisites​

Dragonfly Kubernetes Cluster Setup​

Prepare Kubernetes Cluster​

Kind loads dragonfly image​

Create dragonfly cluster based on helm charts​

Expose the Proxy service port​

Install Dragonfly Repository Agent​

Set Dragonfly Repository Agent configuration​

Set Model Repository configuration​

Triton Server integrates Dragonfly Repository Agent plugin​

Install Triton Server with Docker​

Verify​

Performance testing​

Resources​

Dragonfly Community​

NVIDIA Triton Inference Server​

What is Git LFS?​

Git LFS Usage​

Git LFS manages large files​

Git LFS manages AI models and AI datasets​

Dragonfly eliminates the bandwidth limit of Git LFS’s content storage​

Dragonfly accelerates downloads with Git LFS​

Get the Content Storage address of Git LFS​

Installation​

Prerequisites​

Install dragonfly​

Setup kubernetes cluster​

Kind loads dragonfly image​

Create dragonfly cluster based on helm charts​

Git LFS downlads large files via dragonfly​

Verify large files download with Dragonfly​

Features

Persistent Cache Task

Resource Search (Tasks and Persistent Cache Tasks)

Vortex: A P2P File Transfer Protocol Based on TLV

Enhanced Large File Distribution

Support scopes for Personal Access Tokens (PATs)

Enhanced Preheating

Implement Distributed Rate Limiting for Preheating Tasks

Support to set piece length for preheating

Flexible Preheating: Set Peer scope by Percentage or Count

Implement Audit Logging for User Operations

Garbage Collection

Optimized File Download with Hard Link

Hardware Acceleration for Piece Hash Computation

Advanced Storage Management

Disk Space Validation for Operations

Disk Garbage Collection Management

Support for OpenTelemetry Tracing

Security Enhancements

Nydus

Significant bug fixes

Others

Links

Dragonfly Github

Features

Client written in Rust

Client supports bandwidth rate limiting for prefetching

Client supports leeching

Optimize client’s performance for handling a large number of small I/Os by Nydus

Defines the V2 of the P2P transfer protocol

Enhanced Harbor Integration with P2P Preheating

Task Manager

Peer Manager

Add hostname regexes and CIDRs to cluster scopes for matching clients

Supports distributed rate limiting for creating jobs across different clusters

Support preheating images using self-signed certificates

Support mTLS for gRPC calls between services

Observability

Nydus

Console

Document

Significant bug fixes

AI Infrastructure

Model Spec

Support accelerated distribution of AI models in Hugging Face Hub(Git LFS)

Maintainers

Other

Links

Installation

Prerequisites

Dragonfly Kubernetes Cluster Setup

Prepare Kubernetes Cluster

Kind loads dragonfly image

Create dragonfly cluster based on helm charts

Expose the Proxy service port

Install Dragonfly Repository Agent

Set Dragonfly Repository Agent configuration

Set Model Repository configuration

Triton Server integrates Dragonfly Repository Agent plugin

Install Triton Server with Docker

Verify

Performance testing

Resources

Dragonfly Community

NVIDIA Triton Inference Server

What is Git LFS?

Git LFS Usage

Git LFS manages large files

Git LFS manages AI models and AI datasets

Dragonfly eliminates the bandwidth limit of Git LFS’s content storage

Dragonfly accelerates downloads with Git LFS

Get the Content Storage address of Git LFS

Installation

Prerequisites

Install dragonfly

Setup kubernetes cluster

Kind loads dragonfly image

Create dragonfly cluster based on helm charts

Git LFS downlads large files via dragonfly

Verify large files download with Dragonfly