Blog | Dragonfly

Dragonfly Project Paper Accepted by IEEE Transactions on Networking (TON)!

September 5, 2025 · 4 min read

We’re excited to announce that a paper on model distribution, co-authored by researchers from Dalian University of Technology and Ant Group, has been accepted for publication in the IEEE Transactions on Networking (TON), a high-impact journal recognized by IEEE for its significant influence in networking and systems research. 🎉🎉🎉

Title

Empowering Dragonfly: A Lightweight and Scalable Distribution System for Large Models with High Concurrency

Link

IEEE Xplore

Authors

Zhou Xu, Dalian University of Technology Lizhen Zhou, Dalian University of Technology Zichuan Xu, Dalian University of Technology Wenbo Qi, Ant Group Jinjing Ma, Ant Group Song Yan, Ant Group Yuan Yang, Alibaba Cloud Min Huang, Dalian University of Technology Haomiao Jiang, Dalian University of Technology Qiufen Xia, Dalian University of Technology Guowei Wu, Dalian University of Technology

Background

With the continuous evolution of Artificial Intelligence Generated Content (AIGC) technologies, the distribution of container images and large models has become a challenge. Traditional centralized registry often faces single-node bandwidth bottlenecks during peak concurrent downloads, leading to severe network congestion and excessively long download times. Conversely, while Content Delivery Networks (CDNs) or private links can mitigate some hotspot loads, they fail to fully leverage the idle bandwidth of cluster nodes and introduce additional overhead. Consequently, cloud-native applications and AI services urgently need a dynamic, efficient, scalable, and non-intrusive distribution system for large-scale images and models, compatible with mainstream formats like OCI spec.

Design

github

Key Design 1: A Lightweight Network Measurement Mechanism

Network Latency Probing: Each peer proactively sends probes to a designated set of targets within the cluster, as assigned by the Scheduler, to measure end-to-end latency. This approach ensures efficient probing of all peers in the cluster while operating under constrained network resources.

github

Network Bandwidth Prediction: We employ non-intrusive measurements by analyzing historical loading data to estimate network bandwidth with reasonable accuracy. When a peer initiates a model loading request, it pulls pieces of the model from other peers and reports the size and loading time of these pieces to the Scheduler. The Scheduler leverages historical data from successful loading operations to predict bandwidth.

github

Key Design 2: A Scalable Scheduling Framework

Separating Inference from Scheduling: To ensure the Scheduler has adequate resources for task scheduling, we separate inference services from scheduling operations.
Real-Time Data Synchronization: To maintain data consistency across multiple schedulers, we store and continuously update end-to-end latency information for the entire network, as collected by the lightweight network measurement module.

Key Design 3: An Asynchronous Model Training and Inference

Asynchronous Model Training: Asynchronous training and inference are facilitated through collaboration between the Trainer and Triton. The Scheduler retrieves end-to-end latency and bandwidth predictions from Redis and sends them to the Trainer, which then initiates training and persists the updated model. Triton periodically polls for updates and loads the new model for inference in the subsequent cycle.
Graph Learning Algorithm: This algorithm aggregates feature parameters from peers, modeling each sample as an interaction between a peer and its parent. It also incorporates information from neighboring peers to capture similarities within the cluster, thereby improving the accuracy of bandwidth predictions.

github

Summary

This paper presents an efficient and scalable peer-to-peer (P2P) model distribution system that optimizes resource utilization and ensures data synchronization through a multi-layered architecture.

First, it introduces a lightweight network measurement mechanism that probes latency and infers bandwidth to predict real-time network conditions.
Second, it proposes a scalable scheduling framework that separates inference services from scheduling, enhancing both resource efficiency and system responsiveness.
Finally, it designs the Trainer module, which integrates asynchronous model training and inference with graph learning algorithms to support incremental learning for bursty tasks.

Github Repository

github

Dragonfly v2.3.0 has been released

July 1, 2025 · 7 min read

Dragonfly v2.3.0 is released! 🎉🎉🎉 Thanks the contributors who made this release happend and welcome you to visit d7y.io website.

dragonfly

Features

Persistent Cache Task

It designs to provide persistent caching for tasks. This tool can import file and export file in P2P network. The solution is specifically engineered for high-speed read and write operations. This makes it particularly advantageous for scenarios involving large files, such as machine learning model checkpoints, where rapid, reliable access and distribution across the network are critical for training and inference workflows. By leveraging P2P distribution and persistent caching, dfcache significantly reduces I/O bottlenecks and accelerates the lifecycle of large data assets.

For documentation on how to use the dfcache command-line tool, please refer to the following link: dfcache.

$ dfcache import /tmp/file.txt
⣷ Done: 2229733261

dfcache export 2229733261 -O /tmp/file.txt

Resource Search (Tasks and Persistent Cache Tasks)

The Resource Search feature enables seamless querying of tasks, including files, images, and persistent cache tasks. It optimizes resource access, improving task management and retrieval efficiency.

Vortex: A P2P File Transfer Protocol Based on TLV

Vortex protocol is a high-performance peer-to-peer (P2P) file transfer protocol implementation in Rust, designed as part of the Dragonfly project. It utilizes the TLV (Tag-Length-Value) format for efficient and flexible data transmission, making it ideal for large-scale file distribution scenarios.

Packet Format:

Packet Identifier (8 bits): Uniquely identifies each packet
Tag (8 bits): Specifies data type in value field
Length (32 bits): Indicates Value field length, up to 4 GiB
Value (variable): Actual data content, maximum 1 GiB Protocol Format:

-------------------------------------------------------------------------------------------------
|                            |                   |                    |                         |
| Packet Identifier (8 bits) |    Tag (8 bits)   |  Length (32 bits)  |   Value (up to 4 GiB)   |
|                            |                   |                    |                         |
-------------------------------------------------------------------------------------------------

For more information, please refer to the Vortex Protocol.

Enhanced Large File Distribution

This release significantly enhances Dragonfly's large file distribution capabilities, delivering improved efficiency and performance. We've revamped our scheduling algorithms for large file scenarios to ensure smarter resource and task allocation. Additionally, new mechanisms now more effectively balance the load across peers during large file transfers. Optimizations to the peer-to-peer (P2P) protocol and network transport layers further boost transmission efficiency.

These improvements include performance optimizations for both Client and Scheduler. You can find more details in the project's pull request.

Support scopes for Personal Access Tokens (PATs)

By enabling users to define specific access rights (scopes) for each PAT, we significantly enhance the security of Open API interactions. Instead of granting broad permissions, PATs can now be limited to only the necessary privileges required for a particular integration or task.

Enhanced Preheating

Implement Distributed Rate Limiting for Preheating Tasks

By limiting the rate at which preheating requests are initiated across the distributed system, it prevents excessive preheating activities from stressing the origin. This enhancement ensures a more stable preheating.

Support to set piece length for preheating

By allowing adjustment of the piece size, users can optimize data transfer efficiency, particularly in scenarios involving large files.

Flexible Preheating: Set Peer scope by Percentage or Count

This feature enhances preheating capabilities by allowing users to specify the preheating scope more precisely.

Implement Audit Logging for User Operations

This feature introduces comprehensive audit logging capabilities to track user operations within the system. Audit logs will record critical actions performed by users, such as initiating preheating tasks, deleting task caches, and other significant system interactions.

Garbage Collection

Dragonfly supports Garbage Collection (GC) Audit Logs and GC Job Records to track and manage garbage collection activities. The Manager enables automated GC retention, allowing records to be preserved for a configurable time period. Additionally, it provides the capability to manually trigger forced GC operations as needed.

This feature ensures efficient monitoring and management of GC processes, offering flexibility through automated retention policies and manual intervention for immediate GC execution.

Optimized File Download with Hard Link

File download needs to be done in a way that is efficient and secure. If users are downloading a large file, it is not efficient to download the file and copy to the output path. Instead, we can create a hard link to the file and send the link to the user. This way, we can avoid copying the file and save time and resources. If hard link fails (e.g. due to different file systems), dfdaemon will fallback to copying the file.

For more information, please refer to the file download workflow.

Hardware Acceleration for Piece Hash Computation

This feature enables hardware-accelerated Piece hash computation, significantly boosting performance and efficiency. By utilizing specialized hardware, the hash computation process is accelerated, allowing faster processing of large file.

Advanced Storage Management

Disk Space Validation for Operations

This feature enhances the client's storage functionality by implementing disk space validation. When insufficient disk space is detected, the client will return a failure response, preventing potential data corruption or incomplete operations.

Disk Garbage Collection Management

This feature enhances Peer's disk management by introducing configurable garbage collection (GC) thresholds based on disk usage. The distThreshold parameter allows users to define a specific disk capacity (e.g., 10TiB) as the base for calculating GC trigger points. If set, the distHighThresholdPercent (e.g., 80%) and distLowThresholdPercent (e.g., 60%) are applied relative to this capacity. If distThreshold is not provided or set to 0, these percentages are calculated based on the total actual disk space. When disk usage exceeds the high threshold, Dragonfly triggers GC(LRU) to reclaim space. GC stops when usage falls below the low threshold. This enables efficient management of a logical disk portion for caching, improving resource utilization and system performance.

p10

Support for OpenTelemetry Tracing

Dragonfly supports for tracing based on OpenTelemetry, covering the Manager, Scheduler, and Peers. This enables end-to-end visibility into the download process, allowing users to query detailed information, such as overall download latency, using a specific task ID. The integration ensures efficient monitoring and performance analysis across the entire system.

Add tracing configuration as follows(in Manager, Scheduler and Peer):

p11

You can access the Jaeger UI to visualize the traces.

p12

For more information, please refer to the Tracing.

Security Enhancements

We extend our sincere gratitude to the CNCF TAG Security for their collaboration on a joint security audit. Their expertise and thorough review were invaluable in helping us identify areas for security improvement within Dragonfly. For detailed information on the specific security issues addressed and the corresponding fixes, please refer to the following issue: #3811

Nydus

Significant bug fixes

Fixed memory leaks and file descriptor leaks caused by sysinfo library.
Cleans up the Unix domain socket (UDS) to prevent dfdaemon startup crashes.
Prevent client from repeatedly downloading the same piece from multiple parents.

Others

You can see CHANGELOG for more details.

Dragonfly Github

p13

Dragonfly v2.2.0 has been released

January 7, 2025 · 8 min read

CNCF projects highlighted in this post, and migrated by mingcheng.

Dragonfly v2.2.0 is released! 🎉🎉🎉 Thanks the contributors who made this release happend and welcome you to visit d7y.io website.

Features

Client written in Rust

The client is written in Rust, offering advantages such as ensuring memory safety, improving performance, etc. The client is a submodule of Dragonfly, refer to dragonflyoss/client.

scheduler schema

second scheduler schema

Client supports bandwidth rate limiting for prefetching

Client now supports rate limiting for prefetch requests, which can prevent network overload and reduce competition with other active download tasks, thereby enhancing overall system performance. Refer to the documentation to configure the proxy.prefetchRateLimit option.

code

The following diagram illustrates the usage of download rate limit, upload rate limir, and prefetct rate limit for the client.

rate limit

Client supports leeching

If the user configures the client to disable sharing, it will become a leech.

code

Optimize client’s performance for handling a large number of small I/Os by Nydus

Add the X-Dragonfly-Prefetch HTTP header. If X-Dragonfly-Prefetch is set to true and it is a range request, the client will prefetch the entire task. This feature allows Nydus to control which requests need prefetching.
The client’s HTTP proxy adds an independent cache to reduce requests to the gRPC server, thereby reducing request latency.
Increase the memory cache size in RocksDB and enable prefix search for quickly searching piece metadata.
Use the CRC-32-Castagnoli algorithm with hardware acceleration to reduce the hash calculation cost for piece content.
Reuse the gRPC connections for downloading and optimize the download logic.

Defines the V2 of the P2P transfer protocol

Define the V2 of the P2P transfer protocol to make it more standard, clearer, and better performing, refer to dragonflyoss/api.

Enhanced Harbor Integration with P2P Preheating

Dragonfly improves its integration with Harbor v2.13 for preheating images, includes the following enhancements:

Support for preheating multi architecture images.
User can select the preheat scope for multi-granularity preheating. (Single Seed Peer, All Seed Peers, All Peers)
User can specify the scheduler cluster ids for preheating images to the desired Dragonfly clusters.

Refer to documentation for more details.

create P2P Provider policy

Task Manager

User can search all peers of cached task by task ID or download URL, and delete the cache on the selected peers, refer to the documentation.

dragonfly dashboard

Peer Manager

Manager will regularly synchronize peers’ information and also allows for manual refreshes. Additionally, it will display peers’ information on the Manager Console.

dragonfly dashboard

Add hostname regexes and CIDRs to cluster scopes for matching clients

When the client starts, it reports its hostname and IP to the Manager. The Manager then returns the best matching cluster (including schedulers and seed peers) to the client based on the cluster scopes configuration.

Creating a cluster on the Dragonfly dashboard

Supports distributed rate limiting for creating jobs across different clusters

User can configure rate limiting for job creation across different clusters in the Manager Console.

creating a cluster on the dragonfly dashboard

Support preheating images using self-signed certificates

Preheating requires calling the container registry to parse the image manifest and construct the URL for downloading blobs. If the container registry uses a self-signed certificate, user can configure the self-signed certificate in the Manager’s config for calling to the container registry.

code

Support mTLS for gRPC calls between services

By setting self-signed certificates in the configurations of the Manager, Scheduler, Seed Peer, and Peer, gRPC calls between services will use mTLS.

Observability

Dragonfly is recommending to use prometheus for monitoring. Prometheus and grafana configurations are maintained in the dragonflyoss/monitoring repository.

Grafana dashboards are listed below:

Name	ID	Link	Description
Dragonfly Manager	15945	https://grafana.com/grafana/dashboards/15945	Grafana dashboard for dragonfly manager.
Dragonfly Scheduler	15944	https://grafana.com/grafana/dashboards/15944	Granafa dashboard for dragonfly scheduler.
Dragonfly Client	21053	https://grafana.com/grafana/dashboards/21053	Grafana dashboard for dragonfly client and dragonfly seed client.
Dragonfly Seed Client	21054	https://grafana.com/grafana/dashboards/21054	Grafana dashboard for dragonfly seed client.

dashboard

creating a cluster on the dragonfly dashboard

Nydus

Nydus v2.3.0 is released, refer to Nydus Image Service v2.3.0 for more details.

builder: support –parent-bootstrap for merge.
builder/nydusd: support batch chunks mergence.
nydusify/nydus-snapshotter: support OCI reference types.
nydusify: support export/import for remote images.
nydusify: support –push-chunk-size for large size image.
nydusd/nydus-snapshotter: support basic failover and hot upgrade.
nydusd: support overlay writable mount for fusedev.

Console

Console v0.2.0 is released, featuring a redesigned UI and an improved interaction flow. Additionally, more functional pages have been added, such as preheating, task manager, PATs(Personal Access Tokens) manager, etc. Refer to the documentation for more details.

cluster overview

deeper dive image into cluster-1 on dashboard

Document

Refactor the website documentation to make Dragonfly simpler and more practical for users, refer to d7y.io.

dragonfly website

Significant bug fixes

The following content only highlights the significant bug fixes in this release.

Fix the thread safety issue that occurs when constructing the DAG(Directed Acyclic Graph) during scheduling.
Fix the memory leak caused by the OpenTelemetry library.
Avoid hot reload when dynconfig refresh data from Manager.
Prevent concurrent download requests from causing failures in state machine transitions.
Use context.Background() to avoid stream cancel by dfdaemon.
Fix the database performance issue caused by clearing expired jobs when there are too many job records.
Reuse the gRPC connection pool to prevent redundant request construction.

AI Infrastructure

Model Spec

The Dragonfly community is collaboratively defining the OCI Model Specification. OCI Model Specification aims to provide a standard way to package, distribute and run AI models in a cloud native environment. The goal of this specification is to package models in an OCI artifact to take advantage of OCI distribution and ensure efficient model deployment, refer to CloudNativeAI/model-spec for more details.

OCI Model Specification image

node

Support accelerated distribution of AI models in Hugging Face Hub(Git LFS)

Distribute large files downloaded via the Git LFS protocol through Dragonfly P2P, refer to the documentation.

hugging face hub clusters

Maintainers

The community has added four new Maintainers, hoping to help more contributors participate in the community.

Han Jiang: He works for Kuaishou and will focus on the engineering work for Dragonfly.
Yuan Yang: He works for Alibaba Group and will focus on the engineering work for Dragonfly.

Other

You can see CHANGELOG for more details.

Triton Server accelerates distribution of models based on Dragonfly

April 15, 2024 · 10 min read

CNCF projects highlighted in this post, and migrated by mingcheng.

Project post by Yufei Chen, Miao Hao, and Min Huang, Dragonfly project

This document will help you experience how to use dragonfly with TritonServe. During the downloading of models, the file size is large and there are many services downloading the files at the same time. The bandwidth of the storage will reach the limit and the download will be slow.

Diagram flow showing nodes in Triton Server in Cluster A and Cluster B to Model Registry

Dragonfly can be used to eliminate the bandwidth limit of the storage through P2P technology, thereby accelerating file downloading.

Diagram flow showing Cluster A and Cluster B Peer to Root Peer to Model Registry

Installation

By integrating Dragonfly Repository Agent into Triton, download traffic through Dragonfly to pull models stored in S3, OSS, GCS, and ABS, and register models in Triton. The Dragonfly Repository Agent is in the dragonfly-repository-agent repository.

Prerequisites

Name	Version	Document
Kubernetes cluster	1.20+	kubernetes.io
Helm	3.8.0+	helm.sh
Triton Server	23.08-py3	Triton Server

Notice: Kind is recommended if no kubernetes cluster is available for testing.

Dragonfly Kubernetes Cluster Setup

For detailed installation documentation, please refer to quick-start-kubernetes.

Prepare Kubernetes Cluster

Create kind multi-node cluster configuration file kind-config.yaml, configuration content is as follows:

kind: ClusterapiVersion: kind.x-k8s.io/v1alpha4nodes:  - role: control-plane  - role: worker  - role: worker

Create a kind multi-node cluster using the configuration file:

kind create cluster --config kind-config.yaml

Switch the context of kubectl to kind cluster:

kubectl config use-context kind-kind

Kind loads dragonfly image

Pull dragonfly latest images:

docker pull dragonflyoss/scheduler:latestdocker pull dragonflyoss/manager:latestdocker pull dragonflyoss/dfdaemon:latest

Kind cluster loads dragonfly latest images:

kind load docker-image dragonflyoss/scheduler:latestkind load docker-image dragonflyoss/manager:latestkind load docker-image dragonflyoss/dfdaemon:latest

Create dragonfly cluster based on helm charts

Create helm charts configuration file charts-config.yamland set dfdaemon.config.agents.regx to match the download path of the object storage. Example: add regx:.*models.* to match download request from object storage bucket models. Configuration content is as follows:

scheduler:  image: dragonflyoss/scheduler  tag: latest  replicas: 1  metrics:    enable: true  config:    verbose: true    pprofPort: 18066seedPeer:  image: dragonflyoss/dfdaemon  tag: latest  replicas: 1  metrics:    enable: true  config:    verbose: true    pprofPort: 18066dfdaemon:  image: dragonflyoss/dfdaemon  tag: latest  metrics:    enable: true  config:    verbose: true    pprofPort: 18066    proxy:      defaultFilter: 'Expires&Signature&ns'      security:        insecure: true        cacert: ''        cert: ''        key: ''      tcpListen:        namespace: ''        port: 65001      registryMirror:        url: https://index.docker.io        insecure: true        certs: []        direct: false      proxies:        - regx: blobs/sha256.*        # Proxy all http downlowd requests of model bucket path.        - regx: .*models.*manager:  image: dragonflyoss/manager  tag: latest  replicas: 1  metrics:    enable: true  config:    verbose: true    pprofPort: 18066jaeger:  enable: true

Create a dragonfly cluster using the configuration file:

helm repo add dragonfly https://dragonflyoss.github.io/helm-charts/
helm install --wait --create-namespace --namespace dragonfly-system dragonfly dragonfly/dragonfly -f charts-config.yaml

Example output:

LAST DEPLOYED: Wed Nov 29 21:23:48 2023
NAMESPACE: dragonfly-system
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
1. Get the scheduler address by running these commands:
  export SCHEDULER_POD_NAME=$(kubectl get pods --namespace dragonfly-system -l "app=dragonfly,release=dragonfly,component=scheduler" -o jsonpath={.items[0].metadata.name})
  export SCHEDULER_CONTAINER_PORT=$(kubectl get pod --namespace dragonfly-system $SCHEDULER_POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
  kubectl --namespace dragonfly-system port-forward $SCHEDULER_POD_NAME 8002:$SCHEDULER_CONTAINER_PORT
  echo "Visit http://127.0.0.1:8002 to use your scheduler"

2. Get the dfdaemon port by running these commands:
  export DFDAEMON_POD_NAME=$(kubectl get pods --namespace dragonfly-system -l "app=dragonfly,release=dragonfly,component=dfdaemon" -o jsonpath={.items[0].metadata.name})
  export DFDAEMON_CONTAINER_PORT=$(kubectl get pod --namespace dragonfly-system $DFDAEMON_POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
  You can use $DFDAEMON_CONTAINER_PORT as a proxy port in Node.

3. Configure runtime to use dragonfly:
  https://d7y.io/docs/getting-started/quick-start/kubernetes/

4. Get Jaeger query URL by running these commands:
  export JAEGER_QUERY_PORT=$(kubectl --namespace dragonfly-system get services dragonfly-jaeger-query -o jsonpath="{.spec.ports[0].port}")
  kubectl --namespace dragonfly-system port-forward service/dragonfly-jaeger-query 16686:$JAEGER_QUERY_PORT
  echo "Visit http://127.0.0.1:16686/search?limit=20&lookback=1h&maxDuration&minDuration&service=dragonfly to query download events"

Check that dragonfly is deployed successfully:

kubectl get pods -n dragonfly-systemNAME                                 READY   STATUS    RESTARTS       AGEdragonfly-dfdaemon-8qcpd             1/1     Running   4 (118s ago)   2m45sdragonfly-dfdaemon-qhkn8             1/1     Running   4 (108s ago)   2m45sdragonfly-jaeger-6c44dc44b9-dfjfv    1/1     Running   0              2m45sdragonfly-manager-549cd546b9-ps5tf   1/1     Running   0              2m45sdragonfly-mysql-0                    1/1     Running   0              2m45sdragonfly-redis-master-0             1/1     Running   0              2m45sdragonfly-redis-replicas-0           1/1     Running   0              2m45sdragonfly-redis-replicas-1           1/1     Running   0              2m7sdragonfly-redis-replicas-2           1/1     Running   0              101sdragonfly-scheduler-0                1/1     Running   0              2m45sdragonfly-seed-peer-0                1/1     Running   1 (52s ago)    2m45s

Expose the Proxy service port

Create the dfstore.yaml configuration file to expose the port on which the Dragonfly Peer’s HTTP proxy listens. The default port is 65001 and settargetPort to 65001.

kind: ServiceapiVersion: v1metadata:  name: dfstorespec:  selector:    app: dragonfly    component: dfdaemon    release: dragonfly  ports:    - protocol: TCP      port: 65001      targetPort: 65001  type: NodePort

Create service:

kubectl --namespace dragonfly-system apply -f dfstore.yaml

Forward request to Dragonfly Peer’s HTTP proxy:

kubectl --namespace dragonfly-system port-forward service/dfstore 65001:65001

Install Dragonfly Repository Agent

Set Dragonfly Repository Agent configuration

Create the dragonfly_config.jsonconfiguration file, the configuration is as follows:

{
  "proxy": "http://127.0.0.1:65001",
  "header": {},
  "filter": [
    "X-Amz-Algorithm",
    "X-Amz-Credential",
    "X-Amz-Date",
    "X-Amz-Expires",
    "X-Amz-SignedHeaders",
    "X-Amz-Signature"
  ]
}

proxy: The address of Dragonfly Peer’s HTTP Proxy.
header: Adds a request header to the request.
filter: Used to generate unique tasks and filter unnecessary query parameters in the URL.

In the filter of the configuration, set different values when using different object storage:

Type	Value
OSS	["Expires","Signature","ns"]
S3	["X-Amz-Algorithm", "X-Amz-Credential", "X-Amz-Date", "X-Amz-Expires", "X-Amz-SignedHeaders", "X-Amz-Signature"]
OBS	["X-Amz-Algorithm", "X-Amz-Credential", "X-Amz-Date", "X-Obs-Date", "X-Amz-Expires", "X-Amz-SignedHeaders", "X-Amz-Signature"]

Set Model Repository configuration

Create cloud_credential.json cloud storage credential, the configuration is as follows:

```json
{
  "gs": {
    "": "PATH_TO_GOOGLE_APPLICATION_CREDENTIALS",
    "gs://gcs-bucket-002": "PATH_TO_GOOGLE_APPLICATION_CREDENTIALS_2"
  },
  "s3": {
    "": {
      "secret_key": "AWS_SECRET_ACCESS_KEY",
      "key_id": "AWS_ACCESS_KEY_ID",
      "region": "AWS_DEFAULT_REGION",
      "session_token": "",
      "profile": ""
    },
    "s3://s3-bucket-002": {
      "secret_key": "AWS_SECRET_ACCESS_KEY_2",
      "key_id": "AWS_ACCESS_KEY_ID_2",
      "region": "AWS_DEFAULT_REGION_2",
      "session_token": "AWS_SESSION_TOKEN_2",
      "profile": "AWS_PROFILE_2"
    }
  },
  "as": {
    "": {
      "account_str": "AZURE_STORAGE_ACCOUNT",
      "account_key": "AZURE_STORAGE_KEY"
    },
    "as://Account-002/Container": {
      "account_str": "",
      "account_key": ""
    }
  }
}

In order to pull the model through Dragonfly, the model configuration file needs to be added following code in config.pbtxt file:

model_repository_agents{  agents [    {      name: "dragonfly",    }  ]}

The densenet_onnx example contains modified configuration and model file. Modified config.pbtxt such as:

name: "densenet_onnx"platform: "onnxruntime_onnx"max_batch_size : 0input [  {    name: "data_0"    data_type: TYPE_FP32    format: FORMAT_NCHW    dims: [ 3, 224, 224 ]    reshape { shape: [ 1, 3, 224, 224 ] }  }]output [  {    name: "fc6_1"    data_type: TYPE_FP32    dims: [ 1000 ]    reshape { shape: [ 1, 1000, 1, 1 ] }    label_filename: "densenet_labels.txt"  }]model_repository_agents{  agents [    {      name: "dragonfly",    }  ]}

Triton Server integrates Dragonfly Repository Agent plugin

Install Triton Server with Docker

Pull dragonflyoss/dragonfly-repository-agent image which is integrated Dragonfly Repository Agent plugin in Triton Server, refer to Dockerfile.

docker pull dragonflyoss/dragonfly-repository-agent:latest

Run the container and mount the configuration directory:

docker run --network host --rm \  -v ${path-to-config-dir}:/home/triton/ \  dragonflyoss/dragonfly-repository-agent:latest tritonserver \  --model-repository=${model-repository-path}

path-to-config-dir: The files path of dragonfly_config.json&cloud_credential.json.
model-repository-path: The path of remote model repository.

The correct output is as follows:

=============================== Triton Inference Server ===============================
successfully loaded 'densenet_onnx'
I1130 09:43:22.595672 1 server.cc:604]
+------------------+------------------------------------------------------------------------+
| Repository Agent | Path                                                                   |
+------------------+------------------------------------------------------------------------+
| dragonfly        | /opt/tritonserver/repoagents/dragonfly/libtritonrepoagent_dragonfly.so |
+------------------+------------------------------------------------------------------------+

I1130 09:43:22.596011 1 server.cc:631]
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Backend     | Path                                                            | Config                                                                                                                                                        |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| pytorch     | /opt/tritonserver/backends/pytorch/libtriton_pytorch.so         | {}                                                                                                                                                            |
| onnxruntime | /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+

I1130 09:43:22.596112 1 server.cc:674]
+---------------+---------+--------+
| Model         | Version | Status |
+---------------+---------+--------+
| densenet_onnx | 1       | READY  |
+---------------+---------+--------+

I1130 09:43:22.598318 1 metrics.cc:703] Collecting CPU metrics
I1130 09:43:22.599373 1 tritonserver.cc:2435]
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                                                           |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                                                          |
| server_version                   | 2.37.0                                                                                                                                                                                                          |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
| model_repository_path[0]         | s3://192.168.36.128:9000/models                                                                                                                                                                                 |
| model_control_mode               | MODE_NONE                                                                                                                                                                                                       |
| strict_model_config              | 0                                                                                                                                                                                                               |
| rate_limit                       | OFF                                                                                                                                                                                                             |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                                       |
| min_supported_compute_capability | 6.0                                                                                                                                                                                                             |
| strict_readiness                 | 1                                                                                                                                                                                                               |
| exit_timeout                     | 30                                                                                                                                                                                                              |
| cache_enabled                    | 0                                                                                                                                                                                                               |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I1130 09:43:22.610334 1 grpc_server.cc:2451] Started GRPCInferenceService at 0.0.0.0:8001
I1130 09:43:22.612623 1 http_server.cc:3558] Started HTTPService at 0.0.0.0:8000
I1130 09:43:22.695843 1 http_server.cc:187] Started Metrics Service at 0.0.0.0:8002

Execute the following command to check the Dragonfly logs:

kubectl exec -it -n dragonfly-system dragonfly-dfdaemon-<id> -- tail -f /var/log/dragonfly/daemon/core.log

Check downloaded successfully through Dragonfly:

{
  "level": "info",
  "ts": "2024-02-02 05:28:02.631",
  "caller": "peer/peertask_conductor.go:1349",
  "msg": "peer task done, cost: 352ms",
  "peer": "10.244.2.3-1-4398a429-d780-423a-a630-57d765f1ccfc",
  "task": "974aaf56d4877cc65888a4736340fb1d8fecc93eadf7507f531f9fae650f1b4d",
  "component": "PeerTask",
  "trace": "4cca9ce80dbf5a445d321cec593aee65"
}

Verify

Call inference API：

docker run -it --rm --net=host nvcr.io/nvidia/tritonserver:23.08-py3-sdk /workspace/install/bin/image_client -m densenet_onnx -c 3 -s INCEPTION /workspace/images/mug.jpg

Check the response successful:

Request 01Image '/workspace/images/mug.jpg':    15.349563 (504) = COFFEE MUG    13.227461 (968) = CUP    10.424893 (505) = COFFEEPOT

Performance testing

Test the performance of single-machine model download by Triton API after the integration of Dragonfly P2P. Due to the influence of the network environment of the machine itself, the actual download time is not important, but The proportion of download speed in different scenarios is more meaningful:

Bar chart showing time to download large Triton API; Triton API & Dragonfly Cold Boot; Hit Dragonfly Remote Peer Cache; Hit, Dragonfly Local Peer Cache

Triton API: Use signed URL provided by Object Storage to download the model directly.
Triton API & Dragonfly Cold Boot: Use Triton Serve API to download model via Dragonfly P2P network and no cache hits.
Hit Remote Peer: Use Triton Serve API to download model via Dragonfly P2P network and hit the remote peer cache.
Hit Local Peer: Use Triton Serve API to download model via Dragonfly P2P network and hit the local peer cache.

Test results show Triton and Dragonfly integration. It can effectively reduce the file download time. Note that this test was a single-machine test, which means that in the case of cache hits, the performance limitation is on the disk. If Dragonfly is deployed on multiple machines for P2P download, the models download speed will be faster.

Resources

Dragonfly Community

Website: https://d7y.io/
Github Repo: https://github.com/dragonflyoss/dragonfly
Dragonfly Repository Agent Github Repo: https://github.com/dragonflyoss/dragonfly-repository-agent
Slack Channel: #dragonfly on CNCF Slack
Discussion Group: dragonfly-discuss@googlegroups.com
Twitter: @dragonfly_oss

NVIDIA Triton Inference Server

Website: https://developer.nvidia.com/triton-inference-server
Github Repo: https://github.com/triton-inference-server/server

Dragonfly accelerates distribution of large files with Git LFS

January 15, 2024 · 11 min read

CNCF projects highlighted in this post, and migrated by mingcheng.

What is Git LFS?

Git LFS (Large File Storage) is an open-source extension for Git that enables users to handle large files more efficiently in Git repositories. Git is a version control system designed primarily for text files such as source code and it can become less efficient when dealing with large binary files like audio, videos, datasets, graphics and other large assets. These files can significantly increase the size of a repository and make cloning and fetching operations slow.

Diagram flow showing Remote to Large File Storage

Git LFS addresses this issue by storing these large files on a separate server and replacing them in the Git repository with small placeholder files (pointers). When a user clones or pulls from the repository, Git LFS fetches the large files from the LFS server as needed rather than downloading all the large files with the initial clone of the repository. For specifications, please refer to the Git LFS Specification. The server is implemented based on the HTTP protocol, refer to Git LFS API. Usually Git LFS’s content storage uses object storage to store large files.

Git LFS Usage

Git LFS manages large files

Github and GitLab usually manage large files based on Git LFS.

GitHub uses Git LFS refer to About Git Large File Storage.
GitLab uses Git LFS refer to Git Large File Storage.

Git LFS manages AI models and AI datasets

Large files of models and datasets in AI are usually managed based on Git LFS. Hugging Face Hub and ModelScope Hub manage models and datasets based on Git LFS.

Hugging Face Hub uses Git LFS refer to Getting Started with Repositories.
ModelScope Hub uses Git LFS refer to Getting Started with ModelScope.

Hugging Face Hub’s Python Library implements Git LFS to download models and datasets. Hugging Face Hub’s Python Library distributes models and datasets to accelerate, refer to Hugging Face accelerates distribution of models and datasets based on Dragonfly.

Dragonfly eliminates the bandwidth limit of Git LFS’s content storage

This document will help you experience how to use dragonfly with Git LFS. During the downloading of large files, the file size is large and there are many services downloading the larges files at the same time. The bandwidth of the storage will reach the limit and the download will be slow.

Diagram flow showing Cluster A and Cluster B to Large File Storage

Dragonfly can be used to eliminate the bandwidth limit of the storage through P2P technology, thereby accelerating large files downloading.

Diagram flow showing Cluster A and Cluster B to Large File Storage using Peer and Root Peer

Dragonfly accelerates downloads with Git LFS

By proxying the HTTP protocol file download request of Git LFS to Dragonfly Peer Proxy, the file download traffic is forwarded to the P2P network. The following documentation is based on GitHub LFS.

Get the Content Storage address of Git LFS

Add GIT_CURL_VERBOSE=1 to print verbose logs of git clone and get the address of content storage of Git LFS.

GIT_CURL_VERBOSE=1 git clone git@github.com:{YOUR-USERNAME}/{YOUR-REPOSITORY}.git

Look for the trace git-lfs keyword in the logs and you can see the log of Git LFS download files. Pay attention to the content of actions and download in the log.

15:31:04.848308 trace git-lfs: HTTP: {"objects":[{"oid":"c036cbb7553a909f8b8877d4461924307f27ecb66cff928eeeafd569c3887e29","size":5242880,"actions":{"download":{"href":"https://github-cloud.githubusercontent.com/alambic/media/376919987/c0/36/c036cbb7553a909f8b8877d4461924307f27ecb66cff928eeeafd569c3887e29?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIMWPLRQEC4XCWWPA%2F20231221%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20231221T073104Z&X-Amz-Expires=3600&X-Amz-Signature=4dc757dff0ac96eac3f0cd2eb29ca887035d3a6afba41cb10200ed0aa22812fa&15:31:04.848403 trace git-lfs: HTTP: X-Amz-SignedHeaders=host&actor_id=15955374&key_id=0&repo_id=392935134&token=1","expires_at":"2023-12-21T08:31:04Z","expires_in":3600}}}]}

The download URL can be found in actions.download.href in the objects. You can find that the content storage of GitHub LFS is actually stored at github-cloud.githubusercontent.com. And query parameters include X-Amz-Algorithm, X-Amz-Credential, X-Amz-Date, X-Amz-Expires, X-Amz-Signature and X-Amz-SignedHeaders. The query parameters are AWS Authenticating Requests parameters. The keys of query parameters will be used later when configuring Dragonfly Peer Proxy.

Information about Git LFS :

The content storage address of Git LFS is github-cloud.githubusercontent.com.
The query parameters of the download URL include X-Amz-Algorithm, X-Amz-Credential, X-Amz-Date, X-Amz-Expires, X-Amz-Signature and X-Amz-SignedHeaders.

Installation

Prerequisites

Notice: Kind is recommended if no kubernetes cluster is available for testing.

Install dragonfly

For detailed installation documentation based on kubernetes cluster, please refer to quick-start-kubernetes.

Setup kubernetes cluster

Create kind multi-node cluster configuration file kind-config.yaml, configuration content is as follows:

kind: ClusterapiVersion: kind.x-k8s.io/v1alpha4nodes:  - role: control-plane  - role: worker    extraPortMappings:      - containerPort: 30950        hostPort: 65001  - role: worker

Create a kind multi-node cluster using the configuration file:

kind create cluster --config kind-config.yaml

Switch the context of kubectl to kind cluster:

kubectl config use-context kind-kind

Kind loads dragonfly image

Pull dragonfly latest images:

docker pull dragonflyoss/scheduler:latestdocker pull dragonflyoss/manager:latestdocker pull dragonflyoss/dfdaemon:latest

Kind cluster loads dragonfly latest images:

kind load docker-image dragonflyoss/scheduler:latestkind load docker-image dragonflyoss/manager:latestkind load docker-image dragonflyoss/dfdaemon:latest

Create dragonfly cluster based on helm charts

Create helm charts configuration file charts-config.yaml. Add the github-cloud.githubusercontent.com rule to dfdaemon.config.proxy.proxies.regx to forward the HTTP file download of content storage of Git LFS to the P2P network. And dfdaemon.config.proxy.defaultFilter adds X-Amz-Algorithm, X-Amz-Credential, X-Amz-Date, X-Amz-Expires, X-Amz-Signature and X-Amz-SignedHeaders parameters to filter the query parameters. Dargonfly generates a unique task id based on the URL, so it is necessary to filter the query parameters to generate a unique task id. Configuration content is as follows:

scheduler:
  image: dragonflyoss/scheduler
  tag: latest
  replicas: 1
  metrics:
    enable: true
  config:
    verbose: true
    pprofPort: 18066

seedPeer:
  image: dragonflyoss/dfdaemon
  tag: latest
  replicas: 1
  metrics:
    enable: true
  config:
    verbose: true
    pprofPort: 18066

dfdaemon:
  image: dragonflyoss/dfdaemon
  tag: latest
  metrics:
    enable: true
  config:
    verbose: true
    pprofPort: 18066
    proxy:
      defaultFilter: "X-Amz-Algorithm&X-Amz-Credential&X-Amz-Date&X-Amz-Expires&X-Amz-Signature&X-Amz-SignedHeaders"
      security:
        insecure: true
        cacert: ""
        cert: ""
        key: ""
      tcpListen:
        namespace: ""
        port: 65001
      registryMirror:
        url: https://index.docker.io
        insecure: true
        certs: []
        direct: false
      proxies:
      - regx: blobs/sha256.*
      - regx: github-cloud.githubusercontent.com.*

manager:
  image: dragonflyoss/manager
  tag: latest
  replicas: 1
  metrics:
    enable: true
  config:
    verbose: true
    pprofPort: 18066

jaeger:
  enable: true

Create a dragonfly cluster using the configuration file:

helm repo add dragonfly https://dragonflyoss.github.io/helm-charts/
helm install --wait --create-namespace --namespace dragonfly-system dragonfly dragonfly/dragonfly -f charts-config.yaml

Output:

NAME: dragonfly
LAST DEPLOYED: Thu Dec 21 17:24:37 2023
NAMESPACE: dragonfly-system
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
1. Get the scheduler address by running these commands:
  export SCHEDULER_POD_NAME=$(kubectl get pods --namespace dragonfly-system -l "app=dragonfly,release=dragonfly,component=scheduler" -o jsonpath={.items[0].metadata.name})
  export SCHEDULER_CONTAINER_PORT=$(kubectl get pod --namespace dragonfly-system $SCHEDULER_POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
  kubectl --namespace dragonfly-system port-forward $SCHEDULER_POD_NAME 8002:$SCHEDULER_CONTAINER_PORT
  echo "Visit http://127.0.0.1:8002 to use your scheduler"

2. Get the dfdaemon port by running these commands:
  export DFDAEMON_POD_NAME=$(kubectl get pods --namespace dragonfly-system -l "app=dragonfly,release=dragonfly,component=dfdaemon" -o jsonpath={.items[0].metadata.name})
  export DFDAEMON_CONTAINER_PORT=$(kubectl get pod --namespace dragonfly-system $DFDAEMON_POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
  You can use $DFDAEMON_CONTAINER_PORT as a proxy port in Node.

3. Configure runtime to use dragonfly:
  https://d7y.io/docs/getting-started/quick-start/kubernetes/

4. Get Jaeger query URL by running these commands:
  export JAEGER_QUERY_PORT=$(kubectl --namespace dragonfly-system get services dragonfly-jaeger-query -o jsonpath="{.spec.ports[0].port}")
  kubectl --namespace dragonfly-system port-forward service/dragonfly-jaeger-query 16686:$JAEGER_QUERY_PORT
  echo "Visit http://127.0.0.1:16686/search?limit=20&lookback=1h&maxDuration&minDuration&service=dragonfly to query download events"

Check that dragonfly is deployed successfully:

kubectl get po -n dragonfly-systemNAME                                 READY   STATUS    RESTARTS       AGEdragonfly-dfdaemon-cttxz             1/1     Running   4 (116s ago)   2m51sdragonfly-dfdaemon-k62vd             1/1     Running   4 (117s ago)   2m51sdragonfly-jaeger-84dbfd5b56-mxpfs    1/1     Running   0              2m51sdragonfly-manager-5c598d5754-fd9tf   1/1     Running   0              2m51sdragonfly-mysql-0                    1/1     Running   0              2m51sdragonfly-redis-master-0             1/1     Running   0              2m51sdragonfly-redis-replicas-0           1/1     Running   0              2m51sdragonfly-redis-replicas-1           1/1     Running   0              106sdragonfly-redis-replicas-2           1/1     Running   0              78sdragonfly-scheduler-0                1/1     Running   0              2m51sdragonfly-seed-peer-0                1/1     Running   1 (37s ago)    2m51s

Create peer service configuration file peer-service-config.yaml, configuration content is as follows:

apiVersion: v1kind: Servicemetadata:  name: peer  namespace: dragonfly-systemspec:  type: NodePort  ports:    - name: http-65001      nodePort: 30950      port: 65001  selector:    app: dragonfly    component: dfdaemon    release: dragonfly

Create a peer service using the configuration file:

kubectl apply -f peer-service-config.yaml

Git LFS downlads large files via dragonfly

Proxy Git LFS download requests to Dragonfly Peer Proxy(http://127.0.0.1:65001) through Git configuration. Set Git configuration includes http.proxy, lfs.transfer.enablehrefrewrite and url.{YOUR-LFS-CONTENT-STORAGE}.insteadOf properties.

git config --global http.proxy http://127.0.0.1:65001git config --global lfs.transfer.enablehrefrewrite truegit config --global url.http://github-cloud.githubusercontent.com/.insteadOf https://github-cloud.githubusercontent.com/

Forward Git LFS download requests to the P2P network via Dragonfly Peer Proxy and Git clone the large files.

git clone git@github.com:{YOUR-USERNAME}/{YOUR-REPOSITORY}.git

Verify large files download with Dragonfly

Execute the command:

# find podskubectl -n dragonfly-system get pod -l component=dfdaemon# find logspod_name=dfdaemon-xxxxxkubectl -n dragonfly-system exec -it ${pod_name} -- grep "peer task done" /var/log/dragonfly/daemon/core.log

Example output:

2023-12-21T16:55:20.495+0800INFOpeer/peertask_conductor.go:1326peer task done, cost: 2238ms{"peer": "30.54.146.131-15874-f6729352-950e-412f-b876-0e5c8e3232b1", "task": "70c644474b6c986e3af27d742d3602469e88f8956956817f9f67082c6967dc1a", "component": "PeerTask", "trace": "35c801b7dac36eeb0ea43a58d1c82e77"}

Performance testing

Test the performance of single-machine large files download after the integration of Git LFS and Dragonfly P2P. Due to the influence of the network environment of the machine itself, the actual download time is not important, but the ratio of the increase in the download time in different scenarios is very important.

Bar chart showing time to download large files (512M and 1G) between Git LFS, Git LFS & Dragonfly Cold Boot, Hit Dragonfly Remote Peer Cache and Hit Dragonfly Local Peer Cache

Git LFS: Use Git LFS to download large files directly.
Git LFS & Dragonfly Cold Boot: Use Git LFS to download large files via Dragonfly P2P network and no cache hits.
Hit Dragonfly Remote Peer Cache: Use Git LFS to download large files via Dragonfly P2P network and hit the remote peer cache.
Hit Dragonfly Remote Local Cache: Use Git LFS to download large files via Dragonfly P2P network and hit the local peer cache.

Test results show Git LFS and Dragonfly P2P integration. It can effectively reduce the file download time. Note that this test was a single-machine test, which means that in the case of cache hits, the performance limitation is on the disk. If Dragonfly is deployed on multiple machines for P2P download, the large files download speed will be faster.

TorchServe accelerates the distribution of models based on Dragonfly

December 6, 2023 · 14 min read

CNCF projects highlighted in this post, and migrated by mingcheng.

This document will help you experience how to use dragonfly with TorchServe. During the downloading of models, the file size is large and there are many services downloading the files at the same time. The bandwidth of the storage will reach the limit and the download will be slow.

Diagram flow showing Model Registry flow from Cluster A and Cluster B

Dragonfly can be used to eliminate the bandwidth limit of the storage through P2P technology, thereby accelerating file downloading.

Diagram flow showing Model Registry flow from Cluster A and Cluster B

Architecture

Dragonfly Endpoint architecture

Dragonfly Endpoint plugin forwards TorchServe download model requests to the Dragonfly P2P network.

Dragonfly Endpoint architecture

The models download steps:

TorchServe sends a model download request and the request is forwarded to the Dragonfly Peer.
The Dragonfly Peer registers tasks with the Dragonfly Scheduler.
Return the candidate parents to Dragonfly Peer.
Dragonfly Peer downloads model from candidate parents.
After downloading the model, TorchServe will register the model.

Installation

By integrating Dragonfly Endpoint into TorchServe, download traffic through Dragonfly to pull models stored in S3, OSS, GCS, and ABS, and register models in TorchServe. The Dragonfly Endpoint plugin is in the dragonfly-endpoint repository.

Prerequisites

Name	Version	Document
Kubernetes cluster	1.20+	kubernetes.io
Helm	3.8.0+	helm.sh
TorchServe	0.4.0+	pytorch.org/serve/

Notice: Kind is recommended if no kubernetes cluster is available for testing.

Dragonfly Kubernetes Cluster Setup

For detailed installation documentation, please refer to quick-start-kubernetes.

Prepare Kubernetes Cluster

Create kind multi-node cluster configuration file kind-config.yaml, configuration content is as follows:

kind: ClusterapiVersion: kind.x-k8s.io/v1alpha4nodes:  - role: control-plane  - role: worker  - role: worker

Create a kind multi-node cluster using the configuration file:

kind create cluster --config kind-config.yaml

Switch the context of kubectl to kind cluster:

kubectl config use-context kind-kind

Kind loads dragonfly image

Pull dragonfly latest images:

docker pull dragonflyoss/scheduler:latestdocker pull dragonflyoss/manager:latestdocker pull dragonflyoss/dfdaemon:latest

Kind cluster loads dragonfly latest images:

kind load docker-image dragonflyoss/scheduler:latestkind load docker-image dragonflyoss/manager:latestkind load docker-image dragonflyoss/dfdaemon:latest

Create dragonfly cluster based on helm charts

Create helm charts configuration file charts-config.yaml and set dfdaemon.config.agents.regx to match the download path of the object storage, configuration content is as follows:

scheduler:
  replicas: 1
  metrics:
    enable: true
  config:
    verbose: true
    pprofPort: 18066

seedPeer:
  replicas: 1
  metrics:
    enable: true
  config:
    verbose: true
    pprofPort: 18066

dfdaemon:
  hostNetwork: true
  metrics:
    enable: true
  config:
    verbose: true
    pprofPort: 18066
    proxy:
      defaultFilter: 'Expires&Signature&ns'
      security:
        insecure: true
        cacert: ''
        cert: ''
        key: ''
      tcpListen:
        namespace: ''
        port: 65001
      registryMirror:
        url: https://index.docker.io
        insecure: true
        certs: []
        direct: false
      proxies:
        - regx: blobs/sha256.*
        - regx: .*amazonaws.*

manager:
  replicas: 1
  metrics:
    enable: true
  config:
    verbose: true
    pprofPort: 18066

jaeger:
  enable: true

Create a dragonfly cluster using the configuration file:

$ helm repo add dragonfly https://dragonflyoss.github.io/helm-charts/
$ helm install --wait --create-namespace --namespace dragonfly-system dragonfly dragonfly/dragonfly -f charts-config.yaml
LAST DEPLOYED: Mon Sep  4 10:24:55 2023
NAMESPACE: dragonfly-system
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
1. Get the scheduler address by running these commands:
  export SCHEDULER_POD_NAME=$(kubectl get pods --namespace dragonfly-system -l "app=dragonfly,release=dragonfly,component=scheduler" -o jsonpath={.items[0].metadata.name})
  export SCHEDULER_CONTAINER_PORT=$(kubectl get pod --namespace dragonfly-system $SCHEDULER_POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
  kubectl --namespace dragonfly-system port-forward $SCHEDULER_POD_NAME 8002:$SCHEDULER_CONTAINER_PORT
  echo "Visit http://127.0.0.1:8002 to use your scheduler"
2. Get the dfdaemon port by running these commands:
  export DFDAEMON_POD_NAME=$(kubectl get pods --namespace dragonfly-system -l "app=dragonfly,release=dragonfly,component=dfdaemon" -o jsonpath={.items[0].metadata.name})
  export DFDAEMON_CONTAINER_PORT=$(kubectl get pod --namespace dragonfly-system $DFDAEMON_POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
  You can use $DFDAEMON_CONTAINER_PORT as a proxy port in Node.
3. Configure runtime to use dragonfly:
  https://d7y.io/docs/getting-started/quick-start/kubernetes/
4. Get Jaeger query URL by running these commands:
  export JAEGER_QUERY_PORT=$(kubectl --namespace dragonfly-system get services dragonfly-jaeger-query -o jsonpath="{.spec.ports[0].port}")
  kubectl --namespace dragonfly-system port-forward service/dragonfly-jaeger-query 16686:$JAEGER_QUERY_PORT
  echo "Visit http://127.0.0.1:16686/search?limit=20&lookback=1h&maxDuration&minDuration&service=dragonfly to query download events"

Check that dragonfly is deployed successfully:

$ kubectl get po -n dragonfly-system
NAME                                 READY   STATUS    RESTARTS      AGE
dragonfly-dfdaemon-7r2cn             1/1     Running   0          3m31s
dragonfly-dfdaemon-fktl4             1/1     Running   0          3m31s
dragonfly-jaeger-c7947b579-2xk44     1/1     Running   0          3m31s
dragonfly-manager-5d4f444c6c-wq8d8   1/1     Running   0          3m31s
dragonfly-mysql-0                    1/1     Running   0          3m31s
dragonfly-redis-master-0             1/1     Running   0          3m31s
dragonfly-redis-replicas-0           1/1     Running   0          3m31s
dragonfly-redis-replicas-1           1/1     Running   0          3m5s
dragonfly-redis-replicas-2           1/1     Running   0          2m44s
dragonfly-scheduler-0                1/1     Running   0          3m31s
dragonfly-seed-peer-0                1/1     Running   0          3m31s

Expose the Proxy service port

Create the dfstore.yaml configuration to expose the port on which the Dragonfly Peer’s HTTP proxy listens. The default port is 65001 and settargetPort to 65001.

kind: Service
apiVersion: v1
metadata:
  name: dfstore
spec:
  selector:
    app: dragonfly
    component: dfdaemon
    release: dragonfly
  ports:
    - protocol: TCP
      port: 65001
      targetPort: 65001
  type: NodePort

Create service:

kubectl --namespace dragonfly-system apply -f dfstore.yaml

Forward request to Dragonfly Peer’s HTTP proxy:

kubectl --namespace dragonfly-system port-forward service/dfstore 65001:65001

Install Dragonfly Endpoint plugin

Set environment variables for Dragonfly Endpoint configuration

Create config.json configuration，and set DRAGONFLY_ENDPOINT_CONFIG environment variable for config.json file path.

export DRAGONFLY_ENDPOINT_CONFIG=/etc/dragonfly-endpoint/config.json

The default configuration path is:

linux: /etc/dragonfly-endpoint/config.json
darwin: ~/.dragonfly-endpoint/config.json

Dragonfly Endpoint configuration

Create the config.json configuration to configure the Dragonfly Endpoint for S3, the configuration is as follows:

{
  "addr": "http://127.0.0.1:65001",
  "header": {},
  "filter": [
    "X-Amz-Algorithm",
    "X-Amz-Credential",
    "X-Amz-Date",
    "X-Amz-Expires",
    "X-Amz-SignedHeaders",
    "X-Amz-Signature"
  ],
  "object_storage": {
    "type": "s3",
    "bucket_name": "your_s3_bucket_name",
    "region": "your_s3_region",
    "access_key": "your_s3_access_key",
    "secret_key": "your_s3_secret_key"
  }
}

In the filter of the configuration, set different values when using different object storage:

Type	Value
OSS	"Expires&Signature&ns"
S3	"X-Amz-Algorithm&X-Amz-Credential&X-Amz-Date&X-Amz-Expires&X-Amz-SignedHeaders&X-Amz-Signature"
OBS	"X-Amz-Algorithm&X-Amz-Credential&X-Amz-Date&X-Obs-Date&X-Amz-Expires&X-Amz-SignedHeaders&X-Amz-Signature"

Object storage configuration

In addition to S3, Dragonfly Endpoint plugin also supports OSS, GCS and ABS. Different object storage configurations are as follows:

OSS(Object Storage Service)

{
  "addr": "http://127.0.0.1:65001",
  "header": {},
  "filter": ["Expires", "Signature"],
  "object_storage": {
    "type": "oss",
    "bucket_name": "your_oss_bucket_name",
    "endpoint": "your_oss_endpoint",
    "access_key_id": "your_oss_access_key_id",
    "access_key_secret": "your_oss_access_key_secret"
  }
}

GCS(Google Cloud Storage)

{
  "addr": "http://127.0.0.1:65001",
  "header": {},
  "object_storage": {
    "type": "gcs",
    "bucket_name": "your_gcs_bucket_name",
    "project_id": "your_gcs_project_id",
    "service_account_path": "your_gcs_service_account_path"
  }
}

ABS(Azure Blob Storage)

{
  "addr": "http://127.0.0.1:65001",
  "header": {},
  "object_storage": {
    "type": "abs",
    "account_name": "your_abs_account_name",
    "account_key": "your_abs_account_key",
    "container_name": "your_abs_container_name"
  }
}

TorchServe integrates Dragonfly Endpoint plugin

For detailed installation documentation, please refer to TorchServe document.

Binary installation

The Prerequisites

Name	Version	Document
Python	3.8.0+	https://www.python.org/
TorchServe	0.4.0+	pytorch.org/serve/
Java	11	https://openjdk.org/projects/jdk/11/

Install TorchServe dependencies and torch-model-archiver：

python ./ts_scripts/install_dependencies.py
conda install torchserve torch-model-archiver torch-workflow-archiver -c pytorch

Clone TorchServe repository：

git clone https://github.com/pytorch/serve.gitcd serve

Create model-store directory to store the models：

mkdir model-storechmod 777 model-store

Create plugins-path directory to store the binaries of the plugin：

mkdir plugins-path

Package Dragonfly Endpoint plugin

Clone dragonfly-endpoint repository：

git clone https://github.com/dragonflyoss/dragonfly-endpoint.git

Build the dragonfly-endpoint project to generate file in the build/libs directory:

cd ./dragonfly-endpointgradle shadowJar

Note: Due to the limitations of TorchServe’s JVM, the best Java version for Gradle is 11, as a higher version will cause the plugin to fail to parse.

Move the Jar file into the plugins-path directory:

mv build/libs/dragonfly_endpoint-1.0-all.jar  <your plugins-path>

Prepare the plugin configuration config.json, and use S3 as the object storage:

{
  "addr": "http://127.0.0.1:65001",
  "header": {},
  "filter": [
    "X-Amz-Algorithm",
    "X-Amz-Credential",
    "X-Amz-Date",
    "X-Amz-Expires",
    "X-Amz-SignedHeaders",
    "X-Amz-Signature"
  ],
  "object_storage": {
    "type": "s3",
    "bucket_name": "your_s3_bucket_name",
    "region": "your_s3_region",
    "access_key": "your_s3_access_key",
    "secret_key": "your_s3_secret_key"
  }
}

Set the environment variables for the configuration:

export DRAGONFLY_ENDPOINT_CONFIG=/etc/dragonfly-endpoint/config.json

–model-storesets the previously created directory to store the models and –plugins-path sets the previously created directory to store the plugins. Start the TorchServe with Dragonfly Endpoint plugin:

torchserve --start --model-store <path-to-model-store-file> --plugins-path=<path-to-plugin-jars>

Verify

Prepare the model. Download a model from Model ZOO or package the model refer to Torch Model archiver for TorchServe. Use squeezenet1_1_scripted.mar model to verify：

wget https://torchserve.pytorch.org/mar_files/squeezenet1_1_scripted.mar

Upload the model to object storage. For detailed uploading the model to S3, please refer to S3。

# Download the command line toolpip install awscli# Configure the key as promptedaws configure# Upload fileaws s3 cp < local file path > s3://< bucket name >/< Target path >

TorchServe plugin is named dragonfly, please refer to TorchServe Register API for details of plugin API. The url parameter are not supported and add the file_name parameter which is the model file name to download.

Download the model:

curl -X POST  "http://localhost:8081/dragonfly/models?file_name=squeezenet1_1.mar"

Verify the model download successful:

{
  "Status": "Model \"squeezenet1_1\" Version: 1.0 registered with 0 initial workers. Use scale workers API to add workers for the model."
}

Added model worker for inference:

curl -v -X PUT "http://localhost:8081/models/squeezenet1_1?min_worker=1"

Check the number of worker is increased:

* About to connect() to localhost port 8081 (#0)
*   Trying ::1...
* Connected to localhost (::1) port 8081 (#0)
> PUT /models/squeezenet1_1?min_worker=1 HTTP/1.1
> User-Agent: curl/7.29.0
> Host: localhost:8081
> Accept: */*
>
< HTTP/1.1 202 Accepted
< content-type: application/json
< x-request-id: 66761b5a-54a7-4626-9aa4-12041e0e4e63
< Pragma: no-cache
< Cache-Control: no-cache; no-store, must-revalidate, private
< Expires: Thu, 01 Jan 1970 00:00:00 UTC
< content-length: 47
< connection: keep-alive
<
{  "status": "Processing worker updates..."}
* Connection #0 to host localhost left intact

Call inference API:

# Prepare pictures that require reasoning
curl -O https://raw.githubusercontent.com/pytorch/serve/master/docs/images/kitten_small.jpg
curl -O https://raw.githubusercontent.com/pytorch/serve/master/docs/images/dogs-before.jpg

# Call inference API
curl http://localhost:8080/predictions/squeezenet1_1 -T kitten_small.jpg -T dogs-before.jpg

Check the response successful:

{
  "lynx": 0.5455784201622009,
  "tabby": 0.2794168293476105,
  "Egyptian_cat": 0.10391931980848312,
  "tiger_cat": 0.062633216381073,
  "leopard": 0.005019133910536766
}

Install TorchServe with Docker

Docker configuration

Pull dragonflyoss/dragonfly-endpoint image with the plugin. The following is an example of the CPU version of TorchServe, refer to Dockerfile.

docker pull dragonflyoss/dragonfly-endpoint

Create model-store directory to store the model files：

mkdir model-storechmod 777 model-store

Prepare the plugin configuration config.json, and use S3 as the object storage:

{
  "addr": "http://127.0.0.1:65001",
  "header": {},
  "filter": [
    "X-Amz-Algorithm",
    "X-Amz-Credential",
    "X-Amz-Date",
    "X-Amz-Expires",
    "X-Amz-SignedHeaders",
    "X-Amz-Signature"
  ],
  "object_storage": {
    "type": "s3",
    "bucket_name": "your_s3_bucket_name",
    "region": "your_s3_region",
    "access_key": "your_s3_access_key",
    "secret_key": "your_s3_secret_key"
  }
}

Set the environment variables for the configuration:

export DRAGONFLY_ENDPOINT_CONFIG=/etc/dragonfly-endpoint/config.json

Mount the model-store and dragonfly-endpoint configuration directory. Run the container:

sudo docker run --rm -it --network host \
  -v $(pwd)/model-store:/home/model-server/model-store \
  -v ${DRAGONFLY_ENDPOINT_CONFIG}:${DRAGONFLY_ENDPOINT_CONFIG} \
  dragonflyoss/dragonfly-endpoint:latest

How to Verify

Prepare the model. Download a model from Model ZOO or package the model refer to Torch Model archiver for TorchServe. Use squeezenet1_1_scripted.mar model to verify：

wget https://torchserve.pytorch.org/mar_files/squeezenet1_1_scripted.mar

Upload the model to object storage. For detailed uploading the model to S3, please refer to S3。

# Download the command line tool
pip install awscli

# Configure the key as prompted
aws configure

# Upload file
aws s3 cp <local file path> s3://<bucket name>/<Target path>

Download a model：

curl -X POST  "http://localhost:8081/dragonfly/models?file_name=squeezenet1_1.mar"

Verify the model download successful:

{
  "Status": "Model \"squeezenet1_1\" Version: 1.0 registered with 0 initial workers. Use scale workers API to add workers for the model."
}

Added model worker for inference:

curl -v -X PUT "http://localhost:8081/models/squeezenet1_1?min_worker=1"

Check the number of worker is increased:

* About to connect() to localhost port 8081 (#0)
*   Trying ::1...
* Connected to localhost (::1) port 8081 (#0)
> PUT /models/squeezenet1_1?min_worker=1 HTTP/1.1
> User-Agent: curl/7.29.0
> Host: localhost:8081
> Accept: */*
>
< HTTP/1.1 202 Accepted
< content-type: application/json
< x-request-id: 66761b5a-54a7-4626-9aa4-12041e0e4e63
< Pragma: no-cache
< Cache-Control: no-cache; no-store, must-revalidate, private
< Expires: Thu, 01 Jan 1970 00:00:00 UTC
< content-length: 47
< connection: keep-alive
<
{  "status": "Processing worker updates..."}
* Connection #0 to host localhost left intact

Call inference API:

# Prepare pictures that require reasoning
curl -O https://raw.githubusercontent.com/pytorch/serve/master/docs/images/kitten_small.jpg
curl -O https://raw.githubusercontent.com/pytorch/serve/master/docs/images/dogs-before.jpg

# Call inference API
curl http://localhost:8080/predictions/squeezenet1_1 -T kitten_small.jpg -T dogs-before.jpg

Check the response successful:

{
  "lynx": 0.5455784201622009,
  "tabby": 0.2794168293476105,
  "Egyptian_cat": 0.10391931980848312,
  "tiger_cat": 0.062633216381073,
  "leopard": 0.005019133910536766
}

Performance testing

Test the performance of single-machine model download by TorchServe API after the integration of Dragonfly P2P. Due to the influence of the network environment of the machine itself, the actual download time is not important, but the ratio of the increase in the download time in different scenarios is very important.

Bar chart showing TorchServe API, TouchServe API & Dragonfly Cold Boot, Hit Dragonfly Remote Peer Cache and Hit Dragonfly Local Peer Cache performance based on time to download

TorchServe API: Use signed URL provided by Object Storage to download the model directly.
TorchServe API & Dragonfly Cold Boot: Use TorchServe API to download model via Dragonfly P2P network and no cache hits.
Hit Remote Peer: Use TorchServe API to download model via Dragonfly P2P network and hit the remote peer cache.
Hit Local Peer: Use TorchServe API to download model via Dragonfly P2P network and hit the local peer cache.

Test results show TorchServe and Dragonfly integration. It can effectively reduce the file download time. Note that this test was a single-machine test, which means that in the case of cache hits, the performance limitation is on the disk. If Dragonfly is deployed on multiple machines for P2P download, the models download speed will be faster.

Hugging Face accelerates distribution of models and datasets based on Dragonfly

November 16, 2023 · 10 min read

CNCF projects highlighted in this post, and migrated by mingcheng.

This document will help you experience how to use dragonfly with hugging face. During the downloading of datasets or models, the file size is large and there are many services downloading the files at the same time. The bandwidth of the storage will reach the limit and the download will be slow.

Diagram flow showing Hugging Face Hub flow from Cluster A and Cluster B

Dragonfly can be used to eliminate the bandwidth limit of the storage through P2P technology, thereby accelerating file downloading.

Diagram flow showing Hugging Face Hub flow from Cluster A and Cluster B

Prerequisites

Notice: Kind is recommended if no kubernetes cluster is available for testing.

Install dragonfly

For detailed installation documentation based on kubernetes cluster, please refer to quick-start-kubernetes.

Setup kubernetes cluster

Create kind multi-node cluster configuration file kind-config.yaml, configuration content is as follows:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
  - role: control-plane
  - role: worker
    extraPortMappings:
      - containerPort: 30950
        hostPort: 65001
  - role: worker

Create a kind multi-node cluster using the configuration file:

kind create cluster --config kind-config.yaml

Switch the context of kubectl to kind cluster:

kubectl config use-context kind-kind

Kind loads dragonfly image

Pull dragonfly latest images:

docker pull dragonflyoss/scheduler:latestdocker pull dragonflyoss/manager:latestdocker pull dragonflyoss/dfdaemon:latest

Kind cluster loads dragonfly latest images:

kind load docker-image dragonflyoss/scheduler:latestkind load docker-image dragonflyoss/manager:latestkind load docker-image dragonflyoss/dfdaemon:latest

Create dragonfly cluster based on helm charts

Create helm charts configuration file charts-config.yaml and set dfdaemon.config.proxy.registryMirror.url to the address of the Hugging Face Hub’s LFS server, configuration content is as follows:

scheduler:
  replicas: 1
  metrics:
    enable: true
  config:
    verbose: true
    pprofPort: 18066

seedPeer:
  replicas: 1
  metrics:
    enable: true
  config:
    verbose: true
    pprofPort: 18066

dfdaemon:
  metrics:
    enable: true
  hostNetwork: true
  config:
    verbose: true
    pprofPort: 18066
    proxy:
      defaultFilter: 'Expires&Key-Pair-Id&Policy&Signature'
      security:
        insecure: true
      tcpListen:
        listen: 0.0.0.0
        port: 65001
      registryMirror:
        # When enable, using header "X-Dragonfly-Registry" for remote instead of url.
        dynamic: true
        # URL for the registry mirror.
        url: https://cdn-lfs.huggingface.co
        # Whether to ignore https certificate errors.
        insecure: true
        # Optional certificates if the remote server uses self-signed certificates.
        certs: []
        # Whether to request the remote registry directly.
        direct: false
        # Whether to use proxies to decide if dragonfly should be used.
        useProxies: true
      proxies:
        - regx: repos.*
          useHTTPS: true

manager:
  replicas: 1
  metrics:
    enable: true
  config:
    verbose: true
    pprofPort: 18066

Create a dragonfly cluster using the configuration file:

$ helm repo add dragonfly https://dragonflyoss.github.io/helm-charts/
$ helm install --wait --create-namespace --namespace dragonfly-system dragonfly dragonfly/dragonfly -f charts-config.yaml
NAME: dragonfly
LAST DEPLOYED: Wed Oct 19 04:23:22 2022
NAMESPACE: dragonfly-system
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
1. Get the scheduler address by running these commands:
  export SCHEDULER_POD_NAME=$(kubectl get pods --namespace dragonfly-system -l "app=dragonfly,release=dragonfly,component=scheduler" -o jsonpath={.items[0].metadata.name})
  export SCHEDULER_CONTAINER_PORT=$(kubectl get pod --namespace dragonfly-system $SCHEDULER_POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
  kubectl --namespace dragonfly-system port-forward $SCHEDULER_POD_NAME 8002:$SCHEDULER_CONTAINER_PORT
  echo "Visit http://127.0.0.1:8002 to use your scheduler"
2. Get the dfdaemon port by running these commands:
  export DFDAEMON_POD_NAME=$(kubectl get pods --namespace dragonfly-system -l "app=dragonfly,release=dragonfly,component=dfdaemon" -o jsonpath={.items[0].metadata.name})
  export DFDAEMON_CONTAINER_PORT=$(kubectl get pod --namespace dragonfly-system $DFDAEMON_POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
  You can use $DFDAEMON_CONTAINER_PORT as a proxy port in Node.
3. Configure runtime to use dragonfly:
  https://d7y.io/docs/getting-started/quick-start/kubernetes/

Check that dragonfly is deployed successfully:

$ kubectl get po -n dragonfly-system
NAME                                 READY   STATUS    RESTARTS       AGE
dragonfly-dfdaemon-rhnr6             1/1     Running   4 (101s ago)   3m27s
dragonfly-dfdaemon-s6sv5             1/1     Running   5 (111s ago)   3m27s
dragonfly-manager-67f97d7986-8dgn8   1/1     Running   0              3m27s
dragonfly-mysql-0                    1/1     Running   0              3m27s
dragonfly-redis-master-0             1/1     Running   0              3m27s
dragonfly-redis-replicas-0           1/1     Running   1 (115s ago)   3m27s
dragonfly-redis-replicas-1           1/1     Running   0              95s
dragonfly-redis-replicas-2           1/1     Running   0              70s
dragonfly-scheduler-0                1/1     Running   0              3m27s
dragonfly-seed-peer-0                1/1     Running   2 (95s ago)    3m27s

Create peer service configuration file peer-service-config.yaml, configuration content is as follows:

apiVersion: v1
kind: Service
metadata:
  name: peer
  namespace: dragonfly-system
spec:
  type: NodePort
  ports:
    - name: http-65001
      nodePort: 30950
      port: 65001
  selector:
    app: dragonfly
    component: dfdaemon
    release: dragonfly

Create a peer service using the configuration file:

kubectl apply -f peer-service-config.yaml

Use Hub Python Library to download files and distribute traffic through Draognfly

Any API in the Hub Python Library that uses Requests library for downloading files can distribute the download traffic in the P2P network by setting DragonflyAdapter to the requests Session.

Download a single file with Dragonfly

A single file can be downloaded using the hf_hub_download, distribute traffic through the Dragonfly peer.

Create hf_hub_download_dragonfly.py file. Use DragonflyAdapter to forward the file download request of the LFS protocol to Dragonfly HTTP proxy, so that it can use the P2P network to distribute file, content is as follows:

import requests
from requests.adapters import HTTPAdapter
from urllib.parse import urlparse
from huggingface_hub import hf_hub_download
from huggingface_hub import configure_http_backend

class DragonflyAdapter(HTTPAdapter):
  def get_connection(self, url, proxies=None):
    # Change the schema of the LFS request to download large files from https:// to http://,
    # so that Dragonfly HTTP proxy can be used.
    if url.startswith('https://cdn-lfs.huggingface.co'):
      url = url.replace('https://', 'http://')
    return super().get_connection(url, proxies)

  def add_headers(self, request, kwargs):
    super().add_headers(request, kwargs)
    # If there are multiple different LFS repositories, you can override the
    # default repository address by adding X-Dragonfly-Registry header.
    if request.url.find('example.com') != -1:
      request.headers["X-Dragonfly-Registry"] = 'https://example.com'

# Create a factory function that returns a new Session.
def backend_factory() -> requests.Session:
  session = requests.Session()
  session.mount('http://', DragonflyAdapter())
  session.mount('https://', DragonflyAdapter())
  session.proxies = {'http': 'http://127.0.0.1:65001'}
  return session

# Set it as the default session factory
configure_http_backend(backend_factory=backend_factory)

hf_hub_download(repo_id="tiiuae/falcon-rw-1b", filename="pytorch_model.bin")

Download a single file of th LFS protocol with Dragonfly:

$ python3 hf_hub_download_dragonfly.py
(…)YkNX13a46FCg__&Key-Pair-Id=KVTP0A1DKRTAX: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.62G/2.62G [00:52<00:00, 49.8MB/s]

Verify a single file download with Dragonfly

Execute the command:

# find podskubectl -n dragonfly-system get pod -l component=dfdaemon# find logspod_name=dfdaemon-xxxxxkubectl -n dragonfly-system exec -it ${pod_name} -- grep "peer task done" /var/log/dragonfly/daemon/core.log

Example output:

peer task done, cost: 28349ms   {"peer": "89.116.64.101-77008-a95a6918-a52b-47f5-9b18-cec6ada03daf", "task": "2fe93348699e07ab67823170925f6be579a3fbc803ff3d33bf9278a60b08d901", "component": "PeerTask", "trace": "b34ed802b7afc0f4acd94b2cedf3fa2a"}

Download a snapshot of the repo with Dragonfly

A snapshot of the repo can be downloaded using the snapshot_download, distribute traffic through the Dragonfly peer.

Create snapshot_download_dragonfly.py file. Use DragonflyAdapter to forward the file download request of the LFS protocol to Dragonfly HTTP proxy, so that it can use the P2P network to distribute file. Only the files of the LFS protocol will be distributed through the Dragonfly P2P network. content is as follows:

import requests
from requests.adapters import HTTPAdapter
from urllib.parse import urlparse
from huggingface_hub import snapshot_download
from huggingface_hub import configure_http_backend

class DragonflyAdapter(HTTPAdapter):
  def get_connection(self, url, proxies=None):
    # Change the schema of the LFS request to download large files from https:// to http://,
    # so that Dragonfly HTTP proxy can be used.
    if url.startswith('https://cdn-lfs.huggingface.co'):
      url = url.replace('https://', 'http://')
    return super().get_connection(url, proxies)

  def add_headers(self, request, kwargs):
    super().add_headers(request, kwargs)
    # If there are multiple different LFS repositories, you can override the
    # default repository address by adding X-Dragonfly-Registry header.
    if request.url.find('example.com') != -1:
      request.headers["X-Dragonfly-Registry"] = 'https://example.com'

# Create a factory function that returns a new Session.
def backend_factory() -> requests.Session:
  session = requests.Session()
  session.mount('http://', DragonflyAdapter())
  session.mount('https://', DragonflyAdapter())
  session.proxies = {'http': 'http://127.0.0.1:65001'}
  return session

# Set it as the default session factory
configure_http_backend(backend_factory=backend_factory)

snapshot_download(repo_id="tiiuae/falcon-rw-1b")

Download a snapshot of the repo with Dragonfly:

$ python3 snapshot_download_dragonfly.py
(…)03165eb22f0a867d4e6a64d34fce19/README.md: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7.60k/7.60k [00:00<00:00, 374kB/s]
(…)7d4e6a64d34fce19/configuration_falcon.py: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6.70k/6.70k [00:00<00:00, 762kB/s]
(…)f0a867d4e6a64d34fce19/modeling_falcon.py: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 56.9k/56.9k [00:00<00:00, 5.35MB/s]
(…)3165eb22f0a867d4e6a64d34fce19/merges.txt: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 9.07MB/s]
(…)867d4e6a64d34fce19/tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 234/234 [00:00<00:00, 106kB/s]
(…)eb22f0a867d4e6a64d34fce19/tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.11M/2.11M [00:00<00:00, 27.7MB/s]
(…)3165eb22f0a867d4e6a64d34fce19/vocab.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 798k/798k [00:00<00:00, 19.7MB/s]
(…)7d4e6a64d34fce19/special_tokens_map.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 99.0/99.0 [00:00<00:00, 45.3kB/s]
(…)67d4e6a64d34fce19/generation_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 115/115 [00:00<00:00, 5.02kB/s]
(…)165eb22f0a867d4e6a64d34fce19/config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.05k/1.05k [00:00<00:00, 75.9kB/s]
(…)eb22f0a867d4e6a64d34fce19/.gitattributes: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.48k/1.48k [00:00<00:00, 171kB/s]
(…)t-oSSW23tawg__&Key-Pair-Id=KVTP0A1DKRTAX: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.62G/2.62G [00:50<00:00, 52.1MB/s]
Fetching 12 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:50<00:00,  4.23s/it]

Verify a snapshot of the repo download with Dragonfly

Execute the command:

# find podskubectl -n dragonfly-system get pod -l component=dfdaemon# find logspod_name=dfdaemon-xxxxxkubectl -n dragonfly-system exec -it ${pod_name} -- grep "peer task done" /var/log/dragonfly/daemon/core.log

Example output:

peer task done, cost: 28349ms   {"peer": "89.116.64.101-77008-a95a6918-a52b-47f5-9b18-cec6ada03daf", "task": "2fe93348699e07ab67823170925f6be579a3fbc803ff3d33bf9278a60b08d901", "component": "PeerTask", "trace": "b34ed802b7afc0f4acd94b2cedf3fa2a"}

Performance testing

Test the performance of single-machine file download by hf_hub_download API after the integration of Hugging Face Python Library and Dragonfly P2P. Due to the influence of the network environment of the machine itself, the actual download time is not important, but the ratio of the increase in the download time in different scenarios is very important.

Bar chart showing performance testing result

Hugging Face Python Library: Use hf_hub_download API to download models directly.
Hugging Face Python Library & Dragonfly Cold Boot: Use hf_hub_download API to download models via Dragonfly P2P network and no cache hits.
Hit Dragonfly Remote Peer Cache: Use hf_hub_download API to download models via Dragonfly P2P network and hit the remote peer cache.
Hit Dragonfly Local Peer Cache: Use hf_hub_download API to download models via Dragonfly P2P network and hit the local peer cache.
Hit Hugging Face Cache: Use hf_hub_download API to download models via Dragonfly P2P network and hit the Hugging Face local cache.

Test results show Hugging Face Python Library and Dragonfly P2P integration. It can effectively reduce the file download time. Note that this test was a single-machine test, which means that in the case of cache hits, the performance limitation is on the disk. If Dragonfly is deployed on multiple machines for P2P download, the models download speed will be faster.

Dragonfly completes security audit!

September 15, 2023 · 3 min read

This summer, over four engineer weeks, Trail of Bits and OSTIF collaborated on a security audit of dragonfly. A CNCF Incubating Project, dragonfly functions as file distribution for peer-to-peer technologies. Included in the scope was the sub-project Nydus’s repository that works in image distribution. The engagement was outlined and framed around several goals relevant to the security and longevity of the project as it moves towards graduation.

The Trail of Bits audit team approached the audit by using static and manual testing with automated and manual processes. By introducing semgrep and CodeQL tooling, performing a manual review of client, scheduler, and manager code, and fuzz testing on the gRPC handlers, the audit team was able to identify a variety of findings for the project to improve their security. In focusing efforts on high-level business logic and externally accessible endpoints, the Trail of Bits audit team was able to direct their focus during the audit and provide guidance and recommendations for dragonfly’s future work.

Recorded in the audit report are 19 findings. Five of the findings were ranked as high, one as medium, four low, five informational, and four were considered undetermined. Nine of the findings were categorized as Data Validation, three of which were high severity. Ranked and reviewed as well was dragonfly’s Codebase Maturity, comprising eleven aspects of project code which are analyzed individually in the report.

This is a large project and could not be reviewed in total due to time constraints and scope. multiple specialized features were outside the scope of this audit for those reasons. this project is a great opportunity for continued audit work to improve and elevate code and harden security before graduation. Ongoing efforts for security is critical, as security is a moving target.

We would like to thank the Trail of Bits team, particularly Dan Guido, Jeff Braswell, Paweł Płatek, and Sam Alws for their work on this project. Thank you to the dragonfly maintainers and contributors, specifically Wenbo Qi, for their ongoing work and contributions to this engagement. Finally, we are grateful to the CNCF for funding this audit and supporting open source security efforts.

Using dragonfly to distribute images and files for multi-cluster kuberenetes

September 1, 2023 · 14 min read

Posted on September 1, 2023

CNCF projects highlighted in this post, and migrated by mingcheng.

Dragonfly provides efficient, stable, securefile distribution and image acceleration based on p2p technology to be the best practice and standard solution in cloud native architectures. It is hosted by the Cloud Native Computing Foundation(CNCF) as an Incubating Level Project.

This article introduces the deployment of dragonfly for multi-cluster kubernetes. A dragonfly cluster manages cluster within a network. If you have two clusters with disconnected networks, you can use two dragonfly clusters to manage their own clusters.

The recommended deployment for multi-cluster kubernetes is to use a dragonfly cluster to manage a kubernetes cluster, and use a centralized manager service to manage multiple dragonfly clusters. Because peer can only transmit data in its own dragonfly cluster, if a kubernetes cluster deploys a dragonfly cluster, then a kubernetes cluster forms a p2p network, and internal peers can only schedule and transmit data in a kubernetes cluster.

Screenshot showing diagram flow between Network A / Kubernetes Cluster A and Network B / Kubernetes Cluster B towards Manager

Setup kubernetes cluster

Kind is recommended if no Kubernetes cluster is available for testing.

Create kind cluster configuration file kind-config.yaml, configuration content is as follows:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
  - role: control-plane
  - role: worker
    extraPortMappings:
      - containerPort: 30950
        hostPort: 8080
    labels:
      cluster: a
  - role: worker
    labels:
      cluster: a
  - role: worker
    labels:
      cluster: b
  - role: worker
    labels:
      cluster: b

Create cluster using the configuration file:

kind create cluster --config kind-config.yaml

Switch the context of kubectl to kind cluster A:

kubectl config use-context kind-kind

Kind loads dragonfly image

Pull dragonfly latest images:

docker pull dragonflyoss/scheduler:latest
docker pull dragonflyoss/manager:latest
docker pull dragonflyoss/dfdaemon:latest

Kind cluster loads dragonfly latest images:

kind load docker-image dragonflyoss/scheduler:latest
kind load docker-image dragonflyoss/manager:latest
kind load docker-image dragonflyoss/dfdaemon:latest

Create dragonfly cluster A

Create dragonfly cluster A, the schedulers, seed peers, peers and centralized manager included in the cluster should be installed using helm.

Create dragonfly cluster A based on helm charts

Create dragonfly cluster A charts configuration file charts-config-cluster-a.yaml, configuration content is as follows:

containerRuntime:
  containerd:
    enable: true
    injectConfigPath: true
    registries:
      - 'https://ghcr.io'
scheduler:
  image: dragonflyoss/scheduler
  tag: latest
  nodeSelector:
    cluster: a
  replicas: 1
  metrics:
    enable: true
  config:
    verbose: true
    pprofPort: 18066
seedPeer:
  image: dragonflyoss/dfdaemon
  tag: latest
  nodeSelector:
    cluster: a
  replicas: 1
  metrics:
    enable: true
  config:
    verbose: true
    pprofPort: 18066
dfdaemon:
  image: dragonflyoss/dfdaemon
  tag: latest
  nodeSelector:
    cluster: a
  metrics:
    enable: true
  config:
    verbose: true
    pprofPort: 18066
manager:
  image: dragonflyoss/manager
  tag: latest
  nodeSelector:
    cluster: a
  replicas: 1
  metrics:
    enable: true
  config:
    verbose: true
    pprofPort: 18066
jaeger:
  enable: true

Create dragonfly cluster A using the configuration file:

$ helm repo add dragonfly https://dragonflyoss.github.io/helm-charts/
$ helm install --wait --create-namespace --namespace cluster-a dragonfly dragonfly/dragonfly -f charts-config-cluster-a.yaml
NAME: dragonfly
LAST DEPLOYED: Mon Aug  7 22:07:02 2023
NAMESPACE: cluster-a
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
1. Get the scheduler address by running these commands:
  export SCHEDULER_POD_NAME=$(kubectl get pods --namespace cluster-a -l "app=dragonfly,release=dragonfly,component=scheduler" -o jsonpath={.items[0].metadata.name})
  export SCHEDULER_CONTAINER_PORT=$(kubectl get pod --namespace cluster-a $SCHEDULER_POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
  kubectl --namespace cluster-a port-forward $SCHEDULER_POD_NAME 8002:$SCHEDULER_CONTAINER_PORT
  echo "Visit http://127.0.0.1:8002 to use your scheduler"

2. Get the dfdaemon port by running these commands:
  export DFDAEMON_POD_NAME=$(kubectl get pods --namespace cluster-a -l "app=dragonfly,release=dragonfly,component=dfdaemon" -o jsonpath={.items[0].metadata.name})
  export DFDAEMON_CONTAINER_PORT=$(kubectl get pod --namespace cluster-a $DFDAEMON_POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
  You can use $DFDAEMON_CONTAINER_PORT as a proxy port in Node.

3. Configure runtime to use dragonfly:
  https://d7y.io/docs/getting-started/quick-start/kubernetes/

4. Get Jaeger query URL by running these commands:
  export JAEGER_QUERY_PORT=$(kubectl --namespace cluster-a get services dragonfly-jaeger-query -o jsonpath="{.spec.ports[0].port}")
  kubectl --namespace cluster-a port-forward service/dragonfly-jaeger-query 16686:$JAEGER_QUERY_PORT
  echo "Visit http://127.0.0.1:16686/search?limit=20&lookback=1h&maxDuration&minDuration&service=dragonfly to query download events"

Check that dragonfly cluster A is deployed successfully:

$ kubectl get po -n cluster-a
NAME                                 READY   STATUS    RESTARTS      AGE
dragonfly-dfdaemon-7t6wc             1/1     Running   0             3m18s
dragonfly-dfdaemon-r45bk             1/1     Running   0             3m18s
dragonfly-jaeger-84dbfd5b56-fmhh6    1/1     Running   0             3m18s
dragonfly-manager-75f4c54d6d-tr88v   1/1     Running   0             3m18s
dragonfly-mysql-0                    1/1     Running   0             3m18s
dragonfly-redis-master-0             1/1     Running   0             3m18s
dragonfly-redis-replicas-0           1/1     Running   1 (2m ago)    3m18s
dragonfly-redis-replicas-1           1/1     Running   0             96s
dragonfly-redis-replicas-2           1/1     Running   0             45s
dragonfly-scheduler-0                1/1     Running   0             3m18s
dragonfly-seed-peer-0                1/1     Running   1 (37s ago)   3m18s

Create NodePort service of the manager REST service

Create the manager REST service configuration file manager-rest-svc.yaml, configuration content is as follows:

apiVersion: v1
kind: Service
metadata:
  name: manager-rest
  namespace: cluster-a
spec:
  type: NodePort
  ports:
    - name: http
      nodePort: 30950
      port: 8080
  selector:
    app: dragonfly
    component: manager
    release: dragonfly

Create manager REST service using the configuration file:

kubectl apply -f manager-rest-svc.yaml -n cluster-a

Visit manager console

Visit address localhost:8080 to see the manager console. Sign in the console with the default root user, the username is root and password is dragonfly.

Screenshot showing Dragonfly welcome back page

Screenshot showing Dragonfly cluster page

By default, Dragonfly will automatically create dragonfly cluster A record in manager when it is installed for the first time. You can click dragonfly cluster A to view the details.

Screenshot showing Cluster-1 page on Dragonfly

Create dragonfly cluster B

Create dragonfly cluster B, you need to create a dragonfly cluster record in the manager console first, and the schedulers, seed peers and peers included in the dragonfly cluster should be installed using helm.

Create dragonfly cluster B in the manager console

Visit manager console and click the ADD CLUSTER button to add dragonfly cluster B record. Note that the IDC is set to cluster-2 to match the peer whose IDC is cluster-2.

Screenshot showing Create Cluster page on Dragonfly

Create dragonfly cluster B record successfully.

Screenshot showing Cluster page on Dragonfly

Use scopes to distinguish different dragonfly clusters

The dragonfly cluster needs to serve the scope. It wil provide scheduler services and seed peer services to peers in the scope. The scopes of the dragonfly cluster are configured when the console is created and updated. The scopes of the peer are configured in peer YAML config, the fields are host.idc, host.location and host.advertiseIP, refer to dfdaemon config.

If the peer scopes match the dragonfly cluster scopes, then the peer will use the dragonfly cluster’s scheduler and seed peer first, and if there is no matching dragonfly cluster then use the default dragonfly cluster.

Location: The dragonfly cluster needs to serve all peers in the location. When the location in the peer configuration matches the location in the dragonfly cluster, the peer will preferentially use the scheduler and the seed peer of the dragonfly cluster. It separated by “|”, for example “area|country|province|city”.

IDC: The dragonfly cluster needs to serve all peers in the IDC. When the IDC in the peer configuration matches the IDC in the dragonfly cluster, the peer will preferentially use the scheduler and the seed peer of the dragonfly cluster. IDC has higher priority than location in the scopes.

CIDRs: The dragonfly cluster needs to serve all peers in the CIDRs. The advertise IP will be reported in the peer configuration when the peer is started, and if the advertise IP is empty in the peer configuration, peer will automatically get expose IP as advertise IP. When advertise IP of the peer matches the CIDRs in dragonfly cluster, the peer will preferentially use the scheduler and the seed peer of the dragonfly cluster. CIDRs has higher priority than IDC in the scopes.

Create dragonfly cluster B based on helm charts

Create charts configuration with cluster information in the manager console.

Screenshot showing Cluster-2 page on Dragonfly

Scheduler.config.manager.schedulerClusterID using the Scheduler cluster ID from cluster-2 information in the manager console.
Scheduler.config.manager.addr is address of the manager GRPC server.
seedPeer.config.scheduler.manager.seedPeer.clusterID using the Seed peer cluster ID from cluster-2 information in the manager console.
seedPeer.config.scheduler.manager.netAddrs[0].addr is address of the manager GRPC server.
dfdaemon.config.host.idc using the IDC from cluster-2 information in the manager console.
dfdaemon.config.scheduler.manager.netAddrs[0].addr is address of the manager GRPC server.
externalManager.host is host of the manager GRPC server.
externalRedis.addrs[0] is address of the redis.

Create dragonfly cluster B charts configuration file charts-config-cluster-b.yaml, configuration content is as follows:

containerRuntime:
  containerd:
    enable: true
    injectConfigPath: true
    registries:
      - 'https://ghcr.io'
scheduler:
  image: dragonflyoss/scheduler
  tag: latest
  nodeSelector:
    cluster: b
  replicas: 1
  config:
    manager:
      addr: dragonfly-manager.cluster-a.svc.cluster.local:65003
      schedulerClusterID: 2
seedPeer:
  image: dragonflyoss/dfdaemon
  tag: latest
  nodeSelector:
    cluster: b
  replicas: 1
  config:
    scheduler:
      manager:
        netAddrs:
          - type: tcp
            addr: dragonfly-manager.cluster-a.svc.cluster.local:65003
        seedPeer:
          enable: true
          clusterID: 2
dfdaemon:
  image: dragonflyoss/dfdaemon
  tag: latest
  nodeSelector:
    cluster: b
  config:
    host:
      idc: cluster-2
    scheduler:
      manager:
        netAddrs:
          - type: tcp
            addr: dragonfly-manager.cluster-a.svc.cluster.local:65003
manager:
  enable: false
externalManager:
  enable: true
  host: dragonfly-manager.cluster-a.svc.cluster.local
  restPort: 8080
  grpcPort: 65003
redis:
  enable: false
externalRedis:
  addrs:
    - dragonfly-redis-master.cluster-a.svc.cluster.local:6379
  password: dragonfly
mysql:
  enable: false
jaeger:
  enable: true

Create dragonfly cluster B using the configuration file:

$ helm install --wait --create-namespace --namespace cluster-b dragonfly dragonfly/dragonfly -f charts-config-cluster-b.yaml
NAME: dragonfly
LAST DEPLOYED: Mon Aug  7 22:13:51 2023
NAMESPACE: cluster-b
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
1. Get the scheduler address by running these commands:
  export SCHEDULER_POD_NAME=$(kubectl get pods --namespace cluster-b -l "app=dragonfly,release=dragonfly,component=scheduler" -o jsonpath={.items[0].metadata.name})
  export SCHEDULER_CONTAINER_PORT=$(kubectl get pod --namespace cluster-b $SCHEDULER_POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
  kubectl --namespace cluster-b port-forward $SCHEDULER_POD_NAME 8002:$SCHEDULER_CONTAINER_PORT
  echo "Visit http://127.0.0.1:8002 to use your scheduler"

2. Get the dfdaemon port by running these commands:
  export DFDAEMON_POD_NAME=$(kubectl get pods --namespace cluster-b -l "app=dragonfly,release=dragonfly,component=dfdaemon" -o jsonpath={.items[0].metadata.name})
  export DFDAEMON_CONTAINER_PORT=$(kubectl get pod --namespace cluster-b $DFDAEMON_POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
  You can use $DFDAEMON_CONTAINER_PORT as a proxy port in Node.

3. Configure runtime to use dragonfly:
  https://d7y.io/docs/getting-started/quick-start/kubernetes/

4. Get Jaeger query URL by running these commands:
  export JAEGER_QUERY_PORT=$(kubectl --namespace cluster-b get services dragonfly-jaeger-query -o jsonpath="{.spec.ports[0].port}")
  kubectl --namespace cluster-b port-forward service/dragonfly-jaeger-query 16686:$JAEGER_QUERY_PORT
  echo "Visit http://127.0.0.1:16686/search?limit=20&lookback=1h&maxDuration&minDuration&service=dragonfly to query download events"

Check that dragonfly cluster B is deployed successfully:

$ kubectl get po -n cluster-b
NAME                                READY   STATUS    RESTARTS   AGE
dragonfly-dfdaemon-q8bsg            1/1     Running   0          67s
dragonfly-dfdaemon-tsqls            1/1     Running   0          67s
dragonfly-jaeger-84dbfd5b56-rg5dv   1/1     Running   0          67s
dragonfly-scheduler-0               1/1     Running   0          67s
dragonfly-seed-peer-0               1/1     Running   0          67s

Create dragonfly cluster B successfully.

Screenshot showing Cluster-2 page on Dragonfly

Using dragonfly to distribute images for multi-cluster kubernetes

Containerd pull image back-to-source for the first time through dragonfly in cluster A

Pull ghcr.io/dragonflyoss/dragonfly2/scheduler:v2.0.5 image in kind-worker node:

docker exec -i kind-worker /usr/local/bin/crictl pull ghcr.io/dragonflyoss/dragonfly2/scheduler:v2.0.5

Expose jaeger’s port 16686:

kubectl --namespace cluster-a port-forward service/dragonfly-jaeger-query 16686:16686

Visit the Jaeger page in http://127.0.0.1:16686/search, Search for tracing with Tags http.url="/v2/dragonflyoss/dragonfly2/scheduler/blobs/sha256:82cbeb56bf8065dfb9ff5a0c6ea212ab3a32f413a137675df59d496e68eaf399?ns=ghcr.io":

Screenshot showing Jaeger page (dragonfly-dfget)

Tracing details:

Screenshot showing dragonfly-dfget tracing details on Jaeger UI

When pull image back-to-source for the first time through dragonfly, peer uses cluster-a’s scheduler and seed peer. It takes 1.47s to download the 82cbeb56bf8065dfb9ff5a0c6ea212ab3a32f413a137675df59d496e68eaf399 layer.

Containerd pull image hits the cache of remote peer in cluster A

Pull ghcr.io/dragonflyoss/dragonfly2/scheduler:v2.0.5 image in kind-worker2 node:

docker exec -i kind-worker2 /usr/local/bin/crictl pull ghcr.io/dragonflyoss/dragonfly2/scheduler:v2.0.5

Expose jaeger’s port 16686:

kubectl --namespace cluster-a port-forward service/dragonfly-jaeger-query 16686:16686

Visit the Jaeger page in http://127.0.0.1:16686/search, Search for tracing with Tags http.url="/v2/dragonflyoss/dragonfly2/scheduler/blobs/sha256:82cbeb56bf8065dfb9ff5a0c6ea212ab3a32f413a137675df59d496e68eaf399?ns=ghcr.io":

Screenshot showing dragonfly-dfget on Jaeger UI

Tracing details:

Screenshot showing dragonfly-dfget Tracing Details on Jaeger UI

When pull image hits cache of remote peer, peer uses cluster-a’s scheduler and seed peer. It takes 37.48ms to download the 82cbeb56bf8065dfb9ff5a0c6ea212ab3a32f413a137675df59d496e68eaf399 layer.

Containerd pull image back-to-source for the first time through dragonfly in cluster B

Pull ghcr.io/dragonflyoss/dragonfly2/scheduler:v2.0.5 image in kind-worker3 node:

docker exec -i kind-worker3 /usr/local/bin/crictl pull ghcr.io/dragonflyoss/dragonfly2/scheduler:v2.0.5

Expose jaeger’s port 16686:

kubectl --namespace cluster-b port-forward service/dragonfly-jaeger-query 16686:16686

Visit the Jaeger page in http://127.0.0.1:16686/search, Search for tracing with Tags http.url=”/v2/dragonflyoss/dragonfly2/scheduler/blobs/sha256:82cbeb56bf8065dfb9ff5a0c6ea212ab3a32f413a137675df59d496e68eaf399?ns=ghcr.io”:

Screenshot showing dragonfly-dfget on Jaeger UI

Tracing details:

Screenshot showing dragonfly-dfget Tracing Details on Jaeger UI

When pull image back-to-source for the first time through dragonfly, peer uses cluster-b’s scheduler and seed peer. It takes 4.97s to download the 82cbeb56bf8065dfb9ff5a0c6ea212ab3a32f413a137675df59d496e68eaf399 layer.

Containerd pull image hits the cache of remote peer in cluster B

Pull ghcr.io/dragonflyoss/dragonfly2/scheduler:v2.0.5 image in kind-worker4 node:

docker exec -i kind-worker4 /usr/local/bin/crictl pull ghcr.io/dragonflyoss/dragonfly2/scheduler:v2.0.5

Expose jaeger’s port 16686:

kubectl --namespace cluster-b port-forward service/dragonfly-jaeger-query 16686:16686

Screenshot showing dragonfly-dfget on Jaeger UI

Tracing details:

Screenshot showing dragonfly-dfget Tracing Details on Jaeger UI

When pull image hits cache of remote peer, peer uses cluster-b’s scheduler and seed peer. It takes 14.53ms to download the 82cbeb56bf8065dfb9ff5a0c6ea212ab3a32f413a137675df59d496e68eaf399 layer.

Links

Dragonfly Website: https://d7y.io/

Dragonfly Github Repo: https://github.com/dragonflyoss/dragonfly

Dragonfly Slack Channel: #dragonfly on CNCF Slack

Dragonfly Discussion Group: dragonfly-discuss@googlegroups.com

Dragonfly Twitter: @dragonfly_oss

Nydus Website: https://nydus.dev/

Nydus Github Repo: https://github.com/dragonflyoss/image-service

Dragonfly v2.1.0 is released!

August 7, 2023 · 6 min read

CNCF projects highlighted in this post, and migrated by mingcheng.

Dragonfly v2.1.0 is released! 🎉🎉🎉 Thanks to the Xinxin Zhao[1] for helping to refactor the console[2] and the manager provides a new console for users to operate Dragonfly. Welcome to visit d7y.io[3] website.

Announcement screenshot from Github mentioning "Dragonfly v2.1.0 is released!"

#features Features

Console v1.0.0[4] is released and it provides a new console for users to operate Dragonfly.
Add network topology feature and it can probe the network latency between peers, providing better scheduling capabilities.
Provides the ability to control the features of the scheduler in the manager. If the scheduler preheat feature is not in feature flags, then it will stop providing the preheating in the scheduler.
dfstore adds GetObjectMetadatas and CopyObject to supports using Dragonfly as the JuiceFS backend.
Add personal access tokens feature in the manager and personal access token contains your security credentials for the restful open api.
Add TLS config to manager rest server.
Fix dfdaemon fails to start when there is no available scheduler address.
Add cluster in the manager and the cluster contains a scheduler cluster and a seed peer cluster.
Fix object downloads failed by dfstore when dfdaemon enabled concurrent.
Scheduler adds database field in config and moves the redis config to database field.
Replace net.Dial with grpc health check in dfdaemon.
Fix filtering and evaluation in scheduling. Since the final length of the filter is the candidateParentLimit used, the parents after the filter is wrong.
Fix storage can not write records to file when bufferSize is zero.
Hiding sensitive information in logs, such as the token in the header.
Use unscoped delete when destroying the manager’s resources.
Add uk_scheduler index and uk_seed_peer index in the table of the database.
Remove security domain feature and security feature in the manager.
Add advertise port config to manager and scheduler.
Fix fsm changes state failed when register task.

#break-change Break Change

The M:N relationship model between the scheduler cluster and the seed peer cluster is no longer supported. In the future, a P2P cluster will be a cluster in the manager, and a cluster will only include a scheduler cluster and a seed peer cluster.

#console Console

Screenshot showing Dragonfly Console welcome back page

You can see Manager Console[5] for more details.

#ai-infrastructure AI Infrastructure

Triton Inference Server[6] uses Dragonfly to distribute model files, refer to #2185[7]. If there are developers who are interested in the drgaonfly repository agent[8] project, please contact gaius.qi@gmail.com.
TorchServer[9] uses Dragonfly to distribute model files. Developers have already participated in the dragonfly endpoint[10] project, and the feature will be released in v2.1.1.
Fluid[11] downloads data through Dragonfly when running based on JuiceFS[12], the feature will be released in v2.1.1.
Dragonfly helps Volcano Engine AIGC inference to accelerate image through p2p technology[13].
There have been many cases in the community, using Dragonfly to distribute data in AI scenarios based on P2P technology. In the inference stage, the concurrent download model of the inference service can effectively relieve the bandwidth pressure of the model registry through Dragonfly, and improving the download speed. Community will share topic 《Dragonfly: Intro, Updates and AI Model Distribution in the Practice of Kuaishou – Wenbo Qi, Ant Group & Zekun Liu, Kuaishou Technology》[14] with Kuaishou[15] in KubeCon + CloudNativeCon + Open Source Summit China 2023[16], please follow if interested.

#maintainers Maintainers

The community has added four new Maintainers, hoping to help more contributors participate in community.

Yiyang Huang[17]: He works for Volcano Engine and will focus on the engineering work for Dragonfly.
Manxiang Wen[18]: He works for Baidu and will focus on the engineering work for Dragonfly.
Mohammed Farooq[19] He works for Intel and will focus on the engineering work for Dragonfly.
Zhou Xu[20]: He is a PhD student at Dalian University of Technology and will focus on the intelligent scheduling algorithms.

#others Others

You can see CHANGELOG[21] for more details.

Title​

Link​

Authors​

Background​

Design​

Key Design 1: A Lightweight Network Measurement Mechanism​

Key Design 2: A Scalable Scheduling Framework​

Key Design 3: An Asynchronous Model Training and Inference​

Summary​

Github Repository​

Features​

Persistent Cache Task​

Resource Search (Tasks and Persistent Cache Tasks)​

Vortex: A P2P File Transfer Protocol Based on TLV​

Enhanced Large File Distribution​

Support scopes for Personal Access Tokens (PATs)​

Enhanced Preheating​

Implement Distributed Rate Limiting for Preheating Tasks​

Support to set piece length for preheating​

Flexible Preheating: Set Peer scope by Percentage or Count​

Implement Audit Logging for User Operations​

Garbage Collection​

Optimized File Download with Hard Link​

Hardware Acceleration for Piece Hash Computation​

Advanced Storage Management​

Disk Space Validation for Operations​

Disk Garbage Collection Management​

Support for OpenTelemetry Tracing​

Security Enhancements​

Nydus​

Significant bug fixes​

Others​

Links​

Dragonfly Github​

Features​

Client written in Rust​

Client supports bandwidth rate limiting for prefetching​

Client supports leeching​

Optimize client’s performance for handling a large number of small I/Os by Nydus​

Defines the V2 of the P2P transfer protocol​

Enhanced Harbor Integration with P2P Preheating​

Task Manager​

Peer Manager​

Add hostname regexes and CIDRs to cluster scopes for matching clients​

Supports distributed rate limiting for creating jobs across different clusters​

Support preheating images using self-signed certificates​

Support mTLS for gRPC calls between services​

Observability​

Nydus​

Console​

Document​

Significant bug fixes​

AI Infrastructure​

Model Spec​

Support accelerated distribution of AI models in Hugging Face Hub(Git LFS)​

Maintainers​

Other​

Links​

Installation​

Prerequisites​

Dragonfly Kubernetes Cluster Setup​

Prepare Kubernetes Cluster​

Kind loads dragonfly image​

Create dragonfly cluster based on helm charts​

Expose the Proxy service port​

Install Dragonfly Repository Agent​

Set Dragonfly Repository Agent configuration​

Set Model Repository configuration​

Triton Server integrates Dragonfly Repository Agent plugin​

Install Triton Server with Docker​

Verify​

Performance testing​

Resources​

Dragonfly Community​

NVIDIA Triton Inference Server​

What is Git LFS?​

Git LFS Usage​

Git LFS manages large files​

Git LFS manages AI models and AI datasets​

Dragonfly eliminates the bandwidth limit of Git LFS’s content storage​

Title

Link

Authors

Background

Design

Key Design 1: A Lightweight Network Measurement Mechanism

Key Design 2: A Scalable Scheduling Framework

Key Design 3: An Asynchronous Model Training and Inference

Summary

Github Repository

Features

Persistent Cache Task

Resource Search (Tasks and Persistent Cache Tasks)

Vortex: A P2P File Transfer Protocol Based on TLV

Enhanced Large File Distribution

Support scopes for Personal Access Tokens (PATs)

Enhanced Preheating

Implement Distributed Rate Limiting for Preheating Tasks

Support to set piece length for preheating

Flexible Preheating: Set Peer scope by Percentage or Count

Implement Audit Logging for User Operations

Garbage Collection

Optimized File Download with Hard Link

Hardware Acceleration for Piece Hash Computation

Advanced Storage Management

Disk Space Validation for Operations

Disk Garbage Collection Management

Support for OpenTelemetry Tracing

Security Enhancements

Nydus

Significant bug fixes

Others

Links

Dragonfly Github

Features

Client written in Rust

Client supports bandwidth rate limiting for prefetching

Client supports leeching

Optimize client’s performance for handling a large number of small I/Os by Nydus

Defines the V2 of the P2P transfer protocol

Enhanced Harbor Integration with P2P Preheating

Task Manager

Peer Manager

Add hostname regexes and CIDRs to cluster scopes for matching clients

Supports distributed rate limiting for creating jobs across different clusters

Support preheating images using self-signed certificates

Support mTLS for gRPC calls between services

Observability

Nydus

Console

Document

Significant bug fixes

AI Infrastructure

Model Spec

Support accelerated distribution of AI models in Hugging Face Hub(Git LFS)

Maintainers

Other

Links

Installation

Prerequisites

Dragonfly Kubernetes Cluster Setup

Prepare Kubernetes Cluster

Kind loads dragonfly image

Create dragonfly cluster based on helm charts

Expose the Proxy service port

Install Dragonfly Repository Agent

Set Dragonfly Repository Agent configuration

Set Model Repository configuration

Triton Server integrates Dragonfly Repository Agent plugin

Install Triton Server with Docker

Verify

Performance testing

Resources

Dragonfly Community

NVIDIA Triton Inference Server

What is Git LFS?

Git LFS Usage

Git LFS manages large files

Git LFS manages AI models and AI datasets

Dragonfly eliminates the bandwidth limit of Git LFS’s content storage