Skip to main content

Cloud Native Computing Foundation Announces Dragonflyโ€™s Graduation

ยท 11 min read

Dragonfly graduates after demonstrating production readiness, powering container and AI workloads at scale.

Key Highlights:

  • Dragonfly graduates from CNCF after demonstrating production readiness and widespread adoption across container and AI workloads.
  • Dragonfly is used by major organizations, including Ant Group, Alibaba, Datadog, DiDi, and Kuaishou to power large-scale container and AI model distribution.
  • Since joining the CNCF, Dragonfly is backed by over a 3,000% growth in code contributions and a growing contributor community spanning over 130 companies.

SAN FRANCISCO, Calif. โ€“ January 14, 2026 โ€“ The Cloud Native Computing Foundationยฎ (CNCFยฎ), which builds sustainable ecosystems for cloud native software, today announced the graduation of Dragonfly, a cloud native open source image and file distribution system designed to solve cloud native image distribution in Kubernetes-centered applications.

โ€œDragonflyโ€™s graduation reflects the projectโ€™s maturity, broad industry adoption and critical role in scaling cloud native infrastructure,โ€ said Chris Aniszczyk, CTO, CNCF. โ€œItโ€™s especially exciting to see the projectโ€™s impact in accelerating image distribution and meeting the data demands of AI workloads. Weโ€™re proud to support a community that continues to push forward scalable, efficient and open solutions.โ€

Dragonflyโ€™s Technical Capabilitiesโ€‹

Dragonfly delivers efficient, stable, and secure data distribution and acceleration powered by peer-to-peer (P2P) technology. It aims to provide a bestโ€‘practice, standardsโ€‘based solution for cloud native architectures to improve largeโ€‘scale delivery of files, container images, OCI artifacts, AI models, caches, logs, and dependencies.

Dragonfly runs on Kubernetes and is installed via Helm, with its official chart available on Artifact Hub. It also includes tools like Prometheus for tracking performance, OpenTelemetry for collecting and sharing data, and gRPC for rapid communication between parts. Enhancing Harbor capability to distribute images and OCI artifacts through the preheat feature. In the GenAI era, as model serving becomes increasingly important, Dragonfly delivers even more value by distributing AI model artifacts defined by the ModelPack specification.

Dragonfly continues to advance container image distribution, supporting tens of millions of container launches per day in production, saving storage bandwidth by up to 90%, and reducing launch time from minutes to seconds, with large-scale adoption across different cloud native scenarios.

Dragonfly is also driving standards and acceleration solutions for distributing both AI model weights and optimized image layout in AI workloads. The technology reduces data loading for large-scale AI applications and enables the distribution of model weights at a hundred-terabyte scale to hundreds of nodes in minutes. As AI continues to integrate into operations, Dragonfly becomes crucial to powering large-scale AI workloads.

Milestones Driving Graduationโ€‹

Dragonfly was open-sourced by Alibaba Group in November 2017. It then joined the CNCF as a Sandbox project in October 2018. During this stage, Dragonfly 1.0 became production-ready in November 2019 and the Dragonfly subproject, Nydus, was open-sourced in January 2020. Dragonfly then reached Incubation phase in April 2020, with Dragonfly 2.0 later released in 2021.

Since then, the community has significantly matured and attracted hundreds of contributors from organizations such as Ant Group, Alibaba Cloud, ByteDance, Kuaishou, Intel, Datadog, Zhipu AI, and more, who use Dragonfly to deliver efficient image and AI model distribution.

Since joining CNCF, contributors have increased by 500%, from 45 individuals across 5 companies to 271 individuals across over 130 companies. Commit activity has grown by over 3,000%, from roughly 800 to 26,000 commits, and the number of overall participants has reached 1,890.

Whatโ€™s Next For Dragonflyโ€‹

Dragonfly will accelerate AI model weight distribution based on RDMA, improving throughput and reducing end-to-end latency. It will also optimize image layout to reduce data loading time for large-scale AI workloads. A load-aware two-phase scheduling will be introduced, leveraging collaboration between the scheduler and clients to enhance overall distribution efficiency. To provide more stable and reliable services, Dragonfly will support automatic updates and fault recovery, ensuring stable operation of all components during traffic bursts while controlling back-to-source traffic.

Dragonflyโ€™s Graduation Processโ€‹

To officially graduate from Incubation status, the Dragonfly team enhanced the election policy, clarified the maintainer lifecycle, standardized the contribution process, defined the community ladder, and added community guidelines for subprojects. The graduation process is supported by CNCFโ€™s Technical Oversight Committee (TOC) sponsors for Dragonfly, Karena Angell and Kevin Wang, who conducted a thorough technical due diligence with Dragonflyโ€™s project maintainers.

Additionally, a third-party security audit of Dragonfly was conducted. The Dragonfly team along with the guidance of their TOC sponsors, completed both a self-assessment and a joint assessment with CNCF TAG Security, then collaborated with the Dragonfly security team on a threat model. After this, the team improved the projectโ€™s security policy.

Learn more about Dragonfly and join the community: https://d7y.io/

Supporting Quotesโ€‹

โ€œI am thrilled, as the founder of Dragonfly, to announce its graduation from the CNCF. We are grateful to each and every open source contributor in the community, whose tenacity and commitment have enabled Dragonfly to reach its current state. Dragonfly was created to resolve Alibaba Groupโ€™s challenges with ultra-large-scale file distribution and was open-sourced in 2017. Looking back on this journey over the past eight years, every step has embodied the open source spirit and the tireless efforts of the many contributors. This graduation marks a new starting point for Dragonfly. I hope that the project will embark on a new journey, continue to explore more possibilities in the field of data distribution, and provide greater value!โ€

โ€”Zuozheng Hu, founder of Dragonfly, emeritus maintainer

โ€œI am delighted that Dragonfly is now a CNCF graduated project. This is a significant milestone, reflecting the maturity of the community, the trust of end users, and the reliability of the service. In the future, with the support of CNCF, the Dragonfly team will work together to drive the communityโ€™s sustainable growth and attract more contributors. Facing the challenges of large-scale model distribution and data distribution in the GenAI era, our team will continue to explore the future of data distribution within the cloud native ecosystem.โ€

โ€”Wenbo Qi (Gaius), core Dragonfly maintainer

โ€œSince open-sourcing in 2020, Nydus, alongside Dragonfly, has been validated at production scale. Dragonflyโ€™s graduation is a key milestone for Nydus as a subproject, allowing the project to continue improving the image filesystemโ€™s usability and performance. It will also allow us to further explore ecosystem standardization and AGI use cases that will advance the underlying infrastructure.โ€

โ€”Song Yan, core Nydus maintainer

โ€œThe combination of Dragonfly and Nydus substantially shortens launch times for container images and AI models, enhancing system resilience and efficiency.โ€

โ€”Jiang Liu, Nydus maintainer

โ€œThanks to the communityโ€™s collective efforts, Dragonfly has evolved from a tool for accelerating container images into a secure and stable distribution system widely adopted by many enterprises. Continuous improvements in usability and stability enable the project to support a variety of scenarios, including CI/CD, edge computing, and AI. New challenges are emerging for the distribution of model weights and data in the age of AI. Dragonfly is becoming a key infrastructure in mitigating these challenges. With the support of the CNCF, Dragonfly will continue to drive the future evolution of cloud native distribution technologies.โ€

โ€”Yuan Yang, Dragonfly maintainer

TOC Sponsorsโ€‹

โ€œWeโ€™re grateful to the TOC members who dedicated significant time to the technical due diligence required for Dragonflyโ€™s advancement, as well as the technical leads and community members who supported governance and technical reviews. We also thank the project maintainers for their openness and responsiveness throughout this process, and the end users who met with TOC and TAB members to share their experiences with Dragonfly. This level of collaboration is what helps ensure the strength and credibility of the CNCF ecosystem.โ€

โ€”Karena Angell, chair, TOC, CNCF

โ€œDragonflyโ€™s graduation is a testament to the projectโ€™s technical maturity and the communityโ€™s consistent focus on performance, reliability, and tangible impact. Itโ€™s been impressive to see Dragonfly evolve to meet the needs of large-scale production environments and AI workloads. Congratulations to the maintainers and contributors whoโ€™ve worked hard to reach this milestone.โ€

โ€”Kevin Wang, vice chair, TOC, CNCF

Project End Usersโ€‹

โ€œOver the past few years, as part of Ant Groupโ€™s container infrastructure, the Dragonfly project has accelerated container image and code package delivery across several 10K-node Kubernetes clusters. It has significantly saved image transmission bandwidth, and the Nydus subproject has additionally helped us to reduce image pull time to near zero. The project also supports the delivery of large language models within our AI infrastructure. It is a great honor to have contributed to Dragonfly and to have shared our practices with the community.โ€

โ€”Xu Wang, head of the container infra team, Ant Group, and co-launcher of Kata Containers Project

โ€œDragonfly has become a key infrastructure component of the container image and data distribution system for Alibabaโ€™s large-scale distributed systems. In ultra-large-scale scenarios such as the Double 11 (Singlesโ€™ Day) shopping festival, Dragonfly has provided stable and efficient distribution capabilities and has improved the efficiency of system scaling and delivery. Facing the new technological challenges of the AI era, Dragonfly has played an important role in model data distribution and cache acceleration, helping us to build a more efficient, intelligent computing platform. We are happy to see Dragonfly graduate, which represents an enhancement in community maturity and validates its reliability in large-scale production environments.โ€

โ€”Li Yi, director of engineering for container service, Alibaba Cloud & Tao Huang, director of engineering for cloud native transformation project, Alibaba Group

โ€œDatadog recently adopted the Dragonfly subproject, Nydus, and it has helped significantly reduce time spent pulling images. This includes AI workloads, where image pulls previously took 5 minutes, and node daemonsets, which have startup speeds directly related to how quickly applications can be scheduled on nodes. We have seen significant improvements using Nydus, now everything starts in a matter of seconds. We are thrilled to see Dragonfly graduate and hope to continue to contribute to this impressive ecosystem!โ€

โ€”Baptiste Girard-Carrabin, Datadog

โ€œDiDi uses a distributed cloud platform to handle a large number of user requests and quickly adjust resources, which requires very efficient and stable management of resource distribution and image synchronization. Dragonfly is a core component of our technical architecture due to its strong cloud native adaptability, excellent P2P acceleration capabilities, and proven stability in large-scale scenarios. We believe that Dragonflyโ€™s graduation is a strong testament to its technical maturity and industry value. We also look forward to its continued advancement in the field of cloud native distribution, providing more efficient solutions for large-scale file synchronization, image distribution, and other enterprise scenarios.โ€

โ€”Feng Wu, head of the Elastic Cloud, DiDi & Rapier Yang, head of the Elastic Cloud, DiDi

โ€œAt Kuaishou, Dragonfly is considered the cornerstone of our container infrastructure, and it will soon provide stable and reliable image distribution capabilities for tens of thousands of services and hundreds of thousands of servers. Integrated with its subproject Nydus, Dragonfly dramatically enhances application startup efficiency while significantly alleviating disk I/O pressureโ€”ensuring stability for services. In the era of AI large models, Dragonfly also functions as a critical component of our AI infrastructure, providing exceptional acceleration capabilities for large model distribution. We are deeply honored to partner with the vibrant Dragonfly community in collectively exploring future innovations for cloud native distribution technologies.โ€

โ€”Wang Yao, head of container registry service of Kuaishou

About Cloud Native Computing Foundationโ€‹

Cloud native computing empowers organizations to build and run scalable applications with an open source software stack in public, private, and hybrid clouds. The Cloud Native Computing Foundation (CNCF) hosts critical components of the global technology infrastructure, including Kubernetes, Prometheus, and Envoy. CNCF brings together the industryโ€™s top developers, end users, and vendors and runs the largest open source developer conferences in the world. Supported by nearly 800 members, including the worldโ€™s largest cloud computing and software companies, as well as over 200 innovative startups, CNCF is part of the nonprofit Linux Foundation. For more information, please visit www.cncf.io.

Dragonfly v2.4.0 has been released

ยท 5 min read

Dragonfly v2.4.0 is released!๐ŸŽ‰๐ŸŽ‰๐ŸŽ‰ Thanks the contributors who made this release happend and welcome you to visit d7y.io website.

dragonfly

New features and enhancementsโ€‹

load-aware scheduling algorithmโ€‹

A two-stage scheduling algorithm combining central scheduling with node-level secondary scheduling to optimize P2P download performance based on real-time load awareness.

p1

For more information, please refer to the Scheduling.

Vortex Protocol Support for P2P File Transferโ€‹

Dragonfly provide the new Vortex transfer protocol based on TLV to improve the download performance in internal network. Use the TLV (Tag-Length-Value) format as a lightweight protocol to replace gRPC for data transfer between peers. TCP-based Vortex reduces large file download time by 50% and QUIC-based Vortex by 40% compared to gRPC, both effectively reducing peak memory usage.

For more information, please refer to the TCP Protocol Support for P2P File Transfer and QUIC Protocol Support for P2P File Transfer.

Request SDKโ€‹

A SDK for routing User requests to Seed Peers using consistent hashing, replacing the previous Kubernetes Service load balancing approach.

p2

For more details, please refer to Request SDK.

Simple Multiโ€‘Cluster Kubernetes Deployment with Scheduler Cluster IDโ€‹

Dragonfly supports a simplified feature for deploying and managing multiple Kubernetes clusters by explicitly assigning a schedulerClusterID to each cluster. This approach allows users to directly control cluster affinity without relying on locationโ€‘based scheduling metadata such as IDC, hostname, or IP.

Using this feature, each Peer, Seed Peer, and Scheduler determines its target scheduler cluster through a clearly defined scheduler cluster ID. This ensures precise separation between clusters and predictable crossโ€‘cluster behavior.

p3

For more information, please refer to the Create Dragonfly Cluster Simple.

Performance and Resource Optimization for Manager and Scheduler Componentsโ€‹

Enhanced service performance and resource utilization across Manager and Scheduler components while significantly reducing CPU and memory overhead, delivering improved system efficiency and better resource management.

Enhanced Preheatingโ€‹

  • Support for IP-based peer selection in preheating jobs with priority-based selection logic where IP specification takes highest priority, followed by count-based and percentage-based selection

  • Support for preheating multiple URLs in a single request.

  • Support for preheating file and image via Scheduler gRPC interface.

p4

Calculate task ID based on image blob SHA256 to avoid redundant downloadsโ€‹

The Client now supports calculating task IDs directly from the SHA256 hash of image blobs instead of using the download URL. This enhancement prevents redundant downloads and data duplication when the same blob is accessed from different registry domains.

Cache HTTP 307 redirects for split downloadsโ€‹

Support for caching HTTP 307 (Temporary Redirect) responses to optimize Dragonfly's multi-piece download performance. When a download URL is split into multiple pieces, the redirect target is now cached, eliminating redundant redirect requests and reducing latency.

Go Client Deprecated and Replaced by Rust Clientโ€‹

The Go client has been deprecated and replaced by the Rust Client. All future development and maintenance will focus exclusively on the Rust client, which offers improved performance, stability, and reliability.

For more information, please refer to the dragoflyoss/client.

Additional Enhancementsโ€‹

  • Enable 64K page size support for ARM64 in the Dragonfly Rust client.
  • Fix missing git commit metadata in dfget version output.
  • Support for config_path of io.containerd.cri.v1.images plugin for containerd V3 configuration.
  • Replaces glibc DNS resolver with hickory-dns in reqwest to implement DNS caching and prevent excessive DNS lookups during piece downloads.
  • Support for the --include-files flag to selectively download files from a directory.
  • Add the --no-progress flag to disable the download progress bar output.
  • Support for custom request headers in backend operations, enabling flexible header configuration for HTTP requests.
  • Refactored log output to reduce redundant logging and improve overall logging efficiency.

Significant bug fixesโ€‹

  • Modified the database field type from text to longtext to support storing the information of preheating job.
  • Fixed panic on repeated seed peer service stops during Scheduler shutdown.
  • Fixed broker authentication failure when specifying the Redis password without setting a username.

Nydusโ€‹

New features and enhancementsโ€‹

  • Nydusd: Add CRC32 validation support for both RAFS V5 and V6 formats, enhancing data integrity verification.
  • Nydusd: Support resending FUSE requests during nydusd restoration, improving daemon recovery reliability.
  • Nydusd: Enhance VFS state saving mechanism for daemon hot upgrade and failover.
  • Nydusify: Introduce Nydus-to-OCI reverse conversion capability, enabling seamless migration back to OCI format.
  • Nydusify: Implement zero-disk transfer for image copy, significantly reducing local disk usage during copy operations.
  • Snapshotter: Builtin blob.meta in bootstrap for blob fetch reliability for RAFS v6 image.

Significant bug fixesโ€‹

  • Nydusd: Fix auth token fetching for access_token field in registry authentication.
  • Nydusd: Add recursive inode/dentry invalidation for umount API.
  • Nydus Image: Fix multiple issues in optimize subcommand and add backend configuration support.
  • Snapshotter: Implement lazy parent recovery for proxy mode to handle missing parent snapshots.

Othersโ€‹

You can see CHANGELOG for more details.

Dragonfly Githubโ€‹

qrcode

Dragonfly Project Paper Accepted by IEEE Transactions on Networking (TON)!

ยท 4 min read

Weโ€™re excited to announce that a paper on model distribution, co-authored by researchers from Dalian University of Technology and Ant Group, has been accepted for publication in the IEEE Transactions on Networking (TON), a high-impact journal recognized by IEEE for its significant influence in networking and systems research. ๐ŸŽ‰๐ŸŽ‰๐ŸŽ‰

Titleโ€‹

Empowering Dragonfly: A Lightweight and Scalable Distribution System for Large Models with High Concurrency

IEEE Xplore

Authorsโ€‹

Zhou Xu, Dalian University of Technology Lizhen Zhou, Dalian University of Technology Zichuan Xu, Dalian University of Technology Wenbo Qi, Ant Group Jinjing Ma, Ant Group Song Yan, Ant Group Yuan Yang, Alibaba Cloud Min Huang, Dalian University of Technology Haomiao Jiang, Dalian University of Technology Qiufen Xia, Dalian University of Technology Guowei Wu, Dalian University of Technology

Backgroundโ€‹

With the continuous evolution of Artificial Intelligence Generated Content (AIGC) technologies, the distribution of container images and large models has become a challenge. Traditional centralized registry often faces single-node bandwidth bottlenecks during peak concurrent downloads, leading to severe network congestion and excessively long download times. Conversely, while Content Delivery Networks (CDNs) or private links can mitigate some hotspot loads, they fail to fully leverage the idle bandwidth of cluster nodes and introduce additional overhead. Consequently, cloud-native applications and AI services urgently need a dynamic, efficient, scalable, and non-intrusive distribution system for large-scale images and models, compatible with mainstream formats like OCI spec.

Designโ€‹

github

Key Design 1: A Lightweight Network Measurement Mechanismโ€‹

  • Network Latency Probing: Each peer proactively sends probes to a designated set of targets within the cluster, as assigned by the Scheduler, to measure end-to-end latency. This approach ensures efficient probing of all peers in the cluster while operating under constrained network resources.

github

  • Network Bandwidth Prediction: We employ non-intrusive measurements by analyzing historical loading data to estimate network bandwidth with reasonable accuracy. When a peer initiates a model loading request, it pulls pieces of the model from other peers and reports the size and loading time of these pieces to the Scheduler. The Scheduler leverages historical data from successful loading operations to predict bandwidth.

github

Key Design 2: A Scalable Scheduling Frameworkโ€‹

  • Separating Inference from Scheduling: To ensure the Scheduler has adequate resources for task scheduling, we separate inference services from scheduling operations.

  • Real-Time Data Synchronization: To maintain data consistency across multiple schedulers, we store and continuously update end-to-end latency information for the entire network, as collected by the lightweight network measurement module.

Key Design 3: An Asynchronous Model Training and Inferenceโ€‹

  • Asynchronous Model Training: Asynchronous training and inference are facilitated through collaboration between the Trainer and Triton. The Scheduler retrieves end-to-end latency and bandwidth predictions from Redis and sends them to the Trainer, which then initiates training and persists the updated model. Triton periodically polls for updates and loads the new model for inference in the subsequent cycle.

  • Graph Learning Algorithm: This algorithm aggregates feature parameters from peers, modeling each sample as an interaction between a peer and its parent. It also incorporates information from neighboring peers to capture similarities within the cluster, thereby improving the accuracy of bandwidth predictions.

github

Summaryโ€‹

This paper presents an efficient and scalable peer-to-peer (P2P) model distribution system that optimizes resource utilization and ensures data synchronization through a multi-layered architecture.

  • First, it introduces a lightweight network measurement mechanism that probes latency and infers bandwidth to predict real-time network conditions.

  • Second, it proposes a scalable scheduling framework that separates inference services from scheduling, enhancing both resource efficiency and system responsiveness.

  • Finally, it designs the Trainer module, which integrates asynchronous model training and inference with graph learning algorithms to support incremental learning for bursty tasks.

Github Repositoryโ€‹

github

Dragonfly v2.3.0 has been released

ยท 7 min read

Dragonfly v2.3.0 is released! ๐ŸŽ‰๐ŸŽ‰๐ŸŽ‰ Thanks the contributors who made this release happend and welcome you to visit d7y.io website.

dragonfly

Featuresโ€‹

Persistent Cache Taskโ€‹

It designs to provide persistent caching for tasks. This tool can import file and export file in P2P network. The solution is specifically engineered for high-speed read and write operations. This makes it particularly advantageous for scenarios involving large files, such as machine learning model checkpoints, where rapid, reliable access and distribution across the network are critical for training and inference workflows. By leveraging P2P distribution and persistent caching, dfcache significantly reduces I/O bottlenecks and accelerates the lifecycle of large data assets.

p1

For documentation on how to use the dfcache command-line tool, please refer to the following link: dfcache.

$ dfcache import /tmp/file.txt
โฃท Done: 2229733261
dfcache export 2229733261 -O /tmp/file.txt

p2

Resource Search (Tasks and Persistent Cache Tasks)โ€‹

The Resource Search feature enables seamless querying of tasks, including files, images, and persistent cache tasks. It optimizes resource access, improving task management and retrieval efficiency.

p3

Vortex: A P2P File Transfer Protocol Based on TLVโ€‹

Vortex protocol is a high-performance peer-to-peer (P2P) file transfer protocol implementation in Rust, designed as part of the Dragonfly project. It utilizes the TLV (Tag-Length-Value) format for efficient and flexible data transmission, making it ideal for large-scale file distribution scenarios.

Packet Format:

  • Packet Identifier (8 bits): Uniquely identifies each packet
  • Tag (8 bits): Specifies data type in value field
  • Length (32 bits): Indicates Value field length, up to 4 GiB
  • Value (variable): Actual data content, maximum 1 GiB Protocol Format:
-------------------------------------------------------------------------------------------------
| | | | |
| Packet Identifier (8 bits) | Tag (8 bits) | Length (32 bits) | Value (up to 4 GiB) |
| | | | |
-------------------------------------------------------------------------------------------------

For more information, please refer to the Vortex Protocol.

Enhanced Large File Distributionโ€‹

This release significantly enhances Dragonfly's large file distribution capabilities, delivering improved efficiency and performance. We've revamped our scheduling algorithms for large file scenarios to ensure smarter resource and task allocation. Additionally, new mechanisms now more effectively balance the load across peers during large file transfers. Optimizations to the peer-to-peer (P2P) protocol and network transport layers further boost transmission efficiency.

These improvements include performance optimizations for both Client and Scheduler. You can find more details in the project's pull request.

Support scopes for Personal Access Tokens (PATs)โ€‹

By enabling users to define specific access rights (scopes) for each PAT, we significantly enhance the security of Open API interactions. Instead of granting broad permissions, PATs can now be limited to only the necessary privileges required for a particular integration or task.

p4

Enhanced Preheatingโ€‹

Implement Distributed Rate Limiting for Preheating Tasksโ€‹

By limiting the rate at which preheating requests are initiated across the distributed system, it prevents excessive preheating activities from stressing the origin. This enhancement ensures a more stable preheating.

p5

Support to set piece length for preheatingโ€‹

By allowing adjustment of the piece size, users can optimize data transfer efficiency, particularly in scenarios involving large files.

p6

Flexible Preheating: Set Peer scope by Percentage or Countโ€‹

This feature enhances preheating capabilities by allowing users to specify the preheating scope more precisely.

p7

Implement Audit Logging for User Operationsโ€‹

This feature introduces comprehensive audit logging capabilities to track user operations within the system. Audit logs will record critical actions performed by users, such as initiating preheating tasks, deleting task caches, and other significant system interactions.

p8

Garbage Collectionโ€‹

Dragonfly supports Garbage Collection (GC) Audit Logs and GC Job Records to track and manage garbage collection activities. The Manager enables automated GC retention, allowing records to be preserved for a configurable time period. Additionally, it provides the capability to manually trigger forced GC operations as needed.

This feature ensures efficient monitoring and management of GC processes, offering flexibility through automated retention policies and manual intervention for immediate GC execution.

p9

File download needs to be done in a way that is efficient and secure. If users are downloading a large file, it is not efficient to download the file and copy to the output path. Instead, we can create a hard link to the file and send the link to the user. This way, we can avoid copying the file and save time and resources. If hard link fails (e.g. due to different file systems), dfdaemon will fallback to copying the file.

For more information, please refer to the file download workflow.

Hardware Acceleration for Piece Hash Computationโ€‹

This feature enables hardware-accelerated Piece hash computation, significantly boosting performance and efficiency. By utilizing specialized hardware, the hash computation process is accelerated, allowing faster processing of large file.

Advanced Storage Managementโ€‹

Disk Space Validation for Operationsโ€‹

This feature enhances the client's storage functionality by implementing disk space validation. When insufficient disk space is detected, the client will return a failure response, preventing potential data corruption or incomplete operations.

Disk Garbage Collection Managementโ€‹

This feature enhances Peer's disk management by introducing configurable garbage collection (GC) thresholds based on disk usage. The distThreshold parameter allows users to define a specific disk capacity (e.g., 10TiB) as the base for calculating GC trigger points. If set, the distHighThresholdPercent (e.g., 80%) and distLowThresholdPercent (e.g., 60%) are applied relative to this capacity. If distThreshold is not provided or set to 0, these percentages are calculated based on the total actual disk space. When disk usage exceeds the high threshold, Dragonfly triggers GC(LRU) to reclaim space. GC stops when usage falls below the low threshold. This enables efficient management of a logical disk portion for caching, improving resource utilization and system performance.

p10

Support for OpenTelemetry Tracingโ€‹

Dragonfly supports for tracing based on OpenTelemetry, covering the Manager, Scheduler, and Peers. This enables end-to-end visibility into the download process, allowing users to query detailed information, such as overall download latency, using a specific task ID. The integration ensures efficient monitoring and performance analysis across the entire system.

Add tracing configuration as follows(in Manager, Scheduler and Peer):

p11

You can access the Jaeger UI to visualize the traces.

p12

For more information, please refer to the Tracing.

Security Enhancementsโ€‹

We extend our sincere gratitude to the CNCF TAG Security for their collaboration on a joint security audit. Their expertise and thorough review were invaluable in helping us identify areas for security improvement within Dragonfly. For detailed information on the specific security issues addressed and the corresponding fixes, please refer to the following issue: #3811

Nydusโ€‹

Significant bug fixesโ€‹

  • Fixed memory leaks and file descriptor leaks caused by sysinfo library.
  • Cleans up the Unix domain socket (UDS) to prevent dfdaemon startup crashes.
  • Prevent client from repeatedly downloading the same piece from multiple parents.

Othersโ€‹

You can see CHANGELOG for more details.

Dragonfly Githubโ€‹

p13

Dragonfly v2.2.0 has been released

ยท 8 min read

CNCF projects highlighted in this post, and migrated by mingcheng.

Dragonfly v2.2.0 is released! ๐ŸŽ‰๐ŸŽ‰๐ŸŽ‰ Thanks the contributors who made this release happend and welcome you to visit d7y.io website.

Featuresโ€‹

Client written in Rustโ€‹

The client is written in Rust, offering advantages such as ensuring memory safety, improving performance, etc. The client is a submodule of Dragonfly, refer to dragonflyoss/client.

scheduler schema

second scheduler schema

Client supports bandwidth rate limiting for prefetchingโ€‹

Client now supports rate limiting for prefetch requests, which can prevent network overload and reduce competition with other active download tasks, thereby enhancing overall system performance. Refer to the documentation to configure the proxy.prefetchRateLimit option.

code

The following diagram illustrates the usage of download rate limit, upload rate limir, and prefetct rate limit for the client.

rate limit

Client supports leechingโ€‹

If the user configures the client to disable sharing, it will become a leech.

code

Optimize clientโ€™s performance for handling a large number of small I/Os by Nydusโ€‹

  • Add the X-Dragonfly-Prefetch HTTP header. If X-Dragonfly-Prefetch is set to true and it is a range request, the client will prefetch the entire task. This feature allows Nydus to control which requests need prefetching.
  • The clientโ€™s HTTP proxy adds an independent cache to reduce requests to the gRPC server, thereby reducing request latency.
  • Increase the memory cache size in RocksDB and enable prefix search for quickly searching piece metadata.
  • Use the CRC-32-Castagnoli algorithm with hardware acceleration to reduce the hash calculation cost for piece content.
  • Reuse the gRPC connections for downloading and optimize the download logic.

Defines the V2 of the P2P transfer protocolโ€‹

Define the V2 of the P2P transfer protocol to make it more standard, clearer, and better performing, refer to dragonflyoss/api.

Enhanced Harbor Integration with P2P Preheatingโ€‹

Dragonfly improves its integration with Harbor v2.13 for preheating images, includes the following enhancements:

  • Support for preheating multi architecture images.
  • User can select the preheat scope for multi-granularity preheating. (Single Seed Peer, All Seed Peers, All Peers)
  • User can specify the scheduler cluster ids for preheating images to the desired Dragonfly clusters.

Refer to documentation for more details.

create P2P Provider policy

Task Managerโ€‹

User can search all peers of cached task by task ID or download URL, and delete the cache on the selected peers, refer to the documentation.

dragonfly dashboard

Peer Managerโ€‹

Manager will regularly synchronize peersโ€™ information and also allows for manual refreshes. Additionally, it will display peersโ€™ information on the Manager Console.

dragonfly dashboard

Add hostname regexes and CIDRs to cluster scopes for matching clientsโ€‹

When the client starts, it reports its hostname and IP to the Manager. The Manager then returns the best matching cluster (including schedulers and seed peers) to the client based on the cluster scopes configuration.

Creating a cluster on the Dragonfly dashboard

Supports distributed rate limiting for creating jobs across different clustersโ€‹

User can configure rate limiting for job creation across different clusters in the Manager Console.

creating a cluster on the dragonfly dashboard

Support preheating images using self-signed certificatesโ€‹

Preheating requires calling the container registry to parse the image manifest and construct the URL for downloading blobs. If the container registry uses a self-signed certificate, user can configure the self-signed certificate in the Managerโ€™s config for calling to the container registry.

code

Support mTLS for gRPC calls between servicesโ€‹

By setting self-signed certificates in the configurations of the Manager, Scheduler, Seed Peer, and Peer, gRPC calls between services will use mTLS.

Observabilityโ€‹

Dragonfly is recommending to use prometheus for monitoring. Prometheus and grafana configurations are maintained in the dragonflyoss/monitoring repository.

Grafana dashboards are listed below:

NameIDLinkDescription
Dragonfly Manager15945https://grafana.com/grafana/dashboards/15945Grafana dashboard for dragonfly manager.
Dragonfly Scheduler15944https://grafana.com/grafana/dashboards/15944Granafa dashboard for dragonfly scheduler.
Dragonfly Client21053https://grafana.com/grafana/dashboards/21053Grafana dashboard for dragonfly client and dragonfly seed client.
Dragonfly Seed Client21054https://grafana.com/grafana/dashboards/21054Grafana dashboard for dragonfly seed client.

dashboard

creating a cluster on the dragonfly dashboard

Nydusโ€‹

Nydus v2.3.0 is released, refer to Nydus Image Service v2.3.0 for more details.

  • builder: support โ€“parent-bootstrap for merge.
  • builder/nydusd: support batch chunks mergence.
  • nydusify/nydus-snapshotter: support OCI reference types.
  • nydusify: support export/import for remote images.
  • nydusify: support โ€“push-chunk-size for large size image.
  • nydusd/nydus-snapshotter: support basic failover and hot upgrade.
  • nydusd: support overlay writable mount for fusedev.

Consoleโ€‹

Console v0.2.0 is released, featuring a redesigned UI and an improved interaction flow. Additionally, more functional pages have been added, such as preheating, task manager, PATs(Personal Access Tokens) manager, etc. Refer to the documentation for more details.

cluster overview

deeper dive image into cluster-1 on dashboard

Documentโ€‹

Refactor the website documentation to make Dragonfly simpler and more practical for users, refer to d7y.io.

dragonfly website

Significant bug fixesโ€‹

The following content only highlights the significant bug fixes in this release.

  • Fix the thread safety issue that occurs when constructing the DAG(Directed Acyclic Graph) during scheduling.
  • Fix the memory leak caused by the OpenTelemetry library.
  • Avoid hot reload when dynconfig refresh data from Manager.
  • Prevent concurrent download requests from causing failures in state machine transitions.
  • Use context.Background() to avoid stream cancel by dfdaemon.
  • Fix the database performance issue caused by clearing expired jobs when there are too many job records.
  • Reuse the gRPC connection pool to prevent redundant request construction.

AI Infrastructureโ€‹

Model Specโ€‹

The Dragonfly community is collaboratively defining the OCI Model Specification. OCI Model Specification aims to provide a standard way to package, distribute and run AI models in a cloud native environment. The goal of this specification is to package models in an OCI artifact to take advantage of OCI distribution and ensure efficient model deployment, refer to CloudNativeAI/model-spec for more details.

OCI Model Specification image

node

Support accelerated distribution of AI models in Hugging Face Hub(Git LFS)โ€‹

Distribute large files downloaded via the Git LFS protocol through Dragonfly P2P, refer to the documentation.

hugging face hub clusters

Maintainersโ€‹

The community has added four new Maintainers, hoping to help more contributors participate in the community.

  • Han Jiang: He works for Kuaishou and will focus on the engineering work for Dragonfly.
  • Yuan Yang: He works for Alibaba Group and will focus on the engineering work for Dragonfly.

Otherโ€‹

You can see CHANGELOG for more details.

Triton Server accelerates distribution of models based on Dragonfly

ยท 10 min read

CNCF projects highlighted in this post, and migrated by mingcheng.

Project post by Yufei Chen, Miao Hao, and Min Huang, Dragonfly project

This document will help you experience how to use dragonfly with TritonServe. During the downloading of models, the file size is large and there are many services downloading the files at the same time. The bandwidth of the storage will reach the limit and the download will be slow.

Diagram flow showing nodes in Triton Server in Cluster A and Cluster B to Model Registry

Dragonfly can be used to eliminate the bandwidth limit of the storage through P2P technology, thereby accelerating file downloading.

Diagram flow showing Cluster A and Cluster B Peer to Root Peer to Model Registry

Installationโ€‹

By integrating Dragonfly Repository Agent into Triton, download traffic through Dragonfly to pull models stored in S3, OSS, GCS, and ABS, and register models in Triton. The Dragonfly Repository Agent is in the dragonfly-repository-agent repository.

Prerequisitesโ€‹

NameVersionDocument
Kubernetes cluster1.20+kubernetes.io
Helm3.8.0+helm.sh
Triton Server23.08-py3Triton Server

Notice: Kind is recommended if no kubernetes cluster is available for testing.

Dragonfly Kubernetes Cluster Setupโ€‹

For detailed installation documentation, please refer to ย quick-start-kubernetes.

Prepare Kubernetes Clusterโ€‹

Create kind multi-node cluster configuration file kind-config.yaml, configuration content is as follows:

kind: ClusterapiVersion: kind.x-k8s.io/v1alpha4nodes:  - role: control-plane  - role: worker  - role: worker

Create a kind multi-node cluster using the configuration file:

kind create cluster --config kind-config.yaml

Switch the context of kubectl to kind cluster:

kubectl config use-context kind-kind

Kind loads dragonfly imageโ€‹

Pull dragonfly latest images:

docker pull dragonflyoss/scheduler:latestdocker pull dragonflyoss/manager:latestdocker pull dragonflyoss/dfdaemon:latest

Kind cluster loads dragonfly latest images:

kind load docker-image dragonflyoss/scheduler:latestkind load docker-image dragonflyoss/manager:latestkind load docker-image dragonflyoss/dfdaemon:latest

Create dragonfly cluster based on helm chartsโ€‹

Create helm charts configuration file charts-config.yamland set dfdaemon.config.agents.regx to match the download path of the object storage. Example: add regx:.*models.* to match download request from object storage bucket models. Configuration content is as follows:

scheduler:  image: dragonflyoss/scheduler  tag: latest  replicas: 1  metrics:    enable: true  config:    verbose: true    pprofPort: 18066seedPeer:  image: dragonflyoss/dfdaemon  tag: latest  replicas: 1  metrics:    enable: true  config:    verbose: true    pprofPort: 18066dfdaemon:  image: dragonflyoss/dfdaemon  tag: latest  metrics:    enable: true  config:    verbose: true    pprofPort: 18066    proxy:      defaultFilter: 'Expires&Signature&ns'      security:        insecure: true        cacert: ''        cert: ''        key: ''      tcpListen:        namespace: ''        port: 65001      registryMirror:        url: https://index.docker.io        insecure: true        certs: []        direct: false      proxies:        - regx: blobs/sha256.*        # Proxy all http downlowd requests of model bucket path.        - regx: .*models.*manager:  image: dragonflyoss/manager  tag: latest  replicas: 1  metrics:    enable: true  config:    verbose: true    pprofPort: 18066jaeger:  enable: true

Create a dragonfly cluster using the configuration file:

helm repo add dragonfly https://dragonflyoss.github.io/helm-charts/
helm install --wait --create-namespace --namespace dragonfly-system dragonfly dragonfly/dragonfly -f charts-config.yaml

Example output:

LAST DEPLOYED: Wed Nov 29 21:23:48 2023
NAMESPACE: dragonfly-system
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
1. Get the scheduler address by running these commands:
export SCHEDULER_POD_NAME=$(kubectl get pods --namespace dragonfly-system -l "app=dragonfly,release=dragonfly,component=scheduler" -o jsonpath={.items[0].metadata.name})
export SCHEDULER_CONTAINER_PORT=$(kubectl get pod --namespace dragonfly-system $SCHEDULER_POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
kubectl --namespace dragonfly-system port-forward $SCHEDULER_POD_NAME 8002:$SCHEDULER_CONTAINER_PORT
echo "Visit http://127.0.0.1:8002 to use your scheduler"

2. Get the dfdaemon port by running these commands:
export DFDAEMON_POD_NAME=$(kubectl get pods --namespace dragonfly-system -l "app=dragonfly,release=dragonfly,component=dfdaemon" -o jsonpath={.items[0].metadata.name})
export DFDAEMON_CONTAINER_PORT=$(kubectl get pod --namespace dragonfly-system $DFDAEMON_POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
You can use $DFDAEMON_CONTAINER_PORT as a proxy port in Node.

3. Configure runtime to use dragonfly:
https://d7y.io/docs/getting-started/quick-start/kubernetes/

4. Get Jaeger query URL by running these commands:
export JAEGER_QUERY_PORT=$(kubectl --namespace dragonfly-system get services dragonfly-jaeger-query -o jsonpath="{.spec.ports[0].port}")
kubectl --namespace dragonfly-system port-forward service/dragonfly-jaeger-query 16686:$JAEGER_QUERY_PORT
echo "Visit http://127.0.0.1:16686/search?limit=20&lookback=1h&maxDuration&minDuration&service=dragonfly to query download events"

Check that dragonfly is deployed successfully:

kubectl get pods -n dragonfly-systemNAME                                 READY   STATUS    RESTARTS       AGEdragonfly-dfdaemon-8qcpd             1/1     Running   4 (118s ago)   2m45sdragonfly-dfdaemon-qhkn8             1/1     Running   4 (108s ago)   2m45sdragonfly-jaeger-6c44dc44b9-dfjfv    1/1     Running   0              2m45sdragonfly-manager-549cd546b9-ps5tf   1/1     Running   0              2m45sdragonfly-mysql-0                    1/1     Running   0              2m45sdragonfly-redis-master-0             1/1     Running   0              2m45sdragonfly-redis-replicas-0           1/1     Running   0              2m45sdragonfly-redis-replicas-1           1/1     Running   0              2m7sdragonfly-redis-replicas-2           1/1     Running   0              101sdragonfly-scheduler-0                1/1     Running   0              2m45sdragonfly-seed-peer-0                1/1     Running   1 (52s ago)    2m45s

Expose the Proxy service portโ€‹

Create the dfstore.yaml configuration file to expose the port on which the Dragonfly Peerโ€™s HTTP proxy listens. The default port is 65001 and settargetPort to 65001.

kind: ServiceapiVersion: v1metadata:  name: dfstorespec:  selector:    app: dragonfly    component: dfdaemon    release: dragonfly  ports:    - protocol: TCP      port: 65001      targetPort: 65001  type: NodePort

Create service:

kubectl --namespace dragonfly-system apply -f dfstore.yaml

Forward request to Dragonfly Peerโ€™s HTTP proxy:

kubectl --namespace dragonfly-system port-forward service/dfstore 65001:65001

Install Dragonfly Repository Agentโ€‹

Set Dragonfly Repository Agent configurationโ€‹

Create the dragonfly_config.jsonconfiguration file, the configuration is as follows:

{
"proxy": "http://127.0.0.1:65001",
"header": {},
"filter": [
"X-Amz-Algorithm",
"X-Amz-Credential",
"X-Amz-Date",
"X-Amz-Expires",
"X-Amz-SignedHeaders",
"X-Amz-Signature"
]
}
  • proxy: The address of Dragonfly Peerโ€™s HTTP Proxy.
  • header: Adds a request header to the request.
  • filter: Used to generate unique tasks and filter unnecessary query parameters in the URL.

In the filter of the configuration, set different values when using different object storage:

TypeValue
OSS["Expires","Signature","ns"]
S3["X-Amz-Algorithm", "X-Amz-Credential", "X-Amz-Date", "X-Amz-Expires", "X-Amz-SignedHeaders", "X-Amz-Signature"]
OBS["X-Amz-Algorithm", "X-Amz-Credential", "X-Amz-Date", "X-Obs-Date", "X-Amz-Expires", "X-Amz-SignedHeaders", "X-Amz-Signature"]

Set Model Repository configurationโ€‹

Create cloud_credential.json cloud storage credential, the configuration is as follows:

```json
{
"gs": {
"": "PATH_TO_GOOGLE_APPLICATION_CREDENTIALS",
"gs://gcs-bucket-002": "PATH_TO_GOOGLE_APPLICATION_CREDENTIALS_2"
},
"s3": {
"": {
"secret_key": "AWS_SECRET_ACCESS_KEY",
"key_id": "AWS_ACCESS_KEY_ID",
"region": "AWS_DEFAULT_REGION",
"session_token": "",
"profile": ""
},
"s3://s3-bucket-002": {
"secret_key": "AWS_SECRET_ACCESS_KEY_2",
"key_id": "AWS_ACCESS_KEY_ID_2",
"region": "AWS_DEFAULT_REGION_2",
"session_token": "AWS_SESSION_TOKEN_2",
"profile": "AWS_PROFILE_2"
}
},
"as": {
"": {
"account_str": "AZURE_STORAGE_ACCOUNT",
"account_key": "AZURE_STORAGE_KEY"
},
"as://Account-002/Container": {
"account_str": "",
"account_key": ""
}
}
}

In order to pull the model through Dragonfly, the model configuration file needs to be added following code in config.pbtxt file:

model_repository_agents{ย  agents [ย  ย  {ย  ย  ย  name: "dragonfly",ย  ย  }ย  ]}

The densenet_onnx example contains modified configuration and model file. Modified config.pbtxt such as:

name: "densenet_onnx"platform: "onnxruntime_onnx"max_batch_size : 0input [ย  {ย  ย  name: "data_0"ย  ย  data_type: TYPE_FP32ย  ย  format: FORMAT_NCHWย  ย  dims: [ 3, 224, 224 ]ย  ย  reshape { shape: [ 1, 3, 224, 224 ] }ย  }]output [ย  {ย  ย  name: "fc6_1"ย  ย  data_type: TYPE_FP32ย  ย  dims: [ 1000 ]ย  ย  reshape { shape: [ 1, 1000, 1, 1 ] }ย  ย  label_filename: "densenet_labels.txt"ย  }]model_repository_agents{ย  agents [ย  ย  {ย  ย  ย  name: "dragonfly",ย  ย  }ย  ]}

Triton Server integrates Dragonfly Repository Agent pluginโ€‹

Install Triton Server with Dockerโ€‹

Pull dragonflyoss/dragonfly-repository-agent image which is integrated Dragonfly Repository Agent plugin in Triton Server, refer to Dockerfile.

docker pull dragonflyoss/dragonfly-repository-agent:latest

Run the container and mount the configuration directory:

docker run --network host --rm \ย  -v ${path-to-config-dir}:/home/triton/ \ย  dragonflyoss/dragonfly-repository-agent:latest tritonserver \ย  --model-repository=${model-repository-path}
  • path-to-config-dir: The files path of dragonfly_config.json&cloud_credential.json.
  • model-repository-path: The path of remote model repository.

The correct output is as follows:

=============================== Triton Inference Server ===============================
successfully loaded 'densenet_onnx'
I1130 09:43:22.595672 1 server.cc:604]
+------------------+------------------------------------------------------------------------+
| Repository Agent | Path |
+------------------+------------------------------------------------------------------------+
| dragonfly | /opt/tritonserver/repoagents/dragonfly/libtritonrepoagent_dragonfly.so |
+------------------+------------------------------------------------------------------------+

I1130 09:43:22.596011 1 server.cc:631]
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Backend | Path | Config |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| pytorch | /opt/tritonserver/backends/pytorch/libtriton_pytorch.so | {} |
| onnxruntime | /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+

I1130 09:43:22.596112 1 server.cc:674]
+---------------+---------+--------+
| Model | Version | Status |
+---------------+---------+--------+
| densenet_onnx | 1 | READY |
+---------------+---------+--------+

I1130 09:43:22.598318 1 metrics.cc:703] Collecting CPU metrics
I1130 09:43:22.599373 1 tritonserver.cc:2435]
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.37.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
| model_repository_path[0] | s3://192.168.36.128:9000/models |
| model_control_mode | MODE_NONE |
| strict_model_config | 0 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
| cache_enabled | 0 |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I1130 09:43:22.610334 1 grpc_server.cc:2451] Started GRPCInferenceService at 0.0.0.0:8001
I1130 09:43:22.612623 1 http_server.cc:3558] Started HTTPService at 0.0.0.0:8000
I1130 09:43:22.695843 1 http_server.cc:187] Started Metrics Service at 0.0.0.0:8002

Execute the following command to check the Dragonfly logs:

kubectl exec -it -n dragonfly-system dragonfly-dfdaemon-<id> -- tail -f /var/log/dragonfly/daemon/core.log

Check downloaded successfully through Dragonfly:

{
"level": "info",
"ts": "2024-02-02 05:28:02.631",
"caller": "peer/peertask_conductor.go:1349",
"msg": "peer task done, cost: 352ms",
"peer": "10.244.2.3-1-4398a429-d780-423a-a630-57d765f1ccfc",
"task": "974aaf56d4877cc65888a4736340fb1d8fecc93eadf7507f531f9fae650f1b4d",
"component": "PeerTask",
"trace": "4cca9ce80dbf5a445d321cec593aee65"
}

Verifyโ€‹

Call inference API๏ผš

docker run -it --rm --net=host nvcr.io/nvidia/tritonserver:23.08-py3-sdk /workspace/install/bin/image_client -m densenet_onnx -c 3 -s INCEPTION /workspace/images/mug.jpg

Check the response successful:

Request 01Image '/workspace/images/mug.jpg':ย  ย  15.349563 (504) = COFFEE MUGย  ย  13.227461 (968) = CUPย  ย  10.424893 (505) = COFFEEPOT

Performance testingโ€‹

Test the performance of single-machine model download by Triton API after the integration of Dragonfly P2P. Due to the influence of the network environment of the machine itself, the actual download time is not important, but The proportion of download speed in different scenarios is more meaningful:

Bar chart showing time to download large Triton API; Triton API & Dragonfly Cold Boot; Hit Dragonfly Remote Peer Cache; Hit, Dragonfly Local Peer Cache

  • Triton API: Use signed URL provided by Object Storage to download the model directly.
  • Triton API & Dragonfly Cold Boot: Use Triton Serve API to download model via Dragonfly P2P network and no cache hits.
  • Hit Remote Peer: Use Triton Serve API to download model via Dragonfly P2P network and hit the remote peer cache.
  • Hit Local Peer: Use Triton Serve API to download model via Dragonfly P2P network and hit the local peer cache.

Test results show Triton and Dragonfly integration. It can effectively reduce the file download time. Note that this test was a single-machine test, which means that in the case of cache hits, the performance limitation is on the disk. If Dragonfly is deployed on multiple machines for P2P download, the models download speed will be faster.

Resourcesโ€‹

Dragonfly Communityโ€‹

NVIDIA Triton Inference Serverโ€‹

Dragonfly accelerates distribution of large files with Git LFS

ยท 11 min read

CNCF projects highlighted in this post, and migrated by mingcheng.

What is Git LFS?โ€‹

Git LFS (Large File Storage) is an open-source extension for Git that enables users to handle large files more efficiently in Git repositories. Git is a version control system designed primarily for text files such as source code and it can become less efficient when dealing with large binary files like audio, videos, datasets, graphics and other large assets. These files can significantly increase the size of a repository and make cloning and fetching operations slow.

Diagram flow showing Remote to Large File Storage

Git LFS addresses this issue by storing these large files on a separate server and replacing them in the Git repository with small placeholder files (pointers). When a user clones or pulls from the repository, Git LFS fetches the large files from the LFS server as needed rather than downloading all the large files with the initial clone of the repository. For specifications, please refer to the Git LFS Specification. The server is implemented based on the HTTP protocol, refer to Git LFS API. Usually Git LFSโ€™s content storage uses object storage to store large files.

Git LFS Usageโ€‹

Git LFS manages large filesโ€‹

Github and GitLab usually manage large files based on Git LFS.

Git LFS manages AI models and AI datasetsโ€‹

Large files of models and datasets in AI are usually managed based on Git LFS. Hugging Face Hub and ModelScope Hub manage models and datasets based on Git LFS.

Hugging Face Hubโ€™s Python Library implements Git LFS to download models and datasets. Hugging Face Hubโ€™s Python Library distributes models and datasets to accelerate, refer to Hugging Face accelerates distribution of models and datasets based on Dragonfly.

Dragonfly eliminates the bandwidth limit of Git LFSโ€™s content storageโ€‹

This document will help you experience how to use dragonfly with Git LFS. During the downloading of large files, the file size is large and there are many services downloading the larges files at the same time. The bandwidth of the storage will reach the limit and the download will be slow.

Diagram flow showing Cluster A and Cluster B  to Large File Storage

Dragonfly can be used to eliminate the bandwidth limit of the storage through P2P technology, thereby accelerating large files downloading.

Diagram flow showing Cluster A and Cluster B  to Large File Storage using Peer and Root Peer

Dragonfly accelerates downloads with Git LFSโ€‹

By proxying the HTTP protocol file download request of Git LFS to Dragonfly Peer Proxy, the file download traffic is forwarded to the P2P network. The following documentation is based on GitHub LFS.

Get the Content Storage address of Git LFSโ€‹

Add GIT_CURL_VERBOSE=1 to print verbose logs of git clone and get the address of content storage of Git LFS.

GIT_CURL_VERBOSE=1 git clone git@github.com:{YOUR-USERNAME}/{YOUR-REPOSITORY}.git

Look for the trace git-lfs keyword in the logs and you can see the log of Git LFS download files. Pay attention to the content of actions and download in the log.

15:31:04.848308 trace git-lfs: HTTP: {"objects":[{"oid":"c036cbb7553a909f8b8877d4461924307f27ecb66cff928eeeafd569c3887e29","size":5242880,"actions":{"download":{"href":"https://github-cloud.githubusercontent.com/alambic/media/376919987/c0/36/c036cbb7553a909f8b8877d4461924307f27ecb66cff928eeeafd569c3887e29?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIMWPLRQEC4XCWWPA%2F20231221%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20231221T073104Z&X-Amz-Expires=3600&X-Amz-Signature=4dc757dff0ac96eac3f0cd2eb29ca887035d3a6afba41cb10200ed0aa22812fa&15:31:04.848403 trace git-lfs: HTTP: X-Amz-SignedHeaders=host&actor_id=15955374&key_id=0&repo_id=392935134&token=1","expires_at":"2023-12-21T08:31:04Z","expires_in":3600}}}]}

The download URL can be found in actions.download.href in the objects. You can find that the content storage of GitHub LFS is actually stored at github-cloud.githubusercontent.com. And query parameters include X-Amz-Algorithm, X-Amz-Credential, X-Amz-Date, X-Amz-Expires, X-Amz-Signature and X-Amz-SignedHeaders. The query parameters are AWS Authenticating Requests parameters. The keys of query parameters will be used later when configuring Dragonfly Peer Proxy.

Information about Git LFS :

  1. The content storage address of Git LFS is github-cloud.githubusercontent.com.
  2. The query parameters of the download URL include X-Amz-Algorithm, X-Amz-Credential, X-Amz-Date, X-Amz-Expires, X-Amz-Signature and X-Amz-SignedHeaders.

Installationโ€‹

Prerequisitesโ€‹

Notice: Kind is recommended if no kubernetes cluster is available for testing.

Install dragonflyโ€‹

For detailed installation documentation based on kubernetes cluster, please refer to quick-start-kubernetes.

Setup kubernetes clusterโ€‹

Create kind multi-node cluster configuration file kind-config.yaml, configuration content is as follows:

kind: ClusterapiVersion: kind.x-k8s.io/v1alpha4nodes:ย  - role: control-planeย  - role: workerย  ย  extraPortMappings:ย  ย  ย  - containerPort: 30950ย  ย  ย  ย  hostPort: 65001ย  - role: worker

Create a kind multi-node cluster using the configuration file:

kind create cluster --config kind-config.yaml

Switch the context of kubectl to kind cluster:

kubectl config use-context kind-kind
Kind loads dragonfly imageโ€‹

Pull dragonfly latest images:

docker pull dragonflyoss/scheduler:latestdocker pull dragonflyoss/manager:latestdocker pull dragonflyoss/dfdaemon:latest

Kind cluster loads dragonfly latest images:

kind load docker-image dragonflyoss/scheduler:latestkind load docker-image dragonflyoss/manager:latestkind load docker-image dragonflyoss/dfdaemon:latest
Create dragonfly cluster based on helm chartsโ€‹

Create helm charts configuration file charts-config.yaml. Add the github-cloud.githubusercontent.com rule to dfdaemon.config.proxy.proxies.regx to forward the HTTP file download of content storage of Git LFS to the P2P network. And dfdaemon.config.proxy.defaultFilter adds X-Amz-Algorithm, X-Amz-Credential, X-Amz-Date, X-Amz-Expires, X-Amz-Signature and X-Amz-SignedHeaders parameters to filter the query parameters. Dargonfly generates a unique task id based on the URL, so it is necessary to filter the query parameters to generate a unique task id. Configuration content is as follows:

scheduler:
image: dragonflyoss/scheduler
tag: latest
replicas: 1
metrics:
enable: true
config:
verbose: true
pprofPort: 18066

seedPeer:
image: dragonflyoss/dfdaemon
tag: latest
replicas: 1
metrics:
enable: true
config:
verbose: true
pprofPort: 18066

dfdaemon:
image: dragonflyoss/dfdaemon
tag: latest
metrics:
enable: true
config:
verbose: true
pprofPort: 18066
proxy:
defaultFilter: "X-Amz-Algorithm&X-Amz-Credential&X-Amz-Date&X-Amz-Expires&X-Amz-Signature&X-Amz-SignedHeaders"
security:
insecure: true
cacert: ""
cert: ""
key: ""
tcpListen:
namespace: ""
port: 65001
registryMirror:
url: https://index.docker.io
insecure: true
certs: []
direct: false
proxies:
- regx: blobs/sha256.*
- regx: github-cloud.githubusercontent.com.*

manager:
image: dragonflyoss/manager
tag: latest
replicas: 1
metrics:
enable: true
config:
verbose: true
pprofPort: 18066

jaeger:
enable: true

Create a dragonfly cluster using the configuration file:

helm repo add dragonfly https://dragonflyoss.github.io/helm-charts/
helm install --wait --create-namespace --namespace dragonfly-system dragonfly dragonfly/dragonfly -f charts-config.yaml

Output:

NAME: dragonfly
LAST DEPLOYED: Thu Dec 21 17:24:37 2023
NAMESPACE: dragonfly-system
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
1. Get the scheduler address by running these commands:
export SCHEDULER_POD_NAME=$(kubectl get pods --namespace dragonfly-system -l "app=dragonfly,release=dragonfly,component=scheduler" -o jsonpath={.items[0].metadata.name})
export SCHEDULER_CONTAINER_PORT=$(kubectl get pod --namespace dragonfly-system $SCHEDULER_POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
kubectl --namespace dragonfly-system port-forward $SCHEDULER_POD_NAME 8002:$SCHEDULER_CONTAINER_PORT
echo "Visit http://127.0.0.1:8002 to use your scheduler"

2. Get the dfdaemon port by running these commands:
export DFDAEMON_POD_NAME=$(kubectl get pods --namespace dragonfly-system -l "app=dragonfly,release=dragonfly,component=dfdaemon" -o jsonpath={.items[0].metadata.name})
export DFDAEMON_CONTAINER_PORT=$(kubectl get pod --namespace dragonfly-system $DFDAEMON_POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
You can use $DFDAEMON_CONTAINER_PORT as a proxy port in Node.

3. Configure runtime to use dragonfly:
https://d7y.io/docs/getting-started/quick-start/kubernetes/

4. Get Jaeger query URL by running these commands:
export JAEGER_QUERY_PORT=$(kubectl --namespace dragonfly-system get services dragonfly-jaeger-query -o jsonpath="{.spec.ports[0].port}")
kubectl --namespace dragonfly-system port-forward service/dragonfly-jaeger-query 16686:$JAEGER_QUERY_PORT
echo "Visit http://127.0.0.1:16686/search?limit=20&lookback=1h&maxDuration&minDuration&service=dragonfly to query download events"

Check that dragonfly is deployed successfully:

kubectl get po -n dragonfly-systemNAME ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  READY ย  STATUSย  ย  RESTARTS ย  ย  ย  AGEdragonfly-dfdaemon-cttxz ย  ย  ย  ย  ย  ย  1/1 ย  ย  Running ย  4 (116s ago) ย  2m51sdragonfly-dfdaemon-k62vd ย  ย  ย  ย  ย  ย  1/1 ย  ย  Running ย  4 (117s ago) ย  2m51sdragonfly-jaeger-84dbfd5b56-mxpfsย  ย  1/1 ย  ย  Running ย  0ย  ย  ย  ย  ย  ย  ย  2m51sdragonfly-manager-5c598d5754-fd9tf ย  1/1 ย  ย  Running ย  0ย  ย  ย  ย  ย  ย  ย  2m51sdragonfly-mysql-0ย  ย  ย  ย  ย  ย  ย  ย  ย  ย  1/1 ย  ย  Running ย  0ย  ย  ย  ย  ย  ย  ย  2m51sdragonfly-redis-master-0 ย  ย  ย  ย  ย  ย  1/1 ย  ย  Running ย  0ย  ย  ย  ย  ย  ย  ย  2m51sdragonfly-redis-replicas-0 ย  ย  ย  ย  ย  1/1 ย  ย  Running ย  0ย  ย  ย  ย  ย  ย  ย  2m51sdragonfly-redis-replicas-1 ย  ย  ย  ย  ย  1/1 ย  ย  Running ย  0ย  ย  ย  ย  ย  ย  ย  106sdragonfly-redis-replicas-2 ย  ย  ย  ย  ย  1/1 ย  ย  Running ย  0ย  ย  ย  ย  ย  ย  ย  78sdragonfly-scheduler-0ย  ย  ย  ย  ย  ย  ย  ย  1/1 ย  ย  Running ย  0ย  ย  ย  ย  ย  ย  ย  2m51sdragonfly-seed-peer-0ย  ย  ย  ย  ย  ย  ย  ย  1/1 ย  ย  Running ย  1 (37s ago)ย  ย  2m51s

Create peer service configuration file peer-service-config.yaml, configuration content is as follows:

apiVersion: v1kind: Servicemetadata:ย  name: peerย  namespace: dragonfly-systemspec:ย  type: NodePortย  ports:ย  ย  - name: http-65001ย  ย  ย  nodePort: 30950ย  ย  ย  port: 65001ย  selector:ย  ย  app: dragonflyย  ย  component: dfdaemonย  ย  release: dragonfly

Create a peer service using the configuration file:

kubectl apply -f peer-service-config.yaml

Git LFS downlads large files via dragonflyโ€‹

Proxy Git LFS download requests to Dragonfly Peer Proxy(http://127.0.0.1:65001) through Git configuration. Set Git configuration includes http.proxy, lfs.transfer.enablehrefrewrite and url.{YOUR-LFS-CONTENT-STORAGE}.insteadOf properties.

git config --global http.proxy http://127.0.0.1:65001git config --global lfs.transfer.enablehrefrewrite truegit config --global url.http://github-cloud.githubusercontent.com/.insteadOf https://github-cloud.githubusercontent.com/

Forward Git LFS download requests to the P2P network via Dragonfly Peer Proxy and Git clone the large files.

git clone git@github.com:{YOUR-USERNAME}/{YOUR-REPOSITORY}.git

Verify large files download with Dragonflyโ€‹

Execute the command:

# find podskubectl -n dragonfly-system get pod -l component=dfdaemon# find logspod_name=dfdaemon-xxxxxkubectl -n dragonfly-system exec -it ${pod_name} -- grep "peer task done" /var/log/dragonfly/daemon/core.log

Example output:

2023-12-21T16:55:20.495+0800INFOpeer/peertask_conductor.go:1326peer task done, cost: 2238ms{"peer": "30.54.146.131-15874-f6729352-950e-412f-b876-0e5c8e3232b1", "task": "70c644474b6c986e3af27d742d3602469e88f8956956817f9f67082c6967dc1a", "component": "PeerTask", "trace": "35c801b7dac36eeb0ea43a58d1c82e77"}

Performance testingโ€‹

Test the performance of single-machine large files download after the integration of Git LFS and Dragonfly P2P. Due to the influence of the network environment of the machine itself, the actual download time is not important, but the ratio of the increase in the download time in different scenarios is very important.

Bar chart showing time to download large files (512M and 1G) between Git LFS, Git LFS & Dragonfly Cold Boot, Hit Dragonfly Remote Peer Cache and Hit Dragonfly Local Peer Cache

  • Git LFS: Use Git LFS to download large files directly.
  • Git LFS & Dragonfly Cold Boot: Use Git LFS to download large files via Dragonfly P2P network and no cache hits.
  • Hit Dragonfly Remote Peer Cache: Use Git LFS to download large files via Dragonfly P2P network and hit the remote peer cache.
  • Hit Dragonfly Remote Local Cache: Use Git LFS to download large files via Dragonfly P2P network and hit the local peer cache.

Test results show Git LFS and Dragonfly P2P integration. It can effectively reduce the file download time. Note that this test was a single-machine test, which means that in the case of cache hits, the performance limitation is on the disk. If Dragonfly is deployed on multiple machines for P2P download, the large files download speed will be faster.

Dragonfly communityโ€‹

Git LFSโ€‹

TorchServe accelerates the distribution of models based on Dragonfly

ยท 14 min read

CNCF projects highlighted in this post, and migrated by mingcheng.

This document will help you experience how to use dragonfly with TorchServe. During the downloading of models, the file size is large and there are many services downloading the files at the same time. The bandwidth of the storage will reach the limit and the download will be slow.

Diagram flow showing Model Registry flow from Cluster A and Cluster B

Dragonfly can be used to eliminate the bandwidth limit of the storage through P2P technology, thereby accelerating file downloading.

Diagram flow showing Model Registry flow from Cluster A and Cluster B

Architectureโ€‹

Dragonfly Endpoint architecture

Dragonfly Endpoint plugin forwards TorchServe download model requests to the Dragonfly P2P network.

Dragonfly Endpoint architecture

The models download steps:

  1. TorchServe sends a model download request and the request is forwarded to the Dragonfly Peer.
  2. The Dragonfly Peer registers tasks with the Dragonfly Scheduler.
  3. Return the candidate parents to Dragonfly Peer.
  4. Dragonfly Peer downloads model from candidate parents.
  5. After downloading the model, TorchServe will register the model.

Installationโ€‹

By integrating Dragonfly Endpoint into TorchServe, download traffic through Dragonfly to pull models stored in S3, OSS, GCS, and ABS, and register models in TorchServe. The Dragonfly Endpoint plugin is in the dragonfly-endpoint repository.

Prerequisitesโ€‹

NameVersionDocument
Kubernetes cluster1.20+kubernetes.io
Helm3.8.0+helm.sh
TorchServe0.4.0+pytorch.org/serve/

Notice: Kind is recommended if no kubernetes cluster is available for testing.

Dragonfly Kubernetes Cluster Setupโ€‹

For detailed installation documentation, please refer to ย quick-start-kubernetes.

Prepare Kubernetes Clusterโ€‹

Create kind multi-node cluster configuration file kind-config.yaml, configuration content is as follows:

kind: ClusterapiVersion: kind.x-k8s.io/v1alpha4nodes:ย  - role: control-planeย  - role: workerย  - role: worker

Create a kind multi-node cluster using the configuration file:

kind create cluster --config kind-config.yaml

Switch the context of kubectl to kind cluster:

kubectl config use-context kind-kind

Kind loads dragonfly imageโ€‹

Pull dragonfly latest images:

docker pull dragonflyoss/scheduler:latestdocker pull dragonflyoss/manager:latestdocker pull dragonflyoss/dfdaemon:latest

Kind cluster loads dragonfly latest images:

kind load docker-image dragonflyoss/scheduler:latestkind load docker-image dragonflyoss/manager:latestkind load docker-image dragonflyoss/dfdaemon:latest

Create dragonfly cluster based on helm chartsโ€‹

Create helm charts configuration file charts-config.yaml and set dfdaemon.config.agents.regx to match the download path of the object storage, configuration content is as follows:

scheduler:
replicas: 1
metrics:
enable: true
config:
verbose: true
pprofPort: 18066

seedPeer:
replicas: 1
metrics:
enable: true
config:
verbose: true
pprofPort: 18066

dfdaemon:
hostNetwork: true
metrics:
enable: true
config:
verbose: true
pprofPort: 18066
proxy:
defaultFilter: 'Expires&Signature&ns'
security:
insecure: true
cacert: ''
cert: ''
key: ''
tcpListen:
namespace: ''
port: 65001
registryMirror:
url: https://index.docker.io
insecure: true
certs: []
direct: false
proxies:
- regx: blobs/sha256.*
- regx: .*amazonaws.*

manager:
replicas: 1
metrics:
enable: true
config:
verbose: true
pprofPort: 18066

jaeger:
enable: true

Create a dragonfly cluster using the configuration file:

$ helm repo add dragonfly https://dragonflyoss.github.io/helm-charts/
$ helm install --wait --create-namespace --namespace dragonfly-system dragonfly dragonfly/dragonfly -f charts-config.yaml
LAST DEPLOYED: Mon Sep 4 10:24:55 2023
NAMESPACE: dragonfly-system
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
1. Get the scheduler address by running these commands:
export SCHEDULER_POD_NAME=$(kubectl get pods --namespace dragonfly-system -l "app=dragonfly,release=dragonfly,component=scheduler" -o jsonpath={.items[0].metadata.name})
export SCHEDULER_CONTAINER_PORT=$(kubectl get pod --namespace dragonfly-system $SCHEDULER_POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
kubectl --namespace dragonfly-system port-forward $SCHEDULER_POD_NAME 8002:$SCHEDULER_CONTAINER_PORT
echo "Visit http://127.0.0.1:8002 to use your scheduler"
2. Get the dfdaemon port by running these commands:
export DFDAEMON_POD_NAME=$(kubectl get pods --namespace dragonfly-system -l "app=dragonfly,release=dragonfly,component=dfdaemon" -o jsonpath={.items[0].metadata.name})
export DFDAEMON_CONTAINER_PORT=$(kubectl get pod --namespace dragonfly-system $DFDAEMON_POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
You can use $DFDAEMON_CONTAINER_PORT as a proxy port in Node.
3. Configure runtime to use dragonfly:
https://d7y.io/docs/getting-started/quick-start/kubernetes/
4. Get Jaeger query URL by running these commands:
export JAEGER_QUERY_PORT=$(kubectl --namespace dragonfly-system get services dragonfly-jaeger-query -o jsonpath="{.spec.ports[0].port}")
kubectl --namespace dragonfly-system port-forward service/dragonfly-jaeger-query 16686:$JAEGER_QUERY_PORT
echo "Visit http://127.0.0.1:16686/search?limit=20&lookback=1h&maxDuration&minDuration&service=dragonfly to query download events"

Check that dragonfly is deployed successfully:

$ kubectl get po -n dragonfly-system
NAME READY STATUS RESTARTS AGE
dragonfly-dfdaemon-7r2cn 1/1 Running 0 3m31s
dragonfly-dfdaemon-fktl4 1/1 Running 0 3m31s
dragonfly-jaeger-c7947b579-2xk44 1/1 Running 0 3m31s
dragonfly-manager-5d4f444c6c-wq8d8 1/1 Running 0 3m31s
dragonfly-mysql-0 1/1 Running 0 3m31s
dragonfly-redis-master-0 1/1 Running 0 3m31s
dragonfly-redis-replicas-0 1/1 Running 0 3m31s
dragonfly-redis-replicas-1 1/1 Running 0 3m5s
dragonfly-redis-replicas-2 1/1 Running 0 2m44s
dragonfly-scheduler-0 1/1 Running 0 3m31s
dragonfly-seed-peer-0 1/1 Running 0 3m31s

Expose the Proxy service portโ€‹

Create the dfstore.yaml configuration to expose the port on which the Dragonfly Peerโ€™s HTTP proxy listens. The default port is 65001 and settargetPort to 65001.

kind: Service
apiVersion: v1
metadata:
name: dfstore
spec:
selector:
app: dragonfly
component: dfdaemon
release: dragonfly
ports:
- protocol: TCP
port: 65001
targetPort: 65001
type: NodePort

Create service:

kubectl --namespace dragonfly-system apply -f dfstore.yaml

Forward request to Dragonfly Peerโ€™s HTTP proxy:

kubectl --namespace dragonfly-system port-forward service/dfstore 65001:65001

Install Dragonfly Endpoint pluginโ€‹

Set environment variables for Dragonfly Endpoint configurationโ€‹

Create config.json configuration๏ผŒand set DRAGONFLY_ENDPOINT_CONFIG environment variable for config.json file path.

export DRAGONFLY_ENDPOINT_CONFIG=/etc/dragonfly-endpoint/config.json

The default configuration path is:

  • linux: /etc/dragonfly-endpoint/config.json
  • darwin: ~/.dragonfly-endpoint/config.json

Dragonfly Endpoint configurationโ€‹

Create the config.json configuration to configure the Dragonfly Endpoint for S3, the configuration is as follows:

{
"addr": "http://127.0.0.1:65001",
"header": {},
"filter": [
"X-Amz-Algorithm",
"X-Amz-Credential",
"X-Amz-Date",
"X-Amz-Expires",
"X-Amz-SignedHeaders",
"X-Amz-Signature"
],
"object_storage": {
"type": "s3",
"bucket_name": "your_s3_bucket_name",
"region": "your_s3_region",
"access_key": "your_s3_access_key",
"secret_key": "your_s3_secret_key"
}
}

In the filter of the configuration, set different values when using different object storage:

In the filter of the configuration, set different values when using different object storage:

TypeValue
OSS"Expires&Signature&ns"
S3"X-Amz-Algorithm&X-Amz-Credential&X-Amz-Date&X-Amz-Expires&X-Amz-SignedHeaders&X-Amz-Signature"
OBS"X-Amz-Algorithm&X-Amz-Credential&X-Amz-Date&X-Obs-Date&X-Amz-Expires&X-Amz-SignedHeaders&X-Amz-Signature"

Object storage configurationโ€‹

In addition to S3, Dragonfly Endpoint plugin also supports OSS, GCS and ABS. Different object storage configurations are as follows:

OSS(Object Storage Service)

{
"addr": "http://127.0.0.1:65001",
"header": {},
"filter": ["Expires", "Signature"],
"object_storage": {
"type": "oss",
"bucket_name": "your_oss_bucket_name",
"endpoint": "your_oss_endpoint",
"access_key_id": "your_oss_access_key_id",
"access_key_secret": "your_oss_access_key_secret"
}
}

GCS(Google Cloud Storage)

{
"addr": "http://127.0.0.1:65001",
"header": {},
"object_storage": {
"type": "gcs",
"bucket_name": "your_gcs_bucket_name",
"project_id": "your_gcs_project_id",
"service_account_path": "your_gcs_service_account_path"
}
}

ABS(Azure Blob Storage)

{
"addr": "http://127.0.0.1:65001",
"header": {},
"object_storage": {
"type": "abs",
"account_name": "your_abs_account_name",
"account_key": "your_abs_account_key",
"container_name": "your_abs_container_name"
}
}

TorchServe integrates Dragonfly Endpoint pluginโ€‹

For detailed installation documentation, please refer to TorchServe document.

Binary installationโ€‹

The Prerequisitesโ€‹

NameVersionDocument
Python3.8.0+https://www.python.org/
TorchServe0.4.0+pytorch.org/serve/
Java11https://openjdk.org/projects/jdk/11/

Install TorchServe dependencies and torch-model-archiver๏ผš

python ./ts_scripts/install_dependencies.py
conda install torchserve torch-model-archiver torch-workflow-archiver -c pytorch

Clone TorchServe repository๏ผš

git clone https://github.com/pytorch/serve.gitcd serve

Create model-store directory to store the models๏ผš

mkdir model-storechmod 777 model-store

Create plugins-path directory to store the binaries of the plugin๏ผš

mkdir plugins-path

Package Dragonfly Endpoint pluginโ€‹

Clone dragonfly-endpoint repository๏ผš

git clone https://github.com/dragonflyoss/dragonfly-endpoint.git

Build the dragonfly-endpoint project to generate file in the build/libs directory:

cd ./dragonfly-endpointgradle shadowJar

Note: Due to the limitations of TorchServeโ€™s JVM, the best Java version for Gradle is 11, as a higher version will cause the plugin to fail to parse.

Move the Jar file into the plugins-path directory:

mv build/libs/dragonfly_endpoint-1.0-all.jarย  <your plugins-path>

Prepare the plugin configuration config.json, and use S3 as the object storage:

{
"addr": "http://127.0.0.1:65001",
"header": {},
"filter": [
"X-Amz-Algorithm",
"X-Amz-Credential",
"X-Amz-Date",
"X-Amz-Expires",
"X-Amz-SignedHeaders",
"X-Amz-Signature"
],
"object_storage": {
"type": "s3",
"bucket_name": "your_s3_bucket_name",
"region": "your_s3_region",
"access_key": "your_s3_access_key",
"secret_key": "your_s3_secret_key"
}
}

Set the environment variables for the configuration:

export DRAGONFLY_ENDPOINT_CONFIG=/etc/dragonfly-endpoint/config.json

โ€“model-storesets the previously created directory to store the models and โ€“plugins-path sets the previously created directory to store the plugins. Start the TorchServe with Dragonfly Endpoint plugin:

torchserve --start --model-store <path-to-model-store-file> --plugins-path=<path-to-plugin-jars>

Verifyโ€‹

Prepare the model. Download a model from Model ZOO or package the model refer to Torch Model archiver for TorchServe. Use squeezenet1_1_scripted.mar model to verify๏ผš

wget https://torchserve.pytorch.org/mar_files/squeezenet1_1_scripted.mar

Upload the model to object storage. For detailed uploading the model to S3, please refer to S3ใ€‚

# Download the command line toolpip install awscli# Configure the key as promptedaws configure# Upload fileaws s3 cp < local file path > s3://< bucket name >/< Target path >

TorchServe plugin is named dragonfly, please refer to TorchServe Register API for details of plugin API. The url parameter are not supported and add the file_name parameter which is the model file name to download.

Download the model:

curl -X POSTย  "http://localhost:8081/dragonfly/models?file_name=squeezenet1_1.mar"

Verify the model download successful:

{
"Status": "Model \"squeezenet1_1\" Version: 1.0 registered with 0 initial workers. Use scale workers API to add workers for the model."
}

Added model worker for inference:

curl -v -X PUT "http://localhost:8081/models/squeezenet1_1?min_worker=1"

Check the number of worker is increased:

* About to connect() to localhost port 8081 (#0)
* Trying ::1...
* Connected to localhost (::1) port 8081 (#0)
> PUT /models/squeezenet1_1?min_worker=1 HTTP/1.1
> User-Agent: curl/7.29.0
> Host: localhost:8081
> Accept: */*
>
< HTTP/1.1 202 Accepted
< content-type: application/json
< x-request-id: 66761b5a-54a7-4626-9aa4-12041e0e4e63
< Pragma: no-cache
< Cache-Control: no-cache; no-store, must-revalidate, private
< Expires: Thu, 01 Jan 1970 00:00:00 UTC
< content-length: 47
< connection: keep-alive
<
{ "status": "Processing worker updates..."}
* Connection #0 to host localhost left intact

Call inference API:

# Prepare pictures that require reasoning
curl -O https://raw.githubusercontent.com/pytorch/serve/master/docs/images/kitten_small.jpg
curl -O https://raw.githubusercontent.com/pytorch/serve/master/docs/images/dogs-before.jpg

# Call inference API
curl http://localhost:8080/predictions/squeezenet1_1 -T kitten_small.jpg -T dogs-before.jpg

Check the response successful:

{
"lynx": 0.5455784201622009,
"tabby": 0.2794168293476105,
"Egyptian_cat": 0.10391931980848312,
"tiger_cat": 0.062633216381073,
"leopard": 0.005019133910536766
}

Install TorchServe with Dockerโ€‹

Docker configurationโ€‹

Pull dragonflyoss/dragonfly-endpoint image with the plugin. The following is an example of the CPU version of TorchServe, refer to Dockerfile.

docker pull dragonflyoss/dragonfly-endpoint

Create model-store directory to store the model files๏ผš

mkdir model-storechmod 777 model-store

Prepare the plugin configuration config.json, and use S3 as the object storage:

{
"addr": "http://127.0.0.1:65001",
"header": {},
"filter": [
"X-Amz-Algorithm",
"X-Amz-Credential",
"X-Amz-Date",
"X-Amz-Expires",
"X-Amz-SignedHeaders",
"X-Amz-Signature"
],
"object_storage": {
"type": "s3",
"bucket_name": "your_s3_bucket_name",
"region": "your_s3_region",
"access_key": "your_s3_access_key",
"secret_key": "your_s3_secret_key"
}
}

Set the environment variables for the configuration:

export DRAGONFLY_ENDPOINT_CONFIG=/etc/dragonfly-endpoint/config.json

Mount the model-store and dragonfly-endpoint configuration directory. Run the container:

sudo docker run --rm -it --network host \
-v $(pwd)/model-store:/home/model-server/model-store \
-v ${DRAGONFLY_ENDPOINT_CONFIG}:${DRAGONFLY_ENDPOINT_CONFIG} \
dragonflyoss/dragonfly-endpoint:latest

How to Verifyโ€‹

Prepare the model. Download a model from Model ZOO or package the model refer to Torch Model archiver for TorchServe. Use squeezenet1_1_scripted.mar model to verify๏ผš

wget https://torchserve.pytorch.org/mar_files/squeezenet1_1_scripted.mar

Upload the model to object storage. For detailed uploading the model to S3, please refer to S3ใ€‚

# Download the command line tool
pip install awscli

# Configure the key as prompted
aws configure

# Upload file
aws s3 cp <local file path> s3://<bucket name>/<Target path>

TorchServe plugin is named dragonfly, please refer to TorchServe Register API for details of plugin API. The url parameter are not supported and add the file_name parameter which is the model file name to download.

Download a model๏ผš

curl -X POSTย  "http://localhost:8081/dragonfly/models?file_name=squeezenet1_1.mar"

Verify the model download successful:

{
"Status": "Model \"squeezenet1_1\" Version: 1.0 registered with 0 initial workers. Use scale workers API to add workers for the model."
}

Added model worker for inference:

curl -v -X PUT "http://localhost:8081/models/squeezenet1_1?min_worker=1"

Check the number of worker is increased:

* About to connect() to localhost port 8081 (#0)
* Trying ::1...
* Connected to localhost (::1) port 8081 (#0)
> PUT /models/squeezenet1_1?min_worker=1 HTTP/1.1
> User-Agent: curl/7.29.0
> Host: localhost:8081
> Accept: */*
>
< HTTP/1.1 202 Accepted
< content-type: application/json
< x-request-id: 66761b5a-54a7-4626-9aa4-12041e0e4e63
< Pragma: no-cache
< Cache-Control: no-cache; no-store, must-revalidate, private
< Expires: Thu, 01 Jan 1970 00:00:00 UTC
< content-length: 47
< connection: keep-alive
<
{ "status": "Processing worker updates..."}
* Connection #0 to host localhost left intact

Call inference API:

# Prepare pictures that require reasoning
curl -O https://raw.githubusercontent.com/pytorch/serve/master/docs/images/kitten_small.jpg
curl -O https://raw.githubusercontent.com/pytorch/serve/master/docs/images/dogs-before.jpg

# Call inference API
curl http://localhost:8080/predictions/squeezenet1_1 -T kitten_small.jpg -T dogs-before.jpg

Check the response successful:

{
"lynx": 0.5455784201622009,
"tabby": 0.2794168293476105,
"Egyptian_cat": 0.10391931980848312,
"tiger_cat": 0.062633216381073,
"leopard": 0.005019133910536766
}

Performance testingโ€‹

Test the performance of single-machine model download by TorchServe API after the integration of Dragonfly P2P. Due to the influence of the network environment of the machine itself, the actual download time is not important, but the ratio of the increase in the download time in different scenarios is very important.

Bar chart showing TorchServe API, TouchServe API & Dragonfly Cold Boot, Hit Dragonfly Remote Peer Cache and Hit Dragonfly Local Peer Cache performance based on time to download

  • TorchServe API: Use signed URL provided by Object Storage to download the model directly.
  • TorchServe API & Dragonfly Cold Boot: Use TorchServe API to download model via Dragonfly P2P network and no cache hits.
  • Hit Remote Peer: Use TorchServe API to download model via Dragonfly P2P network and hit the remote peer cache.
  • Hit Local Peer: Use TorchServe API to download model via Dragonfly P2P network and hit the local peer cache.

Test results show TorchServe and Dragonfly integration. It can effectively reduce the file download time. Note that this test was a single-machine test, which means that in the case of cache hits, the performance limitation is on the disk. If Dragonfly is deployed on multiple machines for P2P download, the models download speed will be faster.

Dragonfly communityโ€‹

Pytorchโ€‹

TorchServe Github Repo: https://github.com/pytorch/serve

Hugging Face accelerates distribution of models and datasets based on Dragonfly

ยท 10 min read

CNCF projects highlighted in this post, and migrated by mingcheng.

This document will help you experience how to use dragonfly with hugging face. During the downloading of datasets or models, the file size is large and there are many services downloading the files at the same time. The bandwidth of the storage will reach the limit and the download will be slow.

Diagram flow showing Hugging Face Hub flow from Cluster A and Cluster B

Dragonfly can be used to eliminate the bandwidth limit of the storage through P2P technology, thereby accelerating file downloading.

Diagram flow showing Hugging Face Hub flow from Cluster A and Cluster B

Prerequisitesโ€‹

Notice: Kind is recommended if no kubernetes cluster is available for testing.

Install dragonflyโ€‹

For detailed installation documentation based on kubernetes cluster, please refer to quick-start-kubernetes.

Setup kubernetes clusterโ€‹

Create kind multi-node cluster configuration file kind-config.yaml, configuration content is as follows:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
extraPortMappings:
- containerPort: 30950
hostPort: 65001
- role: worker

Create a kind multi-node cluster using the configuration file:

kind create cluster --config kind-config.yaml

Switch the context of kubectl to kind cluster:

kubectl config use-context kind-kind

Kind loads dragonfly imageโ€‹

Pull dragonfly latest images:

docker pull dragonflyoss/scheduler:latestdocker pull dragonflyoss/manager:latestdocker pull dragonflyoss/dfdaemon:latest

Kind cluster loads dragonfly latest images:

kind load docker-image dragonflyoss/scheduler:latestkind load docker-image dragonflyoss/manager:latestkind load docker-image dragonflyoss/dfdaemon:latest

Create dragonfly cluster based on helm chartsโ€‹

Create helm charts configuration file charts-config.yaml and set dfdaemon.config.proxy.registryMirror.url to the address of the Hugging Face Hubโ€™s LFS server, configuration content is as follows:

scheduler:
replicas: 1
metrics:
enable: true
config:
verbose: true
pprofPort: 18066

seedPeer:
replicas: 1
metrics:
enable: true
config:
verbose: true
pprofPort: 18066

dfdaemon:
metrics:
enable: true
hostNetwork: true
config:
verbose: true
pprofPort: 18066
proxy:
defaultFilter: 'Expires&Key-Pair-Id&Policy&Signature'
security:
insecure: true
tcpListen:
listen: 0.0.0.0
port: 65001
registryMirror:
# When enable, using header "X-Dragonfly-Registry" for remote instead of url.
dynamic: true
# URL for the registry mirror.
url: https://cdn-lfs.huggingface.co
# Whether to ignore https certificate errors.
insecure: true
# Optional certificates if the remote server uses self-signed certificates.
certs: []
# Whether to request the remote registry directly.
direct: false
# Whether to use proxies to decide if dragonfly should be used.
useProxies: true
proxies:
- regx: repos.*
useHTTPS: true

manager:
replicas: 1
metrics:
enable: true
config:
verbose: true
pprofPort: 18066

Create a dragonfly cluster using the configuration file:

$ helm repo add dragonfly https://dragonflyoss.github.io/helm-charts/
$ helm install --wait --create-namespace --namespace dragonfly-system dragonfly dragonfly/dragonfly -f charts-config.yaml
NAME: dragonfly
LAST DEPLOYED: Wed Oct 19 04:23:22 2022
NAMESPACE: dragonfly-system
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
1. Get the scheduler address by running these commands:
export SCHEDULER_POD_NAME=$(kubectl get pods --namespace dragonfly-system -l "app=dragonfly,release=dragonfly,component=scheduler" -o jsonpath={.items[0].metadata.name})
export SCHEDULER_CONTAINER_PORT=$(kubectl get pod --namespace dragonfly-system $SCHEDULER_POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
kubectl --namespace dragonfly-system port-forward $SCHEDULER_POD_NAME 8002:$SCHEDULER_CONTAINER_PORT
echo "Visit http://127.0.0.1:8002 to use your scheduler"
2. Get the dfdaemon port by running these commands:
export DFDAEMON_POD_NAME=$(kubectl get pods --namespace dragonfly-system -l "app=dragonfly,release=dragonfly,component=dfdaemon" -o jsonpath={.items[0].metadata.name})
export DFDAEMON_CONTAINER_PORT=$(kubectl get pod --namespace dragonfly-system $DFDAEMON_POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
You can use $DFDAEMON_CONTAINER_PORT as a proxy port in Node.
3. Configure runtime to use dragonfly:
https://d7y.io/docs/getting-started/quick-start/kubernetes/

Check that dragonfly is deployed successfully:

$ kubectl get po -n dragonfly-system
NAME READY STATUS RESTARTS AGE
dragonfly-dfdaemon-rhnr6 1/1 Running 4 (101s ago) 3m27s
dragonfly-dfdaemon-s6sv5 1/1 Running 5 (111s ago) 3m27s
dragonfly-manager-67f97d7986-8dgn8 1/1 Running 0 3m27s
dragonfly-mysql-0 1/1 Running 0 3m27s
dragonfly-redis-master-0 1/1 Running 0 3m27s
dragonfly-redis-replicas-0 1/1 Running 1 (115s ago) 3m27s
dragonfly-redis-replicas-1 1/1 Running 0 95s
dragonfly-redis-replicas-2 1/1 Running 0 70s
dragonfly-scheduler-0 1/1 Running 0 3m27s
dragonfly-seed-peer-0 1/1 Running 2 (95s ago) 3m27s

Create peer service configuration file peer-service-config.yaml, configuration content is as follows:

apiVersion: v1
kind: Service
metadata:
name: peer
namespace: dragonfly-system
spec:
type: NodePort
ports:
- name: http-65001
nodePort: 30950
port: 65001
selector:
app: dragonfly
component: dfdaemon
release: dragonfly

Create a peer service using the configuration file:

kubectl apply -f peer-service-config.yaml

Use Hub Python Library to download files and distribute traffic through Draognflyโ€‹

Any API in the Hub Python Library that uses Requests library for downloading files can distribute the download traffic in the P2P network by setting DragonflyAdapter to the requests Session.

Download a single file with Dragonflyโ€‹

A single file can be downloaded using the hf_hub_download, distribute traffic through the Dragonfly peer.

Create hf_hub_download_dragonfly.py file. Use DragonflyAdapter to forward the file download request of the LFS protocol to Dragonfly HTTP proxy, so that it can use the P2P network to distribute file, content is as follows:

import requests
from requests.adapters import HTTPAdapter
from urllib.parse import urlparse
from huggingface_hub import hf_hub_download
from huggingface_hub import configure_http_backend

class DragonflyAdapter(HTTPAdapter):
def get_connection(self, url, proxies=None):
# Change the schema of the LFS request to download large files from https:// to http://,
# so that Dragonfly HTTP proxy can be used.
if url.startswith('https://cdn-lfs.huggingface.co'):
url = url.replace('https://', 'http://')
return super().get_connection(url, proxies)

def add_headers(self, request, kwargs):
super().add_headers(request, kwargs)
# If there are multiple different LFS repositories, you can override the
# default repository address by adding X-Dragonfly-Registry header.
if request.url.find('example.com') != -1:
request.headers["X-Dragonfly-Registry"] = 'https://example.com'

# Create a factory function that returns a new Session.
def backend_factory() -> requests.Session:
session = requests.Session()
session.mount('http://', DragonflyAdapter())
session.mount('https://', DragonflyAdapter())
session.proxies = {'http': 'http://127.0.0.1:65001'}
return session

# Set it as the default session factory
configure_http_backend(backend_factory=backend_factory)

hf_hub_download(repo_id="tiiuae/falcon-rw-1b", filename="pytorch_model.bin")

Download a single file of th LFS protocol with Dragonfly:

$ python3 hf_hub_download_dragonfly.py
(โ€ฆ)YkNX13a46FCg__&Key-Pair-Id=KVTP0A1DKRTAX: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 2.62G/2.62G [00:52<00:00, 49.8MB/s]

Verify a single file download with Dragonflyโ€‹

Execute the command:

# find podskubectl -n dragonfly-system get pod -l component=dfdaemon# find logspod_name=dfdaemon-xxxxxkubectl -n dragonfly-system exec -it ${pod_name} -- grep "peer task done" /var/log/dragonfly/daemon/core.log

Example output:

peer task done, cost: 28349ms ย  {"peer": "89.116.64.101-77008-a95a6918-a52b-47f5-9b18-cec6ada03daf", "task": "2fe93348699e07ab67823170925f6be579a3fbc803ff3d33bf9278a60b08d901", "component": "PeerTask", "trace": "b34ed802b7afc0f4acd94b2cedf3fa2a"}

Download a snapshot of the repo with Dragonflyโ€‹

A snapshot of the repo can be downloaded using the snapshot_download, distribute traffic through the Dragonfly peer.

Create snapshot_download_dragonfly.py file. Use DragonflyAdapter to forward the file download request of the LFS protocol to Dragonfly HTTP proxy, so that it can use the P2P network to distribute file. Only the files of the LFS protocol will be distributed through the Dragonfly P2P network. content is as follows:

import requests
from requests.adapters import HTTPAdapter
from urllib.parse import urlparse
from huggingface_hub import snapshot_download
from huggingface_hub import configure_http_backend

class DragonflyAdapter(HTTPAdapter):
def get_connection(self, url, proxies=None):
# Change the schema of the LFS request to download large files from https:// to http://,
# so that Dragonfly HTTP proxy can be used.
if url.startswith('https://cdn-lfs.huggingface.co'):
url = url.replace('https://', 'http://')
return super().get_connection(url, proxies)

def add_headers(self, request, kwargs):
super().add_headers(request, kwargs)
# If there are multiple different LFS repositories, you can override the
# default repository address by adding X-Dragonfly-Registry header.
if request.url.find('example.com') != -1:
request.headers["X-Dragonfly-Registry"] = 'https://example.com'

# Create a factory function that returns a new Session.
def backend_factory() -> requests.Session:
session = requests.Session()
session.mount('http://', DragonflyAdapter())
session.mount('https://', DragonflyAdapter())
session.proxies = {'http': 'http://127.0.0.1:65001'}
return session

# Set it as the default session factory
configure_http_backend(backend_factory=backend_factory)

snapshot_download(repo_id="tiiuae/falcon-rw-1b")

Download a snapshot of the repo with Dragonfly:

$ python3 snapshot_download_dragonfly.py
(โ€ฆ)03165eb22f0a867d4e6a64d34fce19/README.md: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 7.60k/7.60k [00:00<00:00, 374kB/s]
(โ€ฆ)7d4e6a64d34fce19/configuration_falcon.py: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 6.70k/6.70k [00:00<00:00, 762kB/s]
(โ€ฆ)f0a867d4e6a64d34fce19/modeling_falcon.py: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 56.9k/56.9k [00:00<00:00, 5.35MB/s]
(โ€ฆ)3165eb22f0a867d4e6a64d34fce19/merges.txt: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 456k/456k [00:00<00:00, 9.07MB/s]
(โ€ฆ)867d4e6a64d34fce19/tokenizer_config.json: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 234/234 [00:00<00:00, 106kB/s]
(โ€ฆ)eb22f0a867d4e6a64d34fce19/tokenizer.json: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 2.11M/2.11M [00:00<00:00, 27.7MB/s]
(โ€ฆ)3165eb22f0a867d4e6a64d34fce19/vocab.json: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 798k/798k [00:00<00:00, 19.7MB/s]
(โ€ฆ)7d4e6a64d34fce19/special_tokens_map.json: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 99.0/99.0 [00:00<00:00, 45.3kB/s]
(โ€ฆ)67d4e6a64d34fce19/generation_config.json: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 115/115 [00:00<00:00, 5.02kB/s]
(โ€ฆ)165eb22f0a867d4e6a64d34fce19/config.json: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 1.05k/1.05k [00:00<00:00, 75.9kB/s]
(โ€ฆ)eb22f0a867d4e6a64d34fce19/.gitattributes: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 1.48k/1.48k [00:00<00:00, 171kB/s]
(โ€ฆ)t-oSSW23tawg__&Key-Pair-Id=KVTP0A1DKRTAX: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 2.62G/2.62G [00:50<00:00, 52.1MB/s]
Fetching 12 files: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 12/12 [00:50<00:00, 4.23s/it]

Verify a snapshot of the repo download with Dragonflyโ€‹

Execute the command:

# find podskubectl -n dragonfly-system get pod -l component=dfdaemon# find logspod_name=dfdaemon-xxxxxkubectl -n dragonfly-system exec -it ${pod_name} -- grep "peer task done" /var/log/dragonfly/daemon/core.log

Example output:

peer task done, cost: 28349ms ย  {"peer": "89.116.64.101-77008-a95a6918-a52b-47f5-9b18-cec6ada03daf", "task": "2fe93348699e07ab67823170925f6be579a3fbc803ff3d33bf9278a60b08d901", "component": "PeerTask", "trace": "b34ed802b7afc0f4acd94b2cedf3fa2a"}

Performance testingโ€‹

Test the performance of single-machine file download by hf_hub_download API after the integration of Hugging Face Python Library and Dragonfly P2P. Due to the influence of the network environment of the machine itself, the actual download time is not important, but the ratio of the increase in the download time in different scenarios is very important.

Bar chart showing performance testing result

  • Hugging Face Python Library: Use hf_hub_download API to download models directly.
  • Hugging Face Python Library & Dragonfly Cold Boot: Use hf_hub_download API to download models via Dragonfly P2P network and no cache hits.
  • Hit Dragonfly Remote Peer Cache: Use hf_hub_download API to download models via Dragonfly P2P network and hit the remote peer cache.
  • Hit Dragonfly Local Peer Cache: Use hf_hub_download API to download models via Dragonfly P2P network and hit the local peer cache.
  • Hit Hugging Face Cache: Use hf_hub_download API to download models via Dragonfly P2P network and hit the Hugging Face local cache.

Test results show Hugging Face Python Library and Dragonfly P2P integration. It can effectively reduce the file download time. Note that this test was a single-machine test, which means that in the case of cache hits, the performance limitation is on the disk. If Dragonfly is deployed on multiple machines for P2P download, the models download speed will be faster.

Dragonfly communityโ€‹

Hugging Faceโ€‹

Dragonfly completes security audit!

ยท 3 min read

This summer, over four engineer weeks, Trail of Bits and OSTIF collaborated on a security audit of dragonfly. A CNCF Incubating Project, dragonfly functions as file distribution for peer-to-peer technologies. Included in the scope was the sub-project Nydusโ€™s repository that works in image distribution. The engagement was outlined and framed around several goals relevant to the security and longevity of the project as it moves towards graduation.

The Trail of Bits audit team approached the audit by using static and manual testing with automated and manual processes. By introducing semgrep and CodeQL tooling, performing a manual review of client, scheduler, and manager code, and fuzz testing on the gRPC handlers, the audit team was able to identify a variety of findings for the project to improve their security. In focusing efforts on high-level business logic and externally accessible endpoints, the Trail of Bits audit team was able to direct their focus during the audit and provide guidance and recommendations for dragonflyโ€™s future work.

Recorded in the audit report are 19 findings. Five of the findings were ranked as high, one as medium, four low, five informational, and four were considered undetermined. Nine of the findings were categorized as Data Validation, three of which were high severity. Ranked and reviewed as well was dragonflyโ€™s Codebase Maturity, comprising eleven aspects of project code which are analyzed individually in the report.

This is a large project and could not be reviewed in total due to time constraints and scope. multiple specialized features were outside the scope of this audit for those reasons. this project is a great opportunity for continued audit work to improve and elevate code and harden security before graduation. Ongoing efforts for security is critical, as security is a moving target.

We would like to thank the Trail of Bits team, particularly Dan Guido, Jeff Braswell, Paweล‚ Pล‚atek, and Sam Alws for their work on this project. Thank you to the dragonfly maintainers and contributors, specifically Wenbo Qi, for their ongoing work and contributions to this engagement. Finally, we are grateful to the CNCF for funding this audit and supporting open source security efforts.

OSTIF & Trail of Bitsโ€‹

Dragonfly communityโ€‹

Nydus communityโ€‹