generated from kubernetes/kubernetes-template-project
-
Notifications
You must be signed in to change notification settings - Fork 960
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Init english version of core concepts (#2635)
* Update concept, To #43000843 Signed-off-by: cheyang <[email protected]> * Update concept docs, To #43000843 Signed-off-by: cheyang <[email protected]> * Update concept, To #43000843 Signed-off-by: cheyang <[email protected]> * Update concept, To #43000843 Signed-off-by: cheyang <[email protected]> --------- Signed-off-by: cheyang <[email protected]>
- Loading branch information
Showing
3 changed files
with
159 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
# Architecture | ||
|
||
The overall architecture of Fluid is as follows: | ||
|
||
<div align="center"> | ||
<img src="../../../static/concepts/architecture.png" title="perspective" height="60%" width="60%" alt=""> | ||
</div> | ||
|
||
|
||
There are two core concepts: Dataset and Runtime in Fluid. To support these two concepts, Fluid's architecture is split into **a control plane** and **a data plane**. | ||
|
||
|
||
- Control Plane | ||
|
||
- **Dataset/Runtime Manager**: Responsible for the scheduling and orchestration of datasets and their supporting runtimes in Kubernetes. This includes scheduling, migration, and elastic scaling of the runtime for datasets, as well as automated operations for dataset support, such as fine-grained data preheating, such as specifying preheating for a specific folder; controlling metadata backup and recovery to improve data access performance for scenarios with massive small files; and setting pinning policies for cached data to avoid performance fluctuations caused by data eviction. | ||
|
||
- **Application Manager**: Responsible for the scheduling and operation of application Pods that use datasets, which is divided into two core components: the Scheduler and the Webhook. | ||
|
||
- Scheduler: schedule application Pods that use datasets in the Kubernetes cluster. By incorporating cached information obtained from the Runtime, Pods that use datasets are preferentially scheduled to nodes that have data caching, without the need for users to specify caching nodes. | ||
|
||
- Sidecar Webhook: For Kubernetes environments where the csi-plugin cannot be run, the Sidecar webhook automatically replaces the PVC with a FUSE sidecar and controls the startup order of containers in the Pod to ensure that the FUSE container starts first. | ||
|
||
|
||
- Data Plane | ||
|
||
- **Runtime Plugin**: As a highly extensible plugin, it can support various data access engines. Fluid achieves this by abstracting some common features, such as the use of cache media, quotas, directories, etc., making it extensible with different distributed cache engine implementation technologies. For example, the AlluxioRuntime uses a Master-Slave architecture, while the JuiceFSRuntime uses a Worker P2P architecture, both of which can be configured in the CRD of the Runtime. This plugin not only supports specific Runtimes like Alluxio and JuiceFS, but also supports a generic ThinRuntime, enabling users to access generic storage without the need for development. | ||
|
||
- **CSI Plugin**: The storage client is started in a containerized manner, completely decoupled from the business container. Upgrading the CSI plugin will not affect the business container, and it also supports deploying multiple versions of the storage client in the same Kubernetes cluster. Running the client independently in a Pod also provides observability within the Kubernetes system. Additionally, resource quotas can be set for the client's computing resources. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
# Concept | ||
|
||
## Dataset | ||
|
||
A dataset is a collection of logically related data that is used by computation engines, such as Spark for big data and TensorFlow for AI. Intelligent applications of these datasets create core value for industry. | ||
|
||
Dataset management has multiple dimensions, including security, version control, and data acceleration. We aim to provide support for dataset management with a focus on data acceleration. For example, we support aggregation of data from different storage sources, portability, and data features. | ||
|
||
**Data Source**: Supports multiple data sources with different protocols, including HDFS, S3, OSS, and the native Kubernetes Persistent Volume Claim protocol. Multiple data sources can also be mounted under different subdirectories in a unified namespace. | ||
|
||
**Placement Policy**: | ||
cached dataset on nodes of different types using the strong and weak affinity and toleration of the nodeAffinity in Kubernetes semantics. | ||
|
||
|
||
<div align="center"> | ||
<img src="../../../static/concepts/dataset.png" title="perspective" height="60%" width="60%" alt=""> | ||
</div> | ||
|
||
At the same time, Dataset provides observability, such as how much data is in the dataset, how much cache space is currently available, and what the cache hit rate is. Users can use this information to decide whether to scale up or down. | ||
|
||
<div align="center"> | ||
<img src="../../../static/concepts/dataset-status.png" title="perspective" height="60%" width="60%" alt=""> | ||
</div> | ||
|
||
## Runtime | ||
|
||
Dataset is a unified abstract concept, and the actual data operations are implemented by specific runtimes. Due to the differences in storage, there are different runtime interfaces. The introduction of runtime is necessary for accessing the data. The API specification here can be defined relatively flexibly, but the lifecycle of the runtime is unifiedly defined by Fluid, and the implementer of the runtime needs to complete the specific implementation according to the common interface definition. | ||
|
||
|
||
In Fluid, the Runtime is divided into two main categories: | ||
|
||
1. CacheRuntime implements cache acceleration, including the open-source distributed cache Alluxio which mainly accelerates S3, HDFS, and JuiceFS, JindoFS which accelerates OSS and OSS+HDFS, and GooseFS which supports COS. | ||
2. ThinRuntime provides a unified access interface, such as supporting distributed storage systems like s3fs and nfs-fuse. | ||
|
||
|
||
## Operations | ||
|
||
Fluid's universal data operations describe operations such as data prefetch, data migration, elastic scaling, cache cleaning, metadata backup, and recovery. | ||
|
||
|
||
### Data Prefetch | ||
|
||
The directory to be prefetched and the preheating strategy can be one-time, scheduled, or event-triggered can be specified. | ||
|
||
|
||
### Scale up and down | ||
|
||
Support manual scaling, elastic scaling, and scheduled scaling as various strategies for scaling. | ||
|
||
|
||
### Data Migration | ||
|
||
|
||
Supports both importing data from external storage into a dataset before using it, and using a dataset while importing data into it. | ||
|
||
|
||
Full concept: | ||
|
||
<div align="center"> | ||
<img src="../../../static/concepts/concept.png" title="perspective" height="60%" width="60%" alt=""> | ||
</div> | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,68 @@ | ||
# Introduction | ||
|
||
## Why Fluid? | ||
|
||
1. Running AI, big data and other tasks on the cloud through a cloud-native architecture can take advantage of the elasticity of computing resources, but at the same time, it also faces data access latency and large bandwidth overhead due to the separated computing and storage architecture. Especially deep learning training with GPUs, iterative remote access to large amounts of training data will significantly slow down the computing efficiency. | ||
|
||
2. Kubernetes provides heterogeneous storage service access and management standard interface (CSI, Container Storage Interface), but it does not define how the application uses and manages data. When running machine learning tasks, data scientists need to be able to define file features of the dataset, manage versions of the dataset, control access permissions, pre-process the dataset, accelerate heterogeneous data reading, etc. However, there is no such standard scheme in Kubernetes, which is one of the important missing capabilities of Kubernetes. | ||
|
||
3. Kubernetes supports a variety of forms, such as native Kubernetes, edge Kubernetes and Serverless Kubernetes. However, for different forms of Kubernetes, the support for CSI plug-ins is also different, for example, many Serverless Kubernetes do not support the deployment of third-party CSI plug-ins. | ||
|
||
## What is Fluid? | ||
|
||
Unlike traditional PVC-based storage abstraction, Fluid takes an Application-oriented perspective to abstract the “process of using data on Kubernetes”. It introduces the concept of elastic Dataset and implements it as a first-class citizen in Kubernetes to enable Dataset CRUD operation, permission control, and access acceleration. | ||
|
||
Fluid is responsible for converting distributed caching systems (such as Alluxio and JuiceFS) into observable caching services with self-management, elastic scaling, and self-healing capabilities, and it does so by supporting dataset operations. At the same time, through the data caching location information, Fluid can provide data-affinity scheduling for applications using datasets. | ||
|
||
|
||
<div align="center"> | ||
<img src="../../../static/concepts/perspective_cn.png" title="perspective" height="60%" width="60%" alt=""> | ||
</div> | ||
|
||
|
||
## Key Features: | ||
|
||
1. **Application-oriented DataSet Unified Abstraction**:DataSet not only consolidates data from multiple storage sources, but also describes the data's portablity and features, also providing observability, such as total data volume of the DataSet, current cache space size, and cache hit rate. Users can evaluate whether a cache system needs to be scaled up or down according to this information. | ||
|
||
2. **Lightweight but highly extensible Runtime Plugins**:Dataset is an abstract concept, and the data operation needs to be implemented by the Runtime. According to the different storages, there will be different Runtime interfaces. Fluid's Runtime is divided into two categories: CacheRuntime to accelerate data access, such as AlluxioRuntime for S3, HDFS and JuiceFSRuntime for JuiceFS; the other category is ThinRuntime, which provides a unified access interface to facilitate the access to third-party storage. | ||
|
||
3. **Automated data operation**:Providing data prefetch, migration, backup and other operations via CRDs, and supporting various trigger modes such as one-time, scheduled, and event-driven, to facilitate users to integrate them into the automated operation and maintenance system. | ||
|
||
4. **Data acceleration**:By combining distributed data caching technology with autoscaling, portability, observability, and affinity scheduling capabilities, data access performance can be improved through the provision of observable, elastic scaling cache capabilities and data affinity scheduling capabilities. | ||
|
||
5. **Platform independent**:Support diverse environments such as native, edge, Serverless Kubernetes cluster, Kubernetes multi-cluster, and can run in various environments such as cloud platform, edge, Kubernetes multi-cluster. It can run storage client in different modes by choosing CSI Plugin and sidecar according to the differences in environments. | ||
|
||
|
||
## Demo: | ||
There are demos to show how to improve the AI model traning speed in Cloud by using Fluid. | ||
|
||
|
||
|
||
### Demo 1: Accelerate Remote File Accessing with Fluid | ||
|
||
[![](../../../static/remote_file_accessing.png)](http://cloud.video.taobao.com/play/u/2987821887/p/1/e/6/t/1/277753111709.mp4) | ||
|
||
|
||
### Demo 2: Machine Learning with Fluid | ||
|
||
[![](../../../static/machine_learning.png)](http://cloud.video.taobao.com/play/u/2987821887/p/1/e/6/t/1/277528130570.mp4) | ||
|
||
## Quick Start | ||
|
||
You can refer to the following documents to insall and use Fluid. | ||
|
||
|
||
- [English](docs/en/TOC.md) | ||
- [简体中文](docs/zh/TOC.md) | ||
|
||
|
||
## Roadmap: | ||
|
||
Fluid provides support for data scenarios in three stages: | ||
|
||
1. Achieving seamless integration between computation and data to enable interoperability between computation and data. | ||
2. Improving data access speed through universal approaches. | ||
3. Coordinating workloads and data in container clusters, managing multiple datasets, and improving data management efficiency. | ||
|
||
|
||
![](../../../static/concepts/roadmap.png) |