Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add design docs for dataflow affinity using any preceding data operations #68

Merged
merged 2 commits into from
Sep 25, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
# DataFlow 的亲和性支持指定的前置操作

## 动机

当前`AffinityStrategy`字段位于`RunAfter`对应的结构体中,因此亲和性只能依赖于`RunAfter`所对应的前置的数据操作。

对于有些场景,如下的 DataFlow 存在使用非直接的前置操作的亲和性:

- 第(4)步的 DataProcess 要求跟 第(2)步的 DataProcess 使用同样的GPU节点。

```mermaid
graph BT
B((2)DataProcess: 模型转换(GPU节点)) --RunAfter--> A((1)DataProcess: 模型下载)
C((3)DataLoad: 预热数据) --RunAfter--> B
D((4)DataProcess: 异步启动模型推理服务(GPU节点)) --RunAfter-->C

```



## 目标

DataFlow 中的数据操作的亲和性,可以指定依赖任意的前置操作。



## 设计

> 字段改动,因此当前设计对 v1.0.1 和 v1.0.2 的 DataFlow Affinity 不兼容。
>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是否可以定义一个通用的结构体 ObjectRef 来包含引用操作的必要字段, 这样来保持向前兼容:

type ObjectRef struct {
	// API version of the referent operation
	// +optional
	APIVersion string `json:"apiVersion,omitempty"`

	// Kind specifies the type of the referent operation
	// +required
	// +kubebuilder:validation:Enum=DataLoad;DataBackup;DataMigrate;DataProcess
	Kind string `json:"kind"`

	// Name specifies the name of the referent operation
	// +required
	Name string `json:"name"`

	// Namespace specifies the namespace of the referent operation.
	// +optional
	Namespace string `json:"namespace,omitempty"`
}

在其他结构体中使用 ObjectRef
然后,在需要引用操作的地方使用这个通用结构体。例如,在 OperationRef 和 AffinityStrategy 中使用 ObjectRef:

type OperationRef struct {
	ObjectRef `json:",inline"`

	// AffinityStrategy specifies the pod affinity strategy with the referent operation.
	// +optional
	AffinityStrategy AffinityStrategy `json:"affinityStrategy,omitempty"`
}

type AffinityStrategy struct {
	// Specifies the dependent preceding operation in a workflow. If not set, use `RunAfter` field.
	// +optional
	DependOn *ObjectRef `json:"dependOn,omitempty"`
	// Policy one of: "", "Require", "Prefer"
	// +optional
	Policy AffinityPolicy `json:"policy,omitempty"`

	Prefers  []Prefer  `json:"prefers,omitempty"`
	Requires []Require `json:"requires,omitempty"`
}

Copy link
Contributor Author

@xliuqq xliuqq Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我的疑问是:
是否保留 AffinityStrategy 定义在 RunAfter: OperationRef 里面?如果是的话,就可以兼容,抽象出 ObjectRef。
@cheyang

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我倾向于AffinityStrategy放在OperationRef中,因为不想提供单独AffinityStrategy,也有含义,也是有runAfter的语义。FYI @TrafalgarZZZ

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是否可以定义一个通用的结构体 ObjectRef 来包含引用操作的必要字段, 这样来保持向前兼容:

type ObjectRef struct {
	// API version of the referent operation
	// +optional
	APIVersion string `json:"apiVersion,omitempty"`

	// Kind specifies the type of the referent operation
	// +required
	// +kubebuilder:validation:Enum=DataLoad;DataBackup;DataMigrate;DataProcess
	Kind string `json:"kind"`

	// Name specifies the name of the referent operation
	// +required
	Name string `json:"name"`

	// Namespace specifies the namespace of the referent operation.
	// +optional
	Namespace string `json:"namespace,omitempty"`
}

在其他结构体中使用 ObjectRef 然后,在需要引用操作的地方使用这个通用结构体。例如,在 OperationRef 和 AffinityStrategy 中使用 ObjectRef:

type OperationRef struct {
	ObjectRef `json:",inline"`

	// AffinityStrategy specifies the pod affinity strategy with the referent operation.
	// +optional
	AffinityStrategy AffinityStrategy `json:"affinityStrategy,omitempty"`
}

type AffinityStrategy struct {
	// Specifies the dependent preceding operation in a workflow. If not set, use `RunAfter` field.
	// +optional
	DependOn *ObjectRef `json:"dependOn,omitempty"`
	// Policy one of: "", "Require", "Prefer"
	// +optional
	Policy AffinityPolicy `json:"policy,omitempty"`

	Prefers  []Prefer  `json:"prefers,omitempty"`
	Requires []Require `json:"requires,omitempty"`
}

这个对于inline字段对于 java/python SDK 是否需要修改?SDK 能兼容吗?

> - (待确定)如果不将`AffinityStrategy`从`RunAfter`中移出,可以兼容旧版本,语义和层级上是否合理?

当前 `AffinityStrategy`字段位于`RunAfter`字段中,仅能依赖`RunAfter`指定的数据操作。

**将 `AffinityStrategy`字段从`RunAfter`中分离出来**,并添加依赖的前置数据操作的字段 `DependOn *OperationRef`。

- 对于 Data Operation 注入亲和性时,不根据 `RunAfter` 表示的前置操作,而是根据 `AffinityStrategy`中的`DependOn`所表示的操作。

- 由用户来保证`DependOn *OperationRef` 是前置的数据操作。
- **(待确定)如果没有指定`DependOn`,使用 `RunAfter`字段表示的前置数据操作**。

```go
type DataLoadSpec struct {
// Specifies that the preceding operation in a workflow
// +optional
RunAfter *OperationRef `json:"runAfter,omitempty"`

// Modified. move out from OperationRef.
// AffinityStrategy specifies the pod affinity strategy with the referent operation.
// +optional
AffinityStrategy *AffinityStrategy `json:"affinityStrategy,omitempty"`
}

type AffinityStrategy struct {
// Added
// Specifies that the dependent preceding operation in a workflow. If not set, use `RunAfter` field.
// +optional
DependOn *OperationRef `json:"dependOn,omitempty"`
// Policy one of: "", "Require", "Prefer"
// +optional
Policy AffinityPolicy `json:"policy,omitempty"`

Prefers []Prefer `json:"prefers,omitempty"`
Requires []Require `json:"requires,omitempty"`
}

```



## 示例

针对示例的工作流

```mermaid
graph BT
B((2)DataProcess: 模型转换(GPU节点)) --RunAfter--> A((1)DataProcess: 模型下载)
C((3)DataLoad: 预热数据) --RunAfter--> B
D((4)DataProcess: 异步启动模型推理服务(GPU节点)) --RunAfter-->C
```

Yaml 的示例配置信息如下

```yaml
apiVersion: data.fluid.io/v1alpha1
kind: DataProcess
metadata:
name: step2-trtllm-convert
# exposed affinity which will be filled in OperationStatus.
fluid.io/affinity.labels: "node.kubernetes.io/instance-type"
Copy link
Contributor

@cheyang cheyang Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data-operation.fluid.io/affinity.labels如何?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

兼容原先的标签吗?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

先不用兼容,因为这个功能还没有人真正使用。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的

spec:
runAfter:
kind: DataProcess
name: step1-download-model
namespace: default
...
---
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以提供一个affinity依赖上一个dataProcess的例子。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已提供

apiVersion: data.fluid.io/v1alpha1
kind: DataLoad
metadata:
name: step3-warmup-cache
spec:
runAfter:
kind: DataProcess
name: step2-trtllm-convert
namespace: default
...
---
apiVersion: data.fluid.io/v1alpha1
kind: DataProcess
metadata:
name: step4-infer-server
spec:
runAfter:
kind: DataLoad
name: step3-warmup-cache
namespace: default
AffinityStrategy:
# get affinity from which data operation
dependOn:
kind: DataProcess
name: step2-trtllm-convert
namespace: default
policy: Require
# Require to run on a node with the same label value as the dependent operation
requires:
- name: node.kubernetes.io/instance-type
```