-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add design docs for dataflow affinity using any preceding data operations #68
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,129 @@ | ||
# DataFlow 的亲和性支持指定的前置操作 | ||
|
||
## 动机 | ||
|
||
当前`AffinityStrategy`字段位于`RunAfter`对应的结构体中,因此亲和性只能依赖于`RunAfter`所对应的前置的数据操作。 | ||
|
||
对于有些场景,如下的 DataFlow 存在使用非直接的前置操作的亲和性: | ||
|
||
- 第(4)步的 DataProcess 要求跟 第(2)步的 DataProcess 使用同样的GPU节点。 | ||
|
||
```mermaid | ||
graph BT | ||
B((2)DataProcess: 模型转换(GPU节点)) --RunAfter--> A((1)DataProcess: 模型下载) | ||
C((3)DataLoad: 预热数据) --RunAfter--> B | ||
D((4)DataProcess: 异步启动模型推理服务(GPU节点)) --RunAfter-->C | ||
|
||
``` | ||
|
||
|
||
|
||
## 目标 | ||
|
||
DataFlow 中的数据操作的亲和性,可以指定依赖任意的前置操作。 | ||
|
||
|
||
|
||
## 设计 | ||
|
||
> 字段改动,因此当前设计对 v1.0.1 和 v1.0.2 的 DataFlow Affinity 不兼容。 | ||
> | ||
> - (待确定)如果不将`AffinityStrategy`从`RunAfter`中移出,可以兼容旧版本,语义和层级上是否合理? | ||
|
||
当前 `AffinityStrategy`字段位于`RunAfter`字段中,仅能依赖`RunAfter`指定的数据操作。 | ||
|
||
**将 `AffinityStrategy`字段从`RunAfter`中分离出来**,并添加依赖的前置数据操作的字段 `DependOn *OperationRef`。 | ||
|
||
- 对于 Data Operation 注入亲和性时,不根据 `RunAfter` 表示的前置操作,而是根据 `AffinityStrategy`中的`DependOn`所表示的操作。 | ||
|
||
- 由用户来保证`DependOn *OperationRef` 是前置的数据操作。 | ||
- **(待确定)如果没有指定`DependOn`,使用 `RunAfter`字段表示的前置数据操作**。 | ||
|
||
```go | ||
type DataLoadSpec struct { | ||
// Specifies that the preceding operation in a workflow | ||
// +optional | ||
RunAfter *OperationRef `json:"runAfter,omitempty"` | ||
|
||
// Modified. move out from OperationRef. | ||
// AffinityStrategy specifies the pod affinity strategy with the referent operation. | ||
// +optional | ||
AffinityStrategy *AffinityStrategy `json:"affinityStrategy,omitempty"` | ||
} | ||
|
||
type AffinityStrategy struct { | ||
// Added | ||
// Specifies that the dependent preceding operation in a workflow. If not set, use `RunAfter` field. | ||
// +optional | ||
DependOn *OperationRef `json:"dependOn,omitempty"` | ||
// Policy one of: "", "Require", "Prefer" | ||
// +optional | ||
Policy AffinityPolicy `json:"policy,omitempty"` | ||
|
||
Prefers []Prefer `json:"prefers,omitempty"` | ||
Requires []Require `json:"requires,omitempty"` | ||
} | ||
|
||
``` | ||
|
||
|
||
|
||
## 示例 | ||
|
||
针对示例的工作流 | ||
|
||
```mermaid | ||
graph BT | ||
B((2)DataProcess: 模型转换(GPU节点)) --RunAfter--> A((1)DataProcess: 模型下载) | ||
C((3)DataLoad: 预热数据) --RunAfter--> B | ||
D((4)DataProcess: 异步启动模型推理服务(GPU节点)) --RunAfter-->C | ||
``` | ||
|
||
Yaml 的示例配置信息如下 | ||
|
||
```yaml | ||
apiVersion: data.fluid.io/v1alpha1 | ||
kind: DataProcess | ||
metadata: | ||
name: step2-trtllm-convert | ||
# exposed affinity which will be filled in OperationStatus. | ||
fluid.io/affinity.labels: "node.kubernetes.io/instance-type" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. data-operation.fluid.io/affinity.labels如何? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 兼容原先的标签吗? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 先不用兼容,因为这个功能还没有人真正使用。 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 好的 |
||
spec: | ||
runAfter: | ||
kind: DataProcess | ||
name: step1-download-model | ||
namespace: default | ||
... | ||
--- | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 可以提供一个affinity依赖上一个dataProcess的例子。 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 已提供 |
||
apiVersion: data.fluid.io/v1alpha1 | ||
kind: DataLoad | ||
metadata: | ||
name: step3-warmup-cache | ||
spec: | ||
runAfter: | ||
kind: DataProcess | ||
name: step2-trtllm-convert | ||
namespace: default | ||
... | ||
--- | ||
apiVersion: data.fluid.io/v1alpha1 | ||
kind: DataProcess | ||
metadata: | ||
name: step4-infer-server | ||
spec: | ||
runAfter: | ||
kind: DataLoad | ||
name: step3-warmup-cache | ||
namespace: default | ||
AffinityStrategy: | ||
# get affinity from which data operation | ||
dependOn: | ||
kind: DataProcess | ||
name: step2-trtllm-convert | ||
namespace: default | ||
policy: Require | ||
# Require to run on a node with the same label value as the dependent operation | ||
requires: | ||
- name: node.kubernetes.io/instance-type | ||
``` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是否可以定义一个通用的结构体 ObjectRef 来包含引用操作的必要字段, 这样来保持向前兼容:
在其他结构体中使用 ObjectRef
然后,在需要引用操作的地方使用这个通用结构体。例如,在 OperationRef 和 AffinityStrategy 中使用 ObjectRef:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我的疑问是:
是否保留 AffinityStrategy 定义在
RunAfter: OperationRef
里面?如果是的话,就可以兼容,抽象出 ObjectRef。@cheyang
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我倾向于AffinityStrategy放在OperationRef中,因为不想提供单独AffinityStrategy,也有含义,也是有runAfter的语义。FYI @TrafalgarZZZ
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个对于
inline
字段对于 java/python SDK 是否需要修改?SDK 能兼容吗?