feat: KEP 2841 Flux Policy to support Flux Framework#2909
feat: KEP 2841 Flux Policy to support Flux Framework#2909google-oss-prow[bot] merged 4 commits intokubeflow:masterfrom
Conversation
This KEP proposes adding an hpcPolicy to support Flux Framework and (in the future) other workload managers that provide more traditional HPC features. Signed-off-by: vsoch <vsoch@users.noreply.github.com>
andreyvelich
left a comment
There was a problem hiding this comment.
Thanks for driving this great feature @vsoch, and sorry for the delay, got swamped with the KubeCon. I left my initial thoughts.
cc @kubeflow/kubeflow-trainer-team @kubeflow/kubeflow-sdk-team
Pull Request Test Coverage Report for Build 20256291728Details
💛 - Coveralls |
|
/ok-to-test |
Changed crd examples to reflect documentation removed tasks from definition - can go in settings removed mentions of minicluster out of context specified train image instead of custom logic added user stories Signed-off-by: vsoch <vsoch@users.noreply.github.com>
8354e2b to
70533d6
Compare
|
I think the error in CI is a flaky test? Note that I'm currently pushing for a more generic HPCPolicy that can support multiple plugin backends with a flexible Settings field. This means not using any hard coded variables (akin to the current MPI plugin). |
afcd709 to
21353da
Compare
|
For the type MLPolicySource struct {
Torch *TorchMLPolicySource `json:"torch,omitempty"`
MPI *MPIMLPolicySource `json:"mpi,omitempty"`
// FluxMLPolicy defines policy only for Flux
// +optional
Flux *FluxMLPolicySource `json:"flux,omitempty"`
}
// FluxMLPolicySource represents a Flux HPC runtime configuration.
type FluxMLPolicySource struct {
// numNodes is the number of physical nodes for the job.
// This is defined a level up on the Trainer
// numProcPerNode is the number of processes per node.
// Defaults to 1.
// +kubebuilder:default=1
// +optional
NumProcPerNode *int32 `json:"numProcPerNode,omitempty"`
// fluxInstall describes how to install Flux
// +optional
FluxInstall FluxInstall `json:"install,omitempty"`
}
// FluxInstall describes the install, network, and scheduling policy
// This is more modular for the Flux operator, and squashed here.
type FluxInstall struct {
// Container image to use for Flux view that installs Flux
// This must be compatible with the application container
// Get the flux view container (these are choices)
// ghcr.io/converged-computing/flux-view-rocky:arm-9
// ghcr.io/converged-computing/flux-view-rocky:arn-8
// ghcr.io/converged-computing/flux-view-rocky:tag-9
// ghcr.io/converged-computing/flux-view-rocky:tag-8
// ghcr.io/converged-computing/flux-view-ubuntu:tag-noble
// ghcr.io/converged-computing/flux-view-ubuntu:tag-jammy
// ghcr.io/converged-computing/flux-view-ubuntu:tag-focal
// ghcr.io/converged-computing/flux-view-ubuntu:arm-jammy
// ghcr.io/converged-computing/flux-view-ubuntu:arm-focal
// We use an ubuntu (more recent) default since it is common
// +kubebuilder:default="ghcr.io/converged-computing/flux-view-ubuntu:arm-jammy"
Image string `json:"image,omitempty"`
// Network device for flux to use
// +kubebuilder:default="eth0"
NetworkDevice string `json:"networkDevice,omitempty"`
// Queue policy for Flux to use
// +kubebuilder:default="fcfs"
QueuePolicy string `json:"queuePolicy,omitempty"`
}That said, the design of the others adheres to a flat structure, so I have refactored to reflect that - no // FluxMLPolicySource represents a Flux HPC runtime configuration.
type FluxMLPolicySource struct {
// numNodes is the number of physical nodes for the job.
// This is defined a level up on the Trainer
// numProcPerNode is the number of processes per node.
// Defaults to 1.
// +kubebuilder:default=1
// +optional
NumProcPerNode *int32 `json:"numProcPerNode,omitempty"`
// Container image to use for Flux view that installs Flux
// This must be compatible with the application container
// Get the flux view container (these are choices)
// ghcr.io/converged-computing/flux-view-rocky:arm-9
// ghcr.io/converged-computing/flux-view-rocky:arn-8
// ghcr.io/converged-computing/flux-view-rocky:tag-9
// ghcr.io/converged-computing/flux-view-rocky:tag-8
// ghcr.io/converged-computing/flux-view-ubuntu:tag-noble
// ghcr.io/converged-computing/flux-view-ubuntu:tag-jammy
// ghcr.io/converged-computing/flux-view-ubuntu:tag-focal
// ghcr.io/converged-computing/flux-view-ubuntu:arm-jammy
// ghcr.io/converged-computing/flux-view-ubuntu:arm-focal
// We use an ubuntu (more recent) default since it is common
// +kubebuilder:default="ghcr.io/converged-computing/flux-view-ubuntu:arm-jammy"
Image string `json:"image,omitempty"`
// Network device for flux to use
// +kubebuilder:default="eth0"
NetworkDevice string `json:"networkDevice,omitempty"`
// Queue policy for Flux to use
// +kubebuilder:default="fcfs"
QueuePolicy string `json:"queuePolicy,omitempty"`
}After staring at it, I think I prefer it! I will update the PR as needed for tests to pass (or mostly pass) and then we can do another review pass. |
Update the KEP to define a FluxMLPolicySource that exposes attributes specific to Flux. Signed-off-by: vsoch <vsoch@users.noreply.github.com>
21353da to
805b25f
Compare
|
@andreyvelich ready for another look! |
andreyvelich
left a comment
There was a problem hiding this comment.
Thanks for the updates @vsoch!
Overall lgtm, I just left a few questions.
|
Updates applied. Thanks for the review @andreyvelich ! Let me know if you have follow up questions in the discussion above. |
40369b6 to
48aa394
Compare
|
Updates:
|
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
48aa394 to
b9a8f3d
Compare
andreyvelich
left a comment
There was a problem hiding this comment.
Thank you for this @vsoch! Exciting to see this moving forward!
/lgtm
/assign @astefanutti @tenzen-y @Electronic-Waste
|
Thank you @andreyvelich for the speedy follow up reviews! I am also pumped to add Flux (and continue working on the implementation, which is the next step after the KEP). Thanks to the other reviewers in advance for their feedback. |
|
Hi @astefanutti @tenzen-y @Electronic-Waste what do you need from our side to make progress here? Thanks! |
|
|
||
| ## Proposal | ||
|
|
||
| The core of this proposal is to introduce a new Kubeflow Trainer plugin named `Flux`. This plugin will implement the `ComponentBuilderPlugin` interface to modify the `JobSet` specification generated for a `TrainJob`. The mechanism for creating the Flux cluster (the set of pods mapped to physical nodes) is dynamic and non-intrusive to the user's container image: |
There was a problem hiding this comment.
Maybe you've PoC'ed this already, but I don't think the runtime framework currently handles multiple plugins contributing to the same resource (in that case the JobSet resource). We may need to improve the framework machinery to handle this properly which is going to be useful generally.
There was a problem hiding this comment.
Yes, the order of plugin execution is not guaranteed.
So, we might need to introduce single truth of information across plugins.
We probably can make PodSetInfo as a truth cache or introduce another single truth cache.
There was a problem hiding this comment.
I guess, the single source of truth across plugins is Info object.
Like in MPI plugin, we apply volumes to it, so once the JobSet plugin is called, we sync those changes: https://github.com/kubeflow/trainer/blob/master/pkg/runtime/framework/plugins/mpi/mpi.go#L142
andreyvelich
left a comment
There was a problem hiding this comment.
We should be good to move this forward.
Thanks again for this effort @vsoch!
/lgtm
/approve
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Thank you @andreyvelich @tenzen-y - the review was excellent, and I'm glad to see this moving through. It might be a bit early, but Happy New Year! I appreciate everything you do for the Kubeflow (and larger Kubernetes) communities. |
|
Thank you, excited to see this finally move forward! |
* feat: kep for flux hpc (2841) This KEP proposes adding an hpcPolicy to support Flux Framework and (in the future) other workload managers that provide more traditional HPC features. Signed-off-by: vsoch <vsoch@users.noreply.github.com> * review: see updates below. Changed crd examples to reflect documentation removed tasks from definition - can go in settings removed mentions of minicluster out of context specified train image instead of custom logic added user stories Signed-off-by: vsoch <vsoch@users.noreply.github.com> * feat: flux policy Update the KEP to define a FluxMLPolicySource that exposes attributes specific to Flux. Signed-off-by: vsoch <vsoch@users.noreply.github.com> * review: add details of cm and init container Signed-off-by: vsoch <vsoch@users.noreply.github.com> --------- Signed-off-by: vsoch <vsoch@users.noreply.github.com> Co-authored-by: vsoch <vsoch@users.noreply.github.com> Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>
What this PR does / why we need it:
This KEP proposes adding a policy to support Flux Framework that provides more traditional HPC features. Using an HPC workload manager like Flux to bootstrap MPI will empower users to run MPI-based and other distributed workloads with advanced scheduling, topology awareness, and a more robust bootstrapping mechanism than traditional SSH-based methods. The proposal introduces a new flux policy in the
TrainJobAPI, allowing users to select and configure the HPC workload manager, Flux.The WIP implementation for the design discussed.
Authors
Myself and @milroy
Which issue(s) this PR fixes This will fix #2841.
Ping @andreyvelich @astefanutti
Checklist: