feat: KEP 2841 Flux Policy to support Flux Framework by vsoch · Pull Request #2909 · kubeflow/trainer

vsoch · 2025-10-31T02:09:40Z

What this PR does / why we need it:

This KEP proposes adding a policy to support Flux Framework that provides more traditional HPC features. Using an HPC workload manager like Flux to bootstrap MPI will empower users to run MPI-based and other distributed workloads with advanced scheduling, topology awareness, and a more robust bootstrapping mechanism than traditional SSH-based methods. The proposal introduces a new flux policy in the TrainJob API, allowing users to select and configure the HPC workload manager, Flux.

The WIP implementation for the design discussed.

Authors

Myself and @milroy

Which issue(s) this PR fixes This will fix #2841.

Ping @andreyvelich @astefanutti

Checklist:

Docs included if any changes are user facing

This KEP proposes adding an hpcPolicy to support Flux Framework and (in the future) other workload managers that provide more traditional HPC features. Signed-off-by: vsoch <vsoch@users.noreply.github.com>

andreyvelich

Thanks for driving this great feature @vsoch, and sorry for the delay, got swamped with the KubeCon. I left my initial thoughts.

cc @kubeflow/kubeflow-trainer-team @kubeflow/kubeflow-sdk-team

docs/proposals/2841-flux-hpc/README.md

coveralls · 2025-11-18T03:35:26Z

Pull Request Test Coverage Report for Build 20256291728

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 51.435%

Totals
Change from base Build 20255809727:	0.0%
Covered Lines:	1237
Relevant Lines:	2405

💛 - Coveralls

andreyvelich · 2025-11-18T13:48:36Z

/ok-to-test

Changed crd examples to reflect documentation removed tasks from definition - can go in settings removed mentions of minicluster out of context specified train image instead of custom logic added user stories Signed-off-by: vsoch <vsoch@users.noreply.github.com>

vsoch · 2025-11-25T03:03:46Z

I think the error in CI is a flaky test? Note that I'm currently pushing for a more generic HPCPolicy that can support multiple plugin backends with a flexible Settings field. This means not using any hard coded variables (akin to the current MPI plugin).

vsoch · 2025-12-14T08:30:23Z

For the FluxMLPolicySource, we define the minimum required parameters needed for Flux and installing the view, along with the most highly used parameters in HPC. This largely includes the network device and view for compatibility. If you get it wrong, it cannot work. My first thinking was to create clean separation between fields for different components that will likely emerge:

type MLPolicySource struct {
    Torch *TorchMLPolicySource `json:"torch,omitempty"`
    MPI   *MPIMLPolicySource   `json:"mpi,omitempty"`

    // FluxMLPolicy defines policy only for Flux
	// +optional
    Flux  *FluxMLPolicySource  `json:"flux,omitempty"`
}

// FluxMLPolicySource represents a Flux HPC runtime configuration.
type FluxMLPolicySource struct {

	// numNodes is the number of physical nodes for the job.
	// This is defined a level up on the Trainer

	// numProcPerNode is the number of processes per node.
	// Defaults to 1.
	// +kubebuilder:default=1
	// +optional
	NumProcPerNode *int32 `json:"numProcPerNode,omitempty"`	

	// fluxInstall describes how to install Flux
	// +optional
	FluxInstall FluxInstall `json:"install,omitempty"`	
}

// FluxInstall describes the install, network, and scheduling policy
// This is more modular for the Flux operator, and squashed here.
type FluxInstall struct {

    // Container image to use for Flux view that installs Flux
	// This must be compatible with the application container
	// Get the flux view container (these are choices)
	// ghcr.io/converged-computing/flux-view-rocky:arm-9
	// ghcr.io/converged-computing/flux-view-rocky:arn-8
	// ghcr.io/converged-computing/flux-view-rocky:tag-9
	// ghcr.io/converged-computing/flux-view-rocky:tag-8
	// ghcr.io/converged-computing/flux-view-ubuntu:tag-noble
	// ghcr.io/converged-computing/flux-view-ubuntu:tag-jammy
	// ghcr.io/converged-computing/flux-view-ubuntu:tag-focal
	// ghcr.io/converged-computing/flux-view-ubuntu:arm-jammy
	// ghcr.io/converged-computing/flux-view-ubuntu:arm-focal
	// We use an ubuntu (more recent) default since it is common
    // +kubebuilder:default="ghcr.io/converged-computing/flux-view-ubuntu:arm-jammy"
    Image string `json:"image,omitempty"`

    // Network device for flux to use
    // +kubebuilder:default="eth0"
    NetworkDevice string `json:"networkDevice,omitempty"`

    // Queue policy for Flux to use
    // +kubebuilder:default="fcfs"
    QueuePolicy string `json:"queuePolicy,omitempty"`
}

That said, the design of the others adheres to a flat structure, so I have refactored to reflect that - no FluxInstall but everything under one group.

// FluxMLPolicySource represents a Flux HPC runtime configuration.
type FluxMLPolicySource struct {

    // numNodes is the number of physical nodes for the job.
    // This is defined a level up on the Trainer

    // numProcPerNode is the number of processes per node.
    // Defaults to 1.
    // +kubebuilder:default=1
    // +optional
    NumProcPerNode *int32 `json:"numProcPerNode,omitempty"`

    // Container image to use for Flux view that installs Flux
    // This must be compatible with the application container
    // Get the flux view container (these are choices)
    // ghcr.io/converged-computing/flux-view-rocky:arm-9
    // ghcr.io/converged-computing/flux-view-rocky:arn-8
    // ghcr.io/converged-computing/flux-view-rocky:tag-9
    // ghcr.io/converged-computing/flux-view-rocky:tag-8
    // ghcr.io/converged-computing/flux-view-ubuntu:tag-noble
    // ghcr.io/converged-computing/flux-view-ubuntu:tag-jammy
    // ghcr.io/converged-computing/flux-view-ubuntu:tag-focal
    // ghcr.io/converged-computing/flux-view-ubuntu:arm-jammy
    // ghcr.io/converged-computing/flux-view-ubuntu:arm-focal
    // We use an ubuntu (more recent) default since it is common
    // +kubebuilder:default="ghcr.io/converged-computing/flux-view-ubuntu:arm-jammy"
    Image string `json:"image,omitempty"`

    // Network device for flux to use
    // +kubebuilder:default="eth0"
    NetworkDevice string `json:"networkDevice,omitempty"`

    // Queue policy for Flux to use
    // +kubebuilder:default="fcfs"
    QueuePolicy string `json:"queuePolicy,omitempty"`
}

After staring at it, I think I prefer it! I will update the PR as needed for tests to pass (or mostly pass) and then we can do another review pass.

Update the KEP to define a FluxMLPolicySource that exposes attributes specific to Flux. Signed-off-by: vsoch <vsoch@users.noreply.github.com>

vsoch · 2025-12-14T13:22:10Z

@andreyvelich ready for another look!

andreyvelich

Thanks for the updates @vsoch!
Overall lgtm, I just left a few questions.

docs/proposals/2841-flux-hpc/README.md

vsoch · 2025-12-16T00:53:15Z

Updates applied. Thanks for the review @andreyvelich ! Let me know if you have follow up questions in the discussion above.

vsoch · 2025-12-16T03:24:57Z

Updates:

network policy and queue policy are removed. Defaults will be eth0 and fcfs
initContainers is added as an example to the flux-runtime yaml spec.

docs/proposals/2841-flux-hpc/README.md

Signed-off-by: vsoch <vsoch@users.noreply.github.com>

andreyvelich

Thank you for this @vsoch! Exciting to see this moving forward!
/lgtm
/assign @astefanutti @tenzen-y @Electronic-Waste

vsoch · 2025-12-16T18:01:39Z

Thank you @andreyvelich for the speedy follow up reviews! I am also pumped to add Flux (and continue working on the implementation, which is the next step after the KEP).

Thanks to the other reviewers in advance for their feedback.

vsoch · 2025-12-23T18:49:39Z

Hi @astefanutti @tenzen-y @Electronic-Waste what do you need from our side to make progress here? Thanks!

astefanutti

/lgtm

Thanks @vsoch @milroy!

astefanutti · 2025-12-24T11:13:24Z

docs/proposals/2841-flux-hpc/README.md

+
+## Proposal
+
+The core of this proposal is to introduce a new Kubeflow Trainer plugin named `Flux`. This plugin will implement the `ComponentBuilderPlugin` interface to modify the `JobSet` specification generated for a `TrainJob`. The mechanism for creating the Flux cluster (the set of pods mapped to physical nodes) is dynamic and non-intrusive to the user's container image:


Maybe you've PoC'ed this already, but I don't think the runtime framework currently handles multiple plugins contributing to the same resource (in that case the JobSet resource). We may need to improve the framework machinery to handle this properly which is going to be useful generally.

Yes, the order of plugin execution is not guaranteed.
So, we might need to introduce single truth of information across plugins.

We probably can make PodSetInfo as a truth cache or introduce another single truth cache.

I guess, the single source of truth across plugins is Info object.
Like in MPI plugin, we apply volumes to it, so once the JobSet plugin is called, we sync those changes: https://github.com/kubeflow/trainer/blob/master/pkg/runtime/framework/plugins/mpi/mpi.go#L142

tenzen-y

This looks great, thank you for this effort, @vsoch !
/lgtm

andreyvelich

We should be good to move this forward.
Thanks again for this effort @vsoch!
/lgtm
/approve

google-oss-prow · 2025-12-29T18:36:21Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [andreyvelich]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

vsoch · 2025-12-29T20:47:49Z

Thank you @andreyvelich @tenzen-y - the review was excellent, and I'm glad to see this moving through. It might be a bit early, but Happy New Year! I appreciate everything you do for the Kubeflow (and larger Kubernetes) communities.

andreyvelich · 2025-12-29T23:18:55Z

Thank you, excited to see this finally move forward!
Happy holidays @vsoch!

* feat: kep for flux hpc (2841) This KEP proposes adding an hpcPolicy to support Flux Framework and (in the future) other workload managers that provide more traditional HPC features. Signed-off-by: vsoch <vsoch@users.noreply.github.com> * review: see updates below. Changed crd examples to reflect documentation removed tasks from definition - can go in settings removed mentions of minicluster out of context specified train image instead of custom logic added user stories Signed-off-by: vsoch <vsoch@users.noreply.github.com> * feat: flux policy Update the KEP to define a FluxMLPolicySource that exposes attributes specific to Flux. Signed-off-by: vsoch <vsoch@users.noreply.github.com> * review: add details of cm and init container Signed-off-by: vsoch <vsoch@users.noreply.github.com> --------- Signed-off-by: vsoch <vsoch@users.noreply.github.com> Co-authored-by: vsoch <vsoch@users.noreply.github.com> Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>

feat: kep for flux hpc (2841)

28f9140

This KEP proposes adding an hpcPolicy to support Flux Framework and (in the future) other workload managers that provide more traditional HPC features. Signed-off-by: vsoch <vsoch@users.noreply.github.com>

google-oss-prow bot requested review from jinchihe and kuizhiqing October 31, 2025 02:09

google-oss-prow bot added the size/L label Oct 31, 2025

kannon92 mentioned this pull request Oct 31, 2025

feat(docs): KEP-2779: Track TrainJob progress and expose training metrics #2905

Merged

andreyvelich mentioned this pull request Nov 5, 2025

Add documentation for Kubeflow Trainer v2 TrainJob integration with Kueue kubernetes-sigs/kueue#7533

Merged

andreyvelich reviewed Nov 18, 2025

View reviewed changes

google-oss-prow bot added the ok-to-test label Nov 18, 2025

vsoch force-pushed the kep-2841-add-flux-hpc branch from 8354e2b to 70533d6 Compare November 25, 2025 02:04

vsoch changed the title ~~KEP 2841: HPC Policy to support Flux Framework~~ feat: KEP 2841 HPC Policy to support Flux Framework Nov 25, 2025

vsoch requested a review from andreyvelich November 25, 2025 03:02

andreyvelich mentioned this pull request Dec 2, 2025

Support Workload API for TrainJob Scheduling #3015

Open

vsoch force-pushed the kep-2841-add-flux-hpc branch 4 times, most recently from afcd709 to 21353da Compare December 14, 2025 08:27

feat: flux policy

805b25f

Update the KEP to define a FluxMLPolicySource that exposes attributes specific to Flux. Signed-off-by: vsoch <vsoch@users.noreply.github.com>

vsoch force-pushed the kep-2841-add-flux-hpc branch from 21353da to 805b25f Compare December 14, 2025 08:33

vsoch changed the title ~~feat: KEP 2841 HPC Policy to support Flux Framework~~ feat: KEP 2841 Flux Policy to support Flux Framework Dec 14, 2025

andreyvelich reviewed Dec 16, 2025

View reviewed changes

vsoch force-pushed the kep-2841-add-flux-hpc branch from 40369b6 to 48aa394 Compare December 16, 2025 03:23

andreyvelich reviewed Dec 16, 2025

View reviewed changes

docs/proposals/2841-flux-hpc/README.md Outdated Show resolved Hide resolved

review: add details of cm and init container

b9a8f3d

Signed-off-by: vsoch <vsoch@users.noreply.github.com>

vsoch force-pushed the kep-2841-add-flux-hpc branch from 48aa394 to b9a8f3d Compare December 16, 2025 04:15

andreyvelich reviewed Dec 16, 2025

View reviewed changes

google-oss-prow bot assigned astefanutti, Electronic-Waste, tenzen-y and andreyvelich Dec 16, 2025

google-oss-prow bot added the lgtm label Dec 16, 2025

astefanutti reviewed Dec 24, 2025

View reviewed changes

tenzen-y reviewed Dec 24, 2025

View reviewed changes

andreyvelich reviewed Dec 29, 2025

View reviewed changes

google-oss-prow bot added the approved label Dec 29, 2025

google-oss-prow bot merged commit 1fe3bd3 into kubeflow:master Dec 29, 2025
33 checks passed

google-oss-prow bot added this to the v2.2 milestone Dec 29, 2025

andreyvelich mentioned this pull request Jan 9, 2026

TAS: TopologyUngater can not recognize rank-based ordering for MPIJob with runLauncherAsWorker kubernetes-sigs/kueue#8471

Closed


		## Proposal

		The core of this proposal is to introduce a new Kubeflow Trainer plugin named `Flux`. This plugin will implement the `ComponentBuilderPlugin` interface to modify the `JobSet` specification generated for a `TrainJob`. The mechanism for creating the Flux cluster (the set of pods mapped to physical nodes) is dynamic and non-intrusive to the user's container image:

Conversation

vsoch commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coveralls commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 20256291728

Details

💛 - Coveralls

Uh oh!

andreyvelich commented Nov 18, 2025

Uh oh!

vsoch commented Nov 25, 2025

Uh oh!

vsoch commented Dec 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vsoch commented Dec 14, 2025

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vsoch commented Dec 16, 2025

Uh oh!

vsoch commented Dec 16, 2025

Uh oh!

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

vsoch commented Dec 16, 2025

Uh oh!

vsoch commented Dec 23, 2025

Uh oh!

astefanutti left a comment

Choose a reason for hiding this comment

Uh oh!

astefanutti Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

tenzen-y Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

andreyvelich Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

tenzen-y left a comment

Choose a reason for hiding this comment

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

google-oss-prow bot commented Dec 29, 2025

Uh oh!

Uh oh!

vsoch commented Dec 29, 2025

Uh oh!

vsoch commented Oct 31, 2025 •

edited

Loading

coveralls commented Nov 18, 2025 •

edited

Loading

vsoch commented Dec 14, 2025 •

edited

Loading