Skip to content

feat: KEP 2841 Flux Policy to support Flux Framework#2909

Merged
google-oss-prow[bot] merged 4 commits intokubeflow:masterfrom
converged-computing:kep-2841-add-flux-hpc
Dec 29, 2025
Merged

feat: KEP 2841 Flux Policy to support Flux Framework#2909
google-oss-prow[bot] merged 4 commits intokubeflow:masterfrom
converged-computing:kep-2841-add-flux-hpc

Conversation

@vsoch
Copy link
Contributor

@vsoch vsoch commented Oct 31, 2025

What this PR does / why we need it:

This KEP proposes adding a policy to support Flux Framework that provides more traditional HPC features. Using an HPC workload manager like Flux to bootstrap MPI will empower users to run MPI-based and other distributed workloads with advanced scheduling, topology awareness, and a more robust bootstrapping mechanism than traditional SSH-based methods. The proposal introduces a new flux policy in the TrainJob API, allowing users to select and configure the HPC workload manager, Flux.

The WIP implementation for the design discussed.

Authors

Myself and @milroy

Which issue(s) this PR fixes This will fix #2841.

Ping @andreyvelich @astefanutti

Checklist:

  • Docs included if any changes are user facing

This KEP proposes adding an hpcPolicy to support Flux
Framework and (in the future) other workload managers
that provide more traditional HPC features.

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for driving this great feature @vsoch, and sorry for the delay, got swamped with the KubeCon. I left my initial thoughts.

cc @kubeflow/kubeflow-trainer-team @kubeflow/kubeflow-sdk-team

@coveralls
Copy link

coveralls commented Nov 18, 2025

Pull Request Test Coverage Report for Build 20256291728

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 51.435%

Totals Coverage Status
Change from base Build 20255809727: 0.0%
Covered Lines: 1237
Relevant Lines: 2405

💛 - Coveralls

@andreyvelich
Copy link
Member

/ok-to-test

Changed crd examples to reflect documentation
removed tasks from definition - can go in settings
removed mentions of minicluster out of context
specified train image instead of custom logic
added user stories

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
@vsoch vsoch force-pushed the kep-2841-add-flux-hpc branch from 8354e2b to 70533d6 Compare November 25, 2025 02:04
@vsoch vsoch changed the title KEP 2841: HPC Policy to support Flux Framework feat: KEP 2841 HPC Policy to support Flux Framework Nov 25, 2025
@vsoch vsoch requested a review from andreyvelich November 25, 2025 03:02
@vsoch
Copy link
Contributor Author

vsoch commented Nov 25, 2025

I think the error in CI is a flaky test? Note that I'm currently pushing for a more generic HPCPolicy that can support multiple plugin backends with a flexible Settings field. This means not using any hard coded variables (akin to the current MPI plugin).

@vsoch vsoch force-pushed the kep-2841-add-flux-hpc branch 4 times, most recently from afcd709 to 21353da Compare December 14, 2025 08:27
@vsoch
Copy link
Contributor Author

vsoch commented Dec 14, 2025

For the FluxMLPolicySource, we define the minimum required parameters needed for Flux and installing the view, along with the most highly used parameters in HPC. This largely includes the network device and view for compatibility. If you get it wrong, it cannot work. My first thinking was to create clean separation between fields for different components that will likely emerge:

type MLPolicySource struct {
    Torch *TorchMLPolicySource `json:"torch,omitempty"`
    MPI   *MPIMLPolicySource   `json:"mpi,omitempty"`

    // FluxMLPolicy defines policy only for Flux
	// +optional
    Flux  *FluxMLPolicySource  `json:"flux,omitempty"`
}

// FluxMLPolicySource represents a Flux HPC runtime configuration.
type FluxMLPolicySource struct {

	// numNodes is the number of physical nodes for the job.
	// This is defined a level up on the Trainer

	// numProcPerNode is the number of processes per node.
	// Defaults to 1.
	// +kubebuilder:default=1
	// +optional
	NumProcPerNode *int32 `json:"numProcPerNode,omitempty"`	

	// fluxInstall describes how to install Flux
	// +optional
	FluxInstall FluxInstall `json:"install,omitempty"`	
}

// FluxInstall describes the install, network, and scheduling policy
// This is more modular for the Flux operator, and squashed here.
type FluxInstall struct {

    // Container image to use for Flux view that installs Flux
	// This must be compatible with the application container
	// Get the flux view container (these are choices)
	// ghcr.io/converged-computing/flux-view-rocky:arm-9
	// ghcr.io/converged-computing/flux-view-rocky:arn-8
	// ghcr.io/converged-computing/flux-view-rocky:tag-9
	// ghcr.io/converged-computing/flux-view-rocky:tag-8
	// ghcr.io/converged-computing/flux-view-ubuntu:tag-noble
	// ghcr.io/converged-computing/flux-view-ubuntu:tag-jammy
	// ghcr.io/converged-computing/flux-view-ubuntu:tag-focal
	// ghcr.io/converged-computing/flux-view-ubuntu:arm-jammy
	// ghcr.io/converged-computing/flux-view-ubuntu:arm-focal
	// We use an ubuntu (more recent) default since it is common
    // +kubebuilder:default="ghcr.io/converged-computing/flux-view-ubuntu:arm-jammy"
    Image string `json:"image,omitempty"`

    // Network device for flux to use
    // +kubebuilder:default="eth0"
    NetworkDevice string `json:"networkDevice,omitempty"`

    // Queue policy for Flux to use
    // +kubebuilder:default="fcfs"
    QueuePolicy string `json:"queuePolicy,omitempty"`
}

That said, the design of the others adheres to a flat structure, so I have refactored to reflect that - no FluxInstall but everything under one group.

// FluxMLPolicySource represents a Flux HPC runtime configuration.
type FluxMLPolicySource struct {

    // numNodes is the number of physical nodes for the job.
    // This is defined a level up on the Trainer

    // numProcPerNode is the number of processes per node.
    // Defaults to 1.
    // +kubebuilder:default=1
    // +optional
    NumProcPerNode *int32 `json:"numProcPerNode,omitempty"`

    // Container image to use for Flux view that installs Flux
    // This must be compatible with the application container
    // Get the flux view container (these are choices)
    // ghcr.io/converged-computing/flux-view-rocky:arm-9
    // ghcr.io/converged-computing/flux-view-rocky:arn-8
    // ghcr.io/converged-computing/flux-view-rocky:tag-9
    // ghcr.io/converged-computing/flux-view-rocky:tag-8
    // ghcr.io/converged-computing/flux-view-ubuntu:tag-noble
    // ghcr.io/converged-computing/flux-view-ubuntu:tag-jammy
    // ghcr.io/converged-computing/flux-view-ubuntu:tag-focal
    // ghcr.io/converged-computing/flux-view-ubuntu:arm-jammy
    // ghcr.io/converged-computing/flux-view-ubuntu:arm-focal
    // We use an ubuntu (more recent) default since it is common
    // +kubebuilder:default="ghcr.io/converged-computing/flux-view-ubuntu:arm-jammy"
    Image string `json:"image,omitempty"`

    // Network device for flux to use
    // +kubebuilder:default="eth0"
    NetworkDevice string `json:"networkDevice,omitempty"`

    // Queue policy for Flux to use
    // +kubebuilder:default="fcfs"
    QueuePolicy string `json:"queuePolicy,omitempty"`
}

After staring at it, I think I prefer it! I will update the PR as needed for tests to pass (or mostly pass) and then we can do another review pass.

Update the KEP to define a FluxMLPolicySource that
exposes attributes specific to Flux.

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
@vsoch vsoch force-pushed the kep-2841-add-flux-hpc branch from 21353da to 805b25f Compare December 14, 2025 08:33
@vsoch vsoch changed the title feat: KEP 2841 HPC Policy to support Flux Framework feat: KEP 2841 Flux Policy to support Flux Framework Dec 14, 2025
@vsoch
Copy link
Contributor Author

vsoch commented Dec 14, 2025

@andreyvelich ready for another look!

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates @vsoch!
Overall lgtm, I just left a few questions.

@vsoch
Copy link
Contributor Author

vsoch commented Dec 16, 2025

Updates applied. Thanks for the review @andreyvelich ! Let me know if you have follow up questions in the discussion above.

@vsoch vsoch force-pushed the kep-2841-add-flux-hpc branch from 40369b6 to 48aa394 Compare December 16, 2025 03:23
@vsoch
Copy link
Contributor Author

vsoch commented Dec 16, 2025

Updates:

  • network policy and queue policy are removed. Defaults will be eth0 and fcfs
  • initContainers is added as an example to the flux-runtime yaml spec.

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
@vsoch vsoch force-pushed the kep-2841-add-flux-hpc branch from 48aa394 to b9a8f3d Compare December 16, 2025 04:15
Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this @vsoch! Exciting to see this moving forward!
/lgtm
/assign @astefanutti @tenzen-y @Electronic-Waste

@vsoch
Copy link
Contributor Author

vsoch commented Dec 16, 2025

Thank you @andreyvelich for the speedy follow up reviews! I am also pumped to add Flux (and continue working on the implementation, which is the next step after the KEP).

Thanks to the other reviewers in advance for their feedback.

@vsoch
Copy link
Contributor Author

vsoch commented Dec 23, 2025

Hi @astefanutti @tenzen-y @Electronic-Waste what do you need from our side to make progress here? Thanks!

Copy link
Contributor

@astefanutti astefanutti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Thanks @vsoch @milroy!


## Proposal

The core of this proposal is to introduce a new Kubeflow Trainer plugin named `Flux`. This plugin will implement the `ComponentBuilderPlugin` interface to modify the `JobSet` specification generated for a `TrainJob`. The mechanism for creating the Flux cluster (the set of pods mapped to physical nodes) is dynamic and non-intrusive to the user's container image:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you've PoC'ed this already, but I don't think the runtime framework currently handles multiple plugins contributing to the same resource (in that case the JobSet resource). We may need to improve the framework machinery to handle this properly which is going to be useful generally.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the order of plugin execution is not guaranteed.
So, we might need to introduce single truth of information across plugins.

We probably can make PodSetInfo as a truth cache or introduce another single truth cache.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess, the single source of truth across plugins is Info object.
Like in MPI plugin, we apply volumes to it, so once the JobSet plugin is called, we sync those changes: https://github.com/kubeflow/trainer/blob/master/pkg/runtime/framework/plugins/mpi/mpi.go#L142

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thank you for this effort, @vsoch !
/lgtm

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be good to move this forward.
Thanks again for this effort @vsoch!
/lgtm
/approve

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 1fe3bd3 into kubeflow:master Dec 29, 2025
33 checks passed
@google-oss-prow google-oss-prow bot added this to the v2.2 milestone Dec 29, 2025
@vsoch
Copy link
Contributor Author

vsoch commented Dec 29, 2025

Thank you @andreyvelich @tenzen-y - the review was excellent, and I'm glad to see this moving through. It might be a bit early, but Happy New Year! I appreciate everything you do for the Kubeflow (and larger Kubernetes) communities.

@andreyvelich
Copy link
Member

Thank you, excited to see this finally move forward!
Happy holidays @vsoch!

Snehadas2005 pushed a commit to Snehadas2005/trainer that referenced this pull request Jan 4, 2026
* feat: kep for flux hpc (2841)

This KEP proposes adding an hpcPolicy to support Flux
Framework and (in the future) other workload managers
that provide more traditional HPC features.

Signed-off-by: vsoch <vsoch@users.noreply.github.com>

* review: see updates below.

Changed crd examples to reflect documentation
removed tasks from definition - can go in settings
removed mentions of minicluster out of context
specified train image instead of custom logic
added user stories

Signed-off-by: vsoch <vsoch@users.noreply.github.com>

* feat: flux policy

Update the KEP to define a FluxMLPolicySource that
exposes attributes specific to Flux.

Signed-off-by: vsoch <vsoch@users.noreply.github.com>

* review: add details of cm and init container

Signed-off-by: vsoch <vsoch@users.noreply.github.com>

---------

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Co-authored-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Flux Framework as a plugin for HPC and MPI bootstrap

6 participants