KEP-5832: Decouple PodGroup API from Workload API by helayoty · Pull Request #5833 · kubernetes/enhancements

helayoty · 2026-01-23T17:34:50Z

One-line PR description: Decouple PodGroup API as a runtime object from Workload API
Issue link: WAS: Decouple PodGroup API #5832
Other comments: This PR is part of the workload-aware scheding workstream to implement gang-scheduling. It introduce an new separate runtime object, PodGroup, that helps keeping Workload API simple and uers friendly.

/sig scheduling

keps/sig-apps/3541-add-recreate-strategy-to-statefulset/README.md

wojtek-t · 2026-02-09T13:40:11Z

keps/sig-scheduling/5832-decouple-podgroup-api/README.md

+   Conditions []metav1.Condition
+}
+```
+


Can we add one more subsection about deletion protection?

I would like to ensure that PodGroup will not get deleted if there are still some pods in non-terminal state in that PodGroup (i.e. not in Succeeded/Failure state).

The way I envision it is that PodGroup will be created with some dedicated finalizer (name TBD).
We will have a controller that will be looking into PodGRoups with initiated deletion (deletionTimestamp) and for those waiting for all pods linking it to terminate and only then removing the finalizer.

Thinking about it, it may not be a requirement for Alpha (though it would be nice), but let's put it into KEP and add it to beta criteria so we won't forget about it.

For alpha, could we at least mention that true workload controllers (or whoever creates a PodGroup) should (must?) add a finalizer to make sure a Pod doesn't outlive its PodGroup? I'm depending on some mechanism like that to exist for #5736. Or should we document that expectation there instead?

Can we add one more subsection about deletion protection?

Added. PTAL.

For alpha, could we at least mention that true workload controllers

I believe we should document this expecation and add it as beta blocker similar to here and https://github.com/kubernetes/enhancements/pull/5871/files#diff-03402ccdde6d2da9ed283ef0c1b203ef09baec88c05c9631e6ec0e7a8463a29dR214

keps/sig-scheduling/5832-decouple-podgroup-api/README.md

mm4tt · 2026-02-09T12:28:46Z

keps/sig-scheduling/5832-decouple-podgroup-api/README.md

+
+### Risks and Mitigations
+
+- Increase API calls volume: More objects means more API calls for creation, updates, and watches. The mitigation is split the responsibility.`Workload` is rarely updated (as a policy object) while `PodGroup` handles runtime state. In addition, `PodGroups` allow per-replica sharding of status updates.


NIT: ... The mitigation is to split the responsibility. The Workload object is rarely updated (as a template object), while the PodGroup handles runtime state:

keps/sig-scheduling/5832-decouple-podgroup-api/README.md

mm4tt · 2026-02-09T13:32:46Z

keps/sig-scheduling/5832-decouple-podgroup-api/README.md

+// PodGroupTemplateReference references the PodGroupTemplate object that 
+// defines the template used to create the PodGroup.
+type PodGroupTemplateReference struct {
+   // WorkloadName defines the name of the Workload object (Scheduling Policy) this Pod belongs to.


NIT: I'd remove the "(Scheduling Policy)" or rename to "(Scheduling Policy Template)".

In the decoupled model, the actual policy is inside the PodGroup.

mm4tt · 2026-02-09T14:11:00Z

keps/sig-scheduling/5832-decouple-podgroup-api/README.md

+
+- Read `pod.spec.podGroupRef.name` to identify the `PodGroup`
+- Lookup the `PodGroup` object to check its existence and to get the scheduling policy
+- Read `pod.spec.podGroupRef.workloadName` to identify the Workload and check its existence


+1. The TL;DR change for scheduler is: Instead of reading the Workload read the PodGroup. It contains all the information needed by scheduler.

I'd write it explicitly (one sentence) here - maybe as the first sentence.

keps/sig-scheduling/5832-decouple-podgroup-api/README.md

Signed-off-by: helayoty <heelayot@microsoft.com>

nojnhuh · 2026-02-10T05:30:03Z

keps/sig-scheduling/5832-decouple-podgroup-api/README.md

+
+The `PodGroup` lifecycle needs to make sure that `PodGroup` will not be deleted while any pod that references it is in a non-terminal phase (i.e. not `Succeeded` or `Failed`). 
+
+`PodGroup` objects are created with a dedicated finalizer that the controller is responsible for removing only when the deletion-safe condition is met. The mechanism for this is:


Could we clarify here whether the controller is the true workload controller or one run by kube-controller-manager?

updated. @wojtek-t can you please confirm the update is correct?

nojnhuh · 2026-02-10T05:44:56Z

keps/sig-scheduling/5832-decouple-podgroup-api/README.md

+- Any referencing pod is non-terminal, the controller leaves the finalizer in place and re-enqueues (i.e., on pod updates)
+- To find the referencing pods, we can use an index keyed by `workloadRef.podGroupName` (and optionally namespace) so the controller can efficiently list pods that reference a given `PodGroup`
+
+Deletion protection is not required for alpha (nice-to-have), however it is required for beta graduation.


If this section is describing new functionality that will become part of kube-controller-manager only when this KEP graduates to beta, do we need to say that true workload controllers need to handle this themselves in the meantime? Or is this only relevant to #5736 so I should mention this there?

IMHO, it's more relevant in #5547

For me it's actually relevant here (it's rather infrastructure of the PodGroup API).

The reason I'm saying it's not alpha is that I want to ensure that we will be able to deliver everything. Alpha should prove the feature and the finalizer stuff is part of making it production-ready.

@nojnhuh - my claim is that given that noone will really use Alpha in production anyway, we can ignore that problem in Alpha and ensure that it's solved by the controller for Beta; I don't think you should be adding any custom logic for that on your side.

dom4ha

All changes looks good, waiting with approval for the update to workloadRef

keps/sig-scheduling/5832-decouple-podgroup-api/kep.yaml

keps/sig-scheduling/5832-decouple-podgroup-api/README.md

x0rw · 2026-02-10T15:55:33Z

keps/sig-scheduling/5832-decouple-podgroup-api/README.md

+
+###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
+
+The increase of CPU/MEM consumption of kube-apiserver and kube-scheduler should be negligible percentage of the current resource usage.


Can you revisit this section especially the use of 'negligible'.

as you mentioned above in "Informers and Watches" section:

The kube-scheduler will add a new informer to watch PodGroup objects and stop watching Workload objects.

While the design replaces a Workload informer with a PodGroup informer, the effective cardinality changes significantly. Instead of the scheduler's cache scaling with the number of high-level Workloads, it will now scale with the number of individual PodGroups.

What do you think?

I agree that this section is a bit optimistic. Updated. PTAL.

keps/sig-scheduling/5832-decouple-podgroup-api/README.md

Signed-off-by: helayoty <heelayot@microsoft.com>

wojtek-t · 2026-02-11T07:47:05Z

This looks great both with my scheduling hat as well as with my PRR hat.

/lgtm
/approve PRR

@dom4ha - I think this is ready for your (hopefully last) pass

dom4ha

/approve

I noticed two leftovers, the rest is great. Thank you Heba!

dom4ha · 2026-02-11T08:15:32Z

keps/sig-scheduling/5832-decouple-podgroup-api/README.md

+
+`PodGroup` status mirrors `Pod` status semantics:
+- If pods are unschedulable(i.e., timeout, resources, affinity, etc.), the scheduler updates the `PodGroupScheduled` condition to `False` and sets the reason fields accordingly.
+- If pods are scheduled, the scheduler updates the `PodGroupScheduled` condition to `True` after the last pod in the gang completes binding.


Suggested change

- If pods are scheduled, the scheduler updates the `PodGroupScheduled` condition to `True` after the last pod in the gang completes binding.

- If pods are scheduled, the scheduler updates the `PodGroupScheduled` condition to `True` after the group got accepted by the Permit phase.

dom4ha · 2026-02-11T08:16:55Z

keps/sig-scheduling/5832-decouple-podgroup-api/README.md

+
+#### GangScheduling plugin
+
+The GangScheduling plugin will maintain a lister for `PodGroup` and check if the `PodGroup` object exists along with the `Workload` object. This is in addition to the following changes:


Suggested change

The GangScheduling plugin will maintain a lister for `PodGroup` and check if the `PodGroup` object exists along with the `Workload` object. This is in addition to the following changes:

The GangScheduling plugin will maintain a lister for `PodGroup` and check if the `PodGroup` object exists. This is in addition to the following changes:

mm4tt

One minor comment. Other than that
/lgtm

Thanks!

mm4tt · 2026-02-11T08:22:20Z

keps/sig-scheduling/5832-decouple-podgroup-api/README.md

+
+#### GangScheduling plugin
+
+The GangScheduling plugin will maintain a lister for `PodGroup` and check if the `PodGroup` object exists along with the `Workload` object. This is in addition to the following changes:


Above we're saying that scheduler will stop watching Workoad objects. Here we're saying that "will check existence of PodGroup along with the Workload". Let's make it consistent - i.e. remove the "along with the Workload object".

dom4ha · 2026-02-11T08:32:27Z

/approve

k8s-ci-robot · 2026-02-11T08:32:39Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dom4ha, helayoty, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [wojtek-t]
~~keps/sig-scheduling/OWNERS~~ [dom4ha]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

macsko · 2026-02-11T09:06:49Z

keps/sig-scheduling/5832-decouple-podgroup-api/README.md

+   // - "Scheduled": All required pods have been successfully scheduled.
+   // - "Unschedulable": The PodGroup cannot be scheduled due to resource constraints,
+   //   affinity/anti-affinity rules, or insufficient capacity for the gang.
+   // - "SchedulingGated": One or more pods in the PodGroup have scheduling gates


I think setting the SchedulingGated condition would be difficult since sending any synchronous API calls from the scheduling queue is undesirable. The apiserver applies this condition for scheduling gates on pods when it sees that the pod has scheduling gates. However, I don't think we can do the same for pod groups.

Are you saying we should remove this? Or can we just send the API call from other place in scheduler code (e.g. a separate go-routine)?

I'd rather remove it and reconsider when we have the fully working async API calls feature (#5229)

+1 for removing it - I think that deciding exact reasons we support is the discussion for the code review

Anyway, if this is non-trivial to implement then I don't think this should be part of alpha. Removing in #5912.

We can remove it. We can send Unschedulable status for Pods (maybe for PodGroup as well) when they are waiting for too long for in PreEnqueue on min or desired count, but it's indeed dependent on Async API calls feature which introduces such possibility.

helayoty · 2026-02-17T00:10:32Z

/area workload-aware

k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Jan 23, 2026

github-project-automation bot added this to SIG Scheduling Jan 23, 2026

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/apps Categorizes an issue or PR as relevant to SIG Apps. labels Jan 23, 2026

k8s-ci-robot requested review from macsko and palnabarun January 23, 2026 17:35

github-project-automation bot added this to SIG Apps Jan 23, 2026

github-project-automation bot moved this to Needs Triage in SIG Apps Jan 23, 2026

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 23, 2026

helayoty mentioned this pull request Jan 23, 2026

WAS: Decouple PodGroup API #5832

Open

4 tasks

kannon92 reviewed Jan 23, 2026

View reviewed changes

keps/sig-apps/3541-add-recreate-strategy-to-statefulset/README.md Outdated Show resolved Hide resolved

helayoty force-pushed the helayoty/5832-podgroup-api branch 2 times, most recently from be2191b to 0111cb0 Compare January 23, 2026 18:33

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jan 23, 2026

helayoty force-pushed the helayoty/5832-podgroup-api branch from 0111cb0 to bf21536 Compare January 23, 2026 18:36

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 23, 2026

helayoty closed this Jan 23, 2026

helayoty force-pushed the helayoty/5832-podgroup-api branch from bf21536 to 52db292 Compare January 23, 2026 18:37

github-project-automation bot moved this to Closed in SIG Scheduling Jan 23, 2026

github-project-automation bot moved this from Needs Triage to Closed in SIG Apps Jan 23, 2026

k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jan 23, 2026

helayoty reopened this Jan 23, 2026

github-project-automation bot moved this from Closed to In Progress in SIG Scheduling Jan 23, 2026

github-project-automation bot moved this from Closed to In Progress in SIG Apps Jan 23, 2026

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jan 23, 2026

soltysh mentioned this pull request Feb 9, 2026

[WIP] KEP-4671: Beta Promotion #5854

Closed

wojtek-t reviewed Feb 9, 2026

View reviewed changes

mm4tt reviewed Feb 9, 2026

View reviewed changes

dom4ha reviewed Feb 9, 2026

View reviewed changes

keps/sig-scheduling/5832-decouple-podgroup-api/README.md Outdated Show resolved Hide resolved

Address code review

cf876d9

Signed-off-by: helayoty <heelayot@microsoft.com>

helayoty requested review from dom4ha, mm4tt and wojtek-t February 10, 2026 03:37

nojnhuh reviewed Feb 10, 2026

View reviewed changes

dom4ha reviewed Feb 10, 2026

View reviewed changes

keps/sig-scheduling/5832-decouple-podgroup-api/kep.yaml Show resolved Hide resolved

keps/sig-scheduling/5832-decouple-podgroup-api/README.md Outdated Show resolved Hide resolved

x0rw reviewed Feb 10, 2026

View reviewed changes

Update to final API changes

5d7a54a

Signed-off-by: helayoty <heelayot@microsoft.com>

helayoty requested review from dom4ha, nojnhuh and x0rw February 10, 2026 22:14

update summary

892f6e8

Signed-off-by: helayoty <heelayot@microsoft.com>

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 11, 2026

dom4ha reviewed Feb 11, 2026

View reviewed changes

mm4tt reviewed Feb 11, 2026

View reviewed changes

k8s-ci-robot assigned mm4tt Feb 11, 2026

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 11, 2026

k8s-ci-robot merged commit ac4caf3 into kubernetes:master Feb 11, 2026
4 checks passed

macsko reviewed Feb 11, 2026

View reviewed changes

mm4tt mentioned this pull request Feb 11, 2026

KEP-5883: Resolve remaining wording comments #5912

Merged

x0rw approved these changes Feb 11, 2026

View reviewed changes

44past4 mentioned this pull request Feb 11, 2026

Update Topology-aware workload scheduling KEP with Workload alpha2 API #5911

Merged


		### Risks and Mitigations

		- Increase API calls volume: More objects means more API calls for creation, updates, and watches. The mitigation is split the responsibility.`Workload` is rarely updated (as a policy object) while `PodGroup` handles runtime state. In addition, `PodGroups` allow per-replica sharding of status updates.


		The `PodGroup` lifecycle needs to make sure that `PodGroup` will not be deleted while any pod that references it is in a non-terminal phase (i.e. not `Succeeded` or `Failed`).

		`PodGroup` objects are created with a dedicated finalizer that the controller is responsible for removing only when the deletion-safe condition is met. The mechanism for this is:


		###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

		The increase of CPU/MEM consumption of kube-apiserver and kube-scheduler should be negligible percentage of the current resource usage.

	- If pods are scheduled, the scheduler updates the `PodGroupScheduled` condition to `True` after the last pod in the gang completes binding.
	- If pods are scheduled, the scheduler updates the `PodGroupScheduled` condition to `True` after the group got accepted by the Permit phase.


		#### GangScheduling plugin

		The GangScheduling plugin will maintain a lister for `PodGroup` and check if the `PodGroup` object exists along with the `Workload` object. This is in addition to the following changes:

Conversation

helayoty commented Jan 23, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dom4ha left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wojtek-t commented Feb 11, 2026

Uh oh!

dom4ha left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mm4tt left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dom4ha commented Feb 11, 2026

Uh oh!

k8s-ci-robot commented Feb 11, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

helayoty commented Feb 17, 2026

Uh oh!

Reviewers

Assignees