[WIP] KEP-4671: Beta Promotion by mm4tt · Pull Request #5854 · kubernetes/enhancements

mm4tt · 2026-01-30T07:31:13Z

One-line PR description: Promoting KEP-4671 Gang Scheduling to Beta in 1.36

Issue link: Gang Scheduling Support in Kubernetes #4671

Other comments:

k8s-ci-robot · 2026-01-30T07:31:16Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

keps/sig-scheduling/4671-gang-scheduling/README.md

mm4tt · 2026-01-30T14:17:04Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+    specific group. 
+  - Events: Repeated `FailedScheduling` events on the Pods with workloadRef. 
+  - Mitigations: If the gang cannot fit due to resource constraints, delete the Workload object which should disable 
+    the gang-scheduling TODO(mm4tt@): Discuss with Wojtek 


@wojtek-t this is related to the opt-out discussion we need to have. PTAL and let me know your thoughts

macsko · 2026-01-30T14:22:56Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+    - `scheduler_pod_group_scheduling_attempts_total`
+    - `scheduler_pod_group_scheduling_duration_seconds`
+    - `scheduler_pod_group_scheduling_algorithm_duration_seconds`


Metric names look good

@soltysh - FYI regarding #5558 (comment)

These are matching what we were talking about 4 months ago.

keps/sig-scheduling/4671-gang-scheduling/README.md

dom4ha · 2026-01-30T14:33:07Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+5. Create a Workload object named gang-test with minCount=2.
+6. Create a Pod test-pod-1 with spec.workloadRef pointing to gang-test.
+7. The Pod stays in Pending state (waiting for the gang). We verified that  
+   `scheduler_pod_group_scheduling_attempts_total` metric is incremented.


The metric should not be incremented as pods will be blocked on PreEnqueue, so workloads cycle should not be triggered yet.

Ok, is there any other metric / event that we can use here?

Can we just check "pending_pods" instead?

The pods are blocked, so will they be reported in the "gated" queue? Or "gated" is only for the ones with scheduling gates? @macsko

The pods are blocked, so will they be reported in the "gated" queue?

Right, the pods blocked on PreEnqueue will be counted as "gated" in "pending_pods" metric

Or we can utilize scheduler_unschedulable_pods or scheduler_pending_pods metric?

dom4ha · 2026-01-30T14:36:01Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+11. Create test-pod-3 and test-pod-4 pointing to a workload. 
+12. The pods are scheduled immediately one-by-one (Workload logic is ignored/unavailable because the field is dropped). 
+13. Upgrade API Server and Scheduler back to v1.36. 
+14. Create new pods referencing a Workload; verifying that Gang Scheduling functionality is restored (pods wait for minCount before scheduling).


You mean that pod3 and pod4 won't be considered in calculating minCount, but only new pods will, is that correct?

What I had in mind is that we create a new workload with minCount=2 and pod3-4 pointing to it in step 11.
I don't want consider cases where minCount < podCount as it's semantically ambiguous.

However, now I realized we won't be able to create a new workload in step 11 because the API is disabled.

Updated the test scenario to fix that.

dom4ha · 2026-01-30T14:40:36Z

keps/sig-scheduling/4671-gang-scheduling/README.md

 ###### What steps should be taken if SLOs are not being met to determine the problem?

+1. Analyze Latency Metrics: Check `scheduler_pod_group_scheduling_duration_seconds` and 
+   `scheduler_pod_group_scheduling_algorithm_duration_seconds`. High values here indicate that the Workload Scheduling  


Not sure if scheduler_pod_group_scheduling_algorithm_duration_seconds (the one with algorithm) brings anything over the one without algorithm when there is no TAS nor WAS preemption yet.

Answered in the other comment, PTAL.

dom4ha · 2026-01-30T14:45:51Z

keps/sig-scheduling/4671-gang-scheduling/README.md

-      Not required until feature graduated to beta.
-    - Testing: Are there any tests for failure mode? If not, describe why.
-->
+- Pods Pending Indefinitely (Gang Starvation)


There should be two main cases:

Pods waiting in PreEnqueue until minCount is reached

Pods cannot be scheduled because minCount pods does not fit

In both cases we should have pod status set informing about the reason

Thanks, split into two cases.

dom4ha · 2026-01-30T14:48:19Z

keps/sig-scheduling/4671-gang-scheduling/kep.yaml

 metrics:
+  - scheduler_pod_group_scheduling_attempts_total
+  - scheduler_pod_group_scheduling_duration_seconds
+  - scheduler_pod_group_scheduling_algorithm_duration_seconds


This metric should not bring any new information over the metric above.

Are we saying we'll not be adding it?

IIUC, they serve different purposes, similar to standard pod scheduling metrics. scheduler_pod_group_scheduling_duration_seconds covers the end-to-end latency of the cycle (including queue operations, snapshotting, etc.), while _algorithm_duration_seconds measures strictly the core calculation time. Having both allows us to distinguish whether a potential regression is caused by the algorithm's complexity or by system overheads (like snapshotting or queue locking), doesn't it?

I'd consider scheduler_unschedulable_pod_groups which should be similar to the pods one that collect numbers of unscheduable pods broken down by plugin name.

mm4tt

Thanks, PTAL

mm4tt · 2026-02-02T14:47:47Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+5. Create a Workload object named gang-test with minCount=2.
+6. Create a Pod test-pod-1 with spec.workloadRef pointing to gang-test.
+7. The Pod stays in Pending state (waiting for the gang). We verified that  
+   `scheduler_pod_group_scheduling_attempts_total` metric is incremented.


Ok, is there any other metric / event that we can use here?

mm4tt · 2026-02-02T14:59:22Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+11. Create test-pod-3 and test-pod-4 pointing to a workload. 
+12. The pods are scheduled immediately one-by-one (Workload logic is ignored/unavailable because the field is dropped). 
+13. Upgrade API Server and Scheduler back to v1.36. 
+14. Create new pods referencing a Workload; verifying that Gang Scheduling functionality is restored (pods wait for minCount before scheduling).


What I had in mind is that we create a new workload with minCount=2 and pod3-4 pointing to it in step 11.
I don't want consider cases where minCount < podCount as it's semantically ambiguous.

However, now I realized we won't be able to create a new workload in step 11 because the API is disabled.

Updated the test scenario to fix that.

mm4tt · 2026-02-02T15:08:36Z

keps/sig-scheduling/4671-gang-scheduling/kep.yaml

 metrics:
+  - scheduler_pod_group_scheduling_attempts_total
+  - scheduler_pod_group_scheduling_duration_seconds
+  - scheduler_pod_group_scheduling_algorithm_duration_seconds


IIUC, they serve different purposes, similar to standard pod scheduling metrics. scheduler_pod_group_scheduling_duration_seconds covers the end-to-end latency of the cycle (including queue operations, snapshotting, etc.), while _algorithm_duration_seconds measures strictly the core calculation time. Having both allows us to distinguish whether a potential regression is caused by the algorithm's complexity or by system overheads (like snapshotting or queue locking), doesn't it?

mm4tt · 2026-02-02T15:09:59Z

keps/sig-scheduling/4671-gang-scheduling/README.md

 ###### What steps should be taken if SLOs are not being met to determine the problem?

+1. Analyze Latency Metrics: Check `scheduler_pod_group_scheduling_duration_seconds` and 
+   `scheduler_pod_group_scheduling_algorithm_duration_seconds`. High values here indicate that the Workload Scheduling  


Answered in the other comment, PTAL.

mm4tt · 2026-02-02T15:12:47Z

keps/sig-scheduling/4671-gang-scheduling/README.md

-      Not required until feature graduated to beta.
-    - Testing: Are there any tests for failure mode? If not, describe why.
-->
+- Pods Pending Indefinitely (Gang Starvation)


Thanks, split into two cases.

mm4tt · 2026-02-02T15:23:52Z

/assign @soltysh
/cc @soltysh

Hi @soltysh , could you please take a look at the PRR section for Beta promotion?

k8s-ci-robot · 2026-02-02T15:26:44Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mm4tt
Once this PR has been reviewed and has the lgtm label, please ask for approval from soltysh and additionally assign sanposhiho for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

keps/prod-readiness/OWNERS
keps/sig-scheduling/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wojtek-t · 2026-02-03T11:29:06Z

keps/prod-readiness/sig-scheduling/4671.yaml

 alpha:
  approver: "@soltysh"
+beta:
+  approver: "@soltysh"


Can't comment on the lines that didn't change, so adding some comments here:

How can this feature be enabled / disabled in a live cluster?

As part of Beta, we will be adding two new feature gates (that may go directly to beta):

WorkloadSchedulingCycle - to gate the logic related to it

DelayedPreemption - to gate delayed preemption logic
We should reflect that in the answer

Does enabling the feature change any default behavior?

Technically, we will use delayed preemption also for pod-by-pod. It should be no-op from end-user perspective, but maybe it's worth adding it there too?

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

This question needs adjustments to reflect the newly introduced gates - see above

Are there any tests for feature enablement/disablement?

Has this been implemented? If not, please ensure that we will not promote to beta without this test.
[Or maybe we can even implement it in the meantime?]

As part of Beta, we will be adding two new feature gates (that may go directly to beta):

WorkloadSchedulingCycle - to gate the logic related to it
DelayedPreemption - to gate delayed preemption logic
We should reflect that in the answer

I beleive the PodGroup feature gate GenericWorkloadPodGroup should be included as well.

wojtek-t · 2026-02-03T11:36:10Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+5. Create a Workload object named gang-test with minCount=2.
+6. Create a Pod test-pod-1 with spec.workloadRef pointing to gang-test.
+7. The Pod stays in Pending state (waiting for the gang). We verified that  
+   `scheduler_pod_group_scheduling_attempts_total` metric is incremented.


Can we just check "pending_pods" instead?

The pods are blocked, so will they be reported in the "gated" queue? Or "gated" is only for the ones with scheduling gates? @macsko

wojtek-t · 2026-02-03T11:37:54Z

keps/sig-scheduling/4671-gang-scheduling/README.md

-  - Details:
+- [x] API .spec
+  - Other field: workloadRef is set on the Pods.
+- [x] Events 


@macsko - are we going to have this event?

Anyway - I think this is misleading, because the lack of these events doesn't mean it's not working, so I would remove it anyway.

wojtek-t · 2026-02-03T11:39:08Z

keps/sig-scheduling/4671-gang-scheduling/README.md


-###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
+- Scheduling Throughput: There should be no significant regression in the system-wide scheduling throughput (pods/s) 
+when scheduling pods attached to a Workload compared to scheduling an equivalent number of individual pods.


How would you measure it using existing metrics?

[I'm silently assumign that it's via API server metrics counting "/binding" calls, but would be good to clairfy.

wojtek-t · 2026-02-03T11:42:14Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+    - `scheduler_pod_group_scheduling_attempts_total`
+    - `scheduler_pod_group_scheduling_duration_seconds`
+    - `scheduler_pod_group_scheduling_algorithm_duration_seconds`


@soltysh - FYI regarding #5558 (comment)

These are matching what we were talking about 4 months ago.

helayoty · 2026-02-04T22:19:50Z

keps/sig-scheduling/4671-gang-scheduling/kep.yaml

 metrics:
+  - scheduler_pod_group_scheduling_attempts_total
+  - scheduler_pod_group_scheduling_duration_seconds
+  - scheduler_pod_group_scheduling_algorithm_duration_seconds


I'd consider scheduler_unschedulable_pod_groups which should be similar to the pods one that collect numbers of unscheduable pods broken down by plugin name.

helayoty · 2026-02-04T22:46:55Z

keps/prod-readiness/sig-scheduling/4671.yaml

 alpha:
  approver: "@soltysh"
+beta:
+  approver: "@soltysh"


As part of Beta, we will be adding two new feature gates (that may go directly to beta):

WorkloadSchedulingCycle - to gate the logic related to it
DelayedPreemption - to gate delayed preemption logic
We should reflect that in the answer

I beleive the PodGroup feature gate GenericWorkloadPodGroup should be included as well.

helayoty · 2026-02-04T22:49:58Z

keps/sig-scheduling/4671-gang-scheduling/kep.yaml

 approvers:
  - "@sanposhiho"

 see-also:


I'd list all related KEPs.

helayoty · 2026-02-04T22:57:47Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+5. Create a Workload object named gang-test with minCount=2.
+6. Create a Pod test-pod-1 with spec.workloadRef pointing to gang-test.
+7. The Pod stays in Pending state (waiting for the gang). We verified that  
+   `scheduler_pod_group_scheduling_attempts_total` metric is incremented.


Or we can utilize scheduler_unschedulable_pods or scheduler_pending_pods metric?

helayoty · 2026-02-04T22:59:53Z

keps/sig-scheduling/4671-gang-scheduling/README.md

@@ -806,11 +828,11 @@ previous answers based on experience in the field.

 ###### How can an operator determine if the feature is in use by workloads?


Should we trigger an event for this feature?

helayoty · 2026-02-04T23:03:26Z

keps/sig-scheduling/4671-gang-scheduling/README.md

-question.
-->
+Since there are no formal SLOs for the kube-scheduler apart from scalability SLOs, we define the objectives for this
+feature primarily in terms of non-regression to ensure the workload scheduling does not degrade the performance of the


I'd argue that we need to specify what percentage degradation is acceptable.

helayoty · 2026-02-04T23:07:53Z

keps/sig-scheduling/4671-gang-scheduling/README.md

-<!--
-Pick one more of these and delete the rest.
-->
+- Scheduling Latency: There should be no significant regression in pod scheduling latency 


SLO for gang-scheduling latency needs to be defined

helayoty · 2026-02-04T23:08:32Z

keps/sig-scheduling/4671-gang-scheduling/README.md

-Pick one more of these and delete the rest.
-->
+- Scheduling Latency: There should be no significant regression in pod scheduling latency 
+(`scheduler_pod_scheduling_duration_seconds`) for both workload and non-workload pods compared to the baseline.


What is the baseline?

helayoty · 2026-02-04T23:11:29Z

keps/sig-scheduling/4671-gang-scheduling/README.md

+API server, any in-flight workload scheduling will eventually fail at the binding/update stage. These attempts will be
+retried with standard exponential backoff once connectivity is restored.
+
 ###### What are other known failure modes?


What happens if a Workload (PodGroup) object is deleted while its pods are in the scheduling queue or waiting at WaitOnPermit?

What if some pods successfully bind but others fails?

k8s-ci-robot · 2026-02-05T12:26:53Z

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

soltysh · 2026-02-09T10:10:34Z

@mm4tt @wojtek-t IIUC #5833 is the current focus, (based on this doc and this particular one will not be promoted to beta, is that correct? Or things has changed over the past week 😅 ?

wojtek-t · 2026-02-09T10:15:14Z

Right - that's the plan.

[This PR will be useful eventually, but not this cycle :)]

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Jan 30, 2026

k8s-ci-robot requested review from dom4ha and macsko January 30, 2026 07:31

github-project-automation bot added this to SIG Scheduling Jan 30, 2026

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 30, 2026

mm4tt force-pushed the kep-4671-beta-promotion branch 3 times, most recently from f25e37a to 25677bc Compare January 30, 2026 07:40

wojtek-t reviewed Jan 30, 2026

View reviewed changes

keps/sig-scheduling/4671-gang-scheduling/README.md Outdated Show resolved Hide resolved

mm4tt force-pushed the kep-4671-beta-promotion branch 4 times, most recently from 8e713ff to 4f72681 Compare January 30, 2026 14:15

4671 Beta Promotion - PRR v1

d4bef89

mm4tt force-pushed the kep-4671-beta-promotion branch from 4f72681 to d4bef89 Compare January 30, 2026 14:16

mm4tt commented Jan 30, 2026

View reviewed changes

macsko reviewed Jan 30, 2026

View reviewed changes

dom4ha reviewed Jan 30, 2026

View reviewed changes

Addressing comments kubernetes#1

43927c7

mm4tt commented Feb 2, 2026

View reviewed changes

mm4tt marked this pull request as ready for review February 2, 2026 15:22

k8s-ci-robot requested review from dom4ha and macsko February 2, 2026 15:22

k8s-ci-robot assigned soltysh Feb 2, 2026

k8s-ci-robot requested a review from soltysh February 2, 2026 15:23

Update production readinesss beta approver

2441d75

helayoty mentioned this pull request Feb 2, 2026

KEP-5547: Integrate Workload APIs with Job controller #5871

Merged

wojtek-t reviewed Feb 3, 2026

View reviewed changes

wojtek-t self-assigned this Feb 3, 2026

dom4ha mentioned this pull request Feb 3, 2026

Gang Scheduling Support in Kubernetes #4671

Open

23 tasks

helayoty moved this to In Progress in SIG Scheduling Feb 4, 2026

helayoty reviewed Feb 4, 2026

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 5, 2026

mm4tt closed this Feb 11, 2026

github-project-automation bot moved this from In Progress to Closed in SIG Scheduling Feb 11, 2026

		@@ -806,11 +828,11 @@ previous answers based on experience in the field.

		###### How can an operator determine if the feature is in use by workloads?

Conversation

mm4tt commented Jan 30, 2026

Uh oh!

k8s-ci-robot commented Jan 30, 2026

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mm4tt left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mm4tt commented Feb 2, 2026

Uh oh!

k8s-ci-robot commented Feb 2, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment