Conversation
mm4tt
commented
Jan 30, 2026
- One-line PR description: Promoting KEP-4671 Gang Scheduling to Beta in 1.36
- Issue link: Gang Scheduling Support in Kubernetes #4671
- Other comments:
|
Skipping CI for Draft Pull Request. |
f25e37a to
25677bc
Compare
8e713ff to
4f72681
Compare
4f72681 to
d4bef89
Compare
| specific group. | ||
| - Events: Repeated `FailedScheduling` events on the Pods with workloadRef. | ||
| - Mitigations: If the gang cannot fit due to resource constraints, delete the Workload object which should disable | ||
| the gang-scheduling TODO(mm4tt@): Discuss with Wojtek |
There was a problem hiding this comment.
@wojtek-t this is related to the opt-out discussion we need to have. PTAL and let me know your thoughts
| - `scheduler_pod_group_scheduling_attempts_total` | ||
| - `scheduler_pod_group_scheduling_duration_seconds` | ||
| - `scheduler_pod_group_scheduling_algorithm_duration_seconds` |
There was a problem hiding this comment.
@soltysh - FYI regarding #5558 (comment)
These are matching what we were talking about 4 months ago.
| 5. Create a Workload object named gang-test with minCount=2. | ||
| 6. Create a Pod test-pod-1 with spec.workloadRef pointing to gang-test. | ||
| 7. The Pod stays in Pending state (waiting for the gang). We verified that | ||
| `scheduler_pod_group_scheduling_attempts_total` metric is incremented. |
There was a problem hiding this comment.
The metric should not be incremented as pods will be blocked on PreEnqueue, so workloads cycle should not be triggered yet.
There was a problem hiding this comment.
Ok, is there any other metric / event that we can use here?
There was a problem hiding this comment.
Can we just check "pending_pods" instead?
The pods are blocked, so will they be reported in the "gated" queue? Or "gated" is only for the ones with scheduling gates? @macsko
There was a problem hiding this comment.
The pods are blocked, so will they be reported in the "gated" queue?
Right, the pods blocked on PreEnqueue will be counted as "gated" in "pending_pods" metric
There was a problem hiding this comment.
Or we can utilize scheduler_unschedulable_pods or scheduler_pending_pods metric?
| 11. Create test-pod-3 and test-pod-4 pointing to a workload. | ||
| 12. The pods are scheduled immediately one-by-one (Workload logic is ignored/unavailable because the field is dropped). | ||
| 13. Upgrade API Server and Scheduler back to v1.36. | ||
| 14. Create new pods referencing a Workload; verifying that Gang Scheduling functionality is restored (pods wait for minCount before scheduling). |
There was a problem hiding this comment.
You mean that pod3 and pod4 won't be considered in calculating minCount, but only new pods will, is that correct?
There was a problem hiding this comment.
What I had in mind is that we create a new workload with minCount=2 and pod3-4 pointing to it in step 11.
I don't want consider cases where minCount < podCount as it's semantically ambiguous.
However, now I realized we won't be able to create a new workload in step 11 because the API is disabled.
Updated the test scenario to fix that.
| ###### What steps should be taken if SLOs are not being met to determine the problem? | ||
|
|
||
| 1. Analyze Latency Metrics: Check `scheduler_pod_group_scheduling_duration_seconds` and | ||
| `scheduler_pod_group_scheduling_algorithm_duration_seconds`. High values here indicate that the Workload Scheduling |
There was a problem hiding this comment.
Not sure if scheduler_pod_group_scheduling_algorithm_duration_seconds (the one with algorithm) brings anything over the one without algorithm when there is no TAS nor WAS preemption yet.
There was a problem hiding this comment.
Answered in the other comment, PTAL.
| Not required until feature graduated to beta. | ||
| - Testing: Are there any tests for failure mode? If not, describe why. | ||
| --> | ||
| - Pods Pending Indefinitely (Gang Starvation) |
There was a problem hiding this comment.
There should be two main cases:
- Pods waiting in PreEnqueue until minCount is reached
- Pods cannot be scheduled because minCount pods does not fit
In both cases we should have pod status set informing about the reason
There was a problem hiding this comment.
Thanks, split into two cases.
| metrics: | ||
| - scheduler_pod_group_scheduling_attempts_total | ||
| - scheduler_pod_group_scheduling_duration_seconds | ||
| - scheduler_pod_group_scheduling_algorithm_duration_seconds |
There was a problem hiding this comment.
This metric should not bring any new information over the metric above.
There was a problem hiding this comment.
Are we saying we'll not be adding it?
There was a problem hiding this comment.
IIUC, they serve different purposes, similar to standard pod scheduling metrics. scheduler_pod_group_scheduling_duration_seconds covers the end-to-end latency of the cycle (including queue operations, snapshotting, etc.), while _algorithm_duration_seconds measures strictly the core calculation time. Having both allows us to distinguish whether a potential regression is caused by the algorithm's complexity or by system overheads (like snapshotting or queue locking), doesn't it?
There was a problem hiding this comment.
I'd consider scheduler_unschedulable_pod_groups which should be similar to the pods one that collect numbers of unscheduable pods broken down by plugin name.
| 5. Create a Workload object named gang-test with minCount=2. | ||
| 6. Create a Pod test-pod-1 with spec.workloadRef pointing to gang-test. | ||
| 7. The Pod stays in Pending state (waiting for the gang). We verified that | ||
| `scheduler_pod_group_scheduling_attempts_total` metric is incremented. |
There was a problem hiding this comment.
Ok, is there any other metric / event that we can use here?
| 11. Create test-pod-3 and test-pod-4 pointing to a workload. | ||
| 12. The pods are scheduled immediately one-by-one (Workload logic is ignored/unavailable because the field is dropped). | ||
| 13. Upgrade API Server and Scheduler back to v1.36. | ||
| 14. Create new pods referencing a Workload; verifying that Gang Scheduling functionality is restored (pods wait for minCount before scheduling). |
There was a problem hiding this comment.
What I had in mind is that we create a new workload with minCount=2 and pod3-4 pointing to it in step 11.
I don't want consider cases where minCount < podCount as it's semantically ambiguous.
However, now I realized we won't be able to create a new workload in step 11 because the API is disabled.
Updated the test scenario to fix that.
| metrics: | ||
| - scheduler_pod_group_scheduling_attempts_total | ||
| - scheduler_pod_group_scheduling_duration_seconds | ||
| - scheduler_pod_group_scheduling_algorithm_duration_seconds |
There was a problem hiding this comment.
IIUC, they serve different purposes, similar to standard pod scheduling metrics. scheduler_pod_group_scheduling_duration_seconds covers the end-to-end latency of the cycle (including queue operations, snapshotting, etc.), while _algorithm_duration_seconds measures strictly the core calculation time. Having both allows us to distinguish whether a potential regression is caused by the algorithm's complexity or by system overheads (like snapshotting or queue locking), doesn't it?
| ###### What steps should be taken if SLOs are not being met to determine the problem? | ||
|
|
||
| 1. Analyze Latency Metrics: Check `scheduler_pod_group_scheduling_duration_seconds` and | ||
| `scheduler_pod_group_scheduling_algorithm_duration_seconds`. High values here indicate that the Workload Scheduling |
There was a problem hiding this comment.
Answered in the other comment, PTAL.
| Not required until feature graduated to beta. | ||
| - Testing: Are there any tests for failure mode? If not, describe why. | ||
| --> | ||
| - Pods Pending Indefinitely (Gang Starvation) |
There was a problem hiding this comment.
Thanks, split into two cases.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: mm4tt The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
| alpha: | ||
| approver: "@soltysh" | ||
| beta: | ||
| approver: "@soltysh" |
There was a problem hiding this comment.
Can't comment on the lines that didn't change, so adding some comments here:
How can this feature be enabled / disabled in a live cluster?
As part of Beta, we will be adding two new feature gates (that may go directly to beta):
- WorkloadSchedulingCycle - to gate the logic related to it
- DelayedPreemption - to gate delayed preemption logic
We should reflect that in the answer
Does enabling the feature change any default behavior?
Technically, we will use delayed preemption also for pod-by-pod. It should be no-op from end-user perspective, but maybe it's worth adding it there too?
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
This question needs adjustments to reflect the newly introduced gates - see above
Are there any tests for feature enablement/disablement?
Has this been implemented? If not, please ensure that we will not promote to beta without this test.
[Or maybe we can even implement it in the meantime?]
There was a problem hiding this comment.
As part of Beta, we will be adding two new feature gates (that may go directly to beta):
WorkloadSchedulingCycle - to gate the logic related to it
DelayedPreemption - to gate delayed preemption logic
We should reflect that in the answer
I beleive the PodGroup feature gate GenericWorkloadPodGroup should be included as well.
| 5. Create a Workload object named gang-test with minCount=2. | ||
| 6. Create a Pod test-pod-1 with spec.workloadRef pointing to gang-test. | ||
| 7. The Pod stays in Pending state (waiting for the gang). We verified that | ||
| `scheduler_pod_group_scheduling_attempts_total` metric is incremented. |
There was a problem hiding this comment.
Can we just check "pending_pods" instead?
The pods are blocked, so will they be reported in the "gated" queue? Or "gated" is only for the ones with scheduling gates? @macsko
| - Details: | ||
| - [x] API .spec | ||
| - Other field: workloadRef is set on the Pods. | ||
| - [x] Events |
There was a problem hiding this comment.
@macsko - are we going to have this event?
Anyway - I think this is misleading, because the lack of these events doesn't mean it's not working, so I would remove it anyway.
|
|
||
| ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? | ||
| - Scheduling Throughput: There should be no significant regression in the system-wide scheduling throughput (pods/s) | ||
| when scheduling pods attached to a Workload compared to scheduling an equivalent number of individual pods. |
There was a problem hiding this comment.
How would you measure it using existing metrics?
[I'm silently assumign that it's via API server metrics counting "/binding" calls, but would be good to clairfy.
| - `scheduler_pod_group_scheduling_attempts_total` | ||
| - `scheduler_pod_group_scheduling_duration_seconds` | ||
| - `scheduler_pod_group_scheduling_algorithm_duration_seconds` |
There was a problem hiding this comment.
@soltysh - FYI regarding #5558 (comment)
These are matching what we were talking about 4 months ago.
| metrics: | ||
| - scheduler_pod_group_scheduling_attempts_total | ||
| - scheduler_pod_group_scheduling_duration_seconds | ||
| - scheduler_pod_group_scheduling_algorithm_duration_seconds |
There was a problem hiding this comment.
I'd consider scheduler_unschedulable_pod_groups which should be similar to the pods one that collect numbers of unscheduable pods broken down by plugin name.
| alpha: | ||
| approver: "@soltysh" | ||
| beta: | ||
| approver: "@soltysh" |
There was a problem hiding this comment.
As part of Beta, we will be adding two new feature gates (that may go directly to beta):
WorkloadSchedulingCycle - to gate the logic related to it
DelayedPreemption - to gate delayed preemption logic
We should reflect that in the answer
I beleive the PodGroup feature gate GenericWorkloadPodGroup should be included as well.
| approvers: | ||
| - "@sanposhiho" | ||
|
|
||
| see-also: |
| 5. Create a Workload object named gang-test with minCount=2. | ||
| 6. Create a Pod test-pod-1 with spec.workloadRef pointing to gang-test. | ||
| 7. The Pod stays in Pending state (waiting for the gang). We verified that | ||
| `scheduler_pod_group_scheduling_attempts_total` metric is incremented. |
There was a problem hiding this comment.
Or we can utilize scheduler_unschedulable_pods or scheduler_pending_pods metric?
| @@ -806,11 +828,11 @@ previous answers based on experience in the field. | |||
|
|
|||
| ###### How can an operator determine if the feature is in use by workloads? | |||
There was a problem hiding this comment.
Should we trigger an event for this feature?
| question. | ||
| --> | ||
| Since there are no formal SLOs for the kube-scheduler apart from scalability SLOs, we define the objectives for this | ||
| feature primarily in terms of non-regression to ensure the workload scheduling does not degrade the performance of the |
There was a problem hiding this comment.
I'd argue that we need to specify what percentage degradation is acceptable.
| <!-- | ||
| Pick one more of these and delete the rest. | ||
| --> | ||
| - Scheduling Latency: There should be no significant regression in pod scheduling latency |
There was a problem hiding this comment.
SLO for gang-scheduling latency needs to be defined
| Pick one more of these and delete the rest. | ||
| --> | ||
| - Scheduling Latency: There should be no significant regression in pod scheduling latency | ||
| (`scheduler_pod_scheduling_duration_seconds`) for both workload and non-workload pods compared to the baseline. |
| API server, any in-flight workload scheduling will eventually fail at the binding/update stage. These attempts will be | ||
| retried with standard exponential backoff once connectivity is restored. | ||
|
|
||
| ###### What are other known failure modes? |
There was a problem hiding this comment.
- What happens if a
Workload(PodGroup) object is deleted while its pods are in the scheduling queue or waiting at WaitOnPermit? - What if some pods successfully bind but others fails?
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Right - that's the plan. [This PR will be useful eventually, but not this cycle :)] |