From eae3ddb1fd3cad00e92a47741b19aa8d6ccab036 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Maciej=20Skocze=C5=84?= Date: Wed, 10 Dec 2025 08:59:23 +0000 Subject: [PATCH 01/23] Add a section about scheduler changes for v1.36 --- keps/prod-readiness/sig-scheduling/4671.yaml | 2 + .../4671-gang-scheduling/README.md | 315 ++++++++++++++++-- .../4671-gang-scheduling/kep.yaml | 22 +- 3 files changed, 299 insertions(+), 40 deletions(-) diff --git a/keps/prod-readiness/sig-scheduling/4671.yaml b/keps/prod-readiness/sig-scheduling/4671.yaml index 17a4b734bff8..3257880a90d5 100644 --- a/keps/prod-readiness/sig-scheduling/4671.yaml +++ b/keps/prod-readiness/sig-scheduling/4671.yaml @@ -1,3 +1,5 @@ kep-number: 4671 alpha: approver: "@soltysh" +beta: + approver: "@soltysh" diff --git a/keps/sig-scheduling/4671-gang-scheduling/README.md b/keps/sig-scheduling/4671-gang-scheduling/README.md index 5119a6b45c8f..772d752263c4 100644 --- a/keps/sig-scheduling/4671-gang-scheduling/README.md +++ b/keps/sig-scheduling/4671-gang-scheduling/README.md @@ -469,12 +469,14 @@ the intention from the desired state. Note that given scheduling options are stored in the `Workload` object, pods linked to the `Workload` object will not be scheduled until this `Workload` object is created and observed by the kube-scheduler. +#### North Star Vision + The north star vision for gang scheduling implementation should satisfy the following requirements: 1. Ensure that pods being part of a gang are not bound if all pods belonging to it can't be scheduled. 2. Provide the "optimal enough" placement by considering all pods from a gang together. -3. Avoid deadlock scenario when multiple workloads are being scheduled at the same time by kube-scheduler. -4. Avoid deadlock scenario when multiple workloads are being scheduled at the same time by different +3. Avoid deadlock and livelock scenario when multiple workloads are being scheduled at the same time by kube-scheduler. +4. Avoid deadlock and livelock scenario when multiple workloads are being scheduled at the same time by different schedulers. 5. Avoid premature preemptions of already running pods in case a higher priority gang will be rejected. 6. Support gang-level (or workload-level in general) level preemption (if pods form a gang also @@ -488,6 +490,8 @@ Addressing all these requirements in a single shot would be a huge change, so as will only focus on a subset of those. However, we very briefly sketch the path towards the vision to ensure that this KEP is moving in the right direction. +#### GangScheduling Plugin + For `Alpha`, we are focusing on introducing the concept of the `Workload` and plumbing it into kube-scheduler in the simplest possible way. We will implement a new plugin implementing the following hooks: @@ -499,28 +503,7 @@ hooks: This seems to be the simplest possible implementation to address the requirement (1). We are consciously ignoring the rest of the requirements for `Alpha` phase. - -For `Beta`, we want to also touch requirements (2) and (3) by extending the scheduling framework with -a new dedicated phase (tentatively called Workload). In that phase, -kube-scheduler will be looking at all pods from a gang (part of `Workload`) and compute the placement -for all of these pods in a single scheduling cycle. Those placements will be stored only in-memory and -block the required resources from scheduling. Tentatively we plan to use `NominatedNodeName` field for it. -After that, pods will go through regular pod-by-pod scheduling phases (including Filter and Score) -with a nomination as a form of validation the proposed placement and execution of this placement decision. -Therefore we expect the order of processing pods won't ever be important, but all-or-nothing nature of -gangs will be preserved while advancing through the further steps of the binding process. - -While we will not target addressing "optimal enough" part of requirement (2), we will assure that we -can process all gang pods together. The single scheduling cycle and blocking resources in beta -will address the requirement (3). - -We will also introduce delayed preemption by moving it after `WaitOnPermit` phase. Together with -introduction of a dedicated phase for scheduling all pods in a single scheduling cycle this -will address the requirement (5). If accompanied with blocking the resources in-memory as -mentioned above, this basically mitigates the problem. - -More detail about scheduler changes is described in [this document](https://docs.google.com/document/d/1lMYkDuGqEoZWfE2b8vjQx0vHieOMyfmi6VHUef5-5is/edit?tab=t.0#heading=h.1p88ilpefnb). - +#### Future plans We will continue with further improvements on top of it with follow-up KEPs. We are planning to introduce the concept of `Reservation` that will allow to treat distributed subset of resources as @@ -535,12 +518,6 @@ states (e.g. not yet block resources) will help with improving the scheduling ac Finally making the binding process aware of gangs will allow to make sure the process is either successful or triggers workload rescheduling satisfying requirement (7). -The workload-aware preemption is tightly coupled, but separate feature that will also be designed -in a dedicated KEP. The current vision includes introducing a dedicated preemption policy (that -will result in pods no longer being treated individually for preemption purposes) which makes it -an additive feature. However, having a next level of details is required to ensure that we really -have a feasible backward-compatible plan before promoting this feature to Beta. - Addressing requirement (8) is the biggest effort as it requires much closer integration between scheduler and autoscaling components. So in the initial steps we will only focus on mitigating this problem with existing mechanisms (e.g. reserving resources via NominatedNodeName). @@ -548,6 +525,275 @@ this problem with existing mechanisms (e.g. reserving resources via NominatedNod However, approval for this KEP is NOT an approval for this vision. We only sketch it to show that we see a viable path forward from the proposed design that will not require significant rework. +### Scheduler Changes for Beta + +For the `Alpha` phase, we focused on plumbing the `Workload` API and implementing +the `GangScheduling` plugin using simple barriers (`PreEnqueue` and `Permit`). +While this satisfied the correctness requirement for "all-or-nothing" scheduling, +it did not address performance or efficiency at scale, scheduling livelocks, +nor did it solve the problem of partial preemption application. + +For `Beta`, we propose introducing a **Workload Scheduling Cycle**. +This mechanism processes all Pods belonging to a single `PodGroup` in one batch, +rather than attempting to schedule them individually in isolation using the +traditional pod-by-pod approach. +While this won't fully address the "optimal enough" part of requirement (2), +it ensures that all gang pods are processed together. +The single scheduling cycle, together with blocking resources using nomination, +will address requirement (3). + +We will also introduce delayed preemption (described in [KEP-5710](https://kep.k8s.io/5711)). +Together with the introduction of a dedicated Workload Scheduling Cycle, +this will address requirement (5). + +#### The Workload Scheduling Cycle + +We introduce a new phase in the main scheduling loop (`scheduleOne`). In the +end-to-end Pod scheduling flow, it is planned to place this new phase *before* +the standard pod-by-pod scheduling cycle. + +When the scheduler pops a Pod from the active queue, it checks if that Pod +belongs to an unscheduled `PodGroup` with a `Gang` policy. If so, the scheduler +initiates the Workload Scheduling Cycle. + +```md +<<[UNRESOLVED Scope of the Cycle]>> +It is currently unresolved whether the Workload Scheduling Cycle should operate +on the entire `Workload` object (handling all defined PodGroups simultaneously) +or strictly at the `PodGroup` level. + +* PodGroup Level: The cycle processes only the specific `PodGroup` (and replica key) + associated with the popped Pod. This is simpler and aligns with + the Gang Scheduling definition and current implementation. +* Workload Level: The cycle attempts to schedule all PodGroups within the Workload. + This allows for complex dependencies between groups but increases the complexity + and doesn't bring immediate added value. + +*Proposed:* Implement it on PodGroup Level for Beta. However, future migration +to the Workload Level might necessitate non-trivial changes to the phase +introduced by this KEP. +<<[/UNRESOLVED]>> +``` + +The cycle proceeds as follows: + +1. The scheduler takes either pod group itself or its Pod representative from + the scheduling queue. If the pod group is unscheduled (even partially), it temporarily removes + all group's pods from the queue for processing. The order of processing + is determined by the queueing mechanism (see *Queuing and Ordering* below). + +2. A single cluster state snapshot is taken for the entire group operation + to ensure consistency during the cycle. + +3. The scheduler runs a specialized algorithm (detailed below) + to find placements for the group. + +4. Outcome: + * If the group (i.e., at least `minCount` Pods) can be placed, + these Pods have the `.status.nominatedNodeName` set. + They are then effectively "reserved" on those nodes in the + scheduler's internal cache. Pods are then pushed to the + active queue (restoring their original timestamps to ensure fairness) + to pass through the standard scheduling and binding cycle, + which will respect the nomination. + * If `minCount` cannot be met (even after calculating potential + preemptions), the scheduler rejects the entire group. Standard backoff + logic applies (see *Failure Handling*), and Pods are returned to + the scheduling queue. + +#### Queuing and Ordering + +Workload-aware preemption (an `Alpha` effort in [KEP-5710](https://github.com/kubernetes/enhancements/pull/5711)) +will introduce a specific scheduling priority for a workload. +Having that in mind, it is beneficial to design a queueing mechanism open +for taking a workload's scheduling priority into account. +However, as we need to support ordering before that feature can be enabled, +we also need to derive the priority from the pod group's pods. +One such formula can be to set it to the lowest priority found within the pod group, +what will be effectively the weakest link to determine if the whole pod group is schedulable +and reduce unnecessary preemption attempts. + +```md +<<[UNRESOLVED Queue Implementation Strategy]>> +To ensure that we process the pod group (replica) at an appropriate time and +don't starve other pods (including gang pods in the pod-by-pod scheduling phase) +from being scheduled, we need to have a good queueing mechanism for pod groups. +There are several alternatives: + +Alternative 1 (Modify sorting logic): + +Modify the sorting logic within the existing `PriorityQueue` to put all pods +from a gang group one after another. +* *Pros:* Fits the current architecture. +* *Cons:* Might be problematic when some of the gang's pods are in the + backoffQ or unschedulablePods and need to be retrieved efficiently. + Makes it hard to further evolve the Workload Scheduling Cycle. + Would need to inject the workload priority into each of the Pods + or somehow apply the lowest pod's priority to the rest of the group. + +Alternative 2 (Store a gang representative): + +Only one "representative" Pod from the gang is allowed in the `activeQ` at a time. +Others are held in a separate internal structure (e.g., a new map inside the queue). +When the representative is popped, it pulls the rest of the gang for the Workload Cycle. +* *Pros:* Makes it easier to obtain all pod group's pods, reduces queue size. +* *Cons:* High complexity in managing the lifecycle of the representative + (e.g., what if the representative Pod is deleted or other changes to the workload happen? + Would need a workload manager to handle all such cases). + +Alternative 3 (Dedicated PodGroup queue): + +Introduce a completely separate queue for PodGroups alongside the `activeQ` for Pods. +The scheduler would pop the item (Pod or PodGroup) with the highest priority/earliest timestamp. +Pods belonging to an enqueued PodGroup won't be allowed in the `activeQ`. +* *Pros:* Clean separation of concerns. Can easily use the Workload scheduling priority. + Can report dedicated logs and metrics with less confusion to the user. +* *Cons:* Significant and non-trivial architectural change to the scheduling queue + and `scheduleOne` loop. + +*Proposed:* Alternative 3 (Dedicated PodGroup queue). While this requires architectural change to the scheduling queue, +the effort involved in adding pod group queuing will be comparable to modifying the code for the previous alternatives. +This will also make the foundation for future WAS features and support workload priority by design. +<<[/UNRESOLVED]>> +``` + +#### Scheduling Algorithm + +The internal algorithm for placing the group utilizes the optimization defined +in *Opportunistic Batching* ([KEP-5598](https://kep.k8s.io/5598)) for improved performance. +The approach described below allows mitigating some restrictions of that feature, e.g., +by sorting the Pods appropriately by their signatures. In case Opportunistic Batching +is disabled or not applicable, this falls back to non-optimized filtering and scoring for each Pod. +The list and configuration of plugins used by this algorithm will be the same as in the pod-by-pod cycle. + +1. The scheduler iterates through the retrieved Pods and groups + them into homogeneous sub-groups (using the signatures defined in + [KEP-5598](https://kep.k8s.io/5598)). + +2. These sub-groups are sorted. Initially, we sort by the highest priority + of the sub-group (assuming homogeneity enforces uniform sub-group priority). + In the future, sorting may use the size of the sub-group (larger groups first) to + tackle the hardest placement problems early. + +3. The scheduler iterates through the sorted sub-groups. It finds a feasible node + for each pod from a sub-group using standard filtering and scoring phases. + It also utilizes the Opportunistic Batching feature where possible, + reducing overall scheduling time. + + * If a pod fits, it is tentatively nominated. + * If a pod cannot fit, the scheduler tries preemption by running + the `PostFilter` extension point. *Note:* With workload-aware preemption + this phase will be replaced by a workload-level algorithm. + * If preemption is successful, the pod is nominated on the selected node. + * If preemption fails, the pod is considered unscheduled for this cycle. + + The phase can effectively stop once `minCount` pods have a placement, + though attempting to schedule the full group is preferred to maximize utilization. + +4. The scheduler checks if the number of schedulable (including those after delayed preemption) + Pods meets the `minCount`. + + * If `schedulableCount >= minCount`, the cycle succeeds. Pods are pushed + to the active queue and will soon attempt to be scheduled on their + nominated nodes in their own, pod-by-pod cycles. If a pod selects a + different node than its nomination during the individual cycle, the + gang remains valid as long as `minCount` is satisfied globally (enforced at `WaitOnPermit`). + ```md + <<[UNRESOLVED Pod-by-pod cycle preemption]>> + Should gang pods be allowed to preempt anything in their pod-by-pod cycles? + + *Proposed:* Preemption should be forbidden. Otherwise, it may complicate reasoning + about the workload scheduling cycle and workload-aware preemption. + When preemption is necessary, the gang will be retried after timing out at WaitOnPermit, + and all necessary preemptions will be simulated in the next workload scheduling cycle. + <<[/UNRESOLVED]>> + ``` + * If `schedulableCount < minCount`, the cycle fails. Pods go through traditional failure handlers + and nominations for them are cleared to ensure the other workloads (pod groups) + can be attemtped on that place. See *Failure Handling*. + +While this algorithm might be suboptimal, it is a solid first step for ensuring we have +a single-cycle workload scheduling phase. As long as PodGroups consist of homogeneous pods, +opportunistic batching itself will provide significant improvements. +Future features like Topology Aware Scheduling can further improve other subsets of use cases. + +#### Interaction with Basic Policy + +For pod groups using the `Basic` policy, the Workload Scheduling Cycle is +optional. In the `Beta` timeframe, we may opportunistically apply this cycle to +`Basic` pod groups to leverage the batching performance benefits, but the +"all-or-nothing" (`minCount`) checks will be skipped; i.e., we will try to +schedule as many pods from such PodGroup as possible. + +#### Delayed Preemption + +A critical requirement for moving Gang Scheduling to Beta is the integration +with *Delayed Preemption*. + +Standard Kubernetes preemption is eager: when a `PostFilter` selects victims to preempt, +they are deleted immediately. For Gang Scheduling, this behavior is risky and can lead to +*partial preemption application*, meaning we might do some unnecessary preemptions +when the gang, ultimately, won't fit. Delayed Preemption solves this by separating the +*selection* of victims from the *execution* of preemption. + +1. During the Workload Scheduling Cycle, the scheduler calculates necessary + preemptions for all Pods in the gang (Step 3 of Scheduling Algorithm). + +2. The scheduler nominates the victims for preemption and the gang Pod + for scheduling on their place. This way, the gang can be attempted + without making any intermediate disruptions to the cluster. + * If the quorum is met, the scheduler continues scheduling the gang Pods pod-by-pod. + Victims are preempted in the new bulk-deletion mechanism after `WaitOnPermit`, + but only because the *whole* gang (or sufficient quorum) was schedulable. + * If the quorum is not met, the preemption is aborted. No victims are deleted. + The gang returns to the queue. + +Read more about the proposal in +[KEP-5710: Workload Aware Preemption](https://github.com/kubernetes/enhancements/pull/5711) PR. + +#### Workload-aware Preemption + +Workload-aware preemption ([KEP-5710](https://kep.k8s.io/5710)) aims to +enable preemption for a whole pod group at once. In the context of this cycle, +it means that if the cycle determines preemption for a single pod is necessary, +it won't run the `PostFilter` phase, but defer that to the end of the scheduling phase, +running a new, single workload-aware preemption step. + +Read more about the proposal in +[KEP-5710: Workload Aware Preemption](https://github.com/kubernetes/enhancements/pull/5711) PR. + +#### Failure Handling + +If a Workload Scheduling Cycle fails (e.g., `minCount` is not met, preemption fails, +or a timeout occurs), the scheduler must handle the failure efficiently. + +1. Rejection + +When the cycle fails, the scheduler rejects the entire group. +* All Pods in the group are moved back to the scheduling queue. +* Crucially, any `.status.nominatedNodeName` entries set during the failed attempt + (or from previous cycles) must be cleared. This ensures that the resources + tentatively reserved for this gang are immediately released for other workloads. + +2. Backoff strategy + +Backoff mechanism has to be applied for a pod group similarly as we do for individual pods. +For Beta, we will apply the standard Pod backoff logic to the group. + +At the same time, we can consider increasing the maximum backoff default value +as the current 10 seconds proven to be too low in larger clusters, +so this might be the case for workloads. + +3. Retries + +We rely on the existing Queueing Hints mechanism to determine when to retry the gang. +It is considered for a retry when *at least one* member Pod receives a `Queue` hint +(indicating a relevant cluster event, such as a Node addition or Pod deletion, +has made that specific Pod potentially schedulable). + +While checking a single Pod does not guarantee the *whole* gang can fit, +calculating gang-level schedulability inside the event handler can be difficult at the moment. +Therefore, we optimistically retry the Workload Scheduling Cycle if any member's condition improves. ### Test Plan @@ -636,7 +882,7 @@ promoted to the conformance. #### Beta - Providing "optimal enough" placement by considering all pods from a gang together -- Avoiding deadlock scenario when multiple workloads are being scheduled at the same time +- Avoiding livelock scenario when multiple workloads are being scheduled at the same time by kube-scheduler - Implementing delayed preemption to avoid premature preemptions - Workload-aware preemption design to ensure we won't break backward compatibility with it. @@ -720,6 +966,13 @@ This section must be completed when targeting alpha to a release. - Feature gate name: GangScheduling - Components depending on the feature gate: - kube-scheduler + - Feature gate name: WorkloadSchedulingCycle + - Components depending on the feature gate: + - kube-scheduler + - Feature gate name: WorkloadBasicPolicyDesiredCount + - Components depending on the feature gate: + - kube-apiserver + - kube-scheduler - [ ] Other - Describe the mechanism: - Will enabling / disabling the feature require downtime of the control diff --git a/keps/sig-scheduling/4671-gang-scheduling/kep.yaml b/keps/sig-scheduling/4671-gang-scheduling/kep.yaml index 217d8053979e..209a57bf373d 100644 --- a/keps/sig-scheduling/4671-gang-scheduling/kep.yaml +++ b/keps/sig-scheduling/4671-gang-scheduling/kep.yaml @@ -1,13 +1,14 @@ title: Gang Scheduling kep-number: 4671 authors: - - "@erictune" - - "@wojtek-t" - - "@helayoty" - - "@dom4ha" - - "@44past4" - - "@andreyvelich" - - "@thockin" + - "@erictune" + - "@wojtek-t" + - "@helayoty" + - "@dom4ha" + - "@44past4" + - "@andreyvelich" + - "@thockin" + - "@macsko" owning-sig: sig-scheduling participating-sigs: @@ -27,12 +28,12 @@ replaces: # The target maturity stage in the current dev cycle for this KEP. # If the purpose of this KEP is to deprecate a user-visible feature # and a Deprecated feature gates are added, they should be deprecated|disabled|removed. -stage: alpha +stage: beta # The most recent milestone for which work toward delivery of this KEP has been # done. This can be the current (upcoming) milestone, if it is being actively # worked on. -latest-milestone: "v1.35" +latest-milestone: "v1.36" # The milestone at which this feature was, or is targeted to be, at each stage. milestone: @@ -50,6 +51,9 @@ feature-gates: - name: GangScheduling components: - kube-scheduler + - name: WorkloadSchedulingCycle + components: + - kube-scheduler disable-supported: true # The following PRR answers are required at beta release From 9e672be8e04924c80eb47e8ab1c8a5d42dc4040e Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Maciej=20Skocze=C5=84?= Date: Tue, 16 Dec 2025 14:27:35 +0000 Subject: [PATCH 02/23] Add a section about basic policy update --- .../4671-gang-scheduling/README.md | 38 +++++++++++++++++++ .../4671-gang-scheduling/kep.yaml | 4 ++ 2 files changed, 42 insertions(+) diff --git a/keps/sig-scheduling/4671-gang-scheduling/README.md b/keps/sig-scheduling/4671-gang-scheduling/README.md index 772d752263c4..d4d6864da6cf 100644 --- a/keps/sig-scheduling/4671-gang-scheduling/README.md +++ b/keps/sig-scheduling/4671-gang-scheduling/README.md @@ -458,6 +458,39 @@ not be split into two. A `LeaderWorkerSet` is a good example of it, where a sing of a single leader and `N` workers and that forms a scheduling (and runtime unit), but workload as a whole may consist of a number of such replicas. +#### Basic Policy Extension + +While Gang Scheduling focuses on atomic, all-or-nothing scheduling, there is a significant class +of batch workloads that requires best-effort optimization without +the strict blocking semantics of a gang. + +Currently, the `Basic` policy is a no-op. We propose extending the `Basic` policy +to accept a `desiredCount` field. This feature will be gated behind a separate +feature gate (`WorkloadBasicPolicyDesiredCount`) to decouple it from the core Gang Scheduling graduation path. + +```go +// BasicSchedulingPolicy indicates that standard Kubernetes +// scheduling behavior should be used. +type BasicSchedulingPolicy struct { + // DesiredCount is the expected number of pods that will belong to this + // PodGroup. This field is a hint to the scheduler to help it make better + // placement decisions for the group as a whole. + // + // Unlike gang's minCount, this field does not block scheduling. If the number + // of available pods is less than desiredCount, the scheduler can still attempt + // to schedule the available pods, but will optimistically try to select a + // placement that can accommodate the future pods. + // + // +optional + DesiredCount *int32 +} +``` + +This field allows users to express their "true" workloads more easily +and enables the scheduler to optimize the placement of such pod groups by taking the desired state +into account. Ideally, the scheduler should prefer placements that can accommodate +the full `desiredCount`, even if not all pods are created yet. + ### Scheduler Changes The kube-scheduler will be watching for `Workload` objects (using informers) and will use them to map pods @@ -725,6 +758,11 @@ optional. In the `Beta` timeframe, we may opportunistically apply this cycle to "all-or-nothing" (`minCount`) checks will be skipped; i.e., we will try to schedule as many pods from such PodGroup as possible. +If the `Basic` policy has `desiredCount` set, the Workload Scheduling Cycle +may utilize this value to simulate the full group size during feasibility checks. +Note that the implementation of this specific logic might follow in a Beta stage +of this API field. + #### Delayed Preemption A critical requirement for moving Gang Scheduling to Beta is the integration diff --git a/keps/sig-scheduling/4671-gang-scheduling/kep.yaml b/keps/sig-scheduling/4671-gang-scheduling/kep.yaml index 209a57bf373d..12ee0bbc50c8 100644 --- a/keps/sig-scheduling/4671-gang-scheduling/kep.yaml +++ b/keps/sig-scheduling/4671-gang-scheduling/kep.yaml @@ -54,6 +54,10 @@ feature-gates: - name: WorkloadSchedulingCycle components: - kube-scheduler + - name: WorkloadBasicPolicyDesiredCount + components: + - kube-apiserver + - kube-scheduler disable-supported: true # The following PRR answers are required at beta release From 43b5aa940e1876f1b3000403a661025e7adf5104 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Maciej=20Skocze=C5=84?= Date: Mon, 22 Dec 2025 11:18:51 +0000 Subject: [PATCH 03/23] Remove beta graduation from the PR, extend sections about workload scheduling cycle --- keps/prod-readiness/sig-scheduling/4671.yaml | 2 - .../4671-gang-scheduling/README.md | 84 ++++++++----------- .../4671-gang-scheduling/kep.yaml | 5 +- 3 files changed, 38 insertions(+), 53 deletions(-) diff --git a/keps/prod-readiness/sig-scheduling/4671.yaml b/keps/prod-readiness/sig-scheduling/4671.yaml index 3257880a90d5..17a4b734bff8 100644 --- a/keps/prod-readiness/sig-scheduling/4671.yaml +++ b/keps/prod-readiness/sig-scheduling/4671.yaml @@ -1,5 +1,3 @@ kep-number: 4671 alpha: approver: "@soltysh" -beta: - approver: "@soltysh" diff --git a/keps/sig-scheduling/4671-gang-scheduling/README.md b/keps/sig-scheduling/4671-gang-scheduling/README.md index d4d6864da6cf..ddc6bcc8112a 100644 --- a/keps/sig-scheduling/4671-gang-scheduling/README.md +++ b/keps/sig-scheduling/4671-gang-scheduling/README.md @@ -461,8 +461,7 @@ may consist of a number of such replicas. #### Basic Policy Extension While Gang Scheduling focuses on atomic, all-or-nothing scheduling, there is a significant class -of batch workloads that requires best-effort optimization without -the strict blocking semantics of a gang. +of workloads that requires best-effort optimization without the strict blocking semantics of a gang. Currently, the `Basic` policy is a no-op. We propose extending the `Basic` policy to accept a `desiredCount` field. This feature will be gated behind a separate @@ -490,6 +489,8 @@ This field allows users to express their "true" workloads more easily and enables the scheduler to optimize the placement of such pod groups by taking the desired state into account. Ideally, the scheduler should prefer placements that can accommodate the full `desiredCount`, even if not all pods are created yet. +When `desiredCount` is specified, the scheduler can delay scheduling the first Pod it sees +for a short amount of time in order to wait for more Pods to be observed. ### Scheduler Changes @@ -569,9 +570,10 @@ nor did it solve the problem of partial preemption application. For `Beta`, we propose introducing a **Workload Scheduling Cycle**. This mechanism processes all Pods belonging to a single `PodGroup` in one batch, rather than attempting to schedule them individually in isolation using the -traditional pod-by-pod approach. -While this won't fully address the "optimal enough" part of requirement (2), -it ensures that all gang pods are processed together. +traditional pod-by-pod approach. While introduction of this phase itself won't +fully address the "optimal enough" part of requirement (2), +it provides the necessary foundation for applying workload scheduling algorithms +to process the entire gang together. The single scheduling cycle, together with blocking resources using nomination, will address requirement (3). @@ -586,27 +588,13 @@ end-to-end Pod scheduling flow, it is planned to place this new phase *before* the standard pod-by-pod scheduling cycle. When the scheduler pops a Pod from the active queue, it checks if that Pod -belongs to an unscheduled `PodGroup` with a `Gang` policy. If so, the scheduler +belongs to an unscheduled `PodGroup`. If so, the scheduler initiates the Workload Scheduling Cycle. -```md -<<[UNRESOLVED Scope of the Cycle]>> -It is currently unresolved whether the Workload Scheduling Cycle should operate -on the entire `Workload` object (handling all defined PodGroups simultaneously) -or strictly at the `PodGroup` level. - -* PodGroup Level: The cycle processes only the specific `PodGroup` (and replica key) - associated with the popped Pod. This is simpler and aligns with - the Gang Scheduling definition and current implementation. -* Workload Level: The cycle attempts to schedule all PodGroups within the Workload. - This allows for complex dependencies between groups but increases the complexity - and doesn't bring immediate added value. - -*Proposed:* Implement it on PodGroup Level for Beta. However, future migration -to the Workload Level might necessitate non-trivial changes to the phase -introduced by this KEP. -<<[/UNRESOLVED]>> -``` +Since the `PodGroup` instance (defined by the group name and replica key) +is the effective scheduling unit, the Workload Scheduling Cycle will operate +at the `PodGroup` instance level, i.e., each instance will be scheduled separately +in its own cycle. The cycle proceeds as follows: @@ -628,9 +616,9 @@ The cycle proceeds as follows: scheduler's internal cache. Pods are then pushed to the active queue (restoring their original timestamps to ensure fairness) to pass through the standard scheduling and binding cycle, - which will respect the nomination. + which will consider the nomination. * If `minCount` cannot be met (even after calculating potential - preemptions), the scheduler rejects the entire group. Standard backoff + preemptions), the scheduler considers the `PodGroup` unschedulable. Standard backoff logic applies (see *Failure Handling*), and Pods are returned to the scheduling queue. @@ -702,11 +690,13 @@ The list and configuration of plugins used by this algorithm will be the same as 1. The scheduler iterates through the retrieved Pods and groups them into homogeneous sub-groups (using the signatures defined in [KEP-5598](https://kep.k8s.io/5598)). + *This aggregation can be done in the scheduler's cache earlier to optimize performance.* 2. These sub-groups are sorted. Initially, we sort by the highest priority of the sub-group (assuming homogeneity enforces uniform sub-group priority). In the future, sorting may use the size of the sub-group (larger groups first) to tackle the hardest placement problems early. + *This sorting can be done in the scheduler's cache earlier to optimize performance.* 3. The scheduler iterates through the sorted sub-groups. It finds a feasible node for each pod from a sub-group using standard filtering and scoring phases. @@ -719,6 +709,10 @@ The list and configuration of plugins used by this algorithm will be the same as this phase will be replaced by a workload-level algorithm. * If preemption is successful, the pod is nominated on the selected node. * If preemption fails, the pod is considered unscheduled for this cycle. + However, the scheduling of subsequent pods continues as long as + the `minCount` constraint remains satisfiable. The processing can also be + optimized by rejecting all subsequent pods from the same + homogeneous sub-group, as their failed scheduling outcome will be the same. The phase can effectively stop once `minCount` pods have a placement, though attempting to schedule the full group is preferred to maximize utilization. @@ -731,16 +725,18 @@ The list and configuration of plugins used by this algorithm will be the same as nominated nodes in their own, pod-by-pod cycles. If a pod selects a different node than its nomination during the individual cycle, the gang remains valid as long as `minCount` is satisfied globally (enforced at `WaitOnPermit`). - ```md - <<[UNRESOLVED Pod-by-pod cycle preemption]>> - Should gang pods be allowed to preempt anything in their pod-by-pod cycles? - - *Proposed:* Preemption should be forbidden. Otherwise, it may complicate reasoning - about the workload scheduling cycle and workload-aware preemption. - When preemption is necessary, the gang will be retried after timing out at WaitOnPermit, - and all necessary preemptions will be simulated in the next workload scheduling cycle. - <<[/UNRESOLVED]>> - ``` + + In the pod-by-pod cycle, the preemption made by the workload pods will be forbidden. + Otherwise, it may complicate reasoning about the workload scheduling cycle and workload-aware preemption. + When preemption is necessary, the gang will be retried after timing out at WaitOnPermit, + and all necessary preemptions will be simulated in the next workload scheduling cycle. + + In the pod-by-pod cycle, preemption initiated by the workload pods will be forbidden. + Allowing it would complicate reasoning about the consistency of the + Workload Scheduling Cycle and Workload-Aware Preemption. If preemption is necessary + (e.g., the nominated node is no longer valid), the gang will time out at `WaitOnPermit` + and all necessary preemptions will be simulated again in the next Workload Scheduling Cycle. + * If `schedulableCount < minCount`, the cycle fails. Pods go through traditional failure handlers and nominations for them are cleared to ensure the other workloads (pod groups) can be attemtped on that place. See *Failure Handling*. @@ -753,7 +749,7 @@ Future features like Topology Aware Scheduling can further improve other subsets #### Interaction with Basic Policy For pod groups using the `Basic` policy, the Workload Scheduling Cycle is -optional. In the `Beta` timeframe, we may opportunistically apply this cycle to +optional. In the `Beta` timeframe, this cycle will be applied to `Basic` pod groups to leverage the batching performance benefits, but the "all-or-nothing" (`minCount`) checks will be skipped; i.e., we will try to schedule as many pods from such PodGroup as possible. @@ -774,15 +770,12 @@ they are deleted immediately. For Gang Scheduling, this behavior is risky and ca when the gang, ultimately, won't fit. Delayed Preemption solves this by separating the *selection* of victims from the *execution* of preemption. -1. During the Workload Scheduling Cycle, the scheduler calculates necessary +1. During the Workload Scheduling Cycle loop, the scheduler calculates necessary preemptions for all Pods in the gang (Step 3 of Scheduling Algorithm). -2. The scheduler nominates the victims for preemption and the gang Pod - for scheduling on their place. This way, the gang can be attempted - without making any intermediate disruptions to the cluster. - * If the quorum is met, the scheduler continues scheduling the gang Pods pod-by-pod. - Victims are preempted in the new bulk-deletion mechanism after `WaitOnPermit`, - but only because the *whole* gang (or sufficient quorum) was schedulable. +2. At the end of the Workload Scheduling Cycle: + * If the quorum is met, the scheduler actuates the preemptions, + initiating the removal of victims from the cluster. * If the quorum is not met, the preemption is aborted. No victims are deleted. The gang returns to the queue. @@ -1002,9 +995,6 @@ This section must be completed when targeting alpha to a release. - kube-apiserver - kube-scheduler - Feature gate name: GangScheduling - - Components depending on the feature gate: - - kube-scheduler - - Feature gate name: WorkloadSchedulingCycle - Components depending on the feature gate: - kube-scheduler - Feature gate name: WorkloadBasicPolicyDesiredCount diff --git a/keps/sig-scheduling/4671-gang-scheduling/kep.yaml b/keps/sig-scheduling/4671-gang-scheduling/kep.yaml index 12ee0bbc50c8..a9c83db4eadf 100644 --- a/keps/sig-scheduling/4671-gang-scheduling/kep.yaml +++ b/keps/sig-scheduling/4671-gang-scheduling/kep.yaml @@ -28,7 +28,7 @@ replaces: # The target maturity stage in the current dev cycle for this KEP. # If the purpose of this KEP is to deprecate a user-visible feature # and a Deprecated feature gates are added, they should be deprecated|disabled|removed. -stage: beta +stage: alpha # The most recent milestone for which work toward delivery of this KEP has been # done. This can be the current (upcoming) milestone, if it is being actively @@ -51,9 +51,6 @@ feature-gates: - name: GangScheduling components: - kube-scheduler - - name: WorkloadSchedulingCycle - components: - - kube-scheduler - name: WorkloadBasicPolicyDesiredCount components: - kube-apiserver From 8eefcd3edec953cb0e402a3de596d3f1edd32095 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Maciej=20Skocze=C5=84?= Date: Fri, 2 Jan 2026 15:18:26 +0000 Subject: [PATCH 04/23] Expand queueing alternatives. Add unresolved section about enforcing minCount --- .../4671-gang-scheduling/README.md | 69 +++++++++++++++---- 1 file changed, 56 insertions(+), 13 deletions(-) diff --git a/keps/sig-scheduling/4671-gang-scheduling/README.md b/keps/sig-scheduling/4671-gang-scheduling/README.md index ddc6bcc8112a..0fcf560baf30 100644 --- a/keps/sig-scheduling/4671-gang-scheduling/README.md +++ b/keps/sig-scheduling/4671-gang-scheduling/README.md @@ -594,7 +594,9 @@ initiates the Workload Scheduling Cycle. Since the `PodGroup` instance (defined by the group name and replica key) is the effective scheduling unit, the Workload Scheduling Cycle will operate at the `PodGroup` instance level, i.e., each instance will be scheduled separately -in its own cycle. +in its own cycle. If new Pods belonging to an already scheduled `PodGroup` instance appear, +they are also processed via the Workload Scheduling Cycle, which takes the previously +scheduled Pods into consideration. The cycle proceeds as follows: @@ -636,31 +638,50 @@ and reduce unnecessary preemption attempts. ```md <<[UNRESOLVED Queue Implementation Strategy]>> -To ensure that we process the pod group (replica) at an appropriate time and +To ensure that we process the `PodGroup` instance at an appropriate time and don't starve other pods (including gang pods in the pod-by-pod scheduling phase) from being scheduled, we need to have a good queueing mechanism for pod groups. There are several alternatives: +Alternative 0 (Keep current queueing and ordering): + +We can minimize changes by retaining the current queueing and ordering logic. +When a Pod is popped, the scheduler can check if it belongs to a `PodGroup` +requiring a Workload Scheduling Cycle. As we add scheduling priorities +for pod groups later, this alternative naturally evolves into Alternative 1. +* *Pros:* Fits the current architecture. Retains current reasoning about the + scheduling queue. Minimizes implementation effort. +* *Cons:* Might be problematic when some of the pod groups's pods are in the backoffQ + or unschedulablePods and need to be retrieved efficiently. + Makes it hard to further evolve the Workload Scheduling Cycle. + Observability, currently suited for pod-by-pod scheduling, may not + accurately reflect the state of the queue (e.g., pending gangs). + Likely harder to support future extensions and won't work well + if `PodGroup` becomes a separate top-level resource. + The pod group will be likely scheduled based on the highest priority member, + meaning the latter pod-by-pod cycles might be visibly delayed for lower priority Pods. + Alternative 1 (Modify sorting logic): Modify the sorting logic within the existing `PriorityQueue` to put all pods -from a gang group one after another. +from a pod group one after another. * *Pros:* Fits the current architecture. -* *Cons:* Might be problematic when some of the gang's pods are in the +* *Cons:* Might be problematic when some of the pod groups's pods are in the backoffQ or unschedulablePods and need to be retrieved efficiently. Makes it hard to further evolve the Workload Scheduling Cycle. Would need to inject the workload priority into each of the Pods or somehow apply the lowest pod's priority to the rest of the group. -Alternative 2 (Store a gang representative): +Alternative 2 (Store a PodGroup instance): -Only one "representative" Pod from the gang is allowed in the `activeQ` at a time. -Others are held in a separate internal structure (e.g., a new map inside the queue). -When the representative is popped, it pulls the rest of the gang for the Workload Cycle. -* *Pros:* Makes it easier to obtain all pod group's pods, reduces queue size. -* *Cons:* High complexity in managing the lifecycle of the representative - (e.g., what if the representative Pod is deleted or other changes to the workload happen? - Would need a workload manager to handle all such cases). +Modify the scheduling queue's data structures to accept `QueuedPodGroupInfo` alongside `QueuedPodInfo`. +This allows reusing existing queue logic while extending it to `PodGroups`. +All queued members would be stored in a new dara structure +and retrieved for the Workload Cycle when the `PodGroup` is popped. +* *Pros:* Makes it easier to obtain all pods in a group and reduces queue size. + Reuses current logic for popping, enforcing backoff, and processing unschedulable entities. +* *Cons:* Requires adapting the scheduling queue to handle `PodGroups` as + queueable entities, which is non-trivial and might clutter the code. Alternative 3 (Dedicated PodGroup queue): @@ -739,7 +760,29 @@ The list and configuration of plugins used by this algorithm will be the same as * If `schedulableCount < minCount`, the cycle fails. Pods go through traditional failure handlers and nominations for them are cleared to ensure the other workloads (pod groups) - can be attemtped on that place. See *Failure Handling*. + can be attempted on that place. See *Failure Handling*. + +```md +<<[UNRESOLVED Enforcing minCount constraint in algorithm]>> +Gang Scheduling is currently implemented as a plugin, meaning the `minCount` constraint +is enforced at the plugin level. However, the proposed Workload Scheduling Cycle algorithm +needs to know if this constraint is met to decide whether to commit the results. +We have two ways of verifying this: + +1. Explicit check in the algorithm: Hardcode the `minCount` check within the framework's logic. + This implies that Gang Scheduling becomes a core scheduler framework feature rather than + just a specific plugin. + +2. New Extension Point: Introduce a new extension point allowing plugins to validate the group's + scheduled pods. This would function similarly to a `Permit` check (likely requiring `Reserve` state) + but without the suspension (`WaitOnPermit`) gate. Crucially, this extension should support two checks: + * Validation: Check whether the currently scheduled pods meet the requirements, + e.g., if the `minCount` pods from a pod group was successfully scheduled. + * Feasibility: Given the number of pods that have already failed scheduling in this cycle, + check whether is it still *possible* to meet the constraint. If not, the cycle should abort early + to save time. +<<[/UNRESOLVED]>> +``` While this algorithm might be suboptimal, it is a solid first step for ensuring we have a single-cycle workload scheduling phase. As long as PodGroups consist of homogeneous pods, From 38342637ffc6232cb655b85d653a8498ccd6282a Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Maciej=20Skocze=C5=84?= Date: Mon, 5 Jan 2026 14:53:44 +0000 Subject: [PATCH 05/23] Apply comments --- .../4671-gang-scheduling/README.md | 21 ++++++++++++------- 1 file changed, 14 insertions(+), 7 deletions(-) diff --git a/keps/sig-scheduling/4671-gang-scheduling/README.md b/keps/sig-scheduling/4671-gang-scheduling/README.md index 0fcf560baf30..61b9b9be52cf 100644 --- a/keps/sig-scheduling/4671-gang-scheduling/README.md +++ b/keps/sig-scheduling/4671-gang-scheduling/README.md @@ -125,6 +125,7 @@ The following are non-goals for this KEP but will probably soon appear to be goa - Address the problem of premature preemptions in case the higher priority workloads does not eventually schedule. +See [Future plans](#future-plans) for more details. ## Proposal @@ -463,8 +464,9 @@ may consist of a number of such replicas. While Gang Scheduling focuses on atomic, all-or-nothing scheduling, there is a significant class of workloads that requires best-effort optimization without the strict blocking semantics of a gang. -Currently, the `Basic` policy is a no-op. We propose extending the `Basic` policy -to accept a `desiredCount` field. This feature will be gated behind a separate +In the first alpha version of the Workload API, the `Basic` policy was a no-op. +We propose extending the `Basic` policy to accept a `desiredCount` field. +This feature will be gated behind a separate feature gate (`WorkloadBasicPolicyDesiredCount`) to decouple it from the core Gang Scheduling graduation path. ```go @@ -746,11 +748,9 @@ The list and configuration of plugins used by this algorithm will be the same as nominated nodes in their own, pod-by-pod cycles. If a pod selects a different node than its nomination during the individual cycle, the gang remains valid as long as `minCount` is satisfied globally (enforced at `WaitOnPermit`). - - In the pod-by-pod cycle, the preemption made by the workload pods will be forbidden. - Otherwise, it may complicate reasoning about the workload scheduling cycle and workload-aware preemption. - When preemption is necessary, the gang will be retried after timing out at WaitOnPermit, - and all necessary preemptions will be simulated in the next workload scheduling cycle. + The `minCount` check can consider the number of pods that have passed the Workload Scheduling Cycle + to ensure that Pods are not waiting unnecessarily when some have been rejected + but other new pods have been added to the cluster. In the pod-by-pod cycle, preemption initiated by the workload pods will be forbidden. Allowing it would complicate reasoning about the consistency of the @@ -845,6 +845,7 @@ or a timeout occurs), the scheduler must handle the failure efficiently. When the cycle fails, the scheduler rejects the entire group. * All Pods in the group are moved back to the scheduling queue. + Their status is updated the event with failure reason is sent. * Crucially, any `.status.nominatedNodeName` entries set during the failed attempt (or from previous cycles) must be cleared. This ensures that the resources tentatively reserved for this gang are immediately released for other workloads. @@ -869,6 +870,12 @@ While checking a single Pod does not guarantee the *whole* gang can fit, calculating gang-level schedulability inside the event handler can be difficult at the moment. Therefore, we optimistically retry the Workload Scheduling Cycle if any member's condition improves. +It might be beneficial to retry the pod group without being triggered by any cluster event. +Ideally, this would involve scrambling the pods and subgroups within the group that have the same priority. +This could be useful because the pods could be scheduled without any cluster changes +when considered in a different order. + + ### Test Plan -We will create integration test(s) to ensure basic functionalities of gang-scheduling including: +Initially, we created integration tests to ensure the basic functionalities of gang scheduling including: + - Pods linked to the non-existing workload are not scheduled - Pods get unblocked when workload is created and observed by scheduler - Pods are not scheduled if there is no space for the whole gang + +With Workload Scheduling Cycle and Delayed Preemption features, we will significantly expand test coverage to verify: + +- Pods referencing a `Workload` (both gang and basic policies) are correctly processed via the Workload Scheduling Cycle. +- `PodGroup` queuing ensures that all available members are retrieved and processed correctly. +- Deadlocks and livelocks do not occur when multiple gangs compete for resources or interleave with standard pods. +- Delayed Preemption works correctly for pod-by-pod (non-workload) scheduling. +- Delayed Preemption ensures atomicity, i.e., victims are deleted only if the scheduler determines the entire gang can fit, + otherwise, the cycle aborts with zero disruption. +- Failed pod groups are requeued correctly and retry successfully when resources become available. + +We will also benchmark the performance impact of these changes to measure: -In Beta, we will add tests to verify that deadlocks are not happening. +- The scheduling throughput of the workload scheduling, including gang and basic policies and preemptions. +- The performance impact on standard pod scheduling when there are many nominated pods, + for scenarios mentioned in the [NominatedNodeName impact on filtering performance](#nominatednodename-impact-on-filtering-performance). ##### e2e tests From 3045923f6b7afe80df5426e5441fad7d8794d464 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Maciej=20Skocze=C5=84?= Date: Thu, 15 Jan 2026 12:14:06 +0000 Subject: [PATCH 10/23] Resolve queueing strategy and feasibility plugin. List algorithm limitations. Make NNN a hard requirement. Apply comments --- .../4671-gang-scheduling/README.md | 235 +++++++++++------- .../4671-gang-scheduling/kep.yaml | 4 +- 2 files changed, 143 insertions(+), 96 deletions(-) diff --git a/keps/sig-scheduling/4671-gang-scheduling/README.md b/keps/sig-scheduling/4671-gang-scheduling/README.md index 8d4aa2699066..e5da086679aa 100644 --- a/keps/sig-scheduling/4671-gang-scheduling/README.md +++ b/keps/sig-scheduling/4671-gang-scheduling/README.md @@ -23,10 +23,11 @@ - [North Star Vision](#north-star-vision) - [GangScheduling Plugin](#gangscheduling-plugin) - [Future plans](#future-plans) - - [Scheduler Changes for Beta](#scheduler-changes-for-beta) + - [Scheduler Changes for v1.36](#scheduler-changes-for-beta) - [The Workload Scheduling Cycle](#the-workload-scheduling-cycle) - [Queuing and Ordering](#queuing-and-ordering) - [Scheduling Algorithm](#scheduling-algorithm) + - [Algorithm Limitations](#algorithm-limitations) - [Interaction with Basic Policy](#interaction-with-basic-policy) - [Delayed Preemption](#delayed-preemption) - [Workload-aware Preemption](#workload-aware-preemption) @@ -256,7 +257,7 @@ However, this impact is mitigated by several factors: the overall window of time where these nominations are active is expected to be short enough to prevent severe degradation. -The real impact will be verified hrough scalability tests (scheduler-perf benchmark). +The real impact will be verified through scalability tests (scheduler-perf benchmark). ## Design Details @@ -599,15 +600,15 @@ this problem with existing mechanisms (e.g. reserving resources via NominatedNod However, approval for this KEP is NOT an approval for this vision. We only sketch it to show that we see a viable path forward from the proposed design that will not require significant rework. -### Scheduler Changes for Beta +### Scheduler Changes for v1.36 -For the `Alpha` phase, we focused on plumbing the `Workload` API and implementing +For the `Alpha` phase in v1.35, we focused on plumbing the `Workload` API and implementing the `GangScheduling` plugin using simple barriers (`PreEnqueue` and `Permit`). While this satisfied the correctness requirement for "all-or-nothing" scheduling, it did not address performance or efficiency at scale, scheduling livelocks, nor did it solve the problem of partial preemption application. -For `Beta`, we propose introducing a **Workload Scheduling Cycle**. +For v1.36, we propose introducing a **Workload Scheduling Cycle**. This mechanism processes all Pods belonging to a single `PodGroup` in one batch, rather than attempting to schedule them individually in isolation using the traditional pod-by-pod approach. While introduction of this phase itself won't @@ -634,14 +635,19 @@ initiates the Workload Scheduling Cycle. Since the `PodGroup` instance (defined by the group name and replica key) is the effective scheduling unit, the Workload Scheduling Cycle will operate at the `PodGroup` instance level, i.e., each instance will be scheduled separately -in its own cycle. If new Pods belonging to an already scheduled `PodGroup` instance appear, +in its own cycle. + +If new Pods belonging to an already scheduled `PodGroup` instance +(i.e., one that already passed `WaitOnPemit`) appear, they are also processed via the Workload Scheduling Cycle, which takes the previously -scheduled Pods into consideration. +scheduled Pods into consideration. This is done for safety reasons to ensure +the PodGroup-level constraints are still satisfied. However, if the `PodGroup` is being processed, +these new Pods must wait for the ongoing pod group scheduling to be finished, before being considered. The cycle proceeds as follows: -1. The scheduler takes either pod group itself or its Pod representative from - the scheduling queue. If the pod group is unscheduled (even partially), it temporarily removes +1. The scheduler takes pod group from the scheduling queue. + If the pod group is unscheduled (even partially), it temporarily removes all group's pods from the queue for processing. The order of processing is determined by the queueing mechanism (see *Queuing and Ordering* below). @@ -658,7 +664,7 @@ The cycle proceeds as follows: scheduler's internal cache. Pods are then pushed to the active queue (restoring their original timestamps to ensure fairness) to pass through the standard scheduling and binding cycle, - which will consider the nomination. + which will consider and follow the nomination. * If `minCount` cannot be met (even after calculating potential preemptions), the scheduler considers the `PodGroup` unschedulable. Standard backoff logic applies (see *Failure Handling*), and Pods are returned to @@ -676,68 +682,36 @@ One such formula can be to set it to the lowest priority found within the pod gr what will be effectively the weakest link to determine if the whole pod group is schedulable and reduce unnecessary preemption attempts. -```md -<<[UNRESOLVED Queue Implementation Strategy]>> To ensure that we process the `PodGroup` instance at an appropriate time and don't starve other pods (including gang pods in the pod-by-pod scheduling phase) from being scheduled, we need to have a good queueing mechanism for pod groups. -There are several alternatives: -Alternative 0 (Keep current queueing and ordering): +We have decided to make the scheduling queue explicitly workload-aware. +The queue will support queuing `PodGroup` instances alongside individual Pods. -We can minimize changes by retaining the current queueing and ordering logic. -When a Pod is popped, the scheduler can check if it belongs to a `PodGroup` -requiring a Workload Scheduling Cycle. As we add scheduling priorities -for pod groups later, this alternative naturally evolves into Alternative 1. -* *Pros:* Fits the current architecture. Retains current reasoning about the - scheduling queue. Minimizes implementation effort. -* *Cons:* Might be problematic when some of the pod groups's pods are in the backoffQ - or unschedulablePods and need to be retrieved efficiently. - Makes it hard to further evolve the Workload Scheduling Cycle. - Observability, currently suited for pod-by-pod scheduling, may not - accurately reflect the state of the queue (e.g., pending gangs). - Likely harder to support future extensions and won't work well - if `PodGroup` becomes a separate top-level resource. - The pod group will be likely scheduled based on the highest priority member, - meaning the latter pod-by-pod cycles might be visibly delayed for lower priority Pods. +1. When Pods belonging to a `PodGroup` are added to the scheduler and pass the `PreEnqueue`, + they are initially stored in a dedicated internal data structure (tentatively named `workloadPods`) + rather than the standard active queue. -Alternative 1 (Modify sorting logic): +2. Once the number of accumulated Pods meets the scheduling requirements (e.g., `minCount`), + a `QueuedPodGroupInfo` object (analogous to `QueuedPodInfo`) is created + and injected into the main scheduling queue. -Modify the sorting logic within the existing `PriorityQueue` to put all pods -from a pod group one after another. -* *Pros:* Fits the current architecture. -* *Cons:* Might be problematic when some of the pod groups's pods are in the - backoffQ or unschedulablePods and need to be retrieved efficiently. - Makes it hard to further evolve the Workload Scheduling Cycle. - Would need to inject the workload priority into each of the Pods - or somehow apply the lowest pod's priority to the rest of the group. +3. The `scheduleOne` loop will pop the highest-priority item from the queue, + which may now be either a single Pod (triggering the standard cycle) + or a `PodGroup` (triggering the Workload Scheduling Cycle). -Alternative 2 (Store a PodGroup instance): +4. During a Workload Scheduling Cycle, all member Pods are retrieved from `workloadPods`. + Based on the cycle's outcome: + * **Success:** Pods are moved to the standard `activeQ` (with nominations set) + to proceed to the pod-by-pod scheduling soon. + * **Failure/Preemption:** Pods are returned to `workloadPods` or the unschedulable queue. + The `PodGroup` enters a backoff state and is eligible for retry only when + a relevant cluster event wakes up at least one of its member pods. -Modify the scheduling queue's data structures to accept `QueuedPodGroupInfo` alongside `QueuedPodInfo`. -This allows reusing existing queue logic while extending it to `PodGroups`. -All queued members would be stored in a new dara structure -and retrieved for the Workload Cycle when the `PodGroup` is popped. -* *Pros:* Makes it easier to obtain all pods in a group and reduces queue size. - Reuses current logic for popping, enforcing backoff, and processing unschedulable entities. -* *Cons:* Requires adapting the scheduling queue to handle `PodGroups` as - queueable entities, which is non-trivial and might clutter the code. - -Alternative 3 (Dedicated PodGroup queue): - -Introduce a completely separate queue for PodGroups alongside the `activeQ` for Pods. -The scheduler would pop the item (Pod or PodGroup) with the highest priority/earliest timestamp. -Pods belonging to an enqueued PodGroup won't be allowed in the `activeQ`. -* *Pros:* Clean separation of concerns. Can easily use the Workload scheduling priority. - Can report dedicated logs and metrics with less confusion to the user. -* *Cons:* Significant and non-trivial architectural change to the scheduling queue - and `scheduleOne` loop. - -*Proposed:* Alternative 3 (Dedicated PodGroup queue). While this requires architectural change to the scheduling queue, -the effort involved in adding pod group queuing will be comparable to modifying the code for the previous alternatives. -This will also make the foundation for future WAS features and support workload priority by design. -<<[/UNRESOLVED]>> -``` +While this represents a significant architectural change to the scheduling +queue and `scheduleOne` loop, it provides a clean separation of concerns and +establishes a necessary foundation for future Workload Aware Scheduling features. #### Scheduling Algorithm @@ -756,7 +730,8 @@ The list and configuration of plugins used by this algorithm will be the same as 2. These sub-groups are sorted. Initially, we sort by the highest priority of the sub-group (assuming homogeneity enforces uniform sub-group priority). In the future, sorting may use the size of the sub-group (larger groups first) to - tackle the hardest placement problems early. + tackle the hardest placement problems early. Crucially, the ordering should be deterministic + and saable if the pod group state doesn't change *This sorting can be done in the scheduler's cache earlier to optimize performance.* 3. The scheduler iterates through the sorted sub-groups. It finds a feasible node @@ -805,10 +780,10 @@ The list and configuration of plugins used by this algorithm will be the same as pushed directly to the active queue, and will soon attempt to be scheduled on their nominated nodes in their own, pod-by-pod cycles. - If a pod selects a different node than its nomination during the individual cycle, the - gang remains valid as long as `minCount` is satisfied globally (enforced at `WaitOnPermit`). - The `minCount` check can consider the number of pods that have passed the Workload Scheduling Cycle - to ensure that Pods are not waiting unnecessarily when some have been rejected + Pod will be restricted to its nominated node during the individual cycle. + If the node is unavailable, the pod will remain unschedulable and the `WaitOnPermit` gate will take that + into consideration. The `minCount` check can consider the number of pods that have passed + the Workload Scheduling Cycle to ensure that Pods are not waiting unnecessarily when some have been rejected but other new pods have been added to the cluster. In the pod-by-pod cycle, preemption initiated by the workload pods will be forbidden. @@ -823,45 +798,58 @@ The list and configuration of plugins used by this algorithm will be the same as and nominations for them are cleared to ensure the other workloads (pod groups) can be attempted on that place. See *Failure Handling*. -```md -<<[UNRESOLVED Enforcing minCount constraint in algorithm]>> -Gang Scheduling is currently implemented as a plugin, meaning the `minCount` constraint -is enforced at the plugin level. However, the proposed Workload Scheduling Cycle algorithm -needs to know if this constraint is met to decide whether to commit the results. -We have two ways of verifying this: - -1. Explicit check in the algorithm: Hardcode the `minCount` check within the framework's logic. - This implies that Gang Scheduling becomes a core scheduler framework feature rather than - just a specific plugin. - -2. New Extension Point: Introduce a new extension point allowing plugins to validate the group's - scheduled pods. This would function similarly to a `Permit` check (likely requiring `Reserve` state) + Gang Scheduling is currently implemented as a plugin, meaning the `minCount` constraint + is enforced at the plugin level. However, the proposed Workload Scheduling Cycle algorithm + needs to know if this constraint is met to decide whether to commit the results. + To verify this, a new extension point will be introduced, allowing plugins to validate the group's + scheduled pods. This will function similarly to a `Permit` check (likely requiring `Reserve` state) but without the suspension (`WaitOnPermit`) gate. Crucially, this extension should support two checks: + * Validation: Check whether the currently scheduled pods meet the requirements, e.g., if the `minCount` pods from a pod group was successfully scheduled. + * Feasibility: Given the number of pods that have already failed scheduling in this cycle, check whether is it still *possible* to meet the constraint. If not, the cycle should abort early to save time. -<<[/UNRESOLVED]>> -``` While this algorithm might be suboptimal, it is a solid first step for ensuring we have a single-cycle workload scheduling phase. As long as PodGroups consist of homogeneous pods, opportunistic batching itself will provide significant improvements. Future features like Topology Aware Scheduling can further improve other subsets of use cases. -Moreover, this default algorithm relies on specific sorting and may fail to find +#### Algorithm Limitations + +Default algorithm proposed above relies on specific sorting and may fail to find a valid placement that could have been discovered by processing the group's pods in a different order. While resolving this limitation could be desirable, implementing a generalized solver for arbitrary constraints would introduce excessive complexity for the default implementation. The current proposal addresses the vast majority of standard use cases -(homogeneous workloads). Future improvements for this should be delivered via specialized algorithms -based on specific `PodGroup` constraints, such as Topology Aware Scheduling (TAS). +(specifically homogeneous workloads). Future improvements for this should be delivered +via specialized algorithms based on specific pod group constraints, +such as Topology Aware Scheduling (TAS). + +Since the scheduler cannot exhaustively analyze all possible placement permutations, +we will advise users via documentation regarding which pod group types +are well-supported and which scenarios are handled on a +best-effort basis (where a successful placement is not guaranteed, even if +one theoretically exists). + +In particular: +* For basic **homogeneous** pod groups without inter-pod dependencies, this + algorithm is expected to find a placement whenever one exists. +* For **heterogeneous** pod groups, finding a valid placement is not guaranteed. +* For pod groups with **inter-pod dependencies** (e.g., affinity/anti-affinity + or topology spreading rules), finding a valid placement is not guaranteed. + +Moreover, if a pod using these features is rejected by the Workload Scheduling Cycle, +its rejection message (exposed via Pod status) will explicitly indicate +that the rejection may be due to the use of features for which finding an existing +placement cannot be guaranteed, distinguishing it from a generic `Unschedulable` reason. #### Interaction with Basic Policy For pod groups using the `Basic` policy, the Workload Scheduling Cycle is -optional. In the `Beta` timeframe, this cycle will be applied to +optional. In the v1.36 timeframe, this cycle will be applied to `Basic` pod groups to leverage the batching performance benefits, but the "all-or-nothing" (`minCount`) checks will be skipped; i.e., we will try to schedule as many pods from such PodGroup as possible. @@ -973,12 +961,14 @@ When the cycle fails, the scheduler rejects the entire group. 2. Backoff strategy Backoff mechanism has to be applied for a pod group similarly as we do for individual pods. -For Beta, we will apply the standard Pod backoff logic to the group. +Initially, we will apply the standard Pod backoff logic to the group. At the same time, we should consider increasing the maximum backoff duration for pod groups +or potentially scaling it based on the number of pods within the group. The current default of 10 seconds has proven insufficient in large clusters, -so this might be the case for workloads. Crucially, because the Workload Scheduling Cycle can take a significant -amount of time, retrying it too frequently risks starving individual pods. +so this might be the case for workloads. Crucially, because the Workload Scheduling Cycle +can be computationally expensive, retrying it too frequently risks starving individual pods. +Moreover, retries triggered by the Delayed Preemption feature may further strenghten the problem. 3. Retries @@ -991,10 +981,9 @@ While checking a single Pod does not guarantee the *whole* gang can fit, calculating gang-level schedulability inside the event handler can be difficult at the moment. Therefore, we optimistically retry the Workload Scheduling Cycle if any member's condition improves. -It might be beneficial to retry the pod group without being triggered by any cluster event. -Ideally, this would involve scrambling the pods and subgroups within the group that have the same priority. -This could be useful because the pods could be scheduled without any cluster changes -when considered in a different order. +It might be beneficial to retry the pod group without being triggered by any cluster event, +because single Workload Scheduling Cycle cannot determine the placement doesn't really exists, +especially for heterogeneous workloads or inter-pod dependencies. ### Test Plan @@ -1457,6 +1446,8 @@ However: ## Alternatives +### API + The longer version of this design describing the whole thought process of choosing the above described approach can be found in the [extended proposal] document. @@ -1516,6 +1507,62 @@ type PodGroup struct { } ``` +### Pod group queueing in scheduler + +In selecting the optimal pod group queuing mechanism, we evaluated several alternatives: + +Alternative 0 (Keep current queueing and ordering): + +We can minimize changes by retaining the current queueing and ordering logic. +When a Pod is popped, the scheduler can check if it belongs to a `PodGroup` +requiring a Workload Scheduling Cycle. As we add scheduling priorities +for pod groups later, this alternative naturally evolves into Alternative 1. +* *Pros:* Fits the current architecture. Retains current reasoning about the + scheduling queue. Minimizes implementation effort. +* *Cons:* Might be problematic when some of the pod groups's pods are in the backoffQ + or unschedulablePods and need to be retrieved efficiently. + Makes it hard to further evolve the Workload Scheduling Cycle. + Observability, currently suited for pod-by-pod scheduling, may not + accurately reflect the state of the queue (e.g., pending gangs). + Likely harder to support future extensions and won't work well + if `PodGroup` becomes a separate top-level resource. + The pod group will be likely scheduled based on the highest priority member, + meaning the latter pod-by-pod cycles might be visibly delayed for lower priority Pods. + +Alternative 1 (Modify sorting logic): + +Modify the sorting logic within the existing `PriorityQueue` to put all pods +from a pod group one after another. +* *Pros:* Fits the current architecture. +* *Cons:* Might be problematic when some of the pod groups's pods are in the + backoffQ or unschedulablePods and need to be retrieved efficiently. + Makes it hard to further evolve the Workload Scheduling Cycle. + Would need to inject the workload priority into each of the Pods + or somehow apply the lowest pod's priority to the rest of the group. + +Alternative 2 (Store a PodGroup instance): + +Modify the scheduling queue's data structures to accept `QueuedPodGroupInfo` alongside `QueuedPodInfo`. +This allows reusing existing queue logic while extending it to `PodGroups`. +All queued members would be stored in a new data structure +and retrieved for the Workload Cycle when the `PodGroup` is popped. +* *Pros:* Makes it easier to obtain all pods in a group and reduces queue size. + Reuses current logic for popping, enforcing backoff, and processing unschedulable entities. +* *Cons:* Requires adapting the scheduling queue to handle `PodGroups` as + queueable entities, which is non-trivial and might clutter the code. + +Alternative 3 (Dedicated PodGroup queue): + +Introduce a completely separate queue for PodGroups alongside the `activeQ` for Pods. +The scheduler would pop the item (Pod or PodGroup) with the highest priority/earliest timestamp. +Pods belonging to an enqueued PodGroup won't be allowed in the `activeQ`. +* *Pros:* Clean separation of concerns. Can easily use the Workload scheduling priority. + Can report dedicated logs and metrics with less confusion to the user. +* *Cons:* Significant and non-trivial architectural change to the scheduling queue + and `scheduleOne` loop. + +Ultimately, Alternative 3 (Dedicated PodGroup queue) was chosen as the best long-term solution. + ## Infrastructure Needed (Optional) @@ -968,7 +970,7 @@ or potentially scaling it based on the number of pods within the group. The current default of 10 seconds has proven insufficient in large clusters, so this might be the case for workloads. Crucially, because the Workload Scheduling Cycle can be computationally expensive, retrying it too frequently risks starving individual pods. -Moreover, retries triggered by the Delayed Preemption feature may further strenghten the problem. +Moreover, retries triggered by the Delayed Preemption feature may further strengthen the problem. 3. Retries From 581613b9373b1640daf3e1b6fa88a5ef88c9001f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Maciej=20Skocze=C5=84?= Date: Fri, 16 Jan 2026 15:27:04 +0000 Subject: [PATCH 12/23] Apply review comments --- .../4671-gang-scheduling/README.md | 18 ++++++++++++------ .../4671-gang-scheduling/kep.yaml | 4 ++-- 2 files changed, 14 insertions(+), 8 deletions(-) diff --git a/keps/sig-scheduling/4671-gang-scheduling/README.md b/keps/sig-scheduling/4671-gang-scheduling/README.md index 955e32c1ed55..78e00f79823d 100644 --- a/keps/sig-scheduling/4671-gang-scheduling/README.md +++ b/keps/sig-scheduling/4671-gang-scheduling/README.md @@ -628,11 +628,8 @@ this will address requirement (5). We introduce a new phase in the main scheduling loop (`scheduleOne`). In the end-to-end Pod scheduling flow, it is planned to place this new phase *before* -the standard pod-by-pod scheduling cycle. - -When the scheduler pops a Pod from the active queue, it checks if that Pod -belongs to an unscheduled `PodGroup`. If so, the scheduler -initiates the Workload Scheduling Cycle. +the standard pod-by-pod scheduling cycle. When the loop pops a `PodGroup` from +the active queue, it initiates the Workload Scheduling Cycle. Since the `PodGroup` instance (defined by the group name and replica key) is the effective scheduling unit, the Workload Scheduling Cycle will operate @@ -644,7 +641,8 @@ If new Pods belonging to an already scheduled `PodGroup` instance they are also processed via the Workload Scheduling Cycle, which takes the previously scheduled Pods into consideration. This is done for safety reasons to ensure the PodGroup-level constraints are still satisfied. However, if the `PodGroup` is being processed, -these new Pods must wait for the ongoing pod group scheduling to be finished, before being considered. +these new Pods must wait for the ongoing pod group scheduling to be finished (pass `WaitOnPermit`), +before being considered. The cycle proceeds as follows: @@ -717,6 +715,10 @@ establishes a necessary foundation for future Workload Aware Scheduling features #### Scheduling Algorithm +*Note: The algorithm described below is a simplified default version based on baseline scheduling logic. +It is expected to evolve to more effectively handle complex scenarios and specific features +in future iterations.* + The internal algorithm for placing the group utilizes the optimization defined in *Opportunistic Batching* ([KEP-5598](https://kep.k8s.io/5598)) for improved performance. The approach described below allows mitigating some restrictions of that feature, e.g., @@ -921,6 +923,8 @@ We will address it with what we call *delayed preemption* mechanism as following In other words, a different placement can be chosen in a subsequent (workload) scheduling cycles only if it doesn't require additional preemptions or the previously chosen placement is no longer feasible (e.g. because higher priority pods were scheduled in the meantime). + This can be done by ignoring the pods with `deletionTimestamp` set in these preemption attempts + (when the previous preemption is ongoing for the preemptor). The rationale behind the above design is to maintain the current scheduling property where preemption doesn't result in a commitment for a particular placement. If a different possible placement appears @@ -986,6 +990,8 @@ Therefore, we optimistically retry the Workload Scheduling Cycle if any member's It might be beneficial to retry the pod group without being triggered by any cluster event, because single Workload Scheduling Cycle cannot determine the placement doesn't really exists, especially for heterogeneous workloads or inter-pod dependencies. +To avoid introducing subtle errors in the initial implementation, +we can start by skipping the Queueing Hints mechanism and relying solely on the backoff time. ### Test Plan diff --git a/keps/sig-scheduling/4671-gang-scheduling/kep.yaml b/keps/sig-scheduling/4671-gang-scheduling/kep.yaml index 919ce6230533..0659f2a990d3 100644 --- a/keps/sig-scheduling/4671-gang-scheduling/kep.yaml +++ b/keps/sig-scheduling/4671-gang-scheduling/kep.yaml @@ -38,8 +38,8 @@ latest-milestone: "v1.36" # The milestone at which this feature was, or is targeted to be, at each stage. milestone: alpha: "v1.35" - beta: "v1.37" - stable: "v1.39" + beta: "v1.36" + stable: "v1.38" # The following PRR answers are required at alpha release # List the feature gate name and the components for which it must be enabled From e83f6d514b0f660e95cbb72f28eabc94623c0cfa Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Maciej=20Skocze=C5=84?= Date: Mon, 19 Jan 2026 14:12:11 +0000 Subject: [PATCH 13/23] Apply comments --- .../4671-gang-scheduling/README.md | 39 +++++++++++++------ 1 file changed, 28 insertions(+), 11 deletions(-) diff --git a/keps/sig-scheduling/4671-gang-scheduling/README.md b/keps/sig-scheduling/4671-gang-scheduling/README.md index 78e00f79823d..7334fe81ff6b 100644 --- a/keps/sig-scheduling/4671-gang-scheduling/README.md +++ b/keps/sig-scheduling/4671-gang-scheduling/README.md @@ -642,7 +642,8 @@ they are also processed via the Workload Scheduling Cycle, which takes the previ scheduled Pods into consideration. This is done for safety reasons to ensure the PodGroup-level constraints are still satisfied. However, if the `PodGroup` is being processed, these new Pods must wait for the ongoing pod group scheduling to be finished (pass `WaitOnPermit`), -before being considered. +before being considered. This can simplify the preemption, where we can be sure the decision won't be changed, +while the previous attempt hasn't finished yet. The cycle proceeds as follows: @@ -689,25 +690,27 @@ from being scheduled, we need to have a good queueing mechanism for pod groups. We have decided to make the scheduling queue explicitly workload-aware. The queue will support queuing `PodGroup` instances alongside individual Pods. -1. When Pods belonging to a `PodGroup` are added to the scheduler and pass the `PreEnqueue`, - they are initially stored in a dedicated internal data structure (tentatively named `workloadPods`) - rather than the standard active queue. +1. When Pods belonging to a `PodGroup` are added to the scheduler, if a corresponding `QueuedPodGroupInfo` + is not yet present in the scheduling queue, it is created and enqueued. + This object will have an aggregated `PreEnqueue` check, evaluating conditions for all its members. + Crucially, the individual Pods themselves are **not** stored in any standard scheduling queue + data structure (active, backoff, or unschedulable) at this stage, but they are effectively managed + via the `QueuedPodGroupInfo`. 2. Once the number of accumulated Pods meets the scheduling requirements (e.g., `minCount`), - a `QueuedPodGroupInfo` object (analogous to `QueuedPodInfo`) is created - and injected into the main scheduling queue. + a `QueuedPodGroupInfo` object is moved to the activeQ, following the logic similar to individual pods. 3. The `scheduleOne` loop will pop the highest-priority item from the queue, which may now be either a single Pod (triggering the standard cycle) or a `PodGroup` (triggering the Workload Scheduling Cycle). -4. During a Workload Scheduling Cycle, all member Pods are retrieved from `workloadPods`. +4. During a Workload Scheduling Cycle, all member Pods are retrieved from the `QueuedPodGroupInfo`. Based on the cycle's outcome: * **Success:** Pods are moved to the standard `activeQ` (with nominations set) to proceed to the pod-by-pod scheduling soon. - * **Failure/Preemption:** Pods are returned to `workloadPods` or the unschedulable queue. - The `PodGroup` enters a backoff state and is eligible for retry only when - a relevant cluster event wakes up at least one of its member pods. + * **Failure/Preemption:** The `QueuedPodGroupInfo` (containing the unschedulable pods) is returned + to the `unschedulablePodInfos` structure. The `PodGroup` enters a backoff state and is eligible + for retry only when a relevant cluster event wakes up at least one of its member pods. While this represents a significant architectural change to the scheduling queue and `scheduleOne` loop, it provides a clean separation of concerns and @@ -850,6 +853,16 @@ its rejection message (exposed via Pod status) will explicitly indicate that the rejection may be due to the use of features for which finding an existing placement cannot be guaranteed, distinguishing it from a generic `Unschedulable` reason. +In addition to the above, for cases involving **intra-group dependencies** +(e.g., when the schedulability of one pod depends on another group member via inter-pod affinity), +this algorithm may fail to find a placement regardless of cluster state, +due to the deterministic processing order. + +Users will be advised that such dependencies are discouraged. However, they could mitigate this +by assigning a lower priority to the dependent pods. Since the algorithm processes higher-priority +pods first, this ensures that the required pods are scheduled earlier, +to satisfy the affinity rules of the subsequent dependent pods. + #### Interaction with Basic Policy For pod groups using the `Basic` policy, the Workload Scheduling Cycle is @@ -892,6 +905,10 @@ We will address it with what we call *delayed preemption* mechanism as following `Preempt`) that will be responsible for actuation. However, for now we don't see evidence for this being needed. + Relying on the actuation logic is optional for plugins. For example, + the DynamicResources plugin can still actuate its decision (claim deallocation) in the PostFilter phase. + However, any pod-based removals in other plugins should be delegated to the delayed actuation phase. + 3. For individual pods (not being part of a workload), we will adjust the scheduling framework implementation of `schedulingCycle` to actuate preemptions of returned victims if calling `PostFilter` plugins resulted in finding a feasible placement. @@ -958,7 +975,7 @@ or a timeout occurs), the scheduler must handle the failure efficiently. 1. Rejection When the cycle fails, the scheduler rejects the entire group. -* All Pods in the group are moved back to the scheduling queue. +* All Pods in the group are moved back to the scheduling queue (stored in the `unschedulablePodGroups` data structure). Their status is updated the event with failure reason is sent. * Crucially, any `.status.nominatedNodeName` entries set during the failed attempt (or from previous cycles) must be cleared. This ensures that the resources From f68d82b753b59404fb4661daa9fa2a8c3ef87d53 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Maciej=20Skocze=C5=84?= Date: Tue, 20 Jan 2026 08:39:33 +0000 Subject: [PATCH 14/23] Apply comments --- keps/sig-scheduling/4671-gang-scheduling/README.md | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/keps/sig-scheduling/4671-gang-scheduling/README.md b/keps/sig-scheduling/4671-gang-scheduling/README.md index 7334fe81ff6b..1c69bfcc79cc 100644 --- a/keps/sig-scheduling/4671-gang-scheduling/README.md +++ b/keps/sig-scheduling/4671-gang-scheduling/README.md @@ -851,7 +851,11 @@ In particular: Moreover, if a pod using these features is rejected by the Workload Scheduling Cycle, its rejection message (exposed via Pod status) will explicitly indicate that the rejection may be due to the use of features for which finding an existing -placement cannot be guaranteed, distinguishing it from a generic `Unschedulable` reason. +placement cannot be guaranteed. This will be accompanied by a specific failure +reason, distinguishing it from a generic `Unschedulable` condition. +distinguishing it from a generic `Unschedulable` reason. This distinction +is particularly relevant for Cluster Autoscaler or Karpenter, which can act +differently based on the new reason. In addition to the above, for cases involving **intra-group dependencies** (e.g., when the schedulability of one pod depends on another group member via inter-pod affinity), From 97557b5f439ee581d1e420b3f84a1557d8e5a6b6 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Maciej=20Skocze=C5=84?= Date: Tue, 27 Jan 2026 08:16:33 +0000 Subject: [PATCH 15/23] Apply comments --- keps/sig-scheduling/4671-gang-scheduling/README.md | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/keps/sig-scheduling/4671-gang-scheduling/README.md b/keps/sig-scheduling/4671-gang-scheduling/README.md index 1c69bfcc79cc..1a4e4bd79ddd 100644 --- a/keps/sig-scheduling/4671-gang-scheduling/README.md +++ b/keps/sig-scheduling/4671-gang-scheduling/README.md @@ -784,8 +784,8 @@ The list and configuration of plugins used by this algorithm will be the same as but cannot cause additional disruption to do so. * If preemptions are not needed: Pods are nominated to their chosen nodes, - pushed directly to the active queue, and will soon attempt to be scheduled - on their nominated nodes in their own, pod-by-pod cycles. + pushed directly to the active queue in the order they were evaluated in the Workload Scheduling Cycle. + They will soon attempt to be scheduled on their nominated nodes in their own, pod-by-pod cycles. Pod will be restricted to its nominated node during the individual cycle. If the node is unavailable, the pod will remain unschedulable and the `WaitOnPermit` gate will take that @@ -806,11 +806,10 @@ The list and configuration of plugins used by this algorithm will be the same as can be attempted on that place. See *Failure Handling*. Gang Scheduling is currently implemented as a plugin, meaning the `minCount` constraint - is enforced at the plugin level. However, the proposed Workload Scheduling Cycle algorithm + is enforced at the plugin level. The proposed Workload Scheduling Cycle algorithm needs to know if this constraint is met to decide whether to commit the results. - To verify this, a new extension point will be introduced, allowing plugins to validate the group's - scheduled pods. This will function similarly to a `Permit` check (likely requiring `Reserve` state) - but without the suspension (`WaitOnPermit`) gate. Crucially, this extension should support two checks: + To achieve this, we will reuse the existing `Permit` extension point, + but without the suspension phase (`WaitOnPermit`). Crucially, this check has to support two modes: * Validation: Check whether the currently scheduled pods meet the requirements, e.g., if the `minCount` pods from a pod group was successfully scheduled. @@ -818,7 +817,7 @@ The list and configuration of plugins used by this algorithm will be the same as * Feasibility: Given the number of pods that have already failed scheduling in this cycle, check whether is it still *possible* to meet the constraint. If not, the cycle should abort early to save time. - + While this algorithm might be suboptimal, it is a solid first step for ensuring we have a single-cycle workload scheduling phase. As long as PodGroups consist of homogeneous pods, opportunistic batching itself will provide significant improvements. From 592b33f1a10e7c838b979ec8f332dfeca0efeeb7 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Maciej=20Skocze=C5=84?= Date: Tue, 27 Jan 2026 09:45:16 +0000 Subject: [PATCH 16/23] Remove Basic policy desiredCount from the KEP scope --- .../4671-gang-scheduling/README.md | 50 ++----------------- .../4671-gang-scheduling/kep.yaml | 4 -- 2 files changed, 3 insertions(+), 51 deletions(-) diff --git a/keps/sig-scheduling/4671-gang-scheduling/README.md b/keps/sig-scheduling/4671-gang-scheduling/README.md index 1a4e4bd79ddd..ed9a03b7b8bb 100644 --- a/keps/sig-scheduling/4671-gang-scheduling/README.md +++ b/keps/sig-scheduling/4671-gang-scheduling/README.md @@ -500,41 +500,6 @@ not be split into two. A `LeaderWorkerSet` is a good example of it, where a sing of a single leader and `N` workers and that forms a scheduling (and runtime unit), but workload as a whole may consist of a number of such replicas. -#### Basic Policy Extension - -While Gang Scheduling focuses on atomic, all-or-nothing scheduling, there is a significant class -of workloads that requires best-effort optimization without the strict blocking semantics of a gang. - -In the first alpha version of the Workload API, the `Basic` policy was a no-op. -We propose extending the `Basic` policy to accept a `desiredCount` field. -This feature will be gated behind a separate -feature gate (`WorkloadBasicPolicyDesiredCount`) to decouple it from the core Gang Scheduling graduation path. - -```go -// BasicSchedulingPolicy indicates that standard Kubernetes -// scheduling behavior should be used. -type BasicSchedulingPolicy struct { - // DesiredCount is the expected number of pods that will belong to this - // PodGroup. This field is a hint to the scheduler to help it make better - // placement decisions for the group as a whole. - // - // Unlike gang's minCount, this field does not block scheduling. If the number - // of available pods is less than desiredCount, the scheduler can still attempt - // to schedule the available pods, but will optimistically try to select a - // placement that can accommodate the future pods. - // - // +optional - DesiredCount *int32 -} -``` - -This field allows users to express their "true" workloads more easily -and enables the scheduler to optimize the placement of such pod groups by taking the desired state -into account. Ideally, the scheduler should prefer placements that can accommodate -the full `desiredCount`, even if not all pods are created yet. -When `desiredCount` is specified, the scheduler can delay scheduling the first Pod it sees -for a short amount of time in order to wait for more Pods to be observed. - ### Scheduler Changes The kube-scheduler will be watching for `Workload` objects (using informers) and will use them to map pods @@ -796,9 +761,9 @@ The list and configuration of plugins used by this algorithm will be the same as In the pod-by-pod cycle, preemption initiated by the workload pods will be forbidden. Allowing it would complicate reasoning about the consistency of the Workload Scheduling Cycle and Workload-Aware Preemption. If preemption is necessary - (e.g., the nominated node is no longer valid), the gang will either time out - or be instantly rejected (when the `minCount` cannot be satisfied) at `WaitOnPermit` and all necessary preemptions - will be simulated again in the next Workload Scheduling Cycle. + (e.g., the nominated node is no longer valid), the gang will either be instantly rejected + (when the `minCount` cannot be satisfied) or time out (safety check) at `WaitOnPermit` + and all necessary preemptions will be simulated again in the next Workload Scheduling Cycle. * If `schedulableCount < minCount`, the cycle fails. Preemptions computed but not actuated during this cycle are discarded. Pods go through traditional failure handlers @@ -874,11 +839,6 @@ optional. In the v1.36 timeframe, this cycle will be applied to "all-or-nothing" (`minCount`) checks will be skipped; i.e., we will try to schedule as many pods from such PodGroup as possible. -If the `Basic` policy has `desiredCount` set, the Workload Scheduling Cycle -may utilize this value to simulate the full group size during feasibility checks. -Note that the implementation of this specific logic might follow in a Beta stage -of this API field. - #### Delayed Preemption A critical requirement for moving Gang Scheduling to Beta is the integration with *Delayed Preemption*, @@ -1203,10 +1163,6 @@ This section must be completed when targeting alpha to a release. - Feature gate name: DelayedPreemption - Components depending on the feature gate: - kube-scheduler - - Feature gate name: WorkloadBasicPolicyDesiredCount - - Components depending on the feature gate: - - kube-apiserver - - kube-scheduler - [ ] Other - Describe the mechanism: - Will enabling / disabling the feature require downtime of the control diff --git a/keps/sig-scheduling/4671-gang-scheduling/kep.yaml b/keps/sig-scheduling/4671-gang-scheduling/kep.yaml index 0659f2a990d3..c945bcd66d79 100644 --- a/keps/sig-scheduling/4671-gang-scheduling/kep.yaml +++ b/keps/sig-scheduling/4671-gang-scheduling/kep.yaml @@ -54,10 +54,6 @@ feature-gates: - name: DelayedPreemption components: - kube-scheduler - - name: WorkloadBasicPolicyDesiredCount - components: - - kube-apiserver - - kube-scheduler disable-supported: true # The following PRR answers are required at beta release From 287ec804b53e40c990db503a484ebbba5a1cb62c Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Maciej=20Skocze=C5=84?= Date: Tue, 27 Jan 2026 10:57:12 +0000 Subject: [PATCH 17/23] Apply comments --- keps/sig-scheduling/4671-gang-scheduling/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-scheduling/4671-gang-scheduling/README.md b/keps/sig-scheduling/4671-gang-scheduling/README.md index ed9a03b7b8bb..4ff27cbebe9f 100644 --- a/keps/sig-scheduling/4671-gang-scheduling/README.md +++ b/keps/sig-scheduling/4671-gang-scheduling/README.md @@ -585,7 +585,7 @@ to process the entire gang together. The single scheduling cycle, together with blocking resources using nomination, will address requirement (3). -We will also introduce delayed preemption (described in [KEP-5710](https://kep.k8s.io/5711)). +We will also introduce [Delayed Preemption](#delayed-preemption). Together with the introduction of a dedicated Workload Scheduling Cycle, this will address requirement (5). From b6c6d4c47636e0c5b92744bc72ad88bc38411fc5 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Maciej=20Skocze=C5=84?= Date: Tue, 27 Jan 2026 11:33:41 +0000 Subject: [PATCH 18/23] Update toc --- keps/sig-scheduling/4671-gang-scheduling/README.md | 1 - 1 file changed, 1 deletion(-) diff --git a/keps/sig-scheduling/4671-gang-scheduling/README.md b/keps/sig-scheduling/4671-gang-scheduling/README.md index 4ff27cbebe9f..7755ab2446fc 100644 --- a/keps/sig-scheduling/4671-gang-scheduling/README.md +++ b/keps/sig-scheduling/4671-gang-scheduling/README.md @@ -18,7 +18,6 @@ - [Naming](#naming) - [Associating Pod into PodGroups](#associating-pod-into-podgroups) - [API](#api) - - [Basic Policy Extension](#basic-policy-extension) - [Scheduler Changes](#scheduler-changes) - [North Star Vision](#north-star-vision) - [GangScheduling Plugin](#gangscheduling-plugin) From bcc4ade5aecbb54696aaa6a308e1b7c31fddfcf0 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Maciej=20Skocze=C5=84?= Date: Wed, 28 Jan 2026 17:25:17 +0000 Subject: [PATCH 19/23] Apply comments --- keps/sig-scheduling/4671-gang-scheduling/README.md | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/keps/sig-scheduling/4671-gang-scheduling/README.md b/keps/sig-scheduling/4671-gang-scheduling/README.md index 7755ab2446fc..52787522c384 100644 --- a/keps/sig-scheduling/4671-gang-scheduling/README.md +++ b/keps/sig-scheduling/4671-gang-scheduling/README.md @@ -237,7 +237,7 @@ usecases. You can read more about it in the [extended proposal] document. #### NominatedNodeName impact on filtering performance Using `.status.nominatedNodeName` as an output of the Workload Scheduling Cycle -can impact the performance of the standard pod-by-pod scheduling cycle. +can impact the performance of the standard pod-by-pod scheduling cycle for all other pods. Whenever the scheduler filters a node, it must temporarily add nominated pods (with equal or higher priority) to the cached NodeInfo. In large clusters, the number of such operations multiplied by the scheduling throughput can yield to a visible overhead. @@ -248,6 +248,8 @@ having to consider such nomination also increases. However, this impact is mitigated by several factors: * Nominations are temporary. As soon as workload-scheduled pods pass their individual scheduling cycle and are assumed, what cleans the in-memory nominations. +* In case the nominations are no longer feasible, + the gang gets rejected as soon as the scheduler determines this. * For the workload pods themselves, the performance impact is negligible. They will typically only execute filters for the single node they are nominated to, rather than evaluating the entire cluster. @@ -759,9 +761,9 @@ The list and configuration of plugins used by this algorithm will be the same as In the pod-by-pod cycle, preemption initiated by the workload pods will be forbidden. Allowing it would complicate reasoning about the consistency of the - Workload Scheduling Cycle and Workload-Aware Preemption. If preemption is necessary + Workload Scheduling Cycle and Workload-Aware Preemption. If preemption is necessary, (e.g., the nominated node is no longer valid), the gang will either be instantly rejected - (when the `minCount` cannot be satisfied) or time out (safety check) at `WaitOnPermit` + (when the `minCount` cannot be satisfied) or time out (safety check, in case a bug appears) at `WaitOnPermit` and all necessary preemptions will be simulated again in the next Workload Scheduling Cycle. * If `schedulableCount < minCount`, the cycle fails. Preemptions computed but not actuated @@ -1031,10 +1033,11 @@ With Workload Scheduling Cycle and Delayed Preemption features, we will signific - Pods referencing a `Workload` (both gang and basic policies) are correctly processed via the Workload Scheduling Cycle. - `PodGroup` queuing ensures that all available members are retrieved and processed correctly. - Deadlocks and livelocks do not occur when multiple gangs compete for resources or interleave with standard pods. -- Delayed Preemption works correctly for pod-by-pod (non-workload) scheduling. +- Delayed Preemption feature doesn't break pod-by-pod (non-workload) scheduling. - Delayed Preemption ensures atomicity, i.e., victims are deleted only if the scheduler determines the entire gang can fit, otherwise, the cycle aborts with zero disruption. - Failed pod groups are requeued correctly and retry successfully when resources become available. +- Gang is rejected if pod-by-pod scheduling cannot follow a nomination. All other nominations should be also cleared. We will also benchmark the performance impact of these changes to measure: From 69fe3361c49826b231595472e68642b3c78f89db Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Maciej=20Skocze=C5=84?= Date: Mon, 2 Feb 2026 14:30:42 +0000 Subject: [PATCH 20/23] Update the KEP with a decision to skip pod-by-pod scheduling phase after workload cycle --- .../4671-gang-scheduling/README.md | 95 ++++++------------- 1 file changed, 31 insertions(+), 64 deletions(-) diff --git a/keps/sig-scheduling/4671-gang-scheduling/README.md b/keps/sig-scheduling/4671-gang-scheduling/README.md index 52787522c384..c98709efde3d 100644 --- a/keps/sig-scheduling/4671-gang-scheduling/README.md +++ b/keps/sig-scheduling/4671-gang-scheduling/README.md @@ -234,33 +234,20 @@ We try to mitigate it by an extensive analysis of usecases and already sketching how we envision the direction in which the API will need to evolve to support further usecases. You can read more about it in the [extended proposal] document. -#### NominatedNodeName impact on filtering performance - -Using `.status.nominatedNodeName` as an output of the Workload Scheduling Cycle -can impact the performance of the standard pod-by-pod scheduling cycle for all other pods. -Whenever the scheduler filters a node, it must temporarily add nominated pods -(with equal or higher priority) to the cached NodeInfo. In large clusters, -the number of such operations multiplied by the scheduling throughput can yield to a visible overhead. -If the latency between the end of the Workload Scheduling Cycle -and the actual processing of those pods is high, the number of unrelated pods -having to consider such nomination also increases. - -However, this impact is mitigated by several factors: -* Nominations are temporary. As soon as workload-scheduled pods pass - their individual scheduling cycle and are assumed, what cleans the in-memory nominations. -* In case the nominations are no longer feasible, - the gang gets rejected as soon as the scheduler determines this. -* For the workload pods themselves, the performance impact is negligible. - They will typically only execute filters for the single node they are nominated to, - rather than evaluating the entire cluster. -* These pods are expected to be retried quickly after the Workload Scheduling Cycle because - their initial timestamps are preserved. This places them near the head of the active queue, - minimizing the duration they remain in the "nominated but not assumed" state. -* While higher-priority or long-standing (equal priority) pods might interleave and be scheduled before the gang pods, - the overall window of time where these nominations are active is expected to be short enough - to prevent severe degradation. - -The real impact will be verified through scalability tests (scheduler-perf benchmark). +#### Exacerbating the race window by proceeding directly to binding + +Since the entire Workload Scheduling Cycle operates on a single cluster snapshot, +a long-running cycle means decisions are based on snapshotted state that may become stale. +This implies that if the cluster state changes in the meantime +(e.g., a Node suffers a hardware failure or is deleted), +the binding phase could fail for some pods in the workload, potentially causing the entire gang to fail. + +However, assuming all scheduling decisions go through kube-scheduler, +the primary source of race conditions is external infrastructure events (e.g., Node health changes). +While this is a valid concern, this race window exists in the standard scheduling cycle as well. +Although the Workload Scheduling Cycle extends this window, +the propagation latency of Node status updates or deletions is typically non-trivial, +meaning the marginal increase in risk is acceptable compared to the benefits of atomic scheduling. ## Design Details @@ -593,7 +580,7 @@ this will address requirement (5). #### The Workload Scheduling Cycle We introduce a new phase in the main scheduling loop (`scheduleOne`). In the -end-to-end Pod scheduling flow, it is planned to place this new phase *before* +end-to-end Pod scheduling flow, it is planned to place this new phase instead of the standard pod-by-pod scheduling cycle. When the loop pops a `PodGroup` from the active queue, it initiates the Workload Scheduling Cycle. @@ -626,12 +613,11 @@ The cycle proceeds as follows: 4. Outcome: * If the group (i.e., at least `minCount` Pods) can be placed, - these Pods have the `.status.nominatedNodeName` set. - They are then effectively "reserved" on those nodes in the - scheduler's internal cache. Pods are then pushed to the - active queue (restoring their original timestamps to ensure fairness) - to pass through the standard scheduling and binding cycle, - which will consider and follow the nomination. + these Pods proceed directly to the pod-by-pod binding cycle with their selected nodes. + these Pods proceed to the binding bycle with their selected nodes. + * In case preemption is required, the PodGroup is moved back to the scheduling queue + to wait for the preemption to take effect. This requires a subsequent + Workload Scheduling Cycle to verify that the released resources make the placement feasible. * If `minCount` cannot be met (even after calculating potential preemptions), the scheduler considers the `PodGroup` unschedulable. Standard backoff logic applies (see *Failure Handling*), and Pods are returned to @@ -650,8 +636,8 @@ what will be effectively the weakest link to determine if the whole pod group is and reduce unnecessary preemption attempts. To ensure that we process the `PodGroup` instance at an appropriate time and -don't starve other pods (including gang pods in the pod-by-pod scheduling phase) -from being scheduled, we need to have a good queueing mechanism for pod groups. +don't starve other pods from being scheduled, we need to have a good queueing mechanism +for pod groups. We have decided to make the scheduling queue explicitly workload-aware. The queue will support queuing `PodGroup` instances alongside individual Pods. @@ -672,8 +658,7 @@ The queue will support queuing `PodGroup` instances alongside individual Pods. 4. During a Workload Scheduling Cycle, all member Pods are retrieved from the `QueuedPodGroupInfo`. Based on the cycle's outcome: - * **Success:** Pods are moved to the standard `activeQ` (with nominations set) - to proceed to the pod-by-pod scheduling soon. + * **Success:** Pods are moved directly to the binding cycle. * **Failure/Preemption:** The `QueuedPodGroupInfo` (containing the unschedulable pods) is returned to the `unschedulablePodInfos` structure. The `PodGroup` enters a backoff state and is eligible for retry only when a relevant cluster event wakes up at least one of its member pods. @@ -749,22 +734,14 @@ The list and configuration of plugins used by this algorithm will be the same as can be scheduled in a different location if resources become available earlier, but cannot cause additional disruption to do so. - * If preemptions are not needed: Pods are nominated to their chosen nodes, - pushed directly to the active queue in the order they were evaluated in the Workload Scheduling Cycle. - They will soon attempt to be scheduled on their nominated nodes in their own, pod-by-pod cycles. + * If preemptions are not needed: Pods proceed directly to their binding cycles + using the nodes selected during the Workload Scheduling Cycle. - Pod will be restricted to its nominated node during the individual cycle. - If the node is unavailable, the pod will remain unschedulable and the `WaitOnPermit` gate will take that - into consideration. The `minCount` check can consider the number of pods that have passed - the Workload Scheduling Cycle to ensure that Pods are not waiting unnecessarily when some have been rejected - but other new pods have been added to the cluster. - - In the pod-by-pod cycle, preemption initiated by the workload pods will be forbidden. - Allowing it would complicate reasoning about the consistency of the - Workload Scheduling Cycle and Workload-Aware Preemption. If preemption is necessary, - (e.g., the nominated node is no longer valid), the gang will either be instantly rejected - (when the `minCount` cannot be satisfied) or time out (safety check, in case a bug appears) at `WaitOnPermit` - and all necessary preemptions will be simulated again in the next Workload Scheduling Cycle. + The `WaitOnPermit` gate is retained to ensure that the `minCount` pods are successfully + admitted before binding occurs. Additionally, the `minCount` check can consider + the number of pods that have passed the Workload Scheduling Cycle to ensure + that Pods do not wait unnecessarily if some have been rejected while new pods + have been added to the cluster. * If `schedulableCount < minCount`, the cycle fails. Preemptions computed but not actuated during this cycle are discarded. Pods go through traditional failure handlers @@ -913,13 +890,6 @@ in the meantime (e.g. due to other pods terminating or new nodes appearing), sub attempts may pick it up, improving the end-to-end scheduling latency. Returning pods to scheduling queue if these need to wait for preemption to become schedulable maintains that property. -We acknowledge the two limitations of the above approach: (a) dependency on the introduction of -Workload Scheduling Cycle (delayed preemption will not work if workload pods will not be processed -by Workload Scheduling Cycle) and (b) the fact that the placement computed in -Workload Scheduling Cycle may be invalidated in pod-by-pod scheduling later. -However, those features should be used together, -and the simplicity of the approach and target architecture outweigh these limitations. - #### Workload-aware Preemption Workload-aware preemption ([KEP-5710](https://kep.k8s.io/5710)) aims to @@ -1037,13 +1007,10 @@ With Workload Scheduling Cycle and Delayed Preemption features, we will signific - Delayed Preemption ensures atomicity, i.e., victims are deleted only if the scheduler determines the entire gang can fit, otherwise, the cycle aborts with zero disruption. - Failed pod groups are requeued correctly and retry successfully when resources become available. -- Gang is rejected if pod-by-pod scheduling cannot follow a nomination. All other nominations should be also cleared. We will also benchmark the performance impact of these changes to measure: -- The scheduling throughput of the workload scheduling, including gang and basic policies and preemptions. -- The performance impact on standard pod scheduling when there are many nominated pods, - for scenarios mentioned in the [NominatedNodeName impact on filtering performance](#nominatednodename-impact-on-filtering-performance). +- The scheduling throughput of the workload scheduling, including gang and basic policies, and preemptions. ##### e2e tests From 1ca8c1f0c9bb7a20b7e9ca775b6825793de5ebdc Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Maciej=20Skocze=C5=84?= Date: Tue, 3 Feb 2026 08:01:56 +0000 Subject: [PATCH 21/23] Apply comments --- .../4671-gang-scheduling/README.md | 19 +++++++++---------- 1 file changed, 9 insertions(+), 10 deletions(-) diff --git a/keps/sig-scheduling/4671-gang-scheduling/README.md b/keps/sig-scheduling/4671-gang-scheduling/README.md index c98709efde3d..c5e5a252c1a7 100644 --- a/keps/sig-scheduling/4671-gang-scheduling/README.md +++ b/keps/sig-scheduling/4671-gang-scheduling/README.md @@ -579,10 +579,11 @@ this will address requirement (5). #### The Workload Scheduling Cycle -We introduce a new phase in the main scheduling loop (`scheduleOne`). In the -end-to-end Pod scheduling flow, it is planned to place this new phase instead of -the standard pod-by-pod scheduling cycle. When the loop pops a `PodGroup` from -the active queue, it initiates the Workload Scheduling Cycle. +We introduce a new phase in the main scheduling loop (`scheduleOne`). +This phase replaces the standard pod-by-pod scheduling cycle for all Pods +belonging to a `PodGroup`. This means that these individual Pods do not enter +the standard scheduling queue for independent processing. Instead, when the loop pops a +`PodGroup` from the active queue, it initiates the Workload Scheduling Cycle. Since the `PodGroup` instance (defined by the group name and replica key) is the effective scheduling unit, the Workload Scheduling Cycle will operate @@ -601,9 +602,8 @@ while the previous attempt hasn't finished yet. The cycle proceeds as follows: 1. The scheduler takes pod group from the scheduling queue. - If the pod group is unscheduled (even partially), it temporarily removes - all group's pods from the queue for processing. The order of processing - is determined by the queueing mechanism (see *Queuing and Ordering* below). + The retrieved object contains the list of all pending pods belonging to this group. + The order of processing is determined by the queueing mechanism (see *Queuing and Ordering* below). 2. A single cluster state snapshot is taken for the entire group operation to ensure consistency during the cycle. @@ -613,8 +613,7 @@ The cycle proceeds as follows: 4. Outcome: * If the group (i.e., at least `minCount` Pods) can be placed, - these Pods proceed directly to the pod-by-pod binding cycle with their selected nodes. - these Pods proceed to the binding bycle with their selected nodes. + these Pods proceed directly to the binding bycle with their selected nodes. * In case preemption is required, the PodGroup is moved back to the scheduling queue to wait for the preemption to take effect. This requires a subsequent Workload Scheduling Cycle to verify that the released resources make the placement feasible. @@ -646,7 +645,7 @@ The queue will support queuing `PodGroup` instances alongside individual Pods. is not yet present in the scheduling queue, it is created and enqueued. This object will have an aggregated `PreEnqueue` check, evaluating conditions for all its members. Crucially, the individual Pods themselves are **not** stored in any standard scheduling queue - data structure (active, backoff, or unschedulable) at this stage, but they are effectively managed + data structure (active, backoff, or unschedulable), but they are effectively managed via the `QueuedPodGroupInfo`. 2. Once the number of accumulated Pods meets the scheduling requirements (e.g., `minCount`), From ae9b3a3456d720b122af76b69c42efae00241875 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Maciej=20Skocze=C5=84?= Date: Tue, 3 Feb 2026 08:03:11 +0000 Subject: [PATCH 22/23] Update toc --- keps/sig-scheduling/4671-gang-scheduling/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-scheduling/4671-gang-scheduling/README.md b/keps/sig-scheduling/4671-gang-scheduling/README.md index c5e5a252c1a7..1700a15c3d82 100644 --- a/keps/sig-scheduling/4671-gang-scheduling/README.md +++ b/keps/sig-scheduling/4671-gang-scheduling/README.md @@ -13,7 +13,7 @@ - [Story 2: Gang-scheduling of a custom workload](#story-2-gang-scheduling-of-a-custom-workload) - [Risks and Mitigations](#risks-and-mitigations) - [The API needs to be extended in an unpredictable way](#the-api-needs-to-be-extended-in-an-unpredictable-way) - - [NominatedNodeName impact on filtering performance](#nominatednodename-impact-on-filtering-performance) + - [Exacerbating the race window by proceeding directly to binding](#exacerbating-the-race-window-by-proceeding-directly-to-binding) - [Design Details](#design-details) - [Naming](#naming) - [Associating Pod into PodGroups](#associating-pod-into-podgroups) From 4c8bcd9120ac60b643dc40df754709878e52bbf6 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Maciej=20Skocze=C5=84?= Date: Wed, 4 Feb 2026 15:23:41 +0000 Subject: [PATCH 23/23] Add a paragraph about requirement of consistent schedulerName --- keps/sig-scheduling/4671-gang-scheduling/README.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/keps/sig-scheduling/4671-gang-scheduling/README.md b/keps/sig-scheduling/4671-gang-scheduling/README.md index 1700a15c3d82..edc77311628a 100644 --- a/keps/sig-scheduling/4671-gang-scheduling/README.md +++ b/keps/sig-scheduling/4671-gang-scheduling/README.md @@ -808,6 +808,12 @@ by assigning a lower priority to the dependent pods. Since the algorithm process pods first, this ensures that the required pods are scheduled earlier, to satisfy the affinity rules of the subsequent dependent pods. +All pods belonging to a single pod group must share the same `.spec.schedulerName`. +Divergent scheduler names would complicate reasoning about placement decisions +and make future pod group-based constraints more difficult to manage. +The scheduler will validate this condition: if a mismatch is detected, +all pod group's pods will be rejected as unschedulable. + #### Interaction with Basic Policy For pod groups using the `Basic` policy, the Workload Scheduling Cycle is