From eae3ddb1fd3cad00e92a47741b19aa8d6ccab036 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Maciej=20Skocze=C5=84?= <mskoczen@google.com>
Date: Wed, 10 Dec 2025 08:59:23 +0000
Subject: [PATCH 01/23] Add a section about scheduler changes for v1.36

---
 keps/prod-readiness/sig-scheduling/4671.yaml  |   2 +
 .../4671-gang-scheduling/README.md            | 315 ++++++++++++++++--
 .../4671-gang-scheduling/kep.yaml             |  22 +-
 3 files changed, 299 insertions(+), 40 deletions(-)

diff --git a/keps/prod-readiness/sig-scheduling/4671.yaml b/keps/prod-readiness/sig-scheduling/4671.yaml
index 17a4b734bff8..3257880a90d5 100644
--- a/keps/prod-readiness/sig-scheduling/4671.yaml
+++ b/keps/prod-readiness/sig-scheduling/4671.yaml
@@ -1,3 +1,5 @@
 kep-number: 4671
 alpha:
   approver: "@soltysh"
+beta:
+  approver: "@soltysh"
diff --git a/keps/sig-scheduling/4671-gang-scheduling/README.md b/keps/sig-scheduling/4671-gang-scheduling/README.md
index 5119a6b45c8f..772d752263c4 100644
--- a/keps/sig-scheduling/4671-gang-scheduling/README.md
+++ b/keps/sig-scheduling/4671-gang-scheduling/README.md
@@ -469,12 +469,14 @@ the intention from the desired state.
 Note that given scheduling options are stored in the `Workload` object, pods linked to the `Workload`
 object will not be scheduled until this `Workload` object is created and observed by the kube-scheduler.
 
+#### North Star Vision
+
 The north star vision for gang scheduling implementation should satisfy the following requirements:
 
 1. Ensure that pods being part of a gang are not bound if all pods belonging to it can't be scheduled.
 2. Provide the "optimal enough" placement by considering all pods from a gang together.
-3. Avoid deadlock scenario when multiple workloads are being scheduled at the same time by kube-scheduler.
-4. Avoid deadlock scenario when multiple workloads are being scheduled at the same time by different
+3. Avoid deadlock and livelock scenario when multiple workloads are being scheduled at the same time by kube-scheduler.
+4. Avoid deadlock and livelock scenario when multiple workloads are being scheduled at the same time by different
    schedulers.
 5. Avoid premature preemptions of already running pods in case a higher priority gang will be rejected.
 6. Support gang-level (or workload-level in general) level preemption (if pods form a gang also
@@ -488,6 +490,8 @@ Addressing all these requirements in a single shot would be a huge change, so as
 will only focus on a subset of those. However, we very briefly sketch the path towards the vision to
 ensure that this KEP is moving in the right direction.
 
+#### GangScheduling Plugin
+
 For `Alpha`, we are focusing on introducing the concept of the `Workload` and plumbing it into
 kube-scheduler in the simplest possible way. We will implement a new plugin implementing the following
 hooks:
@@ -499,28 +503,7 @@ hooks:
 This seems to be the simplest possible implementation to address the requirement (1). We are consciously
 ignoring the rest of the requirements for `Alpha` phase.
 
-
-For `Beta`, we want to also touch requirements (2) and (3) by extending the scheduling framework with
-a new dedicated phase (tentatively called Workload). In that phase,
-kube-scheduler will be looking at all pods from a gang (part of `Workload`) and compute the placement
-for all of these pods in a single scheduling cycle. Those placements will be stored only in-memory and
-block the required resources from scheduling. Tentatively we plan to use `NominatedNodeName` field for it.
-After that, pods will go through regular pod-by-pod scheduling phases (including Filter and Score)
-with a nomination as a form of validation the proposed placement and execution of this placement decision.
-Therefore we expect the order of processing pods won't ever be important, but all-or-nothing nature of
-gangs will be preserved while advancing through the further steps of the binding process.
-
-While we will not target addressing "optimal enough" part of requirement (2), we will assure that we
-can process all gang pods together. The single scheduling cycle and blocking resources in beta
-will address the requirement (3).
-
-We will also introduce delayed preemption by moving it after `WaitOnPermit` phase. Together with
-introduction of a dedicated phase for scheduling all pods in a single scheduling cycle this
-will address the requirement (5). If accompanied with blocking the resources in-memory as
-mentioned above, this basically mitigates the problem.
-
-More detail about scheduler changes is described in [this document](https://docs.google.com/document/d/1lMYkDuGqEoZWfE2b8vjQx0vHieOMyfmi6VHUef5-5is/edit?tab=t.0#heading=h.1p88ilpefnb).
-
+#### Future plans
 
 We will continue with further improvements on top of it with follow-up KEPs. We are planning to
 introduce the concept of `Reservation` that will allow to treat distributed subset of resources as
@@ -535,12 +518,6 @@ states (e.g. not yet block resources) will help with improving the scheduling ac
 Finally making the binding process aware of gangs will allow to make sure the process is either
 successful or triggers workload rescheduling satisfying requirement (7).
 
-The workload-aware preemption is tightly coupled, but separate feature that will also be designed
-in a dedicated KEP. The current vision includes introducing a dedicated preemption policy (that
-will result in pods no longer being treated individually for preemption purposes) which makes it
-an additive feature. However, having a next level of details is required to ensure that we really
-have a feasible backward-compatible plan before promoting this feature to Beta.
-
 Addressing requirement (8) is the biggest effort as it requires much closer integration between
 scheduler and autoscaling components. So in the initial steps we will only focus on mitigating
 this problem with existing mechanisms (e.g. reserving resources via NominatedNodeName).
@@ -548,6 +525,275 @@ this problem with existing mechanisms (e.g. reserving resources via NominatedNod
 However, approval for this KEP is NOT an approval for this vision. We only sketch it to show that
 we see a viable path forward from the proposed design that will not require significant rework.
 
+### Scheduler Changes for Beta
+
+For the `Alpha` phase, we focused on plumbing the `Workload` API and implementing
+the `GangScheduling` plugin using simple barriers (`PreEnqueue` and `Permit`).
+While this satisfied the correctness requirement for "all-or-nothing" scheduling,
+it did not address performance or efficiency at scale, scheduling livelocks,
+nor did it solve the problem of partial preemption application.
+
+For `Beta`, we propose introducing a **Workload Scheduling Cycle**.
+This mechanism processes all Pods belonging to a single `PodGroup` in one batch,
+rather than attempting to schedule them individually in isolation using the
+traditional pod-by-pod approach.
+While this won't fully address the "optimal enough" part of requirement (2),
+it ensures that all gang pods are processed together.
+The single scheduling cycle, together with blocking resources using nomination,
+will address requirement (3).
+
+We will also introduce delayed preemption (described in [KEP-5710](https://kep.k8s.io/5711)).
+Together with the introduction of a dedicated Workload Scheduling Cycle,
+this will address requirement (5).
+
+#### The Workload Scheduling Cycle
+
+We introduce a new phase in the main scheduling loop (`scheduleOne`). In the
+end-to-end Pod scheduling flow, it is planned to place this new phase *before*
+the standard pod-by-pod scheduling cycle.
+
+When the scheduler pops a Pod from the active queue, it checks if that Pod
+belongs to an unscheduled `PodGroup` with a `Gang` policy. If so, the scheduler
+initiates the Workload Scheduling Cycle.
+
+```md
+<<[UNRESOLVED Scope of the Cycle]>>
+It is currently unresolved whether the Workload Scheduling Cycle should operate
+on the entire `Workload` object (handling all defined PodGroups simultaneously)
+or strictly at the `PodGroup` level.
+
+* PodGroup Level: The cycle processes only the specific `PodGroup` (and replica key)
+  associated with the popped Pod. This is simpler and aligns with
+  the Gang Scheduling definition and current implementation.
+* Workload Level: The cycle attempts to schedule all PodGroups within the Workload.
+  This allows for complex dependencies between groups but increases the complexity
+  and doesn't bring immediate added value.
+
+*Proposed:* Implement it on PodGroup Level for Beta. However, future migration
+to the Workload Level might necessitate non-trivial changes to the phase
+introduced by this KEP.
+<<[/UNRESOLVED]>>
+```
+
+The cycle proceeds as follows:
+
+1. The scheduler takes either pod group itself or its Pod representative from
+   the scheduling queue. If the pod group is unscheduled (even partially), it temporarily removes
+   all group's pods from the queue for processing. The order of processing
+   is determined by the queueing mechanism (see *Queuing and Ordering* below).
+   
+2. A single cluster state snapshot is taken for the entire group operation
+   to ensure consistency during the cycle.
+
+3. The scheduler runs a specialized algorithm (detailed below)
+   to find placements for the group.
+
+4. Outcome:
+   * If the group (i.e., at least `minCount` Pods) can be placed,
+     these Pods have the `.status.nominatedNodeName` set.
+     They are then effectively "reserved" on those nodes in the
+     scheduler's internal cache. Pods are then pushed to the
+     active queue (restoring their original timestamps to ensure fairness)
+     to pass through the standard scheduling and binding cycle,
+     which will respect the nomination.
+   * If `minCount` cannot be met (even after calculating potential
+     preemptions), the scheduler rejects the entire group. Standard backoff
+     logic applies (see *Failure Handling*), and Pods are returned to
+     the scheduling queue.
+
+#### Queuing and Ordering
+
+Workload-aware preemption (an `Alpha` effort in [KEP-5710](https://github.com/kubernetes/enhancements/pull/5711))
+will introduce a specific scheduling priority for a workload.
+Having that in mind, it is beneficial to design a queueing mechanism open
+for taking a workload's scheduling priority into account.
+However, as we need to support ordering before that feature can be enabled,
+we also need to derive the priority from the pod group's pods.
+One such formula can be to set it to the lowest priority found within the pod group,
+what will be effectively the weakest link to determine if the whole pod group is schedulable
+and reduce unnecessary preemption attempts.
+
+```md
+<<[UNRESOLVED Queue Implementation Strategy]>>
+To ensure that we process the pod group (replica) at an appropriate time and
+don't starve other pods (including gang pods in the pod-by-pod scheduling phase)
+from being scheduled, we need to have a good queueing mechanism for pod groups.
+There are several alternatives:
+
+Alternative 1 (Modify sorting logic):
+
+Modify the sorting logic within the existing `PriorityQueue` to put all pods
+from a gang group one after another.
+* *Pros:* Fits the current architecture.
+* *Cons:* Might be problematic when some of the gang's pods are in the
+  backoffQ or unschedulablePods and need to be retrieved efficiently.
+  Makes it hard to further evolve the Workload Scheduling Cycle.
+  Would need to inject the workload priority into each of the Pods
+  or somehow apply the lowest pod's priority to the rest of the group.
+
+Alternative 2 (Store a gang representative):
+
+Only one "representative" Pod from the gang is allowed in the `activeQ` at a time.
+Others are held in a separate internal structure (e.g., a new map inside the queue).
+When the representative is popped, it pulls the rest of the gang for the Workload Cycle.
+* *Pros:* Makes it easier to obtain all pod group's pods, reduces queue size.
+* *Cons:* High complexity in managing the lifecycle of the representative
+  (e.g., what if the representative Pod is deleted or other changes to the workload happen?
+  Would need a workload manager to handle all such cases).
+
+Alternative 3 (Dedicated PodGroup queue):
+
+Introduce a completely separate queue for PodGroups alongside the `activeQ` for Pods.
+The scheduler would pop the item (Pod or PodGroup) with the highest priority/earliest timestamp.
+Pods belonging to an enqueued PodGroup won't be allowed in the `activeQ`.
+* *Pros:* Clean separation of concerns. Can easily use the Workload scheduling priority.
+  Can report dedicated logs and metrics with less confusion to the user.
+* *Cons:* Significant and non-trivial architectural change to the scheduling queue
+  and `scheduleOne` loop.
+
+*Proposed:* Alternative 3 (Dedicated PodGroup queue). While this requires architectural change to the scheduling queue,
+the effort involved in adding pod group queuing will be comparable to modifying the code for the previous alternatives.
+This will also make the foundation for future WAS features and support workload priority by design.
+<<[/UNRESOLVED]>>
+```
+
+#### Scheduling Algorithm
+
+The internal algorithm for placing the group utilizes the optimization defined
+in *Opportunistic Batching* ([KEP-5598](https://kep.k8s.io/5598)) for improved performance.
+The approach described below allows mitigating some restrictions of that feature, e.g.,
+by sorting the Pods appropriately by their signatures. In case Opportunistic Batching
+is disabled or not applicable, this falls back to non-optimized filtering and scoring for each Pod.
+The list and configuration of plugins used by this algorithm will be the same as in the pod-by-pod cycle.
+
+1. The scheduler iterates through the retrieved Pods and groups
+   them into homogeneous sub-groups (using the signatures defined in
+   [KEP-5598](https://kep.k8s.io/5598)).
+
+2. These sub-groups are sorted. Initially, we sort by the highest priority
+   of the sub-group (assuming homogeneity enforces uniform sub-group priority).
+   In the future, sorting may use the size of the sub-group (larger groups first) to
+   tackle the hardest placement problems early.
+
+3. The scheduler iterates through the sorted sub-groups. It finds a feasible node
+   for each pod from a sub-group using standard filtering and scoring phases.
+   It also utilizes the Opportunistic Batching feature where possible,
+   reducing overall scheduling time.
+
+   * If a pod fits, it is tentatively nominated.
+   * If a pod cannot fit, the scheduler tries preemption by running
+     the `PostFilter` extension point. *Note:* With workload-aware preemption
+     this phase will be replaced by a workload-level algorithm.
+     * If preemption is successful, the pod is nominated on the selected node.
+     * If preemption fails, the pod is considered unscheduled for this cycle.
+
+   The phase can effectively stop once `minCount` pods have a placement,
+   though attempting to schedule the full group is preferred to maximize utilization.
+
+4. The scheduler checks if the number of schedulable (including those after delayed preemption)
+   Pods meets the `minCount`.
+
+   * If `schedulableCount >= minCount`, the cycle succeeds. Pods are pushed
+     to the active queue and will soon attempt to be scheduled on their
+     nominated nodes in their own, pod-by-pod cycles. If a pod selects a
+     different node than its nomination during the individual cycle, the
+     gang remains valid as long as `minCount` is satisfied globally (enforced at `WaitOnPermit`).
+     ```md
+      <<[UNRESOLVED Pod-by-pod cycle preemption]>>
+      Should gang pods be allowed to preempt anything in their pod-by-pod cycles?
+
+      *Proposed:* Preemption should be forbidden. Otherwise, it may complicate reasoning
+      about the workload scheduling cycle and workload-aware preemption.
+      When preemption is necessary, the gang will be retried after timing out at WaitOnPermit,
+      and all necessary preemptions will be simulated in the next workload scheduling cycle.
+      <<[/UNRESOLVED]>>
+      ```
+   * If `schedulableCount < minCount`, the cycle fails. Pods go through traditional failure handlers
+     and nominations for them are cleared to ensure the other workloads (pod groups)
+     can be attemtped on that place. See *Failure Handling*.
+  
+While this algorithm might be suboptimal, it is a solid first step for ensuring we have
+a single-cycle workload scheduling phase. As long as PodGroups consist of homogeneous pods,
+opportunistic batching itself will provide significant improvements.
+Future features like Topology Aware Scheduling can further improve other subsets of use cases.
+
+#### Interaction with Basic Policy
+
+For pod groups using the `Basic` policy, the Workload Scheduling Cycle is
+optional. In the `Beta` timeframe, we may opportunistically apply this cycle to
+`Basic` pod groups to leverage the batching performance benefits, but the
+"all-or-nothing" (`minCount`) checks will be skipped; i.e., we will try to
+schedule as many pods from such PodGroup as possible.
+
+#### Delayed Preemption
+
+A critical requirement for moving Gang Scheduling to Beta is the integration
+with *Delayed Preemption*.
+
+Standard Kubernetes preemption is eager: when a `PostFilter` selects victims to preempt,
+they are deleted immediately. For Gang Scheduling, this behavior is risky and can lead to
+*partial preemption application*, meaning we might do some unnecessary preemptions
+when the gang, ultimately, won't fit. Delayed Preemption solves this by separating the
+*selection* of victims from the *execution* of preemption.
+
+1. During the Workload Scheduling Cycle, the scheduler calculates necessary
+   preemptions for all Pods in the gang (Step 3 of Scheduling Algorithm).
+
+2. The scheduler nominates the victims for preemption and the gang Pod
+   for scheduling on their place. This way, the gang can be attempted
+   without making any intermediate disruptions to the cluster.
+   * If the quorum is met, the scheduler continues scheduling the gang Pods pod-by-pod.
+     Victims are preempted in the new bulk-deletion mechanism after `WaitOnPermit`,
+     but only because the *whole* gang (or sufficient quorum) was schedulable.
+   * If the quorum is not met, the preemption is aborted. No victims are deleted.
+     The gang returns to the queue.
+
+Read more about the proposal in
+[KEP-5710: Workload Aware Preemption](https://github.com/kubernetes/enhancements/pull/5711) PR.
+
+#### Workload-aware Preemption
+
+Workload-aware preemption ([KEP-5710](https://kep.k8s.io/5710)) aims to
+enable preemption for a whole pod group at once. In the context of this cycle,
+it means that if the cycle determines preemption for a single pod is necessary,
+it won't run the `PostFilter` phase, but defer that to the end of the scheduling phase,
+running a new, single workload-aware preemption step.
+
+Read more about the proposal in
+[KEP-5710: Workload Aware Preemption](https://github.com/kubernetes/enhancements/pull/5711) PR.
+
+#### Failure Handling
+
+If a Workload Scheduling Cycle fails (e.g., `minCount` is not met, preemption fails,
+or a timeout occurs), the scheduler must handle the failure efficiently.
+
+1. Rejection
+
+When the cycle fails, the scheduler rejects the entire group.
+* All Pods in the group are moved back to the scheduling queue.
+* Crucially, any `.status.nominatedNodeName` entries set during the failed attempt
+  (or from previous cycles) must be cleared. This ensures that the resources
+  tentatively reserved for this gang are immediately released for other workloads.
+
+2. Backoff strategy
+
+Backoff mechanism has to be applied for a pod group similarly as we do for individual pods.
+For Beta, we will apply the standard Pod backoff logic to the group.
+
+At the same time, we can consider increasing the maximum backoff default value
+as the current 10 seconds proven to be too low in larger clusters,
+so this might be the case for workloads.
+
+3. Retries
+
+We rely on the existing Queueing Hints mechanism to determine when to retry the gang.
+It is considered for a retry when *at least one* member Pod receives a `Queue` hint
+(indicating a relevant cluster event, such as a Node addition or Pod deletion,
+has made that specific Pod potentially schedulable).
+
+While checking a single Pod does not guarantee the *whole* gang can fit,
+calculating gang-level schedulability inside the event handler can be difficult at the moment.
+Therefore, we optimistically retry the Workload Scheduling Cycle if any member's condition improves.
 
 ### Test Plan
 
@@ -636,7 +882,7 @@ promoted to the conformance.
 #### Beta
 
 - Providing "optimal enough" placement by considering all pods from a gang together
-- Avoiding deadlock scenario when multiple workloads are being scheduled at the same time
+- Avoiding livelock scenario when multiple workloads are being scheduled at the same time
   by kube-scheduler
 - Implementing delayed preemption to avoid premature preemptions
 - Workload-aware preemption design to ensure we won't break backward compatibility with it.
@@ -720,6 +966,13 @@ This section must be completed when targeting alpha to a release.
   - Feature gate name: GangScheduling
   - Components depending on the feature gate:
     - kube-scheduler
+  - Feature gate name: WorkloadSchedulingCycle
+  - Components depending on the feature gate:
+    - kube-scheduler
+  - Feature gate name: WorkloadBasicPolicyDesiredCount
+  - Components depending on the feature gate:
+    - kube-apiserver
+    - kube-scheduler
 - [ ] Other
   - Describe the mechanism:
   - Will enabling / disabling the feature require downtime of the control
diff --git a/keps/sig-scheduling/4671-gang-scheduling/kep.yaml b/keps/sig-scheduling/4671-gang-scheduling/kep.yaml
index 217d8053979e..209a57bf373d 100644
--- a/keps/sig-scheduling/4671-gang-scheduling/kep.yaml
+++ b/keps/sig-scheduling/4671-gang-scheduling/kep.yaml
@@ -1,13 +1,14 @@
 title: Gang Scheduling
 kep-number: 4671
 authors:
- -    "@erictune"
- -    "@wojtek-t"
- -    "@helayoty" 
- -    "@dom4ha"
- -    "@44past4"
- -    "@andreyvelich" 
- -    "@thockin"
+ - "@erictune"
+ - "@wojtek-t"
+ - "@helayoty" 
+ - "@dom4ha"
+ - "@44past4"
+ - "@andreyvelich" 
+ - "@thockin"
+ - "@macsko"
 
 owning-sig: sig-scheduling
 participating-sigs:
@@ -27,12 +28,12 @@ replaces:
 # The target maturity stage in the current dev cycle for this KEP.
 # If the purpose of this KEP is to deprecate a user-visible feature
 # and a Deprecated feature gates are added, they should be deprecated|disabled|removed.
-stage: alpha
+stage: beta
 
 # The most recent milestone for which work toward delivery of this KEP has been
 # done. This can be the current (upcoming) milestone, if it is being actively
 # worked on.
-latest-milestone: "v1.35"
+latest-milestone: "v1.36"
 
 # The milestone at which this feature was, or is targeted to be, at each stage.
 milestone:
@@ -50,6 +51,9 @@ feature-gates:
   - name: GangScheduling
     components:
       - kube-scheduler
+  - name: WorkloadSchedulingCycle
+    components:
+      - kube-scheduler
 disable-supported: true
 
 # The following PRR answers are required at beta release

From 9e672be8e04924c80eb47e8ab1c8a5d42dc4040e Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Maciej=20Skocze=C5=84?= <mskoczen@google.com>
Date: Tue, 16 Dec 2025 14:27:35 +0000
Subject: [PATCH 02/23] Add a section about basic policy update

---
 .../4671-gang-scheduling/README.md            | 38 +++++++++++++++++++
 .../4671-gang-scheduling/kep.yaml             |  4 ++
 2 files changed, 42 insertions(+)

diff --git a/keps/sig-scheduling/4671-gang-scheduling/README.md b/keps/sig-scheduling/4671-gang-scheduling/README.md
index 772d752263c4..d4d6864da6cf 100644
--- a/keps/sig-scheduling/4671-gang-scheduling/README.md
+++ b/keps/sig-scheduling/4671-gang-scheduling/README.md
@@ -458,6 +458,39 @@ not be split into two. A `LeaderWorkerSet` is a good example of it, where a sing
 of a single leader and `N` workers and that forms a scheduling (and runtime unit), but workload as a whole
 may consist of a number of such replicas.
 
+#### Basic Policy Extension
+
+While Gang Scheduling focuses on atomic, all-or-nothing scheduling, there is a significant class
+of batch workloads that requires best-effort optimization without
+the strict blocking semantics of a gang.
+
+Currently, the `Basic` policy is a no-op. We propose extending the `Basic` policy
+to accept a `desiredCount` field. This feature will be gated behind a separate
+feature gate (`WorkloadBasicPolicyDesiredCount`) to decouple it from the core Gang Scheduling graduation path.
+
+```go
+// BasicSchedulingPolicy indicates that standard Kubernetes
+// scheduling behavior should be used.
+type BasicSchedulingPolicy struct {
+	// DesiredCount is the expected number of pods that will belong to this
+	// PodGroup. This field is a hint to the scheduler to help it make better
+	// placement decisions for the group as a whole.
+	//
+	// Unlike gang's minCount, this field does not block scheduling. If the number
+	// of available pods is less than desiredCount, the scheduler can still attempt
+	// to schedule the available pods, but will optimistically try to select a
+	// placement that can accommodate the future pods.
+	//
+	// +optional
+	DesiredCount *int32
+}
+```
+
+This field allows users to express their "true" workloads more easily
+and enables the scheduler to optimize the placement of such pod groups by taking the desired state
+into account. Ideally, the scheduler should prefer placements that can accommodate
+the full `desiredCount`, even if not all pods are created yet.
+
 ### Scheduler Changes
 
 The kube-scheduler will be watching for `Workload` objects (using informers) and will use them to map pods
@@ -725,6 +758,11 @@ optional. In the `Beta` timeframe, we may opportunistically apply this cycle to
 "all-or-nothing" (`minCount`) checks will be skipped; i.e., we will try to
 schedule as many pods from such PodGroup as possible.
 
+If the `Basic` policy has `desiredCount` set, the Workload Scheduling Cycle
+may utilize this value to simulate the full group size during feasibility checks.
+Note that the implementation of this specific logic might follow in a Beta stage
+of this API field.
+
 #### Delayed Preemption
 
 A critical requirement for moving Gang Scheduling to Beta is the integration
diff --git a/keps/sig-scheduling/4671-gang-scheduling/kep.yaml b/keps/sig-scheduling/4671-gang-scheduling/kep.yaml
index 209a57bf373d..12ee0bbc50c8 100644
--- a/keps/sig-scheduling/4671-gang-scheduling/kep.yaml
+++ b/keps/sig-scheduling/4671-gang-scheduling/kep.yaml
@@ -54,6 +54,10 @@ feature-gates:
   - name: WorkloadSchedulingCycle
     components:
       - kube-scheduler
+  - name: WorkloadBasicPolicyDesiredCount
+    components:
+      - kube-apiserver
+      - kube-scheduler
 disable-supported: true
 
 # The following PRR answers are required at beta release

From 43b5aa940e1876f1b3000403a661025e7adf5104 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Maciej=20Skocze=C5=84?= <mskoczen@google.com>
Date: Mon, 22 Dec 2025 11:18:51 +0000
Subject: [PATCH 03/23] Remove beta graduation from the PR, extend sections
 about workload scheduling cycle

---
 keps/prod-readiness/sig-scheduling/4671.yaml  |  2 -
 .../4671-gang-scheduling/README.md            | 84 ++++++++-----------
 .../4671-gang-scheduling/kep.yaml             |  5 +-
 3 files changed, 38 insertions(+), 53 deletions(-)

diff --git a/keps/prod-readiness/sig-scheduling/4671.yaml b/keps/prod-readiness/sig-scheduling/4671.yaml
index 3257880a90d5..17a4b734bff8 100644
--- a/keps/prod-readiness/sig-scheduling/4671.yaml
+++ b/keps/prod-readiness/sig-scheduling/4671.yaml
@@ -1,5 +1,3 @@
 kep-number: 4671
 alpha:
   approver: "@soltysh"
-beta:
-  approver: "@soltysh"
diff --git a/keps/sig-scheduling/4671-gang-scheduling/README.md b/keps/sig-scheduling/4671-gang-scheduling/README.md
index d4d6864da6cf..ddc6bcc8112a 100644
--- a/keps/sig-scheduling/4671-gang-scheduling/README.md
+++ b/keps/sig-scheduling/4671-gang-scheduling/README.md
@@ -461,8 +461,7 @@ may consist of a number of such replicas.
 #### Basic Policy Extension
 
 While Gang Scheduling focuses on atomic, all-or-nothing scheduling, there is a significant class
-of batch workloads that requires best-effort optimization without
-the strict blocking semantics of a gang.
+of workloads that requires best-effort optimization without the strict blocking semantics of a gang.
 
 Currently, the `Basic` policy is a no-op. We propose extending the `Basic` policy
 to accept a `desiredCount` field. This feature will be gated behind a separate
@@ -490,6 +489,8 @@ This field allows users to express their "true" workloads more easily
 and enables the scheduler to optimize the placement of such pod groups by taking the desired state
 into account. Ideally, the scheduler should prefer placements that can accommodate
 the full `desiredCount`, even if not all pods are created yet.
+When `desiredCount` is specified, the scheduler can delay scheduling the first Pod it sees
+for a short amount of time in order to wait for more Pods to be observed.
 
 ### Scheduler Changes
 
@@ -569,9 +570,10 @@ nor did it solve the problem of partial preemption application.
 For `Beta`, we propose introducing a **Workload Scheduling Cycle**.
 This mechanism processes all Pods belonging to a single `PodGroup` in one batch,
 rather than attempting to schedule them individually in isolation using the
-traditional pod-by-pod approach.
-While this won't fully address the "optimal enough" part of requirement (2),
-it ensures that all gang pods are processed together.
+traditional pod-by-pod approach. While introduction of this phase itself won't
+fully address the "optimal enough" part of requirement (2),
+it provides the necessary foundation for applying workload scheduling algorithms
+to process the entire gang together.
 The single scheduling cycle, together with blocking resources using nomination,
 will address requirement (3).
 
@@ -586,27 +588,13 @@ end-to-end Pod scheduling flow, it is planned to place this new phase *before*
 the standard pod-by-pod scheduling cycle.
 
 When the scheduler pops a Pod from the active queue, it checks if that Pod
-belongs to an unscheduled `PodGroup` with a `Gang` policy. If so, the scheduler
+belongs to an unscheduled `PodGroup`. If so, the scheduler
 initiates the Workload Scheduling Cycle.
 
-```md
-<<[UNRESOLVED Scope of the Cycle]>>
-It is currently unresolved whether the Workload Scheduling Cycle should operate
-on the entire `Workload` object (handling all defined PodGroups simultaneously)
-or strictly at the `PodGroup` level.
-
-* PodGroup Level: The cycle processes only the specific `PodGroup` (and replica key)
-  associated with the popped Pod. This is simpler and aligns with
-  the Gang Scheduling definition and current implementation.
-* Workload Level: The cycle attempts to schedule all PodGroups within the Workload.
-  This allows for complex dependencies between groups but increases the complexity
-  and doesn't bring immediate added value.
-
-*Proposed:* Implement it on PodGroup Level for Beta. However, future migration
-to the Workload Level might necessitate non-trivial changes to the phase
-introduced by this KEP.
-<<[/UNRESOLVED]>>
-```
+Since the `PodGroup` instance (defined by the group name and replica key)
+is the effective scheduling unit, the Workload Scheduling Cycle will operate
+at the `PodGroup` instance level, i.e., each instance will be scheduled separately
+in its own cycle.
 
 The cycle proceeds as follows:
 
@@ -628,9 +616,9 @@ The cycle proceeds as follows:
      scheduler's internal cache. Pods are then pushed to the
      active queue (restoring their original timestamps to ensure fairness)
      to pass through the standard scheduling and binding cycle,
-     which will respect the nomination.
+     which will consider the nomination.
    * If `minCount` cannot be met (even after calculating potential
-     preemptions), the scheduler rejects the entire group. Standard backoff
+     preemptions), the scheduler considers the `PodGroup` unschedulable. Standard backoff
      logic applies (see *Failure Handling*), and Pods are returned to
      the scheduling queue.
 
@@ -702,11 +690,13 @@ The list and configuration of plugins used by this algorithm will be the same as
 1. The scheduler iterates through the retrieved Pods and groups
    them into homogeneous sub-groups (using the signatures defined in
    [KEP-5598](https://kep.k8s.io/5598)).
+   *This aggregation can be done in the scheduler's cache earlier to optimize performance.*
 
 2. These sub-groups are sorted. Initially, we sort by the highest priority
    of the sub-group (assuming homogeneity enforces uniform sub-group priority).
    In the future, sorting may use the size of the sub-group (larger groups first) to
    tackle the hardest placement problems early.
+   *This sorting can be done in the scheduler's cache earlier to optimize performance.*
 
 3. The scheduler iterates through the sorted sub-groups. It finds a feasible node
    for each pod from a sub-group using standard filtering and scoring phases.
@@ -719,6 +709,10 @@ The list and configuration of plugins used by this algorithm will be the same as
      this phase will be replaced by a workload-level algorithm.
      * If preemption is successful, the pod is nominated on the selected node.
      * If preemption fails, the pod is considered unscheduled for this cycle.
+       However, the scheduling of subsequent pods continues as long as
+       the `minCount` constraint remains satisfiable. The processing can also be
+       optimized by rejecting all subsequent pods from the same
+       homogeneous sub-group, as their failed scheduling outcome will be the same.
 
    The phase can effectively stop once `minCount` pods have a placement,
    though attempting to schedule the full group is preferred to maximize utilization.
@@ -731,16 +725,18 @@ The list and configuration of plugins used by this algorithm will be the same as
      nominated nodes in their own, pod-by-pod cycles. If a pod selects a
      different node than its nomination during the individual cycle, the
      gang remains valid as long as `minCount` is satisfied globally (enforced at `WaitOnPermit`).
-     ```md
-      <<[UNRESOLVED Pod-by-pod cycle preemption]>>
-      Should gang pods be allowed to preempt anything in their pod-by-pod cycles?
-
-      *Proposed:* Preemption should be forbidden. Otherwise, it may complicate reasoning
-      about the workload scheduling cycle and workload-aware preemption.
-      When preemption is necessary, the gang will be retried after timing out at WaitOnPermit,
-      and all necessary preemptions will be simulated in the next workload scheduling cycle.
-      <<[/UNRESOLVED]>>
-      ```
+
+     In the pod-by-pod cycle, the preemption made by the workload pods will be forbidden. 
+     Otherwise, it may complicate reasoning about the workload scheduling cycle and workload-aware preemption.
+     When preemption is necessary, the gang will be retried after timing out at WaitOnPermit,
+     and all necessary preemptions will be simulated in the next workload scheduling cycle.
+
+     In the pod-by-pod cycle, preemption initiated by the workload pods will be forbidden.
+     Allowing it would complicate reasoning about the consistency of the
+     Workload Scheduling Cycle and Workload-Aware Preemption. If preemption is necessary
+     (e.g., the nominated node is no longer valid), the gang will time out at `WaitOnPermit`
+     and all necessary preemptions will be simulated again in the next Workload Scheduling Cycle.
+
    * If `schedulableCount < minCount`, the cycle fails. Pods go through traditional failure handlers
      and nominations for them are cleared to ensure the other workloads (pod groups)
      can be attemtped on that place. See *Failure Handling*.
@@ -753,7 +749,7 @@ Future features like Topology Aware Scheduling can further improve other subsets
 #### Interaction with Basic Policy
 
 For pod groups using the `Basic` policy, the Workload Scheduling Cycle is
-optional. In the `Beta` timeframe, we may opportunistically apply this cycle to
+optional. In the `Beta` timeframe, this cycle will be applied to
 `Basic` pod groups to leverage the batching performance benefits, but the
 "all-or-nothing" (`minCount`) checks will be skipped; i.e., we will try to
 schedule as many pods from such PodGroup as possible.
@@ -774,15 +770,12 @@ they are deleted immediately. For Gang Scheduling, this behavior is risky and ca
 when the gang, ultimately, won't fit. Delayed Preemption solves this by separating the
 *selection* of victims from the *execution* of preemption.
 
-1. During the Workload Scheduling Cycle, the scheduler calculates necessary
+1. During the Workload Scheduling Cycle loop, the scheduler calculates necessary
    preemptions for all Pods in the gang (Step 3 of Scheduling Algorithm).
 
-2. The scheduler nominates the victims for preemption and the gang Pod
-   for scheduling on their place. This way, the gang can be attempted
-   without making any intermediate disruptions to the cluster.
-   * If the quorum is met, the scheduler continues scheduling the gang Pods pod-by-pod.
-     Victims are preempted in the new bulk-deletion mechanism after `WaitOnPermit`,
-     but only because the *whole* gang (or sufficient quorum) was schedulable.
+2. At the end of the Workload Scheduling Cycle:
+   * If the quorum is met, the scheduler actuates the preemptions,
+     initiating the removal of victims from the cluster.
    * If the quorum is not met, the preemption is aborted. No victims are deleted.
      The gang returns to the queue.
 
@@ -1002,9 +995,6 @@ This section must be completed when targeting alpha to a release.
     - kube-apiserver
     - kube-scheduler
   - Feature gate name: GangScheduling
-  - Components depending on the feature gate:
-    - kube-scheduler
-  - Feature gate name: WorkloadSchedulingCycle
   - Components depending on the feature gate:
     - kube-scheduler
   - Feature gate name: WorkloadBasicPolicyDesiredCount
diff --git a/keps/sig-scheduling/4671-gang-scheduling/kep.yaml b/keps/sig-scheduling/4671-gang-scheduling/kep.yaml
index 12ee0bbc50c8..a9c83db4eadf 100644
--- a/keps/sig-scheduling/4671-gang-scheduling/kep.yaml
+++ b/keps/sig-scheduling/4671-gang-scheduling/kep.yaml
@@ -28,7 +28,7 @@ replaces:
 # The target maturity stage in the current dev cycle for this KEP.
 # If the purpose of this KEP is to deprecate a user-visible feature
 # and a Deprecated feature gates are added, they should be deprecated|disabled|removed.
-stage: beta
+stage: alpha
 
 # The most recent milestone for which work toward delivery of this KEP has been
 # done. This can be the current (upcoming) milestone, if it is being actively
@@ -51,9 +51,6 @@ feature-gates:
   - name: GangScheduling
     components:
       - kube-scheduler
-  - name: WorkloadSchedulingCycle
-    components:
-      - kube-scheduler
   - name: WorkloadBasicPolicyDesiredCount
     components:
       - kube-apiserver

From 8eefcd3edec953cb0e402a3de596d3f1edd32095 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Maciej=20Skocze=C5=84?= <mskoczen@google.com>
Date: Fri, 2 Jan 2026 15:18:26 +0000
Subject: [PATCH 04/23] Expand queueing alternatives. Add unresolved section
 about enforcing minCount

---
 .../4671-gang-scheduling/README.md            | 69 +++++++++++++++----
 1 file changed, 56 insertions(+), 13 deletions(-)

diff --git a/keps/sig-scheduling/4671-gang-scheduling/README.md b/keps/sig-scheduling/4671-gang-scheduling/README.md
index ddc6bcc8112a..0fcf560baf30 100644
--- a/keps/sig-scheduling/4671-gang-scheduling/README.md
+++ b/keps/sig-scheduling/4671-gang-scheduling/README.md
@@ -594,7 +594,9 @@ initiates the Workload Scheduling Cycle.
 Since the `PodGroup` instance (defined by the group name and replica key)
 is the effective scheduling unit, the Workload Scheduling Cycle will operate
 at the `PodGroup` instance level, i.e., each instance will be scheduled separately
-in its own cycle.
+in its own cycle. If new Pods belonging to an already scheduled `PodGroup` instance appear,
+they are also processed via the Workload Scheduling Cycle, which takes the previously
+scheduled Pods into consideration.
 
 The cycle proceeds as follows:
 
@@ -636,31 +638,50 @@ and reduce unnecessary preemption attempts.
 
 ```md
 <<[UNRESOLVED Queue Implementation Strategy]>>
-To ensure that we process the pod group (replica) at an appropriate time and
+To ensure that we process the `PodGroup` instance at an appropriate time and
 don't starve other pods (including gang pods in the pod-by-pod scheduling phase)
 from being scheduled, we need to have a good queueing mechanism for pod groups.
 There are several alternatives:
 
+Alternative 0 (Keep current queueing and ordering):
+
+We can minimize changes by retaining the current queueing and ordering logic.
+When a Pod is popped, the scheduler can check if it belongs to a `PodGroup`
+requiring a Workload Scheduling Cycle. As we add scheduling priorities
+for pod groups later, this alternative naturally evolves into Alternative 1.
+* *Pros:* Fits the current architecture. Retains current reasoning about the
+  scheduling queue. Minimizes implementation effort.
+* *Cons:* Might be problematic when some of the pod groups's pods are in the backoffQ
+  or unschedulablePods and need to be retrieved efficiently.
+  Makes it hard to further evolve the Workload Scheduling Cycle.
+  Observability, currently suited for pod-by-pod scheduling, may not
+  accurately reflect the state of the queue (e.g., pending gangs).
+  Likely harder to support future extensions and won't work well
+  if `PodGroup` becomes a separate top-level resource.
+  The pod group will be likely scheduled based on the highest priority member,
+  meaning the latter pod-by-pod cycles might be visibly delayed for lower priority Pods.
+
 Alternative 1 (Modify sorting logic):
 
 Modify the sorting logic within the existing `PriorityQueue` to put all pods
-from a gang group one after another.
+from a pod group one after another.
 * *Pros:* Fits the current architecture.
-* *Cons:* Might be problematic when some of the gang's pods are in the
+* *Cons:* Might be problematic when some of the pod groups's pods are in the
   backoffQ or unschedulablePods and need to be retrieved efficiently.
   Makes it hard to further evolve the Workload Scheduling Cycle.
   Would need to inject the workload priority into each of the Pods
   or somehow apply the lowest pod's priority to the rest of the group.
 
-Alternative 2 (Store a gang representative):
+Alternative 2 (Store a PodGroup instance):
 
-Only one "representative" Pod from the gang is allowed in the `activeQ` at a time.
-Others are held in a separate internal structure (e.g., a new map inside the queue).
-When the representative is popped, it pulls the rest of the gang for the Workload Cycle.
-* *Pros:* Makes it easier to obtain all pod group's pods, reduces queue size.
-* *Cons:* High complexity in managing the lifecycle of the representative
-  (e.g., what if the representative Pod is deleted or other changes to the workload happen?
-  Would need a workload manager to handle all such cases).
+Modify the scheduling queue's data structures to accept `QueuedPodGroupInfo` alongside `QueuedPodInfo`.
+This allows reusing existing queue logic while extending it to `PodGroups`.
+All queued members would be stored in a new dara structure
+and retrieved for the Workload Cycle when the `PodGroup` is popped.
+* *Pros:* Makes it easier to obtain all pods in a group and reduces queue size.
+  Reuses current logic for popping, enforcing backoff, and processing unschedulable entities.
+* *Cons:* Requires adapting the scheduling queue to handle `PodGroups` as
+  queueable entities, which is non-trivial and might clutter the code.
 
 Alternative 3 (Dedicated PodGroup queue):
 
@@ -739,7 +760,29 @@ The list and configuration of plugins used by this algorithm will be the same as
 
    * If `schedulableCount < minCount`, the cycle fails. Pods go through traditional failure handlers
      and nominations for them are cleared to ensure the other workloads (pod groups)
-     can be attemtped on that place. See *Failure Handling*.
+     can be attempted on that place. See *Failure Handling*.
+
+```md
+<<[UNRESOLVED Enforcing minCount constraint in algorithm]>>
+Gang Scheduling is currently implemented as a plugin, meaning the `minCount` constraint
+is enforced at the plugin level. However, the proposed Workload Scheduling Cycle algorithm
+needs to know if this constraint is met to decide whether to commit the results.
+We have two ways of verifying this:
+
+1. Explicit check in the algorithm: Hardcode the `minCount` check within the framework's logic.
+   This implies that Gang Scheduling becomes a core scheduler framework feature rather than
+   just a specific plugin.
+
+2. New Extension Point: Introduce a new extension point allowing plugins to validate the group's
+   scheduled pods. This would function similarly to a `Permit` check (likely requiring `Reserve` state)
+   but without the suspension (`WaitOnPermit`) gate. Crucially, this extension should support two checks:
+   * Validation: Check whether the currently scheduled pods meet the requirements,
+     e.g., if the `minCount` pods from a pod group was successfully scheduled.
+   * Feasibility: Given the number of pods that have already failed scheduling in this cycle,
+     check whether is it still *possible* to meet the constraint. If not, the cycle should abort early
+     to save time.
+<<[/UNRESOLVED]>>
+```
   
 While this algorithm might be suboptimal, it is a solid first step for ensuring we have
 a single-cycle workload scheduling phase. As long as PodGroups consist of homogeneous pods,

From 38342637ffc6232cb655b85d653a8498ccd6282a Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Maciej=20Skocze=C5=84?= <mskoczen@google.com>
Date: Mon, 5 Jan 2026 14:53:44 +0000
Subject: [PATCH 05/23] Apply comments

---
 .../4671-gang-scheduling/README.md            | 21 ++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/keps/sig-scheduling/4671-gang-scheduling/README.md b/keps/sig-scheduling/4671-gang-scheduling/README.md
index 0fcf560baf30..61b9b9be52cf 100644
--- a/keps/sig-scheduling/4671-gang-scheduling/README.md
+++ b/keps/sig-scheduling/4671-gang-scheduling/README.md
@@ -125,6 +125,7 @@ The following are non-goals for this KEP but will probably soon appear to be goa
 - Address the problem of premature preemptions in case the higher priority workloads does not
   eventually schedule.
 
+See [Future plans](#future-plans) for more details.
 
 ## Proposal
 
@@ -463,8 +464,9 @@ may consist of a number of such replicas.
 While Gang Scheduling focuses on atomic, all-or-nothing scheduling, there is a significant class
 of workloads that requires best-effort optimization without the strict blocking semantics of a gang.
 
-Currently, the `Basic` policy is a no-op. We propose extending the `Basic` policy
-to accept a `desiredCount` field. This feature will be gated behind a separate
+In the first alpha version of the Workload API, the `Basic` policy was a no-op.
+We propose extending the `Basic` policy to accept a `desiredCount` field.
+This feature will be gated behind a separate
 feature gate (`WorkloadBasicPolicyDesiredCount`) to decouple it from the core Gang Scheduling graduation path.
 
 ```go
@@ -746,11 +748,9 @@ The list and configuration of plugins used by this algorithm will be the same as
      nominated nodes in their own, pod-by-pod cycles. If a pod selects a
      different node than its nomination during the individual cycle, the
      gang remains valid as long as `minCount` is satisfied globally (enforced at `WaitOnPermit`).
-
-     In the pod-by-pod cycle, the preemption made by the workload pods will be forbidden. 
-     Otherwise, it may complicate reasoning about the workload scheduling cycle and workload-aware preemption.
-     When preemption is necessary, the gang will be retried after timing out at WaitOnPermit,
-     and all necessary preemptions will be simulated in the next workload scheduling cycle.
+     The `minCount` check can consider the number of pods that have passed the Workload Scheduling Cycle
+     to ensure that Pods are not waiting unnecessarily when some have been rejected
+     but other new pods have been added to the cluster.
 
      In the pod-by-pod cycle, preemption initiated by the workload pods will be forbidden.
      Allowing it would complicate reasoning about the consistency of the
@@ -845,6 +845,7 @@ or a timeout occurs), the scheduler must handle the failure efficiently.
 
 When the cycle fails, the scheduler rejects the entire group.
 * All Pods in the group are moved back to the scheduling queue.
+  Their status is updated the event with failure reason is sent.
 * Crucially, any `.status.nominatedNodeName` entries set during the failed attempt
   (or from previous cycles) must be cleared. This ensures that the resources
   tentatively reserved for this gang are immediately released for other workloads.
@@ -869,6 +870,12 @@ While checking a single Pod does not guarantee the *whole* gang can fit,
 calculating gang-level schedulability inside the event handler can be difficult at the moment.
 Therefore, we optimistically retry the Workload Scheduling Cycle if any member's condition improves.
 
+It might be beneficial to retry the pod group without being triggered by any cluster event.
+Ideally, this would involve scrambling the pods and subgroups within the group that have the same priority.
+This could be useful because the pods could be scheduled without any cluster changes
+when considered in a different order.
+
+
 ### Test Plan
 
 <!--

From bae532eda598df2c2f2ec082385ac598dd505dec Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Maciej=20Skocze=C5=84?= <mskoczen@google.com>
Date: Fri, 9 Jan 2026 10:22:19 +0000
Subject: [PATCH 06/23] Add NNN risks, extend workload algorithm with more
 details

---
 .../4671-gang-scheduling/README.md            | 53 +++++++++++++++----
 1 file changed, 43 insertions(+), 10 deletions(-)

diff --git a/keps/sig-scheduling/4671-gang-scheduling/README.md b/keps/sig-scheduling/4671-gang-scheduling/README.md
index 61b9b9be52cf..c880e28bce02 100644
--- a/keps/sig-scheduling/4671-gang-scheduling/README.md
+++ b/keps/sig-scheduling/4671-gang-scheduling/README.md
@@ -219,6 +219,29 @@ We try to mitigate it by an extensive analysis of usecases and already sketching
 how we envision the direction in which the API will need to evolve to support further
 usecases. You can read more about it in the [extended proposal] document.
 
+#### NominatedNodeName impact on filtering performance
+
+Using `.status.nominatedNodeName` as an output of the Workload Scheduling Cycle
+can impact the performance of the standard pod-by-pod scheduling cycle.
+Whenever the scheduler filters a node, it must temporarily add nominated pods
+(with equal or higher priority) to the cached NodeInfo. In large clusters,
+the number of such operations multiplied by the scheduling throughput can yield to a visible overhead.
+If the latency between the end of the Workload Scheduling Cycle
+and the actual processing of those pods is high, the number of unrelated pods
+having to consider such nomination also increases.
+
+However, this impact is mitigated by several factors:
+* Nominations are temporary. As soon as workload-scheduled pods pass
+  their individual scheduling cycle and are assumed, what cleans the in-memory nominations.
+* For the workload pods themselves, the performance impact is negligible.
+  They will typically only execute filters for the single node they are nominated to,
+  rather than evaluating the entire cluster.
+* These pods are expected to be retried quickly after the Workload Scheduling Cycle because
+  their initial timestamps are preserved. This places them near the head of the active queue,
+  minimizing the duration they remain in the "nominated but not assumed" state.
+* While higher-priority or long-standing pods might interleave and be scheduled before the gang pods,
+  the overall window of time where these nominations are active is expected to be short enough
+  to prevent severe degradation.
 
 ## Design Details
 
@@ -726,11 +749,12 @@ The list and configuration of plugins used by this algorithm will be the same as
    It also utilizes the Opportunistic Batching feature where possible,
    reducing overall scheduling time.
 
-   * If a pod fits, it is tentatively nominated.
+   * If a pod fits, it is temporarily assumed and reserved on the selected node.
    * If a pod cannot fit, the scheduler tries preemption by running
-     the `PostFilter` extension point. *Note:* With workload-aware preemption
-     this phase will be replaced by a workload-level algorithm.
-     * If preemption is successful, the pod is nominated on the selected node.
+     the `PostFilter` extension point.
+     *Note: With workload-aware preemption this phase will be replaced by a workload-level algorithm
+     that will be run after trying to schedule all pod group's pods.*
+     * If preemption is successful, the pod is temporarily assumed and reserved on the selected node.
      * If preemption fails, the pod is considered unscheduled for this cycle.
        However, the scheduling of subsequent pods continues as long as
        the `minCount` constraint remains satisfiable. The processing can also be
@@ -743,8 +767,8 @@ The list and configuration of plugins used by this algorithm will be the same as
 4. The scheduler checks if the number of schedulable (including those after delayed preemption)
    Pods meets the `minCount`.
 
-   * If `schedulableCount >= minCount`, the cycle succeeds. Pods are pushed
-     to the active queue and will soon attempt to be scheduled on their
+   * If `schedulableCount >= minCount`, the cycle succeeds. Pods are nominated to their chosen nodes,
+    are pushed to the active queue and will soon attempt to be scheduled on their
      nominated nodes in their own, pod-by-pod cycles. If a pod selects a
      different node than its nomination during the individual cycle, the
      gang remains valid as long as `minCount` is satisfied globally (enforced at `WaitOnPermit`).
@@ -789,6 +813,14 @@ a single-cycle workload scheduling phase. As long as PodGroups consist of homoge
 opportunistic batching itself will provide significant improvements.
 Future features like Topology Aware Scheduling can further improve other subsets of use cases.
 
+Moreover, this default algorithm relies on specific sorting and may fail to find
+a valid placement that could have been discovered by processing the group's pods
+in a different order. While resolving this limitation could be desirable,
+implementing a generalized solver for arbitrary constraints would introduce excessive complexity
+for the default implementation. The current proposal addresses the vast majority of standard use cases
+(homogeneous workloads). Future improvements for this should be delivered via specialized algorithms
+based on specific `PodGroup` constraints, such as Topology Aware Scheduling (TAS).
+
 #### Interaction with Basic Policy
 
 For pod groups using the `Basic` policy, the Workload Scheduling Cycle is
@@ -830,7 +862,7 @@ Read more about the proposal in
 Workload-aware preemption ([KEP-5710](https://kep.k8s.io/5710)) aims to
 enable preemption for a whole pod group at once. In the context of this cycle,
 it means that if the cycle determines preemption for a single pod is necessary,
-it won't run the `PostFilter` phase, but defer that to the end of the scheduling phase,
+it won't run the `PostFilter` phase, but defer that to the end of the workload scheduling phase,
 running a new, single workload-aware preemption step.
 
 Read more about the proposal in
@@ -855,9 +887,10 @@ When the cycle fails, the scheduler rejects the entire group.
 Backoff mechanism has to be applied for a pod group similarly as we do for individual pods.
 For Beta, we will apply the standard Pod backoff logic to the group.
 
-At the same time, we can consider increasing the maximum backoff default value
-as the current 10 seconds proven to be too low in larger clusters,
-so this might be the case for workloads.
+At the same time, we should consider increasing the maximum backoff duration for pod groups
+The current default of 10 seconds has proven insufficient in large clusters,
+so this might be the case for workloads. Crucially, because the Workload Scheduling Cycle can take a significant
+amount of time, retrying it too frequently risks starving individual pods.
 
 3. Retries
 

From 88e0a4cb7a4163468c60da6a908e8f6631f6255e Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Maciej=20Skocze=C5=84?= <mskoczen@google.com>
Date: Fri, 9 Jan 2026 13:47:56 +0000
Subject: [PATCH 07/23] Move delayed preemption details to this KEP

---
 .../4671-gang-scheduling/README.md            | 118 ++++++++++++++----
 .../4671-gang-scheduling/kep.yaml             |   3 +
 2 files changed, 94 insertions(+), 27 deletions(-)

diff --git a/keps/sig-scheduling/4671-gang-scheduling/README.md b/keps/sig-scheduling/4671-gang-scheduling/README.md
index c880e28bce02..08417d009a61 100644
--- a/keps/sig-scheduling/4671-gang-scheduling/README.md
+++ b/keps/sig-scheduling/4671-gang-scheduling/README.md
@@ -239,7 +239,7 @@ However, this impact is mitigated by several factors:
 * These pods are expected to be retried quickly after the Workload Scheduling Cycle because
   their initial timestamps are preserved. This places them near the head of the active queue,
   minimizing the duration they remain in the "nominated but not assumed" state.
-* While higher-priority or long-standing pods might interleave and be scheduled before the gang pods,
+* While higher-priority or long-standing (equal priority) pods might interleave and be scheduled before the gang pods,
   the overall window of time where these nominations are active is expected to be short enough
   to prevent severe degradation.
 
@@ -750,11 +750,18 @@ The list and configuration of plugins used by this algorithm will be the same as
    reducing overall scheduling time.
 
    * If a pod fits, it is temporarily assumed and reserved on the selected node.
+  
    * If a pod cannot fit, the scheduler tries preemption by running
      the `PostFilter` extension point.
      *Note: With workload-aware preemption this phase will be replaced by a workload-level algorithm
      that will be run after trying to schedule all pod group's pods.*
-     * If preemption is successful, the pod is temporarily assumed and reserved on the selected node.
+
+     * If calculated preemption is successful, the pod is temporarily assumed and reserved on the selected node.
+       Victim pods are not preempted yet, but just marked as nominated for removal.
+       Subsequent pods from this group won't see victims on the nodes in this workload cycle.
+       [Delayed Preemption](#delayed-preemption) feature is used to delay the actuation
+       until after all group's pods are considered.
+
      * If preemption fails, the pod is considered unscheduled for this cycle.
        However, the scheduling of subsequent pods continues as long as
        the `minCount` constraint remains satisfiable. The processing can also be
@@ -767,10 +774,12 @@ The list and configuration of plugins used by this algorithm will be the same as
 4. The scheduler checks if the number of schedulable (including those after delayed preemption)
    Pods meets the `minCount`.
 
-   * If `schedulableCount >= minCount`, the cycle succeeds. Pods are nominated to their chosen nodes,
-    are pushed to the active queue and will soon attempt to be scheduled on their
-     nominated nodes in their own, pod-by-pod cycles. If a pod selects a
-     different node than its nomination during the individual cycle, the
+   * If `schedulableCount >= minCount`, the cycle succeeds. If preemptions are needed,
+     all nominated victims are removed as described in [Delayed Preemption](#delayed-preemption).
+     Next, pods are nominated to their chosen nodes, pushed to the active queue,
+     and will soon attempt to be scheduled on their nominated nodes in their own, pod-by-pod cycles.
+     
+     If a pod selects a different node than its nomination during the individual cycle, the
      gang remains valid as long as `minCount` is satisfied globally (enforced at `WaitOnPermit`).
      The `minCount` check can consider the number of pods that have passed the Workload Scheduling Cycle
      to ensure that Pods are not waiting unnecessarily when some have been rejected
@@ -782,7 +791,8 @@ The list and configuration of plugins used by this algorithm will be the same as
      (e.g., the nominated node is no longer valid), the gang will time out at `WaitOnPermit`
      and all necessary preemptions will be simulated again in the next Workload Scheduling Cycle.
 
-   * If `schedulableCount < minCount`, the cycle fails. Pods go through traditional failure handlers
+   * If `schedulableCount < minCount`, the cycle fails. Preemptions computed but not actuated
+     during this cycle are discarded. Pods go through traditional failure handlers
      and nominations for them are cleared to ensure the other workloads (pod groups)
      can be attempted on that place. See *Failure Handling*.
 
@@ -836,26 +846,77 @@ of this API field.
 
 #### Delayed Preemption
 
-A critical requirement for moving Gang Scheduling to Beta is the integration
-with *Delayed Preemption*.
-
-Standard Kubernetes preemption is eager: when a `PostFilter` selects victims to preempt,
-they are deleted immediately. For Gang Scheduling, this behavior is risky and can lead to
-*partial preemption application*, meaning we might do some unnecessary preemptions
-when the gang, ultimately, won't fit. Delayed Preemption solves this by separating the
-*selection* of victims from the *execution* of preemption.
-
-1. During the Workload Scheduling Cycle loop, the scheduler calculates necessary
-   preemptions for all Pods in the gang (Step 3 of Scheduling Algorithm).
-
-2. At the end of the Workload Scheduling Cycle:
-   * If the quorum is met, the scheduler actuates the preemptions,
-     initiating the removal of victims from the cluster.
-   * If the quorum is not met, the preemption is aborted. No victims are deleted.
-     The gang returns to the queue.
-
-Read more about the proposal in
-[KEP-5710: Workload Aware Preemption](https://github.com/kubernetes/enhancements/pull/5711) PR.
+A critical requirement for moving Gang Scheduling to Beta is the integration with *Delayed Preemption*,
+which allows the scheduler to avoid unnecessary preemptions. However, the current model of preemption,
+when preemption is triggered immediately
+after the victims are decided (in `PostFilter`), doesn't achieve this goal. The reason for that is
+that the proposed placement (nomination) can actually appear to be invalid and not proceed.
+In such cases, we will not even proceed to binding and the preemption will be completely unnecessary
+disruption.
+
+Note that this problem already exists in the current gang scheduling implementation. A given gang may
+not proceed with binding if the `minCount` pods from it can't be scheduled. But, the preemptions are
+currently triggered immediately after choosing a place for individual pods. So similarly as above,
+we may end up with completely unnecessary disruptions.
+
+We will address it with what we call *delayed preemption* mechanism as following:
+
+1. We will modify the `DefaultPreemption` plugin to just compute preemptions, without actuating them.
+   We advise maintainers of custom `PostFilter` implementations to do the same.
+
+2. We will extend the `PostFilterResult` to include a set of victims (in addition to the existing
+   `NominationInfo`). This will allow us to clearly decouple the computation from actuation.
+
+   We believe that while custom plugins may want to provide their custom preemption logic,
+   the actuation logic can actually be standardized and implemented directly as part of the framework.
+   If that proves incorrect, we will introduce a new plugin extension point (tentatively called
+   `Preempt`) that will be responsible for actuation. However, for now we don't see evidence for this
+   being needed.
+
+3. For individual pods (not being part of a workload), we will adjust the scheduling framework
+   implementation of `schedulingCycle` to actuate preemptions of returned victims if calling
+   `PostFilter` plugins resulted in finding a feasible placement.
+
+4. For pods being part of a workload, we will rely on the Workload Scheduling Cycle.
+   We still have two subcases here:
+
+   1. In the legacy case (without workload-aware preemption), we call `PostFilter` individually for
+      every pod from a PodGroup. However, the victims computed for already the already processed
+      pods may affect placement decisions for the next pods.
+      To accommodate for that, if a set of victims was returned from a `PostFilter` in addition
+      to keeping them for further actuation, we will additionally store them in `CycleState`.
+      More precisely, the `CycleState` will store a new entry containing a map from
+      a `nodeName` to a list of victims that were already chosen.
+      With that, the `DefaultPreemption` plugin will be extended to remove all already chosen
+      victims from a given node before processing that node.
+
+   2. In the target case (with workload-aware preemption), we will have no longer be processing
+      pods individually, so the additional mutations of `CycleState` should not be needed.
+
+5. In both above cases, we will introduce an additional step to the scheduling algorithm at the
+   end. If we managed to find a feasible placement for the PodGroup, we will simply take all
+   the victims and actuate their preemption. If a feasible placement was not found, the victims
+   will be dropped. In both cases, the scheduling of the whole PodGroup (all its pods)
+   will be marked as unschedulable and got back to the scheduling queue.
+
+6. To reduce the number of unnessary preemptions, in case a preemption has already been triggerred
+   and the already nominated placement remains valid, no new preemptions can be triggerred.
+   In other words, a different placement can be chosen in a subsequent (workload) scheduling cycles only if
+   it doesn't require additional preemptions or the previously chosen placement is no longer
+   feasible (e.g. because higher priority pods were scheduled in the meantime).
+
+The rationale behind the above design is to maintain the current scheduling property where preemption
+doesn't result in a commitment for a particular placement. If a different possible placement appears
+in the meantime (e.g. due to other pods terminating or new nodes appearing), subsequent scheduling
+attempts may pick it up, improving the end-to-end scheduling latency. Returning pods to scheduling
+queue if these need to wait for preemption to become schedulable maintains that property.
+
+We acknowledge the two limitations of the above approach: (a) dependency on the introduction of
+Workload Scheduling Cycle (delayed preemption will not work if workload pods will not be processed
+by Workload Scheduling Cycle) and (b) the fact that the placement computed in
+Workload Scheduling Cycle may be invalidated in pod-by-pod scheduling later.
+However, those features should be used together,
+and the simplicity of the approach and target architecture outweigh these limitations.
 
 #### Workload-aware Preemption
 
@@ -1078,6 +1139,9 @@ This section must be completed when targeting alpha to a release.
     - kube-apiserver
     - kube-scheduler
   - Feature gate name: GangScheduling
+  - Components depending on the feature gate:
+    - kube-scheduler
+  - Feature gate name: DelayedPreemption
   - Components depending on the feature gate:
     - kube-scheduler
   - Feature gate name: WorkloadBasicPolicyDesiredCount
diff --git a/keps/sig-scheduling/4671-gang-scheduling/kep.yaml b/keps/sig-scheduling/4671-gang-scheduling/kep.yaml
index a9c83db4eadf..0659f2a990d3 100644
--- a/keps/sig-scheduling/4671-gang-scheduling/kep.yaml
+++ b/keps/sig-scheduling/4671-gang-scheduling/kep.yaml
@@ -51,6 +51,9 @@ feature-gates:
   - name: GangScheduling
     components:
       - kube-scheduler
+  - name: DelayedPreemption
+    components:
+      - kube-scheduler
   - name: WorkloadBasicPolicyDesiredCount
     components:
       - kube-apiserver

From f246885f610383fe2ff9dceabb76ffbb34037be4 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Maciej=20Skocze=C5=84?= <mskoczen@google.com>
Date: Fri, 9 Jan 2026 13:55:44 +0000
Subject: [PATCH 08/23] Update toc

---
 keps/sig-scheduling/4671-gang-scheduling/README.md | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/keps/sig-scheduling/4671-gang-scheduling/README.md b/keps/sig-scheduling/4671-gang-scheduling/README.md
index 08417d009a61..561b6c0019c6 100644
--- a/keps/sig-scheduling/4671-gang-scheduling/README.md
+++ b/keps/sig-scheduling/4671-gang-scheduling/README.md
@@ -13,11 +13,24 @@
     - [Story 2: Gang-scheduling of a custom workload](#story-2-gang-scheduling-of-a-custom-workload)
   - [Risks and Mitigations](#risks-and-mitigations)
     - [The API needs to be extended in an unpredictable way](#the-api-needs-to-be-extended-in-an-unpredictable-way)
+    - [NominatedNodeName impact on filtering performance](#nominatednodename-impact-on-filtering-performance)
 - [Design Details](#design-details)
   - [Naming](#naming)
   - [Associating Pod into PodGroups](#associating-pod-into-podgroups)
   - [API](#api)
+    - [Basic Policy Extension](#basic-policy-extension)
   - [Scheduler Changes](#scheduler-changes)
+    - [North Star Vision](#north-star-vision)
+    - [GangScheduling Plugin](#gangscheduling-plugin)
+    - [Future plans](#future-plans)
+  - [Scheduler Changes for Beta](#scheduler-changes-for-beta)
+    - [The Workload Scheduling Cycle](#the-workload-scheduling-cycle)
+    - [Queuing and Ordering](#queuing-and-ordering)
+    - [Scheduling Algorithm](#scheduling-algorithm)
+    - [Interaction with Basic Policy](#interaction-with-basic-policy)
+    - [Delayed Preemption](#delayed-preemption)
+    - [Workload-aware Preemption](#workload-aware-preemption)
+    - [Failure Handling](#failure-handling)
   - [Test Plan](#test-plan)
       - [Prerequisite testing updates](#prerequisite-testing-updates)
       - [Unit tests](#unit-tests)

From 524e76fe68e68c4b63a88ecbd497e9e6bc45c678 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Maciej=20Skocze=C5=84?= <mskoczen@google.com>
Date: Fri, 9 Jan 2026 15:01:02 +0000
Subject: [PATCH 09/23] Make delayed preemption description consistent. Extend
 integration tests section

---
 .../4671-gang-scheduling/README.md            | 47 +++++++++++++++----
 1 file changed, 38 insertions(+), 9 deletions(-)

diff --git a/keps/sig-scheduling/4671-gang-scheduling/README.md b/keps/sig-scheduling/4671-gang-scheduling/README.md
index 561b6c0019c6..8d4aa2699066 100644
--- a/keps/sig-scheduling/4671-gang-scheduling/README.md
+++ b/keps/sig-scheduling/4671-gang-scheduling/README.md
@@ -256,6 +256,8 @@ However, this impact is mitigated by several factors:
   the overall window of time where these nominations are active is expected to be short enough
   to prevent severe degradation.
 
+The real impact will be verified hrough scalability tests (scheduler-perf benchmark).
+
 ## Design Details
 
 ### Naming
@@ -787,11 +789,22 @@ The list and configuration of plugins used by this algorithm will be the same as
 4. The scheduler checks if the number of schedulable (including those after delayed preemption)
    Pods meets the `minCount`.
 
-   * If `schedulableCount >= minCount`, the cycle succeeds. If preemptions are needed,
-     all nominated victims are removed as described in [Delayed Preemption](#delayed-preemption).
-     Next, pods are nominated to their chosen nodes, pushed to the active queue,
-     and will soon attempt to be scheduled on their nominated nodes in their own, pod-by-pod cycles.
-     
+   * If `schedulableCount >= minCount`, the cycle succeeds.
+  
+     * If preemptions are needed: The removal of all nominated victims is actuated
+       as described in [Delayed Preemption](#delayed-preemption).
+       The pods are nominated to their chosen nodes but are moved to the unschedulable queue,
+       waiting for victim removal to complete. They can be moved back to the active queue
+       and retried even before the victims are fully removed, but they must pass through
+       the Workload Scheduling Cycle again. Crucially, initiating *new* preemptions
+       will be forbidden during this retry. This ensures that the pod group
+       can be scheduled in a different location if resources become available earlier,
+       but cannot cause additional disruption to do so.
+
+     * If preemptions are not needed: Pods are nominated to their chosen nodes,
+       pushed directly to the active queue, and will soon attempt to be scheduled
+       on their nominated nodes in their own, pod-by-pod cycles.
+
      If a pod selects a different node than its nomination during the individual cycle, the
      gang remains valid as long as `minCount` is satisfied globally (enforced at `WaitOnPermit`).
      The `minCount` check can consider the number of pods that have passed the Workload Scheduling Cycle
@@ -801,8 +814,9 @@ The list and configuration of plugins used by this algorithm will be the same as
      In the pod-by-pod cycle, preemption initiated by the workload pods will be forbidden.
      Allowing it would complicate reasoning about the consistency of the
      Workload Scheduling Cycle and Workload-Aware Preemption. If preemption is necessary
-     (e.g., the nominated node is no longer valid), the gang will time out at `WaitOnPermit`
-     and all necessary preemptions will be simulated again in the next Workload Scheduling Cycle.
+     (e.g., the nominated node is no longer valid), the gang will either time out
+     or be instantly rejected (when the `minCount` cannot be satisfied) at `WaitOnPermit` and all necessary preemptions
+     will be simulated again in the next Workload Scheduling Cycle.
 
    * If `schedulableCount < minCount`, the cycle fails. Preemptions computed but not actuated
      during this cycle are discarded. Pods go through traditional failure handlers
@@ -1030,12 +1044,27 @@ This can be done with:
 - [test name](https://github.com/kubernetes/kubernetes/blob/2334b8469e1983c525c0c6382125710093a25883/test/integration/...): [integration master](https://testgrid.k8s.io/sig-release-master-blocking#integration-master?include-filter-by-regex=MyCoolFeature), [triage search](https://storage.googleapis.com/k8s-triage/index.html?test=MyCoolFeature)
 -->
 
-We will create integration test(s) to ensure basic functionalities of gang-scheduling including:
+Initially, we created integration tests to ensure the basic functionalities of gang scheduling including:
+
 - Pods linked to the non-existing workload are not scheduled
 - Pods get unblocked when workload is created and observed by scheduler
 - Pods are not scheduled if there is no space for the whole gang
+  
+With Workload Scheduling Cycle and Delayed Preemption features, we will significantly expand test coverage to verify:
+
+- Pods referencing a `Workload` (both gang and basic policies) are correctly processed via the Workload Scheduling Cycle.
+- `PodGroup` queuing ensures that all available members are retrieved and processed correctly.
+- Deadlocks and livelocks do not occur when multiple gangs compete for resources or interleave with standard pods.
+- Delayed Preemption works correctly for pod-by-pod (non-workload) scheduling.
+- Delayed Preemption ensures atomicity, i.e., victims are deleted only if the scheduler determines the entire gang can fit,
+  otherwise, the cycle aborts with zero disruption.
+- Failed pod groups are requeued correctly and retry successfully when resources become available.
+
+We will also benchmark the performance impact of these changes to measure:
 
-In Beta, we will add tests to verify that deadlocks are not happening.
+- The scheduling throughput of the workload scheduling, including gang and basic policies and preemptions.
+- The performance impact on standard pod scheduling when there are many nominated pods,
+  for scenarios mentioned in the [NominatedNodeName impact on filtering performance](#nominatednodename-impact-on-filtering-performance).
 
 ##### e2e tests
 

From 3045923f6b7afe80df5426e5441fad7d8794d464 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Maciej=20Skocze=C5=84?= <mskoczen@google.com>
Date: Thu, 15 Jan 2026 12:14:06 +0000
Subject: [PATCH 10/23] Resolve queueing strategy and feasibility plugin. List
 algorithm limitations. Make NNN a hard requirement. Apply comments

---
 .../4671-gang-scheduling/README.md            | 235 +++++++++++-------
 .../4671-gang-scheduling/kep.yaml             |   4 +-
 2 files changed, 143 insertions(+), 96 deletions(-)

diff --git a/keps/sig-scheduling/4671-gang-scheduling/README.md b/keps/sig-scheduling/4671-gang-scheduling/README.md
index 8d4aa2699066..e5da086679aa 100644
--- a/keps/sig-scheduling/4671-gang-scheduling/README.md
+++ b/keps/sig-scheduling/4671-gang-scheduling/README.md
@@ -23,10 +23,11 @@
     - [North Star Vision](#north-star-vision)
     - [GangScheduling Plugin](#gangscheduling-plugin)
     - [Future plans](#future-plans)
-  - [Scheduler Changes for Beta](#scheduler-changes-for-beta)
+  - [Scheduler Changes for v1.36](#scheduler-changes-for-beta)
     - [The Workload Scheduling Cycle](#the-workload-scheduling-cycle)
     - [Queuing and Ordering](#queuing-and-ordering)
     - [Scheduling Algorithm](#scheduling-algorithm)
+    - [Algorithm Limitations](#algorithm-limitations)
     - [Interaction with Basic Policy](#interaction-with-basic-policy)
     - [Delayed Preemption](#delayed-preemption)
     - [Workload-aware Preemption](#workload-aware-preemption)
@@ -256,7 +257,7 @@ However, this impact is mitigated by several factors:
   the overall window of time where these nominations are active is expected to be short enough
   to prevent severe degradation.
 
-The real impact will be verified hrough scalability tests (scheduler-perf benchmark).
+The real impact will be verified through scalability tests (scheduler-perf benchmark).
 
 ## Design Details
 
@@ -599,15 +600,15 @@ this problem with existing mechanisms (e.g. reserving resources via NominatedNod
 However, approval for this KEP is NOT an approval for this vision. We only sketch it to show that
 we see a viable path forward from the proposed design that will not require significant rework.
 
-### Scheduler Changes for Beta
+### Scheduler Changes for v1.36
 
-For the `Alpha` phase, we focused on plumbing the `Workload` API and implementing
+For the `Alpha` phase in v1.35, we focused on plumbing the `Workload` API and implementing
 the `GangScheduling` plugin using simple barriers (`PreEnqueue` and `Permit`).
 While this satisfied the correctness requirement for "all-or-nothing" scheduling,
 it did not address performance or efficiency at scale, scheduling livelocks,
 nor did it solve the problem of partial preemption application.
 
-For `Beta`, we propose introducing a **Workload Scheduling Cycle**.
+For v1.36, we propose introducing a **Workload Scheduling Cycle**.
 This mechanism processes all Pods belonging to a single `PodGroup` in one batch,
 rather than attempting to schedule them individually in isolation using the
 traditional pod-by-pod approach. While introduction of this phase itself won't
@@ -634,14 +635,19 @@ initiates the Workload Scheduling Cycle.
 Since the `PodGroup` instance (defined by the group name and replica key)
 is the effective scheduling unit, the Workload Scheduling Cycle will operate
 at the `PodGroup` instance level, i.e., each instance will be scheduled separately
-in its own cycle. If new Pods belonging to an already scheduled `PodGroup` instance appear,
+in its own cycle.
+
+If new Pods belonging to an already scheduled `PodGroup` instance
+(i.e., one that already passed `WaitOnPemit`) appear,
 they are also processed via the Workload Scheduling Cycle, which takes the previously
-scheduled Pods into consideration.
+scheduled Pods into consideration. This is done for safety reasons to ensure
+the PodGroup-level constraints are still satisfied. However, if the `PodGroup` is being processed,
+these new Pods must wait for the ongoing pod group scheduling to be finished, before being considered.
 
 The cycle proceeds as follows:
 
-1. The scheduler takes either pod group itself or its Pod representative from
-   the scheduling queue. If the pod group is unscheduled (even partially), it temporarily removes
+1. The scheduler takes pod group from the scheduling queue.
+   If the pod group is unscheduled (even partially), it temporarily removes
    all group's pods from the queue for processing. The order of processing
    is determined by the queueing mechanism (see *Queuing and Ordering* below).
    
@@ -658,7 +664,7 @@ The cycle proceeds as follows:
      scheduler's internal cache. Pods are then pushed to the
      active queue (restoring their original timestamps to ensure fairness)
      to pass through the standard scheduling and binding cycle,
-     which will consider the nomination.
+     which will consider and follow the nomination.
    * If `minCount` cannot be met (even after calculating potential
      preemptions), the scheduler considers the `PodGroup` unschedulable. Standard backoff
      logic applies (see *Failure Handling*), and Pods are returned to
@@ -676,68 +682,36 @@ One such formula can be to set it to the lowest priority found within the pod gr
 what will be effectively the weakest link to determine if the whole pod group is schedulable
 and reduce unnecessary preemption attempts.
 
-```md
-<<[UNRESOLVED Queue Implementation Strategy]>>
 To ensure that we process the `PodGroup` instance at an appropriate time and
 don't starve other pods (including gang pods in the pod-by-pod scheduling phase)
 from being scheduled, we need to have a good queueing mechanism for pod groups.
-There are several alternatives:
 
-Alternative 0 (Keep current queueing and ordering):
+We have decided to make the scheduling queue explicitly workload-aware.
+The queue will support queuing `PodGroup` instances alongside individual Pods.
 
-We can minimize changes by retaining the current queueing and ordering logic.
-When a Pod is popped, the scheduler can check if it belongs to a `PodGroup`
-requiring a Workload Scheduling Cycle. As we add scheduling priorities
-for pod groups later, this alternative naturally evolves into Alternative 1.
-* *Pros:* Fits the current architecture. Retains current reasoning about the
-  scheduling queue. Minimizes implementation effort.
-* *Cons:* Might be problematic when some of the pod groups's pods are in the backoffQ
-  or unschedulablePods and need to be retrieved efficiently.
-  Makes it hard to further evolve the Workload Scheduling Cycle.
-  Observability, currently suited for pod-by-pod scheduling, may not
-  accurately reflect the state of the queue (e.g., pending gangs).
-  Likely harder to support future extensions and won't work well
-  if `PodGroup` becomes a separate top-level resource.
-  The pod group will be likely scheduled based on the highest priority member,
-  meaning the latter pod-by-pod cycles might be visibly delayed for lower priority Pods.
+1. When Pods belonging to a `PodGroup` are added to the scheduler and pass the `PreEnqueue`,
+   they are initially stored in a dedicated internal data structure (tentatively named `workloadPods`)
+   rather than the standard active queue.
 
-Alternative 1 (Modify sorting logic):
+2. Once the number of accumulated Pods meets the scheduling requirements (e.g., `minCount`),
+   a `QueuedPodGroupInfo` object (analogous to `QueuedPodInfo`) is created
+   and injected into the main scheduling queue.
 
-Modify the sorting logic within the existing `PriorityQueue` to put all pods
-from a pod group one after another.
-* *Pros:* Fits the current architecture.
-* *Cons:* Might be problematic when some of the pod groups's pods are in the
-  backoffQ or unschedulablePods and need to be retrieved efficiently.
-  Makes it hard to further evolve the Workload Scheduling Cycle.
-  Would need to inject the workload priority into each of the Pods
-  or somehow apply the lowest pod's priority to the rest of the group.
+3. The `scheduleOne` loop will pop the highest-priority item from the queue,
+   which may now be either a single Pod (triggering the standard cycle)
+   or a `PodGroup` (triggering the Workload Scheduling Cycle).
 
-Alternative 2 (Store a PodGroup instance):
+4. During a Workload Scheduling Cycle, all member Pods are retrieved from `workloadPods`.
+   Based on the cycle's outcome:
+   * **Success:** Pods are moved to the standard `activeQ` (with nominations set)
+     to proceed to the pod-by-pod scheduling soon.
+   * **Failure/Preemption:** Pods are returned to `workloadPods` or the unschedulable queue.
+     The `PodGroup` enters a backoff state and is eligible for retry only when
+     a relevant cluster event wakes up at least one of its member pods.
 
-Modify the scheduling queue's data structures to accept `QueuedPodGroupInfo` alongside `QueuedPodInfo`.
-This allows reusing existing queue logic while extending it to `PodGroups`.
-All queued members would be stored in a new dara structure
-and retrieved for the Workload Cycle when the `PodGroup` is popped.
-* *Pros:* Makes it easier to obtain all pods in a group and reduces queue size.
-  Reuses current logic for popping, enforcing backoff, and processing unschedulable entities.
-* *Cons:* Requires adapting the scheduling queue to handle `PodGroups` as
-  queueable entities, which is non-trivial and might clutter the code.
-
-Alternative 3 (Dedicated PodGroup queue):
-
-Introduce a completely separate queue for PodGroups alongside the `activeQ` for Pods.
-The scheduler would pop the item (Pod or PodGroup) with the highest priority/earliest timestamp.
-Pods belonging to an enqueued PodGroup won't be allowed in the `activeQ`.
-* *Pros:* Clean separation of concerns. Can easily use the Workload scheduling priority.
-  Can report dedicated logs and metrics with less confusion to the user.
-* *Cons:* Significant and non-trivial architectural change to the scheduling queue
-  and `scheduleOne` loop.
-
-*Proposed:* Alternative 3 (Dedicated PodGroup queue). While this requires architectural change to the scheduling queue,
-the effort involved in adding pod group queuing will be comparable to modifying the code for the previous alternatives.
-This will also make the foundation for future WAS features and support workload priority by design.
-<<[/UNRESOLVED]>>
-```
+While this represents a significant architectural change to the scheduling
+queue and `scheduleOne` loop, it provides a clean separation of concerns and
+establishes a necessary foundation for future Workload Aware Scheduling features.
 
 #### Scheduling Algorithm
 
@@ -756,7 +730,8 @@ The list and configuration of plugins used by this algorithm will be the same as
 2. These sub-groups are sorted. Initially, we sort by the highest priority
    of the sub-group (assuming homogeneity enforces uniform sub-group priority).
    In the future, sorting may use the size of the sub-group (larger groups first) to
-   tackle the hardest placement problems early.
+   tackle the hardest placement problems early. Crucially, the ordering should be deterministic
+   and saable if the pod group state doesn't change
    *This sorting can be done in the scheduler's cache earlier to optimize performance.*
 
 3. The scheduler iterates through the sorted sub-groups. It finds a feasible node
@@ -805,10 +780,10 @@ The list and configuration of plugins used by this algorithm will be the same as
        pushed directly to the active queue, and will soon attempt to be scheduled
        on their nominated nodes in their own, pod-by-pod cycles.
 
-     If a pod selects a different node than its nomination during the individual cycle, the
-     gang remains valid as long as `minCount` is satisfied globally (enforced at `WaitOnPermit`).
-     The `minCount` check can consider the number of pods that have passed the Workload Scheduling Cycle
-     to ensure that Pods are not waiting unnecessarily when some have been rejected
+     Pod will be restricted to its nominated node during the individual cycle.
+     If the node is unavailable, the pod will remain unschedulable and the `WaitOnPermit` gate will take that
+     into consideration. The `minCount` check can consider the number of pods that have passed
+     the Workload Scheduling Cycle to ensure that Pods are not waiting unnecessarily when some have been rejected
      but other new pods have been added to the cluster.
 
      In the pod-by-pod cycle, preemption initiated by the workload pods will be forbidden.
@@ -823,45 +798,58 @@ The list and configuration of plugins used by this algorithm will be the same as
      and nominations for them are cleared to ensure the other workloads (pod groups)
      can be attempted on that place. See *Failure Handling*.
 
-```md
-<<[UNRESOLVED Enforcing minCount constraint in algorithm]>>
-Gang Scheduling is currently implemented as a plugin, meaning the `minCount` constraint
-is enforced at the plugin level. However, the proposed Workload Scheduling Cycle algorithm
-needs to know if this constraint is met to decide whether to commit the results.
-We have two ways of verifying this:
-
-1. Explicit check in the algorithm: Hardcode the `minCount` check within the framework's logic.
-   This implies that Gang Scheduling becomes a core scheduler framework feature rather than
-   just a specific plugin.
-
-2. New Extension Point: Introduce a new extension point allowing plugins to validate the group's
-   scheduled pods. This would function similarly to a `Permit` check (likely requiring `Reserve` state)
+   Gang Scheduling is currently implemented as a plugin, meaning the `minCount` constraint
+   is enforced at the plugin level. However, the proposed Workload Scheduling Cycle algorithm
+   needs to know if this constraint is met to decide whether to commit the results.
+   To verify this, a new extension point will be introduced, allowing plugins to validate the group's
+   scheduled pods. This will function similarly to a `Permit` check (likely requiring `Reserve` state)
    but without the suspension (`WaitOnPermit`) gate. Crucially, this extension should support two checks:
+
    * Validation: Check whether the currently scheduled pods meet the requirements,
      e.g., if the `minCount` pods from a pod group was successfully scheduled.
+
    * Feasibility: Given the number of pods that have already failed scheduling in this cycle,
      check whether is it still *possible* to meet the constraint. If not, the cycle should abort early
      to save time.
-<<[/UNRESOLVED]>>
-```
   
 While this algorithm might be suboptimal, it is a solid first step for ensuring we have
 a single-cycle workload scheduling phase. As long as PodGroups consist of homogeneous pods,
 opportunistic batching itself will provide significant improvements.
 Future features like Topology Aware Scheduling can further improve other subsets of use cases.
 
-Moreover, this default algorithm relies on specific sorting and may fail to find
+#### Algorithm Limitations
+
+Default algorithm proposed above relies on specific sorting and may fail to find
 a valid placement that could have been discovered by processing the group's pods
 in a different order. While resolving this limitation could be desirable,
 implementing a generalized solver for arbitrary constraints would introduce excessive complexity
 for the default implementation. The current proposal addresses the vast majority of standard use cases
-(homogeneous workloads). Future improvements for this should be delivered via specialized algorithms
-based on specific `PodGroup` constraints, such as Topology Aware Scheduling (TAS).
+(specifically homogeneous workloads). Future improvements for this should be delivered
+via specialized algorithms based on specific pod group constraints,
+such as Topology Aware Scheduling (TAS).
+
+Since the scheduler cannot exhaustively analyze all possible placement permutations,
+we will advise users via documentation regarding which pod group types
+are well-supported and which scenarios are handled on a
+best-effort basis (where a successful placement is not guaranteed, even if
+one theoretically exists).
+
+In particular:
+* For basic **homogeneous** pod groups without inter-pod dependencies, this
+  algorithm is expected to find a placement whenever one exists.
+* For **heterogeneous** pod groups, finding a valid placement is not guaranteed.
+* For pod groups with **inter-pod dependencies** (e.g., affinity/anti-affinity
+  or topology spreading rules), finding a valid placement is not guaranteed.
+
+Moreover, if a pod using these features is rejected by the Workload Scheduling Cycle,
+its rejection message (exposed via Pod status) will explicitly indicate
+that the rejection may be due to the use of features for which finding an existing
+placement cannot be guaranteed, distinguishing it from a generic `Unschedulable` reason.
 
 #### Interaction with Basic Policy
 
 For pod groups using the `Basic` policy, the Workload Scheduling Cycle is
-optional. In the `Beta` timeframe, this cycle will be applied to
+optional. In the v1.36 timeframe, this cycle will be applied to
 `Basic` pod groups to leverage the batching performance benefits, but the
 "all-or-nothing" (`minCount`) checks will be skipped; i.e., we will try to
 schedule as many pods from such PodGroup as possible.
@@ -973,12 +961,14 @@ When the cycle fails, the scheduler rejects the entire group.
 2. Backoff strategy
 
 Backoff mechanism has to be applied for a pod group similarly as we do for individual pods.
-For Beta, we will apply the standard Pod backoff logic to the group.
+Initially, we will apply the standard Pod backoff logic to the group.
 
 At the same time, we should consider increasing the maximum backoff duration for pod groups
+or potentially scaling it based on the number of pods within the group.
 The current default of 10 seconds has proven insufficient in large clusters,
-so this might be the case for workloads. Crucially, because the Workload Scheduling Cycle can take a significant
-amount of time, retrying it too frequently risks starving individual pods.
+so this might be the case for workloads. Crucially, because the Workload Scheduling Cycle
+can be computationally expensive, retrying it too frequently risks starving individual pods.
+Moreover, retries triggered by the Delayed Preemption feature may further strenghten the problem.
 
 3. Retries
 
@@ -991,10 +981,9 @@ While checking a single Pod does not guarantee the *whole* gang can fit,
 calculating gang-level schedulability inside the event handler can be difficult at the moment.
 Therefore, we optimistically retry the Workload Scheduling Cycle if any member's condition improves.
 
-It might be beneficial to retry the pod group without being triggered by any cluster event.
-Ideally, this would involve scrambling the pods and subgroups within the group that have the same priority.
-This could be useful because the pods could be scheduled without any cluster changes
-when considered in a different order.
+It might be beneficial to retry the pod group without being triggered by any cluster event,
+because single Workload Scheduling Cycle cannot determine the placement doesn't really exists,
+especially for heterogeneous workloads or inter-pod dependencies.
 
 
 ### Test Plan
@@ -1457,6 +1446,8 @@ However:
 
 ## Alternatives
 
+### API
+
 The longer version of this design describing the whole thought process of choosing the
 above described approach can be found in the [extended proposal] document.
 
@@ -1516,6 +1507,62 @@ type PodGroup struct {
 }
 ```
 
+### Pod group queueing in scheduler
+
+In selecting the optimal pod group queuing mechanism, we evaluated several alternatives:
+
+Alternative 0 (Keep current queueing and ordering):
+
+We can minimize changes by retaining the current queueing and ordering logic.
+When a Pod is popped, the scheduler can check if it belongs to a `PodGroup`
+requiring a Workload Scheduling Cycle. As we add scheduling priorities
+for pod groups later, this alternative naturally evolves into Alternative 1.
+* *Pros:* Fits the current architecture. Retains current reasoning about the
+  scheduling queue. Minimizes implementation effort.
+* *Cons:* Might be problematic when some of the pod groups's pods are in the backoffQ
+  or unschedulablePods and need to be retrieved efficiently.
+  Makes it hard to further evolve the Workload Scheduling Cycle.
+  Observability, currently suited for pod-by-pod scheduling, may not
+  accurately reflect the state of the queue (e.g., pending gangs).
+  Likely harder to support future extensions and won't work well
+  if `PodGroup` becomes a separate top-level resource.
+  The pod group will be likely scheduled based on the highest priority member,
+  meaning the latter pod-by-pod cycles might be visibly delayed for lower priority Pods.
+
+Alternative 1 (Modify sorting logic):
+
+Modify the sorting logic within the existing `PriorityQueue` to put all pods
+from a pod group one after another.
+* *Pros:* Fits the current architecture.
+* *Cons:* Might be problematic when some of the pod groups's pods are in the
+  backoffQ or unschedulablePods and need to be retrieved efficiently.
+  Makes it hard to further evolve the Workload Scheduling Cycle.
+  Would need to inject the workload priority into each of the Pods
+  or somehow apply the lowest pod's priority to the rest of the group.
+
+Alternative 2 (Store a PodGroup instance):
+
+Modify the scheduling queue's data structures to accept `QueuedPodGroupInfo` alongside `QueuedPodInfo`.
+This allows reusing existing queue logic while extending it to `PodGroups`.
+All queued members would be stored in a new data structure
+and retrieved for the Workload Cycle when the `PodGroup` is popped.
+* *Pros:* Makes it easier to obtain all pods in a group and reduces queue size.
+  Reuses current logic for popping, enforcing backoff, and processing unschedulable entities.
+* *Cons:* Requires adapting the scheduling queue to handle `PodGroups` as
+  queueable entities, which is non-trivial and might clutter the code.
+
+Alternative 3 (Dedicated PodGroup queue):
+
+Introduce a completely separate queue for PodGroups alongside the `activeQ` for Pods.
+The scheduler would pop the item (Pod or PodGroup) with the highest priority/earliest timestamp.
+Pods belonging to an enqueued PodGroup won't be allowed in the `activeQ`.
+* *Pros:* Clean separation of concerns. Can easily use the Workload scheduling priority.
+  Can report dedicated logs and metrics with less confusion to the user.
+* *Cons:* Significant and non-trivial architectural change to the scheduling queue
+  and `scheduleOne` loop.
+
+Ultimately, Alternative 3 (Dedicated PodGroup queue) was chosen as the best long-term solution.
+
 ## Infrastructure Needed (Optional)
 
 <!--
diff --git a/keps/sig-scheduling/4671-gang-scheduling/kep.yaml b/keps/sig-scheduling/4671-gang-scheduling/kep.yaml
index 0659f2a990d3..919ce6230533 100644
--- a/keps/sig-scheduling/4671-gang-scheduling/kep.yaml
+++ b/keps/sig-scheduling/4671-gang-scheduling/kep.yaml
@@ -38,8 +38,8 @@ latest-milestone: "v1.36"
 # The milestone at which this feature was, or is targeted to be, at each stage.
 milestone:
   alpha: "v1.35"
-  beta: "v1.36"
-  stable: "v1.38"
+  beta: "v1.37"
+  stable: "v1.39"
 
 # The following PRR answers are required at alpha release
 # List the feature gate name and the components for which it must be enabled

From 9144b659f8b2fd05336ac4a1e37f806e9bdc7bde Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Maciej=20Skocze=C5=84?= <mskoczen@google.com>
Date: Thu, 15 Jan 2026 12:47:28 +0000
Subject: [PATCH 11/23] Fix lint

---
 keps/sig-scheduling/4671-gang-scheduling/README.md | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/keps/sig-scheduling/4671-gang-scheduling/README.md b/keps/sig-scheduling/4671-gang-scheduling/README.md
index e5da086679aa..955e32c1ed55 100644
--- a/keps/sig-scheduling/4671-gang-scheduling/README.md
+++ b/keps/sig-scheduling/4671-gang-scheduling/README.md
@@ -23,7 +23,7 @@
     - [North Star Vision](#north-star-vision)
     - [GangScheduling Plugin](#gangscheduling-plugin)
     - [Future plans](#future-plans)
-  - [Scheduler Changes for v1.36](#scheduler-changes-for-beta)
+  - [Scheduler Changes for v1.36](#scheduler-changes-for-v136)
     - [The Workload Scheduling Cycle](#the-workload-scheduling-cycle)
     - [Queuing and Ordering](#queuing-and-ordering)
     - [Scheduling Algorithm](#scheduling-algorithm)
@@ -53,6 +53,8 @@
 - [Implementation History](#implementation-history)
 - [Drawbacks](#drawbacks)
 - [Alternatives](#alternatives)
+  - [API](#api-1)
+  - [Pod group queueing in scheduler](#pod-group-queueing-in-scheduler)
 - [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
 <!-- /toc -->
 
@@ -968,7 +970,7 @@ or potentially scaling it based on the number of pods within the group.
 The current default of 10 seconds has proven insufficient in large clusters,
 so this might be the case for workloads. Crucially, because the Workload Scheduling Cycle
 can be computationally expensive, retrying it too frequently risks starving individual pods.
-Moreover, retries triggered by the Delayed Preemption feature may further strenghten the problem.
+Moreover, retries triggered by the Delayed Preemption feature may further strengthen the problem.
 
 3. Retries
 

From 581613b9373b1640daf3e1b6fa88a5ef88c9001f Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Maciej=20Skocze=C5=84?= <mskoczen@google.com>
Date: Fri, 16 Jan 2026 15:27:04 +0000
Subject: [PATCH 12/23] Apply review comments

---
 .../4671-gang-scheduling/README.md             | 18 ++++++++++++------
 .../4671-gang-scheduling/kep.yaml              |  4 ++--
 2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/keps/sig-scheduling/4671-gang-scheduling/README.md b/keps/sig-scheduling/4671-gang-scheduling/README.md
index 955e32c1ed55..78e00f79823d 100644
--- a/keps/sig-scheduling/4671-gang-scheduling/README.md
+++ b/keps/sig-scheduling/4671-gang-scheduling/README.md
@@ -628,11 +628,8 @@ this will address requirement (5).
 
 We introduce a new phase in the main scheduling loop (`scheduleOne`). In the
 end-to-end Pod scheduling flow, it is planned to place this new phase *before*
-the standard pod-by-pod scheduling cycle.
-
-When the scheduler pops a Pod from the active queue, it checks if that Pod
-belongs to an unscheduled `PodGroup`. If so, the scheduler
-initiates the Workload Scheduling Cycle.
+the standard pod-by-pod scheduling cycle. When the loop pops a `PodGroup` from
+the active queue, it initiates the Workload Scheduling Cycle.
 
 Since the `PodGroup` instance (defined by the group name and replica key)
 is the effective scheduling unit, the Workload Scheduling Cycle will operate
@@ -644,7 +641,8 @@ If new Pods belonging to an already scheduled `PodGroup` instance
 they are also processed via the Workload Scheduling Cycle, which takes the previously
 scheduled Pods into consideration. This is done for safety reasons to ensure
 the PodGroup-level constraints are still satisfied. However, if the `PodGroup` is being processed,
-these new Pods must wait for the ongoing pod group scheduling to be finished, before being considered.
+these new Pods must wait for the ongoing pod group scheduling to be finished (pass `WaitOnPermit`),
+before being considered.
 
 The cycle proceeds as follows:
 
@@ -717,6 +715,10 @@ establishes a necessary foundation for future Workload Aware Scheduling features
 
 #### Scheduling Algorithm
 
+*Note: The algorithm described below is a simplified default version based on baseline scheduling logic.
+It is expected to evolve to more effectively handle complex scenarios and specific features
+in future iterations.*
+
 The internal algorithm for placing the group utilizes the optimization defined
 in *Opportunistic Batching* ([KEP-5598](https://kep.k8s.io/5598)) for improved performance.
 The approach described below allows mitigating some restrictions of that feature, e.g.,
@@ -921,6 +923,8 @@ We will address it with what we call *delayed preemption* mechanism as following
    In other words, a different placement can be chosen in a subsequent (workload) scheduling cycles only if
    it doesn't require additional preemptions or the previously chosen placement is no longer
    feasible (e.g. because higher priority pods were scheduled in the meantime).
+   This can be done by ignoring the pods with `deletionTimestamp` set in these preemption attempts
+   (when the previous preemption is ongoing for the preemptor).
 
 The rationale behind the above design is to maintain the current scheduling property where preemption
 doesn't result in a commitment for a particular placement. If a different possible placement appears
@@ -986,6 +990,8 @@ Therefore, we optimistically retry the Workload Scheduling Cycle if any member's
 It might be beneficial to retry the pod group without being triggered by any cluster event,
 because single Workload Scheduling Cycle cannot determine the placement doesn't really exists,
 especially for heterogeneous workloads or inter-pod dependencies.
+To avoid introducing subtle errors in the initial implementation,
+we can start by skipping the Queueing Hints mechanism and relying solely on the backoff time.
 
 
 ### Test Plan
diff --git a/keps/sig-scheduling/4671-gang-scheduling/kep.yaml b/keps/sig-scheduling/4671-gang-scheduling/kep.yaml
index 919ce6230533..0659f2a990d3 100644
--- a/keps/sig-scheduling/4671-gang-scheduling/kep.yaml
+++ b/keps/sig-scheduling/4671-gang-scheduling/kep.yaml
@@ -38,8 +38,8 @@ latest-milestone: "v1.36"
 # The milestone at which this feature was, or is targeted to be, at each stage.
 milestone:
   alpha: "v1.35"
-  beta: "v1.37"
-  stable: "v1.39"
+  beta: "v1.36"
+  stable: "v1.38"
 
 # The following PRR answers are required at alpha release
 # List the feature gate name and the components for which it must be enabled

From e83f6d514b0f660e95cbb72f28eabc94623c0cfa Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Maciej=20Skocze=C5=84?= <mskoczen@google.com>
Date: Mon, 19 Jan 2026 14:12:11 +0000
Subject: [PATCH 13/23] Apply comments

---
 .../4671-gang-scheduling/README.md            | 39 +++++++++++++------
 1 file changed, 28 insertions(+), 11 deletions(-)

diff --git a/keps/sig-scheduling/4671-gang-scheduling/README.md b/keps/sig-scheduling/4671-gang-scheduling/README.md
index 78e00f79823d..7334fe81ff6b 100644
--- a/keps/sig-scheduling/4671-gang-scheduling/README.md
+++ b/keps/sig-scheduling/4671-gang-scheduling/README.md
@@ -642,7 +642,8 @@ they are also processed via the Workload Scheduling Cycle, which takes the previ
 scheduled Pods into consideration. This is done for safety reasons to ensure
 the PodGroup-level constraints are still satisfied. However, if the `PodGroup` is being processed,
 these new Pods must wait for the ongoing pod group scheduling to be finished (pass `WaitOnPermit`),
-before being considered.
+before being considered. This can simplify the preemption, where we can be sure the decision won't be changed,
+while the previous attempt hasn't finished yet.
 
 The cycle proceeds as follows:
 
@@ -689,25 +690,27 @@ from being scheduled, we need to have a good queueing mechanism for pod groups.
 We have decided to make the scheduling queue explicitly workload-aware.
 The queue will support queuing `PodGroup` instances alongside individual Pods.
 
-1. When Pods belonging to a `PodGroup` are added to the scheduler and pass the `PreEnqueue`,
-   they are initially stored in a dedicated internal data structure (tentatively named `workloadPods`)
-   rather than the standard active queue.
+1.  When Pods belonging to a `PodGroup` are added to the scheduler, if a corresponding `QueuedPodGroupInfo`
+    is not yet present in the scheduling queue, it is created and enqueued.
+    This object will have an aggregated `PreEnqueue` check, evaluating conditions for all its members.
+    Crucially, the individual Pods themselves are **not** stored in any standard scheduling queue
+    data structure (active, backoff, or unschedulable) at this stage, but they are effectively managed
+    via the `QueuedPodGroupInfo`.
 
 2. Once the number of accumulated Pods meets the scheduling requirements (e.g., `minCount`),
-   a `QueuedPodGroupInfo` object (analogous to `QueuedPodInfo`) is created
-   and injected into the main scheduling queue.
+   a `QueuedPodGroupInfo` object is moved to the activeQ, following the logic similar to individual pods.
 
 3. The `scheduleOne` loop will pop the highest-priority item from the queue,
    which may now be either a single Pod (triggering the standard cycle)
    or a `PodGroup` (triggering the Workload Scheduling Cycle).
 
-4. During a Workload Scheduling Cycle, all member Pods are retrieved from `workloadPods`.
+4. During a Workload Scheduling Cycle, all member Pods are retrieved from the `QueuedPodGroupInfo`.
    Based on the cycle's outcome:
    * **Success:** Pods are moved to the standard `activeQ` (with nominations set)
      to proceed to the pod-by-pod scheduling soon.
-   * **Failure/Preemption:** Pods are returned to `workloadPods` or the unschedulable queue.
-     The `PodGroup` enters a backoff state and is eligible for retry only when
-     a relevant cluster event wakes up at least one of its member pods.
+   * **Failure/Preemption:** The `QueuedPodGroupInfo` (containing the unschedulable pods) is returned
+     to the `unschedulablePodInfos` structure. The `PodGroup` enters a backoff state and is eligible
+     for retry only when a relevant cluster event wakes up at least one of its member pods.
 
 While this represents a significant architectural change to the scheduling
 queue and `scheduleOne` loop, it provides a clean separation of concerns and
@@ -850,6 +853,16 @@ its rejection message (exposed via Pod status) will explicitly indicate
 that the rejection may be due to the use of features for which finding an existing
 placement cannot be guaranteed, distinguishing it from a generic `Unschedulable` reason.
 
+In addition to the above, for cases involving **intra-group dependencies**
+(e.g., when the schedulability of one pod depends on another group member via inter-pod affinity),
+this algorithm may fail to find a placement regardless of cluster state,
+due to the deterministic processing order.
+
+Users will be advised that such dependencies are discouraged. However, they could mitigate this
+by assigning a lower priority to the dependent pods. Since the algorithm processes higher-priority
+pods first, this ensures that the required pods are scheduled earlier,
+to satisfy the affinity rules of the subsequent dependent pods.
+
 #### Interaction with Basic Policy
 
 For pod groups using the `Basic` policy, the Workload Scheduling Cycle is
@@ -892,6 +905,10 @@ We will address it with what we call *delayed preemption* mechanism as following
    `Preempt`) that will be responsible for actuation. However, for now we don't see evidence for this
    being needed.
 
+   Relying on the actuation logic is optional for plugins. For example,
+   the DynamicResources plugin can still actuate its decision (claim deallocation) in the PostFilter phase.
+   However, any pod-based removals in other plugins should be delegated to the delayed actuation phase.
+
 3. For individual pods (not being part of a workload), we will adjust the scheduling framework
    implementation of `schedulingCycle` to actuate preemptions of returned victims if calling
    `PostFilter` plugins resulted in finding a feasible placement.
@@ -958,7 +975,7 @@ or a timeout occurs), the scheduler must handle the failure efficiently.
 1. Rejection
 
 When the cycle fails, the scheduler rejects the entire group.
-* All Pods in the group are moved back to the scheduling queue.
+* All Pods in the group are moved back to the scheduling queue (stored in the `unschedulablePodGroups` data structure).
   Their status is updated the event with failure reason is sent.
 * Crucially, any `.status.nominatedNodeName` entries set during the failed attempt
   (or from previous cycles) must be cleared. This ensures that the resources

From f68d82b753b59404fb4661daa9fa2a8c3ef87d53 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Maciej=20Skocze=C5=84?= <mskoczen@google.com>
Date: Tue, 20 Jan 2026 08:39:33 +0000
Subject: [PATCH 14/23] Apply comments

---
 keps/sig-scheduling/4671-gang-scheduling/README.md | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/keps/sig-scheduling/4671-gang-scheduling/README.md b/keps/sig-scheduling/4671-gang-scheduling/README.md
index 7334fe81ff6b..1c69bfcc79cc 100644
--- a/keps/sig-scheduling/4671-gang-scheduling/README.md
+++ b/keps/sig-scheduling/4671-gang-scheduling/README.md
@@ -851,7 +851,11 @@ In particular:
 Moreover, if a pod using these features is rejected by the Workload Scheduling Cycle,
 its rejection message (exposed via Pod status) will explicitly indicate
 that the rejection may be due to the use of features for which finding an existing
-placement cannot be guaranteed, distinguishing it from a generic `Unschedulable` reason.
+placement cannot be guaranteed. This will be accompanied by a specific failure
+reason, distinguishing it from a generic `Unschedulable` condition.
+distinguishing it from a generic `Unschedulable` reason. This distinction
+is particularly relevant for Cluster Autoscaler or Karpenter, which can act
+differently based on the new reason.
 
 In addition to the above, for cases involving **intra-group dependencies**
 (e.g., when the schedulability of one pod depends on another group member via inter-pod affinity),

From 97557b5f439ee581d1e420b3f84a1557d8e5a6b6 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Maciej=20Skocze=C5=84?= <mskoczen@google.com>
Date: Tue, 27 Jan 2026 08:16:33 +0000
Subject: [PATCH 15/23] Apply comments

---
 keps/sig-scheduling/4671-gang-scheduling/README.md | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/keps/sig-scheduling/4671-gang-scheduling/README.md b/keps/sig-scheduling/4671-gang-scheduling/README.md
index 1c69bfcc79cc..1a4e4bd79ddd 100644
--- a/keps/sig-scheduling/4671-gang-scheduling/README.md
+++ b/keps/sig-scheduling/4671-gang-scheduling/README.md
@@ -784,8 +784,8 @@ The list and configuration of plugins used by this algorithm will be the same as
        but cannot cause additional disruption to do so.
 
      * If preemptions are not needed: Pods are nominated to their chosen nodes,
-       pushed directly to the active queue, and will soon attempt to be scheduled
-       on their nominated nodes in their own, pod-by-pod cycles.
+       pushed directly to the active queue in the order they were evaluated in the Workload Scheduling Cycle.
+       They will soon attempt to be scheduled on their nominated nodes in their own, pod-by-pod cycles.
 
      Pod will be restricted to its nominated node during the individual cycle.
      If the node is unavailable, the pod will remain unschedulable and the `WaitOnPermit` gate will take that
@@ -806,11 +806,10 @@ The list and configuration of plugins used by this algorithm will be the same as
      can be attempted on that place. See *Failure Handling*.
 
    Gang Scheduling is currently implemented as a plugin, meaning the `minCount` constraint
-   is enforced at the plugin level. However, the proposed Workload Scheduling Cycle algorithm
+   is enforced at the plugin level. The proposed Workload Scheduling Cycle algorithm
    needs to know if this constraint is met to decide whether to commit the results.
-   To verify this, a new extension point will be introduced, allowing plugins to validate the group's
-   scheduled pods. This will function similarly to a `Permit` check (likely requiring `Reserve` state)
-   but without the suspension (`WaitOnPermit`) gate. Crucially, this extension should support two checks:
+   To achieve this, we will reuse the existing `Permit` extension point,
+   but without the suspension phase (`WaitOnPermit`). Crucially, this check has to support two modes:
 
    * Validation: Check whether the currently scheduled pods meet the requirements,
      e.g., if the `minCount` pods from a pod group was successfully scheduled.
@@ -818,7 +817,7 @@ The list and configuration of plugins used by this algorithm will be the same as
    * Feasibility: Given the number of pods that have already failed scheduling in this cycle,
      check whether is it still *possible* to meet the constraint. If not, the cycle should abort early
      to save time.
-  
+
 While this algorithm might be suboptimal, it is a solid first step for ensuring we have
 a single-cycle workload scheduling phase. As long as PodGroups consist of homogeneous pods,
 opportunistic batching itself will provide significant improvements.

From 592b33f1a10e7c838b979ec8f332dfeca0efeeb7 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Maciej=20Skocze=C5=84?= <mskoczen@google.com>
Date: Tue, 27 Jan 2026 09:45:16 +0000
Subject: [PATCH 16/23] Remove Basic policy desiredCount from the KEP scope

---
 .../4671-gang-scheduling/README.md            | 50 ++-----------------
 .../4671-gang-scheduling/kep.yaml             |  4 --
 2 files changed, 3 insertions(+), 51 deletions(-)

diff --git a/keps/sig-scheduling/4671-gang-scheduling/README.md b/keps/sig-scheduling/4671-gang-scheduling/README.md
index 1a4e4bd79ddd..ed9a03b7b8bb 100644
--- a/keps/sig-scheduling/4671-gang-scheduling/README.md
+++ b/keps/sig-scheduling/4671-gang-scheduling/README.md
@@ -500,41 +500,6 @@ not be split into two. A `LeaderWorkerSet` is a good example of it, where a sing
 of a single leader and `N` workers and that forms a scheduling (and runtime unit), but workload as a whole
 may consist of a number of such replicas.
 
-#### Basic Policy Extension
-
-While Gang Scheduling focuses on atomic, all-or-nothing scheduling, there is a significant class
-of workloads that requires best-effort optimization without the strict blocking semantics of a gang.
-
-In the first alpha version of the Workload API, the `Basic` policy was a no-op.
-We propose extending the `Basic` policy to accept a `desiredCount` field.
-This feature will be gated behind a separate
-feature gate (`WorkloadBasicPolicyDesiredCount`) to decouple it from the core Gang Scheduling graduation path.
-
-```go
-// BasicSchedulingPolicy indicates that standard Kubernetes
-// scheduling behavior should be used.
-type BasicSchedulingPolicy struct {
-	// DesiredCount is the expected number of pods that will belong to this
-	// PodGroup. This field is a hint to the scheduler to help it make better
-	// placement decisions for the group as a whole.
-	//
-	// Unlike gang's minCount, this field does not block scheduling. If the number
-	// of available pods is less than desiredCount, the scheduler can still attempt
-	// to schedule the available pods, but will optimistically try to select a
-	// placement that can accommodate the future pods.
-	//
-	// +optional
-	DesiredCount *int32
-}
-```
-
-This field allows users to express their "true" workloads more easily
-and enables the scheduler to optimize the placement of such pod groups by taking the desired state
-into account. Ideally, the scheduler should prefer placements that can accommodate
-the full `desiredCount`, even if not all pods are created yet.
-When `desiredCount` is specified, the scheduler can delay scheduling the first Pod it sees
-for a short amount of time in order to wait for more Pods to be observed.
-
 ### Scheduler Changes
 
 The kube-scheduler will be watching for `Workload` objects (using informers) and will use them to map pods
@@ -796,9 +761,9 @@ The list and configuration of plugins used by this algorithm will be the same as
      In the pod-by-pod cycle, preemption initiated by the workload pods will be forbidden.
      Allowing it would complicate reasoning about the consistency of the
      Workload Scheduling Cycle and Workload-Aware Preemption. If preemption is necessary
-     (e.g., the nominated node is no longer valid), the gang will either time out
-     or be instantly rejected (when the `minCount` cannot be satisfied) at `WaitOnPermit` and all necessary preemptions
-     will be simulated again in the next Workload Scheduling Cycle.
+     (e.g., the nominated node is no longer valid), the gang will either be instantly rejected
+     (when the `minCount` cannot be satisfied) or time out (safety check) at `WaitOnPermit`
+     and all necessary preemptions will be simulated again in the next Workload Scheduling Cycle.
 
    * If `schedulableCount < minCount`, the cycle fails. Preemptions computed but not actuated
      during this cycle are discarded. Pods go through traditional failure handlers
@@ -874,11 +839,6 @@ optional. In the v1.36 timeframe, this cycle will be applied to
 "all-or-nothing" (`minCount`) checks will be skipped; i.e., we will try to
 schedule as many pods from such PodGroup as possible.
 
-If the `Basic` policy has `desiredCount` set, the Workload Scheduling Cycle
-may utilize this value to simulate the full group size during feasibility checks.
-Note that the implementation of this specific logic might follow in a Beta stage
-of this API field.
-
 #### Delayed Preemption
 
 A critical requirement for moving Gang Scheduling to Beta is the integration with *Delayed Preemption*,
@@ -1203,10 +1163,6 @@ This section must be completed when targeting alpha to a release.
   - Feature gate name: DelayedPreemption
   - Components depending on the feature gate:
     - kube-scheduler
-  - Feature gate name: WorkloadBasicPolicyDesiredCount
-  - Components depending on the feature gate:
-    - kube-apiserver
-    - kube-scheduler
 - [ ] Other
   - Describe the mechanism:
   - Will enabling / disabling the feature require downtime of the control
diff --git a/keps/sig-scheduling/4671-gang-scheduling/kep.yaml b/keps/sig-scheduling/4671-gang-scheduling/kep.yaml
index 0659f2a990d3..c945bcd66d79 100644
--- a/keps/sig-scheduling/4671-gang-scheduling/kep.yaml
+++ b/keps/sig-scheduling/4671-gang-scheduling/kep.yaml
@@ -54,10 +54,6 @@ feature-gates:
   - name: DelayedPreemption
     components:
       - kube-scheduler
-  - name: WorkloadBasicPolicyDesiredCount
-    components:
-      - kube-apiserver
-      - kube-scheduler
 disable-supported: true
 
 # The following PRR answers are required at beta release

From 287ec804b53e40c990db503a484ebbba5a1cb62c Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Maciej=20Skocze=C5=84?= <mskoczen@google.com>
Date: Tue, 27 Jan 2026 10:57:12 +0000
Subject: [PATCH 17/23] Apply comments

---
 keps/sig-scheduling/4671-gang-scheduling/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/keps/sig-scheduling/4671-gang-scheduling/README.md b/keps/sig-scheduling/4671-gang-scheduling/README.md
index ed9a03b7b8bb..4ff27cbebe9f 100644
--- a/keps/sig-scheduling/4671-gang-scheduling/README.md
+++ b/keps/sig-scheduling/4671-gang-scheduling/README.md
@@ -585,7 +585,7 @@ to process the entire gang together.
 The single scheduling cycle, together with blocking resources using nomination,
 will address requirement (3).
 
-We will also introduce delayed preemption (described in [KEP-5710](https://kep.k8s.io/5711)).
+We will also introduce [Delayed Preemption](#delayed-preemption).
 Together with the introduction of a dedicated Workload Scheduling Cycle,
 this will address requirement (5).
 

From b6c6d4c47636e0c5b92744bc72ad88bc38411fc5 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Maciej=20Skocze=C5=84?= <mskoczen@google.com>
Date: Tue, 27 Jan 2026 11:33:41 +0000
Subject: [PATCH 18/23] Update toc

---
 keps/sig-scheduling/4671-gang-scheduling/README.md | 1 -
 1 file changed, 1 deletion(-)

diff --git a/keps/sig-scheduling/4671-gang-scheduling/README.md b/keps/sig-scheduling/4671-gang-scheduling/README.md
index 4ff27cbebe9f..7755ab2446fc 100644
--- a/keps/sig-scheduling/4671-gang-scheduling/README.md
+++ b/keps/sig-scheduling/4671-gang-scheduling/README.md
@@ -18,7 +18,6 @@
   - [Naming](#naming)
   - [Associating Pod into PodGroups](#associating-pod-into-podgroups)
   - [API](#api)
-    - [Basic Policy Extension](#basic-policy-extension)
   - [Scheduler Changes](#scheduler-changes)
     - [North Star Vision](#north-star-vision)
     - [GangScheduling Plugin](#gangscheduling-plugin)

From bcc4ade5aecbb54696aaa6a308e1b7c31fddfcf0 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Maciej=20Skocze=C5=84?= <mskoczen@google.com>
Date: Wed, 28 Jan 2026 17:25:17 +0000
Subject: [PATCH 19/23] Apply comments

---
 keps/sig-scheduling/4671-gang-scheduling/README.md | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/keps/sig-scheduling/4671-gang-scheduling/README.md b/keps/sig-scheduling/4671-gang-scheduling/README.md
index 7755ab2446fc..52787522c384 100644
--- a/keps/sig-scheduling/4671-gang-scheduling/README.md
+++ b/keps/sig-scheduling/4671-gang-scheduling/README.md
@@ -237,7 +237,7 @@ usecases. You can read more about it in the [extended proposal] document.
 #### NominatedNodeName impact on filtering performance
 
 Using `.status.nominatedNodeName` as an output of the Workload Scheduling Cycle
-can impact the performance of the standard pod-by-pod scheduling cycle.
+can impact the performance of the standard pod-by-pod scheduling cycle for all other pods.
 Whenever the scheduler filters a node, it must temporarily add nominated pods
 (with equal or higher priority) to the cached NodeInfo. In large clusters,
 the number of such operations multiplied by the scheduling throughput can yield to a visible overhead.
@@ -248,6 +248,8 @@ having to consider such nomination also increases.
 However, this impact is mitigated by several factors:
 * Nominations are temporary. As soon as workload-scheduled pods pass
   their individual scheduling cycle and are assumed, what cleans the in-memory nominations.
+* In case the nominations are no longer feasible,
+  the gang gets rejected as soon as the scheduler determines this.
 * For the workload pods themselves, the performance impact is negligible.
   They will typically only execute filters for the single node they are nominated to,
   rather than evaluating the entire cluster.
@@ -759,9 +761,9 @@ The list and configuration of plugins used by this algorithm will be the same as
 
      In the pod-by-pod cycle, preemption initiated by the workload pods will be forbidden.
      Allowing it would complicate reasoning about the consistency of the
-     Workload Scheduling Cycle and Workload-Aware Preemption. If preemption is necessary
+     Workload Scheduling Cycle and Workload-Aware Preemption. If preemption is necessary,
      (e.g., the nominated node is no longer valid), the gang will either be instantly rejected
-     (when the `minCount` cannot be satisfied) or time out (safety check) at `WaitOnPermit`
+     (when the `minCount` cannot be satisfied) or time out (safety check, in case a bug appears) at `WaitOnPermit`
      and all necessary preemptions will be simulated again in the next Workload Scheduling Cycle.
 
    * If `schedulableCount < minCount`, the cycle fails. Preemptions computed but not actuated
@@ -1031,10 +1033,11 @@ With Workload Scheduling Cycle and Delayed Preemption features, we will signific
 - Pods referencing a `Workload` (both gang and basic policies) are correctly processed via the Workload Scheduling Cycle.
 - `PodGroup` queuing ensures that all available members are retrieved and processed correctly.
 - Deadlocks and livelocks do not occur when multiple gangs compete for resources or interleave with standard pods.
-- Delayed Preemption works correctly for pod-by-pod (non-workload) scheduling.
+- Delayed Preemption feature doesn't break pod-by-pod (non-workload) scheduling.
 - Delayed Preemption ensures atomicity, i.e., victims are deleted only if the scheduler determines the entire gang can fit,
   otherwise, the cycle aborts with zero disruption.
 - Failed pod groups are requeued correctly and retry successfully when resources become available.
+- Gang is rejected if pod-by-pod scheduling cannot follow a nomination. All other nominations should be also cleared.
 
 We will also benchmark the performance impact of these changes to measure:
 

From 69fe3361c49826b231595472e68642b3c78f89db Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Maciej=20Skocze=C5=84?= <mskoczen@google.com>
Date: Mon, 2 Feb 2026 14:30:42 +0000
Subject: [PATCH 20/23] Update the KEP with a decision to skip pod-by-pod
 scheduling phase after workload cycle

---
 .../4671-gang-scheduling/README.md            | 95 ++++++-------------
 1 file changed, 31 insertions(+), 64 deletions(-)

diff --git a/keps/sig-scheduling/4671-gang-scheduling/README.md b/keps/sig-scheduling/4671-gang-scheduling/README.md
index 52787522c384..c98709efde3d 100644
--- a/keps/sig-scheduling/4671-gang-scheduling/README.md
+++ b/keps/sig-scheduling/4671-gang-scheduling/README.md
@@ -234,33 +234,20 @@ We try to mitigate it by an extensive analysis of usecases and already sketching
 how we envision the direction in which the API will need to evolve to support further
 usecases. You can read more about it in the [extended proposal] document.
 
-#### NominatedNodeName impact on filtering performance
-
-Using `.status.nominatedNodeName` as an output of the Workload Scheduling Cycle
-can impact the performance of the standard pod-by-pod scheduling cycle for all other pods.
-Whenever the scheduler filters a node, it must temporarily add nominated pods
-(with equal or higher priority) to the cached NodeInfo. In large clusters,
-the number of such operations multiplied by the scheduling throughput can yield to a visible overhead.
-If the latency between the end of the Workload Scheduling Cycle
-and the actual processing of those pods is high, the number of unrelated pods
-having to consider such nomination also increases.
-
-However, this impact is mitigated by several factors:
-* Nominations are temporary. As soon as workload-scheduled pods pass
-  their individual scheduling cycle and are assumed, what cleans the in-memory nominations.
-* In case the nominations are no longer feasible,
-  the gang gets rejected as soon as the scheduler determines this.
-* For the workload pods themselves, the performance impact is negligible.
-  They will typically only execute filters for the single node they are nominated to,
-  rather than evaluating the entire cluster.
-* These pods are expected to be retried quickly after the Workload Scheduling Cycle because
-  their initial timestamps are preserved. This places them near the head of the active queue,
-  minimizing the duration they remain in the "nominated but not assumed" state.
-* While higher-priority or long-standing (equal priority) pods might interleave and be scheduled before the gang pods,
-  the overall window of time where these nominations are active is expected to be short enough
-  to prevent severe degradation.
-
-The real impact will be verified through scalability tests (scheduler-perf benchmark).
+#### Exacerbating the race window by proceeding directly to binding
+
+Since the entire Workload Scheduling Cycle operates on a single cluster snapshot,
+a long-running cycle means decisions are based on snapshotted state that may become stale.
+This implies that if the cluster state changes in the meantime
+(e.g., a Node suffers a hardware failure or is deleted),
+the binding phase could fail for some pods in the workload, potentially causing the entire gang to fail.
+
+However, assuming all scheduling decisions go through kube-scheduler,
+the primary source of race conditions is external infrastructure events (e.g., Node health changes).
+While this is a valid concern, this race window exists in the standard scheduling cycle as well.
+Although the Workload Scheduling Cycle extends this window,
+the propagation latency of Node status updates or deletions is typically non-trivial,
+meaning the marginal increase in risk is acceptable compared to the benefits of atomic scheduling.
 
 ## Design Details
 
@@ -593,7 +580,7 @@ this will address requirement (5).
 #### The Workload Scheduling Cycle
 
 We introduce a new phase in the main scheduling loop (`scheduleOne`). In the
-end-to-end Pod scheduling flow, it is planned to place this new phase *before*
+end-to-end Pod scheduling flow, it is planned to place this new phase instead of
 the standard pod-by-pod scheduling cycle. When the loop pops a `PodGroup` from
 the active queue, it initiates the Workload Scheduling Cycle.
 
@@ -626,12 +613,11 @@ The cycle proceeds as follows:
 
 4. Outcome:
    * If the group (i.e., at least `minCount` Pods) can be placed,
-     these Pods have the `.status.nominatedNodeName` set.
-     They are then effectively "reserved" on those nodes in the
-     scheduler's internal cache. Pods are then pushed to the
-     active queue (restoring their original timestamps to ensure fairness)
-     to pass through the standard scheduling and binding cycle,
-     which will consider and follow the nomination.
+     these Pods proceed directly to the pod-by-pod binding cycle with their selected nodes.
+     these Pods proceed to the binding bycle with their selected nodes.
+   * In case preemption is required, the PodGroup is moved back to the scheduling queue
+     to wait for the preemption to take effect. This requires a subsequent
+     Workload Scheduling Cycle to verify that the released resources make the placement feasible.
    * If `minCount` cannot be met (even after calculating potential
      preemptions), the scheduler considers the `PodGroup` unschedulable. Standard backoff
      logic applies (see *Failure Handling*), and Pods are returned to
@@ -650,8 +636,8 @@ what will be effectively the weakest link to determine if the whole pod group is
 and reduce unnecessary preemption attempts.
 
 To ensure that we process the `PodGroup` instance at an appropriate time and
-don't starve other pods (including gang pods in the pod-by-pod scheduling phase)
-from being scheduled, we need to have a good queueing mechanism for pod groups.
+don't starve other pods from being scheduled, we need to have a good queueing mechanism
+for pod groups.
 
 We have decided to make the scheduling queue explicitly workload-aware.
 The queue will support queuing `PodGroup` instances alongside individual Pods.
@@ -672,8 +658,7 @@ The queue will support queuing `PodGroup` instances alongside individual Pods.
 
 4. During a Workload Scheduling Cycle, all member Pods are retrieved from the `QueuedPodGroupInfo`.
    Based on the cycle's outcome:
-   * **Success:** Pods are moved to the standard `activeQ` (with nominations set)
-     to proceed to the pod-by-pod scheduling soon.
+   * **Success:** Pods are moved directly to the binding cycle.
    * **Failure/Preemption:** The `QueuedPodGroupInfo` (containing the unschedulable pods) is returned
      to the `unschedulablePodInfos` structure. The `PodGroup` enters a backoff state and is eligible
      for retry only when a relevant cluster event wakes up at least one of its member pods.
@@ -749,22 +734,14 @@ The list and configuration of plugins used by this algorithm will be the same as
        can be scheduled in a different location if resources become available earlier,
        but cannot cause additional disruption to do so.
 
-     * If preemptions are not needed: Pods are nominated to their chosen nodes,
-       pushed directly to the active queue in the order they were evaluated in the Workload Scheduling Cycle.
-       They will soon attempt to be scheduled on their nominated nodes in their own, pod-by-pod cycles.
+     * If preemptions are not needed: Pods proceed directly to their binding cycles
+       using the nodes selected during the Workload Scheduling Cycle.
 
-     Pod will be restricted to its nominated node during the individual cycle.
-     If the node is unavailable, the pod will remain unschedulable and the `WaitOnPermit` gate will take that
-     into consideration. The `minCount` check can consider the number of pods that have passed
-     the Workload Scheduling Cycle to ensure that Pods are not waiting unnecessarily when some have been rejected
-     but other new pods have been added to the cluster.
-
-     In the pod-by-pod cycle, preemption initiated by the workload pods will be forbidden.
-     Allowing it would complicate reasoning about the consistency of the
-     Workload Scheduling Cycle and Workload-Aware Preemption. If preemption is necessary,
-     (e.g., the nominated node is no longer valid), the gang will either be instantly rejected
-     (when the `minCount` cannot be satisfied) or time out (safety check, in case a bug appears) at `WaitOnPermit`
-     and all necessary preemptions will be simulated again in the next Workload Scheduling Cycle.
+     The `WaitOnPermit` gate is retained to ensure that the `minCount` pods are successfully
+     admitted before binding occurs. Additionally, the `minCount` check can consider
+     the number of pods that have passed the Workload Scheduling Cycle to ensure
+     that Pods do not wait unnecessarily if some have been rejected while new pods
+     have been added to the cluster.
 
    * If `schedulableCount < minCount`, the cycle fails. Preemptions computed but not actuated
      during this cycle are discarded. Pods go through traditional failure handlers
@@ -913,13 +890,6 @@ in the meantime (e.g. due to other pods terminating or new nodes appearing), sub
 attempts may pick it up, improving the end-to-end scheduling latency. Returning pods to scheduling
 queue if these need to wait for preemption to become schedulable maintains that property.
 
-We acknowledge the two limitations of the above approach: (a) dependency on the introduction of
-Workload Scheduling Cycle (delayed preemption will not work if workload pods will not be processed
-by Workload Scheduling Cycle) and (b) the fact that the placement computed in
-Workload Scheduling Cycle may be invalidated in pod-by-pod scheduling later.
-However, those features should be used together,
-and the simplicity of the approach and target architecture outweigh these limitations.
-
 #### Workload-aware Preemption
 
 Workload-aware preemption ([KEP-5710](https://kep.k8s.io/5710)) aims to
@@ -1037,13 +1007,10 @@ With Workload Scheduling Cycle and Delayed Preemption features, we will signific
 - Delayed Preemption ensures atomicity, i.e., victims are deleted only if the scheduler determines the entire gang can fit,
   otherwise, the cycle aborts with zero disruption.
 - Failed pod groups are requeued correctly and retry successfully when resources become available.
-- Gang is rejected if pod-by-pod scheduling cannot follow a nomination. All other nominations should be also cleared.
 
 We will also benchmark the performance impact of these changes to measure:
 
-- The scheduling throughput of the workload scheduling, including gang and basic policies and preemptions.
-- The performance impact on standard pod scheduling when there are many nominated pods,
-  for scenarios mentioned in the [NominatedNodeName impact on filtering performance](#nominatednodename-impact-on-filtering-performance).
+- The scheduling throughput of the workload scheduling, including gang and basic policies, and preemptions.
 
 ##### e2e tests
 

From 1ca8c1f0c9bb7a20b7e9ca775b6825793de5ebdc Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Maciej=20Skocze=C5=84?= <mskoczen@google.com>
Date: Tue, 3 Feb 2026 08:01:56 +0000
Subject: [PATCH 21/23] Apply comments

---
 .../4671-gang-scheduling/README.md            | 19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/keps/sig-scheduling/4671-gang-scheduling/README.md b/keps/sig-scheduling/4671-gang-scheduling/README.md
index c98709efde3d..c5e5a252c1a7 100644
--- a/keps/sig-scheduling/4671-gang-scheduling/README.md
+++ b/keps/sig-scheduling/4671-gang-scheduling/README.md
@@ -579,10 +579,11 @@ this will address requirement (5).
 
 #### The Workload Scheduling Cycle
 
-We introduce a new phase in the main scheduling loop (`scheduleOne`). In the
-end-to-end Pod scheduling flow, it is planned to place this new phase instead of
-the standard pod-by-pod scheduling cycle. When the loop pops a `PodGroup` from
-the active queue, it initiates the Workload Scheduling Cycle.
+We introduce a new phase in the main scheduling loop (`scheduleOne`).
+This phase replaces the standard pod-by-pod scheduling cycle for all Pods
+belonging to a `PodGroup`. This means that these individual Pods do not enter
+the standard scheduling queue for independent processing. Instead, when the loop pops a
+`PodGroup` from the active queue, it initiates the Workload Scheduling Cycle.
 
 Since the `PodGroup` instance (defined by the group name and replica key)
 is the effective scheduling unit, the Workload Scheduling Cycle will operate
@@ -601,9 +602,8 @@ while the previous attempt hasn't finished yet.
 The cycle proceeds as follows:
 
 1. The scheduler takes pod group from the scheduling queue.
-   If the pod group is unscheduled (even partially), it temporarily removes
-   all group's pods from the queue for processing. The order of processing
-   is determined by the queueing mechanism (see *Queuing and Ordering* below).
+   The retrieved object contains the list of all pending pods belonging to this group.
+   The order of processing is determined by the queueing mechanism (see *Queuing and Ordering* below).
    
 2. A single cluster state snapshot is taken for the entire group operation
    to ensure consistency during the cycle.
@@ -613,8 +613,7 @@ The cycle proceeds as follows:
 
 4. Outcome:
    * If the group (i.e., at least `minCount` Pods) can be placed,
-     these Pods proceed directly to the pod-by-pod binding cycle with their selected nodes.
-     these Pods proceed to the binding bycle with their selected nodes.
+     these Pods proceed directly to the binding bycle with their selected nodes.
    * In case preemption is required, the PodGroup is moved back to the scheduling queue
      to wait for the preemption to take effect. This requires a subsequent
      Workload Scheduling Cycle to verify that the released resources make the placement feasible.
@@ -646,7 +645,7 @@ The queue will support queuing `PodGroup` instances alongside individual Pods.
     is not yet present in the scheduling queue, it is created and enqueued.
     This object will have an aggregated `PreEnqueue` check, evaluating conditions for all its members.
     Crucially, the individual Pods themselves are **not** stored in any standard scheduling queue
-    data structure (active, backoff, or unschedulable) at this stage, but they are effectively managed
+    data structure (active, backoff, or unschedulable), but they are effectively managed
     via the `QueuedPodGroupInfo`.
 
 2. Once the number of accumulated Pods meets the scheduling requirements (e.g., `minCount`),

From ae9b3a3456d720b122af76b69c42efae00241875 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Maciej=20Skocze=C5=84?= <mskoczen@google.com>
Date: Tue, 3 Feb 2026 08:03:11 +0000
Subject: [PATCH 22/23] Update toc

---
 keps/sig-scheduling/4671-gang-scheduling/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/keps/sig-scheduling/4671-gang-scheduling/README.md b/keps/sig-scheduling/4671-gang-scheduling/README.md
index c5e5a252c1a7..1700a15c3d82 100644
--- a/keps/sig-scheduling/4671-gang-scheduling/README.md
+++ b/keps/sig-scheduling/4671-gang-scheduling/README.md
@@ -13,7 +13,7 @@
     - [Story 2: Gang-scheduling of a custom workload](#story-2-gang-scheduling-of-a-custom-workload)
   - [Risks and Mitigations](#risks-and-mitigations)
     - [The API needs to be extended in an unpredictable way](#the-api-needs-to-be-extended-in-an-unpredictable-way)
-    - [NominatedNodeName impact on filtering performance](#nominatednodename-impact-on-filtering-performance)
+    - [Exacerbating the race window by proceeding directly to binding](#exacerbating-the-race-window-by-proceeding-directly-to-binding)
 - [Design Details](#design-details)
   - [Naming](#naming)
   - [Associating Pod into PodGroups](#associating-pod-into-podgroups)

From 4c8bcd9120ac60b643dc40df754709878e52bbf6 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Maciej=20Skocze=C5=84?= <mskoczen@google.com>
Date: Wed, 4 Feb 2026 15:23:41 +0000
Subject: [PATCH 23/23] Add a paragraph about requirement of consistent
 schedulerName

---
 keps/sig-scheduling/4671-gang-scheduling/README.md | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/keps/sig-scheduling/4671-gang-scheduling/README.md b/keps/sig-scheduling/4671-gang-scheduling/README.md
index 1700a15c3d82..edc77311628a 100644
--- a/keps/sig-scheduling/4671-gang-scheduling/README.md
+++ b/keps/sig-scheduling/4671-gang-scheduling/README.md
@@ -808,6 +808,12 @@ by assigning a lower priority to the dependent pods. Since the algorithm process
 pods first, this ensures that the required pods are scheduled earlier,
 to satisfy the affinity rules of the subsequent dependent pods.
 
+All pods belonging to a single pod group must share the same `.spec.schedulerName`.
+Divergent scheduler names would complicate reasoning about placement decisions
+and make future pod group-based constraints more difficult to manage.
+The scheduler will validate this condition: if a mismatch is detected,
+all pod group's pods will be rejected as unschedulable.
+
 #### Interaction with Basic Policy
 
 For pod groups using the `Basic` policy, the Workload Scheduling Cycle is