diff --git a/keps/prod-readiness/sig-scheduling/5732.yaml b/keps/prod-readiness/sig-scheduling/5732.yaml new file mode 100644 index 000000000000..2afe6ab584e8 --- /dev/null +++ b/keps/prod-readiness/sig-scheduling/5732.yaml @@ -0,0 +1,3 @@ +kep-number: 5732 +alpha: + approver: "@wojtek-t" diff --git a/keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md b/keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md new file mode 100644 index 000000000000..2dee3d6b43ae --- /dev/null +++ b/keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md @@ -0,0 +1,1005 @@ +# KEP-5732: Topology-aware workload scheduling + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories (Optional)](#user-stories-optional) + - [Story 1: AI Training in a Single Rack](#story-1-ai-training-in-a-single-rack) + - [Story 2: Workload using Interconnected DRA Devices](#story-2-workload-using-interconnected-dra-devices) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Workload API Changes](#workload-api-changes) + - [Basic and Gang Policy Extension](#basic-and-gang-policy-extension) + - [Scheduling Framework Extensions](#scheduling-framework-extensions) + - [1. Data Structures](#1-data-structures) + - [2. New Plugin Interfaces](#2-new-plugin-interfaces) + - [Scheduling Algorithm Phases](#scheduling-algorithm-phases) + - [Phase 1: Candidate Placement Generation](#phase-1-candidate-placement-generation) + - [Phase 2: Pod-Level Filtering and Feasibility Check](#phase-2-pod-level-filtering-and-feasibility-check) + - [Phase 3: Placement Scoring and Selection](#phase-3-placement-scoring-and-selection) + - [Scheduler Plugins](#scheduler-plugins) + - [Beta Extensions](#beta-extensions) + - [Potential Future Extensions](#potential-future-extensions) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Alpha](#alpha) + - [Beta](#beta) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) + - [Pod Inter-Affinities](#pod-inter-affinities) + - [Standalone Schedulers (e.g., Volcano)](#standalone-schedulers-eg-volcano) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) + + +## Release Signoff Checklist + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [ ] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) within one minor version of promotion to GA +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + +This KEP describes the architectural design and implementation details for +integrating a Topology-Aware and DRA-Aware workload scheduling algorithm into +the Kubernetes kube-scheduler to address the complex placement requirements of +modern, high-performance distributed applications. + +The proposed topology algorithm leverages the workload-oriented scheduling +lifecycle introduced in KEP-4671, rather than fundamentally altering the scheduling +loop itself. It extends this foundation by enabling the evaluation of scheduling +options within specific "Placements" (subsets of the cluster). These Placements +represent candidate domains (sets of +nodes or DRA resources) where the entire workload is theoretically feasible. The +scheduler then simulates the placement of the full group of pods within these +domains, utilizing existing filtering and scoring logic to ensure high-fidelity +decisions before committing resources. + +This design introduces specific extensions to the Kubernetes Workload API to +support `TopologyConstraints` and `DRAConstraints`, defines new interfaces +within the Scheduling Framework (`PlacementGeneratorPlugin`, `PlacementStatePlugin`, +`PlacementScorerPlugin`), and details the algorithmic flow required to schedule Pod +Groups while maintaining compatibility with the scheduler's existing ecosystem. + +## Motivation + +Distributed workloads, particularly those driving the current AI/ML era, often +require high-bandwidth and low-latency communication between multiple pods to +function efficiently. While the [KEP-4671: Workload API](https://kep.k8s.io/4671) +makes the first step towards managing these applications as cohesive units, it +primarily establishes the API structure. For workloads sensitive to inter-pod +communication, simply grouping pods is insufficient; their physical placement +within the cluster's network topology is a decisive factor in their performance. + +In this KEP, we propose an algorithm for topology-aware and DRA-aware scheduling +that operates directly within the Kubernetes kube-scheduler. The core objective +is to ensure that pods belonging to a Workload are co-located within optimal +topological domains - such as specific racks or blocks - or are bound to shared +Dynamic Resource Allocation (DRA) devices that require cohesive management. +Without this level of precision, workloads may be fragmented across disparate +network domains, drastically degrading performance and wasting the potential of +expensive hardware. + +Given the economics of high-performance accelerators and network infrastructure, +maximizing application performance and resource utilization is a primary goal +for users. Achieving this requires intelligent placement decisions that +understand the physical constraints of the cluster. However, the default +scheduler's current pod-centric logic lacks the native mechanisms to efficiently +resolve these complex group-level constraints during the scheduling cycle. + +Topology-aware scheduling is not a new concept and is currently addressed by +external admission control systems like Kueue or alternative schedulers like +Volcano. However, relying on external admission controllers decouples the +topology decision from the scheduler's core logic, while alternative schedulers +introduce operational complexity. We believe that embedding topology and DRA +awareness deeply into the kube-scheduler is critical enough to warrant +standardization. This integration allows the algorithm to leverage the full +fidelity of the scheduler's existing pod-level filtering and scoring plugins, +ensuring highly accurate feasibility checks and placement outcomes without the +need for external dependencies. + +### Goals + +- To enhance kube-scheduler to be able to perform topology-aware and DRA-aware + scheduling for multi-pod workloads, as defined by the Workload API + ([KEP-4671](https://kep.k8s.io/4671)). +- To optimize the placement of distributed workloads by co-locating pods based + on network topology and DRA resource availability. +- To introduce new extension points and phases within the Kubernetes scheduler + framework to support the concept of "Placements" (candidate sets of nodes + and DRA resources). +- To define the required changes to the Workload API (KEP-4671) to support + Topology scheduling constraints. +- To leverage the scheduler's existing pod-level filtering and scoring logic + within the evaluation of each Placement. +- To provide a flexible framework extensible by plugins for various topology + sources (e.g., node labels) and resource types (e.g., DRA). + +### Non-Goals + +- To define the required changes to the Workload API (KEP-4671) to support + ResourceClaims for DRA-aware workload scheduling. These changes will be + proposed in a separate KEP: + [KEP-5729: DRA: ResourceClaim Support for Workloads](https://github.com/kubernetes/enhancements/pull/5736) +- To replace the functionality of external workload queueing and admission + control systems like Kueue. This proposal focuses on the in-scheduler + placement decision for a single Workload at a time. +- To implement Workload-level queueing, fairness, or resource quotas within + kube-scheduler. +- To handle all aspects of the workload lifecycle management beyond + scheduling. +- To implement Workload-level preemption logic. +- To integrate with cluster autoscaling mechanisms in this phase. +- To support complex multi-PodSet dependency resolution with backtracking or + parallel processing in the initial version. +- To automatically discover network topology; the mechanisms rely on topology + information being present (e.g., via node labels or DRA ResourceSlices). + +## Proposal + +This proposal introduces an API to define constraints on a PodGroup (a +collection of pods within a Workload) requiring it to be scheduled onto a +specific subset of nodes or resources. + +We support two fundamental types of constraints: + +1. **Topology Constraint (Node Label Co-location)**: Ensures all pods in a + PodGroup are placed onto nodes sharing a common topological characteristic + (e.g., same rack), defined by a specific node label. + +2. **DRA Constraint (Shared Dynamic Resource Allocation)**: Ensures all pods in a + PodGroup bind to a single DRA claim fulfilled from a single, shared, + co-located resource (e.g., interconnected network interfaces or + accelerators). + +The scheduler is extended to interpret these new PodGroup level scheduling constraints and similarly to scheduling pods on nodes (available scheduling options), find a "Placement" for this PodGroup among the feasible options (subsets of nodes and DRA resources) that satisfies them. + +### User Stories (Optional) + +#### Story 1: AI Training in a Single Rack + +As a data scientist, I want to run a distributed training job where all pods +need to be located in the same server rack to minimize latency. I define a +`TopologyConstraint` on the Workload's PodGroup specifying the rack topology +label. The scheduler identifies a rack with sufficient capacity and schedules +all pods there at once. + +#### Story 2: Workload using Interconnected DRA Devices + +As a cluster administrator, I want to schedule a workload that requires a set of +specialized accelerators that are physically interconnected. I use a +`DRAConstraint` targeting a specific `ResourceClaimTemplate`. The scheduler +finds a set of DRA resources (ResourceSlice) that are co-located and binds the +workload's pods to them. + +### Notes/Constraints/Caveats (Optional) + +### Risks and Mitigations + +- **Scheduling Latency:** Evaluating multiple placements involves running + filter/score plugins multiple times (multiple attempts to schedule a PodGroup considering all topology options). + + - **Mitigation:** Implement pre-filtering optimizations to reject infeasible + placements early based on aggregate resource availability. + +- **Complexity of Pod Group Scheduling:** Scheduling heterogeneous Pod Groups + can be complex. + + - **Mitigation:** The initial version supports sequential processing of pods + within a PodGroup, avoiding complex backtracking or parallel processing + in the alpha release. + +## Design Details + +### Workload API Changes + +The Workload API (KEP-4671) will be extended to allow specifying group-level +scheduling constraints. An optional `ScheduleConstraints` field is added to the +`PodGroup` spec. + +```go +// PodGroup (definition from KEP-4671, with additions) +type PodGroup struct { + Name *string + + // SchedulingConstraints defines group-level scheduling requirements, + // including topology. + SchedulingConstraints *PodGroupSchedulingConstraints +} + +// PodGroupSchedulingConstraints holds the scheduling constraints for the PodGroup. +type PodGroupSchedulingConstraints struct { + // TopologyConstraints specifies desired topological placements for all pods + // within this PodGroup. + TopologyConstraints []TopologyConstraint +} + +// TopologyConstraint describes a desired topological colocation for all pods in the PodGroup. +type TopologyConstraint struct { + // Level specifies the key of the node label representing the topology domain. + // All pods within the PodGroup must be colocated within the same domain instance. + // Different replicas of the PodGroup can land on different domain instances. + // Examples: "topology.kubernetes.io/rack" + Level string +} +``` + +The Workload API changes for DRA-aware scheduling, including the definition of +DRA constraints, are out of scope for the alpha version of this KEP. These changes +will be defined in a separate KEP: +[KEP-5729: DRA: ResourceClaim Support for Workloads](https://github.com/kubernetes/enhancements/pull/5736). + +Note: For the initial alpha scope, only a single TopologyConstraint will be +supported. + +#### Basic and Gang Policy Extension + +In the first alpha version of the Workload API, the `Basic` policy was a no-op. +We propose extending the `Basic` and `Gang` policies to accept a `desiredCount` +field. This field serves as a scheduler hint to improve placement decisions +without imposing hard scheduling constraints. + +This feature will be gated behind a separate feature gate +(`PodGroupDesiredCount`) to decouple it from the core Gang Scheduling +and Topology Aware Scheduling features. + +**1. Basic Policy Update** + +We introduce `desiredCount` to the `Basic` policy to allow users to signal the +expected group size for optimization purposes. + +```go +// BasicSchedulingPolicy indicates that standard Kubernetes +// scheduling behavior should be used. +type BasicSchedulingPolicy struct { + // DesiredCount is the expected number of pods that will belong to this + // PodGroup. This field is a hint to the scheduler to help it make better + // placement decisions for the group as a whole. + // + // Unlike gang's minCount, this field does not block scheduling. If the number + // of available pods is less than desiredCount, the scheduler can still attempt + // to schedule the available pods, but will optimistically try to select a + // placement that can accommodate the future pods. + // + // +optional + DesiredCount *int32 +} +``` + +**2. Gang Policy Update** + +We similarly extend the `Gang` policy. While `minCount` provides a hard constraint +for admission, `desiredCount` provides a soft target for placement optimization. + +```go +// GangSchedulingPolicy defines the parameters for gang scheduling. +type GangSchedulingPolicy struct { + // MinCount is the minimum number of pods that must be schedulable or scheduled + // at the same time for the scheduler to admit the entire group. + // It must be a positive integer. + // + // +required + MinCount int32 + + // DesiredCount is the expected number of pods that will belong to this + // PodGroup. This field is a hint to the scheduler to help it make better + // placement decisions for the group as a whole. + // + // Unlike gang's minCount, this field does not block scheduling. If the number + // of available pods is less than desiredCount but at least minCount, the scheduler + // can still attempt to schedule the available pods, but will optimistically try + // to select a placement that can accommodate the future pods. + // + // When provided desiredCount must be greater or equal to minCount. + // + // +optional + DesiredCount *int32 +} +``` + +Those fields allow users to express their "true" workloads more easily and enables +the scheduler to optimize the placement of such pod groups by taking the desired state +into account. Ideally, the scheduler should prefer placements that can accommodate +the full `desiredCount`, even if not all pods are created yet. When `desiredCount` +is specified, the scheduler can delay scheduling the first Pod it sees for a short +amount of time in order to wait for more Pods to be observed. + +### Scheduling Framework Extensions + +The scheduler framework requires new plugin interfaces to handle "Placements". A +Placement represents a candidate domain (nodes and resources) for a PodGroup. + +#### 1. Data Structures + +```go +// PodGroupInfo holds information about a specific PodGroup within a Workload, +// including a reference to the Workload, the PodGroup's name, and its replica index. +// This struct is designed to be extensible with more fields in the future. +type PodGroupInfo struct { + // WorkloadRef is a reference to the parent Workload object. + WorkloadRef *workloadv1alpha1.Workload + + // PodGroupName is the name of the PodGroup. + PodGroupName string + + // PodGroupReplicaIndex is the index of the PodGroup replica, as defined in KEP-4671. + // This is relevant for PodGroups that have more than one replica. + PodGroupReplicaIndex int + + // PodSets is a list of PodSet objects within this PodGroup. + PodSets []*PodSetInfo + + // -- Add other fields below for future extensions -- +} + +// PodSetInfo holds information about a specific PodSet within a PodGroup, +// primarily the list of Pods. +// Pods within a PodSet must be homogeneous (using the sementic defined in KEP-5598). +// This struct is designed to be extensible with more fields in the future. +type PodSetInfo struct { + // Pods is a list of Pod objects belonging to this PodSet. + Pods []*corev1.Pod + + // -- Add other fields below for future extensions -- +} + +// Placement represents a candidate domain for scheduling a PodGroup. +// It defines a set of nodes and/or proposed Dynamic Resource Allocation (DRA) +// resource bindings necessary to satisfy the PodGroup's requirements within that domain. +// Placement is valid only in the context of a given PodGroup for a single cycle of +// workload scheduling. +type Placement struct { + // NodeSelector specifies the node constraints for this Placement. + // For Topology this is derived from topology labels (e.g., all nodes with label + // 'topology-rack: rack-1'). + // For DRA, this selector would be constructed based on nodeSelector from + // DRA's AllocationResult from DRAAllocations. + // All pods within the PodGroup, when being evaluated against this Placement, + // are restricted to the nodes matching this NodeSelector. + NodeSelector *corev1.NodeSelector + + // DRAAllocations details the proposed DRA resource assignments for + // the ResourceClaims made by the PodGroup. This field is primarily used + // by DRA-aware plugins. + DRAAllocations []DraClaimAllocation +} + +// DraClaimAllocation maps a specific ResourceClaim name to a set of proposed +// device allocations. These allocations are tentative and used by the scheduler's +// AssumePlacement phase to simulate resource commitment. +type DraClaimAllocation struct { + // ResourceClaimName is the name of the ResourceClaim within the PodGroup's + // context that these allocations are intended to satisfy. + ResourceClaimName string + + // Allocation contains DRA AllocationResult structures, specifying devices + // from ResourceSlices that are proposed to fulfill the ResourceClaim. + // The scheduler will use this information in AssumePlacement to temporarily + // consider these devices as allocated. + Allocation dra.AllocationResult +} +``` + +#### 2. New Plugin Interfaces + +**PlacementGeneratorPlugin:** Generates candidate placements based on constraints. + +```go +// PlacementGeneratorPlugin is an interface for plugins that generate candidate Placements. +// Plugins implemeting PlacementGeneratorPlugin interface should also implement +// EnqueueExtensions interface. +type PlacementGeneratorPlugin interface { + Name() string + + // GeneratePlacements generates a list of potential Placements for the given PodGroup. + // Each Placement represents a candidate set of resources (e.g., nodes matching a selector) + // and potential DRA allocations where the PodGroup might be scheduled. + GeneratePlacements(ctx context.Context, state *framework.CycleState, podGroup *PodGroupInfo, parentPlacements []*Placement) ([]*Placement, *framework.Status) +} +``` + +**PlacementStatePlugin:** Manages state changes (simulating binding) during +feasibility checks. + +```go +// PlacementStatePlugin is an interface for plugins that manage state changes +// when a Placement is being considered. +type PlacementStatePlugin interface { + Name() string + + // AssumePlacement temporarily configures the scheduling context to evaluate the feasibility + // of the given Placement for the PodGroup. + AssumePlacement(ctx context.Context, state *framework.CycleState, podGroup *PodGroupInfo, placement *Placement) *framework.Status + + // RevertPlacement reverts the temporary scheduling context changes made by AssumePlacement. + // This should be called after the evaluation of a Placement is complete to restore + // the scheduler's state and allow other Placements to be considered. + RevertPlacement(ctx context.Context, state *framework.CycleState, podGroup *PodGroupInfo, placement *Placement) *framework.Status +} +``` + +**PlacementScorerPlugin:** Scores feasible placements to select the best one. + +```go +// PodGroupAssignment represents the assignment of pods to nodes within a PodGroup for a specific Placement. +type PodGroupAssignment struct { + // PodToNodeMap maps a Pod name (string) to a Node name (string). + PodToNodeMap map[string]string +} + +// PlacementScorerPlugin is an interface for plugins that score feasible Placements. +type PlacementScorerPlugin interface { + Name() string + + // ScorePlacement calculates a score for a given Placement. This function is called in Phase 3 + // (Placement Scoring and Selection) only for Placements that have been deemed feasible + // for all pods in the PodGroup during Phase 2. The PodGroupAssignment indicates the + // node assigned to each pod within this Placement. The returned score is a float64, + // with higher scores generally indicating more preferable Placements. + // Plugins can implement various scoring strategies, such as bin packing to minimize + // resource fragmentation. + ScorePlacement(ctx context.Context, state *framework.CycleState, podGroup *PodGroupInfo, placement *Placement, podsAssignment *PodGroupAssignment) (float64, *framework.Status) +} +``` + +### Scheduling Algorithm Phases + +The algorithm proceeds in three main phases for a given Workload/PodGroup. + +#### Phase 1: Candidate Placement Generation + +- **Input:** PodGroupInfo. + +- **Action:** Iterate over distinct values of the topology label (TAS) or + available Devices (DRA). + +- **Output:** A list of Placement objects. + +- Placement generation is executed after PreFilter giving PlacementGeneratorPlugins + a chance to get the list of nodes in the cluster. + +- Example: If the label is rack, placements are generated for rack-1, rack-2, + etc. + +#### Phase 2: Pod-Level Filtering and Feasibility Check + +- **Action:** For each generated Placement: + + 1. Call `AssumePlacement` (binds context to the specific node selector/DRA + resources). + + 2. Run default workload scheduling algorithm with the given context. + + 3. If all pods fit, the Placement is marked Feasible. + + 4. Call `RevertPlacement`. + +- **Potential Optimization:** Pre-filtering can check aggregate resources + requested by PodGroup Pods before running the full simulation. + +- **Heterogeneous PodGroup Handling**: Sequential processing will be used + initially. Pods are processed sequentially; if any fail, the placement is + rejected. + +#### Phase 3: Placement Scoring and Selection + +- **Action:** Call `ScorePlacement` for all feasible placements. + +- **Selection:** Select the Placement with the highest score. + +- **Binding:** Proceed to bind pods to the assigned nodes and resources using + pod-by-pod scheduling logic with each pod prebound to the selected node + by setting `nominatedNodeName` value. + +### Scheduler Plugins + +**TopologyPlacementPlugin (New)** Implements `PlacementGeneratorPlugin`. Generates +Placements based on distinct values of the designated node label (TAS). + +**PlacementBinPackingPlugin (New)** Implements `PlacementScorerPlugin`. Scores +Placements to maximize utilization (tightest fit) and minimize fragmentation. + +**PlacementPodCountScorerPlugin (New)** Implements `PlacementScorerPlugin`. Scores +Placements based on the number of pods fiting into each Placement. + +**DRATestPlugin (New)** Implements `PlacementGeneratorPlugin` and `PlacementStatePlugin` +and is used only for testing the algorithm's support for DRA-aware scheduling. + +- **Generator:** Returns Placements derived from available Devices satisfying + claims shared by all Pods within a PodGroup. + +- **State:** Temporarily assigns AllocationResults to Devices during the + Assume phase. + +### Beta Extensions + +The beta version of this KEP will introduce full support for DRA-aware workload +scheduling. This enhancement will enable the scheduler to consider DRA claims +defined by users when making placement decisions, ensuring that workloads are +placed on nodes that can satisfy their resource requirements. This will be +achieved by using the API to be defined in +[KEP-5729: DRA: ResourceClaim Support for Workloads](https://github.com/kubernetes/enhancements/pull/5736). + +The implementation will build upon the extension points introduced in the +alpha version of this feature and the `DRATestPlugin` implementation. +Specifically, the `DRAPlugin` will be enhanced to generate placements based +on the ResourceClaim objects associated with the PodGroup. The plugin will +interact with the DRA framework to ensure that the selected placement can +satisfy the resource requirements of the workload, as expressed in its +ResourceClaim. + +### Potential Future Extensions + +The following features are out of scope for this KEP but are considered for +future separate KEPs improving and extending the proposed functionality: + +1. **Prioritized Placement Scheduling:** Allowing a set of preferred placements + with fallbacks (e.g., prefer Rack, fallback to Block). This would introduce + a Rank field to the Placement struct. + +2. **Optional/Preferred Scheduling Constraints:** Constraints that serve purely + as scoring mechanisms without hard requirements. + +3. **Multi-level Scheduling Constraints:** Handling nested constraints (e.g., + Block -> Rack). This would involve iterative placement generation and a + Parent field in the Placement struct. + +4. **Pod Group Replicas Optimization:** Optimizing scheduling for identical + PodGroups (replicas) by scheduling the maximum feasible number of replicas + within a single placement pass. + +5. **Explicit Topology Definition:** Using a Custom Resource (NodeTopology) to + define and alias topology levels, removing the need for users to know exact + node label keys and opening additional optimization and validation options. + +6. **Feasible Placements Limit:** Adding an option to provide a limit on the + number of feasible Placements which need to be found before moving to + Phase 3: Placement Scoring and Selection. + +### Test Plan + +[ ] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes +necessary to implement this enhancement. + +#### Prerequisite testing updates + +#### Unit tests + +- PlacementGeneratorPlugin: Test generation of placements for various topology + labels and DRA ResourceSlices. + +- PlacementStatePlugin: Verify AssumePlacement and RevertPlacement correctly modify + and restore the CycleState. + +- Algorithm Logic: Test the sequential processing of Placements and the + selection logic based on scores. + +- DRA Integration: specific tests for DRATestPlugin plugin. + +#### Integration tests + +- Topology Awareness: Verify that pods with TopologyConstraint are correctly + co-located on nodes sharing the label. + +- DRA Awareness: Verify that pods with shared ResourceClaims are bound to shared + Devices. + +- Infeasibility: Verify that Workloads remain pending if no Placement + satisfies the constraints. + +#### e2e tests + +- End-to-End Workload Scheduling: Submit a Workload with TopologyConstraint + (e.g., Rack) and verify all pods land on the same rack. + +- DRA Co-location: Submit a Workload requiring shared DRA devices and verify + correct allocation and placement. + +### Graduation Criteria + +#### Alpha + +- Feature implemented behind a feature flag. +- PodGroupSchedulingConstraints API defined. +- Basic topology (Node Label) working. +- Initial unit and integration tests. + +#### Beta + +- Support for "Potential Future Extensions" (Prioritized placement, etc.) + evaluated. +- Scalability tests on large clusters with high placement counts. +- Comprehensive e2e testing. +- Cluster autoscaling compomnents are aware of workload topology constraints. + +### Upgrade / Downgrade Strategy + +This KEP is additive and can safely fallback to the original behavior on +downgrade. + +When a user upgrades the cluster to the version which supports topology-aware +workload scheduling: + +- they can enable scheduling plugins implementing new Scheduling Framework + interfaces in kube-scheduler config +- they can start using the new API to create Workload objects with + `schedulingConstraints` field +- scheduler will use enabled plugins to generate placements for Workload and + check their feasibility + +When user downgrades the cluster to the version that no longer supports +topology-aware workload scheduling: + +- the `schedulingConstraints` field can no longer be set on the Workloads + (the already set fields continue to be set though) +- scheduler will revert to the original behavior of scheduling pods belonging + to a gang, without considering different potential placements. + +### Version Skew Strategy + +The feature is limited to the control plane, so the version skew with nodes +(kubelets) doesn't matter. + +For the API changes, the old version of components (in particular +kube-apiserver) may not handle those. Thus, users should not set those fields +before confirming all control-plane instances were upgraded to the version +supporting those. + +For the topology-aware workload scheduling itself, this is purely kube-scheduler +in-memory feature, so the skew doesn't matter (as there is always only a single +kube-scheduler instance being a leader). + +## Production Readiness Review Questionnaire + +### Feature Enablement and Rollback + +###### How can this feature be enabled / disabled in a live cluster? + +- [X] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: TopologyAwareWorkloadScheduling + - Components depending on the feature gate: + - kube-apiserver + - kube-scheduler + - Feature gate name: PodGroupDesiredCount + - Components depending on the feature gate: + - kube-apiserver + - kube-scheduler +- [ ] Other + - Describe the mechanism: + - Will enabling / disabling the feature require downtime of the control + plane? + - Will enabling / disabling the feature require downtime or reprovisioning + of a node? + +###### Does enabling the feature change any default behavior? + +No - even with a feature enabled scheduler by default will use existing scheduling +algorithm to scheudle worklaods. Only when workload will have an explicit topology +constraint set an alternative algorithm will be used. + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + +Yes, the workload scheduling algorithm changes can be disabled by simply disabling +the feature gate in kube-scheduler. + +The new API changes can also be disabled by disabling the feature gate in kube-apiserver. +However that doesn't result in clearing the new fields for workloads that already have +them set in the storage. + +###### What happens if we reenable the feature if it was previously rolled back? + +The feature starts working again. + +###### Are there any tests for feature enablement/disablement? + +The scheduler algorithm changes are purely in-memory and doesn't require any dedicated +enablement/disablement tests - the logic will be covered by regular feature tests. + +For the newly introduced API fields, dedicated enablement/disablement tests at the +kube-apiserver registry layer will be added in Alpha. + +### Rollout, Upgrade and Rollback Planning + + + +###### How can a rollout or rollback fail? Can it impact already running workloads? + + + +###### What specific metrics should inform a rollback? + + + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + + +### Monitoring Requirements + + + +###### How can an operator determine if the feature is in use by workloads? + + + +###### How can someone using this feature know that it is working for their instance? + + + +- [ ] Events + - Event Reason: +- [ ] API .status + - Condition name: + - Other field: +- [ ] Other (treat as last resort) + - Details: + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + + + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + + +- [ ] Metrics + - Metric name: + - [Optional] Aggregation method: + - Components exposing the metric: +- [ ] Other (treat as last resort) + - Details: + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + + + +### Dependencies + + + +###### Does this feature depend on any specific services running in the cluster? + + + +### Scalability + +###### Will enabling / using this feature result in any new API calls? + +No. + +###### Will enabling / using this feature result in introducing new API types? + +No. + +###### Will enabling / using this feature result in any new calls to the cloud provider? + +No. + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + +Using this feature will require setting topology constraint on Workload object. +The related increase in size of the Workload object should however be negligible. + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + +Although the proposed algorithm was designed with performance in mind, the scheduling +latency / Pod Startup SLO may potentially increase especially for large clusters and +fine grained topology constraints. + +We will measure the exact impact using performance benchmarks and scalability tests and +update the section based on the results. The complexity of scheuduling of a single worklaod +is O(#pods * #nodes), which is comparable to the algorithm not using topology constraints, +so the benchmarks are primarily to validate the potential inefficiencies of the implementation. + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + +For large clusters and fine grained toplogy constraints we may observe some increase in CPU +and RAM usage for kube-scheduler. The exact scale of this increase will be confirmed by +scalability tests. + +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? + +No. + +### Troubleshooting + + + +###### How does this feature react if the API server and/or etcd is unavailable? + +###### What are other known failure modes? + + + +###### What steps should be taken if SLOs are not being met to determine the problem? + +## Implementation History + + + +## Drawbacks + +- **Complexity:** This proposal adds significant logic to the kube-scheduler + framework, specifically the "Placement" abstraction and the simulation loop + (Phase 2). + +- **Performance:** Generating and simulating a large number of Placements + (e.g., every rack in a massive cluster) could be computationally expensive. + + - **Mitigation:** Pre-filtering of Placements will be implemented to discard + clearly infeasible Placements (insufficient total resources) before the + expensive pod-level simulation. + +## Alternatives + +### Pod Inter-Affinities + +Currently, users may attempt to simulate gang scheduling using podAffinity (to +co-locate pods) or podAntiAffinity. + +- **Pros:** Native to Kubernetes, no new CRDs. +- **Cons:** Affinity is evaluated per-Pod at the time of that Pod's + scheduling. It does not look ahead. This means that the scheduler might + place the first Pod on a node that satisfies its immediate affinity needs + but prevents the rest of the group from scheduling (e.g., locking a topology + domain that is too small for the rest of the group). + +### Standalone Schedulers (e.g., Volcano) + +Users can run a secondary scheduler like Volcano or Yunikorn. + +- **Pros:** Feature-rich, mature for batch workloads. +- **Cons:** Operationally complex (two schedulers), race conditions when + sharing cluster resources, and lack of integration with standard Kubernetes + features like common admission controllers or newer features like DRA + (initially). + +## Infrastructure Needed (Optional) + + diff --git a/keps/sig-scheduling/5732-topology-aware-workload-scheduling/kep.yaml b/keps/sig-scheduling/5732-topology-aware-workload-scheduling/kep.yaml new file mode 100644 index 000000000000..99bc28c58277 --- /dev/null +++ b/keps/sig-scheduling/5732-topology-aware-workload-scheduling/kep.yaml @@ -0,0 +1,51 @@ +title: Topology-aware workload scheduling +kep-number: 5732 +authors: + - "@44past4" +owning-sig: sig-scheduling +participating-sigs: +status: provisional +creation-date: 2025-12-10 +reviewers: + - sanposhiho + - dom4ha + - macsko + - wojtek-t +approvers: + - sanposhiho + - dom4ha + +see-also: + - "/keps/sig-scheduling/4671-gang-scheduling" + +# The target maturity stage in the current dev cycle for this KEP. +# If the purpose of this KEP is to deprecate a user-visible feature +# and a Deprecated feature gates are added, they should be deprecated|disabled|removed. +stage: alpha + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v1.36" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v1.36" + beta: "v1.37" + stable: "v1.39" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: TopologyAwareWorkloadScheduling + components: + - kube-apiserver + - kube-scheduler + - name: PodGroupDesiredCount + components: + - kube-apiserver + - kube-scheduler +disable-supported: true + +# The following PRR answers are required at beta release +metrics: