KEP-5732: Topology-aware workload scheduling by 44past4 · Pull Request #5733 · kubernetes/enhancements

44past4 · 2025-12-11T00:16:13Z

One-line PR description: Initial version of Topology-aware workload scheduling KEP
Issue link: KEP-5732: Topology-aware workload scheduling #5732

/sig scheduling

44past4 · 2025-12-11T07:40:30Z

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md

wojtek-t · 2025-12-11T12:38:23Z

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md

+type DRAConstraint struct {
+    // ResourceClaimName specifies the name of a specific ResourceClaim
+    // within the PodGroup's pods that this constraint applies to.
+    ResourceClaimName *string


What does ResourceClaimName mean if a given PodGroup is replicated (there are multiple podgroup instances/replicas)?

This would effectively mean sharing the same RC across multiple instances, which in many cases would be highly misleading.
However, arguably there can be usecases for it too, but then the algorithm effectively should consider all podgroup instances in a single round, but for that we don't know how many groups we even have.
@macsko - FYI (as this is slightly colliding with the kep-4671 update)

So thinking about that more, I'm wondering if we can introduce that without further enhancing the API now (i.e. adding the replicas field to PodGroup).

Another alternative would be to very explicitly split the pod-group-replica constraints from the constraints across all pod-group-replicas and (at least for Alpha) focus only on the former.
So something more like (exact names and structures to be refined):

type PodGroupAllReplicasSchedulingConstraints { ResourceClaimName *string // This one is supported only if Replicas=1 } type PodGroupReplicaSchedulingConstraints { ResourceClaimTemplateName *string // Separate RC is created from this template for every replica. }

In case if the PodGroup is replicated the meaning of ResourceClaimName will depend on whether we will be scheduling those replicas together or not. If they will be scheduled separately then scheduling of the first replica will lock the referenced ResourceClaim and the subsequent replicas will not have any freedom when it comes to its allocation - there will be only one possible placement for them. When scheduling multiple replicas at once we can try to choose a DRA allocation which allows us to schedule the highest number of replicas (assuming that we do not provide all-or-nothing semantics for multiple replicas).

I wasn't asking about the implementation aspect.
I wanted to take a step back and understand what is the actual usecase we're trying to address and figure out if/how we should represent it to make it intuitive to users when they have replicated PodGroup. I feel that the API as currently described can be pretty confusing in this case.

I agree that the proposed API might have been confusing. Because of this as suggested in some other comments I have decided to move the DRA-aware scheduling implementation to the beta of this feature and wait for the KEP-5729: DRA: ResourceClaim Support for Workloads to define required API for the PodGroup level ResourceClaims. I hope that this will make the alpha scope of this feature clearer.

wojtek-t · 2025-12-11T12:38:47Z

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md

+
+    // ResourceClaimTemplateName specifies the name of a ResourceClaimTemplate.
+    // This applies to all ResourceClaim instances generated from this template.
+    ResourceClaimTemplateName *string


Who creates and manages lifecycle of the RC created from that template?

Here we are assuming that the lifecycle of RC is managed outside of kube-scheduler. One option for this is to have it managed by the specific workload controller like for instance LeaderWorkerSet which could create a RC when creating a new replica. This would be very inconvenient so probably we should have a single controller which could do this just by watching Workload objects. We had a discussion with @johnbelamaric about this. Either way this should be outside of the scope of this KEP.

The lifecycle is what I plan to address in #5729.

OK - so this matches my thinking.

But the primary question now is - why do we need it then?
If we have some external entity (whether it's dedicated controller or e.g. LWS controller) that will create RC whenever it is needed (it should create it before we will actually do the scheduling), then what scheduler really needs to be aware and is an input for it is that RC (that it will be finding a best allocation for) not the template itself. It doesn't care about the template.

So I think we're aligned on the intention, but I don't really understand how that will be used.

I have updated the KEP and removed the support for DRA based constraints from the alpha version and for beta I have proposed to wait for the KEP-5729: DRA: ResourceClaim Support for Workloads to define required API and lifecycle for the PodGroup level ResourceClaims.

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md

wojtek-t · 2025-12-11T12:54:45Z

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md

+// PodSetAssignment represents the assignment of pods to nodes within a PodSet for a specific Placement.
+type PodSetAssignment struct {
+    // PodToNodeMap maps a Pod name (string) to a Node name (string).
+    PodToNodeMap map[string]string


Do we need dra assignments too?

This is a good question. We might need them for the PodGroup pods binding phase which comes after the selection of the placement for a PodGroup has been finished. So provided that we can capture those when we are checking the placement feasibility then yes, we should have DRA assignments here as well.

After looking into the current state of the beta scope of KEP-4671 which introduces Workload Scheduling Cycle into kube-scheduler it seems that for actual binding of pods to nodes we will be done in a separate pod by pod scheduling cycle for which we will have nominatedNodeName as the main input. Taking into account this it seems that at least for now we do not need to track pod level DRA assignments as part of PodGroupAssignment.

wojtek-t · 2025-12-11T12:58:05Z

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md

+    // DRA's AllocationResult from DRAAllocations.
+    // All pods within the PodSet, when being evaluated against this Placement,
+    // are restricted to the nodes matching this NodeAffinity.
+    NodeAffinity *corev1.NodeAffinity


Implementation detail - given NodeAffinity, finding the nodes that match it is O(N) operation with N being the set of all nodes in the cluster. We together with NodeAffinity here, we should probably also store the exact list of nodes to avoid recomputing it over and over again.

I agree that caching the actual list of nodes matching the nodeAffinity will be important especially in case of large number of small placements. Setting this list can be done by the TopologyPlacementPlugin and DRAPlugin when generating placements. This being said I would consider this as an optimization so I would leave this to the implementation phase.

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md

nojnhuh · 2025-12-11T20:29:04Z

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md

+    // ResourceClaimName specifies the name of a specific ResourceClaim
+    // within the PodGroup's pods that this constraint applies to.
+    ResourceClaimName *string
+
+    // ResourceClaimTemplateName specifies the name of a ResourceClaimTemplate.
+    // This applies to all ResourceClaim instances generated from this template.
+    ResourceClaimTemplateName *string


How do these fields relate to the ResourceClaim references that Pods already have? What happens if the sets of claims referenced by a Workload and its Pods are different?

+1 to this question, it needs to be answered here

I have updated the KEP and removed the support for DRA based constraints from the alpha version and for beta I have proposed to wait for the KEP-5729: DRA: ResourceClaim Support for Workloads to define required API and lifecycle for the PodGroup level ResourceClaims.

wojtek-t · 2025-12-12T09:16:10Z

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md

+type PodGroupSchedulingConstraints struct {
+    // TopologyConstraints specifies desired topological placements for all pods
+    // within this PodGroup.
+    TopologyConstraints []TopologyConstraint


Does multiple topology constraints actually make sense here? What would be the usecase for it?

There are 2 main use cases for defining multiple topology constraints which I can see right now:

when node label values are not unique among all nodes - for instance racks have indexes which are unique only within a given block - in this case we would like to be able to provide both of those labels as required constraints

when some constraints are optional / best effort - this would require to introduce another field to TopologyConstraint which would allow specifying a given constraint as optional / best-effort.

The later I would actually expect "TopologyPreferences" field (or something like that) so I don't think that convinces me.

But the first usecase is interesting - I would actually mention it in the KEP explicitly.

Maybe we should actually explicitly mention in the API comment that in huge majority of cases we expect exactly 1 item in this list and mention this example as a potential exception.

when node label values are not unique among all nodes - for instance racks have indexes which are unique only within a given block - in this case we would like to be able to provide both of those labels as required constraints

I don't understand why. In that case, they are supposed to just use rack-label-existing-only-in-specific-blocks on level, and then the scheduler should ignore the nodes on different blocks because those blocks don't have this label in the first place, no?

The problematic case if when we have nodes with labels like below:

Node 1: block = 1, rack = 1

Node 2: block = 1, rack = 2

Node 3: block = 2, rack = 1

Node 4: block = 2, rack = 2

In this case just using the rack label as constraint and the base for Placement generation is not sufficient because the rack label alone does not identify the rack and instead of putting Node 1 and Node 3 in one Placement we would like them to go to two separate Placements as those are parts of two different blocks.

Ok, I see what you're saying.

wojtek-t · 2025-12-12T09:18:31Z

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md

+type DRAConstraint struct {
+    // ResourceClaimName specifies the name of a specific ResourceClaim
+    // within the PodGroup's pods that this constraint applies to.
+    ResourceClaimName *string


I wasn't asking about the implementation aspect.
I wanted to take a step back and understand what is the actual usecase we're trying to address and figure out if/how we should represent it to make it intuitive to users when they have replicated PodGroup. I feel that the API as currently described can be pretty confusing in this case.

wojtek-t · 2025-12-12T09:22:12Z

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md

+
+    // ResourceClaimTemplateName specifies the name of a ResourceClaimTemplate.
+    // This applies to all ResourceClaim instances generated from this template.
+    ResourceClaimTemplateName *string


OK - so this matches my thinking.

But the primary question now is - why do we need it then?
If we have some external entity (whether it's dedicated controller or e.g. LWS controller) that will create RC whenever it is needed (it should create it before we will actually do the scheduling), then what scheduler really needs to be aware and is an input for it is that RC (that it will be finding a best allocation for) not the template itself. It doesn't care about the template.

So I think we're aligned on the intention, but I don't really understand how that will be used.

wojtek-t · 2025-12-12T09:23:14Z

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md

+    // ResourceClaimName specifies the name of a specific ResourceClaim
+    // within the PodGroup's pods that this constraint applies to.
+    ResourceClaimName *string
+
+    // ResourceClaimTemplateName specifies the name of a ResourceClaimTemplate.
+    // This applies to all ResourceClaim instances generated from this template.
+    ResourceClaimTemplateName *string


+1 to this question, it needs to be answered here

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md

wojtek-t · 2025-12-12T11:19:57Z

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md

+
+    // DRAConstraints specifies constraints on how Dynamic Resources are allocated
+    // across the PodGroup.
+    DRAConstraints []DRAConstraint


Continuing my thoughts from other comments here.

The primary goal that we wanted to ensure with this KEP are:

building the foundations for TAS and having the first version of the algorithm

proving that the algorithm is compatible with both DRA and topology-based requirements
I think this KEP is achieving it.

However, the more I think about it, the more concerns I have about this kind of API. Up until now I thought that we can actually decouple and postpone the discussion of lifecycle of pod-group-owned (or workload-owned) RCs to later, but some of my comments below already suggest it's not that clear and may influence the API.

So I actually started thinking if (for the sake of faster and incremental progress), we shouldn't slightly revise the scope and goals of this KEP, in particular:

remove the "DRAConstraints" from the scope (and couple it the lifecycle of PodGroup/RC discussion we'll have in DRA: ResourceClaim Support for Workloads #5729 - @nojnhuh )

ensure that the proposal is compatible with DRA-based constraints at lower level;
namely, scheduler should not really manage the lifecycle of RC and those RC should just be an input to scheduler (whether on PodGroup level, Workload-level or some to-be-introduced level).
So what if instead we would prove that it works by simply:

ensuring that some internal interface in scheduler (or maybe a scheduler-framework level one?) can actually accept RCs as an additional constraint to the WorkloadCycle

we add a test at that level, that scheduling works if we pass topology constraints as RCs

That would allow us to decouple the core of the changes in that KEP from all the discussions about how to represent it in the API, how is it coupled with lifecycle etc. And hopefully unblock this KEP much faster and still proving the core of what we need.

@johnbelamaric @erictune @44past4 @dom4ha @sanposhiho @macsko - for your thoughts too

I think that makes sense. Decoupling can help execution. We would treat the lifecycle and allocation of RCs in #5729. Allocation implies the constraint. #5194 should also merge with #5729, I think. It was conceived prior to the existence of the Workload API and I think #5729 encompasses a more holistic set of functionality.

Agree with decoupling.

It is possible to implement #5729 without #5732.
Even if we only implement one of the two for 1.36, we still learn something.

I have updated the KEP and removed the support for DRA based constraints from the alpha version and for beta I have proposed to wait for the KEP-5729: DRA: ResourceClaim Support for Workloads to define required API and lifecycle for the PodGroup level ResourceClaims.

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md

sanposhiho · 2025-12-15T13:41:06Z

/assign

I'm a small bandwidth-ed these days, but will take a look at this one for sure..

erictune

Looks great overall!

erictune · 2025-12-15T21:55:09Z

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md

+- **State:** Temporarily assigns AllocationResults to ResourceClaims during
+  the Assume phase.
+
+**PlacementBinPackingPlugin (New)** Implements `PlacementScorer`. Scores


I think this plugin can prevent the current PodGroup from fragmenting larger Levels, but it cannot prevent the current PodGroup from fragmenting smaller levels. If the current podgroup uses fewer than all the nodes in this Placement, then there could be multiple podsAssignment options, and different options may have different fragmentation effects. Since pod-at-a-time scheduling within the Placement is greedy, we won't consider multiple podsAssignment options.

Its not clear to me that you can influence this enough using the per-pod Score plugins.

This is an important problem but in order to be able to define what does the lower/smaller levels fragmentation even mean we need to have at least two topology levels defined for a PodGroup - one lower/smaller which is a prefered/best-effort placement for a PodGroup and higher/larger which is a required placement for a PodGroup.

This is not in the scope of this KEP.

This being said when we will be adding support for multiple levels to address problem with lower/smaller levels fragmentation we will need to solve two subproblems - problem of scheduling pods within a higher/larger placement and problem of scoring of higher/larger placements.

When it comes to scheduling while generating potential placements we can keep track information about placements and their sub placements. When we go through all lower/smaller placements we can keep track of the number of pods which we were able to schedule in each placement. We can also extend the scoring function so that it would work with partial placements (placements which do not contain all pods within a PodGroup) so for each lower/smaller placements we can also get their scores together with the number of pods which we were able to fit in those. While checking the higher/larger placements instead of simply going pod by pod and checking all nodes within the placement for PodGroups which have only one PodSet we can:

Check if there is any lower/smaller sub placement which can fit remaining pods from the PodGroup.

If there are such lower/smaller sub placements we can select the one which can fit the fewest of pods and has the highest score and we can try scheduling there as many of the remaining pods as possible.

If none of the lower/smaller sub placements can fit all remaining pods we can choose lower/smaller sub placements which can fit the highest number of pods and has the highest score and we can try to schedule there as many of the remaining pods as possible and repeat this process again.

This should lead to creating a pod assignment which uses as few of the sub placements as possible.

When it comes to scoring of those higher/larger placements we can extend the PodGroupAssignment struct to contain the information about the number of sub placements used by a given placement and their scores. This information could be used instead of normal bin packing logic to score such placements.

Apart from the API to define the multi-level placements all proposed interfaces should be able to support this logic but their implementation may need to change. All this should be considered in the future KEP for the multi-lever scheduling support.

It is not necessary to have two levels on one PodGroup for this to be a problem. It is only necessary that unrelated PodGroups can have different levels. It is also sufficient to have a PodGroup asking for 1 level, and some plain pods (e.g. from a Deployment).

We will need some place to score a placement's effect on each Level that it touches.

+1 to Eric
I think two-level scheduling is another problem, but we don't need at all to talk about fragmentation.

The simplest example is the following:

topology:
superblock

block block

node1 node2 node3 node4

workload 1 that requests superblock and has 2 pods

workload 2 that requests block and has 2 pods

If we give workload1 node1 and node3 (which is a valid placement), then we no longer have a full block for workload2.

That being said, I would like to decouple two things:

whether the proposed framework plugins enable addressing the problem

having appropriate plugins with logic that address that

I want this KEP to be focused on the first one and keep the second for a follow-up.

I think that given the ScorePlacement includes PodPlacement it is in the position to assess the fragmentation resulting from that placement. So the missing bit is how to generate best placements and it sounds to me that we can do that in a followup.

When it comes to the provided example kube-scheduler needs to know from somewhere that when scheduling workload 1 it needs to try to minimize the number of blocks used. This information needs to come from somewhere. One option is to have this defined on the workload 1 directly by for instance defining the block level constraint as best-effort/optional scheduling constraint. Other option is to have an explicit Topology definition referenced from workload 1 scheduling constraints. So as long as I agree that the problem of fragmentation is important this can/should be solved while working on one of the proposed extensions to this KEP (Prioritized Placement Scheduling, Optional/Preferred Scheduling Constraints or Explicit Topology Definition) and not in this KEP directly because right now we do not have enough information what should be optimized.

erictune · 2025-12-15T22:01:47Z

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md

+    Name() string
+
+    // GeneratePlacements generates a list of potential Placements for the given PodGroup and PodSet.
+    // Each Placement represents a candidate set of resources (e.g., nodes matching a selector)


Consider saying that the GeneratePlacements interface does not have any compatibility guarantees across versions. If/when we later add Prioritized Placement Scheduling, or Multi-level Scheduling Constraints, we will want to change GeneratePlacements.

With addition of parentPlacements argument to the GeneratePlacements and potential to extent Placement struct we should be able to support features like Prioritized Placement Scheduling or Multi-level Scheduling Constraints without the need to change GeneratePlacements signature.

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md

erictune · 2025-12-15T22:24:21Z

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md

+
+5. **Explicit Topology Definition:** Using a Custom Resource (NodeTopology) to
+   define and alias topology levels, removing the need for users to know exact
+   node label keys.


Explicit Topology Information also provided these things:

An explicit total order on levels within one Topology object (needed for Multi-level and Prioritized Placement Scheduling)

An implicit label hierarchy requirement

A level n label's nodes must be a subset of only one level n+1 label.

Useful for Multi-level placement, and for the hierarchical aggregated capacity optimization.

A way to limit the number of levels

Limit by validating the list length in a Topology object.

Limiting levels limits one term of algorithm complexity.

A way to discourage creation of too many Topology objects

Only admins or cloud providers should create these usually.

Taken together, these properties make it easier to avoid the case where there are many more TAS-relevant labels (key/value pairs) than there are nodes.

Also, while the initial algorithm is going to be greedy, in the sense that it examines one workload at a time, future algorithms may want to examine multiple workloads at once to find jointly optimal placements. By allowing excess complexity in the structure of topology labels at the outset, we will limit our ability to do future global optimizations.

I think it is fine to leave Explicit Topology Definition out of Alpha. However, before GA, we should either have beta Explicit Topology Definition, or have documented requirement for (1) the maximum number of label keys used for TAS, (2) partial order requirement over all TAS keys, and (3) nesting requirement for TAS labels.

Otherwise, it will be hard to enforce those later.

One more thing about explicit topology levels:

By defining levels, it is implied that we may wish to start workload of a size which uses all nodes of a given level member (nodes with label level:value). I would say it is a statement that PodGroup with sizes equal to the size of a level-member are going to be statistically more likely than other sizes. And it is an implicit request to therefore avoid fragmenting (partially allocating) all level-members of any level

Playing a bit devil's advocate - I'm not sure that all these arguments are convincing enough to me. In particular:

we want to support DRA-based constraints eventually too, and these implicitly also imply certain topologies. They will not be defined by Topology definition anyway, so it will by definition cover only a subset of potential constraints. @johnbelamaric - for your thoughts too

Nodes are not objects that arbitrary users can access (and thus add arbitrary labels to them). So we're effectively limited to labels that only cluster administrators can set anyway.

So despite the fact that I see potential benefits from having explicit Topology definition, especially the point (1) above makes me suspicious that we will be able to fully utilize its consequences.

But assuming for now we will create the Topology object, doesn't that mean that the API for TopologyConstraint should actually be different and explicitly reference the Topology object?

@erictune ^^

we want to support DRA-based constraints eventually too, and these implicitly also imply certain topologies. They will not be defined by Topology definition anyway, so it will by definition cover only a subset of potential constraints. @johnbelamaric - for your thoughts too

If I understand what you mean, I would argue that DRA devices can actually serve as the way to define these explicit topologies.

For me having a explicit topology definition is important problem. This being said I believe that it is closely related to Prioritized Placement Scheduling, Optional/Preferred Scheduling Constraints and Multi-level Scheduling Constraints extensions listed above and to support to DRA constraints which have been moved to beta. Because of this we may need to have a common API proposal for all of those which probably should also include explicit Topology CRD.

So I agree that on the importance but I am not sure if this should be in scope of this KEP.

When it comes to the required changes to the proposed API I believe that we may need to add a Topology name to the PodGroupSchedulingConstraints to distinguish between different topologies but we should be able to do this as an incremental change.

I have updated the description of this potential extension to include new optimization and validation options related to the introduction of Explicit Topology Definition.

If I understand what you mean, I would argue that DRA devices can actually serve as the way to define these explicit topologies.

That's what I meant here.
The reason why I wrote "implicitly" is that even though it's explicit definition of a topology, it is hidden in DRA objects which is not where majority of users would expect the topology to defined.

And this is what I had on my mind writing the above comment - we already have a way to define the topology in some way via DRA - so I'm not 100% convinced yet that we need a separate dedicated topology definition.

So I agree that on the importance but I am not sure if this should be in scope of this KEP.

+1

When it comes to the required changes to the proposed API I believe that we may need to add a Topology name to the PodGroupSchedulingConstraints to distinguish between different topologies but we should be able to do this as an incremental change.

+1

erictune · 2025-12-15T23:10:30Z

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md

+
+    // DRAConstraints specifies constraints on how Dynamic Resources are allocated
+    // across the PodGroup.
+    DRAConstraints []DRAConstraint


Agree with decoupling.

It is possible to implement #5729 without #5732.
Even if we only implement one of the two for 1.36, we still learn something.

wojtek-t

This is great now - I'm pretty aligned with the proposal now.

wojtek-t · 2025-12-18T13:34:19Z

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md

+type PodGroupSchedulingConstraints struct {
+    // TopologyConstraints specifies desired topological placements for all pods
+    // within this PodGroup.
+    TopologyConstraints []TopologyConstraint


The later I would actually expect "TopologyPreferences" field (or something like that) so I don't think that convinces me.

But the first usecase is interesting - I would actually mention it in the KEP explicitly.

Maybe we should actually explicitly mention in the API comment that in huge majority of cases we expect exactly 1 item in this list and mention this example as a potential exception.

dom4ha

LGTM

Added minor wording suggestions, but we need decision wrt we want to add DesiredCount to Gang policy as well.

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md

dom4ha · 2026-02-05T03:19:26Z

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md

+Note: For the initial alpha scope, only a single TopologyConstraint will be
+supported.
+
+#### Basic Policy Extension


It was brought up in other comments in the original KEP #5730 (comment), it makes sense to add DesiredCount for Gang policy as well for exactly the same reasons.

@wojtek-t @sanposhiho @helayoty @andreyvelich

Yes, I agree. And, that makes another question - should we add it to somewhere else where we can prevent two duplicated fields? e.g., in PodGroup.

I don't want to repeat the same discussion again, but the decision here will be related to whether we really need Basic policy or not. like if Gang likely needs the fields that Basic will get, what's the point to have Basic policy just to have duplicated fields of some Gang fields, can we simply have those fields in PodGroup or whereever we can prevent such duplications

I have updated the KEP to include DesiredCount for Gang policy as well.

dom4ha · 2026-02-05T03:30:38Z

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md

+	// PodGroup. This field is a hint to the scheduler to help it make better
+	// placement decisions for the group as a whole.


Suggested change

// PodGroup. This field is a hint to the scheduler to help it make better

// placement decisions for the group as a whole.

// PodGroup. This field is a hint to the scheduler specifying how many Pods will be created to help it make better

// placement decisions for the group as a whole. Scheduler won't attempt scheduling this PodGroup until the desired number of pods are created.

The semantic of DesiredCount is a bit vague now (it's a hint for scheduler), but considering it's alpha, we can even decide about updating KEP once we have implementation in place. The exact behavior may even depend on type of the PodGroup:

homogenous - we can pick a topology option that can fit the future desired count

heterogenous pods - we need to wait for all pods as we'd not know their shape

As there are some questions about the semantic of desiredCount I would like to avoid making promises right now from which we will need to withdraw later. Because of this I will leave current a bit vague description of desiredCount so that it can be clarified while implementation is done or while we will be moving to beta.

sanposhiho · 2026-02-05T05:37:28Z

Sorry for not coming back here for a while. It looks great now except DesiredCount thingy that @dom4ha also pointed out-
#5733 (comment)

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md

Co-authored-by: Dominik Marciński <gmidon@gmail.com>

dom4ha · 2026-02-05T14:38:20Z

/lgtm
/approve
/hold for Kensei

We still wait for decision regarding workload ref, but that's just a nit change to the KEP to keep it aligned.

sanposhiho · 2026-02-05T18:25:10Z

/lgtm
/approve

still not sure what's the best APis around Basic (desiredCount), but LGTM anyways for now for alpha.

wojtek-t · 2026-02-06T08:15:11Z

/lgtm
/approve PRR

k8s-ci-robot · 2026-02-06T08:15:25Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: 44past4, dom4ha, sanposhiho, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [wojtek-t]
~~keps/sig-scheduling/OWNERS~~ [dom4ha,sanposhiho]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

44past4 · 2026-02-06T08:31:53Z

/label tide/merge-method-squash

wojtek-t · 2026-02-06T08:42:35Z

/hold cancel

* Topology-aware workload scheduling KEP * Fixed Toc * Added KEP reviewers and approvers * Initial batch of fixes after reviews * Move DRA constraints support to beta * Fix TOC * Smaller fixed based on review feedback. * Update Explicit Topology Definition description. * Added Plugin suffix to PlacementGenerator, PlacementState and PlacementScorer * Updating README.md based on the comments - Added requirement to PlacementGeneratorPlugin to implement EnqueueExtensions - Added information about PlacementGeneratorPlugins to be called after PreFilter scheduling phase. - Changed NodeAffinity to NodeSelector in Placement struc * Add prod readiness file * Production Readiness Review Questionnaire * Fixed spelling errors * Update kep.yaml * Extend KEP with desiredCount. * Address comments from dom4ha * Update README.md * Update README.md * Fixed Toc * Add desiredCount to Gang policy * Added cluster autoscaling support as requirement for beta * Fix phrasing Co-authored-by: Dominik Marciński <gmidon@gmail.com> * Fix phrasing Co-authored-by: Dominik Marciński <gmidon@gmail.com> * Updates from review. * Updates from review. --------- Co-authored-by: Dominik Marciński <gmidon@gmail.com>

Topology-aware workload scheduling KEP

d458e3f

k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Dec 11, 2025

github-project-automation bot added this to SIG Scheduling Dec 11, 2025

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 11, 2025

k8s-ci-robot requested review from dom4ha and macsko December 11, 2025 00:16

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Dec 11, 2025

44past4 mentioned this pull request Dec 11, 2025

KEP-5732: Topology-aware workload scheduling #5732

Open

8 tasks

Fixed Toc

493cc9a

k8s-ci-robot requested review from erictune, johnbelamaric and wojtek-t December 11, 2025 07:40

Added KEP reviewers and approvers

52fa7c9

wojtek-t reviewed Dec 11, 2025

View reviewed changes

macsko reviewed Dec 11, 2025

View reviewed changes

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md Outdated Show resolved Hide resolved

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md Show resolved Hide resolved

nojnhuh reviewed Dec 11, 2025

View reviewed changes

wojtek-t reviewed Dec 12, 2025

View reviewed changes

macsko reviewed Dec 15, 2025

View reviewed changes

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md Show resolved Hide resolved

k8s-ci-robot assigned sanposhiho Dec 15, 2025

erictune reviewed Dec 15, 2025

View reviewed changes

andreyvelich mentioned this pull request Dec 16, 2025

feat: KEP 2841 Flux Policy to support Flux Framework kubeflow/trainer#2909

Merged

1 task

44past4 added 3 commits December 16, 2025 14:25

Initial batch of fixes after reviews

3dbf9b4

Move DRA constraints support to beta

03b0c32

Fix TOC

2e35c86

wojtek-t reviewed Dec 18, 2025

View reviewed changes

wojtek-t self-assigned this Dec 18, 2025

44past4 added 4 commits February 3, 2026 20:59

Address comments from dom4ha

02f70fa

Update README.md

85fdede

Update README.md

1bd676d

Fixed Toc

e3c67b1

dom4ha reviewed Feb 5, 2026

View reviewed changes

Add desiredCount to Gang policy

178e039

dom4ha reviewed Feb 5, 2026

View reviewed changes

keps/sig-scheduling/5732-topology-aware-workload-scheduling/README.md Outdated Show resolved Hide resolved

44past4 and others added 5 commits February 5, 2026 12:34

Added cluster autoscaling support as requirement for beta

f133ab4

Fix phrasing

d157cea

Co-authored-by: Dominik Marciński <gmidon@gmail.com>

Fix phrasing

4bf21d2

Co-authored-by: Dominik Marciński <gmidon@gmail.com>

Updates from review.

3f6fffb

Updates from review.

7113f4b

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 5, 2026

k8s-ci-robot assigned dom4ha Feb 5, 2026

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 5, 2026

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 6, 2026

k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Feb 6, 2026

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 6, 2026

k8s-ci-robot merged commit bfb40c4 into kubernetes:master Feb 6, 2026
4 checks passed

k8s-ci-robot added this to the v1.36 milestone Feb 6, 2026

github-project-automation bot moved this from In Progress to Done in SIG Scheduling Feb 6, 2026

dom4ha mentioned this pull request Feb 10, 2026

KEP-5729: DRA: ResourceClaim Support for Workloads #5736

Merged

		// PodGroup. This field is a hint to the scheduler to help it make better
		// placement decisions for the group as a whole.

Conversation

44past4 commented Dec 11, 2025

Uh oh!

44past4 commented Dec 11, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanposhiho Jan 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

johnbelamaric Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sanposhiho Jan 4, 2026 •

edited

Loading

johnbelamaric Dec 15, 2025 •

edited

Loading

erictune Dec 15, 2025 •

edited

Loading