Skip to content

KEP-5710: Add initial KEP docs for workload-aware preemption#5711

Merged
k8s-ci-robot merged 9 commits intokubernetes:masterfrom
wojtek-t:workload_aware_preemption
Feb 5, 2026
Merged

KEP-5710: Add initial KEP docs for workload-aware preemption#5711
k8s-ci-robot merged 9 commits intokubernetes:masterfrom
wojtek-t:workload_aware_preemption

Conversation

@wojtek-t
Copy link
Member

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 28, 2025
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Nov 28, 2025
@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Nov 28, 2025
@wojtek-t
Copy link
Member Author

@wojtek-t wojtek-t force-pushed the workload_aware_preemption branch 2 times, most recently from ce04eca to 0ff3958 Compare December 1, 2025 08:52
Comment on lines 385 to 387
1. Identify the list of potential victims:
- all running workloads with (preemption) priority lower than the new workload W
- all individual pods (not being part of workloads) with priority lower than the new workload W
Copy link
Contributor

@44past4 44past4 Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having two independent priorities for a workload - one for scheduling and one for the preemption or the single preemption priority which can be dynamically updated can potentially lead to a cycle in the preemption.

Let's assume that we have an existing workload A with high scheduling priority and low preemption priority running in a cluster.

Now let's assume that we want to schedule a workload B which has medium scheduling priority and medium preemption priority.

Workload B will preempt workload A and will start to run because its scheduling priority > preemption priority of the workload A.

However when workload A will restart and it will be rescheduled it will preempt workload B and will start to run because its scheduling priority > preemption priority of workload B.

The same issue can happen if we will have only one priority but this priority will be reduced while the workload is running. After preemption when the workload will reappear with the original higher priority it can preempt the workload which has preempted it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One potential solution / mitigation to the described problem could be stating that preemption priority >= scheduling priority. This way after restarting the preempted workload will not be able to preempt the preemptor workload.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for point that out!

Yeah - "preemption priority >= scheduling priority" is definitely desired. I don't think we have any usecases that would benefit from the reversed.

That said, I need to think a bit more if that is enough. I think it prevents the cycles if we assume static priorities, but it can still potentially trigger cycles if the priorities will be changing. OTOH, if the priorities are changing this is probably desired.

Let me think about it a bit more and I will update the KEP to reflect the thoughts later this week.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - I have added an unresolved section about that to the Workload priorities section above describing the problem, potential solution and alternatives. Let's continue the discussion there.

@sanposhiho
Copy link
Member

/assign

Copy link
Contributor

@erictune erictune left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to see this, and I like how it is decoupled from the other work planned for 1.36.

can't reprieve any of those, learning about that would require O(N) full workload schedulings
with N being number of workload/pods violating PDB.
<<[/UNRESOLVED]>>
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's assume that nodes are either high-pod-per node count, or low pod-per-node count. Its a bimodal distribution.

Let's further assume that if Gang scheduling is used, then the node is going to usually be low pod-per-node count.

So, then we can do the following:

  1. Individual Pod as preemptor - assume high pod-per-node, use current algorithm, which is optimized for many pods per node, consider all victims.
  2. Gang as preemptor - assume low pod-per-node in all cases, consider a maximum of e.g. 4 reprieves per node, to keep compute time down, and just stop reprieving in the case where there are more things on the node.,

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every split in the algorith/code path makes it harder to reason about. This is why I'm trying to avoid that whenever possible.

Additionally, while I agree with you that in majority of cases it will be true, there are definitely usecases where people run gang workloads with many pods per node. So in my opinion the split as proposed could potentially result in decisions that would be really far from the optimal ones.

In the spirit of trying to simplify and unify stuff as much as possible I actually adjusted the algorithm so that we can have a single scheme that addresses all four usecases that we have. I think this is much better option.

PTAL

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-project-automation github-project-automation bot moved this to Needs Review in SIG Scheduling Dec 2, 2025
@xigang
Copy link
Member

xigang commented Dec 3, 2025

/cc

@k8s-ci-robot k8s-ci-robot requested a review from xigang December 3, 2025 00:55
1. From remaining potential victims, we start to reprieve pods starting from the highest priority
and working down until the set of remaining victims still keeps the node feasible.

Once we compute the feasibility and list of victims for all nodes, we score that and choose the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: it's possible that we will not do that for all nodes in the cluster. We find feasible nodes until we have max(numNodes * 0.1, 100) nodes from which we can choose from: https://github.com/kubernetes/kubernetes/blob/ec1bf8a4f3a5f054065225dc8275c66b93310d17/pkg/scheduler/framework/preemption/preemption.go#L363-L364

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch - updated (although I don't think it changes anything for this particular proposal).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not for the initial implementation but it's worth to keep it in mind once we look into the scalability of workload preemption

- all running workloads with (preemption) priority lower than the new workload W
- all individual pods (not being part of workloads) with priority lower than the new workload W

1. If removing all the potential victims would not make the new workload W schedulable,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should point out that this depends on workload aware scheduling which is not yet implemented and is planned for 1.36.

1. If removing all the potential victims would not make the new workload W schedulable,
the workload is unschedulable even with preemption.

```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: you need to indent this "code block" to keep the numbering continuous.


1. Identify the list of potential victims:
- all running workloads with (preemption) priority lower than the new workload W
- all individual pods (not being part of workloads) with priority lower than the new workload W
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if there is a workload and an individual pod, where only one is needed to make the new workload schedulable. Which one will be chosen?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should choose pod, but I don't have super strong preference. I added a point about sorting to reflect that but I'm happy to take any suggestions there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess if they have the same priority then: single pod > pod from workload with gang preemtable = false > workload with gang preemtable = true?

Comment on lines 478 to 484
1. Extend `SchedulingFramework` with two new steps: `RunGetResourcesPlugins` and
`WaitForGetResources`. These will be called immediately after `WaitOnPermit` phase and
before running `RunPreBindPlugins`. The `RunGetResourcesPlugins` will simply be calling
`GetResources` methods from all plugins implementing it. And `WaitForGetResources` will
work similarly to `WaitOnPermit`, serving as a barrier to ensure all the resources are
already available to use. The implementation will work similarly to `WaitOnPermit` to
ensure that `GetResources` was executed for all pods from within a `PodGroup`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will the preemption targets be released when we after all don't run the RunGetResourcesPlugins? For example, when a gang turns out being unschedulable

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's very good question. I think we want something conceptually similar to "Reserve/Unreserve" pattern from DRA.

So scheduling phase will effectively serve as "reserve" phase and we we will have a sibling method of "unschedule" that will be able to re-assume the victims.

It requires some description though.

We need to look at the cluster as a whole. With that in mind, keeping the algorithm efficient
becomes a challenge, thus we modify to the approach below.

To check if a workload W can be scheduled on a given cluster with preemption we:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we talk about a "gang pod group" rather than a "workload"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have strong opinion here - let me change it.

@Argh4k
Copy link
Contributor

Argh4k commented Dec 4, 2025

Do we want to add as a part of this KEP a description of how the preemption fits the workload aware scheduling (codewise)? Or do we want to have this other way around, have the KEP for workload aware scheduling reference this one when talking about preemption?

In the gang scheduling KEP we talk about adding a "Workload" phase where we will end up with a pods from Gang with a nominated node names. I assume that this preemption will be a part of this phase. The open question is what actually will be the outcome of the preemption:

  • will the workload premption trigger the preemption, counting on delayed preemption to actuate it
  • will the workload preemption mark pods for preemption and the trigger will be done by the current preemption in the pod post filter? This is actually a preferred option by me as it will also take into consideration changes that happened in the cluster between workload scheduling cycle and pod scheduling.
  • something else?


As part of minimizing preemptions goal, arguably the most important thing to do is to avoid unnecessary
preemptions. However, this is not true for the current gang scheduling implementation.
In the current implementation, preemption is triggered in the `PostFiler`. However, it's entirely
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the reasoning here is that we want delayed preemption because it helps with the current gang scheduling implementation. But I believe that actually in this doc we could describe why we need it in terms of the workload preemption and IIUC this is to have an option to run workload preemption as part of the workload scheduling without immediately actuating the preemptions.

I added this also in a PR discussion, I think it would be beneficial to have a section on what will be the outcome of workload preemption and if it does not actuate the preemptions, what actually will do that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the reasoning here is that we want delayed preemption because it helps with the current gang scheduling implementation. But I believe that actually in this doc we could describe why we need it in terms of the workload preemption and IIUC this is to have an option to run workload preemption as part of the workload scheduling without immediately actuating the preemptions.

Great point - I updated this paragraph to reflect that.

I added this also in a PR discussion, I think it would be beneficial to have a section on what will be the outcome of workload preemption and if it does not actuate the preemptions, what actually will do that.

I hope that an update KEP for gang scheduling that will describe the workload scheduling phase will be opened pretty soon and it will describe it. And I will be able to just link to it here :)
@macsko ^^

1. New field in the workload object (delayed preemption will not bring much value in
case of scheduling individual pods, though there would be significant benefit from
unification, so probably this isn't ideal option).
1. Storing it in private kube-scheduler' structures (PodInfo for individual pods and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not allow external schedulers to use the same concept for victims nomination.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to keep external schedulers out of scope for now - added explicitly to the non-goals section.

Copy link
Member Author

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to address most of the comments, I will try to respond/address the remaining ones later today/tomorrow.

1. From remaining potential victims, we start to reprieve pods starting from the highest priority
and working down until the set of remaining victims still keeps the node feasible.

Once we compute the feasibility and list of victims for all nodes, we score that and choose the
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch - updated (although I don't think it changes anything for this particular proposal).

Comment on lines 385 to 387
1. Identify the list of potential victims:
- all running workloads with (preemption) priority lower than the new workload W
- all individual pods (not being part of workloads) with priority lower than the new workload W
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - I have added an unresolved section about that to the Workload priorities section above describing the problem, potential solution and alternatives. Let's continue the discussion there.

We need to look at the cluster as a whole. With that in mind, keeping the algorithm efficient
becomes a challenge, thus we modify to the approach below.

To check if a workload W can be scheduled on a given cluster with preemption we:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have strong opinion here - let me change it.


1. Identify the list of potential victims:
- all running workloads with (preemption) priority lower than the new workload W
- all individual pods (not being part of workloads) with priority lower than the new workload W
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should choose pod, but I don't have super strong preference. I added a point about sorting to reflect that but I'm happy to take any suggestions there.

1. New field in the workload object (delayed preemption will not bring much value in
case of scheduling individual pods, though there would be significant benefit from
unification, so probably this isn't ideal option).
1. Storing it in private kube-scheduler' structures (PodInfo for individual pods and
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to keep external schedulers out of scope for now - added explicitly to the non-goals section.

- workload C has scheduling priority `low` but preemption cost `high`
In such case, the preemption cost would result in choosing workload B for preemption. But
if it gets recreated, it will preempt workload C causing unnecessary cascading preemption.
This is the reason why a cost-based model was discarded.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@erictune - I thought a bit more about the idea of "preemption priority" vs "preemption cost" that we chatted offline.
I acknowledge the deficiencies of currently proposed model, but I think that the switching to preemption cost and just scoring-based approach will not prevent us from cascading preemptions, which we should really try to avoid.

I tried to update the KEP to reflect that - PTAL and I'm happy to chat more about it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to assume the preemption decision is simply a "priority, then cost" decision, but could in fact be some function of them. I guess that's what you mean by "scoring". I think when you combine "cost" with non-isolated decisions, you can get a better result. By isolated, I mean, not choosing to consider A, B, and C all in the same scheduling decision for "A", but instead just pairwise decisions of "A" and "B" vs "A" and "C". From what I am understanding, the plan is to consider only the pairwise options; I think cascading preemptions may be inevitable in that case (or we may have to severely limit utility).

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 3, 2026

We will create integration test(s) to ensure basic functionalities of workload preemption:

- Pods from a single PodGroup with `DisruptionMode=Pod` can be preempted individually by the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you split the second case for DirstuptionMode=PodGroup into two cases depending whether Pod or PodGroup is the preemptor, we should split DirstuptionMode=Pod case as well for symmetry.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other case is: "pod preempting pod", which is the logic that we already have now :)
That's why it wasn't listed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's pod preempting a PodGroup in pod disruption mode. Thanks for adding.

@wojtek-t wojtek-t force-pushed the workload_aware_preemption branch from 9d1e5b8 to cc9ade4 Compare February 3, 2026 14:12
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 3, 2026
@dom4ha
Copy link
Member

dom4ha commented Feb 3, 2026

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 3, 2026

The feature starts working again.

###### Are there any tests for feature enablement/disablement?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this KEP also depend on PodGroup APIs and PodGroups feature gates?

Plus you would need workload API enabled also?

There seems to be a relatively complicated way to roll out WAS features so I think we should at least comment on all the dependencies needed for this feature to work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good callout. I just added a paragraph about it in the enablement question.

Copy link
Contributor

@kannon92 kannon92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PRR is very close.

I think we need some comments on how this will interact with other features for WAS.

https://github.com/kubernetes/enhancements/pull/5711/changes#r2759717842

We do have a dependency on the workload API but it looks like we may not have a dependency on the pod group feature.

But in order for this feature to work I think you have other dependent feature gates and APIs that need to be enabled, no?

@kannon92
Copy link
Contributor

kannon92 commented Feb 3, 2026

#5711 (review)

Thinking more on this I feel that it is more of a nit and not blocking PRR approval.

/approve

for PRR.

@sanposhiho
Copy link
Member

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: erictune, kannon92, sanposhiho, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 5, 2026
@helayoty helayoty moved this from In Progress to Needs Final Approver in SIG Scheduling Feb 5, 2026
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 5, 2026
@wojtek-t wojtek-t added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Feb 5, 2026
@wojtek-t
Copy link
Member Author

wojtek-t commented Feb 5, 2026

/hold cancel
Given approvals from both Kensei, Dominik and Kevin.

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 5, 2026
@dom4ha
Copy link
Member

dom4ha commented Feb 5, 2026

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 5, 2026
@k8s-ci-robot k8s-ci-robot merged commit 8eedb5d into kubernetes:master Feb 5, 2026
4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.36 milestone Feb 5, 2026
@github-project-automation github-project-automation bot moved this from Needs Final Approver to Done in SIG Scheduling Feb 5, 2026
Karthik-K-N pushed a commit to Karthik-K-N/enhancements that referenced this pull request Feb 10, 2026
…tes#5711)

* Workload-aware preemption KEP

* Expand on review comments

* Improved delayed preemption design

* Few proposed actions in unresolved sections as plan of record

* Further redesign of delayed preemption

* Remove PreemptionPriority from the initial scope & review comments

* Move Delayed Preemption to KEP-4671

* Apply review comments

* Mention dependency on Workload API
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.