KEP-3990: PodTopologySpread DoNotSchedule-to-ScheduleAnyway fallback mode #4150

sanposhiho · 2023-08-12T02:27:42Z

One-line PR description: PodTopologySpread DoNotSchedule-to-ScheduleAnyway fallback mode

Issue link: PodTopologySpread DoNotSchedule-to-ScheduleAnyway fallback mode #3990

Other comments:

sanposhiho · 2023-08-12T02:34:09Z

/assign @alculquicondor @MaciekPytel
/sig scheduling autoscaling

sanposhiho · 2023-08-12T02:30:46Z

keps/sig-scheduling/3990-pod-topology-spread-fallback-mode/README.md

+information to express the idea and why it was not acceptable.
+-->
+
+### introduce `DoNotScheduleUntilScaleUpFailed` and `DoNotScheduleUntilPreemptionFailed`


It's a design similar to what we discussed in the issue. But, I switched this to fallbackCriteria for the reason described here. Open for discussion about which design we should choose.

sanposhiho · 2023-08-12T02:32:39Z

keps/sig-scheduling/3990-pod-topology-spread-fallback-mode/README.md

+Even if the Pod is rejected by plugins other than Pod Topology Spread,
+when one of specified criteria is satisfied, the scheduler fallbacks from DoNotSchedule to ScheduleAnyway.
+
+One possible mitigation is to add `UnschedulablePlugins`, which equals to [QueuedPodInfo.UnschedulablePlugins](https://github.com/kubernetes/kubernetes/blob/8a7df727820bafed8cef27e094a0212d758fcd40/pkg/scheduler/framework/types.go#L180), to somewhere in Pod status 


@alculquicondor What do you think?

sanposhiho · 2023-08-12T06:38:30Z

keps/prod-readiness/sig-scheduling/3990.yaml

@@ -0,0 +1,3 @@
+kep-number: 3990
+beta:
+  approver: "@wojtek-t"


@wojtek-t I assigned it to you as other sig-scheduling KEPs do.
But, please reassign it to other people if needed. 🙏

Just quickly skimmed through it - I will reviewer deeper when you will have buy-in from the SIG.

wojtek-t · 2023-08-24T11:56:22Z

keps/prod-readiness/sig-scheduling/3990.yaml

@@ -0,0 +1,3 @@
+kep-number: 3990
+beta:
+  approver: "@wojtek-t"


Just quickly skimmed through it - I will reviewer deeper when you will have buy-in from the SIG.

wojtek-t · 2023-08-24T11:56:46Z

keps/sig-scheduling/3990-pod-topology-spread-fallback-mode/README.md

+#### the fallback could be done when it's actually not needed.
+
+Even if the Pod is rejected by plugins other than Pod Topology Spread,
+when one of specified criteria is satisfied, the scheduler fallbacks from DoNotSchedule to ScheduleAnyway.


I don't understand it - can you clarify?

For example, let's say PodA has a required Pod Topology Spread with fallbackCriteria: ScaleUpFailed. When PodA's first scheduling, all Nodes went through Pod Topology Spread filter plugin, but they were rejected by other plugins (e.g., the resource fit filter plugin because of insufficient resources.), and PodA got unschedulable state as a result. Then, the cluster autoscaler tried to create a new Node for PodA but it couldn't make it because of stockout.

In this case, Pod Topology Spread doesn't need to fallback actually because it said OK to all Nodes in the previous scheduling cycle -- meaning Pod Topology Spread isn't the cause of unschedulable in the past scheduling cycle and the fallback won't make PodA schedulable.
But, the problem is that Pod Topology Spread cannot see why this Pod was rejected in the previous scheduling cycle. It only understands that PodA was rejected by someone in the previous scheduling, and the cluster autoscaler couldn't create a new Node.

a7i · 2023-09-04T18:26:45Z

keps/sig-scheduling/3990-pod-topology-spread-fallback-mode/kep.yaml

@@ -0,0 +1,35 @@
+title: Pod Topology Spread DoNotSchedule to SchedulingAnyway fallback mode


nit:

Suggested change

title: Pod Topology Spread DoNotSchedule to SchedulingAnyway fallback mode

title: Pod Topology Spread DoNotSchedule to ScheduleAnyway fallback mode

alculquicondor · 2023-09-05T14:16:55Z

/cc @mwielgus

a7i · 2023-09-05T14:20:58Z

keps/sig-scheduling/3990-pod-topology-spread-fallback-mode/README.md

+
+A new field `fallbackCriteria` is introduced to `PodSpec.TopologySpreadConstraint[*]` 
+to represent when to fallback from DoNotSchedule to ScheduleAnyway.
+It can contain two values: `ScaleUpFailed` to fallback when the cluster autoscaler fails to create new Node for Pods,


Could it not simply default to FailedScheduling. As far as I know, ScaleUpFailed is a cluster-autoscaler specific event so for those who use something else (e.g. karpenter), then this will not work.

One challenge with FailedScheduling is that's what cluster-autoscaler reacts on so there could be a race condition in cluster-autoscaler scaling up vs. switching to fallback mode.

This part only means that ScaleUpFailed is a new value that will be added to fallbackCriteria. Yes, I took this name from the cluster autoscaler's event name, but it should work with any cluster autoscaler if they update TriggeredScaleUp Pod condition, which is also proposed in this KEP.

alculquicondor · 2023-10-05T19:10:39Z

We had too much to bite in this cycle. This will have to wait for the next release :(

Try to get some input from SIG Autoscaling in one of their meetings.

sanposhiho · 2023-10-12T14:07:18Z

Yes..

Try to get some input from SIG Autoscaling in one of their meetings.

Sure.

ellistarn · 2023-10-17T16:22:12Z

keps/sig-scheduling/3990-pod-topology-spread-fallback-mode/README.md

+should be approved by the remaining approvers and/or the owning SIG (or
+SIG Architecture for cross-cutting KEPs).
+-->
+# KEP-3990: Pod Topology Spread DoNotSchedule to SchedulingAnyway fallback mode


Curious if there's an analogous use case for:

Preferential node affinity

Preferential pod affinity/antiaffinity

Is there any world where a user may want to have more control over how preferences are handled with autoscalers?

if there's an analogous use case for:

Yes, might be. We would wait for the actual usecases from users before implementing the similar stuff in them though, we should make this design to be easy to follow in other scheduling constants

ellistarn · 2023-10-19T16:11:51Z

keps/sig-scheduling/3990-pod-topology-spread-fallback-mode/README.md

+On the other hand, `FallBackCriteria` allows us to unify APIs in all scheduling constraints. 
+We will just introduce `FallBackCriteria` field in them and there we go.


This makes me think that we might want to model fallback criteria at the pod level.

preferentialFallback: - PodAffinity - AntiAffinity - TopologySpreadConstraints

It may be difficult to get the granularity right, though. At another extreme (as you propose), you'd add a preference policy for each preferred term across the pod spec.

That's an interesting idea.
Because when people want to use fallback, they may want to do the same in all required scheduling constraints the Pod has. Also, we may want to give preferentialFallback the way to "fallback in all".

So, maybe it'd be like this:

preferentialFallback: schedulingConstraints: - "*" # do fallback in all scheduling constraints. # You can also specify individual scheduling constraints name, which supports fallback. # (But, note that this KEP's scope is only TopologySpread.) fallBackCriteria: - ScaleUpFailed

Let me include this idea in the alternative section for now. Like the idea of introducing enums, we can gather feedback from more people about the design choice among them.

ellistarn · 2023-10-19T16:13:03Z

keps/sig-scheduling/3990-pod-topology-spread-fallback-mode/README.md

+Instead of `FallBackCriteria`, introduce `DoNotScheduleUntilScaleUpFailed` and `DoNotScheduleUntilPreemptionFailed` in `WhenUnsatisfiable`.
+`DoNotScheduleUntilScaleUpFailed` corresponds to `ScaleUpFailed`, 
+and `DoNotScheduleUntilPreemptionFailed` corresponds to `PreemptionFailed`.


There are some benefits to reusing this an existing field with new enum values. I'm a touch concerned with how complicated topologyspreadconstraints are getting, with two new honor fields (with different defaults).

I interact with a bunch of customers who are learning kubernetes scheduling for the first time when onboarding to Karpenter, and communicating the complexity and gotchas of topology is always a huge hurdle to get over.

Obviously, I agree that we need to make it easy to understand as much as possible. But, also introducing a new field doesn't always result in confusion.
For this particular KEP, I think FallBackCriteria is not much more complicated than introducing new enum values in WhenUnsatisfiable. Then, currently I'm leaning towards FallBackCriteria for the ease to introduce it into other scheduling constraints, as described.
But, maybe I'm biased since I know the scheduler/CA, I know everything that this KEP wants to do, etc. 😅
So, I'd like to keep this discussion open to gather more feedback from reviewers.

I agree that topologyspreadconstraints are difficult to explain to clients (been there), but I think this added field is optional complexity. It is also the other way around, right? It is a complex solution to a complex setup, but the complex setup came first. Clients that need to hear this are already dealing with specific cases that call for all of this, and in theory are easier to talk to. Also agree that I don't see it as easier to explain when considering existing field with new enum values, so I guess that if it is hard to explain, it would be hard either way.

jsoref

apparently i wrote this a while ago...

keps/sig-scheduling/3990-pod-topology-spread-fallback-mode/README.md

sanposhiho · 2024-01-04T07:35:11Z

Describing the current situation here:

I brought this KEP into SIG/Autoscaling weekly meeting a couple of months ago, and concluded that we had to consider how CA set TriggeredScaleUp: false reliably.

[maciekpytel] - One comment, in general, being unable to trigger scale-up is one failure scenario. However CA can also trigger a scaleup but fail to get capacity, for cloudprovider reasons etc.

We’d need to reliably set the status in these conditions, not sure if the CA is able to do this today.
Could end up in a situation where the CA keeps trying other node groups

Will talk internally with other CA maintainers and come back with someone best able to commit to feedback on this.

Weekly meeting note: https://docs.google.com/document/d/1RvhQAEIrVLHbyNnuaT99-6u9ZUMp7BfkPupT2LAZK7w/edit#bookmark=id.oxf7zvbyd9ya

Not sure if we'll make it in this release though, I'll spend some time on this KEP in this release cycle as well - I'll go around CA's implementation to understand scenarios there myself, and elaborate the KEP around there. Thanks for several feedback so far, I'll bring the fixes towards all at the same time.

sanposhiho · 2024-09-17T01:31:06Z

@MaciekPytel can you (or assign someone to) review this KEP from the autoscaling (CA) PoV?

towca · 2024-12-02T14:36:06Z

@sanposhiho I can take a look from SIG autoscaling PoV, but late next week at the earliest (busy with some last-minute 1.32 changes in CA right now).

sanposhiho · 2024-12-03T01:39:16Z

Thanks @towca for picking it up!
I'm hoping we can prompt this feature; I remember some features on the sig-scheduling side are blocked by it.

sanposhiho · 2025-02-26T02:37:30Z

@towca Any update?

k8s-triage-robot · 2025-05-27T03:19:57Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

sanposhiho · 2025-05-27T08:39:19Z

/remove-lifecycle stale

@towca Any update?

sanposhiho · 2025-05-27T12:24:26Z

@wojtek-t A new TriggeredScaleUp condition discussed here might be another "state" that we might want to track, which we discussed #5329 (comment).
Like, if an external scheduler wants the scheduler to schedule pods and trigger CA as well in the unschedulable scenarios before it updates NNN. In that case, PodScheduled is not enough to decide when to re-compute NNN, and it'd need TriggeredScaleUp (or something similar) to notice CA tried to scale up, but failed.

wojtek-t · 2025-05-27T13:31:21Z

@sanposhiho - agree; although I think it's an extension and we don't want to include that in the current proposal

sanposhiho · 2025-05-27T14:00:51Z

Yeah, definitely no need to include TriggeredScaleUp in NNN kep.

jackfrancis · 2025-05-28T18:46:24Z

keps/sig-scheduling/3990-pod-topology-spread-fallback-mode/README.md

+-->
+
+- A new field `fallbackCriteria` is introduced to `PodSpec.TopologySpreadConstraint[*]` 
+  - `ScaleUpFailed` to fallback when the cluster autoscaler fails to create new Node for Pod.


I don't think we want to introduce an autoscaler-specific construct into the pod spec. In addition to that smelling like a separation of concerns violation, the cluster autoscaler in particular is designed to scale nodes, not pods. (And "scale" terminology here could be confused between HPA, as well.)

I don't think we want to introduce an autoscaler-specific construct into the pod spec.

Rather I think we do want.
You can look at other recent KEPs happening around sig-scheduling/autoscaling though, the scheduler and the autoscaler are getting closer and closer, working on things tightly together.
e.g., #5287 proposes to use NNN to get the intention from the autoscaler, and a workload scheduling initiative we're planning as well: basically, we're trying to take "how CA can create nodes potentially for pods" into consideration and take an appropriate action based on that.

And "scale" terminology here could be confused between HPA, as well.

But, this is true, I'm open to rename it to anything else.

IMO this seems fine. CA is designed to scale Nodes, but it reacts to Pods so it makes sense to me that we'd put autoscaling-related information (either in status or spec) on Pods.

And "scale" terminology here could be confused between HPA, as well.

But, this is true, I'm open to rename it to anything else.

We actually decided to align the core concepts naming between Cluster Autoscaler and Karpenter, and the Karpenter names made more sense. We still need to change the naming in CA (kubernetes/autoscaler#6647), but e.g. the official k8s Node autoscaling docs already use the new naming. In particular, IMO this should be named NodeProvisioningFailed.

Should we distinguish NodeProvisioningFailedAndAutoscalingRanOutOfOptions from NodeProvisioningFailedButThereAreOtherOptionsToTry? If there are many potential options, should there be a timeout?

Actually, even if there's just one option, wouldn't it be a valid use case to wait for a specified amount of time in case of some temporary issues (e.g. stockouts)?

Does this timeout stuff mean CA never put NodeProvisioningFailed condition on its own even when CA is correctly alive (= not only when CA is down)? Would the scheduler always put NodeProvisioningFailed based on timeout? In other words, is there any case that CA wants to put NodeProvisioningFailed on its own before reaching the timeout and the scheduler puts it?

No, NodeProvisioningFailed condition should only be added when node provisioning has failed, I was thinking the timeout could be used in absence of any signal regarding node provisioning.

The timeout duration specified on topology spread constraints level would be a way of saying: I'm willing to wait for up to N seconds if this is going to help me satisfy my spreading constraints. This is a clear trade-off the users could make. It would actually remove the need for the additional mode - ScheduleAnyway with a timeout would be enough to express willingness to wait for better spreading.

No, NodeProvisioningFailed condition should only be added when node provisioning has failed, I was thinking the timeout could be used in absence of any signal regarding node provisioning.

But, earlier in this thread, we talked that "node provisioning has failed" might also have to be determined by the timeout, right? Then, if we implement the timeout config for this fallback feature on the scheduling config, that might be all necessary things: the timeout can catch both cases, CA cannot create nodes due to stockout etc, and CA is down, not working properly. That's what I meant.

It would actually remove the need for the additional mode - ScheduleAnyway with a timeout would be enough to express willingness to wait for better spreading.

That's a new option that we haven't explored though, can users set an appropriate timeout actually?
If you're just a cluster user, not admin, you don't know how usually the cluster autoscaler takes to provision nodes.
So, as a first impression, I'm still inclined to the current idea: end-users just set a new field fallbackCriteria. On top of that, maybe we can allow the cluster admin to configure the global timeout for the fallback in the scheduling configuration, based on their cluster autoscaler's usual provisioning time.

Making it a part of scheduler configuration is definitely worth considering. I wonder if there may be use cases for different thresholds based on business requirements. But if the timeout is just a safeguard protecting from autoscaling being down, perhaps a single value set by a platform team or cluster admin is good enough.

Based on the discussion here, I'll update the KEP once later when I've got time

towca

@sanposhiho Sorry for the delay, I've been swamped with reviews recently.

Overall, the proposal looks good to me from the perspective of Node autoscaling. I'd like to iterate on a couple of things before approving:

We should take care to design the conditions properly, so that they are understandable and useful beyond just this KEP.
We should define the autoscaler behavior when scheduler still can't schedule the Pod after falling back.

Left related comments on the PR!

towca · 2025-05-29T16:04:23Z

keps/sig-scheduling/3990-pod-topology-spread-fallback-mode/README.md

+- A new field `fallbackCriteria` is introduced to `PodSpec.TopologySpreadConstraint[*]` 
+  - `ScaleUpFailed` to fallback when the cluster autoscaler fails to create new Node for Pod.
+  - `PreemptionFailed` to fallback when preemption doesn't help make Pod schedulable.
+- introduce `TriggeredScaleUp` in Pod condition 


This would be a great addition regardless of this KEP, I've been hearing demand for well-defined autoscaling-related conditions/events for a while now.

I know that @elmiko has been looking into this (or talking to someone who has). Do you know if there's been any progress on this Michael?

Also keep in mind that we need to take Karpenter into account for any new standardized autoscaling-related APIs like this condition. This seems fairly uncontroversial though, WDYT @jonathan-innis?

Exactly, not only specific to this TopologySpread's fallback feature, this new condition would be a standard way for the scheduler to receive the response from CA and take some appropriate action based on that. As I mentioned in another thread, the scheduling and the autoscaling are getting closer and going to work more closely towards a better pod placement. This new condition allows us to explore various scheduling options (incl this kind of fallback mechanism) at the scheduling time.

towca · 2025-05-29T16:12:13Z

keps/sig-scheduling/3990-pod-topology-spread-fallback-mode/README.md

+  // TriggeredScaleUp indicates that the Pod triggered scaling up the cluster.
+  // If it's true, new Node for the Pod was successfully created.
+  // Otherwise, new Node for the Pod tried to be created, but failed.
+  TriggeredScaleUp PodConditionType = "TriggeredScaleUp"


I'm not sure about the naming here, especially considering the scenario you mention where an autoscaler first scales up, notices a stockout, and then can't scale again because the initial group is in backoff.

Given the naming, I wouldn't expect the condition to ever flip back to False (the Pod did trigger a scale-up after all). Maybe we need an additional condition specifying whether a Pod is "helpable" by an autoscaler?

As I mentioned in another thread, I don't have a strong opinion here. I'm open to changing it to anything else!

towca · 2025-05-29T16:16:19Z

keps/sig-scheduling/3990-pod-topology-spread-fallback-mode/README.md

+1. Pod is rejected and stays unschedulable.
+2. The cluster autoscaler finds those unschedulable Pod(s) but cannot create Nodes because of stockouts.
+3. The cluster autoscaler adds `TriggeredScaleUp: false`. 
+4. The scheduler notices `TriggeredScaleUp: false` on Pod and schedules that Pod while falling back to `ScheduleAnyway` on Pod Topology Spread.


What if at step 4. there's still no space on existing Nodes to schedule the Pod even with ScheduleAnyway? Should an autoscaler attempt to scale-up again, but this time falling back to ScheduleAnyway? It sounds desirable, but complicates things a bit. And FWIW I think we would need to explicitly prevent Cluster Autoscaler from doing this - otherwise the scheduler code vendored in CA will start falling back to ScheduleAnyway after CA adds the condition.

What if at step 4. there's still no space on existing Nodes to schedule the Pod even with ScheduleAnyway?

It means some other scheduling factors at Filter phase (e.g., resource requirement, pod affinity etc etc) are not satisfied even if we ignore topology spread at Filtering phase. Pods are just going to be Unschedulable again.

Should an autoscaler attempt to scale-up again, but this time falling back to ScheduleAnyway?

I think so because of the fact that the cluster autoscaler cannot find the place at the step 2. The cluster autoscaler should re-compute which node group has to be tried, but this time, considering the spreading constraint as ScheduleAnyway. i.e., ignoring them at CA's scheduling simulation.

It sounds desirable, but complicates things a bit.

Does it? We change the PodTopologySpread plugin's Filter to ignore pods with fallbackCriteria: ScaleUpFailed when those pods have TriggeredScaleUp: false.
When the cluster autoscaler tries to find a new node for this pending pod again (after the step 2 and 3), the PodTopologySpread plugin's Filter (running within CA) will just ignore those pods.
So, to me, it looks very straightforward.

towca · 2025-05-29T16:22:57Z

keps/sig-scheduling/3990-pod-topology-spread-fallback-mode/README.md

+-->
+
+- A new field `fallbackCriteria` is introduced to `PodSpec.TopologySpreadConstraint[*]` 
+  - `ScaleUpFailed` to fallback when the cluster autoscaler fails to create new Node for Pod.


IMO this seems fine. CA is designed to scale Nodes, but it reacts to Pods so it makes sense to me that we'd put autoscaling-related information (either in status or spec) on Pods.

And "scale" terminology here could be confused between HPA, as well.

But, this is true, I'm open to rename it to anything else.

We actually decided to align the core concepts naming between Cluster Autoscaler and Karpenter, and the Karpenter names made more sense. We still need to change the naming in CA (kubernetes/autoscaler#6647), but e.g. the official k8s Node autoscaling docs already use the new naming. In particular, IMO this should be named NodeProvisioningFailed.

k8s-ci-robot · 2025-07-06T05:47:35Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sanposhiho
Once this PR has been reviewed and has the lgtm label, please ask for approval from wojtek-t. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/prod-readiness/OWNERS
~~keps/sig-scheduling/OWNERS~~ [sanposhiho]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sanposhiho · 2025-07-06T05:48:42Z

@x13n @towca
Introduced scaleupTimeout to the scheduler config based on the discussion here.

keps/sig-scheduling/3990-pod-topology-spread-fallback-mode/README.md

x13n · 2025-07-11T13:09:05Z

keps/sig-scheduling/3990-pod-topology-spread-fallback-mode/README.md

+
 #### How we implement `TriggeredScaleUp` in the cluster autoscaler

 Basically, we just put `TriggeredScaleUp: false` for Pods in [status.ScaleUpStatus.PodsRemainUnschedulable](https://github.com/kubernetes/autoscaler/blob/109998dbf30e6a6ef84fc37ebaccca23d7dee2f3/cluster-autoscaler/processors/status/scale_up_status_processor.go#L37) every [reconciliation (RunOnce)](https://github.com/kubernetes/autoscaler/blob/109998dbf30e6a6ef84fc37ebaccca23d7dee2f3/cluster-autoscaler/core/static_autoscaler.go#L296).


We should also clean up the condition if there's a change afterwards. Say, a pod was marked unschedulable due to stockout, but later the stockout was gone and a new attempt to create a Node started. In such case the pod should clean up the existing condtion (or rather add one with value true). Btw, per my other comment I would rename this to NodeProvisioningInProgress. NodeProvisioningInProgress: true means autoscaling started and we should expect a new Node (ideally we'd combine that with setting nominatedNodeName on the pod). NodeProvisioningInProgress: false means autoscaling ran out of options - there is no attempt to create any new node for this pod, so fallback should happen.

@towca @MartynaGrotek @jackfrancis for thoughts.

That is a good point..

Actually, this KEP misses the point of "how CA should handle those pods in later provisioning trials if the first attempt fails".

I believe we have two options here:

CA should always keep trying to create the node with considering the topology spreading as required (w/o fallback). i.e., even if the first provisioning trial fails, the cluster autoscaler keeps considering the pod topology spread as required scheduling constraint. (although, meantime, the scheduler can try to schedule those pods as preferred scheduling constraints, i.e., fallback)

After the first provisioning trial fails, CA should do fallback as well, considering the topology spreading as preferred scheduling constraints. (that means CA just ignores those topology spreading as CA doesn't take preferred scheduling constraints into consideration)

And then, if we go with the first option, we should

always set NodeProvisioningInProgress: true on pending pods in memory (i.e., not actually apply through kube-apiserver at this timing) before the CA's scheduling simulation happens so that the pod topology spread plugin (that CA internally uses for the scheduling simulation) works without the fallback.

and if then CA finds a node candidate and triggers the node provisioning, it actually sets NodeProvisioningInProgress: true through kube-apiserver.

But, one small disadvantage with this first option is, depending on the timing, if the CA triggers the scale-up for the pending pod and the scheduler schedules the same pending pod successfully at the same time, it could happen that the pod has NodeProvisioningInProgress: true, while actually the pod is scheduled with fallback. It might be confusing when users have to do some debugging of scheduling results.

OTOH, if we go with the second option, we don't have to do anything more from what KEP describes here. However, the name NodeProvisioningInProgress no longer makes sense because the node provisioning might be in-progress by the CA's later iteration. So, probably we should rename NodeProvisioningInProgress to FirstNodeProvisioning (or something similar).

What do you think?

I think we need CA to be able to provision new nodes also for the fallback scenario, which isn't really possible in option 1 if I understand correctly. Also NodeProvisioningInProgress: true is mostly useful for debugging, NodeProvisioningInProgress: false is what should trigger scheduler to do the fallback. Then we could have:

Pods with NodeProvisioningInProgress: true are treated same as without any condition by topology spreading filter plugin: won't pass the filter.

Pods with NodeProvisioningInProgress: false are ignored (not blocked) by the topology spreading filter plugin. They only matter for scoring. Autoscaling either succeeded or gave up at this point.

Cluster Autoscaler runs scheduler predicates as today. When it runs out of options, NodeProvisioningInProgress: false is set on the pod, which can trigger another round of autoscaling.

The previous point can be optimized to happen in memory of CA, but should actually work out of the box with an apiserver roundtrip.

Btw, I think NodeProvisioningInProgress: true condition on a scheduled pod is ok: that means as of lastTransitionTime the pod triggered some autoscaling, nothing more.

(Sorry for no replies for a while! I had to take a long leave after a busyness of the code freeze)

Maybe my previous explanation was bad. Let me rephrase.

Cluster Autoscaler runs scheduler predicates as today. When it runs out of options, NodeProvisioningInProgress: false is set on the pod, which can trigger another round of autoscaling.

My point is whether CA should keep trying the node scaling after this point with still taking the topology spread into consideration? or CA should ignore the topology spread after this point?

For example, let's say all nodepools are stockout, and CA puts NodeProvisioningInProgress: false. It triggers the fallback of the topology spread on the scheduler side, but let's say, the pod is still unfortunately unschedulable because of another reason e.g., because simply all running nodes are full from the resource capacity perspective.
And, while this pod is unschedulable while it already got NodeProvisioningInProgress: false, whether do we want CA to try to provision a new node with or without taking the topology spread into consideration.

And, my proposal is that we have two options:

if we do want CA to keep trying to provision a new node with taking the topology spread into consideration, even after CA once put NodeProvisioningInProgress: false, then CA has to always internally put NodeProvisioningInProgress: false so that the topology spread plugin in CA's scheduling simulation would regard the topology spread constraint as required.

If we want CA to ignore the topology spread after CA once put NodeProvisioningInProgress: false, then we don't have to do anything from the current proposal. But, we just should rename NodeProvisioningInProgress to something else like FirstNodeProvisioningInProgress.

This would be easier for me to reason about if we deprioritize the exact way that CA (or karpenter) will solve for this and if we focus in on the core pod + kube-scheduler behaviors.

Pod is configured with TopologySpreadConstraint configuration, w/ UnsatisfiableConstraintAction of DoNotSchedule

(could be CA/karpenter) determines that the infrastructure required to fulfill the spread constraints is ~durably unavailable, and posts a well-known pod condition update.

kube-scheduler interprets the updated pod condition, when combined w/ the TopologySpreadConstraint fallback configuration, to find a candidate node without spread constraints.

No existing infra fulfills the pod requirements (even without the spread constraints), pod stays Pending

creates new infra according to (updated) scheduling criteria (step 4)

New infra comes online, pod is scheduled

Is the above correct? I don't know how we "clean up" the above outcome when the operational condition in step 2 changes unless we update kube-scheduler to actively pre-empt already scheduled pods and then restart the scheduling lifecycle from scratch (w/ the original, strict TopologySpreadConstraint configuration).

What am I missing?

cc @MartynaGrotek
Here we still have a discussion about the condition.

I just realized I never replied here. I wouldn't differentiate between first and subsequent node provisioning attempts in pod conditions. To me, autoscaling is conceptually yet another controller, trying to achieve eventually consistent state. Such controller may at some point discover it is not possible to make progress, and that is a signal useful for kube-scheduler. I think provisioning fallback capacity on the autoscaling side is still a part of provisioning, so NodeProvisioningInProgress: false should only be set when either it is not possible to provision any capacity at all or the capacity has already been provisioned.

For the user looking at these conditions it also becomes easier to understand - either there is some provisioning in progress or not. It helps with the ability to debug the system: NodeProvisioningInProgress: true condition, persisting for a long time, indicates a problem in the autoscaling stack.

If we really wanted to somehow differentiate between provisioning honoring PTS from fallback provisioning, I'd make that explicit: TopologySpreadingIgnored: true, or similar. I'm not sure if that's necessary though.

NodeProvisioningInProgress: false should only be set when either it is not possible to provision any capacity at all or the capacity has already been provisioned.

So, are you saying that, if CA cannot provision a new node and set NodeProvisioningInProgress: false, it will never switch NodeProvisioningInProgress back to true?

It can switch back (stockouts end, limits/quotas get raised, other resources can free up), but once it becomes false, it should be ok to do the fallback in scheduler. Do you anticipate problems with the condition changing back and forth? I think it would be simpler not to have separate conditions for the first and subsequent autoscaling attempts, so I wouldn't add them unless there are problems with that approach.

I agree with @x13n here. If the fallback only works in kube-scheduler but not in Node autoscalers, it seems both less useful (because autoscalers are still blocked from provisioning Nodes that would violate PTS), and more confusing (the autoscaler behavior doesn't match the kube-scheduler behavior, which is normally a problem). And it would require more complexity in autoscaler code.

On the other hand we have this:

Node autoscalers set the NodeProvisioningInProgress condition on pending Pods, the condition can switch back and forth depending on if the autoscaler has something to provision that could help the Pod. I proposed more detailed semantics in the extracted proposal: KEP-5616: Cluster Autoscaler Pod Condition #5619 (comment).

If a pending Pod has condition NodeProvisioningInProgress=True and PTS configured with the proposed fallback, the PTS is treated as ScheduleAnyway by the PTS scheduler plugin. This allows both kube-scheduler to find new Nodes for the Pod that it couldn't before, and Node autoscaler to provision new Nodes for the Pod that it couldn't before.

Which seems pretty straigthtforward.

x13n · 2025-07-11T13:16:32Z

keps/sig-scheduling/3990-pod-topology-spread-fallback-mode/README.md

+In `triggerTimeBasedQueueingHints`, Queueing Hints with the `Resource: Type` event are triggered for pods rejected by those plugins,
+and the scheduling queue requeues/doesn't requeue pods based on QHints, as usual.
+
+`triggerTimeBasedQueueingHints` is triggered periodically, **but not very often**. Probably once 30 sec is enough.


Should it somehow depend on the configured timeout (e.g. be a fixed fraction of it)? If the timeout is configured to 5m, 30s sounds reasonable, but maybe less so for other values. OTOH, maybe I'm over-complicating this...

But, setting a too small timeout doesn't make sense in most clusters probably, right? like, I'm imagining that the timeout would be a few minutes in a real world.
Of course, we will surely need to clarify that the configured timeout is best-effort, and could include delay up to 30s though. I think that's a good trade-off since, as mentioned below, too frequent triggerTimeBasedQueueingHints() is not recommended for a scheduler performance either way.

So, I feel like it's ok to just keep it as it is at least for now, and start to consider making the interval configurable after getting user's complaints actually.

Yeah, I was worried performance may vary between different cloud providers, but starting simple and making configurable only if needed makes sense to me.

jbtk · 2025-07-25T10:57:30Z

keps/sig-scheduling/3990-pod-topology-spread-fallback-mode/README.md

+1. Pod is rejected in the scheduling cycle.
+2. In the PostFilter extension point, the scheduler tries to make space by the preemption, but finds the preemption doesn't help.
+3. When the Pod is moved back to the scheduling queue, the scheduler adds `PodScheduled: false` condition to Pod.
+4. The scheduler notices that the preemption wasn't performed for Pod by `PodScheduled: false` and empty `NominatedNodeName` on the Pod. 


With https://github.com/kubernetes/enhancements/blob/master/keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md the nominated node name can be set by other components than scheduler for various reasons. Wouldn't this break this mechanism if another component set it on the pod?

towca · 2025-10-07T17:12:28Z

@sanposhiho Are you planning to get this into 1.35? I reviewed the proposal and left some comments for the previous release cycle, is the KEP ready for the next review pass?

sanposhiho · 2025-10-07T17:16:36Z

I'm waiting for a reply in this thread. But, I think we are not going to include it in v1.35, from my/our bandwidth perspective towards KEP freeze. At least, it's not on the sig-scheduling's table at the moment.

towca · 2025-10-13T18:52:34Z

@sanposhiho Got it, thanks. FWIW the proposal looks good to me from SIG Autoscaling side as long as we reach alignment in that comment thread, and assuming that defining the detailed semantics of NodeProvisioningInProgress is left to #5619 (we need to align with the Karpenter side on this part).

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Aug 12, 2023

k8s-ci-robot requested review from Huang-Wei and alculquicondor August 12, 2023 02:27

k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Aug 12, 2023

sanposhiho force-pushed the fallback-mode branch 2 times, most recently from 3cb93d9 to 5318bd5 Compare August 12, 2023 02:32

sanposhiho mentioned this pull request Jul 27, 2023

PodTopologySpread DoNotSchedule-to-ScheduleAnyway fallback mode #3990

Open

4 tasks

k8s-ci-robot assigned alculquicondor Aug 12, 2023

k8s-ci-robot added the sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. label Aug 12, 2023

k8s-ci-robot assigned MaciekPytel Aug 12, 2023

sanposhiho commented Aug 12, 2023

View reviewed changes

wojtek-t reviewed Aug 24, 2023

View reviewed changes

wojtek-t self-assigned this Aug 24, 2023

a7i reviewed Sep 4, 2023

View reviewed changes

k8s-ci-robot requested a review from mwielgus September 5, 2023 14:16

a7i reviewed Sep 5, 2023

View reviewed changes

ellistarn reviewed Oct 17, 2023

View reviewed changes

ellistarn reviewed Oct 19, 2023

View reviewed changes

jsoref reviewed Jan 3, 2024

View reviewed changes

sanposhiho force-pushed the fallback-mode branch 2 times, most recently from a545da3 to ea3fc17 Compare March 17, 2024 10:03

sanposhiho force-pushed the fallback-mode branch from 2d8f39b to 9db7b55 Compare September 17, 2024 01:29

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 27, 2025

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 27, 2025

jackfrancis reviewed May 28, 2025

View reviewed changes

towca reviewed May 29, 2025

View reviewed changes

sanposhiho requested a review from x13n July 6, 2025 05:47

sanposhiho requested a review from towca July 6, 2025 05:48

feat: add scaleupTimeout

5337338

sanposhiho force-pushed the fallback-mode branch from c90eee3 to 5337338 Compare July 6, 2025 05:52

x13n reviewed Jul 11, 2025

View reviewed changes

fix: rename based on the suggestion

14c1c10

jbtk reviewed Jul 25, 2025

View reviewed changes

helayoty moved this from Needs Triage to Backlog in SIG Scheduling Sep 9, 2025

helayoty moved this from Backlog to Needs Review in SIG Scheduling Sep 9, 2025

MartynaGrotek mentioned this pull request Oct 6, 2025

Cluster Autoscaler pod conditions #5616

Open

4 tasks

towca mentioned this pull request Oct 13, 2025

KEP-5616: Cluster Autoscaler Pod Condition #5619

Open

		@@ -0,0 +1,35 @@
		title: Pod Topology Spread DoNotSchedule to SchedulingAnyway fallback mode

		On the other hand, `FallBackCriteria` allows us to unify APIs in all scheduling constraints.
		We will just introduce `FallBackCriteria` field in them and there we go.


		#### How we implement `TriggeredScaleUp` in the cluster autoscaler

		Basically, we just put `TriggeredScaleUp: false` for Pods in [status.ScaleUpStatus.PodsRemainUnschedulable](https://github.com/kubernetes/autoscaler/blob/109998dbf30e6a6ef84fc37ebaccca23d7dee2f3/cluster-autoscaler/processors/status/scale_up_status_processor.go#L37) every [reconciliation (RunOnce)](https://github.com/kubernetes/autoscaler/blob/109998dbf30e6a6ef84fc37ebaccca23d7dee2f3/cluster-autoscaler/core/static_autoscaler.go#L296).

KEP-3990: PodTopologySpread DoNotSchedule-to-ScheduleAnyway fallback mode #4150

Are you sure you want to change the base?

KEP-3990: PodTopologySpread DoNotSchedule-to-ScheduleAnyway fallback mode #4150

Conversation

sanposhiho commented Aug 12, 2023

Uh oh!

sanposhiho commented Aug 12, 2023

Uh oh!

sanposhiho Aug 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanposhiho Aug 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alculquicondor commented Sep 5, 2023

Uh oh!

a7i Sep 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanposhiho Sep 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alculquicondor commented Oct 5, 2023

Uh oh!

sanposhiho commented Oct 12, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanposhiho Oct 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ellistarn Oct 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanposhiho Oct 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jsoref left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sanposhiho commented Jan 4, 2024

Uh oh!

sanposhiho commented Sep 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

towca commented Dec 2, 2024

Uh oh!

sanposhiho commented Dec 3, 2024

Uh oh!

sanposhiho Aug 12, 2023 •

edited

Loading

sanposhiho Aug 24, 2023 •

edited

Loading

a7i Sep 5, 2023 •

edited

Loading

sanposhiho Sep 15, 2023 •

edited

Loading

sanposhiho Oct 17, 2023 •

edited

Loading

ellistarn Oct 19, 2023 •

edited

Loading

sanposhiho Oct 20, 2023 •

edited

Loading

sanposhiho commented Sep 17, 2024 •

edited

Loading

sanposhiho May 28, 2025 •

edited

Loading

sanposhiho Jun 4, 2025 •

edited

Loading

sanposhiho Jun 18, 2025 •

edited

Loading