Skip to content

Conversation

sanposhiho
Copy link
Member

  • One-line PR description: PodTopologySpread DoNotSchedule-to-ScheduleAnyway fallback mode
  • Other comments:

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Aug 12, 2023
@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Aug 12, 2023
@sanposhiho sanposhiho force-pushed the fallback-mode branch 2 times, most recently from 3cb93d9 to 5318bd5 Compare August 12, 2023 02:32
@sanposhiho
Copy link
Member Author

/assign @alculquicondor @MaciekPytel
/sig scheduling autoscaling

@k8s-ci-robot k8s-ci-robot added the sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. label Aug 12, 2023
information to express the idea and why it was not acceptable.
-->

### introduce `DoNotScheduleUntilScaleUpFailed` and `DoNotScheduleUntilPreemptionFailed`
Copy link
Member Author

@sanposhiho sanposhiho Aug 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a design similar to what we discussed in the issue. But, I switched this to fallbackCriteria for the reason described here. Open for discussion about which design we should choose.

Even if the Pod is rejected by plugins other than Pod Topology Spread,
when one of specified criteria is satisfied, the scheduler fallbacks from DoNotSchedule to ScheduleAnyway.

One possible mitigation is to add `UnschedulablePlugins`, which equals to [QueuedPodInfo.UnschedulablePlugins](https://github.com/kubernetes/kubernetes/blob/8a7df727820bafed8cef27e094a0212d758fcd40/pkg/scheduler/framework/types.go#L180), to somewhere in Pod status
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alculquicondor What do you think?

@@ -0,0 +1,3 @@
kep-number: 3990
beta:
approver: "@wojtek-t"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wojtek-t I assigned it to you as other sig-scheduling KEPs do.
But, please reassign it to other people if needed. 🙏

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just quickly skimmed through it - I will reviewer deeper when you will have buy-in from the SIG.

@@ -0,0 +1,3 @@
kep-number: 3990
beta:
approver: "@wojtek-t"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just quickly skimmed through it - I will reviewer deeper when you will have buy-in from the SIG.

#### the fallback could be done when it's actually not needed.

Even if the Pod is rejected by plugins other than Pod Topology Spread,
when one of specified criteria is satisfied, the scheduler fallbacks from DoNotSchedule to ScheduleAnyway.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand it - can you clarify?

Copy link
Member Author

@sanposhiho sanposhiho Aug 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, let's say PodA has a required Pod Topology Spread with fallbackCriteria: ScaleUpFailed. When PodA's first scheduling, all Nodes went through Pod Topology Spread filter plugin, but they were rejected by other plugins (e.g., the resource fit filter plugin because of insufficient resources.), and PodA got unschedulable state as a result. Then, the cluster autoscaler tried to create a new Node for PodA but it couldn't make it because of stockout.

In this case, Pod Topology Spread doesn't need to fallback actually because it said OK to all Nodes in the previous scheduling cycle -- meaning Pod Topology Spread isn't the cause of unschedulable in the past scheduling cycle and the fallback won't make PodA schedulable.
But, the problem is that Pod Topology Spread cannot see why this Pod was rejected in the previous scheduling cycle. It only understands that PodA was rejected by someone in the previous scheduling, and the cluster autoscaler couldn't create a new Node.

@wojtek-t wojtek-t self-assigned this Aug 24, 2023
@@ -0,0 +1,35 @@
title: Pod Topology Spread DoNotSchedule to SchedulingAnyway fallback mode
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
title: Pod Topology Spread DoNotSchedule to SchedulingAnyway fallback mode
title: Pod Topology Spread DoNotSchedule to ScheduleAnyway fallback mode

@alculquicondor
Copy link
Member

/cc @mwielgus


A new field `fallbackCriteria` is introduced to `PodSpec.TopologySpreadConstraint[*]`
to represent when to fallback from DoNotSchedule to ScheduleAnyway.
It can contain two values: `ScaleUpFailed` to fallback when the cluster autoscaler fails to create new Node for Pods,
Copy link

@a7i a7i Sep 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it not simply default to FailedScheduling. As far as I know, ScaleUpFailed is a cluster-autoscaler specific event so for those who use something else (e.g. karpenter), then this will not work.

One challenge with FailedScheduling is that's what cluster-autoscaler reacts on so there could be a race condition in cluster-autoscaler scaling up vs. switching to fallback mode.

Copy link
Member Author

@sanposhiho sanposhiho Sep 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part only means that ScaleUpFailed is a new value that will be added to fallbackCriteria. Yes, I took this name from the cluster autoscaler's event name, but it should work with any cluster autoscaler if they update TriggeredScaleUp Pod condition, which is also proposed in this KEP.

@alculquicondor
Copy link
Member

We had too much to bite in this cycle. This will have to wait for the next release :(

Try to get some input from SIG Autoscaling in one of their meetings.

@sanposhiho
Copy link
Member Author

Yes..

Try to get some input from SIG Autoscaling in one of their meetings.

Sure.

should be approved by the remaining approvers and/or the owning SIG (or
SIG Architecture for cross-cutting KEPs).
-->
# KEP-3990: Pod Topology Spread DoNotSchedule to SchedulingAnyway fallback mode

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious if there's an analogous use case for:

  • Preferential node affinity
  • Preferential pod affinity/antiaffinity

Is there any world where a user may want to have more control over how preferences are handled with autoscalers?

Copy link
Member Author

@sanposhiho sanposhiho Oct 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there's an analogous use case for:

Yes, might be. We would wait for the actual usecases from users before implementing the similar stuff in them though, we should make this design to be easy to follow in other scheduling constants

Comment on lines +1014 to +1036
On the other hand, `FallBackCriteria` allows us to unify APIs in all scheduling constraints.
We will just introduce `FallBackCriteria` field in them and there we go.
Copy link

@ellistarn ellistarn Oct 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me think that we might want to model fallback criteria at the pod level.

preferentialFallback:
  - PodAffinity
  - AntiAffinity
  - TopologySpreadConstraints

It may be difficult to get the granularity right, though. At another extreme (as you propose), you'd add a preference policy for each preferred term across the pod spec.

Copy link
Member Author

@sanposhiho sanposhiho Oct 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an interesting idea.
Because when people want to use fallback, they may want to do the same in all required scheduling constraints the Pod has. Also, we may want to give preferentialFallback the way to "fallback in all".

So, maybe it'd be like this:

preferentialFallback:
  schedulingConstraints: 
  - "*" # do fallback in all scheduling constraints.
  # You can also specify individual scheduling constraints name, which supports fallback.
  # (But, note that this KEP's scope is only TopologySpread.)
  fallBackCriteria:
  - ScaleUpFailed

Let me include this idea in the alternative section for now. Like the idea of introducing enums, we can gather feedback from more people about the design choice among them.

Comment on lines 1001 to 1024
Instead of `FallBackCriteria`, introduce `DoNotScheduleUntilScaleUpFailed` and `DoNotScheduleUntilPreemptionFailed` in `WhenUnsatisfiable`.
`DoNotScheduleUntilScaleUpFailed` corresponds to `ScaleUpFailed`,
and `DoNotScheduleUntilPreemptionFailed` corresponds to `PreemptionFailed`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some benefits to reusing this an existing field with new enum values. I'm a touch concerned with how complicated topologyspreadconstraints are getting, with two new honor fields (with different defaults).

I interact with a bunch of customers who are learning kubernetes scheduling for the first time when onboarding to Karpenter, and communicating the complexity and gotchas of topology is always a huge hurdle to get over.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obviously, I agree that we need to make it easy to understand as much as possible. But, also introducing a new field doesn't always result in confusion.
For this particular KEP, I think FallBackCriteria is not much more complicated than introducing new enum values in WhenUnsatisfiable. Then, currently I'm leaning towards FallBackCriteria for the ease to introduce it into other scheduling constraints, as described.
But, maybe I'm biased since I know the scheduler/CA, I know everything that this KEP wants to do, etc. 😅
So, I'd like to keep this discussion open to gather more feedback from reviewers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that topologyspreadconstraints are difficult to explain to clients (been there), but I think this added field is optional complexity. It is also the other way around, right? It is a complex solution to a complex setup, but the complex setup came first. Clients that need to hear this are already dealing with specific cases that call for all of this, and in theory are easier to talk to. Also agree that I don't see it as easier to explain when considering existing field with new enum values, so I guess that if it is hard to explain, it would be hard either way.

Copy link
Contributor

@jsoref jsoref left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apparently i wrote this a while ago...

@sanposhiho
Copy link
Member Author

Describing the current situation here:

I brought this KEP into SIG/Autoscaling weekly meeting a couple of months ago, and concluded that we had to consider how CA set TriggeredScaleUp: false reliably.

[maciekpytel] - One comment, in general, being unable to trigger scale-up is one failure scenario. However CA can also trigger a scaleup but fail to get capacity, for cloudprovider reasons etc.

  • We’d need to reliably set the status in these conditions, not sure if the CA is able to do this today.
    Could end up in a situation where the CA keeps trying other node groups
  • Will talk internally with other CA maintainers and come back with someone best able to commit to feedback on this.

Weekly meeting note: https://docs.google.com/document/d/1RvhQAEIrVLHbyNnuaT99-6u9ZUMp7BfkPupT2LAZK7w/edit#bookmark=id.oxf7zvbyd9ya

Not sure if we'll make it in this release though, I'll spend some time on this KEP in this release cycle as well - I'll go around CA's implementation to understand scenarios there myself, and elaborate the KEP around there. Thanks for several feedback so far, I'll bring the fixes towards all at the same time.

@sanposhiho sanposhiho force-pushed the fallback-mode branch 2 times, most recently from a545da3 to ea3fc17 Compare March 17, 2024 10:03
@sanposhiho
Copy link
Member Author

sanposhiho commented Sep 17, 2024

@MaciekPytel can you (or assign someone to) review this KEP from the autoscaling (CA) PoV?

@towca
Copy link

towca commented Dec 2, 2024

@sanposhiho I can take a look from SIG autoscaling PoV, but late next week at the earliest (busy with some last-minute 1.32 changes in CA right now).

@sanposhiho
Copy link
Member Author

Thanks @towca for picking it up!
I'm hoping we can prompt this feature; I remember some features on the sig-scheduling side are blocked by it.

@sanposhiho
Copy link
Member Author

@towca Any update?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 27, 2025
@sanposhiho
Copy link
Member Author

/remove-lifecycle stale

@towca Any update?

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 27, 2025
@sanposhiho
Copy link
Member Author

@wojtek-t A new TriggeredScaleUp condition discussed here might be another "state" that we might want to track, which we discussed #5329 (comment).
Like, if an external scheduler wants the scheduler to schedule pods and trigger CA as well in the unschedulable scenarios before it updates NNN. In that case, PodScheduled is not enough to decide when to re-compute NNN, and it'd need TriggeredScaleUp (or something similar) to notice CA tried to scale up, but failed.

@wojtek-t
Copy link
Member

@sanposhiho - agree; although I think it's an extension and we don't want to include that in the current proposal

@sanposhiho
Copy link
Member Author

Yeah, definitely no need to include TriggeredScaleUp in NNN kep.

-->

- A new field `fallbackCriteria` is introduced to `PodSpec.TopologySpreadConstraint[*]`
- `ScaleUpFailed` to fallback when the cluster autoscaler fails to create new Node for Pod.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to introduce an autoscaler-specific construct into the pod spec. In addition to that smelling like a separation of concerns violation, the cluster autoscaler in particular is designed to scale nodes, not pods. (And "scale" terminology here could be confused between HPA, as well.)

Copy link
Member Author

@sanposhiho sanposhiho May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to introduce an autoscaler-specific construct into the pod spec.

Rather I think we do want.
You can look at other recent KEPs happening around sig-scheduling/autoscaling though, the scheduler and the autoscaler are getting closer and closer, working on things tightly together.
e.g., #5287 proposes to use NNN to get the intention from the autoscaler, and a workload scheduling initiative we're planning as well: basically, we're trying to take "how CA can create nodes potentially for pods" into consideration and take an appropriate action based on that.

And "scale" terminology here could be confused between HPA, as well.

But, this is true, I'm open to rename it to anything else.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this seems fine. CA is designed to scale Nodes, but it reacts to Pods so it makes sense to me that we'd put autoscaling-related information (either in status or spec) on Pods.

And "scale" terminology here could be confused between HPA, as well.

But, this is true, I'm open to rename it to anything else.

We actually decided to align the core concepts naming between Cluster Autoscaler and Karpenter, and the Karpenter names made more sense. We still need to change the naming in CA (kubernetes/autoscaler#6647), but e.g. the official k8s Node autoscaling docs already use the new naming. In particular, IMO this should be named NodeProvisioningFailed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we distinguish NodeProvisioningFailedAndAutoscalingRanOutOfOptions from NodeProvisioningFailedButThereAreOtherOptionsToTry? If there are many potential options, should there be a timeout?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, even if there's just one option, wouldn't it be a valid use case to wait for a specified amount of time in case of some temporary issues (e.g. stockouts)?

Copy link
Member Author

@sanposhiho sanposhiho Jun 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this timeout stuff mean CA never put NodeProvisioningFailed condition on its own even when CA is correctly alive (= not only when CA is down)? Would the scheduler always put NodeProvisioningFailed based on timeout? In other words, is there any case that CA wants to put NodeProvisioningFailed on its own before reaching the timeout and the scheduler puts it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, NodeProvisioningFailed condition should only be added when node provisioning has failed, I was thinking the timeout could be used in absence of any signal regarding node provisioning.

The timeout duration specified on topology spread constraints level would be a way of saying: I'm willing to wait for up to N seconds if this is going to help me satisfy my spreading constraints. This is a clear trade-off the users could make. It would actually remove the need for the additional mode - ScheduleAnyway with a timeout would be enough to express willingness to wait for better spreading.

Copy link
Member Author

@sanposhiho sanposhiho Jun 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, NodeProvisioningFailed condition should only be added when node provisioning has failed, I was thinking the timeout could be used in absence of any signal regarding node provisioning.

But, earlier in this thread, we talked that "node provisioning has failed" might also have to be determined by the timeout, right? Then, if we implement the timeout config for this fallback feature on the scheduling config, that might be all necessary things: the timeout can catch both cases, CA cannot create nodes due to stockout etc, and CA is down, not working properly. That's what I meant.

It would actually remove the need for the additional mode - ScheduleAnyway with a timeout would be enough to express willingness to wait for better spreading.

That's a new option that we haven't explored though, can users set an appropriate timeout actually?
If you're just a cluster user, not admin, you don't know how usually the cluster autoscaler takes to provision nodes.
So, as a first impression, I'm still inclined to the current idea: end-users just set a new field fallbackCriteria. On top of that, maybe we can allow the cluster admin to configure the global timeout for the fallback in the scheduling configuration, based on their cluster autoscaler's usual provisioning time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making it a part of scheduler configuration is definitely worth considering. I wonder if there may be use cases for different thresholds based on business requirements. But if the timeout is just a safeguard protecting from autoscaling being down, perhaps a single value set by a platform team or cluster admin is good enough.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the discussion here, I'll update the KEP once later when I've got time

Copy link

@towca towca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sanposhiho Sorry for the delay, I've been swamped with reviews recently.

Overall, the proposal looks good to me from the perspective of Node autoscaling. I'd like to iterate on a couple of things before approving:

  • We should take care to design the conditions properly, so that they are understandable and useful beyond just this KEP.
  • We should define the autoscaler behavior when scheduler still can't schedule the Pod after falling back.

Left related comments on the PR!

- A new field `fallbackCriteria` is introduced to `PodSpec.TopologySpreadConstraint[*]`
- `ScaleUpFailed` to fallback when the cluster autoscaler fails to create new Node for Pod.
- `PreemptionFailed` to fallback when preemption doesn't help make Pod schedulable.
- introduce `TriggeredScaleUp` in Pod condition
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a great addition regardless of this KEP, I've been hearing demand for well-defined autoscaling-related conditions/events for a while now.

I know that @elmiko has been looking into this (or talking to someone who has). Do you know if there's been any progress on this Michael?

Also keep in mind that we need to take Karpenter into account for any new standardized autoscaling-related APIs like this condition. This seems fairly uncontroversial though, WDYT @jonathan-innis?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, not only specific to this TopologySpread's fallback feature, this new condition would be a standard way for the scheduler to receive the response from CA and take some appropriate action based on that. As I mentioned in another thread, the scheduling and the autoscaling are getting closer and going to work more closely towards a better pod placement. This new condition allows us to explore various scheduling options (incl this kind of fallback mechanism) at the scheduling time.

// TriggeredScaleUp indicates that the Pod triggered scaling up the cluster.
// If it's true, new Node for the Pod was successfully created.
// Otherwise, new Node for the Pod tried to be created, but failed.
TriggeredScaleUp PodConditionType = "TriggeredScaleUp"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the naming here, especially considering the scenario you mention where an autoscaler first scales up, notices a stockout, and then can't scale again because the initial group is in backoff.

Given the naming, I wouldn't expect the condition to ever flip back to False (the Pod did trigger a scale-up after all). Maybe we need an additional condition specifying whether a Pod is "helpable" by an autoscaler?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in another thread, I don't have a strong opinion here. I'm open to changing it to anything else!

1. Pod is rejected and stays unschedulable.
2. The cluster autoscaler finds those unschedulable Pod(s) but cannot create Nodes because of stockouts.
3. The cluster autoscaler adds `TriggeredScaleUp: false`.
4. The scheduler notices `TriggeredScaleUp: false` on Pod and schedules that Pod while falling back to `ScheduleAnyway` on Pod Topology Spread.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if at step 4. there's still no space on existing Nodes to schedule the Pod even with ScheduleAnyway? Should an autoscaler attempt to scale-up again, but this time falling back to ScheduleAnyway? It sounds desirable, but complicates things a bit. And FWIW I think we would need to explicitly prevent Cluster Autoscaler from doing this - otherwise the scheduler code vendored in CA will start falling back to ScheduleAnyway after CA adds the condition.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if at step 4. there's still no space on existing Nodes to schedule the Pod even with ScheduleAnyway?

It means some other scheduling factors at Filter phase (e.g., resource requirement, pod affinity etc etc) are not satisfied even if we ignore topology spread at Filtering phase. Pods are just going to be Unschedulable again.

Should an autoscaler attempt to scale-up again, but this time falling back to ScheduleAnyway?

I think so because of the fact that the cluster autoscaler cannot find the place at the step 2. The cluster autoscaler should re-compute which node group has to be tried, but this time, considering the spreading constraint as ScheduleAnyway. i.e., ignoring them at CA's scheduling simulation.

It sounds desirable, but complicates things a bit.

Does it? We change the PodTopologySpread plugin's Filter to ignore pods with fallbackCriteria: ScaleUpFailed when those pods have TriggeredScaleUp: false.
When the cluster autoscaler tries to find a new node for this pending pod again (after the step 2 and 3), the PodTopologySpread plugin's Filter (running within CA) will just ignore those pods.
So, to me, it looks very straightforward.

-->

- A new field `fallbackCriteria` is introduced to `PodSpec.TopologySpreadConstraint[*]`
- `ScaleUpFailed` to fallback when the cluster autoscaler fails to create new Node for Pod.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this seems fine. CA is designed to scale Nodes, but it reacts to Pods so it makes sense to me that we'd put autoscaling-related information (either in status or spec) on Pods.

And "scale" terminology here could be confused between HPA, as well.

But, this is true, I'm open to rename it to anything else.

We actually decided to align the core concepts naming between Cluster Autoscaler and Karpenter, and the Karpenter names made more sense. We still need to change the naming in CA (kubernetes/autoscaler#6647), but e.g. the official k8s Node autoscaling docs already use the new naming. In particular, IMO this should be named NodeProvisioningFailed.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sanposhiho
Once this PR has been reviewed and has the lgtm label, please ask for approval from wojtek-t. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sanposhiho sanposhiho requested a review from x13n July 6, 2025 05:47
@sanposhiho
Copy link
Member Author

@x13n @towca
Introduced scaleupTimeout to the scheduler config based on the discussion here.

@sanposhiho sanposhiho requested a review from towca July 6, 2025 05:48

#### How we implement `TriggeredScaleUp` in the cluster autoscaler

Basically, we just put `TriggeredScaleUp: false` for Pods in [status.ScaleUpStatus.PodsRemainUnschedulable](https://github.com/kubernetes/autoscaler/blob/109998dbf30e6a6ef84fc37ebaccca23d7dee2f3/cluster-autoscaler/processors/status/scale_up_status_processor.go#L37) every [reconciliation (RunOnce)](https://github.com/kubernetes/autoscaler/blob/109998dbf30e6a6ef84fc37ebaccca23d7dee2f3/cluster-autoscaler/core/static_autoscaler.go#L296).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also clean up the condition if there's a change afterwards. Say, a pod was marked unschedulable due to stockout, but later the stockout was gone and a new attempt to create a Node started. In such case the pod should clean up the existing condtion (or rather add one with value true). Btw, per my other comment I would rename this to NodeProvisioningInProgress. NodeProvisioningInProgress: true means autoscaling started and we should expect a new Node (ideally we'd combine that with setting nominatedNodeName on the pod). NodeProvisioningInProgress: false means autoscaling ran out of options - there is no attempt to create any new node for this pod, so fallback should happen.

@towca @MartynaGrotek @jackfrancis for thoughts.

Copy link
Member Author

@sanposhiho sanposhiho Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point..

Actually, this KEP misses the point of "how CA should handle those pods in later provisioning trials if the first attempt fails".

I believe we have two options here:

  1. CA should always keep trying to create the node with considering the topology spreading as required (w/o fallback). i.e., even if the first provisioning trial fails, the cluster autoscaler keeps considering the pod topology spread as required scheduling constraint. (although, meantime, the scheduler can try to schedule those pods as preferred scheduling constraints, i.e., fallback)
  2. After the first provisioning trial fails, CA should do fallback as well, considering the topology spreading as preferred scheduling constraints. (that means CA just ignores those topology spreading as CA doesn't take preferred scheduling constraints into consideration)

And then, if we go with the first option, we should

  • always set NodeProvisioningInProgress: true on pending pods in memory (i.e., not actually apply through kube-apiserver at this timing) before the CA's scheduling simulation happens so that the pod topology spread plugin (that CA internally uses for the scheduling simulation) works without the fallback.
  • and if then CA finds a node candidate and triggers the node provisioning, it actually sets NodeProvisioningInProgress: true through kube-apiserver.

But, one small disadvantage with this first option is, depending on the timing, if the CA triggers the scale-up for the pending pod and the scheduler schedules the same pending pod successfully at the same time, it could happen that the pod has NodeProvisioningInProgress: true, while actually the pod is scheduled with fallback. It might be confusing when users have to do some debugging of scheduling results.

OTOH, if we go with the second option, we don't have to do anything more from what KEP describes here. However, the name NodeProvisioningInProgress no longer makes sense because the node provisioning might be in-progress by the CA's later iteration. So, probably we should rename NodeProvisioningInProgress to FirstNodeProvisioning (or something similar).

What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need CA to be able to provision new nodes also for the fallback scenario, which isn't really possible in option 1 if I understand correctly. Also NodeProvisioningInProgress: true is mostly useful for debugging, NodeProvisioningInProgress: false is what should trigger scheduler to do the fallback. Then we could have:

  • Pods with NodeProvisioningInProgress: true are treated same as without any condition by topology spreading filter plugin: won't pass the filter.
  • Pods with NodeProvisioningInProgress: false are ignored (not blocked) by the topology spreading filter plugin. They only matter for scoring. Autoscaling either succeeded or gave up at this point.
  • Cluster Autoscaler runs scheduler predicates as today. When it runs out of options, NodeProvisioningInProgress: false is set on the pod, which can trigger another round of autoscaling.
  • The previous point can be optimized to happen in memory of CA, but should actually work out of the box with an apiserver roundtrip.

Btw, I think NodeProvisioningInProgress: true condition on a scheduled pod is ok: that means as of lastTransitionTime the pod triggered some autoscaling, nothing more.

Copy link
Member Author

@sanposhiho sanposhiho Aug 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Sorry for no replies for a while! I had to take a long leave after a busyness of the code freeze)

Maybe my previous explanation was bad. Let me rephrase.

Cluster Autoscaler runs scheduler predicates as today. When it runs out of options, NodeProvisioningInProgress: false is set on the pod, which can trigger another round of autoscaling.

My point is whether CA should keep trying the node scaling after this point with still taking the topology spread into consideration? or CA should ignore the topology spread after this point?

For example, let's say all nodepools are stockout, and CA puts NodeProvisioningInProgress: false. It triggers the fallback of the topology spread on the scheduler side, but let's say, the pod is still unfortunately unschedulable because of another reason e.g., because simply all running nodes are full from the resource capacity perspective.
And, while this pod is unschedulable while it already got NodeProvisioningInProgress: false, whether do we want CA to try to provision a new node with or without taking the topology spread into consideration.

And, my proposal is that we have two options:

  1. if we do want CA to keep trying to provision a new node with taking the topology spread into consideration, even after CA once put NodeProvisioningInProgress: false, then CA has to always internally put NodeProvisioningInProgress: false so that the topology spread plugin in CA's scheduling simulation would regard the topology spread constraint as required.
  2. If we want CA to ignore the topology spread after CA once put NodeProvisioningInProgress: false, then we don't have to do anything from the current proposal. But, we just should rename NodeProvisioningInProgress to something else like FirstNodeProvisioningInProgress.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be easier for me to reason about if we deprioritize the exact way that CA (or karpenter) will solve for this and if we focus in on the core pod + kube-scheduler behaviors.

  1. Pod is configured with TopologySpreadConstraint configuration, w/ UnsatisfiableConstraintAction of DoNotSchedule
  2. (could be CA/karpenter) determines that the infrastructure required to fulfill the spread constraints is ~durably unavailable, and posts a well-known pod condition update.
  3. kube-scheduler interprets the updated pod condition, when combined w/ the TopologySpreadConstraint fallback configuration, to find a candidate node without spread constraints.
  4. No existing infra fulfills the pod requirements (even without the spread constraints), pod stays Pending
  5. creates new infra according to (updated) scheduling criteria (step 4)
  6. New infra comes online, pod is scheduled

Is the above correct? I don't know how we "clean up" the above outcome when the operational condition in step 2 changes unless we update kube-scheduler to actively pre-empt already scheduled pods and then restart the scheduling lifecycle from scratch (w/ the original, strict TopologySpreadConstraint configuration).

What am I missing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @MartynaGrotek
Here we still have a discussion about the condition.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized I never replied here. I wouldn't differentiate between first and subsequent node provisioning attempts in pod conditions. To me, autoscaling is conceptually yet another controller, trying to achieve eventually consistent state. Such controller may at some point discover it is not possible to make progress, and that is a signal useful for kube-scheduler. I think provisioning fallback capacity on the autoscaling side is still a part of provisioning, so NodeProvisioningInProgress: false should only be set when either it is not possible to provision any capacity at all or the capacity has already been provisioned.

For the user looking at these conditions it also becomes easier to understand - either there is some provisioning in progress or not. It helps with the ability to debug the system: NodeProvisioningInProgress: true condition, persisting for a long time, indicates a problem in the autoscaling stack.

If we really wanted to somehow differentiate between provisioning honoring PTS from fallback provisioning, I'd make that explicit: TopologySpreadingIgnored: true, or similar. I'm not sure if that's necessary though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NodeProvisioningInProgress: false should only be set when either it is not possible to provision any capacity at all or the capacity has already been provisioned.

So, are you saying that, if CA cannot provision a new node and set NodeProvisioningInProgress: false, it will never switch NodeProvisioningInProgress back to true?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can switch back (stockouts end, limits/quotas get raised, other resources can free up), but once it becomes false, it should be ok to do the fallback in scheduler. Do you anticipate problems with the condition changing back and forth? I think it would be simpler not to have separate conditions for the first and subsequent autoscaling attempts, so I wouldn't add them unless there are problems with that approach.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @x13n here. If the fallback only works in kube-scheduler but not in Node autoscalers, it seems both less useful (because autoscalers are still blocked from provisioning Nodes that would violate PTS), and more confusing (the autoscaler behavior doesn't match the kube-scheduler behavior, which is normally a problem). And it would require more complexity in autoscaler code.

On the other hand we have this:

  • Node autoscalers set the NodeProvisioningInProgress condition on pending Pods, the condition can switch back and forth depending on if the autoscaler has something to provision that could help the Pod. I proposed more detailed semantics in the extracted proposal: KEP-5616: Cluster Autoscaler Pod Condition #5619 (comment).
  • If a pending Pod has condition NodeProvisioningInProgress=True and PTS configured with the proposed fallback, the PTS is treated as ScheduleAnyway by the PTS scheduler plugin. This allows both kube-scheduler to find new Nodes for the Pod that it couldn't before, and Node autoscaler to provision new Nodes for the Pod that it couldn't before.

Which seems pretty straigthtforward.

In `triggerTimeBasedQueueingHints`, Queueing Hints with the `Resource: Type` event are triggered for pods rejected by those plugins,
and the scheduling queue requeues/doesn't requeue pods based on QHints, as usual.

`triggerTimeBasedQueueingHints` is triggered periodically, **but not very often**. Probably once 30 sec is enough.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it somehow depend on the configured timeout (e.g. be a fixed fraction of it)? If the timeout is configured to 5m, 30s sounds reasonable, but maybe less so for other values. OTOH, maybe I'm over-complicating this...

Copy link
Member Author

@sanposhiho sanposhiho Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, setting a too small timeout doesn't make sense in most clusters probably, right? like, I'm imagining that the timeout would be a few minutes in a real world.
Of course, we will surely need to clarify that the configured timeout is best-effort, and could include delay up to 30s though. I think that's a good trade-off since, as mentioned below, too frequent triggerTimeBasedQueueingHints() is not recommended for a scheduler performance either way.

So, I feel like it's ok to just keep it as it is at least for now, and start to consider making the interval configurable after getting user's complaints actually.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was worried performance may vary between different cloud providers, but starting simple and making configurable only if needed makes sense to me.

1. Pod is rejected in the scheduling cycle.
2. In the PostFilter extension point, the scheduler tries to make space by the preemption, but finds the preemption doesn't help.
3. When the Pod is moved back to the scheduling queue, the scheduler adds `PodScheduled: false` condition to Pod.
4. The scheduler notices that the preemption wasn't performed for Pod by `PodScheduled: false` and empty `NominatedNodeName` on the Pod.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With https://github.com/kubernetes/enhancements/blob/master/keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md the nominated node name can be set by other components than scheduler for various reasons. Wouldn't this break this mechanism if another component set it on the pod?

@helayoty helayoty moved this from Needs Triage to Backlog in SIG Scheduling Sep 9, 2025
@helayoty helayoty moved this from Backlog to Needs Review in SIG Scheduling Sep 9, 2025
@towca
Copy link

towca commented Oct 7, 2025

@sanposhiho Are you planning to get this into 1.35? I reviewed the proposal and left some comments for the previous release cycle, is the KEP ready for the next review pass?

@sanposhiho
Copy link
Member Author

I'm waiting for a reply in this thread. But, I think we are not going to include it in v1.35, from my/our bandwidth perspective towards KEP freeze. At least, it's not on the sig-scheduling's table at the moment.

@towca
Copy link

towca commented Oct 13, 2025

@sanposhiho Got it, thanks. FWIW the proposal looks good to me from SIG Autoscaling side as long as we reach alignment in that comment thread, and assuming that defining the detailed semantics of NodeProvisioningInProgress is left to #5619 (we need to align with the Karpenter side on this part).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

Status: Needs Review

Development

Successfully merging this pull request may close these issues.