-
Notifications
You must be signed in to change notification settings - Fork 1.6k
KEP-3990: PodTopologySpread DoNotSchedule-to-ScheduleAnyway fallback mode #4150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
sanposhiho
commented
Aug 12, 2023
- One-line PR description: PodTopologySpread DoNotSchedule-to-ScheduleAnyway fallback mode
- Issue link: PodTopologySpread DoNotSchedule-to-ScheduleAnyway fallback mode #3990
- Other comments:
3cb93d9
to
5318bd5
Compare
/assign @alculquicondor @MaciekPytel |
information to express the idea and why it was not acceptable. | ||
--> | ||
|
||
### introduce `DoNotScheduleUntilScaleUpFailed` and `DoNotScheduleUntilPreemptionFailed` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a design similar to what we discussed in the issue. But, I switched this to fallbackCriteria
for the reason described here. Open for discussion about which design we should choose.
Even if the Pod is rejected by plugins other than Pod Topology Spread, | ||
when one of specified criteria is satisfied, the scheduler fallbacks from DoNotSchedule to ScheduleAnyway. | ||
|
||
One possible mitigation is to add `UnschedulablePlugins`, which equals to [QueuedPodInfo.UnschedulablePlugins](https://github.com/kubernetes/kubernetes/blob/8a7df727820bafed8cef27e094a0212d758fcd40/pkg/scheduler/framework/types.go#L180), to somewhere in Pod status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alculquicondor What do you think?
@@ -0,0 +1,3 @@ | |||
kep-number: 3990 | |||
beta: | |||
approver: "@wojtek-t" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wojtek-t I assigned it to you as other sig-scheduling KEPs do.
But, please reassign it to other people if needed. 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just quickly skimmed through it - I will reviewer deeper when you will have buy-in from the SIG.
@@ -0,0 +1,3 @@ | |||
kep-number: 3990 | |||
beta: | |||
approver: "@wojtek-t" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just quickly skimmed through it - I will reviewer deeper when you will have buy-in from the SIG.
#### the fallback could be done when it's actually not needed. | ||
|
||
Even if the Pod is rejected by plugins other than Pod Topology Spread, | ||
when one of specified criteria is satisfied, the scheduler fallbacks from DoNotSchedule to ScheduleAnyway. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand it - can you clarify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, let's say PodA has a required Pod Topology Spread with fallbackCriteria: ScaleUpFailed
. When PodA's first scheduling, all Nodes went through Pod Topology Spread filter plugin, but they were rejected by other plugins (e.g., the resource fit filter plugin because of insufficient resources.), and PodA got unschedulable state as a result. Then, the cluster autoscaler tried to create a new Node for PodA but it couldn't make it because of stockout.
In this case, Pod Topology Spread doesn't need to fallback actually because it said OK to all Nodes in the previous scheduling cycle -- meaning Pod Topology Spread isn't the cause of unschedulable in the past scheduling cycle and the fallback won't make PodA schedulable.
But, the problem is that Pod Topology Spread cannot see why this Pod was rejected in the previous scheduling cycle. It only understands that PodA was rejected by someone in the previous scheduling, and the cluster autoscaler couldn't create a new Node.
@@ -0,0 +1,35 @@ | |||
title: Pod Topology Spread DoNotSchedule to SchedulingAnyway fallback mode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
title: Pod Topology Spread DoNotSchedule to SchedulingAnyway fallback mode | |
title: Pod Topology Spread DoNotSchedule to ScheduleAnyway fallback mode |
/cc @mwielgus |
|
||
A new field `fallbackCriteria` is introduced to `PodSpec.TopologySpreadConstraint[*]` | ||
to represent when to fallback from DoNotSchedule to ScheduleAnyway. | ||
It can contain two values: `ScaleUpFailed` to fallback when the cluster autoscaler fails to create new Node for Pods, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could it not simply default to FailedScheduling
. As far as I know, ScaleUpFailed
is a cluster-autoscaler specific event so for those who use something else (e.g. karpenter), then this will not work.
One challenge with FailedScheduling
is that's what cluster-autoscaler reacts on so there could be a race condition in cluster-autoscaler scaling up vs. switching to fallback mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part only means that ScaleUpFailed
is a new value that will be added to fallbackCriteria
. Yes, I took this name from the cluster autoscaler's event name, but it should work with any cluster autoscaler if they update TriggeredScaleUp
Pod condition, which is also proposed in this KEP.
We had too much to bite in this cycle. This will have to wait for the next release :( Try to get some input from SIG Autoscaling in one of their meetings. |
Yes..
Sure. |
should be approved by the remaining approvers and/or the owning SIG (or | ||
SIG Architecture for cross-cutting KEPs). | ||
--> | ||
# KEP-3990: Pod Topology Spread DoNotSchedule to SchedulingAnyway fallback mode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious if there's an analogous use case for:
- Preferential node affinity
- Preferential pod affinity/antiaffinity
Is there any world where a user may want to have more control over how preferences are handled with autoscalers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there's an analogous use case for:
Yes, might be. We would wait for the actual usecases from users before implementing the similar stuff in them though, we should make this design to be easy to follow in other scheduling constants
On the other hand, `FallBackCriteria` allows us to unify APIs in all scheduling constraints. | ||
We will just introduce `FallBackCriteria` field in them and there we go. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes me think that we might want to model fallback criteria at the pod level.
preferentialFallback:
- PodAffinity
- AntiAffinity
- TopologySpreadConstraints
It may be difficult to get the granularity right, though. At another extreme (as you propose), you'd add a preference policy for each preferred term across the pod spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's an interesting idea.
Because when people want to use fallback, they may want to do the same in all required scheduling constraints the Pod has. Also, we may want to give preferentialFallback
the way to "fallback in all".
So, maybe it'd be like this:
preferentialFallback:
schedulingConstraints:
- "*" # do fallback in all scheduling constraints.
# You can also specify individual scheduling constraints name, which supports fallback.
# (But, note that this KEP's scope is only TopologySpread.)
fallBackCriteria:
- ScaleUpFailed
Let me include this idea in the alternative section for now. Like the idea of introducing enums, we can gather feedback from more people about the design choice among them.
Instead of `FallBackCriteria`, introduce `DoNotScheduleUntilScaleUpFailed` and `DoNotScheduleUntilPreemptionFailed` in `WhenUnsatisfiable`. | ||
`DoNotScheduleUntilScaleUpFailed` corresponds to `ScaleUpFailed`, | ||
and `DoNotScheduleUntilPreemptionFailed` corresponds to `PreemptionFailed`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some benefits to reusing this an existing field with new enum values. I'm a touch concerned with how complicated topologyspreadconstraints are getting, with two new honor
fields (with different defaults).
I interact with a bunch of customers who are learning kubernetes scheduling for the first time when onboarding to Karpenter, and communicating the complexity and gotchas of topology is always a huge hurdle to get over.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Obviously, I agree that we need to make it easy to understand as much as possible. But, also introducing a new field doesn't always result in confusion.
For this particular KEP, I think FallBackCriteria
is not much more complicated than introducing new enum values in WhenUnsatisfiable. Then, currently I'm leaning towards FallBackCriteria
for the ease to introduce it into other scheduling constraints, as described.
But, maybe I'm biased since I know the scheduler/CA, I know everything that this KEP wants to do, etc. 😅
So, I'd like to keep this discussion open to gather more feedback from reviewers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that topologyspreadconstraints are difficult to explain to clients (been there), but I think this added field is optional complexity. It is also the other way around, right? It is a complex solution to a complex setup, but the complex setup came first. Clients that need to hear this are already dealing with specific cases that call for all of this, and in theory are easier to talk to. Also agree that I don't see it as easier to explain when considering existing field with new enum values, so I guess that if it is hard to explain, it would be hard either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apparently i wrote this a while ago...
keps/sig-scheduling/3990-pod-topology-spread-fallback-mode/README.md
Outdated
Show resolved
Hide resolved
keps/sig-scheduling/3990-pod-topology-spread-fallback-mode/README.md
Outdated
Show resolved
Hide resolved
keps/sig-scheduling/3990-pod-topology-spread-fallback-mode/README.md
Outdated
Show resolved
Hide resolved
keps/sig-scheduling/3990-pod-topology-spread-fallback-mode/README.md
Outdated
Show resolved
Hide resolved
keps/sig-scheduling/3990-pod-topology-spread-fallback-mode/README.md
Outdated
Show resolved
Hide resolved
keps/sig-scheduling/3990-pod-topology-spread-fallback-mode/README.md
Outdated
Show resolved
Hide resolved
Describing the current situation here: I brought this KEP into SIG/Autoscaling weekly meeting a couple of months ago, and concluded that we had to consider how CA set
Not sure if we'll make it in this release though, I'll spend some time on this KEP in this release cycle as well - I'll go around CA's implementation to understand scenarios there myself, and elaborate the KEP around there. Thanks for several feedback so far, I'll bring the fixes towards all at the same time. |
a545da3
to
ea3fc17
Compare
2d8f39b
to
9db7b55
Compare
@MaciekPytel can you (or assign someone to) review this KEP from the autoscaling (CA) PoV? |
@sanposhiho I can take a look from SIG autoscaling PoV, but late next week at the earliest (busy with some last-minute 1.32 changes in CA right now). |
Thanks @towca for picking it up! |
@towca Any update? |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale @towca Any update? |
@wojtek-t A new |
@sanposhiho - agree; although I think it's an extension and we don't want to include that in the current proposal |
Yeah, definitely no need to include |
--> | ||
|
||
- A new field `fallbackCriteria` is introduced to `PodSpec.TopologySpreadConstraint[*]` | ||
- `ScaleUpFailed` to fallback when the cluster autoscaler fails to create new Node for Pod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want to introduce an autoscaler-specific construct into the pod spec. In addition to that smelling like a separation of concerns violation, the cluster autoscaler in particular is designed to scale nodes, not pods. (And "scale" terminology here could be confused between HPA, as well.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want to introduce an autoscaler-specific construct into the pod spec.
Rather I think we do want.
You can look at other recent KEPs happening around sig-scheduling/autoscaling though, the scheduler and the autoscaler are getting closer and closer, working on things tightly together.
e.g., #5287 proposes to use NNN to get the intention from the autoscaler, and a workload scheduling initiative we're planning as well: basically, we're trying to take "how CA can create nodes potentially for pods" into consideration and take an appropriate action based on that.
And "scale" terminology here could be confused between HPA, as well.
But, this is true, I'm open to rename it to anything else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this seems fine. CA is designed to scale Nodes, but it reacts to Pods so it makes sense to me that we'd put autoscaling-related information (either in status or spec) on Pods.
And "scale" terminology here could be confused between HPA, as well.
But, this is true, I'm open to rename it to anything else.
We actually decided to align the core concepts naming between Cluster Autoscaler and Karpenter, and the Karpenter names made more sense. We still need to change the naming in CA (kubernetes/autoscaler#6647), but e.g. the official k8s Node autoscaling docs already use the new naming. In particular, IMO this should be named NodeProvisioningFailed
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we distinguish NodeProvisioningFailedAndAutoscalingRanOutOfOptions
from NodeProvisioningFailedButThereAreOtherOptionsToTry
? If there are many potential options, should there be a timeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, even if there's just one option, wouldn't it be a valid use case to wait for a specified amount of time in case of some temporary issues (e.g. stockouts)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this timeout stuff mean CA never put NodeProvisioningFailed condition on its own even when CA is correctly alive (= not only when CA is down)? Would the scheduler always put NodeProvisioningFailed based on timeout? In other words, is there any case that CA wants to put NodeProvisioningFailed on its own before reaching the timeout and the scheduler puts it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, NodeProvisioningFailed condition should only be added when node provisioning has failed, I was thinking the timeout could be used in absence of any signal regarding node provisioning.
The timeout duration specified on topology spread constraints level would be a way of saying: I'm willing to wait for up to N seconds if this is going to help me satisfy my spreading constraints. This is a clear trade-off the users could make. It would actually remove the need for the additional mode - ScheduleAnyway with a timeout would be enough to express willingness to wait for better spreading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, NodeProvisioningFailed condition should only be added when node provisioning has failed, I was thinking the timeout could be used in absence of any signal regarding node provisioning.
But, earlier in this thread, we talked that "node provisioning has failed" might also have to be determined by the timeout, right? Then, if we implement the timeout config for this fallback feature on the scheduling config, that might be all necessary things: the timeout can catch both cases, CA cannot create nodes due to stockout etc, and CA is down, not working properly. That's what I meant.
It would actually remove the need for the additional mode - ScheduleAnyway with a timeout would be enough to express willingness to wait for better spreading.
That's a new option that we haven't explored though, can users set an appropriate timeout actually?
If you're just a cluster user, not admin, you don't know how usually the cluster autoscaler takes to provision nodes.
So, as a first impression, I'm still inclined to the current idea: end-users just set a new field fallbackCriteria
. On top of that, maybe we can allow the cluster admin to configure the global timeout for the fallback in the scheduling configuration, based on their cluster autoscaler's usual provisioning time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making it a part of scheduler configuration is definitely worth considering. I wonder if there may be use cases for different thresholds based on business requirements. But if the timeout is just a safeguard protecting from autoscaling being down, perhaps a single value set by a platform team or cluster admin is good enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the discussion here, I'll update the KEP once later when I've got time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sanposhiho Sorry for the delay, I've been swamped with reviews recently.
Overall, the proposal looks good to me from the perspective of Node autoscaling. I'd like to iterate on a couple of things before approving:
- We should take care to design the conditions properly, so that they are understandable and useful beyond just this KEP.
- We should define the autoscaler behavior when scheduler still can't schedule the Pod after falling back.
Left related comments on the PR!
- A new field `fallbackCriteria` is introduced to `PodSpec.TopologySpreadConstraint[*]` | ||
- `ScaleUpFailed` to fallback when the cluster autoscaler fails to create new Node for Pod. | ||
- `PreemptionFailed` to fallback when preemption doesn't help make Pod schedulable. | ||
- introduce `TriggeredScaleUp` in Pod condition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be a great addition regardless of this KEP, I've been hearing demand for well-defined autoscaling-related conditions/events for a while now.
I know that @elmiko has been looking into this (or talking to someone who has). Do you know if there's been any progress on this Michael?
Also keep in mind that we need to take Karpenter into account for any new standardized autoscaling-related APIs like this condition. This seems fairly uncontroversial though, WDYT @jonathan-innis?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly, not only specific to this TopologySpread's fallback feature, this new condition would be a standard way for the scheduler to receive the response from CA and take some appropriate action based on that. As I mentioned in another thread, the scheduling and the autoscaling are getting closer and going to work more closely towards a better pod placement. This new condition allows us to explore various scheduling options (incl this kind of fallback mechanism) at the scheduling time.
// TriggeredScaleUp indicates that the Pod triggered scaling up the cluster. | ||
// If it's true, new Node for the Pod was successfully created. | ||
// Otherwise, new Node for the Pod tried to be created, but failed. | ||
TriggeredScaleUp PodConditionType = "TriggeredScaleUp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about the naming here, especially considering the scenario you mention where an autoscaler first scales up, notices a stockout, and then can't scale again because the initial group is in backoff.
Given the naming, I wouldn't expect the condition to ever flip back to False (the Pod did trigger a scale-up after all). Maybe we need an additional condition specifying whether a Pod is "helpable" by an autoscaler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned in another thread, I don't have a strong opinion here. I'm open to changing it to anything else!
1. Pod is rejected and stays unschedulable. | ||
2. The cluster autoscaler finds those unschedulable Pod(s) but cannot create Nodes because of stockouts. | ||
3. The cluster autoscaler adds `TriggeredScaleUp: false`. | ||
4. The scheduler notices `TriggeredScaleUp: false` on Pod and schedules that Pod while falling back to `ScheduleAnyway` on Pod Topology Spread. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if at step 4. there's still no space on existing Nodes to schedule the Pod even with ScheduleAnyway
? Should an autoscaler attempt to scale-up again, but this time falling back to ScheduleAnyway
? It sounds desirable, but complicates things a bit. And FWIW I think we would need to explicitly prevent Cluster Autoscaler from doing this - otherwise the scheduler code vendored in CA will start falling back to ScheduleAnyway
after CA adds the condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if at step 4. there's still no space on existing Nodes to schedule the Pod even with ScheduleAnyway?
It means some other scheduling factors at Filter phase (e.g., resource requirement, pod affinity etc etc) are not satisfied even if we ignore topology spread at Filtering phase. Pods are just going to be Unschedulable again.
Should an autoscaler attempt to scale-up again, but this time falling back to ScheduleAnyway?
I think so because of the fact that the cluster autoscaler cannot find the place at the step 2. The cluster autoscaler should re-compute which node group has to be tried, but this time, considering the spreading constraint as ScheduleAnyway. i.e., ignoring them at CA's scheduling simulation.
It sounds desirable, but complicates things a bit.
Does it? We change the PodTopologySpread plugin's Filter to ignore pods with fallbackCriteria: ScaleUpFailed
when those pods have TriggeredScaleUp: false
.
When the cluster autoscaler tries to find a new node for this pending pod again (after the step 2 and 3), the PodTopologySpread plugin's Filter (running within CA) will just ignore those pods.
So, to me, it looks very straightforward.
--> | ||
|
||
- A new field `fallbackCriteria` is introduced to `PodSpec.TopologySpreadConstraint[*]` | ||
- `ScaleUpFailed` to fallback when the cluster autoscaler fails to create new Node for Pod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this seems fine. CA is designed to scale Nodes, but it reacts to Pods so it makes sense to me that we'd put autoscaling-related information (either in status or spec) on Pods.
And "scale" terminology here could be confused between HPA, as well.
But, this is true, I'm open to rename it to anything else.
We actually decided to align the core concepts naming between Cluster Autoscaler and Karpenter, and the Karpenter names made more sense. We still need to change the naming in CA (kubernetes/autoscaler#6647), but e.g. the official k8s Node autoscaling docs already use the new naming. In particular, IMO this should be named NodeProvisioningFailed
.
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: sanposhiho The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
keps/sig-scheduling/3990-pod-topology-spread-fallback-mode/README.md
Outdated
Show resolved
Hide resolved
|
||
#### How we implement `TriggeredScaleUp` in the cluster autoscaler | ||
|
||
Basically, we just put `TriggeredScaleUp: false` for Pods in [status.ScaleUpStatus.PodsRemainUnschedulable](https://github.com/kubernetes/autoscaler/blob/109998dbf30e6a6ef84fc37ebaccca23d7dee2f3/cluster-autoscaler/processors/status/scale_up_status_processor.go#L37) every [reconciliation (RunOnce)](https://github.com/kubernetes/autoscaler/blob/109998dbf30e6a6ef84fc37ebaccca23d7dee2f3/cluster-autoscaler/core/static_autoscaler.go#L296). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also clean up the condition if there's a change afterwards. Say, a pod was marked unschedulable due to stockout, but later the stockout was gone and a new attempt to create a Node started. In such case the pod should clean up the existing condtion (or rather add one with value true
). Btw, per my other comment I would rename this to NodeProvisioningInProgress
. NodeProvisioningInProgress
: true
means autoscaling started and we should expect a new Node (ideally we'd combine that with setting nominatedNodeName
on the pod). NodeProvisioningInProgress
: false
means autoscaling ran out of options - there is no attempt to create any new node for this pod, so fallback should happen.
@towca @MartynaGrotek @jackfrancis for thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good point..
Actually, this KEP misses the point of "how CA should handle those pods in later provisioning trials if the first attempt fails".
I believe we have two options here:
- CA should always keep trying to create the node with considering the topology spreading as required (w/o fallback). i.e., even if the first provisioning trial fails, the cluster autoscaler keeps considering the pod topology spread as required scheduling constraint. (although, meantime, the scheduler can try to schedule those pods as preferred scheduling constraints, i.e., fallback)
- After the first provisioning trial fails, CA should do fallback as well, considering the topology spreading as preferred scheduling constraints. (that means CA just ignores those topology spreading as CA doesn't take preferred scheduling constraints into consideration)
And then, if we go with the first option, we should
- always set
NodeProvisioningInProgress: true
on pending pods in memory (i.e., not actually apply through kube-apiserver at this timing) before the CA's scheduling simulation happens so that the pod topology spread plugin (that CA internally uses for the scheduling simulation) works without the fallback. - and if then CA finds a node candidate and triggers the node provisioning, it actually sets
NodeProvisioningInProgress: true
through kube-apiserver.
But, one small disadvantage with this first option is, depending on the timing, if the CA triggers the scale-up for the pending pod and the scheduler schedules the same pending pod successfully at the same time, it could happen that the pod has NodeProvisioningInProgress: true
, while actually the pod is scheduled with fallback. It might be confusing when users have to do some debugging of scheduling results.
OTOH, if we go with the second option, we don't have to do anything more from what KEP describes here. However, the name NodeProvisioningInProgress
no longer makes sense because the node provisioning might be in-progress by the CA's later iteration. So, probably we should rename NodeProvisioningInProgress
to FirstNodeProvisioning
(or something similar).
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need CA to be able to provision new nodes also for the fallback scenario, which isn't really possible in option 1 if I understand correctly. Also NodeProvisioningInProgress: true
is mostly useful for debugging, NodeProvisioningInProgress: false
is what should trigger scheduler to do the fallback. Then we could have:
- Pods with
NodeProvisioningInProgress: true
are treated same as without any condition by topology spreading filter plugin: won't pass the filter. - Pods with
NodeProvisioningInProgress: false
are ignored (not blocked) by the topology spreading filter plugin. They only matter for scoring. Autoscaling either succeeded or gave up at this point. - Cluster Autoscaler runs scheduler predicates as today. When it runs out of options,
NodeProvisioningInProgress: false
is set on the pod, which can trigger another round of autoscaling. - The previous point can be optimized to happen in memory of CA, but should actually work out of the box with an apiserver roundtrip.
Btw, I think NodeProvisioningInProgress: true
condition on a scheduled pod is ok: that means as of lastTransitionTime
the pod triggered some autoscaling, nothing more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Sorry for no replies for a while! I had to take a long leave after a busyness of the code freeze)
Maybe my previous explanation was bad. Let me rephrase.
Cluster Autoscaler runs scheduler predicates as today. When it runs out of options, NodeProvisioningInProgress: false is set on the pod, which can trigger another round of autoscaling.
My point is whether CA should keep trying the node scaling after this point with still taking the topology spread into consideration? or CA should ignore the topology spread after this point?
For example, let's say all nodepools are stockout, and CA puts NodeProvisioningInProgress: false
. It triggers the fallback of the topology spread on the scheduler side, but let's say, the pod is still unfortunately unschedulable because of another reason e.g., because simply all running nodes are full from the resource capacity perspective.
And, while this pod is unschedulable while it already got NodeProvisioningInProgress: false
, whether do we want CA to try to provision a new node with or without taking the topology spread into consideration.
And, my proposal is that we have two options:
- if we do want CA to keep trying to provision a new node with taking the topology spread into consideration, even after CA once put
NodeProvisioningInProgress: false
, then CA has to always internally putNodeProvisioningInProgress: false
so that the topology spread plugin in CA's scheduling simulation would regard the topology spread constraint as required. - If we want CA to ignore the topology spread after CA once put
NodeProvisioningInProgress: false
, then we don't have to do anything from the current proposal. But, we just should renameNodeProvisioningInProgress
to something else likeFirstNodeProvisioningInProgress
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be easier for me to reason about if we deprioritize the exact way that CA (or karpenter) will solve for this and if we focus in on the core pod + kube-scheduler behaviors.
- Pod is configured with
TopologySpreadConstraint
configuration, w/UnsatisfiableConstraintAction
ofDoNotSchedule
- (could be CA/karpenter) determines that the infrastructure required to fulfill the spread constraints is ~durably unavailable, and posts a well-known pod condition update.
- kube-scheduler interprets the updated pod condition, when combined w/ the
TopologySpreadConstraint
fallback configuration, to find a candidate node without spread constraints. - No existing infra fulfills the pod requirements (even without the spread constraints), pod stays Pending
- creates new infra according to (updated) scheduling criteria (step 4)
- New infra comes online, pod is scheduled
Is the above correct? I don't know how we "clean up" the above outcome when the operational condition in step 2 changes unless we update kube-scheduler to actively pre-empt already scheduled pods and then restart the scheduling lifecycle from scratch (w/ the original, strict TopologySpreadConstraint
configuration).
What am I missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @MartynaGrotek
Here we still have a discussion about the condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized I never replied here. I wouldn't differentiate between first and subsequent node provisioning attempts in pod conditions. To me, autoscaling is conceptually yet another controller, trying to achieve eventually consistent state. Such controller may at some point discover it is not possible to make progress, and that is a signal useful for kube-scheduler. I think provisioning fallback capacity on the autoscaling side is still a part of provisioning, so NodeProvisioningInProgress: false
should only be set when either it is not possible to provision any capacity at all or the capacity has already been provisioned.
For the user looking at these conditions it also becomes easier to understand - either there is some provisioning in progress or not. It helps with the ability to debug the system: NodeProvisioningInProgress: true
condition, persisting for a long time, indicates a problem in the autoscaling stack.
If we really wanted to somehow differentiate between provisioning honoring PTS from fallback provisioning, I'd make that explicit: TopologySpreadingIgnored: true
, or similar. I'm not sure if that's necessary though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NodeProvisioningInProgress: false should only be set when either it is not possible to provision any capacity at all or the capacity has already been provisioned.
So, are you saying that, if CA cannot provision a new node and set NodeProvisioningInProgress: false
, it will never switch NodeProvisioningInProgress
back to true
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can switch back (stockouts end, limits/quotas get raised, other resources can free up), but once it becomes false
, it should be ok to do the fallback in scheduler. Do you anticipate problems with the condition changing back and forth? I think it would be simpler not to have separate conditions for the first and subsequent autoscaling attempts, so I wouldn't add them unless there are problems with that approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @x13n here. If the fallback only works in kube-scheduler but not in Node autoscalers, it seems both less useful (because autoscalers are still blocked from provisioning Nodes that would violate PTS), and more confusing (the autoscaler behavior doesn't match the kube-scheduler behavior, which is normally a problem). And it would require more complexity in autoscaler code.
On the other hand we have this:
- Node autoscalers set the
NodeProvisioningInProgress
condition on pending Pods, the condition can switch back and forth depending on if the autoscaler has something to provision that could help the Pod. I proposed more detailed semantics in the extracted proposal: KEP-5616: Cluster Autoscaler Pod Condition #5619 (comment). - If a pending Pod has condition
NodeProvisioningInProgress=True
and PTS configured with the proposed fallback, the PTS is treated as ScheduleAnyway by the PTS scheduler plugin. This allows both kube-scheduler to find new Nodes for the Pod that it couldn't before, and Node autoscaler to provision new Nodes for the Pod that it couldn't before.
Which seems pretty straigthtforward.
In `triggerTimeBasedQueueingHints`, Queueing Hints with the `Resource: Type` event are triggered for pods rejected by those plugins, | ||
and the scheduling queue requeues/doesn't requeue pods based on QHints, as usual. | ||
|
||
`triggerTimeBasedQueueingHints` is triggered periodically, **but not very often**. Probably once 30 sec is enough. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it somehow depend on the configured timeout (e.g. be a fixed fraction of it)? If the timeout is configured to 5m, 30s sounds reasonable, but maybe less so for other values. OTOH, maybe I'm over-complicating this...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But, setting a too small timeout doesn't make sense in most clusters probably, right? like, I'm imagining that the timeout would be a few minutes in a real world.
Of course, we will surely need to clarify that the configured timeout is best-effort, and could include delay up to 30s though. I think that's a good trade-off since, as mentioned below, too frequent triggerTimeBasedQueueingHints()
is not recommended for a scheduler performance either way.
So, I feel like it's ok to just keep it as it is at least for now, and start to consider making the interval configurable after getting user's complaints actually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was worried performance may vary between different cloud providers, but starting simple and making configurable only if needed makes sense to me.
1. Pod is rejected in the scheduling cycle. | ||
2. In the PostFilter extension point, the scheduler tries to make space by the preemption, but finds the preemption doesn't help. | ||
3. When the Pod is moved back to the scheduling queue, the scheduler adds `PodScheduled: false` condition to Pod. | ||
4. The scheduler notices that the preemption wasn't performed for Pod by `PodScheduled: false` and empty `NominatedNodeName` on the Pod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With https://github.com/kubernetes/enhancements/blob/master/keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md the nominated node name can be set by other components than scheduler for various reasons. Wouldn't this break this mechanism if another component set it on the pod?
@sanposhiho Are you planning to get this into 1.35? I reviewed the proposal and left some comments for the previous release cycle, is the KEP ready for the next review pass? |
I'm waiting for a reply in this thread. But, I think we are not going to include it in v1.35, from my/our bandwidth perspective towards KEP freeze. At least, it's not on the sig-scheduling's table at the moment. |
@sanposhiho Got it, thanks. FWIW the proposal looks good to me from SIG Autoscaling side as long as we reach alignment in that comment thread, and assuming that defining the detailed semantics of |