-
Notifications
You must be signed in to change notification settings - Fork 388
docs: RFC for node limits per NodePool #2525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: maxcao13 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Pull Request Test Coverage Report for Build 18791540167Details
💛 - Coveralls |
b6b34ca to
0dfa25f
Compare
designs/nodepool-node-limits.md
Outdated
|
|
||
| ### Option 2: Soft Limits | ||
|
|
||
| **Description**: Node limits are enforced only during provisioning, allowing disruption operations to continue even when at the limit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've discussed a strategy in the past like "launchBeforeTerminate" or "terminateBeforeLaunch", which could interact with this proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the response. Is there an issue or thread for this, or was it an internal discussion?
Ideally, if that feature was implemented, we probably wouldn't need hard/soft limits and simply change the strategy depending if we are at the limit or not? But that probably requires a bit more detail on that part as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a coffee chat, I think. Hard vs soft is more generally applicable to limits, not just node limits -- I could see us pushing that aspect of the problem into the future and starting simple with just the node key.
The reason we haven't supported this in the past is because of the interaction with the scheduling algorithm. For a trivial example, imagine a node limit of 1, and 33 pods with a cpu request of 1. You could achieve this with a 32 core machine and a 1 core machine. You can imagine how this gets more complicated with more complicated scenarios. This is a variant of the knapsack problem.
With our limit, we would be forced into a single 64 core machine, which comes with significant waste. Further, our scheduler would need to "look-ahead" in the binpacking process, and start to factor in limits across multiple placement decisions. The complexity of this explodes pretty quick. In practice, our current implementation is relatively naïve and might not be impacted by this today, but adding this will make it more challenging to implement better heuristics.
Not trying to say we won't do this, just highlighting something that's not covered in the doc.
I am curious what is most motivating for you for this RFC. Is it solely IP Address management or 3p licensing? Or the others?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes a lot of sense. I can totally see the problem here, where adding limits on the number of knapsacks/bins will add another layer of complexity if eventually we are to improve the scheduling algorithm.
I think you are right, as Karpenter is today, implementing this will not affect the complexity (simply just block any scheduling simulations that create more nodes than the limit), but it could force suboptimal resource usage as to your example (depending on the size of the instance types). Although, I would assume the cluster admin would have to live with that usage if they are enforcing strict node limits.
I am curious what is most motivating for you for this RFC. Is it solely IP Address management or 3p licensing? Or the others?
Our particular use case is that we provide clusters as a service (like eks) and we want to integrate a managed Karpenter offering. However we cannot guarantee that our clusters will function well over a certain amount of nodes as per a service level agreement. Pretty much what is detailed here: #2385.
I am aware that what I am proposing is per NodePool and not global, but given the other issue/RFC that tried to solve this in the past, I think the best path for us would to implement this particular RFC, and then support something like #1747 which could serve as a global limit.
0dfa25f to
e05f2b8
Compare
|
|
||
| ```yaml | ||
| spec: | ||
| limits: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure we should rely on the existing limit especially if users already use this for soft node constraint. I think other "limits" within here are also soft so I don't want nodes to behave differently, nodes is hard and others are soft. I am wondering if we should have a different spec variable instead where we can introduce all hard limits.
Alternatively, we can introduce a new variable that specify if limits is hard or soft however I am unsure if there are use cases were user would want a mix of both.
Personally I am interested in introducing a hard limit variable for example
spec:
limits:
pods: 10
hardLimits:
nodes: 10There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me, although if Karpenter plans in the future to support something like "terminateBeforeLaunch", I'm wondering if it would still be useful to have separate fields for this.
I would imagine if we did later support a mechanism like that, that we would just switch to terminateBeforeLaunch while we are at the limit. But then I'm wondering if we would just change the behaviour of the regular spec.limits to be "hard" and then deprecate spec.hardLimits entirely.
I think how we have it right now, where we have a separate field spec.hardLimits totally makes sense (and only supporting nodes for this particular RFC) since the current limits are technically soft and wouldn't have the same behaviour across resource types. But just wondering about the future if we ever decide to do terminateBeforeLaunch, because that would solve part of the issues here in this RFC with Drifting and Consolidating at the hard limit, and may even remove the need to have separate limits. But I would need some input from maintainers about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
although if Karpenter plans in the future to support something like "terminateBeforeLaunch", I'm wondering if it would still be useful to have separate fields for this.
I don't think having "terminateBeforeLaunch" would remove the need for a mix of soft and hard limits. That feature would come with tradeoffs, the main one being that there is no guarantee that you could get optimal capacity after the initial termination (e.g. due to InsufficientCapacityErrors). This makes it a poor choice for consolidation IMO. Additionally, the evicted pods will be stuck pending for a longer period of time. While this shouldn't cause availability issues with properly configured PDBs, it's still not desirable.
Soft limits are nice because they give you the ability to opt into hard limits without relying on "terminateBeforeLaunch" and still get voluntary disruption.
designs/nodepool-node-limits.md
Outdated
|
|
||
| ### Option 3: Configurable Hard/Soft Node Limits | ||
|
|
||
| **Description**: Allow users to configure whether limits are enforced as hard or soft constraints per NodePool. This can be configured via a new field in the NodePool spec `nodepool.spec.disruption.nodeLimitMode`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather see us stage support for generic hard limits, since I think that is the natural extension of this feature. It's pretty easy to come up with hypothetical examples of why you'd want a hard limit on specific resources (e.g. GPUs). This could be something like what @GnatorX proposed in this comment or we could consider something like this:
limits:
soft:
nodes: 10
hard:
nodes: 12
# Soft limits can still be specified at the top-level for backwards compatibility,
# but this approach would no longer be documented.
cpu: 10I feel like this is the cleanest solution, but I'm not sure how I feel about the backwards compatibility strategy.
e05f2b8 to
67cab9d
Compare
Signed-off-by: Max Cao <[email protected]>
67cab9d to
95cf14b
Compare
|
I've update the doc according to all the comments so far, thanks for the reviews everyone :-) |
|
Ran across this comment on a different issue as well, and I think it brings up a good point about hard limits: What should Karpenter's behavior be when a NodePools limits are tightened and there are now more nodes than the allowed limits? This is worth considering. |
I talk about it here: I think its makes sense to enforce the limit if they are applied lower than the current number of nodes in the nodepool, and to reuse the existing deprovisioning logic that already exists in the static capacity code. |
|
I'm closing this RFC since "hard" limits are not feasible to be kept consistent in a distributed system. Limits today will continue to be best-effort, including node limits. See this thread for more information: kubernetes/autoscaler#8702 |
Fixes #732
Related: #2385
Description
RFC for formally supporting nodes limits per
NodePool.See #2526 for WIP implementation.
How was this change tested?
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.