docs: RFC for node limits per NodePool #2525

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

maxcao13 wants to merge 1 commit into kubernetes-sigs:main from maxcao13:nodepool-node-limits

+231 −0

Contributor

maxcao13 commented Sep 20, 2025 •

edited

Loading

Fixes #732
Related: #2385

Description
RFC for formally supporting nodes limits per NodePool.

See #2526 for WIP implementation.

How was this change tested?

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

k8s-ci-robot added the cncf-cla: yes label

k8s-ci-robot requested review from engedaam and tallaxes

September 20, 2025 07:59

Contributor

k8s-ci-robot commented Sep 20, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: maxcao13
Once this PR has been reviewed and has the lgtm label, please assign mwielgus for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the size/L label

maxcao13 mentioned this pull request

WIP: max node limits per NodePool #2526

Open

coveralls commented Sep 20, 2025 •

edited

Loading

Pull Request Test Coverage Report for Build 18791540167

Details

0 of 0 changed or added relevant lines in 0 files are covered.
2 unchanged lines in 1 file lost coverage.
Overall coverage remained the same at 81.706%

Files with Coverage Reduction	New Missed Lines	%
pkg/controllers/node/termination/controller.go	2	77.14%

Totals
Change from base Build 18763384482:	0.0%
Covered Lines:	11581
Relevant Lines:	14174

💛 - Coveralls

maxcao13 force-pushed the nodepool-node-limits branch from b6b34ca to 0dfa25f Compare

September 22, 2025 17:52

ellistarn reviewed

View reviewed changes

designs/nodepool-node-limits.md Outdated


		### Option 2: Soft Limits

		Description: Node limits are enforced only during provisioning, allowing disruption operations to continue even when at the limit.

Contributor

ellistarn Sep 22, 2025

We've discussed a strategy in the past like "launchBeforeTerminate" or "terminateBeforeLaunch", which could interact with this proposal.

Contributor Author

maxcao13 Sep 25, 2025

Thanks for the response. Is there an issue or thread for this, or was it an internal discussion?

Ideally, if that feature was implemented, we probably wouldn't need hard/soft limits and simply change the strategy depending if we are at the limit or not? But that probably requires a bit more detail on that part as well.

Contributor

ellistarn Sep 27, 2025

Just a coffee chat, I think. Hard vs soft is more generally applicable to limits, not just node limits -- I could see us pushing that aspect of the problem into the future and starting simple with just the node key.

The reason we haven't supported this in the past is because of the interaction with the scheduling algorithm. For a trivial example, imagine a node limit of 1, and 33 pods with a cpu request of 1. You could achieve this with a 32 core machine and a 1 core machine. You can imagine how this gets more complicated with more complicated scenarios. This is a variant of the knapsack problem.

With our limit, we would be forced into a single 64 core machine, which comes with significant waste. Further, our scheduler would need to "look-ahead" in the binpacking process, and start to factor in limits across multiple placement decisions. The complexity of this explodes pretty quick. In practice, our current implementation is relatively naïve and might not be impacted by this today, but adding this will make it more challenging to implement better heuristics.

Not trying to say we won't do this, just highlighting something that's not covered in the doc.

I am curious what is most motivating for you for this RFC. Is it solely IP Address management or 3p licensing? Or the others?

Contributor Author

maxcao13 Sep 29, 2025

That makes a lot of sense. I can totally see the problem here, where adding limits on the number of knapsacks/bins will add another layer of complexity if eventually we are to improve the scheduling algorithm.

I think you are right, as Karpenter is today, implementing this will not affect the complexity (simply just block any scheduling simulations that create more nodes than the limit), but it could force suboptimal resource usage as to your example (depending on the size of the instance types). Although, I would assume the cluster admin would have to live with that usage if they are enforcing strict node limits.

I am curious what is most motivating for you for this RFC. Is it solely IP Address management or 3p licensing? Or the others?

Our particular use case is that we provide clusters as a service (like eks) and we want to integrate a managed Karpenter offering. However we cannot guarantee that our clusters will function well over a certain amount of nodes as per a service level agreement. Pretty much what is detailed here: #2385.

I am aware that what I am proposing is per NodePool and not global, but given the other issue/RFC that tried to solve this in the past, I think the best path for us would to implement this particular RFC, and then support something like #1747 which could serve as a global limit.

GnatorX reviewed

View reviewed changes

designs/nodepool-node-limits.md Outdated Show resolved Hide resolved

maxcao13 force-pushed the nodepool-node-limits branch from 0dfa25f to e05f2b8 Compare

September 25, 2025 22:52

GnatorX reviewed

View reviewed changes

designs/nodepool-node-limits.md

+              ```yaml
+              spec:
+                limits:

GnatorX Oct 9, 2025

I am not sure we should rely on the existing limit especially if users already use this for soft node constraint. I think other "limits" within here are also soft so I don't want nodes to behave differently, nodes is hard and others are soft. I am wondering if we should have a different spec variable instead where we can introduce all hard limits.

Alternatively, we can introduce a new variable that specify if limits is hard or soft however I am unsure if there are use cases were user would want a mix of both.

Personally I am interested in introducing a hard limit variable for example

spec:
   limits:
      pods: 10
   hardLimits:
      nodes: 10

Contributor Author

maxcao13 Oct 9, 2025

Makes sense to me, although if Karpenter plans in the future to support something like "terminateBeforeLaunch", I'm wondering if it would still be useful to have separate fields for this.

I would imagine if we did later support a mechanism like that, that we would just switch to terminateBeforeLaunch while we are at the limit. But then I'm wondering if we would just change the behaviour of the regular spec.limits to be "hard" and then deprecate spec.hardLimits entirely.

I think how we have it right now, where we have a separate field spec.hardLimits totally makes sense (and only supporting nodes for this particular RFC) since the current limits are technically soft and wouldn't have the same behaviour across resource types. But just wondering about the future if we ever decide to do terminateBeforeLaunch, because that would solve part of the issues here in this RFC with Drifting and Consolidating at the hard limit, and may even remove the need to have separate limits. But I would need some input from maintainers about that.

Member

jmdeal Oct 23, 2025

although if Karpenter plans in the future to support something like "terminateBeforeLaunch", I'm wondering if it would still be useful to have separate fields for this.

I don't think having "terminateBeforeLaunch" would remove the need for a mix of soft and hard limits. That feature would come with tradeoffs, the main one being that there is no guarantee that you could get optimal capacity after the initial termination (e.g. due to InsufficientCapacityErrors). This makes it a poor choice for consolidation IMO. Additionally, the evicted pods will be stuck pending for a longer period of time. While this shouldn't cause availability issues with properly configured PDBs, it's still not desirable.

Soft limits are nice because they give you the ability to opt into hard limits without relying on "terminateBeforeLaunch" and still get voluntary disruption.

jmdeal reviewed

View reviewed changes

designs/nodepool-node-limits.md Outdated Show resolved Hide resolved

designs/nodepool-node-limits.md Outdated


		### Option 3: Configurable Hard/Soft Node Limits

		Description: Allow users to configure whether limits are enforced as hard or soft constraints per NodePool. This can be configured via a new field in the NodePool spec `nodepool.spec.disruption.nodeLimitMode`.

Member

jmdeal Oct 23, 2025

I'd rather see us stage support for generic hard limits, since I think that is the natural extension of this feature. It's pretty easy to come up with hypothetical examples of why you'd want a hard limit on specific resources (e.g. GPUs). This could be something like what @GnatorX proposed in this comment or we could consider something like this:

limits:
  soft:
    nodes: 10
  hard:
    nodes: 12
  # Soft limits can still be specified at the top-level for backwards compatibility, 
  # but this approach would no longer be documented.
  cpu: 10

I feel like this is the cleanest solution, but I'm not sure how I feel about the backwards compatibility strategy.

designs/nodepool-node-limits.md Show resolved Hide resolved

maxcao13 force-pushed the nodepool-node-limits branch from e05f2b8 to 67cab9d Compare

October 24, 2025 20:35


          docs: nodepool node limits RFC

95cf14b

Signed-off-by: Max Cao <[email protected]>

maxcao13 force-pushed the nodepool-node-limits branch from 67cab9d to 95cf14b Compare

October 24, 2025 20:36

Contributor Author

maxcao13 commented Oct 24, 2025

I've update the doc according to all the comments so far, thanks for the reviews everyone :-)

Member

jmdeal commented Oct 24, 2025

Ran across this comment on a different issue as well, and I think it brings up a good point about hard limits: What should Karpenter's behavior be when a NodePools limits are tightened and there are now more nodes than the allowed limits? This is worth considering.

Contributor Author

maxcao13 commented Oct 24, 2025

Ran across this #1177 (comment) on a different issue as well, and I think it brings up a good point about hard limits: What should Karpenter's behavior be when a NodePools limits are tightened and there are now more nodes than the allowed limits? This is worth considering.

I talk about it here:
https://github.com/kubernetes-sigs/karpenter/pull/2525/files#diff-5eac97882a24e1c56d7ac0dc9cd56c6c5d7ca182f5e1344bfe644eee898a5132R207-R215

I think its makes sense to enforce the limit if they are applied lower than the current number of nodes in the nodepool, and to reuse the existing deprovisioning logic that already exists in the static capacity code.

maxcao13 mentioned this pull request

Granular resource limits proposal kubernetes/autoscaler#8702

Open

Contributor Author

maxcao13 commented Nov 21, 2025

I'm closing this RFC since "hard" limits are not feasible to be kept consistent in a distributed system. Limits today will continue to be best-effort, including node limits.

See this thread for more information: kubernetes/autoscaler#8702

maxcao13 closed this

maxcao13 deleted the nodepool-node-limits branch

November 21, 2025 00:53

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

ellistarn ellistarn left review comments

jmdeal jmdeal left review comments

engedaam Awaiting requested review from engedaam

tallaxes Awaiting requested review from tallaxes

+1 more reviewer

GnatorX GnatorX left review comments

Labels

cncf-cla: yes size/L