Skip to content

Conversation

@maxcao13
Copy link
Contributor

@maxcao13 maxcao13 commented Sep 20, 2025

Fixes #732
Related: #2385

Description
RFC for formally supporting nodes limits per NodePool.

See #2526 for WIP implementation.

How was this change tested?

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 20, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: maxcao13
Once this PR has been reviewed and has the lgtm label, please assign mwielgus for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 20, 2025
@coveralls
Copy link

coveralls commented Sep 20, 2025

Pull Request Test Coverage Report for Build 18791540167

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 2 unchanged lines in 1 file lost coverage.
  • Overall coverage remained the same at 81.706%

Files with Coverage Reduction New Missed Lines %
pkg/controllers/node/termination/controller.go 2 77.14%
Totals Coverage Status
Change from base Build 18763384482: 0.0%
Covered Lines: 11581
Relevant Lines: 14174

💛 - Coveralls

@maxcao13 maxcao13 force-pushed the nodepool-node-limits branch from b6b34ca to 0dfa25f Compare September 22, 2025 17:52

### Option 2: Soft Limits

**Description**: Node limits are enforced only during provisioning, allowing disruption operations to continue even when at the limit.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've discussed a strategy in the past like "launchBeforeTerminate" or "terminateBeforeLaunch", which could interact with this proposal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the response. Is there an issue or thread for this, or was it an internal discussion?

Ideally, if that feature was implemented, we probably wouldn't need hard/soft limits and simply change the strategy depending if we are at the limit or not? But that probably requires a bit more detail on that part as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a coffee chat, I think. Hard vs soft is more generally applicable to limits, not just node limits -- I could see us pushing that aspect of the problem into the future and starting simple with just the node key.

The reason we haven't supported this in the past is because of the interaction with the scheduling algorithm. For a trivial example, imagine a node limit of 1, and 33 pods with a cpu request of 1. You could achieve this with a 32 core machine and a 1 core machine. You can imagine how this gets more complicated with more complicated scenarios. This is a variant of the knapsack problem.

With our limit, we would be forced into a single 64 core machine, which comes with significant waste. Further, our scheduler would need to "look-ahead" in the binpacking process, and start to factor in limits across multiple placement decisions. The complexity of this explodes pretty quick. In practice, our current implementation is relatively naïve and might not be impacted by this today, but adding this will make it more challenging to implement better heuristics.

Not trying to say we won't do this, just highlighting something that's not covered in the doc.


I am curious what is most motivating for you for this RFC. Is it solely IP Address management or 3p licensing? Or the others?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes a lot of sense. I can totally see the problem here, where adding limits on the number of knapsacks/bins will add another layer of complexity if eventually we are to improve the scheduling algorithm.

I think you are right, as Karpenter is today, implementing this will not affect the complexity (simply just block any scheduling simulations that create more nodes than the limit), but it could force suboptimal resource usage as to your example (depending on the size of the instance types). Although, I would assume the cluster admin would have to live with that usage if they are enforcing strict node limits.

I am curious what is most motivating for you for this RFC. Is it solely IP Address management or 3p licensing? Or the others?

Our particular use case is that we provide clusters as a service (like eks) and we want to integrate a managed Karpenter offering. However we cannot guarantee that our clusters will function well over a certain amount of nodes as per a service level agreement. Pretty much what is detailed here: #2385.

I am aware that what I am proposing is per NodePool and not global, but given the other issue/RFC that tried to solve this in the past, I think the best path for us would to implement this particular RFC, and then support something like #1747 which could serve as a global limit.

@maxcao13 maxcao13 force-pushed the nodepool-node-limits branch from 0dfa25f to e05f2b8 Compare September 25, 2025 22:52

```yaml
spec:
limits:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we should rely on the existing limit especially if users already use this for soft node constraint. I think other "limits" within here are also soft so I don't want nodes to behave differently, nodes is hard and others are soft. I am wondering if we should have a different spec variable instead where we can introduce all hard limits.

Alternatively, we can introduce a new variable that specify if limits is hard or soft however I am unsure if there are use cases were user would want a mix of both.

Personally I am interested in introducing a hard limit variable for example

spec:
   limits:
      pods: 10
   hardLimits:
      nodes: 10

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me, although if Karpenter plans in the future to support something like "terminateBeforeLaunch", I'm wondering if it would still be useful to have separate fields for this.

I would imagine if we did later support a mechanism like that, that we would just switch to terminateBeforeLaunch while we are at the limit. But then I'm wondering if we would just change the behaviour of the regular spec.limits to be "hard" and then deprecate spec.hardLimits entirely.

I think how we have it right now, where we have a separate field spec.hardLimits totally makes sense (and only supporting nodes for this particular RFC) since the current limits are technically soft and wouldn't have the same behaviour across resource types. But just wondering about the future if we ever decide to do terminateBeforeLaunch, because that would solve part of the issues here in this RFC with Drifting and Consolidating at the hard limit, and may even remove the need to have separate limits. But I would need some input from maintainers about that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although if Karpenter plans in the future to support something like "terminateBeforeLaunch", I'm wondering if it would still be useful to have separate fields for this.

I don't think having "terminateBeforeLaunch" would remove the need for a mix of soft and hard limits. That feature would come with tradeoffs, the main one being that there is no guarantee that you could get optimal capacity after the initial termination (e.g. due to InsufficientCapacityErrors). This makes it a poor choice for consolidation IMO. Additionally, the evicted pods will be stuck pending for a longer period of time. While this shouldn't cause availability issues with properly configured PDBs, it's still not desirable.

Soft limits are nice because they give you the ability to opt into hard limits without relying on "terminateBeforeLaunch" and still get voluntary disruption.


### Option 3: Configurable Hard/Soft Node Limits

**Description**: Allow users to configure whether limits are enforced as hard or soft constraints per NodePool. This can be configured via a new field in the NodePool spec `nodepool.spec.disruption.nodeLimitMode`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather see us stage support for generic hard limits, since I think that is the natural extension of this feature. It's pretty easy to come up with hypothetical examples of why you'd want a hard limit on specific resources (e.g. GPUs). This could be something like what @GnatorX proposed in this comment or we could consider something like this:

limits:
  soft:
    nodes: 10
  hard:
    nodes: 12
  # Soft limits can still be specified at the top-level for backwards compatibility, 
  # but this approach would no longer be documented.
  cpu: 10

I feel like this is the cleanest solution, but I'm not sure how I feel about the backwards compatibility strategy.

@maxcao13 maxcao13 force-pushed the nodepool-node-limits branch from e05f2b8 to 67cab9d Compare October 24, 2025 20:35
@maxcao13 maxcao13 force-pushed the nodepool-node-limits branch from 67cab9d to 95cf14b Compare October 24, 2025 20:36
@maxcao13
Copy link
Contributor Author

I've update the doc according to all the comments so far, thanks for the reviews everyone :-)

@jmdeal
Copy link
Member

jmdeal commented Oct 24, 2025

Ran across this comment on a different issue as well, and I think it brings up a good point about hard limits: What should Karpenter's behavior be when a NodePools limits are tightened and there are now more nodes than the allowed limits? This is worth considering.

@maxcao13
Copy link
Contributor Author

Ran across this #1177 (comment) on a different issue as well, and I think it brings up a good point about hard limits: What should Karpenter's behavior be when a NodePools limits are tightened and there are now more nodes than the allowed limits? This is worth considering.

I talk about it here:
https://github.com/kubernetes-sigs/karpenter/pull/2525/files#diff-5eac97882a24e1c56d7ac0dc9cd56c6c5d7ca182f5e1344bfe644eee898a5132R207-R215

I think its makes sense to enforce the limit if they are applied lower than the current number of nodes in the nodepool, and to reuse the existing deprovisioning logic that already exists in the static capacity code.

@maxcao13
Copy link
Contributor Author

I'm closing this RFC since "hard" limits are not feasible to be kept consistent in a distributed system. Limits today will continue to be best-effort, including node limits.

See this thread for more information: kubernetes/autoscaler#8702

@maxcao13 maxcao13 closed this Nov 21, 2025
@maxcao13 maxcao13 deleted the nodepool-node-limits branch November 21, 2025 00:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Limit Nodes per NodePool

6 participants