Granular resource limits proposal #8702

norbertcyran · 2025-10-28T15:40:32Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR adds a proposal for support of granular resource limits in node autoscalers.

Does this PR introduce a user-facing change?

NONE

k8s-ci-robot · 2025-10-28T15:40:42Z

Hi @norbertcyran. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2025-10-28T15:40:59Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: norbertcyran
Once this PR has been reviewed and has the lgtm label, please assign x13n for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

cluster-autoscaler/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

norbertcyran · 2025-10-28T16:08:51Z

This proposal was initially discussed within autoscaling SIG in this doc: https://docs.google.com/document/d/1ORj3oW2ZaciROAbTmqBG1agCmP_8B4BqmCNnQAmqmyc/edit?usp=sharing

ellistarn · 2025-10-28T17:15:55Z

FYI -- there is a karpenter specific proposal for this feature: https://github.com/kubernetes-sigs/karpenter/pull/2525/files#diff-5eac97882a24e1c56d7ac0dc9cd56c6c5d7ca182f5e1344bfe644eee898a5132R23. One of the key challenges mentioned is the "launch before terminate" challenge when doing upgrades and gracefully rolling capacity. This can cause you to get stuck if you're at your limits.

Thoughts on expanding this proposal to reason about this case?

norbertcyran · 2025-10-29T13:41:11Z

FYI -- there is a karpenter specific proposal for this feature: https://github.com/kubernetes-sigs/karpenter/pull/2525/files#diff-5eac97882a24e1c56d7ac0dc9cd56c6c5d7ca182f5e1344bfe644eee898a5132R23. One of the key challenges mentioned is the "launch before terminate" challenge when doing upgrades and gracefully rolling capacity. This can cause you to get stuck if you're at your limits.

Thoughts on expanding this proposal to reason about this case?

From the perspective of the current state of CAS, differentiation between hard and soft limits does not make much sense, as there's no equivalent of Karpenter's node replacement. Scale down only consolidates the node if all pods running on it can be scheduled on other nodes already existing in the cluster. Therefore, functionality-wise, there would be no difference between hard and soft limits. However, such a solution would be more future-proof.

FWIW, in GKE we have surge upgrades: https://docs.cloud.google.com/kubernetes-engine/docs/concepts/node-pool-upgrade-strategies#surge. As they are not handled by CAS, they bypass resource limits, but additionally we don't want the surge nodes to be counted towards the limits at all (i.e. if there exist both the old node and the surge node, we count them as one), so they don't block scale ups. To handle that, in the CAS resource limits implementation we plan to have a NodeFilter interface allowing to filter out certain nodes from the usage calculation: https://github.com/kubernetes/autoscaler/pull/8662/files#diff-03d8f6b8cba668e6137329c4df0e8a979be244ea7c7465a4d8f10ca08849eb5aR12-R14, https://github.com/kubernetes/autoscaler/pull/8662/files#diff-03d8f6b8cba668e6137329c4df0e8a979be244ea7c7465a4d8f10ca08849eb5aR35-R37

Alternatively to the soft and hard limits, I was thinking that something like NodeFilter could be used to bypass the limits during the graceful consolidation, but it doesn't seem perfect either:

some users might expect that the autoscaler never exceeds the limits, even temporarily
some users might be constrained by external factors, such as IPs, licensing (as mentioned in the doc you linked), so exceeding the limits might be simply not possible
user can't control by how much the limits are exceeded. User might want to specify something like "I want to allow exceeding the limit temporarily during the consolidation, but only by x nodes". x could be defined as a difference between soft and hard limits

Having that said, I can see the potential usefulness of having both soft and hard limits, even if it makes the API slightly more complicated.

Would you suggest having an API like that?

...
limits:
  soft:
    resources:
      nodes: 8
  hard:
    resources:
      nodes: 12

@towca @x13n do you have any thoughts about that?

ellistarn · 2025-10-31T16:15:50Z

some users might expect that the autoscaler never exceeds the limits, even temporarily

some users might be constrained by external factors, such as IPs, licensing (as mentioned in the doc you linked), so exceeding the limits might be simply not possible

user can't control by how much the limits are exceeded. User might want to specify something like "I want to allow exceeding the limit temporarily during the consolidation, but only by x nodes". x could be defined as a difference between soft and hard limits

Agree with these. Personally, I am not sure I am convinced by the soft/hard proposal. As you mention, there are other factors (like IPs) that cause limits as well. I suspect we will be forced to "terminate-before-launch" when at those limits (within PDBs).

I lean towards a solution that treats limits as best-effort, where there are cases (i.e. surge updates) where the limits can temporarily be exceeded. I'm not sure the state of the art today, but when we first released limits in Karpenter, it was best effort due to launch parallelism.

Mostly, I wanted to highlight this problem and get you in touch with the karpenter folks who are thinking about this (@maxcao13, @jmdeal, @jonathan-innis). I want to avoid Karpenter's limit semantics diverging from SIG Autoscaling API standards unless absolutely necessary.

maxcao13 · 2025-10-31T17:57:14Z

Thanks for the ping.

Yeah, I have a similar proposal for Karpenter itself, since it does not support node limits for non-static capacity: kubernetes-sigs/karpenter#2525

I'm proposing soft/hard limits in a similar way. The API fields are slightly different because of backwards compatibility concerns, but generally the semantics I am agreeing with where soft limits can temporarily exceed the limit, but hard limits definitively constrain nodes from ever going over.

If we can agree on one semantic across both proposals that would be ideal. For now I don't see a reason why it would be necessary to differ.

norbertcyran · 2025-11-03T11:03:44Z

If we can agree on one semantic across both proposals that would be ideal. For now I don't see a reason why it would be necessary to differ.

Yeah, that would be ideal. I think the only difference is that we don't need the distinction between hard and soft limits yet in CAS, but we might need in the future, and I believe it's not a great cost to add it.

Would you agree on the API suggested in #8702 (comment)? I'd prefer that API over limits and hardLimits, as there are no backwards compatibility concerns.

Do you think you'll implement this API at some point in the future on the Karpenter side?

maxcao13 · 2025-11-04T01:38:07Z

Would you agree on the API suggested in #8702 (comment)? I'd prefer that API over limits and hardLimits, as there are no backwards compatibility concerns.

Well for Karpenter, limits already exist in the spec of NodePools, so it would probably look like:

spec:
  limits:
    soft:
      nodes: 10
    hard:
      nodes: 12
    # Soft limits can still be specified at the top-level for backwards compatibility, 
    # but this approach would no longer be documented.
    cpu: 10

like what was proposed in kubernetes-sigs/karpenter#2525 (comment).. But it doesn't matter too much whether the API looks the same or not, just that intended semantic behaviour is consistent.

This is a question just about the proposal in this PR, Is there a reason why there needs to be a resources field in spec.limits.resources.<resource>?

Do you think you'll implement this API at some point in the future on the Karpenter side?

Would need to defer to the Karpenter reviewers/maintainers on this one :-)
But it's been noted in the PR review that hard limits for Karpenter are valuable for the future in cases for other resources, e.g. GPUs, and current Karpenter limit semantic only means "soft" at the current state.

ellistarn · 2025-11-04T04:31:26Z

Yeah, that would be ideal. I think the only difference is that we don't need the distinction between hard and soft limits yet in CAS, but we might need in the future, and I believe it's not a great cost to add it.

To be clear, hard/soft is @maxcao13's proposal for Karpenter. I am not convinced that making this distinction is the right direction. I am curious to explore https://kubernetes.io/docs/concepts/policy/resource-quotas/#quota-scopes as an answer to the launch before terminate problem. I could imagine customers including or excluding disruption reasons (i.e. drift/underutilized, cc: @jmdeal for commentary) as a way to achieve "soft" quotas.

Do you think you'll implement this API at some point in the future on the Karpenter side?

If the shape of the API works for the Karpenter product experience, it's definitely preferable to support a standard -- I like how it's shaping up :). If we go this route, I could see these AutoscalingResourceQuotas being the preferred mechanism for specifying limits instead of nodepool.spec.limits. Specifically, because this more flexible approach is a good answer to a 3 year old request kubernetes-sigs/karpenter#745. We will, of course, need to maintain support for compatibility, and I imagine some customers may prefer the simplicity of the nodepool scoped API.

cluster-autoscaler/proposals/granular-resource-limits.md

ellistarn · 2025-11-04T04:38:00Z

cluster-autoscaler/proposals/granular-resource-limits.md

+  selector:
+    matchLabels:
+      example.cloud.com/machine-family: e2
+  limits:


We may want to consider hard here to conceptually align with https://kubernetes.io/docs/concepts/policy/resource-quotas/

Especially if we agree on hard/soft limits distinction. Let's see how the discussion goes and let's get back to this one

cluster-autoscaler/proposals/granular-resource-limits.md

ellistarn · 2025-11-04T04:48:46Z

cluster-autoscaler/proposals/granular-resource-limits.md

+      memory: 256Gi
+```
+
+* `selector`: A standard Kubernetes label selector that determines which nodes


I have finally realized why you called it scopeSelector, it's because it's in ResourceQuota 🤦‍♂️ .

yeah exactly, though @jonathan-innis noted that scopeSelector in ResourceQuota is a bit different, since it's not just a label selector, but rather a named filter that you need to reference via scopeName field. Here we just want to use a plain label selector. Therefore, indeed scopeSelector might not be an accurate name, and probably it's better to go with selector or nodeSelector (though this one might be inaccurate if we were to support DRA)

norbertcyran · 2025-11-04T13:49:31Z

This is a question just about the proposal in this PR, Is there a reason why there needs to be a resources field in spec.limits.resources.?

@maxcao13 That might seem redundant, yes. The main reason was to account for possible future support for DRA limits. There was a discussion around this here: https://docs.google.com/document/d/1ORj3oW2ZaciROAbTmqBG1agCmP_8B4BqmCNnQAmqmyc/edit?disco=AAABsjPQRlI

In general, if we were to support DRA limits, we would probably put limits on devices. We will not be able to simply put them at the same level of nesting as other resources. Therefore, we have 2 options:

We account for that and nest limits under resources. If there will be plans to add DRA limits, we can just add a new field under limits, next to resources (current proposal):

spec:
  limits:
    resources:
      cpu: 64
    draDevices: ... # possibly added in the future

We ignore it now and in the future we will probably have to add a top-level field for DRA limits:

spec:
  limits:
    cpu: 64
  draLimits: ...

API-wise, option 1 looks cleaner to me, even if it leaves more boilerplate initially. I'm open to other suggestions, though.

cluster-autoscaler/proposals/granular-resource-limits.md

x13n · 2025-11-17T10:29:19Z

I like the current proposal (modulo naming, but the comment thread about that seems to be converging to a better alternative). Regarding soft/hard limits, I would stick with proposed simple semantics. Excluding nodes from quota is possible based on labels, which will work in some cases. What I don't like about quota scopes is that they require specifying arbitrary categories that are not derived from labels. Would ensuring proper labels be a better solution? Specifically, a label saying "this node is intended to be removed" would be enough to exclude it from quota accounting.

norbertcyran · 2025-11-17T15:13:01Z

@x13n Thanks for review!

What I don't like about quota scopes is that they require specifying arbitrary categories that are not derived from labels

Just to clarify, we are not planning to have scopes similar to these in k8s ResourceQuotas in this design.

Would ensuring proper labels be a better solution? Specifically, a label saying "this node is intended to be removed" would be enough to exclude it from quota accounting.

To be honest, this is something that I'd prefer to avoid - very often in cases like that, excluding/including specific categories of nodes should be considered a part of the business logic, and it doesn't really make sense to override this behavior. For example, nodes that are intended to be removed should always be included in the calculations. Otherwise, the following scenario is possible:

We are at the max limit
A node is picked for a scale down
The quota excludes nodes undergoing deletion, so CA initiates a scale up, as some quota has been freed up
Scale down fails, scale up succeeds. Now, the max limit is exceeded

Similarly, excluding surge upgrade nodes in GKE also should be done by the business logic, opposed to the user having to remember to exclude surge nodes in their quota definitions (and the label used for marking surge upgrade nodes), see #8702 (comment) for an idea how I plan to handle this specific case.

If we take Karpenter's launch-before-terminate consolidation as an example, excluding nodes undergoing consolidation with labels is far from perfect too: first, again the user would have to remember to exclude those nodes in the quotas. Second, there would be no limit how many extra resources would be spun up during the consolidation (the user might be fine with going 2 or 3 nodes above the limit temporarily, but not more). Third, the scenario I described before also applies here, I think: when we exclude nodes undergoing deletion from the usage calculations, Karpenter could provision new nodes, and if the consolidation fails, we end up with exceeded limit.

Having that said, I think that hard/soft distinction might be indeed the most suitable solution for use cases like that. Though initially, I'd probably suggest starting with only hard/soft limits, potentially adding the other limits in the future. That could look like the proposal in #8702 (comment), with the exception that the first iteration of the API would only include hard or soft (like in k8s ResourceQuota). Alternatively, we could leave the current proposal as is, and in the future we could add a burst, softLimits, or hardLimits field, for example:

spec:
  limits:
    resources:
      cpu: 16
  burst:
    resources:
      cpu: 8  # during consolidation we can temporarily provision at most 8 additional CPUs

or:

spec:
  limits:
    resources:
      cpu: 16
  hardLimits:
    resources:
      cpu: 24

Knowing for sure that in Karpenter hard/soft limit distinction will be required, it seems like a good decision to make the API extensible enough for it. Therefore, I think that starting with hard field similarly to k8s ResourceQuota makes the most sense from the API design perspective. For example:

spec:
  limits:
    hard:
      resources:
        cpu: 8
    soft: # to be added in the future

@ellistarn WDYT? Do you possibly have ideas other than hard/soft limits distinction for the Karpenter's launch-before-terminate use case? I remember that you mentioned scopes like in k8s ResourceQuota, but I believe there might be the same caveats I described above about using labels to exclude nodes undergoing consolidation

x13n · 2025-11-19T13:58:58Z

We had a chat about this with @norbertcyran. The summary from my perspective:

There is value in specifying both numbers in a single object. There are edge cases where we could stay above soft limit for a prolonged period of time, but it should be acceptable and shouldn't force deletion of nodes that have workloads running on them.
The separation makes sense in case of both Karpenter consolidation and Cluster Autoscaler scale down: in case of scale down, nodes that started draining should no longer count towards the soft quota.
Because of the two previous points, we should add both fields in the first version of the API, not add one incrementally after the other.
Naming: soft / hard limits are likely more understandable than burst, but one alternative naming would be to have steadyState quota and burst quota with the same semantics as soft/hard (burst would use absolute values, not deltas on top of steadyState.
Omitting hard/burst limit is possible and is semantically equivalent to being set to +inf.
Omitting soft/steadyState limit is also possible and semantically equivalent to be identical with hard/burst limit.
The use case of specifying "scope" for when the burst limit applies (e.g. drift/underutilized as mentioned by @ellistarn above) is not clear to us yet and could be added incrementally in future versions of the API, but should be left out for now.

@norbertcyran let me know if I missed anything!

norbertcyran · 2025-11-19T17:28:05Z

@x13n I think you summarized it very well, thank you! One thing to add regarding:

in case of scale down, nodes that started draining should no longer count towards the soft quota.

That probably needs further discussion, and it's unlikely that we will implement it in the first iteration. Though we definitely agree that there are potential current and future use cases of soft/hard limits distinction in CAS. For API completeness, we can include both soft and hard limits, even if we won't implement the soft limits on the CAS from the start. We will at least document that soft limits are no-op in CAS for now, or we can also add an admission webhook to ensure that soft limits are not used

ellistarn · 2025-11-19T17:44:32Z

To be honest, this is something that I'd prefer to avoid - very often in cases like that, excluding/including specific categories of nodes should be considered a part of the business logic, and it doesn't really make sense to override this behavior.

I really agree with this assertion. I'm not sure how or when customers would configure this, or how I would advise them to do so.

This is also something that makes me generally uneasy about the soft/hard limits. I do not have any idea how I would explain to our customers when or how to use them. From an algorithm perspective, a soft limit would allow our launch-before-terminate behavior to work. A hard limit would not, leading to customers getting stuck. I think this is essentially unacceptable for 99% of customers, so we would likely recommend that everyone use soft.

Further, from a distributed systems perspective, hard may simply not be feasible. Informer caches can be delayed, leading us to make decisions on stale information and potentially overshooting. This is designed to heal/converge over time. We do what we can to make these cases rare, but we are careful to explain to customers that limits in Karpenter are best effort.

I think we are overcomplicating this proposal, and should just focus on the core use case, which is granular limits based on label selections of nodes. From a naming perspective, I think we probably want to follow ResourceQuota's lead and call it hard, where the definition of hard is best effort. Alternatively, I could see dropping subcategorization entirely (e.g. #8702 (comment)).

cc: @x13n @norbertcyran . Happy to jump on a call if it helps us close on this discussion more quickly.

maxcao13 · 2025-11-19T19:05:00Z

Further, from a distributed systems perspective, hard may simply not be feasible. Informer caches can be delayed, leading us to make decisions on stale information and potentially overshooting. This is designed to heal/converge over time. We do what we can to make these cases rare, but we are careful to explain to customers that limits in Karpenter are best effort.

I am wondering then, is Karpenter not trending in the direction of general hard limits altogether? I think it is mentioned in the Karpenter specifc proposal that the idea of hard limits is an acceptable one and that there's real use cases to want hard limits, at least for nodes, but due to the constraints you just mentioned I understand it is a very difficult problem by design. I know @jmdeal mentioned general hard limits is perhaps a natural next step, but I want to make sure we are on the same page here.

FWIW, as someone who would very much like to see this proposal pass in both CAS and Karpenter, whatever limits are decided on are not crucial to my particular use cases, and I am not opinionated on the direction the community wants to go.

ellistarn · 2025-11-20T18:34:34Z

Limits have to be best effort in my view, as making them consistent would require leasing/locking and have unacceptable performance implications. For this reason, I agree that hard is misleading. By the time we're best effort, I don't think it's unreasonable that we would use them for safe upgrades / etc in Karpenter. I could also see letting customers configure "terminate-before-launch" in some scenarios, but that would be a Karpenter feature, and unrelated to limits in my view.

Soft could be interpreted in many ways -- I think of it more like a preference than anything. i.e., maybe soft would mean that customers would want to "fill up" all of their soft limits and then start breaking them? The use cases are very unclear to me, and I typically avoid trying to predict customer requirements unless I have to. It's somewhat indicative to me that ResourceQuota never found a use case for soft limits in a ~decade.

In summary, I like this option best:

kind: CapacityQuota
spec:
  limits:
    nodes: 3
    resources:
      cpu: 16

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 28, 2025

k8s-ci-robot requested a review from aleksandra-malinowska October 28, 2025 15:40

k8s-ci-robot added the area/cluster-autoscaler label Oct 28, 2025

k8s-ci-robot requested a review from vadasambar October 28, 2025 15:40

k8s-ci-robot removed the do-not-merge/needs-area label Oct 28, 2025

norbertcyran mentioned this pull request Oct 28, 2025

Granular resource limits #8703

Open

10 tasks

norbertcyran mentioned this pull request Oct 30, 2025

[Granular resource limits] Add support for granular resource quotas #8662

Merged

ellistarn reviewed Nov 4, 2025

View reviewed changes

cluster-autoscaler/proposals/granular-resource-limits.md Outdated Show resolved Hide resolved

ellistarn reviewed Nov 4, 2025

View reviewed changes

cluster-autoscaler/proposals/granular-resource-limits.md Show resolved Hide resolved

ellistarn reviewed Nov 4, 2025

View reviewed changes

maxcao13 reviewed Nov 6, 2025

View reviewed changes

cluster-autoscaler/proposals/granular-resource-limits.md Outdated Show resolved Hide resolved

Granular resource limits proposal

fbb591b

norbertcyran force-pushed the resource-limits-proposal branch from b668556 to fbb591b Compare November 17, 2025 18:01

Granular resource limits proposal #8702

Are you sure you want to change the base?

Granular resource limits proposal #8702

Conversation

norbertcyran commented Oct 28, 2025

What type of PR is this?

What this PR does / why we need it:

Does this PR introduce a user-facing change?

Uh oh!

k8s-ci-robot commented Oct 28, 2025

Uh oh!

k8s-ci-robot commented Oct 28, 2025

Uh oh!

norbertcyran commented Oct 28, 2025

Uh oh!

ellistarn commented Oct 28, 2025

Uh oh!

norbertcyran commented Oct 29, 2025

Uh oh!

ellistarn commented Oct 31, 2025

Uh oh!

maxcao13 commented Oct 31, 2025

Uh oh!

norbertcyran commented Nov 3, 2025

Uh oh!

maxcao13 commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ellistarn commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ellistarn Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

norbertcyran Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ellistarn Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

norbertcyran Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

norbertcyran commented Nov 4, 2025

Uh oh!

Uh oh!

x13n commented Nov 17, 2025

Uh oh!

norbertcyran commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

x13n commented Nov 19, 2025

Uh oh!

norbertcyran commented Nov 19, 2025

Uh oh!

ellistarn commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maxcao13 commented Nov 19, 2025

Uh oh!

ellistarn commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

maxcao13 commented Nov 4, 2025 •

edited

Loading

ellistarn commented Nov 4, 2025 •

edited

Loading

norbertcyran commented Nov 17, 2025 •

edited

Loading

ellistarn commented Nov 19, 2025 •

edited

Loading