Skip to content

Conversation

@rschalo
Copy link
Contributor

@rschalo rschalo commented May 1, 2025

Prioritizing Emptiness - Reordering Graceful Consolidation

Background:

Karpenter performs dataplane right-sizing for Kubernetes clusters. When performing consolidation, Karpenter considers if a node should be deleted, replaced, or if it isn't a candidate for disruption at all. There are multiple criteria used when determining a node's disrupt-ability and this PR proposes a re-ordering of them. Disruption via Expiration, Interruption, and Node Auto Repair are forceful consolidation methods, and are out of scope for this change.

Goal:

Prioritize emptiness consolidation as it is fast to validate and special in that it is the only graceful consolidation that results in only a deletion operation.

Graceful Consolidation:

Karpenter evaluates four methods of graceful disruption.

The four methods happen sequentially and are:

  • Drift
  • Emptiness
  • Multi-node consolidation
  • Single-node consoldiation

If a valid disruption is found for any of these methods then it is sent to the termination orchestrator, exits the loop, and begins evaluating drift again.

Consolidation Controls:

Disruption can be controlled by a Node Disruption Budget. These budgets are part of the NodePool spec and enables users to determine how many nodes should be disrupted for a given reason and during what time. For example, a user may want to block all consolidation except for an overnight maintenance window for 2 hours.

When multiple budgets are present on a NodePool, the most restrictive budget applies.

spec:
  disruption:
    budgets:
    - nodes: 50%
      reasons:
      - Drifted
      - Underutilized
    - nodes: 100%
      reasons:
      - Empty

While the semantic around the application of these budgets is not wrong - this PR argues that it’s unexpected that emptiness would be blocked by other disruption methods and that it instead should be the first graceful disruption performed.

Problem:

Today, when Karpenter runs Drift, it first checks for any nodes that are both empty and drifted.
These nodes are then sent to the termination orchestrator and Karpenter exits, and returns to the top of the graceful consolidation loop and starts evaluating Drift again. As a result, nodes that are Drifted && Empty are taking priority over Drifted nodes, being counted against the Drift node disruption budget, and are sending fewer than expected nodes to be terminated.

Even more confusing for users, when Karpenter logs these disruptions, the disruption reason is Drifted as evidenced by this log line:

"controller":"disruption.queue","namespace":"","name":"","reason":"Drifted","decision":"delete","disrupted-node-count":1,"replacement-node-count":0,"disrupted-nodes":[{"Node":{"name":"ip-node"},"NodeClaim":{"name":"nodeclaim"}}],"replacement-nodes":[]}

While this isn’t strictly wrong, it’s perhaps unexpected.

To repro this, a sleep was added (otherwise the 1 node disruption budget causes Karpenter to skip over Drift entirely) before evaluating consolidation methods for the following cluster:

1. 1000 nodes one pod per node
2. Block all disruption
3. Drift NodePool
4. Scale down deployment by half
5. Set Drift Budget to: 1 Nodes and Empty Budget to: 500 nodes
6. Observe 1 Node terminated by Drift down to 500 nodes

Proposals:

1. Update Drift's Handling of Empty Nodes (not implemented)

Specifically handle Empty nodes within Drift as being Empty, applying relevant Node Disruption Budgets for Emptiness as well as correctly logging the disruption reason as Empty. Alternatively, empty nodes can be skipped over when evaluating Drift and then handled by the Emptiness consolidation method.

Pros:

  • Keeps Drift the first considered graceful disruption, prioritizing patching and getting requirements in line over everything else

Cons:

  • The issue covered in this PR still exists
  • Complicates how we reason about disruption methods

2. Reorder Graceful Consolidation and Skip Emptiness in Other Methods (this PR)

Update Drift such that if it detects empty nodes, those nodes are skipped over like we do in other disruption methods. Then, these nodes are picked up in Emptiness and are deleted when evaluating Empty nodes. Additionally, Emptiness should be evaluated before Drift is evaluated so that zero-simulation consolidation options are performed first. In the above repro, 500 nodes would be consolidated before evaluating drift for the remaining nodes.

Pros:

  • Higher priority for deleting empty nodes
  • Requirements changing doesn't mean anything for a node destined for deletion
  • Fully separates out Emptiness as its own method and budget

Cons:

  • Drift not being performed first could introduce some unexpected behavior where a user is relying on patching due to AMI or requirements drift before any other changes are made to their dataplane.

@k8s-ci-robot k8s-ci-robot requested review from engedaam and tallaxes May 1, 2025 01:30
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 1, 2025
Copy link
Member

@jonathan-innis jonathan-innis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels May 1, 2025
@cnmcavoy
Copy link
Contributor

cnmcavoy commented May 2, 2025

lgtm

I like this change, and I don't think the "cons" is a big concern. Emptiness is scaling down nodes and not replacing them, so if you need to patch an AMI for a security issue, scaling down nodes is still the desired behavior because it reduces the attack surface (fewer vulnerable nodes) faster than drift can.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 5, 2025
@coveralls
Copy link

coveralls commented May 5, 2025

Pull Request Test Coverage Report for Build 14846033812

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 7 of 7 (100.0%) changed or added relevant lines in 2 files are covered.
  • 9 unchanged lines in 2 files lost coverage.
  • Overall coverage increased (+0.006%) to 82.019%

Files with Coverage Reduction New Missed Lines %
pkg/controllers/disruption/drift.go 2 87.5%
pkg/controllers/nodeclaim/lifecycle/registration.go 7 82.76%
Totals Coverage Status
Change from base Build 14766283735: 0.006%
Covered Lines: 10067
Relevant Lines: 12274

💛 - Coveralls

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 5, 2025
ExpectSingletonReconciled(ctx, queue)
Expect(len(ExpectNodeClaims(ctx, env.Client))).To(Equal(10))
})
It("should allow all nodes from each nodePool to be deleted", func() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would now happen in emptiness because the test used to have empty && drifted nodes. There is a test that covers disruptions across multiple nodepools.

Expect(ExpectNodes(ctx, env.Client)).To(HaveLen(1))
ExpectExists(ctx, env.Client, nodeClaim)
})
It("should ignore nodes with the drifted status condition set to false", func() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved.

Expect(ExpectNodeClaims(ctx, env.Client)).To(HaveLen(1))
ExpectExists(ctx, env.Client, nodeClaim)
})
It("can delete drifted nodes", func() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Covered by should delete drifted nodes

Expect(ExpectNodes(ctx, env.Client)).To(HaveLen(0))
Expect(ExpectNodeClaims(ctx, env.Client)).To(HaveLen(0))
})
It("can replace drifted nodes", func() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Covered by new should replace drifted nodes.

ExpectExists(ctx, env.Client, nodeClaim)
ExpectExists(ctx, env.Client, node)
})
It("should delete nodes with the karpenter.sh/do-not-disrupt annotation set to false", func() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to other annotation test in same file.

Expect(ExpectNodeClaims(ctx, env.Client)).To(HaveLen(0))
Expect(ExpectNodes(ctx, env.Client)).To(HaveLen(0))
ExpectNotFound(ctx, env.Client, nodeClaim, node)
ExpectMetricGaugeValue(disruption.EligibleNodes, 1, map[string]string{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validate via metrics that the disruption was due to emptiness.

Copy link
Member

@jonathan-innis jonathan-innis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

Expect(ExpectNodes(ctx, env.Client)).To(HaveLen(0))
ExpectNotFound(ctx, env.Client, nodeClaim, node)
})
It("can delete empty and drifted nodes", func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! 🎉

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 6, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cnmcavoy, jonathan-innis, rschalo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 8a6239b into kubernetes-sigs:main May 6, 2025
16 checks passed
jigisha620 pushed a commit to jigisha620/karpenter that referenced this pull request Sep 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants