Skip to content

Conversation

@rschalo
Copy link
Contributor

@rschalo rschalo commented May 8, 2025

Fixes #N/A

Description
Today, when Karpenter performs validation for empty nodes, if a single node experience pod churn, the entire batch of empty nodes is considered invalid. This change alters validation such that only the nodes that experience churn are excluded from disruption.

How was this change tested?
Unit tests, also:

I applied this change with a sleep during validation (see "validating candidates" log) to a cluster with two nodes, one pod per node, and then scaled the deployment down to zero. While in the sleep, I scaled up the deployment to one pod. The node the pod scheduled to had its consolidatable status condition removed (practical-tharp-919765395) and the command proceeded with the nodeclaim that was still empty eager-feynman-1734482834.

karpenter-5944796d5f-sf9zg controller {"level":"DEBUG","time":"2025-05-08T17:09:37.713Z","logger":"controller","caller":"disruption/validation.go:123","message":"validating candidates","commit":"3087cce-dirty","controller":"disruption","namespace":"","name":"","reconcileID":"7ee690c9-7c65-476c-860b-37a6e699204d"}

karpenter-5944796d5f-sf9zg controller {"level":"DEBUG","time":"2025-05-08T17:09:40.964Z","logger":"controller","caller":"disruption/consolidation.go:66","message":"removing consolidatable status condition","commit":"3087cce-dirty","controller":"nodeclaim.disruption","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"arm-2cpu-dxf5m"},"namespace":"","name":"arm-2cpu-dxf5m","reconcileID":"4c049d78-633c-4615-9887-5b12c02af3ed","Node":{"name":"practical-tharp-919765395"}}

karpenter-5944796d5f-sf9zg controller {"level":"INFO","time":"2025-05-08T17:09:47.714Z","logger":"controller","caller":"disruption/controller.go:206","message":"disrupting node(s)","commit":"3087cce-dirty","controller":"disruption","namespace":"","name":"","reconcileID":"7ee690c9-7c65-476c-860b-37a6e699204d","command-id":"6c95a5be-9992-47e5-b6e4-4645e55e03db","reason":"empty","decision":"delete","disrupted-node-count":1,"replacement-node-count":0,"pod-count":0,"disrupted-nodes":[{"Node":{"name":"eager-feynman-1734482834"},"NodeClaim":{"name":"arm-2cpu-k72c5"},"capacity-type":"on-demand","instance-type":"c-1x-arm64-linux"}],"replacement-nodes":[]}

karpenter-5944796d5f-sf9zg controller {"level":"DEBUG","time":"2025-05-08T17:09:48.659Z","logger":"controller","caller":"orchestration/queue.go:236","message":"command succeeded","commit":"3087cce-dirty","controller":"disruption.queue","namespace":"","name":"","reconcileID":"58809c5a-348c-472b-ac72-b0f3e75843bb","command-id":"6c95a5be-9992-47e5-b6e4-4645e55e03db","reason":"Empty","decision":"delete","disrupted-node-count":1,"replacement-node-count":0,"disrupted-nodes":[{"Node":{"name":"eager-feynman-1734482834"},"NodeClaim":{"name":"arm-2cpu-k72c5"}}],"replacement-nodes":[]}

karpenter-5944796d5f-sf9zg controller {"level":"INFO","time":"2025-05-08T17:09:48.683Z","logger":"controller","caller":"terminator/terminator.go:89","message":"tainted node","commit":"3087cce-dirty","controller":"node.termination","controllerGroup":"","controllerKind":"Node","Node":{"name":"eager-feynman-1734482834"},"namespace":"","name":"eager-feynman-1734482834","reconcileID":"cdf22cb6-0686-42f7-be34-68fec7cb6418","NodeClaim":{"name":"arm-2cpu-k72c5"},"taint.Key":"karpenter.sh/disrupted","taint.Value":"","taint.Effect":"NoSchedule"}

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 8, 2025
@k8s-ci-robot k8s-ci-robot requested review from engedaam and tallaxes May 8, 2025 17:37
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 8, 2025
@coveralls
Copy link

coveralls commented May 8, 2025

Pull Request Test Coverage Report for Build 15120678788

Details

  • 19 of 26 (73.08%) changed or added relevant lines in 2 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.01%) to 81.947%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/controllers/disruption/validation.go 16 23 69.57%
Totals Coverage Status
Change from base Build 15075049606: 0.01%
Covered Lines: 10150
Relevant Lines: 12386

💛 - Coveralls

@rschalo
Copy link
Contributor Author

rschalo commented May 8, 2025

I think that the best way to test this properly would require a refactoring of the disruption controller for injecting the methods rather than adding them directly to the methods member of the controller. Alternatively, validation could be passed as a param and a mocked validator that creates a pod and binds it to a node during validation could be added.

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 16, 2025
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 16, 2025
@jmdeal
Copy link
Member

jmdeal commented May 19, 2025

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 19, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jmdeal, rschalo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 19, 2025
@k8s-ci-robot k8s-ci-robot merged commit 246a485 into kubernetes-sigs:main May 19, 2025
16 checks passed
rschalo added a commit to rschalo/karpenter that referenced this pull request May 19, 2025
Raj-Popat added a commit to acquia/karpenter that referenced this pull request Jun 11, 2025
* test: Lower resource requests for NodeClaim test (kubernetes-sigs#2229)

* perf: Don't deepcopy inside of watch handler functions (kubernetes-sigs#2232)

* test: Add random name string for NodePool and NodeClass (kubernetes-sigs#2231)

* test: Update E2E testing suite to be named Regression (kubernetes-sigs#2234)

* refactor: convert validation to an interface (kubernetes-sigs#2220)

* fix: allow non-churn empty nodes to be disrupted (kubernetes-sigs#2206)

* perf: Only deep copy nodes during GetCandidates once (kubernetes-sigs#2233)

* feat: add metrics for disruption candidate validation (kubernetes-sigs#2239)

* perf: Only call .Available() once which prevents duplicate allocs (kubernetes-sigs#2241)

* docs: update issue triage meeting schedule (kubernetes-sigs#2244)

* test: deflake NodeClaim and presubmit tests (kubernetes-sigs#2240)

* perf: Avoid deepcopy when get nodePool/cluster health (kubernetes-sigs#2247)

* perf: Improve OrderByPrice performance (kubernetes-sigs#2250)

* test: add validating admission policy for nodeclass status (kubernetes-sigs#2251)

Co-authored-by: Jonathan Innis <[email protected]>

---------

Co-authored-by: Amanuel Engeda <[email protected]>
Co-authored-by: Jonathan Innis <[email protected]>
Co-authored-by: Reed Schalo <[email protected]>
Co-authored-by: DerekFrank <[email protected]>
Co-authored-by: Jason Deal <[email protected]>
Co-authored-by: Reed Schalo <[email protected]>
Co-authored-by: Jonathan Innis <[email protected]>
harshad3339 added a commit to acquia/karpenter that referenced this pull request Jul 31, 2025
* test: Lower resource requests for NodeClaim test (kubernetes-sigs#2229)

* perf: Don't deepcopy inside of watch handler functions (kubernetes-sigs#2232)

* test: Add random name string for NodePool and NodeClass (kubernetes-sigs#2231)

* test: Update E2E testing suite to be named Regression (kubernetes-sigs#2234)

* refactor: convert validation to an interface (kubernetes-sigs#2220)

* fix: allow non-churn empty nodes to be disrupted (kubernetes-sigs#2206)

* perf: Only deep copy nodes during GetCandidates once (kubernetes-sigs#2233)

* feat: add metrics for disruption candidate validation (kubernetes-sigs#2239)

* perf: Only call .Available() once which prevents duplicate allocs (kubernetes-sigs#2241)

* docs: update issue triage meeting schedule (kubernetes-sigs#2244)

* test: deflake NodeClaim and presubmit tests (kubernetes-sigs#2240)

* perf: Avoid deepcopy when get nodePool/cluster health (kubernetes-sigs#2247)

* perf: Improve OrderByPrice performance (kubernetes-sigs#2250)

* test: add validating admission policy for nodeclass status (kubernetes-sigs#2251)

Co-authored-by: Jonathan Innis <[email protected]>

* feat: drain and volume detachment status conditions (kubernetes-sigs#1876)

* fix: show the cron parse error to users to allow them to debug (kubernetes-sigs#2258)

* perf: Don't deep-copy nodes and nodeclaims in our synced check (kubernetes-sigs#2260)

* chore: Fix getting current script directory in install-kwok.sh (kubernetes-sigs#2262)

* perf: Perform quick checks in node health first (kubernetes-sigs#2264)

* chore: Update pod metrics when pod is completed (kubernetes-sigs#2259)

* fix: Correctly build nodepool mapping for complex clusters (kubernetes-sigs#2263)

* fix: fail open for missing nodeclaims in termination (kubernetes-sigs#2266)

* perf: Limit GetInstanceTypes() calls per-NodeClaim (kubernetes-sigs#2271)

* perf: Parallelize disruption execution actions (kubernetes-sigs#2270)

* fix: Fix node owner reference update (kubernetes-sigs#2274)

* perf: Be more resilient to deletion failures in disruption controller (kubernetes-sigs#2272)

* chore(deps): bump the go-deps group with 2 updates (kubernetes-sigs#2277)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore: Ensure we can stand up multiple partitions with kwok (kubernetes-sigs#2283)

* chore: Inject resources into Kwok through a patch (kubernetes-sigs#2285)

* chore: Update NodeClaim E2E test to only replace one status condition (kubernetes-sigs#2284)

* chore: Avoid validating admission policy for clusters older then 1.30 (kubernetes-sigs#2289)

* chore(deps): bump the go-deps group with 2 updates (kubernetes-sigs#2295)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore: bump go version to 1.24.4 (kubernetes-sigs#2298)

* chore: Only log that the command succeeded when it actually did (kubernetes-sigs#2302)

* fix: Fix bug with MarkForDeletion before creating replacements (kubernetes-sigs#2300)

* perf: Refactor the eviction queue to be multithreaded (kubernetes-sigs#2252)

* docs: Add Bizfly Cloud provider (kubernetes-sigs#2303)

* chore: Bump lifecycle cache expiration to one hour (kubernetes-sigs#2307)

* chore: Use cluster state to check replacement NodeClaim existence (kubernetes-sigs#2308)

* chore(deps): bump github.com/samber/lo from 1.50.0 to 1.51.0 in the go-deps group (kubernetes-sigs#2315)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore: bump operatorpkg (kubernetes-sigs#2314)

* chore(deps): bump the k8s-go-deps group across 1 directory with 4 updates (kubernetes-sigs#2317)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore: Refactor Orchestration Queue and Handle Mark/Unmark Deletion in Queue (kubernetes-sigs#2305)

* chore(deps): bump the k8s-go-deps group with 7 updates (kubernetes-sigs#2326)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* perf: multithreaded orchestration queue (kubernetes-sigs#2293)

* test: Add nodeclaim name when you have garbage collection (kubernetes-sigs#2333)

* perf: Reduce multiple patch calls in instance termination (kubernetes-sigs#2324)

* fix: add helm rbac for kwok-provider to update finalizers (kubernetes-sigs#2336)

Signed-off-by: Max Cao <[email protected]>

* feat: configure CRD status operator with larger histogram buckets (kubernetes-sigs#2328)

* chore(deps): bump sigs.k8s.io/yaml from 1.4.0 to 1.5.0 in the k8s-go-deps group (kubernetes-sigs#2339)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore(deps): bump github.com/docker/docker from 28.2.2+incompatible to 28.3.0+incompatible in the go-deps group (kubernetes-sigs#2340)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* fix: Fix re-retrieving object on retry (kubernetes-sigs#2337)

* fix: Fix overriding error with patch call (kubernetes-sigs#2338)

* fix: add missing rlock to disruption queue (kubernetes-sigs#2348)

* test: allow e2e tests to output junit report (kubernetes-sigs#2334)

Signed-off-by: Max Cao <[email protected]>

* docs: Add Oracle Cloud Infrastructure (OCI) provider  (kubernetes-sigs#2342)

* fix: no longer allow the same hostname to take multiple capacity (kubernetes-sigs#2356)

* feat: support auto relaxing min values (kubernetes-sigs#2299)

* fix: update provider ID to ensure that Cloud Provider tests pass (kubernetes-sigs#2363)

* fix: remove unsupported capacity_type label from karpenter_nodeclaims… (kubernetes-sigs#2364)

* fix: update deletionTimestamp on terminating pods when after nodeDeletionTimestamp (kubernetes-sigs#2316)

Co-authored-by: Amanuel Engeda <[email protected]>

* chore: promote ReservedCapacity feature gate to beta (kubernetes-sigs#2365)

* fix: flakiness in expiration tests (kubernetes-sigs#2366)

* test: Bump the termination time for the deletion timestamp (kubernetes-sigs#2367)

* chore: cherry-pick kubernetes-sigs#2399 (kubernetes-sigs#2401)

---------

Signed-off-by: dependabot[bot] <[email protected]>
Signed-off-by: Max Cao <[email protected]>
Co-authored-by: Amanuel Engeda <[email protected]>
Co-authored-by: Jonathan Innis <[email protected]>
Co-authored-by: Reed Schalo <[email protected]>
Co-authored-by: DerekFrank <[email protected]>
Co-authored-by: Jason Deal <[email protected]>
Co-authored-by: Reed Schalo <[email protected]>
Co-authored-by: Jonathan Innis <[email protected]>
Co-authored-by: Todd Neal <[email protected]>
Co-authored-by: Jigisha Patil <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Lê Minh Quân <[email protected]>
Co-authored-by: Max Cao <[email protected]>
Co-authored-by: Aidan Rowe <[email protected]>
Co-authored-by: Daniel Lopes <[email protected]>
Co-authored-by: Saurav Agarwalla <[email protected]>
Co-authored-by: cosimomeli <[email protected]>
jigisha620 pushed a commit to jigisha620/karpenter that referenced this pull request Sep 19, 2025
AndrewMitchell25 pushed a commit to AndrewMitchell25/karpenter that referenced this pull request Sep 25, 2025
AndrewMitchell25 pushed a commit to AndrewMitchell25/karpenter that referenced this pull request Sep 25, 2025
AndrewMitchell25 pushed a commit to AndrewMitchell25/karpenter that referenced this pull request Sep 25, 2025
AndrewMitchell25 pushed a commit to AndrewMitchell25/karpenter that referenced this pull request Sep 25, 2025
AndrewMitchell25 pushed a commit to AndrewMitchell25/karpenter that referenced this pull request Sep 26, 2025
k8s-ci-robot pushed a commit that referenced this pull request Sep 26, 2025
AndrewMitchell25 pushed a commit to AndrewMitchell25/karpenter that referenced this pull request Sep 29, 2025
AndrewMitchell25 pushed a commit to AndrewMitchell25/karpenter that referenced this pull request Sep 29, 2025
AndrewMitchell25 pushed a commit to AndrewMitchell25/karpenter that referenced this pull request Sep 29, 2025
AndrewMitchell25 pushed a commit to AndrewMitchell25/karpenter that referenced this pull request Sep 29, 2025
k8s-ci-robot pushed a commit that referenced this pull request Oct 1, 2025
k8s-ci-robot pushed a commit that referenced this pull request Oct 1, 2025
k8s-ci-robot pushed a commit that referenced this pull request Oct 1, 2025
k8s-ci-robot pushed a commit that referenced this pull request Oct 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants