Skip to content

Conversation

@cosimomeli
Copy link
Contributor

Description
When a node receives the unreachable taint, the Kubernetes taint controller triggers the deletion of all pods after 5 minutes. When the Node Repair threshold is reached, Karpenter's drain procedure waits for all pods to be evicted or to be stuck on termination (when they have passed their deletionTimestamp), but if a Pod has a long termination grace period (RabbitMQ operator pods have 7 days, for example) the node will wait too long before being deleted.

To improve the forced termination, I added the terminating pods with a deletionTimestamp after the nodeTerminationTimestamp to be deleted again, so their deletionTimestamp can be aligned with the nodeTerminationTimestamp.

How was this change tested?
I added a unit test for this and also tested the change with both an Unhealthy Node on AWS (dead kubelet) and a simple node deletion.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Jun 17, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Jun 17, 2025
@k8s-ci-robot
Copy link
Contributor

Welcome @cosimomeli!

It looks like this is your first PR to kubernetes-sigs/karpenter 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/karpenter has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jun 17, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @cosimomeli. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jun 17, 2025
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jun 17, 2025
@coveralls
Copy link

coveralls commented Jun 17, 2025

Pull Request Test Coverage Report for Build 16201702039

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 6 of 6 (100.0%) changed or added relevant lines in 2 files are covered.
  • 171 unchanged lines in 13 files lost coverage.
  • Overall coverage decreased (-0.1%) to 81.916%

Files with Coverage Reduction New Missed Lines %
pkg/controllers/disruption/consolidation.go 3 88.14%
pkg/controllers/disruption/drift.go 4 87.76%
pkg/controllers/disruption/singlenodeconsolidation.go 4 93.62%
pkg/controllers/disruption/emptiness.go 5 87.3%
pkg/controllers/state/statenode.go 5 87.05%
pkg/controllers/controllers.go 9 0.0%
pkg/controllers/disruption/multinodeconsolidation.go 10 86.76%
pkg/test/ratelimitinginterface.go 10 0.0%
pkg/controllers/disruption/helpers.go 11 87.43%
pkg/controllers/disruption/validation.go 15 81.92%
Totals Coverage Status
Change from base Build 15692799185: -0.1%
Covered Lines: 10219
Relevant Lines: 12475

💛 - Coveralls

@jonathan-innis
Copy link
Member

/assign @engedaam

Amanuel implemented Node Autorepair so assigning him since he's the relevant owner

@chicco785
Copy link

hey @engedaam any estimated time for the review? thx!

@engedaam
Copy link
Contributor

engedaam commented Jul 9, 2025

When the Node Repair threshold is reached, Karpenter's drain procedure waits for all pods to be evicted or to be stuck on termination (when they have passed their deletionTimestamp), but if a Pod has a long termination grace period (RabbitMQ operator pods have 7 days, for example) the node will wait too long before being deleted.

Currently, Karpenter does not immediately drain pods when initiating a Node Repair action. Instead, it relies on a tolerationDuration configured by the cloud provider. For example, in the AWS Provider, unreachable nodes are given a 30-minute toleration duration before Karpenter begins the process of deleting the node. During this termination period, Karpenter waits for pods to be terminated, which is handled by the drain logic implemented in the terminator.go file (specifically at this line: https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/node/termination/terminator/terminator.go#L140). The behavior you're describing in this PR aligns with our current expectations. To better understand any potential issues, could you provide a specific example where you've observed Karpenter taking longer than the configured toleration duration to terminate an unhealthy node?

To improve the forced termination, I added the terminating pods with a deletionTimestamp after the nodeTerminationTimestamp to be deleted again, so their deletionTimestamp can be aligned with the nodeTerminationTimestamp.

Can you help me understand why this would help here? We only really look at the deletionTimestamp for filtering pods, not for case when to force delete

@chicco785
Copy link

Currently, Karpenter does not immediately drain pods when initiating a Node Repair action. Instead, it relies on a tolerationDuration configured by the cloud provider. For example, in the AWS Provider, unreachable nodes are given a 30-minute toleration duration before Karpenter begins the process of deleting the node. During this termination period, Karpenter waits for pods to be terminated, which is handled by the drain logic implemented in the terminator.go file (specifically at this line: https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/node/termination/terminator/terminator.go#L140). The behavior you're describing in this PR aligns with our current expectations. To better understand any potential issues, could you provide a specific example where you've observed Karpenter taking longer than the configured toleration duration to terminate an unhealthy node?

As far as I understood, in case a Pod has a long termination grace period the node will not be removed at the end of the node toleration duration, but it will wait for the pod termination grace period. For example, in RabbitMQ operator pods have 7 days termination period, so the node won't be terminated before 7 days.

@cosimomeli can explain better.

@cosimomeli
Copy link
Contributor Author

To better understand any potential issues, could you provide a specific example where you've observed Karpenter taking longer than the configured toleration duration to terminate an unhealthy node?

Hello @engedaam, thanks for the answer.
When a node becomes unreachable, Karpenter triggers Node Repair after 30 minutes (on AWS). Meanwhile, the Kubernetes taint controller starts evicting pods after 5 minutes when a node receives the node.kubernetes.io/unreachable taint.

Karpenter's terminator logic immediately drains every pod in the Node, as the node.health controller sets the node termination timestamp to the current timestamp, this is actually a forced shutdown, and the termination goes as expected, but there is one exception:
The podsToDelete here filters out every pod that is already terminating, and has not passed its graceful termination period. This means that if I have a pod with a very long termination period (my example was RabbitMQ with 7 days), the termination controller will not touch it, and Karpenter will wait for its natural termination before deleting the node.

To improve the forced termination, I added the terminating pods with a deletionTimestamp after the nodeTerminationTimestamp to be deleted again, so their deletionTimestamp can be aligned with the nodeTerminationTimestamp.

Can you help me understand why this would help here? We only really look at the deletionTimestamp for filtering pods, not for case when to force delete

My change has the effect of adding inside podsToDelete the pods with a graceful shutdown period that ends after the deadline of the node termination, this way they get deleted again with a different graceful period, to be compatible with the node termination deadline.

@cosimomeli
Copy link
Contributor Author

Considering that after 30 minutes, excluding the ones with an explicit toleration of node.kubernetes.io/unreachable, all the pods are already deleted by the toleration controller as documented here, it's not uncommon that a terminating pod needs another delete to change its default deletion time to speed up the draining process.

@engedaam
Copy link
Contributor

Considering that after 30 minutes, excluding the ones with an explicit toleration of node.kubernetes.io/unreachable, all the pods are already deleted by the toleration controller as documented here, it's not uncommon that a terminating pod needs another delete to change its default deletion time to speed up the draining process.

This clears things up and thanks for the thorough explanation!

@engedaam
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 10, 2025
Copy link
Contributor

@engedaam engedaam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one small nit

@engedaam
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 10, 2025
@jmdeal
Copy link
Member

jmdeal commented Jul 11, 2025

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cosimomeli, jmdeal

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 11, 2025
@k8s-ci-robot k8s-ci-robot merged commit 58bf160 into kubernetes-sigs:main Jul 11, 2025
16 checks passed
rlanhellas pushed a commit to rlanhellas/karpenter that referenced this pull request Jul 12, 2025
harshad3339 added a commit to acquia/karpenter that referenced this pull request Jul 31, 2025
* test: Lower resource requests for NodeClaim test (kubernetes-sigs#2229)

* perf: Don't deepcopy inside of watch handler functions (kubernetes-sigs#2232)

* test: Add random name string for NodePool and NodeClass (kubernetes-sigs#2231)

* test: Update E2E testing suite to be named Regression (kubernetes-sigs#2234)

* refactor: convert validation to an interface (kubernetes-sigs#2220)

* fix: allow non-churn empty nodes to be disrupted (kubernetes-sigs#2206)

* perf: Only deep copy nodes during GetCandidates once (kubernetes-sigs#2233)

* feat: add metrics for disruption candidate validation (kubernetes-sigs#2239)

* perf: Only call .Available() once which prevents duplicate allocs (kubernetes-sigs#2241)

* docs: update issue triage meeting schedule (kubernetes-sigs#2244)

* test: deflake NodeClaim and presubmit tests (kubernetes-sigs#2240)

* perf: Avoid deepcopy when get nodePool/cluster health (kubernetes-sigs#2247)

* perf: Improve OrderByPrice performance (kubernetes-sigs#2250)

* test: add validating admission policy for nodeclass status (kubernetes-sigs#2251)

Co-authored-by: Jonathan Innis <[email protected]>

* feat: drain and volume detachment status conditions (kubernetes-sigs#1876)

* fix: show the cron parse error to users to allow them to debug (kubernetes-sigs#2258)

* perf: Don't deep-copy nodes and nodeclaims in our synced check (kubernetes-sigs#2260)

* chore: Fix getting current script directory in install-kwok.sh (kubernetes-sigs#2262)

* perf: Perform quick checks in node health first (kubernetes-sigs#2264)

* chore: Update pod metrics when pod is completed (kubernetes-sigs#2259)

* fix: Correctly build nodepool mapping for complex clusters (kubernetes-sigs#2263)

* fix: fail open for missing nodeclaims in termination (kubernetes-sigs#2266)

* perf: Limit GetInstanceTypes() calls per-NodeClaim (kubernetes-sigs#2271)

* perf: Parallelize disruption execution actions (kubernetes-sigs#2270)

* fix: Fix node owner reference update (kubernetes-sigs#2274)

* perf: Be more resilient to deletion failures in disruption controller (kubernetes-sigs#2272)

* chore(deps): bump the go-deps group with 2 updates (kubernetes-sigs#2277)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore: Ensure we can stand up multiple partitions with kwok (kubernetes-sigs#2283)

* chore: Inject resources into Kwok through a patch (kubernetes-sigs#2285)

* chore: Update NodeClaim E2E test to only replace one status condition (kubernetes-sigs#2284)

* chore: Avoid validating admission policy for clusters older then 1.30 (kubernetes-sigs#2289)

* chore(deps): bump the go-deps group with 2 updates (kubernetes-sigs#2295)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore: bump go version to 1.24.4 (kubernetes-sigs#2298)

* chore: Only log that the command succeeded when it actually did (kubernetes-sigs#2302)

* fix: Fix bug with MarkForDeletion before creating replacements (kubernetes-sigs#2300)

* perf: Refactor the eviction queue to be multithreaded (kubernetes-sigs#2252)

* docs: Add Bizfly Cloud provider (kubernetes-sigs#2303)

* chore: Bump lifecycle cache expiration to one hour (kubernetes-sigs#2307)

* chore: Use cluster state to check replacement NodeClaim existence (kubernetes-sigs#2308)

* chore(deps): bump github.com/samber/lo from 1.50.0 to 1.51.0 in the go-deps group (kubernetes-sigs#2315)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore: bump operatorpkg (kubernetes-sigs#2314)

* chore(deps): bump the k8s-go-deps group across 1 directory with 4 updates (kubernetes-sigs#2317)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore: Refactor Orchestration Queue and Handle Mark/Unmark Deletion in Queue (kubernetes-sigs#2305)

* chore(deps): bump the k8s-go-deps group with 7 updates (kubernetes-sigs#2326)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* perf: multithreaded orchestration queue (kubernetes-sigs#2293)

* test: Add nodeclaim name when you have garbage collection (kubernetes-sigs#2333)

* perf: Reduce multiple patch calls in instance termination (kubernetes-sigs#2324)

* fix: add helm rbac for kwok-provider to update finalizers (kubernetes-sigs#2336)

Signed-off-by: Max Cao <[email protected]>

* feat: configure CRD status operator with larger histogram buckets (kubernetes-sigs#2328)

* chore(deps): bump sigs.k8s.io/yaml from 1.4.0 to 1.5.0 in the k8s-go-deps group (kubernetes-sigs#2339)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore(deps): bump github.com/docker/docker from 28.2.2+incompatible to 28.3.0+incompatible in the go-deps group (kubernetes-sigs#2340)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* fix: Fix re-retrieving object on retry (kubernetes-sigs#2337)

* fix: Fix overriding error with patch call (kubernetes-sigs#2338)

* fix: add missing rlock to disruption queue (kubernetes-sigs#2348)

* test: allow e2e tests to output junit report (kubernetes-sigs#2334)

Signed-off-by: Max Cao <[email protected]>

* docs: Add Oracle Cloud Infrastructure (OCI) provider  (kubernetes-sigs#2342)

* fix: no longer allow the same hostname to take multiple capacity (kubernetes-sigs#2356)

* feat: support auto relaxing min values (kubernetes-sigs#2299)

* fix: update provider ID to ensure that Cloud Provider tests pass (kubernetes-sigs#2363)

* fix: remove unsupported capacity_type label from karpenter_nodeclaims… (kubernetes-sigs#2364)

* fix: update deletionTimestamp on terminating pods when after nodeDeletionTimestamp (kubernetes-sigs#2316)

Co-authored-by: Amanuel Engeda <[email protected]>

* chore: promote ReservedCapacity feature gate to beta (kubernetes-sigs#2365)

* fix: flakiness in expiration tests (kubernetes-sigs#2366)

* test: Bump the termination time for the deletion timestamp (kubernetes-sigs#2367)

* chore: cherry-pick kubernetes-sigs#2399 (kubernetes-sigs#2401)

---------

Signed-off-by: dependabot[bot] <[email protected]>
Signed-off-by: Max Cao <[email protected]>
Co-authored-by: Amanuel Engeda <[email protected]>
Co-authored-by: Jonathan Innis <[email protected]>
Co-authored-by: Reed Schalo <[email protected]>
Co-authored-by: DerekFrank <[email protected]>
Co-authored-by: Jason Deal <[email protected]>
Co-authored-by: Reed Schalo <[email protected]>
Co-authored-by: Jonathan Innis <[email protected]>
Co-authored-by: Todd Neal <[email protected]>
Co-authored-by: Jigisha Patil <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Lê Minh Quân <[email protected]>
Co-authored-by: Max Cao <[email protected]>
Co-authored-by: Aidan Rowe <[email protected]>
Co-authored-by: Daniel Lopes <[email protected]>
Co-authored-by: Saurav Agarwalla <[email protected]>
Co-authored-by: cosimomeli <[email protected]>
jigisha620 pushed a commit to jigisha620/karpenter that referenced this pull request Sep 19, 2025
harshad3339 added a commit to acquia/karpenter that referenced this pull request Nov 3, 2025
* chore: bump go version to 1.24.4 (kubernetes-sigs#2298)

* chore: Only log that the command succeeded when it actually did (kubernetes-sigs#2302)

* fix: Fix bug with MarkForDeletion before creating replacements (kubernetes-sigs#2300)

* perf: Refactor the eviction queue to be multithreaded (kubernetes-sigs#2252)

* docs: Add Bizfly Cloud provider (kubernetes-sigs#2303)

* chore: Bump lifecycle cache expiration to one hour (kubernetes-sigs#2307)

* chore: Use cluster state to check replacement NodeClaim existence (kubernetes-sigs#2308)

* chore(deps): bump github.com/samber/lo from 1.50.0 to 1.51.0 in the go-deps group (kubernetes-sigs#2315)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore: bump operatorpkg (kubernetes-sigs#2314)

* chore(deps): bump the k8s-go-deps group across 1 directory with 4 updates (kubernetes-sigs#2317)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore: Refactor Orchestration Queue and Handle Mark/Unmark Deletion in Queue (kubernetes-sigs#2305)

* chore(deps): bump the k8s-go-deps group with 7 updates (kubernetes-sigs#2326)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* perf: multithreaded orchestration queue (kubernetes-sigs#2293)

* test: Add nodeclaim name when you have garbage collection (kubernetes-sigs#2333)

* perf: Reduce multiple patch calls in instance termination (kubernetes-sigs#2324)

* fix: add helm rbac for kwok-provider to update finalizers (kubernetes-sigs#2336)

Signed-off-by: Max Cao <[email protected]>

* feat: configure CRD status operator with larger histogram buckets (kubernetes-sigs#2328)

* chore(deps): bump sigs.k8s.io/yaml from 1.4.0 to 1.5.0 in the k8s-go-deps group (kubernetes-sigs#2339)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore(deps): bump github.com/docker/docker from 28.2.2+incompatible to 28.3.0+incompatible in the go-deps group (kubernetes-sigs#2340)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* fix: Fix re-retrieving object on retry (kubernetes-sigs#2337)

* fix: Fix overriding error with patch call (kubernetes-sigs#2338)

* fix: add missing rlock to disruption queue (kubernetes-sigs#2348)

* test: allow e2e tests to output junit report (kubernetes-sigs#2334)

Signed-off-by: Max Cao <[email protected]>

* docs: Add Oracle Cloud Infrastructure (OCI) provider  (kubernetes-sigs#2342)

* fix: no longer allow the same hostname to take multiple capacity (kubernetes-sigs#2356)

* feat: support auto relaxing min values (kubernetes-sigs#2299)

* fix: update provider ID to ensure that Cloud Provider tests pass (kubernetes-sigs#2363)

* fix: remove unsupported capacity_type label from karpenter_nodeclaims… (kubernetes-sigs#2364)

* fix: update deletionTimestamp on terminating pods when after nodeDeletionTimestamp (kubernetes-sigs#2316)

Co-authored-by: Amanuel Engeda <[email protected]>

* chore: promote ReservedCapacity feature gate to beta (kubernetes-sigs#2365)

* fix: flakiness in expiration tests (kubernetes-sigs#2366)

* test: Bump the termination time for the deletion timestamp (kubernetes-sigs#2367)

* chore(deps): bump github.com/docker/docker from 28.3.0+incompatible to 28.3.1+incompatible in the go-deps group (kubernetes-sigs#2355)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* fix: pod errors when nodepool requirements filter all instance types (kubernetes-sigs#2341)

* refactor: Create a NopValidator for the disruption testing (kubernetes-sigs#2369)

* chore(deps): bump the go-deps group with 2 updates (kubernetes-sigs#2373)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* refactor: Update disruption testing from PR comments (kubernetes-sigs#2372)

* feat: (BREAKING) addition of launch timeout for nodeclaim lifecycle (kubernetes-sigs#2349)

* chore: Consider node.kubernetes.io/not-ready:NoExecute as ephemeral (kubernetes-sigs#2265)

* perf: Optimistically delete from the cache after launch (kubernetes-sigs#2380)

* docs: Node Overlay RFC (kubernetes-sigs#2166)

* fix: handle multiple PDBs for the same pod more gracefully (kubernetes-sigs#2379)

* docs: Add IBM Cloud provider (kubernetes-sigs#2396)

Signed-off-by: Josephine Pfeiffer <[email protected]>

* fix: rate limit eviction when PDBs are blocking (kubernetes-sigs#2399)

* feat: Add the Node Overlay CRD (kubernetes-sigs#2296)

* chore: ignore pods that use unsupported provisioner in the storageClass (kubernetes-sigs#2400)

* feat: Add a feature flag for Node Overlay (kubernetes-sigs#2404)

* feat: Add StaticCapacity feature flag (kubernetes-sigs#2405)

* fix(BREAKING): update naming of karpenter_pods_drained_total (kubernetes-sigs#2421)

* fix: pod metrics when pod is terminal (kubernetes-sigs#2417)

* chore: ignore pods that have unbound pvc with volumeBindingMode immediate (kubernetes-sigs#2415)

* docs: static capacity RFC (kubernetes-sigs#2309)

* chore: bump go version to 1.24.6 (kubernetes-sigs#2432)

* feat: Create optional operator arguments to leverage leader lease functionality (kubernetes-sigs#2433)

* chore(deps): bump the go-deps group with 5 updates (kubernetes-sigs#2442)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore(deps): bump actions/checkout from 4.2.2 to 5.0.0 in /.github/actions/install-pyroscope in the action-deps group (kubernetes-sigs#2428)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore(deps): bump the actions-deps group across 1 directory with 2 updates (kubernetes-sigs#2443)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore(deps): bump actions/cache from 4.2.3 to 4.2.4 in /.github/actions/install-deps in the action-deps group (kubernetes-sigs#2425)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* fix: do not block drifted nodes from being terminated if consolidation is disabled (kubernetes-sigs#2423)

* chore: Pin GH action SHAs for run-bench-test (kubernetes-sigs#2448)

* chore: update operatorpkg (kubernetes-sigs#2455)

* chore: Track NodeClaims in NodePoolState (kubernetes-sigs#2449)

* chore(deps): bump the k8s-go-deps group across 1 directory with 7 updates (kubernetes-sigs#2456)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* perf: Add flag to disable costly metrics controllers (kubernetes-sigs#2354)

* perf: concurrent reconciles CPU-based scaling (kubernetes-sigs#2406)

* perf: Disruption Queue Retry Duration Scaling (kubernetes-sigs#2411)

* perf: Typed Bucket Scaling (kubernetes-sigs#2420)

* ci: Include K8s version 1.33 and 1.34 in testing (kubernetes-sigs#2465)

* chore: increase MaxInstanceTypes to give cloud-providers more control over instance type truncation (kubernetes-sigs#2430)

* chore(deps): bump the go-deps group with 2 updates (kubernetes-sigs#2461)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore(deps): bump amannn/action-semantic-pull-request from 6.0.1 to 6.1.1 in the actions-deps group (kubernetes-sigs#2462)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* ci: revert k8s 1.34 addition (kubernetes-sigs#2475)

* fix: Don't schedule a pod with DRA requirements (kubernetes-sigs#2384)

* fix: support arbitrary reserved capacity labels for drift (kubernetes-sigs#2476)

* chore(deps): bump actions/checkout from 4.2.2 to 5.0.0 in /.github/actions/install-prometheus in the action-deps group (kubernetes-sigs#2426)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* fix: Fix nil pointer exception for multiNodeConsolidation (kubernetes-sigs#2472)

* fix: avoid hash collisions with duplicate match expressions (kubernetes-sigs#2479)

* ci: enable k8s 1.34 tests (kubernetes-sigs#2481)

* fix: Validate unsupported provisioners on bound PVs (kubernetes-sigs#2480)

* refactor: use iterator for iterating state nodes (kubernetes-sigs#2483)

* fix: make toolchain failing due to deletion of asciicheck (kubernetes-sigs#2485)

* fix: Handle PVC edge cases handled by kube-scheduler (kubernetes-sigs#2488)

* chore: Change appName from const to var (kubernetes-sigs#2489)

* fix: Handle unbound volumes with volumeName defined (kubernetes-sigs#2487)

* chore(deps): bump actions/setup-go from 5.5.0 to 6.0.0 in /.github/actions/install-deps in the action-deps group (kubernetes-sigs#2494)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore(deps): bump actions/setup-python from 5.6.0 to 6.0.0 in the actions-deps group (kubernetes-sigs#2493)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore(deps): bump the go-deps group with 6 updates (kubernetes-sigs#2491)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore(deps): bump the k8s-go-deps group with 4 updates (kubernetes-sigs#2492)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore: remove duplicate reconcile logging (kubernetes-sigs#2496)

* chore: bump operatorpkg version (kubernetes-sigs#2500)

* perf: Update the Node Repair Controller for requeue time  (kubernetes-sigs#2286)

* feat: Add NodeOverlay Controller Support (kubernetes-sigs#2306)

* chore(deps): bump the k8s-go-deps group with 3 updates (kubernetes-sigs#2504)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore: rolling back to 1.34 (kubernetes-sigs#2512)

* fix: handle nil selector when hashing in topology (kubernetes-sigs#2511)

* feat: Support Pod Level Resources (kubernetes-sigs#2383)

Signed-off-by: Tsubasa Nagasawa <[email protected]>

* fix: merge limits into requests when constructing ds pods (kubernetes-sigs#2514)

* fix: default CPU_REQUESTS when non-positive value is provided (kubernetes-sigs#2516)

* fix(node): prevent empty providerID causing false NodeClaim matches (kubernetes-sigs#2507)

* feat: Support Static Capacity (kubernetes-sigs#2521)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: Jason Deal <[email protected]>
Co-authored-by: Jonathan Innis <[email protected]>
Co-authored-by: Andrew Mitchell <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Ryan Mistretta <[email protected]>

* fix: over provisioning static nodeclaims during controller crashes (kubernetes-sigs#2534)

* chore: drop consistency error to info log (kubernetes-sigs#2542)

* fix: flaky static provisioning unit test (kubernetes-sigs#2546)

* fix: nodepool crd definition should explicitly say replicas field as alpha (kubernetes-sigs#2554)

* chore: Update NodeRegistrationHealthy SC to use a buffer mechanism (kubernetes-sigs#2520)

---------

Signed-off-by: dependabot[bot] <[email protected]>
Signed-off-by: Max Cao <[email protected]>
Signed-off-by: Josephine Pfeiffer <[email protected]>
Signed-off-by: Tsubasa Nagasawa <[email protected]>
Co-authored-by: Derek Frank <[email protected]>
Co-authored-by: Jonathan Innis <[email protected]>
Co-authored-by: Lê Minh Quân <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Jigisha Patil <[email protected]>
Co-authored-by: Amanuel Engeda <[email protected]>
Co-authored-by: Max Cao <[email protected]>
Co-authored-by: Aidan Rowe <[email protected]>
Co-authored-by: Daniel Lopes <[email protected]>
Co-authored-by: Saurav Agarwalla <[email protected]>
Co-authored-by: cosimomeli <[email protected]>
Co-authored-by: Jason Deal <[email protected]>
Co-authored-by: Reed Schalo <[email protected]>
Co-authored-by: Josephine Pfeiffer <[email protected]>
Co-authored-by: Sumukha Radhakrishna <[email protected]>
Co-authored-by: Andy Townsend <[email protected]>
Co-authored-by: Sumukha Radhakrishna <[email protected]>
Co-authored-by: ryan-mist <[email protected]>
Co-authored-by: Brandon Wagner <[email protected]>
Co-authored-by: Alima Azamat <[email protected]>
Co-authored-by: Andrew Mitchell <[email protected]>
Co-authored-by: Tsubasa Nagasawa <[email protected]>
Co-authored-by: Neil <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants