Skip to content

Conversation

@DerekFrank
Copy link
Contributor

Fixes #N/A

Description

For performance, this change makes the termination controller multithreaded on evicting pods.

How was this change tested?

make presubmit

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 23, 2025
@k8s-ci-robot k8s-ci-robot requested review from engedaam and tallaxes May 23, 2025 19:45
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 23, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @DerekFrank. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 23, 2025
@DerekFrank DerekFrank force-pushed the threaded-eviction-queue branch from 12fe218 to 062e604 Compare May 27, 2025 23:59
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 27, 2025
Namespace: key.Namespace,
}}, serrors.Wrap(fmt.Errorf("evicting pod violates a PDB"), "Pod", klog.KRef(key.Namespace, key.Name))))
return false
Name: pod.Name,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This event looks wrong -- it's referencing a node but that node is based on the pod info?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was the way the event was before this change. Given that we are draining a node by evicting pods from it, we publish the an event on the node when we can't evict a pod

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right -- I think the event before was wrong just reading through this -- it expects a node with a name and we are giving it the pod name and namespace for some reason

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this event seems to be bugged in the current release of Karpenter. Here's an example of this event from our datadog event collector:

image

We have no nodes named "arc/arc-4", that's a namespace and statefulset pod name.

ExpectApplied(ctx, env.Client, pod)
Expect(queue.Evict(ctx, terminator.NewQueueKey(pod, node.Spec.ProviderID))).To(BeTrue())
ExpectApplied(ctx, env.Client, pod, node)
Expect(queue.Evict(ctx, pod)).To(BeNil())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Expect(queue.Evict(ctx, pod)).To(BeNil())
Expect(queue.Evict(ctx, pod)).To(Succeed())

nit: This reads a little better -- you could change all the places you have BeNil() to this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You resolved this but I didn't get it you did or didn't want to change it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops I missed that one!

@DerekFrank DerekFrank changed the title feat: Refactor the eviction queue to be multithreaded WIP, DO NOT MERGE. feat: Refactor the eviction queue to be multithreaded May 29, 2025
@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 29, 2025
@DerekFrank DerekFrank closed this Jun 3, 2025
@DerekFrank DerekFrank force-pushed the threaded-eviction-queue branch from 05bf8b8 to e62bac3 Compare June 3, 2025 21:48
@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jun 3, 2025
@DerekFrank DerekFrank reopened this Jun 3, 2025
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jun 3, 2025
@DerekFrank DerekFrank changed the title WIP, DO NOT MERGE. feat: Refactor the eviction queue to be multithreaded feat: Refactor the eviction queue to be multithreaded Jun 3, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 3, 2025
// Trigger eviction queue with the pod key still in it
ExpectSingletonReconciled(ctx, queue)
Expect(queue.Has(pod)).To(BeTrue())
queue.Add(pod)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're adding a pod to the queue when it's already there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's intentional, we want to verify that when we attempt to add the pod to the queue that it doesn't then get evicted because its already returning true for Has

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is actually making me realize that this change breaks this logic in this code -- it's possible for the following scenario to happen now and us to evict a new pod that shouldn't have been evicted:

  1. We add a pod to the channel
  2. This pod's namespace name is added to the controller workqueue
  3. The pod has previously been evicted and a new pod has launched but we didn't see it -- this means that when we pull the pod from the cache, we actually get the new pod and then we actually evict the new pod

A core tenant of whatever we do here has to be that when we add data to the queue, we also add the UUID of the pod that we pull -- if we don't do that, we have the potential to evict the wrong pod.

pod = ExpectExists(ctx, env.Client, pod)
Expect(pod.DeletionTimestamp.IsZero()).To(BeFalse())
ExpectObjectReconciled(ctx, env.Client, queue, pod)
ExpectDeleted(ctx, env.Client, pod)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like I mentioned below -- this actually does a deletion, which I don't think is what we were doing before -- is this what we want? Since we aren't really checking anything here after reconciling

ctx = injection.WithControllerName(ctx, q.Name())

if !q.Has(pod) {
//This is a different pod than in the queue, we should exit without evicting
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It may be worth adding some more verbose commenting on why we need to handle this case, given that this race condition is a little hard to think about and the impact of not handling this is a little bit on the edge

ctx = log.IntoContext(ctx, log.FromContext(ctx).WithValues("Pod", klog.KRef(key.Namespace, key.Name)))
// Evict returns nil if successful eviction call, and an error if there was an eviction-related error
func (q *Queue) Evict(ctx context.Context, pod *corev1.Pod) error {
ctx = log.IntoContext(ctx, log.FromContext(ctx).WithValues("Pod", klog.KRef(pod.Namespace, pod.Name)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should automatically be injected by the Reconcile() method so there's actually no need to add it to the context here

return false
node, err2 := podutils.NodeForPod(ctx, q.kubeClient, pod)
if err2 != nil {
log.FromContext(ctx).V(1).Error(err2, "pod has no node")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will double-log the error -- controller-runtime automatically logs the error if you return an error from the Reconcile by passing it up the stack

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm returning err not err2 out of the function, so both get logged instead of swallowing err in favor of err2

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see -- that's a good point, maybe an if/else makes this more clear -- that way you would only need a single return statement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we ever want to return err2? I don't think so as its not the thing blocking eviction

ExpectApplied(ctx, env.Client, pod)
Expect(queue.Evict(ctx, terminator.NewQueueKey(pod, node.Spec.ProviderID))).To(BeTrue())
ExpectApplied(ctx, env.Client, pod, node)
Expect(queue.Evict(ctx, pod)).To(BeNil())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You resolved this but I didn't get it you did or didn't want to change it

It("should delete a pod with less than terminationGracePeriodSeconds remaining before nodeTerminationTime", func() {
pod.Spec.TerminationGracePeriodSeconds = lo.ToPtr[int64](120)
// overwrite the node name or the delete does not succeed
pod.Spec.NodeName = ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the delete not succeed if we don't override the nodeName with empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not exactly sure. I think it has to do with the mock client we use for testing. I've verified that client.Delete() is getting called for the pod, but the pod doesn't actually get deleted. FWIW, before this PR if someone had set the nodename in the test it would have failed in the same way. This PR sets the nodename for the pod at the start of each test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have the node\pod bound in tests where they need to be, and not have the overwrite here

Copy link
Member

@jonathan-innis jonathan-innis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 12, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: DerekFrank, jonathan-innis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 12, 2025
@k8s-ci-robot k8s-ci-robot merged commit 6eb3133 into kubernetes-sigs:main Jun 12, 2025
16 checks passed
jonathan-innis pushed a commit to jonathan-innis/karpenter that referenced this pull request Jun 13, 2025
InftyAI-Agent pushed a commit to InftyAI/karpenter that referenced this pull request Jun 13, 2025
* chore: Ensure we can stand up multiple partitions with kwok (kubernetes-sigs#2283)

* chore: Inject resources into Kwok through a patch (kubernetes-sigs#2285)

* chore: Update NodeClaim E2E test to only replace one status condition (kubernetes-sigs#2284)

* chore: Avoid validating admission policy for clusters older then 1.30 (kubernetes-sigs#2289)

* chore(deps): bump the go-deps group with 2 updates (kubernetes-sigs#2295)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore: bump go version to 1.24.4 (kubernetes-sigs#2298)

* chore: Only log that the command succeeded when it actually did (kubernetes-sigs#2302)

* fix: Fix bug with MarkForDeletion before creating replacements (kubernetes-sigs#2300)

* perf: Refactor the eviction queue to be multithreaded (kubernetes-sigs#2252)

* docs: Add Bizfly Cloud provider (kubernetes-sigs#2303)

* feat: support llmaz model

Co-authored-by: Kante Yin <[email protected]>
Signed-off-by: carlory <[email protected]>

* feat: add ci support

Signed-off-by: carlory <[email protected]>

---------

Signed-off-by: dependabot[bot] <[email protected]>
Signed-off-by: carlory <[email protected]>
Co-authored-by: Jonathan Innis <[email protected]>
Co-authored-by: Amanuel Engeda <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Derek Frank <[email protected]>
Co-authored-by: Lê Minh Quân <[email protected]>
Co-authored-by: Kante Yin <[email protected]>
rschalo pushed a commit to rschalo/karpenter that referenced this pull request Jun 15, 2025
@DerekFrank DerekFrank deleted the threaded-eviction-queue branch July 7, 2025 20:20
harshad3339 added a commit to acquia/karpenter that referenced this pull request Jul 31, 2025
* test: Lower resource requests for NodeClaim test (kubernetes-sigs#2229)

* perf: Don't deepcopy inside of watch handler functions (kubernetes-sigs#2232)

* test: Add random name string for NodePool and NodeClass (kubernetes-sigs#2231)

* test: Update E2E testing suite to be named Regression (kubernetes-sigs#2234)

* refactor: convert validation to an interface (kubernetes-sigs#2220)

* fix: allow non-churn empty nodes to be disrupted (kubernetes-sigs#2206)

* perf: Only deep copy nodes during GetCandidates once (kubernetes-sigs#2233)

* feat: add metrics for disruption candidate validation (kubernetes-sigs#2239)

* perf: Only call .Available() once which prevents duplicate allocs (kubernetes-sigs#2241)

* docs: update issue triage meeting schedule (kubernetes-sigs#2244)

* test: deflake NodeClaim and presubmit tests (kubernetes-sigs#2240)

* perf: Avoid deepcopy when get nodePool/cluster health (kubernetes-sigs#2247)

* perf: Improve OrderByPrice performance (kubernetes-sigs#2250)

* test: add validating admission policy for nodeclass status (kubernetes-sigs#2251)

Co-authored-by: Jonathan Innis <[email protected]>

* feat: drain and volume detachment status conditions (kubernetes-sigs#1876)

* fix: show the cron parse error to users to allow them to debug (kubernetes-sigs#2258)

* perf: Don't deep-copy nodes and nodeclaims in our synced check (kubernetes-sigs#2260)

* chore: Fix getting current script directory in install-kwok.sh (kubernetes-sigs#2262)

* perf: Perform quick checks in node health first (kubernetes-sigs#2264)

* chore: Update pod metrics when pod is completed (kubernetes-sigs#2259)

* fix: Correctly build nodepool mapping for complex clusters (kubernetes-sigs#2263)

* fix: fail open for missing nodeclaims in termination (kubernetes-sigs#2266)

* perf: Limit GetInstanceTypes() calls per-NodeClaim (kubernetes-sigs#2271)

* perf: Parallelize disruption execution actions (kubernetes-sigs#2270)

* fix: Fix node owner reference update (kubernetes-sigs#2274)

* perf: Be more resilient to deletion failures in disruption controller (kubernetes-sigs#2272)

* chore(deps): bump the go-deps group with 2 updates (kubernetes-sigs#2277)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore: Ensure we can stand up multiple partitions with kwok (kubernetes-sigs#2283)

* chore: Inject resources into Kwok through a patch (kubernetes-sigs#2285)

* chore: Update NodeClaim E2E test to only replace one status condition (kubernetes-sigs#2284)

* chore: Avoid validating admission policy for clusters older then 1.30 (kubernetes-sigs#2289)

* chore(deps): bump the go-deps group with 2 updates (kubernetes-sigs#2295)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore: bump go version to 1.24.4 (kubernetes-sigs#2298)

* chore: Only log that the command succeeded when it actually did (kubernetes-sigs#2302)

* fix: Fix bug with MarkForDeletion before creating replacements (kubernetes-sigs#2300)

* perf: Refactor the eviction queue to be multithreaded (kubernetes-sigs#2252)

* docs: Add Bizfly Cloud provider (kubernetes-sigs#2303)

* chore: Bump lifecycle cache expiration to one hour (kubernetes-sigs#2307)

* chore: Use cluster state to check replacement NodeClaim existence (kubernetes-sigs#2308)

* chore(deps): bump github.com/samber/lo from 1.50.0 to 1.51.0 in the go-deps group (kubernetes-sigs#2315)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore: bump operatorpkg (kubernetes-sigs#2314)

* chore(deps): bump the k8s-go-deps group across 1 directory with 4 updates (kubernetes-sigs#2317)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore: Refactor Orchestration Queue and Handle Mark/Unmark Deletion in Queue (kubernetes-sigs#2305)

* chore(deps): bump the k8s-go-deps group with 7 updates (kubernetes-sigs#2326)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* perf: multithreaded orchestration queue (kubernetes-sigs#2293)

* test: Add nodeclaim name when you have garbage collection (kubernetes-sigs#2333)

* perf: Reduce multiple patch calls in instance termination (kubernetes-sigs#2324)

* fix: add helm rbac for kwok-provider to update finalizers (kubernetes-sigs#2336)

Signed-off-by: Max Cao <[email protected]>

* feat: configure CRD status operator with larger histogram buckets (kubernetes-sigs#2328)

* chore(deps): bump sigs.k8s.io/yaml from 1.4.0 to 1.5.0 in the k8s-go-deps group (kubernetes-sigs#2339)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore(deps): bump github.com/docker/docker from 28.2.2+incompatible to 28.3.0+incompatible in the go-deps group (kubernetes-sigs#2340)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* fix: Fix re-retrieving object on retry (kubernetes-sigs#2337)

* fix: Fix overriding error with patch call (kubernetes-sigs#2338)

* fix: add missing rlock to disruption queue (kubernetes-sigs#2348)

* test: allow e2e tests to output junit report (kubernetes-sigs#2334)

Signed-off-by: Max Cao <[email protected]>

* docs: Add Oracle Cloud Infrastructure (OCI) provider  (kubernetes-sigs#2342)

* fix: no longer allow the same hostname to take multiple capacity (kubernetes-sigs#2356)

* feat: support auto relaxing min values (kubernetes-sigs#2299)

* fix: update provider ID to ensure that Cloud Provider tests pass (kubernetes-sigs#2363)

* fix: remove unsupported capacity_type label from karpenter_nodeclaims… (kubernetes-sigs#2364)

* fix: update deletionTimestamp on terminating pods when after nodeDeletionTimestamp (kubernetes-sigs#2316)

Co-authored-by: Amanuel Engeda <[email protected]>

* chore: promote ReservedCapacity feature gate to beta (kubernetes-sigs#2365)

* fix: flakiness in expiration tests (kubernetes-sigs#2366)

* test: Bump the termination time for the deletion timestamp (kubernetes-sigs#2367)

* chore: cherry-pick kubernetes-sigs#2399 (kubernetes-sigs#2401)

---------

Signed-off-by: dependabot[bot] <[email protected]>
Signed-off-by: Max Cao <[email protected]>
Co-authored-by: Amanuel Engeda <[email protected]>
Co-authored-by: Jonathan Innis <[email protected]>
Co-authored-by: Reed Schalo <[email protected]>
Co-authored-by: DerekFrank <[email protected]>
Co-authored-by: Jason Deal <[email protected]>
Co-authored-by: Reed Schalo <[email protected]>
Co-authored-by: Jonathan Innis <[email protected]>
Co-authored-by: Todd Neal <[email protected]>
Co-authored-by: Jigisha Patil <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Lê Minh Quân <[email protected]>
Co-authored-by: Max Cao <[email protected]>
Co-authored-by: Aidan Rowe <[email protected]>
Co-authored-by: Daniel Lopes <[email protected]>
Co-authored-by: Saurav Agarwalla <[email protected]>
Co-authored-by: cosimomeli <[email protected]>
jigisha620 pushed a commit to jigisha620/karpenter that referenced this pull request Sep 19, 2025
harshad3339 added a commit to acquia/karpenter that referenced this pull request Nov 3, 2025
* chore: bump go version to 1.24.4 (kubernetes-sigs#2298)

* chore: Only log that the command succeeded when it actually did (kubernetes-sigs#2302)

* fix: Fix bug with MarkForDeletion before creating replacements (kubernetes-sigs#2300)

* perf: Refactor the eviction queue to be multithreaded (kubernetes-sigs#2252)

* docs: Add Bizfly Cloud provider (kubernetes-sigs#2303)

* chore: Bump lifecycle cache expiration to one hour (kubernetes-sigs#2307)

* chore: Use cluster state to check replacement NodeClaim existence (kubernetes-sigs#2308)

* chore(deps): bump github.com/samber/lo from 1.50.0 to 1.51.0 in the go-deps group (kubernetes-sigs#2315)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore: bump operatorpkg (kubernetes-sigs#2314)

* chore(deps): bump the k8s-go-deps group across 1 directory with 4 updates (kubernetes-sigs#2317)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore: Refactor Orchestration Queue and Handle Mark/Unmark Deletion in Queue (kubernetes-sigs#2305)

* chore(deps): bump the k8s-go-deps group with 7 updates (kubernetes-sigs#2326)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* perf: multithreaded orchestration queue (kubernetes-sigs#2293)

* test: Add nodeclaim name when you have garbage collection (kubernetes-sigs#2333)

* perf: Reduce multiple patch calls in instance termination (kubernetes-sigs#2324)

* fix: add helm rbac for kwok-provider to update finalizers (kubernetes-sigs#2336)

Signed-off-by: Max Cao <[email protected]>

* feat: configure CRD status operator with larger histogram buckets (kubernetes-sigs#2328)

* chore(deps): bump sigs.k8s.io/yaml from 1.4.0 to 1.5.0 in the k8s-go-deps group (kubernetes-sigs#2339)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore(deps): bump github.com/docker/docker from 28.2.2+incompatible to 28.3.0+incompatible in the go-deps group (kubernetes-sigs#2340)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* fix: Fix re-retrieving object on retry (kubernetes-sigs#2337)

* fix: Fix overriding error with patch call (kubernetes-sigs#2338)

* fix: add missing rlock to disruption queue (kubernetes-sigs#2348)

* test: allow e2e tests to output junit report (kubernetes-sigs#2334)

Signed-off-by: Max Cao <[email protected]>

* docs: Add Oracle Cloud Infrastructure (OCI) provider  (kubernetes-sigs#2342)

* fix: no longer allow the same hostname to take multiple capacity (kubernetes-sigs#2356)

* feat: support auto relaxing min values (kubernetes-sigs#2299)

* fix: update provider ID to ensure that Cloud Provider tests pass (kubernetes-sigs#2363)

* fix: remove unsupported capacity_type label from karpenter_nodeclaims… (kubernetes-sigs#2364)

* fix: update deletionTimestamp on terminating pods when after nodeDeletionTimestamp (kubernetes-sigs#2316)

Co-authored-by: Amanuel Engeda <[email protected]>

* chore: promote ReservedCapacity feature gate to beta (kubernetes-sigs#2365)

* fix: flakiness in expiration tests (kubernetes-sigs#2366)

* test: Bump the termination time for the deletion timestamp (kubernetes-sigs#2367)

* chore(deps): bump github.com/docker/docker from 28.3.0+incompatible to 28.3.1+incompatible in the go-deps group (kubernetes-sigs#2355)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* fix: pod errors when nodepool requirements filter all instance types (kubernetes-sigs#2341)

* refactor: Create a NopValidator for the disruption testing (kubernetes-sigs#2369)

* chore(deps): bump the go-deps group with 2 updates (kubernetes-sigs#2373)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* refactor: Update disruption testing from PR comments (kubernetes-sigs#2372)

* feat: (BREAKING) addition of launch timeout for nodeclaim lifecycle (kubernetes-sigs#2349)

* chore: Consider node.kubernetes.io/not-ready:NoExecute as ephemeral (kubernetes-sigs#2265)

* perf: Optimistically delete from the cache after launch (kubernetes-sigs#2380)

* docs: Node Overlay RFC (kubernetes-sigs#2166)

* fix: handle multiple PDBs for the same pod more gracefully (kubernetes-sigs#2379)

* docs: Add IBM Cloud provider (kubernetes-sigs#2396)

Signed-off-by: Josephine Pfeiffer <[email protected]>

* fix: rate limit eviction when PDBs are blocking (kubernetes-sigs#2399)

* feat: Add the Node Overlay CRD (kubernetes-sigs#2296)

* chore: ignore pods that use unsupported provisioner in the storageClass (kubernetes-sigs#2400)

* feat: Add a feature flag for Node Overlay (kubernetes-sigs#2404)

* feat: Add StaticCapacity feature flag (kubernetes-sigs#2405)

* fix(BREAKING): update naming of karpenter_pods_drained_total (kubernetes-sigs#2421)

* fix: pod metrics when pod is terminal (kubernetes-sigs#2417)

* chore: ignore pods that have unbound pvc with volumeBindingMode immediate (kubernetes-sigs#2415)

* docs: static capacity RFC (kubernetes-sigs#2309)

* chore: bump go version to 1.24.6 (kubernetes-sigs#2432)

* feat: Create optional operator arguments to leverage leader lease functionality (kubernetes-sigs#2433)

* chore(deps): bump the go-deps group with 5 updates (kubernetes-sigs#2442)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore(deps): bump actions/checkout from 4.2.2 to 5.0.0 in /.github/actions/install-pyroscope in the action-deps group (kubernetes-sigs#2428)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore(deps): bump the actions-deps group across 1 directory with 2 updates (kubernetes-sigs#2443)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore(deps): bump actions/cache from 4.2.3 to 4.2.4 in /.github/actions/install-deps in the action-deps group (kubernetes-sigs#2425)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* fix: do not block drifted nodes from being terminated if consolidation is disabled (kubernetes-sigs#2423)

* chore: Pin GH action SHAs for run-bench-test (kubernetes-sigs#2448)

* chore: update operatorpkg (kubernetes-sigs#2455)

* chore: Track NodeClaims in NodePoolState (kubernetes-sigs#2449)

* chore(deps): bump the k8s-go-deps group across 1 directory with 7 updates (kubernetes-sigs#2456)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* perf: Add flag to disable costly metrics controllers (kubernetes-sigs#2354)

* perf: concurrent reconciles CPU-based scaling (kubernetes-sigs#2406)

* perf: Disruption Queue Retry Duration Scaling (kubernetes-sigs#2411)

* perf: Typed Bucket Scaling (kubernetes-sigs#2420)

* ci: Include K8s version 1.33 and 1.34 in testing (kubernetes-sigs#2465)

* chore: increase MaxInstanceTypes to give cloud-providers more control over instance type truncation (kubernetes-sigs#2430)

* chore(deps): bump the go-deps group with 2 updates (kubernetes-sigs#2461)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore(deps): bump amannn/action-semantic-pull-request from 6.0.1 to 6.1.1 in the actions-deps group (kubernetes-sigs#2462)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* ci: revert k8s 1.34 addition (kubernetes-sigs#2475)

* fix: Don't schedule a pod with DRA requirements (kubernetes-sigs#2384)

* fix: support arbitrary reserved capacity labels for drift (kubernetes-sigs#2476)

* chore(deps): bump actions/checkout from 4.2.2 to 5.0.0 in /.github/actions/install-prometheus in the action-deps group (kubernetes-sigs#2426)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* fix: Fix nil pointer exception for multiNodeConsolidation (kubernetes-sigs#2472)

* fix: avoid hash collisions with duplicate match expressions (kubernetes-sigs#2479)

* ci: enable k8s 1.34 tests (kubernetes-sigs#2481)

* fix: Validate unsupported provisioners on bound PVs (kubernetes-sigs#2480)

* refactor: use iterator for iterating state nodes (kubernetes-sigs#2483)

* fix: make toolchain failing due to deletion of asciicheck (kubernetes-sigs#2485)

* fix: Handle PVC edge cases handled by kube-scheduler (kubernetes-sigs#2488)

* chore: Change appName from const to var (kubernetes-sigs#2489)

* fix: Handle unbound volumes with volumeName defined (kubernetes-sigs#2487)

* chore(deps): bump actions/setup-go from 5.5.0 to 6.0.0 in /.github/actions/install-deps in the action-deps group (kubernetes-sigs#2494)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore(deps): bump actions/setup-python from 5.6.0 to 6.0.0 in the actions-deps group (kubernetes-sigs#2493)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore(deps): bump the go-deps group with 6 updates (kubernetes-sigs#2491)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore(deps): bump the k8s-go-deps group with 4 updates (kubernetes-sigs#2492)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore: remove duplicate reconcile logging (kubernetes-sigs#2496)

* chore: bump operatorpkg version (kubernetes-sigs#2500)

* perf: Update the Node Repair Controller for requeue time  (kubernetes-sigs#2286)

* feat: Add NodeOverlay Controller Support (kubernetes-sigs#2306)

* chore(deps): bump the k8s-go-deps group with 3 updates (kubernetes-sigs#2504)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore: rolling back to 1.34 (kubernetes-sigs#2512)

* fix: handle nil selector when hashing in topology (kubernetes-sigs#2511)

* feat: Support Pod Level Resources (kubernetes-sigs#2383)

Signed-off-by: Tsubasa Nagasawa <[email protected]>

* fix: merge limits into requests when constructing ds pods (kubernetes-sigs#2514)

* fix: default CPU_REQUESTS when non-positive value is provided (kubernetes-sigs#2516)

* fix(node): prevent empty providerID causing false NodeClaim matches (kubernetes-sigs#2507)

* feat: Support Static Capacity (kubernetes-sigs#2521)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: Jason Deal <[email protected]>
Co-authored-by: Jonathan Innis <[email protected]>
Co-authored-by: Andrew Mitchell <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Ryan Mistretta <[email protected]>

* fix: over provisioning static nodeclaims during controller crashes (kubernetes-sigs#2534)

* chore: drop consistency error to info log (kubernetes-sigs#2542)

* fix: flaky static provisioning unit test (kubernetes-sigs#2546)

* fix: nodepool crd definition should explicitly say replicas field as alpha (kubernetes-sigs#2554)

* chore: Update NodeRegistrationHealthy SC to use a buffer mechanism (kubernetes-sigs#2520)

---------

Signed-off-by: dependabot[bot] <[email protected]>
Signed-off-by: Max Cao <[email protected]>
Signed-off-by: Josephine Pfeiffer <[email protected]>
Signed-off-by: Tsubasa Nagasawa <[email protected]>
Co-authored-by: Derek Frank <[email protected]>
Co-authored-by: Jonathan Innis <[email protected]>
Co-authored-by: Lê Minh Quân <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Jigisha Patil <[email protected]>
Co-authored-by: Amanuel Engeda <[email protected]>
Co-authored-by: Max Cao <[email protected]>
Co-authored-by: Aidan Rowe <[email protected]>
Co-authored-by: Daniel Lopes <[email protected]>
Co-authored-by: Saurav Agarwalla <[email protected]>
Co-authored-by: cosimomeli <[email protected]>
Co-authored-by: Jason Deal <[email protected]>
Co-authored-by: Reed Schalo <[email protected]>
Co-authored-by: Josephine Pfeiffer <[email protected]>
Co-authored-by: Sumukha Radhakrishna <[email protected]>
Co-authored-by: Andy Townsend <[email protected]>
Co-authored-by: Sumukha Radhakrishna <[email protected]>
Co-authored-by: ryan-mist <[email protected]>
Co-authored-by: Brandon Wagner <[email protected]>
Co-authored-by: Alima Azamat <[email protected]>
Co-authored-by: Andrew Mitchell <[email protected]>
Co-authored-by: Tsubasa Nagasawa <[email protected]>
Co-authored-by: Neil <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants