-
Notifications
You must be signed in to change notification settings - Fork 382
perf: Refactor the eviction queue to be multithreaded #2252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: Refactor the eviction queue to be multithreaded #2252
Conversation
|
Hi @DerekFrank. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
12fe218 to
062e604
Compare
| Namespace: key.Namespace, | ||
| }}, serrors.Wrap(fmt.Errorf("evicting pod violates a PDB"), "Pod", klog.KRef(key.Namespace, key.Name)))) | ||
| return false | ||
| Name: pod.Name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This event looks wrong -- it's referencing a node but that node is based on the pod info?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was the way the event was before this change. Given that we are draining a node by evicting pods from it, we publish the an event on the node when we can't evict a pod
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right -- I think the event before was wrong just reading through this -- it expects a node with a name and we are giving it the pod name and namespace for some reason
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ExpectApplied(ctx, env.Client, pod) | ||
| Expect(queue.Evict(ctx, terminator.NewQueueKey(pod, node.Spec.ProviderID))).To(BeTrue()) | ||
| ExpectApplied(ctx, env.Client, pod, node) | ||
| Expect(queue.Evict(ctx, pod)).To(BeNil()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Expect(queue.Evict(ctx, pod)).To(BeNil()) | |
| Expect(queue.Evict(ctx, pod)).To(Succeed()) |
nit: This reads a little better -- you could change all the places you have BeNil() to this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You resolved this but I didn't get it you did or didn't want to change it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops I missed that one!
05bf8b8 to
e62bac3
Compare
| // Trigger eviction queue with the pod key still in it | ||
| ExpectSingletonReconciled(ctx, queue) | ||
| Expect(queue.Has(pod)).To(BeTrue()) | ||
| queue.Add(pod) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're adding a pod to the queue when it's already there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's intentional, we want to verify that when we attempt to add the pod to the queue that it doesn't then get evicted because its already returning true for Has
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is actually making me realize that this change breaks this logic in this code -- it's possible for the following scenario to happen now and us to evict a new pod that shouldn't have been evicted:
- We add a pod to the channel
- This pod's namespace name is added to the controller workqueue
- The pod has previously been evicted and a new pod has launched but we didn't see it -- this means that when we pull the pod from the cache, we actually get the new pod and then we actually evict the new pod
A core tenant of whatever we do here has to be that when we add data to the queue, we also add the UUID of the pod that we pull -- if we don't do that, we have the potential to evict the wrong pod.
| pod = ExpectExists(ctx, env.Client, pod) | ||
| Expect(pod.DeletionTimestamp.IsZero()).To(BeFalse()) | ||
| ExpectObjectReconciled(ctx, env.Client, queue, pod) | ||
| ExpectDeleted(ctx, env.Client, pod) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like I mentioned below -- this actually does a deletion, which I don't think is what we were doing before -- is this what we want? Since we aren't really checking anything here after reconciling
| ctx = injection.WithControllerName(ctx, q.Name()) | ||
|
|
||
| if !q.Has(pod) { | ||
| //This is a different pod than in the queue, we should exit without evicting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: It may be worth adding some more verbose commenting on why we need to handle this case, given that this race condition is a little hard to think about and the impact of not handling this is a little bit on the edge
| ctx = log.IntoContext(ctx, log.FromContext(ctx).WithValues("Pod", klog.KRef(key.Namespace, key.Name))) | ||
| // Evict returns nil if successful eviction call, and an error if there was an eviction-related error | ||
| func (q *Queue) Evict(ctx context.Context, pod *corev1.Pod) error { | ||
| ctx = log.IntoContext(ctx, log.FromContext(ctx).WithValues("Pod", klog.KRef(pod.Namespace, pod.Name))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should automatically be injected by the Reconcile() method so there's actually no need to add it to the context here
| return false | ||
| node, err2 := podutils.NodeForPod(ctx, q.kubeClient, pod) | ||
| if err2 != nil { | ||
| log.FromContext(ctx).V(1).Error(err2, "pod has no node") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will double-log the error -- controller-runtime automatically logs the error if you return an error from the Reconcile by passing it up the stack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm returning err not err2 out of the function, so both get logged instead of swallowing err in favor of err2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see -- that's a good point, maybe an if/else makes this more clear -- that way you would only need a single return statement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we ever want to return err2? I don't think so as its not the thing blocking eviction
| ExpectApplied(ctx, env.Client, pod) | ||
| Expect(queue.Evict(ctx, terminator.NewQueueKey(pod, node.Spec.ProviderID))).To(BeTrue()) | ||
| ExpectApplied(ctx, env.Client, pod, node) | ||
| Expect(queue.Evict(ctx, pod)).To(BeNil()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You resolved this but I didn't get it you did or didn't want to change it
| It("should delete a pod with less than terminationGracePeriodSeconds remaining before nodeTerminationTime", func() { | ||
| pod.Spec.TerminationGracePeriodSeconds = lo.ToPtr[int64](120) | ||
| // overwrite the node name or the delete does not succeed | ||
| pod.Spec.NodeName = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does the delete not succeed if we don't override the nodeName with empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not exactly sure. I think it has to do with the mock client we use for testing. I've verified that client.Delete() is getting called for the pod, but the pod doesn't actually get deleted. FWIW, before this PR if someone had set the nodename in the test it would have failed in the same way. This PR sets the nodename for the pod at the start of each test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have the node\pod bound in tests where they need to be, and not have the overwrite here
jonathan-innis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: DerekFrank, jonathan-innis The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* chore: Ensure we can stand up multiple partitions with kwok (kubernetes-sigs#2283) * chore: Inject resources into Kwok through a patch (kubernetes-sigs#2285) * chore: Update NodeClaim E2E test to only replace one status condition (kubernetes-sigs#2284) * chore: Avoid validating admission policy for clusters older then 1.30 (kubernetes-sigs#2289) * chore(deps): bump the go-deps group with 2 updates (kubernetes-sigs#2295) Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * chore: bump go version to 1.24.4 (kubernetes-sigs#2298) * chore: Only log that the command succeeded when it actually did (kubernetes-sigs#2302) * fix: Fix bug with MarkForDeletion before creating replacements (kubernetes-sigs#2300) * perf: Refactor the eviction queue to be multithreaded (kubernetes-sigs#2252) * docs: Add Bizfly Cloud provider (kubernetes-sigs#2303) * feat: support llmaz model Co-authored-by: Kante Yin <[email protected]> Signed-off-by: carlory <[email protected]> * feat: add ci support Signed-off-by: carlory <[email protected]> --------- Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: carlory <[email protected]> Co-authored-by: Jonathan Innis <[email protected]> Co-authored-by: Amanuel Engeda <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Derek Frank <[email protected]> Co-authored-by: Lê Minh Quân <[email protected]> Co-authored-by: Kante Yin <[email protected]>
* test: Lower resource requests for NodeClaim test (kubernetes-sigs#2229) * perf: Don't deepcopy inside of watch handler functions (kubernetes-sigs#2232) * test: Add random name string for NodePool and NodeClass (kubernetes-sigs#2231) * test: Update E2E testing suite to be named Regression (kubernetes-sigs#2234) * refactor: convert validation to an interface (kubernetes-sigs#2220) * fix: allow non-churn empty nodes to be disrupted (kubernetes-sigs#2206) * perf: Only deep copy nodes during GetCandidates once (kubernetes-sigs#2233) * feat: add metrics for disruption candidate validation (kubernetes-sigs#2239) * perf: Only call .Available() once which prevents duplicate allocs (kubernetes-sigs#2241) * docs: update issue triage meeting schedule (kubernetes-sigs#2244) * test: deflake NodeClaim and presubmit tests (kubernetes-sigs#2240) * perf: Avoid deepcopy when get nodePool/cluster health (kubernetes-sigs#2247) * perf: Improve OrderByPrice performance (kubernetes-sigs#2250) * test: add validating admission policy for nodeclass status (kubernetes-sigs#2251) Co-authored-by: Jonathan Innis <[email protected]> * feat: drain and volume detachment status conditions (kubernetes-sigs#1876) * fix: show the cron parse error to users to allow them to debug (kubernetes-sigs#2258) * perf: Don't deep-copy nodes and nodeclaims in our synced check (kubernetes-sigs#2260) * chore: Fix getting current script directory in install-kwok.sh (kubernetes-sigs#2262) * perf: Perform quick checks in node health first (kubernetes-sigs#2264) * chore: Update pod metrics when pod is completed (kubernetes-sigs#2259) * fix: Correctly build nodepool mapping for complex clusters (kubernetes-sigs#2263) * fix: fail open for missing nodeclaims in termination (kubernetes-sigs#2266) * perf: Limit GetInstanceTypes() calls per-NodeClaim (kubernetes-sigs#2271) * perf: Parallelize disruption execution actions (kubernetes-sigs#2270) * fix: Fix node owner reference update (kubernetes-sigs#2274) * perf: Be more resilient to deletion failures in disruption controller (kubernetes-sigs#2272) * chore(deps): bump the go-deps group with 2 updates (kubernetes-sigs#2277) Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * chore: Ensure we can stand up multiple partitions with kwok (kubernetes-sigs#2283) * chore: Inject resources into Kwok through a patch (kubernetes-sigs#2285) * chore: Update NodeClaim E2E test to only replace one status condition (kubernetes-sigs#2284) * chore: Avoid validating admission policy for clusters older then 1.30 (kubernetes-sigs#2289) * chore(deps): bump the go-deps group with 2 updates (kubernetes-sigs#2295) Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * chore: bump go version to 1.24.4 (kubernetes-sigs#2298) * chore: Only log that the command succeeded when it actually did (kubernetes-sigs#2302) * fix: Fix bug with MarkForDeletion before creating replacements (kubernetes-sigs#2300) * perf: Refactor the eviction queue to be multithreaded (kubernetes-sigs#2252) * docs: Add Bizfly Cloud provider (kubernetes-sigs#2303) * chore: Bump lifecycle cache expiration to one hour (kubernetes-sigs#2307) * chore: Use cluster state to check replacement NodeClaim existence (kubernetes-sigs#2308) * chore(deps): bump github.com/samber/lo from 1.50.0 to 1.51.0 in the go-deps group (kubernetes-sigs#2315) Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * chore: bump operatorpkg (kubernetes-sigs#2314) * chore(deps): bump the k8s-go-deps group across 1 directory with 4 updates (kubernetes-sigs#2317) Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * chore: Refactor Orchestration Queue and Handle Mark/Unmark Deletion in Queue (kubernetes-sigs#2305) * chore(deps): bump the k8s-go-deps group with 7 updates (kubernetes-sigs#2326) Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * perf: multithreaded orchestration queue (kubernetes-sigs#2293) * test: Add nodeclaim name when you have garbage collection (kubernetes-sigs#2333) * perf: Reduce multiple patch calls in instance termination (kubernetes-sigs#2324) * fix: add helm rbac for kwok-provider to update finalizers (kubernetes-sigs#2336) Signed-off-by: Max Cao <[email protected]> * feat: configure CRD status operator with larger histogram buckets (kubernetes-sigs#2328) * chore(deps): bump sigs.k8s.io/yaml from 1.4.0 to 1.5.0 in the k8s-go-deps group (kubernetes-sigs#2339) Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * chore(deps): bump github.com/docker/docker from 28.2.2+incompatible to 28.3.0+incompatible in the go-deps group (kubernetes-sigs#2340) Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * fix: Fix re-retrieving object on retry (kubernetes-sigs#2337) * fix: Fix overriding error with patch call (kubernetes-sigs#2338) * fix: add missing rlock to disruption queue (kubernetes-sigs#2348) * test: allow e2e tests to output junit report (kubernetes-sigs#2334) Signed-off-by: Max Cao <[email protected]> * docs: Add Oracle Cloud Infrastructure (OCI) provider (kubernetes-sigs#2342) * fix: no longer allow the same hostname to take multiple capacity (kubernetes-sigs#2356) * feat: support auto relaxing min values (kubernetes-sigs#2299) * fix: update provider ID to ensure that Cloud Provider tests pass (kubernetes-sigs#2363) * fix: remove unsupported capacity_type label from karpenter_nodeclaims… (kubernetes-sigs#2364) * fix: update deletionTimestamp on terminating pods when after nodeDeletionTimestamp (kubernetes-sigs#2316) Co-authored-by: Amanuel Engeda <[email protected]> * chore: promote ReservedCapacity feature gate to beta (kubernetes-sigs#2365) * fix: flakiness in expiration tests (kubernetes-sigs#2366) * test: Bump the termination time for the deletion timestamp (kubernetes-sigs#2367) * chore: cherry-pick kubernetes-sigs#2399 (kubernetes-sigs#2401) --------- Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: Max Cao <[email protected]> Co-authored-by: Amanuel Engeda <[email protected]> Co-authored-by: Jonathan Innis <[email protected]> Co-authored-by: Reed Schalo <[email protected]> Co-authored-by: DerekFrank <[email protected]> Co-authored-by: Jason Deal <[email protected]> Co-authored-by: Reed Schalo <[email protected]> Co-authored-by: Jonathan Innis <[email protected]> Co-authored-by: Todd Neal <[email protected]> Co-authored-by: Jigisha Patil <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Lê Minh Quân <[email protected]> Co-authored-by: Max Cao <[email protected]> Co-authored-by: Aidan Rowe <[email protected]> Co-authored-by: Daniel Lopes <[email protected]> Co-authored-by: Saurav Agarwalla <[email protected]> Co-authored-by: cosimomeli <[email protected]>
* chore: bump go version to 1.24.4 (kubernetes-sigs#2298) * chore: Only log that the command succeeded when it actually did (kubernetes-sigs#2302) * fix: Fix bug with MarkForDeletion before creating replacements (kubernetes-sigs#2300) * perf: Refactor the eviction queue to be multithreaded (kubernetes-sigs#2252) * docs: Add Bizfly Cloud provider (kubernetes-sigs#2303) * chore: Bump lifecycle cache expiration to one hour (kubernetes-sigs#2307) * chore: Use cluster state to check replacement NodeClaim existence (kubernetes-sigs#2308) * chore(deps): bump github.com/samber/lo from 1.50.0 to 1.51.0 in the go-deps group (kubernetes-sigs#2315) Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * chore: bump operatorpkg (kubernetes-sigs#2314) * chore(deps): bump the k8s-go-deps group across 1 directory with 4 updates (kubernetes-sigs#2317) Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * chore: Refactor Orchestration Queue and Handle Mark/Unmark Deletion in Queue (kubernetes-sigs#2305) * chore(deps): bump the k8s-go-deps group with 7 updates (kubernetes-sigs#2326) Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * perf: multithreaded orchestration queue (kubernetes-sigs#2293) * test: Add nodeclaim name when you have garbage collection (kubernetes-sigs#2333) * perf: Reduce multiple patch calls in instance termination (kubernetes-sigs#2324) * fix: add helm rbac for kwok-provider to update finalizers (kubernetes-sigs#2336) Signed-off-by: Max Cao <[email protected]> * feat: configure CRD status operator with larger histogram buckets (kubernetes-sigs#2328) * chore(deps): bump sigs.k8s.io/yaml from 1.4.0 to 1.5.0 in the k8s-go-deps group (kubernetes-sigs#2339) Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * chore(deps): bump github.com/docker/docker from 28.2.2+incompatible to 28.3.0+incompatible in the go-deps group (kubernetes-sigs#2340) Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * fix: Fix re-retrieving object on retry (kubernetes-sigs#2337) * fix: Fix overriding error with patch call (kubernetes-sigs#2338) * fix: add missing rlock to disruption queue (kubernetes-sigs#2348) * test: allow e2e tests to output junit report (kubernetes-sigs#2334) Signed-off-by: Max Cao <[email protected]> * docs: Add Oracle Cloud Infrastructure (OCI) provider (kubernetes-sigs#2342) * fix: no longer allow the same hostname to take multiple capacity (kubernetes-sigs#2356) * feat: support auto relaxing min values (kubernetes-sigs#2299) * fix: update provider ID to ensure that Cloud Provider tests pass (kubernetes-sigs#2363) * fix: remove unsupported capacity_type label from karpenter_nodeclaims… (kubernetes-sigs#2364) * fix: update deletionTimestamp on terminating pods when after nodeDeletionTimestamp (kubernetes-sigs#2316) Co-authored-by: Amanuel Engeda <[email protected]> * chore: promote ReservedCapacity feature gate to beta (kubernetes-sigs#2365) * fix: flakiness in expiration tests (kubernetes-sigs#2366) * test: Bump the termination time for the deletion timestamp (kubernetes-sigs#2367) * chore(deps): bump github.com/docker/docker from 28.3.0+incompatible to 28.3.1+incompatible in the go-deps group (kubernetes-sigs#2355) Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * fix: pod errors when nodepool requirements filter all instance types (kubernetes-sigs#2341) * refactor: Create a NopValidator for the disruption testing (kubernetes-sigs#2369) * chore(deps): bump the go-deps group with 2 updates (kubernetes-sigs#2373) Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * refactor: Update disruption testing from PR comments (kubernetes-sigs#2372) * feat: (BREAKING) addition of launch timeout for nodeclaim lifecycle (kubernetes-sigs#2349) * chore: Consider node.kubernetes.io/not-ready:NoExecute as ephemeral (kubernetes-sigs#2265) * perf: Optimistically delete from the cache after launch (kubernetes-sigs#2380) * docs: Node Overlay RFC (kubernetes-sigs#2166) * fix: handle multiple PDBs for the same pod more gracefully (kubernetes-sigs#2379) * docs: Add IBM Cloud provider (kubernetes-sigs#2396) Signed-off-by: Josephine Pfeiffer <[email protected]> * fix: rate limit eviction when PDBs are blocking (kubernetes-sigs#2399) * feat: Add the Node Overlay CRD (kubernetes-sigs#2296) * chore: ignore pods that use unsupported provisioner in the storageClass (kubernetes-sigs#2400) * feat: Add a feature flag for Node Overlay (kubernetes-sigs#2404) * feat: Add StaticCapacity feature flag (kubernetes-sigs#2405) * fix(BREAKING): update naming of karpenter_pods_drained_total (kubernetes-sigs#2421) * fix: pod metrics when pod is terminal (kubernetes-sigs#2417) * chore: ignore pods that have unbound pvc with volumeBindingMode immediate (kubernetes-sigs#2415) * docs: static capacity RFC (kubernetes-sigs#2309) * chore: bump go version to 1.24.6 (kubernetes-sigs#2432) * feat: Create optional operator arguments to leverage leader lease functionality (kubernetes-sigs#2433) * chore(deps): bump the go-deps group with 5 updates (kubernetes-sigs#2442) Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * chore(deps): bump actions/checkout from 4.2.2 to 5.0.0 in /.github/actions/install-pyroscope in the action-deps group (kubernetes-sigs#2428) Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * chore(deps): bump the actions-deps group across 1 directory with 2 updates (kubernetes-sigs#2443) Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * chore(deps): bump actions/cache from 4.2.3 to 4.2.4 in /.github/actions/install-deps in the action-deps group (kubernetes-sigs#2425) Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * fix: do not block drifted nodes from being terminated if consolidation is disabled (kubernetes-sigs#2423) * chore: Pin GH action SHAs for run-bench-test (kubernetes-sigs#2448) * chore: update operatorpkg (kubernetes-sigs#2455) * chore: Track NodeClaims in NodePoolState (kubernetes-sigs#2449) * chore(deps): bump the k8s-go-deps group across 1 directory with 7 updates (kubernetes-sigs#2456) Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * perf: Add flag to disable costly metrics controllers (kubernetes-sigs#2354) * perf: concurrent reconciles CPU-based scaling (kubernetes-sigs#2406) * perf: Disruption Queue Retry Duration Scaling (kubernetes-sigs#2411) * perf: Typed Bucket Scaling (kubernetes-sigs#2420) * ci: Include K8s version 1.33 and 1.34 in testing (kubernetes-sigs#2465) * chore: increase MaxInstanceTypes to give cloud-providers more control over instance type truncation (kubernetes-sigs#2430) * chore(deps): bump the go-deps group with 2 updates (kubernetes-sigs#2461) Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * chore(deps): bump amannn/action-semantic-pull-request from 6.0.1 to 6.1.1 in the actions-deps group (kubernetes-sigs#2462) Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * ci: revert k8s 1.34 addition (kubernetes-sigs#2475) * fix: Don't schedule a pod with DRA requirements (kubernetes-sigs#2384) * fix: support arbitrary reserved capacity labels for drift (kubernetes-sigs#2476) * chore(deps): bump actions/checkout from 4.2.2 to 5.0.0 in /.github/actions/install-prometheus in the action-deps group (kubernetes-sigs#2426) Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * fix: Fix nil pointer exception for multiNodeConsolidation (kubernetes-sigs#2472) * fix: avoid hash collisions with duplicate match expressions (kubernetes-sigs#2479) * ci: enable k8s 1.34 tests (kubernetes-sigs#2481) * fix: Validate unsupported provisioners on bound PVs (kubernetes-sigs#2480) * refactor: use iterator for iterating state nodes (kubernetes-sigs#2483) * fix: make toolchain failing due to deletion of asciicheck (kubernetes-sigs#2485) * fix: Handle PVC edge cases handled by kube-scheduler (kubernetes-sigs#2488) * chore: Change appName from const to var (kubernetes-sigs#2489) * fix: Handle unbound volumes with volumeName defined (kubernetes-sigs#2487) * chore(deps): bump actions/setup-go from 5.5.0 to 6.0.0 in /.github/actions/install-deps in the action-deps group (kubernetes-sigs#2494) Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * chore(deps): bump actions/setup-python from 5.6.0 to 6.0.0 in the actions-deps group (kubernetes-sigs#2493) Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * chore(deps): bump the go-deps group with 6 updates (kubernetes-sigs#2491) Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * chore(deps): bump the k8s-go-deps group with 4 updates (kubernetes-sigs#2492) Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * chore: remove duplicate reconcile logging (kubernetes-sigs#2496) * chore: bump operatorpkg version (kubernetes-sigs#2500) * perf: Update the Node Repair Controller for requeue time (kubernetes-sigs#2286) * feat: Add NodeOverlay Controller Support (kubernetes-sigs#2306) * chore(deps): bump the k8s-go-deps group with 3 updates (kubernetes-sigs#2504) Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * chore: rolling back to 1.34 (kubernetes-sigs#2512) * fix: handle nil selector when hashing in topology (kubernetes-sigs#2511) * feat: Support Pod Level Resources (kubernetes-sigs#2383) Signed-off-by: Tsubasa Nagasawa <[email protected]> * fix: merge limits into requests when constructing ds pods (kubernetes-sigs#2514) * fix: default CPU_REQUESTS when non-positive value is provided (kubernetes-sigs#2516) * fix(node): prevent empty providerID causing false NodeClaim matches (kubernetes-sigs#2507) * feat: Support Static Capacity (kubernetes-sigs#2521) Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: Jason Deal <[email protected]> Co-authored-by: Jonathan Innis <[email protected]> Co-authored-by: Andrew Mitchell <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Ryan Mistretta <[email protected]> * fix: over provisioning static nodeclaims during controller crashes (kubernetes-sigs#2534) * chore: drop consistency error to info log (kubernetes-sigs#2542) * fix: flaky static provisioning unit test (kubernetes-sigs#2546) * fix: nodepool crd definition should explicitly say replicas field as alpha (kubernetes-sigs#2554) * chore: Update NodeRegistrationHealthy SC to use a buffer mechanism (kubernetes-sigs#2520) --------- Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: Max Cao <[email protected]> Signed-off-by: Josephine Pfeiffer <[email protected]> Signed-off-by: Tsubasa Nagasawa <[email protected]> Co-authored-by: Derek Frank <[email protected]> Co-authored-by: Jonathan Innis <[email protected]> Co-authored-by: Lê Minh Quân <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Jigisha Patil <[email protected]> Co-authored-by: Amanuel Engeda <[email protected]> Co-authored-by: Max Cao <[email protected]> Co-authored-by: Aidan Rowe <[email protected]> Co-authored-by: Daniel Lopes <[email protected]> Co-authored-by: Saurav Agarwalla <[email protected]> Co-authored-by: cosimomeli <[email protected]> Co-authored-by: Jason Deal <[email protected]> Co-authored-by: Reed Schalo <[email protected]> Co-authored-by: Josephine Pfeiffer <[email protected]> Co-authored-by: Sumukha Radhakrishna <[email protected]> Co-authored-by: Andy Townsend <[email protected]> Co-authored-by: Sumukha Radhakrishna <[email protected]> Co-authored-by: ryan-mist <[email protected]> Co-authored-by: Brandon Wagner <[email protected]> Co-authored-by: Alima Azamat <[email protected]> Co-authored-by: Andrew Mitchell <[email protected]> Co-authored-by: Tsubasa Nagasawa <[email protected]> Co-authored-by: Neil <[email protected]>

Fixes #N/A
Description
For performance, this change makes the termination controller multithreaded on evicting pods.
How was this change tested?
make presubmitBy submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.