🐛 Enable NLB connection draining for graceful apiserver shutdown. #5589

benluddy · 2025-07-09T15:00:01Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

The kube-apiserver expects to terminate connections itself during graceful shutdown. As soon as kube-apiserver has received SIGTERM, its /readyz endpoint begins serving HTTP 500 responses. To allow time for load balancers to mark it unhealthy, it continues accepting new connections and serving requests on existing connections for a period of time (controlled by the --shutdown-delay-duration option). Once the shutdown delay has elapsed, it stops accepting new requests and drains in-flight requests before exiting.

By default, NLBs immediately terminate established connections when a target becomes unhealthy. This causes client-facing disruption for clients connected via NLB to a kube-apiserver instance that is shutting down.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #5475

Special notes for your reviewer:

As mentioned in the issue, OpenShift runs a test that continually performs rolling restarts of kube-apiserver in a 3-HA cluster. Throughout the test, clients make requests to the API and report any errors they observe. These reports include errors that aren't visible to the server.

Here's a timeline of what we currently see:

The gray bars represent the period during which a kube-apiserver is shutting down, and each row is a different kube-apiserver instance. The red bars are client-facing errors reported by the polling test client. All of the error samples are read: connection reset by peer occurring approximately 30 seconds after the start of each kube-apiserver shutdown window.

With this patch applied, the polling test client doesn't see any errors:

Checklist:

Release note:

Enable NLB target group connection draining to allow for graceful shutdown of apiserver processes

k8s-ci-robot · 2025-07-09T15:00:04Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

nrb · 2025-07-09T15:06:48Z

/ok-to-test

nrb · 2025-07-09T15:19:23Z

Basic test failing on this:

 E0709 15:15:18.758979   46931 awsmachinepool_controller.go:429] "failed updating instances" err="failed to get node status by provider id: failed to retrieve kubeconfig secret for Cluster /: Secret \"-kubeconfig\" not found" instances=null

This change doesn't touch kubeconfigs, so I'll take a look through logs a little later to see if there's something that accidentally changed.

The kube-apiserver expects to terminate connections itself during graceful shutdown. As soon as kube-apiserver has received SIGTERM, its /readyz endpoint begins serving HTTP 500 responses. To allow time for load balancers to mark it unhealthy, it continues accepting new connections and serving requests on existing connections for a period of time (controlled by the --shutdown-delay-duration option). Once the shutdown delay has elapsed, it stops accepting new requests and drains in-flight requests before exiting. By default, NLBs immediately terminate established connections when a target becomes unhealthy. This causes client-facing disruption for clients connected via NLB to a kube-apiserver instance that is shutting down.

benluddy · 2025-07-09T15:53:48Z

This change doesn't touch kubeconfigs, so I'll take a look through logs a little later to see if there's something that accidentally changed.

Passing now. I'd missed adding the new target group attributes to one mock expectation.

pkg/cloud/services/elb/loadbalancer.go

damdo

Thanks!

/lgtm

/assign @nrb @richardcase @dlipovetsky @AndiDog

k8s-ci-robot · 2025-07-15T13:08:19Z

LGTM label has been added.

Git tree hash: 88d7446cec367dd25b9c6e8450bde290f1a64b10

AndiDog · 2025-07-15T13:27:01Z

/lgtm

Fine to approve this. Just one question: any concerns with the (maximum) 300 seconds delay that we get with this?

damdo · 2025-07-15T13:45:46Z

Fine to approve this. Just one question: any concerns with the (maximum) 300 seconds delay that we get with this?

@AndiDog at the moment is zero so all the connections will be immediately terminated.

My understanding is that with the new change if there are no connections left, the delay will still be zero. If there are some we'll wait up to 300 seconds for them to go away and eventually kill them.

I don't think there should be concerns with adding more delay. If anything it will improve client connections towards the APIServer not getting disrupted until they closed client-side or eventually killed.

If you are concerned about 300s being too low for certain use cases(?), I guess we could make this configurable in the future.
For now I think this is a good short/medium term improvement.

benluddy · 2025-07-15T14:49:13Z

any concerns with the (maximum) 300 seconds delay that we get with this?

Thinking it through...

We'd expect to see this when an apiserver is continuing to serve, but returning 500 from /readyz. Typically we've seen this due to etcd probe failures, and the etcd probe failures have almost always been due to saturation affecting the entire etcd cluster (and so affecting all apiservers simultaneously) rather than due to connectivity issues between a single apiserver and all etcd members. In that case, migrating clients to other apiservers doesn't help.

In cases where apiserver traffic is black-holed, Kube's client-go HTTP2 clients will send ping frames if they haven't received a frame for 30s and will close the connection if a ping goes unacked for 15s, so they should have a 45s upper bound. HTTP1 clients will close a connection if they time out waiting for a response, so they shouldn't see more than a max of 1 error per established connection (the same worst case number of client-perceptible errors). Client-go closes idle connections after about 90s. In either case, the NLB health checks and established clients are probably going to lose contact with the target at approximately the same time, so the client-perceived time to close one of these connections is going to overlap with the time it takes for the NLB to notice that the target is unhealthy.

nrb · 2025-07-15T18:31:39Z

/approve

but

/hold

Just to be sure @AndiDog's satisfied with the answers.

AndiDog · 2025-07-16T08:18:29Z

/approve
/unhold

k8s-ci-robot · 2025-07-16T08:18:37Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: AndiDog, nrb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [AndiDog,nrb]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot requested review from Ankitasw and cnmcavoy July 9, 2025 15:00

k8s-ci-robot added needs-priority size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 9, 2025

k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Jul 9, 2025

benluddy marked this pull request as ready for review July 9, 2025 15:07

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 9, 2025

k8s-ci-robot requested a review from AndiDog July 9, 2025 15:07

benluddy force-pushed the nlb-conn-drain branch from 48bccb3 to c374806 Compare July 9, 2025 15:38

damdo reviewed Jul 14, 2025

View reviewed changes

pkg/cloud/services/elb/loadbalancer.go Show resolved Hide resolved

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jul 15, 2025

damdo reviewed Jul 15, 2025

View reviewed changes

k8s-ci-robot assigned AndiDog, dlipovetsky, nrb, richardcase and damdo Jul 15, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 15, 2025

k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jul 15, 2025

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 16, 2025

k8s-ci-robot merged commit 0006312 into kubernetes-sigs:main Jul 16, 2025
29 checks passed

k8s-ci-robot added this to the v2.8 milestone Jul 16, 2025

🐛 Enable NLB connection draining for graceful apiserver shutdown. #5589

🐛 Enable NLB connection draining for graceful apiserver shutdown. #5589

Uh oh!

Conversation

benluddy commented Jul 9, 2025 • edited by AndiDog Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Jul 9, 2025

Uh oh!

nrb commented Jul 9, 2025

Uh oh!

nrb commented Jul 9, 2025

Uh oh!

benluddy commented Jul 9, 2025

Uh oh!

Uh oh!

damdo left a comment

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Jul 15, 2025

Uh oh!

AndiDog commented Jul 15, 2025

Uh oh!

damdo commented Jul 15, 2025

Uh oh!

benluddy commented Jul 15, 2025

Uh oh!

nrb commented Jul 15, 2025

Uh oh!

AndiDog commented Jul 16, 2025

Uh oh!

k8s-ci-robot commented Jul 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

benluddy commented Jul 9, 2025 •

edited by AndiDog

Loading