Skip to content

Conversation

benluddy
Copy link
Contributor

@benluddy benluddy commented Jul 9, 2025

What type of PR is this?

/kind bug

What this PR does / why we need it:

The kube-apiserver expects to terminate connections itself during graceful shutdown. As soon as kube-apiserver has received SIGTERM, its /readyz endpoint begins serving HTTP 500 responses. To allow time for load balancers to mark it unhealthy, it continues accepting new connections and serving requests on existing connections for a period of time (controlled by the --shutdown-delay-duration option). Once the shutdown delay has elapsed, it stops accepting new requests and drains in-flight requests before exiting.

By default, NLBs immediately terminate established connections when a target becomes unhealthy. This causes client-facing disruption for clients connected via NLB to a kube-apiserver instance that is shutting down.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #5475

Special notes for your reviewer:

As mentioned in the issue, OpenShift runs a test that continually performs rolling restarts of kube-apiserver in a 3-HA cluster. Throughout the test, clients make requests to the API and report any errors they observe. These reports include errors that aren't visible to the server.

Here's a timeline of what we currently see:

before

The gray bars represent the period during which a kube-apiserver is shutting down, and each row is a different kube-apiserver instance. The red bars are client-facing errors reported by the polling test client. All of the error samples are read: connection reset by peer occurring approximately 30 seconds after the start of each kube-apiserver shutdown window.

With this patch applied, the polling test client doesn't see any errors:

after

Checklist:

  • squashed commits
  • includes documentation
  • includes emoji in title
  • adds updates unit tests
  • adds or updates e2e tests

Release note:

Enable NLB target group connection draining to allow for graceful shutdown of apiserver processes  

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jul 9, 2025
@k8s-ci-robot k8s-ci-robot requested review from Ankitasw and cnmcavoy July 9, 2025 15:00
@k8s-ci-robot k8s-ci-robot added needs-priority size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 9, 2025
@nrb
Copy link
Contributor

nrb commented Jul 9, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Jul 9, 2025
@benluddy benluddy marked this pull request as ready for review July 9, 2025 15:07
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 9, 2025
@k8s-ci-robot k8s-ci-robot requested a review from AndiDog July 9, 2025 15:07
@nrb
Copy link
Contributor

nrb commented Jul 9, 2025

Basic test failing on this:

 E0709 15:15:18.758979   46931 awsmachinepool_controller.go:429] "failed updating instances" err="failed to get node status by provider id: failed to retrieve kubeconfig secret for Cluster /: Secret \"-kubeconfig\" not found" instances=null 

This change doesn't touch kubeconfigs, so I'll take a look through logs a little later to see if there's something that accidentally changed.

The kube-apiserver expects to terminate connections itself during graceful shutdown. As soon as
kube-apiserver has received SIGTERM, its /readyz endpoint begins serving HTTP 500 responses. To
allow time for load balancers to mark it unhealthy, it continues accepting new connections and
serving requests on existing connections for a period of time (controlled by the
--shutdown-delay-duration option). Once the shutdown delay has elapsed, it stops accepting new
requests and drains in-flight requests before exiting.

By default, NLBs immediately terminate established connections when a target becomes unhealthy. This
causes client-facing disruption for clients connected via NLB to a kube-apiserver instance that is
shutting down.
@benluddy
Copy link
Contributor Author

benluddy commented Jul 9, 2025

This change doesn't touch kubeconfigs, so I'll take a look through logs a little later to see if there's something that accidentally changed.

Passing now. I'd missed adding the new target group attributes to one mock expectation.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jul 15, 2025
Copy link
Member

@damdo damdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

/lgtm

/assign @nrb @richardcase @dlipovetsky @AndiDog

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 15, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 88d7446cec367dd25b9c6e8450bde290f1a64b10

@AndiDog
Copy link
Contributor

AndiDog commented Jul 15, 2025

/lgtm

Fine to approve this. Just one question: any concerns with the (maximum) 300 seconds delay that we get with this?

@damdo
Copy link
Member

damdo commented Jul 15, 2025

Fine to approve this. Just one question: any concerns with the (maximum) 300 seconds delay that we get with this?

@AndiDog at the moment is zero so all the connections will be immediately terminated.

My understanding is that with the new change if there are no connections left, the delay will still be zero. If there are some we'll wait up to 300 seconds for them to go away and eventually kill them.

I don't think there should be concerns with adding more delay. If anything it will improve client connections towards the APIServer not getting disrupted until they closed client-side or eventually killed.

If you are concerned about 300s being too low for certain use cases(?), I guess we could make this configurable in the future.
For now I think this is a good short/medium term improvement.

@benluddy
Copy link
Contributor Author

any concerns with the (maximum) 300 seconds delay that we get with this?

Thinking it through...

We'd expect to see this when an apiserver is continuing to serve, but returning 500 from /readyz. Typically we've seen this due to etcd probe failures, and the etcd probe failures have almost always been due to saturation affecting the entire etcd cluster (and so affecting all apiservers simultaneously) rather than due to connectivity issues between a single apiserver and all etcd members. In that case, migrating clients to other apiservers doesn't help.

In cases where apiserver traffic is black-holed, Kube's client-go HTTP2 clients will send ping frames if they haven't received a frame for 30s and will close the connection if a ping goes unacked for 15s, so they should have a 45s upper bound. HTTP1 clients will close a connection if they time out waiting for a response, so they shouldn't see more than a max of 1 error per established connection (the same worst case number of client-perceptible errors). Client-go closes idle connections after about 90s. In either case, the NLB health checks and established clients are probably going to lose contact with the target at approximately the same time, so the client-perceived time to close one of these connections is going to overlap with the time it takes for the NLB to notice that the target is unhealthy.

@nrb
Copy link
Contributor

nrb commented Jul 15, 2025

/approve

but

/hold

Just to be sure @AndiDog's satisfied with the answers.

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jul 15, 2025
@AndiDog
Copy link
Contributor

AndiDog commented Jul 16, 2025

/approve
/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 16, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: AndiDog, nrb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 0006312 into kubernetes-sigs:main Jul 16, 2025
29 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v2.8 milestone Jul 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Kubernetes API server rolling restarts experience client-side disruption

8 participants