-
Notifications
You must be signed in to change notification settings - Fork 635
🐛 Enable NLB connection draining for graceful apiserver shutdown. #5589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Skipping CI for Draft Pull Request. |
/ok-to-test |
Basic test failing on this:
This change doesn't touch kubeconfigs, so I'll take a look through logs a little later to see if there's something that accidentally changed. |
The kube-apiserver expects to terminate connections itself during graceful shutdown. As soon as kube-apiserver has received SIGTERM, its /readyz endpoint begins serving HTTP 500 responses. To allow time for load balancers to mark it unhealthy, it continues accepting new connections and serving requests on existing connections for a period of time (controlled by the --shutdown-delay-duration option). Once the shutdown delay has elapsed, it stops accepting new requests and drains in-flight requests before exiting. By default, NLBs immediately terminate established connections when a target becomes unhealthy. This causes client-facing disruption for clients connected via NLB to a kube-apiserver instance that is shutting down.
Passing now. I'd missed adding the new target group attributes to one mock expectation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM label has been added. Git tree hash: 88d7446cec367dd25b9c6e8450bde290f1a64b10
|
/lgtm Fine to approve this. Just one question: any concerns with the (maximum) 300 seconds delay that we get with this? |
@AndiDog at the moment is zero so all the connections will be immediately terminated. My understanding is that with the new change if there are no connections left, the delay will still be zero. If there are some we'll wait up to 300 seconds for them to go away and eventually kill them. I don't think there should be concerns with adding more delay. If anything it will improve client connections towards the APIServer not getting disrupted until they closed client-side or eventually killed. If you are concerned about 300s being too low for certain use cases(?), I guess we could make this configurable in the future. |
Thinking it through... We'd expect to see this when an apiserver is continuing to serve, but returning 500 from /readyz. Typically we've seen this due to etcd probe failures, and the etcd probe failures have almost always been due to saturation affecting the entire etcd cluster (and so affecting all apiservers simultaneously) rather than due to connectivity issues between a single apiserver and all etcd members. In that case, migrating clients to other apiservers doesn't help. In cases where apiserver traffic is black-holed, Kube's client-go HTTP2 clients will send ping frames if they haven't received a frame for 30s and will close the connection if a ping goes unacked for 15s, so they should have a 45s upper bound. HTTP1 clients will close a connection if they time out waiting for a response, so they shouldn't see more than a max of 1 error per established connection (the same worst case number of client-perceptible errors). Client-go closes idle connections after about 90s. In either case, the NLB health checks and established clients are probably going to lose contact with the target at approximately the same time, so the client-perceived time to close one of these connections is going to overlap with the time it takes for the NLB to notice that the target is unhealthy. |
/approve but /hold Just to be sure @AndiDog's satisfied with the answers. |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: AndiDog, nrb The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind bug
What this PR does / why we need it:
The kube-apiserver expects to terminate connections itself during graceful shutdown. As soon as kube-apiserver has received SIGTERM, its /readyz endpoint begins serving HTTP 500 responses. To allow time for load balancers to mark it unhealthy, it continues accepting new connections and serving requests on existing connections for a period of time (controlled by the --shutdown-delay-duration option). Once the shutdown delay has elapsed, it stops accepting new requests and drains in-flight requests before exiting.
By default, NLBs immediately terminate established connections when a target becomes unhealthy. This causes client-facing disruption for clients connected via NLB to a kube-apiserver instance that is shutting down.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #5475
Special notes for your reviewer:
As mentioned in the issue, OpenShift runs a test that continually performs rolling restarts of kube-apiserver in a 3-HA cluster. Throughout the test, clients make requests to the API and report any errors they observe. These reports include errors that aren't visible to the server.
Here's a timeline of what we currently see:
The gray bars represent the period during which a kube-apiserver is shutting down, and each row is a different kube-apiserver instance. The red bars are client-facing errors reported by the polling test client. All of the error samples are
read: connection reset by peer
occurring approximately 30 seconds after the start of each kube-apiserver shutdown window.With this patch applied, the polling test client doesn't see any errors:
Checklist:
addsupdates unit testsRelease note: