Wrap http transport with retry in kube proxy to handle GOAWAY.#57881
Wrap http transport with retry in kube proxy to handle GOAWAY.#57881
Conversation
86c5f51 to
0a06ea2
Compare
…o implement unit test.
| // transportWrapper wraps an http.RoundTripper and ensures requests with bodies | ||
| // are retryable by setting GetBody when possible. This allows the underlying | ||
| // transport (typically http2.Transport) to automatically retry on GOAWAY. | ||
| type transportWrapper struct { |
There was a problem hiding this comment.
we should move this under the forward library so if App access needs to use it's available there and you can enabled. I didn't enabled because I was afraid of breaking anything without running the test plan
There was a problem hiding this comment.
I think while there may be a case to make it available widely under the forward lib, here, we discriminate on which endpoints get the retryable handling which is quite specific to kube proxy. To make it somewhere else we'd need to add an option to pass a function that checks whether or not to apply the logic, which is a lot of overhead.
There was a problem hiding this comment.
Let's start simple and iterate here and keep this limited to Kubernetes for now.
a566bdb to
6ecb987
Compare
| // transportWrapper wraps an http.RoundTripper and ensures requests with bodies | ||
| // are retryable by setting GetBody when possible. This allows the underlying | ||
| // transport (typically http2.Transport) to automatically retry on GOAWAY. | ||
| type transportWrapper struct { |
There was a problem hiding this comment.
Let's start simple and iterate here and keep this limited to Kubernetes for now.
| // transportWrapper wraps an http.RoundTripper and ensures requests with bodies | ||
| // are retryable by setting GetBody when possible. This allows the underlying | ||
| // transport (typically http2.Transport) to automatically retry on GOAWAY. | ||
| type transportWrapper struct { |
There was a problem hiding this comment.
Suggestion: Rename this to something less generic and more descriptive.
| // are retryable by setting GetBody when possible. This allows the underlying | ||
| // transport (typically http2.Transport) to automatically retry on GOAWAY. | ||
| type transportWrapper struct { | ||
| http.RoundTripper |
There was a problem hiding this comment.
Suggestion: Remove the embedding if possible.
| // Check for log streaming / watch operation. Those can be HTTP/2 and susceptible to the GOAWAY issue, | ||
| // however, as they are long-running, the only way to recover would be to dump the body to disc, | ||
| // resulting in concerns about infinitely growing file, and privacy/security as we deal with unencrypted | ||
| // traffic and we don't want to store that somewher, even if it is a tmpfs. |
There was a problem hiding this comment.
As discussed offline, is this accurate? Aren't we storing the request body, and not the response body? Why would this be any less of a problem for requests which are not for /log or are for a watch?
There was a problem hiding this comment.
all endpoints - watch requests included - suffer from this issue. only HTTP/1 requests (such as WebSocket or SPDY) do not and a simple protocol check avoids the problem.
In the Kubernetes codebase, every other type of request is required to retry the request body - for them it's not a problem because requests are bytes.Buffer and http client already populates the getbody after a type assertion.
There was a problem hiding this comment.
Basically kubernetes added this feature for watch requests
There was a problem hiding this comment.
I was asking why the comment indicated that log and watches needed special handling. If we are only caching the request and not the responses, then does it really matter that the response for a long lived request could be very large? Shouldn't this approach work for all requests?
There was a problem hiding this comment.
Let me explain what I meant.
The reason why goway-chance in Kubernetes was introduced was to close long-lived connections to the same kubernetes api server and balance the incoming connections through all replicas - this was designed specially for watch streams.
Looking at Kuberentes implementation, you can see that only HTTP2 requests might receive the go away - connection close. This means that any HTTP1 request won't receive the connection close header.
This HTTP2 filter, excludes long-lived bidirectional connections such as kubectl exec, kubectl port-forward, kubectl cp where the user can send very large payloads since all of these use HTTP1 + WebSocket/SPDY.
Long lived read only - such as watch or log streams - aren't affected as well because the GO HTTP2 client only retries the request body - means we only need to record the request body and not the response. This makes our life simpler given that Teleport only needs to be able to reset the request body and never the response body. Request bodies in Kubernetes are way smaller than response for any endpoint apart for the bidi-directional streams mentioned above.
In theory, we can record the body into memory and retry - the go HTTP2 client already does that automatically if you provide a GetBody function. In reality we can make a threshold of 1MB that goes into memory and if the body is bigger we write to a temp file.
I still think the approach that this PR tries to implement isn't the best - load the body in advance, trim the body... and we can definitively improve it. This PR also includes HTTP1 requests which will cause some problems specially with connection upgrade mechanisms.
This is an attempt to fix #57766. When a request is terminated because the upstream Kubernetes API Server GOAWAY chance is exceeded, clients are informed to retry by replying with a 429 status code and a Retry-After header. This deviates from the approaches taken in #57881 and #60695 to favor simplicity and avoid buffering request data in a teleport process. The downside to this approach is that it requires clients to properly handle retry requests.
This is an attempt to resolve #57766. When a request is terminated because the upstream Kubernetes API Server GOAWAY chance is exceeded, clients are informed to retry by replying with a 429 status code and a Retry-After header. This deviates from the approaches taken in #57881 and #60695 to favor simplicity and avoid buffering request data in a teleport process. The downside to this approach is that it requires clients to properly handle retry requests.
This is an attempt to resolve #57766. When a request is terminated because the upstream Kubernetes API Server GOAWAY chance is exceeded, clients are informed to retry by replying with a 429 status code and a Retry-After header. This deviates from the approaches taken in #57881 and #60695 to favor simplicity and avoid buffering request data in a teleport process. The downside to this approach is that it requires clients to properly handle retry requests.
This is an attempt to resolve #57766. When a request is terminated because the upstream Kubernetes API Server GOAWAY chance is exceeded, clients are informed to retry by replying with a 429 status code and a Retry-After header. This deviates from the approaches taken in #57881 and #60695 to favor simplicity and avoid buffering request data in a teleport process. The downside to this approach is that it requires clients to properly handle retry requests.
This is an attempt to address #57766. When a request is terminated because the upstream Kubernetes API Server GOAWAY chance is exceeded, clients are informed to retry by replying with a 429 status code and a Retry-After header. This deviates from the approaches taken in #57881 and #60695 to favor simplicity and avoid buffering request data in a teleport process. The downside to this approach is that it requires clients to properly handle retry requests.
This is an attempt to address #57766. When a request is terminated because the upstream Kubernetes API Server GOAWAY chance is exceeded, clients are informed to retry by replying with a 429 status code and a Retry-After header. This deviates from the approaches taken in #57881 and #60695 to favor simplicity and avoid buffering request data in a teleport process. The downside to this approach is that it requires clients to properly handle retry requests.
This is an attempt to address #57766. When a request is terminated because the upstream Kubernetes API Server GOAWAY chance is exceeded, clients are informed to retry by replying with a 429 status code and a Retry-After header. This deviates from the approaches taken in #57881 and #60695 to favor simplicity and avoid buffering request data in a teleport process. The downside to this approach is that it requires clients to properly handle retry requests.
This is an attempt to address #57766. When a request is terminated because the upstream Kubernetes API Server GOAWAY chance is exceeded, clients are informed to retry by replying with a 429 status code and a Retry-After header. This deviates from the approaches taken in #57881 and #60695 to favor simplicity and avoid buffering request data in a teleport process. The downside to this approach is that it requires clients to properly handle retry requests.
This is an attempt to address #57766. When a request is terminated because the upstream Kubernetes API Server GOAWAY chance is exceeded, clients are informed to retry by replying with a 429 status code and a Retry-After header. This deviates from the approaches taken in #57881 and #60695 to favor simplicity and avoid buffering request data in a teleport process. The downside to this approach is that it requires clients to properly handle retry requests.
* Kubernetes: Handle GOAWAY requests This is an attempt to address #57766. When a request is terminated because the upstream Kubernetes API Server GOAWAY chance is exceeded, clients are informed to retry by replying with a 429 status code and a Retry-After header. This deviates from the approaches taken in #57881 and #60695 to favor simplicity and avoid buffering request data in a teleport process. The downside to this approach is that it requires clients to properly handle retry requests. * Populate GOAWAY response body (#61264) Follow up to #61142 which sets the response body so that clients which only look at the reason and not the headers will behave appropriately.
* Kubernetes: Handle GOAWAY requests This is an attempt to address #57766. When a request is terminated because the upstream Kubernetes API Server GOAWAY chance is exceeded, clients are informed to retry by replying with a 429 status code and a Retry-After header. This deviates from the approaches taken in #57881 and #60695 to favor simplicity and avoid buffering request data in a teleport process. The downside to this approach is that it requires clients to properly handle retry requests. * Populate GOAWAY response body (#61264) Follow up to #61142 which sets the response body so that clients which only look at the reason and not the headers will behave appropriately.
Fixes #57766
Adding test is quite tricky, need a Kube server with the --goaway-chance flag but it is limited to 2% max.
Go script to reproduce the issue / confirm the fix
Kind config with goaway-chance
changelog: Add retry logic in kube proxy to handle EKS goaway chance