Skip to content

kube: handle net/http rewind body GOAWAY error#66605

Merged
jakealti merged 4 commits into
masterfrom
jakealti/goaway
May 11, 2026
Merged

kube: handle net/http rewind body GOAWAY error#66605
jakealti merged 4 commits into
masterfrom
jakealti/goaway

Conversation

@jakealti
Copy link
Copy Markdown
Contributor

@jakealti jakealti commented May 9, 2026

Summary

Follow-up to #61142, which translated GOAWAY-related errors from the upstream Kubernetes API server into 429 Too Many Requests responses so kube clients retry. That fix matched the error string emitted by golang.org/x/net/http2. After it shipped, #65611 reported that v17.7.21 still leaks a different error string from the same root cause:

net/http: cannot rewind body after connection loss

This PR extends the detection in formatStatusResponseError to also match that string and produce the same 429 response.

Why two strings, same cause

When an HTTP/2 GOAWAY interrupts an in-flight request whose body has already been read, the Go transport may attempt to retry on a new connection. Without Request.GetBody, the retry can't replay the body. Two layers can hit this dead end:

  • golang.org/x/net/http2 returns http2: Transport: cannot retry err [...] after Request.Body was written; define Request.GetBody to avoid this error — already handled.
  • net/http returns net/http: cannot rewind body after connection loss from errCannotRewind when the http1 retry path fires after the http2 conn pool is drained. This happens in concurrent-request scenarios (e.g. helmwave) and is what this PR adds.

The kube agent intentionally does not buffer the request body, so GetBody is unset; previous PRs (#57881, #60695) explored buffering and rejected it. Detection by literal-string match continues to be the only available mechanism — both stdlib sentinels are unexported.

Tests

Three tests now cover GOAWAY handling, complementary in scope:

  • TestGOAWAYHandling (existing): real *http2.Transport against the fake goawayServer. Single request, error A end-to-end.
  • TestKubeForwarder_GOAWAYErrors (new): stub http.RoundTripper plugged into staticKubeCreds.transport, deterministic table over both error strings, exercises the full reverse-proxy error pipeline (formatForwardResponseError -> formatStatusResponseError -> response writer).
  • TestGOAWAYHandling_Concurrent (new): production newH2Transport (net/http.Transport upgraded with http2.ConfigureTransport) against the fake goawayServer. Each invocation fires 50 concurrent requests with a 64KB body to provoke the GOAWAY race. The assertion is narrow: the rewind-body string must never reach a client — other unrelated 500s (broken pipe, force-closed conn, etc.) are tolerated. Without this PR's change the test fails reliably; with it, the test stays green across 100 sequential runs under -count=100 -race -shuffle on (matches the flaky-test detector's CI configuration).

Test plan

  • go test ./lib/kube/proxy/ -run 'TestKubeForwarder_GOAWAYErrors|TestGOAWAYHandling$|TestGOAWAYHandling_Concurrent' -count=100 -race -shuffle on
  • go test ./lib/kube/proxy/... (full package)

Notes on scope

Concurrent reproduction surfaced other GOAWAY-adjacent transport errors (broken pipe, connection reset by peer, client connection force closed, ReverseProxy does an invalid Read on closed Body) that this PR does not translate — they're separate bugs the existing fix doesn't address either. The TestGOAWAYHandling_Concurrent assertion is intentionally narrow: only the rewind-body string is required to stay off the wire.

Closes #65611

@jakealti jakealti added no-changelog Indicates that a PR does not require a changelog entry no-test-plan Bypasses the test plan validation bot labels May 9, 2026
@jakealti jakealti requested review from rosstimothy and tigrato May 9, 2026 23:07
@jakealti jakealti marked this pull request as ready for review May 9, 2026 23:07
@github-actions github-actions Bot requested review from Joerger and avatus May 9, 2026 23:08
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 079279cb2d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread lib/kube/proxy/forwarder.go Outdated
isHTTP2RetryErr := strings.Contains(errString, `http2: Transport: cannot retry err`) &&
strings.HasSuffix(errString, `after Request.Body was written; define Request.GetBody to avoid this error`)
isHTTP1RewindErr := strings.Contains(errString, `net/http: cannot rewind body after connection loss`)
if isHTTP2RetryErr || isHTTP1RewindErr {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can fix this problem at least for create operations because we read the object payload into memory to ensure the user can actually see the created object.

// extractResourceNameFromPostRequest extracts the resource name from a POST body.
// It reads the full body - required because data can be proto encoded -
// and decodes it into a Kubernetes object. It then extracts the resource name
// from the object.
// The body is then reset to the original request body using a new buffer.
func extractResourceNameFromPostRequest(
req *http.Request,
codecs *serializer.CodecFactory,
defaults *schema.GroupVersionKind,
) (string, error) {
if req.Body == nil {
return "", trace.BadParameter("request body is empty")
}
negotiator := newClientNegotiator(codecs)
_, decoder, err := newEncoderAndDecoderForContentType(
responsewriters.GetContentTypeHeader(req.Header),
negotiator,
)
if err != nil {
return "", trace.Wrap(err)
}
newBody := bytes.NewBuffer(make([]byte, 0, 2048))
if _, err := io.Copy(newBody, req.Body); err != nil {
return "", trace.Wrap(err)
}
if err := req.Body.Close(); err != nil {
return "", trace.Wrap(err)
}
req.Body = io.NopCloser(newBody)
// decode memory rw body.
obj, err := decodeAndSetGVK(decoder, newBody.Bytes(), defaults)
if err != nil {
return "", trace.Wrap(err)
}
namer, ok := obj.(kubeObjectInterface)
if !ok {
return "", trace.BadParameter("object %T does not implement kubeObjectInterface", obj)
}
return namer.GetName(), nil
}

For these, we can handle the GetBody.
For patch/update, we can do the same but people didn't like

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added handling for create requests in 1bf7429

@jakealti jakealti requested a review from tigrato May 11, 2026 15:37
Copy link
Copy Markdown
Contributor

@tigrato tigrato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you!

@jakealti jakealti enabled auto-merge May 11, 2026 15:45
@jakealti jakealti added this pull request to the merge queue May 11, 2026
Merged via the queue into master with commit 6b12b89 May 11, 2026
47 checks passed
@jakealti jakealti deleted the jakealti/goaway branch May 11, 2026 16:13
@backport-bot-workflows
Copy link
Copy Markdown
Contributor

@jakealti See the table below for backport results.

Branch Result
branch/v17 Create PR
branch/v18 Create PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport/branch/v17 backport/branch/v18 kubernetes-access no-changelog Indicates that a PR does not require a changelog entry no-test-plan Bypasses the test plan validation bot size/md

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Get "cannot rewind body after connection loss" after enabling GOAWAY

3 participants