client-go: refactor retry logic so it can be reused by tkashem · Pull Request #102217 · kubernetes/kubernetes

tkashem · 2021-05-21T21:18:33Z

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

Refactor retry logic into a reusable unit so we can use it for other Watch and Stream.

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

tkashem · 2021-05-22T12:13:43Z

e2e suite: [sig-node] Probing container should be ready immediately after startupProbe succeeds

/test pull-kubernetes-e2e-kind

tkashem · 2021-05-24T02:11:10Z

refactored the retry into its reusable unit, please review.

/assign @wojtek-t @deads2k

wojtek-t

This looks great - just a couple small comments.

wojtek-t · 2021-05-24T08:04:28Z

staging/src/k8s.io/client-go/rest/with_retry.go

nit:

...) (*RetryAfter, bool)

wojtek-t · 2021-05-24T08:21:13Z

staging/src/k8s.io/client-go/rest/request_test.go

maxRetries is always 1 in all cases?

Please either hardcode it here or add (if it makes sense - those cases will be tested below) cases that exercise it.

yes, maxRetries is always 1 in this test, I hard coded it for now.

wojtek-t · 2021-05-24T08:26:35Z

staging/src/k8s.io/client-go/rest/request_test.go

Seems like coplete copy-paste from the above.

Unless I'm missing something, please extract that to a separate function and do:

func TestRequestDoRawWithRetry(t *testing.T) { newFunc(t, func(ctx context.Context, r *Request) { r.DoRaw(ctx) }) }

[same for Do above]

good catch, I am hoping I can reuse it for Watch and Stream in the followup PR

tkashem · 2021-05-24T13:44:53Z

@wojtek-t comments addressed, I also added noSleepBackOff to make the unit test execution faster.

// noSleepBackOff is a NoBackoff except it does not sleep,
// used for faster execution of the unit tests.
type noSleepBackOff struct {
	*NoBackoff
}

func (n *noSleepBackOff) Sleep(d time.Duration) {}

wojtek-t · 2021-05-24T14:25:44Z

/lgtm
/approve

k8s-ci-robot · 2021-05-24T14:26:09Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tkashem, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~staging/src/k8s.io/client-go/OWNERS~~ [wojtek-t]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tkashem · 2021-05-24T14:52:15Z

e2e suite: [sig-instrumentation] MetricsGrabber should grab all metrics from a Scheduler.

/test pull-kubernetes-e2e-gce-ubuntu-containerd

tkashem · 2021-05-24T15:54:41Z

/test pull-kubernetes-e2e-gce-ubuntu-containerd

fedebongio · 2021-05-25T20:07:19Z

/triage accepted

deads2k · 2021-05-26T13:51:25Z

staging/src/k8s.io/client-go/rest/with_retry.go

+func checkWait(resp *http.Response) (int, bool) {
+	switch r := resp.StatusCode; {
+	// any 500 error code and 429 can trigger a wait
+	case r == http.StatusTooManyRequests, r >= 500:


Pushing this case into a generic NextRetry makes it error prone. I don't see a check of the request verb to ensure it's a GET. I can accept that a GET should be retriable on a 5xx series error, but other http verbs cannot automatically retried on 5xx in all cases. Some action may have been taken before the failure that would make an auto-replay unpredictable. I could probably buy 503 given its usage for, "you didn't get to your endpoint" from the aggregator.

I think if you split the check for GET requests and non-GETs you can keep the generic NextRetry function.

it would retry only if StatusCode == 429 or StatusCode >= 500 and the response has a Retry-After header with an integer value.

This was resolved in kube slack - https://kubernetes.slack.com/archives/C0EG7JC6T/p1622037135027600

smarterclayton · 2022-03-29T15:44:30Z

staging/src/k8s.io/client-go/rest/request.go

-				resp.Body.Close()
-			}()
-
-			retries++


This change broke the threadsafety of *Request. Previously, all methods on *Request tracked retry on function variables (thus different threads can call a request independently). After this change, we can't. I don't know that this is related to #109050, but we should not have made this change (we previously guaranteed the thread safety of methods called on Request).

@smarterclayton - can you clarify this one? I think I'm not following this comment.

The Request itself doesn't seem to be claim to be thread-safe anywhere (and I'm not entirely sure what it would really mean to call different method requests from different threads). There isn't any locking anywhere here etc. either.

Maybe I missed something important in the review, but with this comment I still don't understand what that was.

we previously guaranteed the thread safety of methods called on Request

@smarterclayton I am not sure how that would make sense semantically - multiple go routines access a Request instance concurrently, but i can make retry a local variable with a factory function, I am working on a PR now.

You can call Stream() twice concurrently on a Request() and it would work before. After this change, those two calls would compete IF there were retries (you would have data races on the retry count).

I'm fine with an approach that has a helper object within the methods (vs on the request) as long as it's reviewable.

You can call Stream() twice concurrently on a Request() and it would work before. After this change, those two calls would compete IF there were retries (you would have data races on the retry count).

dohh..
I'm not saying now that we are allowed to break it (agree with your we should fix it), but what would be the reason of doing that?

Stream is a bit more questionable, reusing a LIST or GET request in some machinery as a closure would be more likely.

smarterclayton · 2022-03-29T15:49:51Z

This breaks threadsafety of *Requests that have no body (requests with body are not properly reentrant today). I'm not comfortable with doing that without discussion, and this could easily have regressed a client that reuses Requests from multiple contexts.

For that reason, this probably shouldn't have been factored this way, and I'd prefer to revert this and see a different factoring (that uses a transient variable inside the various Watch/Stream/Do methods), with either a note about Body preventing reentrancy, or making it explicit by forcing Request with body to be safe for reentrancy.

smarterclayton · 2022-03-29T15:51:57Z

@tkashem @wojtek-t as reviewers, @liggitt as well. I don't like ANY change to visible safety behavior of Request as it's such a low level part of the ecosystem. As a result, this isn't safe by default and I don't think we should leave this as is in tree. I would like to remove footguns, not add them (the argument that we already have a footgun is not sufficient, since you can have a reentrant reader today)

smarterclayton · 2022-03-29T17:51:27Z

staging/src/k8s.io/client-go/rest/request.go

+				// "Connection reset by peer" or "apiserver is shutting down" are usually a transient errors.
+				// Thus in case of "GET" operations, we simply retry it.
+				// We are not automatically retrying "write" operations, as they are not idempotent.
+				if r.verb != "GET" {


This change looks like it broke the previous behavior. Prior to this PR, the following things were retried:

"transient errors on GET"

"explicit retriable errors from the server for any method where the body was reseekable"

After this change, we only retry

"GET methods where a transient or explicit reply methods"

For that reason this change is incorrect and broke production behavior. The whole change (and any dependent changes) must be reverted and we will reintroduce the new code (new factoring) once new test cases are introduced.

I.e. the previous method:

if err != nil { // "Connection reset by peer" or "apiserver is shutting down" are usually a transient errors. // Thus in case of "GET" operations, we simply retry it. // We are not automatically retrying "write" operations, as // they are not idempotent. if r.verb != "GET" { return err } // For connection errors and apiserver shutdown errors retry. if net.IsConnectionReset(err) || net.IsProbableEOF(err) { // For the purpose of retry, we set the artificial "retry-after" response. // TODO: Should we clean the original response if it exists? resp = &http.Response{ StatusCode: http.StatusInternalServerError, Header: http.Header{"Retry-After": []string{"1"}}, Body: ioutil.NopCloser(bytes.NewReader([]byte{})), } } else { return err }

^ retries transient errors

} done := func() bool { // Ensure the response body is fully read and closed // before we reconnect, so that we reuse the same TCP // connection. defer func() { const maxBodySlurpSize = 2 << 10 if resp.ContentLength <= maxBodySlurpSize { io.Copy(ioutil.Discard, &io.LimitedReader{R: resp.Body, N: maxBodySlurpSize}) } resp.Body.Close() }() retries++ if seconds, wait := checkWait(resp); wait && retries <= r.maxRetries {

^ if the error from the server indicates a retry (retry-after means the client is free to retry, on any verb), then this method will return it, which means the new logic prevents retries of POST/PUT/DELETE with bodies.

retryInfo = getRetryReason(retries, seconds, resp, err) if seeker, ok := r.body.(io.Seeker); ok && r.body != nil { _, err := seeker.Seek(0, 0) if err != nil { klog.V(4).Infof("Could not retry request, can't Seek() back to beginning of body for %T", r.body) fn(req, resp) return true } } klog.V(4).Infof("Got a Retry-After %ds response for attempt %d to %v", seconds, retries, url) r.backoff.Sleep(time.Duration(seconds) * time.Second) return false } fn(req, resp) return true }() if done { return nil } }

OK yeah - I agree with this.
Thanks for catching this Clayton

@tkashem - I agree with Clayton - at this point of the release we should revert this PR.

explicit retriable errors from the server for any method where the body was reseekable

are you referring to the following scenario?

request verb is POST

request body is re-seekable

server sends 429, Retry-After: 1

This should be retried -

kubernetes/staging/src/k8s.io/client-go/rest/with_retry.go

Lines 173 to 195 in 1ea07d4

var errIsRetryable bool

if f != nil && err != nil && f.IsErrorRetryable(httpReq, err) {

errIsRetryable = true

// we have a retryable error, for which we will create an

// artificial "Retry-After" response.

resp = retryAfterResponse()

}

if err != nil && !errIsRetryable {

return false

}

// if we are here, we have either a or b:

// a: we have a retryable error, for which we already

// have an artificial "Retry-After" response.

// b: we have a response from the server for which we

// need to check if it is retryable

seconds, wait := checkWait(resp)

if !wait {

return false

}

r.retryAfter.Wait = time.Duration(seconds) * time.Second

r.retryAfter.Reason = getRetryReason(r.attempts, seconds, resp, err)

also, TestCheckRetryHandles429And5xx seems to exercise POST with bodies

kubernetes/staging/src/k8s.io/client-go/rest/request_test.go

Lines 1446 to 1483 in 1ea07d4

func TestCheckRetryHandles429And5xx(t *testing.T) {

count := 0

ch := make(chan struct{})

testServer := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, req *http.Request) {

data, err := ioutil.ReadAll(req.Body)

if err != nil {

t.Fatalf("unable to read request body: %v", err)

}

if !bytes.Equal(data, []byte(strings.Repeat("abcd", 1000))) {

t.Fatalf("retry did not send a complete body: %s", data)

}

t.Logf("attempt %d", count)

if count >= 4 {

w.WriteHeader(http.StatusOK)

close(ch)

return

}

w.Header().Set("Retry-After", "0")

w.WriteHeader([]int{http.StatusTooManyRequests, 500, 501, 504}[count])

count++

}))

defer testServer.Close()

c := testRESTClient(t, testServer)

_, err := c.Verb("POST").

Prefix("foo", "bar").

Suffix("baz").

Timeout(time.Second).

Body([]byte(strings.Repeat("abcd", 1000))).

DoRaw(context.Background())

if err != nil {

t.Fatalf("Unexpected error: %v %#v", err, err)

}

<-ch

if count != 4 {

t.Errorf("unexpected retries: %d", count)

}

}

I agree with Clayton - at this point of the release we should revert this PR.

@wojtek-t @smarterclayton yeah, i am working on a revert PR with all dependencies.

two options:

if we still want to revert the entire refactor of retry: client-go: revert the refactor of retry #109149

make retry thread safe in Request and revert the wrapping of error from a previous retry: Revert "Merge pull request #108698 from tkashem/last-error" #109148 and client-go: make retry in Request thread safe #109114

@smarterclayton let me know what option you want to go with, i will close the undesired PR(s) based on your decision.

In your test, I added some debugging to the isErrRetryableFunc closure in request() and it is not being called where I thought it was being called.

Digging deeper, it's because IsNextRetry is being called on the error from the http client, but NOT the response from the server (which is only checkWait). Comparing that to 1.21 yields the same logic - we retry if Retry-After is present and the body is seekable, and IsErrRetryable.

So you're right, this didn't change logic, and we did not regress retries when the server tells us to retry, and we have a test that proves it that is working correctly (good!).

Given that, we're back to the thread safety change being the only issue. That is not a release blocker, but the other issue that depends on the construction of the client potentially is, which means we need to understand the scope of the change to avoid breaking thread safety (the refactor mentioned elsewhere) to figure out the risk of fixing both issues or just fixing #108906 and dealing with the thread safety afterwards.

As a follow up for the next release, the comments on some of these methods and sections could be improved. For instance, we may want to rename IsErrorRetryable to IsHttpErrRetryable and checkWait to isServerIndicatedRetry. That would have helped bring this back into mental cache faster and possibly help some future refactorer.

thanks for the in depth review, I closed the revert PRs, and opened a new issue with your suggestions. #109155

smarterclayton · 2022-03-29T17:57:02Z

@tkashem #102217 (comment) looks like it broke the previous behavior of retry, which means this regressed production client behavior (from 1.22 onwards, it looks like). If you don't think we regressed, please show me why and connect to a test case we have. Otherwise, this and any PR that depends on this logic needs to be reverted now for 1.24 (and probably backported to 1.22+), and we'll re-introduce with the correct behavior once we have tests for the new problem.

…efactor-retry" This reverts commit a898587, reversing changes made to 77937b1.

k8s-ci-robot requested review from brendandburns and cjcullen May 21, 2021 21:19

tkashem force-pushed the client-go-refactor-retry branch from c4edb1e to 73508e1 Compare May 21, 2021 21:27

tkashem force-pushed the client-go-refactor-retry branch 3 times, most recently from 729bd48 to 421dc61 Compare May 24, 2021 02:05

tkashem mentioned this pull request May 24, 2021

client-go: add retry logic for Watch and Stream #102107

Merged

k8s-ci-robot assigned deads2k and wojtek-t May 24, 2021

wojtek-t reviewed May 24, 2021

View reviewed changes

client-go: refactor retry logic so it can be reused

5fdf196

tkashem force-pushed the client-go-refactor-retry branch from 421dc61 to 5fdf196 Compare May 24, 2021 13:41

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 24, 2021

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 24, 2021

k8s-ci-robot merged commit a898587 into kubernetes:master May 24, 2021

k8s-ci-robot added this to the v1.22 milestone May 24, 2021

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 25, 2021

deads2k reviewed May 26, 2021

View reviewed changes

openshift-ci-robot mentioned this pull request Jun 2, 2021

Bug 1948311: UPSTREAM: 102107: client-go: add retry logic for Watch and Stream openshift/kubernetes#783

Merged

soltysh mentioned this pull request Jun 2, 2021

Bug 1823143: wire ICSP lookups to oc image info openshift/oc#829

Merged

tkashem mentioned this pull request Aug 4, 2021

Make reflector retry transient errors when LISTing #104070

Closed

smarterclayton reviewed Mar 29, 2022

View reviewed changes

smarterclayton mentioned this pull request Mar 29, 2022

rest: Ensure response body is fully read and closed before retry #109050

Merged

smarterclayton reviewed Mar 29, 2022

View reviewed changes

tkashem added a commit to tkashem/kubernetes that referenced this pull request Mar 29, 2022

Revert "Merge pull request kubernetes#102217 from tkashem/client-go-r…

e1f3450

…efactor-retry" This reverts commit a898587, reversing changes made to 77937b1.

tkashem mentioned this pull request Mar 30, 2022

client-go: the retry logic in rest.Request should be thread safe #109155

Closed

palnabarun mentioned this pull request Apr 4, 2022

go1.18: Data race: client-go retry body reset #108906

Closed

	var errIsRetryable bool
	if f != nil && err != nil && f.IsErrorRetryable(httpReq, err) {
	errIsRetryable = true
	// we have a retryable error, for which we will create an
	// artificial "Retry-After" response.
	resp = retryAfterResponse()
	}
	if err != nil && !errIsRetryable {
	return false
	}

	// if we are here, we have either a or b:
	// a: we have a retryable error, for which we already
	// have an artificial "Retry-After" response.
	// b: we have a response from the server for which we
	// need to check if it is retryable
	seconds, wait := checkWait(resp)
	if !wait {
	return false
	}

	r.retryAfter.Wait = time.Duration(seconds) * time.Second
	r.retryAfter.Reason = getRetryReason(r.attempts, seconds, resp, err)

	func TestCheckRetryHandles429And5xx(t *testing.T) {
	count := 0
	ch := make(chan struct{})
	testServer := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, req *http.Request) {
	data, err := ioutil.ReadAll(req.Body)
	if err != nil {
	t.Fatalf("unable to read request body: %v", err)
	}
	if !bytes.Equal(data, []byte(strings.Repeat("abcd", 1000))) {
	t.Fatalf("retry did not send a complete body: %s", data)
	}
	t.Logf("attempt %d", count)
	if count >= 4 {
	w.WriteHeader(http.StatusOK)
	close(ch)
	return
	}
	w.Header().Set("Retry-After", "0")
	w.WriteHeader([]int{http.StatusTooManyRequests, 500, 501, 504}[count])
	count++
	}))
	defer testServer.Close()

	c := testRESTClient(t, testServer)
	_, err := c.Verb("POST").
	Prefix("foo", "bar").
	Suffix("baz").
	Timeout(time.Second).
	Body([]byte(strings.Repeat("abcd", 1000))).
	DoRaw(context.Background())
	if err != nil {
	t.Fatalf("Unexpected error: %v %#v", err, err)
	}
	<-ch
	if count != 4 {
	t.Errorf("unexpected retries: %d", count)
	}
	}

Conversation

tkashem commented May 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

tkashem commented May 22, 2021

Uh oh!

tkashem commented May 24, 2021

Uh oh!

wojtek-t left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tkashem commented May 24, 2021

Uh oh!

wojtek-t commented May 24, 2021

Uh oh!

k8s-ci-robot commented May 24, 2021

Uh oh!

tkashem commented May 24, 2021

Uh oh!

tkashem commented May 24, 2021

Uh oh!

fedebongio commented May 25, 2021

Uh oh!

deads2k May 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tkashem Mar 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smarterclayton Mar 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smarterclayton commented Mar 29, 2022

Uh oh!

smarterclayton commented Mar 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tkashem Mar 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tkashem commented May 21, 2021 •

edited

Loading

deads2k May 26, 2021 •

edited

Loading

tkashem Mar 29, 2022 •

edited

Loading

smarterclayton Mar 29, 2022 •

edited

Loading

smarterclayton commented Mar 29, 2022 •

edited

Loading

tkashem Mar 29, 2022 •

edited

Loading

smarterclayton Mar 30, 2022 •

edited

Loading

tkashem Mar 30, 2022 •

edited

Loading

smarterclayton commented Mar 29, 2022 •

edited

Loading