Skip to content

client-go: refactor retry logic so it can be reused#102217

Merged
k8s-ci-robot merged 1 commit intokubernetes:masterfrom
tkashem:client-go-refactor-retry
May 24, 2021
Merged

client-go: refactor retry logic so it can be reused#102217
k8s-ci-robot merged 1 commit intokubernetes:masterfrom
tkashem:client-go-refactor-retry

Conversation

@tkashem
Copy link
Contributor

@tkashem tkashem commented May 21, 2021

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

Refactor retry logic into a reusable unit so we can use it for other Watch and Stream.

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels May 21, 2021
@tkashem tkashem force-pushed the client-go-refactor-retry branch from c4edb1e to 73508e1 Compare May 21, 2021 21:27
@tkashem
Copy link
Contributor Author

tkashem commented May 22, 2021

e2e suite: [sig-node] Probing container should be ready immediately after startupProbe succeeds

/test pull-kubernetes-e2e-kind

@tkashem tkashem force-pushed the client-go-refactor-retry branch 3 times, most recently from 729bd48 to 421dc61 Compare May 24, 2021 02:05
@tkashem
Copy link
Contributor Author

tkashem commented May 24, 2021

refactored the retry into its reusable unit, please review.

/assign @wojtek-t @deads2k

Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great - just a couple small comments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

...) (*RetryAfter, bool)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maxRetries is always 1 in all cases?

Please either hardcode it here or add (if it makes sense - those cases will be tested below) cases that exercise it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, maxRetries is always 1 in this test, I hard coded it for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like coplete copy-paste from the above.

Unless I'm missing something, please extract that to a separate function and do:

func TestRequestDoRawWithRetry(t *testing.T) {
  newFunc(t, func(ctx context.Context, r *Request) { r.DoRaw(ctx) })
}

[same for Do above]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, I am hoping I can reuse it for Watch and Stream in the followup PR

@tkashem tkashem force-pushed the client-go-refactor-retry branch from 421dc61 to 5fdf196 Compare May 24, 2021 13:41
@tkashem
Copy link
Contributor Author

tkashem commented May 24, 2021

@wojtek-t comments addressed, I also added noSleepBackOff to make the unit test execution faster.

// noSleepBackOff is a NoBackoff except it does not sleep,
// used for faster execution of the unit tests.
type noSleepBackOff struct {
	*NoBackoff
}

func (n *noSleepBackOff) Sleep(d time.Duration) {}

@wojtek-t
Copy link
Member

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 24, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tkashem, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 24, 2021
@tkashem
Copy link
Contributor Author

tkashem commented May 24, 2021

e2e suite: [sig-instrumentation] MetricsGrabber should grab all metrics from a Scheduler.

/test pull-kubernetes-e2e-gce-ubuntu-containerd

@tkashem
Copy link
Contributor Author

tkashem commented May 24, 2021

/test pull-kubernetes-e2e-gce-ubuntu-containerd

@k8s-ci-robot k8s-ci-robot merged commit a898587 into kubernetes:master May 24, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.22 milestone May 24, 2021
@fedebongio
Copy link
Contributor

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 25, 2021
func checkWait(resp *http.Response) (int, bool) {
switch r := resp.StatusCode; {
// any 500 error code and 429 can trigger a wait
case r == http.StatusTooManyRequests, r >= 500:
Copy link
Contributor

@deads2k deads2k May 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushing this case into a generic NextRetry makes it error prone. I don't see a check of the request verb to ensure it's a GET. I can accept that a GET should be retriable on a 5xx series error, but other http verbs cannot automatically retried on 5xx in all cases. Some action may have been taken before the failure that would make an auto-replay unpredictable. I could probably buy 503 given its usage for, "you didn't get to your endpoint" from the aggregator.

I think if you split the check for GET requests and non-GETs you can keep the generic NextRetry function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would retry only if StatusCode == 429 or StatusCode >= 500 and the response has a Retry-After header with an integer value.

This was resolved in kube slack - https://kubernetes.slack.com/archives/C0EG7JC6T/p1622037135027600

resp.Body.Close()
}()

retries++
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change broke the threadsafety of *Request. Previously, all methods on *Request tracked retry on function variables (thus different threads can call a request independently). After this change, we can't. I don't know that this is related to #109050, but we should not have made this change (we previously guaranteed the thread safety of methods called on Request).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarterclayton - can you clarify this one? I think I'm not following this comment.

The Request itself doesn't seem to be claim to be thread-safe anywhere (and I'm not entirely sure what it would really mean to call different method requests from different threads). There isn't any locking anywhere here etc. either.

Maybe I missed something important in the review, but with this comment I still don't understand what that was.

Copy link
Contributor Author

@tkashem tkashem Mar 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we previously guaranteed the thread safety of methods called on Request

@smarterclayton I am not sure how that would make sense semantically - multiple go routines access a Request instance concurrently, but i can make retry a local variable with a factory function, I am working on a PR now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@smarterclayton smarterclayton Mar 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can call Stream() twice concurrently on a Request() and it would work before. After this change, those two calls would compete IF there were retries (you would have data races on the retry count).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with an approach that has a helper object within the methods (vs on the request) as long as it's reviewable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can call Stream() twice concurrently on a Request() and it would work before. After this change, those two calls would compete IF there were retries (you would have data races on the retry count).

dohh..
I'm not saying now that we are allowed to break it (agree with your we should fix it), but what would be the reason of doing that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stream is a bit more questionable, reusing a LIST or GET request in some machinery as a closure would be more likely.

@smarterclayton
Copy link
Contributor

This breaks threadsafety of *Requests that have no body (requests with body are not properly reentrant today). I'm not comfortable with doing that without discussion, and this could easily have regressed a client that reuses Requests from multiple contexts.

For that reason, this probably shouldn't have been factored this way, and I'd prefer to revert this and see a different factoring (that uses a transient variable inside the various Watch/Stream/Do methods), with either a note about Body preventing reentrancy, or making it explicit by forcing Request with body to be safe for reentrancy.

@smarterclayton
Copy link
Contributor

smarterclayton commented Mar 29, 2022

@tkashem @wojtek-t as reviewers, @liggitt as well. I don't like ANY change to visible safety behavior of Request as it's such a low level part of the ecosystem. As a result, this isn't safe by default and I don't think we should leave this as is in tree. I would like to remove footguns, not add them (the argument that we already have a footgun is not sufficient, since you can have a reentrant reader today)

// "Connection reset by peer" or "apiserver is shutting down" are usually a transient errors.
// Thus in case of "GET" operations, we simply retry it.
// We are not automatically retrying "write" operations, as they are not idempotent.
if r.verb != "GET" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks like it broke the previous behavior. Prior to this PR, the following things were retried:

  • "transient errors on GET"
  • "explicit retriable errors from the server for any method where the body was reseekable"

After this change, we only retry

  • "GET methods where a transient or explicit reply methods"

For that reason this change is incorrect and broke production behavior. The whole change (and any dependent changes) must be reverted and we will reintroduce the new code (new factoring) once new test cases are introduced.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I.e. the previous method:

		if err != nil {
			// "Connection reset by peer" or "apiserver is shutting down" are usually a transient errors.
			// Thus in case of "GET" operations, we simply retry it.
			// We are not automatically retrying "write" operations, as
			// they are not idempotent.
			if r.verb != "GET" {
				return err
			}
			// For connection errors and apiserver shutdown errors retry.
			if net.IsConnectionReset(err) || net.IsProbableEOF(err) {
				// For the purpose of retry, we set the artificial "retry-after" response.
				// TODO: Should we clean the original response if it exists?
				resp = &http.Response{
					StatusCode: http.StatusInternalServerError,
					Header:     http.Header{"Retry-After": []string{"1"}},
					Body:       ioutil.NopCloser(bytes.NewReader([]byte{})),
				}
			} else {
				return err
			}

^ retries transient errors

		}

		done := func() bool {
			// Ensure the response body is fully read and closed
			// before we reconnect, so that we reuse the same TCP
			// connection.
			defer func() {
				const maxBodySlurpSize = 2 << 10
				if resp.ContentLength <= maxBodySlurpSize {
					io.Copy(ioutil.Discard, &io.LimitedReader{R: resp.Body, N: maxBodySlurpSize})
				}
				resp.Body.Close()
			}()

			retries++
			if seconds, wait := checkWait(resp); wait && retries <= r.maxRetries {

^ if the error from the server indicates a retry (retry-after means the client is free to retry, on any verb), then this method will return it, which means the new logic prevents retries of POST/PUT/DELETE with bodies.


				retryInfo = getRetryReason(retries, seconds, resp, err)
				if seeker, ok := r.body.(io.Seeker); ok && r.body != nil {
					_, err := seeker.Seek(0, 0)
					if err != nil {
						klog.V(4).Infof("Could not retry request, can't Seek() back to beginning of body for %T", r.body)
						fn(req, resp)
						return true
					}
				}

				klog.V(4).Infof("Got a Retry-After %ds response for attempt %d to %v", seconds, retries, url)
				r.backoff.Sleep(time.Duration(seconds) * time.Second)
				return false
			}
			fn(req, resp)
			return true
		}()
		if done {
			return nil
		}
	}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK yeah - I agree with this.
Thanks for catching this Clayton

@tkashem - I agree with Clayton - at this point of the release we should revert this PR.

Copy link
Contributor Author

@tkashem tkashem Mar 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explicit retriable errors from the server for any method where the body was reseekable

are you referring to the following scenario?

  • request verb is POST
  • request body is re-seekable
  • server sends 429, Retry-After: 1

This should be retried -

var errIsRetryable bool
if f != nil && err != nil && f.IsErrorRetryable(httpReq, err) {
errIsRetryable = true
// we have a retryable error, for which we will create an
// artificial "Retry-After" response.
resp = retryAfterResponse()
}
if err != nil && !errIsRetryable {
return false
}
// if we are here, we have either a or b:
// a: we have a retryable error, for which we already
// have an artificial "Retry-After" response.
// b: we have a response from the server for which we
// need to check if it is retryable
seconds, wait := checkWait(resp)
if !wait {
return false
}
r.retryAfter.Wait = time.Duration(seconds) * time.Second
r.retryAfter.Reason = getRetryReason(r.attempts, seconds, resp, err)

also, TestCheckRetryHandles429And5xx seems to exercise POST with bodies

func TestCheckRetryHandles429And5xx(t *testing.T) {
count := 0
ch := make(chan struct{})
testServer := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, req *http.Request) {
data, err := ioutil.ReadAll(req.Body)
if err != nil {
t.Fatalf("unable to read request body: %v", err)
}
if !bytes.Equal(data, []byte(strings.Repeat("abcd", 1000))) {
t.Fatalf("retry did not send a complete body: %s", data)
}
t.Logf("attempt %d", count)
if count >= 4 {
w.WriteHeader(http.StatusOK)
close(ch)
return
}
w.Header().Set("Retry-After", "0")
w.WriteHeader([]int{http.StatusTooManyRequests, 500, 501, 504}[count])
count++
}))
defer testServer.Close()
c := testRESTClient(t, testServer)
_, err := c.Verb("POST").
Prefix("foo", "bar").
Suffix("baz").
Timeout(time.Second).
Body([]byte(strings.Repeat("abcd", 1000))).
DoRaw(context.Background())
if err != nil {
t.Fatalf("Unexpected error: %v %#v", err, err)
}
<-ch
if count != 4 {
t.Errorf("unexpected retries: %d", count)
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Clayton - at this point of the release we should revert this PR.

@wojtek-t @smarterclayton yeah, i am working on a revert PR with all dependencies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two options:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarterclayton let me know what option you want to go with, i will close the undesired PR(s) based on your decision.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your test, I added some debugging to the isErrRetryableFunc closure in request() and it is not being called where I thought it was being called.

Digging deeper, it's because IsNextRetry is being called on the error from the http client, but NOT the response from the server (which is only checkWait). Comparing that to 1.21 yields the same logic - we retry if Retry-After is present and the body is seekable, and IsErrRetryable.

So you're right, this didn't change logic, and we did not regress retries when the server tells us to retry, and we have a test that proves it that is working correctly (good!).

Given that, we're back to the thread safety change being the only issue. That is not a release blocker, but the other issue that depends on the construction of the client potentially is, which means we need to understand the scope of the change to avoid breaking thread safety (the refactor mentioned elsewhere) to figure out the risk of fixing both issues or just fixing #108906 and dealing with the thread safety afterwards.

Copy link
Contributor

@smarterclayton smarterclayton Mar 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a follow up for the next release, the comments on some of these methods and sections could be improved. For instance, we may want to rename IsErrorRetryable to IsHttpErrRetryable and checkWait to isServerIndicatedRetry. That would have helped bring this back into mental cache faster and possibly help some future refactorer.

Copy link
Contributor Author

@tkashem tkashem Mar 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the in depth review, I closed the revert PRs, and opened a new issue with your suggestions. #109155

@smarterclayton
Copy link
Contributor

smarterclayton commented Mar 29, 2022

@tkashem #102217 (comment) looks like it broke the previous behavior of retry, which means this regressed production client behavior (from 1.22 onwards, it looks like). If you don't think we regressed, please show me why and connect to a test case we have. Otherwise, this and any PR that depends on this logic needs to be reverted now for 1.24 (and probably backported to 1.22+), and we'll re-introduce with the correct behavior once we have tests for the new problem.

tkashem added a commit to tkashem/kubernetes that referenced this pull request Mar 29, 2022
…efactor-retry"

This reverts commit a898587, reversing
changes made to 77937b1.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants