client-go: refactor retry logic so it can be reused#102217
client-go: refactor retry logic so it can be reused#102217k8s-ci-robot merged 1 commit intokubernetes:masterfrom
Conversation
c4edb1e to
73508e1
Compare
/test pull-kubernetes-e2e-kind |
729bd48 to
421dc61
Compare
wojtek-t
left a comment
There was a problem hiding this comment.
This looks great - just a couple small comments.
There was a problem hiding this comment.
maxRetries is always 1 in all cases?
Please either hardcode it here or add (if it makes sense - those cases will be tested below) cases that exercise it.
There was a problem hiding this comment.
yes, maxRetries is always 1 in this test, I hard coded it for now.
There was a problem hiding this comment.
Seems like coplete copy-paste from the above.
Unless I'm missing something, please extract that to a separate function and do:
func TestRequestDoRawWithRetry(t *testing.T) {
newFunc(t, func(ctx context.Context, r *Request) { r.DoRaw(ctx) })
}
[same for Do above]
There was a problem hiding this comment.
good catch, I am hoping I can reuse it for Watch and Stream in the followup PR
421dc61 to
5fdf196
Compare
|
@wojtek-t comments addressed, I also added |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: tkashem, wojtek-t The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test pull-kubernetes-e2e-gce-ubuntu-containerd |
|
/test pull-kubernetes-e2e-gce-ubuntu-containerd |
|
/triage accepted |
| func checkWait(resp *http.Response) (int, bool) { | ||
| switch r := resp.StatusCode; { | ||
| // any 500 error code and 429 can trigger a wait | ||
| case r == http.StatusTooManyRequests, r >= 500: |
There was a problem hiding this comment.
Pushing this case into a generic NextRetry makes it error prone. I don't see a check of the request verb to ensure it's a GET. I can accept that a GET should be retriable on a 5xx series error, but other http verbs cannot automatically retried on 5xx in all cases. Some action may have been taken before the failure that would make an auto-replay unpredictable. I could probably buy 503 given its usage for, "you didn't get to your endpoint" from the aggregator.
I think if you split the check for GET requests and non-GETs you can keep the generic NextRetry function.
There was a problem hiding this comment.
it would retry only if StatusCode == 429 or StatusCode >= 500 and the response has a Retry-After header with an integer value.
This was resolved in kube slack - https://kubernetes.slack.com/archives/C0EG7JC6T/p1622037135027600
| resp.Body.Close() | ||
| }() | ||
|
|
||
| retries++ |
There was a problem hiding this comment.
This change broke the threadsafety of *Request. Previously, all methods on *Request tracked retry on function variables (thus different threads can call a request independently). After this change, we can't. I don't know that this is related to #109050, but we should not have made this change (we previously guaranteed the thread safety of methods called on Request).
There was a problem hiding this comment.
@smarterclayton - can you clarify this one? I think I'm not following this comment.
The Request itself doesn't seem to be claim to be thread-safe anywhere (and I'm not entirely sure what it would really mean to call different method requests from different threads). There isn't any locking anywhere here etc. either.
Maybe I missed something important in the review, but with this comment I still don't understand what that was.
There was a problem hiding this comment.
we previously guaranteed the thread safety of methods called on Request
@smarterclayton I am not sure how that would make sense semantically - multiple go routines access a Request instance concurrently, but i can make retry a local variable with a factory function, I am working on a PR now.
There was a problem hiding this comment.
You can call Stream() twice concurrently on a Request() and it would work before. After this change, those two calls would compete IF there were retries (you would have data races on the retry count).
There was a problem hiding this comment.
I'm fine with an approach that has a helper object within the methods (vs on the request) as long as it's reviewable.
There was a problem hiding this comment.
You can call Stream() twice concurrently on a Request() and it would work before. After this change, those two calls would compete IF there were retries (you would have data races on the retry count).
dohh..
I'm not saying now that we are allowed to break it (agree with your we should fix it), but what would be the reason of doing that?
There was a problem hiding this comment.
Stream is a bit more questionable, reusing a LIST or GET request in some machinery as a closure would be more likely.
|
This breaks threadsafety of *Requests that have no body (requests with body are not properly reentrant today). I'm not comfortable with doing that without discussion, and this could easily have regressed a client that reuses Requests from multiple contexts. For that reason, this probably shouldn't have been factored this way, and I'd prefer to revert this and see a different factoring (that uses a transient variable inside the various Watch/Stream/Do methods), with either a note about Body preventing reentrancy, or making it explicit by forcing Request with body to be safe for reentrancy. |
|
@tkashem @wojtek-t as reviewers, @liggitt as well. I don't like ANY change to visible safety behavior of Request as it's such a low level part of the ecosystem. As a result, this isn't safe by default and I don't think we should leave this as is in tree. I would like to remove footguns, not add them (the argument that we already have a footgun is not sufficient, since you can have a reentrant reader today) |
| // "Connection reset by peer" or "apiserver is shutting down" are usually a transient errors. | ||
| // Thus in case of "GET" operations, we simply retry it. | ||
| // We are not automatically retrying "write" operations, as they are not idempotent. | ||
| if r.verb != "GET" { |
There was a problem hiding this comment.
This change looks like it broke the previous behavior. Prior to this PR, the following things were retried:
- "transient errors on GET"
- "explicit retriable errors from the server for any method where the body was reseekable"
After this change, we only retry
- "GET methods where a transient or explicit reply methods"
For that reason this change is incorrect and broke production behavior. The whole change (and any dependent changes) must be reverted and we will reintroduce the new code (new factoring) once new test cases are introduced.
There was a problem hiding this comment.
I.e. the previous method:
if err != nil {
// "Connection reset by peer" or "apiserver is shutting down" are usually a transient errors.
// Thus in case of "GET" operations, we simply retry it.
// We are not automatically retrying "write" operations, as
// they are not idempotent.
if r.verb != "GET" {
return err
}
// For connection errors and apiserver shutdown errors retry.
if net.IsConnectionReset(err) || net.IsProbableEOF(err) {
// For the purpose of retry, we set the artificial "retry-after" response.
// TODO: Should we clean the original response if it exists?
resp = &http.Response{
StatusCode: http.StatusInternalServerError,
Header: http.Header{"Retry-After": []string{"1"}},
Body: ioutil.NopCloser(bytes.NewReader([]byte{})),
}
} else {
return err
}
^ retries transient errors
}
done := func() bool {
// Ensure the response body is fully read and closed
// before we reconnect, so that we reuse the same TCP
// connection.
defer func() {
const maxBodySlurpSize = 2 << 10
if resp.ContentLength <= maxBodySlurpSize {
io.Copy(ioutil.Discard, &io.LimitedReader{R: resp.Body, N: maxBodySlurpSize})
}
resp.Body.Close()
}()
retries++
if seconds, wait := checkWait(resp); wait && retries <= r.maxRetries {
^ if the error from the server indicates a retry (retry-after means the client is free to retry, on any verb), then this method will return it, which means the new logic prevents retries of POST/PUT/DELETE with bodies.
retryInfo = getRetryReason(retries, seconds, resp, err)
if seeker, ok := r.body.(io.Seeker); ok && r.body != nil {
_, err := seeker.Seek(0, 0)
if err != nil {
klog.V(4).Infof("Could not retry request, can't Seek() back to beginning of body for %T", r.body)
fn(req, resp)
return true
}
}
klog.V(4).Infof("Got a Retry-After %ds response for attempt %d to %v", seconds, retries, url)
r.backoff.Sleep(time.Duration(seconds) * time.Second)
return false
}
fn(req, resp)
return true
}()
if done {
return nil
}
}
There was a problem hiding this comment.
OK yeah - I agree with this.
Thanks for catching this Clayton
@tkashem - I agree with Clayton - at this point of the release we should revert this PR.
There was a problem hiding this comment.
explicit retriable errors from the server for any method where the body was reseekable
are you referring to the following scenario?
- request verb is
POST - request body is re-seekable
- server sends
429,Retry-After: 1
This should be retried -
kubernetes/staging/src/k8s.io/client-go/rest/with_retry.go
Lines 173 to 195 in 1ea07d4
also, TestCheckRetryHandles429And5xx seems to exercise POST with bodies
kubernetes/staging/src/k8s.io/client-go/rest/request_test.go
Lines 1446 to 1483 in 1ea07d4
There was a problem hiding this comment.
I agree with Clayton - at this point of the release we should revert this PR.
@wojtek-t @smarterclayton yeah, i am working on a revert PR with all dependencies.
There was a problem hiding this comment.
two options:
-
if we still want to revert the entire refactor of retry: client-go: revert the refactor of retry #109149
-
make
retrythread safe inRequestand revert the wrapping of error from a previous retry: Revert "Merge pull request #108698 from tkashem/last-error" #109148 and client-go: make retry in Request thread safe #109114
There was a problem hiding this comment.
@smarterclayton let me know what option you want to go with, i will close the undesired PR(s) based on your decision.
There was a problem hiding this comment.
In your test, I added some debugging to the isErrRetryableFunc closure in request() and it is not being called where I thought it was being called.
Digging deeper, it's because IsNextRetry is being called on the error from the http client, but NOT the response from the server (which is only checkWait). Comparing that to 1.21 yields the same logic - we retry if Retry-After is present and the body is seekable, and IsErrRetryable.
So you're right, this didn't change logic, and we did not regress retries when the server tells us to retry, and we have a test that proves it that is working correctly (good!).
Given that, we're back to the thread safety change being the only issue. That is not a release blocker, but the other issue that depends on the construction of the client potentially is, which means we need to understand the scope of the change to avoid breaking thread safety (the refactor mentioned elsewhere) to figure out the risk of fixing both issues or just fixing #108906 and dealing with the thread safety afterwards.
There was a problem hiding this comment.
As a follow up for the next release, the comments on some of these methods and sections could be improved. For instance, we may want to rename IsErrorRetryable to IsHttpErrRetryable and checkWait to isServerIndicatedRetry. That would have helped bring this back into mental cache faster and possibly help some future refactorer.
There was a problem hiding this comment.
thanks for the in depth review, I closed the revert PRs, and opened a new issue with your suggestions. #109155
|
@tkashem #102217 (comment) looks like it broke the previous behavior of retry, which means this regressed production client behavior (from 1.22 onwards, it looks like). If you don't think we regressed, please show me why and connect to a test case we have. Otherwise, this and any PR that depends on this logic needs to be reverted now for 1.24 (and probably backported to 1.22+), and we'll re-introduce with the correct behavior once we have tests for the new problem. |
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
Refactor retry logic into a reusable unit so we can use it for other
WatchandStream.Which issue(s) this PR fixes:
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: