-
Notifications
You must be signed in to change notification settings - Fork 404
Improve download robustness with retry logic for body operations #9629
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
itaiad200
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Not all errors should be retryable, e.g. I don't think we need to retry HTTP operations again.
- Can we leverage the existing retry packages instead of using custom logic?
- How was this tested?
pkg/api/helpers/download.go
Outdated
| for attempt := 0; attempt <= d.BodyRetries; attempt++ { | ||
| if attempt > 0 { | ||
| // sleep for a random time between 0 and 1 second or break on context cancellation | ||
| select { | ||
| //nolint:gosec | ||
| case <-time.After(time.Duration(rand.IntN(DefaultDownloadRetryDelayMs)) * time.Millisecond): | ||
| case <-ctx.Done(): | ||
| return ctx.Err() | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have some retry package that we use often. Why did you decide on using some custom logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will switch to use backoff.Retry
pkg/api/helpers/errors.go
Outdated
| // executeHTTPRequest executes an HTTP request and ensures the response body is closed with defer. | ||
| // The callback function receives the response and should return an error if the request should be retried. | ||
| // The response body will be automatically closed after the callback returns. | ||
| func executeHTTPRequest(client *http.Client, req *http.Request, callback func(*http.Response) error) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be in errors.go
pkg/api/helpers/download.go
Outdated
| } | ||
| req.Header.Set("Range", rangeHeader) | ||
|
|
||
| err = executeHTTPRequest(d.HTTPClient, req, func(resp *http.Response) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't you passing the retryClient here? It means that it will retry (attempts)^2 before failing, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- the helper is to verify that we do close the body.
- can't reuse a request so for any error during upload and download we perform a new request
pkg/api/helpers/download.go
Outdated
| _, err = io.ReadFull(resp.Body, buf) | ||
| var err error | ||
| for attempt := 0; attempt <= d.BodyRetries; attempt++ { | ||
| if attempt > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think that all errors should be retryable?
- HTTP requests are already being retried.
- There are some unrelated failures, like OS operations, that we should retry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will update the code to consider the http request and any other check (like etag) to be final error not be counted as part of the network error retry.
pkg/api/helpers/download.go
Outdated
| return ctx.Err() | ||
| } | ||
| // Remove destination file if retrying (will be recreated) | ||
| _ = os.Remove(dst) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is hard to maintain, since it's so easy to miss this when the code below evolves.
WDYT about truncating the file upon first write instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure. kept the same code as we added remove when we added symlink support where truncate is not an option.
|
@itaiad200 tested manually - will do it again after we complete reviewing the latest changes. |
itaiad200
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great.
Please test again before merging
currently we use post to upload data, there is no retry mechnism except for a full retry of the entire upload.
|
@itaiad200 removed the upload logic as we don't have 'real' retry over post request body - the previous code was retry the complete operation. |
This change adds 3 retries for lakectl fs download, when request fails.
Wait 0-1000ms (random) between each attempt.
Close #9631