Skip to content

Conversation

@nopcoder
Copy link
Contributor

@nopcoder nopcoder commented Nov 1, 2025

This change adds 3 retries for lakectl fs download, when request fails.
Wait 0-1000ms (random) between each attempt.

Close #9631

@nopcoder nopcoder self-assigned this Nov 1, 2025
@nopcoder nopcoder added area/lakectl Issues related to lakeFS' command line interface (lakectl) include-changelog PR description should be included in next release changelog labels Nov 1, 2025
@nopcoder nopcoder requested review from itaiad200 and removed request for itaiad200 November 2, 2025 07:52
@nopcoder nopcoder requested a review from itaiad200 November 2, 2025 09:06
@nopcoder nopcoder marked this pull request as ready for review November 2, 2025 09:06
@nopcoder nopcoder requested a review from a team November 2, 2025 09:35
Copy link
Contributor

@itaiad200 itaiad200 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Not all errors should be retryable, e.g. I don't think we need to retry HTTP operations again.
  2. Can we leverage the existing retry packages instead of using custom logic?
  3. How was this tested?

Comment on lines 239 to 248
for attempt := 0; attempt <= d.BodyRetries; attempt++ {
if attempt > 0 {
// sleep for a random time between 0 and 1 second or break on context cancellation
select {
//nolint:gosec
case <-time.After(time.Duration(rand.IntN(DefaultDownloadRetryDelayMs)) * time.Millisecond):
case <-ctx.Done():
return ctx.Err()
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have some retry package that we use often. Why did you decide on using some custom logic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will switch to use backoff.Retry

// executeHTTPRequest executes an HTTP request and ensures the response body is closed with defer.
// The callback function receives the response and should return an error if the request should be retried.
// The response body will be automatically closed after the callback returns.
func executeHTTPRequest(client *http.Client, req *http.Request, callback func(*http.Response) error) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be in errors.go

}
req.Header.Set("Range", rangeHeader)

err = executeHTTPRequest(d.HTTPClient, req, func(resp *http.Response) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't you passing the retryClient here? It means that it will retry (attempts)^2 before failing, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • the helper is to verify that we do close the body.
  • can't reuse a request so for any error during upload and download we perform a new request

_, err = io.ReadFull(resp.Body, buf)
var err error
for attempt := 0; attempt <= d.BodyRetries; attempt++ {
if attempt > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think that all errors should be retryable?

  1. HTTP requests are already being retried.
  2. There are some unrelated failures, like OS operations, that we should retry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will update the code to consider the http request and any other check (like etag) to be final error not be counted as part of the network error retry.

return ctx.Err()
}
// Remove destination file if retrying (will be recreated)
_ = os.Remove(dst)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is hard to maintain, since it's so easy to miss this when the code below evolves.
WDYT about truncating the file upon first write instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. kept the same code as we added remove when we added symlink support where truncate is not an option.

@nopcoder
Copy link
Contributor Author

nopcoder commented Nov 3, 2025

@itaiad200 tested manually - will do it again after we complete reviewing the latest changes.

@nopcoder nopcoder requested a review from itaiad200 November 3, 2025 09:05
Copy link
Contributor

@itaiad200 itaiad200 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great.
Please test again before merging

currently we use post to upload data, there is no retry mechnism except
for a full retry of the entire upload.
@nopcoder nopcoder changed the title Improve upload/download robustness with retry logic for body operations Improve download robustness with retry logic for body operations Nov 5, 2025
@nopcoder nopcoder requested a review from itaiad200 November 5, 2025 07:09
@nopcoder
Copy link
Contributor Author

nopcoder commented Nov 5, 2025

@itaiad200 removed the upload logic as we don't have 'real' retry over post request body - the previous code was retry the complete operation.
let me know if we still like to have that I'll revert.

@nopcoder nopcoder merged commit 11f0017 into master Nov 5, 2025
44 checks passed
@nopcoder nopcoder deleted the task/lakectl-fs-retry branch November 5, 2025 08:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/lakectl Issues related to lakeFS' command line interface (lakectl) include-changelog PR description should be included in next release changelog

Projects

None yet

Development

Successfully merging this pull request may close these issues.

lakectl fs download retry

2 participants