-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: automatic retries with exponential backoff for download failures caused by load shed #3040
Comments
Can you speak more specifically to the error(s) you are seeing and the library methods where they are returned? Most methods in this library are retried with exponential backoff on code 429, which is the expected error for too many requests. See google-cloud-go/storage/go110.go Line 26 in 0033acc
|
I think the error is when calling Read() with the storage.Reader. The error we saw looked like: We've also seen |
Ah interesting, I've never seen that Yes, the Can you clarify which version of the library you are using? Have you done anything unusual with the underlying http client for your storage client (e.g. subbing out using option.WithHTTPClient)? |
Hello! We have also been seeing this GOAWAY error from Like @ihuh0 above, the ones we see are I don't think I'm overriding the http client or anything -- the code I'm using to init the client is all open source, right over here: https://github.com/cockroachdb/cockroach/blob/master/pkg/storage/cloud/gcp/gcs_storage.go#L141 We were planning to wrap the returned storage.Reader in our own io.Reader that would inspect any returned errors and automatically re-open the underlying reader (at a tracked offset) to retry on these GOAWAY errors, but before we started doing that at the application level, I wanted to check if it was expected that the SDK would be doing returning these errors or if it sounds like something is wrong or we'd misconfigured it somehow. |
Hey @dt, thanks for reporting. I've done some research and I think we likely should add retries for this error specifically for reads. The best description for what I think is happening here is this: golang/go#18639 (comment) (substitute GCS for ALB here). Your use case (intermittent reads over several hours) seems the most prone to this scenario since the period between when the server sends the headers and when the body read calls occur is drawn out, so there's more opportunity for the server to close the connection in the meantime. The error type is this: https://github.com/golang/go/blob/master/src/net/http/h2_bundle.go#L8359 . Unfortunately I don't think there is any way of detecting this directly, but we already check for http2 INTERNAL errors here so I think it makes sense to add GOAWAY to this as well. Actually, I think there are other errors that might make sense to add as well, but I'll probably stick with GOAWAY for now to be conservative. Also curious about your use case-- have you considered smaller ranged reads? Or some kind of buffering potentially? Either would probably increase reliability. |
This error can occur in calls to reader.Read if the CFE closes the connection between when the response headers are received and when the caller reads from the reader. Fixes googleapis#3040
Errors from reading the response body in Reader.Read will now always trigger a reopen() call (unless the context has been canceled). Previously, this was limited to only INTERNAL_ERROR from HTTP/2. Fixes #3040
Errors from reading the response body in Reader.Read will now always trigger a reopen() call (unless the context has been canceled). Previously, this was limited to only INTERNAL_ERROR from HTTP/2. Fixes googleapis#3040
Errors from reading the response body in Reader.Read will now always trigger a reopen() call (unless the context has been canceled). Previously, this was limited to only INTERNAL_ERROR from HTTP/2. Fixes googleapis#3040
The fix to this has been released in storage/v1.16.0. |
Is your feature request related to a problem? Please describe.
Occasionally we get failures to download objects from GCS due to too many requests being made at a time resulting in a load shed error. We considered having retries in our own code, but was wondering if it could be automatically handled in the storage library instead. Perhaps we can have some automatic retries with some exponential backoff in this case?
The text was updated successfully, but these errors were encountered: