storage: automatic retries with exponential backoff for download failures caused by load shed #3040

ihuh0 · 2020-10-16T16:41:48Z

Is your feature request related to a problem? Please describe.
Occasionally we get failures to download objects from GCS due to too many requests being made at a time resulting in a load shed error. We considered having retries in our own code, but was wondering if it could be automatically handled in the storage library instead. Perhaps we can have some automatic retries with some exponential backoff in this case?

tritone · 2020-10-16T17:12:21Z

Can you speak more specifically to the error(s) you are seeing and the library methods where they are returned?

Most methods in this library are retried with exponential backoff on code 429, which is the expected error for too many requests. See

google-cloud-go/storage/go110.go

Line 26 in 0033acc

func shouldRetry(err error) bool {

ihuh0 · 2020-10-16T18:27:51Z

I think the error is when calling Read() with the storage.Reader. The error we saw looked like:
http2: server sent GOAWAY and closed the connection; LastStreamID=15, ErrCode=NO_ERROR, debug="load_shed"

We've also seen read: connection reset by peer errors which look like they should be retried already. How many times does it get retried and what is the backoff?

tritone · 2020-10-16T22:40:36Z

Ah interesting, I've never seen that load_shed error before-- looks like some form of socket-level error rather than an actual 429 response.

Yes, the connection reset by peer error should be retried until the context timeout. We use the standard backoff here: https://pkg.go.dev/github.com/googleapis/gax-go/v2#Backoff

Can you clarify which version of the library you are using? Have you done anything unusual with the underlying http client for your storage client (e.g. subbing out using option.WithHTTPClient)?

dt · 2021-06-02T01:18:15Z

Hello!

We have also been seeing this GOAWAY error from storage.Reader.Read() returned from calls pretty often lately in CockroachDB's nightly integration tests, These tests vary, with some opening, reading and closing thousands of smaller files and others -- that seem to fail with this error most -- opening a handful of multi-gb files and keeping them open for intermittent Read calls for a few hours or so. (for a sense of how often we see it: https://github.com/cockroachdb/cockroach/issues?q=is%3Aissue+GOAWAY)

Like @ihuh0 above, the ones we see are ErrCode=NO_ERROR, but ours typically have debug="server_shutting_down".

I don't think I'm overriding the http client or anything -- the code I'm using to init the client is all open source, right over here: https://github.com/cockroachdb/cockroach/blob/master/pkg/storage/cloud/gcp/gcs_storage.go#L141

We were planning to wrap the returned storage.Reader in our own io.Reader that would inspect any returned errors and automatically re-open the underlying reader (at a tracked offset) to retry on these GOAWAY errors, but before we started doing that at the application level, I wanted to check if it was expected that the SDK would be doing returning these errors or if it sounds like something is wrong or we'd misconfigured it somehow.

tritone · 2021-06-07T02:45:35Z

Hey @dt, thanks for reporting. I've done some research and I think we likely should add retries for this error specifically for reads.

The best description for what I think is happening here is this: golang/go#18639 (comment) (substitute GCS for ALB here). Your use case (intermittent reads over several hours) seems the most prone to this scenario since the period between when the server sends the headers and when the body read calls occur is drawn out, so there's more opportunity for the server to close the connection in the meantime.

The error type is this: https://github.com/golang/go/blob/master/src/net/http/h2_bundle.go#L8359 . Unfortunately I don't think there is any way of detecting this directly, but we already check for http2 INTERNAL errors here so I think it makes sense to add GOAWAY to this as well. Actually, I think there are other errors that might make sense to add as well, but I'll probably stick with GOAWAY for now to be conservative.

Also curious about your use case-- have you considered smaller ranged reads? Or some kind of buffering potentially? Either would probably increase reliability.

This error can occur in calls to reader.Read if the CFE closes the connection between when the response headers are received and when the caller reads from the reader. Fixes googleapis#3040

Errors from reading the response body in Reader.Read will now always trigger a reopen() call (unless the context has been canceled). Previously, this was limited to only INTERNAL_ERROR from HTTP/2. Fixes #3040

Errors from reading the response body in Reader.Read will now always trigger a reopen() call (unless the context has been canceled). Previously, this was limited to only INTERNAL_ERROR from HTTP/2. Fixes googleapis#3040

tritone · 2021-06-28T19:37:31Z

The fix to this has been released in storage/v1.16.0.

ihuh0 added the triage me I really want to be triaged. label Oct 16, 2020

product-auto-label bot added the api: storage Issues related to the Cloud Storage API. label Oct 16, 2020

blunderbuss-gcf bot assigned tritone Oct 16, 2020

tritone added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. and removed triage me I really want to be triaged. labels Oct 16, 2020

tritone added the needs more info This issue needs more information from the customer to proceed. label Dec 10, 2020

tritone mentioned this issue Jun 7, 2021

fix(storage): try to reopen for failed Reads #4226

Merged

tritone closed this as completed in #4226 Jun 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: automatic retries with exponential backoff for download failures caused by load shed #3040

storage: automatic retries with exponential backoff for download failures caused by load shed #3040

ihuh0 commented Oct 16, 2020

tritone commented Oct 16, 2020

ihuh0 commented Oct 16, 2020

tritone commented Oct 16, 2020

dt commented Jun 2, 2021

tritone commented Jun 7, 2021

tritone commented Jun 28, 2021

storage: automatic retries with exponential backoff for download failures caused by load shed #3040

storage: automatic retries with exponential backoff for download failures caused by load shed #3040

Comments

ihuh0 commented Oct 16, 2020

tritone commented Oct 16, 2020

ihuh0 commented Oct 16, 2020

tritone commented Oct 16, 2020

dt commented Jun 2, 2021

tritone commented Jun 7, 2021

tritone commented Jun 28, 2021