Should `ResponseError`s be retriable? #303

benesch · 2021-11-22T05:50:13Z

I just noticed that ResponseErrors are not automatically retried by the SDK. The specific example I was surprised by was an S3 ListObjectsV2 request that failed due to an incomplete body:

ResponseError { err: hyper::Error(Body, Custom { kind: UnexpectedEof, error: IncompleteBody }), raw: Response { inner: Response { status: 200, version: HTTP/1.1, headers: {"content-type": "application/xml; charset=utf-8", "content-length": "1064", "x-amz-bucket-region": "us-east-1", "x-amzn-requestid": "B7K3ZBYJ4GOS6NNXINT08OONX033EREZYSK4I1VPIK1PU0BI9VYR", "access-control-allow-origin": "*", "last-modified": "Mon, 22 Nov 2021 04:56:57 GMT", "x-amz-request-id": "0E15FCA83A3F698A", "x-amz-id-2": "MzRISOwyjmnup0E15FCA83A3F698A7/JypPGXLh0OVFGcJaaO3KW/hRAqKOpIEEp", "accept-ranges": "bytes", "content-language": "en-US", "access-control-allow-methods": "HEAD,GET,PUT,POST,DELETE,OPTIONS,PATCH", "access-control-allow-headers": "authorization,cache-control,content-length,content-md5,content-type,etag,location,x-amz-acl,x-amz-content-sha256,x-amz-date,x-amz-request-id,x-amz-security-token,x-amz-tagging,x-amz-target,x-amz-user-agent,x-amz-version-id,x-amzn-requestid,x-localstack-target,amz-sdk-invocation-id,amz-sdk-request", "access-control-expose-headers": "etag,x-amz-version-id", "connection": "close", "date": "Mon, 22 Nov 2021 04:57:03 GMT", "server": "hypercorn-h11"}, body: SdkBody { inner: Taken, retryable: false } }, properties: SharedPropertyBag(Mutex { data: PropertyBag, poisoned: false, .. }) } }

(FWIW I'm specifically testing retry behavior with injected faults; this isn't an actual error I've observed in production.)

Is it intentional that these aren't automatically retried? I'm not actually sure what the other SDKs do in this situation. I guess it could be a problem for non-idempotent requests, because they'll get duplicated, but that can already happen with requests that are retried after a timeout, where the client doesn't know whether the server actually processed the request already.

The text was updated successfully, but these errors were encountered:

rcoh · 2021-11-22T17:22:29Z

this is a good question. We're investigating on our side to determine how the other SDKs behave and what the correct behavior should be. It's worth noting that it's possible to completely override the retry policy by using the low-level operation API + with_retry_policy to drop in a different policy.

rcoh · 2021-11-22T18:44:42Z

As an aside, I'm curious about the framework / tooling you're using for fault generation, we'd like to use something similar in the SDK.

benesch · 2021-11-22T22:23:15Z

As an aside, I'm curious about the framework / tooling you're using for fault generation, we'd like to use something similar in the SDK.

Ah yeah, it's not too complicated. It's closer to random chaos testing than it is to principled fault injection. Specifically we test that Materialize's S3 source is resilient to transient network failures by putting Toxiproxy between materialized and localstack. We add a toxic that drops out the connection after various amounts of traffic, sleep a bit, restore the connection, and then verify that Materialize still receives all the data from S3 that it ought to.

benesch · 2021-11-23T04:31:12Z

this is a good question. We're investigating on our side to determine how the other SDKs behave and what the correct behavior should be. It's worth noting that it's possible to completely override the retry policy by using the low-level operation API + with_retry_policy to drop in a different policy.

Sounds good! For now we're just blindly retrying any errors that we see with some backoff, which works well enough: https://github.com/MaterializeInc/materialize/blob/dbf6c5df193f785e5bff364b5f21b1957a7c0322/src/dataflow/src/source/s3.rs#L316-L325. The downside is that we'll retry permanent errors a few times, but the upside is that we're 100% guaranteed to catch any transient errors.

rcoh · 2021-11-26T21:32:07Z

thanks again for reporting this. this should be retried automatically by the SDK and we're working on a fix

jdisanti · 2022-04-01T18:12:45Z

I think this was fixed in version 0.8.0 by smithy-lang/smithy-rs#1197

Are you still running into this?

rcoh · 2022-04-01T19:51:29Z

Hmm, I'm not sure—ResponseError isn't covered in the retry policy:

        let (err, response) = match err {
            Ok(_) => return RetryKind::Unnecessary,
            Err(SdkError::ServiceError { err, raw }) => (err, raw),
            Err(SdkError::DispatchFailure(err)) => {
                return if err.is_timeout() || err.is_io() {
                    RetryKind::Error(ErrorKind::TransientError)
                } else if let Some(ek) = err.is_other() {
                    RetryKind::Error(ek)
                } else {
                    RetryKind::UnretryableFailure
                }
            }
            Err(_) => return RetryKind::UnretryableFailure,
        };

The issue is that ResponseError is currently generic which prevents us from downcasting. I think we should change the inner type to ConnectorError

jdisanti · 2022-07-20T17:38:41Z

From discussion with @rcoh: The inner Box<Error> on ResponseError needs to be passed into RetryClassifier.

jdisanti · 2022-09-21T01:02:48Z

ResponseErrors are retried as of release-2022-09-21.

github-actions · 2022-09-21T01:03:09Z

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

benesch added guidance Question that needs advice or information. needs-triage This issue or PR still needs to be triaged. labels Nov 22, 2021

rcoh removed the needs-triage This issue or PR still needs to be triaged. label Nov 22, 2021

rcoh self-assigned this Nov 22, 2021

Velfi added bug This issue is a bug. and removed guidance Question that needs advice or information. labels Apr 1, 2022

jdisanti added the response-requested Waiting on additional info and feedback. Will move to 'closing-soon' in 7 days. label Apr 1, 2022

github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to 'closing-soon' in 7 days. label Apr 1, 2022

Velfi added this to AWS Rust SDK Public Roadmap Apr 4, 2022

Velfi moved this to Researching in AWS Rust SDK Public Roadmap Apr 4, 2022

Velfi removed the status in AWS Rust SDK Public Roadmap Apr 25, 2022

jdisanti unassigned rcoh Jul 20, 2022

Velfi mentioned this issue Jul 27, 2022

Retries Tracking smithy-lang/smithy-rs#1583

Closed

5 tasks

jdisanti mentioned this issue Aug 31, 2022

SDK Production Readiness smithy-lang/smithy-rs#1692

Closed

28 tasks

jdisanti self-assigned this Sep 7, 2022

jdisanti moved this to In Progress in AWS Rust SDK Public Roadmap Sep 7, 2022

jdisanti mentioned this issue Sep 7, 2022

Fix native Smithy client retry and retry response errors smithy-lang/smithy-rs#1717

Merged

2 tasks

jdisanti moved this from In Progress to In Review in AWS Rust SDK Public Roadmap Sep 7, 2022

jdisanti moved this from In Review to Pending Release in AWS Rust SDK Public Roadmap Sep 14, 2022

jdisanti closed this as completed Sep 21, 2022

Repository owner moved this from Pending Release to Done in AWS Rust SDK Public Roadmap Sep 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should `ResponseError`s be retriable? #303

Should `ResponseError`s be retriable? #303

benesch commented Nov 22, 2021

rcoh commented Nov 22, 2021

rcoh commented Nov 22, 2021

benesch commented Nov 22, 2021

benesch commented Nov 23, 2021

rcoh commented Nov 26, 2021

jdisanti commented Apr 1, 2022

rcoh commented Apr 1, 2022 •

edited

Loading

jdisanti commented Jul 20, 2022

jdisanti commented Sep 21, 2022

github-actions bot commented Sep 21, 2022

Should ResponseErrors be retriable? #303

Should ResponseErrors be retriable? #303

Comments

benesch commented Nov 22, 2021

rcoh commented Nov 22, 2021

rcoh commented Nov 22, 2021

benesch commented Nov 22, 2021

benesch commented Nov 23, 2021

rcoh commented Nov 26, 2021

jdisanti commented Apr 1, 2022

rcoh commented Apr 1, 2022 • edited Loading

jdisanti commented Jul 20, 2022

jdisanti commented Sep 21, 2022

github-actions bot commented Sep 21, 2022

⚠️COMMENT VISIBILITY WARNING⚠️

Should `ResponseError`s be retriable? #303

Should `ResponseError`s be retriable? #303

rcoh commented Apr 1, 2022 •

edited

Loading