Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should ResponseErrors be retriable? #303

Closed
Tracked by #1583
benesch opened this issue Nov 22, 2021 · 10 comments
Closed
Tracked by #1583

Should ResponseErrors be retriable? #303

benesch opened this issue Nov 22, 2021 · 10 comments
Assignees
Labels
bug This issue is a bug.

Comments

@benesch
Copy link
Contributor

benesch commented Nov 22, 2021

I just noticed that ResponseErrors are not automatically retried by the SDK. The specific example I was surprised by was an S3 ListObjectsV2 request that failed due to an incomplete body:

ResponseError { err: hyper::Error(Body, Custom { kind: UnexpectedEof, error: IncompleteBody }), raw: Response { inner: Response { status: 200, version: HTTP/1.1, headers: {"content-type": "application/xml; charset=utf-8", "content-length": "1064", "x-amz-bucket-region": "us-east-1", "x-amzn-requestid": "B7K3ZBYJ4GOS6NNXINT08OONX033EREZYSK4I1VPIK1PU0BI9VYR", "access-control-allow-origin": "*", "last-modified": "Mon, 22 Nov 2021 04:56:57 GMT", "x-amz-request-id": "0E15FCA83A3F698A", "x-amz-id-2": "MzRISOwyjmnup0E15FCA83A3F698A7/JypPGXLh0OVFGcJaaO3KW/hRAqKOpIEEp", "accept-ranges": "bytes", "content-language": "en-US", "access-control-allow-methods": "HEAD,GET,PUT,POST,DELETE,OPTIONS,PATCH", "access-control-allow-headers": "authorization,cache-control,content-length,content-md5,content-type,etag,location,x-amz-acl,x-amz-content-sha256,x-amz-date,x-amz-request-id,x-amz-security-token,x-amz-tagging,x-amz-target,x-amz-user-agent,x-amz-version-id,x-amzn-requestid,x-localstack-target,amz-sdk-invocation-id,amz-sdk-request", "access-control-expose-headers": "etag,x-amz-version-id", "connection": "close", "date": "Mon, 22 Nov 2021 04:57:03 GMT", "server": "hypercorn-h11"}, body: SdkBody { inner: Taken, retryable: false } }, properties: SharedPropertyBag(Mutex { data: PropertyBag, poisoned: false, .. }) } }

(FWIW I'm specifically testing retry behavior with injected faults; this isn't an actual error I've observed in production.)

Is it intentional that these aren't automatically retried? I'm not actually sure what the other SDKs do in this situation. I guess it could be a problem for non-idempotent requests, because they'll get duplicated, but that can already happen with requests that are retried after a timeout, where the client doesn't know whether the server actually processed the request already.

@benesch benesch added guidance Question that needs advice or information. needs-triage This issue or PR still needs to be triaged. labels Nov 22, 2021
@rcoh
Copy link
Contributor

rcoh commented Nov 22, 2021

this is a good question. We're investigating on our side to determine how the other SDKs behave and what the correct behavior should be. It's worth noting that it's possible to completely override the retry policy by using the low-level operation API + with_retry_policy to drop in a different policy.

@rcoh rcoh removed the needs-triage This issue or PR still needs to be triaged. label Nov 22, 2021
@rcoh rcoh self-assigned this Nov 22, 2021
@rcoh
Copy link
Contributor

rcoh commented Nov 22, 2021

As an aside, I'm curious about the framework / tooling you're using for fault generation, we'd like to use something similar in the SDK.

@benesch
Copy link
Contributor Author

benesch commented Nov 22, 2021

As an aside, I'm curious about the framework / tooling you're using for fault generation, we'd like to use something similar in the SDK.

Ah yeah, it's not too complicated. It's closer to random chaos testing than it is to principled fault injection. Specifically we test that Materialize's S3 source is resilient to transient network failures by putting Toxiproxy between materialized and localstack. We add a toxic that drops out the connection after various amounts of traffic, sleep a bit, restore the connection, and then verify that Materialize still receives all the data from S3 that it ought to.

@benesch
Copy link
Contributor Author

benesch commented Nov 23, 2021

this is a good question. We're investigating on our side to determine how the other SDKs behave and what the correct behavior should be. It's worth noting that it's possible to completely override the retry policy by using the low-level operation API + with_retry_policy to drop in a different policy.

Sounds good! For now we're just blindly retrying any errors that we see with some backoff, which works well enough: https://github.com/MaterializeInc/materialize/blob/dbf6c5df193f785e5bff364b5f21b1957a7c0322/src/dataflow/src/source/s3.rs#L316-L325. The downside is that we'll retry permanent errors a few times, but the upside is that we're 100% guaranteed to catch any transient errors.

@rcoh
Copy link
Contributor

rcoh commented Nov 26, 2021

thanks again for reporting this. this should be retried automatically by the SDK and we're working on a fix

@Velfi Velfi added bug This issue is a bug. and removed guidance Question that needs advice or information. labels Apr 1, 2022
@jdisanti
Copy link
Contributor

jdisanti commented Apr 1, 2022

I think this was fixed in version 0.8.0 by smithy-lang/smithy-rs#1197

Are you still running into this?

@jdisanti jdisanti added the response-requested Waiting on additional info and feedback. Will move to 'closing-soon' in 7 days. label Apr 1, 2022
@rcoh
Copy link
Contributor

rcoh commented Apr 1, 2022

Hmm, I'm not sure—ResponseError isn't covered in the retry policy:

        let (err, response) = match err {
            Ok(_) => return RetryKind::Unnecessary,
            Err(SdkError::ServiceError { err, raw }) => (err, raw),
            Err(SdkError::DispatchFailure(err)) => {
                return if err.is_timeout() || err.is_io() {
                    RetryKind::Error(ErrorKind::TransientError)
                } else if let Some(ek) = err.is_other() {
                    RetryKind::Error(ek)
                } else {
                    RetryKind::UnretryableFailure
                }
            }
            Err(_) => return RetryKind::UnretryableFailure,
        };

The issue is that ResponseError is currently generic which prevents us from downcasting. I think we should change the inner type to ConnectorError

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to 'closing-soon' in 7 days. label Apr 1, 2022
@Velfi Velfi moved this to Researching in AWS Rust SDK Public Roadmap Apr 4, 2022
@jdisanti
Copy link
Contributor

From discussion with @rcoh: The inner Box<Error> on ResponseError needs to be passed into RetryClassifier.

@jdisanti
Copy link
Contributor

ResponseErrors are retried as of release-2022-09-21.

Repository owner moved this from Pending Release to Done in AWS Rust SDK Public Roadmap Sep 21, 2022
@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug.
Projects
Archived in project
Development

No branches or pull requests

4 participants