-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Allow configuration of RetryClassifier on the client #2417
Comments
What is your use-case for needing to do this? |
I may want to retry a wider set of exceptions that the default retrclassifier. For example - I'm interacting with DDB and my transactions are written in such a way that I want to retry any TransactionCanceledException that happens because of a TransactionConflict reason. Or if I'm using a service that throws ConflictExceptions and for my application I always want to retry these. Or if I'm using an eventually consistent service and I always want to retry the not-found errors that might come after fresh creates. |
One more thing to tag onto this issue - the smithy rs client is in general extremely restrictive. In other languages its easy to keep the default behaviour but plug into the logic to write some logs or post a metric when retries happen. The rust sdk makes that impossible. Either an entire retry config has to be written which cannot benefit from the default logic, or no additional behaviour can be customized. I've self-served some of this in a hacky way by depending on some presumably unstable interfaces for the way the smithy clients use tracing, but this should be supported in a well-defined way. |
I want to first say that I understand your frustration. I often feel that the user experience of the SDK is lacking, and it's not easy to extend the client with custom behavior outside of things like operation customization.
I agree that this should not be so difficult. However, it helps to remember that we're maintaining a client that is used to interact with 300+ services and we're still in developer preview. We've had to prioritize what features we work on and that means that some features don't get much polish, if any.
Rust is a statically typed language, which means that any time we want something to be fully customizable by users, we need to rely on traits. However, as we add more traits, our code becomes more and more difficult to maintain. One need only look at the trait bounds on our middleware to see the challenges we're facing. In many cases, we believe that the right thing to do is to design less, and design simpler. While this prevents users like yourself from fully customizing the client, it makes the client easier to maintain for our team, and easier to use for customers with simpler requirements.
I agree that this should be supported in a well defined way that's pleasant to use and I'm confident that it will be by the time we release our 1.0 version. I am currently working on a re-write of our client which will allow users more fine-grained control over what happens during the sending of a request. Additionally, the rewrite is based on a new internal standard for how AWS SDKs should work, so it will also bring our SDK closer to the design of other SDKs, meaning less surprises for people like yourself that use several languages to interact with AWS. I appreciate your passion for this issue, and I hope that you'll find our SDK to be more useful in the future. In the meantime, I hope that our default tracing logs are helpful to you. If you're looking for some specific improvements to our current logging, then let's get a list of possible improvements started. We can then discuss the ideas on a point-by-point basis. It's most helpful to us when you can describe what you hope to accomplish, rather than how you'd like to accomplish it. As a stopgap, I'd suggest running your request in a loop with your own classification method: let sdk_config = aws_config::load_from_env().await;
let client = aws_sdk_dynamodb::Client::new(&sdk_config);
let mut attempts = 0;
let max_attempts = 3;
loop {
if attempts >= max_attempts {
break;
}
match client.your_operation.send().await {
Ok(res) => {
/* it worked, so we can stop */
break;
}
Err(err) => {
let service_err = err.into_service_error();
let cancellation_reasons = service_err.cancellation_reason().unwrap_or_default();
if cancellation_reason
.flat_map(|reason| reason.code())
.find(|code| code == "TransactionConflict")
{
// it failed due to a transaction conflict, we'll try again
attempts += 1;
continue;
} else {
// else, it's some error we don't want to retry
break;
}
}
}
} I wrote that code in GitHub, so it may not be quite right, but I hope it gets you closer to your goal. |
Thanks for the comprehensive response @Velfi. I understand that the client is in developer preview, I'm not making this issue to say that there's a bug, but to make sure this is tracked so that it can be put on the radar for implementation in future. I appreciate all the work that's been done so far.
I understand this desire, but I think for the tooling that aws customers are going to use this functionality has to be there. It can be hidden from the simple customers, but the functionality should not be blocked. We (aws customers) like you go oncall, and have to investigate errors and latency issues in our systems, many of these will be related to our dependencies. We need to have clear insight into cases were retries are happening, what the failures are and how long each attempt takes so that we can have good metrics, and good logs for debugging specific cases where (as one example) we have a request that took very long and we want to find out why.
I see that we are in agreement :) Looking forward to seeing the next version! I definitely do understand the challenges though! I am constantly spending much longer than expected digging into rust code and jumping around trying to find the bounds on the mysterious type parameters on the functions I care about. I am planning to implement a similar stopgap to what you suggest, the unfortunate part is that we make many different calls to deps our service and so need to add this layer at all those call sites instead of being able to configure it once. I will most likely create my own client that wraps the smithy one, implements all the methods we use, and routes to the smithy client wrapped with our retry logic. Thanks for the code snippet. |
To phrase what I am looking for without going into implementation details
Tangentially related to this, I'd love to have support for https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/metrics.html - this doesn't give all the retry insights I'm talking about but it gives a lot of other useful operational metrics for free. |
There's one more thing that came up when I bit the bullet to go ahead and implement this retry at the application layer. It would be nice to have this functionality wrapped up somewhere so we don't have to add this logic all-over our code wherever we make a service call. One idea to do this would be to make my own FooServiceRetryingClient that has it's own FooServiceClient and to then implement all the service functions with the retry logic added in. Unfortunately in this case - that doesn't work either. Because the rust client only exposes calls through the fluent builders and send() and none of these are traits. This means there's no good way to create your own client that wraps the existing client and adds some functionality. You have to rather add your logic all over the code whenever .send() is called. EG: if the client had functions like Here's the most generic way I could come up with - it ain't super pretty and it still has to be called at every .send() location but at least the main logic can be re-used. Fortunately status code works for me, because the typed rust enum ErrorKinds for each operation are different and so generic logic can't be implemented for those. pub async fn retry_conflict<
R,
T: Future + Future<Output = Result<R, SdkError<E>>>,
F,
E,
>(
mut f: F,
) -> Result<R, SdkError<E>>
where
F: FnMut() -> T,
{
let mut num_attempts = 0;
loop {
let res = f().await;
match &res {
Ok(_) => return res,
Err(e) => match e {
ServiceError(r) => {
if r.raw().http().status().as_u16() != 409 {
return res;
}
}
_ => {
return res;
}
},
}
tracing::info!(
"Retrying attempt {} which failed on resource conflict",
num_attempts + 1
);
num_attempts += 1;
if num_attempts >= 2 {
return res;
}
}
} Used like let req = client.foo().some_param("bar");
let response = retry_conflict(|| req.clone().send()).await; |
We're working on a replacement for the current smithy client that will remove all (or nearly all) mysterious type parameters needed to send a request. Additionally, it will make granular logging of the request/response lifecycle a simple manner. I'm very excited to get it in your hands and see if we're still not meeting all of your needs. I'll reach out to you once it's ready for testing. You can view the preliminary version of the change in these PRs: Thanks for your additional comments. Regarding your "retry resource conflicts", I'll be sure to consider that while I integrate our retry handler into the new client. I'm imagining an API similar to: let res = client.foo()
.some_param("bar")
.with_runtime_plugin(
RetryOrchestrator::default()
.also_retry_err(aws_sdk_blah::Error::ResourceConflict)
.build()
)
.send()
.await?; |
The changes you made look really cool. Thanks for sharing! I only saw this comment now or I would have looked at it sooner! The API you suggest with The RFC and PRs you submitted also address some observability issues we are having with the rust SDK. Rust (and many non java languages) don't have the awesome sdk metrics that java has https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/metrics.html We however do want metrics for our dependency calls and in addition to knowing how long a whole depdendency interaction takes we also want metrics for how many retries there were and how long each individual attempt took. We currently time the whole interaction by making all our SDK calls in a closure that we pass to an instrumentation function which adds the metric. This would let us to get all the info we want (including the internal retry / attempt_duration stuff) by instrumenting the client and we'd be able to stop having to use our instrumentation function each time we make a service call with the SDK. We have currently self-serviced some of this by hooking into the spans that the SDK uses, but we understand that this is not a public interface and isn't ideal for us to base our metrics on. |
Retry classifier customization in my current prototype works like this:
Users will be able to configure this by "mapping" the classifiers with a function that will probably look just like this: let res = client.get_object()
<normal get object params>
.customize()
.map_retry_classifier(|classifiers, output_or_error, response| -> Option<ShouldAttempt> {
let should_attempt = classifiers.run(output_or_error, response);
// do whatever you want to classify retries
should_attempt
})
.send()
.await; This way you can front-run the classifier, second-guess it, or ignore it altogether. |
[Read the RFC here](#3018) ## Motivation and Context <!--- Why is this change required? What problem does it solve? --> <!--- If it fixes an open issue, please link to the issue here --> #2417 ## Description <!--- Describe your changes in detail --> Exactly what it says on the tin. I have a related RFC to publish that goes into more depth. ## Testing <!--- Please describe in detail how you tested your changes --> <!--- Include details of your testing environment, and the tests you ran to --> <!--- see how your change affects other areas of the code, etc. --> I wrote an integration test that ensures a custom retry classifier can be set and is called. ## Checklist <!--- If a checkbox below is not applicable, then please DELETE it rather than leaving it unchecked --> - [x] I have updated `CHANGELOG.next.toml` if I made changes to the smithy-rs codegen or runtime crates - [x] I have updated `CHANGELOG.next.toml` if I made changes to the AWS SDK, generated SDK code, or SDK runtime crates ---- _By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice._
We're about to release a new version with the capability to set classifiers. See the above link for more details. |
I believe this was released a while ago. |
This lack of functionality is a significant regression over clients in other languages.
The default retry classifier in smithy is the AwsResponseRetryClassifier.
Smithy allows for the retry config to be modified (max attempts and backoff params) but it doesn't allow for any way to inject customer retry classification. It's possible (but not super ergonomic) to inject one's own retry policy but then one loses out on being able to use the supplied configuration for max attempts and backoff and the good default classification which one might want to just add a few cases to.
The generated code does create a customize method on the client but it is parametrized to the specific AwsResponseRetryClassifier so even when using customize on the request level a custom classifier cannot be supplied. Even if it could, doing this at the request level might be needed in some cases but the client should allow a client level configuration of a retry classifier.
The locked down cuztomize method:
TLDR, support a function on the client builder to set a retry classifier. This will allow consumers of smithy clients to use the AwsResponseRetryClassifier and based on its result to modify the classification however they choose. This can be done with java and go, and is functionality that a client to be used in production by customers should support.
This more specific issue was created after some digging - resolved the old issue.
The text was updated successfully, but these errors were encountered: