-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: implement basic retry support in runtime #487
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR review in progress...
* @param cause An underlying exception which caused this exception. | ||
* @param attempts The number of attempts made before this failure was encountered. | ||
* @param lastResponse The last response received from the retry strategy before this failure was encountered. | ||
* @param lastException The exception caught by the retry strategy before this failure was encountered. Note that the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion
if there is a way to further distinguish between cause
and lastException
for clarity...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a difficult nuance to convey. cause
is pretty much the standard exception re-wrapping mechanism that exists throughout Java/Kotlin. lastException
is more analogous to lastResponse
...it's the last output we received from the block but it's not necessarily the underlying cause of this exception.
Take for example a retry loop that keeps getting a throttling exception (which is retryable). Eventually, we're going to give up and throw a TimedOutException
or TooManyAttemptsException
. In those cases, the most recent retryable exception we got from our code block isn't the cause of the timeout or whatever, it's merely the last thing we saw and may be interesting to code that catches the timeout.
Can you think of a succinct way to capture that in KDoc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it's a tough nut to crack. In what cases will both lastResponse
and lastException
contain a value? One minor improvement may be to use the word previous
rather than last
. IMO last
is somewhat ambiguous as it can mean just the last time something happened, which could be multiple iterations in the past, whereas previous
to me implies something that only could have occurred in the last iteration. As I write this it's pretty obvious how subjective this all so 🤷 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what cases will both
lastResponse
andlastException
contain a value?
Under no circumstances will both fields contain a value. The last result from a block execution can only be a response or an exception, not both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case would it make sense to represent the combination as a union via a sealed class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had a union called Outcome
in the design review but it seemed like popular opinion was against it. I can create a union here but it will conflict conceptually and semantically with Result
(which is not a union).
Do you have any suggestion on names for the union and the members?
* @param result The [Result] of the retry attempt. | ||
* @return A [RetryDirective] indicating what action the [RetryStrategy] should take next. | ||
*/ | ||
fun evaluate(result: Result<R>): RetryDirective |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question
I feel that this was brought up in discussion but I forget, what's the reasoning behind which functions suspend vs which do not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A RetryStrategy
has to be able to suspend to sleep/delay between attempts. A policy (as far as I can tell) shouldn't need to suspend to provide an answer to whether a result should be retried or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For retries, I put suspend
around functions that would either need to execute the code block (which probably makes IO calls) or delay (e.g., backoff, token bucket interactions, etc.). All the retry policies that I've imagined so far all happen in straightforward, on-box logic. While I suppose it's technically possible some retry policy could involve network calls or local storage calls, it seems unlikely that we'll ever need that for the bare smithy or SDK use cases.
val jitter: Double, | ||
) { | ||
init { | ||
require(initialDelayMs > 0) { "initialDelayMs must be at least 0" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment
Do we want to be more conservative here? Setting initial delay to 0 and scale factor to 1 seems like it something people may want to do (for reasons) and may be a bad trip server side. I wonder if other SDKs constrain the lower bounds more...
* The default backoff configuration to use. | ||
*/ | ||
val Default = ExponentialBackoffWithJitterOptions( | ||
initialDelayMs = 10, // start with 10ms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment
Thank you for the clarifying comment, very helpful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My default posture is to document every public type and member, although I concede some members' very name and placement is the all documentation many will need. Do you think there's better documentation to be written here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I was just saying that these comments let me quickly understand how the parameters work.
protected open fun evaluateOtherExceptions(ex: Throwable): RetryDirective? = null | ||
|
||
private fun evaluateNonSdkException(ex: Throwable): RetryDirective? = | ||
// TODO Write logic to find connection errors, timeouts, stream faults, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question
is this follow up work or blocked on something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd hoped this very PR would spawn a discussion about how to fill this in. Do we have a sane way of detecting unmodeled exceptions for things like IO?
I'd imagine the HttpClientEngine
implementations are what will be throwing most of these but the only engine-level exception I can find is HttpClientEngineClosedException
. Absent a common exception layer, it seems like the retry policy would have to know the individual exceptions thrown by implementations (which sounds too tightly coupled).
Do we need a more clearly defined set of exceptions thrown by engines? Are there non-engine sources of useful exceptions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's true that we probably need to have more taxonomy for particular exceptions but I'd start with at least classifying transient errors off raw response codes (we can leave this to aws-sdk-kotlin
if we like since the SEP for this behavior is AWS specific)
} else { | ||
// Prep for another loop | ||
backoffDelayer.backoff(attempt) | ||
fromToken.scheduleRetry(evaluation.reason) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question
So this is the only condition that would result in a recursive call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Affirmative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic. Some overall minor comments/suggestions/questions.
/** | ||
* Set if an error represents a throttling condition | ||
*/ | ||
val Throttling: AttributeKey<Boolean> = AttributeKey("Throttling") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion
This name irks me for some reason. I think it's because Retryable
is an adjective but Throttling
is verb/noun?
Maybe there is a better name? I don't have a great suggestion...ThrottlingError
? IsThrottlingError
?
/** | ||
* An object that can be used to delay between iterations of code. | ||
*/ | ||
interface BackoffDelayer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style
fun interface
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name suggestion: DelayProvider
or RetryDelayProvider
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style
fun interface
?
Unfortunately, suspend
functions are not allowed in SAM interfaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was true at one point but I thought it was lifted? It seems to compile fine locally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears you're right. My IDE lied to me.
* @param result The [Result] of the retry attempt. | ||
* @return A [RetryDirective] indicating what action the [RetryStrategy] should take next. | ||
*/ | ||
fun evaluate(result: Result<R>): RetryDirective |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A RetryStrategy
has to be able to suspend to sleep/delay between attempts. A policy (as far as I can tell) shouldn't need to suspend to provide an answer to whether a result should be retried or not.
package aws.smithy.kotlin.runtime.retries | ||
|
||
/** | ||
* A policy that evaluates a [Result] from a retry attempt and indicates action a [RetryStrategy] should take next. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
the action
* SPDX-License-Identifier: Apache-2.0. | ||
*/ | ||
|
||
package aws.smithy.kotlin.runtime.retries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment
This package name is totally fine but I also wonder if aws.smithy.kotlin.runtime.execution
wouldn't be an alternative parent that is more generalized (other things related to how an operation or block of code is executed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about finding a more general home for these but eventually I'd created 11 different runtime/retry-related files. Seemed like that was past the threshold where a shared namespace was cool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya that's fair. Like I said not a big deal.
*/ | ||
data class StandardRetryTokenBucketOptions( | ||
val maxCapacity: Int, | ||
val refillUnitsPerSecond: Int, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion
refillRate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to be explicit about the units here. refillRate = 10
doesn't really indicate what unit of thing is being refilled nor how often.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I mean you could do refillRatePerSecond
or be fancy and define a type. Something like:
value class TokenRefillRate(val tokensPerSecond: Int)
Not a big deal, my brain just kept saying "theres a name for xyz measured per unit of time...RATE"
* SPDX-License-Identifier: Apache-2.0. | ||
*/ | ||
|
||
package aws.smithy.kotlin.runtime.retries.impl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix
The file location of this looks to be runtime/retries.impl
rather than runtime/retries/impl
@Test | ||
fun testClientException() { | ||
val result = test(ClientException()) | ||
assertEquals(RetryDirective.RetryError(RetryErrorType.ClientSide), result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question
should this attempt to retry any ClientException
or should this terminate and fail? I would assume most client side (if not all) are not actually retryable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question and I suppose it depends on what else is a ClientException
. We just decided to make all of the retryer exceptions (e.g., TimedOutException
) inherit from ClientException
and those are technically retryable (in that they could succeed with more tries).
Taking a look through the runtime, looks like most other instances are serialization (likely not retryable) or configuration related (possibly retryable if the configuration is coming from somewhere remote). Any other instances we can think of?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No but in general ClientException
is basically reserved for runtime related exceptions (not to be confused with ServiceException
where the ErrorType
is set to Client
)
* The standard implementation of a [RetryTokenBucket]. | ||
* @param options The configuration to use for this bucket. | ||
*/ | ||
class StandardRetryTokenBucket( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question
What makes this Standard
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question
This only supports standard
not adaptive
behavior correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What makes this
Standard
?
This is standard in that it's recommended and the only implementation provided by smithy-kotlin.
This only supports
standard
notadaptive
behavior correct?
For now, yes that's correct.
import kotlin.time.Duration | ||
import kotlin.time.ExperimentalTime | ||
|
||
class StandardRetryTokenBucketTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question
Can we get the SEP standard mode test cases in here?
|
||
private const val timeToleranceMs = 20 | ||
|
||
suspend fun <T> TestCoroutineScope.assertTime(expectedMs: Int, block: suspend () -> T): T { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<3
private suspend fun <R> throwFailure(token: RetryToken, attempt: Int, result: Result<R>): Nothing { | ||
token.notifyFailure() | ||
throw RetryFailureException( | ||
"The operation failed", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion
under the assumption that this exception's message will be visible to customers, perhaps adding some of the kdoc content into the message would be clarifying. eg "The operation resulted in a non-retryable failure"
Incredible effort, very clean and well thought out. Bravo @ianbotsf ! |
* BackoffDelayer → DelayProvider * add integration tests sourced from spec * Throttling → ThrottlingError * retry exceptions now inherit from ClientException * updated retry defaults to 20s total, 3 tries * refactored StandardRetryPolicy evaluation strategies for clarity * added handling of CancellationException * cleaned up token failure handling * improved non-retryable exception message * moved companion object constant to top-level additionally, as part of getting integration tests to work: * added circuit-breaker mode for token bucket * added max backoff configuration option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than previous suggestions overall looks good. Nothing blocking from my end.
import kotlinx.coroutines.test.TestCoroutineScope | ||
import kotlin.test.assertTrue | ||
|
||
private const val timeToleranceMs = 20 | ||
|
||
@ExperimentalCoroutinesApi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we propagating this? Just opt-in
|
||
package aws.smithy.kotlin.runtime.retries.impl | ||
|
||
val standardRetryIntegrationTestCases = mapOf( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
else -> fail("Unexpected outcome for $name: ${finalState.outcome}") | ||
} | ||
|
||
if (finalState.outcome != Outcome.RetryQuotaExceeded) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to make any assertion about the expected retry quota?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I can add that.
… vs propagating; added assertion on retry quota remaining for the StandardRetryStrategy integration test
Issue #
Addresses #224.
Description of changes
This change adds support for retries in the runtime (but does not yet integrate them into codegen—separate PR coming). See the included design doc (docs/design/retries.md) for more details.
Revision 2
Addressed the main points of PR feedback:
BackoffDelayer
→DelayProvider
Throttling
→ThrottlingError
ClientException
StandardRetryPolicy
evaluation strategies for clarityCancellationException
Additionally, while adding the integration tests:
Scope
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.