feat: implement basic retry support in runtime #487

ianbotsf · 2021-09-30T17:34:42Z

Issue #

Addresses #224.

Description of changes

This change adds support for retries in the runtime (but does not yet integrate them into codegen—separate PR coming). See the included design doc (docs/design/retries.md) for more details.

Revision 2

Addressed the main points of PR feedback:

BackoffDelayer → DelayProvider
added integration tests sourced from spec
Throttling → ThrottlingError
retry exceptions now inherit from ClientException
updated retry defaults to 20s total, 3 tries
refactored StandardRetryPolicy evaluation strategies for clarity
added handling of CancellationException
cleaned up token failure handling
improved non-retryable exception message
moved companion object constant to top-level

Additionally, while adding the integration tests:

added circuit-breaker mode for token bucket
added max backoff configuration option

Scope

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

kggilmer

PR review in progress...

kggilmer · 2021-10-01T01:23:12Z

runtime/runtime-core/common/src/aws/smithy/kotlin/runtime/retries/Exceptions.kt

+ * @param cause An underlying exception which caused this exception.
+ * @param attempts The number of attempts made before this failure was encountered.
+ * @param lastResponse The last response received from the retry strategy before this failure was encountered.
+ * @param lastException The exception caught by the retry strategy before this failure was encountered. Note that the


suggestion

if there is a way to further distinguish between cause and lastException for clarity...

It's a difficult nuance to convey. cause is pretty much the standard exception re-wrapping mechanism that exists throughout Java/Kotlin. lastException is more analogous to lastResponse...it's the last output we received from the block but it's not necessarily the underlying cause of this exception.

Take for example a retry loop that keeps getting a throttling exception (which is retryable). Eventually, we're going to give up and throw a TimedOutException or TooManyAttemptsException. In those cases, the most recent retryable exception we got from our code block isn't the cause of the timeout or whatever, it's merely the last thing we saw and may be interesting to code that catches the timeout.

Can you think of a succinct way to capture that in KDoc?

Yeah it's a tough nut to crack. In what cases will both lastResponse and lastException contain a value? One minor improvement may be to use the word previous rather than last. IMO last is somewhat ambiguous as it can mean just the last time something happened, which could be multiple iterations in the past, whereas previous to me implies something that only could have occurred in the last iteration. As I write this it's pretty obvious how subjective this all so 🤷 .

In what cases will both lastResponse and lastException contain a value?

Under no circumstances will both fields contain a value. The last result from a block execution can only be a response or an exception, not both.

In that case would it make sense to represent the combination as a union via a sealed class?

We had a union called Outcome in the design review but it seemed like popular opinion was against it. I can create a union here but it will conflict conceptually and semantically with Result (which is not a union).

Do you have any suggestion on names for the union and the members?

kggilmer · 2021-10-01T01:34:32Z

runtime/runtime-core/common/src/aws/smithy/kotlin/runtime/retries/RetryPolicy.kt

+     * @param result The [Result] of the retry attempt.
+     * @return A [RetryDirective] indicating what action the [RetryStrategy] should take next.
+     */
+    fun evaluate(result: Result<R>): RetryDirective


question

I feel that this was brought up in discussion but I forget, what's the reasoning behind which functions suspend vs which do not?

A RetryStrategy has to be able to suspend to sleep/delay between attempts. A policy (as far as I can tell) shouldn't need to suspend to provide an answer to whether a result should be retried or not.

For retries, I put suspend around functions that would either need to execute the code block (which probably makes IO calls) or delay (e.g., backoff, token bucket interactions, etc.). All the retry policies that I've imagined so far all happen in straightforward, on-box logic. While I suppose it's technically possible some retry policy could involve network calls or local storage calls, it seems unlikely that we'll ever need that for the bare smithy or SDK use cases.

kggilmer · 2021-10-01T01:38:37Z

...ntime-core/common/src/aws/smithy/kotlin/runtime/retries/impl/ExponentialBackoffWithJitter.kt

+    val jitter: Double,
+) {
+    init {
+        require(initialDelayMs > 0) { "initialDelayMs must be at least 0" }


comment

Do we want to be more conservative here? Setting initial delay to 0 and scale factor to 1 seems like it something people may want to do (for reasons) and may be a bad trip server side. I wonder if other SDKs constrain the lower bounds more...

kggilmer · 2021-10-01T01:39:11Z

...ntime-core/common/src/aws/smithy/kotlin/runtime/retries/impl/ExponentialBackoffWithJitter.kt

+         * The default backoff configuration to use.
+         */
+        val Default = ExponentialBackoffWithJitterOptions(
+            initialDelayMs = 10, // start with 10ms


comment

Thank you for the clarifying comment, very helpful

My default posture is to document every public type and member, although I concede some members' very name and placement is the all documentation many will need. Do you think there's better documentation to be written here?

No, I was just saying that these comments let me quickly understand how the parameters work.

kggilmer · 2021-10-01T01:40:21Z

runtime/runtime-core/common/src/aws/smithy/kotlin/runtime/retries/impl/StandardRetryPolicy.kt

+    protected open fun evaluateOtherExceptions(ex: Throwable): RetryDirective? = null
+
+    private fun evaluateNonSdkException(ex: Throwable): RetryDirective? =
+        // TODO Write logic to find connection errors, timeouts, stream faults, etc.


question

is this follow up work or blocked on something else?

I'd hoped this very PR would spawn a discussion about how to fill this in. Do we have a sane way of detecting unmodeled exceptions for things like IO?

I'd imagine the HttpClientEngine implementations are what will be throwing most of these but the only engine-level exception I can find is HttpClientEngineClosedException. Absent a common exception layer, it seems like the retry policy would have to know the individual exceptions thrown by implementations (which sounds too tightly coupled).

Do we need a more clearly defined set of exceptions thrown by engines? Are there non-engine sources of useful exceptions?

It's true that we probably need to have more taxonomy for particular exceptions but I'd start with at least classifying transient errors off raw response codes (we can leave this to aws-sdk-kotlin if we like since the SEP for this behavior is AWS specific)

kggilmer · 2021-10-01T01:44:22Z

runtime/runtime-core/common/src/aws/smithy/kotlin/runtime/retries/impl/StandardRetryStrategy.kt

+                    } else {
+                        // Prep for another loop
+                        backoffDelayer.backoff(attempt)
+                        fromToken.scheduleRetry(evaluation.reason)


question

So this is the only condition that would result in a recursive call?

Affirmative.

aajtodd

Fantastic. Some overall minor comments/suggestions/questions.

aajtodd · 2021-10-01T14:45:02Z

runtime/runtime-core/common/src/aws/smithy/kotlin/runtime/Exceptions.kt

+        /**
+         * Set if an error represents a throttling condition
+         */
+        val Throttling: AttributeKey<Boolean> = AttributeKey("Throttling")


suggestion

This name irks me for some reason. I think it's because Retryable is an adjective but Throttling is verb/noun?

Maybe there is a better name? I don't have a great suggestion...ThrottlingError? IsThrottlingError?

aajtodd · 2021-10-01T14:46:06Z

runtime/runtime-core/common/src/aws/smithy/kotlin/runtime/retries/BackoffDelayer.kt

+/**
+ * An object that can be used to delay between iterations of code.
+ */
+interface BackoffDelayer {


style

fun interface?

name suggestion: DelayProvider or RetryDelayProvider

style

fun interface?

Unfortunately, suspend functions are not allowed in SAM interfaces.

That was true at one point but I thought it was lifted? It seems to compile fine locally

It appears you're right. My IDE lied to me.

aajtodd · 2021-10-01T14:50:02Z

runtime/runtime-core/common/src/aws/smithy/kotlin/runtime/retries/RetryPolicy.kt

+     * @param result The [Result] of the retry attempt.
+     * @return A [RetryDirective] indicating what action the [RetryStrategy] should take next.
+     */
+    fun evaluate(result: Result<R>): RetryDirective


A RetryStrategy has to be able to suspend to sleep/delay between attempts. A policy (as far as I can tell) shouldn't need to suspend to provide an answer to whether a result should be retried or not.

aajtodd · 2021-10-01T14:50:23Z

runtime/runtime-core/common/src/aws/smithy/kotlin/runtime/retries/RetryPolicy.kt

+package aws.smithy.kotlin.runtime.retries
+
+/**
+ * A policy that evaluates a [Result] from a retry attempt and indicates action a [RetryStrategy] should take next.


nit

the action

aajtodd · 2021-10-01T14:53:18Z

runtime/runtime-core/common/src/aws/smithy/kotlin/runtime/retries/RetryStrategy.kt

+ * SPDX-License-Identifier: Apache-2.0.
+ */
+
+package aws.smithy.kotlin.runtime.retries


comment

This package name is totally fine but I also wonder if aws.smithy.kotlin.runtime.execution wouldn't be an alternative parent that is more generalized (other things related to how an operation or block of code is executed).

I thought about finding a more general home for these but eventually I'd created 11 different runtime/retry-related files. Seemed like that was past the threshold where a shared namespace was cool.

Ya that's fair. Like I said not a big deal.

aajtodd · 2021-10-01T15:33:56Z

...e/runtime-core/common/src/aws/smithy/kotlin/runtime/retries/impl/StandardRetryTokenBucket.kt

+ */
+data class StandardRetryTokenBucketOptions(
+    val maxCapacity: Int,
+    val refillUnitsPerSecond: Int,


suggestion

refillRate

I wanted to be explicit about the units here. refillRate = 10 doesn't really indicate what unit of thing is being refilled nor how often.

yeah I mean you could do refillRatePerSecond or be fancy and define a type. Something like:

value class TokenRefillRate(val tokensPerSecond: Int)

Not a big deal, my brain just kept saying "theres a name for xyz measured per unit of time...RATE"

aajtodd · 2021-10-01T15:34:52Z

...-core/common/test/aws/smithy/kotlin/runtime/retries.impl/ExponentialBackoffWithJitterTest.kt

+ * SPDX-License-Identifier: Apache-2.0.
+ */
+
+package aws.smithy.kotlin.runtime.retries.impl


fix

The file location of this looks to be runtime/retries.impl rather than runtime/retries/impl

aajtodd · 2021-10-01T15:37:01Z

...e/runtime-core/common/test/aws/smithy/kotlin/runtime/retries.impl/StandardRetryPolicyTest.kt

+    @Test
+    fun testClientException() {
+        val result = test(ClientException())
+        assertEquals(RetryDirective.RetryError(RetryErrorType.ClientSide), result)


question

should this attempt to retry any ClientException or should this terminate and fail? I would assume most client side (if not all) are not actually retryable.

That's a good question and I suppose it depends on what else is a ClientException. We just decided to make all of the retryer exceptions (e.g., TimedOutException) inherit from ClientException and those are technically retryable (in that they could succeed with more tries).

Taking a look through the runtime, looks like most other instances are serialization (likely not retryable) or configuration related (possibly retryable if the configuration is coming from somewhere remote). Any other instances we can think of?

No but in general ClientException is basically reserved for runtime related exceptions (not to be confused with ServiceException where the ErrorType is set to Client)

aajtodd · 2021-10-01T15:49:30Z

...e/runtime-core/common/src/aws/smithy/kotlin/runtime/retries/impl/StandardRetryTokenBucket.kt

+ * The standard implementation of a [RetryTokenBucket].
+ * @param options The configuration to use for this bucket.
+ */
+class StandardRetryTokenBucket(


question

What makes this Standard?

question

This only supports standard not adaptive behavior correct?

What makes this Standard?

This is standard in that it's recommended and the only implementation provided by smithy-kotlin.

This only supports standard not adaptive behavior correct?

For now, yes that's correct.

aajtodd · 2021-10-01T15:59:01Z

...time-core/common/test/aws/smithy/kotlin/runtime/retries.impl/StandardRetryTokenBucketTest.kt

+import kotlin.time.Duration
+import kotlin.time.ExperimentalTime
+
+class StandardRetryTokenBucketTest {


question

Can we get the SEP standard mode test cases in here?

kggilmer · 2021-10-01T23:17:53Z

runtime/runtime-core/common/test/aws/smithy/kotlin/runtime/retries.impl/TimingUtils.kt

+
+private const val timeToleranceMs = 20
+
+suspend fun <T> TestCoroutineScope.assertTime(expectedMs: Int, block: suspend () -> T): T {


kggilmer · 2021-10-01T23:25:54Z

runtime/runtime-core/common/src/aws/smithy/kotlin/runtime/retries/impl/StandardRetryStrategy.kt

+    private suspend fun <R> throwFailure(token: RetryToken, attempt: Int, result: Result<R>): Nothing {
+        token.notifyFailure()
+        throw RetryFailureException(
+            "The operation failed",


suggestion

under the assumption that this exception's message will be visible to customers, perhaps adding some of the kdoc content into the message would be clarifying. eg "The operation resulted in a non-retryable failure"

kggilmer · 2021-10-01T23:28:48Z

Incredible effort, very clean and well thought out. Bravo @ianbotsf !

* BackoffDelayer → DelayProvider * add integration tests sourced from spec * Throttling → ThrottlingError * retry exceptions now inherit from ClientException * updated retry defaults to 20s total, 3 tries * refactored StandardRetryPolicy evaluation strategies for clarity * added handling of CancellationException * cleaned up token failure handling * improved non-retryable exception message * moved companion object constant to top-level additionally, as part of getting integration tests to work: * added circuit-breaker mode for token bucket * added max backoff configuration option

aajtodd

Other than previous suggestions overall looks good. Nothing blocking from my end.

aajtodd · 2021-10-05T14:31:58Z

runtime/runtime-core/common/test/aws/smithy/kotlin/runtime/retries/impl/TimingUtils.kt

 import kotlinx.coroutines.test.TestCoroutineScope
 import kotlin.test.assertTrue

 private const val timeToleranceMs = 20

+@ExperimentalCoroutinesApi


why are we propagating this? Just opt-in

aajtodd · 2021-10-05T14:39:25Z

...ore/jvm/test/aws/smithy/kotlin/runtime/retries/impl/StandardRetryIntegrationTestResources.kt

+
+package aws.smithy.kotlin.runtime.retries.impl
+
+val standardRetryIntegrationTestCases = mapOf(


aajtodd · 2021-10-05T14:45:14Z

...runtime-core/jvm/test/aws/smithy/kotlin/runtime/retries/impl/StandardRetryIntegrationTest.kt

+                else -> fail("Unexpected outcome for $name: ${finalState.outcome}")
+            }
+
+            if (finalState.outcome != Outcome.RetryQuotaExceeded) {


is it possible to make any assertion about the expected retry quota?

Good idea, I can add that.

… vs propagating; added assertion on retry quota remaining for the StandardRetryStrategy integration test

)

feat: implement basic retry support in runtime

12350cb

ianbotsf requested review from aajtodd and kggilmer September 30, 2021 17:34

kggilmer reviewed Oct 1, 2021

View reviewed changes

aajtodd suggested changes Oct 1, 2021

View reviewed changes

kggilmer reviewed Oct 1, 2021

View reviewed changes

kggilmer approved these changes Oct 1, 2021

View reviewed changes

ianbotsf requested review from kggilmer and aajtodd October 4, 2021 20:22

aajtodd approved these changes Oct 5, 2021

View reviewed changes

ianbotsf and others added 3 commits October 5, 2021 17:22

addressing PR feedback: using @OptIn for experimental coroutines APIs…

a2a81fe

… vs propagating; added assertion on retry quota remaining for the StandardRetryStrategy integration test

Merge branch 'main' into retries

cd93247

linting

e8c17d6

ianbotsf merged commit 76f754d into main Oct 5, 2021

ianbotsf deleted the retries branch October 5, 2021 19:01

ianbotsf mentioned this pull request Oct 5, 2021

feat: add codegen wrappers for retries #490

Merged

2 tasks

aajtodd mentioned this pull request Oct 12, 2021

add throttling metadata to generated exceptions #478

Closed

aajtodd pushed a commit that referenced this pull request Mar 11, 2024

fix(codegen): Fix usage of unicode in bucket names of s3 presigner (#487

180ae5d

)


		private const val timeToleranceMs = 20

		suspend fun <T> TestCoroutineScope.assertTime(expectedMs: Int, block: suspend () -> T): T {


		package aws.smithy.kotlin.runtime.retries.impl

		val standardRetryIntegrationTestCases = mapOf(

feat: implement basic retry support in runtime #487

feat: implement basic retry support in runtime #487

Conversation

ianbotsf commented Sep 30, 2021 • edited Loading

Issue #

Description of changes

Revision 2

Scope

kggilmer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aajtodd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kggilmer commented Oct 1, 2021

aajtodd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ianbotsf commented Sep 30, 2021 •

edited

Loading