Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implement basic retry support in runtime #487

Merged
merged 5 commits into from
Oct 5, 2021
Merged

feat: implement basic retry support in runtime #487

merged 5 commits into from
Oct 5, 2021

Conversation

ianbotsf
Copy link
Contributor

@ianbotsf ianbotsf commented Sep 30, 2021

Issue #

Addresses #224.

Description of changes

This change adds support for retries in the runtime (but does not yet integrate them into codegen—separate PR coming). See the included design doc (docs/design/retries.md) for more details.

Revision 2

Addressed the main points of PR feedback:

  • BackoffDelayerDelayProvider
  • added integration tests sourced from spec
  • ThrottlingThrottlingError
  • retry exceptions now inherit from ClientException
  • updated retry defaults to 20s total, 3 tries
  • refactored StandardRetryPolicy evaluation strategies for clarity
  • added handling of CancellationException
  • cleaned up token failure handling
  • improved non-retryable exception message
  • moved companion object constant to top-level

Additionally, while adding the integration tests:

  • added circuit-breaker mode for token bucket
  • added max backoff configuration option

Scope

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Copy link
Contributor

@kggilmer kggilmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR review in progress...

* @param cause An underlying exception which caused this exception.
* @param attempts The number of attempts made before this failure was encountered.
* @param lastResponse The last response received from the retry strategy before this failure was encountered.
* @param lastException The exception caught by the retry strategy before this failure was encountered. Note that the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion

if there is a way to further distinguish between cause and lastException for clarity...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a difficult nuance to convey. cause is pretty much the standard exception re-wrapping mechanism that exists throughout Java/Kotlin. lastException is more analogous to lastResponse...it's the last output we received from the block but it's not necessarily the underlying cause of this exception.

Take for example a retry loop that keeps getting a throttling exception (which is retryable). Eventually, we're going to give up and throw a TimedOutException or TooManyAttemptsException. In those cases, the most recent retryable exception we got from our code block isn't the cause of the timeout or whatever, it's merely the last thing we saw and may be interesting to code that catches the timeout.

Can you think of a succinct way to capture that in KDoc?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's a tough nut to crack. In what cases will both lastResponse and lastException contain a value? One minor improvement may be to use the word previous rather than last. IMO last is somewhat ambiguous as it can mean just the last time something happened, which could be multiple iterations in the past, whereas previous to me implies something that only could have occurred in the last iteration. As I write this it's pretty obvious how subjective this all so 🤷 .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what cases will both lastResponse and lastException contain a value?

Under no circumstances will both fields contain a value. The last result from a block execution can only be a response or an exception, not both.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case would it make sense to represent the combination as a union via a sealed class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had a union called Outcome in the design review but it seemed like popular opinion was against it. I can create a union here but it will conflict conceptually and semantically with Result (which is not a union).

Do you have any suggestion on names for the union and the members?

* @param result The [Result] of the retry attempt.
* @return A [RetryDirective] indicating what action the [RetryStrategy] should take next.
*/
fun evaluate(result: Result<R>): RetryDirective
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question

I feel that this was brought up in discussion but I forget, what's the reasoning behind which functions suspend vs which do not?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A RetryStrategy has to be able to suspend to sleep/delay between attempts. A policy (as far as I can tell) shouldn't need to suspend to provide an answer to whether a result should be retried or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For retries, I put suspend around functions that would either need to execute the code block (which probably makes IO calls) or delay (e.g., backoff, token bucket interactions, etc.). All the retry policies that I've imagined so far all happen in straightforward, on-box logic. While I suppose it's technically possible some retry policy could involve network calls or local storage calls, it seems unlikely that we'll ever need that for the bare smithy or SDK use cases.

val jitter: Double,
) {
init {
require(initialDelayMs > 0) { "initialDelayMs must be at least 0" }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment

Do we want to be more conservative here? Setting initial delay to 0 and scale factor to 1 seems like it something people may want to do (for reasons) and may be a bad trip server side. I wonder if other SDKs constrain the lower bounds more...

* The default backoff configuration to use.
*/
val Default = ExponentialBackoffWithJitterOptions(
initialDelayMs = 10, // start with 10ms
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment

Thank you for the clarifying comment, very helpful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My default posture is to document every public type and member, although I concede some members' very name and placement is the all documentation many will need. Do you think there's better documentation to be written here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I was just saying that these comments let me quickly understand how the parameters work.

protected open fun evaluateOtherExceptions(ex: Throwable): RetryDirective? = null

private fun evaluateNonSdkException(ex: Throwable): RetryDirective? =
// TODO Write logic to find connection errors, timeouts, stream faults, etc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question

is this follow up work or blocked on something else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd hoped this very PR would spawn a discussion about how to fill this in. Do we have a sane way of detecting unmodeled exceptions for things like IO?

I'd imagine the HttpClientEngine implementations are what will be throwing most of these but the only engine-level exception I can find is HttpClientEngineClosedException. Absent a common exception layer, it seems like the retry policy would have to know the individual exceptions thrown by implementations (which sounds too tightly coupled).

Do we need a more clearly defined set of exceptions thrown by engines? Are there non-engine sources of useful exceptions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true that we probably need to have more taxonomy for particular exceptions but I'd start with at least classifying transient errors off raw response codes (we can leave this to aws-sdk-kotlin if we like since the SEP for this behavior is AWS specific)

} else {
// Prep for another loop
backoffDelayer.backoff(attempt)
fromToken.scheduleRetry(evaluation.reason)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question

So this is the only condition that would result in a recursive call?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Affirmative.

Copy link
Contributor

@aajtodd aajtodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic. Some overall minor comments/suggestions/questions.

/**
* Set if an error represents a throttling condition
*/
val Throttling: AttributeKey<Boolean> = AttributeKey("Throttling")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion

This name irks me for some reason. I think it's because Retryable is an adjective but Throttling is verb/noun?

Maybe there is a better name? I don't have a great suggestion...ThrottlingError? IsThrottlingError?

/**
* An object that can be used to delay between iterations of code.
*/
interface BackoffDelayer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style

fun interface?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name suggestion: DelayProvider or RetryDelayProvider

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style

fun interface?

Unfortunately, suspend functions are not allowed in SAM interfaces.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was true at one point but I thought it was lifted? It seems to compile fine locally

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears you're right. My IDE lied to me.

* @param result The [Result] of the retry attempt.
* @return A [RetryDirective] indicating what action the [RetryStrategy] should take next.
*/
fun evaluate(result: Result<R>): RetryDirective
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A RetryStrategy has to be able to suspend to sleep/delay between attempts. A policy (as far as I can tell) shouldn't need to suspend to provide an answer to whether a result should be retried or not.

package aws.smithy.kotlin.runtime.retries

/**
* A policy that evaluates a [Result] from a retry attempt and indicates action a [RetryStrategy] should take next.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

the action

* SPDX-License-Identifier: Apache-2.0.
*/

package aws.smithy.kotlin.runtime.retries
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment

This package name is totally fine but I also wonder if aws.smithy.kotlin.runtime.execution wouldn't be an alternative parent that is more generalized (other things related to how an operation or block of code is executed).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about finding a more general home for these but eventually I'd created 11 different runtime/retry-related files. Seemed like that was past the threshold where a shared namespace was cool.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya that's fair. Like I said not a big deal.

*/
data class StandardRetryTokenBucketOptions(
val maxCapacity: Int,
val refillUnitsPerSecond: Int,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion

refillRate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to be explicit about the units here. refillRate = 10 doesn't really indicate what unit of thing is being refilled nor how often.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I mean you could do refillRatePerSecond or be fancy and define a type. Something like:

value class TokenRefillRate(val tokensPerSecond: Int)

Not a big deal, my brain just kept saying "theres a name for xyz measured per unit of time...RATE"

* SPDX-License-Identifier: Apache-2.0.
*/

package aws.smithy.kotlin.runtime.retries.impl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix

The file location of this looks to be runtime/retries.impl rather than runtime/retries/impl

@Test
fun testClientException() {
val result = test(ClientException())
assertEquals(RetryDirective.RetryError(RetryErrorType.ClientSide), result)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question

should this attempt to retry any ClientException or should this terminate and fail? I would assume most client side (if not all) are not actually retryable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question and I suppose it depends on what else is a ClientException. We just decided to make all of the retryer exceptions (e.g., TimedOutException) inherit from ClientException and those are technically retryable (in that they could succeed with more tries).

Taking a look through the runtime, looks like most other instances are serialization (likely not retryable) or configuration related (possibly retryable if the configuration is coming from somewhere remote). Any other instances we can think of?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No but in general ClientException is basically reserved for runtime related exceptions (not to be confused with ServiceException where the ErrorType is set to Client)

* The standard implementation of a [RetryTokenBucket].
* @param options The configuration to use for this bucket.
*/
class StandardRetryTokenBucket(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question

What makes this Standard?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question

This only supports standard not adaptive behavior correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What makes this Standard?

This is standard in that it's recommended and the only implementation provided by smithy-kotlin.

This only supports standard not adaptive behavior correct?

For now, yes that's correct.

import kotlin.time.Duration
import kotlin.time.ExperimentalTime

class StandardRetryTokenBucketTest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question

Can we get the SEP standard mode test cases in here?


private const val timeToleranceMs = 20

suspend fun <T> TestCoroutineScope.assertTime(expectedMs: Int, block: suspend () -> T): T {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<3

private suspend fun <R> throwFailure(token: RetryToken, attempt: Int, result: Result<R>): Nothing {
token.notifyFailure()
throw RetryFailureException(
"The operation failed",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion

under the assumption that this exception's message will be visible to customers, perhaps adding some of the kdoc content into the message would be clarifying. eg "The operation resulted in a non-retryable failure"

@kggilmer
Copy link
Contributor

kggilmer commented Oct 1, 2021

Incredible effort, very clean and well thought out. Bravo @ianbotsf !

* BackoffDelayer → DelayProvider
* add integration tests sourced from spec
* Throttling → ThrottlingError
* retry exceptions now inherit from ClientException
* updated retry defaults to 20s total, 3 tries
* refactored StandardRetryPolicy evaluation strategies for clarity
* added handling of CancellationException
* cleaned up token failure handling
* improved non-retryable exception message
* moved companion object constant to top-level

additionally, as part of getting integration tests to work:
* added circuit-breaker mode for token bucket
* added max backoff configuration option
@ianbotsf ianbotsf requested review from kggilmer and aajtodd October 4, 2021 20:22
Copy link
Contributor

@aajtodd aajtodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than previous suggestions overall looks good. Nothing blocking from my end.

import kotlinx.coroutines.test.TestCoroutineScope
import kotlin.test.assertTrue

private const val timeToleranceMs = 20

@ExperimentalCoroutinesApi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we propagating this? Just opt-in


package aws.smithy.kotlin.runtime.retries.impl

val standardRetryIntegrationTestCases = mapOf(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

else -> fail("Unexpected outcome for $name: ${finalState.outcome}")
}

if (finalState.outcome != Outcome.RetryQuotaExceeded) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to make any assertion about the expected retry quota?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I can add that.

ianbotsf and others added 3 commits October 5, 2021 17:22
… vs propagating; added assertion on retry quota remaining for the StandardRetryStrategy integration test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants