[core-amqp] Implement exponential retry mechanism#4175
[core-amqp] Implement exponential retry mechanism#4175ramya0820 merged 14 commits intoAzure:masterfrom
Conversation
|
I'm looking to add some tests soon to this PR if we can confirm that general assumptions and approach seem okay |
sdk/core/core-amqp/src/retry.ts
Outdated
| const targetDelayInMs = Math.min( | ||
| config.minExponentialRetryDelayInMs + Math.pow(2, i), | ||
| config.maxExponentialRetryDelayInMs | ||
| ); |
There was a problem hiding this comment.
With default values for min and max delays here, this results in a progression of 3, 5, 9, 17, 33, 65, 129, 257, 513, 1025, 2049 etc.
Questions:
- Shouldn't we be starting from the min value instead of min value + 2 ?
- This method differs from both what is in core-http and in storage. Which of the two did we take as reference here?
- In the exponential retry policy in storage,
Math.pow(2, i)is multiplied by the delay provided by the user instead of adding to it. - In the exponential retry policy in core-http, some other convoluted way is used
- In the exponential retry policy in storage,
There was a problem hiding this comment.
- Shouldn't we be starting from the min value instead of min value + 2 ?
This mirrors what we have in core-http and would depend on the configured value - did we want to set the default min to 0 ?
- This method differs from both what is in core-http and in storage. Which of the two did we take as reference here?
We had decided to use what we have in core-http as captured in #3331 (comment) as well, do we have consensus to go with one in storage instead? @bterlson @daviwil @sadasant
There was a problem hiding this comment.
Overall, if we were to investigate best approach to take forward and not mirror core-http as decided, then following findings may help -
Based on discussion from cloud computing communities online,
Slowing clients down may help, and the classic way to slow clients down is capped exponential backoff. Capped exponential backoff means that clients multiply their backoff by a constant after each attempt, up to some maximum value. In our case, after each unsuccessful attempt, clients sleep for:
This seems to mirror what we have in storage. We'll likely need to update core-http as well.
And also decide if we want to do away with concept of minimum delay and use a maximum cap value along with retry attempts instead.
This would address, simplify other comment too where instead of delayInSeconds and min/max expoenential retry specific config, we can have one maxRetryDelay and maxRetries that would apply to both types of retry.
Additionally the number 2 can be made configurable as well, calling it the backoffFactor. (This seems to be a thing as well if you search for exponential time delay calculator /"online-backoff-calculator" online)
There was a problem hiding this comment.
From #3331 (comment)
When an operation fails, the delay is computed as (2^retry-attempts) milliseconds and is thus incremented with each failed attempt in order of (1 ms, 3ms, 7ms, so on..) This is same as what we have in core-http
This is not what core-http is doing.
In core-http, (2^retry-attempts) is not used as the time itself. This value is used to multiply with the delay (after introducing a randomization factor) provided by the user. Please review https://github.com/Azure/azure-sdk-for-js/blob/%40azure/core-http_1.0.0-preview.1/sdk/core/core-http/lib/policies/exponentialRetryPolicy.ts#L131-L137 one more time.
There was a problem hiding this comment.
Ah, yes missed the * in line 135.
This makes it overall similar to what we have in storage except for the use of minimum retry delay and boundedRandDelta computation.
There was a problem hiding this comment.
@ramya-rao-a @bterlson
I've updated the implementation.
Moreover, @sadasant and I just discussed offline about having one copy of retry policy implementation where possible.
Should we house the exponential delay computation in one place and export, use that in here?
Would be good if we can clarify on which computation to use, so we can consider just having it modularized and exported, say from core-http.
There was a problem hiding this comment.
Actually, currently it isn't making sense to take on core-http as dependency. But eventually we could have a separate module for retry I think.
There was a problem hiding this comment.
Update: For clarity now made sure it's mirroring core-http much closely - @ramya-rao-a. Please check updated PR
There was a problem hiding this comment.
Latest update looks much better and closer to what core-http is doing.
Sharing the retry code between core-http and core-amqp is not trivial at the moment. core-http operates on the model of a "pipeline". The pipeline consists of "policies". Each policy does what it needs to do and then passes the "request" to the next policy in the pipeline.
core-amqp does not have this concept of "pipeline". There are talks about making it have a pipeline, but that is not going to happen any time soon.
So, I am all in favor of aligning the retry logic between core-http and core-amqp, but not sharing actual re-usable code yet.
sdk/core/core-amqp/src/retry.ts
Outdated
| /** | ||
| * @property {number} [delayInSeconds] Amount of time to wait in seconds before making the | ||
| * next attempt. Default: 15. | ||
| * next attempt. Applicable only when performing linear retry. Default: 15. |
There was a problem hiding this comment.
Can we not use this instead of introducing minExponentialRetryDelayInMs?
There was a problem hiding this comment.
What would be the benefits?
One is intended for with linear retry where min/max don't lend any meaning.
For exponential retry, we have notion of min and max, so that would likely be a con as it may be confusing to mix these into one. Unless maybe the naming is clear enough?
@bterlson thoughts on usage/name?
There was a problem hiding this comment.
Ah, never mind. I was looking at the storage implementation where there is no "min" value and the delay for the first time is the same as delayInSeconds and was thinking that maybe we don't need to configure the minimum value.
I see that we do let user configure the min value, and that is fine by me
bterlson
left a comment
There was a problem hiding this comment.
Approved modulo @ramya-rao-a's concerns.
sdk/core/core-amqp/src/retry.ts
Outdated
| /** | ||
| * @property {string} connectionHost The host "<yournamespace>.servicebus.windows.net". | ||
| * Used to check network connectivity. | ||
| * Used to check network connectivity. Applicable only when performing linear retry. |
There was a problem hiding this comment.
connectionHost gets used regardless of linear or exponential retry.
sdk/core/core-amqp/src/retry.ts
Outdated
| * @property {boolean} [exponentialRetry] Flag to denote if we want to perform exponential retry and not | ||
| * the default, which is linear. | ||
| */ | ||
| exponentialRetry?: boolean; |
There was a problem hiding this comment.
I have generally found it helpful to name properties/variables/arguments that are boolean to have a preceding verb like is or has. In this case, I would suggest isExponential.
@bterlson Since the RetryOptions passed when creating a EventHubClient would need this property too and so this is exposed to our users, any thoughts on the name?
There was a problem hiding this comment.
Agree, updated to isExponentialRetry for now
sdk/core/core-amqp/src/retry.ts
Outdated
| i, | ||
| err | ||
| ); | ||
| let targetDelayInMs: number; |
There was a problem hiding this comment.
We can initialize targetDelayInMs with config.delayInSeconds to avoid the else block
| export const defaultDelayBetweenOperationRetriesInSeconds = 5; | ||
| export const defaultDelayBetweenRetriesInSeconds = 15; | ||
| export const defaultMaxDelayForExponentialRetryInMs = 50000; | ||
| export const defaultMinDelayForExponentialRetryInMs = 1; |
There was a problem hiding this comment.
How did we arrive at these defaults? Assuming that we wan't to mirror the retry policy in core-http, these should be different
There was a problem hiding this comment.
Oh, the defaults seemed different in general across these 2, such as for delay between retries to be same as well (5 vs 15 vs 30 seconds based on type of retry).
Fixed this as well to use default as 30 sec.
sdk/core/core-amqp/src/retry.ts
Outdated
|
|
||
| targetDelayInMs = Math.min( | ||
| config.minExponentialRetryDelayInMs + incrementDelta, | ||
| config.delayInSeconds |
There was a problem hiding this comment.
We should be choosing between config.minExponentialRetryDelayInMs + incrementDelta and maxExponentialRetryDelayInMs not config.delayInSeconds.
sdk/core/core-amqp/src/retry.ts
Outdated
| * If `exponentialRetry` option is set, then the delay between retries is adjusted to increase | ||
| * exponentially with each attempt using back-off factor of power 2. | ||
| * | ||
| * @param {RetryConfig<T>} config Parameters that define what type of retry will be performed |
There was a problem hiding this comment.
The parameters define more than just the type of retry. The previous statement was good enough i.e Parameters to configure retry operation
| ); | ||
| let targetDelayInMs: number; | ||
| if (config.exponentialRetry) { | ||
| let incrementDelta = Math.pow(2, i) - 1; |
There was a problem hiding this comment.
At a quick glance, the words minExponentialRetryDelayInMs and maxExponentialRetryDelayInMs tell me that when using exponential retry policy, the delay between retries starts from the minExponentialRetryDelayInMs and is incremented exponentially till maxExponentialRetryDelayInMs
This understanding holds true only if incrementDelta is initialized to Math.pow(2, i - 1) - 1 instead of Math.pow(2, i) - 1
But, I am not sure if this what we want it to mean :)
i.e I am not sure if the first delay interval should be minExponentialRetryDelayInMs or minExponentialRetryDelayInMs + (a random value between 0 and 1 * delayInSeconds)
There was a problem hiding this comment.
Resolution to this is not a blocker for merging this PR, but is a blocker in closing the linked issue
There was a problem hiding this comment.
I agree with your intuition here - the first time we retry, incrementDelta should be zero, and since we start on "try 1", Math.pow(2, i - 1) - 1 makes sense to me.
There was a problem hiding this comment.
In that case minExponentialRetryDelayInMs would need a better default ... Logged #4201 to discuss this
sdk/core/core-amqp/src/retry.ts
Outdated
| /** | ||
| * @property {number} [delayInSeconds] Amount of time to wait in seconds before making the | ||
| * next attempt. Default: 15. | ||
| * When `exponentialRetry` is set to `true`, this is used to compute the exponentially increasing delays between retries. |
There was a problem hiding this comment.
exponentialRetry -> isExponentialRetry
Same in a few other places where the old name is being used in comments
There was a problem hiding this comment.
Addressed, re-addressed :)
sdk/core/core-amqp/src/retry.ts
Outdated
| * @property {boolean} [exponentialRetry] Flag to denote if we want to perform exponential retry and not | ||
| * the default, which is linear. | ||
| */ | ||
| isExponentialRetry?: boolean; |
There was a problem hiding this comment.
@bterlson Do you foresee any other kind of retry we would be supporting? If so, then we should probably have this as an enum
There was a problem hiding this comment.
Probably, given that there are 3 or more supported retry schemes in core-http (linear, exponential, logarithmic, also see Fibonacci and a few others looking around outside Azure).
There was a problem hiding this comment.
How can a single interface RetryOptions cater to the different kinds of inputs required for the different retry mechanisms?
In case of libraries using the http pipeline, each retry mechanism is a different "policy", so they won't have this problem.
Logged #4199 to continue this discussion
There was a problem hiding this comment.
Updated to use enum as agreed
How can a single interface RetryOptions cater to the different kinds of inputs required for the different retry mechanisms?
Agree I recall touching upon this point during initial discussion and had captured it in the assumptions as well
| ); | ||
| let targetDelayInMs: number; | ||
| if (config.exponentialRetry) { | ||
| let incrementDelta = Math.pow(2, i) - 1; |
| let incrementDelta = Math.pow(2, i) - 1; | ||
| const boundedRandDelta = | ||
| config.delayInSeconds * 0.8 + | ||
| Math.floor(Math.random() * (config.delayInSeconds * 1.2 - config.delayInSeconds * 0.8)); |
There was a problem hiding this comment.
Isn't (config.delayInSeconds * 1.2 - config.delayInSeconds * 0.8) the same as configDelayInSeconds * 0.4?
There was a problem hiding this comment.
Right, maybe the magic numbers/factors have a certain meaning? For instance we have time in milliseconds (say for 30 sec) being expressed as 1000 * 30 in some places for readability.
| ); | ||
| let targetDelayInMs: number; | ||
| if (config.exponentialRetry) { | ||
| let incrementDelta = Math.pow(2, i) - 1; |
There was a problem hiding this comment.
I agree with your intuition here - the first time we retry, incrementDelta should be zero, and since we start on "try 1", Math.pow(2, i - 1) - 1 makes sense to me.
ramya-rao-a
left a comment
There was a problem hiding this comment.
@ramya0820 The changes look good. Can we also update the tests in retry.spec.ts to get coverage for the exponential retry? I believe all the tests in there would be applicable for the exponential case.
Refer to the use forEach in errros.spec.ts where the same test is run for different inputs. We can use this model to update the current tests to run for both retry types
@ramya-rao-a
|
| [RetryPolicy.ExponentialRetryPolicy, RetryPolicy.LinearRetryPolicy].forEach( | ||
| () => | ||
| function(retryPolicy) { | ||
| describe(`retry function for ${retryPolicy}`, function() { |
There was a problem hiding this comment.
Wont this print 0 and 1 because retry policy is an enum? That wouldnt help much. The words "fixed"/"linear" and "exponential" are preferable.
Would suggest in having a constant with the required value and then use it in the describe title
| }, | ||
| connectionId: "connection-1", | ||
| operationType: RetryOperationType.cbsAuth, | ||
| delayInSeconds: retryPolicy === RetryPolicy.ExponentialRetryPolicy ? 0 : 15, |
There was a problem hiding this comment.
Why 0 in case of exponential? That will result in no exponential increase at all right?
I would imagine that we don't have to make any changes to the delayInSeconds property in any of the tests

For more context refer to #3331
Specifically, please see this comment #3331 (comment) for assumptions based on which this PR is submitted.
Enlisting here as well for reference-
(2^retry-attempts)milliseconds and is thus incremented with each failed attempt in order of (1 ms, 3ms, 7ms, so on..) This is same as what we have incore-httpnwherenis computed per equationminimum-retry-delay + 2^(n) <= maximum-allowed-retry-delayas the delay induced before this attempt 'n' would have reached the maximum allowed delay value governed by the configured
maxExponentialRetryDelayInMilliseconds.In
core-httpthe retries continue by capping max delay at the configured value and converts into a linear retry until either targeted retry attempt count is reached or abort signal is determined.core-httpand we will be going with introducing this new type of retry policy using a boolean flag on the RetryConfig as opposed to defining a new interface/type to RetryConfig.