common: jittered backoff implementation#3791
Conversation
Signed-off-by: Rama <rama.rao@salesforce.com>
htuch
left a comment
There was a problem hiding this comment.
Awesome @ramaraochavali, a few comments and we can merge. Would be good to make this the default backoff for everything as per @mattklein123 earlier comment.
| uint64_t JitteredBackOffStrategy::computeNextInterval() { | ||
| current_retry_++; | ||
| uint32_t multiplier = (1 << current_retry_) - 1; | ||
| uint64_t new_interval = random_.random() % (base_interval_ * multiplier); |
There was a problem hiding this comment.
What are the behavioral tradeoffs between doing what you have here and something more like (1 << current_retry_) * base_interval + random.random() % some_fixed_interval? I.e. you still do exponential backoff, but jsut add a small jitter around the point that you backoff to?
I think what you have is fine actually, reading https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/, but it would be worth exploring the tradeoff in the comments here and review.
There was a problem hiding this comment.
I have compared both approaches, Looks like the one I have has more randomness.
Here is output
(1 << current_retry_) * base_interval + random.random() % some_fixed_interval (fixed interval as 100)
1082
2069
4079
8054
16010
30000
random_.random() % (base_interval_ * (1 << current_retry_) - 1)
102
1065
954
1084
7196
28940
23291
30000
9979
|
|
||
| uint64_t JitteredBackOffStrategy::computeNextInterval() { | ||
| current_retry_++; | ||
| uint32_t multiplier = (1 << current_retry_) - 1; |
There was a problem hiding this comment.
Nit: please use const for these intermediate variables.
|
|
||
| public: | ||
| /** | ||
| * Use this constructor if max_interval need not be enforced. |
There was a problem hiding this comment.
How come we need this? I think it is always sensible ot add some bound.
Signed-off-by: Rama <rama.rao@salesforce.com>
Signed-off-by: Rama <rama.rao@salesforce.com>
| current_retry_++; | ||
| uint32_t multiplier = (1 << current_retry_) - 1; | ||
| // for retries that take longer, multiplier may overflow and become zero. | ||
| if (multiplier == 0) { |
There was a problem hiding this comment.
@htuch for retries that take longer, noticed that multiplier is becoming zero because of overflow and hence added this check to reset. Also added a test to verify this. LMK if there are any better options here.
There was a problem hiding this comment.
I don't think this should be possible to happen if we bound the maximum backoff. When we hit the cap, let's just immediately return that, rather than doing math that can overflow. I.e. once (1 << current_retry_) * base_interval_ > max_interval_, stop incrementing current_retry_.
| retry_on_ |= route_policy.retryOn(); | ||
| retries_remaining_ = std::max(retries_remaining_, route_policy.numRetries()); | ||
| uint32_t base = runtime_.snapshot().getInteger("upstream.base_retry_backoff_ms", 25); | ||
| backoff_strategy_ptr_ = std::make_unique<JitteredBackOffStrategy>(base, base * 10, random_); |
There was a problem hiding this comment.
Here I have added base * 10 as the maximum interval (bound). I think it should be sufficient. LMK if you think otherwise.
There was a problem hiding this comment.
Seems reasonable, maybe add a comment or constant.
|
@htuch made the JitterBackOff default for management server connections also and removed the earlier ExponentialBackOff. Left couple of questions, PTAL. |
| current_retry_++; | ||
| uint32_t multiplier = (1 << current_retry_) - 1; | ||
| // for retries that take longer, multiplier may overflow and become zero. | ||
| if (multiplier == 0) { |
There was a problem hiding this comment.
I don't think this should be possible to happen if we bound the maximum backoff. When we hit the cap, let's just immediately return that, rather than doing math that can overflow. I.e. once (1 << current_retry_) * base_interval_ > max_interval_, stop incrementing current_retry_.
| current_interval_ = new_interval > max_interval_ ? max_interval_ : new_interval; | ||
| uint64_t JitteredBackOffStrategy::nextBackOffMs() { | ||
| current_retry_++; | ||
| uint32_t multiplier = (1 << current_retry_) - 1; |
There was a problem hiding this comment.
good point, I just took it from the existing implementation. Not sure why it was like that but one thing comes to my mind is 1 << current_retry_ always returns number that is power of 2 - does it have to do any thing with it - I think -1 would make it more random may be? I just left it like that as it is not big deal. let me know if you want me to change..
| DoRetryCallback callback_; | ||
| Event::TimerPtr retry_timer_; | ||
| Upstream::ResourcePriority priority_; | ||
| BackOffStrategyPtr backoff_strategy_ptr_; |
There was a problem hiding this comment.
FWIW, I'm not a fan of putting _ptr in the variable name unless it is really needed to make things clearer (e.g. to disambiguate). Here, I would just call this backoff_strategy_. I think you'll find this consistent with other Envoy code (e.g. retry_timer_ above).
| // Merge in the route policy. | ||
| retry_on_ |= route_policy.retryOn(); | ||
| retries_remaining_ = std::max(retries_remaining_, route_policy.numRetries()); | ||
| uint32_t base = runtime_.snapshot().getInteger("upstream.base_retry_backoff_ms", 25); |
| retry_on_ |= route_policy.retryOn(); | ||
| retries_remaining_ = std::max(retries_remaining_, route_policy.numRetries()); | ||
| uint32_t base = runtime_.snapshot().getInteger("upstream.base_retry_backoff_ms", 25); | ||
| backoff_strategy_ptr_ = std::make_unique<JitteredBackOffStrategy>(base, base * 10, random_); |
There was a problem hiding this comment.
Seems reasonable, maybe add a comment or constant.
Signed-off-by: Rama <rama.rao@salesforce.com>
|
@htuch made changes. PTAL. |
* origin/master: config: making v2-config-only a boolean flag (envoyproxy#3847) lc trie: add exclusive flag. (envoyproxy#3825) upstream: introduce PriorityStateManager, refactor EDS (envoyproxy#3783) test: deflaking header_integration_test (envoyproxy#3849) http: new style WebSockets, where headers and data are processed by the filter chain. (envoyproxy#3776) common: minor doc updates (envoyproxy#3845) fix master build (envoyproxy#3844) logging: Requiring details for RELEASE_ASSERT (envoyproxy#3842) test: add test for consistency of RawStatData internal memory representation (envoyproxy#3843) common: jittered backoff implementation (envoyproxy#3791) format: run buildifier on .bzl files. (envoyproxy#3824) Support mutable metadata for endpoints (envoyproxy#3814) test: deflaking a test, improving debugability (envoyproxy#3829) Update ApiConfigSource docs with grpc_services only for GRPC configs (envoyproxy#3834) Add hard-coded /hot_restart_version test (envoyproxy#3832) healthchecks: Add interval_jitter_percent healthcheck option (envoyproxy#3816) Signed-off-by: Snow Pettersen <snowp@squareup.com>
Signed-off-by: Rama rama.rao@salesforce.com
Description:
Implements fully jittered exponential backoff algorithm and refactors router retry timer to use this.
Risk Level: Low
Testing: Added automated tests
Docs Changes: N/A
Release Notes: N/A