router: support customizable retry back-off intervals#6568
router: support customizable retry back-off intervals#6568htuch merged 2 commits intoenvoyproxy:masterfrom
Conversation
Default behavior remains unchanged: retries will use the runtime parameter defaulted to 25ms as the base interval and 250ms as the maximum. Allows routes to customize the base and maximum intervals. Risk Level: low (no change to default behavior) Testing: unit tests Doc Changes: included, plus updated description of back-off algorithm Release Notes: added Signed-off-by: Stephan Zuercher <zuercher@gmail.com>
f8b12d1 to
e963e18
Compare
snowp
left a comment
There was a problem hiding this comment.
This LGTM, just one question about the design
| std::chrono::milliseconds base_interval( | ||
| runtime_.snapshot().getInteger("upstream.base_retry_backoff_ms", 25)); | ||
| if (route_policy.baseInterval()) { | ||
| base_interval = *route_policy.baseInterval(); |
There was a problem hiding this comment.
Seems like this means that it's not possible to override the interval with runtime anymore once it specified in the route config. Did you consider keeping the runtime value as an override for retries specified in the config?
Not sure which option is better, I can see both sides.
There was a problem hiding this comment.
I could imagine adding a config point for a runtime key to use for overriding a route's back-off interval, but it doesn't feel right to me to have the runtime key override the interval for all the routes.
Given that there's no runtime key override for the per-try timeout, I think it makes sense to do it this way. I think it could even be argued that I should deprecate the runtime key.
htuch
left a comment
There was a problem hiding this comment.
LGTM modulo minor comments.
/wait
| base_interval_ = std::chrono::milliseconds( | ||
| PROTOBUF_GET_MS_REQUIRED(retry_policy.retry_back_off(), base_interval)); | ||
| if ((*base_interval_).count() < 1) { | ||
| base_interval_ = std::chrono::milliseconds(1); |
There was a problem hiding this comment.
Why is the floor function needed? Won't the backoff eventually (or quite quickly) raise this to a useful value?
There was a problem hiding this comment.
The current implementation takes an integer and assumes it's milliseconds. It also asserts that the value is >= 1. I think I could convert it to handle arbitrary std::chrono::duration types, but I wanted to limit the blast radius of this change.
Signed-off-by: Stephan Zuercher <zuercher@gmail.com>
* master: (26 commits) docs: update docs to recommend /retest repokitteh command (envoyproxy#6655) http timeout integration test: wait for 15s for upstream reset (envoyproxy#6646) access log: add response code details to the access log formatter (envoyproxy#6626) build: add ppc build badge to README (envoyproxy#6629) Revert dispatcher stats (envoyproxy#6649) Batch implementation with timer (envoyproxy#6452) fault filter: reset token bucket on data start (envoyproxy#6627) event: update libevent dependency to fix race condition (envoyproxy#6637) examples: standardize docker-compose version and yaml extension (envoyproxy#6613) quiche: Implement SpdyUnsafeArena using SpdySimpleArena (envoyproxy#6612) router: support customizable retry back-off intervals (envoyproxy#6568) api: create OpenRCA service proto file (envoyproxy#6497) ext_authz: option for clearing route cache of authorized requests (envoyproxy#6503) build: update jinja to 2.10.1. (envoyproxy#6623) tools: check spelling in pre-push hook (envoyproxy#6631) security: blameless postmortem template. (envoyproxy#6553) Implementing Endpoint lease for ClusterLoadAssigment (envoyproxy#6477) add HTTP integration tests exercising timeouts (envoyproxy#6621) event: fix DispatcherImplTest::InitializeStats flake (envoyproxy#6619) Add tag extractor for RDS route config name (envoyproxy#6618) ... Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Default behavior remains unchanged: retries will use the runtime parameter
defaulted to 25ms as the base interval and 250ms as the maximum. Allows
routes to customize the base and maximum intervals.
Risk Level: low (no change to default behavior)
Testing: unit tests
Doc Changes: included, plus updated description of back-off algorithm
Release Notes: added
Signed-off-by: Stephan Zuercher zuercher@gmail.com