overload: scale selected timers in response to load#13475
overload: scale selected timers in response to load#13475mattklein123 merged 29 commits intoenvoyproxy:masterfrom
Conversation
This will allow mocking for users of the timer manager. Signed-off-by: Alex Konradi <akonradi@google.com>
Signed-off-by: Alex Konradi <akonradi@google.com>
Add the API protos and implementation framework for scaling timers based on resoure pressure with a scaled trigger. Signed-off-by: Alex Konradi <akonradi@google.com>
Allow Envoy to drop HTTP downstreams with idle connections in response to overload conditions. Signed-off-by: Alex Konradi <akonradi@google.com>
Signed-off-by: Alex Konradi <akonradi@google.com>
|
|
||
| enum class OverloadTimerType { | ||
| // Timers created with this type will never be scaled. This should only be used for testing. | ||
| UnscaledRealTimer, |
There was a problem hiding this comment.
I added this in an intermediate commit for testing. It could be removed but is convenient for testing without a real timer type.
There was a problem hiding this comment.
Can you add ForTest to the end of this enum name?
| ScaledTimerMinimum minimum = | ||
| minimum_it != timer_minimums_.end() ? minimum_it->second : ScaledTimerMinimum(1.0); | ||
| return std::make_unique<FixedMinimumScaledTimer>( | ||
| scaled_timer_manager_->createTimer(std::move(callback)), minimum); |
There was a problem hiding this comment.
What does this method do when there is no configured minimum? I think it should result in a timer that has minimum equal to max, or result in creation of a regular timer instead of a scaled one in order to preserve behavior before we introduced scaled timers.
There was a problem hiding this comment.
The default minimum is ScaledTimerMinimum(1.0) which makes min = max. It seemed more straightforward to avoid special casing, though the behavior of the returned range timer is degenerate and it triggers exactly when its wrapper TimerPtr triggers.
| auto minimum_it = timer_minimums_.find(timer_type); | ||
| ScaledTimerMinimum minimum = | ||
| minimum_it != timer_minimums_.end() ? minimum_it->second : ScaledTimerMinimum(1.0); | ||
| return std::make_unique<FixedMinimumScaledTimer>( |
There was a problem hiding this comment.
Should the implementation of FixedMinimumScaledTimer and ScaledTimerMinimum be moved to scaled timer manager, and have scaled timer manager return instances of TimerPtr interface instead?
I think this may allow us to avoid having an unique_ptr scaled timer inside FixedMinimumScaled which would result in reduction in allocations and memory usage
There was a problem hiding this comment.
I have no objections, though in that case we'd want to get rid of the RangeTimer interface as well. Yeah I'm not a big fan of my existing implementation where we wrap a TimerPtr with a RangeTimerPtr and then again in a TimerPtr. I'm a little unclear on how that would work though: would ScaledTimerManager::createTimer take either a minimum duration or scaling factor?
There was a problem hiding this comment.
Let's start by moving this wrapper to scaled timer manager
Have ScaledTimerManager::createTimer take ScaledTimerMinimum as an input.
| - name: "envoy.resource_monitors.fake_resource1" | ||
| scaled: | ||
| scaling_threshold: 0.9 | ||
| saturation_threshold: 0.9 |
There was a problem hiding this comment.
There seems to be significant whitespace change here due to the change in syntax used to represent the protos. Is this necessary? Consider doing this change in a separate PR.
There was a problem hiding this comment.
Not necessary in this PR; I'll break it up.
test/mocks/event/mocks.h
Outdated
| MOCK_METHOD(bool, enabled, (), (override)); | ||
|
|
||
| const ScopeTrackedObject* scope_{}; | ||
| bool enabled_{}; |
There was a problem hiding this comment.
scope_ and enabled_ seem unused.
There was a problem hiding this comment.
Yep, overzealous copy-paste. I'll remove them in the next commit.
This will allow mocking for users of the timer manager. Signed-off-by: Alex Konradi <akonradi@google.com>
Add a method to ScaledRangeTimerManager to return a TimerPtr that wraps a RangeTimerImpl. While the wrapper could be implemented externally by wrapping a RangeTimerPtr, wrapping the impl class is more efficient since it requires less indirection and fewer heap allocations. Signed-off-by: Alex Konradi <akonradi@google.com>
Signed-off-by: Alex Konradi <akonradi@google.com>
| message ScaleTimersOverloadActionConfig { | ||
| enum TimerType { | ||
| // Adjusts the idle timer for downstream HTTP connections that takes effect when there are no active streams. | ||
| HTTP_DOWNSTREAM_CONNECTION_IDLE = 0; |
There was a problem hiding this comment.
Is it worth mentioning this override in route_components.proto near:
google.protobuf.Duration idle_timeout = 24;
For discoverability.
There was a problem hiding this comment.
Also, add here a reference to the timeout parameter that is affected by this min override.
Explicitly provide the template parameters to absl::visit since using deduction causes Windows builds to fail. It appears that there is an internal instantiation of std::variant_size, which doesn't work with subclasses of absl::variant. Signed-off-by: Alex Konradi <akonradi@google.com>
Signed-off-by: Alex Konradi <akonradi@google.com>
|
|
||
| // Configuration for the action being instantiated. | ||
| google.protobuf.Any typed_config = 3; | ||
|
|
There was a problem hiding this comment.
nit: Consider ordering by field id, since there's no obvious reason to put typed_config next to the name parameter.
| virtual RangeTimerPtr createTimer(TimerCb callback) PURE; | ||
|
|
||
| /** | ||
| * Sets the scale factor for all timers created through this manager. The value should be between |
There was a problem hiding this comment.
change "should" to "must"
|
|
||
| /** | ||
| * Sets the scale factor for all timers created through this manager. The value should be between | ||
| * 0 and 1, inclusive. The scale factor affects the amount of time timers spend in their target |
|
|
||
| enum class OverloadTimerType { | ||
| // Timers created with this type will never be scaled. This should only be used for testing. | ||
| UnscaledRealTimer, |
There was a problem hiding this comment.
Can you add ForTest to the end of this enum name?
| * of their range by the owner of the manager object. Users of this class can call createTimer() to | ||
| * receive a new RangeTimer object that they can then enable or disable at will (but only on the | ||
| * same dispatcher), and setScaleFactor() to change the scaling factor. The current scale factor is | ||
| * applied to all timers, including those that are created later. |
There was a problem hiding this comment.
I think the current wording makes it unclear if updates to scale apply to timers that are already created.
Consider: Updates to the current scale factor are applied to all timers, including those created in the past.
| } | ||
| } | ||
|
|
||
| message ScaleTimersOverloadActionConfig { |
There was a problem hiding this comment.
There are no mentions of this config corresponding to the "envoy.overload_actions.reduce_timeouts" action.
Should envoy.overload_actions.reduce_timeouts be mentioned in this config message comment?
Should there be some doc updates that mention how to configure this action?
|
|
||
| // A set of timer scaling rules to be applied. | ||
| repeated ScaleTimer timer_scale_factors = 1 [(validate.rules).repeated = {min_items: 1}]; | ||
| } |
There was a problem hiding this comment.
Please add release notes describing this new feature.
| std::vector<std::string> names_; | ||
| }; | ||
|
|
||
| class ScaledTimerMinimum { |
There was a problem hiding this comment.
class and method comments.
|
|
||
| std::chrono::milliseconds compute(std::chrono::milliseconds ms) const; | ||
|
|
||
| struct ScaleFactor { |
There was a problem hiding this comment.
I think these 2 structs can be private now.
| scale_timer.has_min_timeout() | ||
| ? ScaledTimerMinimum(std::chrono::milliseconds( | ||
| DurationUtil::durationToMilliseconds(scale_timer.min_timeout()))) | ||
| : ScaledTimerMinimum(scale_timer.min_scale().value() / 100); |
There was a problem hiding this comment.
Done for clarity, though I think type promotion makes this functionally identical.
…rface Signed-off-by: Alex Konradi <akonradi@google.com>
…rface Signed-off-by: Alex Konradi <akonradi@google.com>
Signed-off-by: Alex Konradi <akonradi@google.com>
Signed-off-by: Alex Konradi <akonradi@google.com>
Signed-off-by: Alex Konradi <akonradi@google.com>
Signed-off-by: Alex Konradi <akonradi@google.com>
| // Sets the minimum duration as a percentage of the maximum value. | ||
| type.v3.Percent min_scale = 3; |
There was a problem hiding this comment.
Seems like this might be a way to shoot yourself in the foot as it's not as intuitive and easy to compare as absolute time -- unless you end up doing the calculation yourself in which case maybe just use the min_timeout to set the absolute duration.
If this gets set too low, I could see the aggressive timeouts causing cascading failures i.e. user requests times out so they end up refreshing the page.
There was a problem hiding this comment.
I agree, though a lot of the overload manger actions let you shoot yourself in the foot. Having a scale factor allows this to apply meaningfully to timers that can have different values across listeners.
Signed-off-by: Alex Konradi <akonradi@google.com>
Signed-off-by: Alex Konradi <akonradi@google.com>
Signed-off-by: Alex Konradi <akonradi@google.com>
…imeouts Signed-off-by: Alex Konradi <akonradi@google.com>
| : action_symbol_table_(action_symbol_table), timer_minimums_(timer_minimums), | ||
| actions_(action_symbol_table.size(), OverloadActionState(0)), | ||
| scaled_timer_action_(action_symbol_table.lookup(OverloadActionNames::get().ReduceTimeouts)), | ||
| scaled_timer_manager_(std::move(scaled_timer_manager_)) {} |
There was a problem hiding this comment.
scaled_timer_manager_(std::move(scaled_timer_manager)) {}
Also fix formatting Signed-off-by: Alex Konradi <akonradi@google.com>
The admin handler won't ever call the method to create a scaled timer since it won't have the timeout set. Signed-off-by: Alex Konradi <akonradi@google.com>
…imeouts Signed-off-by: Alex Konradi <akonradi@google.com>
|
The macos check looks to be the same GOAWAY issue seen elsewhere. Would merging master help, or should I just kick off a retest? |
|
/retest |
|
Retrying Azure Pipelines: |
mattklein123
left a comment
There was a problem hiding this comment.
Thanks very cool. Just some API/doc comments from a review.
/wait
| // upstream response header has been received, otherwise a stream reset | ||
| // occurs. | ||
| // | ||
| // If the overload action "envoy.overload_actions.reduce_timeouts" is configured, this timeout |
There was a problem hiding this comment.
Can you ref link "overload action" to the overload manager docs?
| // Typed configuration for the "envoy.overload_actions.reduce_timeouts" action. | ||
| message ScaleTimersOverloadActionConfig { | ||
| enum TimerType { | ||
| UNSPECIFIED = 0; |
There was a problem hiding this comment.
Is the reason this is specified to force the user to explicitly configure the timeout they want? I think that's OK but can you add a small comment?
There was a problem hiding this comment.
Yep, done. I would have liked to just start at 1 but the compiler (or maybe lint checker?) complains.
| oneof overload_adjust { | ||
| option (validate.required) = true; | ||
|
|
||
| // Sets the minimum duration as an absolute value. |
There was a problem hiding this comment.
From reading the docs, it's not immediately clear to me how min_timeout interacts with the scaling. Can you flesh this out a bit more to better describe how everything fits together in terms of scaling? An example would probably help.
There was a problem hiding this comment.
Elaborated in the docs.
There was a problem hiding this comment.
Thanks the example is very helpful. Can you potentially ref link to the docs somehow from here just so that people can go back and forth if they are confused what this does?
/wait
Signed-off-by: Alex Konradi <akonradi@google.com>
Signed-off-by: Alex Konradi <akonradi@google.com>
mattklein123
left a comment
There was a problem hiding this comment.
Thanks LGTM can you merge main?
/wait
* master: (83 commits) tls: Typesafe tls slots (envoyproxy#13789) docs(example): Correct URL for caching example page (envoyproxy#13810) [fuzz] Made health check fuzz more efficient (envoyproxy#13747) rtds: properly scope rtds stats (envoyproxy#13764) http: fixing a bug with IPv6 hosts (envoyproxy#13798) connection: Remember transport socket read resumption requests and replay them when re-enabling read. (envoyproxy#13772) network: adding some accessors for ALPN work. (envoyproxy#13785) docs: added a step about how to handle platform specific extensions (envoyproxy#13759) Fix identation in ip transparency code snippet (envoyproxy#13743) wasm: enable WAVM's stack unwinding feature (envoyproxy#13792) log: set route name for direct response (envoyproxy#13683) Use nghttp2 as external dependsncy in protocol_constraints_lib (envoyproxy#13763) [Windows] Update windows dev docs (envoyproxy#13741) cel: patch thread safety issue (envoyproxy#13739) Windows: Fix ssl_socket_test (envoyproxy#13264) apple dns: add fake api test suite (envoyproxy#13780) overload: scale selected timers in response to load (envoyproxy#13475) examples: Add dynamic configuration (control plane) sandbox (envoyproxy#13746) Removed exception in getResponseStatus() (envoyproxy#13314) network: add timeout for transport connect (envoyproxy#13610) ... Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Commit Message: Scale timers in response to overload state
Additional Description:
Add a new overload action and use the new typed_config field to adjust timers based on resource pressure.
Risk Level: low - everything is disabled by default
Testing: ran unit and integration tests
Docs Changes: documented new overload action
Release Notes: reference new overload action
Towards #11427