overload: scale selected timers in response to load by akonradi · Pull Request #13475 · envoyproxy/envoy

akonradi · 2020-10-09T21:28:08Z

Commit Message: Scale timers in response to overload state
Additional Description:
Add a new overload action and use the new typed_config field to adjust timers based on resource pressure.
Risk Level: low - everything is disabled by default
Testing: ran unit and integration tests
Docs Changes: documented new overload action
Release Notes: reference new overload action

Towards #11427

This will allow mocking for users of the timer manager. Signed-off-by: Alex Konradi <akonradi@google.com>

Signed-off-by: Alex Konradi <akonradi@google.com>

Add the API protos and implementation framework for scaling timers based on resoure pressure with a scaled trigger. Signed-off-by: Alex Konradi <akonradi@google.com>

Allow Envoy to drop HTTP downstreams with idle connections in response to overload conditions. Signed-off-by: Alex Konradi <akonradi@google.com>

repokitteh-read-only · 2020-10-09T21:28:14Z

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/envoy/.
CC @envoyproxy/api-watchers: FYI only for changes made to api/envoy/.

🐱

Caused by: #13475 was opened by akonradi.

see: more, trace.

Signed-off-by: Alex Konradi <akonradi@google.com>

antoniovicente · 2020-10-12T18:38:02Z

include/envoy/server/overload_manager.h


+enum class OverloadTimerType {
+  // Timers created with this type will never be scaled. This should only be used for testing.
+  UnscaledRealTimer,


unused enum value?

I added this in an intermediate commit for testing. It could be removed but is convenient for testing without a real timer type.

Can you add ForTest to the end of this enum name?

antoniovicente · 2020-10-12T18:41:55Z

source/server/overload_manager_impl.cc

+    ScaledTimerMinimum minimum =
+        minimum_it != timer_minimums_.end() ? minimum_it->second : ScaledTimerMinimum(1.0);
+    return std::make_unique<FixedMinimumScaledTimer>(
+        scaled_timer_manager_->createTimer(std::move(callback)), minimum);


What does this method do when there is no configured minimum? I think it should result in a timer that has minimum equal to max, or result in creation of a regular timer instead of a scaled one in order to preserve behavior before we introduced scaled timers.

The default minimum is ScaledTimerMinimum(1.0) which makes min = max. It seemed more straightforward to avoid special casing, though the behavior of the returned range timer is degenerate and it triggers exactly when its wrapper TimerPtr triggers.

antoniovicente · 2020-10-12T18:44:48Z

source/server/overload_manager_impl.cc

+    auto minimum_it = timer_minimums_.find(timer_type);
+    ScaledTimerMinimum minimum =
+        minimum_it != timer_minimums_.end() ? minimum_it->second : ScaledTimerMinimum(1.0);
+    return std::make_unique<FixedMinimumScaledTimer>(


Should the implementation of FixedMinimumScaledTimer and ScaledTimerMinimum be moved to scaled timer manager, and have scaled timer manager return instances of TimerPtr interface instead?

I think this may allow us to avoid having an unique_ptr scaled timer inside FixedMinimumScaled which would result in reduction in allocations and memory usage

I have no objections, though in that case we'd want to get rid of the RangeTimer interface as well. Yeah I'm not a big fan of my existing implementation where we wrap a TimerPtr with a RangeTimerPtr and then again in a TimerPtr. I'm a little unclear on how that would work though: would ScaledTimerManager::createTimer take either a minimum duration or scaling factor?

Let's start by moving this wrapper to scaled timer manager

Have ScaledTimerManager::createTimer take ScaledTimerMinimum as an input.

Split out as #13524

antoniovicente · 2020-10-12T18:54:00Z

test/server/overload_manager_impl_test.cc

+          - name: "envoy.resource_monitors.fake_resource1"
+            scaled:
+              scaling_threshold: 0.9
+              saturation_threshold: 0.9


There seems to be significant whitespace change here due to the change in syntax used to represent the protos. Is this necessary? Consider doing this change in a separate PR.

Not necessary in this PR; I'll break it up.

Syntax change is now in #13518

antoniovicente · 2020-10-12T19:01:15Z

test/mocks/event/mocks.h

+  MOCK_METHOD(bool, enabled, (), (override));
+
+  const ScopeTrackedObject* scope_{};
+  bool enabled_{};


scope_ and enabled_ seem unused.

Yep, overzealous copy-paste. I'll remove them in the next commit.

This will allow mocking for users of the timer manager. Signed-off-by: Alex Konradi <akonradi@google.com>

Add a method to ScaledRangeTimerManager to return a TimerPtr that wraps a RangeTimerImpl. While the wrapper could be implemented externally by wrapping a RangeTimerPtr, wrapping the impl class is more efficient since it requires less indirection and fewer heap allocations. Signed-off-by: Alex Konradi <akonradi@google.com>

Signed-off-by: Alex Konradi <akonradi@google.com>

antoniovicente · 2020-10-14T01:18:59Z

api/envoy/config/overload/v3/overload.proto

+message ScaleTimersOverloadActionConfig {
+  enum TimerType {
+    // Adjusts the idle timer for downstream HTTP connections that takes effect when there are no active streams.
+    HTTP_DOWNSTREAM_CONNECTION_IDLE = 0;


Is it worth mentioning this override in route_components.proto near:
google.protobuf.Duration idle_timeout = 24;

For discoverability.

Also, add here a reference to the timeout parameter that is affected by this min override.

Explicitly provide the template parameters to absl::visit since using deduction causes Windows builds to fail. It appears that there is an internal instantiation of std::variant_size, which doesn't work with subclasses of absl::variant. Signed-off-by: Alex Konradi <akonradi@google.com>

Signed-off-by: Alex Konradi <akonradi@google.com>

antoniovicente · 2020-10-14T21:54:15Z

api/envoy/config/overload/v3/overload.proto


+  // Configuration for the action being instantiated.
+  google.protobuf.Any typed_config = 3;
+


nit: Consider ordering by field id, since there's no obvious reason to put typed_config next to the name parameter.

antoniovicente · 2020-10-14T21:55:36Z

include/envoy/event/scaled_range_timer_manager.h

+  virtual RangeTimerPtr createTimer(TimerCb callback) PURE;
+
+  /**
+   * Sets the scale factor for all timers created through this manager. The value should be between


change "should" to "must"

antoniovicente · 2020-10-14T21:56:07Z

include/envoy/event/scaled_range_timer_manager.h

+
+  /**
+   * Sets the scale factor for all timers created through this manager. The value should be between
+   * 0 and 1, inclusive. The scale factor affects the amount of time timers spend in their target


0.0 and 1.0

antoniovicente · 2020-10-14T22:01:53Z

include/envoy/server/overload_manager.h


+enum class OverloadTimerType {
+  // Timers created with this type will never be scaled. This should only be used for testing.
+  UnscaledRealTimer,


Can you add ForTest to the end of this enum name?

antoniovicente · 2020-10-14T22:12:47Z

include/envoy/event/scaled_range_timer_manager.h

+ * of their range by the owner of the manager object. Users of this class can call createTimer() to
+ * receive a new RangeTimer object that they can then enable or disable at will (but only on the
+ * same dispatcher), and setScaleFactor() to change the scaling factor. The current scale factor is
+ * applied to all timers, including those that are created later.


I think the current wording makes it unclear if updates to scale apply to timers that are already created.

Consider: Updates to the current scale factor are applied to all timers, including those created in the past.

antoniovicente · 2020-10-14T22:18:37Z

api/envoy/config/overload/v3/overload.proto

  }
 }

+message ScaleTimersOverloadActionConfig {


There are no mentions of this config corresponding to the "envoy.overload_actions.reduce_timeouts" action.

Should envoy.overload_actions.reduce_timeouts be mentioned in this config message comment?
Should there be some doc updates that mention how to configure this action?

antoniovicente · 2020-10-14T22:19:26Z

api/envoy/config/overload/v3/overload.proto

+
+  // A set of timer scaling rules to be applied.
+  repeated ScaleTimer timer_scale_factors = 1 [(validate.rules).repeated = {min_items: 1}];
+}


Please add release notes describing this new feature.

antoniovicente · 2020-10-14T22:34:20Z

source/server/overload_manager_impl.h

  std::vector<std::string> names_;
 };

+class ScaledTimerMinimum {


class and method comments.

antoniovicente · 2020-10-14T22:34:52Z

source/server/overload_manager_impl.h

+
+  std::chrono::milliseconds compute(std::chrono::milliseconds ms) const;
+
+  struct ScaleFactor {


I think these 2 structs can be private now.

antoniovicente · 2020-10-14T22:36:27Z

source/server/overload_manager_impl.cc

+        scale_timer.has_min_timeout()
+            ? ScaledTimerMinimum(std::chrono::milliseconds(
+                  DurationUtil::durationToMilliseconds(scale_timer.min_timeout())))
+            : ScaledTimerMinimum(scale_timer.min_scale().value() / 100);


Done for clarity, though I think type promotion makes this functionally identical.

…rface Signed-off-by: Alex Konradi <akonradi@google.com>

Signed-off-by: Alex Konradi <akonradi@google.com>

KBaichoo · 2020-10-20T18:50:24Z

api/envoy/config/overload/v3/overload.proto

+      // Sets the minimum duration as a percentage of the maximum value.
+      type.v3.Percent min_scale = 3;


Seems like this might be a way to shoot yourself in the foot as it's not as intuitive and easy to compare as absolute time -- unless you end up doing the calculation yourself in which case maybe just use the min_timeout to set the absolute duration.

If this gets set too low, I could see the aggressive timeouts causing cascading failures i.e. user requests times out so they end up refreshing the page.

I agree, though a lot of the overload manger actions let you shoot yourself in the foot. Having a scale factor allows this to apply meaningfully to timers that can have different values across listeners.

Signed-off-by: Alex Konradi <akonradi@google.com>

…imeouts Signed-off-by: Alex Konradi <akonradi@google.com>

antoniovicente · 2020-10-23T03:01:47Z

source/server/overload_manager_impl.cc

+      : action_symbol_table_(action_symbol_table), timer_minimums_(timer_minimums),
+        actions_(action_symbol_table.size(), OverloadActionState(0)),
+        scaled_timer_action_(action_symbol_table.lookup(OverloadActionNames::get().ReduceTimeouts)),
+        scaled_timer_manager_(std::move(scaled_timer_manager_)) {}


scaled_timer_manager_(std::move(scaled_timer_manager)) {}

Also fix formatting Signed-off-by: Alex Konradi <akonradi@google.com>

The admin handler won't ever call the method to create a scaled timer since it won't have the timeout set. Signed-off-by: Alex Konradi <akonradi@google.com>

…imeouts Signed-off-by: Alex Konradi <akonradi@google.com>

akonradi · 2020-10-26T15:07:47Z

The macos check looks to be the same GOAWAY issue seen elsewhere. Would merging master help, or should I just kick off a retest?

akonradi · 2020-10-27T14:54:52Z

/retest

repokitteh-read-only · 2020-10-27T14:54:56Z

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #13475 (comment) was created by @akonradi.

see: more, trace.

mattklein123

Thanks very cool. Just some API/doc comments from a review.

/wait

mattklein123 · 2020-10-27T16:46:32Z

api/envoy/config/route/v3/route_components.proto

  // upstream response header has been received, otherwise a stream reset
  // occurs.
+  //
+  // If the overload action "envoy.overload_actions.reduce_timeouts" is configured, this timeout


Can you ref link "overload action" to the overload manager docs?

mattklein123 · 2020-10-27T16:47:21Z

api/envoy/config/overload/v3/overload.proto

+// Typed configuration for the "envoy.overload_actions.reduce_timeouts" action.
+message ScaleTimersOverloadActionConfig {
+  enum TimerType {
+    UNSPECIFIED = 0;


Is the reason this is specified to force the user to explicitly configure the timeout they want? I think that's OK but can you add a small comment?

Yep, done. I would have liked to just start at 1 but the compiler (or maybe lint checker?) complains.

mattklein123 · 2020-10-27T16:50:43Z

api/envoy/config/overload/v3/overload.proto

+    oneof overload_adjust {
+      option (validate.required) = true;
+
+      // Sets the minimum duration as an absolute value.


From reading the docs, it's not immediately clear to me how min_timeout interacts with the scaling. Can you flesh this out a bit more to better describe how everything fits together in terms of scaling? An example would probably help.

Elaborated in the docs.

Thanks the example is very helpful. Can you potentially ref link to the docs somehow from here just so that people can go back and forth if they are confused what this does?

/wait

Signed-off-by: Alex Konradi <akonradi@google.com>

mattklein123

Thanks LGTM can you merge main?

/wait

* master: (83 commits) tls: Typesafe tls slots (envoyproxy#13789) docs(example): Correct URL for caching example page (envoyproxy#13810) [fuzz] Made health check fuzz more efficient (envoyproxy#13747) rtds: properly scope rtds stats (envoyproxy#13764) http: fixing a bug with IPv6 hosts (envoyproxy#13798) connection: Remember transport socket read resumption requests and replay them when re-enabling read. (envoyproxy#13772) network: adding some accessors for ALPN work. (envoyproxy#13785) docs: added a step about how to handle platform specific extensions (envoyproxy#13759) Fix identation in ip transparency code snippet (envoyproxy#13743) wasm: enable WAVM's stack unwinding feature (envoyproxy#13792) log: set route name for direct response (envoyproxy#13683) Use nghttp2 as external dependsncy in protocol_constraints_lib (envoyproxy#13763) [Windows] Update windows dev docs (envoyproxy#13741) cel: patch thread safety issue (envoyproxy#13739) Windows: Fix ssl_socket_test (envoyproxy#13264) apple dns: add fake api test suite (envoyproxy#13780) overload: scale selected timers in response to load (envoyproxy#13475) examples: Add dynamic configuration (control plane) sandbox (envoyproxy#13746) Removed exception in getResponseStatus() (envoyproxy#13314) network: add timeout for transport connect (envoyproxy#13610) ... Signed-off-by: Michael Puncel <mpuncel@squareup.com>

akonradi added 4 commits October 9, 2020 10:01

Add interface for ScaledRangeTimerManager

90f775a

This will allow mocking for users of the timer manager. Signed-off-by: Alex Konradi <akonradi@google.com>

Convert config literals to YAML

0dd57aa

Signed-off-by: Alex Konradi <akonradi@google.com>

Add reduce timeouts overload action

151f1ca

Add the API protos and implementation framework for scaling timers based on resoure pressure with a scaled trigger. Signed-off-by: Alex Konradi <akonradi@google.com>

Enable scaling HTTP connection idle timeout

c5c2be9

Allow Envoy to drop HTTP downstreams with idle connections in response to overload conditions. Signed-off-by: Alex Konradi <akonradi@google.com>

repokitteh-read-only bot added the api label Oct 9, 2020

akonradi marked this pull request as draft October 9, 2020 21:28

mattklein123 assigned antoniovicente Oct 10, 2020

Fix formatting

26a4b2d

Signed-off-by: Alex Konradi <akonradi@google.com>

antoniovicente mentioned this pull request Oct 12, 2020

http: add request header timer #13341

Merged

antoniovicente reviewed Oct 12, 2020

View reviewed changes

akonradi added 3 commits October 12, 2020 15:49

Add interface for ScaledRangeTimerManager

f95bd82

This will allow mocking for users of the timer manager. Signed-off-by: Alex Konradi <akonradi@google.com>

Address review feedback

324198d

Signed-off-by: Alex Konradi <akonradi@google.com>

antoniovicente reviewed Oct 14, 2020

View reviewed changes

akonradi added 2 commits October 14, 2020 10:56

Add comments

dc82f73

Signed-off-by: Alex Konradi <akonradi@google.com>

antoniovicente reviewed Oct 14, 2020

View reviewed changes

akonradi added 6 commits October 15, 2020 10:15

Merge remote-tracking branch 'upstream/master' into scaled-timer-inte…

b3f1c99

…rface Signed-off-by: Alex Konradi <akonradi@google.com>

Merge remote-tracking branch 'upstream/master' into scaled-timer-inte…

794e9e1

…rface Signed-off-by: Alex Konradi <akonradi@google.com>

Fix scaled timer min > max behavior

5808eed

Signed-off-by: Alex Konradi <akonradi@google.com>

Fix formatting

e29329b

Signed-off-by: Alex Konradi <akonradi@google.com>

Merge branch 'scaled-timer-interface' into overload-scaled-timeouts

20c8af4

Signed-off-by: Alex Konradi <akonradi@google.com>

Address feedback

08f1331

Signed-off-by: Alex Konradi <akonradi@google.com>

akonradi marked this pull request as ready for review October 20, 2020 17:50

KBaichoo reviewed Oct 20, 2020

View reviewed changes

akonradi added 4 commits October 20, 2020 15:31

Add missed generated api change

226c183

Signed-off-by: Alex Konradi <akonradi@google.com>

Remove RangeTimer interface

4b00ebe

Signed-off-by: Alex Konradi <akonradi@google.com>

Merge branch 'scaled-timer-interface' into overload-scaled-timeouts

72916ca

Signed-off-by: Alex Konradi <akonradi@google.com>

Merge remote-tracking branch 'upstream/master' into overload-scaled-t…

99fb591

…imeouts Signed-off-by: Alex Konradi <akonradi@google.com>

antoniovicente previously approved these changes Oct 23, 2020

View reviewed changes

Fix constructor argument name

6bbcd2a

Also fix formatting Signed-off-by: Alex Konradi <akonradi@google.com>

akonradi dismissed antoniovicente’s stale review via 6bbcd2a October 23, 2020 15:20

antoniovicente previously approved these changes Oct 23, 2020

View reviewed changes

akonradi added 2 commits October 23, 2020 16:12

Fix coverage for new method never called

43f0673

The admin handler won't ever call the method to create a scaled timer since it won't have the timeout set. Signed-off-by: Alex Konradi <akonradi@google.com>

Merge remote-tracking branch 'upstream/master' into overload-scaled-t…

94821c0

…imeouts Signed-off-by: Alex Konradi <akonradi@google.com>

akonradi dismissed antoniovicente’s stale review via 94821c0 October 23, 2020 20:13

antoniovicente previously approved these changes Oct 23, 2020

View reviewed changes

mattklein123 self-assigned this Oct 24, 2020

mattklein123 requested changes Oct 27, 2020

View reviewed changes

repokitteh-read-only bot added the waiting label Oct 27, 2020

Add example for overload timeout reduction config

28500e1

Signed-off-by: Alex Konradi <akonradi@google.com>

akonradi dismissed antoniovicente’s stale review via 28500e1 October 27, 2020 19:39

repokitteh-read-only bot added waiting and removed waiting labels Oct 27, 2020

Add link to example

7825f39

Signed-off-by: Alex Konradi <akonradi@google.com>

repokitteh-read-only bot removed the waiting label Oct 27, 2020

mattklein123 requested changes Oct 27, 2020

View reviewed changes

repokitteh-read-only bot added the waiting label Oct 27, 2020

Merge branch 'master' into overload-scaled-timeouts

2e2d5e0

repokitteh-read-only bot removed the waiting label Oct 27, 2020

mattklein123 approved these changes Oct 28, 2020

View reviewed changes

repokitteh-read-only bot removed the api label Oct 28, 2020

mattklein123 merged commit a68755d into envoyproxy:master Oct 28, 2020

akonradi deleted the overload-scaled-timeouts branch October 29, 2020 16:32

antoniovicente mentioned this pull request Nov 20, 2020

[http] Adaptive timeouts for idle HTTP client connections waiting for a request #11427

Closed


		// Configuration for the action being instantiated.
		google.protobuf.Any typed_config = 3;


		std::chrono::milliseconds compute(std::chrono::milliseconds ms) const;

		struct ScaleFactor {

		// Sets the minimum duration as a percentage of the maximum value.
		type.v3.Percent min_scale = 3;

Conversation

akonradi commented Oct 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

repokitteh-read-only bot commented Oct 9, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

akonradi commented Oct 9, 2020 •

edited

Loading