Skip to content
Merged
Show file tree
Hide file tree
Changes from 41 commits
Commits
Show all changes
75 commits
Select commit Hold shift + click to select a range
4173b08
Support slow Start mode in Envoy
Sep 18, 2020
2f8dad0
Support slow Start mode in Envoy
Sep 18, 2020
627c910
Introduce creation_time field into host description
Sep 22, 2020
ed27cb7
Propagate timeSource to edf lb
Sep 30, 2020
80cd8eb
Fix weight adjustment formula
Oct 1, 2020
161cbaf
Draft: Track hosts in slow start mode in edf lb
Oct 1, 2020
d7f395c
Track hosts in slow start mode in edf lb
Oct 5, 2020
944607e
Merge remote-tracking branch 'origin/master'
Nov 27, 2020
f966f8b
switch to btree_set for tracking hosts in slow start
Dec 8, 2020
f61216c
Fix logical statement
Dec 8, 2020
6bfb2e0
Parametrize time bias
Dec 10, 2020
a02698d
Fix logger inheritance
Dec 14, 2020
23e517e
Add config validation for slow start in orig_dst_cluster lb
Dec 16, 2020
a4f697d
Merge remote-tracking branch 'origin/master'
Dec 16, 2020
d0f2cd2
Fix logic when tracking hosts in slow start
Jan 22, 2021
3ab3951
Adding tests
Feb 3, 2021
e5a8534
Add support for "first passing HC" slow start mode
Feb 9, 2021
e9e93ea
Fix comparator, add test for runtime updates
Feb 10, 2021
a365f6e
Merge remote-tracking branch 'origin/main'
Feb 15, 2021
2156dc3
Cleanup
Feb 15, 2021
fe0e551
Fix CI
Feb 16, 2021
38f792a
Fix CI
Feb 16, 2021
bbc3fda
Fix more CI
Feb 17, 2021
7d1cdb4
Some docs, some CI fixes...
Feb 17, 2021
18f0463
Fix clang
Feb 18, 2021
3cf6f9a
Update documentation
Feb 19, 2021
1510abb
Revert extra formatting
Feb 22, 2021
c038daf
Apply review comments
Mar 8, 2021
c4b8f8b
Apply review comments
Mar 8, 2021
bd87893
Fix format
Mar 8, 2021
0963656
Merge remote-tracking branch 'origin/main' into main
Mar 8, 2021
33737f8
Fix spelling in docs
Mar 10, 2021
0cbdbe7
Fix spelling
Mar 10, 2021
43b2f54
Fix build, apply rome view comments
Mar 15, 2021
4e8b9d7
Get rid of endpoint warming policy
Mar 29, 2021
78be70e
Remove unused import
Mar 29, 2021
dc1bb99
Fix tests, clarify docs
Mar 30, 2021
5d5d231
Clarify docs
Mar 30, 2021
fdbbd5f
remove extra space
Mar 30, 2021
b371ece
Apply review comment and fix build
Apr 7, 2021
1602a7b
Update formula, docs and clean up
Apr 16, 2021
bf32ee5
Update API+docs with new formula
Apr 27, 2021
9d96d4b
Merge remote-tracking branch 'origin/main'
Apr 28, 2021
f1670a9
Introduce aggression parameter
Apr 28, 2021
7a495e0
Fix docs format
Apr 29, 2021
941a43e
Fix math bug and add basic test
Apr 29, 2021
3467ca4
add more tests
Apr 29, 2021
bd467d6
Apply review comments, finish tests for RR
May 5, 2021
7d8022d
Slow start support in LR and initial test
May 6, 2021
49cd453
More tests for LR slow start
May 11, 2021
c6f2b86
Refactor duplicated code
May 17, 2021
514dabf
Update slow start example table
May 18, 2021
96d7b76
Bump memory limit per cluster
May 19, 2021
6a98431
Merge remote-tracking branch 'origin/main'
May 19, 2021
74557b9
Applied review comments
Aug 16, 2021
1a23da6
Merge remote-tracking branch 'origin/main' into HEAD
Aug 27, 2021
3e4f49a
Fix merge errors
Aug 27, 2021
4a2a508
Fix weird formatting
Aug 27, 2021
875c763
Fix proto and extra formatting
Aug 27, 2021
ccc9338
Move out slow start config from common lb config
Aug 27, 2021
19d288d
Apply more comments and fix some tests
Sep 3, 2021
2766a4f
fix doc and format
Sep 3, 2021
a2b1261
Fix mock default behaviour
Sep 6, 2021
5e18212
Update diagram with example
Sep 6, 2021
2e9d0ff
fix asan
Sep 8, 2021
2002d00
Bump memory limit
Sep 10, 2021
2128535
Apply review comment
Sep 10, 2021
224daa2
Fix graph and spelling in docs
Sep 10, 2021
b3c5c43
Merge remote-tracking branch 'origin/main' into HEAD
Sep 15, 2021
4d3efe7
apply review comments
Sep 15, 2021
3ed1ff4
Fix doc format
Sep 16, 2021
e4a3c84
Apply review comments
Sep 28, 2021
5c587e9
Merge branch 'main' into slow-start
Sep 28, 2021
7f4b258
fix format
Sep 28, 2021
9ed50d9
Fix merge error
Sep 30, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 25 additions & 1 deletion api/envoy/config/cluster/v3/cluster.proto
Original file line number Diff line number Diff line change
Expand Up @@ -439,7 +439,7 @@ message Cluster {
}

// Common configuration for all load balancer implementations.
// [#next-free-field: 8]
// [#next-free-field: 9]
message CommonLbConfig {
option (udpa.annotations.versioning).previous_message_type =
"envoy.api.v2.Cluster.CommonLbConfig";
Expand Down Expand Up @@ -507,6 +507,26 @@ message Cluster {
google.protobuf.UInt32Value hash_balance_factor = 2 [(validate.rules).uint32 = {gte: 100}];
}

// Configuration for :ref:`slow start mode <arch_overview_load_balancing_slow_start>`.
message SlowStartConfig {
// Size of slow start window in seconds, should be set to value greater than zero.
// If set, the newly created host remains in slow start mode starting from its creation time
// for the duration of slow start window.
uint32 slow_start_window = 1 [(validate.rules).uint32 = {gt: 0}];

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this wasn't a Duration?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no smart reason for it


// This parameter linearly affects load balancing weight of an endpoint. Defaults to 1.0.
// The smaller the time bias is, the less traffic would be sent to endpoint that is within slow start window.
// The value of time bias should be greater than or equal to 0.0 and less than or equal to 1.0.
// If slow start window and time bias are configured, effective weight of an endpoint would be scaled with time bias and time factor during the duration of slow start window:
// `weight = load_balancing_weight * time_bias * time_factor`,
// where `time_factor = (1 / slow_start_window_in_seconds) * host_create_duration_in_seconds`.
//
// When `time bias == 0.0` the Round Robin and Least Request Load Balancers will not scale endpoint weight with time bias
// and will behave as if slow start is not enabled.
// Once host exits slow start, its weight is no longer being scaled with time bias parameter.
core.v3.RuntimeDouble time_bias = 3;
Comment thread
mattklein123 marked this conversation as resolved.
Outdated
}

// Configures the :ref:`healthy panic threshold <arch_overview_load_balancing_panic_threshold>`.
// If not specified, the default is 50%.
// To disable panic mode, set to 0%.
Expand Down Expand Up @@ -548,6 +568,10 @@ message Cluster {

// Common Configuration for all consistent hashing load balancers (MaglevLb, RingHashLb, etc.)
ConsistentHashingLbConfig consistent_hashing_lb_config = 7;

// Configuration for slow start mode.
Comment thread
mattklein123 marked this conversation as resolved.
Outdated
// If this configuration is not set, slow start will not be not enabled.
SlowStartConfig slow_start_config = 8;
}

message RefreshRate {
Expand Down
29 changes: 28 additions & 1 deletion api/envoy/config/cluster/v4alpha/cluster.proto
Original file line number Diff line number Diff line change
Expand Up @@ -443,7 +443,7 @@ message Cluster {
}

// Common configuration for all load balancer implementations.
// [#next-free-field: 8]
// [#next-free-field: 9]
message CommonLbConfig {
option (udpa.annotations.versioning).previous_message_type =
"envoy.config.cluster.v3.Cluster.CommonLbConfig";
Expand Down Expand Up @@ -511,6 +511,29 @@ message Cluster {
google.protobuf.UInt32Value hash_balance_factor = 2 [(validate.rules).uint32 = {gte: 100}];
}

// Configuration for :ref:`slow start mode <arch_overview_load_balancing_slow_start>`.
message SlowStartConfig {
option (udpa.annotations.versioning).previous_message_type =
"envoy.config.cluster.v3.Cluster.CommonLbConfig.SlowStartConfig";

// Size of slow start window in seconds, should be set to value greater than zero.
// If set, the newly created host remains in slow start mode starting from its creation time
// for the duration of slow start window.
uint32 slow_start_window = 1 [(validate.rules).uint32 = {gt: 0}];

// This parameter linearly affects load balancing weight of an endpoint. Defaults to 1.0.
// The smaller the time bias is, the less traffic would be sent to endpoint that is within slow start window.
// The value of time bias should be greater than or equal to 0.0 and less than or equal to 1.0.
// If slow start window and time bias are configured, effective weight of an endpoint would be scaled with time bias and time factor during the duration of slow start window:
// `weight = load_balancing_weight * time_bias * time_factor`,
// where `time_factor = (1 / slow_start_window_in_seconds) * host_create_duration_in_seconds`.
//
// When `time bias == 0.0` the Round Robin and Least Request Load Balancers will not scale endpoint weight with time bias
// and will behave as if slow start is not enabled.
// Once host exits slow start, its weight is no longer being scaled with time bias parameter.
core.v4alpha.RuntimeDouble time_bias = 3;
}

// Configures the :ref:`healthy panic threshold <arch_overview_load_balancing_panic_threshold>`.
// If not specified, the default is 50%.
// To disable panic mode, set to 0%.
Expand Down Expand Up @@ -552,6 +575,10 @@ message Cluster {

// Common Configuration for all consistent hashing load balancers (MaglevLb, RingHashLb, etc.)
ConsistentHashingLbConfig consistent_hashing_lb_config = 7;

// Configuration for slow start mode.
// If this configuration is not set, slow start will not be not enabled.
SlowStartConfig slow_start_config = 8;
}

message RefreshRate {
Expand Down
4 changes: 4 additions & 0 deletions bazel/repositories.bzl
Original file line number Diff line number Diff line change
Expand Up @@ -547,6 +547,10 @@ def _com_google_absl():
name = "abseil_node_hash_set",
actual = "@com_google_absl//absl/container:node_hash_set",
)
native.bind(
name = "abseil_btree",
actual = "@com_google_absl//absl/container:btree",
)
native.bind(
name = "abseil_str_format",
actual = "@com_google_absl//absl/strings:str_format",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,7 @@ Envoy will return no addresses and set the response code appropriately. Converse
matching records for the query type, each configured address is returned. This is also true for
AAAA records. Only A, AAAA, and SRV records are supported. If the filter parses queries for other
record types, the filter immediately responds indicating that the type is not supported. The
filter can also redirect a query for a DNS name to the enpoints of a cluster. "www.domain4.com"
filter can also redirect a query for a DNS name to the endpoints of a cluster. "www.domain4.com"
in the configuration demonstrates this. Along with an address list, a cluster name is a valid
endpoint for a DNS name.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,4 @@ Load Balancing
original_dst
zone_aware
subsets
slow_start
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
.. _arch_overview_load_balancing_slow_start:
Comment thread
mattklein123 marked this conversation as resolved.

Slow start mode
===============

Slow start mode is a configuration setting in Envoy to progressively increase amount of traffic for newly added upstream endpoints.
With no slow start enabled Envoy would send a proportional amount of traffic to new upstream ednpoints.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you fix the spelling in this doc as well, please. I'm not sure why the build didn't catch this.. I thought it used to.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also surprised that spell check did not catch those, perhaps it does not cover new .rst files...Fixed (with online spell checker)

This could be undesirable for services that require warm up time to serve full production load and could result in request timeouts, loss of data and deteriorated user experience.

Slow start mode is a mechanism that affects load balancing weight of upstream endpoints and can be configured per upstream cluster.
Currently, slow start is supported in Round Robin and Least Request load balancer types.
Comment thread
nezdolik marked this conversation as resolved.
Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you ref link to the relevant fields for each type?


Users can specify a :ref:`slow start window parameter<envoy_v3_api_field_config.cluster.v3.Cluster.CommonLbConfig.SlowStartConfig.slow_start_window>` (in seconds), so that if endpoint “cluster membership duration" (amount of time since it has joined the cluster) is within the configured window, it enters slow start mode.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“cluster

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, hard to spot

During slow start window, load balancing weight of a particular endpoint will be scaled with :ref:`time bias parameter<envoy_v3_api_field_config.cluster.v3.Cluster.CommonLbConfig.SlowStartConfig.time_bias>`, e.g.:
`weight = load_balancing_weight * time_bias * time_factor`.
Time factor is value that increases as time progresses, and is calculated like:
`time_factor = (1 / slow_start_window_seconds) * endpoint_create_duration_seconds`

The longer slow start window is the less traffic would be sent to endpoint as time advances within slow start window.

Whenever a slow start window duration elapses, upstream endpoint exits slow start mode and gets regular amount of traffic acccording to load balanacing algorithm.
Its load balancing weight will no longer be scaled with runtime bias. Endpoint could also exit slow start mode in case it leaves the cluster.

To reiterate, endpoint enters slow start mode when:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To reiterate, endpoint enters slow start mode when:
To reiterate, endpoint enters slow start mode:

* If no active healthcheck is configured per cluster, immediately if its cluster membership duration is within slow start window.
* In case an active healthcheck is configured per cluster, when its cluster membership duration is within slow start window and endpoint has passed an active healthcheck.
If endpoint does not pass an active healcheck during entire slow start window (since it has been added to upstream cluster), then it never enters slow start mode.

Endpoint exits slow start mode when:
* It leaves the cluster.
* Its cluster membership duration is greater than slow start window.
* It does not pass an active healcheck configured per cluster.
Endpoint could further re-enter slow start, if it passes an active healtcheck and its creation time is within slow start window.

Below is example of how requests would be distributed across endpoints with Round Robin Loadbalancer, slow start window of 10 seconds, no active healcheck and 0.5 time bias.
Endpoint E1 has statically configured initial weight of X and endpoint E2 weight of Y, the actual numerical values are of no significance for this example.

+-------------+--------------------+------------+------------+-----------+----------+-------------+

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This diagram is pretty confusing to me. I get what's trying to be conveyed, but I'm not sure why the events are significant and the timestamps are hard to reason about. There's got to be a better way.

Perhaps a graph with time on the x-axis and weights on the y-axis would be a bit easier to parse? Similar to the graph you have showing the effect of aggression, but each line should be a different endpoint.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have not started this one

| Timestamp | Event | E1 in slow | E2 in slow | E1 LB | E2 LB | LB decision |
| | | start | start | weight | weight | |
+=============+====================+============+============+===========+==========+=============+
| 1 | E1 create | YES | -- | 0.5X | -- | -- |
+-------------+--------------------+------------+------------+-----------+----------+-------------+
| 11 | E2 create | NO | YES | X | 0.5Y | -- |
+-------------+--------------------+------------+------------+-----------+----------+-------------+
| 12 | LB select endpoint | NO | YES | X | 0.5Y | E1 |
+-------------+--------------------+------------+------------+-----------+----------+-------------+
| 13 | LB select endpoint | NO | YES | X | 0.5Y | E1 |
+-------------+--------------------+------------+------------+-----------+----------+-------------+
| 14 | LB select endpoint | NO | YES | X | 0.5Y | E1 |
+-------------+--------------------+------------+------------+-----------+----------+-------------+
| 15 |LB select endpoint | NO | YES | X | 0.5Y | E2 |
+-------------+--------------------+------------+------------+-----------+----------+-------------+
| 22 | LB select endpoint | NO | NO | X | Y | E1 |
+-------------+--------------------+------------+------------+-----------+----------+-------------+
| 23 | LB select endpoint | NO | NO | X | Y | E2 |
+-------------+--------------------+------------+------------+-----------+----------+-------------+
26 changes: 25 additions & 1 deletion generated_api_shadow/envoy/config/cluster/v3/cluster.proto

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

29 changes: 28 additions & 1 deletion generated_api_shadow/envoy/config/cluster/v4alpha/cluster.proto

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions source/common/upstream/BUILD
Original file line number Diff line number Diff line change
Expand Up @@ -177,6 +177,7 @@ envoy_cc_library(
name = "load_balancer_lib",
srcs = ["load_balancer_impl.cc"],
hdrs = ["load_balancer_impl.h"],
external_deps = ["abseil_btree"],
deps = [
":edf_scheduler_lib",
"//include/envoy/common:random_generator_interface",
Expand Down
9 changes: 6 additions & 3 deletions source/common/upstream/cluster_manager_impl.cc
Original file line number Diff line number Diff line change
Expand Up @@ -1315,14 +1315,16 @@ ClusterManagerImpl::ThreadLocalClusterManagerImpl::ClusterEntry::ClusterEntry(
cluster->lbType(), priority_set_, parent_.local_priority_set_, cluster->stats(),
cluster->statsScope(), parent.parent_.runtime_, parent.parent_.random_,
cluster->lbSubsetInfo(), cluster->lbRingHashConfig(), cluster->lbMaglevConfig(),
cluster->lbLeastRequestConfig(), cluster->lbConfig());
cluster->lbLeastRequestConfig(), cluster->lbConfig(),
parent_.thread_local_dispatcher_.timeSource());
Comment thread
nezdolik marked this conversation as resolved.
} else {
switch (cluster->lbType()) {
case LoadBalancerType::LeastRequest: {
ASSERT(lb_factory_ == nullptr);
lb_ = std::make_unique<LeastRequestLoadBalancer>(
priority_set_, parent_.local_priority_set_, cluster->stats(), parent.parent_.runtime_,
parent.parent_.random_, cluster->lbConfig(), cluster->lbLeastRequestConfig());
parent.parent_.random_, cluster->lbConfig(), cluster->lbLeastRequestConfig(),
parent.thread_local_dispatcher_.timeSource());
break;
}
case LoadBalancerType::Random: {
Expand All @@ -1336,7 +1338,8 @@ ClusterManagerImpl::ThreadLocalClusterManagerImpl::ClusterEntry::ClusterEntry(
ASSERT(lb_factory_ == nullptr);
lb_ = std::make_unique<RoundRobinLoadBalancer>(priority_set_, parent_.local_priority_set_,
cluster->stats(), parent.parent_.runtime_,
parent.parent_.random_, cluster->lbConfig());
parent.parent_.random_, cluster->lbConfig(),
parent.thread_local_dispatcher_.timeSource());
break;
}
case LoadBalancerType::ClusterProvided:
Expand Down
Loading