diff --git a/docs/root/intro/arch_overview/load_balancing.rst b/docs/root/intro/arch_overview/load_balancing.rst index d09ccf0054651..547962dc086ce 100644 --- a/docs/root/intro/arch_overview/load_balancing.rst +++ b/docs/root/intro/arch_overview/load_balancing.rst @@ -143,26 +143,6 @@ percentage of healthy hosts multiplied by the overprovisioning factor drops below 100. The default value is 1.4, so a priority level or locality will not be considered unhealthy until the percentage of healthy endpoints goes below 72%. -.. _arch_overview_load_balancing_panic_threshold: - -Panic threshold ---------------- - -During load balancing, Envoy will generally only consider healthy hosts in an upstream cluster. -However, if the percentage of healthy hosts in the cluster becomes too low, Envoy will disregard -health status and balance amongst all hosts. This is known as the *panic threshold*. The default -panic threshold is 50%. This is :ref:`configurable ` via -runtime as well as in the :ref:`cluster configuration -`. The panic threshold -is used to avoid a situation in which host failures cascade throughout the cluster as load -increases. - -Note that panic thresholds are *per-priority*. This means that if the percentage of healthy nodes -in a single priority goes below the threshold, that priority will enter panic mode. In general -it is discouraged to use panic thresholds in conjunction with priorities, as by the time enough -nodes are unhealthy to trigger the panic threshold most of the traffic should already have spilled -over to the next priority level. - .. _arch_overview_load_balancing_priority_levels: Priority levels @@ -181,43 +161,58 @@ healthy because 80*1.4 > 100. As the number of healthy endpoints dips below 72%, goes below 100. At that point the percent of traffic equivalent to the health of P=0 will go to P=0 and remaining traffic will flow to P=1. +It is important to understand how Envoy evaluates priority levels' health. Each priority level is assigned +a health value which basically is a percentage of healthy hosts in relation to total number of hosts in a given +priority level multiplied by overprovisioning factor of 1.4. +A priority level's health is integer value between 0% and 100%. When there are more than one priority levels +in a cluster, Envoy adds all priority levels' health values and caps it at 100%. This is called normalized total health. Value 0% of +normalized total health means that all hosts in all priority levels are unhealthy. Value 100% of normalized total health may +describe many situations: all levels have health of 100% or 4 levels have health value of 30% each. +When normalized total health value is 100%, Envoy assumes that there are enough healthy hosts in all priority +levels to handle the load. Not all hosts need to be in one priority as Envoy distributes traffic across priority +levels based on each priority level's health value. +In order for load distribution algorithm and normalized total health calculation to work properly, the number of hosts +in each priority level should be close. Envoy assumes that for example 100% healthy priority level P=1 is able to take +the entire traffic from P=0 should all its hosts become unhealthy. If P=0 has 10 hosts and P=1 has only 2 hosts, P=1 may be unable +to take the entire load from P=0, even though P=1 health is 100%. + Assume a simple set-up with 2 priority levels, P=1 100% healthy. -+----------------------------+---------------------------+----------------------------+ -| P=0 healthy endpoints | Percent of traffic to P=0 | Percent of traffic to P=1 | -+============================+===========================+============================+ -| 100% | 100% | 0% | -+----------------------------+---------------------------+----------------------------+ -| 72% | 100% | 0% | -+----------------------------+---------------------------+----------------------------+ -| 71% | 99% | 1% | -+----------------------------+---------------------------+----------------------------+ -| 50% | 70% | 30% | -+----------------------------+---------------------------+----------------------------+ -| 25% | 35% | 65% | -+----------------------------+---------------------------+----------------------------+ -| 0% | 0% | 100% | -+----------------------------+---------------------------+----------------------------+ ++----------------------------+----------------+-----------------+ +| P=0 healthy endpoints | Traffic to P=0 | Traffic to P=1 | ++============================+================+=================+ +| 100% | 100% | 0% | ++----------------------------+----------------+-----------------+ +| 72% | 100% | 0% | ++----------------------------+----------------+-----------------+ +| 71% | 99% | 1% | ++----------------------------+----------------+-----------------+ +| 50% | 70% | 30% | ++----------------------------+----------------+-----------------+ +| 25% | 35% | 65% | ++----------------------------+----------------+-----------------+ +| 0% | 0% | 100% | ++----------------------------+----------------+-----------------+ If P=1 becomes unhealthy, it will continue to take spilled load from P=0 until the sum of the health P=0 + P=1 goes below 100. At this point the healths will be scaled up to an "effective" health of 100%. -+------------------------+-------------------------+-----------------+-----------------+ -| P=0 healthy endpoints | P=1 healthy endpoints | Traffic to P=0 | Traffic to P=1 | -+========================+=========================+=================+=================+ -| 100% | 100% | 100% | 0% | -+------------------------+-------------------------+-----------------+-----------------+ -| 72% | 72% | 100% | 0% | -+------------------------+-------------------------+-----------------+-----------------+ -| 71% | 71% | 99% | 1% | -+------------------------+-------------------------+-----------------+-----------------+ -| 50% | 50% | 70% | 30% | -+------------------------+-------------------------+-----------------+-----------------+ -| 25% | 100% | 35% | 65% | -+------------------------+-------------------------+-----------------+-----------------+ -| 25% | 25% | 50% | 50% | -+------------------------+-------------------------+-----------------+-----------------+ ++------------------------+-------------------------+-----------------+----------------+ +| P=0 healthy endpoints | P=1 healthy endpoints | Traffic to P=0 | Traffic to P=1 | ++========================+=========================+=================+================+ +| 100% | 100% | 100% | 0% | ++------------------------+-------------------------+-----------------+----------------+ +| 72% | 72% | 100% | 0% | ++------------------------+-------------------------+-----------------+----------------+ +| 71% | 71% | 99% | 1% | ++------------------------+-------------------------+-----------------+----------------+ +| 50% | 50% | 70% | 30% | ++------------------------+-------------------------+-----------------+----------------+ +| 25% | 100% | 35% | 65% | ++------------------------+-------------------------+-----------------+----------------+ +| 25% | 25% | 50% | 50% | ++------------------------+-------------------------+-----------------+----------------+ As more priorities are added, each level consumes load equal to its "scaled" effective health, so P=2 would only receive traffic if the combined health of P=0 + P=1 was less than 100. @@ -242,11 +237,82 @@ To sum this up in pseudo algorithms: :: - load to P_0 = min(100, health(P_0) * 100 / total_health) + load to P_0 = min(100, health(P_0) * 100 / normalized_total_health) health(P_X) = 140 * healthy_P_X_backends / total_P_X_backends - total_health = min(100, Σ(health(P_0)...health(P_X)) + normalized_total_health = min(100, Σ(health(P_0)...health(P_X)) load to P_X = 100 - Σ(percent_load(P_0)..percent_load(P_X-1)) +.. _arch_overview_load_balancing_panic_threshold: + +Panic threshold +--------------- + +During load balancing, Envoy will generally only consider healthy hosts in an upstream cluster. +However, if the percentage of healthy hosts in the cluster becomes too low, Envoy will disregard +health status and balance amongst all hosts. This is known as the *panic threshold*. The default +panic threshold is 50%. This is :ref:`configurable ` via +runtime as well as in the :ref:`cluster configuration +`. The panic threshold +is used to avoid a situation in which host failures cascade throughout the cluster as load +increases. + +Panic thresholds work in conjunction with priorities. If number of healthy hosts in a given priority +goes down, Envoy will try to shift some traffic to lower priorities. If it succeeds in finding enough +healthy hosts in lower priorities, Envoy will disregard panic thresholds. In mathematical terms, +if normalized total health across all priority levels is 100%, Envoy disregards panic thresholds but continues to +distribute traffic load across priorities according to algorithm described :ref:`here `. +However, when value of normalized total health drops below 100%, Envoy assumes that there is not enough healthy +hosts across all priority levels. It continues to distribute traffic load across priorities, but if a specific priority level's +health is below panic threshold, traffic will go to all hosts in that priority regardless of their health. + +The following examples explain relationship between normalized total health and panic threshold. It is +assumed that default value of 50% is used for panic threshold. + +Assume a simple set-up with 2 priority levels, P=1 100% healthy. In this scenario +normalized total health is always 100% and P=0 never enters panic mode and Envoy is able to shift entire traffic to P=1. + ++-------------+------------+--------------+------------+--------------+--------------+ +| P=0 healthy | Traffic | P=0 in panic | Traffic | P=1 in panic | normalized | +| endpoints | to P=0 | | to P=1 | | total health | ++=============+============+==============+============+==============+==============+ +| 100% | 100% | NO | 0% | NO | 100% | ++-------------+------------+--------------+------------+--------------+--------------+ +| 72% | 100% | NO | 0% | NO | 100% | ++-------------+------------+--------------+------------+--------------+--------------+ +| 71% | 100% | NO | 0% | NO | 100% | ++-------------+------------+--------------+------------+--------------+--------------+ +| 50% | 100% | NO | 0% | NO | 100% | ++-------------+------------+--------------+------------+--------------+--------------+ +| 25% | 100% | NO | 0% | NO | 100% | ++-------------+------------+--------------+------------+--------------+--------------+ +| 0% | 100% | NO | 0% | NO | 100% | ++-------------+------------+--------------+------------+--------------+--------------+ + +If P=1 becomes unhealthy, panic threshold continues to be disregarded until the sum of the health +P=0 + P=1 goes below 100%. At this point Envoy starts checking panic threshold value for each +priority. + ++-------------+-------------+----------+--------------+----------+--------------+-------------+ +| P=0 healthy | P=1 healthy | Traffic | P=0 in panic | Traffic | P=1 in panic | normalized | +| endpoints | endpoints | to P=0 | | to P=1 | | total health| ++=============+=============+==========+==============+==========+==============+=============+ +| 100% | 100% | 100% | NO | 0% | NO | 100% | ++-------------+-------------+----------+--------------+----------+--------------+-------------+ +| 72% | 72% | 100% | NO | 0% | NO | 100% | ++-------------+-------------+----------+--------------+----------+--------------+-------------+ +| 71% | 71% | 99% | NO | 1% | NO | 100% | ++-------------+-------------+----------+--------------+----------+--------------+-------------+ +| 50% | 60% | 50% | NO | 50% | NO | 100% | ++-------------+-------------+----------+--------------+----------+--------------+-------------+ +| 25% | 100% | 25% | NO | 75% | NO | 100% | ++-------------+-------------+----------+--------------+----------+--------------+-------------+ +| 25% | 25% | 50% | YES | 50% | YES | 70% | ++-------------+-------------+----------+--------------+----------+--------------+-------------+ +| 5% | 65% | 7% | YES | 93% | NO | 98% | ++-------------+-------------+----------+--------------+----------+--------------+-------------+ + +Note that panic thresholds can be configured *per-priority*. + .. _arch_overview_load_balancing_zone_aware_routing: Zone aware routing diff --git a/docs/root/intro/version_history.rst b/docs/root/intro/version_history.rst index 1927b261d03da..6ca70d4b0cd05 100644 --- a/docs/root/intro/version_history.rst +++ b/docs/root/intro/version_history.rst @@ -13,6 +13,7 @@ Version history * stats: added :ref:`stats_matcher ` to the bootstrap config for granular control of stat instantiation. * stream: renamed the `RequestInfo` namespace to `StreamInfo` to better match its behaviour within TCP and HTTP implementations. +* upstream: changed how load calculation for :ref:`priority levels` and :ref:`panic thresholds` interact. As long as normalized total health is 100% panic thresholds are disregarded. 1.8.0 (Oct 4, 2018) =================== diff --git a/source/common/upstream/load_balancer_impl.cc b/source/common/upstream/load_balancer_impl.cc index 54a0f91865fa9..1e7980828cad1 100644 --- a/source/common/upstream/load_balancer_impl.cc +++ b/source/common/upstream/load_balancer_impl.cc @@ -47,12 +47,32 @@ LoadBalancerBase::LoadBalancerBase(const PrioritySet& priority_set, ClusterStats recalculatePerPriorityState(host_set->priority(), priority_set_, per_priority_load_, per_priority_health_); } + // Reclaculate panic mode for all levels. + recalculatePerPriorityPanic(); + priority_set_.addMemberUpdateCb([this](uint32_t priority, const HostVector&, const HostVector&) -> void { recalculatePerPriorityState(priority, priority_set_, per_priority_load_, per_priority_health_); }); + priority_set_.addMemberUpdateCb( + [this](uint32_t priority, const HostVector&, const HostVector&) -> void { + UNREFERENCED_PARAMETER(priority); + recalculatePerPriorityPanic(); + }); } +// The following cases are handled by +// recalculatePerPriorityState and recalculatePerPriorityPanic methods (normalized total health is +// sum of all priorities' health values and capped at 100). +// - normalized total health is = 100%. It means there are enough healthy hosts to handle the load. +// Do not enter panic mode, even if a specific priority has low number of healthy hosts. +// - normalized total health is < 100%. There are not enough healthy hosts to handle the load. +// Continue +// distibuting the load among priority sets, but turn on panic mode for a given priority +// if # of healthy hosts in priority set is low. +// - normalized total health is 0%. All hosts are down. Redirect 100% of traffic to P=0 and enable +// panic mode. + void LoadBalancerBase::recalculatePerPriorityState(uint32_t priority, const PrioritySet& priority_set, PriorityLoad& per_priority_load, @@ -66,22 +86,29 @@ void LoadBalancerBase::recalculatePerPriorityState(uint32_t priority, HostSet& host_set = *priority_set.hostSetsPerPriority()[priority]; per_priority_health[priority] = 0; if (host_set.hosts().size() > 0) { + // Each priority level's health is ratio of healthy hosts to total number of hosts in a priority + // multiplied by overprovisioning factor of 1.4 and capped at 100%. It means that if all + // hosts are healthy that priority's health is 100%*1.4=140% and is capped at 100% which results + // in 100%. If 80% of hosts are healty, that priority's health is still 100% (80%*1.4=112% and + // capped at 100%). per_priority_health[priority] = std::min(100, (host_set.overprovisioning_factor() * host_set.healthyHosts().size() / host_set.hosts().size())); } - // Now that we've updated health for the changed priority level, we need to caculate percentage + // Now that we've updated health for the changed priority level, we need to calculate percentage // load for all priority levels. + // // First, determine if the load needs to be scaled relative to health. For example if there are // 3 host sets with 20% / 20% / 10% health they will get 40% / 40% / 20% load to ensure total load // adds up to 100. - const uint32_t total_health = std::min( - std::accumulate(per_priority_health.begin(), per_priority_health.end(), 0), 100); - if (total_health == 0) { + // Sum of priority levels' health values may exceed 100, so it is capped at 100 and referred as + // normalized total health. + const uint32_t normalized_total_health = calcNormalizedTotalHealth(per_priority_health); + if (normalized_total_health == 0) { // Everything is terrible. Send all load to P=0. - // In this one case sumEntries(per_priority_load_) != 100 since we sinkhole all traffic in P=0. + // In this one case sumEntries(per_priority_load) != 100 since we sinkhole all traffic in P=0. per_priority_load[0] = 100; return; } @@ -95,7 +122,7 @@ void LoadBalancerBase::recalculatePerPriorityState(uint32_t priority, // Now assign as much load as possible to the high priority levels and cease assigning load // when total_load runs out. per_priority_load[i] = - std::min(total_load, per_priority_health[i] * 100 / total_health); + std::min(total_load, per_priority_health[i] * 100 / normalized_total_health); total_load -= per_priority_load[i]; } @@ -107,6 +134,29 @@ void LoadBalancerBase::recalculatePerPriorityState(uint32_t priority, } } +// Method iterates through priority levels and turns on/off panic mode. +void LoadBalancerBase::recalculatePerPriorityPanic() { + per_priority_panic_.resize(priority_set_.hostSetsPerPriority().size()); + + const uint32_t normalized_total_health = calcNormalizedTotalHealth(per_priority_health_); + + if (normalized_total_health == 0) { + // Everything is terrible. All load should be to P=0. Turn on panic mode. + ASSERT(per_priority_load_[0] == 100); + per_priority_panic_[0] = true; + return; + } + + for (size_t i = 0; i < per_priority_health_.size(); ++i) { + // For each level check if it should run in panic mode. Never set panic mode if + // normalized total health is 100%, even when individual priority level has very low # of + // healthy hosts. + const HostSet& priority_host_set = *priority_set_.hostSetsPerPriority()[i]; + per_priority_panic_[i] = + (normalized_total_health == 100 ? false : isGlobalPanic(priority_host_set)); + } +} + HostSet& LoadBalancerBase::chooseHostSet(LoadBalancerContext* context) { if (context) { const auto& per_priority_load = @@ -389,7 +439,7 @@ ZoneAwareLoadBalancerBase::hostSourceToUse(LoadBalancerContext* context) { hosts_source.priority_ = host_set.priority(); // If the selected host set has insufficient healthy hosts, return all hosts. - if (isGlobalPanic(host_set)) { + if (per_priority_panic_[hosts_source.priority_]) { stats_.lb_healthy_panic_.inc(); hosts_source.source_type_ = HostsSource::SourceType::AllHosts; return hosts_source; diff --git a/source/common/upstream/load_balancer_impl.h b/source/common/upstream/load_balancer_impl.h index 1a216a1885df9..ffec62688ee9a 100644 --- a/source/common/upstream/load_balancer_impl.h +++ b/source/common/upstream/load_balancer_impl.h @@ -62,6 +62,7 @@ class LoadBalancerBase : public LoadBalancer { HostSet& chooseHostSet(LoadBalancerContext* context); uint32_t percentageLoad(uint32_t priority) const { return per_priority_load_[priority]; } + bool isInPanic(uint32_t priority) const { return per_priority_panic_[priority]; } ClusterStats& stats_; Runtime::Loader& runtime_; @@ -77,12 +78,25 @@ class LoadBalancerBase : public LoadBalancer { void static recalculatePerPriorityState(uint32_t priority, const PrioritySet& priority_set, PriorityLoad& priority_load, std::vector& per_priority_health); + void recalculatePerPriorityPanic(); protected: + // Method calculates normalized total health. Each priority level's health is ratio of + // healthy hosts to total number of hosts in a priority multiplied by overprovisioning factor + // of 1.4 and capped at 100%. Effectively each priority's health is a value between 0-100%. + // Calculating normalized total health starts with summarizing all priorities' health values. + // It can exceed 100%. For example if there are three priorities and each is 100% healthy, the + // total of all priorities is 300%. Normalized total health is then capped at 100%. + static uint32_t calcNormalizedTotalHealth(std::vector& per_priority_health) { + return std::min( + std::accumulate(per_priority_health.begin(), per_priority_health.end(), 0), 100); + } // The percentage load (0-100) for each priority level std::vector per_priority_load_; // The health (0-100) for each priority level. std::vector per_priority_health_; + // Levels which are in panic + std::vector per_priority_panic_; }; class LoadBalancerContextBase : public LoadBalancerContext { diff --git a/source/common/upstream/maglev_lb.h b/source/common/upstream/maglev_lb.h index 27ef4db632a80..58ab7e714bd67 100644 --- a/source/common/upstream/maglev_lb.h +++ b/source/common/upstream/maglev_lb.h @@ -58,13 +58,13 @@ class MaglevLoadBalancer : public ThreadAwareLoadBalancerBase { private: // ThreadAwareLoadBalancerBase - HashingLoadBalancerSharedPtr createLoadBalancer(const HostSet& host_set) override { + HashingLoadBalancerSharedPtr createLoadBalancer(const HostSet& host_set, bool in_panic) override { // Note that we only compute global panic on host set refresh. Given that the runtime setting // will rarely change, this is a reasonable compromise to avoid creating extra LBs when we only // need to create one per priority level. const bool has_locality = host_set.localityWeights() != nullptr && !host_set.localityWeights()->empty(); - if (isGlobalPanic(host_set)) { + if (in_panic) { if (!has_locality) { return std::make_shared(HostsPerLocalityImpl(host_set.hosts(), false), nullptr, table_size_); diff --git a/source/common/upstream/ring_hash_lb.h b/source/common/upstream/ring_hash_lb.h index 7d9eff87e7515..0de89f105db6f 100644 --- a/source/common/upstream/ring_hash_lb.h +++ b/source/common/upstream/ring_hash_lb.h @@ -45,11 +45,11 @@ class RingHashLoadBalancer : public ThreadAwareLoadBalancerBase, typedef std::shared_ptr RingConstSharedPtr; // ThreadAwareLoadBalancerBase - HashingLoadBalancerSharedPtr createLoadBalancer(const HostSet& host_set) override { + HashingLoadBalancerSharedPtr createLoadBalancer(const HostSet& host_set, bool in_panic) override { // Note that we only compute global panic on host set refresh. Given that the runtime setting // will rarely change, this is a reasonable compromise to avoid creating extra LBs when we only // need to create one per priority level. - if (isGlobalPanic(host_set)) { + if (in_panic) { return std::make_shared(config_, host_set.hosts()); } else { return std::make_shared(config_, host_set.healthyHosts()); diff --git a/source/common/upstream/thread_aware_lb_impl.cc b/source/common/upstream/thread_aware_lb_impl.cc index a685604b538ab..7bc1ef1e8c589 100644 --- a/source/common/upstream/thread_aware_lb_impl.cc +++ b/source/common/upstream/thread_aware_lb_impl.cc @@ -25,8 +25,11 @@ void ThreadAwareLoadBalancerBase::refresh() { const uint32_t priority = host_set->priority(); (*per_priority_state_vector)[priority].reset(new PerPriorityState); const auto& per_priority_state = (*per_priority_state_vector)[priority]; - per_priority_state->current_lb_ = createLoadBalancer(*host_set); - per_priority_state->global_panic_ = isGlobalPanic(*host_set); + // Copy panic flag from LoadBalancerBase. It is calculated when there is a change + // in hosts set or hosts' health. + per_priority_state->global_panic_ = per_priority_panic_[priority]; + per_priority_state->current_lb_ = + createLoadBalancer(*host_set, per_priority_state->global_panic_); } { @@ -42,6 +45,7 @@ ThreadAwareLoadBalancerBase::LoadBalancerImpl::chooseHost(LoadBalancerContext* c if (per_priority_state_ == nullptr) { return nullptr; } + // If there is no hash in the context, just choose a random value (this effectively becomes // the random LB but it won't crash if someone configures it this way). // computeHashKey() may be computed on demand, so get it only once. diff --git a/source/common/upstream/thread_aware_lb_impl.h b/source/common/upstream/thread_aware_lb_impl.h index 75511ab5c6cc6..1a1597d762c40 100644 --- a/source/common/upstream/thread_aware_lb_impl.h +++ b/source/common/upstream/thread_aware_lb_impl.h @@ -75,7 +75,8 @@ class ThreadAwareLoadBalancerBase : public LoadBalancerBase, public ThreadAwareL std::shared_ptr> per_priority_load_ GUARDED_BY(mutex_); }; - virtual HashingLoadBalancerSharedPtr createLoadBalancer(const HostSet& host_set) PURE; + virtual HashingLoadBalancerSharedPtr createLoadBalancer(const HostSet& host_set, + bool in_panic) PURE; void refresh(); std::shared_ptr factory_; diff --git a/test/common/upstream/load_balancer_impl_test.cc b/test/common/upstream/load_balancer_impl_test.cc index 178b0216ae03c..245287825edbf 100644 --- a/test/common/upstream/load_balancer_impl_test.cc +++ b/test/common/upstream/load_balancer_impl_test.cc @@ -48,6 +48,7 @@ class TestLb : public LoadBalancerBase { const envoy::api::v2::Cluster::CommonLbConfig& common_config) : LoadBalancerBase(priority_set, stats, runtime, random, common_config) {} using LoadBalancerBase::chooseHostSet; + using LoadBalancerBase::isInPanic; using LoadBalancerBase::percentageLoad; HostConstSharedPtr chooseHostOnce(LoadBalancerContext*) override { @@ -71,14 +72,25 @@ class LoadBalancerBaseTest : public LoadBalancerTestBase { host_set.runCallbacks({}, {}); } - std::vector getLoadPercentage() { - std::vector ret; + template + std::vector aggregatePrioritySetsValues(TestLb& lb, FUNC func) { + std::vector ret; + for (size_t i = 0; i < priority_set_.host_sets_.size(); ++i) { - ret.push_back(lb_.percentageLoad(i)); + ret.push_back((lb.*func)(i)); } + return ret; } + std::vector getLoadPercentage() { + return aggregatePrioritySetsValues(lb_, &TestLb::percentageLoad); + } + + std::vector getPanic() { + return aggregatePrioritySetsValues(lb_, &TestLb::isInPanic); + } + envoy::api::v2::Cluster::CommonLbConfig common_config_; TestLb lb_{priority_set_, stats_, runtime_, random_, common_config_}; }; @@ -157,25 +169,34 @@ TEST_P(LoadBalancerBaseTest, OverProvisioningFactor) { TEST_P(LoadBalancerBaseTest, GentleFailover) { // With 100% of P=0 hosts healthy, P=0 gets all the load. + // None of the levels is in Panic mode updateHostSet(host_set_, 1, 1); updateHostSet(failover_host_set_, 1, 1); ASSERT_THAT(getLoadPercentage(), ElementsAre(100, 0)); + ASSERT_THAT(getPanic(), ElementsAre(false, false)); // Health P=0 == 50*1.4 == 70 + // Total health = 70 + 70 >= 100%. None of the levels should be in panic mode. updateHostSet(host_set_, 2 /* num_hosts */, 1 /* num_healthy_hosts */); updateHostSet(failover_host_set_, 2 /* num_hosts */, 1 /* num_healthy_hosts */); ASSERT_THAT(getLoadPercentage(), ElementsAre(70, 30)); + ASSERT_THAT(getPanic(), ElementsAre(false, false)); // Health P=0 == 25*1.4 == 35 P=1 is healthy so takes all spillover. + // Total health = 35+100 >= 100%. P=0 is below Panic level but it is ignored, because + // Total health >= 100%. updateHostSet(host_set_, 4 /* num_hosts */, 1 /* num_healthy_hosts */); updateHostSet(failover_host_set_, 2 /* num_hosts */, 2 /* num_healthy_hosts */); ASSERT_THAT(getLoadPercentage(), ElementsAre(35, 65)); + ASSERT_THAT(getPanic(), ElementsAre(false, false)); // Health P=0 == 25*1.4 == 35 P=1 == 35 // Health is then scaled up by (100 / (35 + 35) == 50) + // Total health = 35% + 35% is less than 100%. Panic levels per priority kick in. updateHostSet(host_set_, 4 /* num_hosts */, 1 /* num_healthy_hosts */); updateHostSet(failover_host_set_, 4 /* num_hosts */, 1 /* num_healthy_hosts */); ASSERT_THAT(getLoadPercentage(), ElementsAre(50, 50)); + ASSERT_THAT(getPanic(), ElementsAre(true, true)); } TEST_P(LoadBalancerBaseTest, GentleFailoverWithExtraLevels) { @@ -185,6 +206,7 @@ TEST_P(LoadBalancerBaseTest, GentleFailoverWithExtraLevels) { updateHostSet(failover_host_set_, 1, 1); updateHostSet(tertiary_host_set_, 1, 1); ASSERT_THAT(getLoadPercentage(), ElementsAre(100, 0, 0)); + ASSERT_THAT(getPanic(), ElementsAre(false, false, false)); // Health P=0 == 50*1.4 == 70 // Health P=0 == 50, so can take the 30% spillover. @@ -219,6 +241,42 @@ TEST_P(LoadBalancerBaseTest, GentleFailoverWithExtraLevels) { updateHostSet(failover_host_set_, 5 /* num_hosts */, 1 /* num_healthy_hosts */); updateHostSet(tertiary_host_set_, 5 /* num_hosts */, 1 /* num_healthy_hosts */); ASSERT_THAT(getLoadPercentage(), ElementsAre(34, 33, 33)); + ASSERT_THAT(getPanic(), ElementsAre(true, true, true)); + + // Levels P=0 and P=1 are totally down. P=2 is totally healthy. + // 100% of the traffic should go to P=2 and P=0 and P=1 should + // not be in panic mode. + updateHostSet(host_set_, 5 /* num_hosts */, 0 /* num_healthy_hosts */); + updateHostSet(failover_host_set_, 5 /* num_hosts */, 0 /* num_healthy_hosts */); + updateHostSet(tertiary_host_set_, 5 /* num_hosts */, 5 /* num_healthy_hosts */); + ASSERT_THAT(getLoadPercentage(), ElementsAre(0, 0, 100)); + ASSERT_THAT(getPanic(), ElementsAre(false, false, false)); + + // Levels P=0 and P=1 are totally down. P=2 is 80*1.4 >= 100% healthy. + // 100% of the traffic should go to P=2 and P=0 and P=1 should + // not be in panic mode. + updateHostSet(host_set_, 5 /* num_hosts */, 0 /* num_healthy_hosts */); + updateHostSet(failover_host_set_, 5 /* num_hosts */, 0 /* num_healthy_hosts */); + updateHostSet(tertiary_host_set_, 5 /* num_hosts */, 4 /* num_healthy_hosts */); + ASSERT_THAT(getLoadPercentage(), ElementsAre(0, 0, 100)); + ASSERT_THAT(getPanic(), ElementsAre(false, false, false)); + + // Levels P=0 and P=1 are totally down. P=2 is 40*1.4=56%% healthy. + // 100% of the traffic should go to P=2. All levels P=0, P=1 and P=2 should + // be in panic mode even though P=0 and P=1 do not receive any load. + updateHostSet(host_set_, 5 /* num_hosts */, 0 /* num_healthy_hosts */); + updateHostSet(failover_host_set_, 5 /* num_hosts */, 0 /* num_healthy_hosts */); + updateHostSet(tertiary_host_set_, 5 /* num_hosts */, 2 /* num_healthy_hosts */); + ASSERT_THAT(getLoadPercentage(), ElementsAre(0, 0, 100)); + ASSERT_THAT(getPanic(), ElementsAre(true, true, true)); + + // All levels are completely down. 100% of traffic should go to P=0 + // and P=0 should be in panic mode + updateHostSet(host_set_, 5 /* num_hosts */, 0 /* num_healthy_hosts */); + updateHostSet(failover_host_set_, 5 /* num_hosts */, 0 /* num_healthy_hosts */); + updateHostSet(tertiary_host_set_, 5 /* num_hosts */, 0 /* num_healthy_hosts */); + ASSERT_THAT(getLoadPercentage(), ElementsAre(100, _, _)); + ASSERT_THAT(getPanic(), ElementsAre(true, _, _)); // Rounding errors should be picked up by the first healthy priority. updateHostSet(host_set_, 5 /* num_hosts */, 0 /* num_healthy_hosts */); @@ -852,7 +910,6 @@ TEST_P(RoundRobinLoadBalancerTest, NoZoneAwareRoutingLocalEmpty) { HostsPerLocalitySharedPtr local_hosts_per_locality = makeHostsPerLocality({{}, {}}); EXPECT_CALL(runtime_.snapshot_, getInteger("upstream.healthy_panic_threshold", 50)) - .WillOnce(Return(50)) .WillOnce(Return(50)); EXPECT_CALL(runtime_.snapshot_, featureEnabled("upstream.zone_routing.enabled", 100)) .WillOnce(Return(true)); diff --git a/test/mocks/upstream/mocks.h b/test/mocks/upstream/mocks.h index d6e51b61c5da4..5707ee47ed451 100644 --- a/test/mocks/upstream/mocks.h +++ b/test/mocks/upstream/mocks.h @@ -75,6 +75,7 @@ class MockHostSet : public HostSet { Common::CallbackManager member_update_cb_helper_; uint32_t priority_{}; uint32_t overprovisioning_factor_{}; + bool run_in_panic_mode_ = false; }; class MockPrioritySet : public PrioritySet {