upstream: add support for setting degraded through LoadAssignment by snowp · Pull Request #5649 · envoyproxy/envoy

snowp · 2019-01-18T06:42:58Z

Adds a DEGRADED HealthStatus value that can be set on a host through
LoadAssignment, allowing for a host to be marked degraded without
the need for active health checking.

Moves the mapping of EDS flag to health flag to inside
registerHostForPriority, which means that we're now consistently setting
the EDS health flag for EDS/STATIC/STRICT_DNS/LOGICAL_DNS.

Simplifies the check for whether the health flag value of a host has
changed during EDS updates.

Adds tests for the EDS mapping as well as tests to verify that we're
honoring the EDS flag for non-EDS cluster types.

Signed-off-by: Snow Pettersen snowp@squareup.com

Description:
Risk Level: High, substantial refactoring of how we determine whether health flag has changed.
Testing: UTs coverage for new health flag values.
Docs Changes: n/a
Release Notes: n/a
Fixes #5637
#5063

Adds a DEGRADED value that can be set on a host through EDS/LoadAssignment, allowing for a host to be marked degraded without the need for active health checking. Moves the mapping of EDS flag to health flag to inside registerHostForPriority, which means that we're now consistently setting the EDS health flag for EDS/STATIC/STRICT_DNS/LOGICAL_DNS. Simplifies the check for whether the health flag value of a host has changed during EDS updates. Adds tests for the EDS mapping as well as tests to verify that we're honoring the EDS flag for non-EDS cluster types. Signed-off-by: Snow Pettersen <snowp@squareup.com>

Signed-off-by: Snow Pettersen <snowp@squareup.com>

snowp · 2019-01-18T07:30:08Z

source/common/upstream/upstream_impl.cc

              lb_endpoint_.metadata(), lb_endpoint_.load_balancing_weight().value(),
              locality_lb_endpoint_.locality(), lb_endpoint_.endpoint().health_check_config(),
              locality_lb_endpoint_.priority()));
+          setEdsHealthFlag(*new_hosts.back(), lb_endpoint_.health_status());


This is necessary because we create a new HostImpl when we resolve the target, which needs to have the health flags set. Maybe we should be passing the EDS flag to the HostImpl ctor and compute the value there instead of having to call set the flag externally everywhere?

Yeah, I think this would make sense.

venilnoronha

Looks good. Just a few comments. Thanks!

venilnoronha · 2019-01-22T20:12:30Z

source/common/upstream/upstream_impl.cc

+// @param existing_host the host to update.
+// @param flag the health flag to update.
+// @return bool whether the flag update caused the host health to change.
+bool updateHealthFlags(const Host& updated_host, Host& existing_host, Host::HealthFlag flag) {


nit: s/updateHealthFlags/updateHealthFlag ?

venilnoronha · 2019-01-22T20:16:00Z

test/common/upstream/eds_test.cc

    EXPECT_EQ(Host::Health::Healthy, hosts[0]->health());
  }
+
+  const auto rebuild_conter = stats_.counter("cluster.name.update_no_rebuild").value();


s/rebuild_conter/rebuild_counter

venilnoronha · 2019-01-22T20:17:17Z

test/common/upstream/eds_test.cc

+    EXPECT_EQ(Host::Health::Degraded, hosts[0]->health());
+  }
+
+  std::cerr << cluster_->prioritySet().hostSetsPerPriority()[0]->hosts().size() << std::endl;


Use logger?

left behind debug logging, ill just remove

Signed-off-by: Snow Pettersen <snowp@squareup.com>

venilnoronha

LGTM. Thanks!

snowp · 2019-01-25T16:58:07Z

@dio Wanna give this a look?

Signed-off-by: Snow Pettersen <snowp@squareup.com>

snowp · 2019-01-30T19:29:17Z

Ping @dio

snowp · 2019-01-31T16:09:59Z

@htuch Wanna give this a look? Talked to @dio and he's not gonna be able to get to it this week

htuch · 2019-01-31T21:12:15Z

include/envoy/upstream/upstream.h

-  m(DEGRADED_ACTIVE_HC, 0x08)
+  m(DEGRADED_ACTIVE_HC, 0x08)                                                    \
+  /* The host is currently marked as degraded by EDS. */                         \
+  m(DEGRADED_EDS_HEALTH, 0x10)


Can you remind me why we can't just use FAILED_EDS_HEALTH for this use case? AFAICT it fits the bill, and looking at the underlying issue being fixed, it seemed we just needed some plumbing around that.

Because this is for degrading an endpoint, not marking it as unhealthy. The new value is necessary to differentiate it here https://github.com/envoyproxy/envoy/pull/5649/files#diff-583237ddb4e16f38ccda2e9affdb0ad8R210 from the host being marked as unhealthy.

Degraded docs if you're not familiar: https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/load_balancing/degraded

Maybe it wasn't super clear, but this PR adds support for marking endpoints as degraded, and while I was in here I also made sure to fix #5637

Thanks, that clarifies.

htuch

LGTM; do you think any integration tests make sense? Not necessary if too much of a pain, but always nice to trust and verify.

htuch · 2019-02-01T19:23:50Z

source/common/upstream/upstream_impl.cc

+
+void setEdsHealthFlag(Host& host, envoy::api::v2::core::HealthStatus health_status) {
+  switch (health_status) {
+  case envoy::api::v2::core::HealthStatus::UNHEALTHY:


Nit: Maybe add fall-thru annotations or comments here.

htuch · 2019-02-01T19:31:33Z

include/envoy/upstream/upstream.h

-  m(DEGRADED_ACTIVE_HC, 0x08)
+  m(DEGRADED_ACTIVE_HC, 0x08)                                                    \
+  /* The host is currently marked as degraded by EDS. */                         \
+  m(DEGRADED_EDS_HEALTH, 0x10)


Thanks, that clarifies.

Signed-off-by: Snow Pettersen <snowp@squareup.com>

htuch

LGTM modulo comments.

htuch · 2019-02-05T02:08:26Z

source/common/upstream/upstream_impl.cc

              lb_endpoint_.metadata(), lb_endpoint_.load_balancing_weight().value(),
              locality_lb_endpoint_.locality(), lb_endpoint_.endpoint().health_check_config(),
              locality_lb_endpoint_.priority()));
+          setEdsHealthFlag(*new_hosts.back(), lb_endpoint_.health_status());


Yeah, I think this would make sense.

htuch · 2019-02-05T02:09:08Z

test/common/upstream/upstream_impl_test.cc

  EXPECT_FALSE(cluster.info()->addedViaApi());
 }

+TEST_F(StaticClusterImplTest, LoadAssignmentEdsHealth) {


Can you add a // above this explaining what it is validating?

Signed-off-by: Snow Pettersen <snowp@squareup.com>

htuch

LGTM, but CI seems broken.

htuch · 2019-02-05T23:05:50Z

source/common/upstream/upstream_impl.cc

+Host::CreateConnectionData HostImpl::createConnection(
+    Event::Dispatcher& dispatcher, const Network::ConnectionSocket::OptionsSharedPtr& options,
+    Network::TransportSocketOptionsSharedPtr transport_socket_options) const {
+  return {createConnection(dispatcher, *cluster_, address_, options, transport_socket_options),


Nit: why this move?

Signed-off-by: Snow Pettersen <snowp@squareup.com>

htuch

Thanks!

…voyproxy#5649) Adds a DEGRADED HealthStatus value that can be set on a host through LoadAssignment, allowing for a host to be marked degraded without the need for active health checking. Moves the mapping of EDS flag to health flag to inside `registerHostForPriority`, which means that we're now consistently setting the EDS health flag for EDS/STATIC/STRICT_DNS/LOGICAL_DNS. Simplifies the check for whether the health flag value of a host has changed during EDS updates. Adds tests for the EDS mapping as well as tests to verify that we're honoring the EDS flag for non-EDS cluster types. Risk Level: High, substantial refactoring of how we determine whether health flag has changed. Testing: UTs coverage for new health flag values. Docs Changes: n/a Release Notes: n/a Fixes envoyproxy#5637 envoyproxy#5063 Signed-off-by: Snow Pettersen <snowp@squareup.com> Signed-off-by: Fred Douglas <fredlas@google.com>

…15649) (#15881) Make the failover timeout a constructor argument of ConnectionGrid (#5649) Commit Message: make the failover timeout a constructor argument of ConnectionGrid Additional Description: Risk Level: low Testing: unit test ConnectivityGridTest.TimeoutThenSuccessParallelSecondConnects Docs Changes: N/A Release Notes: N/A Platform Specific Features: Signed-off-by: Ryan Hamilton <rch@google.com>

Snow Pettersen added 2 commits January 17, 2019 22:37

add admin output for new flag

5f6fe9a

Signed-off-by: Snow Pettersen <snowp@squareup.com>

snowp commented Jan 18, 2019

View reviewed changes

lizan requested a review from dio January 18, 2019 08:27

lizan assigned dio Jan 18, 2019

venilnoronha reviewed Jan 22, 2019

View reviewed changes

PR feedback naming changes, remove std::cerrs

c980745

Signed-off-by: Snow Pettersen <snowp@squareup.com>

venilnoronha previously approved these changes Jan 23, 2019

View reviewed changes

Snow Pettersen added 2 commits January 29, 2019 18:35

Merge remote-tracking branch 'origin/master' into degraded-eds

8f413f5

Signed-off-by: Snow Pettersen <snowp@squareup.com>

clean up merge

0071ab8

Signed-off-by: Snow Pettersen <snowp@squareup.com>

snowp dismissed venilnoronha’s stale review via 0071ab8 January 30, 2019 00:21

snowp assigned htuch Jan 31, 2019

htuch reviewed Jan 31, 2019

View reviewed changes

htuch reviewed Feb 1, 2019

View reviewed changes

Snow Pettersen added 3 commits February 4, 2019 18:28

specify FALLTHRU and add integration test coverage

aa99b0e

Signed-off-by: Snow Pettersen <snowp@squareup.com>

Merge remote-tracking branch 'origin/master' into degraded-eds

5cb9477

Signed-off-by: Snow Pettersen <snowp@squareup.com>

fix bad merge

198eb6e

Signed-off-by: Snow Pettersen <snowp@squareup.com>

htuch reviewed Feb 5, 2019

View reviewed changes

move setEdsFlag to HostImpl ctor, add test description

3626ea7

Signed-off-by: Snow Pettersen <snowp@squareup.com>

htuch reviewed Feb 5, 2019

View reviewed changes

fix load balancer simulation

44015a1

Signed-off-by: Snow Pettersen <snowp@squareup.com>

htuch approved these changes Feb 8, 2019

View reviewed changes

htuch merged commit 8c6bf40 into envoyproxy:master Feb 8, 2019

RyanTheOptimist mentioned this pull request Apr 7, 2021

make the failover timeout a constructor argument of ConnectionGrid #15881

Merged

Conversation

snowp commented Jan 18, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

venilnoronha left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

venilnoronha left a comment

Choose a reason for hiding this comment

Uh oh!

snowp commented Jan 25, 2019

Uh oh!

snowp commented Jan 30, 2019

Uh oh!

snowp commented Jan 31, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

htuch left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

htuch left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

htuch left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

htuch left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants