healthcheck filter: compute response based on upstream cluster health by brian-pane · Pull Request #2387 · envoyproxy/envoy

brian-pane · 2018-01-17T04:24:50Z

Description:
Extend the health check filter to optionally compute its HTTP response status
based on whether at least a specified percentage of servers in a specified set
of upstream clusters are healthy.

Risk Level: Medium

Testing: Unit tests included

Docs Changes: #425

Release Notes: Included in this PR

Fixes: #2362

API Changes: #417

Signed-off-by: Brian Pane bpane@pinterest.com

*Description*: Extend the health check filter to optionally compute its HTTP response status based on whether at least a specified percentage of servers in a specified set of upstream clusters are healthy. *Risk Level*: Medium *Testing*: Unit tests included *Docs Changes*: TODO *Release Notes*: Included in this PR *Fixes*: #2362 *API Changes*: [#417](envoyproxy/data-plane-api#417) Signed-off-by: Brian Pane <bpane@pinterest.com>

brian-pane · 2018-01-17T04:28:34Z

I need to prepare a data-plane-api PR to update the docs for this feature. Based on my schedule this week, it may be a couple of days before I have the docs ready. But in the meantime, here's the code.

mattklein123 · 2018-01-17T17:31:54Z

@zuercher do you mind owning review of this?

zuercher

This looks good. Thanks for taking it on.

zuercher · 2018-01-18T18:46:17Z

source/server/http/health_check.cc

+            break;
+          }
+        }
+        if (100.0 * stats.membership_healthy_.value() < membership_total * min_healthy_percentage) {


Is there an advantage to writing it this way? I think it would be clearer if it were stats.membership_healthy_.value() < membership_total * min_healthy_percentage / 100.0. I suppose that requires a cast, but perhaps the computation of the minimum required healthy hosts (e.g., membership_total * min_healthy_percentage / 100.0) could be moved into a variable.

I did it that way to avoid a division operation. But, now that you mention it, g++ might implement the division as multiplication by the reciprocal. I know it does that optimization for division by an integer constant. I'll see if it does the same thing for division by a floating point constant.

After some testing and reading, I found that:

g++ only converts division by a floating point literal into multiplication by its reciprocal when a special flag -freciprocal-math is specified. That's probably because the optimization can change the results of computations, due to the limits of floating-point precision.

Division is still slow compared to multiplication, but it's gotten much better in recent years: the DIVSD instruction that g++ generates for / 100.0 will add under 20 clock cycles of latency on recent x86 processors.

So I'll just go with / 100.0 to improve readability.

You could remove the division from the health check request path by moving it to HealthCheckFilterConfig::createFilter. That is, store the value in the range [0.0, 1.0] instead of [0.0, 100.0].

That said, I don't think performance is critical here, and the current version lgtm.

If we were really going for performance, all of the calculations would be done on the main thread using a timer (say every few seconds) and then just referenced via TLS on the workers. I don't think it's worth doing for this use case assuming a sane health check interval, but if you feel like it you could add a TODO.

The conversions from int to floating point are also expensive, so the fastest implementation probably would be to do the whole check using integers only. I.e., during config loading, we could precompute and store threshold = uint64_t(min_healthy_percentage) * 1000000 and then implement the check as

if (stats.membership_healthy_.value() < membership_total * threshold / (100 * 1000000)) {

For now, I'll just add a TODO comment.

zuercher · 2018-01-18T18:48:22Z

test/server/http/health_check_test.cc

 }

+TEST_F(HealthCheckFilterNoPassThroughTest, ComputedHealth) {
+  // Test health non-pass-through health checks without upstream cluster


nit s/Test health/Test/

zuercher · 2018-01-18T18:48:36Z

test/server/http/health_check_test.cc

+              filter_->decodeHeaders(request_headers_, true));
+  }
+
+  // Test health non-pass-through health checks with upstream cluster


s/Test health/Test/

zuercher · 2018-01-18T18:50:10Z

test/server/http/health_check_test.cc

+              filter_->decodeHeaders(request_headers_, true));
+  }
+  {
+    // This should fail, because one upstream cluster has no servers at all.


Consider adding a test for the empty cluster but min health == 0% case.

zuercher · 2018-01-18T19:21:14Z

test/server/http/health_check_test.cc

  Http::TestHeaderMapImpl request_headers_;
  Http::TestHeaderMapImpl request_headers_no_hc_;
+
+  class MockHealthCheckCluster : public Upstream::MockCluster {


Rather than creating a one-off mock here, I think you should add a helper function in the test file that just uses the existing MockCluster. I think something like this will work to create values you can use for cluster_www1/2:

MockCluster makeHealthCheckCluster(uint64_t membership_total, uint64_t membership_healthy) { MockCluster cluster; cluster.info_->stats_.membership_total_.set(membership_total); cluster.info_->stats_.membership_healthy_.set(membership_healthy); return cluster; }

Thanks for the pointer! MockCluster isn't copyable (something in the gmock macros seems to be explicitly deleting the copy constructor), but I was able to use your approach to replace all of MockHealthCheckCluster with just a constructor that sets the stats in MockCluster.

Signed-off-by: Brian Pane <bpane@pinterest.com>

brian-pane · 2018-01-21T06:58:35Z

I still need to create a documentation PR. I'll add that tomorrow.

brian-pane · 2018-01-21T18:29:55Z

Documentation PR: envoyproxy/data-plane-api#425

mattklein123

Generally looks great, few comments.

mattklein123 · 2018-01-21T20:36:40Z

RAW_RELEASE_NOTES.md

 * Added support for route matching based on URL query string parameters.
  :ref:`QueryParameterMatcher<envoy_api_msg_QueryParameterMatcher>`
 * Added `/runtime` admin endpoint to read the current runtime values.
+* Extended the health check filter to support computation of the health check resopnse based on the percent of healthy servers is upstream clusters.


typo "resopnse". Also, nit, please try to wrap around 100 cols

mattklein123 · 2018-01-21T20:38:59Z

source/server/http/health_check.h

  bool pass_through_mode_{};
  HealthCheckCacheManagerSharedPtr cache_manager_{};
  const std::string endpoint_;
+  ClusterMinHealthyPercentagesSharedPtr cluster_min_healthy_percentages_{};


nit: {} not needed

mattklein123 · 2018-01-21T20:39:14Z

source/server/http/health_check.h

 typedef std::shared_ptr<HealthCheckCacheManager> HealthCheckCacheManagerSharedPtr;

+typedef std::map<std::string, double> ClusterMinHealthyPercentages;
+typedef std::shared_ptr<const ClusterMinHealthyPercentages> ClusterMinHealthyPercentagesSharedPtr;


ClusterMinHealthyPercentagesConstSharedPtr

mattklein123 · 2018-01-21T20:41:27Z

source/server/http/health_check.cc

-        new HealthCheckFilter(context, pass_through_mode, cache_manager, hc_endpoint)});
+  ClusterMinHealthyPercentagesSharedPtr cluster_min_healthy_percentages;
+  if (!pass_through_mode && !proto_config.cluster_min_healthy_percentages().empty()) {
+    auto* cluster_to_percentage = new ClusterMinHealthyPercentages();


please avoid naked memory allocations. I would either a) assign to unique_ptr than release/move into the final shared_ptr, or b) assign into a non-const shared_ptr which I think might be convertible to the const one (not sure).

mattklein123 · 2018-01-21T20:41:48Z

source/server/http/health_check.cc

+
+  return [&context, pass_through_mode, cache_manager, hc_endpoint,
+          cluster_min_healthy_percentages](Http::FilterChainFactoryCallbacks& callbacks) -> void {
+    callbacks.addStreamFilter(Http::StreamFilterSharedPtr{new HealthCheckFilter(


nit: not your code but can you switch to std::make_shared?

mattklein123 · 2018-01-21T20:44:21Z

source/server/http/health_check.cc

      final_status = cache_manager_->getCachedResponseCode();
+    } else if (cluster_min_healthy_percentages_ != nullptr &&
+               !cluster_min_healthy_percentages_->empty()) {
+      const auto clusters(context_.clusterManager().clusters());


Unfortunately this is not safe. clusters() right now is only safe to be called from the main thread (not workers) since there is no locking. What you actually want to do is go through your targets clusters and call get() which will give you the thread local version which I think should be sufficient for the computations you need.

mattklein123 · 2018-01-21T20:46:23Z

source/server/http/health_check.cc

+        const double min_healthy_percentage = item.second;
+        auto match = clusters.find(cluster_name);
+        if (match == clusters.end()) {
+          final_status = Http::Code::ServiceUnavailable;


Can you add some comments here? I guess I see how the lack of the cluster at all is sufficient ground for failure, though I wonder if there should be a stat. At the very least I would probably add a small comment.

Signed-off-by: Brian Pane <bpane@pinterest.com>

zuercher · 2018-01-22T17:42:17Z

test/server/http/health_check_test.cc

+              filter_->decodeHeaders(request_headers_, true));
+  }
+  // Test the cases where an upstream cluster is empty, or has no healthy servers, but
+  // the minimum required percent healthy is zero. The hhealth check should return a 200.


typo: hhealth

zuercher · 2018-01-22T17:44:23Z

source/server/http/health_check.cc

        }
-        if (100.0 * stats.membership_healthy_.value() < membership_total * min_healthy_percentage) {
+        // In the general case, consider the service unhealthy if fewer than the
+        // specified percentage of the servers inthe cluster are healthy.


typo: inthe

Signed-off-by: Brian Pane <bpane@pinterest.com>

mattklein123 · 2018-01-22T20:03:33Z

@brian-pane do you mind merging master?

Signed-off-by: Brian Pane <bpane@pinterest.com>

mattklein123 · 2018-01-22T21:47:24Z

test/server/http/health_check_test.cc

+  {
+    Http::TestHeaderMapImpl health_check_response{{":status", "200"}};
+    EXPECT_CALL(context_, healthCheckFailed()).WillOnce(Return(false));
+    EXPECT_CALL(callbacks_, encodeHeaders_(HeaderMapEqualRef(&health_check_response), true))


nit: .Times(1) is redundant (it's implied). You can remove. Same below.

mattklein123

LGTM, just small nit from me. Will defer to @zuercher for other comments.

Signed-off-by: Brian Pane <bpane@pinterest.com>

#2387) To prevent it from force-pushing to an existing branch like it did in envoyproxy/envoy-mobile#2373 Signed-off-by: JP Simard <jp@jpsim.com>

mattklein123 requested a review from zuercher January 17, 2018 17:31

mattklein123 assigned zuercher Jan 17, 2018

zuercher reviewed Jan 18, 2018

View reviewed changes

Brian Pane added 2 commits January 21, 2018 06:43

Code cleanups and a new test case for empty cluster with min health = 0%

76da1ac

Signed-off-by: Brian Pane <bpane@pinterest.com>

Update release notes

ed13504

Signed-off-by: Brian Pane <bpane@pinterest.com>

brian-pane mentioned this pull request Jan 21, 2018

docs: add new percentage-based response option to health check filter envoyproxy/data-plane-api#425

Merged

mattklein123 reviewed Jan 21, 2018

View reviewed changes

Brian Pane added 2 commits January 21, 2018 22:58

Fix thread-safety and clean up based on code review feedback

0985b13

Signed-off-by: Brian Pane <bpane@pinterest.com>

Small cleanup: remove unneeded initializer

0235233

Signed-off-by: Brian Pane <bpane@pinterest.com>

zuercher reviewed Jan 22, 2018

View reviewed changes

Add "Const" to shared pointer typedef name and fix comment typos

d665fd8

Signed-off-by: Brian Pane <bpane@pinterest.com>

Merge master

ca89315

Signed-off-by: Brian Pane <bpane@pinterest.com>

mattklein123 reviewed Jan 22, 2018

View reviewed changes

Remove redundant ".Times(1)" from unit tests

8a06a78

Signed-off-by: Brian Pane <bpane@pinterest.com>

mattklein123 approved these changes Jan 22, 2018

View reviewed changes

zuercher approved these changes Jan 23, 2018

View reviewed changes

zuercher merged commit 130604f into envoyproxy:master Jan 23, 2018

brian-pane deleted the healthcheck/2362 branch January 25, 2018 21:59

Conversation

brian-pane commented Jan 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brian-pane commented Jan 17, 2018

Uh oh!

mattklein123 commented Jan 17, 2018

Uh oh!

zuercher left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brian-pane Jan 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brian-pane commented Jan 21, 2018

Uh oh!

brian-pane commented Jan 21, 2018

Uh oh!

mattklein123 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattklein123 commented Jan 22, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattklein123 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

brian-pane commented Jan 17, 2018 •

edited

Loading

brian-pane Jan 21, 2018 •

edited

Loading