healthcheck filter: compute response based on upstream cluster health#2387
healthcheck filter: compute response based on upstream cluster health#2387zuercher merged 8 commits intoenvoyproxy:masterfrom brian-pane:healthcheck/2362
Conversation
*Description*: Extend the health check filter to optionally compute its HTTP response status based on whether at least a specified percentage of servers in a specified set of upstream clusters are healthy. *Risk Level*: Medium *Testing*: Unit tests included *Docs Changes*: TODO *Release Notes*: Included in this PR *Fixes*: #2362 *API Changes*: [#417](envoyproxy/data-plane-api#417) Signed-off-by: Brian Pane <bpane@pinterest.com>
|
I need to prepare a data-plane-api PR to update the docs for this feature. Based on my schedule this week, it may be a couple of days before I have the docs ready. But in the meantime, here's the code. |
|
@zuercher do you mind owning review of this? |
zuercher
left a comment
There was a problem hiding this comment.
This looks good. Thanks for taking it on.
source/server/http/health_check.cc
Outdated
| break; | ||
| } | ||
| } | ||
| if (100.0 * stats.membership_healthy_.value() < membership_total * min_healthy_percentage) { |
There was a problem hiding this comment.
Is there an advantage to writing it this way? I think it would be clearer if it were stats.membership_healthy_.value() < membership_total * min_healthy_percentage / 100.0. I suppose that requires a cast, but perhaps the computation of the minimum required healthy hosts (e.g., membership_total * min_healthy_percentage / 100.0) could be moved into a variable.
There was a problem hiding this comment.
I did it that way to avoid a division operation. But, now that you mention it, g++ might implement the division as multiplication by the reciprocal. I know it does that optimization for division by an integer constant. I'll see if it does the same thing for division by a floating point constant.
There was a problem hiding this comment.
After some testing and reading, I found that:
- g++ only converts division by a floating point literal into multiplication by its reciprocal when a special flag
-freciprocal-mathis specified. That's probably because the optimization can change the results of computations, due to the limits of floating-point precision. - Division is still slow compared to multiplication, but it's gotten much better in recent years: the
DIVSDinstruction that g++ generates for/ 100.0will add under 20 clock cycles of latency on recent x86 processors.
So I'll just go with / 100.0 to improve readability.
There was a problem hiding this comment.
You could remove the division from the health check request path by moving it to HealthCheckFilterConfig::createFilter. That is, store the value in the range [0.0, 1.0] instead of [0.0, 100.0].
That said, I don't think performance is critical here, and the current version lgtm.
There was a problem hiding this comment.
If we were really going for performance, all of the calculations would be done on the main thread using a timer (say every few seconds) and then just referenced via TLS on the workers. I don't think it's worth doing for this use case assuming a sane health check interval, but if you feel like it you could add a TODO.
There was a problem hiding this comment.
The conversions from int to floating point are also expensive, so the fastest implementation probably would be to do the whole check using integers only. I.e., during config loading, we could precompute and store threshold = uint64_t(min_healthy_percentage) * 1000000 and then implement the check as
if (stats.membership_healthy_.value() < membership_total * threshold / (100 * 1000000)) {
For now, I'll just add a TODO comment.
| } | ||
|
|
||
| TEST_F(HealthCheckFilterNoPassThroughTest, ComputedHealth) { | ||
| // Test health non-pass-through health checks without upstream cluster |
| filter_->decodeHeaders(request_headers_, true)); | ||
| } | ||
|
|
||
| // Test health non-pass-through health checks with upstream cluster |
| filter_->decodeHeaders(request_headers_, true)); | ||
| } | ||
| { | ||
| // This should fail, because one upstream cluster has no servers at all. |
There was a problem hiding this comment.
Consider adding a test for the empty cluster but min health == 0% case.
| Http::TestHeaderMapImpl request_headers_; | ||
| Http::TestHeaderMapImpl request_headers_no_hc_; | ||
|
|
||
| class MockHealthCheckCluster : public Upstream::MockCluster { |
There was a problem hiding this comment.
Rather than creating a one-off mock here, I think you should add a helper function in the test file that just uses the existing MockCluster. I think something like this will work to create values you can use for cluster_www1/2:
MockCluster makeHealthCheckCluster(uint64_t membership_total, uint64_t membership_healthy) {
MockCluster cluster;
cluster.info_->stats_.membership_total_.set(membership_total);
cluster.info_->stats_.membership_healthy_.set(membership_healthy);
return cluster;
}
There was a problem hiding this comment.
Thanks for the pointer! MockCluster isn't copyable (something in the gmock macros seems to be explicitly deleting the copy constructor), but I was able to use your approach to replace all of MockHealthCheckCluster with just a constructor that sets the stats in MockCluster.
Signed-off-by: Brian Pane <bpane@pinterest.com>
Signed-off-by: Brian Pane <bpane@pinterest.com>
|
I still need to create a documentation PR. I'll add that tomorrow. |
|
Documentation PR: envoyproxy/data-plane-api#425 |
mattklein123
left a comment
There was a problem hiding this comment.
Generally looks great, few comments.
RAW_RELEASE_NOTES.md
Outdated
| * Added support for route matching based on URL query string parameters. | ||
| :ref:`QueryParameterMatcher<envoy_api_msg_QueryParameterMatcher>` | ||
| * Added `/runtime` admin endpoint to read the current runtime values. | ||
| * Extended the health check filter to support computation of the health check resopnse based on the percent of healthy servers is upstream clusters. |
There was a problem hiding this comment.
typo "resopnse". Also, nit, please try to wrap around 100 cols
source/server/http/health_check.h
Outdated
| bool pass_through_mode_{}; | ||
| HealthCheckCacheManagerSharedPtr cache_manager_{}; | ||
| const std::string endpoint_; | ||
| ClusterMinHealthyPercentagesSharedPtr cluster_min_healthy_percentages_{}; |
source/server/http/health_check.h
Outdated
| typedef std::shared_ptr<HealthCheckCacheManager> HealthCheckCacheManagerSharedPtr; | ||
|
|
||
| typedef std::map<std::string, double> ClusterMinHealthyPercentages; | ||
| typedef std::shared_ptr<const ClusterMinHealthyPercentages> ClusterMinHealthyPercentagesSharedPtr; |
There was a problem hiding this comment.
ClusterMinHealthyPercentagesConstSharedPtr
source/server/http/health_check.cc
Outdated
| new HealthCheckFilter(context, pass_through_mode, cache_manager, hc_endpoint)}); | ||
| ClusterMinHealthyPercentagesSharedPtr cluster_min_healthy_percentages; | ||
| if (!pass_through_mode && !proto_config.cluster_min_healthy_percentages().empty()) { | ||
| auto* cluster_to_percentage = new ClusterMinHealthyPercentages(); |
There was a problem hiding this comment.
please avoid naked memory allocations. I would either a) assign to unique_ptr than release/move into the final shared_ptr, or b) assign into a non-const shared_ptr which I think might be convertible to the const one (not sure).
source/server/http/health_check.cc
Outdated
|
|
||
| return [&context, pass_through_mode, cache_manager, hc_endpoint, | ||
| cluster_min_healthy_percentages](Http::FilterChainFactoryCallbacks& callbacks) -> void { | ||
| callbacks.addStreamFilter(Http::StreamFilterSharedPtr{new HealthCheckFilter( |
There was a problem hiding this comment.
nit: not your code but can you switch to std::make_shared?
source/server/http/health_check.cc
Outdated
| final_status = cache_manager_->getCachedResponseCode(); | ||
| } else if (cluster_min_healthy_percentages_ != nullptr && | ||
| !cluster_min_healthy_percentages_->empty()) { | ||
| const auto clusters(context_.clusterManager().clusters()); |
There was a problem hiding this comment.
Unfortunately this is not safe. clusters() right now is only safe to be called from the main thread (not workers) since there is no locking. What you actually want to do is go through your targets clusters and call get() which will give you the thread local version which I think should be sufficient for the computations you need.
| const double min_healthy_percentage = item.second; | ||
| auto match = clusters.find(cluster_name); | ||
| if (match == clusters.end()) { | ||
| final_status = Http::Code::ServiceUnavailable; |
There was a problem hiding this comment.
Can you add some comments here? I guess I see how the lack of the cluster at all is sufficient ground for failure, though I wonder if there should be a stat. At the very least I would probably add a small comment.
Signed-off-by: Brian Pane <bpane@pinterest.com>
Signed-off-by: Brian Pane <bpane@pinterest.com>
| filter_->decodeHeaders(request_headers_, true)); | ||
| } | ||
| // Test the cases where an upstream cluster is empty, or has no healthy servers, but | ||
| // the minimum required percent healthy is zero. The hhealth check should return a 200. |
source/server/http/health_check.cc
Outdated
| } | ||
| if (100.0 * stats.membership_healthy_.value() < membership_total * min_healthy_percentage) { | ||
| // In the general case, consider the service unhealthy if fewer than the | ||
| // specified percentage of the servers inthe cluster are healthy. |
Signed-off-by: Brian Pane <bpane@pinterest.com>
|
@brian-pane do you mind merging master? |
Signed-off-by: Brian Pane <bpane@pinterest.com>
| { | ||
| Http::TestHeaderMapImpl health_check_response{{":status", "200"}}; | ||
| EXPECT_CALL(context_, healthCheckFailed()).WillOnce(Return(false)); | ||
| EXPECT_CALL(callbacks_, encodeHeaders_(HeaderMapEqualRef(&health_check_response), true)) |
There was a problem hiding this comment.
nit: .Times(1) is redundant (it's implied). You can remove. Same below.
mattklein123
left a comment
There was a problem hiding this comment.
LGTM, just small nit from me. Will defer to @zuercher for other comments.
Signed-off-by: Brian Pane <bpane@pinterest.com>
#2387) To prevent it from force-pushing to an existing branch like it did in envoyproxy/envoy-mobile#2373 Signed-off-by: JP Simard <jp@jpsim.com>
#2387) To prevent it from force-pushing to an existing branch like it did in envoyproxy/envoy-mobile#2373 Signed-off-by: JP Simard <jp@jpsim.com>
Description:
Extend the health check filter to optionally compute its HTTP response status
based on whether at least a specified percentage of servers in a specified set
of upstream clusters are healthy.
Risk Level: Medium
Testing: Unit tests included
Docs Changes: #425
Release Notes: Included in this PR
Fixes: #2362
API Changes: #417
Signed-off-by: Brian Pane bpane@pinterest.com