Skip to content

healthcheck filter: compute response based on upstream cluster health#2387

Merged
zuercher merged 8 commits intoenvoyproxy:masterfrom
brian-pane:healthcheck/2362
Jan 23, 2018
Merged

healthcheck filter: compute response based on upstream cluster health#2387
zuercher merged 8 commits intoenvoyproxy:masterfrom
brian-pane:healthcheck/2362

Conversation

@brian-pane
Copy link
Contributor

@brian-pane brian-pane commented Jan 17, 2018

Description:
Extend the health check filter to optionally compute its HTTP response status
based on whether at least a specified percentage of servers in a specified set
of upstream clusters are healthy.

Risk Level: Medium

Testing: Unit tests included

Docs Changes: #425

Release Notes: Included in this PR

Fixes: #2362

API Changes: #417

Signed-off-by: Brian Pane bpane@pinterest.com

*Description*:
Extend the health check filter to optionally compute its HTTP response status
based on whether at least a specified percentage of servers in a specified set
of upstream clusters are healthy.

*Risk Level*: Medium

*Testing*: Unit tests included

*Docs Changes*: TODO

*Release Notes*: Included in this PR

*Fixes*: #2362

*API Changes*: [#417](envoyproxy/data-plane-api#417)

Signed-off-by: Brian Pane <bpane@pinterest.com>
@brian-pane
Copy link
Contributor Author

I need to prepare a data-plane-api PR to update the docs for this feature. Based on my schedule this week, it may be a couple of days before I have the docs ready. But in the meantime, here's the code.

@mattklein123
Copy link
Member

@zuercher do you mind owning review of this?

Copy link
Member

@zuercher zuercher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. Thanks for taking it on.

break;
}
}
if (100.0 * stats.membership_healthy_.value() < membership_total * min_healthy_percentage) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an advantage to writing it this way? I think it would be clearer if it were stats.membership_healthy_.value() < membership_total * min_healthy_percentage / 100.0. I suppose that requires a cast, but perhaps the computation of the minimum required healthy hosts (e.g., membership_total * min_healthy_percentage / 100.0) could be moved into a variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it that way to avoid a division operation. But, now that you mention it, g++ might implement the division as multiplication by the reciprocal. I know it does that optimization for division by an integer constant. I'll see if it does the same thing for division by a floating point constant.

Copy link
Contributor Author

@brian-pane brian-pane Jan 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some testing and reading, I found that:

  • g++ only converts division by a floating point literal into multiplication by its reciprocal when a special flag -freciprocal-math is specified. That's probably because the optimization can change the results of computations, due to the limits of floating-point precision.
  • Division is still slow compared to multiplication, but it's gotten much better in recent years: the DIVSD instruction that g++ generates for / 100.0 will add under 20 clock cycles of latency on recent x86 processors.

So I'll just go with / 100.0 to improve readability.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could remove the division from the health check request path by moving it to HealthCheckFilterConfig::createFilter. That is, store the value in the range [0.0, 1.0] instead of [0.0, 100.0].

That said, I don't think performance is critical here, and the current version lgtm.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we were really going for performance, all of the calculations would be done on the main thread using a timer (say every few seconds) and then just referenced via TLS on the workers. I don't think it's worth doing for this use case assuming a sane health check interval, but if you feel like it you could add a TODO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conversions from int to floating point are also expensive, so the fastest implementation probably would be to do the whole check using integers only. I.e., during config loading, we could precompute and store threshold = uint64_t(min_healthy_percentage) * 1000000 and then implement the check as

if (stats.membership_healthy_.value() < membership_total * threshold / (100 * 1000000)) {

For now, I'll just add a TODO comment.

}

TEST_F(HealthCheckFilterNoPassThroughTest, ComputedHealth) {
// Test health non-pass-through health checks without upstream cluster
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit s/Test health/Test/

filter_->decodeHeaders(request_headers_, true));
}

// Test health non-pass-through health checks with upstream cluster
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Test health/Test/

filter_->decodeHeaders(request_headers_, true));
}
{
// This should fail, because one upstream cluster has no servers at all.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a test for the empty cluster but min health == 0% case.

Http::TestHeaderMapImpl request_headers_;
Http::TestHeaderMapImpl request_headers_no_hc_;

class MockHealthCheckCluster : public Upstream::MockCluster {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than creating a one-off mock here, I think you should add a helper function in the test file that just uses the existing MockCluster. I think something like this will work to create values you can use for cluster_www1/2:

MockCluster makeHealthCheckCluster(uint64_t membership_total, uint64_t membership_healthy) {
  MockCluster cluster;
  cluster.info_->stats_.membership_total_.set(membership_total);
  cluster.info_->stats_.membership_healthy_.set(membership_healthy);
  return cluster;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointer! MockCluster isn't copyable (something in the gmock macros seems to be explicitly deleting the copy constructor), but I was able to use your approach to replace all of MockHealthCheckCluster with just a constructor that sets the stats in MockCluster.

Brian Pane added 2 commits January 21, 2018 06:43
Signed-off-by: Brian Pane <bpane@pinterest.com>
Signed-off-by: Brian Pane <bpane@pinterest.com>
@brian-pane
Copy link
Contributor Author

I still need to create a documentation PR. I'll add that tomorrow.

@brian-pane
Copy link
Contributor Author

Documentation PR: envoyproxy/data-plane-api#425

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks great, few comments.

* Added support for route matching based on URL query string parameters.
:ref:`QueryParameterMatcher<envoy_api_msg_QueryParameterMatcher>`
* Added `/runtime` admin endpoint to read the current runtime values.
* Extended the health check filter to support computation of the health check resopnse based on the percent of healthy servers is upstream clusters.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo "resopnse". Also, nit, please try to wrap around 100 cols

bool pass_through_mode_{};
HealthCheckCacheManagerSharedPtr cache_manager_{};
const std::string endpoint_;
ClusterMinHealthyPercentagesSharedPtr cluster_min_healthy_percentages_{};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: {} not needed

typedef std::shared_ptr<HealthCheckCacheManager> HealthCheckCacheManagerSharedPtr;

typedef std::map<std::string, double> ClusterMinHealthyPercentages;
typedef std::shared_ptr<const ClusterMinHealthyPercentages> ClusterMinHealthyPercentagesSharedPtr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ClusterMinHealthyPercentagesConstSharedPtr

new HealthCheckFilter(context, pass_through_mode, cache_manager, hc_endpoint)});
ClusterMinHealthyPercentagesSharedPtr cluster_min_healthy_percentages;
if (!pass_through_mode && !proto_config.cluster_min_healthy_percentages().empty()) {
auto* cluster_to_percentage = new ClusterMinHealthyPercentages();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please avoid naked memory allocations. I would either a) assign to unique_ptr than release/move into the final shared_ptr, or b) assign into a non-const shared_ptr which I think might be convertible to the const one (not sure).


return [&context, pass_through_mode, cache_manager, hc_endpoint,
cluster_min_healthy_percentages](Http::FilterChainFactoryCallbacks& callbacks) -> void {
callbacks.addStreamFilter(Http::StreamFilterSharedPtr{new HealthCheckFilter(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: not your code but can you switch to std::make_shared?

final_status = cache_manager_->getCachedResponseCode();
} else if (cluster_min_healthy_percentages_ != nullptr &&
!cluster_min_healthy_percentages_->empty()) {
const auto clusters(context_.clusterManager().clusters());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this is not safe. clusters() right now is only safe to be called from the main thread (not workers) since there is no locking. What you actually want to do is go through your targets clusters and call get() which will give you the thread local version which I think should be sufficient for the computations you need.

const double min_healthy_percentage = item.second;
auto match = clusters.find(cluster_name);
if (match == clusters.end()) {
final_status = Http::Code::ServiceUnavailable;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some comments here? I guess I see how the lack of the cluster at all is sufficient ground for failure, though I wonder if there should be a stat. At the very least I would probably add a small comment.

Brian Pane added 2 commits January 21, 2018 22:58
Signed-off-by: Brian Pane <bpane@pinterest.com>
Signed-off-by: Brian Pane <bpane@pinterest.com>
filter_->decodeHeaders(request_headers_, true));
}
// Test the cases where an upstream cluster is empty, or has no healthy servers, but
// the minimum required percent healthy is zero. The hhealth check should return a 200.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: hhealth

}
if (100.0 * stats.membership_healthy_.value() < membership_total * min_healthy_percentage) {
// In the general case, consider the service unhealthy if fewer than the
// specified percentage of the servers inthe cluster are healthy.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: inthe

Signed-off-by: Brian Pane <bpane@pinterest.com>
@mattklein123
Copy link
Member

@brian-pane do you mind merging master?

Signed-off-by: Brian Pane <bpane@pinterest.com>
{
Http::TestHeaderMapImpl health_check_response{{":status", "200"}};
EXPECT_CALL(context_, healthCheckFailed()).WillOnce(Return(false));
EXPECT_CALL(callbacks_, encodeHeaders_(HeaderMapEqualRef(&health_check_response), true))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: .Times(1) is redundant (it's implied). You can remove. Same below.

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just small nit from me. Will defer to @zuercher for other comments.

Signed-off-by: Brian Pane <bpane@pinterest.com>
@zuercher zuercher merged commit 130604f into envoyproxy:master Jan 23, 2018
@brian-pane brian-pane deleted the healthcheck/2362 branch January 25, 2018 21:59
jpsim added a commit that referenced this pull request Nov 28, 2022
#2387)

To prevent it from force-pushing to an existing branch like it did in
envoyproxy/envoy-mobile#2373

Signed-off-by: JP Simard <jp@jpsim.com>
jpsim added a commit that referenced this pull request Nov 29, 2022
#2387)

To prevent it from force-pushing to an existing branch like it did in
envoyproxy/envoy-mobile#2373

Signed-off-by: JP Simard <jp@jpsim.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants