Healthcheck filter: compute response based on upstream cluster health by brian-pane · Pull Request #417 · envoyproxy/data-plane-api

brian-pane · 2018-01-14T02:04:23Z

Description:
Extend the "non pass through" mode of the HTTP health check filter to allow an
optional map from cluster names to percentages. If specified, each of the named
clusters must have at least the specified percentage of its servers healthy for
the filter to return a 200. (The proxy also must be in a non-draining state, or
else the filter will return a 503 like it did previously.)

Associated envoyproxy/envoy issue: envoyproxy/envoy#2362

Signed-off-by: Brian Pane bpane@pinterest.com

*Description*: Extend the "non pass through" mode of the HTTP health check filter to allow an optional map from cluster names to percentages. If specified, each of the named clusters must have at least the specified percentage of its servers healthy for the filter to return a 200. (The proxy also must be in a non-draining state, or else the filter will return a 503 like it did previously.) Associated envoyproxy/envoy issue: envoyproxy/envoy#2362 Signed-off-by: Brian Pane <bpane@pinterest.com>

htuch

Looks good, some minor tweaks suggested.

htuch · 2018-01-14T17:07:55Z

api/filter/http/health_check.proto

  google.protobuf.Duration cache_time = 3;
+
+  // [#not-implemented-hide:]
+  // If operating in non pass through mode, specifies a set of upstream cluster


Nit: non-pass-through

htuch · 2018-01-14T17:08:24Z

api/filter/http/health_check.proto

+  // [#not-implemented-hide:]
+  // If operating in non pass through mode, specifies a set of upstream cluster
+  // names and the minimum percentage of servers in each of those clusters that
+  //  must be healthy in order for the filter to return a 200.


Maybe call it cluster_min_healthy_percentages.

htuch · 2018-01-14T17:09:03Z

api/filter/http/health_check.proto

+  // If operating in non pass through mode, specifies a set of upstream cluster
+  // names and the minimum percentage of servers in each of those clusters that
+  //  must be healthy in order for the filter to return a 200.
+  map<string, uint32> cluster_min_percentages = 4;


See discussion in #375 on percentage modeling in Envoy.

mattklein123

LGTM other than @htuch comments. Note that when the docs get unhid we should also add arch docs on the new feature.

Signed-off-by: Brian Pane <bpane@pinterest.com>

brian-pane · 2018-01-14T22:26:31Z

@htuch since #375 has been dormant for a while, I suppose I could add the Percent type to this PR, like this:

message Percent {
  double value = 1 [(validate.rules).double = {gte: 0, lte: 100}];
}

What do you think?

wora · 2018-01-14T22:51:09Z

api/filter/http/health_check.proto

  // names and the minimum percentage of servers in each of those clusters that
  //  must be healthy in order for the filter to return a 200.
-  map<string, uint32> cluster_min_percentages = 4;
+  map<string, uint32> cluster_min_healthy_percentages = 4;


per_cluster_mint_healthy_percentages ?

wora · 2018-01-14T22:53:52Z

api/filter/http/health_check.proto

-  // If operating in non pass through mode, specifies a set of upstream cluster
+  // If operating in non-pass-through mode, specifies a set of upstream cluster
  // names and the minimum percentage of servers in each of those clusters that
  //  must be healthy in order for the filter to return a 200.


"the filter" is lacking context. Which filter are we talking about here? What "return a 200" means?

I think it is less ideal to use boolean for health check result. In a distributed system, we should measure the health in some form of percentage or probability, and let the caller decide the threshold.

@wora, although I like the idea of signaling a percent-availability or current-capacity status in health check responses, I think it's outside the scope of this PR. The reason I say that is: interoperability. For an Envoy-to-Envoy service mesh, it's easy to add some new HTTP header like "X-Envoy-Health-Level: 75%." But in some deployments, the thing sending health checks to Envoy is a hardware load balancer or a cloud hosting provider's "load balancer as a service" (like AWS's NLB and Google's Cloud Load Balancing). To get multiple 3rd party systems/services to support a "partially healthy" health check response status from Envoy, we'd need to find or write an RFC that defines a standardized way of communicating fractional health status over HTTP.

I don't insist that you need to support it. I just suggest that is how we should design a distributed system. I will leave the review to Envoy experts. Thanks.

+1, I think what we have here is good for now. I agree with @wora that % based health is very interesting from a mesh perspective, but I think we should track looking into that separately.

htuch · 2018-01-16T01:30:21Z

@brian-pane Percen LGTM

Signed-off-by: Brian Pane <bpane@pinterest.com>

wora · 2018-01-16T16:17:30Z

api/base.proto

  string sub_zone = 3;
 }

+// Identifies a percentage, in the range [0.0, 100.0].


Add more comment to this.

Specifies a percentage, in the range [0.0, 100.0]. For Envoy APIs, this message must be used instead of int32 or double for specifying traffic split and similar settings based on percentage.

PS: if we define a message to replace double, we must force people to use it. Otherwise, we will get mixed double vs Percent, which would be a bad outcome. You may even want to update STYLE.md to force people to use it.

Sounds good, I’ll add an update to STYLE.md

Signed-off-by: Brian Pane <bpane@pinterest.com>

unicell · 2018-01-22T21:40:03Z

api/filter/http/health_check.proto

+  // If operating in non-pass-through mode, specifies a set of upstream cluster
+  // names and the minimum percentage of servers in each of those clusters that
+  //  must be healthy in order for the filter to return a 200.
+  map<string, Percent> cluster_min_healthy_percentages = 4;


Any thoughts on CDS use case? @brian-pane

htuch reviewed Jan 14, 2018

View reviewed changes

mattklein123 reviewed Jan 14, 2018

View reviewed changes

Updated comments and field names

fdc99ad

Signed-off-by: Brian Pane <bpane@pinterest.com>

wora reviewed Jan 14, 2018

View reviewed changes

Add a Percent message type

06b11c5

Signed-off-by: Brian Pane <bpane@pinterest.com>

wora reviewed Jan 16, 2018

View reviewed changes

Document the new Percent message type in STYLE.md

4f68beb

Signed-off-by: Brian Pane <bpane@pinterest.com>

mattklein123 approved these changes Jan 16, 2018

View reviewed changes

mattklein123 merged commit b4a3031 into envoyproxy:master Jan 16, 2018

This was referenced Jan 16, 2018

Expose trace sampling controls in the public API #375

Closed

healthcheck filter: compute response based on upstream cluster health envoyproxy/envoy#2387

Merged

unicell reviewed Jan 22, 2018

View reviewed changes

brian-pane deleted the healthcheck/2362 branch January 22, 2018 22:45

Conversation

brian-pane commented Jan 14, 2018

Uh oh!

htuch left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattklein123 left a comment

Choose a reason for hiding this comment

Uh oh!

brian-pane commented Jan 14, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

htuch commented Jan 16, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants