Healthcheck filter: compute response based on upstream cluster health#417
Healthcheck filter: compute response based on upstream cluster health#417mattklein123 merged 4 commits intoenvoyproxy:masterfrom brian-pane:healthcheck/2362
Conversation
*Description*: Extend the "non pass through" mode of the HTTP health check filter to allow an optional map from cluster names to percentages. If specified, each of the named clusters must have at least the specified percentage of its servers healthy for the filter to return a 200. (The proxy also must be in a non-draining state, or else the filter will return a 503 like it did previously.) Associated envoyproxy/envoy issue: envoyproxy/envoy#2362 Signed-off-by: Brian Pane <bpane@pinterest.com>
htuch
left a comment
There was a problem hiding this comment.
Looks good, some minor tweaks suggested.
api/filter/http/health_check.proto
Outdated
| google.protobuf.Duration cache_time = 3; | ||
|
|
||
| // [#not-implemented-hide:] | ||
| // If operating in non pass through mode, specifies a set of upstream cluster |
| // [#not-implemented-hide:] | ||
| // If operating in non pass through mode, specifies a set of upstream cluster | ||
| // names and the minimum percentage of servers in each of those clusters that | ||
| // must be healthy in order for the filter to return a 200. |
There was a problem hiding this comment.
Maybe call it cluster_min_healthy_percentages.
api/filter/http/health_check.proto
Outdated
| // If operating in non pass through mode, specifies a set of upstream cluster | ||
| // names and the minimum percentage of servers in each of those clusters that | ||
| // must be healthy in order for the filter to return a 200. | ||
| map<string, uint32> cluster_min_percentages = 4; |
There was a problem hiding this comment.
See discussion in #375 on percentage modeling in Envoy.
mattklein123
left a comment
There was a problem hiding this comment.
LGTM other than @htuch comments. Note that when the docs get unhid we should also add arch docs on the new feature.
Signed-off-by: Brian Pane <bpane@pinterest.com>
api/filter/http/health_check.proto
Outdated
| // names and the minimum percentage of servers in each of those clusters that | ||
| // must be healthy in order for the filter to return a 200. | ||
| map<string, uint32> cluster_min_percentages = 4; | ||
| map<string, uint32> cluster_min_healthy_percentages = 4; |
There was a problem hiding this comment.
per_cluster_mint_healthy_percentages ?
| // If operating in non pass through mode, specifies a set of upstream cluster | ||
| // If operating in non-pass-through mode, specifies a set of upstream cluster | ||
| // names and the minimum percentage of servers in each of those clusters that | ||
| // must be healthy in order for the filter to return a 200. |
There was a problem hiding this comment.
"the filter" is lacking context. Which filter are we talking about here? What "return a 200" means?
I think it is less ideal to use boolean for health check result. In a distributed system, we should measure the health in some form of percentage or probability, and let the caller decide the threshold.
There was a problem hiding this comment.
@wora, although I like the idea of signaling a percent-availability or current-capacity status in health check responses, I think it's outside the scope of this PR. The reason I say that is: interoperability. For an Envoy-to-Envoy service mesh, it's easy to add some new HTTP header like "X-Envoy-Health-Level: 75%." But in some deployments, the thing sending health checks to Envoy is a hardware load balancer or a cloud hosting provider's "load balancer as a service" (like AWS's NLB and Google's Cloud Load Balancing). To get multiple 3rd party systems/services to support a "partially healthy" health check response status from Envoy, we'd need to find or write an RFC that defines a standardized way of communicating fractional health status over HTTP.
There was a problem hiding this comment.
I don't insist that you need to support it. I just suggest that is how we should design a distributed system. I will leave the review to Envoy experts. Thanks.
There was a problem hiding this comment.
+1, I think what we have here is good for now. I agree with @wora that % based health is very interesting from a mesh perspective, but I think we should track looking into that separately.
|
@brian-pane |
Signed-off-by: Brian Pane <bpane@pinterest.com>
| string sub_zone = 3; | ||
| } | ||
|
|
||
| // Identifies a percentage, in the range [0.0, 100.0]. |
There was a problem hiding this comment.
Add more comment to this.
Specifies a percentage, in the range [0.0, 100.0]. For Envoy APIs, this message must be used instead of int32 or double for specifying traffic split and similar settings based on percentage.
PS: if we define a message to replace double, we must force people to use it. Otherwise, we will get mixed double vs Percent, which would be a bad outcome. You may even want to update STYLE.md to force people to use it.
There was a problem hiding this comment.
Sounds good, I’ll add an update to STYLE.md
Signed-off-by: Brian Pane <bpane@pinterest.com>
| // If operating in non-pass-through mode, specifies a set of upstream cluster | ||
| // names and the minimum percentage of servers in each of those clusters that | ||
| // must be healthy in order for the filter to return a 200. | ||
| map<string, Percent> cluster_min_healthy_percentages = 4; |
There was a problem hiding this comment.
Any thoughts on CDS use case? @brian-pane
Description:
Extend the "non pass through" mode of the HTTP health check filter to allow an
optional map from cluster names to percentages. If specified, each of the named
clusters must have at least the specified percentage of its servers healthy for
the filter to return a 200. (The proxy also must be in a non-draining state, or
else the filter will return a 503 like it did previously.)
Associated envoyproxy/envoy issue: envoyproxy/envoy#2362
Signed-off-by: Brian Pane bpane@pinterest.com