-
Notifications
You must be signed in to change notification settings - Fork 273
Healthcheck filter: compute response based on upstream cluster health #417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -21,4 +21,10 @@ message HealthCheck { | |
| // If operating in pass through mode, the amount of time in milliseconds | ||
| // that the filter should cache the upstream response. | ||
| google.protobuf.Duration cache_time = 3; | ||
|
|
||
| // [#not-implemented-hide:] | ||
| // If operating in non-pass-through mode, specifies a set of upstream cluster | ||
| // names and the minimum percentage of servers in each of those clusters that | ||
| // must be healthy in order for the filter to return a 200. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "the filter" is lacking context. Which filter are we talking about here? What "return a 200" means? I think it is less ideal to use boolean for health check result. In a distributed system, we should measure the health in some form of percentage or probability, and let the caller decide the threshold.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @wora, although I like the idea of signaling a percent-availability or current-capacity status in health check responses, I think it's outside the scope of this PR. The reason I say that is: interoperability. For an Envoy-to-Envoy service mesh, it's easy to add some new HTTP header like "X-Envoy-Health-Level: 75%." But in some deployments, the thing sending health checks to Envoy is a hardware load balancer or a cloud hosting provider's "load balancer as a service" (like AWS's NLB and Google's Cloud Load Balancing). To get multiple 3rd party systems/services to support a "partially healthy" health check response status from Envoy, we'd need to find or write an RFC that defines a standardized way of communicating fractional health status over HTTP.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't insist that you need to support it. I just suggest that is how we should design a distributed system. I will leave the review to Envoy experts. Thanks.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1, I think what we have here is good for now. I agree with @wora that % based health is very interesting from a mesh perspective, but I think we should track looking into that separately. |
||
| map<string, uint32> cluster_min_healthy_percentages = 4; | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. per_cluster_mint_healthy_percentages ? |
||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe call it
cluster_min_healthy_percentages.