Skip to content

Healthcheck filter: compute response based on upstream cluster health#417

Merged
mattklein123 merged 4 commits intoenvoyproxy:masterfrom
brian-pane:healthcheck/2362
Jan 16, 2018
Merged

Healthcheck filter: compute response based on upstream cluster health#417
mattklein123 merged 4 commits intoenvoyproxy:masterfrom
brian-pane:healthcheck/2362

Conversation

@brian-pane
Copy link
Contributor

Description:
Extend the "non pass through" mode of the HTTP health check filter to allow an
optional map from cluster names to percentages. If specified, each of the named
clusters must have at least the specified percentage of its servers healthy for
the filter to return a 200. (The proxy also must be in a non-draining state, or
else the filter will return a 503 like it did previously.)

Associated envoyproxy/envoy issue: envoyproxy/envoy#2362

Signed-off-by: Brian Pane bpane@pinterest.com

*Description*:
Extend the "non pass through" mode of the HTTP health check filter to allow an
optional map from cluster names to percentages. If specified, each of the named
clusters must have at least the specified percentage of its servers healthy for
the filter to return a 200. (The proxy also must be in a non-draining state, or
else the filter will return a 503 like it did previously.)

Associated envoyproxy/envoy issue: envoyproxy/envoy#2362

Signed-off-by: Brian Pane <bpane@pinterest.com>
Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, some minor tweaks suggested.

google.protobuf.Duration cache_time = 3;

// [#not-implemented-hide:]
// If operating in non pass through mode, specifies a set of upstream cluster
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: non-pass-through

// [#not-implemented-hide:]
// If operating in non pass through mode, specifies a set of upstream cluster
// names and the minimum percentage of servers in each of those clusters that
// must be healthy in order for the filter to return a 200.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe call it cluster_min_healthy_percentages.

// If operating in non pass through mode, specifies a set of upstream cluster
// names and the minimum percentage of servers in each of those clusters that
// must be healthy in order for the filter to return a 200.
map<string, uint32> cluster_min_percentages = 4;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See discussion in #375 on percentage modeling in Envoy.

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM other than @htuch comments. Note that when the docs get unhid we should also add arch docs on the new feature.

Signed-off-by: Brian Pane <bpane@pinterest.com>
@brian-pane
Copy link
Contributor Author

@htuch since #375 has been dormant for a while, I suppose I could add the Percent type to this PR, like this:

message Percent {
  double value = 1 [(validate.rules).double = {gte: 0, lte: 100}];
}

What do you think?

// names and the minimum percentage of servers in each of those clusters that
// must be healthy in order for the filter to return a 200.
map<string, uint32> cluster_min_percentages = 4;
map<string, uint32> cluster_min_healthy_percentages = 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per_cluster_mint_healthy_percentages ?

// If operating in non pass through mode, specifies a set of upstream cluster
// If operating in non-pass-through mode, specifies a set of upstream cluster
// names and the minimum percentage of servers in each of those clusters that
// must be healthy in order for the filter to return a 200.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"the filter" is lacking context. Which filter are we talking about here? What "return a 200" means?

I think it is less ideal to use boolean for health check result. In a distributed system, we should measure the health in some form of percentage or probability, and let the caller decide the threshold.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wora, although I like the idea of signaling a percent-availability or current-capacity status in health check responses, I think it's outside the scope of this PR. The reason I say that is: interoperability. For an Envoy-to-Envoy service mesh, it's easy to add some new HTTP header like "X-Envoy-Health-Level: 75%." But in some deployments, the thing sending health checks to Envoy is a hardware load balancer or a cloud hosting provider's "load balancer as a service" (like AWS's NLB and Google's Cloud Load Balancing). To get multiple 3rd party systems/services to support a "partially healthy" health check response status from Envoy, we'd need to find or write an RFC that defines a standardized way of communicating fractional health status over HTTP.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't insist that you need to support it. I just suggest that is how we should design a distributed system. I will leave the review to Envoy experts. Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I think what we have here is good for now. I agree with @wora that % based health is very interesting from a mesh perspective, but I think we should track looking into that separately.

@htuch
Copy link
Member

htuch commented Jan 16, 2018

@brian-pane Percen LGTM

Signed-off-by: Brian Pane <bpane@pinterest.com>
string sub_zone = 3;
}

// Identifies a percentage, in the range [0.0, 100.0].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add more comment to this.

Specifies a percentage, in the range [0.0, 100.0]. For Envoy APIs, this message must be used instead of int32 or double for specifying traffic split and similar settings based on percentage.

PS: if we define a message to replace double, we must force people to use it. Otherwise, we will get mixed double vs Percent, which would be a bad outcome. You may even want to update STYLE.md to force people to use it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I’ll add an update to STYLE.md

Signed-off-by: Brian Pane <bpane@pinterest.com>
// If operating in non-pass-through mode, specifies a set of upstream cluster
// names and the minimum percentage of servers in each of those clusters that
// must be healthy in order for the filter to return a 200.
map<string, Percent> cluster_min_healthy_percentages = 4;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thoughts on CDS use case? @brian-pane

@brian-pane brian-pane deleted the healthcheck/2362 branch January 22, 2018 22:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants