admin: add json serialized proto for /clusters#3478
admin: add json serialized proto for /clusters#3478htuch merged 20 commits intoenvoyproxy:masterfrom
Conversation
Signed-off-by: Matt Rice <mattrice@google.com>
Signed-off-by: Matt Rice <mattrice@google.com>
Signed-off-by: Matt Rice <mattrice@google.com>
Signed-off-by: Matt Rice <mattrice@google.com>
source/server/http/admin.cc
Outdated
| } | ||
| Http::Utility::QueryParams query_params = Http::Utility::parseQueryString(url); | ||
| auto it = query_params.find("format"); | ||
| if (it != query_params.end() && it->second == "json") { |
There was a problem hiding this comment.
What do you think of splitting this function in two to make it a little more readable?
| // Admin endpoint uses this wrapper for `/clusters` to display cluster status information. | ||
| // See :ref:`/clusters <operations_admin_interface_clusters>` for more information. | ||
| message Clusters { | ||
| // Aapping from cluster name to each cluster's status. |
| message ClusterStatus { | ||
| // General outlier statistics if installed for this cluster. | ||
| OutlierInfo outlier_info = 1; | ||
| // Mapping from priority to circuit breaker settings for that priority. |
There was a problem hiding this comment.
style nit: I'd prefer a blank line between fields for readability.
|
@zuercher Thanks for the quick review! Responded to your comments. PTAL. |
| // Details an individual cluster's current status. | ||
| message ClusterStatus { | ||
| // General outlier statistics if installed for this cluster. | ||
| OutlierInfo outlier_info = 1; |
There was a problem hiding this comment.
Can you reorder the protos so that they are in topological sort order? This make the documentation clearer and is a convention we have adopted.
There was a problem hiding this comment.
Just to be clear, you're suggesting that if one message depends on another, the dependent message should go below the dependency? Is this a new convention? I was following the example of the bootstrap proto and others, which seem to do the opposite.
There was a problem hiding this comment.
Actually, I think I'm wrong here, the other opposite convention seems to mostly hold.
| bool added_via_api = 3; | ||
|
|
||
| // Mapping from host address to the host's current status. | ||
| map<string, HostStatus> host_statuses = 4; |
There was a problem hiding this comment.
Do we want to use https://github.com/envoyproxy/envoy/blob/master/api/envoy/api/v2/core/address.proto#L93 here? I.e. have some structure since this is proto, rather than encoding ports etc. in strings.
There was a problem hiding this comment.
This brings up a broader question. I think I have made uncharacteristic (in the scope of Envoy) use of maps. Both messages and enums (as mentioned in your other comment below) cannot be used as key types. Should we try to avoid maps here, generally (even when the key types allow), or just remove them in these two incompatible places?
There was a problem hiding this comment.
I think it probably makes sense for the HostStatus.stats field, but I think the rest are debatable.
| OutlierInfo outlier_info = 1; | ||
|
|
||
| // Mapping from priority to circuit breaker settings for that priority. | ||
| map<string, CircuitSettings> circuit_settings = 2; |
There was a problem hiding this comment.
Do you want
envoy/api/envoy/api/v2/core/base.proto
Line 114 in 49b122d
| // Cluster outlier detection statistics. -1 for any of these statistics denotes that there was not | ||
| // enough data to compute it. | ||
| message OutlierInfo { | ||
| // The average success rate of the hosts in the Detector for the last aggregation interval. |
| double success_rate_average = 1; | ||
|
|
||
| // The success rate threshold used in the last interval. The threshold is used to eject hosts | ||
| // based on their success rate. |
There was a problem hiding this comment.
Are these percentages/ratios? If so, consider expressing with https://github.com/envoyproxy/envoy/blob/master/api/envoy/type/percent.proto.
| // Circuit settings for a particular priority setting. | ||
| message CircuitSettings { | ||
| // Total TCP connection limit. | ||
| int64 max_connections = 1; |
There was a problem hiding this comment.
In Envoy, we prefer unsigned types for things that are actually unsigned.
| map<string, int64> stats = 1; | ||
|
|
||
| // A string representation of the host's current health status. | ||
| string health_flags = 2; |
There was a problem hiding this comment.
Prefer proto structure rather than opaque strings here.
|
|
||
| // Pending request limit. A request is pending if it is waiting to be attached to a connection | ||
| // pool connection. | ||
| int64 max_pending_requests = 2; |
There was a problem hiding this comment.
Aren't these
? They are already available on/config_dump, so can we elide here?
There was a problem hiding this comment.
I was trying to be consistent with the info provided in the text printout, but I'll remove as long as there are no back-compat objections.
Signed-off-by: Matt Rice <mattrice@google.com>
Signed-off-by: Matt Rice <mattrice@google.com>
|
@htuch, updated API as suggested. PTAL, and let me know if the new structure makes sense. |
| // Details an individual cluster's current status. | ||
| message ClusterStatus { | ||
| // General outlier statistics if installed for this cluster. | ||
| OutlierInfo outlier_info = 1; |
There was a problem hiding this comment.
Actually, I think I'm wrong here, the other opposite convention seems to mostly hold.
| } | ||
|
|
||
| // :ref:`Cluster outlier detection <arch_overview_outlier_detection>` statistics. -1 for any of | ||
| // these statistics denotes that there was not enough data to compute it. |
There was a problem hiding this comment.
Are you using -1 here or just omitting the field?
| // Health status for a host. | ||
| message HostHealthStatus { | ||
| // The host is currently marked as healthy. | ||
| bool healthy = 1; |
There was a problem hiding this comment.
I think we can simplify this to just indicating when a host is unhealthy, and then when there are no bits set saying it's unhealthy, it's considered healthy. I wonder how this relates to https://github.com/envoyproxy/envoy/blob/master/api/envoy/api/v2/core/health_check.proto#L182.. it seems we should have the ability to also indicate draining? Should we have a single health status message?
There was a problem hiding this comment.
As for the former, I thought about that. My only hesitation is from a usability perspective. The highest order bit here is whether the host is healthy or not, but that would be the hardest for the user to determine if we were to eliminate the redundant healthy bool. You could probably get a little clever by omitting this message or a submessage for a healthy host to create an artificial way to signal a healthy host, but that might be more trouble than it's worth. WDYT?
As for how this relates to the EDS health status, that is reduced to a bool here IIUC - failed_eds_health_check. We could make this bool an enum to represent the full range of EDS health states rather than just healthy or unhealthy.
There was a problem hiding this comment.
I think we don't have to spend too much time making these "usable", in that the APIs are intended for machine parsing first. So, I think we can drop the healthy status. The EDS health status is really a reflection of other health status possibilities. We might for example, via HDS, discover that a host is draining independent to EDS. So, I think we should be able to convey these possibilities. I would argue that we probably just want a single health status proto and it should be https://github.com/envoyproxy/envoy/blob/master/api/envoy/api/v2/core/health_check.proto#L182, we should make it work if it's not expressive enough for the requiremetns here (e.g. indicating outlier detection).
| // | ||
| // Note: the message will be omitted if there were not enough hosts with enough request volume to | ||
| // proceed with success rate based outlier ejection. | ||
| envoy.type.Percent success_rate_average = 1; |
There was a problem hiding this comment.
Can this be derived from the individual hosts' success_rate by the consumer of the endpoint?
There was a problem hiding this comment.
Basically, yes (see below for a more detailed answer). Is it worth providing it for usability?
They will be close. Technically, they may differ under two scenarios IIUC:
- A host has been removed since the last time the success rates were updated (they operate on a timer).
- I'm not 100% sure, but I think hosts that go from getting enough requests to be considered back to not having enough requests to be considered will report an outdated number for the
successRate()call (this one is probably a bug since this goes against the documented behavior). Same effect if the number of hosts drops back below the threshold for calculating the cluster-wide success rate average - none of their values get updated until the cluster gets back above that threshold.
There was a problem hiding this comment.
Unless the distinction is important to the use for (1) and (2), thenI would vote to just provide it at the host-level and allow the consumer to compute.
There was a problem hiding this comment.
Since the internally computed cluster-wide average is compared to each host's average to determine whether the host gets ejected or not, I would think the exact cluster-wide average might be important for users when debugging why a host got ejected. However, it's something that's easy to add later if someone specifically requests it, so SGTM.
| uint64 weight = 4; | ||
|
|
||
| // Configured locality for the host. | ||
| envoy.api.v2.core.Locality locality = 5; |
There was a problem hiding this comment.
Can we reuse https://github.com/envoyproxy/envoy/blob/master/api/envoy/api/v2/endpoint/endpoint.proto#L46 and related structures? Seems like a reasonable overlap.
There was a problem hiding this comment.
Hmmmm, yes, there's definitely some overlap, but I worry that this will creep into us providing the entire EDS config in /clusters (the overlap isn't 100%, so we would have to leave out fields). Didn't you have a suggestion on the previous round of review that we should eliminate config details that could easily be found in /config_dump to delineate status from configuration? Given that EDS info should be available in /config_dump, should we try to remove all of that from the host portion as well? I'm not super opinionated, but I think our approach should be consistent.
There was a problem hiding this comment.
/config_dump doesn't provide EDS today :) @mattklein123 do you think we should provide EDS here or in /config_dump? I think the latter, but this PR seems related.
There was a problem hiding this comment.
Yeah I think it should be provided in /config_dump. The reason I didn't do it is that it isn't a trivial change/design to figure out how to do it. Should it be top level? Should it be embedded within clusters? How will the config source be managed? Etc. I opened the issue to track adding it and hopefully we can discuss there when someone wants to work on it.
There was a problem hiding this comment.
SGTM, I'll remove weight, locality, and canary from HostStatus.
| envoy.api.v2.core.Address address = 1; | ||
|
|
||
| // Mapping from the name of the statistic to the current value. | ||
| map<string, int64> stats = 2; |
There was a problem hiding this comment.
Is this the first time we're defining in proto the stats data model?
There was a problem hiding this comment.
Not exactly - see here. The only more sophisticated definition I could imagine would be differentiating between gauges and counters.
There was a problem hiding this comment.
OK, I think it's fine to have a simpler variant without relying on Prometheus protos. Might belong in its own message though, so we can reuse again later. Would suggest the domain to be uint64.
There was a problem hiding this comment.
SGTM. Where do you think this stat message should go? Under api/envoy/data/, in api/envoy/metrics/v2/stats.proto, or somewhere else?
Signed-off-by: Matt Rice <mattrice@google.com>
|
This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
Signed-off-by: Matt Rice <mattrice@google.com>
|
@htuch, I've updated the proto. PTAL. I'm holding off on making the corresponding changes to |
Signed-off-by: Matt Rice <mattrice@google.com>
|
@htuch, thanks for the review. I've updated |
test/server/http/admin_test.cc
Outdated
| MessageUtil::loadFromJson(expected_json, expected_proto); | ||
|
|
||
| // Ensure the protos created from each JSON are equivalent. | ||
| EXPECT_TRUE(Protobuf::util::MessageDifferencer::Equivalent(output_proto, expected_proto)); |
There was a problem hiding this comment.
Can you move this to
envoy/test/test_common/utility.h
Line 156 in 20c0454
envoy/test/test_common/utility.h
Line 381 in 20c0454
test/server/http/admin_test.cc
Outdated
| }, | ||
| }, | ||
| "health_status": { | ||
| "eds_health_status": "HEALTHY" |
There was a problem hiding this comment.
Maybe add some extra host statuses to this test to exercise non-EDS health check status.
Signed-off-by: Matt Rice <mattrice@google.com>
htuch
left a comment
There was a problem hiding this comment.
Looks like docs build needs fixing.
Signed-off-by: Matt Rice <mattrice@google.com>
|
@mrice32 still has merge conflict. |
Description:
Added the
/clusters?format=jsonadmin endpoint along with a proto representation of/clusters.Risk Level: Low
Testing: Added a unit test for the new format.
Docs Changes: Added a brief description on the admin docs and linked to the more detailed proto definition.
Release Notes: Added release notes.
Fixes #2020