Skip to content

admin: add json serialized proto for /clusters#3478

Merged
htuch merged 20 commits intoenvoyproxy:masterfrom
mrice32:cluster_dump
Jun 27, 2018
Merged

admin: add json serialized proto for /clusters#3478
htuch merged 20 commits intoenvoyproxy:masterfrom
mrice32:cluster_dump

Conversation

@mrice32
Copy link
Member

@mrice32 mrice32 commented May 23, 2018

Description:
Added the /clusters?format=json admin endpoint along with a proto representation of /clusters.

Risk Level: Low

Testing: Added a unit test for the new format.

Docs Changes: Added a brief description on the admin docs and linked to the more detailed proto definition.

Release Notes: Added release notes.

Fixes #2020

Signed-off-by: Matt Rice <mattrice@google.com>
mrice32 added 4 commits May 23, 2018 18:17
Signed-off-by: Matt Rice <mattrice@google.com>
Signed-off-by: Matt Rice <mattrice@google.com>
Signed-off-by: Matt Rice <mattrice@google.com>
@mattklein123
Copy link
Member

@mrice32 this fixes #2020 so I added it to the description.

}
Http::Utility::QueryParams query_params = Http::Utility::parseQueryString(url);
auto it = query_params.find("format");
if (it != query_params.end() && it->second == "json") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of splitting this function in two to make it a little more readable?

// Admin endpoint uses this wrapper for `/clusters` to display cluster status information.
// See :ref:`/clusters <operations_admin_interface_clusters>` for more information.
message Clusters {
// Aapping from cluster name to each cluster's status.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: mapping

message ClusterStatus {
// General outlier statistics if installed for this cluster.
OutlierInfo outlier_info = 1;
// Mapping from priority to circuit breaker settings for that priority.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style nit: I'd prefer a blank line between fields for readability.

Signed-off-by: Matt Rice <mattrice@google.com>
@mrice32
Copy link
Member Author

mrice32 commented May 24, 2018

@zuercher Thanks for the quick review! Responded to your comments. PTAL.

zuercher
zuercher previously approved these changes May 24, 2018
Copy link
Member

@zuercher zuercher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks!

Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mrice32, this takes the admin endpoint further in a really nice direction. I have a bunch of API-level feedback that I thought would be good to discuss.

// Details an individual cluster's current status.
message ClusterStatus {
// General outlier statistics if installed for this cluster.
OutlierInfo outlier_info = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you reorder the protos so that they are in topological sort order? This make the documentation clearer and is a convention we have adopted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, you're suggesting that if one message depends on another, the dependent message should go below the dependency? Is this a new convention? I was following the example of the bootstrap proto and others, which seem to do the opposite.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think I'm wrong here, the other opposite convention seems to mostly hold.

bool added_via_api = 3;

// Mapping from host address to the host's current status.
map<string, HostStatus> host_statuses = 4;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to use https://github.com/envoyproxy/envoy/blob/master/api/envoy/api/v2/core/address.proto#L93 here? I.e. have some structure since this is proto, rather than encoding ports etc. in strings.

Copy link
Member Author

@mrice32 mrice32 May 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This brings up a broader question. I think I have made uncharacteristic (in the scope of Envoy) use of maps. Both messages and enums (as mentioned in your other comment below) cannot be used as key types. Should we try to avoid maps here, generally (even when the key types allow), or just remove them in these two incompatible places?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it probably makes sense for the HostStatus.stats field, but I think the rest are debatable.

OutlierInfo outlier_info = 1;

// Mapping from priority to circuit breaker settings for that priority.
map<string, CircuitSettings> circuit_settings = 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want

enum RoutingPriority {
?

// Cluster outlier detection statistics. -1 for any of these statistics denotes that there was not
// enough data to compute it.
message OutlierInfo {
// The average success rate of the hosts in the Detector for the last aggregation interval.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

References back to docs on this?

double success_rate_average = 1;

// The success rate threshold used in the last interval. The threshold is used to eject hosts
// based on their success rate.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Circuit settings for a particular priority setting.
message CircuitSettings {
// Total TCP connection limit.
int64 max_connections = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Envoy, we prefer unsigned types for things that are actually unsigned.

map<string, int64> stats = 1;

// A string representation of the host's current health status.
string health_flags = 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer proto structure rather than opaque strings here.


// Pending request limit. A request is pending if it is waiting to be attached to a connection
// pool connection.
int64 max_pending_requests = 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't these

? They are already available on /config_dump, so can we elide here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to be consistent with the info provided in the text printout, but I'll remove as long as there are no back-compat objections.

@htuch htuch self-assigned this May 25, 2018
Signed-off-by: Matt Rice <mattrice@google.com>
mrice32 added 2 commits May 28, 2018 16:08
Signed-off-by: Matt Rice <mattrice@google.com>
@mrice32
Copy link
Member Author

mrice32 commented May 29, 2018

@htuch, updated API as suggested. PTAL, and let me know if the new structure makes sense.

// Details an individual cluster's current status.
message ClusterStatus {
// General outlier statistics if installed for this cluster.
OutlierInfo outlier_info = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think I'm wrong here, the other opposite convention seems to mostly hold.

}

// :ref:`Cluster outlier detection <arch_overview_outlier_detection>` statistics. -1 for any of
// these statistics denotes that there was not enough data to compute it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you using -1 here or just omitting the field?

// Health status for a host.
message HostHealthStatus {
// The host is currently marked as healthy.
bool healthy = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can simplify this to just indicating when a host is unhealthy, and then when there are no bits set saying it's unhealthy, it's considered healthy. I wonder how this relates to https://github.com/envoyproxy/envoy/blob/master/api/envoy/api/v2/core/health_check.proto#L182.. it seems we should have the ability to also indicate draining? Should we have a single health status message?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for the former, I thought about that. My only hesitation is from a usability perspective. The highest order bit here is whether the host is healthy or not, but that would be the hardest for the user to determine if we were to eliminate the redundant healthy bool. You could probably get a little clever by omitting this message or a submessage for a healthy host to create an artificial way to signal a healthy host, but that might be more trouble than it's worth. WDYT?

As for how this relates to the EDS health status, that is reduced to a bool here IIUC - failed_eds_health_check. We could make this bool an enum to represent the full range of EDS health states rather than just healthy or unhealthy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't have to spend too much time making these "usable", in that the APIs are intended for machine parsing first. So, I think we can drop the healthy status. The EDS health status is really a reflection of other health status possibilities. We might for example, via HDS, discover that a host is draining independent to EDS. So, I think we should be able to convey these possibilities. I would argue that we probably just want a single health status proto and it should be https://github.com/envoyproxy/envoy/blob/master/api/envoy/api/v2/core/health_check.proto#L182, we should make it work if it's not expressive enough for the requiremetns here (e.g. indicating outlier detection).

//
// Note: the message will be omitted if there were not enough hosts with enough request volume to
// proceed with success rate based outlier ejection.
envoy.type.Percent success_rate_average = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be derived from the individual hosts' success_rate by the consumer of the endpoint?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, yes (see below for a more detailed answer). Is it worth providing it for usability?

They will be close. Technically, they may differ under two scenarios IIUC:

  1. A host has been removed since the last time the success rates were updated (they operate on a timer).
  2. I'm not 100% sure, but I think hosts that go from getting enough requests to be considered back to not having enough requests to be considered will report an outdated number for the successRate() call (this one is probably a bug since this goes against the documented behavior). Same effect if the number of hosts drops back below the threshold for calculating the cluster-wide success rate average - none of their values get updated until the cluster gets back above that threshold.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless the distinction is important to the use for (1) and (2), thenI would vote to just provide it at the host-level and allow the consumer to compute.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the internally computed cluster-wide average is compared to each host's average to determine whether the host gets ejected or not, I would think the exact cluster-wide average might be important for users when debugging why a host got ejected. However, it's something that's easy to add later if someone specifically requests it, so SGTM.

uint64 weight = 4;

// Configured locality for the host.
envoy.api.v2.core.Locality locality = 5;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm, yes, there's definitely some overlap, but I worry that this will creep into us providing the entire EDS config in /clusters (the overlap isn't 100%, so we would have to leave out fields). Didn't you have a suggestion on the previous round of review that we should eliminate config details that could easily be found in /config_dump to delineate status from configuration? Given that EDS info should be available in /config_dump, should we try to remove all of that from the host portion as well? I'm not super opinionated, but I think our approach should be consistent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/config_dump doesn't provide EDS today :) @mattklein123 do you think we should provide EDS here or in /config_dump? I think the latter, but this PR seems related.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think it should be provided in /config_dump. The reason I didn't do it is that it isn't a trivial change/design to figure out how to do it. Should it be top level? Should it be embedded within clusters? How will the config source be managed? Etc. I opened the issue to track adding it and hopefully we can discuss there when someone wants to work on it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM, I'll remove weight, locality, and canary from HostStatus.

envoy.api.v2.core.Address address = 1;

// Mapping from the name of the statistic to the current value.
map<string, int64> stats = 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the first time we're defining in proto the stats data model?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly - see here. The only more sophisticated definition I could imagine would be differentiating between gauges and counters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think it's fine to have a simpler variant without relying on Prometheus protos. Might belong in its own message though, so we can reuse again later. Would suggest the domain to be uint64.

Copy link
Member Author

@mrice32 mrice32 Jun 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM. Where do you think this stat message should go? Under api/envoy/data/, in api/envoy/metrics/v2/stats.proto, or somewhere else?

Signed-off-by: Matt Rice <mattrice@google.com>
@stale
Copy link

stale bot commented Jun 19, 2018

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Jun 19, 2018
@stale stale bot removed stale stalebot believes this issue/PR has not been touched recently labels Jun 19, 2018
Signed-off-by: Matt Rice <mattrice@google.com>
@mrice32
Copy link
Member Author

mrice32 commented Jun 19, 2018

@htuch, I've updated the proto. PTAL. I'm holding off on making the corresponding changes to AdminImpl - I'll do so once we finalize the proto (this is why the CI is failing).

Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrice32 yep, I'm good with this proto now, thanks for fine tuning it.

@mrice32
Copy link
Member Author

mrice32 commented Jun 22, 2018

@htuch, thanks for the review. I've updated AdminImpl and the test. PTAL. I've also still got a lingering question about where the SimpleMetric proto should live: under api/envoy/data/, in api/envoy/metrics/v2/stats.proto, or somewhere else?

Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrice32 since SimpleMetric is just for admin output, I think it could live in api/envoy/admin/v2alpha/metrics.proto.

MessageUtil::loadFromJson(expected_json, expected_proto);

// Ensure the protos created from each JSON are equivalent.
EXPECT_TRUE(Protobuf::util::MessageDifferencer::Equivalent(output_proto, expected_proto));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this to

static bool protoEqual(const Protobuf::Message& lhs, const Protobuf::Message& rhs) {
and then you can take advantage of
MATCHER_P(ProtoEq, rhs, "") { return TestUtility::protoEqual(arg, rhs); }
? This would be a really nice cleanup.

},
},
"health_status": {
"eds_health_status": "HEALTHY"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add some extra host statuses to this test to exercise non-EDS health check status.

Signed-off-by: Matt Rice <mattrice@google.com>
htuch
htuch previously approved these changes Jun 25, 2018
Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like docs build needs fixing.

Signed-off-by: Matt Rice <mattrice@google.com>
mrice32 added 2 commits June 25, 2018 17:26
@htuch
Copy link
Member

htuch commented Jun 26, 2018

@mrice32 still has merge conflict.

@htuch htuch merged commit 6460533 into envoyproxy:master Jun 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants