admin: add json serialized proto for /clusters by mrice32 · Pull Request #3478 · envoyproxy/envoy

mrice32 · 2018-05-23T21:49:17Z

Description:
Added the /clusters?format=json admin endpoint along with a proto representation of /clusters.

Risk Level: Low

Testing: Added a unit test for the new format.

Docs Changes: Added a brief description on the admin docs and linked to the more detailed proto definition.

Release Notes: Added release notes.

Fixes #2020

Signed-off-by: Matt Rice <mattrice@google.com>

mattklein123 · 2018-05-24T03:50:20Z

@mrice32 this fixes #2020 so I added it to the description.

zuercher · 2018-05-24T17:55:44Z

source/server/http/admin.cc

-        }
+  Http::Utility::QueryParams query_params = Http::Utility::parseQueryString(url);
+  auto it = query_params.find("format");
+  if (it != query_params.end() && it->second == "json") {


What do you think of splitting this function in two to make it a little more readable?

zuercher · 2018-05-24T18:00:01Z

api/envoy/admin/v2alpha/clusters.proto

+// Admin endpoint uses this wrapper for `/clusters` to display cluster status information.
+// See :ref:`/clusters <operations_admin_interface_clusters>` for more information.
+message Clusters {
+  // Aapping from cluster name to each cluster's status.


typo: mapping

zuercher · 2018-05-24T18:05:21Z

api/envoy/admin/v2alpha/clusters.proto

+message ClusterStatus {
+  // General outlier statistics if installed for this cluster.
+  OutlierInfo outlier_info = 1;
+  // Mapping from priority to circuit breaker settings for that priority.


style nit: I'd prefer a blank line between fields for readability.

Signed-off-by: Matt Rice <mattrice@google.com>

mrice32 · 2018-05-24T21:03:38Z

@zuercher Thanks for the quick review! Responded to your comments. PTAL.

zuercher

Looks good. Thanks!

htuch

Thanks @mrice32, this takes the admin endpoint further in a really nice direction. I have a bunch of API-level feedback that I thought would be good to discuss.

htuch · 2018-05-25T03:11:14Z

api/envoy/admin/v2alpha/clusters.proto

+// Details an individual cluster's current status.
+message ClusterStatus {
+  // General outlier statistics if installed for this cluster.
+  OutlierInfo outlier_info = 1;


Can you reorder the protos so that they are in topological sort order? This make the documentation clearer and is a convention we have adopted.

Just to be clear, you're suggesting that if one message depends on another, the dependent message should go below the dependency? Is this a new convention? I was following the example of the bootstrap proto and others, which seem to do the opposite.

Actually, I think I'm wrong here, the other opposite convention seems to mostly hold.

htuch · 2018-05-25T03:12:39Z

api/envoy/admin/v2alpha/clusters.proto

+  bool added_via_api = 3;
+
+  // Mapping from host address to the host's current status.
+  map<string, HostStatus> host_statuses = 4;


Do we want to use https://github.com/envoyproxy/envoy/blob/master/api/envoy/api/v2/core/address.proto#L93 here? I.e. have some structure since this is proto, rather than encoding ports etc. in strings.

This brings up a broader question. I think I have made uncharacteristic (in the scope of Envoy) use of maps. Both messages and enums (as mentioned in your other comment below) cannot be used as key types. Should we try to avoid maps here, generally (even when the key types allow), or just remove them in these two incompatible places?

I think it probably makes sense for the HostStatus.stats field, but I think the rest are debatable.

htuch · 2018-05-25T03:13:33Z

api/envoy/admin/v2alpha/clusters.proto

+  OutlierInfo outlier_info = 1;
+
+  // Mapping from priority to circuit breaker settings for that priority.
+  map<string, CircuitSettings> circuit_settings = 2;


Do you want

envoy/api/envoy/api/v2/core/base.proto

Line 114 in 49b122d

enum RoutingPriority {

?

htuch · 2018-05-25T03:16:37Z

api/envoy/admin/v2alpha/clusters.proto

+// Cluster outlier detection statistics. -1 for any of these statistics denotes that there was not
+// enough data to compute it.
+message OutlierInfo {
+  // The average success rate of the hosts in the Detector for the last aggregation interval.


References back to docs on this?

htuch · 2018-05-25T03:17:44Z

api/envoy/admin/v2alpha/clusters.proto

+  double success_rate_average = 1;
+
+  // The success rate threshold used in the last interval. The threshold is used to eject hosts
+  // based on their success rate.


Are these percentages/ratios? If so, consider expressing with https://github.com/envoyproxy/envoy/blob/master/api/envoy/type/percent.proto.

htuch · 2018-05-25T03:18:16Z

api/envoy/admin/v2alpha/clusters.proto

+// Circuit settings for a particular priority setting.
+message CircuitSettings {
+  // Total TCP connection limit.
+  int64 max_connections = 1;


In Envoy, we prefer unsigned types for things that are actually unsigned.

htuch · 2018-05-25T03:18:56Z

api/envoy/admin/v2alpha/clusters.proto

+  map<string, int64> stats = 1;
+
+  // A string representation of the host's current health status.
+  string health_flags = 2;


Prefer proto structure rather than opaque strings here.

htuch · 2018-05-25T03:20:55Z

api/envoy/admin/v2alpha/clusters.proto

+
+  // Pending request limit. A request is pending if it is waiting to be attached to a connection
+  // pool connection.
+  int64 max_pending_requests = 2;


Aren't these

envoy/api/envoy/api/v2/cluster/circuit_breaker.proto

Line 22 in 49b122d

message Thresholds {

? They are already available on /config_dump, so can we elide here?

I was trying to be consistent with the info provided in the text printout, but I'll remove as long as there are no back-compat objections.

Signed-off-by: Matt Rice <mattrice@google.com>

mrice32 · 2018-05-29T14:19:38Z

@htuch, updated API as suggested. PTAL, and let me know if the new structure makes sense.

htuch · 2018-05-25T14:54:00Z

api/envoy/admin/v2alpha/clusters.proto

+// Details an individual cluster's current status.
+message ClusterStatus {
+  // General outlier statistics if installed for this cluster.
+  OutlierInfo outlier_info = 1;


Actually, I think I'm wrong here, the other opposite convention seems to mostly hold.

htuch · 2018-05-31T20:56:54Z

api/envoy/admin/v2alpha/clusters.proto

+}
+
+// :ref:`Cluster outlier detection <arch_overview_outlier_detection>` statistics. -1 for any of
+// these statistics denotes that there was not enough data to compute it.


Are you using -1 here or just omitting the field?

htuch · 2018-05-31T20:59:10Z

api/envoy/admin/v2alpha/clusters.proto

+// Health status for a host.
+message HostHealthStatus {
+  // The host is currently marked as healthy.
+  bool healthy = 1;


I think we can simplify this to just indicating when a host is unhealthy, and then when there are no bits set saying it's unhealthy, it's considered healthy. I wonder how this relates to https://github.com/envoyproxy/envoy/blob/master/api/envoy/api/v2/core/health_check.proto#L182.. it seems we should have the ability to also indicate draining? Should we have a single health status message?

As for the former, I thought about that. My only hesitation is from a usability perspective. The highest order bit here is whether the host is healthy or not, but that would be the hardest for the user to determine if we were to eliminate the redundant healthy bool. You could probably get a little clever by omitting this message or a submessage for a healthy host to create an artificial way to signal a healthy host, but that might be more trouble than it's worth. WDYT?

As for how this relates to the EDS health status, that is reduced to a bool here IIUC - failed_eds_health_check. We could make this bool an enum to represent the full range of EDS health states rather than just healthy or unhealthy.

I think we don't have to spend too much time making these "usable", in that the APIs are intended for machine parsing first. So, I think we can drop the healthy status. The EDS health status is really a reflection of other health status possibilities. We might for example, via HDS, discover that a host is draining independent to EDS. So, I think we should be able to convey these possibilities. I would argue that we probably just want a single health status proto and it should be https://github.com/envoyproxy/envoy/blob/master/api/envoy/api/v2/core/health_check.proto#L182, we should make it work if it's not expressive enough for the requiremetns here (e.g. indicating outlier detection).

htuch · 2018-05-31T21:00:49Z

api/envoy/admin/v2alpha/clusters.proto

+  //
+  // Note: the message will be omitted if there were not enough hosts with enough request volume to
+  // proceed with success rate based outlier ejection.
+  envoy.type.Percent success_rate_average = 1;


Can this be derived from the individual hosts' success_rate by the consumer of the endpoint?

Basically, yes (see below for a more detailed answer). Is it worth providing it for usability?

They will be close. Technically, they may differ under two scenarios IIUC:

A host has been removed since the last time the success rates were updated (they operate on a timer).

I'm not 100% sure, but I think hosts that go from getting enough requests to be considered back to not having enough requests to be considered will report an outdated number for the successRate() call (this one is probably a bug since this goes against the documented behavior). Same effect if the number of hosts drops back below the threshold for calculating the cluster-wide success rate average - none of their values get updated until the cluster gets back above that threshold.

Unless the distinction is important to the use for (1) and (2), thenI would vote to just provide it at the host-level and allow the consumer to compute.

Since the internally computed cluster-wide average is compared to each host's average to determine whether the host gets ejected or not, I would think the exact cluster-wide average might be important for users when debugging why a host got ejected. However, it's something that's easy to add later if someone specifically requests it, so SGTM.

htuch · 2018-05-31T21:01:58Z

api/envoy/admin/v2alpha/clusters.proto

+  uint64 weight = 4;
+
+  // Configured locality for the host.
+  envoy.api.v2.core.Locality locality = 5;


Can we reuse https://github.com/envoyproxy/envoy/blob/master/api/envoy/api/v2/endpoint/endpoint.proto#L46 and related structures? Seems like a reasonable overlap.

Hmmmm, yes, there's definitely some overlap, but I worry that this will creep into us providing the entire EDS config in /clusters (the overlap isn't 100%, so we would have to leave out fields). Didn't you have a suggestion on the previous round of review that we should eliminate config details that could easily be found in /config_dump to delineate status from configuration? Given that EDS info should be available in /config_dump, should we try to remove all of that from the host portion as well? I'm not super opinionated, but I think our approach should be consistent.

/config_dump doesn't provide EDS today :) @mattklein123 do you think we should provide EDS here or in /config_dump? I think the latter, but this PR seems related.

Yeah I think it should be provided in /config_dump. The reason I didn't do it is that it isn't a trivial change/design to figure out how to do it. Should it be top level? Should it be embedded within clusters? How will the config source be managed? Etc. I opened the issue to track adding it and hopefully we can discuss there when someone wants to work on it.

SGTM, I'll remove weight, locality, and canary from HostStatus.

htuch · 2018-05-31T21:02:42Z

api/envoy/admin/v2alpha/clusters.proto

+  envoy.api.v2.core.Address address = 1;
+
+  // Mapping from the name of the statistic to the current value.
+  map<string, int64> stats = 2;


Is this the first time we're defining in proto the stats data model?

Not exactly - see here. The only more sophisticated definition I could imagine would be differentiating between gauges and counters.

OK, I think it's fine to have a simpler variant without relying on Prometheus protos. Might belong in its own message though, so we can reuse again later. Would suggest the domain to be uint64.

SGTM. Where do you think this stat message should go? Under api/envoy/data/, in api/envoy/metrics/v2/stats.proto, or somewhere else?

Signed-off-by: Matt Rice <mattrice@google.com>

stale · 2018-06-19T00:10:59Z

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

Signed-off-by: Matt Rice <mattrice@google.com>

mrice32 · 2018-06-19T23:55:21Z

@htuch, I've updated the proto. PTAL. I'm holding off on making the corresponding changes to AdminImpl - I'll do so once we finalize the proto (this is why the CI is failing).

htuch

@mrice32 yep, I'm good with this proto now, thanks for fine tuning it.

Signed-off-by: Matt Rice <mattrice@google.com>

mrice32 · 2018-06-22T02:00:15Z

@htuch, thanks for the review. I've updated AdminImpl and the test. PTAL. I've also still got a lingering question about where the SimpleMetric proto should live: under api/envoy/data/, in api/envoy/metrics/v2/stats.proto, or somewhere else?

htuch

@mrice32 since SimpleMetric is just for admin output, I think it could live in api/envoy/admin/v2alpha/metrics.proto.

htuch · 2018-06-25T00:31:53Z

test/server/http/admin_test.cc

+  MessageUtil::loadFromJson(expected_json, expected_proto);
+
+  // Ensure the protos created from each JSON are equivalent.
+  EXPECT_TRUE(Protobuf::util::MessageDifferencer::Equivalent(output_proto, expected_proto));


Can you move this to

envoy/test/test_common/utility.h

Line 156 in 20c0454

static bool protoEqual(const Protobuf::Message& lhs, const Protobuf::Message& rhs) {

and then you can take advantage of

envoy/test/test_common/utility.h

Line 381 in 20c0454

MATCHER_P(ProtoEq, rhs, "") { return TestUtility::protoEqual(arg, rhs); }

? This would be a really nice cleanup.

htuch · 2018-06-25T00:35:12Z

test/server/http/admin_test.cc

+      },
+     },
+     "health_status": {
+      "eds_health_status": "HEALTHY"


Maybe add some extra host statuses to this test to exercise non-EDS health check status.

Signed-off-by: Matt Rice <mattrice@google.com>

htuch

Looks like docs build needs fixing.

Signed-off-by: Matt Rice <mattrice@google.com>

htuch · 2018-06-26T01:35:19Z

@mrice32 still has merge conflict.

Add /clusters?format=json.

ec9e498

Signed-off-by: Matt Rice <mattrice@google.com>

mrice32 force-pushed the cluster_dump branch from 1e7cfe1 to ec9e498 Compare May 23, 2018 21:53

mrice32 added 4 commits May 23, 2018 18:17

Added new proto to docs build file.

e7b9bd8

Signed-off-by: Matt Rice <mattrice@google.com>

Merge branch 'master' into cluster_dump

94ac601

Add clusters proto to docs build proto list

b0c76b2

Signed-off-by: Matt Rice <mattrice@google.com>

More doc fixes

cf51d8b

Signed-off-by: Matt Rice <mattrice@google.com>

zuercher reviewed May 24, 2018

View reviewed changes

Comments.

cf1ac78

Signed-off-by: Matt Rice <mattrice@google.com>

zuercher previously approved these changes May 24, 2018

View reviewed changes

htuch suggested changes May 25, 2018

View reviewed changes

htuch self-assigned this May 25, 2018

Responded to API comments.

4a72fd5

Signed-off-by: Matt Rice <mattrice@google.com>

mrice32 dismissed zuercher’s stale review via 4a72fd5 May 28, 2018 20:07

mrice32 added 2 commits May 28, 2018 16:08

Merge branch 'master' into cluster_dump

0a035f3

Add doc ref

63a67f6

Signed-off-by: Matt Rice <mattrice@google.com>

htuch suggested changes May 31, 2018

View reviewed changes

Fixed typo.

a5ce4cc

Signed-off-by: Matt Rice <mattrice@google.com>

mattklein123 mentioned this pull request Jun 1, 2018

health check: structured active healthcheck logging #3176

Merged

stale bot added the stale stalebot believes this issue/PR has not been touched recently label Jun 19, 2018

Merge branch 'master' into cluster_dump

7e1133e

stale bot removed stale stalebot believes this issue/PR has not been touched recently labels Jun 19, 2018

updated proto

b60e1b1

Signed-off-by: Matt Rice <mattrice@google.com>

mrice32 force-pushed the cluster_dump branch from 00e032c to b60e1b1 Compare June 19, 2018 23:52

htuch reviewed Jun 21, 2018

View reviewed changes

mrice32 added 2 commits June 21, 2018 21:56

Updated AdminImpl and the test.

e57410f

Signed-off-by: Matt Rice <mattrice@google.com>

Merge branch 'master' into cluster_dump

0989b2e

htuch suggested changes Jun 25, 2018

View reviewed changes

Responded to comments

8a5e317

Signed-off-by: Matt Rice <mattrice@google.com>

htuch previously approved these changes Jun 25, 2018

View reviewed changes

Fix docs

35717cc

Signed-off-by: Matt Rice <mattrice@google.com>

mrice32 dismissed htuch’s stale review via 35717cc June 25, 2018 21:22

mrice32 added 2 commits June 25, 2018 17:26

Merge branch 'master' into cluster_dump

d9a259d

Next attempt at fixing docs

f32ad14

Signed-off-by: Matt Rice <mattrice@google.com>

mrice32 added 2 commits June 25, 2018 21:58

Merge branch 'master' into cluster_dump

a82839a

Merge branch 'master' into cluster_dump

94a23cb

htuch approved these changes Jun 26, 2018

View reviewed changes

htuch merged commit 6460533 into envoyproxy:master Jun 27, 2018

Conversation

mrice32 commented May 23, 2018 • edited by mattklein123 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattklein123 commented May 24, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrice32 commented May 24, 2018

Uh oh!

zuercher left a comment

Choose a reason for hiding this comment

Uh oh!

htuch left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrice32 May 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrice32 commented May 29, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrice32 commented May 23, 2018 •

edited by mattklein123

Loading

mrice32 May 25, 2018 •

edited

Loading

mrice32 Jun 19, 2018 •

edited

Loading