Skip to content

api: deprecate hosts in Cluster.#9663

Merged
htuch merged 9 commits intoenvoyproxy:masterfrom
htuch:deprecate-hosts
Jan 15, 2020
Merged

api: deprecate hosts in Cluster.#9663
htuch merged 9 commits intoenvoyproxy:masterfrom
htuch:deprecate-hosts

Conversation

@htuch
Copy link
Member

@htuch htuch commented Jan 13, 2020

Now that we have stable Envoy API versioning, deprecate in v2, with the
understanding that we won't ever make this a runtime controlled field.

Risk level: Low
Testing: CI

Signed-off-by: Harvey Tuch htuch@google.com

@htuch htuch requested a review from mattklein123 as a code owner January 13, 2020 00:00
@repokitteh-read-only
Copy link

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/.

🐱

Caused by: #9663 was opened by htuch.

see: more, trace.

@htuch htuch requested a review from lizan January 13, 2020 00:00
@htuch htuch added the api/v3 Major version release @ end of Q3 2019 label Jan 13, 2020
@htuch
Copy link
Member Author

htuch commented Jan 13, 2020

This is a followup to #6419, CC @dio

@htuch
Copy link
Member Author

htuch commented Jan 13, 2020

Some tests are failing, but I'd like to ensure we're aligned on going ahead with this for the v3 cut before investing time into fixing.

@mattklein123
Copy link
Member

+1 let's do it.

/wait

htuch added 2 commits January 13, 2020 22:35
Now that we have stable Envoy API versioning, deprecate in v2, with the
understanding that we won't ever make this a runtime controlled field.

Risk level: Low
Testing: CI

Signed-off-by: Harvey Tuch <htuch@google.com>
Signed-off-by: Harvey Tuch <htuch@google.com>
@htuch
Copy link
Member Author

htuch commented Jan 14, 2020

@mattklein123 great, should be ready for review (modulo potential futzing with the host overheads in stats_integration_test.

Signed-off-by: Harvey Tuch <htuch@google.com>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flushing out a few comments. This looks generally correct but IMO we need a bunch more cleanup here. I'm going to also assign integration test guru @alyssawilk to take a look. Thank you!

/wait

type: static
lb_policy: fakelbtype
hosts:
hidden_envoy_deprecated_hosts:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should be testing this field like this? Either convert or use the deprecated field with a deprecated field test? (We should make sure we still have tests that cover the conversion path)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed this one. There are actually a bunch of places where hosts: still appears in this file. These are never going to be deprecated tests as I don't think we plan to turn down v2 hosts until v2 disappears. So, I think we have sufficient coverage. The reason this specific test was problematic was due to this being upgraded to v3 due to an invalid LB type and then failing the config check for the wrong reason at v3.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused, doesn't marking it deprecated cause the fail on warn tests to fail? Or is the issue that those tests don't run through the validation path? I would expect to see a DEPRECATED_FEATURE_TEST here unless the config is not going through the validation path?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would fail due to this, except it fails even earlier, because lb_policy has an invalid value (which is what the test is trying to catch).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a follow up shouldn't we fix all these tests even though they fail for some other reason? Mainly just for reduced confusion in the future?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For hosts specifically, I think we are supporting this forever for v2, we're not going to disable-by-default, so it's fine to keep using this. We probably do need a bucket of work in moving the trailing example configs/docs/unit tests that still use v2 to v3, but I think that's out of scope for this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are talking past each other here, even without disable-by-default, we still have DEPRECATED_FEATURE_TEST and a compile time options build that fails on any use of a deprecated feature, whether disabled by default or not. My point is that I would expect any use of host to fail in this build, but I think it's failing for other reasons. We should still clean up hosts in these tests otherwise it's debt that will be confusing later. cc @alyssawilk for further comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, basically I think this config would be fatal anyway, but as Matt points out, we hit the fatal condition we care about first so it's not a problem as far as gunit is concerned.
I agree with Matt that I'd prefer we clean configs up as much as possible (if for no other reason than for folks who copy-paste-edit) but I don't think it needs to be release blocking

I think in general once v3 lands, we need to do a sweep for v2 config and stop using it apart from coverage checks. We have to do the clean up eventually and doing it sooner rather than later avoids hacks like this. I think that'd resolve this concern since hosts isn't in v3 at all.


for (int i = 0; i < static_resources->clusters_size(); ++i) {
auto* cluster = static_resources->mutable_clusters(i);
if (!cluster->hosts().empty()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you assert here and below that the deprecated hosts are empty just to avoid any confusing oopses?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given it's deprecated, shouldn't we go with the normal way where we let folks deprecate when it goes fatal, or are we worried about mixing and matching? It's nice to give folks the grace period to migrate config and tests over.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found it pretty confusing to have both around IMHO and we should try and force folks to move to load_assignment, at least for the integration tests. Part of this goes back to the original design of how we populate host ports, by doing an ordered traversal of the hosts. Some tests then want to go in and do a bunch of surgery on the load assignment, but then need to worry about clearing out hosts, which is totally unrelated, and vice versa. This is not the most declarative thing to have.

I would like to eventually move away from this model, perhaps towards registering configuration lambdas that can mutate specific entries with the late port binding, but that's a cleanup PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I guess we can special case if it's painful to support both, it's just kinda harsh to make everyone update their tests when they pull HEAD. This is a pretty special case though so your call!

// 2019/11/01 8859 1299 1315 build: switch to libc++ by default
// 2019/11/12 8998 1299 1350 test: adjust memory limit for macOS
// 2019/11/15 9040 1283 1350 build: update protobuf to 3.10.1
// 2020/01/13 9663 1619 1655 api: deprecate hosts in Cluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yikes. cc @jmarantz. Probably worth a follow up issue here to look into this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In thinking about this more I'm a little confused why this change effects the per-host size? Is any of this data stored on a per-host basis?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have done some analysis on this (using a pre #9633 branch to avoid any possible interference from that stuff). I have a simple benchmark script at https://github.com/htuch/envoy/blob/cla-overhead/test/common/upstream/host_overhead_test.cc that just measures the protobuf overhead difference between the two host data structures.

The results look like:

hosts structure overhead per host: 106.568
load_assignment structure overhead per host: 218.572
hosts overhead per host: 138.568
load_assignment overhead per host: 250.572

The load assignment comes at the cost of an extra 112 bytes per host due to the additional proto structure (when configured similarly to stats_integration_test). We have two obvious copies of each host being measured by the benchmark, one for the bootstrap, which is still around at the end of the test and another in the cluster's config dump copy, since these are inlined in Cluster, so that takes us to 224 bytes. The above regression is 336 bytes, which strongly suggest via interpolation that we have a third copy somewhere.

So, I think the overhead here is purely explicable by the more deeply nested structure. Not sure where we want to go in this PR or beyond. Ideas:

  • We could exclude the inlined hosts from the cluster proto copy kept for config dump, since it's somewhat redundant with /clusters. However, we want to add EDS to config dump in the future, so maybe not.
  • We probably should try and avoid including the input config boostrap in the measured part of stats_integration_test. If we do that, we 're changing our methodology, so would need to note that in the measurement history.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK I see. This test is not actually the size per host/host description, but the amortized size. Got it. Yeah probably worth filing a follow up issue about this and briefly chatting with @jmarantz before submitting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the actual product, do we keep the input protobufs around after the xDS events are processed? If not, then we should also make sure we drop them in the memory integration test. I can follow up on that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We keep copies for config dump purposes. The original bootstrap is also kept in Server::InstanceImpl. That probably isn't needed post-init, so we could in theory save memory there by removing. I'm not sure how practically useful it is, since I imagine most of the scale-up endpoint scenarios are dynamic, so you would only be making improvements for synthetic situations.

Signed-off-by: Harvey Tuch <htuch@google.com>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks in general LGTM but would still like @alyssawilk and @jmarantz to take a quick look.

// 2019/11/01 8859 1299 1315 build: switch to libc++ by default
// 2019/11/12 8998 1299 1350 test: adjust memory limit for macOS
// 2019/11/15 9040 1283 1350 build: update protobuf to 3.10.1
// 2020/01/13 9663 1619 1655 api: deprecate hosts in Cluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In thinking about this more I'm a little confused why this change effects the per-host size? Is any of this data stored on a per-host basis?

htuch added 3 commits January 14, 2020 20:55
Signed-off-by: Harvey Tuch <htuch@google.com>
Signed-off-by: Harvey Tuch <htuch@google.com>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks, though I would still recommend quickly checking with @jmarantz and @alyssawilk before merging. Thanks for slogging through this!

// 2019/11/01 8859 1299 1315 build: switch to libc++ by default
// 2019/11/12 8998 1299 1350 test: adjust memory limit for macOS
// 2019/11/15 9040 1283 1350 build: update protobuf to 3.10.1
// 2020/01/13 9663 1619 1655 api: deprecate hosts in Cluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK I see. This test is not actually the size per host/host description, but the amortized size. Got it. Yeah probably worth filing a follow up issue about this and briefly chatting with @jmarantz before submitting.

Copy link
Contributor

@alyssawilk alyssawilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, major clean up, thanks!

: false),
host_map_(std::make_shared<HostMap>()) {
// TODO(dio): Remove hosts check once the hosts field is removed.
if (config.has_load_assignment() || !config.hosts().empty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that the point of doing conversion was to not have to deal with the old style things in Envoy
Is this due to the special case TODO above?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, in this case, we don't do any conversion from hosts --> load assignment, since neither of these are used by original dst clusters. So, here we just check that both are not set.


for (int i = 0; i < static_resources->clusters_size(); ++i) {
auto* cluster = static_resources->mutable_clusters(i);
if (!cluster->hosts().empty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given it's deprecated, shouldn't we go with the normal way where we let folks deprecate when it goes fatal, or are we worried about mixing and matching? It's nice to give folks the grace period to migrate config and tests over.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api/v3 Major version release @ end of Q3 2019

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants