api: deprecate hosts in Cluster.#9663
Conversation
|
Some tests are failing, but I'd like to ensure we're aligned on going ahead with this for the v3 cut before investing time into fixing. |
|
+1 let's do it. /wait |
Now that we have stable Envoy API versioning, deprecate in v2, with the understanding that we won't ever make this a runtime controlled field. Risk level: Low Testing: CI Signed-off-by: Harvey Tuch <htuch@google.com>
Signed-off-by: Harvey Tuch <htuch@google.com>
d5bf890 to
f349baa
Compare
|
@mattklein123 great, should be ready for review (modulo potential futzing with the host overheads in |
Signed-off-by: Harvey Tuch <htuch@google.com>
mattklein123
left a comment
There was a problem hiding this comment.
Flushing out a few comments. This looks generally correct but IMO we need a bunch more cleanup here. I'm going to also assign integration test guru @alyssawilk to take a look. Thank you!
/wait
| type: static | ||
| lb_policy: fakelbtype | ||
| hosts: | ||
| hidden_envoy_deprecated_hosts: |
There was a problem hiding this comment.
I don't think we should be testing this field like this? Either convert or use the deprecated field with a deprecated field test? (We should make sure we still have tests that cover the conversion path)
There was a problem hiding this comment.
I fixed this one. There are actually a bunch of places where hosts: still appears in this file. These are never going to be deprecated tests as I don't think we plan to turn down v2 hosts until v2 disappears. So, I think we have sufficient coverage. The reason this specific test was problematic was due to this being upgraded to v3 due to an invalid LB type and then failing the config check for the wrong reason at v3.
There was a problem hiding this comment.
I'm confused, doesn't marking it deprecated cause the fail on warn tests to fail? Or is the issue that those tests don't run through the validation path? I would expect to see a DEPRECATED_FEATURE_TEST here unless the config is not going through the validation path?
There was a problem hiding this comment.
It would fail due to this, except it fails even earlier, because lb_policy has an invalid value (which is what the test is trying to catch).
There was a problem hiding this comment.
As a follow up shouldn't we fix all these tests even though they fail for some other reason? Mainly just for reduced confusion in the future?
There was a problem hiding this comment.
For hosts specifically, I think we are supporting this forever for v2, we're not going to disable-by-default, so it's fine to keep using this. We probably do need a bucket of work in moving the trailing example configs/docs/unit tests that still use v2 to v3, but I think that's out of scope for this PR.
There was a problem hiding this comment.
I think we are talking past each other here, even without disable-by-default, we still have DEPRECATED_FEATURE_TEST and a compile time options build that fails on any use of a deprecated feature, whether disabled by default or not. My point is that I would expect any use of host to fail in this build, but I think it's failing for other reasons. We should still clean up hosts in these tests otherwise it's debt that will be confusing later. cc @alyssawilk for further comment.
There was a problem hiding this comment.
Yeah, basically I think this config would be fatal anyway, but as Matt points out, we hit the fatal condition we care about first so it's not a problem as far as gunit is concerned.
I agree with Matt that I'd prefer we clean configs up as much as possible (if for no other reason than for folks who copy-paste-edit) but I don't think it needs to be release blocking
I think in general once v3 lands, we need to do a sweep for v2 config and stop using it apart from coverage checks. We have to do the clean up eventually and doing it sooner rather than later avoids hacks like this. I think that'd resolve this concern since hosts isn't in v3 at all.
|
|
||
| for (int i = 0; i < static_resources->clusters_size(); ++i) { | ||
| auto* cluster = static_resources->mutable_clusters(i); | ||
| if (!cluster->hosts().empty()) { |
There was a problem hiding this comment.
Could you assert here and below that the deprecated hosts are empty just to avoid any confusing oopses?
There was a problem hiding this comment.
Given it's deprecated, shouldn't we go with the normal way where we let folks deprecate when it goes fatal, or are we worried about mixing and matching? It's nice to give folks the grace period to migrate config and tests over.
There was a problem hiding this comment.
I found it pretty confusing to have both around IMHO and we should try and force folks to move to load_assignment, at least for the integration tests. Part of this goes back to the original design of how we populate host ports, by doing an ordered traversal of the hosts. Some tests then want to go in and do a bunch of surgery on the load assignment, but then need to worry about clearing out hosts, which is totally unrelated, and vice versa. This is not the most declarative thing to have.
I would like to eventually move away from this model, perhaps towards registering configuration lambdas that can mutate specific entries with the late port binding, but that's a cleanup PR.
There was a problem hiding this comment.
Got it. I guess we can special case if it's painful to support both, it's just kinda harsh to make everyone update their tests when they pull HEAD. This is a pretty special case though so your call!
| // 2019/11/01 8859 1299 1315 build: switch to libc++ by default | ||
| // 2019/11/12 8998 1299 1350 test: adjust memory limit for macOS | ||
| // 2019/11/15 9040 1283 1350 build: update protobuf to 3.10.1 | ||
| // 2020/01/13 9663 1619 1655 api: deprecate hosts in Cluster. |
There was a problem hiding this comment.
Yikes. cc @jmarantz. Probably worth a follow up issue here to look into this.
There was a problem hiding this comment.
In thinking about this more I'm a little confused why this change effects the per-host size? Is any of this data stored on a per-host basis?
There was a problem hiding this comment.
I have done some analysis on this (using a pre #9633 branch to avoid any possible interference from that stuff). I have a simple benchmark script at https://github.com/htuch/envoy/blob/cla-overhead/test/common/upstream/host_overhead_test.cc that just measures the protobuf overhead difference between the two host data structures.
The results look like:
hosts structure overhead per host: 106.568
load_assignment structure overhead per host: 218.572
hosts overhead per host: 138.568
load_assignment overhead per host: 250.572
The load assignment comes at the cost of an extra 112 bytes per host due to the additional proto structure (when configured similarly to stats_integration_test). We have two obvious copies of each host being measured by the benchmark, one for the bootstrap, which is still around at the end of the test and another in the cluster's config dump copy, since these are inlined in Cluster, so that takes us to 224 bytes. The above regression is 336 bytes, which strongly suggest via interpolation that we have a third copy somewhere.
So, I think the overhead here is purely explicable by the more deeply nested structure. Not sure where we want to go in this PR or beyond. Ideas:
- We could exclude the inlined hosts from the cluster proto copy kept for config dump, since it's somewhat redundant with /clusters. However, we want to add EDS to config dump in the future, so maybe not.
- We probably should try and avoid including the input config
boostrapin the measured part ofstats_integration_test. If we do that, we 're changing our methodology, so would need to note that in the measurement history.
There was a problem hiding this comment.
Ah OK I see. This test is not actually the size per host/host description, but the amortized size. Got it. Yeah probably worth filing a follow up issue about this and briefly chatting with @jmarantz before submitting.
There was a problem hiding this comment.
In the actual product, do we keep the input protobufs around after the xDS events are processed? If not, then we should also make sure we drop them in the memory integration test. I can follow up on that.
There was a problem hiding this comment.
We keep copies for config dump purposes. The original bootstrap is also kept in Server::InstanceImpl. That probably isn't needed post-init, so we could in theory save memory there by removing. I'm not sure how practically useful it is, since I imagine most of the scale-up endpoint scenarios are dynamic, so you would only be making improvements for synthetic situations.
Signed-off-by: Harvey Tuch <htuch@google.com>
mattklein123
left a comment
There was a problem hiding this comment.
Thanks in general LGTM but would still like @alyssawilk and @jmarantz to take a quick look.
| // 2019/11/01 8859 1299 1315 build: switch to libc++ by default | ||
| // 2019/11/12 8998 1299 1350 test: adjust memory limit for macOS | ||
| // 2019/11/15 9040 1283 1350 build: update protobuf to 3.10.1 | ||
| // 2020/01/13 9663 1619 1655 api: deprecate hosts in Cluster. |
There was a problem hiding this comment.
In thinking about this more I'm a little confused why this change effects the per-host size? Is any of this data stored on a per-host basis?
Signed-off-by: Harvey Tuch <htuch@google.com>
Signed-off-by: Harvey Tuch <htuch@google.com>
mattklein123
left a comment
There was a problem hiding this comment.
LGTM, thanks, though I would still recommend quickly checking with @jmarantz and @alyssawilk before merging. Thanks for slogging through this!
| // 2019/11/01 8859 1299 1315 build: switch to libc++ by default | ||
| // 2019/11/12 8998 1299 1350 test: adjust memory limit for macOS | ||
| // 2019/11/15 9040 1283 1350 build: update protobuf to 3.10.1 | ||
| // 2020/01/13 9663 1619 1655 api: deprecate hosts in Cluster. |
There was a problem hiding this comment.
Ah OK I see. This test is not actually the size per host/host description, but the amortized size. Got it. Yeah probably worth filing a follow up issue about this and briefly chatting with @jmarantz before submitting.
alyssawilk
left a comment
There was a problem hiding this comment.
Wow, major clean up, thanks!
| : false), | ||
| host_map_(std::make_shared<HostMap>()) { | ||
| // TODO(dio): Remove hosts check once the hosts field is removed. | ||
| if (config.has_load_assignment() || !config.hosts().empty()) { |
There was a problem hiding this comment.
I thought that the point of doing conversion was to not have to deal with the old style things in Envoy
Is this due to the special case TODO above?
There was a problem hiding this comment.
So, in this case, we don't do any conversion from hosts --> load assignment, since neither of these are used by original dst clusters. So, here we just check that both are not set.
|
|
||
| for (int i = 0; i < static_resources->clusters_size(); ++i) { | ||
| auto* cluster = static_resources->mutable_clusters(i); | ||
| if (!cluster->hosts().empty()) { |
There was a problem hiding this comment.
Given it's deprecated, shouldn't we go with the normal way where we let folks deprecate when it goes fatal, or are we worried about mixing and matching? It's nice to give folks the grace period to migrate config and tests over.
Now that we have stable Envoy API versioning, deprecate in v2, with the
understanding that we won't ever make this a runtime controlled field.
Risk level: Low
Testing: CI
Signed-off-by: Harvey Tuch htuch@google.com