config: support recovery of original message when upgrading.#9633
config: support recovery of original message when upgrading.#9633htuch merged 10 commits intoenvoyproxy:masterfrom
Conversation
2bf151d to
b401d47
Compare
b401d47 to
23dd04f
Compare
|
@lizan @mattklein123 can I get an API and design level review on this one? There's a handful of tests still broken due to config dump changes, but this is basically the approach I think is most solid. |
mattklein123
left a comment
There was a problem hiding this comment.
At a very high level this makes sense to me. Will do a full review once ready. Thank you!
/wait
To better support config dump, deprecated field detection and other debug, it's helpful to leave a type name breadcrumb and be able to synthesize a Protobuf::Message that corresponds to what was delivered on the wire. TODO: plumb this to config dump, add some tests. Risk level: Low Testing: new version converter unit test. Fixes envoyproxy#9612 Signed-off-by: Harvey Tuch <htuch@google.com>
23dd04f to
0da7f83
Compare
|
@mattklein123 ready for full review. I'll add a few more tests, but the implementation is there and passes existing tests. @alyssawilk this should be fully complete for the deprecation checks, I've moved |
Signed-off-by: Harvey Tuch <htuch@google.com>
Signed-off-by: Harvey Tuch <htuch@google.com>
Signed-off-by: Harvey Tuch <htuch@google.com>
Signed-off-by: Harvey Tuch <htuch@google.com>
Signed-off-by: Harvey Tuch <htuch@google.com>
mattklein123
left a comment
There was a problem hiding this comment.
At a high level this makes sense to me. Mostly flushing out some small comments and questions. If you want to merge this and do follow ups that's OK to me. Did you want @alyssawilk to take a look at this also?
| // TODO(htuch): This is a terrible hack, there should be no per-resource | ||
| // business logic in this file. The reason this is required is that | ||
| // endpoints, when captured in configuration such as inlined hosts in | ||
| // Clusters for config dump purposes, can potentially contribute a | ||
| // significant amount to memory consumption. stats_integration_test | ||
| // complains as a result if we increase any memory due to type annotations. | ||
| // In theory, we should be able to just clean up these annotations in | ||
| // ClusterManagerImpl with type erasure, but protobuf doesn't free up memory | ||
| // as expected, we probably need some arena level trick to address this. |
There was a problem hiding this comment.
Sorry I'm not 100% following what the problem is and what we are doing here? It looks like this causes the prev <-> current comparison to be true? And why does this matter?
There was a problem hiding this comment.
The issue is that tagging original type information on endpoints is really expensive memory-wise, since we may have lots of endpoints and endpoints themselves are tiny, i.e. just addresses in some cases. So, the overhead is likely unacceptable for these of supporting original message recovery. In addition, we don't (today) support config dumping of endpoints directly, so it seems to make sense to just bypass adding this overhead.
This is only there because of stats_integration_test, which complained about the extra overhead, it noted 1763 per endpoint, vs the previous bound of 1350, so ~30% extra overhead. For clusters, it's much less, since they have other overheads (you can see the updates to that test in this file).
The solution in this PR is to special case endpoints and avoid ever annotating with original type information.
There was a problem hiding this comment.
It seems like we are probably going to have a similar problem in the future, but I guess we can figure out a better solution by then. In general this makes sense to me though I agree it's horribly hacky.
| non_shadowed_message->msg_.reset(message.New()); | ||
| non_shadowed_message->msg_->MergeFrom(message); | ||
| VersionUtil::scrubHiddenEnvoyDeprecated(*non_shadowed_message->msg_); | ||
| eraseOriginalTypeInformation(*non_shadowed_message->msg_); |
There was a problem hiding this comment.
I'm slightly worried about forgetting to do the various transforms in various scenarios. Are they all localized to this file? Is there any benefit in having these all be bundled into the transform/cast function with enums on what additional things to do? TIOLI just throwing it out there.
There was a problem hiding this comment.
I think broadly the transformations depend on the macro-level operation.e.g. are we reinterpreting vs. upgrading vs. downgrading, so are mostly local. There are some exceptions you can see in this PR outside of test:
- Local info
Nodetype erasure. We synthesize messages for various discovery services and includeNode, which comes from the wire with original type information added during upgrades. I had in the PR that you reviewed the erasure taking place inlocal_info_impl.h. I think the right thing to do is make sure things like load stats reporter, HDS, etc. useVersionConverter::reinterpret, so I will fix to do this and get rid of that exception case. eds.ccuses this to scrub on a best effort for reducing per-host overhead of annotations.
| // Scrub original type information; we don't config dump endpoints today and | ||
| // this is significant memory overhead. |
There was a problem hiding this comment.
You mean just adding all the recursive fields?
There was a problem hiding this comment.
Adding the original type information comes at a memory overhead cost, as discussed in the above thread regarding inlined endpoints.
Signed-off-by: Harvey Tuch <htuch@google.com>
Signed-off-by: Harvey Tuch <htuch@google.com>
Signed-off-by: Harvey Tuch <htuch@google.com>
Signed-off-by: Harvey Tuch <htuch@google.com>
|
@mattklein123 feedback addressed, PTAL. |
mattklein123
left a comment
There was a problem hiding this comment.
Thanks, LGTM. I am mildly concerned about the perf impact of all this config munging, especially in the transition period from v2 -> v3 while management servers are being updated. Are you worried about an increase in fleet CPU usage during config ingestion? Is this something we should try to measure with a perf benchmark as a follow up?
| // TODO(htuch): This is a terrible hack, there should be no per-resource | ||
| // business logic in this file. The reason this is required is that | ||
| // endpoints, when captured in configuration such as inlined hosts in | ||
| // Clusters for config dump purposes, can potentially contribute a | ||
| // significant amount to memory consumption. stats_integration_test | ||
| // complains as a result if we increase any memory due to type annotations. | ||
| // In theory, we should be able to just clean up these annotations in | ||
| // ClusterManagerImpl with type erasure, but protobuf doesn't free up memory | ||
| // as expected, we probably need some arena level trick to address this. |
There was a problem hiding this comment.
It seems like we are probably going to have a similar problem in the future, but I guess we can figure out a better solution by then. In general this makes sense to me though I agree it's horribly hacky.
|
@mattklein123 yeah, I have given that some thought. One thing to note that you called out is that a lot of the munging and recovering etc. isn't that bad for v3 if Envoy is at v3, but it will be potentially bad for v2 --> v3. I'm not sure how much impact this will have TBH, but it's worth noting that we already do a bunch of serializations on the happy path, as we're unpacking from Some potential optimizations include not doing the deprecated/unknown checks as a CLI option; if a particular deployment doesn't care about tracking these, they should be able to turn it off and get a performance boost. We have some gRPC control plane load tests that should be able to capture this, CC @mum4k (if you see some regression here, please ping back). I have opened #9671 to track. |
To better support config dump, deprecated field detection and other debug, it's helpful to leave a type name breadcrumb and be able to synthesize a Protobuf::Message that corresponds to what was delivered on the wire.
While working on this PR, it became apparent that config dump is broken post v3alpha, since a single config dump might have both v2 and v3 Listeners, etc. The only way to resolve this generically is to make the inner resources in config dump
Any. This is a breaking API change, but these are v2alpha/v3alpha at this point, so allowed.Risk level: Low
Testing: new version converter unit test, config dump tests now verify that the correct versioned inner resource is returned.
Fixes #9612
Signed-off-by: Harvey Tuch htuch@google.com