Skip to content

config: support recovery of original message when upgrading.#9633

Merged
htuch merged 10 commits intoenvoyproxy:masterfrom
htuch:original-type-annotation
Jan 14, 2020
Merged

config: support recovery of original message when upgrading.#9633
htuch merged 10 commits intoenvoyproxy:masterfrom
htuch:original-type-annotation

Conversation

@htuch
Copy link
Member

@htuch htuch commented Jan 9, 2020

To better support config dump, deprecated field detection and other debug, it's helpful to leave a type name breadcrumb and be able to synthesize a Protobuf::Message that corresponds to what was delivered on the wire.

While working on this PR, it became apparent that config dump is broken post v3alpha, since a single config dump might have both v2 and v3 Listeners, etc. The only way to resolve this generically is to make the inner resources in config dump Any. This is a breaking API change, but these are v2alpha/v3alpha at this point, so allowed.

Risk level: Low
Testing: new version converter unit test, config dump tests now verify that the correct versioned inner resource is returned.

Fixes #9612

Signed-off-by: Harvey Tuch htuch@google.com

@htuch htuch requested review from alyssawilk and lizan January 9, 2020 14:27
@repokitteh-read-only
Copy link

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/.

🐱

Caused by: #9633 was synchronize by htuch.

see: more, trace.

@htuch htuch force-pushed the original-type-annotation branch from b401d47 to 23dd04f Compare January 10, 2020 21:29
@htuch
Copy link
Member Author

htuch commented Jan 10, 2020

@lizan @mattklein123 can I get an API and design level review on this one? There's a handful of tests still broken due to config dump changes, but this is basically the approach I think is most solid.

@mattklein123 mattklein123 self-assigned this Jan 10, 2020
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a very high level this makes sense to me. Will do a full review once ready. Thank you!

/wait

To better support config dump, deprecated field detection and other
debug, it's helpful to leave a type name breadcrumb and be able to
synthesize a Protobuf::Message that corresponds to what was delivered on
the wire.

TODO: plumb this to config dump, add some tests.

Risk level: Low
Testing: new version converter unit test.

Fixes envoyproxy#9612

Signed-off-by: Harvey Tuch <htuch@google.com>
@htuch htuch force-pushed the original-type-annotation branch from 23dd04f to 0da7f83 Compare January 12, 2020 22:15
@htuch htuch marked this pull request as ready for review January 12, 2020 22:15
@htuch htuch changed the title [WiP] config: support recovery of original message when upgrading. config: support recovery of original message when upgrading. Jan 12, 2020
@htuch
Copy link
Member Author

htuch commented Jan 12, 2020

@mattklein123 ready for full review. I'll add a few more tests, but the implementation is there and passes existing tests.

@alyssawilk this should be fully complete for the deprecation checks, I've moved MessageUtil::checkForUnexpectedFields over to using the recovered original message.

htuch added 5 commits January 12, 2020 18:06
Signed-off-by: Harvey Tuch <htuch@google.com>
Signed-off-by: Harvey Tuch <htuch@google.com>
Signed-off-by: Harvey Tuch <htuch@google.com>
Signed-off-by: Harvey Tuch <htuch@google.com>
@htuch htuch added the api/v3 Major version release @ end of Q3 2019 label Jan 13, 2020
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a high level this makes sense to me. Mostly flushing out some small comments and questions. If you want to merge this and do follow ups that's OK to me. Did you want @alyssawilk to take a look at this also?

Comment on lines +108 to +116
// TODO(htuch): This is a terrible hack, there should be no per-resource
// business logic in this file. The reason this is required is that
// endpoints, when captured in configuration such as inlined hosts in
// Clusters for config dump purposes, can potentially contribute a
// significant amount to memory consumption. stats_integration_test
// complains as a result if we increase any memory due to type annotations.
// In theory, we should be able to just clean up these annotations in
// ClusterManagerImpl with type erasure, but protobuf doesn't free up memory
// as expected, we probably need some arena level trick to address this.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I'm not 100% following what the problem is and what we are doing here? It looks like this causes the prev <-> current comparison to be true? And why does this matter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that tagging original type information on endpoints is really expensive memory-wise, since we may have lots of endpoints and endpoints themselves are tiny, i.e. just addresses in some cases. So, the overhead is likely unacceptable for these of supporting original message recovery. In addition, we don't (today) support config dumping of endpoints directly, so it seems to make sense to just bypass adding this overhead.

This is only there because of stats_integration_test, which complained about the extra overhead, it noted 1763 per endpoint, vs the previous bound of 1350, so ~30% extra overhead. For clusters, it's much less, since they have other overheads (you can see the updates to that test in this file).

The solution in this PR is to special case endpoints and avoid ever annotating with original type information.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we are probably going to have a similar problem in the future, but I guess we can figure out a better solution by then. In general this makes sense to me though I agree it's horribly hacky.

non_shadowed_message->msg_.reset(message.New());
non_shadowed_message->msg_->MergeFrom(message);
VersionUtil::scrubHiddenEnvoyDeprecated(*non_shadowed_message->msg_);
eraseOriginalTypeInformation(*non_shadowed_message->msg_);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm slightly worried about forgetting to do the various transforms in various scenarios. Are they all localized to this file? Is there any benefit in having these all be bundled into the transform/cast function with enums on what additional things to do? TIOLI just throwing it out there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think broadly the transformations depend on the macro-level operation.e.g. are we reinterpreting vs. upgrading vs. downgrading, so are mostly local. There are some exceptions you can see in this PR outside of test:

  1. Local info Node type erasure. We synthesize messages for various discovery services and include Node, which comes from the wire with original type information added during upgrades. I had in the PR that you reviewed the erasure taking place in local_info_impl.h. I think the right thing to do is make sure things like load stats reporter, HDS, etc. use VersionConverter::reinterpret, so I will fix to do this and get rid of that exception case.
  2. eds.cc uses this to scrub on a best effort for reducing per-host overhead of annotations.

Comment on lines +125 to +126
// Scrub original type information; we don't config dump endpoints today and
// this is significant memory overhead.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean just adding all the recursive fields?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the original type information comes at a memory overhead cost, as discussed in the above thread regarding inlined endpoints.

htuch added 2 commits January 13, 2020 17:55
Signed-off-by: Harvey Tuch <htuch@google.com>
Signed-off-by: Harvey Tuch <htuch@google.com>
htuch added 2 commits January 13, 2020 18:11
Signed-off-by: Harvey Tuch <htuch@google.com>
Signed-off-by: Harvey Tuch <htuch@google.com>
@htuch
Copy link
Member Author

htuch commented Jan 13, 2020

@mattklein123 feedback addressed, PTAL.

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM. I am mildly concerned about the perf impact of all this config munging, especially in the transition period from v2 -> v3 while management servers are being updated. Are you worried about an increase in fleet CPU usage during config ingestion? Is this something we should try to measure with a perf benchmark as a follow up?

Comment on lines +108 to +116
// TODO(htuch): This is a terrible hack, there should be no per-resource
// business logic in this file. The reason this is required is that
// endpoints, when captured in configuration such as inlined hosts in
// Clusters for config dump purposes, can potentially contribute a
// significant amount to memory consumption. stats_integration_test
// complains as a result if we increase any memory due to type annotations.
// In theory, we should be able to just clean up these annotations in
// ClusterManagerImpl with type erasure, but protobuf doesn't free up memory
// as expected, we probably need some arena level trick to address this.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we are probably going to have a similar problem in the future, but I guess we can figure out a better solution by then. In general this makes sense to me though I agree it's horribly hacky.

@htuch
Copy link
Member Author

htuch commented Jan 14, 2020

@mattklein123 yeah, I have given that some thought. One thing to note that you called out is that a lot of the munging and recovering etc. isn't that bad for v3 if Envoy is at v3, but it will be potentially bad for v2 --> v3. I'm not sure how much impact this will have TBH, but it's worth noting that we already do a bunch of serializations on the happy path, as we're unpacking from Any objects every xDS resource and from within them the filter configs again (even worse if there is Struct involved today). So, I expect this extra overhead to be bad at the margins, e.g. taking us from 3N (where N is the config size) to 4N overhead, for example, rather than from N to 10N. But, that's just a guess until we measure it.

Some potential optimizations include not doing the deprecated/unknown checks as a CLI option; if a particular deployment doesn't care about tracking these, they should be able to turn it off and get a performance boost.

We have some gRPC control plane load tests that should be able to capture this, CC @mum4k (if you see some regression here, please ping back).

I have opened #9671 to track.

@htuch htuch merged commit 233838a into envoyproxy:master Jan 14, 2020
@htuch htuch deleted the original-type-annotation branch January 14, 2020 03:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api/v3 Major version release @ end of Q3 2019

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Track original type URL in upgraded protos

4 participants