Skip to content

server: change hot restart parent<-->child protocol, remove stats shared memory#5910

Merged
mattklein123 merged 101 commits intoenvoyproxy:masterfrom
fredlas:RPC_non_recvmg_rpc
May 1, 2019
Merged

server: change hot restart parent<-->child protocol, remove stats shared memory#5910
mattklein123 merged 101 commits intoenvoyproxy:masterfrom
fredlas:RPC_non_recvmg_rpc

Conversation

@fredlas
Copy link
Contributor

@fredlas fredlas commented Feb 11, 2019

Description: Change hot restart child<-->parent communication: stats are now transferred in messages rather than shared memory, that messaging protocol was changed from packed structs to protobufs, and each process now talks on two domain sockets (one for its parent, one for child) rather than one.

This PR was reviewed in four steps:

Step 1) The hot restart logic is now split out into a parent and child side. As part of this refactor, each of those halves now binds its own Unix domain socket, where before there was just one socket used to exchange datagrams with both the client and parent (and therefore a race condition if both were actively talking to us). In addition to fixing the race condition, this brings the code into a state more friendly to introducing protobufs as discussed in #4974.

Step 2) Replace the packed struct protocol with a similar one using serialized protobufs. Still uses datagram domain sockets.

Step 3) Add stats dump to the protobuf protocol. Add logic to combine those received stat values with current values. Shift the implementation of the main stats store from shared memory to ordinary heap memory.

Step 4) Delete a bunch of obsoleted stuff: RawStatData, BlockMemoryHashSet, StatsOptions.

Risk Level: high
Docs Changes: Stripped shared memory details out of source/docs/stats.md. Stripped references to removed --max-stats flag from a few places.
Testing: added test for stat combination logic. Existing tests, in particular integration:hotrestart, still pass.
Release Notes: HOT RESTART VERSION INCREMENTED to 11. Hot restart protocol changed. Hot restart stats transfer changed from shared memory to message passing.

Signed-off-by: Fred Douglas <fredlas@google.com>
Signed-off-by: Fred Douglas <fredlas@google.com>
@fredlas
Copy link
Contributor Author

fredlas commented Feb 11, 2019

It would obviously not be a good idea to write more code before asking for a review. However, the immediate followup to this will also include yet more incompatible hot restart logic. Should we make any effort to ultimately have the followup be submitted in the same commit as this, to only increment hot restart version once?

@mattklein123
Copy link
Member

Should we make any effort to ultimately have the followup be submitted in the same commit as this, to only increment hot restart version once?

Yes please. I'm willing to review the larger change at once. Let me know if you want me to review this or wait for more code?

@mattklein123 mattklein123 self-assigned this Feb 11, 2019
@fredlas
Copy link
Contributor Author

fredlas commented Feb 11, 2019

That's totally up to you - it's no additional effort for me either way. The follow-up changes will re-touch pretty much all of the newly created files, so one big review would mean not reviewing the "same" lines twice. On the other hand, the current change is much simpler than the diff size suggests; much of it is just cutting+pasting existing code into new classes. Maybe it's better to understand it while it's still simple, rather than after it's mixed with the followup changes.

But, ultimately up to you!

@mattklein123
Copy link
Member

But, ultimately up to you!

Can you push them in 2 individual commits to start out with? Then I can review together and separate if I want? But we can plan on squashing it all?

@fredlas
Copy link
Contributor Author

fredlas commented Feb 11, 2019

Sure. Actually, maybe I should just start a new branch (without a PR) off of this one, where the followup changes can live, and keep it up to date with any changes to this one that you request: https://github.com/fredlas/envoy/tree/RPC_change_to_protobufs

@mattklein123
Copy link
Member

Sure. Actually, maybe I should just start a new branch (without a PR) off of this one, where the followup changes can live, and keep it up to date with any changes to this one that you request

However you want to do it is fine with me, but I think we should aim to merge as 1 PR. So maybe after we +1 this change, you can just add the next changes to this PR with an additional commit and we can then just look at diffs?

@fredlas
Copy link
Contributor Author

fredlas commented Feb 11, 2019

Yes, sounds good: once ready, I can merge that separate branch into the branch this PR is examining.

@mattklein123
Copy link
Member

@fredlas on this PR, can you do me a favor and drop some comments in on the code that actually changed vs. what was copied? That will help me efficiently review.

@fredlas
Copy link
Contributor Author

fredlas commented Feb 11, 2019

Good idea; comments added.

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is awesome. Great job. One small comment, but let's move forward with the rest of the changes on this PR and I can now review them as incremental diffs? Thank you!

/wait

Signed-off-by: Fred Douglas <fredlas@google.com>
@fredlas
Copy link
Contributor Author

fredlas commented Feb 12, 2019

:) thanks! Got it, I will keep all the changes in here.

Signed-off-by: Fred Douglas <fredlas@google.com>
Signed-off-by: Fred Douglas <fredlas@google.com>
Signed-off-by: Fred Douglas <fredlas@google.com>
@fredlas
Copy link
Contributor Author

fredlas commented Apr 30, 2019

Ah, ok. Maybe this would be a good time to verify what I think I've picked up from context: is it the case that there are few enough gauges that, when it comes to gauges, we can just skip all the memory-saving fancy stat name stuff and still be safe?

Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flushing comments...some of these nits can be TODOs if you prefer.

fredlas added 2 commits April 30, 2019 16:19
Signed-off-by: Fred Douglas <fredlas@google.com>
Signed-off-by: Fred Douglas <fredlas@google.com>
@mattklein123
Copy link
Member

@fredlas I will run another smoke test later, but do you have a test coming that shows the error with the previous hash map? It seems like we should have a test that covers that case?

fredlas added 3 commits April 30, 2019 16:34
Signed-off-by: Fred Douglas <fredlas@google.com>
Signed-off-by: Fred Douglas <fredlas@google.com>
Signed-off-by: Fred Douglas <fredlas@google.com>
@fredlas
Copy link
Contributor Author

fredlas commented Apr 30, 2019

It seems like the issue was that the whole ConstCharStarHashMap type is not, generally speaking, safe to use when you're getting your const char* keys from c_str(). I think the right response there might be additional comments/warnings in the ConstCharStarHashMap header.

As for this particular instance of the problem, I don't think anyone is likely to want to switch it back to ConstCharStarHashMap now that it's working with the even-better StatNameHashMap (assuming it does in fact work with your testing). So, there shouldn't really be a danger of regression there.

@mattklein123
Copy link
Member

I don't think it would be very difficult to have a test that exposed the previous issue? Basically have a few stack allocated proto messages when testing the merging code that are destroyed in between merge invocations? Seems generally useful?

Signed-off-by: Fred Douglas <fredlas@google.com>
@fredlas
Copy link
Contributor Author

fredlas commented Apr 30, 2019

Ok, got it. Enhanced the multiple imports test to do that.

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did another smoke test on a production box with several hot restarts and stats look good. This is ready to go from my perspective. Thank you so much @fredlas for sticking with this and your awesome work. This is a tremendous amount of tech debt gone.

@jmarantz any remaining substantial comments? We can do nits/cleanups in the planned follow ups.

Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks for all this work. Any remaining nits can be resolved in follow-ups.

@fredlas
Copy link
Contributor Author

fredlas commented May 1, 2019

Phew, awesome! Thanks for working through this huge review burden, through all the stages and major changes. There certainly were a lot of improvements you both suggested to get here.

@fredlas
Copy link
Contributor Author

fredlas commented May 1, 2019

Going by https://github.com/envoyproxy/envoy/commits/master, the coverage test appears to be like 90% flakey recently, due to the same timeout thing as here. I'll try a /retest but maybe it still won't work.

@fredlas
Copy link
Contributor Author

fredlas commented May 1, 2019

/retest

@repokitteh-read-only
Copy link

🔨 rebuilding ci/circleci: coverage (failed build)

🐱

Caused by: a #5910 (comment) was created by @fredlas.

see: more, trace.

Signed-off-by: Fred Douglas <fredlas@google.com>
@mattklein123 mattklein123 merged commit b6c23c5 into envoyproxy:master May 1, 2019
@fredlas fredlas deleted the RPC_non_recvmg_rpc branch May 1, 2019 20:49
jeffpiazza-google pushed a commit to jeffpiazza-google/envoy that referenced this pull request May 3, 2019
…red memory (envoyproxy#5910)

Signed-off-by: Fred Douglas <fredlas@google.com>

Signed-off-by: Jeff Piazza <jeffpiazza@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no stalebot Disables stalebot from closing an issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants