server: change hot restart parent<-->child protocol, remove stats shared memory#5910
server: change hot restart parent<-->child protocol, remove stats shared memory#5910mattklein123 merged 101 commits intoenvoyproxy:masterfrom
Conversation
Signed-off-by: Fred Douglas <fredlas@google.com>
Signed-off-by: Fred Douglas <fredlas@google.com>
|
It would obviously not be a good idea to write more code before asking for a review. However, the immediate followup to this will also include yet more incompatible hot restart logic. Should we make any effort to ultimately have the followup be submitted in the same commit as this, to only increment hot restart version once? |
Yes please. I'm willing to review the larger change at once. Let me know if you want me to review this or wait for more code? |
|
That's totally up to you - it's no additional effort for me either way. The follow-up changes will re-touch pretty much all of the newly created files, so one big review would mean not reviewing the "same" lines twice. On the other hand, the current change is much simpler than the diff size suggests; much of it is just cutting+pasting existing code into new classes. Maybe it's better to understand it while it's still simple, rather than after it's mixed with the followup changes. But, ultimately up to you! |
Can you push them in 2 individual commits to start out with? Then I can review together and separate if I want? But we can plan on squashing it all? |
|
Sure. Actually, maybe I should just start a new branch (without a PR) off of this one, where the followup changes can live, and keep it up to date with any changes to this one that you request: https://github.com/fredlas/envoy/tree/RPC_change_to_protobufs |
However you want to do it is fine with me, but I think we should aim to merge as 1 PR. So maybe after we +1 this change, you can just add the next changes to this PR with an additional commit and we can then just look at diffs? |
|
Yes, sounds good: once ready, I can merge that separate branch into the branch this PR is examining. |
|
@fredlas on this PR, can you do me a favor and drop some comments in on the code that actually changed vs. what was copied? That will help me efficiently review. |
|
Good idea; comments added. |
mattklein123
left a comment
There was a problem hiding this comment.
Thanks, this is awesome. Great job. One small comment, but let's move forward with the rest of the changes on this PR and I can now review them as incremental diffs? Thank you!
/wait
Signed-off-by: Fred Douglas <fredlas@google.com>
|
:) thanks! Got it, I will keep all the changes in here. |
Signed-off-by: Fred Douglas <fredlas@google.com>
Signed-off-by: Fred Douglas <fredlas@google.com>
Signed-off-by: Fred Douglas <fredlas@google.com>
|
Ah, ok. Maybe this would be a good time to verify what I think I've picked up from context: is it the case that there are few enough gauges that, when it comes to gauges, we can just skip all the memory-saving fancy stat name stuff and still be safe? |
jmarantz
left a comment
There was a problem hiding this comment.
flushing comments...some of these nits can be TODOs if you prefer.
Signed-off-by: Fred Douglas <fredlas@google.com>
Signed-off-by: Fred Douglas <fredlas@google.com>
|
@fredlas I will run another smoke test later, but do you have a test coming that shows the error with the previous hash map? It seems like we should have a test that covers that case? |
Signed-off-by: Fred Douglas <fredlas@google.com>
Signed-off-by: Fred Douglas <fredlas@google.com>
Signed-off-by: Fred Douglas <fredlas@google.com>
|
It seems like the issue was that the whole ConstCharStarHashMap type is not, generally speaking, safe to use when you're getting your As for this particular instance of the problem, I don't think anyone is likely to want to switch it back to ConstCharStarHashMap now that it's working with the even-better StatNameHashMap (assuming it does in fact work with your testing). So, there shouldn't really be a danger of regression there. |
|
I don't think it would be very difficult to have a test that exposed the previous issue? Basically have a few stack allocated proto messages when testing the merging code that are destroyed in between merge invocations? Seems generally useful? |
Signed-off-by: Fred Douglas <fredlas@google.com>
|
Ok, got it. Enhanced the multiple imports test to do that. |
mattklein123
left a comment
There was a problem hiding this comment.
I did another smoke test on a production box with several hot restarts and stats look good. This is ready to go from my perspective. Thank you so much @fredlas for sticking with this and your awesome work. This is a tremendous amount of tech debt gone.
@jmarantz any remaining substantial comments? We can do nits/cleanups in the planned follow ups.
jmarantz
left a comment
There was a problem hiding this comment.
Looks great, thanks for all this work. Any remaining nits can be resolved in follow-ups.
|
Phew, awesome! Thanks for working through this huge review burden, through all the stages and major changes. There certainly were a lot of improvements you both suggested to get here. |
|
Going by https://github.com/envoyproxy/envoy/commits/master, the coverage test appears to be like 90% flakey recently, due to the same timeout thing as here. I'll try a /retest but maybe it still won't work. |
|
/retest |
|
🔨 rebuilding |
Signed-off-by: Fred Douglas <fredlas@google.com>
…red memory (envoyproxy#5910) Signed-off-by: Fred Douglas <fredlas@google.com> Signed-off-by: Jeff Piazza <jeffpiazza@google.com>
Description: Change hot restart child<-->parent communication: stats are now transferred in messages rather than shared memory, that messaging protocol was changed from packed structs to protobufs, and each process now talks on two domain sockets (one for its parent, one for child) rather than one.
This PR was reviewed in four steps:
Step 1) The hot restart logic is now split out into a parent and child side. As part of this refactor, each of those halves now binds its own Unix domain socket, where before there was just one socket used to exchange datagrams with both the client and parent (and therefore a race condition if both were actively talking to us). In addition to fixing the race condition, this brings the code into a state more friendly to introducing protobufs as discussed in #4974.
Step 2) Replace the packed struct protocol with a similar one using serialized protobufs. Still uses datagram domain sockets.
Step 3) Add stats dump to the protobuf protocol. Add logic to combine those received stat values with current values. Shift the implementation of the main stats store from shared memory to ordinary heap memory.
Step 4) Delete a bunch of obsoleted stuff: RawStatData, BlockMemoryHashSet, StatsOptions.
Risk Level: high
Docs Changes: Stripped shared memory details out of source/docs/stats.md. Stripped references to removed
--max-statsflag from a few places.Testing: added test for stat combination logic. Existing tests, in particular integration:hotrestart, still pass.
Release Notes: HOT RESTART VERSION INCREMENTED to 11. Hot restart protocol changed. Hot restart stats transfer changed from shared memory to message passing.