Mpuncel/hot restarter term passthrough#2596
Conversation
restarter/hot-restarter.py
Outdated
There was a problem hiding this comment.
should I make this configurable via a flag? My assumption was that that overly complicates this script, since it's likely it will be customized anyway
There was a problem hiding this comment.
I think this is fine, we can modify this later if we need more configurability.
restarter/hot-restarter.py
Outdated
There was a problem hiding this comment.
is 1 second too long? If we're here because of an unusual child exit that might be a long time to block our supervisor (e.g. runit) from restarting this process
There was a problem hiding this comment.
Seems reasonable to me, I don't see a big win from speeding this up given the amount of time the restart takes anyway.
restarter/hot-restarter.py
Outdated
There was a problem hiding this comment.
are all users of this script going to want to term here? or kill and exit for faster supervisor restarting?
There was a problem hiding this comment.
@mpuncel I'm a little late here but I would probably switch this back to term_all_children() since I do think in the abnormal case we should kill as fast as possible and allow the supervisor to restart. Do you mind doing a follow up?
There was a problem hiding this comment.
|
Feel free to let me know if the changes here aren't applicable to all situations that hot-restarter.py is meant to address, I figured I may as well offer up this PR to see if it is |
restarter/hot-restarter.py
Outdated
There was a problem hiding this comment.
as written, this isn't necessary because it's only called from term_all_children() which uninstalls the SIGCHLD handler, however I figured it's harmless to leave here so developers modifying the script to call force_kill_all_children() directly won't need to remember to do it
There was a problem hiding this comment.
I'd just remove it, this is a short script, so we can expect folks to understand the signal state.
|
@danielhochman can you take a first pass and verify this from the Lyft perspective? Thanks. |
|
works for me in theory. will take a closer pass this evening. |
7783aed to
3f5faf6
Compare
|
rebased on master and fixed RAW_RELEASE_NOTES.md conflict |
|
Please merge master to pick up #2613, this should fix CI TSAN. |
3f5faf6 to
489a919
Compare
restarter/hot-restarter.py
Outdated
There was a problem hiding this comment.
Nit: prefer not catch-all exception handling.
There was a problem hiding this comment.
are there certain exceptions that you think should be retried here? or do you want the script to exit? For the containers use case it's fairly important that the python process doesn't exit until the children have exited or else we will leak container state in tmpfs
restarter/hot-restarter.py
Outdated
restarter/hot-restarter.py
Outdated
There was a problem hiding this comment.
I think this is fine, we can modify this later if we need more configurability.
restarter/hot-restarter.py
Outdated
There was a problem hiding this comment.
Seems reasonable to me, I don't see a big win from speeding this up given the amount of time the restart takes anyway.
restarter/hot-restarter.py
Outdated
There was a problem hiding this comment.
I would just write for pid in list(pid_list). I have a mild concern about this being O(n^2), but it's fine given how few children that exist in reality. It would probably be cleaner to model the pid_list as a set.
restarter/hot-restarter.py
Outdated
There was a problem hiding this comment.
I'd just remove it, this is a short script, so we can expect folks to understand the signal state.
restarter/hot-restarter.py
Outdated
489a919 to
cddc72b
Compare
|
addressed comments except for the exception handling, because I'm unsure which exceptions we don't want to catch |
htuch
left a comment
There was a problem hiding this comment.
Thanks, one last fix and we can ship.
restarter/hot-restarter.py
Outdated
This allows hot-restarter.py to be used as a parent process to a container engine (e.g. `runc`). Prior to this change, hot-restarter.py would send a SIGKILL to all children when it receives a SIGTERM for simplicity, because it assumed that its children were envoy processes which have no state to clean up. However, when running hot-restarter.py with children that have state that should be cleaned up (e.g. container state), it's better to propagate the SIGTERM in order to allow children to exit gracefully. SIGTERM is now handled by propagating SIGTERM to all children, and then waiting until either all children have exited gracefully or a constant timeout has been hit, at which point the children are forcibly killed as they were before. Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Signed-off-by: Michael Puncel <mpuncel@squareup.com>
cddc72b to
99d9c1d
Compare
|
@mpuncel Looking at what you wrote about integration test, I think we could easily write a |
PR: envoyproxy#2596 changed the behavior of the SIGTERM and SIGCHLD handlers to attempt to allow child processes to exit gracefully before force killing them. This PR reverts the behavior of the SIGCHLD handler back to force killing children if a child exits uncleanly. This should allow the supervisor of the python process (e.g. runit) to restart envoy with a shorter delay (whereas an attempt at graceful TERM might delay up to TERM_WAIT_SECONDS). Note: If the child process of hot-restarter.py is a container framework (e.g. runc), the force kill might result in container state being leaked. This should hopefully be a rare occurrence. Signed-off-by: Michael Puncel <mpuncel@squareup.com>
#2640) PR: #2596 changed the behavior of the SIGTERM and SIGCHLD handlers to attempt to allow child processes to exit gracefully before force killing them. This PR reverts the behavior of the SIGCHLD handler back to force killing children if a child exits uncleanly. This should allow the supervisor of the python process (e.g. runit) to restart envoy with a shorter delay (whereas an attempt at graceful TERM might delay up to TERM_WAIT_SECONDS). Note: If the child process of hot-restarter.py is a container framework (e.g. runc), the force kill might result in container state being leaked. This should hopefully be a rare occurrence. Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Only configure Bazel to run against remote execution if the GITHUB_TOKEN environment variable is set. This should prevent forked repos (that do not have a valid GITHUB_TOKEN) from attempting to run against the remote execution cluster. Signed-off-by: Will Martin <will@engflow.com> Signed-off-by: JP Simard <jp@jpsim.com>
Only configure Bazel to run against remote execution if the GITHUB_TOKEN environment variable is set. This should prevent forked repos (that do not have a valid GITHUB_TOKEN) from attempting to run against the remote execution cluster. Signed-off-by: Will Martin <will@engflow.com> Signed-off-by: JP Simard <jp@jpsim.com>
Title: Enable using hot-restarter.py as a parent of containers by propagating SIGTERM to children
Description:
This PR modifies the way hot-restarter.py handles SIGTERM in order to make it suitable for running as the parent process of containers. It's important to send the SIGTERM along to the container because the envoy process isn't the direct child of hot-restarter.py, and sending SIGKILL to a container runner will leave container state on the host machine.
Risk Level: Low | Medium | High
High. If there's an issue with this PR, it could break hot restarting functionality.
Testing:
I verified this PR on a linux machine with
hot-restarter.pyconfigured to launchrunccontainers with envoy processes inside them. SIGTERM properly propagated the signal toruncwhich caused envoy and runc to exit, and left no container state on the machine, verified viasudo runc list.I also verified this PR on MacOS by running
hot-restarter.pywith a "well-behaved" (exiting when it gets a TERM) subprocess, and confirmed that sending a SIGTERM to thehot-restarter.pyprocess propagated the TERM to the child and printed the expected output.I also ran
hot-restarter.pywith a "misbehaving" subprocess (ignores TERM and stays in an infinite loop) and verified that after 30 seconds a SIGKILL was issued, and saw the expected output.I'm open to writing an integration test to verify all of this functionality, but could use some input from reviewers for how to accomplish this.
Docs Changes:
No doc changes, the docs actually already seem to claim that SIGTERM cleanly terminates children, however that seems to not be the actual case (unless SIGKILL is considered clean).
Release Notes: