Skip to content

Mpuncel/hot restarter term passthrough#2596

Merged
htuch merged 2 commits intoenvoyproxy:masterfrom
mpuncel:mpuncel/hot-restarter-term-passthrough
Feb 18, 2018
Merged

Mpuncel/hot restarter term passthrough#2596
htuch merged 2 commits intoenvoyproxy:masterfrom
mpuncel:mpuncel/hot-restarter-term-passthrough

Conversation

@mpuncel
Copy link
Contributor

@mpuncel mpuncel commented Feb 13, 2018

Title: Enable using hot-restarter.py as a parent of containers by propagating SIGTERM to children

Description:
This PR modifies the way hot-restarter.py handles SIGTERM in order to make it suitable for running as the parent process of containers. It's important to send the SIGTERM along to the container because the envoy process isn't the direct child of hot-restarter.py, and sending SIGKILL to a container runner will leave container state on the host machine.

Risk Level: Low | Medium | High
High. If there's an issue with this PR, it could break hot restarting functionality.

Testing:
I verified this PR on a linux machine with hot-restarter.py configured to launch runc containers with envoy processes inside them. SIGTERM properly propagated the signal to runc which caused envoy and runc to exit, and left no container state on the machine, verified via sudo runc list.

I also verified this PR on MacOS by running hot-restarter.py with a "well-behaved" (exiting when it gets a TERM) subprocess, and confirmed that sending a SIGTERM to the hot-restarter.py process propagated the TERM to the child and printed the expected output.

I also ran hot-restarter.py with a "misbehaving" subprocess (ignores TERM and stays in an infinite loop) and verified that after 30 seconds a SIGKILL was issued, and saw the expected output.

I'm open to writing an integration test to verify all of this functionality, but could use some input from reviewers for how to accomplish this.

Docs Changes:
No doc changes, the docs actually already seem to claim that SIGTERM cleanly terminates children, however that seems to not be the actual case (unless SIGKILL is considered clean).

Release Notes:

* Added SIGTERM propagation to children to hot-restarter.py, which enables using it as a parent of containers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should I make this configurable via a flag? My assumption was that that overly complicates this script, since it's likely it will be customized anyway

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine, we can modify this later if we need more configurability.

Copy link
Contributor Author

@mpuncel mpuncel Feb 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is 1 second too long? If we're here because of an unusual child exit that might be a long time to block our supervisor (e.g. runit) from restarting this process

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me, I don't see a big win from speeding this up given the amount of time the restart takes anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are all users of this script going to want to term here? or kill and exit for faster supervisor restarting?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, @danielhochman for comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mpuncel I'm a little late here but I would probably switch this back to term_all_children() since I do think in the abnormal case we should kill as fast as possible and allow the supervisor to restart. Do you mind doing a follow up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mpuncel
Copy link
Contributor Author

mpuncel commented Feb 13, 2018

Feel free to let me know if the changes here aren't applicable to all situations that hot-restarter.py is meant to address, I figured I may as well offer up this PR to see if it is

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as written, this isn't necessary because it's only called from term_all_children() which uninstalls the SIGCHLD handler, however I figured it's harmless to leave here so developers modifying the script to call force_kill_all_children() directly won't need to remember to do it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just remove it, this is a short script, so we can expect folks to understand the signal state.

@htuch
Copy link
Member

htuch commented Feb 13, 2018

@danielhochman can you take a first pass and verify this from the Lyft perspective? Thanks.

@danielhochman
Copy link
Contributor

works for me in theory. will take a closer pass this evening.

danielhochman
danielhochman previously approved these changes Feb 14, 2018
@mpuncel mpuncel force-pushed the mpuncel/hot-restarter-term-passthrough branch from 7783aed to 3f5faf6 Compare February 14, 2018 21:55
@mpuncel
Copy link
Contributor Author

mpuncel commented Feb 14, 2018

rebased on master and fixed RAW_RELEASE_NOTES.md conflict

@htuch
Copy link
Member

htuch commented Feb 15, 2018

Please merge master to pick up #2613, this should fix CI TSAN.

@mpuncel mpuncel force-pushed the mpuncel/hot-restarter-term-passthrough branch from 3f5faf6 to 489a919 Compare February 15, 2018 04:15
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: prefer not catch-all exception handling.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there certain exceptions that you think should be retried here? or do you want the script to exit? For the containers use case it's fairly important that the python process doesn't exit until the children have exited or else we will leak container state in tmpfs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait for TERM_WAIT_SECONDS...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine, we can modify this later if we need more configurability.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me, I don't see a big win from speeding this up given the amount of time the restart takes anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just write for pid in list(pid_list). I have a mild concern about this being O(n^2), but it's fine given how few children that exist in reality. It would probably be cleaner to model the pid_list as a set.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just remove it, this is a short script, so we can expect folks to understand the signal state.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, @danielhochman for comment.

@mpuncel mpuncel force-pushed the mpuncel/hot-restarter-term-passthrough branch from 489a919 to cddc72b Compare February 15, 2018 23:12
@mpuncel
Copy link
Contributor Author

mpuncel commented Feb 15, 2018

addressed comments except for the exception handling, because I'm unsure which exceptions we don't want to catch

Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, one last fix and we can ship.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list(pid_list)

This allows hot-restarter.py to be used as a parent process to a
container engine (e.g. `runc`). Prior to this change, hot-restarter.py
would send a SIGKILL to all children when it receives a SIGTERM for
simplicity, because it assumed that its children were envoy processes
which have no state to clean up.

However, when running hot-restarter.py with children that have state
that should be cleaned up (e.g. container state), it's better to
propagate the SIGTERM in order to allow children to exit gracefully.

SIGTERM is now handled by propagating SIGTERM to all children, and then
waiting until either all children have exited gracefully or a constant
timeout has been hit, at which point the children are forcibly killed as
they were before.

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Signed-off-by: Michael Puncel <mpuncel@squareup.com>
@mpuncel mpuncel force-pushed the mpuncel/hot-restarter-term-passthrough branch from cddc72b to 99d9c1d Compare February 16, 2018 21:07
Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@htuch htuch merged commit 3e5f19e into envoyproxy:master Feb 18, 2018
@htuch
Copy link
Member

htuch commented Feb 18, 2018

@mpuncel Looking at what you wrote about integration test, I think we could easily write a sh_test in the style of https://github.com/envoyproxy/envoy/blob/master/test/integration/hotrestart_test.sh. If we hit any issues with this script at your end, Lyft or wider user base, I think we should definitely add this.

mpuncel added a commit to mpuncel/envoy that referenced this pull request Feb 19, 2018
PR: envoyproxy#2596 changed the behavior of the SIGTERM and SIGCHLD handlers to
attempt to allow child processes to exit gracefully before force killing
them. This PR reverts the behavior of the SIGCHLD handler back to force
killing children if a child exits uncleanly. This should allow the
supervisor of the python process (e.g. runit) to restart envoy with a
shorter delay (whereas an attempt at graceful TERM might delay up to
TERM_WAIT_SECONDS).

Note: If the child process of hot-restarter.py is a container framework
(e.g. runc), the force kill might result in container state being
leaked. This should hopefully be a rare occurrence.

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
htuch pushed a commit that referenced this pull request Feb 22, 2018
#2640)

PR: #2596 changed the behavior of the SIGTERM and SIGCHLD handlers to
attempt to allow child processes to exit gracefully before force killing
them. This PR reverts the behavior of the SIGCHLD handler back to force
killing children if a child exits uncleanly. This should allow the
supervisor of the python process (e.g. runit) to restart envoy with a
shorter delay (whereas an attempt at graceful TERM might delay up to
TERM_WAIT_SECONDS).

Note: If the child process of hot-restarter.py is a container framework
(e.g. runc), the force kill might result in container state being
leaked. This should hopefully be a rare occurrence.

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
jpsim pushed a commit that referenced this pull request Nov 28, 2022
Only configure Bazel to run against remote execution if the GITHUB_TOKEN
environment variable is set. This should prevent forked repos (that do
not have a valid GITHUB_TOKEN) from attempting to run against the remote
execution cluster.

Signed-off-by: Will Martin <will@engflow.com>
Signed-off-by: JP Simard <jp@jpsim.com>
jpsim pushed a commit that referenced this pull request Nov 29, 2022
Only configure Bazel to run against remote execution if the GITHUB_TOKEN
environment variable is set. This should prevent forked repos (that do
not have a valid GITHUB_TOKEN) from attempting to run against the remote
execution cluster.

Signed-off-by: Will Martin <will@engflow.com>
Signed-off-by: JP Simard <jp@jpsim.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants