Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion RAW_RELEASE_NOTES.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,4 +66,4 @@ final version.
* Added `GEORADIUS_RO` and `GEORADIUSBYMEMBER_RO` to the Redis command splitter whitelist.
* Added support for trusting additional hops in the X-Forwarded-For request header.
* Added setting host header value for http health check request.

* Added SIGTERM propagation to children to hot-restarter.py, which enables using it as a parent of containers.
65 changes: 56 additions & 9 deletions restarter/hot-restarter.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,18 +5,65 @@
import sys
import time

# The number of seconds to wait for children to gracefully exit after
# propagating SIGTERM before force killing children.
# NOTE: If using a shutdown mechanism such as runit's `force-stop` which sends
# a KILL after a specified timeout period, it's important to ensure that this
# constant is smaller than the KILL timeout
TERM_WAIT_SECONDS = 30
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should I make this configurable via a flag? My assumption was that that overly complicates this script, since it's likely it will be customized anyway

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine, we can modify this later if we need more configurability.


restart_epoch = 0
pid_list = []

def force_kill_all_children():
""" Iterate through all known child processes and force kill them. In the future we might consider
possibly giving the child processes time to exit but this is fine for now. If someone force kills
us and does not clean the process tree this will leave child processes around unless they choose
to end themselves if their parent process dies. """
def term_all_children():
""" Iterate through all known child processes, send a TERM signal to each of
them, and then wait up to TERM_WAIT_SECONDS for them to exit gracefully,
exiting early if all children go away. If one or more children have not
exited after TERM_WAIT_SECONDS, they will be forcibly killed """

# First uninstall the SIGCHLD handler so that we don't get called again.
signal.signal(signal.SIGCHLD, signal.SIG_DFL)

global pid_list
for pid in pid_list:
print "sending TERM to PID={}".format(pid)
try:
os.kill(pid, signal.SIGTERM)
except:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: prefer not catch-all exception handling.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there certain exceptions that you think should be retried here? or do you want the script to exit? For the containers use case it's fairly important that the python process doesn't exit until the children have exited or else we will leak container state in tmpfs

print "error sending TERM to PID={} continuing".format(pid)

all_exited = False

# wait for TERM_WAIT_SECONDS seconds for children to exit cleanly
retries = 0
while not all_exited and retries < TERM_WAIT_SECONDS:
for pid in list(pid_list):
ret_pid, exit_status = os.waitpid(pid, os.WNOHANG)
if ret_pid == 0 and exit_status == 0:
# the child is still running
continue

pid_list.remove(pid)

if len(pid_list) == 0:
all_exited = True
else:
retries += 1
time.sleep(1)
Copy link
Contributor Author

@mpuncel mpuncel Feb 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is 1 second too long? If we're here because of an unusual child exit that might be a long time to block our supervisor (e.g. runit) from restarting this process

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me, I don't see a big win from speeding this up given the amount of time the restart takes anyway.


if all_exited:
print "all children exited cleanly"
else:
for pid in pid_list:
print "child PID={} did not exit cleanly, killing".format(pid)
force_kill_all_children()
sys.exit(1) # error status because a child did not exit cleanly

def force_kill_all_children():
""" Iterate through all known child processes and force kill them. Typically
term_all_children() should be attempted first to give child processes an
opportunity to clean up state before exiting """

global pid_list
for pid in pid_list:
print "force killing PID={}".format(pid)
Expand All @@ -29,10 +76,10 @@ def force_kill_all_children():


def sigterm_handler(signum, frame):
""" Handler for SIGTERM. See force_kill_all_children() for further discussion. """
""" Handler for SIGTERM. See term_all_children() for further discussion. """

print "got SIGTERM"
force_kill_all_children()
term_all_children()
sys.exit(0)


Expand Down Expand Up @@ -92,8 +139,8 @@ def sigchld_handler(signum, frame):
kill_all_and_exit = True

if kill_all_and_exit:
print "Due to abnormal exit, force killing all child processes and exiting"
force_kill_all_children()
print "Due to abnormal exit, terminating all child processes and exiting"
term_all_children()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are all users of this script going to want to term here? or kill and exit for faster supervisor restarting?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, @danielhochman for comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mpuncel I'm a little late here but I would probably switch this back to term_all_children() since I do think in the abnormal case we should kill as fast as possible and allow the supervisor to restart. Do you mind doing a follow up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


# Our last child died, so we have no purpose. Exit.
if not pid_list:
Expand Down