-
Notifications
You must be signed in to change notification settings - Fork 451
Kill supervised process after stop timeout #6203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Tom Wieczorek <[email protected]>
There are three error conditions: 1. The process exited with a zero exit code. (err == nil) 2. The process exited with a nonzero exit code. (err is an ExitErr) 3. Something else went wrong, like I/O, rare syscall problems ... Distinguish those cases for the logs, and log all of them at the error level, as those are really abnormal situations that shouldn't happen under normal circumstances. Signed-off-by: Tom Wieczorek <[email protected]>
Previously, the shutdown code looped endlessly until the child process finished, requesting graceful termination over and over again. Change this to a request-shutdown -> wait -> kill -> wait logic. This is to ensure that k0s won't hang when the supervised processes can't be terminated for whichever reason: the code will terminate, at least after all the timeouts expired. Use a buffered channel for the wait result, so that the goroutine will be able to exit, even if nothing reads from the channel anymore. Introduce fine-grained error reporting to differentiate shutdown outcomes (graceful shutdown, forced kill, failure, and so on). Signed-off-by: Tom Wieczorek <[email protected]>
In my head there are two, from OS perspective:
I would personally appreciate more an error message saying: So IMHO the second commit (Cleanly distinguish Supervisor Cmd.Wait errors) adds more lines of code but makes logs less clear. The Go API for collecting the exit status via Error is clumsy so I think we better use I'm ok to change warning to error though. |
|
Under what circumstance would the supervised process not exit cleanly? Do we have reported issues for this? I am thinking that if supervised process does not cleanly shut down, something is broken and may need attention, and this may hide when that happens (but may result in corrupt database/state and similar) I am worried about slow systems where it may take longer than 30 sec to cleanly shut down things (think riscv64). |
Talked with @twz123 and apparently this has happened on windows where containerd didn't gracefully shut down. May be a good idea to avoid do |
You're absolutely right. The questionable API really bullied me into making that artificial distinction, without really realizing that it's kinda pointless. What bothered me about the initial logging was actually the claim "Failed to wait for process" while there was actually no problem at all with waiting; it was just a non-zero exit code. |
I've faced this while working on the Windows support. There's actually a bug in containerd on Windows, which makes it not respond to console control events (the things that have the closest resemblance to SIGINT and friends on Linux). This bug made k0s hang indefinitely during its shutdown.
This is what I'm aiming for now. Since the PID file cleanup is in place, we can choose to shutdown after a timeout and log an error / return a non-zero exit code. When k0s gets restarted, we can do the same: try to terminate the program and then bail out after a timeout. This leaves the decision of what to do with the hanging process up to the users without having to kill k0s itself. Right now, k0s will kill the process referenced by the PID file after the timeout. That needs to change, too. |
Description
Previously, the shutdown code looped endlessly until the child process finished, requesting graceful termination over and over again. Change this to a request-shutdown -> wait -> kill -> wait logic. This is to ensure that k0s won't hang when the supervised processes can't be terminated for whichever reason: the code will terminate, at least after all the timeouts expired.
Use a buffered channel for the wait result, so that the goroutine will be able to exit, even if nothing reads from the channel anymore. Introduce fine-grained error reporting to differentiate shutdown outcomes:
Type of change
How Has This Been Tested?
Checklist