Kill supervised process after stop timeout #6203

twz123 · 2025-07-24T12:50:18Z

Description

Previously, the shutdown code looped endlessly until the child process finished, requesting graceful termination over and over again. Change this to a request-shutdown -> wait -> kill -> wait logic. This is to ensure that k0s won't hang when the supervised processes can't be terminated for whichever reason: the code will terminate, at least after all the timeouts expired.

Use a buffered channel for the wait result, so that the goroutine will be able to exit, even if nothing reads from the channel anymore. Introduce fine-grained error reporting to differentiate shutdown outcomes:

The process exited unexpectedly, with or without nonzero exit code
The process exited after requesting shutdown, with or without nonzero exit code
Requesting shutdown or killing failed
The process was killed because it didn't exit in time
Something went wrong while waiting for the process to exit

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update

How Has This Been Tested?

Manual test
Auto test added

Checklist

Signed-off-by: Tom Wieczorek <[email protected]>

There are three error conditions: 1. The process exited with a zero exit code. (err == nil) 2. The process exited with a nonzero exit code. (err is an ExitErr) 3. Something else went wrong, like I/O, rare syscall problems ... Distinguish those cases for the logs, and log all of them at the error level, as those are really abnormal situations that shouldn't happen under normal circumstances. Signed-off-by: Tom Wieczorek <[email protected]>

Previously, the shutdown code looped endlessly until the child process finished, requesting graceful termination over and over again. Change this to a request-shutdown -> wait -> kill -> wait logic. This is to ensure that k0s won't hang when the supervised processes can't be terminated for whichever reason: the code will terminate, at least after all the timeouts expired. Use a buffered channel for the wait result, so that the goroutine will be able to exit, even if nothing reads from the channel anymore. Introduce fine-grained error reporting to differentiate shutdown outcomes (graceful shutdown, forced kill, failure, and so on). Signed-off-by: Tom Wieczorek <[email protected]>

ncopa · 2025-08-11T12:30:27Z

There are three error conditions:

The process exited with a zero exit code. (err == nil)

The process exited with a nonzero exit code. (err is an ExitErr)

Something else went wrong, like I/O, rare syscall problems ...

In my head there are two, from OS perspective:

The process exited (with numeric exit code)
Something else went wrong, like I/O, rare syscall problems ...

I would personally appreciate more an error message saying: process exited: exit code 0 rather than Process finished prematurely. The former leaves no doubt what happened, the latter would make me think "Yeah? But what is the exit code?"

So IMHO the second commit (Cleanly distinguish Supervisor Cmd.Wait errors) adds more lines of code but makes logs less clear. The Go API for collecting the exit status via Error is clumsy so I think we better use ProcessState string which does exactly what I would expect.

I'm ok to change warning to error though.

ncopa · 2025-08-11T12:43:14Z

Under what circumstance would the supervised process not exit cleanly? Do we have reported issues for this?

I am thinking that if supervised process does not cleanly shut down, something is broken and may need attention, and this may hide when that happens (but may result in corrupt database/state and similar)

I am worried about slow systems where it may take longer than 30 sec to cleanly shut down things (think riscv64).

ncopa · 2025-08-14T12:38:28Z

Under what circumstance would the supervised process not exit cleanly? Do we have reported issues for this?

Talked with @twz123 and apparently this has happened on windows where containerd didn't gracefully shut down.

May be a good idea to avoid do kill -9, including when we do pidfile cleanup. We can either block k0s shutdown or timeout and leave the unkilled supervised process running.

twz123 · 2025-08-20T15:25:26Z

In my head there are two, from OS perspective:

The process exited (with numeric exit code)

Something else went wrong, like I/O, rare syscall problems ...

[...] The Go API for collecting the exit status via Error is clumsy so I think we better use ProcessState string which does exactly what I would expect.

You're absolutely right. The questionable API really bullied me into making that artificial distinction, without really realizing that it's kinda pointless. What bothered me about the initial logging was actually the claim "Failed to wait for process" while there was actually no problem at all with waiting; it was just a non-zero exit code.

twz123 · 2025-08-20T15:44:00Z

Under what circumstance would the supervised process not exit cleanly? Do we have reported issues for this?

I've faced this while working on the Windows support. There's actually a bug in containerd on Windows, which makes it not respond to console control events (the things that have the closest resemblance to SIGINT and friends on Linux).

This bug made k0s hang indefinitely during its shutdown.

We can either block k0s shutdown or timeout and leave the unkilled supervised process running.

This is what I'm aiming for now. Since the PID file cleanup is in place, we can choose to shutdown after a timeout and log an error / return a non-zero exit code. When k0s gets restarted, we can do the same: try to terminate the program and then bail out after a timeout. This leaves the decision of what to do with the hanging process up to the users without having to kill k0s itself.

Right now, k0s will kill the process referenced by the PID file after the timeout. That needs to change, too.

twz123 · 2025-08-20T16:38:35Z

Closing in favor of #6311 and #6312.

twz123 added 3 commits July 24, 2025 08:38

Pass command as parameter into processWaitQuit

4edd15e

Signed-off-by: Tom Wieczorek <[email protected]>

twz123 added the enhancement New feature or request label Jul 24, 2025

twz123 marked this pull request as ready for review July 24, 2025 13:27

twz123 requested review from a team as code owners July 24, 2025 13:27

twz123 requested review from juanluisvaladas and kke July 24, 2025 13:27

twz123 marked this pull request as draft August 18, 2025 14:40

This was referenced Aug 20, 2025

Don't kill processes when cleaning up PID files #6311

Merged

Bail out after stop timeout when stopping supervised processes #6312

Draft

twz123 closed this Aug 20, 2025

twz123 deleted the supervisor-kill-after-timeout branch August 20, 2025 16:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Kill supervised process after stop timeout #6203

Kill supervised process after stop timeout #6203

Uh oh!

twz123 commented Jul 24, 2025

Uh oh!

ncopa commented Aug 11, 2025

Uh oh!

ncopa commented Aug 11, 2025

Uh oh!

ncopa commented Aug 14, 2025

Uh oh!

twz123 commented Aug 20, 2025 •

edited

Loading

Uh oh!

twz123 commented Aug 20, 2025

Uh oh!

twz123 commented Aug 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Kill supervised process after stop timeout #6203

Kill supervised process after stop timeout #6203

Uh oh!

Conversation

twz123 commented Jul 24, 2025

Description

Type of change

How Has This Been Tested?

Checklist

Uh oh!

ncopa commented Aug 11, 2025

Uh oh!

ncopa commented Aug 11, 2025

Uh oh!

ncopa commented Aug 14, 2025

Uh oh!

twz123 commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

twz123 commented Aug 20, 2025

Uh oh!

twz123 commented Aug 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

twz123 commented Aug 20, 2025 •

edited

Loading