Skip to content

Conversation

@twz123
Copy link
Member

@twz123 twz123 commented Jul 24, 2025

Description

Previously, the shutdown code looped endlessly until the child process finished, requesting graceful termination over and over again. Change this to a request-shutdown -> wait -> kill -> wait logic. This is to ensure that k0s won't hang when the supervised processes can't be terminated for whichever reason: the code will terminate, at least after all the timeouts expired.

Use a buffered channel for the wait result, so that the goroutine will be able to exit, even if nothing reads from the channel anymore. Introduce fine-grained error reporting to differentiate shutdown outcomes:

  • The process exited unexpectedly, with or without nonzero exit code
  • The process exited after requesting shutdown, with or without nonzero exit code
  • Requesting shutdown or killing failed
  • The process was killed because it didn't exit in time
  • Something went wrong while waiting for the process to exit

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update

How Has This Been Tested?

  • Manual test
  • Auto test added

Checklist

  • My code follows the style guidelines of this project
  • My commit messages are signed-off
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules
  • I have checked my code and corrected any misspellings

twz123 added 3 commits July 24, 2025 08:38
There are three error conditions:

1. The process exited with a zero exit code. (err == nil)
2. The process exited with a nonzero exit code. (err is an ExitErr)
3. Something else went wrong, like I/O, rare syscall problems ...

Distinguish those cases for the logs, and log all of them at the error
level, as those are really abnormal situations that shouldn't happen
under normal circumstances.

Signed-off-by: Tom Wieczorek <[email protected]>
Previously, the shutdown code looped endlessly until the child process
finished, requesting graceful termination over and over again. Change
this to a request-shutdown -> wait -> kill -> wait logic. This is
to ensure that k0s won't hang when the supervised processes can't be
terminated for whichever reason: the code will terminate, at least after
all the timeouts expired.

Use a buffered channel for the wait result, so that the goroutine
will be able to exit, even if nothing reads from the channel anymore.
Introduce fine-grained error reporting to differentiate shutdown
outcomes (graceful shutdown, forced kill, failure, and so on).

Signed-off-by: Tom Wieczorek <[email protected]>
@twz123 twz123 added the enhancement New feature or request label Jul 24, 2025
@twz123 twz123 marked this pull request as ready for review July 24, 2025 13:27
@twz123 twz123 requested review from a team as code owners July 24, 2025 13:27
@twz123 twz123 requested review from juanluisvaladas and kke July 24, 2025 13:27
@ncopa
Copy link
Collaborator

ncopa commented Aug 11, 2025

There are three error conditions:

  1. The process exited with a zero exit code. (err == nil)
  2. The process exited with a nonzero exit code. (err is an ExitErr)
  3. Something else went wrong, like I/O, rare syscall problems ...

In my head there are two, from OS perspective:

  1. The process exited (with numeric exit code)
  2. Something else went wrong, like I/O, rare syscall problems ...

I would personally appreciate more an error message saying: process exited: exit code 0 rather than Process finished prematurely. The former leaves no doubt what happened, the latter would make me think "Yeah? But what is the exit code?"

So IMHO the second commit (Cleanly distinguish Supervisor Cmd.Wait errors) adds more lines of code but makes logs less clear. The Go API for collecting the exit status via Error is clumsy so I think we better use ProcessState string which does exactly what I would expect.

I'm ok to change warning to error though.

@ncopa
Copy link
Collaborator

ncopa commented Aug 11, 2025

Under what circumstance would the supervised process not exit cleanly? Do we have reported issues for this?

I am thinking that if supervised process does not cleanly shut down, something is broken and may need attention, and this may hide when that happens (but may result in corrupt database/state and similar)

I am worried about slow systems where it may take longer than 30 sec to cleanly shut down things (think riscv64).

@ncopa
Copy link
Collaborator

ncopa commented Aug 14, 2025

Under what circumstance would the supervised process not exit cleanly? Do we have reported issues for this?

Talked with @twz123 and apparently this has happened on windows where containerd didn't gracefully shut down.

May be a good idea to avoid do kill -9, including when we do pidfile cleanup. We can either block k0s shutdown or timeout and leave the unkilled supervised process running.

@twz123 twz123 marked this pull request as draft August 18, 2025 14:40
@twz123
Copy link
Member Author

twz123 commented Aug 20, 2025

In my head there are two, from OS perspective:

  1. The process exited (with numeric exit code)
  2. Something else went wrong, like I/O, rare syscall problems ...

[...] The Go API for collecting the exit status via Error is clumsy so I think we better use ProcessState string which does exactly what I would expect.

You're absolutely right. The questionable API really bullied me into making that artificial distinction, without really realizing that it's kinda pointless. What bothered me about the initial logging was actually the claim "Failed to wait for process" while there was actually no problem at all with waiting; it was just a non-zero exit code.

@twz123
Copy link
Member Author

twz123 commented Aug 20, 2025

Under what circumstance would the supervised process not exit cleanly? Do we have reported issues for this?

I've faced this while working on the Windows support. There's actually a bug in containerd on Windows, which makes it not respond to console control events (the things that have the closest resemblance to SIGINT and friends on Linux).

This bug made k0s hang indefinitely during its shutdown.

We can either block k0s shutdown or timeout and leave the unkilled supervised process running.

This is what I'm aiming for now. Since the PID file cleanup is in place, we can choose to shutdown after a timeout and log an error / return a non-zero exit code. When k0s gets restarted, we can do the same: try to terminate the program and then bail out after a timeout. This leaves the decision of what to do with the hanging process up to the users without having to kill k0s itself.

Right now, k0s will kill the process referenced by the PID file after the timeout. That needs to change, too.

@twz123
Copy link
Member Author

twz123 commented Aug 20, 2025

Closing in favor of #6311 and #6312.

@twz123 twz123 closed this Aug 20, 2025
@twz123 twz123 deleted the supervisor-kill-after-timeout branch August 20, 2025 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants