Skip to content

Improve Teleport reload behavior#37773

Merged
espadolini merged 5 commits intomasterfrom
espadolini/resumption-reload
Feb 13, 2024
Merged

Improve Teleport reload behavior#37773
espadolini merged 5 commits intomasterfrom
espadolini/resumption-reload

Conversation

@espadolini
Copy link
Copy Markdown
Contributor

@espadolini espadolini commented Feb 5, 2024

This PR tweaks some of the Teleport behavior around SIGHUP reloads and forking:

  • the PID file (if any) is not recreated after a host CA rotation (which spins up a new TeleportProcess in the same process)
  • the PID file (if any) is created atomically, and the process that creates it holds an exclusive advisory lock on it that only unlocks at process end (so that we can pkill -L -F <pidfile> and greatly reduce the chance that we accidentally send a signal to an unintended process
  • systemd unit files are configured to use pkill -L -F on the pidfile rather than relying on the $MAINPID as detected by systemd: this is needed because the way Teleport forks a new subprocess on HUP to upgrade itself leaves systemd with no way to accurately keep track of what the "main" Teleport process should be, with the end result being that a second systemctl reload while the first Teleport process hasn't exited yet is likely to not get directed at the new Teleport process (weirdly enough, after the second SIGHUP systemd seems to reload the PIDFile configured in the unit 🤷)
  • child Teleport processes are actively reaped on death even after exiting the WaitForSignals loop; this prevents the accumulation of zombies while Teleport is gracefully shutting down (possible when reloading more than once, which is plausible with a combination of very long lived sessions and frequent updates)
  • failing to start a new child process doesn't leave the parent process hanging for two minutes while not responding (but buffering) signals
  • exponential backoff is added to the logging of "Waiting for services: [...] to finish." and "Shutdown: waiting for N connections to finish.", so that slow graceful shutdowns don't fill the log too much

Changelog: improved the stability of Teleport during graceful upgrades

@espadolini espadolini added the no-changelog Indicates that a PR does not require a changelog entry label Feb 5, 2024
@espadolini espadolini force-pushed the espadolini/resumption-reload branch from e77a2ef to df191fb Compare February 7, 2024 16:54
@espadolini espadolini force-pushed the espadolini/resumption-reload branch 2 times, most recently from 1857176 to 128ede0 Compare February 7, 2024 17:11
@espadolini espadolini changed the title Connection resumption over agent reloading Improve Teleport reload behavior Feb 7, 2024
@espadolini espadolini marked this pull request as ready for review February 8, 2024 15:59
Comment thread lib/config/systemd.go Outdated
Comment thread lib/service/service.go Outdated
@espadolini espadolini force-pushed the espadolini/resumption-reload branch from ca93e2a to 882dc84 Compare February 9, 2024 14:18
Copy link
Copy Markdown
Contributor

@EdwardDowling EdwardDowling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me but is not a part of Teleport I am familiar with with.

@espadolini espadolini added this pull request to the merge queue Feb 13, 2024
Merged via the queue into master with commit bd47f0e Feb 13, 2024
@espadolini espadolini deleted the espadolini/resumption-reload branch February 13, 2024 15:44
@public-teleport-github-review-bot
Copy link
Copy Markdown

@espadolini See the table below for backport results.

Branch Result
branch/v15 Create PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no-changelog Indicates that a PR does not require a changelog entry size/sm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants