Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry sending signals to pid_file_name before logging errors #252

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

ringerc
Copy link
Contributor

@ringerc ringerc commented Feb 2, 2025

There is a race where the process pointed to by pid_file_name could be restarted and read the old certificates but not have its new pid written into pid_file_name yet. In this case, we would try to signal the old process and fail, then give up, so the new process would never get its certs refreshed.

On initial startup, unavoidable errors may also be logged when the process to be launched has not yet been started (because its certs don't exist yet) so pid_file_name is missing or invalid. These shouldn't be logged as errors, but we don't really want to suppress all errors for all pid file signalling since that would hide misconfigurations.

Handle both issues by adding a retry-backoff loop for signalling pid_file_name. On failure, messages are logged at Info level until the retry limit is exceeded, then an Error is logged.

During startup the controlling process is expected to be watching for the creation of the certificates, and to launch the controlled process and create the pid file in a timely manner once the certs appear. So it becomes possible to start up without "normal" errors appearing.

For the restart-race case, we will initially fail to signal the stale pid in the pid file, so we will retry until the new pid is written, so the new process that loaded stale certs will be properly signalled to reload.

Fixes: #250, #251

I added an extensive test suite as part of this, covering command execution, signal handling, etc. If you want some of that split out from this PR then this PR rebased on top I can probably do that.

I also fixed Windows builds and ensured the existing and added tests run on Windows - or at least, under wine when cross-compiled

I strongly suggest merging this via fast-forward merge or with a merge commit, not a squash merge. Otherwise the separation of commits will be lost and it'll be a pain to understand in the history.

@ringerc ringerc marked this pull request as draft February 2, 2025 23:49
@ringerc ringerc force-pushed the retry-signal-pid-file branch from 4f81e68 to 630fb75 Compare February 2, 2025 23:49
@ringerc
Copy link
Contributor Author

ringerc commented Feb 3, 2025

Added some test cover for command exec. Will add some for pid signalling next.

@ringerc ringerc force-pushed the retry-signal-pid-file branch 4 times, most recently from 7d742a6 to b3fda05 Compare February 4, 2025 01:02
@ringerc ringerc marked this pull request as ready for review February 4, 2025 01:02
@ringerc ringerc force-pushed the retry-signal-pid-file branch 3 times, most recently from ea321a3 to 8eb8181 Compare February 4, 2025 04:30
@faisal-memon faisal-memon added this to the 0.11.0 milestone Feb 4, 2025
@ringerc
Copy link
Contributor Author

ringerc commented Feb 5, 2025

I see some linter complaints here. I'll address them, and probably split the PR.

Fix compilation on GOOS=windows for the main binary and tests,
both of which were broken.

Fix a leaked open file handle which caused config_test.go to fail
on windows with "Sharing violation" errors when cleaning up the
tempdir.

Add a 'test-wine' Makefile target. Explain how to run the tests
for windows via wine in the readme.

Signed-off-by: Craig Ringer <[email protected]>
Add test cover for:

* long-running and short-lived cmd and cmd_args invocations
* signalling via pid_file_name and renew_signal
* windows cmd and cmd_args execution

If rebased onto the 'fix-windows' branch for windows test fixes, these
pass for windows test runs too.

To make these tests possible the following changes are made to the
sidecar code:

* Sidecar.certReadyChan yields *workloadapi.X509Context not struct{}
  so the context can be observed. This could also be useful for
  code extending spiffe-helper.
* A new Sidecar.cmdExitChan channel reports whenever a process
  managed by spiffe-helper exits, and the exit state. This allows
  tests to wait for process exits and see the exit code, etc.
* A new Sidecar.pidFileSignalledChan channel reports whenever
  signalling the file in pid_file_name is attempted, the pid
  signalled, and the outcome. This allows signal based tests
  to see what spiffe-helper tried to signal.

Channels with values are used to ensure that there's no race between
what might be present in the Sidecar struct state when the test observes
it and what was there when the event of interest occurred.

It might be handy to have a channel for observing when the managed
process is started too, but the tests just busy-wait polling
Sidecar.processRunning to detect this for now.

I considered combining these channels into some kind of unified observer
channel that emits a union struct depending on the event type, but it
seemed pointless to do this when we can just select{} on multiple
channels of interest and keep each channel simple.

Signed-off-by: Craig Ringer <[email protected]>
@ringerc ringerc marked this pull request as draft February 5, 2025 01:05
… into retry-signal-pid-file

These are tracked by PRs

* spiffe#258 for test suite
* spiffe#257 for windows

Once they are merged the upcoming commit can be merged in its own PR
@ringerc ringerc force-pushed the retry-signal-pid-file branch 2 times, most recently from 92a60c4 to cb78cab Compare February 5, 2025 01:32
@ringerc
Copy link
Contributor Author

ringerc commented Feb 5, 2025

Windows fixes split off to simpler individual PR: #257

Added test cover split off to a PR that focuses on adding the test cover, #258

I've rebased this PR onto a merge-commit created from the two listed above and fixed the linter complaints.

Github doesn't seem to have any way to say "this PR depends on these other two", and only appears to understand linear histories, so I'll keep this PR marked as a draft to prevent it being inadvertently squash-merged until the other two are reviewed and merged. Then this will be a trivial git rebase origin/main.

The only commit unique to this PR is cb78cab so that's the only one to focus on for review.

There is a race where the process pointed to by pid_file_name could be
restarted and read the old certificates but not have its new pid written
into pid_file_name yet. In this case, we would try to signal the old
process and fail, then give up, so the new process would never get its
certs refreshed.

On initial startup, unavoidable errors may also be logged when the
process to be launched has not yet been started (because its certs don't
exist yet) so pid_file_name is missing or invalid. These shouldn't be
logged as errors, but we don't really want to suppress all errors for
all pid file signalling since that would hide misconfigurations.

Handle both issues by adding a retry-backoff loop for signalling
pid_file_name. On failure, messages are logged at Info level until
the retry limit is exceeded, then an Error is logged.

During startup the controlling process is expected to be watching for
the creation of the certificates, and to launch the controlled process
and create the pid file in a timely manner once the certs appear. So it
becomes possible to start up without "normal" errors appearing.

For the restart-race case, we will initially fail to signal the stale
pid in the pid file, so we will retry until the new pid is written, so
the new process that loaded stale certs will be properly signalled to
reload.

To enable better testing for this, the retry count is exposed in the
test channel for process signalling.

Fixes spiffe#250, spiffe#251

Signed-off-by: Craig Ringer <[email protected]>
@ringerc ringerc force-pushed the retry-signal-pid-file branch from cb78cab to f3c7cec Compare February 5, 2025 01:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Race condition between signalling pid file and certificate reading
2 participants