Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error-free startup handling when using pid_file_name #251

Open
ringerc opened this issue Feb 2, 2025 · 1 comment
Open

Error-free startup handling when using pid_file_name #251

ringerc opened this issue Feb 2, 2025 · 1 comment

Comments

@ringerc
Copy link
Contributor

ringerc commented Feb 2, 2025

Presently it's difficult to use spiffe-helper as a co-process to manage the certs for an independent process in a way that doesn't result in spurious errors during startup.

The problem

signalPID() in pkg/sidecar/sidecar.go will return error s failed to read pid file, failed to parse pid file and failed to find process id if the pid file is missing, empty/invalid, or points to a pid that doesn't exist, respectively. This will result in Unable to signal PID file error logs.

The problem is that many processes will require their certs to exist before they can be started. So it's necessary to start spiffe-helper first, and wait for the certs to be created before launching the process that uses them. It is normal for the pid file to be empty, missing or invalid during this period.

It's desirable to get errors when there's actually a problem with signalling the process to be managed, so these errors should not just be demoted to info or debug level. But there should be a away to avoid getting "normal errors" during a normal phase of operation, as routinely ignoring "normal errors" makes it much harder to diagnose real problems when they occur.

(There's also a related ordering issue, a race in restart, per #250)

Current workarounds

One workaround is to put the pid of the controlling process for both spiffe-helper and the intended final command into the pid file, then replace the content once the real process to be managed launches. Then trap the signal sent by spiffe-helper in the controlling process and use it to trigger launch of the process that uses the certs. This is clunky and awkward to use though.

See #250 (comment) for an approach that uses a short-lived process to write the pid file then exec the real one, as suggested by @kfox1111 .

Another alternative is to just ignore the errors, but as noted that's not really good practice.

Proposed solution

spiffe-helper needs to know when the process to signal should be running, so it knows whether failing to signal it is an error or not. Or it needs to have a retry loop so that transient errors can be handled and only reported if a retry fails.

Ideally this could be done by hot-reloading the helper's configuration, so pid_file_name could be updated at runtime once the process to run is known to exist. But that's not easy to shoehorn into the helper's current config logic.

Instead I suggest that the helper should retry signalling the process, logging initial failures at at debug or info level. After a short timeout it should log an error if signalling it still fails.

This would also mostly close the race in #250

ringerc added a commit to ringerc/spiffe-helper-PRs that referenced this issue Feb 2, 2025
There is a race where the process pointed to by pid_file_name could be
restarted and read the old certificates but not have its new pid written
into pid_file_name yet. In this case, we would try to signal the old
process and fail, then give up, so the new process would never get its
certs refreshed.

On initial startup, unavoidable errors may also be logged when the
process to be launched has not yet been started (because its certs don't
exist yet) so pid_file_name is missing or invalid. These shouldn't be
logged as errors, but we don't really want to suppress all errors for
all pid file signalling since that would hide misconfigurations.

Handle both issues by adding a retry-backoff loop for signalling
pid_file_name. On failure, messages are logged at Info level until
the retry limit is exceeded, then an Error is logged.

During startup the controlling process is expected to be watching for
the creation of the certificates, and to launch the controlled process
and create the pid file in a timely manner once the certs appear. So it
becomes possible to start up without "normal" errors appearing.

For the restart-race case, we will initially fail to signal the stale
pid in the pid file, so we will retry until the new pid is written, so
the new process that loaded stale certs will be properly signalled to
reload.

Fixes spiffe#250, spiffe#251
ringerc added a commit to ringerc/spiffe-helper-PRs that referenced this issue Feb 4, 2025
There is a race where the process pointed to by pid_file_name could be
restarted and read the old certificates but not have its new pid written
into pid_file_name yet. In this case, we would try to signal the old
process and fail, then give up, so the new process would never get its
certs refreshed.

On initial startup, unavoidable errors may also be logged when the
process to be launched has not yet been started (because its certs don't
exist yet) so pid_file_name is missing or invalid. These shouldn't be
logged as errors, but we don't really want to suppress all errors for
all pid file signalling since that would hide misconfigurations.

Handle both issues by adding a retry-backoff loop for signalling
pid_file_name. On failure, messages are logged at Info level until
the retry limit is exceeded, then an Error is logged.

During startup the controlling process is expected to be watching for
the creation of the certificates, and to launch the controlled process
and create the pid file in a timely manner once the certs appear. So it
becomes possible to start up without "normal" errors appearing.

For the restart-race case, we will initially fail to signal the stale
pid in the pid file, so we will retry until the new pid is written, so
the new process that loaded stale certs will be properly signalled to
reload.

Fixes spiffe#250, spiffe#251

Signed-off-by: Craig Ringer <[email protected]>
ringerc added a commit to ringerc/spiffe-helper-PRs that referenced this issue Feb 5, 2025
There is a race where the process pointed to by pid_file_name could be
restarted and read the old certificates but not have its new pid written
into pid_file_name yet. In this case, we would try to signal the old
process and fail, then give up, so the new process would never get its
certs refreshed.

On initial startup, unavoidable errors may also be logged when the
process to be launched has not yet been started (because its certs don't
exist yet) so pid_file_name is missing or invalid. These shouldn't be
logged as errors, but we don't really want to suppress all errors for
all pid file signalling since that would hide misconfigurations.

Handle both issues by adding a retry-backoff loop for signalling
pid_file_name. On failure, messages are logged at Info level until
the retry limit is exceeded, then an Error is logged.

During startup the controlling process is expected to be watching for
the creation of the certificates, and to launch the controlled process
and create the pid file in a timely manner once the certs appear. So it
becomes possible to start up without "normal" errors appearing.

For the restart-race case, we will initially fail to signal the stale
pid in the pid file, so we will retry until the new pid is written, so
the new process that loaded stale certs will be properly signalled to
reload.

To enable better testing for this, the retry count is exposed in the
test channel for process signalling.

Fixes spiffe#250, spiffe#251

Signed-off-by: Craig Ringer <[email protected]>
ringerc added a commit to ringerc/spiffe-helper-PRs that referenced this issue Feb 5, 2025
There is a race where the process pointed to by pid_file_name could be
restarted and read the old certificates but not have its new pid written
into pid_file_name yet. In this case, we would try to signal the old
process and fail, then give up, so the new process would never get its
certs refreshed.

On initial startup, unavoidable errors may also be logged when the
process to be launched has not yet been started (because its certs don't
exist yet) so pid_file_name is missing or invalid. These shouldn't be
logged as errors, but we don't really want to suppress all errors for
all pid file signalling since that would hide misconfigurations.

Handle both issues by adding a retry-backoff loop for signalling
pid_file_name. On failure, messages are logged at Info level until
the retry limit is exceeded, then an Error is logged.

During startup the controlling process is expected to be watching for
the creation of the certificates, and to launch the controlled process
and create the pid file in a timely manner once the certs appear. So it
becomes possible to start up without "normal" errors appearing.

For the restart-race case, we will initially fail to signal the stale
pid in the pid file, so we will retry until the new pid is written, so
the new process that loaded stale certs will be properly signalled to
reload.

To enable better testing for this, the retry count is exposed in the
test channel for process signalling.

Fixes spiffe#250, spiffe#251

Signed-off-by: Craig Ringer <[email protected]>
ringerc added a commit to ringerc/spiffe-helper-PRs that referenced this issue Feb 5, 2025
There is a race where the process pointed to by pid_file_name could be
restarted and read the old certificates but not have its new pid written
into pid_file_name yet. In this case, we would try to signal the old
process and fail, then give up, so the new process would never get its
certs refreshed.

On initial startup, unavoidable errors may also be logged when the
process to be launched has not yet been started (because its certs don't
exist yet) so pid_file_name is missing or invalid. These shouldn't be
logged as errors, but we don't really want to suppress all errors for
all pid file signalling since that would hide misconfigurations.

Handle both issues by adding a retry-backoff loop for signalling
pid_file_name. On failure, messages are logged at Info level until
the retry limit is exceeded, then an Error is logged.

During startup the controlling process is expected to be watching for
the creation of the certificates, and to launch the controlled process
and create the pid file in a timely manner once the certs appear. So it
becomes possible to start up without "normal" errors appearing.

For the restart-race case, we will initially fail to signal the stale
pid in the pid file, so we will retry until the new pid is written, so
the new process that loaded stale certs will be properly signalled to
reload.

To enable better testing for this, the retry count is exposed in the
test channel for process signalling.

Fixes spiffe#250, spiffe#251

Signed-off-by: Craig Ringer <[email protected]>
@ringerc
Copy link
Contributor Author

ringerc commented Feb 25, 2025

The PR that proposes to fix this is:

but it depends on a series of other PRs that need merging first so it can be rebased onto them:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant