Skip to content

MWI: Fall back to registering without an existing auth client#56927

Merged
timothyb89 merged 6 commits intomasterfrom
timothyb89/mwi-reauth-fix
Jul 22, 2025
Merged

MWI: Fall back to registering without an existing auth client#56927
timothyb89 merged 6 commits intomasterfrom
timothyb89/mwi-reauth-fix

Conversation

@timothyb89
Copy link
Copy Markdown
Contributor

@timothyb89 timothyb89 commented Jul 18, 2025

This tweaks bot joining logic to discard an existing auth client (i.e. bots, used to verify bot instance IDs between cert refreshes) if the backing identity is expired per the current local time and the certificate's NotAfter field.

As an example, currently if you lock a bot, it will get stuck in a bad state, and will not recover until the tbot process is restarted:

# in one terminal...
$ tbot start identity --join-token=example --certificate-ttl=30s --renewal-interval 20s

# in another terminal...
$ tctl lock --join-token=example # pick your favorite lock target
$ sleep 30                       # observe that the bot fails to renew and enters a failure loop
$ tctl rm lock/$id               # observe that the bot cannot recover after access is restored

Notably, after some recent changes, tbot no longer necessarily crashes if renewals fail. We used to be able to assume systemd/etc would restart for us eventually and clumsily resolve the broken client. This was never a good system to rely on, though, so we should probably resolve this class of issue properly.

changelog: Machine and Workload ID: The tbot client will now discard expired identities if needed during renewal to allow automatic recovery without restarting the process

This tweaks joining logic to allow clients that provided an existing
auth client (i.e. bots, used to verify bot instance IDs between cert
refreshes) to fall back to not using that auth client if said client
appears to be broken.

It attempts to perform a ping, and if the ping fails, proceeds as if
no existing client was provided.
Comment thread lib/auth/join/join.go Outdated
result, err := registerThroughAuthClient(ctx, token, params, params.AuthClient)
if err != nil {
slog.ErrorContext(ctx, "Registration with existing auth client failed.", "error", err)
return nil, trace.Wrap(err)
Copy link
Copy Markdown
Contributor Author

@timothyb89 timothyb89 Jul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, I could also see us falling back to the usual logic in case of any error, not just if a ping fails? But then we could potentially toss out a perfectly good bot instance ID (and consume a bound keypair recovery) for other errors if something spurious happens.

There's probably a lot of solutions here, so I'd love to hear suggestions!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throwing out an otherwise valid bot instance ID is also bad because it could also lead to us overbilling customers. If end up with a new instance ID every time there's an error, and we're billing by bot instances...

Copy link
Copy Markdown
Contributor Author

@timothyb89 timothyb89 Jul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm, yeah, and I think with that in mind my implementation isn't good enough.

I've just added another (duplicate) check outside of the joining client so tbot will refuse to pass along an auth client built with an identity known to be expired. This follows existing bot start-up behavior and will only result in a new bot instance if the identity expires. (I'll revert the ping test change shortly)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the joining client change and confirmed this solves my use case (lock and unlock with a bound keypair token, allowing bot certs to expire before unlocking)

@timothyb89 timothyb89 marked this pull request as ready for review July 18, 2025 01:00
@github-actions github-actions bot requested review from hugoShaka and vapopov July 18, 2025 01:01
This adds an explicit check for an expired identity during renewal.
If expired, the existing auth client will not be used.
@timothyb89 timothyb89 force-pushed the timothyb89/mwi-reauth-fix branch from fe2d0eb to 4c1f268 Compare July 19, 2025 01:16
//
// If the existing identity appears to be expired (`time.Now()` >
// `NotAfter`), the existing auth client will be discarded and the bot will
// try to join without it. This will result in a new bot instance ID.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment about the new bot instance ID still accurate?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, a new ID is issued whenever a bot joins without an existing valid cert bundle.

Comment thread lib/tbot/service_bot_identity.go Outdated
Comment on lines +495 to +498
"The bot identity appears to be expired and will not be used to "+
"authenticate the identity renewal. If it is possible to "+
"rejoin, a new bot instance will be issued. Ensure the "+
"certificate TTL and renewal interval are configured properly.",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can improve this message, it's not super clear to me what actions are required of the user who sees this message. We tell them to configure things properly, but it's not obvious (at least to me) whats not proper about their current configuration.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to reword the message a bit:

The bot identity appears to be expired and will not be used to authenticate the identity renewal. If it is possible to rejoin, a new bot instance will be issued. Ensure the configured certificate TTL is long enough to accommodate your environment and that no external issues will prevent the bot from renewing its identity on schedule.

The problem is that a lot of things can cause this condition and we can't really know the root cause here. It could be caused by a misconfigured TTL (certificate_ttl < renewal_interval) and in that case we'll also have logged an additional warning about this specifically, or it could be caused by a network or Teleport outage preventing several renewals in a row, or the bot could've been locked for a while, or ...

Hopefully this can serve as a call to review preceding tbot logs to look for the source?

Comment thread lib/tbot/service_bot_identity.go Outdated
}

now := time.Now()
if expiry, ok := facade.Expiry(); !ok || now.After(expiry) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only thought with client-side checks of a certificates validity is that the certificate /could/ be valid but the client could be heavily drifted, but, I think in that case we'd probably want to encourage people to fix the drift rather than adjusting the product to accomodate it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From our 1x1 today - This client-side check is definitely not the best possible solution but is relatively simple and shouldn't cause a meaningful jump in bot instance cycling - in practice it only restores previous behavior, since bots would crash, restart, and discard their identity with this same check at startup.

I've added some more detail to the warning message to call out potential clock drift along with a note that we might iterate on this more in the future. (And also tweaked wording to emphasize that its only a Problem if it happens repeatedly)

@timothyb89 timothyb89 enabled auto-merge July 22, 2025 22:38
@timothyb89 timothyb89 added this pull request to the merge queue Jul 22, 2025
Merged via the queue into master with commit d71c299 Jul 22, 2025
43 checks passed
@timothyb89 timothyb89 deleted the timothyb89/mwi-reauth-fix branch July 22, 2025 23:21
@backport-bot-workflows
Copy link
Copy Markdown
Contributor

@timothyb89 See the table below for backport results.

Branch Result
branch/v17 Create PR
branch/v18 Create PR

timothyb89 added a commit that referenced this pull request Jul 23, 2025
The `Expiry()` function was trying to parse DER-encoded data in the
`tls.Certificate` as PEM, causing a silent failure. This method was
not used until #56927 but failed every time as it was trying to parse
certificates using the wrong encoding type.
github-merge-queue bot pushed a commit that referenced this pull request Jul 23, 2025
The `Expiry()` function was trying to parse DER-encoded data in the
`tls.Certificate` as PEM, causing a silent failure. This method was
not used until #56927 but failed every time as it was trying to parse
certificates using the wrong encoding type.
timothyb89 added a commit that referenced this pull request Jul 23, 2025
The `Expiry()` function was trying to parse DER-encoded data in the
`tls.Certificate` as PEM, causing a silent failure. This method was
not used until #56927 but failed every time as it was trying to parse
certificates using the wrong encoding type.
timothyb89 added a commit that referenced this pull request Jul 23, 2025
The `Expiry()` function was trying to parse DER-encoded data in the
`tls.Certificate` as PEM, causing a silent failure. This method was
not used until #56927 but failed every time as it was trying to parse
certificates using the wrong encoding type.
github-merge-queue bot pushed a commit that referenced this pull request Jul 23, 2025
…57060)

* MWI: Fall back to registering without an existing auth client

This tweaks joining logic to allow clients that provided an existing
auth client (i.e. bots, used to verify bot instance IDs between cert
refreshes) to fall back to not using that auth client if said client
appears to be broken.

It attempts to perform a ping, and if the ping fails, proceeds as if
no existing client was provided.

* Explicitly check for an expired identity during renewal

This adds an explicit check for an expired identity during renewal.
If expired, the existing auth client will not be used.

* Revert changes to the join client

* Linter appeasement

* Tweak warning message on expired identity renewal

* Add note about future improvements and tweak logged warning

* MWI: Fix identity facade's `Expiry()` certificate parsing (#57063)

The `Expiry()` function was trying to parse DER-encoded data in the
`tls.Certificate` as PEM, causing a silent failure. This method was
not used until #56927 but failed every time as it was trying to parse
certificates using the wrong encoding type.
github-merge-queue bot pushed a commit that referenced this pull request Jul 28, 2025
…57062)

* MWI: Fall back to registering without an existing auth client

This tweaks joining logic to allow clients that provided an existing
auth client (i.e. bots, used to verify bot instance IDs between cert
refreshes) to fall back to not using that auth client if said client
appears to be broken.

It attempts to perform a ping, and if the ping fails, proceeds as if
no existing client was provided.

* Explicitly check for an expired identity during renewal

This adds an explicit check for an expired identity during renewal.
If expired, the existing auth client will not be used.

* Revert changes to the join client

* Linter appeasement

* Tweak warning message on expired identity renewal

* Add note about future improvements and tweak logged warning

* MWI: Fix identity facade's `Expiry()` certificate parsing (#57063)

The `Expiry()` function was trying to parse DER-encoded data in the
`tls.Certificate` as PEM, causing a silent failure. This method was
not used until #56927 but failed every time as it was trying to parse
certificates using the wrong encoding type.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants