MWI: Fall back to registering without an existing auth client#56927
MWI: Fall back to registering without an existing auth client#56927timothyb89 merged 6 commits intomasterfrom
Conversation
This tweaks joining logic to allow clients that provided an existing auth client (i.e. bots, used to verify bot instance IDs between cert refreshes) to fall back to not using that auth client if said client appears to be broken. It attempts to perform a ping, and if the ping fails, proceeds as if no existing client was provided.
| result, err := registerThroughAuthClient(ctx, token, params, params.AuthClient) | ||
| if err != nil { | ||
| slog.ErrorContext(ctx, "Registration with existing auth client failed.", "error", err) | ||
| return nil, trace.Wrap(err) |
There was a problem hiding this comment.
Alternatively, I could also see us falling back to the usual logic in case of any error, not just if a ping fails? But then we could potentially toss out a perfectly good bot instance ID (and consume a bound keypair recovery) for other errors if something spurious happens.
There's probably a lot of solutions here, so I'd love to hear suggestions!
There was a problem hiding this comment.
Throwing out an otherwise valid bot instance ID is also bad because it could also lead to us overbilling customers. If end up with a new instance ID every time there's an error, and we're billing by bot instances...
There was a problem hiding this comment.
Hrm, yeah, and I think with that in mind my implementation isn't good enough.
I've just added another (duplicate) check outside of the joining client so tbot will refuse to pass along an auth client built with an identity known to be expired. This follows existing bot start-up behavior and will only result in a new bot instance if the identity expires. (I'll revert the ping test change shortly)
There was a problem hiding this comment.
I've removed the joining client change and confirmed this solves my use case (lock and unlock with a bound keypair token, allowing bot certs to expire before unlocking)
This adds an explicit check for an expired identity during renewal. If expired, the existing auth client will not be used.
fe2d0eb to
4c1f268
Compare
| // | ||
| // If the existing identity appears to be expired (`time.Now()` > | ||
| // `NotAfter`), the existing auth client will be discarded and the bot will | ||
| // try to join without it. This will result in a new bot instance ID. |
There was a problem hiding this comment.
Is this comment about the new bot instance ID still accurate?
There was a problem hiding this comment.
Yes, a new ID is issued whenever a bot joins without an existing valid cert bundle.
| "The bot identity appears to be expired and will not be used to "+ | ||
| "authenticate the identity renewal. If it is possible to "+ | ||
| "rejoin, a new bot instance will be issued. Ensure the "+ | ||
| "certificate TTL and renewal interval are configured properly.", |
There was a problem hiding this comment.
I think we can improve this message, it's not super clear to me what actions are required of the user who sees this message. We tell them to configure things properly, but it's not obvious (at least to me) whats not proper about their current configuration.
There was a problem hiding this comment.
I've tried to reword the message a bit:
The bot identity appears to be expired and will not be used to authenticate the identity renewal. If it is possible to rejoin, a new bot instance will be issued. Ensure the configured certificate TTL is long enough to accommodate your environment and that no external issues will prevent the bot from renewing its identity on schedule.
The problem is that a lot of things can cause this condition and we can't really know the root cause here. It could be caused by a misconfigured TTL (certificate_ttl < renewal_interval) and in that case we'll also have logged an additional warning about this specifically, or it could be caused by a network or Teleport outage preventing several renewals in a row, or the bot could've been locked for a while, or ...
Hopefully this can serve as a call to review preceding tbot logs to look for the source?
| } | ||
|
|
||
| now := time.Now() | ||
| if expiry, ok := facade.Expiry(); !ok || now.After(expiry) { |
There was a problem hiding this comment.
My only thought with client-side checks of a certificates validity is that the certificate /could/ be valid but the client could be heavily drifted, but, I think in that case we'd probably want to encourage people to fix the drift rather than adjusting the product to accomodate it.
There was a problem hiding this comment.
From our 1x1 today - This client-side check is definitely not the best possible solution but is relatively simple and shouldn't cause a meaningful jump in bot instance cycling - in practice it only restores previous behavior, since bots would crash, restart, and discard their identity with this same check at startup.
I've added some more detail to the warning message to call out potential clock drift along with a note that we might iterate on this more in the future. (And also tweaked wording to emphasize that its only a Problem if it happens repeatedly)
|
@timothyb89 See the table below for backport results.
|
The `Expiry()` function was trying to parse DER-encoded data in the `tls.Certificate` as PEM, causing a silent failure. This method was not used until #56927 but failed every time as it was trying to parse certificates using the wrong encoding type.
The `Expiry()` function was trying to parse DER-encoded data in the `tls.Certificate` as PEM, causing a silent failure. This method was not used until #56927 but failed every time as it was trying to parse certificates using the wrong encoding type.
The `Expiry()` function was trying to parse DER-encoded data in the `tls.Certificate` as PEM, causing a silent failure. This method was not used until #56927 but failed every time as it was trying to parse certificates using the wrong encoding type.
The `Expiry()` function was trying to parse DER-encoded data in the `tls.Certificate` as PEM, causing a silent failure. This method was not used until #56927 but failed every time as it was trying to parse certificates using the wrong encoding type.
…57060) * MWI: Fall back to registering without an existing auth client This tweaks joining logic to allow clients that provided an existing auth client (i.e. bots, used to verify bot instance IDs between cert refreshes) to fall back to not using that auth client if said client appears to be broken. It attempts to perform a ping, and if the ping fails, proceeds as if no existing client was provided. * Explicitly check for an expired identity during renewal This adds an explicit check for an expired identity during renewal. If expired, the existing auth client will not be used. * Revert changes to the join client * Linter appeasement * Tweak warning message on expired identity renewal * Add note about future improvements and tweak logged warning * MWI: Fix identity facade's `Expiry()` certificate parsing (#57063) The `Expiry()` function was trying to parse DER-encoded data in the `tls.Certificate` as PEM, causing a silent failure. This method was not used until #56927 but failed every time as it was trying to parse certificates using the wrong encoding type.
…57062) * MWI: Fall back to registering without an existing auth client This tweaks joining logic to allow clients that provided an existing auth client (i.e. bots, used to verify bot instance IDs between cert refreshes) to fall back to not using that auth client if said client appears to be broken. It attempts to perform a ping, and if the ping fails, proceeds as if no existing client was provided. * Explicitly check for an expired identity during renewal This adds an explicit check for an expired identity during renewal. If expired, the existing auth client will not be used. * Revert changes to the join client * Linter appeasement * Tweak warning message on expired identity renewal * Add note about future improvements and tweak logged warning * MWI: Fix identity facade's `Expiry()` certificate parsing (#57063) The `Expiry()` function was trying to parse DER-encoded data in the `tls.Certificate` as PEM, causing a silent failure. This method was not used until #56927 but failed every time as it was trying to parse certificates using the wrong encoding type.
This tweaks bot joining logic to discard an existing auth client (i.e. bots, used to verify bot instance IDs between cert refreshes) if the backing identity is expired per the current local time and the certificate's
NotAfterfield.As an example, currently if you lock a bot, it will get stuck in a bad state, and will not recover until the
tbotprocess is restarted:Notably, after some recent changes, tbot no longer necessarily crashes if renewals fail. We used to be able to assume systemd/etc would restart for us eventually and clumsily resolve the broken client. This was never a good system to rely on, though, so we should probably resolve this class of issue properly.
changelog: Machine and Workload ID: The
tbotclient will now discard expired identities if needed during renewal to allow automatic recovery without restarting the process