Skip to content

lib/teleterm: Do not reissue user certs before issuing gateway cert#62096

Merged
ravicious merged 2 commits intomasterfrom
r7s/gateway-user-cert
Dec 10, 2025
Merged

lib/teleterm: Do not reissue user certs before issuing gateway cert#62096
ravicious merged 2 commits intomasterfrom
r7s/gateway-user-cert

Conversation

@ravicious
Copy link
Copy Markdown
Member

@ravicious ravicious commented Dec 9, 2025

Closes #42291.

changelog: Fixed "the client connection is closing" error happening under certain conditions in Teleport Connect when connecting to resources with per-session MFA enabled

This is one of those bugs where the fix is fairly simple in terms of lines changed, but explaining the issue takes way more effort.

The background

I noticed that on both master and 18.5.0, I can no longer connect to databases through a gateway when using my user account with low TTL. When trying to connect to a database, the app would show the per-session MFA prompt, accept my hardware key touch and fail with "grpc: the client connection is closing" immediately after. There was no debug logs in the auth service, indicating that the problem is completely client side.

Shared tsh home

In 18.5.0, we finally made Connect and tsh use the same tsh home dir. The way it works is that Connect starts a file watcher on the dir and on any change loads the profiles from disk and compares them against what's in the memory. It is able to detect certain specific actions (assuming access requests, logging in, logging out) and update the UI accordingly.

When it detects that the state of a profile on disk has changed compared to what's in memory, it clears stale cached cluster clients and syncs cluster details. This is done so that when the user assumes an access request, the app stops using cached cluster clients which still operate with the old user cert with no access requests assumed. Detecting a stale client is done by comparing certificate contents on disk vs what they were during the creation of the client.

Reissuing user certs

In #12293 we made it so that just before certs for a gateway are issued, tshd reissues user certs. This was a long way ago, I think before local proxies stored certs in memory. It was done in order to fix an issue with accessing databases in leaf clusters.

For some reason, when the TTL of a user cert is very short or the cert is almost expired, lib/client.Client.ReissueUserCerts inflates the remaining TTL. I haven't looked into why it happens but I've documented this in #42291.

What's also important here is that we perform two operations with the client: first we refresh user certs on disk and then create local proxy cert in memory.

The problem

The combination of these two things means that when a user with a short TTL tries to connect to a resource, the following sequence of events happens:

  1. The user issues an RPC to create a gateway.
  2. The RPC handler in tshd reissues the user cert and then waits for a hardware key touch to get through per-session MFA.
  3. The profile watcher in Connect notices the change in the certs (since the TTL is now different) and closes stale clients. The client used by the RPC handler is considered stale since it uses the old cert.
  4. The user touches the hardware key.
  5. The RPC handler attempts to create a gateway, but the cluster client is already closed, resulting in "grpc: the client connection is closing".

The solution

I fixed this by removing the code which reissues user certs before a local proxy cert is issued. Removing those lines does not break integration tests (which test each kind of a gateway, both in the context of a root cluster and a leaf cluster). Refreshing certs on disk is no longer necessary since gateways (AKA local proxies) no longer use certs from disk.

This problem does show that the profile watcher makes certain assumptions about RPC handlers: a handler that updates certs on disk never performs any further operations. This is mostly true for handlers which perform logout or access request assumption. But it wasn't the case for the handlers which create gateways. So the fix essentially removes those handlers from this group by making them not modify certs on disk (since it was not necessary anyway).

@ravicious ravicious requested a review from gzdunek December 9, 2025 15:08
@ravicious ravicious added no-changelog Indicates that a PR does not require a changelog entry backport/branch/v18 labels Dec 9, 2025
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gzdunek Do you remember why we clear stale clients before checking if access has changed rather than after this check? Off the top of my head I cannot tell what situation this would address.

// Only clear clients with outdated certificates.
// The watcher 'changed' event may be emitted right after the user logs in
// or assumes a role via Connect (which already closes all clients
// for the profile), so we avoid closing them again if they're already up to date.
await client.clearStaleClusterClients({ rootClusterUri: next.uri });
await this.syncOrUpdateCluster(next);
if (!hasAccessChanged(previous.loggedInUser, next.loggedInUser)) {
return;
}
await this.rendererEventHandler.send({
op: 'did-change-access',
uri: next.uri,
});

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clients should also be cleared when the user logs in again (via tsh), otherwise the auth server might reject requests coming from a stale client. In that case, only the validUntil field would be different.

My intent here was to clear the clients when any property of the profile changes, and then run a more detailed check to detect potential access changes.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reject requests from a stale client as in a client with a stale cert in case disconnect_expired_cert is enabled?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this explanation to the comment in code.

@ravicious ravicious removed the request for review from aadc-dev December 9, 2025 15:10
Copy link
Copy Markdown
Contributor

@gzdunek gzdunek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks for fixing it.

I tested opening connections to databases in a root and leaf clusters, work fine.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clients should also be cleared when the user logs in again (via tsh), otherwise the auth server might reject requests coming from a stale client. In that case, only the validUntil field would be different.

My intent here was to clear the clients when any property of the profile changes, and then run a more detailed check to detect potential access changes.

Copy link
Copy Markdown
Contributor

@hugoShaka hugoShaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed PR explanation, this is very appreciated!

@ravicious ravicious enabled auto-merge December 10, 2025 11:10
@ravicious ravicious added this pull request to the merge queue Dec 10, 2025
Merged via the queue into master with commit 5617051 Dec 10, 2025
44 of 45 checks passed
@ravicious ravicious deleted the r7s/gateway-user-cert branch December 10, 2025 12:01
@backport-bot-workflows
Copy link
Copy Markdown
Contributor

@ravicious See the table below for backport results.

Branch Result
branch/v18 Create PR

@ravicious ravicious removed the no-changelog Indicates that a PR does not require a changelog entry label Dec 10, 2025
21KennethTran pushed a commit that referenced this pull request Jan 6, 2026
…62096)

* lib/teleterm: Do not ReissueUserCerts before issuing gateway cert

* Explain why clearing stale clients needs to be done before check for changed access
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Connect's logic for app cert issue causestsh status to report incorrect "Valid until" time when TTL is low

3 participants