lib/teleterm: Do not reissue user certs before issuing gateway cert by ravicious · Pull Request #62096 · gravitational/teleport

ravicious · 2025-12-09T15:08:28Z

changelog: Fixed "the client connection is closing" error happening under certain conditions in Teleport Connect when connecting to resources with per-session MFA enabled

This is one of those bugs where the fix is fairly simple in terms of lines changed, but explaining the issue takes way more effort.

The background

I noticed that on both master and 18.5.0, I can no longer connect to databases through a gateway when using my user account with low TTL. When trying to connect to a database, the app would show the per-session MFA prompt, accept my hardware key touch and fail with "grpc: the client connection is closing" immediately after. There was no debug logs in the auth service, indicating that the problem is completely client side.

Shared tsh home

In 18.5.0, we finally made Connect and tsh use the same tsh home dir. The way it works is that Connect starts a file watcher on the dir and on any change loads the profiles from disk and compares them against what's in the memory. It is able to detect certain specific actions (assuming access requests, logging in, logging out) and update the UI accordingly.

When it detects that the state of a profile on disk has changed compared to what's in memory, it clears stale cached cluster clients and syncs cluster details. This is done so that when the user assumes an access request, the app stops using cached cluster clients which still operate with the old user cert with no access requests assumed. Detecting a stale client is done by comparing certificate contents on disk vs what they were during the creation of the client.

Reissuing user certs

In #12293 we made it so that just before certs for a gateway are issued, tshd reissues user certs. This was a long way ago, I think before local proxies stored certs in memory. It was done in order to fix an issue with accessing databases in leaf clusters.

For some reason, when the TTL of a user cert is very short or the cert is almost expired, lib/client.Client.ReissueUserCerts inflates the remaining TTL. I haven't looked into why it happens but I've documented this in #42291.

What's also important here is that we perform two operations with the client: first we refresh user certs on disk and then create local proxy cert in memory.

The problem

The combination of these two things means that when a user with a short TTL tries to connect to a resource, the following sequence of events happens:

The user issues an RPC to create a gateway.
The RPC handler in tshd reissues the user cert and then waits for a hardware key touch to get through per-session MFA.
The profile watcher in Connect notices the change in the certs (since the TTL is now different) and closes stale clients. The client used by the RPC handler is considered stale since it uses the old cert.
The user touches the hardware key.
The RPC handler attempts to create a gateway, but the cluster client is already closed, resulting in "grpc: the client connection is closing".

The solution

I fixed this by removing the code which reissues user certs before a local proxy cert is issued. Removing those lines does not break integration tests (which test each kind of a gateway, both in the context of a root cluster and a leaf cluster). Refreshing certs on disk is no longer necessary since gateways (AKA local proxies) no longer use certs from disk.

This problem does show that the profile watcher makes certain assumptions about RPC handlers: a handler that updates certs on disk never performs any further operations. This is mostly true for handlers which perform logout or access request assumption. But it wasn't the case for the handlers which create gateways. So the fix essentially removes those handlers from this group by making them not modify certs on disk (since it was not necessary anyway).

ravicious · 2025-12-09T15:10:32Z

lib/teleterm/clusters/cluster_apps.go

@gzdunek Do you remember why we clear stale clients before checking if access has changed rather than after this check? Off the top of my head I cannot tell what situation this would address.

teleport/web/packages/teleterm/src/mainProcess/clusterLifecycleManager/clusterLifecycleManager.ts

Lines 247 to 260 in 9ad1f18

// Only clear clients with outdated certificates.

// The watcher 'changed' event may be emitted right after the user logs in

// or assumes a role via Connect (which already closes all clients

// for the profile), so we avoid closing them again if they're already up to date.

await client.clearStaleClusterClients({ rootClusterUri: next.uri });

await this.syncOrUpdateCluster(next);

if (!hasAccessChanged(previous.loggedInUser, next.loggedInUser)) {

return;

}

await this.rendererEventHandler.send({

op: 'did-change-access',

uri: next.uri,

});

Clients should also be cleared when the user logs in again (via tsh), otherwise the auth server might reject requests coming from a stale client. In that case, only the validUntil field would be different.

My intent here was to clear the clients when any property of the profile changes, and then run a more detailed check to detect potential access changes.

Reject requests from a stale client as in a client with a stale cert in case disconnect_expired_cert is enabled?

I added this explanation to the comment in code.

gzdunek

Many thanks for fixing it.

I tested opening connections to databases in a root and leaf clusters, work fine.

gzdunek · 2025-12-09T16:13:27Z

lib/teleterm/clusters/cluster_apps.go

Clients should also be cleared when the user logs in again (via tsh), otherwise the auth server might reject requests coming from a stale client. In that case, only the validUntil field would be different.

My intent here was to clear the clients when any property of the profile changes, and then run a more detailed check to detect potential access changes.

hugoShaka

Thanks for the detailed PR explanation, this is very appreciated!

…changed access

backport-bot-workflows · 2025-12-10T12:02:57Z

@ravicious See the table below for backport results.

Branch	Result
branch/v18	Create PR

…62096) * lib/teleterm: Do not ReissueUserCerts before issuing gateway cert * Explain why clearing stale clients needs to be done before check for changed access

lib/teleterm: Do not ReissueUserCerts before issuing gateway cert

b838fd9

ravicious requested a review from gzdunek December 9, 2025 15:08

ravicious added no-changelog Indicates that a PR does not require a changelog entry backport/branch/v18 labels Dec 9, 2025

github-actions bot requested review from aadc-dev and hugoShaka December 9, 2025 15:09

github-actions bot added the size/sm label Dec 9, 2025

ravicious commented Dec 9, 2025

View reviewed changes

ravicious removed the request for review from aadc-dev December 9, 2025 15:10

gzdunek approved these changes Dec 9, 2025

View reviewed changes

hugoShaka approved these changes Dec 9, 2025

View reviewed changes

Explain why clearing stale clients needs to be done before check for …

047dbc3

…changed access

ravicious enabled auto-merge December 10, 2025 11:10

ravicious added this pull request to the merge queue Dec 10, 2025

Merged via the queue into master with commit 5617051 Dec 10, 2025
44 of 45 checks passed

ravicious deleted the r7s/gateway-user-cert branch December 10, 2025 12:01

ravicious mentioned this pull request Dec 10, 2025

[v18] lib/teleterm: Do not reissue user certs before issuing gateway cert #62127

Merged

ravicious removed the no-changelog Indicates that a PR does not require a changelog entry label Dec 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lib/teleterm: Do not reissue user certs before issuing gateway cert#62096

lib/teleterm: Do not reissue user certs before issuing gateway cert#62096
ravicious merged 2 commits intomasterfrom
r7s/gateway-user-cert

ravicious commented Dec 9, 2025 •

edited

Loading

Uh oh!

ravicious Dec 9, 2025

Uh oh!

gzdunek Dec 9, 2025

Uh oh!

ravicious Dec 9, 2025

Uh oh!

gzdunek Dec 9, 2025

Uh oh!

ravicious Dec 10, 2025

Uh oh!

gzdunek left a comment

Uh oh!

gzdunek Dec 9, 2025

Uh oh!

hugoShaka left a comment

Uh oh!

Uh oh!

backport-bot-workflows bot commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	// Only clear clients with outdated certificates.
	// The watcher 'changed' event may be emitted right after the user logs in
	// or assumes a role via Connect (which already closes all clients
	// for the profile), so we avoid closing them again if they're already up to date.
	await client.clearStaleClusterClients({ rootClusterUri: next.uri });
	await this.syncOrUpdateCluster(next);

	if (!hasAccessChanged(previous.loggedInUser, next.loggedInUser)) {
	return;
	}
	await this.rendererEventHandler.send({
	op: 'did-change-access',
	uri: next.uri,
	});

Conversation

ravicious commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The background

Shared tsh home

Reissuing user certs

The problem

The solution

Uh oh!

ravicious Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

gzdunek Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

ravicious Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

gzdunek Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

ravicious Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

gzdunek left a comment

Choose a reason for hiding this comment

Uh oh!

gzdunek Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

hugoShaka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

backport-bot-workflows bot commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ravicious commented Dec 9, 2025 •

edited

Loading