Invalidate cached client when TLS cert changes by gzdunek · Pull Request #59370 · gravitational/teleport

gzdunek · 2025-09-19T15:03:56Z

Closes #44059
Contributes to #25806

As described in the first issue, a relogin can race with another thread retrieving the client from the cache, which may result in the cache retaining an outdated client.

The problem with invalidating the cache will become even more visible once tsh and Connect start sharing the ~/.tsh directory. If a user logs in or assumes a role via tsh, Connect will receive a file system event and must decide whether to clear the cache (and I suspect we can run into some edge cases there).
Additionally, I'd like to avoid exposing any ClientCache related functionality outside of tsh, since it’s really just an implementation detail. Even now, it’s cumbersome to remember to clear the cache after login or assuming a role.

A better option is to shift that responsibility to the ClientCache itself. On any get request, it can verify if the TLS certificate hasn’t changed since the client was created. This approach resolves the linked issue and eliminates the need to manually clear the cache (except logging out).

gzdunek · 2025-09-19T15:06:11Z

+		return false, trace.Wrap(err)
+	}
+
+	if bytes.Equal(coreTLSCertFromCache, keyRing.TLSCert) {


It seemed to me that comparing the TLS cert is the easiest way to detect if a new client should be created (since it was used to create the client), but I'm open to suggestions.

nklaassen · 2025-09-19T17:16:08Z

+	tc, err := c.cfg.NewClientFunc(ctx, profile, "")
+	if err != nil {
+		return false, trace.Wrap(err)
+	}
+
+	keyRing, err := tc.LocalAgent().GetCoreKeyRing()
+	if err != nil {
+		return false, trace.Wrap(err)
+	}
+
+	if bytes.Equal(coreTLSCertFromCache, keyRing.TLSCert) {


my one concern is this seems to do quite a bit of extra work just to get the latest TLS cert to see if it has changed. Like loading the full profile, parsing and validating certificates, I think it may even try to add SSH keys to the system key agent. Not sure this will be tolerable on every cache get, I've been calling it as if it was basically a map lookup. E.g. VNet may call into this multiple times while resolving a DNS query and generally uses the clientcache quite a bit

what you have is very general by only using the existing NewClientFunc but I wonder how hard it would be to pass in a more targeted way to just read the current TLS cert

Ah, that's a good point. I wasn't aware that VNet relies so heavily on the client cache (though the regular tsh daemon will benefit from this as well).

I've updated the code to only read the TLS cert.

Reading from disk on each cache read still feels like a lot of overhead though I can't asses if it's significant enough or not. At least there's singleflight.Group for concurrent cache reads but it still feels like a lot of disk IO.

Lack of the `Dir` caused the cert path to be incorrect.

ravicious · 2025-09-24T15:20:08Z

 }

+type clientWithCert struct {
+	// client is cluster client.


You can drop this comment hehe.

ravicious · 2025-09-24T15:21:42Z

+type clientWithCert struct {
+	// client is cluster client.
+	client *client.ClusterClient
+	// coreTLSCert is the cert used in TeleportClient.ConnectToCluster to create the client.


Suggested change

// coreTLSCert is the cert used in TeleportClient.ConnectToCluster to create the client.

// coreTLSCert is the cert used in [client.TeleportClient.ConnectToCluster] to create the client.

or maybe even

Suggested change

// coreTLSCert is the cert used in TeleportClient.ConnectToCluster to create the client.

// coreTLSCert is the contents of the cert at the time of creating the client.

ravicious · 2025-09-24T15:32:10Z

+	// coreTLSCert is the cert used in TeleportClient.ConnectToCluster to create the client.
 	coreTLSCert []byte
+	// readCoreTLSCert reads a fresh cert from disk.
+	readCoreTLSCert func() ([]byte, error)


I don't think it's very Go-like, I assume the more idiomatic way to do it is to create a profile interface with TLSCert method.

ravicious · 2025-09-24T15:33:47Z

-			return fromCache, nil
+			unchanged, err := fromCache.isCoreTLSCertUnchanged()
+			if err != nil {
+				c.cfg.Logger.WarnContext(ctx, "Failed to validate TLS certificate, removing from cache", "cluster", k, "error", err)


Suggested change

c.cfg.Logger.WarnContext(ctx, "Failed to validate TLS certificate, removing from cache", "cluster", k, "error", err)

c.cfg.Logger.WarnContext(ctx, "Failed to check if TLS certificate has changed, removing client from cache", "cluster", k, "error", err)

ravicious · 2025-09-24T15:34:00Z

+				c.cfg.Logger.DebugContext(ctx, "Retrieved client from cache", "cluster", k)
+				return fromCache.client, nil
+			} else {
+				c.cfg.Logger.DebugContext(ctx, "TLS certificate for cached client has changed, removing from cache", "cluster", k)


Suggested change

c.cfg.Logger.DebugContext(ctx, "TLS certificate for cached client has changed, removing from cache", "cluster", k)

c.cfg.Logger.DebugContext(ctx, "TLS certificate for cached client has changed, removing client from cache", "cluster", k)

ravicious · 2025-09-24T15:47:24Z

 		return nil, trace.BadParameter("cluster URI must be a root URI")
 	}

-	if err = s.DaemonService.ClearCachedClientsForRoot(cluster.URI); err != nil {


Does the removal of these constitute a significant change in behavior?

In the previous version, after logging in or assuming a role, all clients created so far would be closed. With the new behavior, no client will be closed after one of those operations until another RPC attempts to get a client from the cache.

Does this have any significance? It feels like it mostly could affect leaf clients as those might continue to not be closed well after a relogin / role assumption.

ravicious · 2025-09-24T15:49:10Z

+	tc, err := c.cfg.NewClientFunc(ctx, profile, "")
+	if err != nil {
+		return false, trace.Wrap(err)
+	}
+
+	keyRing, err := tc.LocalAgent().GetCoreKeyRing()
+	if err != nil {
+		return false, trace.Wrap(err)
+	}
+
+	if bytes.Equal(coreTLSCertFromCache, keyRing.TLSCert) {


Reading from disk on each cache read still feels like a lot of overhead though I can't asses if it's significant enough or not. At least there's singleflight.Group for concurrent cache reads but it still feels like a lot of disk IO.

ravicious · 2025-10-02T08:54:56Z

 		return nil, trace.BadParameter("cluster URI must be a root URI")
 	}

-	if err = s.DaemonService.ClearCachedClientsForRoot(cluster.URI); err != nil {


In #59760, I tried re-enabling the client cache in tests. Currently there's a test which just straight up fails (#59760 (comment)). I think it's because in the real world, we depend on the cache being cleared during the login RPCs. In tests however we circumvent those RPCs completely (because we don't have access to user credentials so we depend on test helpers to generate new certs on disk).

The test passes when I manually clear the cache on relogin (see the diff). I think once this PR is merged and when the cache no longer depends on being manually cleared, then we should be able to re-enable it in integration tests without such workarounds like the one in the diff.

Diff

diff --git a/integration/proxy/teleterm_test.go b/integration/proxy/teleterm_test.go index ee19d22c1d2..02b5112a9cb 100644 --- a/integration/proxy/teleterm_test.go +++ b/integration/proxy/teleterm_test.go @@ -264,6 +264,7 @@ func testGatewayCertRenewal(ctx context.Context, t *testing.T, params gatewayCer t.Cleanup(func() { daemonService.Stop() }) + tshdEventsService.daemonService = daemonService // Connect the daemon to the tshd events service, like it would // during normal initialization of the app. @@ -320,6 +321,7 @@ type mockTSHDEventsService struct { sendNotificationCallCount atomic.Uint32 promptMFACallCount atomic.Uint32 generateAndSetupUserCreds generateAndSetupUserCredsFunc + daemonService *daemon.Service } func newMockTSHDEventsServiceServer(t *testing.T, tc *libclient.TeleportClient, generateAndSetupUserCreds generateAndSetupUserCredsFunc) (service *mockTSHDEventsService) { @@ -360,11 +362,20 @@ func newMockTSHDEventsServiceServer(t *testing.T, tc *libclient.TeleportClient, // Relogin simulates the act of the user logging in again in the Electron app by replacing the user // cert on disk with a valid one. -func (c *mockTSHDEventsService) Relogin(context.Context, *api.ReloginRequest) (*api.ReloginResponse, error) { +func (c *mockTSHDEventsService) Relogin(ctx context.Context, req *api.ReloginRequest) (*api.ReloginResponse, error) { c.reloginCallCount.Add(1) // Generate valid certs with the default TTL. c.generateAndSetupUserCreds(c.t, c.tc, 0 /* ttl */) + if c.daemonService != nil { + clusterURI, err := uri.Parse(req.RootClusterUri) + if err != nil { + return nil, err + } + if err := c.daemonService.ClearCachedClientsForRoot(clusterURI); err != nil { + return nil, err + } + } return &api.ReloginResponse{}, nil }

gzdunek · 2025-11-04T14:15:07Z

Reading from disk on each cache read still feels like a lot of overhead though I can't asses if it's significant enough or not. At least there's singleflight.Group for concurrent cache reads but it still feels like a lot of disk IO.

I think I'll abandon this solution. It turns out that my fix for reading only the TLS certificate doesn't work when you relogin as a different user, as it still returns the certificate for the previous user (because tc.Profile() still points to the previous username).
To fix this properly, we’d need to re-read the profile first and then load the certificate for the correct user. This doubles the number of calls.

But I also looked closer into how VNet uses the client cache, and well, we indeed use the client cache a lot. Reading from disk on every cache.Get makes is pretty inefficient.

Instead, I'm going to fix #44059 by cleaning the cache after log in finishes.
When it comes to reacting to external changes to the tsh directory, I will add an RPC to the tsh daemon to invalidate the clients when the watcher detects a change. Previously I was worried about potential race conditions, but now I think it should be fine.

Invalidate cached client when cert changes

c73574e

gzdunek requested review from nklaassen and ravicious September 19, 2025 15:03

gzdunek added no-changelog Indicates that a PR does not require a changelog entry backport/branch/v16 backport/branch/v17 backport/branch/v18 labels Sep 19, 2025

github-actions bot added size/sm tsh tsh - Teleport's command line tool for logging into nodes running Teleport. labels Sep 19, 2025

github-actions bot requested review from boxofrad and r0mant September 19, 2025 15:04

gzdunek removed request for boxofrad and r0mant September 19, 2025 15:04

gzdunek commented Sep 19, 2025

View reviewed changes

nklaassen reviewed Sep 19, 2025

View reviewed changes

gzdunek added 3 commits September 23, 2025 10:48

Add Profile.TLSCert() that reads the cert from disk

97403f6

Pass Dir when constructing Config.Profile() so it's not empty

2c0988d

Lack of the `Dir` caused the cert path to be incorrect.

Only read the TLS cert to verify if it hasn't changed

7e7510b

gzdunek requested a review from nklaassen September 23, 2025 09:08

ravicious reviewed Sep 24, 2025

View reviewed changes

ravicious mentioned this pull request Oct 2, 2025

Enable client cache in integration tests of lib/teleterm #59760

Closed

ravicious reviewed Oct 2, 2025

View reviewed changes

ravicious mentioned this pull request Oct 28, 2025

VNet reuses old certs after logout #45192

Open

gzdunek mentioned this pull request Oct 31, 2025

Connect: react to tsh actions by watching tsh dir #60884

Merged

gzdunek closed this Nov 4, 2025

This was referenced Nov 5, 2025

Connect: clear client cache only after a successful login #61036

Merged

Connect: close cluster clients when profile changes #61090

Merged

gzdunek deleted the gzdunek/invalidate-cached-client-when-cert-changes branch December 23, 2025 16:00

	// coreTLSCert is the cert used in TeleportClient.ConnectToCluster to create the client.
	// coreTLSCert is the cert used in [client.TeleportClient.ConnectToCluster] to create the client.

	// coreTLSCert is the cert used in TeleportClient.ConnectToCluster to create the client.
	// coreTLSCert is the contents of the cert at the time of creating the client.

	c.cfg.Logger.WarnContext(ctx, "Failed to validate TLS certificate, removing from cache", "cluster", k, "error", err)
	c.cfg.Logger.WarnContext(ctx, "Failed to check if TLS certificate has changed, removing client from cache", "cluster", k, "error", err)

	c.cfg.Logger.DebugContext(ctx, "TLS certificate for cached client has changed, removing from cache", "cluster", k)
	c.cfg.Logger.DebugContext(ctx, "TLS certificate for cached client has changed, removing client from cache", "cluster", k)

Conversation

gzdunek commented Sep 19, 2025

Uh oh!

gzdunek Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gzdunek commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gzdunek Sep 19, 2025 •

edited

Loading