Skip to content

Invalidate cached client when TLS cert changes#59370

Closed
gzdunek wants to merge 4 commits intomasterfrom
gzdunek/invalidate-cached-client-when-cert-changes
Closed

Invalidate cached client when TLS cert changes#59370
gzdunek wants to merge 4 commits intomasterfrom
gzdunek/invalidate-cached-client-when-cert-changes

Conversation

@gzdunek
Copy link
Copy Markdown
Contributor

@gzdunek gzdunek commented Sep 19, 2025

Closes #44059
Contributes to #25806

As described in the first issue, a relogin can race with another thread retrieving the client from the cache, which may result in the cache retaining an outdated client.

The problem with invalidating the cache will become even more visible once tsh and Connect start sharing the ~/.tsh directory. If a user logs in or assumes a role via tsh, Connect will receive a file system event and must decide whether to clear the cache (and I suspect we can run into some edge cases there).
Additionally, I'd like to avoid exposing any ClientCache related functionality outside of tsh, since it’s really just an implementation detail. Even now, it’s cumbersome to remember to clear the cache after login or assuming a role.

A better option is to shift that responsibility to the ClientCache itself. On any get request, it can verify if the TLS certificate hasn’t changed since the client was created. This approach resolves the linked issue and eliminates the need to manually clear the cache (except logging out).

@gzdunek gzdunek added no-changelog Indicates that a PR does not require a changelog entry backport/branch/v16 backport/branch/v17 backport/branch/v18 labels Sep 19, 2025
@github-actions github-actions bot added size/sm tsh tsh - Teleport's command line tool for logging into nodes running Teleport. labels Sep 19, 2025
@github-actions github-actions bot requested review from boxofrad and r0mant September 19, 2025 15:04
Comment thread lib/client/clientcache/clientcache.go Outdated
return false, trace.Wrap(err)
}

if bytes.Equal(coreTLSCertFromCache, keyRing.TLSCert) {
Copy link
Copy Markdown
Contributor Author

@gzdunek gzdunek Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seemed to me that comparing the TLS cert is the easiest way to detect if a new client should be created (since it was used to create the client), but I'm open to suggestions.

Comment thread lib/client/clientcache/clientcache.go Outdated
Comment on lines +246 to +256
tc, err := c.cfg.NewClientFunc(ctx, profile, "")
if err != nil {
return false, trace.Wrap(err)
}

keyRing, err := tc.LocalAgent().GetCoreKeyRing()
if err != nil {
return false, trace.Wrap(err)
}

if bytes.Equal(coreTLSCertFromCache, keyRing.TLSCert) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my one concern is this seems to do quite a bit of extra work just to get the latest TLS cert to see if it has changed. Like loading the full profile, parsing and validating certificates, I think it may even try to add SSH keys to the system key agent. Not sure this will be tolerable on every cache get, I've been calling it as if it was basically a map lookup. E.g. VNet may call into this multiple times while resolving a DNS query and generally uses the clientcache quite a bit

what you have is very general by only using the existing NewClientFunc but I wonder how hard it would be to pass in a more targeted way to just read the current TLS cert

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's a good point. I wasn't aware that VNet relies so heavily on the client cache (though the regular tsh daemon will benefit from this as well).

I've updated the code to only read the TLS cert.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading from disk on each cache read still feels like a lot of overhead though I can't asses if it's significant enough or not. At least there's singleflight.Group for concurrent cache reads but it still feels like a lot of disk IO.

@gzdunek gzdunek requested a review from nklaassen September 23, 2025 09:08
}

type clientWithCert struct {
// client is cluster client.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can drop this comment hehe.

type clientWithCert struct {
// client is cluster client.
client *client.ClusterClient
// coreTLSCert is the cert used in TeleportClient.ConnectToCluster to create the client.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// coreTLSCert is the cert used in TeleportClient.ConnectToCluster to create the client.
// coreTLSCert is the cert used in [client.TeleportClient.ConnectToCluster] to create the client.

or maybe even

Suggested change
// coreTLSCert is the cert used in TeleportClient.ConnectToCluster to create the client.
// coreTLSCert is the contents of the cert at the time of creating the client.

// coreTLSCert is the cert used in TeleportClient.ConnectToCluster to create the client.
coreTLSCert []byte
// readCoreTLSCert reads a fresh cert from disk.
readCoreTLSCert func() ([]byte, error)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's very Go-like, I assume the more idiomatic way to do it is to create a profile interface with TLSCert method.

return fromCache, nil
unchanged, err := fromCache.isCoreTLSCertUnchanged()
if err != nil {
c.cfg.Logger.WarnContext(ctx, "Failed to validate TLS certificate, removing from cache", "cluster", k, "error", err)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
c.cfg.Logger.WarnContext(ctx, "Failed to validate TLS certificate, removing from cache", "cluster", k, "error", err)
c.cfg.Logger.WarnContext(ctx, "Failed to check if TLS certificate has changed, removing client from cache", "cluster", k, "error", err)

c.cfg.Logger.DebugContext(ctx, "Retrieved client from cache", "cluster", k)
return fromCache.client, nil
} else {
c.cfg.Logger.DebugContext(ctx, "TLS certificate for cached client has changed, removing from cache", "cluster", k)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
c.cfg.Logger.DebugContext(ctx, "TLS certificate for cached client has changed, removing from cache", "cluster", k)
c.cfg.Logger.DebugContext(ctx, "TLS certificate for cached client has changed, removing client from cache", "cluster", k)

return nil, trace.BadParameter("cluster URI must be a root URI")
}

if err = s.DaemonService.ClearCachedClientsForRoot(cluster.URI); err != nil {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the removal of these constitute a significant change in behavior?

In the previous version, after logging in or assuming a role, all clients created so far would be closed. With the new behavior, no client will be closed after one of those operations until another RPC attempts to get a client from the cache.

Does this have any significance? It feels like it mostly could affect leaf clients as those might continue to not be closed well after a relogin / role assumption.

Comment thread lib/client/clientcache/clientcache.go Outdated
Comment on lines +246 to +256
tc, err := c.cfg.NewClientFunc(ctx, profile, "")
if err != nil {
return false, trace.Wrap(err)
}

keyRing, err := tc.LocalAgent().GetCoreKeyRing()
if err != nil {
return false, trace.Wrap(err)
}

if bytes.Equal(coreTLSCertFromCache, keyRing.TLSCert) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading from disk on each cache read still feels like a lot of overhead though I can't asses if it's significant enough or not. At least there's singleflight.Group for concurrent cache reads but it still feels like a lot of disk IO.

return nil, trace.BadParameter("cluster URI must be a root URI")
}

if err = s.DaemonService.ClearCachedClientsForRoot(cluster.URI); err != nil {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #59760, I tried re-enabling the client cache in tests. Currently there's a test which just straight up fails (#59760 (comment)). I think it's because in the real world, we depend on the cache being cleared during the login RPCs. In tests however we circumvent those RPCs completely (because we don't have access to user credentials so we depend on test helpers to generate new certs on disk).

The test passes when I manually clear the cache on relogin (see the diff). I think once this PR is merged and when the cache no longer depends on being manually cleared, then we should be able to re-enable it in integration tests without such workarounds like the one in the diff.

Diff
diff --git a/integration/proxy/teleterm_test.go b/integration/proxy/teleterm_test.go
index ee19d22c1d2..02b5112a9cb 100644
--- a/integration/proxy/teleterm_test.go
+++ b/integration/proxy/teleterm_test.go
@@ -264,6 +264,7 @@ func testGatewayCertRenewal(ctx context.Context, t *testing.T, params gatewayCer
 	t.Cleanup(func() {
 		daemonService.Stop()
 	})
+	tshdEventsService.daemonService = daemonService
 
 	// Connect the daemon to the tshd events service, like it would
 	// during normal initialization of the app.
@@ -320,6 +321,7 @@ type mockTSHDEventsService struct {
 	sendNotificationCallCount atomic.Uint32
 	promptMFACallCount        atomic.Uint32
 	generateAndSetupUserCreds generateAndSetupUserCredsFunc
+	daemonService             *daemon.Service
 }
 
 func newMockTSHDEventsServiceServer(t *testing.T, tc *libclient.TeleportClient, generateAndSetupUserCreds generateAndSetupUserCredsFunc) (service *mockTSHDEventsService) {
@@ -360,11 +362,20 @@ func newMockTSHDEventsServiceServer(t *testing.T, tc *libclient.TeleportClient,
 
 // Relogin simulates the act of the user logging in again in the Electron app by replacing the user
 // cert on disk with a valid one.
-func (c *mockTSHDEventsService) Relogin(context.Context, *api.ReloginRequest) (*api.ReloginResponse, error) {
+func (c *mockTSHDEventsService) Relogin(ctx context.Context, req *api.ReloginRequest) (*api.ReloginResponse, error) {
 	c.reloginCallCount.Add(1)
 
 	// Generate valid certs with the default TTL.
 	c.generateAndSetupUserCreds(c.t, c.tc, 0 /* ttl */)
+	if c.daemonService != nil {
+		clusterURI, err := uri.Parse(req.RootClusterUri)
+		if err != nil {
+			return nil, err
+		}
+		if err := c.daemonService.ClearCachedClientsForRoot(clusterURI); err != nil {
+			return nil, err
+		}
+	}
 
 	return &api.ReloginResponse{}, nil
 }

@gzdunek
Copy link
Copy Markdown
Contributor Author

gzdunek commented Nov 4, 2025

Reading from disk on each cache read still feels like a lot of overhead though I can't asses if it's significant enough or not. At least there's singleflight.Group for concurrent cache reads but it still feels like a lot of disk IO.

I think I'll abandon this solution. It turns out that my fix for reading only the TLS certificate doesn't work when you relogin as a different user, as it still returns the certificate for the previous user (because tc.Profile() still points to the previous username).
To fix this properly, we’d need to re-read the profile first and then load the certificate for the correct user. This doubles the number of calls.

But I also looked closer into how VNet uses the client cache, and well, we indeed use the client cache a lot. Reading from disk on every cache.Get makes is pretty inefficient.

Instead, I'm going to fix #44059 by cleaning the cache after log in finishes.
When it comes to reacting to external changes to the tsh directory, I will add an RPC to the tsh daemon to invalidate the clients when the watcher detects a change. Previously I was worried about potential race conditions, but now I think it should be fine.

@gzdunek gzdunek closed this Nov 4, 2025
@gzdunek gzdunek deleted the gzdunek/invalidate-cached-client-when-cert-changes branch December 23, 2025 16:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport/branch/v17 backport/branch/v18 no-changelog Indicates that a PR does not require a changelog entry size/sm tsh tsh - Teleport's command line tool for logging into nodes running Teleport.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Connect: Clear client cache only after a successful login

3 participants