Invalidate cached client when TLS cert changes#59370
Conversation
| return false, trace.Wrap(err) | ||
| } | ||
|
|
||
| if bytes.Equal(coreTLSCertFromCache, keyRing.TLSCert) { |
There was a problem hiding this comment.
It seemed to me that comparing the TLS cert is the easiest way to detect if a new client should be created (since it was used to create the client), but I'm open to suggestions.
| tc, err := c.cfg.NewClientFunc(ctx, profile, "") | ||
| if err != nil { | ||
| return false, trace.Wrap(err) | ||
| } | ||
|
|
||
| keyRing, err := tc.LocalAgent().GetCoreKeyRing() | ||
| if err != nil { | ||
| return false, trace.Wrap(err) | ||
| } | ||
|
|
||
| if bytes.Equal(coreTLSCertFromCache, keyRing.TLSCert) { |
There was a problem hiding this comment.
my one concern is this seems to do quite a bit of extra work just to get the latest TLS cert to see if it has changed. Like loading the full profile, parsing and validating certificates, I think it may even try to add SSH keys to the system key agent. Not sure this will be tolerable on every cache get, I've been calling it as if it was basically a map lookup. E.g. VNet may call into this multiple times while resolving a DNS query and generally uses the clientcache quite a bit
what you have is very general by only using the existing NewClientFunc but I wonder how hard it would be to pass in a more targeted way to just read the current TLS cert
There was a problem hiding this comment.
Ah, that's a good point. I wasn't aware that VNet relies so heavily on the client cache (though the regular tsh daemon will benefit from this as well).
I've updated the code to only read the TLS cert.
There was a problem hiding this comment.
Reading from disk on each cache read still feels like a lot of overhead though I can't asses if it's significant enough or not. At least there's singleflight.Group for concurrent cache reads but it still feels like a lot of disk IO.
Lack of the `Dir` caused the cert path to be incorrect.
| } | ||
|
|
||
| type clientWithCert struct { | ||
| // client is cluster client. |
There was a problem hiding this comment.
You can drop this comment hehe.
| type clientWithCert struct { | ||
| // client is cluster client. | ||
| client *client.ClusterClient | ||
| // coreTLSCert is the cert used in TeleportClient.ConnectToCluster to create the client. |
There was a problem hiding this comment.
| // coreTLSCert is the cert used in TeleportClient.ConnectToCluster to create the client. | |
| // coreTLSCert is the cert used in [client.TeleportClient.ConnectToCluster] to create the client. |
or maybe even
| // coreTLSCert is the cert used in TeleportClient.ConnectToCluster to create the client. | |
| // coreTLSCert is the contents of the cert at the time of creating the client. |
| // coreTLSCert is the cert used in TeleportClient.ConnectToCluster to create the client. | ||
| coreTLSCert []byte | ||
| // readCoreTLSCert reads a fresh cert from disk. | ||
| readCoreTLSCert func() ([]byte, error) |
There was a problem hiding this comment.
I don't think it's very Go-like, I assume the more idiomatic way to do it is to create a profile interface with TLSCert method.
| return fromCache, nil | ||
| unchanged, err := fromCache.isCoreTLSCertUnchanged() | ||
| if err != nil { | ||
| c.cfg.Logger.WarnContext(ctx, "Failed to validate TLS certificate, removing from cache", "cluster", k, "error", err) |
There was a problem hiding this comment.
| c.cfg.Logger.WarnContext(ctx, "Failed to validate TLS certificate, removing from cache", "cluster", k, "error", err) | |
| c.cfg.Logger.WarnContext(ctx, "Failed to check if TLS certificate has changed, removing client from cache", "cluster", k, "error", err) |
| c.cfg.Logger.DebugContext(ctx, "Retrieved client from cache", "cluster", k) | ||
| return fromCache.client, nil | ||
| } else { | ||
| c.cfg.Logger.DebugContext(ctx, "TLS certificate for cached client has changed, removing from cache", "cluster", k) |
There was a problem hiding this comment.
| c.cfg.Logger.DebugContext(ctx, "TLS certificate for cached client has changed, removing from cache", "cluster", k) | |
| c.cfg.Logger.DebugContext(ctx, "TLS certificate for cached client has changed, removing client from cache", "cluster", k) |
| return nil, trace.BadParameter("cluster URI must be a root URI") | ||
| } | ||
|
|
||
| if err = s.DaemonService.ClearCachedClientsForRoot(cluster.URI); err != nil { |
There was a problem hiding this comment.
Does the removal of these constitute a significant change in behavior?
In the previous version, after logging in or assuming a role, all clients created so far would be closed. With the new behavior, no client will be closed after one of those operations until another RPC attempts to get a client from the cache.
Does this have any significance? It feels like it mostly could affect leaf clients as those might continue to not be closed well after a relogin / role assumption.
| tc, err := c.cfg.NewClientFunc(ctx, profile, "") | ||
| if err != nil { | ||
| return false, trace.Wrap(err) | ||
| } | ||
|
|
||
| keyRing, err := tc.LocalAgent().GetCoreKeyRing() | ||
| if err != nil { | ||
| return false, trace.Wrap(err) | ||
| } | ||
|
|
||
| if bytes.Equal(coreTLSCertFromCache, keyRing.TLSCert) { |
There was a problem hiding this comment.
Reading from disk on each cache read still feels like a lot of overhead though I can't asses if it's significant enough or not. At least there's singleflight.Group for concurrent cache reads but it still feels like a lot of disk IO.
| return nil, trace.BadParameter("cluster URI must be a root URI") | ||
| } | ||
|
|
||
| if err = s.DaemonService.ClearCachedClientsForRoot(cluster.URI); err != nil { |
There was a problem hiding this comment.
In #59760, I tried re-enabling the client cache in tests. Currently there's a test which just straight up fails (#59760 (comment)). I think it's because in the real world, we depend on the cache being cleared during the login RPCs. In tests however we circumvent those RPCs completely (because we don't have access to user credentials so we depend on test helpers to generate new certs on disk).
The test passes when I manually clear the cache on relogin (see the diff). I think once this PR is merged and when the cache no longer depends on being manually cleared, then we should be able to re-enable it in integration tests without such workarounds like the one in the diff.
Diff
diff --git a/integration/proxy/teleterm_test.go b/integration/proxy/teleterm_test.go
index ee19d22c1d2..02b5112a9cb 100644
--- a/integration/proxy/teleterm_test.go
+++ b/integration/proxy/teleterm_test.go
@@ -264,6 +264,7 @@ func testGatewayCertRenewal(ctx context.Context, t *testing.T, params gatewayCer
t.Cleanup(func() {
daemonService.Stop()
})
+ tshdEventsService.daemonService = daemonService
// Connect the daemon to the tshd events service, like it would
// during normal initialization of the app.
@@ -320,6 +321,7 @@ type mockTSHDEventsService struct {
sendNotificationCallCount atomic.Uint32
promptMFACallCount atomic.Uint32
generateAndSetupUserCreds generateAndSetupUserCredsFunc
+ daemonService *daemon.Service
}
func newMockTSHDEventsServiceServer(t *testing.T, tc *libclient.TeleportClient, generateAndSetupUserCreds generateAndSetupUserCredsFunc) (service *mockTSHDEventsService) {
@@ -360,11 +362,20 @@ func newMockTSHDEventsServiceServer(t *testing.T, tc *libclient.TeleportClient,
// Relogin simulates the act of the user logging in again in the Electron app by replacing the user
// cert on disk with a valid one.
-func (c *mockTSHDEventsService) Relogin(context.Context, *api.ReloginRequest) (*api.ReloginResponse, error) {
+func (c *mockTSHDEventsService) Relogin(ctx context.Context, req *api.ReloginRequest) (*api.ReloginResponse, error) {
c.reloginCallCount.Add(1)
// Generate valid certs with the default TTL.
c.generateAndSetupUserCreds(c.t, c.tc, 0 /* ttl */)
+ if c.daemonService != nil {
+ clusterURI, err := uri.Parse(req.RootClusterUri)
+ if err != nil {
+ return nil, err
+ }
+ if err := c.daemonService.ClearCachedClientsForRoot(clusterURI); err != nil {
+ return nil, err
+ }
+ }
return &api.ReloginResponse{}, nil
}
I think I'll abandon this solution. It turns out that my fix for reading only the TLS certificate doesn't work when you relogin as a different user, as it still returns the certificate for the previous user (because But I also looked closer into how VNet uses the client cache, and well, we indeed use the client cache a lot. Reading from disk on every Instead, I'm going to fix #44059 by cleaning the cache after log in finishes. |
Closes #44059
Contributes to #25806
As described in the first issue, a relogin can race with another thread retrieving the client from the cache, which may result in the cache retaining an outdated client.
The problem with invalidating the cache will become even more visible once tsh and Connect start sharing the ~/.tsh directory. If a user logs in or assumes a role via tsh, Connect will receive a file system event and must decide whether to clear the cache (and I suspect we can run into some edge cases there).
Additionally, I'd like to avoid exposing any
ClientCacherelated functionality outside of tsh, since it’s really just an implementation detail. Even now, it’s cumbersome to remember to clear the cache after login or assuming a role.A better option is to shift that responsibility to the
ClientCacheitself. On any get request, it can verify if the TLS certificate hasn’t changed since the client was created. This approach resolves the linked issue and eliminates the need to manually clear the cache (except logging out).