Skip to content

Cache remote clients in Connect #38201

Merged
gzdunek merged 29 commits intomasterfrom
gzdunek/cache-remote-client
Feb 28, 2024
Merged

Cache remote clients in Connect #38201
gzdunek merged 29 commits intomasterfrom
gzdunek/cache-remote-client

Conversation

@gzdunek
Copy link
Copy Markdown
Contributor

@gzdunek gzdunek commented Feb 14, 2024

1/2 of #15603

Currently, to make a remote API call, we need to connect to a proxy every time, which is time-consuming. To improve performance, we should create clients once and keep them in memory.

This task is done by RemoteClientCache that maintains a map of cluster URIs to proxy clients. When a callsite needs a proxy client, it can call the cache and it will return an existing client (if there is one) or will connect to proxy and return a new one (and set it in the cache).

We only add a client to the cache in RemoteClientCache.Get(); it would be ideal to store the client during login, unfortunately we don't have *client.ProxyClient there, but *proxyclient.Client, which is a different thing.
Probably we could get *client.ProxyClient somehow, I will look at it separately.

The cache keeps clients for both root and leaf clusters. To get the auth client, we should call proxyClient.CurrentCluster(). It will return the stored auth client (it is set during tc.ConnectToProxy() call).

Please let me know if the description above is unclear.

Testing

I wanted to write unit tests for RemoteClientCache but I didn't find a good way to mock tc.ConnectToProxy (probably because of my lack of experience with tsh code), which wants to make a remote call. In the end, I decided to add an integration test which has an advantage of testing the flow end-to-end.

Part 2/2 (actual use of the cache): #38202.

Changelog: Improve performance of remote calls in Teleport Connect.

@gzdunek gzdunek added backport/branch/v13 no-changelog Indicates that a PR does not require a changelog entry labels Feb 14, 2024
@gzdunek gzdunek force-pushed the gzdunek/cache-remote-client branch from f30979c to 111651e Compare February 14, 2024 11:22
Copy link
Copy Markdown
Contributor Author

@gzdunek gzdunek Feb 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In web sessions' cache we also compare remote site:

return remoteClt.ClientI, ok && remoteClt.RemoteSite == site
Why is it needed and when it can change? Do we need here it too?

@gzdunek gzdunek force-pushed the gzdunek/cache-remote-client branch from 111651e to 928bae2 Compare February 14, 2024 11:28
@gzdunek gzdunek marked this pull request as ready for review February 14, 2024 11:49
@gzdunek gzdunek requested review from espadolini, ravicious and rosstimothy and removed request for atburke and smallinsky February 14, 2024 11:52
@gzdunek
Copy link
Copy Markdown
Contributor Author

gzdunek commented Feb 14, 2024

@rosstimothy @espadolini I added you here, because I saw you were pushing some improvements to the remote client cache in web/sessions, which I relied on.

@rosstimothy
Copy link
Copy Markdown
Contributor

I would recommend converting from tc.ConnectToProxy() to tc.ConnectToCluster() and using a lib/client/ClusterClient instead of a lib/client/ProxyClient to connect and communicate to the cluster.

ConnectToProxy is the legacy method which connects to the Proxy via SSH. ConnectToCluster is the replacement that connects to the Proxy via gRPC. The SSH handshake in the former connection method requires several round trips between the client and the Proxy to establish a connection which causes connection latency to be poor, especially if the client and Proxy are not geolocated together. The gRPC mechanism reduces connection latency and should help boost Connect performance when there is no client in the cache.

Comment on lines +74 to +92
// Check if we already have a cached client for this cluster and a valid cert.
// Theoretically, we don't have to check the validity
// because auth server will disconnect expired clients.
// However, by default, it isn't done immediately
// (until it is enabled with disconnect_expired_cert on cluster) and to keep
// the previous behavior, we need to check the cert.
//
// If the cert is expired, we will remove the client and make an attempt
// to connect to proxy which will fail with an appropriate error.
proxyClient := c.getFromCache(clusterURI)
if proxyClient != nil {
if cluster.Connected() {
return proxyClient, nil
}
err := c.InvalidateForRootCluster(clusterURI.GetRootClusterURI())
if err != nil {
c.log.WithError(err).Errorf("Failed to invalidate expired remote client for %q.", clusterURI)
}
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the benefit of invalidating the client on Get?

As it is, each call to get a client will incur the overhead of c.ResolveCluster (which reads from tsh home dir) in order to check if the cert is valid. But couldn't we make Get fairly simple and move the invalidation logic to places where we're certain that invalidation needs to happen?

Off the top of my head, Logout would need to remove the client from the cache while Login and AssumeRole would need to replace any old client with a new one.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, in the first version I didn't have that client invalidation here. It was something like this:

if proxyClient := c.getFromCache(clusterURI); proxyClient != nil && cluster.Connected() {
	return proxyClient, nil
}
		
newProxyClient, err := clusterClient.ConnectToCluster(ctx)
if err != nil {
...

The previous client would be closed when adding a new client or by the remote server, but then I thought that it would be good to close the connection immediately and added that InvalidateForRootCluster call.

But in general, do we need checking if the cert is valid? Probably no, the auth server will terminate the session anyway.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I don't think there's a lot of value in client-side checking of certificate expiration; only the auth server's point of view matters, after all.

Comment thread lib/teleterm/services/remoteclientcache/remoteclientcache.go Outdated
Comment thread lib/teleterm/services/remoteclientcache/remoteclientcache.go Outdated
Comment thread lib/teleterm/services/remoteclientcache/remoteclientcache.go Outdated
Comment thread lib/teleterm/services/remoteclientcache/remoteclientcache.go Outdated
Comment thread lib/teleterm/services/remoteclientcache/remoteclientcache.go Outdated
Comment thread lib/teleterm/services/remoteclientcache/remoteclientcache.go Outdated
Comment thread lib/teleterm/services/remoteclientcache/remoteclientcache.go Outdated
Comment thread lib/teleterm/services/remoteclientcache/remoteclientcache.go Outdated
Comment thread lib/teleterm/services/remoteclientcache/remoteclientcache.go Outdated
Comment thread integration/teleterm_test.go Outdated
Comment thread lib/teleterm/services/remoteclientcache/remoteclientcache.go Outdated
@gzdunek
Copy link
Copy Markdown
Contributor Author

gzdunek commented Feb 14, 2024

I would recommend converting from tc.ConnectToProxy() to tc.ConnectToCluster() and using a lib/client/ClusterClient instead of a lib/client/ProxyClient to connect and communicate to the cluster.
ConnectToProxy is the legacy method which connects to the Proxy via SSH. ConnectToCluster is the replacement that connects to the Proxy via gRPC.

@rosstimothy thanks, I didn't know about this difference. I think it is a good idea to convert it, but I have two questions:

  • How can I handle session termination? Currently, I wait for client.Wait and then remove the client. When there is an another call for GetRemoteClient I can create a new connection.
  • I don't see an equivalent of proxyClient.ReissueUserCerts, should it be ported from there? For now, I could probably keep using lib/client/ProxyClient for parts that touch certs.

@rosstimothy
Copy link
Copy Markdown
Contributor

  • How can I handle session termination? Currently, I wait for client.Wait and then remove the client. When there is an another call for GetRemoteClient I can create a new connection.

Why do we need this behavior? Shouldn't we be preventing the cluster client from ever being closed until the user either explicitly removes the cluster, or the users credentials are refreshed and a new cluster replaces the old one?

  • I don't see an equivalent of proxyClient.ReissueUserCerts, should it be ported from there? For now, I could probably keep using lib/client/ProxyClient for parts that touch certs.

Hrmm there used to be an unexported equivalent implemented on the ClusterClient. We can definitely improve the MFA and credential refresh experience for the ClusterClient.

@ravicious
Copy link
Copy Markdown
Member

For now, I could probably keep using lib/client/ProxyClient for parts that touch certs.

That sounds reasonable if porting it to ClusterClient isn't straightforward.

@gzdunek
Copy link
Copy Markdown
Contributor Author

gzdunek commented Feb 15, 2024

How can I handle session termination? Currently, I wait for client.Wait and then remove the client. When there is an another call for GetRemoteClient I can create a new connection.

Why do we need this behavior? Shouldn't we be preventing the cluster client from ever being closed until the user either explicitly removes the cluster, or the users credentials are refreshed and a new cluster replaces the old one?

I was afraid that if the cluster becomes unavailable or the connection goes down, the client would be stuck in a "broken" state, unable to make new calls.
But from what I see, ClusterClient works fine after I ran the cluster again or re-enabled my internet connection.

Comment thread lib/teleterm/services/remoteclientcache/remoteclientcache.go Outdated
Comment on lines +74 to +92
// Check if we already have a cached client for this cluster and a valid cert.
// Theoretically, we don't have to check the validity
// because auth server will disconnect expired clients.
// However, by default, it isn't done immediately
// (until it is enabled with disconnect_expired_cert on cluster) and to keep
// the previous behavior, we need to check the cert.
//
// If the cert is expired, we will remove the client and make an attempt
// to connect to proxy which will fail with an appropriate error.
proxyClient := c.getFromCache(clusterURI)
if proxyClient != nil {
if cluster.Connected() {
return proxyClient, nil
}
err := c.InvalidateForRootCluster(clusterURI.GetRootClusterURI())
if err != nil {
c.log.WithError(err).Errorf("Failed to invalidate expired remote client for %q.", clusterURI)
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I don't think there's a lot of value in client-side checking of certificate expiration; only the auth server's point of view matters, after all.

@gzdunek
Copy link
Copy Markdown
Contributor Author

gzdunek commented Feb 15, 2024

In the end I decided to keep caching ProxyClient. It will allow us to make almost every call faster today (except of the initial call if the cache is empty), in contrary to ClusterClient which lacks cert reissuing APIs.
Tim started deprecating ProxyClient, but this work will probably be done only for v16+, so it will be a bit problematic to backport these client caching PRs to v13-v15 (if we would like also to speed up reissuing certs).

In v16, we should have methods to reissue user certs (and some other issues fixed) in ClusterClient, so I will go back to this and switch the cache to use ClusterClient instead of ProxyClient.
It should be a simple work, mostly replacing one client with another.

Copy link
Copy Markdown
Member

@ravicious ravicious left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me, I'll give it a try once I get around to reviewing the other PR.

Comment thread lib/teleterm/services/clientcache/clientcache.go Outdated
Comment thread lib/teleterm/services/clientcache/clientcache.go Outdated
Comment thread lib/teleterm/daemon/config.go
Comment thread lib/teleterm/daemon/daemon.go Outdated
Comment thread lib/teleterm/services/clientcache/clientcache.go
Comment thread lib/teleterm/services/clientcache/clientcache.go Outdated
Comment thread lib/teleterm/services/clientcache/clientcache.go Outdated
Comment thread lib/teleterm/services/clientcache/clientcache.go Outdated
Comment thread lib/teleterm/services/clientcache/clientcache.go Outdated
gzdunek and others added 2 commits February 23, 2024 17:05
* Replace all simple `c.clusterClient.ConnectToProxy()` calls

* Use cached proxy client to create gateways

* Use cached proxy client to assume roles

* Invalidate clients when logging in and out

* Gracefully handle expired cert error returned by the server

* Drop `GetRootClusterURI` in headless auth watcher since URIs are already root URIs

* Simplify error check

* Make `auth.ClientI` parameter naming more consistent, use `root` prefix when needed

* Reduce error scope where possible

* Clear cached clients before passwordless login

* Use `fakeClientCache` without pointers

* Move separate `proxyClient` parameter to `CreateGatewayParams` in the gateways code
@gzdunek gzdunek removed backport/branch/v13 no-changelog Indicates that a PR does not require a changelog entry labels Feb 28, 2024
@gzdunek gzdunek added this pull request to the merge queue Feb 28, 2024
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Feb 28, 2024
@gzdunek gzdunek added this pull request to the merge queue Feb 28, 2024
Merged via the queue into master with commit 39f9951 Feb 28, 2024
@gzdunek gzdunek deleted the gzdunek/cache-remote-client branch February 28, 2024 08:10
@public-teleport-github-review-bot
Copy link
Copy Markdown

@gzdunek See the table below for backport results.

Branch Result
branch/v15 Failed

gzdunek added a commit that referenced this pull request Mar 6, 2024
* Add remote client cache

* Add an integration test

* Close all clients when stopping the service

* Move RemoteClientCache to the place where it is used

* Do not check client cert in `Get`

* Fix code style issues

* Prevent potential race condition when removing a cached client

* Test concurrent calls to `Get`

* Add TODO

* `remoteclientcache` -> `clientcache`

* Reduce the `err` scope

* Move `Config` closer to `New` and docs

* Fix lint

* Improve logging and error handling

* Add missing comments

* `Close` -> `Clear`

* Improve the test

* Remove mentions about "remote" client

* Pass `cfg` directly to `Cache`

* `InvalidateForRootCluster` -> `ClearForRootCluster`

* Add docs for the interface

* `ClearForRootCluster` -> `ClearForRoot`

* Add config validation

* Log multiple fields at once

* Improve setting logger

* Use cached remote clients in Connect (#38202)

* Replace all simple `c.clusterClient.ConnectToProxy()` calls

* Use cached proxy client to create gateways

* Use cached proxy client to assume roles

* Invalidate clients when logging in and out

* Gracefully handle expired cert error returned by the server

* Drop `GetRootClusterURI` in headless auth watcher since URIs are already root URIs

* Simplify error check

* Make `auth.ClientI` parameter naming more consistent, use `root` prefix when needed

* Reduce error scope where possible

* Clear cached clients before passwordless login

* Use `fakeClientCache` without pointers

* Move separate `proxyClient` parameter to `CreateGatewayParams` in the gateways code

* Replace checking error string with `client.ErrClientCredentialsHaveExpired`

(cherry picked from commit 39f9951)
github-merge-queue Bot pushed a commit that referenced this pull request Mar 7, 2024
* Cache remote clients in Connect  (#38201)

* Add remote client cache

* Add an integration test

* Close all clients when stopping the service

* Move RemoteClientCache to the place where it is used

* Do not check client cert in `Get`

* Fix code style issues

* Prevent potential race condition when removing a cached client

* Test concurrent calls to `Get`

* Add TODO

* `remoteclientcache` -> `clientcache`

* Reduce the `err` scope

* Move `Config` closer to `New` and docs

* Fix lint

* Improve logging and error handling

* Add missing comments

* `Close` -> `Clear`

* Improve the test

* Remove mentions about "remote" client

* Pass `cfg` directly to `Cache`

* `InvalidateForRootCluster` -> `ClearForRootCluster`

* Add docs for the interface

* `ClearForRootCluster` -> `ClearForRoot`

* Add config validation

* Log multiple fields at once

* Improve setting logger

* Use cached remote clients in Connect (#38202)

* Replace all simple `c.clusterClient.ConnectToProxy()` calls

* Use cached proxy client to create gateways

* Use cached proxy client to assume roles

* Invalidate clients when logging in and out

* Gracefully handle expired cert error returned by the server

* Drop `GetRootClusterURI` in headless auth watcher since URIs are already root URIs

* Simplify error check

* Make `auth.ClientI` parameter naming more consistent, use `root` prefix when needed

* Reduce error scope where possible

* Clear cached clients before passwordless login

* Use `fakeClientCache` without pointers

* Move separate `proxyClient` parameter to `CreateGatewayParams` in the gateways code

* Replace checking error string with `client.ErrClientCredentialsHaveExpired`

(cherry picked from commit 39f9951)

* Temporarily disable flaky part of `TestClientCache` (#38798)

(cherry picked from commit 7067a88)

* Make calls to the auth server concurrently (#38955)

* Make calls to the auth server concurrently

* Enhance the comment about preferences upsert

* Summarize how `userpreferences.Update` works

(cherry picked from commit ddc45a4)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants