lib/teleterm: Do not reissue user certs before issuing gateway cert#62096
lib/teleterm: Do not reissue user certs before issuing gateway cert#62096
Conversation
There was a problem hiding this comment.
@gzdunek Do you remember why we clear stale clients before checking if access has changed rather than after this check? Off the top of my head I cannot tell what situation this would address.
There was a problem hiding this comment.
Clients should also be cleared when the user logs in again (via tsh), otherwise the auth server might reject requests coming from a stale client. In that case, only the validUntil field would be different.
My intent here was to clear the clients when any property of the profile changes, and then run a more detailed check to detect potential access changes.
There was a problem hiding this comment.
Reject requests from a stale client as in a client with a stale cert in case disconnect_expired_cert is enabled?
There was a problem hiding this comment.
I added this explanation to the comment in code.
gzdunek
left a comment
There was a problem hiding this comment.
Many thanks for fixing it.
I tested opening connections to databases in a root and leaf clusters, work fine.
There was a problem hiding this comment.
Clients should also be cleared when the user logs in again (via tsh), otherwise the auth server might reject requests coming from a stale client. In that case, only the validUntil field would be different.
My intent here was to clear the clients when any property of the profile changes, and then run a more detailed check to detect potential access changes.
hugoShaka
left a comment
There was a problem hiding this comment.
Thanks for the detailed PR explanation, this is very appreciated!
|
@ravicious See the table below for backport results.
|
…62096) * lib/teleterm: Do not ReissueUserCerts before issuing gateway cert * Explain why clearing stale clients needs to be done before check for changed access
Closes #42291.
changelog: Fixed "the client connection is closing" error happening under certain conditions in Teleport Connect when connecting to resources with per-session MFA enabled
This is one of those bugs where the fix is fairly simple in terms of lines changed, but explaining the issue takes way more effort.
The background
I noticed that on both master and 18.5.0, I can no longer connect to databases through a gateway when using my user account with low TTL. When trying to connect to a database, the app would show the per-session MFA prompt, accept my hardware key touch and fail with "grpc: the client connection is closing" immediately after. There was no debug logs in the auth service, indicating that the problem is completely client side.
Shared tsh home
In 18.5.0, we finally made Connect and tsh use the same tsh home dir. The way it works is that Connect starts a file watcher on the dir and on any change loads the profiles from disk and compares them against what's in the memory. It is able to detect certain specific actions (assuming access requests, logging in, logging out) and update the UI accordingly.
When it detects that the state of a profile on disk has changed compared to what's in memory, it clears stale cached cluster clients and syncs cluster details. This is done so that when the user assumes an access request, the app stops using cached cluster clients which still operate with the old user cert with no access requests assumed. Detecting a stale client is done by comparing certificate contents on disk vs what they were during the creation of the client.
Reissuing user certs
In #12293 we made it so that just before certs for a gateway are issued, tshd reissues user certs. This was a long way ago, I think before local proxies stored certs in memory. It was done in order to fix an issue with accessing databases in leaf clusters.
For some reason, when the TTL of a user cert is very short or the cert is almost expired,
lib/client.Client.ReissueUserCertsinflates the remaining TTL. I haven't looked into why it happens but I've documented this in #42291.What's also important here is that we perform two operations with the client: first we refresh user certs on disk and then create local proxy cert in memory.
The problem
The combination of these two things means that when a user with a short TTL tries to connect to a resource, the following sequence of events happens:
The solution
I fixed this by removing the code which reissues user certs before a local proxy cert is issued. Removing those lines does not break integration tests (which test each kind of a gateway, both in the context of a root cluster and a leaf cluster). Refreshing certs on disk is no longer necessary since gateways (AKA local proxies) no longer use certs from disk.
This problem does show that the profile watcher makes certain assumptions about RPC handlers: a handler that updates certs on disk never performs any further operations. This is mostly true for handlers which perform logout or access request assumption. But it wasn't the case for the handlers which create gateways. So the fix essentially removes those handlers from this group by making them not modify certs on disk (since it was not necessary anyway).