Connect: Detect & reissue expired db certs#17950
Conversation
9bb2795 to
9338cf1
Compare
499797b to
f9dfbf5
Compare
d094631 to
f8ff02f
Compare
There was a problem hiding this comment.
Here we pass a structured message as the reason so that the Electron app can show an appropriately formatted message in the login modal.
There was a problem hiding this comment.
Same with the notification subject: instead of accepting a string, we accept a structured message and let the Electron app handle exact message in the UI.
There was a problem hiding this comment.
This is the method that implements the logic described in the PR description.
There was a problem hiding this comment.
Maybe then rename it to something like ResolveClusterFromStorage?
There was a problem hiding this comment.
Maybe then rename it to something like
ResolveClusterFromStorage?
Good idea! I'll create a separate PR for that to limit the number of unrelated changes in this one.
alpnproxy.LocalProxy.CheckDBCerts already emits a log statement, so this line was just noise.
This is irrelevant to this PR but I figured I'd do this while I'm editing this file.
This will let us create expired user certs by providing a negative TTL.
f8ff02f to
3dc875a
Compare
2f21e15 to
90cca1e
Compare
GavinFrazar
left a comment
There was a problem hiding this comment.
Looks good! Just have a few nits and suggestions
Co-authored-by: Gavin Frazar <gavin.frazar@goteleport.com>
| "github.com/gravitational/teleport/lib/teleterm/daemon" | ||
| ) | ||
|
|
||
| func TestTeletermGatewayCertRenewal(t *testing.T) { |
There was a problem hiding this comment.
This test took 17s on my machine, which is bit heavy. Ideally we could speed it up a bit, but at a minimum we should see if we can make it a parallel test.
There was a problem hiding this comment.
This test took 17s on my machine, which is bit heavy. Ideally we could speed it up a bit, but at a minimum we should see if we can make it a parallel test.
I have some ideas on how to make it faster, I have just asked the db access team about feedback on them.
Is it okay if I address that in a separate PR or should wait with merging this? Making the test faster would likely require changes in shared helpers for which I'd want to get some feedback from the db access team.
GCB says the past two runs of this test took ~19s. I think I could cut this time by about 40%
There was a problem hiding this comment.
I saw 17s as well, but 16.5s of that was just setting up the test clusters. This test doesn't actually need the leaf cluster either, and by commenting out the code that sets up leaf cluster and trust in the setup the time was reduced to ~8 seconds. We should really make that leaf cluster optional.
There was a problem hiding this comment.
Now that I think about it, we should test Connect gateways with a leaf cluster too. I'll move this test into a separate file in integration/proxy and run it as part of TestALPNSNIProxyDatabaseAccess so that the setup time is amortized.
|
@ravicious See the table below for backport results.
|
* Add TTL field to integration/helpers.UserCredsRequest This will let us create expired user certs by providing a negative TTL. * Reissue gateway cert if middleware detects it expired * Add integration test for gateway cert renewal
* Add TTL field to integration/helpers.UserCredsRequest This will let us create expired user certs by providing a negative TTL. * Reissue gateway cert if middleware detects it expired * Add integration test for gateway cert renewal
Closes gravitational/webapps.e#299 & closes gravitational/webapps.e#360.
Needs #18740 to be merged first.
webapps counterpart: gravitational/webapps#1383
Out of 2332 additions, ~1500 are proto updates and 481 are tests, leaving us with about 350 lines of actual implementation code. Best reviewed commit by commit.
Why
If the user attempts to make a connection through a local db proxy but the db & user certs have expired, we want Connect to bring focus to the Connect app, let the user relogin and then proxy the connection.
This is how the end result looks like:
connect-cert-renewal.mov
#18259 added a custom middleware for
alpnproxy.LocalProxywhich checks the cert used by the gateway when there's an incoming connection. If the cert has expired, the middleware fires the callback it received.#18259 added that middleware but it wasn't actually used in lib/teleterm. This PR makes use of that middleware to implement the business logic required to reissue the gateway cert.
How
The happy path of the whole flow looks like this:
sequenceDiagram participant user as db client participant app participant tsh as gateway goroutine participant cluster note over user,cluster: Expired gateway cert, expired cluster cert user->>tsh: Make a connection to the db proxy activate user tsh->>tsh: Notice that the db cert has expired tsh->>cluster: Reissue the cert cluster->>tsh: Error (expired cluster cert) tsh->>app: Ask for relogin activate app app->>cluster: Log in cluster->>app: Success app->>tsh: Success deactivate app tsh->>cluster: Reissue the cert cluster->>tsh: Success tsh->>user: Proxy the connection deactivate userThe initial call to reissue the certs is done in case the cert used by the gateway has expired but the user cert is valid. This can happen if you have two gateways running. The first one will run through the flow shown above, while the second will go through the one below:
sequenceDiagram participant user as db client participant app participant tsh as gateway goroutine participant cluster note over user,cluster: Expired gateway cert, valid cluster cert user->>tsh: Make a connection to the db proxy activate user tsh->>tsh: Notice that the db cert has expired tsh->>cluster: Reissue the cert cluster->>tsh: Success tsh->>user: Proxy the connection deactivate userError handling
Any action taken by the gateway goroutine (outgoing arrows) can fail. If it fails, the connection to the db is closed. Since the action is performed in a totally separate goroutine, the user would have no way of seeing the exact error, other than looking at the logs.
To make it clearer for the user as to why the connection has failed, the struct performing the actual action forwards any error to the Electron app so that it can be shown to the user in form of a notification.
Mutex on relogin request
Since we can only show one modal at a time but multiple clients can attempt to make simultaneous connections through proxies set up by Connect, the action of asking the Electron app to show the login modal is guarded by a mutex. If there's already a relogin request in flight, any other attempt will fail immediately with an appropriate error message.