Connect: Add prerequisites for gracefully handling expired db proxy certs#18259
Connect: Add prerequisites for gracefully handling expired db proxy certs#18259
Conversation
Unless gateway.Close is called on a gateway that was already closed, it shouldn't return an error. However, when working on handling expired certs in Connect I ran into a buggy test where that error from gateway.Close provided a crucial clue in fixing the bug. But because initially it was simply ignored, it took me a while to figure out what was going on. That's why this commit adds logging around those errors. 2 out of those 3 places are helper functions which get used in a variety of tests, hence why they call t.Cleanup. The other place does call gateway.Close eventually but we still use t.Cleanup in case the execution doesn't get to that point.
It's useful to see what resource the gateway is targeting and what is the URI of the gateway. Previously the field with URI was hardcoded in cluster_gateways.go or added only when cfg.Log was nil, meaning that we weren't able to benefit from it in places such as gateway_test.go. This commit makes it so that the `resource` and `gateway` fields are added to any logger that is passed through gateway.Config.
gateway.Gateway holds a mutex in one of its fields. NewWithLocalPort accepted gateway by value so vet was issuing a warning about copying a lock. NewWithLocalPort doesn't actually use the copied lock. But it makes sense to get rid of the warning anyway.
ReissueDBCerts used to accept a full-blown types.Database object just to read a couple of fields from it. In the context of Connect, such object is obtainable only by making a request to the cluster. However, in the upcoming PR we want to be able to reissue the cert without having to perform an unnecessary request to the cluster. gateway.Gateway already holds all data we need to reissue the cert, so let's make ReissueDBCerts accept tlsca.RouteToDatabase instead of types.Database to avoid making that extra request.
In the upcoming PR, after we reissue the db cert, we need to be able to update the cert used by the running alpn.LocalProxy. This commit exposes exactly that functionality. Also, this commit adds RWMutex to Gateway to avoid a situation where multiple goroutines attempt to reload the cert. This shouldn't happen under normal circumstances but better safe than sorry. RWMutex is also used for any field on Gateway that has a setter.
267c998 to
6853363
Compare
This callback will let the layer above gateway.Gateway handle a situation in which the cert used by the gateway has expired but there's a client that tries to make a connection through the gateway. gateway.Gateway doesn't have the ability to reissue the cert by itself, hence why we need to accept a callback from above.
6853363 to
7676f38
Compare
| if cfg.OnExpiredCert != nil { | ||
| localProxyMiddleware.onExpiredCert = func(ctx context.Context) error { | ||
| err := cfg.OnExpiredCert(ctx, gateway) | ||
| return trace.Wrap(err) | ||
| } | ||
| } |
There was a problem hiding this comment.
It's kind of an awkward way to initialize the middleware and add the callback later. However, I want to be able to pass gateway.Gateway as the argument so that's why I first instantiate the middleware, then the gateway and then set the callback.
|
Wow, I feel like I won the lottery with the auto review requests here. :D I wanted to request a review from Gavin anyway as he worked on #16958. |
| // In the future, we're probably going to make this method accept the cert as an arg rather than | ||
| // reading from disk. | ||
| func (g *Gateway) ReloadCert() error { | ||
| g.mu.Lock() |
There was a problem hiding this comment.
Now that I'm looking at this, I'm not sure if this mutex here is needed. alpn.LocalProxy processes connections serially, so there's no risk of two separate goroutines trying to set the cert.
There was a problem hiding this comment.
While reviewing this code I realized I needed a mutex in the local proxy still.
See: #18278
The window for a data race is extremely small, but we do call GetCerts to copy the certs into the TLS config when we dial upstream.
I originally had a mutex in there, but then we made the middleware a blocking call and I thought it was safe to remove.
That being said, once that PR merges I don't think you need a mutex here as the local proxy will guard its own certs with a mutex.
There was a problem hiding this comment.
That being said, once that PR merges I don't think you need a mutex here as the local proxy will guard its own certs with a mutex.
Thanks for letting me know. I'll keep the mutex for the mutable fields but remove its usage from ReloadCert.
There was a problem hiding this comment.
Mostly looks good but I think the OnExpiredCert func is never assigned. Your test mocked it so that's why it passes there. There's an integration test, maybe you could add something similar for the teleterm middleware:
teleport/integration/proxy/proxy_test.go
Line 759 in 7676f38
edit: if you plan to add that in a separate PR I think that's fine, just wanted to make sure this was intentional. I can approve if that's the case, my other comments are minor things.
| routeToDatabase := tlsca.RouteToDatabase{ | ||
| ServiceName: db.GetName(), | ||
| Protocol: db.GetProtocol(), | ||
| Username: params.TargetUser, | ||
| } |
There was a problem hiding this comment.
What about database schema name (--db-name in the cli)?
There was a problem hiding this comment.
Connect never encodes the schema name in the certificate. From what I know, it can be skipped.
In Connect, the user is able to change the schema name at any moment and it'll affect only the CLI command and nothing else.
There was a problem hiding this comment.
Hmm, I was not aware of this - I just tested this and my postgres db cert omits the schema name extension (1.3.9999.2.4). However, if I try to connect it uses my --db-user as the --db-name. Not sure where that's happening
There was a problem hiding this comment.
In case of Postgres and Mongo databaseName is part part of Wire database protocol: For instance:
teleport/lib/srv/db/postgres/engine.go
Line 177 in 1b0a083
The --db-name sets the value in connection string that is used my pgwire protocol and propagated in StartupMessage so event if the RouteToDatabase.Database is not set we are obtaining database name that was set by --db-name from the database wire protocol.
| // In the future, we're probably going to make this method accept the cert as an arg rather than | ||
| // reading from disk. | ||
| func (g *Gateway) ReloadCert() error { | ||
| g.mu.Lock() |
There was a problem hiding this comment.
While reviewing this code I realized I needed a mutex in the local proxy still.
See: #18278
The window for a data race is extremely small, but we do call GetCerts to copy the certs into the TLS config when we dial upstream.
I originally had a mutex in there, but then we made the middleware a blocking call and I thought it was safe to remove.
That being said, once that PR merges I don't think you need a mutex here as the local proxy will guard its own certs with a mutex.
|
Adding the mutex to gateway.Gateway turned out to be too complex for this PR, and also not necessary for the new feature it adds (see #18259 (comment)). Still, it's worthwhile to add it but I'm going to move this to a completely new PR since it requires some bigger changes and I don't want to pollute this one even more. |
|
@GavinFrazar @smallinsky ping The last commit makes the middleware work with changes from #18278. |
smallinsky
left a comment
There was a problem hiding this comment.
@ravicious
Sorry for the delay I have missed this PR.
I have left some suggestions but generally the flow looks good.
| routeToDatabase := tlsca.RouteToDatabase{ | ||
| ServiceName: db.GetName(), | ||
| Protocol: db.GetProtocol(), | ||
| Username: params.TargetUser, | ||
| } |
There was a problem hiding this comment.
In case of Postgres and Mongo databaseName is part part of Wire database protocol: For instance:
teleport/lib/srv/db/postgres/engine.go
Line 177 in 1b0a083
The --db-name sets the value in connection string that is used my pgwire protocol and propagated in StartupMessage so event if the RouteToDatabase.Database is not set we are obtaining database name that was set by --db-name from the database wire protocol.
| func (m *localProxyMiddleware) OnNewConnection(ctx context.Context, lp *alpn.LocalProxy, conn net.Conn) error { | ||
| m.log.Debug("Checking local proxy certs") | ||
|
|
||
| err := lp.CheckDBCerts(m.dbRoute) |
There was a problem hiding this comment.
The CheckDBCerts seems to also returns diffrents errors apart from x509.CertificateInvalidError like: certificate subject is for user Should we call onExpireCert login in case of "certificate subject is for user" error ?
There was a problem hiding this comment.
The CheckDBCerts seems to also returns diffrents errors apart from
x509.CertificateInvalidErrorlike:certificate subject is for userShould we call onExpireCert login in case of "certificate subject is for user" error ?
We shouldn't, I'll post a fix in a sec which inspects the error more closely.
There was a problem hiding this comment.
CheckDBCerts looks at db name/user in the database route in case user has previously logged in with tsh db login using a different db name/user than what they used for tsh proxy db --tunnel --db-user=$dbuser --db-name=$dbname, and the previous cert was loaded from disk
For teleterm flow, why should a database route mismatch not trigger cert reissue?
There was a problem hiding this comment.
Good question, I went back and forth on this when writing this reply.
In general, Connect doesn't let you execute the equivalent of tsh db login and tsh proxy db --tunnel separately. Both always happen within a single UI interaction of creating the gateway so you cannot provide different args to them. During gateway creation, we obtain a lock on daemon.Service.mu. This way we guarantee that no two gateways are being created at the same time… but only within Connect. The cert for the gateway is fetched right before we create it, but unlike tsh proxy db we save and load it from disk, even during reissue managed by the middleware.
And I guess here's where problems might happen. Technically we cannot guarantee that between being saved and loaded the cert is not modified by, let's say, tsh db login. Connect also has no real need to save the cert to disk, this is just the tech debt of not supporting per-session MFA: Connect calls TeleportClient.ReissueUserCerts (which doesn't return the cert) rather than TeleportClient.IssueUserCertsWithMFA.
This quarter we also plan to make Connect use ~/.tsh rather than its own TELEPORT_HOME so the risk of some weird mismatched will be increased.
From what I see, the db cert on disk is scoped per cluster user & db server name, so I suppose we could run into something like this:
sequenceDiagram
participant connect as Connect
participant keys as key agent
participant tsh as outside tsh
connect ->> keys: Save db cert for db user foo
tsh ->> keys: Save db cert for db user bar
connect ->> keys: Get cert for db
keys ->> connect: Return cert for db user bar
connect ->> connect: Start local proxy
Which means:
- The mismatch between the cert subject and the local proxy params won't happen in Connect unless an incorrect cert was loaded into the gateway.
- For now, Connect cannot avoid loading certs from the disk.
- …so Connect should
CheckDBCertsbefore instantiatingalpn.LocalProxyand before doingalpn.LocalProxy.SetCertswhen reissuing certs and return an error if there's a mismatch. - By always setting the cert only after we check it, we ensure that in Connect,
OnNewConnectioncannot encounter a cert subject mismatch. - …so Connect should treat that as unrecoverable error and not execute
OnExpiredCert.
@GavinFrazar So I think the way to proceed would be to keep current code, that is Connect's middleware will not call OnExpiredCert if the check failed due to a subject mismatch. I'm going to extract alpnproxy.CheckDBCerts out of alpnproxy.LocalProxy.CheckDBCerts so that Connect can check the certs before actually setting them.
Does that sound good?
There was a problem hiding this comment.
Checking the certs will happen in the next PR where I actually utilize the callback, this one again is just about checking if they expired.
There was a problem hiding this comment.
I'm going to do something else in the mean time in case this plan makes no sense, so I'd appreciate if you could take a look and let me know if I'm not overengineering this.
There was a problem hiding this comment.
But essentially this would entail fixing some tech debt / possible bug in Connect related to proxies that we simply weren't aware of.
There was a problem hiding this comment.
Sorry for spamming comments here. I've just realized that instead of executing the entirety of CheckDBCerts befor eConnect is creating a local proxy or before setting the certs, I can check for the subject mismatch alone. Essentially this is what we care about at that point (as I described in the comment with the graph above). In Connect it's not possible to start a proxy with an expired cert so the only concern would be having a cert that's for another username.
Edit: I told Zac I'll have a PR out for expired certs by the end of the week so I'm really inclined to add that cert subject cert and be done with it. 😏 But even without the time pressure, I think it'd work just fine here.
There was a problem hiding this comment.
By always setting the cert only after we check it, we ensure that in Connect, OnNewConnection cannot encounter a cert subject mismatch.
…so Connect should treat that as unrecoverable error and not execute OnExpiredCert.
I guess my thinking is that requesting new certs from the cluster is a way to handle that error - however you're saying it's a situation that shouldn't happen, and therefore it should be "unrecoverable". I assume you mean "unrecoverable" as in, the connection just gets dropped/closed instead of trying to handle the nonsensical error. If that's what you mean, then that's fine. However you want to handle the error is ok, I was just curious why the logic was different from tsh middleware.
There was a problem hiding this comment.
I assume you mean "unrecoverable" as in, the connection just gets dropped/closed instead of trying to handle the nonsensical error. If that's what you mean, then that's fine.
Yep, exactly.
I'll have to reconsider this when adding per-session MFA support anyway, as I'd like to reuse a single middleware between tsh and Connect. But it doesn't seem like we'll touch per-session MFA in Connect in the next two or three months.
| } | ||
| onExpiredCertErr := m.onExpiredCert(ctx) | ||
| if onExpiredCertErr != nil { | ||
| return trace.Wrap(onExpiredCertErr) |
There was a problem hiding this comment.
I didn't like how it showed up in the logs. trace.NewAggregate treats the first error as more important. But here we really care only about onExpiredCertErr - err is logged separately above anyway.
|
@ravicious See the table below for backport results.
|
…erts (#18259) * Log gateway.Close errors during test cleanups Unless gateway.Close is called on a gateway that was already closed, it shouldn't return an error. However, when working on handling expired certs in Connect I ran into a buggy test where that error from gateway.Close provided a crucial clue in fixing the bug. But because initially it was simply ignored, it took me a while to figure out what was going on. That's why this commit adds logging around those errors. 2 out of those 3 places are helper functions which get used in a variety of tests, hence why they call t.Cleanup. The other place does call gateway.Close eventually but we still use t.Cleanup in case the execution doesn't get to that point. * Automatically add useful fields to gateway loggers It's useful to see what resource the gateway is targeting and what is the URI of the gateway. Previously the field with URI was hardcoded in cluster_gateways.go or added only when cfg.Log was nil, meaning that we weren't able to benefit from it in places such as gateway_test.go. This commit makes it so that the `resource` and `gateway` fields are added to any logger that is passed through gateway.Config. * Remove copylocks warning from Gateway.NewWithLocalPort gateway.Gateway holds a mutex in one of its fields. NewWithLocalPort accepted gateway by value so vet was issuing a warning about copying a lock. NewWithLocalPort doesn't actually use the copied lock. But it makes sense to get rid of the warning anyway. * Make ReissueDBCerts accept tlsca.RouteToDatabase as arg ReissueDBCerts used to accept a full-blown types.Database object just to read a couple of fields from it. In the context of Connect, such object is obtainable only by making a request to the cluster. However, in the upcoming PR we want to be able to reissue the cert without having to perform an unnecessary request to the cluster. gateway.Gateway already holds all data we need to reissue the cert, so let's make ReissueDBCerts accept tlsca.RouteToDatabase instead of types.Database to avoid making that extra request. * Add Gateway.ReloadCert In the upcoming PR, after we reissue the db cert, we need to be able to update the cert used by the running alpn.LocalProxy. This commit exposes exactly that functionality. Also, this commit adds RWMutex to Gateway to avoid a situation where multiple goroutines attempt to reload the cert. This shouldn't happen under normal circumstances but better safe than sorry. RWMutex is also used for any field on Gateway that has a setter. * Add basic implementation of LocalProxyMiddleware * Add OnExpiredCert callback to gateway.Config This callback will let the layer above gateway.Gateway handle a situation in which the cert used by the gateway has expired but there's a client that tries to make a connection through the gateway. gateway.Gateway doesn't have the ability to reissue the cert by itself, hence why we need to accept a callback from above.
…erts (#18259) (#18678) * Log gateway.Close errors during test cleanups Unless gateway.Close is called on a gateway that was already closed, it shouldn't return an error. However, when working on handling expired certs in Connect I ran into a buggy test where that error from gateway.Close provided a crucial clue in fixing the bug. But because initially it was simply ignored, it took me a while to figure out what was going on. That's why this commit adds logging around those errors. 2 out of those 3 places are helper functions which get used in a variety of tests, hence why they call t.Cleanup. The other place does call gateway.Close eventually but we still use t.Cleanup in case the execution doesn't get to that point. * Automatically add useful fields to gateway loggers It's useful to see what resource the gateway is targeting and what is the URI of the gateway. Previously the field with URI was hardcoded in cluster_gateways.go or added only when cfg.Log was nil, meaning that we weren't able to benefit from it in places such as gateway_test.go. This commit makes it so that the `resource` and `gateway` fields are added to any logger that is passed through gateway.Config. * Remove copylocks warning from Gateway.NewWithLocalPort gateway.Gateway holds a mutex in one of its fields. NewWithLocalPort accepted gateway by value so vet was issuing a warning about copying a lock. NewWithLocalPort doesn't actually use the copied lock. But it makes sense to get rid of the warning anyway. * Make ReissueDBCerts accept tlsca.RouteToDatabase as arg ReissueDBCerts used to accept a full-blown types.Database object just to read a couple of fields from it. In the context of Connect, such object is obtainable only by making a request to the cluster. However, in the upcoming PR we want to be able to reissue the cert without having to perform an unnecessary request to the cluster. gateway.Gateway already holds all data we need to reissue the cert, so let's make ReissueDBCerts accept tlsca.RouteToDatabase instead of types.Database to avoid making that extra request. * Add Gateway.ReloadCert In the upcoming PR, after we reissue the db cert, we need to be able to update the cert used by the running alpn.LocalProxy. This commit exposes exactly that functionality. Also, this commit adds RWMutex to Gateway to avoid a situation where multiple goroutines attempt to reload the cert. This shouldn't happen under normal circumstances but better safe than sorry. RWMutex is also used for any field on Gateway that has a setter. * Add basic implementation of LocalProxyMiddleware * Add OnExpiredCert callback to gateway.Config This callback will let the layer above gateway.Gateway handle a situation in which the cert used by the gateway has expired but there's a client that tries to make a connection through the gateway. gateway.Gateway doesn't have the ability to reissue the cert by itself, hence why we need to accept a callback from above.

The first three commits are small cleanups around gateways. The next four commits are prerequisites for reissuing certs in Connect's db proxies (#17950).
Rationale
#16958 added a middleware implementation for
alpn.LocalProxy. This way if a client tries to make a connection through the proxy but the db cert has expired,tsh proxy dbis able to ask the user to log in again and reissue the cert. We want to implement a similar behavior in Connect.That middleware from #16958 also supports per-session MFA. In Connect however we need to just detect if the cert has expired. In the future, when we get around to adding per-session MFA support to Connect, we plan to refactor the existing middleware from #16958 so that it can be reused between tsh and Connect.
This PR provides the plumbing required by #17950. It adds a simpler implementation of LocalProxy middleware which just checks the cert expiry and fires the provided callback if the cert has expired. In #17950, I'm going to implement the rest of the logic: within the callback I'm going to ask the user to relogin and then reissue the cert.
Each of the four commits explains why a particular change is needed.