Connect: Add prerequisites for gracefully handling expired db proxy certs by ravicious · Pull Request #18259 · gravitational/teleport

ravicious · 2022-11-08T16:23:46Z

The first three commits are small cleanups around gateways. The next four commits are prerequisites for reissuing certs in Connect's db proxies (#17950).

Rationale

#16958 added a middleware implementation for alpn.LocalProxy. This way if a client tries to make a connection through the proxy but the db cert has expired, tsh proxy db is able to ask the user to log in again and reissue the cert. We want to implement a similar behavior in Connect.

That middleware from #16958 also supports per-session MFA. In Connect however we need to just detect if the cert has expired. In the future, when we get around to adding per-session MFA support to Connect, we plan to refactor the existing middleware from #16958 so that it can be reused between tsh and Connect.

This PR provides the plumbing required by #17950. It adds a simpler implementation of LocalProxy middleware which just checks the cert expiry and fires the provided callback if the cert has expired. In #17950, I'm going to implement the rest of the logic: within the callback I'm going to ask the user to relogin and then reissue the cert.

Each of the four commits explains why a particular change is needed.

Unless gateway.Close is called on a gateway that was already closed, it shouldn't return an error. However, when working on handling expired certs in Connect I ran into a buggy test where that error from gateway.Close provided a crucial clue in fixing the bug. But because initially it was simply ignored, it took me a while to figure out what was going on. That's why this commit adds logging around those errors. 2 out of those 3 places are helper functions which get used in a variety of tests, hence why they call t.Cleanup. The other place does call gateway.Close eventually but we still use t.Cleanup in case the execution doesn't get to that point.

It's useful to see what resource the gateway is targeting and what is the URI of the gateway. Previously the field with URI was hardcoded in cluster_gateways.go or added only when cfg.Log was nil, meaning that we weren't able to benefit from it in places such as gateway_test.go. This commit makes it so that the `resource` and `gateway` fields are added to any logger that is passed through gateway.Config.

gateway.Gateway holds a mutex in one of its fields. NewWithLocalPort accepted gateway by value so vet was issuing a warning about copying a lock. NewWithLocalPort doesn't actually use the copied lock. But it makes sense to get rid of the warning anyway.

ReissueDBCerts used to accept a full-blown types.Database object just to read a couple of fields from it. In the context of Connect, such object is obtainable only by making a request to the cluster. However, in the upcoming PR we want to be able to reissue the cert without having to perform an unnecessary request to the cluster. gateway.Gateway already holds all data we need to reissue the cert, so let's make ReissueDBCerts accept tlsca.RouteToDatabase instead of types.Database to avoid making that extra request.

In the upcoming PR, after we reissue the db cert, we need to be able to update the cert used by the running alpn.LocalProxy. This commit exposes exactly that functionality. Also, this commit adds RWMutex to Gateway to avoid a situation where multiple goroutines attempt to reload the cert. This shouldn't happen under normal circumstances but better safe than sorry. RWMutex is also used for any field on Gateway that has a setter.

This callback will let the layer above gateway.Gateway handle a situation in which the cert used by the gateway has expired but there's a client that tries to make a connection through the gateway. gateway.Gateway doesn't have the ability to reissue the cert by itself, hence why we need to accept a callback from above.

ravicious · 2022-11-08T16:41:26Z

+	if cfg.OnExpiredCert != nil {
+		localProxyMiddleware.onExpiredCert = func(ctx context.Context) error {
+			err := cfg.OnExpiredCert(ctx, gateway)
+			return trace.Wrap(err)
+		}
+	}


It's kind of an awkward way to initialize the middleware and add the callback later. However, I want to be able to pass gateway.Gateway as the argument so that's why I first instantiate the middleware, then the gateway and then set the callback.

ravicious · 2022-11-08T16:45:10Z

Wow, I feel like I won the lottery with the auto review requests here. :D I wanted to request a review from Gavin anyway as he worked on #16958.

ravicious · 2022-11-08T16:48:56Z

+// In the future, we're probably going to make this method accept the cert as an arg rather than
+// reading from disk.
+func (g *Gateway) ReloadCert() error {
+	g.mu.Lock()


Now that I'm looking at this, I'm not sure if this mutex here is needed. alpn.LocalProxy processes connections serially, so there's no risk of two separate goroutines trying to set the cert.

While reviewing this code I realized I needed a mutex in the local proxy still.

See: #18278

The window for a data race is extremely small, but we do call GetCerts to copy the certs into the TLS config when we dial upstream.

I originally had a mutex in there, but then we made the middleware a blocking call and I thought it was safe to remove.

That being said, once that PR merges I don't think you need a mutex here as the local proxy will guard its own certs with a mutex.

That being said, once that PR merges I don't think you need a mutex here as the local proxy will guard its own certs with a mutex.

Thanks for letting me know. I'll keep the mutex for the mutable fields but remove its usage from ReloadCert.

GavinFrazar

Mostly looks good but I think the OnExpiredCert func is never assigned. Your test mocked it so that's why it passes there. There's an integration test, maybe you could add something similar for the teleterm middleware:

teleport/integration/proxy/proxy_test.go

Line 759 in 7676f38

t.Run("authenticated tunnel with cert renewal", func(t *testing.T) {

edit: if you plan to add that in a separate PR I think that's fine, just wanted to make sure this was intentional. I can approve if that's the case, my other comments are minor things.

GavinFrazar · 2022-11-08T18:08:34Z

+	routeToDatabase := tlsca.RouteToDatabase{
+		ServiceName: db.GetName(),
+		Protocol:    db.GetProtocol(),
+		Username:    params.TargetUser,
+	}


What about database schema name (--db-name in the cli)?

Connect never encodes the schema name in the certificate. From what I know, it can be skipped.

In Connect, the user is able to change the schema name at any moment and it'll affect only the CLI command and nothing else.

Hmm, I was not aware of this - I just tested this and my postgres db cert omits the schema name extension (1.3.9999.2.4). However, if I try to connect it uses my --db-user as the --db-name. Not sure where that's happening

In case of Postgres and Mongo databaseName is part part of Wire database protocol: For instance:

teleport/lib/srv/db/postgres/engine.go

Line 177 in 1b0a083

sessionCtx.DatabaseName = value

The --db-name sets the value in connection string that is used my pgwire protocol and propagated in StartupMessage so event if the RouteToDatabase.Database is not set we are obtaining database name that was set by --db-name from the database wire protocol.

GavinFrazar · 2022-11-08T20:39:42Z

+// In the future, we're probably going to make this method accept the cert as an arg rather than
+// reading from disk.
+func (g *Gateway) ReloadCert() error {
+	g.mu.Lock()


While reviewing this code I realized I needed a mutex in the local proxy still.

See: #18278

The window for a data race is extremely small, but we do call GetCerts to copy the certs into the TLS config when we dial upstream.

I originally had a mutex in there, but then we made the middleware a blocking call and I thought it was safe to remove.

That being said, once that PR merges I don't think you need a mutex here as the local proxy will guard its own certs with a mutex.

ravicious · 2022-11-10T11:39:25Z

Adding the mutex to gateway.Gateway turned out to be too complex for this PR, and also not necessary for the new feature it adds (see #18259 (comment)).

Still, it's worthwhile to add it but I'm going to move this to a completely new PR since it requires some bigger changes and I don't want to pollute this one even more.

This reverts parts of commit 67297d3.

…kDBCerts

ravicious · 2022-11-18T09:58:52Z

@GavinFrazar @smallinsky ping

The last commit makes the middleware work with changes from #18278.

smallinsky

@ravicious
Sorry for the delay I have missed this PR.
I have left some suggestions but generally the flow looks good.

smallinsky · 2022-11-18T11:01:55Z

+	routeToDatabase := tlsca.RouteToDatabase{
+		ServiceName: db.GetName(),
+		Protocol:    db.GetProtocol(),
+		Username:    params.TargetUser,
+	}


In case of Postgres and Mongo databaseName is part part of Wire database protocol: For instance:

teleport/lib/srv/db/postgres/engine.go

Line 177 in 1b0a083

sessionCtx.DatabaseName = value

The --db-name sets the value in connection string that is used my pgwire protocol and propagated in StartupMessage so event if the RouteToDatabase.Database is not set we are obtaining database name that was set by --db-name from the database wire protocol.

smallinsky · 2022-11-18T11:20:17Z

+func (m *localProxyMiddleware) OnNewConnection(ctx context.Context, lp *alpn.LocalProxy, conn net.Conn) error {
+	m.log.Debug("Checking local proxy certs")
+
+	err := lp.CheckDBCerts(m.dbRoute)


The CheckDBCerts seems to also returns diffrents errors apart from x509.CertificateInvalidError like: certificate subject is for user Should we call onExpireCert login in case of "certificate subject is for user" error ?

The CheckDBCerts seems to also returns diffrents errors apart from x509.CertificateInvalidError like: certificate subject is for user Should we call onExpireCert login in case of "certificate subject is for user" error ?

We shouldn't, I'll post a fix in a sec which inspects the error more closely.

CheckDBCerts looks at db name/user in the database route in case user has previously logged in with tsh db login using a different db name/user than what they used for tsh proxy db --tunnel --db-user=$dbuser --db-name=$dbname, and the previous cert was loaded from disk

For teleterm flow, why should a database route mismatch not trigger cert reissue?

Good question, I went back and forth on this when writing this reply.

In general, Connect doesn't let you execute the equivalent of tsh db login and tsh proxy db --tunnel separately. Both always happen within a single UI interaction of creating the gateway so you cannot provide different args to them. During gateway creation, we obtain a lock on daemon.Service.mu. This way we guarantee that no two gateways are being created at the same time… but only within Connect. The cert for the gateway is fetched right before we create it, but unlike tsh proxy db we save and load it from disk, even during reissue managed by the middleware.

And I guess here's where problems might happen. Technically we cannot guarantee that between being saved and loaded the cert is not modified by, let's say, tsh db login. Connect also has no real need to save the cert to disk, this is just the tech debt of not supporting per-session MFA: Connect calls TeleportClient.ReissueUserCerts (which doesn't return the cert) rather than TeleportClient.IssueUserCertsWithMFA.

This quarter we also plan to make Connect use ~/.tsh rather than its own TELEPORT_HOME so the risk of some weird mismatched will be increased.

From what I see, the db cert on disk is scoped per cluster user & db server name, so I suppose we could run into something like this:

sequenceDiagram participant connect as Connect participant keys as key agent participant tsh as outside tsh connect ->> keys: Save db cert for db user foo tsh ->> keys: Save db cert for db user bar connect ->> keys: Get cert for db keys ->> connect: Return cert for db user bar connect ->> connect: Start local proxy

Loading

Which means:

The mismatch between the cert subject and the local proxy params won't happen in Connect unless an incorrect cert was loaded into the gateway.

For now, Connect cannot avoid loading certs from the disk.

…so Connect should CheckDBCerts before instantiating alpn.LocalProxy and before doing alpn.LocalProxy.SetCerts when reissuing certs and return an error if there's a mismatch.

By always setting the cert only after we check it, we ensure that in Connect, OnNewConnection cannot encounter a cert subject mismatch.

…so Connect should treat that as unrecoverable error and not execute OnExpiredCert.

@GavinFrazar So I think the way to proceed would be to keep current code, that is Connect's middleware will not call OnExpiredCert if the check failed due to a subject mismatch. I'm going to extract alpnproxy.CheckDBCerts out of alpnproxy.LocalProxy.CheckDBCerts so that Connect can check the certs before actually setting them.

Does that sound good?

Checking the certs will happen in the next PR where I actually utilize the callback, this one again is just about checking if they expired.

I'm going to do something else in the mean time in case this plan makes no sense, so I'd appreciate if you could take a look and let me know if I'm not overengineering this.

But essentially this would entail fixing some tech debt / possible bug in Connect related to proxies that we simply weren't aware of.

Sorry for spamming comments here. I've just realized that instead of executing the entirety of CheckDBCerts befor eConnect is creating a local proxy or before setting the certs, I can check for the subject mismatch alone. Essentially this is what we care about at that point (as I described in the comment with the graph above). In Connect it's not possible to start a proxy with an expired cert so the only concern would be having a cert that's for another username.

Edit: I told Zac I'll have a PR out for expired certs by the end of the week so I'm really inclined to add that cert subject cert and be done with it. 😏 But even without the time pressure, I think it'd work just fine here.

By always setting the cert only after we check it, we ensure that in Connect, OnNewConnection cannot encounter a cert subject mismatch.
…so Connect should treat that as unrecoverable error and not execute OnExpiredCert.

I guess my thinking is that requesting new certs from the cluster is a way to handle that error - however you're saying it's a situation that shouldn't happen, and therefore it should be "unrecoverable". I assume you mean "unrecoverable" as in, the connection just gets dropped/closed instead of trying to handle the nonsensical error. If that's what you mean, then that's fine. However you want to handle the error is ok, I was just curious why the logic was different from tsh middleware.

I assume you mean "unrecoverable" as in, the connection just gets dropped/closed instead of trying to handle the nonsensical error. If that's what you mean, then that's fine.

Yep, exactly.

I'll have to reconsider this when adding per-session MFA support anyway, as I'd like to reuse a single middleware between tsh and Connect. But it doesn't seem like we'll touch per-session MFA in Connect in the next two or three months.

ravicious · 2022-11-18T12:21:49Z

-		}
+	onExpiredCertErr := m.onExpiredCert(ctx)
+	if onExpiredCertErr != nil {
+		return trace.Wrap(onExpiredCertErr)


I didn't like how it showed up in the logs. trace.NewAggregate treats the first error as more important. But here we really care only about onExpiredCertErr - err is logged separately above anyway.

github-actions · 2022-11-22T12:57:34Z

@ravicious See the table below for backport results.

Branch	Result
branch/v11	Failed

…erts (#18259) * Log gateway.Close errors during test cleanups Unless gateway.Close is called on a gateway that was already closed, it shouldn't return an error. However, when working on handling expired certs in Connect I ran into a buggy test where that error from gateway.Close provided a crucial clue in fixing the bug. But because initially it was simply ignored, it took me a while to figure out what was going on. That's why this commit adds logging around those errors. 2 out of those 3 places are helper functions which get used in a variety of tests, hence why they call t.Cleanup. The other place does call gateway.Close eventually but we still use t.Cleanup in case the execution doesn't get to that point. * Automatically add useful fields to gateway loggers It's useful to see what resource the gateway is targeting and what is the URI of the gateway. Previously the field with URI was hardcoded in cluster_gateways.go or added only when cfg.Log was nil, meaning that we weren't able to benefit from it in places such as gateway_test.go. This commit makes it so that the `resource` and `gateway` fields are added to any logger that is passed through gateway.Config. * Remove copylocks warning from Gateway.NewWithLocalPort gateway.Gateway holds a mutex in one of its fields. NewWithLocalPort accepted gateway by value so vet was issuing a warning about copying a lock. NewWithLocalPort doesn't actually use the copied lock. But it makes sense to get rid of the warning anyway. * Make ReissueDBCerts accept tlsca.RouteToDatabase as arg ReissueDBCerts used to accept a full-blown types.Database object just to read a couple of fields from it. In the context of Connect, such object is obtainable only by making a request to the cluster. However, in the upcoming PR we want to be able to reissue the cert without having to perform an unnecessary request to the cluster. gateway.Gateway already holds all data we need to reissue the cert, so let's make ReissueDBCerts accept tlsca.RouteToDatabase instead of types.Database to avoid making that extra request. * Add Gateway.ReloadCert In the upcoming PR, after we reissue the db cert, we need to be able to update the cert used by the running alpn.LocalProxy. This commit exposes exactly that functionality. Also, this commit adds RWMutex to Gateway to avoid a situation where multiple goroutines attempt to reload the cert. This shouldn't happen under normal circumstances but better safe than sorry. RWMutex is also used for any field on Gateway that has a setter. * Add basic implementation of LocalProxyMiddleware * Add OnExpiredCert callback to gateway.Config This callback will let the layer above gateway.Gateway handle a situation in which the cert used by the gateway has expired but there's a client that tries to make a connection through the gateway. gateway.Gateway doesn't have the ability to reissue the cert by itself, hence why we need to accept a callback from above.

…erts (#18259) (#18678) * Log gateway.Close errors during test cleanups Unless gateway.Close is called on a gateway that was already closed, it shouldn't return an error. However, when working on handling expired certs in Connect I ran into a buggy test where that error from gateway.Close provided a crucial clue in fixing the bug. But because initially it was simply ignored, it took me a while to figure out what was going on. That's why this commit adds logging around those errors. 2 out of those 3 places are helper functions which get used in a variety of tests, hence why they call t.Cleanup. The other place does call gateway.Close eventually but we still use t.Cleanup in case the execution doesn't get to that point. * Automatically add useful fields to gateway loggers It's useful to see what resource the gateway is targeting and what is the URI of the gateway. Previously the field with URI was hardcoded in cluster_gateways.go or added only when cfg.Log was nil, meaning that we weren't able to benefit from it in places such as gateway_test.go. This commit makes it so that the `resource` and `gateway` fields are added to any logger that is passed through gateway.Config. * Remove copylocks warning from Gateway.NewWithLocalPort gateway.Gateway holds a mutex in one of its fields. NewWithLocalPort accepted gateway by value so vet was issuing a warning about copying a lock. NewWithLocalPort doesn't actually use the copied lock. But it makes sense to get rid of the warning anyway. * Make ReissueDBCerts accept tlsca.RouteToDatabase as arg ReissueDBCerts used to accept a full-blown types.Database object just to read a couple of fields from it. In the context of Connect, such object is obtainable only by making a request to the cluster. However, in the upcoming PR we want to be able to reissue the cert without having to perform an unnecessary request to the cluster. gateway.Gateway already holds all data we need to reissue the cert, so let's make ReissueDBCerts accept tlsca.RouteToDatabase instead of types.Database to avoid making that extra request. * Add Gateway.ReloadCert In the upcoming PR, after we reissue the db cert, we need to be able to update the cert used by the running alpn.LocalProxy. This commit exposes exactly that functionality. Also, this commit adds RWMutex to Gateway to avoid a situation where multiple goroutines attempt to reload the cert. This shouldn't happen under normal circumstances but better safe than sorry. RWMutex is also used for any field on Gateway that has a setter. * Add basic implementation of LocalProxyMiddleware * Add OnExpiredCert callback to gateway.Config This callback will let the layer above gateway.Gateway handle a situation in which the cert used by the gateway has expired but there's a client that tries to make a connection through the gateway. gateway.Gateway doesn't have the ability to reissue the cert by itself, hence why we need to accept a callback from above.

ravicious added 5 commits November 8, 2022 17:24

ravicious force-pushed the ravicious/gateway-middleware branch from 267c998 to 6853363 Compare November 8, 2022 16:38

ravicious added 2 commits November 8, 2022 17:39

Add basic implementation of LocalProxyMiddleware

e27e75f

ravicious force-pushed the ravicious/gateway-middleware branch from 6853363 to 7676f38 Compare November 8, 2022 16:39

ravicious commented Nov 8, 2022

View reviewed changes

ravicious marked this pull request as ready for review November 8, 2022 16:41

github-actions Bot requested review from GavinFrazar and smallinsky November 8, 2022 16:42

ravicious requested a review from avatus November 8, 2022 16:43

ravicious commented Nov 8, 2022

View reviewed changes

GavinFrazar requested changes Nov 8, 2022

View reviewed changes

ravicious added 2 commits November 9, 2022 12:07

Remove mutex use from Gateway.ReloadCert

faeff11

Use logrus.Entry.WithFields

fd42fe1

ravicious requested a review from GavinFrazar November 9, 2022 11:16

ravicious mentioned this pull request Nov 9, 2022

Add mutex for certs in local proxy #18278

Merged

Add godoc for OnExpiredCertFunc

9d855c5

ravicious added 3 commits November 10, 2022 12:52

Revert adding mutex to gateway.Gateway

9172db4

This reverts parts of commit 67297d3.

Merge branch 'master' into ravicious/gateway-middleware

408fcae

Update gateway.LocalProxyMiddleware to work with alpn.LocalProxy.Chec…

1cb6411

…kDBCerts

smallinsky reviewed Nov 18, 2022

View reviewed changes

ravicious added 2 commits November 18, 2022 12:44

Pass nil as conn instead of using net.Pipe

a9cd805

Verify that err from CheckDBCerts is about expiry

9d4a55f

ravicious commented Nov 18, 2022

View reviewed changes

ravicious requested a review from smallinsky November 18, 2022 12:22

ravicious added the backport/branch/v11 label Nov 18, 2022

smallinsky approved these changes Nov 18, 2022

View reviewed changes

GavinFrazar approved these changes Nov 22, 2022

View reviewed changes

github-actions Bot removed the request for review from avatus November 22, 2022 08:11

GavinFrazar reviewed Nov 22, 2022

View reviewed changes

Comment thread lib/teleterm/gateway/local_proxy_middleware.go Outdated

GavinFrazar reviewed Nov 22, 2022

View reviewed changes

Comment thread lib/teleterm/gateway/local_proxy_middleware_test.go Outdated

ravicious added 3 commits November 22, 2022 13:35

Simplify error handling in OnNewConnection

4f93f0f

Remove tt := tt from serial tests

6f0cb3f

Merge branch 'master' into ravicious/gateway-middleware

35a22d0

ravicious enabled auto-merge (squash) November 22, 2022 12:38

ravicious merged commit 8bfc149 into master Nov 22, 2022

ravicious deleted the ravicious/gateway-middleware branch November 22, 2022 13:49

ravicious mentioned this pull request Nov 22, 2022

[v11] Connect: Add prerequisites for gracefully handling expired db proxy certs #18678

Merged

This was referenced Nov 23, 2022

Connect: Check db cert before using it for local proxy #18740

Merged

Connect: Detect & reissue expired db certs #17950

Merged

ravicious mentioned this pull request Mar 1, 2023

Connect: Add mutex to gateway.Gateway #18341

Closed

Conversation

ravicious commented Nov 8, 2022

Rationale

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ravicious commented Nov 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GavinFrazar left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ravicious commented Nov 10, 2022

Uh oh!

ravicious commented Nov 18, 2022

Uh oh!

smallinsky left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ravicious Nov 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Nov 22, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

ravicious commented Nov 8, 2022 •

edited

Loading

GavinFrazar left a comment •

edited

Loading

ravicious Nov 21, 2022 •

edited

Loading