Skip to content

Connect: Add prerequisites for gracefully handling expired db proxy certs#18259

Merged
ravicious merged 18 commits intomasterfrom
ravicious/gateway-middleware
Nov 22, 2022
Merged

Connect: Add prerequisites for gracefully handling expired db proxy certs#18259
ravicious merged 18 commits intomasterfrom
ravicious/gateway-middleware

Conversation

@ravicious
Copy link
Copy Markdown
Member

The first three commits are small cleanups around gateways. The next four commits are prerequisites for reissuing certs in Connect's db proxies (#17950).

Rationale

#16958 added a middleware implementation for alpn.LocalProxy. This way if a client tries to make a connection through the proxy but the db cert has expired, tsh proxy db is able to ask the user to log in again and reissue the cert. We want to implement a similar behavior in Connect.

That middleware from #16958 also supports per-session MFA. In Connect however we need to just detect if the cert has expired. In the future, when we get around to adding per-session MFA support to Connect, we plan to refactor the existing middleware from #16958 so that it can be reused between tsh and Connect.

This PR provides the plumbing required by #17950. It adds a simpler implementation of LocalProxy middleware which just checks the cert expiry and fires the provided callback if the cert has expired. In #17950, I'm going to implement the rest of the logic: within the callback I'm going to ask the user to relogin and then reissue the cert.

Each of the four commits explains why a particular change is needed.

Unless gateway.Close is called on a gateway that was already closed, it
shouldn't return an error.

However, when working on handling expired certs in Connect I ran into a
buggy test where that error from gateway.Close provided a crucial clue
in fixing the bug. But because initially it was simply ignored, it took
me a while to figure out what was going on.

That's why this commit adds logging around those errors.

2 out of those 3 places are helper functions which get used in a variety
of tests, hence why they call t.Cleanup. The other place does call
gateway.Close eventually but we still use t.Cleanup in case the
execution doesn't get to that point.
It's useful to see what resource the gateway is targeting and what is
the URI of the gateway.

Previously the field with URI was hardcoded in cluster_gateways.go or
added only when cfg.Log was nil, meaning that we weren't able to benefit
from it in places such as gateway_test.go.

This commit makes it so that the `resource` and `gateway` fields are
added to any logger that is passed through gateway.Config.
gateway.Gateway holds a mutex in one of its fields. NewWithLocalPort
accepted gateway by value so vet was issuing a warning about copying a
lock.

NewWithLocalPort doesn't actually use the copied lock. But it makes
sense to get rid of the warning anyway.
ReissueDBCerts used to accept a full-blown types.Database object just to
read a couple of fields from it. In the context of Connect, such object
is obtainable only by making a request to the cluster.

However, in the upcoming PR we want to be able to reissue the cert
without having to perform an unnecessary request to the cluster.

gateway.Gateway already holds all data we need to reissue the cert, so
let's make ReissueDBCerts accept tlsca.RouteToDatabase instead of
types.Database to avoid making that extra request.
In the upcoming PR, after we reissue the db cert, we need to be able to
update the cert used by the running alpn.LocalProxy. This commit exposes
exactly that functionality.

Also, this commit adds RWMutex to Gateway to avoid a situation where
multiple goroutines attempt to reload the cert. This shouldn't happen
under normal circumstances but better safe than sorry.

RWMutex is also used for any field on Gateway that has a setter.
@ravicious ravicious force-pushed the ravicious/gateway-middleware branch from 267c998 to 6853363 Compare November 8, 2022 16:38
This callback will let the layer above gateway.Gateway handle a
situation in which the cert used by the gateway has expired but there's
a client that tries to make a connection through the gateway.

gateway.Gateway doesn't have the ability to reissue the cert by itself,
hence why we need to accept a callback from above.
@ravicious ravicious force-pushed the ravicious/gateway-middleware branch from 6853363 to 7676f38 Compare November 8, 2022 16:39
Comment on lines +116 to +121
if cfg.OnExpiredCert != nil {
localProxyMiddleware.onExpiredCert = func(ctx context.Context) error {
err := cfg.OnExpiredCert(ctx, gateway)
return trace.Wrap(err)
}
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's kind of an awkward way to initialize the middleware and add the callback later. However, I want to be able to pass gateway.Gateway as the argument so that's why I first instantiate the middleware, then the gateway and then set the callback.

@ravicious ravicious marked this pull request as ready for review November 8, 2022 16:41
@ravicious ravicious requested a review from avatus November 8, 2022 16:43
@ravicious
Copy link
Copy Markdown
Member Author

ravicious commented Nov 8, 2022

Wow, I feel like I won the lottery with the auto review requests here. :D I wanted to request a review from Gavin anyway as he worked on #16958.

Comment thread lib/teleterm/gateway/gateway.go Outdated
// In the future, we're probably going to make this method accept the cert as an arg rather than
// reading from disk.
func (g *Gateway) ReloadCert() error {
g.mu.Lock()
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I'm looking at this, I'm not sure if this mutex here is needed. alpn.LocalProxy processes connections serially, so there's no risk of two separate goroutines trying to set the cert.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While reviewing this code I realized I needed a mutex in the local proxy still.

See: #18278

The window for a data race is extremely small, but we do call GetCerts to copy the certs into the TLS config when we dial upstream.

I originally had a mutex in there, but then we made the middleware a blocking call and I thought it was safe to remove.

That being said, once that PR merges I don't think you need a mutex here as the local proxy will guard its own certs with a mutex.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That being said, once that PR merges I don't think you need a mutex here as the local proxy will guard its own certs with a mutex.

Thanks for letting me know. I'll keep the mutex for the mutable fields but remove its usage from ReloadCert.

Copy link
Copy Markdown
Contributor

@GavinFrazar GavinFrazar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good but I think the OnExpiredCert func is never assigned. Your test mocked it so that's why it passes there. There's an integration test, maybe you could add something similar for the teleterm middleware:

t.Run("authenticated tunnel with cert renewal", func(t *testing.T) {

edit: if you plan to add that in a separate PR I think that's fine, just wanted to make sure this was intentional. I can approve if that's the case, my other comments are minor things.

Comment on lines +50 to +54
routeToDatabase := tlsca.RouteToDatabase{
ServiceName: db.GetName(),
Protocol: db.GetProtocol(),
Username: params.TargetUser,
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about database schema name (--db-name in the cli)?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Connect never encodes the schema name in the certificate. From what I know, it can be skipped.

In Connect, the user is able to change the schema name at any moment and it'll affect only the CLI command and nothing else.

Connect db connection tab

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I was not aware of this - I just tested this and my postgres db cert omits the schema name extension (1.3.9999.2.4). However, if I try to connect it uses my --db-user as the --db-name. Not sure where that's happening

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of Postgres and Mongo databaseName is part part of Wire database protocol: For instance:

sessionCtx.DatabaseName = value

The --db-name sets the value in connection string that is used my pgwire protocol and propagated in StartupMessage so event if the RouteToDatabase.Database is not set we are obtaining database name that was set by --db-name from the database wire protocol.

Comment thread lib/teleterm/gateway/config.go
Comment thread lib/teleterm/gateway/config.go Outdated
Comment thread lib/teleterm/gateway/gateway.go Outdated
// In the future, we're probably going to make this method accept the cert as an arg rather than
// reading from disk.
func (g *Gateway) ReloadCert() error {
g.mu.Lock()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While reviewing this code I realized I needed a mutex in the local proxy still.

See: #18278

The window for a data race is extremely small, but we do call GetCerts to copy the certs into the TLS config when we dial upstream.

I originally had a mutex in there, but then we made the middleware a blocking call and I thought it was safe to remove.

That being said, once that PR merges I don't think you need a mutex here as the local proxy will guard its own certs with a mutex.

Comment thread lib/teleterm/gateway/gateway_test.go
Comment thread lib/teleterm/gateway/local_proxy_middleware.go
Comment thread lib/teleterm/gateway/local_proxy_middleware_test.go
Comment thread lib/teleterm/clusters/cluster_gateways.go
@ravicious
Copy link
Copy Markdown
Member Author

Adding the mutex to gateway.Gateway turned out to be too complex for this PR, and also not necessary for the new feature it adds (see #18259 (comment)).

Still, it's worthwhile to add it but I'm going to move this to a completely new PR since it requires some bigger changes and I don't want to pollute this one even more.

@ravicious
Copy link
Copy Markdown
Member Author

@GavinFrazar @smallinsky ping

The last commit makes the middleware work with changes from #18278.

Copy link
Copy Markdown
Contributor

@smallinsky smallinsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ravicious
Sorry for the delay I have missed this PR.
I have left some suggestions but generally the flow looks good.

Comment on lines +50 to +54
routeToDatabase := tlsca.RouteToDatabase{
ServiceName: db.GetName(),
Protocol: db.GetProtocol(),
Username: params.TargetUser,
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of Postgres and Mongo databaseName is part part of Wire database protocol: For instance:

sessionCtx.DatabaseName = value

The --db-name sets the value in connection string that is used my pgwire protocol and propagated in StartupMessage so event if the RouteToDatabase.Database is not set we are obtaining database name that was set by --db-name from the database wire protocol.

Comment thread lib/teleterm/gateway/local_proxy_middleware_test.go Outdated
Comment thread lib/teleterm/gateway/local_proxy_middleware_test.go Outdated
func (m *localProxyMiddleware) OnNewConnection(ctx context.Context, lp *alpn.LocalProxy, conn net.Conn) error {
m.log.Debug("Checking local proxy certs")

err := lp.CheckDBCerts(m.dbRoute)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CheckDBCerts seems to also returns diffrents errors apart from x509.CertificateInvalidError like: certificate subject is for user Should we call onExpireCert login in case of "certificate subject is for user" error ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CheckDBCerts seems to also returns diffrents errors apart from x509.CertificateInvalidError like: certificate subject is for user Should we call onExpireCert login in case of "certificate subject is for user" error ?

We shouldn't, I'll post a fix in a sec which inspects the error more closely.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CheckDBCerts looks at db name/user in the database route in case user has previously logged in with tsh db login using a different db name/user than what they used for tsh proxy db --tunnel --db-user=$dbuser --db-name=$dbname, and the previous cert was loaded from disk

For teleterm flow, why should a database route mismatch not trigger cert reissue?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I went back and forth on this when writing this reply.

In general, Connect doesn't let you execute the equivalent of tsh db login and tsh proxy db --tunnel separately. Both always happen within a single UI interaction of creating the gateway so you cannot provide different args to them. During gateway creation, we obtain a lock on daemon.Service.mu. This way we guarantee that no two gateways are being created at the same time… but only within Connect. The cert for the gateway is fetched right before we create it, but unlike tsh proxy db we save and load it from disk, even during reissue managed by the middleware.

And I guess here's where problems might happen. Technically we cannot guarantee that between being saved and loaded the cert is not modified by, let's say, tsh db login. Connect also has no real need to save the cert to disk, this is just the tech debt of not supporting per-session MFA: Connect calls TeleportClient.ReissueUserCerts (which doesn't return the cert) rather than TeleportClient.IssueUserCertsWithMFA.

This quarter we also plan to make Connect use ~/.tsh rather than its own TELEPORT_HOME so the risk of some weird mismatched will be increased.

From what I see, the db cert on disk is scoped per cluster user & db server name, so I suppose we could run into something like this:

sequenceDiagram
    participant connect as Connect
    participant keys as key agent
    participant tsh as outside tsh
    connect ->> keys: Save db cert for db user foo
    tsh ->> keys: Save db cert for db user bar
    connect ->> keys: Get cert for db
    keys ->> connect: Return cert for db user bar
    connect ->> connect: Start local proxy
Loading

Which means:

  • The mismatch between the cert subject and the local proxy params won't happen in Connect unless an incorrect cert was loaded into the gateway.
  • For now, Connect cannot avoid loading certs from the disk.
  • …so Connect should CheckDBCerts before instantiating alpn.LocalProxy and before doing alpn.LocalProxy.SetCerts when reissuing certs and return an error if there's a mismatch.
  • By always setting the cert only after we check it, we ensure that in Connect, OnNewConnection cannot encounter a cert subject mismatch.
  • …so Connect should treat that as unrecoverable error and not execute OnExpiredCert.

@GavinFrazar So I think the way to proceed would be to keep current code, that is Connect's middleware will not call OnExpiredCert if the check failed due to a subject mismatch. I'm going to extract alpnproxy.CheckDBCerts out of alpnproxy.LocalProxy.CheckDBCerts so that Connect can check the certs before actually setting them.

Does that sound good?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking the certs will happen in the next PR where I actually utilize the callback, this one again is just about checking if they expired.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to do something else in the mean time in case this plan makes no sense, so I'd appreciate if you could take a look and let me know if I'm not overengineering this.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But essentially this would entail fixing some tech debt / possible bug in Connect related to proxies that we simply weren't aware of.

Copy link
Copy Markdown
Member Author

@ravicious ravicious Nov 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for spamming comments here. I've just realized that instead of executing the entirety of CheckDBCerts befor eConnect is creating a local proxy or before setting the certs, I can check for the subject mismatch alone. Essentially this is what we care about at that point (as I described in the comment with the graph above). In Connect it's not possible to start a proxy with an expired cert so the only concern would be having a cert that's for another username.

Edit: I told Zac I'll have a PR out for expired certs by the end of the week so I'm really inclined to add that cert subject cert and be done with it. 😏 But even without the time pressure, I think it'd work just fine here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By always setting the cert only after we check it, we ensure that in Connect, OnNewConnection cannot encounter a cert subject mismatch.
…so Connect should treat that as unrecoverable error and not execute OnExpiredCert.

I guess my thinking is that requesting new certs from the cluster is a way to handle that error - however you're saying it's a situation that shouldn't happen, and therefore it should be "unrecoverable". I assume you mean "unrecoverable" as in, the connection just gets dropped/closed instead of trying to handle the nonsensical error. If that's what you mean, then that's fine. However you want to handle the error is ok, I was just curious why the logic was different from tsh middleware.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you mean "unrecoverable" as in, the connection just gets dropped/closed instead of trying to handle the nonsensical error. If that's what you mean, then that's fine.

Yep, exactly.

I'll have to reconsider this when adding per-session MFA support anyway, as I'd like to reuse a single middleware between tsh and Connect. But it doesn't seem like we'll touch per-session MFA in Connect in the next two or three months.

Comment thread lib/teleterm/gateway/local_proxy_middleware.go
}
onExpiredCertErr := m.onExpiredCert(ctx)
if onExpiredCertErr != nil {
return trace.Wrap(onExpiredCertErr)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't like how it showed up in the logs. trace.NewAggregate treats the first error as more important. But here we really care only about onExpiredCertErr - err is logged separately above anyway.

@github-actions github-actions Bot removed the request for review from avatus November 22, 2022 08:11
Comment thread lib/teleterm/gateway/local_proxy_middleware.go Outdated
Comment thread lib/teleterm/gateway/local_proxy_middleware_test.go Outdated
@ravicious ravicious enabled auto-merge (squash) November 22, 2022 12:38
@ravicious ravicious merged commit 8bfc149 into master Nov 22, 2022
@github-actions
Copy link
Copy Markdown
Contributor

@ravicious See the table below for backport results.

Branch Result
branch/v11 Failed

@ravicious ravicious deleted the ravicious/gateway-middleware branch November 22, 2022 13:49
ravicious added a commit that referenced this pull request Nov 22, 2022
…erts (#18259)

* Log gateway.Close errors during test cleanups

Unless gateway.Close is called on a gateway that was already closed, it
shouldn't return an error.

However, when working on handling expired certs in Connect I ran into a
buggy test where that error from gateway.Close provided a crucial clue
in fixing the bug. But because initially it was simply ignored, it took
me a while to figure out what was going on.

That's why this commit adds logging around those errors.

2 out of those 3 places are helper functions which get used in a variety
of tests, hence why they call t.Cleanup. The other place does call
gateway.Close eventually but we still use t.Cleanup in case the
execution doesn't get to that point.

* Automatically add useful fields to gateway loggers

It's useful to see what resource the gateway is targeting and what is
the URI of the gateway.

Previously the field with URI was hardcoded in cluster_gateways.go or
added only when cfg.Log was nil, meaning that we weren't able to benefit
from it in places such as gateway_test.go.

This commit makes it so that the `resource` and `gateway` fields are
added to any logger that is passed through gateway.Config.

* Remove copylocks warning from Gateway.NewWithLocalPort

gateway.Gateway holds a mutex in one of its fields. NewWithLocalPort
accepted gateway by value so vet was issuing a warning about copying a
lock.

NewWithLocalPort doesn't actually use the copied lock. But it makes
sense to get rid of the warning anyway.

* Make ReissueDBCerts accept tlsca.RouteToDatabase as arg

ReissueDBCerts used to accept a full-blown types.Database object just to
read a couple of fields from it. In the context of Connect, such object
is obtainable only by making a request to the cluster.

However, in the upcoming PR we want to be able to reissue the cert
without having to perform an unnecessary request to the cluster.

gateway.Gateway already holds all data we need to reissue the cert, so
let's make ReissueDBCerts accept tlsca.RouteToDatabase instead of
types.Database to avoid making that extra request.

* Add Gateway.ReloadCert

In the upcoming PR, after we reissue the db cert, we need to be able to
update the cert used by the running alpn.LocalProxy. This commit exposes
exactly that functionality.

Also, this commit adds RWMutex to Gateway to avoid a situation where
multiple goroutines attempt to reload the cert. This shouldn't happen
under normal circumstances but better safe than sorry.

RWMutex is also used for any field on Gateway that has a setter.

* Add basic implementation of LocalProxyMiddleware

* Add OnExpiredCert callback to gateway.Config

This callback will let the layer above gateway.Gateway handle a
situation in which the cert used by the gateway has expired but there's
a client that tries to make a connection through the gateway.

gateway.Gateway doesn't have the ability to reissue the cert by itself,
hence why we need to accept a callback from above.
ravicious added a commit that referenced this pull request Nov 23, 2022
…erts (#18259) (#18678)

* Log gateway.Close errors during test cleanups

Unless gateway.Close is called on a gateway that was already closed, it
shouldn't return an error.

However, when working on handling expired certs in Connect I ran into a
buggy test where that error from gateway.Close provided a crucial clue
in fixing the bug. But because initially it was simply ignored, it took
me a while to figure out what was going on.

That's why this commit adds logging around those errors.

2 out of those 3 places are helper functions which get used in a variety
of tests, hence why they call t.Cleanup. The other place does call
gateway.Close eventually but we still use t.Cleanup in case the
execution doesn't get to that point.

* Automatically add useful fields to gateway loggers

It's useful to see what resource the gateway is targeting and what is
the URI of the gateway.

Previously the field with URI was hardcoded in cluster_gateways.go or
added only when cfg.Log was nil, meaning that we weren't able to benefit
from it in places such as gateway_test.go.

This commit makes it so that the `resource` and `gateway` fields are
added to any logger that is passed through gateway.Config.

* Remove copylocks warning from Gateway.NewWithLocalPort

gateway.Gateway holds a mutex in one of its fields. NewWithLocalPort
accepted gateway by value so vet was issuing a warning about copying a
lock.

NewWithLocalPort doesn't actually use the copied lock. But it makes
sense to get rid of the warning anyway.

* Make ReissueDBCerts accept tlsca.RouteToDatabase as arg

ReissueDBCerts used to accept a full-blown types.Database object just to
read a couple of fields from it. In the context of Connect, such object
is obtainable only by making a request to the cluster.

However, in the upcoming PR we want to be able to reissue the cert
without having to perform an unnecessary request to the cluster.

gateway.Gateway already holds all data we need to reissue the cert, so
let's make ReissueDBCerts accept tlsca.RouteToDatabase instead of
types.Database to avoid making that extra request.

* Add Gateway.ReloadCert

In the upcoming PR, after we reissue the db cert, we need to be able to
update the cert used by the running alpn.LocalProxy. This commit exposes
exactly that functionality.

Also, this commit adds RWMutex to Gateway to avoid a situation where
multiple goroutines attempt to reload the cert. This shouldn't happen
under normal circumstances but better safe than sorry.

RWMutex is also used for any field on Gateway that has a setter.

* Add basic implementation of LocalProxyMiddleware

* Add OnExpiredCert callback to gateway.Config

This callback will let the layer above gateway.Gateway handle a
situation in which the cert used by the gateway has expired but there's
a client that tries to make a connection through the gateway.

gateway.Gateway doesn't have the ability to reissue the cert by itself,
hence why we need to accept a callback from above.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants