-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mfa: per-session MFA certs for SSH and Kubernetes #5564
Conversation
5f4fba5
to
5580e2a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not certain about other ramifications of MFA check but if the 20% slowdown is a concern, the connect/port forward APIs could attempt to connect with the local agent's ssh key first (w/o the embedded node name) before requesting the certificate with MFA check if RBAC requires it.
5580e2a
to
4c251dd
Compare
@fspmarshall Can you review this? |
lib/auth/grpcserver.go
Outdated
// Match NodeName to UUID, hostname or self-reported server address. | ||
if n.GetName() == req.NodeName || n.GetHostname() == req.NodeName || addr == req.NodeName { | ||
node = n | ||
break | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some folks have multiple nodes with the same hostname and rely on uuid-based dialing to access them. This loop should probably check all nodes for matches with UUID, and only fallback to hostname/addr matches if none are found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n.GetName()
returns the UUID, so the first condition in this if
statement matches nodes by UUID
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, what I mean is that iteration shouldn't halt if a hostname/addr match is found because there might still be a node which matches UUID and that would need to take priority.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If some node has a hostname matching other node's UUID (which seems very unlikely), then we have multiple matches and no way to resolve the conflict and for the user to pick one of them.
I think the more likely issue is when user passes a hostname and multiple nodes match.
Changed the logic to catch that scenario - if there are multiple matches return an error.
4c251dd
to
45dfb57
Compare
45dfb57
to
096fe3a
Compare
PTAL @fspmarshall |
lib/auth/grpcserver.go
Outdated
} | ||
// Errors other than ErrSessionMFARequired mean something else is wrong, | ||
// most likely access denied. | ||
if noMFAAccessErr != services.ErrSessionMFARequired { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: if any CheckAccessToXXX
APIs get inadvertently updated to wrap the error return, this will start to behave unexpectedly, so either errors.Is
or trace.Unwrap
might still be on the safe side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point.
I used trace.Unwrap
originally and had trouble (because I think trace.AccessDenied
unwraps into the underlying string error). But errors.Is
works well, updated.
b05a4f8
to
c8acbe1
Compare
@r0mant @russjones please review |
@@ -42,7 +42,7 @@ func PromptMFAChallenge(ctx context.Context, proxyAddr string, c *proto.MFAAuthe | |||
return &proto.MFAAuthenticateResponse{}, nil | |||
// TOTP only. | |||
case c.TOTP != nil && len(c.U2F) == 0: | |||
totpCode, err := prompt.Input(os.Stdout, os.Stdin, fmt.Sprintf("Enter an OTP code from a %sdevice", promptDevicePrefix)) | |||
totpCode, err := prompt.Input(os.Stderr, os.Stdin, fmt.Sprintf("Enter an OTP code from a %sdevice", promptDevicePrefix)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious why this change to stderr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of kubectl
's exec plugins.
We configure kubectl
to exec a tsh
subcommand every time it needs to get credentials to connect.
kubectl
expects to read the credentials from stdout
of the execed process.
Anything written to stderr
will be forwarded to kubectl
's own stderr
to allow interactive commands like this.
@@ -126,6 +126,7 @@ func promptU2FChallenges(ctx context.Context, proxyAddr string, challenges []*pr | |||
}) | |||
} | |||
|
|||
log.Debugf("prompting U2F devices with facet %q", facet) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels like a debug leftover, need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is left on purpose.
If a user mis-configures U2F facets (proxy addresses, basically) on the auth_service
, it's really hard to debug when they use a different value in tsh --proxy=...
and U2F auth will fail.
c8acbe1
to
282a31a
Compare
262a22d
to
e49a15a
Compare
Added @r0mant @fspmarshall PTAL |
c76e250
to
4ee0366
Compare
lib/auth/auth_with_roles.go
Outdated
if hasLocalUserRole(a.context.Checker) && username == a.context.Identity.GetIdentity().Username { | ||
return nil | ||
} | ||
if hasRemoteUserRole(a.context.Checker) && username == a.context.UnmappedIdentity.GetIdentity().Username { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the implication of this change? Wouldn't this mean that, for example, a remote user has permissions to change the local user's password as long as they have the same username? I may be missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch, didn't think about that.
It also turns out I was confusing "username" with "ssh login name", so I renamed that request field and it's no longer compared to the caller username.
Changes to currentUserAction
reverted.
lib/auth/auth_with_roles.go
Outdated
} | ||
|
||
var noMFAAccessErr, notFoundErr error | ||
switch t := req.Target.(type) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More of a meta comment, but don't we want to keep the ACL layer slim without much business logic? Any downsides of moving this switch downstream, e.g. to an auth server method and invoke it here similar to how other methods here do? You can pass checker to it too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of this code felt like pure authz logic, which is why I put it here.
But I think you're right, moved all of this code into auth.Server
.
013402e
to
4d96b0d
Compare
This is client-side support for requesting single-use certs with an MFA check. The client doesn't know whether they need MFA check when accessing a resource, this is decided during an RBAC check on the server. So a client will always try to get a single-use cert, and the server will respond with NotNeeded if MFA is not required. This is an extra round-trip for every session which causes ~20% slowdown in SSH logins: ``` $ hyperfine '/tmp/tsh-old ssh talos date' '/tmp/tsh-new ssh talos date' Benchmark #1: /tmp/tsh-old ssh talos date Time (mean ± σ): 49.9 ms ± 1.0 ms [User: 15.1 ms, System: 7.4 ms] Range (min … max): 48.4 ms … 54.1 ms 59 runs Benchmark #2: /tmp/tsh-new ssh talos date Time (mean ± σ): 60.2 ms ± 1.6 ms [User: 19.1 ms, System: 8.3 ms] Range (min … max): 59.0 ms … 69.7 ms 50 runs Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options. Summary '/tmp/tsh-old ssh talos date' ran 1.21 ± 0.04 times faster than '/tmp/tsh-new ssh talos date' ``` Another few other internal changes: - client.LocalKeyAgent will now always have a non-nil LocalKeyStore. Previously, it would be nil (e.g. in a web UI handler or when using an identity file) which easily causes panics. I added a noLocalKeyStore type instead that returns errors from all methods. - requesting a user cert with a TTL < 1min will now succeed and return a 1min cert instead of failing
An unknown node could be an OpenSSH node set up via https://goteleport.com/teleport/docs/openssh-teleport/ In this case, we shouldn't prevent the user from connecting. There's a small risk of authz bypass - an attacker might know a different name/IP for a registered node which Teleport doesn't know about. But a Teleport node will still check RBAC and reject the connection.
IssueUserCertsWithMFA is called on the leaf auth server in case of trusted clusters. Username in the request object will be that of the original unmapped caller.
This RPC is ran before every connection to check whether MFA is required. If a connection is against the leaf cluster, this request is forwarded from root to leaf for evaluation.
Also, move the logic into auth.Server out of ServerWithRoles.
e7c417f
to
e1c31bb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple more questions but otherwise lgtm.
tlsCert, err := key.TeleportTLSCertificate() | ||
if err != nil { | ||
return nil, trace.Wrap(err) | ||
} | ||
rootClusterName, err := tlsca.ClusterName(tlsCert.Issuer) | ||
if err != nil { | ||
return nil, trace.Wrap(err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: There's RootClusterName()
method on the ProxyClient which you could probably just use instead of extracting from the TLS cert, although the outcome would probably be the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RootClusterName
does the exact same thing as this code, including loading the key from localAgent
.
I need the key below too, so I chose to manually extract rootClusterName
to avoid loading the key twice.
* mfa: per-session MFA certs for SSH and Kubernetes This is client-side support for requesting single-use certs with an MFA check. The client doesn't know whether they need MFA check when accessing a resource, this is decided during an RBAC check on the server. So a client will always try to get a single-use cert, and the server will respond with NotNeeded if MFA is not required. This is an extra round-trip for every session which causes ~20% slowdown in SSH logins: ``` $ hyperfine '/tmp/tsh-old ssh talos date' '/tmp/tsh-new ssh talos date' Benchmark #1: /tmp/tsh-old ssh talos date Time (mean ± σ): 49.9 ms ± 1.0 ms [User: 15.1 ms, System: 7.4 ms] Range (min … max): 48.4 ms … 54.1 ms 59 runs Benchmark #2: /tmp/tsh-new ssh talos date Time (mean ± σ): 60.2 ms ± 1.6 ms [User: 19.1 ms, System: 8.3 ms] Range (min … max): 59.0 ms … 69.7 ms 50 runs Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options. Summary '/tmp/tsh-old ssh talos date' ran 1.21 ± 0.04 times faster than '/tmp/tsh-new ssh talos date' ``` Another few other internal changes: - client.LocalKeyAgent will now always have a non-nil LocalKeyStore. Previously, it would be nil (e.g. in a web UI handler or when using an identity file) which easily causes panics. I added a noLocalKeyStore type instead that returns errors from all methods. - requesting a user cert with a TTL < 1min will now succeed and return a 1min cert instead of failing * Capture access approvals on MFA-issued certs * Address review feedback * Address review feedback * mfa: accept unknown nodes during short-term MFA cert creation An unknown node could be an OpenSSH node set up via https://goteleport.com/teleport/docs/openssh-teleport/ In this case, we shouldn't prevent the user from connecting. There's a small risk of authz bypass - an attacker might know a different name/IP for a registered node which Teleport doesn't know about. But a Teleport node will still check RBAC and reject the connection. * Validate username against unmapped user identity IssueUserCertsWithMFA is called on the leaf auth server in case of trusted clusters. Username in the request object will be that of the original unmapped caller. * mfa: add IsMFARequired RPC This RPC is ran before every connection to check whether MFA is required. If a connection is against the leaf cluster, this request is forwarded from root to leaf for evaluation. * Fix integration tests * Correctly treat "Username" as login name in IsMFARequired Also, move the logic into auth.Server out of ServerWithRoles. * Fix TestHA * Address review feedback
* mfa: per-session MFA certs for SSH and Kubernetes This is client-side support for requesting single-use certs with an MFA check. The client doesn't know whether they need MFA check when accessing a resource, this is decided during an RBAC check on the server. So a client will always try to get a single-use cert, and the server will respond with NotNeeded if MFA is not required. This is an extra round-trip for every session which causes ~20% slowdown in SSH logins: ``` $ hyperfine '/tmp/tsh-old ssh talos date' '/tmp/tsh-new ssh talos date' Benchmark #1: /tmp/tsh-old ssh talos date Time (mean ± σ): 49.9 ms ± 1.0 ms [User: 15.1 ms, System: 7.4 ms] Range (min … max): 48.4 ms … 54.1 ms 59 runs Benchmark #2: /tmp/tsh-new ssh talos date Time (mean ± σ): 60.2 ms ± 1.6 ms [User: 19.1 ms, System: 8.3 ms] Range (min … max): 59.0 ms … 69.7 ms 50 runs Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options. Summary '/tmp/tsh-old ssh talos date' ran 1.21 ± 0.04 times faster than '/tmp/tsh-new ssh talos date' ``` Another few other internal changes: - client.LocalKeyAgent will now always have a non-nil LocalKeyStore. Previously, it would be nil (e.g. in a web UI handler or when using an identity file) which easily causes panics. I added a noLocalKeyStore type instead that returns errors from all methods. - requesting a user cert with a TTL < 1min will now succeed and return a 1min cert instead of failing * Capture access approvals on MFA-issued certs * Address review feedback * Address review feedback * mfa: accept unknown nodes during short-term MFA cert creation An unknown node could be an OpenSSH node set up via https://goteleport.com/teleport/docs/openssh-teleport/ In this case, we shouldn't prevent the user from connecting. There's a small risk of authz bypass - an attacker might know a different name/IP for a registered node which Teleport doesn't know about. But a Teleport node will still check RBAC and reject the connection. * Validate username against unmapped user identity IssueUserCertsWithMFA is called on the leaf auth server in case of trusted clusters. Username in the request object will be that of the original unmapped caller. * mfa: add IsMFARequired RPC This RPC is ran before every connection to check whether MFA is required. If a connection is against the leaf cluster, this request is forwarded from root to leaf for evaluation. * Fix integration tests * Correctly treat "Username" as login name in IsMFARequired Also, move the logic into auth.Server out of ServerWithRoles. * Fix TestHA * Address review feedback
* mfa: per-session MFA certs for SSH and Kubernetes This is client-side support for requesting single-use certs with an MFA check. The client doesn't know whether they need MFA check when accessing a resource, this is decided during an RBAC check on the server. So a client will always try to get a single-use cert, and the server will respond with NotNeeded if MFA is not required. This is an extra round-trip for every session which causes ~20% slowdown in SSH logins: ``` $ hyperfine '/tmp/tsh-old ssh talos date' '/tmp/tsh-new ssh talos date' Benchmark #1: /tmp/tsh-old ssh talos date Time (mean ± σ): 49.9 ms ± 1.0 ms [User: 15.1 ms, System: 7.4 ms] Range (min … max): 48.4 ms … 54.1 ms 59 runs Benchmark #2: /tmp/tsh-new ssh talos date Time (mean ± σ): 60.2 ms ± 1.6 ms [User: 19.1 ms, System: 8.3 ms] Range (min … max): 59.0 ms … 69.7 ms 50 runs Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options. Summary '/tmp/tsh-old ssh talos date' ran 1.21 ± 0.04 times faster than '/tmp/tsh-new ssh talos date' ``` Another few other internal changes: - client.LocalKeyAgent will now always have a non-nil LocalKeyStore. Previously, it would be nil (e.g. in a web UI handler or when using an identity file) which easily causes panics. I added a noLocalKeyStore type instead that returns errors from all methods. - requesting a user cert with a TTL < 1min will now succeed and return a 1min cert instead of failing * Capture access approvals on MFA-issued certs * Address review feedback * Address review feedback * mfa: accept unknown nodes during short-term MFA cert creation An unknown node could be an OpenSSH node set up via https://goteleport.com/teleport/docs/openssh-teleport/ In this case, we shouldn't prevent the user from connecting. There's a small risk of authz bypass - an attacker might know a different name/IP for a registered node which Teleport doesn't know about. But a Teleport node will still check RBAC and reject the connection. * Validate username against unmapped user identity IssueUserCertsWithMFA is called on the leaf auth server in case of trusted clusters. Username in the request object will be that of the original unmapped caller. * mfa: add IsMFARequired RPC This RPC is ran before every connection to check whether MFA is required. If a connection is against the leaf cluster, this request is forwarded from root to leaf for evaluation. * Fix integration tests * Correctly treat "Username" as login name in IsMFARequired Also, move the logic into auth.Server out of ServerWithRoles. * Fix TestHA * Address review feedback
This is client-side support for requesting single-use certs with an MFA
check.
The client doesn't know whether they need MFA check when accessing a
resource, this is decided during an RBAC check on the server. So a
client will always try to get a single-use cert, and the server will
respond with NotNeeded if MFA is not required. This is an extra
round-trip for every session which causes ~20% slowdown in SSH logins:
Another few other internal changes:
client.LocalKeyAgent will now always have a non-nil LocalKeyStore.
Previously, it would be nil (e.g. in a web UI handler or when using an
identity file) which easily causes panics. I added a noLocalKeyStore
type instead that returns errors from all methods.
requesting a user cert with a TTL < 1min will now succeed and return a
1min cert instead of failing