RFD 64: Bot for certificate renewals (Machine ID)#7986
Conversation
klizhentas
left a comment
There was a problem hiding this comment.
A couple of early comments
| First, a token will be generated for the bot to join the cluster. | ||
|
|
||
| ``` | ||
| $ tctl tokens add --type=bot |
There was a problem hiding this comment.
Let's consider scoping the permissions of each individual bot. It seems like the bot should be able to only issue credentials for limited set of roles. So when adding bot we should ask for roles the bot should assume:
tctl tokens add --type=bot --roles=Node,Proxy,my-role
# behind the scenes, create an impersonator bot role that is allowed to impersonate the set above
# read here about impersonation: https://goteleport.com/docs/access-controls/guides/impersonation/
| ``` | ||
| $ tctl tokens add --type=bot | ||
| The invite token: 13b74f49d27536dd5c514073097c197b | ||
| This token will expire in 60 minutes |
There was a problem hiding this comment.
Note that simplified node joining will make it possible to join without short-lived token, consider it in the design, what would change?
There was a problem hiding this comment.
Interesting! The biggest issue I see here is the simplified node joining is looking for the node to join the cluster within a 5 minute TTL from when the instance is launched. This seems fine if you've got a packaged AMI with everything already pre-installed and configured, but I'm not sure it makes sense when you want to join an existing non-Teleport instance like an OpenSSH server. You'd have to copy the software down, set up a system service, and then reboot the instance in order to get it to join - is that much of an improvement?
There was a problem hiding this comment.
packaged AMI with everything already pre-installed and configured
I agree, but that's surprisingly popular use-case on AWS. I'd guess that 80% of our users follow it.
There was a problem hiding this comment.
@timothyb89 add a note on future support for simplified tbot joining, so we don't forget to resolve this comment
There was a problem hiding this comment.
note: with the IAM join method there will be no TTL from when the instance is launched, which should make this nicer
| 0123-4567 foo | ||
| ``` | ||
|
|
||
| TODO: what else might we want to see? public addr? number (and kind) of certs? |
There was a problem hiding this comment.
yes, would be cool to see certs and roles under management for this bot
|
|
||
| In the event of an incident, the bot can be locked with the new `tctl lock` | ||
| functionality. A locked bot will be unable to renew certificates and will have | ||
| its existing certificates revoked. (TODO: this is how it works, right?) |
There was a problem hiding this comment.
The certs won't be revoked, that's why it's important to set aggressive default validity of the certificate. For example, by default the certbot will have a certificate that is scoped. It should be able to only request very short lived certificate using impersonation https://goteleport.com/docs/access-controls/guides/impersonation/.
Then once you block the bot's impersonator role, it won't be able to renew the certificates that will expire in a couple of hours.
|
|
||
| TODO: | ||
|
|
||
| - do we need a mode equivalant to `tctl auth sign`'s `--compat` flag for openssh |
There was a problem hiding this comment.
Not really, that should eventually go away, because it was targeting an ancient OpenSSH version that treated optional SSH cert extensions as critical by mistake.
|
|
||
| #### RBAC | ||
|
|
||
| The new system role for the bot will need: (verify this) |
There was a problem hiding this comment.
See the impersonation point above. It seems like there is no need for a hardcoded bot role, but instead each bot will have it's own unique impersonator role limiting it's scope.
| functionality. A locked bot will be unable to renew certificates and will have | ||
| its existing certificates revoked. (TODO: this is how it works, right?) | ||
|
|
||
| Note that a bot can manage multiple sets of certificates, and locking a |
There was a problem hiding this comment.
Please add a separate section on attack vector tree and possible compromise scenarios. E.g. machine with certbot was taken over. How would lock mitigate that and what is the blast radius? I assume with impersonation, the impersonator user will be blocked from issuing new certificates and the existing certificates will expire within 5 minute window. Any other possible attack vectors?
|
What's the behaviour for trusted clusters? My (admittedly limited) experience with OpenSSH nodes is that you needed to manually install CA certs from all the upstream clusters that you want to be able to access the OpenSSH node. Does the cert bot handle this for you, or is it out of scope for rev 1? I suppose you could always run a new bot for every upstream cluster. |
| ##### Security | ||
| In the above example, as no additional client configuration has been specified, | ||
| the bot will generate only a single set of end-user certificates with access to | ||
| the sum of all roles specified. Users with additional security requirements may |
There was a problem hiding this comment.
nit: I would say the "union" of all roles.
| durations) but this is out of scope for the initial implementation. | ||
|
|
||
| *TODO (Tim):* I think the renewal interval should be some fraction of the total | ||
| TTL (at most 1/2) rather than 75%. For example, given a 4 hour TTL, we should |
There was a problem hiding this comment.
I don't disagree. The 75% was a totally arbitrary number I came up with, so I would just propose whatever you want for the group to discuss.
There was a problem hiding this comment.
Fair enough, I'll make the change.
|
|
||
| *TODO* (Tim): Non-filesystem destinations are likely to be very difficult to | ||
| encode as CLI flags (e.g. k8s secrets). Perhaps the CLI should only support | ||
| local FS and other sinks should defer to `tbot.yaml`? |
There was a problem hiding this comment.
Could we avoid the need for tbot.yaml entirely if we made the flag only work for filesystem sinks, and then add a config file in the future if/when we decide to support other options?
If we can keep the CLI simple enough, I like the idea of not needing another config file.
Edit: should have read further. Looks like you're proposing exactly this.
There was a problem hiding this comment.
Yup. This is itself sort of a second-order effect stemming from moving multiple certs to a YAML file. When we do that, it makes sense to then also move "advanced" destinations to the file too (in my opinion, at least).
There was a problem hiding this comment.
yaml has it's advantages. I would keep the yaml file fully-functional. People can commit file to their repository and use it to deploy new versions for example.
| * (TODO) Additional artifacts may be written if we decide to write app config | ||
| files at this stage. | ||
|
|
||
| #### Destination |
There was a problem hiding this comment.
It looks like you've been preferring the term "sink" over destination. I would just pick one and be consistent.
There was a problem hiding this comment.
I think they're both still distinct and important:
- A destination contains files in a backend-agnostic way (with some configuration)
- A sink is a destination + certificates and their configuration (which certificate types, config files to generate, etc)
I think the distinction is important because we'll also store the bot's own data (the renewable cert and other state) in a destination, and the configuration for this will share that configuration syntax. For example, I could see Kubernetes secrets as a useful backend both for the renewable cert and end-user certs. (Or no-op to keep it only in memory, etc)
That's not to say I haven't been using the two terms somewhat interchangeably, I'll see about consolidating terminology where it makes sense.
There was a problem hiding this comment.
I agree with Zac, the difference is not that obvious to me. I'd say destination is pretty clear - target for a cert. Tbot's own state directory can be a completely separate thing.
There was a problem hiding this comment.
@timothyb89 added several comments after the first review pass
| ``` | ||
| $ tctl tokens add --type=bot | ||
| The invite token: 13b74f49d27536dd5c514073097c197b | ||
| This token will expire in 60 minutes |
There was a problem hiding this comment.
@timothyb89 add a note on future support for simplified tbot joining, so we don't forget to resolve this comment
| ... | ||
| ``` | ||
|
|
||
| Note that both direct dialing or connecting through a Teleport proxy are |
There was a problem hiding this comment.
style nit: if you remove "Note that" the sentence will read just as well.
| 0123-4567 foo false dev,ssh | ||
| ``` | ||
|
|
||
| When the bot starts, it generates a set of keys, and uses the provided join |
There was a problem hiding this comment.
generates a private key and certificate pair to be more precise?
There was a problem hiding this comment.
I'll reword this to be more accurate.
| secretName: example-secret | ||
| roles: [c] | ||
|
|
||
| - webhook: http://localhost:8001 |
| - directory: /home/alice/.tbot | ||
| roles: [a, b, c] | ||
|
|
||
| # TODO: probably need to specify permissions (this will be ugly) |
There was a problem hiding this comment.
why not supply secure defaults? (bot is only allowed to write, not read)
There was a problem hiding this comment.
We most likely will not be able to enforce secure defaults from the bot's process (unless it runs as root, which I'd prefer to avoid). I'll remove the TODO here since it just adds confusion.
This relates back to filesystem permissions from before, if we go the ACL route users would need to create the files using their target Unix user and then grant the bot's Unix user write-only access via ACLs (we could provide a helper subcommand for this). There's still more research to do here, I intend test this out and will update the doc.
| secretName: example-secret | ||
| roles: [c] | ||
|
|
||
| # Webhooks support only store |
There was a problem hiding this comment.
I'd remove webhooks for now
|
|
||
| ``` | ||
| $ tbot start ... | ||
| $ tbot config --ssh |
There was a problem hiding this comment.
most likely this will be many preset commands, so subcommand will make more sense. There will be one for client and another for server, for example. The advantage of the subcommand is that each subcommand can have separate parameter flags that will let users tweak base folder for example.
There was a problem hiding this comment.
It seems there's 2 sections on the configuration assistant, so I'll merge the relevant details into this section and remove the other one.
Anyway, I also prefer the subcommand approach, it will more closely match tsh config too.
| When the backend receives the request to generate new user credentials, it validates | ||
| the token, ensuring that the token is of the correct type and has not expired. | ||
|
|
||
| *TODO (Tim): Investigate adding a new gRPC endpoint for this functionality instead |
There was a problem hiding this comment.
I agree, this should be a separate GRPC endpoint, probably integrated with AWS node joining? Can we modify the existing GRPC endpoint to support multiple use-cases?
There was a problem hiding this comment.
I've now migrated this to a pure gRPC endpoint on the tbot branch (see the timothyb89/tbot branch).
As for AWS node joining, I see no reason why the endpoint (specifically the RenewableCertsRequest message) couldn't be extended to accept an EC2 identity document as we do with existing node joining.
|
@timothyb89 let's cut the scope a bit, remove the webhooks and other parts we won't include in the first version. This will allow us to focus on the security core - file permissions, signals, renewals, generations, GRPC endpoints and clients. |
|
|
||
| When the bot starts, it generates a set of keys, and uses the provided join | ||
| token to request a signed, renewable certificate from the auth server. These | ||
| certificates are stored in a restricted location, such as `/var/lib/tbot`, and |
There was a problem hiding this comment.
@zmb3 @klizhentas What do you guys think about putting the bots certificates in /var/lib/teleport/bot instead? I imagine customers are already locking down /var/lib/teleport and they can just get those protections for free.
There was a problem hiding this comment.
This seems like an easy win to me so I'll just change it (but can revert back if desired)
| ``` | ||
| $ tbot start \ | ||
| --name=foo \ | ||
| --output=file:/home/ubuntu/.tbot \ |
There was a problem hiding this comment.
Here is see we use --output but in file configuration it's directory and other words. I think it will be more consistent if the word matches. @klizhentas what do you think?
There was a problem hiding this comment.
I think it should match too. Since we're standardizing on only supporting directory destinations via CLI, it should just be --directory.
This adds the ability to impersonate one or more roles without
impersonating a particular user.
In Teleport today, when creating an impersonator role, both users and
roles must be specified as impersonation is fundamentally tied to an
existing Teleport user:
```yaml
allow:
impersonate:
users: ['jenkins']
roles: ['jenkins']
```
This is inconvenient for two reasons:
1. A user must exist for each set of roles you'd like to
impersonate, creating a UX burden.
2. It makes it difficult to use impersonation to reduce one's
permissions as you always inherit all of the roles granted to the
target user.
For the [certificate bot][bot] we'd instead like to use impersonation
to generate end-user (impersonated) certificates with a reduced set
of permissions. For example, given the following role:
```yaml
allow:
impersonate:
roles: ['jenkins', 'deploy']
```
We can then use `GenerateUserCerts` to issue certifices for a subset
of the allowed roles, e.g. one set of certificates with only the
`jenkins` role attached, and another with only `deploy`.
To that end, this patch:
1. Removes the requirement that roles define both `users` and
`roles` in impersonate conditions
2. Introduces a new `RoleRequests` field in `UserCertsRequest`
3. Modifies `generateUserCerts` to gather `roles` from
`RoleRequests` if allowed by an `allow` (with no `users`)
[bot]: #7986
|
|
||
| First, register a bot with the new `tctl bots add` subcommand, giving the bot a | ||
| name and one or more roles the bot is allowed to | ||
| [impersonate](https://goteleport.com/docs/access-controls/guides/impersonation/). |
There was a problem hiding this comment.
Can a bot have a role with an allow rule for role, auth_connector, oidc, saml, token? Is this a legit use case or it should be prevented against escalations?
There was a problem hiding this comment.
We have some customers interested in using the bot to manage access to Teleport's APIs for automation purposes, like issuing node join tokens on demand. I could see it enabling a lot of automation use-cases so I'm wary of locking out whole classes of functionality.
We could have a sanity check / blacklist in tctl bots add ... but that could be bypassed if users were motivated enough to modify the roles after the fact.
There was a problem hiding this comment.
I like the idea of a check warning the users about potential escalations after issuing a tctl bots add ... command. Is assigning roles with allowed writes/updates for role also a legit use-case?
| | TLS client certificate | `tlscert` | Signed by Teleport `UserCA` | TLS | Always | ||
|
|
||
| In the case of `directory` destinations, we encourage the use of POSIX ACLs. | ||
| This is meant to ensure that the primary (renewable) credentials are |
There was a problem hiding this comment.
Actively checking this when writing to them and throwing a warning/event is advisable.
There was a problem hiding this comment.
I'll add a note to warn about this. I'd like to make it a harder error but unfortunately we can't depend on ACLs being supported on any given filesystem / OS.
| | TLS client certificate | `tlscert` | Signed by Teleport `UserCA` | | ||
| | Private key (TLS) | `key` | | | ||
|
|
||
| The bot then generates a set of non-renewable credentials for each configured |
There was a problem hiding this comment.
The bot should check if the destination is a symlink and warn the user about it (e.g. if the symlink points to the bot directory or another directory outside the targeted user path, indicating an exploitation attempt). In order to achieve this securely and avoid race conditions, the destination should be locked first.
There was a problem hiding this comment.
I definitely like warning about symlinks, I'll add a note on that.
As for locking, I'm not sure I fully understand the recommendation. Are Linux locks really suitable as a security control? flock and friends seem to be advisory only, and I'm not sure what lock we'd hold to prevent a rename-style attack even if we could depend on mandatory locks being available.
There was a problem hiding this comment.
I was referring to a general locking (e.g. w ownership/permission change of the destination directory first), or using one of the two common strategies to avoid these attacks (either recreating the implementation of os.Create or use os.Rename to move it to the destination)
There was a problem hiding this comment.
This could be a challenge since we don't expect to have anything beyond minimal write privileges (via ACLs) to the specific destination files, they'll be (ideally) owned by an entirely separate user and created in advance by that user with tbot init .... I don't expect we'll be able to change permissions from the bot's end, and I assume the usual file moving tricks would erase our ACLs.
My read of the linked article makes me think the O_NOFOLLOW method should still work for our case, does that seem sane to you?
There was a problem hiding this comment.
While the O_NOFOLLOW flag causes the open to fail if the final component in the name is a symbolic link, the RESOLVE_NO_SYMLINKS should block symbolic-link traversal at all stages of pathname traversal, rather than just for the final component, so it's desirable (source).
|
|
||
| #### Audit Log | ||
|
|
||
| The auth server will emit new events to the audit log when: |
There was a problem hiding this comment.
Add an event to report when a bot is created
| The auth server will emit new events to the audit log when: | ||
|
|
||
| - a new renewable certificate is issued for the first time | ||
| - a certificate is renewed |
There was a problem hiding this comment.
This event should contain:
- The final
destinationpath that was set at the time - A flag indicating the ACL set on the destination (if insecure)
- A flag indicating if the destination was a symlink, and what it resolved to
- Which certs were generated (
key/key.pub/sshcert/known_hosts/ssh_config...) and eventually their pins/hashes
There was a problem hiding this comment.
As useful as information this info would be, there is currently no mechanism for relaying that sort of information to generateUserCerts(). The auth server also generates both cert types (TLS and SSH) for all cert requests, bot or otherwise.
I suppose we could extend the existing UserCertsRequest to add these as (optional) fields? It feels a bit odd to add fields expressly for audit logging purposes, but it is an option.
| for signalling or otherwise executing user code when a certificate rotation | ||
| occurs. Instead, end users are encouraged to use filesystem watches (e.g. | ||
| `inotify`) to watch for changes directly using a least-privileged system user. | ||
| We will include a helper command, `tbot watch`, to perform simple actions on |
There was a problem hiding this comment.
Will the command be just part of a “standalone” utility? What are the privileges that the launched script/command will have?
There was a problem hiding this comment.
tbot watch will be effectively standalone and will just provide a nice UX on top of inotify. It'll inherit privileges of the Unix user that runs it, but otherwise does nothing you couldn't also do with other tools, e.g. inotifywatch.
|
|
||
| #### User Experience | ||
|
|
||
| First, register a bot with the new `tctl bots add` subcommand, giving the bot a |
There was a problem hiding this comment.
The enrollment endpoint to add bots (through tctl bots add) using invite codes should be protected by appropriate rate-limiting mechanisms, just like the other token-based invitations.
There was a problem hiding this comment.
I've added an appendix detailing the new gRPC functions to be added and noted the need for ratelimiting on GenerateInitialRenewableUserCerts(), which is where the admin-created token is exchanged for renewable certificates.
There was a problem hiding this comment.
Small change here, we've removed the GenerateInitialRenewableUserCerts() endpoint entirely in favor of adding support for what we need in Teleport's standard auth.Register() flow.
* Allow impersonation of roles without users
This adds the ability to impersonate one or more roles without
impersonating a particular user.
In Teleport today, when creating an impersonator role, both users and
roles must be specified as impersonation is fundamentally tied to an
existing Teleport user:
```yaml
allow:
impersonate:
users: ['jenkins']
roles: ['jenkins']
```
This is inconvenient for two reasons:
1. A user must exist for each set of roles you'd like to
impersonate, creating a UX burden.
2. It makes it difficult to use impersonation to reduce one's
permissions as you always inherit all of the roles granted to the
target user.
For the [certificate bot][bot] we'd instead like to use impersonation
to generate end-user (impersonated) certificates with a reduced set
of permissions. For example, given the following role:
```yaml
allow:
impersonate:
roles: ['jenkins', 'deploy']
```
We can then use `GenerateUserCerts` to issue certifices for a subset
of the allowed roles, e.g. one set of certificates with only the
`jenkins` role attached, and another with only `deploy`.
To that end, this patch:
1. Removes the requirement that roles define both `users` and
`roles` in impersonate conditions
2. Introduces a new `RoleRequests` field in `UserCertsRequest`
3. Modifies `generateUserCerts` to gather `roles` from
`RoleRequests` if allowed by an `allow` (with no `users`)
[bot]: #7986
* Add `determineDesiredRolesAndTraits`; audit log on role impersonation
This splits initial role and trait determination into a new function,
`determineDesiredRolesAndTraits`, to improve control flow and clarity
given the new branches introduced for role impersonation.
Additionally, this moves the call to `CheckRoleImpersonation` down
to match regular user impersonation's flow, and emits an audit log
event on failure.
* Formatting fix
* Unit testing for role requests
This adds a new set of unit tests for role requests.
Also discovered the `impersonator` field wasn't being set for
role impersonation, so it's now set to the user's own username.
In other words, role impersonation will appear (in the audit log and
elsewhere) as self-impersonation.
* Clean up testing users between runs
* Deny most reimpersonation cases and add tests
This attempts to deny most cases of reimpersonation, where an
impersonated certificate might be used to generate certificates for
other roles the user is allowed to impersonate.
One test case is currently failing pending a solution.
* Add new DisallowReissue certificate extension
This adds a new DisallowReissue certificate extension that, if set,
prevents that identity from interacting with `GenerateUserCerts`.
This flag is always set when RoleRequests are used to prevent
unintended privilege escalation (while avoiding breaking changes to
Teleport's existing certificate generation behavior).
* Fix test lints
* Fix typo
* Fix test doc typo, add testcase for user impersonation misuse
* Apply suggestions from code review
Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>
* Accept context in CreateRole per review feedback
* Fix misleading comment
Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>
| ``` | ||
|
|
||
| In future iterations, additional output destination backends may be considered. | ||
| A few ideas include: |
There was a problem hiding this comment.
Would we able to use MachineID to obtain AWS creds via Teleport Application Access?
There was a problem hiding this comment.
I don't know enough about our AWS app access story today to say for certain, but if AWS can be configured to the trust certs issued by Teleport it should be possible.
This initial iteration of Machine ID only issues barebones certs so at the moment app / db / kubernetes access aren't expected to work. We'll be looking into supporting these as soon as possible, though.
|
The RFD is a handy reference, is now a good time to merge? |
|
@timothyb89 Yeah, I think we can merge this now. |
* Detailed new UX with config file fallback for complex cases * Added significantly more detail on how + why we'll use impersonation * Detailed plans for preventing certificate propagation * Added several notes on security plans (and several unsolved issues) * Updated several TODOs * Added more TODOs than removed, oops
* standardize on `destinations` * use only directory destinations on the CLI, remove unneeded section on CLI syntax * only generate one type of cert by default, allow generating both * document new config assistant approach (auto templates + helper CLI) * document memory-only storage backend * add detail to config example * use `/var/lib/teleport/bot` instead of `/var/lib/tbot` * mostly remove discussion of webhooks and k8s secrets (still referenced in a "future ideas" section) * several wording improvements
* Document plans on ACL use (`tbot init`) * Clarify plans on automatic bot locking on cert generation conflict * Add notes about EC2 identity documents in the future * Add note on disallow-reissue * Simplify plan for reload events (FS watch only + `tbot watch` helper) * Remove Bot resource * Use only gRPC endpoints
Also, remove "API Client Refresh" section
Aiming for merge readiness.
Co-authored-by: Ben Arent <ben@goteleport.com>
ae7e8e7 to
3830933
Compare
No description provided.