RFD 0205: Improved On-Prem Joining for Machine ID#52546
Conversation
This RFD discusses improvements to on-prem and non-delegated bot joining, focusing on a new `challenge` join method.
| authentication ceremony, clients can use `go-jose` to marshal and sign a JWT | ||
| which can then be verified easily on the server. | ||
|
|
||
| TODO: This needs significant further elaboration and feedback. |
There was a problem hiding this comment.
Intentionally brief at the moment, I think I'd like to consult with some experts before committing to any crypto implementation. Very much open to ideas and feedback on this!
There was a problem hiding this comment.
I think JWS/JWT seems to be a sensible choice here.
How will this flow work when the bot is providing its public key on initial join? Will it still need to perform an initial signature?
There was a problem hiding this comment.
I think so, it seems like it'd be tidier if the secret-for-pubkey exchange was effectively an independent step and then it did a full regular rejoin process.
That said, keeping them fully separate may create unnecessary complexity elsewhere, since we'd need to have yet another public RPC. I'll try to explore both options and see which is simpler and easier to keep secure.
|
|
||
| #### Token Resource Example | ||
|
|
||
| `challenge`-type tokens differ from other types in that they are intended to |
There was a problem hiding this comment.
We'll definitely want to discourage folks explicitly in documentation from setting a resource level expiry if it may break the rejoin mechanism.
There was a problem hiding this comment.
I could see it still being useful: if I want a bot that only lasts for e.g. 1 month, it would be nice if it could expire automatically, though we'd need some additional logic during regular cert renewals to ensure the backing token still exists.
There was a problem hiding this comment.
though we'd need some additional logic during regular cert renewals to ensure the backing token still exists.
Yeah I think so - either way we just need to make sure the behaviour is fairly sane/well documented. I think worst case scenario would be someone deleting a known-bad token thinking it'll disconnect the bot - and it still has access.
Adds sections on alerting, keypair rotation, and intention to eventually support node joining.
| significantly more flexibility than today's `token` join method. This works by - | ||
| in a sense - inverting the token joining procedure: bots generate an ED25519 | ||
| keypair, and the public key is copied to the server. The public key can be | ||
| copied out-of-band, or bots can provide their public key on first join using a |
There was a problem hiding this comment.
I feel like I'm not understanding this correctly. If they can offer their public key on first join, what's to stop a malicious actor from just joining a bunch of bots? The security of the token is that the server generates it and knows it's valid. I get adding the public key out-of-band, but not this.
Ok, reading again, in this scenario where the public key is not added ahead of time, there would be an initial join secret, so basically a token would still be needed, but after that initial contact it wouldn't need the token again?
There was a problem hiding this comment.
Right, there are two joining scenarios:
- A Teleport admin creates a new
challenge-typeProvisionTokenwith the ED25519 public key embedded in it. The bot joins by solving a challenge using its private key. - A Teleport admin creates a new
challenge-typeProvisionTokenwithout a public key. The server then generates an initial joining secret, which the admin can provide to a bot exactly like atoken-type token today. The bot presents this joining secret along with its public key to bind its public key to theProvisionToken, then follows the process from Option 1 to authenticate.
We'll also only allow a single bot instance (w/ generation counter, to avoid copied identities) to be associated with a single public key at any time. I'll make this more explicit, and try to clarify the two joining flows more.
There was a problem hiding this comment.
Ok. Yeah, I'd like to know more on that, that feels like it would be hard to scale. If I have 10k servers that are old and don't have TPMs, I guess I could be automating creation of 10k public/private pairs but I wouldn't like it, especially if all the instances are from one bot.
I guess I get that if it's not like that, then if I'm malicious and get the public key I can do what I was worried about above, and just send it in and have access.
There was a problem hiding this comment.
For bulk joining, you'd want to use the Terraform provider with scenario 2, and have it generate an initial join secret for each node - there's an example of this further down in the document. The generated secret value can be provided to tbot and it will behave similarly to token joining today; keypair generation and exchange with Auth will be entirely transparent to the user.
Overall, in its default mode (i.e. scenario 2) it's functionally identical to traditional token joining for end users during initial deployment, with a few enhancements to make Terraform automation a bit nicer (server side secret fulfillment, a sane story for restoring broken bots, etc).
There was a problem hiding this comment.
Yeah, same could apply to Ansible or something for an on-prem scenario where Terraform won't work.
|
|
||
| # If set, rejoining is only valid before this timestamp; may be | ||
| # incremented to extend bot lifespan. | ||
| expires: "" |
There was a problem hiding this comment.
Yeah, I'm thinking 3 different expiration times in the same resource is bound to get confusing.
There was a problem hiding this comment.
Yeah... this is the biggest sore spot I think. I think I might remove the resource-level expiration (or, well, make it an error to set one), and might drop this secondary expiration. Playing with some ideas to make the ergonomics here make sense, but it's confusing as is.
There was a problem hiding this comment.
As noted in another thread, I've renamed the duplicate fields. I'm not sure if that's ultimately a perfect solution but it's hopefully at least less confusing.
I do think (the field now known as) must_rejoin_before is optional. I think it's a reasonably sensible additional rejoining control admins might like to have available, but I think ultimately the design works alright with just a counter. If you think the end result is still confusing, we can remove this field.
This renames the join method to `bound-keypair`, adds sections on extensible keystore backends, non-Terraform UX, and scoped RBAC.
Adds a new URI joining proposal
New sections on state storage and rejected alternatives, plus rewrote several sections for clarity.
| precedent here as bots can view e.g. their own roles without explicitly | ||
| having RBAC permissions to do so. | ||
|
|
||
| The remaining rejoin counter should then be exposed as a Prometheus metric to |
There was a problem hiding this comment.
Perhaps the last time it (re)joined would be a good addition to the metrics as well.
There was a problem hiding this comment.
It's a bit tricky to publish date fields directly in prometheus, but I think we could publish a metric like this:
teleport_bot_bound_keypair_joins{bot_name="example"} 1
You could then alert against it with a query like this:
sum by (user) (teleport_bot_bound_keypair_joins[1h]) > 1
Prometheus should count each increase in the field and will account for counter resets if auth restarts. I think this would meet your needs?
There was a problem hiding this comment.
Perhaps a good middle-ground would be teleport_bot_rejoined_seconds where it has an ever increasing value with the number of seconds since the last rejoin.
Creating an alert would be as simple as checking if the value is less than 1h.
I think a metric like system_uptime would be comparable.
|
|
||
| We should take steps to improve visibility of bots at or near expiry, including: | ||
|
|
||
| - Configurable cluster alerts when the number of available renewals has crossed |
There was a problem hiding this comment.
Maybe add an alert that states something like: "A bot has recently automatically rejoined", without specifically watching the count.
Where 'recently' is configurable.
There was a problem hiding this comment.
That's a good thought, I've added a note about it to this section. There's some mild complexity here, I think, since alerts could get noisy if there's hundreds (or more) bots active. I think we'll definitely need to iterate on this as we see how this method gets used.
| token. This may require introduction of a new certificate field to track the | ||
| exact join token used. | ||
|
|
||
| - Public key locking: locks bots joining with a particular public key. A |
There was a problem hiding this comment.
Another option would be to allow a bot to be limited to only a single instance. If a second instance joins, the first one is revoked (by design).
Another option would be to only allow rejoining if all instances are rejoining at once. Say after a power outage on a site with multiple instances. The logic would validate that there are other instances in the same bot, and only allow a generation increase if they all 'agree'.
There was a problem hiding this comment.
I've tried to further clarify the approach here in the "bot lifecycle" section since it was vague and wasn't fully written down, and actually pulling on this thread led to introducing "Join State Documents" in the new "Preventing Credential Duplication" section.
The short version is, we'll only allow one active bot instance per (bot, bound-keypair token) pair. If a rejoin occurs, the previous instance will not be allowed to refresh its certificates any further and will need to rejoin - and might additionally get locked out. If a bot's keypair is cloned and 2 clients start competing to rejoin, we'll detect this and lock all bots using the join token, similar to generation counter lockouts today.
|
|
||
| The URI syntax might look like this: | ||
| ``` | ||
| tbot+[auth|proxy]://[join method]:[token value]@[addr]:[port]?key=val&foo=bar |
There was a problem hiding this comment.
For easy of copying this, you could base64 encode this and/or even sign it like a JWT.
There was a problem hiding this comment.
I think there are some benefits to it being cleartext, mainly ease of confirming bots are routed to the right place, and ability to tweak the URI if needed, e.g. connecting to a leaf cluster. From a UX perspective, it's still functionally a single token for copy and paste purposes, which Teleport can fully compose it for you in the CLI/web. Do you see a use case for shorter connection strings?
I'm not too sure of the value of signing these - could you elaborate on the use case a bit? Bots still verify TLS on startup, and in general I don't see bots connecting to a hostile Teleport instance as a meaningful threat vector.
There was a problem hiding this comment.
A token is easily copy-pastable in a shell (i.e. vim/bash etc), where a complex url usually is a bit more annoying. Especially if you have + and & characters.
Having those replaced via base64 encoding significantly reduces the amount of user error in most cases.
Signing would just be to validate that you have copied the whole string. Maybe I should have called it hashing.
| joining as well as bots, as a more secure alternative to static or long-lived | ||
| join tokens. | ||
|
|
||
| ### Additional Keypair Protections |
There was a problem hiding this comment.
Would you allow CA based joining? In that case someone could pregenerate the tbot private keypair and sign it from a centralized CA.
One could then add that public CA to the ProvisionToken as allowed to rejoin.
That way all tbot's could rejoin if they have a certificate signed by a single matching CA.
| URL paths and query parameters may also provide options for future extension if | ||
| desired. | ||
|
|
||
| ## Future Extensions and Alternatives |
There was a problem hiding this comment.
Would you consider a little scope creep and add extra rejoin 'conditions'?
I would for instance think that rejoining on the same IP address could automatically be allowed. It's not much, but rejoining from a different IP should (in our use-case) automatically be denied.
The 'extension' could be:
- Allow external factors or 'proof' to be sent along with a rejoin request to automatically allow it.
Things I was thinking about:
- (Soft)TPM joining (if you do not want auto TPM join, just a limited amount, and still want to do custom keypairs)
- Public IP Address
- Physical Keys
- Information it can exfiltrate from the place it's running. For instance a challenge against a networked HSM (Hashicorp Vault?).
There was a problem hiding this comment.
I think there's definitely some room to add additional joining requirements over time, public IP in particular could be helpful.
I think your "soft TPM" idea might be covered by the proposal already, if I understand correctly? That sounds like the TPM/HSM private key storage backend?
There was a problem hiding this comment.
What I'm proposing is to have a custom keypair provisioned on the machine as primary joining method and have the TPM joining be a secondary factor to enable auto joining.
The reason is, that we as a company don't want the hassle of managing TPM certificates.
I would allow an option to have the tbot machine id publish the TPM certificate that is always present, and 'register' it as an extra factor along with a first-time join token.
That way we could enable auto rejoining if you can prove you come from the same TPM.
The flow would be:
- We preprovision a disk with a keypair.
- We mount the disk in an edge device.
- We boot up the machine for the first time.
- It registers itself with the certificate and the TPM certificate.
- If the node goes down for >24h, it can rejoin if it still has a valid provisioned certificate AND can prove itself via the TPM.
- We periodically rotate the generated certificate (every 6m or so)
To be clear: in our use-case it's normal node joining. We are looking at Machine ID for 2-way communication with edge devices, but are not there yet.
I do understand the risk of having auto rejoin for Nodes might be less risky than for Machine ID, so perhaps the use-cases should be split.
There was a problem hiding this comment.
I think this is an interesting flow for sure, though I do think we might have something mostly equivalent with the initial join secret + HSM storage:
- Create bound keypair token with initial join secret
- Provision disk with initial join secret
- On boot, the machine generates a keypair on the TPM/HSM and establishes trust with Auth using the join secret
- For each join, auth issues a challenge that must be completed using the keypair on the TPM/HSM
- If desired, the keypair can be rotated at any time, which would create a new HSM-stored keypair and switch Auth's trust to it.
My understanding is that a HSM or TPM-stored keypair should be more or less equivalent to the module's built-in key.
I think the main limitation here is that through this method Auth can't 100% trust that the client is in fact using a hardware keystore, but I'd argue that's strictly true for any automatic TPM enrollment.
It's probably out of scope for this first revision, but it might be interesting to add some additional challenge requirements in the future, like an additional EKCert attestation or similar. I think that's the only way we could (kind of) ensure a real TPM was in use, but that'd need some more in depth design.
There was a problem hiding this comment.
Somewhat significant update after a discussion with @strideynet today - we're tentatively pivoting toward implementing this as a delegated join method, meaning bots will complete the challenge ceremony on every renewal attempt.
This means we won't use traditional renewable certs at all, and we determine if a particular join attempt costs a rejoin credit (probably need a better term for this) based on whether or not the bot presents an existing client certificate. We already use this mechanism to preserve bot instance IDs today.
This should keep the implementation more in line with other delegated joining methods (i.e. all of them except token) and could put us on a path toward eliminating traditional renewable certs in the future.
There was a problem hiding this comment.
Quick summary of additional updates today:
- Added credential duplication mitigation (aka generation counter lite)
- Added protobuf draft
- Added keypair rotation procedure
- Added bot lifecycle details
- Tweaked the joining URL scheme to allow for both token name + additional parameter (i.e. initial joining secret)
I think my only outstanding item is that I'm thinking of renaming "rejoining" to just "joining", especially in the token spec. It's really the same process and I think it's conceptually simpler if we don't special case onboarding - that's a join, just like an expired connection.
There was a problem hiding this comment.
Another small summary of updates:
- Renamed
rejoiningtojoiningin most contexts, since it's the same process. - Expanded join state document, added example. During a discussion with @boxofrad and @strideynet, we decided to move the remaining join counter and rotation flag into the document as well.
- Added notes about locking old instances after a rejoin. There's probably some value here, but I think it could cause disruptions in apps as well.
@strideynet also suggested exclusively using the join state sequence number for all joins and refreshes, instead of using it for joins, and then using the standard generation counter (stored in the bot instance) for refreshes. I think there's some merit to this and haven't come to a firm conclusion yet. We agreed it's okay as written, but I may change it here after thinking on it further.
strideynet
left a comment
There was a problem hiding this comment.
I think I'm at a point where I'm happy with the overall design of this now - we probably want to begin implementation before merging down this RFD to see if anything shakes out.
* MWI: Enforce generation counter for bound keypair joining This enable generation counter enforcement for bound keypair joining, and adds a new function, `shouldEnforceGenerationCounter`, to make enabling it for other join methods trivial. Bound keypair joining introduces a similar mechanism for use between its own recovery attempts but does rely on the standard generation counter for it's renewal-style certificates so every join attempt is subject to a generation check. This wasn't enabled in the original set of bound keypair PRs so it's enabled here. RFD: #52546 * Add tests for generation counter enforcement, fix error handling bug This adds a test case for traditional generation counter enforcement with bound keypair joining, and fixes an error handling bug around certificate generation. This bug was mostly harmless before and would've just returned nil certs at worst, but is now meaningfully fallible. * Fix broken test * Fix lint * Remove references to registration secret in test for rebase onto master * Empty commit for CI
This adds support for joining URIs to tbot. Joining URIs are intended
to condense tbot's growing list of required server-side config options
or CLI parameters into a single string that can be provided to the
`tbot` client.
For example, consider these two equivalent CLI commands:
```
$ tbot start identity \
--proxy-server example.teleport.sh:443 \
--join-method bound_keypair \
--token my-token \
--registration-secret abc123 \
--storage ./tbot-data
--destination ./tbot-user
$ tbot start identity \
tbot+proxy+bound-keypair://my-token:abc123@example.teleport.sh:443 \
--storage ./tbot-data \
--destination ./tbot-user
```
As shown, all parameters necessary for bots to actually connect to
and authenticate with the remote Teleport instance are included in a
single parameter. This parameter can be generated by existing tooling,
like the example command printed via `tctl bots add`, or the web UI.
End users will only need to paste a single "token", provide their own
client-side parameters (if any), and run. Similarly, we now have a new
minimally viable YAML config:
```yaml
version: v2
uri: tbot+proxy+bound-keypair://my-token:abc123@example.teleport.sh:443
storage:
type: directory
path: ./tbot-data
services:
- type: identity
destination:
type: directory
path: ./tbot-user
```
This implementation is designed to be only additive, and should not
interfere with existing config files or CLI strings. Parsed URI
parameters are merged on top of the traditional config fields during
the bot's pre-run check, and raise an error if any field conflicts.
RFD: #52546
* MWI: Add joining URIs for tbot
This adds support for joining URIs to tbot. Joining URIs are intended
to condense tbot's growing list of required server-side config options
or CLI parameters into a single string that can be provided to the
`tbot` client.
For example, consider these two equivalent CLI commands:
```
$ tbot start identity \
--proxy-server example.teleport.sh:443 \
--join-method bound_keypair \
--token my-token \
--registration-secret abc123 \
--storage ./tbot-data
--destination ./tbot-user
$ tbot start identity \
tbot+proxy+bound-keypair://my-token:abc123@example.teleport.sh:443 \
--storage ./tbot-data \
--destination ./tbot-user
```
As shown, all parameters necessary for bots to actually connect to
and authenticate with the remote Teleport instance are included in a
single parameter. This parameter can be generated by existing tooling,
like the example command printed via `tctl bots add`, or the web UI.
End users will only need to paste a single "token", provide their own
client-side parameters (if any), and run. Similarly, we now have a new
minimally viable YAML config:
```yaml
version: v2
uri: tbot+proxy+bound-keypair://my-token:abc123@example.teleport.sh:443
storage:
type: directory
path: ./tbot-data
services:
- type: identity
destination:
type: directory
path: ./tbot-user
```
This implementation is designed to be only additive, and should not
interfere with existing config files or CLI strings. Parsed URI
parameters are merged on top of the traditional config fields during
the bot's pre-run check, and raise an error if any field conflicts.
RFD: #52546
* Fix lints
* Set `omitempty` flag on the URI field
This excludes the URI field when empty, to avoid polluting generated
config files when not using URIs - which remains fully supported - and
to clear test failures since a large number of golden tests would
otherwise need to be regenerated.
* Add additional tests for joining URI config merging
* Add additional integration-style test for joining URIs
* Fix lint
* Consistently rename field to JoinURI and convert from arg to flag
* Remove interspersed flag as arg has been removed.
This adds support for joining URIs to tbot. Joining URIs are intended
to condense tbot's growing list of required server-side config options
or CLI parameters into a single string that can be provided to the
`tbot` client.
For example, consider these two equivalent CLI commands:
```
$ tbot start identity \
--proxy-server example.teleport.sh:443 \
--join-method bound_keypair \
--token my-token \
--registration-secret abc123 \
--storage ./tbot-data
--destination ./tbot-user
$ tbot start identity \
tbot+proxy+bound-keypair://my-token:abc123@example.teleport.sh:443 \
--storage ./tbot-data \
--destination ./tbot-user
```
As shown, all parameters necessary for bots to actually connect to
and authenticate with the remote Teleport instance are included in a
single parameter. This parameter can be generated by existing tooling,
like the example command printed via `tctl bots add`, or the web UI.
End users will only need to paste a single "token", provide their own
client-side parameters (if any), and run. Similarly, we now have a new
minimally viable YAML config:
```yaml
version: v2
uri: tbot+proxy+bound-keypair://my-token:abc123@example.teleport.sh:443
storage:
type: directory
path: ./tbot-data
services:
- type: identity
destination:
type: directory
path: ./tbot-user
```
This implementation is designed to be only additive, and should not
interfere with existing config files or CLI strings. Parsed URI
parameters are merged on top of the traditional config fields during
the bot's pre-run check, and raise an error if any field conflicts.
RFD: #52546
* MWI: Add joining URIs for tbot
This adds support for joining URIs to tbot. Joining URIs are intended
to condense tbot's growing list of required server-side config options
or CLI parameters into a single string that can be provided to the
`tbot` client.
For example, consider these two equivalent CLI commands:
```
$ tbot start identity \
--proxy-server example.teleport.sh:443 \
--join-method bound_keypair \
--token my-token \
--registration-secret abc123 \
--storage ./tbot-data
--destination ./tbot-user
$ tbot start identity \
tbot+proxy+bound-keypair://my-token:abc123@example.teleport.sh:443 \
--storage ./tbot-data \
--destination ./tbot-user
```
As shown, all parameters necessary for bots to actually connect to
and authenticate with the remote Teleport instance are included in a
single parameter. This parameter can be generated by existing tooling,
like the example command printed via `tctl bots add`, or the web UI.
End users will only need to paste a single "token", provide their own
client-side parameters (if any), and run. Similarly, we now have a new
minimally viable YAML config:
```yaml
version: v2
uri: tbot+proxy+bound-keypair://my-token:abc123@example.teleport.sh:443
storage:
type: directory
path: ./tbot-data
services:
- type: identity
destination:
type: directory
path: ./tbot-user
```
This implementation is designed to be only additive, and should not
interfere with existing config files or CLI strings. Parsed URI
parameters are merged on top of the traditional config fields during
the bot's pre-run check, and raise an error if any field conflicts.
RFD: #52546
* Fix lints
* Set `omitempty` flag on the URI field
This excludes the URI field when empty, to avoid polluting generated
config files when not using URIs - which remains fully supported - and
to clear test failures since a large number of golden tests would
otherwise need to be regenerated.
* Add additional tests for joining URI config merging
* Add additional integration-style test for joining URIs
* Fix lint
* Consistently rename field to JoinURI and convert from arg to flag
* Remove interspersed flag as arg has been removed.
* Fix broken tests after rebase
* MWI: Enforce generation counter for bound keypair joining This enable generation counter enforcement for bound keypair joining, and adds a new function, `shouldEnforceGenerationCounter`, to make enabling it for other join methods trivial. Bound keypair joining introduces a similar mechanism for use between its own recovery attempts but does rely on the standard generation counter for it's renewal-style certificates so every join attempt is subject to a generation check. This wasn't enabled in the original set of bound keypair PRs so it's enabled here. RFD: #52546 * Add tests for generation counter enforcement, fix error handling bug This adds a test case for traditional generation counter enforcement with bound keypair joining, and fixes an error handling bug around certificate generation. This bug was mostly harmless before and would've just returned nil certs at worst, but is now meaningfully fallible. * Fix broken test * Fix lint * Remove references to registration secret in test for rebase onto master * Empty commit for CI
* MWI: Enforce generation counter for bound keypair joining This enable generation counter enforcement for bound keypair joining, and adds a new function, `shouldEnforceGenerationCounter`, to make enabling it for other join methods trivial. Bound keypair joining introduces a similar mechanism for use between its own recovery attempts but does rely on the standard generation counter for it's renewal-style certificates so every join attempt is subject to a generation check. This wasn't enabled in the original set of bound keypair PRs so it's enabled here. RFD: #52546 * Add tests for generation counter enforcement, fix error handling bug This adds a test case for traditional generation counter enforcement with bound keypair joining, and fixes an error handling bug around certificate generation. This bug was mostly harmless before and would've just returned nil certs at worst, but is now meaningfully fallible. * Fix broken test * Fix lint * Remove references to registration secret in test for rebase onto master * Empty commit for CI
* MWI: Add joining URIs for tbot
This adds support for joining URIs to tbot. Joining URIs are intended
to condense tbot's growing list of required server-side config options
or CLI parameters into a single string that can be provided to the
`tbot` client.
For example, consider these two equivalent CLI commands:
```
$ tbot start identity \
--proxy-server example.teleport.sh:443 \
--join-method bound_keypair \
--token my-token \
--registration-secret abc123 \
--storage ./tbot-data
--destination ./tbot-user
$ tbot start identity \
tbot+proxy+bound-keypair://my-token:abc123@example.teleport.sh:443 \
--storage ./tbot-data \
--destination ./tbot-user
```
As shown, all parameters necessary for bots to actually connect to
and authenticate with the remote Teleport instance are included in a
single parameter. This parameter can be generated by existing tooling,
like the example command printed via `tctl bots add`, or the web UI.
End users will only need to paste a single "token", provide their own
client-side parameters (if any), and run. Similarly, we now have a new
minimally viable YAML config:
```yaml
version: v2
uri: tbot+proxy+bound-keypair://my-token:abc123@example.teleport.sh:443
storage:
type: directory
path: ./tbot-data
services:
- type: identity
destination:
type: directory
path: ./tbot-user
```
This implementation is designed to be only additive, and should not
interfere with existing config files or CLI strings. Parsed URI
parameters are merged on top of the traditional config fields during
the bot's pre-run check, and raise an error if any field conflicts.
RFD: #52546
* Fix lints
* Set `omitempty` flag on the URI field
This excludes the URI field when empty, to avoid polluting generated
config files when not using URIs - which remains fully supported - and
to clear test failures since a large number of golden tests would
otherwise need to be regenerated.
* Add additional tests for joining URI config merging
* Add additional integration-style test for joining URIs
* Fix lint
* Consistently rename field to JoinURI and convert from arg to flag
* Remove interspersed flag as arg has been removed.
* Fix broken tests after rebase
* MWI: Bound Keypair Rotation (#55240) * MWI: Bound Keypair Joining: Keypair rotation This adds keypair rotation for bound keypair rotation. When a rotation flag is set in the token spec, joining clients will be required to generate a new keypair and complete an additional joining challenge against the new keypair. The flag is a timestamp token to allow for some level of idempotency; to make setting this flag easier, a new `tctl` command is included: `tctl bound-keypair request-rotation [token]`. This sets the flag to the current timestamp, and joining clients will be required to perform a rotation on their next authentication attempt. Closes #55084 * Properly initialize the tctl command * Refactor ClientState to allow storing intermediate state during rotation * Fix invalid comparison and mutation logic * Log signature suite and use cryptosuites helper * Remove outdated TODO * Frontload MFA check to avoid prompting twice * Fix tctl command logging * Fix incomplete docstring * Fix imports * Fix typo in log message * Add tests for server-side rotation Adjusts the test harness a bit and adds a batch of test cases for keypair rotation. Also fixes a lint error. * Add additional test case for reused keys * Add ClientState unit test * Remove unnecessary log * Fix test lints * Fix reference to wrong key field Now that the key can change, fix a dangling reference to the initial key field. Also s/marshalled/marshaled * Wrap KeyHistoryEntry in a containing struct This should allow for some future extension if needed. * MWI: Bound Keypair - Registration Secrets (#55380) * MWI: Bound Keypair - Registration Secrets This adds support for initial joining via registration secrets. These one time use secrets emulate traditional token joining and allow clients to perform their initial join With this, no options are required for bound keypair-type tokens. While admins can specify a joining secret if they wish, if none is provided, one will be generated on the server and can be found in `status.bound_keypair.registration_secret` on the token resource. When joining, this secret can be shared with clients in addition to the (no longer sensitive) token name. This secret is verified and a keypair rotation is requested, prompting the client to generate a new keypair, provide the public key to the server, and complete a joining challenge. It then joins the cluster as usual. * Remove unnecessary token validation checks * Rename tbot flag to --registration-secret * Fix reference to renamed flag * Various fixes, mostly more unwanted checks * Add test cases for registration secrets * Fix broken test Onboarding config is no longer required, so fix the now-broken test * Allow empty .spec.bound_keypair field for bound keypair tokens This allows .spec.bound_keypair to be empty or entirely unset, since we can build defaults at creation time. * Add test for secret expiry enforcement * Handle nonexistent client state when using a registration secret * Fix test lints * Hide exact registration secret rejection reason from client Registration secret errors now return a single error message to the client and log a more specific message on the server. * MWI: Enforce generation counter for bound keypair joining (#55543) * MWI: Enforce generation counter for bound keypair joining This enable generation counter enforcement for bound keypair joining, and adds a new function, `shouldEnforceGenerationCounter`, to make enabling it for other join methods trivial. Bound keypair joining introduces a similar mechanism for use between its own recovery attempts but does rely on the standard generation counter for it's renewal-style certificates so every join attempt is subject to a generation check. This wasn't enabled in the original set of bound keypair PRs so it's enabled here. RFD: #52546 * Add tests for generation counter enforcement, fix error handling bug This adds a test case for traditional generation counter enforcement with bound keypair joining, and fixes an error handling bug around certificate generation. This bug was mostly harmless before and would've just returned nil certs at worst, but is now meaningfully fallible. * Fix broken test * Fix lint * Remove references to registration secret in test for rebase onto master * Empty commit for CI * MWI: Add audit events for bound keypair joining (#55701) * MWI: Add audit events for bound keypair joining This adds 3 new audit events for bound keypair joining: - `join_token.bound_keypair.recovery` - emitted when a join triggers a recovery (first join, or join with expired certs) - `join_token.bound_keypair.rotation` - emitted when a keypair rotation takes place - `join_token.bound_keypair.join_state_verification_failed` - emitted when the client provides an invalid join state document * Fix UI lint * Fix more UI lints * Remove outdated TODO * Fix tests broken by error message changes * MWI: Add lock targets for join token name and bot instance ID (#56021) * MWI: Add lock targets for join token name and bot instance ID This adds two new lock targets meant to help lock specific bot instances without affecting all bots sharing a single user: - Bot Instance ID: Targets a bot instance UUID, which has been assigned automatically to unique bot instances for some time - Join token name: Targets the join token through which the bot joined Bot instance ID locks are most useful for traditional token-joined bots, since tokens are single use and bots have no way to onboard again without human intervention if their old certs (and old bot instance) expire. Join token locks are useful for bots using delegated join methods. They are particularly useful for bound keypair joining, where there is a direct 1:1 relationship between a "bot instance" and a token, even though that bot ID will change each time a recovery takes place. Note that this does not currently set the join token for nodes even though that would theoretically be possible. We could consider supporting node locking in the future if there's demand. * Set join token cert request field for non-renewable bot identities * Fix ASN ID and pass through join token name in impersonated certs * Tweak docstrings and add missing references for lib/decision * Clarify docstrings Clarifies various docstrings and makes sure they mention `token` joined bots cannot be targeted. * Fix failing tests * MWI: Use specific lock targets when locking out bots (#56110) * MWI: Use specific lock targets when locking out bots Building on #56021, this takes advantage of the new granular lock targets to lock bots during verification failures, namely: - Generation counter mismatch: Locks a bot instance (token) or token name (bound keypair). - Join state verification failure (bound keypair only) Additionally, as the bound keypair joining process now generates locks, join state verification has been moved to take place explicitly *after* the main joining challenge has been completed. Without this, unauthenticated clients could abuse the new locking behavior by simply sending any invalid join state document. * Use new lock targets for traditional generation counter lockouts * Enforce new bot lock targets during cert generation * Fix lint in `mutateStatusConsumeRecovery()` * Add tests for new lock events This adds new tests and updates existing tests to account for the new locking strategies, and to make sure existing clients are actually denied cluster access. Additionally, as join state is now verified only after the regular challenge ceremony, a number of tests were broken as they set up the token in a technically impossible state, depending on the join state being checked first. Tests now explicitly specify their token keypair (bound or initial) to resolve this. * Remove resolved TODOs * Fix cut off comment * MWI: Fix flaky tests for automatic bot lockouts (#56323) * MWI: Fix flaky tests for automatic bot lockouts This fixes a flaky test, `TestRegisterBotCertificateGenerationStolen`, which assumed authenticated clients would immediately lose access if locked. It also fixes another test introduced at the same time that contains a similar check. * Increase maximum time limit * MWI: Remove bound keypair experiment flag (#56592) This removes the environment variable gating use of the bound keypair experiment. * MWI: Fix bound keypair initial join secret field name (#56603) * MWI: Fix bound keypair initial join secret field name The `initial_join_secret` field was not given a proper YAML field name and was rendering as `initialjoinsecret`. Additionally, we've tried to standardize on referring to this field as "the registration secret", so this renames the field to match new terminology. This hopefully does not count as a breaking change as registration secret functionality has not been made available in a release. * Rename to `registration_secret` * MWI: Fix typos in bound keypair ProvisionTokenV2 proto (#56653) This fixes a number of spelling and grammar issues in the proto comments for ProvisionTokenSpecV2BoundKeypair and ProvisionTokenStatusV2BoundKeypair. * MWI: Fix flaky test for bound keypair generation counter (#56732) * MWI: Fix flaky test for bound keypair generation counter This fixes another flaky test in TestServer_RegisterUsingBoundKeypairMethod_GenerationCounter, caused by locks occasionally not immediately taking effect. * Apply suggestions from code review * MWI: Add joining URIs for tbot (#56267) * MWI: Add joining URIs for tbot This adds support for joining URIs to tbot. Joining URIs are intended to condense tbot's growing list of required server-side config options or CLI parameters into a single string that can be provided to the `tbot` client. For example, consider these two equivalent CLI commands: ``` $ tbot start identity \ --proxy-server example.teleport.sh:443 \ --join-method bound_keypair \ --token my-token \ --registration-secret abc123 \ --storage ./tbot-data --destination ./tbot-user $ tbot start identity \ tbot+proxy+bound-keypair://my-token:abc123@example.teleport.sh:443 \ --storage ./tbot-data \ --destination ./tbot-user ``` As shown, all parameters necessary for bots to actually connect to and authenticate with the remote Teleport instance are included in a single parameter. This parameter can be generated by existing tooling, like the example command printed via `tctl bots add`, or the web UI. End users will only need to paste a single "token", provide their own client-side parameters (if any), and run. Similarly, we now have a new minimally viable YAML config: ```yaml version: v2 uri: tbot+proxy+bound-keypair://my-token:abc123@example.teleport.sh:443 storage: type: directory path: ./tbot-data services: - type: identity destination: type: directory path: ./tbot-user ``` This implementation is designed to be only additive, and should not interfere with existing config files or CLI strings. Parsed URI parameters are merged on top of the traditional config fields during the bot's pre-run check, and raise an error if any field conflicts. RFD: #52546 * Fix lints * Set `omitempty` flag on the URI field This excludes the URI field when empty, to avoid polluting generated config files when not using URIs - which remains fully supported - and to clear test failures since a large number of golden tests would otherwise need to be regenerated. * Add additional tests for joining URI config merging * Add additional integration-style test for joining URIs * Fix lint * Consistently rename field to JoinURI and convert from arg to flag * Remove interspersed flag as arg has been removed. * Fix broken tests after rebase * MWI: Verify locks against bound keypair tokens before mutating state (#56829) * MWI: Verify locks against bound keypair tokens before mutating state This adds an additional check for locks against a bound keypair token before any server-side state can be mutated, e.g. before potentially generating additional locks. Locks were always checked before credentials were issued, so access was reliably prevented. However, if bots get locked, they will retry the connection in a loop. The locks are generated before they're checked, which can lead to an infinite lock creation loop. This PR adds an additional check for locks against the join token before any server-side mutation takes place, but after we've at least partially verified the client's identity (via a challenge or registration secret) to avoid leaking new information about whether or not a token is locked. * Don't test for exact lock counts Preventing duplicate locks is best effort and subject to the lock checks actually returning an error when a lock exists in a timely manner, so don't assume we won't have duplicates in the test. * Try to call t.Helper() when possible in testExtractBotParamsFromCerts
) * MWI: Minimal bound-keypair joining implementation (#54371) * MWI: Minimal bound-keypair joining implementation This includes a minimal implementation of bound-keypair joining. This first iteration requires preregistered public keys, and requires `unlimited` and `insecure` flags to be set on bound keypair tokens. Minimal client-side implementation will be in a follow up PR. RFD: #52546 Closes #53373 * Refactor challenge response function, rebase on updated protos branch This includes a number of changes: - Rebases on the latest protos branch. This includes removal of the new keypair field on initial join, and adds messages for interactive keypair rotation. - Per the rebase, remaining_joins is removed in favor of using join_count for all calculations. The registration method and validatity checks have been updated to reference that instead. - Refactors challenge response function to allow for keypair rotation. We still don't implement rotation but the handler now receives the full proto message and produces a full proto response, so that we can easily handle the rotation case in the future. - Challenge validation checks time fields explicitly to ensure the client didn't tamper with them. - Added some missing docstrings * Add joinserver test * Fix lint error and add docstring * Add tests for bound keypair challenge validation * Remove client side package intended for other PR * Fix various lints * Add tests for RegisterUsingBoundKeypairMethod() * Fix lints * Add basic provisioning token CheckAndSetDefaults() tests * Include bound public key in RegisterUsingBoundKeypairMethod return This is passed back to clients as part of the proto certs message as confirmation that rotation succeeded, so the value needed to be plumbed through. * Fixes after upstream proto change We renamed and tweaked a number of proto fields, so this updates field references. * Apply suggestions from code review Co-authored-by: Dan Upton <daniel.upton@goteleport.com> * Remove TODO * Fix missed field rename * Fix broken test * Fix lurking nil pointer deref after field rename --------- Co-authored-by: Dan Upton <daniel.upton@goteleport.com> * Fix build due to backport changes * Backport additional test changes --------- Co-authored-by: Dan Upton <daniel.upton@goteleport.com>
Of note: - s/rejoin/recovery (give or take some conjugation) - Updated keypair rotation and sequence diagram - Updated join state spec - Updated token resource example to match public docs - Lots of misc terminology tweaks
|
I've done a pass to update the RFD to reflect the current state of what we've implemented, please take another look! Ideally we can merge this down now that it's implemented and more or less reflects what exists in v18/master. |
* RFD 0205: Improved On-Prem Joining for Machine ID This RFD discusses improvements to on-prem and non-delegated bot joining, focusing on a new `challenge` join method. * Various whitespace fixes * Add details after first feedback pass Adds sections on alerting, keypair rotation, and intention to eventually support node joining. * Add section detailing joining flows, various other details * Fix cspell nits * Rename to bound keypair, address review feedback This renames the join method to `bound-keypair`, adds sections on extensible keystore backends, non-Terraform UX, and scoped RBAC. * Rewrite join UX improvement to use URIs Adds a new URI joining proposal * Rewrite some sections, discuss state storage New sections on state storage and rejected alternatives, plus rewrote several sections for clarity. * Rename overlapping `expires` fields * Pivot to delegated joining impl. Add sequence diagram. * Don't assume ED25519; fix renew->refresh terminology * Tweak joining URL scheme Moves join method to the URL scheme to allow joining secrets; more examples added. * Bot lifecycle; removed uniqueness requirement Removes "soft bot expiration" section as this has been resolved with the switch to delegated joining. Also added a Bot Lifecycle section to describe how bots are expected to be disabled. Also removed the public key uniqueness requirement. At join time bots now specify both the token name and joining secret (if any), so we won't need to search all tokens for a matching key. It was also not efficient to ensure uniqueness among all provision tokens. * Add keypair rotation details * Credential duplication mitigation, proto draft Describes a method for mitigating credential duplication, and includes a protobuf draft. * Rename most references to "rejoining" to just "joining" These were fundamentally the same processes, so we'll standardize on calling both initial joining and rejoining different modes of just "joining". * Add example join state document * Add note about locking old instance after rejoin * Small fixes * Fix word missing from cspell * Remove kubernetes reference * Fix hanging sentence, add reference to join state document in lifecycle * SSH key format, introduce "insecure" flag * Mark as implemented * Various updates to reflect implemented state Of note: - s/rejoin/recovery (give or take some conjugation) - Updated keypair rotation and sequence diagram - Updated join state spec - Updated token resource example to match public docs - Lots of misc terminology tweaks
* RFD 0205: Improved On-Prem Joining for Machine ID This RFD discusses improvements to on-prem and non-delegated bot joining, focusing on a new `challenge` join method. * Various whitespace fixes * Add details after first feedback pass Adds sections on alerting, keypair rotation, and intention to eventually support node joining. * Add section detailing joining flows, various other details * Fix cspell nits * Rename to bound keypair, address review feedback This renames the join method to `bound-keypair`, adds sections on extensible keystore backends, non-Terraform UX, and scoped RBAC. * Rewrite join UX improvement to use URIs Adds a new URI joining proposal * Rewrite some sections, discuss state storage New sections on state storage and rejected alternatives, plus rewrote several sections for clarity. * Rename overlapping `expires` fields * Pivot to delegated joining impl. Add sequence diagram. * Don't assume ED25519; fix renew->refresh terminology * Tweak joining URL scheme Moves join method to the URL scheme to allow joining secrets; more examples added. * Bot lifecycle; removed uniqueness requirement Removes "soft bot expiration" section as this has been resolved with the switch to delegated joining. Also added a Bot Lifecycle section to describe how bots are expected to be disabled. Also removed the public key uniqueness requirement. At join time bots now specify both the token name and joining secret (if any), so we won't need to search all tokens for a matching key. It was also not efficient to ensure uniqueness among all provision tokens. * Add keypair rotation details * Credential duplication mitigation, proto draft Describes a method for mitigating credential duplication, and includes a protobuf draft. * Rename most references to "rejoining" to just "joining" These were fundamentally the same processes, so we'll standardize on calling both initial joining and rejoining different modes of just "joining". * Add example join state document * Add note about locking old instance after rejoin * Small fixes * Fix word missing from cspell * Remove kubernetes reference * Fix hanging sentence, add reference to join state document in lifecycle * SSH key format, introduce "insecure" flag * Mark as implemented * Various updates to reflect implemented state Of note: - s/rejoin/recovery (give or take some conjugation) - Updated keypair rotation and sequence diagram - Updated join state spec - Updated token resource example to match public docs - Lots of misc terminology tweaks
* MWI: Enforce generation counter for bound keypair joining This enable generation counter enforcement for bound keypair joining, and adds a new function, `shouldEnforceGenerationCounter`, to make enabling it for other join methods trivial. Bound keypair joining introduces a similar mechanism for use between its own recovery attempts but does rely on the standard generation counter for it's renewal-style certificates so every join attempt is subject to a generation check. This wasn't enabled in the original set of bound keypair PRs so it's enabled here. RFD: #52546 * Add tests for generation counter enforcement, fix error handling bug This adds a test case for traditional generation counter enforcement with bound keypair joining, and fixes an error handling bug around certificate generation. This bug was mostly harmless before and would've just returned nil certs at worst, but is now meaningfully fallible. * Fix broken test * Fix lint * Remove references to registration secret in test for rebase onto master * Empty commit for CI
* MWI: Enforce generation counter for bound keypair joining This enable generation counter enforcement for bound keypair joining, and adds a new function, `shouldEnforceGenerationCounter`, to make enabling it for other join methods trivial. Bound keypair joining introduces a similar mechanism for use between its own recovery attempts but does rely on the standard generation counter for it's renewal-style certificates so every join attempt is subject to a generation check. This wasn't enabled in the original set of bound keypair PRs so it's enabled here. RFD: #52546 * Add tests for generation counter enforcement, fix error handling bug This adds a test case for traditional generation counter enforcement with bound keypair joining, and fixes an error handling bug around certificate generation. This bug was mostly harmless before and would've just returned nil certs at worst, but is now meaningfully fallible. * Fix broken test * Fix lint * Remove references to registration secret in test for rebase onto master * Empty commit for CI
* MWI: Add joining URIs for tbot
This adds support for joining URIs to tbot. Joining URIs are intended
to condense tbot's growing list of required server-side config options
or CLI parameters into a single string that can be provided to the
`tbot` client.
For example, consider these two equivalent CLI commands:
```
$ tbot start identity \
--proxy-server example.teleport.sh:443 \
--join-method bound_keypair \
--token my-token \
--registration-secret abc123 \
--storage ./tbot-data
--destination ./tbot-user
$ tbot start identity \
tbot+proxy+bound-keypair://my-token:abc123@example.teleport.sh:443 \
--storage ./tbot-data \
--destination ./tbot-user
```
As shown, all parameters necessary for bots to actually connect to
and authenticate with the remote Teleport instance are included in a
single parameter. This parameter can be generated by existing tooling,
like the example command printed via `tctl bots add`, or the web UI.
End users will only need to paste a single "token", provide their own
client-side parameters (if any), and run. Similarly, we now have a new
minimally viable YAML config:
```yaml
version: v2
uri: tbot+proxy+bound-keypair://my-token:abc123@example.teleport.sh:443
storage:
type: directory
path: ./tbot-data
services:
- type: identity
destination:
type: directory
path: ./tbot-user
```
This implementation is designed to be only additive, and should not
interfere with existing config files or CLI strings. Parsed URI
parameters are merged on top of the traditional config fields during
the bot's pre-run check, and raise an error if any field conflicts.
RFD: #52546
* Fix lints
* Set `omitempty` flag on the URI field
This excludes the URI field when empty, to avoid polluting generated
config files when not using URIs - which remains fully supported - and
to clear test failures since a large number of golden tests would
otherwise need to be regenerated.
* Add additional tests for joining URI config merging
* Add additional integration-style test for joining URIs
* Fix lint
* Consistently rename field to JoinURI and convert from arg to flag
* Remove interspersed flag as arg has been removed.
* Fix broken tests after rebase
* MWI: Enforce generation counter for bound keypair joining This enable generation counter enforcement for bound keypair joining, and adds a new function, `shouldEnforceGenerationCounter`, to make enabling it for other join methods trivial. Bound keypair joining introduces a similar mechanism for use between its own recovery attempts but does rely on the standard generation counter for it's renewal-style certificates so every join attempt is subject to a generation check. This wasn't enabled in the original set of bound keypair PRs so it's enabled here. RFD: #52546 * Add tests for generation counter enforcement, fix error handling bug This adds a test case for traditional generation counter enforcement with bound keypair joining, and fixes an error handling bug around certificate generation. This bug was mostly harmless before and would've just returned nil certs at worst, but is now meaningfully fallible. * Fix broken test * Fix lint * Remove references to registration secret in test for rebase onto master * Empty commit for CI
* MWI: Add joining URIs for tbot
This adds support for joining URIs to tbot. Joining URIs are intended
to condense tbot's growing list of required server-side config options
or CLI parameters into a single string that can be provided to the
`tbot` client.
For example, consider these two equivalent CLI commands:
```
$ tbot start identity \
--proxy-server example.teleport.sh:443 \
--join-method bound_keypair \
--token my-token \
--registration-secret abc123 \
--storage ./tbot-data
--destination ./tbot-user
$ tbot start identity \
tbot+proxy+bound-keypair://my-token:abc123@example.teleport.sh:443 \
--storage ./tbot-data \
--destination ./tbot-user
```
As shown, all parameters necessary for bots to actually connect to
and authenticate with the remote Teleport instance are included in a
single parameter. This parameter can be generated by existing tooling,
like the example command printed via `tctl bots add`, or the web UI.
End users will only need to paste a single "token", provide their own
client-side parameters (if any), and run. Similarly, we now have a new
minimally viable YAML config:
```yaml
version: v2
uri: tbot+proxy+bound-keypair://my-token:abc123@example.teleport.sh:443
storage:
type: directory
path: ./tbot-data
services:
- type: identity
destination:
type: directory
path: ./tbot-user
```
This implementation is designed to be only additive, and should not
interfere with existing config files or CLI strings. Parsed URI
parameters are merged on top of the traditional config fields during
the bot's pre-run check, and raise an error if any field conflicts.
RFD: #52546
* Fix lints
* Set `omitempty` flag on the URI field
This excludes the URI field when empty, to avoid polluting generated
config files when not using URIs - which remains fully supported - and
to clear test failures since a large number of golden tests would
otherwise need to be regenerated.
* Add additional tests for joining URI config merging
* Add additional integration-style test for joining URIs
* Fix lint
* Consistently rename field to JoinURI and convert from arg to flag
* Remove interspersed flag as arg has been removed.
* Fix broken tests after rebase
* MWI: Bound Keypair Rotation (#55240) * MWI: Bound Keypair Joining: Keypair rotation This adds keypair rotation for bound keypair rotation. When a rotation flag is set in the token spec, joining clients will be required to generate a new keypair and complete an additional joining challenge against the new keypair. The flag is a timestamp token to allow for some level of idempotency; to make setting this flag easier, a new `tctl` command is included: `tctl bound-keypair request-rotation [token]`. This sets the flag to the current timestamp, and joining clients will be required to perform a rotation on their next authentication attempt. Closes #55084 * Properly initialize the tctl command * Refactor ClientState to allow storing intermediate state during rotation * Fix invalid comparison and mutation logic * Log signature suite and use cryptosuites helper * Remove outdated TODO * Frontload MFA check to avoid prompting twice * Fix tctl command logging * Fix incomplete docstring * Fix imports * Fix typo in log message * Add tests for server-side rotation Adjusts the test harness a bit and adds a batch of test cases for keypair rotation. Also fixes a lint error. * Add additional test case for reused keys * Add ClientState unit test * Remove unnecessary log * Fix test lints * Fix reference to wrong key field Now that the key can change, fix a dangling reference to the initial key field. Also s/marshalled/marshaled * Wrap KeyHistoryEntry in a containing struct This should allow for some future extension if needed. * MWI: Bound Keypair - Registration Secrets (#55380) * MWI: Bound Keypair - Registration Secrets This adds support for initial joining via registration secrets. These one time use secrets emulate traditional token joining and allow clients to perform their initial join With this, no options are required for bound keypair-type tokens. While admins can specify a joining secret if they wish, if none is provided, one will be generated on the server and can be found in `status.bound_keypair.registration_secret` on the token resource. When joining, this secret can be shared with clients in addition to the (no longer sensitive) token name. This secret is verified and a keypair rotation is requested, prompting the client to generate a new keypair, provide the public key to the server, and complete a joining challenge. It then joins the cluster as usual. * Remove unnecessary token validation checks * Rename tbot flag to --registration-secret * Fix reference to renamed flag * Various fixes, mostly more unwanted checks * Add test cases for registration secrets * Fix broken test Onboarding config is no longer required, so fix the now-broken test * Allow empty .spec.bound_keypair field for bound keypair tokens This allows .spec.bound_keypair to be empty or entirely unset, since we can build defaults at creation time. * Add test for secret expiry enforcement * Handle nonexistent client state when using a registration secret * Fix test lints * Hide exact registration secret rejection reason from client Registration secret errors now return a single error message to the client and log a more specific message on the server. * MWI: Enforce generation counter for bound keypair joining (#55543) * MWI: Enforce generation counter for bound keypair joining This enable generation counter enforcement for bound keypair joining, and adds a new function, `shouldEnforceGenerationCounter`, to make enabling it for other join methods trivial. Bound keypair joining introduces a similar mechanism for use between its own recovery attempts but does rely on the standard generation counter for it's renewal-style certificates so every join attempt is subject to a generation check. This wasn't enabled in the original set of bound keypair PRs so it's enabled here. RFD: #52546 * Add tests for generation counter enforcement, fix error handling bug This adds a test case for traditional generation counter enforcement with bound keypair joining, and fixes an error handling bug around certificate generation. This bug was mostly harmless before and would've just returned nil certs at worst, but is now meaningfully fallible. * Fix broken test * Fix lint * Remove references to registration secret in test for rebase onto master * Empty commit for CI * MWI: Add audit events for bound keypair joining (#55701) * MWI: Add audit events for bound keypair joining This adds 3 new audit events for bound keypair joining: - `join_token.bound_keypair.recovery` - emitted when a join triggers a recovery (first join, or join with expired certs) - `join_token.bound_keypair.rotation` - emitted when a keypair rotation takes place - `join_token.bound_keypair.join_state_verification_failed` - emitted when the client provides an invalid join state document * Fix UI lint * Fix more UI lints * Remove outdated TODO * Fix tests broken by error message changes * Fix lint * MWI: Add lock targets for join token name and bot instance ID (#56021) * MWI: Add lock targets for join token name and bot instance ID This adds two new lock targets meant to help lock specific bot instances without affecting all bots sharing a single user: - Bot Instance ID: Targets a bot instance UUID, which has been assigned automatically to unique bot instances for some time - Join token name: Targets the join token through which the bot joined Bot instance ID locks are most useful for traditional token-joined bots, since tokens are single use and bots have no way to onboard again without human intervention if their old certs (and old bot instance) expire. Join token locks are useful for bots using delegated join methods. They are particularly useful for bound keypair joining, where there is a direct 1:1 relationship between a "bot instance" and a token, even though that bot ID will change each time a recovery takes place. Note that this does not currently set the join token for nodes even though that would theoretically be possible. We could consider supporting node locking in the future if there's demand. * Set join token cert request field for non-renewable bot identities * Fix ASN ID and pass through join token name in impersonated certs * Tweak docstrings and add missing references for lib/decision * Clarify docstrings Clarifies various docstrings and makes sure they mention `token` joined bots cannot be targeted. * Fix failing tests * MWI: Use specific lock targets when locking out bots (#56110) * MWI: Use specific lock targets when locking out bots Building on #56021, this takes advantage of the new granular lock targets to lock bots during verification failures, namely: - Generation counter mismatch: Locks a bot instance (token) or token name (bound keypair). - Join state verification failure (bound keypair only) Additionally, as the bound keypair joining process now generates locks, join state verification has been moved to take place explicitly *after* the main joining challenge has been completed. Without this, unauthenticated clients could abuse the new locking behavior by simply sending any invalid join state document. * Use new lock targets for traditional generation counter lockouts * Enforce new bot lock targets during cert generation * Fix lint in `mutateStatusConsumeRecovery()` * Add tests for new lock events This adds new tests and updates existing tests to account for the new locking strategies, and to make sure existing clients are actually denied cluster access. Additionally, as join state is now verified only after the regular challenge ceremony, a number of tests were broken as they set up the token in a technically impossible state, depending on the join state being checked first. Tests now explicitly specify their token keypair (bound or initial) to resolve this. * Remove resolved TODOs * Fix cut off comment * MWI: Fix flaky tests for automatic bot lockouts (#56323) * MWI: Fix flaky tests for automatic bot lockouts This fixes a flaky test, `TestRegisterBotCertificateGenerationStolen`, which assumed authenticated clients would immediately lose access if locked. It also fixes another test introduced at the same time that contains a similar check. * Increase maximum time limit * MWI: Remove bound keypair experiment flag (#56592) This removes the environment variable gating use of the bound keypair experiment. * MWI: Fix bound keypair initial join secret field name (#56603) * MWI: Fix bound keypair initial join secret field name The `initial_join_secret` field was not given a proper YAML field name and was rendering as `initialjoinsecret`. Additionally, we've tried to standardize on referring to this field as "the registration secret", so this renames the field to match new terminology. This hopefully does not count as a breaking change as registration secret functionality has not been made available in a release. * Rename to `registration_secret` * MWI: Fix typos in bound keypair ProvisionTokenV2 proto (#56653) This fixes a number of spelling and grammar issues in the proto comments for ProvisionTokenSpecV2BoundKeypair and ProvisionTokenStatusV2BoundKeypair. * MWI: Fix flaky test for bound keypair generation counter (#56732) * MWI: Fix flaky test for bound keypair generation counter This fixes another flaky test in TestServer_RegisterUsingBoundKeypairMethod_GenerationCounter, caused by locks occasionally not immediately taking effect. * Apply suggestions from code review * MWI: Add joining URIs for tbot (#56267) * MWI: Add joining URIs for tbot This adds support for joining URIs to tbot. Joining URIs are intended to condense tbot's growing list of required server-side config options or CLI parameters into a single string that can be provided to the `tbot` client. For example, consider these two equivalent CLI commands: ``` $ tbot start identity \ --proxy-server example.teleport.sh:443 \ --join-method bound_keypair \ --token my-token \ --registration-secret abc123 \ --storage ./tbot-data --destination ./tbot-user $ tbot start identity \ tbot+proxy+bound-keypair://my-token:abc123@example.teleport.sh:443 \ --storage ./tbot-data \ --destination ./tbot-user ``` As shown, all parameters necessary for bots to actually connect to and authenticate with the remote Teleport instance are included in a single parameter. This parameter can be generated by existing tooling, like the example command printed via `tctl bots add`, or the web UI. End users will only need to paste a single "token", provide their own client-side parameters (if any), and run. Similarly, we now have a new minimally viable YAML config: ```yaml version: v2 uri: tbot+proxy+bound-keypair://my-token:abc123@example.teleport.sh:443 storage: type: directory path: ./tbot-data services: - type: identity destination: type: directory path: ./tbot-user ``` This implementation is designed to be only additive, and should not interfere with existing config files or CLI strings. Parsed URI parameters are merged on top of the traditional config fields during the bot's pre-run check, and raise an error if any field conflicts. RFD: #52546 * Fix lints * Set `omitempty` flag on the URI field This excludes the URI field when empty, to avoid polluting generated config files when not using URIs - which remains fully supported - and to clear test failures since a large number of golden tests would otherwise need to be regenerated. * Add additional tests for joining URI config merging * Add additional integration-style test for joining URIs * Fix lint * Consistently rename field to JoinURI and convert from arg to flag * Remove interspersed flag as arg has been removed. * Fix broken tests after rebase * MWI: Verify locks against bound keypair tokens before mutating state (#56829) * MWI: Verify locks against bound keypair tokens before mutating state This adds an additional check for locks against a bound keypair token before any server-side state can be mutated, e.g. before potentially generating additional locks. Locks were always checked before credentials were issued, so access was reliably prevented. However, if bots get locked, they will retry the connection in a loop. The locks are generated before they're checked, which can lead to an infinite lock creation loop. This PR adds an additional check for locks against the join token before any server-side mutation takes place, but after we've at least partially verified the client's identity (via a challenge or registration secret) to avoid leaking new information about whether or not a token is locked. * Don't test for exact lock counts Preventing duplicate locks is best effort and subject to the lock checks actually returning an error when a lock exists in a timely manner, so don't assume we won't have duplicates in the test. * Try to call t.Helper() when possible in testExtractBotParamsFromCerts * Bound Keypair: Fix lock generation on sequence desync (#57687) * Bound Keypair: Fix lock generation on sequence desync This fixes an issue where locks may not be generated as expected when join state sequences desync unless the original client is also performing a recovery. Currently, if the original client is renewing with a valid identity, its bot instance ID is checked against the stored instance ID. If they don't match, access is denied without generating a lock. However, if client credentials are stolen and used to perform a recovery, this implicitly generates a new bot instance ID. The original client, presumably with still-valid certs containing the original ID, will try to renew as usual, but will only be denied. Join state verification is skipped, and no lock is created. (Note that, given enough time, the client's credentials will eventually expire. The next join attempt will then attempt a recovery, fail to verify join state, and generate a lock as expected. This just means locking takes ~1hr instead of ~20min, based on default values.) The fix is straightforward enough: the bot instance ID check is moved after join state verification. In practice this check is unlikely to be useful as any action that could cause the bot instance IDs to change should also cause join state verification to fail. The check remains at the end of the renewal flow as a sanity check, but only after the challenge ceremony and join state verification are performed. changelog: Bound Keypair Joining: Fix lock generation on sequence desync * Add some test detail in a comment
This RFD discusses improvements to on-prem and non-delegated bot joining, focusing on a new
bound-keypairjoin method.Rendered