Skip to content

RFD 0205: Improved On-Prem Joining for Machine ID#52546

Merged
timothyb89 merged 26 commits intomasterfrom
rfd/0205-improved-onprem-joining
Aug 18, 2025
Merged

RFD 0205: Improved On-Prem Joining for Machine ID#52546
timothyb89 merged 26 commits intomasterfrom
rfd/0205-improved-onprem-joining

Conversation

@timothyb89
Copy link
Copy Markdown
Contributor

@timothyb89 timothyb89 commented Feb 27, 2025

This RFD discusses improvements to on-prem and non-delegated bot joining, focusing on a new bound-keypair join method.

Rendered

This RFD discusses improvements to on-prem and non-delegated bot
joining, focusing on a new `challenge` join method.
Comment thread rfd/0205-improved-onprem-joining.md Outdated
authentication ceremony, clients can use `go-jose` to marshal and sign a JWT
which can then be verified easily on the server.

TODO: This needs significant further elaboration and feedback.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentionally brief at the moment, I think I'd like to consult with some experts before committing to any crypto implementation. Very much open to ideas and feedback on this!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think JWS/JWT seems to be a sensible choice here.

How will this flow work when the bot is providing its public key on initial join? Will it still need to perform an initial signature?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, it seems like it'd be tidier if the secret-for-pubkey exchange was effectively an independent step and then it did a full regular rejoin process.

That said, keeping them fully separate may create unnecessary complexity elsewhere, since we'd need to have yet another public RPC. I'll try to explore both options and see which is simpler and easier to keep secure.

Comment thread rfd/0205-improved-onprem-joining.md Outdated
Comment thread rfd/0205-improved-onprem-joining.md Outdated
Comment thread rfd/0205-improved-onprem-joining.md Outdated

#### Token Resource Example

`challenge`-type tokens differ from other types in that they are intended to
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll definitely want to discourage folks explicitly in documentation from setting a resource level expiry if it may break the rejoin mechanism.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could see it still being useful: if I want a bot that only lasts for e.g. 1 month, it would be nice if it could expire automatically, though we'd need some additional logic during regular cert renewals to ensure the backing token still exists.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though we'd need some additional logic during regular cert renewals to ensure the backing token still exists.

Yeah I think so - either way we just need to make sure the behaviour is fairly sane/well documented. I think worst case scenario would be someone deleting a known-bad token thinking it'll disconnect the bot - and it still has access.

Comment thread rfd/0205-improved-onprem-joining.md Outdated
Comment thread rfd/0205-improved-onprem-joining.md
Comment thread rfd/0205-improved-onprem-joining.md Outdated
Adds sections on alerting, keypair rotation, and intention to
eventually support node joining.
Comment thread rfd/0205-improved-onprem-joining.md Outdated
significantly more flexibility than today's `token` join method. This works by -
in a sense - inverting the token joining procedure: bots generate an ED25519
keypair, and the public key is copied to the server. The public key can be
copied out-of-band, or bots can provide their public key on first join using a
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like I'm not understanding this correctly. If they can offer their public key on first join, what's to stop a malicious actor from just joining a bunch of bots? The security of the token is that the server generates it and knows it's valid. I get adding the public key out-of-band, but not this.

Ok, reading again, in this scenario where the public key is not added ahead of time, there would be an initial join secret, so basically a token would still be needed, but after that initial contact it wouldn't need the token again?

Copy link
Copy Markdown
Contributor Author

@timothyb89 timothyb89 Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, there are two joining scenarios:

  1. A Teleport admin creates a new challenge-type ProvisionToken with the ED25519 public key embedded in it. The bot joins by solving a challenge using its private key.
  2. A Teleport admin creates a new challenge-type ProvisionToken without a public key. The server then generates an initial joining secret, which the admin can provide to a bot exactly like a token-type token today. The bot presents this joining secret along with its public key to bind its public key to the ProvisionToken, then follows the process from Option 1 to authenticate.

We'll also only allow a single bot instance (w/ generation counter, to avoid copied identities) to be associated with a single public key at any time. I'll make this more explicit, and try to clarify the two joining flows more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Yeah, I'd like to know more on that, that feels like it would be hard to scale. If I have 10k servers that are old and don't have TPMs, I guess I could be automating creation of 10k public/private pairs but I wouldn't like it, especially if all the instances are from one bot.

I guess I get that if it's not like that, then if I'm malicious and get the public key I can do what I was worried about above, and just send it in and have access.

Copy link
Copy Markdown
Contributor Author

@timothyb89 timothyb89 Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For bulk joining, you'd want to use the Terraform provider with scenario 2, and have it generate an initial join secret for each node - there's an example of this further down in the document. The generated secret value can be provided to tbot and it will behave similarly to token joining today; keypair generation and exchange with Auth will be entirely transparent to the user.

Overall, in its default mode (i.e. scenario 2) it's functionally identical to traditional token joining for end users during initial deployment, with a few enhancements to make Terraform automation a bit nicer (server side secret fulfillment, a sane story for restoring broken bots, etc).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, same could apply to Ansible or something for an on-prem scenario where Terraform won't work.

Comment thread rfd/0205-improved-onprem-joining.md Outdated
Comment thread rfd/0205-improved-onprem-joining.md Outdated
Comment thread rfd/0205-improved-onprem-joining.md Outdated
Comment thread rfd/0205-improved-onprem-joining.md Outdated
Comment thread rfd/0205-improved-onprem-joining.md Outdated

# If set, rejoining is only valid before this timestamp; may be
# incremented to extend bot lifespan.
expires: ""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm thinking 3 different expiration times in the same resource is bound to get confusing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... this is the biggest sore spot I think. I think I might remove the resource-level expiration (or, well, make it an error to set one), and might drop this secondary expiration. Playing with some ideas to make the ergonomics here make sense, but it's confusing as is.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted in another thread, I've renamed the duplicate fields. I'm not sure if that's ultimately a perfect solution but it's hopefully at least less confusing.

I do think (the field now known as) must_rejoin_before is optional. I think it's a reasonably sensible additional rejoining control admins might like to have available, but I think ultimately the design works alright with just a counter. If you think the end result is still confusing, we can remove this field.

Comment thread rfd/0205-improved-onprem-joining.md
This renames the join method to `bound-keypair`, adds sections on
extensible keystore backends, non-Terraform UX, and scoped RBAC.
Adds a new URI joining proposal
New sections on state storage and rejected alternatives, plus
rewrote several sections for clarity.
Comment thread rfd/0205-improved-onprem-joining.md Outdated
Comment thread rfd/0205-improved-onprem-joining.md Outdated
precedent here as bots can view e.g. their own roles without explicitly
having RBAC permissions to do so.

The remaining rejoin counter should then be exposed as a Prometheus metric to
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps the last time it (re)joined would be a good addition to the metrics as well.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit tricky to publish date fields directly in prometheus, but I think we could publish a metric like this:

teleport_bot_bound_keypair_joins{bot_name="example"} 1

You could then alert against it with a query like this:

sum by (user) (teleport_bot_bound_keypair_joins[1h]) > 1

Prometheus should count each increase in the field and will account for counter resets if auth restarts. I think this would meet your needs?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a good middle-ground would be teleport_bot_rejoined_seconds where it has an ever increasing value with the number of seconds since the last rejoin.
Creating an alert would be as simple as checking if the value is less than 1h.

I think a metric like system_uptime would be comparable.


We should take steps to improve visibility of bots at or near expiry, including:

- Configurable cluster alerts when the number of available renewals has crossed
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add an alert that states something like: "A bot has recently automatically rejoined", without specifically watching the count.
Where 'recently' is configurable.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good thought, I've added a note about it to this section. There's some mild complexity here, I think, since alerts could get noisy if there's hundreds (or more) bots active. I think we'll definitely need to iterate on this as we see how this method gets used.

token. This may require introduction of a new certificate field to track the
exact join token used.

- Public key locking: locks bots joining with a particular public key. A
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option would be to allow a bot to be limited to only a single instance. If a second instance joins, the first one is revoked (by design).

Another option would be to only allow rejoining if all instances are rejoining at once. Say after a power outage on a site with multiple instances. The logic would validate that there are other instances in the same bot, and only allow a generation increase if they all 'agree'.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to further clarify the approach here in the "bot lifecycle" section since it was vague and wasn't fully written down, and actually pulling on this thread led to introducing "Join State Documents" in the new "Preventing Credential Duplication" section.

The short version is, we'll only allow one active bot instance per (bot, bound-keypair token) pair. If a rejoin occurs, the previous instance will not be allowed to refresh its certificates any further and will need to rejoin - and might additionally get locked out. If a bot's keypair is cloned and 2 clients start competing to rejoin, we'll detect this and lock all bots using the join token, similar to generation counter lockouts today.

Comment thread rfd/0205-improved-onprem-joining.md Outdated

The URI syntax might look like this:
```
tbot+[auth|proxy]://[join method]:[token value]@[addr]:[port]?key=val&foo=bar
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For easy of copying this, you could base64 encode this and/or even sign it like a JWT.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are some benefits to it being cleartext, mainly ease of confirming bots are routed to the right place, and ability to tweak the URI if needed, e.g. connecting to a leaf cluster. From a UX perspective, it's still functionally a single token for copy and paste purposes, which Teleport can fully compose it for you in the CLI/web. Do you see a use case for shorter connection strings?

I'm not too sure of the value of signing these - could you elaborate on the use case a bit? Bots still verify TLS on startup, and in general I don't see bots connecting to a hostile Teleport instance as a meaningful threat vector.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A token is easily copy-pastable in a shell (i.e. vim/bash etc), where a complex url usually is a bit more annoying. Especially if you have + and & characters.
Having those replaced via base64 encoding significantly reduces the amount of user error in most cases.

Signing would just be to validate that you have copied the whole string. Maybe I should have called it hashing.

Comment thread rfd/0205-improved-onprem-joining.md Outdated
joining as well as bots, as a more secure alternative to static or long-lived
join tokens.

### Additional Keypair Protections
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you allow CA based joining? In that case someone could pregenerate the tbot private keypair and sign it from a centralized CA.
One could then add that public CA to the ProvisionToken as allowed to rejoin.

That way all tbot's could rejoin if they have a certificate signed by a single matching CA.

URL paths and query parameters may also provide options for future extension if
desired.

## Future Extensions and Alternatives
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you consider a little scope creep and add extra rejoin 'conditions'?
I would for instance think that rejoining on the same IP address could automatically be allowed. It's not much, but rejoining from a different IP should (in our use-case) automatically be denied.

The 'extension' could be:

  • Allow external factors or 'proof' to be sent along with a rejoin request to automatically allow it.

Things I was thinking about:

  • (Soft)TPM joining (if you do not want auto TPM join, just a limited amount, and still want to do custom keypairs)
  • Public IP Address
  • Physical Keys
  • Information it can exfiltrate from the place it's running. For instance a challenge against a networked HSM (Hashicorp Vault?).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's definitely some room to add additional joining requirements over time, public IP in particular could be helpful.

I think your "soft TPM" idea might be covered by the proposal already, if I understand correctly? That sounds like the TPM/HSM private key storage backend?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm proposing is to have a custom keypair provisioned on the machine as primary joining method and have the TPM joining be a secondary factor to enable auto joining.

The reason is, that we as a company don't want the hassle of managing TPM certificates.
I would allow an option to have the tbot machine id publish the TPM certificate that is always present, and 'register' it as an extra factor along with a first-time join token.

That way we could enable auto rejoining if you can prove you come from the same TPM.

The flow would be:

  • We preprovision a disk with a keypair.
  • We mount the disk in an edge device.
  • We boot up the machine for the first time.
  • It registers itself with the certificate and the TPM certificate.
  • If the node goes down for >24h, it can rejoin if it still has a valid provisioned certificate AND can prove itself via the TPM.
  • We periodically rotate the generated certificate (every 6m or so)

To be clear: in our use-case it's normal node joining. We are looking at Machine ID for 2-way communication with edge devices, but are not there yet.

I do understand the risk of having auto rejoin for Nodes might be less risky than for Machine ID, so perhaps the use-cases should be split.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is an interesting flow for sure, though I do think we might have something mostly equivalent with the initial join secret + HSM storage:

  1. Create bound keypair token with initial join secret
  2. Provision disk with initial join secret
  3. On boot, the machine generates a keypair on the TPM/HSM and establishes trust with Auth using the join secret
  4. For each join, auth issues a challenge that must be completed using the keypair on the TPM/HSM
  5. If desired, the keypair can be rotated at any time, which would create a new HSM-stored keypair and switch Auth's trust to it.

My understanding is that a HSM or TPM-stored keypair should be more or less equivalent to the module's built-in key.

I think the main limitation here is that through this method Auth can't 100% trust that the client is in fact using a hardware keystore, but I'd argue that's strictly true for any automatic TPM enrollment.

It's probably out of scope for this first revision, but it might be interesting to add some additional challenge requirements in the future, like an additional EKCert attestation or similar. I think that's the only way we could (kind of) ensure a real TPM was in use, but that'd need some more in depth design.

Comment thread rfd/0205-improved-onprem-joining.md Outdated
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhat significant update after a discussion with @strideynet today - we're tentatively pivoting toward implementing this as a delegated join method, meaning bots will complete the challenge ceremony on every renewal attempt.

This means we won't use traditional renewable certs at all, and we determine if a particular join attempt costs a rejoin credit (probably need a better term for this) based on whether or not the bot presents an existing client certificate. We already use this mechanism to preserve bot instance IDs today.

This should keep the implementation more in line with other delegated joining methods (i.e. all of them except token) and could put us on a path toward eliminating traditional renewable certs in the future.

Copy link
Copy Markdown
Contributor Author

@timothyb89 timothyb89 Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick summary of additional updates today:

  • Added credential duplication mitigation (aka generation counter lite)
  • Added protobuf draft
  • Added keypair rotation procedure
  • Added bot lifecycle details
  • Tweaked the joining URL scheme to allow for both token name + additional parameter (i.e. initial joining secret)

I think my only outstanding item is that I'm thinking of renaming "rejoining" to just "joining", especially in the token spec. It's really the same process and I think it's conceptually simpler if we don't special case onboarding - that's a join, just like an expired connection.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another small summary of updates:

  • Renamed rejoining to joining in most contexts, since it's the same process.
  • Expanded join state document, added example. During a discussion with @boxofrad and @strideynet, we decided to move the remaining join counter and rotation flag into the document as well.
  • Added notes about locking old instances after a rejoin. There's probably some value here, but I think it could cause disruptions in apps as well.

@strideynet also suggested exclusively using the join state sequence number for all joins and refreshes, instead of using it for joins, and then using the standard generation counter (stored in the bot instance) for refreshes. I think there's some merit to this and haven't come to a firm conclusion yet. We agreed it's okay as written, but I may change it here after thinking on it further.

Copy link
Copy Markdown
Contributor

@strideynet strideynet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm at a point where I'm happy with the overall design of this now - we probably want to begin implementation before merging down this RFD to see if anything shakes out.

github-merge-queue bot pushed a commit that referenced this pull request Jun 17, 2025
* MWI: Enforce generation counter for bound keypair joining

This enable generation counter enforcement for bound keypair joining,
and adds a new function, `shouldEnforceGenerationCounter`, to make
enabling it for other join methods trivial.

Bound keypair joining introduces a similar mechanism for use between
its own recovery attempts but does rely on the standard generation
counter for it's renewal-style certificates so every join attempt is
subject to a generation check. This wasn't enabled in the original set
of bound keypair PRs so it's enabled here.

RFD: #52546

* Add tests for generation counter enforcement, fix error handling bug

This adds a test case for traditional generation counter enforcement
with bound keypair joining, and fixes an error handling bug around
certificate generation. This bug was mostly harmless before and
would've just returned nil certs at worst, but is now meaningfully
fallible.

* Fix broken test

* Fix lint

* Remove references to registration secret in test for rebase onto master

* Empty commit for CI
timothyb89 added a commit that referenced this pull request Jul 1, 2025
This adds support for joining URIs to tbot. Joining URIs are intended
to condense tbot's growing list of required server-side config options
or CLI parameters into a single string that can be provided to the
`tbot` client.

For example, consider these two equivalent CLI commands:

```
$ tbot start identity \
    --proxy-server example.teleport.sh:443 \
    --join-method bound_keypair \
    --token my-token \
    --registration-secret abc123 \
    --storage ./tbot-data
    --destination ./tbot-user

$ tbot start identity \
    tbot+proxy+bound-keypair://my-token:abc123@example.teleport.sh:443 \
    --storage ./tbot-data \
    --destination ./tbot-user
```

As shown, all parameters necessary for bots to actually connect to
and authenticate with the remote Teleport instance are included in a
single parameter. This parameter can be generated by existing tooling,
like the example command printed via `tctl bots add`, or the web UI.

End users will only need to paste a single "token", provide their own
client-side parameters (if any), and run. Similarly, we now have a new
minimally viable YAML config:

```yaml
version: v2
uri: tbot+proxy+bound-keypair://my-token:abc123@example.teleport.sh:443
storage:
  type: directory
  path: ./tbot-data
services:
  - type: identity
    destination:
      type: directory
      path: ./tbot-user
```

This implementation is designed to be only additive, and should not
interfere with existing config files or CLI strings. Parsed URI
parameters are merged on top of the traditional config fields during
the bot's pre-run check, and raise an error if any field conflicts.

RFD: #52546
github-merge-queue bot pushed a commit that referenced this pull request Jul 11, 2025
* MWI: Add joining URIs for tbot

This adds support for joining URIs to tbot. Joining URIs are intended
to condense tbot's growing list of required server-side config options
or CLI parameters into a single string that can be provided to the
`tbot` client.

For example, consider these two equivalent CLI commands:

```
$ tbot start identity \
    --proxy-server example.teleport.sh:443 \
    --join-method bound_keypair \
    --token my-token \
    --registration-secret abc123 \
    --storage ./tbot-data
    --destination ./tbot-user

$ tbot start identity \
    tbot+proxy+bound-keypair://my-token:abc123@example.teleport.sh:443 \
    --storage ./tbot-data \
    --destination ./tbot-user
```

As shown, all parameters necessary for bots to actually connect to
and authenticate with the remote Teleport instance are included in a
single parameter. This parameter can be generated by existing tooling,
like the example command printed via `tctl bots add`, or the web UI.

End users will only need to paste a single "token", provide their own
client-side parameters (if any), and run. Similarly, we now have a new
minimally viable YAML config:

```yaml
version: v2
uri: tbot+proxy+bound-keypair://my-token:abc123@example.teleport.sh:443
storage:
  type: directory
  path: ./tbot-data
services:
  - type: identity
    destination:
      type: directory
      path: ./tbot-user
```

This implementation is designed to be only additive, and should not
interfere with existing config files or CLI strings. Parsed URI
parameters are merged on top of the traditional config fields during
the bot's pre-run check, and raise an error if any field conflicts.

RFD: #52546

* Fix lints

* Set `omitempty` flag on the URI field

This excludes the URI field when empty, to avoid polluting generated
config files when not using URIs - which remains fully supported - and
to clear test failures since a large number of golden tests would
otherwise need to be regenerated.

* Add additional tests for joining URI config merging

* Add additional integration-style test for joining URIs

* Fix lint

* Consistently rename field to JoinURI and convert from arg to flag

* Remove interspersed flag as arg has been removed.
timothyb89 added a commit that referenced this pull request Jul 11, 2025
This adds support for joining URIs to tbot. Joining URIs are intended
to condense tbot's growing list of required server-side config options
or CLI parameters into a single string that can be provided to the
`tbot` client.

For example, consider these two equivalent CLI commands:

```
$ tbot start identity \
    --proxy-server example.teleport.sh:443 \
    --join-method bound_keypair \
    --token my-token \
    --registration-secret abc123 \
    --storage ./tbot-data
    --destination ./tbot-user

$ tbot start identity \
    tbot+proxy+bound-keypair://my-token:abc123@example.teleport.sh:443 \
    --storage ./tbot-data \
    --destination ./tbot-user
```

As shown, all parameters necessary for bots to actually connect to
and authenticate with the remote Teleport instance are included in a
single parameter. This parameter can be generated by existing tooling,
like the example command printed via `tctl bots add`, or the web UI.

End users will only need to paste a single "token", provide their own
client-side parameters (if any), and run. Similarly, we now have a new
minimally viable YAML config:

```yaml
version: v2
uri: tbot+proxy+bound-keypair://my-token:abc123@example.teleport.sh:443
storage:
  type: directory
  path: ./tbot-data
services:
  - type: identity
    destination:
      type: directory
      path: ./tbot-user
```

This implementation is designed to be only additive, and should not
interfere with existing config files or CLI strings. Parsed URI
parameters are merged on top of the traditional config fields during
the bot's pre-run check, and raise an error if any field conflicts.

RFD: #52546
github-merge-queue bot pushed a commit that referenced this pull request Jul 11, 2025
* MWI: Add joining URIs for tbot

This adds support for joining URIs to tbot. Joining URIs are intended
to condense tbot's growing list of required server-side config options
or CLI parameters into a single string that can be provided to the
`tbot` client.

For example, consider these two equivalent CLI commands:

```
$ tbot start identity \
    --proxy-server example.teleport.sh:443 \
    --join-method bound_keypair \
    --token my-token \
    --registration-secret abc123 \
    --storage ./tbot-data
    --destination ./tbot-user

$ tbot start identity \
    tbot+proxy+bound-keypair://my-token:abc123@example.teleport.sh:443 \
    --storage ./tbot-data \
    --destination ./tbot-user
```

As shown, all parameters necessary for bots to actually connect to
and authenticate with the remote Teleport instance are included in a
single parameter. This parameter can be generated by existing tooling,
like the example command printed via `tctl bots add`, or the web UI.

End users will only need to paste a single "token", provide their own
client-side parameters (if any), and run. Similarly, we now have a new
minimally viable YAML config:

```yaml
version: v2
uri: tbot+proxy+bound-keypair://my-token:abc123@example.teleport.sh:443
storage:
  type: directory
  path: ./tbot-data
services:
  - type: identity
    destination:
      type: directory
      path: ./tbot-user
```

This implementation is designed to be only additive, and should not
interfere with existing config files or CLI strings. Parsed URI
parameters are merged on top of the traditional config fields during
the bot's pre-run check, and raise an error if any field conflicts.

RFD: #52546

* Fix lints

* Set `omitempty` flag on the URI field

This excludes the URI field when empty, to avoid polluting generated
config files when not using URIs - which remains fully supported - and
to clear test failures since a large number of golden tests would
otherwise need to be regenerated.

* Add additional tests for joining URI config merging

* Add additional integration-style test for joining URIs

* Fix lint

* Consistently rename field to JoinURI and convert from arg to flag

* Remove interspersed flag as arg has been removed.

* Fix broken tests after rebase
timothyb89 added a commit that referenced this pull request Jul 12, 2025
* MWI: Enforce generation counter for bound keypair joining

This enable generation counter enforcement for bound keypair joining,
and adds a new function, `shouldEnforceGenerationCounter`, to make
enabling it for other join methods trivial.

Bound keypair joining introduces a similar mechanism for use between
its own recovery attempts but does rely on the standard generation
counter for it's renewal-style certificates so every join attempt is
subject to a generation check. This wasn't enabled in the original set
of bound keypair PRs so it's enabled here.

RFD: #52546

* Add tests for generation counter enforcement, fix error handling bug

This adds a test case for traditional generation counter enforcement
with bound keypair joining, and fixes an error handling bug around
certificate generation. This bug was mostly harmless before and
would've just returned nil certs at worst, but is now meaningfully
fallible.

* Fix broken test

* Fix lint

* Remove references to registration secret in test for rebase onto master

* Empty commit for CI
timothyb89 added a commit that referenced this pull request Jul 15, 2025
* MWI: Enforce generation counter for bound keypair joining

This enable generation counter enforcement for bound keypair joining,
and adds a new function, `shouldEnforceGenerationCounter`, to make
enabling it for other join methods trivial.

Bound keypair joining introduces a similar mechanism for use between
its own recovery attempts but does rely on the standard generation
counter for it's renewal-style certificates so every join attempt is
subject to a generation check. This wasn't enabled in the original set
of bound keypair PRs so it's enabled here.

RFD: #52546

* Add tests for generation counter enforcement, fix error handling bug

This adds a test case for traditional generation counter enforcement
with bound keypair joining, and fixes an error handling bug around
certificate generation. This bug was mostly harmless before and
would've just returned nil certs at worst, but is now meaningfully
fallible.

* Fix broken test

* Fix lint

* Remove references to registration secret in test for rebase onto master

* Empty commit for CI
timothyb89 added a commit that referenced this pull request Jul 15, 2025
* MWI: Add joining URIs for tbot

This adds support for joining URIs to tbot. Joining URIs are intended
to condense tbot's growing list of required server-side config options
or CLI parameters into a single string that can be provided to the
`tbot` client.

For example, consider these two equivalent CLI commands:

```
$ tbot start identity \
    --proxy-server example.teleport.sh:443 \
    --join-method bound_keypair \
    --token my-token \
    --registration-secret abc123 \
    --storage ./tbot-data
    --destination ./tbot-user

$ tbot start identity \
    tbot+proxy+bound-keypair://my-token:abc123@example.teleport.sh:443 \
    --storage ./tbot-data \
    --destination ./tbot-user
```

As shown, all parameters necessary for bots to actually connect to
and authenticate with the remote Teleport instance are included in a
single parameter. This parameter can be generated by existing tooling,
like the example command printed via `tctl bots add`, or the web UI.

End users will only need to paste a single "token", provide their own
client-side parameters (if any), and run. Similarly, we now have a new
minimally viable YAML config:

```yaml
version: v2
uri: tbot+proxy+bound-keypair://my-token:abc123@example.teleport.sh:443
storage:
  type: directory
  path: ./tbot-data
services:
  - type: identity
    destination:
      type: directory
      path: ./tbot-user
```

This implementation is designed to be only additive, and should not
interfere with existing config files or CLI strings. Parsed URI
parameters are merged on top of the traditional config fields during
the bot's pre-run check, and raise an error if any field conflicts.

RFD: #52546

* Fix lints

* Set `omitempty` flag on the URI field

This excludes the URI field when empty, to avoid polluting generated
config files when not using URIs - which remains fully supported - and
to clear test failures since a large number of golden tests would
otherwise need to be regenerated.

* Add additional tests for joining URI config merging

* Add additional integration-style test for joining URIs

* Fix lint

* Consistently rename field to JoinURI and convert from arg to flag

* Remove interspersed flag as arg has been removed.

* Fix broken tests after rebase
github-merge-queue bot pushed a commit that referenced this pull request Jul 24, 2025
* MWI: Bound Keypair Rotation (#55240)

* MWI: Bound Keypair Joining: Keypair rotation

This adds keypair rotation for bound keypair rotation. When a rotation
flag is set in the token spec, joining clients will be required to
generate a new keypair and complete an additional joining challenge
against the new keypair.

The flag is a timestamp token to allow for some level of idempotency;
to make setting this flag easier, a new `tctl` command is included:
`tctl bound-keypair request-rotation [token]`. This sets the flag
to the current timestamp, and joining clients will be required to
perform a rotation on their next authentication attempt.

Closes #55084

* Properly initialize the tctl command

* Refactor ClientState to allow storing intermediate state during rotation

* Fix invalid comparison and mutation logic

* Log signature suite and use cryptosuites helper

* Remove outdated TODO

* Frontload MFA check to avoid prompting twice

* Fix tctl command logging

* Fix incomplete docstring

* Fix imports

* Fix typo in log message

* Add tests for server-side rotation

Adjusts the test harness a bit and adds a batch of test cases for
keypair rotation.

Also fixes a lint error.

* Add additional test case for reused keys

* Add ClientState unit test

* Remove unnecessary log

* Fix test lints

* Fix reference to wrong key field

Now that the key can change, fix a dangling reference to the initial
key field. Also s/marshalled/marshaled

* Wrap KeyHistoryEntry in a containing struct

This should allow for some future extension if needed.

* MWI: Bound Keypair - Registration Secrets (#55380)

* MWI: Bound Keypair - Registration Secrets

This adds support for initial joining via registration secrets. These
one time use secrets emulate traditional token joining and allow
clients to perform their initial join

With this, no options are required for bound keypair-type tokens.
While admins can specify a joining secret if they wish, if none is
provided, one will be generated on the server and can be found in
`status.bound_keypair.registration_secret` on the token resource.

When joining, this secret can be shared with clients in addition to
the (no longer sensitive) token name. This secret is verified and
a keypair rotation is requested, prompting the client to generate a
new keypair, provide the public key to the server, and complete a
joining challenge. It then joins the cluster as usual.

* Remove unnecessary token validation checks

* Rename tbot flag to --registration-secret

* Fix reference to renamed flag

* Various fixes, mostly more unwanted checks

* Add test cases for registration secrets

* Fix broken test

Onboarding config is no longer required, so fix the now-broken test

* Allow empty .spec.bound_keypair field for bound keypair tokens

This allows .spec.bound_keypair to be empty or entirely unset,
since we can build defaults at creation time.

* Add test for secret expiry enforcement

* Handle nonexistent client state when using a registration secret

* Fix test lints

* Hide exact registration secret rejection reason from client

Registration secret errors now return a single error message to the
client and log a more specific message on the server.

* MWI: Enforce generation counter for bound keypair joining (#55543)

* MWI: Enforce generation counter for bound keypair joining

This enable generation counter enforcement for bound keypair joining,
and adds a new function, `shouldEnforceGenerationCounter`, to make
enabling it for other join methods trivial.

Bound keypair joining introduces a similar mechanism for use between
its own recovery attempts but does rely on the standard generation
counter for it's renewal-style certificates so every join attempt is
subject to a generation check. This wasn't enabled in the original set
of bound keypair PRs so it's enabled here.

RFD: #52546

* Add tests for generation counter enforcement, fix error handling bug

This adds a test case for traditional generation counter enforcement
with bound keypair joining, and fixes an error handling bug around
certificate generation. This bug was mostly harmless before and
would've just returned nil certs at worst, but is now meaningfully
fallible.

* Fix broken test

* Fix lint

* Remove references to registration secret in test for rebase onto master

* Empty commit for CI

* MWI: Add audit events for bound keypair joining (#55701)

* MWI: Add audit events for bound keypair joining

This adds 3 new audit events for bound keypair joining:
- `join_token.bound_keypair.recovery` - emitted when a join triggers
  a recovery (first join, or join with expired certs)
- `join_token.bound_keypair.rotation` - emitted when a keypair
  rotation takes place
- `join_token.bound_keypair.join_state_verification_failed` - emitted
  when the client provides an invalid join state document

* Fix UI lint

* Fix more UI lints

* Remove outdated TODO

* Fix tests broken by error message changes

* MWI: Add lock targets for join token name and bot instance ID (#56021)

* MWI: Add lock targets for join token name and bot instance ID

This adds two new lock targets meant to help lock specific bot
instances without affecting all bots sharing a single user:
- Bot Instance ID: Targets a bot instance UUID, which has been
  assigned automatically to unique bot instances for some time
- Join token name: Targets the join token through which the bot
  joined

Bot instance ID locks are most useful for traditional token-joined
bots, since tokens are single use and bots have no way to onboard
again without human intervention if their old certs (and old bot
instance) expire.

Join token locks are useful for bots using delegated join methods.
They are particularly useful for bound keypair joining, where there
is a direct 1:1 relationship between a "bot instance" and a token,
even though that bot ID will change each time a recovery takes place.

Note that this does not currently set the join token for nodes even
though that would theoretically be possible. We could consider
supporting node locking in the future if there's demand.

* Set join token cert request field for non-renewable bot identities

* Fix ASN ID and pass through join token name in impersonated certs

* Tweak docstrings and add missing references for lib/decision

* Clarify docstrings

Clarifies various docstrings and makes sure they mention `token`
joined bots cannot be targeted.

* Fix failing tests

* MWI: Use specific lock targets when locking out bots (#56110)

* MWI: Use specific lock targets when locking out bots

Building on #56021, this takes advantage of the new granular lock
targets to lock bots during verification failures, namely:
- Generation counter mismatch: Locks a bot instance (token) or token
  name (bound keypair).
- Join state verification failure (bound keypair only)

Additionally, as the bound keypair joining process now generates
locks, join state verification has been moved to take place explicitly
*after* the main joining challenge has been completed. Without this,
unauthenticated clients could abuse the new locking behavior by simply
sending any invalid join state document.

* Use new lock targets for traditional generation counter lockouts

* Enforce new bot lock targets during cert generation

* Fix lint in `mutateStatusConsumeRecovery()`

* Add tests for new lock events

This adds new tests and updates existing tests to account for the new
locking strategies, and to make sure existing clients are actually
denied cluster access.

Additionally, as join state is now verified only after the regular
challenge ceremony, a number of tests were broken as they set up
the token in a technically impossible state, depending on the join
state being checked first. Tests now explicitly specify their token
keypair (bound or initial) to resolve this.

* Remove resolved TODOs

* Fix cut off comment

* MWI: Fix flaky tests for automatic bot lockouts (#56323)

* MWI: Fix flaky tests for automatic bot lockouts

This fixes a flaky test, `TestRegisterBotCertificateGenerationStolen`,
which assumed authenticated clients would immediately lose access if
locked. It also fixes another test introduced at the same time that
contains a similar check.

* Increase maximum time limit

* MWI: Remove bound keypair experiment flag (#56592)

This removes the environment variable gating use of the bound keypair
experiment.

* MWI: Fix bound keypair initial join secret field name (#56603)

* MWI: Fix bound keypair initial join secret field name

The `initial_join_secret` field was not given a proper YAML field
name and was rendering as `initialjoinsecret`. Additionally, we've
tried to standardize on referring to this field as "the registration
secret", so this renames the field to match new terminology.

This hopefully does not count as a breaking change as registration
secret functionality has not been made available in a release.

* Rename to `registration_secret`

* MWI: Fix typos in bound keypair ProvisionTokenV2 proto (#56653)

This fixes a number of spelling and grammar issues in the proto
comments for ProvisionTokenSpecV2BoundKeypair and
ProvisionTokenStatusV2BoundKeypair.

* MWI: Fix flaky test for bound keypair generation counter (#56732)

* MWI: Fix flaky test for bound keypair generation counter

This fixes another flaky test in
TestServer_RegisterUsingBoundKeypairMethod_GenerationCounter, caused
by locks occasionally not immediately taking effect.

* Apply suggestions from code review

* MWI: Add joining URIs for tbot (#56267)

* MWI: Add joining URIs for tbot

This adds support for joining URIs to tbot. Joining URIs are intended
to condense tbot's growing list of required server-side config options
or CLI parameters into a single string that can be provided to the
`tbot` client.

For example, consider these two equivalent CLI commands:

```
$ tbot start identity \
    --proxy-server example.teleport.sh:443 \
    --join-method bound_keypair \
    --token my-token \
    --registration-secret abc123 \
    --storage ./tbot-data
    --destination ./tbot-user

$ tbot start identity \
    tbot+proxy+bound-keypair://my-token:abc123@example.teleport.sh:443 \
    --storage ./tbot-data \
    --destination ./tbot-user
```

As shown, all parameters necessary for bots to actually connect to
and authenticate with the remote Teleport instance are included in a
single parameter. This parameter can be generated by existing tooling,
like the example command printed via `tctl bots add`, or the web UI.

End users will only need to paste a single "token", provide their own
client-side parameters (if any), and run. Similarly, we now have a new
minimally viable YAML config:

```yaml
version: v2
uri: tbot+proxy+bound-keypair://my-token:abc123@example.teleport.sh:443
storage:
  type: directory
  path: ./tbot-data
services:
  - type: identity
    destination:
      type: directory
      path: ./tbot-user
```

This implementation is designed to be only additive, and should not
interfere with existing config files or CLI strings. Parsed URI
parameters are merged on top of the traditional config fields during
the bot's pre-run check, and raise an error if any field conflicts.

RFD: #52546

* Fix lints

* Set `omitempty` flag on the URI field

This excludes the URI field when empty, to avoid polluting generated
config files when not using URIs - which remains fully supported - and
to clear test failures since a large number of golden tests would
otherwise need to be regenerated.

* Add additional tests for joining URI config merging

* Add additional integration-style test for joining URIs

* Fix lint

* Consistently rename field to JoinURI and convert from arg to flag

* Remove interspersed flag as arg has been removed.

* Fix broken tests after rebase

* MWI: Verify locks against bound keypair tokens before mutating state (#56829)

* MWI: Verify locks against bound keypair tokens before mutating state

This adds an additional check for locks against a bound keypair token
before any server-side state can be mutated, e.g. before potentially
generating additional locks.

Locks were always checked before credentials were issued, so access
was reliably prevented. However, if bots get locked, they will retry
the connection in a loop. The locks are generated before they're
checked, which can lead to an infinite lock creation loop.

This PR adds an additional check for locks against the join token
before any server-side mutation takes place, but after we've at least
partially verified the client's identity (via a challenge or
registration secret) to avoid leaking new information about whether
or not a token is locked.

* Don't test for exact lock counts

Preventing duplicate locks is best effort and subject to the lock
checks actually returning an error when a lock exists in a timely
manner, so don't assume we won't have duplicates in the test.

* Try to call t.Helper() when possible in testExtractBotParamsFromCerts
github-merge-queue bot pushed a commit that referenced this pull request Aug 16, 2025
)

* MWI: Minimal bound-keypair joining implementation (#54371)

* MWI: Minimal bound-keypair joining implementation

This includes a minimal implementation of bound-keypair joining. This
first iteration requires preregistered public keys, and requires
`unlimited` and `insecure` flags to be set on bound keypair tokens.

Minimal client-side implementation will be in a follow up PR.

RFD: #52546
Closes #53373

* Refactor challenge response function, rebase on updated protos branch

This includes a number of changes:
- Rebases on the latest protos branch. This includes removal of the
  new keypair field on initial join, and adds messages for
  interactive keypair rotation.
- Per the rebase, remaining_joins is removed in favor of using
  join_count for all calculations. The registration method and
  validatity checks have been updated to reference that instead.
- Refactors challenge response function to allow for keypair
  rotation. We still don't implement rotation but the handler now
  receives the full proto message and produces a full proto response,
  so that we can easily handle the rotation case in the future.
- Challenge validation checks time fields explicitly to ensure the
  client didn't tamper with them.
- Added some missing docstrings

* Add joinserver test

* Fix lint error and add docstring

* Add tests for bound keypair challenge validation

* Remove client side package intended for other PR

* Fix various lints

* Add tests for RegisterUsingBoundKeypairMethod()

* Fix lints

* Add basic provisioning token CheckAndSetDefaults() tests

* Include bound public key in RegisterUsingBoundKeypairMethod return

This is passed back to clients as part of the proto certs message as
confirmation that rotation succeeded, so the value needed to be
plumbed through.

* Fixes after upstream proto change

We renamed and tweaked a number of proto fields, so this updates
field references.

* Apply suggestions from code review

Co-authored-by: Dan Upton <daniel.upton@goteleport.com>

* Remove TODO

* Fix missed field rename

* Fix broken test

* Fix lurking nil pointer deref after field rename

---------

Co-authored-by: Dan Upton <daniel.upton@goteleport.com>

* Fix build due to backport changes

* Backport additional test changes

---------

Co-authored-by: Dan Upton <daniel.upton@goteleport.com>
Of note:
- s/rejoin/recovery (give or take some conjugation)
- Updated keypair rotation and sequence diagram
- Updated join state spec
- Updated token resource example to match public docs
- Lots of misc terminology tweaks
@timothyb89 timothyb89 requested a review from strideynet August 16, 2025 01:59
@timothyb89
Copy link
Copy Markdown
Contributor Author

I've done a pass to update the RFD to reflect the current state of what we've implemented, please take another look! Ideally we can merge this down now that it's implemented and more or less reflects what exists in v18/master.

@timothyb89 timothyb89 added this pull request to the merge queue Aug 18, 2025
Merged via the queue into master with commit cb9bd12 Aug 18, 2025
40 checks passed
@timothyb89 timothyb89 deleted the rfd/0205-improved-onprem-joining branch August 18, 2025 21:24
ryanclark pushed a commit that referenced this pull request Aug 19, 2025
* RFD 0205: Improved On-Prem Joining for Machine ID

This RFD discusses improvements to on-prem and non-delegated bot
joining, focusing on a new `challenge` join method.

* Various whitespace fixes

* Add details after first feedback pass

Adds sections on alerting, keypair rotation, and intention to
eventually support node joining.

* Add section detailing joining flows, various other details

* Fix cspell nits

* Rename to bound keypair, address review feedback

This renames the join method to `bound-keypair`, adds sections on
extensible keystore backends, non-Terraform UX, and scoped RBAC.

* Rewrite join UX improvement to use URIs

Adds a new URI joining proposal

* Rewrite some sections, discuss state storage

New sections on state storage and rejected alternatives, plus
rewrote several sections for clarity.

* Rename overlapping `expires` fields

* Pivot to delegated joining impl. Add sequence diagram.

* Don't assume ED25519; fix renew->refresh terminology

* Tweak joining URL scheme

Moves join method to the URL scheme to allow joining secrets; more
examples added.

* Bot lifecycle; removed uniqueness requirement

Removes "soft bot expiration" section as this has been resolved
with the switch to delegated joining. Also added a Bot Lifecycle
section to describe how bots are expected to be disabled.

Also removed the public key uniqueness requirement. At join time bots
now specify both the token name and joining secret (if any), so we
won't need to search all tokens for a matching key. It was also not
efficient to ensure uniqueness among all provision tokens.

* Add keypair rotation details

* Credential duplication mitigation, proto draft

Describes a method for mitigating credential duplication, and
includes a protobuf draft.

* Rename most references to "rejoining" to just "joining"

These were fundamentally the same processes, so we'll standardize on
calling both initial joining and rejoining different modes of just
"joining".

* Add example join state document

* Add note about locking old instance after rejoin

* Small fixes

* Fix word missing from cspell

* Remove kubernetes reference

* Fix hanging sentence, add reference to join state document in lifecycle

* SSH key format, introduce "insecure" flag

* Mark as implemented

* Various updates to reflect implemented state

Of note:
- s/rejoin/recovery (give or take some conjugation)
- Updated keypair rotation and sequence diagram
- Updated join state spec
- Updated token resource example to match public docs
- Lots of misc terminology tweaks
ryanclark pushed a commit that referenced this pull request Aug 19, 2025
* RFD 0205: Improved On-Prem Joining for Machine ID

This RFD discusses improvements to on-prem and non-delegated bot
joining, focusing on a new `challenge` join method.

* Various whitespace fixes

* Add details after first feedback pass

Adds sections on alerting, keypair rotation, and intention to
eventually support node joining.

* Add section detailing joining flows, various other details

* Fix cspell nits

* Rename to bound keypair, address review feedback

This renames the join method to `bound-keypair`, adds sections on
extensible keystore backends, non-Terraform UX, and scoped RBAC.

* Rewrite join UX improvement to use URIs

Adds a new URI joining proposal

* Rewrite some sections, discuss state storage

New sections on state storage and rejected alternatives, plus
rewrote several sections for clarity.

* Rename overlapping `expires` fields

* Pivot to delegated joining impl. Add sequence diagram.

* Don't assume ED25519; fix renew->refresh terminology

* Tweak joining URL scheme

Moves join method to the URL scheme to allow joining secrets; more
examples added.

* Bot lifecycle; removed uniqueness requirement

Removes "soft bot expiration" section as this has been resolved
with the switch to delegated joining. Also added a Bot Lifecycle
section to describe how bots are expected to be disabled.

Also removed the public key uniqueness requirement. At join time bots
now specify both the token name and joining secret (if any), so we
won't need to search all tokens for a matching key. It was also not
efficient to ensure uniqueness among all provision tokens.

* Add keypair rotation details

* Credential duplication mitigation, proto draft

Describes a method for mitigating credential duplication, and
includes a protobuf draft.

* Rename most references to "rejoining" to just "joining"

These were fundamentally the same processes, so we'll standardize on
calling both initial joining and rejoining different modes of just
"joining".

* Add example join state document

* Add note about locking old instance after rejoin

* Small fixes

* Fix word missing from cspell

* Remove kubernetes reference

* Fix hanging sentence, add reference to join state document in lifecycle

* SSH key format, introduce "insecure" flag

* Mark as implemented

* Various updates to reflect implemented state

Of note:
- s/rejoin/recovery (give or take some conjugation)
- Updated keypair rotation and sequence diagram
- Updated join state spec
- Updated token resource example to match public docs
- Lots of misc terminology tweaks
timothyb89 added a commit that referenced this pull request Aug 26, 2025
* MWI: Enforce generation counter for bound keypair joining

This enable generation counter enforcement for bound keypair joining,
and adds a new function, `shouldEnforceGenerationCounter`, to make
enabling it for other join methods trivial.

Bound keypair joining introduces a similar mechanism for use between
its own recovery attempts but does rely on the standard generation
counter for it's renewal-style certificates so every join attempt is
subject to a generation check. This wasn't enabled in the original set
of bound keypair PRs so it's enabled here.

RFD: #52546

* Add tests for generation counter enforcement, fix error handling bug

This adds a test case for traditional generation counter enforcement
with bound keypair joining, and fixes an error handling bug around
certificate generation. This bug was mostly harmless before and
would've just returned nil certs at worst, but is now meaningfully
fallible.

* Fix broken test

* Fix lint

* Remove references to registration secret in test for rebase onto master

* Empty commit for CI
timothyb89 added a commit that referenced this pull request Aug 28, 2025
* MWI: Enforce generation counter for bound keypair joining

This enable generation counter enforcement for bound keypair joining,
and adds a new function, `shouldEnforceGenerationCounter`, to make
enabling it for other join methods trivial.

Bound keypair joining introduces a similar mechanism for use between
its own recovery attempts but does rely on the standard generation
counter for it's renewal-style certificates so every join attempt is
subject to a generation check. This wasn't enabled in the original set
of bound keypair PRs so it's enabled here.

RFD: #52546

* Add tests for generation counter enforcement, fix error handling bug

This adds a test case for traditional generation counter enforcement
with bound keypair joining, and fixes an error handling bug around
certificate generation. This bug was mostly harmless before and
would've just returned nil certs at worst, but is now meaningfully
fallible.

* Fix broken test

* Fix lint

* Remove references to registration secret in test for rebase onto master

* Empty commit for CI
timothyb89 added a commit that referenced this pull request Aug 29, 2025
* MWI: Add joining URIs for tbot

This adds support for joining URIs to tbot. Joining URIs are intended
to condense tbot's growing list of required server-side config options
or CLI parameters into a single string that can be provided to the
`tbot` client.

For example, consider these two equivalent CLI commands:

```
$ tbot start identity \
    --proxy-server example.teleport.sh:443 \
    --join-method bound_keypair \
    --token my-token \
    --registration-secret abc123 \
    --storage ./tbot-data
    --destination ./tbot-user

$ tbot start identity \
    tbot+proxy+bound-keypair://my-token:abc123@example.teleport.sh:443 \
    --storage ./tbot-data \
    --destination ./tbot-user
```

As shown, all parameters necessary for bots to actually connect to
and authenticate with the remote Teleport instance are included in a
single parameter. This parameter can be generated by existing tooling,
like the example command printed via `tctl bots add`, or the web UI.

End users will only need to paste a single "token", provide their own
client-side parameters (if any), and run. Similarly, we now have a new
minimally viable YAML config:

```yaml
version: v2
uri: tbot+proxy+bound-keypair://my-token:abc123@example.teleport.sh:443
storage:
  type: directory
  path: ./tbot-data
services:
  - type: identity
    destination:
      type: directory
      path: ./tbot-user
```

This implementation is designed to be only additive, and should not
interfere with existing config files or CLI strings. Parsed URI
parameters are merged on top of the traditional config fields during
the bot's pre-run check, and raise an error if any field conflicts.

RFD: #52546

* Fix lints

* Set `omitempty` flag on the URI field

This excludes the URI field when empty, to avoid polluting generated
config files when not using URIs - which remains fully supported - and
to clear test failures since a large number of golden tests would
otherwise need to be regenerated.

* Add additional tests for joining URI config merging

* Add additional integration-style test for joining URIs

* Fix lint

* Consistently rename field to JoinURI and convert from arg to flag

* Remove interspersed flag as arg has been removed.

* Fix broken tests after rebase
timothyb89 added a commit that referenced this pull request Sep 4, 2025
* MWI: Enforce generation counter for bound keypair joining

This enable generation counter enforcement for bound keypair joining,
and adds a new function, `shouldEnforceGenerationCounter`, to make
enabling it for other join methods trivial.

Bound keypair joining introduces a similar mechanism for use between
its own recovery attempts but does rely on the standard generation
counter for it's renewal-style certificates so every join attempt is
subject to a generation check. This wasn't enabled in the original set
of bound keypair PRs so it's enabled here.

RFD: #52546

* Add tests for generation counter enforcement, fix error handling bug

This adds a test case for traditional generation counter enforcement
with bound keypair joining, and fixes an error handling bug around
certificate generation. This bug was mostly harmless before and
would've just returned nil certs at worst, but is now meaningfully
fallible.

* Fix broken test

* Fix lint

* Remove references to registration secret in test for rebase onto master

* Empty commit for CI
timothyb89 added a commit that referenced this pull request Sep 4, 2025
* MWI: Add joining URIs for tbot

This adds support for joining URIs to tbot. Joining URIs are intended
to condense tbot's growing list of required server-side config options
or CLI parameters into a single string that can be provided to the
`tbot` client.

For example, consider these two equivalent CLI commands:

```
$ tbot start identity \
    --proxy-server example.teleport.sh:443 \
    --join-method bound_keypair \
    --token my-token \
    --registration-secret abc123 \
    --storage ./tbot-data
    --destination ./tbot-user

$ tbot start identity \
    tbot+proxy+bound-keypair://my-token:abc123@example.teleport.sh:443 \
    --storage ./tbot-data \
    --destination ./tbot-user
```

As shown, all parameters necessary for bots to actually connect to
and authenticate with the remote Teleport instance are included in a
single parameter. This parameter can be generated by existing tooling,
like the example command printed via `tctl bots add`, or the web UI.

End users will only need to paste a single "token", provide their own
client-side parameters (if any), and run. Similarly, we now have a new
minimally viable YAML config:

```yaml
version: v2
uri: tbot+proxy+bound-keypair://my-token:abc123@example.teleport.sh:443
storage:
  type: directory
  path: ./tbot-data
services:
  - type: identity
    destination:
      type: directory
      path: ./tbot-user
```

This implementation is designed to be only additive, and should not
interfere with existing config files or CLI strings. Parsed URI
parameters are merged on top of the traditional config fields during
the bot's pre-run check, and raise an error if any field conflicts.

RFD: #52546

* Fix lints

* Set `omitempty` flag on the URI field

This excludes the URI field when empty, to avoid polluting generated
config files when not using URIs - which remains fully supported - and
to clear test failures since a large number of golden tests would
otherwise need to be regenerated.

* Add additional tests for joining URI config merging

* Add additional integration-style test for joining URIs

* Fix lint

* Consistently rename field to JoinURI and convert from arg to flag

* Remove interspersed flag as arg has been removed.

* Fix broken tests after rebase
github-merge-queue bot pushed a commit that referenced this pull request Sep 12, 2025
* MWI: Bound Keypair Rotation (#55240)

* MWI: Bound Keypair Joining: Keypair rotation

This adds keypair rotation for bound keypair rotation. When a rotation
flag is set in the token spec, joining clients will be required to
generate a new keypair and complete an additional joining challenge
against the new keypair.

The flag is a timestamp token to allow for some level of idempotency;
to make setting this flag easier, a new `tctl` command is included:
`tctl bound-keypair request-rotation [token]`. This sets the flag
to the current timestamp, and joining clients will be required to
perform a rotation on their next authentication attempt.

Closes #55084

* Properly initialize the tctl command

* Refactor ClientState to allow storing intermediate state during rotation

* Fix invalid comparison and mutation logic

* Log signature suite and use cryptosuites helper

* Remove outdated TODO

* Frontload MFA check to avoid prompting twice

* Fix tctl command logging

* Fix incomplete docstring

* Fix imports

* Fix typo in log message

* Add tests for server-side rotation

Adjusts the test harness a bit and adds a batch of test cases for
keypair rotation.

Also fixes a lint error.

* Add additional test case for reused keys

* Add ClientState unit test

* Remove unnecessary log

* Fix test lints

* Fix reference to wrong key field

Now that the key can change, fix a dangling reference to the initial
key field. Also s/marshalled/marshaled

* Wrap KeyHistoryEntry in a containing struct

This should allow for some future extension if needed.

* MWI: Bound Keypair - Registration Secrets (#55380)

* MWI: Bound Keypair - Registration Secrets

This adds support for initial joining via registration secrets. These
one time use secrets emulate traditional token joining and allow
clients to perform their initial join

With this, no options are required for bound keypair-type tokens.
While admins can specify a joining secret if they wish, if none is
provided, one will be generated on the server and can be found in
`status.bound_keypair.registration_secret` on the token resource.

When joining, this secret can be shared with clients in addition to
the (no longer sensitive) token name. This secret is verified and
a keypair rotation is requested, prompting the client to generate a
new keypair, provide the public key to the server, and complete a
joining challenge. It then joins the cluster as usual.

* Remove unnecessary token validation checks

* Rename tbot flag to --registration-secret

* Fix reference to renamed flag

* Various fixes, mostly more unwanted checks

* Add test cases for registration secrets

* Fix broken test

Onboarding config is no longer required, so fix the now-broken test

* Allow empty .spec.bound_keypair field for bound keypair tokens

This allows .spec.bound_keypair to be empty or entirely unset,
since we can build defaults at creation time.

* Add test for secret expiry enforcement

* Handle nonexistent client state when using a registration secret

* Fix test lints

* Hide exact registration secret rejection reason from client

Registration secret errors now return a single error message to the
client and log a more specific message on the server.

* MWI: Enforce generation counter for bound keypair joining (#55543)

* MWI: Enforce generation counter for bound keypair joining

This enable generation counter enforcement for bound keypair joining,
and adds a new function, `shouldEnforceGenerationCounter`, to make
enabling it for other join methods trivial.

Bound keypair joining introduces a similar mechanism for use between
its own recovery attempts but does rely on the standard generation
counter for it's renewal-style certificates so every join attempt is
subject to a generation check. This wasn't enabled in the original set
of bound keypair PRs so it's enabled here.

RFD: #52546

* Add tests for generation counter enforcement, fix error handling bug

This adds a test case for traditional generation counter enforcement
with bound keypair joining, and fixes an error handling bug around
certificate generation. This bug was mostly harmless before and
would've just returned nil certs at worst, but is now meaningfully
fallible.

* Fix broken test

* Fix lint

* Remove references to registration secret in test for rebase onto master

* Empty commit for CI

* MWI: Add audit events for bound keypair joining (#55701)

* MWI: Add audit events for bound keypair joining

This adds 3 new audit events for bound keypair joining:
- `join_token.bound_keypair.recovery` - emitted when a join triggers
  a recovery (first join, or join with expired certs)
- `join_token.bound_keypair.rotation` - emitted when a keypair
  rotation takes place
- `join_token.bound_keypair.join_state_verification_failed` - emitted
  when the client provides an invalid join state document

* Fix UI lint

* Fix more UI lints

* Remove outdated TODO

* Fix tests broken by error message changes

* Fix lint

* MWI: Add lock targets for join token name and bot instance ID (#56021)

* MWI: Add lock targets for join token name and bot instance ID

This adds two new lock targets meant to help lock specific bot
instances without affecting all bots sharing a single user:
- Bot Instance ID: Targets a bot instance UUID, which has been
  assigned automatically to unique bot instances for some time
- Join token name: Targets the join token through which the bot
  joined

Bot instance ID locks are most useful for traditional token-joined
bots, since tokens are single use and bots have no way to onboard
again without human intervention if their old certs (and old bot
instance) expire.

Join token locks are useful for bots using delegated join methods.
They are particularly useful for bound keypair joining, where there
is a direct 1:1 relationship between a "bot instance" and a token,
even though that bot ID will change each time a recovery takes place.

Note that this does not currently set the join token for nodes even
though that would theoretically be possible. We could consider
supporting node locking in the future if there's demand.

* Set join token cert request field for non-renewable bot identities

* Fix ASN ID and pass through join token name in impersonated certs

* Tweak docstrings and add missing references for lib/decision

* Clarify docstrings

Clarifies various docstrings and makes sure they mention `token`
joined bots cannot be targeted.

* Fix failing tests

* MWI: Use specific lock targets when locking out bots (#56110)

* MWI: Use specific lock targets when locking out bots

Building on #56021, this takes advantage of the new granular lock
targets to lock bots during verification failures, namely:
- Generation counter mismatch: Locks a bot instance (token) or token
  name (bound keypair).
- Join state verification failure (bound keypair only)

Additionally, as the bound keypair joining process now generates
locks, join state verification has been moved to take place explicitly
*after* the main joining challenge has been completed. Without this,
unauthenticated clients could abuse the new locking behavior by simply
sending any invalid join state document.

* Use new lock targets for traditional generation counter lockouts

* Enforce new bot lock targets during cert generation

* Fix lint in `mutateStatusConsumeRecovery()`

* Add tests for new lock events

This adds new tests and updates existing tests to account for the new
locking strategies, and to make sure existing clients are actually
denied cluster access.

Additionally, as join state is now verified only after the regular
challenge ceremony, a number of tests were broken as they set up
the token in a technically impossible state, depending on the join
state being checked first. Tests now explicitly specify their token
keypair (bound or initial) to resolve this.

* Remove resolved TODOs

* Fix cut off comment

* MWI: Fix flaky tests for automatic bot lockouts (#56323)

* MWI: Fix flaky tests for automatic bot lockouts

This fixes a flaky test, `TestRegisterBotCertificateGenerationStolen`,
which assumed authenticated clients would immediately lose access if
locked. It also fixes another test introduced at the same time that
contains a similar check.

* Increase maximum time limit

* MWI: Remove bound keypair experiment flag (#56592)

This removes the environment variable gating use of the bound keypair
experiment.

* MWI: Fix bound keypair initial join secret field name (#56603)

* MWI: Fix bound keypair initial join secret field name

The `initial_join_secret` field was not given a proper YAML field
name and was rendering as `initialjoinsecret`. Additionally, we've
tried to standardize on referring to this field as "the registration
secret", so this renames the field to match new terminology.

This hopefully does not count as a breaking change as registration
secret functionality has not been made available in a release.

* Rename to `registration_secret`

* MWI: Fix typos in bound keypair ProvisionTokenV2 proto (#56653)

This fixes a number of spelling and grammar issues in the proto
comments for ProvisionTokenSpecV2BoundKeypair and
ProvisionTokenStatusV2BoundKeypair.

* MWI: Fix flaky test for bound keypair generation counter (#56732)

* MWI: Fix flaky test for bound keypair generation counter

This fixes another flaky test in
TestServer_RegisterUsingBoundKeypairMethod_GenerationCounter, caused
by locks occasionally not immediately taking effect.

* Apply suggestions from code review

* MWI: Add joining URIs for tbot (#56267)

* MWI: Add joining URIs for tbot

This adds support for joining URIs to tbot. Joining URIs are intended
to condense tbot's growing list of required server-side config options
or CLI parameters into a single string that can be provided to the
`tbot` client.

For example, consider these two equivalent CLI commands:

```
$ tbot start identity \
    --proxy-server example.teleport.sh:443 \
    --join-method bound_keypair \
    --token my-token \
    --registration-secret abc123 \
    --storage ./tbot-data
    --destination ./tbot-user

$ tbot start identity \
    tbot+proxy+bound-keypair://my-token:abc123@example.teleport.sh:443 \
    --storage ./tbot-data \
    --destination ./tbot-user
```

As shown, all parameters necessary for bots to actually connect to
and authenticate with the remote Teleport instance are included in a
single parameter. This parameter can be generated by existing tooling,
like the example command printed via `tctl bots add`, or the web UI.

End users will only need to paste a single "token", provide their own
client-side parameters (if any), and run. Similarly, we now have a new
minimally viable YAML config:

```yaml
version: v2
uri: tbot+proxy+bound-keypair://my-token:abc123@example.teleport.sh:443
storage:
  type: directory
  path: ./tbot-data
services:
  - type: identity
    destination:
      type: directory
      path: ./tbot-user
```

This implementation is designed to be only additive, and should not
interfere with existing config files or CLI strings. Parsed URI
parameters are merged on top of the traditional config fields during
the bot's pre-run check, and raise an error if any field conflicts.

RFD: #52546

* Fix lints

* Set `omitempty` flag on the URI field

This excludes the URI field when empty, to avoid polluting generated
config files when not using URIs - which remains fully supported - and
to clear test failures since a large number of golden tests would
otherwise need to be regenerated.

* Add additional tests for joining URI config merging

* Add additional integration-style test for joining URIs

* Fix lint

* Consistently rename field to JoinURI and convert from arg to flag

* Remove interspersed flag as arg has been removed.

* Fix broken tests after rebase

* MWI: Verify locks against bound keypair tokens before mutating state (#56829)

* MWI: Verify locks against bound keypair tokens before mutating state

This adds an additional check for locks against a bound keypair token
before any server-side state can be mutated, e.g. before potentially
generating additional locks.

Locks were always checked before credentials were issued, so access
was reliably prevented. However, if bots get locked, they will retry
the connection in a loop. The locks are generated before they're
checked, which can lead to an infinite lock creation loop.

This PR adds an additional check for locks against the join token
before any server-side mutation takes place, but after we've at least
partially verified the client's identity (via a challenge or
registration secret) to avoid leaking new information about whether
or not a token is locked.

* Don't test for exact lock counts

Preventing duplicate locks is best effort and subject to the lock
checks actually returning an error when a lock exists in a timely
manner, so don't assume we won't have duplicates in the test.

* Try to call t.Helper() when possible in testExtractBotParamsFromCerts

* Bound Keypair: Fix lock generation on sequence desync (#57687)

* Bound Keypair: Fix lock generation on sequence desync

This fixes an issue where locks may not be generated as expected when
join state sequences desync unless the original client is also
performing a recovery.

Currently, if the original client is renewing with a valid identity,
its bot instance ID is checked against the stored instance ID. If they
don't match, access is denied without generating a lock. However, if
client credentials are stolen and used to perform a recovery, this
implicitly generates a new bot instance ID. The original client,
presumably with still-valid certs containing the original ID, will
try to renew as usual, but will only be denied. Join state
verification is skipped, and no lock is created.

(Note that, given enough time, the client's credentials will
eventually expire. The next join attempt will then attempt a recovery,
fail to verify join state, and generate a lock as expected. This just
means locking takes ~1hr instead of ~20min, based on default values.)

The fix is straightforward enough: the bot instance ID check is moved
after join state verification. In practice this check is unlikely to
be useful as any action that could cause the bot instance IDs to
change should also cause join state verification to fail. The check
remains at the end of the renewal flow as a sanity check, but only
after the challenge ceremony and join state verification are
performed.

changelog: Bound Keypair Joining: Fix lock generation on sequence desync

* Add some test detail in a comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no-changelog Indicates that a PR does not require a changelog entry rfd Request for Discussion size/lg

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants