Skip to content

[v18] Bound Keypair Joining Backport#56746

Merged
timothyb89 merged 16 commits intobranch/v18from
timothyb89/v18/bound-keypair-omnibackport
Jul 24, 2025
Merged

[v18] Bound Keypair Joining Backport#56746
timothyb89 merged 16 commits intobranch/v18from
timothyb89/v18/bound-keypair-omnibackport

Conversation

@timothyb89
Copy link
Copy Markdown
Contributor

@timothyb89 timothyb89 commented Jul 12, 2025

This is a backport of Bound Keypair joining for branch/v18, containing several PRs:

This does not include:

A partial implementation of bound keypair joining made it into v18.0.0 when the branch was cut, so around half of the PRs are already merged (#52566, #54766, #54371, #54372, #54822, #54940).

Note that we're targeting this for release in v18.1.0 and it should not be merged until we're confident there will be no further v18.0.x minor releases. Backports of the documentation PRs (#56604, #56824) will follow separately.

changelog: Machine and Workload ID: Add new bound_keypair join method to better support bots in on-prem and other environments without a platform-specific join method

@timothyb89 timothyb89 changed the title [v18] Bound Keypair Joining Backport [v18] Bound Keypair Joining Backport - Part 1 Jul 12, 2025
* MWI: Bound Keypair Joining: Keypair rotation

This adds keypair rotation for bound keypair rotation. When a rotation
flag is set in the token spec, joining clients will be required to
generate a new keypair and complete an additional joining challenge
against the new keypair.

The flag is a timestamp token to allow for some level of idempotency;
to make setting this flag easier, a new `tctl` command is included:
`tctl bound-keypair request-rotation [token]`. This sets the flag
to the current timestamp, and joining clients will be required to
perform a rotation on their next authentication attempt.

Closes #55084

* Properly initialize the tctl command

* Refactor ClientState to allow storing intermediate state during rotation

* Fix invalid comparison and mutation logic

* Log signature suite and use cryptosuites helper

* Remove outdated TODO

* Frontload MFA check to avoid prompting twice

* Fix tctl command logging

* Fix incomplete docstring

* Fix imports

* Fix typo in log message

* Add tests for server-side rotation

Adjusts the test harness a bit and adds a batch of test cases for
keypair rotation.

Also fixes a lint error.

* Add additional test case for reused keys

* Add ClientState unit test

* Remove unnecessary log

* Fix test lints

* Fix reference to wrong key field

Now that the key can change, fix a dangling reference to the initial
key field. Also s/marshalled/marshaled

* Wrap KeyHistoryEntry in a containing struct

This should allow for some future extension if needed.
* MWI: Bound Keypair - Registration Secrets

This adds support for initial joining via registration secrets. These
one time use secrets emulate traditional token joining and allow
clients to perform their initial join

With this, no options are required for bound keypair-type tokens.
While admins can specify a joining secret if they wish, if none is
provided, one will be generated on the server and can be found in
`status.bound_keypair.registration_secret` on the token resource.

When joining, this secret can be shared with clients in addition to
the (no longer sensitive) token name. This secret is verified and
a keypair rotation is requested, prompting the client to generate a
new keypair, provide the public key to the server, and complete a
joining challenge. It then joins the cluster as usual.

* Remove unnecessary token validation checks

* Rename tbot flag to --registration-secret

* Fix reference to renamed flag

* Various fixes, mostly more unwanted checks

* Add test cases for registration secrets

* Fix broken test

Onboarding config is no longer required, so fix the now-broken test

* Allow empty .spec.bound_keypair field for bound keypair tokens

This allows .spec.bound_keypair to be empty or entirely unset,
since we can build defaults at creation time.

* Add test for secret expiry enforcement

* Handle nonexistent client state when using a registration secret

* Fix test lints

* Hide exact registration secret rejection reason from client

Registration secret errors now return a single error message to the
client and log a more specific message on the server.
* MWI: Enforce generation counter for bound keypair joining

This enable generation counter enforcement for bound keypair joining,
and adds a new function, `shouldEnforceGenerationCounter`, to make
enabling it for other join methods trivial.

Bound keypair joining introduces a similar mechanism for use between
its own recovery attempts but does rely on the standard generation
counter for it's renewal-style certificates so every join attempt is
subject to a generation check. This wasn't enabled in the original set
of bound keypair PRs so it's enabled here.

RFD: #52546

* Add tests for generation counter enforcement, fix error handling bug

This adds a test case for traditional generation counter enforcement
with bound keypair joining, and fixes an error handling bug around
certificate generation. This bug was mostly harmless before and
would've just returned nil certs at worst, but is now meaningfully
fallible.

* Fix broken test

* Fix lint

* Remove references to registration secret in test for rebase onto master

* Empty commit for CI
* MWI: Add audit events for bound keypair joining

This adds 3 new audit events for bound keypair joining:
- `join_token.bound_keypair.recovery` - emitted when a join triggers
  a recovery (first join, or join with expired certs)
- `join_token.bound_keypair.rotation` - emitted when a keypair
  rotation takes place
- `join_token.bound_keypair.join_state_verification_failed` - emitted
  when the client provides an invalid join state document

* Fix UI lint

* Fix more UI lints

* Remove outdated TODO

* Fix tests broken by error message changes
* MWI: Add lock targets for join token name and bot instance ID

This adds two new lock targets meant to help lock specific bot
instances without affecting all bots sharing a single user:
- Bot Instance ID: Targets a bot instance UUID, which has been
  assigned automatically to unique bot instances for some time
- Join token name: Targets the join token through which the bot
  joined

Bot instance ID locks are most useful for traditional token-joined
bots, since tokens are single use and bots have no way to onboard
again without human intervention if their old certs (and old bot
instance) expire.

Join token locks are useful for bots using delegated join methods.
They are particularly useful for bound keypair joining, where there
is a direct 1:1 relationship between a "bot instance" and a token,
even though that bot ID will change each time a recovery takes place.

Note that this does not currently set the join token for nodes even
though that would theoretically be possible. We could consider
supporting node locking in the future if there's demand.

* Set join token cert request field for non-renewable bot identities

* Fix ASN ID and pass through join token name in impersonated certs

* Tweak docstrings and add missing references for lib/decision

* Clarify docstrings

Clarifies various docstrings and makes sure they mention `token`
joined bots cannot be targeted.

* Fix failing tests
@timothyb89 timothyb89 force-pushed the timothyb89/v18/bound-keypair-omnibackport branch from 7509472 to 17b58ed Compare July 15, 2025 23:35
* MWI: Use specific lock targets when locking out bots

Building on #56021, this takes advantage of the new granular lock
targets to lock bots during verification failures, namely:
- Generation counter mismatch: Locks a bot instance (token) or token
  name (bound keypair).
- Join state verification failure (bound keypair only)

Additionally, as the bound keypair joining process now generates
locks, join state verification has been moved to take place explicitly
*after* the main joining challenge has been completed. Without this,
unauthenticated clients could abuse the new locking behavior by simply
sending any invalid join state document.

* Use new lock targets for traditional generation counter lockouts

* Enforce new bot lock targets during cert generation

* Fix lint in `mutateStatusConsumeRecovery()`

* Add tests for new lock events

This adds new tests and updates existing tests to account for the new
locking strategies, and to make sure existing clients are actually
denied cluster access.

Additionally, as join state is now verified only after the regular
challenge ceremony, a number of tests were broken as they set up
the token in a technically impossible state, depending on the join
state being checked first. Tests now explicitly specify their token
keypair (bound or initial) to resolve this.

* Remove resolved TODOs

* Fix cut off comment
* MWI: Fix flaky tests for automatic bot lockouts

This fixes a flaky test, `TestRegisterBotCertificateGenerationStolen`,
which assumed authenticated clients would immediately lose access if
locked. It also fixes another test introduced at the same time that
contains a similar check.

* Increase maximum time limit
This removes the environment variable gating use of the bound keypair
experiment.
* MWI: Fix bound keypair initial join secret field name

The `initial_join_secret` field was not given a proper YAML field
name and was rendering as `initialjoinsecret`. Additionally, we've
tried to standardize on referring to this field as "the registration
secret", so this renames the field to match new terminology.

This hopefully does not count as a breaking change as registration
secret functionality has not been made available in a release.

* Rename to `registration_secret`
This fixes a number of spelling and grammar issues in the proto
comments for ProvisionTokenSpecV2BoundKeypair and
ProvisionTokenStatusV2BoundKeypair.
* MWI: Fix flaky test for bound keypair generation counter

This fixes another flaky test in
TestServer_RegisterUsingBoundKeypairMethod_GenerationCounter, caused
by locks occasionally not immediately taking effect.

* Apply suggestions from code review
* MWI: Add joining URIs for tbot

This adds support for joining URIs to tbot. Joining URIs are intended
to condense tbot's growing list of required server-side config options
or CLI parameters into a single string that can be provided to the
`tbot` client.

For example, consider these two equivalent CLI commands:

```
$ tbot start identity \
    --proxy-server example.teleport.sh:443 \
    --join-method bound_keypair \
    --token my-token \
    --registration-secret abc123 \
    --storage ./tbot-data
    --destination ./tbot-user

$ tbot start identity \
    tbot+proxy+bound-keypair://my-token:abc123@example.teleport.sh:443 \
    --storage ./tbot-data \
    --destination ./tbot-user
```

As shown, all parameters necessary for bots to actually connect to
and authenticate with the remote Teleport instance are included in a
single parameter. This parameter can be generated by existing tooling,
like the example command printed via `tctl bots add`, or the web UI.

End users will only need to paste a single "token", provide their own
client-side parameters (if any), and run. Similarly, we now have a new
minimally viable YAML config:

```yaml
version: v2
uri: tbot+proxy+bound-keypair://my-token:abc123@example.teleport.sh:443
storage:
  type: directory
  path: ./tbot-data
services:
  - type: identity
    destination:
      type: directory
      path: ./tbot-user
```

This implementation is designed to be only additive, and should not
interfere with existing config files or CLI strings. Parsed URI
parameters are merged on top of the traditional config fields during
the bot's pre-run check, and raise an error if any field conflicts.

RFD: #52546

* Fix lints

* Set `omitempty` flag on the URI field

This excludes the URI field when empty, to avoid polluting generated
config files when not using URIs - which remains fully supported - and
to clear test failures since a large number of golden tests would
otherwise need to be regenerated.

* Add additional tests for joining URI config merging

* Add additional integration-style test for joining URIs

* Fix lint

* Consistently rename field to JoinURI and convert from arg to flag

* Remove interspersed flag as arg has been removed.

* Fix broken tests after rebase
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Jul 15, 2025

Amplify deployment status

Branch Commit Job ID Status Preview Updated (UTC)
timothyb89/v18/bound-keypair-omnibackport 3f5cfae 5 ✅SUCCEED timothyb89-v18-bound-keypair-omnibackport 2025-07-23 23:25:02

@timothyb89 timothyb89 changed the title [v18] Bound Keypair Joining Backport - Part 1 [v18] Bound Keypair Joining Backport Jul 15, 2025
@timothyb89
Copy link
Copy Markdown
Contributor Author

timothyb89 commented Jul 16, 2025

Manual testing notes with steps I ran both on a local cluster and a cloud tenant. (May make a follow up PR to add these to the Machine ID testplan)

First pass with registration secrets:

  1. Create a bound keypair token with defaults (empty / no bound_keypair: {} struct). Extract the registration secret and join a bot using it.

    tbot start identity --join-uri tbot+proxy+bound-keypair://token-name:registration-secret@example.teleport.sh:443 --storage ./storage --destination ./destination
    
  2. Restart the bot with --certificate-ttl=30s and let its certs expire. Try again after 30s and ensure it fails. An audit log entry should be generated ("Bot Join Failed", a bot.join event with error "no rejoins remaining").

  3. Use tctl edit to edit the token and add an additional recovery attempt and restart. It should succeed now.

  4. Request a keypair rotation: tctl bound-keypair request-rotation token-name.

    Restart the bot (or pkill -usr1 tbot) and ensure it logs "Set new active keypair". A "Bound Keypair Rotation" audit event should be logged (even type is join_token.bound_keypair.rotation). The .status.bound_keypair.bound_public_key should change.

  5. Start the bot again, this time with a short renewal interval, say --renewal-interval=1m or less. Trigger an automatic rotation of the bound_keypair CA:

    tctl auth rotate --type=bound_keypair --grace-period=3m
    

    (May need to be done from the Teleport node if your user lacks perms.)

    Ensure the bot continually reauthenticates throughout the process.

  6. Once the rotation finishes, ensure it can still recover from expired certs. Increase the recovery limit if needed, start with --certificate-ttl=30s, quit, let the certs expire, and start again.

Next, try a new token with a preregistered key:

  1. Create a keypair

    mkdir storage-preregistered
    tbot keypair create --storage=./storage-preregistered --proxy-server example.teleport.sh:443
    
  2. Copy the SSH public key into a token's .spec.bound_keypair.onboarding.initial_public_key field.

  3. Start the bot:

    tbot start identity --join-uri tbot+proxy+bound-keypair://token-name@example.teleport.sh:443 --storage=./storage-preregistered --destination=./destination
    

    It should onboard successfully. A "Bound Keypair Recovery" audit event should be shown in the console with "New counter value: 1", indicating a first recovery.

  4. Try to lock the bot. Start with the traditional generation counter: copy the storage directory and restart the bot. Let it fetch new certs and kill it again (or use --oneshot). Remove the directory and move the copy back:

    cp -r storage-preregistered storage-preregistered-copy
    tbot start identity ... --oneshot
    rm -rf storage-preregistered
    mv storage-preregistered-copy storage-preregistered
    tbot start identity ... --certificate-ttl=1m
    

    This should fail.

  5. Make sure a lock was generated: tctl get lock. It should mention "certificate generation mismatch".

  6. Remove the lock. Let the certs expire normally (or else it'll just get locked again) and let the bot rejoin correctly (increase recovery attempts as needed).

  7. Repeat this process but with the storage-preregistered/bkp_state file:

    tbot start identity ... --certificate-ttl=30s --oneshot # start once to get clean state
    cp storage-preregistered/bkp_state .                    # duplicate the clean state
    sleep 31                                                # state must expire for counter to increment
    tbot start identity ... --certificate-ttl=30s --oneshot # new state with counter++
    sleep 31
    cp bkp_state storage-preregistered/
    tbot start identity ... --certificate-ttl=30s --oneshot # start with outdated state
    

    This should fail and generate a lock. The description should instead mention join state verification failure.

…56829)

* MWI: Verify locks against bound keypair tokens before mutating state

This adds an additional check for locks against a bound keypair token
before any server-side state can be mutated, e.g. before potentially
generating additional locks.

Locks were always checked before credentials were issued, so access
was reliably prevented. However, if bots get locked, they will retry
the connection in a loop. The locks are generated before they're
checked, which can lead to an infinite lock creation loop.

This PR adds an additional check for locks against the join token
before any server-side mutation takes place, but after we've at least
partially verified the client's identity (via a challenge or
registration secret) to avoid leaking new information about whether
or not a token is locked.

* Don't test for exact lock counts

Preventing duplicate locks is best effort and subject to the lock
checks actually returning an error when a lock exists in a timely
manner, so don't assume we won't have duplicates in the test.

* Try to call t.Helper() when possible in testExtractBotParamsFromCerts
@timothyb89 timothyb89 marked this pull request as ready for review July 18, 2025 01:50
@github-actions github-actions bot added audit-log Issues related to Teleports Audit Log backport labels Jul 18, 2025
@github-actions github-actions bot requested a review from boxofrad July 18, 2025 01:51
@github-actions github-actions bot requested a review from strideynet July 18, 2025 01:51
@github-actions github-actions bot added tctl tctl - Teleport admin tool ui labels Jul 18, 2025
@public-teleport-github-review-bot
Copy link
Copy Markdown

@timothyb89 - this PR will require admin approval to merge due to its size. Consider breaking it up into a series smaller changes.

@timothyb89 timothyb89 requested a review from zmb3 July 18, 2025 18:02
Copy link
Copy Markdown
Collaborator

@zmb3 zmb3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot.

@timothyb89 timothyb89 enabled auto-merge July 23, 2025 23:19
@timothyb89 timothyb89 added this pull request to the merge queue Jul 23, 2025
Merged via the queue into branch/v18 with commit e8597c6 Jul 24, 2025
45 checks passed
@timothyb89 timothyb89 deleted the timothyb89/v18/bound-keypair-omnibackport branch July 24, 2025 00:02
@camscale camscale added the no-changelog Indicates that a PR does not require a changelog entry label Jul 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

audit-log Issues related to Teleports Audit Log backport documentation machine-id no-changelog Indicates that a PR does not require a changelog entry size/xl tctl tctl - Teleport admin tool ui

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants