From 764aac89d9117dc357ea6410275bd4050e07ae80 Mon Sep 17 00:00:00 2001 From: Tim Buckley Date: Wed, 26 Feb 2025 21:08:20 -0700 Subject: [PATCH 01/25] RFD 0205: Improved On-Prem Joining for Machine ID This RFD discusses improvements to on-prem and non-delegated bot joining, focusing on a new `challenge` join method. --- rfd/0205-improved-onprem-joining.md | 418 ++++++++++++++++++++++++++++ 1 file changed, 418 insertions(+) create mode 100644 rfd/0205-improved-onprem-joining.md diff --git a/rfd/0205-improved-onprem-joining.md b/rfd/0205-improved-onprem-joining.md new file mode 100644 index 0000000000000..6d5b7367f1a30 --- /dev/null +++ b/rfd/0205-improved-onprem-joining.md @@ -0,0 +1,418 @@ +--- +authors: Tim Buckley () +state: draft +--- + +# RFD 0205 - Improved On-Prem Bot Joining + +## Required Approvers + +- Engineering: @strideynet && @zmb3 + +- Product: @thedevelopnik + +## What + +This RFD proposes serveral improvements to better support non-delegated and +on-prem joining, particularly for Machine ID. + +Primarily, we discuss a new `challenge` join method intended to replace the +traditional `token` join method for many use cases, but also proposes a number +of UX improvements to improve bot joining generally and `token` or +`challenge-response` joining in particular. + +## Why + +Today, if some form of delegated joining is not available, bots must fall back +to the traditional `token` join method. This join method simple and universal: +it has effectively zero hardware or software requirements, works with any (or +no) cloud provider, and is perfect for demos and experimentation. It's also +relatively secure: single use tokens ensure a Teleport admin user is directly +involved with each bot join, and generation counter checks help ensure bot +identities are difficult to exfiltrate unnoticed. + +Unfortunately, that's a fairly exhaustive list of positives, and when used in a +production environment, bot token joining has major operational problems: + +- Onboarding scales poorly: joining a large fleet of bot instances means + provisioning a token for each bot, which also means generating secrets + yourself, and distributing them appropriately. + +- Ongoing maintenance requires manual intervention, breaking IaC principles. We + should assume `token` joined bots will inevitably fail at some point, and when + this happens, a new token must be issued - manually. + +- Internal bot identities have a hard 24 hour TTL limit, limiting the maximum + possible resiliency to 24 hours before a bot can no longer rejoin without + manual human intervention + +- `token`-method tokens are themselves secret values and their names need to be + treated carefully to avoid leaking secrets + +- Bots occasionally trigger generation counter lockouts, killing themselves and + any instances on the same bot + +These limitations led to a surprisingly narrow set of use cases where token +joining was really a *good* experience, effectively just: + +- Experimentation and development use. It's simple and comprehensible which + makes it great for use in documentation - `tctl bots add` even gives you the + command to run to start a bot immediately! + +- Running very few, very reliable, long-lived systems. If you can reasonably + expect your system to never go down for more than 24 hours, bots can happily + run for months. + + (Ironically, Kubernetes is a great environment in which to run `token`-joined + bots since it'll rapidly reschedule any bot deployments that fail... but we + have a dedicated `kubernetes` delegated join method.) + +End users willing to create their own automation around token issuance could +work around some of these limitations, but this creates an unnecessary barrier +to entry for use of Machine ID on-prem. + +## Details + +### Context: Priorities and Security Invariants + +It's become clear through conversations with customers and previous attempts at +solving this problem that some of these pain points are contradictory. As a case +study, an ideal UX from an end user's perspective might be to create one token, +join all their bots with it, and leave them running indefinitely. This would +create several issues: + +- The initial joining secret would have an ambiguous lifetime. At what point + does this multi-use token expire? + +- How many bots will join? + +- How can we tell joined bots apart? Can we trust a bot identity if it can be + thrown away and regenerated? + +- When a bot needs to rejoin, does it use the same token? Can that token *ever* + expire? + +With this in mind, we need to strike some balance between effective UX and a +system we can trust to not allow unauthorized or unintended access. To that end, +we'll focus on these explicit compromises we believe improve today's UX while +making minimal security concessions: + +- **There must be a 1:1 relationship between a `ProvisionToken` and a bot + instance.** Allowing many joins on one token creates unnecessary backend + contention and - depending on implementation - creates severe traceability + problems. + + However, we can greatly improve today's automation story. Secret fulfillment + can take place server side to reduce the number of resources to generate in + Terraform, and `ProvisionToken` resources can be made reusable to fully enable + IaC workflows. + +- **Bot credentials can be long-lived, but their *trust* must be controlled and + renewed.** We can't allow any secret values to either remain both valid and + useful for a long time, or allow currently valid identities to generate + credentials that can be used for future, unchecked extension of access. + + However, we can create a new state where an identity is technically valid + indefinitely, but useless unless explicitly allowed by the server. Existing + controls like the generation counter can still effectively prevent unintended + reuse of the bot identity. + +### Challenge-Response Joining + +TODO: Consider alternative join method names? + +We believe a new join method, `challenge`, can meet our needs and provide +significantly more flexibility than today's `token` join method. This works by - +in a sense - inverting the token joining procedure: bots generate an ED25519 +keypair, and the public key is copied to the server. The public key can be +copied out-of-band, or bots can provide their public key on first join using a +one-time use shared secret, much like today's `token` method. + +Once the public key has been shared, bots may then join by requesting a +challenge from the Teleport Auth service and complete it by signing it with +their private key. If successful, the bot is issued a renewable identity just as +`token`-joined bots are today, and the bot will actively renew this identity for +as long as possible. + +If the identity renewal fails at any point, bots may attempt to reauthenticate, +and the Auth service can use predefined per-bot rules to decide if this specific +bot is allowed to rejoin, including a rejoin counter and expiration date. If a +rejoin is rejected, the bot's identity does not necessarily remain invalid: if +server-side rules are adjusted, for example by increasing the token's rejoin +limit, it can then rejoin without any client-side reconfiguration. + +This has several important differences to existing join methods: + +- Onboarding secrets are optional, and the secret exchange process may be + skipped if the `ProvisionToken` is configured with a public key directly. + Otherwise, joining bots authenticate with an onboarding secret to + automatically share their public key with the server. + +- When joining or rejoining, Teleport issues a challenge that the client must + solve. This is similar to TPM joining today, but backed by a local keypair + rather than (necessarily) a hardware token. + +- When a bot's identity expires, assuming it has some rejoin allocations left, + it can simply repeat the joining process to receive a fresh renewable + certificate. + +- If a bot exhausts its rejoining limit, it will not be able to fetch new + certificates, similar to today's behavior. However, this bot can be restored + without needing to generate a new identity: an admin user can edit the backing + `ProvisionToken` to increment `spec.challenge.rejoining.total_rejoins`. The + failed `tbot` instance can then retry the joining process, and it will + succeed. + +It otherwise functions similarly to `token`-joined bots today. It proves its +identity - either via an onboarding secret or public key - to receive a +renewable identity and renews it as usual for as long as possible. The +generation counter is still used to detect identity reuse. When the internal +identity expires, the bot loses access to resources (until it reauthenticates). + +#### Token Resource Example + +`challenge`-type tokens differ from other types in that they are intended to +have no resource-level expiration (though that is allowed), are meant to have +their spec modified over time by users or automation tools, and publish +information about their current state in the immutable (to users) `status` +field. + +``` yaml +kind: token +version: v2 +metadata: + name: my-join-token +spec: + bot_name: example + + join_method: challenge + challenge: + + # `onboarding` parameters control initial join behavior + onboarding: + # If set, no joining secret is generated; the secret exchange ceremony is + # skipped and instance will directly prove its identity using its private + # key. + public_key: null + + # If set, use an explicit initial joining secret; if both this and + # `public_key` are unset, a value will be generated server-side and + # stored in `.status.challenge.initial_join_secret` + initial_join_secret: "" + + # Initial joining must take place before this timestamp. May be + # modified if bot has not yet joined. + expires: "2025-03-01T21:45:40.104524Z" + + # Parameteres to tune rejoining behavior when the regular bot identity has + # expired + rejoining: + # If true, `total_rejoins` is ignored and bots may rejoin indefinitely; + # must be opt-in. + unlimited: false + + # Total number of allowed rejoins; this may be incremented to allow + # additional rejoins, even if a bot identity has already expired. May + # be decremented, but only by the current value of + # `.status.challenge.remaining_rejoins`. + total_rejoins: 10 + + # If set, rejoining is only valid before this timestamp; may be + # incremented to extend bot lifespan. + expires: "" + +status: + challenge: + # If `public_key` is unset, this value will be generated server-side and + # made available here. + initial_join_secret: + + # The public key of the bot associated with this token, set on first join. + bound_public_key: + + # The current bot instance UUID. A new UUID is issued on rejoin; the previous + # UUID will be linked via a `previous_instance_id` in the bot instance. + bound_bot_instance_id: + + # A count of remaining rejoins; if `.spec.challenge.rejoining.total_rejoins` + # is incremented + remaining_rejoins: 10 +``` + +#### Terraform Example + +This join method is explicitly designed to be used with Terraform and IaC +workflows. Today, it's possible to generate one or more secret values, compute +an expiration time, provision a token for each secret, and spawn many VM +instances with that token passed along via user data. While a bit verbose, this +initial deployment workflow is mostly sane. For an example, refer to [this +documentation +snapshot](https://github.com/gravitational/teleport/blob/cb0a69d09550e45c2c327ab7dcc6a023e3bb162a/docs/pages/reference/terraform-provider/resources/bot.mdx#example-usage) +with a working example. + +However, the critical issue in this workflow is in maintenance: if any one of +these bots expires, there is no reasonable method to restore it short of +manually issuing a new join token (`tctl bots instance add foo`), connecting to +the node, and manually copying the new token into the bot config. + +The new `challenge` method improves this situation in two primary ways: one +fewer resource needs to be generated (the secret value), and maintenance can be +performed simply by adjusting values in the resource. If a bot fails, its rejoin +counter can be incremented easily. + +As an example, we can consider provisioning several bots. We'll need to account +for future overrides so we can fix a single bot in the future if needed: + +``` hcl +locals { + nodes = toset(["foo", "bar", "baz"]) +} + +resource "teleport_bot" "example" { + name = "example" + roles = ["access"] +} + +variable "bot_rejoin_overrides" { + type = map(number) + default = { + foo = 5 + } +} + +resource "teleport_provision_token" "example" { + for_each = local.nodes + + version = "v2" + metadata = { + name = "example-${each.key}" + } + spec = { + roles = ["Bot"] + bot_name = teleport_bot.example.name + join_method = "challenge" + + challenge = { + rejoining = { + # look up node-specific count in the rejoin overrides map, default to 2 + total_rejoins = lookup(var.bot_rejoin_overrides, each.key, 2) + } + } + } +} + +resource "aws_instance" "example" { + for_each = teleport_provision_token.example + + ami = "ami-12345678" + instance_type = "t2.micro" + + user_data = templatefile("${path.module}/user-data.tpl", { + initial_join_token = each.value.status.challenge.initial_join_secret + }) +} +``` + +In this example, if node `bar` uses its 2 renewals, we can add a new entry for +it in `bot_rejoin_overrides` and its `ProvisionToken` will be updated to allow +additional renewals. + +#### Challenge Ceremony + +The challenge ceremony will take inspiration from several existing join methods. +We can create an interactive challenge similar to TPM joining and present a +challenge containing a nonce. To avoid completely implementing our own +authentication ceremony, clients can use `go-jose` to marshal and sign a JWT +which can then be verified easily on the server. + +TODO: This needs significant further elaboration and feedback. + +#### Client-Side Changes in `tbot` + +Bots should be informed of their number of remaining rejoins. We can give bots +permission to view their own join token, or include the number of remaining +rejoins as an informational field in the bot's current user certificate. We +should then expose this as a Prometheus metric to allow for alerting if a bot + +We should also investigate hardware key storage backends using HSMs / PKCS#11. +We have some prior art here in Teleport's [HSM +support](https://goteleport.com/docs/admin-guides/deploy-a-cluster/hsm/). + +#### Non-Terraform UX + +TODO + +#### Remaining Downsides + +- Repairing a bot that has exhausted all of its rejoins is still a semi-manual + process. It is significantly easier, and does not necessarily require any + changes on the impacted bot node itself, but is still annoying. Users can opt + out of this by setting `.spec.rejoining.unlimited=true`, but this has obvious + security implications. + +- Effort required to configure IaC / Terraform is still fairly high, even if + reduced. + +### Other Supporting Improvements + +Alongside `challenge` joining, we have several UX proposals to further improve +the usability of non-delegated joining. + +#### Longer-Lived Bots + +The renewable identity's 24 hour maximum TTL is too restrictive and should be +lengthened. We propose raising this limit to 7 days, but keeping the default +(1hr) the same. + +#### Bot Instance Locking + +When we introduced bot instances in +[RFD0162](0162-machine-id-token-join-method-bot-instance.md), it allowed many +bot instances to join under a single bot (and Teleport user). Generation counter +checks were moved out of a bot's user and into bot instances, however when a +generation counter mismatch occurs, the resulting lock is still filed against +the user as a whole. This means a generation counter lockout in one instance can +easily impact all other instances under the same bot. + +We should make this more granular and make locks that can target a bot instance +UUID, and as this RFD introduces a method for bots to rejoin as a new instance, +a lock target for specific join tokens. We may need to introduce join token +tracking, e.g. by introducing a join token certificate field. + +#### Token CLI UX Improvements + +Instead of separating join method and token name, we propose combining the two +into a single CLI flag and referring to tokens as a (method, value) tuple. For +example: + +``` +$ tctl bots add example --roles=access +The bot token: challenge:04f0ceff1bd0589ba45c1832dfc8feaf +This token will expire in 59 minutes. + +[...snip...] + +$ tbot start --token=challenge:04f0ceff1bd0589ba45c1832dfc8feaf ... +``` + +## Alternatives and Future Extensions + +### Explicitly Insecure Token Joining + +There are perfectly valid use cases for allowing relatively insecure access to +resources that do not have strict trust requirements, and Teleport's RBAC system +is robust enough to only allow these bots access to an acceptable subset of +resources. It may be worthwhile to add an `insecure-shared-secret` join method +that allows for arbitrary joining in use cases that still fall through the +cracks, so long as end users understand the security implications. + +### Client-side multi-token support + +A simpler variant of N-Token Resiliency, this would allow `tbot` clients to +accept an ordered list of joining token strings which could be used +sequentially. If the internal identity expires, the next token in the list will +be used to attempt a rejoin. + +## Rejected Alternatives + +### N-Token Resiliency From 8604f44cba75c7e9acecb56ac0b03d1029754b1f Mon Sep 17 00:00:00 2001 From: Tim Buckley Date: Wed, 26 Feb 2025 21:18:06 -0700 Subject: [PATCH 02/25] Various whitespace fixes --- rfd/0205-improved-onprem-joining.md | 56 ++++++++++++++--------------- 1 file changed, 28 insertions(+), 28 deletions(-) diff --git a/rfd/0205-improved-onprem-joining.md b/rfd/0205-improved-onprem-joining.md index 6d5b7367f1a30..28992762894f1 100644 --- a/rfd/0205-improved-onprem-joining.md +++ b/rfd/0205-improved-onprem-joining.md @@ -186,40 +186,40 @@ spec: bot_name: example join_method: challenge - challenge: + challenge: - # `onboarding` parameters control initial join behavior - onboarding: - # If set, no joining secret is generated; the secret exchange ceremony is - # skipped and instance will directly prove its identity using its private - # key. - public_key: null + # `onboarding` parameters control initial join behavior + onboarding: + # If set, no joining secret is generated; the secret exchange ceremony is + # skipped and instance will directly prove its identity using its private + # key. + public_key: null - # If set, use an explicit initial joining secret; if both this and - # `public_key` are unset, a value will be generated server-side and - # stored in `.status.challenge.initial_join_secret` - initial_join_secret: "" + # If set, use an explicit initial joining secret; if both this and + # `public_key` are unset, a value will be generated server-side and + # stored in `.status.challenge.initial_join_secret` + initial_join_secret: "" - # Initial joining must take place before this timestamp. May be - # modified if bot has not yet joined. - expires: "2025-03-01T21:45:40.104524Z" + # Initial joining must take place before this timestamp. May be + # modified if bot has not yet joined. + expires: "2025-03-01T21:45:40.104524Z" - # Parameteres to tune rejoining behavior when the regular bot identity has + # Parameters to tune rejoining behavior when the regular bot identity has # expired - rejoining: - # If true, `total_rejoins` is ignored and bots may rejoin indefinitely; - # must be opt-in. - unlimited: false + rejoining: + # If true, `total_rejoins` is ignored and bots may rejoin indefinitely; + # must be opt-in. + unlimited: false - # Total number of allowed rejoins; this may be incremented to allow - # additional rejoins, even if a bot identity has already expired. May - # be decremented, but only by the current value of - # `.status.challenge.remaining_rejoins`. - total_rejoins: 10 + # Total number of allowed rejoins; this may be incremented to allow + # additional rejoins, even if a bot identity has already expired. May + # be decremented, but only by the current value of + # `.status.challenge.remaining_rejoins`. + total_rejoins: 10 - # If set, rejoining is only valid before this timestamp; may be - # incremented to extend bot lifespan. - expires: "" + # If set, rejoining is only valid before this timestamp; may be + # incremented to extend bot lifespan. + expires: "" status: challenge: @@ -385,7 +385,7 @@ Instead of separating join method and token name, we propose combining the two into a single CLI flag and referring to tokens as a (method, value) tuple. For example: -``` +``` $ tctl bots add example --roles=access The bot token: challenge:04f0ceff1bd0589ba45c1832dfc8feaf This token will expire in 59 minutes. From 405bd940f17b979700a84fa01e0b750d479d928a Mon Sep 17 00:00:00 2001 From: Tim Buckley Date: Thu, 27 Feb 2025 09:39:07 -0700 Subject: [PATCH 03/25] Add details after first feedback pass Adds sections on alerting, keypair rotation, and intention to eventually support node joining. --- rfd/0205-improved-onprem-joining.md | 40 +++++++++++++++++++++++++++-- 1 file changed, 38 insertions(+), 2 deletions(-) diff --git a/rfd/0205-improved-onprem-joining.md b/rfd/0205-improved-onprem-joining.md index 28992762894f1..34480d41f8ba0 100644 --- a/rfd/0205-improved-onprem-joining.md +++ b/rfd/0205-improved-onprem-joining.md @@ -52,6 +52,9 @@ production environment, bot token joining has major operational problems: - Bots occasionally trigger generation counter lockouts, killing themselves and any instances on the same bot +- When a bot instance's renewable identity expires, there's no clear way to tell + that it stopped functioning other than checking the `tbot` process directly. + These limitations led to a surprisingly narrow set of use cases where token joining was really a *good* experience, effectively just: @@ -223,8 +226,10 @@ spec: status: challenge: - # If `public_key` is unset, this value will be generated server-side and - # made available here. + # If `spec.onboarding.public_key` is unset, this value will be generated + # server-side and made available here. If + # `spec.onboarding.initial_join_secret` is set, its value will be copied + # here. initial_join_secret: # The public key of the bot associated with this token, set on first join. @@ -342,6 +347,31 @@ support](https://goteleport.com/docs/admin-guides/deploy-a-cluster/hsm/). TODO +#### Expiration Alerting UX + +A major deficiency in `token` joining is that bots fail silently. Their status +can be partially monitored via `tctl get bot_instance` but this does not +effectively notify administrators when something has gone wrong. + +We should take steps to improve visibility of bots at or near expiry, including: + +- Configurable cluster alerts when the number of available renewals has crossed + some threshold + +- Exposing the number of available renewals in the web UI and `tctl bot ls` + +- Exposing per-token renewal counts as Prometheus metrics, both on the Auth + Serivce and via `tbot`'s metrics endpoint. + +#### Keypair Rotation + +Given the long-lived nature of the keypair credential, it's important to support +rotation without bot downtime. Ideally, it should be possible to initiate a +rotation from either the server (e.g. by setting a `rotate_on_next_renewal` flag +on the token/bot instance) or `tbot` client. + +TODO: Expand this. + #### Remaining Downsides - Repairing a bot that has exhausted all of its rejoins is still a semi-manual @@ -397,6 +427,12 @@ $ tbot start --token=challenge:04f0ceff1bd0589ba45c1832dfc8feaf ... ## Alternatives and Future Extensions +### Agent Joining Support + +We should explore expanding this join method to cover regular Teleport agent +joining as well as bots, as a more secure alternative to static or long-lived +join tokens. + ### Explicitly Insecure Token Joining There are perfectly valid use cases for allowing relatively insecure access to From 6e265d8626d11449f813dbeb6aa23a1238dfec35 Mon Sep 17 00:00:00 2001 From: Tim Buckley Date: Thu, 27 Feb 2025 21:21:48 -0700 Subject: [PATCH 04/25] Add section detailing joining flows, various other details --- rfd/0205-improved-onprem-joining.md | 97 +++++++++++++++++++++++++++-- 1 file changed, 92 insertions(+), 5 deletions(-) diff --git a/rfd/0205-improved-onprem-joining.md b/rfd/0205-improved-onprem-joining.md index 34480d41f8ba0..ac0a25214941f 100644 --- a/rfd/0205-improved-onprem-joining.md +++ b/rfd/0205-improved-onprem-joining.md @@ -129,13 +129,14 @@ significantly more flexibility than today's `token` join method. This works by - in a sense - inverting the token joining procedure: bots generate an ED25519 keypair, and the public key is copied to the server. The public key can be copied out-of-band, or bots can provide their public key on first join using a -one-time use shared secret, much like today's `token` method. +one-time use shared secret to authenticate the exchange, much like today's +`token` method. Once the public key has been shared, bots may then join by requesting a -challenge from the Teleport Auth service and complete it by signing it with +challenge from the Teleport Auth service and completing it by signing it with their private key. If successful, the bot is issued a renewable identity just as `token`-joined bots are today, and the bot will actively renew this identity for -as long as possible. +as long as possible, or until its backing token expires. If the identity renewal fails at any point, bots may attempt to reauthenticate, and the Auth service can use predefined per-bot rules to decide if this specific @@ -172,6 +173,56 @@ renewable identity and renews it as usual for as long as possible. The generation counter is still used to detect identity reuse. When the internal identity expires, the bot loses access to resources (until it reauthenticates). +#### Joining UX Flows + +This join method creates two new joining flows: + +1. **Static Binding**: A keypair is pregenerated on the client and the public + key is directly included in the token resource by a Teleport admin. + + Example UX (subject to change): + + ``` + $ tbot generate-keypair + Wrote id_ed25519 + Wrote id_ed25519.pub + $ tctl bots add example --public-key id_ed25519.pub + $ tbot start identity --token=challenge:id_ed25519 + ``` + + (In this example, `tctl bots add` creates a `challenge` token automatically, + much like a `token`-type token is created automatically today.) + + The public key can be copied as needed, similar to SSH `authorized_keys` and + GitHub's SSH authentication. This is arguably more secure since no secret is + ever copied. + + On startup, Auth issues a challenge to the bot which is solved with its + private key, and it receives a standard renewable identity. + +2. **Bind-on-join**: The `tbot` client is given a joining secret. + + Example UX: + ``` + $ tctl bots add example --join-method=challenge + The bot token: challenge:04f0ceff1bd0589ba45c1832dfc8feaf + This token will expire in 59 minutes. + $ tbot start identity --token=challenge:04f0ceff1bd0589ba45c1832dfc8feaf + ``` + + On `tbot` startup, a keypair is transparently generated and exchanged with + Auth, after which the bot internally behaves as if flow 1 was used, and the + now-bound keypair perform its first full join. + + From an end user's PoV, this process is nearly identical to traditional + `token` joining. While a joining secret does need to be copied to the bot + node, these secrets remain short-lived and one-time use. + +We expect most users to use Flow 2: it's much easier to provision new nodes and +requires less back-and-forth between the admin's workstation and bot node. Flow +1 is particularly ill-suited to Terraform use since keypairs would need to be +pregenerated and copied to nodes, which is not ideal from a security PoV. + #### Token Resource Example `challenge`-type tokens differ from other types in that they are intended to @@ -195,7 +246,8 @@ spec: onboarding: # If set, no joining secret is generated; the secret exchange ceremony is # skipped and instance will directly prove its identity using its private - # key. + # key. It is an error for a public key to be associated with more than one + # token, and creation or update will fail if a public key is reused. public_key: null # If set, use an explicit initial joining secret; if both this and @@ -233,6 +285,9 @@ status: initial_join_secret: # The public key of the bot associated with this token, set on first join. + # The bound public key must be unique among all `challenge` tokens; token + # resource creation/update or bot joining will fail if a public key is + # reused. bound_public_key: # The current bot instance UUID. A new UUID is issued on rejoin; the previous @@ -240,7 +295,8 @@ status: bound_bot_instance_id: # A count of remaining rejoins; if `.spec.challenge.rejoining.total_rejoins` - # is incremented + # is incremented, this value will be incremented by the same amount. If + # decremented, this value cannot fall below zero. remaining_rejoins: 10 ``` @@ -363,6 +419,37 @@ We should take steps to improve visibility of bots at or near expiry, including: - Exposing per-token renewal counts as Prometheus metrics, both on the Auth Serivce and via `tbot`'s metrics endpoint. +#### Outstanding Issue: Soft Bot Expiration + +The `spec.rejoining.expires` field can be used to prevent rejoining after a +certain time, in tandem with the rejoin count limit. This has the - likely +confusing - downside that bots will still be able to renew their certificates +indefinitely past the expiration date, assuming their certs were valid at the +time of expiration. + +Also, as with all Teleport resources, the `metadata.expires` field can also +remove the token resource after a set time. Bots will also continue to renew +certs as long as possible until they are either locked or otherwise fail to +renew their certs on time. + +These two expirations create some confusion, and do not allow for an obvious +method to deny a bot access, aside from creating a lock. + +We would like to solve two expiration use cases: +1. We should be able to prevent all bot resource access after a certain date, + including renewals, in a way that allows the bot to be resumed later if + desired. (I.e. the token resource must still exist.) + + Locks may accomplish this, but some centralized management in the token + resource would be convenient. + +2. We should be able to prevent bot rejoins after a certain date, to control + rejoining conditions in tandem with the rejoin counter. The + `spec.rejoining.expires` field accomplishes this, but does have a naming + collision with `metadata.expires`. + +TODO: Ensure this is solved and not confusing. + #### Keypair Rotation Given the long-lived nature of the keypair credential, it's important to support From dc4fb2f0c2ff35d3955f70b7224e8e5d9a2185e6 Mon Sep 17 00:00:00 2001 From: Tim Buckley Date: Fri, 28 Feb 2025 19:04:33 -0700 Subject: [PATCH 05/25] Fix cspell nits --- rfd/0205-improved-onprem-joining.md | 4 ++-- rfd/cspell.json | 5 ++++- 2 files changed, 6 insertions(+), 3 deletions(-) diff --git a/rfd/0205-improved-onprem-joining.md b/rfd/0205-improved-onprem-joining.md index ac0a25214941f..7d9e651f548cb 100644 --- a/rfd/0205-improved-onprem-joining.md +++ b/rfd/0205-improved-onprem-joining.md @@ -13,7 +13,7 @@ state: draft ## What -This RFD proposes serveral improvements to better support non-delegated and +This RFD proposes several improvements to better support non-delegated and on-prem joining, particularly for Machine ID. Primarily, we discuss a new `challenge` join method intended to replace the @@ -417,7 +417,7 @@ We should take steps to improve visibility of bots at or near expiry, including: - Exposing the number of available renewals in the web UI and `tctl bot ls` - Exposing per-token renewal counts as Prometheus metrics, both on the Auth - Serivce and via `tbot`'s metrics endpoint. + Service and via `tbot`'s metrics endpoint. #### Outstanding Issue: Soft Bot Expiration diff --git a/rfd/cspell.json b/rfd/cspell.json index 4f7f16bb37698..aa387bfeb8b1e 100644 --- a/rfd/cspell.json +++ b/rfd/cspell.json @@ -633,6 +633,7 @@ "rdsproxy", "readyz", "reauth", + "reauthenticates", "reccfg", "reconnections", "redisenterprise", @@ -757,6 +758,7 @@ "teleportrolev", "teleterm", "teleuser", + "templatefile", "templating", "tenantid", "testdb", @@ -776,6 +778,7 @@ "tokenless", "tolerations", "topotentially", + "toset", "trivy", "trunc", "tshd", @@ -843,4 +846,4 @@ "ykpiv", "yubihsm" ] -} +} \ No newline at end of file From f8602f487907383ca64ac0b28dd8b2ec985668bf Mon Sep 17 00:00:00 2001 From: Tim Buckley Date: Thu, 6 Mar 2025 21:07:33 -0700 Subject: [PATCH 06/25] Rename to bound keypair, address review feedback This renames the join method to `bound-keypair`, adds sections on extensible keystore backends, non-Terraform UX, and scoped RBAC. --- rfd/0205-improved-onprem-joining.md | 148 +++++++++++++++++++++------- 1 file changed, 110 insertions(+), 38 deletions(-) diff --git a/rfd/0205-improved-onprem-joining.md b/rfd/0205-improved-onprem-joining.md index 7d9e651f548cb..ec5f1de184e1a 100644 --- a/rfd/0205-improved-onprem-joining.md +++ b/rfd/0205-improved-onprem-joining.md @@ -3,7 +3,7 @@ authors: Tim Buckley () state: draft --- -# RFD 0205 - Improved On-Prem Bot Joining +# RFD 0205 - Improved On-Prem Bots With `bound-keypair` Joining ## Required Approvers @@ -16,10 +16,10 @@ state: draft This RFD proposes several improvements to better support non-delegated and on-prem joining, particularly for Machine ID. -Primarily, we discuss a new `challenge` join method intended to replace the +Primarily, we discuss a new `bound-keypair` join method intended to replace the traditional `token` join method for many use cases, but also proposes a number -of UX improvements to improve bot joining generally and `token` or -`challenge-response` joining in particular. +of related UX improvements to improve bot joining generally and `token` and +`bound-keypair` joining in particular. ## Why @@ -120,11 +120,11 @@ making minimal security concessions: controls like the generation counter can still effectively prevent unintended reuse of the bot identity. -### Challenge-Response Joining +### Bound Keypair Joining TODO: Consider alternative join method names? -We believe a new join method, `challenge`, can meet our needs and provide +We believe a new join method, `bound-keypair`, can meet our needs and provide significantly more flexibility than today's `token` join method. This works by - in a sense - inverting the token joining procedure: bots generate an ED25519 keypair, and the public key is copied to the server. The public key can be @@ -163,8 +163,8 @@ This has several important differences to existing join methods: - If a bot exhausts its rejoining limit, it will not be able to fetch new certificates, similar to today's behavior. However, this bot can be restored without needing to generate a new identity: an admin user can edit the backing - `ProvisionToken` to increment `spec.challenge.rejoining.total_rejoins`. The - failed `tbot` instance can then retry the joining process, and it will + `ProvisionToken` to increment `spec.bound_keypair.rejoining.total_rejoins`. + The failed `tbot` instance can then retry the joining process, and it will succeed. It otherwise functions similarly to `token`-joined bots today. It proves its @@ -187,10 +187,10 @@ This join method creates two new joining flows: Wrote id_ed25519 Wrote id_ed25519.pub $ tctl bots add example --public-key id_ed25519.pub - $ tbot start identity --token=challenge:id_ed25519 + $ tbot start identity --token=bound-keypair:id_ed25519 ``` - (In this example, `tctl bots add` creates a `challenge` token automatically, + (In this example, `tctl bots add` creates a `bound-keypair` token automatically, much like a `token`-type token is created automatically today.) The public key can be copied as needed, similar to SSH `authorized_keys` and @@ -204,10 +204,10 @@ This join method creates two new joining flows: Example UX: ``` - $ tctl bots add example --join-method=challenge - The bot token: challenge:04f0ceff1bd0589ba45c1832dfc8feaf + $ tctl bots add example --join-method=bound-keypair + The bot token: bound-keypair:04f0ceff1bd0589ba45c1832dfc8feaf This token will expire in 59 minutes. - $ tbot start identity --token=challenge:04f0ceff1bd0589ba45c1832dfc8feaf + $ tbot start identity --token=bound-keypair:04f0ceff1bd0589ba45c1832dfc8feaf ``` On `tbot` startup, a keypair is transparently generated and exchanged with @@ -225,7 +225,7 @@ pregenerated and copied to nodes, which is not ideal from a security PoV. #### Token Resource Example -`challenge`-type tokens differ from other types in that they are intended to +`bound-keypair`-type tokens differ from other types in that they are intended to have no resource-level expiration (though that is allowed), are meant to have their spec modified over time by users or automation tools, and publish information about their current state in the immutable (to users) `status` @@ -239,8 +239,8 @@ metadata: spec: bot_name: example - join_method: challenge - challenge: + join_method: bound-keypair + bound_keypair: # `onboarding` parameters control initial join behavior onboarding: @@ -252,7 +252,7 @@ spec: # If set, use an explicit initial joining secret; if both this and # `public_key` are unset, a value will be generated server-side and - # stored in `.status.challenge.initial_join_secret` + # stored in `.status.bound_keypair.initial_join_secret` initial_join_secret: "" # Initial joining must take place before this timestamp. May be @@ -269,7 +269,7 @@ spec: # Total number of allowed rejoins; this may be incremented to allow # additional rejoins, even if a bot identity has already expired. May # be decremented, but only by the current value of - # `.status.challenge.remaining_rejoins`. + # `.status.bound_keypair.remaining_rejoins`. total_rejoins: 10 # If set, rejoining is only valid before this timestamp; may be @@ -277,7 +277,7 @@ spec: expires: "" status: - challenge: + bound_keypair: # If `spec.onboarding.public_key` is unset, this value will be generated # server-side and made available here. If # `spec.onboarding.initial_join_secret` is set, its value will be copied @@ -285,8 +285,8 @@ status: initial_join_secret: # The public key of the bot associated with this token, set on first join. - # The bound public key must be unique among all `challenge` tokens; token - # resource creation/update or bot joining will fail if a public key is + # The bound public key must be unique among all `bound-keypair` tokens; + # token resource creation/update or bot joining will fail if a public key is # reused. bound_public_key: @@ -294,7 +294,7 @@ status: # UUID will be linked via a `previous_instance_id` in the bot instance. bound_bot_instance_id: - # A count of remaining rejoins; if `.spec.challenge.rejoining.total_rejoins` + # A count of remaining rejoins; if `.spec.bound_keypair.rejoining.total_rejoins` # is incremented, this value will be incremented by the same amount. If # decremented, this value cannot fall below zero. remaining_rejoins: 10 @@ -316,7 +316,7 @@ these bots expires, there is no reasonable method to restore it short of manually issuing a new join token (`tctl bots instance add foo`), connecting to the node, and manually copying the new token into the bot config. -The new `challenge` method improves this situation in two primary ways: one +The new `bound-keypair` method improves this situation in two primary ways: one fewer resource needs to be generated (the secret value), and maintenance can be performed simply by adjusting values in the resource. If a bot fails, its rejoin counter can be incremented easily. @@ -351,9 +351,9 @@ resource "teleport_provision_token" "example" { spec = { roles = ["Bot"] bot_name = teleport_bot.example.name - join_method = "challenge" + join_method = "bound-keypair" - challenge = { + bound_keypair = { rejoining = { # look up node-specific count in the rejoin overrides map, default to 2 total_rejoins = lookup(var.bot_rejoin_overrides, each.key, 2) @@ -369,7 +369,7 @@ resource "aws_instance" "example" { instance_type = "t2.micro" user_data = templatefile("${path.module}/user-data.tpl", { - initial_join_token = each.value.status.challenge.initial_join_secret + initial_join_token = each.value.status.bound_keypair.initial_join_secret }) } ``` @@ -395,13 +395,48 @@ permission to view their own join token, or include the number of remaining rejoins as an informational field in the bot's current user certificate. We should then expose this as a Prometheus metric to allow for alerting if a bot -We should also investigate hardware key storage backends using HSMs / PKCS#11. -We have some prior art here in Teleport's [HSM -support](https://goteleport.com/docs/admin-guides/deploy-a-cluster/hsm/). +#### Keystore Storage Backends + +We should support abstract keystore storage backends to enable storage methods +beyond plain file storage. + +For example: + +- HSM storage, for hardware supporting PKCS#11. This includes many TPM 1.2 / 2.0 + implementations. We have some prior art here in Teleport's [HSM + support](https://goteleport.com/docs/admin-guides/deploy-a-cluster/hsm/). + +- [Apple Secure Enclave key storage][enclave]. This would require additional + changes to our release process, as access to this functionality requires app + signing. We again have prior art here with `tsh`. + +Further evaluation will be necessary to ensure these backends support our +challenge process and key types. Libraries like [`sks`] provide compatibility +across TPM 2.0 (Windows, Linux) and Apple's Secure Enclave, and should be able +to sign our challenges appropriately, and using our desired key types. + +[enclave]: https://developer.apple.com/documentation/security/protecting-keys-with-the-secure-enclave +[`sks`]: https://github.com/facebookincubator/sks #### Non-Terraform UX -TODO +This proposal also aims to improve the non-Terraform UX, particularly when +automating with `tctl`. All regular token management workflows with +`tctl create -f` will continue to work; upserting resources to modify runtime +values will, for example, properly increase +`status.bound_keypair.remaining_rejoins` while preserving other token fields +like `status.bound_keypair.bound_public_key`. + +Additional `tctl` changes will include: + +- Once satisfied with behavior of the join method, replacing the default + automatically generated join tokens for `tctl bots add` and + `tctl bots instances add` to use this new join method. + +- Adding a column for "rejoins remaining" in `tctl bots instances ls` (where + relevant). + +- Adding support for updating `total_rejoins` in `tctl bots update` #### Expiration Alerting UX @@ -472,8 +507,8 @@ TODO: Expand this. ### Other Supporting Improvements -Alongside `challenge` joining, we have several UX proposals to further improve -the usability of non-delegated joining. +Alongside `bound-keypair` joining, we have several UX proposals to further +improve the usability of non-delegated joining. #### Longer-Lived Bots @@ -491,10 +526,20 @@ generation counter mismatch occurs, the resulting lock is still filed against the user as a whole. This means a generation counter lockout in one instance can easily impact all other instances under the same bot. -We should make this more granular and make locks that can target a bot instance -UUID, and as this RFD introduces a method for bots to rejoin as a new instance, -a lock target for specific join tokens. We may need to introduce join token -tracking, e.g. by introducing a join token certificate field. +We can introduce several new lock targets to address this: + +- Bot instance UUID locking: prevent access by a particular instance. This will + not lock bots that have since reauthenticated and received a new bot instance + UUID. + +- Join token locking: locks all bot instances that joined with a particular + token. This may require introduction of a new certificate field to track the + exact join token used. + +- Public key locking: locks bots joining with a particular public key. A + compromised bot could theoretically generate a fresh keypair so a join token + lock is the primary locking solution, however this will prevent joining with + another token. #### Token CLI UX Improvements @@ -504,12 +549,12 @@ example: ``` $ tctl bots add example --roles=access -The bot token: challenge:04f0ceff1bd0589ba45c1832dfc8feaf +The bot token: bound-keypair:04f0ceff1bd0589ba45c1832dfc8feaf This token will expire in 59 minutes. [...snip...] -$ tbot start --token=challenge:04f0ceff1bd0589ba45c1832dfc8feaf ... +$ tbot start --token=bound-keypair:04f0ceff1bd0589ba45c1832dfc8feaf ... ``` ## Alternatives and Future Extensions @@ -520,6 +565,33 @@ We should explore expanding this join method to cover regular Teleport agent joining as well as bots, as a more secure alternative to static or long-lived join tokens. +### Additional Keypair Protections + +We should investigate supporting additional layers of protection for the private +key. There are several avenues for this, depending on storage backend: + +- Filesystem storage can be encrypted at rest and require a private key to be + entered to unlock it, similar to SSH keys without an agent. + +- Secure Enclave storage can require that the device be unlocked, or require + biometric verification. + +- Other HSM-stored keys may support various types of human presence + verification. For example, YubiHSM has a touch sensor that can be required for + access on a key-by-key basis. + +#### Tightly Scoped Token RBAC + +To better support use cases where central administrators vend bot tokens for +teams, we can add scoped RBAC support for `ProvisionToken` CRUD operations. + +For example, this would allow a designated team to update a `bound-keypair` +token to increase the rejoin counter without needing to reach out to the central +administrator. + +This is likely dependent on [Scoped RBAC](https://github.com/gravitational/teleport/pull/38078), +which is still in the planning stage. + ### Explicitly Insecure Token Joining There are perfectly valid use cases for allowing relatively insecure access to From c9a1b209cb0efcde35f4357ec6e3596a05273898 Mon Sep 17 00:00:00 2001 From: Tim Buckley Date: Thu, 6 Mar 2025 21:29:23 -0700 Subject: [PATCH 07/25] Rewrite join UX improvement to use URIs Adds a new URI joining proposal --- rfd/0205-improved-onprem-joining.md | 35 ++++++++++++++++++++++++----- 1 file changed, 29 insertions(+), 6 deletions(-) diff --git a/rfd/0205-improved-onprem-joining.md b/rfd/0205-improved-onprem-joining.md index ec5f1de184e1a..12f80b0246453 100644 --- a/rfd/0205-improved-onprem-joining.md +++ b/rfd/0205-improved-onprem-joining.md @@ -541,22 +541,45 @@ We can introduce several new lock targets to address this: lock is the primary locking solution, however this will prevent joining with another token. -#### Token CLI UX Improvements +#### Bot Joining URIs -Instead of separating join method and token name, we propose combining the two -into a single CLI flag and referring to tokens as a (method, value) tuple. For -example: +Instead of separate flags for proxy, join method, token value, and other +potential join-method specific parameters, we propose adopting joining URIs. +These would provide a single value users can copy to pre-fill various +configuration fields and would greatly improve the onboarding experience. + +The URI syntax might look like this: +``` +tbot+[auth|proxy]://[join method]:[token value]@[addr]:[port]?key=val&foo=bar +``` + +Consider these two equivalent commands: +``` +$ tbot start identity --proxy-server example.teleport.sh:443 --join-method bound-keypair --token example + +$ tbot start identity tbot+proxy://bound-keypair:example@example.teleport.sh:443 +``` + +Joining URIs can greatly simplify the regular onboarding experience by providing +a single value to copy when onboarding a bot: ``` $ tctl bots add example --roles=access -The bot token: bound-keypair:04f0ceff1bd0589ba45c1832dfc8feaf +The bot token: tbot+proxy://bound-keypair:example@example.teleport.sh:443 This token will expire in 59 minutes. [...snip...] -$ tbot start --token=bound-keypair:04f0ceff1bd0589ba45c1832dfc8feaf ... +$ tbot start identity tbot+proxy://bound-keypair:example@example.teleport.sh:443 ... ``` +Given the CLI now supports many operational modes, it's much easier for users to +write their given starting command (e.g. `tbot start app`) and paste the joining +URI to get started immediately. + +URL paths and query parameters may also provide options for future extension if +desired. + ## Alternatives and Future Extensions ### Agent Joining Support From 6da9c4e686cfe15b1a9cf3fbd4f223282066353d Mon Sep 17 00:00:00 2001 From: Tim Buckley Date: Fri, 7 Mar 2025 21:04:12 -0700 Subject: [PATCH 08/25] Rewrite some sections, discuss state storage New sections on state storage and rejected alternatives, plus rewrote several sections for clarity. --- rfd/0205-improved-onprem-joining.md | 114 +++++++++++++++++++++++----- 1 file changed, 96 insertions(+), 18 deletions(-) diff --git a/rfd/0205-improved-onprem-joining.md b/rfd/0205-improved-onprem-joining.md index 12f80b0246453..0d1d4dd0c4dfd 100644 --- a/rfd/0205-improved-onprem-joining.md +++ b/rfd/0205-improved-onprem-joining.md @@ -3,7 +3,7 @@ authors: Tim Buckley () state: draft --- -# RFD 0205 - Improved On-Prem Bots With `bound-keypair` Joining +# RFD 0205 - Improved On-Prem Bots with Bound Keypair Joining ## Required Approvers @@ -70,6 +70,13 @@ joining was really a *good* experience, effectively just: bots since it'll rapidly reschedule any bot deployments that fail... but we have a dedicated `kubernetes` delegated join method.) +In short, token joining has a complexity cliff. It's extremely easy to get +started, but it can feel like a false start when users learn token joining is +not suitable to their production use case. At best it's back to the docs to +learn about some more complicated join method; at worst, it's even more +disappointing when users learn there simply is _no_ good method for on-prem +joining. (Well, unless they have TPMs.) + End users willing to create their own automation around token issuance could work around some of these limitations, but this creates an unnecessary barrier to entry for use of Machine ID on-prem. @@ -87,13 +94,15 @@ create several issues: - The initial joining secret would have an ambiguous lifetime. At what point does this multi-use token expire? -- How many bots will join? +- How many times will the token be used? Can we trust it's never been used + improperly, and that each use actually originated from the infrastructure we + intended to join? -- How can we tell joined bots apart? Can we trust a bot identity if it can be - thrown away and regenerated? +- How can we tell joined bots apart, even over time? If the original joining + token is still valid, could a malicious bot purge its identity and rejoin? -- When a bot needs to rejoin, does it use the same token? Can that token *ever* - expire? +- When a bot needs to rejoin, does it use the same token? If so, can that token + *ever* expire? With this in mind, we need to strike some balance between effective UX and a system we can trust to not allow unauthorized or unintended access. To that end, @@ -122,8 +131,6 @@ making minimal security concessions: ### Bound Keypair Joining -TODO: Consider alternative join method names? - We believe a new join method, `bound-keypair`, can meet our needs and provide significantly more flexibility than today's `token` join method. This works by - in a sense - inverting the token joining procedure: bots generate an ED25519 @@ -167,11 +174,20 @@ This has several important differences to existing join methods: The failed `tbot` instance can then retry the joining process, and it will succeed. -It otherwise functions similarly to `token`-joined bots today. It proves its -identity - either via an onboarding secret or public key - to receive a -renewable identity and renews it as usual for as long as possible. The -generation counter is still used to detect identity reuse. When the internal -identity expires, the bot loses access to resources (until it reauthenticates). +It otherwise functions similarly to `token`-joined bots today: + +- It is still fully infrastructure agnostic and works across operating systems. + +- The joining UX is largely compatible with `token` joining and should still + work great for experimentation and documentation examples. + +- It proves its identity - either via an onboarding secret or public key - to + receive a renewable identity and renews it as usual for as long as possible. + +- When the internal identity expires, the bot loses access to resources until it + reauthenticates. + +- The generation counter is still used to detect identity reuse. #### Joining UX Flows @@ -223,6 +239,10 @@ requires less back-and-forth between the admin's workstation and bot node. Flow 1 is particularly ill-suited to Terraform use since keypairs would need to be pregenerated and copied to nodes, which is not ideal from a security PoV. +Flow 2 is also mostly equivalent to `token` joining. Current users will already +be conceptually familiar with the joining process, and documentation updates +will be minimal. + #### Token Resource Example `bound-keypair`-type tokens differ from other types in that they are intended to @@ -390,10 +410,30 @@ TODO: This needs significant further elaboration and feedback. #### Client-Side Changes in `tbot` -Bots should be informed of their number of remaining rejoins. We can give bots -permission to view their own join token, or include the number of remaining -rejoins as an informational field in the bot's current user certificate. We -should then expose this as a Prometheus metric to allow for alerting if a bot +Bots should be informed of their number of remaining rejoins. There's a few +methods by which we could inform bots of their remaining rejoins: + +1. (Recommended) Heartbeats: bots submit heartbeats at startup and on a regular + interval. It would be trivial to include a remaining rejoin counter in the + (currently empty) heartbeat response. + +2. Certificate field: we could include the number of remaining rejoins in a + certificate field. + +3. New RPC: we could add a new RPC for bots to fetch this, alongside any other + potentially useful information. + +4. We could grant bots permission to view their own join tokens. There is + precedent here as bots can view e.g. their own roles without explicitly + having RBAC permissions to do so. + +The remaining rejoin counter should then be exposed as a Prometheus metric to +allow for alerting if a bot drops below some threshold. + +Importantly, this is a potentially lagging indicator. The design allows for the +rejoin counter to be decreased (to zero) at any time, so a rejoin attempt may +still fail at any time. This should be acceptable since it can also be increased +after the fact to restore access if desired. #### Keystore Storage Backends @@ -580,7 +620,7 @@ URI to get started immediately. URL paths and query parameters may also provide options for future extension if desired. -## Alternatives and Future Extensions +## Future Extensions and Alternatives ### Agent Joining Support @@ -631,6 +671,44 @@ accept an ordered list of joining token strings which could be used sequentially. If the internal identity expires, the next token in the list will be used to attempt a rejoin. +This may be interesting for users with workload-critical bots wishing to hedge +against in outage in a delegated join method's IdP. With Workload ID being used +to authenticate e.g. database connections, this might be a worthwhile future +addition. + +### Alternative: State in Bot Instances + +We could alternatively store state in bot instances, rather than the token +resource. + +To some extent this better matches current Teleport behavior today. Bot instance +resources already manage quite a bit of backend state and track recent +authentications, and there isn't much precedent for state to be actively managed +in provision tokens themselves. + +On the other hand, bot instances are created automatically and are not generally +edited by user - though there's no compelling reason this can't be the case. + +In practice, the best argument for keeping state in the provision token is +probably that we may wish to enable node joining with this method in the future. + ## Rejected Alternatives ### N-Token Resiliency + +This alternative built on top of the existing `token` join method by providing +bots with additional secrets they could use if their identity expired. Users +could select their desired level of resiliency by selecting the number of backup +tokens a bot would receive, thus the name. + +This idea still has some merit but we realized this can largely be simplified +into the bind-on-join flow described above. Multiple secrets mainly served to +constrain credential reuse by limiting the number of possible rejoins until a +human has to take some action. + +Bound keypair joining replaces the secrets with a rejoin counter, and allows for +(among other things) resuscitation of dead bots since their credentials remain +available even once expired. + +A lighter weight alternative here could be client-side multi-token support as +described in the alternatives above. From 4a1fe8fc1bf613a3c060d24521985f4854058096 Mon Sep 17 00:00:00 2001 From: Tim Buckley Date: Mon, 17 Mar 2025 18:48:05 -0600 Subject: [PATCH 09/25] Rename overlapping `expires` fields --- rfd/0205-improved-onprem-joining.md | 24 +++++++++++++----------- 1 file changed, 13 insertions(+), 11 deletions(-) diff --git a/rfd/0205-improved-onprem-joining.md b/rfd/0205-improved-onprem-joining.md index 0d1d4dd0c4dfd..b91f0037c4ff4 100644 --- a/rfd/0205-improved-onprem-joining.md +++ b/rfd/0205-improved-onprem-joining.md @@ -277,7 +277,7 @@ spec: # Initial joining must take place before this timestamp. May be # modified if bot has not yet joined. - expires: "2025-03-01T21:45:40.104524Z" + must_join_before: "2025-03-01T21:45:40.104524Z" # Parameters to tune rejoining behavior when the regular bot identity has # expired @@ -293,8 +293,9 @@ spec: total_rejoins: 10 # If set, rejoining is only valid before this timestamp; may be - # incremented to extend bot lifespan. - expires: "" + # incremented or reset to the empty string to allow rejoining once + # expired. + must_rejoin_before: "2026-03-01T21:45:40.104524Z" status: bound_keypair: @@ -316,7 +317,9 @@ status: # A count of remaining rejoins; if `.spec.bound_keypair.rejoining.total_rejoins` # is incremented, this value will be incremented by the same amount. If - # decremented, this value cannot fall below zero. + # decremented, this value cannot fall below zero. If + # `.spec.bound_keypair.rejoining.unlimited` is set, this value will always + # be 0 but rejoin attempts will be succeed. remaining_rejoins: 10 ``` @@ -496,11 +499,11 @@ We should take steps to improve visibility of bots at or near expiry, including: #### Outstanding Issue: Soft Bot Expiration -The `spec.rejoining.expires` field can be used to prevent rejoining after a -certain time, in tandem with the rejoin count limit. This has the - likely -confusing - downside that bots will still be able to renew their certificates -indefinitely past the expiration date, assuming their certs were valid at the -time of expiration. +The `spec.rejoining.must_rejoin_before` field can be used to prevent rejoining +after a certain time, in tandem with the rejoin count limit. This has the - +likely confusing - downside that bots will still be able to renew their +certificates indefinitely past the expiration date, assuming their certs were +valid at the time of expiration. Also, as with all Teleport resources, the `metadata.expires` field can also remove the token resource after a set time. Bots will also continue to renew @@ -520,8 +523,7 @@ We would like to solve two expiration use cases: 2. We should be able to prevent bot rejoins after a certain date, to control rejoining conditions in tandem with the rejoin counter. The - `spec.rejoining.expires` field accomplishes this, but does have a naming - collision with `metadata.expires`. + `spec.rejoining.must_rejoin_before` field may accomplish this. TODO: Ensure this is solved and not confusing. From a3606f5abae4466353decaa04dad9584ef73693c Mon Sep 17 00:00:00 2001 From: Tim Buckley Date: Tue, 18 Mar 2025 21:37:15 -0600 Subject: [PATCH 10/25] Pivot to delegated joining impl. Add sequence diagram. --- rfd/0205-improved-onprem-joining.md | 157 ++++++++++++++++++++++++---- 1 file changed, 135 insertions(+), 22 deletions(-) diff --git a/rfd/0205-improved-onprem-joining.md b/rfd/0205-improved-onprem-joining.md index b91f0037c4ff4..5314d0dca81b8 100644 --- a/rfd/0205-improved-onprem-joining.md +++ b/rfd/0205-improved-onprem-joining.md @@ -140,15 +140,16 @@ one-time use shared secret to authenticate the exchange, much like today's `token` method. Once the public key has been shared, bots may then join by requesting a -challenge from the Teleport Auth service and completing it by signing it with -their private key. If successful, the bot is issued a renewable identity just as -`token`-joined bots are today, and the bot will actively renew this identity for -as long as possible, or until its backing token expires. +challenge from the Teleport Auth service and complete it by signing it with +their private key. If successful, the bot is issued a nonrenewable identity +similar to our existing delegated join methods, and the bot will actively +refresh this identity for as long as possible, or until its backing token +expires. -If the identity renewal fails at any point, bots may attempt to reauthenticate, +If the identity refresh fails at any point, bots may attempt to rejoin, and the Auth service can use predefined per-bot rules to decide if this specific bot is allowed to rejoin, including a rejoin counter and expiration date. If a -rejoin is rejected, the bot's identity does not necessarily remain invalid: if +rejoin is rejected, the bot's keypair does not necessarily remain invalid: if server-side rules are adjusted, for example by increasing the token's rejoin limit, it can then rejoin without any client-side reconfiguration. @@ -164,8 +165,7 @@ This has several important differences to existing join methods: rather than (necessarily) a hardware token. - When a bot's identity expires, assuming it has some rejoin allocations left, - it can simply repeat the joining process to receive a fresh renewable - certificate. + it can simply repeat the joining process to receive a fresh certificate. - If a bot exhausts its rejoining limit, it will not be able to fetch new certificates, similar to today's behavior. However, this bot can be restored @@ -174,7 +174,8 @@ This has several important differences to existing join methods: The failed `tbot` instance can then retry the joining process, and it will succeed. -It otherwise functions similarly to `token`-joined bots today: +It otherwise functions similarly to `token`-joined bots today, despite being +implemented as a delegated joining method: - It is still fully infrastructure agnostic and works across operating systems. @@ -182,13 +183,64 @@ It otherwise functions similarly to `token`-joined bots today: work great for experimentation and documentation examples. - It proves its identity - either via an onboarding secret or public key - to - receive a renewable identity and renews it as usual for as long as possible. + receive a nonrenewable identity and refreshes it as usual for as long as + possible by repeating the joining challenge. - When the internal identity expires, the bot loses access to resources until it reauthenticates. - The generation counter is still used to detect identity reuse. +#### Renewing and Rejoining + +> [!NOTE] +> We use the term "renew" fairly loosely in various parts of Teleport and +> especially Machine ID. For our purposes here, we'll try to stick to these +> definitions: +> +> - **refreshing**: fetching new identities at a regular interval, regardless of +> method +> - **renewing**: refreshing a renewable identity without completing a joining +> challenge, specific to token joining +> - **rejoining**: in bound keypair joining, a rejoin occurs when attempting a +> refresh with no client certificates, or expired client certificates. This +> triggers additional verifications, and consumes a rejoin. + +Today, Machine ID has two broad categories of joining: + +- Delegated joining, where joining challenges are completed each time a bot + requests new certificates. Regardless of whether a bot is joining for the + first time or refreshing an existing valid identity, it must complete a + challenge to receive certificates. The two cases are functionally equivalent, + with only minor caveats (see below). This is all join methods except `token`. + +- Non-delegated joining, where bots complete an initial joining challenge once + and then use their existing valid identity as its own proof to fetch new + certificates. This is exclusively used with `token` joining. + +Bound keypair joining could be implemented using either of these strategies. +However, we'll opt to implement this as a delegated join method. This provides +several advantanges: + +- Standardized implementation, matching all other join methods - except `token`. + +- Regular verification checks. With true renewable certificates, bots could last + indefinitely without completing a challenge. This makes it harder to tell + which bots are still alive, and could leave bots alive if their join token is + deleted. When bots regularly interact with the join method, we can + +- If using hardware key storage backends, repeating the joining challenge helps + ensure the identity can't be effectively exfiltrated. + +For the purposes of differentiating a full rejoin from a regular refresh, we can +take advantage of optional authenticated joining added in +[RFD 0162](./0162-machine-id-token-join-method-bot-instance.md). This allows +clients to present an existing valid identity to preserve certain identity +parameters, like the bot instance UUID. + +Using this mechanism, we can ensure that join attempts with an existing client +identity do not consume a rejoin; attempts without one will consume a rejoin. + #### Joining UX Flows This join method creates two new joining flows: @@ -214,7 +266,7 @@ This join method creates two new joining flows: ever copied. On startup, Auth issues a challenge to the bot which is solved with its - private key, and it receives a standard renewable identity. + private key, and it receives a standard bot identity. 2. **Bind-on-join**: The `tbot` client is given a joining secret. @@ -268,7 +320,10 @@ spec: # skipped and instance will directly prove its identity using its private # key. It is an error for a public key to be associated with more than one # token, and creation or update will fail if a public key is reused. - public_key: null + # May not be modified after resource creation. Note that public keys may + # be rotated, so refer to `.status.bound_keypair.bound_public_key` for the + # currently bound key information. + initial_public_key: null # If set, use an explicit initial joining secret; if both this and # `public_key` are unset, a value will be generated server-side and @@ -295,12 +350,12 @@ spec: # If set, rejoining is only valid before this timestamp; may be # incremented or reset to the empty string to allow rejoining once # expired. - must_rejoin_before: "2026-03-01T21:45:40.104524Z" + may_rejoin_until: "2026-03-01T21:45:40.104524Z" status: bound_keypair: - # If `spec.onboarding.public_key` is unset, this value will be generated - # server-side and made available here. If + # If `spec.onboarding.initial_public_key` is unset, this value will be + # generated server-side and made available here. If # `spec.onboarding.initial_join_secret` is set, its value will be copied # here. initial_join_secret: @@ -319,8 +374,11 @@ status: # is incremented, this value will be incremented by the same amount. If # decremented, this value cannot fall below zero. If # `.spec.bound_keypair.rejoining.unlimited` is set, this value will always - # be 0 but rejoin attempts will be succeed. + # be 0 but rejoin attempts will succeed. remaining_rejoins: 10 + + # The timestamp of the last successful joining or rejoining attempt, if any. + last_joined_at: null ``` #### Terraform Example @@ -397,9 +455,9 @@ resource "aws_instance" "example" { } ``` -In this example, if node `bar` uses its 2 renewals, we can add a new entry for +In this example, if node `bar` uses its 2 rejoins, we can add a new entry for it in `bot_rejoin_overrides` and its `ProvisionToken` will be updated to allow -additional renewals. +additional rejoins. #### Challenge Ceremony @@ -409,7 +467,48 @@ challenge containing a nonce. To avoid completely implementing our own authentication ceremony, clients can use `go-jose` to marshal and sign a JWT which can then be verified easily on the server. -TODO: This needs significant further elaboration and feedback. +A rough outline of the joining procedure: + +```mermaid +sequenceDiagram + participant keystore as Local Keystore + participant bot as Bot + participant auth as Auth Server + + opt first join & has initial join token + bot->>+keystore: Generate new keypair + keystore-->>-bot: Public key + + bot->>auth: Bind public key with token + auth->>auth: Verify token
& record public key + end + + bot->>auth: Request joining challenge + auth-->>bot: Sends joining challenge + bot->>keystore: Request signed document + keystore-->>bot: Signed challenge document + bot->>auth: Signed challenge document + auth->>auth: Validate signed document
against bound public key + + opt no valid client certificate + auth->>auth: decrement rejoin counter + end + + auth-->>bot: Signed TLS certificates +``` + +To avoid use of traditional renewable certificates, this takes advantage of +recent support for optionally authenticated joining added as part of +[RFD 0162](./0162-machine-id-token-join-method-bot-instance.md). If a bot still +has a valid client certificate from a previous authentication attempt, it uses +it to open an mTLS session. + +When Auth validates the join attempt, clients that presented an existing valid +identity are considered to be "renewals" and do not trigger a "rejoin", leaving +the rejoin counter untouched. Clients that do not present a valid client +certificate are considered to be rejoining and the token associated with this +public key must have `.status.bound_keypair.remaining_rejoins` >= 1. + #### Client-Side Changes in `tbot` @@ -499,7 +598,7 @@ We should take steps to improve visibility of bots at or near expiry, including: #### Outstanding Issue: Soft Bot Expiration -The `spec.rejoining.must_rejoin_before` field can be used to prevent rejoining +The `spec.rejoining.may_rejoin_until` field can be used to prevent rejoining after a certain time, in tandem with the rejoin count limit. This has the - likely confusing - downside that bots will still be able to renew their certificates indefinitely past the expiration date, assuming their certs were @@ -523,10 +622,13 @@ We would like to solve two expiration use cases: 2. We should be able to prevent bot rejoins after a certain date, to control rejoining conditions in tandem with the rejoin counter. The - `spec.rejoining.must_rejoin_before` field may accomplish this. + `spec.rejoining.may_rejoin_until` field may accomplish this. TODO: Ensure this is solved and not confusing. +TODO: Presumably solved by switching to delegated joining. Will remove this +section after some reevaluation. + #### Keypair Rotation Given the long-lived nature of the keypair credential, it's important to support @@ -554,7 +656,7 @@ improve the usability of non-delegated joining. #### Longer-Lived Bots -The renewable identity's 24 hour maximum TTL is too restrictive and should be +The bot identity's 24 hour maximum TTL is too restrictive and should be lengthened. We propose raising this limit to 7 days, but keeping the default (1hr) the same. @@ -645,6 +747,17 @@ key. There are several avenues for this, depending on storage backend: verification. For example, YubiHSM has a touch sensor that can be required for access on a key-by-key basis. +#### Additional Alerting Rules + +A non-exhaustive list of additional alerting rules that may be beneficial: + +- Configurable cluster alerts when a bot rejoins or has used its last attempt. + This should be opt-in as this type of alert may not scale well with lots of + bots. + +- + + #### Tightly Scoped Token RBAC To better support use cases where central administrators vend bot tokens for From 6df3e8da91b9f4621f1130b219b1441c6f742dd7 Mon Sep 17 00:00:00 2001 From: Tim Buckley Date: Tue, 18 Mar 2025 21:57:20 -0600 Subject: [PATCH 11/25] Don't assume ED25519; fix renew->refresh terminology --- rfd/0205-improved-onprem-joining.md | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/rfd/0205-improved-onprem-joining.md b/rfd/0205-improved-onprem-joining.md index 5314d0dca81b8..9dc3f9be1fc6b 100644 --- a/rfd/0205-improved-onprem-joining.md +++ b/rfd/0205-improved-onprem-joining.md @@ -133,11 +133,11 @@ making minimal security concessions: We believe a new join method, `bound-keypair`, can meet our needs and provide significantly more flexibility than today's `token` join method. This works by - -in a sense - inverting the token joining procedure: bots generate an ED25519 -keypair, and the public key is copied to the server. The public key can be -copied out-of-band, or bots can provide their public key on first join using a -one-time use shared secret to authenticate the exchange, much like today's -`token` method. +in a sense - inverting the token joining procedure: bots generate a keypair, +using the cluster signature algorithm (probably ED25519) and the public key is +copied to the server. The public key can be copied out-of-band, or bots can +provide their public key on first join using a one-time use shared secret to +authenticate the exchange, much like today's `token` method. Once the public key has been shared, bots may then join by requesting a challenge from the Teleport Auth service and complete it by signing it with @@ -504,12 +504,11 @@ has a valid client certificate from a previous authentication attempt, it uses it to open an mTLS session. When Auth validates the join attempt, clients that presented an existing valid -identity are considered to be "renewals" and do not trigger a "rejoin", leaving +identity are considered to be "refresh" and do not trigger a rejoin, leaving the rejoin counter untouched. Clients that do not present a valid client certificate are considered to be rejoining and the token associated with this public key must have `.status.bound_keypair.remaining_rejoins` >= 1. - #### Client-Side Changes in `tbot` Bots should be informed of their number of remaining rejoins. There's a few From 0385874d1c53128e98be66ab658dc0958512bafa Mon Sep 17 00:00:00 2001 From: Tim Buckley Date: Wed, 19 Mar 2025 18:25:29 -0600 Subject: [PATCH 12/25] Tweak joining URL scheme Moves join method to the URL scheme to allow joining secrets; more examples added. --- rfd/0205-improved-onprem-joining.md | 31 ++++++++++++++++++++++------- 1 file changed, 24 insertions(+), 7 deletions(-) diff --git a/rfd/0205-improved-onprem-joining.md b/rfd/0205-improved-onprem-joining.md index 9dc3f9be1fc6b..25759d78fd7f4 100644 --- a/rfd/0205-improved-onprem-joining.md +++ b/rfd/0205-improved-onprem-joining.md @@ -504,8 +504,8 @@ has a valid client certificate from a previous authentication attempt, it uses it to open an mTLS session. When Auth validates the join attempt, clients that presented an existing valid -identity are considered to be "refresh" and do not trigger a rejoin, leaving -the rejoin counter untouched. Clients that do not present a valid client +identity are considered to be requesting a refresh rather than rejoining, +leaving the rejoin counter untouched. Clients that do not present a valid client certificate are considered to be rejoining and the token associated with this public key must have `.status.bound_keypair.remaining_rejoins` >= 1. @@ -693,27 +693,35 @@ configuration fields and would greatly improve the onboarding experience. The URI syntax might look like this: ``` -tbot+[auth|proxy]://[join method]:[token value]@[addr]:[port]?key=val&foo=bar +tbot+[auth|proxy]+[join method]://[token name]:[optional parameter]@[addr]:[port]?key=val&foo=bar ``` +The "optional parameter" is a relatively new concept to support specifying the +token name and secret value, as they are now decoupled. Join methods like Azure +that require an additional parameter can take advantage of this as well; +previously this method required a `tbot.yaml` in order to specify a `ClientID`. + Consider these two equivalent commands: ``` -$ tbot start identity --proxy-server example.teleport.sh:443 --join-method bound-keypair --token example +$ tbot start identity --proxy-server example.teleport.sh:443 --join-method bound-keypair:initial-join-secret --token example -$ tbot start identity tbot+proxy://bound-keypair:example@example.teleport.sh:443 +$ tbot start identity tbot+proxy+bound-keypair://example:initial-join-secret@example.teleport.sh:443 ``` +(We will also need to introduce `method:parameter` syntax for the traditional +`--join-method` syntax.) + Joining URIs can greatly simplify the regular onboarding experience by providing a single value to copy when onboarding a bot: ``` $ tctl bots add example --roles=access -The bot token: tbot+proxy://bound-keypair:example@example.teleport.sh:443 +The bot token: tbot+proxy+bound-keypair://example:initial-join-secret@example.teleport.sh:443 This token will expire in 59 minutes. [...snip...] -$ tbot start identity tbot+proxy://bound-keypair:example@example.teleport.sh:443 ... +$ tbot start identity tbot+proxy+bound-keypair://example:initial-join-secret@example.teleport.sh:443 ... ``` Given the CLI now supports many operational modes, it's much easier for users to @@ -723,6 +731,15 @@ URI to get started immediately. URL paths and query parameters may also provide options for future extension if desired. +Additional examples for other join methods: +``` +# Traditional token joining, connecting to auth +$ tbot start identity tbot+auth+token://abcde12345@teleport.example.com:3025 + +# Azure joining via proxy with client ID specified +$ tbot start identity tbot+proxy+azure://bot-token:22222222-2222-2222-2222-222222222222@example.teleport.sh:443/ +``` + ## Future Extensions and Alternatives ### Agent Joining Support From 04000855297f9ac3926e37988d3c749d10590481 Mon Sep 17 00:00:00 2001 From: Tim Buckley Date: Wed, 19 Mar 2025 19:15:09 -0600 Subject: [PATCH 13/25] Bot lifecycle; removed uniqueness requirement Removes "soft bot expiration" section as this has been resolved with the switch to delegated joining. Also added a Bot Lifecycle section to describe how bots are expected to be disabled. Also removed the public key uniqueness requirement. At join time bots now specify both the token name and joining secret (if any), so we won't need to search all tokens for a matching key. It was also not efficient to ensure uniqueness among all provision tokens. --- rfd/0205-improved-onprem-joining.md | 108 +++++++++++++++------------- 1 file changed, 58 insertions(+), 50 deletions(-) diff --git a/rfd/0205-improved-onprem-joining.md b/rfd/0205-improved-onprem-joining.md index 25759d78fd7f4..720bd5ad18cf4 100644 --- a/rfd/0205-improved-onprem-joining.md +++ b/rfd/0205-improved-onprem-joining.md @@ -295,6 +295,51 @@ Flow 2 is also mostly equivalent to `token` joining. Current users will already be conceptually familiar with the joining process, and documentation updates will be minimal. +#### Bot Lifecycle + +Bots joined with bound-keypair tokens will have a meaningfully different +lifecycle than other types of bots. To summarize some key differences: + +- The `Bot` and `ProvisionToken` resources are never expected expire + automatically. Users could put an expiration on the token resource, but we + will not recommend doing so. + +- The bot keypair has no defined lifespan of its own. + +- As long as a bot retains its keypair, it can always be recovered server-side. + If it runs out of rejoins, the backend token can be reconfigured to allow + more. If the backend token is deleted outright, it can be recreated with the + public key. + +- Each time a bot rejoins, it creates a new bot instance. The bot instance is + tied to the valid client certificate, and we won't change this behavior. The + new bot instance will contain a reference to the previous instance ID based on + the content of `.status.bound_keypair.bound_bot_instance_id` at rejoining + time. + +Bots may stop refreshing under several conditions, triggering a rejoin attempt: + +- The backing `ProvisionToken` resource has been deleted; in this case, the + rejoin attempt is unlikely to succeed + +- The bot has been offline for longer than its certificate TTL + +- A lock targeting the bot in any capacity (username, instance UUID, token name) + is in place + +Bots will be unable to rejoin under any of these conditions: + +- The `ProvisionToken` resource has been deleted + +- Loss of private key + +- A lock is in place + +- `.status.bound_keypair.remaining_rejoins` is zero (and not unlimited) + +- `.spec.bound_keypair.rejoining.may_rejoin_until` is set to a value before the + current time + #### Token Resource Example `bound-keypair`-type tokens differ from other types in that they are intended to @@ -361,9 +406,6 @@ status: initial_join_secret: # The public key of the bot associated with this token, set on first join. - # The bound public key must be unique among all `bound-keypair` tokens; - # token resource creation/update or bot joining will fail if a public key is - # reused. bound_public_key: # The current bot instance UUID. A new UUID is issued on rejoin; the previous @@ -595,38 +637,9 @@ We should take steps to improve visibility of bots at or near expiry, including: - Exposing per-token renewal counts as Prometheus metrics, both on the Auth Service and via `tbot`'s metrics endpoint. -#### Outstanding Issue: Soft Bot Expiration - -The `spec.rejoining.may_rejoin_until` field can be used to prevent rejoining -after a certain time, in tandem with the rejoin count limit. This has the - -likely confusing - downside that bots will still be able to renew their -certificates indefinitely past the expiration date, assuming their certs were -valid at the time of expiration. - -Also, as with all Teleport resources, the `metadata.expires` field can also -remove the token resource after a set time. Bots will also continue to renew -certs as long as possible until they are either locked or otherwise fail to -renew their certs on time. - -These two expirations create some confusion, and do not allow for an obvious -method to deny a bot access, aside from creating a lock. - -We would like to solve two expiration use cases: -1. We should be able to prevent all bot resource access after a certain date, - including renewals, in a way that allows the bot to be resumed later if - desired. (I.e. the token resource must still exist.) - - Locks may accomplish this, but some centralized management in the token - resource would be convenient. - -2. We should be able to prevent bot rejoins after a certain date, to control - rejoining conditions in tandem with the rejoin counter. The - `spec.rejoining.may_rejoin_until` field may accomplish this. - -TODO: Ensure this is solved and not confusing. - -TODO: Presumably solved by switching to delegated joining. Will remove this -section after some reevaluation. +In the future, we might also consider configurable cluster alerts when a bot +rejoins or has used its last attempt. This should be opt-in as this type of +alert may not scale well with lots of bots. #### Keypair Rotation @@ -742,13 +755,13 @@ $ tbot start identity tbot+proxy+azure://bot-token:22222222-2222-2222-2222-22222 ## Future Extensions and Alternatives -### Agent Joining Support +### Future: Agent Joining Support We should explore expanding this join method to cover regular Teleport agent joining as well as bots, as a more secure alternative to static or long-lived join tokens. -### Additional Keypair Protections +### Future: Additional Keypair Protections We should investigate supporting additional layers of protection for the private key. There are several avenues for this, depending on storage backend: @@ -763,18 +776,13 @@ key. There are several avenues for this, depending on storage backend: verification. For example, YubiHSM has a touch sensor that can be required for access on a key-by-key basis. -#### Additional Alerting Rules - -A non-exhaustive list of additional alerting rules that may be beneficial: - -- Configurable cluster alerts when a bot rejoins or has used its last attempt. - This should be opt-in as this type of alert may not scale well with lots of - bots. - -- - +Note that as a consequence of implementing this as a delegated joining method, +bots are expected to complete joining challenges at regular intervals. This +could be varying levels of impractical depending on keystore backend. Keys +stored encrypted at rest on the filesystem could be decrypted and kept in +memory, but HSMs with presence requirements may not be practical. -#### Tightly Scoped Token RBAC +#### Future: Tightly Scoped Token RBAC To better support use cases where central administrators vend bot tokens for teams, we can add scoped RBAC support for `ProvisionToken` CRUD operations. @@ -786,7 +794,7 @@ administrator. This is likely dependent on [Scoped RBAC](https://github.com/gravitational/teleport/pull/38078), which is still in the planning stage. -### Explicitly Insecure Token Joining +### Alternative/Future: Explicitly Insecure Token Joining There are perfectly valid use cases for allowing relatively insecure access to resources that do not have strict trust requirements, and Teleport's RBAC system @@ -795,7 +803,7 @@ resources. It may be worthwhile to add an `insecure-shared-secret` join method that allows for arbitrary joining in use cases that still fall through the cracks, so long as end users understand the security implications. -### Client-side multi-token support +### Alternative: Client-side multi-token support A simpler variant of N-Token Resiliency, this would allow `tbot` clients to accept an ordered list of joining token strings which could be used From 905ceae6cc67b7a52e65240eb2070141c54d2b43 Mon Sep 17 00:00:00 2001 From: Tim Buckley Date: Wed, 19 Mar 2025 20:03:47 -0600 Subject: [PATCH 14/25] Add keypair rotation details --- rfd/0205-improved-onprem-joining.md | 58 +++++++++++++++++++++++++++-- 1 file changed, 55 insertions(+), 3 deletions(-) diff --git a/rfd/0205-improved-onprem-joining.md b/rfd/0205-improved-onprem-joining.md index 720bd5ad18cf4..ce8201ad9e896 100644 --- a/rfd/0205-improved-onprem-joining.md +++ b/rfd/0205-improved-onprem-joining.md @@ -358,7 +358,6 @@ spec: join_method: bound-keypair bound_keypair: - # `onboarding` parameters control initial join behavior onboarding: # If set, no joining secret is generated; the secret exchange ceremony is @@ -397,6 +396,11 @@ spec: # expired. may_rejoin_until: "2026-03-01T21:45:40.104524Z" + # If set, the bot will perform a keypair rotation on its next renewal after + # it is informed of the change to this field. Note that this is tied to bot + # heartbeats and may not take effect on the next refresh interval. + rotate_on_next_renewal: false + status: bound_keypair: # If `spec.onboarding.initial_public_key` is unset, this value will be @@ -421,6 +425,9 @@ status: # The timestamp of the last successful joining or rejoining attempt, if any. last_joined_at: null + + # The timestamp of the last successful keypair rotation, if any. + last_rotated_at: null ``` #### Terraform Example @@ -553,7 +560,8 @@ public key must have `.status.bound_keypair.remaining_rejoins` >= 1. #### Client-Side Changes in `tbot` -Bots should be informed of their number of remaining rejoins. There's a few +Bots should be informed of various status metrics, including number of remaining +rejoins and whether or not a keypair rotation has been requested. There's a few methods by which we could inform bots of their remaining rejoins: 1. (Recommended) Heartbeats: bots submit heartbeats at startup and on a regular @@ -648,7 +656,51 @@ rotation without bot downtime. Ideally, it should be possible to initiate a rotation from either the server (e.g. by setting a `rotate_on_next_renewal` flag on the token/bot instance) or `tbot` client. -TODO: Expand this. +To trigger a rotation, an admin can set `.spec.bound_keypair.rotate_on_next_renewal=true` +on the bound keypair token. The value of this field will be synchronized to the +bot using the same mechanism as described above for remaining rejoins, which is +tied to the heartbeat interval (30m, hard coded) rather than the bot's regular +renewal interval, so it will take place on the next renewal once the request has +been synchronized. + +To perform the rotation, additional steps are taken as part of the challenge +ceremony: + +```mermaid +sequenceDiagram + participant keystore as Local Keystore + participant bot as Bot + participant auth as Auth Server + + Note over keystore,auth: Joining secret exchange not shown + bot->>keystore: Request new keypair + keystore-->>bot: New public key + bot->>auth: Request joining challenge
with new public key as
optional parameter + auth-->>bot: Sends joining challenge + + bot->>keystore: Request signed document
with original public key + keystore-->>bot: Signed challenge document + bot->>auth: Signed challenge document + auth->>auth: Validate signed document
against bound public key + + auth-->>bot: Sends new joining challenge
for new public key + bot->>keystore: Request signed document
with new public key + keystore-->>bot: Signed challenge document + + bot->>auth: Signed challenge document + auth->>auth: Validate signed document
against new public key + auth->>auth: Commit new public key to backend,
clear rotate flag + + auth-->>bot: Signed TLS certificates +``` + +As shown above, the join service will return a second challenge rather than +certs if the initial request included a new public key. Certs will only be +returned on completion of the second challenge using the new public key, and the +new key will only be committed to the backend at this point. + +Other bot parameters remain unchanged. No new bot instance is created, the +generation counter is not reset and is checked and incremented as usual. #### Remaining Downsides From ccc4f212a26bfaab3165c4e83a1c584fcbe304b8 Mon Sep 17 00:00:00 2001 From: Tim Buckley Date: Wed, 19 Mar 2025 21:22:26 -0600 Subject: [PATCH 15/25] Credential duplication mitigation, proto draft Describes a method for mitigating credential duplication, and includes a protobuf draft. --- rfd/0205-improved-onprem-joining.md | 115 ++++++++++++++++++++++++++-- 1 file changed, 110 insertions(+), 5 deletions(-) diff --git a/rfd/0205-improved-onprem-joining.md b/rfd/0205-improved-onprem-joining.md index ce8201ad9e896..66fd35d3a1d8c 100644 --- a/rfd/0205-improved-onprem-joining.md +++ b/rfd/0205-improved-onprem-joining.md @@ -317,6 +317,10 @@ lifecycle than other types of bots. To summarize some key differences: the content of `.status.bound_keypair.bound_bot_instance_id` at rejoining time. +- When a new instance is generated as part of a rejoin, refresh attempts using + the old instance will be denied via a check against the currently bound + `.status.bound_keypair.bound_bot_instance_id`. + Bots may stop refreshing under several conditions, triggering a rejoin attempt: - The backing `ProvisionToken` resource has been deleted; in this case, the @@ -340,6 +344,49 @@ Bots will be unable to rejoin under any of these conditions: - `.spec.bound_keypair.rejoining.may_rejoin_until` is set to a value before the current time +#### Preventing Credential Duplication + +Like `token` joining, `bound-keypair` aims to have relatively robust protections +against stolen or duplicated credentials, including both the long-lived keypair +and short-lived refreshed certificates. + +Refreshed certificates will function like renewable certificates do today, +including generation counter checks. An attacker would need to exfiltrate both +the bot certificates and its keypair, and then also prevent the original bot +instance from attempting to refresh certificates, or else a generation counter +check will lock the bot on the next refresh attempt. + +However, the long-lived keypair introduces a similar class of problem. If an +attacker exfiltrates the keypair, assuming rejoins are available, they can +attempt to rejoin and gain extended access. The original bot will still compete +with the imposter bot and each will be forced to fully rejoin on every attempt, +but this results in minimal real downtime for an attacker, at least until the +rejoin allowance runs out. + +To mitigate this, we'll need to create a mechanism similar to the generation +counter to protect rejoins and the long-lived keypair: + +1. When joining, Auth will return a join state document (signed JWT) that + includes the current join counter, alongside the usual Teleport + certificates. Bots must store this document for the next join attempt. + +2. On subsequent join attempts, bots must include this signed document in the + join request. + +3. On each join attempt, before creating a new bot instance, auth verifies the + JWT, and compares the current join counter to the join counter in the JWT. + + If the values match, the join is allowed, and a new join state document is + returned alongside the certificate bundle. + + If they do not match, the join is rejected, and a lock is generated against + the affected `(bot, token)`. + +Just as with the generation counter, this procedure relies on an imposter bot +successfully joining once. When the original bot fails to refresh and attempts +to rejoin, it presents a valid but outdated join state document, we generate a +lock, and then deny further access to both bots. + #### Token Resource Example `bound-keypair`-type tokens differ from other types in that they are intended to @@ -423,6 +470,9 @@ status: # be 0 but rejoin attempts will succeed. remaining_rejoins: 10 + # A count of the total number of joins performed using this token. + join_count: 0 + # The timestamp of the last successful joining or rejoining attempt, if any. last_joined_at: null @@ -430,6 +480,60 @@ status: last_rotated_at: null ``` +#### Proto Draft + +```protobuf +message RegisterUsingBoundKeypairInitialRequest { + types.RegisterUsingTokenRequest join_request = 1; + + // If set, requests a rotation to the new public key. The joining challenge + // must first be completed using the previous key, and upon completion a new + // challenge will be issued for this key. Certificates will only be returned + // after the second challenge is complete. + bytes new_public_key = 2; + + // A document signed by Auth containing join state parameters from the + // previous join attempt. Not required on initial join; required on all + // subsequent joins. + bytes previous_join_state = 3; +} + +message RegisterUsingBoundKeypairChallengeResponse { + bytes solution = 1; +} + +message RegisterUsingBoundKeypairRequest { + oneof payload { + RegisterUsingBoundKeypairInitialRequest init = 1; + RegisterUsingBoundKeypairChallengeResponse challenge_response = 2; + } +} + +message RegisterUsingBoundKeypairCertificates { + // Signed Teleport certificates resulting from the join process. + Certs certs = 1; + + // A signed join state document to be provided on the next join attempt. + bytes join_state = 2; +} + +message RegisterUsingBoundKeypairResponse { + oneof response { + // A challenge to sign. During keypair rotation, a second challenge will be + // provided to verify the new keypair before certs are returned. + string challenge = 1; + RegisterUsingBoundKeypairCertificates certs = 2; + } +} + +service JoinService { + // ...snip... + + rpc RegisterUsingBoundKeypair(stream RegisterUsingBoundKeypairRequest) returns (stream RegisterUsingBoundKeypairResponse); +} + +``` + #### Terraform Example This join method is explicitly designed to be used with Terraform and IaC @@ -536,14 +640,15 @@ sequenceDiagram auth-->>bot: Sends joining challenge bot->>keystore: Request signed document keystore-->>bot: Signed challenge document - bot->>auth: Signed challenge document + bot->>auth: Signed challenge document
& previous join state auth->>auth: Validate signed document
against bound public key opt no valid client certificate + auth->>auth: Validate join state document auth->>auth: decrement rejoin counter end - auth-->>bot: Signed TLS certificates + auth-->>bot: Signed TLS certificates
& new join state ``` To avoid use of traditional renewable certificates, this takes advantage of @@ -685,13 +790,13 @@ sequenceDiagram auth-->>bot: Sends new joining challenge
for new public key bot->>keystore: Request signed document
with new public key - keystore-->>bot: Signed challenge document + keystore-->>bot: Signed challenge document
& previous join state bot->>auth: Signed challenge document auth->>auth: Validate signed document
against new public key auth->>auth: Commit new public key to backend,
clear rotate flag - auth-->>bot: Signed TLS certificates + auth-->>bot: Signed TLS certificates
& new join state ``` As shown above, the join service will return a second challenge rather than @@ -774,7 +879,7 @@ $ tbot start identity tbot+proxy+bound-keypair://example:initial-join-secret@exa ``` (We will also need to introduce `method:parameter` syntax for the traditional -`--join-method` syntax.) +`--join-method` flag.) Joining URIs can greatly simplify the regular onboarding experience by providing a single value to copy when onboarding a bot: From 4d0d73d7e794ba034b0e2b8b49bc6cbce22262c1 Mon Sep 17 00:00:00 2001 From: Tim Buckley Date: Fri, 21 Mar 2025 15:17:04 -0600 Subject: [PATCH 16/25] Rename most references to "rejoining" to just "joining" These were fundamentally the same processes, so we'll standardize on calling both initial joining and rejoining different modes of just "joining". --- rfd/0205-improved-onprem-joining.md | 123 ++++++++++++++-------------- 1 file changed, 63 insertions(+), 60 deletions(-) diff --git a/rfd/0205-improved-onprem-joining.md b/rfd/0205-improved-onprem-joining.md index 66fd35d3a1d8c..06bbd38e087a6 100644 --- a/rfd/0205-improved-onprem-joining.md +++ b/rfd/0205-improved-onprem-joining.md @@ -148,10 +148,10 @@ expires. If the identity refresh fails at any point, bots may attempt to rejoin, and the Auth service can use predefined per-bot rules to decide if this specific -bot is allowed to rejoin, including a rejoin counter and expiration date. If a -rejoin is rejected, the bot's keypair does not necessarily remain invalid: if -server-side rules are adjusted, for example by increasing the token's rejoin -limit, it can then rejoin without any client-side reconfiguration. +bot is allowed to rejoin, including a join counter and expiration date. If a +join attempt is rejected, the bot's keypair does not necessarily remain invalid: +if server-side rules are adjusted, for example by increasing the token's join +allowance, it can then rejoin without any client-side reconfiguration. This has several important differences to existing join methods: @@ -164,13 +164,13 @@ This has several important differences to existing join methods: solve. This is similar to TPM joining today, but backed by a local keypair rather than (necessarily) a hardware token. -- When a bot's identity expires, assuming it has some rejoin allocations left, +- When a bot's identity expires, assuming it has a nonzero join allowance, it can simply repeat the joining process to receive a fresh certificate. -- If a bot exhausts its rejoining limit, it will not be able to fetch new +- If a bot exhausts its joining allowance, it will not be able to fetch new certificates, similar to today's behavior. However, this bot can be restored without needing to generate a new identity: an admin user can edit the backing - `ProvisionToken` to increment `spec.bound_keypair.rejoining.total_rejoins`. + `ProvisionToken` to increment `spec.bound_keypair.joining.total_joins`. The failed `tbot` instance can then retry the joining process, and it will succeed. @@ -239,7 +239,7 @@ clients to present an existing valid identity to preserve certain identity parameters, like the bot instance UUID. Using this mechanism, we can ensure that join attempts with an existing client -identity do not consume a rejoin; attempts without one will consume a rejoin. +identity do not consume a join; attempts without one will consume a join. #### Joining UX Flows @@ -307,11 +307,11 @@ lifecycle than other types of bots. To summarize some key differences: - The bot keypair has no defined lifespan of its own. - As long as a bot retains its keypair, it can always be recovered server-side. - If it runs out of rejoins, the backend token can be reconfigured to allow + If it runs out of joins, the backend token can be reconfigured to allow more. If the backend token is deleted outright, it can be recreated with the public key. -- Each time a bot rejoins, it creates a new bot instance. The bot instance is +- Each time a bot joins, it creates a new bot instance. The bot instance is tied to the valid client certificate, and we won't change this behavior. The new bot instance will contain a reference to the previous instance ID based on the content of `.status.bound_keypair.bound_bot_instance_id` at rejoining @@ -339,9 +339,9 @@ Bots will be unable to rejoin under any of these conditions: - A lock is in place -- `.status.bound_keypair.remaining_rejoins` is zero (and not unlimited) +- `.status.bound_keypair.remaining_joins` is zero (and not unlimited) -- `.spec.bound_keypair.rejoining.may_rejoin_until` is set to a value before the +- `.spec.bound_keypair.joining.may_join_until` is set to a value before the current time #### Preventing Credential Duplication @@ -357,11 +357,11 @@ instance from attempting to refresh certificates, or else a generation counter check will lock the bot on the next refresh attempt. However, the long-lived keypair introduces a similar class of problem. If an -attacker exfiltrates the keypair, assuming rejoins are available, they can -attempt to rejoin and gain extended access. The original bot will still compete -with the imposter bot and each will be forced to fully rejoin on every attempt, -but this results in minimal real downtime for an attacker, at least until the -rejoin allowance runs out. +attacker exfiltrates the keypair, assuming additional joins are available, they +can attempt to rejoin and gain extended access. The original bot will still +compete with the imposter bot and each will be forced to fully rejoin on every +attempt, but this results in minimal real downtime for an attacker, at least +until the join allowance runs out. To mitigate this, we'll need to create a mechanism similar to the generation counter to protect rejoins and the long-lived keypair: @@ -421,31 +421,33 @@ spec: # stored in `.status.bound_keypair.initial_join_secret` initial_join_secret: "" - # Initial joining must take place before this timestamp. May be - # modified if bot has not yet joined. + # Use of `initial_join_secret` must take place before this timestamp. May + # be modified if initial secret has not yet been consumed. must_join_before: "2025-03-01T21:45:40.104524Z" - # Parameters to tune rejoining behavior when the regular bot identity has - # expired - rejoining: - # If true, `total_rejoins` is ignored and bots may rejoin indefinitely; + # Parameters to tune joining behavior, both on first join and when rejoining + # when the regular identity expires. + joining: + # If true, `total_joins` is ignored and bots may rejoin indefinitely; # must be opt-in. unlimited: false - # Total number of allowed rejoins; this may be incremented to allow + # Total number of allowed joins; this may be incremented to allow # additional rejoins, even if a bot identity has already expired. May # be decremented, but only by the current value of - # `.status.bound_keypair.remaining_rejoins`. - total_rejoins: 10 + # `.status.bound_keypair.remaining_joins`. This value must be at least 1 + # for a bot to perform an initial join. + total_joins: 10 - # If set, rejoining is only valid before this timestamp; may be + # If set, joining is only valid before this timestamp; may be # incremented or reset to the empty string to allow rejoining once # expired. - may_rejoin_until: "2026-03-01T21:45:40.104524Z" + may_join_until: "2026-03-01T21:45:40.104524Z" # If set, the bot will perform a keypair rotation on its next renewal after # it is informed of the change to this field. Note that this is tied to bot - # heartbeats and may not take effect on the next refresh interval. + # heartbeats and may not take effect on the next refresh interval. This flag + # will be reset to `false` by Auth upon successful keypair rotation. rotate_on_next_renewal: false status: @@ -459,16 +461,17 @@ status: # The public key of the bot associated with this token, set on first join. bound_public_key: - # The current bot instance UUID. A new UUID is issued on rejoin; the previous - # UUID will be linked via a `previous_instance_id` in the bot instance. + # The current bot instance UUID. A new UUID is issued on rejoin; the + # previous UUID will be linked via a `previous_instance_id` in the bot + # instance. bound_bot_instance_id: - # A count of remaining rejoins; if `.spec.bound_keypair.rejoining.total_rejoins` + # A count of remaining joins; if `.spec.bound_keypair.joining.total_joins` # is incremented, this value will be incremented by the same amount. If # decremented, this value cannot fall below zero. If - # `.spec.bound_keypair.rejoining.unlimited` is set, this value will always + # `.spec.bound_keypair.joining.unlimited` is set, this value will always # be 0 but rejoin attempts will succeed. - remaining_rejoins: 10 + remaining_joins: 10 # A count of the total number of joins performed using this token. join_count: 0 @@ -552,8 +555,8 @@ the node, and manually copying the new token into the bot config. The new `bound-keypair` method improves this situation in two primary ways: one fewer resource needs to be generated (the secret value), and maintenance can be -performed simply by adjusting values in the resource. If a bot fails, its rejoin -counter can be incremented easily. +performed simply by adjusting values in the resource. If a bot fails, its join +allowance can be incremented easily. As an example, we can consider provisioning several bots. We'll need to account for future overrides so we can fix a single bot in the future if needed: @@ -568,7 +571,7 @@ resource "teleport_bot" "example" { roles = ["access"] } -variable "bot_rejoin_overrides" { +variable "bot_join_overrides" { type = map(number) default = { foo = 5 @@ -588,9 +591,9 @@ resource "teleport_provision_token" "example" { join_method = "bound-keypair" bound_keypair = { - rejoining = { - # look up node-specific count in the rejoin overrides map, default to 2 - total_rejoins = lookup(var.bot_rejoin_overrides, each.key, 2) + joining = { + # look up node-specific count in the join overrides map, default to 2 + total_joins = lookup(var.bot_join_overrides, each.key, 2) } } } @@ -608,9 +611,9 @@ resource "aws_instance" "example" { } ``` -In this example, if node `bar` uses its 2 rejoins, we can add a new entry for -it in `bot_rejoin_overrides` and its `ProvisionToken` will be updated to allow -additional rejoins. +In this example, if node `bar` uses its 2 joins, we can add a new entry for +it in `bot_join_overrides` and its `ProvisionToken` will be updated to allow +additional joins. #### Challenge Ceremony @@ -645,7 +648,7 @@ sequenceDiagram opt no valid client certificate auth->>auth: Validate join state document - auth->>auth: decrement rejoin counter + auth->>auth: decrement remaining join counter end auth-->>bot: Signed TLS certificates
& new join state @@ -659,21 +662,21 @@ it to open an mTLS session. When Auth validates the join attempt, clients that presented an existing valid identity are considered to be requesting a refresh rather than rejoining, -leaving the rejoin counter untouched. Clients that do not present a valid client +leaving the join counter untouched. Clients that do not present a valid client certificate are considered to be rejoining and the token associated with this -public key must have `.status.bound_keypair.remaining_rejoins` >= 1. +public key must have `.status.bound_keypair.remaining_joins` >= 1. #### Client-Side Changes in `tbot` Bots should be informed of various status metrics, including number of remaining -rejoins and whether or not a keypair rotation has been requested. There's a few -methods by which we could inform bots of their remaining rejoins: +joins and whether or not a keypair rotation has been requested. There's a few +methods by which we could inform bots of their remaining join allowance: 1. (Recommended) Heartbeats: bots submit heartbeats at startup and on a regular - interval. It would be trivial to include a remaining rejoin counter in the + interval. It would be trivial to include a remaining join counter in the (currently empty) heartbeat response. -2. Certificate field: we could include the number of remaining rejoins in a +2. Certificate field: we could include the number of remaining joins in a certificate field. 3. New RPC: we could add a new RPC for bots to fetch this, alongside any other @@ -683,11 +686,11 @@ methods by which we could inform bots of their remaining rejoins: precedent here as bots can view e.g. their own roles without explicitly having RBAC permissions to do so. -The remaining rejoin counter should then be exposed as a Prometheus metric to +The remaining join counter should then be exposed as a Prometheus metric to allow for alerting if a bot drops below some threshold. Importantly, this is a potentially lagging indicator. The design allows for the -rejoin counter to be decreased (to zero) at any time, so a rejoin attempt may +join counter to be decreased (to zero) at any time, so a rejoin attempt may still fail at any time. This should be acceptable since it can also be increased after the fact to restore access if desired. @@ -720,7 +723,7 @@ This proposal also aims to improve the non-Terraform UX, particularly when automating with `tctl`. All regular token management workflows with `tctl create -f` will continue to work; upserting resources to modify runtime values will, for example, properly increase -`status.bound_keypair.remaining_rejoins` while preserving other token fields +`status.bound_keypair.remaining_joins` while preserving other token fields like `status.bound_keypair.bound_public_key`. Additional `tctl` changes will include: @@ -729,10 +732,10 @@ Additional `tctl` changes will include: automatically generated join tokens for `tctl bots add` and `tctl bots instances add` to use this new join method. -- Adding a column for "rejoins remaining" in `tctl bots instances ls` (where +- Adding a column for "joins remaining" in `tctl bots instances ls` (where relevant). -- Adding support for updating `total_rejoins` in `tctl bots update` +- Adding support for updating `total_joins` in `tctl bots update` #### Expiration Alerting UX @@ -763,7 +766,7 @@ on the token/bot instance) or `tbot` client. To trigger a rotation, an admin can set `.spec.bound_keypair.rotate_on_next_renewal=true` on the bound keypair token. The value of this field will be synchronized to the -bot using the same mechanism as described above for remaining rejoins, which is +bot using the same mechanism as described above for remaining joins, which is tied to the heartbeat interval (30m, hard coded) rather than the bot's regular renewal interval, so it will take place on the next renewal once the request has been synchronized. @@ -809,10 +812,10 @@ generation counter is not reset and is checked and incremented as usual. #### Remaining Downsides -- Repairing a bot that has exhausted all of its rejoins is still a semi-manual +- Repairing a bot that has exhausted all of its joins is still a semi-manual process. It is significantly easier, and does not necessarily require any changes on the impacted bot node itself, but is still annoying. Users can opt - out of this by setting `.spec.rejoining.unlimited=true`, but this has obvious + out of this by setting `.spec.joining.unlimited=true`, but this has obvious security implications. - Effort required to configure IaC / Terraform is still fairly high, even if @@ -945,7 +948,7 @@ To better support use cases where central administrators vend bot tokens for teams, we can add scoped RBAC support for `ProvisionToken` CRUD operations. For example, this would allow a designated team to update a `bound-keypair` -token to increase the rejoin counter without needing to reach out to the central +token to increase the join counter without needing to reach out to the central administrator. This is likely dependent on [Scoped RBAC](https://github.com/gravitational/teleport/pull/38078), @@ -1002,7 +1005,7 @@ into the bind-on-join flow described above. Multiple secrets mainly served to constrain credential reuse by limiting the number of possible rejoins until a human has to take some action. -Bound keypair joining replaces the secrets with a rejoin counter, and allows for +Bound keypair joining replaces the secrets with a join counter, and allows for (among other things) resuscitation of dead bots since their credentials remain available even once expired. From 8811f21fc6465028c351c91a7c1487caca0db23c Mon Sep 17 00:00:00 2001 From: Tim Buckley Date: Fri, 21 Mar 2025 15:45:51 -0600 Subject: [PATCH 17/25] Add example join state document --- rfd/0205-improved-onprem-joining.md | 52 ++++++++++++++++++++++++----- 1 file changed, 44 insertions(+), 8 deletions(-) diff --git a/rfd/0205-improved-onprem-joining.md b/rfd/0205-improved-onprem-joining.md index 06bbd38e087a6..fef1de75a63d6 100644 --- a/rfd/0205-improved-onprem-joining.md +++ b/rfd/0205-improved-onprem-joining.md @@ -387,6 +387,38 @@ successfully joining once. When the original bot fails to refresh and attempts to rejoin, it presents a valid but outdated join state document, we generate a lock, and then deny further access to both bots. +#### The Join State Document + +As discussed above, the join state document is a JWT signed by Auth included +alongside the regular Teleport user certificates after successfully completing +the challenge ceremony. + +This contains fields that + +An example join state document JWT payload: +```json +{ + "iat": 1234567890, + "iss": "example.teleport.sh", + "aud": "bot-name", + "sequence": 10, + "remaining_joins": 1, + "rotate_on_next_renewal": false, +} +``` + +Our unique claims include: + +- `sequence`: The identity sequence number, analogous to the generation counter + used for `token` joining. This is used to ensure the identity can't be renewed + simultaneously by two different clients. + +- `remaining_joins`: Used to inform clients of how many remaining join attempts + they can make before expiring. + +- `rotate_on_next_renewal`: Used to inform clients of a request to rotate their + keypair on the next refresh attempt. + #### Token Resource Example `bound-keypair`-type tokens differ from other types in that they are intended to @@ -672,17 +704,21 @@ Bots should be informed of various status metrics, including number of remaining joins and whether or not a keypair rotation has been requested. There's a few methods by which we could inform bots of their remaining join allowance: -1. (Recommended) Heartbeats: bots submit heartbeats at startup and on a regular - interval. It would be trivial to include a remaining join counter in the - (currently empty) heartbeat response. +1. Join State Document (recommended): we include this information as part of the + join state document which is returned alongside the certificate bundle + following a successful join. + +2. Heartbeats: bots submit heartbeats at startup and on a regular interval. It + would be trivial to include a remaining join counter in the (currently empty) + heartbeat response. -2. Certificate field: we could include the number of remaining joins in a +3. Certificate field: we could include the number of remaining joins in a certificate field. -3. New RPC: we could add a new RPC for bots to fetch this, alongside any other +4. New RPC: we could add a new RPC for bots to fetch this, alongside any other potentially useful information. -4. We could grant bots permission to view their own join tokens. There is +5. We could grant bots permission to view their own join tokens. There is precedent here as bots can view e.g. their own roles without explicitly having RBAC permissions to do so. @@ -691,8 +727,8 @@ allow for alerting if a bot drops below some threshold. Importantly, this is a potentially lagging indicator. The design allows for the join counter to be decreased (to zero) at any time, so a rejoin attempt may -still fail at any time. This should be acceptable since it can also be increased -after the fact to restore access if desired. +still fail. This should be acceptable since it can also be increased after the +fact to restore access if desired. #### Keystore Storage Backends From cde1c3199134fb410633c825ade482e7dac43352 Mon Sep 17 00:00:00 2001 From: Tim Buckley Date: Fri, 21 Mar 2025 16:22:35 -0600 Subject: [PATCH 18/25] Add note about locking old instance after rejoin --- rfd/0205-improved-onprem-joining.md | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/rfd/0205-improved-onprem-joining.md b/rfd/0205-improved-onprem-joining.md index fef1de75a63d6..c9aebdbfadeb4 100644 --- a/rfd/0205-improved-onprem-joining.md +++ b/rfd/0205-improved-onprem-joining.md @@ -387,6 +387,18 @@ successfully joining once. When the original bot fails to refresh and attempts to rejoin, it presents a valid but outdated join state document, we generate a lock, and then deny further access to both bots. +As an additional level of protection, following a successful rejoin attempt, +we can optionally insert a lock targeting the previous bot instance UUID. This +lock can have a modest expiration date to avoid resource leakage on the cluster +(max renewable cert TTL of the proposed 7 days). + +Today, bots do not immediately notice if they have been locked. However, we can +investigate methods to ensure clients notice locks early and trigger a +rapid renewal, which would in turn fully lock the `(bot, token)` pair once the +original bot attempts to rejoin with an outdated join state document. This would +be an improvement over `token`-joined bots today, which will take up to a full +renewal interval to trigger a generation counter check. + #### The Join State Document As discussed above, the join state document is a JWT signed by Auth included From 8f8b8e41aed4150ff5b4e80784888b0dd457c21b Mon Sep 17 00:00:00 2001 From: Tim Buckley Date: Fri, 21 Mar 2025 16:59:58 -0600 Subject: [PATCH 19/25] Small fixes --- rfd/0205-improved-onprem-joining.md | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/rfd/0205-improved-onprem-joining.md b/rfd/0205-improved-onprem-joining.md index c9aebdbfadeb4..a6ac125452311 100644 --- a/rfd/0205-improved-onprem-joining.md +++ b/rfd/0205-improved-onprem-joining.md @@ -220,7 +220,7 @@ Today, Machine ID has two broad categories of joining: Bound keypair joining could be implemented using either of these strategies. However, we'll opt to implement this as a delegated join method. This provides -several advantanges: +several advantages: - Standardized implementation, matching all other join methods - except `token`. @@ -380,7 +380,7 @@ counter to protect rejoins and the long-lived keypair: returned alongside the certificate bundle. If they do not match, the join is rejected, and a lock is generated against - the affected `(bot, token)`. + the affected `(bot, token)` pair. Just as with the generation counter, this procedure relies on an imposter bot successfully joining once. When the original bot fails to refresh and attempts @@ -489,9 +489,8 @@ spec: may_join_until: "2026-03-01T21:45:40.104524Z" # If set, the bot will perform a keypair rotation on its next renewal after - # it is informed of the change to this field. Note that this is tied to bot - # heartbeats and may not take effect on the next refresh interval. This flag - # will be reset to `false` by Auth upon successful keypair rotation. + # it is informed of the change to this field. This flag will be reset to + # `false` by Auth upon successful keypair rotation. rotate_on_next_renewal: false status: From aa96dc0c6c7fdc93718994437dac70c80264f25e Mon Sep 17 00:00:00 2001 From: Tim Buckley Date: Fri, 21 Mar 2025 17:03:12 -0600 Subject: [PATCH 20/25] Fix word missing from cspell --- rfd/cspell.json | 1 + 1 file changed, 1 insertion(+) diff --git a/rfd/cspell.json b/rfd/cspell.json index aa387bfeb8b1e..8dc6259f0a24e 100644 --- a/rfd/cspell.json +++ b/rfd/cspell.json @@ -633,6 +633,7 @@ "rdsproxy", "readyz", "reauth", + "reauthenticated", "reauthenticates", "reccfg", "reconnections", From 9d68564913e42c44ef92eebe6cc087dd5c4c93de Mon Sep 17 00:00:00 2001 From: Tim Buckley Date: Mon, 24 Mar 2025 16:32:29 -0600 Subject: [PATCH 21/25] Remove kubernetes reference --- rfd/0205-improved-onprem-joining.md | 4 ---- 1 file changed, 4 deletions(-) diff --git a/rfd/0205-improved-onprem-joining.md b/rfd/0205-improved-onprem-joining.md index a6ac125452311..717b2ff855a28 100644 --- a/rfd/0205-improved-onprem-joining.md +++ b/rfd/0205-improved-onprem-joining.md @@ -66,10 +66,6 @@ joining was really a *good* experience, effectively just: expect your system to never go down for more than 24 hours, bots can happily run for months. - (Ironically, Kubernetes is a great environment in which to run `token`-joined - bots since it'll rapidly reschedule any bot deployments that fail... but we - have a dedicated `kubernetes` delegated join method.) - In short, token joining has a complexity cliff. It's extremely easy to get started, but it can feel like a false start when users learn token joining is not suitable to their production use case. At best it's back to the docs to From 6cb7735f0634356ad28f2e94ebfdf16418b74a83 Mon Sep 17 00:00:00 2001 From: Tim Buckley Date: Mon, 24 Mar 2025 19:52:49 -0600 Subject: [PATCH 22/25] Fix hanging sentence, add reference to join state document in lifecycle --- rfd/0205-improved-onprem-joining.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/rfd/0205-improved-onprem-joining.md b/rfd/0205-improved-onprem-joining.md index 717b2ff855a28..5d2f102812692 100644 --- a/rfd/0205-improved-onprem-joining.md +++ b/rfd/0205-improved-onprem-joining.md @@ -223,7 +223,8 @@ several advantages: - Regular verification checks. With true renewable certificates, bots could last indefinitely without completing a challenge. This makes it harder to tell which bots are still alive, and could leave bots alive if their join token is - deleted. When bots regularly interact with the join method, we can + deleted. When bots regularly interact with the join method, we can ensure they + _stop_ working more rapidly once disabled (see lifecycle section below) - If using hardware key storage backends, repeating the joining challenge helps ensure the identity can't be effectively exfiltrated. @@ -340,6 +341,8 @@ Bots will be unable to rejoin under any of these conditions: - `.spec.bound_keypair.joining.may_join_until` is set to a value before the current time +- Inability to provide valid join state document after first join attempt + #### Preventing Credential Duplication Like `token` joining, `bound-keypair` aims to have relatively robust protections From 951ade4be7d343ddad12b0161c50d8b74ddac1dd Mon Sep 17 00:00:00 2001 From: Tim Buckley Date: Mon, 14 Apr 2025 15:05:17 -0600 Subject: [PATCH 23/25] SSH key format, introduce "insecure" flag --- rfd/0205-improved-onprem-joining.md | 30 +++++++++++++++++++++++++---- 1 file changed, 26 insertions(+), 4 deletions(-) diff --git a/rfd/0205-improved-onprem-joining.md b/rfd/0205-improved-onprem-joining.md index 5d2f102812692..fa85f53c49026 100644 --- a/rfd/0205-improved-onprem-joining.md +++ b/rfd/0205-improved-onprem-joining.md @@ -457,6 +457,8 @@ spec: # May not be modified after resource creation. Note that public keys may # be rotated, so refer to `.status.bound_keypair.bound_public_key` for the # currently bound key information. + # This key should be written in SSH public key format, including the + # algorithm. initial_public_key: null # If set, use an explicit initial joining secret; if both this and @@ -487,6 +489,15 @@ spec: # expired. may_join_until: "2026-03-01T21:45:40.104524Z" + # Insecure is an optional flag that enables insecure joining and + # rejoining. This method disables generation counter checks during joining + # and rejoining. When combined with the `unlimited` flag, this allows + # unlimited reuse of this token provided the client has access to the + # keypair. This may be useful in certain cases - like use in unsupported + # CI/CD providers - but cannot offer the same security assurances and + # should be used with care. + insecure: false + # If set, the bot will perform a keypair rotation on its next renewal after # it is informed of the change to this field. This flag will be reset to # `false` by Auth upon successful keypair rotation. @@ -501,6 +512,7 @@ status: initial_join_secret: # The public key of the bot associated with this token, set on first join. + # This key is written in SSH public key format. bound_public_key: # The current bot instance UUID. A new UUID is issued on rejoin; the @@ -1000,14 +1012,24 @@ administrator. This is likely dependent on [Scoped RBAC](https://github.com/gravitational/teleport/pull/38078), which is still in the planning stage. -### Alternative/Future: Explicitly Insecure Token Joining +### Future: Explicitly Insecure Token Joining There are perfectly valid use cases for allowing relatively insecure access to resources that do not have strict trust requirements, and Teleport's RBAC system is robust enough to only allow these bots access to an acceptable subset of -resources. It may be worthwhile to add an `insecure-shared-secret` join method -that allows for arbitrary joining in use cases that still fall through the -cracks, so long as end users understand the security implications. +resources. + +Initially, we will provide a minimal version of this via the +`.spec.bound_keypair.joining.insecure` flag, which bypasses the generation +counter check. When combined with the `unlimited` flag, this allows for +effectively unlimited joining, provided a keypair can still be provided. This +should be enough to enable basic support on otherwise-unsupported CI/CD +provides, at least provided a keypair can be stored in a platform secret. + +Alternatively, it may be worthwhile to add an `insecure-shared-secret` join +method that further reduces security enforcement, and allows for arbitrary +joining in use cases that still fall through the cracks, so long as end users +understand the security implications. ### Alternative: Client-side multi-token support From 3ed2d909e257cac7ab0daa91717dc0598b322b76 Mon Sep 17 00:00:00 2001 From: Tim Buckley Date: Fri, 15 Aug 2025 19:28:45 -0600 Subject: [PATCH 24/25] Mark as implemented --- rfd/0205-improved-onprem-joining.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfd/0205-improved-onprem-joining.md b/rfd/0205-improved-onprem-joining.md index fa85f53c49026..9b650bbd6550e 100644 --- a/rfd/0205-improved-onprem-joining.md +++ b/rfd/0205-improved-onprem-joining.md @@ -1,6 +1,6 @@ --- authors: Tim Buckley () -state: draft +state: implemented --- # RFD 0205 - Improved On-Prem Bots with Bound Keypair Joining From 5bd66fff7e9884a849d27387093459f7a6e631d1 Mon Sep 17 00:00:00 2001 From: Tim Buckley Date: Fri, 15 Aug 2025 19:57:25 -0600 Subject: [PATCH 25/25] Various updates to reflect implemented state Of note: - s/rejoin/recovery (give or take some conjugation) - Updated keypair rotation and sequence diagram - Updated join state spec - Updated token resource example to match public docs - Lots of misc terminology tweaks --- rfd/0205-improved-onprem-joining.md | 308 ++++++++++++++-------------- 1 file changed, 152 insertions(+), 156 deletions(-) diff --git a/rfd/0205-improved-onprem-joining.md b/rfd/0205-improved-onprem-joining.md index 9b650bbd6550e..571ccac51bf73 100644 --- a/rfd/0205-improved-onprem-joining.md +++ b/rfd/0205-improved-onprem-joining.md @@ -43,7 +43,7 @@ production environment, bot token joining has major operational problems: this happens, a new token must be issued - manually. - Internal bot identities have a hard 24 hour TTL limit, limiting the maximum - possible resiliency to 24 hours before a bot can no longer rejoin without + possible resiliency to 24 hours before a bot can no longer recover without manual human intervention - `token`-method tokens are themselves secret values and their names need to be @@ -95,9 +95,9 @@ create several issues: intended to join? - How can we tell joined bots apart, even over time? If the original joining - token is still valid, could a malicious bot purge its identity and rejoin? + token is still valid, could a malicious bot purge its identity and recover? -- When a bot needs to rejoin, does it use the same token? If so, can that token +- When a bot needs to recover, does it use the same token? If so, can that token *ever* expire? With this in mind, we need to strike some balance between effective UX and a @@ -142,12 +142,12 @@ similar to our existing delegated join methods, and the bot will actively refresh this identity for as long as possible, or until its backing token expires. -If the identity refresh fails at any point, bots may attempt to rejoin, -and the Auth service can use predefined per-bot rules to decide if this specific -bot is allowed to rejoin, including a join counter and expiration date. If a -join attempt is rejected, the bot's keypair does not necessarily remain invalid: -if server-side rules are adjusted, for example by increasing the token's join -allowance, it can then rejoin without any client-side reconfiguration. +If the identity refresh fails at any point, bots may attempt to recover, and the +Auth service can use predefined per-bot rules to decide if this specific bot is +allowed to recover, including a join counter and expiration date. If a join +attempt is rejected, the bot's keypair does not necessarily remain invalid: if +server-side rules are adjusted, for example by increasing the token's join +allowance, it can then recover without any client-side reconfiguration. This has several important differences to existing join methods: @@ -156,9 +156,9 @@ This has several important differences to existing join methods: Otherwise, joining bots authenticate with an onboarding secret to automatically share their public key with the server. -- When joining or rejoining, Teleport issues a challenge that the client must - solve. This is similar to TPM joining today, but backed by a local keypair - rather than (necessarily) a hardware token. +- When joining, Teleport issues a challenge that the client must solve. This is + similar to TPM joining today, but backed by a local keypair rather than + (necessarily) a hardware token. - When a bot's identity expires, assuming it has a nonzero join allowance, it can simply repeat the joining process to receive a fresh certificate. @@ -166,7 +166,7 @@ This has several important differences to existing join methods: - If a bot exhausts its joining allowance, it will not be able to fetch new certificates, similar to today's behavior. However, this bot can be restored without needing to generate a new identity: an admin user can edit the backing - `ProvisionToken` to increment `spec.bound_keypair.joining.total_joins`. + `ProvisionToken` to increment `spec.bound_keypair.recovery.limit`. The failed `tbot` instance can then retry the joining process, and it will succeed. @@ -187,7 +187,7 @@ implemented as a delegated joining method: - The generation counter is still used to detect identity reuse. -#### Renewing and Rejoining +#### Renewing and Recovering > [!NOTE] > We use the term "renew" fairly loosely in various parts of Teleport and @@ -198,9 +198,9 @@ implemented as a delegated joining method: > method > - **renewing**: refreshing a renewable identity without completing a joining > challenge, specific to token joining -> - **rejoining**: in bound keypair joining, a rejoin occurs when attempting a -> refresh with no client certificates, or expired client certificates. This -> triggers additional verifications, and consumes a rejoin. +> - **recovering**: in bound keypair joining, a recovery occurs when attempting +> a refresh with no client certificates, or expired client certificates. This +> triggers additional verifications, and consumes a recovery. Today, Machine ID has two broad categories of joining: @@ -229,7 +229,7 @@ several advantages: - If using hardware key storage backends, repeating the joining challenge helps ensure the identity can't be effectively exfiltrated. -For the purposes of differentiating a full rejoin from a regular refresh, we can +For the purposes of differentiating a recovery from a regular refresh, we can take advantage of optional authenticated joining added in [RFD 0162](./0162-machine-id-token-join-method-bot-instance.md). This allows clients to present an existing valid identity to preserve certain identity @@ -248,15 +248,16 @@ This join method creates two new joining flows: Example UX (subject to change): ``` - $ tbot generate-keypair + $ tbot keypair create --storage=./storage --proxy-server=example.teleport.sh:443 Wrote id_ed25519 Wrote id_ed25519.pub $ tctl bots add example --public-key id_ed25519.pub $ tbot start identity --token=bound-keypair:id_ed25519 ``` - (In this example, `tctl bots add` creates a `bound-keypair` token automatically, - much like a `token`-type token is created automatically today.) + (In this example, `tctl bots add` creates a `bound-keypair` token + automatically, much like a `token`-type token is created automatically + today.) The public key can be copied as needed, similar to SSH `authorized_keys` and GitHub's SSH authentication. This is arguably more secure since no secret is @@ -311,24 +312,25 @@ lifecycle than other types of bots. To summarize some key differences: - Each time a bot joins, it creates a new bot instance. The bot instance is tied to the valid client certificate, and we won't change this behavior. The new bot instance will contain a reference to the previous instance ID based on - the content of `.status.bound_keypair.bound_bot_instance_id` at rejoining + the content of `.status.bound_keypair.bound_bot_instance_id` at recovery time. -- When a new instance is generated as part of a rejoin, refresh attempts using +- When a new instance is generated as part of a recovery, refresh attempts using the old instance will be denied via a check against the currently bound `.status.bound_keypair.bound_bot_instance_id`. -Bots may stop refreshing under several conditions, triggering a rejoin attempt: +Bots may stop refreshing under several conditions, triggering a recovery +attempt: - The backing `ProvisionToken` resource has been deleted; in this case, the - rejoin attempt is unlikely to succeed + recovery attempt is unlikely to succeed - The bot has been offline for longer than its certificate TTL - A lock targeting the bot in any capacity (username, instance UUID, token name) is in place -Bots will be unable to rejoin under any of these conditions: +Bots will be unable to recover under any of these conditions: - The `ProvisionToken` resource has been deleted @@ -336,10 +338,11 @@ Bots will be unable to rejoin under any of these conditions: - A lock is in place -- `.status.bound_keypair.remaining_joins` is zero (and not unlimited) +- `.status.bound_keypair.recover_count` is greater than or equal to + `.spec.bound_keypair.recovery.limit` (and mode is `standard`) -- `.spec.bound_keypair.joining.may_join_until` is set to a value before the - current time +- `.spec.bound_keypair.onboarding.must_register_before` is set to a value before + the current time - Inability to provide valid join state document after first join attempt @@ -357,13 +360,13 @@ check will lock the bot on the next refresh attempt. However, the long-lived keypair introduces a similar class of problem. If an attacker exfiltrates the keypair, assuming additional joins are available, they -can attempt to rejoin and gain extended access. The original bot will still -compete with the imposter bot and each will be forced to fully rejoin on every +can attempt to recover and gain extended access. The original bot will still +compete with the imposter bot and each will be forced to recover on every attempt, but this results in minimal real downtime for an attacker, at least until the join allowance runs out. To mitigate this, we'll need to create a mechanism similar to the generation -counter to protect rejoins and the long-lived keypair: +counter to protect recoveries and the long-lived keypair: 1. When joining, Auth will return a join state document (signed JWT) that includes the current join counter, alongside the usual Teleport @@ -383,10 +386,10 @@ counter to protect rejoins and the long-lived keypair: Just as with the generation counter, this procedure relies on an imposter bot successfully joining once. When the original bot fails to refresh and attempts -to rejoin, it presents a valid but outdated join state document, we generate a +to recover, it presents a valid but outdated join state document, we generate a lock, and then deny further access to both bots. -As an additional level of protection, following a successful rejoin attempt, +As an additional level of protection, following a successful recovery attempt, we can optionally insert a lock targeting the previous bot instance UUID. This lock can have a modest expiration date to avoid resource leakage on the cluster (max renewable cert TTL of the proposed 7 days). @@ -394,9 +397,9 @@ lock can have a modest expiration date to avoid resource leakage on the cluster Today, bots do not immediately notice if they have been locked. However, we can investigate methods to ensure clients notice locks early and trigger a rapid renewal, which would in turn fully lock the `(bot, token)` pair once the -original bot attempts to rejoin with an outdated join state document. This would -be an improvement over `token`-joined bots today, which will take up to a full -renewal interval to trigger a generation counter check. +original bot attempts to recover with an outdated join state document. This +would be an improvement over `token`-joined bots today, which will take up to a +full renewal interval to trigger a generation counter check. #### The Join State Document @@ -412,24 +415,22 @@ An example join state document JWT payload: "iat": 1234567890, "iss": "example.teleport.sh", "aud": "bot-name", - "sequence": 10, - "remaining_joins": 1, - "rotate_on_next_renewal": false, + "bot_instance_id": "aaaa-bbbb", + "recovery_sequence": 10, + "recovery_limit": 1, + "recovery_mode": "standard", } ``` Our unique claims include: -- `sequence`: The identity sequence number, analogous to the generation counter - used for `token` joining. This is used to ensure the identity can't be renewed - simultaneously by two different clients. +- `recovery_sequence`: The identity sequence number, analogous to the generation + counter used for `token` joining. This is used to ensure the identity can't be + renewed simultaneously by two different clients. -- `remaining_joins`: Used to inform clients of how many remaining join attempts +- `recovery_limit`: Used to inform clients of how many remaining join attempts they can make before expiring. -- `rotate_on_next_renewal`: Used to inform clients of a request to rotate their - keypair on the next refresh attempt. - #### Token Resource Example `bound-keypair`-type tokens differ from other types in that they are intended to @@ -448,90 +449,81 @@ spec: join_method: bound-keypair bound_keypair: - # `onboarding` parameters control initial join behavior + # Fields related to the initial join attempt. onboarding: - # If set, no joining secret is generated; the secret exchange ceremony is - # skipped and instance will directly prove its identity using its private - # key. It is an error for a public key to be associated with more than one - # token, and creation or update will fail if a public key is reused. - # May not be modified after resource creation. Note that public keys may - # be rotated, so refer to `.status.bound_keypair.bound_public_key` for the - # currently bound key information. - # This key should be written in SSH public key format, including the - # algorithm. - initial_public_key: null - - # If set, use an explicit initial joining secret; if both this and - # `public_key` are unset, a value will be generated server-side and - # stored in `.status.bound_keypair.initial_join_secret` - initial_join_secret: "" - - # Use of `initial_join_secret` must take place before this timestamp. May - # be modified if initial secret has not yet been consumed. - must_join_before: "2025-03-01T21:45:40.104524Z" - - # Parameters to tune joining behavior, both on first join and when rejoining - # when the regular identity expires. - joining: - # If true, `total_joins` is ignored and bots may rejoin indefinitely; - # must be opt-in. - unlimited: false - - # Total number of allowed joins; this may be incremented to allow - # additional rejoins, even if a bot identity has already expired. May - # be decremented, but only by the current value of - # `.status.bound_keypair.remaining_joins`. This value must be at least 1 - # for a bot to perform an initial join. - total_joins: 10 - - # If set, joining is only valid before this timestamp; may be - # incremented or reset to the empty string to allow rejoining once - # expired. - may_join_until: "2026-03-01T21:45:40.104524Z" - - # Insecure is an optional flag that enables insecure joining and - # rejoining. This method disables generation counter checks during joining - # and rejoining. When combined with the `unlimited` flag, this allows - # unlimited reuse of this token provided the client has access to the - # keypair. This may be useful in certain cases - like use in unsupported - # CI/CD providers - but cannot offer the same security assurances and - # should be used with care. - insecure: false - - # If set, the bot will perform a keypair rotation on its next renewal after - # it is informed of the change to this field. This flag will be reset to - # `false` by Auth upon successful keypair rotation. - rotate_on_next_renewal: false + # If set to a public key in SSH authorized_keys format, the + # joining client must have the corresponding private key to join. This + # keypair may be created using `tbot keypair create`. If set, + # `registration_secret` and `must_register_before` are ignored. + initial_public_key: "" + + # If set to a secret string value, a client may use this secret to perform + # the first join without pre-registering a public key in + # `initial_public_key`. If unset and no `initial_public_key` is provided, + # a random value will be generated automatically into + # `.status.bound_keypair.registration_secret`. + registration_secret: "" + + # If set to an RFC 3339 timestamp, attempts to register via + # `registration_secret` will be denied once the timestamp has elapsed. If + # more time is needed, this field can be edited to extend the registration + # period. + must_register_before: "" + + # Fields related to recovery after certificates have expired. + recovery: + # The maximum number of allowed recovery attempts. This value may + # be raised or lowered after creation to allow additional recovery + # attempts should the initial limit be exhausted. If `mode` is set to + # `standard`, recovery attempts will only be allowed if + # `.status.bound_keypair.recovery_count` is less than this limit. This + # limit is not enforced if `mode` is set to `relaxed` or `insecure`. This + # value must be at least 1 to allow for the initial join during + # onboarding, which counts as a recovery. + limit: 1 + + # The recovery rule enforcement mode. Valid values: + # - standard (or unset): all configured rules enforced. The recovery limit + # and client join state are required and verified. This is the most + # secure recovery mode. + # - relaxed: recovery limit is not enforced, but client join state is + # still required. This effectively allows unlimited recovery attempts, + # but client join state still helps mitigate stolen credentials. + # - insecure: neither the recovery limit nor client join state are + # enforced. This allows any client with the private key to join freely. + # This is less secure, but can be useful in certain situations, like in + # otherwise unsupported CI/CD providers. This mode should be used with + # care, and RBAC rules should be configured to heavily restrict which + # resources this identity can access. + mode: "standard" + + # If set to an RFC 3339 timestamp, once elapsed, a keypair rotation will be + # forced on next join if it has not already been rotated. The most recent + # rotation is recorded in `.status.bound_keypair.last_rotated_at`. + rotate_after: "" status: bound_keypair: # If `spec.onboarding.initial_public_key` is unset, this value will be # generated server-side and made available here. If - # `spec.onboarding.initial_join_secret` is set, its value will be copied + # `spec.onboarding.registration_secret` is set, its value will be copied # here. - initial_join_secret: + registration_secret: # The public key of the bot associated with this token, set on first join. # This key is written in SSH public key format. bound_public_key: - # The current bot instance UUID. A new UUID is issued on rejoin; the + # The current bot instance UUID. A new UUID is issued on recovery; the # previous UUID will be linked via a `previous_instance_id` in the bot # instance. bound_bot_instance_id: - # A count of remaining joins; if `.spec.bound_keypair.joining.total_joins` - # is incremented, this value will be incremented by the same amount. If - # decremented, this value cannot fall below zero. If - # `.spec.bound_keypair.joining.unlimited` is set, this value will always - # be 0 but rejoin attempts will succeed. - remaining_joins: 10 - - # A count of the total number of joins performed using this token. - join_count: 0 + # A count of the total number of recoveries performed using this token. + recovery_count: 0 - # The timestamp of the last successful joining or rejoining attempt, if any. - last_joined_at: null + # The timestamp of the last successful recovery attempt, if any. + last_recovered_at: null # The timestamp of the last successful keypair rotation, if any. last_rotated_at: null @@ -715,10 +707,11 @@ has a valid client certificate from a previous authentication attempt, it uses it to open an mTLS session. When Auth validates the join attempt, clients that presented an existing valid -identity are considered to be requesting a refresh rather than rejoining, -leaving the join counter untouched. Clients that do not present a valid client -certificate are considered to be rejoining and the token associated with this -public key must have `.status.bound_keypair.remaining_joins` >= 1. +identity are considered to be requesting a refresh rather than recovering, +leaving the recovery counter untouched. Clients that do not present a valid +client certificate are considered to be recovering and the token associated with +this public key must have `.status.bound_keypair.recovery_count` less than +`.spec.bound_keypair.recovery.limit`. #### Client-Side Changes in `tbot` @@ -748,7 +741,7 @@ The remaining join counter should then be exposed as a Prometheus metric to allow for alerting if a bot drops below some threshold. Importantly, this is a potentially lagging indicator. The design allows for the -join counter to be decreased (to zero) at any time, so a rejoin attempt may +join counter to be decreased (to zero) at any time, so a recovery attempt may still fail. This should be acceptable since it can also be increased after the fact to restore access if desired. @@ -779,10 +772,9 @@ to sign our challenges appropriately, and using our desired key types. This proposal also aims to improve the non-Terraform UX, particularly when automating with `tctl`. All regular token management workflows with -`tctl create -f` will continue to work; upserting resources to modify runtime -values will, for example, properly increase -`status.bound_keypair.remaining_joins` while preserving other token fields -like `status.bound_keypair.bound_public_key`. +`tctl edit` will continue to work; upserting resources to modify runtime +values will not interfere with bots as this is expected to be a regular part of +bot / token lifecycle. Additional `tctl` changes will include: @@ -812,7 +804,7 @@ We should take steps to improve visibility of bots at or near expiry, including: Service and via `tbot`'s metrics endpoint. In the future, we might also consider configurable cluster alerts when a bot -rejoins or has used its last attempt. This should be opt-in as this type of +recovers or has used its last attempt. This should be opt-in as this type of alert may not scale well with lots of bots. #### Keypair Rotation @@ -822,48 +814,52 @@ rotation without bot downtime. Ideally, it should be possible to initiate a rotation from either the server (e.g. by setting a `rotate_on_next_renewal` flag on the token/bot instance) or `tbot` client. -To trigger a rotation, an admin can set `.spec.bound_keypair.rotate_on_next_renewal=true` -on the bound keypair token. The value of this field will be synchronized to the -bot using the same mechanism as described above for remaining joins, which is -tied to the heartbeat interval (30m, hard coded) rather than the bot's regular -renewal interval, so it will take place on the next renewal once the request has -been synchronized. +To trigger a rotation, an admin can set `.spec.bound_keypair.rotate_after=$timestamp` +on the bound keypair token. On its next refresh attempt, the server will require +a keypair rotation as part of the usual challenge process. Once the previous +keypair has been validated, the client will be asked to generate a new keypair, +then repeats the challenge process to validate ownership of the new private key, +and is only then issued certificates. To perform the rotation, additional steps are taken as part of the challenge ceremony: ```mermaid sequenceDiagram - participant keystore as Local Keystore - participant bot as Bot - participant auth as Auth Server + participant keystore as Local Keystore + participant bot as Bot + participant auth as Auth Server - Note over keystore,auth: Joining secret exchange not shown - bot->>keystore: Request new keypair - keystore-->>bot: New public key - bot->>auth: Request joining challenge
with new public key as
optional parameter - auth-->>bot: Sends joining challenge + Note over keystore,auth: Initial onboarding not shown - bot->>keystore: Request signed document
with original public key - keystore-->>bot: Signed challenge document - bot->>auth: Signed challenge document - auth->>auth: Validate signed document
against bound public key + bot->>auth: Request joining challenge,
provides previous join state + auth-->>bot: Sends joining challenge
against original public key - auth-->>bot: Sends new joining challenge
for new public key - bot->>keystore: Request signed document
with new public key - keystore-->>bot: Signed challenge document
& previous join state + bot->>keystore: Request signed document
with original public key + keystore-->>bot: Signed challenge document + bot->>auth: Signed challenge document + auth->>auth: Validate signed document
against bound public key - bot->>auth: Signed challenge document - auth->>auth: Validate signed document
against new public key - auth->>auth: Commit new public key to backend,
clear rotate flag + auth->>bot: Requests new public key + bot->>keystore: Request new keypair + keystore-->>bot: New public key + bot->>auth: New public key - auth-->>bot: Signed TLS certificates
& new join state + auth-->>bot: Sends new joining challenge
for new public key + bot->>keystore: Request signed document
with new public key + keystore-->>bot: Signed challenge document + + bot->>auth: Signed challenge document + auth->>auth: Validate signed document
against new public key + auth->>auth: Commit new public key to backend + + auth-->>bot: Signed TLS certificates
& new join state ``` As shown above, the join service will return a second challenge rather than -certs if the initial request included a new public key. Certs will only be -returned on completion of the second challenge using the new public key, and the -new key will only be committed to the backend at this point. +certs if a rotation was requested server side. Certs will only be returned on +completion of the second challenge using the new public key, and the new key +will only be committed to the backend at this point. Other bot parameters remain unchanged. No new bot instance is created, the generation counter is not reset and is checked and incremented as usual. @@ -873,8 +869,8 @@ generation counter is not reset and is checked and incremented as usual. - Repairing a bot that has exhausted all of its joins is still a semi-manual process. It is significantly easier, and does not necessarily require any changes on the impacted bot node itself, but is still annoying. Users can opt - out of this by setting `.spec.joining.unlimited=true`, but this has obvious - security implications. + out of this by setting `.spec.recover.mode="relaxed"`, but this has obvious + security implications, which may be tolerable for some use cases. - Effort required to configure IaC / Terraform is still fairly high, even if reduced. @@ -1036,7 +1032,7 @@ understand the security implications. A simpler variant of N-Token Resiliency, this would allow `tbot` clients to accept an ordered list of joining token strings which could be used sequentially. If the internal identity expires, the next token in the list will -be used to attempt a rejoin. +be used to attempt a recovery. This may be interesting for users with workload-critical bots wishing to hedge against in outage in a delegated join method's IdP. With Workload ID being used @@ -1070,7 +1066,7 @@ tokens a bot would receive, thus the name. This idea still has some merit but we realized this can largely be simplified into the bind-on-join flow described above. Multiple secrets mainly served to -constrain credential reuse by limiting the number of possible rejoins until a +constrain credential reuse by limiting the number of possible recoveries until a human has to take some action. Bound keypair joining replaces the secrets with a join counter, and allows for