add self-repair for malformed instance certs by fspmarshall · Pull Request #41467 · gravitational/teleport

fspmarshall · 2024-05-13T15:01:58Z

Fixes an issue where mix-and-match of join tokens with different system role permissions over time would cause the instance certificate to be malformed. This could lead to various issues, including services not showing up in instance heartbeats, or service-level heartbeats failing to be emitted. The current fix works by having agents prove which system roles they hold via assertions in order to get their primary instance cert reissued. This is the same mechanism by which the instance certs were originally generated back in v10, and has been ported forward and modified to work within our existing cert reissue framework.

In the long run, a more well-defined and structured model for token mix-and-match will likely be needed in order to properly support future features like scoped RBAC and static label assignments. This PR initially sidestepped the issue by disallowing mix-and-match in v16 onwards, but after some discussion that has been deemed too drastic of a change. The most realistic alternative I know of at the time of writing would be to start always requiring that that new tokens also grant all old roles, and fully regenerating the agent's identity based on the new token. This has the upside of mostly preserving existing behavior and not requiring users to reset their data directories, but also presents some confusing edge-cases, such as what to do about identities related to services that aren't currently active on the agent, but were in the past.

Related: https://github.com/gravitational/teleport.e/pull/4240

Fixes: #38977

changelog: fixed an issue where mix-and-match of join tokens could interfere with some services appearing correctly in heartbeats.

GavinFrazar

can we add a test that simulates an older agent starting up with/without new roles as well, i.e. where the initial local version state is not set?

GavinFrazar · 2024-05-20T21:24:11Z

I just want to make sure I understand this right:

the first check in this func is for non "instance" roles, i.e. RoleDiscovery, RoleDatabase etc.). It needs to wait for the "instance" connector to be broadcast.

this check is for the "instance" role registration itself, so we look to see if we already have an instance identity stored

if we do have an instance identity already, assert that we either have all the instance roles in that identity OR that the identity is older than v16

if we dont have an identity already, then this is first time connect and we will write the teleport version in "InitialLocalVersion"

finally, the instance connector is made available to 1. and we assert again that the role is present in the instance identity

orca-security-us

Orca Security Scan Summary

Status	Check	Issues by priority
Passed	Infrastructure as Code	0 0 0 0	View in Orca
Failed	Secrets	1 0 0 0	View in Orca
Passed	Vulnerabilities	0 0 0 0	View in Orca

🔑 The following Secrets have been detected in your pull request across all commits

⚠️ Please take action to mitigate the risk of the identified secrets by revoking them, and if already in use, updating all dependent systems

	NAME	FILE	LINE NUM	COMMIT
	PEM File With Private Key	...uth/webproxy_key.pem	1	`f66cc33`	View in code

fspmarshall · 2024-05-24T04:19:49Z

"secrets" in the above scan result are self-signed certs used by a test.

webvictim · 2024-05-28T12:18:23Z

As someone who's been on the rough end of user experience with Teleport's join tokens for a while (and is particularly aware of the customer frustration/confusion which can exist around them), I wanted to suggest that it would be nicer if we allow the mix-and-match behaviour.

To be clear; I always tell people that they should use a join token which is valid for the full set of services that an agent provides. However, people don't always take advice, and the fact that things have just worked in this circumstance historically means that users have pre-set expectations. They don't understand the difference between an App cert and an Instance cert, and nor should they.

It sounds like there is an explicit requirement here for the agent's storage to be cleared so a new instance certificate can be issued? This will cause confusion and frustration for anyone who has mixed and matched tokens in the past and expects things to work. IMO, we should definitely make agents just "do the right thing" and automatically handle Instance cert regeneration/reissue as long as the agent has separate certs for each service it provides.

Also - I don't know whether we still advertise teleport app start and other similar commands as a viable strategy for adding more services to a running agent, but this seems in direct opposition to the spirit of this PR.

Related (for a different, but no less frustrating reason): #2838

tigrato · 2024-05-28T13:30:58Z

@fspmarshall

I have concerns about the current form of this PR and its impact on the product.

First, agents installed via the teleport-kube-agents Helm chart and running in Kubernetes store their credentials in Kubernetes secrets following the pattern {pod-name}-state. In contrast, agents running on nodes store their credentials in /var/lib/teleport. It's not as simple as saying cleanup all /var/lib/teleport folders for all agents/replicas you wish to update. Credentials are crucial in Teleport, yet it is not transparent how and when agents store them to disk or secrets. Enforcing a complete data wipe introduces other issues, starting with the server-id. The server-id is vital in SSH as it allows dialing to remote hosts using the server ID instead of relying on the host's name or hostname. Forcing users to change the server-id can disrupt automations if a specific host ID is hardcoded.

Second, adding or removing services from an agent configuration is a common practice. Users often begin with limited knowledge of Teleport, implementing and adopting it incrementally. They might start with Kubernetes and later expand to application or database services to expose assets running in Kubernetes. This incremental adoption is standard and should not be hindered. The effort required to expand a service, especially when customers purchase new licenses for additional protocols or request trials, would be so great that it would discourage implementation. This could be a significant drawback for larger customers who expand their usage regularly by purchasing new protocols or features after a successful rollout.

Instead, we should allow agents to exchange all their secrets for a new instance certificate valid for the appropriate services. By exchanging all secrets, I mean the agents should call an endpoint where, through a cryptographic challenge, they prove possession of the cert-key pair for a given role and a token for the new role they wish to adopt. After Auth validates the possession of the certificates for the current enabled services and the new token, it will issue a new instance certificate valid for all requested roles. This approach is more sustainable in the long term because:

It allows users to expand their current usage seamlessly.
Inventory will consistently display the correct instance profiles.
In the future, it will enable hot config reloads to activate new services with minimal disruption.

This strategy provides a better framework for scalability and flexibility, accommodating the evolving needs of our users without significant overhead or disruption.

When a service is disabled and moved to another service, Teleport should replace the stored cert-key pairs with a new instance certificate that has the necessary permissions for the remaining services. This ensures that only the required credentials are stored, enhancing security. Instead of retaining the cert-key pairs of all previously enabled services, we should transition to a more secure model where a new, less privileged instance certificate is issued. This certificate should be sufficient to meet the needs of the remaining active services without retaining unnecessary credentials.

orca-security-us

Orca Security Scan Summary

Status	Check	Issues by priority
Passed	Infrastructure as Code	0 0 0 0	View in Orca
Failed	Secrets	1 0 0 0	View in Orca
Passed	Vulnerabilities	0 0 0 0	View in Orca

🔑 The following Secrets have been detected in your pull request across all commits

⚠️ Please take action to mitigate the risk of the identified secrets by revoking them, and if already in use, updating all dependent systems

	NAME	FILE	LINE NUM	COMMIT
	PEM File With Private Key	...uth/webproxy_key.pem	1	5af7a4ea0	View in code

orca-security-us

Orca Security Scan Summary

Status	Check	Issues by priority
Passed	Infrastructure as Code	0 0 0 0	View in Orca
Failed	Secrets	1 0 0 0	View in Orca
Passed	Vulnerabilities	0 0 0 0	View in Orca

🔑 The following Secrets have been detected in your pull request across all commits

⚠️ Please take action to mitigate the risk of the identified secrets by revoking them, and if already in use, updating all dependent systems

	NAME	FILE	LINE NUM	COMMIT
	PEM File With Private Key	...uth/webproxy_key.pem	1	59f4482a3	View in code

fspmarshall · 2024-05-28T22:16:10Z

Based on feedback from @tigrato and @webvictim, and some supplementary discussions elsewhere, this PR has been modified to no longer reject new mix-and-match attempts. The self-repair logic should fix the existing issues caused by mix-and-match. Future features may still require that we formalize a more specific model of how mix-and-match should work (and possibly limit the cases where it is permitted), but for now we are going to preserve mix-and-match ability.

rosstimothy

Approving to not block merging with the assumption that s/instance-assets/testdata happens prior to merging

public-teleport-github-review-bot · 2024-05-29T23:26:35Z

@fspmarshall See the table below for backport results.

Branch	Result
branch/v13	Failed
branch/v14	Failed
branch/v15	Failed
branch/v16	Create PR

ravicious · 2025-05-20T11:42:30Z

+	// behaves equivalent to instanceRoles except that while instance roles are static assignments
+	// set up when the teleport process starts, hosted plugin roles are dynamically assigned by
+	// runtime configuration, and may not necessarily be present on the instance cert.
+	hostedPluginRoles map[types.SystemRole]string


I'm working on Intune support in addition to Jamf. The Jamf integration can be currently enabled as either a standalone service or a hosted plugin.

In the context of this, I have trouble understanding the purpose of hostedPluginRoles. Unlike instanceRoles, its values are never read. It's usage seems to boil down to a hosted plugin calling service.TeleportProcess.SetExpectedHostedPluginRole in its init function and then the check introduced in this PR looks at hostedPluginRoles if it has the given key that corresponds to the system role.

teleport/lib/service/service.go

Lines 3325 to 3330 in 0c96c72

if role.IsLocalService() && !process.instanceRoleExpected(role) && !process.hostedPluginRoleExpected(role) {

// if you hit this error, your probably forgot to call SetExpectedInstanceRole inside of

// the registerExpectedServices function, or forgot to call SetExpectedHostedPluginRole during

// the hosted plugin init process.

process.logger.ErrorContext(process.ExitContext(), "Register called for unexpected instance role (this is a bug).", "role", role)

}

What would I need to do if I wanted to run two hosted plugins at the same time (one for Jamf, one for Intune) where both plugins would use the same system role? Is it something just inherently unsupported and instead there should be a single hosted plugin that handles both MDMs?

I talked with Sakshyam and I think I understand it a little bit better now.

startJamfService is called both from the Jamf service and the Jamf hosted plugin. That function then calls process.RegisterWithAuthServer, which triggers the check that's embedded in my previous comment. That's why the plugin needs to call SetExpectedHostedPluginRole.

So if I added a hosted plugin for Intune, it could simply not call SetExpectedHostedPluginRole I guess? But then I also wonder why startJamfService is used for both the service and the hosted plugin, couldn't the plugin avoid calling RegisterWithAuthServer? 🤔

After some further discussion, it might be due to the access checker looking for the MDM role, as suggested by Sakshyam.

fspmarshall force-pushed the fspmarshall/reject-new-partial-tokens branch from 342d525 to 9302729 Compare May 15, 2024 18:04

fspmarshall marked this pull request as ready for review May 15, 2024 18:05

github-actions Bot requested review from GavinFrazar and timothyb89 May 15, 2024 18:05

github-actions Bot added the size/md label May 15, 2024

timothyb89 reviewed May 18, 2024

View reviewed changes

Comment thread lib/service/connect.go

Comment thread lib/service/connect.go

fspmarshall force-pushed the fspmarshall/reject-new-partial-tokens branch from 9302729 to 36792f5 Compare May 20, 2024 16:06

fspmarshall requested a review from timothyb89 May 20, 2024 16:14

GavinFrazar reviewed May 20, 2024

View reviewed changes

fspmarshall force-pushed the fspmarshall/reject-new-partial-tokens branch from 36792f5 to d521ba0 Compare May 22, 2024 23:13

orca-security-us Bot reviewed May 24, 2024

View reviewed changes

fspmarshall force-pushed the fspmarshall/reject-new-partial-tokens branch from f66cc33 to 5af7a4e Compare May 28, 2024 19:42

orca-security-us Bot reviewed May 28, 2024

View reviewed changes

fspmarshall force-pushed the fspmarshall/reject-new-partial-tokens branch from 5af7a4e to 59f4482 Compare May 28, 2024 21:40

orca-security-us Bot reviewed May 28, 2024

View reviewed changes

fspmarshall force-pushed the fspmarshall/reject-new-partial-tokens branch from 59f4482 to 7024909 Compare May 28, 2024 22:08

fspmarshall requested a review from GavinFrazar May 28, 2024 22:16

fspmarshall force-pushed the fspmarshall/reject-new-partial-tokens branch from 7024909 to 563e7c3 Compare May 28, 2024 22:55

fspmarshall added backport/branch/v14 labels May 28, 2024

fspmarshall changed the title ~~add self-repair for malformed instance certs and explicitly disallow future mix-and-match of join tokens~~ add self-repair for malformed instance certs May 28, 2024

fspmarshall force-pushed the fspmarshall/reject-new-partial-tokens branch 2 times, most recently from 71c75bf to afa3d8e Compare May 28, 2024 23:45

timothyb89 approved these changes May 29, 2024

View reviewed changes

rosstimothy added the backport/branch/v16 label May 29, 2024

rosstimothy linked an issue May 29, 2024 that may be closed by this pull request

Upgrading from 14.3.18 to 14.3.20 broke configured agent applications #42040

Closed

fspmarshall force-pushed the fspmarshall/reject-new-partial-tokens branch from 7e0f143 to 4f1fc65 Compare May 29, 2024 18:53

rosstimothy approved these changes May 29, 2024

View reviewed changes

self-repair instance certs

8228c07

fspmarshall force-pushed the fspmarshall/reject-new-partial-tokens branch from 4f1fc65 to 8228c07 Compare May 29, 2024 22:47

fspmarshall added the backport/branch/v13 label May 29, 2024

fspmarshall enabled auto-merge May 29, 2024 22:58

fspmarshall added this pull request to the merge queue May 29, 2024

Merged via the queue into master with commit f198694 May 29, 2024

fspmarshall deleted the fspmarshall/reject-new-partial-tokens branch May 29, 2024 23:24

fspmarshall added a commit that referenced this pull request May 30, 2024

self-repair instance certs (#41467)

da4da27

fspmarshall added a commit that referenced this pull request May 30, 2024

self-repair instance certs (#41467)

928de43

This was referenced May 30, 2024

[v13] self-repair instance certs #42187

Merged

[v14] self-repair instance certs #42188

Merged

[v15] self-repair instance certs #42189

Merged

[v16] self-repair instance certs #42190

Merged

github-merge-queue Bot pushed a commit that referenced this pull request May 31, 2024

self-repair instance certs (#41467) (#42188)

5d2291d

github-merge-queue Bot pushed a commit that referenced this pull request May 31, 2024

self-repair instance certs (#41467) (#42189)

c3842e3

This was referenced May 31, 2024

tctl inventory ls shows incorrect services for host when multiple Teleport instances run #41244

Closed

Discrepancy between Instance system roles and configuration #13788

Closed

fspmarshall added a commit that referenced this pull request May 31, 2024

self-repair instance certs (#41467)

1677f08

github-merge-queue Bot pushed a commit that referenced this pull request May 31, 2024

self-repair instance certs (#41467) (#42187)

4f28cba

fspmarshall mentioned this pull request Aug 30, 2024

fix local re-register #46087

Merged

ravicious reviewed May 20, 2025

View reviewed changes

programmerq mentioned this pull request Sep 2, 2025

Update procedure to reflect self-repair capability when adding a new instance role #58612

Open

milos-teleport mentioned this pull request Sep 11, 2025

Adding the App agent role via a new token requires and extra restart for the agent to serve apps #59013

Open

	if role.IsLocalService() && !process.instanceRoleExpected(role) && !process.hostedPluginRoleExpected(role) {
	// if you hit this error, your probably forgot to call SetExpectedInstanceRole inside of
	// the registerExpectedServices function, or forgot to call SetExpectedHostedPluginRole during
	// the hosted plugin init process.
	process.logger.ErrorContext(process.ExitContext(), "Register called for unexpected instance role (this is a bug).", "role", role)
	}

Conversation

fspmarshall commented May 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

GavinFrazar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

GavinFrazar May 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

orca-security-us Bot left a comment

Choose a reason for hiding this comment

Orca Security Scan Summary

Uh oh!

fspmarshall commented May 24, 2024

Uh oh!

webvictim commented May 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tigrato commented May 28, 2024

Uh oh!

orca-security-us Bot left a comment

Choose a reason for hiding this comment

Orca Security Scan Summary

Uh oh!

orca-security-us Bot left a comment

Choose a reason for hiding this comment

Orca Security Scan Summary

Uh oh!

fspmarshall commented May 28, 2024

Uh oh!

rosstimothy left a comment

Choose a reason for hiding this comment

Uh oh!

public-teleport-github-review-bot Bot commented May 29, 2024

Uh oh!

ravicious May 20, 2025

Choose a reason for hiding this comment

Uh oh!

ravicious May 20, 2025

Choose a reason for hiding this comment

Uh oh!

ravicious May 20, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

fspmarshall commented May 13, 2024 •

edited

Loading

GavinFrazar May 20, 2024 •

edited

Loading

webvictim commented May 28, 2024 •

edited

Loading