add self-repair for malformed instance certs#41467
Conversation
342d525 to
9302729
Compare
9302729 to
36792f5
Compare
GavinFrazar
left a comment
There was a problem hiding this comment.
can we add a test that simulates an older agent starting up with/without new roles as well, i.e. where the initial local version state is not set?
There was a problem hiding this comment.
I just want to make sure I understand this right:
- the first check in this func is for non "instance" roles, i.e.
RoleDiscovery,RoleDatabaseetc.). It needs to wait for the "instance" connector to be broadcast. - this check is for the "instance" role registration itself, so we look to see if we already have an instance identity stored
- if we do have an instance identity already, assert that we either have all the instance roles in that identity OR that the identity is older than v16
- if we dont have an identity already, then this is first time connect and we will write the teleport version in "InitialLocalVersion"
- finally, the instance connector is made available to 1. and we assert again that the role is present in the instance identity
36792f5 to
d521ba0
Compare
There was a problem hiding this comment.
Orca Security Scan Summary
| Status | Check | Issues by priority | |
|---|---|---|---|
| Infrastructure as Code | View in Orca | ||
| Secrets | View in Orca | ||
| Vulnerabilities | View in Orca |
🔑 The following Secrets have been detected in your pull request across all commits
| NAME | FILE | LINE NUM | COMMIT | ||
|---|---|---|---|---|---|
| PEM File With Private Key | ...uth/webproxy_key.pem | 1 | f66cc33 | View in code |
|
"secrets" in the above scan result are self-signed certs used by a test. |
|
As someone who's been on the rough end of user experience with Teleport's join tokens for a while (and is particularly aware of the customer frustration/confusion which can exist around them), I wanted to suggest that it would be nicer if we allow the mix-and-match behaviour. To be clear; I always tell people that they should use a join token which is valid for the full set of services that an agent provides. However, people don't always take advice, and the fact that things have just worked in this circumstance historically means that users have pre-set expectations. They don't understand the difference between an App cert and an Instance cert, and nor should they. It sounds like there is an explicit requirement here for the agent's storage to be cleared so a new instance certificate can be issued? This will cause confusion and frustration for anyone who has mixed and matched tokens in the past and expects things to work. IMO, we should definitely make agents just "do the right thing" and automatically handle Instance cert regeneration/reissue as long as the agent has separate certs for each service it provides. Also - I don't know whether we still advertise Related (for a different, but no less frustrating reason): #2838 |
|
I have concerns about the current form of this PR and its impact on the product. First, agents installed via the Second, adding or removing services from an agent configuration is a common practice. Users often begin with limited knowledge of Teleport, implementing and adopting it incrementally. They might start with Kubernetes and later expand to application or database services to expose assets running in Kubernetes. This incremental adoption is standard and should not be hindered. The effort required to expand a service, especially when customers purchase new licenses for additional protocols or request trials, would be so great that it would discourage implementation. This could be a significant drawback for larger customers who expand their usage regularly by purchasing new protocols or features after a successful rollout. Instead, we should allow agents to exchange all their secrets for a new instance certificate valid for the appropriate services. By exchanging all secrets, I mean the agents should call an endpoint where, through a cryptographic challenge, they prove possession of the cert-key pair for a given role and a token for the new role they wish to adopt. After Auth validates the possession of the certificates for the current enabled services and the new token, it will issue a new instance certificate valid for all requested roles. This approach is more sustainable in the long term because:
This strategy provides a better framework for scalability and flexibility, accommodating the evolving needs of our users without significant overhead or disruption. When a service is disabled and moved to another service, Teleport should replace the stored cert-key pairs with a new instance certificate that has the necessary permissions for the remaining services. This ensures that only the required credentials are stored, enhancing security. Instead of retaining the cert-key pairs of all previously enabled services, we should transition to a more secure model where a new, less privileged instance certificate is issued. This certificate should be sufficient to meet the needs of the remaining active services without retaining unnecessary credentials. |
f66cc33 to
5af7a4e
Compare
There was a problem hiding this comment.
Orca Security Scan Summary
| Status | Check | Issues by priority | |
|---|---|---|---|
| Infrastructure as Code | View in Orca | ||
| Secrets | View in Orca | ||
| Vulnerabilities | View in Orca |
🔑 The following Secrets have been detected in your pull request across all commits
| NAME | FILE | LINE NUM | COMMIT | ||
|---|---|---|---|---|---|
| PEM File With Private Key | ...uth/webproxy_key.pem | 1 | 5af7a4ea0 | View in code |
5af7a4e to
59f4482
Compare
There was a problem hiding this comment.
Orca Security Scan Summary
| Status | Check | Issues by priority | |
|---|---|---|---|
| Infrastructure as Code | View in Orca | ||
| Secrets | View in Orca | ||
| Vulnerabilities | View in Orca |
🔑 The following Secrets have been detected in your pull request across all commits
| NAME | FILE | LINE NUM | COMMIT | ||
|---|---|---|---|---|---|
| PEM File With Private Key | ...uth/webproxy_key.pem | 1 | 59f4482a3 | View in code |
59f4482 to
7024909
Compare
|
Based on feedback from @tigrato and @webvictim, and some supplementary discussions elsewhere, this PR has been modified to no longer reject new mix-and-match attempts. The self-repair logic should fix the existing issues caused by mix-and-match. Future features may still require that we formalize a more specific model of how mix-and-match should work (and possibly limit the cases where it is permitted), but for now we are going to preserve mix-and-match ability. |
7024909 to
563e7c3
Compare
71c75bf to
afa3d8e
Compare
7e0f143 to
4f1fc65
Compare
rosstimothy
left a comment
There was a problem hiding this comment.
Approving to not block merging with the assumption that s/instance-assets/testdata happens prior to merging
4f1fc65 to
8228c07
Compare
|
@fspmarshall See the table below for backport results.
|
| // behaves equivalent to instanceRoles except that while instance roles are static assignments | ||
| // set up when the teleport process starts, hosted plugin roles are dynamically assigned by | ||
| // runtime configuration, and may not necessarily be present on the instance cert. | ||
| hostedPluginRoles map[types.SystemRole]string |
There was a problem hiding this comment.
I'm working on Intune support in addition to Jamf. The Jamf integration can be currently enabled as either a standalone service or a hosted plugin.
In the context of this, I have trouble understanding the purpose of hostedPluginRoles. Unlike instanceRoles, its values are never read. It's usage seems to boil down to a hosted plugin calling service.TeleportProcess.SetExpectedHostedPluginRole in its init function and then the check introduced in this PR looks at hostedPluginRoles if it has the given key that corresponds to the system role.
teleport/lib/service/service.go
Lines 3325 to 3330 in 0c96c72
What would I need to do if I wanted to run two hosted plugins at the same time (one for Jamf, one for Intune) where both plugins would use the same system role? Is it something just inherently unsupported and instead there should be a single hosted plugin that handles both MDMs?
There was a problem hiding this comment.
I talked with Sakshyam and I think I understand it a little bit better now.
startJamfService is called both from the Jamf service and the Jamf hosted plugin. That function then calls process.RegisterWithAuthServer, which triggers the check that's embedded in my previous comment. That's why the plugin needs to call SetExpectedHostedPluginRole.
So if I added a hosted plugin for Intune, it could simply not call SetExpectedHostedPluginRole I guess? But then I also wonder why startJamfService is used for both the service and the hosted plugin, couldn't the plugin avoid calling RegisterWithAuthServer? 🤔
There was a problem hiding this comment.
After some further discussion, it might be due to the access checker looking for the MDM role, as suggested by Sakshyam.
Fixes an issue where mix-and-match of join tokens with different system role permissions over time would cause the instance certificate to be malformed. This could lead to various issues, including services not showing up in instance heartbeats, or service-level heartbeats failing to be emitted. The current fix works by having agents prove which system roles they hold via assertions in order to get their primary instance cert reissued. This is the same mechanism by which the instance certs were originally generated back in v10, and has been ported forward and modified to work within our existing cert reissue framework.
In the long run, a more well-defined and structured model for token mix-and-match will likely be needed in order to properly support future features like scoped RBAC and static label assignments. This PR initially sidestepped the issue by disallowing mix-and-match in v16 onwards, but after some discussion that has been deemed too drastic of a change. The most realistic alternative I know of at the time of writing would be to start always requiring that that new tokens also grant all old roles, and fully regenerating the agent's identity based on the new token. This has the upside of mostly preserving existing behavior and not requiring users to reset their data directories, but also presents some confusing edge-cases, such as what to do about identities related to services that aren't currently active on the agent, but were in the past.
Related: https://github.com/gravitational/teleport.e/pull/4240
Fixes: #38977
changelog: fixed an issue where mix-and-match of join tokens could interfere with some services appearing correctly in heartbeats.