Conversation
|
🤖 Vercel preview here: https://docs-551l07gpx-goteleport.vercel.app/docs/ver/preview |
| ``` | ||
|
|
||
| 1. Verify that you have added the label by running the following command on your local | ||
| computer. Your Teleport user must be authorized to access the server. Teleport |
There was a problem hiding this comment.
So anyone who can access a server can update ephemeral labels? Is this concerning?
There was a problem hiding this comment.
You only need access to the server to verify that the labels were added.
|
🤖 Vercel preview here: https://docs-p9prrcm3h-goteleport.vercel.app/docs/ver/preview |
| Resource-based labels are stored in memory on servers, so if a server restarts it may temporarily lose | ||
| its resource-based labels until the Auth Service applies them again. To update resource-based labels, |
There was a problem hiding this comment.
Resource-based labels are stored in memory on servers, so if a server restarts it may temporarily lose
its resource-based labels until the Auth Service applies them again.
This is not the docs question but rather implementation detail which I either missed or forgot about but I'm not sure I understand why this is the case. Aren't these labels supposed to be applied on every heartbeat? If there's a period of time where not all labels are present on the server, this means they effectively can't be used in RBAC's "deny" rules?
There was a problem hiding this comment.
The auth server has to send the labels to the node over the inventory control stream before the node can report them in future heartbeats. So yes, the shouldn't be used for deny rules.
There was a problem hiding this comment.
If auth server has to send them to the node, why can't it merge them with the labels reported by the node at the time it receives the node heartbeat? I understand that in the former case auth server can control how fast it sends them out, but if the ServerInfo resources are cached, applying them to heartbeats wouldn't increase the load, would it?
@fspmarshall Do you recall why it was implemented this way? I'm quite a bit worried that with current implementation we just give users ability to shoot themselves in the foot, now you have to track which labels can or can't be used in "deny" rules. At the very least I think we should not allow connecting to servers that haven't yet received all labels IMO.
cc @rosstimothy
There was a problem hiding this comment.
@r0mant This is a pretty tricky feature to track the history of because discussion was spread out all over the place. I think #21079 (comment) is perhaps the most relevant bit, but everything is really spread out.
The tl;dr as I remember it is something like this:
- Loading all ServerInfos immediately either on startup (i.e. cache init) or on first heartbeat would be bad since that had the potentially to effectively double the amount of load experience when restarting a teleport cluster (something that is already problematically intensive and one of our biggest scaling challenges).
- None of the solutions proposed for having nodes refuse connections until all labels had been discovered seemed practical (see linked discussion).
- Its doubtful that there even is a sane way to decide wether or not a node will acquire dynamically applied labels short of requiring that users categorically separate nodes into statically and dynamically labeled groups either via config file options or by using separate join tokens.
I'm quite a bit worried that with current implementation we just give users ability to shoot themselves in the foot, now you have to track which labels can or can't be used in "deny" rules.
Oof, I thought we had addressed it, but I don't see it looking back at the code. I think the fact that we don't have a better story around this is my fault I'm sorry to say. I was pushing for categorically disallowing dynamic label usage in deny rules early on (#21079 (comment), #21079 (comment)), but it looks like that point got lost in the shuffle when that issue was closed. I was too focused on the scalability discussion in the followup RFD and failed to re-raise the security issue. Fixing this retroactively without breaking anyone's stuff might be a little tricky 😬
There was a problem hiding this comment.
Update: I've been giving the deny rule issue more thought. Here's where my I'm at currently:
Deprecating dynamic labels in deny rules at this point seems like it would be pretty messy. Its the better fix, but I think that ship has sailed. Unless anyone has a new idea not covered during the various other issues/discussions related to this, I think label discovery is going to stay lazy, and wether or not a given agent might have dynamically applied labels in the future is likely to continue to be undecidable in most cases. Given that, my suggestion would be to do one or more of the following:
- Add a warning to role creation/update of the form
Label <label-name> appears to be a cloud label. Use of cloud labels in deny rules is not recommended due to their dynamic nature. Cloud labels are discovered asynchronously, and agents remain accessible even if teleport failed to discover corresponding cloud labels. - Add the ability to opt into requiring cloud labels for specific agents statically, either via the agent's own configuration, or via attributes applied to its joining method (e.g.
tctl token add --type=node --require-cloud-labels=yes). Basically, have some mechanism of explicitly marking those agents that we know will have cloud labels eventually. That subset of nodes that joined via discovery could then start pre-configured to require cloud labels, without us breaking nodes that might have cloud labels, but were initially added to the cluster via other means (and therefore might not have cloud labels known to the auth server yet). - Cache cloud labels on the agent's disk so that nodes can use "last known cloud labels" when restarting, reducing the total amount of time spent without cloud labels (not a fix, just a mitigation).
There was a problem hiding this comment.
@fspmarshall This is not just about cloud labels, as in #34008 we basically added ability for users to set these labels server-side by creating ServerInfo resource.
I don't know if caching these labels on the node helps much tbh because we'll still have this problem when, say, labels are updated while the node is being restarted or down or re-joining with empty data dir.
Instead of (or in addition to) merging the labels during heartbeats, can't we do that at connect time, on-demand? I.e. when connecting to a server, pull its ServerInfo if there's one, and see if there're labels there that aren't on the server? Then it wouldn't put more strain on cache during startup?
There was a problem hiding this comment.
Ah, didn't realize user-created dynamic labels had landed (jumped into this comment thread without having full context 😬). That does complicate things. Especially since, at least from a quick glance, it looks like user-created dynamic labels aren't constrained to a namespace the way cloud labels are so we can't tell which labels might or might not be dynamic from looking at roles/nodes. Also, as far as I can tell, there isn't any mechanism in-place for cleanup, so user-created labels apply until the node restarts even if the ServerInfo is deleted, which is its own kind of problem.
Loading server infos at the time when an access-control decision is needed is definitely appealing in terms of mitigating load on restart, though an additional round trip to auth for every access-control decision wouldn't be ideal. We have a lot of users requesting that we do whatever we can to reduce resource access latency. Also, if the point of doing that would be to make using deny rules with dynamic labels sane, then we'd really need to also apply dynamic labels to all heartbeats, and that means the benefits versus traditional caching might be minimal (it might even be worse, since presumably most backends are more performant per-item read when serving range requests versus individual reads, hard to say for sure without testing).
I've been running through various scenarios for the past couple hours, and I don't see any good ways to have our cake and eat it too here. I'll keep mulling it over, but right now I'm thinking that we're going to end up needing to pick a direction and commit to it:
-
Lazy dynamic labels that shouldn't be used for deny rules (likely requiring some trickery to reliably detect and warn about use of dynamic labels in deny rules now that user-created labels have landed and they are un-namespaced).
-
Always consistent labels that significantly increase peak load when in general use (with the risk of serious production outages if anyone currently making liberal use of dynamic labels is also pushing their backend to its limit).
I lean toward option 1. Not only for itself, but because I find the precedence set by option 2 scary. There's plenty of room in teleport for more lazy features that can be safely rate-limited without breaking their contracts. There is very little room for sensitive cross-resource synchronization in the "hot path" so to speak.
There was a problem hiding this comment.
Alright, we've had some out of band discussions and come up with a gameplan:
We're going to continue to use the existing system of propagating dynamic labels lazily to agents. We will fix the security issue by disallowing use of dynamic labels in deny rules for all new roles that get created. User-created dynamic labels just landed and no one is actually using the feature yet, so we're going to make a breaking change to that feature prior to publishing documentation of its use. We'll add a namespace prefix to all user-created dynamic labels (e.g. dynamic/<label>). This will allow us to statically determine which deny rules are trying to use dynamic labels by simply checking all labels to see if they start with any of the dynamic label prefixes values. Newly created roles doing this will be rejected. Existing roles doing this (hopefully there aren't any, but you never know) will generate a warning, or possibly a cluster alert. This will be fairly easy to check with namespacing since we'll be able to just load all roles and iterate them.
We will reserve the right to start allowing dynamic labels in deny rules in the future (which won't be a breaking change since it will be allowing a previously disallowed configuration), but won't implement it unless we can come up with a plan to do so without breaking existing clusters that both have limited backend/network capacity and a large number of existing dynamic labels.
|
🤖 Vercel preview here: https://docs-n5pvthnb1-goteleport.vercel.app/docs/ver/preview |
|
Friendly ping @r0mant |
|
@atburke Just checking to see if we're still going ahead with this one. Thanks! |
0f676e1 to
92bf7e0
Compare
server_info-based labels|
🤖 Vercel preview here: https://docs-rko9yl6a7-goteleport.vercel.app/docs/ver/preview |
r0mant
left a comment
There was a problem hiding this comment.
Couple of comments but after that should be good to go from my perspective.
| (!docs/pages/includes/tctl.mdx!) | ||
|
|
||
| ## Step 1/3. Install Teleport | ||
| ### Install Teleport |
There was a problem hiding this comment.
I'm curious as to why we're turning these "Step ..." H2s into H3s of the Prerequisites section. I would keep them as they were and add a Step 4/4. for resource-based labels. Otherwise, the document structure is a little jarring to me.
There was a problem hiding this comment.
This was a response to this comment (although making them H3s was just a mistake on my part). I've updated the docs with a hybrid approach that preserves the structure in the prereqs but allows the label sections to be done in any order, let me know what you think.
|
🤖 Vercel preview here: https://docs-46rzx12u6-goteleport.vercel.app/docs/ver/preview |
r0mant
left a comment
There was a problem hiding this comment.
lgtm but I've a question about "si" prefix
| # server_info.yaml | ||
| kind: server_info | ||
| metadata: | ||
| name: si-<node-name> |
There was a problem hiding this comment.
Hmm do we require the si- prefix? What's the reasoning for it? It's a little weird.
There was a problem hiding this comment.
It's to differentiate server infos matched by name from server infos matched in other ways, namely aws-<account-id>-<instance-id>.
This change adds documentation for labels created with the
server_inforesource.