Add docs for resource-based labels by atburke · Pull Request #34475 · gravitational/teleport

atburke · 2023-11-10T21:18:13Z

This change adds documentation for labels created with the server_info resource.

github-actions · 2023-11-10T21:26:59Z

🤖 Vercel preview here: https://docs-551l07gpx-goteleport.vercel.app/docs/ver/preview

zmb3 · 2023-11-14T13:15:29Z

+   ```
+
+1. Verify that you have added the label by running the following command on your local
+computer. Your Teleport user must be authorized to access the server. Teleport


So anyone who can access a server can update ephemeral labels? Is this concerning?

You only need access to the server to verify that the labels were added.

github-actions · 2023-11-15T01:01:59Z

🤖 Vercel preview here: https://docs-p9prrcm3h-goteleport.vercel.app/docs/ver/preview

atburke · 2023-11-20T20:57:00Z

Friendly ping @zmb3 @r0mant

r0mant · 2023-11-20T21:16:56Z

+Resource-based labels are stored in memory on servers, so if a server restarts it may temporarily lose
+its resource-based labels until the Auth Service applies them again. To update resource-based labels,


Resource-based labels are stored in memory on servers, so if a server restarts it may temporarily lose
its resource-based labels until the Auth Service applies them again.

This is not the docs question but rather implementation detail which I either missed or forgot about but I'm not sure I understand why this is the case. Aren't these labels supposed to be applied on every heartbeat? If there's a period of time where not all labels are present on the server, this means they effectively can't be used in RBAC's "deny" rules?

The auth server has to send the labels to the node over the inventory control stream before the node can report them in future heartbeats. So yes, the shouldn't be used for deny rules.

If auth server has to send them to the node, why can't it merge them with the labels reported by the node at the time it receives the node heartbeat? I understand that in the former case auth server can control how fast it sends them out, but if the ServerInfo resources are cached, applying them to heartbeats wouldn't increase the load, would it?

@fspmarshall Do you recall why it was implemented this way? I'm quite a bit worried that with current implementation we just give users ability to shoot themselves in the foot, now you have to track which labels can or can't be used in "deny" rules. At the very least I think we should not allow connecting to servers that haven't yet received all labels IMO.

cc @rosstimothy

@r0mant This is a pretty tricky feature to track the history of because discussion was spread out all over the place. I think #21079 (comment) is perhaps the most relevant bit, but everything is really spread out.

The tl;dr as I remember it is something like this:

Loading all ServerInfos immediately either on startup (i.e. cache init) or on first heartbeat would be bad since that had the potentially to effectively double the amount of load experience when restarting a teleport cluster (something that is already problematically intensive and one of our biggest scaling challenges).

None of the solutions proposed for having nodes refuse connections until all labels had been discovered seemed practical (see linked discussion).

Its doubtful that there even is a sane way to decide wether or not a node will acquire dynamically applied labels short of requiring that users categorically separate nodes into statically and dynamically labeled groups either via config file options or by using separate join tokens.

I'm quite a bit worried that with current implementation we just give users ability to shoot themselves in the foot, now you have to track which labels can or can't be used in "deny" rules.

Oof, I thought we had addressed it, but I don't see it looking back at the code. I think the fact that we don't have a better story around this is my fault I'm sorry to say. I was pushing for categorically disallowing dynamic label usage in deny rules early on (#21079 (comment), #21079 (comment)), but it looks like that point got lost in the shuffle when that issue was closed. I was too focused on the scalability discussion in the followup RFD and failed to re-raise the security issue. Fixing this retroactively without breaking anyone's stuff might be a little tricky 😬

Update: I've been giving the deny rule issue more thought. Here's where my I'm at currently:

Deprecating dynamic labels in deny rules at this point seems like it would be pretty messy. Its the better fix, but I think that ship has sailed. Unless anyone has a new idea not covered during the various other issues/discussions related to this, I think label discovery is going to stay lazy, and wether or not a given agent might have dynamically applied labels in the future is likely to continue to be undecidable in most cases. Given that, my suggestion would be to do one or more of the following:

Add a warning to role creation/update of the form Label <label-name> appears to be a cloud label. Use of cloud labels in deny rules is not recommended due to their dynamic nature. Cloud labels are discovered asynchronously, and agents remain accessible even if teleport failed to discover corresponding cloud labels.

Add the ability to opt into requiring cloud labels for specific agents statically, either via the agent's own configuration, or via attributes applied to its joining method (e.g. tctl token add --type=node --require-cloud-labels=yes). Basically, have some mechanism of explicitly marking those agents that we know will have cloud labels eventually. That subset of nodes that joined via discovery could then start pre-configured to require cloud labels, without us breaking nodes that might have cloud labels, but were initially added to the cluster via other means (and therefore might not have cloud labels known to the auth server yet).

Cache cloud labels on the agent's disk so that nodes can use "last known cloud labels" when restarting, reducing the total amount of time spent without cloud labels (not a fix, just a mitigation).

@fspmarshall This is not just about cloud labels, as in #34008 we basically added ability for users to set these labels server-side by creating ServerInfo resource.

I don't know if caching these labels on the node helps much tbh because we'll still have this problem when, say, labels are updated while the node is being restarted or down or re-joining with empty data dir.

Instead of (or in addition to) merging the labels during heartbeats, can't we do that at connect time, on-demand? I.e. when connecting to a server, pull its ServerInfo if there's one, and see if there're labels there that aren't on the server? Then it wouldn't put more strain on cache during startup?

Ah, didn't realize user-created dynamic labels had landed (jumped into this comment thread without having full context 😬). That does complicate things. Especially since, at least from a quick glance, it looks like user-created dynamic labels aren't constrained to a namespace the way cloud labels are so we can't tell which labels might or might not be dynamic from looking at roles/nodes. Also, as far as I can tell, there isn't any mechanism in-place for cleanup, so user-created labels apply until the node restarts even if the ServerInfo is deleted, which is its own kind of problem.

Loading server infos at the time when an access-control decision is needed is definitely appealing in terms of mitigating load on restart, though an additional round trip to auth for every access-control decision wouldn't be ideal. We have a lot of users requesting that we do whatever we can to reduce resource access latency. Also, if the point of doing that would be to make using deny rules with dynamic labels sane, then we'd really need to also apply dynamic labels to all heartbeats, and that means the benefits versus traditional caching might be minimal (it might even be worse, since presumably most backends are more performant per-item read when serving range requests versus individual reads, hard to say for sure without testing).

I've been running through various scenarios for the past couple hours, and I don't see any good ways to have our cake and eat it too here. I'll keep mulling it over, but right now I'm thinking that we're going to end up needing to pick a direction and commit to it:

Lazy dynamic labels that shouldn't be used for deny rules (likely requiring some trickery to reliably detect and warn about use of dynamic labels in deny rules now that user-created labels have landed and they are un-namespaced).

Always consistent labels that significantly increase peak load when in general use (with the risk of serious production outages if anyone currently making liberal use of dynamic labels is also pushing their backend to its limit).

I lean toward option 1. Not only for itself, but because I find the precedence set by option 2 scary. There's plenty of room in teleport for more lazy features that can be safely rate-limited without breaking their contracts. There is very little room for sensitive cross-resource synchronization in the "hot path" so to speak.

Alright, we've had some out of band discussions and come up with a gameplan:

We're going to continue to use the existing system of propagating dynamic labels lazily to agents. We will fix the security issue by disallowing use of dynamic labels in deny rules for all new roles that get created. User-created dynamic labels just landed and no one is actually using the feature yet, so we're going to make a breaking change to that feature prior to publishing documentation of its use. We'll add a namespace prefix to all user-created dynamic labels (e.g. dynamic/<label>). This will allow us to statically determine which deny rules are trying to use dynamic labels by simply checking all labels to see if they start with any of the dynamic label prefixes values. Newly created roles doing this will be rejected. Existing roles doing this (hopefully there aren't any, but you never know) will generate a warning, or possibly a cluster alert. This will be fairly easy to check with namespacing since we'll be able to just load all roles and iterate them.

We will reserve the right to start allowing dynamic labels in deny rules in the future (which won't be a breaking change since it will be allowing a previously disallowed configuration), but won't implement it unless we can come up with a plan to do so without breaking existing clusters that both have limited backend/network capacity and a large number of existing dynamic labels.

github-actions · 2023-11-21T00:39:24Z

🤖 Vercel preview here: https://docs-n5pvthnb1-goteleport.vercel.app/docs/ver/preview

atburke · 2023-11-27T19:34:04Z

Friendly ping @r0mant

ptgott · 2024-01-05T15:39:39Z

@atburke Just checking to see if we're still going ahead with this one. Thanks!

atburke · 2024-01-05T17:03:13Z

@ptgott We will once #36219 is merged

atburke · 2024-01-11T01:08:17Z

@ptgott @r0mant Thank you for your patience on this one! I've updated the docs to include the changes from #36219 and would appreciate another review pass.

github-actions · 2024-01-11T01:11:33Z

🤖 Vercel preview here: https://docs-rko9yl6a7-goteleport.vercel.app/docs/ver/preview

r0mant

Couple of comments but after that should be good to go from my perspective.

ptgott · 2024-01-11T22:58:41Z

 (!docs/pages/includes/tctl.mdx!)

-## Step 1/3. Install Teleport
+### Install Teleport


I'm curious as to why we're turning these "Step ..." H2s into H3s of the Prerequisites section. I would keep them as they were and add a Step 4/4. for resource-based labels. Otherwise, the document structure is a little jarring to me.

This was a response to this comment (although making them H3s was just a mistake on my part). I've updated the docs with a hybrid approach that preserves the structure in the prereqs but allows the label sections to be done in any order, let me know what you think.

Perfect, thanks!

github-actions · 2024-01-12T01:01:54Z

🤖 Vercel preview here: https://docs-46rzx12u6-goteleport.vercel.app/docs/ver/preview

r0mant

lgtm but I've a question about "si" prefix

r0mant · 2024-01-12T01:04:07Z

+   # server_info.yaml
+   kind: server_info
+   metadata:
+      name: si-<node-name>


Hmm do we require the si- prefix? What's the reasoning for it? It's a little weird.

It's to differentiate server infos matched by name from server infos matched in other ways, namely aws-<account-id>-<instance-id>.

public-teleport-github-review-bot · 2024-01-12T23:41:47Z

@atburke See the table below for backport results.

Branch	Result
branch/v14	Create PR

atburke added backport/branch/v14 no-changelog Indicates that a PR does not require a changelog entry labels Nov 10, 2023

atburke temporarily deployed to vercel November 10, 2023 21:18 — with GitHub Actions Inactive

github-actions Bot added documentation size/sm labels Nov 10, 2023

github-actions Bot requested review from ptgott, r0mant, xinding33 and zmb3 November 10, 2023 21:18

ptgott approved these changes Nov 13, 2023

View reviewed changes

Comment thread docs/pages/management/admin/labels.mdx Outdated

Comment thread docs/pages/management/admin/labels.mdx Outdated

zmb3 reviewed Nov 14, 2023

View reviewed changes

r0mant reviewed Nov 14, 2023

View reviewed changes

Comment thread docs/pages/management/admin/labels.mdx Outdated

atburke temporarily deployed to vercel November 15, 2023 00:53 — with GitHub Actions Inactive

r0mant reviewed Nov 20, 2023

View reviewed changes

atburke temporarily deployed to vercel November 21, 2023 00:32 — with GitHub Actions Inactive

atburke mentioned this pull request Jan 3, 2024

Forbid server_info-based labels in deny rules #36219

Merged

atburke added 5 commits January 10, 2024 17:00

First draft

4b5eb4f

Add ephemeral labels language

b2d70b0

Address comments

8c132fa

Add deny rule disclaimer

8c0653f

Update docs with deny rules rule

92bf7e0

atburke force-pushed the atburke/server-info-docs branch from 0f676e1 to 92bf7e0 Compare January 11, 2024 01:06

atburke temporarily deployed to vercel January 11, 2024 01:06 — with GitHub Actions Inactive

atburke changed the title ~~Add docs for ephemeral/server_info-based labels~~ Add docs for resource-based labels Jan 11, 2024

atburke requested review from ptgott and r0mant January 11, 2024 01:08

r0mant reviewed Jan 11, 2024

View reviewed changes

Comment thread docs/pages/management/admin/labels.mdx Outdated

Comment thread docs/pages/management/admin/labels.mdx Outdated

ptgott reviewed Jan 11, 2024

View reviewed changes

Address comments

1db8951

atburke temporarily deployed to vercel January 12, 2024 00:56 — with GitHub Actions Inactive

r0mant approved these changes Jan 12, 2024

View reviewed changes

atburke added this pull request to the merge queue Jan 12, 2024

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jan 12, 2024

atburke added this pull request to the merge queue Jan 12, 2024

Merged via the queue into master with commit 832988c Jan 12, 2024

atburke deleted the atburke/server-info-docs branch January 12, 2024 23:40

atburke mentioned this pull request Jan 12, 2024

[v14] Add docs for resource-based labels #36661

Merged

		Resource-based labels are stored in memory on servers, so if a server restarts it may temporarily lose
		its resource-based labels until the Auth Service applies them again. To update resource-based labels,

Conversation

atburke commented Nov 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Nov 10, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Nov 15, 2023

Uh oh!

atburke commented Nov 20, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Nov 21, 2023

Uh oh!

atburke commented Nov 27, 2023

Uh oh!

ptgott commented Jan 5, 2024

Uh oh!

atburke commented Jan 5, 2024

Uh oh!

atburke commented Jan 11, 2024

Uh oh!

github-actions Bot commented Jan 11, 2024

Uh oh!

r0mant left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jan 12, 2024

Uh oh!

r0mant left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

public-teleport-github-review-bot Bot commented Jan 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

atburke commented Nov 10, 2023 •

edited

Loading