Skip to content

Add docs for resource-based labels#34475

Merged
atburke merged 6 commits intomasterfrom
atburke/server-info-docs
Jan 12, 2024
Merged

Add docs for resource-based labels#34475
atburke merged 6 commits intomasterfrom
atburke/server-info-docs

Conversation

@atburke
Copy link
Copy Markdown
Contributor

@atburke atburke commented Nov 10, 2023

This change adds documentation for labels created with the server_info resource.

@atburke atburke added backport/branch/v14 no-changelog Indicates that a PR does not require a changelog entry labels Nov 10, 2023
@github-actions
Copy link
Copy Markdown
Contributor

🤖 Vercel preview here: https://docs-551l07gpx-goteleport.vercel.app/docs/ver/preview

Comment thread docs/pages/management/admin/labels.mdx Outdated
Comment thread docs/pages/management/admin/labels.mdx Outdated
Comment thread docs/pages/management/admin/labels.mdx Outdated
```

1. Verify that you have added the label by running the following command on your local
computer. Your Teleport user must be authorized to access the server. Teleport
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So anyone who can access a server can update ephemeral labels? Is this concerning?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You only need access to the server to verify that the labels were added.

Comment thread docs/pages/management/admin/labels.mdx Outdated
Comment thread docs/pages/management/admin/labels.mdx Outdated
@github-actions
Copy link
Copy Markdown
Contributor

🤖 Vercel preview here: https://docs-p9prrcm3h-goteleport.vercel.app/docs/ver/preview

@atburke
Copy link
Copy Markdown
Contributor Author

atburke commented Nov 20, 2023

Friendly ping @zmb3 @r0mant

Comment thread docs/pages/management/admin/labels.mdx Outdated
Comment on lines +331 to +332
Resource-based labels are stored in memory on servers, so if a server restarts it may temporarily lose
its resource-based labels until the Auth Service applies them again. To update resource-based labels,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resource-based labels are stored in memory on servers, so if a server restarts it may temporarily lose
its resource-based labels until the Auth Service applies them again.

This is not the docs question but rather implementation detail which I either missed or forgot about but I'm not sure I understand why this is the case. Aren't these labels supposed to be applied on every heartbeat? If there's a period of time where not all labels are present on the server, this means they effectively can't be used in RBAC's "deny" rules?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The auth server has to send the labels to the node over the inventory control stream before the node can report them in future heartbeats. So yes, the shouldn't be used for deny rules.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If auth server has to send them to the node, why can't it merge them with the labels reported by the node at the time it receives the node heartbeat? I understand that in the former case auth server can control how fast it sends them out, but if the ServerInfo resources are cached, applying them to heartbeats wouldn't increase the load, would it?

@fspmarshall Do you recall why it was implemented this way? I'm quite a bit worried that with current implementation we just give users ability to shoot themselves in the foot, now you have to track which labels can or can't be used in "deny" rules. At the very least I think we should not allow connecting to servers that haven't yet received all labels IMO.

cc @rosstimothy

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@r0mant This is a pretty tricky feature to track the history of because discussion was spread out all over the place. I think #21079 (comment) is perhaps the most relevant bit, but everything is really spread out.

The tl;dr as I remember it is something like this:

  • Loading all ServerInfos immediately either on startup (i.e. cache init) or on first heartbeat would be bad since that had the potentially to effectively double the amount of load experience when restarting a teleport cluster (something that is already problematically intensive and one of our biggest scaling challenges).
  • None of the solutions proposed for having nodes refuse connections until all labels had been discovered seemed practical (see linked discussion).
  • Its doubtful that there even is a sane way to decide wether or not a node will acquire dynamically applied labels short of requiring that users categorically separate nodes into statically and dynamically labeled groups either via config file options or by using separate join tokens.

I'm quite a bit worried that with current implementation we just give users ability to shoot themselves in the foot, now you have to track which labels can or can't be used in "deny" rules.

Oof, I thought we had addressed it, but I don't see it looking back at the code. I think the fact that we don't have a better story around this is my fault I'm sorry to say. I was pushing for categorically disallowing dynamic label usage in deny rules early on (#21079 (comment), #21079 (comment)), but it looks like that point got lost in the shuffle when that issue was closed. I was too focused on the scalability discussion in the followup RFD and failed to re-raise the security issue. Fixing this retroactively without breaking anyone's stuff might be a little tricky 😬

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: I've been giving the deny rule issue more thought. Here's where my I'm at currently:

Deprecating dynamic labels in deny rules at this point seems like it would be pretty messy. Its the better fix, but I think that ship has sailed. Unless anyone has a new idea not covered during the various other issues/discussions related to this, I think label discovery is going to stay lazy, and wether or not a given agent might have dynamically applied labels in the future is likely to continue to be undecidable in most cases. Given that, my suggestion would be to do one or more of the following:

  • Add a warning to role creation/update of the form Label <label-name> appears to be a cloud label. Use of cloud labels in deny rules is not recommended due to their dynamic nature. Cloud labels are discovered asynchronously, and agents remain accessible even if teleport failed to discover corresponding cloud labels.
  • Add the ability to opt into requiring cloud labels for specific agents statically, either via the agent's own configuration, or via attributes applied to its joining method (e.g. tctl token add --type=node --require-cloud-labels=yes). Basically, have some mechanism of explicitly marking those agents that we know will have cloud labels eventually. That subset of nodes that joined via discovery could then start pre-configured to require cloud labels, without us breaking nodes that might have cloud labels, but were initially added to the cluster via other means (and therefore might not have cloud labels known to the auth server yet).
  • Cache cloud labels on the agent's disk so that nodes can use "last known cloud labels" when restarting, reducing the total amount of time spent without cloud labels (not a fix, just a mitigation).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fspmarshall This is not just about cloud labels, as in #34008 we basically added ability for users to set these labels server-side by creating ServerInfo resource.

I don't know if caching these labels on the node helps much tbh because we'll still have this problem when, say, labels are updated while the node is being restarted or down or re-joining with empty data dir.

Instead of (or in addition to) merging the labels during heartbeats, can't we do that at connect time, on-demand? I.e. when connecting to a server, pull its ServerInfo if there's one, and see if there're labels there that aren't on the server? Then it wouldn't put more strain on cache during startup?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, didn't realize user-created dynamic labels had landed (jumped into this comment thread without having full context 😬). That does complicate things. Especially since, at least from a quick glance, it looks like user-created dynamic labels aren't constrained to a namespace the way cloud labels are so we can't tell which labels might or might not be dynamic from looking at roles/nodes. Also, as far as I can tell, there isn't any mechanism in-place for cleanup, so user-created labels apply until the node restarts even if the ServerInfo is deleted, which is its own kind of problem.

Loading server infos at the time when an access-control decision is needed is definitely appealing in terms of mitigating load on restart, though an additional round trip to auth for every access-control decision wouldn't be ideal. We have a lot of users requesting that we do whatever we can to reduce resource access latency. Also, if the point of doing that would be to make using deny rules with dynamic labels sane, then we'd really need to also apply dynamic labels to all heartbeats, and that means the benefits versus traditional caching might be minimal (it might even be worse, since presumably most backends are more performant per-item read when serving range requests versus individual reads, hard to say for sure without testing).

I've been running through various scenarios for the past couple hours, and I don't see any good ways to have our cake and eat it too here. I'll keep mulling it over, but right now I'm thinking that we're going to end up needing to pick a direction and commit to it:

  1. Lazy dynamic labels that shouldn't be used for deny rules (likely requiring some trickery to reliably detect and warn about use of dynamic labels in deny rules now that user-created labels have landed and they are un-namespaced).

  2. Always consistent labels that significantly increase peak load when in general use (with the risk of serious production outages if anyone currently making liberal use of dynamic labels is also pushing their backend to its limit).

I lean toward option 1. Not only for itself, but because I find the precedence set by option 2 scary. There's plenty of room in teleport for more lazy features that can be safely rate-limited without breaking their contracts. There is very little room for sensitive cross-resource synchronization in the "hot path" so to speak.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, we've had some out of band discussions and come up with a gameplan:

We're going to continue to use the existing system of propagating dynamic labels lazily to agents. We will fix the security issue by disallowing use of dynamic labels in deny rules for all new roles that get created. User-created dynamic labels just landed and no one is actually using the feature yet, so we're going to make a breaking change to that feature prior to publishing documentation of its use. We'll add a namespace prefix to all user-created dynamic labels (e.g. dynamic/<label>). This will allow us to statically determine which deny rules are trying to use dynamic labels by simply checking all labels to see if they start with any of the dynamic label prefixes values. Newly created roles doing this will be rejected. Existing roles doing this (hopefully there aren't any, but you never know) will generate a warning, or possibly a cluster alert. This will be fairly easy to check with namespacing since we'll be able to just load all roles and iterate them.

We will reserve the right to start allowing dynamic labels in deny rules in the future (which won't be a breaking change since it will be allowing a previously disallowed configuration), but won't implement it unless we can come up with a plan to do so without breaking existing clusters that both have limited backend/network capacity and a large number of existing dynamic labels.

@github-actions
Copy link
Copy Markdown
Contributor

🤖 Vercel preview here: https://docs-n5pvthnb1-goteleport.vercel.app/docs/ver/preview

@atburke
Copy link
Copy Markdown
Contributor Author

atburke commented Nov 27, 2023

Friendly ping @r0mant

@ptgott
Copy link
Copy Markdown
Contributor

ptgott commented Jan 5, 2024

@atburke Just checking to see if we're still going ahead with this one. Thanks!

@atburke
Copy link
Copy Markdown
Contributor Author

atburke commented Jan 5, 2024

@ptgott We will once #36219 is merged

@atburke atburke changed the title Add docs for ephemeral/server_info-based labels Add docs for resource-based labels Jan 11, 2024
@atburke
Copy link
Copy Markdown
Contributor Author

atburke commented Jan 11, 2024

@ptgott @r0mant Thank you for your patience on this one! I've updated the docs to include the changes from #36219 and would appreciate another review pass.

@atburke atburke requested review from ptgott and r0mant January 11, 2024 01:08
@github-actions
Copy link
Copy Markdown
Contributor

🤖 Vercel preview here: https://docs-rko9yl6a7-goteleport.vercel.app/docs/ver/preview

Copy link
Copy Markdown
Collaborator

@r0mant r0mant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of comments but after that should be good to go from my perspective.

Comment thread docs/pages/management/admin/labels.mdx Outdated
Comment thread docs/pages/management/admin/labels.mdx Outdated
Comment thread docs/pages/management/admin/labels.mdx Outdated
(!docs/pages/includes/tctl.mdx!)

## Step 1/3. Install Teleport
### Install Teleport
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious as to why we're turning these "Step ..." H2s into H3s of the Prerequisites section. I would keep them as they were and add a Step 4/4. for resource-based labels. Otherwise, the document structure is a little jarring to me.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a response to this comment (although making them H3s was just a mistake on my part). I've updated the docs with a hybrid approach that preserves the structure in the prereqs but allows the label sections to be done in any order, let me know what you think.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, thanks!

@github-actions
Copy link
Copy Markdown
Contributor

🤖 Vercel preview here: https://docs-46rzx12u6-goteleport.vercel.app/docs/ver/preview

Copy link
Copy Markdown
Collaborator

@r0mant r0mant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm but I've a question about "si" prefix

# server_info.yaml
kind: server_info
metadata:
name: si-<node-name>
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm do we require the si- prefix? What's the reasoning for it? It's a little weird.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's to differentiate server infos matched by name from server infos matched in other ways, namely aws-<account-id>-<instance-id>.

@atburke atburke added this pull request to the merge queue Jan 12, 2024
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jan 12, 2024
@atburke atburke added this pull request to the merge queue Jan 12, 2024
Merged via the queue into master with commit 832988c Jan 12, 2024
@atburke atburke deleted the atburke/server-info-docs branch January 12, 2024 23:40
@public-teleport-github-review-bot
Copy link
Copy Markdown

@atburke See the table below for backport results.

Branch Result
branch/v14 Create PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation no-changelog Indicates that a PR does not require a changelog entry size/sm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants