Skip to content

RFD 57: Add agentless mode section and AWS tags forwarding section#18676

Merged
lxea merged 6 commits intomasterfrom
lxea/rfd-update-agentless
Feb 20, 2023
Merged

RFD 57: Add agentless mode section and AWS tags forwarding section#18676
lxea merged 6 commits intomasterfrom
lxea/rfd-update-agentless

Conversation

@lxea
Copy link
Copy Markdown
Contributor

@lxea lxea commented Nov 22, 2022

This adds some sections to the EC2 discovery rfd outlining some more information on how agentless mode will be configured and implemented as well as information on how AWS tags will be set for the teleport node labels.

@lxea lxea force-pushed the lxea/rfd-update-agentless branch 2 times, most recently from caca05c to 517d966 Compare November 23, 2022 16:36
@lxea lxea marked this pull request as ready for review November 23, 2022 16:38
@github-actions github-actions Bot requested review from greedy52 and r0mant November 23, 2022 16:38
@github-actions github-actions Bot added the rfd Request for Discussion label Nov 23, 2022
Comment thread rfd/0057-automatic-aws-server-discovery.md Outdated
Comment thread rfd/0057-automatic-aws-server-discovery.md Outdated
Comment thread rfd/0057-automatic-aws-server-discovery.md
@r0mant r0mant requested review from capnspacehook and jakule and removed request for greedy52 November 30, 2022 23:42
@r0mant
Copy link
Copy Markdown
Collaborator

r0mant commented Nov 30, 2022

@jakule @capnspacehook Can you take a look as well when you get a chance? This is going to rely on the "OpenSSH inventory management" functionality you're working on.

@lxea lxea force-pushed the lxea/rfd-update-agentless branch from 517d966 to 9c8f8a1 Compare December 1, 2022 14:35
will be created, each secret will have the following contents

```json
{
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about CA rotations?

In order to not break access to these agentless nodes, the discovery service would need to watch for CA rotations and push updated certs to the agentless nodes.

Are we including this in scope, or are we okay with the fact that CA rotations will break things for now?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, @lxea and I talked about it - we should address this in the RFD but we probably won't include this in the scope of the initial implementation. I wonder if we'll be able to reuse some of the automatic upgrade functionality @fspmarshall is working on for this.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why automatic upgrades apply. That's about updating the Teleport binary, not getting new certs.

I'm no expert here, but I would think discovery service could set up a watcher for CA rotations just like Machine ID does, and push the new certs to the instances when this happens.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't mean automatic upgrades specifically but rather machinery that's part of script-based upgrades for example, basically the machinery that will run some script on the nodes based on some conditions (new version availability is just one of the conditions in my mind). But yeah, it could be done in discovery service too.

Copy link
Copy Markdown
Contributor

@capnspacehook capnspacehook left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as long as cert rotation is addressed

Comment thread rfd/0057-automatic-aws-server-discovery.md Outdated
Comment thread rfd/0057-automatic-aws-server-discovery.md Outdated
Comment thread rfd/0057-automatic-aws-server-discovery.md Outdated
Comment thread rfd/0057-automatic-aws-server-discovery.md Outdated
Comment thread rfd/0057-automatic-aws-server-discovery.md Outdated
Comment thread rfd/0057-automatic-aws-server-discovery.md Outdated
Comment thread rfd/0057-automatic-aws-server-discovery.md Outdated
Comment thread rfd/0057-automatic-aws-server-discovery.md Outdated
Comment thread rfd/0057-automatic-aws-server-discovery.md Outdated
@lxea lxea force-pushed the lxea/rfd-update-agentless branch 4 times, most recently from 3b8a21a to 47490e1 Compare December 23, 2022 15:08
@lxea lxea force-pushed the lxea/rfd-update-agentless branch from 47490e1 to 0f3d07b Compare January 18, 2023 13:27
Comment on lines +375 to +389
### Including AWS Tags as Teleport labels

The AWS tags on discovered EC2 instances will be included as Teleport labels on the
discovered Nodes.

In order to achieve this a helper resource named `DiscoveredServer` will be
introduced with will store metadata about discovered nodes that was retrieved via the
AWS API.

When Teleport is installed and registers the EC2 instance, the Auth server will check
for a corresponding `DiscoveredServer` resource by matching on `instance-id` and
`account-id` labels. If there is a matching `DiscoveredServer`, it will create a
`Server` resource using the metadata from the `DiscoveredServer` and ignore labels
sent via heartbeat from the node.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for jumping into this late, got here via #21079

I think this needs some more consideration from a scalability perspective.

As currently implemented in #21079 this induces an additional backend read for every node heartbeat (regardless of wether this feature is in use), and has the potential to induce an additional backend write per instance in the worst case. This is very problematic.

For context, I just got out of an S1 induced by a change that increased reads/writes by about 30%. IIUC this solution has the potential to be far more impactful. The 30% change took care to use very aggressive jittering s.t. the added operations were evenly distributed across a ~3 min interval. As currently implemented in #21079, this change performs writes in a tight loop once per minute, so I'd hazard that this change would be more than 3x as impactful.

In general, adding any new resource that has the potential to scale with cluster size is a pretty big deal.


At the very least, I think we need to modify the implementation s.t. we don't incur the additional read on node heartbeat creation if this feature is not in use, so that people can opt into the significantly increased backend load. E.g. if the cert of the node encoded wether or not it was subject to discovered labels, then heartbeat logic could selectively decide when it was appropriate to incur the added load.

Ideally, we'd find a way to implement this without significantly changing the current load characteristics. E.g. if it were acceptable for there to be a short (1-6 min) delay between node joining and full label population, I think we could work this feature into the existing Instance resource model without significantly changing load characteristics (tho it'd be tricky).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fspmarshall Yep, we shouldn't be placing this check in the local backend, I agree. We'll move it up the stack so auth server maintains local cache for discovered servers as for other resources and checks that for node heartbeats.

@r0mant r0mant changed the title Add agentless mode section and AWS tags forwarding section to EC2 discovery RFD RFD 57: Add agentless mode section and AWS tags forwarding section Feb 7, 2023
Copy link
Copy Markdown
Contributor

@klizhentas klizhentas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@r0mant lgtm, assuming you will address @fspmarshall's scalability concerns.

@r0mant
Copy link
Copy Markdown
Collaborator

r0mant commented Feb 17, 2023

@lxea I've moved the part about tags import into a separate RFD: #22033.

Can you remove it from here and let's merge this one.

@lxea lxea force-pushed the lxea/rfd-update-agentless branch from 0f3d07b to adbb574 Compare February 20, 2023 11:01
@lxea lxea enabled auto-merge February 20, 2023 11:01
@lxea lxea added this pull request to the merge queue Feb 20, 2023
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Feb 20, 2023
@lxea lxea added this pull request to the merge queue Feb 20, 2023
Merged via the queue into master with commit a15e987 Feb 20, 2023
@lxea lxea deleted the lxea/rfd-update-agentless branch February 20, 2023 13:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rfd Request for Discussion

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants