RFD 57: Add agentless mode section and AWS tags forwarding section#18676
RFD 57: Add agentless mode section and AWS tags forwarding section#18676
Conversation
caca05c to
517d966
Compare
|
@jakule @capnspacehook Can you take a look as well when you get a chance? This is going to rely on the "OpenSSH inventory management" functionality you're working on. |
517d966 to
9c8f8a1
Compare
| will be created, each secret will have the following contents | ||
|
|
||
| ```json | ||
| { |
There was a problem hiding this comment.
What about CA rotations?
In order to not break access to these agentless nodes, the discovery service would need to watch for CA rotations and push updated certs to the agentless nodes.
Are we including this in scope, or are we okay with the fact that CA rotations will break things for now?
There was a problem hiding this comment.
Yes, @lxea and I talked about it - we should address this in the RFD but we probably won't include this in the scope of the initial implementation. I wonder if we'll be able to reuse some of the automatic upgrade functionality @fspmarshall is working on for this.
There was a problem hiding this comment.
I don't see why automatic upgrades apply. That's about updating the Teleport binary, not getting new certs.
I'm no expert here, but I would think discovery service could set up a watcher for CA rotations just like Machine ID does, and push the new certs to the instances when this happens.
There was a problem hiding this comment.
I didn't mean automatic upgrades specifically but rather machinery that's part of script-based upgrades for example, basically the machinery that will run some script on the nodes based on some conditions (new version availability is just one of the conditions in my mind). But yeah, it could be done in discovery service too.
capnspacehook
left a comment
There was a problem hiding this comment.
LGTM as long as cert rotation is addressed
3b8a21a to
47490e1
Compare
47490e1 to
0f3d07b
Compare
| ### Including AWS Tags as Teleport labels | ||
|
|
||
| The AWS tags on discovered EC2 instances will be included as Teleport labels on the | ||
| discovered Nodes. | ||
|
|
||
| In order to achieve this a helper resource named `DiscoveredServer` will be | ||
| introduced with will store metadata about discovered nodes that was retrieved via the | ||
| AWS API. | ||
|
|
||
| When Teleport is installed and registers the EC2 instance, the Auth server will check | ||
| for a corresponding `DiscoveredServer` resource by matching on `instance-id` and | ||
| `account-id` labels. If there is a matching `DiscoveredServer`, it will create a | ||
| `Server` resource using the metadata from the `DiscoveredServer` and ignore labels | ||
| sent via heartbeat from the node. | ||
|
|
There was a problem hiding this comment.
Sorry for jumping into this late, got here via #21079
I think this needs some more consideration from a scalability perspective.
As currently implemented in #21079 this induces an additional backend read for every node heartbeat (regardless of wether this feature is in use), and has the potential to induce an additional backend write per instance in the worst case. This is very problematic.
For context, I just got out of an S1 induced by a change that increased reads/writes by about 30%. IIUC this solution has the potential to be far more impactful. The 30% change took care to use very aggressive jittering s.t. the added operations were evenly distributed across a ~3 min interval. As currently implemented in #21079, this change performs writes in a tight loop once per minute, so I'd hazard that this change would be more than 3x as impactful.
In general, adding any new resource that has the potential to scale with cluster size is a pretty big deal.
At the very least, I think we need to modify the implementation s.t. we don't incur the additional read on node heartbeat creation if this feature is not in use, so that people can opt into the significantly increased backend load. E.g. if the cert of the node encoded wether or not it was subject to discovered labels, then heartbeat logic could selectively decide when it was appropriate to incur the added load.
Ideally, we'd find a way to implement this without significantly changing the current load characteristics. E.g. if it were acceptable for there to be a short (1-6 min) delay between node joining and full label population, I think we could work this feature into the existing Instance resource model without significantly changing load characteristics (tho it'd be tricky).
There was a problem hiding this comment.
@fspmarshall Yep, we shouldn't be placing this check in the local backend, I agree. We'll move it up the stack so auth server maintains local cache for discovered servers as for other resources and checks that for node heartbeats.
klizhentas
left a comment
There was a problem hiding this comment.
@r0mant lgtm, assuming you will address @fspmarshall's scalability concerns.
0f3d07b to
adbb574
Compare
This adds some sections to the EC2 discovery rfd outlining some more information on how agentless mode will be configured and implemented as well as information on how AWS tags will be set for the teleport node labels.