RFD 41: Simplified Node Joining for AWS#7292
Conversation
klizhentas
left a comment
There was a problem hiding this comment.
Some early review @russjones @nklaassen
| enabled: yes | ||
| aws_join: | ||
| allow: | ||
| - organization: "o-1111111111" |
There was a problem hiding this comment.
have you consider to use AWS arn format? It allows flexible patterns
There was a problem hiding this comment.
Could do that. We would still need a separate ARN for the organization and the account so I didn't see a major benefit, but it would allow admins to restrict based on the IAM role in the account ARN as well so may be better
| # Example: | ||
| allow: | ||
| - account: "2222222222" # allow any node from this account | ||
| - organization: "o-1111111111" # allow any node in any account this org |
There was a problem hiding this comment.
I'm curious about data types of account and organization. Is it a big of a difference in case of []string? I wonder how the config will look like for multiple accounts/organizations?
There was a problem hiding this comment.
See the latest revision, multiple accounts can be listed in separate rules (and we no longer plan to support an organization rule)
| (`organizations:DescribeAccount` or `organizations:ListAccounts`) can only be | ||
| called from the organization's management account. At a minimum, this would | ||
| still require creating signed requests on the node and sending them to the auth | ||
| server as the current design does. |
There was a problem hiding this comment.
This doesn't simplify the flow. We replace the token with request and instead of attaching the policy to read secrets from secret manager we attach another one.
The main idea was to eliminate the need of attachments on node side with shifting all required checks and verifications to the auth node and replace the token with identity document.
There was a problem hiding this comment.
as discussed in slack I am going to revamp this design to use the Instance Identity Document method over the IAM method
There was a problem hiding this comment.
In the latest revision I describe both the EC2 (Instance Identity Document) method and the IAM method
366a7d1 to
83b6423
Compare
There was a problem hiding this comment.
whatever heartbeat on AWS is used, identifies the node by instance ID, and includes the signature from the IID. To prevent replay attack, the auth server will first verify the IID signature, getNode(instanceID).
There was a problem hiding this comment.
If it's running and it has joined the cluster, the heartbeat should be present, so if the node name == instance id
There was a problem hiding this comment.
For the lookup to be efficient, the node name == instance-id
There was a problem hiding this comment.
Updated so that the node name will be set to <aws_account_id>-<aws_instance_id> (instance IDs are not necessarily unique across accounts) so auth can efficiently check if the node has already joined.
Don't need to store the IID signature as we will block any second join from the same instance.
Extend token type for auth configuration
a96dc31 to
1359bbe
Compare
| 3. Check that the AWS join token matches the AWS account in the IID, and the | ||
| requested Teleport service role. | ||
| 4. Check that this EC2 instance has not already joined the cluster. | ||
| - The node name will be set to `<aws_account_id>-<aws_instance_id>` so that |
There was a problem hiding this comment.
- For detection/incident response purposes, this should raise an alert or at least be logged in detail.
There was a problem hiding this comment.
This would definitely be logged in detail if the same instance attempts to join multiple times, I'm not sure if we currently have any applicable mechanism for raising an alert other than logging @klizhentas ?
| https://github.com/gravitational/teleport/blob/d4247cb150d720be97521347b74bf9c526ae869f/lib/auth/auth.go#L1538-L1563). | ||
|
|
||
| The auth server will then: | ||
| 1. Use the AWS DSA public certificate to check that the PKCS7 signature for the |
There was a problem hiding this comment.
Will Teleport trust the region specified in the IID and use that to pick the specific AWS DSA public certificate? Or will this be region-locked via the teleport config/only support the primary AWS DSA? (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/verify-pkcs7.html)
There was a problem hiding this comment.
It would be possible to add region-locking to the config, but my current plan is just to trust the region in the IID to select the correct cert, is there a security issue with this?
There was a problem hiding this comment.
I'm mainly worried about the potential security risk of having the attacker "pick" the certificate. Such risks would be mainly related to the implementation of the code we'll see later (e.g. path injections/traversal when retrieving the cert from disk or network). On a different note, an operator may need to restrict this (e.g. Govcloud instances, existing IAM policies that deny access to any actions outside the Regions specified for defense in depth), so I would consider the pros/cons of optionally restrict this.
There was a problem hiding this comment.
Good point, I've added a configuration option to restrict the regions from which a node can join for the EC2 method (we won't have this info for the IAM method as it isn't included in the sts:GetCallerIdentity response)
| the AWS SDK for Go. | ||
|
|
||
| Because the signed request can include arbitrary headers, this allows us to | ||
| issue a challenge (a crypto random string of bytes) that the node must include |
There was a problem hiding this comment.
The lifetime of this challenge should be limited.
There was a problem hiding this comment.
The challenge will only be valid within the original gRPC streaming request, and only one attempt will be allowed. We could put a timeout of something like 1 minute on the complete gRPC request.
There was a problem hiding this comment.
add a 1 minute timeout to the doc
|
|
||
| ### IAM Method | ||
|
|
||
| In place of a join token, nodes will present a signed `sts:GetCallerIdentity` |
There was a problem hiding this comment.
Ideally, customers should be advised in the docs that this design will work as long as the sts:AssumeRole action is restricted. This should be ensured by them depending on their AWS setups.
There was a problem hiding this comment.
Good point, this will definitely be noted. Customers could possibly restrict access for an account to a specific set of IAM roles which cannot be assumed from other accounts, but this does impose a greater configuration burden.
knisbet
left a comment
There was a problem hiding this comment.
I was just curious whether this was considered in the context of Teleport Cloud. It looks like based on the examples this should work with multiple AWS accounts, I just wanted to make sure when we host the auth/proxy servers in teleport cloud and the nodes/db/kube connect using reverse tunnel mode that they'll be able to present tokens that can be validated.
Also, talking to other customers about similar stuff, I don't know if it really applies to this RFD or not, but can be of the labels be enforced/restricted based on the token they use? Some customers might want to ensure that proper RBAC carries through based on the node joining that isn't totally up to the node side configuration.
|
@nklaassen ping on updating and merging it, and @pierrebeaucamp please review for compatibility with cloud |
| 1. Check that the `pendingTime` of the IID, which is the time the instance was | ||
| launched, is within a 5 minute TTL. | ||
| - if the node fails to join the cluster during this window, the user can | ||
| stop and restart the EC2 instance to reset the `pendingTime` (and | ||
| effectively create a new IID with a new signature) |
There was a problem hiding this comment.
Not sure why node joining window is narrowed to only 5 min counting from instance creation time.
Does situation when an instance is started and teleport service is deployed to the node later is not possible ?
There was a problem hiding this comment.
That is possible, but not really the intended use-case for this feature where ideally teleport will be installed in the AMI and start up immediately so that the dev can access the instance with Teleport instead of ssh. The 5-minute TTL was introduced to decrease the window where an attacker could steal the Instance Identity Document and join the cluster with a fake/malicious node.
I don't know if this is the right number, but there are 2 pending "RFD 33"s so I'll claim 35 for now.
Issue #7145