Skip to content

RFD 41: Simplified Node Joining for AWS#7292

Merged
nklaassen merged 14 commits into
masterfrom
rfd/0035-aws-node-join
Sep 15, 2021
Merged

RFD 41: Simplified Node Joining for AWS#7292
nklaassen merged 14 commits into
masterfrom
rfd/0035-aws-node-join

Conversation

@nklaassen
Copy link
Copy Markdown
Contributor

@nklaassen nklaassen commented Jun 14, 2021

I don't know if this is the right number, but there are 2 pending "RFD 33"s so I'll claim 35 for now.

Issue #7145

Copy link
Copy Markdown
Contributor

@klizhentas klizhentas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some early review @russjones @nklaassen

Comment thread rfd/0035-aws-node-join.md Outdated
Comment thread rfd/0035-aws-node-join.md Outdated
Comment thread rfd/0035-aws-node-join.md Outdated
enabled: yes
aws_join:
allow:
- organization: "o-1111111111"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you consider to use AWS arn format? It allows flexible patterns

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could do that. We would still need a separate ARN for the organization and the account so I didn't see a major benefit, but it would allow admins to restrict based on the IAM role in the account ARN as well so may be better

Comment thread rfd/0035-aws-node-join.md Outdated
Comment thread rfd/0035-aws-node-join.md Outdated
Comment thread rfd/0035-aws-node-join.md Outdated
Comment thread rfd/0035-aws-node-join.md Outdated
Comment thread rfd/0035-aws-node-join.md Outdated
# Example:
allow:
- account: "2222222222" # allow any node from this account
- organization: "o-1111111111" # allow any node in any account this org
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious about data types of account and organization. Is it a big of a difference in case of []string? I wonder how the config will look like for multiple accounts/organizations?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the latest revision, multiple accounts can be listed in separate rules (and we no longer plan to support an organization rule)

Comment thread rfd/0035-aws-node-join.md Outdated
Comment thread rfd/0035-aws-node-join.md Outdated
Comment thread rfd/0035-aws-node-join.md Outdated
Comment thread rfd/0035-aws-node-join.md Outdated
(`organizations:DescribeAccount` or `organizations:ListAccounts`) can only be
called from the organization's management account. At a minimum, this would
still require creating signed requests on the node and sending them to the auth
server as the current design does.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't simplify the flow. We replace the token with request and instead of attaching the policy to read secrets from secret manager we attach another one.
The main idea was to eliminate the need of attachments on node side with shifting all required checks and verifications to the auth node and replace the token with identity document.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed in slack I am going to revamp this design to use the Instance Identity Document method over the IAM method

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the latest revision I describe both the EC2 (Instance Identity Document) method and the IAM method

@nklaassen nklaassen force-pushed the rfd/0035-aws-node-join branch from 366a7d1 to 83b6423 Compare June 24, 2021 17:38
Comment thread rfd/0035-aws-node-join.md Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whatever heartbeat on AWS is used, identifies the node by instance ID, and includes the signature from the IID. To prevent replay attack, the auth server will first verify the IID signature, getNode(instanceID).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's running and it has joined the cluster, the heartbeat should be present, so if the node name == instance id

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the lookup to be efficient, the node name == instance-id

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated so that the node name will be set to <aws_account_id>-<aws_instance_id> (instance IDs are not necessarily unique across accounts) so auth can efficiently check if the node has already joined.

Don't need to store the IID signature as we will block any second join from the same instance.

@nklaassen nklaassen force-pushed the rfd/0035-aws-node-join branch from a96dc31 to 1359bbe Compare June 24, 2021 19:55
Comment thread rfd/0035-aws-node-join.md
3. Check that the AWS join token matches the AWS account in the IID, and the
requested Teleport service role.
4. Check that this EC2 instance has not already joined the cluster.
- The node name will be set to `<aws_account_id>-<aws_instance_id>` so that
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • For detection/incident response purposes, this should raise an alert or at least be logged in detail.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would definitely be logged in detail if the same instance attempts to join multiple times, I'm not sure if we currently have any applicable mechanism for raising an alert other than logging @klizhentas ?

Comment thread rfd/0035-aws-node-join.md Outdated
https://github.com/gravitational/teleport/blob/d4247cb150d720be97521347b74bf9c526ae869f/lib/auth/auth.go#L1538-L1563).

The auth server will then:
1. Use the AWS DSA public certificate to check that the PKCS7 signature for the
Copy link
Copy Markdown

@phosphore phosphore Jun 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will Teleport trust the region specified in the IID and use that to pick the specific AWS DSA public certificate? Or will this be region-locked via the teleport config/only support the primary AWS DSA? (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/verify-pkcs7.html)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be possible to add region-locking to the config, but my current plan is just to trust the region in the IID to select the correct cert, is there a security issue with this?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm mainly worried about the potential security risk of having the attacker "pick" the certificate. Such risks would be mainly related to the implementation of the code we'll see later (e.g. path injections/traversal when retrieving the cert from disk or network). On a different note, an operator may need to restrict this (e.g. Govcloud instances, existing IAM policies that deny access to any actions outside the Regions specified for defense in depth), so I would consider the pros/cons of optionally restrict this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I've added a configuration option to restrict the regions from which a node can join for the EC2 method (we won't have this info for the IAM method as it isn't included in the sts:GetCallerIdentity response)

Comment thread rfd/0035-aws-node-join.md
the AWS SDK for Go.

Because the signed request can include arbitrary headers, this allows us to
issue a challenge (a crypto random string of bytes) that the node must include
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lifetime of this challenge should be limited.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The challenge will only be valid within the original gRPC streaming request, and only one attempt will be allowed. We could put a timeout of something like 1 minute on the complete gRPC request.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the timeout idea!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a 1 minute timeout to the doc

Comment thread rfd/0035-aws-node-join.md

### IAM Method

In place of a join token, nodes will present a signed `sts:GetCallerIdentity`
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, customers should be advised in the docs that this design will work as long as the sts:AssumeRole action is restricted. This should be ensured by them depending on their AWS setups.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, this will definitely be noted. Customers could possibly restrict access for an account to a specific set of IAM roles which cannot be assumed from other accounts, but this does impose a greater configuration burden.

Comment thread rfd/0035-aws-node-join.md
Comment thread rfd/0035-aws-node-join.md
@russjones russjones added the rfd Request for Discussion label Jul 28, 2021
Copy link
Copy Markdown
Contributor

@knisbet knisbet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just curious whether this was considered in the context of Teleport Cloud. It looks like based on the examples this should work with multiple AWS accounts, I just wanted to make sure when we host the auth/proxy servers in teleport cloud and the nodes/db/kube connect using reverse tunnel mode that they'll be able to present tokens that can be validated.

Also, talking to other customers about similar stuff, I don't know if it really applies to this RFD or not, but can be of the labels be enforced/restricted based on the token they use? Some customers might want to ensure that proper RBAC carries through based on the node joining that isn't totally up to the node side configuration.

@klizhentas
Copy link
Copy Markdown
Contributor

@nklaassen ping on updating and merging it, and @pierrebeaucamp please review for compatibility with cloud

Comment thread rfd/0035-aws-node-join.md
Comment on lines +55 to +59
1. Check that the `pendingTime` of the IID, which is the time the instance was
launched, is within a 5 minute TTL.
- if the node fails to join the cluster during this window, the user can
stop and restart the EC2 instance to reset the `pendingTime` (and
effectively create a new IID with a new signature)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why node joining window is narrowed to only 5 min counting from instance creation time.
Does situation when an instance is started and teleport service is deployed to the node later is not possible ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is possible, but not really the intended use-case for this feature where ideally teleport will be installed in the AMI and start up immediately so that the dev can access the instance with Teleport instead of ssh. The 5-minute TTL was introduced to decrease the window where an attacker could steal the Instance Identity Document and join the cluster with a fake/malicious node.

@nklaassen nklaassen changed the title RFD 35: Simplified Node Joining for AWS RFD 41: Simplified Node Joining for AWS Sep 15, 2021
@nklaassen nklaassen enabled auto-merge (squash) September 15, 2021 22:46
@nklaassen nklaassen merged commit 3e5480d into master Sep 15, 2021
@nklaassen nklaassen deleted the rfd/0035-aws-node-join branch September 15, 2021 22:49
zmb3 pushed a commit that referenced this pull request Sep 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rfd Request for Discussion

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants