RFD 0119 - Integrate with AWS using OIDC#22556
Conversation
b2bcaa9 to
f94fc24
Compare
| ## What | ||
|
|
||
| Discover AWS RDS Databases without manual configuration. | ||
|
|
||
| #### Goals | ||
|
|
||
| Easily discover AWS RDS Databases of an account. |
There was a problem hiding this comment.
Isn't the goal to just have a generic way to create an AWS integration in Teleport? RDS discovery/configuration would be just one of potential users of this integration.
There was a problem hiding this comment.
I was targeting something more specific to avoid the "where are we going to use this"
Let me rephrase the goal and RFD title to "Integrate with AWS using OIDC"
| Our current discovery process for AWS resources requires multiple steps before the user gets to the "aha moment". | ||
| Some of those steps are a little complicated and require the user to know some Teleport specific internals. | ||
|
|
||
| We can reduce the number of steps and remove most of the Teleport specific configuration by creating an OIDC IdP in Teleport. |
There was a problem hiding this comment.
Be specific: instead of saying "some steps", "a little complicated" and "some specific internals", say that IAM setup/configuration is the challenge and that using AWS OIDC integration will allow Teleport proxy/auth services to gain some limited privileges in users' AWS accounts so simplify these steps. And that it will work with Cloud.
There was a problem hiding this comment.
👍
Mentioning cloud here seems out of place 🤔
I'll add it to the goals instead, if that's ok
|
|
||
| ```yaml | ||
| kind: integration | ||
| subkind: aws-oidc |
There was a problem hiding this comment.
Nit: Do we need -oidc suffix?
There was a problem hiding this comment.
I'm trying to leave space/pattern to have other types of AWS integrations (using secrets or using an ec2 instance).
Should I remove the -oidc anyway?
| #### AWS Calls from unauthorized Teleport Users | ||
| We'll be able to call `DescribeDBInstances` AWS endpoint, however we must ensure that this is protected by RBAC. | ||
|
|
||
| In order to do so, we'll re-use the `db` resource with the `list` verb to allow listing RDS Databases. |
There was a problem hiding this comment.
This part is not very clear tbh. Can you explain what we're trying to protect against here? How can unauthorized Teleport users can leverage AWS integration? Reusing existing database RBAC rules for AWS integration seems strange tbh.
There was a problem hiding this comment.
Picture the state where we have an Integration with AWS.
We'll have an HTTP endpoint in Proxy that calls the AWS API Describe DB Instance.
We'll have a new resource Integration, and subsequent rules (create, delete, list, update), but that only works for CRUD operations over the resource itself.
Should any user be able to list the AWS RDS DBs? Or do we want some kind of check before calling the AWS Endpoint?
I'm overloading the db.list rule with something else not entirely relevant.
However, I think we should add some check before calling the AWS Endpoint.
There was a problem hiding this comment.
We'll have a new verb: use
If the user has integration.use then they can issue Integration API Calls
There was a problem hiding this comment.
We'll have a new verb:
useIf the user hasintegration.usethen they can issue Integration API Calls
Apologies for a naive, after-the-merge question, but what is stopping us from using one of the existing verbs like "read" for this? Does it have to do with the part where you mention rules working only for CRUD operations over the resource itself?
We have managed to avoid adding a verb like "use" for resources where one could theoretically make use of it as well ("using" a db server or a node) so I'm wondering how this situation is different.
If it's related to the POST .../integration/:id/action/aws-oidc-list-databases endpoint mentioned in the RFD, then we could even argue that describing access to call an integration action could be made through checking the "create" verb for a new resource called "integration action". OTOH I've seen someone mention that it's hard to implement those checks for "virtual" resources that don't actually exist on the backend.
There was a problem hiding this comment.
There's some discussion here, which provide some context
#24256 (comment)
We have managed to avoid adding a verb like "use" for resources where one could theoretically make use of it as well ("using" a db server or a node) so I'm wondering how this situation is different.
They are protected by <resource>_labels, so we could also use that here.
We discussed that approach but ended deciding on using the VerbUse
As for creating a virtual resource integration action, I don't think it was considered.
I'm not aware of any issues of implementing resources without a backend. But I didn't try it.
I think it could be an alternative, but don't actually see a big disadvantage in using the VerbUse
There was a problem hiding this comment.
I'm really excited for an OIDC IdP starting to be added to Teleport. This will eventually unlock the ability for Teleport to issue ID tokens to processes running on hosts with Machine ID installed in order for them to be able to access cloud APIs.
One thing I would consider would be actually issuing the token in the name of the Teleport proxy rather than the user for the purposes of AWS RDS DB discovery - especially if the Proxy itself is going to actually be making the call. This accounts for a scenario in future where Teleport is also issuing tokens directly to users, but they do not want to grant the ability to all their users the ability to list RDS DBs. For example:
sub: system:proxy:discover
This allows the configured trusted relationship to be explicitly for the discover flow.
| - JWKS URI: the endpoint where the provider returns the public keys. | ||
| - Claims Supported: a list of supported claims that the OpenID Provider (ie, Teleport) MAY be able to supply values for when issuing a token: | ||
| - `iss`: the issuer identity, on our case it will be the same as the `issuer` key. | ||
| - `sub`: identifies the subject of the token, on our case it will contain the user's name. |
There was a problem hiding this comment.
Do we want to prefix with some string to identify that the token has been issued to a user ? I can foresee use-cases in future where we may want the teleport system itself to be capable of talking to an API using OIDC for authentication without a user action triggering it.
Perhaps, for a user: user:noah.stride@goteleport.com and for a system service something like system:proxy ?
There was a problem hiding this comment.
Sure, seems fair 👍
f94fc24 to
bb1e569
Compare
Proxy is making the call but the user was the one who triggered it 🤔 |
Perhaps a "on-behalf-of" claim or similar ? I'm unsure, but I do think that identifying the proxy discover subsystem lets us craft a much more specific trust rule in AWS. |
|
|
||
| After this step, when the user tries to list RDS DBs, Teleport generates a token and issues the API call. | ||
|
|
||
| To generate a token, Teleport creates a JWT with the claims described above and the configured role, and signs it with the private key (the public is provided in the public JWKS endpoint). |
There was a problem hiding this comment.
JWT with the claims described above
What are these claims? I don't see any claims being described above.
There was a problem hiding this comment.
They are below and not above, sorry.
Replaced with below with a markdown link to it.
| The first interaction between AWS and Teleport happens when the user clicks on `Get Thumbprint` (step 4 of the User's Point of View flow). | ||
| This will trigger a request started by AWS to Teleport hitting the OpenID Configuration endpoint and subsequently the JSON Web Key Set endpoint. | ||
|
|
||
| After this step, when the user tries to list RDS DBs, Teleport generates a token and issues the API call. |
There was a problem hiding this comment.
Can you have multiple AWS integrations e.g. for different accounts? How do we pick which one to use?
There was a problem hiding this comment.
Yes, multiple AWS Integrations must be supported.
I'll add that.
I'll also add that the user must be able to select which AWS Integration to use.
| $ tctl create aws-integration.yaml | ||
| ``` | ||
|
|
||
| Resource representation as `yaml`: |
There was a problem hiding this comment.
Can you add a section covering IaC flows to the RFD as well? Are there any changes needed to Terraform provider, Kube operator and Helm charts?
| We'll be able to call `DescribeDBInstances` AWS endpoint, however we must ensure that this is protected by RBAC. | ||
|
|
||
| In order to do so, we'll re-use the `db` resource with the `list` verb to allow listing RDS Databases. | ||
|
|
||
| ```yaml | ||
| kind: role | ||
| metadata: | ||
| name: editor | ||
| spec: | ||
| allow: | ||
| rules: | ||
| - resources: | ||
| - db | ||
| verbs: | ||
| - list | ||
| ``` |
There was a problem hiding this comment.
I still don't get this tbh. In your comment you said:
We'll have an HTTP endpoint in Proxy that calls the AWS API Describe DB Instance.
What endpoint is this and why would it be unauthenticated?
There was a problem hiding this comment.
There's still authentication, however my concern is authorization.
Let's say we have a UserA that has editor role and is able to List AWS RDS DBs.
However, there's also UserB that has access role.
Should UserB be allowed to call the List AWS RDS DBs?
There was a problem hiding this comment.
I added some more details on Listing AWS RDS DBs
83c735b
Let me know if that helps in understanding my concern
| "Action": "sts:AssumeRoleWithWebIdentity", | ||
| "Condition": { | ||
| "StringEquals": { | ||
| "proxy.example.com:aud": "sts.amazonaws.com" |
There was a problem hiding this comment.
If we used an ID token subject that was specific to the Teleport Discovery process then we could specify this here to ensure that this role can only be used by that process and not by other entities which have been issued tokens.
There was a problem hiding this comment.
Something like this?
"Condition": {
"StringEquals": {
"proxy.example.com:aud": "sts.amazonaws.com",
"proxy.example.com:sub": "system:proxy:discover",Assuming we issue tokens whose sub is system:proxy:discover.
What would be the pros of this approach?
Allow for multiple roles to be associated with the same provider but having different policies.
If we are strict here, then we would need users to create extra roles in order to have multiple Teleport "systems" interacting with it.
Honestly, I think we should not add that constraint to not increase scope.
We can always add that condition later on if needed.
There was a problem hiding this comment.
Yeah, something like that.
My main "pro" for it is that when we start issuing tokens from the Teleport OIDC provider to users or machines, it prevents them from assuming this role. I know we can change this recommendation in future, but it'll be up to operators to actually go back and update this IAM role configuration.
This isn't a hard requirement from me, I'm happy to proceed as is but I just think we need to be aware that we'll be issuing these ID tokens to things other than the Teleport system in the near future.
There was a problem hiding this comment.
I'd rather avoid this as it requires further setup to the trust relationship (which is created automatically in the IdP setup flow) with not much added benefit
There was a problem hiding this comment.
Mhm, I can definitely see the concern from the UX point of view.
But, I really think as part of the RFD we need to acknowledge that we are opening up access to list all RDS DBs to any machine or user in future if an operator misses an update advisory and does not modify their existing trust relationships created before we issued the change in advice. Whilst this is certainly a low privilege action, one concern I have is that the design from this RfD is used with another feature that requires more privileged permissions and because this wasn't openly acknowledged in this RfD, that this will be missed.
There was a problem hiding this comment.
We can add
"StringEquals": {
"proxy.example.com:aud": "sts.amazonaws.com",
"proxy.example.com:sub": "system:proxy",
And if we ever need to evolve that, we can just add another suffix to the sub.
What do you think?
There was a problem hiding this comment.
@strideynet I might be misinterpreting your concern, let's chat on Slack if I'm off -
But, I really think as part of the RFD we need to acknowledge that we are opening up access to list all RDS DBs to any machine or user in future if an operator misses an update advisory and does not modify their existing trust relationships created before we issued the change in advice
Hmm, I don't see this happening - the AWS provider is a one-time setup with an IAM policy, that grants the Teleport auth service access to AWS, which can only be accessed through the AWS integration. This mechanism would be the only thing that sets the sub when signing the OIDC JWT, so I don't see the advantage in adding a condition that's always going to evaluate to true
Whilst this is certainly a low privilege action, one concern I have is that the design from this RfD is used with another feature that requires more privileged permissions and because this wasn't openly acknowledged in this RfD, that this will be missed.
Like above, there's only one feature that's going to use this
There was a problem hiding this comment.
@marcoandredinis - @strideynet and I had a chat last night about this section and came to the compromise of setting the audience to something other than sts.amazonaws.com - maybe discover.teleport or discover.proxy-url
As per the AWS docs the audience is meant to be the client ID of the application, so setting it to something Discover related is more suitable anyway.
This means we can avoid setting a subject as we're not going to conflict with any future developments of the IdP provider (they'll be requesting a different audience).
mdwn
left a comment
There was a problem hiding this comment.
We should specifically call out which auth flows we support for this OIDC. One thing I ran into while experimenting with this is back channel vs. front channel auth. Back channel meaning that there has to be direct communication between the IdP and the receiving service, which may or may not be the case depending on a user setup.
I mention this because Okta in particular is opinionated about the use of the id_token flow, which could be a problem when generalizing this.
Handy reference; https://www.scottbrady91.com/openid-connect/openid-connect-flows
| 3. Configures the Teleport instance as an OIDC IdP: | ||
| 1. Clicks `Add provider` of type `OpenID Connect`. | ||
| 2. Fills in the Provider URL as provided by the Discover Wizard. | ||
| 3. Fills in the Audience as provided by the Discover Wizard `sts.amazonaws.com`. |
There was a problem hiding this comment.
Does the audience have to be sts.amazonaws.com? We could skip this step if we had a fixed audience as part of the OIDC IdP in Teleport.
There was a problem hiding this comment.
Whilst AWS doesn't enforce it, and I'd argue they should, sts.amazonaws.com definitely makes the most semantic sense for the value here as that's the host of the AWS service that allows the OIDC ID token to be swapped for AWS credentials. I imagine they don't enforce it universally as not too long ago some CI/CD providers did not allow you to customise the audience of the ID tokens issued to workloads.
There was a problem hiding this comment.
I've changed the audience to reflect the client
Per aws docs
The client ID (also known as audience) is a unique identifier for your app that is issued to you when you register your app with the IdP
So, the audience is going to be discover.teleport
bb1e569 to
07b1cd3
Compare
323adba to
79bf009
Compare
There was a problem hiding this comment.
Got this question after reviewing #23989. What's the use-case for having different statuses? I.e. why not just create the integration in the "running" state? And what's the scenario in which it will be placed in "error" state and how to recover from it?
Just wondering if it's really needed.
There was a problem hiding this comment.
The integration may become unhealthy (OIDC IdP was removed from AWS, Role doesn't allow for required api calls, thumbprint changed).
In that case, there will be a Cluster Alert and the Integration will switch to Error Status.
The Paused status is to allow the user to disable the Integration but not removing it.
Maybe that's something we can work later on, and is not really needed for this first iteration.
There was a problem hiding this comment.
If the proxy rotates its SSL certificate the integration will also become unhealthy as AWS saves and compares the fingerprint of the SSL certificate on setup
There was a problem hiding this comment.
It's not quite that drastic, it actually uses the fingerprint of the TLS certificate's CA. So you can rotate certificates as long as it's issued by the same CA.
There was a problem hiding this comment.
@marcoandredinis I understand the general intent but tbh it just raises more questions: how would Teleport detect that it's unhealthy and why, and whether it's a permanent or temporary unhealthiness, and would we detect it's healthy after being unhealthy, and what does the user need to do in that case? And so on.
We did have some ideas (or rather, "aspirations") for upstream resource health checks (e.g. #20544) but it feels like it's a large enough problem that would require a separate RFD and consistent approach we can use for all resources.
I would cut this from the scope of this RFD and implementation for now.
There was a problem hiding this comment.
Sounds fair 👍
I'll remove the status from the RFD and implementation (backport to V12 is not merged yet, so it will wait to include this removal)
|
@r0mant I'd like to emphasize that I think this flow will work great for smaller teams that just start with their AWS account. We must have a parallel IAC terraform flow for more mature teams that would like to configure everything without magic declaratively. |
79bf009 to
036c2b7
Compare
Rendered
Part of #22129