Skip to content

RFD 0094: Kubernetes node joining#17905

Merged
hugoShaka merged 5 commits intomasterfrom
rfd/0094-kubernetes-node-joining
Nov 29, 2022
Merged

RFD 0094: Kubernetes node joining#17905
hugoShaka merged 5 commits intomasterfrom
rfd/0094-kubernetes-node-joining

Conversation

@hugoShaka
Copy link
Copy Markdown
Contributor

@hugoShaka hugoShaka commented Oct 27, 2022

Rendered version

This RFD proposes a mechanism to join Teleport nodes living in the same Kubernetes cluster than the auth nodes. This is part of our Q4 efforts to improve the Helm experience. This might also benefit the cloud hosting, I did not focus on this use-case but asked a couple of questions here.

@hugoShaka hugoShaka added the rfd Request for Discussion label Oct 27, 2022
@hugoShaka hugoShaka requested a review from r0mant October 27, 2022 23:28
@hugoShaka hugoShaka force-pushed the rfd/0094-kubernetes-node-joining branch from 6cea8ab to e494e6f Compare November 1, 2022 19:02
Copy link
Copy Markdown
Contributor

@klizhentas klizhentas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hugoShaka @r0mant @reedloden your token based approach makes sense, but the design and the code have to be audited.

Copy link
Copy Markdown
Contributor

@strideynet strideynet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into Kubernetes joining some time ago and the approach described under "Approach 1" roughly lines up with what I concluded was the best path forward.

I have a few thoughts here as this work is quite similar to the CircleCI and GitHub joining work I've completed recently.

One thing that occurs to me, is that we shouldn't rule this out as not being useful for Teleport Cloud customers. It is very possible that with some minor tweaks to this RFD, it would be possible to let Kubernetes workloads on clusters where the Auth Server is not running to join a cluster.

In cases where their Kubernetes API server is publically exposed, we could allow them to configure the address of the API server within the ProvisionToken spec, if this address is not configured, then the Auth Server could fall back to the k8s API detectable from its environment.

In reality, many Kubernetes API servers are not publically exposed in their entirety, however, the Kubernetes documentation on Service Account tokens does mention that just exposing the discovery endpoints using some kind of service is a valid option:

In many cases, Kubernetes API servers are not available on the public internet, but public endpoints that serve cached responses from the API server can be made available by users or service providers. In these cases, it is possible to override the jwks_uri in the OpenID Provider Configuration so that it points to the public endpoint, rather than the API server's address, by passing the --service-account-jwks-uri flag to the API server. Like the issuer URL, the JWKS URI is required to use the https scheme.

Even if we decide against supporting this in the initial version, I think it's worthy us acknowledging this within the RFD and stating why we don't want to support it.

Comment thread rfd/0094-kubernetes-node-joining.md Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should aim for consistency with the CircleCI and GitHub work - and pull the configuration for Kubernetes joining into a separate block rather than reusing the overloaded spec.allow e.g

spec:
  k8s:
    kube_api_server: ""
    allow:
    - service_account: "my-namespace:my-service-account"

I'd also suggest ( as shown ) that we allow the URL of a Kubernetes API server to be directly configured. When this value is omitted, the Auth Server should try and detect this from its environment, but by making it configurable, we make it possible for automatic joining from Kubernetes clusters that the auth server is not within.

Copy link
Copy Markdown
Contributor Author

@hugoShaka hugoShaka Nov 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to rely on Kubernetes to validate the tokens instead of using the OIDC discovery endpoint and validating this on our own. Kubernetes does much more than just checking validity, audience and expiry and we cannot benefit from that without valid Kubernetes credentials.

In this context, making the apiserver configurable makes sense, but this would imply:

  • having a way to trust the remote Kubernetes certs. Either adding them to the trust store or passing them in the resource.
  • a way to pass Kubernetes credentials (I was planning on only using the in-cluster kube client because it made everything easy)

We cannot rely on env vars for this because this would not work if the user create different tokens targeting distinct Kubernetes clusters. We would either have to extend the token definition to contain a working kubeconfig, or to have the user provide kubeconfigs file on all auth nodes. We might also encounter extra issues with cloud-provider-specific auth methods.

I totally see the value of being able top trust additional clusters though. If you're OK I suggest we mention this as a possible next step in the RFD but keep the "remote apiserver" feature out of the first implementation to keep it focused on our current Helm chart issues.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, I wasn't sure how much additional work the special k8s API offered on token validation, but it definitely sounds like its bringing value.

I think I definitely agree w/ keeping this as a next step - just useful to make sure we record whats in and out of scope.

Comment thread rfd/0094-kubernetes-node-joining.md Outdated
@hugoShaka hugoShaka requested a review from strideynet November 3, 2022 18:46
Comment thread rfd/0094-kubernetes-node-joining.md Outdated
@hugoShaka hugoShaka force-pushed the rfd/0094-kubernetes-node-joining branch from cab94c1 to 0d5b6a1 Compare November 7, 2022 21:25
@hugoShaka hugoShaka requested a review from strideynet November 7, 2022 21:25
@hugoShaka hugoShaka marked this pull request as ready for review November 7, 2022 21:25
Copy link
Copy Markdown
Contributor

@strideynet strideynet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - just a few random nits.

Comment thread rfd/0094-kubernetes-node-joining.md Outdated
Comment thread rfd/0094-kubernetes-node-joining.md Outdated
Comment thread rfd/0094-kubernetes-node-joining.md Outdated
Comment thread rfd/0094-kubernetes-node-joining.md Outdated
Comment thread rfd/0094-kubernetes-node-joining.md Outdated
Comment thread rfd/0094-kubernetes-node-joining.md Outdated
Co-authored-by: Noah Stride <noah.stride@goteleport.com>
Copy link
Copy Markdown
Collaborator

@r0mant r0mant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm with a couple of comments

Comment thread rfd/0094-kubernetes-node-joining.md Outdated
Comment on lines +34 to +35
- As a Teleport power user, I want to easily join my custom sets of Teleport
nodes on Kubernetes so that I have less join automation to build and maintain.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if deploying nodes as in "SSH nodes" in the same Kube cluster has much of a value tbh since we have Kube Access, I would maybe say "Teleport services like Machine ID or plugins".

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about kubernetes_service, app_service, db_service, discovery_service, windows_desktop_service. I'll explicit the list.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +332 to +335
Joining nodes from outside the Kubernetes cluster would imply generating and
exporting a Kubernetes token from the service account. While this pattern is
feasible, it does not provide much added value compared to generating a static
token from Teleport directly.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to provide more details on how it would work, maybe in a "future work" section? I just feel like having a "Kubernetes" join method and not supporting agents joining from other Kube clusters is strange. Like, if you run everything on Kubernetes, then you'd probably want to use the same join method even if your stuff is in different clusters? For comparison, IAM join method for example supports nodes joining from different AWS accounts.

Copy link
Copy Markdown
Contributor Author

@hugoShaka hugoShaka Nov 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should work the same way, except Teleport uses the kubeconfig from the resource instead of the inCluster config.

My biggest concern is that managed kubernetes hostings like GKE, EKS require more than a kubeconfig to interact with them. Suddenly we would have to support all the ways to pass service accounts to Teleport + support each cloud's auth provider in the go code. The spec would look like:

kind: token
version: v2
metadata:
  name: proxy-token
  expires: "3000-01-01T00:00:00Z"
spec:
  roles: ["Proxy"]
  k8s:
    allow:
      - service_account: "my-namespace:my-service-account"
    # empty authType defaults to "inCluster" for compatibility
    authType: "inCluster|kubeconfig|gcp|aws|azure|rancher|..."
    kubeconfig:
      certificate-authority-data: base64
      server: http://1.2.3.4
      # either set token or user && client-*
      token: my-secret-token
      user: client-sa-name
      client-certificate-data: base64
      client-key-data: base64
    gcp:
      # optional SA account key/name
      # if they are empty, teleport should try to use ambient credentials (see google.DefaultTokenSource)
      service-account-key: base64
      service-account-name: "foo@sa-project.iam.gserviceaccount.com"
      # mandatory fields to identify the cluster
      project: cluster-project
      location: us-east1
      cluster-name: my-cluster
    aws:
      # optional SA account id/secret
      # if they are empty, Teleport should try to use ambient credentials
      key-id: ASDFGHJK
      key-secret: QWERTYUIOP
      # mandatory
      region: us-east-1
      account-id: 1234567890
      cluster-name: my-cluster
    azure:
    # [...]
    rancher:
    # [...]

The users would have to map the cloud's service account to a kubernetes user/group and create a rolebinding to allow this group to use the TokenReview API. This step is cloud-specific and can be straightforward (as in GKE) or complex (hello AWS).

I'll add those details to the RFD so they can be reviewed at the same time by the auditors.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Copy Markdown
Contributor

@reedloden reedloden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving based on #18659 (review)

@hugoShaka hugoShaka enabled auto-merge (squash) November 28, 2022 21:53
@hugoShaka hugoShaka merged commit 2bb4e59 into master Nov 29, 2022
hugoShaka added a commit that referenced this pull request Dec 2, 2022
This commit adds a new joinMethod as described in #17905

This method allow pods running in the same Kubernetes cluster than the auth servers to join the Teleport cluster. It relies on Kubernetes tokens to establish trust. The goal is to be able to deploy proxies and auths separately and join them in a single cluser.

Pre Kubernetes 1.20, the tokens are static, long-lived, not bound to pods. We support them for compatibility reasons. Starting with Kubernetes 1.20, tokens are bound to pods (and starting with 1.21 they can be mounted through projected volumes). Starting with 1.21 we should only accept bound tokens. The chart will ensure tokens are properly mounted with projected volumes so we can benefit from the 1h to 10min token lifetime.
hugoShaka added a commit that referenced this pull request Feb 16, 2023
This commit adds a new joinMethod as described in #17905

This method allow pods running in the same Kubernetes cluster than the auth servers to join the Teleport cluster. It relies on Kubernetes tokens to establish trust. The goal is to be able to deploy proxies and auths separately and join them in a single cluser.

Pre Kubernetes 1.20, the tokens are static, long-lived, not bound to pods. We support them for compatibility reasons. Starting with Kubernetes 1.20, tokens are bound to pods (and starting with 1.21 they can be mounted through projected volumes). Starting with 1.21 we should only accept bound tokens. The chart will ensure tokens are properly mounted with projected volumes so we can benefit from the 1h to 10min token lifetime.
hugoShaka added a commit that referenced this pull request Feb 20, 2023
This commit adds a new joinMethod as described in #17905

This method allow pods running in the same Kubernetes cluster than the auth servers to join the Teleport cluster. It relies on Kubernetes tokens to establish trust. The goal is to be able to deploy proxies and auths separately and join them in a single cluser.

Pre Kubernetes 1.20, the tokens are static, long-lived, not bound to pods. We support them for compatibility reasons. Starting with Kubernetes 1.20, tokens are bound to pods (and starting with 1.21 they can be mounted through projected volumes). Starting with 1.21 we should only accept bound tokens. The chart will ensure tokens are properly mounted with projected volumes so we can benefit from the 1h to 10min token lifetime.
hugoShaka added a commit that referenced this pull request Feb 24, 2023
This commit adds a new joinMethod as described in #17905

This method allow pods running in the same Kubernetes cluster than the auth servers to join the Teleport cluster. It relies on Kubernetes tokens to establish trust. The goal is to be able to deploy proxies and auths separately and join them in a single cluser.

Pre Kubernetes 1.20, the tokens are static, long-lived, not bound to pods. We support them for compatibility reasons. Starting with Kubernetes 1.20, tokens are bound to pods (and starting with 1.21 they can be mounted through projected volumes). Starting with 1.21 we should only accept bound tokens. The chart will ensure tokens are properly mounted with projected volumes so we can benefit from the 1h to 10min token lifetime.
hugoShaka added a commit that referenced this pull request Feb 24, 2023
Backport #18659

### About the backport

This backport is here to provide a rollback path from v12 to v11. Helm users upgrading to v12 end up with Kubernetes tokens automatically created. Currently, if they rollback to v11, the token is unknown and Teleport crashes during cache warming. Once this will be backported, users will have a v11 version they can rollback to if the v12 upgrade fails.

### Original commit message

This commit adds a new joinMethod as described in #17905

This method allow pods running in the same Kubernetes cluster than the auth servers to join the Teleport cluster. It relies on Kubernetes tokens to establish trust. The goal is to be able to deploy proxies and auths separately and join them in a single cluser.

Pre Kubernetes 1.20, the tokens are static, long-lived, not bound to pods. We support them for compatibility reasons. Starting with Kubernetes 1.20, tokens are bound to pods (and starting with 1.21 they can be mounted through projected volumes). Starting with 1.21 we should only accept bound tokens. The chart will ensure tokens are properly mounted with projected volumes so we can benefit from the 1h to 10min token lifetime.
@hugoShaka hugoShaka deleted the rfd/0094-kubernetes-node-joining branch June 30, 2023 15:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rfd Request for Discussion

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants