Skip to content

add support for running agent helm chart on persistent volume#7123

Merged
knisbet merged 16 commits intomasterfrom
kevin/master/helm-stateful2
Jul 22, 2021
Merged

add support for running agent helm chart on persistent volume#7123
knisbet merged 16 commits intomasterfrom
kevin/master/helm-stateful2

Conversation

@knisbet
Copy link
Copy Markdown
Contributor

@knisbet knisbet commented Jun 1, 2021

Problem: In teleport cloud we are setup to automatically rotate the join tokens to the teleport cluster. The helm chart isn't currently compatible with this as the join token gets added as a kubernetes secret but is not kept in sync with the cluster.

This is an addition to optionally allow the pods to be deployed as a statefulset with a persistent volume from kubernetes/cloud to maintain the state directory through pod restarts. I believe the Teleport team is also looking at using the kubernetes API to store this state, but I don't know if/when that will be available, so this seemed like a reasonable solution.

Submitting it here if the core team is interested, otherwise we'll have to fork the chart for use in the cloud.

I originally approached this as a bunch of replacements on the deployment file, changing the API and other parameters inline, but the approach seemed fragile with the number of tweaks. So I went with this approach to disable the deployment and add the statefulset if persistent storage is enabled, which largely duplicates the object definition that will require certain updates to be duplicated in both places.

Also, not super familiar with the linting approach used with helm, so I think I added the relevant tests.

@knisbet knisbet requested review from awly and webvictim June 1, 2021 14:03
@webvictim
Copy link
Copy Markdown
Contributor

#7096 does a similar thing but seems a lot simpler (although doesn't add linting as your PR does etc) - what's the incentive for using a StatefulSet instead?

@knisbet
Copy link
Copy Markdown
Contributor Author

knisbet commented Jun 1, 2021

It's not totally clear to me how #7096 is meant to work, if it's attempting to use a localDir for persistence, or some other way to map a single volume. Since the deployment controller doesn't normally account for persistence, I'm not really sure if this is a recommended model, and won't run into problems when say bumping the container tag where the deployment controller would then want to replace the pod with a new one by temporarily using two instances using the same persistence store.

The StatefulSet controller is more geared towards handling persistence, so in this approach the prereq is for the cluster to be connected to a cloud environment / storage controller with a storage class defined, and the volumes will be provisioned by kubernetes in the cloud environment by installing the chart. The statefulset controller also understands and won't try and run more than one pod connected to the same volume, or schedule pods in other AZs than the volume. Will probably also be more appropriate if we want to try and create redundant deployments through increasing the replicas.

the TLDR, StatefulSet is geared towards applications with state, Deployments are geared towards stateless (or more specifically ephemeral storage for pods).

@webvictim
Copy link
Copy Markdown
Contributor

@klizhentas @awly What are your thoughts on this from a product standpoint? Are we looking to prioritise the work to use the Kubernetes API for certificate storage, or should we look to merge a chart-based solution for this in the short term?

@awly
Copy link
Copy Markdown
Contributor

awly commented Jun 3, 2021

Storing state in k8s API (#4832) is not on the horizon any time soon.
Adding support for persistent volumes in our chart in the meantime makes sense.
The only reason I didn't add it originally was cross-platform compatibility.

@knisbet will you change work across at least AWS/GCP/Azure?

@knisbet
Copy link
Copy Markdown
Contributor Author

knisbet commented Jun 4, 2021

@knisbet will you change work across at least AWS/GCP/Azure?

It should...

This really just leverages the kubernetes support for persistent volumes, which can integrate with various cloud providers and different plugins. So it just requires having or configuring a storageclass (https://kubernetes.io/docs/concepts/storage/storage-classes/) that matches the particular cloud provider and when installing the helm chart point at the storageclass created by the customer that will then have cloud provider specific parameters.

Copy link
Copy Markdown
Contributor

@awly awly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I like this change.
To release it as a part of our public Helm chart we should at least document how to set this up (maybe link to cloud provider-specific docs if available).
Also, if it's possible to migrate from non-stateful setup, we should document how. If it's not possible, we need a warning somewhere.

storage:
enabled: false
storageClassName: ""
requests: 128Mi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually have no idea how big the teleport storage needs are.
Do you have any numbers on the real-world size?
If this needs to be increased after deployment, will existing data be preserved?

Copy link
Copy Markdown
Contributor Author

@knisbet knisbet Jun 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually have no idea how big the teleport storage needs are.
Do you have any numbers on the real-world size?

Last 7 days I've had this deployed / idle it's used 19MiB. I may have assumed incorrectly that because this is a node agent and logs/audit/etc are meant to be centralized that the default for this volume can be relatively small.

Is there something I might be missing that would suggest a larger default may be needed here?

If this needs to be increased after deployment, will existing data be preserved?

It depends on how the storage class is setup and presumably whether the block driver supports it. The storage class has a boolean field called allowVolumeExpansion, and if it's set, the volume can be expanded without data loss. It will interrupt the pod temporarily since the volume needs to be unmapped and the filesystem expanded (Edit: On AWS, IDK other systems might have an online expansion).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this volume can be fairly small - no logs are really kept locally (they all go to Docker/containerd) and the sqlite database for the agent itself should only be holding a keypair and some cache data.

Comment thread examples/chart/teleport-kube-agent/templates/deployment.yaml Outdated
Comment thread examples/chart/teleport-kube-agent/templates/statefulset.yaml Outdated
Comment thread examples/chart/teleport-kube-agent/templates/statefulset.yaml
@knisbet
Copy link
Copy Markdown
Contributor Author

knisbet commented Jun 8, 2021

Overall I like this change.
To release it as a part of our public Helm chart we should at least document how to set this up (maybe link to cloud provider-specific docs if available).

Are these agent helm charts documented somewhere in general? I see a few references to the other chart in the docs, and I guess there is a values reference for this chart which is very confusing to read https://goteleport.com/docs/kubernetes-access/helm/reference/#teleport-kube-agent but I don't see any specific docs on the agent charts.

Also, if it's possible to migrate from non-stateful setup, we should document how. If it's not possible, we need a warning somewhere.

It's not really a migration, when not using the volume, every startup of teleport is a new installation. When adding the persistent volume, the first start of the pod will then "install" teleport by persisting the /var/lib/teleport directories, so a subsequent creation of the pod now just has the data from the last startup. So I don't think there is really a migration here other than you no longer need long lived tokens (or like us, have to manually reset the token every time the pod restarts).

Copy link
Copy Markdown
Contributor

@webvictim webvictim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There aren't decent docs for the teleport-kube-agent chart AFAIK - the only real reference is under https://goteleport.com/docs/kubernetes-access/controls/#kubernetes-labels.

@inertial-frame @benarent Do we have plans to write guides for using teleport-kube-agent similar to those for teleport-cluster?

I think that to merge this, we would at least need to add the additional values to the chart reference - unclear or not, the reference is the canonical source for all the chart's options.

Comment thread examples/chart/teleport-kube-agent/templates/statefulset.yaml
storage:
enabled: false
storageClassName: ""
requests: 128Mi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this volume can be fairly small - no logs are really kept locally (they all go to Docker/containerd) and the sqlite database for the agent itself should only be holding a keypair and some cache data.

@awly
Copy link
Copy Markdown
Contributor

awly commented Jun 9, 2021

@knisbet
Copy link
Copy Markdown
Contributor Author

knisbet commented Jun 10, 2021

@knisbet @webvictim yeah, we can just expand comments in values.yaml and add a section in https://github.com/gravitational/teleport/blob/master/examples/chart/teleport-kube-agent/README.md

Added some more text in 25533c5

Mainly trying to follow how it's already setup.

Comment thread examples/chart/teleport-kube-agent/values.yaml Outdated
Comment thread examples/chart/teleport-kube-agent/values.yaml Outdated
Comment on lines +111 to +113
# nodeSelector to apply for pod assignment
# https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/
nodeSelector: {}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add this to the correct section of https://github.com/gravitational/teleport/blob/master/docs/pages/kubernetes-access/helm/reference.mdx as well to keep everything documented. You can just copy a section and change the appropriate parts.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I added to the reference doc.

Co-authored-by: Gus Luxton <gus@goteleport.com>
@knisbet knisbet requested a review from webvictim July 21, 2021 14:46
Copy link
Copy Markdown
Contributor

@webvictim webvictim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a few nits

Comment thread docs/pages/kubernetes-access/helm/reference.mdx Outdated
Comment thread docs/pages/kubernetes-access/helm/reference.mdx Outdated
Comment thread docs/pages/kubernetes-access/helm/reference.mdx Outdated
Comment thread docs/pages/kubernetes-access/helm/reference.mdx Outdated
@corkrean corkrean added the c-bi Internal Customer Reference label Jul 22, 2021
@knisbet knisbet merged commit 4cacc2c into master Jul 22, 2021
Copy link
Copy Markdown
Contributor

@russjones russjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot.

@knisbet knisbet deleted the kevin/master/helm-stateful2 branch July 22, 2021 16:26
webvictim pushed a commit that referenced this pull request Aug 20, 2021
* add support for running agent on persistent volume

* address review feedback

* add more comments on the fields

* update statefulset to be compatabile with upstream change
webvictim added a commit that referenced this pull request Aug 26, 2021
* Add abilty to configure postStart handler for teleport-cluster chart (#7168)

* Add support for PDB with the teleport-cluster helm chart (#7138)

* Allow teleport-cluster-agent chart to use an existing volume for the data directory (#7096)

* Mount teleport-tls to the init container for the teleport-cluster helm chart (#7166)

* add support for running agent helm chart on persistent volume (#7123)

* helm: Make auth type configurable (#7508)

* Add imagePullSecrets in kube-agent chart (#6941)

Co-authored-by: Gus Luxton <gus@goteleport.com>
Co-authored-by: Stefan Sedich <stefan.sedich@gmail.com>
Co-authored-by: Kevin Nisbet <kevin@gravitational.com>
Co-authored-by: Antonio Gurgel <66754096+antonio-te@users.noreply.github.com>
Co-authored-by: Aymen Memni <7290430+amemni@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

c-bi Internal Customer Reference helm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants