add support for running agent helm chart on persistent volume#7123
add support for running agent helm chart on persistent volume#7123
Conversation
|
#7096 does a similar thing but seems a lot simpler (although doesn't add linting as your PR does etc) - what's the incentive for using a |
|
It's not totally clear to me how #7096 is meant to work, if it's attempting to use a localDir for persistence, or some other way to map a single volume. Since the deployment controller doesn't normally account for persistence, I'm not really sure if this is a recommended model, and won't run into problems when say bumping the container tag where the deployment controller would then want to replace the pod with a new one by temporarily using two instances using the same persistence store. The StatefulSet controller is more geared towards handling persistence, so in this approach the prereq is for the cluster to be connected to a cloud environment / storage controller with a storage class defined, and the volumes will be provisioned by kubernetes in the cloud environment by installing the chart. The statefulset controller also understands and won't try and run more than one pod connected to the same volume, or schedule pods in other AZs than the volume. Will probably also be more appropriate if we want to try and create redundant deployments through increasing the replicas. the TLDR, StatefulSet is geared towards applications with state, Deployments are geared towards stateless (or more specifically ephemeral storage for pods). |
|
@klizhentas @awly What are your thoughts on this from a product standpoint? Are we looking to prioritise the work to use the Kubernetes API for certificate storage, or should we look to merge a chart-based solution for this in the short term? |
It should... This really just leverages the kubernetes support for persistent volumes, which can integrate with various cloud providers and different plugins. So it just requires having or configuring a storageclass (https://kubernetes.io/docs/concepts/storage/storage-classes/) that matches the particular cloud provider and when installing the helm chart point at the storageclass created by the customer that will then have cloud provider specific parameters. |
awly
left a comment
There was a problem hiding this comment.
Overall I like this change.
To release it as a part of our public Helm chart we should at least document how to set this up (maybe link to cloud provider-specific docs if available).
Also, if it's possible to migrate from non-stateful setup, we should document how. If it's not possible, we need a warning somewhere.
| storage: | ||
| enabled: false | ||
| storageClassName: "" | ||
| requests: 128Mi |
There was a problem hiding this comment.
I actually have no idea how big the teleport storage needs are.
Do you have any numbers on the real-world size?
If this needs to be increased after deployment, will existing data be preserved?
There was a problem hiding this comment.
I actually have no idea how big the teleport storage needs are.
Do you have any numbers on the real-world size?
Last 7 days I've had this deployed / idle it's used 19MiB. I may have assumed incorrectly that because this is a node agent and logs/audit/etc are meant to be centralized that the default for this volume can be relatively small.
Is there something I might be missing that would suggest a larger default may be needed here?
If this needs to be increased after deployment, will existing data be preserved?
It depends on how the storage class is setup and presumably whether the block driver supports it. The storage class has a boolean field called allowVolumeExpansion, and if it's set, the volume can be expanded without data loss. It will interrupt the pod temporarily since the volume needs to be unmapped and the filesystem expanded (Edit: On AWS, IDK other systems might have an online expansion).
There was a problem hiding this comment.
I think this volume can be fairly small - no logs are really kept locally (they all go to Docker/containerd) and the sqlite database for the agent itself should only be holding a keypair and some cache data.
Are these agent helm charts documented somewhere in general? I see a few references to the other chart in the docs, and I guess there is a values reference for this chart which is very confusing to read https://goteleport.com/docs/kubernetes-access/helm/reference/#teleport-kube-agent but I don't see any specific docs on the agent charts.
It's not really a migration, when not using the volume, every startup of teleport is a new installation. When adding the persistent volume, the first start of the pod will then "install" teleport by persisting the /var/lib/teleport directories, so a subsequent creation of the pod now just has the data from the last startup. So I don't think there is really a migration here other than you no longer need long lived tokens (or like us, have to manually reset the token every time the pod restarts). |
webvictim
left a comment
There was a problem hiding this comment.
There aren't decent docs for the teleport-kube-agent chart AFAIK - the only real reference is under https://goteleport.com/docs/kubernetes-access/controls/#kubernetes-labels.
@inertial-frame @benarent Do we have plans to write guides for using teleport-kube-agent similar to those for teleport-cluster?
I think that to merge this, we would at least need to add the additional values to the chart reference - unclear or not, the reference is the canonical source for all the chart's options.
| storage: | ||
| enabled: false | ||
| storageClassName: "" | ||
| requests: 128Mi |
There was a problem hiding this comment.
I think this volume can be fairly small - no logs are really kept locally (they all go to Docker/containerd) and the sqlite database for the agent itself should only be holding a keypair and some cache data.
|
@knisbet @webvictim yeah, we can just expand comments in |
Added some more text in 25533c5 Mainly trying to follow how it's already setup. |
| # nodeSelector to apply for pod assignment | ||
| # https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/ | ||
| nodeSelector: {} |
There was a problem hiding this comment.
Please add this to the correct section of https://github.com/gravitational/teleport/blob/master/docs/pages/kubernetes-access/helm/reference.mdx as well to keep everything documented. You can just copy a section and change the appropriate parts.
There was a problem hiding this comment.
Alright, I added to the reference doc.
Co-authored-by: Gus Luxton <gus@goteleport.com>
Co-authored-by: Gus Luxton <gus@goteleport.com>
…o kevin/master/helm-stateful2
* add support for running agent on persistent volume * address review feedback * add more comments on the fields * update statefulset to be compatabile with upstream change
* Add abilty to configure postStart handler for teleport-cluster chart (#7168) * Add support for PDB with the teleport-cluster helm chart (#7138) * Allow teleport-cluster-agent chart to use an existing volume for the data directory (#7096) * Mount teleport-tls to the init container for the teleport-cluster helm chart (#7166) * add support for running agent helm chart on persistent volume (#7123) * helm: Make auth type configurable (#7508) * Add imagePullSecrets in kube-agent chart (#6941) Co-authored-by: Gus Luxton <gus@goteleport.com> Co-authored-by: Stefan Sedich <stefan.sedich@gmail.com> Co-authored-by: Kevin Nisbet <kevin@gravitational.com> Co-authored-by: Antonio Gurgel <66754096+antonio-te@users.noreply.github.com> Co-authored-by: Aymen Memni <7290430+amemni@users.noreply.github.com>
Problem: In teleport cloud we are setup to automatically rotate the join tokens to the teleport cluster. The helm chart isn't currently compatible with this as the join token gets added as a kubernetes secret but is not kept in sync with the cluster.
This is an addition to optionally allow the pods to be deployed as a statefulset with a persistent volume from kubernetes/cloud to maintain the state directory through pod restarts. I believe the Teleport team is also looking at using the kubernetes API to store this state, but I don't know if/when that will be available, so this seemed like a reasonable solution.
Submitting it here if the core team is interested, otherwise we'll have to fork the chart for use in the cloud.
I originally approached this as a bunch of replacements on the deployment file, changing the API and other parameters inline, but the approach seemed fragile with the number of tweaks. So I went with this approach to disable the deployment and add the statefulset if persistent storage is enabled, which largely duplicates the object definition that will require certain updates to be duplicated in both places.
Also, not super familiar with the linting approach used with helm, so I think I added the relevant tests.