Conversation
4045fb0 to
f56ff5d
Compare
|
We might also want folks from the scalability team to take a look at this once we're happy with the design. |
| Proxies will have an initContainer checking if all auth pods from the past | ||
| version were removed. Version check via the Teleport gRPC api (`PingResponse`) | ||
| requires valid credentials to connect to Teleport. To workaround this issue we | ||
| can rely on Kubernetes's service discovery through DNS to discover how many | ||
| pods are running which version: | ||
|
|
||
| - the chart labels auth pods with their major teleport version | ||
| - the chart creates two headless services: | ||
| - one selecting pods with the current major version `teleport-auth-v11` | ||
| - one selecting pods with the previous major version `teleport-auth-v10` | ||
| - proxy pods have an initContainer | ||
| - the `v11` initContainer resolves `teleport-auth-v10` every 5 seconds until no | ||
| IP is returned | ||
| - the initContainer exits, the proxy starts | ||
| - this unlocks the proxy deployment rollout |
There was a problem hiding this comment.
This is unsafe with a StatefulSet because a pod killed during a rollout will be re-created with the lastest version. Deployment implementation is based on ReplicaSets, which ensures this does not happen and we cannot have a capacity loss because of a stuck rollout.
I wonder if it would be possible for the chart to refuse sessionRecording: proxy. In the last chart, running in proxy mode was almost identical to running in auth mode (almost because if users manually added proxies it would have had a different behaviour).
I believe that in order to improve the overall experience we have to provide a more opinionated deployment of Teleport, one that works great in 99% of the use-cases. This could be one of those changes.
There was a problem hiding this comment.
I discussed with @r0mant and it seems not supporting proxy session recording mode does not align with our strategy. However we could gracefully handle session recording queue emptying. This would mean the proxy session recording would only loose data on hardware failure, not on controlled disruptions like autoscaling events or version updates/configuration rollouts.
There was a problem hiding this comment.
After discussing with multiple persons, it appears graceful shutdown is not currently supported by the chart, and does not even ship the recordings before exiting. Working around with a graceful restart also raises issues because the restart will hang until users session end, but proxy is not preemptively closing user sessions.
There was a problem hiding this comment.
Might be good to investigate if we can use the lifecycle.preStop hook.
The exec plugin could force teleport to close connections and ship the data volume to other proxy nodes to avoid the loss of session recording inventory.
Using a statefulset it's possible to know if your replica will be scaled down -> using the number of replicas and the pod name. For cases where scale down will happen, we can ship the payload to another proxy
There was a problem hiding this comment.
Might be good to investigate if we can use the
lifecycle.preStophook. The exec plugin could force teleport to close connections and ship the data volume to other proxy nodes to avoid the loss of session recording inventory.Using a statefulset it's possible to know if your replica will be scaled down -> using the number of replicas and the pod name. For cases where scale down will happen, we can ship the payload to another proxy
As far as I understand the proxy storage is temporary, even in proxy mode. The recordings will eventually be shipped to an auth server for archiving purposes. In a permanent downscale case we would need to have a way to forcefully empty the queue. I initially assumed people using proxy mode would not downscale, but this was not based and this case is valid.
My biggest concern is what happens if we're in a stuck proxy rollout (auths are slowly rolling out, proxies are waiting) and we loose one or many proxies. With a StatefulSet they will come back up as v11 and not v10. They will not go back up and wait for the auth rollout to finish. This will cause a capacity loss, and maybe a full blackout if all proxies gets rescheduled (e.g. a GKE automated nodepool update that happens at the same time). With a deployment, a v10 pod getting deleted would be replaced by another v10 pod, which is what we want.
There was a problem hiding this comment.
I changed back to a Deployment, this means users might loose active session recordings during a pod rollout when using the proxy recording mode.
There was a problem hiding this comment.
IIRC the way we do this in Cloud is to rewrite the default container's SIGTERM (15) signal to SIGQUIT (3) using dumb-init so that we do in fact ask the pods to shut down gracefully: /usr/bin/dumb-init --rewrite 15:3 -- teleport start -c /etc/teleport/teleport.yaml
(from https://goteleport.com/docs/reference/signals/):
QUIT: Graceful shutdown. The daemon will wait until connections are dropped.
TERM , INT or KILL: Immediate non-graceful shutdown. All existing connections will be dropped.
I've never proposed this change in the Teleport Docker container itself because I feel it would be too much of a breaking change, but doing manually inside the Helm chart with args/command might work and be a good compromise?
I know that someone on the Teleport side was investigating the ability for Teleport to properly flush its session recording queue as part of a graceful shutdown but I'm unsure what progress they made on that. Cloud doesn't support proxy recording mode so there isn't a ton of prior art unfortunately.
There was a problem hiding this comment.
Yeah, @espadolini took a lot of time to explain me this (thank you ❤️ ). I don't think I want to/can jump into this rabbit hole right now: SIGQUIT waits for sessions to end, which can take several hours as we don't have a session eviction/migration mechanism. In cloud this kinda works because we can afford slow rollouts with 30min of grace period. I'd love to raise this issue again once we have done the main refactos.
r0mant
left a comment
There was a problem hiding this comment.
I think this RFD needs to include details about additional updates we wanted to make to the charts such as
- Deployment resources always named as release name
- Lack of standard static labels on Teleport deployed resources e.g. app.kubernetes.io/component
- Including auth connectors
- Service monitors for metrics
| auth: | ||
| teleportConfig: | ||
| auth_service: | ||
| authentication: | ||
| connector_name: "my-connector" | ||
| proxy: | ||
| teleportConfig: | ||
| proxy_service: | ||
| acme: | ||
| enabled: true | ||
| email: foo@example.com |
There was a problem hiding this comment.
@webvictim Do you recall why our initial Helm chart design took the approach with exposing only certain fields via Helm chart values rather than mirroring the config structure as-is and merging like proposed here?
This approach definitely makes it easier to reconcile Helm values with config but I wonder if it makes it easy for users to shoot themselves in the foot. E.g. what happens if you add "auth_service" config section to the "proxy" value section? Just thinking out loud.
There was a problem hiding this comment.
@r0mant Essentially it was because of all the added complexity that it would add. To be honest though I didn't know that mustMergeOverride even existed, or I'd probably have written it that way in the first place 😁
One notable upside of doing it the old way with specific named values is that we could use enums in the values.yaml.json schema to set acceptable values for the chart and make sure that it's pretty difficult for people to break their Teleport config. When we take the guard rails off it'll be much easier to make the config file fail to validate, but I think the benefits of not needing to use chartMode: custom for more advanced deployments will definitely outweigh the downsides.
Another reason was that toYaml can be a bit of a blunt object and will rewrite lists in a certain way which made config files harder to read. That isn't a huge problem though.
@hugoShaka I wonder whether there's a way for us to do some kind of config file validation via a pre-deploy hook or InitContainer where we template out the config file that the user has specified and run teleport configure --test /etc/teleport/teleport.yaml against it to at least check whether Teleport is able to parse the file correctly. It would offload the validation to the Teleport binary itself and could maybe be used to abort a rollout with a failed configuration before taking the pods offline?
There was a problem hiding this comment.
+1 for a pre-deploy hook validating the configuration.
This would also fit perfectly with a suggestion from @programmerq: validate backend reachability/s3 permissions on startup as this is the most complex part of our setup and would allow user to catch broken setups waaaaay more easily. I don't think configure --test has this behaviour yet but it sounds like a nice little RFD
There was a problem hiding this comment.
Added configure --test hooks to the RFD: ec8823c
| but it would have required a remote backend anyway (because our SQLite backend | ||
| does not support replication). | ||
|
|
||
| The main LB service should send traffic to the proxies, two additional services |
There was a problem hiding this comment.
If we want a graceful shutdown, we have to have a readiness probe that marks the container as not ready as soon as the container receives the signal to start a graceful shutdown. This is necessary so that the LB does not continue to send traffic to a container that is shutting down.
There was a problem hiding this comment.
I don't think this is a requirement, see the note on kube docs. Kubernetes will set the pod's endpoints unready as soon as the pod deletion process starts, even before transitioning to the Terminated state. If we don't want to loose requests we have to keep the pod running for at least 30 seconds. After 30 seconds all kube-proxy will have reconciled and no traffic will be routed to the pod through native kube services. This is usually done with a preStop hook waiting 30 seconds.
Users wanting to ensure 100% seamless rollouts would also have to rely on their LB readiness-gates. Readiness gate injection can be controlled through an annotation on the EKS namespace, or on the GKE service.
| rollout will cause the load to spread unevenly across auth pods, which will be | ||
| harmful at scale. | ||
|
|
||
| Proxies will have an initContainer checking if all auth pods from the past |
There was a problem hiding this comment.
Will the helm upgrade wait for the full operation to end? If yes, we need to check the timing. By default it waits just 5min but in operations where new nodes are spanned it could take longer
There was a problem hiding this comment.
By default no, if you run with --wait it will. Else it's fire and forget, we're only relying on Kubernetes' logic and hope we reach the desired state at some point.
There was a problem hiding this comment.
I would assume that most of them would run with --atomic
--atomic if set, upgrade process rolls back changes made in case of failed upgrade. The --wait flag will be set automatically if --atomic is used
There was a problem hiding this comment.
In this case yes, we must warn the users the rollout can take some time and they should adjust their timeouts. Releasing will be longer than before if we setup disruption mitigations like preStop hooks, graceful shutdowns and sequentially update auth and proxies.
There was a problem hiding this comment.
Added a section about documenting the possible rollout time and timeout tuning in the RFD: ec8823c
| Proxies will have an initContainer checking if all auth pods from the past | ||
| version were removed. Version check via the Teleport gRPC api (`PingResponse`) | ||
| requires valid credentials to connect to Teleport. To workaround this issue we | ||
| can rely on Kubernetes's service discovery through DNS to discover how many |
There was a problem hiding this comment.
The DNS server will just reply with the ready pods. To show not-ready pods, you must configure the service with Service.spec.publishNotReadyAddresses: true.
In situations where you upgrade from 1 auth to 3 auths, teleport-auth-v10 will return no endpoints after the first upgrade and you still have 2 pods that are not yet ready/running. It will result in every proxy connecting to a single auth server.
the init container must be aware of the number of auth servers. In situations where auth pods are moved to other nodes - cluster upgrades or rebalancing - it should also wait until every auth server is ready
There was a problem hiding this comment.
The DNS server will just reply with the ready pods. To show not-ready pods, you must configure the service with Service.spec.publishNotReadyAddresses: true.
👍
In situations where you upgrade from 1 auth to 3 auths, teleport-auth-v10 will return no endpoints after the first upgrade and you still have 2 pods that are not yet ready/running. It will result in every proxy connecting to a single auth server.
Interesting edge case. I think this is an acceptable temporary state as all proxies were previously connected to a single auth instance. This might be mitigated by doing surge-only updates, which is generally not a bad idea to avoid loosing capacity during rollouts.
the init container must be aware of the number of auth servers. In situations where auth pods are moved to other nodes - cluster upgrades or rebalancing - it should also wait until every auth server is ready
I don't know yet what to think about this. It makes sense but I wonder if this is really necessary as it would likely involve adding some kind of env variable/cmdline argument. An auth upscale would cause a proxy rollout, plus in certain cases we have stateless auth (full aws mode) that users might want to upscale/downscale without Helm.
There was a problem hiding this comment.
The more I think about those cases, the more I just want an operator to take care of the updates instead of hacking through Kubernetes 🙃
There was a problem hiding this comment.
Added the publishNotReadyAddresses in the RFD: 4c1ca75
| Proxies will have an initContainer checking if all auth pods from the past | ||
| version were removed. Version check via the Teleport gRPC api (`PingResponse`) | ||
| requires valid credentials to connect to Teleport. To workaround this issue we | ||
| can rely on Kubernetes's service discovery through DNS to discover how many | ||
| pods are running which version: | ||
|
|
||
| - the chart labels auth pods with their major teleport version | ||
| - the chart creates two headless services: | ||
| - one selecting pods with the current major version `teleport-auth-v11` | ||
| - one selecting pods with the previous major version `teleport-auth-v10` | ||
| - proxy pods have an initContainer | ||
| - the `v11` initContainer resolves `teleport-auth-v10` every 5 seconds until no | ||
| IP is returned | ||
| - the initContainer exits, the proxy starts | ||
| - this unlocks the proxy deployment rollout |
There was a problem hiding this comment.
Might be good to investigate if we can use the lifecycle.preStop hook.
The exec plugin could force teleport to close connections and ship the data volume to other proxy nodes to avoid the loss of session recording inventory.
Using a statefulset it's possible to know if your replica will be scaled down -> using the number of replicas and the pod name. For cases where scale down will happen, we can ship the payload to another proxy
@r0mant I added a commit covering labels, deploying CRs and serviceMonitor. Resources are already named based on the release name. |
|
I will review, but would be more helpful if @webvictim and @bcg62 would take a look |
Myself and @gclawes will provide feedback |
webvictim
left a comment
There was a problem hiding this comment.
I really like the proposal overall and think that both decoupling the auth/proxy pods and offloading full config validation to Teleport itself rather than implementing a small subset in Helm will make customers' lives a lot easier. I have zero doubt that people have already built their own changes around this and the migration will likely be a breaking change for them, but unfortunately a little friction is necessary to make the journey smoother for everyone else.
| auth: | ||
| teleportConfig: | ||
| auth_service: | ||
| authentication: | ||
| connector_name: "my-connector" | ||
| proxy: | ||
| teleportConfig: | ||
| proxy_service: | ||
| acme: | ||
| enabled: true | ||
| email: foo@example.com |
There was a problem hiding this comment.
@r0mant Essentially it was because of all the added complexity that it would add. To be honest though I didn't know that mustMergeOverride even existed, or I'd probably have written it that way in the first place 😁
One notable upside of doing it the old way with specific named values is that we could use enums in the values.yaml.json schema to set acceptable values for the chart and make sure that it's pretty difficult for people to break their Teleport config. When we take the guard rails off it'll be much easier to make the config file fail to validate, but I think the benefits of not needing to use chartMode: custom for more advanced deployments will definitely outweigh the downsides.
Another reason was that toYaml can be a bit of a blunt object and will rewrite lists in a certain way which made config files harder to read. That isn't a huge problem though.
@hugoShaka I wonder whether there's a way for us to do some kind of config file validation via a pre-deploy hook or InitContainer where we template out the config file that the user has specified and run teleport configure --test /etc/teleport/teleport.yaml against it to at least check whether Teleport is able to parse the file correctly. It would offload the validation to the Teleport binary itself and could maybe be used to abort a rollout with a failed configuration before taking the pods offline?
| Proxies will have an initContainer checking if all auth pods from the past | ||
| version were removed. Version check via the Teleport gRPC api (`PingResponse`) | ||
| requires valid credentials to connect to Teleport. To workaround this issue we | ||
| can rely on Kubernetes's service discovery through DNS to discover how many | ||
| pods are running which version: | ||
|
|
||
| - the chart labels auth pods with their major teleport version | ||
| - the chart creates two headless services: | ||
| - one selecting pods with the current major version `teleport-auth-v11` | ||
| - one selecting pods with the previous major version `teleport-auth-v10` | ||
| - proxy pods have an initContainer | ||
| - the `v11` initContainer resolves `teleport-auth-v10` every 5 seconds until no | ||
| IP is returned | ||
| - the initContainer exits, the proxy starts | ||
| - this unlocks the proxy deployment rollout |
There was a problem hiding this comment.
IIRC the way we do this in Cloud is to rewrite the default container's SIGTERM (15) signal to SIGQUIT (3) using dumb-init so that we do in fact ask the pods to shut down gracefully: /usr/bin/dumb-init --rewrite 15:3 -- teleport start -c /etc/teleport/teleport.yaml
(from https://goteleport.com/docs/reference/signals/):
QUIT: Graceful shutdown. The daemon will wait until connections are dropped.
TERM , INT or KILL: Immediate non-graceful shutdown. All existing connections will be dropped.
I've never proposed this change in the Teleport Docker container itself because I feel it would be too much of a breaking change, but doing manually inside the Helm chart with args/command might work and be a good compromise?
I know that someone on the Teleport side was investigating the ability for Teleport to properly flush its session recording queue as part of a graceful shutdown but I'm unsure what progress they made on that. Cloud doesn't support proxy recording mode so there isn't a ton of prior art unfortunately.
bdcf624 to
ec8823c
Compare
Part of [RFD-0096](#18274) This PR adds helm hooks deploying a test configuration job and running `teleport configure --test` to validate the `teleport.yaml` configuration is sane.
There was a problem hiding this comment.
It seems like app.kubernetes.io/component should have a value that indicates which teleport components are active. auth+kube or proxy+kube, etc...
If not in that label, is there another label that would better indicate the teleport component(s) that are active?
It would make the helm chart feel more in tune with the product. It'd be great if this could be generated based on the rendered teleport.yaml in the configmap for the deployment, if possible.
There was a problem hiding this comment.
Fair, currently it would be "proxy" and "auth+kube". However I'd like to move kube service out of auth pods soon, I kept this off the main PR in a failed attempt to limit the number of changes. Eventually, this should be proxy, auth, and kube. I'll update the RFD to reflect this.
However I think this will be easier to just write auth instead of auth+kube today as we rely on those labels as selectors and Kubernetes does not support label edition well, this will require another downtime when we'll move the kube service out of the auth because of the label change.
There was a problem hiding this comment.
Since helm can't deploy the CRD and its CRs in the same release, would it be possible to support bootstrap.yaml in the helm chart? That would hopefully allow folks to do a one-shot initial deploy that doesn't require them to separately create custom resources or exec into the pod.
There was a problem hiding this comment.
Technically you can do bootstrap.yaml with the Helm chart currently using extraVolumes, extraVolumeMounts and extraArgs. Maybe if we just document how to do this better as it's intended to be a one-off thing?
There was a problem hiding this comment.
extraVolume/extraVolumeMounts/extraArgs also require that the end-user do a two-step operation to get that initial data in place. Allowing the chart to natively support as much of this initialization process as possible all from the chart values will reduce the need for external steps/scripts/processes.
There was a problem hiding this comment.
From what I read, --bootstrap does not support all resources (only User, CA, TrustedCluster, Role, AuthConnectors) and allows you to import severely broken states (it does some validation about the CA material but does not check if a user has existing roles for example). Therefore, the flag should be used only for restoring an existing state and must not be used as a substitution for the CR/CRD mechanism.
We do need to make populating resources easier, but I think --boostrap is not the adequate way.
There was a problem hiding this comment.
Are we intending to add a migration guide from a custom config to the new format in the docs? I know a lot of people are already using custom and so it'd be nice to guide them through the process to ease the pain of a breaking change.
There was a problem hiding this comment.
Like this guide? https://github.com/gravitational/teleport/pull/19881/files#diff-c7dd7c2ab1f43e582d7662823d997dc4317c42d334fb179a4b3a5cd4dbc911dd
Feel free to jump in the PR if you have ideas on how to make the migration less painful
Part of #18274 This commit introduces a new hidden `wait` CLI subcommand: - `teleport wait no-resolve <domain-name>` resolves a domain name and exits only when no IPs are resolved. This CLI command should be used in the Helm chart, as an init-container, to block proxies from rolling out until all auth pods have been successfully rolled-out. - `teleport wait duration 30s` has the same behaviour as `sleep 30`. Due to image hardening we won't have `sleep` available, but waiting 30 seconds in a preStop hook is required to ensure a 100% seamless pod rollout on kube-proxy-based clusters.
Co-authored-by: Marco André Dinis <marco.dinis@goteleport.com>
9a07a29 to
03f3eb4
Compare
Co-authored-by: Zac Bergquist <zac.bergquist@goteleport.com>
This commit refactors the `teleport-cluster` Helm chart to deploy separately proxy and auth pods. It allows users to pass raw teleport configuration to the deployed Teleport nodes. Finally, it removes the `custom` chart mode as the mode was broken by the split. A new `scratch` mode has been introduced. See [the corresponding RFD](#18274) describing the design.
Part of [RFD-096](#18274): managing the major upgrades safely This commit's main purpose is to block proxies running a new Teleport major version from connecting to auth pods running an old Teleport version. This commit does 3 things: - adding initContainers and preStop hooks to the `teleport-cluster` Helm chart (initContainers were designed in RFD 096, preStop was a nice additoin coming from [the wait PR](#19277)) - fixing a bug in the `wait` command (the DNS error was not properly unwrapped and not recognized as a DNS error) - fixing missing override support on some auth Deployment values. As a rule of thumb for future review, we should not use .Values directly and prefer using $auth and $proxy
Rendered version
This RFD is part of our Q4 efforts to improve the teleport Helm experience.