Deploy split proxy/auth with helm chart#18857
Conversation
3c762c9 to
6a00b79
Compare
958d896 to
fb912cc
Compare
|
Upgrading from a past chart caused ~3 min of downtime, mainly caused by:
I'm not sure how to minimize this. One could deploy a second release next to the original one, then changing the DNS record to point to the new LB. The biggest issue would be reverse-tunnel clients trying to connect to unreachable proxy, but this might be acceptable. |
webvictim
left a comment
There was a problem hiding this comment.
Looks really good overall. I think the new templating style is going to give people a lot more flexibility.
We might want to update the README a bit to remove the values listing and point people to the comprehensive chart reference too... which unfortunately is also going to need overhauling as part of the changes 😢
013f71f to
d397909
Compare
f844519 to
620b06f
Compare
Part of [RFD-0096](#18274) This PR adds helm hooks deploying a test configuration job and running `teleport configure --test` to validate the `teleport.yaml` configuration is sane.
c83b7d1 to
3ec9f8f
Compare
|
Documentation PR is here: #19881 I'll wait until there are no more comments on this one to rebase and send it for review to the docs team. |
| # TLS multiplexing is not supported when using ACM+NLB for TLS termination. | ||
| # | ||
| # Possible values are 'separate' and 'multiplex' | ||
| proxyListenerMode: "separate" |
There was a problem hiding this comment.
Shouldn't the default value be multiplex nowadays?
There was a problem hiding this comment.
I think more people are terminating TLS in front of Teleport than not when deploying in Kubernetes. The use of ACM is widespread, and the lack of Ingress support is a huge gripe people have with the Helm charts in general. As such I think it's best if we're still explicit about requiring the use of separate ports/listeners until such time as Teleport can work reliably behind TLS termination. If we default to multiplex here we're just making a rod for our own backs.
There was a problem hiding this comment.
If/when we'll manage to have Teleport run behind an Ingress we'll be able to multiplex by default on those setups. Based on support requests a lot of people are using ACM-based setups and switching to multiplex will break their setups.
|
Following discussions, I searched GitHub issues and learned Teleport had several issues with stale proxy/auth nodes in the past. I'll run a couple more tests to observe how the cluster reacts during rollouts and validate that the topology change does not introduce a regression/make things worse. |
With split auth/proxyThe auth rollout took ~5min for the metrics to go back to normal The proxy rollout happened really fast, and all nodes reconnected real quick. However, CPU usage increased dramatically after the rollout and did not go back to regular levels (for proxy, nodes, and a bit for auth). According to the logs, the nodes were discovering 3 or 4 proxies for 30 minutes. Once the nodes stopped discovering stale proxies, the CPU usage went back to nominal usage. With auth and proxy bundledThe proxy rollout issue also happens Split impactWe were already facing facing the issue #20057 in bundled mode. This is non-blocking. Splitting auth and proxies cause auth rollouts to disconnect every node. They can take up to 5 minutes to reconnect. Most of the wait does look like a bug though: #8793 While the auth rollout impact is not ideal, this does not seem to be a blocker. We might want to investigate why reconnection is delayed, so the update experience becomes smoother. |
This commit implements arbitrary configuration passing to Teleport, like what was done for the `teleport-cluster` in #18857. This allows users to deploy services or set fields the chart does not support. The huge snapshot diffs are caused by order changes in the config (the YAML export orders keys alphabetically). I validated that the old and new snapshots were strictly equivalent with the following python snippet: ```python import yaml import pathlib import deepdiff old = yaml.safe_load(Path("./config-snapshot.old").open()) new = yaml.safe_load(Path("./config-snapshot.new").open()) old_content = { k: yaml.safe_load(yaml.safe_load(v[1])["data"]["teleport.yaml"]) for (k,v) in old.items() } new_content = { k: yaml.safe_load(yaml.safe_load(v[1])["data"]["teleport.yaml"]) for (k,v) in new.items() } diff = deepdiff.DeepDiff(old_content, new_content) print(diff) ```
This commit implements arbitrary configuration passing to Teleport, like what was done for the `teleport-cluster` in #18857. This allows users to deploy services or set fields the chart does not support. The huge snapshot diffs are caused by order changes in the config (the YAML export orders keys alphabetically). I validated that the old and new snapshots were strictly equivalent with the following python snippet: ```python import yaml import pathlib import deepdiff old = yaml.safe_load(Path("./config-snapshot.old").open()) new = yaml.safe_load(Path("./config-snapshot.new").open()) old_content = { k: yaml.safe_load(yaml.safe_load(v[1])["data"]["teleport.yaml"]) for (k,v) in old.items() } new_content = { k: yaml.safe_load(yaml.safe_load(v[1])["data"]["teleport.yaml"]) for (k,v) in new.items() } diff = deepdiff.DeepDiff(old_content, new_content) print(diff) ```
…20449) This commit implements arbitrary configuration passing to Teleport, like what was done for the `teleport-cluster` in #18857. This allows users to deploy services or set fields the chart does not support. The huge snapshot diffs are caused by order changes in the config (the YAML export orders keys alphabetically). I validated that the old and new snapshots were strictly equivalent with the following python snippet: ```python import yaml import pathlib import deepdiff old = yaml.safe_load(Path("./config-snapshot.old").open()) new = yaml.safe_load(Path("./config-snapshot.new").open()) old_content = { k: yaml.safe_load(yaml.safe_load(v[1])["data"]["teleport.yaml"]) for (k,v) in old.items() } new_content = { k: yaml.safe_load(yaml.safe_load(v[1])["data"]["teleport.yaml"]) for (k,v) in new.items() } diff = deepdiff.DeepDiff(old_content, new_content) print(diff) ```



This PR implements big chunks of #18274
The following changes are implemented in the PR:
kubernetesjoinMethod to join proxiesThe following changes have been implemented and reviewed in sub-PRs:
The following changes will be implemented later, in subsequent PRs:
The following changes will not be implemented and the RFD will be modified to reflect this change:
I am not happy with the state of the tests, but the PR is already too large. I would like to revamp the tests in a subsequent PR.
Before merging I want to run the following tests:
Should address:
authandproxypods #16871teleport-clusterhelm chart #16096 (although it should be tested and we might want to add a containerPort to officialise proxy-peering support)