Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CA rotation does not apply to teleport-kube-agent pods, preventing them from rejoining a cluster #9815

Closed
webvictim opened this issue Jan 17, 2022 · 2 comments
Assignees
Labels
bug helm kubernetes-access robustness Resistance to crashes and reliability

Comments

@webvictim
Copy link
Contributor

Description

What happened: When running a CA rotation on a Teleport cluster (tctl auth rotate), pods running the teleport-kube-agent Helm chart do not appear to have their certificates rotated at the correct point in the rotation. As such, all these agents lose connectivity to the cluster during a CA rotation and the Kubernetes clusters/apps/databases that they are serving are no longer accessible.

If your teleport-kube-agent pod is not using persistent storage, you can just kubectl delete pod teleport-kube-agent and have the deployment recreate the pod. As long as the join token provided in the values is still valid, the pod will rejoin with a new certificate and work correctly.

If your teleport-kube-agent pod is using persistent storage (i.e. storage.enabled is set to true in the chart values), then simply deleting/restarting the pod will not work. The old pre-rotation certificates are stored, so the procedure to get the pod working again is more involved - something like this:

  • set theteleport-kube-agent chart replica count to 0
  • delete the PersistentVolumeClaim associated with the chart and the associated volume (depending on PVC policy)
  • run a helm upgrade on the teleport-kube-agent chart to recreate the missing PVC
  • make sure that the pod is recreated and the PVC/PV attach correctly

I suspect that the root of this issue may be something to do with the way Kubernetes handles volume mounts and how this interacts with SQLite databases/locks - as I presume that's where the certificates are supposed to be updated during the CA rotation.

Possibly related issues:

What you expected to happen: When a CA rotation takes place on a Teleport cluster, all joined/running agent pods which are actively heartbeating should have their certificates automatically rotated and stay connected to the cluster. Currently, performing a CA rotation could result in a major loss of service.

Reproduction Steps

  1. Deploy a Teleport cluster (any method - Terraform, teleport-cluster Helm chart, set it up from scratch on EC2, etc)
  2. Use Helm to deploy a teleport-kube-agent pod attached to that cluster
  3. Run a CA rotation on the auth server with tctl auth rotate
  4. Observe that the teleport-kube-agent pod loses connection and does not rejoin.

Server Details

  • Teleport version (run teleport version): Teleport v8.0.7 git:v8.0.7-0-geb8076446 go1.17.3

Debug Logs

These are the logs from the teleport-kube-agent pod - they just repeat in a loop every minute or two and will not resolve without manual intervention.

2022-01-17T14:51:43Z [KUBERNETE] WARN Failed to create remote tunnel: failed to dial: all auth methods failed, conn: <nil>. leaseID:27835 target:example.teleport.com:3024 reversetunnel/agent.go:340
2022-01-17T14:51:45Z [KUBERNETE] WARN Re-init the cache on error: connection error: desc = "transport: Error while dialing failed to dial: ssh: handshake failed: no matching keys found". cache/cache.go:711
2022-01-17T14:51:47Z WARN [KUBERNETE] Restart watch on error: connection error: desc = "transport: Error while dialing failed to dial: ssh: handshake failed: no matching keys found". resource-kind:lock services/watcher.go:187
2022-01-17T14:51:47Z WARN [KUBERNETE] Maximum staleness of 5m0s exceeded, failure started at 2022-01-14 16:06:29.126434696 +0000 UTC m=+11627706.178752111. resource-kind:lock services/watcher.go:195
2022-01-17T14:51:51Z [KUBERNETE] WARN Failed to create remote tunnel: failed to dial: all auth methods failed, conn: <nil>. leaseID:27836 target:example.teleport.com:3024 reversetunnel/agent.go:340
2022-01-17T14:51:55Z [KUBERNETE] WARN Re-init the cache on error: connection error: desc = "transport: Error while dialing failed to dial: ssh: handshake failed: no matching keys found". cache/cache.go:711
2022-01-17T14:51:57Z WARN [KUBERNETE] Restart watch on error: connection error: desc = "transport: Error while dialing failed to dial: ssh: handshake failed: no matching keys found". resource-kind:lock services/watcher.go:187
2022-01-17T14:51:57Z WARN [KUBERNETE] Maximum staleness of 5m0s exceeded, failure started at 2022-01-14 16:06:29.126434696 +0000 UTC m=+11627706.178752111. resource-kind:lock services/watcher.go:195
2022-01-17T14:51:59Z [KUBERNETE] WARN Failed to create remote tunnel: failed to dial: all auth methods failed, conn: <nil>. leaseID:27837 target:example.teleport.com:3024 reversetunnel/agent.go:340
2022-01-17T14:52:05Z [KUBERNETE] WARN Re-init the cache on error: connection error: desc = "transport: Error while dialing failed to dial: ssh: handshake failed: no matching keys found". cache/cache.go:711
2022-01-17T14:52:07Z WARN [KUBERNETE] Restart watch on error: connection error: desc = "transport: Error while dialing failed to dial: ssh: handshake failed: no matching keys found". resource-kind:lock services/watcher.go:187
2022-01-17T14:52:07Z WARN [KUBERNETE] Maximum staleness of 5m0s exceeded, failure started at 2022-01-14 16:06:29.126434696 +0000 UTC m=+11627706.178752111. resource-kind:lock services/watcher.go:195
2022-01-17T14:52:07Z [KUBERNETE] WARN Failed to create remote tunnel: failed to dial: all auth methods failed, conn: <nil>. leaseID:27838 target:example.teleport.com:3024 reversetunnel/agent.go:340
2022-01-17T14:52:15Z [KUBERNETE] WARN Re-init the cache on error: connection error: desc = "transport: Error while dialing failed to dial: ssh: handshake failed: no matching keys found". cache/cache.go:711
2022-01-17T14:52:15Z [KUBERNETE] WARN Failed to create remote tunnel: failed to dial: all auth methods failed, conn: <nil>. leaseID:27839 target:example.teleport.com:3024 reversetunnel/agent.go:340
2022-01-17T14:52:17Z WARN [KUBERNETE] Restart watch on error: connection error: desc = "transport: Error while dialing failed to dial: ssh: handshake failed: no matching keys found". resource-kind:lock services/watcher.go:187
2022-01-17T14:52:17Z WARN [KUBERNETE] Maximum staleness of 5m0s exceeded, failure started at 2022-01-14 16:06:29.126434696 +0000 UTC m=+11627706.178752111. resource-kind:lock services/watcher.go:195
2022-01-17T14:52:23Z [KUBERNETE] WARN Heartbeat failed rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing failed to dial: ssh: handshake failed: no matching keys found". srv/heartbeat.go:261
2022-01-17T14:52:23Z INFO [PROC:1]    Detected Teleport component "kubernetes" is running in a degraded state. service/state.go:105
2022-01-17T14:52:23Z [KUBERNETE] WARN Failed to create remote tunnel: failed to dial: all auth methods failed, conn: <nil>. leaseID:27840 target:example.teleport.com:3024 reversetunnel/agent.go:340
2022-01-17T14:52:25Z [KUBERNETE] WARN Re-init the cache on error: connection error: desc = "transport: Error while dialing failed to dial: ssh: handshake failed: no matching keys found". cache/cache.go:711
2022-01-17T14:52:27Z WARN [KUBERNETE] Restart watch on error: connection error: desc = "transport: Error while dialing failed to dial: ssh: handshake failed: no matching keys found". resource-kind:lock services/watcher.go:187
2022-01-17T14:52:27Z WARN [KUBERNETE] Maximum staleness of 5m0s exceeded, failure started at 2022-01-14 16:06:29.126434696 +0000 UTC m=+11627706.178752111. resource-kind:lock services/watcher.go:195
2022-01-17T14:52:28Z INFO [PROC:1]    Teleport component "kubernetes" is recovering from a degraded state. service/state.go:119
2022-01-17T14:52:31Z [KUBERNETE] WARN Failed to create remote tunnel: failed to dial: all auth methods failed, conn: <nil>. leaseID:27841 target:example.teleport.com:3024 reversetunnel/agent.go:340
2022-01-17T14:52:35Z [KUBERNETE] WARN Re-init the cache on error: connection error: desc = "transport: Error while dialing failed to dial: ssh: handshake failed: no matching keys found". cache/cache.go:711
2022-01-17T14:52:37Z WARN [KUBERNETE] Restart watch on error: connection error: desc = "transport: Error while dialing failed to dial: ssh: handshake failed: no matching keys found". resource-kind:lock services/watcher.go:187
2022-01-17T14:52:37Z WARN [KUBERNETE] Maximum staleness of 5m0s exceeded, failure started at 2022-01-14 16:06:29.126434696 +0000 UTC m=+11627706.178752111. resource-kind:lock services/watcher.go:195
2022-01-17T14:52:38Z INFO [PROC:1]    Teleport component "kubernetes" has recovered from a degraded state. service/state.go:123
@russjones
Copy link
Contributor

I noticed that the following integration tests are flakey. I wonder if the underlying problem is the same as this issue?

TestRotateTrustedClusters
TestRotateSuccess

@espadolini
Copy link
Contributor

Fixed by the combined efforts of #9418, #11074 and #10706.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug helm kubernetes-access robustness Resistance to crashes and reliability
Projects
None yet
Development

No branches or pull requests

3 participants