CA rotation does not apply to teleport-kube-agent pods, preventing them from rejoining a cluster #9815

webvictim · 2022-01-17T18:22:18Z

Description

What happened: When running a CA rotation on a Teleport cluster (tctl auth rotate), pods running the teleport-kube-agent Helm chart do not appear to have their certificates rotated at the correct point in the rotation. As such, all these agents lose connectivity to the cluster during a CA rotation and the Kubernetes clusters/apps/databases that they are serving are no longer accessible.

If your teleport-kube-agent pod is not using persistent storage, you can just kubectl delete pod teleport-kube-agent and have the deployment recreate the pod. As long as the join token provided in the values is still valid, the pod will rejoin with a new certificate and work correctly.

If your teleport-kube-agent pod is using persistent storage (i.e. storage.enabled is set to true in the chart values), then simply deleting/restarting the pod will not work. The old pre-rotation certificates are stored, so the procedure to get the pod working again is more involved - something like this:

set theteleport-kube-agent chart replica count to 0
delete the PersistentVolumeClaim associated with the chart and the associated volume (depending on PVC policy)
run a helm upgrade on the teleport-kube-agent chart to recreate the missing PVC
make sure that the pod is recreated and the PVC/PV attach correctly

I suspect that the root of this issue may be something to do with the way Kubernetes handles volume mounts and how this interacts with SQLite databases/locks - as I presume that's where the certificates are supposed to be updated during the CA rotation.

Possibly related issues:

What you expected to happen: When a CA rotation takes place on a Teleport cluster, all joined/running agent pods which are actively heartbeating should have their certificates automatically rotated and stay connected to the cluster. Currently, performing a CA rotation could result in a major loss of service.

Reproduction Steps

Deploy a Teleport cluster (any method - Terraform, teleport-cluster Helm chart, set it up from scratch on EC2, etc)
Use Helm to deploy a teleport-kube-agent pod attached to that cluster
Run a CA rotation on the auth server with tctl auth rotate
Observe that the teleport-kube-agent pod loses connection and does not rejoin.

Server Details

Teleport version (run teleport version): Teleport v8.0.7 git:v8.0.7-0-geb8076446 go1.17.3

Debug Logs

These are the logs from the teleport-kube-agent pod - they just repeat in a loop every minute or two and will not resolve without manual intervention.

2022-01-17T14:51:43Z [KUBERNETE] WARN Failed to create remote tunnel: failed to dial: all auth methods failed, conn: <nil>. leaseID:27835 target:example.teleport.com:3024 reversetunnel/agent.go:340
2022-01-17T14:51:45Z [KUBERNETE] WARN Re-init the cache on error: connection error: desc = "transport: Error while dialing failed to dial: ssh: handshake failed: no matching keys found". cache/cache.go:711
2022-01-17T14:51:47Z WARN [KUBERNETE] Restart watch on error: connection error: desc = "transport: Error while dialing failed to dial: ssh: handshake failed: no matching keys found". resource-kind:lock services/watcher.go:187
2022-01-17T14:51:47Z WARN [KUBERNETE] Maximum staleness of 5m0s exceeded, failure started at 2022-01-14 16:06:29.126434696 +0000 UTC m=+11627706.178752111. resource-kind:lock services/watcher.go:195
2022-01-17T14:51:51Z [KUBERNETE] WARN Failed to create remote tunnel: failed to dial: all auth methods failed, conn: <nil>. leaseID:27836 target:example.teleport.com:3024 reversetunnel/agent.go:340
2022-01-17T14:51:55Z [KUBERNETE] WARN Re-init the cache on error: connection error: desc = "transport: Error while dialing failed to dial: ssh: handshake failed: no matching keys found". cache/cache.go:711
2022-01-17T14:51:57Z WARN [KUBERNETE] Restart watch on error: connection error: desc = "transport: Error while dialing failed to dial: ssh: handshake failed: no matching keys found". resource-kind:lock services/watcher.go:187
2022-01-17T14:51:57Z WARN [KUBERNETE] Maximum staleness of 5m0s exceeded, failure started at 2022-01-14 16:06:29.126434696 +0000 UTC m=+11627706.178752111. resource-kind:lock services/watcher.go:195
2022-01-17T14:51:59Z [KUBERNETE] WARN Failed to create remote tunnel: failed to dial: all auth methods failed, conn: <nil>. leaseID:27837 target:example.teleport.com:3024 reversetunnel/agent.go:340
2022-01-17T14:52:05Z [KUBERNETE] WARN Re-init the cache on error: connection error: desc = "transport: Error while dialing failed to dial: ssh: handshake failed: no matching keys found". cache/cache.go:711
2022-01-17T14:52:07Z WARN [KUBERNETE] Restart watch on error: connection error: desc = "transport: Error while dialing failed to dial: ssh: handshake failed: no matching keys found". resource-kind:lock services/watcher.go:187
2022-01-17T14:52:07Z WARN [KUBERNETE] Maximum staleness of 5m0s exceeded, failure started at 2022-01-14 16:06:29.126434696 +0000 UTC m=+11627706.178752111. resource-kind:lock services/watcher.go:195
2022-01-17T14:52:07Z [KUBERNETE] WARN Failed to create remote tunnel: failed to dial: all auth methods failed, conn: <nil>. leaseID:27838 target:example.teleport.com:3024 reversetunnel/agent.go:340
2022-01-17T14:52:15Z [KUBERNETE] WARN Re-init the cache on error: connection error: desc = "transport: Error while dialing failed to dial: ssh: handshake failed: no matching keys found". cache/cache.go:711
2022-01-17T14:52:15Z [KUBERNETE] WARN Failed to create remote tunnel: failed to dial: all auth methods failed, conn: <nil>. leaseID:27839 target:example.teleport.com:3024 reversetunnel/agent.go:340
2022-01-17T14:52:17Z WARN [KUBERNETE] Restart watch on error: connection error: desc = "transport: Error while dialing failed to dial: ssh: handshake failed: no matching keys found". resource-kind:lock services/watcher.go:187
2022-01-17T14:52:17Z WARN [KUBERNETE] Maximum staleness of 5m0s exceeded, failure started at 2022-01-14 16:06:29.126434696 +0000 UTC m=+11627706.178752111. resource-kind:lock services/watcher.go:195
2022-01-17T14:52:23Z [KUBERNETE] WARN Heartbeat failed rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing failed to dial: ssh: handshake failed: no matching keys found". srv/heartbeat.go:261
2022-01-17T14:52:23Z INFO [PROC:1]    Detected Teleport component "kubernetes" is running in a degraded state. service/state.go:105
2022-01-17T14:52:23Z [KUBERNETE] WARN Failed to create remote tunnel: failed to dial: all auth methods failed, conn: <nil>. leaseID:27840 target:example.teleport.com:3024 reversetunnel/agent.go:340
2022-01-17T14:52:25Z [KUBERNETE] WARN Re-init the cache on error: connection error: desc = "transport: Error while dialing failed to dial: ssh: handshake failed: no matching keys found". cache/cache.go:711
2022-01-17T14:52:27Z WARN [KUBERNETE] Restart watch on error: connection error: desc = "transport: Error while dialing failed to dial: ssh: handshake failed: no matching keys found". resource-kind:lock services/watcher.go:187
2022-01-17T14:52:27Z WARN [KUBERNETE] Maximum staleness of 5m0s exceeded, failure started at 2022-01-14 16:06:29.126434696 +0000 UTC m=+11627706.178752111. resource-kind:lock services/watcher.go:195
2022-01-17T14:52:28Z INFO [PROC:1]    Teleport component "kubernetes" is recovering from a degraded state. service/state.go:119
2022-01-17T14:52:31Z [KUBERNETE] WARN Failed to create remote tunnel: failed to dial: all auth methods failed, conn: <nil>. leaseID:27841 target:example.teleport.com:3024 reversetunnel/agent.go:340
2022-01-17T14:52:35Z [KUBERNETE] WARN Re-init the cache on error: connection error: desc = "transport: Error while dialing failed to dial: ssh: handshake failed: no matching keys found". cache/cache.go:711
2022-01-17T14:52:37Z WARN [KUBERNETE] Restart watch on error: connection error: desc = "transport: Error while dialing failed to dial: ssh: handshake failed: no matching keys found". resource-kind:lock services/watcher.go:187
2022-01-17T14:52:37Z WARN [KUBERNETE] Maximum staleness of 5m0s exceeded, failure started at 2022-01-14 16:06:29.126434696 +0000 UTC m=+11627706.178752111. resource-kind:lock services/watcher.go:195
2022-01-17T14:52:38Z INFO [PROC:1]    Teleport component "kubernetes" has recovered from a degraded state. service/state.go:123

The text was updated successfully, but these errors were encountered:

russjones · 2022-02-05T15:49:19Z

I noticed that the following integration tests are flakey. I wonder if the underlying problem is the same as this issue?

TestRotateTrustedClusters
TestRotateSuccess

espadolini · 2022-03-28T09:43:47Z

Fixed by the combined efforts of #9418, #11074 and #10706.

webvictim added bug kubernetes-access helm robustness Resistance to crashes and reliability labels Jan 17, 2022

russjones assigned espadolini Jan 19, 2022

russjones mentioned this issue Feb 13, 2022

CA rotation is unstable #10332

Closed

espadolini mentioned this issue Mar 1, 2022

Address problems in concurrent sqlite access #10706

Merged

espadolini closed this as completed Mar 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CA rotation does not apply to teleport-kube-agent pods, preventing them from rejoining a cluster #9815

CA rotation does not apply to teleport-kube-agent pods, preventing them from rejoining a cluster #9815

webvictim commented Jan 17, 2022

russjones commented Feb 5, 2022

espadolini commented Mar 28, 2022

CA rotation does not apply to teleport-kube-agent pods, preventing them from rejoining a cluster #9815

CA rotation does not apply to teleport-kube-agent pods, preventing them from rejoining a cluster #9815

Comments

webvictim commented Jan 17, 2022

Description

Reproduction Steps

Server Details

Debug Logs

russjones commented Feb 5, 2022

espadolini commented Mar 28, 2022