You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What happened: When running a CA rotation on a Teleport cluster (tctl auth rotate), pods running the teleport-kube-agent Helm chart do not appear to have their certificates rotated at the correct point in the rotation. As such, all these agents lose connectivity to the cluster during a CA rotation and the Kubernetes clusters/apps/databases that they are serving are no longer accessible.
If your teleport-kube-agent pod is not using persistent storage, you can just kubectl delete pod teleport-kube-agent and have the deployment recreate the pod. As long as the join token provided in the values is still valid, the pod will rejoin with a new certificate and work correctly.
If your teleport-kube-agent pod is using persistent storage (i.e. storage.enabled is set to true in the chart values), then simply deleting/restarting the pod will not work. The old pre-rotation certificates are stored, so the procedure to get the pod working again is more involved - something like this:
set theteleport-kube-agent chart replica count to 0
delete the PersistentVolumeClaim associated with the chart and the associated volume (depending on PVC policy)
run a helm upgrade on the teleport-kube-agent chart to recreate the missing PVC
make sure that the pod is recreated and the PVC/PV attach correctly
I suspect that the root of this issue may be something to do with the way Kubernetes handles volume mounts and how this interacts with SQLite databases/locks - as I presume that's where the certificates are supposed to be updated during the CA rotation.
What you expected to happen: When a CA rotation takes place on a Teleport cluster, all joined/running agent pods which are actively heartbeating should have their certificates automatically rotated and stay connected to the cluster. Currently, performing a CA rotation could result in a major loss of service.
Reproduction Steps
Deploy a Teleport cluster (any method - Terraform, teleport-cluster Helm chart, set it up from scratch on EC2, etc)
Use Helm to deploy a teleport-kube-agent pod attached to that cluster
Run a CA rotation on the auth server with tctl auth rotate
Observe that the teleport-kube-agent pod loses connection and does not rejoin.
Server Details
Teleport version (run teleport version): Teleport v8.0.7 git:v8.0.7-0-geb8076446 go1.17.3
Debug Logs
These are the logs from the teleport-kube-agent pod - they just repeat in a loop every minute or two and will not resolve without manual intervention.
2022-01-17T14:51:43Z [KUBERNETE] WARN Failed to create remote tunnel: failed to dial: all auth methods failed, conn: <nil>. leaseID:27835 target:example.teleport.com:3024 reversetunnel/agent.go:340
2022-01-17T14:51:45Z [KUBERNETE] WARN Re-init the cache on error: connection error: desc = "transport: Error while dialing failed to dial: ssh: handshake failed: no matching keys found". cache/cache.go:711
2022-01-17T14:51:47Z WARN [KUBERNETE] Restart watch on error: connection error: desc = "transport: Error while dialing failed to dial: ssh: handshake failed: no matching keys found". resource-kind:lock services/watcher.go:187
2022-01-17T14:51:47Z WARN [KUBERNETE] Maximum staleness of 5m0s exceeded, failure started at 2022-01-14 16:06:29.126434696 +0000 UTC m=+11627706.178752111. resource-kind:lock services/watcher.go:195
2022-01-17T14:51:51Z [KUBERNETE] WARN Failed to create remote tunnel: failed to dial: all auth methods failed, conn: <nil>. leaseID:27836 target:example.teleport.com:3024 reversetunnel/agent.go:340
2022-01-17T14:51:55Z [KUBERNETE] WARN Re-init the cache on error: connection error: desc = "transport: Error while dialing failed to dial: ssh: handshake failed: no matching keys found". cache/cache.go:711
2022-01-17T14:51:57Z WARN [KUBERNETE] Restart watch on error: connection error: desc = "transport: Error while dialing failed to dial: ssh: handshake failed: no matching keys found". resource-kind:lock services/watcher.go:187
2022-01-17T14:51:57Z WARN [KUBERNETE] Maximum staleness of 5m0s exceeded, failure started at 2022-01-14 16:06:29.126434696 +0000 UTC m=+11627706.178752111. resource-kind:lock services/watcher.go:195
2022-01-17T14:51:59Z [KUBERNETE] WARN Failed to create remote tunnel: failed to dial: all auth methods failed, conn: <nil>. leaseID:27837 target:example.teleport.com:3024 reversetunnel/agent.go:340
2022-01-17T14:52:05Z [KUBERNETE] WARN Re-init the cache on error: connection error: desc = "transport: Error while dialing failed to dial: ssh: handshake failed: no matching keys found". cache/cache.go:711
2022-01-17T14:52:07Z WARN [KUBERNETE] Restart watch on error: connection error: desc = "transport: Error while dialing failed to dial: ssh: handshake failed: no matching keys found". resource-kind:lock services/watcher.go:187
2022-01-17T14:52:07Z WARN [KUBERNETE] Maximum staleness of 5m0s exceeded, failure started at 2022-01-14 16:06:29.126434696 +0000 UTC m=+11627706.178752111. resource-kind:lock services/watcher.go:195
2022-01-17T14:52:07Z [KUBERNETE] WARN Failed to create remote tunnel: failed to dial: all auth methods failed, conn: <nil>. leaseID:27838 target:example.teleport.com:3024 reversetunnel/agent.go:340
2022-01-17T14:52:15Z [KUBERNETE] WARN Re-init the cache on error: connection error: desc = "transport: Error while dialing failed to dial: ssh: handshake failed: no matching keys found". cache/cache.go:711
2022-01-17T14:52:15Z [KUBERNETE] WARN Failed to create remote tunnel: failed to dial: all auth methods failed, conn: <nil>. leaseID:27839 target:example.teleport.com:3024 reversetunnel/agent.go:340
2022-01-17T14:52:17Z WARN [KUBERNETE] Restart watch on error: connection error: desc = "transport: Error while dialing failed to dial: ssh: handshake failed: no matching keys found". resource-kind:lock services/watcher.go:187
2022-01-17T14:52:17Z WARN [KUBERNETE] Maximum staleness of 5m0s exceeded, failure started at 2022-01-14 16:06:29.126434696 +0000 UTC m=+11627706.178752111. resource-kind:lock services/watcher.go:195
2022-01-17T14:52:23Z [KUBERNETE] WARN Heartbeat failed rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing failed to dial: ssh: handshake failed: no matching keys found". srv/heartbeat.go:261
2022-01-17T14:52:23Z INFO [PROC:1] Detected Teleport component "kubernetes" is running in a degraded state. service/state.go:105
2022-01-17T14:52:23Z [KUBERNETE] WARN Failed to create remote tunnel: failed to dial: all auth methods failed, conn: <nil>. leaseID:27840 target:example.teleport.com:3024 reversetunnel/agent.go:340
2022-01-17T14:52:25Z [KUBERNETE] WARN Re-init the cache on error: connection error: desc = "transport: Error while dialing failed to dial: ssh: handshake failed: no matching keys found". cache/cache.go:711
2022-01-17T14:52:27Z WARN [KUBERNETE] Restart watch on error: connection error: desc = "transport: Error while dialing failed to dial: ssh: handshake failed: no matching keys found". resource-kind:lock services/watcher.go:187
2022-01-17T14:52:27Z WARN [KUBERNETE] Maximum staleness of 5m0s exceeded, failure started at 2022-01-14 16:06:29.126434696 +0000 UTC m=+11627706.178752111. resource-kind:lock services/watcher.go:195
2022-01-17T14:52:28Z INFO [PROC:1] Teleport component "kubernetes" is recovering from a degraded state. service/state.go:119
2022-01-17T14:52:31Z [KUBERNETE] WARN Failed to create remote tunnel: failed to dial: all auth methods failed, conn: <nil>. leaseID:27841 target:example.teleport.com:3024 reversetunnel/agent.go:340
2022-01-17T14:52:35Z [KUBERNETE] WARN Re-init the cache on error: connection error: desc = "transport: Error while dialing failed to dial: ssh: handshake failed: no matching keys found". cache/cache.go:711
2022-01-17T14:52:37Z WARN [KUBERNETE] Restart watch on error: connection error: desc = "transport: Error while dialing failed to dial: ssh: handshake failed: no matching keys found". resource-kind:lock services/watcher.go:187
2022-01-17T14:52:37Z WARN [KUBERNETE] Maximum staleness of 5m0s exceeded, failure started at 2022-01-14 16:06:29.126434696 +0000 UTC m=+11627706.178752111. resource-kind:lock services/watcher.go:195
2022-01-17T14:52:38Z INFO [PROC:1] Teleport component "kubernetes" has recovered from a degraded state. service/state.go:123
The text was updated successfully, but these errors were encountered:
Description
What happened: When running a CA rotation on a Teleport cluster (
tctl auth rotate
), pods running theteleport-kube-agent
Helm chart do not appear to have their certificates rotated at the correct point in the rotation. As such, all these agents lose connectivity to the cluster during a CA rotation and the Kubernetes clusters/apps/databases that they are serving are no longer accessible.If your
teleport-kube-agent
pod is not using persistent storage, you can justkubectl delete pod teleport-kube-agent
and have the deployment recreate the pod. As long as the join token provided in the values is still valid, the pod will rejoin with a new certificate and work correctly.If your
teleport-kube-agent
pod is using persistent storage (i.e.storage.enabled
is set totrue
in the chart values), then simply deleting/restarting the pod will not work. The old pre-rotation certificates are stored, so the procedure to get the pod working again is more involved - something like this:teleport-kube-agent
chart replica count to 0PersistentVolumeClaim
associated with the chart and the associated volume (depending on PVC policy)helm upgrade
on theteleport-kube-agent
chart to recreate the missing PVCI suspect that the root of this issue may be something to do with the way Kubernetes handles volume mounts and how this interacts with SQLite databases/locks - as I presume that's where the certificates are supposed to be updated during the CA rotation.
Possibly related issues:
What you expected to happen: When a CA rotation takes place on a Teleport cluster, all joined/running agent pods which are actively heartbeating should have their certificates automatically rotated and stay connected to the cluster. Currently, performing a CA rotation could result in a major loss of service.
Reproduction Steps
teleport-cluster
Helm chart, set it up from scratch on EC2, etc)teleport-kube-agent
pod attached to that clustertctl auth rotate
teleport-kube-agent
pod loses connection and does not rejoin.Server Details
teleport version
):Teleport v8.0.7 git:v8.0.7-0-geb8076446 go1.17.3
Debug Logs
These are the logs from the
teleport-kube-agent
pod - they just repeat in a loop every minute or two and will not resolve without manual intervention.The text was updated successfully, but these errors were encountered: