Reverse Tunnel Nodes getting stuck initializing #13911

rosstimothy · 2022-06-27T20:25:55Z

Expected behavior:

All reverse tunnel nodes are connected and reachable

Current behavior:

Running scaling tests for load testing revealed a small number of nodes were not actually connecting.

The deployment shows that it has fully scaled up to 1000 reverse tunnel node pods:

kubectl describe -n loadtest-tross deploy/iot-node
Name:                   iot-node
Namespace:              loadtest-tross
CreationTimestamp:      Mon, 27 Jun 2022 15:51:30 -0400
Labels:                 teleport-role=node
Annotations:            deployment.kubernetes.io/revision: 1
Selector:               node=iot,teleport-role=node
Replicas:               1000 desired | 1000 updated | 1000 total | 1000 available | 0 unavailable

tctl only shows a subset have registered:

tctl get nodes --format=json | jq -r '.[].spec.hostname' | wc -l
809

Bug details:

Teleport version 10.0.0-alpha.1, 10.0.0-alpha.2
Recreation steps: Run a cluster with a large number of reverse tunnel nodes. Watch that they don't all appear.

Debug log

kubectl logs -n loadtest-tross iot-node-9ccb8bc98-kn4xj
2022-06-27T19:59:51Z INFO [PROC:1]    Generating new host UUID: a0ea36f9-0dc0-44fb-8537-93a48188c8e9. pid:7.1 service/service.go:735
2022-06-27T19:59:51Z DEBU [SQLITE]    Connected to: file:%2Fvar%2Flib%2Fteleport%2Fproc%2Fsqlite.db?_busy_timeout=10000&_sync=FULL&_txlock=immediate, poll stream period: 1s lite/lite.go:246
2022-06-27T19:59:52Z DEBU [SQLITE]    journal_mode=delete, synchronous=2, busy_timeout=10000 lite/lite.go:293
2022-06-27T19:59:52Z INFO [PROC:1]    Adding diagnostic debugging handlers. To connect with profiler, use `go tool pprof 0.0.0.0:3434`. pid:7.1 service/service.go:2387
2022-06-27T19:59:52Z DEBU [PROC:1]    Adding service to supervisor. pid:7.1 service:readyz.monitor service/supervisor.go:207
2022-06-27T19:59:52Z INFO [PROC:1]    Service diag is creating new listener on 0.0.0.0:3434. pid:7.1 service/signals.go:212
2022-06-27T19:59:52Z INFO [DIAG:1]    Starting diagnostic service on 0.0.0.0:3434. pid:7.1 service/service.go:2467
2022-06-27T19:59:52Z DEBU [PROC:1]    Adding service to supervisor. pid:7.1 service:diagnostic.service service/supervisor.go:207
2022-06-27T19:59:52Z DEBU [PROC:1]    Adding service to supervisor. pid:7.1 service:diagnostic.shutdown service/supervisor.go:207
2022-06-27T19:59:52Z DEBU [PROC:1]    Adding service to supervisor. pid:7.1 service:register.node service/supervisor.go:207
2022-06-27T19:59:52Z DEBU [PROC:1]    Adding service to supervisor. pid:7.1 service:ssh.node service/supervisor.go:207
2022-06-27T19:59:52Z DEBU [PROC:1]    Adding service to supervisor. pid:7.1 service:ssh.shutdown service/supervisor.go:207
2022-06-27T19:59:52Z DEBU [PROC:1]    Adding service to supervisor. pid:7.1 service:common.rotate service/supervisor.go:207
2022-06-27T19:59:52Z DEBU [PROC:1]    Service has started. pid:7.1 service:diagnostic.shutdown service/supervisor.go:268
2022-06-27T19:59:52Z DEBU [PROC:1]    No signal pipe to import, must be first Teleport process. pid:7.1 service/service.go:995
2022-06-27T19:59:52Z DEBU [PROC:1]    Service has started. pid:7.1 service:readyz.monitor service/supervisor.go:268
2022-06-27T19:59:52Z DEBU [PROC:1]    Service has started. pid:7.1 service:diagnostic.service service/supervisor.go:268
2022-06-27T19:59:52Z DEBU [PROC:1]    Service has started. pid:7.1 service:ssh.node service/supervisor.go:268
2022-06-27T19:59:52Z DEBU [PROC:1]    Service has started. pid:7.1 service:register.node service/supervisor.go:268
2022-06-27T19:59:52Z INFO [PROC:1]    Joining the cluster with a secure token. pid:7.1 service/connect.go:411
2022-06-27T19:59:52Z DEBU [PROC:1]    Generating new key pair for Node first-time-connect. pid:7.1 service/connect.go:314
2022-06-27T19:59:52Z DEBU [PROC:1]    Service has started. pid:7.1 service:ssh.shutdown service/supervisor.go:268
2022-06-27T19:59:52Z DEBU [PROC:1]    Service has started. pid:7.1 service:common.rotate service/supervisor.go:268
2022-06-27T20:00:01Z DEBU [AUTH]      Registering node to the cluster. auth-servers:[{monster.gravitational.co:3080 tcp }] auth/register.go:156
2022-06-27T20:00:01Z DEBU [AUTH]      The first specified auth server appears to be a proxy. auth/register.go:170
2022-06-27T20:00:01Z INFO [AUTH]      Attempting registration via proxy server. auth/register.go:176
2022-06-27T20:00:01Z DEBU [CLIENT]    HTTPS client init(proxyAddr=monster.gravitational.co:3080, insecure=true) client/weblogin.go:233
WARNING: You are using insecure connection to SSH proxy https://monster.gravitational.co:3080
2022-06-27T20:00:01Z DEBU [CLIENT]    Attempting https://monster.gravitational.co:3080/v1/webapi/host/credentials client/https_client.go:93
2022-06-27T20:00:17Z DEBU [AUTH]      Registration via proxy server failed. error:[
ERROR REPORT:
Original Error: *url.Error Post &#34;https://monster.gravitational.co:3080/v1/webapi/host/credentials&#34;: EOF
Stack Trace:
	/go/src/github.com/gravitational/teleport/lib/client/https_client.go:113 github.com/gravitational/teleport/lib/client.(*WebClient).PostJSONWithFallback
	/go/src/github.com/gravitational/teleport/lib/client/weblogin.go:444 github.com/gravitational/teleport/lib/client.HostCredentials
	/go/src/github.com/gravitational/teleport/lib/auth/register.go:220 github.com/gravitational/teleport/lib/auth.registerThroughProxy
	/go/src/github.com/gravitational/teleport/lib/auth/register.go:177 github.com/gravitational/teleport/lib/auth.Register
	/go/src/github.com/gravitational/teleport/lib/service/connect.go:418 github.com/gravitational/teleport/lib/service.(*TeleportProcess).firstTimeConnect
	/go/src/github.com/gravitational/teleport/lib/service/connect.go:155 github.com/gravitational/teleport/lib/service.(*TeleportProcess).connect
	/go/src/github.com/gravitational/teleport/lib/service/connect.go:137 github.com/gravitational/teleport/lib/service.(*TeleportProcess).connectToAuthService
	/go/src/github.com/gravitational/teleport/lib/service/connect.go:64 github.com/gravitational/teleport/lib/service.(*TeleportProcess).reconnectToAuthService
	/go/src/github.com/gravitational/teleport/lib/service/service.go:2191 github.com/gravitational/teleport/lib/service.(*TeleportProcess).registerWithAuthServer.func1
	/go/src/github.com/gravitational/teleport/lib/service/supervisor.go:521 github.com/gravitational/teleport/lib/service.(*LocalService).Serve
	/go/src/github.com/gravitational/teleport/lib/service/supervisor.go:269 github.com/gravitational/teleport/lib/service.(*LocalSupervisor).serve.func1
	/opt/go/src/runtime/asm_amd64.s:1571 runtime.goexit
User Message: Post &#34;https://monster.gravitational.co:3080/v1/webapi/host/credentials&#34;: EOF] auth/register.go:180
2022-06-27T20:00:17Z INFO [AUTH]      Attempting registration with auth server. auth/register.go:176
2022-06-27T20:00:17Z WARN [AUTH]      Joining cluster without validating the identity of the Auth Server. This may open you up to a Man-In-The-Middle (MITM) attack if an attacker can gain privileged network access. To remedy this, use the CA pin value provided when join token was generated to validate the identity of the Auth Server. auth/register.go:332
2022-06-27T20:00:20Z DEBU [AUTH]      Registration with auth server failed. error:[
ERROR REPORT:
Original Error: *json.SyntaxError invalid character &#39;&lt;&#39; looking for beginning of value
Stack Trace:
	/go/src/github.com/gravitational/teleport/lib/auth/clt.go:480 github.com/gravitational/teleport/lib/auth.(*Client).RegisterUsingToken
	/go/src/github.com/gravitational/teleport/lib/auth/register.go:267 github.com/gravitational/teleport/lib/auth.registerThroughAuth
	/go/src/github.com/gravitational/teleport/lib/auth/register.go:177 github.com/gravitational/teleport/lib/auth.Register
	/go/src/github.com/gravitational/teleport/lib/service/connect.go:418 github.com/gravitational/teleport/lib/service.(*TeleportProcess).firstTimeConnect
	/go/src/github.com/gravitational/teleport/lib/service/connect.go:155 github.com/gravitational/teleport/lib/service.(*TeleportProcess).connect
	/go/src/github.com/gravitational/teleport/lib/service/connect.go:137 github.com/gravitational/teleport/lib/service.(*TeleportProcess).connectToAuthService
	/go/src/github.com/gravitational/teleport/lib/service/connect.go:64 github.com/gravitational/teleport/lib/service.(*TeleportProcess).reconnectToAuthService
	/go/src/github.com/gravitational/teleport/lib/service/service.go:2191 github.com/gravitational/teleport/lib/service.(*TeleportProcess).registerWithAuthServer.func1
	/go/src/github.com/gravitational/teleport/lib/service/supervisor.go:521 github.com/gravitational/teleport/lib/service.(*LocalService).Serve
	/go/src/github.com/gravitational/teleport/lib/service/supervisor.go:269 github.com/gravitational/teleport/lib/service.(*LocalSupervisor).serve.func1
	/opt/go/src/runtime/asm_amd64.s:1571 runtime.goexit
User Message: invalid character &#39;&lt;&#39; looking for beginning of value] auth/register.go:180
2022-06-27T20:00:20Z ERRO [PROC:1]    Node failed to establish connection to cluster: Post "https://monster.gravitational.co:3080/v1/webapi/host/credentials": EOF, invalid character '<' looking for beginning of value. pid:7.1 service/connect.go:87
2022-06-27T20:00:22Z DEBU [PROC:1]    Retrying connection to auth server after waiting 2.205020838s. pid:7.1 service/connect.go:99
2022-06-27T20:00:22Z INFO [PROC:1]    Joining the cluster with a secure token. pid:7.1 service/connect.go:411
2022-06-27T20:00:22Z DEBU [PROC:1]    Returning existing key pair for Node first-time-connect. pid:7.1 service/connect.go:311
2022-06-27T20:00:22Z DEBU [AUTH]      Registering node to the cluster. auth-servers:[{monster.gravitational.co:3080 tcp }] auth/register.go:156
2022-06-27T20:00:22Z DEBU [AUTH]      The first specified auth server appears to be a proxy. auth/register.go:170
2022-06-27T20:00:22Z INFO [AUTH]      Attempting registration via proxy server. auth/register.go:176
2022-06-27T20:00:22Z DEBU [CLIENT]    HTTPS client init(proxyAddr=monster.gravitational.co:3080, insecure=true) client/weblogin.go:233
WARNING: You are using insecure connection to SSH proxy https://monster.gravitational.co:3080
2022-06-27T20:00:22Z DEBU [CLIENT]    Attempting https://monster.gravitational.co:3080/v1/webapi/host/credentials client/https_client.go:93
2022-06-27T20:00:28Z INFO [AUTH]      Successfully registered via proxy server. auth/register.go:183
2022-06-27T20:00:28Z DEBU [PROC:1]    Deleted generated key pair Node first-time-connect. pid:7.1 service/connect.go:300
2022-06-27T20:00:28Z INFO [PROC:1]    Node has obtained credentials to connect to the cluster. pid:7.1 service/connect.go:446
2022-06-27T20:00:28Z DEBU [PROC:1]    Attempting to connect to Auth Server directly. auth-addrs:[monster.gravitational.co:3080] pid:7.1 service/connect.go:880
2022-06-27T20:00:29Z DEBU [PROC:1]    Failed to connect to Auth Server directly. auth-addrs:[monster.gravitational.co:3080] pid:7.1 service/connect.go:886
2022-06-27T20:00:29Z DEBU [PROC:1]    Attempting to discover reverse tunnel address. auth-addrs:[monster.gravitational.co:3080] pid:7.1 service/connect.go:895
2022-06-27T20:00:29Z DEBU [PROC:1]    Attempting to connect to Auth Server through tunnel. auth-addrs:[monster.gravitational.co:3080] pid:7.1 service/connect.go:897
2022-06-27T20:00:29Z DEBU             Attempting GET monster.gravitational.co:3080/webapi/find webclient/webclient.go:115
2022-06-27T20:00:31Z DEBU             Attempting GET monster.gravitational.co:3024/webapi/find webclient/webclient.go:115
2022-06-27T20:00:31Z DEBU [PROC:1]    Failed to ping web proxy "monster.gravitational.co:3024" addr: Get "https://monster.gravitational.co:3024/webapi/find": tls: first record does not look like a TLS handshake pid:7.1 reversetunnel/transport.go:100
2022-06-27T20:00:31Z DEBU [HTTP:PROX] No proxy set in environment, returning direct dialer. proxy/proxy.go:296
2022-06-27T20:00:38Z DEBU [PROC:1]    Connected to Auth Server through tunnel. auth-addrs:[monster.gravitational.co:3080] pid:7.1 service/connect.go:911
2022-06-27T20:00:38Z DEBU             Attempting GET monster.gravitational.co:3080/webapi/find webclient/webclient.go:115
2022-06-27T20:00:44Z ERRO [PROC:1]    Failed to resolve tunnel address Get "https://monster.gravitational.co:3080/webapi/find": read tcp 10.32.6.161:49798->35.188.83.105:3080: read: connection reset by peer pid:7.1 reversetunnel/transport.go:90
2022-06-27T20:00:44Z ERRO [PROC:1]    Node failed to establish connection to cluster: Get "https://teleport.cluster.local/v2/authorities/host/one?load_keys=false": Get "https://monster.gravitational.co:3080/webapi/find": read tcp 10.32.6.161:49798->35.188.83.105:3080: read: connection reset by peer. pid:7.1 service/connect.go:87
2022-06-27T20:00:54Z DEBU [PROC:1]    Retrying connection to auth server after waiting 10.293630975s. pid:7.1 service/connect.go:99
2022-06-27T20:00:54Z INFO [PROC:1]    Joining the cluster with a secure token. pid:7.1 service/connect.go:411
2022-06-27T20:00:54Z DEBU [PROC:1]    Generating new key pair for Node first-time-connect. pid:7.1 service/connect.go:314
2022-06-27T20:00:54Z DEBU [AUTH]      Registering node to the cluster. auth-servers:[{monster.gravitational.co:3080 tcp }] auth/register.go:156
2022-06-27T20:00:54Z DEBU [AUTH]      The first specified auth server appears to be a proxy. auth/register.go:170
2022-06-27T20:00:54Z INFO [AUTH]      Attempting registration via proxy server. auth/register.go:176
2022-06-27T20:00:54Z DEBU [CLIENT]    HTTPS client init(proxyAddr=monster.gravitational.co:3080, insecure=true) client/weblogin.go:233
WARNING: You are using insecure connection to SSH proxy https://monster.gravitational.co:3080
2022-06-27T20:00:54Z DEBU [CLIENT]    Attempting https://monster.gravitational.co:3080/v1/webapi/host/credentials client/https_client.go:93
2022-06-27T20:00:56Z INFO [AUTH]      Successfully registered via proxy server. auth/register.go:183
2022-06-27T20:00:56Z DEBU [PROC:1]    Deleted generated key pair Node first-time-connect. pid:7.1 service/connect.go:300
2022-06-27T20:00:56Z INFO [PROC:1]    Node has obtained credentials to connect to the cluster. pid:7.1 service/connect.go:446
2022-06-27T20:00:56Z DEBU [PROC:1]    Attempting to connect to Auth Server directly. auth-addrs:[monster.gravitational.co:3080] pid:7.1 service/connect.go:880
2022-06-27T20:00:57Z DEBU [PROC:1]    Failed to connect to Auth Server directly. auth-addrs:[monster.gravitational.co:3080] pid:7.1 service/connect.go:886
2022-06-27T20:00:57Z DEBU [PROC:1]    Attempting to discover reverse tunnel address. auth-addrs:[monster.gravitational.co:3080] pid:7.1 service/connect.go:895
2022-06-27T20:00:57Z DEBU [PROC:1]    Attempting to connect to Auth Server through tunnel. auth-addrs:[monster.gravitational.co:3080] pid:7.1 service/connect.go:897
2022-06-27T20:00:57Z DEBU             Attempting GET monster.gravitational.co:3080/webapi/find webclient/webclient.go:115
2022-06-27T20:00:58Z DEBU             Attempting GET monster.gravitational.co:3024/webapi/find webclient/webclient.go:115
2022-06-27T20:00:59Z DEBU [PROC:1]    Failed to ping web proxy "monster.gravitational.co:3024" addr: Get "https://monster.gravitational.co:3024/webapi/find": tls: first record does not look like a TLS handshake pid:7.1 reversetunnel/transport.go:100
2022-06-27T20:00:59Z DEBU [HTTP:PROX] No proxy set in environment, returning direct dialer. proxy/proxy.go:296
2022-06-27T20:01:01Z DEBU [PROC:1]    Connected to Auth Server through tunnel. auth-addrs:[monster.gravitational.co:3080] pid:7.1 service/connect.go:911
2022-06-27T20:01:01Z DEBU             Attempting GET monster.gravitational.co:3024/webapi/find webclient/webclient.go:115
2022-06-27T20:01:01Z DEBU [PROC:1]    Failed to ping web proxy "monster.gravitational.co:3024" addr: Get "https://monster.gravitational.co:3024/webapi/find": tls: first record does not look like a TLS handshake pid:7.1 reversetunnel/transport.go:100
2022-06-27T20:01:01Z DEBU [HTTP:PROX] No proxy set in environment, returning direct dialer. proxy/proxy.go:296

The text was updated successfully, but these errors were encountered:

rosstimothy · 2022-06-27T21:03:30Z

@dboslee @NajiObeid FYI. I'm not sure if this is related to any of the reversetunnel changes that were made for proxy peering but this seems to only be a problem in v10.

dboslee · 2022-06-27T21:19:41Z

Based on the logs you sent, it looks like the agent is still trying to tunnel to the auth server and hasn't started the agentpool yet (which is where the bulk of our changes are). I am not certain what the issue is here, but if you have some more logs I can dig deeper into this.

rosstimothy · 2022-06-27T21:28:29Z

Unfortunately what is provided is all the output from the nodes that are experiencing this. It seems to be stuck somewhere in the initialization process, I didn't think the proxy peering changes were in play here since we were able to test with 10k nodes before, just wanted to give you a heads up in case it was some how tangentially related.

rosstimothy · 2022-06-29T13:31:10Z

After some further digging v10 is having some i/o timeout issues when a burst of nodes join at once:

2022-06-28T13:12:39Z [MX:1]      WARN "\nERROR REPORT:\nOriginal Error: *net.OpError read tcp 10.32.0.3:3025-&gt;10.32.4.2:44488: i/o timeout\nStack Trace:\n\t/go/src/github.com/gravitational/teleport/lib/multiplexer/multiplexer.go:396 github.com/gravitational/teleport/lib/multiplexer.detectProto\n\t/go/src/github.com/gravitational/teleport/lib/multiplexer/multiplexer.go:259 github.com/gravitational/teleport/lib/multiplexer.detect\n\t/go/src/github.com/gravitational/teleport/lib/multiplexer/multiplexer.go:221 github.com/gravitational/teleport/lib/multiplexer.(*Mux).detectAndForward\n\t/opt/go/src/runtime/asm_amd64.s:1571 runtime.goexit\nUser Message: failed to peek connection\n\tread tcp 10.32.0.3:3025-&gt;10.32.4.2:44488: i/o timeout" multiplexer/multiplexer.go:224

Profiles taken from stuck nodes show that they are waiting on a response from getting the initial CA. Blocking forever on a response can be resolved by providing a timeout to the auth http.Client. However that won't address the underlying timeout problem but does allow clients to try connecting again and eventually connect.

rosstimothy · 2022-06-30T15:56:43Z

I performed the same 1000 reverse tunnel node scaling test binary search style on all commits in v10. It looks like the culprit for this is #12113. All commits prior exhibit the same behavior as v9.3.7 and from that commit on I see the i/o timeout errors and stuck nodes.

There is also a really stark contrast in the amount of time that it takes the nodes to connect. In the graph below the blueish lines on the left depict the time it took the 1000 nodes to connect in cc4d4df and the orange/green are from 9911640

rosstimothy · 2022-06-30T19:24:49Z

The underlying cause seems to be the removal of this check which limited precomputing keys to the auth and proxy service. Precomputing keys on all nodes now makes them slower and requires more CPU from the GKE node pool, which the auth and proxy are also running on. Confirmed that restoring the check returns the behavior to what was observed in cc4d4df

rosstimothy added bug test-plan-problem Issues which have been surfaced by running the manual release test plan labels Jun 27, 2022

rosstimothy mentioned this issue Jun 27, 2022

Teleport 10 Test Plan #13340

Closed

rosstimothy mentioned this issue Jun 29, 2022

Add timeout to auth http requests #13976

Merged

fspmarshall mentioned this issue Jun 30, 2022

don't precompute keys on peripheral agents #14021

Merged

fspmarshall closed this as completed in #14021 Jul 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reverse Tunnel Nodes getting stuck initializing #13911

Reverse Tunnel Nodes getting stuck initializing #13911

rosstimothy commented Jun 27, 2022

rosstimothy commented Jun 27, 2022

dboslee commented Jun 27, 2022

rosstimothy commented Jun 27, 2022

rosstimothy commented Jun 29, 2022

rosstimothy commented Jun 30, 2022 •

edited

Loading

rosstimothy commented Jun 30, 2022

Reverse Tunnel Nodes getting stuck initializing #13911

Reverse Tunnel Nodes getting stuck initializing #13911

Comments

rosstimothy commented Jun 27, 2022

rosstimothy commented Jun 27, 2022

dboslee commented Jun 27, 2022

rosstimothy commented Jun 27, 2022

rosstimothy commented Jun 29, 2022

rosstimothy commented Jun 30, 2022 • edited Loading

rosstimothy commented Jun 30, 2022

rosstimothy commented Jun 30, 2022 •

edited

Loading