Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reverse Tunnel Nodes getting stuck initializing #13911

Closed
rosstimothy opened this issue Jun 27, 2022 · 6 comments · Fixed by #14021
Closed

Reverse Tunnel Nodes getting stuck initializing #13911

rosstimothy opened this issue Jun 27, 2022 · 6 comments · Fixed by #14021
Labels
bug test-plan-problem Issues which have been surfaced by running the manual release test plan

Comments

@rosstimothy
Copy link
Contributor

Expected behavior:

All reverse tunnel nodes are connected and reachable

Current behavior:

Running scaling tests for load testing revealed a small number of nodes were not actually connecting.

The deployment shows that it has fully scaled up to 1000 reverse tunnel node pods:

kubectl describe -n loadtest-tross deploy/iot-node
Name:                   iot-node
Namespace:              loadtest-tross
CreationTimestamp:      Mon, 27 Jun 2022 15:51:30 -0400
Labels:                 teleport-role=node
Annotations:            deployment.kubernetes.io/revision: 1
Selector:               node=iot,teleport-role=node
Replicas:               1000 desired | 1000 updated | 1000 total | 1000 available | 0 unavailable

tctl only shows a subset have registered:

tctl get nodes --format=json | jq -r '.[].spec.hostname' | wc -l
809

Bug details:

  • Teleport version 10.0.0-alpha.1, 10.0.0-alpha.2
  • Recreation steps: Run a cluster with a large number of reverse tunnel nodes. Watch that they don't all appear.
Debug log
kubectl logs -n loadtest-tross iot-node-9ccb8bc98-kn4xj
2022-06-27T19:59:51Z INFO [PROC:1]    Generating new host UUID: a0ea36f9-0dc0-44fb-8537-93a48188c8e9. pid:7.1 service/service.go:735
2022-06-27T19:59:51Z DEBU [SQLITE]    Connected to: file:%2Fvar%2Flib%2Fteleport%2Fproc%2Fsqlite.db?_busy_timeout=10000&_sync=FULL&_txlock=immediate, poll stream period: 1s lite/lite.go:246
2022-06-27T19:59:52Z DEBU [SQLITE]    journal_mode=delete, synchronous=2, busy_timeout=10000 lite/lite.go:293
2022-06-27T19:59:52Z INFO [PROC:1]    Adding diagnostic debugging handlers. To connect with profiler, use `go tool pprof 0.0.0.0:3434`. pid:7.1 service/service.go:2387
2022-06-27T19:59:52Z DEBU [PROC:1]    Adding service to supervisor. pid:7.1 service:readyz.monitor service/supervisor.go:207
2022-06-27T19:59:52Z INFO [PROC:1]    Service diag is creating new listener on 0.0.0.0:3434. pid:7.1 service/signals.go:212
2022-06-27T19:59:52Z INFO [DIAG:1]    Starting diagnostic service on 0.0.0.0:3434. pid:7.1 service/service.go:2467
2022-06-27T19:59:52Z DEBU [PROC:1]    Adding service to supervisor. pid:7.1 service:diagnostic.service service/supervisor.go:207
2022-06-27T19:59:52Z DEBU [PROC:1]    Adding service to supervisor. pid:7.1 service:diagnostic.shutdown service/supervisor.go:207
2022-06-27T19:59:52Z DEBU [PROC:1]    Adding service to supervisor. pid:7.1 service:register.node service/supervisor.go:207
2022-06-27T19:59:52Z DEBU [PROC:1]    Adding service to supervisor. pid:7.1 service:ssh.node service/supervisor.go:207
2022-06-27T19:59:52Z DEBU [PROC:1]    Adding service to supervisor. pid:7.1 service:ssh.shutdown service/supervisor.go:207
2022-06-27T19:59:52Z DEBU [PROC:1]    Adding service to supervisor. pid:7.1 service:common.rotate service/supervisor.go:207
2022-06-27T19:59:52Z DEBU [PROC:1]    Service has started. pid:7.1 service:diagnostic.shutdown service/supervisor.go:268
2022-06-27T19:59:52Z DEBU [PROC:1]    No signal pipe to import, must be first Teleport process. pid:7.1 service/service.go:995
2022-06-27T19:59:52Z DEBU [PROC:1]    Service has started. pid:7.1 service:readyz.monitor service/supervisor.go:268
2022-06-27T19:59:52Z DEBU [PROC:1]    Service has started. pid:7.1 service:diagnostic.service service/supervisor.go:268
2022-06-27T19:59:52Z DEBU [PROC:1]    Service has started. pid:7.1 service:ssh.node service/supervisor.go:268
2022-06-27T19:59:52Z DEBU [PROC:1]    Service has started. pid:7.1 service:register.node service/supervisor.go:268
2022-06-27T19:59:52Z INFO [PROC:1]    Joining the cluster with a secure token. pid:7.1 service/connect.go:411
2022-06-27T19:59:52Z DEBU [PROC:1]    Generating new key pair for Node first-time-connect. pid:7.1 service/connect.go:314
2022-06-27T19:59:52Z DEBU [PROC:1]    Service has started. pid:7.1 service:ssh.shutdown service/supervisor.go:268
2022-06-27T19:59:52Z DEBU [PROC:1]    Service has started. pid:7.1 service:common.rotate service/supervisor.go:268
2022-06-27T20:00:01Z DEBU [AUTH]      Registering node to the cluster. auth-servers:[{monster.gravitational.co:3080 tcp }] auth/register.go:156
2022-06-27T20:00:01Z DEBU [AUTH]      The first specified auth server appears to be a proxy. auth/register.go:170
2022-06-27T20:00:01Z INFO [AUTH]      Attempting registration via proxy server. auth/register.go:176
2022-06-27T20:00:01Z DEBU [CLIENT]    HTTPS client init(proxyAddr=monster.gravitational.co:3080, insecure=true) client/weblogin.go:233
WARNING: You are using insecure connection to SSH proxy https://monster.gravitational.co:3080
2022-06-27T20:00:01Z DEBU [CLIENT]    Attempting https://monster.gravitational.co:3080/v1/webapi/host/credentials client/https_client.go:93
2022-06-27T20:00:17Z DEBU [AUTH]      Registration via proxy server failed. error:[
ERROR REPORT:
Original Error: *url.Error Post "https://monster.gravitational.co:3080/v1/webapi/host/credentials": EOF
Stack Trace:
	/go/src/github.com/gravitational/teleport/lib/client/https_client.go:113 github.com/gravitational/teleport/lib/client.(*WebClient).PostJSONWithFallback
	/go/src/github.com/gravitational/teleport/lib/client/weblogin.go:444 github.com/gravitational/teleport/lib/client.HostCredentials
	/go/src/github.com/gravitational/teleport/lib/auth/register.go:220 github.com/gravitational/teleport/lib/auth.registerThroughProxy
	/go/src/github.com/gravitational/teleport/lib/auth/register.go:177 github.com/gravitational/teleport/lib/auth.Register
	/go/src/github.com/gravitational/teleport/lib/service/connect.go:418 github.com/gravitational/teleport/lib/service.(*TeleportProcess).firstTimeConnect
	/go/src/github.com/gravitational/teleport/lib/service/connect.go:155 github.com/gravitational/teleport/lib/service.(*TeleportProcess).connect
	/go/src/github.com/gravitational/teleport/lib/service/connect.go:137 github.com/gravitational/teleport/lib/service.(*TeleportProcess).connectToAuthService
	/go/src/github.com/gravitational/teleport/lib/service/connect.go:64 github.com/gravitational/teleport/lib/service.(*TeleportProcess).reconnectToAuthService
	/go/src/github.com/gravitational/teleport/lib/service/service.go:2191 github.com/gravitational/teleport/lib/service.(*TeleportProcess).registerWithAuthServer.func1
	/go/src/github.com/gravitational/teleport/lib/service/supervisor.go:521 github.com/gravitational/teleport/lib/service.(*LocalService).Serve
	/go/src/github.com/gravitational/teleport/lib/service/supervisor.go:269 github.com/gravitational/teleport/lib/service.(*LocalSupervisor).serve.func1
	/opt/go/src/runtime/asm_amd64.s:1571 runtime.goexit
User Message: Post "https://monster.gravitational.co:3080/v1/webapi/host/credentials": EOF] auth/register.go:180
2022-06-27T20:00:17Z INFO [AUTH]      Attempting registration with auth server. auth/register.go:176
2022-06-27T20:00:17Z WARN [AUTH]      Joining cluster without validating the identity of the Auth Server. This may open you up to a Man-In-The-Middle (MITM) attack if an attacker can gain privileged network access. To remedy this, use the CA pin value provided when join token was generated to validate the identity of the Auth Server. auth/register.go:332
2022-06-27T20:00:20Z DEBU [AUTH]      Registration with auth server failed. error:[
ERROR REPORT:
Original Error: *json.SyntaxError invalid character '<' looking for beginning of value
Stack Trace:
	/go/src/github.com/gravitational/teleport/lib/auth/clt.go:480 github.com/gravitational/teleport/lib/auth.(*Client).RegisterUsingToken
	/go/src/github.com/gravitational/teleport/lib/auth/register.go:267 github.com/gravitational/teleport/lib/auth.registerThroughAuth
	/go/src/github.com/gravitational/teleport/lib/auth/register.go:177 github.com/gravitational/teleport/lib/auth.Register
	/go/src/github.com/gravitational/teleport/lib/service/connect.go:418 github.com/gravitational/teleport/lib/service.(*TeleportProcess).firstTimeConnect
	/go/src/github.com/gravitational/teleport/lib/service/connect.go:155 github.com/gravitational/teleport/lib/service.(*TeleportProcess).connect
	/go/src/github.com/gravitational/teleport/lib/service/connect.go:137 github.com/gravitational/teleport/lib/service.(*TeleportProcess).connectToAuthService
	/go/src/github.com/gravitational/teleport/lib/service/connect.go:64 github.com/gravitational/teleport/lib/service.(*TeleportProcess).reconnectToAuthService
	/go/src/github.com/gravitational/teleport/lib/service/service.go:2191 github.com/gravitational/teleport/lib/service.(*TeleportProcess).registerWithAuthServer.func1
	/go/src/github.com/gravitational/teleport/lib/service/supervisor.go:521 github.com/gravitational/teleport/lib/service.(*LocalService).Serve
	/go/src/github.com/gravitational/teleport/lib/service/supervisor.go:269 github.com/gravitational/teleport/lib/service.(*LocalSupervisor).serve.func1
	/opt/go/src/runtime/asm_amd64.s:1571 runtime.goexit
User Message: invalid character '<' looking for beginning of value] auth/register.go:180
2022-06-27T20:00:20Z ERRO [PROC:1]    Node failed to establish connection to cluster: Post "https://monster.gravitational.co:3080/v1/webapi/host/credentials": EOF, invalid character '<' looking for beginning of value. pid:7.1 service/connect.go:87
2022-06-27T20:00:22Z DEBU [PROC:1]    Retrying connection to auth server after waiting 2.205020838s. pid:7.1 service/connect.go:99
2022-06-27T20:00:22Z INFO [PROC:1]    Joining the cluster with a secure token. pid:7.1 service/connect.go:411
2022-06-27T20:00:22Z DEBU [PROC:1]    Returning existing key pair for Node first-time-connect. pid:7.1 service/connect.go:311
2022-06-27T20:00:22Z DEBU [AUTH]      Registering node to the cluster. auth-servers:[{monster.gravitational.co:3080 tcp }] auth/register.go:156
2022-06-27T20:00:22Z DEBU [AUTH]      The first specified auth server appears to be a proxy. auth/register.go:170
2022-06-27T20:00:22Z INFO [AUTH]      Attempting registration via proxy server. auth/register.go:176
2022-06-27T20:00:22Z DEBU [CLIENT]    HTTPS client init(proxyAddr=monster.gravitational.co:3080, insecure=true) client/weblogin.go:233
WARNING: You are using insecure connection to SSH proxy https://monster.gravitational.co:3080
2022-06-27T20:00:22Z DEBU [CLIENT]    Attempting https://monster.gravitational.co:3080/v1/webapi/host/credentials client/https_client.go:93
2022-06-27T20:00:28Z INFO [AUTH]      Successfully registered via proxy server. auth/register.go:183
2022-06-27T20:00:28Z DEBU [PROC:1]    Deleted generated key pair Node first-time-connect. pid:7.1 service/connect.go:300
2022-06-27T20:00:28Z INFO [PROC:1]    Node has obtained credentials to connect to the cluster. pid:7.1 service/connect.go:446
2022-06-27T20:00:28Z DEBU [PROC:1]    Attempting to connect to Auth Server directly. auth-addrs:[monster.gravitational.co:3080] pid:7.1 service/connect.go:880
2022-06-27T20:00:29Z DEBU [PROC:1]    Failed to connect to Auth Server directly. auth-addrs:[monster.gravitational.co:3080] pid:7.1 service/connect.go:886
2022-06-27T20:00:29Z DEBU [PROC:1]    Attempting to discover reverse tunnel address. auth-addrs:[monster.gravitational.co:3080] pid:7.1 service/connect.go:895
2022-06-27T20:00:29Z DEBU [PROC:1]    Attempting to connect to Auth Server through tunnel. auth-addrs:[monster.gravitational.co:3080] pid:7.1 service/connect.go:897
2022-06-27T20:00:29Z DEBU             Attempting GET monster.gravitational.co:3080/webapi/find webclient/webclient.go:115
2022-06-27T20:00:31Z DEBU             Attempting GET monster.gravitational.co:3024/webapi/find webclient/webclient.go:115
2022-06-27T20:00:31Z DEBU [PROC:1]    Failed to ping web proxy "monster.gravitational.co:3024" addr: Get "https://monster.gravitational.co:3024/webapi/find": tls: first record does not look like a TLS handshake pid:7.1 reversetunnel/transport.go:100
2022-06-27T20:00:31Z DEBU [HTTP:PROX] No proxy set in environment, returning direct dialer. proxy/proxy.go:296
2022-06-27T20:00:38Z DEBU [PROC:1]    Connected to Auth Server through tunnel. auth-addrs:[monster.gravitational.co:3080] pid:7.1 service/connect.go:911
2022-06-27T20:00:38Z DEBU             Attempting GET monster.gravitational.co:3080/webapi/find webclient/webclient.go:115
2022-06-27T20:00:44Z ERRO [PROC:1]    Failed to resolve tunnel address Get "https://monster.gravitational.co:3080/webapi/find": read tcp 10.32.6.161:49798->35.188.83.105:3080: read: connection reset by peer pid:7.1 reversetunnel/transport.go:90
2022-06-27T20:00:44Z ERRO [PROC:1]    Node failed to establish connection to cluster: Get "https://teleport.cluster.local/v2/authorities/host/one?load_keys=false": Get "https://monster.gravitational.co:3080/webapi/find": read tcp 10.32.6.161:49798->35.188.83.105:3080: read: connection reset by peer. pid:7.1 service/connect.go:87
2022-06-27T20:00:54Z DEBU [PROC:1]    Retrying connection to auth server after waiting 10.293630975s. pid:7.1 service/connect.go:99
2022-06-27T20:00:54Z INFO [PROC:1]    Joining the cluster with a secure token. pid:7.1 service/connect.go:411
2022-06-27T20:00:54Z DEBU [PROC:1]    Generating new key pair for Node first-time-connect. pid:7.1 service/connect.go:314
2022-06-27T20:00:54Z DEBU [AUTH]      Registering node to the cluster. auth-servers:[{monster.gravitational.co:3080 tcp }] auth/register.go:156
2022-06-27T20:00:54Z DEBU [AUTH]      The first specified auth server appears to be a proxy. auth/register.go:170
2022-06-27T20:00:54Z INFO [AUTH]      Attempting registration via proxy server. auth/register.go:176
2022-06-27T20:00:54Z DEBU [CLIENT]    HTTPS client init(proxyAddr=monster.gravitational.co:3080, insecure=true) client/weblogin.go:233
WARNING: You are using insecure connection to SSH proxy https://monster.gravitational.co:3080
2022-06-27T20:00:54Z DEBU [CLIENT]    Attempting https://monster.gravitational.co:3080/v1/webapi/host/credentials client/https_client.go:93
2022-06-27T20:00:56Z INFO [AUTH]      Successfully registered via proxy server. auth/register.go:183
2022-06-27T20:00:56Z DEBU [PROC:1]    Deleted generated key pair Node first-time-connect. pid:7.1 service/connect.go:300
2022-06-27T20:00:56Z INFO [PROC:1]    Node has obtained credentials to connect to the cluster. pid:7.1 service/connect.go:446
2022-06-27T20:00:56Z DEBU [PROC:1]    Attempting to connect to Auth Server directly. auth-addrs:[monster.gravitational.co:3080] pid:7.1 service/connect.go:880
2022-06-27T20:00:57Z DEBU [PROC:1]    Failed to connect to Auth Server directly. auth-addrs:[monster.gravitational.co:3080] pid:7.1 service/connect.go:886
2022-06-27T20:00:57Z DEBU [PROC:1]    Attempting to discover reverse tunnel address. auth-addrs:[monster.gravitational.co:3080] pid:7.1 service/connect.go:895
2022-06-27T20:00:57Z DEBU [PROC:1]    Attempting to connect to Auth Server through tunnel. auth-addrs:[monster.gravitational.co:3080] pid:7.1 service/connect.go:897
2022-06-27T20:00:57Z DEBU             Attempting GET monster.gravitational.co:3080/webapi/find webclient/webclient.go:115
2022-06-27T20:00:58Z DEBU             Attempting GET monster.gravitational.co:3024/webapi/find webclient/webclient.go:115
2022-06-27T20:00:59Z DEBU [PROC:1]    Failed to ping web proxy "monster.gravitational.co:3024" addr: Get "https://monster.gravitational.co:3024/webapi/find": tls: first record does not look like a TLS handshake pid:7.1 reversetunnel/transport.go:100
2022-06-27T20:00:59Z DEBU [HTTP:PROX] No proxy set in environment, returning direct dialer. proxy/proxy.go:296
2022-06-27T20:01:01Z DEBU [PROC:1]    Connected to Auth Server through tunnel. auth-addrs:[monster.gravitational.co:3080] pid:7.1 service/connect.go:911
2022-06-27T20:01:01Z DEBU             Attempting GET monster.gravitational.co:3024/webapi/find webclient/webclient.go:115
2022-06-27T20:01:01Z DEBU [PROC:1]    Failed to ping web proxy "monster.gravitational.co:3024" addr: Get "https://monster.gravitational.co:3024/webapi/find": tls: first record does not look like a TLS handshake pid:7.1 reversetunnel/transport.go:100
2022-06-27T20:01:01Z DEBU [HTTP:PROX] No proxy set in environment, returning direct dialer. proxy/proxy.go:296
@rosstimothy rosstimothy added bug test-plan-problem Issues which have been surfaced by running the manual release test plan labels Jun 27, 2022
@rosstimothy
Copy link
Contributor Author

@dboslee @NajiObeid FYI. I'm not sure if this is related to any of the reversetunnel changes that were made for proxy peering but this seems to only be a problem in v10.

@dboslee
Copy link
Contributor

dboslee commented Jun 27, 2022

Based on the logs you sent, it looks like the agent is still trying to tunnel to the auth server and hasn't started the agentpool yet (which is where the bulk of our changes are). I am not certain what the issue is here, but if you have some more logs I can dig deeper into this.

@rosstimothy
Copy link
Contributor Author

Unfortunately what is provided is all the output from the nodes that are experiencing this. It seems to be stuck somewhere in the initialization process, I didn't think the proxy peering changes were in play here since we were able to test with 10k nodes before, just wanted to give you a heads up in case it was some how tangentially related.

@rosstimothy
Copy link
Contributor Author

After some further digging v10 is having some i/o timeout issues when a burst of nodes join at once:

2022-06-28T13:12:39Z [MX:1]      WARN "\nERROR REPORT:\nOriginal Error: *net.OpError read tcp 10.32.0.3:3025-&gt;10.32.4.2:44488: i/o timeout\nStack Trace:\n\t/go/src/github.com/gravitational/teleport/lib/multiplexer/multiplexer.go:396 github.com/gravitational/teleport/lib/multiplexer.detectProto\n\t/go/src/github.com/gravitational/teleport/lib/multiplexer/multiplexer.go:259 github.com/gravitational/teleport/lib/multiplexer.detect\n\t/go/src/github.com/gravitational/teleport/lib/multiplexer/multiplexer.go:221 github.com/gravitational/teleport/lib/multiplexer.(*Mux).detectAndForward\n\t/opt/go/src/runtime/asm_amd64.s:1571 runtime.goexit\nUser Message: failed to peek connection\n\tread tcp 10.32.0.3:3025-&gt;10.32.4.2:44488: i/o timeout" multiplexer/multiplexer.go:224

Profiles taken from stuck nodes show that they are waiting on a response from getting the initial CA. Blocking forever on a response can be resolved by providing a timeout to the auth http.Client. However that won't address the underlying timeout problem but does allow clients to try connecting again and eventually connect.

@rosstimothy
Copy link
Contributor Author

rosstimothy commented Jun 30, 2022

I performed the same 1000 reverse tunnel node scaling test binary search style on all commits in v10. It looks like the culprit for this is #12113. All commits prior exhibit the same behavior as v9.3.7 and from that commit on I see the i/o timeout errors and stuck nodes.

There is also a really stark contrast in the amount of time that it takes the nodes to connect. In the graph below the blueish lines on the left depict the time it took the 1000 nodes to connect in cc4d4df and the orange/green are from 9911640
image

@rosstimothy
Copy link
Contributor Author

The underlying cause seems to be the removal of this check which limited precomputing keys to the auth and proxy service. Precomputing keys on all nodes now makes them slower and requires more CPU from the GKE node pool, which the auth and proxy are also running on. Confirmed that restoring the check returns the behavior to what was observed in cc4d4df

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug test-plan-problem Issues which have been surfaced by running the manual release test plan
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants