-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reverse Tunnel Nodes getting stuck initializing #13911
Comments
@dboslee @NajiObeid FYI. I'm not sure if this is related to any of the |
Based on the logs you sent, it looks like the agent is still trying to tunnel to the auth server and hasn't started the agentpool yet (which is where the bulk of our changes are). I am not certain what the issue is here, but if you have some more logs I can dig deeper into this. |
Unfortunately what is provided is all the output from the nodes that are experiencing this. It seems to be stuck somewhere in the initialization process, I didn't think the proxy peering changes were in play here since we were able to test with 10k nodes before, just wanted to give you a heads up in case it was some how tangentially related. |
After some further digging v10 is having some
Profiles taken from stuck nodes show that they are waiting on a response from getting the initial CA. Blocking forever on a response can be resolved by providing a timeout to the auth |
I performed the same 1000 reverse tunnel node scaling test binary search style on all commits in v10. It looks like the culprit for this is #12113. All commits prior exhibit the same behavior as There is also a really stark contrast in the amount of time that it takes the nodes to connect. In the graph below the blueish lines on the left depict the time it took the 1000 nodes to connect in cc4d4df and the orange/green are from 9911640 |
The underlying cause seems to be the removal of this check which limited precomputing keys to the auth and proxy service. Precomputing keys on all nodes now makes them slower and requires more CPU from the GKE node pool, which the auth and proxy are also running on. Confirmed that restoring the check returns the behavior to what was observed in cc4d4df |
Expected behavior:
All reverse tunnel nodes are connected and reachable
Current behavior:
Running scaling tests for load testing revealed a small number of nodes were not actually connecting.
The deployment shows that it has fully scaled up to 1000 reverse tunnel node pods:
tctl
only shows a subset have registered:Bug details:
10.0.0-alpha.1
,10.0.0-alpha.2
Debug log
The text was updated successfully, but these errors were encountered: