Implement waiting for Connect My Computer node to join cluster#30905
Implement waiting for Connect My Computer node to join cluster#30905
Conversation
4f4b2f3 to
2b77354
Compare
|
For some reason, after completing the setup, However, on subsequent calls it works. Does it happen to you too? |
|
@gzdunek That's probably because the hostname label is dynamic and it's not yet set when the first I suspected that returning the resource from the event might bite me, but not in this way. We could certainly attempt to refetch the node after receiving the event rather than returning it straight from the event. What do you think? |
Yeah, let's do this. |
| // The Electron app aborts the request which calls NodeJoinWait after a timeout, but let's use a | ||
| // timeout internally as well. Both operations in this struct theoretically block forever if an | ||
| // error happens elsewhere and gRPC doesn't set a timeout by default. | ||
| // | ||
| // If there's a bug in the Electron app and it forgets to abort the request, at least the request | ||
| // will not hang forever. |
There was a problem hiding this comment.
| // The Electron app aborts the request which calls NodeJoinWait after a timeout, but let's use a | |
| // timeout internally as well. Both operations in this struct theoretically block forever if an | |
| // error happens elsewhere and gRPC doesn't set a timeout by default. | |
| // | |
| // If there's a bug in the Electron app and it forgets to abort the request, at least the request | |
| // will not hang forever. | |
| // The Electron app aborts this request after a timeout, but we set a timeout here as well for | |
| // to ensure that the request can't hang forever if there's a bug in the request abort logic. |
Two things to consider:
- Perhaps move this
timeoutCtxup into the endpoint itself. That way it's clear that the whole endpoint just times out after a minute. - I'm not sure how big of a change this would be, but since we're adding timeout logic here anyways, there may be a case for just removing it from the frontend code and using this logic exclusively instead, or having the frontend pass the timeout as a request parameter.
There was a problem hiding this comment.
- Perhaps move this
timeoutCtxup into the endpoint itself. That way it's clear that the whole endpoint just times out after a minute.
Yeah, I wasn't sure if it'd be best to add a timeout tightly around the operation that actually blocks vs for the whole endpoint.
Since those individual services technically could serve as building blocks for other requests, I guess it's best to add the timeout on the endpoint level.
2. I'm not sure how big of a change this would be, but since we're adding timeout logic here anyways, there may be a case for just removing it from the frontend code and using this logic exclusively instead, or having the frontend pass the timeout as a request parameter.
Now that I think about it, under normal circumstances this would be the way to go. However, the UI needs to be able to distinguish between a timeout and other errors. The JS gRPC client encodes that information as custom properties on the error object. gRPC errors in our app pass through the Electron context bridge which strips all custom properties from Error objects, meaning we'd only have the error message to distinguish between a timeout or not.
So until we revamp how those errors are passed, I think I'm going to keep current implementation as is with regards to the timeout.
I'm still not sure if an endpoint should trouble itself with the deadline, perhaps it should rather return BadParameter if no deadline was set on the gRPC request? I'd rather use built-in mechanisms to handle stuff like this vs building my own on top of gRPC.
|
Ping @ibeckermayer, we have two other PRs stacked on this one so I'd like to merge it if possible. ;) |
There was a problem hiding this comment.
What this comment is pointing out isn't quite clear to me (granted it could just be a function of the fact that I'm unfamiliar with this part of the codebase).
There was a problem hiding this comment.
Oh, I think the comment doesn't make it clear that the config comes from teleport node config that we call during the setup of Connect My Computer.
I'll update the comment to my best ability (3fcd4b2c9f68) and then merge this PR, but please let me know if the updated comment is any better.
3fcd4b2 to
b900313
Compare
|
Merged #30910 into this PR with fast forward. |
tshd needs to know this out of band, so that when the Electron app tells it to watch for host UUID file for a specific cluster, the Electron app can send just the profile name of the cluster instead of an arbitrary path on the computer.
This aligns it with a regular AbortController, which also emits the event only once.
53c98c0 to
09efa36
Compare
09efa36 to
692b42d
Compare
|
@ravicious See the table below for backport results.
|
https://github.com/gravitational/teleport/blob/master/rfd/0133-connect-my-computer.md#determining-a-successful-launch
In short, after we start a Teleport agent in Connect, we want to wait until the node successfully joins the cluster so that we can change the UI state from "starting" to "running".
This is implemented by an RPC in tsh daemon which reads host UUID file from disk and then sets up a watcher to wait for
OpPutof the node. Most of the work is done byNodeJoinWaitin theconnectmycomputerpackage. When the Electron app wants to wait until the node join, it calls this RPC.Best reviewed commit-by-commit.
If you'd like to test it on a real cluster, then you need to checkout
ravicious/launch-js, open the config file, setfeature.connectMyComputertotrue, restart the app and then click the laptop icon in the upper right of the view of all cluster resources.The setup and subsequent launches should show a success state only once we detect that the node has joined the cluster.