Skip to content

Implement waiting for Connect My Computer node to join cluster#30905

Merged
ravicious merged 10 commits intomasterfrom
ravicious/launch
Sep 21, 2023
Merged

Implement waiting for Connect My Computer node to join cluster#30905
ravicious merged 10 commits intomasterfrom
ravicious/launch

Conversation

@ravicious
Copy link
Copy Markdown
Member

@ravicious ravicious commented Aug 23, 2023

https://github.com/gravitational/teleport/blob/master/rfd/0133-connect-my-computer.md#determining-a-successful-launch

In short, after we start a Teleport agent in Connect, we want to wait until the node successfully joins the cluster so that we can change the UI state from "starting" to "running".

This is implemented by an RPC in tsh daemon which reads host UUID file from disk and then sets up a watcher to wait for OpPut of the node. Most of the work is done by NodeJoinWait in the connectmycomputer package. When the Electron app wants to wait until the node join, it calls this RPC.

Best reviewed commit-by-commit.


If you'd like to test it on a real cluster, then you need to checkout ravicious/launch-js, open the config file, set feature.connectMyComputer to true, restart the app and then click the laptop icon in the upper right of the view of all cluster resources.

The setup and subsequent launches should show a success state only once we detect that the node has joined the cluster.

Comment thread lib/teleterm/services/connectmycomputer/connectmycomputer.go Outdated
Comment thread lib/teleterm/services/connectmycomputer/connectmycomputer.go Outdated
@ravicious ravicious marked this pull request as ready for review August 23, 2023 13:45
@github-actions github-actions Bot added size/md tsh tsh - Teleport's command line tool for logging into nodes running Teleport. labels Aug 23, 2023
Comment thread lib/teleterm/services/connectmycomputer/connectmycomputer.go
Comment thread integration/teleterm_test.go
Comment thread lib/teleterm/services/connectmycomputer/connectmycomputer.go Outdated
Comment thread lib/teleterm/services/connectmycomputer/connectmycomputer.go Outdated
Comment thread lib/teleterm/services/connectmycomputer/connectmycomputer.go
Comment thread lib/teleterm/services/connectmycomputer/connectmycomputer.go
@ravicious ravicious requested a review from ibeckermayer August 24, 2023 11:07
@gzdunek
Copy link
Copy Markdown
Contributor

gzdunek commented Aug 24, 2023

For some reason, after completing the setup, WaitForConnectMyComputerNodeJoin doesn't return the hostname value in labels:

[tshd] info: receive: /teleport.lib.teleterm.v1.TerminalService/WaitForConnectMyComputerNodeJoin -> ({"server":{"uri":"/clusters/mercury.cloud.gravitational.io/servers/fd4c100e-47d4-489d-a0f0-edfcc9eaf086","tunnel":true,"name":"fd4c100e-47d4-489d-a0f0-edfcc9eaf086","hostname":"mbp.home","addr":"","labelsList":{"0":{"name":"hostname","value":""},"1":{"name":"teleport.dev/connect-my-computer/owner","value":"grzegorz.zdunek@goteleport.com"}}}})

However, on subsequent calls it works.

Does it happen to you too?

@ravicious
Copy link
Copy Markdown
Member Author

@gzdunek That's probably because the hostname label is dynamic and it's not yet set when the first OpPut event comes in. On subsequent calls, the RPC returns the node through GetNode rather than from the event.

I suspected that returning the resource from the event might bite me, but not in this way.

We could certainly attempt to refetch the node after receiving the event rather than returning it straight from the event. What do you think?

@gzdunek
Copy link
Copy Markdown
Contributor

gzdunek commented Aug 24, 2023

We could certainly attempt to refetch the node after receiving the event rather than returning it straight from the event. What do you think?

Yeah, let's do this.

Comment on lines +351 to +356
// The Electron app aborts the request which calls NodeJoinWait after a timeout, but let's use a
// timeout internally as well. Both operations in this struct theoretically block forever if an
// error happens elsewhere and gRPC doesn't set a timeout by default.
//
// If there's a bug in the Electron app and it forgets to abort the request, at least the request
// will not hang forever.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// The Electron app aborts the request which calls NodeJoinWait after a timeout, but let's use a
// timeout internally as well. Both operations in this struct theoretically block forever if an
// error happens elsewhere and gRPC doesn't set a timeout by default.
//
// If there's a bug in the Electron app and it forgets to abort the request, at least the request
// will not hang forever.
// The Electron app aborts this request after a timeout, but we set a timeout here as well for
// to ensure that the request can't hang forever if there's a bug in the request abort logic.

Two things to consider:

  1. Perhaps move this timeoutCtx up into the endpoint itself. That way it's clear that the whole endpoint just times out after a minute.
  2. I'm not sure how big of a change this would be, but since we're adding timeout logic here anyways, there may be a case for just removing it from the frontend code and using this logic exclusively instead, or having the frontend pass the timeout as a request parameter.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Perhaps move this timeoutCtx up into the endpoint itself. That way it's clear that the whole endpoint just times out after a minute.

Yeah, I wasn't sure if it'd be best to add a timeout tightly around the operation that actually blocks vs for the whole endpoint.

Since those individual services technically could serve as building blocks for other requests, I guess it's best to add the timeout on the endpoint level.

2. I'm not sure how big of a change this would be, but since we're adding timeout logic here anyways, there may be a case for just removing it from the frontend code and using this logic exclusively instead, or having the frontend pass the timeout as a request parameter.

Now that I think about it, under normal circumstances this would be the way to go. However, the UI needs to be able to distinguish between a timeout and other errors. The JS gRPC client encodes that information as custom properties on the error object. gRPC errors in our app pass through the Electron context bridge which strips all custom properties from Error objects, meaning we'd only have the error message to distinguish between a timeout or not.

So until we revamp how those errors are passed, I think I'm going to keep current implementation as is with regards to the timeout.

I'm still not sure if an endpoint should trouble itself with the deadline, perhaps it should rather return BadParameter if no deadline was set on the gRPC request? I'd rather use built-in mechanisms to handle stuff like this vs building my own on top of gRPC.

Comment thread lib/teleterm/services/connectmycomputer/connectmycomputer.go Outdated
@ravicious
Copy link
Copy Markdown
Member Author

Ping @ibeckermayer, we have two other PRs stacked on this one so I'd like to merge it if possible. ;)

Comment on lines 363 to 365
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What this comment is pointing out isn't quite clear to me (granted it could just be a function of the fact that I'm unfamiliar with this part of the codebase).

Copy link
Copy Markdown
Member Author

@ravicious ravicious Sep 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I think the comment doesn't make it clear that the config comes from teleport node config that we call during the setup of Connect My Computer.

I'll update the comment to my best ability (3fcd4b2c9f68) and then merge this PR, but please let me know if the updated comment is any better.

@ravicious
Copy link
Copy Markdown
Member Author

Merged #30910 into this PR with fast forward.

tshd needs to know this out of band, so that when the Electron app tells
it to watch for host UUID file for a specific cluster, the Electron app
can send just the profile name of the cluster instead of an arbitrary path
on the computer.
@ravicious ravicious enabled auto-merge September 21, 2023 07:20
@ravicious ravicious changed the title Add RPC for waiting until Connect My Computer node joins cluster Implement waiting for Connect My Computer node to join cluster Sep 21, 2023
@public-teleport-github-review-bot public-teleport-github-review-bot Bot removed the request for review from ryanclark September 21, 2023 11:43
@ravicious ravicious added this pull request to the merge queue Sep 21, 2023
Merged via the queue into master with commit 847a1b1 Sep 21, 2023
@ravicious ravicious deleted the ravicious/launch branch September 21, 2023 12:01
@public-teleport-github-review-bot
Copy link
Copy Markdown

@ravicious See the table below for backport results.

Branch Result
branch/v14 Failed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/md tsh tsh - Teleport's command line tool for logging into nodes running Teleport.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants