Implement waiting for Connect My Computer node to join cluster by ravicious · Pull Request #30905 · gravitational/teleport

ravicious · 2023-08-23T13:31:08Z

https://github.com/gravitational/teleport/blob/master/rfd/0133-connect-my-computer.md#determining-a-successful-launch

In short, after we start a Teleport agent in Connect, we want to wait until the node successfully joins the cluster so that we can change the UI state from "starting" to "running".

This is implemented by an RPC in tsh daemon which reads host UUID file from disk and then sets up a watcher to wait for OpPut of the node. Most of the work is done by NodeJoinWait in the connectmycomputer package. When the Electron app wants to wait until the node join, it calls this RPC.

Best reviewed commit-by-commit.

If you'd like to test it on a real cluster, then you need to checkout ravicious/launch-js, open the config file, set feature.connectMyComputer to true, restart the app and then click the laptop icon in the upper right of the view of all cluster resources.

The setup and subsequent launches should show a success state only once we detect that the node has joined the cluster.

gzdunek · 2023-08-24T13:09:47Z

For some reason, after completing the setup, WaitForConnectMyComputerNodeJoin doesn't return the hostname value in labels:

[tshd] info: receive: /teleport.lib.teleterm.v1.TerminalService/WaitForConnectMyComputerNodeJoin -> ({"server":{"uri":"/clusters/mercury.cloud.gravitational.io/servers/fd4c100e-47d4-489d-a0f0-edfcc9eaf086","tunnel":true,"name":"fd4c100e-47d4-489d-a0f0-edfcc9eaf086","hostname":"mbp.home","addr":"","labelsList":{"0":{"name":"hostname","value":""},"1":{"name":"teleport.dev/connect-my-computer/owner","value":"grzegorz.zdunek@goteleport.com"}}}})

However, on subsequent calls it works.

Does it happen to you too?

ravicious · 2023-08-24T13:21:12Z

@gzdunek That's probably because the hostname label is dynamic and it's not yet set when the first OpPut event comes in. On subsequent calls, the RPC returns the node through GetNode rather than from the event.

I suspected that returning the resource from the event might bite me, but not in this way.

We could certainly attempt to refetch the node after receiving the event rather than returning it straight from the event. What do you think?

gzdunek · 2023-08-24T14:11:29Z

We could certainly attempt to refetch the node after receiving the event rather than returning it straight from the event. What do you think?

Yeah, let's do this.

ibeckermayer · 2023-08-29T00:01:52Z

+	// The Electron app aborts the request which calls NodeJoinWait after a timeout, but let's use a
+	// timeout internally as well. Both operations in this struct theoretically block forever if an
+	// error happens elsewhere and gRPC doesn't set a timeout by default.
+	//
+	// If there's a bug in the Electron app and it forgets to abort the request, at least the request
+	// will not hang forever.


Suggested change

// The Electron app aborts the request which calls NodeJoinWait after a timeout, but let's use a

// timeout internally as well. Both operations in this struct theoretically block forever if an

// error happens elsewhere and gRPC doesn't set a timeout by default.

//

// If there's a bug in the Electron app and it forgets to abort the request, at least the request

// will not hang forever.

// The Electron app aborts this request after a timeout, but we set a timeout here as well for

// to ensure that the request can't hang forever if there's a bug in the request abort logic.

Two things to consider:

Perhaps move this timeoutCtx up into the endpoint itself. That way it's clear that the whole endpoint just times out after a minute.

I'm not sure how big of a change this would be, but since we're adding timeout logic here anyways, there may be a case for just removing it from the frontend code and using this logic exclusively instead, or having the frontend pass the timeout as a request parameter.

Perhaps move this timeoutCtx up into the endpoint itself. That way it's clear that the whole endpoint just times out after a minute.

Yeah, I wasn't sure if it'd be best to add a timeout tightly around the operation that actually blocks vs for the whole endpoint.

Since those individual services technically could serve as building blocks for other requests, I guess it's best to add the timeout on the endpoint level.

2. I'm not sure how big of a change this would be, but since we're adding timeout logic here anyways, there may be a case for just removing it from the frontend code and using this logic exclusively instead, or having the frontend pass the timeout as a request parameter.

Now that I think about it, under normal circumstances this would be the way to go. However, the UI needs to be able to distinguish between a timeout and other errors. The JS gRPC client encodes that information as custom properties on the error object. gRPC errors in our app pass through the Electron context bridge which strips all custom properties from Error objects, meaning we'd only have the error message to distinguish between a timeout or not.

So until we revamp how those errors are passed, I think I'm going to keep current implementation as is with regards to the timeout.

I'm still not sure if an endpoint should trouble itself with the deadline, perhaps it should rather return BadParameter if no deadline was set on the gRPC request? I'd rather use built-in mechanisms to handle stuff like this vs building my own on top of gRPC.

ravicious · 2023-09-19T09:54:20Z

Ping @ibeckermayer, we have two other PRs stacked on this one so I'd like to merge it if possible. ;)

ibeckermayer · 2023-09-20T18:41:59Z

What this comment is pointing out isn't quite clear to me (granted it could just be a function of the fact that I'm unfamiliar with this part of the codebase).

Oh, I think the comment doesn't make it clear that the config comes from teleport node config that we call during the setup of Connect My Computer.

I'll update the comment to my best ability (3fcd4b2c9f68) and then merge this PR, but please let me know if the updated comment is any better.

ravicious · 2023-09-21T07:07:05Z

Merged #30910 into this PR with fast forward.

tshd needs to know this out of band, so that when the Electron app tells it to watch for host UUID file for a specific cluster, the Electron app can send just the profile name of the cluster instead of an arbitrary path on the computer.

This aligns it with a regular AbortController, which also emits the event only once.

https://github.com/denoland/deno_std/blob/72d6e6641e3cd39ae69fba89a78feab354018ef0/async/delay.ts#L39

public-teleport-github-review-bot · 2023-09-21T12:02:48Z

@ravicious See the table below for backport results.

Branch	Result
branch/v14	Failed

ravicious commented Aug 23, 2023

View reviewed changes

Comment thread lib/teleterm/services/connectmycomputer/connectmycomputer.go Outdated

ravicious commented Aug 23, 2023

View reviewed changes

Comment thread lib/teleterm/services/connectmycomputer/connectmycomputer.go Outdated

ravicious marked this pull request as ready for review August 23, 2023 13:45

github-actions Bot requested review from ibeckermayer and ryanclark August 23, 2023 13:45

github-actions Bot added size/md tsh tsh - Teleport's command line tool for logging into nodes running Teleport. labels Aug 23, 2023

ravicious force-pushed the ravicious/launch branch from 4f4b2f3 to 2b77354 Compare August 23, 2023 13:47

ravicious commented Aug 23, 2023

View reviewed changes

Comment thread lib/teleterm/services/connectmycomputer/connectmycomputer.go

This was referenced Aug 23, 2023

Add proper implementation of waitForNodeToJoin #30910

Merged

Connect My Computer #27881

Closed

ibeckermayer reviewed Aug 24, 2023

View reviewed changes

ravicious requested a review from ibeckermayer August 24, 2023 11:07

ravicious mentioned this pull request Aug 24, 2023

Generate user login state from access lists and integrate into certificates. #29364

Merged

ibeckermayer reviewed Aug 29, 2023

View reviewed changes

ravicious commented Sep 15, 2023

View reviewed changes

Comment thread lib/teleterm/services/connectmycomputer/connectmycomputer.go Outdated

ravicious requested review from gzdunek and ibeckermayer September 15, 2023 13:47

gzdunek approved these changes Sep 15, 2023

View reviewed changes

ravicious added the backport/branch/v14 label Sep 15, 2023

ibeckermayer approved these changes Sep 20, 2023

View reviewed changes

Add daemon.Service.ResolveClusterURI

e659722

ravicious force-pushed the ravicious/launch branch from 3fcd4b2 to b900313 Compare September 21, 2023 06:44

ravicious added 2 commits September 21, 2023 09:07

Accept agents dir through command line flag

83e139b

tshd needs to know this out of band, so that when the Electron app tells it to watch for host UUID file for a specific cluster, the Electron app can send just the profile name of the cluster instead of an arbitrary path on the computer.

Implement WaitForConnectMyComputerNodeJoin in tsh daemon

11169ea

ravicious added 3 commits September 21, 2023 09:07

wait: Use addEventListener instead of onabort

d3cd207

Make TshAbortController emit abort event only once

5397ce1

This aligns it with a regular AbortController, which also emits the event only once.

Refactor how types are imported in tshd fixtures

bda771b

ravicious force-pushed the ravicious/launch branch from 53c98c0 to 09efa36 Compare September 21, 2023 07:07

ravicious added 4 commits September 21, 2023 09:17

Implement WaitForConnectMyComputerNodeJoin in Electron app

db0c591

createAbortController: Add signal.aborted, use emitter.once

b8716fb

Improve wait function based on Deno implementation

7c0f430

https://github.com/denoland/deno_std/blob/72d6e6641e3cd39ae69fba89a78feab354018ef0/async/delay.ts#L39

Add a comment about the events package

692b42d

ravicious force-pushed the ravicious/launch branch from 09efa36 to 692b42d Compare September 21, 2023 07:18

ravicious enabled auto-merge September 21, 2023 07:20

ravicious changed the title ~~Add RPC for waiting until Connect My Computer node joins cluster~~ Implement waiting for Connect My Computer node to join cluster Sep 21, 2023

avatus approved these changes Sep 21, 2023

View reviewed changes

public-teleport-github-review-bot Bot removed the request for review from ryanclark September 21, 2023 11:43

ravicious added this pull request to the merge queue Sep 21, 2023

Merged via the queue into master with commit 847a1b1 Sep 21, 2023

ravicious deleted the ravicious/launch branch September 21, 2023 12:01

ravicious mentioned this pull request Sep 21, 2023

[v14] Implement waiting for Connect My Computer node to join cluster #32295

Merged

Conversation

ravicious commented Aug 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gzdunek commented Aug 24, 2023

Uh oh!

ravicious commented Aug 24, 2023

Uh oh!

gzdunek commented Aug 24, 2023

Uh oh!

ibeckermayer Aug 29, 2023

Choose a reason for hiding this comment

Uh oh!

ravicious Sep 15, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ravicious commented Sep 19, 2023

Uh oh!

ibeckermayer Sep 20, 2023

Choose a reason for hiding this comment

Uh oh!

ravicious Sep 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ravicious commented Sep 21, 2023

Uh oh!

public-teleport-github-review-bot Bot commented Sep 21, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ravicious commented Aug 23, 2023 •

edited

Loading

ravicious Sep 21, 2023 •

edited

Loading