Skip to content

Move cluster state to main process#59643

Merged
gzdunek merged 19 commits intomasterfrom
gzdunek/cluster-store
Oct 22, 2025
Merged

Move cluster state to main process#59643
gzdunek merged 19 commits intomasterfrom
gzdunek/cluster-store

Conversation

@gzdunek
Copy link
Copy Markdown
Contributor

@gzdunek gzdunek commented Sep 26, 2025

Contributes to #25806

Part 2/2 of moving cluster state to the main process.

Centralizing the cluster state in the main process makes it easier to read it and update in both the main and the renderer processes.

Originally, I had planned to refactor the overall shape of the state as well, but that turned out to be a really large change. So for now, I've only moved the necessary parts of the state to the main process, and aimed to keep everything working as before.

The three main behaviors I wanted to maintain are:

  1. State is updated before the handlers resolve
    This is achieved using AwaitableSender, which ensures that the renderer acknowledges each update message before the handler completes. This maintains the previous assumption that the state is immediately available after an update.
  2. useStateSelector can still work effectively
    In the previous implementation, calling setState only updated the parts of the state that actually changed, allowing useStateSelector to avoid unnecessary re-renders.
    If we were to send the full state over IPC, it would be serialized with structuredClone, resulting in a new object each time, which would break referential stability and make useStateSelector useless.
    To avoid this, we now generate patches in the main process (using immer) and apply them in the renderer. This preserves object references where possible and keeps useStateSelector working effectively. This feature is probably not super important, I'm treating it as a nice to have, as it was easy to achieve.
  3. Errors are still handled in the notifications. This is done by invoking all the functions from the renderer, so we get the errors back. I'd still want to change that and something like error field to the cluster state, I'm going to refactor the state shape in the future.

@gzdunek gzdunek requested review from avatus and ravicious September 26, 2025 12:27
@gzdunek gzdunek added the no-changelog Indicates that a PR does not require a changelog entry label Sep 26, 2025
@gzdunek gzdunek force-pushed the gzdunek/cluster-store branch from a61b146 to 3b0f6fc Compare September 26, 2025 12:34
@ravicious ravicious self-requested a review October 7, 2025 16:27
@gzdunek gzdunek force-pushed the gzdunek/awaitable-sender branch from c9a9402 to 87fda96 Compare October 13, 2025 12:32
@gzdunek gzdunek force-pushed the gzdunek/cluster-store branch from 3b0f6fc to 3e4c985 Compare October 13, 2025 12:33
@gzdunek gzdunek force-pushed the gzdunek/awaitable-sender branch from 87fda96 to 032b4ac Compare October 13, 2025 13:36
@gzdunek gzdunek force-pushed the gzdunek/cluster-store branch 2 times, most recently from 67d5a63 to 6669dd1 Compare October 15, 2025 12:00
@gzdunek gzdunek requested a review from ravicious October 15, 2025 12:07
Copy link
Copy Markdown
Member

@ravicious ravicious left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found a bug related to the proxy host allow list.

// The workaround is to update the field in case of a failure,
// so the places that wait for showResources !== UNSPECIFIED don't get stuck indefinitely.
cluster.showResources = ShowResources.ACCESSIBLE_ONLY;
private subscribeToClusterStore(): void {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it makes sense to add tests which check if subscribeToClusterStore indeed preserves identity? I was thinking that it is something that can easily slip past us, OTOH… Is it even possible for it to not preserve identity if it's backed by Immer? This also got me thinking and I came to the conclusion that this statement:

If we were to send the full state over IPC, it would be serialized with structuredClone, resulting in a new object each time, which would break referential stability and make useStateSelector useless.

is not necessarily correct. If we were sending the full state, then Immer would still take care to change only those parts of the state that actually need to change. It could just end up being super expensive. I remember Bartosz had some performance problems initially when he switched to Immer in the role editor.

With that said I suppose I answered my one question: we don't need those identity tests because it's impossible for subscribeToClusterStore to not preserve object identity.

Funny that I wrote this comment:

// It doesn't appear to be explicitly documented anywhere, but Immer preserves object
// identity, so Object.is works as expected. This behavior is covered by our tests.
const hasSelectedStateChanged = !Object.is(
newSelectedState,
selectedState
);

This feature is probably not super important, I'm treating it as a nice to have, as it was easy to achieve.

What do you mean it's not super important? You mean keeping useStateSelector working? I feel like it's super important because otherwise many places in the app would start re-rendering way more often than necessary! 😅

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we were sending the full state, then Immer would still take care to change only those parts of the state that actually need to change. It could just end up being super expensive. I remember Bartosz had some performance problems initially when he switched to Immer in the role editor.

I'm not sure if I understand.
If were sending a full state, it would be applied in the renderer in the following way:

      this.setState(c => {
          c.clusters = castDraft(e.value);
      })

Now, if we send an update with the full state, but with one cluster having flipped the connected flag, would Immer really be that smart to only modify that one flag? I was under impression that it would just always replace c.clusters with a new value.

In the docs there's a following example:

        case "adduser-3":
            // OK: returning a new state. But, unnecessary complex and expensive
            return {
                userCount: draft.userCount + 1,
                users: [...draft.users, action.payload]
            }

But here we merge the state manually.

With that said, I think we probably don't need a test for that, as long as we produce patches on the one side and consume it on the another. Immer will take of preserving the identity.

What do you mean it's not super important? You mean keeping useStateSelector working? I feel like it's super important because otherwise many places in the app would start re-rendering way more often than necessary! 😅

Ah, I was thinking we still use ClustersService.useState quite a lot, but actually it’s not that bad - there are only a few instances of it, and they’re not even that high up in the component tree. So indeed, it's important to keep useStateSelector working!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, if we send an update with the full state, but with one cluster having flipped the connected flag, would Immer really be that smart to only modify that one flag? I was under impression that it would just always replace c.clusters with a new value.

It turns out I was wrong, it'd indeed replace it with a new value.

I added this patch:

Patch
diff --git a/web/packages/teleterm/src/mainProcess/clusterStore/clusterStore.ts b/web/packages/teleterm/src/mainProcess/clusterStore/clusterStore.ts
index 985fd0658be..c9c74db872f 100644
--- a/web/packages/teleterm/src/mainProcess/clusterStore/clusterStore.ts
+++ b/web/packages/teleterm/src/mainProcess/clusterStore/clusterStore.ts
@@ -172,8 +172,8 @@ export class ClusterStore {
       this.senders.values().map(sender => {
         const send = this.withErrorHandling(update => sender.send(update));
         return send({
-          kind: 'patches',
-          value: patches,
+          kind: 'state',
+          value: this.state,
         });
       })
     );
diff --git a/web/packages/teleterm/src/ui/StatusBar/StatusBar.tsx b/web/packages/teleterm/src/ui/StatusBar/StatusBar.tsx
index a1d8576b4b7..5b6e0494f52 100644
--- a/web/packages/teleterm/src/ui/StatusBar/StatusBar.tsx
+++ b/web/packages/teleterm/src/ui/StatusBar/StatusBar.tsx
@@ -44,6 +44,8 @@ export function StatusBar(props: { onAssumedRolesClick(): void }) {
   const assumed = useAssumedRequests(rootClusterUri);
   const assumedRolesText = getAssumedRoles(assumed);
 
+  console.log('%c Rendering StatusBar', 'background: #222; color: #bada55');
+
   return (
     <Flex
       width="100%"

You can see that the status bar does not re-render when I close a gateway (thus updating only the gateways part of the clusters service). But when I log out of a cluster, the states bar does get re-rendered.

Rendering with full state updates
rendering.with.full.state.mov

This does not happen when only patches are sent through:

Rendering with patches
rendering.with.patches.mov

@gzdunek gzdunek requested a review from ravicious October 17, 2025 13:37
);
});
this.clusterStore = new ClusterStore(
this.getTshdClients().then(c => c.terminalService),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The coordination between the processes is getting a little hectic and we don't have it well documented. If I were to look at MainProcess and ClusterStore with no knowledge about them, my questions would be:

  1. What if getTshdClients returns an error? How is the error surfaced to the user?
  2. If the renderer process directly depends on ClusterStore and ClusterStore depends on the result of a promise, what happens if that promise hangs indefinitely? Can the renderer reasonably assume that ClusterStore is ready when the renderer wants to talk to it?

The answer to both questions is concealed in the fact that both getTshdClients and the startup of the frontend app depend on the result of the same promise. The error from said promise is surfaced primarily in the UI of the renderer.

Could you add docs for getTshdClients and resolvedChildProcessAddresses that would provide some context behind this?

getTshdClients can also be made private now.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized that passing a promise directly to the constructor is problematic: if it rejects before any callsite attaches a handler, it triggers an unhandled promise rejection. I fixed this by updating ClusterStore to accept a function that returns a promise instead.

  1. The error is propagated to the caller. Currently, only the renderer invokes the ClusterStore (so the errors are handled in the same way as of today).

If the renderer process directly depends on ClusterStore and ClusterStore depends on the result of a promise, what happens if that promise hangs indefinitely?

Then any call to ClusterStore that depends on that promise would also hang. This doesn’t seem to have any new effect on the renderer. If resolvedChildProcessAddresses promise hangs, then the renderer will be stuck on the loading screen, and don't even call any ClusterStore method.

Can the renderer reasonably assume that ClusterStore is ready when the renderer wants to talk to it?

Yes, I think so. Since the ClusterStore is now initialized synchronously, it can immediately start accepting requests (which will wait for the tshd client to initialize).

Base automatically changed from gzdunek/awaitable-sender to master October 22, 2025 10:29
@gzdunek gzdunek force-pushed the gzdunek/cluster-store branch from 8d21d0d to 75e6c59 Compare October 22, 2025 10:36
@gzdunek gzdunek enabled auto-merge October 22, 2025 10:43
@gzdunek gzdunek added this pull request to the merge queue Oct 22, 2025
Merged via the queue into master with commit a41d021 Oct 22, 2025
41 checks passed
@gzdunek gzdunek deleted the gzdunek/cluster-store branch October 22, 2025 11:04
smallinsky pushed a commit that referenced this pull request Oct 23, 2025
* Create `ClusterStore` that manages cluster state

* Fix tests that mocked tshd directly

* Remove IPC to notify the main process about cluster list changes

* Load immer plugins in `MainProcess`

* Improve comments

* Refactor `useSender`

* Get rid of unnecessary Map and try/catch around send

* Get rid of `MainProcess.create`

* Do not return early `c.proxyHost` is falsy

* Add more context to test

* Add missing logout handler in main process

* Fix applying patches

* Adjust `subscribeToClusterStore` to updated `startAwaitableSenderListener`

* Crash window when sending state update fails

* Extract WebContents navigation handlers and add tests for opening links

* Improve error message

* Initialize `ClusterStore` synchronously

* Convert `lazyTshdClient` field to `getTshdClient` function, add docs

* Remove unused eslint directive
mmcallister pushed a commit that referenced this pull request Nov 6, 2025
* Create `ClusterStore` that manages cluster state

* Fix tests that mocked tshd directly

* Remove IPC to notify the main process about cluster list changes

* Load immer plugins in `MainProcess`

* Improve comments

* Refactor `useSender`

* Get rid of unnecessary Map and try/catch around send

* Get rid of `MainProcess.create`

* Do not return early `c.proxyHost` is falsy

* Add more context to test

* Add missing logout handler in main process

* Fix applying patches

* Adjust `subscribeToClusterStore` to updated `startAwaitableSenderListener`

* Crash window when sending state update fails

* Extract WebContents navigation handlers and add tests for opening links

* Improve error message

* Initialize `ClusterStore` synchronously

* Convert `lazyTshdClient` field to `getTshdClient` function, add docs

* Remove unused eslint directive
@ravicious
Copy link
Copy Markdown
Member

Just a reminder that this will need to be backported together with #61044.

mmcallister pushed a commit that referenced this pull request Nov 19, 2025
* Create `ClusterStore` that manages cluster state

* Fix tests that mocked tshd directly

* Remove IPC to notify the main process about cluster list changes

* Load immer plugins in `MainProcess`

* Improve comments

* Refactor `useSender`

* Get rid of unnecessary Map and try/catch around send

* Get rid of `MainProcess.create`

* Do not return early `c.proxyHost` is falsy

* Add more context to test

* Add missing logout handler in main process

* Fix applying patches

* Adjust `subscribeToClusterStore` to updated `startAwaitableSenderListener`

* Crash window when sending state update fails

* Extract WebContents navigation handlers and add tests for opening links

* Improve error message

* Initialize `ClusterStore` synchronously

* Convert `lazyTshdClient` field to `getTshdClient` function, add docs

* Remove unused eslint directive
mmcallister pushed a commit that referenced this pull request Nov 20, 2025
* Create `ClusterStore` that manages cluster state

* Fix tests that mocked tshd directly

* Remove IPC to notify the main process about cluster list changes

* Load immer plugins in `MainProcess`

* Improve comments

* Refactor `useSender`

* Get rid of unnecessary Map and try/catch around send

* Get rid of `MainProcess.create`

* Do not return early `c.proxyHost` is falsy

* Add more context to test

* Add missing logout handler in main process

* Fix applying patches

* Adjust `subscribeToClusterStore` to updated `startAwaitableSenderListener`

* Crash window when sending state update fails

* Extract WebContents navigation handlers and add tests for opening links

* Improve error message

* Initialize `ClusterStore` synchronously

* Convert `lazyTshdClient` field to `getTshdClient` function, add docs

* Remove unused eslint directive
@gzdunek gzdunek mentioned this pull request Nov 21, 2025
gzdunek added a commit that referenced this pull request Nov 28, 2025
* Create `ClusterStore` that manages cluster state

* Fix tests that mocked tshd directly

* Remove IPC to notify the main process about cluster list changes

* Load immer plugins in `MainProcess`

* Improve comments

* Refactor `useSender`

* Get rid of unnecessary Map and try/catch around send

* Get rid of `MainProcess.create`

* Do not return early `c.proxyHost` is falsy

* Add more context to test

* Add missing logout handler in main process

* Fix applying patches

* Adjust `subscribeToClusterStore` to updated `startAwaitableSenderListener`

* Crash window when sending state update fails

* Extract WebContents navigation handlers and add tests for opening links

* Improve error message

* Initialize `ClusterStore` synchronously

* Convert `lazyTshdClient` field to `getTshdClient` function, add docs

* Remove unused eslint directive

(cherry picked from commit a41d021)
github-merge-queue bot pushed a commit that referenced this pull request Dec 4, 2025
* Combine `ClustersService` logout functions (#59539)

* Remove clusters immediately after a logout, move `useClusterLogout` to `AppContext`

* Review callsites to ensure cluster is properly checked before being accessed

* Revert "Review callsites to ensure cluster is properly checked before being accessed"

This reverts commit 8343c3c.

* Switch to removing the cluster at the end of logout sequence

* Lint

* Move `logoutWithCleanup` to `ui/ClusterLogout`

(cherry picked from commit de6b4ed)

* Enable sending messages from main to renderer with acknowledgments (#59642)

* Create awaitable sender

* Review comments

* Fix test and lint

(cherry picked from commit 5dc76fe)

* Move cluster state to main process (#59643)

* Create `ClusterStore` that manages cluster state

* Fix tests that mocked tshd directly

* Remove IPC to notify the main process about cluster list changes

* Load immer plugins in `MainProcess`

* Improve comments

* Refactor `useSender`

* Get rid of unnecessary Map and try/catch around send

* Get rid of `MainProcess.create`

* Do not return early `c.proxyHost` is falsy

* Add more context to test

* Add missing logout handler in main process

* Fix applying patches

* Adjust `subscribeToClusterStore` to updated `startAwaitableSenderListener`

* Crash window when sending state update fails

* Extract WebContents navigation handlers and add tests for opening links

* Improve error message

* Initialize `ClusterStore` synchronously

* Convert `lazyTshdClient` field to `getTshdClient` function, add docs

* Remove unused eslint directive

(cherry picked from commit a41d021)

* Connect: make logout function idempotent (#60553)

* Remove `ClusterRemove` RPC, make logging out idempotent

* Move calling `removeKubeConfig` and `maybeRemoveAppUpdatesManagingCluster` to main process

The main process should not depend on the renderer to clean up its own resources.

* Remove cleaning up kube dir

* Lint

(cherry picked from commit 2d1bc7b)

* Connect: add profile watcher (#60622)

* Add profile watcher

* Move `makeClusterWithOnlyProfileProperties` to `profileWatcher.ts`, improve test

* Handle watched directory removal

* Improve comments

* Make tests faster, pass abort signal everywhere

* Improve docs

* Make `removing tsh directory does not break watcher` easier to understand

* Make test dir per test

* Improve timing in tests

* Add a limit of how many events can be emitted by `fs.watch` (to break the endless stream of events on Windows when watched dir is removed), go into the polling mode only when it's expected that the watched dir was removed

* Use `expect().rejects.toThrow` correctly

* Deflake 'max file system events count is restricted'

* Replace `makeClusterWithOnlyProfileProperties` with `mergeClusterProfileWithDetails`, move it back to `cluster.ts`

* Attempt to fix tests

* Clarify comment

(cherry picked from commit d4e6f19)

* Initialize tshdClients in MainProcess constructor (#61044)

(cherry picked from commit c7a4233)

* Connect: react to tsh actions by watching tsh dir (#60884)

* Add `ClusterLifecycleManager`

* Register handlers for adding, removing and logging out from cluster

* Provide `rootCluster` in `useWorkspaceContext`

The handlers in the profile watcher will proceed with updating the cluster store, even if the renderer handlers returned errors.
This check protects us from a runtime error if the renderer fails to remove the workspace.

* Improve docs

* Move processing queue to listener

* Make `will-` operations always interrupt main process actions

* Improve error messages

* Do not remove managing cluster when **only** logging out

The app updater displays all clusters, not just those the user is logged into.

* Revert "Provide `rootCluster` in `useWorkspaceContext`"

This reverts commit cf76d2b.

* Rename `logoutWithCleanup` to `cleanUpBeforeLogout`

* Do not pass `AbortSignal` to `this.mainProcessClient.syncRootClusters`

* Lint

* Fix types issues

* Do not stack watcher notifications

(cherry picked from commit 5fa8249)

* Connect: close cluster clients when profile changes (#61090)

* Include expiration time in `LoggedInUser`

This will allow the profile watcher to detect when the user relogged.

* Display expiration time in UI

* Add `ClearStaleClusterClients` RPC

* Implement `ClearStaleClusterClients`

* Clear stale clients when profile changes

* Improve session expiration component

* Move refresh button back to top

* `ClearCachedStaleClientsForRoot` -> `ClearStaleCachedClientsForRoot`

* `unchanged` -> `stale`

* Make "closing stale clients" a subtest

* Add `clientcache` test

* Remove `getProfile` error wrapping

* Improve comment

* Convert story to controls

(cherry picked from commit 6615e42)

* Gracefully handle missing `current-profile` and respect `TELEPORT_PROXY` in `tsh status`  (#61295)

* Respect `TELEPORT_PROXY` env var in `tsh status`

* Enable listing profiles if there is no active profile

* Add test

* Define `err` within the block where it's actually used

* Handle missing current profile in `tsh logout`

* Make check more explicit

* Revert mistakenly commited change

(cherry picked from commit 95bec3a)

* Connect: switch tsh home directory to ~/.tsh (#61352)

* Switch tsh home directory to ~/.tsh

* Migrate old tsh home to new location, disallow updating fields outside the `state` key in app_state.json from the renderer process

* Show banner about migrated tsh home

* `promoteMigratedTshHome` -> `showTshHomeMigrationBanner`

* `MigratedTshHomeBanner` -> `TshHomeMigrationBanner`

* 'Profiles are' -> 'Profiles are now', remove unnecessary space

* Fix assigning colors for new workspaces

* Improve logs

(cherry picked from commit 54b5f6c)

* Connect: refresh resources when access changes and add tests for `ClusterLifecycleManager` (#61479)

* Detect when user's access changes

* Refresh resources in UI when `did-change-access` is received

* Add tests for `ClusterLifecycleManager`

* Add better docs for ClusterLifecycleEvent

* Test assuming requests too

* Improve test names

(cherry picked from commit 4b00520)

* Set up deep links as soon as possible (#61668)

(cherry picked from commit 0b5ab6b)

* Serialize IPC errors  (#61665)

* Serialize all enumerable error fields

* Add wrappers around `ipcMain.handle` and `ipcRenderer.invoke`

* Fix `Method Error.prototype.toString called on incompatible receiver undefined`

* Improve docs

* Lint

(cherry picked from commit a1f2ae0)

* Fix unrecoverable ssh cert errors in tsh/Connect (#61322)

* Initialize default Username/HostLogin only in tsh

* Move `Username()` from `api.go` to `tsh.go`

* Remove wrong `Profile.SiteName` default

* Remove resetting `SiteName`

Not sure why it was needed. Perhaps to clear the default that we just removed? But even if add the default back and remove this fix, everything works.

* Gracefully handle missing SSH/TLS certs

* Remove unused `TeleportClient.LoadKeyForClusterWithReissue`

* Revert "Move `Username()` from `api.go` to `tsh.go`"

This reverts commit f7ff0ff.

* Revert "Initialize default Username/HostLogin only in tsh"

This reverts commit ed38bab.

* When any of SSH/TLS cert is missing, return partial profile

* Only log non-nil errors

* Revert "Remove wrong `Profile.SiteName` default"

* Revert "Remove resetting `SiteName`"

This reverts commit f54ab3f.

* Set `SiteName` when adding cluster

* Improve comments

* Add test

* Fix test

* Add myself to TODO

* Add test for logging out with missing SSH cert

* Lint

(cherry picked from commit cd3c8f8)

* Connect: update docs for sharing ~/.tsh directory (#61467)

* Update docs for sharing ~/.tsh directory

* Review comments

* Lint

(cherry picked from commit 19533bf)

---------

Co-authored-by: ravicious <rafal.cieslak@goteleport.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no-changelog Indicates that a PR does not require a changelog entry size/md ui

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants