From 697cafff6ff1cbef60b0d7dfa5fcecc81e853346 Mon Sep 17 00:00:00 2001 From: Kiet Ho Date: Sun, 10 May 2026 18:02:33 -0700 Subject: [PATCH 1/9] fix(desktop): heal stale host-service adoption; enable tray Restart in stopped state MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adopt only host-services that respond to a health check (2s timeout). A live pid no longer serving on the recorded port was being adopted as "running", leaving every V2 surface stuck on "Host service not available" with no in-app recovery (superset-sh/superset#4299). Also enable tray "Restart" in stopped state — disabling it precisely when users need restart was backwards. Disabled now only while a start is in flight, to avoid racing the pending start. Adds a plan doc covering broader recovery work (banner, in-app reset, retry-with-backoff) for follow-up PRs. --- .../20260510-1430-host-service-recovery.md | 201 ++++++++++++++++++ .../main/lib/host-service-coordinator.test.ts | 184 ++++++++++++++++ .../src/main/lib/host-service-coordinator.ts | 27 +++ apps/desktop/src/main/lib/tray/index.ts | 5 +- 4 files changed, 416 insertions(+), 1 deletion(-) create mode 100644 apps/desktop/plans/20260510-1430-host-service-recovery.md create mode 100644 apps/desktop/src/main/lib/host-service-coordinator.test.ts diff --git a/apps/desktop/plans/20260510-1430-host-service-recovery.md b/apps/desktop/plans/20260510-1430-host-service-recovery.md new file mode 100644 index 00000000000..d4c0e801534 --- /dev/null +++ b/apps/desktop/plans/20260510-1430-host-service-recovery.md @@ -0,0 +1,201 @@ +# Host-Service Recovery: Self-Healing + In-App Escape Hatch + +**Status:** Draft +**Scope:** v2 desktop only — `apps/desktop/src/main/lib/host-service-coordinator.ts`, `apps/desktop/src/lib/trpc/routers/host-service-coordinator/index.ts`, `apps/desktop/src/renderer/routes/_authenticated/providers/LocalHostServiceProvider/*`, `apps/desktop/src/main/lib/tray/index.ts`, plus a new Settings section. +**Tracking issue:** [superset-sh/superset#4299](https://github.com/superset-sh/superset/issues/4299) + +## Goal + +When host-service is unreachable, the V2 right-pane goes silently blank and the user has no in-app recovery path — every interactive surface that needs host-service errors with `"Host service not available"`, including the **delete workspace** action, so the user can't even clean up. Today the only working recovery is filesystem surgery (`mv ~/.superset/host …`). + +This plan adds: + +1. **Self-healing** — adopt only host-services that actually respond, and retry start when status flips to `stopped`. +2. **A user-visible escape hatch** — a "Reset host service" action surfaced both as a banner when the right-pane would otherwise be blank, and as a button in Settings. +3. **Honesty in the tray** — let users click "Restart" when host-service is actually in the state where restart helps. + +## Why + +From #4299 the reporter is stuck in this state: + +- Workspaces list renders (it's hydrated from local-db, not host-service). +- Clicking any workspace shows nothing on the right — every V2 surface uses `useLocalHostService().activeHostUrl`, which is `null`. +- Delete workspace errors `"Host service unavailable"`. +- Account-management section sits in skeleton forever (also a host-service call). +- Sign-out + sign-in, Cmd+R, Quit Completely, Tray → Restart, V1/V2 toggle — none recover. +- Manual fix: deleting `~/.superset/host//manifest.json` and relaunching. There is no in-app equivalent today. + +Symptoms aside, the architectural problem is that **`activeHostUrl === null` is a permanently absorbing state**: nothing in the renderer or main ever tries to climb out of it after the first `start({orgId})` mutation fails or after adoption silently picks up a dead process. + +## Current state + +The pipeline that produces `activeHostUrl`: + +| Step | Where | +| --- | --- | +| Renderer fires `start({orgId})` once per org on mount | `apps/desktop/src/renderer/routes/_authenticated/providers/LocalHostServiceProvider/LocalHostServiceProvider.tsx:48-52` | +| Renderer polls `getConnection({activeOrgId})` every 5s | `LocalHostServiceProvider.tsx:59-63` | +| `start` → coordinator: `tryAdopt` then `spawn` | `apps/desktop/src/main/lib/host-service-coordinator.ts:74-102` | +| `tryAdopt` only verifies `isProcessAlive(pid)` and app-version match | `host-service-coordinator.ts:280-315` | +| `spawn` health-checks via `pollHealthCheck` then registers `running` | `host-service-coordinator.ts:333-437` | +| `getConnection` returns null unless an instance is `running` | `host-service-coordinator.ts:155-163` | +| Tray "Restart" is gated on `isRunning` | `apps/desktop/src/main/lib/tray/index.ts:172-200` | + +### The four failure buckets + +| # | Bucket | Why we get stuck | +|---|--------|------------------| +| A | Stale/unhealthy adopted process | `tryAdopt` doesn't health-check. A live-pid-but-not-serving process is adopted; `getConnection` returns its dead port forever. | +| B | Auth token missing | `start` throws `"No auth token available"` (coordinator router line 17). Renderer's `useEffect` doesn't retry. | +| C | Spawn fails | `pollHealthCheck` timeout, port collision, DB lock, env validation. Instance is deleted; renderer's `useEffect` doesn't retry. | +| D | activeOrganizationId mismatch | `start` is fired for every org in collections, but `getConnection` is queried only for the session's `activeOrganizationId`. If the two drift (stale session, sync race), host-service is up but the renderer is asking about the wrong org. | + +Buckets A and C/D account for the issue thread. B is the easiest to reproduce locally (clear keychain). + +## Ranked work + +Ranking criteria: **impact** (how many buckets the change unsticks) × **visibility** (does the user know it happened) ÷ **risk + effort**. + +### 1. Health-check during adoption *(highest impact, near-zero risk)* + +**File:** `apps/desktop/src/main/lib/host-service-coordinator.ts:280-315` + +Before registering an adopted manifest as `running`, do a `pollHealthCheck(manifest.endpoint, manifest.authToken, 2_000)`. On failure: SIGKILL the pid, `removeManifest(orgId)`, return null so `start` falls through to `spawn`. + +```ts +private async tryAdopt(organizationId: string): Promise { + const manifest = this.readAndValidateManifest(organizationId); + if (!manifest) return null; + + // … existing app-version gate … + + const healthy = await pollHealthCheck(manifest.endpoint, manifest.authToken, 2_000); + if (!healthy) { + console.log(`[host-service:${organizationId}] Adopted pid=${manifest.pid} did not respond; killing`); + try { process.kill(manifest.pid, "SIGKILL"); } catch {} + removeManifest(organizationId); + return null; + } + + // … existing register-and-return … +} +``` + +Fixes **bucket A**. Invisible when adoption is healthy. Adds at most ~2s to coordinator start when adoption fails — acceptable; it only triggers on app launch / Cmd+R. + +**Tests:** unit test for the new path in `host-service-coordinator.test.ts` — mock `pollHealthCheck` to return false, assert `tryAdopt` returns null, manifest is removed, pid received SIGKILL. + +### 2. Coordinator `reset(orgId)` + tRPC surface *(unlocks 3 and 5)* + +**File:** `apps/desktop/src/main/lib/host-service-coordinator.ts`, `apps/desktop/src/lib/trpc/routers/host-service-coordinator/index.ts` + +Add: + +```ts +async reset( + organizationId: string, + config: SpawnConfig, + options: { wipeHostDb?: boolean } = {}, +): Promise { + // 1. Stop in-memory + send SIGTERM + this.stop(organizationId); + + // 2. SIGKILL any pid the manifest still references (covers the + // "process up but unresponsive" case that motivated bucket A). + const manifest = readManifest(organizationId); + if (manifest && isProcessAlive(manifest.pid)) { + try { process.kill(manifest.pid, "SIGKILL"); } catch {} + } + + // 3. Remove manifest + removeManifest(organizationId); + + // 4. Optional: archive host.db to host.db.broken- so we keep + // the file for debugging but the next spawn starts clean. + if (options.wipeHostDb) { + const dbPath = path.join(manifestDir(organizationId), "host.db"); + if (fs.existsSync(dbPath)) { + fs.renameSync(dbPath, `${dbPath}.broken-${Date.now()}`); + } + } + + // 5. Fresh start + return this.start(organizationId, config); +} +``` + +Expose via `hostServiceCoordinator.reset` in the tRPC router, mirroring the existing `restart` shape (takes `organizationId`, loads token from `loadToken()`, threads `cloudApiUrl`). Add an `wipeHostDb: z.boolean().optional()` input. + +The default is **not** to wipe host.db — that's data loss. The Settings UI surfaces "Reset" (manifest only) and "Reset and clear local data" (also archives host.db) as separate actions. + +**Tests:** coordinator unit test that `reset` removes the manifest, the optional rename happens only when requested, and the new `start` produces a running instance. + +### 3. Surface failures with a recovery banner *(visibility — the actual user fix)* + +**File:** `LocalHostServiceProvider.tsx` + new `WorkspaceHostUnavailableBanner` component. + +Today `activeHostUrl === null` is silent. Change the contract: + +- Provider tracks "did we attempt `start` for the active org, and has it been stopped/null longer than 5s?" +- When true, render a banner above the workspace content with: + - The last known status from `coordinator.onStatusChange` (`starting` / `stopped`) + any error message from the failed `start` mutation. + - Two actions: **Reset host service** (`reset(orgId, { wipeHostDb: false })`) and **Get diagnostics** (copies manifest path + last 200 log lines to clipboard, for paste into bug reports). + +The banner needs `select-text cursor-text` per `apps/desktop/AGENTS.md` so users can copy errors. + +Hook the banner into the workspace right-pane shell so it appears in place of the blank skeleton — not as a toast (toasts auto-dismiss; this state is sticky). + +**Tests:** RTL test that mounts the provider with a mocked `start` mutation that throws, advances time past 5s, and asserts the banner renders with the expected error and actions. + +### 4. Renderer retry with exponential backoff *(covers transient C/D)* + +**File:** `LocalHostServiceProvider.tsx` + +Subscribe to `electronTrpc.hostServiceCoordinator.onStatusChange.useSubscription`. When status for the active org transitions to `stopped` (or `start` mutation throws), schedule a re-`start` with backoff: 1s, 4s, 15s, then give up. Cap visible to the user via the banner from (3) — after backoff exhausts, the banner says "Auto-retry exhausted, click Reset." + +This fixes the **transient** half of bucket C (port race, slow disk on boot) and is the cheapest way to handle bucket B after a sign-in: as soon as `loadToken` returns a token, the next retry succeeds. + +**Tests:** RTL test using fake timers — assert exactly three retries fire on stop events. + +### 5. Tray "Restart" enabled in `stopped` *(one-liner, always-correct)* + +**File:** `apps/desktop/src/main/lib/tray/index.ts:172-200` + +```ts +// Before: enabled: isRunning, +// After: enabled: status !== "starting", +``` + +`stopped` is the state where Restart is most useful; gating it on `running` is backwards. + +### 6. Settings → Advanced → "Reset host service" button *(belt and suspenders)* + +**File:** new section in Settings (the Experimental panel feels right, since V2 toggle lives there). Reuses the tRPC `reset` mutation from (2). Two buttons: + +- **Reset host service** — calls `reset({})`. Safe; loses no data. +- **Reset and clear local host data** — calls `reset({ wipeHostDb: true })`. Behind a confirm dialog that names what's lost (terminal session history, host-side chat state) and what survives (workspaces, projects, chats — those live in `@superset/local-db`). + +Lower priority than the in-context banner from (3) because a user who can't load workspaces probably won't think to open Settings, but it's useful from a support perspective ("ask them to click this button"). + +## Out of scope + +- **Why host-service crashes in the first place.** The "right pane went blank then Cmd+R" half of #4299 is a different bug — something in the workspace render path or host-service runtime that we should track separately. This plan only handles the *recovery* once it has crashed. +- **Auto-wiping `host.db`.** Data loss without a confirm dialog is non-negotiable. +- **Touching the start mutation semantics.** Other call sites depend on `start` being idempotent and resolving once. +- **Cross-org healing.** We retry/reset only the active org. Inactive orgs are out of scope; their host-service can be healed by switching to them. + +## Rollout + +- **Order:** (1) and (5) ship first as a small PR — no UI changes, easy to revert. (2)+(3)+(4) ship as a second PR — that's the user-facing recovery story. (6) tacks on after. +- **Telemetry:** log `[host-service-coordinator] adoption health check failed` (1), `reset(orgId)` invocation source (banner vs tray vs settings) (2/3/6), retry attempt counts (4). PostHog event `host_service_recovery` with `{ source, action, succeeded }`. +- **Risk:** the adoption health check briefly extends coordinator startup on the unhappy path. Bound by `2_000ms`, and only triggers when adoption would have failed silently anyway, so the worst case is "startup is slightly slower when host-service was broken" — strictly better than current. + +## Acceptance criteria + +For #4299 to be considered closed by this work: + +- [ ] A user reproducing the issue can recover without filesystem access, in ≤2 clicks from the blank workspace screen. +- [ ] If the manifest points at a non-responsive pid, the next Cmd+R or app launch heals automatically (no user action). +- [ ] If host-service spawn is failing for a persistent reason (e.g., port binding error), the banner names the reason in copyable text. +- [ ] Tray → Restart works in `stopped` state. +- [ ] No regression in startup time for the happy path (verified via dev-tools timing on a freshly-cloned `.superset` dir). diff --git a/apps/desktop/src/main/lib/host-service-coordinator.test.ts b/apps/desktop/src/main/lib/host-service-coordinator.test.ts new file mode 100644 index 00000000000..128578c007f --- /dev/null +++ b/apps/desktop/src/main/lib/host-service-coordinator.test.ts @@ -0,0 +1,184 @@ +import { beforeEach, describe, expect, mock, test } from "bun:test"; + +const APP_VERSION = "1.2.3"; + +const manifestStore: { + current: { + pid: number; + endpoint: string; + authToken: string; + startedAt: number; + organizationId: string; + spawnedByAppVersion: string; + } | null; +} = { current: null }; + +const readManifestMock = mock(() => manifestStore.current); +const removeManifestMock = mock(() => { + manifestStore.current = null; +}); +const isProcessAliveMock = mock(() => true); + +mock.module("./host-service-manifest", () => ({ + readManifest: readManifestMock, + removeManifest: removeManifestMock, + isProcessAlive: isProcessAliveMock, + listManifests: mock(() => []), + manifestDir: (orgId: string) => `/tmp/host/${orgId}`, +})); + +const pollHealthCheckMock = mock(() => Promise.resolve(true)); + +mock.module("./host-service-utils", () => ({ + HEALTH_POLL_TIMEOUT_MS: 10_000, + MAX_HOST_LOG_BYTES: 1024, + findFreePort: mock(() => Promise.resolve(40000)), + openRotatingLogFd: mock(() => -1), + pollHealthCheck: pollHealthCheckMock, +})); + +mock.module("electron", () => ({ + app: { + getVersion: () => APP_VERSION, + isPackaged: false, + getAppPath: () => "/tmp/app", + }, +})); + +mock.module("@superset/local-db", () => ({ settings: {} })); +mock.module("@superset/shared/host-info", () => ({ + getHostId: () => "host-1", + getHostName: () => "host", +})); +mock.module("main/env.main", () => ({ + env: { NEXT_PUBLIC_API_URL: "", RELAY_URL: "" }, +})); +mock.module("shared/env.shared", () => ({ + env: { DESKTOP_VITE_PORT: 3000, DESKTOP_NOTIFICATIONS_PORT: 4000 }, +})); +mock.module("./app-environment", () => ({ + SUPERSET_HOME_DIR: "/tmp/superset", +})); +mock.module("./local-db", () => ({ + localDb: { + select: () => ({ from: () => ({ get: () => null }) }), + }, +})); +mock.module("./terminal/env", () => ({ HOOK_PROTOCOL_VERSION: "1" })); +mock.module("../../lib/trpc/routers/workspaces/utils/shell-env", () => ({ + getProcessEnvWithShellPath: async (e: Record) => e, +})); + +const { HostServiceCoordinator } = await import("./host-service-coordinator"); + +const baseManifest = (pid: number, endpoint = "http://127.0.0.1:55555") => ({ + pid, + endpoint, + authToken: "manifest-secret", + startedAt: 0, + organizationId: "org-1", + spawnedByAppVersion: APP_VERSION, +}); + +const spawnConfig = { authToken: "token", cloudApiUrl: "https://api.example" }; + +describe("HostServiceCoordinator.tryAdopt — adoption health check", () => { + let coordinator: InstanceType; + let killedPids: Array<{ pid: number; signal: NodeJS.Signals | number }>; + let originalKill: typeof process.kill; + let spawnMock: ReturnType; + + beforeEach(() => { + manifestStore.current = null; + readManifestMock.mockClear(); + removeManifestMock.mockClear(); + isProcessAliveMock.mockClear(); + pollHealthCheckMock.mockClear(); + + killedPids = []; + originalKill = process.kill; + // `process.kill` is read-only in some Bun versions — assign via cast. + (process as unknown as { kill: typeof process.kill }).kill = (( + pid: number, + signal?: NodeJS.Signals | number, + ) => { + killedPids.push({ pid, signal: signal ?? "SIGTERM" }); + return true; + }) as typeof process.kill; + + coordinator = new HostServiceCoordinator(); + // Replace spawn so a failed adoption doesn't actually launch electron. + spawnMock = mock(async () => ({ + port: 60000, + secret: "fresh-secret", + machineId: "host-1", + })); + (coordinator as unknown as { spawn: typeof spawnMock }).spawn = spawnMock; + }); + + test("adopts when manifest is healthy", async () => { + manifestStore.current = baseManifest(1234); + pollHealthCheckMock.mockImplementationOnce(() => Promise.resolve(true)); + + const conn = await coordinator.start("org-1", spawnConfig); + + expect(conn.port).toBe(55555); + expect(conn.secret).toBe("manifest-secret"); + expect(pollHealthCheckMock).toHaveBeenCalledTimes(1); + expect(spawnMock).not.toHaveBeenCalled(); + expect(removeManifestMock).not.toHaveBeenCalled(); + expect(coordinator.getProcessStatus("org-1")).toBe("running"); + + (process as unknown as { kill: typeof process.kill }).kill = originalKill; + }); + + test("kills the adopted pid with SIGKILL and falls through to spawn when health check fails", async () => { + manifestStore.current = baseManifest(4321); + pollHealthCheckMock.mockImplementationOnce(() => Promise.resolve(false)); + + const conn = await coordinator.start("org-1", spawnConfig); + + expect(pollHealthCheckMock).toHaveBeenCalledTimes(1); + expect(killedPids).toContainEqual({ pid: 4321, signal: "SIGKILL" }); + expect(removeManifestMock).toHaveBeenCalledTimes(1); + expect(spawnMock).toHaveBeenCalledTimes(1); + expect(conn.port).toBe(60000); + expect(conn.secret).toBe("fresh-secret"); + + (process as unknown as { kill: typeof process.kill }).kill = originalKill; + }); + + test("swallows SIGKILL ENOENT (pid already gone) and still respawns", async () => { + manifestStore.current = baseManifest(7777); + pollHealthCheckMock.mockImplementationOnce(() => Promise.resolve(false)); + (process as unknown as { kill: typeof process.kill }).kill = (() => { + throw new Error("ESRCH"); + }) as typeof process.kill; + + const conn = await coordinator.start("org-1", spawnConfig); + + expect(removeManifestMock).toHaveBeenCalledTimes(1); + expect(spawnMock).toHaveBeenCalledTimes(1); + expect(conn.port).toBe(60000); + + (process as unknown as { kill: typeof process.kill }).kill = originalKill; + }); + + test("kills with SIGTERM (existing behavior) on app-version mismatch, before health check", async () => { + manifestStore.current = { + ...baseManifest(5555), + spawnedByAppVersion: "0.9.0", + }; + + const conn = await coordinator.start("org-1", spawnConfig); + + // App-version gate runs before the new health check. + expect(pollHealthCheckMock).not.toHaveBeenCalled(); + expect(killedPids).toContainEqual({ pid: 5555, signal: "SIGTERM" }); + expect(removeManifestMock).toHaveBeenCalledTimes(1); + expect(spawnMock).toHaveBeenCalledTimes(1); + expect(conn.port).toBe(60000); + + (process as unknown as { kill: typeof process.kill }).kill = originalKill; + }); +}); diff --git a/apps/desktop/src/main/lib/host-service-coordinator.ts b/apps/desktop/src/main/lib/host-service-coordinator.ts index e32c75759f5..d42cc08fdb3 100644 --- a/apps/desktop/src/main/lib/host-service-coordinator.ts +++ b/apps/desktop/src/main/lib/host-service-coordinator.ts @@ -56,6 +56,13 @@ interface HostServiceProcess { const ADOPTED_LIVENESS_INTERVAL = 5_000; +/** + * Cap how long an adoption health check can take before we decide the adopted + * process is dead and respawn. Short enough that a Cmd+R into a wedged + * host-service heals quickly; long enough to ride out brief startup blips. + */ +const ADOPT_HEALTH_CHECK_TIMEOUT_MS = 2_000; + export class HostServiceCoordinator extends EventEmitter { private instances = new Map(); private pendingStarts = new Map>(); @@ -299,6 +306,26 @@ export class HostServiceCoordinator extends EventEmitter { return null; } + // A live pid is not the same as a serving host-service — the process can + // be hung on migrations, deadlocked, or no longer bound to the recorded + // port. Without this check the renderer's `getConnection` keeps handing + // out a dead port forever, which is the failure mode in superset-sh/superset#4299. + const healthy = await pollHealthCheck( + manifest.endpoint, + manifest.authToken, + ADOPT_HEALTH_CHECK_TIMEOUT_MS, + ); + if (!healthy) { + console.log( + `[host-service:${organizationId}] Adopted pid=${manifest.pid} did not respond on ${manifest.endpoint}, killing and respawning`, + ); + try { + process.kill(manifest.pid, "SIGKILL"); + } catch {} + removeManifest(organizationId); + return null; + } + this.instances.set(organizationId, { pid: manifest.pid, port, diff --git a/apps/desktop/src/main/lib/tray/index.ts b/apps/desktop/src/main/lib/tray/index.ts index 81948961fb1..4717a1cbd42 100644 --- a/apps/desktop/src/main/lib/tray/index.ts +++ b/apps/desktop/src/main/lib/tray/index.ts @@ -169,8 +169,11 @@ function buildHostServiceSubmenu( enabled: false, }); menuItems.push({ + // Enabled in "stopped" too — that's the state where users most need + // restart to work (host-service crashed or never came up). Disabled + // only while a start is in flight, to avoid racing the pending start. label: " Restart", - enabled: isRunning, + enabled: status !== "starting", click: () => { void (async () => { try { From 8330df8d44d1a4d00985e9d1979436d9664a29af Mon Sep 17 00:00:00 2001 From: Kiet Ho Date: Sun, 10 May 2026 18:08:41 -0700 Subject: [PATCH 2/9] docs(desktop): mark PR1 shipped in host-service recovery plan, prep PR2 handoff - Status badge updated; PR #4395 link added. - Items 1 and 5 marked shipped with implementation pointers. - Items 2-4 (PR2) tightened with concrete file paths, the existing useRemoteHostStatus + WorkspaceHostOfflineState pattern to mirror, router shape, and test cases. - Cross-linked #4396 (white-screen ioreg execFileSync) as the related but independently-fixable issue. - Added a "Handoff checklist for PR2" listing every file to touch. --- .../20260510-1430-host-service-recovery.md | 168 ++++++++++-------- 1 file changed, 94 insertions(+), 74 deletions(-) diff --git a/apps/desktop/plans/20260510-1430-host-service-recovery.md b/apps/desktop/plans/20260510-1430-host-service-recovery.md index d4c0e801534..6edadd5bab7 100644 --- a/apps/desktop/plans/20260510-1430-host-service-recovery.md +++ b/apps/desktop/plans/20260510-1430-host-service-recovery.md @@ -1,8 +1,9 @@ # Host-Service Recovery: Self-Healing + In-App Escape Hatch -**Status:** Draft -**Scope:** v2 desktop only — `apps/desktop/src/main/lib/host-service-coordinator.ts`, `apps/desktop/src/lib/trpc/routers/host-service-coordinator/index.ts`, `apps/desktop/src/renderer/routes/_authenticated/providers/LocalHostServiceProvider/*`, `apps/desktop/src/main/lib/tray/index.ts`, plus a new Settings section. +**Status:** PR1 shipped ([#4395](https://github.com/superset-sh/superset/pull/4395)). PR2 (items 2 + 3 + 4) ready for handoff. PR3 (item 6) optional follow-up. +**Scope:** v2 desktop only — `apps/desktop/src/main/lib/host-service-coordinator.ts`, `apps/desktop/src/lib/trpc/routers/host-service-coordinator/index.ts`, `apps/desktop/src/renderer/routes/_authenticated/providers/LocalHostServiceProvider/*`, `apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/*`, `apps/desktop/src/main/lib/tray/index.ts`, plus a new Settings section. **Tracking issue:** [superset-sh/superset#4299](https://github.com/superset-sh/superset/issues/4299) +**Related:** [#4396](https://github.com/superset-sh/superset/issues/4396) (white-screen / unbounded `ioreg execFileSync` — separate fix path, not blocked by this plan). ## Goal @@ -56,40 +57,19 @@ Buckets A and C/D account for the issue thread. B is the easiest to reproduce lo Ranking criteria: **impact** (how many buckets the change unsticks) × **visibility** (does the user know it happened) ÷ **risk + effort**. -### 1. Health-check during adoption *(highest impact, near-zero risk)* +### 1. Health-check during adoption *(shipped in PR #4395)* -**File:** `apps/desktop/src/main/lib/host-service-coordinator.ts:280-315` +**Status:** ✅ Shipped. See `apps/desktop/src/main/lib/host-service-coordinator.ts:307-326` for the implementation and `apps/desktop/src/main/lib/host-service-coordinator.test.ts` for the regression test. -Before registering an adopted manifest as `running`, do a `pollHealthCheck(manifest.endpoint, manifest.authToken, 2_000)`. On failure: SIGKILL the pid, `removeManifest(orgId)`, return null so `start` falls through to `spawn`. +Fixes **bucket A** structurally — every Cmd+R / app launch now health-checks the adopted manifest endpoint with a 2s timeout, SIGKILLs unresponsive pids, and falls through to a clean `spawn`. Invisible when adoption is healthy. -```ts -private async tryAdopt(organizationId: string): Promise { - const manifest = this.readAndValidateManifest(organizationId); - if (!manifest) return null; - - // … existing app-version gate … - - const healthy = await pollHealthCheck(manifest.endpoint, manifest.authToken, 2_000); - if (!healthy) { - console.log(`[host-service:${organizationId}] Adopted pid=${manifest.pid} did not respond; killing`); - try { process.kill(manifest.pid, "SIGKILL"); } catch {} - removeManifest(organizationId); - return null; - } - - // … existing register-and-return … -} -``` - -Fixes **bucket A**. Invisible when adoption is healthy. Adds at most ~2s to coordinator start when adoption fails — acceptable; it only triggers on app launch / Cmd+R. - -**Tests:** unit test for the new path in `host-service-coordinator.test.ts` — mock `pollHealthCheck` to return false, assert `tryAdopt` returns null, manifest is removed, pid received SIGKILL. - -### 2. Coordinator `reset(orgId)` + tRPC surface *(unlocks 3 and 5)* +### 2. Coordinator `reset(orgId)` + tRPC surface *(PR2 — unlocks 3 and 6)* -**File:** `apps/desktop/src/main/lib/host-service-coordinator.ts`, `apps/desktop/src/lib/trpc/routers/host-service-coordinator/index.ts` +**Files to edit:** +- `apps/desktop/src/main/lib/host-service-coordinator.ts` — add method on the class, alongside `restart` (currently at ~`148-156`). +- `apps/desktop/src/lib/trpc/routers/host-service-coordinator/index.ts` — add a `reset` mutation, mirroring `restart` (currently at lines 37-47). Existing imports already include `loadToken` and `env.NEXT_PUBLIC_API_URL`. -Add: +Add to `HostServiceCoordinator`: ```ts async reset( @@ -97,7 +77,7 @@ async reset( config: SpawnConfig, options: { wipeHostDb?: boolean } = {}, ): Promise { - // 1. Stop in-memory + send SIGTERM + // 1. Stop in-memory + send SIGTERM. No-op if no instance is tracked. this.stop(organizationId); // 2. SIGKILL any pid the manifest still references (covers the @@ -107,7 +87,7 @@ async reset( try { process.kill(manifest.pid, "SIGKILL"); } catch {} } - // 3. Remove manifest + // 3. Remove manifest so adoption can't pick up the stale entry. removeManifest(organizationId); // 4. Optional: archive host.db to host.db.broken- so we keep @@ -119,83 +99,123 @@ async reset( } } - // 5. Fresh start + // 5. Fresh start. return this.start(organizationId, config); } ``` -Expose via `hostServiceCoordinator.reset` in the tRPC router, mirroring the existing `restart` shape (takes `organizationId`, loads token from `loadToken()`, threads `cloudApiUrl`). Add an `wipeHostDb: z.boolean().optional()` input. +Router shape: -The default is **not** to wipe host.db — that's data loss. The Settings UI surfaces "Reset" (manifest only) and "Reset and clear local data" (also archives host.db) as separate actions. - -**Tests:** coordinator unit test that `reset` removes the manifest, the optional rename happens only when requested, and the new `start` produces a running instance. +```ts +reset: publicProcedure + .input(orgInput.extend({ wipeHostDb: z.boolean().optional() })) + .mutation(async ({ input }) => { + const coordinator = getHostServiceCoordinator(); + const { token } = await loadToken(); + if (!token) { + throw new Error("No auth token available — user must be logged in"); + } + return coordinator.reset( + input.organizationId, + { authToken: token, cloudApiUrl: env.NEXT_PUBLIC_API_URL }, + { wipeHostDb: input.wipeHostDb }, + ); + }), +``` -### 3. Surface failures with a recovery banner *(visibility — the actual user fix)* +The default is **not** to wipe host.db — that's data loss. Settings UI (item 6) surfaces "Reset" (manifest only) and "Reset and clear local data" (also archives host.db) as separate actions. -**File:** `LocalHostServiceProvider.tsx` + new `WorkspaceHostUnavailableBanner` component. +**Tests:** extend `host-service-coordinator.test.ts` (already mocks `host-service-manifest`, `host-service-utils`, etc.) with three cases: +1. `reset` with no `wipeHostDb` removes the manifest, calls `start`, returns a new connection. +2. `reset({ wipeHostDb: true })` renames `host.db` to `host.db.broken-` (mock `fs.renameSync`). +3. `reset` is safe to call when no instance is tracked (no manifest, no pid). -Today `activeHostUrl === null` is silent. Change the contract: +### 3. Surface failures with a recovery state *(PR2 — the actual user fix)* -- Provider tracks "did we attempt `start` for the active org, and has it been stopped/null longer than 5s?" -- When true, render a banner above the workspace content with: - - The last known status from `coordinator.onStatusChange` (`starting` / `stopped`) + any error message from the failed `start` mutation. - - Two actions: **Reset host service** (`reset(orgId, { wipeHostDb: false })`) and **Get diagnostics** (copies manifest path + last 200 log lines to clipboard, for paste into bug reports). +**Pattern to follow:** the v2 workspace layout (`apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/layout.tsx:87-101`) **already** renders `` when `useRemoteHostStatus` reports `offline` for a remote host. The local-host branch is currently a gap — when `isLocal && activeHostUrl === null` we render the workspace anyway and downstream calls toast `"Host service not available"`. Mirror the remote pattern for the local case. -The banner needs `select-text cursor-text` per `apps/desktop/AGENTS.md` so users can copy errors. +**Files to create / edit:** +- `apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/hooks/useLocalHostStatus/useLocalHostStatus.ts` (new). Returns `"skip" | "loading" | "stopped" | "ready"`. Status is `"stopped"` when the workspace is local (`workspace.hostId === machineId`), the provider has attempted `start` for the active org, and `activeHostUrl` has stayed null for ≥5s. Read status events from `electronTrpc.hostServiceCoordinator.onStatusChange.useSubscription`, plus the most recent `start`-mutation error (lift the mutation result up from `LocalHostServiceProvider` into context). +- `apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/components/WorkspaceLocalHostStoppedState/WorkspaceLocalHostStoppedState.tsx` (new). Two-button UI (mirror `WorkspaceHostOfflineState` styling so it doesn't feel grafted on): + - **Reset host service** — calls `electronTrpc.hostServiceCoordinator.reset.useMutation({ wipeHostDb: false })`. + - **Copy diagnostics** — copies a small text blob to clipboard: manifest path, manifest JSON (via a new `hostServiceCoordinator.getDiagnostics(orgId)` query if needed, or just print `~/.superset/host//`), last `start` mutation error, and the current status. Use the existing `select-text cursor-text` convention from `apps/desktop/AGENTS.md` so users can copy from the rendered text directly. +- `apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/layout.tsx` — add a branch above the `hostStatus.status === "offline"` block that renders `WorkspaceLocalHostStoppedState` when `useLocalHostStatus(workspace)` returns `"stopped"`. +- `apps/desktop/src/renderer/routes/_authenticated/providers/LocalHostServiceProvider/LocalHostServiceProvider.tsx` — expose the latest `start` mutation error and a "last attempt timestamp" on the context. (Currently the mutation result is discarded.) -Hook the banner into the workspace right-pane shell so it appears in place of the blank skeleton — not as a toast (toasts auto-dismiss; this state is sticky). +**Why this shape, not a toast or top-bar banner:** `WorkspaceHostOfflineState` already establishes the pattern of replacing the workspace content when the host is unreachable, and the failure here is sticky (not a one-shot error). Reusing the slot keeps the design system consistent and means the recovery action appears exactly where the user is trying to work. -**Tests:** RTL test that mounts the provider with a mocked `start` mutation that throws, advances time past 5s, and asserts the banner renders with the expected error and actions. +**Tests:** +- RTL test for `useLocalHostStatus`: mount with mocked context where `activeHostUrl` is null and last-attempt was >5s ago → assert `"stopped"`. <5s → `"loading"`. Remote workspace → `"skip"`. +- RTL test for `WorkspaceLocalHostStoppedState`: clicking **Reset host service** fires the `reset` mutation; clicking **Copy diagnostics** calls `navigator.clipboard.writeText` with text containing the manifest path. -### 4. Renderer retry with exponential backoff *(covers transient C/D)* +### 4. Renderer retry with exponential backoff *(PR2 — covers transient C/D)* -**File:** `LocalHostServiceProvider.tsx` +**File:** `apps/desktop/src/renderer/routes/_authenticated/providers/LocalHostServiceProvider/LocalHostServiceProvider.tsx` -Subscribe to `electronTrpc.hostServiceCoordinator.onStatusChange.useSubscription`. When status for the active org transitions to `stopped` (or `start` mutation throws), schedule a re-`start` with backoff: 1s, 4s, 15s, then give up. Cap visible to the user via the banner from (3) — after backoff exhausts, the banner says "Auto-retry exhausted, click Reset." +The current `useEffect` (lines 48-52) fires `start({orgId})` exactly once on mount and never again. Wrap that effect with retry-on-failure: -This fixes the **transient** half of bucket C (port race, slow disk on boot) and is the cheapest way to handle bucket B after a sign-in: as soon as `loadToken` returns a token, the next retry succeeds. +- Subscribe to `electronTrpc.hostServiceCoordinator.onStatusChange.useSubscription` (already exported by the router at `apps/desktop/src/lib/trpc/routers/host-service-coordinator/index.ts:49-56`). +- When the active org's status transitions to `stopped` (or the most recent `start` mutation errors), schedule a re-`start` with backoff: 1s, 4s, 15s. After three attempts, stop retrying and let the recovery state from (3) take over. +- Reset the backoff counter when status reaches `running`. -**Tests:** RTL test using fake timers — assert exactly three retries fire on stop events. +Pattern note: subscription must use the observable pattern, not async generators (`apps/desktop/AGENTS.md` covers why). The router already does this correctly. -### 5. Tray "Restart" enabled in `stopped` *(one-liner, always-correct)* +This fixes the **transient** half of bucket C (port race, slow disk on boot) and is the cheapest way to handle bucket B after a sign-in: as soon as `loadToken` returns a token, the next retry succeeds without the user clicking anything. -**File:** `apps/desktop/src/main/lib/tray/index.ts:172-200` +**Tests:** RTL test using fake timers — push three `stopped` events into the mocked subscription, advance time, assert exactly three `start` mutations fire at 1s/4s/15s, and a `running` event after the second attempt resets the counter. -```ts -// Before: enabled: isRunning, -// After: enabled: status !== "starting", -``` +### 5. Tray "Restart" enabled in `stopped` *(shipped in PR #4395)* -`stopped` is the state where Restart is most useful; gating it on `running` is backwards. +**Status:** ✅ Shipped. See `apps/desktop/src/main/lib/tray/index.ts:171-176`. -### 6. Settings → Advanced → "Reset host service" button *(belt and suspenders)* +### 6. Settings → Advanced → "Reset host service" button *(PR3 — belt and suspenders)* -**File:** new section in Settings (the Experimental panel feels right, since V2 toggle lives there). Reuses the tRPC `reset` mutation from (2). Two buttons: +**File:** new section in `apps/desktop/src/renderer/routes/_authenticated/settings/`. The Experimental panel is the right home (the V2 toggle already lives there). Reuses the tRPC `reset` mutation from (2). Two buttons: -- **Reset host service** — calls `reset({})`. Safe; loses no data. -- **Reset and clear local host data** — calls `reset({ wipeHostDb: true })`. Behind a confirm dialog that names what's lost (terminal session history, host-side chat state) and what survives (workspaces, projects, chats — those live in `@superset/local-db`). +- **Reset host service** — calls `reset({ wipeHostDb: false })`. Safe; loses no data. +- **Reset and clear local host data** — calls `reset({ wipeHostDb: true })`. Behind a confirm dialog that names what's lost (terminal session history, host-side chat state) and what survives (workspaces, projects, chats — those live in `@superset/local-db`, not `host.db`). -Lower priority than the in-context banner from (3) because a user who can't load workspaces probably won't think to open Settings, but it's useful from a support perspective ("ask them to click this button"). +Lower priority than the in-context recovery from (3) because a user who can't load workspaces probably won't think to open Settings, but it's useful from a support perspective ("ask them to click this button"). ## Out of scope -- **Why host-service crashes in the first place.** The "right pane went blank then Cmd+R" half of #4299 is a different bug — something in the workspace render path or host-service runtime that we should track separately. This plan only handles the *recovery* once it has crashed. +- **The white-screen path.** Tracked separately at [#4396](https://github.com/superset-sh/superset/issues/4396) — `getHostId()` on macOS shells out to `ioreg` via `execFileSync` with no timeout, blocking the main event loop when subprocess spawning is blocked by sandboxing tools. Fix lands independently. - **Auto-wiping `host.db`.** Data loss without a confirm dialog is non-negotiable. -- **Touching the start mutation semantics.** Other call sites depend on `start` being idempotent and resolving once. +- **Touching the `start` mutation semantics.** Other call sites depend on `start` being idempotent and resolving once. - **Cross-org healing.** We retry/reset only the active org. Inactive orgs are out of scope; their host-service can be healed by switching to them. ## Rollout -- **Order:** (1) and (5) ship first as a small PR — no UI changes, easy to revert. (2)+(3)+(4) ship as a second PR — that's the user-facing recovery story. (6) tacks on after. -- **Telemetry:** log `[host-service-coordinator] adoption health check failed` (1), `reset(orgId)` invocation source (banner vs tray vs settings) (2/3/6), retry attempt counts (4). PostHog event `host_service_recovery` with `{ source, action, succeeded }`. -- **Risk:** the adoption health check briefly extends coordinator startup on the unhappy path. Bound by `2_000ms`, and only triggers when adoption would have failed silently anyway, so the worst case is "startup is slightly slower when host-service was broken" — strictly better than current. +- **Order:** (1)+(5) shipped as PR #4395. (2)+(3)+(4) ship together as PR2 — that's the user-facing recovery story. (6) tacks on as PR3. +- **Telemetry:** PostHog event `host_service_recovery` with `{ source: "banner" | "tray" | "settings", action: "reset" | "retry" | "wipe", succeeded: boolean }`. Wire from the renderer at each click-handler. Server-side, the coordinator already `console.log`s adoption health-check failures (PR1) — keep those. +- **Risk:** Renderer retry could mask a deterministic spawn failure under three rounds of backoff if the recovery state isn't surfaced clearly. Mitigate by making (3) and (4) ship together: the retry counter is visible in the recovery state, and after exhaustion the UI says "Auto-retry exhausted, click Reset." ## Acceptance criteria For #4299 to be considered closed by this work: -- [ ] A user reproducing the issue can recover without filesystem access, in ≤2 clicks from the blank workspace screen. -- [ ] If the manifest points at a non-responsive pid, the next Cmd+R or app launch heals automatically (no user action). -- [ ] If host-service spawn is failing for a persistent reason (e.g., port binding error), the banner names the reason in copyable text. -- [ ] Tray → Restart works in `stopped` state. +- [x] If the manifest points at a non-responsive pid, the next Cmd+R or app launch heals automatically (PR1). +- [x] Tray → Restart works in `stopped` state (PR1). +- [ ] A user reproducing the issue can recover without filesystem access, in ≤2 clicks from the blank workspace screen (PR2). +- [ ] If host-service spawn is failing for a persistent reason (e.g., port binding error), the recovery state names the reason in copyable text (PR2). - [ ] No regression in startup time for the happy path (verified via dev-tools timing on a freshly-cloned `.superset` dir). + +## Handoff checklist for PR2 + +A new implementer picking this up should: + +1. **Read PR #4395** to see the patterns already in place: the coordinator test file structure (`apps/desktop/src/main/lib/host-service-coordinator.test.ts`) and the `pollHealthCheck` usage. +2. **Read `useRemoteHostStatus` + `WorkspaceHostOfflineState`** in `apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/` — PR2's local-host equivalent should mirror these in shape and visual style. +3. **Implement in order:** (2) coordinator `reset` → tRPC `reset` → renderer mutation hook; then (4) retry loop in `LocalHostServiceProvider`; then (3) `useLocalHostStatus` + `WorkspaceLocalHostStoppedState` + layout branch. (2) before (3) so the recovery component has the mutation to call; (4) before (3) so the recovery state knows when to surface (after backoff exhausts). +4. **Verify each test can fail** — for every new test, mutate the implementation to confirm the test catches the regression. (Project convention; see `apps/desktop/AGENTS.md` and the existing coordinator test.) +5. **Lint and typecheck before pushing.** `bun run lint` (from repo root) treats Biome warnings as errors. `bun run typecheck` in `apps/desktop`. + +Files touched by PR2 (preview): + +- `apps/desktop/src/main/lib/host-service-coordinator.ts` (add `reset` method) +- `apps/desktop/src/main/lib/host-service-coordinator.test.ts` (extend with reset tests) +- `apps/desktop/src/lib/trpc/routers/host-service-coordinator/index.ts` (add `reset` mutation) +- `apps/desktop/src/renderer/routes/_authenticated/providers/LocalHostServiceProvider/LocalHostServiceProvider.tsx` (expose error / last-attempt, add retry loop) +- `apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/hooks/useLocalHostStatus/{useLocalHostStatus.ts,useLocalHostStatus.test.ts,index.ts}` (new) +- `apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/components/WorkspaceLocalHostStoppedState/{WorkspaceLocalHostStoppedState.tsx,WorkspaceLocalHostStoppedState.test.tsx,index.ts}` (new) +- `apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/layout.tsx` (add branch above the offline branch) From 4525732818e777b7d3a6b3553f3b4931e8c63322 Mon Sep 17 00:00:00 2001 From: Kiet Ho Date: Sun, 10 May 2026 19:31:07 -0700 Subject: [PATCH 3/9] fix(desktop): add in-app host-service reset + retry loop + recovery UI PR2 of the recovery plan (#4299): when host-service can't be reached the provider now auto-retries on a 1s/4s/15s backoff, and once retries exhaust the v2-workspace layout swaps in a recovery screen with Reset and Copy-diagnostics actions instead of a blank pane. - coordinator: new reset() forcibly SIGKILLs any pid in the manifest, optionally archives host.db to host.db.broken-, then re-spawns - tRPC: hostServiceCoordinator.reset mutation extends orgInput with wipeHostDb - LocalHostServiceProvider: wraps start with exponential-backoff retry for the active org, resets on running, exposes lastStartError / lastAttemptAt / retryAttempt / retryExhausted on context - new useLocalHostStatus hook + WorkspaceLocalHostStoppedState UI mirroring the existing useRemoteHostStatus / WorkspaceHostOfflineState pattern; layout branches on it above the offline branch --- .../routers/host-service-coordinator/index.ts | 18 ++ .../main/lib/host-service-coordinator.test.ts | 116 ++++++++++++- .../src/main/lib/host-service-coordinator.ts | 38 +++++ .../WorkspaceLocalHostStoppedState.tsx | 108 ++++++++++++ .../WorkspaceLocalHostStoppedState/index.ts | 1 + .../hooks/useLocalHostStatus/index.ts | 1 + .../useLocalHostStatus/useLocalHostStatus.ts | 58 +++++++ .../_dashboard/v2-workspace/layout.tsx | 13 ++ .../LocalHostServiceProvider.tsx | 155 +++++++++++++++++- 9 files changed, 502 insertions(+), 6 deletions(-) create mode 100644 apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/components/WorkspaceLocalHostStoppedState/WorkspaceLocalHostStoppedState.tsx create mode 100644 apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/components/WorkspaceLocalHostStoppedState/index.ts create mode 100644 apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/hooks/useLocalHostStatus/index.ts create mode 100644 apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/hooks/useLocalHostStatus/useLocalHostStatus.ts diff --git a/apps/desktop/src/lib/trpc/routers/host-service-coordinator/index.ts b/apps/desktop/src/lib/trpc/routers/host-service-coordinator/index.ts index 5683a865f4f..4584fe4031c 100644 --- a/apps/desktop/src/lib/trpc/routers/host-service-coordinator/index.ts +++ b/apps/desktop/src/lib/trpc/routers/host-service-coordinator/index.ts @@ -46,6 +46,24 @@ export const createHostServiceCoordinatorRouter = () => { }); }), + reset: publicProcedure + .input(orgInput.extend({ wipeHostDb: z.boolean().optional() })) + .mutation(async ({ input }) => { + const coordinator = getHostServiceCoordinator(); + const { token } = await loadToken(); + if (!token) { + throw new Error("No auth token available — user must be logged in"); + } + return coordinator.reset( + input.organizationId, + { + authToken: token, + cloudApiUrl: env.NEXT_PUBLIC_API_URL, + }, + { wipeHostDb: input.wipeHostDb }, + ); + }), + onStatusChange: publicProcedure.subscription(() => { return observable((emit) => { const coordinator = getHostServiceCoordinator(); diff --git a/apps/desktop/src/main/lib/host-service-coordinator.test.ts b/apps/desktop/src/main/lib/host-service-coordinator.test.ts index 128578c007f..0cbbc26dd6a 100644 --- a/apps/desktop/src/main/lib/host-service-coordinator.test.ts +++ b/apps/desktop/src/main/lib/host-service-coordinator.test.ts @@ -1,4 +1,7 @@ import { beforeEach, describe, expect, mock, test } from "bun:test"; +import * as fs from "node:fs"; +import * as os from "node:os"; +import path from "node:path"; const APP_VERSION = "1.2.3"; @@ -13,6 +16,10 @@ const manifestStore: { } | null; } = { current: null }; +// Per-test temp dir so reset({wipeHostDb}) can rename a real file without +// races between tests. Tests assign this in beforeEach. +let testManifestRoot = ""; + const readManifestMock = mock(() => manifestStore.current); const removeManifestMock = mock(() => { manifestStore.current = null; @@ -24,7 +31,7 @@ mock.module("./host-service-manifest", () => ({ removeManifest: removeManifestMock, isProcessAlive: isProcessAliveMock, listManifests: mock(() => []), - manifestDir: (orgId: string) => `/tmp/host/${orgId}`, + manifestDir: (orgId: string) => path.join(testManifestRoot, orgId), })); const pollHealthCheckMock = mock(() => Promise.resolve(true)); @@ -95,6 +102,8 @@ describe("HostServiceCoordinator.tryAdopt — adoption health check", () => { isProcessAliveMock.mockClear(); pollHealthCheckMock.mockClear(); + testManifestRoot = fs.mkdtempSync(path.join(os.tmpdir(), "hsc-test-")); + killedPids = []; originalKill = process.kill; // `process.kill` is read-only in some Bun versions — assign via cast. @@ -182,3 +191,108 @@ describe("HostServiceCoordinator.tryAdopt — adoption health check", () => { (process as unknown as { kill: typeof process.kill }).kill = originalKill; }); }); + +describe("HostServiceCoordinator.reset", () => { + let coordinator: InstanceType; + let killedPids: Array<{ pid: number; signal: NodeJS.Signals | number }>; + let originalKill: typeof process.kill; + let spawnMock: ReturnType; + + beforeEach(() => { + manifestStore.current = null; + readManifestMock.mockClear(); + removeManifestMock.mockClear(); + isProcessAliveMock.mockClear(); + pollHealthCheckMock.mockClear(); + + testManifestRoot = fs.mkdtempSync(path.join(os.tmpdir(), "hsc-test-")); + + killedPids = []; + originalKill = process.kill; + (process as unknown as { kill: typeof process.kill }).kill = (( + pid: number, + signal?: NodeJS.Signals | number, + ) => { + killedPids.push({ pid, signal: signal ?? "SIGTERM" }); + return true; + }) as typeof process.kill; + + coordinator = new HostServiceCoordinator(); + spawnMock = mock(async () => ({ + port: 60000, + secret: "fresh-secret", + machineId: "host-1", + })); + (coordinator as unknown as { spawn: typeof spawnMock }).spawn = spawnMock; + }); + + test("removes manifest, SIGKILLs live pid, then spawns fresh", async () => { + manifestStore.current = baseManifest(8888); + + const conn = await coordinator.reset("org-1", spawnConfig); + + expect(killedPids).toContainEqual({ pid: 8888, signal: "SIGKILL" }); + expect(removeManifestMock).toHaveBeenCalledTimes(1); + expect(spawnMock).toHaveBeenCalledTimes(1); + expect(conn.port).toBe(60000); + expect(conn.secret).toBe("fresh-secret"); + + (process as unknown as { kill: typeof process.kill }).kill = originalKill; + }); + + test("with wipeHostDb=true, archives host.db to host.db.broken-", async () => { + manifestStore.current = baseManifest(9999); + const orgDir = path.join(testManifestRoot, "org-1"); + fs.mkdirSync(orgDir, { recursive: true }); + const dbPath = path.join(orgDir, "host.db"); + fs.writeFileSync(dbPath, "fake db payload"); + + const conn = await coordinator.reset("org-1", spawnConfig, { + wipeHostDb: true, + }); + + expect(fs.existsSync(dbPath)).toBe(false); + const archived = fs + .readdirSync(orgDir) + .find((f) => f.startsWith("host.db.broken-")); + expect(archived).toBeDefined(); + if (archived) { + expect(fs.readFileSync(path.join(orgDir, archived), "utf8")).toBe( + "fake db payload", + ); + } + expect(spawnMock).toHaveBeenCalledTimes(1); + expect(conn.port).toBe(60000); + + (process as unknown as { kill: typeof process.kill }).kill = originalKill; + }); + + test("is safe when no manifest exists — no kill, no rename, still spawns", async () => { + manifestStore.current = null; + + const conn = await coordinator.reset("org-1", spawnConfig); + + expect(killedPids).toHaveLength(0); + // `removeManifest` is called unconditionally — that's fine, the impl + // in host-service-manifest treats a missing file as a no-op. + expect(removeManifestMock).toHaveBeenCalledTimes(1); + expect(spawnMock).toHaveBeenCalledTimes(1); + expect(conn.port).toBe(60000); + + (process as unknown as { kill: typeof process.kill }).kill = originalKill; + }); + + test("with wipeHostDb=true but no host.db file, does not throw and still spawns", async () => { + manifestStore.current = baseManifest(1111); + // No host.db on disk. + + const conn = await coordinator.reset("org-1", spawnConfig, { + wipeHostDb: true, + }); + + expect(spawnMock).toHaveBeenCalledTimes(1); + expect(conn.port).toBe(60000); + + (process as unknown as { kill: typeof process.kill }).kill = originalKill; + }); +}); diff --git a/apps/desktop/src/main/lib/host-service-coordinator.ts b/apps/desktop/src/main/lib/host-service-coordinator.ts index d42cc08fdb3..68d77ba4500 100644 --- a/apps/desktop/src/main/lib/host-service-coordinator.ts +++ b/apps/desktop/src/main/lib/host-service-coordinator.ts @@ -159,6 +159,44 @@ export class HostServiceCoordinator extends EventEmitter { return this.start(organizationId, config); } + /** + * Forcefully reset host-service state for an org. Used by the in-app + * recovery path (superset-sh/superset#4299) when `stop`/`restart` aren't + * enough — e.g. the manifest points at a live-but-wedged pid that adoption + * keeps picking up. + * + * Differs from `restart` in two ways: + * 1. SIGKILLs any pid in the manifest (even if not in this.instances), + * so a manifest written by a previous app session can't survive. + * 2. Optionally archives `host.db` to `host.db.broken-` so the next + * spawn starts with a clean DB. Off by default — that's data loss. + */ + async reset( + organizationId: string, + config: SpawnConfig, + options: { wipeHostDb?: boolean } = {}, + ): Promise { + this.stop(organizationId); + + const manifest = readManifest(organizationId); + if (manifest && isProcessAlive(manifest.pid)) { + try { + process.kill(manifest.pid, "SIGKILL"); + } catch {} + } + + removeManifest(organizationId); + + if (options.wipeHostDb) { + const dbPath = path.join(manifestDir(organizationId), "host.db"); + if (fs.existsSync(dbPath)) { + fs.renameSync(dbPath, `${dbPath}.broken-${Date.now()}`); + } + } + + return this.start(organizationId, config); + } + getConnection(organizationId: string): Connection | null { const instance = this.instances.get(organizationId); if (!instance || instance.status !== "running") return null; diff --git a/apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/components/WorkspaceLocalHostStoppedState/WorkspaceLocalHostStoppedState.tsx b/apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/components/WorkspaceLocalHostStoppedState/WorkspaceLocalHostStoppedState.tsx new file mode 100644 index 00000000000..f884f32127a --- /dev/null +++ b/apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/components/WorkspaceLocalHostStoppedState/WorkspaceLocalHostStoppedState.tsx @@ -0,0 +1,108 @@ +import { Button } from "@superset/ui/button"; +import { toast } from "@superset/ui/sonner"; +import { AlertTriangle, Clipboard, RotateCw } from "lucide-react"; +import { useCopyToClipboard } from "renderer/hooks/useCopyToClipboard"; +import { electronTrpc } from "renderer/lib/electron-trpc"; + +interface WorkspaceLocalHostStoppedStateProps { + organizationId: string; + lastError: string | null; + lastAttemptAt: number | null; + retryAttempt: number; +} + +const MANIFEST_PATH_HINT = "~/.superset/host//"; + +export function WorkspaceLocalHostStoppedState({ + organizationId, + lastError, + lastAttemptAt, + retryAttempt, +}: WorkspaceLocalHostStoppedStateProps) { + const { copyToClipboard } = useCopyToClipboard(); + const resetMutation = electronTrpc.hostServiceCoordinator.reset.useMutation({ + onError: (error) => { + toast.error("Reset failed", { description: error.message }); + }, + }); + + const onReset = () => { + resetMutation.mutate({ organizationId, wipeHostDb: false }); + }; + + const onCopyDiagnostics = () => { + const lines = [ + "Superset host-service diagnostics", + `organizationId: ${organizationId}`, + `manifestDir: ${MANIFEST_PATH_HINT.replace("", organizationId)}`, + `retryAttempt: ${retryAttempt}`, + `lastAttemptAt: ${ + lastAttemptAt ? new Date(lastAttemptAt).toISOString() : "—" + }`, + `lastError: ${lastError ?? "(none — host-service status reported stopped)"}`, + `reporter: superset-sh/superset#4299`, + ]; + void copyToClipboard(lines.join("\n")); + toast.success("Diagnostics copied to clipboard"); + }; + + return ( +
+
+
+
+ +
+

+ Host service stopped +

+

+ The local host service didn't come back up after automatic retries. + Reset it to try again, or copy diagnostics for support. +

+
+ + {lastError ? ( +
+ + Last error + + + {lastError} + +
+ ) : null} + +
+ + +
+
+
+ ); +} diff --git a/apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/components/WorkspaceLocalHostStoppedState/index.ts b/apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/components/WorkspaceLocalHostStoppedState/index.ts new file mode 100644 index 00000000000..c955b0f4cd6 --- /dev/null +++ b/apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/components/WorkspaceLocalHostStoppedState/index.ts @@ -0,0 +1 @@ +export { WorkspaceLocalHostStoppedState } from "./WorkspaceLocalHostStoppedState"; diff --git a/apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/hooks/useLocalHostStatus/index.ts b/apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/hooks/useLocalHostStatus/index.ts new file mode 100644 index 00000000000..3f4c8a71725 --- /dev/null +++ b/apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/hooks/useLocalHostStatus/index.ts @@ -0,0 +1 @@ +export { type LocalHostStatus, useLocalHostStatus } from "./useLocalHostStatus"; diff --git a/apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/hooks/useLocalHostStatus/useLocalHostStatus.ts b/apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/hooks/useLocalHostStatus/useLocalHostStatus.ts new file mode 100644 index 00000000000..b9bdbdb85c7 --- /dev/null +++ b/apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/hooks/useLocalHostStatus/useLocalHostStatus.ts @@ -0,0 +1,58 @@ +import type { SelectV2Workspace } from "@superset/db/schema"; +import { useLocalHostService } from "renderer/routes/_authenticated/providers/LocalHostServiceProvider"; + +export type LocalHostStatus = + | { status: "skip" } + | { status: "loading" } + | { + status: "stopped"; + organizationId: string; + lastError: string | null; + lastAttemptAt: number | null; + retryAttempt: number; + } + | { status: "ready" }; + +/** + * Mirror of `useRemoteHostStatus` for the **local** host. Returns the recovery + * state the v2-workspace layout should render when the local host-service + * is down — see superset-sh/superset#4299. + * + * - `skip` — workspace is not on this machine (remote host); caller falls + * through to `useRemoteHostStatus`. + * - `loading` — provider hasn't decided yet (no machineId, or retry chain + * still in flight). + * - `stopped` — host-service is unreachable AND the provider's automatic + * retry chain has exhausted. The recovery UI takes over here. + * - `ready` — host-service is responding (`activeHostUrl` is non-null). + */ +export function useLocalHostStatus( + workspace: SelectV2Workspace | null, +): LocalHostStatus { + const { + machineId, + activeHostUrl, + lastStartError, + lastAttemptAt, + retryAttempt, + retryExhausted, + } = useLocalHostService(); + + if (!workspace) return { status: "loading" }; + const isLocal = machineId != null && workspace.hostId === machineId; + if (!isLocal) return { status: "skip" }; + + if (activeHostUrl) return { status: "ready" }; + + // We're local, no connection, but the retry chain is still working through + // its delays — render a blank loading state rather than the recovery UI. + if (!retryExhausted) return { status: "loading" }; + + return { + status: "stopped", + organizationId: workspace.organizationId, + lastError: lastStartError, + lastAttemptAt, + retryAttempt, + }; +} diff --git a/apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/layout.tsx b/apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/layout.tsx index 05a8de21c94..2bc3f60bb1e 100644 --- a/apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/layout.tsx +++ b/apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/layout.tsx @@ -8,7 +8,9 @@ import { useWorkspaceCreatesStore } from "renderer/stores/workspace-creates"; import { WorkspaceCreateErrorState } from "./components/WorkspaceCreateErrorState"; import { WorkspaceCreatingState } from "./components/WorkspaceCreatingState"; import { WorkspaceHostIncompatibleState } from "./components/WorkspaceHostIncompatibleState"; +import { WorkspaceLocalHostStoppedState } from "./components/WorkspaceLocalHostStoppedState"; import { WorkspaceNotFoundState } from "./components/WorkspaceNotFoundState"; +import { useLocalHostStatus } from "./hooks/useLocalHostStatus"; import { useRemoteHostStatus } from "./hooks/useRemoteHostStatus"; import { WorkspaceProvider } from "./providers/WorkspaceProvider"; @@ -55,6 +57,7 @@ function V2WorkspaceLayout() { }, [ensureWorkspaceInSidebar, workspace]); const hostStatus = useRemoteHostStatus(workspace); + const localHostStatus = useLocalHostStatus(workspace); if (!workspaceId || !isReady || !workspaces) { return
; @@ -83,6 +86,16 @@ function V2WorkspaceLayout() { return ; } + if (localHostStatus.status === "stopped") { + return ( + + ); + } if (hostStatus.status === "incompatible") { return ( (null); +// Exponential backoff between automatic re-tries of `start` for the active +// org. Index N is the delay before attempt N+1. After RETRY_DELAYS_MS.length +// failures we stop retrying and let the recovery UI take over. +const RETRY_DELAYS_MS = [1_000, 4_000, 15_000]; + export function LocalHostServiceProvider({ children, }: { @@ -45,11 +61,125 @@ export function LocalHostServiceProvider({ [organizations], ); + // Retry bookkeeping. Refs hold the live values so the status subscription + // and the start-mutation callbacks can read them without re-subscribing. + const attemptRef = useRef(0); + const retryTimerRef = useRef | null>(null); + const activeOrgRef = useRef(activeOrganizationId); + + const [retryAttempt, setRetryAttempt] = useState(0); + const [retryExhausted, setRetryExhausted] = useState(false); + const [lastStartError, setLastStartError] = useState(null); + const [lastAttemptAt, setLastAttemptAt] = useState(null); + + const clearRetryTimer = useCallback(() => { + if (retryTimerRef.current) { + clearTimeout(retryTimerRef.current); + retryTimerRef.current = null; + } + }, []); + + // Ref-based indirection: scheduleRetry only needs to *call* fireStart + // deferred via setTimeout. Going through a ref avoids the circular + // useCallback dependency between the two. + const fireStartRef = useRef<((orgId: string) => void) | null>(null); + + const scheduleRetry = useCallback( + (organizationId: string) => { + if (organizationId !== activeOrgRef.current) return; + if (attemptRef.current >= RETRY_DELAYS_MS.length) { + setRetryExhausted(true); + return; + } + const delay = RETRY_DELAYS_MS[attemptRef.current]; + attemptRef.current += 1; + setRetryAttempt(attemptRef.current); + clearRetryTimer(); + retryTimerRef.current = setTimeout(() => { + retryTimerRef.current = null; + if (organizationId !== activeOrgRef.current) return; + fireStartRef.current?.(organizationId); + }, delay); + }, + [clearRetryTimer], + ); + + const fireStart = useCallback( + (organizationId: string) => { + setLastAttemptAt(Date.now()); + startHostService( + { organizationId }, + { + onError: (error: { message: string }) => { + if (organizationId !== activeOrgRef.current) return; + setLastStartError(error.message); + scheduleRetry(organizationId); + }, + onSuccess: () => { + if (organizationId !== activeOrgRef.current) return; + setLastStartError(null); + }, + }, + ); + }, + [startHostService, scheduleRetry], + ); + + useEffect(() => { + fireStartRef.current = fireStart; + }, [fireStart]); + + // Reset retry state whenever the active org changes — we don't carry + // backoff across orgs. + useEffect(() => { + activeOrgRef.current = activeOrganizationId; + attemptRef.current = 0; + setRetryAttempt(0); + setRetryExhausted(false); + setLastStartError(null); + setLastAttemptAt(null); + clearRetryTimer(); + }, [activeOrganizationId, clearRetryTimer]); + + // Initial start: fire once per org on mount/org-list change. Active org + // gets full retry treatment via fireStart; inactive orgs use the bare + // mutation (they get healed when the user switches to them). useEffect(() => { for (const organizationId of organizationIds) { - startHostService({ organizationId }); + if (organizationId === activeOrganizationId) { + fireStart(organizationId); + } else { + startHostService({ organizationId }); + } } - }, [organizationIds, startHostService]); + }, [organizationIds, activeOrganizationId, fireStart, startHostService]); + + electronTrpc.hostServiceCoordinator.onStatusChange.useSubscription( + undefined, + { + onData: (event: { + organizationId: string; + status: "starting" | "running" | "stopped"; + }) => { + if (event.organizationId !== activeOrgRef.current) return; + if (event.status === "running") { + attemptRef.current = 0; + setRetryAttempt(0); + setRetryExhausted(false); + setLastStartError(null); + clearRetryTimer(); + return; + } + if (event.status === "stopped") { + // Child died (or was killed) after running. Re-enter the backoff + // chain unless we've already exhausted it for this active org. + scheduleRetry(event.organizationId); + } + }, + }, + ); + + useEffect(() => clearRetryTimer, [clearRetryTimer]); const { data: machineIdData } = electronTrpc.device.getMachineId.useQuery( undefined, @@ -66,8 +196,16 @@ export function LocalHostServiceProvider({ if (!machineIdData) return null; const machineId = machineIdData.machineId; + const base = { + machineId, + lastStartError, + lastAttemptAt, + retryAttempt, + retryExhausted, + }; + if (!activeConnection?.port) { - return { machineId, activeHostUrl: null }; + return { ...base, activeHostUrl: null }; } const activeHostUrl = `http://127.0.0.1:${activeConnection.port}`; @@ -75,8 +213,15 @@ export function LocalHostServiceProvider({ setHostServiceSecret(activeHostUrl, activeConnection.secret); } - return { machineId, activeHostUrl }; - }, [machineIdData, activeConnection]); + return { ...base, activeHostUrl }; + }, [ + machineIdData, + activeConnection, + lastStartError, + lastAttemptAt, + retryAttempt, + retryExhausted, + ]); if (!value) return null; From f9315a01350414e637c58a0c913a85c3eaeee91e Mon Sep 17 00:00:00 2001 From: Kiet Ho Date: Mon, 11 May 2026 22:42:03 -0700 Subject: [PATCH 4/9] fix(desktop): route host-service-coordinator logs through electron-log MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The coordinator's narration (including PR1's adoption health-check failures: "Adopted pid=… did not respond …, killing and respawning") went to bare console.log, which never reaches main.log in a packaged build. Route it through electron-log like the auto-updater does so the next #4299-style report has breadcrumbs. --- .../main/lib/host-service-coordinator.test.ts | 8 +++++++ .../src/main/lib/host-service-coordinator.ts | 21 ++++++++++--------- 2 files changed, 19 insertions(+), 10 deletions(-) diff --git a/apps/desktop/src/main/lib/host-service-coordinator.test.ts b/apps/desktop/src/main/lib/host-service-coordinator.test.ts index 0cbbc26dd6a..fdc7f8752b2 100644 --- a/apps/desktop/src/main/lib/host-service-coordinator.test.ts +++ b/apps/desktop/src/main/lib/host-service-coordinator.test.ts @@ -52,6 +52,14 @@ mock.module("electron", () => ({ }, })); +mock.module("electron-log/main", () => ({ + default: { + info: () => {}, + warn: () => {}, + error: () => {}, + }, +})); + mock.module("@superset/local-db", () => ({ settings: {} })); mock.module("@superset/shared/host-info", () => ({ getHostId: () => "host-1", diff --git a/apps/desktop/src/main/lib/host-service-coordinator.ts b/apps/desktop/src/main/lib/host-service-coordinator.ts index 68d77ba4500..ef2d6111d1d 100644 --- a/apps/desktop/src/main/lib/host-service-coordinator.ts +++ b/apps/desktop/src/main/lib/host-service-coordinator.ts @@ -6,6 +6,7 @@ import path from "node:path"; import { settings } from "@superset/local-db"; import { getHostId, getHostName } from "@superset/shared/host-info"; import { app } from "electron"; +import log from "electron-log/main"; import { env } from "main/env.main"; import { env as sharedEnv } from "shared/env.shared"; import { getProcessEnvWithShellPath } from "../../lib/trpc/routers/workspaces/utils/shell-env"; @@ -283,19 +284,19 @@ export class HostServiceCoordinator extends EventEmitter { try { const ready = await waitForStableBundle(); if (!ready) { - console.warn( + log.warn( "[host-service] bundle did not stabilize, skipping reload", ); return; } const config = await configProvider(); if (!config) return; - console.log( + log.info( "[host-service] bundle changed, restarting running instances", ); await this.restartAll(config); } catch (error) { - console.error("[host-service] dev reload failed:", error); + log.error("[host-service] dev reload failed:", error); } finally { reloading = false; } @@ -309,7 +310,7 @@ export class HostServiceCoordinator extends EventEmitter { trigger(); }); } catch (error) { - console.error("[host-service] failed to enable dev reload:", error); + log.error("[host-service] failed to enable dev reload:", error); return () => {}; } @@ -334,7 +335,7 @@ export class HostServiceCoordinator extends EventEmitter { const reason = manifest.spawnedByAppVersion ? `spawned by app ${manifest.spawnedByAppVersion} != current ${currentAppVersion}` : "no recorded app version (pre-upgrade manifest)"; - console.log( + log.info( `[host-service:${organizationId}] Adopted service ${reason}, killing`, ); try { @@ -354,7 +355,7 @@ export class HostServiceCoordinator extends EventEmitter { ADOPT_HEALTH_CHECK_TIMEOUT_MS, ); if (!healthy) { - console.log( + log.info( `[host-service:${organizationId}] Adopted pid=${manifest.pid} did not respond on ${manifest.endpoint}, killing and respawning`, ); try { @@ -372,7 +373,7 @@ export class HostServiceCoordinator extends EventEmitter { }); this.startAdoptedLivenessCheck(organizationId, manifest.pid); - console.log( + log.info( `[host-service:${organizationId}] Adopted pid=${manifest.pid} port=${port}`, ); this.emitStatus(organizationId, "running", null); @@ -473,7 +474,7 @@ export class HostServiceCoordinator extends EventEmitter { instance.pid = childPid; child.on("exit", (code) => { - console.log(`[host-service:${organizationId}] exited with code ${code}`); + log.info(`[host-service:${organizationId}] exited with code ${code}`); const current = this.instances.get(organizationId); if (!current || current.pid !== childPid || current.status === "stopped") return; @@ -496,7 +497,7 @@ export class HostServiceCoordinator extends EventEmitter { instance.status = "running"; - console.log(`[host-service:${organizationId}] listening on port ${port}`); + log.info(`[host-service:${organizationId}] listening on port ${port}`); this.emitStatus(organizationId, "running", "starting"); return { port, secret, machineId: this.machineId }; } @@ -555,7 +556,7 @@ export class HostServiceCoordinator extends EventEmitter { this.adoptedLivenessTimers.delete(organizationId); const instance = this.instances.get(organizationId); if (instance && instance.status !== "stopped") { - console.log( + log.info( `[host-service:${organizationId}] Adopted process ${pid} died`, ); this.instances.delete(organizationId); From 608b6ea86276b9d227b5a712c2456b5c7a179bd3 Mon Sep 17 00:00:00 2001 From: Kiet Ho Date: Mon, 11 May 2026 23:22:33 -0700 Subject: [PATCH 5/9] fix(desktop): drop local-host recovery screen; keep reset + auto-retry MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit #4430 just removed the equivalent full-screen "host offline" gate for remote hosts ("render optimistically; downstream queries surface their own errors"). Follow that direction for the local case too: remove WorkspaceLocalHostStoppedState, useLocalHostStatus, and the layout branch. What stays from PR2: - coordinator reset() + hostServiceCoordinator.reset tRPC mutation — useful for the planned Settings escape hatch and for support - LocalHostServiceProvider auto-retries start() for the active org on a 1s/4s/15s backoff (and on a post-running "stopped" status event), resetting the counter when status reaches "running" — transient spawn failures self-heal without any UI A non-blocking recovery surface (banner, not takeover) can be a separate PR if we still want one. --- .../WorkspaceLocalHostStoppedState.tsx | 108 ------------------ .../WorkspaceLocalHostStoppedState/index.ts | 1 - .../hooks/useLocalHostStatus/index.ts | 1 - .../useLocalHostStatus/useLocalHostStatus.ts | 58 ---------- .../_dashboard/v2-workspace/layout.tsx | 13 --- .../LocalHostServiceProvider.tsx | 61 ++-------- 6 files changed, 8 insertions(+), 234 deletions(-) delete mode 100644 apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/components/WorkspaceLocalHostStoppedState/WorkspaceLocalHostStoppedState.tsx delete mode 100644 apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/components/WorkspaceLocalHostStoppedState/index.ts delete mode 100644 apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/hooks/useLocalHostStatus/index.ts delete mode 100644 apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/hooks/useLocalHostStatus/useLocalHostStatus.ts diff --git a/apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/components/WorkspaceLocalHostStoppedState/WorkspaceLocalHostStoppedState.tsx b/apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/components/WorkspaceLocalHostStoppedState/WorkspaceLocalHostStoppedState.tsx deleted file mode 100644 index f884f32127a..00000000000 --- a/apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/components/WorkspaceLocalHostStoppedState/WorkspaceLocalHostStoppedState.tsx +++ /dev/null @@ -1,108 +0,0 @@ -import { Button } from "@superset/ui/button"; -import { toast } from "@superset/ui/sonner"; -import { AlertTriangle, Clipboard, RotateCw } from "lucide-react"; -import { useCopyToClipboard } from "renderer/hooks/useCopyToClipboard"; -import { electronTrpc } from "renderer/lib/electron-trpc"; - -interface WorkspaceLocalHostStoppedStateProps { - organizationId: string; - lastError: string | null; - lastAttemptAt: number | null; - retryAttempt: number; -} - -const MANIFEST_PATH_HINT = "~/.superset/host//"; - -export function WorkspaceLocalHostStoppedState({ - organizationId, - lastError, - lastAttemptAt, - retryAttempt, -}: WorkspaceLocalHostStoppedStateProps) { - const { copyToClipboard } = useCopyToClipboard(); - const resetMutation = electronTrpc.hostServiceCoordinator.reset.useMutation({ - onError: (error) => { - toast.error("Reset failed", { description: error.message }); - }, - }); - - const onReset = () => { - resetMutation.mutate({ organizationId, wipeHostDb: false }); - }; - - const onCopyDiagnostics = () => { - const lines = [ - "Superset host-service diagnostics", - `organizationId: ${organizationId}`, - `manifestDir: ${MANIFEST_PATH_HINT.replace("", organizationId)}`, - `retryAttempt: ${retryAttempt}`, - `lastAttemptAt: ${ - lastAttemptAt ? new Date(lastAttemptAt).toISOString() : "—" - }`, - `lastError: ${lastError ?? "(none — host-service status reported stopped)"}`, - `reporter: superset-sh/superset#4299`, - ]; - void copyToClipboard(lines.join("\n")); - toast.success("Diagnostics copied to clipboard"); - }; - - return ( -
-
-
-
- -
-

- Host service stopped -

-

- The local host service didn't come back up after automatic retries. - Reset it to try again, or copy diagnostics for support. -

-
- - {lastError ? ( -
- - Last error - - - {lastError} - -
- ) : null} - -
- - -
-
-
- ); -} diff --git a/apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/components/WorkspaceLocalHostStoppedState/index.ts b/apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/components/WorkspaceLocalHostStoppedState/index.ts deleted file mode 100644 index c955b0f4cd6..00000000000 --- a/apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/components/WorkspaceLocalHostStoppedState/index.ts +++ /dev/null @@ -1 +0,0 @@ -export { WorkspaceLocalHostStoppedState } from "./WorkspaceLocalHostStoppedState"; diff --git a/apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/hooks/useLocalHostStatus/index.ts b/apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/hooks/useLocalHostStatus/index.ts deleted file mode 100644 index 3f4c8a71725..00000000000 --- a/apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/hooks/useLocalHostStatus/index.ts +++ /dev/null @@ -1 +0,0 @@ -export { type LocalHostStatus, useLocalHostStatus } from "./useLocalHostStatus"; diff --git a/apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/hooks/useLocalHostStatus/useLocalHostStatus.ts b/apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/hooks/useLocalHostStatus/useLocalHostStatus.ts deleted file mode 100644 index b9bdbdb85c7..00000000000 --- a/apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/hooks/useLocalHostStatus/useLocalHostStatus.ts +++ /dev/null @@ -1,58 +0,0 @@ -import type { SelectV2Workspace } from "@superset/db/schema"; -import { useLocalHostService } from "renderer/routes/_authenticated/providers/LocalHostServiceProvider"; - -export type LocalHostStatus = - | { status: "skip" } - | { status: "loading" } - | { - status: "stopped"; - organizationId: string; - lastError: string | null; - lastAttemptAt: number | null; - retryAttempt: number; - } - | { status: "ready" }; - -/** - * Mirror of `useRemoteHostStatus` for the **local** host. Returns the recovery - * state the v2-workspace layout should render when the local host-service - * is down — see superset-sh/superset#4299. - * - * - `skip` — workspace is not on this machine (remote host); caller falls - * through to `useRemoteHostStatus`. - * - `loading` — provider hasn't decided yet (no machineId, or retry chain - * still in flight). - * - `stopped` — host-service is unreachable AND the provider's automatic - * retry chain has exhausted. The recovery UI takes over here. - * - `ready` — host-service is responding (`activeHostUrl` is non-null). - */ -export function useLocalHostStatus( - workspace: SelectV2Workspace | null, -): LocalHostStatus { - const { - machineId, - activeHostUrl, - lastStartError, - lastAttemptAt, - retryAttempt, - retryExhausted, - } = useLocalHostService(); - - if (!workspace) return { status: "loading" }; - const isLocal = machineId != null && workspace.hostId === machineId; - if (!isLocal) return { status: "skip" }; - - if (activeHostUrl) return { status: "ready" }; - - // We're local, no connection, but the retry chain is still working through - // its delays — render a blank loading state rather than the recovery UI. - if (!retryExhausted) return { status: "loading" }; - - return { - status: "stopped", - organizationId: workspace.organizationId, - lastError: lastStartError, - lastAttemptAt, - retryAttempt, - }; -} diff --git a/apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/layout.tsx b/apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/layout.tsx index 2bc3f60bb1e..05a8de21c94 100644 --- a/apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/layout.tsx +++ b/apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/layout.tsx @@ -8,9 +8,7 @@ import { useWorkspaceCreatesStore } from "renderer/stores/workspace-creates"; import { WorkspaceCreateErrorState } from "./components/WorkspaceCreateErrorState"; import { WorkspaceCreatingState } from "./components/WorkspaceCreatingState"; import { WorkspaceHostIncompatibleState } from "./components/WorkspaceHostIncompatibleState"; -import { WorkspaceLocalHostStoppedState } from "./components/WorkspaceLocalHostStoppedState"; import { WorkspaceNotFoundState } from "./components/WorkspaceNotFoundState"; -import { useLocalHostStatus } from "./hooks/useLocalHostStatus"; import { useRemoteHostStatus } from "./hooks/useRemoteHostStatus"; import { WorkspaceProvider } from "./providers/WorkspaceProvider"; @@ -57,7 +55,6 @@ function V2WorkspaceLayout() { }, [ensureWorkspaceInSidebar, workspace]); const hostStatus = useRemoteHostStatus(workspace); - const localHostStatus = useLocalHostStatus(workspace); if (!workspaceId || !isReady || !workspaces) { return
; @@ -86,16 +83,6 @@ function V2WorkspaceLayout() { return ; } - if (localHostStatus.status === "stopped") { - return ( - - ); - } if (hostStatus.status === "incompatible") { return ( | null>(null); const activeOrgRef = useRef(activeOrganizationId); - const [retryAttempt, setRetryAttempt] = useState(0); - const [retryExhausted, setRetryExhausted] = useState(false); - const [lastStartError, setLastStartError] = useState(null); - const [lastAttemptAt, setLastAttemptAt] = useState(null); - const clearRetryTimer = useCallback(() => { if (retryTimerRef.current) { clearTimeout(retryTimerRef.current); @@ -87,13 +74,9 @@ export function LocalHostServiceProvider({ const scheduleRetry = useCallback( (organizationId: string) => { if (organizationId !== activeOrgRef.current) return; - if (attemptRef.current >= RETRY_DELAYS_MS.length) { - setRetryExhausted(true); - return; - } + if (attemptRef.current >= RETRY_DELAYS_MS.length) return; const delay = RETRY_DELAYS_MS[attemptRef.current]; attemptRef.current += 1; - setRetryAttempt(attemptRef.current); clearRetryTimer(); retryTimerRef.current = setTimeout(() => { retryTimerRef.current = null; @@ -106,19 +89,13 @@ export function LocalHostServiceProvider({ const fireStart = useCallback( (organizationId: string) => { - setLastAttemptAt(Date.now()); startHostService( { organizationId }, { - onError: (error: { message: string }) => { + onError: () => { if (organizationId !== activeOrgRef.current) return; - setLastStartError(error.message); scheduleRetry(organizationId); }, - onSuccess: () => { - if (organizationId !== activeOrgRef.current) return; - setLastStartError(null); - }, }, ); }, @@ -134,15 +111,11 @@ export function LocalHostServiceProvider({ useEffect(() => { activeOrgRef.current = activeOrganizationId; attemptRef.current = 0; - setRetryAttempt(0); - setRetryExhausted(false); - setLastStartError(null); - setLastAttemptAt(null); clearRetryTimer(); }, [activeOrganizationId, clearRetryTimer]); // Initial start: fire once per org on mount/org-list change. Active org - // gets full retry treatment via fireStart; inactive orgs use the bare + // gets the retry treatment via fireStart; inactive orgs use the bare // mutation (they get healed when the user switches to them). useEffect(() => { for (const organizationId of organizationIds) { @@ -164,9 +137,6 @@ export function LocalHostServiceProvider({ if (event.organizationId !== activeOrgRef.current) return; if (event.status === "running") { attemptRef.current = 0; - setRetryAttempt(0); - setRetryExhausted(false); - setLastStartError(null); clearRetryTimer(); return; } @@ -196,16 +166,8 @@ export function LocalHostServiceProvider({ if (!machineIdData) return null; const machineId = machineIdData.machineId; - const base = { - machineId, - lastStartError, - lastAttemptAt, - retryAttempt, - retryExhausted, - }; - if (!activeConnection?.port) { - return { ...base, activeHostUrl: null }; + return { machineId, activeHostUrl: null }; } const activeHostUrl = `http://127.0.0.1:${activeConnection.port}`; @@ -213,15 +175,8 @@ export function LocalHostServiceProvider({ setHostServiceSecret(activeHostUrl, activeConnection.secret); } - return { ...base, activeHostUrl }; - }, [ - machineIdData, - activeConnection, - lastStartError, - lastAttemptAt, - retryAttempt, - retryExhausted, - ]); + return { machineId, activeHostUrl }; + }, [machineIdData, activeConnection]); if (!value) return null; From 4ec73dc822edc685a4adf7736392f7e2e87fc41a Mon Sep 17 00:00:00 2001 From: Kiet Ho Date: Mon, 11 May 2026 23:59:56 -0700 Subject: [PATCH 6/9] fix(desktop): address review feedback on host-service reset + tests MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - reset(): read the manifest pid *before* stop() removes it, so a wedged tracked instance gets SIGTERM (from stop) then SIGKILL (escalation) instead of only SIGTERM. (cubic P1) - log on SIGKILL failure in reset() and the adoption health-check path — ESRCH (pid already gone) stays silent, EPERM/etc. now surface. (cubic P2) - tests: restore process.kill in afterEach (unconditional) and rm the mkdtemp dirs in afterEach; rename the "ENOENT" case to "ESRCH" (the errno actually simulated); add a reset test for the tracked-instance SIGTERM→SIGKILL escalation. (greptile/coderabbit) --- .../main/lib/host-service-coordinator.test.ts | 58 +++++++++++++------ .../src/main/lib/host-service-coordinator.ts | 27 +++++++-- 2 files changed, 62 insertions(+), 23 deletions(-) diff --git a/apps/desktop/src/main/lib/host-service-coordinator.test.ts b/apps/desktop/src/main/lib/host-service-coordinator.test.ts index fdc7f8752b2..0c289050470 100644 --- a/apps/desktop/src/main/lib/host-service-coordinator.test.ts +++ b/apps/desktop/src/main/lib/host-service-coordinator.test.ts @@ -1,4 +1,4 @@ -import { beforeEach, describe, expect, mock, test } from "bun:test"; +import { afterEach, beforeEach, describe, expect, mock, test } from "bun:test"; import * as fs from "node:fs"; import * as os from "node:os"; import path from "node:path"; @@ -133,6 +133,16 @@ describe("HostServiceCoordinator.tryAdopt — adoption health check", () => { (coordinator as unknown as { spawn: typeof spawnMock }).spawn = spawnMock; }); + afterEach(() => { + // Unconditional — if an assertion throws mid-test, the override must + // still be torn down or the next test captures the wrong `originalKill`. + (process as unknown as { kill: typeof process.kill }).kill = originalKill; + if (testManifestRoot) { + fs.rmSync(testManifestRoot, { recursive: true, force: true }); + testManifestRoot = ""; + } + }); + test("adopts when manifest is healthy", async () => { manifestStore.current = baseManifest(1234); pollHealthCheckMock.mockImplementationOnce(() => Promise.resolve(true)); @@ -145,8 +155,6 @@ describe("HostServiceCoordinator.tryAdopt — adoption health check", () => { expect(spawnMock).not.toHaveBeenCalled(); expect(removeManifestMock).not.toHaveBeenCalled(); expect(coordinator.getProcessStatus("org-1")).toBe("running"); - - (process as unknown as { kill: typeof process.kill }).kill = originalKill; }); test("kills the adopted pid with SIGKILL and falls through to spawn when health check fails", async () => { @@ -161,15 +169,15 @@ describe("HostServiceCoordinator.tryAdopt — adoption health check", () => { expect(spawnMock).toHaveBeenCalledTimes(1); expect(conn.port).toBe(60000); expect(conn.secret).toBe("fresh-secret"); - - (process as unknown as { kill: typeof process.kill }).kill = originalKill; }); - test("swallows SIGKILL ENOENT (pid already gone) and still respawns", async () => { + test("swallows SIGKILL ESRCH (pid already gone) and still respawns", async () => { manifestStore.current = baseManifest(7777); pollHealthCheckMock.mockImplementationOnce(() => Promise.resolve(false)); (process as unknown as { kill: typeof process.kill }).kill = (() => { - throw new Error("ESRCH"); + const err: NodeJS.ErrnoException = new Error("kill ESRCH"); + err.code = "ESRCH"; + throw err; }) as typeof process.kill; const conn = await coordinator.start("org-1", spawnConfig); @@ -177,8 +185,6 @@ describe("HostServiceCoordinator.tryAdopt — adoption health check", () => { expect(removeManifestMock).toHaveBeenCalledTimes(1); expect(spawnMock).toHaveBeenCalledTimes(1); expect(conn.port).toBe(60000); - - (process as unknown as { kill: typeof process.kill }).kill = originalKill; }); test("kills with SIGTERM (existing behavior) on app-version mismatch, before health check", async () => { @@ -195,8 +201,6 @@ describe("HostServiceCoordinator.tryAdopt — adoption health check", () => { expect(removeManifestMock).toHaveBeenCalledTimes(1); expect(spawnMock).toHaveBeenCalledTimes(1); expect(conn.port).toBe(60000); - - (process as unknown as { kill: typeof process.kill }).kill = originalKill; }); }); @@ -234,6 +238,14 @@ describe("HostServiceCoordinator.reset", () => { (coordinator as unknown as { spawn: typeof spawnMock }).spawn = spawnMock; }); + afterEach(() => { + (process as unknown as { kill: typeof process.kill }).kill = originalKill; + if (testManifestRoot) { + fs.rmSync(testManifestRoot, { recursive: true, force: true }); + testManifestRoot = ""; + } + }); + test("removes manifest, SIGKILLs live pid, then spawns fresh", async () => { manifestStore.current = baseManifest(8888); @@ -244,8 +256,24 @@ describe("HostServiceCoordinator.reset", () => { expect(spawnMock).toHaveBeenCalledTimes(1); expect(conn.port).toBe(60000); expect(conn.secret).toBe("fresh-secret"); + }); - (process as unknown as { kill: typeof process.kill }).kill = originalKill; + test("SIGKILLs the manifest pid even when an instance is tracked (stop's SIGTERM may not be enough)", async () => { + // First adopt a healthy instance so it's tracked in `this.instances`. + manifestStore.current = baseManifest(2468); + pollHealthCheckMock.mockImplementationOnce(() => Promise.resolve(true)); + await coordinator.start("org-1", spawnConfig); + expect(coordinator.getProcessStatus("org-1")).toBe("running"); + killedPids.length = 0; + + // Adoption leaves the manifest in place; reset must read its pid before + // stop() removes it, then escalate SIGTERM → SIGKILL on a wedged process. + const conn = await coordinator.reset("org-1", spawnConfig); + + expect(killedPids).toContainEqual({ pid: 2468, signal: "SIGTERM" }); + expect(killedPids).toContainEqual({ pid: 2468, signal: "SIGKILL" }); + expect(spawnMock).toHaveBeenCalledTimes(1); + expect(conn.port).toBe(60000); }); test("with wipeHostDb=true, archives host.db to host.db.broken-", async () => { @@ -271,8 +299,6 @@ describe("HostServiceCoordinator.reset", () => { } expect(spawnMock).toHaveBeenCalledTimes(1); expect(conn.port).toBe(60000); - - (process as unknown as { kill: typeof process.kill }).kill = originalKill; }); test("is safe when no manifest exists — no kill, no rename, still spawns", async () => { @@ -286,8 +312,6 @@ describe("HostServiceCoordinator.reset", () => { expect(removeManifestMock).toHaveBeenCalledTimes(1); expect(spawnMock).toHaveBeenCalledTimes(1); expect(conn.port).toBe(60000); - - (process as unknown as { kill: typeof process.kill }).kill = originalKill; }); test("with wipeHostDb=true but no host.db file, does not throw and still spawns", async () => { @@ -300,7 +324,5 @@ describe("HostServiceCoordinator.reset", () => { expect(spawnMock).toHaveBeenCalledTimes(1); expect(conn.port).toBe(60000); - - (process as unknown as { kill: typeof process.kill }).kill = originalKill; }); }); diff --git a/apps/desktop/src/main/lib/host-service-coordinator.ts b/apps/desktop/src/main/lib/host-service-coordinator.ts index ef2d6111d1d..18ddc9b414c 100644 --- a/apps/desktop/src/main/lib/host-service-coordinator.ts +++ b/apps/desktop/src/main/lib/host-service-coordinator.ts @@ -177,13 +177,22 @@ export class HostServiceCoordinator extends EventEmitter { config: SpawnConfig, options: { wipeHostDb?: boolean } = {}, ): Promise { + // Capture the manifest pid *before* stop() — stop() removes the manifest + // for tracked instances and only sends SIGTERM, which a wedged process + // can ignore. We escalate to SIGKILL on whatever pid the manifest named. + const manifestPid = readManifest(organizationId)?.pid; + this.stop(organizationId); - const manifest = readManifest(organizationId); - if (manifest && isProcessAlive(manifest.pid)) { + if (manifestPid != null && isProcessAlive(manifestPid)) { try { - process.kill(manifest.pid, "SIGKILL"); - } catch {} + process.kill(manifestPid, "SIGKILL"); + } catch (error) { + log.warn( + `[host-service:${organizationId}] reset: SIGKILL of pid=${manifestPid} failed`, + error, + ); + } } removeManifest(organizationId); @@ -360,7 +369,15 @@ export class HostServiceCoordinator extends EventEmitter { ); try { process.kill(manifest.pid, "SIGKILL"); - } catch {} + } catch (error) { + // ESRCH (already gone) is fine; anything else (EPERM) we want to see. + if ((error as NodeJS.ErrnoException)?.code !== "ESRCH") { + log.warn( + `[host-service:${organizationId}] SIGKILL of stale pid=${manifest.pid} failed`, + error, + ); + } + } removeManifest(organizationId); return null; } From a4991ad3848b376ded1ce1ce8f66e47dcf4946dd Mon Sep 17 00:00:00 2001 From: Kiet Ho Date: Tue, 12 May 2026 00:24:02 -0700 Subject: [PATCH 7/9] refactor(desktop): trim host-service recovery PR to the load-bearing bits MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Drop the speculative pieces: - reset({ wipeHostDb }) + host.db archival — no caller; the Settings "clear local data" button that would use it is out of scope here - LocalHostServiceProvider retry-with-backoff loop — heavier than the #4299 fix needs and invisible without the recovery UI (already dropped); reverted to origin/main What remains: - adopt health-check (the actual #4299 fix) + tray Restart-in-stopped - coordinator reset() (manifest-only force-kill + respawn) + tRPC mutation — small, tested, ready for a support escape hatch - coordinator narration through electron-log --- .../routers/host-service-coordinator/index.ts | 28 ++--- .../main/lib/host-service-coordinator.test.ts | 42 +------ .../src/main/lib/host-service-coordinator.ts | 24 +--- .../LocalHostServiceProvider.tsx | 104 +----------------- 4 files changed, 22 insertions(+), 176 deletions(-) diff --git a/apps/desktop/src/lib/trpc/routers/host-service-coordinator/index.ts b/apps/desktop/src/lib/trpc/routers/host-service-coordinator/index.ts index 4584fe4031c..d7d3aaa16fd 100644 --- a/apps/desktop/src/lib/trpc/routers/host-service-coordinator/index.ts +++ b/apps/desktop/src/lib/trpc/routers/host-service-coordinator/index.ts @@ -46,23 +46,17 @@ export const createHostServiceCoordinatorRouter = () => { }); }), - reset: publicProcedure - .input(orgInput.extend({ wipeHostDb: z.boolean().optional() })) - .mutation(async ({ input }) => { - const coordinator = getHostServiceCoordinator(); - const { token } = await loadToken(); - if (!token) { - throw new Error("No auth token available — user must be logged in"); - } - return coordinator.reset( - input.organizationId, - { - authToken: token, - cloudApiUrl: env.NEXT_PUBLIC_API_URL, - }, - { wipeHostDb: input.wipeHostDb }, - ); - }), + reset: publicProcedure.input(orgInput).mutation(async ({ input }) => { + const coordinator = getHostServiceCoordinator(); + const { token } = await loadToken(); + if (!token) { + throw new Error("No auth token available — user must be logged in"); + } + return coordinator.reset(input.organizationId, { + authToken: token, + cloudApiUrl: env.NEXT_PUBLIC_API_URL, + }); + }), onStatusChange: publicProcedure.subscription(() => { return observable((emit) => { diff --git a/apps/desktop/src/main/lib/host-service-coordinator.test.ts b/apps/desktop/src/main/lib/host-service-coordinator.test.ts index 0c289050470..95b1fa908e8 100644 --- a/apps/desktop/src/main/lib/host-service-coordinator.test.ts +++ b/apps/desktop/src/main/lib/host-service-coordinator.test.ts @@ -16,8 +16,9 @@ const manifestStore: { } | null; } = { current: null }; -// Per-test temp dir so reset({wipeHostDb}) can rename a real file without -// races between tests. Tests assign this in beforeEach. +// Per-test temp dir backing the mocked `manifestDir`. A real path (not a +// fixed string) so tests stay isolated; assigned in beforeEach, removed in +// afterEach. let testManifestRoot = ""; const readManifestMock = mock(() => manifestStore.current); @@ -276,31 +277,6 @@ describe("HostServiceCoordinator.reset", () => { expect(conn.port).toBe(60000); }); - test("with wipeHostDb=true, archives host.db to host.db.broken-", async () => { - manifestStore.current = baseManifest(9999); - const orgDir = path.join(testManifestRoot, "org-1"); - fs.mkdirSync(orgDir, { recursive: true }); - const dbPath = path.join(orgDir, "host.db"); - fs.writeFileSync(dbPath, "fake db payload"); - - const conn = await coordinator.reset("org-1", spawnConfig, { - wipeHostDb: true, - }); - - expect(fs.existsSync(dbPath)).toBe(false); - const archived = fs - .readdirSync(orgDir) - .find((f) => f.startsWith("host.db.broken-")); - expect(archived).toBeDefined(); - if (archived) { - expect(fs.readFileSync(path.join(orgDir, archived), "utf8")).toBe( - "fake db payload", - ); - } - expect(spawnMock).toHaveBeenCalledTimes(1); - expect(conn.port).toBe(60000); - }); - test("is safe when no manifest exists — no kill, no rename, still spawns", async () => { manifestStore.current = null; @@ -313,16 +289,4 @@ describe("HostServiceCoordinator.reset", () => { expect(spawnMock).toHaveBeenCalledTimes(1); expect(conn.port).toBe(60000); }); - - test("with wipeHostDb=true but no host.db file, does not throw and still spawns", async () => { - manifestStore.current = baseManifest(1111); - // No host.db on disk. - - const conn = await coordinator.reset("org-1", spawnConfig, { - wipeHostDb: true, - }); - - expect(spawnMock).toHaveBeenCalledTimes(1); - expect(conn.port).toBe(60000); - }); }); diff --git a/apps/desktop/src/main/lib/host-service-coordinator.ts b/apps/desktop/src/main/lib/host-service-coordinator.ts index 18ddc9b414c..91eeae74ddc 100644 --- a/apps/desktop/src/main/lib/host-service-coordinator.ts +++ b/apps/desktop/src/main/lib/host-service-coordinator.ts @@ -161,21 +161,16 @@ export class HostServiceCoordinator extends EventEmitter { } /** - * Forcefully reset host-service state for an org. Used by the in-app - * recovery path (superset-sh/superset#4299) when `stop`/`restart` aren't - * enough — e.g. the manifest points at a live-but-wedged pid that adoption - * keeps picking up. - * - * Differs from `restart` in two ways: - * 1. SIGKILLs any pid in the manifest (even if not in this.instances), - * so a manifest written by a previous app session can't survive. - * 2. Optionally archives `host.db` to `host.db.broken-` so the next - * spawn starts with a clean DB. Off by default — that's data loss. + * Forcefully reset host-service state for an org. Unlike `restart`, this + * SIGKILLs whatever pid the manifest names — even when no instance is + * tracked in this process (e.g. a manifest left by a previous app session) + * — then removes the manifest so adoption can't pick up the stale entry, + * and respawns. Used by the recovery path for superset-sh/superset#4299 + * where a live-but-wedged host-service keeps getting re-adopted. */ async reset( organizationId: string, config: SpawnConfig, - options: { wipeHostDb?: boolean } = {}, ): Promise { // Capture the manifest pid *before* stop() — stop() removes the manifest // for tracked instances and only sends SIGTERM, which a wedged process @@ -197,13 +192,6 @@ export class HostServiceCoordinator extends EventEmitter { removeManifest(organizationId); - if (options.wipeHostDb) { - const dbPath = path.join(manifestDir(organizationId), "host.db"); - if (fs.existsSync(dbPath)) { - fs.renameSync(dbPath, `${dbPath}.broken-${Date.now()}`); - } - } - return this.start(organizationId, config); } diff --git a/apps/desktop/src/renderer/routes/_authenticated/providers/LocalHostServiceProvider/LocalHostServiceProvider.tsx b/apps/desktop/src/renderer/routes/_authenticated/providers/LocalHostServiceProvider/LocalHostServiceProvider.tsx index 2677a8ebb78..5ef6fdfd5cc 100644 --- a/apps/desktop/src/renderer/routes/_authenticated/providers/LocalHostServiceProvider/LocalHostServiceProvider.tsx +++ b/apps/desktop/src/renderer/routes/_authenticated/providers/LocalHostServiceProvider/LocalHostServiceProvider.tsx @@ -2,11 +2,9 @@ import { useLiveQuery } from "@tanstack/react-db"; import { createContext, type ReactNode, - useCallback, useContext, useEffect, useMemo, - useRef, } from "react"; import { env } from "renderer/env.renderer"; import { authClient } from "renderer/lib/auth-client"; @@ -23,12 +21,6 @@ interface LocalHostServiceContextValue { const LocalHostServiceContext = createContext(null); -// Exponential backoff between automatic re-tries of `start` for the active -// org. Index N is the delay before attempt N+1. After RETRY_DELAYS_MS.length -// failures we stop retrying; the right pane stays blank until something else -// (org switch, tray restart, app relaunch) re-triggers a start. -const RETRY_DELAYS_MS = [1_000, 4_000, 15_000]; - export function LocalHostServiceProvider({ children, }: { @@ -53,103 +45,11 @@ export function LocalHostServiceProvider({ [organizations], ); - // Retry bookkeeping. Refs hold the live values so the status subscription - // and the start-mutation callbacks can read them without re-subscribing. - const attemptRef = useRef(0); - const retryTimerRef = useRef | null>(null); - const activeOrgRef = useRef(activeOrganizationId); - - const clearRetryTimer = useCallback(() => { - if (retryTimerRef.current) { - clearTimeout(retryTimerRef.current); - retryTimerRef.current = null; - } - }, []); - - // Ref-based indirection: scheduleRetry only needs to *call* fireStart - // deferred via setTimeout. Going through a ref avoids the circular - // useCallback dependency between the two. - const fireStartRef = useRef<((orgId: string) => void) | null>(null); - - const scheduleRetry = useCallback( - (organizationId: string) => { - if (organizationId !== activeOrgRef.current) return; - if (attemptRef.current >= RETRY_DELAYS_MS.length) return; - const delay = RETRY_DELAYS_MS[attemptRef.current]; - attemptRef.current += 1; - clearRetryTimer(); - retryTimerRef.current = setTimeout(() => { - retryTimerRef.current = null; - if (organizationId !== activeOrgRef.current) return; - fireStartRef.current?.(organizationId); - }, delay); - }, - [clearRetryTimer], - ); - - const fireStart = useCallback( - (organizationId: string) => { - startHostService( - { organizationId }, - { - onError: () => { - if (organizationId !== activeOrgRef.current) return; - scheduleRetry(organizationId); - }, - }, - ); - }, - [startHostService, scheduleRetry], - ); - - useEffect(() => { - fireStartRef.current = fireStart; - }, [fireStart]); - - // Reset retry state whenever the active org changes — we don't carry - // backoff across orgs. - useEffect(() => { - activeOrgRef.current = activeOrganizationId; - attemptRef.current = 0; - clearRetryTimer(); - }, [activeOrganizationId, clearRetryTimer]); - - // Initial start: fire once per org on mount/org-list change. Active org - // gets the retry treatment via fireStart; inactive orgs use the bare - // mutation (they get healed when the user switches to them). useEffect(() => { for (const organizationId of organizationIds) { - if (organizationId === activeOrganizationId) { - fireStart(organizationId); - } else { - startHostService({ organizationId }); - } + startHostService({ organizationId }); } - }, [organizationIds, activeOrganizationId, fireStart, startHostService]); - - electronTrpc.hostServiceCoordinator.onStatusChange.useSubscription( - undefined, - { - onData: (event: { - organizationId: string; - status: "starting" | "running" | "stopped"; - }) => { - if (event.organizationId !== activeOrgRef.current) return; - if (event.status === "running") { - attemptRef.current = 0; - clearRetryTimer(); - return; - } - if (event.status === "stopped") { - // Child died (or was killed) after running. Re-enter the backoff - // chain unless we've already exhausted it for this active org. - scheduleRetry(event.organizationId); - } - }, - }, - ); - - useEffect(() => clearRetryTimer, [clearRetryTimer]); + }, [organizationIds, startHostService]); const { data: machineIdData } = electronTrpc.device.getMachineId.useQuery( undefined, From a88be527a62a4835870e09822655456f3f438bd6 Mon Sep 17 00:00:00 2001 From: Kiet Ho Date: Tue, 12 May 2026 00:39:09 -0700 Subject: [PATCH 8/9] docs(desktop): replace host-service recovery plan with a short shipped summary, move to plans/done The original 220-line plan ranked six items and described a recovery screen + retry loop that didn't ship. Replace it with a tight summary of what landed (adopt health-check, reset(), tray fix, electron-log) and what was considered and deferred, and move it under plans/done per the repo convention. --- .../20260510-1430-host-service-recovery.md | 221 ------------------ .../done/20260510-host-service-recovery.md | 19 ++ 2 files changed, 19 insertions(+), 221 deletions(-) delete mode 100644 apps/desktop/plans/20260510-1430-host-service-recovery.md create mode 100644 apps/desktop/plans/done/20260510-host-service-recovery.md diff --git a/apps/desktop/plans/20260510-1430-host-service-recovery.md b/apps/desktop/plans/20260510-1430-host-service-recovery.md deleted file mode 100644 index 6edadd5bab7..00000000000 --- a/apps/desktop/plans/20260510-1430-host-service-recovery.md +++ /dev/null @@ -1,221 +0,0 @@ -# Host-Service Recovery: Self-Healing + In-App Escape Hatch - -**Status:** PR1 shipped ([#4395](https://github.com/superset-sh/superset/pull/4395)). PR2 (items 2 + 3 + 4) ready for handoff. PR3 (item 6) optional follow-up. -**Scope:** v2 desktop only — `apps/desktop/src/main/lib/host-service-coordinator.ts`, `apps/desktop/src/lib/trpc/routers/host-service-coordinator/index.ts`, `apps/desktop/src/renderer/routes/_authenticated/providers/LocalHostServiceProvider/*`, `apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/*`, `apps/desktop/src/main/lib/tray/index.ts`, plus a new Settings section. -**Tracking issue:** [superset-sh/superset#4299](https://github.com/superset-sh/superset/issues/4299) -**Related:** [#4396](https://github.com/superset-sh/superset/issues/4396) (white-screen / unbounded `ioreg execFileSync` — separate fix path, not blocked by this plan). - -## Goal - -When host-service is unreachable, the V2 right-pane goes silently blank and the user has no in-app recovery path — every interactive surface that needs host-service errors with `"Host service not available"`, including the **delete workspace** action, so the user can't even clean up. Today the only working recovery is filesystem surgery (`mv ~/.superset/host …`). - -This plan adds: - -1. **Self-healing** — adopt only host-services that actually respond, and retry start when status flips to `stopped`. -2. **A user-visible escape hatch** — a "Reset host service" action surfaced both as a banner when the right-pane would otherwise be blank, and as a button in Settings. -3. **Honesty in the tray** — let users click "Restart" when host-service is actually in the state where restart helps. - -## Why - -From #4299 the reporter is stuck in this state: - -- Workspaces list renders (it's hydrated from local-db, not host-service). -- Clicking any workspace shows nothing on the right — every V2 surface uses `useLocalHostService().activeHostUrl`, which is `null`. -- Delete workspace errors `"Host service unavailable"`. -- Account-management section sits in skeleton forever (also a host-service call). -- Sign-out + sign-in, Cmd+R, Quit Completely, Tray → Restart, V1/V2 toggle — none recover. -- Manual fix: deleting `~/.superset/host//manifest.json` and relaunching. There is no in-app equivalent today. - -Symptoms aside, the architectural problem is that **`activeHostUrl === null` is a permanently absorbing state**: nothing in the renderer or main ever tries to climb out of it after the first `start({orgId})` mutation fails or after adoption silently picks up a dead process. - -## Current state - -The pipeline that produces `activeHostUrl`: - -| Step | Where | -| --- | --- | -| Renderer fires `start({orgId})` once per org on mount | `apps/desktop/src/renderer/routes/_authenticated/providers/LocalHostServiceProvider/LocalHostServiceProvider.tsx:48-52` | -| Renderer polls `getConnection({activeOrgId})` every 5s | `LocalHostServiceProvider.tsx:59-63` | -| `start` → coordinator: `tryAdopt` then `spawn` | `apps/desktop/src/main/lib/host-service-coordinator.ts:74-102` | -| `tryAdopt` only verifies `isProcessAlive(pid)` and app-version match | `host-service-coordinator.ts:280-315` | -| `spawn` health-checks via `pollHealthCheck` then registers `running` | `host-service-coordinator.ts:333-437` | -| `getConnection` returns null unless an instance is `running` | `host-service-coordinator.ts:155-163` | -| Tray "Restart" is gated on `isRunning` | `apps/desktop/src/main/lib/tray/index.ts:172-200` | - -### The four failure buckets - -| # | Bucket | Why we get stuck | -|---|--------|------------------| -| A | Stale/unhealthy adopted process | `tryAdopt` doesn't health-check. A live-pid-but-not-serving process is adopted; `getConnection` returns its dead port forever. | -| B | Auth token missing | `start` throws `"No auth token available"` (coordinator router line 17). Renderer's `useEffect` doesn't retry. | -| C | Spawn fails | `pollHealthCheck` timeout, port collision, DB lock, env validation. Instance is deleted; renderer's `useEffect` doesn't retry. | -| D | activeOrganizationId mismatch | `start` is fired for every org in collections, but `getConnection` is queried only for the session's `activeOrganizationId`. If the two drift (stale session, sync race), host-service is up but the renderer is asking about the wrong org. | - -Buckets A and C/D account for the issue thread. B is the easiest to reproduce locally (clear keychain). - -## Ranked work - -Ranking criteria: **impact** (how many buckets the change unsticks) × **visibility** (does the user know it happened) ÷ **risk + effort**. - -### 1. Health-check during adoption *(shipped in PR #4395)* - -**Status:** ✅ Shipped. See `apps/desktop/src/main/lib/host-service-coordinator.ts:307-326` for the implementation and `apps/desktop/src/main/lib/host-service-coordinator.test.ts` for the regression test. - -Fixes **bucket A** structurally — every Cmd+R / app launch now health-checks the adopted manifest endpoint with a 2s timeout, SIGKILLs unresponsive pids, and falls through to a clean `spawn`. Invisible when adoption is healthy. - -### 2. Coordinator `reset(orgId)` + tRPC surface *(PR2 — unlocks 3 and 6)* - -**Files to edit:** -- `apps/desktop/src/main/lib/host-service-coordinator.ts` — add method on the class, alongside `restart` (currently at ~`148-156`). -- `apps/desktop/src/lib/trpc/routers/host-service-coordinator/index.ts` — add a `reset` mutation, mirroring `restart` (currently at lines 37-47). Existing imports already include `loadToken` and `env.NEXT_PUBLIC_API_URL`. - -Add to `HostServiceCoordinator`: - -```ts -async reset( - organizationId: string, - config: SpawnConfig, - options: { wipeHostDb?: boolean } = {}, -): Promise { - // 1. Stop in-memory + send SIGTERM. No-op if no instance is tracked. - this.stop(organizationId); - - // 2. SIGKILL any pid the manifest still references (covers the - // "process up but unresponsive" case that motivated bucket A). - const manifest = readManifest(organizationId); - if (manifest && isProcessAlive(manifest.pid)) { - try { process.kill(manifest.pid, "SIGKILL"); } catch {} - } - - // 3. Remove manifest so adoption can't pick up the stale entry. - removeManifest(organizationId); - - // 4. Optional: archive host.db to host.db.broken- so we keep - // the file for debugging but the next spawn starts clean. - if (options.wipeHostDb) { - const dbPath = path.join(manifestDir(organizationId), "host.db"); - if (fs.existsSync(dbPath)) { - fs.renameSync(dbPath, `${dbPath}.broken-${Date.now()}`); - } - } - - // 5. Fresh start. - return this.start(organizationId, config); -} -``` - -Router shape: - -```ts -reset: publicProcedure - .input(orgInput.extend({ wipeHostDb: z.boolean().optional() })) - .mutation(async ({ input }) => { - const coordinator = getHostServiceCoordinator(); - const { token } = await loadToken(); - if (!token) { - throw new Error("No auth token available — user must be logged in"); - } - return coordinator.reset( - input.organizationId, - { authToken: token, cloudApiUrl: env.NEXT_PUBLIC_API_URL }, - { wipeHostDb: input.wipeHostDb }, - ); - }), -``` - -The default is **not** to wipe host.db — that's data loss. Settings UI (item 6) surfaces "Reset" (manifest only) and "Reset and clear local data" (also archives host.db) as separate actions. - -**Tests:** extend `host-service-coordinator.test.ts` (already mocks `host-service-manifest`, `host-service-utils`, etc.) with three cases: -1. `reset` with no `wipeHostDb` removes the manifest, calls `start`, returns a new connection. -2. `reset({ wipeHostDb: true })` renames `host.db` to `host.db.broken-` (mock `fs.renameSync`). -3. `reset` is safe to call when no instance is tracked (no manifest, no pid). - -### 3. Surface failures with a recovery state *(PR2 — the actual user fix)* - -**Pattern to follow:** the v2 workspace layout (`apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/layout.tsx:87-101`) **already** renders `` when `useRemoteHostStatus` reports `offline` for a remote host. The local-host branch is currently a gap — when `isLocal && activeHostUrl === null` we render the workspace anyway and downstream calls toast `"Host service not available"`. Mirror the remote pattern for the local case. - -**Files to create / edit:** -- `apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/hooks/useLocalHostStatus/useLocalHostStatus.ts` (new). Returns `"skip" | "loading" | "stopped" | "ready"`. Status is `"stopped"` when the workspace is local (`workspace.hostId === machineId`), the provider has attempted `start` for the active org, and `activeHostUrl` has stayed null for ≥5s. Read status events from `electronTrpc.hostServiceCoordinator.onStatusChange.useSubscription`, plus the most recent `start`-mutation error (lift the mutation result up from `LocalHostServiceProvider` into context). -- `apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/components/WorkspaceLocalHostStoppedState/WorkspaceLocalHostStoppedState.tsx` (new). Two-button UI (mirror `WorkspaceHostOfflineState` styling so it doesn't feel grafted on): - - **Reset host service** — calls `electronTrpc.hostServiceCoordinator.reset.useMutation({ wipeHostDb: false })`. - - **Copy diagnostics** — copies a small text blob to clipboard: manifest path, manifest JSON (via a new `hostServiceCoordinator.getDiagnostics(orgId)` query if needed, or just print `~/.superset/host//`), last `start` mutation error, and the current status. Use the existing `select-text cursor-text` convention from `apps/desktop/AGENTS.md` so users can copy from the rendered text directly. -- `apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/layout.tsx` — add a branch above the `hostStatus.status === "offline"` block that renders `WorkspaceLocalHostStoppedState` when `useLocalHostStatus(workspace)` returns `"stopped"`. -- `apps/desktop/src/renderer/routes/_authenticated/providers/LocalHostServiceProvider/LocalHostServiceProvider.tsx` — expose the latest `start` mutation error and a "last attempt timestamp" on the context. (Currently the mutation result is discarded.) - -**Why this shape, not a toast or top-bar banner:** `WorkspaceHostOfflineState` already establishes the pattern of replacing the workspace content when the host is unreachable, and the failure here is sticky (not a one-shot error). Reusing the slot keeps the design system consistent and means the recovery action appears exactly where the user is trying to work. - -**Tests:** -- RTL test for `useLocalHostStatus`: mount with mocked context where `activeHostUrl` is null and last-attempt was >5s ago → assert `"stopped"`. <5s → `"loading"`. Remote workspace → `"skip"`. -- RTL test for `WorkspaceLocalHostStoppedState`: clicking **Reset host service** fires the `reset` mutation; clicking **Copy diagnostics** calls `navigator.clipboard.writeText` with text containing the manifest path. - -### 4. Renderer retry with exponential backoff *(PR2 — covers transient C/D)* - -**File:** `apps/desktop/src/renderer/routes/_authenticated/providers/LocalHostServiceProvider/LocalHostServiceProvider.tsx` - -The current `useEffect` (lines 48-52) fires `start({orgId})` exactly once on mount and never again. Wrap that effect with retry-on-failure: - -- Subscribe to `electronTrpc.hostServiceCoordinator.onStatusChange.useSubscription` (already exported by the router at `apps/desktop/src/lib/trpc/routers/host-service-coordinator/index.ts:49-56`). -- When the active org's status transitions to `stopped` (or the most recent `start` mutation errors), schedule a re-`start` with backoff: 1s, 4s, 15s. After three attempts, stop retrying and let the recovery state from (3) take over. -- Reset the backoff counter when status reaches `running`. - -Pattern note: subscription must use the observable pattern, not async generators (`apps/desktop/AGENTS.md` covers why). The router already does this correctly. - -This fixes the **transient** half of bucket C (port race, slow disk on boot) and is the cheapest way to handle bucket B after a sign-in: as soon as `loadToken` returns a token, the next retry succeeds without the user clicking anything. - -**Tests:** RTL test using fake timers — push three `stopped` events into the mocked subscription, advance time, assert exactly three `start` mutations fire at 1s/4s/15s, and a `running` event after the second attempt resets the counter. - -### 5. Tray "Restart" enabled in `stopped` *(shipped in PR #4395)* - -**Status:** ✅ Shipped. See `apps/desktop/src/main/lib/tray/index.ts:171-176`. - -### 6. Settings → Advanced → "Reset host service" button *(PR3 — belt and suspenders)* - -**File:** new section in `apps/desktop/src/renderer/routes/_authenticated/settings/`. The Experimental panel is the right home (the V2 toggle already lives there). Reuses the tRPC `reset` mutation from (2). Two buttons: - -- **Reset host service** — calls `reset({ wipeHostDb: false })`. Safe; loses no data. -- **Reset and clear local host data** — calls `reset({ wipeHostDb: true })`. Behind a confirm dialog that names what's lost (terminal session history, host-side chat state) and what survives (workspaces, projects, chats — those live in `@superset/local-db`, not `host.db`). - -Lower priority than the in-context recovery from (3) because a user who can't load workspaces probably won't think to open Settings, but it's useful from a support perspective ("ask them to click this button"). - -## Out of scope - -- **The white-screen path.** Tracked separately at [#4396](https://github.com/superset-sh/superset/issues/4396) — `getHostId()` on macOS shells out to `ioreg` via `execFileSync` with no timeout, blocking the main event loop when subprocess spawning is blocked by sandboxing tools. Fix lands independently. -- **Auto-wiping `host.db`.** Data loss without a confirm dialog is non-negotiable. -- **Touching the `start` mutation semantics.** Other call sites depend on `start` being idempotent and resolving once. -- **Cross-org healing.** We retry/reset only the active org. Inactive orgs are out of scope; their host-service can be healed by switching to them. - -## Rollout - -- **Order:** (1)+(5) shipped as PR #4395. (2)+(3)+(4) ship together as PR2 — that's the user-facing recovery story. (6) tacks on as PR3. -- **Telemetry:** PostHog event `host_service_recovery` with `{ source: "banner" | "tray" | "settings", action: "reset" | "retry" | "wipe", succeeded: boolean }`. Wire from the renderer at each click-handler. Server-side, the coordinator already `console.log`s adoption health-check failures (PR1) — keep those. -- **Risk:** Renderer retry could mask a deterministic spawn failure under three rounds of backoff if the recovery state isn't surfaced clearly. Mitigate by making (3) and (4) ship together: the retry counter is visible in the recovery state, and after exhaustion the UI says "Auto-retry exhausted, click Reset." - -## Acceptance criteria - -For #4299 to be considered closed by this work: - -- [x] If the manifest points at a non-responsive pid, the next Cmd+R or app launch heals automatically (PR1). -- [x] Tray → Restart works in `stopped` state (PR1). -- [ ] A user reproducing the issue can recover without filesystem access, in ≤2 clicks from the blank workspace screen (PR2). -- [ ] If host-service spawn is failing for a persistent reason (e.g., port binding error), the recovery state names the reason in copyable text (PR2). -- [ ] No regression in startup time for the happy path (verified via dev-tools timing on a freshly-cloned `.superset` dir). - -## Handoff checklist for PR2 - -A new implementer picking this up should: - -1. **Read PR #4395** to see the patterns already in place: the coordinator test file structure (`apps/desktop/src/main/lib/host-service-coordinator.test.ts`) and the `pollHealthCheck` usage. -2. **Read `useRemoteHostStatus` + `WorkspaceHostOfflineState`** in `apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/` — PR2's local-host equivalent should mirror these in shape and visual style. -3. **Implement in order:** (2) coordinator `reset` → tRPC `reset` → renderer mutation hook; then (4) retry loop in `LocalHostServiceProvider`; then (3) `useLocalHostStatus` + `WorkspaceLocalHostStoppedState` + layout branch. (2) before (3) so the recovery component has the mutation to call; (4) before (3) so the recovery state knows when to surface (after backoff exhausts). -4. **Verify each test can fail** — for every new test, mutate the implementation to confirm the test catches the regression. (Project convention; see `apps/desktop/AGENTS.md` and the existing coordinator test.) -5. **Lint and typecheck before pushing.** `bun run lint` (from repo root) treats Biome warnings as errors. `bun run typecheck` in `apps/desktop`. - -Files touched by PR2 (preview): - -- `apps/desktop/src/main/lib/host-service-coordinator.ts` (add `reset` method) -- `apps/desktop/src/main/lib/host-service-coordinator.test.ts` (extend with reset tests) -- `apps/desktop/src/lib/trpc/routers/host-service-coordinator/index.ts` (add `reset` mutation) -- `apps/desktop/src/renderer/routes/_authenticated/providers/LocalHostServiceProvider/LocalHostServiceProvider.tsx` (expose error / last-attempt, add retry loop) -- `apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/hooks/useLocalHostStatus/{useLocalHostStatus.ts,useLocalHostStatus.test.ts,index.ts}` (new) -- `apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/components/WorkspaceLocalHostStoppedState/{WorkspaceLocalHostStoppedState.tsx,WorkspaceLocalHostStoppedState.test.tsx,index.ts}` (new) -- `apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/layout.tsx` (add branch above the offline branch) diff --git a/apps/desktop/plans/done/20260510-host-service-recovery.md b/apps/desktop/plans/done/20260510-host-service-recovery.md new file mode 100644 index 00000000000..e18816eefb7 --- /dev/null +++ b/apps/desktop/plans/done/20260510-host-service-recovery.md @@ -0,0 +1,19 @@ +# Host-service recovery (#4299) — shipped + +**Issue:** [superset-sh/superset#4299](https://github.com/superset-sh/superset/issues/4299) — after Cmd+R the v2 right pane goes blank because the renderer keeps getting handed a dead host-service port. + +**Root cause:** `tryAdopt` only checked `isProcessAlive(pid)` + app-version. A live-but-not-serving host-service (hung on migrations, deadlocked, port no longer bound) got adopted as `running`, and `getConnection` returned its dead port forever — an absorbing state nothing climbed out of. + +## What shipped (PR #4395) + +- **Adopt health-check** — `tryAdopt` now `pollHealthCheck`s the manifest endpoint (2s cap) before registering an adopted instance; on failure it SIGKILLs the stale pid, removes the manifest, and falls through to a clean `spawn`. This is the fix. +- **`coordinator.reset(orgId)`** + `hostServiceCoordinator.reset` tRPC mutation — force-kill (SIGKILL on whatever pid the manifest names, even if untracked) + remove manifest + respawn. No UI caller yet; intended for a support escape hatch / future Settings button. +- **Tray "Restart" enabled in `stopped`** — was gated on `isRunning`, i.e. disabled exactly when restart helps; now disabled only while a start is in flight. +- **Coordinator logs through `electron-log`** — adoption health-check failures now land in `main.log` (were bare `console.log`, invisible in packaged builds). `log.warn` on non-ESRCH SIGKILL failures. + +## Considered, not shipped + +- **Full-screen "host stopped" recovery screen** in the v2-workspace layout — dropped. [#4430](https://github.com/superset-sh/superset/pull/4430) removed the analogous remote `WorkspaceHostOfflineState` ("render optimistically; downstream queries surface their own errors"); a local equivalent would swim against that. A non-blocking banner could be a future PR. +- **Renderer retry-with-backoff** in `LocalHostServiceProvider` — built, then dropped: heavier than the bug needs and invisible without the recovery screen. +- **`reset({ wipeHostDb })`** (archive `host.db` → `host.db.broken-`) + a Settings "Reset and clear local data" button — deferred until there's a caller. +- **The white-screen-before-Cmd+R variant** — tracked separately at [#4396](https://github.com/superset-sh/superset/issues/4396): `getHostId()` shells out to `ioreg` via `execFileSync` with no timeout, blocking the main event loop when subprocess spawning is sandboxed. From 359a8ca0ca7d82edfb220310b32b497ef16fd5eb Mon Sep 17 00:00:00 2001 From: Kiet Ho Date: Tue, 12 May 2026 00:39:45 -0700 Subject: [PATCH 9/9] test(desktop): fix stale 'no rename' in reset test title (wipeHostDb path removed) --- apps/desktop/src/main/lib/host-service-coordinator.test.ts | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/apps/desktop/src/main/lib/host-service-coordinator.test.ts b/apps/desktop/src/main/lib/host-service-coordinator.test.ts index 95b1fa908e8..5ed168f1d1e 100644 --- a/apps/desktop/src/main/lib/host-service-coordinator.test.ts +++ b/apps/desktop/src/main/lib/host-service-coordinator.test.ts @@ -277,7 +277,7 @@ describe("HostServiceCoordinator.reset", () => { expect(conn.port).toBe(60000); }); - test("is safe when no manifest exists — no kill, no rename, still spawns", async () => { + test("is safe when no manifest exists — no kill, still spawns", async () => { manifestStore.current = null; const conn = await coordinator.reset("org-1", spawnConfig);