fix(desktop): heal stale host-service adoption; enable tray Restart in stopped state by Kitenite · Pull Request #4395 · superset-sh/superset

Kitenite · 2026-05-11T01:02:50Z

Summary

Two small fixes that together let users (and the app itself) recover from a wedged host-service without filesystem surgery. Targets #4299, where a host-service whose pid is alive but no longer serving on the recorded port leaves every V2 surface stuck on "Host service not available" — including the delete workspace action, so users can't even clean up.

Health-check during adoption (host-service-coordinator.ts). tryAdopt now pollHealthChecks the manifest endpoint (2s timeout) before registering the adopted host-service as running. On failure: SIGKILL the pid, remove the manifest, return null so start falls through to a clean spawn. This is the structural fix — before this PR, the renderer's getConnection kept handing out a dead port forever.
Tray "Restart" enabled in stopped (tray/index.ts). Previously enabled: isRunning — disabled exactly when restart would help. Now disabled only while a start is in flight, to avoid racing the pending start.
Plan doc (plans/20260510-1430-host-service-recovery.md) covers the broader recovery work — in-app reset, recovery banner, retry-with-backoff — for follow-up PRs. This PR is items 1 and 5 from that plan.

The "screen blank before Cmd+R" half of #4299 is a separate bug — this PR makes the recovery work; the root cause for why host-service crashes there needs its own investigation.

Test plan

bun test src/main/lib/host-service-coordinator.test.ts — new file, 4 tests: healthy adoption (no respawn), unhealthy adoption → SIGKILL + respawn, ESRCH-tolerant kill, existing app-version-mismatch path unchanged.
bun run lint clean
bun run typecheck clean
Manual: with desktop running, pkill -9 -f host-service.js while leaving the manifest in place, then Cmd+R the renderer — right pane should recover within ~2s (previously stuck indefinitely).
Manual: with host-service in stopped state (e.g., immediately after pkill + Tray → Stop), Tray submenu "Restart" should be enabled and successfully restart host-service.

Summary by cubic

Adds self-healing for the desktop host-service via an adoption health check and a new reset endpoint. Stuck “Host service not available” cases now recover without manual cleanup; tray Restart works when stopped.

New Features
- Coordinator reset() and electronTrpc.hostServiceCoordinator.reset: read the manifest pid before stop(), SIGKILL stale pids, remove the manifest, then respawn.
Bug Fixes
- Adoption now health-checks the manifest endpoint (2s timeout); on failure it SIGKILLs the pid (ESRCH tolerated), removes the manifest, and spawns fresh. Coordinator logs route through electron-log for visibility in packaged builds.
- Tray “Restart” is enabled when status is stopped and disabled only while starting.

^{Written for commit 359a8ca. Summary will update on new commits.}

Summary by CodeRabbit

New Features
- Added host service reset capability with option to clear local data.
- Implemented automatic retry with exponential backoff for host service startup failures.
- Enabled restart option in tray menu when service is stopped.
Bug Fixes
- Added health checks to prevent adoption of unresponsive host processes.

coderabbitai · 2026-05-11T01:03:06Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR implements host-service recovery through adoption health-checking, a forceful reset capability, tRPC exposure, exponential retry/backoff for the active organization, tray restart gating, comprehensive tests, and a detailed plan document outlining the multi-PR rollout strategy.

Changes

Host Service Recovery Implementation

Layer / File(s)	Summary
Recovery Plan & Strategy `apps/desktop/plans/20260510-1430-host-service-recovery.md`	Documents six work items (adoption health-check, coordinator reset, recovery UI, renderer retry, tray gating, Settings UI), rollout sequencing, telemetry requirements, acceptance criteria for closing `#4299`, and implementation checklist with file preview.
Coordinator Health-Check & Reset Implementation `apps/desktop/src/main/lib/host-service-coordinator.ts`, `apps/desktop/src/main/lib/host-service-coordinator.test.ts`	Adds `ADOPT_HEALTH_CHECK_TIMEOUT_MS` (2s), adoption health-check via `pollHealthCheck` with SIGKILL+manifest removal on failure, new `reset(organizationId, config, {wipeHostDb?})` method, optional host.db archival to `host.db.broken-<timestamp>`, and logging migration to `electron-log`. Comprehensive Bun test suite covers healthy/failed/version-mismatch adoption and reset scenarios with and without host.db archival.
tRPC Reset Route `apps/desktop/src/lib/trpc/routers/host-service-coordinator/index.ts`	Exposes new `reset` mutation accepting `organizationId` and optional `wipeHostDb`, authenticating via `loadToken()` and invoking coordinator reset.
Tray Restart Menu Gating `apps/desktop/src/main/lib/tray/index.ts`	Updates "Restart" menu item to enable when service is not in `starting` state, allowing restart from both `running` and `stopped` states with explanatory comments.
Renderer Auto-Retry & Exponential Backoff `apps/desktop/src/renderer/routes/_authenticated/providers/LocalHostServiceProvider/LocalHostServiceProvider.tsx`	Adds exponential retry/backoff with `RETRY_DELAYS_MS`, scoped to active organization only. Introduces refs for attempt tracking and timer management, `scheduleRetry` bounded by attempt count, `fireStart` wrapper scheduling retries on error, org-change reset, and `stopped` status re-entry to retry chain.

Sequence Diagram(s)

sequenceDiagram
  participant Renderer
  participant HostServiceCoordinator
  participant ProcessUtils
  participant ManifestStorage
  participant ProcessCheck

  Renderer->>HostServiceCoordinator: tryAdopt(manifest)
  HostServiceCoordinator->>ProcessUtils: spawn or check existing PID
  HostServiceCoordinator->>ProcessCheck: pollHealthCheck(endpoint, timeout=2s)
  alt Health Check Passes
    ProcessCheck-->>HostServiceCoordinator: healthy
    HostServiceCoordinator-->>Renderer: manifest endpoint + secret
  else Health Check Fails
    ProcessCheck-->>HostServiceCoordinator: timeout or error
    HostServiceCoordinator->>ProcessUtils: SIGKILL PID
    HostServiceCoordinator->>ManifestStorage: removeManifest()
    HostServiceCoordinator->>ProcessUtils: spawn new host
    HostServiceCoordinator-->>Renderer: new connection details
  end

  Renderer->>HostServiceCoordinator: reset(orgId, config, {wipeHostDb?})
  HostServiceCoordinator->>ProcessUtils: stop instance
  HostServiceCoordinator->>ManifestStorage: readManifest()
  HostServiceCoordinator->>ProcessUtils: SIGKILL manifest PID
  HostServiceCoordinator->>ManifestStorage: removeManifest()
  alt wipeHostDb enabled
    HostServiceCoordinator->>ProcessUtils: archive host.db to host.db.broken-*
  end
  HostServiceCoordinator->>ProcessUtils: spawn new host
  HostServiceCoordinator-->>Renderer: new connection details

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

A rabbit hops through startup's maze,
With health checks bright and reset's glaze.
Backoff exponential, retries dance,
When hosts go dark, we get a chance! 🐰✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the two main code fixes: health-checking during host-service adoption and enabling the tray Restart button in stopped state, directly addressing the core issues described in the changeset.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The pull request description is comprehensive and well-structured, following the template with all required sections completed including a clear summary, related issues link, type of change, detailed testing plan, and additional context.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch dour-table

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

greptile-apps · 2026-05-11T01:07:05Z

Greptile Summary

This PR ships two targeted fixes for the host-service recovery path described in #4299: a health-check gate during adoption, and enabling the tray Restart button in all non-inflight states.

tryAdopt health check — before registering an adopted manifest as running, the coordinator now calls pollHealthCheck with a 2 s cap. On failure it SIGKILLs the stale pid, removes the manifest, and falls through to a clean spawn, breaking the dead-port-forever loop.
Tray Restart gate — enabled: isRunning is replaced with enabled: status !== "starting", so the item is clickable whenever no start is already in flight.
Four new unit tests cover the healthy adoption, unhealthy adoption (SIGKILL + respawn), ESRCH-tolerant kill, and the existing app-version mismatch path.

Confidence Score: 4/5

Safe to merge; both runtime changes are targeted and low-blast-radius, with issues confined to the test file.

The coordinator and tray changes are small, well-commented, and backed by unit tests. The test file has a fragile cleanup pattern — process.kill is restored inside each test body rather than in a shared afterEach, so a mid-test assertion failure can corrupt mock state for later tests.

apps/desktop/src/main/lib/host-service-coordinator.test.ts — cleanup pattern should be reviewed before the test suite grows.

Important Files Changed

Filename	Overview
apps/desktop/src/main/lib/host-service-coordinator.ts	Adds a 2 s `pollHealthCheck` gate inside `tryAdopt` before registering an adopted process as running; on failure it SIGKILLs the stale pid, removes the manifest, and falls through to a clean spawn — directly fixing the dead-port-forever bug in #4299.
apps/desktop/src/main/lib/tray/index.ts	Changes Restart menu item from `enabled: isRunning` to `enabled: status !== "starting"`, so it stays clickable in every non-inflight state rather than only when the service is fully up.
apps/desktop/src/main/lib/host-service-coordinator.test.ts	New unit test file with 4 adoption health-check scenarios; logic is correct but `process.kill` is restored inline in each test body rather than in an `afterEach`, making the suite fragile when assertions fail early. One test title also misnames the errno (ENOENT → ESRCH).
apps/desktop/plans/20260510-1430-host-service-recovery.md	Planning document only; describes the broader recovery work (items 1 and 5 ship in this PR). No runtime changes.

Sequence Diagram

sequenceDiagram
    participant R as Renderer
    participant C as HostServiceCoordinator
    participant FS as Manifest / FS
    participant P as Host-Service PID

    R->>C: start(orgId, config)
    C->>C: existing running? → return cached
    C->>C: pendingStarts? → deduplicate
    C->>FS: readAndValidateManifest(orgId)
    alt No manifest / pid dead
        FS-->>C: null
        C->>C: spawn()
    else Manifest exists, pid alive
        C->>P: pollHealthCheck(endpoint, authToken, 2000ms)
        alt Healthy
            P-->>C: true
            C->>C: register as running
            C-->>R: Connection(port, secret)
        else Unhealthy (new path)
            P-->>C: false / timeout
            C->>P: SIGKILL(pid)
            C->>FS: removeManifest(orgId)
            C->>C: spawn()
            C-->>R: Connection(freshPort, freshSecret)
        end
    end

Prompt To Fix All With AI

Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
apps/desktop/src/main/lib/host-service-coordinator.test.ts:113-116
**Fragile per-test `process.kill` restore — use `afterEach`**

`process.kill` is restored at the end of each test body. If any assertion fires before the restore line (e.g. test 3's `expect(conn.port).toBe(60000)` fails), `process.kill` is left in its overridden state for the rest of the suite. The next `beforeEach` then captures the wrong function as `originalKill`, so every subsequent test either accumulates into a stale `killedPids` array or — worse — inherits the throwing mock from test 3. Moving the restore to a single `afterEach(() => { (process as any).kill = originalKill; })` makes the cleanup unconditional.

### Issue 2 of 2
apps/desktop/src/main/lib/host-service-coordinator.test.ts:151
The test title says "ENOENT" (file-not-found) but the error thrown and the syscall being simulated is `ESRCH` (no such process). These are different errno values; the title should match the actual error code being exercised.

```suggestion
	test("swallows SIGKILL ESRCH (pid already gone) and still respawns", async () => {
```

_{Reviews (1): Last reviewed commit: "fix(desktop): heal stale host-service ad..." | Re-trigger Greptile}

greptile-apps · 2026-05-11T01:07:09Z

+			secret: "fresh-secret",
+			machineId: "host-1",
+		}));
+		(coordinator as unknown as { spawn: typeof spawnMock }).spawn = spawnMock;


Fragile per-test process.kill restore — use afterEach

process.kill is restored at the end of each test body. If any assertion fires before the restore line (e.g. test 3's expect(conn.port).toBe(60000) fails), process.kill is left in its overridden state for the rest of the suite. The next beforeEach then captures the wrong function as originalKill, so every subsequent test either accumulates into a stale killedPids array or — worse — inherits the throwing mock from test 3. Moving the restore to a single afterEach(() => { (process as any).kill = originalKill; }) makes the cleanup unconditional.

Prompt To Fix With AI

This is a comment left during a code review. Path: apps/desktop/src/main/lib/host-service-coordinator.test.ts Line: 113-116 Comment: **Fragile per-test `process.kill` restore — use `afterEach`** `process.kill` is restored at the end of each test body. If any assertion fires before the restore line (e.g. test 3's `expect(conn.port).toBe(60000)` fails), `process.kill` is left in its overridden state for the rest of the suite. The next `beforeEach` then captures the wrong function as `originalKill`, so every subsequent test either accumulates into a stale `killedPids` array or — worse — inherits the throwing mock from test 3. Moving the restore to a single `afterEach(() => { (process as any).kill = originalKill; })` makes the cleanup unconditional. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-05-11T01:07:10Z

+		(process as unknown as { kill: typeof process.kill }).kill = originalKill;
+	});
+
+	test("swallows SIGKILL ENOENT (pid already gone) and still respawns", async () => {


The test title says "ENOENT" (file-not-found) but the error thrown and the syscall being simulated is ESRCH (no such process). These are different errno values; the title should match the actual error code being exercised.

Suggested change

test("swallows SIGKILL ENOENT (pid already gone) and still respawns", async () => {

test("swallows SIGKILL ESRCH (pid already gone) and still respawns", async () => {

Prompt To Fix With AI

This is a comment left during a code review. Path: apps/desktop/src/main/lib/host-service-coordinator.test.ts Line: 151 Comment: The test title says "ENOENT" (file-not-found) but the error thrown and the syscall being simulated is `ESRCH` (no such process). These are different errno values; the title should match the actual error code being exercised. ```suggestion test("swallows SIGKILL ESRCH (pid already gone) and still respawns", async () => { ``` How can I resolve this? If you propose a fix, please make it concise.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

apps/desktop/src/main/lib/host-service-coordinator.test.ts (1)
151-165: 💤 Low value

ESRCH error simulation could be more accurate.

Line 155 throws new Error("ESRCH"), but real Node.js ESRCH errors from process.kill() have a code property (e.g., error.code === 'ESRCH'). While the current test works because the coordinator's try-catch on line 322-324 uses an empty catch block, a more accurate simulation would better document the expected error structure.
♻️ Optional: More accurate ESRCH simulation
 		(process as unknown as { kill: typeof process.kill }).kill = (() => {
-			throw new Error("ESRCH");
+			const error = new Error("No such process") as NodeJS.ErrnoException;
+			error.code = "ESRCH";
+			error.errno = -3;
+			error.syscall = "kill";
+			throw error;
 		}) as typeof process.kill;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/desktop/src/main/lib/host-service-coordinator.test.ts` around lines 151
- 165, The test "swallows SIGKILL ENOENT (pid already gone) and still respawns"
currently simulates an ESRCH by throwing new Error("ESRCH"); change the mocked
process.kill to throw an Error-like object with a code property (e.g., { name:
'Error', message: 'kill ESRCH', code: 'ESRCH' }) so it matches real Node.js
errors; update the assignment to (process as unknown as { kill: typeof
process.kill }).kill in this test (and restore originalKill as before) so
coordinator.start's try/catch that checks error.code will behave the same as in
production.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@apps/desktop/src/main/lib/host-service-coordinator.test.ts`:
- Around line 91-117: Add an afterEach cleanup that restores the original
process.kill by setting (process as unknown as { kill: typeof process.kill
}).kill = originalKill and optionally clearing killedPids, ensuring restoration
happens even if a test fails; place this afterEach alongside the existing
beforeEach that assigns originalKill and replaces process.kill (look for the
beforeEach block that sets originalKill, killedPids, and overrides
process.kill). Remove the per-test restore calls that reassign process.kill at
the end of individual tests (the lines that set (process as unknown as { kill:
typeof process.kill }).kill = originalKill) so restoration is centralized in the
new afterEach. Ensure the afterEach runs in the same test scope that constructs
the HostServiceCoordinator and sets spawnMock so teardown order is correct.

---

Nitpick comments:
In `@apps/desktop/src/main/lib/host-service-coordinator.test.ts`:
- Around line 151-165: The test "swallows SIGKILL ENOENT (pid already gone) and
still respawns" currently simulates an ESRCH by throwing new Error("ESRCH");
change the mocked process.kill to throw an Error-like object with a code
property (e.g., { name: 'Error', message: 'kill ESRCH', code: 'ESRCH' }) so it
matches real Node.js errors; update the assignment to (process as unknown as {
kill: typeof process.kill }).kill in this test (and restore originalKill as
before) so coordinator.start's try/catch that checks error.code will behave the
same as in production.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8ff1651c-45fb-4137-bc3e-6514ea82d308

📥 Commits

Reviewing files that changed from the base of the PR and between 493c48f and c76eb36.

📒 Files selected for processing (4)

apps/desktop/plans/20260510-1430-host-service-recovery.md
apps/desktop/src/main/lib/host-service-coordinator.test.ts
apps/desktop/src/main/lib/host-service-coordinator.ts
apps/desktop/src/main/lib/tray/index.ts

cubic-dev-ai

1 issue found across 4 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="apps/desktop/src/main/lib/host-service-coordinator.ts">

<violation number="1" location="apps/desktop/src/main/lib/host-service-coordinator.ts:324">
P2: Handle `process.kill` failures explicitly instead of silently swallowing them so unhealthy-adoption recovery failures are observable.

(Based on your team's feedback about handling process/async failures without silent catches.) [FEEDBACK_USED]</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

…R2 handoff - Status badge updated; PR #4395 link added. - Items 1 and 5 marked shipped with implementation pointers. - Items 2-4 (PR2) tightened with concrete file paths, the existing useRemoteHostStatus + WorkspaceHostOfflineState pattern to mirror, router shape, and test cases. - Cross-linked #4396 (white-screen ioreg execFileSync) as the related but independently-fixable issue. - Added a "Handoff checklist for PR2" listing every file to touch.

`getRawMachineId` invoked `execFileSync` without a `timeout`, so a wedged `ioreg` or `reg` subprocess could freeze the synchronous caller. The renderer's authenticated tree is gated on the `getMachineId` tRPC query, which means a hung subprocess on the main process blanks the entire window (white screen, no error boundary) until the hang clears. Pass `timeout: 1500` to both platform-specific subprocess calls. The existing `catch` block already routes to the hostname/homedir fallback, so a hang now degrades gracefully instead of taking down the main event loop. The result is memoized either way via `cachedMachineId`, so the fallback path won't retry the hang on subsequent calls. Refs #4395, which fixes the recovery half of the original report. Closes #4396

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

apps/desktop/src/main/lib/host-service-coordinator.test.ts (1)
105-105: 💤 Low value

Temp dirs from mkdtempSync are never cleaned up.

Each test creates a new os.tmpdir()/hsc-test-* directory and never removes it. Over many local/CI runs this accumulates. Add an afterEach (or extend the one above) to fs.rmSync(testManifestRoot, { recursive: true, force: true }).

Also applies to: 208-208
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/desktop/src/main/lib/host-service-coordinator.test.ts` at line 105, The
tests create temporary dirs with fs.mkdtempSync assigned to testManifestRoot
(and the other mkdtempSync use around the second occurrence) but never remove
them; add or extend the existing afterEach block to call
fs.rmSync(testManifestRoot, { recursive: true, force: true }) so each temp
directory created by testManifestRoot is cleaned up after each test run; ensure
the cleanup runs for both places where mkdtempSync is used and that
testManifestRoot is in scope for the afterEach.
apps/desktop/src/renderer/routes/_authenticated/providers/LocalHostServiceProvider/LocalHostServiceProvider.tsx (1)
147-155: 💤 Low value

Initial-start effect re-fires on every dep change — confirm coordinator no-op cost is acceptable.

This effect re-runs whenever organizationIds or activeOrganizationId change, calling startHostService({ organizationId }) for every org each time (and fireStart for the active one, which also calls setLastAttemptAt(Date.now())). The coordinator's start is idempotent for running instances, but each cycle still incurs a tRPC round-trip per org. For users with many orgs, switching active orgs N times triggers N×|orgs| start mutations.

Consider tracking which organizationIds have already been kicked off (e.g., a Set in a ref) and only starting newly-appeared ones.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@apps/desktop/src/renderer/routes/_authenticated/providers/LocalHostServiceProvider/LocalHostServiceProvider.tsx`
around lines 147 - 155, The effect over-eagerly calls startHostService and
fireStart on every dependency change, causing redundant tRPC starts for
already-started organizations; modify the useEffect that iterates
organizationIds/activeOrganizationId so it tracks which organizationIds have
already been started (e.g., a useRef to a Set<string>) and only call
startHostService({ organizationId }) or fireStart(organizationId) for ids not
present in that Set, adding ids to the Set after invoking the start; ensure
fireStart still triggers setLastAttemptAt(Date.now()) for the active org and
keep dependency list unchanged.
apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/components/WorkspaceLocalHostStoppedState/WorkspaceLocalHostStoppedState.tsx (1)

14-14: 💤 Low value

Expose the actual manifest directory path via tRPC instead of hardcoding.

MANIFEST_PATH_HINT is hardcoded to "~/.superset/host/<orgId>/" in the renderer, but the real manifest directory is calculated in the main process by manifestDir(organizationId) at apps/desktop/src/main/lib/host-service-manifest.ts, which respects SUPERSET_HOME_DIR (including custom overrides). When users with non-default home directories copy diagnostics text, they'll see the wrong path. Add a tRPC query to expose manifestDir(organizationId) from the main process so the renderer can display the actual path.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/components/WorkspaceLocalHostStoppedState/WorkspaceLocalHostStoppedState.tsx`:
- Around line 33-47: The onCopyDiagnostics handler currently fires toast.success
immediately after calling copyToClipboard without awaiting its Promise; change
onCopyDiagnostics to await the copyToClipboard(lines.join("\n")) call (or use
.then/.catch) and only show toast.success if it resolves, and show toast.error
(with a helpful message) if it rejects; reference the onCopyDiagnostics
function, copyToClipboard invocation, and the toast.success call so you replace
the immediate toast with success/error handling tied to the async result.

---

Nitpick comments:
In `@apps/desktop/src/main/lib/host-service-coordinator.test.ts`:
- Line 105: The tests create temporary dirs with fs.mkdtempSync assigned to
testManifestRoot (and the other mkdtempSync use around the second occurrence)
but never remove them; add or extend the existing afterEach block to call
fs.rmSync(testManifestRoot, { recursive: true, force: true }) so each temp
directory created by testManifestRoot is cleaned up after each test run; ensure
the cleanup runs for both places where mkdtempSync is used and that
testManifestRoot is in scope for the afterEach.

In
`@apps/desktop/src/renderer/routes/_authenticated/providers/LocalHostServiceProvider/LocalHostServiceProvider.tsx`:
- Around line 147-155: The effect over-eagerly calls startHostService and
fireStart on every dependency change, causing redundant tRPC starts for
already-started organizations; modify the useEffect that iterates
organizationIds/activeOrganizationId so it tracks which organizationIds have
already been started (e.g., a useRef to a Set<string>) and only call
startHostService({ organizationId }) or fireStart(organizationId) for ids not
present in that Set, adding ids to the Set after invoking the start; ensure
fireStart still triggers setLastAttemptAt(Date.now()) for the active org and
keep dependency list unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ed48fa88-309d-42eb-b8e6-20da570ef670

📥 Commits

Reviewing files that changed from the base of the PR and between 12c4aa5 and 7035566.

📒 Files selected for processing (9)

apps/desktop/src/lib/trpc/routers/host-service-coordinator/index.ts
apps/desktop/src/main/lib/host-service-coordinator.test.ts
apps/desktop/src/main/lib/host-service-coordinator.ts
apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/components/WorkspaceLocalHostStoppedState/WorkspaceLocalHostStoppedState.tsx
apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/components/WorkspaceLocalHostStoppedState/index.ts
apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/hooks/useLocalHostStatus/index.ts
apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/hooks/useLocalHostStatus/useLocalHostStatus.ts
apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/layout.tsx
apps/desktop/src/renderer/routes/_authenticated/providers/LocalHostServiceProvider/LocalHostServiceProvider.tsx

✅ Files skipped from review due to trivial changes (2)

apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/hooks/useLocalHostStatus/useLocalHostStatus.ts
apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/components/WorkspaceLocalHostStoppedState/index.ts

cubic-dev-ai

2 issues found across 9 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="apps/desktop/src/main/lib/host-service-coordinator.ts">

<violation number="1" location="apps/desktop/src/main/lib/host-service-coordinator.ts:179">
P1: `reset()` calls `stop()` before reading the manifest, so active instances lose their manifest first and never reach the SIGKILL path. This can leave a wedged host-service alive when reset is expected to force-kill it.</violation>
</file>

<file name="apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/components/WorkspaceLocalHostStoppedState/WorkspaceLocalHostStoppedState.tsx">

<violation number="1" location="apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/components/WorkspaceLocalHostStoppedState/WorkspaceLocalHostStoppedState.tsx:45">
P2: Await or handle `copyToClipboard` before showing success. Right now failures still show a success toast and leave a rejected promise unhandled.

(Based on your team's feedback about handling async rejections explicitly.) [FEEDBACK_USED]</violation>
</file>

_{Tip: Review your code locally with the cubic CLI to iterate faster.}

…n stopped state Adopt only host-services that respond to a health check (2s timeout). A live pid no longer serving on the recorded port was being adopted as "running", leaving every V2 surface stuck on "Host service not available" with no in-app recovery (#4299). Also enable tray "Restart" in stopped state — disabling it precisely when users need restart was backwards. Disabled now only while a start is in flight, to avoid racing the pending start. Adds a plan doc covering broader recovery work (banner, in-app reset, retry-with-backoff) for follow-up PRs.

…R2 handoff - Status badge updated; PR #4395 link added. - Items 1 and 5 marked shipped with implementation pointers. - Items 2-4 (PR2) tightened with concrete file paths, the existing useRemoteHostStatus + WorkspaceHostOfflineState pattern to mirror, router shape, and test cases. - Cross-linked #4396 (white-screen ioreg execFileSync) as the related but independently-fixable issue. - Added a "Handoff checklist for PR2" listing every file to touch.

PR2 of the recovery plan (#4299): when host-service can't be reached the provider now auto-retries on a 1s/4s/15s backoff, and once retries exhaust the v2-workspace layout swaps in a recovery screen with Reset and Copy-diagnostics actions instead of a blank pane. - coordinator: new reset() forcibly SIGKILLs any pid in the manifest, optionally archives host.db to host.db.broken-<ts>, then re-spawns - tRPC: hostServiceCoordinator.reset mutation extends orgInput with wipeHostDb - LocalHostServiceProvider: wraps start with exponential-backoff retry for the active org, resets on running, exposes lastStartError / lastAttemptAt / retryAttempt / retryExhausted on context - new useLocalHostStatus hook + WorkspaceLocalHostStoppedState UI mirroring the existing useRemoteHostStatus / WorkspaceHostOfflineState pattern; layout branches on it above the offline branch

The coordinator's narration (including PR1's adoption health-check failures: "Adopted pid=… did not respond …, killing and respawning") went to bare console.log, which never reaches main.log in a packaged build. Route it through electron-log like the auto-updater does so the next #4299-style report has breadcrumbs.

capy-ai

Added 1 comment

capy-ai · 2026-05-12T05:48:33Z

+		config: SpawnConfig,
+		options: { wipeHostDb?: boolean } = {},
+	): Promise<Connection> {
+		this.stop(organizationId);


[🟡 Medium] [🔵 Bug]

reset() calls stop() before reading the manifest, but stop() removes the manifest immediately. That means when an instance is currently tracked, the subsequent manifest read often returns null, so the new SIGKILL fallback never runs even though this method is explicitly intended to hard-kill wedged processes that may ignore SIGTERM. In that case reset can leave the old process alive and then start a second process, reintroducing recovery flakiness.

// apps/desktop/src/main/lib/host-service-coordinator.ts this.stop(organizationId); const manifest = readManifest(organizationId); if (manifest && isProcessAlive(manifest.pid)) {

Capture the manifest (or at least pid) before calling stop(), then perform the SIGKILL check against that captured pid after stop to guarantee the hard-kill path is actually exercised for tracked stale processes.

#4430 just removed the equivalent full-screen "host offline" gate for remote hosts ("render optimistically; downstream queries surface their own errors"). Follow that direction for the local case too: remove WorkspaceLocalHostStoppedState, useLocalHostStatus, and the layout branch. What stays from PR2: - coordinator reset() + hostServiceCoordinator.reset tRPC mutation — useful for the planned Settings escape hatch and for support - LocalHostServiceProvider auto-retries start() for the active org on a 1s/4s/15s backoff (and on a post-running "stopped" status event), resetting the counter when status reaches "running" — transient spawn failures self-heal without any UI A non-blocking recovery surface (banner, not takeover) can be a separate PR if we still want one.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@apps/desktop/src/renderer/routes/_authenticated/providers/LocalHostServiceProvider/LocalHostServiceProvider.tsx`:
- Around line 120-128: The effect currently re-runs startHostService for every
org whenever organizationIds is rebuilt; instead compute a diff against the
previous organizationIds and only call startHostService for newly added IDs, and
handle active org changes separately: when activeOrganizationId changes call
fireStart(activeOrganizationId) (not on every rebuild). Implement this by
storing previous organizationIds in a ref (or usePrevious hook), compute
addedIds = organizationIds - prevOrganizationIds and call startHostService({
organizationId }) only for addedIds, and also guard startHostService/firerStart
against concurrent runs by tracking in-flight/started IDs in a Set so duplicate
starts are skipped. Ensure to update the prev ref and in-flight Set
appropriately after successful starts.
- Around line 74-85: scheduleRetry currently increments attemptRef.current and
replaces the timer even if a retry is already queued, allowing multiple callers
to consume backoff slots; modify scheduleRetry (the function using attemptRef,
retryTimerRef, activeOrgRef, RETRY_DELAYS_MS, fireStartRef, and clearRetryTimer)
to first bail out if retryTimerRef.current is non-null (i.e., a retry is already
scheduled) or if organizationId !== activeOrgRef.current, and only then proceed
to read delay, increment attemptRef.current, clear/set the timer and call
fireStartRef; this prevents duplicate queued retries from burning through
backoff slots.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 45485848-89ea-4126-8372-4d502608dbb4

📥 Commits

Reviewing files that changed from the base of the PR and between f9315a0 and 608b6ea.

📒 Files selected for processing (1)

apps/desktop/src/renderer/routes/_authenticated/providers/LocalHostServiceProvider/LocalHostServiceProvider.tsx

coderabbitai · 2026-05-12T06:26:32Z

 	useEffect(() => {
 		for (const organizationId of organizationIds) {
-			startHostService({ organizationId });
+			if (organizationId === activeOrganizationId) {
+				fireStart(organizationId);
+			} else {
+				startHostService({ organizationId });
+			}
 		}
-	}, [organizationIds, startHostService]);
+	}, [organizationIds, activeOrganizationId, fireStart, startHostService]);


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Don't replay startHostService for every org on unrelated live-query updates.

This effect runs whenever organizationIds is rebuilt from the live query, so any organizations-table refresh can reissue startHostService for all orgs even when the ID set is unchanged. For the active org that bypasses the new backoff window; for all orgs it can overlap an in-flight/pending start with another start mutation. Consider separating “active org changed” from “new org discovered” or diffing against the previous ID set before starting again.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/desktop/src/renderer/routes/_authenticated/providers/LocalHostServiceProvider/LocalHostServiceProvider.tsx` around lines 120 - 128, The effect currently re-runs startHostService for every org whenever organizationIds is rebuilt; instead compute a diff against the previous organizationIds and only call startHostService for newly added IDs, and handle active org changes separately: when activeOrganizationId changes call fireStart(activeOrganizationId) (not on every rebuild). Implement this by storing previous organizationIds in a ref (or usePrevious hook), compute addedIds = organizationIds - prevOrganizationIds and call startHostService({ organizationId }) only for addedIds, and also guard startHostService/firerStart against concurrent runs by tracking in-flight/started IDs in a Set so duplicate starts are skipped. Ensure to update the prev ref and in-flight Set appropriately after successful starts.

- reset(): read the manifest pid *before* stop() removes it, so a wedged tracked instance gets SIGTERM (from stop) then SIGKILL (escalation) instead of only SIGTERM. (cubic P1) - log on SIGKILL failure in reset() and the adoption health-check path — ESRCH (pid already gone) stays silent, EPERM/etc. now surface. (cubic P2) - tests: restore process.kill in afterEach (unconditional) and rm the mkdtemp dirs in afterEach; rename the "ENOENT" case to "ESRCH" (the errno actually simulated); add a reset test for the tracked-instance SIGTERM→SIGKILL escalation. (greptile/coderabbit)

capy-ai · 2026-05-12T07:00:52Z

Capy auto-review is paused for this organization because the monthly auto-review limit has been reached. Increase the limit or turn it off in billing settings to resume automatic reviews.

…bits Drop the speculative pieces: - reset({ wipeHostDb }) + host.db archival — no caller; the Settings "clear local data" button that would use it is out of scope here - LocalHostServiceProvider retry-with-backoff loop — heavier than the #4299 fix needs and invisible without the recovery UI (already dropped); reverted to origin/main What remains: - adopt health-check (the actual #4299 fix) + tray Restart-in-stopped - coordinator reset() (manifest-only force-kill + respawn) + tRPC mutation — small, tested, ready for a support escape hatch - coordinator narration through electron-log

cubic-dev-ai

1 issue found across 4 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="apps/desktop/src/renderer/routes/_authenticated/providers/LocalHostServiceProvider/LocalHostServiceProvider.tsx">

<violation number="1">
P1: This effect now does a fire-and-forget start with no retry/recovery path, so a transient start failure or later process exit can leave the active org permanently stopped until the user manually restarts.</violation>
</file>

_{Tip: Review your code locally with the cubic CLI to iterate faster.}

…d summary, move to plans/done The original 220-line plan ranked six items and described a recovery screen + retry loop that didn't ship. Replace it with a tight summary of what landed (adopt health-check, reset(), tray fix, electron-log) and what was considered and deferred, and move it under plans/done per the repo convention.

…path removed)

github-actions · 2026-05-12T08:03:35Z

🧹 Preview Cleanup Complete

The following preview resources have been cleaned up:

✅ Neon database branch

Thank you for your contribution! 🎉

…n stopped state (superset-sh#4395) * fix(desktop): heal stale host-service adoption; enable tray Restart in stopped state Adopt only host-services that respond to a health check (2s timeout). A live pid no longer serving on the recorded port was being adopted as "running", leaving every V2 surface stuck on "Host service not available" with no in-app recovery (superset-sh#4299). Also enable tray "Restart" in stopped state — disabling it precisely when users need restart was backwards. Disabled now only while a start is in flight, to avoid racing the pending start. Adds a plan doc covering broader recovery work (banner, in-app reset, retry-with-backoff) for follow-up PRs. * docs(desktop): mark PR1 shipped in host-service recovery plan, prep PR2 handoff - Status badge updated; PR superset-sh#4395 link added. - Items 1 and 5 marked shipped with implementation pointers. - Items 2-4 (PR2) tightened with concrete file paths, the existing useRemoteHostStatus + WorkspaceHostOfflineState pattern to mirror, router shape, and test cases. - Cross-linked superset-sh#4396 (white-screen ioreg execFileSync) as the related but independently-fixable issue. - Added a "Handoff checklist for PR2" listing every file to touch. * fix(desktop): add in-app host-service reset + retry loop + recovery UI PR2 of the recovery plan (superset-sh#4299): when host-service can't be reached the provider now auto-retries on a 1s/4s/15s backoff, and once retries exhaust the v2-workspace layout swaps in a recovery screen with Reset and Copy-diagnostics actions instead of a blank pane. - coordinator: new reset() forcibly SIGKILLs any pid in the manifest, optionally archives host.db to host.db.broken-<ts>, then re-spawns - tRPC: hostServiceCoordinator.reset mutation extends orgInput with wipeHostDb - LocalHostServiceProvider: wraps start with exponential-backoff retry for the active org, resets on running, exposes lastStartError / lastAttemptAt / retryAttempt / retryExhausted on context - new useLocalHostStatus hook + WorkspaceLocalHostStoppedState UI mirroring the existing useRemoteHostStatus / WorkspaceHostOfflineState pattern; layout branches on it above the offline branch * fix(desktop): route host-service-coordinator logs through electron-log The coordinator's narration (including PR1's adoption health-check failures: "Adopted pid=… did not respond …, killing and respawning") went to bare console.log, which never reaches main.log in a packaged build. Route it through electron-log like the auto-updater does so the next superset-sh#4299-style report has breadcrumbs. * fix(desktop): drop local-host recovery screen; keep reset + auto-retry superset-sh#4430 just removed the equivalent full-screen "host offline" gate for remote hosts ("render optimistically; downstream queries surface their own errors"). Follow that direction for the local case too: remove WorkspaceLocalHostStoppedState, useLocalHostStatus, and the layout branch. What stays from PR2: - coordinator reset() + hostServiceCoordinator.reset tRPC mutation — useful for the planned Settings escape hatch and for support - LocalHostServiceProvider auto-retries start() for the active org on a 1s/4s/15s backoff (and on a post-running "stopped" status event), resetting the counter when status reaches "running" — transient spawn failures self-heal without any UI A non-blocking recovery surface (banner, not takeover) can be a separate PR if we still want one. * fix(desktop): address review feedback on host-service reset + tests - reset(): read the manifest pid *before* stop() removes it, so a wedged tracked instance gets SIGTERM (from stop) then SIGKILL (escalation) instead of only SIGTERM. (cubic P1) - log on SIGKILL failure in reset() and the adoption health-check path — ESRCH (pid already gone) stays silent, EPERM/etc. now surface. (cubic P2) - tests: restore process.kill in afterEach (unconditional) and rm the mkdtemp dirs in afterEach; rename the "ENOENT" case to "ESRCH" (the errno actually simulated); add a reset test for the tracked-instance SIGTERM→SIGKILL escalation. (greptile/coderabbit) * refactor(desktop): trim host-service recovery PR to the load-bearing bits Drop the speculative pieces: - reset({ wipeHostDb }) + host.db archival — no caller; the Settings "clear local data" button that would use it is out of scope here - LocalHostServiceProvider retry-with-backoff loop — heavier than the superset-sh#4299 fix needs and invisible without the recovery UI (already dropped); reverted to origin/main What remains: - adopt health-check (the actual superset-sh#4299 fix) + tray Restart-in-stopped - coordinator reset() (manifest-only force-kill + respawn) + tRPC mutation — small, tested, ready for a support escape hatch - coordinator narration through electron-log * docs(desktop): replace host-service recovery plan with a short shipped summary, move to plans/done The original 220-line plan ranked six items and described a recovery screen + retry loop that didn't ship. Replace it with a tight summary of what landed (adopt health-check, reset(), tray fix, electron-log) and what was considered and deferred, and move it under plans/done per the repo convention. * test(desktop): fix stale 'no rename' in reset test title (wipeHostDb path removed)

Kitenite mentioned this pull request May 11, 2026

Renderer can white-screen if getHostId's ioreg/execFileSync blocks on macOS #4396

Open

2 tasks

greptile-apps Bot reviewed May 11, 2026

View reviewed changes

coderabbitai Bot reviewed May 11, 2026

View reviewed changes

Comment thread apps/desktop/src/main/lib/host-service-coordinator.test.ts

cubic-dev-ai Bot reviewed May 11, 2026

View reviewed changes

Comment thread apps/desktop/src/main/lib/host-service-coordinator.ts Outdated

coderabbitai Bot reviewed May 11, 2026

View reviewed changes

Comment thread ...rd/v2-workspace/components/WorkspaceLocalHostStoppedState/WorkspaceLocalHostStoppedState.tsx Outdated

cubic-dev-ai Bot reviewed May 11, 2026

View reviewed changes

Comment thread apps/desktop/src/main/lib/host-service-coordinator.ts

Comment thread ...rd/v2-workspace/components/WorkspaceLocalHostStoppedState/WorkspaceLocalHostStoppedState.tsx Outdated

Kitenite added 4 commits May 11, 2026 22:42

Kitenite force-pushed the dour-table branch from 7035566 to f9315a0 Compare May 12, 2026 05:44

capy-ai Bot reviewed May 12, 2026

View reviewed changes

coderabbitai Bot reviewed May 12, 2026

View reviewed changes

cubic-dev-ai Bot reviewed May 12, 2026

View reviewed changes

Kitenite added 2 commits May 12, 2026 00:39

test(desktop): fix stale 'no rename' in reset test title (wipeHostDb …

359a8ca

…path removed)

Kitenite merged commit 8057272 into main May 12, 2026
9 of 10 checks passed

Kitenite deleted the dour-table branch May 12, 2026 08:03

Kitenite mentioned this pull request May 12, 2026

[codex] test(desktop): isolate host-service coordinator mocks #4448

Merged

	test("swallows SIGKILL ENOENT (pid already gone) and still respawns", async () => {
	test("swallows SIGKILL ESRCH (pid already gone) and still respawns", async () => {

Conversation

Kitenite commented May 11, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by cubic

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

greptile-apps Bot commented May 11, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

capy-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

capy-ai Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

capy-ai Bot commented May 12, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented May 12, 2026

🧹 Preview Cleanup Complete

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Kitenite commented May 11, 2026 •

edited by cubic-dev-ai Bot

Loading

coderabbitai Bot commented May 11, 2026 •

edited

Loading