Skip to content

fix(desktop): heal stale host-service adoption; enable tray Restart in stopped state#4395

Merged
Kitenite merged 9 commits into
mainfrom
dour-table
May 12, 2026
Merged

fix(desktop): heal stale host-service adoption; enable tray Restart in stopped state#4395
Kitenite merged 9 commits into
mainfrom
dour-table

Conversation

@Kitenite
Copy link
Copy Markdown
Collaborator

@Kitenite Kitenite commented May 11, 2026

Summary

Two small fixes that together let users (and the app itself) recover from a wedged host-service without filesystem surgery. Targets #4299, where a host-service whose pid is alive but no longer serving on the recorded port leaves every V2 surface stuck on "Host service not available" — including the delete workspace action, so users can't even clean up.

  • Health-check during adoption (host-service-coordinator.ts). tryAdopt now pollHealthChecks the manifest endpoint (2s timeout) before registering the adopted host-service as running. On failure: SIGKILL the pid, remove the manifest, return null so start falls through to a clean spawn. This is the structural fix — before this PR, the renderer's getConnection kept handing out a dead port forever.
  • Tray "Restart" enabled in stopped (tray/index.ts). Previously enabled: isRunning — disabled exactly when restart would help. Now disabled only while a start is in flight, to avoid racing the pending start.
  • Plan doc (plans/20260510-1430-host-service-recovery.md) covers the broader recovery work — in-app reset, recovery banner, retry-with-backoff — for follow-up PRs. This PR is items 1 and 5 from that plan.

The "screen blank before Cmd+R" half of #4299 is a separate bug — this PR makes the recovery work; the root cause for why host-service crashes there needs its own investigation.

Test plan

  • bun test src/main/lib/host-service-coordinator.test.ts — new file, 4 tests: healthy adoption (no respawn), unhealthy adoption → SIGKILL + respawn, ESRCH-tolerant kill, existing app-version-mismatch path unchanged.
  • bun run lint clean
  • bun run typecheck clean
  • Manual: with desktop running, pkill -9 -f host-service.js while leaving the manifest in place, then Cmd+R the renderer — right pane should recover within ~2s (previously stuck indefinitely).
  • Manual: with host-service in stopped state (e.g., immediately after pkill + Tray → Stop), Tray submenu "Restart" should be enabled and successfully restart host-service.

Summary by cubic

Adds self-healing for the desktop host-service via an adoption health check and a new reset endpoint. Stuck “Host service not available” cases now recover without manual cleanup; tray Restart works when stopped.

  • New Features

    • Coordinator reset() and electronTrpc.hostServiceCoordinator.reset: read the manifest pid before stop(), SIGKILL stale pids, remove the manifest, then respawn.
  • Bug Fixes

    • Adoption now health-checks the manifest endpoint (2s timeout); on failure it SIGKILLs the pid (ESRCH tolerated), removes the manifest, and spawns fresh. Coordinator logs route through electron-log for visibility in packaged builds.
    • Tray “Restart” is enabled when status is stopped and disabled only while starting.

Written for commit 359a8ca. Summary will update on new commits.

Summary by CodeRabbit

  • New Features

    • Added host service reset capability with option to clear local data.
    • Implemented automatic retry with exponential backoff for host service startup failures.
    • Enabled restart option in tray menu when service is stopped.
  • Bug Fixes

    • Added health checks to prevent adoption of unresponsive host processes.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 11, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR implements host-service recovery through adoption health-checking, a forceful reset capability, tRPC exposure, exponential retry/backoff for the active organization, tray restart gating, comprehensive tests, and a detailed plan document outlining the multi-PR rollout strategy.

Changes

Host Service Recovery Implementation

Layer / File(s) Summary
Recovery Plan & Strategy
apps/desktop/plans/20260510-1430-host-service-recovery.md
Documents six work items (adoption health-check, coordinator reset, recovery UI, renderer retry, tray gating, Settings UI), rollout sequencing, telemetry requirements, acceptance criteria for closing #4299, and implementation checklist with file preview.
Coordinator Health-Check & Reset Implementation
apps/desktop/src/main/lib/host-service-coordinator.ts, apps/desktop/src/main/lib/host-service-coordinator.test.ts
Adds ADOPT_HEALTH_CHECK_TIMEOUT_MS (2s), adoption health-check via pollHealthCheck with SIGKILL+manifest removal on failure, new reset(organizationId, config, {wipeHostDb?}) method, optional host.db archival to host.db.broken-<timestamp>, and logging migration to electron-log. Comprehensive Bun test suite covers healthy/failed/version-mismatch adoption and reset scenarios with and without host.db archival.
tRPC Reset Route
apps/desktop/src/lib/trpc/routers/host-service-coordinator/index.ts
Exposes new reset mutation accepting organizationId and optional wipeHostDb, authenticating via loadToken() and invoking coordinator reset.
Tray Restart Menu Gating
apps/desktop/src/main/lib/tray/index.ts
Updates "Restart" menu item to enable when service is not in starting state, allowing restart from both running and stopped states with explanatory comments.
Renderer Auto-Retry & Exponential Backoff
apps/desktop/src/renderer/routes/_authenticated/providers/LocalHostServiceProvider/LocalHostServiceProvider.tsx
Adds exponential retry/backoff with RETRY_DELAYS_MS, scoped to active organization only. Introduces refs for attempt tracking and timer management, scheduleRetry bounded by attempt count, fireStart wrapper scheduling retries on error, org-change reset, and stopped status re-entry to retry chain.

Sequence Diagram(s)

sequenceDiagram
  participant Renderer
  participant HostServiceCoordinator
  participant ProcessUtils
  participant ManifestStorage
  participant ProcessCheck

  Renderer->>HostServiceCoordinator: tryAdopt(manifest)
  HostServiceCoordinator->>ProcessUtils: spawn or check existing PID
  HostServiceCoordinator->>ProcessCheck: pollHealthCheck(endpoint, timeout=2s)
  alt Health Check Passes
    ProcessCheck-->>HostServiceCoordinator: healthy
    HostServiceCoordinator-->>Renderer: manifest endpoint + secret
  else Health Check Fails
    ProcessCheck-->>HostServiceCoordinator: timeout or error
    HostServiceCoordinator->>ProcessUtils: SIGKILL PID
    HostServiceCoordinator->>ManifestStorage: removeManifest()
    HostServiceCoordinator->>ProcessUtils: spawn new host
    HostServiceCoordinator-->>Renderer: new connection details
  end

  Renderer->>HostServiceCoordinator: reset(orgId, config, {wipeHostDb?})
  HostServiceCoordinator->>ProcessUtils: stop instance
  HostServiceCoordinator->>ManifestStorage: readManifest()
  HostServiceCoordinator->>ProcessUtils: SIGKILL manifest PID
  HostServiceCoordinator->>ManifestStorage: removeManifest()
  alt wipeHostDb enabled
    HostServiceCoordinator->>ProcessUtils: archive host.db to host.db.broken-*
  end
  HostServiceCoordinator->>ProcessUtils: spawn new host
  HostServiceCoordinator-->>Renderer: new connection details
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

A rabbit hops through startup's maze,
With health checks bright and reset's glaze.
Backoff exponential, retries dance,
When hosts go dark, we get a chance! 🐰✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the two main code fixes: health-checking during host-service adoption and enabling the tray Restart button in stopped state, directly addressing the core issues described in the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The pull request description is comprehensive and well-structured, following the template with all required sections completed including a clear summary, related issues link, type of change, detailed testing plan, and additional context.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch dour-table

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 11, 2026

Greptile Summary

This PR ships two targeted fixes for the host-service recovery path described in #4299: a health-check gate during adoption, and enabling the tray Restart button in all non-inflight states.

  • tryAdopt health check — before registering an adopted manifest as running, the coordinator now calls pollHealthCheck with a 2 s cap. On failure it SIGKILLs the stale pid, removes the manifest, and falls through to a clean spawn, breaking the dead-port-forever loop.
  • Tray Restart gateenabled: isRunning is replaced with enabled: status !== "starting", so the item is clickable whenever no start is already in flight.
  • Four new unit tests cover the healthy adoption, unhealthy adoption (SIGKILL + respawn), ESRCH-tolerant kill, and the existing app-version mismatch path.

Confidence Score: 4/5

Safe to merge; both runtime changes are targeted and low-blast-radius, with issues confined to the test file.

The coordinator and tray changes are small, well-commented, and backed by unit tests. The test file has a fragile cleanup pattern — process.kill is restored inside each test body rather than in a shared afterEach, so a mid-test assertion failure can corrupt mock state for later tests.

apps/desktop/src/main/lib/host-service-coordinator.test.ts — cleanup pattern should be reviewed before the test suite grows.

Important Files Changed

Filename Overview
apps/desktop/src/main/lib/host-service-coordinator.ts Adds a 2 s pollHealthCheck gate inside tryAdopt before registering an adopted process as running; on failure it SIGKILLs the stale pid, removes the manifest, and falls through to a clean spawn — directly fixing the dead-port-forever bug in #4299.
apps/desktop/src/main/lib/tray/index.ts Changes Restart menu item from enabled: isRunning to enabled: status !== "starting", so it stays clickable in every non-inflight state rather than only when the service is fully up.
apps/desktop/src/main/lib/host-service-coordinator.test.ts New unit test file with 4 adoption health-check scenarios; logic is correct but process.kill is restored inline in each test body rather than in an afterEach, making the suite fragile when assertions fail early. One test title also misnames the errno (ENOENT → ESRCH).
apps/desktop/plans/20260510-1430-host-service-recovery.md Planning document only; describes the broader recovery work (items 1 and 5 ship in this PR). No runtime changes.

Sequence Diagram

sequenceDiagram
    participant R as Renderer
    participant C as HostServiceCoordinator
    participant FS as Manifest / FS
    participant P as Host-Service PID

    R->>C: start(orgId, config)
    C->>C: existing running? → return cached
    C->>C: pendingStarts? → deduplicate
    C->>FS: readAndValidateManifest(orgId)
    alt No manifest / pid dead
        FS-->>C: null
        C->>C: spawn()
    else Manifest exists, pid alive
        C->>P: pollHealthCheck(endpoint, authToken, 2000ms)
        alt Healthy
            P-->>C: true
            C->>C: register as running
            C-->>R: Connection(port, secret)
        else Unhealthy (new path)
            P-->>C: false / timeout
            C->>P: SIGKILL(pid)
            C->>FS: removeManifest(orgId)
            C->>C: spawn()
            C-->>R: Connection(freshPort, freshSecret)
        end
    end
Loading
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
apps/desktop/src/main/lib/host-service-coordinator.test.ts:113-116
**Fragile per-test `process.kill` restore — use `afterEach`**

`process.kill` is restored at the end of each test body. If any assertion fires before the restore line (e.g. test 3's `expect(conn.port).toBe(60000)` fails), `process.kill` is left in its overridden state for the rest of the suite. The next `beforeEach` then captures the wrong function as `originalKill`, so every subsequent test either accumulates into a stale `killedPids` array or — worse — inherits the throwing mock from test 3. Moving the restore to a single `afterEach(() => { (process as any).kill = originalKill; })` makes the cleanup unconditional.

### Issue 2 of 2
apps/desktop/src/main/lib/host-service-coordinator.test.ts:151
The test title says "ENOENT" (file-not-found) but the error thrown and the syscall being simulated is `ESRCH` (no such process). These are different errno values; the title should match the actual error code being exercised.

```suggestion
	test("swallows SIGKILL ESRCH (pid already gone) and still respawns", async () => {
```

Reviews (1): Last reviewed commit: "fix(desktop): heal stale host-service ad..." | Re-trigger Greptile

Comment on lines +113 to +116
secret: "fresh-secret",
machineId: "host-1",
}));
(coordinator as unknown as { spawn: typeof spawnMock }).spawn = spawnMock;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Fragile per-test process.kill restore — use afterEach

process.kill is restored at the end of each test body. If any assertion fires before the restore line (e.g. test 3's expect(conn.port).toBe(60000) fails), process.kill is left in its overridden state for the rest of the suite. The next beforeEach then captures the wrong function as originalKill, so every subsequent test either accumulates into a stale killedPids array or — worse — inherits the throwing mock from test 3. Moving the restore to a single afterEach(() => { (process as any).kill = originalKill; }) makes the cleanup unconditional.

Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/desktop/src/main/lib/host-service-coordinator.test.ts
Line: 113-116

Comment:
**Fragile per-test `process.kill` restore — use `afterEach`**

`process.kill` is restored at the end of each test body. If any assertion fires before the restore line (e.g. test 3's `expect(conn.port).toBe(60000)` fails), `process.kill` is left in its overridden state for the rest of the suite. The next `beforeEach` then captures the wrong function as `originalKill`, so every subsequent test either accumulates into a stale `killedPids` array or — worse — inherits the throwing mock from test 3. Moving the restore to a single `afterEach(() => { (process as any).kill = originalKill; })` makes the cleanup unconditional.

How can I resolve this? If you propose a fix, please make it concise.

(process as unknown as { kill: typeof process.kill }).kill = originalKill;
});

test("swallows SIGKILL ENOENT (pid already gone) and still respawns", async () => {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The test title says "ENOENT" (file-not-found) but the error thrown and the syscall being simulated is ESRCH (no such process). These are different errno values; the title should match the actual error code being exercised.

Suggested change
test("swallows SIGKILL ENOENT (pid already gone) and still respawns", async () => {
test("swallows SIGKILL ESRCH (pid already gone) and still respawns", async () => {
Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/desktop/src/main/lib/host-service-coordinator.test.ts
Line: 151

Comment:
The test title says "ENOENT" (file-not-found) but the error thrown and the syscall being simulated is `ESRCH` (no such process). These are different errno values; the title should match the actual error code being exercised.

```suggestion
	test("swallows SIGKILL ESRCH (pid already gone) and still respawns", async () => {
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
apps/desktop/src/main/lib/host-service-coordinator.test.ts (1)

151-165: 💤 Low value

ESRCH error simulation could be more accurate.

Line 155 throws new Error("ESRCH"), but real Node.js ESRCH errors from process.kill() have a code property (e.g., error.code === 'ESRCH'). While the current test works because the coordinator's try-catch on line 322-324 uses an empty catch block, a more accurate simulation would better document the expected error structure.

♻️ Optional: More accurate ESRCH simulation
 		(process as unknown as { kill: typeof process.kill }).kill = (() => {
-			throw new Error("ESRCH");
+			const error = new Error("No such process") as NodeJS.ErrnoException;
+			error.code = "ESRCH";
+			error.errno = -3;
+			error.syscall = "kill";
+			throw error;
 		}) as typeof process.kill;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/desktop/src/main/lib/host-service-coordinator.test.ts` around lines 151
- 165, The test "swallows SIGKILL ENOENT (pid already gone) and still respawns"
currently simulates an ESRCH by throwing new Error("ESRCH"); change the mocked
process.kill to throw an Error-like object with a code property (e.g., { name:
'Error', message: 'kill ESRCH', code: 'ESRCH' }) so it matches real Node.js
errors; update the assignment to (process as unknown as { kill: typeof
process.kill }).kill in this test (and restore originalKill as before) so
coordinator.start's try/catch that checks error.code will behave the same as in
production.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@apps/desktop/src/main/lib/host-service-coordinator.test.ts`:
- Around line 91-117: Add an afterEach cleanup that restores the original
process.kill by setting (process as unknown as { kill: typeof process.kill
}).kill = originalKill and optionally clearing killedPids, ensuring restoration
happens even if a test fails; place this afterEach alongside the existing
beforeEach that assigns originalKill and replaces process.kill (look for the
beforeEach block that sets originalKill, killedPids, and overrides
process.kill). Remove the per-test restore calls that reassign process.kill at
the end of individual tests (the lines that set (process as unknown as { kill:
typeof process.kill }).kill = originalKill) so restoration is centralized in the
new afterEach. Ensure the afterEach runs in the same test scope that constructs
the HostServiceCoordinator and sets spawnMock so teardown order is correct.

---

Nitpick comments:
In `@apps/desktop/src/main/lib/host-service-coordinator.test.ts`:
- Around line 151-165: The test "swallows SIGKILL ENOENT (pid already gone) and
still respawns" currently simulates an ESRCH by throwing new Error("ESRCH");
change the mocked process.kill to throw an Error-like object with a code
property (e.g., { name: 'Error', message: 'kill ESRCH', code: 'ESRCH' }) so it
matches real Node.js errors; update the assignment to (process as unknown as {
kill: typeof process.kill }).kill in this test (and restore originalKill as
before) so coordinator.start's try/catch that checks error.code will behave the
same as in production.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8ff1651c-45fb-4137-bc3e-6514ea82d308

📥 Commits

Reviewing files that changed from the base of the PR and between 493c48f and c76eb36.

📒 Files selected for processing (4)
  • apps/desktop/plans/20260510-1430-host-service-recovery.md
  • apps/desktop/src/main/lib/host-service-coordinator.test.ts
  • apps/desktop/src/main/lib/host-service-coordinator.ts
  • apps/desktop/src/main/lib/tray/index.ts

Comment thread apps/desktop/src/main/lib/host-service-coordinator.test.ts
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 4 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="apps/desktop/src/main/lib/host-service-coordinator.ts">

<violation number="1" location="apps/desktop/src/main/lib/host-service-coordinator.ts:324">
P2: Handle `process.kill` failures explicitly instead of silently swallowing them so unhealthy-adoption recovery failures are observable.

(Based on your team's feedback about handling process/async failures without silent catches.) [FEEDBACK_USED]</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread apps/desktop/src/main/lib/host-service-coordinator.ts Outdated
Kitenite added a commit that referenced this pull request May 11, 2026
…R2 handoff

- Status badge updated; PR #4395 link added.
- Items 1 and 5 marked shipped with implementation pointers.
- Items 2-4 (PR2) tightened with concrete file paths, the existing
  useRemoteHostStatus + WorkspaceHostOfflineState pattern to mirror,
  router shape, and test cases.
- Cross-linked #4396 (white-screen ioreg execFileSync) as the related
  but independently-fixable issue.
- Added a "Handoff checklist for PR2" listing every file to touch.
github-actions Bot added a commit that referenced this pull request May 11, 2026
`getRawMachineId` invoked `execFileSync` without a `timeout`, so a wedged
`ioreg` or `reg` subprocess could freeze the synchronous caller. The
renderer's authenticated tree is gated on the `getMachineId` tRPC query,
which means a hung subprocess on the main process blanks the entire
window (white screen, no error boundary) until the hang clears.

Pass `timeout: 1500` to both platform-specific subprocess calls. The
existing `catch` block already routes to the hostname/homedir fallback,
so a hang now degrades gracefully instead of taking down the main event
loop. The result is memoized either way via `cachedMachineId`, so the
fallback path won't retry the hang on subsequent calls.

Refs #4395, which fixes the recovery half of the original report.

Closes #4396
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
apps/desktop/src/main/lib/host-service-coordinator.test.ts (1)

105-105: 💤 Low value

Temp dirs from mkdtempSync are never cleaned up.

Each test creates a new os.tmpdir()/hsc-test-* directory and never removes it. Over many local/CI runs this accumulates. Add an afterEach (or extend the one above) to fs.rmSync(testManifestRoot, { recursive: true, force: true }).

Also applies to: 208-208

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/desktop/src/main/lib/host-service-coordinator.test.ts` at line 105, The
tests create temporary dirs with fs.mkdtempSync assigned to testManifestRoot
(and the other mkdtempSync use around the second occurrence) but never remove
them; add or extend the existing afterEach block to call
fs.rmSync(testManifestRoot, { recursive: true, force: true }) so each temp
directory created by testManifestRoot is cleaned up after each test run; ensure
the cleanup runs for both places where mkdtempSync is used and that
testManifestRoot is in scope for the afterEach.
apps/desktop/src/renderer/routes/_authenticated/providers/LocalHostServiceProvider/LocalHostServiceProvider.tsx (1)

147-155: 💤 Low value

Initial-start effect re-fires on every dep change — confirm coordinator no-op cost is acceptable.

This effect re-runs whenever organizationIds or activeOrganizationId change, calling startHostService({ organizationId }) for every org each time (and fireStart for the active one, which also calls setLastAttemptAt(Date.now())). The coordinator's start is idempotent for running instances, but each cycle still incurs a tRPC round-trip per org. For users with many orgs, switching active orgs N times triggers N×|orgs| start mutations.

Consider tracking which organizationIds have already been kicked off (e.g., a Set in a ref) and only starting newly-appeared ones.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@apps/desktop/src/renderer/routes/_authenticated/providers/LocalHostServiceProvider/LocalHostServiceProvider.tsx`
around lines 147 - 155, The effect over-eagerly calls startHostService and
fireStart on every dependency change, causing redundant tRPC starts for
already-started organizations; modify the useEffect that iterates
organizationIds/activeOrganizationId so it tracks which organizationIds have
already been started (e.g., a useRef to a Set<string>) and only call
startHostService({ organizationId }) or fireStart(organizationId) for ids not
present in that Set, adding ids to the Set after invoking the start; ensure
fireStart still triggers setLastAttemptAt(Date.now()) for the active org and
keep dependency list unchanged.
apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/components/WorkspaceLocalHostStoppedState/WorkspaceLocalHostStoppedState.tsx (1)

14-14: 💤 Low value

Expose the actual manifest directory path via tRPC instead of hardcoding.

MANIFEST_PATH_HINT is hardcoded to "~/.superset/host/<orgId>/" in the renderer, but the real manifest directory is calculated in the main process by manifestDir(organizationId) at apps/desktop/src/main/lib/host-service-manifest.ts, which respects SUPERSET_HOME_DIR (including custom overrides). When users with non-default home directories copy diagnostics text, they'll see the wrong path. Add a tRPC query to expose manifestDir(organizationId) from the main process so the renderer can display the actual path.

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/components/WorkspaceLocalHostStoppedState/WorkspaceLocalHostStoppedState.tsx`:
- Around line 33-47: The onCopyDiagnostics handler currently fires toast.success
immediately after calling copyToClipboard without awaiting its Promise; change
onCopyDiagnostics to await the copyToClipboard(lines.join("\n")) call (or use
.then/.catch) and only show toast.success if it resolves, and show toast.error
(with a helpful message) if it rejects; reference the onCopyDiagnostics
function, copyToClipboard invocation, and the toast.success call so you replace
the immediate toast with success/error handling tied to the async result.

---

Nitpick comments:
In `@apps/desktop/src/main/lib/host-service-coordinator.test.ts`:
- Line 105: The tests create temporary dirs with fs.mkdtempSync assigned to
testManifestRoot (and the other mkdtempSync use around the second occurrence)
but never remove them; add or extend the existing afterEach block to call
fs.rmSync(testManifestRoot, { recursive: true, force: true }) so each temp
directory created by testManifestRoot is cleaned up after each test run; ensure
the cleanup runs for both places where mkdtempSync is used and that
testManifestRoot is in scope for the afterEach.

In
`@apps/desktop/src/renderer/routes/_authenticated/providers/LocalHostServiceProvider/LocalHostServiceProvider.tsx`:
- Around line 147-155: The effect over-eagerly calls startHostService and
fireStart on every dependency change, causing redundant tRPC starts for
already-started organizations; modify the useEffect that iterates
organizationIds/activeOrganizationId so it tracks which organizationIds have
already been started (e.g., a useRef to a Set<string>) and only call
startHostService({ organizationId }) or fireStart(organizationId) for ids not
present in that Set, adding ids to the Set after invoking the start; ensure
fireStart still triggers setLastAttemptAt(Date.now()) for the active org and
keep dependency list unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ed48fa88-309d-42eb-b8e6-20da570ef670

📥 Commits

Reviewing files that changed from the base of the PR and between 12c4aa5 and 7035566.

📒 Files selected for processing (9)
  • apps/desktop/src/lib/trpc/routers/host-service-coordinator/index.ts
  • apps/desktop/src/main/lib/host-service-coordinator.test.ts
  • apps/desktop/src/main/lib/host-service-coordinator.ts
  • apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/components/WorkspaceLocalHostStoppedState/WorkspaceLocalHostStoppedState.tsx
  • apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/components/WorkspaceLocalHostStoppedState/index.ts
  • apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/hooks/useLocalHostStatus/index.ts
  • apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/hooks/useLocalHostStatus/useLocalHostStatus.ts
  • apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/layout.tsx
  • apps/desktop/src/renderer/routes/_authenticated/providers/LocalHostServiceProvider/LocalHostServiceProvider.tsx
✅ Files skipped from review due to trivial changes (2)
  • apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/hooks/useLocalHostStatus/useLocalHostStatus.ts
  • apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/components/WorkspaceLocalHostStoppedState/index.ts

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 9 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="apps/desktop/src/main/lib/host-service-coordinator.ts">

<violation number="1" location="apps/desktop/src/main/lib/host-service-coordinator.ts:179">
P1: `reset()` calls `stop()` before reading the manifest, so active instances lose their manifest first and never reach the SIGKILL path. This can leave a wedged host-service alive when reset is expected to force-kill it.</violation>
</file>

<file name="apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/components/WorkspaceLocalHostStoppedState/WorkspaceLocalHostStoppedState.tsx">

<violation number="1" location="apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/components/WorkspaceLocalHostStoppedState/WorkspaceLocalHostStoppedState.tsx:45">
P2: Await or handle `copyToClipboard` before showing success. Right now failures still show a success toast and leave a rejected promise unhandled.

(Based on your team's feedback about handling async rejections explicitly.) [FEEDBACK_USED]</violation>
</file>

Tip: Review your code locally with the cubic CLI to iterate faster.

Comment thread apps/desktop/src/main/lib/host-service-coordinator.ts
Kitenite added 4 commits May 11, 2026 22:42
…n stopped state

Adopt only host-services that respond to a health check (2s timeout).
A live pid no longer serving on the recorded port was being adopted as
"running", leaving every V2 surface stuck on "Host service not available"
with no in-app recovery (#4299).

Also enable tray "Restart" in stopped state — disabling it precisely
when users need restart was backwards. Disabled now only while a start
is in flight, to avoid racing the pending start.

Adds a plan doc covering broader recovery work (banner, in-app reset,
retry-with-backoff) for follow-up PRs.
…R2 handoff

- Status badge updated; PR #4395 link added.
- Items 1 and 5 marked shipped with implementation pointers.
- Items 2-4 (PR2) tightened with concrete file paths, the existing
  useRemoteHostStatus + WorkspaceHostOfflineState pattern to mirror,
  router shape, and test cases.
- Cross-linked #4396 (white-screen ioreg execFileSync) as the related
  but independently-fixable issue.
- Added a "Handoff checklist for PR2" listing every file to touch.
PR2 of the recovery plan (#4299): when host-service can't be reached the
provider now auto-retries on a 1s/4s/15s backoff, and once retries
exhaust the v2-workspace layout swaps in a recovery screen with Reset
and Copy-diagnostics actions instead of a blank pane.

- coordinator: new reset() forcibly SIGKILLs any pid in the manifest,
  optionally archives host.db to host.db.broken-<ts>, then re-spawns
- tRPC: hostServiceCoordinator.reset mutation extends orgInput with
  wipeHostDb
- LocalHostServiceProvider: wraps start with exponential-backoff retry
  for the active org, resets on running, exposes lastStartError /
  lastAttemptAt / retryAttempt / retryExhausted on context
- new useLocalHostStatus hook + WorkspaceLocalHostStoppedState UI
  mirroring the existing useRemoteHostStatus / WorkspaceHostOfflineState
  pattern; layout branches on it above the offline branch
The coordinator's narration (including PR1's adoption health-check
failures: "Adopted pid=… did not respond …, killing and respawning")
went to bare console.log, which never reaches main.log in a packaged
build. Route it through electron-log like the auto-updater does so the
next #4299-style report has breadcrumbs.
Copy link
Copy Markdown

@capy-ai capy-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added 1 comment

config: SpawnConfig,
options: { wipeHostDb?: boolean } = {},
): Promise<Connection> {
this.stop(organizationId);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[🟡 Medium] [🔵 Bug]

reset() calls stop() before reading the manifest, but stop() removes the manifest immediately. That means when an instance is currently tracked, the subsequent manifest read often returns null, so the new SIGKILL fallback never runs even though this method is explicitly intended to hard-kill wedged processes that may ignore SIGTERM. In that case reset can leave the old process alive and then start a second process, reintroducing recovery flakiness.

// apps/desktop/src/main/lib/host-service-coordinator.ts
this.stop(organizationId);

const manifest = readManifest(organizationId);
if (manifest && isProcessAlive(manifest.pid)) {

Capture the manifest (or at least pid) before calling stop(), then perform the SIGKILL check against that captured pid after stop to guarantee the hard-kill path is actually exercised for tracked stale processes.

#4430 just removed the equivalent full-screen "host offline" gate for
remote hosts ("render optimistically; downstream queries surface their
own errors"). Follow that direction for the local case too: remove
WorkspaceLocalHostStoppedState, useLocalHostStatus, and the layout
branch.

What stays from PR2:
- coordinator reset() + hostServiceCoordinator.reset tRPC mutation —
  useful for the planned Settings escape hatch and for support
- LocalHostServiceProvider auto-retries start() for the active org on a
  1s/4s/15s backoff (and on a post-running "stopped" status event),
  resetting the counter when status reaches "running" — transient
  spawn failures self-heal without any UI

A non-blocking recovery surface (banner, not takeover) can be a
separate PR if we still want one.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@apps/desktop/src/renderer/routes/_authenticated/providers/LocalHostServiceProvider/LocalHostServiceProvider.tsx`:
- Around line 120-128: The effect currently re-runs startHostService for every
org whenever organizationIds is rebuilt; instead compute a diff against the
previous organizationIds and only call startHostService for newly added IDs, and
handle active org changes separately: when activeOrganizationId changes call
fireStart(activeOrganizationId) (not on every rebuild). Implement this by
storing previous organizationIds in a ref (or usePrevious hook), compute
addedIds = organizationIds - prevOrganizationIds and call startHostService({
organizationId }) only for addedIds, and also guard startHostService/firerStart
against concurrent runs by tracking in-flight/started IDs in a Set so duplicate
starts are skipped. Ensure to update the prev ref and in-flight Set
appropriately after successful starts.
- Around line 74-85: scheduleRetry currently increments attemptRef.current and
replaces the timer even if a retry is already queued, allowing multiple callers
to consume backoff slots; modify scheduleRetry (the function using attemptRef,
retryTimerRef, activeOrgRef, RETRY_DELAYS_MS, fireStartRef, and clearRetryTimer)
to first bail out if retryTimerRef.current is non-null (i.e., a retry is already
scheduled) or if organizationId !== activeOrgRef.current, and only then proceed
to read delay, increment attemptRef.current, clear/set the timer and call
fireStartRef; this prevents duplicate queued retries from burning through
backoff slots.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 45485848-89ea-4126-8372-4d502608dbb4

📥 Commits

Reviewing files that changed from the base of the PR and between f9315a0 and 608b6ea.

📒 Files selected for processing (1)
  • apps/desktop/src/renderer/routes/_authenticated/providers/LocalHostServiceProvider/LocalHostServiceProvider.tsx

Comment on lines +120 to +128
useEffect(() => {
for (const organizationId of organizationIds) {
startHostService({ organizationId });
if (organizationId === activeOrganizationId) {
fireStart(organizationId);
} else {
startHostService({ organizationId });
}
}
}, [organizationIds, startHostService]);
}, [organizationIds, activeOrganizationId, fireStart, startHostService]);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Don't replay startHostService for every org on unrelated live-query updates.

This effect runs whenever organizationIds is rebuilt from the live query, so any organizations-table refresh can reissue startHostService for all orgs even when the ID set is unchanged. For the active org that bypasses the new backoff window; for all orgs it can overlap an in-flight/pending start with another start mutation. Consider separating “active org changed” from “new org discovered” or diffing against the previous ID set before starting again.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@apps/desktop/src/renderer/routes/_authenticated/providers/LocalHostServiceProvider/LocalHostServiceProvider.tsx`
around lines 120 - 128, The effect currently re-runs startHostService for every
org whenever organizationIds is rebuilt; instead compute a diff against the
previous organizationIds and only call startHostService for newly added IDs, and
handle active org changes separately: when activeOrganizationId changes call
fireStart(activeOrganizationId) (not on every rebuild). Implement this by
storing previous organizationIds in a ref (or usePrevious hook), compute
addedIds = organizationIds - prevOrganizationIds and call startHostService({
organizationId }) only for addedIds, and also guard startHostService/firerStart
against concurrent runs by tracking in-flight/started IDs in a Set so duplicate
starts are skipped. Ensure to update the prev ref and in-flight Set
appropriately after successful starts.

- reset(): read the manifest pid *before* stop() removes it, so a wedged
  tracked instance gets SIGTERM (from stop) then SIGKILL (escalation)
  instead of only SIGTERM. (cubic P1)
- log on SIGKILL failure in reset() and the adoption health-check path —
  ESRCH (pid already gone) stays silent, EPERM/etc. now surface. (cubic P2)
- tests: restore process.kill in afterEach (unconditional) and rm the
  mkdtemp dirs in afterEach; rename the "ENOENT" case to "ESRCH" (the
  errno actually simulated); add a reset test for the tracked-instance
  SIGTERM→SIGKILL escalation. (greptile/coderabbit)
@capy-ai
Copy link
Copy Markdown

capy-ai Bot commented May 12, 2026

Capy auto-review is paused for this organization because the monthly auto-review limit has been reached. Increase the limit or turn it off in billing settings to resume automatic reviews.

…bits

Drop the speculative pieces:
- reset({ wipeHostDb }) + host.db archival — no caller; the Settings
  "clear local data" button that would use it is out of scope here
- LocalHostServiceProvider retry-with-backoff loop — heavier than the
  #4299 fix needs and invisible without the recovery UI (already
  dropped); reverted to origin/main

What remains:
- adopt health-check (the actual #4299 fix) + tray Restart-in-stopped
- coordinator reset() (manifest-only force-kill + respawn) + tRPC
  mutation — small, tested, ready for a support escape hatch
- coordinator narration through electron-log
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 4 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="apps/desktop/src/renderer/routes/_authenticated/providers/LocalHostServiceProvider/LocalHostServiceProvider.tsx">

<violation number="1">
P1: This effect now does a fire-and-forget start with no retry/recovery path, so a transient start failure or later process exit can leave the active org permanently stopped until the user manually restarts.</violation>
</file>

Tip: Review your code locally with the cubic CLI to iterate faster.

Kitenite added 2 commits May 12, 2026 00:39
…d summary, move to plans/done

The original 220-line plan ranked six items and described a recovery
screen + retry loop that didn't ship. Replace it with a tight summary of
what landed (adopt health-check, reset(), tray fix, electron-log) and
what was considered and deferred, and move it under plans/done per the
repo convention.
@Kitenite Kitenite merged commit 8057272 into main May 12, 2026
9 of 10 checks passed
@Kitenite Kitenite deleted the dour-table branch May 12, 2026 08:03
@github-actions
Copy link
Copy Markdown
Contributor

🧹 Preview Cleanup Complete

The following preview resources have been cleaned up:

  • ✅ Neon database branch

Thank you for your contribution! 🎉

MocA-Love pushed a commit to MocA-Love/superset that referenced this pull request May 25, 2026
…n stopped state (superset-sh#4395)

* fix(desktop): heal stale host-service adoption; enable tray Restart in stopped state

Adopt only host-services that respond to a health check (2s timeout).
A live pid no longer serving on the recorded port was being adopted as
"running", leaving every V2 surface stuck on "Host service not available"
with no in-app recovery (superset-sh#4299).

Also enable tray "Restart" in stopped state — disabling it precisely
when users need restart was backwards. Disabled now only while a start
is in flight, to avoid racing the pending start.

Adds a plan doc covering broader recovery work (banner, in-app reset,
retry-with-backoff) for follow-up PRs.

* docs(desktop): mark PR1 shipped in host-service recovery plan, prep PR2 handoff

- Status badge updated; PR superset-sh#4395 link added.
- Items 1 and 5 marked shipped with implementation pointers.
- Items 2-4 (PR2) tightened with concrete file paths, the existing
  useRemoteHostStatus + WorkspaceHostOfflineState pattern to mirror,
  router shape, and test cases.
- Cross-linked superset-sh#4396 (white-screen ioreg execFileSync) as the related
  but independently-fixable issue.
- Added a "Handoff checklist for PR2" listing every file to touch.

* fix(desktop): add in-app host-service reset + retry loop + recovery UI

PR2 of the recovery plan (superset-sh#4299): when host-service can't be reached the
provider now auto-retries on a 1s/4s/15s backoff, and once retries
exhaust the v2-workspace layout swaps in a recovery screen with Reset
and Copy-diagnostics actions instead of a blank pane.

- coordinator: new reset() forcibly SIGKILLs any pid in the manifest,
  optionally archives host.db to host.db.broken-<ts>, then re-spawns
- tRPC: hostServiceCoordinator.reset mutation extends orgInput with
  wipeHostDb
- LocalHostServiceProvider: wraps start with exponential-backoff retry
  for the active org, resets on running, exposes lastStartError /
  lastAttemptAt / retryAttempt / retryExhausted on context
- new useLocalHostStatus hook + WorkspaceLocalHostStoppedState UI
  mirroring the existing useRemoteHostStatus / WorkspaceHostOfflineState
  pattern; layout branches on it above the offline branch

* fix(desktop): route host-service-coordinator logs through electron-log

The coordinator's narration (including PR1's adoption health-check
failures: "Adopted pid=… did not respond …, killing and respawning")
went to bare console.log, which never reaches main.log in a packaged
build. Route it through electron-log like the auto-updater does so the
next superset-sh#4299-style report has breadcrumbs.

* fix(desktop): drop local-host recovery screen; keep reset + auto-retry

superset-sh#4430 just removed the equivalent full-screen "host offline" gate for
remote hosts ("render optimistically; downstream queries surface their
own errors"). Follow that direction for the local case too: remove
WorkspaceLocalHostStoppedState, useLocalHostStatus, and the layout
branch.

What stays from PR2:
- coordinator reset() + hostServiceCoordinator.reset tRPC mutation —
  useful for the planned Settings escape hatch and for support
- LocalHostServiceProvider auto-retries start() for the active org on a
  1s/4s/15s backoff (and on a post-running "stopped" status event),
  resetting the counter when status reaches "running" — transient
  spawn failures self-heal without any UI

A non-blocking recovery surface (banner, not takeover) can be a
separate PR if we still want one.

* fix(desktop): address review feedback on host-service reset + tests

- reset(): read the manifest pid *before* stop() removes it, so a wedged
  tracked instance gets SIGTERM (from stop) then SIGKILL (escalation)
  instead of only SIGTERM. (cubic P1)
- log on SIGKILL failure in reset() and the adoption health-check path —
  ESRCH (pid already gone) stays silent, EPERM/etc. now surface. (cubic P2)
- tests: restore process.kill in afterEach (unconditional) and rm the
  mkdtemp dirs in afterEach; rename the "ENOENT" case to "ESRCH" (the
  errno actually simulated); add a reset test for the tracked-instance
  SIGTERM→SIGKILL escalation. (greptile/coderabbit)

* refactor(desktop): trim host-service recovery PR to the load-bearing bits

Drop the speculative pieces:
- reset({ wipeHostDb }) + host.db archival — no caller; the Settings
  "clear local data" button that would use it is out of scope here
- LocalHostServiceProvider retry-with-backoff loop — heavier than the
  superset-sh#4299 fix needs and invisible without the recovery UI (already
  dropped); reverted to origin/main

What remains:
- adopt health-check (the actual superset-sh#4299 fix) + tray Restart-in-stopped
- coordinator reset() (manifest-only force-kill + respawn) + tRPC
  mutation — small, tested, ready for a support escape hatch
- coordinator narration through electron-log

* docs(desktop): replace host-service recovery plan with a short shipped summary, move to plans/done

The original 220-line plan ranked six items and described a recovery
screen + retry loop that didn't ship. Replace it with a tight summary of
what landed (adopt health-check, reset(), tray fix, electron-log) and
what was considered and deferred, and move it under plans/done per the
repo convention.

* test(desktop): fix stale 'no rename' in reset test title (wipeHostDb path removed)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant