fix(desktop): heal stale host-service adoption; enable tray Restart in stopped state#4395
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR implements host-service recovery through adoption health-checking, a forceful reset capability, tRPC exposure, exponential retry/backoff for the active organization, tray restart gating, comprehensive tests, and a detailed plan document outlining the multi-PR rollout strategy. ChangesHost Service Recovery Implementation
Sequence Diagram(s)sequenceDiagram
participant Renderer
participant HostServiceCoordinator
participant ProcessUtils
participant ManifestStorage
participant ProcessCheck
Renderer->>HostServiceCoordinator: tryAdopt(manifest)
HostServiceCoordinator->>ProcessUtils: spawn or check existing PID
HostServiceCoordinator->>ProcessCheck: pollHealthCheck(endpoint, timeout=2s)
alt Health Check Passes
ProcessCheck-->>HostServiceCoordinator: healthy
HostServiceCoordinator-->>Renderer: manifest endpoint + secret
else Health Check Fails
ProcessCheck-->>HostServiceCoordinator: timeout or error
HostServiceCoordinator->>ProcessUtils: SIGKILL PID
HostServiceCoordinator->>ManifestStorage: removeManifest()
HostServiceCoordinator->>ProcessUtils: spawn new host
HostServiceCoordinator-->>Renderer: new connection details
end
Renderer->>HostServiceCoordinator: reset(orgId, config, {wipeHostDb?})
HostServiceCoordinator->>ProcessUtils: stop instance
HostServiceCoordinator->>ManifestStorage: readManifest()
HostServiceCoordinator->>ProcessUtils: SIGKILL manifest PID
HostServiceCoordinator->>ManifestStorage: removeManifest()
alt wipeHostDb enabled
HostServiceCoordinator->>ProcessUtils: archive host.db to host.db.broken-*
end
HostServiceCoordinator->>ProcessUtils: spawn new host
HostServiceCoordinator-->>Renderer: new connection details
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Greptile SummaryThis PR ships two targeted fixes for the host-service recovery path described in #4299: a health-check gate during adoption, and enabling the tray Restart button in all non-inflight states.
Confidence Score: 4/5Safe to merge; both runtime changes are targeted and low-blast-radius, with issues confined to the test file. The coordinator and tray changes are small, well-commented, and backed by unit tests. The test file has a fragile cleanup pattern — apps/desktop/src/main/lib/host-service-coordinator.test.ts — cleanup pattern should be reviewed before the test suite grows.
|
| Filename | Overview |
|---|---|
| apps/desktop/src/main/lib/host-service-coordinator.ts | Adds a 2 s pollHealthCheck gate inside tryAdopt before registering an adopted process as running; on failure it SIGKILLs the stale pid, removes the manifest, and falls through to a clean spawn — directly fixing the dead-port-forever bug in #4299. |
| apps/desktop/src/main/lib/tray/index.ts | Changes Restart menu item from enabled: isRunning to enabled: status !== "starting", so it stays clickable in every non-inflight state rather than only when the service is fully up. |
| apps/desktop/src/main/lib/host-service-coordinator.test.ts | New unit test file with 4 adoption health-check scenarios; logic is correct but process.kill is restored inline in each test body rather than in an afterEach, making the suite fragile when assertions fail early. One test title also misnames the errno (ENOENT → ESRCH). |
| apps/desktop/plans/20260510-1430-host-service-recovery.md | Planning document only; describes the broader recovery work (items 1 and 5 ship in this PR). No runtime changes. |
Sequence Diagram
sequenceDiagram
participant R as Renderer
participant C as HostServiceCoordinator
participant FS as Manifest / FS
participant P as Host-Service PID
R->>C: start(orgId, config)
C->>C: existing running? → return cached
C->>C: pendingStarts? → deduplicate
C->>FS: readAndValidateManifest(orgId)
alt No manifest / pid dead
FS-->>C: null
C->>C: spawn()
else Manifest exists, pid alive
C->>P: pollHealthCheck(endpoint, authToken, 2000ms)
alt Healthy
P-->>C: true
C->>C: register as running
C-->>R: Connection(port, secret)
else Unhealthy (new path)
P-->>C: false / timeout
C->>P: SIGKILL(pid)
C->>FS: removeManifest(orgId)
C->>C: spawn()
C-->>R: Connection(freshPort, freshSecret)
end
end
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 2
apps/desktop/src/main/lib/host-service-coordinator.test.ts:113-116
**Fragile per-test `process.kill` restore — use `afterEach`**
`process.kill` is restored at the end of each test body. If any assertion fires before the restore line (e.g. test 3's `expect(conn.port).toBe(60000)` fails), `process.kill` is left in its overridden state for the rest of the suite. The next `beforeEach` then captures the wrong function as `originalKill`, so every subsequent test either accumulates into a stale `killedPids` array or — worse — inherits the throwing mock from test 3. Moving the restore to a single `afterEach(() => { (process as any).kill = originalKill; })` makes the cleanup unconditional.
### Issue 2 of 2
apps/desktop/src/main/lib/host-service-coordinator.test.ts:151
The test title says "ENOENT" (file-not-found) but the error thrown and the syscall being simulated is `ESRCH` (no such process). These are different errno values; the title should match the actual error code being exercised.
```suggestion
test("swallows SIGKILL ESRCH (pid already gone) and still respawns", async () => {
```
Reviews (1): Last reviewed commit: "fix(desktop): heal stale host-service ad..." | Re-trigger Greptile
| secret: "fresh-secret", | ||
| machineId: "host-1", | ||
| })); | ||
| (coordinator as unknown as { spawn: typeof spawnMock }).spawn = spawnMock; |
There was a problem hiding this comment.
Fragile per-test
process.kill restore — use afterEach
process.kill is restored at the end of each test body. If any assertion fires before the restore line (e.g. test 3's expect(conn.port).toBe(60000) fails), process.kill is left in its overridden state for the rest of the suite. The next beforeEach then captures the wrong function as originalKill, so every subsequent test either accumulates into a stale killedPids array or — worse — inherits the throwing mock from test 3. Moving the restore to a single afterEach(() => { (process as any).kill = originalKill; }) makes the cleanup unconditional.
Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/desktop/src/main/lib/host-service-coordinator.test.ts
Line: 113-116
Comment:
**Fragile per-test `process.kill` restore — use `afterEach`**
`process.kill` is restored at the end of each test body. If any assertion fires before the restore line (e.g. test 3's `expect(conn.port).toBe(60000)` fails), `process.kill` is left in its overridden state for the rest of the suite. The next `beforeEach` then captures the wrong function as `originalKill`, so every subsequent test either accumulates into a stale `killedPids` array or — worse — inherits the throwing mock from test 3. Moving the restore to a single `afterEach(() => { (process as any).kill = originalKill; })` makes the cleanup unconditional.
How can I resolve this? If you propose a fix, please make it concise.| (process as unknown as { kill: typeof process.kill }).kill = originalKill; | ||
| }); | ||
|
|
||
| test("swallows SIGKILL ENOENT (pid already gone) and still respawns", async () => { |
There was a problem hiding this comment.
The test title says "ENOENT" (file-not-found) but the error thrown and the syscall being simulated is
ESRCH (no such process). These are different errno values; the title should match the actual error code being exercised.
| test("swallows SIGKILL ENOENT (pid already gone) and still respawns", async () => { | |
| test("swallows SIGKILL ESRCH (pid already gone) and still respawns", async () => { |
Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/desktop/src/main/lib/host-service-coordinator.test.ts
Line: 151
Comment:
The test title says "ENOENT" (file-not-found) but the error thrown and the syscall being simulated is `ESRCH` (no such process). These are different errno values; the title should match the actual error code being exercised.
```suggestion
test("swallows SIGKILL ESRCH (pid already gone) and still respawns", async () => {
```
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
apps/desktop/src/main/lib/host-service-coordinator.test.ts (1)
151-165: 💤 Low valueESRCH error simulation could be more accurate.
Line 155 throws
new Error("ESRCH"), but real Node.js ESRCH errors fromprocess.kill()have acodeproperty (e.g.,error.code === 'ESRCH'). While the current test works because the coordinator's try-catch on line 322-324 uses an empty catch block, a more accurate simulation would better document the expected error structure.♻️ Optional: More accurate ESRCH simulation
(process as unknown as { kill: typeof process.kill }).kill = (() => { - throw new Error("ESRCH"); + const error = new Error("No such process") as NodeJS.ErrnoException; + error.code = "ESRCH"; + error.errno = -3; + error.syscall = "kill"; + throw error; }) as typeof process.kill;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/desktop/src/main/lib/host-service-coordinator.test.ts` around lines 151 - 165, The test "swallows SIGKILL ENOENT (pid already gone) and still respawns" currently simulates an ESRCH by throwing new Error("ESRCH"); change the mocked process.kill to throw an Error-like object with a code property (e.g., { name: 'Error', message: 'kill ESRCH', code: 'ESRCH' }) so it matches real Node.js errors; update the assignment to (process as unknown as { kill: typeof process.kill }).kill in this test (and restore originalKill as before) so coordinator.start's try/catch that checks error.code will behave the same as in production.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@apps/desktop/src/main/lib/host-service-coordinator.test.ts`:
- Around line 91-117: Add an afterEach cleanup that restores the original
process.kill by setting (process as unknown as { kill: typeof process.kill
}).kill = originalKill and optionally clearing killedPids, ensuring restoration
happens even if a test fails; place this afterEach alongside the existing
beforeEach that assigns originalKill and replaces process.kill (look for the
beforeEach block that sets originalKill, killedPids, and overrides
process.kill). Remove the per-test restore calls that reassign process.kill at
the end of individual tests (the lines that set (process as unknown as { kill:
typeof process.kill }).kill = originalKill) so restoration is centralized in the
new afterEach. Ensure the afterEach runs in the same test scope that constructs
the HostServiceCoordinator and sets spawnMock so teardown order is correct.
---
Nitpick comments:
In `@apps/desktop/src/main/lib/host-service-coordinator.test.ts`:
- Around line 151-165: The test "swallows SIGKILL ENOENT (pid already gone) and
still respawns" currently simulates an ESRCH by throwing new Error("ESRCH");
change the mocked process.kill to throw an Error-like object with a code
property (e.g., { name: 'Error', message: 'kill ESRCH', code: 'ESRCH' }) so it
matches real Node.js errors; update the assignment to (process as unknown as {
kill: typeof process.kill }).kill in this test (and restore originalKill as
before) so coordinator.start's try/catch that checks error.code will behave the
same as in production.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 8ff1651c-45fb-4137-bc3e-6514ea82d308
📒 Files selected for processing (4)
apps/desktop/plans/20260510-1430-host-service-recovery.mdapps/desktop/src/main/lib/host-service-coordinator.test.tsapps/desktop/src/main/lib/host-service-coordinator.tsapps/desktop/src/main/lib/tray/index.ts
There was a problem hiding this comment.
1 issue found across 4 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="apps/desktop/src/main/lib/host-service-coordinator.ts">
<violation number="1" location="apps/desktop/src/main/lib/host-service-coordinator.ts:324">
P2: Handle `process.kill` failures explicitly instead of silently swallowing them so unhealthy-adoption recovery failures are observable.
(Based on your team's feedback about handling process/async failures without silent catches.) [FEEDBACK_USED]</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
…R2 handoff - Status badge updated; PR #4395 link added. - Items 1 and 5 marked shipped with implementation pointers. - Items 2-4 (PR2) tightened with concrete file paths, the existing useRemoteHostStatus + WorkspaceHostOfflineState pattern to mirror, router shape, and test cases. - Cross-linked #4396 (white-screen ioreg execFileSync) as the related but independently-fixable issue. - Added a "Handoff checklist for PR2" listing every file to touch.
`getRawMachineId` invoked `execFileSync` without a `timeout`, so a wedged `ioreg` or `reg` subprocess could freeze the synchronous caller. The renderer's authenticated tree is gated on the `getMachineId` tRPC query, which means a hung subprocess on the main process blanks the entire window (white screen, no error boundary) until the hang clears. Pass `timeout: 1500` to both platform-specific subprocess calls. The existing `catch` block already routes to the hostname/homedir fallback, so a hang now degrades gracefully instead of taking down the main event loop. The result is memoized either way via `cachedMachineId`, so the fallback path won't retry the hang on subsequent calls. Refs #4395, which fixes the recovery half of the original report. Closes #4396
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
apps/desktop/src/main/lib/host-service-coordinator.test.ts (1)
105-105: 💤 Low valueTemp dirs from
mkdtempSyncare never cleaned up.Each test creates a new
os.tmpdir()/hsc-test-*directory and never removes it. Over many local/CI runs this accumulates. Add anafterEach(or extend the one above) tofs.rmSync(testManifestRoot, { recursive: true, force: true }).Also applies to: 208-208
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/desktop/src/main/lib/host-service-coordinator.test.ts` at line 105, The tests create temporary dirs with fs.mkdtempSync assigned to testManifestRoot (and the other mkdtempSync use around the second occurrence) but never remove them; add or extend the existing afterEach block to call fs.rmSync(testManifestRoot, { recursive: true, force: true }) so each temp directory created by testManifestRoot is cleaned up after each test run; ensure the cleanup runs for both places where mkdtempSync is used and that testManifestRoot is in scope for the afterEach.apps/desktop/src/renderer/routes/_authenticated/providers/LocalHostServiceProvider/LocalHostServiceProvider.tsx (1)
147-155: 💤 Low valueInitial-start effect re-fires on every dep change — confirm coordinator no-op cost is acceptable.
This effect re-runs whenever
organizationIdsoractiveOrganizationIdchange, callingstartHostService({ organizationId })for every org each time (andfireStartfor the active one, which also callssetLastAttemptAt(Date.now())). The coordinator'sstartis idempotent forrunninginstances, but each cycle still incurs a tRPC round-trip per org. For users with many orgs, switching active orgs N times triggers N×|orgs| start mutations.Consider tracking which
organizationIds have already been kicked off (e.g., aSetin a ref) and only starting newly-appeared ones.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/desktop/src/renderer/routes/_authenticated/providers/LocalHostServiceProvider/LocalHostServiceProvider.tsx` around lines 147 - 155, The effect over-eagerly calls startHostService and fireStart on every dependency change, causing redundant tRPC starts for already-started organizations; modify the useEffect that iterates organizationIds/activeOrganizationId so it tracks which organizationIds have already been started (e.g., a useRef to a Set<string>) and only call startHostService({ organizationId }) or fireStart(organizationId) for ids not present in that Set, adding ids to the Set after invoking the start; ensure fireStart still triggers setLastAttemptAt(Date.now()) for the active org and keep dependency list unchanged.apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/components/WorkspaceLocalHostStoppedState/WorkspaceLocalHostStoppedState.tsx (1)
14-14: 💤 Low valueExpose the actual manifest directory path via tRPC instead of hardcoding.
MANIFEST_PATH_HINTis hardcoded to"~/.superset/host/<orgId>/"in the renderer, but the real manifest directory is calculated in the main process bymanifestDir(organizationId)atapps/desktop/src/main/lib/host-service-manifest.ts, which respectsSUPERSET_HOME_DIR(including custom overrides). When users with non-default home directories copy diagnostics text, they'll see the wrong path. Add a tRPC query to exposemanifestDir(organizationId)from the main process so the renderer can display the actual path.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/components/WorkspaceLocalHostStoppedState/WorkspaceLocalHostStoppedState.tsx`:
- Around line 33-47: The onCopyDiagnostics handler currently fires toast.success
immediately after calling copyToClipboard without awaiting its Promise; change
onCopyDiagnostics to await the copyToClipboard(lines.join("\n")) call (or use
.then/.catch) and only show toast.success if it resolves, and show toast.error
(with a helpful message) if it rejects; reference the onCopyDiagnostics
function, copyToClipboard invocation, and the toast.success call so you replace
the immediate toast with success/error handling tied to the async result.
---
Nitpick comments:
In `@apps/desktop/src/main/lib/host-service-coordinator.test.ts`:
- Line 105: The tests create temporary dirs with fs.mkdtempSync assigned to
testManifestRoot (and the other mkdtempSync use around the second occurrence)
but never remove them; add or extend the existing afterEach block to call
fs.rmSync(testManifestRoot, { recursive: true, force: true }) so each temp
directory created by testManifestRoot is cleaned up after each test run; ensure
the cleanup runs for both places where mkdtempSync is used and that
testManifestRoot is in scope for the afterEach.
In
`@apps/desktop/src/renderer/routes/_authenticated/providers/LocalHostServiceProvider/LocalHostServiceProvider.tsx`:
- Around line 147-155: The effect over-eagerly calls startHostService and
fireStart on every dependency change, causing redundant tRPC starts for
already-started organizations; modify the useEffect that iterates
organizationIds/activeOrganizationId so it tracks which organizationIds have
already been started (e.g., a useRef to a Set<string>) and only call
startHostService({ organizationId }) or fireStart(organizationId) for ids not
present in that Set, adding ids to the Set after invoking the start; ensure
fireStart still triggers setLastAttemptAt(Date.now()) for the active org and
keep dependency list unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ed48fa88-309d-42eb-b8e6-20da570ef670
📒 Files selected for processing (9)
apps/desktop/src/lib/trpc/routers/host-service-coordinator/index.tsapps/desktop/src/main/lib/host-service-coordinator.test.tsapps/desktop/src/main/lib/host-service-coordinator.tsapps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/components/WorkspaceLocalHostStoppedState/WorkspaceLocalHostStoppedState.tsxapps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/components/WorkspaceLocalHostStoppedState/index.tsapps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/hooks/useLocalHostStatus/index.tsapps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/hooks/useLocalHostStatus/useLocalHostStatus.tsapps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/layout.tsxapps/desktop/src/renderer/routes/_authenticated/providers/LocalHostServiceProvider/LocalHostServiceProvider.tsx
✅ Files skipped from review due to trivial changes (2)
- apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/hooks/useLocalHostStatus/useLocalHostStatus.ts
- apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/components/WorkspaceLocalHostStoppedState/index.ts
There was a problem hiding this comment.
2 issues found across 9 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="apps/desktop/src/main/lib/host-service-coordinator.ts">
<violation number="1" location="apps/desktop/src/main/lib/host-service-coordinator.ts:179">
P1: `reset()` calls `stop()` before reading the manifest, so active instances lose their manifest first and never reach the SIGKILL path. This can leave a wedged host-service alive when reset is expected to force-kill it.</violation>
</file>
<file name="apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/components/WorkspaceLocalHostStoppedState/WorkspaceLocalHostStoppedState.tsx">
<violation number="1" location="apps/desktop/src/renderer/routes/_authenticated/_dashboard/v2-workspace/components/WorkspaceLocalHostStoppedState/WorkspaceLocalHostStoppedState.tsx:45">
P2: Await or handle `copyToClipboard` before showing success. Right now failures still show a success toast and leave a rejected promise unhandled.
(Based on your team's feedback about handling async rejections explicitly.) [FEEDBACK_USED]</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
…n stopped state Adopt only host-services that respond to a health check (2s timeout). A live pid no longer serving on the recorded port was being adopted as "running", leaving every V2 surface stuck on "Host service not available" with no in-app recovery (#4299). Also enable tray "Restart" in stopped state — disabling it precisely when users need restart was backwards. Disabled now only while a start is in flight, to avoid racing the pending start. Adds a plan doc covering broader recovery work (banner, in-app reset, retry-with-backoff) for follow-up PRs.
…R2 handoff - Status badge updated; PR #4395 link added. - Items 1 and 5 marked shipped with implementation pointers. - Items 2-4 (PR2) tightened with concrete file paths, the existing useRemoteHostStatus + WorkspaceHostOfflineState pattern to mirror, router shape, and test cases. - Cross-linked #4396 (white-screen ioreg execFileSync) as the related but independently-fixable issue. - Added a "Handoff checklist for PR2" listing every file to touch.
PR2 of the recovery plan (#4299): when host-service can't be reached the provider now auto-retries on a 1s/4s/15s backoff, and once retries exhaust the v2-workspace layout swaps in a recovery screen with Reset and Copy-diagnostics actions instead of a blank pane. - coordinator: new reset() forcibly SIGKILLs any pid in the manifest, optionally archives host.db to host.db.broken-<ts>, then re-spawns - tRPC: hostServiceCoordinator.reset mutation extends orgInput with wipeHostDb - LocalHostServiceProvider: wraps start with exponential-backoff retry for the active org, resets on running, exposes lastStartError / lastAttemptAt / retryAttempt / retryExhausted on context - new useLocalHostStatus hook + WorkspaceLocalHostStoppedState UI mirroring the existing useRemoteHostStatus / WorkspaceHostOfflineState pattern; layout branches on it above the offline branch
The coordinator's narration (including PR1's adoption health-check failures: "Adopted pid=… did not respond …, killing and respawning") went to bare console.log, which never reaches main.log in a packaged build. Route it through electron-log like the auto-updater does so the next #4299-style report has breadcrumbs.
| config: SpawnConfig, | ||
| options: { wipeHostDb?: boolean } = {}, | ||
| ): Promise<Connection> { | ||
| this.stop(organizationId); |
There was a problem hiding this comment.
[🟡 Medium] [🔵 Bug]
reset() calls stop() before reading the manifest, but stop() removes the manifest immediately. That means when an instance is currently tracked, the subsequent manifest read often returns null, so the new SIGKILL fallback never runs even though this method is explicitly intended to hard-kill wedged processes that may ignore SIGTERM. In that case reset can leave the old process alive and then start a second process, reintroducing recovery flakiness.
// apps/desktop/src/main/lib/host-service-coordinator.ts
this.stop(organizationId);
const manifest = readManifest(organizationId);
if (manifest && isProcessAlive(manifest.pid)) {Capture the manifest (or at least pid) before calling stop(), then perform the SIGKILL check against that captured pid after stop to guarantee the hard-kill path is actually exercised for tracked stale processes.
#4430 just removed the equivalent full-screen "host offline" gate for remote hosts ("render optimistically; downstream queries surface their own errors"). Follow that direction for the local case too: remove WorkspaceLocalHostStoppedState, useLocalHostStatus, and the layout branch. What stays from PR2: - coordinator reset() + hostServiceCoordinator.reset tRPC mutation — useful for the planned Settings escape hatch and for support - LocalHostServiceProvider auto-retries start() for the active org on a 1s/4s/15s backoff (and on a post-running "stopped" status event), resetting the counter when status reaches "running" — transient spawn failures self-heal without any UI A non-blocking recovery surface (banner, not takeover) can be a separate PR if we still want one.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@apps/desktop/src/renderer/routes/_authenticated/providers/LocalHostServiceProvider/LocalHostServiceProvider.tsx`:
- Around line 120-128: The effect currently re-runs startHostService for every
org whenever organizationIds is rebuilt; instead compute a diff against the
previous organizationIds and only call startHostService for newly added IDs, and
handle active org changes separately: when activeOrganizationId changes call
fireStart(activeOrganizationId) (not on every rebuild). Implement this by
storing previous organizationIds in a ref (or usePrevious hook), compute
addedIds = organizationIds - prevOrganizationIds and call startHostService({
organizationId }) only for addedIds, and also guard startHostService/firerStart
against concurrent runs by tracking in-flight/started IDs in a Set so duplicate
starts are skipped. Ensure to update the prev ref and in-flight Set
appropriately after successful starts.
- Around line 74-85: scheduleRetry currently increments attemptRef.current and
replaces the timer even if a retry is already queued, allowing multiple callers
to consume backoff slots; modify scheduleRetry (the function using attemptRef,
retryTimerRef, activeOrgRef, RETRY_DELAYS_MS, fireStartRef, and clearRetryTimer)
to first bail out if retryTimerRef.current is non-null (i.e., a retry is already
scheduled) or if organizationId !== activeOrgRef.current, and only then proceed
to read delay, increment attemptRef.current, clear/set the timer and call
fireStartRef; this prevents duplicate queued retries from burning through
backoff slots.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 45485848-89ea-4126-8372-4d502608dbb4
📒 Files selected for processing (1)
apps/desktop/src/renderer/routes/_authenticated/providers/LocalHostServiceProvider/LocalHostServiceProvider.tsx
| useEffect(() => { | ||
| for (const organizationId of organizationIds) { | ||
| startHostService({ organizationId }); | ||
| if (organizationId === activeOrganizationId) { | ||
| fireStart(organizationId); | ||
| } else { | ||
| startHostService({ organizationId }); | ||
| } | ||
| } | ||
| }, [organizationIds, startHostService]); | ||
| }, [organizationIds, activeOrganizationId, fireStart, startHostService]); |
There was a problem hiding this comment.
Don't replay startHostService for every org on unrelated live-query updates.
This effect runs whenever organizationIds is rebuilt from the live query, so any organizations-table refresh can reissue startHostService for all orgs even when the ID set is unchanged. For the active org that bypasses the new backoff window; for all orgs it can overlap an in-flight/pending start with another start mutation. Consider separating “active org changed” from “new org discovered” or diffing against the previous ID set before starting again.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@apps/desktop/src/renderer/routes/_authenticated/providers/LocalHostServiceProvider/LocalHostServiceProvider.tsx`
around lines 120 - 128, The effect currently re-runs startHostService for every
org whenever organizationIds is rebuilt; instead compute a diff against the
previous organizationIds and only call startHostService for newly added IDs, and
handle active org changes separately: when activeOrganizationId changes call
fireStart(activeOrganizationId) (not on every rebuild). Implement this by
storing previous organizationIds in a ref (or usePrevious hook), compute
addedIds = organizationIds - prevOrganizationIds and call startHostService({
organizationId }) only for addedIds, and also guard startHostService/firerStart
against concurrent runs by tracking in-flight/started IDs in a Set so duplicate
starts are skipped. Ensure to update the prev ref and in-flight Set
appropriately after successful starts.
- reset(): read the manifest pid *before* stop() removes it, so a wedged tracked instance gets SIGTERM (from stop) then SIGKILL (escalation) instead of only SIGTERM. (cubic P1) - log on SIGKILL failure in reset() and the adoption health-check path — ESRCH (pid already gone) stays silent, EPERM/etc. now surface. (cubic P2) - tests: restore process.kill in afterEach (unconditional) and rm the mkdtemp dirs in afterEach; rename the "ENOENT" case to "ESRCH" (the errno actually simulated); add a reset test for the tracked-instance SIGTERM→SIGKILL escalation. (greptile/coderabbit)
|
Capy auto-review is paused for this organization because the monthly auto-review limit has been reached. Increase the limit or turn it off in billing settings to resume automatic reviews. |
…bits
Drop the speculative pieces:
- reset({ wipeHostDb }) + host.db archival — no caller; the Settings
"clear local data" button that would use it is out of scope here
- LocalHostServiceProvider retry-with-backoff loop — heavier than the
#4299 fix needs and invisible without the recovery UI (already
dropped); reverted to origin/main
What remains:
- adopt health-check (the actual #4299 fix) + tray Restart-in-stopped
- coordinator reset() (manifest-only force-kill + respawn) + tRPC
mutation — small, tested, ready for a support escape hatch
- coordinator narration through electron-log
There was a problem hiding this comment.
1 issue found across 4 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="apps/desktop/src/renderer/routes/_authenticated/providers/LocalHostServiceProvider/LocalHostServiceProvider.tsx">
<violation number="1">
P1: This effect now does a fire-and-forget start with no retry/recovery path, so a transient start failure or later process exit can leave the active org permanently stopped until the user manually restarts.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
…d summary, move to plans/done The original 220-line plan ranked six items and described a recovery screen + retry loop that didn't ship. Replace it with a tight summary of what landed (adopt health-check, reset(), tray fix, electron-log) and what was considered and deferred, and move it under plans/done per the repo convention.
🧹 Preview Cleanup CompleteThe following preview resources have been cleaned up:
Thank you for your contribution! 🎉 |
…n stopped state (superset-sh#4395) * fix(desktop): heal stale host-service adoption; enable tray Restart in stopped state Adopt only host-services that respond to a health check (2s timeout). A live pid no longer serving on the recorded port was being adopted as "running", leaving every V2 surface stuck on "Host service not available" with no in-app recovery (superset-sh#4299). Also enable tray "Restart" in stopped state — disabling it precisely when users need restart was backwards. Disabled now only while a start is in flight, to avoid racing the pending start. Adds a plan doc covering broader recovery work (banner, in-app reset, retry-with-backoff) for follow-up PRs. * docs(desktop): mark PR1 shipped in host-service recovery plan, prep PR2 handoff - Status badge updated; PR superset-sh#4395 link added. - Items 1 and 5 marked shipped with implementation pointers. - Items 2-4 (PR2) tightened with concrete file paths, the existing useRemoteHostStatus + WorkspaceHostOfflineState pattern to mirror, router shape, and test cases. - Cross-linked superset-sh#4396 (white-screen ioreg execFileSync) as the related but independently-fixable issue. - Added a "Handoff checklist for PR2" listing every file to touch. * fix(desktop): add in-app host-service reset + retry loop + recovery UI PR2 of the recovery plan (superset-sh#4299): when host-service can't be reached the provider now auto-retries on a 1s/4s/15s backoff, and once retries exhaust the v2-workspace layout swaps in a recovery screen with Reset and Copy-diagnostics actions instead of a blank pane. - coordinator: new reset() forcibly SIGKILLs any pid in the manifest, optionally archives host.db to host.db.broken-<ts>, then re-spawns - tRPC: hostServiceCoordinator.reset mutation extends orgInput with wipeHostDb - LocalHostServiceProvider: wraps start with exponential-backoff retry for the active org, resets on running, exposes lastStartError / lastAttemptAt / retryAttempt / retryExhausted on context - new useLocalHostStatus hook + WorkspaceLocalHostStoppedState UI mirroring the existing useRemoteHostStatus / WorkspaceHostOfflineState pattern; layout branches on it above the offline branch * fix(desktop): route host-service-coordinator logs through electron-log The coordinator's narration (including PR1's adoption health-check failures: "Adopted pid=… did not respond …, killing and respawning") went to bare console.log, which never reaches main.log in a packaged build. Route it through electron-log like the auto-updater does so the next superset-sh#4299-style report has breadcrumbs. * fix(desktop): drop local-host recovery screen; keep reset + auto-retry superset-sh#4430 just removed the equivalent full-screen "host offline" gate for remote hosts ("render optimistically; downstream queries surface their own errors"). Follow that direction for the local case too: remove WorkspaceLocalHostStoppedState, useLocalHostStatus, and the layout branch. What stays from PR2: - coordinator reset() + hostServiceCoordinator.reset tRPC mutation — useful for the planned Settings escape hatch and for support - LocalHostServiceProvider auto-retries start() for the active org on a 1s/4s/15s backoff (and on a post-running "stopped" status event), resetting the counter when status reaches "running" — transient spawn failures self-heal without any UI A non-blocking recovery surface (banner, not takeover) can be a separate PR if we still want one. * fix(desktop): address review feedback on host-service reset + tests - reset(): read the manifest pid *before* stop() removes it, so a wedged tracked instance gets SIGTERM (from stop) then SIGKILL (escalation) instead of only SIGTERM. (cubic P1) - log on SIGKILL failure in reset() and the adoption health-check path — ESRCH (pid already gone) stays silent, EPERM/etc. now surface. (cubic P2) - tests: restore process.kill in afterEach (unconditional) and rm the mkdtemp dirs in afterEach; rename the "ENOENT" case to "ESRCH" (the errno actually simulated); add a reset test for the tracked-instance SIGTERM→SIGKILL escalation. (greptile/coderabbit) * refactor(desktop): trim host-service recovery PR to the load-bearing bits Drop the speculative pieces: - reset({ wipeHostDb }) + host.db archival — no caller; the Settings "clear local data" button that would use it is out of scope here - LocalHostServiceProvider retry-with-backoff loop — heavier than the superset-sh#4299 fix needs and invisible without the recovery UI (already dropped); reverted to origin/main What remains: - adopt health-check (the actual superset-sh#4299 fix) + tray Restart-in-stopped - coordinator reset() (manifest-only force-kill + respawn) + tRPC mutation — small, tested, ready for a support escape hatch - coordinator narration through electron-log * docs(desktop): replace host-service recovery plan with a short shipped summary, move to plans/done The original 220-line plan ranked six items and described a recovery screen + retry loop that didn't ship. Replace it with a tight summary of what landed (adopt health-check, reset(), tray fix, electron-log) and what was considered and deferred, and move it under plans/done per the repo convention. * test(desktop): fix stale 'no rename' in reset test title (wipeHostDb path removed)
Summary
Two small fixes that together let users (and the app itself) recover from a wedged host-service without filesystem surgery. Targets #4299, where a host-service whose pid is alive but no longer serving on the recorded port leaves every V2 surface stuck on "Host service not available" — including the delete workspace action, so users can't even clean up.
host-service-coordinator.ts).tryAdoptnowpollHealthChecks the manifest endpoint (2s timeout) before registering the adopted host-service asrunning. On failure: SIGKILL the pid, remove the manifest, return null sostartfalls through to a cleanspawn. This is the structural fix — before this PR, the renderer'sgetConnectionkept handing out a dead port forever.stopped(tray/index.ts). Previouslyenabled: isRunning— disabled exactly when restart would help. Now disabled only while a start is in flight, to avoid racing the pending start.plans/20260510-1430-host-service-recovery.md) covers the broader recovery work — in-app reset, recovery banner, retry-with-backoff — for follow-up PRs. This PR is items 1 and 5 from that plan.The "screen blank before Cmd+R" half of #4299 is a separate bug — this PR makes the recovery work; the root cause for why host-service crashes there needs its own investigation.
Test plan
bun test src/main/lib/host-service-coordinator.test.ts— new file, 4 tests: healthy adoption (no respawn), unhealthy adoption → SIGKILL + respawn, ESRCH-tolerant kill, existing app-version-mismatch path unchanged.bun run lintcleanbun run typecheckcleanpkill -9 -f host-service.jswhile leaving the manifest in place, then Cmd+R the renderer — right pane should recover within ~2s (previously stuck indefinitely).stoppedstate (e.g., immediately afterpkill+ Tray → Stop), Tray submenu "Restart" should be enabled and successfully restart host-service.Summary by cubic
Adds self-healing for the desktop host-service via an adoption health check and a new
resetendpoint. Stuck “Host service not available” cases now recover without manual cleanup; tray Restart works when stopped.New Features
reset()andelectronTrpc.hostServiceCoordinator.reset: read the manifest pid beforestop(), SIGKILL stale pids, remove the manifest, then respawn.Bug Fixes
electron-logfor visibility in packaged builds.Written for commit 359a8ca. Summary will update on new commits.
Summary by CodeRabbit
New Features
Bug Fixes