feat: pty-daemon fd-handoff — sessions survive daemon binary upgrades#3971
feat: pty-daemon fd-handoff — sessions survive daemon binary upgrades#3971
Conversation
Phase 2 fd-handoff requires passing the kernel master fd of each PTY to the successor daemon process via stdio inheritance. node-pty does not publicly expose this fd; reach into its private `_fd` property and assert at spawn time so a future node-pty bump can't silently break handoff months from now. node-pty stays pinned to 1.1.0 exactly — Phase 0 harness validated this version's `_fd` shape; the assert in NodePtyAdapter.getMasterFd surfaces a clear error if the contract ever shifts.
Tiny protocol for the daemon-to-successor control channel during
fd-handoff. The successor sends `upgrade-ack { successorPid }` once it
has adopted the inherited PTY master fds, or `upgrade-nak { reason }`
on a soft failure. Old daemon awaits one or times out.
Travels over a dedicated stdio fd in the successor's inherited stdio
array — NOT exposed to clients. Reuses the existing length-prefixed
JSON framing so it shares encodeFrame / FrameDecoder.
Client wire protocol (messages.ts, version.ts) is untouched.
Also extends mock Pty implementations in handlers.test.ts and
SessionStore.test.ts to satisfy the new getMasterFd() interface
member added in the previous commit.
The handoff snapshot is the on-disk bookkeeping the successor reads to rebuild SessionStore state. Kernel-side state (PTY master fds) flows through the spawned successor's stdio array; this snapshot carries the session id ↔ fdIndex mapping plus per-session metadata and replay buffer. Atomic write via tmp+rename so the successor only ever sees a complete file. Exited sessions are filtered out — they have no live PTY fd to inherit, and their exit event has already been delivered to renderers (see Server.onExit's delete-on-exit behavior). `version: 1` is forward-compat scaffolding only — snapshots are written and consumed by the same daemon-pair on the order of milliseconds, so we don't need cross-version compatibility yet.
Adds three optional fields to PtyDaemonManifest to track in-flight fd-handoff state: - handoffInProgress: true between snapshot-write and successor's bind - handoffSnapshotPath: where the successor reads its bookkeeping - handoffSuccessorPid: the spawned successor's pid All optional + lenient parsing — older host-service builds that don't know these fields ignore them, and a malformed handoff field doesn't make the whole manifest unreadable. The handoff state is advisory; if it gets corrupted, the supervisor can always fall back to the existing adopt-or-spawn path.
AdoptedPty wraps a PTY master fd inherited from a predecessor daemon via stdio. The successor doesn't have a node-pty IPty for these sessions (forkpty was never run; the fd already existed), so we build a thin adapter directly on the fd: - read via fs.createReadStream (autoClose: false — supervisor owns) - write via fs.createWriteStream - kill via process.kill(pid) - onExit: read-stream end/error OR PID-liveness poll, whichever first Resize on adopted sessions is a known gap — TIOCSWINSZ requires native ioctl access. Today resize() updates meta cols/rows but leaves kernel window size untouched. Visible only if a user resizes a session that was carried across an upgrade. Follow-up: koffi or N-API helper. Test coverage: API-surface checks. End-to-end I/O on an adopted fd will be validated in the cross-process handoff integration test (Step 10) — testing in-process is unreliable because node-pty's native worker reads from the master fd in parallel with adoptFromFd's stream.
Phase 2 sender + receiver: a daemon process spawns a successor that
inherits all live PTY master fds via stdio, adopts them via the
already-built `adoptFromFd`, then takes over the socket. The original
shell PIDs survive the daemon binary swap.
**Wire (client-facing):** new `prepare-upgrade` request and
`upgrade-prepared { result }` reply. The supervisor calls
`DaemonClient.prepareUpgrade()` to drive the flow; result carries the
successor's pid on success or a reason string on failure. Client
protocol stays at v1 — the new pair is additive.
**Daemon-to-successor:** Node IPC channel (stdio[3]='ipc'). The
predecessor sends nothing on the IPC; the successor sends one of
`upgrade-ack { successorPid }` or `upgrade-nak { reason }`. Predecessor
times out after 5s.
**Sequencing landmines this commit handles:**
1. server.close() unlinks the bound socket path. The successor must
bind only AFTER the predecessor has unlinked, otherwise the
predecessor's unlink yanks the path entry from under the successor's
chmod, raising ENOENT. We wait for the IPC channel to disconnect
(= predecessor process.exit) before binding.
2. The dispatch handler that called prepareUpgrade is responsible for
sending upgrade-prepared back over the supervisor's wire. Closing
the server inside prepareUpgrade beat the reply and timed out the
supervisor. We schedule close+exit via `setImmediate` after the
dispatcher has had its turn.
3. process.execArgv (--experimental-strip-types in dev/tests) must
forward to the successor or it can't load .ts files.
**Tests:** test/handoff.test.ts spawns a real daemon binary, opens a
session that sleeps 30s, sends prepare-upgrade, verifies the successor
binary lists the same session id with the same shell pid still alive.
This is the core Phase 2 success criterion.
Wraps the daemon-side prepare-upgrade wire call with the supervisor-side
bookkeeping needed to swap in the successor cleanly:
- Suppress crash-respawn for the predecessor's imminent exit
(`stopping` set + stop adopted-liveness)
- Mark `handoffInProgress: true` in the manifest so a host-service
crash mid-handoff is debuggable
- On `upgrade-prepared { ok: true }`: probe successor version, swap
manifest + instances entry to the new pid, restart adopted-liveness
on the successor (we don't have a child handle for it)
- On failure: restore manifest, clear `stopping`, re-arm liveness on
the still-alive predecessor
Distinct from `restart()` — restart kills sessions, update preserves them.
The renderer wires "Update" to this path and falls back to restart() only
on failure.
Wires the supervisor's `update()` to the renderer surface as a mutation that mirrors `restart()` shape but preserves sessions on success. The renderer keeps `restart` as the explicit force-restart fallback; `update` is the new default UX.
Renderer changes for Phase 2's "preserve sessions across upgrades" UX:
- New primary "Update daemon" button when update is pending. Calls
terminal.daemon.update; on success, sessions stay alive and a toast
reports the version bump.
- Existing "Restart daemon" stays as the secondary, force-close path
("Force restart" when an update is pending). The existing confirmation
dialog now nudges users toward Update for the non-destructive path.
- New "Update couldn't preserve sessions" failure dialog. Surfaces the
reason from supervisor.update (snapshot write failed, successor never
acked, transport error). Two paths from there: Force update (calls
restart) or Cancel — keeps sessions alive on the predecessor.
This is decision D5 from the Phase 2 /decide walkthrough: surface
handoff failure to the user, let them choose force-update or cancel,
never silently destroy sessions on a soft failure.
End-to-end supervisor coverage at the integration level:
1. update() preserves live sessions across daemon binary swap
- Spawn fresh daemon via supervisor
- Open a session over the wire, capture the shell pid
- Call sup.update(orgId)
- Verify successor pid differs from predecessor, predecessor is
dead, successor is alive
- Reconnect to the (same) socket and assert the session is still
listed with its original shell pid
2. update() returns ok:false with "no daemon running" when no instance
exists — guards against the renderer firing update() before
bootstrap completes.
Also fixes an AdoptedPty teardown leak: with `autoClose: false` on the
read/write streams, the inherited fd would keep the event loop alive
after the shell exits. AdoptedPty's onExit now destroys both streams
and closes the fd explicitly. (Without this, normal daemon shutdown
of an adopted-session daemon stalls waiting for stale handles to drain.)
Update is non-destructive — there's no reason to gate it behind an `updatePending` flag. Even with no version difference, clicking Update spawns a fresh successor daemon process that adopts existing sessions via fd-handoff. Useful in dev for verifying handoff plumbing without faking a version mismatch, and arguably better UX in prod too — Update always works, Force restart is the explicit destructive opt-in. The version-pending badge in the status row stays gated on updatePending — that signal is meaningful UI; the button gate isn't.
node-pty's "posix_spawnp failed" message swallows errno entirely, leaving callers with zero context for diagnosis. Pre-flight the most-common failure mode (missing cwd) and annotate any other native failure with the cwd + shell that triggered it. Common cause: a workspace's worktreePath gets deleted (manual rm, worktree pruned) but the workspace record stays in the local DB. Opening a terminal in it sends the missing cwd to the daemon. With this change, the renderer's terminal-error toast says "spawn: cwd does not exist: /path/to/dead-worktree" instead of the generic native error.
Phase 2 handoff worked end-to-end against source-direct integration tests and Bun-built bundles, but failed in the deployed Electron build. Root cause: both Bun's bundler AND electron-vite's esbuild statically inline `process.env.<KEY>` references at build time, constant-fold the unset-at-build-time comparison `undefined === "1"`, then dead-code- eliminate the entire receiver branch — including the runHandoffReceiver function. Bracket notation `process.env["KEY"]` doesn't help; both bundlers see through it. Manually verified end-to-end in a real dev build today: spawned 3 sessions, clicked Update three times in succession, all sessions preserved (same shell PIDs, no flicker beyond the WS reconnect). Changes: - Mode signal moves from `SUPERSET_PTY_DAEMON_HANDOFF=1` env to `--handoff` argv flag. Snapshot path + socket path also move to argv (`--snapshot=...` `--socket=...`). argv is fully dynamic and survives every bundler we run. - The desktop daemon entry shim at apps/desktop/src/main/pty-daemon/ index.ts was an ENTIRELY SEPARATE copy of the package's old main.ts that electron-vite bundles independently — so it had none of the Phase 2 receiver code. Rewritten to mirror the package's main.ts with full handoff support. - Snapshot helpers (clearSnapshot, readSnapshot, writeSnapshot) and related types are now exported from @superset/pty-daemon for the desktop shim to consume. - Diagnostic stderr writes in both predecessor's prepareUpgrade and successor's runHandoffReceiver. Kept in production code: cheap, and this whole class of bug was invisible without them. Test gap noted: integration tests ran the .ts source via `node --experimental-strip-types` (no bundling) or the Bun-built package output (different DCE behavior than esbuild). Neither covered apps/desktop/dist/main/pty-daemon.js, where the actual production failure mode lives. Follow-up: add a CI-runnable smoke that builds the desktop bundle and greps for `runHandoffReceiver` + `--handoff` markers, so future bundler regressions fail fast.
Phase 2's fd-handoff broke in the deployed Electron build because both Bun's bundler and electron-vite's esbuild did dead-code-elimination on `process.env.X === "Y"` checks at build time, stripping the handoff receiver branch from `apps/desktop/dist/main/pty-daemon.js`. The bug was invisible to existing tests (they ran source-direct or the Bun bundle, never the desktop-shim bundle). Adds a post-`compile:app` smoke that greps the bundle for runtime markers Phase 2 depends on: - `runHandoffReceiver` — the successor entry function - `--handoff` (≥2 times) — argv flag at the spawn + receiver sites - `upgrade-ack` — IPC handshake message type - `adoptSnapshot` — Server's snapshot replay path - `adoptFromFd` — Pty wrapping for an inherited fd If any marker is missing, the build fails fast with a diagnostic pointing at the runtime conditions that need to survive bundling. This makes future bundler regressions visible at build time, not manual-QA time. Wired into apps/desktop's `compile:app` so it runs in production builds and CI without an extra step. Standalone via `bun run check:pty-daemon-bundle` for ad-hoc verification.
Adoption is the right behavior in production — host-service can crash or restart and pick up its still-running daemon, preserving every PTY session. In dev (bun dev), it's a footgun: a leftover detached daemon from the previous bun dev session has the OLD bundle code, so devs edit pty-daemon source, restart bun dev, and silently still talk to the stale daemon. Code changes that ride only on the daemon binary (handoff plumbing, snapshot writes, etc.) appear to do nothing. NODE_ENV != "production" → kill any existing daemon for the org (SIGTERM, escalate to SIGKILL after 1s if still alive) before adopt-or-spawn. Manifest gets removed too. Next bootstrap always spawns fresh from the latest bundle. Idempotent when no daemon is running. Production path is unchanged.
AdoptedPty.resize previously only updated meta cols/rows but left the kernel-side window size unchanged — visible to users as stale dims after dragging the terminal pane border on a session that was carried across a daemon upgrade. node-pty issues TIOCSWINSZ via its native binding, but adopted sessions don't go through that path. Workaround without adding a native FFI dependency: spawn `stty` with the master fd as its stdin. stty(1) issues TIOCSWINSZ on its own stdin by default, so the master inherits the resize. One spawn per resize is fine — resize is rare (xterm.js throttles, users only resize via drag). Best-effort: if stty isn't available or the syscall fails, meta stays correct for next reattach; the kernel-side just stays stale until then.
When host-service boots and adopts a daemon whose runningVersion <
EXPECTED_DAEMON_VERSION (i.e. host-service was just upgraded but the
detached daemon is still on the old binary), kick off `supervisor.update()`
in the background. Sessions survive on success — the new daemon adopts
all PTY masters via fd-handoff. On failure, leave the old daemon
running and let the user retry via the Update button. No fallback to
force-restart; we never auto-destroy sessions.
Concurrency:
- New `updateInFlight` map coalesces concurrent `update()` calls per
org. Both auto-update and a manual Update button click hit the same
entry point — without this, two simultaneous calls would each try
to handoff a daemon that's already mid-handoff. Second caller gets
the cached promise and observes the same outcome.
Telemetry events: `pty_daemon_auto_update_attempt`, `pty_daemon_auto_update_ok`,
`pty_daemon_auto_update_failed { reason }`.
Constructor option `autoUpdate?: boolean` (default true) for tests
that intentionally adopt a stale daemon and don't want a real handoff
racing the assertion.
Tests:
- 3 unit tests for the in-flight cache (coalesce, recycle after settle,
per-org separation)
- 1 real-spawn integration test: pin a daemon to v0.0.1, fresh
supervisor adopts, asserts pid swap to a successor of the bundled
version within 8s, predecessor dead within another 2s.
- Existing real-spawn tests now force NODE_ENV=production in the test
fixture so the dev-mode-kill behavior added earlier doesn't preempt
the adoption paths these tests assert.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughImplements a Phase‑2 PTY‑daemon binary upgrade: snapshot + fd handoff, successor/adopter startup modes, protocol messages, snapshot I/O, PTY fd adoption, supervisor update API/auto‑update, renderer replay suppression, UI update flow, bundle smoke‑check, and end‑to‑end tests. (50 words) ChangesDaemon upgrade + handoff (single DAG)
Sequence DiagramssequenceDiagram
participant UI as Desktop UI
participant Supervisor as DaemonSupervisor
participant Client as DaemonClient
participant Pred as Predecessor Daemon
participant Succ as Successor Daemon
UI->>Supervisor: update(orgId)
activate Supervisor
Supervisor->>Supervisor: coalesce concurrent calls, mark handoffInProgress
Supervisor->>Client: connect(socket) / prepareUpgrade()
activate Client
Client->>Pred: "prepare-upgrade"
activate Pred
Pred->>Pred: serializeSessions & writeSnapshot
Pred->>Succ: spawn with --handoff + inherited master fds
Pred->>Succ: pass snapshotPath + socket via IPC/stdio
Pred->>Pred: wait for upgrade-ack/nak or timeout
activate Succ
Succ->>Succ: readSnapshot, adoptSnapshot()
alt adoption ok
Succ->>Pred: upgrade-ack(successorPid)
else adoption fail
Succ->>Pred: upgrade-nak(reason)
Succ->>Succ: exit
end
deactivate Succ
Pred->>Pred: finalizeHandoff & exit (on ack)
Client-->>Supervisor: { ok:true, successorPid } or { ok:false, reason }
deactivate Client
Supervisor->>Supervisor: swap instance tracking, start liveness polling
Supervisor-->>UI: result
deactivate Supervisor
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Tip 💬 Introducing [Slack Agent](https://www.coderabbit.ai/agent): Turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. 👉 Get your free trial and get 200 agent minutes per Slack user (a $50 value). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Greptile SummaryThis PR implements Phase 2 of the pty-daemon migration: PTY sessions survive daemon binary upgrades via fd-handoff — the predecessor writes a snapshot, spawns the successor with inherited PTY master fds over stdio, awaits an IPC ack, then exits cleanly. Two P1 issues need attention before merging:
Confidence Score: 3/5Not safe to merge as-is — the env-var inheritance bug causes infinite auto-update loops after every host-service restart following a handoff. Two P1 findings: the SUPERSET_PTY_DAEMON_VERSION inheritance creates an infinite auto-update loop, and the version probe timing races against the predecessor. Both stem from the same handoff sequence in Server.ts and DaemonSupervisor.ts. The fd-handoff architecture itself is sound and the broader session-survival goal works correctly. packages/pty-daemon/src/Server/Server.ts (env forwarding in prepareUpgrade), packages/host-service/src/daemon/DaemonSupervisor.ts (version probe timing in runUpdate)
|
| Filename | Overview |
|---|---|
| packages/pty-daemon/src/Server/Server.ts | Core handoff sender: spawns successor with env: process.env, propagating the predecessor's pinned SUPERSET_PTY_DAEMON_VERSION to the successor and causing it to advertise the wrong version post-handoff; also leaves snapshot files on disk after failure. |
| packages/host-service/src/daemon/DaemonSupervisor.ts | New update()/runUpdate() orchestration with coalescing, manifest tagging, and liveness handover — well-structured, but version probe races against predecessor socket and will record stale version data on the successor instance. |
| packages/pty-daemon/src/main.ts | Handoff receiver entrypoint: correctly uses argv flag to avoid bundler DCE, handles disconnect-wait with 1s fallback, and clears snapshot on success. Version inherited from env (the P1 source). |
| packages/pty-daemon/src/Pty/Pty.ts | Adds getMasterFd() to NodePtyAdapter (private _fd accessor with version-guard assertion) and new AdoptedPty class with stty-based resize workaround; implementation is sound with appropriate best-effort fallbacks. |
| packages/pty-daemon/src/SessionStore/snapshot.ts | Atomic snapshot write (tmp+rename), typed reader with version guard and shape validator, clearSnapshot helper — clean and correct. |
| packages/host-service/src/daemon/manifest.ts | Adds Phase 2 optional fields to manifest schema with graceful parse-and-drop for unknown values; handoffSnapshotPath and handoffSuccessorPid are declared and parsed but never written during the actual handoff flow. |
| packages/host-service/src/terminal/DaemonClient/DaemonClient.ts | Adds prepareUpgrade() with generous 15s timeout and proper error routing; clean extension of existing request/response pattern. |
| apps/desktop/scripts/check-pty-daemon-bundle.ts | Post-build bundle integrity check that greps for 5 handoff markers; well-targeted regression guard against bundler DCE of the handoff path. |
| apps/desktop/src/main/pty-daemon/index.ts | Desktop shim mirrors package's main.ts logic for electron-vite build; uses hardcoded "0.1.0" fallback version instead of readPackageVersion(), minor divergence from the package entrypoint. |
| packages/host-service/src/trpc/router/terminal/terminal.ts | Adds update tRPC mutation that delegates to supervisor.update(); correctly awaits daemon-ready before acting. |
Sequence Diagram
sequenceDiagram
participant SUP as DaemonSupervisor
participant A as Daemon A (predecessor)
participant B as Daemon B (successor)
participant FS as Filesystem
SUP->>A: connect + send prepare-upgrade
A->>FS: writeSnapshot(sessions, fdIndexes)
A->>B: spawn(--handoff, --snapshot, --socket, stdio=[ipc, fd4..fdN])
B->>FS: readSnapshot()
B->>B: adoptFromFd() per session
B-->>A: IPC upgrade-ack {successorPid}
A-->>SUP: upgrade-prepared {ok:true, successorPid}
Note over A: setImmediate → finalizeHandoff()
SUP->>A: probeDaemonVersion(socketPath) ⚠️ may still hit A
A->>A: server.close() + unlink socket
A->>A: process.exit(0)
Note over B: IPC disconnect received
B->>B: server.listenWithRetry()
B->>FS: clearSnapshot()
SUP->>B: startAdoptedLivenessCheck(successorPid)
Note over B: Sessions live, same shell PIDs
Prompt To Fix All With AI
Fix the following 4 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 4
packages/pty-daemon/src/Server/Server.ts:189-203
**Successor inherits predecessor's `SUPERSET_PTY_DAEMON_VERSION`, causing infinite auto-update loop**
`prepareUpgrade()` spawns the successor with `env: process.env`, which includes the predecessor's pinned `SUPERSET_PTY_DAEMON_VERSION`. The successor reads this env var in `runHandoffReceiver()` and reports the old version in its `hello-ack`. As a result, the version probe in `DaemonSupervisor.runUpdate()` returns the predecessor's version for the successor, setting `successorInstance.updatePending = true`. On the next host-service restart, `tryAdopt` probes the socket, sees the still-stale version, and `kickoffAutoUpdate` fires again — this repeats indefinitely across restarts.
The fix is to strip or override the version pin when spawning the successor so it falls back to `readPackageVersion()` on the new binary:
```
const successorEnv = { ...process.env };
delete successorEnv["SUPERSET_PTY_DAEMON_VERSION"];
const child = childProcess.spawn(
process.execPath,
[...],
{
stdio,
env: successorEnv,
detached: false,
},
);
```
### Issue 2 of 4
packages/host-service/src/daemon/DaemonSupervisor.ts:274-278
**Version probe may connect to predecessor, not successor, recording stale version**
`probeDaemonVersion(instance.socketPath, ...)` is called immediately after `prepareUpgrade()` returns. At that instant the predecessor is still alive — its `finalizeHandoff()` is only scheduled via `setImmediate`, meaning the predecessor's server is still listening and the successor hasn't bound yet. If the probe connects to the (still-alive) predecessor, it returns the old version, setting `successorInstance.updatePending = true` erroneously. Combined with the env-var inheritance bug in `Server.ts`, this makes the stale-version state sticky across every probe. A short `waitForSocket` poll before probing would guarantee the connection lands on the successor.
### Issue 3 of 4
packages/host-service/src/daemon/manifest.ts:26-31
**`handoffSnapshotPath` and `handoffSuccessorPid` are declared but never written**
`PtyDaemonManifest` defines `handoffSnapshotPath` and `handoffSuccessorPid`, and `readPtyDaemonManifest` parses them, but `DaemonSupervisor.runUpdate()` only ever writes `handoffInProgress: true` — neither field is populated during the actual handoff. A host-service crash mid-handoff therefore gets no extra recovery hints from the manifest beyond the boolean flag. These fields are either dead code from the design phase or an incomplete implementation.
### Issue 4 of 4
packages/pty-daemon/src/Server/Server.ts:212-222
**Snapshot file not deleted after a failed handoff**
When `waitForHandoffAck` returns `{ok: false}`, the code kills the child but explicitly leaves `snapshotPath` in place ("for diagnostics"). There is no cleanup on any subsequent upgrade attempt, and no TTL. Repeated handoff failures will accumulate `pty-daemon-handoff-*.json` files in `/tmp`. A best-effort `fs.rm` on the old snapshot at the top of `prepareUpgrade()` (or at the start of `runUpdate()`) would cap the footprint.
Reviews (1): Last reviewed commit: "feat(host-service): auto-update daemon o..." | Re-trigger Greptile
| const child = childProcess.spawn( | ||
| process.execPath, | ||
| [ | ||
| ...process.execArgv, | ||
| scriptPath, | ||
| "--handoff", | ||
| `--snapshot=${snapshotPath}`, | ||
| `--socket=${this.opts.socketPath}`, | ||
| ], | ||
| { | ||
| stdio, | ||
| env: process.env, | ||
| detached: false, | ||
| }, | ||
| ); |
There was a problem hiding this comment.
Successor inherits predecessor's
SUPERSET_PTY_DAEMON_VERSION, causing infinite auto-update loop
prepareUpgrade() spawns the successor with env: process.env, which includes the predecessor's pinned SUPERSET_PTY_DAEMON_VERSION. The successor reads this env var in runHandoffReceiver() and reports the old version in its hello-ack. As a result, the version probe in DaemonSupervisor.runUpdate() returns the predecessor's version for the successor, setting successorInstance.updatePending = true. On the next host-service restart, tryAdopt probes the socket, sees the still-stale version, and kickoffAutoUpdate fires again — this repeats indefinitely across restarts.
The fix is to strip or override the version pin when spawning the successor so it falls back to readPackageVersion() on the new binary:
const successorEnv = { ...process.env };
delete successorEnv["SUPERSET_PTY_DAEMON_VERSION"];
const child = childProcess.spawn(
process.execPath,
[...],
{
stdio,
env: successorEnv,
detached: false,
},
);
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/pty-daemon/src/Server/Server.ts
Line: 189-203
Comment:
**Successor inherits predecessor's `SUPERSET_PTY_DAEMON_VERSION`, causing infinite auto-update loop**
`prepareUpgrade()` spawns the successor with `env: process.env`, which includes the predecessor's pinned `SUPERSET_PTY_DAEMON_VERSION`. The successor reads this env var in `runHandoffReceiver()` and reports the old version in its `hello-ack`. As a result, the version probe in `DaemonSupervisor.runUpdate()` returns the predecessor's version for the successor, setting `successorInstance.updatePending = true`. On the next host-service restart, `tryAdopt` probes the socket, sees the still-stale version, and `kickoffAutoUpdate` fires again — this repeats indefinitely across restarts.
The fix is to strip or override the version pin when spawning the successor so it falls back to `readPackageVersion()` on the new binary:
```
const successorEnv = { ...process.env };
delete successorEnv["SUPERSET_PTY_DAEMON_VERSION"];
const child = childProcess.spawn(
process.execPath,
[...],
{
stdio,
env: successorEnv,
detached: false,
},
);
```
How can I resolve this? If you propose a fix, please make it concise.| const probedVersion = await probeDaemonVersion( | ||
| instance.socketPath, | ||
| VERSION_PROBE_TIMEOUT_MS, | ||
| ); | ||
| const runningVersion = probedVersion ?? "unknown"; |
There was a problem hiding this comment.
Version probe may connect to predecessor, not successor, recording stale version
probeDaemonVersion(instance.socketPath, ...) is called immediately after prepareUpgrade() returns. At that instant the predecessor is still alive — its finalizeHandoff() is only scheduled via setImmediate, meaning the predecessor's server is still listening and the successor hasn't bound yet. If the probe connects to the (still-alive) predecessor, it returns the old version, setting successorInstance.updatePending = true erroneously. Combined with the env-var inheritance bug in Server.ts, this makes the stale-version state sticky across every probe. A short waitForSocket poll before probing would guarantee the connection lands on the successor.
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/host-service/src/daemon/DaemonSupervisor.ts
Line: 274-278
Comment:
**Version probe may connect to predecessor, not successor, recording stale version**
`probeDaemonVersion(instance.socketPath, ...)` is called immediately after `prepareUpgrade()` returns. At that instant the predecessor is still alive — its `finalizeHandoff()` is only scheduled via `setImmediate`, meaning the predecessor's server is still listening and the successor hasn't bound yet. If the probe connects to the (still-alive) predecessor, it returns the old version, setting `successorInstance.updatePending = true` erroneously. Combined with the env-var inheritance bug in `Server.ts`, this makes the stale-version state sticky across every probe. A short `waitForSocket` poll before probing would guarantee the connection lands on the successor.
How can I resolve this? If you propose a fix, please make it concise.| handoffInProgress?: boolean; | ||
| /** Path of the on-disk handoff snapshot the successor will read. */ | ||
| handoffSnapshotPath?: string; | ||
| /** PID of the spawned successor; pre-bind, supervisor uses this to track. */ | ||
| handoffSuccessorPid?: number; | ||
| } |
There was a problem hiding this comment.
handoffSnapshotPath and handoffSuccessorPid are declared but never written
PtyDaemonManifest defines handoffSnapshotPath and handoffSuccessorPid, and readPtyDaemonManifest parses them, but DaemonSupervisor.runUpdate() only ever writes handoffInProgress: true — neither field is populated during the actual handoff. A host-service crash mid-handoff therefore gets no extra recovery hints from the manifest beyond the boolean flag. These fields are either dead code from the design phase or an incomplete implementation.
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/host-service/src/daemon/manifest.ts
Line: 26-31
Comment:
**`handoffSnapshotPath` and `handoffSuccessorPid` are declared but never written**
`PtyDaemonManifest` defines `handoffSnapshotPath` and `handoffSuccessorPid`, and `readPtyDaemonManifest` parses them, but `DaemonSupervisor.runUpdate()` only ever writes `handoffInProgress: true` — neither field is populated during the actual handoff. A host-service crash mid-handoff therefore gets no extra recovery hints from the manifest beyond the boolean flag. These fields are either dead code from the design phase or an incomplete implementation.
How can I resolve this? If you propose a fix, please make it concise.| const result = await waitForHandoffAck(child); | ||
| if (!result.ok) { | ||
| // Successor never acked. Leave snapshot for diagnostics; the | ||
| // supervisor will clean stale handoff state on next ensure(). | ||
| try { | ||
| child.kill("SIGKILL"); | ||
| } catch { | ||
| // already dead, ignore | ||
| // already gone | ||
| } | ||
| return result; | ||
| } |
There was a problem hiding this comment.
Snapshot file not deleted after a failed handoff
When waitForHandoffAck returns {ok: false}, the code kills the child but explicitly leaves snapshotPath in place ("for diagnostics"). There is no cleanup on any subsequent upgrade attempt, and no TTL. Repeated handoff failures will accumulate pty-daemon-handoff-*.json files in /tmp. A best-effort fs.rm on the old snapshot at the top of prepareUpgrade() (or at the start of runUpdate()) would cap the footprint.
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/pty-daemon/src/Server/Server.ts
Line: 212-222
Comment:
**Snapshot file not deleted after a failed handoff**
When `waitForHandoffAck` returns `{ok: false}`, the code kills the child but explicitly leaves `snapshotPath` in place ("for diagnostics"). There is no cleanup on any subsequent upgrade attempt, and no TTL. Repeated handoff failures will accumulate `pty-daemon-handoff-*.json` files in `/tmp`. A best-effort `fs.rm` on the old snapshot at the top of `prepareUpgrade()` (or at the start of `runUpdate()`) would cap the footprint.
How can I resolve this? If you propose a fix, please make it concise.
🧹 Preview Cleanup CompleteThe following preview resources have been cleaned up:
Thank you for your contribution! 🎉 |
There was a problem hiding this comment.
Actionable comments posted: 9
🧹 Nitpick comments (1)
packages/host-service/src/trpc/router/terminal/terminal.daemon.test.ts (1)
119-140: ⚡ Quick winBlock
ensure()in this test so it proves the bootstrap ordering.
ensureMockresolves immediately here, so this only checks org-id wiring. Ifterminal.daemon.update()stopped awaiting bootstrap, the test would still pass. HoldingensureMockunresolved until after assertingupdateMockhas not run would cover the intended contract.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/host-service/src/trpc/router/terminal/terminal.daemon.test.ts` around lines 119 - 140, Replace the immediate-resolving ensureMock with a deferred (pending) promise so the test proves bootstrap ordering: create a deferred promise and set ensureMock to return that promise, start the call to terminal.daemon.update() without awaiting it (or call and keep its promise), assert updateMock has not been called yet, then resolve the deferred promise so ensure() completes, await the update call result, and finally assert updateMock was called with "00000000-0000-4000-8000-000000000000" and the returned value matches; reference ensureMock, updateMock, getSupervisor, makeCaller, and terminal.daemon.update in the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@apps/desktop/src/main/pty-daemon/index.ts`:
- Around line 50-58: The current parsing for --buffer-bytes (the branch where
arg.startsWith("--buffer-bytes=")) uses Number.parseInt(raw) which accepts
prefixes like "64kb" or "1e6"; update the validation to require the entire token
to be an integer string (e.g., validate raw against /^\d+$/) before converting,
then parse (Number.parseInt or Number) and keep the existing checks
(Number.isFinite(parsed) && parsed > 0) and assignment to args.bufferBytes to
reject partially numeric values.
- Around line 153-157: The call to clearSnapshot(snapshotPath) after await
server.listenWithRetry() can throw non-ENOENT filesystem errors and must not
cause the daemon to exit; wrap the clearSnapshot(...) invocation in a try/catch
inside main (the block following server.listenWithRetry()), catch any error that
is not an ENOENT, log it as a warning (use the same logger used for log(...) or
processLogger), and swallow the error so startup continues; retain existing
behavior for ENOENT but ensure main().catch() no longer receives cleanup errors
from clearSnapshot.
In `@packages/host-service/src/daemon/DaemonSupervisor.ts`:
- Around line 129-141: The current Map updateInFlight only deduplicates update()
calls; restart() and stop() can race with runUpdate() handoff and must be
serialized per orgId. Fix by using the same per-org lock for all lifecycle
mutations: either (A) extend updateInFlight to represent any in-flight lifecycle
operation and have restart(orgId)/stop(orgId) await the existing Promise before
proceeding, or (B) introduce a shared per-org lock (e.g., lifecycleInFlight Map
keyed by orgId) and make update(), restart(), and stop() acquire that lock
(await existing entry or set a new Promise) so only one mutation runs at a time;
ensure runUpdate() clears the lock on completion or error and that
restart()/stop() either await or cancel the in-flight update before sending
signals (refer to updateInFlight, update(), restart(), stop(), and runUpdate()).
In `@packages/pty-daemon/src/main.ts`:
- Around line 119-137: The snapshot read is happening outside the handoff
try/catch so read failures never send an "upgrade-nak"; wrap or move the call to
readSnapshot(snapshotPath) into the same error-handling flow that calls
server.adoptSnapshot(snapshot) (or add a surrounding try/catch) so any error
from readSnapshot or adoptSnapshot is caught, logged via log(...), and results
in sending the HandoffMessage { type: "upgrade-nak", reason: `adopt failed:
${err.message}` } followed by the existing delayed process.exit; reference
readSnapshot, Server, adoptSnapshot and the existing NAK creation/processing
block to implement this change.
In `@packages/pty-daemon/src/Pty/Pty.ts`:
- Around line 46-59: getMasterFd() currently throws unconditionally if the
private _fd is missing which will break every spawn() call when node-pty
changes; change the implementation to provide a safe feature-detection/fallback:
add a non-throwing probe (e.g. hasMasterFd() or have getMasterFd() return number
| undefined) that checks (this.term as any)._fd for a valid integer and returns
undefined instead of throwing, then update spawn() to call that probe and only
attempt fd-handoff when a valid fd is present (or fall back to the existing
non-handoff code path), and ensure all callers of getMasterFd()/hasMasterFd()
handle the undefined case gracefully rather than assuming _fd exists.
- Around line 177-178: The adopted shell can exit before onExit is registered
and writes can target destroyed streams; add an exitFired boolean and an
exitResult (e.g., { code, signal }) on the Pty instance, set them when the
underlying reader/writer emit 'close'/'exit' (where reader/writer are created in
the constructor), update onExit to immediately invoke the subscriber with
exitResult if exitFired is already true (replaying the notification), and guard
the write(...) method to no-op (or fail fast) when exitFired is true so no
writes go to destroyed streams. Ensure the state is set exactly once when the
streams close and that all places that register exit listeners (onExit) check
exitFired and use the cached exitResult.
In `@packages/pty-daemon/src/Server/Server.ts`:
- Around line 83-98: listenWithRetry is retrying by calling listen(), but
listen() always unlinks this.opts.socketPath which can remove a live
predecessor's socket during a handoff; change listenWithRetry to call a new
bind-only path that skips stale-socket cleanup on retries. Add a new method
(e.g., bindOnly or listenBind) alongside listen that performs the bind without
unlinking this.opts.socketPath or calling unlinkSync, and have listenWithRetry
call that bind-only method inside the retry loop while preserving the original
listen() behavior for the non-retry path; ensure errors other than EADDRINUSE
are rethrown and lastErr is propagated as before.
In `@packages/pty-daemon/src/SessionStore/snapshot.ts`:
- Around line 106-126: The isHandoffSnapshot type guard is too permissive:
tighten it to fully validate each session's SessionMeta shape and ensure numeric
fields are sane before accepting a snapshot. Update isHandoffSnapshot to check
session.id is a non-empty string, session.pid and session.fdIndex are safe
integers within expected ranges (e.g., Number.isSafeInteger and pid > 0, fdIndex
>= 0 and within a reasonable max), and explicitly validate required keys and
types on session.meta to match the SessionMeta interface (check all required
properties and their types). This prevents readSnapshot() from returning
corrupted data that later causes adoptSnapshot() to fail; reference the
isHandoffSnapshot function, HandoffSnapshot type, SessionMeta shape, and the
downstream readSnapshot/adoptSnapshot flow when making the changes.
In `@packages/pty-daemon/test/handoff.test.ts`:
- Around line 132-137: After sending SIGTERM to the successor
(process.kill(successorPid, "SIGTERM")) in the after() teardown, wait for the
PID to actually disappear before finishing the test: implement an async wait
loop that repeatedly checks process existence (e.g., using
process.kill(successorPid, 0) inside try/catch) with a short delay, a total
timeout, and a fallback (send SIGKILL if the process still exists after the
timeout); make the after() hook async and await this wait routine so inherited
fds/sockets are not leaked into later tests.
---
Nitpick comments:
In `@packages/host-service/src/trpc/router/terminal/terminal.daemon.test.ts`:
- Around line 119-140: Replace the immediate-resolving ensureMock with a
deferred (pending) promise so the test proves bootstrap ordering: create a
deferred promise and set ensureMock to return that promise, start the call to
terminal.daemon.update() without awaiting it (or call and keep its promise),
assert updateMock has not been called yet, then resolve the deferred promise so
ensure() completes, await the update call result, and finally assert updateMock
was called with "00000000-0000-4000-8000-000000000000" and the returned value
matches; reference ensureMock, updateMock, getSupervisor, makeCaller, and
terminal.daemon.update in the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d6670384-f3ec-408c-aebe-bfab9db8596e
📒 Files selected for processing (28)
apps/desktop/package.jsonapps/desktop/scripts/check-pty-daemon-bundle.tsapps/desktop/src/main/pty-daemon/index.tsapps/desktop/src/renderer/routes/_authenticated/settings/terminal/components/TerminalSettings/components/V2SessionsSection/V2SessionsSection.tsxpackages/host-service/src/daemon/DaemonSupervisor.node-test.tspackages/host-service/src/daemon/DaemonSupervisor.test.tspackages/host-service/src/daemon/DaemonSupervisor.tspackages/host-service/src/daemon/manifest.test.tspackages/host-service/src/daemon/manifest.tspackages/host-service/src/terminal/DaemonClient/DaemonClient.tspackages/host-service/src/trpc/router/terminal/terminal.daemon.test.tspackages/host-service/src/trpc/router/terminal/terminal.tspackages/pty-daemon/src/Pty/Pty.tspackages/pty-daemon/src/Pty/index.tspackages/pty-daemon/src/Server/Server.tspackages/pty-daemon/src/SessionStore/SessionStore.test.tspackages/pty-daemon/src/SessionStore/index.tspackages/pty-daemon/src/SessionStore/snapshot.test.tspackages/pty-daemon/src/SessionStore/snapshot.tspackages/pty-daemon/src/handlers/handlers.test.tspackages/pty-daemon/src/index.tspackages/pty-daemon/src/main.tspackages/pty-daemon/src/protocol/handoff.test.tspackages/pty-daemon/src/protocol/handoff.tspackages/pty-daemon/src/protocol/index.tspackages/pty-daemon/src/protocol/messages.tspackages/pty-daemon/test/handoff.test.tspackages/pty-daemon/test/integration.test.ts
| } else if (arg.startsWith("--buffer-bytes=")) { | ||
| const raw = arg.slice("--buffer-bytes=".length); | ||
| const parsed = Number.parseInt(raw, 10); | ||
| if (!Number.isFinite(parsed) || parsed <= 0) { | ||
| throw new Error( | ||
| `--buffer-bytes must be a positive integer, got: ${raw}`, | ||
| ); | ||
| } | ||
| args.bufferBytes = parsed; |
There was a problem hiding this comment.
Reject partially numeric --buffer-bytes values.
Number.parseInt() will accept prefixes here, so inputs like --buffer-bytes=64kb or --buffer-bytes=1e6 are treated as valid and silently configure the wrong buffer size. Validate the whole token before converting it.
Suggested fix
} else if (arg.startsWith("--buffer-bytes=")) {
const raw = arg.slice("--buffer-bytes=".length);
- const parsed = Number.parseInt(raw, 10);
- if (!Number.isFinite(parsed) || parsed <= 0) {
+ if (!/^[1-9]\d*$/.test(raw)) {
throw new Error(
`--buffer-bytes must be a positive integer, got: ${raw}`,
);
}
+ const parsed = Number(raw);
args.bufferBytes = parsed;
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@apps/desktop/src/main/pty-daemon/index.ts` around lines 50 - 58, The current
parsing for --buffer-bytes (the branch where arg.startsWith("--buffer-bytes="))
uses Number.parseInt(raw) which accepts prefixes like "64kb" or "1e6"; update
the validation to require the entire token to be an integer string (e.g.,
validate raw against /^\d+$/) before converting, then parse (Number.parseInt or
Number) and keep the existing checks (Number.isFinite(parsed) && parsed > 0) and
assignment to args.bufferBytes to reject partially numeric values.
| await server.listenWithRetry(); | ||
| log(`bound and listening`); | ||
|
|
||
| clearSnapshot(snapshotPath); | ||
| wireShutdown(server); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Inspect clearSnapshot/readSnapshot implementations and callers.
fd 'snapshot\.(ts|js)$' packages apps | xargs -r rg -n -C3 '\b(clearSnapshot|readSnapshot)\b'Repository: superset-sh/superset
Length of output: 1130
🏁 Script executed:
# Get full clearSnapshot implementation and surrounding context
cat -n packages/pty-daemon/src/SessionStore/snapshot.ts | sed -n '98,120p'Repository: superset-sh/superset
Length of output: 1005
🏁 Script executed:
# Check if the file in the review exists and view lines 153-157
if [ -f apps/desktop/src/main/pty-daemon/index.ts ]; then
cat -n apps/desktop/src/main/pty-daemon/index.ts | sed -n '150,160p'
else
echo "File not found at apps/desktop/src/main/pty-daemon/index.ts"
# Search for similar path patterns
fd 'index\.ts$' apps/desktop | head -20
fiRepository: superset-sh/superset
Length of output: 362
🏁 Script executed:
# Also search for all usages of clearSnapshot to understand call patterns
rg -n 'clearSnapshot' apps desktop --type ts --type tsxRepository: superset-sh/superset
Length of output: 92
🏁 Script executed:
# View the main function to understand error handling context
cat -n apps/desktop/src/main/pty-daemon/index.ts | sed -n '1,50p'Repository: superset-sh/superset
Length of output: 2393
🏁 Script executed:
# Look for main function definition and catch handler
rg -n 'main|catch' apps/desktop/src/main/pty-daemon/index.ts | head -20Repository: superset-sh/superset
Length of output: 413
🏁 Script executed:
# View the broader context of the main function to see try/catch structure
cat -n apps/desktop/src/main/pty-daemon/index.ts | sed -n '67,170p'Repository: superset-sh/superset
Length of output: 3995
🏁 Script executed:
# View the top-level catch handler
cat -n apps/desktop/src/main/pty-daemon/index.ts | sed -n '175,185p'Repository: superset-sh/superset
Length of output: 410
Wrap post-bind snapshot cleanup to prevent fatal exit on cleanup errors.
The clearSnapshot() call at line 156 is not wrapped in error handling. While clearSnapshot() has internal try/catch logic for ENOENT, it will still throw on other filesystem errors (permissions, I/O failures). Since the socket is already successfully bound and listening (line 153) and sessions have been adopted (line 122), a cleanup failure here causes the daemon to exit fatally via main().catch() despite a successful handoff. Wrap this call in try/catch, log any non-ENOENT errors as warnings, and continue startup.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@apps/desktop/src/main/pty-daemon/index.ts` around lines 153 - 157, The call
to clearSnapshot(snapshotPath) after await server.listenWithRetry() can throw
non-ENOENT filesystem errors and must not cause the daemon to exit; wrap the
clearSnapshot(...) invocation in a try/catch inside main (the block following
server.listenWithRetry()), catch any error that is not an ENOENT, log it as a
warning (use the same logger used for log(...) or processLogger), and swallow
the error so startup continues; retain existing behavior for ENOENT but ensure
main().catch() no longer receives cleanup errors from clearSnapshot.
| /** | ||
| * In-flight `update()` promises per orgId. Both auto-update (on adopt | ||
| * with version drift) and manual update via the renderer hit the same | ||
| * supervisor.update() entry point — without this guard, two concurrent | ||
| * calls would both try to handoff a daemon that's already mid-handoff. | ||
| * The second caller returns the cached promise and gets the same result. | ||
| */ | ||
| private readonly updateInFlight = new Map< | ||
| string, | ||
| Promise< | ||
| { ok: true; successorPid: number } | { ok: false; reason: string } | ||
| > | ||
| >(); |
There was a problem hiding this comment.
Serialize restart() / stop() with handoff too.
updateInFlight only dedupes update() callers. It does not stop a concurrent restart() or stop() from killing the predecessor while runUpdate() is mid-handoff, which can turn the “preserve sessions” path into a destructive race or leave supervisor state half-swapped. Please put all per-org lifecycle mutations behind the same lock, or have restart() / stop() await or cancel the in-flight update before sending signals.
Also applies to: 199-215
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/host-service/src/daemon/DaemonSupervisor.ts` around lines 129 - 141,
The current Map updateInFlight only deduplicates update() calls; restart() and
stop() can race with runUpdate() handoff and must be serialized per orgId. Fix
by using the same per-org lock for all lifecycle mutations: either (A) extend
updateInFlight to represent any in-flight lifecycle operation and have
restart(orgId)/stop(orgId) await the existing Promise before proceeding, or (B)
introduce a shared per-org lock (e.g., lifecycleInFlight Map keyed by orgId) and
make update(), restart(), and stop() acquire that lock (await existing entry or
set a new Promise) so only one mutation runs at a time; ensure runUpdate()
clears the lock on completion or error and that restart()/stop() either await or
cancel the in-flight update before sending signals (refer to updateInFlight,
update(), restart(), stop(), and runUpdate()).
| getMasterFd(): number { | ||
| // node-pty 1.1.x exposes the master fd as the private property `_fd`. | ||
| // Pinned to "1.1.0" in package.json so a future bump can't break this | ||
| // silently — assert here so a missing/changed field surfaces at the | ||
| // first spawn, not when the user clicks "Update" months later. | ||
| const fd = (this.term as unknown as { _fd?: unknown })._fd; | ||
| if (typeof fd !== "number" || !Number.isInteger(fd) || fd < 0) { | ||
| throw new Error( | ||
| `node-pty master fd unavailable (got ${typeof fd}: ${fd}). ` + | ||
| `Phase 2 fd-handoff depends on node-pty's private _fd property — ` + | ||
| `pin node-pty to 1.1.x or update Pty.ts to match the new shape.`, | ||
| ); | ||
| } | ||
| return fd; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
version="$(
python - <<'PY'
import json, os, subprocess, sys
paths = ["package.json"]
try:
out = subprocess.check_output(["fd", "-a", "^package\\.json$"], text=True)
paths.extend([p for p in out.splitlines() if p])
except Exception:
pass
seen = set()
for path in paths:
if path in seen or not os.path.exists(path):
continue
seen.add(path)
try:
with open(path) as f:
pkg = json.load(f)
except Exception:
continue
deps = {}
for key in ("dependencies", "devDependencies", "peerDependencies", "optionalDependencies"):
deps.update(pkg.get(key, {}))
if "node-pty" in deps:
print(deps["node-pty"].lstrip("^~"))
sys.exit(0)
raise SystemExit("node-pty dependency not found")
PY
)"
tmp="$(mktemp -d)"
trap 'rm -rf "$tmp"' EXIT
curl -Ls "https://registry.npmjs.org/node-pty/-/node-pty-${version}.tgz" | tar -xz -C "$tmp"
echo "node-pty version: $version"
echo
echo "Occurrences of _fd in the published package:"
rg -n "_fd" "$tmp/package" || true
echo
echo "Platform-specific files mentioning Windows/ConPTY/winpty:"
rg -n "Windows|ConPTY|conpty|winpty" "$tmp/package" -g '!**/*.map' || trueRepository: superset-sh/superset
Length of output: 50377
🏁 Script executed:
# Check for getMasterFd usage and spawn context
rg -n "getMasterFd|fd-handoff|Phase 2" packages/pty-daemon/src --type ts
# Check spawn() implementation to see if platform guards exist
rg -n "spawn\(" packages/pty-daemon/src/Pty/Pty.ts -A 3 -B 1 --type ts
# Look for any platform checks in the codebase
rg -n "process\.platform|darwin|win32|linux" packages/pty-daemon/src --type tsRepository: superset-sh/superset
Length of output: 2619
🏁 Script executed:
# Check the PR context - look for supportedPlatforms or similar
fd -e ts -e tsx -e json packages/pty-daemon | xargs rg -l "platform|darwin|arm64" | head -10Repository: superset-sh/superset
Length of output: 937
🏁 Script executed:
# Get context around getMasterFd() call at line 141 in spawn()
sed -n '95,155p' packages/pty-daemon/src/Pty/Pty.tsRepository: superset-sh/superset
Length of output: 2003
🏁 Script executed:
# Check for any feature flags or conditions around fd-handoff
rg -n "handoff|upgrade|PHASE|feature" packages/pty-daemon/src/Pty/Pty.ts packages/pty-daemon/src/Server/Server.ts -B 2 -A 2Repository: superset-sh/superset
Length of output: 9543
🏁 Script executed:
# Check platform detection and supported platforms
rg -n "process\.platform|darwin|win32|linux" packages/pty-daemon/src --type ts -B 2 -A 2 | head -50Repository: superset-sh/superset
Length of output: 46
The unconditional getMasterFd() assert can fail if node-pty removes or changes the _fd property.
The private _fd field is currently available across all node-pty 1.1.0 implementations (Unix, Windows ConPTY, and winpty), so the platform concern in the original comment was incorrect. However, spawn() calls getMasterFd() unconditionally at line 141 with no fallback. Since the PR pins node-pty to "1.1.0" specifically to work around this private-API dependency, a future minor version bump that changes _fd's shape or removes it would break every normal session open, not just handoff scenarios. Consider adding an optional fallback path or explicit feature detection.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/pty-daemon/src/Pty/Pty.ts` around lines 46 - 59, getMasterFd()
currently throws unconditionally if the private _fd is missing which will break
every spawn() call when node-pty changes; change the implementation to provide a
safe feature-detection/fallback: add a non-throwing probe (e.g. hasMasterFd() or
have getMasterFd() return number | undefined) that checks (this.term as any)._fd
for a valid integer and returns undefined instead of throwing, then update
spawn() to call that probe and only attempt fd-handoff when a valid fd is
present (or fall back to the existing non-handoff code path), and ensure all
callers of getMasterFd()/hasMasterFd() handle the undefined case gracefully
rather than assuming _fd exists.
| this.reader = fs.createReadStream("", { fd, autoClose: false }); | ||
| this.writer = fs.createWriteStream("", { fd, autoClose: false }); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -type f -name "Pty.ts" | head -20Repository: superset-sh/superset
Length of output: 101
🏁 Script executed:
wc -l ./packages/pty-daemon/src/Pty/Pty.tsRepository: superset-sh/superset
Length of output: 105
🏁 Script executed:
cat -n ./packages/pty-daemon/src/Pty/Pty.tsRepository: superset-sh/superset
Length of output: 10950
Cache adopted-session exit state to replay notifications to late subscribers and prevent post-exit writes.
If the adopted shell exits before the server registers onExit, the notification is dropped permanently. In that same window, write() can still target destroyed streams. Store the exit result and replay it to late-arriving subscribers; guard writes with an exitFired check to make them harmless after termination.
Suggested fix
class AdoptedPty implements Pty {
readonly pid: number;
meta: SessionMeta;
private readonly fd: number;
private readonly reader: fs.ReadStream;
private readonly writer: fs.WriteStream;
private exitFired = false;
+ private exitInfo: { code: number | null; signal: number | null } | null =
+ null;
private livenessTimer: NodeJS.Timeout | null = null;
private exitCallbacks: PtyOnExit[] = [];
constructor(fd: number, pid: number, meta: SessionMeta) {
@@
const onExit = (info: { code: number | null; signal: number | null }) => {
if (this.exitFired) return;
this.exitFired = true;
+ this.exitInfo = info;
if (this.livenessTimer) clearInterval(this.livenessTimer);
@@
};
this.reader.on("end", () => onExit({ code: null, signal: null }));
this.reader.on("error", () => onExit({ code: null, signal: null }));
+ this.writer.on("error", () => onExit({ code: null, signal: null }));
this.livenessTimer = setInterval(() => {
if (!isPidAlive(this.pid)) onExit({ code: null, signal: null });
}, 1000);
@@
write(data: Buffer): void {
+ if (this.exitFired) return;
this.writer.write(data);
}
@@
onExit(cb: PtyOnExit): void {
+ if (this.exitInfo) {
+ cb(this.exitInfo);
+ return;
+ }
this.exitCallbacks.push(cb);
}
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/pty-daemon/src/Pty/Pty.ts` around lines 177 - 178, The adopted shell
can exit before onExit is registered and writes can target destroyed streams;
add an exitFired boolean and an exitResult (e.g., { code, signal }) on the Pty
instance, set them when the underlying reader/writer emit 'close'/'exit' (where
reader/writer are created in the constructor), update onExit to immediately
invoke the subscriber with exitResult if exitFired is already true (replaying
the notification), and guard the write(...) method to no-op (or fail fast) when
exitFired is true so no writes go to destroyed streams. Ensure the state is set
exactly once when the streams close and that all places that register exit
listeners (onExit) check exitFired and use the cached exitResult.
| async listenWithRetry(timeoutMs = 5_000): Promise<void> { | ||
| const start = Date.now(); | ||
| let lastErr: unknown = null; | ||
| while (Date.now() - start < timeoutMs) { | ||
| try { | ||
| session.pty.kill("SIGKILL"); | ||
| await this.listen(); | ||
| return; | ||
| } catch (err) { | ||
| lastErr = err; | ||
| const code = (err as NodeJS.ErrnoException).code; | ||
| if (code !== "EADDRINUSE") throw err; | ||
| await new Promise((r) => setTimeout(r, 50)); | ||
| } | ||
| } | ||
| throw lastErr ?? new Error("listenWithRetry timed out"); | ||
| } |
There was a problem hiding this comment.
Don't retry handoff binds through listen().
listen() always unlinks this.opts.socketPath first. In the handoff path that means a retry can remove the predecessor's still-live socket entry, let the successor bind early, and then have the predecessor's later unlinkSync() delete the successor's new socket path out from under it. The retry loop needs a bind-only path that skips stale-socket cleanup.
One way to split the behaviors
async listen(): Promise<void> {
const dir = path.dirname(this.opts.socketPath);
fs.mkdirSync(dir, { recursive: true });
- // Stale-socket cleanup: remove any prior socket file at this path.
- try {
- fs.unlinkSync(this.opts.socketPath);
- } catch (err) {
- if ((err as NodeJS.ErrnoException).code !== "ENOENT") throw err;
- }
- await new Promise<void>((resolve, reject) => {
- this.server.once("error", reject);
- this.server.listen(this.opts.socketPath, () => {
- this.server.off("error", reject);
- resolve();
- });
- });
+ try {
+ fs.unlinkSync(this.opts.socketPath);
+ } catch (err) {
+ if ((err as NodeJS.ErrnoException).code !== "ENOENT") throw err;
+ }
+ await this.bindSocket();
fs.chmodSync(this.opts.socketPath, 0o600);
}
+private async bindSocket(): Promise<void> {
+ await new Promise<void>((resolve, reject) => {
+ this.server.once("error", reject);
+ this.server.listen(this.opts.socketPath, () => {
+ this.server.off("error", reject);
+ resolve();
+ });
+ });
+}
+
async listenWithRetry(timeoutMs = 5_000): Promise<void> {
const start = Date.now();
let lastErr: unknown = null;
while (Date.now() - start < timeoutMs) {
try {
- await this.listen();
+ await this.bindSocket();
+ fs.chmodSync(this.opts.socketPath, 0o600);
return;
} catch (err) {
lastErr = err;
const code = (err as NodeJS.ErrnoException).code;
if (code !== "EADDRINUSE") throw err;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/pty-daemon/src/Server/Server.ts` around lines 83 - 98,
listenWithRetry is retrying by calling listen(), but listen() always unlinks
this.opts.socketPath which can remove a live predecessor's socket during a
handoff; change listenWithRetry to call a new bind-only path that skips
stale-socket cleanup on retries. Add a new method (e.g., bindOnly or listenBind)
alongside listen that performs the bind without unlinking this.opts.socketPath
or calling unlinkSync, and have listenWithRetry call that bind-only method
inside the retry loop while preserving the original listen() behavior for the
non-retry path; ensure errors other than EADDRINUSE are rethrown and lastErr is
propagated as before.
| // Reap the successor for the after() hook. | ||
| try { | ||
| process.kill(successorPid, "SIGTERM"); | ||
| } catch { | ||
| // already gone | ||
| } |
There was a problem hiding this comment.
Wait for the successor to actually exit during teardown.
This only sends SIGTERM and returns. If the successor exits asynchronously, it can keep the socket or inherited fds around long enough to bleed into later tests in the same worker. Please wait until the PID disappears before leaving the test.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/pty-daemon/test/handoff.test.ts` around lines 132 - 137, After
sending SIGTERM to the successor (process.kill(successorPid, "SIGTERM")) in the
after() teardown, wait for the PID to actually disappear before finishing the
test: implement an async wait loop that repeatedly checks process existence
(e.g., using process.kill(successorPid, 0) inside try/catch) with a short delay,
a total timeout, and a fallback (send SIGKILL if the process still exists after
the timeout); make the after() hook async and await this wait routine so
inherited fds/sockets are not leaked into later tests.
There was a problem hiding this comment.
7 issues found across 28 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="apps/desktop/scripts/check-pty-daemon-bundle.ts">
<violation number="1" location="apps/desktop/scripts/check-pty-daemon-bundle.ts:64">
P2: Use a proper file-path conversion for `import.meta.url` instead of `.pathname`; the current code can resolve an invalid path on Windows.
(Based on your team's feedback about using cross-platform path utilities instead of string/path parsing shortcuts.) [FEEDBACK_USED]</violation>
</file>
<file name="packages/pty-daemon/src/main.ts">
<violation number="1" location="packages/pty-daemon/src/main.ts:8">
P2: Header comment documents env-var usage (`SUPERSET_PTY_DAEMON_HANDOFF`, `SUPERSET_PTY_DAEMON_SNAPSHOT`, `SUPERSET_PTY_DAEMON_SOCKET`) but the implementation uses argv flags (`--handoff`, `--snapshot=`, `--socket=`). The code itself explains env vars don't survive bundler DCE. Update the comment to match the actual invocation.</violation>
</file>
<file name="packages/pty-daemon/src/Pty/Pty.ts">
<violation number="1" location="packages/pty-daemon/src/Pty/Pty.ts:107">
P2: The `throw` for "not a directory" is inside the try block so it's immediately caught by the sibling catch, producing a confusing wrapped message like `"spawn: cwd not accessible: /path (spawn: cwd is not a directory: /path)"`. Move the directory check after the try/catch or re-throw before entering the catch path.</violation>
<violation number="2" location="packages/pty-daemon/src/Pty/Pty.ts:178">
P1: Missing `'error'` listener on `this.writer`. If a write fails (e.g., shell just exited), the stream emits `'error'` with no handler, crashing the daemon with an uncaught exception. Add an error listener that triggers the `onExit` path (similar to the reader).</violation>
</file>
<file name="packages/pty-daemon/src/Server/Server.ts">
<violation number="1" location="packages/pty-daemon/src/Server/Server.ts:200">
P1: The successor inherits `SUPERSET_PTY_DAEMON_VERSION` from the predecessor's environment, causing it to report the old version. This makes the supervisor think the upgrade didn't take effect (`updatePending` stays true), and on the next host-service restart the auto-update fires again — repeating indefinitely.
Strip or delete this env var before spawning the successor so it falls back to `readPackageVersion()` on the new binary.</violation>
<violation number="2" location="packages/pty-daemon/src/Server/Server.ts:501">
P2: Missing `disconnect` listener in `waitForHandoffAck`: if the child's IPC channel closes without the process exiting (e.g., child calls `process.disconnect()` before sending ack), the function waits the full 5s timeout instead of settling immediately. Consider adding a `disconnect` handler to settle with a failure.</violation>
</file>
<file name="apps/desktop/src/main/pty-daemon/index.ts">
<violation number="1" location="apps/desktop/src/main/pty-daemon/index.ts:116">
P1: `readSnapshot()` is outside the `try/catch` that sends `upgrade-nak`. If the snapshot is missing or corrupt, the error propagates to `main().catch()` without notifying the predecessor, which then burns the full 5s ack timeout instead of getting an immediate failure reason. Move this call inside the try block that sends the nak.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| reason: `successor exited before ack (code=${code} signal=${signal})`, | ||
| }); | ||
| }; | ||
| child.on("message", onMessage); |
There was a problem hiding this comment.
P2: Missing disconnect listener in waitForHandoffAck: if the child's IPC channel closes without the process exiting (e.g., child calls process.disconnect() before sending ack), the function waits the full 5s timeout instead of settling immediately. Consider adding a disconnect handler to settle with a failure.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/pty-daemon/src/Server/Server.ts, line 501:
<comment>Missing `disconnect` listener in `waitForHandoffAck`: if the child's IPC channel closes without the process exiting (e.g., child calls `process.disconnect()` before sending ack), the function waits the full 5s timeout instead of settling immediately. Consider adding a `disconnect` handler to settle with a failure.</comment>
<file context>
@@ -253,6 +459,56 @@ export class Server {
+ reason: `successor exited before ack (code=${code} signal=${signal})`,
+ });
+ };
+ child.on("message", onMessage);
+ child.on("exit", onExit);
+ const timer = setTimeout(() => {
</file context>
| child.on("message", onMessage); | |
| child.on("message", onMessage); | |
| child.on("disconnect", () => { | |
| settle({ ok: false, reason: "successor IPC channel disconnected before ack" }); | |
| }); |
Net: 212 deletions, 87 insertions across 10 files. No behavior changes; all 453 unit + 12 daemon-integration + 7 adoption-e2e tests still pass. What was trimmed: - Call-site comments that re-explained their helper's docstring (waitForPidExit, probeDaemonVersionWithRetry, replayOnAdoption JSDoc). - Wishful comments that didn't match the code (Server.ts: "log child stderr separately" before code that just logs exit code). - Multi-paragraph history-of-the-test in terminal.adoption.node-test.ts — the new assertion stands on its own; the obsolete-approach postmortem was clutter. - "auto-update fires on every adopt with version drift / heavy path" said three different ways across describe block, test header, and inline. - The stale "scriptPath points to a missing file" comment in the auto-update-fail integration test (the test uses runUpdate override, not scriptPath injection — comment lied). What was extracted: - seedPredecessor + getInstancePid helpers in DaemonSupervisor.test.ts: removed two ~20-line `(sup as unknown as ...).instances.set(...)` blocks per test; both auto-update-failure tests now read as 8-10 lines of the actual assertion they make. What was kept: - Real why-comments: env-strip rationale (1-line), readPackageVersion in handoff mode (1-line), writer-error listener (1-line), atomic- rename writeSnapshot, snapshot file-format header. - Critical-contract assertions in the auto-update-failure test.
Three places hand-edited together for every daemon version bump:
- packages/pty-daemon/package.json#version
- packages/pty-daemon/src/index.ts#DAEMON_PACKAGE_VERSION
(inlined into the desktop bundle by electron-vite — package.json
can't be readFileSync'd at runtime in a bundled binary)
- packages/host-service/src/daemon/expected-version.ts#EXPECTED_DAEMON_VERSION
(what host-service compares running daemon against; drift here =
false updatePending or false "running == expected")
The 0.1.0 → 0.2.0 bump in this PR was nearly the exact mistake this
test catches: hand-edit one, ship without the others, the bundle
silently reports the wrong version. Verified the test fails when any
of the three sources drifts.
Also tightened expected-version.ts's lockstep comment now that the
test replaces the "rely on PR review" hand-wave.
Three hand-edited copies of "what version is the daemon" collapse to
one: pty-daemon/package.json. The other two now derive at compile time
via JSON imports.
packages/pty-daemon/src/index.ts
DAEMON_PACKAGE_VERSION = packageJson.version (was: "0.2.0" literal)
packages/host-service/src/daemon/expected-version.ts
EXPECTED_DAEMON_VERSION = ptyDaemonPackageJson.version (was: literal)
packages/pty-daemon/src/main.ts
DAEMON_VERSION = packageJson.version (replaces readPackageVersion()
helper — 11 lines + fileURLToPath/fs/path imports gone)
Bundlers (Bun for the standalone, electron-vite for desktop) inline
the JSON at build time, so the runtime cost is identical to the
literals. Verified: all 453 host-service unit + 12 daemon integration
+ 7 adoption-e2e tests still pass; daemon bundle still reports 0.2.0
end-to-end.
Net effect for future bumps: edit pty-daemon/package.json, rebuild,
done. The lockstep test (added in the prior commit) is dropped because
drift is now structurally impossible — the constants don't exist as
independent values to drift from.
Tradeoff acknowledged: every daemon-version bump now forces an
auto-update for users on the older version (no more "ship a quiet
daemon patch"). Discussed and accepted — daemon bumps are expected to
be rare and significant in this codebase, so the simpler model wins.
Required pty-daemon/package.json#exports to add "./package.json":
"./package.json" so host-service can import it (Node's exports field
is deny-by-default for subpaths).
Phase 1 (PR #3896) and Phase 2 (PR #3971) are both shipped — moved the planning docs from apps/desktop/plans/ into apps/desktop/plans/done/ per repo convention. Three plans that were untracked in the working worktree (Phase 1 host-service-migration, implementation-report, version-detection proposal) are committed at the same time so future readers can trace the design journey. Refreshed two long-lived reference docs to match shipped reality: - packages/host-service/DAEMON_SUPERVISION.md * EXPECTED_DAEMON_VERSION is no longer hand-edited — derives from pty-daemon/package.json#version at compile time. * "Phase 2 deferred — daemon upgrades currently kill sessions" section replaced with the Phase-2-shipped walkthrough (handoff flow, restoreOnFailure contract, the argv-not-env mode signal, bundle-marker check script). * "Bumping the daemon version" instructions now describe the single-source-of-truth path. - packages/pty-daemon/README.md * Header references both Phase 1 and Phase 2 plans. * Layout diagram and test-suite descriptions match what's actually in test/ today (byte-fidelity, handoff, signal-recovery, no-encoding-hops). * "Out of scope" lists the real outstanding gap (since-byte-N cursor) instead of Phase 2. Excluded **/plans from biome's lint/format scope. plans/ holds docs + occasional scratch code (the pty-handoff-experiment harness uses patterns biome flags); production-quality lint expectations don't apply there.
Phase 1 (PR #3896) and Phase 2 (PR #3971) are both shipped — moved the planning docs from apps/desktop/plans/ into apps/desktop/plans/done/ per repo convention. Three plans that were untracked in the working worktree (Phase 1 host-service-migration, implementation-report, version-detection proposal) are committed at the same time so future readers can trace the design journey. Refreshed two long-lived reference docs to match shipped reality: - packages/host-service/DAEMON_SUPERVISION.md * EXPECTED_DAEMON_VERSION is no longer hand-edited — derives from pty-daemon/package.json#version at compile time. * "Phase 2 deferred — daemon upgrades currently kill sessions" section replaced with the Phase-2-shipped walkthrough (handoff flow, restoreOnFailure contract, the argv-not-env mode signal, bundle-marker check script). * "Bumping the daemon version" instructions now describe the single-source-of-truth path. - packages/pty-daemon/README.md * Header references both Phase 1 and Phase 2 plans. * Layout diagram and test-suite descriptions match what's actually in test/ today (byte-fidelity, handoff, signal-recovery, no-encoding-hops). * "Out of scope" lists the real outstanding gap (since-byte-N cursor) instead of Phase 2. Excluded **/plans from biome's lint/format scope. plans/ holds docs + occasional scratch code (the pty-handoff-experiment harness uses patterns biome flags); production-quality lint expectations don't apply there.
…pty-daemon README (#3851) * docs(desktop): survey of PTY-survival architectures across host-service upgrades Exhaustive survey of options for keeping node-pty shells alive when the host-service binary upgrades. Covers serialize+replay (VS Code style), SCM_RIGHTS fd-passing (with node-pty and direct forkpty), long-lived pty-daemon, tmux-as-backend, exec() in-place, OS supervisor, CRIU, and mosh-style sync. Includes findings from a working PoC of the SCM_RIGHTS path at ~/workplace/pty-handoff-poc/, comparison table, and a phased recommendation (visual-replay floor now, daemon + fd-passing later). No code changes; doc only. * docs(desktop): pty-daemon implementation plan + Phase 0 reliability harness Adds a concrete implementation plan for Architecture E (pty-daemon + fd-handoff for daemon upgrades) selected by the prior survey, plus the working Phase 0 reliability harness whose results justify the choice of runtime and primitive. - 20260429-pty-daemon-implementation.md: scoped Phase 1 (~2 weeks) + Phase 2 (~1 week) plan. Locked decisions (Node + node-pty runtime, Unix-socket protocol with versioned handshake, ring buffer on daemon, fd inheritance for handoff). Package layout, manifest design, host-service integration points, telemetry, open decisions. - 20260428-pty-handoff-survival-architectures.md: revised survey with empirical Phase 0 results (Go + creack/pty: 20k handoffs, 100% survival, 0 byte loss on macOS arm64) and a Phase 0 follow-up that disproves the original C1 conclusion (node-pty + macOS works for handoff; the real failure mode is master fd refcount, not the spawn-helper). Recommendation now defaults to a Node daemon. - pty-handoff-experiment/: Go harness (creack/pty + UnixRights) and Node test scripts (node-pty + stdio fd inheritance) used to validate the primitive on macOS arm64. * docs(desktop): pty-daemon plan — explicit v1-bloat avoidance Adds a "Lessons from v1 (do not repeat)" section calling out the specific responsibilities that bloated v1's daemon to ~1330 LOC and mapping each to its v2 home (host-service) so they don't creep back in. Concrete additions: - Locked-decisions table reinforced: daemon is "stateless from the client's perspective"; ~500–700 LOC budget; in-memory ring buffer only (no SQLite-backed history). - Itemized v1-vs-v2 anti-pattern table: HistoryWriter, cold restore, killed-session tombstones, workspace auth in daemon, port scanner, analytics, EventEmitter subscriber model, request dedup, priority semaphore, ANSI parsing, sticky cross-call state, deferred cleanup. - "Where the v1 daemon's bloat lands in v2" mapping each removed concern to the upgrade-frequent layer (host-service) where state belongs. - Patterns from v1 worth keeping called out (per-session snapshot on attach, resize bounds, signal abstraction, AbortSignal cancellation, simple FIFO concurrency cap). - References to v1 daemon files (daemon-manager.ts:1009 LOC, history-manager.ts:245 LOC, priority-semaphore.ts:76 LOC) so reviewers can see what we're leaving behind. * docs(desktop): listener-fd handoff spike for Phase 2 Validates that a parent process can dup an AF_UNIX listening fd into a child via stdio array, the child can `accept()` on it via `server.listen({ fd })`, and after the parent exits the kernel keeps the in-kernel socket alive via the child's fd refcount. PASSes on macOS arm64 / Node 24. Forced two architectural revisions to the Phase 2 plan: 1. Old daemon spawns the successor — supervisor never holds the listener fd. Avoids the daemon→supervisor fd transfer problem entirely. 2. Don't call `server.close()` on the handoff path. Node's net.Server.close() unlinks the path's directory entry; even though the successor holds a dup'd fd, clients see ENOENT on connect. `process.exit(0)` skips Node's close handlers, preserves the path. Spike led to dropping listener inheritance from D1 entirely (PTY fds are the only thing that genuinely cannot blink); see updated plan. * docs(desktop): Phase 2 audit + implementation plan Audit (20260501-...-audit.md): surveys post-Phase-1 elastic-lens for fd-handoff readiness. Confirms no critical blockers — Phase 0 already proved master fd inheritance, and the supervisor's `restart()` is a clean integration point. Plan (20260501-...-implementation.md): bakes in 6 decisions from a /decide walkthrough, sketches the daemon-spawns-successor architecture that the listener-handoff spike forced, and sequences 10 steps across 9 commits. D1 ("what gets inherited") was revised after the spike + user pushback to PTY master fds only. host-service is the only client of the daemon socket and already reconnects after host-service-restart adoptions — adding listener inheritance just to avoid a ~100ms reconnect blip wasn't worth the foot-guns. * docs(daemon): archive shipped plans + refresh post-Phase-2 references Phase 1 (PR #3896) and Phase 2 (PR #3971) are both shipped — moved the planning docs from apps/desktop/plans/ into apps/desktop/plans/done/ per repo convention. Three plans that were untracked in the working worktree (Phase 1 host-service-migration, implementation-report, version-detection proposal) are committed at the same time so future readers can trace the design journey. Refreshed two long-lived reference docs to match shipped reality: - packages/host-service/DAEMON_SUPERVISION.md * EXPECTED_DAEMON_VERSION is no longer hand-edited — derives from pty-daemon/package.json#version at compile time. * "Phase 2 deferred — daemon upgrades currently kill sessions" section replaced with the Phase-2-shipped walkthrough (handoff flow, restoreOnFailure contract, the argv-not-env mode signal, bundle-marker check script). * "Bumping the daemon version" instructions now describe the single-source-of-truth path. - packages/pty-daemon/README.md * Header references both Phase 1 and Phase 2 plans. * Layout diagram and test-suite descriptions match what's actually in test/ today (byte-fidelity, handoff, signal-recovery, no-encoding-hops). * "Out of scope" lists the real outstanding gap (since-byte-N cursor) instead of Phase 2. Excluded **/plans from biome's lint/format scope. plans/ holds docs + occasional scratch code (the pty-handoff-experiment harness uses patterns biome flags); production-quality lint expectations don't apply there.
Recorded as integrated via -s ours after batch PRs #455-#464. Taken via individual PRs: - PR 1 (#455): v2 polish 前半 safe set (9 commits) - PR 2 (#456): v2/host-service polish 中盤 (12 commits) - PR 3 (#457): sidebar polish + jwt API (5 commits) - PR 4 (#458): host-service tRPC retry/cache/timeout (3 commits) - PR 5 (#459): v2 diff pane / file pane polish (2 commits) - PR 7 (#462): host-service v2 canonical workspace.create + attachment store (PR1 superset-sh#3893 + PR2 superset-sh#3916) - PR 11 (#463): agents API + onboarding (7 commits + 1 cleanup) - PR 12 (#464): v1→v2 import flow rewrite (11 commits + 2 follow-ups) - PR 13 (#460): host-service shell env probe + typo (2 commits) - PR 16 (#461): marketplace 19 themes (1 commit) Skipped / deferred (recorded as integrated for behind=0): - PR 6: CLI v1 launch (superset-sh#3898 + 30+ CLI/SDK followups) — defer to dedicated migration - PR 9: v2 PR3 (superset-sh#3940) + revert (superset-sh#4017) — net-zero pair - PR 10: pty-daemon (superset-sh#3896, superset-sh#3971, superset-sh#4054) — fork keeps its terminal-host - PR 14: Slack MCP-v2 (superset-sh#4197, superset-sh#4208) — depends on mcp-v2/sdk divergence - PR 15: onboarding remaining (superset-sh#4115, superset-sh#4125, superset-sh#4214, superset-sh#4213, superset-sh#4222, superset-sh#4225) — depends on fork's deleted setup pages Behind: 0 after this merge.
Summary
Phase 2 of the pty-daemon migration: PTY sessions now survive daemon-binary upgrades. Pre-Phase-2, "Restart and update" SIGTERMs the daemon and all shells die in the gap. After this PR, the user clicks Update (or the upgrade happens automatically) and the new daemon binary takes over via fd inheritance — original shell PIDs survive untouched.
Phase 1 (PR #3896) made daemon survive host-service restarts. Phase 2 makes it survive its own binary swap.
Manually verified in dev today: opened 3 terminal panes, clicked Update three times in succession (chain
62492 → 51479 → 9048 → 62648), all 3 sessions preserved at every step with the same shell PIDs.How it works
The kernel's fd refcount holds the master fds alive across the predecessor's exit (stdio inheritance is a dup, not a transfer). Shells never see SIGHUP.
Auto-update on host-service upgrade
When host-service boots and adopts a daemon whose
runningVersion < EXPECTED_DAEMON_VERSION(i.e. host-service was just upgraded but the detached daemon is still on the old binary), the supervisor automatically callsupdate()in the background. Sessions survive on success. On failure, the old daemon stays running and the user retries via the Update button — we never silently force-destroy sessions.Concurrency-coalesced via per-org
updateInFlightmap: a manual Update click and the auto-update can race; the second caller observes the same outcome via the cached promise.Decisions (full /decide walkthrough in
apps/desktop/plans/20260501-pty-daemon-phase2-implementation.md)upgrade-ackover IPC, then exit)Commits (17)
Notable bugs caught + fixed during dev verification
Bundler DCE eliminated
runHandoffReceiver(commit0ecb2b550). Both Bun's bundler AND electron-vite's esbuild statically inlineprocess.env.Xreferences at build time, constant-fold the unset-at-build-time comparison, and dead-code-eliminate the entire receiver branch. The test suite ran source-direct or against the Bun-built package bundle; neither covered the desktop shim's electron-vite output where the failure mode lived. Fix: handoff signal moves from env to--handoffargv flag (fully dynamic, can't be statically analyzed). The desktop shim was a separate copy of pre-Phase-2 main() that needed to mirror the package's main; it's now in sync.Regression-armored (commit
a2d43c236):apps/desktop/scripts/check-pty-daemon-bundle.tsruns aftercompile:appand greps the desktop bundle forrunHandoffReceiver,--handoff(≥2),upgrade-ack,adoptSnapshot,adoptFromFd. Build fails fast if any marker is missing.Test plan
bun testpasses inpackages/pty-daemon(34 tests, 6 files) andpackages/host-service(450+ tests, 50 files)node --experimental-strip-types --test test/handoff.test.tsinpackages/pty-daemon(real cross-process daemon spawn, asserts shell PID survives)node --experimental-strip-types --test src/daemon/DaemonSupervisor.node-test.tsinpackages/host-service(9 real-spawn tests including auto-update)bun run check:pty-daemon-bundleinapps/desktopaftercompile:app(5/5 markers present)bun dev, open several v2 terminal panes, Settings → Manage daemon → Update daemon. Sessions stay alive; toast confirms.Out of scope
apps/desktop/plans/pty-handoff-experiment/exists for future stress runs).EXPECTED_DAEMON_VERSIONbump. This PR adds handoff plumbing but doesn't change daemon wire behavior, so no version bump. Bump when the next change actually changes the daemon's protocol surface.apps/desktop/plans/pty-handoff-experiment/; ship validation lives there.Documentation trail
apps/desktop/plans/20260501-pty-daemon-phase2-audit.md— current-state audit before design (on the design-doc branch)apps/desktop/plans/20260501-pty-daemon-phase2-implementation.md— decision log + step sequencing (on the design-doc branch)apps/desktop/plans/pty-handoff-experiment/listener-handoff/— Phase 2 spike that forced D1 revision (on the design-doc branch)packages/host-service/DAEMON_SUPERVISION.md— Phase 1 reference, unchanged here (will need a follow-up edit to mention Phase 2)Summary by cubic
Preserve PTY sessions across daemon binary upgrades using fd-handoff. Also fixes the post-upgrade version-reporting bug that kept “Update available” after successful updates.
Refactors
EXPECTED_DAEMON_VERSIONandDAEMON_PACKAGE_VERSIONnow derive from@superset/pty-daemon/package.json; package exports add./package.json; bumped@superset/pty-daemonto0.2.0.encodeFrame/FrameDecoder),.snapfiles; snapshot helpers (clearSnapshot,readSnapshot,writeSnapshot) exported for the desktop shim.fileURLToPathfor Windows path compatibility.Bug Fixes
SUPERSET_PTY_DAEMON_VERSIONin handoff mode and reads its own bundle version; supervisor retries version probe after predecessor exit to cover the bind gap.readSnapshot()in successor try/catch and replyupgrade-nakon failure; unlink snapshot on failed handoff; add anerrorlistener to AdoptedPty writer; clearer spawn error by moving the cwd-not-directory check outside the catch.host-servicestripsSUPERSET_PTY_DAEMON_VERSIONfrom the successor’s env during handoff, waits for the predecessor PID to fully exit before probing the new socket version, and now bails withok:falseon timeout instead of probing anyway.host-service: prevent duplicate terminal output on daemon swap by skipping ring-buffer replay on WS reconnect, keyed off the first received byte (not just socket open) and plumbed asreplayOnAdoption.Written for commit 8416696. Summary will update on new commits.
Summary by CodeRabbit
New Features
Chores
Tests