feat(ipc): extract socket watchdog into shared @vellumai/ipc-server-utils + wire into assistant & skill servers#29825
Conversation
…tils Generalizes the resilience pattern shipped in #29771 (gateway IPC socket watchdog) into a reusable package and wires it into the assistant + skill IPC servers, which suffer from the same Bug 3 failure mode: when the listening socket file is removed from disk (e.g. by a tmpfs sweep or rogue cleanup of /run/*), already-connected clients keep working but new connect() calls hit ENOENT, silently fail-closing whatever the consumer gates on the IPC. For gateway threshold reads this manifested as the 'silent Strict when Relaxed' bug fixed in #29771. The new package lives at packages/ipc-server-utils/ and exposes: - SocketWatchdog: stat-based polling rebind with shutdown-race + restart-race guards (snapshots the consumer's server reference before await listen(), re-checks after, discards the new listener and unlinks the path it recreated if state moved). - ensureSocketDir: shared helper for creating /run/* with mode 0o700. Gateway server delegates to SocketWatchdog (445→352 LoC; integration test slimmed to 189 LoC since unit-level concerns now live in the package). Assistant server gains AssistantIpcServerOptions { watchdogIntervalMs } + legacyServers Set for graceful drain. Same shape for skill-server (SkillIpcServerOptions). Both extract createListeningServer() so the watchdog can build a fresh server on rebind. Tests: 9/9 watchdog unit + 12/12 skill-server + 61/61 IPC tests in isolation. Gateway/assistant typecheck + lint + lint:unused all clean. Knip configured to ignore the new internal package, matching existing @vellumai/* convention. Refs #29771.
There was a problem hiding this comment.
🔴 @vellumai/ipc-server-utils missing from bundledDependencies breaks npm pack
The new @vellumai/ipc-server-utils dependency uses a file: specifier (file:../packages/ipc-server-utils) in assistant/package.json:47 but is not listed in the bundledDependencies array. Every other @vellumai/* file-specifier dependency is in bundledDependencies. The prepack script (scripts/prepack-bundled-deps.mjs:25-34) explicitly validates this invariant — it iterates all file: dependencies, checks membership in bundledDependencies, and calls process.exit(1) if any are missing. This will cause the assistant's prepack step and consequently npm pack / npm publish to fail.
(Refers to lines 76-85)
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 51125189c8
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| "@vellumai/credential-storage": "file:../packages/credential-storage", | ||
| "@vellumai/egress-proxy": "file:../packages/egress-proxy", | ||
| "@vellumai/gateway-client": "file:../packages/gateway-client", | ||
| "@vellumai/ipc-server-utils": "file:../packages/ipc-server-utils", |
There was a problem hiding this comment.
Copy new local package into assistant Docker build context
Adding @vellumai/ipc-server-utils as a file:../packages/... dependency here requires /app/packages/ipc-server-utils to exist before bun install --frozen-lockfile runs in the assistant image build. assistant/Dockerfile currently copies only ces-client, service-contracts, credential-storage, egress-proxy, gateway-client, skill-host-contracts, slack-text, and twilio-client (lines 20–27), so Docker builds that install assistant dependencies will fail to resolve this local package.
Useful? React with 👍 / 👎.
| "dependencies": { | ||
| "@vellumai/assistant-client": "file:../packages/assistant-client", | ||
| "@vellumai/ces-client": "file:../packages/ces-client", | ||
| "@vellumai/ipc-server-utils": "file:../packages/ipc-server-utils", |
There was a problem hiding this comment.
Copy new local package into gateway Docker build context
This new file:../packages/ipc-server-utils dependency is not matched by a corresponding COPY packages/ipc-server-utils ... in gateway/Dockerfile before bun install --frozen-lockfile. The gateway Docker build currently copies only assistant-client, ces-client, service-contracts, slack-text, and twilio-client (lines 12–16), so dependency installation in containerized builds will fail with a missing local package path.
Useful? React with 👍 / 👎.
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
Devin caught: scripts/prepack-bundled-deps.mjs validates that every 'file:' specifier in dependencies is also listed in bundledDependencies and exits 1 otherwise (because npm pack does not follow the symlink-based node_modules layout that bun creates for file: deps). Without this entry the prepack hook would fail on 'npm pack' / 'npm publish'. Verified locally: prepack now materializes 9 bundled packages including the new one and exits 0.
…texts Codex caught: both Dockerfiles fail to build the new file: dependency because /app/packages/ipc-server-utils doesn't exist before 'bun install --frozen-lockfile' runs. Mirrors the existing pattern for slack-text and twilio-client (devDeps-only packages, no inner install needed at runtime since they have no runtime dependencies).
|
Pushed Both cycle-1 P1s now addressed:
|
|
Codex Review: Didn't find any major issues. Hooray! ℹ️ About Codex in GitHubCodex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback". |
Summary
Generalizes the resilience pattern from #29771 (gateway IPC socket watchdog) into a reusable package and wires it into the assistant and skill IPC servers, which share the same Bug 3 failure mode.
When the listening socket file is removed from disk (e.g. by a tmpfs sweep or a rogue cleanup of
/run/*), already-connected clients keep working — but newconnect()calls hitENOENT, silently fail-closing whatever the consumer gates on the IPC. For gateway threshold reads, this is the "silent Strict when Relaxed" bug fixed in #29771; for the assistant server, it would silently break CLI ↔ daemon RPCs; for the skill IPC server, skills running across daemon restarts would lose their connection path.This PR is the second half of that fix: same blast radius, same recovery, now shared.
What's new
packages/ipc-server-utils/— new internal package, mirrors the shape ofslack-textetc.SocketWatchdog— stat-based polling rebind with both race guards Codex caught on the gateway-only version:getServer()beforeawait listen(), re-checks after. If the reference moved (null = stop, different = restart/concurrent rebind), closenewServerandunlink()the path it just recreated.rebindIfMissing()in.catch(err => log.error(...)).ensureSocketDir(socketPath)— exported standalone (not a method) because both the gateway and assistant servers also call it during their initialstart().mode: 0o700only applies to dirs the helper creates.getServer,createServer,onRebind) so consumers can implement their own legacy-server tracking — keeps the watchdog stateless about consumer internals.Gateway — delegates to
SocketWatchdog(445 → 352LoC). Integration test slimmed (304 → 189LoC) since unit-level concerns now live in the package.Assistant —
AssistantIpcServerandSkillIpcServereach gain an options object ({ watchdogIntervalMs?: number }) and alegacyServers: Set<Server>for graceful client drain. Both extract acreateListeningServer()factory so the watchdog can build a fresh server on rebind.Bug 3 surface coverage
/run/gateway-ipc/gateway.sock/run/assistant-ipc/assistant.sock/workspace/assistant-skill.sock(overridable)Out of scope (different failure modes — TCP, single-shot bootstrap, HTTP-only, or external-supervisor-respawned):
credential-executor/src/managed-main.ts— bootstrap-and-die, accepts one connection then unlinks itself by designskills/meet-join/daemon/audio-ingest.ts— TCP, port 0oauth2.ts,mcp-oauth-provider.ts,outbound-proxy/server.ts,cli/login.ts)skills/meet-join/bot/src/native-messaging/socket-server.ts— Chrome respawns the NMH botVerification
packages/ipc-server-utils: 9/9 unit tests pass, typecheck clean.gateway: 9/9 IPC tests (7 watchdog integration + 2 multi-client) pass, typecheck + lint + lint:unused clean.assistant: 12/12 skill-server tests + all 61 IPC tests pass when run in isolation, typecheck + lint + lint:unused clean.BROWSER_OPERATION_META, not introduced here.Notes for reviewers
mode: 0o700onmkdirSynconly applies to directories the helper creates — existing/run/*mounts keep their permissions.socket-path.tsmodules ingateway/src/ipc/andassistant/src/ipc/are not consolidated in this PR (they have differentgetWorkspaceDirimport paths and the assistant version has an extragetAssistantSocketPathshorthand). Clean follow-up opportunity.Refs #29771.