fix(gateway): re-bind IPC socket when its path is unlinked#29771
Conversation
The gateway IPC server bound `{workspaceDir}/gateway.sock` (or the
short-path fallback under `/run/gateway-ipc/`) on startup and never
re-checked the path. When the path entry was removed from disk
(observed in the sandbox after a disk-full cleanup swept `/run/*`),
the listening fd in the daemon stayed alive thanks to the kernel
keeping the listener inode, but every new `connect()` from a one-shot
client failed with ENOENT.
The blast radius was silent and broad:
- `ipcCallPersistent` (used by `classify_risk`) kept working — the
persistent connection survived the unlink.
- `ipcCall` one-shots (used by `get_global_thresholds` and
`get_conversation_threshold`) failed every call.
- `gateway-threshold-reader.ts` caught the failure and fail-closed to
the `none` threshold (= Strict), so every tool that needed
threshold-based auto-approve started prompting.
- Bash silently escaped via the `platform_auto_approve` carve-out;
`file_read`, `recall`, etc. did not. Result: a Relaxed threshold
silently behaved like Strict for read-only tools, with no surfaced
error other than ~10 ENOENT WARNs/sec on a hot path.
Fix: add a small watchdog timer (default 5s, set
`watchdogIntervalMs: 0` to disable) that `stat()`s the listening path
and atomically re-binds when missing. Existing connected sockets keep
their accept loop alive via a `legacyServers` set that gets closed
only after their in-flight traffic drains. The hot-path
`classify_risk` connection is unaffected (it never disconnected to
begin with), and one-shot RPCs start working again on the next
watchdog tick instead of needing a daemon restart.
The watchdog also makes `start()` idempotent against transient
`mkdir` failures and re-uses the existing socket cleanup path.
Tests: 7 new unit tests in `ipc-server-watchdog.test.ts` covering
`rebindIfMissing()` no-op + recovery cases, in-flight client
survival, periodic auto-rebind, and timer cleanup on `stop()`.
Existing multi-client suite still passes.
|
@codex full review please |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6ac454a77e
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| const previous = this.server; | ||
| this.server = newServer; | ||
| if (previous) { |
There was a problem hiding this comment.
Abort rebind when shutdown happens mid-listen
If stop() is called while rebindIfMissing() is awaiting newServer.listen(...), shutdown clears this.server and returns, but this code then unconditionally installs newServer as the active server afterward. That resurrects the IPC listener after shutdown (and can keep the process alive or accept new IPC calls unexpectedly). This race is reachable whenever the watchdog fires during teardown and the socket path is missing; add a post-listen guard (e.g., stopped flag/generation check) and close newServer instead of assigning it when shutdown has already started.
Useful? React with 👍 / 👎.
Addresses Codex P1 + Devin findings on PR #29771. Codex P1 — shutdown-mid-rebind race: If stop() ran while rebindIfMissing() was awaiting newServer.listen(), shutdown cleared this.server and returned, but rebindIfMissing then unconditionally installed newServer afterward — resurrecting the IPC listener after shutdown (kept the process alive, accepted new IPC calls). Reachable whenever the watchdog fires during teardown and the socket path is missing. Fix: snapshot this.server as initialServer before the await; after listen() resolves, check this.server === initialServer. If the reference moved (stop() nulled it, restart re-set it, concurrent rebind beat us), close newServer and unlink the path it just recreated, then return false. Devin — unhandled rejection on watchdog tick: setInterval callback called this.rebindIfMissing() without catching the returned promise. ensureSocketDir() / createListening Server() execute before the inner try/catch, so a sync throw (e.g. EACCES on a read-only fs) became an unhandled rejection every 5s. Fix: wrap the timer call in .catch() that logs via log.error. Tests: + rebindIfMissing aborts cleanly when shutdown happens mid-listen (clears this.server while listen is in flight, asserts result is false, path is unlinked, this.server stays null) + watchdog timer catches synchronous rebind errors so unhandled rejections do not escape (replaces ensureSocketDir to throw, asserts process.unhandledRejection sees no events) 10/10 pass on ipc-server-watchdog + ipc-server-multi-client. bunx tsc --noEmit clean. bun run lint shows 0 errors (7 unrelated pre-existing warnings).
|
Fixed both findings in 4f7bfcf: @chatgpt-codex-connector P1 (shutdown-mid-rebind race): Snapshot @devin-ai-integration (unhandled rejection on watchdog tick): Wrapped the Regression coverage:
10/10 pass on ipc-server-watchdog + ipc-server-multi-client. tsc + lint clean. |
Addresses Devin 🔴 finding on PR #29772. The comment narrated history ("the previous implementation emitted one WARN per failure") and referenced PR #29771, both of which violate assistant/AGENTS.md: Comments should describe the current state of the codebase, not narrate its history. Avoid phrases like 'no longer does X', 'previously used Y', or 'was removed in PR Z'. Rewrote the block to describe what the helper does and why, without referencing the old per-call WARN behavior or any specific PR. 16/16 pass on gateway-threshold-reader. tsc clean.
|
@codex review |
|
Codex Review: Didn't find any major issues. Another round soon, please! ℹ️ About Codex in GitHubCodex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback". |
#29772) * fix(threshold-reader): coalesce IPC failure WARNs and surface recovery When the gateway IPC threshold endpoints are unreachable (see PR #29771 for the precipitating bug — socket file unlinked on disk), `getAutoApproveThreshold` falls back to "none" (Strict) on every call. Each fallback emitted one WARN at the call site, producing ~10 WARN/sec on the hot path. The actual signal — "thresholds are broken; we're behaving like Strict for everything other than bash" — drowned in its own log spam and never reached a structured event that could trip a dashboard. Coalesce per-op (`global_thresholds`, `conversation_threshold`) log emissions: - The first failure in a streak still WARNs immediately so failures aren't lost. - Subsequent failures within the WARN window (default 30s, injectable for tests) are counted but not logged. - When the IPC starts working again, an INFO records how long the streak lasted (`streakDurationMs`) and how many calls were swallowed (`swallowedFailures`). That recovery event is the cue dashboards should alert on. - Each WARN/INFO carries a stable `event` field (`ipc_threshold_failure` / `ipc_threshold_recovered`) plus `op` / `consecutiveFailures` so log-based monitors can latch. - Conversation-threshold and global-threshold streaks are tracked independently. "global is broken but conversation works" can happen because they share an IPC connection but not a code path. Tests: 6 new tests in `gateway-threshold-reader.test.ts` covering the first-failure-WARNs case, mid-streak suppression, fresh WARN after the cadence elapses, recovery INFO with swallowed counter, per-op streak independence, and that a successful round-trip returning `null` (no override) clears a streak. All 16 tests in the file pass; broader 32-test gateway/risk suite unaffected. Note on the pino mock: bun's `mock.module("../util/logger.js")` does not intercept transitive imports (see comments in `stt-hints.test.ts`, `avatar-e2e.test.ts`). The working pattern is to mock pino itself; this is documented inline in the test file. * fix(threshold-reader): rewrite helper doc to current-state only Addresses Devin 🔴 finding on PR #29772. The comment narrated history ("the previous implementation emitted one WARN per failure") and referenced PR #29771, both of which violate assistant/AGENTS.md: Comments should describe the current state of the codebase, not narrate its history. Avoid phrases like 'no longer does X', 'previously used Y', or 'was removed in PR Z'. Rewrote the block to describe what the helper does and why, without referencing the old per-call WARN behavior or any specific PR. 16/16 pass on gateway-threshold-reader. tsc clean. --------- Co-authored-by: credence-the-bot <credence-the-bot@users.noreply.github.com>
…tils + wire into assistant & skill servers (#29825) * feat(ipc): extract socket watchdog into shared @vellumai/ipc-server-utils Generalizes the resilience pattern shipped in #29771 (gateway IPC socket watchdog) into a reusable package and wires it into the assistant + skill IPC servers, which suffer from the same Bug 3 failure mode: when the listening socket file is removed from disk (e.g. by a tmpfs sweep or rogue cleanup of /run/*), already-connected clients keep working but new connect() calls hit ENOENT, silently fail-closing whatever the consumer gates on the IPC. For gateway threshold reads this manifested as the 'silent Strict when Relaxed' bug fixed in #29771. The new package lives at packages/ipc-server-utils/ and exposes: - SocketWatchdog: stat-based polling rebind with shutdown-race + restart-race guards (snapshots the consumer's server reference before await listen(), re-checks after, discards the new listener and unlinks the path it recreated if state moved). - ensureSocketDir: shared helper for creating /run/* with mode 0o700. Gateway server delegates to SocketWatchdog (445→352 LoC; integration test slimmed to 189 LoC since unit-level concerns now live in the package). Assistant server gains AssistantIpcServerOptions { watchdogIntervalMs } + legacyServers Set for graceful drain. Same shape for skill-server (SkillIpcServerOptions). Both extract createListeningServer() so the watchdog can build a fresh server on rebind. Tests: 9/9 watchdog unit + 12/12 skill-server + 61/61 IPC tests in isolation. Gateway/assistant typecheck + lint + lint:unused all clean. Knip configured to ignore the new internal package, matching existing @vellumai/* convention. Refs #29771. * fix(assistant): add @vellumai/ipc-server-utils to bundledDependencies Devin caught: scripts/prepack-bundled-deps.mjs validates that every 'file:' specifier in dependencies is also listed in bundledDependencies and exits 1 otherwise (because npm pack does not follow the symlink-based node_modules layout that bun creates for file: deps). Without this entry the prepack hook would fail on 'npm pack' / 'npm publish'. Verified locally: prepack now materializes 9 bundled packages including the new one and exits 0. * fix(docker): copy ipc-server-utils into assistant + gateway build contexts Codex caught: both Dockerfiles fail to build the new file: dependency because /app/packages/ipc-server-utils doesn't exist before 'bun install --frozen-lockfile' runs. Mirrors the existing pattern for slack-text and twilio-client (devDeps-only packages, no inner install needed at runtime since they have no runtime dependencies). --------- Co-authored-by: credence-the-bot <credence@vellum.ai>
Summary
The gateway IPC server bound
{workspaceDir}/gateway.sockon startup and never re-checked the path on disk. When the path entry was removed (observed in this sandbox after a disk-full cleanup swept/run/*), the listening fd in the daemon stayed alive — the kernel keeps the listener inode independently of the path — but every newconnect()from a one-shot client failed withENOENT.This shipped a silent, broad failure mode for permission auto-approve.
Discovery / impact
While debugging "Risk Tolerance = Relaxed silently behaves like Strict for read-only tools," I found:
ipcCallPersistent(used byclassify_risk) kept working — the persistent connection survived the unlink, kernel-side.ipcCallone-shots (used byget_global_thresholdsandget_conversation_threshold) failed every call withENOENT.gateway-threshold-reader.tscaught the failure and fail-closed to thresholdnone(= Strict), so every tool that relied on threshold-based auto-approve started prompting.platform_auto_approvecarve-out inassistant/src/tools/permission-checker.ts:207-227.file_read,recall, etc. had no carve-out → prompted on every call./workspace/data/logs/vellum.logbefore discovery.start()time), but the bug recurred on the next sweep.Fix
Add a small watchdog (default 5s, set
watchdogIntervalMs: 0to disable) onGatewayIpcServerthat:stat()s the listening path on each tick.legacyServersset and gracefullyclose()s it. Already-connected sockets keep their accept loop alive until they drain — the persistentclassify_riskchannel never blinks.connect()s start working again on the next watchdog tick instead of needing a daemon restart.Also extracts
start()helpers (ensureSocketDir,createListeningServer,handleConnection) and exposesrebindIfMissing()publicly so tests can drive the recovery deterministically without sleeping.Tests
7 new unit tests in
gateway/src/__tests__/ipc-server-watchdog.test.ts:rebindIfMissingno-op when the path existsrebindIfMissingno-op beforestart()rebindIfMissingrecreates the path after manualunlink()stop()cancels the timer and cleans up the pathbun test src/__tests__/ipc-server: 8 pass / 0 fail (7 watchdog + 1 multi-client). ESLint clean.bunx tsc --noEmitclean.Risk
unref()on the timer handle prevents it from holding the event loop alive (matters for tests; the daemon's main loop already keeps it alive).Follow-ups (separate PRs)
gateway-threshold-readerENOENT log spam to one warning per N seconds while broken, and emit a structured event so the failure mode surfaces in dashboards.platform_auto_approvecarve-out log loudly when it covers for a broken threshold IPC, so "bash works but reads don't" stops being a silent symptom.Verification on the live failure
After landing the watchdog fix conceptually (container restarted to recover the socket in this session),
recallandfile_readround-tripped silently against the gateway IPC again, and the threshold-reader stopped logging ENOENT.