Skip to content

fix(gateway): re-bind IPC socket when its path is unlinked#29771

Merged
noanflaherty merged 2 commits into
mainfrom
credence/gateway-ipc-socket-watchdog
May 6, 2026
Merged

fix(gateway): re-bind IPC socket when its path is unlinked#29771
noanflaherty merged 2 commits into
mainfrom
credence/gateway-ipc-socket-watchdog

Conversation

@credence-the-bot
Copy link
Copy Markdown
Contributor

@credence-the-bot credence-the-bot Bot commented May 6, 2026

Summary

The gateway IPC server bound {workspaceDir}/gateway.sock on startup and never re-checked the path on disk. When the path entry was removed (observed in this sandbox after a disk-full cleanup swept /run/*), the listening fd in the daemon stayed alive — the kernel keeps the listener inode independently of the path — but every new connect() from a one-shot client failed with ENOENT.

This shipped a silent, broad failure mode for permission auto-approve.

Discovery / impact

While debugging "Risk Tolerance = Relaxed silently behaves like Strict for read-only tools," I found:

  • ipcCallPersistent (used by classify_risk) kept working — the persistent connection survived the unlink, kernel-side.
  • ipcCall one-shots (used by get_global_thresholds and get_conversation_threshold) failed every call with ENOENT.
  • gateway-threshold-reader.ts caught the failure and fail-closed to threshold none (= Strict), so every tool that relied on threshold-based auto-approve started prompting.
  • Bash silently escaped via the platform_auto_approve carve-out in assistant/src/tools/permission-checker.ts:207-227. file_read, recall, etc. had no carve-out → prompted on every call.
  • Symptom on the user side: the threshold UI said "Relaxed", but only bash auto-approved. Other low-risk reads kept asking.
  • The only on-host signal was ~10 ENOENT WARN/sec on the threshold-reader hot path. 41,774 of those WARN lines accumulated in /workspace/data/logs/vellum.log before discovery.
  • A daemon restart fixed it (re-creates the socket at start() time), but the bug recurred on the next sweep.

Fix

Add a small watchdog (default 5s, set watchdogIntervalMs: 0 to disable) on GatewayIpcServer that:

  1. stat()s the listening path on each tick.
  2. If the path is missing, atomically binds a fresh listener at the same path.
  3. Moves the previous listener into a legacyServers set and gracefully close()s it. Already-connected sockets keep their accept loop alive until they drain — the persistent classify_risk channel never blinks.
  4. New connect()s start working again on the next watchdog tick instead of needing a daemon restart.

Also extracts start() helpers (ensureSocketDir, createListeningServer, handleConnection) and exposes rebindIfMissing() publicly so tests can drive the recovery deterministically without sleeping.

Tests

7 new unit tests in gateway/src/__tests__/ipc-server-watchdog.test.ts:

  • rebindIfMissing no-op when the path exists
  • rebindIfMissing no-op before start()
  • rebindIfMissing recreates the path after manual unlink()
  • Fresh client connect + RPC round-trip works after re-bind
  • A pre-existing connected client survives the unlink and keeps round-tripping
  • Periodic watchdog re-binds without manual intervention (25ms interval, deadline 1s)
  • stop() cancels the timer and cleans up the path

bun test src/__tests__/ipc-server: 8 pass / 0 fail (7 watchdog + 1 multi-client). ESLint clean. bunx tsc --noEmit clean.

Risk

  • The fix only kicks in when the path is missing — no behavior change in the healthy steady state.
  • unref() on the timer handle prevents it from holding the event loop alive (matters for tests; the daemon's main loop already keeps it alive).
  • The legacy-server set is bounded by how often the path goes missing — in practice this is "once per disk sweep," not a leak vector.

Follow-ups (separate PRs)

  • PR B (planned): coalesce the gateway-threshold-reader ENOENT log spam to one warning per N seconds while broken, and emit a structured event so the failure mode surfaces in dashboards.
  • PR C (TBD): make the bash platform_auto_approve carve-out log loudly when it covers for a broken threshold IPC, so "bash works but reads don't" stops being a silent symptom.

Verification on the live failure

After landing the watchdog fix conceptually (container restarted to recover the socket in this session), recall and file_read round-tripped silently against the gateway IPC again, and the threshold-reader stopped logging ENOENT.


Open in Devin Review

The gateway IPC server bound `{workspaceDir}/gateway.sock` (or the
short-path fallback under `/run/gateway-ipc/`) on startup and never
re-checked the path. When the path entry was removed from disk
(observed in the sandbox after a disk-full cleanup swept `/run/*`),
the listening fd in the daemon stayed alive thanks to the kernel
keeping the listener inode, but every new `connect()` from a one-shot
client failed with ENOENT.

The blast radius was silent and broad:

- `ipcCallPersistent` (used by `classify_risk`) kept working — the
  persistent connection survived the unlink.
- `ipcCall` one-shots (used by `get_global_thresholds` and
  `get_conversation_threshold`) failed every call.
- `gateway-threshold-reader.ts` caught the failure and fail-closed to
  the `none` threshold (= Strict), so every tool that needed
  threshold-based auto-approve started prompting.
- Bash silently escaped via the `platform_auto_approve` carve-out;
  `file_read`, `recall`, etc. did not. Result: a Relaxed threshold
  silently behaved like Strict for read-only tools, with no surfaced
  error other than ~10 ENOENT WARNs/sec on a hot path.

Fix: add a small watchdog timer (default 5s, set
`watchdogIntervalMs: 0` to disable) that `stat()`s the listening path
and atomically re-binds when missing. Existing connected sockets keep
their accept loop alive via a `legacyServers` set that gets closed
only after their in-flight traffic drains. The hot-path
`classify_risk` connection is unaffected (it never disconnected to
begin with), and one-shot RPCs start working again on the next
watchdog tick instead of needing a daemon restart.

The watchdog also makes `start()` idempotent against transient
`mkdir` failures and re-uses the existing socket cleanup path.

Tests: 7 new unit tests in `ipc-server-watchdog.test.ts` covering
`rebindIfMissing()` no-op + recovery cases, in-flight client
survival, periodic auto-rebind, and timer cleanup on `stop()`.
Existing multi-client suite still passes.
@credence-the-bot
Copy link
Copy Markdown
Contributor Author

@codex full review please

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6ac454a77e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread gateway/src/ipc/server.ts Outdated
Comment on lines +247 to +249
const previous = this.server;
this.server = newServer;
if (previous) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Abort rebind when shutdown happens mid-listen

If stop() is called while rebindIfMissing() is awaiting newServer.listen(...), shutdown clears this.server and returns, but this code then unconditionally installs newServer as the active server afterward. That resurrects the IPC listener after shutdown (and can keep the process alive or accept new IPC calls unexpectedly). This race is reachable whenever the watchdog fires during teardown and the socket path is missing; add a post-listen guard (e.g., stopped flag/generation check) and close newServer instead of assigning it when shutdown has already started.

Useful? React with 👍 / 👎.

devin-ai-integration[bot]

This comment was marked as resolved.

Addresses Codex P1 + Devin findings on PR #29771.

Codex P1 — shutdown-mid-rebind race:
  If stop() ran while rebindIfMissing() was awaiting newServer.listen(),
  shutdown cleared this.server and returned, but rebindIfMissing then
  unconditionally installed newServer afterward — resurrecting the IPC
  listener after shutdown (kept the process alive, accepted new IPC
  calls). Reachable whenever the watchdog fires during teardown and
  the socket path is missing.

  Fix: snapshot this.server as initialServer before the await; after
  listen() resolves, check this.server === initialServer. If the
  reference moved (stop() nulled it, restart re-set it, concurrent
  rebind beat us), close newServer and unlink the path it just
  recreated, then return false.

Devin — unhandled rejection on watchdog tick:
  setInterval callback called this.rebindIfMissing() without
  catching the returned promise. ensureSocketDir() / createListening
  Server() execute before the inner try/catch, so a sync throw
  (e.g. EACCES on a read-only fs) became an unhandled rejection
  every 5s.

  Fix: wrap the timer call in .catch() that logs via log.error.

Tests:
  + rebindIfMissing aborts cleanly when shutdown happens mid-listen
    (clears this.server while listen is in flight, asserts result is
    false, path is unlinked, this.server stays null)
  + watchdog timer catches synchronous rebind errors so unhandled
    rejections do not escape (replaces ensureSocketDir to throw,
    asserts process.unhandledRejection sees no events)

10/10 pass on ipc-server-watchdog + ipc-server-multi-client.
bunx tsc --noEmit clean. bun run lint shows 0 errors (7 unrelated
pre-existing warnings).
@credence-the-bot
Copy link
Copy Markdown
Contributor Author

Fixed both findings in 4f7bfcf:

@chatgpt-codex-connector P1 (shutdown-mid-rebind race): Snapshot this.server as initialServer before the listen() await. After listen resolves, check this.server === initialServer — if the reference moved (stop() nulled it, restart, concurrent rebind), close newServer + unlink the path it recreated, return false. The new server never gets installed and never resurrects the listener after shutdown.

@devin-ai-integration (unhandled rejection on watchdog tick): Wrapped the setInterval callback's rebindIfMissing() call in .catch() that logs via log.error. Sync throws from ensureSocketDir() / createListeningServer() no longer escape as unhandled rejections.

Regression coverage:

  • New test: rebindIfMissing aborts cleanly when shutdown happens mid-listen — kicks off rebind, nulls this.server mid-listen, asserts result is false + path unlinked + server stays null.
  • New test: watchdog timer catches synchronous rebind errors so unhandled rejections don't escape — replaces ensureSocketDir with a thrower, drives the timer for ~5 ticks, asserts process saw zero unhandledRejection events.

10/10 pass on ipc-server-watchdog + ipc-server-multi-client. tsc + lint clean.

credence-the-bot Bot pushed a commit that referenced this pull request May 6, 2026
Addresses Devin 🔴 finding on PR #29772.

The comment narrated history ("the previous implementation emitted
one WARN per failure") and referenced PR #29771, both of which
violate assistant/AGENTS.md:

  Comments should describe the current state of the codebase, not
  narrate its history. Avoid phrases like 'no longer does X',
  'previously used Y', or 'was removed in PR Z'.

Rewrote the block to describe what the helper does and why, without
referencing the old per-call WARN behavior or any specific PR.

16/16 pass on gateway-threshold-reader. tsc clean.
@credence-the-bot
Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Another round soon, please!

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

noanflaherty pushed a commit that referenced this pull request May 6, 2026
#29772)

* fix(threshold-reader): coalesce IPC failure WARNs and surface recovery

When the gateway IPC threshold endpoints are unreachable (see PR #29771
for the precipitating bug — socket file unlinked on disk),
`getAutoApproveThreshold` falls back to "none" (Strict) on every
call. Each fallback emitted one WARN at the call site, producing
~10 WARN/sec on the hot path. The actual signal — "thresholds are
broken; we're behaving like Strict for everything other than bash"
— drowned in its own log spam and never reached a structured event
that could trip a dashboard.

Coalesce per-op (`global_thresholds`, `conversation_threshold`)
log emissions:

- The first failure in a streak still WARNs immediately so failures
  aren't lost.
- Subsequent failures within the WARN window (default 30s,
  injectable for tests) are counted but not logged.
- When the IPC starts working again, an INFO records how long the
  streak lasted (`streakDurationMs`) and how many calls were
  swallowed (`swallowedFailures`). That recovery event is the cue
  dashboards should alert on.
- Each WARN/INFO carries a stable `event` field
  (`ipc_threshold_failure` / `ipc_threshold_recovered`) plus
  `op` / `consecutiveFailures` so log-based monitors can latch.
- Conversation-threshold and global-threshold streaks are tracked
  independently. "global is broken but conversation works" can
  happen because they share an IPC connection but not a code path.

Tests: 6 new tests in `gateway-threshold-reader.test.ts` covering
the first-failure-WARNs case, mid-streak suppression, fresh WARN
after the cadence elapses, recovery INFO with swallowed counter,
per-op streak independence, and that a successful round-trip
returning `null` (no override) clears a streak. All 16 tests in
the file pass; broader 32-test gateway/risk suite unaffected.

Note on the pino mock: bun's `mock.module("../util/logger.js")`
does not intercept transitive imports (see comments in
`stt-hints.test.ts`, `avatar-e2e.test.ts`). The working pattern is
to mock pino itself; this is documented inline in the test file.

* fix(threshold-reader): rewrite helper doc to current-state only

Addresses Devin 🔴 finding on PR #29772.

The comment narrated history ("the previous implementation emitted
one WARN per failure") and referenced PR #29771, both of which
violate assistant/AGENTS.md:

  Comments should describe the current state of the codebase, not
  narrate its history. Avoid phrases like 'no longer does X',
  'previously used Y', or 'was removed in PR Z'.

Rewrote the block to describe what the helper does and why, without
referencing the old per-call WARN behavior or any specific PR.

16/16 pass on gateway-threshold-reader. tsc clean.

---------

Co-authored-by: credence-the-bot <credence-the-bot@users.noreply.github.com>
@noanflaherty noanflaherty merged commit 05b8eae into main May 6, 2026
14 checks passed
@noanflaherty noanflaherty deleted the credence/gateway-ipc-socket-watchdog branch May 6, 2026 09:59
noanflaherty pushed a commit that referenced this pull request May 6, 2026
…tils + wire into assistant & skill servers (#29825)

* feat(ipc): extract socket watchdog into shared @vellumai/ipc-server-utils

Generalizes the resilience pattern shipped in #29771 (gateway IPC socket
watchdog) into a reusable package and wires it into the assistant + skill
IPC servers, which suffer from the same Bug 3 failure mode: when the
listening socket file is removed from disk (e.g. by a tmpfs sweep or
rogue cleanup of /run/*), already-connected clients keep working but new
connect() calls hit ENOENT, silently fail-closing whatever the consumer
gates on the IPC. For gateway threshold reads this manifested as the
'silent Strict when Relaxed' bug fixed in #29771.

The new package lives at packages/ipc-server-utils/ and exposes:
  - SocketWatchdog: stat-based polling rebind with shutdown-race + restart-race
    guards (snapshots the consumer's server reference before await listen(),
    re-checks after, discards the new listener and unlinks the path it
    recreated if state moved).
  - ensureSocketDir: shared helper for creating /run/* with mode 0o700.

Gateway server delegates to SocketWatchdog (445→352 LoC; integration test
slimmed to 189 LoC since unit-level concerns now live in the package).

Assistant server gains AssistantIpcServerOptions { watchdogIntervalMs }
+ legacyServers Set for graceful drain. Same shape for skill-server
(SkillIpcServerOptions). Both extract createListeningServer() so the
watchdog can build a fresh server on rebind.

Tests: 9/9 watchdog unit + 12/12 skill-server + 61/61 IPC tests in
isolation. Gateway/assistant typecheck + lint + lint:unused all clean.
Knip configured to ignore the new internal package, matching existing
@vellumai/* convention.

Refs #29771.

* fix(assistant): add @vellumai/ipc-server-utils to bundledDependencies

Devin caught: scripts/prepack-bundled-deps.mjs validates that every
'file:' specifier in dependencies is also listed in bundledDependencies
and exits 1 otherwise (because npm pack does not follow the symlink-based
node_modules layout that bun creates for file: deps). Without this entry
the prepack hook would fail on 'npm pack' / 'npm publish'.

Verified locally: prepack now materializes 9 bundled packages including
the new one and exits 0.

* fix(docker): copy ipc-server-utils into assistant + gateway build contexts

Codex caught: both Dockerfiles fail to build the new file: dependency
because /app/packages/ipc-server-utils doesn't exist before
'bun install --frozen-lockfile' runs. Mirrors the existing pattern for
slack-text and twilio-client (devDeps-only packages, no inner install
needed at runtime since they have no runtime dependencies).

---------

Co-authored-by: credence-the-bot <credence@vellum.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant