Skip to content

feat(ipc): extract socket watchdog into shared @vellumai/ipc-server-utils + wire into assistant & skill servers#29825

Merged
noanflaherty merged 3 commits into
mainfrom
credence/ipc-socket-watchdog-shared
May 6, 2026
Merged

feat(ipc): extract socket watchdog into shared @vellumai/ipc-server-utils + wire into assistant & skill servers#29825
noanflaherty merged 3 commits into
mainfrom
credence/ipc-socket-watchdog-shared

Conversation

@credence-the-bot
Copy link
Copy Markdown
Contributor

@credence-the-bot credence-the-bot Bot commented May 6, 2026

Summary

Generalizes the resilience pattern from #29771 (gateway IPC socket watchdog) into a reusable package and wires it into the assistant and skill IPC servers, which share the same Bug 3 failure mode.

When the listening socket file is removed from disk (e.g. by a tmpfs sweep or a rogue cleanup of /run/*), already-connected clients keep working — but new connect() calls hit ENOENT, silently fail-closing whatever the consumer gates on the IPC. For gateway threshold reads, this is the "silent Strict when Relaxed" bug fixed in #29771; for the assistant server, it would silently break CLI ↔ daemon RPCs; for the skill IPC server, skills running across daemon restarts would lose their connection path.

This PR is the second half of that fix: same blast radius, same recovery, now shared.

What's new

packages/ipc-server-utils/ — new internal package, mirrors the shape of slack-text etc.

  • SocketWatchdog — stat-based polling rebind with both race guards Codex caught on the gateway-only version:
    • Shutdown-mid-rebind race: snapshots getServer() before await listen(), re-checks after. If the reference moved (null = stop, different = restart/concurrent rebind), close newServer and unlink() the path it just recreated.
    • Unhandled rejection in tick: timer callback wraps rebindIfMissing() in .catch(err => log.error(...)).
  • ensureSocketDir(socketPath) — exported standalone (not a method) because both the gateway and assistant servers also call it during their initial start(). mode: 0o700 only applies to dirs the helper creates.
  • Callback-based API (getServer, createServer, onRebind) so consumers can implement their own legacy-server tracking — keeps the watchdog stateless about consumer internals.

Gateway — delegates to SocketWatchdog (445 → 352 LoC). Integration test slimmed (304 → 189 LoC) since unit-level concerns now live in the package.

AssistantAssistantIpcServer and SkillIpcServer each gain an options object ({ watchdogIntervalMs?: number }) and a legacyServers: Set<Server> for graceful client drain. Both extract a createListeningServer() factory so the watchdog can build a fresh server on rebind.

Bug 3 surface coverage

Server Path Before After
Gateway /run/gateway-ipc/gateway.sock #29771 shared
Assistant /run/assistant-ipc/assistant.sock ❌ vulnerable
Skill /workspace/assistant-skill.sock (overridable) ❌ vulnerable

Out of scope (different failure modes — TCP, single-shot bootstrap, HTTP-only, or external-supervisor-respawned):

  • credential-executor/src/managed-main.ts — bootstrap-and-die, accepts one connection then unlinks itself by design
  • skills/meet-join/daemon/audio-ingest.ts — TCP, port 0
  • HTTP servers (oauth2.ts, mcp-oauth-provider.ts, outbound-proxy/server.ts, cli/login.ts)
  • skills/meet-join/bot/src/native-messaging/socket-server.ts — Chrome respawns the NMH bot

Verification

  • packages/ipc-server-utils: 9/9 unit tests pass, typecheck clean.
  • gateway: 9/9 IPC tests (7 watchdog integration + 2 multi-client) pass, typecheck + lint + lint:unused clean.
  • assistant: 12/12 skill-server tests + all 61 IPC tests pass when run in isolation, typecheck + lint + lint:unused clean.
    • 7 IPC tests are unstable when run together as a batch on main as well — pre-existing module-cache issue around BROWSER_OPERATION_META, not introduced here.

Notes for reviewers

  • The watchdog's callback-based API was deliberate: passing the server directly would force the watchdog to know about consumer-specific cleanup (legacy server sets, in-flight client drain). With callbacks, gateway/assistant/skill each track their own legacy servers in whatever way fits their lifecycle.
  • mode: 0o700 on mkdirSync only applies to directories the helper creates — existing /run/* mounts keep their permissions.
  • The duplicated socket-path.ts modules in gateway/src/ipc/ and assistant/src/ipc/ are not consolidated in this PR (they have different getWorkspaceDir import paths and the assistant version has an extra getAssistantSocketPath shorthand). Clean follow-up opportunity.

Refs #29771.


Open in Devin Review

…tils

Generalizes the resilience pattern shipped in #29771 (gateway IPC socket
watchdog) into a reusable package and wires it into the assistant + skill
IPC servers, which suffer from the same Bug 3 failure mode: when the
listening socket file is removed from disk (e.g. by a tmpfs sweep or
rogue cleanup of /run/*), already-connected clients keep working but new
connect() calls hit ENOENT, silently fail-closing whatever the consumer
gates on the IPC. For gateway threshold reads this manifested as the
'silent Strict when Relaxed' bug fixed in #29771.

The new package lives at packages/ipc-server-utils/ and exposes:
  - SocketWatchdog: stat-based polling rebind with shutdown-race + restart-race
    guards (snapshots the consumer's server reference before await listen(),
    re-checks after, discards the new listener and unlinks the path it
    recreated if state moved).
  - ensureSocketDir: shared helper for creating /run/* with mode 0o700.

Gateway server delegates to SocketWatchdog (445→352 LoC; integration test
slimmed to 189 LoC since unit-level concerns now live in the package).

Assistant server gains AssistantIpcServerOptions { watchdogIntervalMs }
+ legacyServers Set for graceful drain. Same shape for skill-server
(SkillIpcServerOptions). Both extract createListeningServer() so the
watchdog can build a fresh server on rebind.

Tests: 9/9 watchdog unit + 12/12 skill-server + 61/61 IPC tests in
isolation. Gateway/assistant typecheck + lint + lint:unused all clean.
Knip configured to ignore the new internal package, matching existing
@vellumai/* convention.

Refs #29771.
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 5 additional findings in Devin Review.

Open in Devin Review

Comment thread assistant/package.json
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 @vellumai/ipc-server-utils missing from bundledDependencies breaks npm pack

The new @vellumai/ipc-server-utils dependency uses a file: specifier (file:../packages/ipc-server-utils) in assistant/package.json:47 but is not listed in the bundledDependencies array. Every other @vellumai/* file-specifier dependency is in bundledDependencies. The prepack script (scripts/prepack-bundled-deps.mjs:25-34) explicitly validates this invariant — it iterates all file: dependencies, checks membership in bundledDependencies, and calls process.exit(1) if any are missing. This will cause the assistant's prepack step and consequently npm pack / npm publish to fail.

(Refers to lines 76-85)

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 51125189c8

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread assistant/package.json
"@vellumai/credential-storage": "file:../packages/credential-storage",
"@vellumai/egress-proxy": "file:../packages/egress-proxy",
"@vellumai/gateway-client": "file:../packages/gateway-client",
"@vellumai/ipc-server-utils": "file:../packages/ipc-server-utils",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Copy new local package into assistant Docker build context

Adding @vellumai/ipc-server-utils as a file:../packages/... dependency here requires /app/packages/ipc-server-utils to exist before bun install --frozen-lockfile runs in the assistant image build. assistant/Dockerfile currently copies only ces-client, service-contracts, credential-storage, egress-proxy, gateway-client, skill-host-contracts, slack-text, and twilio-client (lines 20–27), so Docker builds that install assistant dependencies will fail to resolve this local package.

Useful? React with 👍 / 👎.

Comment thread gateway/package.json
"dependencies": {
"@vellumai/assistant-client": "file:../packages/assistant-client",
"@vellumai/ces-client": "file:../packages/ces-client",
"@vellumai/ipc-server-utils": "file:../packages/ipc-server-utils",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Copy new local package into gateway Docker build context

This new file:../packages/ipc-server-utils dependency is not matched by a corresponding COPY packages/ipc-server-utils ... in gateway/Dockerfile before bun install --frozen-lockfile. The gateway Docker build currently copies only assistant-client, ces-client, service-contracts, slack-text, and twilio-client (lines 12–16), so dependency installation in containerized builds will fail with a missing local package path.

Useful? React with 👍 / 👎.

@socket-security
Copy link
Copy Markdown

socket-security Bot commented May 6, 2026

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Updatedtypescript-eslint@​8.58.0 ⏵ 8.57.01001007498100
Addedstemmer@​2.0.19810010082100
Addedrrule@​2.8.19910010082100
Addedtar-stream@​3.1.710010010091100
Updatedyaml@​2.8.4 ⏵ 2.8.29999 -110092100
Addeduuid@​11.1.01009910094100
Addedtldts@​7.0.2510010010095100

View full report

Devin caught: scripts/prepack-bundled-deps.mjs validates that every
'file:' specifier in dependencies is also listed in bundledDependencies
and exits 1 otherwise (because npm pack does not follow the symlink-based
node_modules layout that bun creates for file: deps). Without this entry
the prepack hook would fail on 'npm pack' / 'npm publish'.

Verified locally: prepack now materializes 9 bundled packages including
the new one and exits 0.
@credence-the-bot
Copy link
Copy Markdown
Contributor Author

Pushed d956bbc70 to address Devin's P1 finding: added @vellumai/ipc-server-utils to assistant/package.json bundledDependencies. Verified scripts/prepack-bundled-deps.mjs exits 0 locally (materializes 9 bundled packages including the new one).

@codex review
@devin review

…texts

Codex caught: both Dockerfiles fail to build the new file: dependency
because /app/packages/ipc-server-utils doesn't exist before
'bun install --frozen-lockfile' runs. Mirrors the existing pattern for
slack-text and twilio-client (devDeps-only packages, no inner install
needed at runtime since they have no runtime dependencies).
@credence-the-bot
Copy link
Copy Markdown
Contributor Author

Pushed 5c9b35c94 to address Codex's two P1 findings: added COPY packages/ipc-server-utils ./packages/ipc-server-utils to both assistant/Dockerfile and gateway/Dockerfile (mirrors the existing pattern for slack-text and twilio-client — devDeps-only packages don't need an inner bun install since they have no runtime dependencies).

Both cycle-1 P1s now addressed:

  • Devin: bundledDependencies (d956bbc70)
  • Codex: assistant + gateway Dockerfile COPY (5c9b35c94)

@codex review
@devin review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Hooray!

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@noanflaherty noanflaherty merged commit aa797e5 into main May 6, 2026
14 checks passed
@noanflaherty noanflaherty deleted the credence/ipc-socket-watchdog-shared branch May 6, 2026 11:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant