Skip to content

QVAC-12232 fix: handle unhandled worker process termination in Bare runtime#1480

Merged
simon-iribarren merged 7 commits into
tetherto:mainfrom
simon-iribarren:fix/qvac-12232-unhandled-worker-termination
Apr 16, 2026
Merged

QVAC-12232 fix: handle unhandled worker process termination in Bare runtime#1480
simon-iribarren merged 7 commits into
tetherto:mainfrom
simon-iribarren:fix/qvac-12232-unhandled-worker-termination

Conversation

@simon-iribarren

@simon-iribarren simon-iribarren commented Apr 9, 2026

Copy link
Copy Markdown
Contributor

🎯 What problem does this PR solve?

  • Worker process only handles SIGTERM/SIGINT for cleanup — uncaught exceptions, unhandled rejections, parent crashes, and SIGKILL all leave resources locked
  • Corestore directories and RAG instances remain locked after ungraceful termination, causing "resource busy" / "File descriptor could not be locked" errors on restart
  • unloadAllModels skips FilesystemDL.close() that the single-model unloadModel path performs, leaking file handles during shutdown
  • Registry client's getRegistryClient() fails instantly on fd-lock contention with no retry — any transient lock (e.g., previous process still releasing) causes immediate ModelLoadFailedError
  • QVACRegistryClient creates Corestore with wait: false (default), using tryLock instead of waitForLock — concurrent SDK instances on the same machine collide on ~/.qvac/registry-corestore/<key>

Related: Slack thread from Zbigniew documenting the fd-lock error chain in production

📝 How does it solve it?

A. Crash handlers (worker-core.ts)

  • Register uncaughtException and unhandledRejection handlers on bare-process to trigger full cleanup on unexpected crashes
  • Non-graceful shutdowns exit with code 1

B. IPC disconnect detection (create-server.ts, worker-core.ts)

  • Listen for IPC socket close event in desktop mode — fires when parent process (Electron) crashes or is killed
  • Triggers cleanup via callback (avoids circular imports)

C. Stale lock file (server/utils/worker-lock.ts)

  • New PID-based lock file at ~/.qvac/.worker.lock written on startup, removed on graceful shutdown
  • On startup, detects stale locks from dead PIDs (handles SIGKILL, OOM, power loss)
  • Uses process.kill(pid, 0) for cross-platform liveness check

D. Fix unloadAllModels (model-registry.ts)

  • Now calls loader.close() before model.unload(), matching unloadModel behavior
  • Per-model try/catch so one failing unload doesn't block others

E. Registry client fd-lock retry (registry-client.ts)

  • Retry getRegistryClient() up to 3 times with exponential backoff (500ms, 1000ms, 2000ms) on "File descriptor could not be locked" errors
  • Only fd-lock errors trigger retry; other errors propagate immediately

F. Corestore wait option plumbing (qvac-lib-registry-server/client)

  • QVACRegistryClient now accepts corestoreOpts and passes them to the Corestore constructor
  • Type definition updated with corestoreOpts?: { wait?: boolean }
  • SDK will pass corestoreOpts: { wait: true } once @qvac/registry-client dependency is bumped (TODO in code)

🧪 How was it tested?

Manual verification — no existing test infrastructure covers worker lifecycle. Suggested e2e tests as follow-up:

  1. Stale lock recovery: create fake lock with dead PID → start worker → verify stale detection log and clean startup
  2. Uncaught exception cleanup: inject setTimeout(() => { throw ... }, 3000) in worker → verify cleanup runs and lock file removed
  3. SIGKILL recovery: start worker → kill -9 → verify lock file persists → restart → verify stale detection and no "resource busy"
  4. IPC disconnect: start desktop worker → kill parent → verify worker exits cleanly
  5. fd-lock retry: simulate contention → verify retry logs and eventual success

Files changed

File Change
packages/sdk/server/utils/worker-lock.ts New — PID lock file: acquire, release, stale detection
packages/sdk/server/worker-core.ts Add crash/rejection handlers, lock lifecycle, IPC disconnect callback, expanded shutdown reasons
packages/sdk/server/rpc/create-server.ts Add onDisconnect callback option to createIPCClient, socket close listener
packages/sdk/server/bare/registry/model-registry.ts Fix unloadAllModels to close loaders, add per-model error handling
packages/sdk/server/bare/registry/registry-client.ts Add retry with exponential backoff on fd-lock errors
packages/qvac-lib-registry-server/client/lib/client.js Plumb corestoreOpts through to new Corestore()
packages/qvac-lib-registry-server/client/index.d.ts Add corestoreOpts to QVACRegistryClientOptions type

Follow-up needed

  • Bump @qvac/registry-client → release with corestoreOpts support
  • Bump SDK's @qvac/registry-client dep → enable corestoreOpts: { wait: true }

Comment thread packages/sdk/server/bare/registry/registry-client.ts
opaninakuffo
opaninakuffo previously approved these changes Apr 13, 2026
Comment thread packages/qvac-lib-registry-server/client/lib/client.js Outdated
Comment thread packages/sdk/server/bare/registry/model-registry.ts Outdated
Comment thread packages/sdk/server/utils/worker-lock.ts Outdated
- Remove unused isStaleWorkerLock export from worker-lock.ts
- Use optional chaining in unloadAllModels (model-registry.ts)
Introduces server/utils/qvac-paths.ts with getQvacPath(...subPaths)
as the single place that resolves paths under ~/.qvac. Updated
worker-lock, cache, config-registry, and hyperdrive to use it instead
of manually constructing path.join(HOME_DIR, ".qvac", ...) each time.
@simon-iribarren

Copy link
Copy Markdown
Contributor Author

/review

@github-actions

Copy link
Copy Markdown
Contributor

Tier-based Approval Status

**PR Tier:** TIER1

**Current Status:** ✅ APPROVED

**Requirements:**
- 1 Team Member approval ✅ (2/1)
- 1 Team Lead OR Management approval ✅ (1/1)



---
*This comment is automatically updated when reviews change.*

@simon-iribarren simon-iribarren merged commit a7def65 into tetherto:main Apr 16, 2026
22 of 27 checks passed
simon-iribarren added a commit that referenced this pull request Apr 22, 2026
Bumps @qvac/registry-client from 0.4.0 to 0.4.1 to release the
corestoreOpts constructor option added by #1480, unblocking the SDK
follow-up that enables corestoreOpts: { wait: true }.
simon-iribarren added a commit to simon-iribarren/qvac that referenced this pull request Apr 22, 2026
…VAC-12232)

Bumps @qvac/registry-client to ^0.4.1 and enables waitForLock semantics
on the Corestore backing the registry client. This prevents tryLock
collisions when multiple SDK instances on the same machine contend over
~/.qvac/registry-corestore/<key>, which was the motivating failure mode
for tetherto#1480.

Blocked on: release PR tetherto#1698 (@qvac/registry-client v0.4.1
publish). bun.lock update + review-ready flip will follow once v0.4.1
is on npm.
Victor-Rodzko pushed a commit that referenced this pull request Apr 30, 2026
The bare worker leaks indefinitely when started while another SDK process
holds the registry corestore lock. Root cause: `corestoreOpts: { wait: true }`
issues a blocking `flock(LOCK_EX)` on a libuv worker thread that JS cannot
cancel, so when SIGTERM/IPC-disconnect arrives, the in-flight `client.ready()`
never resolves (cleanup early-returns with `registryClient = null`) and
`process.exit()` cannot terminate Bare while the native handle is held.
The OS process wedges forever, breaking the three `no-lingering-bare-*`
e2e tests in mixed-suite runs.

`wait: true` was deliberately added by #1480 (QVAC-12232) to tolerate
transient lock contention during another SDK's startup/shutdown; reverting
to the bare default would re-introduce that bug. Instead, switch to
`wait: false` (tryLock) and provide an equivalent JS-bounded retry budget
in the existing retry loop:

  - 8 attempts, 250 ms base backoff, capped by a 10 s deadline
  - each step is a fresh non-blocking syscall — `EBUSY` surfaces to JS
    immediately, so shutdown remains cancellable at every point
  - exhausted budget propagates the underlying error, hitting the
    existing `closeRegistryClient` early-return on `null` and letting
    `process.exit()` terminate the worker cleanly

As defense in depth, arm a 3-second SIGKILL safety net in
`shutdownBareDirectWorker` (unrefed timer) before calling `process.exit`,
so any future blocking-handle bug can't survive shutdown.

Covered by existing `no-lingering-bare-{sigterm,close,ipc-disconnect}`
e2e tests, which now pass in mixed-suite runs.
simon-iribarren added a commit that referenced this pull request May 1, 2026
#1835)

The bare worker leaks indefinitely when started while another SDK process
holds the registry corestore lock. Root cause: `corestoreOpts: { wait: true }`
issues a blocking `flock(LOCK_EX)` on a libuv worker thread that JS cannot
cancel, so when SIGTERM/IPC-disconnect arrives, the in-flight `client.ready()`
never resolves (cleanup early-returns with `registryClient = null`) and
`process.exit()` cannot terminate Bare while the native handle is held.
The OS process wedges forever, breaking the three `no-lingering-bare-*`
e2e tests in mixed-suite runs.

`wait: true` was deliberately added by #1480 (QVAC-12232) to tolerate
transient lock contention during another SDK's startup/shutdown; reverting
to the bare default would re-introduce that bug. Instead, switch to
`wait: false` (tryLock) and provide an equivalent JS-bounded retry budget
in the existing retry loop:

  - 8 attempts, 250 ms base backoff, capped by a 10 s deadline
  - each step is a fresh non-blocking syscall — `EBUSY` surfaces to JS
    immediately, so shutdown remains cancellable at every point
  - exhausted budget propagates the underlying error, hitting the
    existing `closeRegistryClient` early-return on `null` and letting
    `process.exit()` terminate the worker cleanly

As defense in depth, arm a 3-second SIGKILL safety net in
`shutdownBareDirectWorker` (unrefed timer) before calling `process.exit`,
so any future blocking-handle bug can't survive shutdown.

Covered by existing `no-lingering-bare-{sigterm,close,ipc-disconnect}`
e2e tests, which now pass in mixed-suite runs.

Co-authored-by: Dmytro Medvinskyi <functionsilence@gmail.com>
simon-iribarren added a commit that referenced this pull request May 1, 2026
#1842)

The bare worker leaks indefinitely when started while another SDK process
holds the registry corestore lock. Root cause: `corestoreOpts: { wait: true }`
issues a blocking `flock(LOCK_EX)` on a libuv worker thread that JS cannot
cancel, so when SIGTERM/IPC-disconnect arrives, the in-flight `client.ready()`
never resolves (cleanup early-returns with `registryClient = null`) and
`process.exit()` cannot terminate Bare while the native handle is held.
The OS process wedges forever, breaking the three `no-lingering-bare-*`
e2e tests in mixed-suite runs.

`wait: true` was deliberately added by #1480 (QVAC-12232) to tolerate
transient lock contention during another SDK's startup/shutdown; reverting
to the bare default would re-introduce that bug. Instead, switch to
`wait: false` (tryLock) and provide an equivalent JS-bounded retry budget
in the existing retry loop:

  - 8 attempts, 250 ms base backoff, capped by a 10 s deadline
  - each step is a fresh non-blocking syscall — `EBUSY` surfaces to JS
    immediately, so shutdown remains cancellable at every point
  - exhausted budget propagates the underlying error, hitting the
    existing `closeRegistryClient` early-return on `null` and letting
    `process.exit()` terminate the worker cleanly

As defense in depth, arm a 3-second SIGKILL safety net in
`shutdownBareDirectWorker` (unrefed timer) before calling `process.exit`,
so any future blocking-handle bug can't survive shutdown.

Covered by existing `no-lingering-bare-{sigterm,close,ipc-disconnect}`
e2e tests, which now pass in mixed-suite runs.

Co-authored-by: Simon Iribarren <simon.ig13@gmail.com>
Co-authored-by: Dmytro Medvinskyi <functionsilence@gmail.com>
Proletter pushed a commit that referenced this pull request May 24, 2026
Bumps @qvac/registry-client from 0.4.0 to 0.4.1 to release the
corestoreOpts constructor option added by #1480, unblocking the SDK
follow-up that enables corestoreOpts: { wait: true }.
Proletter pushed a commit that referenced this pull request May 24, 2026
…untime (#1480)

* fix: handle unhandled worker process termination in Bare runtime (QVAC-12232)

- Add uncaughtException/unhandledRejection handlers to trigger cleanup on crash
- Add IPC socket close detection to handle parent process termination
- Add PID-based lock file (~/.qvac/.worker.lock) with stale detection on startup
- Fix unloadAllModels to close FilesystemDL loaders (aligned with single-model path)
- Use exit code 1 for crash shutdowns, 0 for graceful

Made-with: Cursor

* fix: add fd-lock retry with backoff and corestoreOpts plumbing for registry client

- Add retry with exponential backoff (3 attempts, 500/1000/2000ms) on fd-lock errors in getRegistryClient()
- Plumb corestoreOpts through QVACRegistryClient constructor to Corestore (registry-server)
- Add corestoreOpts to QVACRegistryClientOptions type definition
- SDK will pass wait: true once @qvac/registry-client dep is bumped

* fix: make getRegistryClient single-flight to prevent concurrent fd-lock contention

* fix: address review comments on worker termination PR

- Remove unused isStaleWorkerLock export from worker-lock.ts
- Use optional chaining in unloadAllModels (model-registry.ts)

* fix: instantiate Corestore in constructor, remove _corestoreOpts property

* fix: remove homeDir param from worker-lock, resolve via getEnv internally

* chore: add getQvacPath utility, consolidate ~/.qvac path resolution

Introduces server/utils/qvac-paths.ts with getQvacPath(...subPaths)
as the single place that resolves paths under ~/.qvac. Updated
worker-lock, cache, config-registry, and hyperdrive to use it instead
of manually constructing path.join(HOME_DIR, ".qvac", ...) each time.
Proletter pushed a commit that referenced this pull request May 24, 2026
#1835)

The bare worker leaks indefinitely when started while another SDK process
holds the registry corestore lock. Root cause: `corestoreOpts: { wait: true }`
issues a blocking `flock(LOCK_EX)` on a libuv worker thread that JS cannot
cancel, so when SIGTERM/IPC-disconnect arrives, the in-flight `client.ready()`
never resolves (cleanup early-returns with `registryClient = null`) and
`process.exit()` cannot terminate Bare while the native handle is held.
The OS process wedges forever, breaking the three `no-lingering-bare-*`
e2e tests in mixed-suite runs.

`wait: true` was deliberately added by #1480 (QVAC-12232) to tolerate
transient lock contention during another SDK's startup/shutdown; reverting
to the bare default would re-introduce that bug. Instead, switch to
`wait: false` (tryLock) and provide an equivalent JS-bounded retry budget
in the existing retry loop:

  - 8 attempts, 250 ms base backoff, capped by a 10 s deadline
  - each step is a fresh non-blocking syscall — `EBUSY` surfaces to JS
    immediately, so shutdown remains cancellable at every point
  - exhausted budget propagates the underlying error, hitting the
    existing `closeRegistryClient` early-return on `null` and letting
    `process.exit()` terminate the worker cleanly

As defense in depth, arm a 3-second SIGKILL safety net in
`shutdownBareDirectWorker` (unrefed timer) before calling `process.exit`,
so any future blocking-handle bug can't survive shutdown.

Covered by existing `no-lingering-bare-{sigterm,close,ipc-disconnect}`
e2e tests, which now pass in mixed-suite runs.

Co-authored-by: Dmytro Medvinskyi <functionsilence@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants