QVAC-18197 fix: avoid blocking flock on contended registry corestore by simon-iribarren · Pull Request #1835 · tetherto/qvac

simon-iribarren · 2026-04-30T14:50:17Z

🎯 What problem does this PR solve?

The SDK bare worker leaks indefinitely as an orphaned OS process when started while another SDK process on the same host holds the registry corestore lock at ~/.qvac/registry-corestore/<key>/. Surfaces three ways:

CI: no-lingering-bare-{sigterm,close,ipc-disconnect} (packages/sdk/tests-qvac) fail in mixed-suite runs while passing in isolation — eroding signal value of the lifecycle tests.
Production: Workbench Desktop + a CLI session, two Electron windows, dev server + smoke harness — anything running two SDKs against the same ~/.qvac/ accumulates orphaned bare processes silently. Each leak holds a full inference runtime in memory.
The user/app sees Cleanup completed successfully and has no programmatic way to detect or recover.

Asana: QVAC-18197 (severity: medium).

Root cause

packages/sdk/server/bare/registry/registry-client.ts:40 passes corestoreOpts: { wait: true } to QVACRegistryClient. That flag is forwarded to the underlying Corestore and ultimately makes fd-lock issue a blocking flock(LOCK_EX) on a libuv worker thread. There is no JS-cancellable handle for that thread.

Failure chain when contended:

await client.ready() never resolves → registryClient stays null.
SIGTERM (or RPC close, or IPC disconnect) fires shutdownBareDirectWorker.
runCleanup calls closeRegistryClient, which early-returns because registryClient === null — logs Cleanup completed successfully.
process.exit(0) is called, but Bare cannot terminate while the libuv thread is blocked in flock(2) keeping a native handle open.
OS process wedges indefinitely.

The same wedge breaks close()-mode tests via a secondary effect: the parent's net.Server.close(callback) only fires when all IPC connections drain, and the wedged child never closes its IPC socket.

Why we can't just revert `wait: true`

It was deliberately introduced by #1480 (QVAC-12232) to tolerate tryLock collisions during another SDK's startup/shutdown — the inverse bug that produced instant ModelLoadFailedError on launch. We need to keep the contention tolerance.

📝 How does it solve it?

A. registry-client.ts — switch to tryLock with a JS-bounded retry budget

corestoreOpts: { wait: false } so each attempt is a non-blocking flock(LOCK_EX | LOCK_NB) that surfaces EBUSY immediately as "File descriptor could not be locked".
The existing retry loop already handles that error; widened the budget so the contention-tolerance QVAC-12232 fix: handle unhandled worker process termination in Bare runtime #1480 wanted is preserved:
- MAX_RETRIES = 8, BASE_DELAY_MS = 250, capped by a MAX_TOTAL_WAIT_MS = 10_000 deadline (was 3 × 500/1000/2000).
- Every retry step is a fresh, cancellable JS operation — shutdown remains instantaneous at every point.
When the budget is exhausted, the underlying error propagates to the caller. closeRegistryClient then takes its existing registryClient === null early-return path, and process.exit() terminates the worker cleanly.

B. worker-core.ts — SIGKILL safety net (defense in depth)

Before process.exit() in shutdownBareDirectWorker, arm a 3-second unrefed setTimeout that calls process.kill(process.pid, 'SIGKILL') if process.exit hasn't terminated us by then.
Unrefed so it doesn't keep the loop alive on the happy path; uses the same Bare-friendly unref() pattern as tests-qvac/tests/utils/no-lingering-bare-consumer.ts.
Anti-fix to avoid: Promise.race on the in-flight client.ready() does not work — the libuv thread blocked on flock isn't cancellable from JS, the native handle stays open, process.exit still wedges. The fix has to live at the flock layer.

🧪 How was it tested?

bun run lint (eslint + tsc) green in packages/sdk.
prettier --check green on the two modified files.
no-lingering-bare-{sigterm,close,ipc-disconnect} (packages/sdk/tests-qvac/tests/no-lingering-bare-tests.ts) directly exercise the regression — they fail before this change in mixed-suite runs and pass after. CI on this PR will confirm.

Manual repro (per Asana ticket)

node -e "import('@qvac/sdk').then(m => m.modelRegistryList()).then(()=>setTimeout(()=>{},300_000))" — note bare child PID from ~/.qvac/.worker.lock.
In a second terminal, run the same command; note the new bare child PID.
kill -TERM <second host PID>.
Before: second bare child stays alive indefinitely (>60 s), requires SIGKILL from outside.
After: second bare child exits within ~1 s; the JS-bounded retry budget surfaces the contention as a normal init error, cleanup runs to completion, process.exit(0) terminates the worker.

Files changed

File	Change
`packages/sdk/server/bare/registry/registry-client.ts`	`corestoreOpts: { wait: false }`; widened retry to 8 attempts / 10 s deadline; expanded comment with the rationale and link to QVAC-18197
`packages/sdk/server/worker-core.ts`	New `scheduleForceExit()` helper armed before `process.exit` in `shutdownBareDirectWorker` (3 s unrefed SIGKILL safety net)

The bare worker leaks indefinitely when started while another SDK process holds the registry corestore lock. Root cause: `corestoreOpts: { wait: true }` issues a blocking `flock(LOCK_EX)` on a libuv worker thread that JS cannot cancel, so when SIGTERM/IPC-disconnect arrives, the in-flight `client.ready()` never resolves (cleanup early-returns with `registryClient = null`) and `process.exit()` cannot terminate Bare while the native handle is held. The OS process wedges forever, breaking the three `no-lingering-bare-*` e2e tests in mixed-suite runs. `wait: true` was deliberately added by tetherto#1480 (QVAC-12232) to tolerate transient lock contention during another SDK's startup/shutdown; reverting to the bare default would re-introduce that bug. Instead, switch to `wait: false` (tryLock) and provide an equivalent JS-bounded retry budget in the existing retry loop: - 8 attempts, 250 ms base backoff, capped by a 10 s deadline - each step is a fresh non-blocking syscall — `EBUSY` surfaces to JS immediately, so shutdown remains cancellable at every point - exhausted budget propagates the underlying error, hitting the existing `closeRegistryClient` early-return on `null` and letting `process.exit()` terminate the worker cleanly As defense in depth, arm a 3-second SIGKILL safety net in `shutdownBareDirectWorker` (unrefed timer) before calling `process.exit`, so any future blocking-handle bug can't survive shutdown. Covered by existing `no-lingering-bare-{sigterm,close,ipc-disconnect}` e2e tests, which now pass in mixed-suite runs.

github-actions · 2026-04-30T18:38:52Z