QVAC-18197 fix: avoid blocking flock on contended registry corestore#1835
Merged
simon-iribarren merged 6 commits intoMay 1, 2026
Merged
Conversation
The bare worker leaks indefinitely when started while another SDK process
holds the registry corestore lock. Root cause: `corestoreOpts: { wait: true }`
issues a blocking `flock(LOCK_EX)` on a libuv worker thread that JS cannot
cancel, so when SIGTERM/IPC-disconnect arrives, the in-flight `client.ready()`
never resolves (cleanup early-returns with `registryClient = null`) and
`process.exit()` cannot terminate Bare while the native handle is held.
The OS process wedges forever, breaking the three `no-lingering-bare-*`
e2e tests in mixed-suite runs.
`wait: true` was deliberately added by tetherto#1480 (QVAC-12232) to tolerate
transient lock contention during another SDK's startup/shutdown; reverting
to the bare default would re-introduce that bug. Instead, switch to
`wait: false` (tryLock) and provide an equivalent JS-bounded retry budget
in the existing retry loop:
- 8 attempts, 250 ms base backoff, capped by a 10 s deadline
- each step is a fresh non-blocking syscall — `EBUSY` surfaces to JS
immediately, so shutdown remains cancellable at every point
- exhausted budget propagates the underlying error, hitting the
existing `closeRegistryClient` early-return on `null` and letting
`process.exit()` terminate the worker cleanly
As defense in depth, arm a 3-second SIGKILL safety net in
`shutdownBareDirectWorker` (unrefed timer) before calling `process.exit`,
so any future blocking-handle bug can't survive shutdown.
Covered by existing `no-lingering-bare-{sigterm,close,ipc-disconnect}`
e2e tests, which now pass in mixed-suite runs.
Contributor
Contributor
Contributor
Contributor
Contributor
opaninakuffo
approved these changes
May 1, 2026
NamelsKing
approved these changes
May 1, 2026
Contributor
|
/review |
Contributor
Tier-based Approval Status |
Contributor
Author
|
/review |
Proletter
pushed a commit
that referenced
this pull request
May 24, 2026
#1835) The bare worker leaks indefinitely when started while another SDK process holds the registry corestore lock. Root cause: `corestoreOpts: { wait: true }` issues a blocking `flock(LOCK_EX)` on a libuv worker thread that JS cannot cancel, so when SIGTERM/IPC-disconnect arrives, the in-flight `client.ready()` never resolves (cleanup early-returns with `registryClient = null`) and `process.exit()` cannot terminate Bare while the native handle is held. The OS process wedges forever, breaking the three `no-lingering-bare-*` e2e tests in mixed-suite runs. `wait: true` was deliberately added by #1480 (QVAC-12232) to tolerate transient lock contention during another SDK's startup/shutdown; reverting to the bare default would re-introduce that bug. Instead, switch to `wait: false` (tryLock) and provide an equivalent JS-bounded retry budget in the existing retry loop: - 8 attempts, 250 ms base backoff, capped by a 10 s deadline - each step is a fresh non-blocking syscall — `EBUSY` surfaces to JS immediately, so shutdown remains cancellable at every point - exhausted budget propagates the underlying error, hitting the existing `closeRegistryClient` early-return on `null` and letting `process.exit()` terminate the worker cleanly As defense in depth, arm a 3-second SIGKILL safety net in `shutdownBareDirectWorker` (unrefed timer) before calling `process.exit`, so any future blocking-handle bug can't survive shutdown. Covered by existing `no-lingering-bare-{sigterm,close,ipc-disconnect}` e2e tests, which now pass in mixed-suite runs. Co-authored-by: Dmytro Medvinskyi <functionsilence@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🎯 What problem does this PR solve?
The SDK bare worker leaks indefinitely as an orphaned OS process when started while another SDK process on the same host holds the registry corestore lock at
~/.qvac/registry-corestore/<key>/. Surfaces three ways:no-lingering-bare-{sigterm,close,ipc-disconnect}(packages/sdk/tests-qvac) fail in mixed-suite runs while passing in isolation — eroding signal value of the lifecycle tests.~/.qvac/accumulates orphaned bare processes silently. Each leak holds a full inference runtime in memory.Cleanup completed successfullyand has no programmatic way to detect or recover.Asana: QVAC-18197 (severity: medium).
Root cause
packages/sdk/server/bare/registry/registry-client.ts:40passescorestoreOpts: { wait: true }toQVACRegistryClient. That flag is forwarded to the underlyingCorestoreand ultimately makesfd-lockissue a blockingflock(LOCK_EX)on a libuv worker thread. There is no JS-cancellable handle for that thread.Failure chain when contended:
await client.ready()never resolves →registryClientstaysnull.shutdownBareDirectWorker.runCleanupcallscloseRegistryClient, which early-returns becauseregistryClient === null— logsCleanup completed successfully.process.exit(0)is called, but Bare cannot terminate while the libuv thread is blocked inflock(2)keeping a native handle open.The same wedge breaks
close()-mode tests via a secondary effect: the parent'snet.Server.close(callback)only fires when all IPC connections drain, and the wedged child never closes its IPC socket.Why we can't just revert
wait: trueIt was deliberately introduced by #1480 (QVAC-12232) to tolerate
tryLockcollisions during another SDK's startup/shutdown — the inverse bug that produced instantModelLoadFailedErroron launch. We need to keep the contention tolerance.📝 How does it solve it?
A.
registry-client.ts— switch totryLockwith a JS-bounded retry budgetcorestoreOpts: { wait: false }so each attempt is a non-blockingflock(LOCK_EX | LOCK_NB)that surfacesEBUSYimmediately as"File descriptor could not be locked".MAX_RETRIES = 8,BASE_DELAY_MS = 250, capped by aMAX_TOTAL_WAIT_MS = 10_000deadline (was3 × 500/1000/2000).closeRegistryClientthen takes its existingregistryClient === nullearly-return path, andprocess.exit()terminates the worker cleanly.B.
worker-core.ts— SIGKILL safety net (defense in depth)process.exit()inshutdownBareDirectWorker, arm a 3-second unrefedsetTimeoutthat callsprocess.kill(process.pid, 'SIGKILL')ifprocess.exithasn't terminated us by then.unref()pattern astests-qvac/tests/utils/no-lingering-bare-consumer.ts.Promise.raceon the in-flightclient.ready()does not work — the libuv thread blocked on flock isn't cancellable from JS, the native handle stays open,process.exitstill wedges. The fix has to live at the flock layer.🧪 How was it tested?
bun run lint(eslint + tsc) green inpackages/sdk.prettier --checkgreen on the two modified files.no-lingering-bare-{sigterm,close,ipc-disconnect}(packages/sdk/tests-qvac/tests/no-lingering-bare-tests.ts) directly exercise the regression — they fail before this change in mixed-suite runs and pass after. CI on this PR will confirm.Manual repro (per Asana ticket)
node -e "import('@qvac/sdk').then(m => m.modelRegistryList()).then(()=>setTimeout(()=>{},300_000))"— note bare child PID from~/.qvac/.worker.lock.kill -TERM <second host PID>.process.exit(0)terminates the worker.Files changed
packages/sdk/server/bare/registry/registry-client.tscorestoreOpts: { wait: false }; widened retry to 8 attempts / 10 s deadline; expanded comment with the rationale and link to QVAC-18197packages/sdk/server/worker-core.tsscheduleForceExit()helper armed beforeprocess.exitinshutdownBareDirectWorker(3 s unrefed SIGKILL safety net)