Skip to content

QVAC-18197 fix: avoid blocking flock on contended registry corestore#1835

Merged
simon-iribarren merged 6 commits into
tetherto:mainfrom
simon-iribarren:fix/qvac-18197-bare-worker-leak
May 1, 2026
Merged

QVAC-18197 fix: avoid blocking flock on contended registry corestore#1835
simon-iribarren merged 6 commits into
tetherto:mainfrom
simon-iribarren:fix/qvac-18197-bare-worker-leak

Conversation

@simon-iribarren

Copy link
Copy Markdown
Contributor

🎯 What problem does this PR solve?

The SDK bare worker leaks indefinitely as an orphaned OS process when started while another SDK process on the same host holds the registry corestore lock at ~/.qvac/registry-corestore/<key>/. Surfaces three ways:

  • CI: no-lingering-bare-{sigterm,close,ipc-disconnect} (packages/sdk/tests-qvac) fail in mixed-suite runs while passing in isolation — eroding signal value of the lifecycle tests.
  • Production: Workbench Desktop + a CLI session, two Electron windows, dev server + smoke harness — anything running two SDKs against the same ~/.qvac/ accumulates orphaned bare processes silently. Each leak holds a full inference runtime in memory.
  • The user/app sees Cleanup completed successfully and has no programmatic way to detect or recover.

Asana: QVAC-18197 (severity: medium).

Root cause

packages/sdk/server/bare/registry/registry-client.ts:40 passes corestoreOpts: { wait: true } to QVACRegistryClient. That flag is forwarded to the underlying Corestore and ultimately makes fd-lock issue a blocking flock(LOCK_EX) on a libuv worker thread. There is no JS-cancellable handle for that thread.

Failure chain when contended:

  1. await client.ready() never resolves → registryClient stays null.
  2. SIGTERM (or RPC close, or IPC disconnect) fires shutdownBareDirectWorker.
  3. runCleanup calls closeRegistryClient, which early-returns because registryClient === null — logs Cleanup completed successfully.
  4. process.exit(0) is called, but Bare cannot terminate while the libuv thread is blocked in flock(2) keeping a native handle open.
  5. OS process wedges indefinitely.

The same wedge breaks close()-mode tests via a secondary effect: the parent's net.Server.close(callback) only fires when all IPC connections drain, and the wedged child never closes its IPC socket.

Why we can't just revert wait: true

It was deliberately introduced by #1480 (QVAC-12232) to tolerate tryLock collisions during another SDK's startup/shutdown — the inverse bug that produced instant ModelLoadFailedError on launch. We need to keep the contention tolerance.

📝 How does it solve it?

A. registry-client.ts — switch to tryLock with a JS-bounded retry budget

  • corestoreOpts: { wait: false } so each attempt is a non-blocking flock(LOCK_EX | LOCK_NB) that surfaces EBUSY immediately as "File descriptor could not be locked".
  • The existing retry loop already handles that error; widened the budget so the contention-tolerance QVAC-12232 fix: handle unhandled worker process termination in Bare runtime #1480 wanted is preserved:
    • MAX_RETRIES = 8, BASE_DELAY_MS = 250, capped by a MAX_TOTAL_WAIT_MS = 10_000 deadline (was 3 × 500/1000/2000).
    • Every retry step is a fresh, cancellable JS operation — shutdown remains instantaneous at every point.
  • When the budget is exhausted, the underlying error propagates to the caller. closeRegistryClient then takes its existing registryClient === null early-return path, and process.exit() terminates the worker cleanly.

B. worker-core.ts — SIGKILL safety net (defense in depth)

  • Before process.exit() in shutdownBareDirectWorker, arm a 3-second unrefed setTimeout that calls process.kill(process.pid, 'SIGKILL') if process.exit hasn't terminated us by then.
  • Unrefed so it doesn't keep the loop alive on the happy path; uses the same Bare-friendly unref() pattern as tests-qvac/tests/utils/no-lingering-bare-consumer.ts.
  • Anti-fix to avoid: Promise.race on the in-flight client.ready() does not work — the libuv thread blocked on flock isn't cancellable from JS, the native handle stays open, process.exit still wedges. The fix has to live at the flock layer.

🧪 How was it tested?

  • bun run lint (eslint + tsc) green in packages/sdk.
  • prettier --check green on the two modified files.
  • no-lingering-bare-{sigterm,close,ipc-disconnect} (packages/sdk/tests-qvac/tests/no-lingering-bare-tests.ts) directly exercise the regression — they fail before this change in mixed-suite runs and pass after. CI on this PR will confirm.

Manual repro (per Asana ticket)

  1. node -e "import('@qvac/sdk').then(m => m.modelRegistryList()).then(()=>setTimeout(()=>{},300_000))" — note bare child PID from ~/.qvac/.worker.lock.
  2. In a second terminal, run the same command; note the new bare child PID.
  3. kill -TERM <second host PID>.
  4. Before: second bare child stays alive indefinitely (>60 s), requires SIGKILL from outside.
  5. After: second bare child exits within ~1 s; the JS-bounded retry budget surfaces the contention as a normal init error, cleanup runs to completion, process.exit(0) terminates the worker.

Files changed

File Change
packages/sdk/server/bare/registry/registry-client.ts corestoreOpts: { wait: false }; widened retry to 8 attempts / 10 s deadline; expanded comment with the rationale and link to QVAC-18197
packages/sdk/server/worker-core.ts New scheduleForceExit() helper armed before process.exit in shutdownBareDirectWorker (3 s unrefed SIGKILL safety net)

The bare worker leaks indefinitely when started while another SDK process
holds the registry corestore lock. Root cause: `corestoreOpts: { wait: true }`
issues a blocking `flock(LOCK_EX)` on a libuv worker thread that JS cannot
cancel, so when SIGTERM/IPC-disconnect arrives, the in-flight `client.ready()`
never resolves (cleanup early-returns with `registryClient = null`) and
`process.exit()` cannot terminate Bare while the native handle is held.
The OS process wedges forever, breaking the three `no-lingering-bare-*`
e2e tests in mixed-suite runs.

`wait: true` was deliberately added by tetherto#1480 (QVAC-12232) to tolerate
transient lock contention during another SDK's startup/shutdown; reverting
to the bare default would re-introduce that bug. Instead, switch to
`wait: false` (tryLock) and provide an equivalent JS-bounded retry budget
in the existing retry loop:

  - 8 attempts, 250 ms base backoff, capped by a 10 s deadline
  - each step is a fresh non-blocking syscall — `EBUSY` surfaces to JS
    immediately, so shutdown remains cancellable at every point
  - exhausted budget propagates the underlying error, hitting the
    existing `closeRegistryClient` early-return on `null` and letting
    `process.exit()` terminate the worker cleanly

As defense in depth, arm a 3-second SIGKILL safety net in
`shutdownBareDirectWorker` (unrefed timer) before calling `process.exit`,
so any future blocking-handle bug can't survive shutdown.

Covered by existing `no-lingering-bare-{sigterm,close,ipc-disconnect}`
e2e tests, which now pass in mixed-suite runs.
@github-actions

github-actions Bot commented Apr 30, 2026

Copy link
Copy Markdown
Contributor

QVAC E2E — ios⚠️ no results

Config: suite=smoke · filter=(none) · exclude=(none)
View run · Artifacts

The test job did not produce a results artifact. Check the run for job-level failures.

@github-actions

github-actions Bot commented Apr 30, 2026

Copy link
Copy Markdown
Contributor

QVAC E2E — android⚠️ no results

Config: suite=smoke · filter=(none) · exclude=(none)
View run · Artifacts

The test job did not produce a results artifact. Check the run for job-level failures.

@github-actions

github-actions Bot commented Apr 30, 2026

Copy link
Copy Markdown
Contributor

QVAC E2E — windows — ✅ all tests passed (89/89, 715s)

Config: suite=smoke · filter=(none) · exclude=(none)
View run · Artifacts

@github-actions

github-actions Bot commented Apr 30, 2026

Copy link
Copy Markdown
Contributor

QVAC E2E — linux — ✅ all tests passed (89/89, 484s)

Config: suite=smoke · filter=(none) · exclude=(none)
View run · Artifacts

@github-actions

github-actions Bot commented Apr 30, 2026

Copy link
Copy Markdown
Contributor

QVAC E2E — macos — ✅ all tests passed (89/89, 306s)

Config: suite=smoke · filter=(none) · exclude=(none)
View run · Artifacts

@NamelsKing

Copy link
Copy Markdown
Contributor

/review

@github-actions

github-actions Bot commented May 1, 2026

Copy link
Copy Markdown
Contributor

Tier-based Approval Status

**PR Tier:** TIER1

**Current Status:** ✅ APPROVED

**Requirements:**
- 1 Team Member approval ✅ (1/1)
- 1 Team Lead OR Management approval ✅ (1/1)



---
*This comment is automatically updated when reviews change.*

@simon-iribarren

Copy link
Copy Markdown
Contributor Author

/review

@simon-iribarren simon-iribarren merged commit 8bb6b0e into tetherto:main May 1, 2026
12 of 13 checks passed
Proletter pushed a commit that referenced this pull request May 24, 2026
#1835)

The bare worker leaks indefinitely when started while another SDK process
holds the registry corestore lock. Root cause: `corestoreOpts: { wait: true }`
issues a blocking `flock(LOCK_EX)` on a libuv worker thread that JS cannot
cancel, so when SIGTERM/IPC-disconnect arrives, the in-flight `client.ready()`
never resolves (cleanup early-returns with `registryClient = null`) and
`process.exit()` cannot terminate Bare while the native handle is held.
The OS process wedges forever, breaking the three `no-lingering-bare-*`
e2e tests in mixed-suite runs.

`wait: true` was deliberately added by #1480 (QVAC-12232) to tolerate
transient lock contention during another SDK's startup/shutdown; reverting
to the bare default would re-introduce that bug. Instead, switch to
`wait: false` (tryLock) and provide an equivalent JS-bounded retry budget
in the existing retry loop:

  - 8 attempts, 250 ms base backoff, capped by a 10 s deadline
  - each step is a fresh non-blocking syscall — `EBUSY` surfaces to JS
    immediately, so shutdown remains cancellable at every point
  - exhausted budget propagates the underlying error, hitting the
    existing `closeRegistryClient` early-return on `null` and letting
    `process.exit()` terminate the worker cleanly

As defense in depth, arm a 3-second SIGKILL safety net in
`shutdownBareDirectWorker` (unrefed timer) before calling `process.exit`,
so any future blocking-handle bug can't survive shutdown.

Covered by existing `no-lingering-bare-{sigterm,close,ipc-disconnect}`
e2e tests, which now pass in mixed-suite runs.

Co-authored-by: Dmytro Medvinskyi <functionsilence@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test-e2e-smoke Triggers smoke e2e test suite [Currently SDK-only] tier1 verify

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants