fix(ingestion): optional per-parse timeout to prevent worker deadlocks#1509
fix(ingestion): optional per-parse timeout to prevent worker deadlocks#1509giulioleone097 wants to merge 11 commits into
Conversation
Adds an opt-in `GITNEXUS_PARSE_TIMEOUT_MICROS` env var that wires `Parser.setTimeoutMicros` around every call routed through `parseSourceSafe`. When the limit is hit, the wrapper converts tree-sitter's `null` return into a catchable `ParseTimeoutError` — which the existing callers in `call-processor.ts` already handle by skipping the file. When unset (default) behaviour is identical to previous versions. Why --- On large repositories the worker-pool idle timeout (default 30s) can fire while a worker is blocked inside a sync `parser.parse()` on a pathological file. The replacement path then calls `worker.terminate()` while the native parser is still running; on macOS this races the tree-sitter binding and the process aborts with: libc++abi: terminating due to uncaught exception of type Napi::Error The crash is unrecoverable from JavaScript because the throw originates in C++ during teardown and bypasses `try/catch`. Setting a per-parse timeout slightly below the worker idle timeout lets tree-sitter abort cooperatively before the pool tries to terminate the worker, so the race window never opens. Reproduction ------------ npx gitnexus analyze # 30s default worker timeout # → "Worker N parse job idle timeout. Splitting into …" # → "libc++abi: terminating due to uncaught exception of type Napi::Error" Workaround that confirmed the trigger: npx gitnexus analyze --worker-timeout 300 # succeeds (~10× slower) Scope ----- This change removes the dominant trigger of the race but does NOT fix the underlying race in `worker-pool.replaceWorker`. A complete fix would still want to either (a) make the post-terminate path tolerant of in-flight native work, or (b) drive this same timeout automatically from `--worker-timeout`. Keeping the env var opt-in here so the change is minimal and bisectable; happy to follow up with the automatic wiring if the direction is acceptable. Tests ----- `test/unit/safe-parse.test.ts` gains three cases covering: - no-op behaviour when the env var is unset (backward compatibility), - `ParseTimeoutError` thrown when the limit is exceeded, - timeout is reset after each call so a reused parser is unaffected.
|
@giulio-leone is attempting to deploy a commit to the NexusCore Team on Vercel. A member of the Team first needs to authorize it. |
|
Claude finished @magyargergo's task in 9m 9s —— View job Production-readiness review — PR #1509Review bar for this PR
Current PR state
Branch hygiene assessmentMerge-from-main commit present but harmless and merge-safe. Commit Understanding of the change
The PR wraps every The fix is correct in concept. The lifecycle of FindingsFinding 1 — Uncaught
Finding 2 — Silent data-loss across four production processors
Finding 3 — User-facing env var is undocumented
Finding 4 — CI not yet complete; test results unverified
Finding 5 — Secondary language-capture re-parses uncaught in the sequential path
Finding 6 —
Finding 7 — Confusing
PR-specific assessment sectionsRuntime correctness The core Worker-pool/deadlock mitigation The fix correctly targets the production deadlock. The primary parse in Parser state lifecycle
Observability and incomplete-index risk Critically incomplete. The worker path ( Env var validation/documentation
Tests and CI Three new test cases: no-op when unset, timeout throw, reset after timeout. All cover the wrapper's own behavior. Missing: invalid env var returns 0; reset after parser throw (not just timeout); TypeScript grammar-mismatch secondary re-parse under timeout; Go range-binding path; caller catch behavior; timeout observability. Tests rely on Hidden Unicode / hygiene
Branch hygiene Merge-from-main commit Final verdictnot production-ready This PR has three independent blockers that must be fixed before merge:
The underlying design is sound and the core |
CI Report❌ Some checks failed Pipeline Status
Test Results
✅ All 8611 tests passed 3 test(s) skipped — expand for details
Code CoverageTests
📋 View full run · Generated by CI |
|
@giulioleone097 can you please look into claude's findings? |
Summary
Adds an opt-in
GITNEXUS_PARSE_TIMEOUT_MICROSenv var that appliesParser.setTimeoutMicrosaround every call routed throughparseSourceSafe. When the limit is hit, the wrapper converts tree-sitter'snullreturn into a catchableParseTimeoutError— which existing callers incall-processor.tsalready handle by skipping the file.When unset (default), behaviour is identical to previous versions.
Why
On large repositories the worker-pool idle timeout (default 30s) can fire while a worker is blocked inside a sync
parser.parse()on a pathological file. The replacement path then callsworker.terminate()while the native parser is still running; on macOS this races the tree-sitter binding and the process aborts with:The crash is unrecoverable from JavaScript because the throw originates in C++ during teardown and bypasses
try/catch. Setting a per-parse timeout slightly below the worker idle timeout lets tree-sitter abort cooperatively before the pool tries to terminate the worker, so the race window never opens.Repro
Reproduced on a ~8k-file polyglot Nx monorepo (TS + .NET + Python + others), gitnexus 1.6.4, Node 25.9, macOS 25.4.
Workaround that confirmed the trigger by avoiding the split/terminate path entirely:
With this change, exporting
GITNEXUS_PARSE_TIMEOUT_MICROS=$((25 * 1000 * 1000))(i.e. 25s, below the 30s default worker timeout) avoids the race without forcing users to disable the idle timeout entirely.Scope
This removes the dominant trigger but does not fix the underlying race in
worker-pool.replaceWorker(cancelling a worker mid-parser.parseis inherently fragile with the current node-tree-sitter binding). Two natural follow-ups, both out of scope here to keep the diff minimal and bisectable:terminate()path tolerant of in-flight native work (or drain it cooperatively).GITNEXUS_PARSE_TIMEOUT_MICROSfrom--worker-timeoutso users get the protection by default. Happy to send a follow-up PR if you'd prefer that wiring instead of an env-only opt-in.Test plan
npx vitest run test/unit/safe-parse.test.ts→ 10/10 pass (7 pre-existing + 3 new)npx tsc --noEmit→ no errorsnpx eslint gitnexus/src/core/tree-sitter/safe-parse.ts gitnexus/test/unit/safe-parse.test.ts→ cleangitnexus analyzeon the same monorepo withGITNEXUS_PARSE_TIMEOUT_MICROS=25000000and the default 30s worker timeout (no--worker-timeoutoverride) — index completed, no Napi::Error.New test cases cover:
ParseTimeoutErrorthrown when the limit is exceeded,Risk / rollback
process.envread is added).ParseTimeoutErroris thrown fromparseSourceSafe; all current call sites already wrap the call intry { … } catch { continue; }.