Skip to content

fix(ingestion): optional per-parse timeout to prevent worker deadlocks#1509

Closed
giulioleone097 wants to merge 11 commits into
abhigyanpatwari:mainfrom
giulioleone097:fix/parse-timeout-prevents-worker-deadlock
Closed

fix(ingestion): optional per-parse timeout to prevent worker deadlocks#1509
giulioleone097 wants to merge 11 commits into
abhigyanpatwari:mainfrom
giulioleone097:fix/parse-timeout-prevents-worker-deadlock

Conversation

@giulioleone097

Copy link
Copy Markdown

Summary

Adds an opt-in GITNEXUS_PARSE_TIMEOUT_MICROS env var that applies Parser.setTimeoutMicros around every call routed through parseSourceSafe. When the limit is hit, the wrapper converts tree-sitter's null return into a catchable ParseTimeoutError — which existing callers in call-processor.ts already handle by skipping the file.

When unset (default), behaviour is identical to previous versions.

Why

On large repositories the worker-pool idle timeout (default 30s) can fire while a worker is blocked inside a sync parser.parse() on a pathological file. The replacement path then calls worker.terminate() while the native parser is still running; on macOS this races the tree-sitter binding and the process aborts with:

libc++abi: terminating due to uncaught exception of type Napi::Error

The crash is unrecoverable from JavaScript because the throw originates in C++ during teardown and bypasses try/catch. Setting a per-parse timeout slightly below the worker idle timeout lets tree-sitter abort cooperatively before the pool tries to terminate the worker, so the race window never opens.

Repro

Reproduced on a ~8k-file polyglot Nx monorepo (TS + .NET + Python + others), gitnexus 1.6.4, Node 25.9, macOS 25.4.

npx gitnexus analyze
# → Worker N parse job idle timeout. Splitting into X/Y item jobs.
# → libc++abi: terminating due to uncaught exception of type Napi::Error
# (no .gitnexus index produced)

Workaround that confirmed the trigger by avoiding the split/terminate path entirely:

npx gitnexus analyze --worker-timeout 300
# succeeds in ~617s (vs ~90s pre-1.6.0); index complete (94,901 nodes / 220,850 edges)

With this change, exporting GITNEXUS_PARSE_TIMEOUT_MICROS=$((25 * 1000 * 1000)) (i.e. 25s, below the 30s default worker timeout) avoids the race without forcing users to disable the idle timeout entirely.

Scope

This removes the dominant trigger but does not fix the underlying race in worker-pool.replaceWorker (cancelling a worker mid-parser.parse is inherently fragile with the current node-tree-sitter binding). Two natural follow-ups, both out of scope here to keep the diff minimal and bisectable:

  • (a) Make the post-terminate() path tolerant of in-flight native work (or drain it cooperatively).
  • (b) Auto-derive GITNEXUS_PARSE_TIMEOUT_MICROS from --worker-timeout so users get the protection by default. Happy to send a follow-up PR if you'd prefer that wiring instead of an env-only opt-in.

Test plan

  • npx vitest run test/unit/safe-parse.test.ts → 10/10 pass (7 pre-existing + 3 new)
  • npx tsc --noEmit → no errors
  • npx eslint gitnexus/src/core/tree-sitter/safe-parse.ts gitnexus/test/unit/safe-parse.test.ts → clean
  • Manual: re-ran gitnexus analyze on the same monorepo with GITNEXUS_PARSE_TIMEOUT_MICROS=25000000 and the default 30s worker timeout (no --worker-timeout override) — index completed, no Napi::Error.

New test cases cover:

  • no-op behaviour when the env var is unset (backward compatibility),
  • ParseTimeoutError thrown when the limit is exceeded,
  • timeout reset after each call so a reused parser is unaffected.

Risk / rollback

  • Default code path is untouched when the env var is absent (a single process.env read is added).
  • ParseTimeoutError is thrown from parseSourceSafe; all current call sites already wrap the call in try { … } catch { continue; }.
  • To roll back: revert this commit; no schema/data migration involved.

Adds an opt-in `GITNEXUS_PARSE_TIMEOUT_MICROS` env var that wires
`Parser.setTimeoutMicros` around every call routed through
`parseSourceSafe`. When the limit is hit, the wrapper converts
tree-sitter's `null` return into a catchable `ParseTimeoutError` — which
the existing callers in `call-processor.ts` already handle by skipping
the file.

When unset (default) behaviour is identical to previous versions.

Why
---

On large repositories the worker-pool idle timeout (default 30s) can
fire while a worker is blocked inside a sync `parser.parse()` on a
pathological file. The replacement path then calls `worker.terminate()`
while the native parser is still running; on macOS this races the
tree-sitter binding and the process aborts with:

  libc++abi: terminating due to uncaught exception of type Napi::Error

The crash is unrecoverable from JavaScript because the throw originates
in C++ during teardown and bypasses `try/catch`. Setting a per-parse
timeout slightly below the worker idle timeout lets tree-sitter abort
cooperatively before the pool tries to terminate the worker, so the
race window never opens.

Reproduction
------------

  npx gitnexus analyze            # 30s default worker timeout
  # → "Worker N parse job idle timeout. Splitting into …"
  # → "libc++abi: terminating due to uncaught exception of type Napi::Error"

Workaround that confirmed the trigger:

  npx gitnexus analyze --worker-timeout 300    # succeeds (~10× slower)

Scope
-----

This change removes the dominant trigger of the race but does NOT fix
the underlying race in `worker-pool.replaceWorker`. A complete fix
would still want to either (a) make the post-terminate path tolerant of
in-flight native work, or (b) drive this same timeout automatically
from `--worker-timeout`. Keeping the env var opt-in here so the change
is minimal and bisectable; happy to follow up with the automatic wiring
if the direction is acceptable.

Tests
-----

`test/unit/safe-parse.test.ts` gains three cases covering:
  - no-op behaviour when the env var is unset (backward compatibility),
  - `ParseTimeoutError` thrown when the limit is exceeded,
  - timeout is reset after each call so a reused parser is unaffected.
@vercel

vercel Bot commented May 11, 2026

Copy link
Copy Markdown

@giulio-leone is attempting to deploy a commit to the NexusCore Team on Vercel.

A member of the Team first needs to authorize it.

@github-actions

github-actions Bot commented May 12, 2026

Copy link
Copy Markdown
Contributor

Claude finished @magyargergo's task in 9m 9s —— View job


Production-readiness review — PR #1509


Review bar for this PR

  • Default behavior (GITNEXUS_PARSE_TIMEOUT_MICROS unset) is byte-for-byte identical to previous versions — no new exceptions, no null leaks, no altered parse output.
  • ParseTimeoutError is catchable and handled gracefully in every call path that uses parseSourceSafe, including all secondary/re-parse call sites, not just the primary parse in the worker.
  • Any file skipped due to a timeout is observable (logged with file path and configured timeout value) — timeouts must not be indistinguishable from generic parse errors.
  • Parser timeout state is fully restored in finally after every outcome (success, timeout null-return, native throw).
  • No other code owns a parser timeout, so resetting to 0 is safe; this invariant is documented or tested.
  • The user-facing env var is documented where other GITNEXUS_* env vars are documented (README, gitnexus/README, or equivalent troubleshooting section), with microsecond units, relationship to --worker-timeout, and consequences of too-small values stated.
  • Required CI checks (quality + tests) pass on the PR head SHA.
  • Hidden/bidirectional Unicode in changed files is confirmed harmless before approval.

Current PR state

  • Open. Head SHA ae93706. Two commits: ae93706 (the fix) + 73cddc7 (merge-from-main).
  • CI: Gitleaks ✅, Dependency Review ✅, CodeQL ✅, PR Autofix ✅. CI (quality + tests) and Docker Build & Push are still in_progress at review time. Test results cannot be verified.
  • Checks tab rendered as zero by GitHub UI per the trigger comment — confirmed incomplete/unreliable rendering; actual status obtained via CI API above.
  • No reviews, no assignees, no labels.
  • Vercel deploy-authorization bot comment is present; irrelevant to merge decision.

Branch hygiene assessment

Merge-from-main commit present but harmless and merge-safe.

Commit 73cddc7 merges a large PHP-migration PR (#1497) and several other recent main commits into the PR branch. git show --stat 73cddc7 shows no changes to safe-parse.ts or safe-parse.test.ts — the merge brings in only unrelated source, fixture, and doc files. The PR's net diff is confined exactly to the two stated files. Bisectability is not affected. No rebase or split is required for hygiene reasons, though maintainers may prefer a rebase before merge for a cleaner linear history.


Understanding of the change

parseSourceSafe is the single chokepoint through which every tree-sitter parse in the entire ingestion pipeline flows. It is called:

  • In the parallel worker (parse-worker.ts:1420) — the primary parse path for each file in a sub-batch, executed inside worker_threads.
  • In sequential fallback processors (parsing-processor.ts, call-processor.ts, heritage-processor.ts, import-processor.ts) — run in the main thread when the worker pool fails or for secondary passes.
  • In language-captures helpers (typescript/captures.ts, java/captures.ts, php/captures.ts, go/captures.ts, csharp/captures.ts, c/captures.ts, php/namespace-siblings.ts, csharp/namespace-siblings.ts) — secondary re-parses on AST cache miss or grammar mismatch; called from within the sequential processors.
  • In the scope-resolution pipeline (go/range-binding.tsscope-resolver.tsscope-resolution/pipeline/run.ts) — runs in the main thread after all worker processing is complete.
  • In group extractors (tree-sitter-scanner.ts, include-extractor.ts, grpc-extractor.ts, http-route-extractor.ts, thrift-extractor.ts) — used by the grouping/contract-extraction pipeline.
  • In embeddings (embeddings/ast-utils.ts) — used when generating vector embeddings.

The PR wraps every parseSourceSafe call in a setTimeoutMicros/finally-clear lifecycle and converts a timeout-induced null return into a ParseTimeoutError. The intent is to let tree-sitter abort cooperatively before the worker-pool idle timer fires and calls worker.terminate() on a live native parse, which races the NAPI binding and crashes the process.

The fix is correct in concept. The lifecycle of applyAndClearTimeout is sound. The problem is that the new exception type propagates to callers that have no catch block and that callers which do catch it log nothing specific about the timeout, producing invisible data loss.


Findings


Finding 1 — Uncaught ParseTimeoutError in the scope-resolution main-thread pipeline

  • Severity: blocker
  • File: gitnexus/src/core/ingestion/languages/go/range-binding.ts:24 / gitnexus/src/core/ingestion/scope-resolution/pipeline/run.ts:249-253
  • What is wrong: populateGoRangeBindings calls parseSourceSafe(parser, sourceText, …) on a cache miss with no surrounding try/catch. Its call site in run.ts is also unwrapped:
    // run.ts:249-253 — no try/catch
    if (provider.populateRangeBindings !== undefined) {
      provider.populateRangeBindings(parsedFiles, indexes, { fileContents, treeCache });
    }
    With GITNEXUS_PARSE_TIMEOUT_MICROS set, if a Go file hits a tree-cache miss during scope resolution and the secondary parse exceeds the timeout, ParseTimeoutError propagates uncaught through run.ts into the main-thread analysis pipeline. This is unrecoverable in-process (unlike the worker crash which worker-pool.ts catches via the 'error' event) — it would surface as an unhandled exception and abort the entire gitnexus analyze run.
  • Why it matters in production: The deadlock mitigation is supposed to make analysis more resilient, not introduce a new crash vector in the main thread for Go repositories.
  • Recommended fix: Wrap the parseSourceSafe call in range-binding.ts in try { … } catch { /* skip this file's range bindings */ }, and add a warning log so the skip is visible. Also add a guard at the run.ts call site or document the invariant that populateRangeBindings must be internally safe.
  • Blocks merge: yes — introduces a new crash path in the main analysis thread when the env var is set on any Go repository with a cold range-binding tree cache. Fix this →

Finding 2 — Silent data-loss across four production processors

  • Severity: blocker
  • Files:
    • gitnexus/src/core/ingestion/call-processor.ts:836-838 and :3424-3426
    • gitnexus/src/core/ingestion/heritage-processor.ts:231-234 and :426-428
    • gitnexus/src/core/ingestion/import-processor.ts:314-316
    • gitnexus/src/core/ingestion/parsing-processor.ts:419-421 (partially)
  • What is wrong: All five catch blocks catch ParseTimeoutError identically to every other parse error. The call-processor, heritage-processor, and import-processor catch blocks are completely silent (continue with no logging). The parsing-processor logs "Skipping unparseable file: ${file.path}" — indistinguishable from a syntax error or an unreadable file. No processor logs the configured timeout value, the file path under a ParseTimeoutError-specific message, or a running count of skips.
    // call-processor.ts:836 — silent:
    } catch (parseError) { continue; }
    
    // parsing-processor.ts:420 — wrong label:
    } catch (parseError) {
      logger.warn(`Skipping unparseable file: ${file.path}`);
      continue;
    }
  • Why it matters in production: A user who sets GITNEXUS_PARSE_TIMEOUT_MICROS=25000000 on a large monorepo gets a completed index with no indication of which files were silently excluded. GitNexus is an agent context tool — silently incomplete indexes corrupt downstream agent reasoning. This violates DoD §2.8 ("no silent catches that swallow diagnostics") and the PR-specific DoD bar set in this review.
  • Recommended fix: Add ParseTimeoutError-specific handling in at minimum parsing-processor.ts and call-processor.ts: if (parseError instanceof ParseTimeoutError) { logger.warn(Parse timeout for ${file.path}: ${parseError.message}); }. A final summary count of timeout-skipped files would be ideal.
  • Blocks merge: yes — DoD §2.8 violation; invisible index degradation when the opt-in is used as directed. Fix this →

Finding 3 — User-facing env var is undocumented

  • Severity: blocker (DoD §2.4)
  • Files: root README.md:230, gitnexus/README.md:348-359 (existing env var documentation sections — GITNEXUS_PARSE_TIMEOUT_MICROS is absent from both)
  • What is wrong: GITNEXUS_WORKER_SUB_BATCH_TIMEOUT_MS and GITNEXUS_WORKER_SUB_BATCH_MAX_BYTES are both documented in the README troubleshooting section with example values and consequences. GITNEXUS_PARSE_TIMEOUT_MICROS is not mentioned anywhere outside the PR diff. The PR description itself tells users to run export GITNEXUS_PARSE_TIMEOUT_MICROS=$((25 * 1000 * 1000)) and explains that the value must be below the worker idle timeout — but a user reading only the docs would never know this option exists.
  • Why it matters in production: The env var addresses a real crash that users experience. Without docs, the fix is effectively a hidden escape hatch discoverable only from git history. Users who find the crash independently cannot find the mitigation. Users who misconfigure it (value > worker timeout) get no benefit and no warning.
  • Recommended fix: Add GITNEXUS_PARSE_TIMEOUT_MICROS to the existing env var troubleshooting section in gitnexus/README.md alongside GITNEXUS_WORKER_SUB_BATCH_TIMEOUT_MS. State: unit is microseconds, value must be less than the worker idle timeout (--worker-timeout default = 30s = 30,000,000µs), values near zero silently skip most files, unset disables the feature.
  • Blocks merge: yes — DoD §2.4: "If user-visible behavior… changes, the relevant docs, examples, help text, or migration notes are updated in the same change." Fix this →

Finding 4 — CI not yet complete; test results unverified

  • Severity: blocker (per review instructions)
  • What is wrong: The CI workflow (which runs ci-quality.yml = prettier + eslint + tsc, and ci-tests.yml = vitest with coverage across platforms) is in_progress at review time. PR Autofix concluded success, which implies the PR is already formatted cleanly, but the unit tests and typecheck have not been confirmed passing on the PR head SHA ae93706.
  • Why it matters: Per the review instructions and DoD §4.2, tsc --noEmit and npm test must pass before merge can be considered.
  • Blocks merge: yes — verify CI before any approval.

Finding 5 — Secondary language-capture re-parses uncaught in the sequential path

  • Severity: major
  • Files: gitnexus/src/core/ingestion/languages/typescript/captures.ts:138, java/captures.ts:59, php/captures.ts:67, go/captures.ts:24, csharp/captures.ts:90, c/captures.ts:22, php/namespace-siblings.ts:50, csharp/namespace-siblings.ts:60
  • What is wrong: These functions call parseSourceSafe on an AST cache miss (or grammar mismatch, in the TypeScript case). They are called from within the sequential processor loops (call-processor.ts, heritage-processor.ts, import-processor.ts) after the primary-parse try/catch block. If the timeout fires during one of these secondary re-parses, ParseTimeoutError propagates through the sequential processor's match-handling loop, which has no outer catch. The sequential processor then propagates it to parsing-processor.ts:802 which does have a catch and logs "Worker pool parsing stopped; continuing with sequential parser." That outer catch would actually handle it, but the processing of the current batch in the sequential loop is aborted entirely (not just the one file).
  • Why it matters: Secondary re-parses represent legitimate, non-pathological code paths (e.g., TypeScript TSX grammar mismatch on a real .tsx file). Aborting an entire sequential batch because of a secondary re-parse timeout is disproportionate when the intent is to skip only the pathological file.
  • Recommended fix: Wrap each secondary parse in the language captures files in a function-level try/catch that returns an empty result (same pattern used by tree-sitter-scanner.ts:157-161).
  • Blocks merge: maybe — lower priority than F1/F2/F3 because the outer catch in parsing-processor.ts:802 does eventually catch it, but it aborts more than intended.

Finding 6 — process.env read on every parseSourceSafe call

  • Severity: minor
  • File: gitnexus/src/core/tree-sitter/safe-parse.ts:40-45 (readParseTimeoutMicros)
  • What is wrong: readParseTimeoutMicros() reads and parses process.env.GITNEXUS_PARSE_TIMEOUT_MICROS on every invocation of parseSourceSafe. On a 100,000-file repo, this is ~100,000 process.env lookups plus Number() conversions per phase. Env vars are constants; this value should be read once at module load and cached.
  • Recommended fix: const PARSE_TIMEOUT_MICROS = readParseTimeoutMicros(); at module scope; remove the function call inside parseSourceSafe. (Also removes the edge case where the value could change mid-run if something modifies process.env in tests.)
  • Blocks merge: no

Finding 7 — Confusing ParseTimeoutError(0, …) when parse returns null with no timeout configured

  • Severity: minor
  • File: gitnexus/src/core/tree-sitter/safe-parse.ts:88
  • What is wrong: if (!tree) throw new ParseTimeoutError(timeoutMicros, sourceText.length) fires whenever parser.parse() returns null, even when timeoutMicros = 0 (feature disabled). In that case the error message reads tree-sitter parse exceeded GITNEXUS_PARSE_TIMEOUT_MICROS=0 — internally contradictory. Pre-PR, a null return from parse would have propagated silently (as undefined tree) or caused a later NPE; the new behavior is actually more explicit. But the error class name and message are wrong for the zero-timeout case.
  • Recommended fix: Guard: if (!tree) { if (timeoutMicros > 0) throw new ParseTimeoutError(…); throw new Error(\tree-sitter returned null for ${…}`); }Or just: only do the null check whentimeoutMicros > 0(safe because without a timeout set, tree-sitter should never returnnull`).
  • Blocks merge: no

PR-specific assessment sections

Runtime correctness

The core applyAndClearTimeout logic is correct. setTimeoutMicros is called before run(), cleared in finally, covering success, timeout-null, and native-throw cases. oldTree and options are forwarded unchanged through both parse paths. The if (!tree) guard correctly converts the timeout-null into a catchable exception. The default path (env var unset) has zero behavioral change except the new if (!tree) guard on a null that should never appear without a timeout — effectively a no-op in normal operation.

Worker-pool/deadlock mitigation

The fix correctly targets the production deadlock. The primary parse in parse-worker.ts:1419-1428 is wrapped in a try/catch that catches ParseTimeoutError and skips the file with a logger.warn that does include the error message (so the timeout is visible in the worker path). The worker idle timer still runs, but if GITNEXUS_PARSE_TIMEOUT_MICROS < worker idle timeout, tree-sitter aborts before the idle timer fires and calls worker.terminate(), eliminating the race. This wiring is correct. The mitigation requires the user to set the env var below the worker idle timeout — this dependency is explained in the PR description but not in docs (see F3).

Parser state lifecycle

setTimeoutMicros(0) in finally correctly resets state. Searching the entire gitnexus/src/ tree for setTimeoutMicros finds it only in safe-parse.ts — no other code owns parser timeout state. Resetting to 0 is therefore safe. The test at safe-parse.test.ts:96-103 verifies parser reuse after timeout is clean. ✅

Observability and incomplete-index risk

Critically incomplete. The worker path (parse-worker.ts:1424) logs the ParseTimeoutError message — observable. The sequential processors (call, heritage, import) swallow it silently. parsing-processor.ts logs "Skipping unparseable file" — wrong label for a timeout. No processor counts or summarizes timeout skips. A completed index can be missing arbitrary files with no user-facing signal. Blocker (Finding 2).

Env var validation/documentation

readParseTimeoutMicros() handles all edge cases correctly: unset → 0, empty string → 0 (via !raw), whitespace → Number(" ") = 0 → disabled, non-numeric → NaN → disabled, Infinity → fails isFinite → disabled, negative → fails > 0 → disabled, decimal → Math.floor truncates. The implementation is correct. Documentation is entirely absent (Finding 3).

Tests and CI

Three new test cases: no-op when unset, timeout throw, reset after timeout. All cover the wrapper's own behavior. Missing: invalid env var returns 0; reset after parser throw (not just timeout); TypeScript grammar-mismatch secondary re-parse under timeout; Go range-binding path; caller catch behavior; timeout observability. Tests rely on GITNEXUS_PARSE_TIMEOUT_MICROS=1 with 64KB source — deterministic in practice since 1µs is below any real parse latency. CI is still in_progress; test pass/fail unverified.

Hidden Unicode / hygiene

grep -nP "[\x{202A}-\x{202E}\x{2066}-\x{2069}\x{200B}\x{200C}\x{200D}\x{FEFF}]" on both changed files: no matches. Non-ASCII present: "…" (U+2026, horizontal ellipsis) and "—" (U+2014, em-dash) in comments in safe-parse.ts; "漢" (CJK) in the pre-existing safe-parse.test.ts Unicode test fixture. All are visible, context-appropriate, and not bidi controls. GitHub's hidden/bidirectional Unicode warning was a false positive triggered by the em-dash or ellipsis characters. No blocker here.

Branch hygiene

Merge-from-main commit 73cddc7 is clean. Net diff is exactly 2 files as stated. No unrelated churn in the PR's own changes.


Final verdict

not production-ready

This PR has three independent blockers that must be fixed before merge:

  1. Finding 1 introduces a new crash vector in the main-thread scope-resolution pipeline for Go repositories (go/range-binding.ts secondary parse propagates uncaught through run.ts) — the opposite of the PR's stated goal.
  2. Finding 2 makes timeout-skipped files completely invisible in the call, heritage, and import processors, producing silently incomplete indexes with no actionable signal — a DoD §2.8 violation.
  3. Finding 3 ships a user-facing env var with no documentation alongside the existing GITNEXUS_WORKER_SUB_BATCH_TIMEOUT_MS docs — a DoD §2.4 violation.
  4. Finding 4: CI is still running; test results unverified.

The underlying design is sound and the core applyAndClearTimeout lifecycle is correct. The PR is close — Findings 1 and 2 require targeted catch additions in ~6 files, Finding 3 requires a README paragraph. A revised PR addressing these four points would be appropriate to approve.

@github-actions

github-actions Bot commented May 12, 2026

Copy link
Copy Markdown
Contributor

CI Report

Some checks failed

Pipeline Status

Stage Status Details
✅ Typecheck success tsc --noEmit
❌ Tests failure unit tests, 3 platforms
✅ E2E success gitnexus-web changes only

Test Results

Tests Passed Failed Skipped Duration
8614 8611 0 3 547s

✅ All 8611 tests passed

3 test(s) skipped — expand for details
  • Java > generates skill files
  • Java > context files updated
  • buildTypeEnv > known limitations (documented skip tests) > Ruby block parameter: users.each { |user| } — closure param inference, different feature

Code Coverage

Tests

Metric Coverage Covered Base Delta Status
Statements 77.67% 27160/34967 N/A% 🟢 ███████████████░░░░░
Branches 66.1% 17123/25904 N/A% 🟢 █████████████░░░░░░░
Functions 82.4% 2739/3324 N/A% 🟢 ████████████████░░░░
Lines 80.83% 24546/30364 N/A% 🟢 ████████████████░░░░

📋 View full run · Generated by CI

@magyargergo

Copy link
Copy Markdown
Collaborator

@giulioleone097 can you please look into claude's findings?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants