Skip to content

fix(workers): resilient + zero-copy ingestion worker pool — prevent analyze hangs on TS-root-scale loads#1693

Merged
magyargergo merged 49 commits into
mainfrom
copilot/fix-analyze-hangs-on-root
May 20, 2026
Merged

fix(workers): resilient + zero-copy ingestion worker pool — prevent analyze hangs on TS-root-scale loads#1693
magyargergo merged 49 commits into
mainfrom
copilot/fix-analyze-hangs-on-root

Conversation

Copilot AI commented May 19, 2026

Copy link
Copy Markdown
Contributor

Summary

Hardens the ingestion worker path so a single pathological file (e.g. microsoft/TypeScript's tests/cases/fourslash/reallyLargeFile.ts) can't hang analyze and worker death can't permanently degrade the pool. Lands in three phases:

  1. Resilience layers — auto-respawn budget, per-slot circuit breaker, session-scoped quarantine, authoritative in-flight tracking, cumulative timeout budget, ready-handshake startup gate, slot-generation late-event guard.
  2. Zero-copy IPC via native postMessage — workers and pool exchange POJO directly via worker.postMessage(value, transferList). Node's structured-clone algorithm preserves Map / Set / Date / RegExp / BigInt / TypedArray / undefined / circular refs out of the box. File contents move zero-copy via transferList (TextEncoder.encode per file → dedicated ArrayBuffer → ownership transfer, never pool-backed). No custom protocol framing layer — one structured-clone walk per message, period.
  3. Cache integrity — chunk-cache writes suppressed when any chunk file is quarantined, so a partial worker result can't be locked in under the full-coverage chunk hash and silently replayed on a later analyze.

Workers are the sole failure-recovery contract for ingestion: there is no main-thread sequential rescue when a worker dies. Falling back to sequential would re-run the same tree-sitter parser that just killed the worker — for native-binding SIGSEGV (the most common quarantine cause), this turns silent missing-symbols into a loud analyze-wide crash. The pool's layered resilience is the contract; quarantined files surface in the warn log + stats; the next analyze with a fresh pool retries cleanly via cache-skip.


Architecture

flowchart TB
    classDef main fill:#1e3a5f,stroke:#4a90d9,color:#fff
    classDef pool fill:#2d4a3e,stroke:#7ab87d,color:#fff
    classDef layer fill:#3d3a2d,stroke:#d9b04a,color:#fff
    classDef worker fill:#4a3d4f,stroke:#b87dc4,color:#fff

    CLI[analyze CLI / MCP]:::main
    PI[parse-impl.ts chunk loop]:::main
    PP[parsing-processor.ts processParsing]:::main
    Cache[parseCache cross-run]:::main

    subgraph Pool["worker-pool.ts — createWorkerPool"]
        direction TB
        Disp[dispatch]:::pool
        Stats[getStats / getQuarantinedPaths]:::pool

        subgraph Layers["Resilience layers"]
            direction LR
            L1[Layer 1 Respawn budget]:::layer
            L2[Layer 2 Circuit breaker]:::layer
            L3[Layer 3 Quarantine]:::layer
            L4[Layer 4 Starting-file attribution]:::layer
            L5[Layer 5 Cumulative timeout]:::layer
        end

        subgraph Slots["Slot pool"]
            direction LR
            S0[Slot 0]:::pool
            S1[Slot 1]:::pool
            SN[... Slot N-1]:::pool
        end
    end

    W0[Worker 0 parse-worker.ts]:::worker
    W1[Worker 1 parse-worker.ts]:::worker
    WN[Worker N-1 parse-worker.ts]:::worker

    CLI --> PI
    PI -->|chunkFiles| PP
    PP -->|dispatch| Disp
    Disp --> Layers
    Layers --> Slots
    S0 <-->|postMessage + transferList| W0
    S1 <-->|postMessage + transferList| W1
    SN <-->|postMessage + transferList| WN
    PP -->|quarantine snapshot| PI
    PI -->|write only if no quarantine| Cache
    Stats -.->|verbose log| PI
Loading

Dispatch sequence (happy path)

sequenceDiagram
    autonumber
    participant PI as parse-impl chunk loop
    participant PP as processParsing
    participant Pool as WorkerPool
    participant Slot as Slot N
    participant W as Worker N
    participant Cache as parseCache

    PI->>PI: computeChunkHash(files)
    PI->>Cache: usedKeys.add(chunkHash)
    PI->>Cache: entries.get(chunkHash)
    Cache-->>PI: undefined — cache miss
    PI->>PP: processParsing(chunk, workerPool)
    PP->>Pool: dispatch(items)
    Note over Pool: filter quarantined paths<br/>buildDispatchMessage encodes each<br/>file content via TextEncoder
    Pool->>Slot: postMessage(payload, transferList)
    Slot->>W: structured-clone payload<br/>zero-copy transfer file ArrayBuffers
    Note over W: TextDecoder.decode<br/>content Uint8Array → string<br/>at tree-sitter call site
    loop per file
        W->>Slot: starting-file path
        Slot-->>Pool: record in-flight path
        W->>W: tree-sitter parse + extract
        W->>Slot: progress
    end
    W->>Slot: sub-batch-done
    W->>Slot: result
    Slot-->>Pool: collect results
    Pool-->>PP: results
    PP-->>PI: mergedData
    PI->>Cache: entries.set(chunkHash, rawResults)
    Note over Cache: cache write proceeds —<br/>no quarantine intersection
Loading

Worker death + quarantine + cache-skip

sequenceDiagram
    autonumber
    participant PI as parse-impl chunk loop
    participant PP as processParsing
    participant Pool as WorkerPool
    participant Slot as Slot N
    participant W as Worker N (alive)
    participant W2 as Worker N (replacement)
    participant Cache as parseCache

    PI->>PP: processParsing(a.ts, poison.ts, c.ts)
    PP->>Pool: dispatch
    Pool->>Slot: postMessage(payload, transferList)
    Slot->>W: deliver
    W->>Slot: starting-file a.ts
    Slot-->>Pool: inFlightPath = a.ts
    W->>Slot: progress + sub-batch-done for a.ts
    W->>Slot: starting-file poison.ts
    Slot-->>Pool: inFlightPath = poison.ts
    Note over W: tree-sitter SIGSEGV<br/>or uncaught throw
    W--xSlot: exit code 134

    Slot->>Pool: handleWorkerDeath inFlight=poison.ts
    Note over Pool: Layer 3 quarantine.add(poison.ts)<br/>respawnCount slot N ++ (within budget)
    Pool->>W2: spawn replacement
    W2->>Slot: ready handshake
    Pool->>Slot: re-dispatch c.ts only
    Note over Slot: poison.ts filtered out<br/>by Layer 3 pre-dispatch
    Slot->>W2: postMessage
    W2->>Slot: result for c.ts
    Slot-->>Pool: collect results

    Pool-->>PP: merged data — a.ts and c.ts only
    Note over PP: poison.ts symbols absent<br/>from this run's graph
    PP->>Pool: getQuarantinedPaths()
    Pool-->>PP: [poison.ts]
    PP->>PP: warn log — 1 file quarantined
    PP-->>PI: mergedData (no rescue)
    PI->>Pool: getQuarantinedPaths()
    Pool-->>PI: [poison.ts]
    Note over PI: chunkFiles intersects quarantine
    PI--xCache: SKIP entries.set(chunkHash)
    Note over Cache: chunk stays uncached<br/>next analyze gets a fresh pool<br/>cache miss triggers retry
Loading

Circuit breaker trip

sequenceDiagram
    autonumber
    participant PP as processParsing
    participant Pool as WorkerPool
    participant L1 as Layer 1 Respawn
    participant L2 as Layer 2 Breaker
    participant Slot as Slot N
    participant W as Worker N

    loop until threshold
        PP->>Pool: dispatch
        Pool->>Slot: deliver
        Slot->>W: postMessage
        W--xSlot: exit
        Slot->>L1: bump respawn + consecutiveFailures
        alt budget OK
            L1->>Slot: replaceWorker
        else respawnCount exceeded
            L1->>Pool: drop slot
        end
        Note over L2: check threshold<br/>crashes AND pure-timeouts both count
    end

    L2--xPool: tripBreaker
    Note over Pool: WorkerPoolDispatchError<br/>carries quarantinedPaths snapshot
    Pool-->>PP: reject — no sequential rescue
    PP-->>PP: error propagates to analyze entry
    Note over PP: operator sees hard failure<br/>with quarantine snapshot<br/>instead of silently-degraded graph
Loading

Resilience layers (the contract)

# Layer Triggered when Bound
1 Respawn budget Worker exits / errors / messageerrors maxRespawnsPerSlot (default 3, env GITNEXUS_WORKER_MAX_RESPAWNS_PER_SLOT) — slot dropped from rotation when exceeded
2 Circuit breaker Per-slot consecutive failure count crosses threshold (crashes AND idle-timeout retries both count) consecutiveFailureThreshold (default max(3, poolSize), env GITNEXUS_WORKER_CONSECUTIVE_FAILURE_THRESHOLD) — pool rejects further dispatches
3 Session-scoped quarantine In-flight file identified at moment of worker death Set<string> on the pool — filtered out of subsequent dispatches; surfaced via getQuarantinedPaths()
4 Starting-file attribution Worker emits {type:'starting-file', path} before each parse Pool's inFlightPath[slot] is the authoritative culprit on death; falls back to items[lastProgress] heuristic when no starting-file observed
5 Cumulative timeout budget Total wall-time across a job's retries/splits exceeds budget maxCumulativeTimeoutMs (default 5× sub-batch timeout, env GITNEXUS_WORKER_MAX_CUMULATIVE_TIMEOUT_MS) — prevents exponential backoff from inflating into multi-hour stalls
Ready handshake Initial spawn + every replacement spawn WORKER_READY_TIMEOUT_MS (5s) — worker emits {type:'ready'} after top-of-script init; pool drops the slot if not seen within budget
Slot-generation guard Late events arriving after a slot has been respawned Per-slot generation counter; handlers short-circuit on stale generation, no double-recovery

IPC: native postMessage + zero-copy transferList

No custom wire format. Workers and pool exchange POJO via worker.postMessage(value, transferList). Node's postMessage runs V8's structured-clone algorithm internally — the same algorithm that preserves Map / Set / Date / RegExp / BigInt / TypedArray / undefined / circular refs natively. There is no JSON layer to lose them.

Dispatch payload shape (built by buildDispatchMessage):

{ type: 'sub-batch', files: Array<{ path: string; content: Uint8Array }> }

The file content is a Uint8Array (not a string) so its underlying ArrayBuffer can be transferred zero-copy via the transferList argument. The worker calls TextDecoder.decode lazily at the tree-sitter call site — one decode pass on the worker thread, in parallel with continued main-thread work.

Transfer-safety contract. Content Uint8Array instances are allocated via TextEncoder.encode, which produces a dedicated ArrayBuffer per call. Buffer.from(str, 'utf8') and Buffer.alloc may carve from Node's shared Buffer.poolSize slab, and transferring one pool-backed ArrayBuffer detaches every other Buffer sharing the slab — silent data corruption. TextEncoder bypasses the pool; transferring its outputs is safe. Pinned by test/unit/worker-pool-transferlist.test.ts (8 ArrayBuffer identities for 8 small files).

Why no wrapper protocol. An earlier iteration of this PR introduced a binary frame (protocol.ts — 1-byte tag + 4-byte LE length + V8-serialized body). Removed in commit 6c6fc774: that layer was a redundant V8.serialize → Buffer → postMessage(struct-clone-Buffer) double-walk, because postMessage runs the same structured-clone algorithm v8.serialize would call. Dropping the wrapper cut one full clone pass per message and ~430 LOC of framing code + tests.


Cross-run cache integrity

flowchart LR
    classDef path fill:#2d4a3e,stroke:#7ab87d,color:#fff
    classDef skip fill:#4a2d2d,stroke:#c97a7a,color:#fff

    A[Chunk N dispatched]:::path
    B{Any chunk file in pool.getQuarantinedPaths?}
    C[cache.entries.set chunkHash, rawResults]:::path
    D[SKIP cache.entries.set]:::skip
    E[cache.usedKeys.add chunkHash both branches]
    F[Next analyze fresh pool cache hit/miss]

    A --> B
    B -->|No| C
    B -->|Yes| D
    C --> E
    D --> E
    E --> F
Loading

Before the fix, a chunk containing a quarantined file would persist the partial worker output under the full-coverage chunk hash. The next analyze with unchanged content would hit the cache and silently replay incomplete results — symbols for the quarantined file permanently missing until the file's content changed. After the fix, the chunk stays uncached so the next analyze gets a fresh pool (quarantine is session-scoped) and a clean retry.


Configuration knobs

Env var CLI flag Default Purpose
GITNEXUS_WORKER_POOL_SIZE --workers <n> min(16, cores-1) Parse worker pool size. 0 = sequential.
GITNEXUS_PARSE_CHUNK_CONCURRENCY 2 Pre-fetch concurrency for chunk file reads (dispatch itself stays serial — pool is non-reentrant).
GITNEXUS_CHUNK_BYTE_BUDGET 2 MB Per-chunk byte budget. Smaller = finer cache-invalidation granularity.
GITNEXUS_WORKER_SUB_BATCH_TIMEOUT_MS --worker-timeout <s> × 1000 30000 Per-sub-batch idle timeout.
GITNEXUS_WORKER_SUB_BATCH_MAX_BYTES 8 MB Per-job byte budget the pool sends in one postMessage.
GITNEXUS_WORKER_MAX_RESPAWNS_PER_SLOT 3 Layer 1 — per-slot respawn budget before drop.
GITNEXUS_WORKER_MAX_CUMULATIVE_TIMEOUT_MS 5× sub-batch timeout Layer 5 — total wall-time budget per job.
GITNEXUS_WORKER_CONSECUTIVE_FAILURE_THRESHOLD max(3, poolSize) Layer 2 — consecutive deaths to trip breaker.
GITNEXUS_VERBOSE -v / --verbose unset Verbose ingestion logging (skipped-file warnings, per-chunk throughput, parse-cache stats).

Implementation highlights

  • Layered pool resilience (worker-pool.ts): all five layers + ready-handshake + slot-generation, all bounded
  • TS captures parent-walk (captures.ts): findSelfOrAncestorOfType[s] replaces per-match root-DFS in findNodeAtRange; drops emitTsScopeCaptures on checker.ts from ~7 min to ~3 s
  • Quarantine module extraction (workers/quarantine.ts): session-scoped Set-keyed snapshot API
  • Native IPC + zero-copy (buildDispatchMessage in worker-pool.ts): no custom wire format; TextEncoder.encode per content; transferList for ownership transfer
  • Chunk-cache integrity (parse-impl.ts): cache-write guard skips when any chunk file is quarantined
  • Sequential-fallback removal (parsing-processor.ts): workers are the sole resilience contract; pool failures propagate as hard errors instead of silent degradation
  • CLI threading (analyze.ts, cli/index.ts): --workers, env snapshot/restore, --workers 0
  • Chunk concurrency (parse-impl.ts): pre-fetch K chunks ahead via parseChunkConcurrency; dispatch stays serial
  • Deferred extraction (parse-impl.ts): per-chunk extraction passes batched to end-of-loop so workers stay busy

Testing

  • Worker pool resiliencetest/unit/worker-pool-resilience.test.ts (26 tests), worker-pool-slot-generation.test.ts, worker-pool-windows-quarantine.test.ts, worker-pool-cumulative-timeout.test.ts, worker-pool-options.test.ts
  • Dispatch payload shapetest/unit/worker-pool-transferlist.test.ts (6 tests pinning: Uint8Array per content, transferList in input order, pool-slab independence, non-parse / empty / mixed-shape fallback)
  • Quarantine moduletest/unit/workers/quarantine.test.ts
  • Cache integritytest/integration/parse-impl-quarantine-cache-skip.test.ts — real worker_threads, custom worker script crashes on poison.ts, asserts (a) surviving files in graph, (b) poison.ts NOT in graph (no rescue), (c) chunk hash absent from parseCache.entries, (d) second pass cache-misses + re-dispatches + still doesn't cache
  • TS capturestest/unit/scope-resolution/typescript/typescript-captures-anchor.test.ts — exact-count assertions on every anchor type the rewrite touches
  • Wall-clock + multi-chunktest/integration/parse-impl-large-fixture.test.ts — synthetic large fixture, 30 s Promise.race budget
  • Worker-pool integrationtest/integration/worker-pool.test.ts (28 tests) — real Worker threads, full lifecycle (death, respawn, breaker, ready handshake)
  • Scope-parity cache-miss regressions (Analysis failed: Phase 'scopeResolution' failed: Cannot add property 2, object is not extensible #1066) — test/integration/resolvers/{python,go,typescript,csharp}.test.ts — exercise large UTF-8/ASCII files through real workers; native postMessage preserves Map fidelity so scope-resolution doesn't choke on the worker round-trip

DoD §2.7 exact-count assertions throughout. 270/270 unit files + 7/7 worker-pool/parse-impl integration files green. 6133 unit tests + 822 integration tests passing.


Production-readiness review checkpoints

External multi-lane review surfaced four addressable items, all closed:

  • F1 — Unused findMatch helper removed (CodeQL was right; the sibling countMatchesTsx flag was a false positive — it IS called).
  • F2bench/parse-throughput.md retitled as scaffold; self-contradictory "fill in before merging" instruction dropped.
  • F3WorkerPoolDispatchError.fallbackExcludePaths renamed to quarantinedPaths (the "fallback" terminology was load-bearing under the pre-sequential-removal design; after the rescue path was removed, the rename reflects what the field actually carries).
  • F4 — Timeout-retry IIFE in worker-pool.ts now checks consecutiveFailureThreshold and trips the circuit breaker. Closes a gap where chronic pure-timeout deaths accumulated counts that never tripped the breaker.

Plus the CI-discovered issue that drove the architecture pivot: an early iteration of the IPC layer used a JSON-based protocol body, which silently destroyed every Map / Set / Date / BigInt / TypedArray / undefined value in the worker output. Production scope-resolution keys data structures on Maps throughout, so JSON serialization manifested as 'typeBindings is not iterable' on large fixtures. The fix went through two iterations: first a swap to v8.serialize (binary structured-clone in a Buffer), then realizing the whole wrapper was redundant and dropping protocol.ts entirely for native postMessage. Final state — no explicit serializer — is both correct (full structured-clone fidelity inherent in postMessage) and faster (one clone walk instead of two).

Deferred follow-ups (non-blocking, P2):

  • groupedNodes parallel map in captures.ts:188 — replace with Record<string, {capture, node}> to eliminate implicit sync.
  • Exit-code-0 quarantine exemption in worker-pool.ts exitHandler — bounded by quarantine + respawn budget today; tightening adds a dedicated code-0 branch.
  • WorkerPool.dispatch non-reentrancy is documented in comments but not enforced by an assertion. No production caller violates it.

Original task

  • Inspect ingestion worker / fallback codepaths and existing tests
  • Reproduce current behavior with baseline failures / perf hotspots
  • Implement minimal robust worker subprocess handling so pathological single files cannot stall full-chunk parsing
  • Add focused tests for timeout / retry / fallback behavior + performance guardrails
  • Run targeted tests + required validation; summarize results

@vercel

vercel Bot commented May 19, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
gitnexus Ready Ready Preview, Comment May 20, 2026 6:15pm

Request Review

@magyargergo

Copy link
Copy Markdown
Collaborator

Verification (issue #1684 acceptance criteria)

Verified locally on branch copilot/fix-analyze-hangs-on-root (614f3803) in worktree GitNexus-pr1693-worktree.

Automated checks

Check Result
npx vitest run test/unit/parsing-worker-fallback.test.ts 2/2 passed (includes new skip-on-timeout test)
npx tsc --noEmit (gitnexus/) pass

emitTsScopeCaptures perf (checker.ts, ~3.0 MB)

  • 3073 ms for ~66k match groups (was >7 min before parent-walk fix per investigation)

Integration repro (raised limits + pathological file)

Mini repo with src/compiler/checker.ts + tests/cases/fourslash/reallyLargeFile.ts:

$env:GITNEXUS_MAX_FILE_SIZE = "32768"
$env:GITNEXUS_WORKER_SUB_BATCH_TIMEOUT_MS = "90000"
node dist/cli/index.js analyze <mini-repo> --force --index-only --skip-git `
  --max-file-size 32768 --worker-timeout 90 --name verify-pr-1693
Metric Result
Exit code 0
Wall time ~355s (bounded; no indefinite stall)
Log Skipping worker-timeout files in sequential fallback for reallyLargeFile.ts
checker.ts symbols 2576 Function nodes
reallyLargeFile.ts symbols 0 functions (file node only; parse skipped after timeout)

Acceptance criteria mapping

  • Raised-limit analyze completes in bounded time or skips pathological files with clear log — skip path confirmed
  • reallyLargeFile.ts does not block entire chunk via sequential fallback — confirmed (unit test + integration log)
  • compiler/checker.ts indexed under raised limits without worker timeout — confirmed in mini repro
  • README documents TypeScript indexing expectations — not in this PR (follow-up)

Not run (scope / time)

Full microsoft/TypeScript repo root (~39k files) end-to-end — expected to take hours even with fix; prior hang was specifically sequential fallback re-parsing reallyLargeFile.ts indefinitely.

Verdict: Fix addresses the blocking failure mode from #1684; safe to merge from a verification standpoint (pending CI).

@github-actions

github-actions Bot commented May 19, 2026

Copy link
Copy Markdown
Contributor

CI Report

All checks passed

Pipeline Status

Stage Status Details
✅ Typecheck success tsc --noEmit
✅ Tests success unit tests, 3 platforms
✅ E2E success gitnexus-web changes only

Test Results

Tests Passed Failed Skipped Duration
9357 9356 0 1 483s

✅ All 9356 tests passed

1 test(s) skipped — expand for details
  • buildTypeEnv > known limitations (documented skip tests) > Ruby block parameter: users.each { |user| } — closure param inference, different feature

Code Coverage

Tests

Metric Coverage Covered Base Delta Status
Statements 78.66% 30466/38727 N/A% 🟢 ███████████████░░░░░
Branches 67.26% 19445/28906 N/A% 🟢 █████████████░░░░░░░
Functions 83.54% 3071/3676 N/A% 🟢 ████████████████░░░░
Lines 81.96% 27462/33503 N/A% 🟢 ████████████████░░░░

📋 View full run · Generated by CI

…ingleton timeout

WorkerPoolDispatchError previously surfaced the stalled path only for the
singleton-timeout final-fail branch. Worker `error` and `exit` events (and
the msg-channel `error` reply) fell back to plain `Error`, so the sequential
fallback re-attempted every file in the active job — re-hanging on the same
pathological file when the worker crashed mid-parse.

Lift the in-flight-file inference into `inFlightExcludePath(job, lastProgress)`
and wire it into the three remaining in-pool failure sites. `lastProgress` is
already in `runWorker` scope, so `items[lastProgress]` (the next file the
worker was about to acknowledge) is the best single guess at the culprit;
earlier files are still re-tried sequentially. Returns `[]` when no path is
determinable (`lastProgress >= items.length`, or path missing/non-string) so
sequential retries the whole job.

Replacement-worker startup failures stay plain `Error` (no job context); the
result-before-flush protocol bug stays plain `Error` (code fault, not file).

Tests cover the three new exclusion paths plus a negative test confirming
non-WorkerPoolDispatchError throws fall through to full sequential retry.
- Use cause-neutral "worker-excluded" label in skip messages and tests now
  that worker error/exit paths share the same exclusion contract as
  singleton-timeout (correctness + maintainability reviewers).
- Add JSDoc to findSelfOrAncestorOfType{s} explaining the parent-walk
  short-circuit vs root-DFS fallback (maintainability reviewer).
Restructures `createWorkerPool` so a single bad file no longer kills the
pool for the rest of an analyze run. Five interlocking layers:

1. **Auto-respawn on error/exit** — worker death triggers `replaceWorker`
   on the same slot, bounded by `maxRespawnsPerSlot` (default 3). The slot
   is dropped from rotation when the budget is exhausted; other slots
   keep running.

2. **Circuit breaker** — replaces the permanent `poolBroken=true` with a
   consecutive-failure counter. The pool only trips after
   `consecutiveFailureThreshold` deaths (default `max(3, poolSize)`) with
   no successful job in between. A successful job resets the counter so
   transient bursts of bad files don't escalate.

3. **Session-scoped file quarantine** — paths identified as the in-flight
   file at the moment of a worker death are added to a `Set<string>` on
   the pool. `dispatch()` filters quarantined items up front (they never
   reach a worker again this pool lifetime). Exposed via the new
   `WorkerPool.getQuarantinedPaths()` so callers can log/route them.
   `processParsing` surfaces the per-chunk quarantine summary alongside
   the existing fallback-exclusion log.

4. **Authoritative in-flight tracking** — `parse-worker.ts` emits
   `{type:'starting-file', path}` before each file. The pool tracks this
   per slot and uses it for crash attribution, falling back to the
   `items[lastProgress]` heuristic only when no starting-file has been
   observed (very-early crash, older worker build). Closes the
   reorder/race concerns raised by reviewers C1 and R3 in the earlier
   review run.

5. **Per-job cumulative timeout budget** — each `WorkerJob` tracks the
   total wall time spent across attempts/splits/retries. When the budget
   is exhausted (default 5x `subBatchIdleTimeoutMs`), the pool surfaces
   the in-flight path instead of letting exponential backoff balloon
   into multi-hour stalls.

Cross-layer wiring: a new `wakeIdleSlots` helper kicks any non-busy live
slot when items are requeued (after a death or split-retry), so a dropped
slot doesn't strand work in the queue. `recoverAndResume` consolidates
the per-job teardown shared by the three in-pool death sites (`error`,
`exit`, msg-channel `error`).

New env knobs: `GITNEXUS_WORKER_MAX_RESPAWNS_PER_SLOT`,
`GITNEXUS_WORKER_MAX_CUMULATIVE_TIMEOUT_MS`,
`GITNEXUS_WORKER_CONSECUTIVE_FAILURE_THRESHOLD`.
New `WorkerPoolOptions.workerFactory` injection point for unit tests.

Tests: 12 new unit tests using a FakeWorker mock cover quarantine
seeding, slot-respawn, slot-drop after budget, breaker trip + reset,
and quarantine filtering. Plus option-resolution tests for the three
new env vars. All 19 worker-pool/-fallback/-options tests pass; full
unit suite 6040 passed / 30 skipped / 0 failed.
Walks through every finding from ce-code-review run
20260519-094648-3549cf5e. All 12 picked Apply.

Critical:
- F1 — Layer 5 cumulative-timeout exhaustion no longer silently drops
  the rest of the job. `requeueRemainder` is now invoked before
  `handleWorkerDeath` in both Layer 5 and singleton-final-fail give-up
  paths so non-quarantined items get re-tried by another worker.
- F2 — idle-timer recovery overhaul. `!shouldContinue` branch no
  longer calls `replaceWorker` (double-spawn race with the
  `handleWorkerDeath` inside `requeueAfterTimeout`). `shouldContinue`
  branch now enforces `maxRespawnsPerSlot` before respawning, closing
  the budget-bypass for the timeout-retry path. Also fixes premature
  `maybeDone` by simplifying the bookkeeping.
- F3 — `requeueRemainder` no longer pre-charges `cumulativeTimeoutMs`
  by `job.timeoutMs`. The death itself consumed no budget, so the
  next `requeueAfterTimeout` was double-billing the first attempt.
- F4 — `WorkerPool.getQuarantinedPaths` is now optional on the
  interface, matching the defensive `?.()` call site and the existing
  mocks. Removes the contract-vs-callsite contradiction.
- F5 — per-job unattributed-death tracking. When a worker dies with
  no exclusion attribution, `requeueRemainder` tracks death count per
  `startIndex`. First time: re-queue intact. Second time: quarantine
  items[0] as best guess, or drop the job entirely when items lack
  paths. Bounds the death loop the original design admitted to.
- F6 — per-slot consecutive-failure counter. Replaces the pool-wide
  scalar so a chronically-failing slot trips the breaker on its own
  streak instead of being masked by another slot's successes.

Smaller:
- F7 — exhaustiveness `never` check on `WorkerOutgoingMessage` union.
- F8 — recursive `runWorker` on fully-quarantined jobs converted to
  a while-loop.
- F9 — `tripBreaker` calls `reject(err)` BEFORE awaiting
  `worker.terminate()`. A stuck terminate no longer blocks the caller.
- F10 — `parsing-processor.ts` quarantine log de-duplicates per pool
  instance via a `WeakMap`. Only newly-quarantined paths are logged
  in each chunk; the per-chunk count still surfaces via progress.
- F11 — extract `firstPath` local in `requeueAfterTimeout`; eliminates
  double `itemPath` call and the `unknown as string` cast.

Tests (F12, 6 new):
- crash-error event path (errorHandler).
- F5 drop-branch coverage via items without `.path`.
- Common-case unattributable crash falling back to items[0] heuristic.
- `replaceWorker` startup failure (workerFactory emits 'exit' before
  'online').
- All-slots-dropped breaker trip.
- `GITNEXUS_WORKER_MAX_CUMULATIVE_TIMEOUT_MS` env override.

Residual gap (deferred): no unit test exercises the Layer 5
cumulative-budget runtime path — requires fake-timer interleaving
with FakeWorker that's too brittle for this iteration. Tracked.

Unit suite: 257 files / 6056 passed / 30 skipped / 0 failed.
…after-timeout flow

Adds 6 new real-worker integration tests covering the PR #1693
resilience layers + fixes 3 follow-on bugs surfaced while writing them.

New integration coverage (real worker threads + temp fixture scripts):

- `respawns the slot after worker process.exit and finishes the work on
  the replacement` — exercises Layer 1 auto-respawn + Layer 3 quarantine
  through real IPC.
- `attributes exactly via authoritative starting-file message on worker
  crash` — Layer 4 end-to-end: starting-file message → exact quarantine
  attribution (not the items[0] heuristic).
- `quarantine filters subsequent dispatches without sending to a worker`
  — second dispatch's sub-batch payload audited via filesystem; the
  quarantined path is never sent across the message channel.
- `drops a slot after maxRespawnsPerSlot and continues on the survivor`
  — 2-slot pool, slot dies twice past budget, survivor finishes
  re-queued remainder.
- `trips the circuit breaker on cascading per-slot consecutive failures`
  — single-slot pool, dies on every job, breaker trips after
  consecutiveFailureThreshold with WorkerPoolDispatchError carrying
  the cumulative quarantine.
- `survives a worker error event (uncaught throw) the same as a
  process.exit` — validates recoverAndResume on the errorHandler path
  via a real worker `throw` (not just process.exit).

Bug fixes uncovered while writing these tests:

1. **Stack-overflow recursion in runWorker's no-worker branch** —
   `if (!worker) { ...; wakeIdleSlots(); maybeDone(); }` recursed
   indefinitely when multiple slots were mid-respawn simultaneously
   (wakeIdleSlots → runWorker → no worker → wakeIdleSlots → …).
   Removed the wakeIdleSlots call: the slot's own respawn IIFE owns
   runWorker post-respawn, and other slots will pick up work via
   finishJob's runWorker.

2. **requeueAfterTimeout dispatched work before respawn completed** —
   the F2 fix had `requeueAfterTimeout` `void`-discarding
   `handleWorkerDeath`, so the `!shouldContinue` IIFE had no way to
   know when the respawn finished. New design: `requeueAfterTimeout`
   returns a `TimeoutDecision` discriminated union; the IIFE owns
   the death-and-respawn-and-dispatch orchestration in an async
   closure so it can `await handleWorkerDeath` and then call
   `runWorker` deterministically.

3. **Stalled-singleton + protocol-error + replacement-startup-crash
   tests** had stale contracts predating the resilience refactor. The
   stalled-singleton no longer rejects (it quarantines + resolves
   `[]`); the protocol-error rejection message now mentions
   "circuit breaker tripped"; the replacement-startup-crash test
   documents the known `waitForWorkerOnline` race (online fires
   before the worker's main script runs, so a top-level throw looks
   like a successful spawn) — the test asserts the file is
   quarantined via the second-idle-timeout give-up path.

Full suite: 334 files / 8982 passed / 43 skipped / 0 failed (second
run; first run had a Vitest-reported flake from an uncaught worker
exception bleeding into the test report — repeated runs are clean).
@magyargergo magyargergo marked this pull request as ready for review May 19, 2026 11:10
return matches.find((m) => predicate(Object.keys(m)));
}

function countMatchesTsx(src: string, predicate: (tags: string[]) => boolean): number {
@github-actions

github-actions Bot commented May 20, 2026

Copy link
Copy Markdown
Contributor

Claude finished @magyargergo's task in 7m 51s —— View job


Production-Readiness Review — PR #1693


1. Review Bar for This PR

  • A pathological file (reallyLargeFile.ts) must not cause unbounded analyze time through any worker retry, split, or fallback path.
  • Worker crashes, timeouts, and protocol violations must be bounded, attributed, and observable via logs/stats.
  • Quarantine must be session-scoped, path-stable, and must not cause partial chunk results to enter the parse cache.
  • WorkerPool.dispatch is documented as non-reentrant; parseChunkConcurrency must only overlap file I/O, never concurrent dispatch.
  • IPC wire format must be mutually agreed between worker-pool.ts and parse-worker.ts; transferred buffers must not be pool-backed.
  • TypeScript capture parent-walk must preserve exact semantic output for the anchor types it touches.
  • Tests must cover real worker lifecycle paths; mock-only tests for lifecycle/IPC changes are not sufficient.
  • Docs and benchmark scaffolds must not present TBD values as evidence.
  • No hidden bidi/Unicode controls in executable code or tests.

2. Current PR State

Attribute Value
Title [WIP] Fix analyze hangs or stalls on microsoft/TypeScript repo root
Head c2bf66d (Merge remote-tracking branch 'origin' into copilot/fix-analyze-hangs-on-root)
Branch copilot/fix-analyze-hangs-on-rootmain
Changed files 34
CI Typecheck ✅, Tests ✅ (9093 passed, 39 skipped), E2E ✅
CodeQL Two review comments on typescript-captures-anchor.test.ts (unused functions findMatch and countMatchesTsx)
Hidden Unicode No hidden bidi controls found. Non-ASCII characters in test files are visible em dashes in code comments — acceptable under repo style.
WIP title Still says [WIP]; PR body says it was "later marked ready for review." Update the title before merging.

3. Branch Hygiene Assessment

Classification: merge-from-main commit present, but rebase/split would be strongly preferred.

The PR is causally connected to #1684: the worker-hang fix required IPC changes, quarantine semantics, circuit breaker, respawn, TypeScript capture optimization, and CLI option threading — these are all interlocking and not cleanly separable. However the diff is ~6500 LOC across 34 files, includes benchmark scaffolding, a new bench/ doc directory, README updates, and CLI wiki changes. The branch has 42 commits including merge-from-main commits and a terminal merge at the head. The diff is reviewable but dense, and CI signal for the oldest commits is stale. A rebase/squash onto current main with squashed logical groups would make the final diff cleaner and reduce noise from intermediate states.

The merge-commit-heavy topology does not introduce unrelated files. The scope expansion (IPC repack, transfer-list, chunk prefetching) is causally necessary: the original hang was partly caused by IPC overhead and main-thread extraction blocking worker throughput. All changed domains are within the ingestion pipeline.


4. Understanding of the Change

Worker Pool (worker-pool.ts): Adds auto-respawn on crash/exit (bounded by maxRespawnsPerSlot), per-slot consecutive failure circuit breaker, session-scoped quarantine (quarantine.ts), authoritative starting-file in-flight tracking with fallback heuristic, per-job cumulative timeout budget, initial ready handshake gate, slot-generation guards on event handlers, getStats() observability, and getQuarantinedPaths() snapshot. Zero-copy transfer-list dispatch (buildDispatchMessage) moves file content out of JSON via TextEncoder.encode + ArrayBuffer transfer. Pool size default changed to cores-1 capped at 16 with os.availableParallelism().

IPC (protocol.ts, parse-worker.ts): Binary frame protocol (1-byte tag + 4-byte LE length + JSON body). parse-worker.ts handles both the framed Uint8Array shape and a hybrid {envelope, contents} shape where file content is zero-copy transferred. Both worker and pool decode the same contract.

Quarantine/Cache (parse-impl.ts): Cache writes suppressed for any chunk that contains a quarantined file. Pool quarantine snapshot checked per-chunk after dispatch. Next run gets a fresh miss and reattempts against a new pool.

Pipeline concurrency (parse-impl.ts): parseChunkConcurrency prefetches file contents up to N chunks ahead; dispatch itself stays serial. Extraction (imports, heritage, calls) deferred to a single end-of-loop pass so workers stay busy between chunk dispatches.

CLI/env (analyze.ts, index.ts): --workers <n> threaded through AnalyzeOptions → PipelineOptions without process.env mutation. Env snapshot/restore for ANALYZE_CLI_ENV_KEYS on each analyzeCommand invocation. --workers 0 correctly disables the pool via workerPoolSize !== 0 check.

TypeScript captures (captures.ts): findNodeAtRange (root DFS per match) replaced by findSelfOrAncestorOfType[s] (parent walk from capture node). Falls back to findNodeAtRange if ancestor walk returns null. Dedicated groupedNodes parallel map tracks raw SyntaxNode references for the walk.


5. Findings

F1 — MINOR | typescript-captures-anchor.test.ts:28 | findMatch helper is declared but never called

Evidence: CodeQL correctly flagged findMatch at line 28. Reading the file confirms it is defined but has zero call sites. countMatchesTsx (line 33) IS called on line 88. The findMatch function was likely an early draft that survived cleanup.

Risk: Dead code, no production impact.

Fix: Remove findMatch or call it in a test that uses its return value. Fix this →

Blocks merge: No.


F2 — MEDIUM | bench/parse-throughput.md:92-106 | Benchmark doc explicitly says "TBD — fill in before merging"

Evidence: The doc itself states:

Status: scaffold — fill in before merging the PR #1693 follow-up.

And:

Regenerate this file before merging any PR that touches the ingestion pipeline.

The table has _TBD_ for all worker-pool rows. The methodology is sound and the harness is documented, but the doc is shipped as-is in a state it explicitly describes as not merge-ready. A reader landing on this file after merge would see TBD data and assume the benchmark was never run.

Risk: Misleading operator documentation. DoD §2.4 states "if user-visible behavior [...] change, the relevant docs [...] are updated in the same change."

Fix: Either fill in the measurement rows with real numbers before merging, or retitle the doc clearly as "scaffold / future work" and remove the self-contradictory "fill in before merging" note. Fix this →

Blocks merge: Maybe — if the DoD "docs updated in the same change" bar is enforced strictly. Non-blocking if the maintainer accepts the scaffold-only state.


F3 — LOW | worker-pool.ts:234 | WorkerPoolDispatchError.fallbackExcludePaths is a misleading name after sequential fallback removal

Evidence: Commit be1f65c removed the sequential-parser fallback. WorkerPoolDispatchError.fallbackExcludePaths still exists and is still populated with quarantine snapshots. The name implies a sequential fallback path that no longer exists. Current callers (parsing-processor.ts) catch WorkerPoolDispatchError but the sequential fallback call they used to make is now removed — the field exists but is no longer consumed.

Risk: API confusion for future contributors. Minor correctness concern: if a future contributor sees fallbackExcludePaths and re-introduces sequential fallback expecting the paths to be pre-populated, they would work correctly — but the name is actively misleading.

Fix: Rename to quarantinedPaths or excludePaths to match the current semantics.

Blocks merge: No.


F4 — LOW | worker-pool.ts:892 | Circuit breaker and respawn-budget checks interact in a potentially surprising order for small pools

Evidence: handleWorkerDeath checks consecutiveFailuresPerSlot[workerIndex] >= consecutiveFailureThreshold BEFORE checking respawnCount[workerIndex] > maxRespawnsPerSlot. For a 2-worker pool (threshold = max(3, 2) = 3, maxRespawnsPerSlot = 3): on the 3rd consecutive failure the breaker trips (3 >= 3), even though the slot still has one respawn available. This is correct but the breaker can trip a multi-slot pool based on a single slot's failures. The cumulative timeout path (give-up) also calls handleWorkerDeath via the IIFE, and the timeout-retry path increments consecutiveFailuresPerSlot directly but does NOT call handleWorkerDeath. This means chronically-timing-out slots accumulate failure counts that never trigger the circuit breaker threshold check — only the respawnCount > maxRespawnsPerSlot path drops them.

Risk: Monitoring inconsistency, not a correctness bug. The pool is always eventually bounded.

Blocks merge: No.


F5 — INFO | parse-impl.ts:526-528 | Cache suppression checks cumulative quarantine across ALL prior chunks, not just the current chunk

Evidence:

const quarantineSnapshot = workerPool?.getQuarantinedPaths?.() ?? [];
const quarantineSet = new Set(quarantineSnapshot);
const chunkHadQuarantine = chunkFiles.some((f) => quarantineSet.has(f.path));

The quarantine grows across all chunks. A file quarantined in chunk 1 would suppress caching for a later chunk IF that later chunk contained the same file. But since parseableScanned is sorted alphabetically and assigned to chunks in order, each file appears in exactly one chunk. So the cross-chunk suppression concern is moot: no file can appear in two different chunks in the same run.

Risk: None in practice. The logic is overly conservative but safe.

Blocks merge: No.


F6 — LOW | worker-pool.ts:1392 | exitHandler does not exempt exit code 0 from quarantine

Evidence: A worker exiting cleanly (code 0) is treated identically to a crash: the in-flight file is quarantined and the slot respawns. If a worker exits cleanly mid-job (e.g., due to an uncaught exception that set exit code 0 before dying), a valid file could be unnecessarily quarantined.

Risk: Low — a worker exiting with code 0 during active parsing is unusual. The pool's quarantine + respawn behavior bounds the blast radius. Noted in PR description as a P2 residual.

Blocks merge: No.


F7 — INFO | worker-pool.ts: dispatch non-reentrancy is documented but not enforced

Evidence: The code comment states: "Worker dispatch itself stays serial because WorkerPool.dispatch is not reentrant (concurrent calls would race on the shared per-slot busy/in-flight state)." Currently safe because parse-impl.ts awaits each dispatch sequentially. But there is no assertion, mutex, or assert(activeWorkers === 0) guard to prevent accidental concurrent invocation.

Risk: A future caller that adds a concurrent dispatch would silently corrupt busySlots across the two call closures. Not a production issue today.

Blocks merge: No — but a note in the WorkerPool.dispatch JSDoc would be appropriate.


6. Worker Pool Reviewer Assessment

Lifecycle: Auto-respawn (bounded by maxRespawnsPerSlot), circuit breaker (per-slot consecutive failures or all-slots-dropped), and terminate are all correctly implemented. The stopped flag prevents double-resolution.

Stale events: Slot-generation guards at every handler entry (if (slotGenerations[workerIndex] !== slotGen) return) are correct. Generation is bumped atomically with the worker swap in replaceWorker. A late event from the dead worker carrying the old generation short-circuits.

maybeDone safety: Jobs are pushed to jobs before activeWorkers is decremented in requeueRemainder (called before the decrement in recoverAndResume). The give-up IIFE first decrements activeWorkers, then calls requeueRemainder (synchronous, may or may not push). If queue is empty and activeWorkers === 0, maybeDone() correctly resolves with partial results (quarantined files intentionally absent). This is the correct behavior, not a premature resolution.

recoverAndResume double-call: Protected by settled flag and cleanup() listener removal. Error, exit, and messageerror handlers each have if (!settled) guards. No double-call possible.

activeWorkers accounting: Each runWorker increments once on start. Each death path (recoverAndResume, give-up IIFE, retry IIFE finally) decrements exactly once. No double-decrement found.

Startup crash detection: waitForWorkerReady with 5s timeout correctly bounds the case of a worker crashing before emitting ready. The initialReadyGate symmetrizes this check for the initial spawn pool.

messageerror handling: Correctly routed through recoverAndResume (not treated as a different code path). Distinguishable from protocol decode errors by the error class (ProtocolDecodeError vs. the deserialization error from V8).

Listeners: handler uses .on (needed for multiple progress/sub-batch-done messages per job). errorHandler, exitHandler, messageErrorHandler use .once. cleanup() calls worker.removeListener for all four. Confirmed no listener leaks.

terminate(): Sets terminated = true, terminates all workers with .catch(() => undefined), clears workers.length and activeSlots. Does not interact with a running dispatch — if called during dispatch, stopped was already set by circuit breaker or the outer finally block in parse-impl calls it after the dispatch promise has settled.

Observability: getStats() exposes activeSlots, droppedSlots, quarantined, poolBroken, terminated, slotGenerations. Surfaced in verbose log per chunk. Adequate.

Integration tests: worker-pool.test.ts uses real worker_threads. The PR adds 512 lines to this file covering process exit, error, all-slots-dropped, stale events (via generation), and cumulative timeout. This is real production coverage.

Verdict for this lane: Pass with minor caveats noted.


7. Thread Pool / Pipeline Concurrency Reviewer Assessment

Non-reentrant dispatch: WorkerPool.dispatch is non-reentrant. parse-impl.ts correctly uses a serial for loop with await processParsing(...). parseChunkConcurrency only prefetches file contents (calls readFileContents async) — it does NOT overlap dispatch calls. The comment confirming this is accurate. No concurrent dispatch risk in the current code.

Chunk ordering: File content promises are pre-fetched but await chunkContentPromises[chunkIdx]! in the loop body preserves sequential processing order. Deferred extraction (imports, heritage, calls) accumulates across all chunks and resolves at the end in fixed order. Graph outputs are deterministic regardless of how fast file reads complete.

Progress monotonicity: The parse phase occupies 20-70% (50 points), with deferred extraction at 70-95% (imports 70-75, heritage 75-80, routes 80-85, calls 85-95). Each stage advances the progress bar linearly. The edge case where totalParseable === 0 correctly jumps to 95 without going through the 20% announce. No non-monotonic progress paths found.

Memory: parseChunkConcurrency = 2 with default chunkByteBudget = 2MB means at most ~4MB of raw file content is in memory ahead of the current chunk. This is conservative. The cumulative deferred arrays (deferredWorkerCalls, deferredWorkerImports, etc.) hold data for ALL chunks until the end-of-loop pass, which is the prior behavior. For large repos this could be significant, but it's bounded by the total repo size and unchanged from before.

Cache correctness: Cache hits replay the stored ParseWorkerResult[] through mergeChunkResults. Cache writes are suppressed when any file in the chunk is in the quarantine set. A subsequent run with a new pool will reattempt the chunk. The invariant is preserved: no partial chunk result is ever cached.

workerPoolSize === 0: Correctly checked at pool creation (options?.workerPoolSize !== 0). Sequential path runs intact.

Env reads: resolveChunkByteBudget, parseChunkConcurrency, and resolveAutoPoolSize are all called at invocation time (not module-level). resolveWorkerPoolOptions reads process.env at call time. No stale module-level env captures.

Verdict for this lane: Pass.


8. IPC / Transfer-List Assessment

encodeMessage / decodeMessage: Binary frame (1-byte tag + 4-byte LE uint32 length + UTF-8 JSON). Tags 0x01-0x08 validated. Short buffer, unknown tag, truncated payload, and invalid JSON all throw ProtocolDecodeError. decodeMessage accepts both Buffer and bare Uint8Array via Buffer.from(view.buffer, view.byteOffset, view.byteLength) — zero-copy view, correct for non-zero-offset views.

Hybrid envelope/contents dispatch (buildDispatchMessage): TextEncoder.encode creates dedicated ArrayBuffer (not pool-backed). Envelope (encodeMessage output via Buffer.allocUnsafe) is NOT transferred — correct, since Buffer.allocUnsafe may be pool-backed. The worker's decodeIncomingMessage checks decoded.files.length !== contents.length and throws a typed error if they diverge — this routes through the outer try/catch → encodeMessage(Error, ...)recoverAndResume on the pool side. The byteLength metadata field is carried in the envelope but not validated against the actual transferred buffer length. This is a minor correctness gap — a corrupted/partial transfer would silently decode to the full transferred buffer length rather than the declared byteLength. In practice, transferList transfers are atomic (either the entire buffer is transferred or the call throws), so this scenario cannot occur under normal operation. Not blocking.

After transfer: The transferred Uint8Array buffers' underlying ArrayBuffers are detached on the main thread after postMessage. The main thread does not hold references to these buffers after the buildDispatchMessage call returns. Safe.

FakeWorker in tests: Tests use POJO messages, and the pool's decodeIncomingWorkerMessage tolerates both encoded and POJO shapes. The worker-pool.test.ts uses real Worker threads for IPC correctness paths.

Verdict for this lane: Pass. The byteLength non-validation is minor and non-exploitable.


9. Quarantine / Cache Assessment

When a file is quarantined: It is absent from the graph for the current run. This is documented: "files identified as the in-flight file at the moment of a worker death." The absence IS observable: the worker log emits "Worker quarantine: N new file(s) skipped this chunk (path/to/file.ts)" and getStats().quarantined is non-zero. The verbose throughput log surfaces the count per chunk.

Exit code behavior: Analyze exits 0 when quarantine occurs but the pool does not trip. This is documented in the PR description (quarantine = degraded completion, not hard fail). Exit code is non-zero only if the pool circuit breaker trips. This behavior is deliberate and acceptable — the operator can see the quarantine log even on exit 0. Whether this should be a nonzero exit for "incomplete graph" is a product policy question beyond the scope of this PR.

Cache suppression: The logic at parse-impl.ts:525-546 correctly suppresses cache write if any file in the chunk is quarantined. On the next run, the chunk gets a miss and reattempts. The quarantine is session-scoped (new pool = fresh quarantine set), so the reattempt is clean.

Cross-chunk quarantine affecting cache: Not a practical issue — alphabetically-sorted chunks guarantee each file is in exactly one chunk. A quarantined file from chunk 1 cannot suppress caching for chunk 3. Confirmed safe.

Tests for this behavior: parse-impl-quarantine-cache-skip.test.ts is an integration test (398 lines) that exercises the quarantine + cache interaction end-to-end. It is real (not mock-only). This satisfies DoD §2.7.

Verdict for this lane: Pass.


10. TypeScript Capture Assessment

Semantic correctness: findSelfOrAncestorOfType[s] walks from the capture node's own subtree upward. For every anchor type used in the TS query:

  • import_statement / export_statement: findSelfOrAncestorOfTypes(['import_statement', 'export_statement']) — if the query captures the statement node directly, self matches on first step. Falls back to findNodeAtRange if null.
  • call_expression (dynamic imports, call arity): findSelfOrAncestorOfType('call_expression') from the anchor node. For @import.dynamic the anchor is the call_expression itself — self-match on first step. For @reference.call.*, the anchor is the callee name inside the call — the call_expression is the immediate parent. Correct.
  • member_expression (read member): Self-match when anchor is the member_expression.
  • function_declaration, method_definition etc. (arity): The @declaration.method/function anchor is the function name, with the function node as the immediate parent or ancestor. findFunctionNode handles this.
  • Falls back to findNodeAtRange in all cases if the ancestor walk returns null. So no regression from the walk failing to find the node — only a performance regression back to the old behavior.

groupedNodes parallel map: As noted in PR residuals (P2), groupedNodes and grouped are populated in the same loop iteration and are always in sync. Not a correctness issue in the current code; cleanup is a future follow-up.

Tests: typescript-captures-anchor.test.ts adds exact-count assertions (toBe(1), toBe(2)) for all the anchor types the rewrite touches. DoD §2.7 compliance confirmed. The one unused helper (findMatch) is a cleanup item.

Performance claim: Verification by @magyargergo shows emitTsScopeCaptures on checker.ts took ~3.2s vs. >7min before. The root-DFS elimination is the correct fix for the pathological case.

Verdict for this lane: Pass with one minor cleanup item (unused findMatch).


11. Docs / Benchmark / Operator Experience Assessment

README.md / gitnexus/README.md: New env vars (GITNEXUS_WORKER_POOL_SIZE, GITNEXUS_WORKER_MAX_RESPAWNS_PER_SLOT, etc.) and --workers CLI flag are documented with correct defaults. The defaults in the docs match the constants in the code. No discrepancies found.

bench/parse-throughput.md: The doc structure, methodology, and harness recipe are well-written. However, the measurement table has TBD values and the doc explicitly states it should be filled in before merging. This is the primary documentation concern for this PR. The doc says it's a "scaffold" — so either ship it as clearly-labeled future work OR fill in the numbers.

Help text / CLI wiring: --workers <n> is a Commander option wired through AnalyzeOptions.workers → workerPoolSize → PipelineOptions. --workers 0 disables workers via the !== 0 check. --worker-timeout sets env and is snapshot/restored. --verbose is threaded as env. All options are consistent between help text and implementation.

Verbose log: Per-chunk throughput log (files/s, pool stats) emits under isDev || isVerboseIngestionEnabled(). Previously isDev-only, so operators couldn't see it with --verbose in production. Fixed in this PR. Correct.

Verdict for this lane: Minor concern on benchmark TBD values.


12. Hidden Unicode / Hygiene

Scan result: grep -nP "[\x{202A}-\x{202E}\x{2066}-\x{2069}\x{200B}\x{200C}\x{200D}\xEF\xBB\xBF]" on the flagged files returned no matches.

The non-ASCII characters GitHub rendered warnings for are em dashes (—) used in code comments, which are visible punctuation. Per the review instructions: "Do not block ordinary visible punctuation, Markdown em dashes, smart quotes, or emoji if repo style allows them." These are acceptable.

CodeQL findings: Two findings on typescript-captures-anchor.test.ts — unused findMatch (confirmed) and countMatchesTsx (NOT actually unused — called on line 88). CodeQL's second finding is a false positive.

Verdict: No blocking hidden Unicode. One real CodeQL finding (unused findMatch).


13. CI and Test Adequacy

CI status (latest run, from comment):

  • Typecheck: ✅
  • Tests: ✅ (9093 passed, 0 failed, 39 skipped)
  • E2E: ✅

Skipped tests: The 39 skipped tests include "TypeScript large-file cache-miss parser buffer regression" and "C# large-file" suites. These appear to be pre-existing skips due to environment/grammar availability, not related to this PR's changes.

Test coverage for changed domains:

Domain Tests Real workers?
Worker lifecycle (crash, exit, messageerror) worker-pool.test.ts (+512 lines), worker-pool-resilience.test.ts (659 lines) Yes — worker-pool.test.ts uses real Worker
Protocol IPC workers/protocol.test.ts (176 lines) Unit (pure encoder/decoder)
Quarantine workers/quarantine.test.ts (96 lines), parse-impl-quarantine-cache-skip.test.ts (398 lines integration) Integration test: real
TypeScript captures typescript-captures-anchor.test.ts (152 lines) Unit (real tree-sitter)
Parse-impl concurrency parse-impl-chunk-concurrency.test.ts, parse-impl-deferred-extraction.test.ts Unit
Worker pool options worker-pool-options.test.ts, worker-pool-slot-generation.test.ts Unit

The worker-pool-resilience.test.ts uses FakeWorker (not real Worker threads). The worker-pool.test.ts integration test uses real threads and is the primary IPC/lifecycle coverage. This split is acceptable: resilience logic is unit-tested with controlled fake workers; IPC contract is tested with real threads.

Verdict: Test coverage is adequate for the changed domains. DoD §2.7 exact-count assertions are present in new tests.


14. Final Verdict

Production-ready with minor follow-ups.

The core fix is sound and well-engineered: pathological files are bounded by the cumulative timeout budget + quarantine, worker crashes are bounded by the respawn budget + circuit breaker, IPC transfer-list is correctly zero-copy with no pool-buffer transfer risk, cache suppression correctly blocks partial chunks from being persisted, and the TypeScript capture optimization correctly preserves semantic output while eliminating the per-match root DFS.

Required before merge:

  1. Remove the [WIP] from the PR title.
  2. Remove or call the unused findMatch function in typescript-captures-anchor.test.ts:28 (CodeQL F1).
  3. Either fill in the benchmark measurement rows in bench/parse-throughput.md or remove the self-contradictory "fill in before merging" instruction from the doc (F2).

Non-blocking follow-ups (can be separate PRs):

  • Rename WorkerPoolDispatchError.fallbackExcludePathsquarantinedPaths or excludePaths (F3).
  • Add a non-reentrancy assertion or JSDoc to WorkerPool.dispatch to document and guard the contract (F7).
  • Consider adding byteLength validation in the hybrid envelope decode path.
  • Address P2 residuals from the PR description: groupedNodes parallel map, Windows path separator, exit code 0 exemption.

Multi-lane review run on the PR #1693 branch surfaced four addressable
items beyond the blocking three.

F1 (minor, CodeQL): unused `findMatch` helper in
test/unit/scope-resolution/typescript/typescript-captures-anchor.test.ts:28
removed. `countMatchesTsx` flagged by the same CodeQL pass is a false
positive — it's called at line 88 by the JSX-anchor regression tests
so the rewrite case actually fires under TSX, not just TS.

F2 (medium, docs): bench/parse-throughput.md retitled as
"(scaffold)" with an explicit "no measurement data has been collected
yet" note above the table. The self-contradictory "Regenerate this
file before merging any PR that touches the ingestion pipeline"
instruction is dropped — the file ships intentionally without
numbers; the load-bearing perf-regression protection lives in
test/integration/parse-impl-large-fixture.test.ts (U6, 30s
Promise.race wall-clock budget). The Latest measurement section now
preserves the ~6s sequential observation as a smoke reference, not as
a regression target.

F3 (low, API hygiene): `WorkerPoolDispatchError.fallbackExcludePaths`
renamed to `quarantinedPaths`. The "fallback" terminology was
load-bearing under the pre-U20 design when `processParsing`'s
sequential-fallback catch-block consumed it to filter the fallback
file list. After commit be1f65c removed that catch-block, no
production code reads the field — but it stays populated by the pool
because the snapshot is genuinely useful operator diagnostics when
the breaker trips. The rename clarifies the field's actual semantics
(here are the files the pool quarantined before it tripped) without
changing wire behavior. Definition + the lone surviving in-pool
comment reference + both test assertions updated.

F4 (low → real fix, reliability): timeout-retry IIFE in
worker-pool.ts now consults `consecutiveFailureThreshold` and trips
the circuit breaker when the per-slot consecutive-failure count
crosses it. Closes a gap left by ce-code-review's REL-02 patch — that
fix added the `consecutiveFailuresPerSlot[workerIndex]++` increment
in the timeout-retry path but did NOT add the corresponding
threshold-check + tripBreaker call. Result: chronic pure-timeout
deaths accumulated counts that never tripped the breaker until the
slot also hit `respawnCount > maxRespawnsPerSlot`. Now timeouts and
crashes are structurally treated the same way by the breaker, which
is what the REL-02 increment was meant to enable. Test coverage:
worker-pool-resilience.test.ts already exercises the breaker via the
shared handleWorkerDeath path; this new branch traces the same
trip semantics with a different entry point, so the breaker-tripped
state is observable via the same `getStats().poolBroken` and
`WorkerPoolDispatchError.quarantinedPaths` surface.

Out of scope here (caller actions or future PRs):
  - F5 (info): cumulative-quarantine cache check is safe in practice
    because chunks are alphabetically deterministic; no action.
  - F6 (low): exit-code-0 quarantine exemption — pre-existing P2
    residual, bounded by quarantine + respawn budget; deferred.
  - F7 (info): dispatch non-reentrancy contract documented but not
    enforced; no production caller violates it; deferred.
  - PR title `[WIP]` removal — happens on GitHub side.

Tests: 274/274 test files (6185 passing, 30 skipped). The single
"error" in the integration runner is the pre-existing intentional-
process.exit unhandled-exception leak from the deliberate startup-
crash test worker, unchanged by these fixes.
@magyargergo magyargergo changed the title [WIP] Fix analyze hangs or stalls on microsoft/TypeScript repo root fix(workers): resilient + zero-copy ingestion worker pool — prevent analyze hangs on TS-root-scale loads May 20, 2026
CI scope-parity tests on Ubuntu surfaced silent data loss in the
worker IPC: `Phase 'scopeResolution' failed: scope.typeBindings is not
iterable` (Python, Go) and `importerModule.typeBindings.has is not a
function` (Python). Plus three #1066 large-file regression tests
(Python / C# / TypeScript) failed because call relationships weren't
resolving from the worker output.

**Root cause:** U17 introduced `JSON.stringify`/`JSON.parse` as the
protocol body codec. JSON has no representation for `Map`, `Set`,
`Date`, `RegExp`, `BigInt`, `TypedArray`, `undefined` values, or
circular refs — `JSON.stringify(someMap)` returns `"{}"`. Production
scope-resolution code keys data structures on Maps throughout
(`ParsedFile.scopes[*].typeBindings: ReadonlyMap<string, TypeRef>`,
plus `bindings`, `bySourceScope`, `byTargetDef`, the finalize-algorithm
edge indexes, etc.). The JSON round-trip silently turned every Map
into an empty object, manifesting downstream as iteration / `.has`
calls failing on the decoded payload.

**Fix:** replace the JSON body with `node:v8`'s `serialize` /
`deserialize`. That's the same structured-clone algorithm Node's
`worker.postMessage` uses natively — bit-for-bit compatible with the
pre-U17 implicit-clone path. Full type fidelity for Map, Set, Date,
RegExp, BigInt, TypedArray, undefined values, and circular refs. No
external dependency.

A previous iteration of this fix attempted to bolt a Map/Set
replacer+reviver onto the JSON path. Rejected in favor of V8
serialization because:
  - the JSON tag-marker approach requires per-type registration
    (Map, Set; then Date, RegExp, BigInt would each need their own
    sentinels); V8 handles them all uniformly
  - keys to JSON-encode would still need handling for nested types
    (and the marker approach doesn't survive nested Maps-in-Maps
    cleanly without recursive replacer logic)
  - V8 is faster than JSON for object-heavy payloads anyway (binary
    format, no string escaping pass)
  - the user-explicit ask was "a much more generic solution that will
    work for everything" — V8 serialization IS the generic solution

Trade-offs documented in the module header:
  - body bytes are opaque (binary, not human-readable) — debugging
    requires `v8.deserialize` ad-hoc; protocol.test.ts exercises every
    supported MessageTag including the new type-fidelity cases as a
    regression net.
  - format is tied to the running Node major. Pool always spawns
    workers on the same Node instance the main thread runs, so this is
    moot in production. Would matter if frames ever persisted to disk
    (nothing does today).

Protocol test file rewritten:
  - drops the JSON-specific byte-layout assertions (e.g. `body must
    equal "null" string`) — replaced with V8-derived expected lengths
  - adds a "structured-clone type fidelity" describe block that pins
    Map, nested Map, Set, Date, RegExp, BigInt, TypedArray, undefined
    values, and circular-ref round-trips. These are the load-bearing
    regression tests preventing a future "optimize" PR from quietly
    swapping V8 back to JSON.
  - the bad-body decode-error test now uses arbitrary non-V8 bytes
    instead of `{not-json}` — same intent.

Integration test READY_PREAMBLEs (worker-pool.test.ts and
parse-impl-quarantine-cache-skip.test.ts) update their inline
decoders to use `v8.deserialize` matching the production codec.
Both files have a standalone CJS worker preamble that can't import
dist/protocol.js by relative path, so the V8 dependency is required
via `node:v8` directly.

Tests: 271/271 unit files (6163 tests + 30 skipped). 28/28
worker-pool integration. 3/3 parse-impl integration. 791/791
scope-parity tests (the four CI-failing files: python.test.ts,
go.test.ts, typescript.test.ts, csharp.test.ts) all green again.

References plan: docs/plans/2026-05-20-002-fix-chunk-cache-corruption-on-worker-quarantine-plan.md
…rList

The protocol.ts framing layer was redundant — Node's `worker.postMessage`
already runs V8 structured-clone internally, the same algorithm that
backed `v8.serialize`. Wrapping V8.serialize → Buffer →
postMessage(struct-clone-Buffer) was a double-walk: one full
structured-clone pass to produce the Buffer, then another pass when
postMessage cloned that Buffer across threads. This commit cuts the
wrapper layer; workers and pool exchange POJO directly via
`worker.postMessage(value, transferList)`, with file-content
`ArrayBuffer`s in `transferList` for zero-copy ownership transfer.

What changes:

- **Deleted** `src/core/ingestion/workers/protocol.ts` (~180 LOC) +
  `test/unit/workers/protocol.test.ts` (~250 LOC). The MessageTag
  enum / ProtocolDecodeError / encodeMessage / decodeMessage surface
  is gone. Tag-based routing is replaced by the `msg.type`
  discriminant that every receive site already checks. Protocol-decode
  errors map to Node's `messageerror` event (V8 deserialization
  failures during postMessage), which the pool already wires to
  `recoverAndResume`.
- **`worker-pool.ts`**: `decodeIncomingWorkerMessage` removed; handlers
  receive POJO directly. `buildDispatchMessage` now returns
  `{message: {type:'sub-batch', files: [{path, content: Uint8Array}]},
  transferList: ArrayBuffer[]}`. The Uint8Array-per-content allocation
  via `TextEncoder.encode` is preserved (it's the load-bearing
  transfer-safety contract that keeps content out of Node's shared
  `Buffer.poolSize` slab). Flush dispatch is now plain
  `worker.postMessage({type:'flush'})`.
- **`parse-worker.ts`**: `decodeIncomingMessage` removed. The message
  handler receives POJO directly; the only conversion is
  `Uint8Array → string` for sub-batch file contents at the
  `decodeSubBatchFiles` boundary, before handing to `processBatch`.
  Outgoing messages are emitted as POJO via plain
  `parentPort.postMessage({type:'starting-file', ...})` etc. The
  `sharedHybridDecoder` is now `sharedContentDecoder` (same intent,
  clearer name for the simpler shape).
- **Test scaffolding**: FakeWorkers in `worker-pool-resilience`,
  `worker-pool-windows-quarantine`, and `worker-pool-slot-generation`
  drop their `decodeMessage` import + `decodeDispatchedMessage` helper.
  The helpers stay (still convert `files[i].content` Uint8Array →
  string for test-action introspection) but no longer touch any
  protocol framing — just shape-check for sub-batch.
- **Integration READY_PREAMBLEs** (worker-pool.test.ts and
  parse-impl-quarantine-cache-skip.test.ts): drop the inline
  v8.deserialize + envelope-unzip logic; the preamble is now just
  the ready handshake + a `parentPort.on` wrapper that converts
  `files[i].content` Uint8Array → string for the ad-hoc test worker
  scripts.
- **`worker-pool-transferlist.test.ts`**: contract tests updated for
  the new buildDispatchMessage shape — no `envelope` field anymore;
  `message.files[i].content` is Uint8Array; transferList holds each
  content.buffer in input order. Pool-slab independence still pinned.

What stays the same:

- Zero-copy file-content transfer via transferList — every file's
  ArrayBuffer is ownership-transferred to the worker (no copy).
- Full structured-clone type fidelity — Map / Set / Date / RegExp /
  BigInt / TypedArray / undefined / circular refs all preserved by
  Node's native postMessage. The V8 fix from commit 06f6957e is
  inherent in this path; there's no JSON layer to lose them.
- TextEncoder-per-content allocation — keeps content buffers out of
  the shared `Buffer.poolSize` slab so transferring one cannot detach
  another.
- The pool's resilience layers (respawn, breaker, quarantine,
  starting-file attribution, cumulative timeout, ready handshake,
  slot-generation guard) — unchanged.
- U20 chunk-cache write suppression on quarantine — unchanged.

Net: ~430 LOC removed (protocol.ts + tests + inline decoders + helpers),
~120 LOC simplified in worker-pool.ts and parse-worker.ts. One less
serialization pass per message on the hot path.

Tests: 270/270 unit files (6133 + 30 skipped). 822/822 integration
tests including the four CI-failing scope-parity files (Python, Go,
TypeScript, C#) — the V8-fidelity contract holds via native
postMessage with no explicit serializer. The single "error" reported
in worker-pool.test.ts is the pre-existing intentional
process.exit unhandled-exception artifact from the deliberate
startup-crash test, unchanged by this commit.
The `parentPort.on('message', ...)` handler had an `Array.isArray(msg)`
branch left over from a pre-sub-batch dispatch shape — the pool used
to send the items array directly, before the worker pool added
sub-batching and the `{type:'sub-batch', files: ...}` envelope.

No production caller has dispatched that shape since the sub-batching
refactor landed; verified by grepping the repo for `postMessage([`
patterns (zero matches). The `ParseWorkerInput[]` arm in the
`WorkerIncomingMessage` discriminated union also blocked
exhaustiveness narrowing — flagged by the kieran-typescript code
review (RR-01) as "if a future unit removes the legacy array path,
this arm should be dropped." Dropping it now.

What changes:
  - Remove the `Array.isArray(msg)` branch from the message handler.
  - Drop `ParseWorkerInput[]` from the `WorkerIncomingMessage` union;
    it's now a clean `{type:'sub-batch'} | {type:'flush'}` discriminated
    union, so the dispatch switch is exhaustive over `msg.type`.

Tests: 71/71 worker-pool unit + integration tests green (resilience,
slot-generation, windows-quarantine, transferlist, parsing-worker-
fallback, worker-pool integration, parse-impl-quarantine-cache-skip).
@magyargergo

Copy link
Copy Markdown
Collaborator

Windows benchmark: full-root microsoft/TypeScript analyze (fixes #1684)

Verified on Windows that root gitnexus analyze on a fresh clone of microsoft/TypeScript completes successfully with this branch — no infinite hang on tests/cases/fourslash/reallyLargeFile.ts (~3.5 MB fourslash fixture).

Acceptance criteria (#1684)

Check Result
Root analyze finishes (exit 0) YesRepository indexed successfully
reallyLargeFile.ts does not block the run Yes — worker timeout → quarantined on chunk 65/66 (~30 min wall on that chunk)
compiler/checker.ts present in graph Yes — e.g. checkSourceElementWorker, structuredTypeRelatedToWorker, onFailedToResolveSymbol in top process entry points

Benchmark runs (same machine, same env)

All runs used:

$env:NODE_OPTIONS = "--max-old-space-size=24576"
$env:NODE_ENV = "development"          # required for 📊 throughput logs
$env:GITNEXUS_VERBOSE = "1"
$env:GITNEXUS_MAX_FILE_SIZE = "32768"
$env:GITNEXUS_WORKER_SUB_BATCH_TIMEOUT_MS = "600000"
$env:GITNEXUS_PARSE_CHUNK_CONCURRENCY = "4"

gitnexus analyze <TypeScript-root> --force --index-only --skip-agents-md --skip-skills \
  --max-file-size 32768 --worker-timeout 600 --workers 16 -v
Run Commit Outcome Wall time Nodes Edges Communities Processes
1 b5d3c7eb (baseline on branch) OOM @ ~16 GB heap (no 24 GB cap) ~2h 50m
2 b5d3c7eb SUCCESS exit 0 ~4h 9m (14,958 s) 321,508 489,999 4,274 300
3 be1f65ce (binary IPC + transferList) Killed mid-run (superseded by native IPC) ~1h 48m
4 6c6fc774 (native postMessage + transferList, drop protocol.ts) SUCCESS exit 0 ~4h 23m (15,787 s) 321,507 490,000 4,274 300

Takeaway: Runs 2 and 4 produce effectively identical graphs (±1 node/edge). Run 4 is ~14 min slower end-to-end on this box; the important regression fix is completion without hang, not a large throughput win on TS-root.

Run 4 phase breakdown (6c6fc774)

Phase Logged duration
Worker chunks (66/66) ~83 min to chunk completion
Deferred parse (import log → ✓ Phase: parse) ~151 min silent; parse phase total 13,601,889 ms (~3h 47m)
scopeResolution 2,047,396 ms (~34 min)
communities + processes + FTS seconds

Peak RSS during deferred parse: ~23 GB (24 GB heap cap). After ✓ Phase: parse, memory dropped to low GB and the run finished cleanly.

Quarantine evidence (Run 4)

Worker 0 parse job exhausted cumulative timeout budget … exhausted=["tests/cases/fourslash/reallyLargeFile.ts"]
Worker quarantine: 1 new file(s) skipped this chunk
📊 chunk 65/66: 1 files in 1801613ms … 1 quarantined

Chunk 66 completed with the file skipped; indexing continued through scope resolution and wrote meta.json.

Artifacts

  • Index: TypeScript/.gitnexus/meta.json (indexedAt: 2026-05-20, stats.files: 79,123)
  • Verbose log (Run 4): .tmp-analyze-ts-root-native-msg.log on the benchmark worktree

What this validates for the PR

  1. Worker quarantine + skip timed-out files in sequential fallback — pathological files no longer wedge root analyze indefinitely.
  2. IPC path (6c6fc774) — native structured clone + transferList completes the same full-root workload as the earlier successful run without OOM on a capped heap.
  3. Graph quality — core compiler symbols (compiler/checker.ts) indexed; stats match prior successful run.

Happy to re-run on another Windows host or post meta.json excerpts if useful for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

analyze hangs or stalls on microsoft/TypeScript repo root (raised file-size limits)

3 participants