Skip to content

fix(analyze): prevent cache-hit native workers from aborting#1751

Merged
magyargergo merged 3 commits into
abhigyanpatwari:mainfrom
ChamHerry:codex/analyze-native-worker-cache-hit
May 21, 2026
Merged

fix(analyze): prevent cache-hit native workers from aborting#1751
magyargergo merged 3 commits into
abhigyanpatwari:mainfrom
ChamHerry:codex/analyze-native-worker-cache-hit

Conversation

@ChamHerry

Copy link
Copy Markdown
Contributor

Summary

Fixes an intermittent gitnexus analyze abort seen on warm parse-cache runs, where the analyzer could still start parse workers even when every chunk was a cache hit.

This PR keeps the fix intentionally narrow:

  • delays parse worker pool creation until the first cache miss actually needs workers
  • falls back to sequential parsing when the initial worker ready handshake drops every slot
  • captures/tees respawned analyzer stdout/stderr so native aborts are classified separately from heap OOM
  • preserves single-line progress rendering when respawn output is piped for diagnostics
  • routes large-file skip warnings through the progress logger while analyze progress is active
  • adds targeted tests for cache-hit no-worker behavior, initial worker readiness fallback, respawn crash classification, progress terminal bridging, and warning routing

Root cause

Warm-cache analysis was replaying cached chunk results, but runChunkedParseAndResolve had already created a worker pool before the chunk cache check. That meant a cache-hit-only run could still execute the parse-worker top-level module and load tree-sitter / N-API native bindings. On newer runtimes (observed with Node 25), that native path can abort the process before any cache miss requires parsing.

A second issue made diagnosis misleading: the heap-respawn path inherited child stdio, so the parent could not inspect child stderr/stdout after abnormal exits. Native SIGABRT / status 134 failures could be reported as heap OOM even when stderr contained a native binding crash signature.

Validation

  • GitNexus impact analysis before editing:
    • ensureHeap → LOW
    • runChunkedParseAndResolve → LOW
    • createWorkerPool → MEDIUM
    • WorkerPool → MEDIUM
    • walkRepositoryPaths → LOW
  • GitNexus detect_changes on staged linked worktree → risk medium, 9 changed files, affected analyze execution flows only
  • ./node_modules/.bin/vitest run test/unit/analyze-heap-respawn.test.ts test/unit/analyze-respawn-progress-terminal.test.ts test/unit/parse-impl-worker-lazy-cache.test.ts test/integration/filesystem-walker.test.ts test/integration/worker-pool.test.ts
  • npx tsc --noEmit
  • npm run build
  • NODE_OPTIONS='--max-old-space-size=8192' npm test → 355 files / 9140 tests passed, 1 skipped

Notes

This branch was created from the current upstream main and excludes unrelated local i18n, doctor, and docs/todo work.

@vercel

vercel Bot commented May 21, 2026

Copy link
Copy Markdown

Someone is attempting to deploy a commit to the NexusCore Team on Vercel.

A member of the Team first needs to authorize it.

Delay parse worker startup until a cache miss requires it, fall back to sequential parsing when initial worker readiness fails, and preserve analyzer diagnostics/progress when heap respawn captures child output.

Constraint: Node 25 and tree-sitter/N-API worker initialization can abort before ready, while warm-cache analysis should not start workers at all.

Rejected: Treating status-134/SIGABRT as heap OOM unconditionally | native worker aborts require distinct recovery guidance and stderr/stdout evidence.

Rejected: cli-progress noTTYOutput for respawn progress | it appends newline frames instead of preserving one-line redraw UX.

Confidence: high

Scope-risk: moderate

Directive: Keep parse-worker creation behind confirmed cache misses and preserve TTY-style progress when respawn pipes stderr for crash classification.

Tested: GitNexus impact analysis for ensureHeap, runChunkedParseAndResolve, createWorkerPool, WorkerPool, walkRepositoryPaths; GitNexus detect_changes scoped to staged worktree; targeted vitest for analyze respawn, parse lazy cache, filesystem walker, worker pool; npx tsc --noEmit; npm run build; NODE_OPTIONS='--max-old-space-size=8192' npm test.

Not-tested: Windows terminal rendering and published npm package install path.
@ChamHerry ChamHerry marked this pull request as ready for review May 21, 2026 11:46
@ChamHerry ChamHerry force-pushed the codex/analyze-native-worker-cache-hit branch from d617e17 to c84c903 Compare May 21, 2026 11:46
@github-actions

github-actions Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

CI Report

All checks passed

Pipeline Status

Stage Status Details
✅ Typecheck success tsc --noEmit
✅ Tests success unit tests, 3 platforms
✅ E2E success gitnexus-web changes only

Test Results

Tests Passed Failed Skipped Duration
9418 9417 0 1 478s

✅ All 9417 tests passed

1 test(s) skipped — expand for details
  • buildTypeEnv > known limitations (documented skip tests) > Ruby block parameter: users.each { |user| } — closure param inference, different feature

Code Coverage

Tests

Metric Coverage Covered Base Delta Status
Statements 78.14% 31053/39737 N/A% 🟢 ███████████████░░░░░
Branches 66.67% 19749/29621 N/A% 🟢 █████████████░░░░░░░
Functions 82.16% 3164/3851 N/A% 🟢 ████████████████░░░░
Lines 81.51% 28010/34363 N/A% 🟢 ████████████████░░░░

📋 View full run · Generated by CI

@github-actions

github-actions Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

Claude finished @magyargergo's task in 6m 13s —— View job


PR #1751 Production-Readiness Review

Todo List

  • Read repo docs (DoD.md, AGENTS.md, GUARDRAILS.md, CONTRIBUTING.md, TESTING.md, ARCHITECTURE.md)
  • Check CI status and workflow run details
  • Read all 9 changed files
  • Read related context files (parse-worker.ts, pipeline.ts, cache, tree-sitter, worker-pool internals)
  • Run hidden Unicode / bidi control scan
  • Reconstruct old vs new execution paths
  • Assess cache-hit worker avoidance (Finding A)
  • Assess partial-cache-miss behavior (Finding B)
  • Assess fix(workers): resilient + zero-copy ingestion worker pool — prevent analyze hangs on TS-root-scale loads #1693 quarantine/timeout compatibility (Finding C)
  • Assess initial readiness fallback scope (Finding D)
  • Assess respawn output capture and classification (Findings E/F)
  • Assess progress terminal bridge (Finding G)
  • Assess warning routing (Finding H)
  • Assess tests quality and coverage (Finding I)
  • Compile and post full review

Review bar for this PR

  • All-cache-hit analyze must never touch WorkerPool, parse-worker.js, or any N-API native binding — proven by test assertions that fail if a worker is spawned.
  • Partial-cache-miss must create workers exactly once, use them across remaining misses, and terminate them reliably in all paths (success, throw, SIGINT).
  • PR fix(workers): resilient + zero-copy ingestion worker pool — prevent analyze hangs on TS-root-scale loads #1693's WorkerPoolDispatchError / quarantine / circuit-breaker / in-flight semantics must be untouched by the lazy-pool refactor; broad sequential fallback must not be reintroduced.
  • Respawned-child output must be bounded, teed, and correctly classify native SIGABRT/Napi::Error aborts separately from heap OOM without losing exit-status fidelity.
  • CI must be green on Linux, macOS, and Windows (typecheck, lint, format, unit, integration, scope-parity, CodeQL, Gitleaks, Dependency Review).

Problem being solved

Claimed bug: On warm parse-cache runs, gitnexus analyze could still abort intermittently even when every chunk was a cache hit. The cause was that PR-prior code constructed WorkerPool unconditionally before the chunk-cache check. Worker construction loads parse-worker.js top-level code, which initializes tree-sitter / N-API native bindings. On Node 25 (and observed on non-LTS runtimes generally), this native-binding load path can SIGABRT before any parse work begins.

Secondary bug: The heap-respawn path used execFileSync(..., { stdio: 'inherit' }), which forwarded child stdout/stderr to the terminal live but gave the parent no way to inspect child output after a nonzero exit. A native crash producing libc++abi: terminating due to uncaught exception of type Napi::Error on stderr would appear as an empty-output exit to the parent classifier, causing it to emit heap-OOM guidance even when the root cause was a native binding crash unrelated to heap size.

Expected behavior: An all-cache-hit warm run should replay cached parse results, produce the same graph, and exit cleanly without instantiating any worker or loading any native binding.


Current PR state

Field Value
Title fix(analyze): prevent cache-hit native workers from aborting
State Open, ready for review
Base ← Head abhigyanpatwari/GitNexus:mainChamHerry:codex/analyze-native-worker-cache-hit
Head SHA c84c903 (1 commit total; verified locally)
Changed files 9 TypeScript files — 4 src, 5 test — all within gitnexus/
CI visibility Complete — all 21 CI jobs verified individually (see below)

Merge status and mergeability

mergeable

All required CI jobs are complete and green. CodeQL, Gitleaks, Dependency Review, PR Autofix passed. Docker Build & Push is still in progress at review time — this is a CLI-only change with no image-production impact; Vercel authorization for NexusCore is similarly non-blocking. 9416/9417 tests passed; 1 skipped test is a documented Ruby closure-param inference limitation unrelated to this PR.


Repository history considered


Branch hygiene assessment

clean feature/fix PR

One commit (c84c903). All 9 changed files are in gitnexus/src/ (analyze.ts, filesystem-walker.ts, parse-impl.ts, worker-pool.ts) and gitnexus/test/ (5 test files). No web UI, i18n, doctor command, release bump, generated artifact, or workflow churn. Scope matches the PR title exactly.


Understanding of the change

cache/parse pipeline (parse-impl.ts): workerPool creation moves from module-entry unconditional to a getOrCreateWorkerPool() closure that is only invoked inside the cache-miss else-branch. The shouldUseWorkers predicate is evaluated eagerly (based on file count/byte totals), but no pool object is ever constructed until a real miss is dispatched. Cache-hit chunks call mergeChunkResults directly. Pool is terminated in a finally block regardless of success/failure. workerPoolDisabled prevents re-creation after a startup failure. Quarantine-aware cache-write guard (added in #1693) is preserved.

worker-pool readiness/fallback (worker-pool.ts + parse-impl.ts): New WorkerPoolInitializationError subclass of WorkerPoolDispatchError is thrown when activeSlots.size === 0 at first dispatch. The message now includes initialReadinessFailures (per-slot error reasons accumulated during the startup phase). parse-impl.ts catches only WorkerPoolInitializationError; all other WorkerPoolDispatchError subclasses are re-thrown unchanged, preserving #1693 circuit-breaker behavior.

heap respawn/native abort diagnostics (analyze.ts): execFileSync replaced with spawn(..., { stdio: ['inherit', 'pipe', 'pipe'] }). stdout/stderr are teed through realStdoutWrite/realStderrWrite (captured at module load before any monkey-patching) and accumulated in a bounded 1MB ring buffer per stream. New childProcessLikelyNativeAbort() inspects the captured output for napi::error, libc++abi: terminating, abort trap, native stack/worker/binding. Classifier order: OOM check first; if OOM signatures found → heap guidance; else if native signatures found → native-worker guidance. Separate RespawnExit interface propagates status, signal, stdout, stderr, and message cleanly.

progress rendering (analyze.ts): New createAnsiPipeTerminal() constructs a CliProgressTerminal that writes ANSI escape sequences to the piped stderr stream. Activated only when GITNEXUS_RESPAWN_PROGRESS_TTY=1 and the child's stderr is not a TTY — meaning the respawned child is running with a piped stderr, but the parent confirmed it was a real TTY when it set the env var. This preserves single-line overwrite UX in the piped child while still capturing stderr tail for crash classification.

filesystem warnings (filesystem-walker.ts): warnLargeFileSkip() now checks GITNEXUS_ANALYZE_PROGRESS_ACTIVE env var at call time. If set, routes through console.warn (which analyze.ts redirects to bar.log() for the lifetime of the progress bar) instead of logger.warn (pino NDJSON). This prevents raw pino JSON lines from corrupting the single-line progress display.

tests: Five new/heavily modified test files. parse-impl-worker-lazy-cache.test.ts writes marker files at worker top-level and asserts existsSync(markerPath) === false for all-cache-hit runs — a hard negative assertion that would fail if worker construction was moved even one line earlier. analyze-heap-respawn.test.ts tests all OOM and native-abort classification branches including mixed cases. analyze-respawn-progress-terminal.test.ts verifies the ANSI shim is injected and emits correct escape sequences. filesystem-walker.test.ts adds the GITNEXUS_ANALYZE_PROGRESS_ACTIVE routing regression test.


Findings

Finding 1 — Cache-hit no-worker behavior

Risk: Worker pool constructed before cache check, causing native abort on warm runs.
Evidence found: Verified in parse-impl.ts:505getOrCreateWorkerPool() is called only inside the else branch (cache miss, line 484+). The cache-hit path at lines 458–483 calls only mergeChunkResults and onProgress; it never calls getOrCreateWorkerPool(). The parse-impl-worker-lazy-cache.test.ts test (does not spawn a parse worker when every chunk is served from parse cache) writes a top-level marker in the worker script and asserts fs.existsSync(markerPath) === false after the run. This is a genuine hard negative — it would fail immediately if createWorkerPool() was called even once.
Recommended fix: N/A — verified correct.
Blocks merge: No.

Finding 2 — Partial-cache-miss behavior

Risk: Lazy creation might not cache-miss chunks correctly or might re-create the pool per miss.
Evidence found: getOrCreateWorkerPool() returns the existing workerPool when non-null, creating it only once. workerPoolDisabled guards against re-creation after failure. Cleanup: finally { await workerPool?.terminate(); } at line 858 in parse-impl.ts. Cache writes on misses are guarded by the quarantine check (line 566). Test spawns the parse worker lazily on the first cache miss and stores raw results asserts the marker exists, the cache entry is stored, and the graph node is populated.
Recommended fix: N/A — verified correct.
Blocks merge: No.

Finding 3 — #1693 quarantine/timeout compatibility

Risk: Lazy pool creation might bypass quarantine, circuit-breaker, or dispatch-error semantics.
Evidence found: In parse-impl.ts:520, catch (err) { if (!(err instanceof WorkerPoolInitializationError)) throw err; } re-throws all WorkerPoolDispatchError (circuit-breaker, dispatch error) and everything else except the specific initialization failure. WorkerPoolInitializationError extends WorkerPoolDispatchError, so the quarantine snapshot at dispatch time still applies. Quarantine cache-write guard at line 566 (workerPool?.getQuarantinedPaths?.()) is unmodified. Mid-run worker death still goes through the existing slot-failure / respawn / circuit-breaker path.
Recommended fix: N/A — #1693 semantics preserved.
Blocks merge: No.

Finding 4 — Initial readiness fallback scope

Risk: Fallback could be broader than initial-readiness failure, silently masking mid-run crashes.
Evidence found: WorkerPoolInitializationError is thrown only when activeSlots.size === 0 at the point of first dispatch (worker-pool.ts:689). This is strictly the "all workers exited before completing the ready handshake" case. The catch at parse-impl.ts:520 checks !(err instanceof WorkerPoolInitializationError) — any other dispatch error (quarantine full, circuit-breaker tripped, dispatch timeout) is re-thrown. On activation, workerPoolDisabled = true prevents re-creation for the rest of the run. The warning log includes readinessFailures details. Test falls back to sequential parsing when initial workers exit before ready uses a worker that calls process.exit(1) before the ready message; result usedWorkerPool === false is asserted.
Recommended fix: N/A — fallback is appropriately narrow.
Blocks merge: No.

Finding 5 — Respawn output capture and status propagation

Risk: Unbounded output buffering, lost output, double-resolution, incorrect exit code.
Evidence found: appendOutputTail bounds each stream to RESPAWN_OUTPUT_TAIL_CHARS (1MB) using a ring-buffer-style tail.slice(-N). The settled flag at line 207 prevents double-resolution if both error and close fire. Both error and close events are handled. Exit status is propagated at lines 362–364: process.exitCode = status (falling back to 1 when status is null). realStdoutWrite / realStderrWrite are captured at module load before any monkey-patching, so output is teed reliably even when console.* is redirected.
Recommended fix: N/A — bounded, correct.
Blocks merge: No.

Finding 6 — Native abort vs heap OOM classifier

Risk: Wrong recovery guidance; heap OOM classified as native abort or vice versa.
Evidence found: Classifier order: childProcessLikelyOom() first (lines 343–351), then childProcessLikelyNativeAbort() (lines 352–360). childProcessLikelyOom returns true when: (a) stderr/stdout/message contains a heap OOM phrase, OR (b) no child output at all AND status is 134 or signal is SIGABRT. Condition (b) is the ambiguous fallback that was the prior behavior for empty-output exits. childProcessLikelyNativeAbort looks for napi::error, libc++abi: terminating, abort trap, native stack/worker/binding in any field. The key ordering: if the child produced native-crash output AND heap-OOM text (pathological mixed case), OOM guidance wins. The test does not print heap guidance when a SIGABRT child emitted a native N-API crash confirms that status 134 + native stderr → native-worker guidance, not heap guidance. The heap OOM guidance itself includes "If this persists, it may be a native crash unrelated to heap size" as a hedge.

Minor gap (non-blocking): A native crash that produces zero stderr output (hard abort before any libc output) will be classified as OOM (condition b), not native abort. This was the prior behavior and is explicitly acknowledged in the user-facing message. No test covers this mixed case, but the classification improvement for crashes WITH output is materially better than before.
Recommended fix: Add a test case where status=134, stderr is empty, to document the OOM-fallback behavior explicitly. Not blocking.
Blocks merge: No.

Finding 7 — Progress terminal bridge and ANSI correctness

Risk: Corrupt output, broken progress display on non-TTY or Windows.
Evidence found: createAnsiPipeTerminal uses only escaped ANSI sequences (\x1B[...], \r, \n). No literal invisible characters — hidden Unicode scan returned no output. Width calculation falls back to 80 when COLUMNS is unset. isTTY: () => true signals to cli-progress that it should use TTY rendering mode. The GITNEXUS_RESPAWN_PROGRESS_TTY guard ensures the terminal shim is only activated when the parent confirmed TTY on respawn. Windows CI (tests / windows-latest) passed.

Minor issue (non-blocking): The write() method in createAnsiPipeTerminal (analyze.ts:156–159) truncates at s.slice(0, width). This naive byte-count truncation can cut through a multi-byte UTF-8 character or split an incomplete ANSI escape sequence, potentially leaving a dangling \x1B[ on the terminal. In practice, cli-progress progress bar strings are ASCII-only and short enough that this edge case is unlikely to trigger. However, it is a latent correctness issue for progress bar labels containing non-ASCII content.
Recommended fix: Use a Unicode-aware width function (e.g. strip ANSI codes before measuring, then truncate with grapheme awareness) or add a follow-up issue. Not blocking given the observed behavior.
Blocks merge: No.

Finding 8 — GITNEXUS_ANALYZE_PROGRESS_ACTIVE lifecycle

Risk: Env var leaks into child processes or persists across programmatic re-invocations.
Evidence found: GITNEXUS_ANALYZE_PROGRESS_ACTIVE is in ANALYZE_CLI_ENV_KEYS (analyze.ts:386). snapshotAnalyzeEnv() saves it at line 509. restoreAnalyzeEnv(envSnap) is in the finally block at line 512–513 — this runs on success, error, and SIGINT (SIGINT handler calls closeLbug() then exits, but the finally still fires for the async path). Set at line 779 inside the try block. Respawned child inherits it only if present in childEnv — the env snapshot/restore mechanism ensures it is absent from the re-exec env when not set by the parent.
Recommended fix: N/A — correct lifecycle.
Blocks merge: No.

Finding 9 — Warning routing during active progress

Risk: Large-file skip warnings swallowed, duplicated, or corrupted in progress output.
Evidence found: warnLargeFileSkip() in filesystem-walker.ts:28–40 checks GITNEXUS_ANALYZE_PROGRESS_ACTIVE === '1' and branches to console.warn (captured by barLog) vs logger.warn. The test routes large-file notices through console.warn while analyze progress is active in filesystem-walker.test.ts:442–462 asserts console.warn was called with the skip summary and that no pino records contain "Skipped" — confirming no duplication.
Recommended fix: N/A — routing is clean.
Blocks merge: No.

Finding 10 — Hidden Unicode / bidi controls

Risk: Invisible controls in executable code or tests could mislead reviewers.
Evidence found: Ran ripgrep pattern scan on all 9 changed files for U+202A–U+202E, U+2066–U+2069, U+200B–U+200D, U+FEFF. Zero matches. All ANSI sequences in the code are proper \x1B escape literals. \u2591 (░, block character) used in the cli-progress bar preset is visible Unicode consistent with existing repo style.
Blocks merge: No.


PR-specific assessment sections

Cache-hit no-worker behavior: ✅ Confirmed. getOrCreateWorkerPool() is strictly inside the cache-miss else-branch. Test would fail if this invariant broke.

Partial-cache-miss behavior: ✅ Confirmed. Pool created once, reused, terminated in finally. Quarantine guard on cache writes preserved.

#1693 worker-pool/quarantine compatibility: ✅ Confirmed. WorkerPoolInitializationError catch is narrow; all other dispatch errors re-thrown. Quarantine, circuit-breaker, in-flight tracking are untouched. No broad sequential fallback introduced.

Respawn diagnostics and native-abort classification: ✅ Materially improved. Output is now available for classification. Native crashes with output correctly route to native-worker guidance. Silent-status-134 ambiguity acknowledged in user message. Minor gap: mixed/silent cases follow prior behavior.

Progress and terminal behavior: ✅ with minor follow-up. ANSI shim works and Windows CI passes. s.slice(0, width) truncation is a latent cosmetic issue for non-ASCII label content.

Warning routing: ✅ Confirmed correct. No duplication, no suppression, test proves routing branch.

Tests and CI: ✅ All 21 CI jobs green. 9416/9417 tests pass. New tests would fail against the old implementation (marker file test is a hard negative; classifier tests mock spawn and assert exact guidance text). Integration tests use real worker threads.

Branch hygiene and scope: ✅ 1 commit, 4 src + 5 test files, all within gitnexus/. Zero unrelated churn.


Back-and-forth avoided by verifying

  • Confirmed head SHA c84c903 matches expected 1-commit branch.
  • Confirmed getOrCreateWorkerPool() is only invoked in the cache-miss branch (not before the cache check).
  • Confirmed WorkerPoolInitializationError catch uses !(err instanceof WorkerPoolInitializationError) — only the initial-readiness failure falls back; all other dispatch errors re-throw.
  • Confirmed workerPool?.terminate() in the finally block at line 858 runs regardless of path.
  • Confirmed GITNEXUS_ANALYZE_PROGRESS_ACTIVE is in the snapshot/restore key list.
  • Confirmed hidden Unicode scan returned zero results.
  • Confirmed CI: all 21 jobs in run 26223968010 successful (typecheck, lint, format, ubuntu/macos/windows tests, scope-parity all languages, CodeQL, Gitleaks, Dependency Review, CI Gate, PR Autofix).
  • Confirmed childProcessLikelyNativeAbort and childProcessLikelyOom test ordering in analyze-heap-respawn.test.ts covers the discriminating case.

Open questions that remain only if unavoidable

None. All behavior is verifiable from the code, tests, and CI.


Final verdict

production-ready with minor follow-ups

The core fix is correct and thoroughly proven: all-cache-hit runs never touch the worker pool; partial-cache-miss runs create the pool lazily, preserve #1693 quarantine/circuit-breaker semantics, and clean it up reliably. The respawn diagnostic improvement (piped capture + native-abort classifier) is materially better than the previous execFileSync approach and handles the motivating libc++abi Napi::Error case correctly. Progress rendering and warning routing are correctly scoped with proper env-var lifecycle management. CI is green on all three platforms with full typecheck, lint, scope-parity, and security checks passing.

The two non-blocking follow-ups to track are: (1) the s.slice(0, width) ANSI-unaware truncation in createAnsiPipeTerminal.write() could corrupt progress output for non-ASCII bar labels — Fix this →; (2) a test explicitly documenting the silent-status-134 OOM-fallback classification would close the remaining ambiguity gap. Neither issue is blocking.

Docker PR builds run gitnexus prepare under QEMU for linux/arm64, where the fixed 120s TypeScript timeout can kill otherwise healthy builds. Increase the default timeout and allow GITNEXUS_BUILD_TIMEOUT_MS to tune slower environments without changing the build steps.

Constraint: PR abhigyanpatwari#1751 Docker Build & Push gitnexus failed with spawnSync /bin/sh ETIMEDOUT while running node_modules/.bin/tsc in scripts/build.js.\nRejected: Rerunning CI only | the failure was the build script's deterministic timeout boundary under arm64 emulation, not a code assertion.\nConfidence: high\nScope-risk: narrow\nDirective: Keep build timeout changes in scripts/build.js configurable; do not hide real compiler failures, only allow slower successful compiles to finish.\nTested: GitNexus impact for gitnexus/scripts/build.js reported LOW; gitnexus detect_changes reported 1 changed file, 0 affected processes, low risk; git diff --check; gitnexus npm run build.\nNot-tested: GitHub Docker arm64 build rerun before pushing; local Docker multi-platform build under QEMU.
@magyargergo

Copy link
Copy Markdown
Collaborator

@ChamHerry Can you please look into Claude's findings? 🙏

Preserve complete ANSI escape sequences and grapheme boundaries when the respawn progress terminal shim truncates wrapped output, so the shim does not emit dangling escape bytes or split surrogate pairs while keeping raw writes untouched.

Constraint: Claude review on PR abhigyanpatwari#1751 flagged `s.slice(0, width)` in createAnsiPipeTerminal.write() as a latent terminal-corruption risk.
Rejected: Adding a display-width dependency | a local helper is sufficient for this narrow respawn terminal shim and avoids new dependency churn.
Rejected: Changing silent status-134 classification | current tests already document the output-less 134 fallback as heap guidance.
Confidence: high
Scope-risk: narrow
Directive: Keep respawn terminal writes ANSI-aware and preserve rawWrite bypass semantics for callers that intentionally write control sequences.
Tested: GitNexus impact for createAnsiPipeTerminal reported LOW; GitNexus detect_changes reported 2 changed files, 3 affected processes, medium risk; targeted vitest for analyze respawn progress and heap respawn; gitnexus npx tsc --noEmit; prettier check for changed files; eslint for changed files.
Not-tested: Full npm test suite; manual terminal rendering on Windows.
@ChamHerry

Copy link
Copy Markdown
Contributor Author

Thanks — I looked into both Claude follow-ups and pushed 6fba199c.

What changed:

  • Fixed the respawn progress terminal truncation path: createAnsiPipeTerminal.write() now truncates by visible columns while preserving complete ANSI escape sequences and grapheme boundaries, and raw writes still bypass truncation.
  • Added a regression test covering ANSI escape preservation, surrogate-pair-safe truncation, and rawWrite bypass.

On the silent status=134 point: this branch already had coverage for that behavior in analyze-heap-respawn.test.ts (prints heap guidance when child exits 134 without output) plus the native N-API stderr case, so I did not change the classifier.

Local verification:

  • vitest run test/unit/analyze-respawn-progress-terminal.test.ts test/unit/analyze-heap-respawn.test.ts
  • npx tsc --noEmit
  • npx prettier --check gitnexus/src/cli/analyze.ts gitnexus/test/unit/analyze-respawn-progress-terminal.test.ts
  • npx eslint gitnexus/src/cli/analyze.ts gitnexus/test/unit/analyze-respawn-progress-terminal.test.ts
  • GitNexus impact/detect_changes reviewed (detect_changes: medium risk, 2 changed files, 3 affected processes).

@github-actions

github-actions Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

Claude finished @magyargergo's task in 10m 5s —— View job


PR #1751 Production-Readiness Review

Todo List

  • Read repo docs (DoD.md, AGENTS.md, GUARDRAILS.md, CONTRIBUTING.md, TESTING.md, ARCHITECTURE.md)
  • Check CI status and workflow run details (21 jobs, all green)
  • Read all 10 changed files
  • Read related context files (parsing-processor.ts, worker-pool.ts, parse-impl.ts internals)
  • Hidden Unicode / bidi control scan — zero matches
  • Reconstruct old vs new execution paths for all scenarios
  • Assess all findings A–L
  • Compile full review

Review bar for this PR

  • All-cache-hit analyze must not instantiate WorkerPool, not execute parse-worker top-level code, not load N-API native bindings — proven by a hard negative test that would fail on any regression.
  • First cache miss must create the worker pool lazily, exactly once, and terminate it in all paths (success, throw, SIGINT).
  • WorkerPoolInitializationError fallback must be strictly pre-dispatch; all other WorkerPoolDispatchError subclasses (circuit breaker, quarantine, timeout) must re-throw unchanged from fix(workers): resilient + zero-copy ingestion worker pool — prevent analyze hangs on TS-root-scale loads #1693.
  • Respawned child output must be bounded, teed, classified accurately (native abort vs heap OOM), and exit status must propagate non-zero reliably.
  • Full CI must be green on the current head across Linux, macOS, Windows, typecheck, lint, format, scope-parity, CodeQL, Gitleaks, and Dependency Review.

Problem being solved

Bug: gitnexus analyze could abort intermittently on warm parse-cache runs. Root cause: runChunkedParseAndResolve constructed WorkerPool unconditionally before the chunk-cache lookup. On an all-cache-hit run, worker construction still loaded parse-worker.js top-level code and N-API native bindings. On Node 25 and some non-LTS runtimes, this native init path can SIGABRT before any parse work begins.

Secondary bug: The heap-respawn path used execFileSync({ stdio: 'inherit' }), giving the parent process no access to child stderr/stdout after a nonzero exit. Native SIGABRT/status-134 failures with libc++abi: Napi::Error in stderr appeared as empty-output exits, producing misleading heap-OOM guidance.

Expected: Warm-cache all-hit runs replay cached parse output without touching any native binding. Native crash diagnostics are classified separately from heap OOM.


Current PR state

Field Value
Title fix(analyze): prevent cache-hit native workers from aborting
State Open, ready for review
Head SHA 6fba199 (3 commits total: c84c903, acc4c1b, 6fba199)
Base abhigyanpatwari/GitNexus:main
Changed files 10 (4 src + 5 test + 1 build script)
Force-push Yes: from d617e17c84c903 on initial PR submission

The initial PR visible head was c84c903. Since then, two additional commits were pushed: acc4c1b (ci(docker): tolerate slower arm64 TypeScript builds — build script timeout config) and 6fba199 (fix(analyze): truncate respawn progress safely — ANSI-aware truncation in createAnsiPipeTerminal.write()).


Merge status and mergeability

mergeable

All 6 CI workflow runs (CI, Docker Build & Push, CodeQL, Gitleaks, Dependency Review, PR Autofix) completed with success at 2026-05-21T13:04:21Z. Run 26227708423 (CI) contains 21 individual jobs — ubuntu coverage, macos-latest, windows-latest, quality/typecheck, quality/lint, quality/format, quality/typecheck-web, all scope-parity language jobs, CI Gate — all success. One test skipped (documented Ruby closure-param inference limitation, unrelated to this PR). Caveat: I could not independently verify that these CI runs were triggered on 6fba199 vs acc4c1b (run timestamps are 57 seconds before ChamHerry's comment posting 6fba199). Based on timing and the acc4c1b being a CI-targeted commit, the runs are consistent with the full 3-commit branch, but this cannot be confirmed with certainty from the available tooling.


Repository history considered

PR #1693 (fix(workers): resilient + zero-copy ingestion worker pool): The critical antecedent. Established WorkerPoolDispatchError, quarantine, circuit-breaker, in-flight tracking, starting-file authoritative attribution, ready-handshake gate, zero-copy postMessage, and cache-write suppression on quarantined chunks. Explicitly removed broad sequential fallback from the dispatch try-block. This PR's changes are verified to not regress any of these guarantees — see Finding C and D.

Issue #1684 (TypeScript-root hang, closed by #1693): No regression. The WorkerPoolInitializationError fallback is bounded to the pre-dispatch window, cannot mask mid-run quarantine or circuit-breaker behavior.

Issue #1665 / PR #1666 (Napi::Error on C/C++ repos, PR closed without merge): #1666 routed C/C++ chunks through sequential parsing — the rejected broad fallback. This PR avoids the same trap: WorkerPoolInitializationError is thrown only when activeSlots.size === 0 at first dispatch (all workers failed the ready handshake before any work started). Mid-run native crashes still route through #1693 quarantine/circuit-breaker, not this fallback.

Issue #1726 / #1741 (analyze stuck at 49% on large repos): This PR does not touch the cross-file type propagation or scope-resolution phase. Not regressed — dispatch logic for mid-run failures is unchanged.


Branch hygiene assessment

clean feature/fix PR

Three commits, all purposeful: c84c903 is the core analyze/worker/cache fix (9 files); acc4c1b is a build script timeout improvement for arm64 Docker CI (1 file, separate concern but legitimate and well-scoped); 6fba199 implements the ANSI-aware truncation follow-up that the prior Claude review requested (2 files). No web UI, i18n, release bumps, generated artifacts, or unrelated workflow churn. All changed files are within gitnexus/.


Understanding of the change

cache/parse pipeline (parse-impl.ts): workerPool construction moves from unconditional (before cache lookup) to a getOrCreateWorkerPool() closure invoked only inside the cache-miss else branch (line 505). shouldUseWorkers is evaluated eagerly (based on file count/byte totals) but the pool object is never constructed until a real miss arrives. Pool is terminated in finally (line 858). workerPoolDisabled prevents re-creation after failure. Quarantine-aware cache-write guard from #1693 (line 565-585) is preserved.

worker-pool readiness/fallback (worker-pool.ts + parse-impl.ts): New WorkerPoolInitializationError extends WorkerPoolDispatchError (line 238). Initial spawn loop unchanged; initialReadyGate = Promise.allSettled(...) runs all slot readiness checks in parallel. Failed slots are dropped from activeSlots; the gate settles before first dispatch. When activeSlots.size === 0 at dispatch entry (line 691), WorkerPoolInitializationError is thrown with readinessFailures diagnostics. parse-impl.ts catches only this subclass (line 520), terminates the failed pool, sets workerPoolDisabled = true, clears workerPool to undefined, and re-invokes sequential processParsing. All other WorkerPoolDispatchError subclasses re-throw unchanged.

heap respawn/native abort diagnostics (analyze.ts): execFileSync replaced with async spawn({ stdio: ['inherit', 'pipe', 'pipe'] }). stdout/stderr teed through realStdoutWrite/realStderrWrite (captured at module load) and accumulated in bounded 1MB ring buffers via appendOutputTail. settled flag prevents double-resolution. New childProcessLikelyNativeAbort() inspects accumulated output for napi::error, libc++abi: terminating, abort trap, native stack/worker/binding. Classifier order: OOM first (returns true when heap signatures or empty output + status 134/SIGABRT); native abort second (requires output with native signature). Exit status propagation: line 459 extracts numeric status or falls back to 1, sets process.exitCode.

progress rendering (analyze.ts): createAnsiPipeTerminal() now uses truncateAnsiToColumns() (implemented in commit 6fba199) — ANSI-escape-aware truncation with Intl.Segmenter grapheme boundaries and wide-codepoint accounting. Activated only when GITNEXUS_RESPAWN_PROGRESS_TTY=1 and process.stderr.isTTY !== true. Raw writes bypass truncation. Replaces the prior s.slice(0, width) which could split ANSI sequences or surrogate pairs.

filesystem warnings (filesystem-walker.ts): warnLargeFileSkip() checks GITNEXUS_ANALYZE_PROGRESS_ACTIVE === '1' at call time and routes through console.warn (captured by bar.log()) instead of logger.warn (pino NDJSON) to prevent raw JSON lines corrupting the single-line progress display.

tests: parse-impl-worker-lazy-cache.test.ts — hard negative (marker file proves no worker spawn on cache hit), lazy creation on first miss, initial-readiness fallback (worker exits before ready → usedWorkerPool === false, no cache write). analyze-heap-respawn.test.ts — all OOM/native-abort classification branches including the discriminating case (status 134 + native stderr → native guidance, not OOM guidance). analyze-respawn-progress-terminal.test.ts — ANSI shim injection, ANSI-safe truncation, surrogate-pair-safe truncation, raw write bypass. filesystem-walker.test.tsGITNEXUS_ANALYZE_PROGRESS_ACTIVE routing regression test. worker-pool.test.ts — minor targeted fix.


Findings

Finding 1 — Cache-hit no-worker behavior

Risk: Worker pool created before cache check, causing native abort on warm runs.
Evidence found: getOrCreateWorkerPool() at parse-impl.ts:505 is inside the else (cache miss) branch; the cache-hit if block at lines 458–483 calls only mergeChunkResults and onProgress. Test parse-impl-worker-lazy-cache.test.ts:121 writes a marker file at worker top-level, asserts fs.existsSync(markerPath) === false after an all-cache-hit run. This is a genuine hard negative: the marker is created at const { parentPort } = require('node:worker_threads') (top of worker script), so even constructing the Worker object (which loads the script) would fail the assertion.
Recommended fix: N/A — verified correct.
Blocks merge: No.

Finding 2 — Partial-cache-miss behavior

Risk: Lazy creation might not aggregate cache-hit chunks correctly, or pool might be created more than once.
Evidence found: getOrCreateWorkerPool() closure returns existing workerPool when non-null. workerPoolDisabled guards against re-creation after failure. finally { await workerPool?.terminate(); } at parse-impl.ts:858 runs regardless of path. Cache writes gated at line 565: only when rawResults.length > 0 AND no quarantined files in chunk. Test spawns the parse worker lazily on the first cache miss verifies marker exists, cache entry stored, graph node populated.
Recommended fix: N/A — verified correct.
Blocks merge: No.

Finding 3 — #1693 worker-pool/quarantine compatibility

Risk: Lazy pool creation might bypass quarantine, circuit-breaker, or cache-skip semantics.
Evidence found: parse-impl.ts:520 catches only WorkerPoolInitializationError and re-throws everything else. WorkerPoolInitializationError extends WorkerPoolDispatchError, so circuit-breaker errors, timeout errors, and dispatch errors all propagate unchanged. Quarantine snapshot at line 566 (workerPool?.getQuarantinedPaths?.()) is unmodified. processParsingWithWorkers in parsing-processor.ts:885-902 explicitly documents: "there is no sequential-parser fallback for either partial quarantine or full pool failure — the operator must see a clear hard signal." WorkerPoolInitializationError is a new class covering only the pre-dispatch startup window, not mid-run failures.
Recommended fix: N/A — #1693 semantics preserved.
Blocks merge: No.

Finding 4 — Sequential fallback scope

Risk: Fallback might reintroduce the broad sequential behavior #1693 intentionally removed.
Evidence found: The WorkerPoolInitializationError is thrown at worker-pool.ts:696 only when activeSlots.size === 0 after the initial ready gate settles. This is strictly the "all workers exited before completing the ready handshake" window — no work has been dispatched. The parse-impl.ts:520 catch explicitly re-throws anything that isn't WorkerPoolInitializationError. Mid-run crashes, quarantine trips, circuit-breaker fires, and timeout escalations are unaffected. workerPoolDisabled = true after fallback prevents any subsequent pool creation for the rest of the run.
Recommended fix: N/A — fallback is narrowly scoped.
Blocks merge: No.

Finding 5 — Initial readiness gate correctness

Risk: Gate might hang, mishandle some-slot failure, or leak worker threads.
Evidence found: waitForWorkerReady() at worker-pool.ts:440 has a WORKER_READY_TIMEOUT_MS (5s) timeout, cleans up all listeners in a single cleanup() closure, handles error, exit, messageerror events. initialReadyGate = Promise.allSettled(...) at line 648 runs all slot checks in parallel; failed slots are removed from activeSlots and their workers terminated (await w.terminate().catch(() => undefined) at line 662). Some-slot failure leaves remaining healthy slots usable — WorkerPoolInitializationError is thrown only when activeSlots.size === 0 at first dispatch. Test forces process.exit(1) before {type:'ready'}usedWorkerPool === false confirmed. Replacement worker ready path (replaceWorker at line 758) is unchanged.
Recommended fix: N/A — gate is correct.
Blocks merge: No.

Finding 6 — Respawn output capture and exit status

Risk: Unbounded buffers, lost exit code, race between error and close, double-resolution.
Evidence found: appendOutputTail at analyze.ts:272 bounds each stream to 1MB via next.slice(-RESPAWN_OUTPUT_TAIL_CHARS). settled flag at line 301 prevents double-resolution. Both error and close events handled (lines 321-338). Stdin inherited ('inherit'), stdout/stderr piped — no interactive input issues. Exit code propagation at line 459: typeof childExit.status === 'number' && childExit.status !== 0 ? childExit.status : 1. Signal-only exits (SIGABRT with null status) → fallback to 1. Test prints heap guidance when child exits 134 without output confirms process.exitCode === 134.
Recommended fix: N/A — correct.
Blocks merge: No.

Finding 7 — OOM vs native-abort classification

Risk: Wrong guidance; misclassification of native crashes as OOM or vice versa.
Evidence found: childProcessLikelyOom at line 351: checks heap-OOM phrases first across all fields; if any output exists but no OOM phrase → returns false; only falls back to status 134/SIGABRT classification when no child output at all. childProcessLikelyNativeAbort at line 385: requires output to contain strong native-crash signatures. Classifier order in ensureHeap: OOM checked first, native abort second. Test does not print heap guidance when a SIGABRT child emitted a native N-API crash — status 134 + libc++abi: terminating due to uncaught exception of type Napi::Error in stderr → native-worker guidance, not OOM. Test prints heap guidance when child exits 134 without output — status 134 + empty output → OOM (acknowledged silent-exit fallback).

Minor observation (non-blocking): childProcessLikelyNativeAbort matches "native binding" or "native worker" in any output field. If an analyze run fails with an unrelated error but had previously logged a warning like Loading native binding for Python… to stdout, the parent would classify it as a native abort and show gitnexus analyze --workers 0 guidance. This is guidance-only — no behavioral impact — and the classifier is gated on a nonzero exit, so it only fires on failures.
Recommended fix: Consider requiring a native abort signal or a stronger phrase combination before printing native-abort guidance. Non-blocking given current test coverage.
Blocks merge: No.

Finding 8 — Progress terminal bridge and ANSI correctness (post-6fba199)

Risk: Multibyte/ANSI corruption in progress output, especially on piped stderr.
Evidence found: Commit 6fba199 replaced s.slice(0, width) with truncateAnsiToColumns(s, maxColumns) — a full ANSI-escape-preserving, grapheme-boundary-safe truncator using Intl.Segmenter and wide-codepoint accounting. readAnsiEscapeAt uses a sticky regex. Raw writes (rawWrite = true) bypass truncation entirely. Width fallback to 80 when COLUMNS unset. Test truncates wrapped progress writes without splitting ANSI escapes or surrogate pairs proves: 3-column limit with ab\x1B[31mcdab\x1B[31mc (escape preserved, visible char truncated); 4-column limit with abc😀defabc (surrogate pair not split). The previous follow-up from the prior review is now fully addressed.
Recommended fix: N/A — correct after 6fba199.
Blocks merge: No.

Finding 9 — Warning routing during active progress

Risk: Large-file skip warnings swallowed, duplicated, or corrupted in progress output.
Evidence found: warnLargeFileSkip() in filesystem-walker.ts:28-40 branches on GITNEXUS_ANALYZE_PROGRESS_ACTIVE === '1'console.warn (routed to barLog) vs logger.warn (pino). Test routes large-file notices through console.warn while analyze progress is activeconsole.warn spy sees the message, pino capture has zero records with "Skipped". No duplication or swallowing.
Recommended fix: N/A — routing is clean.
Blocks merge: No.

Finding 10 — GITNEXUS_ANALYZE_PROGRESS_ACTIVE lifecycle

Risk: Env var leaks across programmatic analyzeCommand invocations or into child processes.
Evidence found: GITNEXUS_ANALYZE_PROGRESS_ACTIVE is in ANALYZE_CLI_ENV_KEYS at analyze.ts:482. snapshotAnalyzeEnv() at line 605 captures pre-call state. restoreAnalyzeEnv(envSnap) in the finally at line 608-609 runs on success, error, and async-path SIGINT. The env is set at line 875 inside analyzeCommandImpl, well after snapshotting. Early-return validation paths (lines 625-707) never reach line 875, so snapshot/restore is a no-op for those exits. process.exit() paths (SIGINT, alreadyUpToDate) bypass finally — intentional per comment at line 600-601 (process termination makes restoration moot). The child childEnv spreads process.env BEFORE analyzeCommandImpl sets the flag, so the respawned child does not inherit GITNEXUS_ANALYZE_PROGRESS_ACTIVE = '1' from the parent.
Recommended fix: N/A — correct lifecycle.
Blocks merge: No.

Finding 11 — Hidden Unicode / bidi controls

Risk: Misleading code or tests via invisible control characters.
Evidence found: Grep for U+202A–U+202E, U+2066–U+2069, U+200B–U+200C, U+200D, U+FEFF across all 10 changed files returned zero matches. Non-ASCII characters in the implementation are: \u2588/\u2591 (visible block characters used in the cli-progress bar preset — intentional, consistent with existing repo style), \u2014 (em dash in a comment string), \u2591 in bar preset. All ANSI sequences are proper \x1B escape literals.
Recommended fix: N/A — clean.
Blocks merge: No.

Finding 12 — workerPool?.terminate() in finally without .catch()

Risk: If terminate() throws, the original error is masked.
Evidence found: parse-impl.ts:858: await workerPool?.terminate() — no .catch(). The WorkerPoolInitializationError catch block correctly uses await failedPool?.terminate().catch(() => undefined). The outer finally is inconsistent. However, Worker.terminate() from Node.js worker_threads is extremely unlikely to throw synchronously; the risk is theoretical.
Recommended fix: await workerPool?.terminate().catch(() => undefined) for consistency with the fallback catch. Fix this →
Blocks merge: No — pre-existing pattern, Worker.terminate() contract is reliable.

Finding 13 — build.js commit (acc4c1b) bundled in PR

Risk: Unrelated churn.
Evidence found: acc4c1b adds a configurable build timeout (GITNEXUS_BUILD_TIMEOUT_MS, default 5 minutes) to fix flaky arm64 Docker CI builds. This is a legitimate CI improvement — no analyze/worker/cache logic is touched. It passes lint, typecheck, and builds cleanly.
Blocks merge: No.


PR-specific assessment sections

Cache-hit no-worker behavior: ✅ Confirmed. getOrCreateWorkerPool() is strictly inside the cache-miss branch. The marker-file test would fail on even a single premature new Worker() call.

Partial-cache-miss behavior: ✅ Confirmed. Pool created once, reused across misses, terminated in finally. Cache aggregation is deterministic (alphabetical chunk sort preserved from #1693).

#1693 worker-pool/quarantine compatibility: ✅ Confirmed. WorkerPoolInitializationError catch re-throws all other WorkerPoolDispatchError subclasses. parsing-processor.ts explicitly documents that no broad sequential fallback exists. Quarantine snapshot on cache writes is unchanged.

Sequential fallback scope: ✅ Confirmed. Fallback activates only when activeSlots.size === 0 after the initial ready gate — the "all workers dead before first dispatch" window. Not reachable from mid-run quarantine, circuit-breaker, or timeout escalations.

Initial readiness gate: ✅ Confirmed. 5s timeout per slot, parallel settlement, failed slots dropped and terminated, healthy slots remain usable. Replacement worker path unchanged.

Respawn diagnostics and exit status: ✅ Confirmed. Bounded output, teed correctly, settled flag, status propagation including signal-only exits.

Native-abort vs heap-OOM classification: ✅ Materially improved. Status-134 with native stderr → native-worker guidance. Status-134 with no output → OOM fallback (acknowledged). One minor concern: "native binding" in ordinary warning output could misclassify; non-blocking.

Progress and terminal behavior: ✅ Confirmed post-6fba199. ANSI-safe and grapheme-safe truncation implemented and tested. Windows CI passed.

Warning routing: ✅ Confirmed. No duplication, no suppression. Test proves routing.

Hidden Unicode/hygiene: ✅ Clean — zero bidi/hidden controls.

Tests and CI: ✅ All 21 CI jobs green. New tests include hard negatives. Integration tests use real worker threads. Minor CI-HEAD-SHA timing uncertainty noted but consistent with full 3-commit branch.

Branch hygiene and scope: ✅ Clean. acc4c1b is a minor, purposeful CI improvement — not harmful and appropriately scoped.


Back-and-forth avoided by verifying

  • Confirmed getOrCreateWorkerPool() is exclusively inside the cache-miss else branch (parse-impl.ts:484+), not before the cache lookup.
  • Confirmed WorkerPoolInitializationError catch uses !(err instanceof WorkerPoolInitializationError) — only the startup failure falls back; all dispatch errors re-throw.
  • Confirmed workerPool = undefined is set explicitly before failedPool.terminate() in the catch block, preventing the outer finally from double-terminating.
  • Confirmed processParsingWithWorkers in parsing-processor.ts does not modify graph before calling dispatch — the graph is clean when the fallback sequential parse runs.
  • Confirmed GITNEXUS_ANALYZE_PROGRESS_ACTIVE is in ANALYZE_CLI_ENV_KEYS and properly snapshotted/restored.
  • Confirmed initialReadyGate = Promise.allSettled(...) — some-slot failure doesn't kill the pool; only all-slots-failed throws WorkerPoolInitializationError.
  • Confirmed childProcessLikelyNativeAbort requires output with native signatures — not triggered by silent exits.
  • Confirmed truncateAnsiToColumns is now ANSI-escape-preserving and grapheme-safe (commit 6fba199 addressed prior review's follow-up).
  • Confirmed zero bidi/hidden Unicode controls in all 10 changed files.
  • Confirmed all 21 CI jobs in run 26227708423 are success.

Open questions

None. All behavior is verifiable from the code, tests, CI, and history.


Final verdict

production-ready with minor follow-ups

The core fix is correct and thoroughly proven: all-cache-hit runs never touch the worker pool or native bindings; partial-cache-miss runs create the pool lazily, preserve #1693's quarantine/circuit-breaker/cache-integrity contract, and clean up reliably. The respawn diagnostic improvement accurately classifies the motivating libc++abi Napi::Error crash case. Progress rendering is now fully ANSI-safe and grapheme-safe after commit 6fba199 addressed the prior review's follow-up. Warning routing and env scoping are correct. CI is green on all 21 jobs across three platforms.

Two non-blocking items remain: (1) workerPool?.terminate() in the finally block at parse-impl.ts:858 lacks .catch(), creating a theoretical masking risk inconsistent with the failedPool.terminate().catch() pattern used in the same file — Fix this →; (2) childProcessLikelyNativeAbort matches "native binding" in ordinary warning output, which could produce misleading guidance on unrelated failures. Neither is blocking.

@magyargergo magyargergo merged commit 2a3d140 into abhigyanpatwari:main May 21, 2026
30 of 31 checks passed
@magyargergo

Copy link
Copy Markdown
Collaborator

Thank you for this!

ChamHerry pushed a commit to ChamHerry/GitNexus that referenced this pull request May 21, 2026
…#1751

问题描述(What): 修复 workerPool.terminate() 缺少 .catch() 和原生 abort 分类器误匹配问题
根本原因(Why): Claude 审查指出 finally 块中 terminate 无错误处理可能掩码原始异常,以及 native binding/worker 关键词会误匹配普通日志输出
修复内容(How): 1. terminate() 添加 .catch(() => undefined) 保持一致性 2. 将松散匹配改为组合短语策略(需同时出现 native+binding+crash/fail/error 或 native+worker+abort/crash/terminate)3. 新增误匹配回归测试和正确匹配保留测试
技术细节(Details): parse-impl.ts:858 添加 .catch(), analyze.ts:397-404 改用组合短语匹配, 新增 2 个测试用例
影响范围(Impact): analyze respawn 分类逻辑, parse worker pool 清理路径
@ChamHerry ChamHerry deleted the codex/analyze-native-worker-cache-hit branch May 21, 2026 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants