fix(analyze): prevent cache-hit native workers from aborting#1751
Conversation
|
Someone is attempting to deploy a commit to the NexusCore Team on Vercel. A member of the Team first needs to authorize it. |
Delay parse worker startup until a cache miss requires it, fall back to sequential parsing when initial worker readiness fails, and preserve analyzer diagnostics/progress when heap respawn captures child output. Constraint: Node 25 and tree-sitter/N-API worker initialization can abort before ready, while warm-cache analysis should not start workers at all. Rejected: Treating status-134/SIGABRT as heap OOM unconditionally | native worker aborts require distinct recovery guidance and stderr/stdout evidence. Rejected: cli-progress noTTYOutput for respawn progress | it appends newline frames instead of preserving one-line redraw UX. Confidence: high Scope-risk: moderate Directive: Keep parse-worker creation behind confirmed cache misses and preserve TTY-style progress when respawn pipes stderr for crash classification. Tested: GitNexus impact analysis for ensureHeap, runChunkedParseAndResolve, createWorkerPool, WorkerPool, walkRepositoryPaths; GitNexus detect_changes scoped to staged worktree; targeted vitest for analyze respawn, parse lazy cache, filesystem walker, worker pool; npx tsc --noEmit; npm run build; NODE_OPTIONS='--max-old-space-size=8192' npm test. Not-tested: Windows terminal rendering and published npm package install path.
d617e17 to
c84c903
Compare
CI Report✅ All checks passed Pipeline Status
Test Results
✅ All 9417 tests passed 1 test(s) skipped — expand for details
Code CoverageTests
📋 View full run · Generated by CI |
|
Claude finished @magyargergo's task in 6m 13s —— View job PR #1751 Production-Readiness ReviewTodo List
Review bar for this PR
Problem being solvedClaimed bug: On warm parse-cache runs, Secondary bug: The heap-respawn path used Expected behavior: An all-cache-hit warm run should replay cached parse results, produce the same graph, and exit cleanly without instantiating any worker or loading any native binding. Current PR state
Merge status and mergeabilitymergeable All required CI jobs are complete and green. CodeQL, Gitleaks, Dependency Review, PR Autofix passed. Docker Build & Push is still in progress at review time — this is a CLI-only change with no image-production impact; Vercel authorization for NexusCore is similarly non-blocking. 9416/9417 tests passed; 1 skipped test is a documented Ruby closure-param inference limitation unrelated to this PR. Repository history considered
Branch hygiene assessmentclean feature/fix PR One commit ( Understanding of the changecache/parse pipeline ( worker-pool readiness/fallback ( heap respawn/native abort diagnostics ( progress rendering ( filesystem warnings ( tests: Five new/heavily modified test files. FindingsFinding 1 — Cache-hit no-worker behaviorRisk: Worker pool constructed before cache check, causing native abort on warm runs. Finding 2 — Partial-cache-miss behaviorRisk: Lazy creation might not cache-miss chunks correctly or might re-create the pool per miss. Finding 3 — #1693 quarantine/timeout compatibilityRisk: Lazy pool creation might bypass quarantine, circuit-breaker, or dispatch-error semantics. Finding 4 — Initial readiness fallback scopeRisk: Fallback could be broader than initial-readiness failure, silently masking mid-run crashes. Finding 5 — Respawn output capture and status propagationRisk: Unbounded output buffering, lost output, double-resolution, incorrect exit code. Finding 6 — Native abort vs heap OOM classifierRisk: Wrong recovery guidance; heap OOM classified as native abort or vice versa. Minor gap (non-blocking): A native crash that produces zero stderr output (hard abort before any libc output) will be classified as OOM (condition b), not native abort. This was the prior behavior and is explicitly acknowledged in the user-facing message. No test covers this mixed case, but the classification improvement for crashes WITH output is materially better than before. Finding 7 — Progress terminal bridge and ANSI correctnessRisk: Corrupt output, broken progress display on non-TTY or Windows. Minor issue (non-blocking): The Finding 8 —
|
Docker PR builds run gitnexus prepare under QEMU for linux/arm64, where the fixed 120s TypeScript timeout can kill otherwise healthy builds. Increase the default timeout and allow GITNEXUS_BUILD_TIMEOUT_MS to tune slower environments without changing the build steps. Constraint: PR abhigyanpatwari#1751 Docker Build & Push gitnexus failed with spawnSync /bin/sh ETIMEDOUT while running node_modules/.bin/tsc in scripts/build.js.\nRejected: Rerunning CI only | the failure was the build script's deterministic timeout boundary under arm64 emulation, not a code assertion.\nConfidence: high\nScope-risk: narrow\nDirective: Keep build timeout changes in scripts/build.js configurable; do not hide real compiler failures, only allow slower successful compiles to finish.\nTested: GitNexus impact for gitnexus/scripts/build.js reported LOW; gitnexus detect_changes reported 1 changed file, 0 affected processes, low risk; git diff --check; gitnexus npm run build.\nNot-tested: GitHub Docker arm64 build rerun before pushing; local Docker multi-platform build under QEMU.
|
@ChamHerry Can you please look into Claude's findings? 🙏 |
Preserve complete ANSI escape sequences and grapheme boundaries when the respawn progress terminal shim truncates wrapped output, so the shim does not emit dangling escape bytes or split surrogate pairs while keeping raw writes untouched. Constraint: Claude review on PR abhigyanpatwari#1751 flagged `s.slice(0, width)` in createAnsiPipeTerminal.write() as a latent terminal-corruption risk. Rejected: Adding a display-width dependency | a local helper is sufficient for this narrow respawn terminal shim and avoids new dependency churn. Rejected: Changing silent status-134 classification | current tests already document the output-less 134 fallback as heap guidance. Confidence: high Scope-risk: narrow Directive: Keep respawn terminal writes ANSI-aware and preserve rawWrite bypass semantics for callers that intentionally write control sequences. Tested: GitNexus impact for createAnsiPipeTerminal reported LOW; GitNexus detect_changes reported 2 changed files, 3 affected processes, medium risk; targeted vitest for analyze respawn progress and heap respawn; gitnexus npx tsc --noEmit; prettier check for changed files; eslint for changed files. Not-tested: Full npm test suite; manual terminal rendering on Windows.
|
Thanks — I looked into both Claude follow-ups and pushed What changed:
On the silent Local verification:
|
|
Claude finished @magyargergo's task in 10m 5s —— View job PR #1751 Production-Readiness ReviewTodo List
Review bar for this PR
Problem being solvedBug: Secondary bug: The heap-respawn path used Expected: Warm-cache all-hit runs replay cached parse output without touching any native binding. Native crash diagnostics are classified separately from heap OOM. Current PR state
The initial PR visible head was Merge status and mergeabilitymergeable All 6 CI workflow runs (CI, Docker Build & Push, CodeQL, Gitleaks, Dependency Review, PR Autofix) completed with Repository history consideredPR #1693 ( Issue #1684 (TypeScript-root hang, closed by #1693): No regression. The Issue #1665 / PR #1666 (Napi::Error on C/C++ repos, PR closed without merge): #1666 routed C/C++ chunks through sequential parsing — the rejected broad fallback. This PR avoids the same trap: Issue #1726 / #1741 (analyze stuck at 49% on large repos): This PR does not touch the cross-file type propagation or scope-resolution phase. Not regressed — dispatch logic for mid-run failures is unchanged. Branch hygiene assessmentclean feature/fix PR Three commits, all purposeful: Understanding of the changecache/parse pipeline ( worker-pool readiness/fallback ( heap respawn/native abort diagnostics ( progress rendering ( filesystem warnings ( tests: FindingsFinding 1 — Cache-hit no-worker behaviorRisk: Worker pool created before cache check, causing native abort on warm runs. Finding 2 — Partial-cache-miss behaviorRisk: Lazy creation might not aggregate cache-hit chunks correctly, or pool might be created more than once. Finding 3 — #1693 worker-pool/quarantine compatibilityRisk: Lazy pool creation might bypass quarantine, circuit-breaker, or cache-skip semantics. Finding 4 — Sequential fallback scopeRisk: Fallback might reintroduce the broad sequential behavior #1693 intentionally removed. Finding 5 — Initial readiness gate correctnessRisk: Gate might hang, mishandle some-slot failure, or leak worker threads. Finding 6 — Respawn output capture and exit statusRisk: Unbounded buffers, lost exit code, race between Finding 7 — OOM vs native-abort classificationRisk: Wrong guidance; misclassification of native crashes as OOM or vice versa. Minor observation (non-blocking): Finding 8 — Progress terminal bridge and ANSI correctness (post-6fba199)Risk: Multibyte/ANSI corruption in progress output, especially on piped stderr. Finding 9 — Warning routing during active progressRisk: Large-file skip warnings swallowed, duplicated, or corrupted in progress output. Finding 10 —
|
|
Thank you for this! |
…#1751 问题描述(What): 修复 workerPool.terminate() 缺少 .catch() 和原生 abort 分类器误匹配问题 根本原因(Why): Claude 审查指出 finally 块中 terminate 无错误处理可能掩码原始异常,以及 native binding/worker 关键词会误匹配普通日志输出 修复内容(How): 1. terminate() 添加 .catch(() => undefined) 保持一致性 2. 将松散匹配改为组合短语策略(需同时出现 native+binding+crash/fail/error 或 native+worker+abort/crash/terminate)3. 新增误匹配回归测试和正确匹配保留测试 技术细节(Details): parse-impl.ts:858 添加 .catch(), analyze.ts:397-404 改用组合短语匹配, 新增 2 个测试用例 影响范围(Impact): analyze respawn 分类逻辑, parse worker pool 清理路径
Summary
Fixes an intermittent
gitnexus analyzeabort seen on warm parse-cache runs, where the analyzer could still start parse workers even when every chunk was a cache hit.This PR keeps the fix intentionally narrow:
Root cause
Warm-cache analysis was replaying cached chunk results, but
runChunkedParseAndResolvehad already created a worker pool before the chunk cache check. That meant a cache-hit-only run could still execute the parse-worker top-level module and load tree-sitter / N-API native bindings. On newer runtimes (observed with Node 25), that native path can abort the process before any cache miss requires parsing.A second issue made diagnosis misleading: the heap-respawn path inherited child stdio, so the parent could not inspect child stderr/stdout after abnormal exits. Native
SIGABRT/ status 134 failures could be reported as heap OOM even when stderr contained a native binding crash signature.Validation
ensureHeap→ LOWrunChunkedParseAndResolve→ LOWcreateWorkerPool→ MEDIUMWorkerPool→ MEDIUMwalkRepositoryPaths→ LOWdetect_changeson staged linked worktree → riskmedium, 9 changed files, affected analyze execution flows only./node_modules/.bin/vitest run test/unit/analyze-heap-respawn.test.ts test/unit/analyze-respawn-progress-terminal.test.ts test/unit/parse-impl-worker-lazy-cache.test.ts test/integration/filesystem-walker.test.ts test/integration/worker-pool.test.tsnpx tsc --noEmitnpm run buildNODE_OPTIONS='--max-old-space-size=8192' npm test→ 355 files / 9140 tests passed, 1 skippedNotes
This branch was created from the current upstream
mainand excludes unrelated local i18n, doctor, and docs/todo work.