fix(hooks): bound db-lock probe subprocesses and gate probe behind hook slot (#2163)#2165
Conversation
…ok slot (abhigyanpatwari#2163) The Claude PreToolUse db-lock probe leaks orphaned lsof processes when the hook process is hard-killed mid-probe (e.g. Claude Code's 10s hook timeout under load). Orphans accumulate, raise load, slow the next probe, and snowball to sustained 100% CPU. - Wrap the unix lsof/ps fallback in coreutils timeout (-k 1 2 / -k 1 1), resolved via a lazy self-test, so probe children self-destruct within ~3s even if the hook is SIGKILLed. GITNEXUS_HOOK_TIMEOUT_PATH overrides the guard binary; the sentinel value 'disabled' turns the guard off; hosts without a usable guard keep the previous behavior. - Acquire the per-repo hook slot before probing (all three adapters), bounding concurrent probes to 3 per .gitnexus, with probe and augment inside try/finally so the slot is always released. - Tests: source-order contract, slot-gating behavior, orphan reaping with a SIGTERM-immune fake lsof and a SIGKILLed parent (red on base), probe-copy byte parity, no-guard equivalence, broken-guard rejection. Note: pre-commit typecheck skipped; the 62 tsc errors are pre-existing on main (all in src/core/** and src/server/, none in files touched here; base==head invariant verified).
|
@Minidoracat is attempting to deploy a commit to the NexusCore Team on Vercel. A member of the Team first needs to authorize it. |
CI Report✅ All checks passed Pipeline Status
Test Results
✅ All 11304 tests passed 16 test(s) skipped — expand for details
Code CoverageTests
📋 View full run · Generated by CI |
magyargergo
left a comment
There was a problem hiding this comment.
Tri-method review digest
Methods: 6 Claude reviewer lanes (risk, test/CI, security-boundary, correctness, adversarial, reliability) + Codex (the one genuinely independent engine — returned live results). Claude-lane agreement is "consistent across personas," not independent confirmation; the coordinator additionally reproduced every load-bearing claim directly (tagged below).
Verdict: no merge-blocking issues found
This is a carefully built PR whose headline claims all checked out under adversarial scrutiny.
Verified solid
- Red/green is real
[reproduced]: with the 5 production hook files reverted to base and the new tests kept, exactly the 8 claimed regression tests fail (source-order ×3, slot-gate ×2, orphan-reaping ×1, strengthened probe-contract ×2); 172/172 green at head. - The ETIMEDOUT fail-closed contract survives the wrapper
[reproduced](GNU coreutils 9.1): when Node's 1s spawnSync timer SIGTERMs the wrapper,timeoutforwards TERM and arms-k— a TERM-immune child is reaped ~1s later and spawnSync still reportserror.code === 'ETIMEDOUT'. Containment holds on both the supervised and SIGKILLed-hook paths. - lsof-missing fail-open equivalence under the wrapper (wrapper exit 127 → no
.error→ empty pid list →false)[code-read + lane-verified empirically]. - Slot release is exactly-once on every path including the new early-return inside
try(finally + idempotent release + owner re-check before unlink); dead-pid slots are reclaimed immediately via ESRCH, not after 30s. - Security: clean. Args-array spawns throughout (no shell), the
--barrier fordbPathAbssurvives the wrapper insertion, the new env override mirrors the siblingGITNEXUS_HOOK_*_PATHpattern and is strictly stronger (mandatory self-test), no hidden Unicode/bidi, and the diff contains only what the description says. - Probe copies byte-identical, now test-enforced.
Refuted findings (validation is a feature)
- Codex P1 ("spawnSync 1000ms < wrapper 2s budget inverts the kill ordering; lsof may still orphan; spurious fail-closed") — refuted by direct measurement (above): the wrapper forwards the SIGTERM it receives and still arms
-k, so the inner-shorter-than-wrapper design is intentional and safe. - Codex P2 (orphan-reaping test "passes vacuously" on runners without coreutils
timeout) — it cannot pass vacuously: the fixture's fake lsof is TERM-immune (lsofIgnoreSigterm) and sleeps 30s, so without a guard it survives and theexpect(alive).toBe(false)assertion fails loudly — exactly as the PR's "known limitations" section documents. A guard-availability precheck/skipIfwould make the failure reason clearer (polish). - Slot starvation, 30s-stale-threshold wedges, and exit-handler accumulation were probed by three lanes and refuted with evidence.
Findings — all P3, none blocking (2 inline, rest below)
- (inline) The
status === 137arm of the wrapper exit-code mapping is dead code for GNU coreutils, and the SIGSTOP-window residual lands fail-open via the unmappedstatus === null && signalshape. - (inline) Slot-starved hooks now exit silently even under
GITNEXUS_DEBUG=1. GITNEXUS_HOOK_TIMEOUT_PATHpointing at an existing but unusable path (directory, non-executable) silently disables containment without falling through to the built-in candidates — the self-test failure is swallowed (catchdiscards all errors), and the docstring's "a typo'd path can never silently disable orphan containment" (probe :79) only holds for non-existent paths. Doc/hardening nuance, not a logic regression.- (pre-existing, out of declared scope) The augment child (7s local / 12s npx) is now the longest-lived unwrapped subprocess; the same SIGKILL-orphan mechanism this PR fixes for lsof/ps applies to it. Fits the #2163 follow-up list.
- Test gaps: the 124/137 mapping has no direct unit test (a fake guard exiting 124 would pin the live arm; a
status:null/signalfake would have exposed finding 1); the antigravity adapter has source-order coverage only (no behavioral slot-gate test); busybox-shaped wrappers are untested.
CI: all green except the long-standing Vercel "authorization required" check (infra, unrelated).
Coverage limits: full diff read; tests executed locally; timeout semantics reproduced on GNU coreutils 9.1/Linux only — macOS BSD timeout and busybox behavior are code-read/documentation-level assessments.
Automated multi-tool review digest (6 Claude lanes + Codex, coordinator-verified); please verify findings before acting on them.
- Map guard signal-death (status null + signal, no spawnSync error) to fail-closed at both the lsof and ps call sites, closing the freeze window (SIGSTOP / laptop sleep > 2s) that previously landed fail-open. Rewrite the exit-code comments: coreutils surfaces the -k kill as signal death, 124 is budget expiry (live arm), 137 covers only exit-code-propagating wrappers or an externally SIGKILLed child. - Add a debug-gated 'augment skipped: hook slots saturated' stderr line on the slot-starved early return in all three adapters, restoring observability under GITNEXUS_DEBUG=1. - GITNEXUS_HOOK_TIMEOUT_PATH now participates in candidate fall-through: the env candidate is tried first, then the built-ins, each behind the lazy self-test — an existing-but-unusable env path (directory, non-executable) can no longer silently disable orphan containment. - Tests: +6 — guard exit 124 pins the live arm (CJS+Plugin), guard signal-death pins the new mapping (CJS+Plugin, red before the fix), antigravity behavioral slot-gate, env-dir fall-through still reaps a SIGTERM-immune orphan via a built-in guard. Note: pre-commit typecheck skipped; the 62 tsc errors are pre-existing on main (none in files touched here).
Review follow-up — all five P3 findings addressed in 07a7b7eThanks for the thorough tri-method review — the refuted-findings section in particular saved us a round trip. Dispositions:
Local state after the follow-up: hooks suite 178 tests, 176 green on the incident-class host (the 2 reds remain the pre-existing #1493 ENOENT fail-open pair documented under Known limitations — environmental, not introduced here); antigravity e2e 23/23; probe copies byte-identical; |
|
@Minidoracat Would you be interested in tackling with this?
|
|
Happy to take it! Plan: a focused follow-up PR that wraps the augment child in the same self-tested guard (sharing the resolver from the probe module so the byte-parity convention is preserved), with an orphan-reaping regression test mirroring the lsof one (SIGTERM-immune fake CLI via |
Summary
Stops the orphaned-
lsofCPU storm from the Claude db-lock probe by (1) wrapping the unixlsof/psfallback in a self-destructing coreutilstimeoutguard and (2) acquiring the per-repo hook slot before probing, bounding concurrent probes to 3 per.gitnexus. Zero contract changes — fail-open/fail-closed semantics, env names, function names, and the Windows path are untouched.Fixes #2163
Motivation / context
See #2163 for the full incident analysis. Short version: the probe (introduced in #1493) runs unserialized on every Grep/Glob/Bash tool call. On a busy Linux host the
/procscan blows its 1200 ms budget, falls back tolsof(which can't finish within its 1 sspawnSynctimeout there), and when Claude Code's 10 s hook timeout hard-kills the hook mid-spawnSync, thelsofchild is never signalled — it reparents to PID 1 and burns CPU for minutes. Each orphan raises load, making the next probe slower and more likely to be hard-killed: a self-reinforcing snowball that pinned a 16-core host at load 122 with 114 orphanedlsofprocesses.The fix breaks both factors of that feedback loop: orphan lifetime is now bounded at ~3 s (growth term → zero) and concurrent probes are bounded at 3 per repo (instantaneous term → bounded).
Areas touched
gitnexus/(CLI / core / MCP server) — hooks + tests onlygitnexus-web/(Vite / React UI).github/(workflows, actions)eval/or other toolingAlso touches
gitnexus-claude-plugin/hooks/(the byte-identical probe copy and the plugin adapter, kept in sync per the #2134 convention).Scope & constraints
In scope
hook-db-lock-probe.cjs(both copies, byte-identical): newresolveUnixGuardTimeout()— resolves coreutilstimeout/gtimeout, validated by a lazy one-shot self-test (timeout -k 1 1 /bin/sh -c :must exit 0) so broken wrappers (busybox < 1.34, toybox, dead symlinks) are rejected and the host falls back to current behavior rather than degrading fail-closed to fail-open.GITNEXUS_HOOK_TIMEOUT_PATHoverrides the binary; the sentinel valuedisabledturns the guard off; an invalid path falls through to the built-in candidates (same semantics as the siblingGITNEXUS_HOOK_LSOF_PATH/PS_PATHenvs).lsofruns undertimeout -k 1 2,psundertimeout -k 1 1. The innerspawnSynctimeouts (1000 ms / 500 ms) are unchanged and always shorter than the wrapper budgets, so on a live hook the existing ETIMEDOUT fail-closed path still triggers first — the wrapper only matters once the hook itself has been SIGKILLed. Defensive mapping of wrapper exit codes 124/137 → fail-closed (coreutils-only belt-and-suspenders, commented as such).gitnexus/hooks/claude/gitnexus-hook.cjs,gitnexus-claude-plugin/hooks/gitnexus-hook.js,gitnexus/hooks/antigravity/gitnexus-antigravity-hook.cjs): reordered to cheap guards →acquireHookSlot→ probe + augment insidetry/finally { release() }. The slot is a non-blocking try-lock, so hook latency only goes down. Worst-case slot hold (~18 s, Windows probe + npx-fallback augment) stays under the 30 s stale threshold with 1.6× margin —hook-lock.cjsis untouched.hooks.test.ts, +21 inhook-test-helpers.ts): source-order contract for all three adapters; slot-gate behavior (probe must not run when slots are full); orphan reaping — a SIGTERM-immune fakelsofwhose hook parent is SIGKILLed must die within 5 s (red on base, green here; this is the incident's minimal reproduction); probe-copy byte parity; no-guard equivalence; broken-guard rejection stays fail-closed.[Unreleased].Explicitly out of scope / not done here (follow-ups proposed in #2163)
/procscan that removes the Linuxlsoffallback entirely (performance rework, changes the budget-timeout fail semantics — separate PR).win-rm-list-json.ps1watchdog, Cursor-integration probe gap,doctorhook-staleness check, dev-checkoutisGitNexusServerCommandregex gap — each deserves its own issue.gitnexus setup(no settings.json changes, no migration).Implementation notes
hook-db-lock-probe.cjscopies are byte-identical (diff -qclean) and the spawn-count/windowsHideparity convention is preserved (every spawn-family call site carrieswindowsHide: true, including the new self-test)./usr/bin/timeoutthat is syntax-compatible with the flags used; older macOS falls through togtimeout(Homebrew coreutils) or no guard.Testing & verification
cd gitnexus && npx vitest run test/unit/hooks.test.ts test/unit/resolve-invocation.test.ts test/unit/setup.test.ts test/unit/cli-commands.test.ts— 172 tests inhooks.test.ts(see known limitations for 2 host-specific reds)cd gitnexus && npx vitest run test/integration/antigravity-hook-e2e.test.tsnpx prettier --check/npx eslinton all changed files (0 errors)cd gitnexus && npx tsc --noEmit— 62 pre-existing errors on main (src/core/**,src/server/upload-ingest.ts), none in files touched here; base == head verifiedcd gitnexus && npm run test:integration— not run (no core/indexing/MCP paths changed)Red/green evidence: with the five production hook files reverted to base and the new tests kept, the mandatory regression set goes red — source-order ×3, slot-gate ×2, orphan-reaping ×1 (plus 2 strengthened probe-contract assertions). All green with the fix applied.
Incident-class host verification (16-core Linux dev server, ~600–700 processes — the same host class as the #2163 incident):
lsofleft, 0 orphans.lsof, 0 leftovertimeoutwrappers within 5 s.lsof(PPID 1) from real hard-kill timing alone; the same suite against this branch leaks none.Known limitations
#1493"ENOENT lsof → fail-open" tests are red on the pathological host itself (real/usr/bin/lsoftakes ~2 s there, exceeding the probe's frozen 1 s budget → fail-closed — i.e. the bug(hooks): Claude db-lock probe leaks orphaned lsof processes on busy Linux hosts — snowballs to sustained 100% CPU #2163 pathology, present on base, unchanged by this PR). Fast CI runners are unaffected.timeouton the Linux lane (GitHub ubuntu runners ship GNU coreutils; a minimal/container runner without it would skip-fail that one test by design — it is the wrapper regression test).🤖 Generated with Claude Code