Skip to content

fix(hooks): bound db-lock probe subprocesses and gate probe behind hook slot (#2163)#2165

Merged
magyargergo merged 2 commits into
abhigyanpatwari:mainfrom
Minidoracat:fix/2163-hook-lsof-orphan-storm
Jun 11, 2026
Merged

fix(hooks): bound db-lock probe subprocesses and gate probe behind hook slot (#2163)#2165
magyargergo merged 2 commits into
abhigyanpatwari:mainfrom
Minidoracat:fix/2163-hook-lsof-orphan-storm

Conversation

@Minidoracat

Copy link
Copy Markdown
Contributor

Summary

Stops the orphaned-lsof CPU storm from the Claude db-lock probe by (1) wrapping the unix lsof/ps fallback in a self-destructing coreutils timeout guard and (2) acquiring the per-repo hook slot before probing, bounding concurrent probes to 3 per .gitnexus. Zero contract changes — fail-open/fail-closed semantics, env names, function names, and the Windows path are untouched.

Fixes #2163

Motivation / context

See #2163 for the full incident analysis. Short version: the probe (introduced in #1493) runs unserialized on every Grep/Glob/Bash tool call. On a busy Linux host the /proc scan blows its 1200 ms budget, falls back to lsof (which can't finish within its 1 s spawnSync timeout there), and when Claude Code's 10 s hook timeout hard-kills the hook mid-spawnSync, the lsof child is never signalled — it reparents to PID 1 and burns CPU for minutes. Each orphan raises load, making the next probe slower and more likely to be hard-killed: a self-reinforcing snowball that pinned a 16-core host at load 122 with 114 orphaned lsof processes.

The fix breaks both factors of that feedback loop: orphan lifetime is now bounded at ~3 s (growth term → zero) and concurrent probes are bounded at 3 per repo (instantaneous term → bounded).

Areas touched

  • gitnexus/ (CLI / core / MCP server) — hooks + tests only
  • gitnexus-web/ (Vite / React UI)
  • .github/ (workflows, actions)
  • eval/ or other tooling
  • Docs / agent config only

Also touches gitnexus-claude-plugin/hooks/ (the byte-identical probe copy and the plugin adapter, kept in sync per the #2134 convention).

Scope & constraints

In scope

  • hook-db-lock-probe.cjs (both copies, byte-identical): new resolveUnixGuardTimeout() — resolves coreutils timeout/gtimeout, validated by a lazy one-shot self-test (timeout -k 1 1 /bin/sh -c : must exit 0) so broken wrappers (busybox < 1.34, toybox, dead symlinks) are rejected and the host falls back to current behavior rather than degrading fail-closed to fail-open. GITNEXUS_HOOK_TIMEOUT_PATH overrides the binary; the sentinel value disabled turns the guard off; an invalid path falls through to the built-in candidates (same semantics as the sibling GITNEXUS_HOOK_LSOF_PATH/PS_PATH envs).
  • Guarded spawns: lsof runs under timeout -k 1 2, ps under timeout -k 1 1. The inner spawnSync timeouts (1000 ms / 500 ms) are unchanged and always shorter than the wrapper budgets, so on a live hook the existing ETIMEDOUT fail-closed path still triggers first — the wrapper only matters once the hook itself has been SIGKILLed. Defensive mapping of wrapper exit codes 124/137 → fail-closed (coreutils-only belt-and-suspenders, commented as such).
  • All three adapters (gitnexus/hooks/claude/gitnexus-hook.cjs, gitnexus-claude-plugin/hooks/gitnexus-hook.js, gitnexus/hooks/antigravity/gitnexus-antigravity-hook.cjs): reordered to cheap guards → acquireHookSlot → probe + augment inside try/finally { release() }. The slot is a non-blocking try-lock, so hook latency only goes down. Worst-case slot hold (~18 s, Windows probe + npx-fallback augment) stays under the 30 s stale threshold with 1.6× margin — hook-lock.cjs is untouched.
  • Tests (+379 lines in hooks.test.ts, +21 in hook-test-helpers.ts): source-order contract for all three adapters; slot-gate behavior (probe must not run when slots are full); orphan reaping — a SIGTERM-immune fake lsof whose hook parent is SIGKILLed must die within 5 s (red on base, green here; this is the incident's minimal reproduction); probe-copy byte parity; no-guard equivalence; broken-guard rejection stays fail-closed.
  • CHANGELOG entry under [Unreleased].

Explicitly out of scope / not done here (follow-ups proposed in #2163)

  • Direction (1): cmdline-first /proc scan that removes the Linux lsof fallback entirely (performance rework, changes the budget-timeout fail semantics — separate PR).
  • Direction (3): TTL cache for probe results (separate PR).
  • Windows win-rm-list-json.ps1 watchdog, Cursor-integration probe gap, doctor hook-staleness check, dev-checkout isGitNexusServerCommand regex gap — each deserves its own issue.
  • No version bump: plugin manifests stay lockstep with the release flow; plugin users pick this up with the next release, npm/antigravity users via gitnexus setup (no settings.json changes, no migration).

Implementation notes

  • The two hook-db-lock-probe.cjs copies are byte-identical (diff -q clean) and the spawn-count/windowsHide parity convention is preserved (every spawn-family call site carries windowsHide: true, including the new self-test).
  • On hosts without a usable guard binary the generated argv is identical to today's, byte for byte — no behavior change.
  • macOS ≥ 13 ships a BSD /usr/bin/timeout that is syntax-compatible with the flags used; older macOS falls through to gtimeout (Homebrew coreutils) or no guard.

Testing & verification

  • cd gitnexus && npx vitest run test/unit/hooks.test.ts test/unit/resolve-invocation.test.ts test/unit/setup.test.ts test/unit/cli-commands.test.ts — 172 tests in hooks.test.ts (see known limitations for 2 host-specific reds)
  • cd gitnexus && npx vitest run test/integration/antigravity-hook-e2e.test.ts
  • npx prettier --check / npx eslint on all changed files (0 errors)
  • cd gitnexus && npx tsc --noEmit — 62 pre-existing errors on main (src/core/**, src/server/upload-ingest.ts), none in files touched here; base == head verified
  • cd gitnexus && npm run test:integration — not run (no core/indexing/MCP paths changed)

Red/green evidence: with the five production hook files reverted to base and the new tests kept, the mandatory regression set goes red — source-order ×3, slot-gate ×2, orphan-reaping ×1 (plus 2 strengthened probe-contract assertions). All green with the fix applied.

Incident-class host verification (16-core Linux dev server, ~600–700 processes — the same host class as the #2163 incident):

  • 50 concurrent hook invocations → 0 lsof left, 0 orphans.
  • 20 concurrent invocations with every hook process SIGKILLed mid-probe (simulating the Claude Code hook-timeout kill) → 0 orphaned lsof, 0 leftover timeout wrappers within 5 s.
  • Live contrast observed while iterating: running the full test suite against base leaked 2 orphaned lsof (PPID 1) from real hard-kill timing alone; the same suite against this branch leaks none.

Known limitations

  • Two pre-existing #1493 "ENOENT lsof → fail-open" tests are red on the pathological host itself (real /usr/bin/lsof takes ~2 s there, exceeding the probe's frozen 1 s budget → fail-closed — i.e. the bug(hooks): Claude db-lock probe leaks orphaned lsof processes on busy Linux hosts — snowballs to sustained 100% CPU #2163 pathology, present on base, unchanged by this PR). Fast CI runners are unaffected.
  • The orphan-reaping test requires a working coreutils timeout on the Linux lane (GitHub ubuntu runners ship GNU coreutils; a minimal/container runner without it would skip-fail that one test by design — it is the wrapper regression test).

🤖 Generated with Claude Code

…ok slot (abhigyanpatwari#2163)

The Claude PreToolUse db-lock probe leaks orphaned lsof processes when
the hook process is hard-killed mid-probe (e.g. Claude Code's 10s hook
timeout under load). Orphans accumulate, raise load, slow the next
probe, and snowball to sustained 100% CPU.

- Wrap the unix lsof/ps fallback in coreutils timeout (-k 1 2 / -k 1 1),
  resolved via a lazy self-test, so probe children self-destruct within
  ~3s even if the hook is SIGKILLed. GITNEXUS_HOOK_TIMEOUT_PATH
  overrides the guard binary; the sentinel value 'disabled' turns the
  guard off; hosts without a usable guard keep the previous behavior.
- Acquire the per-repo hook slot before probing (all three adapters),
  bounding concurrent probes to 3 per .gitnexus, with probe and augment
  inside try/finally so the slot is always released.
- Tests: source-order contract, slot-gating behavior, orphan reaping
  with a SIGTERM-immune fake lsof and a SIGKILLed parent (red on base),
  probe-copy byte parity, no-guard equivalence, broken-guard rejection.

Note: pre-commit typecheck skipped; the 62 tsc errors are pre-existing
on main (all in src/core/** and src/server/, none in files touched
here; base==head invariant verified).
@Minidoracat Minidoracat requested a review from magyargergo as a code owner June 11, 2026 09:06
@vercel

vercel Bot commented Jun 11, 2026

Copy link
Copy Markdown

@Minidoracat is attempting to deploy a commit to the NexusCore Team on Vercel.

A member of the Team first needs to authorize it.

@github-actions

github-actions Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

CI Report

All checks passed

Pipeline Status

Stage Status Details
✅ Typecheck success tsc --noEmit
✅ Tests success unit tests, 3 platforms
✅ E2E success gitnexus-web changes only

Test Results

Tests Passed Failed Skipped Duration
11320 11304 0 16 499s

✅ All 11304 tests passed

16 test(s) skipped — expand for details
  • COBOL pipeline benchmark > scales with file count
  • C++ ADL emit benchmark > emit phase scales sub-quadratically with co-scaled files and sites
  • C++ pipeline benchmark > scales with file count
  • C# pipeline benchmark > scales with file count — namespaces spread across the solution
  • C# pipeline benchmark > scales with file count — all types in one (global) namespace bucket
  • C# pipeline benchmark > scales with file count — all types in one (named) namespace bucket
  • Go pipeline benchmark > scales with file count (workers enabled)
  • Go pipeline benchmark — worker pool (issue Worker idle timeout kills long Go scope extraction and surfaces as Napi::Error during analyze #1848) > does not quarantine the large generated Go file on sub-batch idle timeout
  • Go structural interface detection benchmark > scales linearly with interface × struct count
  • Go structural interface detection split-phase benchmark > separates index-build and detection time
  • PHP pipeline benchmark > scales with file count (workers enabled)
  • Ruby pipeline benchmark > scales with file count (workers enabled)
  • Rust pipeline benchmark > scales with file count (workers enabled)
  • Vue pipeline benchmark > scales with component count
  • run.cjs direct-exec entrypoint (fix(cli): steer docs, skills, and hooks through a CLI-neutral project-local runner (#1939) #1945) > resolves a .cmd shim via the Windows shell branch, passing args and exit code
  • buildTypeEnv > known limitations (documented skip tests) > Ruby block parameter: users.each { |user| } — closure param inference, different feature

Code Coverage

Tests

Metric Coverage Covered Base Delta Status
Statements 75.78% 37317/49243 N/A% 🟢 ███████████████░░░░░
Branches 63.6% 23121/36349 N/A% 🟢 ████████████░░░░░░░░
Functions 81.44% 4021/4937 N/A% 🟢 ████████████████░░░░
Lines 79.49% 33694/42384 N/A% 🟢 ███████████████░░░░░

📋 View full run · Generated by CI

@magyargergo magyargergo left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tri-method review digest

Methods: 6 Claude reviewer lanes (risk, test/CI, security-boundary, correctness, adversarial, reliability) + Codex (the one genuinely independent engine — returned live results). Claude-lane agreement is "consistent across personas," not independent confirmation; the coordinator additionally reproduced every load-bearing claim directly (tagged below).

Verdict: no merge-blocking issues found

This is a carefully built PR whose headline claims all checked out under adversarial scrutiny.

Verified solid

  • Red/green is real [reproduced]: with the 5 production hook files reverted to base and the new tests kept, exactly the 8 claimed regression tests fail (source-order ×3, slot-gate ×2, orphan-reaping ×1, strengthened probe-contract ×2); 172/172 green at head.
  • The ETIMEDOUT fail-closed contract survives the wrapper [reproduced] (GNU coreutils 9.1): when Node's 1s spawnSync timer SIGTERMs the wrapper, timeout forwards TERM and arms -k — a TERM-immune child is reaped ~1s later and spawnSync still reports error.code === 'ETIMEDOUT'. Containment holds on both the supervised and SIGKILLed-hook paths.
  • lsof-missing fail-open equivalence under the wrapper (wrapper exit 127 → no .error → empty pid list → false) [code-read + lane-verified empirically].
  • Slot release is exactly-once on every path including the new early-return inside try (finally + idempotent release + owner re-check before unlink); dead-pid slots are reclaimed immediately via ESRCH, not after 30s.
  • Security: clean. Args-array spawns throughout (no shell), the -- barrier for dbPathAbs survives the wrapper insertion, the new env override mirrors the sibling GITNEXUS_HOOK_*_PATH pattern and is strictly stronger (mandatory self-test), no hidden Unicode/bidi, and the diff contains only what the description says.
  • Probe copies byte-identical, now test-enforced.

Refuted findings (validation is a feature)

  • Codex P1 ("spawnSync 1000ms < wrapper 2s budget inverts the kill ordering; lsof may still orphan; spurious fail-closed") — refuted by direct measurement (above): the wrapper forwards the SIGTERM it receives and still arms -k, so the inner-shorter-than-wrapper design is intentional and safe.
  • Codex P2 (orphan-reaping test "passes vacuously" on runners without coreutils timeout) — it cannot pass vacuously: the fixture's fake lsof is TERM-immune (lsofIgnoreSigterm) and sleeps 30s, so without a guard it survives and the expect(alive).toBe(false) assertion fails loudly — exactly as the PR's "known limitations" section documents. A guard-availability precheck/skipIf would make the failure reason clearer (polish).
  • Slot starvation, 30s-stale-threshold wedges, and exit-handler accumulation were probed by three lanes and refuted with evidence.

Findings — all P3, none blocking (2 inline, rest below)

  1. (inline) The status === 137 arm of the wrapper exit-code mapping is dead code for GNU coreutils, and the SIGSTOP-window residual lands fail-open via the unmapped status === null && signal shape.
  2. (inline) Slot-starved hooks now exit silently even under GITNEXUS_DEBUG=1.
  3. GITNEXUS_HOOK_TIMEOUT_PATH pointing at an existing but unusable path (directory, non-executable) silently disables containment without falling through to the built-in candidates — the self-test failure is swallowed (catch discards all errors), and the docstring's "a typo'd path can never silently disable orphan containment" (probe :79) only holds for non-existent paths. Doc/hardening nuance, not a logic regression.
  4. (pre-existing, out of declared scope) The augment child (7s local / 12s npx) is now the longest-lived unwrapped subprocess; the same SIGKILL-orphan mechanism this PR fixes for lsof/ps applies to it. Fits the #2163 follow-up list.
  5. Test gaps: the 124/137 mapping has no direct unit test (a fake guard exiting 124 would pin the live arm; a status:null/signal fake would have exposed finding 1); the antigravity adapter has source-order coverage only (no behavioral slot-gate test); busybox-shaped wrappers are untested.

CI: all green except the long-standing Vercel "authorization required" check (infra, unrelated).
Coverage limits: full diff read; tests executed locally; timeout semantics reproduced on GNU coreutils 9.1/Linux only — macOS BSD timeout and busybox behavior are code-read/documentation-level assessments.


Automated multi-tool review digest (6 Claude lanes + Codex, coordinator-verified); please verify findings before acting on them.

Comment thread gitnexus/hooks/claude/hook-db-lock-probe.cjs
Comment thread gitnexus/hooks/claude/gitnexus-hook.cjs Outdated
- Map guard signal-death (status null + signal, no spawnSync error) to
  fail-closed at both the lsof and ps call sites, closing the freeze
  window (SIGSTOP / laptop sleep > 2s) that previously landed fail-open.
  Rewrite the exit-code comments: coreutils surfaces the -k kill as
  signal death, 124 is budget expiry (live arm), 137 covers only
  exit-code-propagating wrappers or an externally SIGKILLed child.
- Add a debug-gated 'augment skipped: hook slots saturated' stderr line
  on the slot-starved early return in all three adapters, restoring
  observability under GITNEXUS_DEBUG=1.
- GITNEXUS_HOOK_TIMEOUT_PATH now participates in candidate fall-through:
  the env candidate is tried first, then the built-ins, each behind the
  lazy self-test — an existing-but-unusable env path (directory,
  non-executable) can no longer silently disable orphan containment.
- Tests: +6 — guard exit 124 pins the live arm (CJS+Plugin), guard
  signal-death pins the new mapping (CJS+Plugin, red before the fix),
  antigravity behavioral slot-gate, env-dir fall-through still reaps a
  SIGTERM-immune orphan via a built-in guard.

Note: pre-commit typecheck skipped; the 62 tsc errors are pre-existing
on main (none in files touched here).
@Minidoracat

Copy link
Copy Markdown
Contributor Author

Review follow-up — all five P3 findings addressed in 07a7b7e

Thanks for the thorough tri-method review — the refuted-findings section in particular saved us a round trip. Dispositions:

  1. 137 dead-code / freeze-window fail-open — fixed; see inline reply. Both call sites now map guard && status === null && signal → fail-closed, comments rewritten to match coreutils' actual -k semantics.
  2. Slot-starved silence under GITNEXUS_DEBUG=1 — fixed; see inline reply. Debug-gated augment skipped: hook slots saturated in all three adapters.
  3. GITNEXUS_HOOK_TIMEOUT_PATH existing-but-unusable path silently disabling containment — fixed. resolveUnixGuardTimeout() now builds a candidate list (env candidate first, then the built-ins) and lazily self-tests each in order; the first passing candidate wins and is memoized once per process, null only when all candidates fail. The docstring's "a bad env value can never silently disable orphan containment" claim now holds for every shape of bad value. Measured unhappy-path cost: a directory candidate fails its self-test in ~1 ms, /usr/bin/timeout then passes in ~3 ms; each self-test child is itself capped by timeout -k 1 1, so the resolver cannot mint orphans of its own.
  4. Unwrapped augment child (7–12 s) — agreed, and deliberately not touched here: it is pre-existing and fits the bug(hooks): Claude db-lock probe leaks orphaned lsof processes on busy Linux hosts — snowballs to sustained 100% CPU #2163 follow-up list alongside the cmdline-first scan and probe-result cache. We would fold the cursor-integration silent slot-skip (noted in the inline reply) into the same follow-up.
  5. Test gaps — added six tests: guard exit 124 pins the live arm (CJS + Plugin); guard signal-death pins the new F1 mapping (CJS + Plugin; red with the mapping reverted); an antigravity behavioral slot-gate test (with a status === 0 guard against vacuous passes if the staging list ever drifts); and an env-dir fall-through test proving a SIGTERM-immune orphan is still reaped via a built-in guard when the env candidate is unusable. The busybox-shaped guard behavioral test was deliberately skipped: on the supervised path the inner 1 s spawnSync timer always fires before the wrapper's 2 s budget, so a busybox expiry shape is unreachable in-process — the decision-relevant result shapes (124, signal-death) are now pinned deterministically instead.

Local state after the follow-up: hooks suite 178 tests, 176 green on the incident-class host (the 2 reds remain the pre-existing #1493 ENOENT fail-open pair documented under Known limitations — environmental, not introduced here); antigravity e2e 23/23; probe copies byte-identical; windowsHide parity holds.

@magyargergo magyargergo merged commit 10d1e47 into abhigyanpatwari:main Jun 11, 2026
31 of 32 checks passed
@magyargergo

Copy link
Copy Markdown
Collaborator

@Minidoracat Would you be interested in tackling with this?

(pre-existing, out of declared scope) The augment child (7s local / 12s npx) is now the longest-lived unwrapped subprocess; the same SIGKILL-orphan mechanism this PR fixes for lsof/ps applies to it. Fits the #2163 follow-up list.

@Minidoracat

Copy link
Copy Markdown
Contributor Author

Happy to take it! Plan: a focused follow-up PR that wraps the augment child in the same self-tested guard (sharing the resolver from the probe module so the byte-parity convention is preserved), with an orphan-reaping regression test mirroring the lsof one (SIGTERM-immune fake CLI via GITNEXUS_HOOK_CLI_PATH, red on current main). I'll fold in the two small items from this thread while at it: the cursor-integration silent slot-skip and the guard-availability precheck polish for the reaping tests. Will reference #2163.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(hooks): Claude db-lock probe leaks orphaned lsof processes on busy Linux hosts — snowballs to sustained 100% CPU

2 participants