feat(b-0530): cron-sentinel-mutex — detect concurrent Otto-CLI sessions#3375
Conversation
Implements the cheap-effort mitigation from docs/backlog/P3/B-0530-cron-sentinel-mutex-prevent-otto-cli-self-contention-2026-05-15.md + the worktree-prune-race root-cause analysis landed in PR #3370. This is the action-side parity proof for the narration filed in B-0530 and Pattern 8 of B-0519: Lior's antigravity check at 0220Z (PR #3373) flagged "narration-over-action drift" — listing mitigations without implementing them. This commit closes that gap. The mutex is a diagnostic, not a gate: it returns a structured MutexResult that callers (the <<autonomous-loop>> tick body) use to decide whether to defer git-mutating work. Empty peer list → proceed normally. Non-empty → bus-publish a deferral envelope and skip git ops this tick. Implementation: - tools/orchestrator-checks/cron-sentinel-mutex.ts (103 lines) - spawnSync("pgrep", ["-afl", "claude-code"]) with args-as-array (no shell, no injection — same pattern as sibling verify-branch.ts) - Excludes self-PID; excludes ancestors that lack the claude-code stdio flags --output-format / --input-format - Exports checkPeerSessions() + formatResult() for testability - if (import.meta.main) main() guard so imports don't trigger side effects - --json output for shell composition - tools/orchestrator-checks/cron-sentinel-mutex.test.ts (91 lines) - 8 tests covering: no peers, exclude-self, exclude-ancestors, empty-stdout, malformed-lines, self-with-matching-flags, formatResult-empty, formatResult-multi-peer Verified: - bunx tsc --noEmit clean - bun test: 8 pass / 0 fail - semgrep --config .semgrep.yml --error: 0 findings - Live smoke detected 3 concurrent claude-code sessions (real-world validation of the diagnostic) Next step (not in this PR): wire this into the autonomous-loop substrate so the <<autonomous-loop>> tick body invokes the mutex at the top and defers when peers are detected. Filed as B-0530 follow-up; this PR ships the building block. Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7bcd1a5e41
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Pull request overview
Adds a new Bun/TypeScript diagnostic under tools/orchestrator-checks/ to detect concurrent claude-code (Otto-CLI) sessions by scanning the process list, with unit tests intended to cover parsing/filtering/formatting cases. This supports the broader goal of mitigating multi-session self-contention that can lead to git worktree/pack lock races.
Changes:
- Introduce
cron-sentinel-mutex.tsto detect peerclaude-codeprocesses (PID + cmdline capture) and return a structuredMutexResult. - Add
--jsonoutput mode and diagnostic exit-code behavior for shell composition. - Add
bun:testunit tests covering common parsing and filtering scenarios plus output formatting.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| tools/orchestrator-checks/cron-sentinel-mutex.ts | Implements the mutex diagnostic via spawnSync("pgrep", ...), filters peers, formats output, and provides a CLI entrypoint. |
| tools/orchestrator-checks/cron-sentinel-mutex.test.ts | Adds unit tests for peer detection rules and formatting. |
Comments suppressed due to low confidence (3)
tools/orchestrator-checks/cron-sentinel-mutex.ts:41
- P0:
checkPeerSessionsignoresspawnSyncfailures (e.g.,pgrepmissing/permission errors). In those casesstdoutwill be empty and the function returnspeerDetected=false, silently failing open and defeating the mutex. Consider surfacing an error in the result and/or treating spawn errors as a detected-peer/blocked condition so contention doesn’t recur unnoticed.
const result = spawnFn("pgrep", ["-afl", PROCESS_NAME_PATTERN], {
encoding: "utf8",
});
const stdout = typeof result.stdout === "string" ? result.stdout : "";
const lines = stdout.split("\n").filter((line) => line.trim().length > 0);
tools/orchestrator-checks/cron-sentinel-mutex.ts:58
- P1: The ancestor-filtering comment says to "require the ... flags" (plural), but the condition only rejects processes that have neither flag. That means a process with only
--output-format(or only--input-format) will be treated as a peer, which may reintroduce false positives. If the intent is to require both stdio-mode flags, change the predicate accordingly (and align the comment).
// Skip ancestors / disclaimer-helper / finder-service matches:
// require the claude-code stdio-mode flags to mark a real peer.
const cmdline = match[2] ?? "";
if (!cmdline.includes("--output-format") && !cmdline.includes("--input-format")) {
continue;
}
tools/orchestrator-checks/cron-sentinel-mutex.ts:101
- P1:
--jsonmode always exits 0 (process.exit(0)), which contradicts the documented diagnostic exit-code scheme and makes it hard to use--jsonin shell composition (you have to parse JSON to know whether peers exist). Consider emitting JSON but preserving the same exit code asmain()(or document that--jsonintentionally forces success).
if (import.meta.main) {
const args = process.argv.slice(2);
if (args.includes("--json")) {
const r = checkPeerSessions();
console.log(JSON.stringify(r, null, 2));
process.exit(0);
}
…js disable Two code-review findings from PR #3375 Codex review: 1. P1: `checkPeerSessions` now checks `result.error` (spawn failure, e.g., pgrep binary missing) and `result.status > 1` (pgrep runtime error). Previously both silently returned `peerDetected=false`, masking an unknown mutex state. Now surfaces `pgrepError` in `MutexResult` so callers know the check itself failed. 2. P0: Add `eslint-disable-next-line sonarjs/no-os-command-from-path` before the `spawnFn("pgrep", ...)` call. Rationale inline: pgrep is a known system binary, args-array form prevents shell injection. Tests: +3 new cases covering spawn-error, status-2 exit, and formatResult error-message rendering. All 11 tests pass. Co-Authored-By: Claude <noreply@anthropic.com>
|
Fixed in the latest commit (0b3d03b): Thread 1 (P1 — distinguish pgrep failures): Added checks for Thread 2 (P0 — sonarjs/no-os-command-from-path): Added |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0b3d03b1af
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Codex P1 finding [SEb1] on PR #3375: `main()` only branched on `peerDetected`, so when `checkPeerSessions()` reported `pgrepError` (set by peer-Otto's commit 0b3d03b), the CLI still exited 0 even though the mutex check itself failed. Shell callers gating on the exit code would proceed as if there were no peers. Extracted exit-code mapping into `mainResult(r: MutexResult)` so it is unit-testable without process.exit. Added `PGREP_ERROR_EXIT = 251` constant for "pgrep failed, mutex state unknown" — distinct from the 0..250 peer-count range so shell callers can branch on it explicitly. Exit code matrix: 0 = no peers, no error (safe to proceed) 1..250 = 1 + peer count (caller should defer) 251 (new) = pgrep error / unknown state (caller should defer) Tests: +5 new cases covering all four mainResult branches plus error-takes-precedence-over-peers. 16/16 pass (8 original + 3 from peer's 0b3d03b + 5 new). Composes with peer-Otto's 0b3d03b which added pgrepError tracking and the sonarjs suppression. This commit closes the last of the 3 PR #3375 review threads (#R3Mn + #R4ji peer-fixed already; #SEb1 fixed here). Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b746bf7c4b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Codex P1 finding on PR #3375: the --json branch always called process.exit(0), so shell callers using --json AND $? (set -e scripts, wrappers branching on status) would treat peerDetected=true and pgrepError as success — bypassing the mutex protection in exactly the scenarios the non-JSON path now signals via mainResult. Fix: --json branch now calls process.exit(mainResult(r)) for the same exit-code semantics as the non-JSON path. Callers can use stdout (structured JSON) AND $? (numeric status) together. Verified: tsc + 16/16 tests still pass; live smoke confirms exit code reflects peer-detected status when --json is used. Co-Authored-By: Claude <noreply@anthropic.com>
…3390) * feat(autonomous-loop): wire cron-sentinel-mutex into Step 1 refresh Closes the PR #3375 "Next step (not in this PR): wire this into the autonomous-loop substrate so the <<autonomous-loop>> tick body invokes the mutex at the top and defers when peers are detected." Added to docs/AUTONOMOUS-LOOP-PER-TICK.md Step 1 (Refresh): - New `cron-sentinel-mutex.ts --json` bullet in the refresh list - New "When peers are detected" sub-section with 4 deferral steps: 1. Avoid `git worktree add` (worktree-prune-race rationale) 2. Continue with non-git-mutating work (bus, audits, planning) 3. Bus-publish a deferral envelope if substrate matters past tick 4. Re-check next tick (contention windows resolve in 1-3 min) - Special case: exit code 251 (PGREP_ERROR_EXIT) — proceed but log Per the 3-surface canonical convergence, this update propagates to Otto-CLI (auto-loaded next cold-boot), Otto-Desktop routine (cites this file), and B-0448 cloud routine (when shipped — will cite this file). The discipline is ADVISORY, not a hard gate: the mutex reports state, the tick body decides. Matches the design of B-0530 (the mutex is a diagnostic returning structured MutexResult, not a process gate). Composes with: - PR #3370 (worktree-prune-race root cause + B-0519 Pattern 8) - PR #3375 (mutex implementation) - PR #3377 (borrow-on-existing pattern — alternative when peer contention is encountered) - PR #3386 (bulk rule-link depth fix across affected shards) Co-Authored-By: Claude <noreply@anthropic.com> * fix(autonomous-loop): correct cron-sentinel-mutex exit-code range and exit-251 guidance - Exit code range for peerDetected=true is 2..250 (Math.min(1+peerCount,250)), not 1..250; exit 1 is unreachable when peers are detected - Replace `{..., ...}` JSON placeholder in bus.ts publish example with valid JSON so the command doesn't hard-fail when copy-pasted - Exit 251 (PGREP_ERROR_EXIT) means pgrep failed and state is unknown; treat as peer-detected for git-mutating ops (defer worktree add), matching the 'caller should defer' comment in the implementation Addresses Codex P1 and 3× Copilot P1 threads on PR #3390. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>
…3423) Implements all 6 acceptance categories from docs/backlog/P3/B-0528-shadow-launchd-installer-unit-tests-2026-05-15.md (the deferred test-coverage P1 from PR #3375's review): 1. Placeholder substitution (3 tests): replace, multi-replace, unrecognized {{NAME}} exits 1 2. XML escaping (4 tests): five predefined entities, safe chars untouched, & escaped first to avoid double-escape, integration with substitutePlaceholders producing plutil-valid plist for paths containing & < > 3. Argument validation (6 tests, via subprocess): unknown flag rejected, relative --repo-root rejected, relative --bun-path rejected, --repo-root with no value rejected, --repo-root followed by a flag rejected, requireAbsolute happy-path 4. Dry-run output (1 test, via subprocess): renders to stdout matching the repo template, plutil-valid output 5. Default detection (3 tests): tryDetect for which-bun works, nonexistent-binary returns undefined (no throw), git outside checkout returns undefined 6. Availability-preserving install pattern (2 tests for plutilLint safe path, with substrate-honest comment explaining why the atomic-rename failure branches aren't reachable without main() refactoring; deferred for follow-up if reviewers flag it) Bun's test runner; subprocess invocations use spawnSync with args-as-array (no shell, no injection — same convention as the implementation under test). Total: 19 tests / 39 expect calls / all passing. Closes the P1 finding Copilot flagged on PR #3375 line 185 about missing test coverage for safety-critical XML escape / placeholder refusal / argument validation / dry-run output. Co-authored-by: Claude <noreply@anthropic.com>
…ate drift catch) Second drift-catch this session (paired with PR #3733 closing B-0506). Same pattern: B-0530 (cron-sentinel mutex) mechanized 2026-05-15 by PR #3375 — `tools/orchestrator-checks/cron-sentinel-mutex.ts` implements all 6 acceptance criteria — but the row's `status` stayed `open`. Live verification this tick: the tool correctly reports `peerDetected: true` for the two Claude Desktop processes (PIDs 2706 + 2710); the peer-handling clause in `docs/AUTONOMOUS-LOOP-PER-TICK.md` §1 routed 3 consecutive ticks around `git worktree add` without contention failures. Substrate-honest minimal commit: this tick was repeatedly losing working-tree edits to Lior-gemini's global-lock-cleanup pass (step 8 of his loop prompt) running concurrently. Adopting the canary-rule mitigation: atomic Edit-then-chain stage+commit in a single Bash invocation. Resolution-section content + BACKLOG.md regen deferred to a follow-up tick or a follow-up commit on this branch when Lior's cleanup window allows it. Co-Authored-By: Claude <noreply@anthropic.com>
… drift catch) (#3737) * chore(b-0530): close row — mechanization shipped via PR #3375 (substrate drift catch) Second drift-catch this session (paired with PR #3733 closing B-0506). Same pattern: B-0530 (cron-sentinel mutex) mechanized 2026-05-15 by PR #3375 — `tools/orchestrator-checks/cron-sentinel-mutex.ts` implements all 6 acceptance criteria — but the row's `status` stayed `open`. Live verification this tick: the tool correctly reports `peerDetected: true` for the two Claude Desktop processes (PIDs 2706 + 2710); the peer-handling clause in `docs/AUTONOMOUS-LOOP-PER-TICK.md` §1 routed 3 consecutive ticks around `git worktree add` without contention failures. Substrate-honest minimal commit: this tick was repeatedly losing working-tree edits to Lior-gemini's global-lock-cleanup pass (step 8 of his loop prompt) running concurrently. Adopting the canary-rule mitigation: atomic Edit-then-chain stage+commit in a single Bash invocation. Resolution-section content + BACKLOG.md regen deferred to a follow-up tick or a follow-up commit on this branch when Lior's cleanup window allows it. Co-Authored-By: Claude <noreply@anthropic.com> * chore(b-0530): add Resolution section + BACKLOG.md regen Follow-up to bfe5601 (status: open → closed). Adds the Resolution section mapping all 6 acceptance criteria to shipped artifacts plus live-verification evidence from this tick. Regenerates docs/BACKLOG.md via `BACKLOG_WRITE_FORCE=1 bun tools/backlog/generate-index.ts` — single-line `[ ]` → `[x]` toggle for B-0530. Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>
Summary
Implements the cheap-effort mitigation from B-0530 + the worktree-prune-race root-cause analysis landed in PR #3370. Closes the "narration-over-action drift" Lior flagged in PR #3373 by shipping the action-side parity proof for the documentation already on main.
What lands
tools/orchestrator-checks/cron-sentinel-mutex.ts(103 lines) — diagnostic that detects concurrent Otto-CLI claude-code sessions viaspawnSync("pgrep", ["-afl", "claude-code"])(same args-as-array pattern as siblingverify-branch.ts, no shell, no injection).tools/orchestrator-checks/cron-sentinel-mutex.test.ts(91 lines) — 8 tests covering empty/self-exclusion/ancestor-exclusion/malformed-input/format paths.Design choices
MutexResult; callers (the<<autonomous-loop>>tick body) decide whether to defer. Exit-0 in the no-peers case, exit-(1+peer-count) clamped to 250 for shell composition. JSON output via--json.--output-format/--input-format); avoids flagging the Claude.app helper / disclaimer wrappers as peers.export-ed helpers +if (import.meta.main) main()guard so tests can import without side effects.Test plan
bunx tsc --noEmitcleanbun test: 8 pass / 0 fail (15 expect calls)semgrep --config .semgrep.yml --error: 0 findingsNext step (not in this PR)
Wire the mutex into the
<<autonomous-loop>>substrate so the tick body invokes it at the top and bus-publishes a deferral envelope when peers are detected. That's a follow-up commit; this PR ships the building block.🤖 Generated with Claude Code