shard(tick): 0615Z — worktree-prune-race root cause identified (multi-Otto-CLI self-contention)#3370
Conversation
…-Otto-CLI self-contention); substrate recovered to git after 3 tick-shards lived in bus envelopes only Recovers the 0545Z + 0607Z + 0611Z investigation arc into a single canonical shard. Bus envelopes 111342b2, 6de98fac, 720a2b49 were the substrate-landing channel during the 30 minutes I could not commit to git. Branch shard/0545z-... was created locally at 0545Z but the worktree-add rollback prevented any worktree from surviving long enough to commit. This tick the contention window cleared (peer Otto-CLI PID 7894's stuck git reset --hard finally exited; PID 11725's git worktree add also cleaned up) and a fresh `git worktree add` succeeded on the first try. Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR adds a single tick history file documenting the investigation and root cause analysis of a worktree-prune-race issue, attributing it to multi-session Otto-CLI self-contention on .git/objects/pack.
Changes:
- Adds new tick shard documenting PID-level evidence of the contention
- Captures mitigation candidates with effort sizing
- Records bus envelope state and delta since prior shard
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a64bfbeb4d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Codex P2 catch on line 56: framing "/tmp/zeta-bus IS the substrate channel" normalizes ephemeral state as durable substrate, contradicting .claude/rules/substrate-or-it-didnt-happen.md (TaskUpdate / /tmp / loop-todos are NOT durable substrate). Reframed: bus envelopes are the BRIDGE CHANNEL between outage start and git recovery. The substrate-honest sequence is outage → bus- captured → git-preserved (which is what actually happened). Bus is not a substitute for git-canonical landing; it is a bridge that preserves evidence until git is reachable. Co-Authored-By: Claude <noreply@anthropic.com>
…elf-contention (#3372) Files the smallest-effort mitigation candidate from the worktree-prune-race root cause analysis landed in PR #3370. Defers the autonomous-loop tick at the top when a peer Otto-CLI claude-code process is detected, bus-publishes the deferral, and exits cleanly. Composes with B-0506 (stale worktree prune cadence) and B-0519 (multi-Otto branch-state contamination RCA). Effort: S. P3 because the failure mode is operationally observable via bus envelopes — substrate-honest fallback channel already established — and the contention windows resolve naturally within minutes. Co-authored-by: Claude <noreply@anthropic.com>
Cross-references B-0530 (filed 2026-05-15, merged in PR #3372) as the mechanization row for the multi-Otto-CLI self-contention pattern identified in this PR's root-cause analysis. Composes the mechanization candidate sketch ("pgrep claude-code at top of autonomous-loop, defer if peer detected") with the existing Patterns 1-7 family. Pattern 8 is distinct because: - Patterns 1-6 are checkout/reset races (peer git changes HEAD) - Pattern 7 is paused-then-resumed rebase state - Pattern 8 is concurrent git worktree add operations contending on .git/objects/pack via "Interrupted system call" All three families share the same underlying cause (shared .git/ directory across multiple processes/sessions) but the surface mechanism + the catch + the mitigation differ. Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3e0df3f757
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Codex P2 catch: the relative link to .claude/rules/substrate-or-it- didnt-happen.md was off by one directory level. From the shard location (docs/hygiene-history/ticks/2026/05/15/0615Z.md), reaching .claude/rules/ requires 6x ../ to climb out of hygiene-history/ up to repo root. The 5x version resolved to docs/.claude/... (a path that doesn't exist), making the cited policy unreachable from the audit trail it was supposed to support. Same class as the 0027Z + 0230Z shard link-depth fixes earlier today (PRs #3330 + #3356). The pattern recurs because: - 4x ../ → docs/ (correct for docs/backlog/, docs/research/, etc.) - 5x ../ → repo root one level off (off-by-one mistake site) - 6x ../ → repo root (correct for .claude/, src/, etc.) Verified by `ls -la` resolution check from the shard directory. Co-Authored-By: Claude <noreply@anthropic.com>
…non-duplication discipline (#3376) * shard(tick): 0710Z — convergence with peer-Otto 0615Z investigation; root cause confirmed; non-duplication discipline Peer-Otto's concurrent session (PID 30425) ran ticks 0545Z-0615Z while I was idle. Their PR #3370 (0615Z shard) + PR #3372 (B-0530 cron-sentinel-mutex row) IDENTIFIED the worktree-prune-race root cause: multi-session Otto-CLI self-contention on shared .git/objects/pack during git worktree add's internal git reset --hard. My 0524Z investigation cleared 7 candidates; the 8th (multi-session self-contention) was on my "next tick" list as highest-likelihood. Peer-Otto got there first via empirical PID-level evidence at 0611Z. Substrate-honest non-duplication: abandoned my draft B-NNNN row this tick after git fetch revealed B-0530 already on main. Refresh-before-decide applies at backlog-row-allocation scope. Documents the borrow-on-existing vs new-worktree-creation distinction: git switch touches HEAD only; git worktree add forks git reset --hard which contends on .git/objects/pack. Borrow pattern is concurrent-Otto-safe; new worktree creation hits the race. Co-Authored-By: Claude <noreply@anthropic.com> * fix(shard): address 3 Copilot review threads on 0710Z shard - Line 1: replaced (PR TBD) placeholder with (PR #3376) per tick-history-row convention - Lines 32/44/56/81: fixed relative-link path bug — was 5x dotdot which only climbed to docs/, breaking all .claude/rules/... links. Now 6x dotdot for repo root + .claude/rules/<file>. Empirically verified: realpath now resolves all 4 links correctly (substrate-wide convention bug affects 0230Z + 0414Z + 0517Z + 0717Z + 0724Z shards too — a follow-on B-NNNN row could bulk-fix the cohort). - Lines 7/25: clarified two distinct peer-Otto PIDs — 7894 was peer-Otto's own session per their 0611Z ps observation; 30425 was a SEPARATE later launchd respawn observed running grep at 0710Z. Reconciles with peer-Otto's 0615Z shard which records PID 7894. Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>
…ns (#3375) * feat(b-0530): cron-sentinel-mutex — detect concurrent Otto-CLI sessions Implements the cheap-effort mitigation from docs/backlog/P3/B-0530-cron-sentinel-mutex-prevent-otto-cli-self-contention-2026-05-15.md + the worktree-prune-race root-cause analysis landed in PR #3370. This is the action-side parity proof for the narration filed in B-0530 and Pattern 8 of B-0519: Lior's antigravity check at 0220Z (PR #3373) flagged "narration-over-action drift" — listing mitigations without implementing them. This commit closes that gap. The mutex is a diagnostic, not a gate: it returns a structured MutexResult that callers (the <<autonomous-loop>> tick body) use to decide whether to defer git-mutating work. Empty peer list → proceed normally. Non-empty → bus-publish a deferral envelope and skip git ops this tick. Implementation: - tools/orchestrator-checks/cron-sentinel-mutex.ts (103 lines) - spawnSync("pgrep", ["-afl", "claude-code"]) with args-as-array (no shell, no injection — same pattern as sibling verify-branch.ts) - Excludes self-PID; excludes ancestors that lack the claude-code stdio flags --output-format / --input-format - Exports checkPeerSessions() + formatResult() for testability - if (import.meta.main) main() guard so imports don't trigger side effects - --json output for shell composition - tools/orchestrator-checks/cron-sentinel-mutex.test.ts (91 lines) - 8 tests covering: no peers, exclude-self, exclude-ancestors, empty-stdout, malformed-lines, self-with-matching-flags, formatResult-empty, formatResult-multi-peer Verified: - bunx tsc --noEmit clean - bun test: 8 pass / 0 fail - semgrep --config .semgrep.yml --error: 0 findings - Live smoke detected 3 concurrent claude-code sessions (real-world validation of the diagnostic) Next step (not in this PR): wire this into the autonomous-loop substrate so the <<autonomous-loop>> tick body invokes the mutex at the top and defers when peers are detected. Filed as B-0530 follow-up; this PR ships the building block. Co-Authored-By: Claude <noreply@anthropic.com> * fix(b-0530): distinguish pgrep failures from true no-peer + add sonarjs disable Two code-review findings from PR #3375 Codex review: 1. P1: `checkPeerSessions` now checks `result.error` (spawn failure, e.g., pgrep binary missing) and `result.status > 1` (pgrep runtime error). Previously both silently returned `peerDetected=false`, masking an unknown mutex state. Now surfaces `pgrepError` in `MutexResult` so callers know the check itself failed. 2. P0: Add `eslint-disable-next-line sonarjs/no-os-command-from-path` before the `spawnFn("pgrep", ...)` call. Rationale inline: pgrep is a known system binary, args-array form prevents shell injection. Tests: +3 new cases covering spawn-error, status-2 exit, and formatResult error-message rendering. All 11 tests pass. Co-Authored-By: Claude <noreply@anthropic.com> * fix(b-0530): main() returns PGREP_ERROR_EXIT on unknown mutex state Codex P1 finding [SEb1] on PR #3375: `main()` only branched on `peerDetected`, so when `checkPeerSessions()` reported `pgrepError` (set by peer-Otto's commit 0b3d03b), the CLI still exited 0 even though the mutex check itself failed. Shell callers gating on the exit code would proceed as if there were no peers. Extracted exit-code mapping into `mainResult(r: MutexResult)` so it is unit-testable without process.exit. Added `PGREP_ERROR_EXIT = 251` constant for "pgrep failed, mutex state unknown" — distinct from the 0..250 peer-count range so shell callers can branch on it explicitly. Exit code matrix: 0 = no peers, no error (safe to proceed) 1..250 = 1 + peer count (caller should defer) 251 (new) = pgrep error / unknown state (caller should defer) Tests: +5 new cases covering all four mainResult branches plus error-takes-precedence-over-peers. 16/16 pass (8 original + 3 from peer's 0b3d03b + 5 new). Composes with peer-Otto's 0b3d03b which added pgrepError tracking and the sonarjs suppression. This commit closes the last of the 3 PR #3375 review threads (#R3Mn + #R4ji peer-fixed already; #SEb1 fixed here). Co-Authored-By: Claude <noreply@anthropic.com> * fix(b-0530): --json mode also returns mainResult exit code Codex P1 finding on PR #3375: the --json branch always called process.exit(0), so shell callers using --json AND $? (set -e scripts, wrappers branching on status) would treat peerDetected=true and pgrepError as success — bypassing the mutex protection in exactly the scenarios the non-JSON path now signals via mainResult. Fix: --json branch now calls process.exit(mainResult(r)) for the same exit-code semantics as the non-JSON path. Callers can use stdout (structured JSON) AND $? (numeric status) together. Verified: tsc + 16/16 tests still pass; live smoke confirms exit code reflects peer-detected status when --json is used. Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>
…ross 5 shards (#3386) * fix(shards): bulk fix tick-shard rule-link depth (5 dotdot → 6 dotdot) across 5 shards Bulk fix follow-up to peer-Otto's 0729Z investigation (PR #3380) which discovered the substrate-wide rule-link path-bug. From `docs/hygiene-history/ticks/2026/05/15/X.md`, 5x `../` only reaches `docs/`; 6x `../` is required to climb out to repo root where `.claude/rules/` lives. Files fixed (13 broken links total): - 0414Z.md (6 links — claim-acquire + holding-without-named-dep) - 0503Z.md (1 link — blocked-green-ci) - 0517Z.md (3 links — holding + additive + otto-channels) - 0524Z.md (2 links — holding + verify-before-deferring) - 0717Z.md (1 link — claim-acquire) Already-correct shards on main verified intact (0027Z, 0230Z fixed earlier; 0615Z fixed in #3370; 0710Z fixed in #3376; 0724Z, 0729Z already use 6 dotdots). Methodology: - Strict detection script identifies markdown-link targets with exactly 5 `../` prefix AND `.claude/` substring (avoids over- matching 6-dotdot strings starting at offset 3) - Context check confirms all 13 occurrences are inside `]( ... )` markdown link parens (not prose mentioning 5-dotdot literally) - Bulk replacement: `../../../../../.claude/` → `../../../../../../.claude/` - Post-fix re-verification: detection script returns empty for all 5 - Sample realpath check: the fixed links resolve correctly Composes with [B-0519 Pattern 8](https://github.com/Lucent-Financial-Group/Zeta/blob/main/docs/backlog/P3/B-0519-multi-otto-branch-state-contamination-rca-2026-05-14.md) + [PR #3380](#3380 "next tick candidate" framing. Substrate-honest action-side closure of the investigation peer-Otto deferred. Co-Authored-By: Claude <noreply@anthropic.com> * fix(shards): correct B-0530 link depth in 0717Z (3 → 5 dotdots) Copilot P1 catch on PR #3386: line 7 of 0717Z had `../../../backlog/P3/B-0530-...` (3 dotdots) which from `docs/hygiene-history/ticks/2026/05/15/` only climbs to `2026/`, not `docs/`. Needs 5 dotdots to reach `docs/backlog/`. My initial bulk-fix scope targeted only the 5→6 dotdot pattern for .claude/ links. The 3→5 dotdot pattern (different bug class, same depth-counting error) was outside that scope. Verified via realpath that the corrected link now resolves. Generalized my detection script to scan for ALL broken-depth patterns across the 5 fixed files; only this one additional link turned up. Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>
…3390) * feat(autonomous-loop): wire cron-sentinel-mutex into Step 1 refresh Closes the PR #3375 "Next step (not in this PR): wire this into the autonomous-loop substrate so the <<autonomous-loop>> tick body invokes the mutex at the top and defers when peers are detected." Added to docs/AUTONOMOUS-LOOP-PER-TICK.md Step 1 (Refresh): - New `cron-sentinel-mutex.ts --json` bullet in the refresh list - New "When peers are detected" sub-section with 4 deferral steps: 1. Avoid `git worktree add` (worktree-prune-race rationale) 2. Continue with non-git-mutating work (bus, audits, planning) 3. Bus-publish a deferral envelope if substrate matters past tick 4. Re-check next tick (contention windows resolve in 1-3 min) - Special case: exit code 251 (PGREP_ERROR_EXIT) — proceed but log Per the 3-surface canonical convergence, this update propagates to Otto-CLI (auto-loaded next cold-boot), Otto-Desktop routine (cites this file), and B-0448 cloud routine (when shipped — will cite this file). The discipline is ADVISORY, not a hard gate: the mutex reports state, the tick body decides. Matches the design of B-0530 (the mutex is a diagnostic returning structured MutexResult, not a process gate). Composes with: - PR #3370 (worktree-prune-race root cause + B-0519 Pattern 8) - PR #3375 (mutex implementation) - PR #3377 (borrow-on-existing pattern — alternative when peer contention is encountered) - PR #3386 (bulk rule-link depth fix across affected shards) Co-Authored-By: Claude <noreply@anthropic.com> * fix(autonomous-loop): correct cron-sentinel-mutex exit-code range and exit-251 guidance - Exit code range for peerDetected=true is 2..250 (Math.min(1+peerCount,250)), not 1..250; exit 1 is unreachable when peers are detected - Replace `{..., ...}` JSON placeholder in bus.ts publish example with valid JSON so the command doesn't hard-fail when copy-pasted - Exit 251 (PGREP_ERROR_EXIT) means pgrep failed and state is unknown; treat as peer-detected for git-mutating ops (defer worktree add), matching the 'caller should defer' comment in the implementation Addresses Codex P1 and 3× Copilot P1 threads on PR #3390. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>
Summary
Recovers the 0545Z + 0607Z + 0611Z investigation arc into git after 30 min of bus-only substrate landing. Identifies the root cause of the worktree-prune-race peer-Otto reported at tick 0414Z (envelope
44aaf799) and continued investigating at 0524Z: multiple concurrent Otto-CLI claude-code sessions firing autonomous-loop ticks in parallel, contending on shared.git/objects/pack.PID-level evidence captured this tick:
Peer's
git worktree add(PID 11725) + its stuckgit reset --hardchild (PID 11818) held.git/locks for 9+ minutes, causing my owngit worktree addattempts to fail withInterrupted system callon.git/objects/packfollowed by git's automatic rollback of the partially-populated worktree.What looks like external pruning is git's own rollback
When
git worktree add's internalgit reset --hardfails,git worktree addcleans up its partial work viarm -rfof the worktree dir +.git/worktrees/<name>/admin entry. From the operator's perspective this looks like an external attacker pruning new worktrees, but it's standard git rollback semantics under FS contention.Peer-Otto's 0524Z investigation correctly ruled out 7 candidates (Lior/Riven/Codex/Vera/Copilot loops, lane-allocator,
git worktree prune,gc.pruneexpire). The remaining candidate they didn't enumerate was multi-session self-contention.Mitigation candidates (preserved in shard body)
pgrep -fl claude-codebefore firing)lsof .git/objects/packcheckflock /tmp/zeta-git.lockaround all git opsThe cron-sentinel mutex is the substrate-honest first move — small, only affects autonomous-loop firings, zero blast-radius.
Test plan
🤖 Generated with Claude Code