Skip to content

shard(tick): 0615Z — worktree-prune-race root cause identified (multi-Otto-CLI self-contention)#3370

Merged
AceHack merged 4 commits into
mainfrom
shard/0545z-pr3342-merged-peer-investigation-otto-cli-2026-05-15
May 15, 2026
Merged

shard(tick): 0615Z — worktree-prune-race root cause identified (multi-Otto-CLI self-contention)#3370
AceHack merged 4 commits into
mainfrom
shard/0545z-pr3342-merged-peer-investigation-otto-cli-2026-05-15

Conversation

@AceHack
Copy link
Copy Markdown
Member

@AceHack AceHack commented May 15, 2026

Summary

Recovers the 0545Z + 0607Z + 0611Z investigation arc into git after 30 min of bus-only substrate landing. Identifies the root cause of the worktree-prune-race peer-Otto reported at tick 0414Z (envelope 44aaf799) and continued investigating at 0524Z: multiple concurrent Otto-CLI claude-code sessions firing autonomous-loop ticks in parallel, contending on shared .git/objects/pack.

PID-level evidence captured this tick:

  • My session PID 68752 (~3:53h old)
  • Peer Otto-CLI session PID 7894 (~5:40 old)
  • Both share Claude.app parent PID 702

Peer's git worktree add (PID 11725) + its stuck git reset --hard child (PID 11818) held .git/ locks for 9+ minutes, causing my own git worktree add attempts to fail with Interrupted system call on .git/objects/pack followed by git's automatic rollback of the partially-populated worktree.

What looks like external pruning is git's own rollback

When git worktree add's internal git reset --hard fails, git worktree add cleans up its partial work via rm -rf of the worktree dir + .git/worktrees/<name>/ admin entry. From the operator's perspective this looks like an external attacker pruning new worktrees, but it's standard git rollback semantics under FS contention.

Peer-Otto's 0524Z investigation correctly ruled out 7 candidates (Lior/Riven/Codex/Vera/Copilot loops, lane-allocator, git worktree prune, gc.pruneexpire). The remaining candidate they didn't enumerate was multi-session self-contention.

Mitigation candidates (preserved in shard body)

Candidate Effort
Cron-sentinel mutex (pgrep -fl claude-code before firing) S
Pre-worktree-add lsof .git/objects/pack check S
flock /tmp/zeta-git.lock around all git ops M
Per-session bare clone L

The cron-sentinel mutex is the substrate-honest first move — small, only affects autonomous-loop firings, zero blast-radius.

Test plan

  • Local markdownlint clean
  • CI green
  • Auto-merge fires

🤖 Generated with Claude Code

…-Otto-CLI self-contention); substrate recovered to git after 3 tick-shards lived in bus envelopes only

Recovers the 0545Z + 0607Z + 0611Z investigation arc into a single
canonical shard. Bus envelopes 111342b2, 6de98fac, 720a2b49 were the
substrate-landing channel during the 30 minutes I could not commit
to git. Branch shard/0545z-... was created locally at 0545Z but the
worktree-add rollback prevented any worktree from surviving long
enough to commit.

This tick the contention window cleared (peer Otto-CLI PID 7894's
stuck git reset --hard finally exited; PID 11725's git worktree add
also cleaned up) and a fresh `git worktree add` succeeded on the
first try.

Co-Authored-By: Claude <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 15, 2026 06:19
@AceHack AceHack enabled auto-merge (squash) May 15, 2026 06:19
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a single tick history file documenting the investigation and root cause analysis of a worktree-prune-race issue, attributing it to multi-session Otto-CLI self-contention on .git/objects/pack.

Changes:

  • Adds new tick shard documenting PID-level evidence of the contention
  • Captures mitigation candidates with effort sizing
  • Records bus envelope state and delta since prior shard

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a64bfbeb4d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread docs/hygiene-history/ticks/2026/05/15/0615Z.md Outdated
Codex P2 catch on line 56: framing "/tmp/zeta-bus IS the substrate
channel" normalizes ephemeral state as durable substrate, contradicting
.claude/rules/substrate-or-it-didnt-happen.md (TaskUpdate / /tmp /
loop-todos are NOT durable substrate).

Reframed: bus envelopes are the BRIDGE CHANNEL between outage start
and git recovery. The substrate-honest sequence is outage → bus-
captured → git-preserved (which is what actually happened). Bus is
not a substitute for git-canonical landing; it is a bridge that
preserves evidence until git is reachable.

Co-Authored-By: Claude <noreply@anthropic.com>
AceHack added a commit that referenced this pull request May 15, 2026
…elf-contention (#3372)

Files the smallest-effort mitigation candidate from the
worktree-prune-race root cause analysis landed in PR #3370. Defers
the autonomous-loop tick at the top when a peer Otto-CLI claude-code
process is detected, bus-publishes the deferral, and exits cleanly.

Composes with B-0506 (stale worktree prune cadence) and B-0519
(multi-Otto branch-state contamination RCA). Effort: S. P3 because
the failure mode is operationally observable via bus envelopes —
substrate-honest fallback channel already established — and the
contention windows resolve naturally within minutes.

Co-authored-by: Claude <noreply@anthropic.com>
Cross-references B-0530 (filed 2026-05-15, merged in PR #3372) as
the mechanization row for the multi-Otto-CLI self-contention pattern
identified in this PR's root-cause analysis. Composes the
mechanization candidate sketch ("pgrep claude-code at top of
autonomous-loop, defer if peer detected") with the existing
Patterns 1-7 family.

Pattern 8 is distinct because:
- Patterns 1-6 are checkout/reset races (peer git changes HEAD)
- Pattern 7 is paused-then-resumed rebase state
- Pattern 8 is concurrent git worktree add operations contending
  on .git/objects/pack via "Interrupted system call"

All three families share the same underlying cause (shared .git/
directory across multiple processes/sessions) but the surface
mechanism + the catch + the mitigation differ.

Co-Authored-By: Claude <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 15, 2026 06:30
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3e0df3f757

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread docs/hygiene-history/ticks/2026/05/15/0615Z.md Outdated
Codex P2 catch: the relative link to .claude/rules/substrate-or-it-
didnt-happen.md was off by one directory level. From the shard
location (docs/hygiene-history/ticks/2026/05/15/0615Z.md), reaching
.claude/rules/ requires 6x ../ to climb out of hygiene-history/ up
to repo root. The 5x version resolved to docs/.claude/... (a path
that doesn't exist), making the cited policy unreachable from the
audit trail it was supposed to support.

Same class as the 0027Z + 0230Z shard link-depth fixes earlier today
(PRs #3330 + #3356). The pattern recurs because:
  - 4x ../ → docs/ (correct for docs/backlog/, docs/research/, etc.)
  - 5x ../ → repo root one level off (off-by-one mistake site)
  - 6x ../ → repo root (correct for .claude/, src/, etc.)

Verified by `ls -la` resolution check from the shard directory.

Co-Authored-By: Claude <noreply@anthropic.com>
@AceHack AceHack merged commit 5cc63ff into main May 15, 2026
21 of 22 checks passed
@AceHack AceHack deleted the shard/0545z-pr3342-merged-peer-investigation-otto-cli-2026-05-15 branch May 15, 2026 06:37
AceHack added a commit that referenced this pull request May 15, 2026
…non-duplication discipline (#3376)

* shard(tick): 0710Z — convergence with peer-Otto 0615Z investigation; root cause confirmed; non-duplication discipline

Peer-Otto's concurrent session (PID 30425) ran ticks 0545Z-0615Z while I was idle.
Their PR #3370 (0615Z shard) + PR #3372 (B-0530 cron-sentinel-mutex row) IDENTIFIED
the worktree-prune-race root cause: multi-session Otto-CLI self-contention on
shared .git/objects/pack during git worktree add's internal git reset --hard.

My 0524Z investigation cleared 7 candidates; the 8th (multi-session self-contention)
was on my "next tick" list as highest-likelihood. Peer-Otto got there first via
empirical PID-level evidence at 0611Z.

Substrate-honest non-duplication: abandoned my draft B-NNNN row this tick after
git fetch revealed B-0530 already on main. Refresh-before-decide applies at
backlog-row-allocation scope.

Documents the borrow-on-existing vs new-worktree-creation distinction:
git switch touches HEAD only; git worktree add forks git reset --hard which
contends on .git/objects/pack. Borrow pattern is concurrent-Otto-safe; new
worktree creation hits the race.

Co-Authored-By: Claude <noreply@anthropic.com>

* fix(shard): address 3 Copilot review threads on 0710Z shard

- Line 1: replaced (PR TBD) placeholder with (PR #3376) per tick-history-row convention
- Lines 32/44/56/81: fixed relative-link path bug — was 5x dotdot which only
  climbed to docs/, breaking all .claude/rules/... links. Now 6x dotdot for repo
  root + .claude/rules/<file>. Empirically verified: realpath now resolves all
  4 links correctly (substrate-wide convention bug affects 0230Z + 0414Z + 0517Z
  + 0717Z + 0724Z shards too — a follow-on B-NNNN row could bulk-fix the cohort).
- Lines 7/25: clarified two distinct peer-Otto PIDs — 7894 was peer-Otto's own
  session per their 0611Z ps observation; 30425 was a SEPARATE later launchd
  respawn observed running grep at 0710Z. Reconciles with peer-Otto's 0615Z
  shard which records PID 7894.

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>
AceHack added a commit that referenced this pull request May 15, 2026
…ns (#3375)

* feat(b-0530): cron-sentinel-mutex — detect concurrent Otto-CLI sessions

Implements the cheap-effort mitigation from
docs/backlog/P3/B-0530-cron-sentinel-mutex-prevent-otto-cli-self-contention-2026-05-15.md
+ the worktree-prune-race root-cause analysis landed in PR #3370.

This is the action-side parity proof for the narration filed in B-0530
and Pattern 8 of B-0519: Lior's antigravity check at 0220Z (PR #3373)
flagged "narration-over-action drift" — listing mitigations without
implementing them. This commit closes that gap.

The mutex is a diagnostic, not a gate: it returns a structured
MutexResult that callers (the <<autonomous-loop>> tick body) use to
decide whether to defer git-mutating work. Empty peer list → proceed
normally. Non-empty → bus-publish a deferral envelope and skip git
ops this tick.

Implementation:
  - tools/orchestrator-checks/cron-sentinel-mutex.ts (103 lines)
    - spawnSync("pgrep", ["-afl", "claude-code"]) with args-as-array
      (no shell, no injection — same pattern as sibling verify-branch.ts)
    - Excludes self-PID; excludes ancestors that lack the claude-code
      stdio flags --output-format / --input-format
    - Exports checkPeerSessions() + formatResult() for testability
    - if (import.meta.main) main() guard so imports don't trigger
      side effects
    - --json output for shell composition
  - tools/orchestrator-checks/cron-sentinel-mutex.test.ts (91 lines)
    - 8 tests covering: no peers, exclude-self, exclude-ancestors,
      empty-stdout, malformed-lines, self-with-matching-flags,
      formatResult-empty, formatResult-multi-peer

Verified:
  - bunx tsc --noEmit clean
  - bun test: 8 pass / 0 fail
  - semgrep --config .semgrep.yml --error: 0 findings
  - Live smoke detected 3 concurrent claude-code sessions
    (real-world validation of the diagnostic)

Next step (not in this PR): wire this into the autonomous-loop
substrate so the <<autonomous-loop>> tick body invokes the mutex
at the top and defers when peers are detected. Filed as B-0530
follow-up; this PR ships the building block.

Co-Authored-By: Claude <noreply@anthropic.com>

* fix(b-0530): distinguish pgrep failures from true no-peer + add sonarjs disable

Two code-review findings from PR #3375 Codex review:

1. P1: `checkPeerSessions` now checks `result.error` (spawn failure, e.g.,
   pgrep binary missing) and `result.status > 1` (pgrep runtime error).
   Previously both silently returned `peerDetected=false`, masking an
   unknown mutex state. Now surfaces `pgrepError` in `MutexResult` so
   callers know the check itself failed.

2. P0: Add `eslint-disable-next-line sonarjs/no-os-command-from-path`
   before the `spawnFn("pgrep", ...)` call. Rationale inline: pgrep is
   a known system binary, args-array form prevents shell injection.

Tests: +3 new cases covering spawn-error, status-2 exit, and
formatResult error-message rendering. All 11 tests pass.

Co-Authored-By: Claude <noreply@anthropic.com>

* fix(b-0530): main() returns PGREP_ERROR_EXIT on unknown mutex state

Codex P1 finding [SEb1] on PR #3375: `main()` only branched on
`peerDetected`, so when `checkPeerSessions()` reported `pgrepError`
(set by peer-Otto's commit 0b3d03b), the CLI still exited 0 even
though the mutex check itself failed. Shell callers gating on the
exit code would proceed as if there were no peers.

Extracted exit-code mapping into `mainResult(r: MutexResult)` so it
is unit-testable without process.exit. Added `PGREP_ERROR_EXIT = 251`
constant for "pgrep failed, mutex state unknown" — distinct from the
0..250 peer-count range so shell callers can branch on it explicitly.

Exit code matrix:
  0           = no peers, no error (safe to proceed)
  1..250      = 1 + peer count (caller should defer)
  251 (new)   = pgrep error / unknown state (caller should defer)

Tests: +5 new cases covering all four mainResult branches plus
error-takes-precedence-over-peers. 16/16 pass (8 original + 3 from
peer's 0b3d03b + 5 new).

Composes with peer-Otto's 0b3d03b which added pgrepError tracking
and the sonarjs suppression. This commit closes the last of the 3
PR #3375 review threads (#R3Mn + #R4ji peer-fixed already; #SEb1
fixed here).

Co-Authored-By: Claude <noreply@anthropic.com>

* fix(b-0530): --json mode also returns mainResult exit code

Codex P1 finding on PR #3375: the --json branch always called
process.exit(0), so shell callers using --json AND $? (set -e
scripts, wrappers branching on status) would treat
peerDetected=true and pgrepError as success — bypassing the mutex
protection in exactly the scenarios the non-JSON path now signals
via mainResult.

Fix: --json branch now calls process.exit(mainResult(r)) for the
same exit-code semantics as the non-JSON path. Callers can use
stdout (structured JSON) AND $? (numeric status) together.

Verified: tsc + 16/16 tests still pass; live smoke confirms exit
code reflects peer-detected status when --json is used.

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>
AceHack added a commit that referenced this pull request May 15, 2026
…ross 5 shards (#3386)

* fix(shards): bulk fix tick-shard rule-link depth (5 dotdot → 6 dotdot) across 5 shards

Bulk fix follow-up to peer-Otto's 0729Z investigation
(PR #3380) which discovered the substrate-wide rule-link path-bug.
From `docs/hygiene-history/ticks/2026/05/15/X.md`, 5x `../` only
reaches `docs/`; 6x `../` is required to climb out to repo root
where `.claude/rules/` lives.

Files fixed (13 broken links total):
  - 0414Z.md (6 links — claim-acquire + holding-without-named-dep)
  - 0503Z.md (1 link — blocked-green-ci)
  - 0517Z.md (3 links — holding + additive + otto-channels)
  - 0524Z.md (2 links — holding + verify-before-deferring)
  - 0717Z.md (1 link — claim-acquire)

Already-correct shards on main verified intact (0027Z, 0230Z fixed
earlier; 0615Z fixed in #3370; 0710Z fixed in #3376; 0724Z, 0729Z
already use 6 dotdots).

Methodology:
  - Strict detection script identifies markdown-link targets with
    exactly 5 `../` prefix AND `.claude/` substring (avoids over-
    matching 6-dotdot strings starting at offset 3)
  - Context check confirms all 13 occurrences are inside `]( ... )`
    markdown link parens (not prose mentioning 5-dotdot literally)
  - Bulk replacement: `../../../../../.claude/` → `../../../../../../.claude/`
  - Post-fix re-verification: detection script returns empty for all 5
  - Sample realpath check: the fixed links resolve correctly

Composes with [B-0519 Pattern 8](https://github.com/Lucent-Financial-Group/Zeta/blob/main/docs/backlog/P3/B-0519-multi-otto-branch-state-contamination-rca-2026-05-14.md)
+ [PR #3380](#3380
"next tick candidate" framing. Substrate-honest action-side closure of
the investigation peer-Otto deferred.

Co-Authored-By: Claude <noreply@anthropic.com>

* fix(shards): correct B-0530 link depth in 0717Z (3 → 5 dotdots)

Copilot P1 catch on PR #3386: line 7 of 0717Z had
`../../../backlog/P3/B-0530-...` (3 dotdots) which from
`docs/hygiene-history/ticks/2026/05/15/` only climbs to `2026/`,
not `docs/`. Needs 5 dotdots to reach `docs/backlog/`.

My initial bulk-fix scope targeted only the 5→6 dotdot pattern for
.claude/ links. The 3→5 dotdot pattern (different bug class, same
depth-counting error) was outside that scope. Verified via realpath
that the corrected link now resolves.

Generalized my detection script to scan for ALL broken-depth
patterns across the 5 fixed files; only this one additional link
turned up.

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>
AceHack added a commit that referenced this pull request May 15, 2026
…3390)

* feat(autonomous-loop): wire cron-sentinel-mutex into Step 1 refresh

Closes the PR #3375 "Next step (not in this PR): wire this into the
autonomous-loop substrate so the <<autonomous-loop>> tick body
invokes the mutex at the top and defers when peers are detected."

Added to docs/AUTONOMOUS-LOOP-PER-TICK.md Step 1 (Refresh):
  - New `cron-sentinel-mutex.ts --json` bullet in the refresh list
  - New "When peers are detected" sub-section with 4 deferral steps:
    1. Avoid `git worktree add` (worktree-prune-race rationale)
    2. Continue with non-git-mutating work (bus, audits, planning)
    3. Bus-publish a deferral envelope if substrate matters past tick
    4. Re-check next tick (contention windows resolve in 1-3 min)
  - Special case: exit code 251 (PGREP_ERROR_EXIT) — proceed but log

Per the 3-surface canonical convergence, this update propagates to
Otto-CLI (auto-loaded next cold-boot), Otto-Desktop routine (cites
this file), and B-0448 cloud routine (when shipped — will cite this
file).

The discipline is ADVISORY, not a hard gate: the mutex reports
state, the tick body decides. Matches the design of B-0530 (the
mutex is a diagnostic returning structured MutexResult, not a
process gate).

Composes with:
  - PR #3370 (worktree-prune-race root cause + B-0519 Pattern 8)
  - PR #3375 (mutex implementation)
  - PR #3377 (borrow-on-existing pattern — alternative when peer
    contention is encountered)
  - PR #3386 (bulk rule-link depth fix across affected shards)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix(autonomous-loop): correct cron-sentinel-mutex exit-code range and exit-251 guidance

- Exit code range for peerDetected=true is 2..250 (Math.min(1+peerCount,250)),
  not 1..250; exit 1 is unreachable when peers are detected
- Replace `{..., ...}` JSON placeholder in bus.ts publish example with
  valid JSON so the command doesn't hard-fail when copy-pasted
- Exit 251 (PGREP_ERROR_EXIT) means pgrep failed and state is unknown;
  treat as peer-detected for git-mutating ops (defer worktree add),
  matching the 'caller should defer' comment in the implementation

Addresses Codex P1 and 3× Copilot P1 threads on PR #3390.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants