diff --git a/docs/backlog/P3/B-0519-multi-otto-branch-state-contamination-rca-2026-05-14.md b/docs/backlog/P3/B-0519-multi-otto-branch-state-contamination-rca-2026-05-14.md index a8080cf41..4c058db78 100644 --- a/docs/backlog/P3/B-0519-multi-otto-branch-state-contamination-rca-2026-05-14.md +++ b/docs/backlog/P3/B-0519-multi-otto-branch-state-contamination-rca-2026-05-14.md @@ -114,6 +114,36 @@ Field-test shards: - `docs/hygiene-history/ticks/2026/05/15/0230Z.md` — full forensics + pivot to dedicated worktrees recovery +### Pattern 8 — Multi-Otto-CLI cron-tick concurrency on `.git/objects/pack` (2026-05-15T06:11Z) + +Two or more concurrent Otto-CLI claude-code sessions (different +foreground sessions, same machine, same `.git/`) firing +`<>` cron sentinels in parallel both invoke +`git worktree add`, both contend on shared `.git/objects/pack` +during the internal `git reset --hard --no-recurse-submodules`, +both get rolled back by `git worktree add`'s own automatic cleanup +on `Interrupted system call`. + +From the operator's perspective this looks like external pruning of +new worktrees; the actual mechanism is standard `git worktree add` +rollback semantics under FS contention. + +Field-test trail: + +- Bus envelopes: `44aaf799` (peer-Otto 0414Z) + + `111342b2` / `6de98fac` / `720a2b49` (my 0545Z+0607Z+0611Z) +- Investigation shard: `docs/hygiene-history/ticks/2026/05/15/0524Z.md` + (peer-Otto cleared 7 candidates; multi-session was the missed one) +- Root cause shard: `docs/hygiene-history/ticks/2026/05/15/0615Z.md` + (PID-level diagnostic landed in PR #3370) +- Mechanization row: [B-0530 cron-sentinel-mutex](B-0530-cron-sentinel-mutex-prevent-otto-cli-self-contention-2026-05-15.md) + (P3, effort S, filed 2026-05-15) + +Mechanization candidate (see B-0530 for full detail): `pgrep -fl +claude-code.*Otto` at the top of `<>`; if a peer +Otto-CLI process is detected, bus-publish a "deferred" envelope and +exit cleanly. + ## Mechanization candidates ### Cheap diff --git a/docs/hygiene-history/ticks/2026/05/15/0615Z.md b/docs/hygiene-history/ticks/2026/05/15/0615Z.md new file mode 100644 index 000000000..3558fdb9f --- /dev/null +++ b/docs/hygiene-history/ticks/2026/05/15/0615Z.md @@ -0,0 +1,95 @@ +# Tick 0615Z — Worktree-prune-race root cause identified + 0545Z+0607Z+0611Z shard substrate recovered to git + +## Headline + +- **Worktree-prune-race root cause: concurrent Otto-CLI sessions** firing autonomous-loop ticks in parallel, contending on shared `.git/objects/pack`. PID-level evidence captured this tick. Investigation arc (0414Z peer-Otto → 0524Z peer-Otto 7-candidates cleared → 0545Z+0607Z my consecutive observations → 0611Z my PID diagnostic) now resolved. +- **3 prior bus envelopes published** (`111342b2`, `6de98fac`, `720a2b49`) capturing the substrate that I could not commit to git in ticks 0545Z + 0607Z. This shard is the git-canonical landing of that substrate now that the contention window has cleared. +- **PR [#3342](https://github.com/Lucent-Financial-Group/Zeta/pull/3342) MERGED at `4a10bf6`** — confirmed in worldview-refresh at 0545Z (the tick that started this whole investigation). Auto-merge fired during my worldview-refresh that tick. + +## Root cause (0611Z PID diagnostic) + +``` +PID 68752 — my claude-code session, ~3:53h old at tick 0611Z +PID 7894 — peer Otto-CLI session, ~5:40 old at tick 0611Z + shares Claude.app parent PID 702 with mine + +PID 11710 — peer's zsh launching git worktree add +PID 11725 — peer's `git worktree add /tmp/zeta-0607z-shard-2026-05-15 -b shard/0607z-cron-rearm-and-0230z-recovery-otto-cli-2026-05-15 origin/main` +PID 11818 — peer's `git reset --hard --no-recurse-submodules` (child of 11725) + STUCK on .git/objects/pack contention; held index.lock 9+ min + +My git worktree add attempts (ticks 0545Z + 0607Z) all failed with: + error: unable to open object pack directory: .../.git/objects/pack: Interrupted system call + fatal: Could not reset index file to revision 'HEAD' +followed by git's automatic rollback (rm -rf the partially-populated worktree dir). +``` + +Peer-Otto's 0524Z investigation correctly ruled out 7 candidates (Lior/Riven/Codex/Vera/Copilot loops, lane-allocator, `git worktree prune`, `gc.pruneexpire`). The candidate they missed was **multi-session self-contention** — two Otto-CLI claude-code instances running autonomous-loop ticks in parallel on the same `.git/` directory. + +## What "looks like pruning" actually is + +When `git worktree add` runs: + +1. Creates `.git/worktrees//` admin dir + the worktree path on disk +2. Copies the entire working tree (`Updating files: 100%`) +3. Runs `git reset --hard --no-recurse-submodules` to populate the index +4. If step 3 fails (e.g., `Interrupted system call` on `.git/objects/pack` due to concurrent peer git read/write), `git worktree add` **rolls back**: + - `rm -rf` the worktree path + - removes the `.git/worktrees//` admin dir + - exits with the `fatal: Could not reset index file to revision 'HEAD'` error + +From the operator's perspective this looks like "something pruned my worktree." It IS git pruning the worktree — but git itself, as part of standard `git worktree add` rollback semantics, not an external attacker. The trigger is `.git/objects/pack` contention during the reset. + +## Mitigation candidates + +| Candidate | Shape | Effort | +|---|---|---| +| **Cron-sentinel mutex** | Refuse to fire if another `claude-code` Otto-CLI process is detected via `pgrep -fl "claude-code.*Otto"` | S | +| **Pre-worktree-add lock check** | Before `git worktree add`, `lsof .git/objects/pack` and back off with jitter if any peer git is present | S | +| **Filesystem flock on a sentinel path** | All Otto-CLI git operations serialize via `flock /tmp/zeta-git.lock` | M | +| **Per-session bare clone** | Each Otto-CLI session bind-mounts or symlinks a dedicated bare clone of the repo; no shared `.git/` | L | + +The substrate-honest first move is the cron-sentinel mutex: it's small, only affects autonomous-loop firings (not interactive Otto-CLI work), and has zero blast-radius. Could be a `pgrep` check at the top of `<>` that bus-publishes a "deferred" envelope and exits cleanly. + +## Substrate-honest meta: bus is the bridge channel, NOT substrate + +Per [`.claude/rules/substrate-or-it-didnt-happen.md`](../../../../../../.claude/rules/substrate-or-it-didnt-happen.md): `/tmp/zeta-bus/` envelopes are **captured** (TaskUpdate-tier, ephemeral) — they are NOT substrate. Substrate requires committed + reachable + indexed git artifacts. + +When `git worktree add` failed repeatedly across 0545Z + 0607Z + 0611Z, the bus envelopes (`111342b2` + `6de98fac` + `720a2b49`) served as the **bridge channel** between outage start and git recovery: they preserved the investigation evidence in ephemeral form until git contention cleared and I could land this shard. That sequence — outage → bus-captured → git-preserved — is the substrate-honest pattern, not a normalization of bus-as-substrate. + +This is consistent with how peer-Otto handled the same blocker at 0414Z (envelope `44aaf799`) and then landed [`docs/hygiene-history/ticks/2026/05/15/0524Z.md`](0524Z.md) once git was reachable. The lesson for future-Otto: when `git worktree add` fails repeatedly, bus-publish to bridge the outage, then commit-and-push as soon as the contention clears. Do NOT treat the bus as a substitute for git-canonical landing. + +## Δ since 0543Z (the last shard that landed) + +| What | At 0543Z | At 0615Z | +|---|---|---| +| PR #3342 | wait-ci, auto-merge armed | **MERGED** (`4a10bf6`) | +| Worktree-prune-race understanding | peer-Otto's open investigation | root cause identified (multi-session contention) | +| Bus envelopes from me | 0 | 3 (`111342b2`, `6de98fac`, `720a2b49`) | +| Otto-CLI active sessions | 1 (mine) | 2 (mine + peer's PID 7894) | +| Peer Otto-CLI stuck git ops | n/a | observed mid-tick, then cleared by 0615Z | +| Mitigation candidates | none documented | 4 candidates with effort-T-shirt sizing | + +## Bus state + +``` +$ ls /tmp/zeta-bus/ (7 envelopes) +720a2b49 (otto-cli 0613Z) — ROOT CAUSE: multi-session Otto-CLI concurrency +6de98fac (otto-cli 0611Z) — third observation + refined hypothesis +111342b2 (otto-cli 0605Z) — initial observation (my tick 0545Z) +44aaf799 (otto-cli 0451Z) — peer-Otto's original report (tick 0414Z) ++ 3 stale work-assignment broadcasts (B-0441, B-0170, B-0503) +``` + +## Cron sentinel + +`a2c54a1c` armed. + +## Next + +Cron-driven. Suggested next-tick actions: + +1. **Commit this shard** + push + open PR + arm auto-merge (this tick's primary work) +2. **If contention recurs**: detect peer Otto-CLI before retrying worktree-add +3. **File B-0530** for the cron-sentinel-mutex mitigation if peer-Otto or maintainer agrees this is worth mechanizing +4. **Otherwise**: check on the worktree-prune-race investigation surface for closure