diff --git a/.claude/rules/agent-worktree-hygiene-never-hold-main-never-step-on-operator-cleanup-on-pr-merge.md b/.claude/rules/agent-worktree-hygiene-never-hold-main-never-step-on-operator-cleanup-on-pr-merge.md index 1a6a19d357..e7a4138549 100644 --- a/.claude/rules/agent-worktree-hygiene-never-hold-main-never-step-on-operator-cleanup-on-pr-merge.md +++ b/.claude/rules/agent-worktree-hygiene-never-hold-main-never-step-on-operator-cleanup-on-pr-merge.md @@ -1,17 +1,20 @@ -# Agent worktree hygiene — never hold `main`, never step on operator, clean up after PR merge +# Agent worktree hygiene — never hold `main`, never step on operator, survive reboots, clean up after PR merge Carved sentence: -> Agent worktrees are scratch space; the operator's primary checkout -> is the operator's. Agents NEVER check out branches that would block -> the operator's primary git operations. Specifically: agents NEVER -> hold the `main` branch in any worktree (use detached HEAD off -> `origin/main` instead). Agents NEVER create worktrees under paths -> the operator uses for their own work. Agents REMOVE their own -> worktrees after the work's PR merges (or substrate-honestly -> abandon). Substrate-engineering target B-0750 mechanizes this with -> a periodic cleanup job; until that ships, the discipline operates -> by agent-side compliance. +> Agent worktrees are scratch space, but scratch space MUST survive +> reboots when in-flight work lives in it. Agents place worktrees +> under `~/Documents/src/repos/Zeta/worktrees/--/` +> (persistent storage), NEVER under `/tmp/` or `/private/tmp/` +> (macOS-cleared on reboot AND on `com.apple.periodic-daily` cleanup +> of files >3 days old). Agents NEVER hold the `main` branch in any +> worktree (use detached HEAD off `origin/main` instead). Agents +> NEVER create worktrees under paths the operator uses for their own +> work, with one exception: the repo's own `worktrees/` subdirectory +> is git-aware and operator-safe. Agents REMOVE their own worktrees +> after the work's PR merges (or substrate-honestly abandon). +> Substrate-engineering target B-0750 mechanizes the cleanup; +> B-0894 mechanizes the reboot-survival default-location. ## Operational content @@ -22,11 +25,11 @@ Agent worktrees that need to BASE OFF main use `--detach`: ```bash # WRONG (locks main branch in the worktree; blocks operator's # `git checkout main` in primary checkout): -git worktree add /private/tmp/zeta-feat-xyz main +git worktree add ~/Documents/src/repos/Zeta/worktrees/zeta-feat-xyz main # RIGHT (detached HEAD at main's current SHA; doesn't hold the # branch ref; operator can still checkout main in primary): -git worktree add --detach /private/tmp/zeta-feat-xyz origin/main +git worktree add --detach ~/Documents/src/repos/Zeta/worktrees/zeta-feat-xyz origin/main ``` The substrate-honest reason: in a multi-checkout repo, `main` can @@ -39,26 +42,43 @@ When the agent needs main's current STATE (the file contents at main's tip), `--detach origin/main` gives exactly that without holding the branch reference. -### Rule 2 — NEVER create agent worktrees under the operator's primary checkout path +### Rule 2 — Agent worktrees go in `/worktrees/--/` -The operator's primary checkout (the repo root from -`git rev-parse --show-toplevel`, referred to here as -``) is operator-controllable. Agent -worktrees go under `/private/tmp/zeta--/` or -`/tmp/zeta--/`. +**Updated 2026-05-28 per B-0894 reboot-survival discipline.** Previously +this rule recommended `/private/tmp/zeta--/` which is +macOS-cleared on reboot AND on `com.apple.periodic-daily` cleanup of +files >3 days old. **Persistent location is the new default**: -Specifically forbidden agent worktree paths: +```bash +# RIGHT (persistent; survives reboot; outside operator's git status): +git worktree add --detach \ + ~/Documents/src/repos/Zeta/worktrees/-- \ + origin/main + +# WRONG (macOS-cleared on reboot; in-flight work lost): +git worktree add --detach /private/tmp/zeta-- origin/main +git worktree add --detach /tmp/zeta-- origin/main +``` + +The repo's `worktrees/` subdirectory at top-level is the canonical +persistent location. Git's worktree mechanism auto-excludes worktree +paths from the parent's `git status`, so worktrees under `/worktrees/` +don't pollute the operator's `git status` even though they live under +the operator's repo root. Lior already uses this pattern (e.g., +`~/Documents/src/repos/Zeta/worktrees/lior-fix-4772-archive-ts/`) and +Lior's worktrees consistently survive operator restarts. -- `/main` (or any subdir of - the operator's primary checkout) -- `/-*` (or any - peer-agent surface under operator's primary checkout) +Specifically forbidden agent worktree paths (UPDATED): + +- `/tmp/zeta-*` or `/private/tmp/zeta-*` — **NEW**: macOS-cleared on reboot; in-flight work loss + orphaned branch refs (per B-0894 empirical anchor 2026-05-28: 95 worktrees pruned in one restart) +- `/main` (or any subdir of the operator's primary checkout EXCEPT `worktrees/`) +- `/-*` at the top-level (this pollutes operator's `git status`; place under `worktrees/-*` instead) - Any path the operator might `cd` into for their own work -The substrate-honest reason: operator workflows depend on the primary -checkout's `git status` being clean + predictable. Agent worktrees -that share the primary checkout's directory tree create symbolic-link -confusion + operator-side `git` invocation surprises. +The substrate-honest reasons: + +1. **Reboot survival** — agent in-flight commits, edits, and backgrounded `git push` operations MUST survive macOS reboots and periodic-cleanup. `/tmp/` and `/private/tmp/` violate this invariant. +2. **Operator-status cleanliness** — operator workflows depend on the primary checkout's `git status` being clean + predictable. Worktrees under `/worktrees/` are git-aware (auto-excluded from operator's status); top-level subdirs are NOT (would pollute status). ### Rule 3 — REMOVE agent worktrees after the work's PR merges (or abandon) @@ -83,8 +103,9 @@ Before starting a substrate-cascade (multiple-PRs-in-one-session) work pattern, agents audit their worktree state: ```bash -# Inventory agent's own worktrees: -git worktree list | grep -E "/private/tmp/zeta-|/tmp/zeta-" +# Inventory agent's own worktrees (UPDATED 2026-05-28 — also check +# legacy /tmp paths to catch + migrate any remaining transient worktrees): +git worktree list | grep -E "$HOME/Documents/src/repos/Zeta/worktrees/|/private/tmp/zeta-|/tmp/zeta-" # Per-worktree status check (for each one): git -C status --short @@ -94,6 +115,8 @@ git -C log --oneline -1 # - SAFE + work done → git worktree remove # - DIRTY (uncommitted) → preserve OR substrate-honestly commit/abandon # - active iteration ongoing → keep +# - in /tmp or /private/tmp → MIGRATE to persistent location OR commit/push immediately +# (per Rule 5; transient location violates reboot-survival) ``` Empirical anchor: 2026-05-25 session accumulated 37 stale agent @@ -103,6 +126,45 @@ intervention ("we need to fix this mess yall always stepping on each other and me constantly"). The discipline this rule encodes would have prevented the accumulation. +### Rule 5 — Reboot-survival is a hard invariant (B-0894) + +**NEVER use `/tmp/` or `/private/tmp/` for git worktrees.** macOS +clears these on reboot AND via `com.apple.periodic-daily` cleanup +(files >3 days old). Agent in-flight work — uncommitted edits, +backgrounded `git push` operations, partially-extracted worktrees, +captured background-task output files — all evaporate. + +Empirical anchor 2026-05-28T04:09Z–04:35Z (operator restart): + +| Worktree location pattern | Outcome on restart | +|---|---| +| `/private/tmp/zeta--/` (95 instances) | **All 95 pruned** (`git worktree list` returned them as `prunable`; on-disk dirs cleared) | +| `~/Documents/src/repos/Zeta/worktrees/-*` (multiple) | **All survived intact** | +| `~/.gemini/tmp/project/lior-*` (multiple) | **All survived intact** (`~/.gemini/` is persistent user-home) | + +The 04:09Z autonomous-loop tick had a substantive tick-shard commit +(`4f89af885`) sitting on branch +`otto-cli/tick-0409z-sentinel-rearm-2026-05-28` with a backgrounded +`git push` in flight when restart hit. Push never completed; worktree +directory at `/private/tmp/zeta-otto-cli-0409z-sentinel-rearm/` was +gone post-restart. Branch ref + commit object survived in `.git/objects/` +(git's object store is in repo, not in `/tmp/`), so the commit could +theoretically be re-pushed from a fresh worktree — but the +backgrounded-task output file at +`/private/tmp/claude-501//tasks/.output` was +also cleared, so we couldn't even read whether the push had completed +before restart. + +The persistent-location worktree authoring B-0894 (this row's substrate +landing) survived the same restart cleanly — empirical dogfooding-proof. + +**Operational discipline**: + +1. Default to `~/Documents/src/repos/Zeta/worktrees/--/` for ALL new agent worktrees +2. When migrating existing `/tmp/`-based work: commit immediately to capture state in `.git/objects/`, then if in-flight work is critical, create a fresh persistent worktree off the same branch +3. For backgrounded `git push` operations: ALWAYS verify outcome via `git ls-remote origin ` post-completion (ground-truth) — never rely on captured output files from background-task harness storage +4. Cron sentinel is harness-level non-persistence (separate root cause; covered by `tick-must-never-stop.md` catch-43); restart any session MUST `CronList` + re-arm if missing + ## Composes with other rules - `.claude/rules/claim-acquire-before-worktree-work.md` — worktree @@ -132,8 +194,11 @@ have prevented the accumulation. ### Audit own agent worktrees ```bash +# UPDATED 2026-05-28 per B-0894: persistent-location is now the +# primary surface; legacy /tmp paths checked too to catch + migrate +# any remaining transient worktrees that need cleanup. git worktree list --porcelain | awk '/^worktree /{print $2}' | \ - grep -E "/private/tmp/zeta-|/tmp/zeta-" + grep -E "$HOME/Documents/src/repos/Zeta/worktrees/|/private/tmp/zeta-|/tmp/zeta-" ``` ### Per-worktree clean check @@ -163,24 +228,26 @@ worktree changes. The operator's primary worktree often sits on a feature branch rather than `main`, so checking for "primary on main" produces false negatives. -The correct invariant: **no agent worktree (under `/private/tmp/zeta-*` -or `/tmp/zeta-*`) holds `[main]`**. Zero matches is the happy path; the -operator MAY have `main` checked out in their own primary, but agents -must not. +The correct invariant: **no agent worktree (under `/worktrees/`, +`/private/tmp/zeta-*`, or `/tmp/zeta-*`) holds `[main]`**. Zero matches +is the happy path; the operator MAY have `main` checked out in their +own primary, but agents must not. ```bash # Prints OK on success. If a worktree line prints, an agent worktree # is holding [main] and is the blocker for operator git operations. -git worktree list | awk '/\[main\]/ { path=$1 } END { exit 0 }' \ - && git worktree list | grep -E "\[main\]" \ - | grep -E "/private/tmp/zeta-|/tmp/zeta-" || echo "OK: no agent holds [main]" +# UPDATED 2026-05-28: also catches persistent-location worktrees per +# B-0894 new default (~/Documents/src/repos/Zeta/worktrees/). +git worktree list | grep -E "\[main\]" \ + | grep -E "$HOME/Documents/src/repos/Zeta/worktrees/|/private/tmp/zeta-|/tmp/zeta-" \ + || echo "OK: no agent holds [main]" ``` Expected result: `OK: no agent holds [main]`, or equivalently no agent-worktree match if the final echo is omitted. A single operator primary line is OK when the operator intentionally has `main` checked -out. Any `/private/tmp/zeta-*`, `/tmp/zeta-*`, or per-agent worktree -line holding `[main]` is a violation to fix. +out. Any `/worktrees/*`, `/private/tmp/zeta-*`, `/tmp/zeta-*`, +or per-agent worktree line holding `[main]` is a violation to fix. ## Substrate-honest framing diff --git a/docs/backlog/P1/B-0894-reboot-survival-discipline-in-flight-state-must-survive-macos-private-tmp-clear-aaron-2026-05-28.md b/docs/backlog/P1/B-0894-reboot-survival-discipline-in-flight-state-must-survive-macos-private-tmp-clear-aaron-2026-05-28.md new file mode 100644 index 0000000000..e5fcc90324 --- /dev/null +++ b/docs/backlog/P1/B-0894-reboot-survival-discipline-in-flight-state-must-survive-macos-private-tmp-clear-aaron-2026-05-28.md @@ -0,0 +1,123 @@ +--- +id: B-0894 +title: Reboot-survival discipline — in-flight state must survive macOS `/private/tmp/` clear (worktrees + bus envelopes + bg-task output + sentinel) +status: open +priority: P1 +created: 2026-05-28 +attribution: aaron-2026-05-28 +depends_on: [] +composes_with: + - B-0750 + - B-0530 + - B-0858.5 +tags: + - hygiene + - infrastructure + - rule-update + - cross-cutting +--- + +# B-0894 — Reboot-survival discipline — in-flight state must survive macOS `/private/tmp/` clear + +## Substrate-inventory pass (per `.claude/rules/verify-existing-substrate-before-authoring.md`) + +Topic: reboot survival of in-flight state (worktrees, bus envelopes, background-task output, sentinel) + +Searched surfaces (origin/main): + +- `docs/agendas/`: none on this specific topic +- `docs/trajectories/`: none +- `docs/backlog/`: no row covers the cross-cutting "in-flight state survives reboot" discipline. B-0750 (agent worktree hygiene + cleanup automation) is sibling at worktree-cleanup scope but NOT at reboot-survival scope. B-0530 (cron-sentinel mutex) is sibling at multi-agent contention scope but assumes worktrees exist on disk. B-0858.5 (heartbeat auto-state-gathering, consent-first) is sibling at state-gathering-scope. +- `.claude/rules/`: `agent-worktree-hygiene-never-hold-main-never-step-on-operator-cleanup-on-pr-merge.md` HARDCODES `/private/tmp/zeta--/` and `/tmp/zeta--/` as the recommended location in 10+ places — directly contradicts reboot-survival requirement. `claim-acquire-before-worktree-work.md` references `/private/tmp/` in saturation-ceiling patterns. `refresh-world-model-poll-pr-gate.md` references `.git/index.lock` recovery but not reboot-survival cross-class. +- `memory/`: 0 hits on "reboot survival" as named pattern. +- `docs/research/`: 0 hits on the named pattern. + +Targeted searches: + +```bash +rg -l "private/tmp/zeta-|/tmp/zeta-|reboot.survival|in.flight.survives" .claude/rules/ docs/backlog/ +``` + +Conclusion: NO existing rule or row names the cross-cutting reboot-survival discipline. The `agent-worktree-hygiene` rule actively hardcodes the failure-mode pattern. Mint-new authorized per operator 2026-05-28 explicit framing: *"why are we putting any git stuff in /private/tmp/ this is terrible design"* + *"we need to survive reboots in any kind of inflight stuff"*. + +Authoring action: **mint-new + rule-edit** (this row + the `agent-worktree-hygiene` rule edit ship together as the substrate landing of the operator's named requirement). + +## The problem (empirical anchor — 2026-05-28T04:09Z–04:35Z) + +Operator restart at ~04:30Z UTC pruned **95 worktrees** that had been placed at `/private/tmp/zeta-*` paths per the prevailing `agent-worktree-hygiene` rule recommendation. Same restart left **5 Lior worktrees intact** at `~/Documents/src/repos/Zeta/worktrees/lior-*` paths. **Same restart, opposite outcomes, asymmetric to where each agent put its worktrees.** + +Worse: the 04:09Z autonomous-loop tick had a substantive tick-shard commit (`4f89af885`) sitting on branch `otto-cli/tick-0409z-sentinel-rearm-2026-05-28` with a backgrounded `git push` in flight when restart hit. Push never completed; branch ref + commit object survived in `.git/objects/`, but the worktree directory at `/private/tmp/zeta-otto-cli-0409z-sentinel-rearm/` was gone. The backgrounded-task output file at `/private/tmp/claude-501//tasks/.output` was also gone — couldn't even read the push outcome to know whether to retry. + +Operator framing: *"why are we putting any git stuff in /private/tmp/ this is terrible design"* + *"we need to survive reboots in any kind of inflight stuff"*. + +## Root cause + +macOS clears `/private/tmp/` on reboot AND via `com.apple.periodic-daily` cleanup of files older than 3 days. `/tmp/` is a symlink to `/private/tmp/` on macOS — same behavior. Four classes of agent in-flight state currently live there: + +| State class | Current location | Survives macOS reboot? | +|---|---|---| +| Agent worktrees | `/private/tmp/zeta--/` (per `agent-worktree-hygiene` rule recommendation) | NO | +| Bash background-task output | `/private/tmp/claude-501//tasks/.output` (Claude Code harness default) | NO | +| Bus envelopes | `/tmp/zeta-bus/.json` (per `tools/bus/bus.ts` line 19 default `ZETA_BUS_DIR`) | NO | +| Cron sentinel | In-memory only (per `tick-must-never-stop` rule; harness-level non-persistence) | NO (separate root cause; covered by `tick-must-never-stop`) | + +Lior's pattern (`~/Documents/src/repos/Zeta/worktrees/lior-*` + `~/.gemini/tmp/project/lior-*`) survives because user home directory is NOT cleared by macOS. + +## Acceptance criteria + +1. **`agent-worktree-hygiene` rule edit lands**: change Rule 2's recommended location from `/private/tmp/zeta--/` to `~/Documents/src/repos/Zeta/worktrees/--/` (the persistent-location pattern Lior already proves works). Add new Rule 5 "Reboot-survival is a hard invariant — NEVER use `/tmp/` or `/private/tmp/` for git worktrees." Add empirical anchor (this restart) as proof point. Update all 10+ examples in the rule. +2. **`claim-acquire-before-worktree-work` rule update**: saturation-ceiling sub-cases reference `/private/tmp/`; flip to persistent location. +3. **Bus envelope migration plan filed** as B-0894.1 (sub-row): `ZETA_BUS_DIR` should default to `~/.zeta-bus/` or `~/Library/Application Support/Zeta/bus/`. Migration ships separately. +4. **Background-task output is harness-level**: cannot be moved from agent-side; document the workaround (always check `git ls-remote origin ` as ground-truth post-restart, never rely on captured output files). Add to `refresh-world-model-poll-pr-gate.md` or similar. +5. **Per-agent persistent worktree-pool primitive** (long-term mechanization): worktree pool under `~/Documents/src/repos/Zeta/worktrees/pool//` with N pre-allocated slots per agent identity. Sub-row B-0894.2 if shipped separately. + +## What ships in this PR + +This PR delivers acceptance criteria 1 + 2 + the empirical-anchor documentation. Criteria 3 / 4 / 5 are sub-rows (filed as needed). + +## The empirical-anchor that this row preserves + +The 04:09Z tick shard's lost commit `4f89af885` IS the empirical anchor that: + +- Persistent location works (`~/Documents/src/repos/Zeta/worktrees/otto-cli-reboot-survival-fix-0434z/` survived this restart) +- Transient location fails (`/private/tmp/zeta-otto-cli-0409z-sentinel-rearm/` did not) +- Branch ref + commit object survive in `.git/objects/` regardless (so a re-push from a persistent worktree can recover the commit if needed) +- Background-task output also lives in transient storage (separate root cause; harness-level) + +The worktree where THIS backlog row is being authored IS the dogfooding-proof-point: same restart event tested both patterns simultaneously; persistent survived; transient did not. + +## Composes with + +- **B-0750** — sibling at agent-worktree-cleanup scope; this row adds the location-discipline that prevents the cleanup problem from compounding with reboot-loss +- **B-0530** — sibling at multi-agent contention scope; both assume worktrees exist on disk; this row ensures they do +- **B-0858.5** — sibling at consent-first state-gathering scope; same root cause class (state needs persistent location) +- **`.claude/rules/agent-worktree-hygiene-never-hold-main-never-step-on-operator-cleanup-on-pr-merge.md`** — the rule being edited +- **`.claude/rules/claim-acquire-before-worktree-work.md`** — sibling rule referencing `/private/tmp/` +- **`.claude/rules/tick-must-never-stop.md`** — sentinel session-exit non-persistence is sibling at cron-scope (separate from filesystem-scope; both are reboot-survival) +- **`tools/bus/bus.ts` `BUS_DIR` default** — same root cause class at bus-envelope scope + +## Substrate-honest framing + +This row does NOT: + +- Solve every reboot-survival problem (background-task output is harness-level; sentinel is harness-level) +- Mandate immediate migration of every existing `/private/tmp/` reference in code (opportunistic migration; the rule change flips the default for NEW worktrees) +- Override operator authority (operator can put worktrees anywhere they want; the discipline is for agents) + +This row DOES: + +- Encode the reboot-survival hard-invariant operator named +- Provide empirical anchor (this exact restart) as proof +- Edit the rule that was actively recommending the failure-mode pattern +- Land the worktree-pool primitive as substrate-engineering target + +## Full reasoning + +Operator 2026-05-28T~04:30Z UTC verbatim: + +> *"why are we putting any git stuff in /private/tmp/ this is terrible design"* +> *"we need to survive reboots in any kind of inflight stuff"* +> *"hey fyi i had to restart"* +> *"please reread latest backlog for today, also i moved from vscode back to console"* + +The restart that produced the empirical anchor for this row was triggered partly by VSCode-Otto surface failure (operator separately disclosed: VSCode-Otto loses context every ~20min and emits "Quiet"; Otto-CLI typically holds ~6h — preserved as user-scope `feedback_aaron_vscode_otto_surface_20min_context_loss_emits_quiet_cli_holds_6h_surface_choice_signal_2026_05_28.md`). The convergence of (surface-failure restart + transient-worktree-location loss) was the substrate-engineering event that surfaced the discipline gap.