diff --git a/.claude/rules/backlog-item-start-gate.md b/.claude/rules/backlog-item-start-gate.md index aa23227f6..b664ace4e 100644 --- a/.claude/rules/backlog-item-start-gate.md +++ b/.claude/rules/backlog-item-start-gate.md @@ -14,7 +14,19 @@ complete a checklist directly on the row body: 1. **Prior-art-search** across the existing axes: wake-time-substrate + skill-router + orthogonal-axes + Otto-364 + PR #1701 + decision-archaeology + lost-files - canonical at `tools/hygiene/LOST-FILES-LOCATIONS.md`. + canonical at `tools/hygiene/LOST-FILES-LOCATIONS.md` + **the + curated external prior-art surface at `references/upstreams/`** + (humans who've solved similar problems; mix of cutting-edge + and tried-and-true; prefer explicit-target subtree searches + like `rg "pattern" references/upstreams/postgres/` for focused + prior-art research; `rg` from repo root is also safe — it + respects `.gitignore` and `references/upstreams/*` is gitignored + — but explicit-target is recommended because it documents + which upstream(s) you actually consulted on the backlog row; + see `.claude/rules/references-upstreams-not-our-code-search-excludes.md` + for the full two-modes table and the plain-grep caveats). + The watchlist + category index lives at `docs/UPSTREAM-LIST.md`; + the synthesis notes at `references/notes/`. Log surfaces searched, queries used, results found on the row. 2. **Dependency-restructure**: diff --git a/.claude/rules/references-upstreams-not-our-code-search-excludes.md b/.claude/rules/references-upstreams-not-our-code-search-excludes.md new file mode 100644 index 000000000..1b7b2542f --- /dev/null +++ b/.claude/rules/references-upstreams-not-our-code-search-excludes.md @@ -0,0 +1,284 @@ +# `references/upstreams/` is NOT our code — search/scan operations must exclude it + +Carved sentence: + +> `references/upstreams/` is OTHER PEOPLE'S CODE that we mirror for +> study (regeneratable mirror state, gitignored, never hand-edited). +> Any search operation that walks the file tree (`find`, `grep -r`, +> `xargs grep`, recursive-file-walk scripts) MUST exclude +> `references/upstreams/` — otherwise scans run for hours, surface +> false-positives from unrelated upstream code, and pollute results +> with patterns that match in protobuf docs, gRPC tests, Redis +> manifests, etc. **Default to ripgrep** (`rg`) which respects +> `.gitignore` automatically. For plain `grep -r`, use +> `--exclude-dir=upstreams` (basename glob, NOT a path) or an +> explicit allowlist (`memory/ docs/ .claude/ tools/`). For +> `find`, use `-not -path './references/upstreams/*'` (the `find` +> command does NOT support `--exclude-dir`). + +## Operational content + +Per `references/README.md`: + +- `references/upstreams/` is "Disposable mirror state — cloned + upstream repositories used as read-only references. **Gitignored; + regeneratable via script; never hand-edited.**" +- `.gitignore` line: `references/upstreams/*` (with carve-outs for + the directory's own `.gitignore` + `README.md`) +- The mirror state regenerates from `references/reference-sources.json` + via the sync script + +Git operations naturally skip the tree (gitignored). But plain +`find` / `grep -r` / `xargs grep` / custom file-walk scripts do +NOT respect gitignore — they walk the entire filesystem. + +## The failure mode this rule prevents + +The authoring agent 2026-05-15T~13:00Z spawned this search: + +```bash +find . -type f -name "*.md" 2>/dev/null | xargs grep -l \ + "lock.free.*weight.free\|weight.free.*lock.free\|scale.free.*lock.free\|DV2.0\|5 always.active\|root discipline" 2>/dev/null +``` + +`find` quickly listed all `.md` files including the upstream tree. +`xargs grep` then recursed through gigabytes of upstream content +(protobuf docs, gRPC tests, Redis manifests, etc.). The process was +still running 2+ hours later when the human maintainer noticed "11 +shells running is impressive otto" — the shell count surfaced the +runaway grep. + +Result: no useful output (the patterns don't appear in upstream code), +hours of CPU and IO wasted, multiple monitor processes lingering, +genuine pollution of the agent's working state. + +The substrate-honest fix is to encode the discipline so future agents +(and any other Zeta AI) doesn't hit the same trap. + +## Operational discipline + +When searching the repo for content: + +### Prefer ripgrep (respects gitignore by default) + +```bash +rg "pattern" --type md # markdown only +rg "pattern" docs/ memory/ # explicit allowlist +``` + +### Plain `grep -r` needs explicit excludes (with caveat) + +```bash +# GNU grep --exclude-dir takes a BASENAME glob, not a path — +# so --exclude-dir=upstreams excludes any directory named +# 'upstreams' anywhere in the tree (currently only references/upstreams/). +# If a second 'upstreams/' ever appears that we DO want to search, +# this approach overreaches and we need the explicit-allowlist +# approach below instead. +grep -rn "pattern" \ + --exclude-dir=upstreams \ + --exclude-dir=node_modules \ + --exclude-dir=.git \ + --exclude-dir=bin --exclude-dir=obj \ + memory/ docs/ .claude/ tools/ +``` + +**Caveat**: GNU `grep`'s `--exclude-dir=GLOB` matches directory +*names* (basename), NOT slash-delimited paths. So +`--exclude-dir=references/upstreams` does NOT work (silently +matches nothing). Use the basename `upstreams` instead, OR use +explicit-allowlist sub-paths (`memory/ docs/ .claude/ tools/`) +which sidestep the issue entirely. + +**Better**: just use `rg` — it respects `.gitignore` by default +and `references/upstreams/*` is already gitignored. + +### `find | xargs grep` is the worst trap + +The `find` step is fast and listful; the `xargs grep` step inherits +the listful tree and recurses. Even with `-l` (list-only), grep +opens each file and scans until first match — and on a tree of +gigabytes of upstream content, that's hours. + +Fix: filter the `find` output BEFORE passing to xargs: + +```bash +find . -type f -name "*.md" \ + -not -path "./references/upstreams/*" \ + -not -path "./node_modules/*" \ + -not -path "./.git/*" \ + 2>/dev/null | xargs grep -l "pattern" 2>/dev/null +``` + +### Explicit allowlists beat exclude lists + +When you know which directories to search, name them. Don't start +from `.` and try to exclude — you'll always miss something. + +**Repo-native content lives in**: + +- `memory/` — agent memory + persona folders + conversations +- `docs/` — governance, research, hygiene-history, history +- `.claude/` — rules, skills, agents, commands, hooks, settings +- `tools/` — TS scripts (per Rule 0; no `.sh` except install-graph) +- Top-level files (CLAUDE.md, README, GOVERNANCE.md, AGENTS.md, + ALIGNMENT.md, MANIFESTO.md target locations) + +**Non-repo content** (skip in searches by default): + +- `references/upstreams/` — gitignored mirror state of OTHER repos +- `references//` — legacy imports (e.g., `tla-book/`) + with their own file layouts; check `references/README.md` for + current state +- `node_modules/` — npm dependencies +- `bin/`, `obj/` — .NET build outputs +- `target/` — Rust build outputs +- `.git/` — git internals + +## When `references/upstreams/` IS the right search target + +Not rare — actually a **first-class workflow** during backlog +research. Per the human maintainer 2026-05-15: *"when doing +backlog items this is a good place to know about humans whoved +solved similar issues i've been gathering their githubs so we +can learn when doing our backlog itmes. some of these are very +cutting edge and some are tried and true been around for years."* + +`references/upstreams/` is the curated **prior-art surface** — +humans who've solved similar problems, mirrored as read-only +references. When starting a backlog item, consulting the relevant +upstream(s) is encouraged and composes with +`.claude/rules/backlog-item-start-gate.md` (prior-art-search step). + +**The two modes are not in tension:** + +| Mode | Pattern | Treatment | +|---|---|---| +| **Backlog prior-art research** (explicit-target) | `rg "pattern" references/upstreams/postgres/` | Encouraged; one of the curated prior-art surfaces; log queries on the backlog row | +| **Unconstrained repo scan with plain `grep -r`** or `find . \| xargs grep` | (`grep -rn "pattern" .`) | MUST exclude `--exclude-dir=upstreams`; otherwise runaway-scan failure mode | +| **Unconstrained repo scan with ripgrep** | `rg "pattern" .` | Safe-by-default — ripgrep respects `.gitignore`, and `references/upstreams/*` is already gitignored | + +Other legitimate explicit-target reasons: + +- Verifying that an upstream actually contains a feature we + attribute to it (e.g., "does Spanner actually do X?") +- Auditing for license-text or attribution requirements when + taking an upstream excerpt into `references/notes/` + +**Discovery surfaces for upstream prior-art:** + +- `docs/UPSTREAM-LIST.md` — curated watchlist + category index +- `references/notes/` — synthesis notes ("what matters from each + upstream"); start here before grepping the mirror +- `references/reference-sources.json` — full source list + +**Refresh the mirror on demand:** + +```bash +tools/setup/common/sync-upstreams.sh # refresh all +tools/setup/common/sync-upstreams.sh --name foo,bar # subset +tools/setup/common/sync-upstreams.sh --prune # drop stale +``` + +The script reads `references/reference-sources.json`, shallow-clones +or fast-fetches each upstream into `references/upstreams//`, +and resets-hard to match `origin/` byte-for-byte. Safe to +re-run; `ls-remote` short-circuits when local HEAD already matches. + +In all cases, **EXPLICITLY target the subtree** when grepping +the mirror: + +```bash +rg "pattern" references/upstreams/spanner/ +``` + +— never start from `.` and let the recursion drift. + +## Why this rule auto-loads + +Per `.claude/rules/wake-time-substrate.md`: load-bearing search- +hygiene knowledge needs wake-time landing. Without this rule: + +- A future agent cold-booting may spawn the same kind of runaway + search (the failure mode is operationally tempting because + `find | xargs grep` is the obvious one-liner for the search + problem) +- Other Zeta agents searching the repo for the first time inherit + the gap by default +- The 2-hour-grep evidence (this rule's authoring trigger) shows + the failure mode is real and recurring + +## Composes with other rules + +- `.claude/rules/rule-0-no-sh-files.md` — TS-first discipline; + search wrappers should be `.ts` not `.sh` (compose: when + authoring a search tool, use ripgrep via Bun child_process and + encode the exclusion list there) +- `.claude/rules/wake-time-substrate.md` — load-bearing search- + hygiene needs auto-loaded landing +- `.claude/rules/refresh-world-model-poll-pr-gate.md` — prefer + scripted commands (like ripgrep wrappers) over ad-hoc bash; + same shape as `gh|jq` chains +- `.claude/rules/holding-without-named-dependency-is-standing-by-failure.md` + — runaway searches LOOK like work-in-progress but produce no + output; same shape as Standing-by failure mode (operational + noise without operational signal) + +## Composes with substrate + +- `references/README.md` — canonical definition of what + `references/upstreams/` is +- `references/reference-sources.json` — the upstream watchlist + the mirror tree regenerates from +- `.gitignore` — line `references/upstreams/*` is the existing + git-level enforcement +- The substrate-honest failure-mode anchor: the 2-hour-grep + evidence on 2026-05-15 that authored this rule + +## Substrate-honest framing + +This rule does NOT prevent the failure mode at the tool level. +Plain `grep -r` and `find` will still walk `references/upstreams/` +if invoked without exclusions. The rule encodes the DISCIPLINE +that the agent applies; mechanizing it further would require: + +- A shell function or alias that wraps grep with the standard + exclusions +- A `.ripgreprc` file at repo root (ripgrep already respects + gitignore so this is mostly redundant) +- A pre-Bash hook that catches the `find | xargs grep` pattern + and warns or rewrites it + +Those are future-substrate options. This rule is the discipline- +level landing that catches the failure mode at cold-boot until +mechanization lands. + +## Full reasoning + +The human maintainer 2026-05-15T~15:25Z, after observing the +runaway grep process: *"references/upstreams/ in code we ignore +this folder everywhere casue its not our code but other githubs +we reference for ideas"* + +The 2-hour-grep evidence trail: + +1. An agent authored a manifesto-search bash one-liner using + `find | xargs grep -l` with no exclusion of `references/upstreams/` +2. find quickly returned all matching paths (including the upstream + tree); xargs grep recursed +3. The grep was still running 2 hours later when the human + maintainer's "11 shells running" observation surfaced it +4. The cause: plain `grep` doesn't respect gitignore; only ripgrep + does +5. Cleanup: explicit `pgrep | xargs kill` of the runaway grep + processes; 11 shells → 0 +6. Substrate-landing: this rule encodes the discipline so the + failure mode doesn't recur + +The substrate-honest meta-note: the failure mode was rooted in +agent-side ignorance of the `references/upstreams/` convention. +The maintainer caught it via shell-count observation, taught the +discipline, and the rule now lands so future agents don't repeat +the trap. That's the bandwidth-engineering pattern operating at +discipline-level: one piece of teaching → durable substrate that +serves future cold-boots without re-teaching. diff --git a/CLAUDE.md b/CLAUDE.md index 0eb131949..beee18fa2 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -45,5 +45,16 @@ See [`docs/CONFLICT-RESOLUTION.md`](docs/CONFLICT-RESOLUTION.md). On deadlock, t - **Result-over-exception** — errors surface as `Result<_, DbspError>`; no exceptions on hot paths. - **Memory fast-path** — read `~/.claude/projects//memory/CURRENT-*.md` before raw `feedback_*.md` logs; CURRENT files win on conflict with older raw memories. +- **`references/upstreams/` — explicit-target searches ONLY (curated prior-art surface, NOT our code).** + Mirror state of OTHER repos (protobuf, gRPC, Redis, etc.); gitignored; gigabytes; the only + folder where a naive plain `grep -r` or `find | xargs grep` from `.` becomes a 2-hour runaway. + BUT also the curated prior-art surface for backlog-item research — humans who've solved similar + problems, cutting-edge + tried-and-true. Two modes: **explicit-target encouraged** + (`rg "pattern" references/upstreams/postgres/` during backlog research; check + `docs/UPSTREAM-LIST.md` + `references/notes/` first); **unconstrained scan needs the right tool** + — `rg "pattern" .` is safe-by-default (ripgrep respects gitignore), but plain `grep -r` needs + `--exclude-dir=upstreams` (basename, NOT a path) or an explicit allowlist + (`memory/ docs/ .claude/ tools/`). Refresh the mirror on demand: `tools/setup/common/sync-upstreams.sh`. + Full: `.claude/rules/references-upstreams-not-our-code-search-excludes.md`. - **Thoughts free, actions razored** — journal to `memory/` freely; CLAUDE.md additions are razored (cooling-period required, disposition-shaping bar). Full: `memory/feedback_thoughts_free_actions_razored_*`.