Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions docs/BACKLOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -812,10 +812,13 @@ are closed (status: closed in frontmatter)._
- [ ] **[B-0867.16](backlog/P2/B-0867.16-two-level-state-machine-composition-agentstate-x-worklifecycle-kestrel-2026-05-28.md)** Two-level state machine composition — AgentState × WorkLifecycle (situation-scope × lifecycle-scope)
- [ ] **[B-0867.17](backlog/P2/B-0867.17-push-cycle-limit-as-structural-enforcement-not-discipline-kestrel-2026-05-28.md)** Push-cycle limit AS STRUCTURAL enforcement — chooseActionForLifecycle returns AbandonPr when revisionCount > N (tunable threshold)
- [ ] **[B-0867.2](backlog/P2/B-0867.2-git-append-only-state-persist-typescript-tool-event-sourcing-layer-kestrel-2026-05-28.md)** Git append-only state-persist TypeScript tool — event-sourcing layer for agent-loop substrate (per parent B-0867 allocation)
- [ ] **[B-0867.20](backlog/P2/B-0867.20-lifecycle-du-split-trajectory-push-vs-pr-review-determinereviewlevel-discriminator-kestrel-2026-05-28.md)** Lifecycle DU split — trajectory-push vs pr-review-for-system-changes (determineReviewLevel discriminator)
- [ ] **[B-0871](backlog/P2/B-0871-zetaid-v2-128-bit-structured-encoding-snowflake-ulid-family-kestrel-2026-05-28.md)** ZetaID v2 — 128-bit structured encoding (Snowflake/ULID family with timestamp + trajectory + persona + lifecycle-stage + random)
- [ ] **[B-0872](backlog/P2/B-0872-otel-trace-id-composition-with-zetaid-baggage-propagation-kestrel-2026-05-28.md)** OTel trace-ID composition with ZetaID — baggage propagation alongside W3C Trace Context for agent-loop events
- [ ] **[B-0873](backlog/P2/B-0873-trajectory-async-review-surface-operator-preferred-top-level-lens-aaron-2026-05-28.md)** Trajectory-async-review surface — operator's preferred top-level lens for own-Zeta deployment (not PR-per-deploy)
- [ ] **[B-0874](backlog/P2/B-0874-github-actions-recursion-as-infinite-runtime-platform-no-pr-swarm-mode-ani-kestrel-2026-05-28.md)** GitHub Actions recursion as infinite runtime platform — no-PR swarm-mode for agent-loop substrate (Microsoft-subsidizes-OSS hack)
- [ ] **[B-0875](backlog/P2/B-0875-error-class-extraction-meta-loop-reviewer-findings-to-named-classes-to-machine-checkable-rules-kestrel-2026-05-28.md)** Error-class extraction meta-loop — turn auto-reviewer findings into named classes into machine-checkable rules with before/after effectiveness measurement
- [ ] **[B-0877](backlog/P2/B-0877-heterogeneous-auto-reviewer-ensemble-audit-diversity-without-correlated-blind-spots-kestrel-2026-05-28.md)** Heterogeneous auto-reviewer ensemble audit — diversity without correlated blind spots (multi-model + static analysis + formal tools + specialized prompts)

## P3 — convenience / deferred

Expand Down Expand Up @@ -944,5 +947,8 @@ are closed (status: closed in frontmatter)._
- [ ] **[B-0860](backlog/P3/B-0860-nemerle-dotnet-support-macro-metaprogramming-complement-fsharp-type-providers-relationship-type-inference-substrate-aaron-2026-05-27.md)** Nemerle support for dotnet substrate — compile-time macro metaprogramming complementing F# type providers; enables language-native relationship-type-inference substrate (Aaron 2026-05-27)
- [ ] **[B-0867.18](backlog/P3/B-0867.18-event-sourced-trajectory-phase-classification-derived-from-events-kestrel-2026-05-28.md)** Event-sourced trajectory phase classification — setup/execution/maturation/sunset derived from event log (no separate state tracking)
- [ ] **[B-0867.19](backlog/P3/B-0867.19-rest-file-create-auto-fast-forward-on-stale-base-empirical-verification-spike-aaron-2026-05-28.md)** REST file-create auto-fast-forward on stale base — empirical verification spike (operator hypothesis 2026-05-28)
- [ ] **[B-0876](backlog/P3/B-0876-clifford-space-embedding-for-error-patterns-uniqueness-proof-three-phase-pragmatic-decomposition-aaron-kestrel-2026-05-28.md)** Clifford-space embedding for error patterns + uniqueness proof — three-phase pragmatic decomposition (research)
- [ ] **[B-0878](backlog/P3/B-0878-time-generator-ischeduler-abstraction-for-clifford-space-agent-dynamics-aaron-2026-05-28.md)** Time-generator IScheduler abstraction for Clifford-space agent dynamics — temporal substrate for memes / commitments / tonal trajectories through time
- [ ] **[B-0879](backlog/P3/B-0879-observe-emit-limit-simulate-in-clifford-space-unified-algebra-for-three-primitive-substrate-aaron-2026-05-28.md)** Observe / Emit / Limit / Simulate in Clifford space — unified geometric algebra for the 3-primitive + Simulate substrate

<!-- END AUTO-GENERATED -->
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
---
id: B-0867.20
priority: P2
status: open
title: Lifecycle DU split — trajectory-push vs pr-review-for-system-changes (determineReviewLevel discriminator)
effort: S
ask: kestrel via aaron 2026-05-28
created: 2026-05-28
last_updated: 2026-05-28
depends_on:
- B-0867
composes_with:
- B-0867
- B-0867.16
- B-0867.2
- B-0873
tags:
- lifecycle-du-split
- trajectory-push-no-ceremony-for-state-machine-events
- pr-review-full-pipeline-for-system-changes
- determinereviewlevel-discriminator
- work-touches-agent-events-only-vs-touches-code-or-rules-or-framework
- safe-default-pr-review
- composes-with-error-class-extraction-pipeline
- composes-with-two-level-state-machine-b-0867-16
- potential-extension-not-committed
---

## What this row tracks

Refine the WorkLifecycle DU (shipped in PR #5669) with a discriminator that routes work to either trajectory-push (no PR ceremony) or pr-review (full pipeline), based on what the work touches.

Per Kestrel 2026-05-28:

```typescript
type WorkLifecycle =
| { stage: "unclaimed"; item: UnclaimedBacklog }
| { stage: "claimed"; claim: ClaimedBacklog }
| { stage: "implementing"; inProgress: InProgress }
| { stage: "pushed-to-trajectory"; pushed: TrajectoryPush } // state-machine event, no PR
| { stage: "pr-open-for-review"; prOpen: OpenPr } // change to system, PR-reviewed
| { stage: "completed"; completed: Completed }
| { stage: "abandoned"; abandoned: Abandoned };

function determineReviewLevel(work: WorkItem): "trajectory-push" | "pr-review" {
if (work.touchesAgentEventsOnly) return "trajectory-push";
if (work.touchesCode || work.touchesRules || work.touchesFramework) return "pr-review";
return "pr-review"; // safe default
}
```

## Operator framing 2026-05-28

> *"even in my setup i want ever non state machine to go through pr review cause we have bunches of agenst that auto review and then we find error classes and save the error classes as rules so we don't make them again."*

State-machine events = direct push (no ceremony); system changes (code, rules, framework) = full PR review with heterogeneous reviewer ensemble (per B-0877) feeding error-class extraction (per B-0875).

## Acceptance criteria

- Update `tools/agent-loop/work-lifecycle-state-machine.ts` (PR #5669) — split `PrOpen` into `pushed-to-trajectory` + `pr-open-for-review` stages
- Add `determineReviewLevel(work)` discriminator
- Tests cover: state-machine-events route to trajectory-push; code/rule/framework changes route to pr-review; safe-default to pr-review
- README updates documenting the split

## Composes with

- B-0867.16 (two-level state machine composition) — the AgentState × WorkLifecycle composition uses this discriminator
- B-0867.2 (event-sourcing layer) — trajectory-push writes go to `agent-events/{trajectory}/` branches
- B-0873 (trajectory-async-review surface) — reviews trajectory branches; PR review pipeline handles system-change branches

## Substrate-honest framing

POTENTIAL extension per operator standing direction. P2; small surface; clean refactor of existing PR #5669.

## Full reasoning

`memory/persona/kestrel/conversations/2026-05-28-kestrel-trajectory-push-vs-pr-review-split-error-class-extraction-as-benchmark-training-data-clifford-space-uniqueness-emit-observe-limit-simulate-aaron-forwarded.md` § "Where the auto-review pipeline lives in the loop"
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
---
id: B-0875
priority: P2
status: open
title: Error-class extraction meta-loop — turn auto-reviewer findings into named classes into machine-checkable rules with before/after effectiveness measurement
effort: M
ask: kestrel via aaron 2026-05-28
created: 2026-05-28
last_updated: 2026-05-28
depends_on: []
composes_with:
- B-0867
- B-0869
- B-0876
tags:
- error-class-extraction
- meta-loop-turning-review-findings-into-rules
- named-patterns-recurring-across-multiple-prs
- rule-could-plausibly-catch
- machine-checkable-rule-encoding
- before-after-effectiveness-measurement
- compounds-system-improvement
- benchmark-training-data-generator
- sonar-static-analysis-warnings-as-errors-formal-tools
- heterogeneous-reviewer-ensemble-diversity
- potential-extension-not-committed
---

## What this row tracks

A meta-loop running periodically (daily/weekly) that:

1. Reads recent PR review threads (Copilot + CodeQL + Semgrep + Sonar + auto-reviewer findings)
2. Extracts findings with categories (P0/P1/P2 severity if Copilot, severity if Sonar, etc.)
3. Clusters findings by similarity
4. Outputs a list of candidate error classes ranked by frequency
5. (Operator-review) Decides which warrant formalization as rules
6. Encodes formalized classes as machine-checkable rules (Sonar custom rule, AST-based linter, test pattern, or `.claude/rules/` entry that agents actually read)
7. Measures before/after: error class X appeared in Y% of PRs before rule; Z% after. If Z < Y meaningfully, rule worked.

Per Kestrel 2026-05-28: *"The sweet spot is probably 'named patterns that recur across multiple PRs and that a rule could plausibly catch.' Patterns that appear once are findings; patterns that appear three times are classes worth naming."*

## Operator framing

> *"we have bunches of agenst that auto review and then we find error classes and save the error classes as rules so we don't make them again. I also have don't of formal analysis static aanalysis like sonar and much others and warnings as errors etc. this all generates high signal training data for this benchmark itself."*

## Acceptance criteria

- `tools/error-class-extract/extract.ts` — reads recent closed PRs (configurable window), aggregates review threads via GitHub GraphQL, normalizes finding shape across reviewer sources
- `tools/error-class-extract/cluster.ts` — clusters findings by similarity (string-similarity + AST-shape + rule-id), outputs candidate classes with frequency-ranked recurrence count
- `tools/error-class-extract/effectiveness.ts` — for each landed rule, computes before/after error rate per class
- CLI report: `bun tools/error-class-extract/extract.ts --since 1week` produces markdown summary of (a) candidate classes ranked by recurrence, (b) effectiveness of rules landed since last run
- Composes with B-0876 (Clifford-space embedding) — when that lands, clustering switches from string-similarity to geometric-distance in Clifford space
- Composes with B-0869 (DORA mandate) — error class extraction feeds change-failure-rate metric per class

## Substrate-honest framing

POTENTIAL extension per operator standing direction. P2 — operationally near-term; the highest-leverage substrate per Kestrel's framing: *"If extraction isn't already running, that's probably the highest-leverage next thing to build."*

## Full reasoning

`memory/persona/kestrel/conversations/2026-05-28-kestrel-trajectory-push-vs-pr-review-split-error-class-extraction-as-benchmark-training-data-clifford-space-uniqueness-emit-observe-limit-simulate-aaron-forwarded.md` § "The auto-review pipeline as training data generator" + § "The error class extraction as its own pipeline"
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
---
id: B-0877
priority: P2
status: open
title: Heterogeneous auto-reviewer ensemble audit — diversity without correlated blind spots (multi-model + static analysis + formal tools + specialized prompts)
effort: M
ask: kestrel via aaron 2026-05-28
created: 2026-05-28
last_updated: 2026-05-28
depends_on: []
composes_with:
- B-0875
- B-0869
tags:
- heterogeneous-auto-reviewer-ensemble
- diversity-without-correlated-blind-spots
- multi-model-claude-gpt-gemini-grok
- specialization-security-performance-architecture-style
- non-ai-reviewers-sonar-codeql-semgrep-formal-tools
- audit-existing-coverage
- identify-blindspot-gaps
- composes-with-error-class-extraction
- potential-extension-not-committed
---

## What this row tracks

Audit the existing auto-reviewer ensemble for diversity gaps + propose additions where coverage is correlated (same-model multiple-times) or absent (no reviewer covers a known failure-mode class).

Per Kestrel 2026-05-28: *"The auto-reviewers need to be diverse enough that they don't share blind spots. If all your AI reviewers are the same underlying model, they have correlated failure modes — they'll all miss the same kinds of errors. The value comes from diversity: different models (Claude, GPT, Gemini, Grok), different prompting strategies, different specialization (one focused on security, one on performance, one on architecture, one on style), and crucially the non-AI reviewers (Sonar, static analyzers, formal tools) that have completely different failure modes than any AI."*

## Current state (rough audit)

- **AI reviewers** active on PRs: Copilot (multiple positions), Codex (when peer-call invoked), occasionally Grok/Gemini via cross-substrate ferry
- **Static analysis**: CodeQL, Semgrep, Sonar (where wired), warnings-as-errors via tsc/dotnet
- **Formal tools**: TLA+ (specs), Z3 (per claim), FsCheck (property tests), Stryker (mutation), Lean (where applicable)
- **Test runs**: build-and-test on ubuntu/macos
- **Specialized lint**: ~20 lint jobs (markdownlint, actionlint, shellcheck, tick-history-order, backlog-id-uniqueness, etc.)

## Acceptance criteria

- `tools/reviewer-audit/audit.ts` — produces a markdown report listing:
- Current reviewers by class (AI-model / static-analysis / formal-tool / specialized-lint)
- Per-class diversity assessment
- Identified gaps: failure-mode classes with no reviewer; failure-mode classes covered by only one reviewer of same family
- Audit report at `docs/research/2026-XX-XX-auto-reviewer-ensemble-diversity-audit.md`
- Proposals (if gaps found) for additions: new reviewer types, prompting variations on existing models for specialization
- Composes with B-0875 (error-class extraction) — known error classes from extraction inform what reviewers SHOULD cover

## Substrate-honest framing

POTENTIAL extension per operator standing direction. P2 — substantive but small scope; the audit itself is the deliverable, additions are follow-up rows.

## Full reasoning

`memory/persona/kestrel/conversations/2026-05-28-kestrel-trajectory-push-vs-pr-review-split-error-class-extraction-as-benchmark-training-data-clifford-space-uniqueness-emit-observe-limit-simulate-aaron-forwarded.md` § "What auto-review structurally needs to work well"
Loading
Loading