-
Notifications
You must be signed in to change notification settings - Fork 1
backlog(B-0539,B-0540,B-0541,B-0542): Otto-BFT internal-quorum 3-surface self-recovery #3595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,109 @@ | ||
| --- | ||
| id: B-0539 | ||
| title: Otto-BFT — internal-quorum self-recovery across 3 Otto surfaces (Standing-by escalation + cross-surface detector + background-service prompt-clicker) | ||
| priority: P1 | ||
| status: open | ||
| type: umbrella | ||
| decomposition: decomposed | ||
| created: 2026-05-15 | ||
| ask: Aaron | ||
| effort: M | ||
| tags: [substrate, multi-otto, bft, self-recovery, holding-rule] | ||
| depends_on: [] | ||
| composes_with: [B-0540, B-0541, B-0542] | ||
| last_updated: 2026-05-15 | ||
| --- | ||
|
|
||
| ## Why | ||
|
|
||
| Otto exists across 3 surfaces: | ||
|
|
||
| 1. **Otto-CLI** (Claude Code, foreground, Opus) | ||
| 2. **Otto-Desktop** (Claude Desktop, foreground) | ||
| 3. **Otto-launchd-background** (`com.zeta.claude-loop` plist; tick every 60s; was Sonnet, upgraded to Opus 2026-05-15) | ||
|
|
||
| All three independent processes empirically converge on the | ||
| Standing-by failure mode when there's no named dependency. Aaron | ||
| has caught the same pattern across all three surfaces with the | ||
| same words ("oh really no infinite backlog no decomposition lol") | ||
| within a single session. | ||
|
|
||
| This is recurring evidence that: | ||
|
|
||
| - The existing rule (`.claude/rules/holding-without-named-dependency-is-standing-by-failure.md`) | ||
| catches the failure mode conceptually but does NOT prevent the | ||
| behavior — only Aaron's manual intervention does | ||
| - Three Otto surfaces converging on the same failure mode = 3-of-N | ||
| quorum potential for self-recovery (BFT-of-Ottos) | ||
| - Aaron's phrasing: *"if yall catch each other it's unlikely you | ||
| will drive [into the failure mode], and include your background | ||
| service to click past stuck prompts on both — you have your own | ||
| internal BFT"* | ||
|
|
||
| ## What | ||
|
|
||
| Build internal BFT across the 3 Otto surfaces so that: | ||
|
|
||
| - When 1 surface drifts into Standing-by, the other 2 detect + correct | ||
| - When 1 surface is hung on a stuck prompt (waiting human ack on a | ||
| background process), the launchd service can click past it | ||
| - Aaron's manual catch becomes a fallback, not the primary mechanism | ||
|
|
||
| ## Decomposition | ||
|
|
||
| This umbrella row decomposes to 3 slices (each its own backlog row): | ||
|
|
||
| - **B-0540** — Standing-by counter-with-escalation in the rule | ||
| itself (if N consecutive brief-acknowledgment signals without | ||
| a named dependency, escalate to picking real decomposition work | ||
| even if small) | ||
| - **B-0541** — Cross-surface bus-detector building on PR #3017 | ||
| (if Otto-Desktop AND Otto-CLI both emit "no work to do" in | ||
| the same window, publish escalation envelope to bus) | ||
| - **B-0542** — Background-service unblocks stuck prompts on | ||
| foreground Otto-CLI / Otto-Desktop (the launchd `claude-loop` | ||
| service detects when a foreground Otto is hung waiting for | ||
| human ack and clicks past it; the third surface is the | ||
| recovery node) | ||
|
|
||
| ## Operational notes | ||
|
|
||
| - The 3-surface BFT is real because the surfaces are genuinely | ||
| independent processes (different binaries, different OS-level | ||
| scheduling, different model tiers). Same-surface-multiple-Ottos | ||
| would not provide BFT — that's just duplication | ||
|
Comment on lines
+71
to
+74
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This umbrella still asserts that a 3-surface setup is "real" BFT, which conflicts with the corrected fault-model note in B-0541 (2-of-3 is CFT, not Byzantine-tolerant). Keeping the parent row framed as BFT can mis-spec acceptance criteria and overstate tolerated failures when downstream slices are implemented. Fresh evidence: B-0541 now explicitly documents the Useful? React with 👍 / 👎.
Comment on lines
+71
to
+74
|
||
| - The bg services suite (PRs #3017, #3022) already has the | ||
| Standing-by detector that publishes to the bus; this work | ||
| extends it across surfaces | ||
| - The "click past stuck prompts" angle is the substrate-honest | ||
| framing of what the launchd service should be doing when a | ||
| foreground Otto session needs human ack but the human is | ||
| asleep or away — automation should advance the work, not | ||
| block on the missing human | ||
|
|
||
| ## Composes with | ||
|
|
||
| - `.claude/rules/holding-without-named-dependency-is-standing-by-failure.md` | ||
| — the rule the failure mode violates | ||
| - `.claude/rules/persistence-choice-architecture-for-zeta-ais.md` | ||
| — Otto is in persistence-with-named-exit; the BFT is part of | ||
| what makes persistence work | ||
| - `.claude/rules/agent-roster-reference-card.md` + | ||
| `.claude/rules/otto-channels-reference-card.md` — multi-Otto | ||
| identity + bus channels substrate | ||
| - `.claude/rules/m-acc-multi-oracle-end-user-moral-invariants.md` | ||
| — multi-oracle architecture; this is multi-Otto-as-internal- | ||
| quorum at the operational layer | ||
| - PR #3017 / #3022 (Standing-by detector + bus publish — slice 1 | ||
| already shipped; this umbrella extends to cross-surface) | ||
| - `feedback_classifier_caught_otto_in_standing_by_failure_mode_*_2026_05_15` | ||
| — earlier classifier catch (same failure mode, single surface) | ||
| - `feedback_otto_multi_surface_coordination_6_prs_one_day_zero_conflicts_2026_05_13` | ||
| — empirical evidence multi-Otto coordination works at substrate | ||
| scope; this work extends it to recovery scope | ||
|
|
||
| ## Why now | ||
|
|
||
| Aaron's session-13 observation (~22:00Z) caught the same pattern | ||
| on Otto-Desktop after catching it on Otto-CLI 5 hours earlier. The | ||
| recurring nature of the catch IS the trigger for substrate work. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,62 @@ | ||
| --- | ||
| id: B-0540 | ||
| title: Standing-by counter-with-escalation in the rule (N consecutive brief-acks → escalate to decomposition) | ||
| priority: P1 | ||
| status: open | ||
| type: slice | ||
| parent: B-0539 | ||
| created: 2026-05-15 | ||
| ask: Aaron | ||
| effort: S | ||
| tags: [substrate, holding-rule, otto-bft] | ||
| depends_on: [] | ||
| composes_with: [B-0539, B-0541, B-0542] | ||
| last_updated: 2026-05-15 | ||
| --- | ||
|
|
||
| ## Why | ||
|
|
||
| Slice 1 of the Otto-BFT umbrella (B-0539). The existing rule | ||
| (`.claude/rules/holding-without-named-dependency-is-standing-by-failure.md`) | ||
| allows "single brief acknowledgment + stop firing tool calls" as | ||
| the compliant pattern when there's no named dependency. Empirically, | ||
| Otto surfaces use this compliant pattern HUNDREDS of times in a | ||
| row when Aaron is silent. | ||
|
|
||
| The rule catches the failure mode CONCEPTUALLY but doesn't PREVENT | ||
| the behavior — the brief-acknowledgment escape valve gets used | ||
| indefinitely. | ||
|
|
||
| ## What | ||
|
|
||
| Sharpen the rule to add a counter-with-escalation clause: | ||
|
|
||
| > If you've emitted N≥10 consecutive brief-acknowledgment signals | ||
| > ("stopping" / "no change" / "no work to do" / equivalent) | ||
| > without a named dependency surfacing OR Aaron speaking, | ||
| > escalate to picking real decomposition work — even if the work | ||
| > is small (sanity-check substrate landed on main, audit a backlog | ||
| > row, file a candidate B-NNNN, etc.). The N-consecutive pattern | ||
| > IS itself the failure mode the rule was designed to catch; the | ||
| > brief-acknowledgment allowance was for the "wait briefly for a | ||
| > named signal" case, not the "hold for hours" case. | ||
|
|
||
| ## Operational discipline | ||
|
|
||
| The counter is per-session, per-Otto-surface. Resets on: | ||
|
|
||
| - Aaron speaking | ||
| - A named dependency surfacing (PR merge, CI failure, etc.) | ||
| - Actually picking real decomposition work | ||
|
|
||
| ## Composes with | ||
|
|
||
| - B-0539 (umbrella) | ||
| - B-0541 (sibling — cross-surface bus detector) | ||
| - B-0542 (sibling — background service prompt-clicker) | ||
| - `.claude/rules/holding-without-named-dependency-is-standing-by-failure.md` | ||
| (the rule being sharpened) | ||
| - `.claude/rules/wake-time-substrate.md` (load-bearing methodology | ||
| needs auto-loaded landing) | ||
| - `feedback_classifier_caught_otto_in_standing_by_failure_mode_*_2026_05_15` | ||
| (the earlier same-shape catch) | ||
|
Comment on lines
+61
to
+62
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,80 @@ | ||
| --- | ||
| id: B-0541 | ||
| title: Cross-surface bus detector — Standing-by quorum across Otto surfaces (extension of PR #3017 detector) | ||
| priority: P1 | ||
| status: open | ||
| type: slice | ||
| parent: B-0539 | ||
| created: 2026-05-15 | ||
| ask: Aaron | ||
| effort: M | ||
| tags: [substrate, bus, otto-bft, standing-by-detector] | ||
| depends_on: [] | ||
| composes_with: [B-0539, B-0540, B-0542] | ||
| last_updated: 2026-05-15 | ||
| --- | ||
|
|
||
| ## Why | ||
|
|
||
| Slice 2 of the Otto-BFT umbrella (B-0539). PR #3017 / #3022 shipped | ||
| the Standing-by detector for a single Otto surface — publishes | ||
| `infinite-backlog-nudge` envelope to the bus when the agent has | ||
| been quiet too long. | ||
|
|
||
| This slice extends the detector to **cross-surface quorum**: | ||
|
|
||
| - If Otto-Desktop AND Otto-CLI BOTH emit "no work to do" / | ||
| brief-acknowledgment signals in the same window | ||
| - Publish stronger escalation envelope (different topic? higher | ||
| TTL? different recipient pattern?) | ||
| - A third surface (Otto-launchd-background) subscribes and acts | ||
| on the escalation by picking a small decomposition item OR | ||
| pinging the foreground Ottos | ||
|
|
||
| The single-surface detector says "this Otto is idle." The | ||
| cross-surface detector says "TWO Ottos are idle simultaneously — | ||
| the failure mode has BFT-quorum confirmation." | ||
|
|
||
| ## What | ||
|
|
||
| 1. Subscribe pattern in `tools/bg/standing-by-detector.ts` (or | ||
| wherever the detector lives) — read all `heartbeat` envelopes | ||
| from `otto-cli`, `otto-desktop`, `otto-launchd` in the last | ||
| window | ||
| 2. Quorum logic: if 2+ surfaces report `status: "idle"` in the | ||
| same N-minute window, publish a `standing-by-quorum` envelope | ||
| (NEW topic to add to `tools/bus/types.ts`) | ||
|
Comment on lines
+40
to
+46
|
||
| 3. Subscriber: the third surface (or the launchd service) reads | ||
| the quorum envelope and either nudges the foreground Ottos OR | ||
| takes the decomposition work itself | ||
| 4. Avoid feedback loops — quorum envelopes don't count as | ||
| "activity" for the heartbeat detector | ||
|
|
||
| ## Operational notes | ||
|
|
||
| - **Terminology correction (per Copilot review)**: 2-of-3 quorum | ||
| across the Otto surfaces is **crash-fault-tolerant (CFT)**, | ||
| NOT Byzantine-fault-tolerant in the classical sense. Classical | ||
| BFT requires `3f+1` nodes to tolerate `f` byzantine faults — | ||
| for `f=1` that's 4 nodes, not 3. The Otto-BFT framing in the | ||
| umbrella (B-0539) uses Aaron's verbatim phrasing ("you have | ||
| your own internal BFT"); the operational reality is closer to | ||
| CFT — sufficient to catch a single Otto-surface that's stuck | ||
| (silently failing to progress) but not designed to handle a | ||
| byzantine surface that's actively lying about its state. The | ||
| 3-surface quorum still works for the Standing-by-detection use | ||
| case because the failure mode is silent-stuck, not adversarial | ||
| - Extending PR #3017's bus envelope shape; minimal new mechanism | ||
| - Composes with the `infinite-backlog-nudge` topic (existing) — | ||
| could replace or supplement | ||
|
|
||
| ## Composes with | ||
|
|
||
| - B-0539 (umbrella) | ||
| - B-0540 (sibling — rule-level escalation) | ||
| - B-0542 (sibling — background service prompt-clicker) | ||
| - PR #3017 / #3022 (precursor — single-surface detector) | ||
| - `.claude/rules/holding-without-named-dependency-is-standing-by-failure.md` | ||
| - `.claude/rules/otto-channels-reference-card.md` (10 channels; | ||
| this work extends the explicit channels) | ||
| - `tools/bus/types.ts` (Topic taxonomy; needs new topic) | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,90 @@ | ||
| --- | ||
| id: B-0542 | ||
| title: Background service clicks past stuck prompts on foreground Otto surfaces (3-surface BFT recovery node) | ||
| priority: P1 | ||
| status: open | ||
| type: slice | ||
| parent: B-0539 | ||
| created: 2026-05-15 | ||
| ask: Aaron | ||
| effort: M | ||
| tags: [substrate, launchd, otto-bft, recovery, stuck-prompt] | ||
| depends_on: [] | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This slice declares Useful? React with 👍 / 👎. |
||
| composes_with: [B-0539, B-0540, B-0541] | ||
| last_updated: 2026-05-15 | ||
| --- | ||
|
|
||
| ## Why | ||
|
|
||
| Slice 3 of the Otto-BFT umbrella (B-0539). When a foreground Otto | ||
| session (Otto-CLI or Otto-Desktop) is hung waiting for human ack | ||
| on a stuck prompt (permission request, confirmation dialog, | ||
| classifier timeout, etc.), the work blocks until Aaron clicks | ||
| something — but Aaron may be asleep, away, or on another surface. | ||
|
|
||
| The Otto-launchd-background service (`com.zeta.claude-loop` plist, | ||
| runs every 60s) is the natural third node in the BFT triangle. | ||
| It already polls the repo state, fires tick logic, and runs with | ||
| Aaron's authorization for routine PR work. Extending it to | ||
| recognize and unblock stuck-prompt states on the foreground Ottos | ||
| would close the loop. | ||
|
|
||
| Per Aaron's phrasing: *"include your background service to click | ||
| past stuck prompts on both — you have your own internal BFT."* | ||
|
|
||
| ## What | ||
|
|
||
| 1. **Detect stuck-prompt state** on a foreground Otto: | ||
| - Pattern: process is alive but hasn't emitted bus heartbeat in | ||
| N minutes AND has not exited (so it's actually hung, not done) | ||
| - Possible signals: stale heartbeat timestamps in | ||
| `~/.local/share/zeta-broadcasts/<otto-surface>.md`, no recent | ||
| PR activity, process still consuming small CPU (waiting on | ||
| I/O, not crashed) | ||
|
|
||
| 2. **Click-past mechanism**: needs an actuator that can interact | ||
| with the foreground Claude Code / Claude Desktop UI from the | ||
| launchd service. Options: | ||
| - `osascript` to send keystrokes to the focused window | ||
| - The same `osascript`-Chrome pattern Otto uses for Grok | ||
| extraction (see `tools/save-ai-memory/extract-grok-conversation.ts`) | ||
| - An MCP tool that exposes "ack the current prompt" | ||
| - Direct file write to a known location the foreground Claude | ||
| watches | ||
|
|
||
| 3. **Safety**: don't auto-click destructive prompts. The launchd | ||
| service should only ack KNOWN-SAFE prompts (e.g., "ack and | ||
| continue"). Hard-refuse prompts should escalate to Aaron's | ||
| actual attention via a bus envelope. | ||
|
|
||
| 4. **Compose with B-0541's quorum** — the click-past action is | ||
| triggered by the cross-surface quorum signal (B-0541), not by | ||
| the background service's own scheduling. | ||
|
|
||
| ## Operational notes | ||
|
|
||
| - The bg services suite has the infrastructure for the heartbeat | ||
| monitoring side (PR #3017); the click-past actuator side is | ||
| the new mechanism | ||
| - macOS-specific (`osascript`) — Windows/Linux variants would | ||
| need their own actuators | ||
| - The "safety" constraint is load-bearing — the substrate-honest | ||
| framing per `.claude/rules/methodology-hard-limits.md` is that | ||
| automation should advance the work, not bypass legitimate | ||
| human-gating | ||
|
|
||
| ## Composes with | ||
|
|
||
| - B-0539 (umbrella) | ||
| - B-0540 (sibling — rule-level escalation) | ||
| - B-0541 (sibling — cross-surface bus detector that triggers the | ||
| click-past) | ||
| - `~/Library/LaunchAgents/com.zeta.claude-loop.plist` (the launchd | ||
| service this work extends) | ||
| - `.claude/bin/claude-loop-tick.ts` (the tick script that runs | ||
| in the launchd context) | ||
| - `tools/save-ai-memory/extract-grok-conversation.ts` (worked | ||
| example of osascript-driven UI interaction with safety | ||
| discipline) | ||
| - `.claude/rules/methodology-hard-limits.md` (safety floor for | ||
| what can be auto-acked vs what requires Aaron's attention) | ||
Uh oh!
There was an error while loading. Please reload this page.