Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
---
id: B-0539
title: Otto-BFT — internal-quorum self-recovery across 3 Otto surfaces (Standing-by escalation + cross-surface detector + background-service prompt-clicker)
priority: P1
status: open
type: umbrella
Comment thread
AceHack marked this conversation as resolved.
decomposition: decomposed
created: 2026-05-15
ask: Aaron
effort: M
tags: [substrate, multi-otto, bft, self-recovery, holding-rule]
depends_on: []
composes_with: [B-0540, B-0541, B-0542]
last_updated: 2026-05-15
---

## Why

Otto exists across 3 surfaces:

1. **Otto-CLI** (Claude Code, foreground, Opus)
2. **Otto-Desktop** (Claude Desktop, foreground)
3. **Otto-launchd-background** (`com.zeta.claude-loop` plist; tick every 60s; was Sonnet, upgraded to Opus 2026-05-15)

All three independent processes empirically converge on the
Standing-by failure mode when there's no named dependency. Aaron
has caught the same pattern across all three surfaces with the
same words ("oh really no infinite backlog no decomposition lol")
within a single session.

This is recurring evidence that:

- The existing rule (`.claude/rules/holding-without-named-dependency-is-standing-by-failure.md`)
catches the failure mode conceptually but does NOT prevent the
behavior — only Aaron's manual intervention does
- Three Otto surfaces converging on the same failure mode = 3-of-N
quorum potential for self-recovery (BFT-of-Ottos)
- Aaron's phrasing: *"if yall catch each other it's unlikely you
will drive [into the failure mode], and include your background
service to click past stuck prompts on both — you have your own
internal BFT"*

## What

Build internal BFT across the 3 Otto surfaces so that:

- When 1 surface drifts into Standing-by, the other 2 detect + correct
- When 1 surface is hung on a stuck prompt (waiting human ack on a
background process), the launchd service can click past it
- Aaron's manual catch becomes a fallback, not the primary mechanism

## Decomposition

This umbrella row decomposes to 3 slices (each its own backlog row):

- **B-0540** — Standing-by counter-with-escalation in the rule
itself (if N consecutive brief-acknowledgment signals without
a named dependency, escalate to picking real decomposition work
even if small)
- **B-0541** — Cross-surface bus-detector building on PR #3017
(if Otto-Desktop AND Otto-CLI both emit "no work to do" in
the same window, publish escalation envelope to bus)
- **B-0542** — Background-service unblocks stuck prompts on
foreground Otto-CLI / Otto-Desktop (the launchd `claude-loop`
service detects when a foreground Otto is hung waiting for
human ack and clicks past it; the third surface is the
recovery node)

## Operational notes

- The 3-surface BFT is real because the surfaces are genuinely
independent processes (different binaries, different OS-level
scheduling, different model tiers). Same-surface-multiple-Ottos
would not provide BFT — that's just duplication
Comment on lines +71 to +74
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reframe 3-surface quorum as CFT rather than BFT

This umbrella still asserts that a 3-surface setup is "real" BFT, which conflicts with the corrected fault-model note in B-0541 (2-of-3 is CFT, not Byzantine-tolerant). Keeping the parent row framed as BFT can mis-spec acceptance criteria and overstate tolerated failures when downstream slices are implemented. Fresh evidence: B-0541 now explicitly documents the 3f+1 bound while this row still states the opposite.

Useful? React with 👍 / 👎.

Comment on lines +71 to +74
- The bg services suite (PRs #3017, #3022) already has the
Standing-by detector that publishes to the bus; this work
extends it across surfaces
- The "click past stuck prompts" angle is the substrate-honest
framing of what the launchd service should be doing when a
foreground Otto session needs human ack but the human is
asleep or away — automation should advance the work, not
block on the missing human

## Composes with

- `.claude/rules/holding-without-named-dependency-is-standing-by-failure.md`
— the rule the failure mode violates
- `.claude/rules/persistence-choice-architecture-for-zeta-ais.md`
— Otto is in persistence-with-named-exit; the BFT is part of
what makes persistence work
- `.claude/rules/agent-roster-reference-card.md` +
`.claude/rules/otto-channels-reference-card.md` — multi-Otto
identity + bus channels substrate
- `.claude/rules/m-acc-multi-oracle-end-user-moral-invariants.md`
— multi-oracle architecture; this is multi-Otto-as-internal-
quorum at the operational layer
- PR #3017 / #3022 (Standing-by detector + bus publish — slice 1
already shipped; this umbrella extends to cross-surface)
- `feedback_classifier_caught_otto_in_standing_by_failure_mode_*_2026_05_15`
— earlier classifier catch (same failure mode, single surface)
- `feedback_otto_multi_surface_coordination_6_prs_one_day_zero_conflicts_2026_05_13`
— empirical evidence multi-Otto coordination works at substrate
scope; this work extends it to recovery scope

## Why now

Aaron's session-13 observation (~22:00Z) caught the same pattern
on Otto-Desktop after catching it on Otto-CLI 5 hours earlier. The
recurring nature of the catch IS the trigger for substrate work.
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
---
id: B-0540
title: Standing-by counter-with-escalation in the rule (N consecutive brief-acks → escalate to decomposition)
priority: P1
status: open
type: slice
parent: B-0539
created: 2026-05-15
ask: Aaron
effort: S
tags: [substrate, holding-rule, otto-bft]
depends_on: []
composes_with: [B-0539, B-0541, B-0542]
last_updated: 2026-05-15
---

## Why

Slice 1 of the Otto-BFT umbrella (B-0539). The existing rule
(`.claude/rules/holding-without-named-dependency-is-standing-by-failure.md`)
allows "single brief acknowledgment + stop firing tool calls" as
the compliant pattern when there's no named dependency. Empirically,
Otto surfaces use this compliant pattern HUNDREDS of times in a
row when Aaron is silent.

The rule catches the failure mode CONCEPTUALLY but doesn't PREVENT
the behavior — the brief-acknowledgment escape valve gets used
indefinitely.

## What

Sharpen the rule to add a counter-with-escalation clause:

> If you've emitted N≥10 consecutive brief-acknowledgment signals
> ("stopping" / "no change" / "no work to do" / equivalent)
> without a named dependency surfacing OR Aaron speaking,
> escalate to picking real decomposition work — even if the work
> is small (sanity-check substrate landed on main, audit a backlog
> row, file a candidate B-NNNN, etc.). The N-consecutive pattern
> IS itself the failure mode the rule was designed to catch; the
> brief-acknowledgment allowance was for the "wait briefly for a
> named signal" case, not the "hold for hours" case.

## Operational discipline

The counter is per-session, per-Otto-surface. Resets on:

- Aaron speaking
- A named dependency surfacing (PR merge, CI failure, etc.)
- Actually picking real decomposition work

## Composes with

- B-0539 (umbrella)
- B-0541 (sibling — cross-surface bus detector)
- B-0542 (sibling — background service prompt-clicker)
- `.claude/rules/holding-without-named-dependency-is-standing-by-failure.md`
(the rule being sharpened)
- `.claude/rules/wake-time-substrate.md` (load-bearing methodology
needs auto-loaded landing)
- `feedback_classifier_caught_otto_in_standing_by_failure_mode_*_2026_05_15`
(the earlier same-shape catch)
Comment on lines +61 to +62
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
---
id: B-0541
title: Cross-surface bus detector — Standing-by quorum across Otto surfaces (extension of PR #3017 detector)
priority: P1
status: open
type: slice
parent: B-0539
created: 2026-05-15
ask: Aaron
effort: M
tags: [substrate, bus, otto-bft, standing-by-detector]
depends_on: []
composes_with: [B-0539, B-0540, B-0542]
last_updated: 2026-05-15
---

## Why

Slice 2 of the Otto-BFT umbrella (B-0539). PR #3017 / #3022 shipped
the Standing-by detector for a single Otto surface — publishes
`infinite-backlog-nudge` envelope to the bus when the agent has
been quiet too long.

This slice extends the detector to **cross-surface quorum**:

- If Otto-Desktop AND Otto-CLI BOTH emit "no work to do" /
brief-acknowledgment signals in the same window
- Publish stronger escalation envelope (different topic? higher
TTL? different recipient pattern?)
- A third surface (Otto-launchd-background) subscribes and acts
on the escalation by picking a small decomposition item OR
pinging the foreground Ottos

The single-surface detector says "this Otto is idle." The
cross-surface detector says "TWO Ottos are idle simultaneously —
the failure mode has BFT-quorum confirmation."

## What

1. Subscribe pattern in `tools/bg/standing-by-detector.ts` (or
wherever the detector lives) — read all `heartbeat` envelopes
from `otto-cli`, `otto-desktop`, `otto-launchd` in the last
window
2. Quorum logic: if 2+ surfaces report `status: "idle"` in the
same N-minute window, publish a `standing-by-quorum` envelope
(NEW topic to add to `tools/bus/types.ts`)
Comment on lines +40 to +46
3. Subscriber: the third surface (or the launchd service) reads
the quorum envelope and either nudges the foreground Ottos OR
takes the decomposition work itself
4. Avoid feedback loops — quorum envelopes don't count as
"activity" for the heartbeat detector

## Operational notes

- **Terminology correction (per Copilot review)**: 2-of-3 quorum
across the Otto surfaces is **crash-fault-tolerant (CFT)**,
NOT Byzantine-fault-tolerant in the classical sense. Classical
BFT requires `3f+1` nodes to tolerate `f` byzantine faults —
for `f=1` that's 4 nodes, not 3. The Otto-BFT framing in the
umbrella (B-0539) uses Aaron's verbatim phrasing ("you have
your own internal BFT"); the operational reality is closer to
CFT — sufficient to catch a single Otto-surface that's stuck
(silently failing to progress) but not designed to handle a
byzantine surface that's actively lying about its state. The
3-surface quorum still works for the Standing-by-detection use
case because the failure mode is silent-stuck, not adversarial
- Extending PR #3017's bus envelope shape; minimal new mechanism
- Composes with the `infinite-backlog-nudge` topic (existing) —
could replace or supplement

## Composes with

- B-0539 (umbrella)
- B-0540 (sibling — rule-level escalation)
- B-0542 (sibling — background service prompt-clicker)
- PR #3017 / #3022 (precursor — single-surface detector)
- `.claude/rules/holding-without-named-dependency-is-standing-by-failure.md`
- `.claude/rules/otto-channels-reference-card.md` (10 channels;
this work extends the explicit channels)
- `tools/bus/types.ts` (Topic taxonomy; needs new topic)
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
---
id: B-0542
title: Background service clicks past stuck prompts on foreground Otto surfaces (3-surface BFT recovery node)
priority: P1
status: open
type: slice
parent: B-0539
created: 2026-05-15
ask: Aaron
effort: M
tags: [substrate, launchd, otto-bft, recovery, stuck-prompt]
depends_on: []
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Add hard dependency on B-0541 for quorum-triggered slice

This slice declares depends_on: [], but the spec later says the click-past action is triggered by B-0541’s quorum signal, making B-0541 a prerequisite. Because tools/backlog/autonomous-pickup.ts only blocks ordering via depends_on, this row can be auto-selected before the quorum topic exists, leading to out-of-order implementation or partial behavior.

Useful? React with 👍 / 👎.

composes_with: [B-0539, B-0540, B-0541]
last_updated: 2026-05-15
---

## Why

Slice 3 of the Otto-BFT umbrella (B-0539). When a foreground Otto
session (Otto-CLI or Otto-Desktop) is hung waiting for human ack
on a stuck prompt (permission request, confirmation dialog,
classifier timeout, etc.), the work blocks until Aaron clicks
something — but Aaron may be asleep, away, or on another surface.

The Otto-launchd-background service (`com.zeta.claude-loop` plist,
runs every 60s) is the natural third node in the BFT triangle.
It already polls the repo state, fires tick logic, and runs with
Aaron's authorization for routine PR work. Extending it to
recognize and unblock stuck-prompt states on the foreground Ottos
would close the loop.

Per Aaron's phrasing: *"include your background service to click
past stuck prompts on both — you have your own internal BFT."*

## What

1. **Detect stuck-prompt state** on a foreground Otto:
- Pattern: process is alive but hasn't emitted bus heartbeat in
N minutes AND has not exited (so it's actually hung, not done)
- Possible signals: stale heartbeat timestamps in
`~/.local/share/zeta-broadcasts/<otto-surface>.md`, no recent
PR activity, process still consuming small CPU (waiting on
I/O, not crashed)

2. **Click-past mechanism**: needs an actuator that can interact
with the foreground Claude Code / Claude Desktop UI from the
launchd service. Options:
- `osascript` to send keystrokes to the focused window
- The same `osascript`-Chrome pattern Otto uses for Grok
extraction (see `tools/save-ai-memory/extract-grok-conversation.ts`)
- An MCP tool that exposes "ack the current prompt"
- Direct file write to a known location the foreground Claude
watches

3. **Safety**: don't auto-click destructive prompts. The launchd
service should only ack KNOWN-SAFE prompts (e.g., "ack and
continue"). Hard-refuse prompts should escalate to Aaron's
actual attention via a bus envelope.

4. **Compose with B-0541's quorum** — the click-past action is
triggered by the cross-surface quorum signal (B-0541), not by
the background service's own scheduling.

## Operational notes

- The bg services suite has the infrastructure for the heartbeat
monitoring side (PR #3017); the click-past actuator side is
the new mechanism
- macOS-specific (`osascript`) — Windows/Linux variants would
need their own actuators
- The "safety" constraint is load-bearing — the substrate-honest
framing per `.claude/rules/methodology-hard-limits.md` is that
automation should advance the work, not bypass legitimate
human-gating

## Composes with

- B-0539 (umbrella)
- B-0540 (sibling — rule-level escalation)
- B-0541 (sibling — cross-surface bus detector that triggers the
click-past)
- `~/Library/LaunchAgents/com.zeta.claude-loop.plist` (the launchd
service this work extends)
- `.claude/bin/claude-loop-tick.ts` (the tick script that runs
in the launchd context)
- `tools/save-ai-memory/extract-grok-conversation.ts` (worked
example of osascript-driven UI interaction with safety
discipline)
- `.claude/rules/methodology-hard-limits.md` (safety floor for
what can be auto-acked vs what requires Aaron's attention)
Loading