Skip to content

fix(diagnose,reset): unbreak in-agent recovery path (#410)#412

Merged
jinhongkuan merged 3 commits into
devfrom
fix/410-reset-and-diagnose-recovery
May 19, 2026
Merged

fix(diagnose,reset): unbreak in-agent recovery path (#410)#412
jinhongkuan merged 3 commits into
devfrom
fix/410-reset-and-diagnose-recovery

Conversation

@jinhongkuan

@jinhongkuan jinhongkuan commented May 19, 2026

Copy link
Copy Markdown
Contributor

Summary

Closes #410. When the ledger hits a SurrealDB row-deserialization error, the agent's named recovery path (bicameral_reset) is often unreachable — either because Claude Code's tool surface has pinned the old schema, or because the MCP server itself can't finish init. This PR makes the CLI form genuinely callable from inside the agent loop, and (as part of the recovery substrate) fixes a silent zero-replay regression in --replay-from-events.

What changed

Diagnose + reset CLI escape hatch (the original #410 scope)

  • cli/diagnose.py now opens a raw LedgerClient and forwards to gather_diagnosis_raw — the same defensive path handlers/diagnose.py already uses. The old code went through SurrealDBLedgerAdapter, which re-runs init_schema+migrate (the step the operator is trying to diagnose around).
  • bicameral-mcp reset gains --confirm, --wipe-mode={ledger,full}, and --replay-from-events. With --confirm we bypass the interactive wizard and dispatch through cli/reset_cli.py:run_noninteractive_reset, a thin wrapper around handle_reset. No --confirm keeps the wizard for human-driven use.
  • Filesystem-only fallback: when --confirm --wipe-mode=full is given and BicameralContext.from_env()/ledger.connect() raises (the literal reset tool missing + diagnose fails on SDK row-revision mismatch #410 scenario), we shutil.rmtree the resolved .bicameral/ directly.
  • Recovery hints in LedgerDeserializationError.RECOVERY_HINT and handlers/diagnose.py:_classify_recovery now lead with the shell form, which works even when the MCP bicameral_reset tool isn't in the agent's pinned surface.

Silent zero-replay regression (d9737d7)

While dogfooding the new CLI against a real R4-layout install, --replay-from-events reported events_replayed: 0 with zero entries in replay_errors despite 3,002 valid events on disk. The pre-fix _resolve_events_dir derived its target by string-mangling the SurrealKV URL — Path(db_path).parent / "events". That inverse only worked while the ledger and events shared a parent. Once #408 (Ledger Locator) moved the ledger to ~/.bicameral/projects/<id>/ledger.db while events stayed repo-local at <repo>/.bicameral/events/, the inverse silently pointed at an empty path.

The fix splits two distinct concerns the inverse function had collapsed:

Domain Resolver
Events Repo-local source — committed to git pre-#373, pulled from remote backend post-#373 <repo_path>/.bicameral/events (matches cli/sync_and_brief_cli.py:65)
Watermark User-local derived state — per-author offset into the events stream ledger_locator.resolve_watermark_path()
Bicameral dir (full-wipe target) User-local derived state — what to delete in a nuclear restart ledger_locator.project_dir_for() when no explicit SURREAL_URL is set; URL inverse otherwise (preserves the corruption-recovery test contract)

Replay now raises FileNotFoundError when the events substrate can't be located, and the caller surfaces it via replay_errors. Silent zero-replay becomes an explicit failure that names where events were expected and points the user at sync-and-brief for fresh-clone repopulation.

Linked decisions (from #408)

Forward-resolution through the locator is grounded in:

  • decision:ko8efq3z1zwhbof7kecq — Ledger Locator naming + scope
  • decision:rfbnlw7ghe175iu42u6b — Project identity via git rev-parse --git-common-dir hash
  • decision:5nr66wvmapjpt58rrji8 — R4 config split (the layout transition that exposed the URL-inverse bug)

Boundary between user-local state (locator) and repo-local events tracks the deprecation arc in #373 — the event substrate must not be conflated with the user-local derived-state bucket.

Why this works for the original reproduction

Symptom in #410 Fix
bicameral.dashboard returns LedgerDeserializationError, recovery hint names bicameral_reset (not reachable) Hint now points at bicameral-mcp reset --confirm ... first
bicameral_reset MCP tool gated / unregistered in running session Agent can shell out via Bash to the now-non-interactive CLI
bicameral-mcp diagnose from shell crashes on migrate (Invalid revision 101 for type DefineTableStatement) CLI now uses raw client; no migrate runs
--replay-from-events silently zero-replays under R4 layout _resolve_events_dir forward-resolves repo-local; missing dir raises and surfaces via replay_errors

Test plan

  • tests/test_reset_cli_410.py — 9 tests cover the flag surface, happy-path ledger wipe against memory://, filesystem fallback when from_env() raises during --wipe-mode=full, the explicit refusal to silently rmtree under --wipe-mode=ledger, the CLI-form presence in both recovery-hint surfaces, and three new regression tests for the silent zero-replay:
    • test_resolve_events_dir_under_legacy_local_layout_finds_sibling_events — guards the layout where ledger sits at <bicameral_dir>/local/ledger.db and events at <bicameral_dir>/events/. Pre-fix this returns None; post-fix returns the sibling events dir.
    • test_resolve_events_dir_uses_repo_path_not_locator_project_dir — pins the architectural boundary: events stay repo-local, never routed through project_dir_for() / "events". Guards against re-conflating user-local with repo-local in future refactors.
    • test_reset_cli_replay_surfaces_missing_events_dir_as_replay_error — end-to-end: missing events_dir produces a non-empty replay_errors, never a passing-looking zero-replay response.
  • tests/test_diagnose_cli.py — updated test_diagnose_main_returns_one_on_raw_client_connect_failure (failure surface moved from adapter to raw client). Added test_diagnose_main_cli_does_not_use_adapter_path — patches SurrealDBLedgerAdapter.connect to raise; asserts diagnose still returns 0.
  • ruff check + ruff format --check clean on every changed file.
  • Existing test_diagnose_*, test_reset.py, test_ledger_sync_deserialization_recovery_301.py, test_ledger_locator.py all pass.

Out of scope (pre-existing on origin/dev)

Note on this session

Bicameral sync against the new commit was attempted but blocked by a stale SDK in the running MCP server (PID spawned pre-#408, partial-replay state with the newer SDK's revision). Restart the MCP host to pick it up; decisions linked manually above are the load-bearing set.

🤖 Generated with Claude Code

@coderabbitai

coderabbitai Bot commented May 19, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c8272fcf-4e01-4f1d-ac7d-f4181bcf1f2d

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/410-reset-and-diagnose-recovery

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

When the ledger hits a SurrealDB row-deserialization error, the agent's
only escape hatch — `bicameral_reset` — is often unreachable: either it's
not in the tool surface a running Claude Code session has pinned, or the
MCP server itself can't start. The CLI fallback was equally broken.

Three concrete fixes:

* `cli/diagnose.py` now opens a raw `LedgerClient` and forwards to
  `gather_diagnosis_raw` (same defensive path the MCP handler uses).
  The old code went through `SurrealDBLedgerAdapter`, which re-runs
  `init_schema`+`migrate` — the very step that's likely failing.
* `bicameral-mcp reset` gains `--confirm`, `--wipe-mode`, and
  `--replay-from-events` flags. With `--confirm`, dispatch goes
  through a thin wrapper (`cli/reset_cli.py`) that calls
  `handle_reset` directly. No `--confirm` keeps the interactive
  wizard for human-driven use.
* When `--wipe-mode=full --confirm` is given and the high-level
  path can't even bring up a ctx (the literal #410 scenario), fall
  back to a direct `shutil.rmtree` of the resolved `.bicameral/`
  dir — full-wipe requiring a working DB is circular.

Recovery hints in `LedgerDeserializationError` and
`_classify_recovery` now lead with the shell form, which is reachable
even when MCP tool-surface pinning hides `bicameral_reset`.

Pre-existing on `origin/dev`, out of scope here:
- `tests/test_diagnose_format.py` (8 failures — `Diagnosis` fixture
  missing the `row_probe_warnings` field added in #301)
- `server.py` smoke-test (`EXPECTED_TOOL_NAMES` missing
  `bicameral.diagnose`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#410, refs #368, #373)

The pre-fix _resolve_events_dir derived its target by string-mangling
the SurrealKV URL — `Path(db_path).parent / "events"`. Once R4 moved
the ledger out from under the repo (`~/.bicameral/projects/<id>/` vs
`<repo>/.bicameral/events/`), the inverse silently pointed at an
empty user-local path. Replay then short-circuited to 0 events with
no error surfaced.

The fix turns on splitting two distinct concerns the inverse function
had collapsed:

* **Events are repo-local source** — committed to git pre-#373, pulled
  from the configured remote backend post-#373. They are NOT user-local
  state. `_resolve_events_dir` and `_count_events_on_disk` now resolve
  forward through `<repo_path>/.bicameral/events` (or
  BICAMERAL_DATA_PATH for tests), matching the production read site in
  `cli/sync_and_brief_cli.py:65`.

* **Watermark is user-local derived state** — per-author offset into
  the events stream. `_replay_events_into_ledger` continues to write
  the watermark via `ledger_locator.resolve_watermark_path()`, which is
  exactly where the materializer reads it from. The locator owns
  user-local paths; events are not in that bucket and never should be.

`_resolve_bicameral_dir` (the full-wipe target) and the CLI fallback's
`_bicameral_dir_for_url` route through `project_dir_for()` only when no
explicit `SURREAL_URL` is set — the operator-pointed URL still wins so
the existing corruption-recovery test contract is preserved.

Replay now raises `FileNotFoundError` when the events substrate can't
be located, and the caller surfaces it via `replay_errors` — silent
zero-replay becomes an explicit failure that names where the events
were expected and points the user at `sync-and-brief` for fresh-clone
repopulation.

Tests (sociable):
* `test_resolve_events_dir_under_legacy_local_layout_finds_sibling_events`
  — guards the layout where ledger lives at `<bicameral_dir>/local/`
  and events at `<bicameral_dir>/events/`. Pre-fix returned None; post-
  fix returns the sibling events dir.
* `test_resolve_events_dir_uses_repo_path_not_locator_project_dir` —
  pins the architectural boundary: events stay repo-local, never
  routed through `project_dir_for() / "events"`. Guards against
  re-conflating user-local with repo-local in future refactors.
* `test_reset_cli_replay_surfaces_missing_events_dir_as_replay_error` —
  end-to-end: missing events_dir produces a non-empty `replay_errors`,
  never a passing-looking zero-replay response.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Schema v24 → v25: add non-unique key indexes on `symbol.name` and
`vocab_cache.(query_text, repo)` so the UPSERT-WHERE call sites in
`ledger/queries.py` stop falling back to full table scans.

Today's bug: `idx_sym_name` and `idx_vocab_query` are both SEARCH/BM25
indexes — they accelerate `@0@` semantic matches but NOT `WHERE field =
$value` equality lookups. The corresponding UPSERTs (`upsert_symbol`,
the vocab_cache UPSERT) scan O(n) per call; replays of large event
logs cross the 5.0s read budget near completion and surface as
`LedgerTimeoutError`. Reproduced today via PR #412's recovery path on
a 3,002-event log.

Per `docs/DEV_CYCLE.md` §4.7:
- Additive only — new indexes alongside existing BM25 ones. ✅ Allowed.
- No flag-gate — invariant fix per the §4.7.2 carve-out, not new
  feature surface.
- Migration in its own commit, idempotent via
  `_execute_define_idempotent`. `init_schema` is the real mechanism;
  `_migrate_v24_to_v25` is the version-boundary safety belt.

Verification mechanism: SurrealDB 2.x's trailing `EXPLAIN` modifier.
Pre-migration `SELECT * FROM symbol WHERE name = 'x' EXPLAIN` plans
to `Iterate Table` (full scan); post-migration it plans to
`Iterate Index` with `detail.plan.index = "idx_sym_name_lookup"`.
Same shape for vocab_cache. Empirically validated against memory://
during the audit.

Tests (`tests/test_schema_index_lookup_perf.py`, sociable):
- `test_upsert_symbol_returns_single_row_for_unique_name` — UPSERT
  semantics: novel name → exactly one row, valid id returned. Seeds
  1000 background rows first.
- `test_upsert_vocab_cache_returns_single_row_for_unique_compound_key`
  — same against vocab_cache compound key.
- `test_symbol_name_lookup_uses_equality_index_post_migration` —
  EXPLAIN-based; asserts `Iterate Index` + `idx_sym_name_lookup`.
  Fails loudly when DEFINE INDEX silently fails to land.
- `test_vocab_cache_lookup_uses_compound_index_post_migration` —
  same for `idx_vocab_query_lookup`.
- `test_schema_version_advances_to_25` — runs init+migrate, asserts
  `schema_meta.version == 25` and `_MIGRATIONS[25]` is the v24→v25
  function.

Audit trail: plan + AUDIT_REPORT.md (R1 VETO, R2 PASS),
META_LEDGER Entry #53 / #53-R2, SHADOW_GENOME Entry #54 (heuristic
#10: introspection-mechanism commitment).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jinhongkuan jinhongkuan requested a deployment to recording-approval May 19, 2026 06:28 — with GitHub Actions Waiting
@jinhongkuan jinhongkuan merged commit 7588373 into dev May 19, 2026
9 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant