fix(diagnose,reset): unbreak in-agent recovery path (#410)#412
Merged
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
When the ledger hits a SurrealDB row-deserialization error, the agent's only escape hatch — `bicameral_reset` — is often unreachable: either it's not in the tool surface a running Claude Code session has pinned, or the MCP server itself can't start. The CLI fallback was equally broken. Three concrete fixes: * `cli/diagnose.py` now opens a raw `LedgerClient` and forwards to `gather_diagnosis_raw` (same defensive path the MCP handler uses). The old code went through `SurrealDBLedgerAdapter`, which re-runs `init_schema`+`migrate` — the very step that's likely failing. * `bicameral-mcp reset` gains `--confirm`, `--wipe-mode`, and `--replay-from-events` flags. With `--confirm`, dispatch goes through a thin wrapper (`cli/reset_cli.py`) that calls `handle_reset` directly. No `--confirm` keeps the interactive wizard for human-driven use. * When `--wipe-mode=full --confirm` is given and the high-level path can't even bring up a ctx (the literal #410 scenario), fall back to a direct `shutil.rmtree` of the resolved `.bicameral/` dir — full-wipe requiring a working DB is circular. Recovery hints in `LedgerDeserializationError` and `_classify_recovery` now lead with the shell form, which is reachable even when MCP tool-surface pinning hides `bicameral_reset`. Pre-existing on `origin/dev`, out of scope here: - `tests/test_diagnose_format.py` (8 failures — `Diagnosis` fixture missing the `row_probe_warnings` field added in #301) - `server.py` smoke-test (`EXPECTED_TOOL_NAMES` missing `bicameral.diagnose`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
a83b7f1 to
b99289e
Compare
#410, refs #368, #373) The pre-fix _resolve_events_dir derived its target by string-mangling the SurrealKV URL — `Path(db_path).parent / "events"`. Once R4 moved the ledger out from under the repo (`~/.bicameral/projects/<id>/` vs `<repo>/.bicameral/events/`), the inverse silently pointed at an empty user-local path. Replay then short-circuited to 0 events with no error surfaced. The fix turns on splitting two distinct concerns the inverse function had collapsed: * **Events are repo-local source** — committed to git pre-#373, pulled from the configured remote backend post-#373. They are NOT user-local state. `_resolve_events_dir` and `_count_events_on_disk` now resolve forward through `<repo_path>/.bicameral/events` (or BICAMERAL_DATA_PATH for tests), matching the production read site in `cli/sync_and_brief_cli.py:65`. * **Watermark is user-local derived state** — per-author offset into the events stream. `_replay_events_into_ledger` continues to write the watermark via `ledger_locator.resolve_watermark_path()`, which is exactly where the materializer reads it from. The locator owns user-local paths; events are not in that bucket and never should be. `_resolve_bicameral_dir` (the full-wipe target) and the CLI fallback's `_bicameral_dir_for_url` route through `project_dir_for()` only when no explicit `SURREAL_URL` is set — the operator-pointed URL still wins so the existing corruption-recovery test contract is preserved. Replay now raises `FileNotFoundError` when the events substrate can't be located, and the caller surfaces it via `replay_errors` — silent zero-replay becomes an explicit failure that names where the events were expected and points the user at `sync-and-brief` for fresh-clone repopulation. Tests (sociable): * `test_resolve_events_dir_under_legacy_local_layout_finds_sibling_events` — guards the layout where ledger lives at `<bicameral_dir>/local/` and events at `<bicameral_dir>/events/`. Pre-fix returned None; post- fix returns the sibling events dir. * `test_resolve_events_dir_uses_repo_path_not_locator_project_dir` — pins the architectural boundary: events stay repo-local, never routed through `project_dir_for() / "events"`. Guards against re-conflating user-local with repo-local in future refactors. * `test_reset_cli_replay_surfaces_missing_events_dir_as_replay_error` — end-to-end: missing events_dir produces a non-empty `replay_errors`, never a passing-looking zero-replay response. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
b99289e to
d9737d7
Compare
Schema v24 → v25: add non-unique key indexes on `symbol.name` and `vocab_cache.(query_text, repo)` so the UPSERT-WHERE call sites in `ledger/queries.py` stop falling back to full table scans. Today's bug: `idx_sym_name` and `idx_vocab_query` are both SEARCH/BM25 indexes — they accelerate `@0@` semantic matches but NOT `WHERE field = $value` equality lookups. The corresponding UPSERTs (`upsert_symbol`, the vocab_cache UPSERT) scan O(n) per call; replays of large event logs cross the 5.0s read budget near completion and surface as `LedgerTimeoutError`. Reproduced today via PR #412's recovery path on a 3,002-event log. Per `docs/DEV_CYCLE.md` §4.7: - Additive only — new indexes alongside existing BM25 ones. ✅ Allowed. - No flag-gate — invariant fix per the §4.7.2 carve-out, not new feature surface. - Migration in its own commit, idempotent via `_execute_define_idempotent`. `init_schema` is the real mechanism; `_migrate_v24_to_v25` is the version-boundary safety belt. Verification mechanism: SurrealDB 2.x's trailing `EXPLAIN` modifier. Pre-migration `SELECT * FROM symbol WHERE name = 'x' EXPLAIN` plans to `Iterate Table` (full scan); post-migration it plans to `Iterate Index` with `detail.plan.index = "idx_sym_name_lookup"`. Same shape for vocab_cache. Empirically validated against memory:// during the audit. Tests (`tests/test_schema_index_lookup_perf.py`, sociable): - `test_upsert_symbol_returns_single_row_for_unique_name` — UPSERT semantics: novel name → exactly one row, valid id returned. Seeds 1000 background rows first. - `test_upsert_vocab_cache_returns_single_row_for_unique_compound_key` — same against vocab_cache compound key. - `test_symbol_name_lookup_uses_equality_index_post_migration` — EXPLAIN-based; asserts `Iterate Index` + `idx_sym_name_lookup`. Fails loudly when DEFINE INDEX silently fails to land. - `test_vocab_cache_lookup_uses_compound_index_post_migration` — same for `idx_vocab_query_lookup`. - `test_schema_version_advances_to_25` — runs init+migrate, asserts `schema_meta.version == 25` and `_MIGRATIONS[25]` is the v24→v25 function. Audit trail: plan + AUDIT_REPORT.md (R1 VETO, R2 PASS), META_LEDGER Entry #53 / #53-R2, SHADOW_GENOME Entry #54 (heuristic #10: introspection-mechanism commitment). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #410. When the ledger hits a SurrealDB row-deserialization error, the agent's named recovery path (
bicameral_reset) is often unreachable — either because Claude Code's tool surface has pinned the old schema, or because the MCP server itself can't finish init. This PR makes the CLI form genuinely callable from inside the agent loop, and (as part of the recovery substrate) fixes a silent zero-replay regression in--replay-from-events.What changed
Diagnose + reset CLI escape hatch (the original #410 scope)
cli/diagnose.pynow opens a rawLedgerClientand forwards togather_diagnosis_raw— the same defensive pathhandlers/diagnose.pyalready uses. The old code went throughSurrealDBLedgerAdapter, which re-runsinit_schema+migrate(the step the operator is trying to diagnose around).bicameral-mcp resetgains--confirm,--wipe-mode={ledger,full}, and--replay-from-events. With--confirmwe bypass the interactive wizard and dispatch throughcli/reset_cli.py:run_noninteractive_reset, a thin wrapper aroundhandle_reset. No--confirmkeeps the wizard for human-driven use.--confirm --wipe-mode=fullis given andBicameralContext.from_env()/ledger.connect()raises (the literal reset tool missing + diagnose fails on SDK row-revision mismatch #410 scenario), weshutil.rmtreethe resolved.bicameral/directly.LedgerDeserializationError.RECOVERY_HINTandhandlers/diagnose.py:_classify_recoverynow lead with the shell form, which works even when the MCPbicameral_resettool isn't in the agent's pinned surface.Silent zero-replay regression (
d9737d7)While dogfooding the new CLI against a real R4-layout install,
--replay-from-eventsreportedevents_replayed: 0with zero entries inreplay_errorsdespite 3,002 valid events on disk. The pre-fix_resolve_events_dirderived its target by string-mangling the SurrealKV URL —Path(db_path).parent / "events". That inverse only worked while the ledger and events shared a parent. Once #408 (Ledger Locator) moved the ledger to~/.bicameral/projects/<id>/ledger.dbwhile events stayed repo-local at<repo>/.bicameral/events/, the inverse silently pointed at an empty path.The fix splits two distinct concerns the inverse function had collapsed:
<repo_path>/.bicameral/events(matchescli/sync_and_brief_cli.py:65)ledger_locator.resolve_watermark_path()ledger_locator.project_dir_for()when no explicitSURREAL_URLis set; URL inverse otherwise (preserves the corruption-recovery test contract)Replay now raises
FileNotFoundErrorwhen the events substrate can't be located, and the caller surfaces it viareplay_errors. Silent zero-replay becomes an explicit failure that names where events were expected and points the user atsync-and-brieffor fresh-clone repopulation.Linked decisions (from #408)
Forward-resolution through the locator is grounded in:
decision:ko8efq3z1zwhbof7kecq— Ledger Locator naming + scopedecision:rfbnlw7ghe175iu42u6b— Project identity viagit rev-parse --git-common-dirhashdecision:5nr66wvmapjpt58rrji8— R4 config split (the layout transition that exposed the URL-inverse bug)Boundary between user-local state (locator) and repo-local events tracks the deprecation arc in #373 — the event substrate must not be conflated with the user-local derived-state bucket.
Why this works for the original reproduction
bicameral.dashboardreturnsLedgerDeserializationError, recovery hint namesbicameral_reset(not reachable)bicameral-mcp reset --confirm ...firstbicameral_resetMCP tool gated / unregistered in running sessionbicameral-mcp diagnosefrom shell crashes on migrate (Invalid revision 101 for type DefineTableStatement)--replay-from-eventssilently zero-replays under R4 layout_resolve_events_dirforward-resolves repo-local; missing dir raises and surfaces viareplay_errorsTest plan
tests/test_reset_cli_410.py— 9 tests cover the flag surface, happy-path ledger wipe againstmemory://, filesystem fallback whenfrom_env()raises during--wipe-mode=full, the explicit refusal to silently rmtree under--wipe-mode=ledger, the CLI-form presence in both recovery-hint surfaces, and three new regression tests for the silent zero-replay:test_resolve_events_dir_under_legacy_local_layout_finds_sibling_events— guards the layout where ledger sits at<bicameral_dir>/local/ledger.dband events at<bicameral_dir>/events/. Pre-fix this returns None; post-fix returns the sibling events dir.test_resolve_events_dir_uses_repo_path_not_locator_project_dir— pins the architectural boundary: events stay repo-local, never routed throughproject_dir_for() / "events". Guards against re-conflating user-local with repo-local in future refactors.test_reset_cli_replay_surfaces_missing_events_dir_as_replay_error— end-to-end: missingevents_dirproduces a non-emptyreplay_errors, never a passing-looking zero-replay response.tests/test_diagnose_cli.py— updatedtest_diagnose_main_returns_one_on_raw_client_connect_failure(failure surface moved from adapter to raw client). Addedtest_diagnose_main_cli_does_not_use_adapter_path— patchesSurrealDBLedgerAdapter.connectto raise; asserts diagnose still returns 0.ruff check+ruff format --checkclean on every changed file.test_diagnose_*,test_reset.py,test_ledger_sync_deserialization_recovery_301.py,test_ledger_locator.pyall pass.Out of scope (pre-existing on
origin/dev)tests/test_diagnose_format.py— 8 failures from a fixture missingrow_probe_warnings(field added in fix(ledger): ledger_sync deserialization error — Invalid revision3for type Value #301, fixture never updated). Confirmed onorigin/devindependent of this PR.server.py --smoke-test—EXPECTED_TOOL_NAMESconstant is missingbicameral.diagnose. Same one-liner drift onorigin/dev. Worth a tiny follow-up.Note on this session
Bicameral sync against the new commit was attempted but blocked by a stale SDK in the running MCP server (PID spawned pre-#408, partial-replay state with the newer SDK's revision). Restart the MCP host to pick it up; decisions linked manually above are the load-bearing set.
🤖 Generated with Claude Code