fix(ledger): route ledger_sync deserialization warnings to wipe-and-replay recovery (#301) by jinhongkuan · Pull Request #403 · BicameralAI/bicameral-mcp

jinhongkuan · 2026-05-17T11:00:26Z

Why

#301 — bicameral.link_commit fails with SurrealDB rejected query: Versioned error: A deserialization error occured: Invalid revision \3` for type `Value`, and the agent's natural recovery instinct (run bicameral.diagnose) returns recovery_path: clean/next_action: "Ledger is at expected schema v17. No remediation needed."`. The user is stuck.

v0.15.0 already added a row-level probe (cli/_diagnose_gather.py::_probe_row_deserialization) that catches this failure mode and writes the warning into Diagnosis.row_probe_warnings + the suggestions list. But the probe's findings stopped there — handlers/diagnose.py::_classify_recovery only inspects schema_meta.version, so the recovery_path enum (which the agent's skill text branches on) stays clean. And handlers/sync_middleware.py::ensure_ledger_synced swallows the underlying LedgerError at DEBUG, so the agent never even sees the error message from the sync attempt.

Net effect on v0.15.0: the probe runs but its verdict doesn't reach the agent's decision surface.

What

New LedgerDeserializationError (subclass of LedgerError) — ledger.client.query / ledger.client.execute raise it instead of the generic class when SurrealDB returns a record-format mismatch. The exception message embeds the recovery command (bicameral_reset(wipe_mode='ledger', replay_from_events=True, confirm=True)), so the agent sees the wipe-and-replay instruction inside the MCP error envelope.
_classify_recovery now consults Diagnosis.row_probe_warnings before the schema-version checks. Non-empty warnings route to reset_rebuild (when .bicameral/events/*.jsonl is present next to the ledger) or reset_destructive (no events on disk) with a next_action that quotes the exact bicameral_reset(...) call.
ensure_ledger_synced re-raises LedgerDeserializationError instead of swallowing it at DEBUG. The broad `except Exception` is still in place for transient catch-up failures (the original best-effort contract); only deserialization errors break out, because they're the one class of failure the agent must surface to the user.
Version bump → 0.15.1 (pyproject.toml, RECOMMENDED_VERSION).
CHANGELOG entry with explicit notes for v0.14.x → v0.15.1 upgraders: the SurrealDB SDK pin is unchanged, so persisted rows still need wipe-and-replay. What's new is the visibility of the recovery path.

Out of scope

No SurrealDB SDK bump. The SDK pin (surrealdb==2.0.0) stays — bumping it would require a separate schema + format-migration story.
No automatic ledger wipe. Recovery remains a deliberate operator action (bicameral_reset with confirm=True). The hotfix only makes the recovery path discoverable.
No retroactive repair of ledger_sync rows persisted by prior versions. Affected users still run the reset command.
cli/_diagnose_gather.py::_probe_row_deserialization is untouched — it already exists from v0.15.0 (commit 72bbd20).

Acceptance

CI green.
tests/test_ledger_sync_deserialization_recovery_301.py — 13 sociable tests covering (a) classification of Invalid revision / deserialization error substrings, (b) _classify_recovery routing on row_probe_warnings (reset_rebuild w/ events, reset_destructive w/o), (c) ensure_ledger_synced re-raises the new class but still swallows unrelated RuntimeErrors, (d) the new exception is a subclass of LedgerError so existing handler blocks still catch it.
LedgerDeserializationError.RECOVERY_HINT mentions bicameral_reset and replay_from_events=True.
handlers/diagnose.py::_classify_recovery returns reset_rebuild or reset_destructive (not clean) when diagnosis.row_probe_warnings is non-empty, even if schema_recorded == schema_expected.

Closes #301

🤖 Generated with Claude Code

Summary by CodeRabbit

v0.15.1 Release Notes

Bug Fixes
- Enhanced ledger synchronization error detection and recovery for row format mismatches
- Automated recovery mechanism now properly routes to wipe-and-replay recovery when needed
Tests
- Added comprehensive test coverage for ledger deserialization recovery flows

…eplay recovery (#301) v0.15.0 added the row-level probe (cli/_diagnose_gather.py::_probe_row_deserialization) but its findings stopped at the suggestions list — _classify_recovery still inspected only schema_meta.version, so an agent that ran diagnose after a link_commit failure saw recovery_path=clean / "No remediation needed" while the ledger was actually unreadable. This wires the probe through: - New LedgerDeserializationError (subclass of LedgerError) is raised from ledger.client.query/execute when SurrealDB returns "Invalid revision \`N\` for type \`Value\`" or a "deserialization error" wrapper. The exception message embeds the recovery command so the agent sees the wipe-and-replay instruction inside the MCP error envelope. - handlers/diagnose.py::_classify_recovery consults row_probe_warnings before the schema-version checks and routes to reset_rebuild / reset_destructive with a quoted bicameral_reset(...) next_action. - handlers/sync_middleware.py::ensure_ledger_synced re-raises LedgerDeserializationError instead of swallowing it at DEBUG. The broad except Exception still catches transient catch-up failures. The SurrealDB SDK pin is unchanged — v0.14.x users hit by #301 still need to wipe and replay; this PR makes the recovery path discoverable instead of leaving them with a bare LedgerError. Bumps version → 0.15.1 and RECOMMENDED_VERSION → 0.15.1. Closes #301 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-05-17T11:00:44Z

Warning

Rate limit exceeded

@jinhongkuan has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 54 minutes and 19 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9d5bd869-ce80-46e3-8b2d-b14fa678bcdc

📥 Commits

Reviewing files that changed from the base of the PR and between aebb92f and cb01c5c.

📒 Files selected for processing (1)

handlers/diagnose.py

📝 Walkthrough

Walkthrough

This PR fixes issue #301 by introducing a specialized LedgerDeserializationError to detect SurrealDB row-format mismatches in ledger queries, updating recovery classification to prioritize row probe warnings, and ensuring the error propagates through middleware instead of being suppressed.

Changes

Deserialization Error Detection & Recovery Routing

Layer / File(s)	Summary
Ledger deserialization error type and detection `ledger/client.py`	Introduces `LedgerDeserializationError` subclass with embedded `RECOVERY_HINT`, adds `_is_deserialization_error()` classifier that detects SurrealDB row-format mismatch signatures, and updates `LedgerClient.query()` and `execute()` methods to raise this specific error instead of generic `LedgerError` on deserialization failures.
Recovery classification based on row probe warnings `handlers/diagnose.py`	Updates `_classify_recovery` to check `Diagnosis.row_probe_warnings` before schema-version comparisons; when row warnings are present, returns `reset_destructive` (no events) or `reset_rebuild` (events exist) with actionable `next_action` recovery command, preventing deserialization mismatches from being misclassified as `clean`.
Middleware error propagation `handlers/sync_middleware.py`	Adds import for `LedgerDeserializationError` and introduces dedicated handler in `ensure_ledger_synced` that catches this error, logs a warning, and re-raises to surface it to the MCP transport layer; other exceptions remain swallowed with debug logging.
Test coverage for deserialization recovery flow `tests/test_ledger_sync_deserialization_recovery_301.py`	Comprehensive test suite covering error classification (signature detection, message recovery hints), recovery routing (row-warning prioritization, events-file-based path selection), and middleware behavior (re-raising deserialization errors vs swallowing others).
Version and documentation updates `CHANGELOG.md`, `RECOMMENDED_VERSION`, `pyproject.toml`	Adds v0.15.1 hotfix changelog entry documenting the ledger deserialization error fix and recovery guidance; bumps version from 0.15.0 to 0.15.1.

Sequence Diagram

sequenceDiagram
  participant Agent as Agent/MCP
  participant Middleware as ensure_ledger_synced
  participant Ledger as LedgerClient
  participant Diagnose as _classify_recovery
  Agent->>Middleware: sync request
  Middleware->>Ledger: query/execute (HEAD catch-up)
  Ledger-->>Middleware: SurrealError (deserialization)
  Middleware->>Middleware: detect LedgerDeserializationError
  Middleware-->>Agent: re-raise to transport
  Agent->>Diagnose: call diagnose()
  Diagnose->>Diagnose: check row_probe_warnings
  Diagnose-->>Agent: recovery_path: reset_rebuild/destructive

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

BicameralAI/bicameral-mcp#298: Introduces the base _classify_recovery recovery-path classification system in handlers/diagnose.py that this PR extends with row-probe-warnings prioritization.

Poem

🐰 A ledger once broken by revision mismatch,
Now detects its own wounds with a catch!
Row warnings take flight, before schemas align—
Recovery paths bloom, /bicameral-sync shines. ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title clearly describes the main change: routing ledger sync deserialization warnings to wipe-and-replay recovery, which is the core objective of this hotfix.
Linked Issues check	✅ Passed	The pull request successfully implements all coding requirements from issue `#301`: detects row-level deserialization failures via LedgerDeserializationError, routes them to wipe-and-replay recovery in _classify_recovery, and re-raises the error in sync middleware to surface it to the agent.
Out of Scope Changes check	✅ Passed	All changes are directly related to issue `#301` objectives: detecting and routing deserialization failures. No out-of-scope modifications to unrelated systems or features were introduced.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch hotfix/0.15.1-301-deserialization-recovery-routing

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

… no-redef `path` is annotated at line 133 (the new #301 row_probe_warnings branch) and also at line 143 (existing schema-newer-than-binary branch). Same scope → same name → mypy no-redef. Drop the later annotation; type is unchanged because the literal still narrows to `RecoveryPath`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

tests/test_ledger_sync_deserialization_recovery_301.py (1)
74-108: 💤 Low value

Consider adding schema initialization for guideline alignment.

The coding guideline states: "For ledger query tests, never MagicMock the client; use the real LedgerClient(url="memory://", ...) + init_schema + migrate". These tests correctly use the real client but skip schema initialization. While not strictly necessary for narrow seam tests that patch _db.query (the schema is never queried), adding init_schema and migrate would improve consistency with the guideline and make the test setup more realistic.
♻️ Example for test_query_raises_deserialization_error_when_surrealdb_complains
 async def test_query_raises_deserialization_error_when_surrealdb_complains():
     """The classifier triggers on a real LedgerClient.query() path.
 
     Narrow seam: we patch the surrealdb-py async call so it raises
     ``SurrealError("Invalid revision ...")`` — this is the documented failure
     mode for SurrealKV record-format drift and cannot be triggered naturally
     against ``memory://``.
     """
     client = LedgerClient(url="memory://", ns="t301_q", db="ledger_test")
     await client.connect()
+    await init_schema(client)
+    await migrate(client)
     try:
Apply the same pattern to test_query_with_non_deserialization_error_still_raises_plain_ledger_error.
As per coding guidelines: "For ledger query tests, never MagicMock the client; use the real LedgerClient(url="memory://", ...) + init_schema + migrate".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_ledger_sync_deserialization_recovery_301.py` around lines 74 -
108, Add schema initialization to both tests by calling the real LedgerClient's
init_schema and migrate before exercising the patched query: after await
client.connect() invoke await client.init_schema() and await client.migrate() in
test_query_raises_deserialization_error_when_surrealdb_complains and
test_query_with_non_deserialization_error_still_raises_plain_ledger_error so the
in-memory client is set up per guidelines while keeping the existing patching of
client._db.query and existing asserts intact.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@handlers/diagnose.py`:
- Around line 136-140: The next_action string currently instructs users to "wipe
and replay from .bicameral/events/" even when has_events is False; update the
conditional message construction around next_action (the code that formats the
string with replay_from_events={has_events}) so when has_events is False it does
not mention replaying from .bicameral/events (e.g., change the tail to "wipe
only (no events to replay)" or similar), otherwise keep the existing "wipe and
replay from .bicameral/events/" wording when has_events is True.
- Around line 133-134: Rename the variable currently assigned as path in the
branch that handles row_warnings to warning_path (e.g., change "path:
RecoveryPath = ..." to "warning_path: RecoveryPath = ...") and update any
subsequent uses/return in that branch to return warning_path instead of path so
it no longer collides with the other branch's path variable; locate this in the
function handling recovery paths where has_events, row_warnings, tables are
computed and adjust the corresponding return statement(s) to reference
warning_path.

---

Nitpick comments:
In `@tests/test_ledger_sync_deserialization_recovery_301.py`:
- Around line 74-108: Add schema initialization to both tests by calling the
real LedgerClient's init_schema and migrate before exercising the patched query:
after await client.connect() invoke await client.init_schema() and await
client.migrate() in
test_query_raises_deserialization_error_when_surrealdb_complains and
test_query_with_non_deserialization_error_still_raises_plain_ledger_error so the
in-memory client is set up per guidelines while keeping the existing patching of
client._db.query and existing asserts intact.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f0ecdd36-bc22-49e2-89a6-fe9f79b17ea8

📥 Commits

Reviewing files that changed from the base of the PR and between 6963cb0 and aebb92f.

📒 Files selected for processing (7)

CHANGELOG.md
RECOMMENDED_VERSION
handlers/diagnose.py
handlers/sync_middleware.py
ledger/client.py
pyproject.toml
tests/test_ledger_sync_deserialization_recovery_301.py

coderabbitai · 2026-05-17T11:06:04Z

+        path: RecoveryPath = "reset_rebuild" if has_events else "reset_destructive"
+        tables = ", ".join(sorted({w.split(":", 1)[0] for w in row_warnings}))


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify duplicate typed declarations are removed (expect 1 match after fix). rg -nP '\bpath:\s*RecoveryPath\b|\bwarning_path:\s*RecoveryPath\b' handlers/diagnose.py

Repository: BicameralAI/bicameral-mcp

Length of output: 245

🏁 Script executed:

sed -n '120,150p' handlers/diagnose.py | cat -n

Repository: BicameralAI/bicameral-mcp

Length of output: 1932

🏁 Script executed:

sed -n '133,141p' handlers/diagnose.py | cat -n

Repository: BicameralAI/bicameral-mcp

Length of output: 628

Resolve path redefinition to unblock mypy.

The typed assignment on line 133 conflicts with the typed assignment on line 143 (no-redef), causing CI failure. Both are separate conditional branches that independently define the same variable name. The proposed fix correctly renames line 133's variable to warning_path and updates the corresponding return statement, eliminating the redefinition while preserving the distinct logic for each recovery path.

🛠️ Proposed fix

- path: RecoveryPath = "reset_rebuild" if has_events else "reset_destructive" + warning_path: RecoveryPath = "reset_rebuild" if has_events else "reset_destructive" tables = ", ".join(sorted({w.split(":", 1)[0] for w in row_warnings})) - return path, ( + return warning_path, ( f"Row-level deserialization warnings on {tables} — likely a " "SurrealDB embedded-SDK record-format mismatch. Run " f"`bicameral_reset(wipe_mode='ledger', replay_from_events={has_events}, " "confirm=True)` to wipe and replay from .bicameral/events/." )

🧰 Tools

🪛 GitHub Actions: Lint & Type Check / 0_ruff + mypy.txt

[error] mypy . failed with 1 error (no-redef). Checked 134 source files.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@handlers/diagnose.py` around lines 133 - 134, Rename the variable currently assigned as path in the branch that handles row_warnings to warning_path (e.g., change "path: RecoveryPath = ..." to "warning_path: RecoveryPath = ...") and update any subsequent uses/return in that branch to return warning_path instead of path so it no longer collides with the other branch's path variable; locate this in the function handling recovery paths where has_events, row_warnings, tables are computed and adjust the corresponding return statement(s) to reference warning_path.

coderabbitai · 2026-05-17T11:06:05Z

+            f"Row-level deserialization warnings on {tables} — likely a "
+            "SurrealDB embedded-SDK record-format mismatch. Run "
+            f"`bicameral_reset(wipe_mode='ledger', replay_from_events={has_events}, "
+            "confirm=True)` to wipe and replay from .bicameral/events/."
+        )


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Make next_action text consistent with destructive recovery.

When has_events is False, the command correctly uses replay_from_events=False, but the sentence still says "wipe and replay from .bicameral/events/". That instruction is contradictory for the destructive path.

✏️ Proposed fix

path: RecoveryPath = "reset_rebuild" if has_events else "reset_destructive" tables = ", ".join(sorted({w.split(":", 1)[0] for w in row_warnings})) + replay_text = ( + "to wipe and replay from .bicameral/events/." + if has_events + else "to wipe the ledger (no replayable .bicameral/events/*.jsonl found)." + ) return path, ( f"Row-level deserialization warnings on {tables} — likely a " "SurrealDB embedded-SDK record-format mismatch. Run " f"`bicameral_reset(wipe_mode='ledger', replay_from_events={has_events}, " - "confirm=True)` to wipe and replay from .bicameral/events/." + f"confirm=True)` {replay_text}" )

🧰 Tools

🪛 GitHub Actions: Lint & Type Check / 0_ruff + mypy.txt

[error] mypy . failed with 1 error (no-redef). Checked 134 source files.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@handlers/diagnose.py` around lines 136 - 140, The next_action string currently instructs users to "wipe and replay from .bicameral/events/" even when has_events is False; update the conditional message construction around next_action (the code that formats the string with replay_from_events={has_events}) so when has_events is False it does not mention replaying from .bicameral/events (e.g., change the tail to "wipe only (no events to replay)" or similar), otherwise keep the existing "wipe and replay from .bicameral/events/" wording when has_events is True.

jinhongkuan added flow:hotfix Emergency fix targeting main directly; must be synced back to dev (DEV_CYCLE.md s10) P1 High: ship this milestone; user-impacting bug or committed feature fix Bug fix or correctness repair ledger Decision ledger, persistence, or query surface labels May 17, 2026

jinhongkuan temporarily deployed to ci-test May 17, 2026 11:00 — with GitHub Actions Inactive

jinhongkuan temporarily deployed to production May 17, 2026 11:00 — with GitHub Actions Inactive

jinhongkuan requested a deployment to recording-approval May 17, 2026 11:00 — with GitHub Actions Waiting

jinhongkuan temporarily deployed to ci-test May 17, 2026 11:00 — with GitHub Actions Inactive

jinhongkuan temporarily deployed to ci-test May 17, 2026 11:04 — with GitHub Actions Inactive

jinhongkuan requested a deployment to recording-approval May 17, 2026 11:04 — with GitHub Actions Waiting

jinhongkuan temporarily deployed to ci-test May 17, 2026 11:04 — with GitHub Actions Inactive

jinhongkuan temporarily deployed to production May 17, 2026 11:04 — with GitHub Actions Inactive

coderabbitai Bot reviewed May 17, 2026

View reviewed changes

jinhongkuan merged commit e57f07a into main May 17, 2026
10 of 11 checks passed

jinhongkuan mentioned this pull request May 17, 2026

fix(ledger): ledger_sync deserialization error — Invalid revision 3 for type Value #301

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ledger): route ledger_sync deserialization warnings to wipe-and-replay recovery (#301)#403

fix(ledger): route ledger_sync deserialization warnings to wipe-and-replay recovery (#301)#403
jinhongkuan merged 2 commits into
mainfrom
hotfix/0.15.1-301-deserialization-recovery-routing

jinhongkuan commented May 17, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 17, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 17, 2026

Uh oh!

coderabbitai Bot May 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		path: RecoveryPath = "reset_rebuild" if has_events else "reset_destructive"
		tables = ", ".join(sorted({w.split(":", 1)[0] for w in row_warnings}))

Conversation

jinhongkuan commented May 17, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What

Out of scope

Acceptance

Summary by CodeRabbit

v0.15.1 Release Notes

Uh oh!

coderabbitai Bot commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jinhongkuan commented May 17, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 17, 2026 •

edited

Loading