Skip to content

fix(ledger): route ledger_sync deserialization warnings to wipe-and-replay recovery (#301)#403

Merged
jinhongkuan merged 2 commits into
mainfrom
hotfix/0.15.1-301-deserialization-recovery-routing
May 17, 2026
Merged

fix(ledger): route ledger_sync deserialization warnings to wipe-and-replay recovery (#301)#403
jinhongkuan merged 2 commits into
mainfrom
hotfix/0.15.1-301-deserialization-recovery-routing

Conversation

@jinhongkuan

@jinhongkuan jinhongkuan commented May 17, 2026

Copy link
Copy Markdown
Contributor

Why

#301bicameral.link_commit fails with SurrealDB rejected query: Versioned error: A deserialization error occured: Invalid revision \3` for type `Value`, and the agent's natural recovery instinct (run bicameral.diagnose) returns recovery_path: clean/next_action: "Ledger is at expected schema v17. No remediation needed."`. The user is stuck.

v0.15.0 already added a row-level probe (cli/_diagnose_gather.py::_probe_row_deserialization) that catches this failure mode and writes the warning into Diagnosis.row_probe_warnings + the suggestions list. But the probe's findings stopped there — handlers/diagnose.py::_classify_recovery only inspects schema_meta.version, so the recovery_path enum (which the agent's skill text branches on) stays clean. And handlers/sync_middleware.py::ensure_ledger_synced swallows the underlying LedgerError at DEBUG, so the agent never even sees the error message from the sync attempt.

Net effect on v0.15.0: the probe runs but its verdict doesn't reach the agent's decision surface.

What

  • New LedgerDeserializationError (subclass of LedgerError) — ledger.client.query / ledger.client.execute raise it instead of the generic class when SurrealDB returns a record-format mismatch. The exception message embeds the recovery command (bicameral_reset(wipe_mode='ledger', replay_from_events=True, confirm=True)), so the agent sees the wipe-and-replay instruction inside the MCP error envelope.
  • _classify_recovery now consults Diagnosis.row_probe_warnings before the schema-version checks. Non-empty warnings route to reset_rebuild (when .bicameral/events/*.jsonl is present next to the ledger) or reset_destructive (no events on disk) with a next_action that quotes the exact bicameral_reset(...) call.
  • ensure_ledger_synced re-raises LedgerDeserializationError instead of swallowing it at DEBUG. The broad `except Exception` is still in place for transient catch-up failures (the original best-effort contract); only deserialization errors break out, because they're the one class of failure the agent must surface to the user.
  • Version bump → 0.15.1 (pyproject.toml, RECOMMENDED_VERSION).
  • CHANGELOG entry with explicit notes for v0.14.x → v0.15.1 upgraders: the SurrealDB SDK pin is unchanged, so persisted rows still need wipe-and-replay. What's new is the visibility of the recovery path.

Out of scope

  • No SurrealDB SDK bump. The SDK pin (surrealdb==2.0.0) stays — bumping it would require a separate schema + format-migration story.
  • No automatic ledger wipe. Recovery remains a deliberate operator action (bicameral_reset with confirm=True). The hotfix only makes the recovery path discoverable.
  • No retroactive repair of ledger_sync rows persisted by prior versions. Affected users still run the reset command.
  • cli/_diagnose_gather.py::_probe_row_deserialization is untouched — it already exists from v0.15.0 (commit 72bbd20).

Acceptance

  • CI green.
  • tests/test_ledger_sync_deserialization_recovery_301.py — 13 sociable tests covering (a) classification of Invalid revision / deserialization error substrings, (b) _classify_recovery routing on row_probe_warnings (reset_rebuild w/ events, reset_destructive w/o), (c) ensure_ledger_synced re-raises the new class but still swallows unrelated RuntimeErrors, (d) the new exception is a subclass of LedgerError so existing handler blocks still catch it.
  • LedgerDeserializationError.RECOVERY_HINT mentions bicameral_reset and replay_from_events=True.
  • handlers/diagnose.py::_classify_recovery returns reset_rebuild or reset_destructive (not clean) when diagnosis.row_probe_warnings is non-empty, even if schema_recorded == schema_expected.

Closes #301

🤖 Generated with Claude Code

Summary by CodeRabbit

v0.15.1 Release Notes

  • Bug Fixes

    • Enhanced ledger synchronization error detection and recovery for row format mismatches
    • Automated recovery mechanism now properly routes to wipe-and-replay recovery when needed
  • Tests

    • Added comprehensive test coverage for ledger deserialization recovery flows

Review Change Stack

…eplay recovery (#301)

v0.15.0 added the row-level probe (cli/_diagnose_gather.py::_probe_row_deserialization)
but its findings stopped at the suggestions list — _classify_recovery still
inspected only schema_meta.version, so an agent that ran diagnose after a
link_commit failure saw recovery_path=clean / "No remediation needed" while
the ledger was actually unreadable. This wires the probe through:

- New LedgerDeserializationError (subclass of LedgerError) is raised from
  ledger.client.query/execute when SurrealDB returns "Invalid revision \`N\`
  for type \`Value\`" or a "deserialization error" wrapper. The exception
  message embeds the recovery command so the agent sees the wipe-and-replay
  instruction inside the MCP error envelope.
- handlers/diagnose.py::_classify_recovery consults row_probe_warnings before
  the schema-version checks and routes to reset_rebuild / reset_destructive
  with a quoted bicameral_reset(...) next_action.
- handlers/sync_middleware.py::ensure_ledger_synced re-raises
  LedgerDeserializationError instead of swallowing it at DEBUG. The broad
  except Exception still catches transient catch-up failures.

The SurrealDB SDK pin is unchanged — v0.14.x users hit by #301 still need
to wipe and replay; this PR makes the recovery path discoverable instead
of leaving them with a bare LedgerError.

Bumps version → 0.15.1 and RECOMMENDED_VERSION → 0.15.1.

Closes #301

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jinhongkuan jinhongkuan added flow:hotfix Emergency fix targeting main directly; must be synced back to dev (DEV_CYCLE.md s10) P1 High: ship this milestone; user-impacting bug or committed feature fix Bug fix or correctness repair ledger Decision ledger, persistence, or query surface labels May 17, 2026
@jinhongkuan jinhongkuan requested a deployment to recording-approval May 17, 2026 11:00 — with GitHub Actions Waiting
@coderabbitai

coderabbitai Bot commented May 17, 2026

Copy link
Copy Markdown

Warning

Rate limit exceeded

@jinhongkuan has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 54 minutes and 19 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9d5bd869-ce80-46e3-8b2d-b14fa678bcdc

📥 Commits

Reviewing files that changed from the base of the PR and between aebb92f and cb01c5c.

📒 Files selected for processing (1)
  • handlers/diagnose.py
📝 Walkthrough

Walkthrough

This PR fixes issue #301 by introducing a specialized LedgerDeserializationError to detect SurrealDB row-format mismatches in ledger queries, updating recovery classification to prioritize row probe warnings, and ensuring the error propagates through middleware instead of being suppressed.

Changes

Deserialization Error Detection & Recovery Routing

Layer / File(s) Summary
Ledger deserialization error type and detection
ledger/client.py
Introduces LedgerDeserializationError subclass with embedded RECOVERY_HINT, adds _is_deserialization_error() classifier that detects SurrealDB row-format mismatch signatures, and updates LedgerClient.query() and execute() methods to raise this specific error instead of generic LedgerError on deserialization failures.
Recovery classification based on row probe warnings
handlers/diagnose.py
Updates _classify_recovery to check Diagnosis.row_probe_warnings before schema-version comparisons; when row warnings are present, returns reset_destructive (no events) or reset_rebuild (events exist) with actionable next_action recovery command, preventing deserialization mismatches from being misclassified as clean.
Middleware error propagation
handlers/sync_middleware.py
Adds import for LedgerDeserializationError and introduces dedicated handler in ensure_ledger_synced that catches this error, logs a warning, and re-raises to surface it to the MCP transport layer; other exceptions remain swallowed with debug logging.
Test coverage for deserialization recovery flow
tests/test_ledger_sync_deserialization_recovery_301.py
Comprehensive test suite covering error classification (signature detection, message recovery hints), recovery routing (row-warning prioritization, events-file-based path selection), and middleware behavior (re-raising deserialization errors vs swallowing others).
Version and documentation updates
CHANGELOG.md, RECOMMENDED_VERSION, pyproject.toml
Adds v0.15.1 hotfix changelog entry documenting the ledger deserialization error fix and recovery guidance; bumps version from 0.15.0 to 0.15.1.

Sequence Diagram

sequenceDiagram
  participant Agent as Agent/MCP
  participant Middleware as ensure_ledger_synced
  participant Ledger as LedgerClient
  participant Diagnose as _classify_recovery
  Agent->>Middleware: sync request
  Middleware->>Ledger: query/execute (HEAD catch-up)
  Ledger-->>Middleware: SurrealError (deserialization)
  Middleware->>Middleware: detect LedgerDeserializationError
  Middleware-->>Agent: re-raise to transport
  Agent->>Diagnose: call diagnose()
  Diagnose->>Diagnose: check row_probe_warnings
  Diagnose-->>Agent: recovery_path: reset_rebuild/destructive
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • BicameralAI/bicameral-mcp#298: Introduces the base _classify_recovery recovery-path classification system in handlers/diagnose.py that this PR extends with row-probe-warnings prioritization.

Poem

🐰 A ledger once broken by revision mismatch,
Now detects its own wounds with a catch!
Row warnings take flight, before schemas align—
Recovery paths bloom, /bicameral-sync shines. ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title clearly describes the main change: routing ledger sync deserialization warnings to wipe-and-replay recovery, which is the core objective of this hotfix.
Linked Issues check ✅ Passed The pull request successfully implements all coding requirements from issue #301: detects row-level deserialization failures via LedgerDeserializationError, routes them to wipe-and-replay recovery in _classify_recovery, and re-raises the error in sync middleware to surface it to the agent.
Out of Scope Changes check ✅ Passed All changes are directly related to issue #301 objectives: detecting and routing deserialization failures. No out-of-scope modifications to unrelated systems or features were introduced.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch hotfix/0.15.1-301-deserialization-recovery-routing

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

… no-redef

`path` is annotated at line 133 (the new #301 row_probe_warnings branch) and
also at line 143 (existing schema-newer-than-binary branch). Same scope →
same name → mypy no-redef. Drop the later annotation; type is unchanged
because the literal still narrows to `RecoveryPath`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
tests/test_ledger_sync_deserialization_recovery_301.py (1)

74-108: 💤 Low value

Consider adding schema initialization for guideline alignment.

The coding guideline states: "For ledger query tests, never MagicMock the client; use the real LedgerClient(url="memory://", ...) + init_schema + migrate". These tests correctly use the real client but skip schema initialization. While not strictly necessary for narrow seam tests that patch _db.query (the schema is never queried), adding init_schema and migrate would improve consistency with the guideline and make the test setup more realistic.

♻️ Example for test_query_raises_deserialization_error_when_surrealdb_complains
 async def test_query_raises_deserialization_error_when_surrealdb_complains():
     """The classifier triggers on a real LedgerClient.query() path.
 
     Narrow seam: we patch the surrealdb-py async call so it raises
     ``SurrealError("Invalid revision ...")`` — this is the documented failure
     mode for SurrealKV record-format drift and cannot be triggered naturally
     against ``memory://``.
     """
     client = LedgerClient(url="memory://", ns="t301_q", db="ledger_test")
     await client.connect()
+    await init_schema(client)
+    await migrate(client)
     try:

Apply the same pattern to test_query_with_non_deserialization_error_still_raises_plain_ledger_error.

As per coding guidelines: "For ledger query tests, never MagicMock the client; use the real LedgerClient(url="memory://", ...) + init_schema + migrate".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_ledger_sync_deserialization_recovery_301.py` around lines 74 -
108, Add schema initialization to both tests by calling the real LedgerClient's
init_schema and migrate before exercising the patched query: after await
client.connect() invoke await client.init_schema() and await client.migrate() in
test_query_raises_deserialization_error_when_surrealdb_complains and
test_query_with_non_deserialization_error_still_raises_plain_ledger_error so the
in-memory client is set up per guidelines while keeping the existing patching of
client._db.query and existing asserts intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@handlers/diagnose.py`:
- Around line 136-140: The next_action string currently instructs users to "wipe
and replay from .bicameral/events/" even when has_events is False; update the
conditional message construction around next_action (the code that formats the
string with replay_from_events={has_events}) so when has_events is False it does
not mention replaying from .bicameral/events (e.g., change the tail to "wipe
only (no events to replay)" or similar), otherwise keep the existing "wipe and
replay from .bicameral/events/" wording when has_events is True.
- Around line 133-134: Rename the variable currently assigned as path in the
branch that handles row_warnings to warning_path (e.g., change "path:
RecoveryPath = ..." to "warning_path: RecoveryPath = ...") and update any
subsequent uses/return in that branch to return warning_path instead of path so
it no longer collides with the other branch's path variable; locate this in the
function handling recovery paths where has_events, row_warnings, tables are
computed and adjust the corresponding return statement(s) to reference
warning_path.

---

Nitpick comments:
In `@tests/test_ledger_sync_deserialization_recovery_301.py`:
- Around line 74-108: Add schema initialization to both tests by calling the
real LedgerClient's init_schema and migrate before exercising the patched query:
after await client.connect() invoke await client.init_schema() and await
client.migrate() in
test_query_raises_deserialization_error_when_surrealdb_complains and
test_query_with_non_deserialization_error_still_raises_plain_ledger_error so the
in-memory client is set up per guidelines while keeping the existing patching of
client._db.query and existing asserts intact.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f0ecdd36-bc22-49e2-89a6-fe9f79b17ea8

📥 Commits

Reviewing files that changed from the base of the PR and between 6963cb0 and aebb92f.

📒 Files selected for processing (7)
  • CHANGELOG.md
  • RECOMMENDED_VERSION
  • handlers/diagnose.py
  • handlers/sync_middleware.py
  • ledger/client.py
  • pyproject.toml
  • tests/test_ledger_sync_deserialization_recovery_301.py

Comment thread handlers/diagnose.py
Comment on lines +133 to +134
path: RecoveryPath = "reset_rebuild" if has_events else "reset_destructive"
tables = ", ".join(sorted({w.split(":", 1)[0] for w in row_warnings}))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify duplicate typed declarations are removed (expect 1 match after fix).
rg -nP '\bpath:\s*RecoveryPath\b|\bwarning_path:\s*RecoveryPath\b' handlers/diagnose.py

Repository: BicameralAI/bicameral-mcp

Length of output: 245


🏁 Script executed:

sed -n '120,150p' handlers/diagnose.py | cat -n

Repository: BicameralAI/bicameral-mcp

Length of output: 1932


🏁 Script executed:

sed -n '133,141p' handlers/diagnose.py | cat -n

Repository: BicameralAI/bicameral-mcp

Length of output: 628


Resolve path redefinition to unblock mypy.

The typed assignment on line 133 conflicts with the typed assignment on line 143 (no-redef), causing CI failure. Both are separate conditional branches that independently define the same variable name. The proposed fix correctly renames line 133's variable to warning_path and updates the corresponding return statement, eliminating the redefinition while preserving the distinct logic for each recovery path.

🛠️ Proposed fix
-        path: RecoveryPath = "reset_rebuild" if has_events else "reset_destructive"
+        warning_path: RecoveryPath = "reset_rebuild" if has_events else "reset_destructive"
         tables = ", ".join(sorted({w.split(":", 1)[0] for w in row_warnings}))
-        return path, (
+        return warning_path, (
             f"Row-level deserialization warnings on {tables} — likely a "
             "SurrealDB embedded-SDK record-format mismatch. Run "
             f"`bicameral_reset(wipe_mode='ledger', replay_from_events={has_events}, "
             "confirm=True)` to wipe and replay from .bicameral/events/."
         )
🧰 Tools
🪛 GitHub Actions: Lint & Type Check / 0_ruff + mypy.txt

[error] mypy . failed with 1 error (no-redef). Checked 134 source files.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@handlers/diagnose.py` around lines 133 - 134, Rename the variable currently
assigned as path in the branch that handles row_warnings to warning_path (e.g.,
change "path: RecoveryPath = ..." to "warning_path: RecoveryPath = ...") and
update any subsequent uses/return in that branch to return warning_path instead
of path so it no longer collides with the other branch's path variable; locate
this in the function handling recovery paths where has_events, row_warnings,
tables are computed and adjust the corresponding return statement(s) to
reference warning_path.

Comment thread handlers/diagnose.py
Comment on lines +136 to +140
f"Row-level deserialization warnings on {tables} — likely a "
"SurrealDB embedded-SDK record-format mismatch. Run "
f"`bicameral_reset(wipe_mode='ledger', replay_from_events={has_events}, "
"confirm=True)` to wipe and replay from .bicameral/events/."
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Make next_action text consistent with destructive recovery.

When has_events is False, the command correctly uses replay_from_events=False, but the sentence still says "wipe and replay from .bicameral/events/". That instruction is contradictory for the destructive path.

✏️ Proposed fix
         path: RecoveryPath = "reset_rebuild" if has_events else "reset_destructive"
         tables = ", ".join(sorted({w.split(":", 1)[0] for w in row_warnings}))
+        replay_text = (
+            "to wipe and replay from .bicameral/events/."
+            if has_events
+            else "to wipe the ledger (no replayable .bicameral/events/*.jsonl found)."
+        )
         return path, (
             f"Row-level deserialization warnings on {tables} — likely a "
             "SurrealDB embedded-SDK record-format mismatch. Run "
             f"`bicameral_reset(wipe_mode='ledger', replay_from_events={has_events}, "
-            "confirm=True)` to wipe and replay from .bicameral/events/."
+            f"confirm=True)` {replay_text}"
         )
🧰 Tools
🪛 GitHub Actions: Lint & Type Check / 0_ruff + mypy.txt

[error] mypy . failed with 1 error (no-redef). Checked 134 source files.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@handlers/diagnose.py` around lines 136 - 140, The next_action string
currently instructs users to "wipe and replay from .bicameral/events/" even when
has_events is False; update the conditional message construction around
next_action (the code that formats the string with
replay_from_events={has_events}) so when has_events is False it does not mention
replaying from .bicameral/events (e.g., change the tail to "wipe only (no events
to replay)" or similar), otherwise keep the existing "wipe and replay from
.bicameral/events/" wording when has_events is True.

@jinhongkuan jinhongkuan merged commit e57f07a into main May 17, 2026
10 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fix Bug fix or correctness repair flow:hotfix Emergency fix targeting main directly; must be synced back to dev (DEV_CYCLE.md s10) ledger Decision ledger, persistence, or query surface P1 High: ship this milestone; user-impacting bug or committed feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(ledger): ledger_sync deserialization error — Invalid revision 3 for type Value

1 participant