test(sync_middleware): sociable banner tests + decision_id alias fix by jinhongkuan · Pull Request #303 · BicameralAI/bicameral-mcp

jinhongkuan · 2026-05-10T21:34:51Z

Summary

Rewrites the get_session_start_banner tests in tests/test_sync_middleware.py from solitary (MagicMock ctx + AsyncMock ledger returning hand-crafted dicts) to sociable — they now seed a real SurrealDBLedgerAdapter over memory:// and exercise the actual get_decisions_by_status SQL.
The first sociable run surfaced a latent UX bug: get_decisions_by_status selected decision_id from the decision table, but no such field is defined on that table (schema.py:114-154). Every banner row reached agents with decision_id: None. Fixed in ledger/adapter.py by aliasing type::string(id) AS decision_id — same pattern already used at queries.py:167, 228, 404, 512.
Adds a "Sociable Testing for UX Paths" section to pilot/mcp/CLAUDE.md codifying the preference for handlers + ledger: SimpleNamespace ctx + real adapter; narrow seams only when the collaborator genuinely can't run in tests. Includes a checklist for tests-only PRs.

Why this matters

AI-authored tests skew solitary because mocks are easy to make pass. A solitary banner test stayed green for months while the production query returned null IDs. This PR is the worked example: same scenarios, same handler, real substrate — the bug fell out on the first run.

The non-banner tests (ensure_ledger_synced SHA-cache logic, repo_write_barrier concurrency primitives) intentionally keep their narrow seam patches: link_commit has its own end-to-end coverage, and the barrier tests are about pure asyncio behavior. The new CLAUDE.md guidance flags these as legitimate exceptions.

Test plan

pytest tests/test_sync_middleware.py — 21/21 pass
pytest tests/test_phase2_ledger.py tests/test_phase3_integration.py tests/test_link_commit_grounding.py tests/test_codegenome_continuity_service.py — 50/50 pass (no regression from the SQL alias)
pytest tests/test_alpha_contract.py tests/test_alpha_flow.py — alpha_contract passes; two alpha_flow failures are pre-existing on main (verified by git stash + rerun) and unrelated to this PR
CI green on PR

Out of scope

Other solitary handler tests (test_ingest_size_limit.py, test_codegenome_adapter.py, etc.) — separate PRs, one at a time.
Refactoring ensure_ledger_synced tests to use a real git repo — current narrow-seam approach is the right granularity for SHA-cache logic.

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes
- Fixed decision ID projection in session-start banner results to ensure proper status filtering and display.
Tests
- Refactored middleware tests from mock-based to real database environment for improved validation of decision retrieval, filtering, and banner generation.
Documentation
- Added testing standards for database-backed components, emphasizing real integration tests over mocks.

Two paired changes that together unblock any ledger whose schema_meta.version rolled past 5 with v3-era source_span rows still in the yields table — the dogfood crash that motivated this work. **Tolerant init_schema (Layer A).** \`_execute_define_idempotent\` now recognizes a small allowlist of SurrealDB error substrings as recoverable: existing "already exists" and "already contains" plus the new "expected a record<" (TYPE constraint mismatch when DEFINE TABLE/INDEX OVERWRITE re-validates legacy rows) and a defensive "but expected" for adjacent value-type errors. The patterns live in module-level RECOVERABLE_DEFINE_PATTERNS so the substring contract is testable. When a recoverable error fires, init logs a warning and emits a SCHEMA_DEFINE_SKIPPED audit event so the next migration's cleanup is visible in the audit log. **Re-runnable yields cleanup (Layer B).** The v4→v5 cleanup body is extracted into shared \`_clean_yields_legacy_rows\` + \`_dedupe_yields\` helpers. v4→v5 still calls them, so historical behaviour is unchanged. The new \`_migrate_v16_to_v17\` calls the same helpers, which means any DB that ran past v5 with the corruption still present gets the cleanup on the v16→v17 boundary instead of permanently — the migrate() loop only iterates \`range(current+1, SCHEMA_VERSION+1)\`, so a one-shot historical migration is permanently unreachable for users who upgraded later. \`SCHEMA_VERSION\` bumps to 17. SCHEMA_COMPATIBILITY[17] is a placeholder release-eng pins at PR merge. **Audit log.** New \`AuditEventType.SCHEMA_DEFINE_SKIPPED\` (warn-level) emitted by \`_execute_define_idempotent\` whenever a UNIQUE-violation or type-mismatch is swallowed — gives operators visibility into the in-flight cleanup without breaking init. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…296) Mirrors the existing \`bicameral-mcp diagnose\` CLI as an MCP tool so agents can fire it from any tool-error envelope. Critical property: the handler opens a raw \`LedgerClient\` and never goes through \`adapter.connect()\` — which means it produces a useful diagnosis even when normal connect crashes inside \`init_schema\` / \`migrate\`, the exact failure mode that motivated #296. **Refactor.** \`cli/_diagnose_gather.py\` grows three raw-client variants (\`_read_bicameral_meta_raw\`, \`_read_schema_version_raw\`, \`_read_table_counts_raw\`) and a \`gather_diagnosis_raw(client, ledger_url)\` entry point. The existing adapter-flavoured \`gather_diagnosis(adapter)\` becomes a one-line wrapper, so the CLI keeps its current shape and the test suite (\`test_diagnose_*\`) is unchanged. **New handler + tool.** \`handlers/diagnose.handle_diagnose\` opens a raw client against the resolved ledger URL, calls \`gather_diagnosis_raw\`, and classifies the result into a \`recovery_path\`: - \`clean\` — schema matches binary, tables look sane. - \`fixable\` — schema behind binary; next normal connect migrates. - \`reset_rebuild\` — ledger broken AND \`.bicameral/events/\` has events on disk; \`bicameral_reset(replay_from_events=true, confirm=true)\` recovers without data loss. - \`reset_destructive\` — ledger broken AND no events; reset loses decision history, user must accept. \`next_action\` is a human-readable string the agent renders verbatim. Wired into \`server.py\` list_tools + dispatch as \`bicameral.diagnose\`. Pydantic contract \`DiagnoseResponse\` lives alongside \`ResetResponse\` in \`contracts.py\`. **SKILL contract.** \`skills/bicameral-diagnose/SKILL.md\` documents auto-fire from any schema/migration error envelope, the recovery-path → command matrix, and the read-only invariant (repair is CLI-driven, never an in-session agent action). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…e (#296) \`bicameral_reset(replay_from_events=True, confirm=True)\` (only valid with \`wipe_mode="ledger"\`) wipes the materialized DB, resets the team-mode watermark, and replays every event in \`.bicameral/events/*.jsonl\` back through the same ingest path team mode uses. The event log is the canonical record (committed to git in team mode); replay is recovery, not destruction. Implementation reuses \`EventMaterializer.replay_new_events\` — the exact code team mode runs on every sync — so replay-vs-live divergence is impossible by construction. Determinism of the underlying canonical_id resolution is tracked separately under #296 (P0 follow-up: regression suite covering every event type the materializer dispatches on). **Behaviour.** - Dry run reports on-disk event count via \`events_replayed\` so the user sees scale before confirming. - Post-confirm response reports actual replayed count + any \`replay_errors\` (wipe still committed; replay errors don't destroy data, only block recovery). - \`replay_from_events=True\` + \`wipe_mode="full"\` is rejected with a clear error — full-wipe deletes the substrate we'd replay from. - Honours \`BICAMERAL_DATA_PATH\` symmetrically with \`adapters/ledger.py\`'s team-mode write path so the replay reads from the same directory the writers wrote to. **Skill.** \`skills/bicameral-reset/SKILL.md\` documents the new flag, the exclusivity rule with full-wipe, and the recommended pairing with \`bicameral_diagnose\`'s \`recovery_path="reset_rebuild"\`. **Contracts.** \`ResetResponse.events_replayed\` (default 0) carries the count. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two CI test classes that lock down the safety contract for layers A and B. Both run on every PR via the existing \`test-mcp-regression.yml\` workflow. **Fixture-replay suite (\`test_legacy_ledger_fixtures.py\`).** Frozen DB shapes under \`tests/fixtures/legacy_ledgers/\`. Each fixture is a Python module exporting an async \`build(client)\` coroutine that mutates the client into a known-historical bad state. The parametrized test: 1. Builds the bad state in a \`memory://\` client. 2. Runs the production \`init_schema\` + \`migrate\` path. 3. Asserts schema reaches \`SCHEMA_VERSION\`, fixture-specific invariants hold, and a second \`init_schema\` + \`migrate\` is a no-op (idempotent). First fixture is the user's exact dogfood symptom: \`v3_yields_source_span\` — \`yields.in = source_span:*\` rows surviving past v5, schema_meta.version = 16 to force the v16→v17 boundary. Adding a new fixture later requires zero test code: register it in \`FIXTURES\` and the parametrized test runs against it. README under the fixtures dir documents the registration shape. **Error-format provocation suite (\`test_schema_recoverable_errors.py\`).** Three tests that lock down the SurrealDB error-string contract layer A depends on: - UNIQUE-violation produces a substring matching \`RECOVERABLE_DEFINE_PATTERNS\` (existing v4→v5 path). - Type-mismatch on \`DEFINE INDEX OVERWRITE\` against legacy rows produces a matching substring (the #296 scenario). - All entries in \`RECOVERABLE_DEFINE_PATTERNS\` are lowercase (the catch lower-cases the message before substring match). When a future \`surrealdb-py\` bump changes the format, these tests fail loudly with the exact message the maintainer needs to update the patterns list — turning a silent regression into a known-loud signal. **CI wiring.** \`test-mcp-regression.yml\` lists the two new files in the pytest invocation. \`test_ledger_bicameral_meta_migration.py\` had a hardcoded \`assert SCHEMA_VERSION == 16\` updated to \`assert SCHEMA_VERSION >= 16\` so future bumps land cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Ruff I001 — alphabetical block. handlers.diagnose belongs after handlers.bind, not after handlers.ratify. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`ruff format --check .` was failing on the five files added in this PR. No behavioural changes — purely whitespace/line-wrapping per the project's ruff config. 42-test smoke pass green after the reformat. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mypy flagged DiagnoseResponse.recovery_path as expecting Literal['clean','fixable','reset_rebuild','reset_destructive'] while _classify_recovery returned plain str. Add a `RecoveryPath` type alias and use it on both the function signature and the intermediate `path` binding for the rec>exp branch. Pure typing — no runtime change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replaces the dashboard image at the bottom of "How It Feels" with a three-beat demo video section (ingest -> preflight -> ratify async) referencing GitHub user-attachments URLs so videos render as inline players. Moves the "Star on GitHub" banner from the top header to a centered placement immediately after the demo, turning it into a post-demo conversion beat instead of a misaligned header element. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…itch Replaces the single-line MCP-server description with a position-take opener: paragraph 1 names the failure mode (agreements emerge mid-flight, never reach a doc); paragraph 2 introduces Bicameral MCP as a spec compliance layer that captures both formal source materials (transcripts, PRDs, Slack) and undiscussed mid-implementation decisions to be ratified async by the product owner. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

triage→main: resilient ledger migration + bicameral_diagnose + reset --replay-from-events (#296) + README demo videos & opener rewrite (#299)

…from-events + README polish Bumps pyproject.toml 0.14.4 → 0.14.5 and adds the v0.14.5 CHANGELOG entry covering the #296 ledger-resilience track (bicameral_diagnose, reset --replay-from-events, v16→v17 yields cleanup) and the #299 README pass (opener rewrite, demo video section, star CTA relocation). Hotfix on the README opener: tightens "and most agreements emerge mid-flight, in corrections and side comments that never reach a doc" to "and requirement gaps surfaced mid-implementation are buried under thousands of lines of code". The new framing locates the failure mode in the engineering medium (code volume the human can't review) rather than the conversational one (undocumented agreements), which aligns the opener with the spec compliance layer positioning in paragraph 2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

release: v0.14.5 (triage) — bicameral_diagnose tool + reset --replay-from-events + README polish

… + decision_id alias fix The banner tests previously used MagicMock for ctx and AsyncMock for ledger, returning hand-crafted dicts. They stayed green even when get_decisions_by_status silently returned decision_id=None for every row (the SQL selected an undefined field — see ledger.adapter:584). Refactor to seed a real SurrealDBLedgerAdapter over memory:// and run the actual get_decisions_by_status query. The first sociable run surfaced the latent bug, which is fixed in this commit by aliasing type::string(id) AS decision_id (matches the pattern at queries.py:167, 228, 404, 512). Tests that legitimately need narrow seams (handle_link_commit, asyncio.Lock primitives) are left as-is and now documented inline. Adds a "Sociable Testing for UX Paths" section to pilot/mcp/CLAUDE.md codifying the preference: SimpleNamespace ctx + real adapter for handler/ledger tests, narrow seams only when a collaborator can't be run in tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-05-10T21:35:22Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 99fd32a2-b922-4912-b251-4388284aee9b

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

Introduces mandatory sociable testing policy via CLAUDE.md, fixes SurrealDB query aliasing in ledger adapter's get_decisions_by_status, and refactors sync_middleware tests from mock-based to real SurrealDB memory-backed end-to-end assertions.

Changes

Sociable Testing Refactor with Adapter Query Fix

Layer / File(s)	Summary
Testing Policy `CLAUDE.md`	Introduces mandatory "Sociable Testing for UX Paths" section defining rules: prefer real ledger adapter/client with seeded schema over `MagicMock`/`AsyncMock`, use `SimpleNamespace` for `ctx`, allow narrow patching only for hard-to-run boundaries, with validation checklist for tests-only PRs.
Adapter Query Fix `ledger/adapter.py`	`get_decisions_by_status` query now aliases SurrealDB record id as `decision_id` via `type::string(id) AS decision_id` instead of selecting non-existent field, ensuring proper decision-id values in banner results.
Test Infrastructure Setup `tests/test_sync_middleware.py`	Replaces mock-based substrate with real SurrealDB `memory://` backend: `_make_real_adapter()` connects and initializes schema, `_next_canonical()` generates unique ids, `_banner_ctx()` and `_ensure_ctx()` provide lightweight real-ledger context, helpers seed drifted/ungrounded/proposal decisions.
Banner Test Suite `tests/test_sync_middleware.py`	All banner tests refactored to seed real ledger decisions and assert against actual query behavior: drifted/ungrounded decision counts, truncation at top 10 with drifted prioritized, per-session firing, exception swallowing, stale vs. fresh proposal detection. New test validates all open statuses surface in banner items.
Ensure Ledger Synced Tests `tests/test_sync_middleware.py`	Tests for `ensure_ledger_synced` refactored to use `_ensure_ctx()` with minimal SHA-cache state; validates calling/skipping `handle_link_commit` based on HEAD advancement and swallowing exceptions from `handle_link_commit`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

BicameralAI/bicameral-mcp#56: Overlapping changes to ledger adapter's decision_id extraction and sync_middleware ledger interaction, moving tests to real SurrealDB and updating banner query behavior.

Poem

🐰 Hops through the test suite with joy!
Mock dittos fade, real queries stay,
The banner blooms with true SQLite dreams,
Each decision seeded, each assertion beams,
Sociable testing leads the way!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the two main changes: refactoring banner tests to use sociable testing approach and fixing a decision_id alias bug in the adapter.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/sociable-sync-middleware-tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/test_sync_middleware.py`:
- Around line 184-203: The test
test_banner_queries_each_open_status_actually_surfaces currently only seeds
drifted and ungrounded so it cannot verify context_pending reaches the banner;
either (A) seed a context_pending row before calling get_session_start_banner by
calling the existing helper (e.g. _seed_context_pending(client,
description="c1")) and update the final assertion to expect
{"drifted","ungrounded","context_pending"}, or (B) narrow the test claim by
changing the docstring and the final assertion to only assert the two seeded
statuses {"drifted","ungrounded"}; locate and modify the test function
test_banner_queries_each_open_status_actually_surfaces and the call to
get_session_start_banner to apply one of these fixes.
- Around line 18-29: The import block fails Ruff I001: reorder and format
imports into standard groups (stdlib, third-party, local) and alphabetize within
each group; specifically, group asyncio, datetime (UTC, datetime, timedelta),
pathlib.Path, types.SimpleNamespace, and unittest.mock (AsyncMock, MagicMock,
patch) as stdlib, then third-party pytest, and finally local imports from
handlers.sync_middleware (ensure_ledger_synced, get_session_start_banner) and
ledger modules (SurrealDBLedgerAdapter, LedgerClient, init_schema, migrate);
remove any unused imports and ensure the import style matches existing project
conventions so Ruff stops flagging I001.
- Around line 370-376: The test may short-circuit and never await
handle_link_commit because _LAST_SYNCED_SHA or the repo HEAD can equal the test
context's sha; update test_ensure_swallows_link_commit_exception to force the
"head advanced" branch by setting the module-level _LAST_SYNCED_SHA to a
different value than the repo HEAD (or otherwise mutate the test ctx so
ctx.repo.head.commit.hexsha[:7] != _LAST_SYNCED_SHA) before calling
ensure_ledger_synced; this guarantees ensure_ledger_synced will call and await
handlers.link_commit.handle_link_commit (the AsyncMock) so its RuntimeError
side_effect is exercised and swallowed.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3d1c2279-8485-44e9-ac44-726b6632f856

📥 Commits

Reviewing files that changed from the base of the PR and between b963dc8 and 0bc5356.

📒 Files selected for processing (3)

CLAUDE.md
ledger/adapter.py
tests/test_sync_middleware.py

coderabbitai · 2026-05-10T21:39:18Z

+import asyncio
+from datetime import UTC, datetime, timedelta
 from pathlib import Path
+from types import SimpleNamespace
 from unittest.mock import AsyncMock, MagicMock, patch

 import pytest

 from handlers.sync_middleware import ensure_ledger_synced, get_session_start_banner
+from ledger.adapter import SurrealDBLedgerAdapter
+from ledger.client import LedgerClient
+from ledger.schema import init_schema, migrate


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix the import block so Ruff stops failing.

CI is already red on this block with I001, so this PR will not go green until the imports are re-sorted/formatted.

🧰 Tools

🪛 GitHub Actions: Lint & Type Check / 0_ruff + mypy.txt

[error] 16-29: Ruff check failed (I001): Import block is un-sorted or un-formatted. Organize imports; run with ruff check . --fix.

🪛 GitHub Actions: Lint & Type Check / ruff + mypy

[error] 16-29: Ruff import sorting/formatting failed (I001). Import block is un-sorted or un-formatted. Found 1 error. Help: Organize imports. Fixable with ruff check . --fix.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_sync_middleware.py` around lines 18 - 29, The import block fails Ruff I001: reorder and format imports into standard groups (stdlib, third-party, local) and alphabetize within each group; specifically, group asyncio, datetime (UTC, datetime, timedelta), pathlib.Path, types.SimpleNamespace, and unittest.mock (AsyncMock, MagicMock, patch) as stdlib, then third-party pytest, and finally local imports from handlers.sync_middleware (ensure_ledger_synced, get_session_start_banner) and ledger modules (SurrealDBLedgerAdapter, LedgerClient, init_schema, migrate); remove any unused imports and ensure the import style matches existing project conventions so Ruff stops flagging I001.

coderabbitai · 2026-05-10T21:39:18Z

+async def test_banner_queries_each_open_status_actually_surfaces():
+    """The banner must surface decisions across ALL queried statuses.
+
+    Original test asserted ``get_decisions_by_status.assert_called_once_with(
+    ["drifted", "ungrounded", "context_pending"])`` against a mock — a
+    tautology mirroring the SQL string. The real behavior contract is:
+    rows with each of those statuses end up in the banner.
+    ``context_pending`` rows are routed through ``status='ungrounded'`` in
+    the production query path; this test pins the visible-to-agent shape.
+    """
+    adapter, client = await _make_real_adapter()
+    await _seed_drifted(client, description="d1")
+    await _seed_ungrounded(client, description="u1")
+    ctx = _banner_ctx(adapter)
+
+    banner = await get_session_start_banner(ctx)
+
+    assert banner is not None
+    assert {i["status"] for i in banner.items} == {"drifted", "ungrounded"}



⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Either exercise context_pending here or narrow the test’s claim.

This test only seeds drifted and ungrounded, so it never proves the third queried “open” status still reaches the banner. If the context_pending path regresses, this stays green.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_sync_middleware.py` around lines 184 - 203, The test test_banner_queries_each_open_status_actually_surfaces currently only seeds drifted and ungrounded so it cannot verify context_pending reaches the banner; either (A) seed a context_pending row before calling get_session_start_banner by calling the existing helper (e.g. _seed_context_pending(client, description="c1")) and update the final assertion to expect {"drifted","ungrounded","context_pending"}, or (B) narrow the test claim by changing the docstring and the final assertion to only assert the two seeded statuses {"drifted","ungrounded"}; locate and modify the test function test_banner_queries_each_open_status_actually_surfaces and the call to get_session_start_banner to apply one of these fixes.

coderabbitai · 2026-05-10T21:39:18Z

 async def test_ensure_swallows_link_commit_exception():
-    ctx = _make_ctx()
+    ctx = _ensure_ctx()

    with patch("handlers.link_commit.handle_link_commit", new_callable=AsyncMock) as mock_lc:
        mock_lc.side_effect = RuntimeError("git not available")
        # Must not raise
        await ensure_ledger_synced(ctx)


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Force the head-advanced branch in the swallow test.

As written, this can pass without ever awaiting handle_link_commit if _LAST_SYNCED_SHA or the real repo HEAD short-circuits the function first. That makes the swallow path a false positive.

Suggested fix

-async def test_ensure_swallows_link_commit_exception(): +async def test_ensure_swallows_link_commit_exception(monkeypatch): ctx = _ensure_ctx() + monkeypatch.setattr("handlers.sync_middleware._LAST_SYNCED_SHA", "") - with patch("handlers.link_commit.handle_link_commit", new_callable=AsyncMock) as mock_lc: + with ( + patch("handlers.link_commit._read_current_head_sha", return_value="new_sha"), + patch("handlers.link_commit.handle_link_commit", new_callable=AsyncMock) as mock_lc, + ): mock_lc.side_effect = RuntimeError("git not available") # Must not raise await ensure_ledger_synced(ctx) + mock_lc.assert_awaited_once()

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_sync_middleware.py` around lines 370 - 376, The test may short-circuit and never await handle_link_commit because _LAST_SYNCED_SHA or the repo HEAD can equal the test context's sha; update test_ensure_swallows_link_commit_exception to force the "head advanced" branch by setting the module-level _LAST_SYNCED_SHA to a different value than the repo HEAD (or otherwise mutate the test ctx so ctx.repo.head.commit.hexsha[:7] != _LAST_SYNCED_SHA) before calling ensure_ledger_synced; this guarantees ensure_ledger_synced will call and await handlers.link_commit.handle_link_commit (the AsyncMock) so its RuntimeError side_effect is exercised and swallowed.

…dleware-tests # Conflicts: # CHANGELOG.md # README.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…gate (BicameralAI#58) Sibling of the M2 grounding-recall eval (BicameralAI#284). Phase A is **measurement only** — no runtime change to `handle_preflight` or any retrieval surface. Recall regression risk = zero. The Phase B optimization choice (multi-hop expansion / semantic search / LLM reranker) is gated on this PR's first stable baseline, per the wiki's optimization principle: "identify the specific scenario, then optimize." Per the Phase 2 spec posted on BicameralAI#58 (all four open questions defaulted to recommended): Q1 dataset size → 25 cases hand-curated, matches M2's 23 Q2 miss-mode buckets → three (vocabulary / unbound / transitive), matches the issue body's framing Q3 fire-rate gate → raw retrieval (`response.decisions`); fire is downstream and a secondary diagnostic Q4 ledger persistence → per-run temp + memory:// (per-case freshness) Three measurement axes (deliberately split for diagnosis) --------------------------------------------------------- overall recall surfaced / (surfaced + missed) gate ≥ 0.70 per-mode recall same, sliced by miss_mode gate ≥ 0.50 fire rate response.fired / total gate ≥ 0.60 Errors (seeder infra failures, not agent misses) are excluded from the recall denominator but counted separately so reviewers can see them. Files ----- tests/fixtures/preflight_m6/dataset.py (412 LOC) 25 hand-curated M6Case rows, 8 + 8 + 9 across the three modes. Frozen dataclass; GENERATOR_VERSION constant invalidates downstream caches when bumped. Import-time _validate_dataset() fails loud on duplicate case_id, invalid miss_mode, transitive case without intended_file_path, unbound case with non-ungrounded status. tests/eval/_preflight_m6_seeder.py (231 LOC) Per-case freshness: each call creates a new tempdir + memory:// ledger + git-initialized repo + writes a placeholder file (or the transitive case's intended + caller files). Calls the real handle_ingest + handle_bind so seeded rows have production shape (source_type, span, signoff, binds_to). Reset code-locator + ledger singletons before AND after so the next case starts clean. tests/eval_preflight_m6_recall.py (274 LOC) Argparse runner, drives the seeder + handle_preflight, classifies outcomes, aggregates. JSON output + gate enforcement (--gate-mode warn|hard). Mirrors eval_grounding_recall.py shape so existing CI patterns transfer. tests/eval_preflight_m6_summary.py (162 LOC) Markdown step-summary renderer for $GITHUB_STEP_SUMMARY. Per-mode table + collapsible missed-case detail with topic + intended description. Fail-quiet on missing JSON / parse errors. tests/test_preflight_m6_eval.py (267 LOC) 16 sociable unit tests for the classifier + aggregator. Per the new CLAUDE.md "Sociable Testing for UX Paths" rule (BicameralAI#303): SimpleNamespace + real M6Case dataclasses, NEVER MagicMock — so any added/removed field on the response shape fails the test honestly. .github/workflows/test-mcp-regression.yml (+31 LOC) New "M6 preflight recall eval (warn-only)" + summary steps after M2. No ANTHROPIC_API_KEY needed — preflight retrieval is deterministic. CHANGELOG.md (+2 lines) [Unreleased] / Added entry. Local verification ------------------ - 16/16 sociable unit tests pass on the classifier + aggregator (test_aggregate_basic_recall_math, test_errors_excluded_from_recall_denominator, test_per_miss_mode_breakdown, etc.) - Dataset import + _validate_dataset() pass — 25 cases (8/8/9) - Runner --help renders cleanly - Summary renderer smoke-tested on synthetic JSON — per-mode table + missed-case detail render correctly with emoji gates - ruff check + ruff format --check + mypy all green on touched files What's NOT in this PR (intentionally — Phase B gating) ------------------------------------------------------ - Any runtime change to handle_preflight or _region_anchored_preflight - Skill changes (no agent-facing contract change in Phase A) - Multi-hop / call-graph / inheritance graph expansion (Phase B candidate, deferred) - Semantic search layer (Phase B candidate, deferred) - LLM reranker (Phase B candidate, deferred) - Real-corpus eval (synthetic first; corpus follow-up if needed) After this PR's first CI baseline lands, we pick the dominant miss-mode from the per-mode breakdown and ship Phase B targeted to it. Cheap-first ordering per the wiki: search_hint refinement → multi-hop graph → semantic → reranker. Refs BicameralAI#58. Plan: plan-58-preflight-decision-detection.md. Phase 2 spec signoff: BicameralAI#58 (comment) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jinhongkuan and others added 14 commits May 10, 2026 00:13

chore(lint): fix handlers.diagnose import order in server.py

fa4ca79

Ruff I001 — alphabetical block. handlers.diagnose belongs after handlers.bind, not after handlers.ratify. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Merge pull request #298 from BicameralAI/triage-from-dev

6d96ac3

triage→main: resilient ledger migration + bicameral_diagnose + reset --replay-from-events (#296) + README demo videos & opener rewrite (#299)

Update recommended version to 0.14.5

489441c

Merge pull request #302 from BicameralAI/release/v0.14.5

b963dc8

release: v0.14.5 (triage) — bicameral_diagnose tool + reset --replay-from-events + README polish

jinhongkuan temporarily deployed to ci-test May 10, 2026 21:34 — with GitHub Actions Inactive

jinhongkuan had a problem deploying to recording-approval May 10, 2026 21:34 — with GitHub Actions Failure

jinhongkuan temporarily deployed to ci-test May 10, 2026 21:34 — with GitHub Actions Inactive

jinhongkuan temporarily deployed to production May 10, 2026 21:34 — with GitHub Actions Inactive

jinhongkuan temporarily deployed to ci-test May 10, 2026 21:34 — with GitHub Actions Inactive

jinhongkuan requested a review from silongtan May 10, 2026 21:38

jinhongkuan changed the base branch from main to dev May 10, 2026 21:38

coderabbitai Bot reviewed May 10, 2026

View reviewed changes

Merge remote-tracking branch 'origin/dev' into feat/sociable-sync-mid…

9e2ecb9

…dleware-tests # Conflicts: # CHANGELOG.md # README.md

jinhongkuan had a problem deploying to production May 10, 2026 21:41 — with GitHub Actions Failure

jinhongkuan temporarily deployed to ci-test May 10, 2026 21:41 — with GitHub Actions Inactive

jinhongkuan had a problem deploying to recording-approval May 10, 2026 21:41 — with GitHub Actions Failure

jinhongkuan temporarily deployed to ci-test May 10, 2026 21:41 — with GitHub Actions Inactive

chore(test): satisfy ruff I001 import grouping in test_sync_middleware

b676b17

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jinhongkuan temporarily deployed to production May 10, 2026 22:08 — with GitHub Actions Inactive

jinhongkuan temporarily deployed to ci-test May 10, 2026 22:08 — with GitHub Actions Inactive

jinhongkuan had a problem deploying to recording-approval May 10, 2026 22:08 — with GitHub Actions Failure

jinhongkuan merged commit e307635 into dev May 10, 2026
8 of 9 checks passed

This was referenced May 11, 2026

M6 preflight handler retrieval: by-design split (handler structural, skill-layer covered by #306) #58

Closed

feat(eval): M6 preflight retrieval recall eval — Phase A measurement gate (#58) #304

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(sync_middleware): sociable banner tests + decision_id alias fix#303

test(sync_middleware): sociable banner tests + decision_id alias fix#303
jinhongkuan merged 16 commits into
devfrom
feat/sociable-sync-middleware-tests

jinhongkuan commented May 10, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 10, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 10, 2026

Uh oh!

coderabbitai Bot May 10, 2026

Uh oh!

coderabbitai Bot May 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jinhongkuan commented May 10, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why this matters

Test plan

Out of scope

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jinhongkuan commented May 10, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 10, 2026 •

edited

Loading