Skip to content

test(sync_middleware): sociable banner tests + decision_id alias fix#303

Merged
jinhongkuan merged 16 commits into
devfrom
feat/sociable-sync-middleware-tests
May 10, 2026
Merged

test(sync_middleware): sociable banner tests + decision_id alias fix#303
jinhongkuan merged 16 commits into
devfrom
feat/sociable-sync-middleware-tests

Conversation

@jinhongkuan

@jinhongkuan jinhongkuan commented May 10, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Rewrites the get_session_start_banner tests in tests/test_sync_middleware.py from solitary (MagicMock ctx + AsyncMock ledger returning hand-crafted dicts) to sociable — they now seed a real SurrealDBLedgerAdapter over memory:// and exercise the actual get_decisions_by_status SQL.
  • The first sociable run surfaced a latent UX bug: get_decisions_by_status selected decision_id from the decision table, but no such field is defined on that table (schema.py:114-154). Every banner row reached agents with decision_id: None. Fixed in ledger/adapter.py by aliasing type::string(id) AS decision_id — same pattern already used at queries.py:167, 228, 404, 512.
  • Adds a "Sociable Testing for UX Paths" section to pilot/mcp/CLAUDE.md codifying the preference for handlers + ledger: SimpleNamespace ctx + real adapter; narrow seams only when the collaborator genuinely can't run in tests. Includes a checklist for tests-only PRs.

Why this matters

AI-authored tests skew solitary because mocks are easy to make pass. A solitary banner test stayed green for months while the production query returned null IDs. This PR is the worked example: same scenarios, same handler, real substrate — the bug fell out on the first run.

The non-banner tests (ensure_ledger_synced SHA-cache logic, repo_write_barrier concurrency primitives) intentionally keep their narrow seam patches: link_commit has its own end-to-end coverage, and the barrier tests are about pure asyncio behavior. The new CLAUDE.md guidance flags these as legitimate exceptions.

Test plan

  • pytest tests/test_sync_middleware.py — 21/21 pass
  • pytest tests/test_phase2_ledger.py tests/test_phase3_integration.py tests/test_link_commit_grounding.py tests/test_codegenome_continuity_service.py — 50/50 pass (no regression from the SQL alias)
  • pytest tests/test_alpha_contract.py tests/test_alpha_flow.py — alpha_contract passes; two alpha_flow failures are pre-existing on main (verified by git stash + rerun) and unrelated to this PR
  • CI green on PR

Out of scope

  • Other solitary handler tests (test_ingest_size_limit.py, test_codegenome_adapter.py, etc.) — separate PRs, one at a time.
  • Refactoring ensure_ledger_synced tests to use a real git repo — current narrow-seam approach is the right granularity for SHA-cache logic.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Fixed decision ID projection in session-start banner results to ensure proper status filtering and display.
  • Tests

    • Refactored middleware tests from mock-based to real database environment for improved validation of decision retrieval, filtering, and banner generation.
  • Documentation

    • Added testing standards for database-backed components, emphasizing real integration tests over mocks.

Review Change Stack

jinhongkuan and others added 14 commits May 10, 2026 00:13
Two paired changes that together unblock any ledger whose
schema_meta.version rolled past 5 with v3-era source_span rows still
in the yields table — the dogfood crash that motivated this work.

**Tolerant init_schema (Layer A).**

\`_execute_define_idempotent\` now recognizes a small allowlist of
SurrealDB error substrings as recoverable: existing "already exists"
and "already contains" plus the new "expected a record<" (TYPE
constraint mismatch when DEFINE TABLE/INDEX OVERWRITE re-validates
legacy rows) and a defensive "but expected" for adjacent value-type
errors. The patterns live in module-level RECOVERABLE_DEFINE_PATTERNS
so the substring contract is testable. When a recoverable error
fires, init logs a warning and emits a SCHEMA_DEFINE_SKIPPED audit
event so the next migration's cleanup is visible in the audit log.

**Re-runnable yields cleanup (Layer B).**

The v4→v5 cleanup body is extracted into shared
\`_clean_yields_legacy_rows\` + \`_dedupe_yields\` helpers. v4→v5 still
calls them, so historical behaviour is unchanged. The new
\`_migrate_v16_to_v17\` calls the same helpers, which means any DB
that ran past v5 with the corruption still present gets the cleanup
on the v16→v17 boundary instead of permanently — the migrate() loop
only iterates \`range(current+1, SCHEMA_VERSION+1)\`, so a one-shot
historical migration is permanently unreachable for users who
upgraded later.

\`SCHEMA_VERSION\` bumps to 17. SCHEMA_COMPATIBILITY[17] is a
placeholder release-eng pins at PR merge.

**Audit log.**

New \`AuditEventType.SCHEMA_DEFINE_SKIPPED\` (warn-level) emitted by
\`_execute_define_idempotent\` whenever a UNIQUE-violation or
type-mismatch is swallowed — gives operators visibility into the
in-flight cleanup without breaking init.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…296)

Mirrors the existing \`bicameral-mcp diagnose\` CLI as an MCP tool so
agents can fire it from any tool-error envelope. Critical property:
the handler opens a raw \`LedgerClient\` and never goes through
\`adapter.connect()\` — which means it produces a useful diagnosis
even when normal connect crashes inside \`init_schema\` / \`migrate\`,
the exact failure mode that motivated #296.

**Refactor.**

\`cli/_diagnose_gather.py\` grows three raw-client variants
(\`_read_bicameral_meta_raw\`, \`_read_schema_version_raw\`,
\`_read_table_counts_raw\`) and a \`gather_diagnosis_raw(client,
ledger_url)\` entry point. The existing adapter-flavoured
\`gather_diagnosis(adapter)\` becomes a one-line wrapper, so the CLI
keeps its current shape and the test suite (\`test_diagnose_*\`) is
unchanged.

**New handler + tool.**

\`handlers/diagnose.handle_diagnose\` opens a raw client against the
resolved ledger URL, calls \`gather_diagnosis_raw\`, and classifies
the result into a \`recovery_path\`:

- \`clean\` — schema matches binary, tables look sane.
- \`fixable\` — schema behind binary; next normal connect migrates.
- \`reset_rebuild\` — ledger broken AND \`.bicameral/events/\` has
  events on disk; \`bicameral_reset(replay_from_events=true,
  confirm=true)\` recovers without data loss.
- \`reset_destructive\` — ledger broken AND no events; reset loses
  decision history, user must accept.

\`next_action\` is a human-readable string the agent renders verbatim.
Wired into \`server.py\` list_tools + dispatch as
\`bicameral.diagnose\`. Pydantic contract \`DiagnoseResponse\` lives
alongside \`ResetResponse\` in \`contracts.py\`.

**SKILL contract.**

\`skills/bicameral-diagnose/SKILL.md\` documents auto-fire from any
schema/migration error envelope, the recovery-path → command matrix,
and the read-only invariant (repair is CLI-driven, never an
in-session agent action).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e (#296)

\`bicameral_reset(replay_from_events=True, confirm=True)\` (only valid
with \`wipe_mode="ledger"\`) wipes the materialized DB, resets the
team-mode watermark, and replays every event in
\`.bicameral/events/*.jsonl\` back through the same ingest path team
mode uses. The event log is the canonical record (committed to git
in team mode); replay is recovery, not destruction.

Implementation reuses \`EventMaterializer.replay_new_events\` — the
exact code team mode runs on every sync — so replay-vs-live
divergence is impossible by construction. Determinism of the
underlying canonical_id resolution is tracked separately under #296
(P0 follow-up: regression suite covering every event type the
materializer dispatches on).

**Behaviour.**

- Dry run reports on-disk event count via \`events_replayed\` so the
  user sees scale before confirming.
- Post-confirm response reports actual replayed count + any
  \`replay_errors\` (wipe still committed; replay errors don't
  destroy data, only block recovery).
- \`replay_from_events=True\` + \`wipe_mode="full"\` is rejected with
  a clear error — full-wipe deletes the substrate we'd replay from.
- Honours \`BICAMERAL_DATA_PATH\` symmetrically with
  \`adapters/ledger.py\`'s team-mode write path so the replay reads
  from the same directory the writers wrote to.

**Skill.**

\`skills/bicameral-reset/SKILL.md\` documents the new flag, the
exclusivity rule with full-wipe, and the recommended pairing with
\`bicameral_diagnose\`'s \`recovery_path="reset_rebuild"\`.

**Contracts.**

\`ResetResponse.events_replayed\` (default 0) carries the count.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two CI test classes that lock down the safety contract for layers A
and B. Both run on every PR via the existing
\`test-mcp-regression.yml\` workflow.

**Fixture-replay suite (\`test_legacy_ledger_fixtures.py\`).**

Frozen DB shapes under \`tests/fixtures/legacy_ledgers/\`. Each
fixture is a Python module exporting an async \`build(client)\`
coroutine that mutates the client into a known-historical bad state.
The parametrized test:

  1. Builds the bad state in a \`memory://\` client.
  2. Runs the production \`init_schema\` + \`migrate\` path.
  3. Asserts schema reaches \`SCHEMA_VERSION\`, fixture-specific
     invariants hold, and a second \`init_schema\` + \`migrate\` is
     a no-op (idempotent).

First fixture is the user's exact dogfood symptom:
\`v3_yields_source_span\` — \`yields.in = source_span:*\` rows
surviving past v5, schema_meta.version = 16 to force the v16→v17
boundary. Adding a new fixture later requires zero test code:
register it in \`FIXTURES\` and the parametrized test runs against
it. README under the fixtures dir documents the registration shape.

**Error-format provocation suite (\`test_schema_recoverable_errors.py\`).**

Three tests that lock down the SurrealDB error-string contract layer
A depends on:

  - UNIQUE-violation produces a substring matching
    \`RECOVERABLE_DEFINE_PATTERNS\` (existing v4→v5 path).
  - Type-mismatch on \`DEFINE INDEX OVERWRITE\` against legacy rows
    produces a matching substring (the #296 scenario).
  - All entries in \`RECOVERABLE_DEFINE_PATTERNS\` are lowercase
    (the catch lower-cases the message before substring match).

When a future \`surrealdb-py\` bump changes the format, these tests
fail loudly with the exact message the maintainer needs to update
the patterns list — turning a silent regression into a known-loud
signal.

**CI wiring.**

\`test-mcp-regression.yml\` lists the two new files in the pytest
invocation. \`test_ledger_bicameral_meta_migration.py\` had a
hardcoded \`assert SCHEMA_VERSION == 16\` updated to
\`assert SCHEMA_VERSION >= 16\` so future bumps land cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ruff I001 — alphabetical block. handlers.diagnose belongs after
handlers.bind, not after handlers.ratify.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`ruff format --check .` was failing on the five files added in this
PR. No behavioural changes — purely whitespace/line-wrapping per the
project's ruff config. 42-test smoke pass green after the reformat.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mypy flagged DiagnoseResponse.recovery_path as expecting
Literal['clean','fixable','reset_rebuild','reset_destructive']
while _classify_recovery returned plain str. Add a `RecoveryPath`
type alias and use it on both the function signature and the
intermediate `path` binding for the rec>exp branch. Pure typing —
no runtime change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the dashboard image at the bottom of "How It Feels" with a
three-beat demo video section (ingest -> preflight -> ratify async)
referencing GitHub user-attachments URLs so videos render as inline
players. Moves the "Star on GitHub" banner from the top header to a
centered placement immediately after the demo, turning it into a
post-demo conversion beat instead of a misaligned header element.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…itch

Replaces the single-line MCP-server description with a position-take
opener: paragraph 1 names the failure mode (agreements emerge mid-flight,
never reach a doc); paragraph 2 introduces Bicameral MCP as a spec
compliance layer that captures both formal source materials (transcripts,
PRDs, Slack) and undiscussed mid-implementation decisions to be ratified
async by the product owner.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
triage→main: resilient ledger migration + bicameral_diagnose + reset --replay-from-events (#296) + README demo videos & opener rewrite (#299)
…from-events + README polish

Bumps pyproject.toml 0.14.4 → 0.14.5 and adds the v0.14.5 CHANGELOG
entry covering the #296 ledger-resilience track (bicameral_diagnose,
reset --replay-from-events, v16→v17 yields cleanup) and the #299
README pass (opener rewrite, demo video section, star CTA relocation).

Hotfix on the README opener: tightens "and most agreements emerge
mid-flight, in corrections and side comments that never reach a doc"
to "and requirement gaps surfaced mid-implementation are buried
under thousands of lines of code". The new framing locates the
failure mode in the engineering medium (code volume the human
can't review) rather than the conversational one (undocumented
agreements), which aligns the opener with the spec compliance
layer positioning in paragraph 2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
release: v0.14.5 (triage) — bicameral_diagnose tool + reset --replay-from-events + README polish
… + decision_id alias fix

The banner tests previously used MagicMock for ctx and AsyncMock for ledger,
returning hand-crafted dicts. They stayed green even when get_decisions_by_status
silently returned decision_id=None for every row (the SQL selected an undefined
field — see ledger.adapter:584).

Refactor to seed a real SurrealDBLedgerAdapter over memory:// and run the actual
get_decisions_by_status query. The first sociable run surfaced the latent bug,
which is fixed in this commit by aliasing type::string(id) AS decision_id
(matches the pattern at queries.py:167, 228, 404, 512).

Tests that legitimately need narrow seams (handle_link_commit, asyncio.Lock
primitives) are left as-is and now documented inline.

Adds a "Sociable Testing for UX Paths" section to pilot/mcp/CLAUDE.md codifying
the preference: SimpleNamespace ctx + real adapter for handler/ledger tests,
narrow seams only when a collaborator can't be run in tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented May 10, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 99fd32a2-b922-4912-b251-4388284aee9b

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Introduces mandatory sociable testing policy via CLAUDE.md, fixes SurrealDB query aliasing in ledger adapter's get_decisions_by_status, and refactors sync_middleware tests from mock-based to real SurrealDB memory-backed end-to-end assertions.

Changes

Sociable Testing Refactor with Adapter Query Fix

Layer / File(s) Summary
Testing Policy
CLAUDE.md
Introduces mandatory "Sociable Testing for UX Paths" section defining rules: prefer real ledger adapter/client with seeded schema over MagicMock/AsyncMock, use SimpleNamespace for ctx, allow narrow patching only for hard-to-run boundaries, with validation checklist for tests-only PRs.
Adapter Query Fix
ledger/adapter.py
get_decisions_by_status query now aliases SurrealDB record id as decision_id via type::string(id) AS decision_id instead of selecting non-existent field, ensuring proper decision-id values in banner results.
Test Infrastructure Setup
tests/test_sync_middleware.py
Replaces mock-based substrate with real SurrealDB memory:// backend: _make_real_adapter() connects and initializes schema, _next_canonical() generates unique ids, _banner_ctx() and _ensure_ctx() provide lightweight real-ledger context, helpers seed drifted/ungrounded/proposal decisions.
Banner Test Suite
tests/test_sync_middleware.py
All banner tests refactored to seed real ledger decisions and assert against actual query behavior: drifted/ungrounded decision counts, truncation at top 10 with drifted prioritized, per-session firing, exception swallowing, stale vs. fresh proposal detection. New test validates all open statuses surface in banner items.
Ensure Ledger Synced Tests
tests/test_sync_middleware.py
Tests for ensure_ledger_synced refactored to use _ensure_ctx() with minimal SHA-cache state; validates calling/skipping handle_link_commit based on HEAD advancement and swallowing exceptions from handle_link_commit.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • BicameralAI/bicameral-mcp#56: Overlapping changes to ledger adapter's decision_id extraction and sync_middleware ledger interaction, moving tests to real SurrealDB and updating banner query behavior.

Poem

🐰 Hops through the test suite with joy!
Mock dittos fade, real queries stay,
The banner blooms with true SQLite dreams,
Each decision seeded, each assertion beams,
Sociable testing leads the way!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the two main changes: refactoring banner tests to use sociable testing approach and fixing a decision_id alias bug in the adapter.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/sociable-sync-middleware-tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@jinhongkuan jinhongkuan requested a review from silongtan May 10, 2026 21:38
@jinhongkuan jinhongkuan changed the base branch from main to dev May 10, 2026 21:38

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/test_sync_middleware.py`:
- Around line 184-203: The test
test_banner_queries_each_open_status_actually_surfaces currently only seeds
drifted and ungrounded so it cannot verify context_pending reaches the banner;
either (A) seed a context_pending row before calling get_session_start_banner by
calling the existing helper (e.g. _seed_context_pending(client,
description="c1")) and update the final assertion to expect
{"drifted","ungrounded","context_pending"}, or (B) narrow the test claim by
changing the docstring and the final assertion to only assert the two seeded
statuses {"drifted","ungrounded"}; locate and modify the test function
test_banner_queries_each_open_status_actually_surfaces and the call to
get_session_start_banner to apply one of these fixes.
- Around line 18-29: The import block fails Ruff I001: reorder and format
imports into standard groups (stdlib, third-party, local) and alphabetize within
each group; specifically, group asyncio, datetime (UTC, datetime, timedelta),
pathlib.Path, types.SimpleNamespace, and unittest.mock (AsyncMock, MagicMock,
patch) as stdlib, then third-party pytest, and finally local imports from
handlers.sync_middleware (ensure_ledger_synced, get_session_start_banner) and
ledger modules (SurrealDBLedgerAdapter, LedgerClient, init_schema, migrate);
remove any unused imports and ensure the import style matches existing project
conventions so Ruff stops flagging I001.
- Around line 370-376: The test may short-circuit and never await
handle_link_commit because _LAST_SYNCED_SHA or the repo HEAD can equal the test
context's sha; update test_ensure_swallows_link_commit_exception to force the
"head advanced" branch by setting the module-level _LAST_SYNCED_SHA to a
different value than the repo HEAD (or otherwise mutate the test ctx so
ctx.repo.head.commit.hexsha[:7] != _LAST_SYNCED_SHA) before calling
ensure_ledger_synced; this guarantees ensure_ledger_synced will call and await
handlers.link_commit.handle_link_commit (the AsyncMock) so its RuntimeError
side_effect is exercised and swallowed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3d1c2279-8485-44e9-ac44-726b6632f856

📥 Commits

Reviewing files that changed from the base of the PR and between b963dc8 and 0bc5356.

📒 Files selected for processing (3)
  • CLAUDE.md
  • ledger/adapter.py
  • tests/test_sync_middleware.py

Comment on lines +18 to +29
import asyncio
from datetime import UTC, datetime, timedelta
from pathlib import Path
from types import SimpleNamespace
from unittest.mock import AsyncMock, MagicMock, patch

import pytest

from handlers.sync_middleware import ensure_ledger_synced, get_session_start_banner
from ledger.adapter import SurrealDBLedgerAdapter
from ledger.client import LedgerClient
from ledger.schema import init_schema, migrate

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix the import block so Ruff stops failing.

CI is already red on this block with I001, so this PR will not go green until the imports are re-sorted/formatted.

🧰 Tools
🪛 GitHub Actions: Lint & Type Check / 0_ruff + mypy.txt

[error] 16-29: Ruff check failed (I001): Import block is un-sorted or un-formatted. Organize imports; run with ruff check . --fix.

🪛 GitHub Actions: Lint & Type Check / ruff + mypy

[error] 16-29: Ruff import sorting/formatting failed (I001). Import block is un-sorted or un-formatted. Found 1 error. Help: Organize imports. Fixable with ruff check . --fix.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_sync_middleware.py` around lines 18 - 29, The import block fails
Ruff I001: reorder and format imports into standard groups (stdlib, third-party,
local) and alphabetize within each group; specifically, group asyncio, datetime
(UTC, datetime, timedelta), pathlib.Path, types.SimpleNamespace, and
unittest.mock (AsyncMock, MagicMock, patch) as stdlib, then third-party pytest,
and finally local imports from handlers.sync_middleware (ensure_ledger_synced,
get_session_start_banner) and ledger modules (SurrealDBLedgerAdapter,
LedgerClient, init_schema, migrate); remove any unused imports and ensure the
import style matches existing project conventions so Ruff stops flagging I001.

Comment on lines +184 to 203
async def test_banner_queries_each_open_status_actually_surfaces():
"""The banner must surface decisions across ALL queried statuses.

Original test asserted ``get_decisions_by_status.assert_called_once_with(
["drifted", "ungrounded", "context_pending"])`` against a mock — a
tautology mirroring the SQL string. The real behavior contract is:
rows with each of those statuses end up in the banner.
``context_pending`` rows are routed through ``status='ungrounded'`` in
the production query path; this test pins the visible-to-agent shape.
"""
adapter, client = await _make_real_adapter()
await _seed_drifted(client, description="d1")
await _seed_ungrounded(client, description="u1")
ctx = _banner_ctx(adapter)

banner = await get_session_start_banner(ctx)

assert banner is not None
assert {i["status"] for i in banner.items} == {"drifted", "ungrounded"}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Either exercise context_pending here or narrow the test’s claim.

This test only seeds drifted and ungrounded, so it never proves the third queried “open” status still reaches the banner. If the context_pending path regresses, this stays green.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_sync_middleware.py` around lines 184 - 203, The test
test_banner_queries_each_open_status_actually_surfaces currently only seeds
drifted and ungrounded so it cannot verify context_pending reaches the banner;
either (A) seed a context_pending row before calling get_session_start_banner by
calling the existing helper (e.g. _seed_context_pending(client,
description="c1")) and update the final assertion to expect
{"drifted","ungrounded","context_pending"}, or (B) narrow the test claim by
changing the docstring and the final assertion to only assert the two seeded
statuses {"drifted","ungrounded"}; locate and modify the test function
test_banner_queries_each_open_status_actually_surfaces and the call to
get_session_start_banner to apply one of these fixes.

Comment on lines 370 to 376
async def test_ensure_swallows_link_commit_exception():
ctx = _make_ctx()
ctx = _ensure_ctx()

with patch("handlers.link_commit.handle_link_commit", new_callable=AsyncMock) as mock_lc:
mock_lc.side_effect = RuntimeError("git not available")
# Must not raise
await ensure_ledger_synced(ctx)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Force the head-advanced branch in the swallow test.

As written, this can pass without ever awaiting handle_link_commit if _LAST_SYNCED_SHA or the real repo HEAD short-circuits the function first. That makes the swallow path a false positive.

Suggested fix
-async def test_ensure_swallows_link_commit_exception():
+async def test_ensure_swallows_link_commit_exception(monkeypatch):
     ctx = _ensure_ctx()
+    monkeypatch.setattr("handlers.sync_middleware._LAST_SYNCED_SHA", "")

-    with patch("handlers.link_commit.handle_link_commit", new_callable=AsyncMock) as mock_lc:
+    with (
+        patch("handlers.link_commit._read_current_head_sha", return_value="new_sha"),
+        patch("handlers.link_commit.handle_link_commit", new_callable=AsyncMock) as mock_lc,
+    ):
         mock_lc.side_effect = RuntimeError("git not available")
         # Must not raise
         await ensure_ledger_synced(ctx)
+        mock_lc.assert_awaited_once()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_sync_middleware.py` around lines 370 - 376, The test may
short-circuit and never await handle_link_commit because _LAST_SYNCED_SHA or the
repo HEAD can equal the test context's sha; update
test_ensure_swallows_link_commit_exception to force the "head advanced" branch
by setting the module-level _LAST_SYNCED_SHA to a different value than the repo
HEAD (or otherwise mutate the test ctx so ctx.repo.head.commit.hexsha[:7] !=
_LAST_SYNCED_SHA) before calling ensure_ledger_synced; this guarantees
ensure_ledger_synced will call and await handlers.link_commit.handle_link_commit
(the AsyncMock) so its RuntimeError side_effect is exercised and swallowed.

…dleware-tests

# Conflicts:
#	CHANGELOG.md
#	README.md
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jinhongkuan jinhongkuan had a problem deploying to recording-approval May 10, 2026 22:08 — with GitHub Actions Failure
@jinhongkuan jinhongkuan merged commit e307635 into dev May 10, 2026
8 of 9 checks passed
Knapp-Kevin pushed a commit to Knapp-Kevin/bicameral-mcp that referenced this pull request May 21, 2026
…gate (BicameralAI#58)

Sibling of the M2 grounding-recall eval (BicameralAI#284). Phase A is **measurement
only** — no runtime change to `handle_preflight` or any retrieval surface.
Recall regression risk = zero. The Phase B optimization choice (multi-hop
expansion / semantic search / LLM reranker) is gated on this PR's first
stable baseline, per the wiki's optimization principle: "identify the
specific scenario, then optimize."

Per the Phase 2 spec posted on BicameralAI#58 (all four open questions defaulted to
recommended):

  Q1 dataset size       → 25 cases hand-curated, matches M2's 23
  Q2 miss-mode buckets  → three (vocabulary / unbound / transitive),
                          matches the issue body's framing
  Q3 fire-rate gate     → raw retrieval (`response.decisions`); fire is
                          downstream and a secondary diagnostic
  Q4 ledger persistence → per-run temp + memory:// (per-case freshness)

Three measurement axes (deliberately split for diagnosis)
---------------------------------------------------------

  overall recall    surfaced / (surfaced + missed)        gate ≥ 0.70
  per-mode recall   same, sliced by miss_mode             gate ≥ 0.50
  fire rate         response.fired / total                gate ≥ 0.60

Errors (seeder infra failures, not agent misses) are excluded from the
recall denominator but counted separately so reviewers can see them.

Files
-----

  tests/fixtures/preflight_m6/dataset.py       (412 LOC)
    25 hand-curated M6Case rows, 8 + 8 + 9 across the three modes.
    Frozen dataclass; GENERATOR_VERSION constant invalidates downstream
    caches when bumped. Import-time _validate_dataset() fails loud on
    duplicate case_id, invalid miss_mode, transitive case without
    intended_file_path, unbound case with non-ungrounded status.

  tests/eval/_preflight_m6_seeder.py            (231 LOC)
    Per-case freshness: each call creates a new tempdir + memory:// ledger
    + git-initialized repo + writes a placeholder file (or the transitive
    case's intended + caller files). Calls the real handle_ingest +
    handle_bind so seeded rows have production shape (source_type,
    span, signoff, binds_to). Reset code-locator + ledger singletons
    before AND after so the next case starts clean.

  tests/eval_preflight_m6_recall.py             (274 LOC)
    Argparse runner, drives the seeder + handle_preflight, classifies
    outcomes, aggregates. JSON output + gate enforcement
    (--gate-mode warn|hard). Mirrors eval_grounding_recall.py shape so
    existing CI patterns transfer.

  tests/eval_preflight_m6_summary.py            (162 LOC)
    Markdown step-summary renderer for $GITHUB_STEP_SUMMARY. Per-mode
    table + collapsible missed-case detail with topic + intended
    description. Fail-quiet on missing JSON / parse errors.

  tests/test_preflight_m6_eval.py               (267 LOC)
    16 sociable unit tests for the classifier + aggregator. Per the new
    CLAUDE.md "Sociable Testing for UX Paths" rule (BicameralAI#303): SimpleNamespace
    + real M6Case dataclasses, NEVER MagicMock — so any added/removed
    field on the response shape fails the test honestly.

  .github/workflows/test-mcp-regression.yml     (+31 LOC)
    New "M6 preflight recall eval (warn-only)" + summary steps after M2.
    No ANTHROPIC_API_KEY needed — preflight retrieval is deterministic.

  CHANGELOG.md                                  (+2 lines)
    [Unreleased] / Added entry.

Local verification
------------------

  - 16/16 sociable unit tests pass on the classifier + aggregator
    (test_aggregate_basic_recall_math, test_errors_excluded_from_recall_denominator,
    test_per_miss_mode_breakdown, etc.)
  - Dataset import + _validate_dataset() pass — 25 cases (8/8/9)
  - Runner --help renders cleanly
  - Summary renderer smoke-tested on synthetic JSON — per-mode table +
    missed-case detail render correctly with emoji gates
  - ruff check + ruff format --check + mypy all green on touched files

What's NOT in this PR (intentionally — Phase B gating)
------------------------------------------------------

  - Any runtime change to handle_preflight or _region_anchored_preflight
  - Skill changes (no agent-facing contract change in Phase A)
  - Multi-hop / call-graph / inheritance graph expansion (Phase B
    candidate, deferred)
  - Semantic search layer (Phase B candidate, deferred)
  - LLM reranker (Phase B candidate, deferred)
  - Real-corpus eval (synthetic first; corpus follow-up if needed)

After this PR's first CI baseline lands, we pick the dominant miss-mode
from the per-mode breakdown and ship Phase B targeted to it. Cheap-first
ordering per the wiki: search_hint refinement → multi-hop graph → semantic
→ reranker.

Refs BicameralAI#58. Plan: plan-58-preflight-decision-detection.md.
Phase 2 spec signoff: BicameralAI#58 (comment)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant