test(sync_middleware): sociable banner tests + decision_id alias fix#303
Conversation
Two paired changes that together unblock any ledger whose schema_meta.version rolled past 5 with v3-era source_span rows still in the yields table — the dogfood crash that motivated this work. **Tolerant init_schema (Layer A).** \`_execute_define_idempotent\` now recognizes a small allowlist of SurrealDB error substrings as recoverable: existing "already exists" and "already contains" plus the new "expected a record<" (TYPE constraint mismatch when DEFINE TABLE/INDEX OVERWRITE re-validates legacy rows) and a defensive "but expected" for adjacent value-type errors. The patterns live in module-level RECOVERABLE_DEFINE_PATTERNS so the substring contract is testable. When a recoverable error fires, init logs a warning and emits a SCHEMA_DEFINE_SKIPPED audit event so the next migration's cleanup is visible in the audit log. **Re-runnable yields cleanup (Layer B).** The v4→v5 cleanup body is extracted into shared \`_clean_yields_legacy_rows\` + \`_dedupe_yields\` helpers. v4→v5 still calls them, so historical behaviour is unchanged. The new \`_migrate_v16_to_v17\` calls the same helpers, which means any DB that ran past v5 with the corruption still present gets the cleanup on the v16→v17 boundary instead of permanently — the migrate() loop only iterates \`range(current+1, SCHEMA_VERSION+1)\`, so a one-shot historical migration is permanently unreachable for users who upgraded later. \`SCHEMA_VERSION\` bumps to 17. SCHEMA_COMPATIBILITY[17] is a placeholder release-eng pins at PR merge. **Audit log.** New \`AuditEventType.SCHEMA_DEFINE_SKIPPED\` (warn-level) emitted by \`_execute_define_idempotent\` whenever a UNIQUE-violation or type-mismatch is swallowed — gives operators visibility into the in-flight cleanup without breaking init. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…296) Mirrors the existing \`bicameral-mcp diagnose\` CLI as an MCP tool so agents can fire it from any tool-error envelope. Critical property: the handler opens a raw \`LedgerClient\` and never goes through \`adapter.connect()\` — which means it produces a useful diagnosis even when normal connect crashes inside \`init_schema\` / \`migrate\`, the exact failure mode that motivated #296. **Refactor.** \`cli/_diagnose_gather.py\` grows three raw-client variants (\`_read_bicameral_meta_raw\`, \`_read_schema_version_raw\`, \`_read_table_counts_raw\`) and a \`gather_diagnosis_raw(client, ledger_url)\` entry point. The existing adapter-flavoured \`gather_diagnosis(adapter)\` becomes a one-line wrapper, so the CLI keeps its current shape and the test suite (\`test_diagnose_*\`) is unchanged. **New handler + tool.** \`handlers/diagnose.handle_diagnose\` opens a raw client against the resolved ledger URL, calls \`gather_diagnosis_raw\`, and classifies the result into a \`recovery_path\`: - \`clean\` — schema matches binary, tables look sane. - \`fixable\` — schema behind binary; next normal connect migrates. - \`reset_rebuild\` — ledger broken AND \`.bicameral/events/\` has events on disk; \`bicameral_reset(replay_from_events=true, confirm=true)\` recovers without data loss. - \`reset_destructive\` — ledger broken AND no events; reset loses decision history, user must accept. \`next_action\` is a human-readable string the agent renders verbatim. Wired into \`server.py\` list_tools + dispatch as \`bicameral.diagnose\`. Pydantic contract \`DiagnoseResponse\` lives alongside \`ResetResponse\` in \`contracts.py\`. **SKILL contract.** \`skills/bicameral-diagnose/SKILL.md\` documents auto-fire from any schema/migration error envelope, the recovery-path → command matrix, and the read-only invariant (repair is CLI-driven, never an in-session agent action). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e (#296) \`bicameral_reset(replay_from_events=True, confirm=True)\` (only valid with \`wipe_mode="ledger"\`) wipes the materialized DB, resets the team-mode watermark, and replays every event in \`.bicameral/events/*.jsonl\` back through the same ingest path team mode uses. The event log is the canonical record (committed to git in team mode); replay is recovery, not destruction. Implementation reuses \`EventMaterializer.replay_new_events\` — the exact code team mode runs on every sync — so replay-vs-live divergence is impossible by construction. Determinism of the underlying canonical_id resolution is tracked separately under #296 (P0 follow-up: regression suite covering every event type the materializer dispatches on). **Behaviour.** - Dry run reports on-disk event count via \`events_replayed\` so the user sees scale before confirming. - Post-confirm response reports actual replayed count + any \`replay_errors\` (wipe still committed; replay errors don't destroy data, only block recovery). - \`replay_from_events=True\` + \`wipe_mode="full"\` is rejected with a clear error — full-wipe deletes the substrate we'd replay from. - Honours \`BICAMERAL_DATA_PATH\` symmetrically with \`adapters/ledger.py\`'s team-mode write path so the replay reads from the same directory the writers wrote to. **Skill.** \`skills/bicameral-reset/SKILL.md\` documents the new flag, the exclusivity rule with full-wipe, and the recommended pairing with \`bicameral_diagnose\`'s \`recovery_path="reset_rebuild"\`. **Contracts.** \`ResetResponse.events_replayed\` (default 0) carries the count. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two CI test classes that lock down the safety contract for layers A
and B. Both run on every PR via the existing
\`test-mcp-regression.yml\` workflow.
**Fixture-replay suite (\`test_legacy_ledger_fixtures.py\`).**
Frozen DB shapes under \`tests/fixtures/legacy_ledgers/\`. Each
fixture is a Python module exporting an async \`build(client)\`
coroutine that mutates the client into a known-historical bad state.
The parametrized test:
1. Builds the bad state in a \`memory://\` client.
2. Runs the production \`init_schema\` + \`migrate\` path.
3. Asserts schema reaches \`SCHEMA_VERSION\`, fixture-specific
invariants hold, and a second \`init_schema\` + \`migrate\` is
a no-op (idempotent).
First fixture is the user's exact dogfood symptom:
\`v3_yields_source_span\` — \`yields.in = source_span:*\` rows
surviving past v5, schema_meta.version = 16 to force the v16→v17
boundary. Adding a new fixture later requires zero test code:
register it in \`FIXTURES\` and the parametrized test runs against
it. README under the fixtures dir documents the registration shape.
**Error-format provocation suite (\`test_schema_recoverable_errors.py\`).**
Three tests that lock down the SurrealDB error-string contract layer
A depends on:
- UNIQUE-violation produces a substring matching
\`RECOVERABLE_DEFINE_PATTERNS\` (existing v4→v5 path).
- Type-mismatch on \`DEFINE INDEX OVERWRITE\` against legacy rows
produces a matching substring (the #296 scenario).
- All entries in \`RECOVERABLE_DEFINE_PATTERNS\` are lowercase
(the catch lower-cases the message before substring match).
When a future \`surrealdb-py\` bump changes the format, these tests
fail loudly with the exact message the maintainer needs to update
the patterns list — turning a silent regression into a known-loud
signal.
**CI wiring.**
\`test-mcp-regression.yml\` lists the two new files in the pytest
invocation. \`test_ledger_bicameral_meta_migration.py\` had a
hardcoded \`assert SCHEMA_VERSION == 16\` updated to
\`assert SCHEMA_VERSION >= 16\` so future bumps land cleanly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ruff I001 — alphabetical block. handlers.diagnose belongs after handlers.bind, not after handlers.ratify. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`ruff format --check .` was failing on the five files added in this PR. No behavioural changes — purely whitespace/line-wrapping per the project's ruff config. 42-test smoke pass green after the reformat. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mypy flagged DiagnoseResponse.recovery_path as expecting Literal['clean','fixable','reset_rebuild','reset_destructive'] while _classify_recovery returned plain str. Add a `RecoveryPath` type alias and use it on both the function signature and the intermediate `path` binding for the rec>exp branch. Pure typing — no runtime change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the dashboard image at the bottom of "How It Feels" with a three-beat demo video section (ingest -> preflight -> ratify async) referencing GitHub user-attachments URLs so videos render as inline players. Moves the "Star on GitHub" banner from the top header to a centered placement immediately after the demo, turning it into a post-demo conversion beat instead of a misaligned header element. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…itch Replaces the single-line MCP-server description with a position-take opener: paragraph 1 names the failure mode (agreements emerge mid-flight, never reach a doc); paragraph 2 introduces Bicameral MCP as a spec compliance layer that captures both formal source materials (transcripts, PRDs, Slack) and undiscussed mid-implementation decisions to be ratified async by the product owner. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
triage→main: resilient ledger migration + bicameral_diagnose + reset --replay-from-events (#296) + README demo videos & opener rewrite (#299)
…from-events + README polish Bumps pyproject.toml 0.14.4 → 0.14.5 and adds the v0.14.5 CHANGELOG entry covering the #296 ledger-resilience track (bicameral_diagnose, reset --replay-from-events, v16→v17 yields cleanup) and the #299 README pass (opener rewrite, demo video section, star CTA relocation). Hotfix on the README opener: tightens "and most agreements emerge mid-flight, in corrections and side comments that never reach a doc" to "and requirement gaps surfaced mid-implementation are buried under thousands of lines of code". The new framing locates the failure mode in the engineering medium (code volume the human can't review) rather than the conversational one (undocumented agreements), which aligns the opener with the spec compliance layer positioning in paragraph 2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
release: v0.14.5 (triage) — bicameral_diagnose tool + reset --replay-from-events + README polish
… + decision_id alias fix The banner tests previously used MagicMock for ctx and AsyncMock for ledger, returning hand-crafted dicts. They stayed green even when get_decisions_by_status silently returned decision_id=None for every row (the SQL selected an undefined field — see ledger.adapter:584). Refactor to seed a real SurrealDBLedgerAdapter over memory:// and run the actual get_decisions_by_status query. The first sociable run surfaced the latent bug, which is fixed in this commit by aliasing type::string(id) AS decision_id (matches the pattern at queries.py:167, 228, 404, 512). Tests that legitimately need narrow seams (handle_link_commit, asyncio.Lock primitives) are left as-is and now documented inline. Adds a "Sociable Testing for UX Paths" section to pilot/mcp/CLAUDE.md codifying the preference: SimpleNamespace ctx + real adapter for handler/ledger tests, narrow seams only when a collaborator can't be run in tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughIntroduces mandatory sociable testing policy via CLAUDE.md, fixes SurrealDB query aliasing in ledger adapter's ChangesSociable Testing Refactor with Adapter Query Fix
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/test_sync_middleware.py`:
- Around line 184-203: The test
test_banner_queries_each_open_status_actually_surfaces currently only seeds
drifted and ungrounded so it cannot verify context_pending reaches the banner;
either (A) seed a context_pending row before calling get_session_start_banner by
calling the existing helper (e.g. _seed_context_pending(client,
description="c1")) and update the final assertion to expect
{"drifted","ungrounded","context_pending"}, or (B) narrow the test claim by
changing the docstring and the final assertion to only assert the two seeded
statuses {"drifted","ungrounded"}; locate and modify the test function
test_banner_queries_each_open_status_actually_surfaces and the call to
get_session_start_banner to apply one of these fixes.
- Around line 18-29: The import block fails Ruff I001: reorder and format
imports into standard groups (stdlib, third-party, local) and alphabetize within
each group; specifically, group asyncio, datetime (UTC, datetime, timedelta),
pathlib.Path, types.SimpleNamespace, and unittest.mock (AsyncMock, MagicMock,
patch) as stdlib, then third-party pytest, and finally local imports from
handlers.sync_middleware (ensure_ledger_synced, get_session_start_banner) and
ledger modules (SurrealDBLedgerAdapter, LedgerClient, init_schema, migrate);
remove any unused imports and ensure the import style matches existing project
conventions so Ruff stops flagging I001.
- Around line 370-376: The test may short-circuit and never await
handle_link_commit because _LAST_SYNCED_SHA or the repo HEAD can equal the test
context's sha; update test_ensure_swallows_link_commit_exception to force the
"head advanced" branch by setting the module-level _LAST_SYNCED_SHA to a
different value than the repo HEAD (or otherwise mutate the test ctx so
ctx.repo.head.commit.hexsha[:7] != _LAST_SYNCED_SHA) before calling
ensure_ledger_synced; this guarantees ensure_ledger_synced will call and await
handlers.link_commit.handle_link_commit (the AsyncMock) so its RuntimeError
side_effect is exercised and swallowed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 3d1c2279-8485-44e9-ac44-726b6632f856
📒 Files selected for processing (3)
CLAUDE.mdledger/adapter.pytests/test_sync_middleware.py
| import asyncio | ||
| from datetime import UTC, datetime, timedelta | ||
| from pathlib import Path | ||
| from types import SimpleNamespace | ||
| from unittest.mock import AsyncMock, MagicMock, patch | ||
|
|
||
| import pytest | ||
|
|
||
| from handlers.sync_middleware import ensure_ledger_synced, get_session_start_banner | ||
| from ledger.adapter import SurrealDBLedgerAdapter | ||
| from ledger.client import LedgerClient | ||
| from ledger.schema import init_schema, migrate |
There was a problem hiding this comment.
Fix the import block so Ruff stops failing.
CI is already red on this block with I001, so this PR will not go green until the imports are re-sorted/formatted.
🧰 Tools
🪛 GitHub Actions: Lint & Type Check / 0_ruff + mypy.txt
[error] 16-29: Ruff check failed (I001): Import block is un-sorted or un-formatted. Organize imports; run with ruff check . --fix.
🪛 GitHub Actions: Lint & Type Check / ruff + mypy
[error] 16-29: Ruff import sorting/formatting failed (I001). Import block is un-sorted or un-formatted. Found 1 error. Help: Organize imports. Fixable with ruff check . --fix.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/test_sync_middleware.py` around lines 18 - 29, The import block fails
Ruff I001: reorder and format imports into standard groups (stdlib, third-party,
local) and alphabetize within each group; specifically, group asyncio, datetime
(UTC, datetime, timedelta), pathlib.Path, types.SimpleNamespace, and
unittest.mock (AsyncMock, MagicMock, patch) as stdlib, then third-party pytest,
and finally local imports from handlers.sync_middleware (ensure_ledger_synced,
get_session_start_banner) and ledger modules (SurrealDBLedgerAdapter,
LedgerClient, init_schema, migrate); remove any unused imports and ensure the
import style matches existing project conventions so Ruff stops flagging I001.
| async def test_banner_queries_each_open_status_actually_surfaces(): | ||
| """The banner must surface decisions across ALL queried statuses. | ||
|
|
||
| Original test asserted ``get_decisions_by_status.assert_called_once_with( | ||
| ["drifted", "ungrounded", "context_pending"])`` against a mock — a | ||
| tautology mirroring the SQL string. The real behavior contract is: | ||
| rows with each of those statuses end up in the banner. | ||
| ``context_pending`` rows are routed through ``status='ungrounded'`` in | ||
| the production query path; this test pins the visible-to-agent shape. | ||
| """ | ||
| adapter, client = await _make_real_adapter() | ||
| await _seed_drifted(client, description="d1") | ||
| await _seed_ungrounded(client, description="u1") | ||
| ctx = _banner_ctx(adapter) | ||
|
|
||
| banner = await get_session_start_banner(ctx) | ||
|
|
||
| assert banner is not None | ||
| assert {i["status"] for i in banner.items} == {"drifted", "ungrounded"} | ||
|
|
There was a problem hiding this comment.
Either exercise context_pending here or narrow the test’s claim.
This test only seeds drifted and ungrounded, so it never proves the third queried “open” status still reaches the banner. If the context_pending path regresses, this stays green.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/test_sync_middleware.py` around lines 184 - 203, The test
test_banner_queries_each_open_status_actually_surfaces currently only seeds
drifted and ungrounded so it cannot verify context_pending reaches the banner;
either (A) seed a context_pending row before calling get_session_start_banner by
calling the existing helper (e.g. _seed_context_pending(client,
description="c1")) and update the final assertion to expect
{"drifted","ungrounded","context_pending"}, or (B) narrow the test claim by
changing the docstring and the final assertion to only assert the two seeded
statuses {"drifted","ungrounded"}; locate and modify the test function
test_banner_queries_each_open_status_actually_surfaces and the call to
get_session_start_banner to apply one of these fixes.
| async def test_ensure_swallows_link_commit_exception(): | ||
| ctx = _make_ctx() | ||
| ctx = _ensure_ctx() | ||
|
|
||
| with patch("handlers.link_commit.handle_link_commit", new_callable=AsyncMock) as mock_lc: | ||
| mock_lc.side_effect = RuntimeError("git not available") | ||
| # Must not raise | ||
| await ensure_ledger_synced(ctx) |
There was a problem hiding this comment.
Force the head-advanced branch in the swallow test.
As written, this can pass without ever awaiting handle_link_commit if _LAST_SYNCED_SHA or the real repo HEAD short-circuits the function first. That makes the swallow path a false positive.
Suggested fix
-async def test_ensure_swallows_link_commit_exception():
+async def test_ensure_swallows_link_commit_exception(monkeypatch):
ctx = _ensure_ctx()
+ monkeypatch.setattr("handlers.sync_middleware._LAST_SYNCED_SHA", "")
- with patch("handlers.link_commit.handle_link_commit", new_callable=AsyncMock) as mock_lc:
+ with (
+ patch("handlers.link_commit._read_current_head_sha", return_value="new_sha"),
+ patch("handlers.link_commit.handle_link_commit", new_callable=AsyncMock) as mock_lc,
+ ):
mock_lc.side_effect = RuntimeError("git not available")
# Must not raise
await ensure_ledger_synced(ctx)
+ mock_lc.assert_awaited_once()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/test_sync_middleware.py` around lines 370 - 376, The test may
short-circuit and never await handle_link_commit because _LAST_SYNCED_SHA or the
repo HEAD can equal the test context's sha; update
test_ensure_swallows_link_commit_exception to force the "head advanced" branch
by setting the module-level _LAST_SYNCED_SHA to a different value than the repo
HEAD (or otherwise mutate the test ctx so ctx.repo.head.commit.hexsha[:7] !=
_LAST_SYNCED_SHA) before calling ensure_ledger_synced; this guarantees
ensure_ledger_synced will call and await handlers.link_commit.handle_link_commit
(the AsyncMock) so its RuntimeError side_effect is exercised and swallowed.
…dleware-tests # Conflicts: # CHANGELOG.md # README.md
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…gate (BicameralAI#58) Sibling of the M2 grounding-recall eval (BicameralAI#284). Phase A is **measurement only** — no runtime change to `handle_preflight` or any retrieval surface. Recall regression risk = zero. The Phase B optimization choice (multi-hop expansion / semantic search / LLM reranker) is gated on this PR's first stable baseline, per the wiki's optimization principle: "identify the specific scenario, then optimize." Per the Phase 2 spec posted on BicameralAI#58 (all four open questions defaulted to recommended): Q1 dataset size → 25 cases hand-curated, matches M2's 23 Q2 miss-mode buckets → three (vocabulary / unbound / transitive), matches the issue body's framing Q3 fire-rate gate → raw retrieval (`response.decisions`); fire is downstream and a secondary diagnostic Q4 ledger persistence → per-run temp + memory:// (per-case freshness) Three measurement axes (deliberately split for diagnosis) --------------------------------------------------------- overall recall surfaced / (surfaced + missed) gate ≥ 0.70 per-mode recall same, sliced by miss_mode gate ≥ 0.50 fire rate response.fired / total gate ≥ 0.60 Errors (seeder infra failures, not agent misses) are excluded from the recall denominator but counted separately so reviewers can see them. Files ----- tests/fixtures/preflight_m6/dataset.py (412 LOC) 25 hand-curated M6Case rows, 8 + 8 + 9 across the three modes. Frozen dataclass; GENERATOR_VERSION constant invalidates downstream caches when bumped. Import-time _validate_dataset() fails loud on duplicate case_id, invalid miss_mode, transitive case without intended_file_path, unbound case with non-ungrounded status. tests/eval/_preflight_m6_seeder.py (231 LOC) Per-case freshness: each call creates a new tempdir + memory:// ledger + git-initialized repo + writes a placeholder file (or the transitive case's intended + caller files). Calls the real handle_ingest + handle_bind so seeded rows have production shape (source_type, span, signoff, binds_to). Reset code-locator + ledger singletons before AND after so the next case starts clean. tests/eval_preflight_m6_recall.py (274 LOC) Argparse runner, drives the seeder + handle_preflight, classifies outcomes, aggregates. JSON output + gate enforcement (--gate-mode warn|hard). Mirrors eval_grounding_recall.py shape so existing CI patterns transfer. tests/eval_preflight_m6_summary.py (162 LOC) Markdown step-summary renderer for $GITHUB_STEP_SUMMARY. Per-mode table + collapsible missed-case detail with topic + intended description. Fail-quiet on missing JSON / parse errors. tests/test_preflight_m6_eval.py (267 LOC) 16 sociable unit tests for the classifier + aggregator. Per the new CLAUDE.md "Sociable Testing for UX Paths" rule (BicameralAI#303): SimpleNamespace + real M6Case dataclasses, NEVER MagicMock — so any added/removed field on the response shape fails the test honestly. .github/workflows/test-mcp-regression.yml (+31 LOC) New "M6 preflight recall eval (warn-only)" + summary steps after M2. No ANTHROPIC_API_KEY needed — preflight retrieval is deterministic. CHANGELOG.md (+2 lines) [Unreleased] / Added entry. Local verification ------------------ - 16/16 sociable unit tests pass on the classifier + aggregator (test_aggregate_basic_recall_math, test_errors_excluded_from_recall_denominator, test_per_miss_mode_breakdown, etc.) - Dataset import + _validate_dataset() pass — 25 cases (8/8/9) - Runner --help renders cleanly - Summary renderer smoke-tested on synthetic JSON — per-mode table + missed-case detail render correctly with emoji gates - ruff check + ruff format --check + mypy all green on touched files What's NOT in this PR (intentionally — Phase B gating) ------------------------------------------------------ - Any runtime change to handle_preflight or _region_anchored_preflight - Skill changes (no agent-facing contract change in Phase A) - Multi-hop / call-graph / inheritance graph expansion (Phase B candidate, deferred) - Semantic search layer (Phase B candidate, deferred) - LLM reranker (Phase B candidate, deferred) - Real-corpus eval (synthetic first; corpus follow-up if needed) After this PR's first CI baseline lands, we pick the dominant miss-mode from the per-mode breakdown and ship Phase B targeted to it. Cheap-first ordering per the wiki: search_hint refinement → multi-hop graph → semantic → reranker. Refs BicameralAI#58. Plan: plan-58-preflight-decision-detection.md. Phase 2 spec signoff: BicameralAI#58 (comment) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
get_session_start_bannertests intests/test_sync_middleware.pyfrom solitary (MagicMock ctx + AsyncMock ledger returning hand-crafted dicts) to sociable — they now seed a realSurrealDBLedgerAdapterovermemory://and exercise the actualget_decisions_by_statusSQL.get_decisions_by_statusselecteddecision_idfrom thedecisiontable, but no such field is defined on that table (schema.py:114-154). Every banner row reached agents withdecision_id: None. Fixed inledger/adapter.pyby aliasingtype::string(id) AS decision_id— same pattern already used atqueries.py:167, 228, 404, 512.pilot/mcp/CLAUDE.mdcodifying the preference for handlers + ledger:SimpleNamespacectx + real adapter; narrow seams only when the collaborator genuinely can't run in tests. Includes a checklist for tests-only PRs.Why this matters
AI-authored tests skew solitary because mocks are easy to make pass. A solitary banner test stayed green for months while the production query returned null IDs. This PR is the worked example: same scenarios, same handler, real substrate — the bug fell out on the first run.
The non-banner tests (
ensure_ledger_syncedSHA-cache logic,repo_write_barrierconcurrency primitives) intentionally keep their narrow seam patches: link_commit has its own end-to-end coverage, and the barrier tests are about pure asyncio behavior. The new CLAUDE.md guidance flags these as legitimate exceptions.Test plan
pytest tests/test_sync_middleware.py— 21/21 passpytest tests/test_phase2_ledger.py tests/test_phase3_integration.py tests/test_link_commit_grounding.py tests/test_codegenome_continuity_service.py— 50/50 pass (no regression from the SQL alias)pytest tests/test_alpha_contract.py tests/test_alpha_flow.py— alpha_contract passes; two alpha_flow failures are pre-existing onmain(verified bygit stash+ rerun) and unrelated to this PROut of scope
test_ingest_size_limit.py,test_codegenome_adapter.py, etc.) — separate PRs, one at a time.ensure_ledger_syncedtests to use a real git repo — current narrow-seam approach is the right granularity for SHA-cache logic.🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
Tests
Documentation