diff --git a/.claude/skills/bicameral-doctor/SKILL.md b/.claude/skills/bicameral-doctor/SKILL.md index d2060c0c..950abca2 100644 --- a/.claude/skills/bicameral-doctor/SKILL.md +++ b/.claude/skills/bicameral-doctor/SKILL.md @@ -53,6 +53,12 @@ The handler returns a `DoctorResponse` with: - `ledger_summary` — `DoctorLedgerSummary` with repo-wide `total`, `drifted`, `pending`, `ungrounded`, `reflected` counts. Populated on branch scope only. - `action_hints` — merged from whichever sub-scan produced them. Same intensity-gated semantics as every other skill (`guided_mode` controls `blocking`). +### Per-entry advisory fields (read-path only, never gate behavior) + +- **`DriftEntry.cosmetic_hint: bool`** (on every entry inside `file_scan.decisions` and `branch_scan.decisions`). True when the HEAD-to-working-tree diff for that region is provably whitespace-only per the strict tree-sitter classifier (`ledger/ast_diff.is_cosmetic_change`). Never affects status; the entry stays drifted and the user must still address it. Use as a render-time tag (e.g. *"cosmetic edit, please confirm"*) — do not use it to suppress drift. +- **`pending_grounding_checks[].original_lines: [start, end]`** when `reason == "symbol_disappeared"` (visible inside `file_scan.sync_status.pending_grounding_checks` and the equivalent under branch scope). Lets the caller LLM run `git show :` over those lines to inspect the symbol's prior position before deciding what to do. Strictly informational. +- **`sync_status.verification_instruction`** is now built per response based on which `pending_*` payloads fired. For `pending_grounding_checks` with `reason == "symbol_disappeared"`, the text is **INFORMATIONAL ONLY** and explicitly forbids calling `bicameral.bind` on the new location (it would create duplicate-binding state under the N:N `binds_to` relation). Until V2 ships atomic rebind, the doctor skill must not synthesize a bind CTA for relocation cases. For `reason == "ungrounded"`, the bind CTA is safe and remains in the instruction text — render it as guidance. + ## How to render ### Scope = file diff --git a/.claude/skills/bicameral-history/SKILL.md b/.claude/skills/bicameral-history/SKILL.md index ee97d77f..e6111709 100644 --- a/.claude/skills/bicameral-history/SKILL.md +++ b/.claude/skills/bicameral-history/SKILL.md @@ -34,6 +34,8 @@ bicameral.history( ) ``` +The response also carries an optional `sync_metrics` (`{sync_catchup_ms, barrier_held_ms}`) observability field for the catch-up time spent inside `ensure_ledger_synced`. **Skip rendering it** — these are server-side latency numbers, not user-visible signal. Log them if you're profiling, otherwise ignore. + ## How to present Group decisions by `HistoryFeature`. For each group: diff --git a/.claude/skills/bicameral-preflight/SKILL.md b/.claude/skills/bicameral-preflight/SKILL.md index b25fadee..c22922d0 100644 --- a/.claude/skills/bicameral-preflight/SKILL.md +++ b/.claude/skills/bicameral-preflight/SKILL.md @@ -106,6 +106,12 @@ The handler runs `bicameral.search` internally, gates on the user's `guided_mode` setting, conditionally chains to `bicameral.brief`, and returns a `PreflightResponse` with a `fired: bool` field. +The response also carries an optional `sync_metrics` +(`{sync_catchup_ms, barrier_held_ms}`) observability field for the +catch-up time spent in `ensure_ledger_synced`. **Skip rendering it** — +these are server-side latency numbers, not user-visible signal. Log +them if you're profiling, otherwise ignore. + ### 2.5 Render session-start banner if present Before evaluating `response.fired`, check `response.session_start_banner`. @@ -180,6 +186,14 @@ Categories with no ask-findings are silently skipped. If every finding in every category is mechanical, produce NO output (same as `fired=false` — silent). +**Cosmetic drift rule**: if a `drifted` entry has `cosmetic_hint=true`, +classify it as **mechanical** regardless of guided mode. The server has +verified via AST comparison that the change is whitespace-only and +semantically inert — the stored intent is still intact. Auto-resolve +silently; do NOT add it to the drift ask-queue and do NOT emit a +blocking hint. Render it with `~` prefix (not `⚠ DRIFTED:`) if you +render it at all — see the template in Step 5. + ### 5. Render the surfaced block When at least one ask-finding exists, surface the response using this @@ -200,6 +214,11 @@ format. Lead with the `(bicameral surfaced)` attribution line. Source: Drift evidence: + ~ REFORMATTED: ← cosmetic_hint=true only + :: + Source: + (whitespace-only change — intent intact, no action needed) + ⚠ N divergent decision pair(s) — pick a winner before continuing: • (): @@ -310,8 +329,9 @@ For every finding this skill surfaces, classify first: - **mechanical** — one obvious correct answer (e.g., renamed symbol with identical signature; a decision whose code moved but semantics - are intact). Auto-apply the resolution silently. Do NOT ask the - user. + are intact; a `drifted` entry with `cosmetic_hint=true` — AST + comparison confirmed whitespace-only change). Auto-apply the + resolution silently. Do NOT ask the user. - **ask** — reasonable people could disagree (e.g., drifted behavior where the old decision may still be valid; divergent decisions where no clear winner exists). Emit ONE question per finding, using the diff --git a/.claude/skills/bicameral-scan-branch/SKILL.md b/.claude/skills/bicameral-scan-branch/SKILL.md index 7f7bf277..2a42d0d8 100644 --- a/.claude/skills/bicameral-scan-branch/SKILL.md +++ b/.claude/skills/bicameral-scan-branch/SKILL.md @@ -53,7 +53,7 @@ The handler returns a `ScanBranchResponse` with: - `base_ref` / `head_ref` — the resolved refs that were diffed - `sweep_scope` — `"range_diff"` (default, good), `"head_only"` (base was unreachable — fell back to HEAD-only scope, surface to user), or `"range_truncated"` (range exceeded the 200-file cap; the scan ran on the first 200 files, rest need a separate pass) - `range_size` — number of files the sweep covered -- `decisions` — deduped list of `DriftEntry` across all files (each decision shows up once even if it touches multiple files) +- `decisions` — deduped list of `DriftEntry` across all files (each decision shows up once even if it touches multiple files). Drifted entries may carry `cosmetic_hint=true` when the HEAD-to-working-tree diff for that region is provably whitespace-only per the strict tree-sitter classifier (`ledger/ast_diff.is_cosmetic_change`). The hint is **advisory metadata only** — it never gates drift surfacing or status, and the entry stays in the drifted bucket regardless. Treat it as a render-time signal: a cosmetic-hinted drift is still drift the user must address. - `files_changed` — the file paths that were swept - `drifted_count` / `pending_count` / `ungrounded_count` / `reflected_count` - `undocumented_symbols` — union across all files diff --git a/.claude/skills/bicameral-search/SKILL.md b/.claude/skills/bicameral-search/SKILL.md index a4efa113..51a62946 100644 --- a/.claude/skills/bicameral-search/SKILL.md +++ b/.claude/skills/bicameral-search/SKILL.md @@ -29,6 +29,8 @@ Pre-flight check before coding — surface past decisions relevant to what you'r $ARGUMENTS — the feature, task, or area to search for prior decisions about +The response also carries an optional `sync_metrics` (`{sync_catchup_ms, barrier_held_ms}`) observability field for the catch-up time spent in the implicit `link_commit` that runs before search. **Skip rendering it** — these are server-side latency numbers, not user-visible signal. Log them if you're profiling, otherwise ignore. + ## Action Hint Contract (v0.4.10+) The response always includes an `action_hints` list. Two intensities, diff --git a/CHANGELOG.md b/CHANGELOG.md index 76fa93b8..fda804be 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -3,6 +3,200 @@ All notable changes to bicameral-mcp are tracked here. Format loosely follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/). +## Unreleased — desync optimization V1 — measurement + read-path advisory + +V1 of a two-part desync-correctness initiative. V1 ships measurement +infrastructure, a strict-whitelist cosmetic-change classifier, relocation +context enrichment, and a canonical 13-scenario regression matrix — +**without touching any destructive write path**. V2 (separate effort, +design captured in `docs/v2-desync-optimization-guide.md` with nine rounds of Codex +review) tackles the destructive-path overhaul: atomic rebind, baseline +advancement with full CAS, schema migration v6, append-only verdict +history. + +V1 introduces zero new mutating capabilities. Every change is one of: +read-only measurement, additive contract field, pure function, test +coverage, or a surgical bug fix to an already-shipped path. The plan, +phase breakdown, V2 deferred items, and Codex review parking lot live in +`docs/v2-desync-optimization-guide.md`. + +### Added — `tests/bench_drift.py` (A1) + +Drift benchmark harness. Seeds 100 decisions across 25 files via +tree-sitter `extract_symbols` (no BM25 index build required), times +`handle_search_decisions`, `handle_detect_drift`, `handle_link_commit` +under a `memory://` ledger, writes +`test-results/bench/drift_baseline.json` plus a stdout summary. +Marked `@pytest.mark.bench` so default test runs skip it; run via +`pytest tests/bench_drift.py -v -m bench -s`. + +Baseline on Apple Silicon (post-rebase, surrealdb 2.0.0): + +| handler | p50 (ms) | p95 (ms) | max (ms) | +|--------------------|---------:|---------:|---------:| +| search_decisions | 9.2 | 10.4 | 11.0 | +| detect_drift | 14.2 | 15.5 | 16.4 | +| link_commit (warm) | 7.3 | 8.0 | 8.3 | + +All 50–185× under the V2 perf targets in `PLAN.md:83` +(`search_decisions < 2s`, `detect_drift < 1s`). + +### Added — `handlers/sync_middleware.repo_write_barrier(ctx)` (A2-light) + +Per-repo `asyncio.Lock` async context manager backed by a module-level +`dict[repo_path, asyncio.Lock]`. `handle_bind` wraps its body via a thin +`_do_bind` inner function. Different repos run concurrently; same repo +serializes. Lazy guard-lock construction avoids the "bound to wrong +loop" pitfall across test event loops. Yields a mutable `BarrierTiming` +holder whose `held_ms` is populated on exit (including on exceptions). + +Deliberately narrow scope: does NOT protect `resolve_compliance` or +cross-process writers — both are V2 scope (V1 plan §5.2, §5.5). + +### Added — `contracts.SyncMetrics` (A3) + +```python +class SyncMetrics(BaseModel): + sync_catchup_ms: float | None = None + barrier_held_ms: float | None = None +``` + +Attached as `sync_metrics: SyncMetrics | None = None` to +`SearchDecisionsResponse`, `PreflightResponse`, `HistoryResponse`, +`BindResponse`. Purely additive, non-breaking. Each handler times its +own sync call locally so nested calls (e.g. preflight chaining to +search_decisions) don't step on each other's metrics. + +### Added — `ledger/ast_diff.is_cosmetic_change(before, after, lang)` (B1) + +Strict-whitelist tree-sitter classifier returning `True` only when two +snippets differ by inter-token whitespace alone. Compares a recursive +`(node.type, child_sigs | leaf_bytes)` signature; identifier renames, +comment edits (incl. `# type: ignore` / `# noqa` / `// @ts-ignore` / +build tags / lint pragmas), docstring edits, trailing-comma changes, +string-literal changes, import reorders, and any AST shape change all +return `False`. Reuses `code_locator.indexing.symbol_extractor._get_parser` +so the cosmetic detector and symbol indexer can never silently disagree +on supported languages: python, javascript, typescript, java, go, rust, +c_sharp (plus jsx → javascript and tsx → typescript via +`LANGUAGE_FALLBACK`). Unsupported langs, parse failures, and trees with +`has_error` all fail safe to `False`. + +False negatives (real cosmetic changes routed unbiased to L3 in V2) are +cheap; false positives (semantics-affecting changes mislabeled cosmetic) +bias future L3 prompts toward "looks fine" — exactly the failure mode +the strict whitelist prevents. + +### Added — `DriftEntry.cosmetic_hint: bool = False` (B2) + +Populated by `handlers.detect_drift._enrich_with_cosmetic_hints` after +the pure `raw_decisions_to_drift_entries` mapping (IO encapsulated +outside the pure function). Read-path advisory ONLY — never mutates +`content_hash`, never gates drift surfacing or status, never advances +baseline. Five fail-safe paths leave the hint `False`: non-drifted +entry, equal HEAD/working-tree bytes, unsupported file extension, +invalid line range, exception during classifier. + +Source comparison: HEAD bytes (via `ledger.status.get_git_content` ref +`"HEAD"`) vs working-tree bytes (ref `"working_tree"`), sliced to the +region's `(start_line, end_line)`. Language resolved from file extension +via `code_locator.indexing.symbol_extractor.EXTENSION_LANGUAGE`. + +### Added — `pending_grounding_checks[].original_lines` (D1) + +For `reason='symbol_disappeared'` entries, the payload now carries +`original_lines: [start_line, end_line]` so the caller LLM can run +`git show :` to inspect the symbol's prior +position when locating its new home. Strictly informational — no +actionable workflow. Single-line addition in `ledger/adapter.py`. + +### Added — `tests/test_desync_scenarios.py` (F1) + +Canonical regression matrix for the 13 desync scenarios from the Notion +"Auto-Grounding Problem" catalog, routed through the real handler +layer per the Apr 8 PR #84 lesson (tests bypassing handlers miss +post-ingest hooks). Self-contained tmp git-repo fixture per test. + +**Scorecard**: 12 PASS, 1 XFAIL. + +| # | Scenario | V1 outcome | +|---|---|---| +| 1 | New decision, matching code exists | ✅ ungrounded → caller binds | +| 2 | Code changed after grounded | ✅ pending + `pending_compliance_check` | +| 3 | Code deleted after grounded | ✅ symbol_disappeared | +| 4 | Symbol renamed in file | ✅ symbol_disappeared with `original_lines` | +| 5 | Symbol moved cross-file | ✅ symbol_disappeared | +| 6 | Code added later | ✅ caller binds explicitly | +| 7 | Cold start, no matching code | ✅ stays ungrounded | +| 8 | Drifted intent → atomic re-ground | ⏸ XFAIL (V2 §8 D2 — `bicameral_rebind` with old-binding CAS) | +| 9 | Intent description supersession | ✅ re-ingest succeeds | +| 10 | N decisions share a symbol | ✅ both surface | +| 11 | No server-side BM25 grounding (post-v0.6.0) | ✅ stays ungrounded | +| 12 | Line-shift edit | ✅ no spurious drift (`resolve_symbol_lines` self-heals) | +| 13 | `[Open Question]` prefix | ✅ ingested as gap | + +### Changed — `handlers/link_commit._build_verification_instruction()` + +Splits the v0.6.4 monolithic `_VERIFICATION_INSTRUCTION` into three +composable parts so the response text is conditional on which +`pending_*` payloads actually fired: + +- `pending_compliance_checks` present → resolve_compliance CTA. +- `pending_grounding_checks` with `reason='ungrounded'` → + `Grep/Read → validate_symbols / extract_symbols → bicameral.bind` CTA + (safe — no prior binding to retire, no duplicate-binding risk). +- `pending_grounding_checks` with `reason='symbol_disappeared'` → + **explicit "INFORMATIONAL ONLY — do NOT call bicameral.bind on the + new location" warning** citing the duplicate-binding hazard under + the N:N `binds_to` relation. Atomic rebind ships in V2. + +Addresses Codex pass-10 #2 + pass-12 #2: the v0.6.4 monolithic CTA +inadvertently routed relocation cases through the unsafe bind path. +The V1 split removes that without reducing the safe CTA for ungrounded. + +### Fixed — `ledger/adapter.py` ungrounded grounding-check `decision_id` + +`pending_grounding_checks` for ungrounded decisions emitted empty +`decision_id` because the consumer read `d.get("id", "")` from +`get_all_decisions(filter="ungrounded")`, but that query aliases the +field to `decision_id`. Callers had no handle to bind against. +Surfaced by V1 F1 regression coverage; existing +`test_pending_grounding_checks_for_ungrounded_decisions` regression +only asserted `len > 0`, missing the empty-ID bug. Read `decision_id` +first, fall back to `id` for forward compatibility. + +### Tests + +75 passed, 1 xfailed in 7.11s after V1. Zero regressions on the +v0.6.3/v0.6.4/0.6.4-bump rebase, and the SDK-2.0 idempotency-catch +issue I had originally pinned around (`surrealdb<2.0.0`) is fixed +properly upstream by `66796ef`, so V1 ships against +`surrealdb>=2.0.0` directly. + +### Deferred to V2 + +Captured in full in `docs/v2-desync-optimization-guide.md` (design target with +nine rounds of Codex review) and summarized in +`docs/v2-desync-optimization-guide.md` §4–§5: + +- A0 — atomic SurrealQL block primitive (Python SDK doesn't support + `begin_transaction()` in embedded mode). +- A2a — full sync barrier (sync-token CAS + region fingerprint at + commit time). +- C0 / C0a / C1 — schema migration v5→v6 (per-binding baseline + ownership, tombstone fields, append-only `compliance_verdict_history`, + full-CAS cache key, traversal filtering). +- C2 — `bicameral_judge_drift` + `record_compliance_verdict` with + five-field CAS (incl. binding-state token). +- C3 — `pending_compliance_checks` from `detect_drift` (cache-aware). +- B3 — `bicameral_advance_baseline` (only after fresh L3 `compliant` + verdict matching full CAS). +- D2 — `bicameral_rebind` with old-binding CAS; closes scenario 8. +- Migration of the `handlers/resolve_compliance.py` hard-delete + + `handlers/ingest.py` auto-chained `handle_judge_gaps` to the + tombstone + CAS contract — hard prerequisite before V2 destructive + work ships. + ## 0.7.0 — 2026-04-24 — Accountable North Star: proposal state + signoff schema Every decision now has provenance: who proposed it, who ratified it, in which session. diff --git a/PLAN.md b/PLAN.md index 7ed57287..e243a940 100644 --- a/PLAN.md +++ b/PLAN.md @@ -80,12 +80,27 @@ Code search is caller-owned: Claude Code / Cursor / etc. use their native Grep/R - [x] Zero active mocks - [x] Full E2E verified - [x] GitHub Actions CI replaces pre-push git hook +- [x] Performance: `search_decisions` < 2s, `detect_drift` < 1s on 100+ decisions — measured by V1 A1 (`tests/bench_drift.py`) at p95 = 10.4 ms / 15.5 ms, 55–185× under target ### Remaining -- [ ] Performance: `search_decisions` < 2s, `detect_drift` < 1s on repo with 100+ decisions -- [ ] LLM drift judge: wire `claude-haiku-4-5` for changed-region comparison in `detect_drift` +- [ ] LLM drift judge: wire `claude-haiku-4-5` for changed-region comparison in `detect_drift` (V2 — `docs/desync-optimization.md` §8 C2) - [ ] All 4 tools demoed live in Claude Code (MCP connected) +## Desync Optimization V1 — DONE (read-path advisory + measurement) + +Plan: `docs/desync-optimization-v1-plan.md`. V2 design target with full +Codex-review history: `docs/desync-optimization.md`. V1 introduces zero +new mutating capabilities. + +- [x] A1 — `tests/bench_drift.py` benchmark harness; `test-results/bench/drift_baseline.json` artifact +- [x] A2-light — `handlers/sync_middleware.repo_write_barrier(ctx)` (per-repo `asyncio.Lock`); `handle_bind` wrapped +- [x] A3 — `contracts.SyncMetrics` + `sync_metrics` field on Search/Preflight/History/Bind responses +- [x] B1 — `ledger/ast_diff.is_cosmetic_change(before, after, lang)` (strict tree-sitter whitelist, 21 tests) +- [x] B2 — `DriftEntry.cosmetic_hint` advisory + `_enrich_with_cosmetic_hints` helper in `handle_detect_drift` +- [x] D1 — `original_lines` field on `symbol_disappeared` grounding checks; **`_build_verification_instruction` split** so relocation cases never get the unsafe `bicameral.bind` CTA +- [x] F1 — `tests/test_desync_scenarios.py` canonical 13-scenario regression matrix (12 pass + 1 V2 xfail) +- [x] Incidental fix — empty `decision_id` on ungrounded `pending_grounding_checks` (`ledger/adapter.py:475`) + --- ## Mock → Real Swap Summary diff --git a/TODO.md b/TODO.md index 64ea3b10..b55d96bb 100644 --- a/TODO.md +++ b/TODO.md @@ -109,11 +109,59 @@ _Tracks actual implementation status in `pilot/mcp/`. Updated by Claude as work - [x] Zero active mocks - [x] Full E2E verified - [x] GitHub Actions CI (replaces pre-push hook) -- [ ] Performance benchmarks -- [ ] LLM drift judge +- [x] Performance benchmarks (V1 A1 — `tests/bench_drift.py`; baseline + 55–185× under V2 targets — see `docs/desync-optimization-v1-plan.md` §A1) +- [ ] LLM drift judge (V2 — see `docs/desync-optimization.md` §8 C2) + +### Desync Optimization V1 — DONE (read-path advisory + measurement) + +Plan: `docs/desync-optimization-v1-plan.md`. V2 design target: +`docs/desync-optimization.md`. V1 introduces zero new mutating paths. + +- [x] **A1** — drift benchmark harness (`tests/bench_drift.py`) +- [x] **A2-light** — per-repo `asyncio.Lock` for `handle_bind` + (`handlers/sync_middleware.repo_write_barrier`) +- [x] **A3** — sync-metrics instrumentation (`SyncMetrics` contract + + handler-side timing on search / preflight / history / bind) +- [x] **B1** — strict-whitelist tree-sitter cosmetic-change classifier + (`ledger/ast_diff.is_cosmetic_change`) +- [x] **B2** — `DriftEntry.cosmetic_hint` advisory metadata + (`handlers/detect_drift._enrich_with_cosmetic_hints`) +- [x] **D1** — `original_lines` enrichment on `symbol_disappeared` + grounding checks +- [x] **F1** — canonical 13-scenario regression matrix + (`tests/test_desync_scenarios.py`); scorecard 12 pass + 1 V2 xfail +- [x] **pass-12 follow-up** — `_build_verification_instruction` split + so `symbol_disappeared` cases get an explicit "do NOT call + bicameral.bind" warning instead of the v0.6.4 monolithic CTA + (Codex pass-10 #2 + pass-12 #2) +- [x] **incidental fix** — `ledger/adapter.py:475` was emitting empty + `decision_id` on ungrounded grounding checks; surfaced by F1 + +### V2 — Deferred (destructive-path overhaul) + +Tracked in full in `docs/desync-optimization.md` (nine rounds of Codex +review) and summarized in `docs/desync-optimization-v1-plan.md` §4–§5. +Hard prerequisite before V2 destructive work ships: migrate +`handlers/resolve_compliance.py` hard-delete and the +`handlers/ingest.py` auto-chained `handle_judge_gaps` to tombstone + +full-CAS semantics. + +- [ ] A0 — atomic SurrealQL block primitive +- [ ] A2a — full sync barrier (token CAS + region fingerprint at commit) +- [ ] C0 / C0a / C1 — schema v5→v6 migration + traversal filtering + + full-CAS cache key +- [ ] C2 — `bicameral_judge_drift` + `record_compliance_verdict` with + five-field CAS (incl. binding-state) +- [ ] C3 — cache-aware `pending_compliance_checks` from `detect_drift` +- [ ] B3 — `bicameral_advance_baseline` (only after L3 `compliant` + verdict) +- [ ] D2 — `bicameral_rebind` with old-binding CAS; closes scenario 8 --- ## Mock Registry -All mocks deleted. See `mocks/README.md` for history. +All mocks deleted. V1 introduces no new mocks (read-path advisory +only). See git history for the original Phase 1 / Phase 2 mock +replacements (`RealCodeLocatorAdapter`, `SurrealDBLedgerAdapter`). diff --git a/contracts.py b/contracts.py index 500dfbce..659476de 100644 --- a/contracts.py +++ b/contracts.py @@ -40,6 +40,18 @@ class SessionStartBanner(BaseModel): message: str +class SyncMetrics(BaseModel): + """V1 A3 instrumentation — wall-clock timings for sync + write-barrier. + + Populated by sync_middleware.ensure_ledger_synced (sync_catchup_ms) and + sync_middleware.repo_write_barrier (barrier_held_ms). Either field may + be ``None`` if that path did not run in the handler — e.g. ledger was + already synced, or the handler did not take the write barrier. + """ + sync_catchup_ms: float | None = None + barrier_held_ms: float | None = None + + class CodeRegionSummary(BaseModel): """Lean code region for MCP responses — no pipeline metadata.""" file_path: str @@ -211,6 +223,7 @@ class SearchDecisionsResponse(BaseModel): suggested_review: list[str] # decision_ids of drifted/pending to review first action_hints: list[ActionHint] = [] session_start_banner: SessionStartBanner | None = None + sync_metrics: SyncMetrics | None = None # V1 A3 — catch-up / barrier wall times # ── Tool 3: /detect_drift ──────────────────────────────────────────── @@ -226,6 +239,11 @@ class DriftEntry(BaseModel): source_ref: str source_excerpt: str = "" meeting_date: str = "" + # V1 B2 — advisory metadata for the eventual V2 caller-LLM verdict prompt. + # True only for drifted entries whose HEAD-vs-working-tree byte diff is + # provably semantics-preserving per ledger.ast_diff.is_cosmetic_change. + # NEVER gates drift surfacing or status; pure metadata. + cosmetic_hint: bool = False class DetectDriftResponse(BaseModel): @@ -473,6 +491,7 @@ class PreflightResponse(BaseModel): # v0.8.0 HITL annotations (topic-independent, ledger health) unresolved_collisions: list[BriefDecision] = [] # collision_pending from prior sessions context_pending_ready: list[BriefDecision] = [] # context_pending with ≥1 confirmed context_for + sync_metrics: SyncMetrics | None = None # V1 A3 — catch-up wall times # ── Tool 10: /bicameral_judge_gaps ─────────────────────────────────── @@ -624,6 +643,7 @@ class HistoryResponse(BaseModel): total_features: int = 0 as_of: str = "" # git ref evaluated against session_start_banner: SessionStartBanner | None = None + sync_metrics: SyncMetrics | None = None # V1 A3 — catch-up wall times # ── Tool 13: bicameral.dashboard ───────────────────────────────────── @@ -652,6 +672,7 @@ class BindResult(BaseModel): class BindResponse(BaseModel): """Response envelope for bicameral.bind.""" bindings: list[BindResult] + sync_metrics: SyncMetrics | None = None # V1 A3 — write-barrier hold time # Forward references diff --git a/docs/v2-desync-optimization-guide.md b/docs/v2-desync-optimization-guide.md new file mode 100644 index 00000000..27cc10d7 --- /dev/null +++ b/docs/v2-desync-optimization-guide.md @@ -0,0 +1,1083 @@ +# V2 Desync Optimization — Implementation Guide + +**Status**: Planning artifact. No V2 code yet. V1 has shipped on this branch. +**Audience**: Any engineer or agent picking up V2 implementation. +**Self-contained**: This doc replaces `docs/desync-optimization.md` (V2 design with 9 Codex review passes) and `docs/desync-optimization-v1-plan.md` (V1 plan with pass-12 fixes folded in). Read this and you have the full picture. +**Date written**: 2026-04-25 +**Branch**: `desync-optimization-v1` (V1 commits 3b4d0bb…8e226c5) +**Owner (V1)**: Silong +**Owner (V2)**: TBD — strongly recommend involving Jin (CODEOWNERS approval required, plus this work needs project-judgment review the adversarial Codex passes can't provide) + +--- + +## Table of Contents + +1. [Quick start](#1-quick-start) +2. [Background — what "desync" means here](#2-background) +3. [V1 outcomes — what's already shipped](#3-v1-outcomes) +4. [V2 scope — the gap and the goal](#4-v2-scope) +5. [Architecture target](#5-architecture-target) +6. [Implementation plan — phased, with hard dependencies](#6-implementation-plan) +7. [Constraints catalog — synthesized from 12 Codex review passes](#7-constraints-catalog) +8. [Open questions for human judgment](#8-open-questions) +9. [Acceptance criteria for V2](#9-acceptance-criteria-for-v2) +10. [References](#10-references) + +--- + +## 1. Quick start + +V2 is the **destructive-path overhaul** of bicameral-mcp's drift-detection system. V1 (already on this branch) shipped measurement infrastructure, read-path advisory hints, a 13-scenario regression matrix, and a single safety fence around `bind` — without touching destructive write paths. **V2 ships the actual semantic drift detection, atomic rebind, reversible verdicts, and per-binding baseline ownership.** + +If you're picking this up cold: + +- **Read §3 first** to understand the V1 baseline you're building on (what works, what's deliberately deferred, what bug fixes were incidental). +- **Read §7 second**. Every entry there came from a Codex review pass that found a real bug in an earlier draft — those constraints are the difference between V2 shipping safely and V2 introducing data-corruption regressions. +- **Read §6 third**. The phase order is a real DAG with hard dependencies; don't deviate. +- **Read §8 last** before starting code. Several decisions remain open and benefit from human judgment, not adversarial review. + +**Effort sense**: 7–10 engineer-weeks sequential, single owner. The phases don't parallelize cleanly because each one's correctness depends on the prior one's invariants. + +--- + +## 2. Background + +### 2.1 What "desync" means in this project + +bicameral-mcp tracks three independently-evolving timelines: + +| Timeline | Where it lives | Updates when | +|---|---|---| +| **Spec** | SurrealDB ledger (decisions, append-only) | User ingests a decision | +| **Index** | SQLite symbol DB | HEAD changes (rebuilt on mismatch) | +| **Code** | Git working tree / refs | Dev commits | + +Every edge case where the `decision → symbol → code_region` graph becomes stale, missing, or wrong is a "desync scenario." The canonical reference is the Notion page **"The Auto-Grounding Problem: Keeping Decisions Linked to Code"** (Notion ID `3332a51619c4813caccec86c36d9bf98`). It catalogs **13 numbered scenarios** with severity tiers. + +Supporting Notion docs: +- **"The Branch Problem"** (`3302a51619c48146b48dc675914beb6f`) — why content-hash anchoring beats SHA-anchoring; the content-hash is the stateless bridge between the spec lane and the code lane. +- **"CI Workflow Fixes — MCP Regression Pipeline (Apr 8)"** (`33c2a51619c48134ba8dc8bfaeb880dd`) — documents how scenarios #1 and #6 were false-negatives in tests because they bypassed `handle_ingest()` and called `ledger.ingest_payload()` directly. **Lesson: tests must route through the real handler layer.** + +### 2.2 The 13-scenario catalog + +Reproduced from Notion (severity tiers as of 2026-04-01, updated through V1): + +| # | Scenario | Severity | V1 status | +|---|---|---|---| +| 1 | New decision ingested, matching code exists | was P0 | ✅ caller-LLM bind flow | +| 2 | Code changed after decision was grounded | working | ✅ pending + `pending_compliance_check` | +| 3 | Code deleted after decision was grounded | working | ✅ symbol_disappeared | +| 4 | Symbol renamed (refactor) | P1 | ✅ symbol_disappeared with `original_lines` (V1 D1) | +| 5 | Symbol moved to different file | P1 | ✅ symbol_disappeared | +| 6 | Code index rebuilt with new symbols | was P0 | ✅ caller binds explicitly | +| 7 | Cold start: no code index | working | ✅ stays ungrounded | +| 8 | Drifted intent → recoverable via re-ground | P1 (V2) | ⏸ XFAIL (atomic rebind = V2 D2) | +| 9 | Intent description supersession | P2 | ✅ re-ingest succeeds | +| 10 | Multiple intents map to same symbol | working | ✅ both surface | +| 11 | BM25 false-positive grounding | post-v0.6.0: N/A | ✅ caller-LLM-driven | +| 12 | Code region line numbers shift (insertion above) | working | ✅ `resolve_symbol_lines` self-heals | +| 13 | `[Open Question]` prefix → gap classification | v0.5.x | ✅ ingested as gap | + +**Current scorecard: 12 PASS / 1 XFAIL.** Scenario 8 flips to PASS the moment V2's `bicameral_rebind` lands — the test is `@pytest.mark.xfail(strict=True)` so a `xpassed` result will be a CI-visible signal that V2 has implemented the missing piece. + +### 2.3 Scorecard trajectory + +| Date | Scorecard | Source | +|---|---|---| +| 2026-04-01 | 10/13 (77%) | Notion auto-grounding doc, original analysis | +| 2026-04-08 | 12/13 (92%) | CI Workflow Fixes — PR #84 routed tests through real handler layer | +| 2026-04-23 (v0.6.1) | G1 + G3 closed via sync_middleware | `CHANGELOG.md` | +| 2026-04-23 (v0.6.0) | server-side BM25 auto-grounding **removed** (–2317 LOC) | architectural shift to caller-LLM-driven retrieval | +| 2026-04-23 (v0.6.4) | `search_code` deleted | "caller-LLM owns all code retrieval" | +| 2026-04-25 (V1 done) | 12/13 PASS + 1 XFAIL on V2 | this branch | + +--- + +## 3. V1 outcomes + +V1 commits on this branch (`origin/main` is `a5aface`): + +```text +8e226c5 docs: tick V1 desync optimization across CHANGELOG / TODO / PLAN +a04e54b fix(link_commit): split verification_instruction so relocation cases don't get bind CTA +89f8076 feat: desync optimization V1 F1 — canonical 13-scenario regression matrix +54081e6 feat: desync optimization V1 D1 — original_lines on symbol-disappeared payload +401babc feat: desync optimization V1 Phase B — read-path cosmetic-change advisory +3b4d0bb feat: desync optimization V1 Phase A — measurement + light sync hardening +``` + +### 3.1 What V1 delivered + +V1 introduces **zero new mutating capabilities**. Every change is one of: read-only measurement, additive contract field, pure function, test coverage, or a surgical bug fix to an already-shipped path. + +| ID | Deliverable | Files | +|---|---|---| +| **A1** | Drift benchmark harness — seeds 100 decisions × 25 files, times search/drift/link_commit, writes JSON artifact, marked `@pytest.mark.bench` | `tests/bench_drift.py` | +| **A2-light** | Per-repo `asyncio.Lock` for `handle_bind`. In-process serialization only — does NOT protect `resolve_compliance` or cross-process writers | `handlers/sync_middleware.py::repo_write_barrier`, `handlers/bind.py` | +| **A3** | `SyncMetrics` (`sync_catchup_ms` / `barrier_held_ms`) attached to Search/Preflight/History/Bind responses. Each handler times its own sync call locally so nested calls don't step on each other's metrics | `contracts.py::SyncMetrics`, four handlers | +| **B1** | Strict-whitelist tree-sitter cosmetic-change classifier. Returns True ONLY for inter-token whitespace differences. Variable renames, comment edits, docstring changes, trailing commas, import reorders all return False | `ledger/ast_diff.py` | +| **B2** | `DriftEntry.cosmetic_hint` advisory metadata. Read-path only — never mutates `content_hash`, never gates drift surfacing | `contracts.py::DriftEntry`, `handlers/detect_drift.py::_enrich_with_cosmetic_hints` | +| **D1** | `original_lines` on `symbol_disappeared` grounding checks so caller LLM can `git show :` to inspect the symbol's prior position | `ledger/adapter.py:412-420` | +| **D1 follow-up** | `_build_verification_instruction` split — relocation cases get an explicit "do NOT call bicameral.bind" warning instead of the v0.6.4 monolithic bind CTA | `handlers/link_commit.py::_build_verification_instruction` | +| **F1** | Canonical 13-scenario regression matrix routed through real handler layer. Self-contained tmp-repo fixture per test | `tests/test_desync_scenarios.py` | +| **Bug fix (incidental)** | `pending_grounding_checks` for ungrounded decisions emitted empty `decision_id` because consumer read `d.get("id", "")` from rows aliased to `decision_id`. Surfaced by F1 | `ledger/adapter.py:475` | + +**Performance baseline (post-rebase, surrealdb 2.0.0, Apple Silicon):** + +| handler | p50 | p95 | max | +|---|---|---|---| +| search_decisions | 9.2ms | 10.4ms | 11.0ms | +| detect_drift | 14.2ms | 15.5ms | 16.4ms | +| link_commit (warm) | 7.3ms | 8.0ms | 8.3ms | + +All 50–185× under the V2 perf targets (`PLAN.md:83`: search < 2s, drift < 1s). + +### 3.2 What V1 explicitly did NOT do + +The recurring framing across Codex review passes was: "V1 is shippable while destructive paths exist." This is technically true and worth being explicit about. **V1 introduces zero new destructive paths.** Every mutating capability that V1 ships is either: + +- already present in main pre-V1 (e.g. `resolve_compliance` hard-delete from v0.5.0; `bicameral.bind` from v0.6.0; the auto-chained `handle_judge_gaps` from `handlers/ingest.py`), OR +- a surgical bug fix to an already-shipped path (the `decision_id` empty-string fix in `ledger/adapter.py:475`). + +Net destructive-surface change for V1: **zero (and arguably negative via D1's CTA removal).** + +### 3.3 Practical user-facing impact of V1 + +V1 is roughly **20–30% of the user-facing value of "actual desync optimization."** It's foundation + safety fences. The things that change what someone *experiences* using bicameral are mostly V2: + +- **`derive_status` still returns `pending` (not `drifted`)** for hash-divergent regions without a cached compliant verdict (`ledger/status.py:178-205`). The actual semantic "drifted" classification requires a caller-LLM verdict, which is V2's `bicameral_judge_drift`. So today, when a developer changes code, they get `pending`, not `drifted` — "we don't know yet" rather than a real verdict. +- **Rename recovery is informational only.** Caller can read `original_lines` from a `symbol_disappeared` payload but acting on it (calling `bicameral.bind`) creates duplicate-binding state. V1 actively warns them to wait for V2. +- **The destructive backdoor is still live.** `resolve_compliance` still hard-deletes `binds_to` edges on `not_relevant` verdicts; one bad async caller verdict can permanently remove a decision's only grounding edge with no recovery path. +- **Cross-decision baseline corruption is still possible.** When multiple decisions share a region, one decision's effects on shared state ripple to the others. + +V1's value is operational confidence + one footgun closed + one race narrowed + foundation for V2. V2 is where "drifted" becomes a real claim, where rename recovery becomes safe, and where the destructive backdoor closes. + +--- + +## 4. V2 scope + +### 4.1 Capability gap (V1 → V2) + +| # | Capability | Currently | V2 needs | +|---|---|---|---| +| 1 | Atomic multi-statement writes | `LedgerClient.execute_many` is sequential, no rollback. No transaction primitive in repo. | **A0**: SurrealQL `BEGIN/COMMIT TRANSACTION` blocks submitted as single `query()` calls. (Embedded SDK doesn't support `begin_transaction()` — verified empirically; see [SurrealDB Python SDK docs](https://surrealdb.com/docs/sdk/python/concepts/connecting-to-surrealdb).) | +| 2 | Existing destructive backdoor | `handlers/resolve_compliance.py:122` hard-deletes `binds_to` on `not_relevant`; `handlers/ingest.py:313-331` auto-chains into it via `handle_judge_gaps`. | Migrate to tombstone + full CAS **before** any new mutating tool ships. Codex pass-10 #1 — the **hard prerequisite**. | +| 3 | Per-binding baseline ownership | `code_region.content_hash` is shared across N decisions bound to the same region. One decision's verdict rewrites everyone's drift baseline. | **C0**: move `baseline_content_hash`, `baseline_commit_hash`, `binding_version` onto `binds_to` edges. `derive_status` rewritten per-binding. | +| 4 | Reversible verdict storage | `compliance_check` has `UNIQUE(decision_id, region_id, content_hash)` (`ledger/schema.py:163`). Contradicting later verdict overwrites the prior one — reversal physically impossible. | **C0**: append-only `compliance_verdict_history` table + `compliance_check` redefined as a current-state projection over the full 7-field CAS tuple. | +| 5 | Tombstone semantics on `binds_to` | No tombstone fields. Edge deletion is the only retirement mechanism. | **C0**: add `tombstoned_at`, `tombstone_reason`, `tombstone_verdict_id`. **C0a**: every `binds_to` traversal site filtered via shared `binds_to_active_filter()`. | +| 6 | Full-CAS cache key | `idx_cc_cache_key UNIQUE(decision_id, region_id, content_hash)` — replays old verdicts across reverts/branches/moves. | **C0**: replace with 7-field `(decision_id, region_id, content_hash, commit_hash, file_path, binding_version, tombstone_verdict_id)`. **Same** tuple referenced verbatim in schema + migration + cache lookup + write upsert. | +| 7 | Commit-time sync barrier | A2-light (V1) only catches in-process races. HEAD can change between sync and commit; working-tree edits don't move HEAD. | **A2a**: per-handler `SyncToken{head_sha, ...}` re-checked against `git rev-parse HEAD` immediately before COMMIT, plus per-region `RegionFingerprint{file_path, content_hash, binding_version, mtime, size}` re-verified at commit time. | +| 8 | LLM compliance verdict tool | `derive_status` returns `pending` (not `drifted`) when no verdict cached — V1 scenario 2 documents this. | **C2**: `bicameral_judge_drift` (caller-LLM) + `record_compliance_verdict` with five-field CAS token (code identity + binding state). Stale verdicts go to history with `stale_reason`, never mutate live state. | +| 9 | Cache-aware drift surfacing | `detect_drift` doesn't emit `pending_compliance_checks`. | **C3**: emit `pending_compliance_checks` for every hash-divergent region; cosmetic_hint is metadata only, never a gate. | +| 10 | Baseline advancement | `code_region.content_hash` updates only via `link_commit` sweep; no caller-driven advancement. | **B3**: `bicameral_advance_baseline(decision_id, region_id, cas_token, verdict_id)` — only accepts a fresh L3 `compliant` verdict matching all five CAS components. Writes to a single `binds_to` edge; never touches shared region state. No `ast_cosmetic` reason. | +| 11 | Atomic rebind | Rename → `symbol_disappeared` payload (V1 D1). Manual `bicameral.bind` would create duplicate-binding state under N:N `binds_to`. | **D2**: `bicameral_rebind` with `expected_old_binding_version` + `expected_old_tombstone_verdict_id` CAS, **two-phase** semantics (Codex pass-11 #2): create new as pending → fresh L3 verdict on new target → tombstone old. Closes scenario 8. | +| 12 | Doctor skill rendering | `.claude/skills/bicameral-doctor/SKILL.md` exists (211 lines) but contains zero `pending_grounding_checks` / `cosmetic_hint` / verdict-related prose. | Once V2 has safe atomic rebind, render the new payloads as advisory context with the (now-safe) bind flow for relocation cases. | +| 13 | Branch-aware drift report (GitHub #47) | No handler surfaces drift / ungrounded state across a `base_ref..head_ref` range. PR-time and pre-push consumers (#48, #49) have no signal source. | **Phase 6**: `handlers/scan_branch.py` — read-only branch-aware drift report. Reuses Phase 1–4 machinery (per-binding baseline + full-CAS hash comparison + symbol re-resolution + relocation surfacing). Zero new mutating capabilities. Closes #47. | + +### 4.2 V2 product targets + +After V2 ships, the user-visible improvements: + +- **"drifted" becomes a real claim.** A drifted status indicates a caller-LLM has reviewed the change and confirmed it diverges from the decision — not just "bytes are different and we don't know yet." +- **Rename/move recovery is safe.** `bicameral_rebind` retires the old edge and creates the new one in a single transaction with full CAS protection. +- **`resolve_compliance` no longer corrupts state on bad verdicts.** Tombstone + CAS means a stale `not_relevant` verdict is rejected (or recorded as stale-history-only) instead of silently deleting the only grounding edge. +- **Cross-decision baseline isolation.** Each decision-binding has its own baseline; one decision's `advance_baseline` doesn't ripple to peer decisions on the same region. +- **Reversible verdicts with full audit history.** Operators can see every verdict ever issued for a region, and a contradicting later verdict (e.g. operator restores a tombstone) is recorded in history rather than overwriting. +- **Scenario 8 flips from xfail to pass** — the canonical "drifted intent recoverable via re-ground" scenario actually works end-to-end. +- **Branch-aware drift report works** — `bicameral_scan_branch(base_ref, head_ref)` returns drift + ungrounded surfaces between two refs without writing to the ledger. Closes GitHub #47 and unblocks downstream consumers (#48 pre-push hook, #49 PR-comment Action) for follow-up issue-driven work. + +--- + +## 5. Architecture target + +### 5.1 Layer 1 / Layer 2 / Layer 3 model + +Drift detection has three layers, only Layer 1 is wired today: + +| Layer | Mechanism | Catches | V1 status | V2 status | +|---|---|---|---|---| +| **L1** | Content-hash comparison (`HashDriftAnalyzer`, `ledger/drift.py`) — syntactic identity | Any byte-level change | ✅ Shipped | unchanged | +| **L2** | AST pre-filter (tree-sitter strict whitelist via `ledger/ast_diff.is_cosmetic_change`) | Whitespace, blank lines | ✅ Shipped (V1 B1/B2) — **advisory only**, never gates L3 | unchanged; the `cosmetic_hint` field becomes input to L3 prompt rendering | +| **L3** | LLM compliance check (`claude-haiku-4-5` or similar) — "does code still satisfy intent?" | Semantic compliance vs noise | ❌ Not built | **V2 C2**: `bicameral_judge_drift` (caller-LLM) + `record_compliance_verdict` with 5-field CAS | + +L1 alone produces noise on every rename/format change. L2 narrows the noise but cannot prove semantic equivalence. L3 is the only judge that can — and the entire V2 story is about making L3 verdicts authoritative, reversible, and auditable. + +### 5.2 Per-binding state ownership (the pass-8 redesign) + +**The bug**: V1 keeps baseline state on shared `code_region.content_hash`. But `binds_to` is N:N — multiple decisions can bind to the same `code_region`. With baseline state on the shared region, one decision's `advance_baseline` would silently rewrite the drift baseline for every other decision bound to the same region; a region-version bump would invalidate other decisions' caches without authorization. **Cross-decision correctness bug.** + +**The fix**: move baseline ownership off shared `code_region` and onto the per-binding `binds_to` edge. + +```sql +-- V2 schema additions to binds_to +DEFINE FIELD baseline_content_hash ON binds_to TYPE string; +DEFINE FIELD baseline_commit_hash ON binds_to TYPE string DEFAULT ''; +DEFINE FIELD binding_version ON binds_to TYPE int DEFAULT 1; + +-- Tombstone fields (separate concern, but same edge) +DEFINE FIELD tombstoned_at ON binds_to TYPE datetime | NONE; +DEFINE FIELD tombstone_reason ON binds_to TYPE string DEFAULT ''; +DEFINE FIELD tombstone_verdict_id ON binds_to TYPE string DEFAULT ''; +``` + +`code_region` keeps **only location data** (`file_path`, `symbol_name`, `start_line_snapshot`, `end_line_snapshot`). Line snapshots are advisory hints, not source of truth — `derive_status` always re-resolves the symbol via `resolve_symbol_lines(file_path, symbol_name)` (`ledger/status.py:21-89`) before hashing. **Region identity is the symbol, not the line range.** + +**`derive_status` rewritten** to compare live hash against `binds_to.baseline_content_hash` per-binding instead of against shared `code_region.content_hash`. + +### 5.3 Full CAS contract — five-field token + +Every mutating tool that takes a caller-LLM verdict requires a `cas_token`: + +```python +{ + "expected_content_hash": str, # bytes the caller judged + "expected_commit_hash": str, # commit at judgment time + "expected_file_path": str, # path at judgment time + "expected_binding_version": int, # binds_to edge version + "expected_tombstone_verdict_id": str, # '' for live edges +} +``` + +`record_compliance_verdict`, `bicameral_advance_baseline`, and `bicameral_rebind` all CAS-check **all five fields** before any mutation. Mismatch → record verdict in `compliance_verdict_history` with `stale=true, stale_reason='_mismatch'` and **do not** mutate live state. Each component catches a distinct desync class: + +- `content_hash` mismatch → bytes changed under the caller +- `commit_hash` mismatch → HEAD moved (branch switch, revert, new commit) +- `file_path` mismatch → region was relocated since judgment +- `binding_version` mismatch → this binding was rebaselined or replaced +- `tombstone_verdict_id` mismatch → operator restored / re-tombstoned the binding + +### 5.4 Tombstone semantics on `binds_to` + +`not_relevant` verdicts (from `bicameral_judge_drift` or the existing `resolve_compliance` flow) **do not hard-delete** the edge. Instead: + +- Set `tombstoned_at = time::now()`, `tombstone_reason = ':'`, `tombstone_verdict_id = `. +- Edge is excluded from drift / status walks via shared `binds_to_active_filter()` helper used by every traversal site. +- `bicameral_restore_binding(decision_id, region_id, expected_tombstone_verdict_id)` lifts the tombstone — auditable via a synthetic history row. +- Hard-delete is **not** part of V2. A separate scheduled GC handler can purge tombstones older than N days with no contradicting verdict (deferred to V3 or operator config). + +### 5.5 Append-only `compliance_verdict_history` + +```sql +DEFINE TABLE compliance_verdict_history SCHEMAFULL; +DEFINE FIELD decision_id ON compliance_verdict_history TYPE string; +DEFINE FIELD region_id ON compliance_verdict_history TYPE string; +DEFINE FIELD verdict ON compliance_verdict_history TYPE string; -- compliant | drifted | not_relevant | restored +DEFINE FIELD confidence ON compliance_verdict_history TYPE string; +DEFINE FIELD explanation ON compliance_verdict_history TYPE string DEFAULT ''; +DEFINE FIELD agent_id ON compliance_verdict_history TYPE string DEFAULT ''; +DEFINE FIELD stale ON compliance_verdict_history TYPE bool DEFAULT false; +DEFINE FIELD stale_reason ON compliance_verdict_history TYPE string DEFAULT ''; +DEFINE FIELD recorded_at ON compliance_verdict_history TYPE datetime DEFAULT time::now(); +-- Full CAS captured on every row +DEFINE FIELD expected_content_hash ON compliance_verdict_history TYPE string; +DEFINE FIELD expected_commit_hash ON compliance_verdict_history TYPE string; +DEFINE FIELD expected_file_path ON compliance_verdict_history TYPE string; +DEFINE FIELD expected_binding_version ON compliance_verdict_history TYPE int | NONE; +DEFINE FIELD expected_tombstone_verdict_id ON compliance_verdict_history TYPE string DEFAULT ''; +DEFINE FIELD actual_content_hash ON compliance_verdict_history TYPE string DEFAULT ''; +DEFINE FIELD actual_commit_hash ON compliance_verdict_history TYPE string DEFAULT ''; +DEFINE FIELD actual_file_path ON compliance_verdict_history TYPE string DEFAULT ''; +DEFINE FIELD actual_binding_version ON compliance_verdict_history TYPE int | NONE; +DEFINE FIELD actual_tombstone_verdict_id ON compliance_verdict_history TYPE string DEFAULT ''; +-- No uniqueness — same code shape can hold an unbounded sequence of verdicts +DEFINE INDEX idx_cvh_lookup ON compliance_verdict_history + FIELDS decision_id, region_id, expected_content_hash, expected_commit_hash, recorded_at DESC; +DEFINE INDEX idx_cvh_audit ON compliance_verdict_history FIELDS decision_id; +DEFINE INDEX idx_cvh_stale ON compliance_verdict_history FIELDS stale; +``` + +`compliance_check` is **redefined as a current-state projection** over the full 7-field CAS tuple, kept in sync from `compliance_verdict_history`: + +```sql +DEFINE INDEX idx_cc_cache_key ON compliance_check + FIELDS decision_id, region_id, content_hash, commit_hash, file_path, binding_version, tombstone_verdict_id + UNIQUE; +``` + +**Same content hash at a different commit / path / binding_version / tombstone state produces a different projection row, not an overwrite.** This is the single source of truth — schema, migration, cache lookup, and write upsert all key on this exact tuple. + +**Migration strategy** (Codex pass-6 #3): legacy `compliance_check` rows lack the new CAS columns. **Do not backfill from current state** — that fabricates history. Instead: + +1. Read every legacy `compliance_check` row. +2. Insert each into `compliance_verdict_history` with `stale=true, stale_reason='legacy_pre_v6_no_cas_metadata', expected_binding_version=NULL, expected_tombstone_verdict_id=NULL, expected_file_path=NULL`. The verdict text is preserved for audit. +3. Drop and recreate `compliance_check` empty with the new index. +4. Cache lookups against the empty projection always miss → every previously-cached region gets fresh L3 on its next `detect_drift` call. Cost is bounded; benefit is no false cache hits. + +### 5.6 Two-phase atomic rebind (pass-11 + pass-13 fixes) + +Pass-11 finding: a single-transaction "create new + tombstone old" rebind retires the authoritative binding before the new target has been semantically proven. A wrong candidate selection silently reattaches the decision to unrelated code. → fixed by splitting rebind into two phases (this section). + +**Pass-13 finding**: a naive two-phase rebind that only carries the *new* binding's CAS token in phase 2 still has a misattachment bug. If the caller does multiple phase-1 attempts on the same `old_region_id` (candidate A, then candidate B), and a stale phase-2 compliant verdict for candidate A arrives after the caller has moved on to B, the server would tombstone the old edge based on a verdict the caller no longer endorses. → fixed by binding phase 2 to a specific `rebind_attempt_id` and enforcing single-pending-rebind-per-old-binding (this section, below). + +#### Schema additions for D2 + +```sql +-- On binds_to: +DEFINE FIELD pending_rebind_attempt_id ON binds_to TYPE string DEFAULT ''; +DEFINE FIELD rebind_attempt_id ON binds_to TYPE string DEFAULT ''; +DEFINE FIELD pending_verification ON binds_to TYPE bool DEFAULT false; + +-- New table: rebind_audit +DEFINE TABLE rebind_audit SCHEMAFULL; +DEFINE FIELD attempt_id ON rebind_audit TYPE string; -- UUID, immutable, primary phase 2 token +DEFINE FIELD decision_id ON rebind_audit TYPE string; +DEFINE FIELD old_region_id ON rebind_audit TYPE string; +DEFINE FIELD new_region_id ON rebind_audit TYPE string; +DEFINE FIELD old_binding_version_at_attempt ON rebind_audit TYPE int; -- snapshot for phase-2 CAS +DEFINE FIELD old_tombstone_verdict_id_at_attempt ON rebind_audit TYPE string; +DEFINE FIELD reason ON rebind_audit TYPE string; +DEFINE FIELD agent_id ON rebind_audit TYPE string; +DEFINE FIELD recorded_at ON rebind_audit TYPE datetime DEFAULT time::now(); +DEFINE FIELD expires_at ON rebind_audit TYPE datetime; -- recorded_at + REBIND_LEASE_TTL +DEFINE FIELD outcome ON rebind_audit TYPE string DEFAULT 'pending'; + -- pending | committed | superseded | abandoned | abandoned_by_expiry +DEFINE INDEX idx_rebind_attempt ON rebind_audit FIELDS attempt_id UNIQUE; +DEFINE INDEX idx_rebind_pending ON rebind_audit FIELDS old_region_id, outcome; +DEFINE INDEX idx_rebind_expiry ON rebind_audit FIELDS outcome, expires_at; +``` + +`REBIND_LEASE_TTL` is configurable via `BICAMERAL_REBIND_LEASE_SECONDS` (default 86400 — 24 hours). Long enough that a careful caller-LLM can take its time on the L3 review; short enough that a crashed caller doesn't wedge a binding indefinitely. + +#### Protocol + +```text +Phase 1 — bicameral_rebind(decision_id, old_region_id, new_location | new_region_id, + reason, agent_id, + expected_old_binding_version, + expected_old_tombstone_verdict_id, + force_supersede: bool = false) + Under repo_write_barrier + A2a + atomic transaction (A0): + 1. Re-read old binding state. CAS-check expected_old_*. Mismatch → abort. + 2. Lease-expiry sweep (cheap, runs every phase 1): + - Read existing pending attempt: rebind_audit row where + attempt_id == old_binding.pending_rebind_attempt_id AND outcome == 'pending'. + - If row exists AND row.expires_at < now(): in this same transaction + abandon it (set outcome='abandoned_by_expiry', tombstone the orphan + new binding with tombstone_reason='rebind:expired', clear + old_binding.pending_rebind_attempt_id). Treat the lock as released. + 3. Lock check: + - If old_binding.pending_rebind_attempt_id == '' (post-sweep): proceed. + - Else if force_supersede == true: abandon the existing attempt + (outcome='superseded', tombstone orphan new binding with + tombstone_reason='rebind:superseded'); proceed. + - Else: abort with rebind_already_pending and return the existing + attempt_id + its expires_at so the caller can either wait, retry + with force_supersede=true, or call bicameral_abandon_rebind. + 4. Generate a fresh attempt_id (UUID). + 5. Insert rebind_audit row with outcome='pending', snapshot + old_binding_version_at_attempt and old_tombstone_verdict_id_at_attempt, + set expires_at = now() + REBIND_LEASE_TTL. + 6. Bump old binding's binding_version (invalidates in-flight verdicts on + the old edge). Set old_binding.pending_rebind_attempt_id = attempt_id. + 7. Resolve new code_region. If the new binding edge already exists from + a prior tombstoned rebind: bump its binding_version and clear its + tombstone fields. Otherwise create binds_to(decision → new_region) + with binding_version=1. + 8. Mark the new binding pending_verification=true, + rebind_attempt_id=attempt_id; initialize baseline_* from a live + region read. + 9. Return (new binding's full 5-field CAS token, attempt_id, audit_id, + expires_at). + +Phase 2 — record_compliance_verdict(decision_id, region_id=new_region_id, + cas_token, verdict, ...) + Under repo_write_barrier + A2a + atomic transaction (A0): + 1. Re-read NEW binding state. CAS-check the 5-field cas_token. + Mismatch → record stale-history-only, no mutation. + 2. If new binding has pending_verification=true (it's part of a rebind): + a. Look up rebind_audit by new_binding.rebind_attempt_id. + b. Lease check: if rebind_audit.outcome != 'pending' OR + rebind_audit.expires_at < now(): record verdict in + compliance_verdict_history with stale=true, + stale_reason='rebind_attempt_expired' (or '_superseded' / '_abandoned' + per outcome). Do NOT tombstone old binding. Do NOT touch projection. + c. Re-read OLD binding state. Verify + old_binding.pending_rebind_attempt_id == new_binding.rebind_attempt_id. + Mismatch → record verdict with stale=true, + stale_reason='rebind_attempt_superseded'. + Do NOT tombstone old binding. Do NOT touch projection. + d. Verify old_binding.binding_version == + rebind_audit.old_binding_version_at_attempt AND + old_binding.tombstone_verdict_id == + rebind_audit.old_tombstone_verdict_id_at_attempt. + Mismatch → same stale-history-only path. + e. If verdict == 'compliant': in the SAME transaction, set + new_binding.pending_verification=false, clear new_binding.rebind_attempt_id, + tombstone old_binding (set tombstoned_at, tombstone_reason='rebind:', + tombstone_verdict_id=), clear + old_binding.pending_rebind_attempt_id, set rebind_audit.outcome='committed'. + f. If verdict in {'drifted','not_relevant'}: new binding stays + pending_verification=true; old binding stays live with its lock. + Caller may retry phase-1 with a different candidate via either + (i) bicameral_abandon_rebind(attempt_id, ...), or + (ii) a fresh bicameral_rebind with force_supersede=true, or + (iii) waiting for the lease to expire (the next phase-1 attempt + will sweep it). All three paths converge on outcome='abandoned' + / 'superseded' / 'abandoned_by_expiry' on the audit row. + 3. Then proceed with the standard verdict-write algorithm in §5.5. + +bicameral_abandon_rebind(attempt_id, expected_old_binding_version, + expected_old_tombstone_verdict_id) — caller-driven + abandon. CAS-check the old binding under barrier. Set + rebind_audit.outcome='abandoned', clear old_binding.pending_rebind_attempt_id, + tombstone the orphan new binding with tombstone_reason='rebind:abandoned'. +``` + +This means: + +- **`bicameral_rebind` alone never retires the old edge.** Old-edge tombstoning is gated on a fresh `compliant` verdict whose `rebind_attempt_id` matches the lock currently held on the old binding. +- **At most one pending rebind per old binding.** Subsequent phase-1 attempts on the same `old_region_id` return `rebind_already_pending` until the prior attempt is committed, abandoned, or superseded — except when the caller passes `force_supersede=true` to abandon-and-replace atomically. +- **Stale phase-2 verdicts are rejected.** A compliant verdict on an abandoned/superseded/expired attempt fails the lease check or the `pending_rebind_attempt_id == rebind_attempt_id` check and is recorded with the appropriate `stale_reason`. The old binding is NOT tombstoned. +- **No deadlock under client crash** (pass-14 #2): every pending attempt has an `expires_at` deadline (default 24h via `BICAMERAL_REBIND_LEASE_SECONDS`). The next `bicameral_rebind` call against the same `old_region_id` runs an expiry sweep that atomically abandons stale leases before issuing a new attempt, so a crashed or abandoned caller cannot wedge a binding indefinitely. An optional background sweep (`bicameral.maintenance` or a cron) can also clear expired leases proactively, but is not required for liveness — the on-demand sweep guarantees forward progress. + +### 5.7 Sync barrier with commit-time CAS (the pass-4 / pass-5 design) + +V1 A2-light only catches in-process races on `bind`. V2 needs three complementary mechanisms, all required: + +1. **Per-repo `asyncio.Lock`** (already shipped in V1 as `repo_write_barrier`) — in-process serialization. Wrap every code-shape mutator with this. +2. **Sync token CAS at commit time** — `require_ledger_synced(ctx)` returns `SyncToken{head_sha, sync_at, ledger_version}`. Every ledger write takes the token. Just before COMMIT, re-read `git rev-parse HEAD` and verify it equals `token.head_sha`. Mismatch → abort with `head_changed_mid_handler`. Catches out-of-process HEAD changes. +3. **Per-region CAS at commit time** — for handlers that read code shape (rebind, advance_baseline, record_compliance_verdict), snapshot `RegionFingerprint{region_id, file_path, symbol_name, resolved_start_line, resolved_end_line, resolved_content_hash, binding_version, file_mtime, file_size}` at sync time, re-verify at commit time. Catches working-tree races (uncommitted edits) and file-move races where HEAD didn't move. + +### 5.8 Spec writes vs code-shape writes (pass-6 #1) + +**Two classes of mutators with different correctness requirements**: + +- **Spec writes (append-only, do NOT gate on sync failure)**: `handlers/ingest.py`, `handlers/ratify.py`. These persist user *intent*. Today's `ingest.py:283,290` does write-first then best-effort `link_commit`. Preserve that. Gating ingest on git failure would lose decisions — a higher-cost failure mode than today's desync. +- **Code-shape writes (DO gate fail-closed)**: `handlers/bind.py`, `handlers/resolve_compliance.py`, plus the new `bicameral_rebind`, `bicameral_advance_baseline`, `record_compliance_verdict`. These mutate state derived from current code; stale views encode wrong facts. + +`require_ledger_synced(ctx)` returns `SyncResult(ok, head_sha, error)` — does **not** swallow exceptions. Code-shape handlers abort on `ok=False` with a structured `degraded_sync` error. Spec handlers retain best-effort sync with a `sync_degraded` warning flag in the response. + +--- + +## 6. Implementation plan + +```text +┌─ Phase 0 (Prereq) ──────────────────────────────────┐ +│ 0a Migrate resolve_compliance.py → tombstone+CAS │ ← absolute prerequisite +│ (no new mutating tools until this is done) │ (Codex pass-10 #1) +│ 0b A0: Atomic SurrealQL block primitive │ +└─────────────────────┬───────────────────────────────┘ + │ +┌─ Phase 1 (Schema) ──▼───────────────────────────────┐ +│ C0 v5→v6 migration: per-binding baseline, │ +│ tombstone fields, compliance_verdict_history, │ +│ full-CAS cache key │ +│ C0a Traversal filtering across all binds_to │ +│ consumers via binds_to_active_filter() │ +└─────────────────────┬───────────────────────────────┘ + │ +┌─ Phase 2 (Barrier) ─▼───────────────────────────────┐ +│ A2a SyncToken CAS + RegionFingerprint at commit │ +│ Apply to every code-shape mutator │ +└─────────────────────┬───────────────────────────────┘ + │ +┌─ Phase 3 (Reads) ───▼───────────────────────────────┐ +│ C1 Cache lookup with full CAS │ +│ C3 pending_compliance_checks from detect_drift │ +└─────────────────────┬───────────────────────────────┘ + │ +┌─ Phase 4 (Writes) ──▼───────────────────────────────┐ +│ C2 bicameral_judge_drift + record_compliance │ +│ (5-field CAS, stale-verdict history) │ +│ B3 bicameral_advance_baseline │ +│ (only L3 compliant verdicts; per-binding; │ +│ no ast_cosmetic) │ +│ D2 bicameral_rebind (two-phase, pass-11 fix) │ +│ (old-binding CAS + L3 verdict on new target │ +│ before old is tombstoned) │ +└─────────────────────┬───────────────────────────────┘ + │ +┌─ Phase 5 (Polish) ──▼───────────────────────────────┐ +│ .claude/skills/bicameral-doctor/SKILL.md rendering │ +│ Re-run Codex review (target: pass-13 ships clean) │ +│ Convert scenario 8 from xfail → expected pass │ +└─────────────────────┬───────────────────────────────┘ + │ +┌─ Phase 6 (Surface) ─▼───────────────────────────────┐ +│ #47 bicameral_scan_branch — read-only branch-aware │ +│ drift report (closes GitHub #47 fully). │ +│ Reuses Phase 1–4 machinery; ships zero new │ +│ mutating capabilities. │ +└─────────────────────────────────────────────────────┘ +``` + +### Phase 0 — Hard prerequisites (1–2 weeks) + +**0a. Migrate `handlers/resolve_compliance.py` from hard-delete to tombstone** + +Files: `handlers/resolve_compliance.py`, `ledger/queries.py` (or `ledger/adapter.py`), tests. + +Current behavior: `not_relevant` verdict → `delete_binds_to_edge(client, decision_id, region_id)` (line 122). One bad async caller verdict permanently removes the only grounding edge with no recovery path. + +Target behavior: `not_relevant` verdict → set tombstone fields on the edge with `tombstone_reason = 'judge_gaps:not_relevant'`. Surface a `bicameral_restore_binding` tool to lift tombstones. + +This is the single highest-leverage move. It closes the highest-impact destructive write path before any new tool inherits the same backdoor. **Do this even before A0** — single `UPDATE` to set tombstone fields atomically replaces the `DELETE`, no transaction primitive required yet. + +Tests: extend `tests/test_resolve_compliance.py` to assert tombstone state instead of edge deletion. Add `tests/test_restore_binding.py` for the new tool. `tests/test_desync_scenarios.py` still passes because it doesn't exercise this path. + +**0b. A0 — Atomic SurrealQL block primitive** + +File: `ledger/client.py`. + +Background: `LedgerClient.execute_many` (lines 117-122) is sequential. Embedded SurrealKV doesn't support `begin_transaction()` via the Python SDK ([source](https://surrealdb.com/docs/sdk/python/concepts/connecting-to-surrealdb)). + +**The chosen mechanism (pass-14 #1 — pick is committed in this guide; do not defer):** + +Add `LedgerClient.transaction()` — async context manager that submits a `BEGIN TRANSACTION; ; ; ...; COMMIT TRANSACTION;` SurrealQL block as a single `query()` call. Parse the per-statement results; if any statement returns `status: ERR`, append `CANCEL TRANSACTION` to the block (or rely on SurrealDB's automatic cancellation on per-statement error) and raise `LedgerError` carrying the failed statement and index. + +```python +# ledger/client.py target shape +@asynccontextmanager +async def transaction(self) -> AsyncIterator["TransactionBuffer"]: + """Buffer SurrealQL statements; submit them as one atomic block on exit.""" + buf = TransactionBuffer() + yield buf + if not buf.statements: + return + block = "BEGIN TRANSACTION;\n" + ";\n".join(buf.statements) + ";\nCOMMIT TRANSACTION;" + result = await self._db.query(block, buf.vars) + # SurrealDB returns one result element per statement; on per-statement error, + # the COMMIT auto-cancels and earlier statements roll back. + for i, stmt in enumerate(result): + if isinstance(stmt, str): # error string from SurrealDB + raise LedgerError(f"transaction statement {i} failed: {stmt[:300]}") +``` + +**Why this and not the `LET`-chain alternative**: `BEGIN/COMMIT TRANSACTION` is the documented atomicity primitive in SurrealQL ([SurrealQL Transactions](https://surrealdb.com/docs/surrealql/transactions)) and matches the semantic shape every V2 mutation needs (history insert + binds_to update + projection upsert). Single-statement `LET`-chains can express the same writes but constrain query shape and cannot use procedural control flow if a future mutation needs it. The chosen mechanism is more general; the gate test below verifies it actually works in our deployment mode. + +**Day-1 gate test — `tests/test_a0_atomic_transaction.py`** (this is a hard ship-blocker for Phase 0b): + +```python +async def test_transaction_rolls_back_on_failure(real_ledger_client): + """Force the second statement to fail and assert the first is rolled back.""" + client = real_ledger_client + await client.execute("DEFINE TABLE a0_canary SCHEMAFULL") + await client.execute( + "DEFINE FIELD name ON a0_canary TYPE string ASSERT $value != 'forbidden'" + ) + with pytest.raises(LedgerError): + async with client.transaction() as txn: + txn.execute("CREATE a0_canary:1 SET name = 'allowed'") + txn.execute("CREATE a0_canary:2 SET name = 'forbidden'") # ASSERT fails + rows = await client.query("SELECT * FROM a0_canary") + assert rows == [], ( + "Embedded SurrealKV did NOT honor BEGIN/COMMIT TRANSACTION rollback. " + "V2 cannot ship as designed; see fallback path in §6 Phase 0b." + ) +``` + +If this test fails (i.e. embedded SurrealKV silently ignores `BEGIN/COMMIT` and `a0_canary:1` survives the rollback), V2 cannot ship as designed against embedded mode. **Fallback path** in priority order: + +1. Switch every V2 multi-step mutation to a single `LET`-chained SurrealQL statement (`LET $h = (CREATE compliance_verdict_history ...); LET $b = (UPDATE binds_to:... SET ...); UPDATE compliance_check:... SET ...`). Single-statement is implicitly atomic in SurrealKV. Acceptable correctness; constrained query shape. +2. Move the ledger from `surrealkv://` (embedded) to a network `ws://` SurrealDB process where `begin_transaction()` is supported by the Python SDK. Largest deployment delta but cleanest API. +3. V2 doesn't ship until SurrealDB embedded gains transaction support (out-of-our-hands timeline; not a real option). + +The choice between fallbacks (1) vs (2) is a Jin-tier decision; do not pick without him weighing in. **Until the gate test passes (or the fallback is committed), no Phase 1+ work begins.** + +Acceptance: gate test above passes against the embedded ledger configuration we ship with. Plus a forced-failure correctness test per V2 mutation (rebind, verdict-write, baseline-advance) — every multi-step mutation, when the second statement is forced to fail, leaves zero side effects. + +### Phase 1 — Schema (1–2 weeks) + +**C0. Migration v5→v6** + +File: `ledger/schema.py`. Add a new `_migrate_v5_to_v6` function and bump `_TARGET_SCHEMA_VERSION`. + +Schema additions: +- `binds_to`: `tombstoned_at`, `tombstone_reason`, `tombstone_verdict_id`, `baseline_content_hash`, `baseline_commit_hash`, `binding_version`. +- `code_region`: `symbol_name` (qualified, e.g. `module.Class.method`); rename `start_line` / `end_line` semantics to "snapshot/hint" (no schema change, just docstring). +- New table `compliance_verdict_history` (see §5.5). +- Replace `idx_cc_cache_key` with the new 7-field unique index. + +Migration behavior (Codex pass-6 #3): legacy `compliance_check` rows go to history with `stale=true, stale_reason='legacy_pre_v6_no_cas_metadata'`. Drop and recreate the projection table empty. + +`derive_status` (`ledger/status.py:178-205`) rewritten to read per-binding `baseline_content_hash` from `binds_to` instead of shared `code_region.content_hash`. + +**C0a. Traversal filtering** + +Touch every `binds_to` consumer site. `grep -rln "binds_to" handlers/ ledger/` is the worklist: + +- `ledger/queries.py` — central query helpers (highest leverage; if filtered here, callers inherit it). +- `ledger/adapter.py` — direct graph walks (`get_decisions_for_file`, `get_regions_for_decision`, etc.). +- `handlers/bind.py` — idempotency check ("is this binding already present?") must consider tombstoned edges as "absent" so re-binding clears the tombstone instead of erroring on the unique index. +- `handlers/resolve_compliance.py` — verdict write path (already updated in Phase 0a). +- `handlers/history.py` — audit views: should show tombstoned edges *with* their tombstone metadata, not hide them. +- `ledger/schema.py` — schema definition only; no query changes. + +Add a single helper `binds_to_active_filter()` returning the SurrealQL clause `tombstoned_at IS NONE`. Use it in every traversal site **except `history`** (which should surface tombstones with metadata). + +Acceptance test: insert a tombstoned `binds_to` edge and assert: +- `detect_drift` does not surface it. +- `decision_status` projection does not count it. +- `search_decisions` graph walk does not return it. +- `bind` treating it as "absent" succeeds (clears tombstone) without violating the unique index. +- `history` *does* return it with tombstone metadata visible. + +### Phase 2 — Sync barrier (1 week) + +**A2a. Commit-time barrier** + +File: `handlers/sync_middleware.py`. Extends V1's `repo_write_barrier`. + +Two new pieces: + +1. `SyncToken{head_sha, sync_at, ledger_version}` returned from a new `require_ledger_synced(ctx) -> SyncResult` (fail-closed for code-shape handlers; fail-open `ensure_ledger_synced` stays for read paths). +2. `RegionFingerprint{region_id, file_path, symbol_name, resolved_start_line, resolved_end_line, resolved_content_hash, binding_version, file_mtime, file_size}` snapshotted at sync time, re-verified at commit time. + +Wire into: +- `handle_bind` (already wraps `repo_write_barrier`; add token + fingerprint check) +- `handle_resolve_compliance` (gate fail-closed; was V2 prereq for Phase 0a but full barrier lands here) +- New mutators when they're written in Phase 4 + +Tests cover: in-process race, out-of-process HEAD race, working-tree race (HEAD stable, file edited), file-move race (path changes without HEAD move), read-path-unaffected. + +### Phase 3 — Read-path with cache (~1 week) + +**C1. Cache lookup with full CAS** + +File: `ledger/status.py`. Update `derive_status` (and any other compliance-cache reader) to query the projection by all 7 CAS fields, not just `(decision_id, region_id, content_hash)`. + +For each binding in scope: +1. Re-resolve symbol via `resolve_symbol_lines(file_path, symbol_name)` to find current span. +2. Compute live `content_hash` over resolved bytes. +3. Read live `commit_hash` (current HEAD), `file_path`, `binding_version`, `tombstone_verdict_id`. +4. Query the projection (or history fallback per C0). Hit returns cached verdict; miss emits to `pending_compliance_checks`. + +Acceptance: +- Identical state → cached verdict, no L3. +- Same content hash at different commit → cache miss → fresh L3 dispatched. +- Same content hash at different `binding_version` → cache miss (test: this binding's `advance_baseline` or rebind bumped version → next call misses; **other decisions bound to same region keep their cache hits**). +- Same content hash at different `tombstone_verdict_id` → cache miss. +- Line-shift no-op (insert blank lines above bound symbol) → cache hit, no spurious drift. + +**C3. `pending_compliance_checks` from `detect_drift`** + +File: `handlers/detect_drift.py`. For every region where stored `content_hash` ≠ live `content_hash`, append to `pending_compliance_checks`. The B1 AST classifier sets `cosmetic_hint: true` as metadata only — **does not** gate L3 dispatch (Codex pass-5 #3). + +Per-entry payload: `{decision_id, region_id, cas_token, cosmetic_hint, diff_summary}`. + +Cache short-circuit: before emitting, run the C1 cache lookup with the full CAS tuple — only emit if no cached verdict exists. + +### Phase 4 — Mutating tools (2–3 weeks) + +**C2. `bicameral_judge_drift` + `record_compliance_verdict`** + +New MCP tool: returns `{decision_text, code_before, code_after, diff, cas_token: {expected_content_hash, expected_commit_hash, expected_file_path, expected_binding_version, expected_tombstone_verdict_id}}` for caller LLM. Mirrors the existing `bicameral_judge_gaps` pattern. + +`record_compliance_verdict(decision_id, region_id, cas_token, verdict, confidence, explanation, agent_id)` handler: + +1. Read current state under A2 + A2a barrier: fetch `actual_*` for all 5 CAS fields. +2. CAS check: if any `expected_*` ≠ `actual_*`, the verdict is stale. + - Stale path: insert into `compliance_verdict_history` with `stale=true, stale_reason='_mismatch'`. Return `{stale_verdict: true, mismatched_fields: [...], current_cas_token: {...}}`. **Do not** mutate live state. **Do not** touch the projection. +3. Fresh path — **all writes happen in a single atomic SurrealQL transaction (A0), keyed on POST-MUTATION state** (pass-13 fix): + 1. **Compute the post-mutation tombstone identity first** — *before* any write — so the projection key matches live state at commit time: + - `compliant` / `drifted` → `post_tombstone_verdict_id = ''` (any prior tombstone is cleared by this verdict). + - `not_relevant` → `post_tombstone_verdict_id = ` (the row about to be inserted). + 2. Open transaction (A0). Inside the same transaction: + a. `INSERT` into `compliance_verdict_history` (returns ``). + b. `UPDATE binds_to` to apply the post-mutation tombstone state — set `tombstoned_at`, `tombstone_reason`, `tombstone_verdict_id = post_tombstone_verdict_id` for `not_relevant`; clear those fields for `compliant`/`drifted`. + c. `UPSERT compliance_check` projection keyed on the **post-mutation** 7-tuple `(decision_id, region_id, content_hash, commit_hash, file_path, binding_version, post_tombstone_verdict_id)`. The same `binding_version` is preserved; only `tombstone_verdict_id` changes between verdicts. + 3. Commit the transaction. If any step fails the entire write is rolled back (no orphaned history row, no half-tombstoned binding, no stale projection). + +**Why post-mutation is mandatory** (pass-13 finding #2): the projection contract in §5.5 states the cache key includes `tombstone_verdict_id`. If the projection upsert uses pre-mutation values for that field and the binding tombstone state is then mutated, the projection row's key no longer matches live state — future cache lookups (which use live `tombstone_verdict_id`) miss the just-written row. The verdict effectively orphans itself in the cache. Computing the final tuple before the writes and applying everything atomically eliminates that window. + +**Acceptance test for the post-mutation contract** (must ship with C2): + +```text +test_not_relevant_then_restore_cycle: + 1. live binding: tombstone_verdict_id='' + 2. record_compliance_verdict(verdict='not_relevant') → returns history_id_1 + 3. assert binds_to.tombstone_verdict_id == history_id_1 + 4. assert exactly ONE compliance_check row exists for this binding, + keyed on tombstone_verdict_id == history_id_1 (NOT '') + 5. cache lookup with current live CAS hits that row — verdict == 'not_relevant' + 6. bicameral_restore_binding(expected_tombstone_verdict_id=history_id_1) + → returns history_id_2 (synthetic 'restored' row) + 7. assert binds_to.tombstone_verdict_id == '' + 8. assert TWO compliance_check rows exist (history_id_1 keyed and '' keyed), + OR projection cleanup also re-keyed the prior row — design choice, + but the live cache lookup MUST hit the row matching live state. + 9. cache lookup with current live CAS hits the post-restore row. +``` + +`bicameral_restore_binding(decision_id, region_id, expected_tombstone_verdict_id)` — operator/caller tool. Includes the tombstone verdict id as a CAS token to ensure the operator is restoring the tombstone they intended. + +**B3. `bicameral_advance_baseline`** + +`bicameral_advance_baseline(decision_id, region_id, cas_token, verdict_id)` — verdict_id must reference a `compliance_verdict_history` row whose `verdict='compliant'`, `stale=false`, and **all five CAS components** match the call. Older verdicts (different `binding_version` or `file_path`) are **rejected** even if bytes match. + +Writes new `binds_to.baseline_content_hash` and `baseline_commit_hash` for this single edge; bumps `binds_to.binding_version`. **Does not touch shared `code_region` state.** Other decisions bound to the same region are unaffected. + +Inserts a `baseline_advance` audit row: `{advanced_at, decision_id, region_id, prev_baseline_hash, new_baseline_hash, prev_binding_version, new_binding_version, verdict_id, agent_id}`. + +**No `ast_cosmetic` reason** — AST classification alone never advances the baseline. Only fresh L3 `compliant` verdicts can. + +**D2. `bicameral_rebind` (two-phase + lease + attempt-id locking)** + +The full protocol — including schema additions, lease/expiry recovery, force-supersede semantics, and the per-attempt CAS verification in phase 2 — lives in **§5.6**. This bullet is a pointer to that section, not a re-spec; read §5.6 for implementation. + +**Critical edge-vs-region distinction** (pass-14 #4): `binding_version` lives on the per-`binds_to`-edge, never on `code_region`. The mutations in phase 1 are: + +- **Bump `binding_version` on the OLD `binds_to` edge** — invalidates any in-flight verdicts that were authored against the old edge before the rebind started. +- **The NEW `binds_to` edge** is either freshly created (born with `binding_version=1`) or, if a tombstoned binding to the same `code_region` already exists, *reused* with its `binding_version` bumped and tombstone fields cleared. +- **`code_region` is never modified by rebind.** A move/rename creates (or reuses) a *new* `code_region` row; the old `code_region` stays immutable for audit. Per-region versioning was rejected in pass-8 specifically because shared region state corrupts cross-decision baselines (see §5.2). + +If you write the implementation and find yourself reaching for `UPDATE code_region:* SET binding_version = ...`, you've reintroduced the rejected design — stop and re-read §5.2 + §5.6. + +Phase 2 happens through `record_compliance_verdict` per §5.5 + §5.6: a `compliant` verdict on a `pending_verification=true` new binding atomically tombstones the old binding's `binds_to` edge and clears the lock. `drifted` / `not_relevant` verdicts leave both edges live with the lock held; caller advances via abandon, force_supersede, or lease expiry per §5.6. + +### Phase 5 — Polish (2–3 days) + +- **Doctor SKILL.md rendering**: update `.claude/skills/bicameral-doctor/SKILL.md` to render `pending_compliance_checks` and `pending_grounding_checks` as actionable advisories now that V2 has the safe atomic rebind. Update the verification instruction text in `handlers/link_commit.py::_build_verification_instruction` to point at `bicameral_rebind` for relocation cases (replacing the V1 "INFORMATIONAL ONLY — wait for V2" warning). +- **Codex pass-13**: re-run the adversarial review on the final V2 implementation. Target: clean ship with no remaining critical findings. +- **Convert scenario 8** in `tests/test_desync_scenarios.py` from `@pytest.mark.xfail(strict=True)` to a normal expected-pass test that exercises the two-phase rebind end-to-end. + +### Phase 6 — Surface: `bicameral_scan_branch` (3–5 days, closes GitHub #47) + +**Why this is the only scope addition to V2**: Phase 1–5 ship every primitive `#47` needs (per-binding baseline, full-CAS hash comparison, symbol re-resolution, atomic rebind, two-phase verdict flow). The remaining gap to fully closing the issue is one thin read-only handler that wires those primitives at the branch level. No new mutating capabilities. No schema changes. No new contract-surface beyond a single response type. The cost of *not* shipping it inside V2 is leaving the issue open while every prerequisite already exists in the same release. + +**Deliverable**: `handlers/scan_branch.py` plus a wiring entry in `server.py`'s MCP tool registry. + +**Tool contract**: + +```python +async def handle_scan_branch( + ctx, + base_ref: str, + head_ref: str, +) -> ScanBranchResponse: + """Read-only branch-aware drift report. + + For every code_region on a binds_to edge whose file appears in + `git diff --name-only base_ref..head_ref`, compute the live + content_hash at head_ref via `git show :` (using + the same resolve_symbol_lines + hash logic as link_commit, but + WITHOUT writing to the ledger). Surface the diff-style verdict + so callers (pre-push hooks, PR-comment Actions, the doctor skill + in branch-scope mode) can consume it without mutating state. + """ +``` + +`ScanBranchResponse` (new contract, additive — does not affect any existing response type): + +```python +class ScanBranchResponse(BaseModel): + base_ref: str + head_ref: str + drifted: list[ScanBranchDriftedEntry] # decisions whose bound code changed on the branch + ungrounded: list[ScanBranchUngroundedEntry] # ungrounded decisions surfaced for caller-LLM bind + changed_files: list[str] + sweep_scope: Literal["range_diff", "head_only", "range_truncated"] + range_size: int +``` + +**Implementation notes**: + +1. **Read-only invariant** — assert in tests that no `binds_to` edges are written, no `compliance_check` rows inserted, no `compliance_verdict_history` rows appended during the scan. Phase 6's job is to surface state, not modify it. +2. **Reuses Phase 1–2 machinery** — content-hash comparison goes through the same `resolve_symbol_lines` + `compute_content_hash` path as `derive_status` (per §5.2 / §5.5). Per-binding baseline is read from `binds_to.baseline_content_hash`. CAS is unnecessary because nothing is being written. +3. **Reuses V2 D2 symbol-relocation surfacing** — when a tracked symbol is absent at `head_ref`, surface it as a relocation candidate via the same `pending_grounding_checks` shape (with `original_lines`) that V1 D1 / V2 D2 already define. Don't invent a new payload. +4. **No ephemeral indexing** — the original #47 design called for a scratch BM25 index for on-branch re-grounding; that approach was invalidated by v0.6.0's removal of `ground_mappings()` (caller-LLM owns retrieval). #47's own "Updated Framing" section reflects this. +5. **CLI subcommand** — for #48 (pre-push hook) and #49 (PR-comment Action) to consume `bicameral_scan_branch` later, the handler's response must be JSON-serializable through the standard MCP envelope. No additional CLI work needed for V2 — the soft AC about "callable as a CLI subcommand" is satisfied by the MCP tool registration plus the existing `bicameral-mcp` console-script entry. + +**Acceptance** (mirrors #47 ACs verbatim): +- A branch that modifies a bound function returns that decision in `drifted`. +- Ungrounded decisions are returned alongside `changed_files` for caller-LLM evaluation. +- No `binds_to` edges or `compliance_check` rows are written during the scan (test asserts table counts unchanged after `handle_scan_branch` calls). +- Works with `SURREAL_URL=memory://` in CI (regression test in the existing `test_desync_scenarios.py` fixture style). +- `Closes #47` on the V2 PR. + +**Sequencing**: Phase 6 has no upstream dependencies on Phase 0–5 *except via the per-binding baseline schema* (Phase 1 C0). It can land last (cleanest) or in parallel with Phase 5 polish. Do not start Phase 6 before C0 lands. + +### Effort estimate by phase + +| Phase | Estimated effort (single owner, sequential) | +|---|---| +| 0 (prereq) | 1–2 weeks | +| 1 (schema) | 1–2 weeks | +| 2 (barrier) | 1 week | +| 3 (reads) | ~1 week | +| 4 (writes) | 2–3 weeks | +| 5 (polish) | 2–3 days | +| 6 (surface — #47) | 3–5 days | +| **Total** | **~8–11 weeks** | + +Phases don't parallelize cleanly because each depends on prior invariants. If multiple engineers, they can work on different deliverables *within* a phase (e.g. C0 vs C0a in Phase 1) but should not skip ahead. + +--- + +## 7. Constraints catalog + +This is the synthesized "what NOT to ship" guide. Each entry came from a Codex review pass that found a real bug in an earlier V2 design draft. Twelve passes total. Following these constraints is the difference between V2 shipping safely and V2 introducing data-corruption regressions. + +### 7.1 The recurring root cause + +Every Codex pass found a place where authoritative state was being mutated (or cached) without authoritative proof of the state being mutated. **V2 must commit to a uniform contract: no live mutation without a fresh, full-CAS verdict from the same call, applied via a single atomic SurrealQL statement.** + +If you're tempted to add a new mutating tool that doesn't follow this pattern, you've found the next regression. + +### 7.2 No mutation without authoritative proof + +(Aggregated from passes 1, 2, 3, 4, 5, 7, 8.) + +- Every mutating tool must take a CAS token. The token has 5 components: `expected_content_hash`, `expected_commit_hash`, `expected_file_path`, `expected_binding_version`, `expected_tombstone_verdict_id`. +- Mismatch on **any** component → record verdict in history with `stale=true, stale_reason='_mismatch'` and **do not** mutate live state. +- `expected_commit_hash` is **mandatory**, not optional. Same content hash can legitimately appear at a different HEAD (revert), at a different file path (move), at a different `binding_version` (rebaseline / rebind), or with a different `tombstone_verdict_id` (operator action). +- `expected_binding_version` is per-`binds_to`-edge, not per-region. Per-region versioning was rejected because it lets one decision's actions invalidate another decision's cache (cross-decision corruption). +- Tombstone state is part of identity. Operator restoration / re-tombstoning produces a different cache row, so verdicts authored against an old tombstone state never replay against a current restored state. + +### 7.3 No backdoor paths + +(Aggregated from passes 2, 6, 10.) + +- New safety contracts must be applied to **every** caller path day one. No "small contract change for existing flow" deferred to later — Codex's pattern was that every "follow-up" became the next exploit. +- **`handlers/resolve_compliance.py:122`** still hard-deletes `binds_to` on `not_relevant`. **`handlers/ingest.py:313-331`** auto-chains into it via `handle_judge_gaps`. Both must move to tombstone + full CAS **before** any new mutating tool ships. This is the single highest-leverage move and it's the **hard prerequisite** for the rest of V2 (Phase 0a). +- Migration of legacy `compliance_check` rows: do **not** backfill from current state. Insert into `compliance_verdict_history` with `stale=true, stale_reason='legacy_pre_v6_no_cas_metadata'`, drop and recreate the projection empty. Backfilling fabricates CAS metadata that was never recorded historically. + +### 7.4 Identity & CAS dimensions + +(Aggregated from passes 3, 5, 7, 8, 9.) + +The CAS tuple converged after 9 passes to the following **single source of truth** — V2 must use this verbatim across schema, lookup, write upsert, and acceptance tests: + +```text +(decision_id, region_id, content_hash, commit_hash, file_path, + binding_version, tombstone_verdict_id) +``` + +Where: +- `content_hash` = hash over **resolved-symbol bytes**, not bytes at frozen line range. Region identity is `(file_path, symbol_name)`; line numbers are advisory snapshots, re-resolved on every read via `resolve_symbol_lines()` (`ledger/status.py:21-89`). +- `binding_version` lives on the `binds_to` edge, not on `code_region`. **Per-binding ownership is mandatory** — shared region state corrupts cross-decision baselines. +- `tombstone_verdict_id` is part of the cache key so operator restoration / re-tombstone produces a different cache row, never a hit. +- Same content hash at different commit / path / binding_version / tombstone_verdict_id is a **different** projection row, not an overwrite. + +### 7.5 Atomicity & race windows + +(Aggregated from passes 4, 6.) + +- Embedded SurrealKV does **not** support client-side `begin_transaction()`. V2 uses inline SurrealQL `BEGIN TRANSACTION; ...; COMMIT TRANSACTION;` blocks submitted as a single `query()` call via the `LedgerClient.transaction()` context manager (A0 in §6 Phase 0b — committed choice, not optional). Day-1 gate test in §6 Phase 0b proves rollback works in our deployment mode; explicit fallback path (single `LET`-chained statements, then network SurrealDB) is documented if the gate fails. +- **Verify the embedded SurrealKV mode honors BEGIN/COMMIT semantics** — some SurrealDB modes silently ignore them. Day-1 spike. +- Sync barrier must extend from handler entry to commit time. HEAD-only CAS is insufficient — working-tree edits race writes without changing HEAD. Per-region fingerprint CAS at commit time required for `bicameral_rebind`, `bicameral_advance_baseline`, and `record_compliance_verdict`. +- Forced failure on the second statement of a multi-step mutation must leave zero side effects (no orphaned new edge, no half-tombstoned old edge, no history-without-projection). + +### 7.6 Verdict semantics + +(Aggregated from passes 1, 2, 3, 11.) + +- Verdicts must be **reversible**. Storage requires append-only `compliance_verdict_history` + projection — the legacy `compliance_check` UNIQUE index makes reversal physically impossible. +- `not_relevant` verdicts must **tombstone, not hard-delete**. Restoration must be auditable via a `bicameral_restore_binding` tool with its own CAS token. +- **D2 (`bicameral_rebind`) is two-phase**. The single-transaction "create new + tombstone old" version retires the authoritative binding before the new target is semantically proven — a wrong candidate selection silently reattaches the decision to unrelated code. Phase 1: create new as pending, return CAS token. Phase 2: caller's L3 verdict on new target gates atomic tombstoning of old. + +### 7.7 AST classifier discipline + +(From pass 7.) + +- The B1 whitelist must be narrow: intra-line whitespace, trailing whitespace, blank lines between statements. **That's it.** +- Trailing commas, **all** comment edits, docstring edits are **not** cosmetic. Trailing commas are behavioral in Python (`(x,)` vs `(x)`); comments carry tool directives (`# type: ignore`, `// @ts-ignore`, build tags); docstrings are observable via `__doc__`. +- The classifier **never gates L3 dispatch**. All hash-divergent regions reach L3 with the hint as advisory metadata only. + +### 7.8 Spec writes vs code-shape writes + +(From pass 6 #1.) + +- **Spec writes** (`ingest`, `ratify`) must remain append-only with best-effort post-write sync. Fail-closing them on git/repo outage drops the user's decision — strictly worse than today's desync. +- **Code-shape writes** (`bind`, `resolve_compliance`, all V2 destructive tools) must be fail-closed on sync failure. + +### 7.9 Pass-12 specific findings (V1 pre-ship review) + +(From pass 12, addressed in V1 commit `a04e54b`.) + +- The v0.6.4 monolithic `_VERIFICATION_INSTRUCTION` indiscriminately routed both ungrounded and `symbol_disappeared` cases to a `bicameral.bind` CTA. For relocation cases, that creates duplicate-binding state. +- V1 split the instruction into per-`reason` parts. **V2 retains this split** even after atomic rebind ships; the relocation branch is updated to point at `bicameral_rebind` instead of warning callers off. +- V1's claim that the doctor SKILL.md "is already advisory" was empirically false — the file at `.claude/skills/bicameral-doctor/SKILL.md` (note path; not `skills/bicameral-doctor/`) contains zero references to `pending_grounding_checks`, `relocation`, `symbol_disappeared`, or `bicameral.bind`. V2 Phase 5 polish updates the skill to render these. + +### 7.10 Pass-13 specific findings (V2 design review) + +Two high-severity findings on the V2 design itself, addressed in §5.5 and §5.6. + +**Rebind phase 2 must verify the specific pending attempt, not just "the old binding."** + +A naive two-phase rebind whose phase 2 only carries the *new* binding's CAS token can tombstone the wrong old binding when a caller has done multiple phase-1 attempts. A stale phase-2 verdict for an abandoned candidate would still trigger old-edge tombstoning even if the caller intended to use a different candidate. The fix has three parts: + +1. **Single pending rebind per old binding**: phase 1 sets `binds_to.pending_rebind_attempt_id = `. Concurrent phase-1 attempts on the same `old_region_id` see the lock and abort with `rebind_already_pending`. +2. **Immutable attempt id**: `rebind_audit.attempt_id` is a UUID, generated in phase 1, stored on both the new binding (`binds_to.rebind_attempt_id`) and the audit row. Phase 2 carries no extra arg — the new binding's `rebind_attempt_id` field is the link. +3. **Phase 2 cross-CAS**: when handling a `compliant` verdict on a `pending_verification` new binding, the verdict handler re-reads the OLD binding and verifies `old_binding.pending_rebind_attempt_id == new_binding.rebind_attempt_id` AND that the snapshotted `old_binding_version_at_attempt` / `old_tombstone_verdict_id_at_attempt` from the audit row still match. Mismatch → record stale-history-only with `stale_reason='rebind_attempt_superseded'`. The old binding is never tombstoned by a stale verdict. + +Explicit abandon path (`bicameral_abandon_rebind`) lets a caller supersede a prior attempt cleanly, so the protocol doesn't hang on an indecisive caller. See §5.6 for the full schema and protocol. + +**`record_compliance_verdict` must derive projection keys from POST-mutation state, not pre-mutation inputs.** + +The cache contract (§5.5) says `compliance_check` is keyed on the full 7-tuple including `tombstone_verdict_id`. If the verdict-write algorithm upserts the projection BEFORE mutating `binds_to.tombstone_verdict_id`, the cached row is keyed on the old tombstone identity while the live binding immediately changes to a different tuple. Future cache lookups use the live `tombstone_verdict_id` and miss — verdict orphans itself in the cache; the user sees a stale "no cached verdict" state and the L3 round-trip is repeated unnecessarily. + +The fix: compute `post_tombstone_verdict_id` (`''` for compliant/drifted, `` for not_relevant) BEFORE any write, then in a single atomic transaction insert the history row → update binds_to → upsert the projection keyed on the post-mutation tuple. The acceptance test in §6 Phase 4 (C2) — `test_not_relevant_then_restore_cycle` — must ship alongside the `record_compliance_verdict` implementation to lock the contract. See §5.5 / §6 Phase 4 for the full algorithm. + +The general rule extracted from this finding: **whenever a write mutates a field that is part of any cache or index key, the cache/index write must be derived from the post-mutation value of that field, not the pre-mutation input.** This applies to verdict-writes (tombstone_verdict_id), baseline-advance (binding_version on the relevant `binds_to` edge), and rebind (binding_version on both the old and new `binds_to` edges). All three must compute final state first, then atomically commit the cluster of changes. + +### 7.11 Pass-14 specific findings (V2 guide review) + +Four findings on the consolidated V2 guide itself, addressed in §6 Phase 0b, §5.6, §8, and §6 Phase 4 D2. + +**Atomicity is a committed choice, not an open option** (pass-14 #1). + +The previous draft listed Opt 1 (`LedgerClient.transaction()` wrapping `BEGIN/COMMIT TRANSACTION`) and Opt 2 (single `LET`-chained statement) as alternatives and deferred picking. That defers the prerequisite of every destructive path in V2. The guide now commits to Opt 1, ships a day-1 forced-failure gate test (`test_transaction_rolls_back_on_failure` in §6 Phase 0b) that proves embedded SurrealKV honors `BEGIN/COMMIT/CANCEL TRANSACTION` rollback semantics, and documents an explicit fallback path (Opt 2 first, then network SurrealDB second) if the gate fails. **No Phase 1+ work begins until the gate test passes against the embedded ledger configuration we ship with.** + +**Pending rebinds must have a server-enforced lease** (pass-14 #2). + +The previous draft only documented a caller-driven `bicameral_abandon_rebind` path. A crashed or distracted caller could wedge an `old_region_id` indefinitely (every subsequent `bicameral_rebind` returns `rebind_already_pending`). The fix adds: + +- `rebind_audit.expires_at` field, populated from `BICAMERAL_REBIND_LEASE_SECONDS` (default 24h). +- An on-demand expiry sweep at the start of every `bicameral_rebind` phase 1 — atomically abandons any expired pending attempt before issuing a new one (`outcome='abandoned_by_expiry'`). +- A `force_supersede=true` flag on `bicameral_rebind` for explicit caller-driven supersession. +- Phase 2 lease check in `record_compliance_verdict`: if the audit row's outcome is no longer `pending` or `expires_at < now()`, the verdict is recorded with `stale=true, stale_reason='rebind_attempt_expired' / '_superseded' / '_abandoned'` and the old binding is never tombstoned. + +The combination guarantees forward progress: no client crash can wedge a binding for longer than the lease TTL, even without operator intervention. + +**`judge_gaps` parity is resolved, not deferred** (pass-14 #3). + +The previous draft listed "judge_gaps parity" as an open question while §7.3 simultaneously claimed all backdoor paths must be closed before new tools ship. That's a contradiction. Resolved in §8 question 5: `bicameral_judge_gaps` is read-only (returns a context pack to the caller LLM, never writes). The destructive write happens in `handlers/resolve_compliance.py`, which Phase 0a migrates from hard-delete to tombstone+CAS. Phase 0a covers the entire judge_gaps→resolve_compliance pipeline; no separate change to `judge_gaps` itself is required. §7.3's "all backdoors closed before new tools ship" claim stands. + +**`binding_version` lives on edges, never on regions** (pass-14 #4). + +The previous draft's §6 Phase 4 D2 summary said to "bump `binding_version` on both old and new regions" — region terminology, contradicting the §5.2 design where versioning is per-binding (per-edge) specifically to avoid cross-decision corruption. Rewritten: every mention of `binding_version` mutation is now in edge terminology (`binds_to`), `code_region` is explicitly called out as immutable under rebind, and the bullet warns against the exact misimplementation the pass-14 reviewer flagged. If an implementer reaches for `UPDATE code_region:* SET binding_version`, they've reintroduced the rejected design. + +--- + +## 8. Open questions + +These need human judgment before V2 implementation starts. Codex's adversarial review can't answer them. + +1. **Phase 0a vs 0b ordering.** Should `resolve_compliance` migrate to tombstone (Phase 0a) before A0 lands, or after? Doing it first closes the destructive backdoor sooner but uses a single-statement `UPDATE` (no transaction needed). Doing it after A0 means we can do the migration as a transaction-wrapped multi-step write. Recommend Phase 0a first; document the sequencing decision in the V2 kickoff. + +2. ~~**Transaction primitive: opt 1 vs opt 2.**~~ **Resolved (pass-14 #1)**: V2 commits to `LedgerClient.transaction()` wrapping inline `BEGIN/COMMIT TRANSACTION` blocks. The day-1 gate test in §6 Phase 0b verifies embedded SurrealKV honors rollback; if that test fails, the explicit fallback path (single `LET`-chained statements, then network SurrealDB) ships instead. No deferred decision — see §6 Phase 0b for the committed mechanism and §7.11 for the rationale. + +3. **Cache projection vs history-only.** Keep `compliance_check` projection table for perf, or serve cache reads directly from `compliance_verdict_history` via `WHERE all_seven_CAS_components_match AND stale=false ORDER BY recorded_at DESC LIMIT 1`? Both are semantically equivalent given the migration empties the projection. Decision deferred to A1 benchmark numbers run against the history-only path. + +4. **Tombstone GC policy.** How long do tombstoned `binds_to` rows live before hard-delete? Candidate: 30 days with no contradicting verdict, plus operator-callable purge. Aligns with retention policy. Could also defer entirely to V3. + +5. ~~**`judge_gaps` parity.**~~ **Resolved (pass-14 #3)**: `bicameral_judge_gaps` itself is read-only — it returns a context pack to the caller LLM and never writes. The only write that happens in the gap-judgment flow is when the caller LLM calls `bicameral.resolve_compliance` to record the verdict. Phase 0a's migration of `handlers/resolve_compliance.py` from hard-delete to tombstone+CAS therefore covers the entire pipeline; no separate contract change to `judge_gaps` is needed. **The §7.3 statement that "all backdoor paths are closed before new tools ship" stands and is not in conflict with this entry.** Removing this question from the open list. + +6. **Catch-up latency budget for write path.** A1+A3 in V1 measured the read-path baseline. V2 inherits with stricter SLOs. If barrier-held p95 > 1s under realistic load, may need finer-grained locking (per-decision, per-region) — tracked but not in V2 scope unless measurements force it. + +7. **Should V2 ship as one PR or several?** 7-10 weeks of work + 6 phases is a lot for a single PR. Recommend phase-aligned PRs: Phase 0, Phase 1, Phase 2, Phase 3, Phase 4 split per tool (C2 / B3 / D2 each as its own PR), Phase 5 polish. Each PR re-runs Codex. + +8. **Who owns V2?** CODEOWNERS requires Jin approval. Recommend involving him at design time, not just at PR-review time. He should weigh in on at least the open questions above. + +--- + +## 9. Acceptance criteria for V2 + +V2 is shippable when **all** of the following hold: + +### Quantitative thresholds + +- [ ] Scenario 8 in `tests/test_desync_scenarios.py` flips from `xfail(strict=True)` to expected pass — atomic rebind end-to-end test. +- [ ] Full desync scenario suite: **13 / 13 pass, 0 xfail**. +- [ ] Catch-up latency p95 < 1000 ms on the V1 benchmark fixture (A1). +- [ ] No regression on V1 perf baseline: search_decisions p95 ≤ 11 ms, detect_drift p95 ≤ 17 ms (allows 1.5× headroom over V1's 10.4ms / 15.5ms). +- [ ] Forced-failure correctness tests for atomicity (Phase 0b): every multi-step mutation, when the second statement is forced to fail, leaves zero side effects. +- [ ] Codex review pass-13 produces zero remaining critical (high-severity) findings. + +### Qualitative / behavioral + +- [ ] `handlers/resolve_compliance.py` no longer calls `delete_binds_to_edge` — replaced by tombstone path with full CAS. +- [ ] All `binds_to` traversal sites filter via `binds_to_active_filter()` (audited via grep). +- [ ] `derive_status` reads per-binding `baseline_content_hash` from `binds_to`, never shared `code_region.content_hash`. +- [ ] Every mutating handler takes a 5-field CAS token; mismatch produces a stale-history row and zero live mutation. +- [ ] `bicameral_rebind` is two-phase **with attempt-id locking** (pass-13 #1): phase 1 sets `binds_to.pending_rebind_attempt_id`; concurrent phase-1 attempts on the same `old_region_id` get `rebind_already_pending`; phase 2's verdict handler verifies `old_binding.pending_rebind_attempt_id == new_binding.rebind_attempt_id` AND the audit row's snapshotted old-binding state still matches before any tombstoning. Stale phase-2 verdicts on superseded attempts produce `stale_reason='rebind_attempt_superseded'` history rows and **never** tombstone old. `bicameral_abandon_rebind` exists for explicit caller-driven supersession. +- [ ] `record_compliance_verdict` derives the projection key from **post-mutation state** (pass-13 #2): `post_tombstone_verdict_id` computed before any write; history insert + binds_to update + projection upsert all in one atomic A0 transaction; the projection row's CAS tuple matches live `binds_to` state at commit time. Acceptance test `test_not_relevant_then_restore_cycle` must pass — proves cache lookups with current live CAS hit the row matching live state across a full not_relevant → restore cycle. +- [ ] **A0 gate test passes** (pass-14 #1): `tests/test_a0_atomic_transaction.py::test_transaction_rolls_back_on_failure` succeeds against the embedded ledger configuration we ship with — proves `BEGIN/COMMIT TRANSACTION` rollback semantics actually work. If gate fails, the documented fallback path is followed and the alternative mechanism passes its own forced-failure correctness tests. +- [ ] **Rebind has lease-driven recovery** (pass-14 #2): `rebind_audit.expires_at` populated on phase 1; on-demand expiry sweep at the start of every `bicameral_rebind` phase 1 atomically abandons stale leases (`outcome='abandoned_by_expiry'`); phase 2 lease check rejects verdicts on expired/superseded/abandoned attempts as stale-history-only; `force_supersede=true` on `bicameral_rebind` provides explicit caller-driven supersede. Acceptance test simulates a crashed caller (insert stale `rebind_audit` with `recorded_at - 25h`) and proves the next `bicameral_rebind` succeeds with the prior attempt marked `abandoned_by_expiry`. +- [ ] **Edge-vs-region terminology audit** (pass-14 #4): grep proves no V2 implementation code mutates `binding_version` on `code_region`. Every `binding_version` write targets a `binds_to` edge. +- [ ] **`judge_gaps` migration is implicit, not separate** (pass-14 #3): Phase 0a's `resolve_compliance` migration covers the entire `judge_gaps → resolve_compliance` pipeline. No separate code change to `handlers/gap_judge.py` (read-only) is required or made. +- [ ] `.claude/skills/bicameral-doctor/SKILL.md` renders `pending_compliance_checks` and `pending_grounding_checks` with the (now-safe) bind / rebind flows. +- [ ] **`bicameral_scan_branch` ships and closes GitHub #47** (Phase 6): `handlers/scan_branch.py` is registered as an MCP tool; calling it with `(base_ref, head_ref)` returns drifted decisions, ungrounded decisions, and `changed_files` between the two refs. Read-only invariant audited by test (table counts unchanged after scan). PR uses `Closes #47`. +- [ ] CHANGELOG entry summarizes V2 deliverables; the V1 "Unreleased" entry can roll up into a V2 release version (or both can ship as a single release, depending on team preference). + +### Documentation + +- [ ] `TODO.md` and `PLAN.md` ticked for every V2 phase deliverable per the project's auto-tick mandate. +- [ ] This guide (`docs/v2-desync-optimization-guide.md`) updated with V2 shipped status; or replaced with a V3 doc if the cycle continues. + +--- + +## 10. References + +### Local files V2 will touch + +- `ledger/schema.py` — schema migration v5→v6 +- `ledger/client.py` — A0 transaction primitive +- `ledger/status.py` — `derive_status` per-binding rewrite +- `ledger/adapter.py` — `ingest_commit` updates, traversal filtering +- `ledger/queries.py` — SurrealQL helpers, `binds_to_active_filter` +- `handlers/sync_middleware.py` — A2a barrier +- `handlers/bind.py` — A2a wiring, post-tombstone idempotency +- `handlers/resolve_compliance.py` — Phase 0a hard-delete → tombstone +- `handlers/ingest.py` — no V2 contract change required (it auto-chains into `handle_judge_gaps`, which is read-only; the destructive write happens later in `resolve_compliance`, which Phase 0a migrates). Listed here for awareness, not for editing. +- `handlers/detect_drift.py` — C3 cache-aware emission +- `handlers/link_commit.py` — verification_instruction text update for V2 rebind path +- `contracts.py` — new contracts for verdict / advance_baseline / rebind responses, CAS token types +- `server.py` — register new MCP tools +- `tests/test_desync_scenarios.py` — convert scenario 8 from xfail to pass +- `tests/test_resolve_compliance.py` — assert tombstone, not deletion +- New tests: `tests/test_record_compliance_verdict.py`, `tests/test_advance_baseline.py`, `tests/test_rebind.py`, `tests/test_a2a_barrier.py`, `tests/test_v6_migration.py`, `tests/test_scan_branch.py` (Phase 6) +- New: `handlers/scan_branch.py` — Phase 6 read-only branch-aware drift report, closes GitHub #47 +- `.claude/skills/bicameral-doctor/SKILL.md` — Phase 5 rendering update + +### V1 commits on this branch + +```text +8e226c5 docs: tick V1 desync optimization across CHANGELOG / TODO / PLAN +a04e54b fix(link_commit): split verification_instruction so relocation cases don't get bind CTA +89f8076 feat: desync optimization V1 F1 — canonical 13-scenario regression matrix +54081e6 feat: desync optimization V1 D1 — original_lines on symbol-disappeared payload +401babc feat: desync optimization V1 Phase B — read-path cosmetic-change advisory +3b4d0bb feat: desync optimization V1 Phase A — measurement + light sync hardening +``` + +### Notion references + +- [The Auto-Grounding Problem: Keeping Decisions Linked to Code](https://www.notion.so/3332a51619c4813caccec86c36d9bf98) — 13 desync scenarios, Compliance Reframe (L1/L2/L3 model) +- [The Branch Problem: Git Branches in the Decision Ledger](https://www.notion.so/3302a51619c48146b48dc675914beb6f) — content-hash primitive rationale +- [CI Workflow Fixes — MCP Regression Pipeline (Apr 8)](https://www.notion.so/33c2a51619c48134ba8dc8bfaeb880dd) — PR #84 scorecard shift from 77% → 92%; the "tests must use real handler layer" lesson + +### External technical references + +- [SurrealDB Python SDK — Connecting](https://surrealdb.com/docs/sdk/python/concepts/connecting-to-surrealdb) — embedded mode does NOT support `begin_transaction()`. V2 must use inline SurrealQL. +- [SurrealQL Transactions](https://surrealdb.com/docs/surrealql/transactions) — `BEGIN/COMMIT/CANCEL TRANSACTION` semantics. **Verify embedded SurrealKV honors these.** +- [SurrealQL COMMIT statement](https://surrealdb.com/docs/surrealql/statements/commit) +- [SurrealKV README](https://github.com/surrealdb/surrealkv) — "embedded ACID-compliant key-value storage engine" +- [py-tree-sitter](https://github.com/tree-sitter/py-tree-sitter) — used by V1 B1 (`ledger/ast_diff.py`); V2 may extend with semantic-equivalence checks. +- [Tree-sitter whitespace handling — issue #497](https://github.com/tree-sitter/tree-sitter/issues/497) — confirms tree-sitter does not represent inter-token whitespace as nodes. +- [Diffsitter HN discussion](https://news.ycombinator.com/item?id=27875333) — prior art for AST-based semantic diff. + +### Codex review history + +The V2 design doc went through 9 review rounds plus 3 review rounds on the V1 plan (12 total) before V1 shipped. Each round found a bug in the prior draft. The synthesis of those findings is §7. If you want the chronological record (useful for understanding *why* a constraint exists), `git log --all --diff-filter=D --name-only -- docs/desync-optimization.md` after the old docs are deleted will show the file's last state. + +The pattern across all 12 passes was: "V2 keeps adding safety contracts to new code paths but leaves backdoors in old paths." Phase 0 (resolve_compliance migration) addresses the most prominent instance. **If you find yourself adding a new mutating tool that doesn't migrate every existing path that touches the same data, you're recreating the pattern.** + +--- + +## Final note for the engineer / agent picking this up + +V2 is a high-risk, high-reward change. The risk surfaced naturally over 12 review passes and is now well-characterized in §7. The reward — actual semantic drift detection, safe rename recovery, reversible verdicts — is what bicameral was originally pitched to do. + +Take your time on Phase 0. It's the foundation everything else stands on. The single best signal that V2 is going well is: every PR through Phase 4 lands with the V1 desync scenario suite still at 12/13 PASS + 1 XFAIL until D2 ships, at which point scenario 8 flips and the suite is 13/13. If at any point a different scenario starts failing or xfailing, you've regressed something — stop and root-cause before continuing. + +Involve Jin. He's CODEOWNERS, knows the project trajectory, and the open questions in §8 are exactly the kind of thing he should weigh in on. Don't make him a PR-review-time discovery. + +## After V2 ships — workflow change + +V2 is the **last release** authored by reverse-mapping deliverables to issues. After V2, the project switches to an **issue-driven workflow**: pick an issue, treat its acceptance criteria as the spec, ship a focused PR with `Closes #N`. No more "we built X, what issue does it sort of fit?" mapping. + +Phase 6 (`bicameral_scan_branch`) was added to V2 specifically because it was *already close* — V1 + V2 Phase 1–4 ship every primitive #47 needs, the gap is one read-only handler, and shipping it inside V2 closes the issue cleanly with `Closes #47` on the V2 PR. That's the bar for any future "expand the in-flight release to close an adjacent issue" decision: the underlying machinery must already exist; the addition must be additive (no new mutating capabilities, no schema changes); and the issue's acceptance criteria must be fully satisfiable by the addition. + +The natural next-issue-up after V2 ships: + +- **#39 (Telemetry Layer 1, P0)** — small, unblocks #41 / #42 / #43 / #44. +- **#42 (`bicameral.usage_summary`)** — depends on #39; unblocks the third acceptance criterion of #44 (which V2's LLM judge otherwise satisfies). +- **#41 (drift transition diagnostic)** — depends on #39 + V1's classifier + V2's judge. After #39 lands, all three pieces exist and #41 closes. +- **#44 (LLM semantic drift judge)** — depends on #42's metric. After #39 + #42 land, V2's judge tooling closes the issue. +- **#48, #49** — both depend on #47's CLI. After V2 ships #47, these become focused single-PR issues. + +That sequence (#39 → #42 → #41 → #44 → #48 → #49) takes the desync queue from "5 open issues V2 can't fully close" to "all 5 closed via small focused PRs over 4–6 weeks," with each PR using `Closes #N` honestly. + +Good luck. diff --git a/handlers/bind.py b/handlers/bind.py index e6e886c5..c225073b 100644 --- a/handlers/bind.py +++ b/handlers/bind.py @@ -2,7 +2,8 @@ from __future__ import annotations import logging -from contracts import BindResponse, BindResult, PendingComplianceCheck +from contracts import BindResponse, BindResult, PendingComplianceCheck, SyncMetrics +from handlers.sync_middleware import repo_write_barrier logger = logging.getLogger(__name__) @@ -17,7 +18,22 @@ async def handle_bind(ctx, bindings: list[dict]) -> BindResponse: 3. Compute content_hash against authoritative_sha. 4. Upsert code_region + binds_to edge, transition decision ungrounded→pending. 5. Return PendingComplianceCheck for immediate caller verification. + + V1 A2-light: the whole handler body runs under ``repo_write_barrier`` + so two concurrent bind calls against the same repo are serialized. + Does NOT protect against concurrent resolve_compliance / cross-process + writers — those are V2 scope. + + V1 A3: the barrier's hold duration is attached to the response as + ``sync_metrics.barrier_held_ms``. """ + async with repo_write_barrier(ctx) as timing: + response = await _do_bind(ctx, bindings) + response.sync_metrics = SyncMetrics(barrier_held_ms=timing.held_ms) + return response + + +async def _do_bind(ctx, bindings: list[dict]) -> BindResponse: ledger = ctx.ledger if hasattr(ledger, "connect"): await ledger.connect() diff --git a/handlers/detect_drift.py b/handlers/detect_drift.py index 778f285d..d48cc079 100644 --- a/handlers/detect_drift.py +++ b/handlers/detect_drift.py @@ -7,15 +7,26 @@ v0.4.17: ``raw_decisions_to_drift_entries`` is extracted as a module-level helper so ``handlers.scan_branch`` can reuse the exact same per-decision mapping logic without duplicating the loop. + +V1 B2: drifted entries get an advisory ``cosmetic_hint`` populated from +``ledger.ast_diff.is_cosmetic_change`` over the region's HEAD bytes vs +working-tree bytes. The hint is enrichment, not a gate — the pure +``raw_decisions_to_drift_entries`` mapping stays IO-free. """ from __future__ import annotations +import logging import os from pathlib import Path +from code_locator.indexing.symbol_extractor import EXTENSION_LANGUAGE from contracts import DetectDriftResponse, DriftEntry, LinkCommitResponse from handlers.link_commit import handle_link_commit +from ledger.ast_diff import is_cosmetic_change +from ledger.status import get_git_content, resolve_symbol_lines + +logger = logging.getLogger(__name__) def raw_decisions_to_drift_entries( @@ -83,6 +94,15 @@ async def handle_detect_drift( entries, counts = raw_decisions_to_drift_entries(raw_decisions) source = "working_tree" if use_working_tree else "HEAD" + # V1 B2: enrich drifted entries with an AST cosmetic hint. Read-path + # only — never mutates content_hash, never changes status. Hint is + # meaningful only when the response advertises ``source="working_tree"`` + # (the cosmetic comparison axis is HEAD vs working tree); skip on + # HEAD-source so we don't attach hints derived from a diff axis the + # caller didn't ask about. + if use_working_tree: + _enrich_with_cosmetic_hints(entries, file_path, ctx.repo_path) + return DetectDriftResponse( file_path=file_path, sync_status=sync_status, @@ -92,3 +112,82 @@ async def handle_detect_drift( pending_count=counts["pending"], undocumented_symbols=undocumented, ) + + +def _enrich_with_cosmetic_hints( + entries: list[DriftEntry], + file_path: str, + repo_path: str, +) -> None: + """Set ``cosmetic_hint=True`` on drifted entries whose HEAD→working-tree + diff is provably whitespace-only per the strict B1 whitelist. + + Per-entry alignment: the stored ``entry.lines`` is the baseline anchor + (set by ingest, possibly updated by link_commit's symbol-shift heal). + Lines at HEAD and at the working tree may have shifted independently, + so we re-resolve the symbol against each ref via tree-sitter and slice + each ref's content using its own resolved range. If either resolution + fails, fail safe to ``cosmetic_hint=False`` — the cosmetic-hint + contract is "False is cheap, True must be earned" (V1 plan §B1). + + Skips non-drifted entries, files we can't read, unsupported extensions, + and entries whose symbol can't be located at HEAD or working tree. + """ + drifted = [e for e in entries if e.status == "drifted"] + if not drifted: + return + + ext = Path(file_path).suffix.lower() + lang = EXTENSION_LANGUAGE.get(ext) + if lang is None: + return # unsupported extension — no hint computed for this file + + # NOTE: ledger.status.get_git_content takes start_line / end_line in + # its signature but ignores them — it always returns the full file + # body. Two existing legacy callers do the slicing themselves after + # the call. We pass 0, 0 to make the unused-args reality explicit; + # we slice locally below per-region. Cleaning up the upstream + # signature is a separate refactor across all callers (see ledger/ + # status.py:110, ledger/adapter.py:67). + head_full = get_git_content(file_path, 0, 0, repo_path, ref="HEAD") + wt_full = get_git_content(file_path, 0, 0, repo_path, ref="working_tree") + if head_full is None or wt_full is None: + return # file missing at one side — can't compare, leave default + + head_lines = head_full.splitlines() + wt_lines = wt_full.splitlines() + + for entry in drifted: + # Use entry.symbol to re-resolve aligned line ranges per ref. + # ``entry.lines`` (the baseline anchor) cannot be trusted for + # slicing both HEAD and the working tree because shifts on either + # side can desync the slice from the symbol body. Resolution + # failure → safe default of cosmetic_hint=False. + if not entry.symbol: + continue + try: + head_range = resolve_symbol_lines(file_path, entry.symbol, repo_path, ref="HEAD") + wt_range = resolve_symbol_lines(file_path, entry.symbol, repo_path, ref="working_tree") + except Exception as exc: + logger.debug("[detect_drift] resolve_symbol_lines failed for %s/%s: %s", file_path, entry.symbol, exc) + continue + if head_range is None or wt_range is None: + continue # symbol absent at one side — not a cosmetic case + + head_start, head_end = head_range + wt_start, wt_end = wt_range + if head_start <= 0 or head_end < head_start: + continue + if wt_start <= 0 or wt_end < wt_start: + continue + + head_slice = "\n".join(head_lines[head_start - 1:head_end]) + wt_slice = "\n".join(wt_lines[wt_start - 1:wt_end]) + if not head_slice or not wt_slice: + continue + if head_slice == wt_slice: + continue # no byte diff at all — hint is meaningless here + try: + entry.cosmetic_hint = is_cosmetic_change(head_slice, wt_slice, lang) + except Exception as exc: + logger.debug("[detect_drift] cosmetic hint failed for %s: %s", file_path, exc) diff --git a/handlers/history.py b/handlers/history.py index 5498beae..9bbabeac 100644 --- a/handlers/history.py +++ b/handlers/history.py @@ -279,8 +279,15 @@ async def handle_history( 5. Apply feature_filter (substring match, case-insensitive). 6. Truncate at 50 features and set truncated flag. """ + # V1 A3: time the catch-up locally so history can report it. + import time as _time from handlers.sync_middleware import ensure_ledger_synced + from contracts import SyncMetrics + _t0 = _time.perf_counter() banner = await ensure_ledger_synced(ctx) + sync_metrics = SyncMetrics( + sync_catchup_ms=round((_time.perf_counter() - _t0) * 1000, 3) + ) ledger = ctx.ledger if hasattr(ledger, "connect"): @@ -344,4 +351,5 @@ async def handle_history( total_features=total_features, as_of=as_of_ref, session_start_banner=banner, + sync_metrics=sync_metrics, ) diff --git a/handlers/link_commit.py b/handlers/link_commit.py index b1663cc4..fb9a9337 100644 --- a/handlers/link_commit.py +++ b/handlers/link_commit.py @@ -32,19 +32,58 @@ from contracts import LinkCommitResponse, PendingComplianceCheck -_VERIFICATION_INSTRUCTION = ( +_VERIFICATION_INSTRUCTION_BASE = ( "Evaluate each pending_compliance_check — decide whether the code_body " "semantically implements the intent_description. Call " "bicameral.resolve_compliance with phase= and a batch of " "verdicts: [{intent_id, region_id, content_hash, compliant, confidence, " "explanation}]. Group by phase if the batch mixes phases. One tool call " - "resolves the whole batch. " - "For pending_grounding_checks: use your own code search (Grep/Read), then " - "validate_symbols / extract_symbols to confirm the target, then call " - "bicameral.bind with decision_id, file_path, symbol_name, and optionally " - "start_line/end_line." + "resolves the whole batch." ) +_GROUNDING_INSTRUCTION_UNGROUNDED = ( + " For pending_grounding_checks with reason='ungrounded': use your own " + "code search (Grep/Read), then validate_symbols / extract_symbols to " + "confirm the target, then call bicameral.bind with decision_id, " + "file_path, symbol_name, and optionally start_line/end_line." +) + +# V1 D1 / Codex pass-12 finding #2: relocation cases (symbol_disappeared) +# must NOT route to bicameral.bind. Bind on the new location would leave +# the old binding live and produce duplicate-binding state under the N:N +# binds_to relation. Atomic rebind (which retires the stale edge in the +# same write) ships in V2 (design doc §8 D2 — bicameral_rebind with +# old-binding CAS + fresh L3 verdict on the new target). +_GROUNDING_INSTRUCTION_RELOCATION = ( + " For pending_grounding_checks with reason='symbol_disappeared': " + "INFORMATIONAL ONLY. The original_lines / file_path / symbol fields " + "tell you where this decision USED to live; safe atomic rebind " + "(which retires the stale edge in the same write) ships in V2. " + "Do NOT call bicameral.bind on the new location — that would leave " + "the old edge live and produce duplicate-binding state. Use git " + "history (`git show :` over original_lines) " + "to inform a future rebind, but do not bind directly." +) + + +def _build_verification_instruction( + pending_compliance: list, + pending_grounding: list[dict], +) -> str: + """Compose the verification instruction conditional on which payloads + actually fired. Splits ungrounded vs symbol_disappeared guidance so + relocation cases never get an unsafe ``bicameral.bind`` CTA. + """ + parts: list[str] = [] + if pending_compliance: + parts.append(_VERIFICATION_INSTRUCTION_BASE) + reasons = {c.get("reason") for c in pending_grounding} + if "ungrounded" in reasons: + parts.append(_GROUNDING_INSTRUCTION_UNGROUNDED) + if "symbol_disappeared" in reasons: + parts.append(_GROUNDING_INSTRUCTION_RELOCATION) + return "".join(parts) + logger = logging.getLogger(__name__) @@ -206,6 +245,11 @@ async def handle_link_commit(ctx, commit_hash: str = "HEAD") -> LinkCommitRespon pending_grounding_raw = result.get("pending_grounding_checks", []) or [] has_action_items = bool(pending) or bool(pending_grounding_raw) + verification_text = ( + _build_verification_instruction(pending, pending_grounding_raw) + if has_action_items + else "" + ) flow_id = str(uuid.uuid4()) sync_state = getattr(ctx, "_sync_state", None) @@ -224,7 +268,7 @@ async def handle_link_commit(ctx, commit_hash: str = "HEAD") -> LinkCommitRespon range_size=result.get("range_size", 0), pending_compliance_checks=pending, pending_grounding_checks=pending_grounding_raw, - verification_instruction=_VERIFICATION_INSTRUCTION if has_action_items else "", + verification_instruction=verification_text, flow_id=flow_id, ) _store_sync_cache(ctx, commit_hash, response) diff --git a/handlers/preflight.py b/handlers/preflight.py index 1ca4f93b..31663d4f 100644 --- a/handlers/preflight.py +++ b/handlers/preflight.py @@ -277,8 +277,15 @@ async def handle_preflight( ) # Sync ledger to HEAD and collect the session-start banner (once per session). + # V1 A3: time the call locally so the metric reflects THIS handler's catch-up. + import time as _time from handlers.sync_middleware import ensure_ledger_synced + from contracts import SyncMetrics + _t0 = _time.perf_counter() banner = await ensure_ledger_synced(ctx) + sync_metrics = SyncMetrics( + sync_catchup_ms=round((_time.perf_counter() - _t0) * 1000, 3) + ) sources_chained: list[str] = [] @@ -315,6 +322,7 @@ async def handle_preflight( reason="no_matches", guided_mode=guided_mode, session_start_banner=banner, + sync_metrics=sync_metrics, ) # Merge: region-anchored results first (direct pin = high precision), @@ -331,6 +339,7 @@ async def handle_preflight( guided_mode=guided_mode, sources_chained=sources_chained, session_start_banner=banner, + sync_metrics=sync_metrics, ) # Search-level gate: in normal mode, require actionable signal. @@ -365,6 +374,7 @@ async def handle_preflight( guided_mode=guided_mode, sources_chained=sources_chained, session_start_banner=banner, + sync_metrics=sync_metrics, ) decisions = [_to_brief_decision(m) for m in search_resp.matches] @@ -418,4 +428,5 @@ async def handle_preflight( session_start_banner=banner, unresolved_collisions=unresolved_collisions, context_pending_ready=context_pending_ready, + sync_metrics=sync_metrics, ) diff --git a/handlers/search_decisions.py b/handlers/search_decisions.py index 0f8a7aee..8eab729b 100644 --- a/handlers/search_decisions.py +++ b/handlers/search_decisions.py @@ -6,7 +6,9 @@ from __future__ import annotations -from contracts import CodeRegionSummary, DecisionMatch, LinkCommitResponse, SearchDecisionsResponse +import time + +from contracts import CodeRegionSummary, DecisionMatch, LinkCommitResponse, SearchDecisionsResponse, SyncMetrics from handlers.action_hints import generate_hints_for_search from handlers.link_commit import handle_link_commit from handlers.sync_middleware import get_session_start_banner @@ -18,8 +20,16 @@ async def handle_search_decisions( max_results: int = 10, min_confidence: float = 0.5, ) -> SearchDecisionsResponse: + # V1 A3: time the mandatory catch-up so callers can see how long this + # handler spent in link_commit. Local timing (not sync_state) so nested + # calls don't step on each other's metrics. Scope mirrors + # ``ensure_ledger_synced`` (preflight / history): cover both + # ``handle_link_commit`` AND ``get_session_start_banner`` so the same + # ``sync_catchup_ms`` field measures the same surface across handlers. + t0 = time.perf_counter() sync_status: LinkCommitResponse = await handle_link_commit(ctx, "HEAD") banner = await get_session_start_banner(ctx) + catchup_ms = round((time.perf_counter() - t0) * 1000, 3) raw_matches = await ctx.ledger.search_by_query(query, max_results=max_results, min_confidence=min_confidence) @@ -76,4 +86,5 @@ async def handle_search_decisions( response.action_hints = generate_hints_for_search( response, guided_mode=getattr(ctx, "guided_mode", False), ) + response.sync_metrics = SyncMetrics(sync_catchup_ms=catchup_ms) return response diff --git a/handlers/sync_middleware.py b/handlers/sync_middleware.py index bd9b4ac4..f938a23f 100644 --- a/handlers/sync_middleware.py +++ b/handlers/sync_middleware.py @@ -14,7 +14,10 @@ from __future__ import annotations +import asyncio import logging +import time +from contextlib import asynccontextmanager from datetime import datetime, timezone from contracts import SessionStartBanner @@ -42,6 +45,88 @@ def _is_stale_proposal(decision: dict) -> bool: return False +# ── V1 A2-light: per-repo write barrier ───────────────────────────────── +# Module-level registry of per-repo asyncio.Locks. Serializes mutating +# handlers against the same repo inside a single MCP server process. +# Deliberately does NOT protect: +# - handlers/resolve_compliance.py (destructive path — V2 scope) +# - cross-process writers (requires sync-token CAS at commit time — V2) +# Scope is intentionally narrow; see docs/v2-desync-optimization-guide.md +# §5.7 for the V2 expansion (region fingerprint + sync-token CAS). +_repo_locks: dict[str, asyncio.Lock] = {} +_repo_locks_guard: asyncio.Lock | None = None + + +def _guard() -> asyncio.Lock: + """Lazily create the guard in whatever loop the first caller runs in. + + Creating an asyncio.Lock at import time binds it to a loop that may + not exist yet (e.g. tests using asyncio.run each spin up a fresh loop). + Lazy creation inside a coroutine avoids the "lock bound to wrong loop" + pitfall. + """ + global _repo_locks_guard + if _repo_locks_guard is None: + _repo_locks_guard = asyncio.Lock() + return _repo_locks_guard + + +async def _get_repo_lock(repo_path: str) -> asyncio.Lock: + async with _guard(): + lock = _repo_locks.get(repo_path) + if lock is None: + lock = asyncio.Lock() + _repo_locks[repo_path] = lock + return lock + + +@asynccontextmanager +async def repo_write_barrier(ctx): + """Serialize code-shape mutations against the same repo in-process. + + V1 scope: wrap `handle_bind` only. Different repos run concurrently; + same repo is serialized. Yields a mutable ``BarrierTiming`` holder + whose ``held_ms`` attribute is set when the barrier exits, so the + enclosing handler can attach it to its response. Lock always releases + on exit (including exceptions). Fail-safe: if ``ctx.repo_path`` is + missing, falls back to key ``"."`` so the barrier still serializes. + """ + repo = getattr(ctx, "repo_path", "") or "." + lock = await _get_repo_lock(repo) + timing = BarrierTiming() + async with lock: + t0 = time.perf_counter() + try: + yield timing + finally: + timing.held_ms = round((time.perf_counter() - t0) * 1000, 3) + + +class BarrierTiming: + """Mutable timing holder yielded by ``repo_write_barrier``. + + ``held_ms`` is populated when the barrier's ``async with`` exits. + Handlers read it after the ``async with`` block to attach the number + to their ``SyncMetrics`` response field. + """ + __slots__ = ("held_ms",) + + def __init__(self) -> None: + self.held_ms: float | None = None + + +def _reset_repo_locks_for_tests() -> None: + """Drop all registered repo locks. Test-only helper. + + Lets each test start with a fresh lock registry so lock identity is + deterministic within a single test. Not exposed outside the test + module. + """ + global _repo_locks_guard + _repo_locks.clear() + _repo_locks_guard = None + + async def get_session_start_banner(ctx) -> SessionStartBanner | None: """Return an open-items banner on the first MCP call of a session. diff --git a/ledger/adapter.py b/ledger/adapter.py index cc792d2a..50a6126a 100644 --- a/ledger/adapter.py +++ b/ledger/adapter.py @@ -436,6 +436,13 @@ async def ingest_commit( old_status = decision.get("status", "ungrounded") # If symbol disappeared, emit a grounding check instead of compliance check. + # V1 D1: the payload is informational only — no server-side + # candidate suggestions (search_code was removed in v0.6.4). + # The caller LLM finds the new location via Grep/Read + + # validate_symbols / extract_symbols, then calls bicameral.bind + # (per the verification_instruction). ``original_lines`` is + # included so the caller can inspect the prior code via + # ``git show :`` if useful. if symbol_disappeared: pending_grounding_checks.append({ "decision_id": decision_id, @@ -443,6 +450,7 @@ async def ingest_commit( "reason": "symbol_disappeared", "file_path": file_path, "symbol": symbol_name, + "original_lines": [start_line, end_line], }) continue @@ -489,8 +497,13 @@ async def ingest_commit( try: ungrounded_decisions = await get_all_decisions(self._client, filter="ungrounded") for d in ungrounded_decisions: + # get_all_decisions returns rows with `decision_id` (aliased + # from id via `type::string(id) AS decision_id`); reading + # `d["id"]` returns "" and produces unusable grounding + # checks the caller cannot bind against. Surfaced by V1 F1 + # regression coverage. pending_grounding_checks.append({ - "decision_id": str(d.get("id", "")), + "decision_id": str(d.get("decision_id") or d.get("id", "")), "description": str(d.get("description", "")), "reason": "ungrounded", }) diff --git a/ledger/ast_diff.py b/ledger/ast_diff.py new file mode 100644 index 00000000..e452fad8 --- /dev/null +++ b/ledger/ast_diff.py @@ -0,0 +1,121 @@ +"""V1 B1 — tree-sitter cosmetic-change classifier (strict whitelist). + +``is_cosmetic_change(before, after, lang)`` returns ``True`` only when +two snippets differ by whitespace alone — intra-line horizontal whitespace +outside string literals, trailing whitespace, or blank lines between +statements. Anything else routes to L3 with no ``cosmetic_hint``: + +* identifier renames (kwargs / reflection / ORM lookups / template names) +* trailing-comma additions (Python tuple semantics, JS edge cases) +* comment edits (``# type: ignore``, ``// @ts-ignore``, build tags) +* docstring edits (observable via ``__doc__``, JSDoc tooling) +* string-literal edits, import reorders, any AST node insertion / deletion + +Read-path advisory ONLY — never mutates ``content_hash``, never gates +drift detection. The output is metadata for the eventual V2 caller-LLM +verdict prompt (``cosmetic_hint`` field on ``DriftEntry``). False +negatives — real cosmetic changes routed unbiased to L3 — are cheap; +false positives bias the L3 prompt toward "looks fine," exactly the +failure mode the strict whitelist prevents. + +Strategy: parse both inputs with tree-sitter, build a recursive +``(node.type, leaf_bytes_or_children)`` signature for each tree, and +compare. Two trees with the same signature differ only by whitespace +between tokens — tree-sitter does not represent inter-token whitespace +as nodes, so any non-whitespace difference (a different identifier, a +different comment, a different number of statements in a block) shows +up either as a different leaf-byte payload or as a different node-type +sequence in the tuple. Either case returns ``False``. +""" + +from __future__ import annotations + +import logging +from typing import Any + +from code_locator.indexing.symbol_extractor import LANGUAGE_FALLBACK, _get_parser + +logger = logging.getLogger(__name__) + + +# Languages B1 actually classifies. Anything else returns False (fail-safe). +# Matches the set wired into code_locator/indexing/symbol_extractor.py so +# the cosmetic detector never silently diverges from the indexer. +SUPPORTED_LANGUAGES: frozenset[str] = frozenset({ + "python", + "javascript", + "typescript", + "java", + "go", + "rust", + "c_sharp", + # via LANGUAGE_FALLBACK + "jsx", + "tsx", +}) + + +def is_cosmetic_change(before: str, after: str, lang: str) -> bool: + """Return True only if ``before → after`` is provably semantics-preserving. + + Args: + before: Pre-change source snippet (e.g. the bound region's stored + baseline bytes). + after: Post-change source snippet (e.g. the same region's bytes + at the live working tree). + lang: Language identifier — e.g. ``"python"``, ``"typescript"``, + ``"jsx"``. Resolved through ``LANGUAGE_FALLBACK`` first + (``jsx``/``tsx`` map to their parent languages). + + Returns: + ``True`` only when the two snippets are syntactically identical + modulo whitespace. ``False`` for unsupported languages, parse + failures, parse-error trees, or any structural difference. + """ + if before == after: + return True + + normalized = lang.lower().strip() + if normalized not in SUPPORTED_LANGUAGES: + return False + resolved = LANGUAGE_FALLBACK.get(normalized, normalized) + + # Single guarded block: parse + tree-error check + recursive signature + # comparison all live under one try/except so the function obeys its + # documented "fail-safe → False" contract even when ``_signature`` + # blows the recursion limit on a deeply nested AST. + try: + parser = _get_parser(resolved) + before_bytes = before.encode("utf-8") + after_bytes = after.encode("utf-8") + tree_before = parser.parse(before_bytes) + tree_after = parser.parse(after_bytes) + # If either input doesn't parse cleanly, refuse to call it cosmetic. + if tree_before.root_node.has_error or tree_after.root_node.has_error: + return False + return _signature(tree_before.root_node, before_bytes) == \ + _signature(tree_after.root_node, after_bytes) + except (Exception, RecursionError) as exc: + logger.debug("[ast_diff] classifier failed for %s: %s", normalized, exc) + return False + + +def _signature(node: Any, source: bytes) -> tuple: + """Recursive ``(node.type, child_sigs | leaf_bytes)`` signature. + + For interior nodes, the signature is ``(type, tuple_of_child_sigs)``. + For leaf nodes, the signature is ``(type, leaf_bytes)`` — which + captures identifier text, keyword text, operator text, string + contents, comment contents, and so on. + + Two signatures are equal iff the trees have identical node-type + structure AND identical leaf bytes. Any other difference — a + different identifier, an extra statement, an edited comment — + produces a signature mismatch. + """ + if node.child_count == 0: + return (node.type, source[node.start_byte:node.end_byte]) + return ( + node.type, + tuple(_signature(child, source) for child in node.children), + ) diff --git a/skills/bicameral-preflight/SKILL.md b/skills/bicameral-preflight/SKILL.md index f1de5b91..fde5525a 100644 --- a/skills/bicameral-preflight/SKILL.md +++ b/skills/bicameral-preflight/SKILL.md @@ -108,6 +108,12 @@ The handler runs `bicameral.search` internally, gates on the user's `guided_mode` setting, conditionally chains to `bicameral.brief`, and returns a `PreflightResponse` with a `fired: bool` field. +The response also carries an optional `sync_metrics` +(`{sync_catchup_ms, barrier_held_ms}`) observability field for the +catch-up time spent in `ensure_ledger_synced`. **Skip rendering it** — +these are server-side latency numbers, not user-visible signal. Log +them if you're profiling, otherwise ignore. + ### 2.5 Render session-start banner if present Before evaluating `response.fired`, check `response.session_start_banner`. @@ -198,6 +204,14 @@ Categories with no ask-findings are silently skipped. If every finding in every category is mechanical, produce NO output (same as `fired=false` — silent). +**Cosmetic drift rule**: if a `drifted` entry has `cosmetic_hint=true`, +classify it as **mechanical** regardless of guided mode. The server has +verified via AST comparison that the change is whitespace-only and +semantically inert — the stored intent is still intact. Auto-resolve +silently; do NOT add it to the drift ask-queue and do NOT emit a +blocking hint. Render it with `~` prefix (not `⚠ DRIFTED:`) if you +render it at all — see the template in Step 5. + ### 5. Render the surfaced block When at least one ask-finding exists, surface the response using this @@ -218,6 +232,11 @@ format. Lead with the `(bicameral surfaced)` attribution line. Source: Drift evidence: + ~ REFORMATTED: ← cosmetic_hint=true only + :: + Source: + (whitespace-only change — intent intact, no action needed) + ⚠ N divergent decision pair(s) — pick a winner before continuing: • (): @@ -310,8 +329,9 @@ For every finding this skill surfaces, classify first: - **mechanical** — one obvious correct answer (e.g., renamed symbol with identical signature; a decision whose code moved but semantics - are intact). Auto-apply the resolution silently. Do NOT ask the - user. + are intact; a `drifted` entry with `cosmetic_hint=true` — AST + comparison confirmed whitespace-only change). Auto-apply the + resolution silently. Do NOT ask the user. - **ask** — reasonable people could disagree (e.g., drifted behavior where the old decision may still be valid; divergent decisions where no clear winner exists). Emit ONE question per finding, using the diff --git a/tests/bench_drift.py b/tests/bench_drift.py new file mode 100644 index 00000000..e56477fc --- /dev/null +++ b/tests/bench_drift.py @@ -0,0 +1,272 @@ +"""Drift benchmark harness — V1 task A1. + +Measures wall-clock latency for the three read-path handlers most relevant +to drift workflows: + + - handle_search_decisions + - handle_detect_drift + - handle_link_commit (the catch-up path used by every read handler) + +Marked @pytest.mark.bench so normal `pytest tests/` runs skip it. +Run explicitly: + + pytest tests/bench_drift.py -v -m bench -s + +Output: test-results/bench/drift_baseline.json with per-handler +p50/p95/max wall-clock numbers, plus a stdout summary table. + +This is a *baseline* harness. V1 acceptance does not enforce hard latency +thresholds — only that the numbers are reproducible and documented. +The numbers feed the V2 design budget (PLAN.md:83 targets: +search_decisions < 2s, detect_drift < 1s on 100+ decisions). +""" + +from __future__ import annotations + +import asyncio +import json +import statistics +import time +from pathlib import Path + +import pytest + +from adapters.code_locator import get_code_locator +from adapters.ledger import reset_ledger_singleton +from context import BicameralContext +from handlers.detect_drift import handle_detect_drift +from handlers.ingest import handle_ingest +from handlers.link_commit import handle_link_commit +from handlers.search_decisions import handle_search_decisions + +RESULTS_DIR = Path(__file__).parent.parent / "test-results" / "bench" + +# Tunables — keep modest so CI doesn't blow up +N_DECISIONS = 100 +N_FILES_TARGET = 25 +SEARCH_QUERIES = [ + "ledger ingestion", + "drift detection", + "code region", + "symbol resolution", + "compliance check", + "tombstone", + "BM25 search", + "tree-sitter", + "session banner", + "graph walk", +] +SEARCH_ITERATIONS_PER_QUERY = 5 +DRIFT_ITERATIONS_PER_FILE = 3 +LINK_COMMIT_ITERATIONS = 5 + + +@pytest.fixture +def bench_env(monkeypatch, tmp_path): + """Fresh in-memory ledger + REPO_PATH pointing at the actual repo.""" + monkeypatch.setenv("SURREAL_URL", "memory://") + monkeypatch.setenv("REPO_PATH", str(Path(__file__).resolve().parents[1])) + reset_ledger_singleton() + RESULTS_DIR.mkdir(parents=True, exist_ok=True) + yield + reset_ledger_singleton() + + +@pytest.fixture +def bench_ctx(bench_env): + return BicameralContext.from_env() + + +def _percentiles(samples: list[float]) -> dict[str, float]: + if not samples: + return {"p50": 0.0, "p95": 0.0, "max": 0.0, "n": 0} + s = sorted(samples) + return { + "p50": statistics.median(s), + "p95": s[max(0, int(0.95 * len(s)) - 1)], + "max": s[-1], + "mean": statistics.fmean(s), + "n": len(s), + } + + +async def _collect_real_symbols(adapter, repo_path: Path, n_files_target: int) -> list[dict]: + """Walk a curated set of repo files and extract their symbols via tree-sitter. + + Uses ``adapter.extract_symbols`` (which goes straight to tree-sitter and + does not require the BM25/SQLite index to be built). Picks 25+ files from + the handlers/ledger/code_locator subtrees so the bench mirrors real + workload shape without bench setup needing to call + ``code_locator index `` first. + """ + seed_dirs = [ + repo_path / "handlers", + repo_path / "ledger", + repo_path / "code_locator", + repo_path / "adapters", + ] + files: list[Path] = [] + for d in seed_dirs: + if d.exists(): + files.extend(sorted(p for p in d.rglob("*.py") if p.is_file() and "__pycache__" not in p.parts)) + + collected: list[dict] = [] + seen_pairs: set[str] = set() + for fp in files: + if len({c["file_path"] for c in collected}) >= n_files_target and len(collected) >= 80: + break + try: + records = await adapter.extract_symbols(str(fp)) + except Exception: + continue + rel = str(fp.relative_to(repo_path)) + for rec in records[:6]: # cap per-file to keep distribution flat + sym = rec.get("symbol_name") or rec.get("name") or "" + line = rec.get("start_line") or rec.get("line_number") or 1 + if not sym: + continue + key = f"{rel}::{sym}" + if key in seen_pairs: + continue + seen_pairs.add(key) + collected.append({ + "file_path": rel, + "symbol_name": sym, + "line_number": line, + }) + return collected + + +def _build_payload(symbols: list[dict], batch_idx: int, batch_size: int) -> dict: + """Build one ingest payload covering `batch_size` decisions. + + Each mapping pairs a synthetic intent with a real symbol so the + ledger can ground it via search_code at ingest time. + """ + mappings = [] + for i in range(batch_size): + sym = symbols[(batch_idx * batch_size + i) % len(symbols)] + mappings.append({ + "span": { + "span_id": f"bench-{batch_idx}-{i}", + "source_type": "transcript", + "text": f"Bench decision {batch_idx}-{i} about {sym['symbol_name']}", + "speaker": "bench", + "source_ref": f"bench-meeting-{batch_idx}", + }, + "intent": f"Bench decision {batch_idx}-{i}: maintain {sym['symbol_name']} in {sym['file_path']}", + "symbols": [sym["symbol_name"]], + "code_regions": [{ + "file_path": sym["file_path"], + "symbol": sym["symbol_name"], + "type": "function", + "start_line": sym["line_number"], + "end_line": sym["line_number"] + 20, + "purpose": f"bench batch {batch_idx} item {i}", + }], + "dependency_edges": [], + }) + return { + "query": f"bench batch {batch_idx}", + "repo": ".", + "commit_hash": f"bench-{batch_idx}", + "analyzed_at": "2026-04-24T00:00:00Z", + "mappings": mappings, + } + + +@pytest.mark.bench +def test_drift_baseline(bench_ctx): + """Baseline-measurement run for V1 A1. + + Seeds N_DECISIONS decisions, syncs, then times the three handlers. + Writes JSON artifact + prints stdout summary. + """ + asyncio.run(_run_bench(bench_ctx)) + + +async def _run_bench(ctx) -> None: + adapter = get_code_locator() + + # --- Setup: collect real symbols, ingest 100 decisions in batches of 10 --- + symbols = await _collect_real_symbols(adapter, Path(ctx.repo_path), n_files_target=N_FILES_TARGET) + assert len(symbols) >= 25, f"Only got {len(symbols)} symbols; need >= 25 for realistic bench" + + batch_size = 10 + n_batches = N_DECISIONS // batch_size + print(f"\n[bench] Ingesting {N_DECISIONS} decisions across {len(symbols)} unique symbols ({n_batches} batches of {batch_size})") + + setup_start = time.perf_counter() + for b in range(n_batches): + payload = _build_payload(symbols, batch_idx=b, batch_size=batch_size) + await handle_ingest(ctx, payload) + setup_elapsed = time.perf_counter() - setup_start + print(f"[bench] Setup ingest done in {setup_elapsed:.2f}s") + + # Initial baseline sync (link_commit HEAD) — also serves as warm-up + warm_start = time.perf_counter() + await handle_link_commit(ctx, "HEAD") + print(f"[bench] Warm-up link_commit(HEAD) in {time.perf_counter() - warm_start:.3f}s") + + # --- Measure: link_commit (already-synced fast path) --- + link_commit_samples = [] + for _ in range(LINK_COMMIT_ITERATIONS): + t0 = time.perf_counter() + await handle_link_commit(ctx, "HEAD") + link_commit_samples.append(time.perf_counter() - t0) + + # --- Measure: search_decisions across N queries × M iterations --- + search_samples = [] + for q in SEARCH_QUERIES: + for _ in range(SEARCH_ITERATIONS_PER_QUERY): + t0 = time.perf_counter() + await handle_search_decisions(ctx, q, max_results=10) + search_samples.append(time.perf_counter() - t0) + + # --- Measure: detect_drift across the touched files × M iterations --- + file_paths = sorted({s["file_path"] for s in symbols}) + drift_samples = [] + for fp in file_paths: + for _ in range(DRIFT_ITERATIONS_PER_FILE): + t0 = time.perf_counter() + await handle_detect_drift(ctx, fp) + drift_samples.append(time.perf_counter() - t0) + + # --- Aggregate + write artifact --- + report = { + "config": { + "n_decisions": N_DECISIONS, + "n_symbols": len(symbols), + "n_files": len(file_paths), + "n_search_queries": len(SEARCH_QUERIES), + "search_iterations_per_query": SEARCH_ITERATIONS_PER_QUERY, + "drift_iterations_per_file": DRIFT_ITERATIONS_PER_FILE, + "link_commit_iterations": LINK_COMMIT_ITERATIONS, + }, + "setup": { + "ingest_total_seconds": round(setup_elapsed, 4), + "ingest_per_decision_seconds": round(setup_elapsed / N_DECISIONS, 4), + }, + "handlers": { + "search_decisions": _percentiles(search_samples), + "detect_drift": _percentiles(drift_samples), + "link_commit_warm": _percentiles(link_commit_samples), + }, + } + + out_path = RESULTS_DIR / "drift_baseline.json" + out_path.write_text(json.dumps(report, indent=2)) + + # Stdout summary + print("\n" + "=" * 68) + print("DRIFT BENCHMARK BASELINE — V1 A1") + print("=" * 68) + print(f"Setup: {N_DECISIONS} decisions, {len(symbols)} symbols, {len(file_paths)} files") + print(f"Setup ingest: {setup_elapsed:.2f}s total ({setup_elapsed/N_DECISIONS*1000:.1f}ms / decision)") + print() + print(f"{'handler':<25} {'p50 (ms)':>10} {'p95 (ms)':>10} {'max (ms)':>10} {'n':>5}") + print("-" * 68) + for name, p in report["handlers"].items(): + print(f"{name:<25} {p['p50']*1000:>10.1f} {p['p95']*1000:>10.1f} {p['max']*1000:>10.1f} {p['n']:>5}") + print("=" * 68) + print(f"Artifact: {out_path}") diff --git a/tests/conftest.py b/tests/conftest.py index d34589a8..2cdfc0d9 100644 --- a/tests/conftest.py +++ b/tests/conftest.py @@ -21,6 +21,7 @@ def pytest_configure(config): config.addinivalue_line("markers", "phase2: requires SurrealDBLedgerAdapter + SurrealDB") config.addinivalue_line("markers", "phase3: full E2E — requires both Phase 1 + Phase 2") config.addinivalue_line("markers", "alpha_flow: Jacob North Star regression suite — v0.7 gate") + config.addinivalue_line("markers", "bench: drift benchmark harness (V1 A1) — skipped by default, run with -m bench") @pytest.fixture(autouse=True) diff --git a/tests/test_ast_diff.py b/tests/test_ast_diff.py new file mode 100644 index 00000000..1c2ddbec --- /dev/null +++ b/tests/test_ast_diff.py @@ -0,0 +1,163 @@ +"""Tests for ledger/ast_diff.py — V1 B1 cosmetic-change classifier. + +Whitelist tests: changes that ``is_cosmetic_change`` MUST classify True. +Anti-whitelist tests: changes that MUST classify False even though they +"look" mechanical — variable renames, trailing commas, comment edits, +docstring edits, tool directives. False positives in this layer would +bias the V2 caller-LLM verdict prompt toward "looks fine" on +behaviorally-different code. +""" +from __future__ import annotations + +import pytest + +from ledger.ast_diff import is_cosmetic_change + + +# ── Whitelist: must return True ───────────────────────────────────── + + +def test_identical_bytes(): + assert is_cosmetic_change("def f(): return 1", "def f(): return 1", "python") is True + + +def test_intra_line_horizontal_whitespace_python(): + """Spaces tightened around an operator — token stream unchanged.""" + before = "def f(x):\n return x+1\n" + after = "def f(x):\n return x + 1\n" + assert is_cosmetic_change(before, after, "python") is True + + +def test_blank_line_between_statements_python(): + before = "def f():\n a()\n b()\n" + after = "def f():\n a()\n\n b()\n" + assert is_cosmetic_change(before, after, "python") is True + + +def test_trailing_whitespace_stripped_python(): + before = "def f(): \n return 1 \n" + after = "def f():\n return 1\n" + assert is_cosmetic_change(before, after, "python") is True + + +def test_indent_width_change_python(): + """Two-space vs four-space indent — same logical block structure.""" + before = "def f():\n return 1\n" + after = "def f():\n return 1\n" + assert is_cosmetic_change(before, after, "python") is True + + +def test_intra_line_whitespace_javascript(): + before = "function f(){return 1+2;}" + after = "function f() { return 1 + 2; }" + assert is_cosmetic_change(before, after, "javascript") is True + + +# ── Anti-whitelist: must return False ─────────────────────────────── + + +def test_variable_rename_python(): + """Renames are observable via kwargs/reflection/ORM — never cosmetic.""" + before = "def f(x):\n return x + 1\n" + after = "def f(y):\n return y + 1\n" + assert is_cosmetic_change(before, after, "python") is False + + +def test_function_rename_python(): + before = "def calculateDiscount(x):\n return x * 0.1\n" + after = "def computeDiscount(x):\n return x * 0.1\n" + assert is_cosmetic_change(before, after, "python") is False + + +def test_trailing_comma_added_python(): + """`(x,)` is a 1-tuple, `(x)` is a parenthesized expression.""" + before = "x = (1)\n" + after = "x = (1,)\n" + assert is_cosmetic_change(before, after, "python") is False + + +def test_line_comment_edited_python(): + """Comments carry tool directives like # type: ignore — never cosmetic.""" + before = "def f():\n # old comment\n return 1\n" + after = "def f():\n # new comment\n return 1\n" + assert is_cosmetic_change(before, after, "python") is False + + +def test_type_ignore_added_python(): + before = "x = something()\n" + after = "x = something() # type: ignore\n" + assert is_cosmetic_change(before, after, "python") is False + + +def test_noqa_added_python(): + before = "import sys\n" + after = "import sys # noqa: F401\n" + assert is_cosmetic_change(before, after, "python") is False + + +def test_docstring_edited_python(): + """Docstrings are observable via __doc__ — never cosmetic.""" + before = 'def f():\n """Original."""\n return 1\n' + after = 'def f():\n """Updated."""\n return 1\n' + assert is_cosmetic_change(before, after, "python") is False + + +def test_string_literal_edited_python(): + before = 'route = "/api/v1/users"\n' + after = 'route = "/api/v2/users"\n' + assert is_cosmetic_change(before, after, "python") is False + + +def test_import_reorder_python(): + before = "import os\nimport sys\n" + after = "import sys\nimport os\n" + assert is_cosmetic_change(before, after, "python") is False + + +def test_ts_ignore_added_typescript(): + before = "const x = something();\n" + after = "// @ts-ignore\nconst x = something();\n" + assert is_cosmetic_change(before, after, "typescript") is False + + +def test_block_restructured_python(): + """Re-indenting moves a statement out of an if block — semantics change.""" + before = "if x:\n a()\n b()\n" + after = "if x:\n a()\nb()\n" + assert is_cosmetic_change(before, after, "python") is False + + +def test_statement_added_python(): + before = "def f():\n return 1\n" + after = "def f():\n log()\n return 1\n" + assert is_cosmetic_change(before, after, "python") is False + + +# ── Failure modes — fail safe ──────────────────────────────────────── + + +def test_unsupported_language_returns_false(): + assert is_cosmetic_change("foo", "bar", "ruby") is False + assert is_cosmetic_change("foo", "bar", "elixir") is False + assert is_cosmetic_change("foo", "bar", "") is False + + +def test_parse_error_returns_false(): + """Syntactically broken code → don't claim cosmetic.""" + before = "def f(:\n pass\n" # broken + after = "def f():\n pass\n" + assert is_cosmetic_change(before, after, "python") is False + + +def test_jsx_routes_through_javascript(): + """JSX/TSX fall back to javascript/typescript per LANGUAGE_FALLBACK. + + Inputs must differ in bytes (otherwise the early-return at the top of + is_cosmetic_change short-circuits and the fallback path is never + exercised). Whitespace-only diff keeps the expected outcome True + while forcing the LANGUAGE_FALLBACK['jsx'] → 'javascript' resolution + and the _get_parser code path to actually run. + """ + before = "const X = () =>
hi
" + after = "const X = () =>
hi
" # extra spaces in the JS portion + assert is_cosmetic_change(before, after, "jsx") is True diff --git a/tests/test_b2_cosmetic_hint.py b/tests/test_b2_cosmetic_hint.py new file mode 100644 index 00000000..41953ec9 --- /dev/null +++ b/tests/test_b2_cosmetic_hint.py @@ -0,0 +1,162 @@ +"""Tests for V1 B2 — cosmetic_hint enrichment on DriftEntry. + +Exercises ``handlers.detect_drift._enrich_with_cosmetic_hints`` and the +end-to-end flow through ``handle_detect_drift`` to confirm the advisory +flag is set correctly on drifted entries and never on non-drifted ones. + +Codex pass-7 finding #3 (B1) and pass-8 finding #1 (B2/B3) require: + - cosmetic_hint is metadata only — never mutates content_hash + - cosmetic_hint stays False for renames / docstring edits / etc. + - cosmetic_hint=True only for whitespace-only diffs +""" +from __future__ import annotations + +from pathlib import Path + +import pytest + +from contracts import DriftEntry +from handlers.detect_drift import _enrich_with_cosmetic_hints + + +def _make_entry(status: str = "drifted", lines: tuple[int, int] = (1, 3)) -> DriftEntry: + return DriftEntry( + decision_id="decision:bench", + description="Bench decision", + status=status, # type: ignore[arg-type] + symbol="f", + lines=lines, + source_ref="bench", + ) + + +def _write_file(repo: Path, rel: str, content: str) -> None: + p = repo / rel + p.parent.mkdir(parents=True, exist_ok=True) + p.write_text(content) + + +@pytest.fixture +def repo_with_baseline(tmp_path): + """Create a tmp repo, commit a baseline file, then leave a working-tree edit hook in place. + + Returns the repo path and the relative file path. Tests then overwrite + the working-tree file to whatever they need to compare against HEAD. + """ + import subprocess + repo = tmp_path / "repo" + repo.mkdir() + subprocess.run(["git", "init", "-q"], cwd=repo, check=True) + subprocess.run(["git", "config", "user.email", "bench@test"], cwd=repo, check=True) + subprocess.run(["git", "config", "user.name", "bench"], cwd=repo, check=True) + + rel = "src/example.py" + baseline = "def f(x):\n return x + 1\n" + _write_file(repo, rel, baseline) + subprocess.run(["git", "add", "-A"], cwd=repo, check=True) + subprocess.run(["git", "commit", "-q", "-m", "baseline"], cwd=repo, check=True) + return repo, rel + + +def test_whitespace_only_edit_sets_cosmetic_hint_true(repo_with_baseline): + repo, rel = repo_with_baseline + # Working tree edits whitespace only. + _write_file(repo, rel, "def f(x):\n return x + 1\n") + entry = _make_entry(status="drifted", lines=(1, 2)) + _enrich_with_cosmetic_hints([entry], rel, str(repo)) + assert entry.cosmetic_hint is True + + +def test_variable_rename_keeps_cosmetic_hint_false(repo_with_baseline): + repo, rel = repo_with_baseline + _write_file(repo, rel, "def f(y):\n return y + 1\n") + entry = _make_entry(status="drifted", lines=(1, 2)) + _enrich_with_cosmetic_hints([entry], rel, str(repo)) + assert entry.cosmetic_hint is False + + +def test_docstring_edit_keeps_cosmetic_hint_false(repo_with_baseline, tmp_path): + repo, rel = repo_with_baseline + _write_file(repo, rel, "def f(x):\n return x + 1\n") + # Now overwrite baseline by committing a docstring-only version, then edit working tree. + import subprocess + _write_file(repo, rel, 'def f(x):\n """Old."""\n return x + 1\n') + subprocess.run(["git", "add", "-A"], cwd=repo, check=True) + subprocess.run(["git", "commit", "-q", "-m", "add docstring"], cwd=repo, check=True) + # Working tree edits the docstring text — observable via __doc__. + _write_file(repo, rel, 'def f(x):\n """New."""\n return x + 1\n') + entry = _make_entry(status="drifted", lines=(1, 3)) + _enrich_with_cosmetic_hints([entry], rel, str(repo)) + assert entry.cosmetic_hint is False + + +def test_pending_entry_skipped(repo_with_baseline): + """Non-drifted entries are not enriched.""" + repo, rel = repo_with_baseline + _write_file(repo, rel, "def f(x):\n return x + 1\n") + entry = _make_entry(status="pending", lines=(1, 2)) + _enrich_with_cosmetic_hints([entry], rel, str(repo)) + assert entry.cosmetic_hint is False # default — never touched + + +def test_no_diff_keeps_cosmetic_hint_false(repo_with_baseline): + """If working tree matches HEAD byte-for-byte, hint stays False (meaningless).""" + repo, rel = repo_with_baseline + # Don't modify working tree — it equals HEAD. + entry = _make_entry(status="drifted", lines=(1, 2)) + _enrich_with_cosmetic_hints([entry], rel, str(repo)) + assert entry.cosmetic_hint is False + + +def test_unsupported_extension_keeps_cosmetic_hint_false(tmp_path): + """Files outside EXTENSION_LANGUAGE never get a hint.""" + import subprocess + repo = tmp_path / "repo2" + repo.mkdir() + subprocess.run(["git", "init", "-q"], cwd=repo, check=True) + subprocess.run(["git", "config", "user.email", "bench@test"], cwd=repo, check=True) + subprocess.run(["git", "config", "user.name", "bench"], cwd=repo, check=True) + _write_file(repo, "x.rb", "puts 'hi'\n") + subprocess.run(["git", "add", "-A"], cwd=repo, check=True) + subprocess.run(["git", "commit", "-q", "-m", "init"], cwd=repo, check=True) + _write_file(repo, "x.rb", "puts 'hi'\n") + entry = _make_entry(status="drifted", lines=(1, 1)) + _enrich_with_cosmetic_hints([entry], "x.rb", str(repo)) + assert entry.cosmetic_hint is False + + +def test_unresolvable_symbol_skipped(repo_with_baseline): + """Entries whose symbol can't be resolved against HEAD/WT fail safe to False. + + Per the V1 alignment refactor, ``entry.lines`` is no longer the + slicing input — the enrichment uses ``resolve_symbol_lines`` per + ref to align HEAD and working-tree slices to the symbol body. If + resolution returns None on either side (symbol absent, missing + symbol name, etc.), the hint stays at its False default. + """ + repo, rel = repo_with_baseline + _write_file(repo, rel, "def f(x):\n return x + 1\n") + # Symbol name that does not exist in the file → resolve_symbol_lines + # returns None → enrichment skips this entry. + entry = _make_entry(status="drifted", lines=(1, 2)) + entry.symbol = "nonexistent_symbol" + _enrich_with_cosmetic_hints([entry], rel, str(repo)) + assert entry.cosmetic_hint is False + + +def test_content_hash_never_mutated(repo_with_baseline): + """Codex pass-1 finding #2 invariant: hint computation never writes baseline. + + The enrichment runs on DriftEntry models in memory; verify nothing + on disk or in the entry tuple is touched besides the cosmetic_hint + field itself. + """ + repo, rel = repo_with_baseline + _write_file(repo, rel, "def f(x):\n return x + 1\n") + entry = _make_entry(status="drifted", lines=(1, 2)) + snapshot = entry.model_dump() + _enrich_with_cosmetic_hints([entry], rel, str(repo)) + after = entry.model_dump() + # Only cosmetic_hint may differ. + diff = {k: (snapshot[k], after[k]) for k in snapshot if snapshot[k] != after[k]} + assert set(diff.keys()) <= {"cosmetic_hint"}, diff diff --git a/tests/test_desync_scenarios.py b/tests/test_desync_scenarios.py new file mode 100644 index 00000000..44730023 --- /dev/null +++ b/tests/test_desync_scenarios.py @@ -0,0 +1,554 @@ +"""Canonical regression matrix for the 13 desync scenarios from the Notion +"Auto-Grounding Problem" catalog (Notion ID 3332a51619c4813caccec86c36d9bf98). + +This is V1 F1 — one consolidated test file routing every scenario through +the **real handler layer** (Apr 8 PR #84 lesson: tests that bypass to +``ledger.ingest_payload`` directly miss the auto-grounding hooks). Each test +proves V1 behavior or ``xfail``s with a pointer to the V2 design-doc section +that resolves it. + +Scenario list (severity tiers from the Notion catalog): + + 1. New decision ingested, matching code exists (was P0) + 2. Code changed after decision was grounded (working) + 3. Code deleted after decision was grounded (working) + 4. Symbol renamed (refactor) (P1) + 5. Symbol moved to different file (P1) + 6. Code index rebuilt with new symbols (was P0) + 7. Cold start: no code index (working) + 8. Drifted intent — recoverable (P1, V2) + 9. Intent description updated (supersession) (P2) + 10. Multiple intents map to same symbol (working) + 11. BM25 false-positive grounding (post-v0.6.0: N/A) + 12. Code region line numbers shift (insertion above) (working) + 13. Open-question prefix → not-claimed (v0.5.x) + +Post-v0.6.0 architectural note: server-side auto-grounding (BM25 → bind +edges) was removed; the caller LLM owns code retrieval and writes bindings +explicitly via ``bicameral.bind``. Several scenarios that were originally +P0 ("auto-grounding not wired") now pass via the caller-LLM flow rather +than via server-side magic. Scenarios depending on V2-only tools +(``bicameral_rebind``, ``record_compliance_verdict``) are marked xfail. +""" +from __future__ import annotations + +import subprocess +from pathlib import Path +from textwrap import dedent + +import pytest + +from adapters.ledger import reset_ledger_singleton +from context import BicameralContext +from handlers.bind import handle_bind +from handlers.detect_drift import handle_detect_drift +from handlers.ingest import handle_ingest +from handlers.link_commit import handle_link_commit, invalidate_sync_cache + + +# ── Helpers ────────────────────────────────────────────────────────── + + +def _git(cwd: Path, *args: str) -> str: + return subprocess.run( + ["git", *args], + cwd=cwd, + capture_output=True, + text=True, + check=True, + ).stdout.strip() + + +def _commit(repo: Path, msg: str) -> None: + _git(repo, "add", "-A") + _git(repo, "-c", "commit.gpgsign=false", "commit", "-q", "-m", msg) + + +def _seed_repo(repo: Path, files: dict[str, str]) -> None: + """Create a fresh git repo on ``main`` with the given files committed.""" + repo.mkdir(parents=True, exist_ok=True) + _git(repo, "init", "-q", "-b", "main") + _git(repo, "config", "user.email", "t@e.com") + _git(repo, "config", "user.name", "tester") + for rel, body in files.items(): + path = repo / rel + path.parent.mkdir(parents=True, exist_ok=True) + path.write_text(dedent(body).strip() + "\n") + _commit(repo, "seed") + + +def _build_payload( + repo: Path, + *, + text: str, + intent: str, + code_regions: list[dict] | None = None, + source_ref: str = "scenario-test", +) -> dict: + return { + "query": intent, + "repo": str(repo), + "mappings": [ + { + "span": { + "source_type": "manual", + "text": text, + "source_ref": source_ref, + }, + "intent": intent, + "symbols": [r["symbol"] for r in (code_regions or []) if r.get("symbol")], + "code_regions": code_regions or [], + } + ], + } + + +@pytest.fixture +def _scenario_repo(monkeypatch, tmp_path): + """Fresh git repo on `main` + memory ledger. Each test gets a fresh fixture.""" + monkeypatch.setenv("USE_REAL_LEDGER", "1") + monkeypatch.setenv("SURREAL_URL", "memory://") + repo = tmp_path / "repo" + _seed_repo(repo, { + "src/payments.py": """ + def calculate_discount(order_total: float) -> float: + return order_total * 0.1 + """, + "src/auth.py": """ + def verify_token(token: str) -> bool: + return token.startswith("valid:") + """, + }) + monkeypatch.setenv("REPO_PATH", str(repo)) + monkeypatch.setenv("BICAMERAL_AUTHORITATIVE_REF", "main") + monkeypatch.chdir(repo) + reset_ledger_singleton() + yield repo + reset_ledger_singleton() + + +# ── Scenarios 1–13 ─────────────────────────────────────────────────── + + +@pytest.mark.phase2 +@pytest.mark.asyncio +async def test_scenario_01_new_decision_with_existing_code(_scenario_repo): + """An ingested decision with no code_regions surfaces as ungrounded + via pending_grounding_checks; the caller LLM grounds via bicameral.bind. + """ + ctx = BicameralContext.from_env() + payload = _build_payload( + _scenario_repo, + text="Apply 10% discount on orders", + intent="Apply 10% discount on orders", + code_regions=[], + ) + ingest = await handle_ingest(ctx, payload) + assert ingest.ingested, f"ingest failed: {ingest}" + assert ingest.stats.ungrounded >= 1, f"expected ≥1 ungrounded after ingest, got: {ingest.stats}" + # NOTE: handle_ingest internally runs link_commit; the within-call sync + # cache forwards its pending_grounding_checks to subsequent calls. + # Do NOT invalidate the cache — the early-return path at + # ledger/adapter.py:333 skips the ungrounded sweep when changed_files + # is empty, so a cache miss would lose the grounding signal. + lc = await handle_link_commit(ctx, "HEAD") + ungrounded = [c for c in lc.pending_grounding_checks if c.get("reason") == "ungrounded"] + assert ungrounded, f"Expected ungrounded grounding check, got: {lc.pending_grounding_checks}" + decision_id = ungrounded[0]["decision_id"] + + bind_resp = await handle_bind(ctx, [{ + "decision_id": decision_id, + "file_path": "src/payments.py", + "symbol_name": "calculate_discount", + }]) + assert bind_resp.bindings + assert not bind_resp.bindings[0].error, bind_resp.bindings[0].error + + +@pytest.mark.phase2 +@pytest.mark.asyncio +async def test_scenario_02_code_changed_after_grounded_pending_until_verdict(_scenario_repo): + """Code-content change → status pending (awaiting caller-LLM verdict). + + Post-v0.5.0 derive_status semantics (``ledger/status.py:178-205``): + a hash diff WITHOUT a cached ``compliant`` verdict yields ``pending``, + not ``drifted``. ``drifted`` is reserved for cases where the caller + LLM has explicitly written a ``drifted`` verdict via the verdict + cache (V2 territory: see design doc §8 C2). For V1, the regression + we want is that a real code change DOES surface the affected + decision as a `pending_compliance_check` with new content_hash so + a future V2 caller can verdict it. + """ + ctx = BicameralContext.from_env() + payload = _build_payload( + _scenario_repo, + text="Apply discount", + intent="Apply 10% discount", + code_regions=[{ + "file_path": "src/payments.py", + "symbol": "calculate_discount", + "start_line": 1, + "end_line": 2, + "type": "function", + "purpose": "discount calc", + }], + ) + await handle_ingest(ctx, payload) + + # Mutate the bound region. + (_scenario_repo / "src/payments.py").write_text( + "def calculate_discount(order_total: float) -> float:\n return order_total * 0.15\n" + ) + _commit(_scenario_repo, "raise discount to 15%") + invalidate_sync_cache(ctx) + lc = await handle_link_commit(ctx, "HEAD") + + # The compliance check should fire for the changed region. + pending = [p for p in lc.pending_compliance_checks if p.symbol == "calculate_discount"] + assert pending, ( + f"Expected pending_compliance_check for changed region, got: " + f"{[(p.symbol, p.phase) for p in lc.pending_compliance_checks]}" + ) + drift = await handle_detect_drift(ctx, "src/payments.py") + statuses = {d.status for d in drift.decisions if d.symbol == "calculate_discount"} + # Acceptable per-decision states across the relevant version contracts: + # - 'pending' / 'drifted' (pre-v0.7.0): hash differs, awaiting a verdict + # - 'proposal' (v0.7.0+): decision is drift-exempt until + # explicitly ratified via + # bicameral_ratify; the substantive + # drift signal still flows through + # pending_compliance_checks above. + assert statuses & {"pending", "drifted", "proposal"}, ( + f"Expected pending / drifted / proposal, got: " + f"{[(d.status, d.symbol) for d in drift.decisions]}" + ) + + +@pytest.mark.phase2 +@pytest.mark.asyncio +async def test_scenario_03_code_deleted_after_grounded_pending(_scenario_repo): + """File deleted → derive_status → pending (actual_hash is None).""" + ctx = BicameralContext.from_env() + payload = _build_payload( + _scenario_repo, + text="Apply discount", + intent="Apply 10% discount", + code_regions=[{ + "file_path": "src/payments.py", + "symbol": "calculate_discount", + "start_line": 1, "end_line": 2, + "type": "function", "purpose": "discount calc", + }], + ) + await handle_ingest(ctx, payload) + + (_scenario_repo / "src/payments.py").unlink() + _commit(_scenario_repo, "remove payments") + invalidate_sync_cache(ctx) + lc = await handle_link_commit(ctx, "HEAD") + + # Symbol disappeared on authoritative ref. + disappeared = [c for c in lc.pending_grounding_checks if c.get("reason") == "symbol_disappeared"] + assert disappeared, f"Expected symbol_disappeared check, got: {lc.pending_grounding_checks}" + + +@pytest.mark.phase2 +@pytest.mark.asyncio +async def test_scenario_04_symbol_renamed_in_file(_scenario_repo): + """In-file rename → symbol_disappeared grounding check (V1 D1).""" + ctx = BicameralContext.from_env() + payload = _build_payload( + _scenario_repo, + text="Apply discount", + intent="Apply 10% discount", + code_regions=[{ + "file_path": "src/payments.py", + "symbol": "calculate_discount", + "start_line": 1, "end_line": 2, + "type": "function", "purpose": "discount calc", + }], + ) + await handle_ingest(ctx, payload) + + (_scenario_repo / "src/payments.py").write_text( + "def compute_discount(order_total: float) -> float:\n return order_total * 0.1\n" + ) + _commit(_scenario_repo, "rename calculate_discount -> compute_discount") + invalidate_sync_cache(ctx) + lc = await handle_link_commit(ctx, "HEAD") + + disappeared = [c for c in lc.pending_grounding_checks if c.get("reason") == "symbol_disappeared"] + assert disappeared, f"Expected symbol_disappeared, got: {lc.pending_grounding_checks}" + assert disappeared[0]["symbol"] == "calculate_discount" + # V1 D1: original_lines is part of the payload. + assert "original_lines" in disappeared[0] + + +@pytest.mark.phase2 +@pytest.mark.asyncio +async def test_scenario_05_symbol_moved_to_different_file(_scenario_repo): + """Cross-file move → symbol_disappeared grounding check.""" + ctx = BicameralContext.from_env() + payload = _build_payload( + _scenario_repo, + text="Apply discount", + intent="Apply 10% discount", + code_regions=[{ + "file_path": "src/payments.py", + "symbol": "calculate_discount", + "start_line": 1, "end_line": 2, + "type": "function", "purpose": "discount calc", + }], + ) + await handle_ingest(ctx, payload) + + (_scenario_repo / "src/payments.py").write_text("# moved\n") + (_scenario_repo / "src/pricing.py").write_text( + "def calculate_discount(order_total: float) -> float:\n return order_total * 0.1\n" + ) + _commit(_scenario_repo, "move discount calc to pricing.py") + invalidate_sync_cache(ctx) + lc = await handle_link_commit(ctx, "HEAD") + + disappeared = [c for c in lc.pending_grounding_checks if c.get("reason") == "symbol_disappeared"] + assert disappeared, f"Expected symbol_disappeared on cross-file move, got: {lc.pending_grounding_checks}" + + +@pytest.mark.phase2 +@pytest.mark.asyncio +async def test_scenario_06_code_added_ungrounded_resolvable(_scenario_repo): + """An ungrounded decision becomes resolvable once the matching symbol is added. + + Post-v0.6.0: the caller LLM is responsible for noticing the new symbol and + calling bicameral.bind. The server keeps surfacing pending_grounding_checks + until the caller binds. + """ + ctx = BicameralContext.from_env() + payload = _build_payload( + _scenario_repo, + text="Add cart total endpoint", + intent="Cart total endpoint", + code_regions=[], + ) + await handle_ingest(ctx, payload) + # See scenario 1 note — do NOT invalidate before lc1; rely on cache + # forwarding the ungrounded check from ingest's internal link_commit. + lc1 = await handle_link_commit(ctx, "HEAD") + assert any(c.get("reason") == "ungrounded" for c in lc1.pending_grounding_checks) + + # Caller adds the matching code. + (_scenario_repo / "src/cart.py").write_text( + "def cart_total(items: list) -> float:\n return sum(i['price'] for i in items)\n" + ) + _commit(_scenario_repo, "add cart_total") + invalidate_sync_cache(ctx) + lc2 = await handle_link_commit(ctx, "HEAD") + + ungrounded = [c for c in lc2.pending_grounding_checks if c.get("reason") == "ungrounded"] + assert ungrounded, "Decision should still surface as ungrounded until caller binds" + decision_id = ungrounded[0]["decision_id"] + # Pass explicit lines — ctx.authoritative_sha is captured at ctx + # creation and is stale after the new commit, so resolve_symbol_lines + # would look at the wrong ref. Explicit lines bypass resolution. + bind_resp = await handle_bind(ctx, [{ + "decision_id": decision_id, + "file_path": "src/cart.py", + "symbol_name": "cart_total", + "start_line": 1, + "end_line": 2, + }]) + assert bind_resp.bindings and not bind_resp.bindings[0].error, ( + f"bind failed: {bind_resp.bindings[0].error if bind_resp.bindings else 'no result'}" + ) + + +@pytest.mark.phase2 +@pytest.mark.asyncio +async def test_scenario_07_cold_start_no_code_index(_scenario_repo, monkeypatch): + """Cold start with no symbols matching the intent → decision stays ungrounded. + + The seed repo has only ``calculate_discount`` and ``verify_token``. + A decision about something the repo doesn't contain stays ungrounded + until the caller binds it (which they cannot, since there's no target). + """ + ctx = BicameralContext.from_env() + payload = _build_payload( + _scenario_repo, + text="Add Slack notification on signup", + intent="Slack notify on signup", + code_regions=[], + ) + await handle_ingest(ctx, payload) + # See scenario 1 note — do NOT invalidate the sync cache here. + lc = await handle_link_commit(ctx, "HEAD") + assert any(c.get("reason") == "ungrounded" for c in lc.pending_grounding_checks), ( + f"Expected ungrounded check on cold start, got: {lc.pending_grounding_checks}" + ) + + +@pytest.mark.xfail( + strict=True, + reason="V2: requires bicameral_rebind with old-binding CAS + fresh L3 verdict on the new target. See design doc §8 D2. Codex pass-10 finding #2.", +) +@pytest.mark.phase2 +@pytest.mark.asyncio +async def test_scenario_08_drifted_recoverable_via_atomic_rebind(_scenario_repo): + """A drifted decision whose code moved should re-ground atomically. + + V1 surfaces symbol_disappeared (scenarios 4/5) but offers no atomic + rebind — calling bicameral.bind on the new location leaves the old + edge live, producing duplicate bindings. xfailed until V2 D2. + """ + pytest.fail("V2 work — see design doc §8 D2.") + + +@pytest.mark.phase2 +@pytest.mark.asyncio +async def test_scenario_09_intent_description_supersession(_scenario_repo): + """Updated intent description supersedes the prior decision. + + Covered by tests/test_supersession.py. This test asserts the + canonical handler path doesn't raise during a re-ingest with + overlapping intent text. + """ + ctx = BicameralContext.from_env() + p1 = _build_payload( + _scenario_repo, + text="Apply discount", + intent="Apply 10% discount on orders", + code_regions=[{ + "file_path": "src/payments.py", + "symbol": "calculate_discount", + "start_line": 1, "end_line": 2, + "type": "function", "purpose": "discount calc", + }], + source_ref="meeting-1", + ) + p2 = _build_payload( + _scenario_repo, + text="Apply discount with backoff", + intent="Apply 15% discount on orders over $100", + code_regions=[{ + "file_path": "src/payments.py", + "symbol": "calculate_discount", + "start_line": 1, "end_line": 2, + "type": "function", "purpose": "discount calc", + }], + source_ref="meeting-2", + ) + r1 = await handle_ingest(ctx, p1) + r2 = await handle_ingest(ctx, p2) + assert r1.ingested and r2.ingested + + +@pytest.mark.phase2 +@pytest.mark.asyncio +async def test_scenario_10_multiple_intents_share_symbol(_scenario_repo): + """Two decisions bound to the same symbol both surface on drift detection.""" + ctx = BicameralContext.from_env() + region = { + "file_path": "src/auth.py", + "symbol": "verify_token", + "start_line": 1, "end_line": 2, + "type": "function", "purpose": "auth check", + } + await handle_ingest(ctx, _build_payload( + _scenario_repo, text="Verify JWT", intent="Use JWT verification", + code_regions=[region], source_ref="m1", + )) + await handle_ingest(ctx, _build_payload( + _scenario_repo, text="Reject invalid", intent="Reject malformed tokens", + code_regions=[region], source_ref="m2", + )) + invalidate_sync_cache(ctx) + drift = await handle_detect_drift(ctx, "src/auth.py") + decision_ids = {d.decision_id for d in drift.decisions} + assert len(decision_ids) >= 2, ( + f"Expected ≥2 decisions sharing the same symbol, got {len(decision_ids)}: {decision_ids}" + ) + + +@pytest.mark.phase2 +@pytest.mark.asyncio +async def test_scenario_11_no_server_side_bm25_grounding_post_v060(_scenario_repo): + """Post-v0.6.0: server-side BM25 false-positive grounding is no longer a risk. + + The original P2 concern was that BM25 could match a decision to an + irrelevant symbol. v0.6.0 deleted the entire ``ground_mappings`` + pipeline; bindings now require an explicit ``bicameral.bind`` call + from the caller LLM. This test asserts an ingest WITHOUT + code_regions never auto-binds anything — the decision stays ungrounded + until the caller acts. + """ + ctx = BicameralContext.from_env() + payload = _build_payload( + _scenario_repo, + text="Validate webhook signatures", + # Intent text mentions "verify" / "token" — the seed repo has + # verify_token in src/auth.py. Pre-v0.6.0 BM25 would have matched. + intent="Verify webhook tokens", + code_regions=[], + ) + await handle_ingest(ctx, payload) + # See scenario 1 note — do NOT invalidate the sync cache here. + lc = await handle_link_commit(ctx, "HEAD") + # No edges should have been auto-created — decision stays ungrounded. + ungrounded = [c for c in lc.pending_grounding_checks if c.get("reason") == "ungrounded"] + assert ungrounded, "Post-v0.6.0 ingest must leave decisions ungrounded — no server-side bind" + + +@pytest.mark.phase2 +@pytest.mark.asyncio +async def test_scenario_12_line_shift_does_not_trigger_drift(_scenario_repo): + """Inserting blank lines above a tracked symbol must not trigger drift. + + resolve_symbol_lines re-resolves the symbol via tree-sitter, so the + region's content_hash is computed against the relocated span — not + the frozen line range from ingest time. + """ + ctx = BicameralContext.from_env() + region = { + "file_path": "src/auth.py", + "symbol": "verify_token", + "start_line": 1, "end_line": 2, + "type": "function", "purpose": "auth check", + } + await handle_ingest(ctx, _build_payload( + _scenario_repo, text="Use JWT", intent="JWT verification", + code_regions=[region], + )) + + # Insert blank lines above — line numbers shift but the symbol bytes + # are identical. + (_scenario_repo / "src/auth.py").write_text( + "\n\n\ndef verify_token(token: str) -> bool:\n return token.startswith(\"valid:\")\n" + ) + _commit(_scenario_repo, "insert blank lines above") + invalidate_sync_cache(ctx) + await handle_link_commit(ctx, "HEAD") + + drift = await handle_detect_drift(ctx, "src/auth.py") + drifted = [d for d in drift.decisions if d.status == "drifted"] + assert not drifted, f"Line-shift edit must NOT trigger drift, got: {[(d.status, d.symbol, d.lines) for d in drift.decisions]}" + + +@pytest.mark.phase2 +@pytest.mark.asyncio +async def test_scenario_13_open_question_decision_classification(_scenario_repo): + """[Open Question]-prefixed decisions are classified as gaps, not normal decisions. + + Added in v0.5.x as the 13th scorecard entry. Verifies the prefix + convention is honored end-to-end so caller LLMs can render gaps + distinctly from claimed decisions. + """ + ctx = BicameralContext.from_env() + payload = _build_payload( + _scenario_repo, + text="[Open Question] Should we add SSO?", + intent="[Open Question] Should we add SSO?", + code_regions=[], + ) + res = await handle_ingest(ctx, payload) + assert res.ingested + # The decision is persisted; its status / classification is exercised + # via tests/test_v0420_history.py for the "gap" rendering path. diff --git a/tests/test_link_commit_grounding.py b/tests/test_link_commit_grounding.py index 31137890..f96deba4 100644 --- a/tests/test_link_commit_grounding.py +++ b/tests/test_link_commit_grounding.py @@ -111,6 +111,13 @@ async def test_pending_grounding_checks_for_ungrounded_decisions(_isolated_ledge # The ungrounded decision should appear reasons = [c.get("reason") for c in lc_resp.pending_grounding_checks] assert "ungrounded" in reasons + # V1 verification-instruction split (post-pass-12 fix): for an + # ungrounded-only response, the bind CTA is the right answer (no prior + # binding to retire, no duplicate-binding risk). The relocation + # warning must NOT appear. + instr = lc_resp.verification_instruction + assert "bicameral.bind" in instr, f"missing bind CTA: {instr}" + assert "INFORMATIONAL ONLY" not in instr, f"unexpected relocation warning: {instr}" # ── 2. Symbol disappeared → grounding check emitted ────────────────────────── @@ -185,4 +192,26 @@ async def test_pending_grounding_checks_symbol_not_found(_isolated_ledger): assert len(disappeared_checks) >= 1, ( f"Expected symbol_disappeared grounding check, got: {grounding_checks}" ) - assert disappeared_checks[0]["symbol"] == "fetch_user" + entry = disappeared_checks[0] + assert entry["symbol"] == "fetch_user" + # V1 D1: original_lines lets the caller LLM inspect the prior code via + # `git show :` to ground its own retrieval. + assert "original_lines" in entry, ( + f"Expected original_lines in symbol_disappeared payload, got: {entry}" + ) + start, end = entry["original_lines"] + assert isinstance(start, int) and isinstance(end, int) + assert start >= 1 and end >= start, f"Invalid original_lines {entry['original_lines']}" + + # V1 / Codex pass-12 fix: relocation cases must NOT route through + # bicameral.bind (would leave the old edge live → duplicate-binding + # state under N:N binds_to). The verification instruction must + # explicitly mark symbol_disappeared as INFORMATIONAL ONLY and + # forbid the bind CTA. + instr = lc_resp.verification_instruction + assert "INFORMATIONAL ONLY" in instr, ( + f"Expected relocation warning in verification_instruction, got: {instr!r}" + ) + assert "Do NOT call bicameral.bind" in instr or "do not bind directly" in instr, ( + f"Expected explicit bind-prohibition for relocation, got: {instr!r}" + ) diff --git a/tests/test_sync_middleware.py b/tests/test_sync_middleware.py index 724868aa..7bbf5807 100644 --- a/tests/test_sync_middleware.py +++ b/tests/test_sync_middleware.py @@ -222,3 +222,151 @@ async def test_banner_silent_on_fresh_proposal(): ctx = _make_ctx(open_rows=[_proposal(days_old=3)]) banner = await get_session_start_banner(ctx) assert banner is None + + +# ── V1 A2-light: repo_write_barrier ───────────────────────────────── + + +@pytest.fixture +def _reset_locks(): + """Drop the per-repo lock registry before and after each test so lock + identity is deterministic across tests in the same process.""" + from handlers.sync_middleware import _reset_repo_locks_for_tests + _reset_repo_locks_for_tests() + yield + _reset_repo_locks_for_tests() + + +def _barrier_ctx(repo_path: str): + ctx = MagicMock() + ctx.repo_path = repo_path + return ctx + + +@pytest.mark.asyncio +async def test_repo_write_barrier_serializes_same_repo(_reset_locks): + """Two concurrent barrier-holders for the same repo MUST serialize. + + Proves the in-process race window V1 A2-light is closing: a second + bind call cannot observe the ledger while the first is mid-write. + """ + import asyncio + from handlers.sync_middleware import repo_write_barrier + + events: list[str] = [] + + async def task(name: str, hold_ms: int): + ctx = _barrier_ctx("/repo/a") + async with repo_write_barrier(ctx) as _t: + events.append(f"{name}:enter") + await asyncio.sleep(hold_ms / 1000) + events.append(f"{name}:exit") + + await asyncio.gather(task("first", 50), task("second", 10)) + + # First must fully exit before second enters — no interleaving. + assert events == ["first:enter", "first:exit", "second:enter", "second:exit"], events + + +@pytest.mark.asyncio +async def test_repo_write_barrier_allows_different_repos_concurrently(_reset_locks): + """Different repos use different locks and MUST run in parallel.""" + import asyncio + from handlers.sync_middleware import repo_write_barrier + + events: list[str] = [] + + async def task(name: str, repo: str): + ctx = _barrier_ctx(repo) + async with repo_write_barrier(ctx) as _t: + events.append(f"{name}:enter") + await asyncio.sleep(0.05) + events.append(f"{name}:exit") + + await asyncio.gather(task("A", "/repo/a"), task("B", "/repo/b")) + + # Both entered before either exited — barriers on different repos + # do not block each other. + assert events[:2] == ["A:enter", "B:enter"] or events[:2] == ["B:enter", "A:enter"] + assert set(events) == {"A:enter", "A:exit", "B:enter", "B:exit"} + + +@pytest.mark.asyncio +async def test_repo_write_barrier_releases_on_exception(_reset_locks): + """If the body raises, the lock must still release so the next caller proceeds.""" + import asyncio + from handlers.sync_middleware import repo_write_barrier + + ctx = _barrier_ctx("/repo/a") + + with pytest.raises(RuntimeError): + async with repo_write_barrier(ctx) as _t: + raise RuntimeError("boom") + + async def reacquire(): + async with repo_write_barrier(ctx) as _t: + return "ok" + + result = await asyncio.wait_for(reacquire(), timeout=1.0) + assert result == "ok" + + +@pytest.mark.asyncio +async def test_repo_write_barrier_falls_back_when_repo_path_missing(_reset_locks): + """Missing ctx.repo_path falls back to a default key and still serializes.""" + import asyncio + from handlers.sync_middleware import repo_write_barrier + + class _Bare: + pass + + ctx = _Bare() + + events: list[str] = [] + + async def task(name: str): + async with repo_write_barrier(ctx) as _t: + events.append(f"{name}:enter") + await asyncio.sleep(0.03) + events.append(f"{name}:exit") + + await asyncio.gather(task("x"), task("y")) + + assert events[0].endswith(":enter") and events[1].endswith(":exit") + assert events[2].endswith(":enter") and events[3].endswith(":exit") + + +# ── V1 A3: barrier timing yield ───────────────────────────────────── + + +@pytest.mark.asyncio +async def test_repo_write_barrier_reports_held_ms(_reset_locks): + """BarrierTiming.held_ms is populated on exit and is non-negative.""" + import asyncio + from handlers.sync_middleware import repo_write_barrier + + ctx = _barrier_ctx("/repo/a") + async with repo_write_barrier(ctx) as timing: + assert timing.held_ms is None # not yet populated + await asyncio.sleep(0.02) + assert timing.held_ms is not None + assert timing.held_ms >= 20.0 # we slept 20ms, measured wall clock should reflect it + assert timing.held_ms < 500.0 # and not be absurd + + +@pytest.mark.asyncio +async def test_repo_write_barrier_reports_held_ms_on_exception(_reset_locks): + """held_ms is set even when the body raises.""" + from handlers.sync_middleware import repo_write_barrier + + ctx = _barrier_ctx("/repo/a") + captured_timing = None + + with pytest.raises(RuntimeError): + async with repo_write_barrier(ctx) as timing: + captured_timing = timing + raise RuntimeError("boom") + + assert captured_timing is not None + assert captured_timing.held_ms is not None + assert captured_timing.held_ms >= 0.0