desync optimization V1 — read-path advisory + V2 implementation guide by silongtan · Pull Request #56 · BicameralAI/bicameral-mcp

silongtan · 2026-04-25T04:38:44Z

Summary

Ships V1 of a two-part desync-correctness initiative plus a consolidated V2 implementation guide for the destructive-path overhaul that's deliberately deferred. V1 introduces zero new mutating
capabilities; every change is read-only measurement, additive contract field, pure function, test coverage, or a surgical bug fix to an already-shipped path.

V1 implementation (6 commits, 3b4d0bb…8e226c5): drift benchmark harness, per-repo asyncio.Lock for bind, SyncMetrics instrumentation, strict-whitelist tree-sitter cosmetic-change classifier with
DriftEntry.cosmetic_hint advisory metadata, original_lines on symbol_disappeared grounding checks, canonical 13-scenario regression matrix (12 PASS / 1 XFAIL), and a _VERIFICATION_INSTRUCTION split that
removes the v0.6.4 unsafe-bind CTA for relocation cases.
V2 planning (1 commit, 1021822): single-source docs/v2-desync-optimization-guide.md (999 lines) consolidating the V2 design + V1 plan + 14 rounds of Codex review into one self-contained guide for the next
engineer/agent to pick up.

Net destructive-surface change for V1: zero (and arguably negative — V1 actively removes the v0.6.4 unsafe relocation CTA).

What V1 ships

ID	Deliverable	Files
A1	Drift benchmark harness — 100 decisions × 25 files, JSON artifact, `@pytest.mark.bench`	`tests/bench_drift.py`
A2-light	Per-repo `asyncio.Lock` for `handle_bind` (in-process only)	`handlers/sync_middleware.py::repo_write_barrier`, `handlers/bind.py`
A3	`SyncMetrics` (`sync_catchup_ms`, `barrier_held_ms`) on Search/Preflight/History/Bind responses	`contracts.py`, four handlers
B1	Strict-whitelist tree-sitter cosmetic-change classifier	`ledger/ast_diff.py`
B2	`DriftEntry.cosmetic_hint` advisory metadata	`contracts.py`, `handlers/detect_drift.py`
D1	`original_lines` on `symbol_disappeared` grounding checks; verification-instruction split for safe relocation routing	`ledger/adapter.py`, `handlers/link_commit.py`
F1	Canonical 13-scenario regression matrix routed through real handler layer	`tests/test_desync_scenarios.py`
Bug fix	Empty `decision_id` on ungrounded grounding checks (surfaced by F1)	`ledger/adapter.py:475`

Performance baseline (Apple Silicon, surrealdb 2.0.0):

search_decisions p95 = 10.4 ms (target <2 s — 185× under)
detect_drift p95 = 15.5 ms (target <1 s — 55× under)
link_commit (warm) p95 = 8.0 ms

Test plan

pytest tests/test_link_commit_grounding.py tests/test_desync_scenarios.py tests/test_b2_cosmetic_hint.py tests/test_ast_diff.py tests/test_phase3_integration.py tests/test_sync_middleware.py tests/test_bind.py → 75 passed, 1 xfailed in 7.68s
pytest tests/bench_drift.py -v -m bench -s → benchmark passes; baseline numbers documented in CHANGELOG
Test suite passes against surrealdb>=2.0.0 (the SDK 2.0 schema-migration crash I had originally pinned around is fixed properly upstream by 66796ef)
Zero regressions on the v0.6.3 → v0.6.4 → 0.6.4-bump → 66796ef → 8d44a16 → ce74461 → a5aface rebase
Reviewer check: scenario 8 in tests/test_desync_scenarios.py is @pytest.mark.xfail(strict=True) — it MUST stay xfailed until V2's bicameral_rebind ships. If it ever flips to xpassed, V2 has
leaked into V1 unintentionally.

V2 — explicitly deferred (not in this PR)

The full V2 plan lives in docs/v2-desync-optimization-guide.md:

Phase 0: migrate handlers/resolve_compliance.py from hard-delete to tombstone+CAS (closes the highest-impact destructive backdoor — Codex pass-10 refactor: port interfaces + source_span for drift pipeline #1, hard prerequisite); A0 atomic SurrealQL transaction
primitive with day-1 forced-failure gate test.
Phase 1: schema migration v5→v6 (per-binding baseline ownership, tombstone fields, append-only compliance_verdict_history, full-CAS cache key) + traversal filtering on every binds_to consumer.
Phase 2: full sync barrier (sync-token CAS + region fingerprint at commit time).
Phase 3: cache-aware pending_compliance_checks from detect_drift.
Phase 4: bicameral_judge_drift + record_compliance_verdict (5-field CAS), bicameral_advance_baseline, bicameral_rebind (two-phase + lease + attempt-id locking).
Phase 5: doctor SKILL rendering, Codex pass-15, scenario 8 xfail → expected pass.

Effort estimate: 7–10 engineer-weeks sequential. Phases don't parallelize cleanly because each one's correctness depends on the prior one's invariants.

The guide synthesizes 14 Codex review passes into a "what NOT to ship" constraints catalog (§7) — every entry came from a pass that found a real bug in an earlier draft. Strong recommendation to involve Jin at
design time before V2 implementation begins (CODEOWNERS approval required, plus this work needs project-judgment review the adversarial Codex passes can't provide).

Notes for reviewers

docs/v2-desync-optimization-guide.md is 999 lines but heavily structured; §1 has a recommended read order for fresh readers.
Two prior planning artifacts (docs/desync-optimization.md and docs/desync-optimization-v1-plan.md) were never tracked per project convention; their contents are absorbed into the consolidated guide.
CHANGELOG entry uses "Unreleased" heading — pick a version (v0.7.0?) at release time.
Scenario 8's xfail(strict=True) is intentional CI plumbing; do not remove without shipping V2 D2.

Summary by CodeRabbit

New Features
- Optional sync timing metrics added to responses and per-repo write-barrier timing capture
- Advisory cosmetic-change flag to mark whitespace-only edits (render-only; does not alter drift status)
- Grounding checks include prior code-context for relocation cases; verification guidance split by reason
Bug Fixes
- Fixed empty decision identifier emission for some ungrounded cases
- Verification instructions now conditionally route CTAs and forbid unsafe bind-on-relocate actions
Documentation
- Added V2 desync optimization guide and updated planning/changelog
Tests
- Added benchmark harness (opt-in) and expanded regression/test suites (13 scenarios + cosmetic/sync tests)

coderabbitai · 2026-04-25T04:38:56Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds Desync Optimization V1: read-path-only instrumentation and advisory changes — sync-metrics on responses, per-repo write barrier for bind timing, strict AST whitespace-only cosmetic classifier with drift-entry enrichment, conditional verification-instruction routing, grounding payload enrichment, a drift benchmark harness, and regression tests; destructive write-path work deferred to V2.

Changes

Cohort / File(s)	Summary
Docs & Planning `CHANGELOG.md`, `PLAN.md`, `TODO.md`, `docs/v2-desync-optimization-guide.md`	New V1 completion notes, V2 design/migration guide, updated TODOs and acceptance criteria; non-code documentation added.
Contracts / Models `contracts.py`	Added `SyncMetrics` model and optional `sync_metrics` on multiple response types; added advisory `DriftEntry.cosmetic_hint`.
AST Cosmetic Classifier & Tests `ledger/ast_diff.py`, `tests/test_ast_diff.py`	New strict tree-sitter-based `is_cosmetic_change()` with fail-safe parsing and comprehensive tests for cosmetic vs semantic edits.
Handlers — Sync Instrumentation `handlers/history.py`, `handlers/preflight.py`, `handlers/search_decisions.py`, `handlers/bind.py`	Measure ledger catch-up and barrier-held durations; attach `sync_metrics` to responses; `handle_bind` delegates to `_do_bind` and wraps execution with `repo_write_barrier`.
Handlers — Logic Refinements `handlers/link_commit.py`, `handlers/detect_drift.py`	`_build_verification_instruction` now composes conditional verification text (forbids `bicameral.bind` for `symbol_disappeared` relocations); `detect_drift` enriches `DriftEntry.cosmetic_hint` for whitespace-only edits (fail-safe).
Sync Middleware `handlers/sync_middleware.py`, `tests/test_sync_middleware.py`	Added `repo_write_barrier(ctx)` async context manager with per-repo locks and `BarrierTiming.held_ms`; tests for serialization, timing, and exception behavior.
Ledger Adapter `ledger/adapter.py`	`ingest_commit` emits `original_lines` for `symbol_disappeared` pending grounding checks and fixes `decision_id` extraction for ungrounded decisions.
Benchmarks `tests/bench_drift.py`, `tests/conftest.py`	New drift benchmark harness (pytest `bench` marker) measuring link_commit/search/detect_drift latencies and emitting JSON results.
Tests — Cosmetic Hint & Scenarios `tests/test_b2_cosmetic_hint.py`, `tests/test_desync_scenarios.py`, `tests/test_link_commit_grounding.py`	New tests validating cosmetic-hint enrichment, 13 desync regression scenarios, verification-instruction content, and `original_lines` expectations.
Skills / Docs `.claude/skills/*`, `skills/bicameral-preflight/SKILL.md`	Skill docs updated to document `sync_metrics`, `cosmetic_hint`, and informational-only verification instructions for relocations.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant Handler as Handler (search/preflight/bind/detect_drift)
  participant Sync as repo_write_barrier (Lock)
  participant Ledger
  participant Repo as GitRepo
  participant AST as ledger/ast_diff

  Client->>Handler: request (search/preflight/bind/detect_drift)
  Handler->>Sync: enter repo_write_barrier(ctx)
  Sync-->>Handler: BarrierTiming (held_ms unset)
  Handler->>Ledger: ensure_ledger_synced(ctx)
  Ledger-->>Handler: synced
  alt detect_drift path
    Handler->>Repo: read file(s) (HEAD, working tree)
    Repo-->>Handler: file contents
    Handler->>AST: is_cosmetic_change(before, after, lang)
    AST-->>Handler: cosmetic? (True/False)
  end
  Handler->>Sync: exit repo_write_barrier (record held_ms)
  Handler-->>Client: Response (includes sync_metrics, cosmetic_hint, verification_instruction)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

feat(v0.4.20): unified compliance verification — cache + honest status #30 — Alters verification-instruction composition and LinkCommitResponse routing; directly related to instruction-splitting changes.
v0.4.11 — latent drift fix (range-diff sweep + distinct counters) #17 — Overlaps ledger/adapter.ingest_commit edits; related to grounding payloads and range handling.
feat: v0.5.1 — spec-gap wedge cuts + bicameral.history #36 — Modifies ledger/adapter.py and History/contract surface; related to SyncMetrics and grounding behavior.

Suggested reviewers

jinhongkuan

Poem

🐰
I hopped through locks and timed each beat,
Nibbled stray spaces till the diff looked neat,
I warned where binds must never roam,
Gave hints and metrics, then hopped home,
A tiny V1 hop toward a safer comb. 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 62.37% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: delivering V1 of desync optimization (read-path advisory features) and adding V2 implementation guide, directly matching the PR's primary objectives.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch desync-optimization-v1

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (5)

handlers/sync_middleware.py (1)

56-80: Optional: _repo_locks grows without bound.

The registry never evicts entries, so a long-lived MCP server process that handles many distinct repo_path values will accumulate one asyncio.Lock per repo for the lifetime of the process. Per-process MCP servers typically see ≤1 repo, so this is unlikely to bite, but if you anticipate a multi-repo daemon mode (e.g. dashboard, CI worker), consider a WeakValueDictionary or an explicit eviction hook. Pure footprint issue — no correctness impact.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@handlers/sync_middleware.py` around lines 56 - 80, _repo_locks currently
stores a strong reference for every repo and will grow without bound; change the
storage used by _repo_locks in handlers/sync_middleware.py (used by
_get_repo_lock and guarded by _guard) to a weak-value mapping or add an eviction
mechanism so locks for unused repo_path keys can be garbage-collected: replace
the plain dict _repo_locks with a weakref.WeakValueDictionary (or implement a
size/time-based eviction policy) and ensure the existing _guard()/async with
_guard() protection remains around accesses to the new mapping.

tests/bench_drift.py (2)

80-90: p95 is off-by-one when 0.95 * len(s) is not an integer.

s[max(0, int(0.95 * len(s)) - 1)] returns the nearest-rank p95 only when 0.95*N is integral (e.g., N=20, 100). For N=50 (the search-decisions sample size: 10 queries × 5 iterations) it yields index 46 (47th value) instead of the standard nearest-rank index 47 (48th value). The published baseline numbers in PLAN.md / V2 guide are derived from this — minor but worth tightening since they're cited as the V2 perf budget reference.

♻️ Proposed fix using `statistics.quantiles`

-def _percentiles(samples: list[float]) -> dict[str, float]:
-    if not samples:
-        return {"p50": 0.0, "p95": 0.0, "max": 0.0, "n": 0}
-    s = sorted(samples)
-    return {
-        "p50": statistics.median(s),
-        "p95": s[max(0, int(0.95 * len(s)) - 1)],
-        "max": s[-1],
-        "mean": statistics.fmean(s),
-        "n": len(s),
-    }
+def _percentiles(samples: list[float]) -> dict[str, float]:
+    if not samples:
+        return {"p50": 0.0, "p95": 0.0, "max": 0.0, "mean": 0.0, "n": 0}
+    s = sorted(samples)
+    # Nearest-rank p95: smallest value with rank >= ceil(0.95 * N).
+    p95_idx = max(0, -(-len(s) * 95 // 100) - 1)
+    return {
+        "p50": statistics.median(s),
+        "p95": s[p95_idx],
+        "max": s[-1],
+        "mean": statistics.fmean(s),
+        "n": len(s),
+    }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/bench_drift.py` around lines 80 - 90, The p95 calculation in
_percentiles is off-by-one for non-integer 0.95*N; replace the current manual
index logic with a proper quantile computation (e.g., use statistics.quantiles
with n=100 and method="inclusive" or method="nearest" to get the 95th
percentile) on the sorted samples s, and return that value for "p95" while
keeping the other returns ("p50", "max", "mean", "n") unchanged; ensure the
empty-samples branch remains the same.

118-121: Log the swallowed exception.

A bare except Exception: continue here means a regression in extract_symbols (e.g., a tree-sitter parser change or a malformed file pattern) silently shrinks the corpus until the len(symbols) >= 25 assertion at line 193 trips far away from the cause. Logging at DEBUG keeps the bench quiet but recoverable.

-        try:
-            records = await adapter.extract_symbols(str(fp))
-        except Exception:
-            continue
+        try:
+            records = await adapter.extract_symbols(str(fp))
+        except Exception as exc:
+            print(f"[bench] extract_symbols failed for {fp}: {exc}")
+            continue

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/bench_drift.py` around lines 118 - 121, The try/except around
adapter.extract_symbols currently swallows all exceptions; change it to catch
Exception as e and log the error at DEBUG before continuing so failures in
extract_symbols are visible (e.g., logger.debug("extract_symbols failed for %s:
%s", fp, e, exc_info=True)) and then continue; keep the variable names (records,
adapter.extract_symbols, fp) intact and only add the debug log to avoid noisy
test output.

tests/test_b2_cosmetic_hint.py (1)

78-90: Dead write + unused fixture parameter.

The _write_file at line 80 is immediately overwritten at line 83 before any git operation, so it has no effect on the test. The tmp_path parameter is also unused (the working-tree path comes from the repo_with_baseline fixture). The test reads as if it's establishing a non-docstring baseline state, but it isn't.

♻️ Proposed cleanup

-def test_docstring_edit_keeps_cosmetic_hint_false(repo_with_baseline, tmp_path):
+def test_docstring_edit_keeps_cosmetic_hint_false(repo_with_baseline):
     repo, rel = repo_with_baseline
-    _write_file(repo, rel, "def f(x):\n    return x + 1\n")
-    # Now overwrite baseline by committing a docstring-only version, then edit working tree.
+    # Commit a docstring-bearing baseline, then edit only the docstring text in
+    # the working tree — the diff is observable via __doc__, so it must NOT be
+    # classified as cosmetic.
     import subprocess
     _write_file(repo, rel, 'def f(x):\n    """Old."""\n    return x + 1\n')
     subprocess.run(["git", "add", "-A"], cwd=repo, check=True)
     subprocess.run(["git", "commit", "-q", "-m", "add docstring"], cwd=repo, check=True)
-    # Working tree edits the docstring text — observable via __doc__.
     _write_file(repo, rel, 'def f(x):\n    """New."""\n    return x + 1\n')
     entry = _make_entry(status="drifted", lines=(1, 3))
     _enrich_with_cosmetic_hints([entry], rel, str(repo))
     assert entry.cosmetic_hint is False

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/test_b2_cosmetic_hint.py` around lines 78 - 90, The test dead-writes a
file via _write_file immediately overwritten later and declares an unused
tmp_path fixture; remove the first _write_file call (the one creating "def
f(x):\n    return x + 1\n") and drop the tmp_path parameter from
test_docstring_edit_keeps_cosmetic_hint_false so the test only creates the
baseline commit with the docstring and then edits the working tree, keeping
references to _write_file, subprocess.run, _make_entry and
_enrich_with_cosmetic_hints intact.

TODO.md (1)

138-140: Line numbers in prose will drift.

ledger/adapter.py:475 is also referenced from PLAN.md line 102. The bug fix point will move on the next edit to that file and the docs go stale silently. Consider citing a stable anchor (function name) instead.

-- [x] **incidental fix** — `ledger/adapter.py:475` was emitting empty
-      `decision_id` on ungrounded grounding checks; surfaced by F1
+- [x] **incidental fix** — empty `decision_id` on ungrounded
+      `pending_grounding_checks` in `ledger/adapter.py`
+      (`SurrealDBLedgerAdapter._collect_pending_grounding_checks`); surfaced by F1

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@TODO.md` around lines 138 - 140, Update the docs to stop citing a fragile
line number: replace the `ledger/adapter.py:475` reference in TODO.md and
PLAN.md with a stable symbol (the function or method that emits the decision_id
for grounding checks); locate the emitter by searching for the "decision_id"
token or the grounding-check function in ledger/adapter.py and reference it by
its fully-qualified function/method name (e.g., AdapterClass.method_name or
function_name) so future edits won't silently break the doc link.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/v2-desync-optimization-guide.md`:
- Around line 519-535: The current transaction() context manager builds a raw
BEGIN/COMMIT block and calls self._db.query(...) which is incompatible with
surrealdb-py's transaction API; refactor transaction() (the asynccontextmanager
that yields TransactionBuffer) to create a session via self._db.new_session(),
start a transaction with session.begin_transaction(), run each buffered
statement through the transaction object (txn.query or txn.execute equivalent)
so you get per-statement results, and then call txn.commit() in the try path or
txn.cancel() in except, converting any per-statement failures into LedgerError
(replace the current isinstance(stmt, str) check and the raw self._db.query
usage with transaction-aware calls and proper try/except around
txn.commit()/txn.cancel()).

---

Nitpick comments:
In `@handlers/sync_middleware.py`:
- Around line 56-80: _repo_locks currently stores a strong reference for every
repo and will grow without bound; change the storage used by _repo_locks in
handlers/sync_middleware.py (used by _get_repo_lock and guarded by _guard) to a
weak-value mapping or add an eviction mechanism so locks for unused repo_path
keys can be garbage-collected: replace the plain dict _repo_locks with a
weakref.WeakValueDictionary (or implement a size/time-based eviction policy) and
ensure the existing _guard()/async with _guard() protection remains around
accesses to the new mapping.

In `@tests/bench_drift.py`:
- Around line 80-90: The p95 calculation in _percentiles is off-by-one for
non-integer 0.95*N; replace the current manual index logic with a proper
quantile computation (e.g., use statistics.quantiles with n=100 and
method="inclusive" or method="nearest" to get the 95th percentile) on the sorted
samples s, and return that value for "p95" while keeping the other returns
("p50", "max", "mean", "n") unchanged; ensure the empty-samples branch remains
the same.
- Around line 118-121: The try/except around adapter.extract_symbols currently
swallows all exceptions; change it to catch Exception as e and log the error at
DEBUG before continuing so failures in extract_symbols are visible (e.g.,
logger.debug("extract_symbols failed for %s: %s", fp, e, exc_info=True)) and
then continue; keep the variable names (records, adapter.extract_symbols, fp)
intact and only add the debug log to avoid noisy test output.

In `@tests/test_b2_cosmetic_hint.py`:
- Around line 78-90: The test dead-writes a file via _write_file immediately
overwritten later and declares an unused tmp_path fixture; remove the first
_write_file call (the one creating "def f(x):\n    return x + 1\n") and drop the
tmp_path parameter from test_docstring_edit_keeps_cosmetic_hint_false so the
test only creates the baseline commit with the docstring and then edits the
working tree, keeping references to _write_file, subprocess.run, _make_entry and
_enrich_with_cosmetic_hints intact.

In `@TODO.md`:
- Around line 138-140: Update the docs to stop citing a fragile line number:
replace the `ledger/adapter.py:475` reference in TODO.md and PLAN.md with a
stable symbol (the function or method that emits the decision_id for grounding
checks); locate the emitter by searching for the "decision_id" token or the
grounding-check function in ledger/adapter.py and reference it by its
fully-qualified function/method name (e.g., AdapterClass.method_name or
function_name) so future edits won't silently break the doc link.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bf4c2513-a834-4520-80b2-fe791afc76a0

📥 Commits

Reviewing files that changed from the base of the PR and between 1021822 and 63fd689.

📒 Files selected for processing (21)

CHANGELOG.md
PLAN.md
TODO.md
contracts.py
docs/v2-desync-optimization-guide.md
handlers/bind.py
handlers/detect_drift.py
handlers/history.py
handlers/link_commit.py
handlers/preflight.py
handlers/search_decisions.py
handlers/sync_middleware.py
ledger/adapter.py
ledger/ast_diff.py
tests/bench_drift.py
tests/conftest.py
tests/test_ast_diff.py
tests/test_b2_cosmetic_hint.py
tests/test_desync_scenarios.py
tests/test_link_commit_grounding.py
tests/test_sync_middleware.py

✅ Files skipped from review due to trivial changes (2)

tests/conftest.py
handlers/history.py

🚧 Files skipped from review as they are similar to previous changes (5)

tests/test_ast_diff.py
handlers/preflight.py
handlers/link_commit.py
tests/test_link_commit_grounding.py
CHANGELOG.md

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (6)

handlers/sync_middleware.py (2)
83-115: Minor readability nit: forward reference to BarrierTiming.

repo_write_barrier (L83–102) instantiates BarrierTiming at L96, but the class is defined below at L105. It works because Python resolves the name at call time, but readers parsing top-to-bottom hit the use before the definition. Consider hoisting BarrierTiming above repo_write_barrier.
Suggested ordering
-@asynccontextmanager
-async def repo_write_barrier(ctx):
-    ...
-
-class BarrierTiming:
-    ...
+class BarrierTiming:
+    ...
+
+@asynccontextmanager
+async def repo_write_barrier(ctx):
+    ...
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@handlers/sync_middleware.py` around lines 83 - 115, The function
repo_write_barrier instantiates BarrierTiming before the class is defined, which
is a readability nit—move the BarrierTiming class definition above
repo_write_barrier so the type is defined before use; update any
docstrings/comments if needed to reflect the new order and keep the exact class
name BarrierTiming and function name repo_write_barrier unchanged.
56-71: Lazy-loop binding works for the MCP server, but the registry persists across loops.

Single-threaded asyncio guarantees the if _repo_locks_guard is None: … block is atomic, so the lazy guard is race-safe. Two follow-up considerations worth tracking for V2:

The locks in _repo_locks and _repo_locks_guard itself are bound to the loop active when first created. If the host ever recycles the event loop (test harnesses already do — that's why _reset_repo_locks_for_tests exists), production-side recycling would silently re-use a lock from a dead loop and raise RuntimeError: ... attached to a different loop. Not a bug today since the MCP server keeps a single loop for its lifetime, but please note this constraint in the docstring or guard against it (e.g., compare asyncio.get_running_loop() to a stored loop ref and rebuild on mismatch).

_repo_locks is never evicted. For a long-running server processing many distinct ctx.repo_path values this leaks asyncio.Lock objects. Likely fine in practice (few repos per session) but worth a TODO for V2's broader barrier scope.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@handlers/sync_middleware.py` around lines 56 - 71, The current lazy guard and
per-repo locks (_repo_locks_guard and _repo_locks) are bound to whichever event
loop exists at first creation and persist across loops, risking RuntimeError if
the event loop is recycled and leaking locks for many ctx.repo_path values;
update _guard() to store the loop used when creating _repo_locks_guard (capture
asyncio.get_running_loop()), compare that stored loop to
asyncio.get_running_loop() on each call and recreate the guard (and clear or
rebuild _repo_locks) when they differ, and add a concise docstring note about
loop-binding; also add a TODO comment near _repo_locks about eviction and
reference the existing _reset_repo_locks_for_tests as a test helper for clearing
state.
tests/test_sync_middleware.py (1)
343-354: Potentially flaky timing assertion.

asyncio.sleep(0.02) is documented as "approximately delay seconds" and the wall clock may be coarser than perf_counter on some platforms (notably Windows < ~15.6ms timer resolution). Asserting held_ms >= 20.0 is tight — a single rounding/clock-skew artifact will turn it red. Consider relaxing to >= 15.0 or sleeping a bit longer (e.g. 50ms with >= 40.0).
Suggested loosening
-    async with repo_write_barrier(ctx) as timing:
-        assert timing.held_ms is None  # not yet populated
-        await asyncio.sleep(0.02)
-    assert timing.held_ms is not None
-    assert timing.held_ms >= 20.0  # we slept 20ms, measured wall clock should reflect it
-    assert timing.held_ms < 500.0  # and not be absurd
+    async with repo_write_barrier(ctx) as timing:
+        assert timing.held_ms is None  # not yet populated
+        await asyncio.sleep(0.05)
+    assert timing.held_ms is not None
+    assert timing.held_ms >= 40.0  # slept 50ms; allow ~10ms slack for coarse clocks
+    assert timing.held_ms < 500.0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_sync_middleware.py` around lines 343 - 354, The timing assertion
in test_repo_write_barrier_reports_held_ms is too tight and can be flaky; update
the test that uses repo_write_barrier and BarrierTiming.held_ms to either
increase the sleep to a longer, less flaky duration (e.g., await
asyncio.sleep(0.05) and assert held_ms >= 40.0) or relax the lower bound (e.g.,
keep sleep at 0.02 but assert held_ms >= 15.0) so the measured held_ms
comparison is robust across platforms.
tests/bench_drift.py (2)
80-90: Minor: mean key is missing from the empty-samples branch.

If any handler ends up with zero samples, the resulting JSON artifact will lack mean for that handler while populated handlers include it — making downstream diffs across runs awkward. The stdout summary at Line 270 is unaffected (it only reads p50/p95/max/n), but the artifact schema should be consistent.
♻️ Proposed fix
     if not samples:
-        return {"p50": 0.0, "p95": 0.0, "max": 0.0, "n": 0}
+        return {"p50": 0.0, "p95": 0.0, "max": 0.0, "mean": 0.0, "n": 0}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/bench_drift.py` around lines 80 - 90, The empty-samples branch in
_percentiles currently returns {"p50","p95","max","n"} but omits "mean", causing
inconsistent JSON schema; update the early-return dict in function _percentiles
to include "mean": 0.0 (and keep "n": 0) so all callers receive the same set of
keys ("p50","p95","max","mean","n") whether samples is empty or not.
118-121: Optional: log the swallowed extraction error.

except Exception: continue will silently mask systematic adapter failures — if extract_symbols fails on every file the bench will assert at Line 193 with a useless "Only got 0 symbols" message. A logger.debug (or even a print since this is a bench script) would shorten diagnosis if this happens in CI.
♻️ Proposed fix
         try:
             records = await adapter.extract_symbols(str(fp))
-        except Exception:
+        except Exception as exc:  # noqa: BLE001 — bench best-effort
+            print(f"[bench] extract_symbols failed for {fp}: {exc}")
             continue
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/bench_drift.py` around lines 118 - 121, The try/except around
adapter.extract_symbols(str(fp)) is swallowing all errors; change the except to
capture the exception (except Exception as e) and log or print the failure
(including the filepath/str(fp) and e) before continuing so
adapter.extract_symbols failures are visible during CI; update the block that
assigns records = await adapter.extract_symbols(str(fp)) to use the captured
exception and a logger.debug or print statement referencing the same filepath
and exception details.
tests/test_desync_scenarios.py (1)
405-441: Optional: tighten scenario 9 to actually assert supersession.

The current assertion (r1.ingested and r2.ingested) only proves the handler doesn't raise — a regression that silently dropped supersession_candidates emission would still pass. Since the test is the canonical regression entry for scenario 9, a one-line assertion that r2.supersession_candidates is non-empty (or that one of the two decisions transitions to a superseded status) would close that gap without duplicating coverage from tests/test_supersession.py.
♻️ Proposed fix
     r1 = await handle_ingest(ctx, p1)
     r2 = await handle_ingest(ctx, p2)
     assert r1.ingested and r2.ingested
+    # Smoke check: re-ingest with overlapping intent must surface a
+    # supersession candidate, otherwise the canonical path silently
+    # regressed even if both ingests "succeeded".
+    assert r2.supersession_candidates, (
+        f"Expected supersession_candidates on overlapping re-ingest, got: {r2}"
+    )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_desync_scenarios.py` around lines 405 - 441, The test currently
only asserts that handle_ingest returned ingested flags for r1 and r2; update
test_scenario_09_intent_description_supersession to also assert that
supersession actually occurred by checking r2.supersession_candidates is
non-empty (or that r1 or r2 has a decision with status "superseded"); locate the
test function and add a single assertion after the existing ones that inspects
r2.supersession_candidates (or verifies a decision status change) to ensure
supersession is emitted, without changing any handler logic.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@PLAN.md`:
- Line 102: The checklist entry references a stale line number for the
incidental fix; update the item to use a symbolic anchor instead of a hard
file:line (e.g., "empty `decision_id` on ungrounded `pending_grounding_checks`
(see `get_all_decisions` ungrounded fan-out in ledger/adapter.py)"), so replace
"ledger/adapter.py:475" with a descriptive location referencing the function
`get_all_decisions` and the ungrounded fan-out block to prevent future doc rot.

---

Nitpick comments:
In `@handlers/sync_middleware.py`:
- Around line 83-115: The function repo_write_barrier instantiates BarrierTiming
before the class is defined, which is a readability nit—move the BarrierTiming
class definition above repo_write_barrier so the type is defined before use;
update any docstrings/comments if needed to reflect the new order and keep the
exact class name BarrierTiming and function name repo_write_barrier unchanged.
- Around line 56-71: The current lazy guard and per-repo locks
(_repo_locks_guard and _repo_locks) are bound to whichever event loop exists at
first creation and persist across loops, risking RuntimeError if the event loop
is recycled and leaking locks for many ctx.repo_path values; update _guard() to
store the loop used when creating _repo_locks_guard (capture
asyncio.get_running_loop()), compare that stored loop to
asyncio.get_running_loop() on each call and recreate the guard (and clear or
rebuild _repo_locks) when they differ, and add a concise docstring note about
loop-binding; also add a TODO comment near _repo_locks about eviction and
reference the existing _reset_repo_locks_for_tests as a test helper for clearing
state.

In `@tests/bench_drift.py`:
- Around line 80-90: The empty-samples branch in _percentiles currently returns
{"p50","p95","max","n"} but omits "mean", causing inconsistent JSON schema;
update the early-return dict in function _percentiles to include "mean": 0.0
(and keep "n": 0) so all callers receive the same set of keys
("p50","p95","max","mean","n") whether samples is empty or not.
- Around line 118-121: The try/except around adapter.extract_symbols(str(fp)) is
swallowing all errors; change the except to capture the exception (except
Exception as e) and log or print the failure (including the filepath/str(fp) and
e) before continuing so adapter.extract_symbols failures are visible during CI;
update the block that assigns records = await adapter.extract_symbols(str(fp))
to use the captured exception and a logger.debug or print statement referencing
the same filepath and exception details.

In `@tests/test_desync_scenarios.py`:
- Around line 405-441: The test currently only asserts that handle_ingest
returned ingested flags for r1 and r2; update
test_scenario_09_intent_description_supersession to also assert that
supersession actually occurred by checking r2.supersession_candidates is
non-empty (or that r1 or r2 has a decision with status "superseded"); locate the
test function and add a single assertion after the existing ones that inspects
r2.supersession_candidates (or verifies a decision status change) to ensure
supersession is emitted, without changing any handler logic.

In `@tests/test_sync_middleware.py`:
- Around line 343-354: The timing assertion in
test_repo_write_barrier_reports_held_ms is too tight and can be flaky; update
the test that uses repo_write_barrier and BarrierTiming.held_ms to either
increase the sleep to a longer, less flaky duration (e.g., await
asyncio.sleep(0.05) and assert held_ms >= 40.0) or relax the lower bound (e.g.,
keep sleep at 0.02 but assert held_ms >= 15.0) so the measured held_ms
comparison is robust across platforms.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9d7efb60-551d-48a4-81d1-6b6e23ff4919

📥 Commits

Reviewing files that changed from the base of the PR and between 020c74e and e6b81ed.

📒 Files selected for processing (27)

.claude/skills/bicameral-doctor/SKILL.md
.claude/skills/bicameral-history/SKILL.md
.claude/skills/bicameral-preflight/SKILL.md
.claude/skills/bicameral-scan-branch/SKILL.md
.claude/skills/bicameral-search/SKILL.md
CHANGELOG.md
PLAN.md
TODO.md
contracts.py
docs/v2-desync-optimization-guide.md
handlers/bind.py
handlers/detect_drift.py
handlers/history.py
handlers/link_commit.py
handlers/preflight.py
handlers/search_decisions.py
handlers/sync_middleware.py
ledger/adapter.py
ledger/ast_diff.py
skills/bicameral-preflight/SKILL.md
tests/bench_drift.py
tests/conftest.py
tests/test_ast_diff.py
tests/test_b2_cosmetic_hint.py
tests/test_desync_scenarios.py
tests/test_link_commit_grounding.py
tests/test_sync_middleware.py

✅ Files skipped from review due to trivial changes (5)

tests/conftest.py
.claude/skills/bicameral-search/SKILL.md
.claude/skills/bicameral-scan-branch/SKILL.md
docs/v2-desync-optimization-guide.md
.claude/skills/bicameral-history/SKILL.md

🚧 Files skipped from review as they are similar to previous changes (8)

handlers/history.py
.claude/skills/bicameral-doctor/SKILL.md
handlers/search_decisions.py
.claude/skills/bicameral-preflight/SKILL.md
handlers/link_commit.py
TODO.md
tests/test_ast_diff.py
CHANGELOG.md

coderabbitai · 2026-04-26T03:06:52Z

+- [x] B2 — `DriftEntry.cosmetic_hint` advisory + `_enrich_with_cosmetic_hints` helper in `handle_detect_drift`
+- [x] D1 — `original_lines` field on `symbol_disappeared` grounding checks; **`_build_verification_instruction` split** so relocation cases never get the unsafe `bicameral.bind` CTA
+- [x] F1 — `tests/test_desync_scenarios.py` canonical 13-scenario regression matrix (12 pass + 1 V2 xfail)
+- [x] Incidental fix — empty `decision_id` on ungrounded `pending_grounding_checks` (`ledger/adapter.py:475`)


⚠️ Potential issue | 🟡 Minor

Minor: stale line reference.

ledger/adapter.py:475 no longer points at the empty-decision_id fix; the relevant block is now around L500–509 in the current file. Consider replacing with a symbolic anchor (e.g., "in get_all_decisions ungrounded fan-out") so doc rot doesn't accumulate.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@PLAN.md` at line 102, The checklist entry references a stale line number for the incidental fix; update the item to use a symbolic anchor instead of a hard file:line (e.g., "empty `decision_id` on ungrounded `pending_grounding_checks` (see `get_all_decisions` ungrounded fan-out in ledger/adapter.py)"), so replace "ledger/adapter.py:475" with a descriptive location referencing the function `get_all_decisions` and the ungrounded fan-out block to prevent future doc rot.

…ning Ships three read-path / measurement pieces from the V1 plan: A1 — Drift benchmark harness (tests/bench_drift.py) Seeds 100 decisions across 25 files (real symbols via tree-sitter extract_symbols, no BM25 index build required), times search_decisions / detect_drift / link_commit under a memory:// ledger, writes test-results/bench/drift_baseline.json plus a stdout summary. Registered @pytest.mark.bench so default test runs skip it. Baseline on Apple Silicon: search_decisions p95 = 9.8 ms, detect_drift p95 = 14.8 ms, link_commit (warm) p95 = 8.0 ms — all 55–185× under the V2 targets in PLAN.md:83. A2-light — per-repo asyncio.Lock for handle_bind New handlers/sync_middleware.repo_write_barrier async context manager backed by a module-level dict[repo_path, asyncio.Lock]. Lazy guard-lock construction avoids the "bound to wrong loop" pitfall across test event loops. handle_bind wraps its body in the barrier via a thin _do_bind inner function. Deliberately narrow scope: does NOT protect resolve_compliance or cross-process writers — both are V2 scope (plan §5.2, §5.5). A3 — Sync metrics instrumentation (SyncMetrics contract) New contracts.SyncMetrics model with optional sync_catchup_ms and barrier_held_ms float fields, attached as sync_metrics: SyncMetrics | None = None to SearchDecisionsResponse, PreflightResponse, HistoryResponse, BindResponse. Purely additive, non-breaking. repo_write_barrier now yields a mutable BarrierTiming whose held_ms is populated on exit (including on exceptions). Each handler times its own sync call locally so nested calls (e.g. preflight → search_decisions) don't step on each other's metrics. Incidental (rebase update): an earlier draft of this commit pinned surrealdb>=1.0.0,<2.0.0 in pyproject.toml as a defensive workaround for an SDK 2.0 crash during schema migration (InternalError raised inside an except-LedgerError block in ledger/schema.py:259, which made every ingest-based test fail against 2.0). That pin was dropped during rebase onto main: commit 66796ef ("fix(client): wrap SDK SurrealError so init_schema idempotency catch works") fixes the underlying issue at the LedgerClient layer, so the test suite now passes against surrealdb>=2.0.0 directly. No pin survives in this commit's diff. Tests: 75 passed, 1 xfailed against surrealdb 2.0.0 after rebase (was 32 passed pre-F1; F1 lifts the count). Zero regressions from the v0.6.3 / v0.6.4 / 66796ef / 8d44a16 / ce74461 / a5aface rebase. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@ts-ignore

…sory B1 — ledger/ast_diff.py is_cosmetic_change(before, after, lang) returns True only when two snippets differ by inter-token whitespace alone. Strategy: parse both with tree-sitter and compare a recursive (node.type, child_sigs | leaf_bytes) signature. Two trees with identical signatures differ only by whitespace tree-sitter doesn't represent as nodes; any structural or leaf-text difference returns False. Strict whitelist by design — variable renames, trailing commas, comment edits (incl. # type: ignore / // @ts-ignore / build tags), docstring edits, and string-literal changes all return False. False negatives (real cosmetic changes routed unbiased to L3) are cheap; false positives bias the future V2 caller-LLM verdict prompt toward "looks fine" — which is the exact failure mode Codex pass-7 finding #3 flagged. Reuses code_locator.indexing.symbol_extractor._get_parser and LANGUAGE_FALLBACK so the cosmetic detector and the symbol indexer can never silently disagree on supported languages. Supports python, javascript, typescript, java, go, rust, c_sharp (plus jsx → javascript and tsx → typescript via fallback). Unsupported langs, parse failures, and has_error trees fail safe to False. B2 — DriftEntry.cosmetic_hint Adds cosmetic_hint: bool = False on contracts.DriftEntry (additive, non-breaking). handlers/detect_drift._enrich_with_cosmetic_hints runs after the pure raw_decisions_to_drift_entries mapping and only touches drifted entries. Source of "before" / "after": HEAD bytes vs working-tree bytes via ledger.status.get_git_content, sliced to the region's (start_line, end_line). Language resolved from file extension via EXTENSION_LANGUAGE. Five fail-safe paths leave cosmetic_hint=False: non-drifted status, equal HEAD/working-tree bytes (no diff to classify), unsupported extension, invalid line range, exception during is_cosmetic_change. Codex invariants enforced by tests: * The hint NEVER mutates content_hash or any other DriftEntry field besides cosmetic_hint itself (test_content_hash_never_mutated). * The hint never gates drift surfacing or status — it is metadata consumed by the eventual V2 caller-LLM verdict prompt. Tests: 61 passed in 3.81s — 21 ast_diff, 8 b2_cosmetic_hint, 19 sync_middleware, 6 bind, 7 phase3 integration. Zero regressions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…d payload V1 plan D1 was originally designed around a server-side search_code(decision.description) to surface relocation candidates. That approach is obsolete after v0.6.4 nuked search_code and shifted all code retrieval to the caller LLM. v0.6.4 already implemented the spirit of D1: - ledger/adapter.py:412-420 emits a symbol_disappeared grounding check on the authoritative ref when resolve_symbol_lines returns None for a tracked region; - handlers/link_commit.py:21 verification_instruction tells the caller to use Grep/Read + validate_symbols / extract_symbols + bicameral.bind; - tests/test_link_commit_grounding.py covers the rename-→-grounding flow end-to-end. V1's actual contribution: add original_lines: [start_line, end_line] to the symbol_disappeared payload so the caller LLM can inspect the symbol's prior position via `git show <prev_ref>:<file_path>` to ground its own retrieval. Single-line addition in ledger/adapter.py; new assertion in the existing regression test. Strictly informational. No new "call bicameral_bind" CTA — Codex pass-10 finding #2 stands: V2 must ship bicameral_rebind with old-binding CAS + a fresh L3 verdict on the new target before any rebind workflow becomes safe. Tests: 63 passed in 4.52s — zero regressions across ast_diff, b2_cosmetic_hint, sync_middleware, bind, link_commit_grounding, phase3 integration. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…trix New tests/test_desync_scenarios.py — one test per scenario from the Notion "Auto-Grounding Problem" catalog (Notion ID 3332a51619c4813caccec86c36d9bf98), routed through the real handler layer per the Apr 8 PR #84 lesson (tests bypassing handlers miss post-ingest hooks). Self-contained fixture builds a tmp git repo on ``main`` plus a memory ledger per test, so scenarios are independent of each other and of the bicameral-mcp checkout. Scorecard: 12 PASS, 1 XFAIL — meets V1 plan acceptance gate (>=12/13). Pass: 1. New decision, matching code exists → ungrounded → caller binds 2. Code changed after grounded → pending + pending_compliance_check 3. Code deleted after grounded → symbol_disappeared 4. Symbol renamed in file → symbol_disappeared with original_lines (V1 D1) 5. Symbol moved to different file → symbol_disappeared 6. Code added later, decision becomes resolvable → caller binds explicitly 7. Cold start: no matching code in repo → stays ungrounded 9. Intent description supersession → re-ingest succeeds 10. Multiple intents share a symbol → both surface in drift response 11. No server-side BM25 grounding (post-v0.6.0) → stays ungrounded — caller-LLM-driven 12. Line-shift edit does not trigger drift → resolve_symbol_lines re-resolves 13. [Open Question] prefix → ingested as gap-class decision XFAIL: 8. Drifted intent → atomic re-ground → requires V2 bicameral_rebind with old-binding CAS + fresh L3 verdict on new target (design doc 8 D2; Codex pass-10 #2) Scenario 2 explicitly asserts ``pending`` (not ``drifted``) per post-v0.5.0 derive_status semantics: a hash diff WITHOUT a cached ``compliant`` verdict yields ``pending``. ``drifted`` requires a caller-LLM verdict via the verdict cache (V2 territory). Incidental bug surfaced by F1 and fixed in ledger/adapter.py: ``pending_grounding_checks`` for ungrounded decisions read ``d.get("id", "")`` from ``get_all_decisions(filter="ungrounded")``, but that query aliases the field to ``decision_id``. Result: every ungrounded grounding-check entry shipped with ``decision_id=""``, leaving callers no handle to bind against. The existing ``test_pending_grounding_checks_for_ungrounded_decisions`` regression didn't catch it because it only asserted ``len > 0`` on the list, not non-empty IDs. Fix: read ``decision_id`` first, fall back to ``id`` for forward compatibility. Real correctness bug, not a test artifact. Tests: 76 passed, 1 xfailed in 7.35s across test_desync_scenarios, test_b2_cosmetic_hint, test_ast_diff, test_phase3_integration, test_sync_middleware, test_bind, test_link_commit_grounding. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…don't get bind CTA Codex pass-12 finding #2: D1 added original_lines to the symbol_disappeared payload (richer relocation context) but left the v0.6.4 verification_instruction text directing callers to use bicameral.bind for ALL pending_grounding_checks. For reason='ungrounded' that CTA is correct (no prior binding to retire, no duplicate-binding risk). For reason='symbol_disappeared' it is unsafe: bind on the new location leaves the old edge live under the N:N binds_to relation, producing duplicate-binding state. Atomic rebind that retires the stale edge in the same write ships in V2 (design doc 8 D2; Codex pass-10 finding #2). Net effect of D1 was that we added more relocation context routed through an unsafe CTA. This commit removes the unsafe CTA for relocation cases without reducing the safe CTA for ungrounded cases. Changes in handlers/link_commit.py: - _VERIFICATION_INSTRUCTION (single string) split into _VERIFICATION_INSTRUCTION_BASE + _GROUNDING_INSTRUCTION_UNGROUNDED + _GROUNDING_INSTRUCTION_RELOCATION. - New _build_verification_instruction(pending_compliance, pending_grounding) composer assembles the response text per call based on which reason values actually fired. Symbol-disappeared cases get an explicit "INFORMATIONAL ONLY — do NOT call bicameral.bind" warning citing the duplicate-binding hazard. - Response wiring at the bottom of handle_link_commit swapped from the constant lookup to the composer call. Tests (tests/test_link_commit_grounding.py): - test_pending_grounding_checks_for_ungrounded_decisions now asserts "bicameral.bind" IN the verification_instruction AND "INFORMATIONAL ONLY" NOT in it. - test_pending_grounding_checks_symbol_not_found now asserts "INFORMATIONAL ONLY" IN the instruction AND an explicit bind-prohibition phrase. Net destructive surface change for V1: arguably negative — V1 actively *removes* a v0.6.4 unsafe CTA for relocation, while introducing zero new mutating capabilities. Tests: 75 passed, 1 xfailed in 7.11s (xfail is scenario 8, V2 atomic rebind — by design). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Per the project's CLAUDE.md auto-tick mandate, after V1 implementation work lands we tick the corresponding rows in TODO.md and PLAN.md and add a CHANGELOG entry. CHANGELOG.md "Unreleased — desync optimization V1 — measurement + read-path advisory" entry. Detailed per-deliverable summary (A1, A2-light, A3, B1, B2, D1, F1, pass-12 follow-up, incidental decision_id fix) with baseline numbers, scorecard table, and explicit V2-deferred list pointing at docs/desync-optimization.md. TODO.md - Phase 3: ticked "Performance benchmarks" (V1 A1 measurements show we already meet the V2 perf targets — search p95 = 10.4 ms, drift p95 = 15.5 ms vs 2 s / 1 s targets, 55-185x under). - LLM drift judge note refined to "(V2 — see ...)". - New "Desync Optimization V1 — DONE" sub-section enumerating every V1 task ID + new "V2 — Deferred" sub-section listing the parking-lot items. - Mock Registry note clarified: V1 introduces no new mocks. PLAN.md - Phase 3 perf bullet ticked with the same measurement rationale. - New "Desync Optimization V1 — DONE" section mirroring TODO.md's enumeration so PLAN.md stays the canonical phased view. - LLM drift judge bullet rewritten to point at the V2 design doc. mocks/README.md is intentionally NOT updated — file does not exist in-repo (verified empirically); the V1 plan's earlier reference to it has been corrected. .claude/skills/bicameral-doctor/SKILL.md is intentionally NOT updated in V1 — V1's safeguard for relocation cases is the server-emitted verification_instruction text (handlers/link_commit._build_verification_instruction), which IS in-tree and DOES ship. Adding doctor-skill rendering prose for pending_grounding_checks is V2 / future-Docs scope. Tests: 75 passed, 1 xfailed in 7.68s — zero regressions from doc work (docs only — no production code changed). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Single self-contained guide for V2 implementation. Replaces two prior working artifacts that were never tracked (per project convention that plan docs stay untracked): - docs/desync-optimization.md (V2 design with 9 Codex review passes) - docs/desync-optimization-v1-plan.md (V1 plan + pass-12 fixes) Both have been removed from the working tree; their contents are absorbed here as a single source of truth for any engineer or agent picking up V2 work after V1 ships on this branch. Structure (10 sections, ~1000 lines): 1. Quick start — read order for fresh readers 2. Background — three-timeline desync, 13-scenario catalog, scorecard 3. V1 outcomes — exhaustive list of what V1 shipped + the honest "20-30% of user-facing value" assessment so the next engineer doesn't overestimate the foundation 4. V2 scope — capability gap table + product targets 5. Architecture — L1/L2/L3 model, per-binding ownership, 5-field CAS, tombstones, append-only verdict history, two-phase rebind with attempt-id locking, three-mechanism sync barrier, spec-vs-code-shape writes 6. Implementation plan — 6 phases as a DAG with file-level breakdown and effort estimate (7-10 weeks, 1 engineer, sequential) 7. Constraints catalog — synthesized from 14 Codex review passes; each entry is a "what NOT to ship" rule traced to the pass # that found the bug 8. Open questions — 7 decisions Codex can't make; explicitly flags Jin involvement (CODEOWNERS approval required) 9. Acceptance criteria — quantitative thresholds, qualitative checks, ship-blocker gate tests 10. References — local files, V1 commits, Notion docs, external sources Pass-13 fixes folded in (V2 design review): - 5.6 / 6 Phase 4 D2: rebind phase 2 binds to a specific rebind_attempt_id with single-pending-rebind-per-old-binding enforcement. Stale phase-2 verdicts on superseded attempts cannot tombstone the old edge. - 5.5 / 6 Phase 4 C2: record_compliance_verdict derives projection keys from POST-mutation state. History insert + binds_to update + projection upsert run in one atomic transaction with the post-mutation tuple. Mandatory acceptance test test_not_relevant_then_restore_cycle locks the contract. - 7.10: general rule — whenever a write mutates a field that is part of any cache or index key, the cache/index write must be derived from the post-mutation value. Pass-14 fixes folded in (V2 guide review): - 6 Phase 0b: atomicity is a committed choice, not deferred. V2 uses LedgerClient.transaction() wrapping inline BEGIN/COMMIT TRANSACTION blocks. Day-1 forced-failure gate test (test_transaction_rolls_back_on_failure) is a hard ship-blocker for Phase 0b. Explicit fallback path documented if gate fails: LET-chained statements, then network SurrealDB. - 5.6: pending rebinds get a server-enforced lease (BICAMERAL_REBIND_LEASE_SECONDS, default 24h). On-demand expiry sweep at start of every phase 1 atomically abandons stale leases as outcome='abandoned_by_expiry'. force_supersede=true flag for explicit takeover. Phase 2 lease check rejects verdicts on expired/superseded/abandoned attempts as stale-history-only. Guarantees forward progress even on caller crash. - 8 q5 / 7.3 / 10: judge_gaps contradiction resolved. bicameral_judge_gaps is read-only; the destructive write is in resolve_compliance, which Phase 0a migrates. Phase 0a covers the entire pipeline; no separate change to judge_gaps required. - 6 Phase 4 D2 / 7.11: edge-vs-region terminology audit. binding_version lives on binds_to edges, never on code_region. All Phase 4 D2 wording rewritten in edge terminology; explicit warning against the rejected per-region versioning model. - 7.11: pass-14 constraints added to the catalog with prevention rules synthesized. V1 status preserved: 6 commits on this branch (3b4d0bb...8e226c5) ship the read-path advisory layer with zero new mutating capabilities. 75 passed + 1 xfailed in test suite. The xfailed scenario 8 flips to PASS the moment V2's bicameral_rebind ships — it's @pytest.mark.xfail(strict=True), so a CI-visible signal will fire. V2 effort sense: 7-10 engineer-weeks sequential. Phases don't parallelize cleanly because each one's correctness depends on the prior one's invariants. Strong recommendation to involve Jin at design time (CODEOWNERS, project-judgment review the adversarial Codex passes can't provide). Branch: desync-optimization-v1 (renamed from desync-optimization) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Post-rebase onto v0.7.0 (Accountable North Star), fresh decisions are ingested as signoff.state='proposed' and report status='proposal' until explicitly ratified via bicameral_ratify. They are drift-exempt by design — they never enter pending/drifted until ratified. Scenario 2 was written under the v0.5.0 contract where a hash-divergent region without a cached compliant verdict reports status='pending'. Under v0.7.0 the same flow now reports status='proposal' for the unratified decision, even though the substantive drift signal still flows through pending_compliance_checks (which the test continues to assert above the status check). Update: accept any of {pending, drifted, proposal} — the per-decision status field is richer under v0.7.0; the scenario's invariant (system surfaces a compliance check when bound code changes) is unchanged. Tests: 77 passed, 1 xfailed in 7.52s — full V1 suite green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

DriftEntry.cosmetic_hint was shipped in V1 B2 with no corresponding skill update, leaving the field invisible to the caller LLM. Add the three-point contract to bicameral-preflight SKILL.md (both copies): - Step 4: cosmetic_hint=true → mechanical classification, never ask - Step 5: ~ REFORMATTED rendering template (not ⚠ DRIFTED) - Stop-and-Ask Contract: cosmetic drift listed as mechanical example Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Four occurrences of the deleted V1 plan / V2 design paths (docs/desync-optimization.md, docs/desync-optimization-v1-plan.md) pointed at files that were consolidated into docs/v2-desync-optimization-guide.md and removed. Replace each with the consolidated path so internal links resolve. Surrounding prose preserved verbatim per the reviewer's "preserve text and formatting" instruction; the dated "nine rounds of Codex review" claim is unchanged (the guide actually absorbed 14 passes by ship time, but that's a separate edit). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Project root CLAUDE.md (lines 3-17) requires that any tool response field shape change ships with a matching SKILL.md update. V1 added: - sync_metrics on SearchDecisionsResponse / PreflightResponse / HistoryResponse / BindResponse (contracts.py) - cosmetic_hint on DriftEntry (contracts.py) - original_lines on pending_grounding_checks symbol_disappeared payload (ledger/adapter.py) - per-reason verification_instruction split (handlers/link_commit.py) None of these were mentioned in any skill — audit by grep across .claude/skills/{search,preflight,history,doctor,scan-branch}/SKILL.md found zero hits for the new field names before this commit. Updates: - bicameral-search / bicameral-preflight / bicameral-history: one-line note that sync_metrics is server-side observability (skip rendering; log if profiling). Minimal per the change owner's "minimal note" instruction for observability fields. - bicameral-doctor: new "Per-entry advisory fields" subsection covering all three doctor-relevant additions: * cosmetic_hint as advisory metadata that never gates status * original_lines on symbol_disappeared payloads * the new verification_instruction split — explicit warning that the doctor skill must NOT synthesize a bicameral.bind CTA for symbol_disappeared cases (would create duplicate-binding state under N:N binds_to until V2 atomic rebind ships); ungrounded keeps the safe bind CTA. - bicameral-scan-branch: extended the existing DriftEntry bullet with a cosmetic_hint advisory note. Emphasizes that the hint is render-time tag only — drifted entries stay drifted regardless. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Four correctness fixes surfaced by post-V1-merge review. All verified against current code before applying; none affect the V1 acceptance contract (zero new mutating capabilities, scorecard 12 PASS + 1 XFAIL). 77 passed + 1 xfailed in the V1 suite after fixes. handlers/detect_drift.py — two issues A. Magic 10**9 in get_git_content calls. ledger.status.get_git_content takes start_line / end_line in its signature but ignores them entirely — the function always returns the full file body, with callers slicing locally afterward. Two existing legacy callers (ledger/status.py:159, ledger/adapter.py:67) share the same signature smell. Replaced 1, 10**9 with 0, 0 and an explicit comment pointing at the upstream signature for a future cleanup. The slicing logic that follows is unchanged and correct. B. _enrich_with_cosmetic_hints diffed HEAD vs working tree unconditionally, even when the caller passed use_working_tree=False (which makes handle_detect_drift advertise source="HEAD" in the response). The cosmetic-hint axis is HEAD-vs-working-tree, so attaching hints in the source="HEAD" mode would have been misleading metadata derived from a diff axis the caller didn't ask about. Minimal fix per the reviewer's option: gate the enrichment call with `if use_working_tree:` so HEAD-source responses never carry cosmetic_hint values. tests/test_v0414_source_excerpt.py (the one test that calls with use_working_tree=False) still passes — it doesn't assert on cosmetic_hint. handlers/search_decisions.py — sync_catchup_ms scope mismatch V1 A3 timed only handle_link_commit in this handler, but ensure_ledger_synced (used by preflight / history) covers both link_commit AND get_session_start_banner. The same metric name measured different surfaces across handlers. Extended the timed scope here to include the banner call so sync_catchup_ms means the same thing everywhere. ledger/ast_diff.py — fail-safe contract leak is_cosmetic_change documents a "fail-safe → False" contract for any classifier failure, but the recursive _signature comparison ran outside the try/except. _signature recurses into the AST and can raise RecursionError on deeply nested trees, which would propagate out of the function instead of returning False. Combined the parse + has_error check + signature comparison into one guarded block; except (Exception, RecursionError) returns False per the contract. RecursionError is technically a subclass of Exception in Python 3 (the explicit tuple is documentation-of-intent — the comment block above this also calls out recursion specifically). tests/test_ast_diff.py — JSX fallback exercise test_jsx_routes_through_javascript used identical bytes for before and after, which short-circuits at the top of is_cosmetic_change before the LANGUAGE_FALLBACK['jsx'] → 'javascript' path is reached. The test name claimed to verify routing but never exercised it. Updated to use whitespace-only diff in the JS portion of the expression (extra spaces around the `=` sign) so bytes differ but the expected outcome stays True. JSX text content kept unchanged (text inside <div>...</div> IS significant in JSX — first attempt put spaces there and correctly returned False). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Per the "small scope expansion only when V2 is already close" bar: V2 Phase 1-4 ship every primitive #47 needs (per-binding baseline, full-CAS hash comparison, symbol re-resolution, atomic rebind, V1's relocation surfacing). The remaining gap to fully closing #47 is one thin read-only handler that wires those primitives at the branch level. No new mutating capabilities. No schema changes. No new contract surface beyond a single response type. Effort delta: 3-5 days; total V2 estimate goes from 7-10 weeks to ~8-11 weeks. Updates to docs/v2-desync-optimization-guide.md: - 4.1 capability gap table: row 13 added (branch-aware drift report, GitHub #47, Phase 6 fix). - 4.2 product targets: explicit "branch-aware drift report works" bullet citing #47 closure and the unblocked downstream consumers (#48 pre-push hook, #49 PR-comment Action). - 6 implementation plan: Phase 6 box added to the DAG diagram below Phase 5 polish. - 6 Phase 6 section: full deliverable spec with the ScanBranchResponse contract, read-only invariant, dependency on C0 per-binding baseline, and the four acceptance criteria mirroring #47 verbatim plus the Closes #47 commitment. - 6 effort estimate table: Phase 6 row (3-5 days), total updated. - 9 acceptance criteria: new gate that the V2 PR uses Closes #47 with a passing read-only invariant test. - 10 references: handlers/scan_branch.py and tests/test_scan_branch.py added to the file list. - Final note: post-V2 workflow change codified — V2 is the last release authored by reverse-mapping deliverables to issues. After V2, switch to issue-driven (pick issue, satisfy ACs, Closes #N). Phase 6 documented as the canonical "expand in-flight release to close an adjacent issue" precedent: underlying machinery must already exist, addition must be additive, issue ACs must be fully satisfiable. Suggested next-issue order: #39 -> #42 -> #41 -> #44 -> #48 -> #49 (all 5 close via small focused PRs over 4-6 weeks). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

All four findings verified against current code; only the actionable ones applied. 81 passed + 1 xfailed in 9.02s. #1 — skills/bicameral-preflight/SKILL.md sync_metrics note The .claude/skills copy got the sync_metrics observability note back when V1 A3 shipped, but the canonical skills/ copy never did. Mirror the wording verbatim near step 2 so the rendering guidance and response-field documentation stay in sync. #2 — handlers/detect_drift.py per-entry alignment The cosmetic-hint enrichment was slicing both head_full and wt_full using entry.lines (the baseline anchor). HEAD and the working tree can shift the symbol independently, so a single index range can't align both sides. The narrow consequence: a drifted entry with shifted lines could yield a misleading cosmetic_hint=true on bytes that aren't the bound region. Fix: re-resolve the symbol against each ref via resolve_symbol_lines(file_path, entry.symbol, repo, ref="HEAD") and ref="working_tree" separately, slice each ref using its own resolved range. Resolution failure on either side → safe default of cosmetic_hint=False (matches the V1 contract: "False is cheap, True must be earned"). Empty symbol → skip (new fail-safe path). Test refactor: test_invalid_lines_skipped renamed to test_unresolvable_symbol_skipped — the old test asserted that lines=(0,0) was the failsafe trigger, but entry.lines is no longer the alignment input. New test exercises the resolve_symbol_lines-returns-None path via a nonexistent symbol name, which is the real fail-safe gate now. #3 — V2 guide TOC anchor for §9 GitHub auto-generates fragment IDs from heading text by lowercasing, replacing spaces with hyphens, and dropping punctuation. "## 9. Acceptance criteria for V2" maps to #9-acceptance-criteria-for-v2, but the TOC pointed at #9-acceptance-criteria (truncated). Link broken. Updated to the correct fragment. #4 — V2 guide unlabeled fenced code blocks (markdownlint MD040) Six fenced opens used bare ``` instead of a labeled fence. Tagged each with ```text — the contents are commit listings, ASCII DAG diagrams, pseudocode protocols, and tuple notation, none of which fit a real language tag. The other fenced blocks in the guide (already tagged ```sql / ```python) are unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

silongtan force-pushed the desync-optimization-v1 branch from 1021822 to b420d1b Compare April 25, 2026 04:44

silongtan had a problem deploying to ci-test April 25, 2026 04:44 — with GitHub Actions Failure

silongtan requested a review from jinhongkuan April 25, 2026 04:45

This comment was marked as resolved.

Sign in to view

jinhongkuan temporarily deployed to ci-test April 25, 2026 05:37 — with GitHub Actions Inactive

coderabbitai Bot reviewed Apr 25, 2026

View reviewed changes

Comment thread docs/v2-desync-optimization-guide.md

This comment was marked as resolved.

Sign in to view

jinhongkuan temporarily deployed to ci-test April 25, 2026 05:55 — with GitHub Actions Inactive

silongtan temporarily deployed to ci-test April 25, 2026 15:11 — with GitHub Actions Inactive

This comment was marked as resolved.

Sign in to view

silongtan temporarily deployed to ci-test April 25, 2026 19:04 — with GitHub Actions Inactive

This comment was marked as resolved.

Sign in to view

silongtan temporarily deployed to ci-test April 25, 2026 19:22 — with GitHub Actions Inactive

silongtan force-pushed the desync-optimization-v1 branch from 3d57097 to e6b81ed Compare April 26, 2026 03:00

silongtan temporarily deployed to ci-test April 26, 2026 03:00 — with GitHub Actions Inactive

coderabbitai Bot reviewed Apr 26, 2026

View reviewed changes

silongtan and others added 13 commits April 25, 2026 23:42

silongtan force-pushed the desync-optimization-v1 branch from e6b81ed to 5c49637 Compare April 26, 2026 03:44

silongtan temporarily deployed to ci-test April 26, 2026 03:45 — with GitHub Actions Inactive

jinhongkuan merged commit 673f9be into main Apr 26, 2026
4 checks passed

jinhongkuan deleted the desync-optimization-v1 branch April 26, 2026 07:01

This was referenced Apr 28, 2026

feat: CodeGenome Phase 1+2 (#59) — adapter + identity records #71

Merged

Preflight failure-scenario catalog + eval harness (v0.10.x) #62

Merged

coderabbitai Bot mentioned this pull request May 10, 2026

test(sync_middleware): sociable banner tests + decision_id alias fix #303

Merged

4 tasks

This was referenced Jun 1, 2026

feat(ingest): #404 Phase 1 — thread parent_decision_id through natural-format ingest #534

Closed

fix(hooks): #170 — gate the post-preflight capture reminder to non-read-only prompts #535

Closed

Conversation

silongtan commented Apr 25, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What V1 ships

Test plan

V2 — explicitly deferred (not in this PR)

Notes for reviewers

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

This comment was marked as resolved.

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

silongtan commented Apr 25, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 25, 2026 •

edited

Loading