Skip to content

desync optimization V1 — read-path advisory + V2 implementation guide#56

Merged
jinhongkuan merged 14 commits into
mainfrom
desync-optimization-v1
Apr 26, 2026
Merged

desync optimization V1 — read-path advisory + V2 implementation guide#56
jinhongkuan merged 14 commits into
mainfrom
desync-optimization-v1

Conversation

@silongtan

@silongtan silongtan commented Apr 25, 2026

Copy link
Copy Markdown
Collaborator

Summary

Ships V1 of a two-part desync-correctness initiative plus a consolidated V2 implementation guide for the destructive-path overhaul that's deliberately deferred. V1 introduces zero new mutating
capabilities; every change is read-only measurement, additive contract field, pure function, test coverage, or a surgical bug fix to an already-shipped path.

  • V1 implementation (6 commits, 3b4d0bb…8e226c5): drift benchmark harness, per-repo asyncio.Lock for bind, SyncMetrics instrumentation, strict-whitelist tree-sitter cosmetic-change classifier with
    DriftEntry.cosmetic_hint advisory metadata, original_lines on symbol_disappeared grounding checks, canonical 13-scenario regression matrix (12 PASS / 1 XFAIL), and a _VERIFICATION_INSTRUCTION split that
    removes the v0.6.4 unsafe-bind CTA for relocation cases.
  • V2 planning (1 commit, 1021822): single-source docs/v2-desync-optimization-guide.md (999 lines) consolidating the V2 design + V1 plan + 14 rounds of Codex review into one self-contained guide for the next
    engineer/agent to pick up.

Net destructive-surface change for V1: zero (and arguably negative — V1 actively removes the v0.6.4 unsafe relocation CTA).

What V1 ships

ID Deliverable Files
A1 Drift benchmark harness — 100 decisions × 25 files, JSON artifact, @pytest.mark.bench tests/bench_drift.py
A2-light Per-repo asyncio.Lock for handle_bind (in-process only) handlers/sync_middleware.py::repo_write_barrier, handlers/bind.py
A3 SyncMetrics (sync_catchup_ms, barrier_held_ms) on Search/Preflight/History/Bind responses contracts.py, four handlers
B1 Strict-whitelist tree-sitter cosmetic-change classifier ledger/ast_diff.py
B2 DriftEntry.cosmetic_hint advisory metadata contracts.py, handlers/detect_drift.py
D1 original_lines on symbol_disappeared grounding checks; verification-instruction split for safe relocation routing ledger/adapter.py, handlers/link_commit.py
F1 Canonical 13-scenario regression matrix routed through real handler layer tests/test_desync_scenarios.py
Bug fix Empty decision_id on ungrounded grounding checks (surfaced by F1) ledger/adapter.py:475

Performance baseline (Apple Silicon, surrealdb 2.0.0):

  • search_decisions p95 = 10.4 ms (target <2 s — 185× under)
  • detect_drift p95 = 15.5 ms (target <1 s — 55× under)
  • link_commit (warm) p95 = 8.0 ms

Test plan

  • pytest tests/test_link_commit_grounding.py tests/test_desync_scenarios.py tests/test_b2_cosmetic_hint.py tests/test_ast_diff.py tests/test_phase3_integration.py tests/test_sync_middleware.py tests/test_bind.py75 passed, 1 xfailed in 7.68s
  • pytest tests/bench_drift.py -v -m bench -s → benchmark passes; baseline numbers documented in CHANGELOG
  • Test suite passes against surrealdb>=2.0.0 (the SDK 2.0 schema-migration crash I had originally pinned around is fixed properly upstream by 66796ef)
  • Zero regressions on the v0.6.3 → v0.6.4 → 0.6.4-bump → 66796ef8d44a16ce74461a5aface rebase
  • Reviewer check: scenario 8 in tests/test_desync_scenarios.py is @pytest.mark.xfail(strict=True) — it MUST stay xfailed until V2's bicameral_rebind ships. If it ever flips to xpassed, V2 has
    leaked into V1 unintentionally.

V2 — explicitly deferred (not in this PR)

The full V2 plan lives in docs/v2-desync-optimization-guide.md:

  • Phase 0: migrate handlers/resolve_compliance.py from hard-delete to tombstone+CAS (closes the highest-impact destructive backdoor — Codex pass-10 refactor: port interfaces + source_span for drift pipeline #1, hard prerequisite); A0 atomic SurrealQL transaction
    primitive with day-1 forced-failure gate test.
  • Phase 1: schema migration v5→v6 (per-binding baseline ownership, tombstone fields, append-only compliance_verdict_history, full-CAS cache key) + traversal filtering on every binds_to consumer.
  • Phase 2: full sync barrier (sync-token CAS + region fingerprint at commit time).
  • Phase 3: cache-aware pending_compliance_checks from detect_drift.
  • Phase 4: bicameral_judge_drift + record_compliance_verdict (5-field CAS), bicameral_advance_baseline, bicameral_rebind (two-phase + lease + attempt-id locking).
  • Phase 5: doctor SKILL rendering, Codex pass-15, scenario 8 xfail → expected pass.

Effort estimate: 7–10 engineer-weeks sequential. Phases don't parallelize cleanly because each one's correctness depends on the prior one's invariants.

The guide synthesizes 14 Codex review passes into a "what NOT to ship" constraints catalog (§7) — every entry came from a pass that found a real bug in an earlier draft. Strong recommendation to involve Jin at
design time before V2 implementation begins (CODEOWNERS approval required, plus this work needs project-judgment review the adversarial Codex passes can't provide).

Notes for reviewers

  • docs/v2-desync-optimization-guide.md is 999 lines but heavily structured; §1 has a recommended read order for fresh readers.
  • Two prior planning artifacts (docs/desync-optimization.md and docs/desync-optimization-v1-plan.md) were never tracked per project convention; their contents are absorbed into the consolidated guide.
  • CHANGELOG entry uses "Unreleased" heading — pick a version (v0.7.0?) at release time.
  • Scenario 8's xfail(strict=True) is intentional CI plumbing; do not remove without shipping V2 D2.

Summary by CodeRabbit

  • New Features

    • Optional sync timing metrics added to responses and per-repo write-barrier timing capture
    • Advisory cosmetic-change flag to mark whitespace-only edits (render-only; does not alter drift status)
    • Grounding checks include prior code-context for relocation cases; verification guidance split by reason
  • Bug Fixes

    • Fixed empty decision identifier emission for some ungrounded cases
    • Verification instructions now conditionally route CTAs and forbid unsafe bind-on-relocate actions
  • Documentation

    • Added V2 desync optimization guide and updated planning/changelog
  • Tests

    • Added benchmark harness (opt-in) and expanded regression/test suites (13 scenarios + cosmetic/sync tests)

@coderabbitai

coderabbitai Bot commented Apr 25, 2026

Copy link
Copy Markdown

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds Desync Optimization V1: read-path-only instrumentation and advisory changes — sync-metrics on responses, per-repo write barrier for bind timing, strict AST whitespace-only cosmetic classifier with drift-entry enrichment, conditional verification-instruction routing, grounding payload enrichment, a drift benchmark harness, and regression tests; destructive write-path work deferred to V2.

Changes

Cohort / File(s) Summary
Docs & Planning
CHANGELOG.md, PLAN.md, TODO.md, docs/v2-desync-optimization-guide.md
New V1 completion notes, V2 design/migration guide, updated TODOs and acceptance criteria; non-code documentation added.
Contracts / Models
contracts.py
Added SyncMetrics model and optional sync_metrics on multiple response types; added advisory DriftEntry.cosmetic_hint.
AST Cosmetic Classifier & Tests
ledger/ast_diff.py, tests/test_ast_diff.py
New strict tree-sitter-based is_cosmetic_change() with fail-safe parsing and comprehensive tests for cosmetic vs semantic edits.
Handlers — Sync Instrumentation
handlers/history.py, handlers/preflight.py, handlers/search_decisions.py, handlers/bind.py
Measure ledger catch-up and barrier-held durations; attach sync_metrics to responses; handle_bind delegates to _do_bind and wraps execution with repo_write_barrier.
Handlers — Logic Refinements
handlers/link_commit.py, handlers/detect_drift.py
_build_verification_instruction now composes conditional verification text (forbids bicameral.bind for symbol_disappeared relocations); detect_drift enriches DriftEntry.cosmetic_hint for whitespace-only edits (fail-safe).
Sync Middleware
handlers/sync_middleware.py, tests/test_sync_middleware.py
Added repo_write_barrier(ctx) async context manager with per-repo locks and BarrierTiming.held_ms; tests for serialization, timing, and exception behavior.
Ledger Adapter
ledger/adapter.py
ingest_commit emits original_lines for symbol_disappeared pending grounding checks and fixes decision_id extraction for ungrounded decisions.
Benchmarks
tests/bench_drift.py, tests/conftest.py
New drift benchmark harness (pytest bench marker) measuring link_commit/search/detect_drift latencies and emitting JSON results.
Tests — Cosmetic Hint & Scenarios
tests/test_b2_cosmetic_hint.py, tests/test_desync_scenarios.py, tests/test_link_commit_grounding.py
New tests validating cosmetic-hint enrichment, 13 desync regression scenarios, verification-instruction content, and original_lines expectations.
Skills / Docs
.claude/skills/*, skills/bicameral-preflight/SKILL.md
Skill docs updated to document sync_metrics, cosmetic_hint, and informational-only verification instructions for relocations.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant Handler as Handler (search/preflight/bind/detect_drift)
  participant Sync as repo_write_barrier (Lock)
  participant Ledger
  participant Repo as GitRepo
  participant AST as ledger/ast_diff

  Client->>Handler: request (search/preflight/bind/detect_drift)
  Handler->>Sync: enter repo_write_barrier(ctx)
  Sync-->>Handler: BarrierTiming (held_ms unset)
  Handler->>Ledger: ensure_ledger_synced(ctx)
  Ledger-->>Handler: synced
  alt detect_drift path
    Handler->>Repo: read file(s) (HEAD, working tree)
    Repo-->>Handler: file contents
    Handler->>AST: is_cosmetic_change(before, after, lang)
    AST-->>Handler: cosmetic? (True/False)
  end
  Handler->>Sync: exit repo_write_barrier (record held_ms)
  Handler-->>Client: Response (includes sync_metrics, cosmetic_hint, verification_instruction)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • jinhongkuan

Poem

🐰
I hopped through locks and timed each beat,
Nibbled stray spaces till the diff looked neat,
I warned where binds must never roam,
Gave hints and metrics, then hopped home,
A tiny V1 hop toward a safer comb. 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 62.37% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: delivering V1 of desync optimization (read-path advisory features) and adding V2 implementation guide, directly matching the PR's primary objectives.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch desync-optimization-v1

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot]

This comment was marked as resolved.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (5)
handlers/sync_middleware.py (1)

56-80: Optional: _repo_locks grows without bound.

The registry never evicts entries, so a long-lived MCP server process that handles many distinct repo_path values will accumulate one asyncio.Lock per repo for the lifetime of the process. Per-process MCP servers typically see ≤1 repo, so this is unlikely to bite, but if you anticipate a multi-repo daemon mode (e.g. dashboard, CI worker), consider a WeakValueDictionary or an explicit eviction hook. Pure footprint issue — no correctness impact.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@handlers/sync_middleware.py` around lines 56 - 80, _repo_locks currently
stores a strong reference for every repo and will grow without bound; change the
storage used by _repo_locks in handlers/sync_middleware.py (used by
_get_repo_lock and guarded by _guard) to a weak-value mapping or add an eviction
mechanism so locks for unused repo_path keys can be garbage-collected: replace
the plain dict _repo_locks with a weakref.WeakValueDictionary (or implement a
size/time-based eviction policy) and ensure the existing _guard()/async with
_guard() protection remains around accesses to the new mapping.
tests/bench_drift.py (2)

80-90: p95 is off-by-one when 0.95 * len(s) is not an integer.

s[max(0, int(0.95 * len(s)) - 1)] returns the nearest-rank p95 only when 0.95*N is integral (e.g., N=20, 100). For N=50 (the search-decisions sample size: 10 queries × 5 iterations) it yields index 46 (47th value) instead of the standard nearest-rank index 47 (48th value). The published baseline numbers in PLAN.md / V2 guide are derived from this — minor but worth tightening since they're cited as the V2 perf budget reference.

♻️ Proposed fix using `statistics.quantiles`
-def _percentiles(samples: list[float]) -> dict[str, float]:
-    if not samples:
-        return {"p50": 0.0, "p95": 0.0, "max": 0.0, "n": 0}
-    s = sorted(samples)
-    return {
-        "p50": statistics.median(s),
-        "p95": s[max(0, int(0.95 * len(s)) - 1)],
-        "max": s[-1],
-        "mean": statistics.fmean(s),
-        "n": len(s),
-    }
+def _percentiles(samples: list[float]) -> dict[str, float]:
+    if not samples:
+        return {"p50": 0.0, "p95": 0.0, "max": 0.0, "mean": 0.0, "n": 0}
+    s = sorted(samples)
+    # Nearest-rank p95: smallest value with rank >= ceil(0.95 * N).
+    p95_idx = max(0, -(-len(s) * 95 // 100) - 1)
+    return {
+        "p50": statistics.median(s),
+        "p95": s[p95_idx],
+        "max": s[-1],
+        "mean": statistics.fmean(s),
+        "n": len(s),
+    }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/bench_drift.py` around lines 80 - 90, The p95 calculation in
_percentiles is off-by-one for non-integer 0.95*N; replace the current manual
index logic with a proper quantile computation (e.g., use statistics.quantiles
with n=100 and method="inclusive" or method="nearest" to get the 95th
percentile) on the sorted samples s, and return that value for "p95" while
keeping the other returns ("p50", "max", "mean", "n") unchanged; ensure the
empty-samples branch remains the same.

118-121: Log the swallowed exception.

A bare except Exception: continue here means a regression in extract_symbols (e.g., a tree-sitter parser change or a malformed file pattern) silently shrinks the corpus until the len(symbols) >= 25 assertion at line 193 trips far away from the cause. Logging at DEBUG keeps the bench quiet but recoverable.

-        try:
-            records = await adapter.extract_symbols(str(fp))
-        except Exception:
-            continue
+        try:
+            records = await adapter.extract_symbols(str(fp))
+        except Exception as exc:
+            print(f"[bench] extract_symbols failed for {fp}: {exc}")
+            continue
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/bench_drift.py` around lines 118 - 121, The try/except around
adapter.extract_symbols currently swallows all exceptions; change it to catch
Exception as e and log the error at DEBUG before continuing so failures in
extract_symbols are visible (e.g., logger.debug("extract_symbols failed for %s:
%s", fp, e, exc_info=True)) and then continue; keep the variable names (records,
adapter.extract_symbols, fp) intact and only add the debug log to avoid noisy
test output.
tests/test_b2_cosmetic_hint.py (1)

78-90: Dead write + unused fixture parameter.

The _write_file at line 80 is immediately overwritten at line 83 before any git operation, so it has no effect on the test. The tmp_path parameter is also unused (the working-tree path comes from the repo_with_baseline fixture). The test reads as if it's establishing a non-docstring baseline state, but it isn't.

♻️ Proposed cleanup
-def test_docstring_edit_keeps_cosmetic_hint_false(repo_with_baseline, tmp_path):
+def test_docstring_edit_keeps_cosmetic_hint_false(repo_with_baseline):
     repo, rel = repo_with_baseline
-    _write_file(repo, rel, "def f(x):\n    return x + 1\n")
-    # Now overwrite baseline by committing a docstring-only version, then edit working tree.
+    # Commit a docstring-bearing baseline, then edit only the docstring text in
+    # the working tree — the diff is observable via __doc__, so it must NOT be
+    # classified as cosmetic.
     import subprocess
     _write_file(repo, rel, 'def f(x):\n    """Old."""\n    return x + 1\n')
     subprocess.run(["git", "add", "-A"], cwd=repo, check=True)
     subprocess.run(["git", "commit", "-q", "-m", "add docstring"], cwd=repo, check=True)
-    # Working tree edits the docstring text — observable via __doc__.
     _write_file(repo, rel, 'def f(x):\n    """New."""\n    return x + 1\n')
     entry = _make_entry(status="drifted", lines=(1, 3))
     _enrich_with_cosmetic_hints([entry], rel, str(repo))
     assert entry.cosmetic_hint is False
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_b2_cosmetic_hint.py` around lines 78 - 90, The test dead-writes a
file via _write_file immediately overwritten later and declares an unused
tmp_path fixture; remove the first _write_file call (the one creating "def
f(x):\n    return x + 1\n") and drop the tmp_path parameter from
test_docstring_edit_keeps_cosmetic_hint_false so the test only creates the
baseline commit with the docstring and then edits the working tree, keeping
references to _write_file, subprocess.run, _make_entry and
_enrich_with_cosmetic_hints intact.
TODO.md (1)

138-140: Line numbers in prose will drift.

ledger/adapter.py:475 is also referenced from PLAN.md line 102. The bug fix point will move on the next edit to that file and the docs go stale silently. Consider citing a stable anchor (function name) instead.

-- [x] **incidental fix** — `ledger/adapter.py:475` was emitting empty
-      `decision_id` on ungrounded grounding checks; surfaced by F1
+- [x] **incidental fix** — empty `decision_id` on ungrounded
+      `pending_grounding_checks` in `ledger/adapter.py`
+      (`SurrealDBLedgerAdapter._collect_pending_grounding_checks`); surfaced by F1
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@TODO.md` around lines 138 - 140, Update the docs to stop citing a fragile
line number: replace the `ledger/adapter.py:475` reference in TODO.md and
PLAN.md with a stable symbol (the function or method that emits the decision_id
for grounding checks); locate the emitter by searching for the "decision_id"
token or the grounding-check function in ledger/adapter.py and reference it by
its fully-qualified function/method name (e.g., AdapterClass.method_name or
function_name) so future edits won't silently break the doc link.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/v2-desync-optimization-guide.md`:
- Around line 519-535: The current transaction() context manager builds a raw
BEGIN/COMMIT block and calls self._db.query(...) which is incompatible with
surrealdb-py's transaction API; refactor transaction() (the asynccontextmanager
that yields TransactionBuffer) to create a session via self._db.new_session(),
start a transaction with session.begin_transaction(), run each buffered
statement through the transaction object (txn.query or txn.execute equivalent)
so you get per-statement results, and then call txn.commit() in the try path or
txn.cancel() in except, converting any per-statement failures into LedgerError
(replace the current isinstance(stmt, str) check and the raw self._db.query
usage with transaction-aware calls and proper try/except around
txn.commit()/txn.cancel()).

---

Nitpick comments:
In `@handlers/sync_middleware.py`:
- Around line 56-80: _repo_locks currently stores a strong reference for every
repo and will grow without bound; change the storage used by _repo_locks in
handlers/sync_middleware.py (used by _get_repo_lock and guarded by _guard) to a
weak-value mapping or add an eviction mechanism so locks for unused repo_path
keys can be garbage-collected: replace the plain dict _repo_locks with a
weakref.WeakValueDictionary (or implement a size/time-based eviction policy) and
ensure the existing _guard()/async with _guard() protection remains around
accesses to the new mapping.

In `@tests/bench_drift.py`:
- Around line 80-90: The p95 calculation in _percentiles is off-by-one for
non-integer 0.95*N; replace the current manual index logic with a proper
quantile computation (e.g., use statistics.quantiles with n=100 and
method="inclusive" or method="nearest" to get the 95th percentile) on the sorted
samples s, and return that value for "p95" while keeping the other returns
("p50", "max", "mean", "n") unchanged; ensure the empty-samples branch remains
the same.
- Around line 118-121: The try/except around adapter.extract_symbols currently
swallows all exceptions; change it to catch Exception as e and log the error at
DEBUG before continuing so failures in extract_symbols are visible (e.g.,
logger.debug("extract_symbols failed for %s: %s", fp, e, exc_info=True)) and
then continue; keep the variable names (records, adapter.extract_symbols, fp)
intact and only add the debug log to avoid noisy test output.

In `@tests/test_b2_cosmetic_hint.py`:
- Around line 78-90: The test dead-writes a file via _write_file immediately
overwritten later and declares an unused tmp_path fixture; remove the first
_write_file call (the one creating "def f(x):\n    return x + 1\n") and drop the
tmp_path parameter from test_docstring_edit_keeps_cosmetic_hint_false so the
test only creates the baseline commit with the docstring and then edits the
working tree, keeping references to _write_file, subprocess.run, _make_entry and
_enrich_with_cosmetic_hints intact.

In `@TODO.md`:
- Around line 138-140: Update the docs to stop citing a fragile line number:
replace the `ledger/adapter.py:475` reference in TODO.md and PLAN.md with a
stable symbol (the function or method that emits the decision_id for grounding
checks); locate the emitter by searching for the "decision_id" token or the
grounding-check function in ledger/adapter.py and reference it by its
fully-qualified function/method name (e.g., AdapterClass.method_name or
function_name) so future edits won't silently break the doc link.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bf4c2513-a834-4520-80b2-fe791afc76a0

📥 Commits

Reviewing files that changed from the base of the PR and between 1021822 and 63fd689.

📒 Files selected for processing (21)
  • CHANGELOG.md
  • PLAN.md
  • TODO.md
  • contracts.py
  • docs/v2-desync-optimization-guide.md
  • handlers/bind.py
  • handlers/detect_drift.py
  • handlers/history.py
  • handlers/link_commit.py
  • handlers/preflight.py
  • handlers/search_decisions.py
  • handlers/sync_middleware.py
  • ledger/adapter.py
  • ledger/ast_diff.py
  • tests/bench_drift.py
  • tests/conftest.py
  • tests/test_ast_diff.py
  • tests/test_b2_cosmetic_hint.py
  • tests/test_desync_scenarios.py
  • tests/test_link_commit_grounding.py
  • tests/test_sync_middleware.py
✅ Files skipped from review due to trivial changes (2)
  • tests/conftest.py
  • handlers/history.py
🚧 Files skipped from review as they are similar to previous changes (5)
  • tests/test_ast_diff.py
  • handlers/preflight.py
  • handlers/link_commit.py
  • tests/test_link_commit_grounding.py
  • CHANGELOG.md

Comment thread docs/v2-desync-optimization-guide.md
devin-ai-integration[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (6)
handlers/sync_middleware.py (2)

83-115: Minor readability nit: forward reference to BarrierTiming.

repo_write_barrier (L83–102) instantiates BarrierTiming at L96, but the class is defined below at L105. It works because Python resolves the name at call time, but readers parsing top-to-bottom hit the use before the definition. Consider hoisting BarrierTiming above repo_write_barrier.

Suggested ordering
-@asynccontextmanager
-async def repo_write_barrier(ctx):
-    ...
-
-class BarrierTiming:
-    ...
+class BarrierTiming:
+    ...
+
+@asynccontextmanager
+async def repo_write_barrier(ctx):
+    ...
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@handlers/sync_middleware.py` around lines 83 - 115, The function
repo_write_barrier instantiates BarrierTiming before the class is defined, which
is a readability nit—move the BarrierTiming class definition above
repo_write_barrier so the type is defined before use; update any
docstrings/comments if needed to reflect the new order and keep the exact class
name BarrierTiming and function name repo_write_barrier unchanged.

56-71: Lazy-loop binding works for the MCP server, but the registry persists across loops.

Single-threaded asyncio guarantees the if _repo_locks_guard is None: … block is atomic, so the lazy guard is race-safe. Two follow-up considerations worth tracking for V2:

  1. The locks in _repo_locks and _repo_locks_guard itself are bound to the loop active when first created. If the host ever recycles the event loop (test harnesses already do — that's why _reset_repo_locks_for_tests exists), production-side recycling would silently re-use a lock from a dead loop and raise RuntimeError: ... attached to a different loop. Not a bug today since the MCP server keeps a single loop for its lifetime, but please note this constraint in the docstring or guard against it (e.g., compare asyncio.get_running_loop() to a stored loop ref and rebuild on mismatch).
  2. _repo_locks is never evicted. For a long-running server processing many distinct ctx.repo_path values this leaks asyncio.Lock objects. Likely fine in practice (few repos per session) but worth a TODO for V2's broader barrier scope.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@handlers/sync_middleware.py` around lines 56 - 71, The current lazy guard and
per-repo locks (_repo_locks_guard and _repo_locks) are bound to whichever event
loop exists at first creation and persist across loops, risking RuntimeError if
the event loop is recycled and leaking locks for many ctx.repo_path values;
update _guard() to store the loop used when creating _repo_locks_guard (capture
asyncio.get_running_loop()), compare that stored loop to
asyncio.get_running_loop() on each call and recreate the guard (and clear or
rebuild _repo_locks) when they differ, and add a concise docstring note about
loop-binding; also add a TODO comment near _repo_locks about eviction and
reference the existing _reset_repo_locks_for_tests as a test helper for clearing
state.
tests/test_sync_middleware.py (1)

343-354: Potentially flaky timing assertion.

asyncio.sleep(0.02) is documented as "approximately delay seconds" and the wall clock may be coarser than perf_counter on some platforms (notably Windows < ~15.6ms timer resolution). Asserting held_ms >= 20.0 is tight — a single rounding/clock-skew artifact will turn it red. Consider relaxing to >= 15.0 or sleeping a bit longer (e.g. 50ms with >= 40.0).

Suggested loosening
-    async with repo_write_barrier(ctx) as timing:
-        assert timing.held_ms is None  # not yet populated
-        await asyncio.sleep(0.02)
-    assert timing.held_ms is not None
-    assert timing.held_ms >= 20.0  # we slept 20ms, measured wall clock should reflect it
-    assert timing.held_ms < 500.0  # and not be absurd
+    async with repo_write_barrier(ctx) as timing:
+        assert timing.held_ms is None  # not yet populated
+        await asyncio.sleep(0.05)
+    assert timing.held_ms is not None
+    assert timing.held_ms >= 40.0  # slept 50ms; allow ~10ms slack for coarse clocks
+    assert timing.held_ms < 500.0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_sync_middleware.py` around lines 343 - 354, The timing assertion
in test_repo_write_barrier_reports_held_ms is too tight and can be flaky; update
the test that uses repo_write_barrier and BarrierTiming.held_ms to either
increase the sleep to a longer, less flaky duration (e.g., await
asyncio.sleep(0.05) and assert held_ms >= 40.0) or relax the lower bound (e.g.,
keep sleep at 0.02 but assert held_ms >= 15.0) so the measured held_ms
comparison is robust across platforms.
tests/bench_drift.py (2)

80-90: Minor: mean key is missing from the empty-samples branch.

If any handler ends up with zero samples, the resulting JSON artifact will lack mean for that handler while populated handlers include it — making downstream diffs across runs awkward. The stdout summary at Line 270 is unaffected (it only reads p50/p95/max/n), but the artifact schema should be consistent.

♻️ Proposed fix
     if not samples:
-        return {"p50": 0.0, "p95": 0.0, "max": 0.0, "n": 0}
+        return {"p50": 0.0, "p95": 0.0, "max": 0.0, "mean": 0.0, "n": 0}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/bench_drift.py` around lines 80 - 90, The empty-samples branch in
_percentiles currently returns {"p50","p95","max","n"} but omits "mean", causing
inconsistent JSON schema; update the early-return dict in function _percentiles
to include "mean": 0.0 (and keep "n": 0) so all callers receive the same set of
keys ("p50","p95","max","mean","n") whether samples is empty or not.

118-121: Optional: log the swallowed extraction error.

except Exception: continue will silently mask systematic adapter failures — if extract_symbols fails on every file the bench will assert at Line 193 with a useless "Only got 0 symbols" message. A logger.debug (or even a print since this is a bench script) would shorten diagnosis if this happens in CI.

♻️ Proposed fix
         try:
             records = await adapter.extract_symbols(str(fp))
-        except Exception:
+        except Exception as exc:  # noqa: BLE001 — bench best-effort
+            print(f"[bench] extract_symbols failed for {fp}: {exc}")
             continue
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/bench_drift.py` around lines 118 - 121, The try/except around
adapter.extract_symbols(str(fp)) is swallowing all errors; change the except to
capture the exception (except Exception as e) and log or print the failure
(including the filepath/str(fp) and e) before continuing so
adapter.extract_symbols failures are visible during CI; update the block that
assigns records = await adapter.extract_symbols(str(fp)) to use the captured
exception and a logger.debug or print statement referencing the same filepath
and exception details.
tests/test_desync_scenarios.py (1)

405-441: Optional: tighten scenario 9 to actually assert supersession.

The current assertion (r1.ingested and r2.ingested) only proves the handler doesn't raise — a regression that silently dropped supersession_candidates emission would still pass. Since the test is the canonical regression entry for scenario 9, a one-line assertion that r2.supersession_candidates is non-empty (or that one of the two decisions transitions to a superseded status) would close that gap without duplicating coverage from tests/test_supersession.py.

♻️ Proposed fix
     r1 = await handle_ingest(ctx, p1)
     r2 = await handle_ingest(ctx, p2)
     assert r1.ingested and r2.ingested
+    # Smoke check: re-ingest with overlapping intent must surface a
+    # supersession candidate, otherwise the canonical path silently
+    # regressed even if both ingests "succeeded".
+    assert r2.supersession_candidates, (
+        f"Expected supersession_candidates on overlapping re-ingest, got: {r2}"
+    )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_desync_scenarios.py` around lines 405 - 441, The test currently
only asserts that handle_ingest returned ingested flags for r1 and r2; update
test_scenario_09_intent_description_supersession to also assert that
supersession actually occurred by checking r2.supersession_candidates is
non-empty (or that r1 or r2 has a decision with status "superseded"); locate the
test function and add a single assertion after the existing ones that inspects
r2.supersession_candidates (or verifies a decision status change) to ensure
supersession is emitted, without changing any handler logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@PLAN.md`:
- Line 102: The checklist entry references a stale line number for the
incidental fix; update the item to use a symbolic anchor instead of a hard
file:line (e.g., "empty `decision_id` on ungrounded `pending_grounding_checks`
(see `get_all_decisions` ungrounded fan-out in ledger/adapter.py)"), so replace
"ledger/adapter.py:475" with a descriptive location referencing the function
`get_all_decisions` and the ungrounded fan-out block to prevent future doc rot.

---

Nitpick comments:
In `@handlers/sync_middleware.py`:
- Around line 83-115: The function repo_write_barrier instantiates BarrierTiming
before the class is defined, which is a readability nit—move the BarrierTiming
class definition above repo_write_barrier so the type is defined before use;
update any docstrings/comments if needed to reflect the new order and keep the
exact class name BarrierTiming and function name repo_write_barrier unchanged.
- Around line 56-71: The current lazy guard and per-repo locks
(_repo_locks_guard and _repo_locks) are bound to whichever event loop exists at
first creation and persist across loops, risking RuntimeError if the event loop
is recycled and leaking locks for many ctx.repo_path values; update _guard() to
store the loop used when creating _repo_locks_guard (capture
asyncio.get_running_loop()), compare that stored loop to
asyncio.get_running_loop() on each call and recreate the guard (and clear or
rebuild _repo_locks) when they differ, and add a concise docstring note about
loop-binding; also add a TODO comment near _repo_locks about eviction and
reference the existing _reset_repo_locks_for_tests as a test helper for clearing
state.

In `@tests/bench_drift.py`:
- Around line 80-90: The empty-samples branch in _percentiles currently returns
{"p50","p95","max","n"} but omits "mean", causing inconsistent JSON schema;
update the early-return dict in function _percentiles to include "mean": 0.0
(and keep "n": 0) so all callers receive the same set of keys
("p50","p95","max","mean","n") whether samples is empty or not.
- Around line 118-121: The try/except around adapter.extract_symbols(str(fp)) is
swallowing all errors; change the except to capture the exception (except
Exception as e) and log or print the failure (including the filepath/str(fp) and
e) before continuing so adapter.extract_symbols failures are visible during CI;
update the block that assigns records = await adapter.extract_symbols(str(fp))
to use the captured exception and a logger.debug or print statement referencing
the same filepath and exception details.

In `@tests/test_desync_scenarios.py`:
- Around line 405-441: The test currently only asserts that handle_ingest
returned ingested flags for r1 and r2; update
test_scenario_09_intent_description_supersession to also assert that
supersession actually occurred by checking r2.supersession_candidates is
non-empty (or that r1 or r2 has a decision with status "superseded"); locate the
test function and add a single assertion after the existing ones that inspects
r2.supersession_candidates (or verifies a decision status change) to ensure
supersession is emitted, without changing any handler logic.

In `@tests/test_sync_middleware.py`:
- Around line 343-354: The timing assertion in
test_repo_write_barrier_reports_held_ms is too tight and can be flaky; update
the test that uses repo_write_barrier and BarrierTiming.held_ms to either
increase the sleep to a longer, less flaky duration (e.g., await
asyncio.sleep(0.05) and assert held_ms >= 40.0) or relax the lower bound (e.g.,
keep sleep at 0.02 but assert held_ms >= 15.0) so the measured held_ms
comparison is robust across platforms.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9d7efb60-551d-48a4-81d1-6b6e23ff4919

📥 Commits

Reviewing files that changed from the base of the PR and between 020c74e and e6b81ed.

📒 Files selected for processing (27)
  • .claude/skills/bicameral-doctor/SKILL.md
  • .claude/skills/bicameral-history/SKILL.md
  • .claude/skills/bicameral-preflight/SKILL.md
  • .claude/skills/bicameral-scan-branch/SKILL.md
  • .claude/skills/bicameral-search/SKILL.md
  • CHANGELOG.md
  • PLAN.md
  • TODO.md
  • contracts.py
  • docs/v2-desync-optimization-guide.md
  • handlers/bind.py
  • handlers/detect_drift.py
  • handlers/history.py
  • handlers/link_commit.py
  • handlers/preflight.py
  • handlers/search_decisions.py
  • handlers/sync_middleware.py
  • ledger/adapter.py
  • ledger/ast_diff.py
  • skills/bicameral-preflight/SKILL.md
  • tests/bench_drift.py
  • tests/conftest.py
  • tests/test_ast_diff.py
  • tests/test_b2_cosmetic_hint.py
  • tests/test_desync_scenarios.py
  • tests/test_link_commit_grounding.py
  • tests/test_sync_middleware.py
✅ Files skipped from review due to trivial changes (5)
  • tests/conftest.py
  • .claude/skills/bicameral-search/SKILL.md
  • .claude/skills/bicameral-scan-branch/SKILL.md
  • docs/v2-desync-optimization-guide.md
  • .claude/skills/bicameral-history/SKILL.md
🚧 Files skipped from review as they are similar to previous changes (8)
  • handlers/history.py
  • .claude/skills/bicameral-doctor/SKILL.md
  • handlers/search_decisions.py
  • .claude/skills/bicameral-preflight/SKILL.md
  • handlers/link_commit.py
  • TODO.md
  • tests/test_ast_diff.py
  • CHANGELOG.md

Comment thread PLAN.md
- [x] B2 — `DriftEntry.cosmetic_hint` advisory + `_enrich_with_cosmetic_hints` helper in `handle_detect_drift`
- [x] D1 — `original_lines` field on `symbol_disappeared` grounding checks; **`_build_verification_instruction` split** so relocation cases never get the unsafe `bicameral.bind` CTA
- [x] F1 — `tests/test_desync_scenarios.py` canonical 13-scenario regression matrix (12 pass + 1 V2 xfail)
- [x] Incidental fix — empty `decision_id` on ungrounded `pending_grounding_checks` (`ledger/adapter.py:475`)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Minor: stale line reference.

ledger/adapter.py:475 no longer points at the empty-decision_id fix; the relevant block is now around L500–509 in the current file. Consider replacing with a symbolic anchor (e.g., "in get_all_decisions ungrounded fan-out") so doc rot doesn't accumulate.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@PLAN.md` at line 102, The checklist entry references a stale line number for
the incidental fix; update the item to use a symbolic anchor instead of a hard
file:line (e.g., "empty `decision_id` on ungrounded `pending_grounding_checks`
(see `get_all_decisions` ungrounded fan-out in ledger/adapter.py)"), so replace
"ledger/adapter.py:475" with a descriptive location referencing the function
`get_all_decisions` and the ungrounded fan-out block to prevent future doc rot.

silongtan and others added 13 commits April 25, 2026 23:42
…ning

Ships three read-path / measurement pieces from the V1 plan:

A1 — Drift benchmark harness (tests/bench_drift.py)
  Seeds 100 decisions across 25 files (real symbols via tree-sitter
  extract_symbols, no BM25 index build required), times search_decisions
  / detect_drift / link_commit under a memory:// ledger, writes
  test-results/bench/drift_baseline.json plus a stdout summary.
  Registered @pytest.mark.bench so default test runs skip it.
  Baseline on Apple Silicon: search_decisions p95 = 9.8 ms,
  detect_drift p95 = 14.8 ms, link_commit (warm) p95 = 8.0 ms — all
  55–185× under the V2 targets in PLAN.md:83.

A2-light — per-repo asyncio.Lock for handle_bind
  New handlers/sync_middleware.repo_write_barrier async context manager
  backed by a module-level dict[repo_path, asyncio.Lock]. Lazy guard-lock
  construction avoids the "bound to wrong loop" pitfall across test event
  loops. handle_bind wraps its body in the barrier via a thin _do_bind
  inner function. Deliberately narrow scope: does NOT protect
  resolve_compliance or cross-process writers — both are V2 scope
  (plan §5.2, §5.5).

A3 — Sync metrics instrumentation (SyncMetrics contract)
  New contracts.SyncMetrics model with optional sync_catchup_ms and
  barrier_held_ms float fields, attached as
  sync_metrics: SyncMetrics | None = None to SearchDecisionsResponse,
  PreflightResponse, HistoryResponse, BindResponse. Purely additive,
  non-breaking. repo_write_barrier now yields a mutable BarrierTiming
  whose held_ms is populated on exit (including on exceptions).
  Each handler times its own sync call locally so nested calls
  (e.g. preflight → search_decisions) don't step on each other's metrics.

Incidental (rebase update): an earlier draft of this commit pinned
surrealdb>=1.0.0,<2.0.0 in pyproject.toml as a defensive workaround
for an SDK 2.0 crash during schema migration (InternalError raised
inside an except-LedgerError block in ledger/schema.py:259, which
made every ingest-based test fail against 2.0). That pin was dropped
during rebase onto main: commit 66796ef ("fix(client): wrap SDK
SurrealError so init_schema idempotency catch works") fixes the
underlying issue at the LedgerClient layer, so the test suite now
passes against surrealdb>=2.0.0 directly. No pin survives in this
commit's diff.

Tests: 75 passed, 1 xfailed against surrealdb 2.0.0 after rebase
(was 32 passed pre-F1; F1 lifts the count). Zero regressions from
the v0.6.3 / v0.6.4 / 66796ef / 8d44a16 / ce74461 / a5aface rebase.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sory

B1 — ledger/ast_diff.py
  is_cosmetic_change(before, after, lang) returns True only when two
  snippets differ by inter-token whitespace alone. Strategy: parse both
  with tree-sitter and compare a recursive (node.type, child_sigs |
  leaf_bytes) signature. Two trees with identical signatures differ
  only by whitespace tree-sitter doesn't represent as nodes; any
  structural or leaf-text difference returns False.

  Strict whitelist by design — variable renames, trailing commas,
  comment edits (incl. # type: ignore / // @ts-ignore / build tags),
  docstring edits, and string-literal changes all return False. False
  negatives (real cosmetic changes routed unbiased to L3) are cheap;
  false positives bias the future V2 caller-LLM verdict prompt toward
  "looks fine" — which is the exact failure mode Codex pass-7 finding
  #3 flagged.

  Reuses code_locator.indexing.symbol_extractor._get_parser and
  LANGUAGE_FALLBACK so the cosmetic detector and the symbol indexer
  can never silently disagree on supported languages. Supports
  python, javascript, typescript, java, go, rust, c_sharp (plus
  jsx → javascript and tsx → typescript via fallback). Unsupported
  langs, parse failures, and has_error trees fail safe to False.

B2 — DriftEntry.cosmetic_hint
  Adds cosmetic_hint: bool = False on contracts.DriftEntry (additive,
  non-breaking). handlers/detect_drift._enrich_with_cosmetic_hints
  runs after the pure raw_decisions_to_drift_entries mapping and only
  touches drifted entries.

  Source of "before" / "after": HEAD bytes vs working-tree bytes via
  ledger.status.get_git_content, sliced to the region's
  (start_line, end_line). Language resolved from file extension via
  EXTENSION_LANGUAGE.

  Five fail-safe paths leave cosmetic_hint=False: non-drifted status,
  equal HEAD/working-tree bytes (no diff to classify), unsupported
  extension, invalid line range, exception during is_cosmetic_change.

  Codex invariants enforced by tests:
    * The hint NEVER mutates content_hash or any other DriftEntry
      field besides cosmetic_hint itself (test_content_hash_never_mutated).
    * The hint never gates drift surfacing or status — it is metadata
      consumed by the eventual V2 caller-LLM verdict prompt.

Tests: 61 passed in 3.81s — 21 ast_diff, 8 b2_cosmetic_hint, 19
sync_middleware, 6 bind, 7 phase3 integration. Zero regressions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d payload

V1 plan D1 was originally designed around a server-side
search_code(decision.description) to surface relocation candidates.
That approach is obsolete after v0.6.4 nuked search_code and shifted
all code retrieval to the caller LLM. v0.6.4 already implemented the
spirit of D1:
  - ledger/adapter.py:412-420 emits a symbol_disappeared grounding
    check on the authoritative ref when resolve_symbol_lines returns
    None for a tracked region;
  - handlers/link_commit.py:21 verification_instruction tells the
    caller to use Grep/Read + validate_symbols / extract_symbols +
    bicameral.bind;
  - tests/test_link_commit_grounding.py covers the rename-→-grounding
    flow end-to-end.

V1's actual contribution: add original_lines: [start_line, end_line]
to the symbol_disappeared payload so the caller LLM can inspect the
symbol's prior position via `git show <prev_ref>:<file_path>` to
ground its own retrieval. Single-line addition in ledger/adapter.py;
new assertion in the existing regression test.

Strictly informational. No new "call bicameral_bind" CTA — Codex
pass-10 finding #2 stands: V2 must ship bicameral_rebind with
old-binding CAS + a fresh L3 verdict on the new target before any
rebind workflow becomes safe.

Tests: 63 passed in 4.52s — zero regressions across ast_diff,
b2_cosmetic_hint, sync_middleware, bind, link_commit_grounding,
phase3 integration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…trix

New tests/test_desync_scenarios.py — one test per scenario from the
Notion "Auto-Grounding Problem" catalog (Notion ID
3332a51619c4813caccec86c36d9bf98), routed through the real handler
layer per the Apr 8 PR #84 lesson (tests bypassing handlers miss
post-ingest hooks). Self-contained fixture builds a tmp git repo on
``main`` plus a memory ledger per test, so scenarios are independent
of each other and of the bicameral-mcp checkout.

Scorecard: 12 PASS, 1 XFAIL — meets V1 plan acceptance gate (>=12/13).

Pass:
  1. New decision, matching code exists           → ungrounded → caller binds
  2. Code changed after grounded                   → pending + pending_compliance_check
  3. Code deleted after grounded                   → symbol_disappeared
  4. Symbol renamed in file                        → symbol_disappeared with original_lines (V1 D1)
  5. Symbol moved to different file                → symbol_disappeared
  6. Code added later, decision becomes resolvable → caller binds explicitly
  7. Cold start: no matching code in repo          → stays ungrounded
  9. Intent description supersession               → re-ingest succeeds
 10. Multiple intents share a symbol               → both surface in drift response
 11. No server-side BM25 grounding (post-v0.6.0)   → stays ungrounded — caller-LLM-driven
 12. Line-shift edit does not trigger drift        → resolve_symbol_lines re-resolves
 13. [Open Question] prefix                        → ingested as gap-class decision

XFAIL:
  8. Drifted intent → atomic re-ground             → requires V2 bicameral_rebind
                                                     with old-binding CAS + fresh
                                                     L3 verdict on new target
                                                     (design doc 8 D2; Codex pass-10 #2)

Scenario 2 explicitly asserts ``pending`` (not ``drifted``) per
post-v0.5.0 derive_status semantics: a hash diff WITHOUT a cached
``compliant`` verdict yields ``pending``. ``drifted`` requires a
caller-LLM verdict via the verdict cache (V2 territory).

Incidental bug surfaced by F1 and fixed in ledger/adapter.py:
``pending_grounding_checks`` for ungrounded decisions read
``d.get("id", "")`` from ``get_all_decisions(filter="ungrounded")``,
but that query aliases the field to ``decision_id``. Result: every
ungrounded grounding-check entry shipped with ``decision_id=""``,
leaving callers no handle to bind against. The existing
``test_pending_grounding_checks_for_ungrounded_decisions`` regression
didn't catch it because it only asserted ``len > 0`` on the list, not
non-empty IDs. Fix: read ``decision_id`` first, fall back to ``id``
for forward compatibility. Real correctness bug, not a test artifact.

Tests: 76 passed, 1 xfailed in 7.35s across test_desync_scenarios,
test_b2_cosmetic_hint, test_ast_diff, test_phase3_integration,
test_sync_middleware, test_bind, test_link_commit_grounding.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…don't get bind CTA

Codex pass-12 finding #2: D1 added original_lines to the
symbol_disappeared payload (richer relocation context) but left the
v0.6.4 verification_instruction text directing callers to use
bicameral.bind for ALL pending_grounding_checks. For
reason='ungrounded' that CTA is correct (no prior binding to retire,
no duplicate-binding risk). For reason='symbol_disappeared' it is
unsafe: bind on the new location leaves the old edge live under the
N:N binds_to relation, producing duplicate-binding state. Atomic
rebind that retires the stale edge in the same write ships in V2
(design doc 8 D2; Codex pass-10 finding #2).

Net effect of D1 was that we added more relocation context routed
through an unsafe CTA. This commit removes the unsafe CTA for
relocation cases without reducing the safe CTA for ungrounded cases.

Changes in handlers/link_commit.py:
  - _VERIFICATION_INSTRUCTION (single string) split into
    _VERIFICATION_INSTRUCTION_BASE +
    _GROUNDING_INSTRUCTION_UNGROUNDED +
    _GROUNDING_INSTRUCTION_RELOCATION.
  - New _build_verification_instruction(pending_compliance,
    pending_grounding) composer assembles the response text per call
    based on which reason values actually fired. Symbol-disappeared
    cases get an explicit "INFORMATIONAL ONLY — do NOT call
    bicameral.bind" warning citing the duplicate-binding hazard.
  - Response wiring at the bottom of handle_link_commit swapped from
    the constant lookup to the composer call.

Tests (tests/test_link_commit_grounding.py):
  - test_pending_grounding_checks_for_ungrounded_decisions now
    asserts "bicameral.bind" IN the verification_instruction AND
    "INFORMATIONAL ONLY" NOT in it.
  - test_pending_grounding_checks_symbol_not_found now asserts
    "INFORMATIONAL ONLY" IN the instruction AND an explicit
    bind-prohibition phrase.

Net destructive surface change for V1: arguably negative — V1 actively
*removes* a v0.6.4 unsafe CTA for relocation, while introducing zero
new mutating capabilities.

Tests: 75 passed, 1 xfailed in 7.11s (xfail is scenario 8, V2 atomic
rebind — by design).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per the project's CLAUDE.md auto-tick mandate, after V1 implementation
work lands we tick the corresponding rows in TODO.md and PLAN.md and
add a CHANGELOG entry.

CHANGELOG.md
  "Unreleased — desync optimization V1 — measurement + read-path
  advisory" entry. Detailed per-deliverable summary (A1, A2-light,
  A3, B1, B2, D1, F1, pass-12 follow-up, incidental decision_id fix)
  with baseline numbers, scorecard table, and explicit V2-deferred
  list pointing at docs/desync-optimization.md.

TODO.md
  - Phase 3: ticked "Performance benchmarks" (V1 A1 measurements show
    we already meet the V2 perf targets — search p95 = 10.4 ms,
    drift p95 = 15.5 ms vs 2 s / 1 s targets, 55-185x under).
  - LLM drift judge note refined to "(V2 — see ...)".
  - New "Desync Optimization V1 — DONE" sub-section enumerating every
    V1 task ID + new "V2 — Deferred" sub-section listing the
    parking-lot items.
  - Mock Registry note clarified: V1 introduces no new mocks.

PLAN.md
  - Phase 3 perf bullet ticked with the same measurement rationale.
  - New "Desync Optimization V1 — DONE" section mirroring TODO.md's
    enumeration so PLAN.md stays the canonical phased view.
  - LLM drift judge bullet rewritten to point at the V2 design doc.

mocks/README.md is intentionally NOT updated — file does not exist
in-repo (verified empirically); the V1 plan's earlier reference to
it has been corrected.

.claude/skills/bicameral-doctor/SKILL.md is intentionally NOT updated
in V1 — V1's safeguard for relocation cases is the server-emitted
verification_instruction text (handlers/link_commit._build_verification_instruction),
which IS in-tree and DOES ship. Adding doctor-skill rendering prose
for pending_grounding_checks is V2 / future-Docs scope.

Tests: 75 passed, 1 xfailed in 7.68s — zero regressions from doc work
(docs only — no production code changed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Single self-contained guide for V2 implementation. Replaces two prior
working artifacts that were never tracked (per project convention that
plan docs stay untracked):

  - docs/desync-optimization.md (V2 design with 9 Codex review passes)
  - docs/desync-optimization-v1-plan.md (V1 plan + pass-12 fixes)

Both have been removed from the working tree; their contents are
absorbed here as a single source of truth for any engineer or agent
picking up V2 work after V1 ships on this branch.

Structure (10 sections, ~1000 lines):

  1.  Quick start — read order for fresh readers
  2.  Background — three-timeline desync, 13-scenario catalog, scorecard
  3.  V1 outcomes — exhaustive list of what V1 shipped + the honest
      "20-30% of user-facing value" assessment so the next engineer
      doesn't overestimate the foundation
  4.  V2 scope — capability gap table + product targets
  5.  Architecture — L1/L2/L3 model, per-binding ownership, 5-field CAS,
      tombstones, append-only verdict history, two-phase rebind with
      attempt-id locking, three-mechanism sync barrier, spec-vs-code-shape
      writes
  6.  Implementation plan — 6 phases as a DAG with file-level breakdown
      and effort estimate (7-10 weeks, 1 engineer, sequential)
  7.  Constraints catalog — synthesized from 14 Codex review passes;
      each entry is a "what NOT to ship" rule traced to the pass # that
      found the bug
  8.  Open questions — 7 decisions Codex can't make; explicitly flags
      Jin involvement (CODEOWNERS approval required)
  9.  Acceptance criteria — quantitative thresholds, qualitative checks,
      ship-blocker gate tests
  10. References — local files, V1 commits, Notion docs, external sources

Pass-13 fixes folded in (V2 design review):

  - 5.6 / 6 Phase 4 D2: rebind phase 2 binds to a specific
    rebind_attempt_id with single-pending-rebind-per-old-binding
    enforcement. Stale phase-2 verdicts on superseded attempts cannot
    tombstone the old edge.
  - 5.5 / 6 Phase 4 C2: record_compliance_verdict derives projection
    keys from POST-mutation state. History insert + binds_to update +
    projection upsert run in one atomic transaction with the
    post-mutation tuple. Mandatory acceptance test
    test_not_relevant_then_restore_cycle locks the contract.
  - 7.10: general rule — whenever a write mutates a field that is part
    of any cache or index key, the cache/index write must be derived
    from the post-mutation value.

Pass-14 fixes folded in (V2 guide review):

  - 6 Phase 0b: atomicity is a committed choice, not deferred. V2
    uses LedgerClient.transaction() wrapping inline BEGIN/COMMIT
    TRANSACTION blocks. Day-1 forced-failure gate test
    (test_transaction_rolls_back_on_failure) is a hard ship-blocker
    for Phase 0b. Explicit fallback path documented if gate fails:
    LET-chained statements, then network SurrealDB.
  - 5.6: pending rebinds get a server-enforced lease
    (BICAMERAL_REBIND_LEASE_SECONDS, default 24h). On-demand expiry
    sweep at start of every phase 1 atomically abandons stale leases
    as outcome='abandoned_by_expiry'. force_supersede=true flag for
    explicit takeover. Phase 2 lease check rejects verdicts on
    expired/superseded/abandoned attempts as stale-history-only.
    Guarantees forward progress even on caller crash.
  - 8 q5 / 7.3 / 10: judge_gaps contradiction resolved.
    bicameral_judge_gaps is read-only; the destructive write is in
    resolve_compliance, which Phase 0a migrates. Phase 0a covers the
    entire pipeline; no separate change to judge_gaps required.
  - 6 Phase 4 D2 / 7.11: edge-vs-region terminology audit.
    binding_version lives on binds_to edges, never on code_region.
    All Phase 4 D2 wording rewritten in edge terminology; explicit
    warning against the rejected per-region versioning model.
  - 7.11: pass-14 constraints added to the catalog with
    prevention rules synthesized.

V1 status preserved: 6 commits on this branch (3b4d0bb...8e226c5)
ship the read-path advisory layer with zero new mutating capabilities.
75 passed + 1 xfailed in test suite. The xfailed scenario 8 flips to
PASS the moment V2's bicameral_rebind ships — it's
@pytest.mark.xfail(strict=True), so a CI-visible signal will fire.

V2 effort sense: 7-10 engineer-weeks sequential. Phases don't
parallelize cleanly because each one's correctness depends on the
prior one's invariants. Strong recommendation to involve Jin at
design time (CODEOWNERS, project-judgment review the adversarial
Codex passes can't provide).

Branch: desync-optimization-v1 (renamed from desync-optimization)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Post-rebase onto v0.7.0 (Accountable North Star), fresh decisions are
ingested as signoff.state='proposed' and report status='proposal' until
explicitly ratified via bicameral_ratify. They are drift-exempt by
design — they never enter pending/drifted until ratified.

Scenario 2 was written under the v0.5.0 contract where a hash-divergent
region without a cached compliant verdict reports status='pending'.
Under v0.7.0 the same flow now reports status='proposal' for the
unratified decision, even though the substantive drift signal still
flows through pending_compliance_checks (which the test continues to
assert above the status check).

Update: accept any of {pending, drifted, proposal} — the per-decision
status field is richer under v0.7.0; the scenario's invariant (system
surfaces a compliance check when bound code changes) is unchanged.

Tests: 77 passed, 1 xfailed in 7.52s — full V1 suite green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DriftEntry.cosmetic_hint was shipped in V1 B2 with no corresponding
skill update, leaving the field invisible to the caller LLM. Add the
three-point contract to bicameral-preflight SKILL.md (both copies):

- Step 4: cosmetic_hint=true → mechanical classification, never ask
- Step 5: ~ REFORMATTED rendering template (not ⚠ DRIFTED)
- Stop-and-Ask Contract: cosmetic drift listed as mechanical example

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Four occurrences of the deleted V1 plan / V2 design paths
(docs/desync-optimization.md, docs/desync-optimization-v1-plan.md)
pointed at files that were consolidated into
docs/v2-desync-optimization-guide.md and removed. Replace each with
the consolidated path so internal links resolve. Surrounding prose
preserved verbatim per the reviewer's "preserve text and formatting"
instruction; the dated "nine rounds of Codex review" claim is
unchanged (the guide actually absorbed 14 passes by ship time, but
that's a separate edit).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Project root CLAUDE.md (lines 3-17) requires that any tool response
field shape change ships with a matching SKILL.md update. V1 added:

  - sync_metrics on SearchDecisionsResponse / PreflightResponse /
    HistoryResponse / BindResponse (contracts.py)
  - cosmetic_hint on DriftEntry (contracts.py)
  - original_lines on pending_grounding_checks symbol_disappeared
    payload (ledger/adapter.py)
  - per-reason verification_instruction split
    (handlers/link_commit.py)

None of these were mentioned in any skill — audit by grep across
.claude/skills/{search,preflight,history,doctor,scan-branch}/SKILL.md
found zero hits for the new field names before this commit. Updates:

  - bicameral-search / bicameral-preflight / bicameral-history:
    one-line note that sync_metrics is server-side observability
    (skip rendering; log if profiling). Minimal per the change owner's
    "minimal note" instruction for observability fields.

  - bicameral-doctor: new "Per-entry advisory fields" subsection
    covering all three doctor-relevant additions:
      * cosmetic_hint as advisory metadata that never gates status
      * original_lines on symbol_disappeared payloads
      * the new verification_instruction split — explicit warning
        that the doctor skill must NOT synthesize a bicameral.bind
        CTA for symbol_disappeared cases (would create
        duplicate-binding state under N:N binds_to until V2 atomic
        rebind ships); ungrounded keeps the safe bind CTA.

  - bicameral-scan-branch: extended the existing DriftEntry bullet
    with a cosmetic_hint advisory note. Emphasizes that the hint is
    render-time tag only — drifted entries stay drifted regardless.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four correctness fixes surfaced by post-V1-merge review. All verified
against current code before applying; none affect the V1 acceptance
contract (zero new mutating capabilities, scorecard 12 PASS + 1 XFAIL).
77 passed + 1 xfailed in the V1 suite after fixes.

handlers/detect_drift.py — two issues
  A. Magic 10**9 in get_git_content calls. ledger.status.get_git_content
     takes start_line / end_line in its signature but ignores them
     entirely — the function always returns the full file body, with
     callers slicing locally afterward. Two existing legacy callers
     (ledger/status.py:159, ledger/adapter.py:67) share the same
     signature smell. Replaced 1, 10**9 with 0, 0 and an explicit
     comment pointing at the upstream signature for a future cleanup.
     The slicing logic that follows is unchanged and correct.
  B. _enrich_with_cosmetic_hints diffed HEAD vs working tree
     unconditionally, even when the caller passed use_working_tree=False
     (which makes handle_detect_drift advertise source="HEAD" in the
     response). The cosmetic-hint axis is HEAD-vs-working-tree, so
     attaching hints in the source="HEAD" mode would have been
     misleading metadata derived from a diff axis the caller didn't ask
     about. Minimal fix per the reviewer's option: gate the enrichment
     call with `if use_working_tree:` so HEAD-source responses never
     carry cosmetic_hint values. tests/test_v0414_source_excerpt.py
     (the one test that calls with use_working_tree=False) still
     passes — it doesn't assert on cosmetic_hint.

handlers/search_decisions.py — sync_catchup_ms scope mismatch
  V1 A3 timed only handle_link_commit in this handler, but
  ensure_ledger_synced (used by preflight / history) covers both
  link_commit AND get_session_start_banner. The same metric name
  measured different surfaces across handlers. Extended the timed
  scope here to include the banner call so sync_catchup_ms means the
  same thing everywhere.

ledger/ast_diff.py — fail-safe contract leak
  is_cosmetic_change documents a "fail-safe → False" contract for
  any classifier failure, but the recursive _signature comparison ran
  outside the try/except. _signature recurses into the AST and can
  raise RecursionError on deeply nested trees, which would propagate
  out of the function instead of returning False. Combined the parse +
  has_error check + signature comparison into one guarded block;
  except (Exception, RecursionError) returns False per the contract.
  RecursionError is technically a subclass of Exception in Python 3
  (the explicit tuple is documentation-of-intent — the comment block
  above this also calls out recursion specifically).

tests/test_ast_diff.py — JSX fallback exercise
  test_jsx_routes_through_javascript used identical bytes for before
  and after, which short-circuits at the top of is_cosmetic_change
  before the LANGUAGE_FALLBACK['jsx'] → 'javascript' path is reached.
  The test name claimed to verify routing but never exercised it.
  Updated to use whitespace-only diff in the JS portion of the
  expression (extra spaces around the `=` sign) so bytes differ but
  the expected outcome stays True. JSX text content kept unchanged
  (text inside <div>...</div> IS significant in JSX — first attempt
  put spaces there and correctly returned False).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per the "small scope expansion only when V2 is already close" bar:
V2 Phase 1-4 ship every primitive #47 needs (per-binding baseline,
full-CAS hash comparison, symbol re-resolution, atomic rebind, V1's
relocation surfacing). The remaining gap to fully closing #47 is one
thin read-only handler that wires those primitives at the branch
level. No new mutating capabilities. No schema changes. No new
contract surface beyond a single response type. Effort delta:
3-5 days; total V2 estimate goes from 7-10 weeks to ~8-11 weeks.

Updates to docs/v2-desync-optimization-guide.md:

- 4.1 capability gap table: row 13 added (branch-aware drift report,
  GitHub #47, Phase 6 fix).
- 4.2 product targets: explicit "branch-aware drift report works"
  bullet citing #47 closure and the unblocked downstream consumers
  (#48 pre-push hook, #49 PR-comment Action).
- 6 implementation plan: Phase 6 box added to the DAG diagram below
  Phase 5 polish.
- 6 Phase 6 section: full deliverable spec with the
  ScanBranchResponse contract, read-only invariant, dependency on C0
  per-binding baseline, and the four acceptance criteria mirroring
  #47 verbatim plus the Closes #47 commitment.
- 6 effort estimate table: Phase 6 row (3-5 days), total updated.
- 9 acceptance criteria: new gate that the V2 PR uses Closes #47
  with a passing read-only invariant test.
- 10 references: handlers/scan_branch.py and tests/test_scan_branch.py
  added to the file list.
- Final note: post-V2 workflow change codified — V2 is the last
  release authored by reverse-mapping deliverables to issues. After
  V2, switch to issue-driven (pick issue, satisfy ACs, Closes #N).
  Phase 6 documented as the canonical "expand in-flight release to
  close an adjacent issue" precedent: underlying machinery must
  already exist, addition must be additive, issue ACs must be fully
  satisfiable. Suggested next-issue order: #39 -> #42 -> #41 -> #44
  -> #48 -> #49 (all 5 close via small focused PRs over 4-6 weeks).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
All four findings verified against current code; only the actionable
ones applied. 81 passed + 1 xfailed in 9.02s.

#1 — skills/bicameral-preflight/SKILL.md sync_metrics note
  The .claude/skills copy got the sync_metrics observability note
  back when V1 A3 shipped, but the canonical skills/ copy never
  did. Mirror the wording verbatim near step 2 so the rendering
  guidance and response-field documentation stay in sync.

#2 — handlers/detect_drift.py per-entry alignment
  The cosmetic-hint enrichment was slicing both head_full and
  wt_full using entry.lines (the baseline anchor). HEAD and the
  working tree can shift the symbol independently, so a single
  index range can't align both sides. The narrow consequence: a
  drifted entry with shifted lines could yield a misleading
  cosmetic_hint=true on bytes that aren't the bound region.

  Fix: re-resolve the symbol against each ref via
  resolve_symbol_lines(file_path, entry.symbol, repo, ref="HEAD")
  and ref="working_tree" separately, slice each ref using its own
  resolved range. Resolution failure on either side → safe default
  of cosmetic_hint=False (matches the V1 contract: "False is cheap,
  True must be earned"). Empty symbol → skip (new fail-safe path).

  Test refactor: test_invalid_lines_skipped renamed to
  test_unresolvable_symbol_skipped — the old test asserted that
  lines=(0,0) was the failsafe trigger, but entry.lines is no
  longer the alignment input. New test exercises the
  resolve_symbol_lines-returns-None path via a nonexistent symbol
  name, which is the real fail-safe gate now.

#3 — V2 guide TOC anchor for §9
  GitHub auto-generates fragment IDs from heading text by
  lowercasing, replacing spaces with hyphens, and dropping
  punctuation. "## 9. Acceptance criteria for V2" maps to
  #9-acceptance-criteria-for-v2, but the TOC pointed at
  #9-acceptance-criteria (truncated). Link broken.
  Updated to the correct fragment.

#4 — V2 guide unlabeled fenced code blocks (markdownlint MD040)
  Six fenced opens used bare ``` instead of a labeled fence.
  Tagged each with ```text — the contents are commit listings,
  ASCII DAG diagrams, pseudocode protocols, and tuple notation,
  none of which fit a real language tag. The other fenced blocks
  in the guide (already tagged ```sql / ```python) are unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants