desync optimization V1 — read-path advisory + V2 implementation guide#56
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds Desync Optimization V1: read-path-only instrumentation and advisory changes — sync-metrics on responses, per-repo write barrier for bind timing, strict AST whitespace-only cosmetic classifier with drift-entry enrichment, conditional verification-instruction routing, grounding payload enrichment, a drift benchmark harness, and regression tests; destructive write-path work deferred to V2. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Handler as Handler (search/preflight/bind/detect_drift)
participant Sync as repo_write_barrier (Lock)
participant Ledger
participant Repo as GitRepo
participant AST as ledger/ast_diff
Client->>Handler: request (search/preflight/bind/detect_drift)
Handler->>Sync: enter repo_write_barrier(ctx)
Sync-->>Handler: BarrierTiming (held_ms unset)
Handler->>Ledger: ensure_ledger_synced(ctx)
Ledger-->>Handler: synced
alt detect_drift path
Handler->>Repo: read file(s) (HEAD, working tree)
Repo-->>Handler: file contents
Handler->>AST: is_cosmetic_change(before, after, lang)
AST-->>Handler: cosmetic? (True/False)
end
Handler->>Sync: exit repo_write_barrier (record held_ms)
Handler-->>Client: Response (includes sync_metrics, cosmetic_hint, verification_instruction)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
1021822 to
b420d1b
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (5)
handlers/sync_middleware.py (1)
56-80: Optional:_repo_locksgrows without bound.The registry never evicts entries, so a long-lived MCP server process that handles many distinct
repo_pathvalues will accumulate oneasyncio.Lockper repo for the lifetime of the process. Per-process MCP servers typically see ≤1 repo, so this is unlikely to bite, but if you anticipate a multi-repo daemon mode (e.g. dashboard, CI worker), consider aWeakValueDictionaryor an explicit eviction hook. Pure footprint issue — no correctness impact.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@handlers/sync_middleware.py` around lines 56 - 80, _repo_locks currently stores a strong reference for every repo and will grow without bound; change the storage used by _repo_locks in handlers/sync_middleware.py (used by _get_repo_lock and guarded by _guard) to a weak-value mapping or add an eviction mechanism so locks for unused repo_path keys can be garbage-collected: replace the plain dict _repo_locks with a weakref.WeakValueDictionary (or implement a size/time-based eviction policy) and ensure the existing _guard()/async with _guard() protection remains around accesses to the new mapping.tests/bench_drift.py (2)
80-90:p95is off-by-one when0.95 * len(s)is not an integer.
s[max(0, int(0.95 * len(s)) - 1)]returns the nearest-rank p95 only when0.95*Nis integral (e.g., N=20, 100). For N=50 (the search-decisions sample size: 10 queries × 5 iterations) it yields index 46 (47th value) instead of the standard nearest-rank index 47 (48th value). The published baseline numbers inPLAN.md/ V2 guide are derived from this — minor but worth tightening since they're cited as the V2 perf budget reference.♻️ Proposed fix using `statistics.quantiles`
-def _percentiles(samples: list[float]) -> dict[str, float]: - if not samples: - return {"p50": 0.0, "p95": 0.0, "max": 0.0, "n": 0} - s = sorted(samples) - return { - "p50": statistics.median(s), - "p95": s[max(0, int(0.95 * len(s)) - 1)], - "max": s[-1], - "mean": statistics.fmean(s), - "n": len(s), - } +def _percentiles(samples: list[float]) -> dict[str, float]: + if not samples: + return {"p50": 0.0, "p95": 0.0, "max": 0.0, "mean": 0.0, "n": 0} + s = sorted(samples) + # Nearest-rank p95: smallest value with rank >= ceil(0.95 * N). + p95_idx = max(0, -(-len(s) * 95 // 100) - 1) + return { + "p50": statistics.median(s), + "p95": s[p95_idx], + "max": s[-1], + "mean": statistics.fmean(s), + "n": len(s), + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/bench_drift.py` around lines 80 - 90, The p95 calculation in _percentiles is off-by-one for non-integer 0.95*N; replace the current manual index logic with a proper quantile computation (e.g., use statistics.quantiles with n=100 and method="inclusive" or method="nearest" to get the 95th percentile) on the sorted samples s, and return that value for "p95" while keeping the other returns ("p50", "max", "mean", "n") unchanged; ensure the empty-samples branch remains the same.
118-121: Log the swallowed exception.A bare
except Exception: continuehere means a regression inextract_symbols(e.g., a tree-sitter parser change or a malformed file pattern) silently shrinks the corpus until thelen(symbols) >= 25assertion at line 193 trips far away from the cause. Logging at DEBUG keeps the bench quiet but recoverable.- try: - records = await adapter.extract_symbols(str(fp)) - except Exception: - continue + try: + records = await adapter.extract_symbols(str(fp)) + except Exception as exc: + print(f"[bench] extract_symbols failed for {fp}: {exc}") + continue🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/bench_drift.py` around lines 118 - 121, The try/except around adapter.extract_symbols currently swallows all exceptions; change it to catch Exception as e and log the error at DEBUG before continuing so failures in extract_symbols are visible (e.g., logger.debug("extract_symbols failed for %s: %s", fp, e, exc_info=True)) and then continue; keep the variable names (records, adapter.extract_symbols, fp) intact and only add the debug log to avoid noisy test output.tests/test_b2_cosmetic_hint.py (1)
78-90: Dead write + unused fixture parameter.The
_write_fileat line 80 is immediately overwritten at line 83 before any git operation, so it has no effect on the test. Thetmp_pathparameter is also unused (the working-tree path comes from therepo_with_baselinefixture). The test reads as if it's establishing a non-docstring baseline state, but it isn't.♻️ Proposed cleanup
-def test_docstring_edit_keeps_cosmetic_hint_false(repo_with_baseline, tmp_path): +def test_docstring_edit_keeps_cosmetic_hint_false(repo_with_baseline): repo, rel = repo_with_baseline - _write_file(repo, rel, "def f(x):\n return x + 1\n") - # Now overwrite baseline by committing a docstring-only version, then edit working tree. + # Commit a docstring-bearing baseline, then edit only the docstring text in + # the working tree — the diff is observable via __doc__, so it must NOT be + # classified as cosmetic. import subprocess _write_file(repo, rel, 'def f(x):\n """Old."""\n return x + 1\n') subprocess.run(["git", "add", "-A"], cwd=repo, check=True) subprocess.run(["git", "commit", "-q", "-m", "add docstring"], cwd=repo, check=True) - # Working tree edits the docstring text — observable via __doc__. _write_file(repo, rel, 'def f(x):\n """New."""\n return x + 1\n') entry = _make_entry(status="drifted", lines=(1, 3)) _enrich_with_cosmetic_hints([entry], rel, str(repo)) assert entry.cosmetic_hint is False🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_b2_cosmetic_hint.py` around lines 78 - 90, The test dead-writes a file via _write_file immediately overwritten later and declares an unused tmp_path fixture; remove the first _write_file call (the one creating "def f(x):\n return x + 1\n") and drop the tmp_path parameter from test_docstring_edit_keeps_cosmetic_hint_false so the test only creates the baseline commit with the docstring and then edits the working tree, keeping references to _write_file, subprocess.run, _make_entry and _enrich_with_cosmetic_hints intact.TODO.md (1)
138-140: Line numbers in prose will drift.
ledger/adapter.py:475is also referenced fromPLAN.mdline 102. The bug fix point will move on the next edit to that file and the docs go stale silently. Consider citing a stable anchor (function name) instead.-- [x] **incidental fix** — `ledger/adapter.py:475` was emitting empty - `decision_id` on ungrounded grounding checks; surfaced by F1 +- [x] **incidental fix** — empty `decision_id` on ungrounded + `pending_grounding_checks` in `ledger/adapter.py` + (`SurrealDBLedgerAdapter._collect_pending_grounding_checks`); surfaced by F1🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@TODO.md` around lines 138 - 140, Update the docs to stop citing a fragile line number: replace the `ledger/adapter.py:475` reference in TODO.md and PLAN.md with a stable symbol (the function or method that emits the decision_id for grounding checks); locate the emitter by searching for the "decision_id" token or the grounding-check function in ledger/adapter.py and reference it by its fully-qualified function/method name (e.g., AdapterClass.method_name or function_name) so future edits won't silently break the doc link.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/v2-desync-optimization-guide.md`:
- Around line 519-535: The current transaction() context manager builds a raw
BEGIN/COMMIT block and calls self._db.query(...) which is incompatible with
surrealdb-py's transaction API; refactor transaction() (the asynccontextmanager
that yields TransactionBuffer) to create a session via self._db.new_session(),
start a transaction with session.begin_transaction(), run each buffered
statement through the transaction object (txn.query or txn.execute equivalent)
so you get per-statement results, and then call txn.commit() in the try path or
txn.cancel() in except, converting any per-statement failures into LedgerError
(replace the current isinstance(stmt, str) check and the raw self._db.query
usage with transaction-aware calls and proper try/except around
txn.commit()/txn.cancel()).
---
Nitpick comments:
In `@handlers/sync_middleware.py`:
- Around line 56-80: _repo_locks currently stores a strong reference for every
repo and will grow without bound; change the storage used by _repo_locks in
handlers/sync_middleware.py (used by _get_repo_lock and guarded by _guard) to a
weak-value mapping or add an eviction mechanism so locks for unused repo_path
keys can be garbage-collected: replace the plain dict _repo_locks with a
weakref.WeakValueDictionary (or implement a size/time-based eviction policy) and
ensure the existing _guard()/async with _guard() protection remains around
accesses to the new mapping.
In `@tests/bench_drift.py`:
- Around line 80-90: The p95 calculation in _percentiles is off-by-one for
non-integer 0.95*N; replace the current manual index logic with a proper
quantile computation (e.g., use statistics.quantiles with n=100 and
method="inclusive" or method="nearest" to get the 95th percentile) on the sorted
samples s, and return that value for "p95" while keeping the other returns
("p50", "max", "mean", "n") unchanged; ensure the empty-samples branch remains
the same.
- Around line 118-121: The try/except around adapter.extract_symbols currently
swallows all exceptions; change it to catch Exception as e and log the error at
DEBUG before continuing so failures in extract_symbols are visible (e.g.,
logger.debug("extract_symbols failed for %s: %s", fp, e, exc_info=True)) and
then continue; keep the variable names (records, adapter.extract_symbols, fp)
intact and only add the debug log to avoid noisy test output.
In `@tests/test_b2_cosmetic_hint.py`:
- Around line 78-90: The test dead-writes a file via _write_file immediately
overwritten later and declares an unused tmp_path fixture; remove the first
_write_file call (the one creating "def f(x):\n return x + 1\n") and drop the
tmp_path parameter from test_docstring_edit_keeps_cosmetic_hint_false so the
test only creates the baseline commit with the docstring and then edits the
working tree, keeping references to _write_file, subprocess.run, _make_entry and
_enrich_with_cosmetic_hints intact.
In `@TODO.md`:
- Around line 138-140: Update the docs to stop citing a fragile line number:
replace the `ledger/adapter.py:475` reference in TODO.md and PLAN.md with a
stable symbol (the function or method that emits the decision_id for grounding
checks); locate the emitter by searching for the "decision_id" token or the
grounding-check function in ledger/adapter.py and reference it by its
fully-qualified function/method name (e.g., AdapterClass.method_name or
function_name) so future edits won't silently break the doc link.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: bf4c2513-a834-4520-80b2-fe791afc76a0
📒 Files selected for processing (21)
CHANGELOG.mdPLAN.mdTODO.mdcontracts.pydocs/v2-desync-optimization-guide.mdhandlers/bind.pyhandlers/detect_drift.pyhandlers/history.pyhandlers/link_commit.pyhandlers/preflight.pyhandlers/search_decisions.pyhandlers/sync_middleware.pyledger/adapter.pyledger/ast_diff.pytests/bench_drift.pytests/conftest.pytests/test_ast_diff.pytests/test_b2_cosmetic_hint.pytests/test_desync_scenarios.pytests/test_link_commit_grounding.pytests/test_sync_middleware.py
✅ Files skipped from review due to trivial changes (2)
- tests/conftest.py
- handlers/history.py
🚧 Files skipped from review as they are similar to previous changes (5)
- tests/test_ast_diff.py
- handlers/preflight.py
- handlers/link_commit.py
- tests/test_link_commit_grounding.py
- CHANGELOG.md
3d57097 to
e6b81ed
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (6)
handlers/sync_middleware.py (2)
83-115: Minor readability nit: forward reference toBarrierTiming.
repo_write_barrier(L83–102) instantiatesBarrierTimingat L96, but the class is defined below at L105. It works because Python resolves the name at call time, but readers parsing top-to-bottom hit the use before the definition. Consider hoistingBarrierTimingaboverepo_write_barrier.Suggested ordering
-@asynccontextmanager -async def repo_write_barrier(ctx): - ... - -class BarrierTiming: - ... +class BarrierTiming: + ... + +@asynccontextmanager +async def repo_write_barrier(ctx): + ...🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@handlers/sync_middleware.py` around lines 83 - 115, The function repo_write_barrier instantiates BarrierTiming before the class is defined, which is a readability nit—move the BarrierTiming class definition above repo_write_barrier so the type is defined before use; update any docstrings/comments if needed to reflect the new order and keep the exact class name BarrierTiming and function name repo_write_barrier unchanged.
56-71: Lazy-loop binding works for the MCP server, but the registry persists across loops.Single-threaded asyncio guarantees the
if _repo_locks_guard is None: …block is atomic, so the lazy guard is race-safe. Two follow-up considerations worth tracking for V2:
- The locks in
_repo_locksand_repo_locks_guarditself are bound to the loop active when first created. If the host ever recycles the event loop (test harnesses already do — that's why_reset_repo_locks_for_testsexists), production-side recycling would silently re-use a lock from a dead loop and raiseRuntimeError: ... attached to a different loop. Not a bug today since the MCP server keeps a single loop for its lifetime, but please note this constraint in the docstring or guard against it (e.g., compareasyncio.get_running_loop()to a stored loop ref and rebuild on mismatch)._repo_locksis never evicted. For a long-running server processing many distinctctx.repo_pathvalues this leaksasyncio.Lockobjects. Likely fine in practice (few repos per session) but worth a TODO for V2's broader barrier scope.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@handlers/sync_middleware.py` around lines 56 - 71, The current lazy guard and per-repo locks (_repo_locks_guard and _repo_locks) are bound to whichever event loop exists at first creation and persist across loops, risking RuntimeError if the event loop is recycled and leaking locks for many ctx.repo_path values; update _guard() to store the loop used when creating _repo_locks_guard (capture asyncio.get_running_loop()), compare that stored loop to asyncio.get_running_loop() on each call and recreate the guard (and clear or rebuild _repo_locks) when they differ, and add a concise docstring note about loop-binding; also add a TODO comment near _repo_locks about eviction and reference the existing _reset_repo_locks_for_tests as a test helper for clearing state.tests/test_sync_middleware.py (1)
343-354: Potentially flaky timing assertion.
asyncio.sleep(0.02)is documented as "approximately delay seconds" and the wall clock may be coarser than perf_counter on some platforms (notably Windows < ~15.6ms timer resolution). Assertingheld_ms >= 20.0is tight — a single rounding/clock-skew artifact will turn it red. Consider relaxing to>= 15.0or sleeping a bit longer (e.g. 50ms with>= 40.0).Suggested loosening
- async with repo_write_barrier(ctx) as timing: - assert timing.held_ms is None # not yet populated - await asyncio.sleep(0.02) - assert timing.held_ms is not None - assert timing.held_ms >= 20.0 # we slept 20ms, measured wall clock should reflect it - assert timing.held_ms < 500.0 # and not be absurd + async with repo_write_barrier(ctx) as timing: + assert timing.held_ms is None # not yet populated + await asyncio.sleep(0.05) + assert timing.held_ms is not None + assert timing.held_ms >= 40.0 # slept 50ms; allow ~10ms slack for coarse clocks + assert timing.held_ms < 500.0🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_sync_middleware.py` around lines 343 - 354, The timing assertion in test_repo_write_barrier_reports_held_ms is too tight and can be flaky; update the test that uses repo_write_barrier and BarrierTiming.held_ms to either increase the sleep to a longer, less flaky duration (e.g., await asyncio.sleep(0.05) and assert held_ms >= 40.0) or relax the lower bound (e.g., keep sleep at 0.02 but assert held_ms >= 15.0) so the measured held_ms comparison is robust across platforms.tests/bench_drift.py (2)
80-90: Minor:meankey is missing from the empty-samples branch.If any handler ends up with zero samples, the resulting JSON artifact will lack
meanfor that handler while populated handlers include it — making downstream diffs across runs awkward. The stdout summary at Line 270 is unaffected (it only readsp50/p95/max/n), but the artifact schema should be consistent.♻️ Proposed fix
if not samples: - return {"p50": 0.0, "p95": 0.0, "max": 0.0, "n": 0} + return {"p50": 0.0, "p95": 0.0, "max": 0.0, "mean": 0.0, "n": 0}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/bench_drift.py` around lines 80 - 90, The empty-samples branch in _percentiles currently returns {"p50","p95","max","n"} but omits "mean", causing inconsistent JSON schema; update the early-return dict in function _percentiles to include "mean": 0.0 (and keep "n": 0) so all callers receive the same set of keys ("p50","p95","max","mean","n") whether samples is empty or not.
118-121: Optional: log the swallowed extraction error.
except Exception: continuewill silently mask systematic adapter failures — ifextract_symbolsfails on every file the bench will assert at Line 193 with a useless "Only got 0 symbols" message. Alogger.debug(or even a♻️ Proposed fix
try: records = await adapter.extract_symbols(str(fp)) - except Exception: + except Exception as exc: # noqa: BLE001 — bench best-effort + print(f"[bench] extract_symbols failed for {fp}: {exc}") continue🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/bench_drift.py` around lines 118 - 121, The try/except around adapter.extract_symbols(str(fp)) is swallowing all errors; change the except to capture the exception (except Exception as e) and log or print the failure (including the filepath/str(fp) and e) before continuing so adapter.extract_symbols failures are visible during CI; update the block that assigns records = await adapter.extract_symbols(str(fp)) to use the captured exception and a logger.debug or print statement referencing the same filepath and exception details.tests/test_desync_scenarios.py (1)
405-441: Optional: tighten scenario 9 to actually assert supersession.The current assertion (
r1.ingested and r2.ingested) only proves the handler doesn't raise — a regression that silently droppedsupersession_candidatesemission would still pass. Since the test is the canonical regression entry for scenario 9, a one-line assertion thatr2.supersession_candidatesis non-empty (or that one of the two decisions transitions to asupersededstatus) would close that gap without duplicating coverage fromtests/test_supersession.py.♻️ Proposed fix
r1 = await handle_ingest(ctx, p1) r2 = await handle_ingest(ctx, p2) assert r1.ingested and r2.ingested + # Smoke check: re-ingest with overlapping intent must surface a + # supersession candidate, otherwise the canonical path silently + # regressed even if both ingests "succeeded". + assert r2.supersession_candidates, ( + f"Expected supersession_candidates on overlapping re-ingest, got: {r2}" + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_desync_scenarios.py` around lines 405 - 441, The test currently only asserts that handle_ingest returned ingested flags for r1 and r2; update test_scenario_09_intent_description_supersession to also assert that supersession actually occurred by checking r2.supersession_candidates is non-empty (or that r1 or r2 has a decision with status "superseded"); locate the test function and add a single assertion after the existing ones that inspects r2.supersession_candidates (or verifies a decision status change) to ensure supersession is emitted, without changing any handler logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@PLAN.md`:
- Line 102: The checklist entry references a stale line number for the
incidental fix; update the item to use a symbolic anchor instead of a hard
file:line (e.g., "empty `decision_id` on ungrounded `pending_grounding_checks`
(see `get_all_decisions` ungrounded fan-out in ledger/adapter.py)"), so replace
"ledger/adapter.py:475" with a descriptive location referencing the function
`get_all_decisions` and the ungrounded fan-out block to prevent future doc rot.
---
Nitpick comments:
In `@handlers/sync_middleware.py`:
- Around line 83-115: The function repo_write_barrier instantiates BarrierTiming
before the class is defined, which is a readability nit—move the BarrierTiming
class definition above repo_write_barrier so the type is defined before use;
update any docstrings/comments if needed to reflect the new order and keep the
exact class name BarrierTiming and function name repo_write_barrier unchanged.
- Around line 56-71: The current lazy guard and per-repo locks
(_repo_locks_guard and _repo_locks) are bound to whichever event loop exists at
first creation and persist across loops, risking RuntimeError if the event loop
is recycled and leaking locks for many ctx.repo_path values; update _guard() to
store the loop used when creating _repo_locks_guard (capture
asyncio.get_running_loop()), compare that stored loop to
asyncio.get_running_loop() on each call and recreate the guard (and clear or
rebuild _repo_locks) when they differ, and add a concise docstring note about
loop-binding; also add a TODO comment near _repo_locks about eviction and
reference the existing _reset_repo_locks_for_tests as a test helper for clearing
state.
In `@tests/bench_drift.py`:
- Around line 80-90: The empty-samples branch in _percentiles currently returns
{"p50","p95","max","n"} but omits "mean", causing inconsistent JSON schema;
update the early-return dict in function _percentiles to include "mean": 0.0
(and keep "n": 0) so all callers receive the same set of keys
("p50","p95","max","mean","n") whether samples is empty or not.
- Around line 118-121: The try/except around adapter.extract_symbols(str(fp)) is
swallowing all errors; change the except to capture the exception (except
Exception as e) and log or print the failure (including the filepath/str(fp) and
e) before continuing so adapter.extract_symbols failures are visible during CI;
update the block that assigns records = await adapter.extract_symbols(str(fp))
to use the captured exception and a logger.debug or print statement referencing
the same filepath and exception details.
In `@tests/test_desync_scenarios.py`:
- Around line 405-441: The test currently only asserts that handle_ingest
returned ingested flags for r1 and r2; update
test_scenario_09_intent_description_supersession to also assert that
supersession actually occurred by checking r2.supersession_candidates is
non-empty (or that r1 or r2 has a decision with status "superseded"); locate the
test function and add a single assertion after the existing ones that inspects
r2.supersession_candidates (or verifies a decision status change) to ensure
supersession is emitted, without changing any handler logic.
In `@tests/test_sync_middleware.py`:
- Around line 343-354: The timing assertion in
test_repo_write_barrier_reports_held_ms is too tight and can be flaky; update
the test that uses repo_write_barrier and BarrierTiming.held_ms to either
increase the sleep to a longer, less flaky duration (e.g., await
asyncio.sleep(0.05) and assert held_ms >= 40.0) or relax the lower bound (e.g.,
keep sleep at 0.02 but assert held_ms >= 15.0) so the measured held_ms
comparison is robust across platforms.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 9d7efb60-551d-48a4-81d1-6b6e23ff4919
📒 Files selected for processing (27)
.claude/skills/bicameral-doctor/SKILL.md.claude/skills/bicameral-history/SKILL.md.claude/skills/bicameral-preflight/SKILL.md.claude/skills/bicameral-scan-branch/SKILL.md.claude/skills/bicameral-search/SKILL.mdCHANGELOG.mdPLAN.mdTODO.mdcontracts.pydocs/v2-desync-optimization-guide.mdhandlers/bind.pyhandlers/detect_drift.pyhandlers/history.pyhandlers/link_commit.pyhandlers/preflight.pyhandlers/search_decisions.pyhandlers/sync_middleware.pyledger/adapter.pyledger/ast_diff.pyskills/bicameral-preflight/SKILL.mdtests/bench_drift.pytests/conftest.pytests/test_ast_diff.pytests/test_b2_cosmetic_hint.pytests/test_desync_scenarios.pytests/test_link_commit_grounding.pytests/test_sync_middleware.py
✅ Files skipped from review due to trivial changes (5)
- tests/conftest.py
- .claude/skills/bicameral-search/SKILL.md
- .claude/skills/bicameral-scan-branch/SKILL.md
- docs/v2-desync-optimization-guide.md
- .claude/skills/bicameral-history/SKILL.md
🚧 Files skipped from review as they are similar to previous changes (8)
- handlers/history.py
- .claude/skills/bicameral-doctor/SKILL.md
- handlers/search_decisions.py
- .claude/skills/bicameral-preflight/SKILL.md
- handlers/link_commit.py
- TODO.md
- tests/test_ast_diff.py
- CHANGELOG.md
| - [x] B2 — `DriftEntry.cosmetic_hint` advisory + `_enrich_with_cosmetic_hints` helper in `handle_detect_drift` | ||
| - [x] D1 — `original_lines` field on `symbol_disappeared` grounding checks; **`_build_verification_instruction` split** so relocation cases never get the unsafe `bicameral.bind` CTA | ||
| - [x] F1 — `tests/test_desync_scenarios.py` canonical 13-scenario regression matrix (12 pass + 1 V2 xfail) | ||
| - [x] Incidental fix — empty `decision_id` on ungrounded `pending_grounding_checks` (`ledger/adapter.py:475`) |
There was a problem hiding this comment.
Minor: stale line reference.
ledger/adapter.py:475 no longer points at the empty-decision_id fix; the relevant block is now around L500–509 in the current file. Consider replacing with a symbolic anchor (e.g., "in get_all_decisions ungrounded fan-out") so doc rot doesn't accumulate.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@PLAN.md` at line 102, The checklist entry references a stale line number for
the incidental fix; update the item to use a symbolic anchor instead of a hard
file:line (e.g., "empty `decision_id` on ungrounded `pending_grounding_checks`
(see `get_all_decisions` ungrounded fan-out in ledger/adapter.py)"), so replace
"ledger/adapter.py:475" with a descriptive location referencing the function
`get_all_decisions` and the ungrounded fan-out block to prevent future doc rot.
…ning Ships three read-path / measurement pieces from the V1 plan: A1 — Drift benchmark harness (tests/bench_drift.py) Seeds 100 decisions across 25 files (real symbols via tree-sitter extract_symbols, no BM25 index build required), times search_decisions / detect_drift / link_commit under a memory:// ledger, writes test-results/bench/drift_baseline.json plus a stdout summary. Registered @pytest.mark.bench so default test runs skip it. Baseline on Apple Silicon: search_decisions p95 = 9.8 ms, detect_drift p95 = 14.8 ms, link_commit (warm) p95 = 8.0 ms — all 55–185× under the V2 targets in PLAN.md:83. A2-light — per-repo asyncio.Lock for handle_bind New handlers/sync_middleware.repo_write_barrier async context manager backed by a module-level dict[repo_path, asyncio.Lock]. Lazy guard-lock construction avoids the "bound to wrong loop" pitfall across test event loops. handle_bind wraps its body in the barrier via a thin _do_bind inner function. Deliberately narrow scope: does NOT protect resolve_compliance or cross-process writers — both are V2 scope (plan §5.2, §5.5). A3 — Sync metrics instrumentation (SyncMetrics contract) New contracts.SyncMetrics model with optional sync_catchup_ms and barrier_held_ms float fields, attached as sync_metrics: SyncMetrics | None = None to SearchDecisionsResponse, PreflightResponse, HistoryResponse, BindResponse. Purely additive, non-breaking. repo_write_barrier now yields a mutable BarrierTiming whose held_ms is populated on exit (including on exceptions). Each handler times its own sync call locally so nested calls (e.g. preflight → search_decisions) don't step on each other's metrics. Incidental (rebase update): an earlier draft of this commit pinned surrealdb>=1.0.0,<2.0.0 in pyproject.toml as a defensive workaround for an SDK 2.0 crash during schema migration (InternalError raised inside an except-LedgerError block in ledger/schema.py:259, which made every ingest-based test fail against 2.0). That pin was dropped during rebase onto main: commit 66796ef ("fix(client): wrap SDK SurrealError so init_schema idempotency catch works") fixes the underlying issue at the LedgerClient layer, so the test suite now passes against surrealdb>=2.0.0 directly. No pin survives in this commit's diff. Tests: 75 passed, 1 xfailed against surrealdb 2.0.0 after rebase (was 32 passed pre-F1; F1 lifts the count). Zero regressions from the v0.6.3 / v0.6.4 / 66796ef / 8d44a16 / ce74461 / a5aface rebase. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sory B1 — ledger/ast_diff.py is_cosmetic_change(before, after, lang) returns True only when two snippets differ by inter-token whitespace alone. Strategy: parse both with tree-sitter and compare a recursive (node.type, child_sigs | leaf_bytes) signature. Two trees with identical signatures differ only by whitespace tree-sitter doesn't represent as nodes; any structural or leaf-text difference returns False. Strict whitelist by design — variable renames, trailing commas, comment edits (incl. # type: ignore / // @ts-ignore / build tags), docstring edits, and string-literal changes all return False. False negatives (real cosmetic changes routed unbiased to L3) are cheap; false positives bias the future V2 caller-LLM verdict prompt toward "looks fine" — which is the exact failure mode Codex pass-7 finding #3 flagged. Reuses code_locator.indexing.symbol_extractor._get_parser and LANGUAGE_FALLBACK so the cosmetic detector and the symbol indexer can never silently disagree on supported languages. Supports python, javascript, typescript, java, go, rust, c_sharp (plus jsx → javascript and tsx → typescript via fallback). Unsupported langs, parse failures, and has_error trees fail safe to False. B2 — DriftEntry.cosmetic_hint Adds cosmetic_hint: bool = False on contracts.DriftEntry (additive, non-breaking). handlers/detect_drift._enrich_with_cosmetic_hints runs after the pure raw_decisions_to_drift_entries mapping and only touches drifted entries. Source of "before" / "after": HEAD bytes vs working-tree bytes via ledger.status.get_git_content, sliced to the region's (start_line, end_line). Language resolved from file extension via EXTENSION_LANGUAGE. Five fail-safe paths leave cosmetic_hint=False: non-drifted status, equal HEAD/working-tree bytes (no diff to classify), unsupported extension, invalid line range, exception during is_cosmetic_change. Codex invariants enforced by tests: * The hint NEVER mutates content_hash or any other DriftEntry field besides cosmetic_hint itself (test_content_hash_never_mutated). * The hint never gates drift surfacing or status — it is metadata consumed by the eventual V2 caller-LLM verdict prompt. Tests: 61 passed in 3.81s — 21 ast_diff, 8 b2_cosmetic_hint, 19 sync_middleware, 6 bind, 7 phase3 integration. Zero regressions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d payload
V1 plan D1 was originally designed around a server-side
search_code(decision.description) to surface relocation candidates.
That approach is obsolete after v0.6.4 nuked search_code and shifted
all code retrieval to the caller LLM. v0.6.4 already implemented the
spirit of D1:
- ledger/adapter.py:412-420 emits a symbol_disappeared grounding
check on the authoritative ref when resolve_symbol_lines returns
None for a tracked region;
- handlers/link_commit.py:21 verification_instruction tells the
caller to use Grep/Read + validate_symbols / extract_symbols +
bicameral.bind;
- tests/test_link_commit_grounding.py covers the rename-→-grounding
flow end-to-end.
V1's actual contribution: add original_lines: [start_line, end_line]
to the symbol_disappeared payload so the caller LLM can inspect the
symbol's prior position via `git show <prev_ref>:<file_path>` to
ground its own retrieval. Single-line addition in ledger/adapter.py;
new assertion in the existing regression test.
Strictly informational. No new "call bicameral_bind" CTA — Codex
pass-10 finding #2 stands: V2 must ship bicameral_rebind with
old-binding CAS + a fresh L3 verdict on the new target before any
rebind workflow becomes safe.
Tests: 63 passed in 4.52s — zero regressions across ast_diff,
b2_cosmetic_hint, sync_middleware, bind, link_commit_grounding,
phase3 integration.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…trix New tests/test_desync_scenarios.py — one test per scenario from the Notion "Auto-Grounding Problem" catalog (Notion ID 3332a51619c4813caccec86c36d9bf98), routed through the real handler layer per the Apr 8 PR #84 lesson (tests bypassing handlers miss post-ingest hooks). Self-contained fixture builds a tmp git repo on ``main`` plus a memory ledger per test, so scenarios are independent of each other and of the bicameral-mcp checkout. Scorecard: 12 PASS, 1 XFAIL — meets V1 plan acceptance gate (>=12/13). Pass: 1. New decision, matching code exists → ungrounded → caller binds 2. Code changed after grounded → pending + pending_compliance_check 3. Code deleted after grounded → symbol_disappeared 4. Symbol renamed in file → symbol_disappeared with original_lines (V1 D1) 5. Symbol moved to different file → symbol_disappeared 6. Code added later, decision becomes resolvable → caller binds explicitly 7. Cold start: no matching code in repo → stays ungrounded 9. Intent description supersession → re-ingest succeeds 10. Multiple intents share a symbol → both surface in drift response 11. No server-side BM25 grounding (post-v0.6.0) → stays ungrounded — caller-LLM-driven 12. Line-shift edit does not trigger drift → resolve_symbol_lines re-resolves 13. [Open Question] prefix → ingested as gap-class decision XFAIL: 8. Drifted intent → atomic re-ground → requires V2 bicameral_rebind with old-binding CAS + fresh L3 verdict on new target (design doc 8 D2; Codex pass-10 #2) Scenario 2 explicitly asserts ``pending`` (not ``drifted``) per post-v0.5.0 derive_status semantics: a hash diff WITHOUT a cached ``compliant`` verdict yields ``pending``. ``drifted`` requires a caller-LLM verdict via the verdict cache (V2 territory). Incidental bug surfaced by F1 and fixed in ledger/adapter.py: ``pending_grounding_checks`` for ungrounded decisions read ``d.get("id", "")`` from ``get_all_decisions(filter="ungrounded")``, but that query aliases the field to ``decision_id``. Result: every ungrounded grounding-check entry shipped with ``decision_id=""``, leaving callers no handle to bind against. The existing ``test_pending_grounding_checks_for_ungrounded_decisions`` regression didn't catch it because it only asserted ``len > 0`` on the list, not non-empty IDs. Fix: read ``decision_id`` first, fall back to ``id`` for forward compatibility. Real correctness bug, not a test artifact. Tests: 76 passed, 1 xfailed in 7.35s across test_desync_scenarios, test_b2_cosmetic_hint, test_ast_diff, test_phase3_integration, test_sync_middleware, test_bind, test_link_commit_grounding. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…don't get bind CTA Codex pass-12 finding #2: D1 added original_lines to the symbol_disappeared payload (richer relocation context) but left the v0.6.4 verification_instruction text directing callers to use bicameral.bind for ALL pending_grounding_checks. For reason='ungrounded' that CTA is correct (no prior binding to retire, no duplicate-binding risk). For reason='symbol_disappeared' it is unsafe: bind on the new location leaves the old edge live under the N:N binds_to relation, producing duplicate-binding state. Atomic rebind that retires the stale edge in the same write ships in V2 (design doc 8 D2; Codex pass-10 finding #2). Net effect of D1 was that we added more relocation context routed through an unsafe CTA. This commit removes the unsafe CTA for relocation cases without reducing the safe CTA for ungrounded cases. Changes in handlers/link_commit.py: - _VERIFICATION_INSTRUCTION (single string) split into _VERIFICATION_INSTRUCTION_BASE + _GROUNDING_INSTRUCTION_UNGROUNDED + _GROUNDING_INSTRUCTION_RELOCATION. - New _build_verification_instruction(pending_compliance, pending_grounding) composer assembles the response text per call based on which reason values actually fired. Symbol-disappeared cases get an explicit "INFORMATIONAL ONLY — do NOT call bicameral.bind" warning citing the duplicate-binding hazard. - Response wiring at the bottom of handle_link_commit swapped from the constant lookup to the composer call. Tests (tests/test_link_commit_grounding.py): - test_pending_grounding_checks_for_ungrounded_decisions now asserts "bicameral.bind" IN the verification_instruction AND "INFORMATIONAL ONLY" NOT in it. - test_pending_grounding_checks_symbol_not_found now asserts "INFORMATIONAL ONLY" IN the instruction AND an explicit bind-prohibition phrase. Net destructive surface change for V1: arguably negative — V1 actively *removes* a v0.6.4 unsafe CTA for relocation, while introducing zero new mutating capabilities. Tests: 75 passed, 1 xfailed in 7.11s (xfail is scenario 8, V2 atomic rebind — by design). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per the project's CLAUDE.md auto-tick mandate, after V1 implementation
work lands we tick the corresponding rows in TODO.md and PLAN.md and
add a CHANGELOG entry.
CHANGELOG.md
"Unreleased — desync optimization V1 — measurement + read-path
advisory" entry. Detailed per-deliverable summary (A1, A2-light,
A3, B1, B2, D1, F1, pass-12 follow-up, incidental decision_id fix)
with baseline numbers, scorecard table, and explicit V2-deferred
list pointing at docs/desync-optimization.md.
TODO.md
- Phase 3: ticked "Performance benchmarks" (V1 A1 measurements show
we already meet the V2 perf targets — search p95 = 10.4 ms,
drift p95 = 15.5 ms vs 2 s / 1 s targets, 55-185x under).
- LLM drift judge note refined to "(V2 — see ...)".
- New "Desync Optimization V1 — DONE" sub-section enumerating every
V1 task ID + new "V2 — Deferred" sub-section listing the
parking-lot items.
- Mock Registry note clarified: V1 introduces no new mocks.
PLAN.md
- Phase 3 perf bullet ticked with the same measurement rationale.
- New "Desync Optimization V1 — DONE" section mirroring TODO.md's
enumeration so PLAN.md stays the canonical phased view.
- LLM drift judge bullet rewritten to point at the V2 design doc.
mocks/README.md is intentionally NOT updated — file does not exist
in-repo (verified empirically); the V1 plan's earlier reference to
it has been corrected.
.claude/skills/bicameral-doctor/SKILL.md is intentionally NOT updated
in V1 — V1's safeguard for relocation cases is the server-emitted
verification_instruction text (handlers/link_commit._build_verification_instruction),
which IS in-tree and DOES ship. Adding doctor-skill rendering prose
for pending_grounding_checks is V2 / future-Docs scope.
Tests: 75 passed, 1 xfailed in 7.68s — zero regressions from doc work
(docs only — no production code changed).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Single self-contained guide for V2 implementation. Replaces two prior
working artifacts that were never tracked (per project convention that
plan docs stay untracked):
- docs/desync-optimization.md (V2 design with 9 Codex review passes)
- docs/desync-optimization-v1-plan.md (V1 plan + pass-12 fixes)
Both have been removed from the working tree; their contents are
absorbed here as a single source of truth for any engineer or agent
picking up V2 work after V1 ships on this branch.
Structure (10 sections, ~1000 lines):
1. Quick start — read order for fresh readers
2. Background — three-timeline desync, 13-scenario catalog, scorecard
3. V1 outcomes — exhaustive list of what V1 shipped + the honest
"20-30% of user-facing value" assessment so the next engineer
doesn't overestimate the foundation
4. V2 scope — capability gap table + product targets
5. Architecture — L1/L2/L3 model, per-binding ownership, 5-field CAS,
tombstones, append-only verdict history, two-phase rebind with
attempt-id locking, three-mechanism sync barrier, spec-vs-code-shape
writes
6. Implementation plan — 6 phases as a DAG with file-level breakdown
and effort estimate (7-10 weeks, 1 engineer, sequential)
7. Constraints catalog — synthesized from 14 Codex review passes;
each entry is a "what NOT to ship" rule traced to the pass # that
found the bug
8. Open questions — 7 decisions Codex can't make; explicitly flags
Jin involvement (CODEOWNERS approval required)
9. Acceptance criteria — quantitative thresholds, qualitative checks,
ship-blocker gate tests
10. References — local files, V1 commits, Notion docs, external sources
Pass-13 fixes folded in (V2 design review):
- 5.6 / 6 Phase 4 D2: rebind phase 2 binds to a specific
rebind_attempt_id with single-pending-rebind-per-old-binding
enforcement. Stale phase-2 verdicts on superseded attempts cannot
tombstone the old edge.
- 5.5 / 6 Phase 4 C2: record_compliance_verdict derives projection
keys from POST-mutation state. History insert + binds_to update +
projection upsert run in one atomic transaction with the
post-mutation tuple. Mandatory acceptance test
test_not_relevant_then_restore_cycle locks the contract.
- 7.10: general rule — whenever a write mutates a field that is part
of any cache or index key, the cache/index write must be derived
from the post-mutation value.
Pass-14 fixes folded in (V2 guide review):
- 6 Phase 0b: atomicity is a committed choice, not deferred. V2
uses LedgerClient.transaction() wrapping inline BEGIN/COMMIT
TRANSACTION blocks. Day-1 forced-failure gate test
(test_transaction_rolls_back_on_failure) is a hard ship-blocker
for Phase 0b. Explicit fallback path documented if gate fails:
LET-chained statements, then network SurrealDB.
- 5.6: pending rebinds get a server-enforced lease
(BICAMERAL_REBIND_LEASE_SECONDS, default 24h). On-demand expiry
sweep at start of every phase 1 atomically abandons stale leases
as outcome='abandoned_by_expiry'. force_supersede=true flag for
explicit takeover. Phase 2 lease check rejects verdicts on
expired/superseded/abandoned attempts as stale-history-only.
Guarantees forward progress even on caller crash.
- 8 q5 / 7.3 / 10: judge_gaps contradiction resolved.
bicameral_judge_gaps is read-only; the destructive write is in
resolve_compliance, which Phase 0a migrates. Phase 0a covers the
entire pipeline; no separate change to judge_gaps required.
- 6 Phase 4 D2 / 7.11: edge-vs-region terminology audit.
binding_version lives on binds_to edges, never on code_region.
All Phase 4 D2 wording rewritten in edge terminology; explicit
warning against the rejected per-region versioning model.
- 7.11: pass-14 constraints added to the catalog with
prevention rules synthesized.
V1 status preserved: 6 commits on this branch (3b4d0bb...8e226c5)
ship the read-path advisory layer with zero new mutating capabilities.
75 passed + 1 xfailed in test suite. The xfailed scenario 8 flips to
PASS the moment V2's bicameral_rebind ships — it's
@pytest.mark.xfail(strict=True), so a CI-visible signal will fire.
V2 effort sense: 7-10 engineer-weeks sequential. Phases don't
parallelize cleanly because each one's correctness depends on the
prior one's invariants. Strong recommendation to involve Jin at
design time (CODEOWNERS, project-judgment review the adversarial
Codex passes can't provide).
Branch: desync-optimization-v1 (renamed from desync-optimization)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Post-rebase onto v0.7.0 (Accountable North Star), fresh decisions are
ingested as signoff.state='proposed' and report status='proposal' until
explicitly ratified via bicameral_ratify. They are drift-exempt by
design — they never enter pending/drifted until ratified.
Scenario 2 was written under the v0.5.0 contract where a hash-divergent
region without a cached compliant verdict reports status='pending'.
Under v0.7.0 the same flow now reports status='proposal' for the
unratified decision, even though the substantive drift signal still
flows through pending_compliance_checks (which the test continues to
assert above the status check).
Update: accept any of {pending, drifted, proposal} — the per-decision
status field is richer under v0.7.0; the scenario's invariant (system
surfaces a compliance check when bound code changes) is unchanged.
Tests: 77 passed, 1 xfailed in 7.52s — full V1 suite green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DriftEntry.cosmetic_hint was shipped in V1 B2 with no corresponding skill update, leaving the field invisible to the caller LLM. Add the three-point contract to bicameral-preflight SKILL.md (both copies): - Step 4: cosmetic_hint=true → mechanical classification, never ask - Step 5: ~ REFORMATTED rendering template (not ⚠ DRIFTED) - Stop-and-Ask Contract: cosmetic drift listed as mechanical example Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Four occurrences of the deleted V1 plan / V2 design paths (docs/desync-optimization.md, docs/desync-optimization-v1-plan.md) pointed at files that were consolidated into docs/v2-desync-optimization-guide.md and removed. Replace each with the consolidated path so internal links resolve. Surrounding prose preserved verbatim per the reviewer's "preserve text and formatting" instruction; the dated "nine rounds of Codex review" claim is unchanged (the guide actually absorbed 14 passes by ship time, but that's a separate edit). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Project root CLAUDE.md (lines 3-17) requires that any tool response
field shape change ships with a matching SKILL.md update. V1 added:
- sync_metrics on SearchDecisionsResponse / PreflightResponse /
HistoryResponse / BindResponse (contracts.py)
- cosmetic_hint on DriftEntry (contracts.py)
- original_lines on pending_grounding_checks symbol_disappeared
payload (ledger/adapter.py)
- per-reason verification_instruction split
(handlers/link_commit.py)
None of these were mentioned in any skill — audit by grep across
.claude/skills/{search,preflight,history,doctor,scan-branch}/SKILL.md
found zero hits for the new field names before this commit. Updates:
- bicameral-search / bicameral-preflight / bicameral-history:
one-line note that sync_metrics is server-side observability
(skip rendering; log if profiling). Minimal per the change owner's
"minimal note" instruction for observability fields.
- bicameral-doctor: new "Per-entry advisory fields" subsection
covering all three doctor-relevant additions:
* cosmetic_hint as advisory metadata that never gates status
* original_lines on symbol_disappeared payloads
* the new verification_instruction split — explicit warning
that the doctor skill must NOT synthesize a bicameral.bind
CTA for symbol_disappeared cases (would create
duplicate-binding state under N:N binds_to until V2 atomic
rebind ships); ungrounded keeps the safe bind CTA.
- bicameral-scan-branch: extended the existing DriftEntry bullet
with a cosmetic_hint advisory note. Emphasizes that the hint is
render-time tag only — drifted entries stay drifted regardless.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four correctness fixes surfaced by post-V1-merge review. All verified
against current code before applying; none affect the V1 acceptance
contract (zero new mutating capabilities, scorecard 12 PASS + 1 XFAIL).
77 passed + 1 xfailed in the V1 suite after fixes.
handlers/detect_drift.py — two issues
A. Magic 10**9 in get_git_content calls. ledger.status.get_git_content
takes start_line / end_line in its signature but ignores them
entirely — the function always returns the full file body, with
callers slicing locally afterward. Two existing legacy callers
(ledger/status.py:159, ledger/adapter.py:67) share the same
signature smell. Replaced 1, 10**9 with 0, 0 and an explicit
comment pointing at the upstream signature for a future cleanup.
The slicing logic that follows is unchanged and correct.
B. _enrich_with_cosmetic_hints diffed HEAD vs working tree
unconditionally, even when the caller passed use_working_tree=False
(which makes handle_detect_drift advertise source="HEAD" in the
response). The cosmetic-hint axis is HEAD-vs-working-tree, so
attaching hints in the source="HEAD" mode would have been
misleading metadata derived from a diff axis the caller didn't ask
about. Minimal fix per the reviewer's option: gate the enrichment
call with `if use_working_tree:` so HEAD-source responses never
carry cosmetic_hint values. tests/test_v0414_source_excerpt.py
(the one test that calls with use_working_tree=False) still
passes — it doesn't assert on cosmetic_hint.
handlers/search_decisions.py — sync_catchup_ms scope mismatch
V1 A3 timed only handle_link_commit in this handler, but
ensure_ledger_synced (used by preflight / history) covers both
link_commit AND get_session_start_banner. The same metric name
measured different surfaces across handlers. Extended the timed
scope here to include the banner call so sync_catchup_ms means the
same thing everywhere.
ledger/ast_diff.py — fail-safe contract leak
is_cosmetic_change documents a "fail-safe → False" contract for
any classifier failure, but the recursive _signature comparison ran
outside the try/except. _signature recurses into the AST and can
raise RecursionError on deeply nested trees, which would propagate
out of the function instead of returning False. Combined the parse +
has_error check + signature comparison into one guarded block;
except (Exception, RecursionError) returns False per the contract.
RecursionError is technically a subclass of Exception in Python 3
(the explicit tuple is documentation-of-intent — the comment block
above this also calls out recursion specifically).
tests/test_ast_diff.py — JSX fallback exercise
test_jsx_routes_through_javascript used identical bytes for before
and after, which short-circuits at the top of is_cosmetic_change
before the LANGUAGE_FALLBACK['jsx'] → 'javascript' path is reached.
The test name claimed to verify routing but never exercised it.
Updated to use whitespace-only diff in the JS portion of the
expression (extra spaces around the `=` sign) so bytes differ but
the expected outcome stays True. JSX text content kept unchanged
(text inside <div>...</div> IS significant in JSX — first attempt
put spaces there and correctly returned False).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per the "small scope expansion only when V2 is already close" bar: V2 Phase 1-4 ship every primitive #47 needs (per-binding baseline, full-CAS hash comparison, symbol re-resolution, atomic rebind, V1's relocation surfacing). The remaining gap to fully closing #47 is one thin read-only handler that wires those primitives at the branch level. No new mutating capabilities. No schema changes. No new contract surface beyond a single response type. Effort delta: 3-5 days; total V2 estimate goes from 7-10 weeks to ~8-11 weeks. Updates to docs/v2-desync-optimization-guide.md: - 4.1 capability gap table: row 13 added (branch-aware drift report, GitHub #47, Phase 6 fix). - 4.2 product targets: explicit "branch-aware drift report works" bullet citing #47 closure and the unblocked downstream consumers (#48 pre-push hook, #49 PR-comment Action). - 6 implementation plan: Phase 6 box added to the DAG diagram below Phase 5 polish. - 6 Phase 6 section: full deliverable spec with the ScanBranchResponse contract, read-only invariant, dependency on C0 per-binding baseline, and the four acceptance criteria mirroring #47 verbatim plus the Closes #47 commitment. - 6 effort estimate table: Phase 6 row (3-5 days), total updated. - 9 acceptance criteria: new gate that the V2 PR uses Closes #47 with a passing read-only invariant test. - 10 references: handlers/scan_branch.py and tests/test_scan_branch.py added to the file list. - Final note: post-V2 workflow change codified — V2 is the last release authored by reverse-mapping deliverables to issues. After V2, switch to issue-driven (pick issue, satisfy ACs, Closes #N). Phase 6 documented as the canonical "expand in-flight release to close an adjacent issue" precedent: underlying machinery must already exist, addition must be additive, issue ACs must be fully satisfiable. Suggested next-issue order: #39 -> #42 -> #41 -> #44 -> #48 -> #49 (all 5 close via small focused PRs over 4-6 weeks). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
All four findings verified against current code; only the actionable ones applied. 81 passed + 1 xfailed in 9.02s. #1 — skills/bicameral-preflight/SKILL.md sync_metrics note The .claude/skills copy got the sync_metrics observability note back when V1 A3 shipped, but the canonical skills/ copy never did. Mirror the wording verbatim near step 2 so the rendering guidance and response-field documentation stay in sync. #2 — handlers/detect_drift.py per-entry alignment The cosmetic-hint enrichment was slicing both head_full and wt_full using entry.lines (the baseline anchor). HEAD and the working tree can shift the symbol independently, so a single index range can't align both sides. The narrow consequence: a drifted entry with shifted lines could yield a misleading cosmetic_hint=true on bytes that aren't the bound region. Fix: re-resolve the symbol against each ref via resolve_symbol_lines(file_path, entry.symbol, repo, ref="HEAD") and ref="working_tree" separately, slice each ref using its own resolved range. Resolution failure on either side → safe default of cosmetic_hint=False (matches the V1 contract: "False is cheap, True must be earned"). Empty symbol → skip (new fail-safe path). Test refactor: test_invalid_lines_skipped renamed to test_unresolvable_symbol_skipped — the old test asserted that lines=(0,0) was the failsafe trigger, but entry.lines is no longer the alignment input. New test exercises the resolve_symbol_lines-returns-None path via a nonexistent symbol name, which is the real fail-safe gate now. #3 — V2 guide TOC anchor for §9 GitHub auto-generates fragment IDs from heading text by lowercasing, replacing spaces with hyphens, and dropping punctuation. "## 9. Acceptance criteria for V2" maps to #9-acceptance-criteria-for-v2, but the TOC pointed at #9-acceptance-criteria (truncated). Link broken. Updated to the correct fragment. #4 — V2 guide unlabeled fenced code blocks (markdownlint MD040) Six fenced opens used bare ``` instead of a labeled fence. Tagged each with ```text — the contents are commit listings, ASCII DAG diagrams, pseudocode protocols, and tuple notation, none of which fit a real language tag. The other fenced blocks in the guide (already tagged ```sql / ```python) are unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
e6b81ed to
5c49637
Compare
Summary
Ships V1 of a two-part desync-correctness initiative plus a consolidated V2 implementation guide for the destructive-path overhaul that's deliberately deferred. V1 introduces zero new mutating
capabilities; every change is read-only measurement, additive contract field, pure function, test coverage, or a surgical bug fix to an already-shipped path.
asyncio.Lockforbind,SyncMetricsinstrumentation, strict-whitelist tree-sitter cosmetic-change classifier withDriftEntry.cosmetic_hintadvisory metadata,original_linesonsymbol_disappearedgrounding checks, canonical 13-scenario regression matrix (12 PASS / 1 XFAIL), and a_VERIFICATION_INSTRUCTIONsplit thatremoves the v0.6.4 unsafe-bind CTA for relocation cases.
docs/v2-desync-optimization-guide.md(999 lines) consolidating the V2 design + V1 plan + 14 rounds of Codex review into one self-contained guide for the nextengineer/agent to pick up.
Net destructive-surface change for V1: zero (and arguably negative — V1 actively removes the v0.6.4 unsafe relocation CTA).
What V1 ships
@pytest.mark.benchtests/bench_drift.pyasyncio.Lockforhandle_bind(in-process only)handlers/sync_middleware.py::repo_write_barrier,handlers/bind.pySyncMetrics(sync_catchup_ms,barrier_held_ms) on Search/Preflight/History/Bind responsescontracts.py, four handlersledger/ast_diff.pyDriftEntry.cosmetic_hintadvisory metadatacontracts.py,handlers/detect_drift.pyoriginal_linesonsymbol_disappearedgrounding checks; verification-instruction split for safe relocation routingledger/adapter.py,handlers/link_commit.pytests/test_desync_scenarios.pydecision_idon ungrounded grounding checks (surfaced by F1)ledger/adapter.py:475Performance baseline (Apple Silicon, surrealdb 2.0.0):
search_decisionsp95 = 10.4 ms (target <2 s — 185× under)detect_driftp95 = 15.5 ms (target <1 s — 55× under)link_commit(warm) p95 = 8.0 msTest plan
pytest tests/test_link_commit_grounding.py tests/test_desync_scenarios.py tests/test_b2_cosmetic_hint.py tests/test_ast_diff.py tests/test_phase3_integration.py tests/test_sync_middleware.py tests/test_bind.py→ 75 passed, 1 xfailed in 7.68spytest tests/bench_drift.py -v -m bench -s→ benchmark passes; baseline numbers documented in CHANGELOGsurrealdb>=2.0.0(the SDK 2.0 schema-migration crash I had originally pinned around is fixed properly upstream by66796ef)tests/test_desync_scenarios.pyis@pytest.mark.xfail(strict=True)— it MUST stay xfailed until V2'sbicameral_rebindships. If it ever flips toxpassed, V2 hasleaked into V1 unintentionally.
V2 — explicitly deferred (not in this PR)
The full V2 plan lives in
docs/v2-desync-optimization-guide.md:handlers/resolve_compliance.pyfrom hard-delete to tombstone+CAS (closes the highest-impact destructive backdoor — Codex pass-10 refactor: port interfaces + source_span for drift pipeline #1, hard prerequisite); A0 atomic SurrealQL transactionprimitive with day-1 forced-failure gate test.
compliance_verdict_history, full-CAS cache key) + traversal filtering on everybinds_toconsumer.pending_compliance_checksfromdetect_drift.bicameral_judge_drift+record_compliance_verdict(5-field CAS),bicameral_advance_baseline,bicameral_rebind(two-phase + lease + attempt-id locking).Effort estimate: 7–10 engineer-weeks sequential. Phases don't parallelize cleanly because each one's correctness depends on the prior one's invariants.
The guide synthesizes 14 Codex review passes into a "what NOT to ship" constraints catalog (§7) — every entry came from a pass that found a real bug in an earlier draft. Strong recommendation to involve Jin at
design time before V2 implementation begins (CODEOWNERS approval required, plus this work needs project-judgment review the adversarial Codex passes can't provide).
Notes for reviewers
docs/v2-desync-optimization-guide.mdis 999 lines but heavily structured; §1 has a recommended read order for fresh readers.docs/desync-optimization.mdanddocs/desync-optimization-v1-plan.md) were never tracked per project convention; their contents are absorbed into the consolidated guide.v0.7.0?) at release time.xfail(strict=True)is intentional CI plumbing; do not remove without shipping V2 D2.Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Tests