Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
14 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .claude/skills/bicameral-doctor/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,12 @@ The handler returns a `DoctorResponse` with:
- `ledger_summary` — `DoctorLedgerSummary` with repo-wide `total`, `drifted`, `pending`, `ungrounded`, `reflected` counts. Populated on branch scope only.
- `action_hints` — merged from whichever sub-scan produced them. Same intensity-gated semantics as every other skill (`guided_mode` controls `blocking`).

### Per-entry advisory fields (read-path only, never gate behavior)

- **`DriftEntry.cosmetic_hint: bool`** (on every entry inside `file_scan.decisions` and `branch_scan.decisions`). True when the HEAD-to-working-tree diff for that region is provably whitespace-only per the strict tree-sitter classifier (`ledger/ast_diff.is_cosmetic_change`). Never affects status; the entry stays drifted and the user must still address it. Use as a render-time tag (e.g. *"cosmetic edit, please confirm"*) — do not use it to suppress drift.
- **`pending_grounding_checks[].original_lines: [start, end]`** when `reason == "symbol_disappeared"` (visible inside `file_scan.sync_status.pending_grounding_checks` and the equivalent under branch scope). Lets the caller LLM run `git show <prev_ref>:<file_path>` over those lines to inspect the symbol's prior position before deciding what to do. Strictly informational.
- **`sync_status.verification_instruction`** is now built per response based on which `pending_*` payloads fired. For `pending_grounding_checks` with `reason == "symbol_disappeared"`, the text is **INFORMATIONAL ONLY** and explicitly forbids calling `bicameral.bind` on the new location (it would create duplicate-binding state under the N:N `binds_to` relation). Until V2 ships atomic rebind, the doctor skill must not synthesize a bind CTA for relocation cases. For `reason == "ungrounded"`, the bind CTA is safe and remains in the instruction text — render it as guidance.

## How to render

### Scope = file
Expand Down
2 changes: 2 additions & 0 deletions .claude/skills/bicameral-history/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,8 @@ bicameral.history(
)
```

The response also carries an optional `sync_metrics` (`{sync_catchup_ms, barrier_held_ms}`) observability field for the catch-up time spent inside `ensure_ledger_synced`. **Skip rendering it** — these are server-side latency numbers, not user-visible signal. Log them if you're profiling, otherwise ignore.

## How to present

Group decisions by `HistoryFeature`. For each group:
Expand Down
24 changes: 22 additions & 2 deletions .claude/skills/bicameral-preflight/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,12 @@ The handler runs `bicameral.search` internally, gates on the user's
`guided_mode` setting, conditionally chains to `bicameral.brief`, and
returns a `PreflightResponse` with a `fired: bool` field.

The response also carries an optional `sync_metrics`
(`{sync_catchup_ms, barrier_held_ms}`) observability field for the
catch-up time spent in `ensure_ledger_synced`. **Skip rendering it** —
these are server-side latency numbers, not user-visible signal. Log
them if you're profiling, otherwise ignore.
Comment thread
coderabbitai[bot] marked this conversation as resolved.

### 2.5 Render session-start banner if present

Before evaluating `response.fired`, check `response.session_start_banner`.
Expand Down Expand Up @@ -180,6 +186,14 @@ Categories with no ask-findings are silently skipped. If every
finding in every category is mechanical, produce NO output (same as
`fired=false` — silent).

**Cosmetic drift rule**: if a `drifted` entry has `cosmetic_hint=true`,
classify it as **mechanical** regardless of guided mode. The server has
verified via AST comparison that the change is whitespace-only and
semantically inert — the stored intent is still intact. Auto-resolve
silently; do NOT add it to the drift ask-queue and do NOT emit a
blocking hint. Render it with `~` prefix (not `⚠ DRIFTED:`) if you
render it at all — see the template in Step 5.

### 5. Render the surfaced block

When at least one ask-finding exists, surface the response using this
Expand All @@ -200,6 +214,11 @@ format. Lead with the `(bicameral surfaced)` attribution line.
Source: <source_ref>
Drift evidence: <drift_evidence verbatim>

~ REFORMATTED: <decision description> ← cosmetic_hint=true only
<file_path>:<symbol>:<lines>
Source: <source_ref>
(whitespace-only change — intent intact, no action needed)

⚠ N divergent decision pair(s) — pick a winner before continuing:
• <symbol> (<file_path>): <summary>

Expand Down Expand Up @@ -310,8 +329,9 @@ For every finding this skill surfaces, classify first:

- **mechanical** — one obvious correct answer (e.g., renamed symbol
with identical signature; a decision whose code moved but semantics
are intact). Auto-apply the resolution silently. Do NOT ask the
user.
are intact; a `drifted` entry with `cosmetic_hint=true` — AST
comparison confirmed whitespace-only change). Auto-apply the
resolution silently. Do NOT ask the user.
- **ask** — reasonable people could disagree (e.g., drifted behavior
where the old decision may still be valid; divergent decisions where
no clear winner exists). Emit ONE question per finding, using the
Expand Down
2 changes: 1 addition & 1 deletion .claude/skills/bicameral-scan-branch/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ The handler returns a `ScanBranchResponse` with:
- `base_ref` / `head_ref` — the resolved refs that were diffed
- `sweep_scope` — `"range_diff"` (default, good), `"head_only"` (base was unreachable — fell back to HEAD-only scope, surface to user), or `"range_truncated"` (range exceeded the 200-file cap; the scan ran on the first 200 files, rest need a separate pass)
- `range_size` — number of files the sweep covered
- `decisions` — deduped list of `DriftEntry` across all files (each decision shows up once even if it touches multiple files)
- `decisions` — deduped list of `DriftEntry` across all files (each decision shows up once even if it touches multiple files). Drifted entries may carry `cosmetic_hint=true` when the HEAD-to-working-tree diff for that region is provably whitespace-only per the strict tree-sitter classifier (`ledger/ast_diff.is_cosmetic_change`). The hint is **advisory metadata only** — it never gates drift surfacing or status, and the entry stays in the drifted bucket regardless. Treat it as a render-time signal: a cosmetic-hinted drift is still drift the user must address.
- `files_changed` — the file paths that were swept
- `drifted_count` / `pending_count` / `ungrounded_count` / `reflected_count`
- `undocumented_symbols` — union across all files
Expand Down
2 changes: 2 additions & 0 deletions .claude/skills/bicameral-search/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,8 @@ Pre-flight check before coding — surface past decisions relevant to what you'r

$ARGUMENTS — the feature, task, or area to search for prior decisions about

The response also carries an optional `sync_metrics` (`{sync_catchup_ms, barrier_held_ms}`) observability field for the catch-up time spent in the implicit `link_commit` that runs before search. **Skip rendering it** — these are server-side latency numbers, not user-visible signal. Log them if you're profiling, otherwise ignore.

## Action Hint Contract (v0.4.10+)

The response always includes an `action_hints` list. Two intensities,
Expand Down
194 changes: 194 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,200 @@
All notable changes to bicameral-mcp are tracked here. Format loosely follows
[Keep a Changelog](https://keepachangelog.com/en/1.1.0/).

## Unreleased — desync optimization V1 — measurement + read-path advisory

V1 of a two-part desync-correctness initiative. V1 ships measurement
infrastructure, a strict-whitelist cosmetic-change classifier, relocation
context enrichment, and a canonical 13-scenario regression matrix —
**without touching any destructive write path**. V2 (separate effort,
design captured in `docs/v2-desync-optimization-guide.md` with nine rounds of Codex
review) tackles the destructive-path overhaul: atomic rebind, baseline
advancement with full CAS, schema migration v6, append-only verdict
history.

V1 introduces zero new mutating capabilities. Every change is one of:
read-only measurement, additive contract field, pure function, test
coverage, or a surgical bug fix to an already-shipped path. The plan,
phase breakdown, V2 deferred items, and Codex review parking lot live in
`docs/v2-desync-optimization-guide.md`.

### Added — `tests/bench_drift.py` (A1)

Drift benchmark harness. Seeds 100 decisions across 25 files via
tree-sitter `extract_symbols` (no BM25 index build required), times
`handle_search_decisions`, `handle_detect_drift`, `handle_link_commit`
under a `memory://` ledger, writes
`test-results/bench/drift_baseline.json` plus a stdout summary.
Marked `@pytest.mark.bench` so default test runs skip it; run via
`pytest tests/bench_drift.py -v -m bench -s`.

Baseline on Apple Silicon (post-rebase, surrealdb 2.0.0):

| handler | p50 (ms) | p95 (ms) | max (ms) |
|--------------------|---------:|---------:|---------:|
| search_decisions | 9.2 | 10.4 | 11.0 |
| detect_drift | 14.2 | 15.5 | 16.4 |
| link_commit (warm) | 7.3 | 8.0 | 8.3 |

All 50–185× under the V2 perf targets in `PLAN.md:83`
(`search_decisions < 2s`, `detect_drift < 1s`).

### Added — `handlers/sync_middleware.repo_write_barrier(ctx)` (A2-light)

Per-repo `asyncio.Lock` async context manager backed by a module-level
`dict[repo_path, asyncio.Lock]`. `handle_bind` wraps its body via a thin
`_do_bind` inner function. Different repos run concurrently; same repo
serializes. Lazy guard-lock construction avoids the "bound to wrong
loop" pitfall across test event loops. Yields a mutable `BarrierTiming`
holder whose `held_ms` is populated on exit (including on exceptions).

Deliberately narrow scope: does NOT protect `resolve_compliance` or
cross-process writers — both are V2 scope (V1 plan §5.2, §5.5).

### Added — `contracts.SyncMetrics` (A3)

```python
class SyncMetrics(BaseModel):
sync_catchup_ms: float | None = None
barrier_held_ms: float | None = None
```

Attached as `sync_metrics: SyncMetrics | None = None` to
`SearchDecisionsResponse`, `PreflightResponse`, `HistoryResponse`,
`BindResponse`. Purely additive, non-breaking. Each handler times its
own sync call locally so nested calls (e.g. preflight chaining to
search_decisions) don't step on each other's metrics.

### Added — `ledger/ast_diff.is_cosmetic_change(before, after, lang)` (B1)

Strict-whitelist tree-sitter classifier returning `True` only when two
snippets differ by inter-token whitespace alone. Compares a recursive
`(node.type, child_sigs | leaf_bytes)` signature; identifier renames,
comment edits (incl. `# type: ignore` / `# noqa` / `// @ts-ignore` /
build tags / lint pragmas), docstring edits, trailing-comma changes,
string-literal changes, import reorders, and any AST shape change all
return `False`. Reuses `code_locator.indexing.symbol_extractor._get_parser`
so the cosmetic detector and symbol indexer can never silently disagree
on supported languages: python, javascript, typescript, java, go, rust,
c_sharp (plus jsx → javascript and tsx → typescript via
`LANGUAGE_FALLBACK`). Unsupported langs, parse failures, and trees with
`has_error` all fail safe to `False`.

False negatives (real cosmetic changes routed unbiased to L3 in V2) are
cheap; false positives (semantics-affecting changes mislabeled cosmetic)
bias future L3 prompts toward "looks fine" — exactly the failure mode
the strict whitelist prevents.

### Added — `DriftEntry.cosmetic_hint: bool = False` (B2)

Populated by `handlers.detect_drift._enrich_with_cosmetic_hints` after
the pure `raw_decisions_to_drift_entries` mapping (IO encapsulated
outside the pure function). Read-path advisory ONLY — never mutates
`content_hash`, never gates drift surfacing or status, never advances
baseline. Five fail-safe paths leave the hint `False`: non-drifted
entry, equal HEAD/working-tree bytes, unsupported file extension,
invalid line range, exception during classifier.

Source comparison: HEAD bytes (via `ledger.status.get_git_content` ref
`"HEAD"`) vs working-tree bytes (ref `"working_tree"`), sliced to the
region's `(start_line, end_line)`. Language resolved from file extension
via `code_locator.indexing.symbol_extractor.EXTENSION_LANGUAGE`.

### Added — `pending_grounding_checks[].original_lines` (D1)

For `reason='symbol_disappeared'` entries, the payload now carries
`original_lines: [start_line, end_line]` so the caller LLM can run
`git show <prev_ref>:<file_path>` to inspect the symbol's prior
position when locating its new home. Strictly informational — no
actionable workflow. Single-line addition in `ledger/adapter.py`.

### Added — `tests/test_desync_scenarios.py` (F1)

Canonical regression matrix for the 13 desync scenarios from the Notion
"Auto-Grounding Problem" catalog, routed through the real handler
layer per the Apr 8 PR #84 lesson (tests bypassing handlers miss
post-ingest hooks). Self-contained tmp git-repo fixture per test.

**Scorecard**: 12 PASS, 1 XFAIL.

| # | Scenario | V1 outcome |
|---|---|---|
| 1 | New decision, matching code exists | ✅ ungrounded → caller binds |
| 2 | Code changed after grounded | ✅ pending + `pending_compliance_check` |
| 3 | Code deleted after grounded | ✅ symbol_disappeared |
| 4 | Symbol renamed in file | ✅ symbol_disappeared with `original_lines` |
| 5 | Symbol moved cross-file | ✅ symbol_disappeared |
| 6 | Code added later | ✅ caller binds explicitly |
| 7 | Cold start, no matching code | ✅ stays ungrounded |
| 8 | Drifted intent → atomic re-ground | ⏸ XFAIL (V2 §8 D2 — `bicameral_rebind` with old-binding CAS) |
| 9 | Intent description supersession | ✅ re-ingest succeeds |
| 10 | N decisions share a symbol | ✅ both surface |
| 11 | No server-side BM25 grounding (post-v0.6.0) | ✅ stays ungrounded |
| 12 | Line-shift edit | ✅ no spurious drift (`resolve_symbol_lines` self-heals) |
| 13 | `[Open Question]` prefix | ✅ ingested as gap |

### Changed — `handlers/link_commit._build_verification_instruction()`

Splits the v0.6.4 monolithic `_VERIFICATION_INSTRUCTION` into three
composable parts so the response text is conditional on which
`pending_*` payloads actually fired:

- `pending_compliance_checks` present → resolve_compliance CTA.
- `pending_grounding_checks` with `reason='ungrounded'` →
`Grep/Read → validate_symbols / extract_symbols → bicameral.bind` CTA
(safe — no prior binding to retire, no duplicate-binding risk).
- `pending_grounding_checks` with `reason='symbol_disappeared'` →
**explicit "INFORMATIONAL ONLY — do NOT call bicameral.bind on the
new location" warning** citing the duplicate-binding hazard under
the N:N `binds_to` relation. Atomic rebind ships in V2.

Addresses Codex pass-10 #2 + pass-12 #2: the v0.6.4 monolithic CTA
inadvertently routed relocation cases through the unsafe bind path.
The V1 split removes that without reducing the safe CTA for ungrounded.

### Fixed — `ledger/adapter.py` ungrounded grounding-check `decision_id`

`pending_grounding_checks` for ungrounded decisions emitted empty
`decision_id` because the consumer read `d.get("id", "")` from
`get_all_decisions(filter="ungrounded")`, but that query aliases the
field to `decision_id`. Callers had no handle to bind against.
Surfaced by V1 F1 regression coverage; existing
`test_pending_grounding_checks_for_ungrounded_decisions` regression
only asserted `len > 0`, missing the empty-ID bug. Read `decision_id`
first, fall back to `id` for forward compatibility.

### Tests

75 passed, 1 xfailed in 7.11s after V1. Zero regressions on the
v0.6.3/v0.6.4/0.6.4-bump rebase, and the SDK-2.0 idempotency-catch
issue I had originally pinned around (`surrealdb<2.0.0`) is fixed
properly upstream by `66796ef`, so V1 ships against
`surrealdb>=2.0.0` directly.

### Deferred to V2

Captured in full in `docs/v2-desync-optimization-guide.md` (design target with
nine rounds of Codex review) and summarized in
`docs/v2-desync-optimization-guide.md` §4–§5:

- A0 — atomic SurrealQL block primitive (Python SDK doesn't support
`begin_transaction()` in embedded mode).
- A2a — full sync barrier (sync-token CAS + region fingerprint at
commit time).
- C0 / C0a / C1 — schema migration v5→v6 (per-binding baseline
ownership, tombstone fields, append-only `compliance_verdict_history`,
full-CAS cache key, traversal filtering).
- C2 — `bicameral_judge_drift` + `record_compliance_verdict` with
five-field CAS (incl. binding-state token).
- C3 — `pending_compliance_checks` from `detect_drift` (cache-aware).
- B3 — `bicameral_advance_baseline` (only after fresh L3 `compliant`
verdict matching full CAS).
- D2 — `bicameral_rebind` with old-binding CAS; closes scenario 8.
- Migration of the `handlers/resolve_compliance.py` hard-delete +
`handlers/ingest.py` auto-chained `handle_judge_gaps` to the
tombstone + CAS contract — hard prerequisite before V2 destructive
work ships.

## 0.7.0 — 2026-04-24 — Accountable North Star: proposal state + signoff schema

Every decision now has provenance: who proposed it, who ratified it, in which session.
Expand Down
19 changes: 17 additions & 2 deletions PLAN.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,12 +80,27 @@ Code search is caller-owned: Claude Code / Cursor / etc. use their native Grep/R
- [x] Zero active mocks
- [x] Full E2E verified
- [x] GitHub Actions CI replaces pre-push git hook
- [x] Performance: `search_decisions` < 2s, `detect_drift` < 1s on 100+ decisions — measured by V1 A1 (`tests/bench_drift.py`) at p95 = 10.4 ms / 15.5 ms, 55–185× under target

### Remaining
- [ ] Performance: `search_decisions` < 2s, `detect_drift` < 1s on repo with 100+ decisions
- [ ] LLM drift judge: wire `claude-haiku-4-5` for changed-region comparison in `detect_drift`
- [ ] LLM drift judge: wire `claude-haiku-4-5` for changed-region comparison in `detect_drift` (V2 — `docs/desync-optimization.md` §8 C2)
- [ ] All 4 tools demoed live in Claude Code (MCP connected)

## Desync Optimization V1 — DONE (read-path advisory + measurement)

Plan: `docs/desync-optimization-v1-plan.md`. V2 design target with full
Codex-review history: `docs/desync-optimization.md`. V1 introduces zero
new mutating capabilities.

- [x] A1 — `tests/bench_drift.py` benchmark harness; `test-results/bench/drift_baseline.json` artifact
- [x] A2-light — `handlers/sync_middleware.repo_write_barrier(ctx)` (per-repo `asyncio.Lock`); `handle_bind` wrapped
- [x] A3 — `contracts.SyncMetrics` + `sync_metrics` field on Search/Preflight/History/Bind responses
- [x] B1 — `ledger/ast_diff.is_cosmetic_change(before, after, lang)` (strict tree-sitter whitelist, 21 tests)
- [x] B2 — `DriftEntry.cosmetic_hint` advisory + `_enrich_with_cosmetic_hints` helper in `handle_detect_drift`
- [x] D1 — `original_lines` field on `symbol_disappeared` grounding checks; **`_build_verification_instruction` split** so relocation cases never get the unsafe `bicameral.bind` CTA
- [x] F1 — `tests/test_desync_scenarios.py` canonical 13-scenario regression matrix (12 pass + 1 V2 xfail)
- [x] Incidental fix — empty `decision_id` on ungrounded `pending_grounding_checks` (`ledger/adapter.py:475`)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Minor: stale line reference.

ledger/adapter.py:475 no longer points at the empty-decision_id fix; the relevant block is now around L500–509 in the current file. Consider replacing with a symbolic anchor (e.g., "in get_all_decisions ungrounded fan-out") so doc rot doesn't accumulate.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@PLAN.md` at line 102, The checklist entry references a stale line number for
the incidental fix; update the item to use a symbolic anchor instead of a hard
file:line (e.g., "empty `decision_id` on ungrounded `pending_grounding_checks`
(see `get_all_decisions` ungrounded fan-out in ledger/adapter.py)"), so replace
"ledger/adapter.py:475" with a descriptive location referencing the function
`get_all_decisions` and the ungrounded fan-out block to prevent future doc rot.


---

## Mock → Real Swap Summary
Expand Down
Loading
Loading