release: v0.15.0 — PII archive, hard-delete remove_decision, schema v17→v24 chain by jinhongkuan · Pull Request #388 · BicameralAI/bicameral-mcp

jinhongkuan · 2026-05-16T05:52:02Z

Summary

Cumulative release draining the dev → main backlog accumulated since v0.14.7. Lands:

#221 PII archive (operator-erasable PII surface, content-addressed)
Hard-delete bicameral.remove_decision (breaking — see below; closes decision:i4wafafzowm3ai5eyhgs)
Schema chain v17 → v24 (all additive, non-destructive)
Preflight: broaden dedup cache key to include file_paths + ledger revision (M7a/b/c) #87 constant-time revision counter + preflight dedup telemetry
[v0-productization §4] Dashboard scope expansion — source view, decisions↔sources, remove flows, SurrealQL admin #278 Phases 1+2+3+4 — dashboard source view, remove flows, raw SurrealQL admin panel, dogfood instrumentation
feat(preflight): eliminate silent graph-expansion fallbacks (#173 follow-up) #243 + perf(server): move code-locator index init off the MCP stdio handshake path #380 — code-locator singleton + eager init off the MCP stdio handshake
fix(ingest): auto-capture unreliable for planning/brainstorming workflows — SessionEnd hook too narrowly scoped #344, [v0-productization §3] Pull-based meeting ingestion — sync-and-brief CLI + SessionStart hook + source pullers #279 — LocalDirectorySourceAdapter, sync-and-brief team-mode
fix(ledger): decisions ingested without decision_level — no auto-classification pipeline #340 — auto-classify decision_level on ingest
#224 — operator-configurable query timeout
#215 — MCP transport trust-boundary declaration
Triage releases v0.14.5–v0.14.7 carried forward
Three new architectural decisions ratified for doctrine PR follow-up (expand-only schema, flag-gating, §10.5.1 triage amendment)

Full breakdown in CHANGELOG.md's ## v0.15.0 section.

Linked issues

This release closes/refs (non-exhaustive — see commit Closes #N / Fixes #N keywords in the full log for the canonical set, all of which auto-close on merge):

Closes #87, BicameralAI/bicameral-daemon#36, #209, BicameralAI/bicameral-daemon#9, BicameralAI/bicameral-daemon#23, BicameralAI/bicameral-daemon#22, #243, #272, #278, #279, #280, #281, #288, #301, #308, BicameralAI/bicameral-daemon#4, #332, #334, BicameralAI/bicameral-daemon#32, BicameralAI/bicameral-daemon#31, #340, #341, #342, #343, #344, #358, #362, #364, #380, #386
Refs BicameralAI/bicameral-daemon#37, #232, #357 (subtasks of the test-infrastructure track land here; parent stays open)
Refs BicameralAI/bicameral-daemon#2 (Ledger Locator RFC landed; full implementation deferred to v0.16.x)

Linked decisions

Closes decision:i4wafafzowm3ai5eyhgs — Default bicameral.remove_decision to hard delete; eliminate soft-delete tombstone state. Implementation in PR #386 (merged to dev).
Refs decision:cp25jfz1nt6h3u2gjzmu — Schema migrations must be expand-only (doctrine; companion PR amends DEV_CYCLE.md prospectively).
Refs decision:adklplvfhthkdch05pe9 — New-schema-dependent code must be feature-flag gated (doctrine; companion PR).
Refs decision:0ok1249n2tdrfud2a5j9 — DEV_CYCLE.md §10.5.1 (triage eligibility) amendment (doctrine; companion PR).

Plan / Audit / Seal

Plan: the dev → main release pattern in DEV_CYCLE.md §4.1 (release PR) and §6 (release cycle). 154 commits from origin/main..origin/dev (24 fix / 26 feat / 50 merge / remainder docs+chore+test+style).
Audit: schema migration chain v17 → v24 reviewed — every step is additive (new DEFINEs, new fields with defaults, new EVENTs). No REMOVE / DROP operations in the migration path. The idx_input_span_dedup change (v24) uses OVERWRITE and extends the field set, which is monotonically weaker than the prior index — every row valid before is still valid after.
Risk: L2 — schema-touching, tool-contract change (bicameral.remove_decision), but the migration path has been exercised end-to-end against the prod ledger (the schema migration that fixes the dashboard /history 500 was applied to my local prod DB during testing, verified working).

Breaking changes (operator-facing)

bicameral.remove_decision response shape changed. Dropped signoff and projected_status. Added event_logged, removed_at, previous_state, reason. The decision row + all references are now physically removed instead of flipped to signoff.state="removed". Callers consuming the response should check the new top-level fields. Idempotent on missing decisions (was_new=False, no raise) — the matching event in the journal is the canonical record of any prior removal.

Schema migrations

Auto-applied on first connect. Non-destructive. Operators upgrading from v0.14.x see one-time migration log entries; no data loss.

Version	Migration	Source
v17 → v18	`decision.updated_at` + `idx_decision_updated_at`	#87 precondition
v18 → v19	`bicameral_meta.decision_revision` + `DEFINE EVENT decision_revision_bump`	#87 Phase 6
v19 → v20	PII archive schema slot (`input_span.archive_key`)	BicameralAI/bicameral-daemon#23 Phase A
v20 → v21	(PII archive metadata field)	BicameralAI/bicameral-daemon#23 Phase A
v21 → v22	ASSERT `text != '' OR archive_key != ''` on `input_span.text`	BicameralAI/bicameral-daemon#23 Phase B-1
v22 → v23	Backfill `decision.decision_level` for legacy rows	#340 prereq
v23 → v24	`idx_input_span_dedup` extended with `archive_key`	dashboard `/history` collision fix

Test plan

Per-feature suites run green throughout dev (see merged PRs v22→v23 migration: backfill decision_level for legacy decisions #371, feat(pii-archive): #221 Phase B-1 — ingest cutover + read-path centralization (NOT closure) #356, perf(skill): add tier-2 semantic relevance gate to preflight (#300) #322, feat(timeout): #224 ledger-query timeout with Claude-hooks context surfacing #323, docs(security): #215 Track 1 — declare MCP-transport trust boundary (SOC2-01) #324, feat(preflight): eliminate silent graph-expansion fallbacks (#243) #294, docs(README): demo video section + relocate star CTA mid-doc #299, feat(sources): #344 LocalDirectorySourceAdapter — capture decisions beyond the IDE #347, ci: add workflow_dispatch trigger to schema persistence tests #345, infra(pre-commit): #357 sub-task 3 — local ruff enforcement at commit time #361, fix(e2e): #362 — reclassify Flow 3 'no cc rows + no verdicts' as advisory #363, infra(symlinks): #357 sub-task 4 — Windows symlink materialization gate #364, test(preflight): #357 backfill — de-mock test_preflight_dedup_v2.py + decrement trap cap 8→5 #365, fix(setup): auto-detect nightly channel from .dev install version #381, chore(nightly): bump RECOMMENDED_NIGHTLY_VERSION to 2026.5.16.dev024452 #382, fix(schema): backfill bicameral_meta.decision_revision in v18→v19 + v22→v23 #383, fix(server): move code-locator init off MCP stdio handshake (#380) #385, feat(remove_decision): hard-delete by default + v24 input_span dedup index #386 etc.).
SURREAL_URL=memory:// pytest tests/test_phase2_ledger.py tests/test_phase3_integration.py tests/test_remove_decision.py tests/test_input_span_safe_upsert.py tests/test_remove_source.py tests/test_dogfood_label_propagation.py tests/test_pii_archive_schema_migration_b1.py tests/test_history_erasure_propagation.py tests/test_schema_recoverable_errors.py -q — green on a current dev checkout.
Manual end-to-end against the live MCP server: hard-deleted three lifecycle-test tombstones; dashboard /history returns 200 with the full 15-decision ledger.
Schema migration applied against a real persistent surrealkv:// ledger (the author's ~/.bicameral/ledger.db) and verified the v24 index distinguishes two archive-keyed rows in the same (source_type, source_ref) bucket.
Tier 2 CI gates (DEV_CYCLE.md §4.5.2) — gates run on this PR; release blocked if any hard-gate fails.
Manual smoke after merge — pipx install bicameral-mcp==0.15.0, run bicameral-mcp setup, ingest a sample transcript, observe the dashboard.

Post-merge tasks

Tag v0.15.0 and push the tag.
Run pipx upgrade bicameral-mcp on design-partner machines (or wait for their next bicameral.update).
Land the doctrine-amendment PR onto dev (expand-only / flag-gate / §10.5.1 rewrite) so v0.15.1+ ships under the new rule.

🤖 Generated with Claude Code

…st (#192) Single env var now owns the entire telemetry-flag namespace. Three accepted forms: bool (`0`/`off`/`false`/`no` → all off; `1`/`on`/`true`/`yes` → relay only), csv (`relay,preflight,raw`), and unset (default → relay only). New `telemetry_flags.py` module owns parsing; `consent.telemetry_allowed()` and `preflight_telemetry.{telemetry_enabled, raw_capture_enabled}` delegate to a frozen `TelemetryFlags` cached once per process. Backwards-compat preserved on three axes: 1. Legacy `BICAMERAL_PREFLIGHT_TELEMETRY=1` and `BICAMERAL_PREFLIGHT_TELEMETRY_RAW=1` continue to work as additive overlays — first read of either emits a one-line stderr deprecation warning per process. Removed in v1.x. 2. `BICAMERAL_TELEMETRY=1` semantics unchanged (relay only — does NOT auto-enable preflight). 3. Non-canonical truthy values (`enabled`, `t`, `active`, etc. — used in pre-#192 deployments) map to relay-only with a stderr warning pointing at the canonical form. Caught by Codex review as a P2 finding; preserves the pre-#192 contract that any non-OFF value enabled relay. Semantics: - CSV form is explicit — what's listed is on, what's not is off (so `BICAMERAL_TELEMETRY=preflight,raw` turns OFF the default-on relay, documented in the setup wizard). - `raw` always implies `preflight` (raw is a mode of preflight events; defensive double-check in `raw_capture_enabled()`). - Process-cached parsing via `lru_cache`; tests use `_reset_for_tests()` via an autouse fixture in `tests/conftest.py` so monkeypatched env vars take effect cross-test. 35 fixtures in `tests/test_telemetry_flags.py` cover all forms + integration with the existing call sites + the legacy-truthy preservation case. 87/87 green across all 7 telemetry-touching test files (including 52 regression tests for #39 / #101 / #112 behaviors). Closes #192. Unblocks #65 phase 4. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…METRY (#192 follow-up) Five doc references still spelled the legacy `BICAMERAL_PREFLIGHT_TELEMETRY=1` shape after #192 consolidated the flag namespace. Updated each to lead with the canonical csv form (`BICAMERAL_TELEMETRY=preflight` / `=preflight,raw`) and note the legacy var is still honored via the deprecation overlay: - server.py — preflight_id schema description (agent-visible) - contracts.py — preflight_id field comment - preflight_telemetry.py — module docstring (Default mode + Raw mode lines) - handlers/record_bypass.py — module docstring (telemetry_disabled reason) - skills/bicameral-preflight/SKILL.md — bypass-write contract (agent-visible) - docs/semantic-drift-governance.md — record_bypass return-value spec No behavior change. Tests unchanged: 66/66 green across telemetry_flags, consent_notice, preflight_telemetry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resolves #250 base-branch drift after dev advanced 50 commits since the PR was opened. Two conflicts: 1. handlers/record_bypass.py (modify/delete) — dev's #244 v1 revert (commit d1e3914) deleted the entire HITL bypass + decision_level surface from v0 scope. My #192 follow-up touched its module docstring; resolution is to accept the deletion (the doc patch is moot once the file is gone). 2. skills/bicameral-preflight/SKILL.md (content) — same #244 revert deleted the §5.4-bypass-semantics block I patched for canonical env-var phrasing. Accepted dev's deletion of the block; the remaining §5.4 telemetry-attribution + §5.5 confirm-finding sections are untouched and still carry the canonical `BICAMERAL_TELEMETRY=preflight` form via the merged v1 of §5.4-telemetry-note. The other four files I patched in the doc-followup commit (server.py, contracts.py, preflight_telemetry.py, docs/semantic-drift-governance.md) auto-merged cleanly. My canonical `BICAMERAL_TELEMETRY=preflight` references survive verbatim. Telemetry tests post-merge: 66/66 green (test_telemetry_flags + test_consent_notice + test_preflight_telemetry). Note: docs/semantic-drift-governance.md still describes record_bypass return values that no longer have a handler. dev kept the file unchanged through the v1 revert; whether the governance lifecycle doc should be deleted, marked v1-deferred, or kept as forward-looking architecture is a separate triage call (not in #250 scope). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ne (#280) PR #285's first CI run produced a clean baseline: 23 cases / precision 0.913 / recall 0.913 / abort_rate 0.000 ✓ all gates pass That's ~7-13 pp of headroom on every gate (≥ 0.85 / ≥ 0.80 / ≤ 0.30). Locking the baseline in before drift sets in. Two changes to .github/workflows/test-mcp-regression.yml: 1. `--gate-mode warn` → `--gate-mode hard`. Runner exits non-zero on breach instead of warning to step output. 2. Removed `continue-on-error: true` from the eval step. The step now fails CI when the gate breaches. The metrics-summary step keeps `continue-on-error: true` so a renderer bug never masks the eval result — and the `always()` guard means the breach summary is still rendered inline when the eval fails. After this lands, PRs that touch the bind handler / bind skill / fixture / dataset must EITHER keep recall ≥ 0.80 / precision ≥ 0.85 / abort_rate ≤ 0.30, OR deliberately re-record the cache by setting BICAMERAL_GROUNDING_EVAL_RECORD=1 after a skill-prompt change. Aligns with Jin's "deliberate not drift" framing — same path the M1 eval *should* have taken (M1 has been warn-only forever; M2 is being flipped while the baseline is fresh, days after the eval shipped). Refs #280. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…uts don't fail M2 hard-gate (#288) The hard-gate flip (commit 6605f24) surfaced an existing flakiness on the first CI run: a single httpx.ReadTimeout on one of the up-to-8 tool-use turns crashed the whole eval run, failing the MCP regression suite. Previously masked by `--gate-mode warn` + `continue-on-error: true`, both removed by the gate-flip. Two surgical fixes: 1. tests/eval/_bind_judge.py — _call_messages_api now retries 3× with exponential backoff (2s/8s/32s) on: - httpx.TimeoutException (read/connect/pool) - httpx.NetworkError, httpx.RemoteProtocolError - HTTP 429 (rate limit) + 5xx (server-side transient) After exhausting retries, raises RuntimeError with a bounded message. Terminal 4xx (auth, malformed payload) still fails fast — those aren't transient. 2. tests/eval_grounding_recall.py — per-case catch broadened from `except RuntimeError` to `except Exception`, and a single failing case now records an `eval_error` outcome row instead of crashing the whole eval. Aggregate gate is still applied: if N cases err hard enough that recall < 0.80 across 23 cases, the eval fails CI correctly. With our 0.913 baseline, ~5 cases would have to err before the gate breaches. 3. tests/eval_grounding_recall_summary.py — eval_error added to the outcome-breakdown table; missed-cases list surfaces the error msg inline (rather than rendering "—::—" for the absent binding). Local verification: - retry loop smoke-tested: 3× ReadTimeout → bounded RuntimeError; 503/503/200 → recovers and returns the 200 response. - ruff check + format + mypy all green. - test_m2_grounding_log + test_bind_m2_telemetry: 11 passed, 3 skipped. Refs #280 #288. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ci(m2): flip M2 grounding-recall gate warn → hard after stable baseline (#280)

…iscussion (#280) Jin's PR-#288 followup: aggregate metrics tell us *whether* the agent is grounding well; categorized failure modes tell PMs *which kinds of decisions* it struggles with. New deterministic post-hoc classifier runs over the existing per-case rows; renderer adds a "Failure modes" section to the GitHub step summary ranked by count, with up to 2 example cases per category and a documented PM-actionable next step. Categories ---------- correct — agent got it right, no action wrong_module — same-name disambiguation failed wrong_intent — similar-intent miss cross_language_confusion — Python ↔ TS runtime mistake wrong_symbol_in_right_file — sub-region disambiguation gap hallucinated_symbol — handler reject path caught a fake symbol span_mismatch — handler reject path caught hallucinated lines aborted_correctly — agent recognized a behavioral / unbindable decision (only meaningful once §B fixture lands; the §B "ungroundable behavioral cases" piece was deferred per Jin's plan recommendation — design partners are better authors for those via #280 friction reports) aborted_incorrectly — agent over-cautious on a bindable case eval_error — infra (API timeout / network) Each category carries a documented next step (FAILURE_MODE_NEXT_STEPS constant, kept in sync with the renderer's _FAILURE_MODE_HINTS table). PM-readable, not engineering jargon. Files ----- tests/eval_grounding_recall.py + classify_failure_mode(row) -> str + FAILURE_MODE_NEXT_STEPS dict (PM-actionable next steps) + failure_mode field embedded on every per-case row (success + eval_error paths both populate it) tests/eval_grounding_recall_summary.py + _render_failure_modes(rows) helper + new "Failure modes (top categories — PM-actionable)" section between the gate-breach line and the existing miss list. Ranked by miss count, eval_error always last (infra noise), capped at top 3 categories with up to 2 example cases each. Examples surface case_id + reasoning/abort_reason/error_msg excerpt (truncated to 110 chars, pipe-escaped for table safety). tests/test_grounding_failure_mode.py (new) 13 table-driven tests across all 10 categories + 3 invariants: unknown-outcome falls into 'uncategorized', taxonomy documentation completeness, handler-reject-priority over case_type. Pure unit tests — no API, no ledger. CHANGELOG.md +2 Cache key unchanged ------------------- failure_mode is computed at the row level from existing fields; doesn't touch the bind judge's cache key (model | skill | repo | decision). So the existing 0.913/0.913 baseline cache stays valid; CI runs after this PR will hit the cache and produce identical numbers — only the renderer output is enriched. Local verification ------------------ - 13 passed on tests/test_grounding_failure_mode.py - 24 passed, 3 skipped across the M2-related test files (test_m2_grounding_log + test_bind_m2_telemetry + test_grounding_failure_mode) - ruff check + ruff format --check + mypy all green on touched files - Renderer smoke-tested on a synthetic input with 6 misses across 4 categories — section ranks correctly, examples populate, hints land in the right column Out of scope (intentionally deferred) ------------------------------------- §B from the plan: 4-5 deliberately ungroundable behavioral cases (`expected_outcome="abort"`) that materially measure Jin's "behavioral decisions" pattern. Recommended deferral — design partners are better authors for those via #280 friction reports rather than engineering inventing them. Once §B lands, `aborted_correctly` will start firing for real (it can fire today only for rows that carry `expected_outcome="abort"`, which no current row does). Refs #280 #288. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(eval): M2 failure-mode enumeration for cross-functional design discussion (#280)

refactor(telemetry): consolidate BICAMERAL_TELEMETRY env-var namespace (#192)

PR #174 closed the recall ceiling but introduced two silent fallback paths in `_region_anchored_preflight`: when `ctx.code_graph` was absent OR when the expander raised, the response shape was byte- identical to "expansion ran and matched zero" — caller couldn't tell recall was degraded. Three additive signals now surface every fallback (per Phase 2 spec posted on #243, all four open questions defaulted to recommended): 1. Response field — `sources_chained` includes `"graph_unavailable"`. Additive (never replaces existing `"region"` / `"graph"` tags). Bare tag — granular reason flows through telemetry, not the response shape, per signoff Q2. 2. Log level — exception case bumped from `logger.debug` → `logger.warning` with stable `[preflight:fallback]` substring + exception type for grep-friendly production logs. 3. Telemetry counter — new `preflight_telemetry.write_fallback_event( reason, session_id)` modeled on `write_ingest_refusal_event` (#216). Emits a `graph_expansion_fallback` row to the existing `~/.bicameral/preflight_events.jsonl` substrate. Reasons are a controlled enum: `"absent"`, `"missing_method"`, `"exception:<type>"`. Gated on `BICAMERAL_TELEMETRY=preflight`. The fallback case classifier in `_region_anchored_preflight` distinguishes three reasons (was conflated into a single `if expander is not None:` skip in the pre-#243 code): - `code_graph is None` → "absent" - `code_graph` set but no `expand_file_paths_via_graph` → "missing_method" - expander raised → "exception:<typ>" Skill update (`skills/bicameral-preflight/SKILL.md`) renders a one- line recall-degraded note to the agent when the tag is present: > Note: structural-neighbor lookup was unavailable this call — > recall may be reduced until the symbol index is rebuilt. Decisions > bound to files that import these may not have surfaced. Treats `"graph_unavailable"` as advisory: doesn't block the preflight surface; direct-pin matches are unaffected. Tests ----- 4 new cases in `tests/test_preflight_graph_expansion.py`: - test_preflight_fallback_absent_code_graph_tags_graph_unavailable — ctx with code_graph=None → response carries the tag, telemetry counter reason="absent" - test_preflight_fallback_expander_raises_warns_and_tags — stub expander raises RuntimeError → response carries the tag, `caplog` captures WARN-level log with `[preflight:fallback]` substring, telemetry counter reason="exception:RuntimeError" - test_preflight_successful_expansion_does_not_tag_graph_unavailable — regression guard: clean expansion path must NOT carry the tag (no false alarms) - test_preflight_empty_file_paths_does_not_tag_graph_unavailable — empty file_paths short-circuits before expansion check; the "expansion was never attempted" case is distinguishable from "attempted-and-fell-back" Existing tests use containment assertions (`"region" in sources_chained`) not exact list equality, so additive `"graph_ unavailable"` doesn't break them. What's NOT in this PR --------------------- Piece B (eager symbol-index initialization at server startup) is the follow-up commit on this branch. Lands separately so the response- shape change can ship without the adapter-lifecycle change. After both pieces land, the telemetry counter shipped here gives ongoing visibility into how often fallback engages in production. Refs #243 (parent #173 / PR #174). Plan signoff via #243 (comment). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…Piece B) Pre-fix, the code-locator adapter had two cooperating problems that made silent fallback the default: 1. `get_code_locator()` returned a FRESH `RealCodeLocatorAdapter` per call. Caching was absent. 2. `_ensure_initialized()` was lazy — first tool call paid the index-build cost AND could race the index check on concurrent dispatch (e.g. preflight + bind landing in parallel after server boot). Together: every silent fallback in the production runtime was "hot" because the adapter was being rebuilt + rechecked on every call. Piece A (#283 commit 3c9730f) made the fallback loud at the response layer; Piece B closes the upstream cause. Three changes ------------- adapters/code_locator.py - Singleton-by-REPO_PATH cache via `_INSTANCE_CACHE: dict[str, RealCodeLocatorAdapter]`. Path resolved through `Path.resolve()` so symlink + relative-path callers cache-hit consistently. Multi-repo correctness preserved (any test that swaps REPO_PATH mid-process gets a fresh adapter for the new path). - New `reset_code_locator_cache()` test-only hook, mirroring `adapters.ledger.reset_ledger_singleton`. - New `async def RealCodeLocatorAdapter.initialize()` — wraps sync `_ensure_initialized()` in `loop.run_in_executor(None, ...)` so the cold-init path doesn't block the event loop. Idempotent on already-initialized adapters. server.py - `serve_stdio()` calls `await get_code_locator().initialize()` between the dashboard sidecar start and the consent-notice block. - **Fail-loud per #243 phase-2 signoff Q3** — explicit `except RuntimeError as exc:` re-raises after printing an actionable stderr message (`"Run: python -m code_locator index <repo>"`). The outer try/finally still runs the `SERVER_SHUTDOWN` audit emit, so operators get a clean event AND a clear actionable error. No more silent degradation. tests/test_preflight_graph_expansion.py — 4 new tests - test_get_code_locator_returns_same_instance_per_repo_path (singleton + reset behavior across two REPO_PATHs) - test_initialize_succeeds_when_index_present (idempotent on already-initialized adapter) - test_initialize_fails_loudly_when_index_empty (RuntimeError from `_ensure_initialized` propagates through the async wrapper — doesn't get swallowed) - test_serve_stdio_refuses_boot_on_empty_index (boot-path level: with everything else stubbed healthy, an empty index aborts `serve_stdio()` with the expected RuntimeError) Local smoke tests ----------------- - Singleton + reset_code_locator_cache: 4 assertions pass (cache hit on same path, distinct instance on new path, fresh after reset, second call after reset stays cached) - Async `initialize()`: re-raises RuntimeError on stubbed `_ensure_initialized` failure; idempotent no-op on already-initialized adapter - ruff check + ruff format --check + mypy all green on touched files What's NOT in this PR --------------------- Nothing — Piece A (commit 3c9730f) and Piece B (this commit) together close #243's full scope. PR will open with both pieces. Telemetry counter shipped in Piece A gives ongoing production visibility into how often fallback engages post-merge. Refs #243 (parent #173 / PR #174). Plan signoff via #243 (comment). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…backs feat(preflight): eliminate silent graph-expansion fallbacks (#243)

Replaces the dashboard image at the bottom of "How It Feels" with a three-beat demo video section (ingest -> preflight -> ratify async) referencing GitHub user-attachments URLs so videos render as inline players. Moves the "Star on GitHub" banner from the top header to a centered placement immediately after the demo, turning it into a post-demo conversion beat instead of a misaligned header element. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…itch Replaces the single-line MCP-server description with a position-take opener: paragraph 1 names the failure mode (agreements emerge mid-flight, never reach a doc); paragraph 2 introduces Bicameral MCP as a spec compliance layer that captures both formal source materials (transcripts, PRDs, Slack) and undiscussed mid-implementation decisions to be ratified async by the product owner. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

docs(README): demo video section + relocate star CTA mid-doc

… + decision_id alias fix The banner tests previously used MagicMock for ctx and AsyncMock for ledger, returning hand-crafted dicts. They stayed green even when get_decisions_by_status silently returned decision_id=None for every row (the SQL selected an undefined field — see ledger.adapter:584). Refactor to seed a real SurrealDBLedgerAdapter over memory:// and run the actual get_decisions_by_status query. The first sociable run surfaced the latent bug, which is fixed in this commit by aliasing type::string(id) AS decision_id (matches the pattern at queries.py:167, 228, 404, 512). Tests that legitimately need narrow seams (handle_link_commit, asyncio.Lock primitives) are left as-is and now documented inline. Adds a "Sociable Testing for UX Paths" section to pilot/mcp/CLAUDE.md codifying the preference: SimpleNamespace ctx + real adapter for handler/ledger tests, narrow seams only when a collaborator can't be run in tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…dleware-tests # Conflicts: # CHANGELOG.md # README.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…e-tests test(sync_middleware): sociable banner tests + decision_id alias fix

…gate (#58) Sibling of the M2 grounding-recall eval (#284). Phase A is **measurement only** — no runtime change to `handle_preflight` or any retrieval surface. Recall regression risk = zero. The Phase B optimization choice (multi-hop expansion / semantic search / LLM reranker) is gated on this PR's first stable baseline, per the wiki's optimization principle: "identify the specific scenario, then optimize." Per the Phase 2 spec posted on #58 (all four open questions defaulted to recommended): Q1 dataset size → 25 cases hand-curated, matches M2's 23 Q2 miss-mode buckets → three (vocabulary / unbound / transitive), matches the issue body's framing Q3 fire-rate gate → raw retrieval (`response.decisions`); fire is downstream and a secondary diagnostic Q4 ledger persistence → per-run temp + memory:// (per-case freshness) Three measurement axes (deliberately split for diagnosis) --------------------------------------------------------- overall recall surfaced / (surfaced + missed) gate ≥ 0.70 per-mode recall same, sliced by miss_mode gate ≥ 0.50 fire rate response.fired / total gate ≥ 0.60 Errors (seeder infra failures, not agent misses) are excluded from the recall denominator but counted separately so reviewers can see them. Files ----- tests/fixtures/preflight_m6/dataset.py (412 LOC) 25 hand-curated M6Case rows, 8 + 8 + 9 across the three modes. Frozen dataclass; GENERATOR_VERSION constant invalidates downstream caches when bumped. Import-time _validate_dataset() fails loud on duplicate case_id, invalid miss_mode, transitive case without intended_file_path, unbound case with non-ungrounded status. tests/eval/_preflight_m6_seeder.py (231 LOC) Per-case freshness: each call creates a new tempdir + memory:// ledger + git-initialized repo + writes a placeholder file (or the transitive case's intended + caller files). Calls the real handle_ingest + handle_bind so seeded rows have production shape (source_type, span, signoff, binds_to). Reset code-locator + ledger singletons before AND after so the next case starts clean. tests/eval_preflight_m6_recall.py (274 LOC) Argparse runner, drives the seeder + handle_preflight, classifies outcomes, aggregates. JSON output + gate enforcement (--gate-mode warn|hard). Mirrors eval_grounding_recall.py shape so existing CI patterns transfer. tests/eval_preflight_m6_summary.py (162 LOC) Markdown step-summary renderer for $GITHUB_STEP_SUMMARY. Per-mode table + collapsible missed-case detail with topic + intended description. Fail-quiet on missing JSON / parse errors. tests/test_preflight_m6_eval.py (267 LOC) 16 sociable unit tests for the classifier + aggregator. Per the new CLAUDE.md "Sociable Testing for UX Paths" rule (#303): SimpleNamespace + real M6Case dataclasses, NEVER MagicMock — so any added/removed field on the response shape fails the test honestly. .github/workflows/test-mcp-regression.yml (+31 LOC) New "M6 preflight recall eval (warn-only)" + summary steps after M2. No ANTHROPIC_API_KEY needed — preflight retrieval is deterministic. CHANGELOG.md (+2 lines) [Unreleased] / Added entry. Local verification ------------------ - 16/16 sociable unit tests pass on the classifier + aggregator (test_aggregate_basic_recall_math, test_errors_excluded_from_recall_denominator, test_per_miss_mode_breakdown, etc.) - Dataset import + _validate_dataset() pass — 25 cases (8/8/9) - Runner --help renders cleanly - Summary renderer smoke-tested on synthetic JSON — per-mode table + missed-case detail render correctly with emoji gates - ruff check + ruff format --check + mypy all green on touched files What's NOT in this PR (intentionally — Phase B gating) ------------------------------------------------------ - Any runtime change to handle_preflight or _region_anchored_preflight - Skill changes (no agent-facing contract change in Phase A) - Multi-hop / call-graph / inheritance graph expansion (Phase B candidate, deferred) - Semantic search layer (Phase B candidate, deferred) - LLM reranker (Phase B candidate, deferred) - Real-corpus eval (synthetic first; corpus follow-up if needed) After this PR's first CI baseline lands, we pick the dominant miss-mode from the per-mode breakdown and ship Phase B targeted to it. Cheap-first ordering per the wiki: search_hint refinement → multi-hop graph → semantic → reranker. Refs #58. Plan: plan-58-preflight-decision-detection.md. Phase 2 spec signoff: #58 (comment) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(eval): M6 preflight retrieval recall eval — Phase A measurement gate (#58)

…#58 followup) PR #304's first CI baseline produced overall recall 0.000 with 14/25 cases erroring — root cause: the M6 seeder runs 25 cases back-to-back in a single process, and the LLM-08 ingest rate limiter (#216, burst=10 / refill=1.0/s) refuses cases 12+ with `_IngestRefused("rate_limit_ exceeded")`. Math: 10 initial tokens + ~1 refill while seeding the first 11 cases = 11 cases through, then 14 cases (U4-U8 + all 9 T*) erred. The rate limiter is for production agent-loop safety, not eval throughput. There's already a documented env var to disable it (see `handlers.ingest._check_rate_limit` docstring): ``BICAMERAL_INGEST_RATE_LIMIT_DISABLE`` truthy → bucket check is short-circuited. Setting it in the seeder's per-case env setup (saved + restored like `REPO_PATH` and `SURREAL_URL`) is the documented path. Symptom before this fix (post-#304 CI on dev): M6 preflight retrieval recall eval — 25 cases overall recall : 0.000 errors: 14 transitive_relevance : 0/9 surfaced, 9 errors ← all rate-limited unbound_decision : 0/8 surfaced, 5 errors ← last 5 rate-limited vocabulary_mismatch : 0/8 surfaced, 0 errors ← first 8, ran clean Expected after this fix: vocabulary_mismatch stays 0/8 surfaced (that's the honest BM25-can't-bridge-vocab baseline the eval was designed to surface). transitive_relevance + unbound_decision should produce non-zero recall once the seeder doesn't trip the rate limiter. Belt-and-suspenders alternatives considered: - clear the `_RATE_LIMIT_REGISTRY` dict between cases — works but reaches into private state and skips the env-var contract - sleep between cases to allow refill — works but slow + hides the fact that the rate limiter isn't appropriate for evals - lower burst/refill via `.bicameral/config.yaml` in the synthetic repo — works but requires every Phase B eval surface to re-author the same config The env-var path is the documented API and one line. Smoke verification ------------------ - 16/16 sociable unit tests pass on the classifier + aggregator - ruff check + format + mypy all green on the touched file Refs #58 (Phase A baseline). Followup to PR #304. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replaces the existing assets/bicameral-hero.png with a new visual that illustrates the product as a double-entry ledger for AI-assisted product development — PM and Dev agents each running a Bicameral MCP server, both synced through a shared Team Ledger, with a live Decision Ledger panel showing mixed signoff/code states (including ratified-but-not- reflected, reflected-but-not-ratified, and drifted rows) and the three core pillars (decisions first-class, two-sided ledger, escalation over recommendation). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ks to canonical skills/ The .claude/skills/bicameral-*/SKILL.md files were tracked duplicates of skills/bicameral-*/SKILL.md that drifted independently. PRs frequently touched skills/ but not the mirror, so the mirror lagged ~3 feature commits (3c9730f, d1e3914, 79b872b) and 2+ weeks behind canonical. Beyond stale duplicates: the drift was bidirectional. 7 skills existed only in canonical (.claude/skills/-missing → never resolved as slash commands) and 7 only in the mirror (no canonical source → became de-facto canonical despite CLAUDE.md saying otherwise). claude-mem auto-writes into CLAUDE.md files also drifted (ingest and preflight CLAUDE.md had different "Recent Activity" entries between the two paths). This change: 1. Canonicalizes the 7 mirror-only skills via git mv into skills/ (bicameral-{brief, context-sentry, doctor, guided, scan-branch, search, status}). 2. Replaces every .claude/skills/bicameral-X with a symlink to ../../skills/bicameral-X (22 symlinks total). Claude Code's slash-command resolver follows the symlinks transparently — confirmed in-vivo during implementation when the resolver re-indexed and surfaced all 22 skills after the swap. 3. Repoints tests/CI/docs at canonical skills/ paths (tests/_extract_headless.py SKILL_MD_PATH; tests/regen_extraction_fixtures.py docstring; tests/eval_decision_relevance.py docstring; tests/e2e/README.md; .github/workflows/test-mcp-regression.yml comment; README.md slash-command row; docs/DEV_CYCLE.md canonical-source note; docs/v2-desync-optimization-guide.md doctor SKILL.md references). 4. Updates CLAUDE.md to describe the symlink layout (drop "stale duplicates slated for deletion" wording) and adds a Windows note: contributors on Windows must set core.symlinks=true (or use WSL) so the mode-120000 entries materialize as symlinks rather than text files containing the target path. 5. Ticks off TODO.md:169 — the unresolved decision is now made. Refs: TODO.md:169 (now ticked), CLAUDE.md "Canonical Skill Source". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chore(skills): replace .claude/skills/bicameral-* mirrors with symlinks to canonical skills/

…precondition) Phase 4 of #87 broadens the preflight dedup cache key from `topic` alone to `(topic_norm, file_paths_hash, ledger_revision)` so a same-topic call within the 5-min window correctly invalidates when underlying ledger state changed (M7a/b/c xfailed cases). `ledger_revision` derives from `MAX(updated_at)` over the decision table — this PR is the schema half of that contract; the handler-side broadening lands as a follow-up on branch `87-preflight-dedup-key`. Per Kevin's signoff (B2 approach, gh issue #87 comment thread): additive schema bump, L1 risk, no tool contract change, falls back gracefully. The field is `option<datetime>` rather than non-optional `datetime` because DEFINE FIELD against existing rows leaves them as NONE until the migration backfill runs — same precedent as v8→v9 (`decision_level` is `option<string>` for identical reasons). Phase 4's MAX query can COALESCE(updated_at, created_at) if it wants strict-non-NULL semantics. Schema changes (ledger/schema.py): - SCHEMA_VERSION 17 → 18 + compatibility-map entry - DEFINE FIELD updated_at ON decision TYPE option<datetime> DEFAULT time::now() - DEFINE INDEX idx_decision_updated_at ON decision FIELDS updated_at - _migrate_v17_to_v18: idempotent DEFINE + backfill UPDATE decision SET updated_at = created_at WHERE updated_at IS NONE Call-site audits (7 UPDATEs now carry `, updated_at = time::now()`): - ledger/queries.py:602 upsert_decision canonical-dedup UPDATE path - ledger/queries.py:1072 update_decision_status - ledger/queries.py:1163 update_decision_level - ledger/adapter.py:1394 apply_ratify (signoff write) - ledger/adapter.py:1428 apply_supersede (old decision signoff-freeze) - handlers/resolve_collision.py:99 link_parent (cross-level parent link) - handlers/resolve_collision.py:128 collision_pending clear (proposed signoff) CREATE in queries.py:638 needs no edit — the DEFAULT picks up time::now() on INSERT automatically. Tests (tests/test_v18_decision_updated_at.py, 11 tests, all passing): - Schema version advanced to v18 - CREATE populates updated_at via DEFAULT - Each of the 7 UPDATE call sites bumps updated_at (one test each) - Index supports ORDER BY updated_at DESC - Migration backfill: pre-v18 rows with NONE → created_at Sociable substrate over memory:// per CLAUDE.md guidance — real SurrealDBLedgerAdapter + real LedgerClient, no mocks. The drift this guards against is the kind solitary tests miss: a mock would happily return whatever updated_at the test expects; only a real ledger UPDATE proves the SQL actually carries the new column. Regression check passes: tests/test_v15_migration.py, test_schema_persistence.py, test_schema_recoverable_errors.py, test_sync_middleware.py, test_codegenome_continuity_service.py, test_compliance_check_schema.py, test_ledger_bicameral_meta_migration.py — 50/50 pass. The single test_alpha_flow.py failure (test_code_edit_without_rebind_marks_drifted) reproduces on origin/dev without this PR's changes — pre-existing, not introduced here. Refs #87 (Phase 4 precondition per spec signoff). Out of scope: dedup key broadening itself (#87 Phase 4), telemetry (#87 Phase 5). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…d_at (#308 CI fix) CI surfaced two issues on PR #308: 1. Ruff I001 — tests/test_v18_decision_updated_at.py import block was not alphabetically sorted within the `from ledger.queries import (...)` group. Auto-fixed. 2. tests/test_legacy_ledger_fixtures.py::test_legacy_ledger_fixture_reaches_clean_state[v3_yields_source_span] blew up on the v17→v18 backfill: SurrealDB rejected query: Found NONE for field `created_at`, with record `decision:dec_1`, but expected a datetime SQL: UPDATE decision SET updated_at = created_at WHERE updated_at IS NONE The v3 fixture creates `decision:dec_1` via raw CREATE without setting `created_at`. Once init_schema applies `DEFINE FIELD created_at ON decision TYPE datetime`, ANY UPDATE on that row re-validates the row and trips the type assertion — even one that doesn't touch created_at. The earlier draft used `SET updated_at = created_at` which read the corrupt field directly; even after switching to time::now() in the SET clause, the implicit re-validation on UPDATE still failed. ## Fix Switch the backfill from a single bulk UPDATE to a per-row loop with try/except, mirroring `_clean_yields_legacy_rows` (which uses the same tolerance pattern for v3-era stale yields edges): ```python ids = await client.query("SELECT id FROM decision WHERE updated_at IS NONE") for row in ids: try: await client.execute(f"UPDATE {row['id']} SET updated_at = time::now()") healed += 1 except Exception: skipped += 1 # row has other corrupt non-optional fields logger.warning(...) ``` Rows that fail stay with `updated_at=NONE` and MAX(updated_at) skips them. Harmless for the dedup-cache marker (#87) since the marker only needs monotonicity, not coverage — the new decisions created post-v18 get DEFAULT time::now() and dominate MAX(). The SELECT itself reads only `id`, so it doesn't trip the type assertion on `created_at`. The WHERE clause on `updated_at IS NONE` is safe because `updated_at` is `option<datetime>` (intentionally optional — same precedent as v8→v9 `decision_level`). ## Files - ledger/schema.py — _migrate_v17_to_v18: per-row UPDATE with try/except; emits healed/skipped counts to the logger - tests/test_v18_decision_updated_at.py - Import sort fix (ruff I001) - test_v18_migration_backfills_legacy_rows_with_none_updated_at: call _migrate_v17_to_v18 directly instead of inlining the (now multi-statement) backfill body - test_v18_migration_backfill_tolerates_legacy_rows_with_none_created_at (NEW): inspects the migration source to guard against future drafts that reintroduce a created_at reference in the SET clause ## Verification - tests/test_legacy_ledger_fixtures.py::test_legacy_ledger_fixture_reaches_clean_state[v3_yields_source_span] PASS - tests/test_v18_decision_updated_at.py — 13/13 PASS (12 originals + 1 new regression guard) - 94/94 in the broader schema/migration/dedup cluster - `python3 -m ruff check` — clean on all touched files Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CI's ruff job runs BOTH `ruff check` AND `ruff format --check`. The former was clean after the import-sort fix, but the latter flagged ledger/schema.py and tests/test_v18_decision_updated_at.py for reformatting. Applied `ruff format` in place — pure whitespace / line-length normalization, no semantic change. Verified: `ruff format --check` clean on both files locally; 14/14 v18 + legacy-fixture tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…nnel-autodetect fix(setup): auto-detect nightly channel from .dev install version

Surfaces the setup-wizard nightly-channel auto-detect fix (PR #381) to design partners. Without it, anyone who installed via `pipx install --pip-args=--pre bicameral-mcp` ran `bicameral-mcp setup` into a config hardcoded to `channel: stable`, so `bicameral.update` silently never offered the nightly upgrade path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…026.5.16.dev024452 chore(nightly): bump RECOMMENDED_NIGHTLY_VERSION to 2026.5.16.dev024452

…22→v23 The v18→v19 migration only seeded the bicameral_meta singleton when the row was absent. When _write_wire_format_sentinel had already written a row (v16's adapter.connect path), the seed branch was skipped and decision_revision stayed NONE because SurrealDB v2's DEFAULT 0 does not backfill existing rows. Every subsequent decision UPDATE then blew up the decision_revision_bump trigger with "Cannot perform addition with 'NONE' and '1'", which _migrate_v22_to_v23's per-row try/except silently swallowed — so the decision_level classification migration "succeeded" while skipping every legacy row. Fix in two places: - _migrate_v18_to_v19: UPDATE existing rows to 0 when the field is NONE (root cause; prevents recurrence for any DB upgrading from <v19). - _migrate_v22_to_v23: same backfill at the top as defense-in-depth so the per-row UPDATEs below land their classifications instead of silently failing. SCHEMA_VERSION stays at 23 — the buggy nightly (dev15124) was only ever downloaded internally, so no forward-fix migration is needed. Tests: - test_migrate_v18_to_v19_backfills_decision_revision_on_preexisting_row: asserts a sentinel row with NONE decision_revision is rescued, and a real decision CREATE bumps the counter (trigger contract intact). - test_v23_classifies_when_decision_revision_was_none: asserts v22→v23 classifies legacy rows when entering with the broken counter state (no silent skips). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a ## Linked decisions section to the §4.3 PR body template, parallel to ## Linked issues, and codifies the rule: every PR authored by a BicameralAI org member references at least one decision:<surrealdb-id> so reviewers can verify the change is grounded in an explicit decision rather than ambient assumption. External contributors are exempt — bicameral access is org-internal, and gating community PRs on internal tooling is the wrong tradeoff. The reviewing maintainer shepherds the decision ingest on the contributor's behalf at merge time. This is a doc-only rule; CI enforcement (lint that an org-member PR body contains a decision:<id> token) can follow as a separate PR if needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CI ruff format --check caught three files that were lint-clean but not format-clean. No semantic change — line breaks collapse to match the project's max-line-length policy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…d-decision docs(dev-cycle): require linked bicameral decision on org-member PRs

…-backfill fix(schema): backfill bicameral_meta.decision_revision in v18→v19 + v22→v23

Pre-fix, serve_stdio awaited get_code_locator().initialize() inline before opening the MCP stdio transport. On a 150MB+ symbol-index DB the cold path took ~45s (sqlite-vec open + tree-sitter load + BM25 pickle load), blowing past Claude Code's 30s MCP initialize timeout on real-world repos — the server "started" but the JSON-RPC handshake never landed and the client gave up. Fix: - ``RealCodeLocatorAdapter.initialize_in_background()`` — schedules ``_ensure_initialized`` in the default executor via an asyncio Task, returns immediately. A done-callback prints the bare error to stderr on failure so the operator still sees the actionable "Run: python -m code_locator index <repo_path>" hint that #243 wrote. - ``_ensure_initialized`` now serializes its body via a threading.Lock. Sync callers from worker threads (the ``asyncio.to_thread(ctx.code_graph.<method>, ...)`` pattern every tool handler already uses) block on the lock until the background Task finishes, then see the post-init state and proceed. No callsite needs to know about the background Task. - ``_run_init_body`` extracted from ``_ensure_initialized`` so tests can monkey-patch the slow body without bypassing the lock/state machine — the lock + Task glue is what's under test. - ``wait_until_ready()`` — optional async gate for callers that want to explicitly await readiness from an async context and surface a structured error to the MCP client on failure. - ``server.py:serve_stdio`` — replaces ``await get_code_locator().initialize()`` with ``get_code_locator().initialize_in_background()`` (synchronous, no await). Stderr message rewritten to reflect the new contract. Trade-off: #243's "server refuses to boot when index is empty" becomes "first code-locator tool call fails loudly when index is empty." Operator still sees the failure on stderr at boot via the done-callback. The fail-loud contract from #243 phase-2 signoff Q3 is preserved, just relocated from boot-time to first-tool-call-time. Measured: JSON-RPC ``initialize`` reply now lands in ~16ms on this repo's own 150MB code-graph.db (was ~45s). Closes #380 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…-handshake fix(server): move code-locator init off MCP stdio handshake (#380)

…fe upsert The v23 dedup index `idx_input_span_dedup` was UNIQUE on `(source_type, source_ref, text)`. Phase B-1 (#221) introduced archive_key and writes text='' for archive-keyed rows, so two distinct archive_keys sharing (source_type, source_ref) collided on the empty-text slot. The collision surfaced as a 500 on the dashboard's /history endpoint once any second archive-keyed write to the same source bucket landed (transitively via ensure_ledger_synced → link_commit → ingest paths). Changes: - Schema v23→v24: extend idx_input_span_dedup with archive_key as a 4th field. Non-destructive — adding a discriminator can only relax uniqueness, so all rows valid under the old index remain valid. Migration uses DEFINE INDEX OVERWRITE via _execute_define_idempotent (re-runnable). init_schema's OVERWRITE pass keeps the in-source DEFINE in sync on every connect. - upsert_input_span: refactored into a thin retry wrapper around _upsert_input_span_once. The wrapper retries on the SurrealDB v2 MVCC "failed to commit transaction" string (bounded, no backoff — the conflicting writer has already committed by the time we see the error). The inner body now catches unique-index "already contains" violations on both the archive-keyed and legacy text-only paths, re-SELECTing to return the winning row's id instead of crashing. - 6 new sociable tests pin: archive_key coexistence under v24, idempotent same-key dedup, concurrent-same-key race convergence, legacy text-path race safety, v24 migration idempotency, and a fixture that pins the v2 MVCC error substring so a future surrealdb-py bump breaks loudly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…bstone Implements decision:i4wafafzowm3ai5eyhgs (ratified 2026-05-15). bicameral.remove_decision now physically removes the decision row plus every reference to it (binds_to / yields / supersedes / context_for / about edges + the compliance_check verdict cache for the decision) and orphans child decisions cleanly by NULLing parent_decision_id. The decision_removed.completed event captures a full pre-deletion snapshot in the journal so the action is recoverable from the event log alone — the "soft audit trail" that replaces the tombstone row model. Motivation: the soft-delete model was intended as a negative-signal mechanism (rows with signoff.state="removed" warn future agents away from re-introducing the same wrong decision). In practice the dominant call shape is janitorial — test pollution, accidentally-ingested rows, retracted ideas with no learning value — where tombstones become friction that surfaces in preflight, occupies dashboard slots, and gets re-bound by drift sweeps. Supersession remains the right tool when a persistent negative signal is actually wanted. Contract changes: - RemoveDecisionResponse: drops `signoff` and `projected_status` (the row is gone — there's no signoff dict to return and the projected status is meaningless). Promotes the relevant fields to top level: was_new, event_logged, removed_at, previous_state, reason. - Idempotency: missing decision_id returns was_new=False without raising. The matching event in the journal is the canonical record of any prior removal. Trade-off: typos for never-existed ids look like idempotency, but the SKILL.md flow (read history first, then call) catches that. - server.py tool description updated to match. - skills/remove-decision/SKILL.md rewritten end-to-end; .claude/skills copy synced. Out of scope (separate decisions): - handlers/remove_source.py cascade still soft-deletes yielded decisions. That's a different tool's contract; touching it should be its own decision. - dashboard.html "already-removed" button-disable guard remains as defensive dead code — cosmetic-only and out of scope. Tests: - tests/test_remove_decision.py rewritten as sociable (real SurrealDBLedgerAdapter over memory://) per pilot/mcp/CLAUDE.md. 9 tests covering: reason validation, missing-id idempotency, full edge+cache cleanup, child orphan, second-call no-op, event emission/skipping, and idempotent no-event. - tests/test_dogfood_label_propagation.py: removed the obsolete monkeypatches for handlers.remove_decision.project/update_decision_status (functions no longer imported by the new handler). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…-delete work claude-mem regenerated the recent-activity tables for handlers/ and tests/ after today's remove_decision hard-delete implementation, and seeded new context files at skills/remove-decision/ and .claude/skills/remove-decision/ where the skill was edited. Purely auto-generated context — no code change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…-suite contention The safe-upsert retry loop landed at 5 attempts (ebcfeb4). Running the full regression batch surfaced a flake in test_concurrent_same_archive_key_race — three concurrent writers for the same archive_key occasionally exceed 5 MVCC retries when the test suite holds dozens of memory:// SurrealDB instances in the same process. Each retry's SELECT short-circuits the moment the winning writer commits, so the cost is one RTT per attempt — trivial. 10 absorbs the variance with massive headroom for production usage (where contention storms of this shape can't happen — one DB per process). A proper fix (per-key write queue instead of optimistic retry) is tracked separately as a follow-up issue. Also includes claude-mem auto-generated activity-log refreshes from this session (no code change in those files). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…-handshake feat(remove_decision): hard-delete by default + v24 input_span dedup index

…17→v24 chain Drains the dev → main backlog accumulated since v0.14.7. Cumulative release; non-destructive schema migration chain (v17→v24) applied automatically on first connect. Breaking change: bicameral.remove_decision contract is now hard-delete by default (decision:i4wafafzowm3ai5eyhgs). Highlights: - PII archive (#221 Phase A + B-1) — operator-erasable PII surface keyed by content-hash; ingest writes verbatim text to the archive and leaves input_span.text='' with the v22 ASSERT enforcing exactly-one-of. - Hard-delete remove_decision — soft-delete tombstone retired; full pre-deletion snapshot lives in the event journal. - Constant-time revision counter (#87 Phase 6) — bicameral_meta.decision_revision auto-bumped by DEFINE EVENT; replaces O(N) MAX(updated_at) scan in preflight dedup. - bicameral.admin/query (#278 Phase 3), dashboard source view (#278 Phase 1), LocalDirectorySourceAdapter (#344), sync-and-brief team-mode (#279). - Code-locator singleton + eager startup init (#243, #380) — index work moves off the per-call hot path and off the MCP stdio handshake. - Schema v17→v24 chain — all additive, non-destructive. Three architectural decisions ratified for the doctrine follow-up PR: expand-only schema rule, feature-flag gating for new-schema-dependent code, DEV_CYCLE.md §10.5.1 amendment for triage eligibility. Closes decision:i4wafafzowm3ai5eyhgs. See CHANGELOG.md for the full Added / Changed / Fixed / Schema-migrations / Doctrine / Removed breakdown. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-05-16T05:52:11Z

Important

Review skipped

Too many files!

This PR contains 218 files, which is 68 over the limit of 150.

To get a review, narrow the scope:
• coderabbit review --type committed # exclude uncommitted changes
• coderabbit review --dir # limit to a subdirectory
• coderabbit review --base # compare against a closer base

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a6a5204c-8b7d-47e3-ad0c-b7f7b7e447ed

📥 Commits

Reviewing files that changed from the base of the PR and between 083c1d4 and 74377d3.

📒 Files selected for processing (218)

.claude/hooks/pre_tool_use_timeout_context.py
.claude/hooks/session_start_timeout_posture.py
.claude/skills/bicameral-bind
.claude/skills/bicameral-brief
.claude/skills/bicameral-capture-corrections
.claude/skills/bicameral-capture-corrections/SKILL.md
.claude/skills/bicameral-config
.claude/skills/bicameral-context-sentry
.claude/skills/bicameral-dashboard
.claude/skills/bicameral-dashboard/SKILL.md
.claude/skills/bicameral-diagnose
.claude/skills/bicameral-doctor
.claude/skills/bicameral-guided
.claude/skills/bicameral-history
.claude/skills/bicameral-history/SKILL.md
.claude/skills/bicameral-ingest
.claude/skills/bicameral-ingest/SKILL.md
.claude/skills/bicameral-judge-gaps
.claude/skills/bicameral-judge-gaps/SKILL.md
.claude/skills/bicameral-output-formats
.claude/skills/bicameral-preflight
.claude/skills/bicameral-preflight/CLAUDE.md
.claude/skills/bicameral-preflight/SKILL.md
.claude/skills/bicameral-report-bug
.claude/skills/bicameral-reset
.claude/skills/bicameral-reset/SKILL.md
.claude/skills/bicameral-resolve-collision
.claude/skills/bicameral-resolve-collision/SKILL.md
.claude/skills/bicameral-scan-branch
.claude/skills/bicameral-search
.claude/skills/bicameral-status
.claude/skills/bicameral-sync
.claude/skills/bicameral-update
.claude/skills/remove-decision/CLAUDE.md
.claude/skills/remove-decision/SKILL.md
.github/workflows/lint-and-typecheck.yml
.github/workflows/perf-gate.yml
.github/workflows/preflight-eval.yml
.github/workflows/test-mcp-regression.yml
.github/workflows/test-schema-persistence.yml
.gitignore
.pre-commit-config.yaml
CHANGELOG.md
CLAUDE.md
README.md
RECOMMENDED_NIGHTLY_VERSION
RECOMMENDED_VERSION
SECURITY.md
TODO.md
adapters/code_locator.py
adapters/ledger.py
assets/dashboard.html
cli/_diagnose_gather.py
cli/_ledger_io_engine.py
cli/brief_renderer.py
cli/diagnose.py
cli/ledger_export_cli.py
cli/ledger_import_cli.py
cli/ledger_io.py
cli/sync_and_brief_cli.py
consent.py
context.py
contracts.py
dashboard/admin.py
dashboard/server.py
docs/DEV_CYCLE.md
docs/META_LEDGER.md
docs/SHADOW_GENOME.md
docs/governance/compliance-stance-matrix.md
docs/governance/doctrine-deterministic-governance.md
docs/ideation-team-server-tier-v1-2026-05-14.md
docs/ledger-sociable-test-audit.md
docs/policies/acceptable-use.md
docs/policies/claude-hooks-mcp-integration.md
docs/policies/gdpr-art-17-erasure-roadmap.md
docs/policies/host-trust-model.md
docs/policies/ledger-export.md
docs/policies/notifications-roadmap.md
docs/policies/query-timeouts.md
docs/policies/sources-config.md
docs/policies/threat-model-and-trust-boundary.md
docs/preflight-failure-scenarios.md
docs/research-brief-compliance-audit-2026-05-06.md
docs/research-brief-r1-limitations-remediation-2026-05-14.md
docs/research-brief-team-server-tier-v1-2026-05-14.md
docs/semantic-drift-governance.md
docs/v0-productization-design-partner-dogfood.md
docs/v2-desync-optimization-guide.md
events/dogfood.py
events/sources/__init__.py
events/sources/granola.py
events/sources/local_directory.py
governance-gates.yaml
handlers/bind.py
handlers/history.py
handlers/ingest.py
handlers/link_commit.py
handlers/preflight.py
handlers/remove_decision.py
handlers/remove_source.py
handlers/resolve_collision.py
handlers/search_decisions.py
handlers/update.py
ledger/CLAUDE.md
ledger/adapter.py
ledger/client.py
ledger/queries.py
ledger/schema.py
ledger/timeout_telemetry.py
notifications/__init__.py
notifications/channel.py
notifications/contracts.py
notifications/stderr.py
pii_archive/__init__.py
pii_archive/contracts.py
pii_archive/store.py
preflight_telemetry.py
pyproject.toml
pytest.ini
scripts/audit_sociable_coverage.py
scripts/hooks/preflight_intent.py
scripts/lint_skill_governance.py
server.py
setup_wizard.py
skills/admin-surrealql/SKILL.md
skills/bicameral-brief/SKILL.md
skills/bicameral-context-sentry/CLAUDE.md
skills/bicameral-context-sentry/SKILL.md
skills/bicameral-doctor/SKILL.md
skills/bicameral-guided/SKILL.md
skills/bicameral-preflight/SKILL.md
skills/bicameral-scan-branch/SKILL.md
skills/bicameral-search/SKILL.md
skills/bicameral-status/SKILL.md
skills/bicameral-sync-and-brief/SKILL.md
skills/bicameral-update/SKILL.md
skills/remove-decision/CLAUDE.md
skills/remove-decision/SKILL.md
skills/remove-source/SKILL.md
telemetry_flags.py
tests/_extract_headless.py
tests/_replay_helpers.py
tests/conftest.py
tests/e2e/README.md
tests/e2e/run_e2e_flows.py
tests/eval/__init__.py
tests/eval/_bind_judge.py
tests/eval/_preflight_eval_seed.py
tests/eval/_preflight_m6_seeder.py
tests/eval/preflight_dataset.jsonl
tests/eval/run_preflight_eval.py
tests/eval_decision_relevance.py
tests/eval_grounding_recall.py
tests/eval_grounding_recall_summary.py
tests/eval_preflight_m6_recall.py
tests/eval_preflight_m6_summary.py
tests/fixtures/preflight_m6/__init__.py
tests/fixtures/preflight_m6/dataset.py
tests/fixtures/skill_lint/clean_skill/SKILL.md
tests/fixtures/skill_lint/flagged_skill/SKILL.md
tests/fixtures/skill_lint/registered_skill/SKILL.md
tests/perf/__init__.py
tests/perf/conftest.py
tests/perf/test_ledger_revision_perf.py
tests/regen_extraction_fixtures.py
tests/test_admin_surrealql_route.py
tests/test_brief_renderer.py
tests/test_claude_hooks_timeout_context.py
tests/test_codelocator_background_init.py
tests/test_compliance_policy_docs.py
tests/test_consent_notice.py
tests/test_dashboard_admin_panel.py
tests/test_dashboard_remove_flows.py
tests/test_dashboard_source_view.py
tests/test_diagnose_allowlist.py
tests/test_diagnose_cli.py
tests/test_dogfood_label_propagation.py
tests/test_grounding_failure_mode.py
tests/test_history_erasure_propagation.py
tests/test_history_input_span_id.py
tests/test_input_span_safe_upsert.py
tests/test_ledger_bicameral_meta_migration.py
tests/test_ledger_export_cli.py
tests/test_ledger_import_cli.py
tests/test_ledger_io_canonical_record.py
tests/test_ledger_io_export.py
tests/test_ledger_io_import.py
tests/test_ledger_mock_regression.py
tests/test_notifications_unit.py
tests/test_pii_archive_schema_migration.py
tests/test_pii_archive_schema_migration_b1.py
tests/test_pii_archive_unit.py
tests/test_preflight_dedup_telemetry.py
tests/test_preflight_dedup_v2.py
tests/test_preflight_graph_expansion.py
tests/test_preflight_hitl.py
tests/test_preflight_id_plumbing.py
tests/test_preflight_m6_eval.py
tests/test_query_timeout_handler_routing.py
tests/test_query_timeout_unit.py
tests/test_remove_decision.py
tests/test_remove_source.py
tests/test_replay_determinism.py
tests/test_replay_helpers_unit.py
tests/test_resolve_span_text_unit.py
tests/test_sessionstart_hook_install.py
tests/test_setup_wizard_channel_autodetect.py
tests/test_skill_governance_lint.py
tests/test_skills_symlink_integrity.py
tests/test_sources_granola_unit.py
tests/test_sources_local_directory_unit.py
tests/test_sync_and_brief_cli.py
tests/test_sync_and_brief_team_mode.py
tests/test_sync_middleware.py
tests/test_telemetry_flags.py
tests/test_v18_decision_updated_at.py
tests/test_v19_revision_counter.py
tests/test_v23_decision_level_backfill.py

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch release/v0.15.0

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

… (triage eligibility) Encodes the three architectural decisions ratified 2026-05-15 (decision:cp25jfz1nt6h3u2gjzmu, decision:adklplvfhthkdch05pe9, decision:0ok1249n2tdrfud2a5j9): §4.7 — new subsection enforcing two complementary rules for any PR that touches ledger/schema.py or its _MIGRATIONS registry: §4.7.1 — schema migrations must be expand-only. Destructive operations (REMOVE / DROP / breaking ALTER / tightening ASSERT) live in their own commits and ship in a later release after the prior reader surface is validated as gone from prod. Includes an allowed/forbidden table for reviewer ease. CI lint planned via scripts/lint_schema_destructive.py. §4.7.2 — code paths that depend on new schema must be feature-flag gated and default OFF in prod (env var or .bicameral/config.yaml setting). Schema ships immediately; flag flips later in a separate release. If the experiment is killed, the flag never flips on and a follow-up cleanup migration drops the slot. Exception: invariant bugfixes (e.g. fixing a unique-index collision that breaks the dashboard for everyone) don't need flag-gating — that's not feature surface. §4.7.3 — concrete PR-review checklist for schema-touching PRs. §10.5.1 — triage eligibility rule rewritten. Previously: "schema-migrating changes are not triage-eligible" (blanket). Now: schema migrations CAN ride a triage release if they comply with §4.7 (expand-only AND feature code is flag-gated). The blanket ban is replaced by enumerated exclusions (destructive schema, flag-flip releases, breaking public-API changes, multi-PR epics, v1 patches). Motivation: the prior rule was correct under the implicit assumption that schema and feature ship together — then you can't ship one without the other. Once §4.7 decouples them, schema can drain to main on every triage instead of accumulating on dev waiting for a "real" release. The current v18→v24 backlog (drained by the v0.15.0 release PR #388) is the symptom the prior rule produced; this amendment prevents recurrence. Refs decision:cp25jfz1nt6h3u2gjzmu, decision:adklplvfhthkdch05pe9, decision:0ok1249n2tdrfud2a5j9. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

# Conflicts: # CHANGELOG.md # RECOMMENDED_VERSION # pyproject.toml

silongtan and others added 30 commits May 6, 2026 23:52

Merge pull request #288 from BicameralAI/280-m2-gate-flip-hard

9575df5

ci(m2): flip M2 grounding-recall gate warn → hard after stable baseline (#280)

Merge pull request #293 from BicameralAI/280-failure-mode-enumeration

28b7530

feat(eval): M2 failure-mode enumeration for cross-functional design discussion (#280)

Merge remote-tracking branch 'origin/dev' into 192-telemetry-csv-flag

526035e

Merge pull request #250 from BicameralAI/192-telemetry-csv-flag

ece365b

refactor(telemetry): consolidate BICAMERAL_TELEMETRY env-var namespace (#192)

Merge pull request #294 from BicameralAI/243-preflight-eliminate-fall…

119cd89

…backs feat(preflight): eliminate silent graph-expansion fallbacks (#243)

Merge pull request #299 from BicameralAI/docs/readme-demo-videos

4299752

docs(README): demo video section + relocate star CTA mid-doc

Merge remote-tracking branch 'origin/dev' into feat/sociable-sync-mid…

9e2ecb9

…dleware-tests # Conflicts: # CHANGELOG.md # README.md

chore(test): satisfy ruff I001 import grouping in test_sync_middleware

b676b17

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Merge pull request #303 from BicameralAI/feat/sociable-sync-middlewar…

e307635

…e-tests test(sync_middleware): sociable banner tests + decision_id alias fix

Merge pull request #304 from BicameralAI/58-preflight-m6-recall-eval

61a0e66

feat(eval): M6 preflight retrieval recall eval — Phase A measurement gate (#58)

Merge pull request #305 from BicameralAI/58-m6-seeder-fix

14188f8

Merge pull request #307 from BicameralAI/skills/symlink-canonical-mirror

4ea2338

chore(skills): replace .claude/skills/bicameral-* mirrors with symlinks to canonical skills/

jinhongkuan and others added 16 commits May 15, 2026 19:44

Merge pull request #381 from BicameralAI/fix/setup-wizard-nightly-cha…

772ff88

…nnel-autodetect fix(setup): auto-detect nightly channel from .dev install version

Merge pull request #382 from BicameralAI/chore/bump-nightly-pointer-2…

a02826e

…026.5.16.dev024452 chore(nightly): bump RECOMMENDED_NIGHTLY_VERSION to 2026.5.16.dev024452

style: ruff format the schema fix files

8c4bd47

CI ruff format --check caught three files that were lint-clean but not format-clean. No semantic change — line breaks collapse to match the project's max-line-length policy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Merge pull request #384 from BicameralAI/docs/dev-cycle-require-linke…

27a6500

…d-decision docs(dev-cycle): require linked bicameral decision on org-member PRs

Merge pull request #383 from BicameralAI/fix/schema-decision-revision…

2fbfa32

…-backfill fix(schema): backfill bicameral_meta.decision_revision in v18→v19 + v22→v23

Merge pull request #385 from BicameralAI/fix/380-codelocator-init-off…

dac8f0f

…-handshake fix(server): move code-locator init off MCP stdio handshake (#380)

Merge pull request #386 from BicameralAI/fix/380-codelocator-init-off…

9596352

…-handshake feat(remove_decision): hard-delete by default + v24 input_span dedup index

jinhongkuan added the flow:release Release PR (BicameralAI/dev → BicameralAI/main) that promotes integrated work to a tagged release label May 16, 2026

jinhongkuan mentioned this pull request May 16, 2026

docs(dev-cycle): §4.7 expand-only + flag-gate, amend §10.5.1 triage eligibility #389

Merged

3 tasks

Merge remote-tracking branch 'origin/main' into release/v0.15.0

74377d3

# Conflicts: # CHANGELOG.md # RECOMMENDED_VERSION # pyproject.toml

jinhongkuan temporarily deployed to ci-test May 16, 2026 06:53 — with GitHub Actions Inactive

jinhongkuan temporarily deployed to production May 16, 2026 06:53 — with GitHub Actions Inactive

jinhongkuan temporarily deployed to ci-test May 16, 2026 06:53 — with GitHub Actions Inactive

jinhongkuan requested a deployment to recording-approval May 16, 2026 06:53 — with GitHub Actions Waiting

jinhongkuan enabled auto-merge (rebase) May 16, 2026 07:02

jinhongkuan merged commit 6963cb0 into main May 16, 2026
10 of 11 checks passed

jinhongkuan deleted the release/v0.15.0 branch May 16, 2026 07:08

jinhongkuan mentioned this pull request May 17, 2026

fix(skill): bicameral-preflight does not auto-fire on /qor-plan with a GitHub issue URL #402

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release: v0.15.0 — PII archive, hard-delete remove_decision, schema v17→v24 chain#388

release: v0.15.0 — PII archive, hard-delete remove_decision, schema v17→v24 chain#388
jinhongkuan merged 156 commits into
mainfrom
release/v0.15.0

jinhongkuan commented May 16, 2026

Uh oh!

coderabbitai Bot commented May 16, 2026 •

edited

Loading

Review skipped

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jinhongkuan commented May 16, 2026

Summary

Linked issues

Linked decisions

Plan / Audit / Seal

Breaking changes (operator-facing)

Schema migrations

Test plan

Post-merge tasks

Uh oh!

coderabbitai Bot commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coderabbitai Bot commented May 16, 2026 •

edited

Loading