release: v0.15.0 — PII archive, hard-delete remove_decision, schema v17→v24 chain#388
Conversation
…st (#192) Single env var now owns the entire telemetry-flag namespace. Three accepted forms: bool (`0`/`off`/`false`/`no` → all off; `1`/`on`/`true`/`yes` → relay only), csv (`relay,preflight,raw`), and unset (default → relay only). New `telemetry_flags.py` module owns parsing; `consent.telemetry_allowed()` and `preflight_telemetry.{telemetry_enabled, raw_capture_enabled}` delegate to a frozen `TelemetryFlags` cached once per process. Backwards-compat preserved on three axes: 1. Legacy `BICAMERAL_PREFLIGHT_TELEMETRY=1` and `BICAMERAL_PREFLIGHT_TELEMETRY_RAW=1` continue to work as additive overlays — first read of either emits a one-line stderr deprecation warning per process. Removed in v1.x. 2. `BICAMERAL_TELEMETRY=1` semantics unchanged (relay only — does NOT auto-enable preflight). 3. Non-canonical truthy values (`enabled`, `t`, `active`, etc. — used in pre-#192 deployments) map to relay-only with a stderr warning pointing at the canonical form. Caught by Codex review as a P2 finding; preserves the pre-#192 contract that any non-OFF value enabled relay. Semantics: - CSV form is explicit — what's listed is on, what's not is off (so `BICAMERAL_TELEMETRY=preflight,raw` turns OFF the default-on relay, documented in the setup wizard). - `raw` always implies `preflight` (raw is a mode of preflight events; defensive double-check in `raw_capture_enabled()`). - Process-cached parsing via `lru_cache`; tests use `_reset_for_tests()` via an autouse fixture in `tests/conftest.py` so monkeypatched env vars take effect cross-test. 35 fixtures in `tests/test_telemetry_flags.py` cover all forms + integration with the existing call sites + the legacy-truthy preservation case. 87/87 green across all 7 telemetry-touching test files (including 52 regression tests for #39 / #101 / #112 behaviors). Closes #192. Unblocks #65 phase 4. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…METRY (#192 follow-up) Five doc references still spelled the legacy `BICAMERAL_PREFLIGHT_TELEMETRY=1` shape after #192 consolidated the flag namespace. Updated each to lead with the canonical csv form (`BICAMERAL_TELEMETRY=preflight` / `=preflight,raw`) and note the legacy var is still honored via the deprecation overlay: - server.py — preflight_id schema description (agent-visible) - contracts.py — preflight_id field comment - preflight_telemetry.py — module docstring (Default mode + Raw mode lines) - handlers/record_bypass.py — module docstring (telemetry_disabled reason) - skills/bicameral-preflight/SKILL.md — bypass-write contract (agent-visible) - docs/semantic-drift-governance.md — record_bypass return-value spec No behavior change. Tests unchanged: 66/66 green across telemetry_flags, consent_notice, preflight_telemetry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolves #250 base-branch drift after dev advanced 50 commits since the PR was opened. Two conflicts: 1. handlers/record_bypass.py (modify/delete) — dev's #244 v1 revert (commit d1e3914) deleted the entire HITL bypass + decision_level surface from v0 scope. My #192 follow-up touched its module docstring; resolution is to accept the deletion (the doc patch is moot once the file is gone). 2. skills/bicameral-preflight/SKILL.md (content) — same #244 revert deleted the §5.4-bypass-semantics block I patched for canonical env-var phrasing. Accepted dev's deletion of the block; the remaining §5.4 telemetry-attribution + §5.5 confirm-finding sections are untouched and still carry the canonical `BICAMERAL_TELEMETRY=preflight` form via the merged v1 of §5.4-telemetry-note. The other four files I patched in the doc-followup commit (server.py, contracts.py, preflight_telemetry.py, docs/semantic-drift-governance.md) auto-merged cleanly. My canonical `BICAMERAL_TELEMETRY=preflight` references survive verbatim. Telemetry tests post-merge: 66/66 green (test_telemetry_flags + test_consent_notice + test_preflight_telemetry). Note: docs/semantic-drift-governance.md still describes record_bypass return values that no longer have a handler. dev kept the file unchanged through the v1 revert; whether the governance lifecycle doc should be deleted, marked v1-deferred, or kept as forward-looking architecture is a separate triage call (not in #250 scope). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ne (#280) PR #285's first CI run produced a clean baseline: 23 cases / precision 0.913 / recall 0.913 / abort_rate 0.000 ✓ all gates pass That's ~7-13 pp of headroom on every gate (≥ 0.85 / ≥ 0.80 / ≤ 0.30). Locking the baseline in before drift sets in. Two changes to .github/workflows/test-mcp-regression.yml: 1. `--gate-mode warn` → `--gate-mode hard`. Runner exits non-zero on breach instead of warning to step output. 2. Removed `continue-on-error: true` from the eval step. The step now fails CI when the gate breaches. The metrics-summary step keeps `continue-on-error: true` so a renderer bug never masks the eval result — and the `always()` guard means the breach summary is still rendered inline when the eval fails. After this lands, PRs that touch the bind handler / bind skill / fixture / dataset must EITHER keep recall ≥ 0.80 / precision ≥ 0.85 / abort_rate ≤ 0.30, OR deliberately re-record the cache by setting BICAMERAL_GROUNDING_EVAL_RECORD=1 after a skill-prompt change. Aligns with Jin's "deliberate not drift" framing — same path the M1 eval *should* have taken (M1 has been warn-only forever; M2 is being flipped while the baseline is fresh, days after the eval shipped). Refs #280. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…uts don't fail M2 hard-gate (#288) The hard-gate flip (commit 6605f24) surfaced an existing flakiness on the first CI run: a single httpx.ReadTimeout on one of the up-to-8 tool-use turns crashed the whole eval run, failing the MCP regression suite. Previously masked by `--gate-mode warn` + `continue-on-error: true`, both removed by the gate-flip. Two surgical fixes: 1. tests/eval/_bind_judge.py — _call_messages_api now retries 3× with exponential backoff (2s/8s/32s) on: - httpx.TimeoutException (read/connect/pool) - httpx.NetworkError, httpx.RemoteProtocolError - HTTP 429 (rate limit) + 5xx (server-side transient) After exhausting retries, raises RuntimeError with a bounded message. Terminal 4xx (auth, malformed payload) still fails fast — those aren't transient. 2. tests/eval_grounding_recall.py — per-case catch broadened from `except RuntimeError` to `except Exception`, and a single failing case now records an `eval_error` outcome row instead of crashing the whole eval. Aggregate gate is still applied: if N cases err hard enough that recall < 0.80 across 23 cases, the eval fails CI correctly. With our 0.913 baseline, ~5 cases would have to err before the gate breaches. 3. tests/eval_grounding_recall_summary.py — eval_error added to the outcome-breakdown table; missed-cases list surfaces the error msg inline (rather than rendering "—::—" for the absent binding). Local verification: - retry loop smoke-tested: 3× ReadTimeout → bounded RuntimeError; 503/503/200 → recovers and returns the 200 response. - ruff check + format + mypy all green. - test_m2_grounding_log + test_bind_m2_telemetry: 11 passed, 3 skipped. Refs #280 #288. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ci(m2): flip M2 grounding-recall gate warn → hard after stable baseline (#280)
…iscussion (#280) Jin's PR-#288 followup: aggregate metrics tell us *whether* the agent is grounding well; categorized failure modes tell PMs *which kinds of decisions* it struggles with. New deterministic post-hoc classifier runs over the existing per-case rows; renderer adds a "Failure modes" section to the GitHub step summary ranked by count, with up to 2 example cases per category and a documented PM-actionable next step. Categories ---------- correct — agent got it right, no action wrong_module — same-name disambiguation failed wrong_intent — similar-intent miss cross_language_confusion — Python ↔ TS runtime mistake wrong_symbol_in_right_file — sub-region disambiguation gap hallucinated_symbol — handler reject path caught a fake symbol span_mismatch — handler reject path caught hallucinated lines aborted_correctly — agent recognized a behavioral / unbindable decision (only meaningful once §B fixture lands; the §B "ungroundable behavioral cases" piece was deferred per Jin's plan recommendation — design partners are better authors for those via #280 friction reports) aborted_incorrectly — agent over-cautious on a bindable case eval_error — infra (API timeout / network) Each category carries a documented next step (FAILURE_MODE_NEXT_STEPS constant, kept in sync with the renderer's _FAILURE_MODE_HINTS table). PM-readable, not engineering jargon. Files ----- tests/eval_grounding_recall.py + classify_failure_mode(row) -> str + FAILURE_MODE_NEXT_STEPS dict (PM-actionable next steps) + failure_mode field embedded on every per-case row (success + eval_error paths both populate it) tests/eval_grounding_recall_summary.py + _render_failure_modes(rows) helper + new "Failure modes (top categories — PM-actionable)" section between the gate-breach line and the existing miss list. Ranked by miss count, eval_error always last (infra noise), capped at top 3 categories with up to 2 example cases each. Examples surface case_id + reasoning/abort_reason/error_msg excerpt (truncated to 110 chars, pipe-escaped for table safety). tests/test_grounding_failure_mode.py (new) 13 table-driven tests across all 10 categories + 3 invariants: unknown-outcome falls into 'uncategorized', taxonomy documentation completeness, handler-reject-priority over case_type. Pure unit tests — no API, no ledger. CHANGELOG.md +2 Cache key unchanged ------------------- failure_mode is computed at the row level from existing fields; doesn't touch the bind judge's cache key (model | skill | repo | decision). So the existing 0.913/0.913 baseline cache stays valid; CI runs after this PR will hit the cache and produce identical numbers — only the renderer output is enriched. Local verification ------------------ - 13 passed on tests/test_grounding_failure_mode.py - 24 passed, 3 skipped across the M2-related test files (test_m2_grounding_log + test_bind_m2_telemetry + test_grounding_failure_mode) - ruff check + ruff format --check + mypy all green on touched files - Renderer smoke-tested on a synthetic input with 6 misses across 4 categories — section ranks correctly, examples populate, hints land in the right column Out of scope (intentionally deferred) ------------------------------------- §B from the plan: 4-5 deliberately ungroundable behavioral cases (`expected_outcome="abort"`) that materially measure Jin's "behavioral decisions" pattern. Recommended deferral — design partners are better authors for those via #280 friction reports rather than engineering inventing them. Once §B lands, `aborted_correctly` will start firing for real (it can fire today only for rows that carry `expected_outcome="abort"`, which no current row does). Refs #280 #288. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
feat(eval): M2 failure-mode enumeration for cross-functional design discussion (#280)
refactor(telemetry): consolidate BICAMERAL_TELEMETRY env-var namespace (#192)
PR #174 closed the recall ceiling but introduced two silent fallback paths in `_region_anchored_preflight`: when `ctx.code_graph` was absent OR when the expander raised, the response shape was byte- identical to "expansion ran and matched zero" — caller couldn't tell recall was degraded. Three additive signals now surface every fallback (per Phase 2 spec posted on #243, all four open questions defaulted to recommended): 1. Response field — `sources_chained` includes `"graph_unavailable"`. Additive (never replaces existing `"region"` / `"graph"` tags). Bare tag — granular reason flows through telemetry, not the response shape, per signoff Q2. 2. Log level — exception case bumped from `logger.debug` → `logger.warning` with stable `[preflight:fallback]` substring + exception type for grep-friendly production logs. 3. Telemetry counter — new `preflight_telemetry.write_fallback_event( reason, session_id)` modeled on `write_ingest_refusal_event` (#216). Emits a `graph_expansion_fallback` row to the existing `~/.bicameral/preflight_events.jsonl` substrate. Reasons are a controlled enum: `"absent"`, `"missing_method"`, `"exception:<type>"`. Gated on `BICAMERAL_TELEMETRY=preflight`. The fallback case classifier in `_region_anchored_preflight` distinguishes three reasons (was conflated into a single `if expander is not None:` skip in the pre-#243 code): - `code_graph is None` → "absent" - `code_graph` set but no `expand_file_paths_via_graph` → "missing_method" - expander raised → "exception:<typ>" Skill update (`skills/bicameral-preflight/SKILL.md`) renders a one- line recall-degraded note to the agent when the tag is present: > Note: structural-neighbor lookup was unavailable this call — > recall may be reduced until the symbol index is rebuilt. Decisions > bound to files that import these may not have surfaced. Treats `"graph_unavailable"` as advisory: doesn't block the preflight surface; direct-pin matches are unaffected. Tests ----- 4 new cases in `tests/test_preflight_graph_expansion.py`: - test_preflight_fallback_absent_code_graph_tags_graph_unavailable — ctx with code_graph=None → response carries the tag, telemetry counter reason="absent" - test_preflight_fallback_expander_raises_warns_and_tags — stub expander raises RuntimeError → response carries the tag, `caplog` captures WARN-level log with `[preflight:fallback]` substring, telemetry counter reason="exception:RuntimeError" - test_preflight_successful_expansion_does_not_tag_graph_unavailable — regression guard: clean expansion path must NOT carry the tag (no false alarms) - test_preflight_empty_file_paths_does_not_tag_graph_unavailable — empty file_paths short-circuits before expansion check; the "expansion was never attempted" case is distinguishable from "attempted-and-fell-back" Existing tests use containment assertions (`"region" in sources_chained`) not exact list equality, so additive `"graph_ unavailable"` doesn't break them. What's NOT in this PR --------------------- Piece B (eager symbol-index initialization at server startup) is the follow-up commit on this branch. Lands separately so the response- shape change can ship without the adapter-lifecycle change. After both pieces land, the telemetry counter shipped here gives ongoing visibility into how often fallback engages in production. Refs #243 (parent #173 / PR #174). Plan signoff via #243 (comment). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…Piece B)
Pre-fix, the code-locator adapter had two cooperating problems that
made silent fallback the default:
1. `get_code_locator()` returned a FRESH `RealCodeLocatorAdapter`
per call. Caching was absent.
2. `_ensure_initialized()` was lazy — first tool call paid the
index-build cost AND could race the index check on concurrent
dispatch (e.g. preflight + bind landing in parallel after
server boot).
Together: every silent fallback in the production runtime was
"hot" because the adapter was being rebuilt + rechecked on every
call. Piece A (#283 commit 3c9730f) made the fallback loud at the
response layer; Piece B closes the upstream cause.
Three changes
-------------
adapters/code_locator.py
- Singleton-by-REPO_PATH cache via `_INSTANCE_CACHE: dict[str,
RealCodeLocatorAdapter]`. Path resolved through `Path.resolve()`
so symlink + relative-path callers cache-hit consistently.
Multi-repo correctness preserved (any test that swaps REPO_PATH
mid-process gets a fresh adapter for the new path).
- New `reset_code_locator_cache()` test-only hook, mirroring
`adapters.ledger.reset_ledger_singleton`.
- New `async def RealCodeLocatorAdapter.initialize()` — wraps
sync `_ensure_initialized()` in `loop.run_in_executor(None, ...)`
so the cold-init path doesn't block the event loop. Idempotent
on already-initialized adapters.
server.py
- `serve_stdio()` calls `await get_code_locator().initialize()`
between the dashboard sidecar start and the consent-notice block.
- **Fail-loud per #243 phase-2 signoff Q3** — explicit `except
RuntimeError as exc:` re-raises after printing an actionable
stderr message (`"Run: python -m code_locator index <repo>"`).
The outer try/finally still runs the `SERVER_SHUTDOWN` audit
emit, so operators get a clean event AND a clear actionable
error. No more silent degradation.
tests/test_preflight_graph_expansion.py — 4 new tests
- test_get_code_locator_returns_same_instance_per_repo_path
(singleton + reset behavior across two REPO_PATHs)
- test_initialize_succeeds_when_index_present
(idempotent on already-initialized adapter)
- test_initialize_fails_loudly_when_index_empty
(RuntimeError from `_ensure_initialized` propagates through the
async wrapper — doesn't get swallowed)
- test_serve_stdio_refuses_boot_on_empty_index
(boot-path level: with everything else stubbed healthy, an
empty index aborts `serve_stdio()` with the expected
RuntimeError)
Local smoke tests
-----------------
- Singleton + reset_code_locator_cache: 4 assertions pass
(cache hit on same path, distinct instance on new path, fresh
after reset, second call after reset stays cached)
- Async `initialize()`: re-raises RuntimeError on stubbed
`_ensure_initialized` failure; idempotent no-op on
already-initialized adapter
- ruff check + ruff format --check + mypy all green on touched files
What's NOT in this PR
---------------------
Nothing — Piece A (commit 3c9730f) and Piece B (this commit) together
close #243's full scope. PR will open with both pieces. Telemetry
counter shipped in Piece A gives ongoing production visibility into
how often fallback engages post-merge.
Refs #243 (parent #173 / PR #174). Plan signoff via
#243 (comment).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…backs feat(preflight): eliminate silent graph-expansion fallbacks (#243)
Replaces the dashboard image at the bottom of "How It Feels" with a three-beat demo video section (ingest -> preflight -> ratify async) referencing GitHub user-attachments URLs so videos render as inline players. Moves the "Star on GitHub" banner from the top header to a centered placement immediately after the demo, turning it into a post-demo conversion beat instead of a misaligned header element. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…itch Replaces the single-line MCP-server description with a position-take opener: paragraph 1 names the failure mode (agreements emerge mid-flight, never reach a doc); paragraph 2 introduces Bicameral MCP as a spec compliance layer that captures both formal source materials (transcripts, PRDs, Slack) and undiscussed mid-implementation decisions to be ratified async by the product owner. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
docs(README): demo video section + relocate star CTA mid-doc
… + decision_id alias fix The banner tests previously used MagicMock for ctx and AsyncMock for ledger, returning hand-crafted dicts. They stayed green even when get_decisions_by_status silently returned decision_id=None for every row (the SQL selected an undefined field — see ledger.adapter:584). Refactor to seed a real SurrealDBLedgerAdapter over memory:// and run the actual get_decisions_by_status query. The first sociable run surfaced the latent bug, which is fixed in this commit by aliasing type::string(id) AS decision_id (matches the pattern at queries.py:167, 228, 404, 512). Tests that legitimately need narrow seams (handle_link_commit, asyncio.Lock primitives) are left as-is and now documented inline. Adds a "Sociable Testing for UX Paths" section to pilot/mcp/CLAUDE.md codifying the preference: SimpleNamespace ctx + real adapter for handler/ledger tests, narrow seams only when a collaborator can't be run in tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dleware-tests # Conflicts: # CHANGELOG.md # README.md
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e-tests test(sync_middleware): sociable banner tests + decision_id alias fix
…gate (#58) Sibling of the M2 grounding-recall eval (#284). Phase A is **measurement only** — no runtime change to `handle_preflight` or any retrieval surface. Recall regression risk = zero. The Phase B optimization choice (multi-hop expansion / semantic search / LLM reranker) is gated on this PR's first stable baseline, per the wiki's optimization principle: "identify the specific scenario, then optimize." Per the Phase 2 spec posted on #58 (all four open questions defaulted to recommended): Q1 dataset size → 25 cases hand-curated, matches M2's 23 Q2 miss-mode buckets → three (vocabulary / unbound / transitive), matches the issue body's framing Q3 fire-rate gate → raw retrieval (`response.decisions`); fire is downstream and a secondary diagnostic Q4 ledger persistence → per-run temp + memory:// (per-case freshness) Three measurement axes (deliberately split for diagnosis) --------------------------------------------------------- overall recall surfaced / (surfaced + missed) gate ≥ 0.70 per-mode recall same, sliced by miss_mode gate ≥ 0.50 fire rate response.fired / total gate ≥ 0.60 Errors (seeder infra failures, not agent misses) are excluded from the recall denominator but counted separately so reviewers can see them. Files ----- tests/fixtures/preflight_m6/dataset.py (412 LOC) 25 hand-curated M6Case rows, 8 + 8 + 9 across the three modes. Frozen dataclass; GENERATOR_VERSION constant invalidates downstream caches when bumped. Import-time _validate_dataset() fails loud on duplicate case_id, invalid miss_mode, transitive case without intended_file_path, unbound case with non-ungrounded status. tests/eval/_preflight_m6_seeder.py (231 LOC) Per-case freshness: each call creates a new tempdir + memory:// ledger + git-initialized repo + writes a placeholder file (or the transitive case's intended + caller files). Calls the real handle_ingest + handle_bind so seeded rows have production shape (source_type, span, signoff, binds_to). Reset code-locator + ledger singletons before AND after so the next case starts clean. tests/eval_preflight_m6_recall.py (274 LOC) Argparse runner, drives the seeder + handle_preflight, classifies outcomes, aggregates. JSON output + gate enforcement (--gate-mode warn|hard). Mirrors eval_grounding_recall.py shape so existing CI patterns transfer. tests/eval_preflight_m6_summary.py (162 LOC) Markdown step-summary renderer for $GITHUB_STEP_SUMMARY. Per-mode table + collapsible missed-case detail with topic + intended description. Fail-quiet on missing JSON / parse errors. tests/test_preflight_m6_eval.py (267 LOC) 16 sociable unit tests for the classifier + aggregator. Per the new CLAUDE.md "Sociable Testing for UX Paths" rule (#303): SimpleNamespace + real M6Case dataclasses, NEVER MagicMock — so any added/removed field on the response shape fails the test honestly. .github/workflows/test-mcp-regression.yml (+31 LOC) New "M6 preflight recall eval (warn-only)" + summary steps after M2. No ANTHROPIC_API_KEY needed — preflight retrieval is deterministic. CHANGELOG.md (+2 lines) [Unreleased] / Added entry. Local verification ------------------ - 16/16 sociable unit tests pass on the classifier + aggregator (test_aggregate_basic_recall_math, test_errors_excluded_from_recall_denominator, test_per_miss_mode_breakdown, etc.) - Dataset import + _validate_dataset() pass — 25 cases (8/8/9) - Runner --help renders cleanly - Summary renderer smoke-tested on synthetic JSON — per-mode table + missed-case detail render correctly with emoji gates - ruff check + ruff format --check + mypy all green on touched files What's NOT in this PR (intentionally — Phase B gating) ------------------------------------------------------ - Any runtime change to handle_preflight or _region_anchored_preflight - Skill changes (no agent-facing contract change in Phase A) - Multi-hop / call-graph / inheritance graph expansion (Phase B candidate, deferred) - Semantic search layer (Phase B candidate, deferred) - LLM reranker (Phase B candidate, deferred) - Real-corpus eval (synthetic first; corpus follow-up if needed) After this PR's first CI baseline lands, we pick the dominant miss-mode from the per-mode breakdown and ship Phase B targeted to it. Cheap-first ordering per the wiki: search_hint refinement → multi-hop graph → semantic → reranker. Refs #58. Plan: plan-58-preflight-decision-detection.md. Phase 2 spec signoff: #58 (comment) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
feat(eval): M6 preflight retrieval recall eval — Phase A measurement gate (#58)
…#58 followup) PR #304's first CI baseline produced overall recall 0.000 with 14/25 cases erroring — root cause: the M6 seeder runs 25 cases back-to-back in a single process, and the LLM-08 ingest rate limiter (#216, burst=10 / refill=1.0/s) refuses cases 12+ with `_IngestRefused("rate_limit_ exceeded")`. Math: 10 initial tokens + ~1 refill while seeding the first 11 cases = 11 cases through, then 14 cases (U4-U8 + all 9 T*) erred. The rate limiter is for production agent-loop safety, not eval throughput. There's already a documented env var to disable it (see `handlers.ingest._check_rate_limit` docstring): ``BICAMERAL_INGEST_RATE_LIMIT_DISABLE`` truthy → bucket check is short-circuited. Setting it in the seeder's per-case env setup (saved + restored like `REPO_PATH` and `SURREAL_URL`) is the documented path. Symptom before this fix (post-#304 CI on dev): M6 preflight retrieval recall eval — 25 cases overall recall : 0.000 errors: 14 transitive_relevance : 0/9 surfaced, 9 errors ← all rate-limited unbound_decision : 0/8 surfaced, 5 errors ← last 5 rate-limited vocabulary_mismatch : 0/8 surfaced, 0 errors ← first 8, ran clean Expected after this fix: vocabulary_mismatch stays 0/8 surfaced (that's the honest BM25-can't-bridge-vocab baseline the eval was designed to surface). transitive_relevance + unbound_decision should produce non-zero recall once the seeder doesn't trip the rate limiter. Belt-and-suspenders alternatives considered: - clear the `_RATE_LIMIT_REGISTRY` dict between cases — works but reaches into private state and skips the env-var contract - sleep between cases to allow refill — works but slow + hides the fact that the rate limiter isn't appropriate for evals - lower burst/refill via `.bicameral/config.yaml` in the synthetic repo — works but requires every Phase B eval surface to re-author the same config The env-var path is the documented API and one line. Smoke verification ------------------ - 16/16 sociable unit tests pass on the classifier + aggregator - ruff check + format + mypy all green on the touched file Refs #58 (Phase A baseline). Followup to PR #304. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the existing assets/bicameral-hero.png with a new visual that illustrates the product as a double-entry ledger for AI-assisted product development — PM and Dev agents each running a Bicameral MCP server, both synced through a shared Team Ledger, with a live Decision Ledger panel showing mixed signoff/code states (including ratified-but-not- reflected, reflected-but-not-ratified, and drifted rows) and the three core pillars (decisions first-class, two-sided ledger, escalation over recommendation). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ks to canonical skills/ The .claude/skills/bicameral-*/SKILL.md files were tracked duplicates of skills/bicameral-*/SKILL.md that drifted independently. PRs frequently touched skills/ but not the mirror, so the mirror lagged ~3 feature commits (3c9730f, d1e3914, 79b872b) and 2+ weeks behind canonical. Beyond stale duplicates: the drift was bidirectional. 7 skills existed only in canonical (.claude/skills/-missing → never resolved as slash commands) and 7 only in the mirror (no canonical source → became de-facto canonical despite CLAUDE.md saying otherwise). claude-mem auto-writes into CLAUDE.md files also drifted (ingest and preflight CLAUDE.md had different "Recent Activity" entries between the two paths). This change: 1. Canonicalizes the 7 mirror-only skills via git mv into skills/ (bicameral-{brief, context-sentry, doctor, guided, scan-branch, search, status}). 2. Replaces every .claude/skills/bicameral-X with a symlink to ../../skills/bicameral-X (22 symlinks total). Claude Code's slash-command resolver follows the symlinks transparently — confirmed in-vivo during implementation when the resolver re-indexed and surfaced all 22 skills after the swap. 3. Repoints tests/CI/docs at canonical skills/ paths (tests/_extract_headless.py SKILL_MD_PATH; tests/regen_extraction_fixtures.py docstring; tests/eval_decision_relevance.py docstring; tests/e2e/README.md; .github/workflows/test-mcp-regression.yml comment; README.md slash-command row; docs/DEV_CYCLE.md canonical-source note; docs/v2-desync-optimization-guide.md doctor SKILL.md references). 4. Updates CLAUDE.md to describe the symlink layout (drop "stale duplicates slated for deletion" wording) and adds a Windows note: contributors on Windows must set core.symlinks=true (or use WSL) so the mode-120000 entries materialize as symlinks rather than text files containing the target path. 5. Ticks off TODO.md:169 — the unresolved decision is now made. Refs: TODO.md:169 (now ticked), CLAUDE.md "Canonical Skill Source". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
chore(skills): replace .claude/skills/bicameral-* mirrors with symlinks to canonical skills/
…precondition) Phase 4 of #87 broadens the preflight dedup cache key from `topic` alone to `(topic_norm, file_paths_hash, ledger_revision)` so a same-topic call within the 5-min window correctly invalidates when underlying ledger state changed (M7a/b/c xfailed cases). `ledger_revision` derives from `MAX(updated_at)` over the decision table — this PR is the schema half of that contract; the handler-side broadening lands as a follow-up on branch `87-preflight-dedup-key`. Per Kevin's signoff (B2 approach, gh issue #87 comment thread): additive schema bump, L1 risk, no tool contract change, falls back gracefully. The field is `option<datetime>` rather than non-optional `datetime` because DEFINE FIELD against existing rows leaves them as NONE until the migration backfill runs — same precedent as v8→v9 (`decision_level` is `option<string>` for identical reasons). Phase 4's MAX query can COALESCE(updated_at, created_at) if it wants strict-non-NULL semantics. Schema changes (ledger/schema.py): - SCHEMA_VERSION 17 → 18 + compatibility-map entry - DEFINE FIELD updated_at ON decision TYPE option<datetime> DEFAULT time::now() - DEFINE INDEX idx_decision_updated_at ON decision FIELDS updated_at - _migrate_v17_to_v18: idempotent DEFINE + backfill UPDATE decision SET updated_at = created_at WHERE updated_at IS NONE Call-site audits (7 UPDATEs now carry `, updated_at = time::now()`): - ledger/queries.py:602 upsert_decision canonical-dedup UPDATE path - ledger/queries.py:1072 update_decision_status - ledger/queries.py:1163 update_decision_level - ledger/adapter.py:1394 apply_ratify (signoff write) - ledger/adapter.py:1428 apply_supersede (old decision signoff-freeze) - handlers/resolve_collision.py:99 link_parent (cross-level parent link) - handlers/resolve_collision.py:128 collision_pending clear (proposed signoff) CREATE in queries.py:638 needs no edit — the DEFAULT picks up time::now() on INSERT automatically. Tests (tests/test_v18_decision_updated_at.py, 11 tests, all passing): - Schema version advanced to v18 - CREATE populates updated_at via DEFAULT - Each of the 7 UPDATE call sites bumps updated_at (one test each) - Index supports ORDER BY updated_at DESC - Migration backfill: pre-v18 rows with NONE → created_at Sociable substrate over memory:// per CLAUDE.md guidance — real SurrealDBLedgerAdapter + real LedgerClient, no mocks. The drift this guards against is the kind solitary tests miss: a mock would happily return whatever updated_at the test expects; only a real ledger UPDATE proves the SQL actually carries the new column. Regression check passes: tests/test_v15_migration.py, test_schema_persistence.py, test_schema_recoverable_errors.py, test_sync_middleware.py, test_codegenome_continuity_service.py, test_compliance_check_schema.py, test_ledger_bicameral_meta_migration.py — 50/50 pass. The single test_alpha_flow.py failure (test_code_edit_without_rebind_marks_drifted) reproduces on origin/dev without this PR's changes — pre-existing, not introduced here. Refs #87 (Phase 4 precondition per spec signoff). Out of scope: dedup key broadening itself (#87 Phase 4), telemetry (#87 Phase 5). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d_at (#308 CI fix) CI surfaced two issues on PR #308: 1. Ruff I001 — tests/test_v18_decision_updated_at.py import block was not alphabetically sorted within the `from ledger.queries import (...)` group. Auto-fixed. 2. tests/test_legacy_ledger_fixtures.py::test_legacy_ledger_fixture_reaches_clean_state[v3_yields_source_span] blew up on the v17→v18 backfill: SurrealDB rejected query: Found NONE for field `created_at`, with record `decision:dec_1`, but expected a datetime SQL: UPDATE decision SET updated_at = created_at WHERE updated_at IS NONE The v3 fixture creates `decision:dec_1` via raw CREATE without setting `created_at`. Once init_schema applies `DEFINE FIELD created_at ON decision TYPE datetime`, ANY UPDATE on that row re-validates the row and trips the type assertion — even one that doesn't touch created_at. The earlier draft used `SET updated_at = created_at` which read the corrupt field directly; even after switching to time::now() in the SET clause, the implicit re-validation on UPDATE still failed. ## Fix Switch the backfill from a single bulk UPDATE to a per-row loop with try/except, mirroring `_clean_yields_legacy_rows` (which uses the same tolerance pattern for v3-era stale yields edges): ```python ids = await client.query("SELECT id FROM decision WHERE updated_at IS NONE") for row in ids: try: await client.execute(f"UPDATE {row['id']} SET updated_at = time::now()") healed += 1 except Exception: skipped += 1 # row has other corrupt non-optional fields logger.warning(...) ``` Rows that fail stay with `updated_at=NONE` and MAX(updated_at) skips them. Harmless for the dedup-cache marker (#87) since the marker only needs monotonicity, not coverage — the new decisions created post-v18 get DEFAULT time::now() and dominate MAX(). The SELECT itself reads only `id`, so it doesn't trip the type assertion on `created_at`. The WHERE clause on `updated_at IS NONE` is safe because `updated_at` is `option<datetime>` (intentionally optional — same precedent as v8→v9 `decision_level`). ## Files - ledger/schema.py — _migrate_v17_to_v18: per-row UPDATE with try/except; emits healed/skipped counts to the logger - tests/test_v18_decision_updated_at.py - Import sort fix (ruff I001) - test_v18_migration_backfills_legacy_rows_with_none_updated_at: call _migrate_v17_to_v18 directly instead of inlining the (now multi-statement) backfill body - test_v18_migration_backfill_tolerates_legacy_rows_with_none_created_at (NEW): inspects the migration source to guard against future drafts that reintroduce a created_at reference in the SET clause ## Verification - tests/test_legacy_ledger_fixtures.py::test_legacy_ledger_fixture_reaches_clean_state[v3_yields_source_span] PASS - tests/test_v18_decision_updated_at.py — 13/13 PASS (12 originals + 1 new regression guard) - 94/94 in the broader schema/migration/dedup cluster - `python3 -m ruff check` — clean on all touched files Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI's ruff job runs BOTH `ruff check` AND `ruff format --check`. The former was clean after the import-sort fix, but the latter flagged ledger/schema.py and tests/test_v18_decision_updated_at.py for reformatting. Applied `ruff format` in place — pure whitespace / line-length normalization, no semantic change. Verified: `ruff format --check` clean on both files locally; 14/14 v18 + legacy-fixture tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nnel-autodetect fix(setup): auto-detect nightly channel from .dev install version
Surfaces the setup-wizard nightly-channel auto-detect fix (PR #381) to design partners. Without it, anyone who installed via `pipx install --pip-args=--pre bicameral-mcp` ran `bicameral-mcp setup` into a config hardcoded to `channel: stable`, so `bicameral.update` silently never offered the nightly upgrade path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…026.5.16.dev024452 chore(nightly): bump RECOMMENDED_NIGHTLY_VERSION to 2026.5.16.dev024452
…22→v23 The v18→v19 migration only seeded the bicameral_meta singleton when the row was absent. When _write_wire_format_sentinel had already written a row (v16's adapter.connect path), the seed branch was skipped and decision_revision stayed NONE because SurrealDB v2's DEFAULT 0 does not backfill existing rows. Every subsequent decision UPDATE then blew up the decision_revision_bump trigger with "Cannot perform addition with 'NONE' and '1'", which _migrate_v22_to_v23's per-row try/except silently swallowed — so the decision_level classification migration "succeeded" while skipping every legacy row. Fix in two places: - _migrate_v18_to_v19: UPDATE existing rows to 0 when the field is NONE (root cause; prevents recurrence for any DB upgrading from <v19). - _migrate_v22_to_v23: same backfill at the top as defense-in-depth so the per-row UPDATEs below land their classifications instead of silently failing. SCHEMA_VERSION stays at 23 — the buggy nightly (dev15124) was only ever downloaded internally, so no forward-fix migration is needed. Tests: - test_migrate_v18_to_v19_backfills_decision_revision_on_preexisting_row: asserts a sentinel row with NONE decision_revision is rescued, and a real decision CREATE bumps the counter (trigger contract intact). - test_v23_classifies_when_decision_revision_was_none: asserts v22→v23 classifies legacy rows when entering with the broken counter state (no silent skips). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a ## Linked decisions section to the §4.3 PR body template, parallel to ## Linked issues, and codifies the rule: every PR authored by a BicameralAI org member references at least one decision:<surrealdb-id> so reviewers can verify the change is grounded in an explicit decision rather than ambient assumption. External contributors are exempt — bicameral access is org-internal, and gating community PRs on internal tooling is the wrong tradeoff. The reviewing maintainer shepherds the decision ingest on the contributor's behalf at merge time. This is a doc-only rule; CI enforcement (lint that an org-member PR body contains a decision:<id> token) can follow as a separate PR if needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI ruff format --check caught three files that were lint-clean but not format-clean. No semantic change — line breaks collapse to match the project's max-line-length policy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d-decision docs(dev-cycle): require linked bicameral decision on org-member PRs
…-backfill fix(schema): backfill bicameral_meta.decision_revision in v18→v19 + v22→v23
Pre-fix, serve_stdio awaited get_code_locator().initialize() inline before opening the MCP stdio transport. On a 150MB+ symbol-index DB the cold path took ~45s (sqlite-vec open + tree-sitter load + BM25 pickle load), blowing past Claude Code's 30s MCP initialize timeout on real-world repos — the server "started" but the JSON-RPC handshake never landed and the client gave up. Fix: - ``RealCodeLocatorAdapter.initialize_in_background()`` — schedules ``_ensure_initialized`` in the default executor via an asyncio Task, returns immediately. A done-callback prints the bare error to stderr on failure so the operator still sees the actionable "Run: python -m code_locator index <repo_path>" hint that #243 wrote. - ``_ensure_initialized`` now serializes its body via a threading.Lock. Sync callers from worker threads (the ``asyncio.to_thread(ctx.code_graph.<method>, ...)`` pattern every tool handler already uses) block on the lock until the background Task finishes, then see the post-init state and proceed. No callsite needs to know about the background Task. - ``_run_init_body`` extracted from ``_ensure_initialized`` so tests can monkey-patch the slow body without bypassing the lock/state machine — the lock + Task glue is what's under test. - ``wait_until_ready()`` — optional async gate for callers that want to explicitly await readiness from an async context and surface a structured error to the MCP client on failure. - ``server.py:serve_stdio`` — replaces ``await get_code_locator().initialize()`` with ``get_code_locator().initialize_in_background()`` (synchronous, no await). Stderr message rewritten to reflect the new contract. Trade-off: #243's "server refuses to boot when index is empty" becomes "first code-locator tool call fails loudly when index is empty." Operator still sees the failure on stderr at boot via the done-callback. The fail-loud contract from #243 phase-2 signoff Q3 is preserved, just relocated from boot-time to first-tool-call-time. Measured: JSON-RPC ``initialize`` reply now lands in ~16ms on this repo's own 150MB code-graph.db (was ~45s). Closes #380 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-handshake fix(server): move code-locator init off MCP stdio handshake (#380)
…fe upsert The v23 dedup index `idx_input_span_dedup` was UNIQUE on `(source_type, source_ref, text)`. Phase B-1 (#221) introduced archive_key and writes text='' for archive-keyed rows, so two distinct archive_keys sharing (source_type, source_ref) collided on the empty-text slot. The collision surfaced as a 500 on the dashboard's /history endpoint once any second archive-keyed write to the same source bucket landed (transitively via ensure_ledger_synced → link_commit → ingest paths). Changes: - Schema v23→v24: extend idx_input_span_dedup with archive_key as a 4th field. Non-destructive — adding a discriminator can only relax uniqueness, so all rows valid under the old index remain valid. Migration uses DEFINE INDEX OVERWRITE via _execute_define_idempotent (re-runnable). init_schema's OVERWRITE pass keeps the in-source DEFINE in sync on every connect. - upsert_input_span: refactored into a thin retry wrapper around _upsert_input_span_once. The wrapper retries on the SurrealDB v2 MVCC "failed to commit transaction" string (bounded, no backoff — the conflicting writer has already committed by the time we see the error). The inner body now catches unique-index "already contains" violations on both the archive-keyed and legacy text-only paths, re-SELECTing to return the winning row's id instead of crashing. - 6 new sociable tests pin: archive_key coexistence under v24, idempotent same-key dedup, concurrent-same-key race convergence, legacy text-path race safety, v24 migration idempotency, and a fixture that pins the v2 MVCC error substring so a future surrealdb-py bump breaks loudly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…bstone Implements decision:i4wafafzowm3ai5eyhgs (ratified 2026-05-15). bicameral.remove_decision now physically removes the decision row plus every reference to it (binds_to / yields / supersedes / context_for / about edges + the compliance_check verdict cache for the decision) and orphans child decisions cleanly by NULLing parent_decision_id. The decision_removed.completed event captures a full pre-deletion snapshot in the journal so the action is recoverable from the event log alone — the "soft audit trail" that replaces the tombstone row model. Motivation: the soft-delete model was intended as a negative-signal mechanism (rows with signoff.state="removed" warn future agents away from re-introducing the same wrong decision). In practice the dominant call shape is janitorial — test pollution, accidentally-ingested rows, retracted ideas with no learning value — where tombstones become friction that surfaces in preflight, occupies dashboard slots, and gets re-bound by drift sweeps. Supersession remains the right tool when a persistent negative signal is actually wanted. Contract changes: - RemoveDecisionResponse: drops `signoff` and `projected_status` (the row is gone — there's no signoff dict to return and the projected status is meaningless). Promotes the relevant fields to top level: was_new, event_logged, removed_at, previous_state, reason. - Idempotency: missing decision_id returns was_new=False without raising. The matching event in the journal is the canonical record of any prior removal. Trade-off: typos for never-existed ids look like idempotency, but the SKILL.md flow (read history first, then call) catches that. - server.py tool description updated to match. - skills/remove-decision/SKILL.md rewritten end-to-end; .claude/skills copy synced. Out of scope (separate decisions): - handlers/remove_source.py cascade still soft-deletes yielded decisions. That's a different tool's contract; touching it should be its own decision. - dashboard.html "already-removed" button-disable guard remains as defensive dead code — cosmetic-only and out of scope. Tests: - tests/test_remove_decision.py rewritten as sociable (real SurrealDBLedgerAdapter over memory://) per pilot/mcp/CLAUDE.md. 9 tests covering: reason validation, missing-id idempotency, full edge+cache cleanup, child orphan, second-call no-op, event emission/skipping, and idempotent no-event. - tests/test_dogfood_label_propagation.py: removed the obsolete monkeypatches for handlers.remove_decision.project/update_decision_status (functions no longer imported by the new handler). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-delete work claude-mem regenerated the recent-activity tables for handlers/ and tests/ after today's remove_decision hard-delete implementation, and seeded new context files at skills/remove-decision/ and .claude/skills/remove-decision/ where the skill was edited. Purely auto-generated context — no code change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-suite contention The safe-upsert retry loop landed at 5 attempts (ebcfeb4). Running the full regression batch surfaced a flake in test_concurrent_same_archive_key_race — three concurrent writers for the same archive_key occasionally exceed 5 MVCC retries when the test suite holds dozens of memory:// SurrealDB instances in the same process. Each retry's SELECT short-circuits the moment the winning writer commits, so the cost is one RTT per attempt — trivial. 10 absorbs the variance with massive headroom for production usage (where contention storms of this shape can't happen — one DB per process). A proper fix (per-key write queue instead of optimistic retry) is tracked separately as a follow-up issue. Also includes claude-mem auto-generated activity-log refreshes from this session (no code change in those files). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-handshake feat(remove_decision): hard-delete by default + v24 input_span dedup index
…17→v24 chain Drains the dev → main backlog accumulated since v0.14.7. Cumulative release; non-destructive schema migration chain (v17→v24) applied automatically on first connect. Breaking change: bicameral.remove_decision contract is now hard-delete by default (decision:i4wafafzowm3ai5eyhgs). Highlights: - PII archive (#221 Phase A + B-1) — operator-erasable PII surface keyed by content-hash; ingest writes verbatim text to the archive and leaves input_span.text='' with the v22 ASSERT enforcing exactly-one-of. - Hard-delete remove_decision — soft-delete tombstone retired; full pre-deletion snapshot lives in the event journal. - Constant-time revision counter (#87 Phase 6) — bicameral_meta.decision_revision auto-bumped by DEFINE EVENT; replaces O(N) MAX(updated_at) scan in preflight dedup. - bicameral.admin/query (#278 Phase 3), dashboard source view (#278 Phase 1), LocalDirectorySourceAdapter (#344), sync-and-brief team-mode (#279). - Code-locator singleton + eager startup init (#243, #380) — index work moves off the per-call hot path and off the MCP stdio handshake. - Schema v17→v24 chain — all additive, non-destructive. Three architectural decisions ratified for the doctrine follow-up PR: expand-only schema rule, feature-flag gating for new-schema-dependent code, DEV_CYCLE.md §10.5.1 amendment for triage eligibility. Closes decision:i4wafafzowm3ai5eyhgs. See CHANGELOG.md for the full Added / Changed / Fixed / Schema-migrations / Doctrine / Removed breakdown. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Important Review skippedToo many files! This PR contains 218 files, which is 68 over the limit of 150. To get a review, narrow the scope: ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (218)
You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
… (triage eligibility) Encodes the three architectural decisions ratified 2026-05-15 (decision:cp25jfz1nt6h3u2gjzmu, decision:adklplvfhthkdch05pe9, decision:0ok1249n2tdrfud2a5j9): §4.7 — new subsection enforcing two complementary rules for any PR that touches ledger/schema.py or its _MIGRATIONS registry: §4.7.1 — schema migrations must be expand-only. Destructive operations (REMOVE / DROP / breaking ALTER / tightening ASSERT) live in their own commits and ship in a later release after the prior reader surface is validated as gone from prod. Includes an allowed/forbidden table for reviewer ease. CI lint planned via scripts/lint_schema_destructive.py. §4.7.2 — code paths that depend on new schema must be feature-flag gated and default OFF in prod (env var or .bicameral/config.yaml setting). Schema ships immediately; flag flips later in a separate release. If the experiment is killed, the flag never flips on and a follow-up cleanup migration drops the slot. Exception: invariant bugfixes (e.g. fixing a unique-index collision that breaks the dashboard for everyone) don't need flag-gating — that's not feature surface. §4.7.3 — concrete PR-review checklist for schema-touching PRs. §10.5.1 — triage eligibility rule rewritten. Previously: "schema-migrating changes are not triage-eligible" (blanket). Now: schema migrations CAN ride a triage release if they comply with §4.7 (expand-only AND feature code is flag-gated). The blanket ban is replaced by enumerated exclusions (destructive schema, flag-flip releases, breaking public-API changes, multi-PR epics, v1 patches). Motivation: the prior rule was correct under the implicit assumption that schema and feature ship together — then you can't ship one without the other. Once §4.7 decouples them, schema can drain to main on every triage instead of accumulating on dev waiting for a "real" release. The current v18→v24 backlog (drained by the v0.15.0 release PR #388) is the symptom the prior rule produced; this amendment prevents recurrence. Refs decision:cp25jfz1nt6h3u2gjzmu, decision:adklplvfhthkdch05pe9, decision:0ok1249n2tdrfud2a5j9. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts: # CHANGELOG.md # RECOMMENDED_VERSION # pyproject.toml
Summary
Cumulative release draining the
dev→mainbacklog accumulated since v0.14.7. Lands:bicameral.remove_decision(breaking — see below; closesdecision:i4wafafzowm3ai5eyhgs)v17 → v24(all additive, non-destructive)LocalDirectorySourceAdapter,sync-and-briefteam-modedecision_levelon ingestFull breakdown in
CHANGELOG.md's## v0.15.0section.Linked issues
This release closes/refs (non-exhaustive — see commit
Closes #N/Fixes #Nkeywords in the full log for the canonical set, all of which auto-close on merge):Closes #87, BicameralAI/bicameral-daemon#36, #209, BicameralAI/bicameral-daemon#9, BicameralAI/bicameral-daemon#23, BicameralAI/bicameral-daemon#22, #243, #272, #278, #279, #280, #281, #288, #301, #308, BicameralAI/bicameral-daemon#4, #332, #334, BicameralAI/bicameral-daemon#32, BicameralAI/bicameral-daemon#31, #340, #341, #342, #343, #344, #358, #362, #364, #380, #386
Refs BicameralAI/bicameral-daemon#37, #232, #357 (subtasks of the test-infrastructure track land here; parent stays open)
Refs BicameralAI/bicameral-daemon#2 (Ledger Locator RFC landed; full implementation deferred to v0.16.x)
Linked decisions
Closes decision:i4wafafzowm3ai5eyhgs — Default
bicameral.remove_decisionto hard delete; eliminate soft-delete tombstone state. Implementation in PR #386 (merged to dev).Refs decision:cp25jfz1nt6h3u2gjzmu — Schema migrations must be expand-only (doctrine; companion PR amends DEV_CYCLE.md prospectively).
Refs decision:adklplvfhthkdch05pe9 — New-schema-dependent code must be feature-flag gated (doctrine; companion PR).
Refs decision:0ok1249n2tdrfud2a5j9 — DEV_CYCLE.md §10.5.1 (triage eligibility) amendment (doctrine; companion PR).
Plan / Audit / Seal
origin/main..origin/dev(24 fix / 26 feat / 50 merge / remainder docs+chore+test+style).v17 → v24reviewed — every step is additive (new DEFINEs, new fields with defaults, new EVENTs). NoREMOVE/DROPoperations in the migration path. Theidx_input_span_dedupchange (v24) usesOVERWRITEand extends the field set, which is monotonically weaker than the prior index — every row valid before is still valid after.bicameral.remove_decision), but the migration path has been exercised end-to-end against the prod ledger (the schema migration that fixes the dashboard/history500 was applied to my local prod DB during testing, verified working).Breaking changes (operator-facing)
bicameral.remove_decisionresponse shape changed. Droppedsignoffandprojected_status. Addedevent_logged,removed_at,previous_state,reason. The decision row + all references are now physically removed instead of flipped tosignoff.state="removed". Callers consuming the response should check the new top-level fields. Idempotent on missing decisions (was_new=False, no raise) — the matching event in the journal is the canonical record of any prior removal.Schema migrations
Auto-applied on first connect. Non-destructive. Operators upgrading from v0.14.x see one-time migration log entries; no data loss.
decision.updated_at+idx_decision_updated_atbicameral_meta.decision_revision+DEFINE EVENT decision_revision_bumpinput_span.archive_key)text != '' OR archive_key != ''oninput_span.textdecision.decision_levelfor legacy rowsidx_input_span_dedupextended witharchive_key/historycollision fixTest plan
SURREAL_URL=memory:// pytest tests/test_phase2_ledger.py tests/test_phase3_integration.py tests/test_remove_decision.py tests/test_input_span_safe_upsert.py tests/test_remove_source.py tests/test_dogfood_label_propagation.py tests/test_pii_archive_schema_migration_b1.py tests/test_history_erasure_propagation.py tests/test_schema_recoverable_errors.py -q— green on a current dev checkout./historyreturns 200 with the full 15-decision ledger.surrealkv://ledger (the author's~/.bicameral/ledger.db) and verified the v24 index distinguishes two archive-keyed rows in the same(source_type, source_ref)bucket.pipx install bicameral-mcp==0.15.0, runbicameral-mcp setup, ingest a sample transcript, observe the dashboard.Post-merge tasks
v0.15.0and push the tag.pipx upgrade bicameral-mcpon design-partner machines (or wait for their nextbicameral.update).🤖 Generated with Claude Code