Skip to content

release: v0.15.0 — PII archive, hard-delete remove_decision, schema v17→v24 chain#388

Merged
jinhongkuan merged 156 commits into
mainfrom
release/v0.15.0
May 16, 2026
Merged

release: v0.15.0 — PII archive, hard-delete remove_decision, schema v17→v24 chain#388
jinhongkuan merged 156 commits into
mainfrom
release/v0.15.0

Conversation

@jinhongkuan

Copy link
Copy Markdown
Contributor

Summary

Cumulative release draining the devmain backlog accumulated since v0.14.7. Lands:

Full breakdown in CHANGELOG.md's ## v0.15.0 section.

Linked issues

This release closes/refs (non-exhaustive — see commit Closes #N / Fixes #N keywords in the full log for the canonical set, all of which auto-close on merge):

Closes #87, BicameralAI/bicameral-daemon#36, #209, BicameralAI/bicameral-daemon#9, BicameralAI/bicameral-daemon#23, BicameralAI/bicameral-daemon#22, #243, #272, #278, #279, #280, #281, #288, #301, #308, BicameralAI/bicameral-daemon#4, #332, #334, BicameralAI/bicameral-daemon#32, BicameralAI/bicameral-daemon#31, #340, #341, #342, #343, #344, #358, #362, #364, #380, #386
Refs BicameralAI/bicameral-daemon#37, #232, #357 (subtasks of the test-infrastructure track land here; parent stays open)
Refs BicameralAI/bicameral-daemon#2 (Ledger Locator RFC landed; full implementation deferred to v0.16.x)

Linked decisions

Closes decision:i4wafafzowm3ai5eyhgs — Default bicameral.remove_decision to hard delete; eliminate soft-delete tombstone state. Implementation in PR #386 (merged to dev).
Refs decision:cp25jfz1nt6h3u2gjzmu — Schema migrations must be expand-only (doctrine; companion PR amends DEV_CYCLE.md prospectively).
Refs decision:adklplvfhthkdch05pe9 — New-schema-dependent code must be feature-flag gated (doctrine; companion PR).
Refs decision:0ok1249n2tdrfud2a5j9 — DEV_CYCLE.md §10.5.1 (triage eligibility) amendment (doctrine; companion PR).

Plan / Audit / Seal

  • Plan: the dev → main release pattern in DEV_CYCLE.md §4.1 (release PR) and §6 (release cycle). 154 commits from origin/main..origin/dev (24 fix / 26 feat / 50 merge / remainder docs+chore+test+style).
  • Audit: schema migration chain v17 → v24 reviewed — every step is additive (new DEFINEs, new fields with defaults, new EVENTs). No REMOVE / DROP operations in the migration path. The idx_input_span_dedup change (v24) uses OVERWRITE and extends the field set, which is monotonically weaker than the prior index — every row valid before is still valid after.
  • Risk: L2 — schema-touching, tool-contract change (bicameral.remove_decision), but the migration path has been exercised end-to-end against the prod ledger (the schema migration that fixes the dashboard /history 500 was applied to my local prod DB during testing, verified working).

Breaking changes (operator-facing)

bicameral.remove_decision response shape changed. Dropped signoff and projected_status. Added event_logged, removed_at, previous_state, reason. The decision row + all references are now physically removed instead of flipped to signoff.state="removed". Callers consuming the response should check the new top-level fields. Idempotent on missing decisions (was_new=False, no raise) — the matching event in the journal is the canonical record of any prior removal.

Schema migrations

Auto-applied on first connect. Non-destructive. Operators upgrading from v0.14.x see one-time migration log entries; no data loss.

Version Migration Source
v17 → v18 decision.updated_at + idx_decision_updated_at #87 precondition
v18 → v19 bicameral_meta.decision_revision + DEFINE EVENT decision_revision_bump #87 Phase 6
v19 → v20 PII archive schema slot (input_span.archive_key) BicameralAI/bicameral-daemon#23 Phase A
v20 → v21 (PII archive metadata field) BicameralAI/bicameral-daemon#23 Phase A
v21 → v22 ASSERT text != '' OR archive_key != '' on input_span.text BicameralAI/bicameral-daemon#23 Phase B-1
v22 → v23 Backfill decision.decision_level for legacy rows #340 prereq
v23 → v24 idx_input_span_dedup extended with archive_key dashboard /history collision fix

Test plan

Post-merge tasks

  • Tag v0.15.0 and push the tag.
  • Run pipx upgrade bicameral-mcp on design-partner machines (or wait for their next bicameral.update).
  • Land the doctrine-amendment PR onto dev (expand-only / flag-gate / §10.5.1 rewrite) so v0.15.1+ ships under the new rule.

🤖 Generated with Claude Code

silongtan and others added 30 commits May 6, 2026 23:52
…st (#192)

Single env var now owns the entire telemetry-flag namespace. Three accepted
forms: bool (`0`/`off`/`false`/`no` → all off; `1`/`on`/`true`/`yes` → relay
only), csv (`relay,preflight,raw`), and unset (default → relay only).

New `telemetry_flags.py` module owns parsing; `consent.telemetry_allowed()`
and `preflight_telemetry.{telemetry_enabled, raw_capture_enabled}` delegate
to a frozen `TelemetryFlags` cached once per process.

Backwards-compat preserved on three axes:
  1. Legacy `BICAMERAL_PREFLIGHT_TELEMETRY=1` and
     `BICAMERAL_PREFLIGHT_TELEMETRY_RAW=1` continue to work as additive
     overlays — first read of either emits a one-line stderr deprecation
     warning per process. Removed in v1.x.
  2. `BICAMERAL_TELEMETRY=1` semantics unchanged (relay only — does NOT
     auto-enable preflight).
  3. Non-canonical truthy values (`enabled`, `t`, `active`, etc. — used in
     pre-#192 deployments) map to relay-only with a stderr warning pointing
     at the canonical form. Caught by Codex review as a P2 finding;
     preserves the pre-#192 contract that any non-OFF value enabled relay.

Semantics:
- CSV form is explicit — what's listed is on, what's not is off
  (so `BICAMERAL_TELEMETRY=preflight,raw` turns OFF the default-on relay,
  documented in the setup wizard).
- `raw` always implies `preflight` (raw is a mode of preflight events;
  defensive double-check in `raw_capture_enabled()`).
- Process-cached parsing via `lru_cache`; tests use `_reset_for_tests()`
  via an autouse fixture in `tests/conftest.py` so monkeypatched env vars
  take effect cross-test.

35 fixtures in `tests/test_telemetry_flags.py` cover all forms + integration
with the existing call sites + the legacy-truthy preservation case. 87/87
green across all 7 telemetry-touching test files (including 52 regression
tests for #39 / #101 / #112 behaviors).

Closes #192. Unblocks #65 phase 4.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…METRY (#192 follow-up)

Five doc references still spelled the legacy `BICAMERAL_PREFLIGHT_TELEMETRY=1`
shape after #192 consolidated the flag namespace. Updated each to lead with
the canonical csv form (`BICAMERAL_TELEMETRY=preflight` / `=preflight,raw`)
and note the legacy var is still honored via the deprecation overlay:

  - server.py — preflight_id schema description (agent-visible)
  - contracts.py — preflight_id field comment
  - preflight_telemetry.py — module docstring (Default mode + Raw mode lines)
  - handlers/record_bypass.py — module docstring (telemetry_disabled reason)
  - skills/bicameral-preflight/SKILL.md — bypass-write contract (agent-visible)
  - docs/semantic-drift-governance.md — record_bypass return-value spec

No behavior change. Tests unchanged: 66/66 green across telemetry_flags,
consent_notice, preflight_telemetry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolves #250 base-branch drift after dev advanced 50 commits since the
PR was opened. Two conflicts:

1. handlers/record_bypass.py (modify/delete) — dev's #244 v1 revert
   (commit d1e3914) deleted the entire HITL bypass + decision_level
   surface from v0 scope. My #192 follow-up touched its module
   docstring; resolution is to accept the deletion (the doc patch is
   moot once the file is gone).

2. skills/bicameral-preflight/SKILL.md (content) — same #244 revert
   deleted the §5.4-bypass-semantics block I patched for canonical
   env-var phrasing. Accepted dev's deletion of the block; the
   remaining §5.4 telemetry-attribution + §5.5 confirm-finding
   sections are untouched and still carry the canonical
   `BICAMERAL_TELEMETRY=preflight` form via the merged v1 of
   §5.4-telemetry-note.

The other four files I patched in the doc-followup commit
(server.py, contracts.py, preflight_telemetry.py,
docs/semantic-drift-governance.md) auto-merged cleanly. My canonical
`BICAMERAL_TELEMETRY=preflight` references survive verbatim.

Telemetry tests post-merge: 66/66 green
(test_telemetry_flags + test_consent_notice + test_preflight_telemetry).

Note: docs/semantic-drift-governance.md still describes record_bypass
return values that no longer have a handler. dev kept the file
unchanged through the v1 revert; whether the governance lifecycle doc
should be deleted, marked v1-deferred, or kept as forward-looking
architecture is a separate triage call (not in #250 scope).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ne (#280)

PR #285's first CI run produced a clean baseline:

  23 cases / precision 0.913 / recall 0.913 / abort_rate 0.000
  ✓ all gates pass

That's ~7-13 pp of headroom on every gate (≥ 0.85 / ≥ 0.80 / ≤ 0.30).
Locking the baseline in before drift sets in.

Two changes to .github/workflows/test-mcp-regression.yml:

  1. `--gate-mode warn` → `--gate-mode hard`. Runner exits non-zero
     on breach instead of warning to step output.

  2. Removed `continue-on-error: true` from the eval step. The step
     now fails CI when the gate breaches. The metrics-summary step
     keeps `continue-on-error: true` so a renderer bug never masks
     the eval result — and the `always()` guard means the breach
     summary is still rendered inline when the eval fails.

After this lands, PRs that touch the bind handler / bind skill /
fixture / dataset must EITHER keep recall ≥ 0.80 / precision ≥ 0.85 /
abort_rate ≤ 0.30, OR deliberately re-record the cache by setting
BICAMERAL_GROUNDING_EVAL_RECORD=1 after a skill-prompt change.

Aligns with Jin's "deliberate not drift" framing — same path the M1
eval *should* have taken (M1 has been warn-only forever; M2 is being
flipped while the baseline is fresh, days after the eval shipped).

Refs #280.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…uts don't fail M2 hard-gate (#288)

The hard-gate flip (commit 6605f24) surfaced an existing flakiness on
the first CI run: a single httpx.ReadTimeout on one of the up-to-8
tool-use turns crashed the whole eval run, failing the MCP regression
suite. Previously masked by `--gate-mode warn` + `continue-on-error: true`,
both removed by the gate-flip.

Two surgical fixes:

  1. tests/eval/_bind_judge.py — _call_messages_api now retries 3× with
     exponential backoff (2s/8s/32s) on:
       - httpx.TimeoutException (read/connect/pool)
       - httpx.NetworkError, httpx.RemoteProtocolError
       - HTTP 429 (rate limit) + 5xx (server-side transient)
     After exhausting retries, raises RuntimeError with a bounded message.
     Terminal 4xx (auth, malformed payload) still fails fast — those
     aren't transient.

  2. tests/eval_grounding_recall.py — per-case catch broadened from
     `except RuntimeError` to `except Exception`, and a single failing
     case now records an `eval_error` outcome row instead of crashing
     the whole eval. Aggregate gate is still applied: if N cases err
     hard enough that recall < 0.80 across 23 cases, the eval fails CI
     correctly. With our 0.913 baseline, ~5 cases would have to err
     before the gate breaches.

  3. tests/eval_grounding_recall_summary.py — eval_error added to the
     outcome-breakdown table; missed-cases list surfaces the error msg
     inline (rather than rendering "—::—" for the absent binding).

Local verification:
  - retry loop smoke-tested: 3× ReadTimeout → bounded RuntimeError;
    503/503/200 → recovers and returns the 200 response.
  - ruff check + format + mypy all green.
  - test_m2_grounding_log + test_bind_m2_telemetry: 11 passed, 3 skipped.

Refs #280 #288.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ci(m2): flip M2 grounding-recall gate warn → hard after stable baseline (#280)
…iscussion (#280)

Jin's PR-#288 followup: aggregate metrics tell us *whether* the agent is
grounding well; categorized failure modes tell PMs *which kinds of
decisions* it struggles with. New deterministic post-hoc classifier
runs over the existing per-case rows; renderer adds a "Failure modes"
section to the GitHub step summary ranked by count, with up to 2
example cases per category and a documented PM-actionable next step.

Categories
----------

  correct                       — agent got it right, no action
  wrong_module                  — same-name disambiguation failed
  wrong_intent                  — similar-intent miss
  cross_language_confusion      — Python ↔ TS runtime mistake
  wrong_symbol_in_right_file    — sub-region disambiguation gap
  hallucinated_symbol           — handler reject path caught a fake symbol
  span_mismatch                 — handler reject path caught hallucinated lines
  aborted_correctly             — agent recognized a behavioral / unbindable
                                  decision (only meaningful once §B fixture
                                  lands; the §B "ungroundable behavioral
                                  cases" piece was deferred per Jin's plan
                                  recommendation — design partners are
                                  better authors for those via #280
                                  friction reports)
  aborted_incorrectly           — agent over-cautious on a bindable case
  eval_error                    — infra (API timeout / network)

Each category carries a documented next step (FAILURE_MODE_NEXT_STEPS
constant, kept in sync with the renderer's _FAILURE_MODE_HINTS table).
PM-readable, not engineering jargon.

Files
-----

  tests/eval_grounding_recall.py
    + classify_failure_mode(row) -> str
    + FAILURE_MODE_NEXT_STEPS dict (PM-actionable next steps)
    + failure_mode field embedded on every per-case row (success +
      eval_error paths both populate it)

  tests/eval_grounding_recall_summary.py
    + _render_failure_modes(rows) helper
    + new "Failure modes (top categories — PM-actionable)" section
      between the gate-breach line and the existing miss list. Ranked
      by miss count, eval_error always last (infra noise), capped at
      top 3 categories with up to 2 example cases each. Examples
      surface case_id + reasoning/abort_reason/error_msg excerpt
      (truncated to 110 chars, pipe-escaped for table safety).

  tests/test_grounding_failure_mode.py (new)
    13 table-driven tests across all 10 categories + 3 invariants:
    unknown-outcome falls into 'uncategorized', taxonomy
    documentation completeness, handler-reject-priority over
    case_type. Pure unit tests — no API, no ledger.

  CHANGELOG.md +2

Cache key unchanged
-------------------

failure_mode is computed at the row level from existing fields; doesn't
touch the bind judge's cache key (model | skill | repo | decision). So
the existing 0.913/0.913 baseline cache stays valid; CI runs after this
PR will hit the cache and produce identical numbers — only the renderer
output is enriched.

Local verification
------------------

  - 13 passed on tests/test_grounding_failure_mode.py
  - 24 passed, 3 skipped across the M2-related test files
    (test_m2_grounding_log + test_bind_m2_telemetry + test_grounding_failure_mode)
  - ruff check + ruff format --check + mypy all green on touched files
  - Renderer smoke-tested on a synthetic input with 6 misses across 4
    categories — section ranks correctly, examples populate, hints land
    in the right column

Out of scope (intentionally deferred)
-------------------------------------

§B from the plan: 4-5 deliberately ungroundable behavioral cases
(`expected_outcome="abort"`) that materially measure Jin's "behavioral
decisions" pattern. Recommended deferral — design partners are better
authors for those via #280 friction reports rather than engineering
inventing them. Once §B lands, `aborted_correctly` will start firing
for real (it can fire today only for rows that carry
`expected_outcome="abort"`, which no current row does).

Refs #280 #288.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
feat(eval): M2 failure-mode enumeration for cross-functional design discussion (#280)
refactor(telemetry): consolidate BICAMERAL_TELEMETRY env-var namespace (#192)
PR #174 closed the recall ceiling but introduced two silent fallback
paths in `_region_anchored_preflight`: when `ctx.code_graph` was
absent OR when the expander raised, the response shape was byte-
identical to "expansion ran and matched zero" — caller couldn't tell
recall was degraded.

Three additive signals now surface every fallback (per Phase 2 spec
posted on #243, all four open questions defaulted to recommended):

  1. Response field — `sources_chained` includes `"graph_unavailable"`.
     Additive (never replaces existing `"region"` / `"graph"` tags).
     Bare tag — granular reason flows through telemetry, not the
     response shape, per signoff Q2.

  2. Log level — exception case bumped from `logger.debug` →
     `logger.warning` with stable `[preflight:fallback]` substring +
     exception type for grep-friendly production logs.

  3. Telemetry counter — new `preflight_telemetry.write_fallback_event(
     reason, session_id)` modeled on `write_ingest_refusal_event`
     (#216). Emits a `graph_expansion_fallback` row to the existing
     `~/.bicameral/preflight_events.jsonl` substrate. Reasons are a
     controlled enum: `"absent"`, `"missing_method"`,
     `"exception:<type>"`. Gated on `BICAMERAL_TELEMETRY=preflight`.

The fallback case classifier in `_region_anchored_preflight`
distinguishes three reasons (was conflated into a single `if expander
is not None:` skip in the pre-#243 code):

  - `code_graph is None`                                      → "absent"
  - `code_graph` set but no `expand_file_paths_via_graph`     → "missing_method"
  - expander raised                                            → "exception:<typ>"

Skill update (`skills/bicameral-preflight/SKILL.md`) renders a one-
line recall-degraded note to the agent when the tag is present:

  > Note: structural-neighbor lookup was unavailable this call —
  > recall may be reduced until the symbol index is rebuilt. Decisions
  > bound to files that import these may not have surfaced.

Treats `"graph_unavailable"` as advisory: doesn't block the preflight
surface; direct-pin matches are unaffected.

Tests
-----

4 new cases in `tests/test_preflight_graph_expansion.py`:

  - test_preflight_fallback_absent_code_graph_tags_graph_unavailable
    — ctx with code_graph=None → response carries the tag,
    telemetry counter reason="absent"
  - test_preflight_fallback_expander_raises_warns_and_tags
    — stub expander raises RuntimeError → response carries the tag,
    `caplog` captures WARN-level log with `[preflight:fallback]`
    substring, telemetry counter reason="exception:RuntimeError"
  - test_preflight_successful_expansion_does_not_tag_graph_unavailable
    — regression guard: clean expansion path must NOT carry the tag
    (no false alarms)
  - test_preflight_empty_file_paths_does_not_tag_graph_unavailable
    — empty file_paths short-circuits before expansion check; the
    "expansion was never attempted" case is distinguishable from
    "attempted-and-fell-back"

Existing tests use containment assertions (`"region" in
sources_chained`) not exact list equality, so additive `"graph_
unavailable"` doesn't break them.

What's NOT in this PR
---------------------

Piece B (eager symbol-index initialization at server startup) is the
follow-up commit on this branch. Lands separately so the response-
shape change can ship without the adapter-lifecycle change. After
both pieces land, the telemetry counter shipped here gives ongoing
visibility into how often fallback engages in production.

Refs #243 (parent #173 / PR #174). Plan signoff via
#243 (comment).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…Piece B)

Pre-fix, the code-locator adapter had two cooperating problems that
made silent fallback the default:

  1. `get_code_locator()` returned a FRESH `RealCodeLocatorAdapter`
     per call. Caching was absent.
  2. `_ensure_initialized()` was lazy — first tool call paid the
     index-build cost AND could race the index check on concurrent
     dispatch (e.g. preflight + bind landing in parallel after
     server boot).

Together: every silent fallback in the production runtime was
"hot" because the adapter was being rebuilt + rechecked on every
call. Piece A (#283 commit 3c9730f) made the fallback loud at the
response layer; Piece B closes the upstream cause.

Three changes
-------------

  adapters/code_locator.py
    - Singleton-by-REPO_PATH cache via `_INSTANCE_CACHE: dict[str,
      RealCodeLocatorAdapter]`. Path resolved through `Path.resolve()`
      so symlink + relative-path callers cache-hit consistently.
      Multi-repo correctness preserved (any test that swaps REPO_PATH
      mid-process gets a fresh adapter for the new path).
    - New `reset_code_locator_cache()` test-only hook, mirroring
      `adapters.ledger.reset_ledger_singleton`.
    - New `async def RealCodeLocatorAdapter.initialize()` — wraps
      sync `_ensure_initialized()` in `loop.run_in_executor(None, ...)`
      so the cold-init path doesn't block the event loop. Idempotent
      on already-initialized adapters.

  server.py
    - `serve_stdio()` calls `await get_code_locator().initialize()`
      between the dashboard sidecar start and the consent-notice block.
    - **Fail-loud per #243 phase-2 signoff Q3** — explicit `except
      RuntimeError as exc:` re-raises after printing an actionable
      stderr message (`"Run: python -m code_locator index <repo>"`).
      The outer try/finally still runs the `SERVER_SHUTDOWN` audit
      emit, so operators get a clean event AND a clear actionable
      error. No more silent degradation.

  tests/test_preflight_graph_expansion.py — 4 new tests
    - test_get_code_locator_returns_same_instance_per_repo_path
      (singleton + reset behavior across two REPO_PATHs)
    - test_initialize_succeeds_when_index_present
      (idempotent on already-initialized adapter)
    - test_initialize_fails_loudly_when_index_empty
      (RuntimeError from `_ensure_initialized` propagates through the
      async wrapper — doesn't get swallowed)
    - test_serve_stdio_refuses_boot_on_empty_index
      (boot-path level: with everything else stubbed healthy, an
      empty index aborts `serve_stdio()` with the expected
      RuntimeError)

Local smoke tests
-----------------

  - Singleton + reset_code_locator_cache: 4 assertions pass
    (cache hit on same path, distinct instance on new path, fresh
    after reset, second call after reset stays cached)
  - Async `initialize()`: re-raises RuntimeError on stubbed
    `_ensure_initialized` failure; idempotent no-op on
    already-initialized adapter

  - ruff check + ruff format --check + mypy all green on touched files

What's NOT in this PR
---------------------

Nothing — Piece A (commit 3c9730f) and Piece B (this commit) together
close #243's full scope. PR will open with both pieces. Telemetry
counter shipped in Piece A gives ongoing production visibility into
how often fallback engages post-merge.

Refs #243 (parent #173 / PR #174). Plan signoff via
#243 (comment).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…backs

feat(preflight): eliminate silent graph-expansion fallbacks (#243)
Replaces the dashboard image at the bottom of "How It Feels" with a
three-beat demo video section (ingest -> preflight -> ratify async)
referencing GitHub user-attachments URLs so videos render as inline
players. Moves the "Star on GitHub" banner from the top header to a
centered placement immediately after the demo, turning it into a
post-demo conversion beat instead of a misaligned header element.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…itch

Replaces the single-line MCP-server description with a position-take
opener: paragraph 1 names the failure mode (agreements emerge mid-flight,
never reach a doc); paragraph 2 introduces Bicameral MCP as a spec
compliance layer that captures both formal source materials (transcripts,
PRDs, Slack) and undiscussed mid-implementation decisions to be ratified
async by the product owner.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
docs(README): demo video section + relocate star CTA mid-doc
… + decision_id alias fix

The banner tests previously used MagicMock for ctx and AsyncMock for ledger,
returning hand-crafted dicts. They stayed green even when get_decisions_by_status
silently returned decision_id=None for every row (the SQL selected an undefined
field — see ledger.adapter:584).

Refactor to seed a real SurrealDBLedgerAdapter over memory:// and run the actual
get_decisions_by_status query. The first sociable run surfaced the latent bug,
which is fixed in this commit by aliasing type::string(id) AS decision_id
(matches the pattern at queries.py:167, 228, 404, 512).

Tests that legitimately need narrow seams (handle_link_commit, asyncio.Lock
primitives) are left as-is and now documented inline.

Adds a "Sociable Testing for UX Paths" section to pilot/mcp/CLAUDE.md codifying
the preference: SimpleNamespace ctx + real adapter for handler/ledger tests,
narrow seams only when a collaborator can't be run in tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dleware-tests

# Conflicts:
#	CHANGELOG.md
#	README.md
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e-tests

test(sync_middleware): sociable banner tests + decision_id alias fix
…gate (#58)

Sibling of the M2 grounding-recall eval (#284). Phase A is **measurement
only** — no runtime change to `handle_preflight` or any retrieval surface.
Recall regression risk = zero. The Phase B optimization choice (multi-hop
expansion / semantic search / LLM reranker) is gated on this PR's first
stable baseline, per the wiki's optimization principle: "identify the
specific scenario, then optimize."

Per the Phase 2 spec posted on #58 (all four open questions defaulted to
recommended):

  Q1 dataset size       → 25 cases hand-curated, matches M2's 23
  Q2 miss-mode buckets  → three (vocabulary / unbound / transitive),
                          matches the issue body's framing
  Q3 fire-rate gate     → raw retrieval (`response.decisions`); fire is
                          downstream and a secondary diagnostic
  Q4 ledger persistence → per-run temp + memory:// (per-case freshness)

Three measurement axes (deliberately split for diagnosis)
---------------------------------------------------------

  overall recall    surfaced / (surfaced + missed)        gate ≥ 0.70
  per-mode recall   same, sliced by miss_mode             gate ≥ 0.50
  fire rate         response.fired / total                gate ≥ 0.60

Errors (seeder infra failures, not agent misses) are excluded from the
recall denominator but counted separately so reviewers can see them.

Files
-----

  tests/fixtures/preflight_m6/dataset.py       (412 LOC)
    25 hand-curated M6Case rows, 8 + 8 + 9 across the three modes.
    Frozen dataclass; GENERATOR_VERSION constant invalidates downstream
    caches when bumped. Import-time _validate_dataset() fails loud on
    duplicate case_id, invalid miss_mode, transitive case without
    intended_file_path, unbound case with non-ungrounded status.

  tests/eval/_preflight_m6_seeder.py            (231 LOC)
    Per-case freshness: each call creates a new tempdir + memory:// ledger
    + git-initialized repo + writes a placeholder file (or the transitive
    case's intended + caller files). Calls the real handle_ingest +
    handle_bind so seeded rows have production shape (source_type,
    span, signoff, binds_to). Reset code-locator + ledger singletons
    before AND after so the next case starts clean.

  tests/eval_preflight_m6_recall.py             (274 LOC)
    Argparse runner, drives the seeder + handle_preflight, classifies
    outcomes, aggregates. JSON output + gate enforcement
    (--gate-mode warn|hard). Mirrors eval_grounding_recall.py shape so
    existing CI patterns transfer.

  tests/eval_preflight_m6_summary.py            (162 LOC)
    Markdown step-summary renderer for $GITHUB_STEP_SUMMARY. Per-mode
    table + collapsible missed-case detail with topic + intended
    description. Fail-quiet on missing JSON / parse errors.

  tests/test_preflight_m6_eval.py               (267 LOC)
    16 sociable unit tests for the classifier + aggregator. Per the new
    CLAUDE.md "Sociable Testing for UX Paths" rule (#303): SimpleNamespace
    + real M6Case dataclasses, NEVER MagicMock — so any added/removed
    field on the response shape fails the test honestly.

  .github/workflows/test-mcp-regression.yml     (+31 LOC)
    New "M6 preflight recall eval (warn-only)" + summary steps after M2.
    No ANTHROPIC_API_KEY needed — preflight retrieval is deterministic.

  CHANGELOG.md                                  (+2 lines)
    [Unreleased] / Added entry.

Local verification
------------------

  - 16/16 sociable unit tests pass on the classifier + aggregator
    (test_aggregate_basic_recall_math, test_errors_excluded_from_recall_denominator,
    test_per_miss_mode_breakdown, etc.)
  - Dataset import + _validate_dataset() pass — 25 cases (8/8/9)
  - Runner --help renders cleanly
  - Summary renderer smoke-tested on synthetic JSON — per-mode table +
    missed-case detail render correctly with emoji gates
  - ruff check + ruff format --check + mypy all green on touched files

What's NOT in this PR (intentionally — Phase B gating)
------------------------------------------------------

  - Any runtime change to handle_preflight or _region_anchored_preflight
  - Skill changes (no agent-facing contract change in Phase A)
  - Multi-hop / call-graph / inheritance graph expansion (Phase B
    candidate, deferred)
  - Semantic search layer (Phase B candidate, deferred)
  - LLM reranker (Phase B candidate, deferred)
  - Real-corpus eval (synthetic first; corpus follow-up if needed)

After this PR's first CI baseline lands, we pick the dominant miss-mode
from the per-mode breakdown and ship Phase B targeted to it. Cheap-first
ordering per the wiki: search_hint refinement → multi-hop graph → semantic
→ reranker.

Refs #58. Plan: plan-58-preflight-decision-detection.md.
Phase 2 spec signoff: #58 (comment)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
feat(eval): M6 preflight retrieval recall eval — Phase A measurement gate (#58)
…#58 followup)

PR #304's first CI baseline produced overall recall 0.000 with 14/25
cases erroring — root cause: the M6 seeder runs 25 cases back-to-back
in a single process, and the LLM-08 ingest rate limiter (#216, burst=10
/ refill=1.0/s) refuses cases 12+ with `_IngestRefused("rate_limit_
exceeded")`. Math: 10 initial tokens + ~1 refill while seeding the first
11 cases = 11 cases through, then 14 cases (U4-U8 + all 9 T*) erred.

The rate limiter is for production agent-loop safety, not eval
throughput. There's already a documented env var to disable it
(see `handlers.ingest._check_rate_limit` docstring):
``BICAMERAL_INGEST_RATE_LIMIT_DISABLE`` truthy → bucket check is
short-circuited. Setting it in the seeder's per-case env setup (saved
+ restored like `REPO_PATH` and `SURREAL_URL`) is the documented path.

Symptom before this fix (post-#304 CI on dev):
  M6 preflight retrieval recall eval — 25 cases
    overall recall : 0.000   errors: 14
    transitive_relevance   : 0/9 surfaced, 9 errors  ← all rate-limited
    unbound_decision       : 0/8 surfaced, 5 errors  ← last 5 rate-limited
    vocabulary_mismatch    : 0/8 surfaced, 0 errors  ← first 8, ran clean

Expected after this fix: vocabulary_mismatch stays 0/8 surfaced (that's
the honest BM25-can't-bridge-vocab baseline the eval was designed to
surface). transitive_relevance + unbound_decision should produce
non-zero recall once the seeder doesn't trip the rate limiter.

Belt-and-suspenders alternatives considered:
  - clear the `_RATE_LIMIT_REGISTRY` dict between cases — works but
    reaches into private state and skips the env-var contract
  - sleep between cases to allow refill — works but slow + hides the
    fact that the rate limiter isn't appropriate for evals
  - lower burst/refill via `.bicameral/config.yaml` in the synthetic
    repo — works but requires every Phase B eval surface to re-author
    the same config

The env-var path is the documented API and one line.

Smoke verification
------------------

  - 16/16 sociable unit tests pass on the classifier + aggregator
  - ruff check + format + mypy all green on the touched file

Refs #58 (Phase A baseline). Followup to PR #304.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the existing assets/bicameral-hero.png with a new visual that
illustrates the product as a double-entry ledger for AI-assisted product
development — PM and Dev agents each running a Bicameral MCP server,
both synced through a shared Team Ledger, with a live Decision Ledger
panel showing mixed signoff/code states (including ratified-but-not-
reflected, reflected-but-not-ratified, and drifted rows) and the three
core pillars (decisions first-class, two-sided ledger, escalation over
recommendation).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ks to canonical skills/

The .claude/skills/bicameral-*/SKILL.md files were tracked duplicates of
skills/bicameral-*/SKILL.md that drifted independently. PRs frequently
touched skills/ but not the mirror, so the mirror lagged ~3 feature
commits (3c9730f, d1e3914, 79b872b) and 2+ weeks behind canonical.

Beyond stale duplicates: the drift was bidirectional. 7 skills existed
only in canonical (.claude/skills/-missing → never resolved as slash
commands) and 7 only in the mirror (no canonical source → became de-facto
canonical despite CLAUDE.md saying otherwise). claude-mem auto-writes
into CLAUDE.md files also drifted (ingest and preflight CLAUDE.md had
different "Recent Activity" entries between the two paths).

This change:

1. Canonicalizes the 7 mirror-only skills via git mv into skills/
   (bicameral-{brief, context-sentry, doctor, guided, scan-branch,
   search, status}).
2. Replaces every .claude/skills/bicameral-X with a symlink to
   ../../skills/bicameral-X (22 symlinks total). Claude Code's
   slash-command resolver follows the symlinks transparently — confirmed
   in-vivo during implementation when the resolver re-indexed and
   surfaced all 22 skills after the swap.
3. Repoints tests/CI/docs at canonical skills/ paths
   (tests/_extract_headless.py SKILL_MD_PATH; tests/regen_extraction_fixtures.py
   docstring; tests/eval_decision_relevance.py docstring; tests/e2e/README.md;
   .github/workflows/test-mcp-regression.yml comment; README.md slash-command
   row; docs/DEV_CYCLE.md canonical-source note; docs/v2-desync-optimization-guide.md
   doctor SKILL.md references).
4. Updates CLAUDE.md to describe the symlink layout (drop "stale duplicates
   slated for deletion" wording) and adds a Windows note: contributors on
   Windows must set core.symlinks=true (or use WSL) so the mode-120000
   entries materialize as symlinks rather than text files containing the
   target path.
5. Ticks off TODO.md:169 — the unresolved decision is now made.

Refs: TODO.md:169 (now ticked), CLAUDE.md "Canonical Skill Source".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
chore(skills): replace .claude/skills/bicameral-* mirrors with symlinks to canonical skills/
…precondition)

Phase 4 of #87 broadens the preflight dedup cache key from `topic` alone
to `(topic_norm, file_paths_hash, ledger_revision)` so a same-topic call
within the 5-min window correctly invalidates when underlying ledger state
changed (M7a/b/c xfailed cases). `ledger_revision` derives from
`MAX(updated_at)` over the decision table — this PR is the schema half
of that contract; the handler-side broadening lands as a follow-up on
branch `87-preflight-dedup-key`.

Per Kevin's signoff (B2 approach, gh issue #87 comment thread): additive
schema bump, L1 risk, no tool contract change, falls back gracefully.
The field is `option<datetime>` rather than non-optional `datetime` because
DEFINE FIELD against existing rows leaves them as NONE until the migration
backfill runs — same precedent as v8→v9 (`decision_level` is `option<string>`
for identical reasons). Phase 4's MAX query can COALESCE(updated_at,
created_at) if it wants strict-non-NULL semantics.

Schema changes (ledger/schema.py):
- SCHEMA_VERSION 17 → 18 + compatibility-map entry
- DEFINE FIELD updated_at ON decision TYPE option<datetime> DEFAULT time::now()
- DEFINE INDEX idx_decision_updated_at ON decision FIELDS updated_at
- _migrate_v17_to_v18: idempotent DEFINE + backfill
  UPDATE decision SET updated_at = created_at WHERE updated_at IS NONE

Call-site audits (7 UPDATEs now carry `, updated_at = time::now()`):
- ledger/queries.py:602  upsert_decision canonical-dedup UPDATE path
- ledger/queries.py:1072 update_decision_status
- ledger/queries.py:1163 update_decision_level
- ledger/adapter.py:1394 apply_ratify (signoff write)
- ledger/adapter.py:1428 apply_supersede (old decision signoff-freeze)
- handlers/resolve_collision.py:99  link_parent (cross-level parent link)
- handlers/resolve_collision.py:128 collision_pending clear (proposed signoff)

CREATE in queries.py:638 needs no edit — the DEFAULT picks up time::now()
on INSERT automatically.

Tests (tests/test_v18_decision_updated_at.py, 11 tests, all passing):
- Schema version advanced to v18
- CREATE populates updated_at via DEFAULT
- Each of the 7 UPDATE call sites bumps updated_at (one test each)
- Index supports ORDER BY updated_at DESC
- Migration backfill: pre-v18 rows with NONE → created_at

Sociable substrate over memory:// per CLAUDE.md guidance — real
SurrealDBLedgerAdapter + real LedgerClient, no mocks. The drift this
guards against is the kind solitary tests miss: a mock would happily
return whatever updated_at the test expects; only a real ledger UPDATE
proves the SQL actually carries the new column.

Regression check passes: tests/test_v15_migration.py, test_schema_persistence.py,
test_schema_recoverable_errors.py, test_sync_middleware.py,
test_codegenome_continuity_service.py, test_compliance_check_schema.py,
test_ledger_bicameral_meta_migration.py — 50/50 pass. The single
test_alpha_flow.py failure (test_code_edit_without_rebind_marks_drifted)
reproduces on origin/dev without this PR's changes — pre-existing, not
introduced here.

Refs #87 (Phase 4 precondition per spec signoff). Out of scope: dedup
key broadening itself (#87 Phase 4), telemetry (#87 Phase 5).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d_at (#308 CI fix)

CI surfaced two issues on PR #308:

1. Ruff I001 — tests/test_v18_decision_updated_at.py import block was
   not alphabetically sorted within the `from ledger.queries import (...)`
   group. Auto-fixed.

2. tests/test_legacy_ledger_fixtures.py::test_legacy_ledger_fixture_reaches_clean_state[v3_yields_source_span]
   blew up on the v17→v18 backfill:

     SurrealDB rejected query: Found NONE for field `created_at`,
     with record `decision:dec_1`, but expected a datetime
     SQL: UPDATE decision SET updated_at = created_at WHERE updated_at IS NONE

   The v3 fixture creates `decision:dec_1` via raw CREATE without
   setting `created_at`. Once init_schema applies
   `DEFINE FIELD created_at ON decision TYPE datetime`, ANY UPDATE on
   that row re-validates the row and trips the type assertion — even
   one that doesn't touch created_at. The earlier draft used
   `SET updated_at = created_at` which read the corrupt field directly;
   even after switching to time::now() in the SET clause, the implicit
   re-validation on UPDATE still failed.

## Fix

Switch the backfill from a single bulk UPDATE to a per-row loop with
try/except, mirroring `_clean_yields_legacy_rows` (which uses the same
tolerance pattern for v3-era stale yields edges):

```python
ids = await client.query("SELECT id FROM decision WHERE updated_at IS NONE")
for row in ids:
    try:
        await client.execute(f"UPDATE {row['id']} SET updated_at = time::now()")
        healed += 1
    except Exception:
        skipped += 1  # row has other corrupt non-optional fields
        logger.warning(...)
```

Rows that fail stay with `updated_at=NONE` and MAX(updated_at) skips
them. Harmless for the dedup-cache marker (#87) since the marker only
needs monotonicity, not coverage — the new decisions created post-v18
get DEFAULT time::now() and dominate MAX().

The SELECT itself reads only `id`, so it doesn't trip the type
assertion on `created_at`. The WHERE clause on `updated_at IS NONE` is
safe because `updated_at` is `option<datetime>` (intentionally
optional — same precedent as v8→v9 `decision_level`).

## Files

- ledger/schema.py — _migrate_v17_to_v18: per-row UPDATE with
  try/except; emits healed/skipped counts to the logger
- tests/test_v18_decision_updated_at.py
  - Import sort fix (ruff I001)
  - test_v18_migration_backfills_legacy_rows_with_none_updated_at:
    call _migrate_v17_to_v18 directly instead of inlining the (now
    multi-statement) backfill body
  - test_v18_migration_backfill_tolerates_legacy_rows_with_none_created_at
    (NEW): inspects the migration source to guard against future
    drafts that reintroduce a created_at reference in the SET clause

## Verification

- tests/test_legacy_ledger_fixtures.py::test_legacy_ledger_fixture_reaches_clean_state[v3_yields_source_span] PASS
- tests/test_v18_decision_updated_at.py — 13/13 PASS (12 originals + 1 new regression guard)
- 94/94 in the broader schema/migration/dedup cluster
- `python3 -m ruff check` — clean on all touched files

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI's ruff job runs BOTH `ruff check` AND `ruff format --check`. The
former was clean after the import-sort fix, but the latter flagged
ledger/schema.py and tests/test_v18_decision_updated_at.py for
reformatting. Applied `ruff format` in place — pure whitespace /
line-length normalization, no semantic change.

Verified: `ruff format --check` clean on both files locally; 14/14
v18 + legacy-fixture tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jinhongkuan and others added 16 commits May 15, 2026 19:44
…nnel-autodetect

fix(setup): auto-detect nightly channel from .dev install version
Surfaces the setup-wizard nightly-channel auto-detect fix (PR #381) to
design partners. Without it, anyone who installed via
`pipx install --pip-args=--pre bicameral-mcp` ran `bicameral-mcp setup`
into a config hardcoded to `channel: stable`, so `bicameral.update`
silently never offered the nightly upgrade path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…026.5.16.dev024452

chore(nightly): bump RECOMMENDED_NIGHTLY_VERSION to 2026.5.16.dev024452
…22→v23

The v18→v19 migration only seeded the bicameral_meta singleton when the
row was absent. When _write_wire_format_sentinel had already written a
row (v16's adapter.connect path), the seed branch was skipped and
decision_revision stayed NONE because SurrealDB v2's DEFAULT 0 does not
backfill existing rows. Every subsequent decision UPDATE then blew up
the decision_revision_bump trigger with "Cannot perform addition with
'NONE' and '1'", which _migrate_v22_to_v23's per-row try/except silently
swallowed — so the decision_level classification migration "succeeded"
while skipping every legacy row.

Fix in two places:
- _migrate_v18_to_v19: UPDATE existing rows to 0 when the field is NONE
  (root cause; prevents recurrence for any DB upgrading from <v19).
- _migrate_v22_to_v23: same backfill at the top as defense-in-depth so
  the per-row UPDATEs below land their classifications instead of
  silently failing.

SCHEMA_VERSION stays at 23 — the buggy nightly (dev15124) was only ever
downloaded internally, so no forward-fix migration is needed.

Tests:
- test_migrate_v18_to_v19_backfills_decision_revision_on_preexisting_row:
  asserts a sentinel row with NONE decision_revision is rescued, and a
  real decision CREATE bumps the counter (trigger contract intact).
- test_v23_classifies_when_decision_revision_was_none: asserts v22→v23
  classifies legacy rows when entering with the broken counter state
  (no silent skips).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a ## Linked decisions section to the §4.3 PR body template,
parallel to ## Linked issues, and codifies the rule: every PR
authored by a BicameralAI org member references at least one
decision:<surrealdb-id> so reviewers can verify the change is
grounded in an explicit decision rather than ambient assumption.

External contributors are exempt — bicameral access is org-internal,
and gating community PRs on internal tooling is the wrong tradeoff.
The reviewing maintainer shepherds the decision ingest on the
contributor's behalf at merge time.

This is a doc-only rule; CI enforcement (lint that an org-member PR
body contains a decision:<id> token) can follow as a separate PR if
needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI ruff format --check caught three files that were lint-clean but
not format-clean. No semantic change — line breaks collapse to
match the project's max-line-length policy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d-decision

docs(dev-cycle): require linked bicameral decision on org-member PRs
…-backfill

fix(schema): backfill bicameral_meta.decision_revision in v18→v19 + v22→v23
Pre-fix, serve_stdio awaited get_code_locator().initialize() inline
before opening the MCP stdio transport. On a 150MB+ symbol-index DB
the cold path took ~45s (sqlite-vec open + tree-sitter load + BM25
pickle load), blowing past Claude Code's 30s MCP initialize timeout
on real-world repos — the server "started" but the JSON-RPC handshake
never landed and the client gave up.

Fix:

- ``RealCodeLocatorAdapter.initialize_in_background()`` — schedules
  ``_ensure_initialized`` in the default executor via an asyncio Task,
  returns immediately. A done-callback prints the bare error to stderr
  on failure so the operator still sees the actionable "Run: python -m
  code_locator index <repo_path>" hint that #243 wrote.
- ``_ensure_initialized`` now serializes its body via a
  threading.Lock. Sync callers from worker threads (the
  ``asyncio.to_thread(ctx.code_graph.<method>, ...)`` pattern every
  tool handler already uses) block on the lock until the background
  Task finishes, then see the post-init state and proceed. No callsite
  needs to know about the background Task.
- ``_run_init_body`` extracted from ``_ensure_initialized`` so tests
  can monkey-patch the slow body without bypassing the lock/state
  machine — the lock + Task glue is what's under test.
- ``wait_until_ready()`` — optional async gate for callers that want
  to explicitly await readiness from an async context and surface a
  structured error to the MCP client on failure.
- ``server.py:serve_stdio`` — replaces ``await
  get_code_locator().initialize()`` with
  ``get_code_locator().initialize_in_background()`` (synchronous, no
  await). Stderr message rewritten to reflect the new contract.

Trade-off: #243's "server refuses to boot when index is empty"
becomes "first code-locator tool call fails loudly when index is
empty." Operator still sees the failure on stderr at boot via the
done-callback. The fail-loud contract from #243 phase-2 signoff Q3
is preserved, just relocated from boot-time to first-tool-call-time.

Measured: JSON-RPC ``initialize`` reply now lands in ~16ms on this
repo's own 150MB code-graph.db (was ~45s).

Closes #380

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-handshake

fix(server): move code-locator init off MCP stdio handshake (#380)
…fe upsert

The v23 dedup index `idx_input_span_dedup` was UNIQUE on
`(source_type, source_ref, text)`. Phase B-1 (#221) introduced archive_key
and writes text='' for archive-keyed rows, so two distinct archive_keys
sharing (source_type, source_ref) collided on the empty-text slot. The
collision surfaced as a 500 on the dashboard's /history endpoint once any
second archive-keyed write to the same source bucket landed (transitively
via ensure_ledger_synced → link_commit → ingest paths).

Changes:
- Schema v23→v24: extend idx_input_span_dedup with archive_key as a 4th
  field. Non-destructive — adding a discriminator can only relax
  uniqueness, so all rows valid under the old index remain valid.
  Migration uses DEFINE INDEX OVERWRITE via _execute_define_idempotent
  (re-runnable). init_schema's OVERWRITE pass keeps the in-source DEFINE
  in sync on every connect.
- upsert_input_span: refactored into a thin retry wrapper around
  _upsert_input_span_once. The wrapper retries on the SurrealDB v2 MVCC
  "failed to commit transaction" string (bounded, no backoff — the
  conflicting writer has already committed by the time we see the error).
  The inner body now catches unique-index "already contains" violations
  on both the archive-keyed and legacy text-only paths, re-SELECTing to
  return the winning row's id instead of crashing.
- 6 new sociable tests pin: archive_key coexistence under v24, idempotent
  same-key dedup, concurrent-same-key race convergence, legacy text-path
  race safety, v24 migration idempotency, and a fixture that pins the v2
  MVCC error substring so a future surrealdb-py bump breaks loudly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…bstone

Implements decision:i4wafafzowm3ai5eyhgs (ratified 2026-05-15).

bicameral.remove_decision now physically removes the decision row plus
every reference to it (binds_to / yields / supersedes / context_for /
about edges + the compliance_check verdict cache for the decision) and
orphans child decisions cleanly by NULLing parent_decision_id. The
decision_removed.completed event captures a full pre-deletion snapshot
in the journal so the action is recoverable from the event log alone —
the "soft audit trail" that replaces the tombstone row model.

Motivation: the soft-delete model was intended as a negative-signal
mechanism (rows with signoff.state="removed" warn future agents away
from re-introducing the same wrong decision). In practice the dominant
call shape is janitorial — test pollution, accidentally-ingested rows,
retracted ideas with no learning value — where tombstones become
friction that surfaces in preflight, occupies dashboard slots, and gets
re-bound by drift sweeps. Supersession remains the right tool when a
persistent negative signal is actually wanted.

Contract changes:
- RemoveDecisionResponse: drops `signoff` and `projected_status` (the
  row is gone — there's no signoff dict to return and the projected
  status is meaningless). Promotes the relevant fields to top level:
  was_new, event_logged, removed_at, previous_state, reason.
- Idempotency: missing decision_id returns was_new=False without
  raising. The matching event in the journal is the canonical record
  of any prior removal. Trade-off: typos for never-existed ids look
  like idempotency, but the SKILL.md flow (read history first, then
  call) catches that.
- server.py tool description updated to match.
- skills/remove-decision/SKILL.md rewritten end-to-end; .claude/skills
  copy synced.

Out of scope (separate decisions):
- handlers/remove_source.py cascade still soft-deletes yielded
  decisions. That's a different tool's contract; touching it should be
  its own decision.
- dashboard.html "already-removed" button-disable guard remains as
  defensive dead code — cosmetic-only and out of scope.

Tests:
- tests/test_remove_decision.py rewritten as sociable (real
  SurrealDBLedgerAdapter over memory://) per pilot/mcp/CLAUDE.md. 9
  tests covering: reason validation, missing-id idempotency, full
  edge+cache cleanup, child orphan, second-call no-op, event
  emission/skipping, and idempotent no-event.
- tests/test_dogfood_label_propagation.py: removed the obsolete
  monkeypatches for handlers.remove_decision.project/update_decision_status
  (functions no longer imported by the new handler).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-delete work

claude-mem regenerated the recent-activity tables for handlers/ and tests/
after today's remove_decision hard-delete implementation, and seeded new
context files at skills/remove-decision/ and .claude/skills/remove-decision/
where the skill was edited. Purely auto-generated context — no code change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-suite contention

The safe-upsert retry loop landed at 5 attempts (ebcfeb4). Running the
full regression batch surfaced a flake in
test_concurrent_same_archive_key_race — three concurrent writers for the
same archive_key occasionally exceed 5 MVCC retries when the test suite
holds dozens of memory:// SurrealDB instances in the same process. Each
retry's SELECT short-circuits the moment the winning writer commits, so
the cost is one RTT per attempt — trivial. 10 absorbs the variance with
massive headroom for production usage (where contention storms of this
shape can't happen — one DB per process).

A proper fix (per-key write queue instead of optimistic retry) is
tracked separately as a follow-up issue.

Also includes claude-mem auto-generated activity-log refreshes from
this session (no code change in those files).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-handshake

feat(remove_decision): hard-delete by default + v24 input_span dedup index
…17→v24 chain

Drains the dev → main backlog accumulated since v0.14.7. Cumulative release;
non-destructive schema migration chain (v17→v24) applied automatically on
first connect. Breaking change: bicameral.remove_decision contract is now
hard-delete by default (decision:i4wafafzowm3ai5eyhgs).

Highlights:
- PII archive (#221 Phase A + B-1) — operator-erasable PII surface keyed by
  content-hash; ingest writes verbatim text to the archive and leaves
  input_span.text='' with the v22 ASSERT enforcing exactly-one-of.
- Hard-delete remove_decision — soft-delete tombstone retired; full
  pre-deletion snapshot lives in the event journal.
- Constant-time revision counter (#87 Phase 6) — bicameral_meta.decision_revision
  auto-bumped by DEFINE EVENT; replaces O(N) MAX(updated_at) scan in
  preflight dedup.
- bicameral.admin/query (#278 Phase 3), dashboard source view (#278 Phase 1),
  LocalDirectorySourceAdapter (#344), sync-and-brief team-mode (#279).
- Code-locator singleton + eager startup init (#243, #380) — index work moves
  off the per-call hot path and off the MCP stdio handshake.
- Schema v17→v24 chain — all additive, non-destructive.

Three architectural decisions ratified for the doctrine follow-up PR:
expand-only schema rule, feature-flag gating for new-schema-dependent code,
DEV_CYCLE.md §10.5.1 amendment for triage eligibility.

Closes decision:i4wafafzowm3ai5eyhgs.

See CHANGELOG.md for the full Added / Changed / Fixed / Schema-migrations /
Doctrine / Removed breakdown.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jinhongkuan jinhongkuan added the flow:release Release PR (BicameralAI/dev → BicameralAI/main) that promotes integrated work to a tagged release label May 16, 2026
@coderabbitai

coderabbitai Bot commented May 16, 2026

Copy link
Copy Markdown

Important

Review skipped

Too many files!

This PR contains 218 files, which is 68 over the limit of 150.

To get a review, narrow the scope:
• coderabbit review --type committed # exclude uncommitted changes
• coderabbit review --dir # limit to a subdirectory
• coderabbit review --base # compare against a closer base

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a6a5204c-8b7d-47e3-ad0c-b7f7b7e447ed

📥 Commits

Reviewing files that changed from the base of the PR and between 083c1d4 and 74377d3.

📒 Files selected for processing (218)
  • .claude/hooks/pre_tool_use_timeout_context.py
  • .claude/hooks/session_start_timeout_posture.py
  • .claude/skills/bicameral-bind
  • .claude/skills/bicameral-brief
  • .claude/skills/bicameral-capture-corrections
  • .claude/skills/bicameral-capture-corrections/SKILL.md
  • .claude/skills/bicameral-config
  • .claude/skills/bicameral-context-sentry
  • .claude/skills/bicameral-dashboard
  • .claude/skills/bicameral-dashboard/SKILL.md
  • .claude/skills/bicameral-diagnose
  • .claude/skills/bicameral-doctor
  • .claude/skills/bicameral-guided
  • .claude/skills/bicameral-history
  • .claude/skills/bicameral-history/SKILL.md
  • .claude/skills/bicameral-ingest
  • .claude/skills/bicameral-ingest/SKILL.md
  • .claude/skills/bicameral-judge-gaps
  • .claude/skills/bicameral-judge-gaps/SKILL.md
  • .claude/skills/bicameral-output-formats
  • .claude/skills/bicameral-preflight
  • .claude/skills/bicameral-preflight/CLAUDE.md
  • .claude/skills/bicameral-preflight/SKILL.md
  • .claude/skills/bicameral-report-bug
  • .claude/skills/bicameral-reset
  • .claude/skills/bicameral-reset/SKILL.md
  • .claude/skills/bicameral-resolve-collision
  • .claude/skills/bicameral-resolve-collision/SKILL.md
  • .claude/skills/bicameral-scan-branch
  • .claude/skills/bicameral-search
  • .claude/skills/bicameral-status
  • .claude/skills/bicameral-sync
  • .claude/skills/bicameral-update
  • .claude/skills/remove-decision/CLAUDE.md
  • .claude/skills/remove-decision/SKILL.md
  • .github/workflows/lint-and-typecheck.yml
  • .github/workflows/perf-gate.yml
  • .github/workflows/preflight-eval.yml
  • .github/workflows/test-mcp-regression.yml
  • .github/workflows/test-schema-persistence.yml
  • .gitignore
  • .pre-commit-config.yaml
  • CHANGELOG.md
  • CLAUDE.md
  • README.md
  • RECOMMENDED_NIGHTLY_VERSION
  • RECOMMENDED_VERSION
  • SECURITY.md
  • TODO.md
  • adapters/code_locator.py
  • adapters/ledger.py
  • assets/dashboard.html
  • cli/_diagnose_gather.py
  • cli/_ledger_io_engine.py
  • cli/brief_renderer.py
  • cli/diagnose.py
  • cli/ledger_export_cli.py
  • cli/ledger_import_cli.py
  • cli/ledger_io.py
  • cli/sync_and_brief_cli.py
  • consent.py
  • context.py
  • contracts.py
  • dashboard/admin.py
  • dashboard/server.py
  • docs/DEV_CYCLE.md
  • docs/META_LEDGER.md
  • docs/SHADOW_GENOME.md
  • docs/governance/compliance-stance-matrix.md
  • docs/governance/doctrine-deterministic-governance.md
  • docs/ideation-team-server-tier-v1-2026-05-14.md
  • docs/ledger-sociable-test-audit.md
  • docs/policies/acceptable-use.md
  • docs/policies/claude-hooks-mcp-integration.md
  • docs/policies/gdpr-art-17-erasure-roadmap.md
  • docs/policies/host-trust-model.md
  • docs/policies/ledger-export.md
  • docs/policies/notifications-roadmap.md
  • docs/policies/query-timeouts.md
  • docs/policies/sources-config.md
  • docs/policies/threat-model-and-trust-boundary.md
  • docs/preflight-failure-scenarios.md
  • docs/research-brief-compliance-audit-2026-05-06.md
  • docs/research-brief-r1-limitations-remediation-2026-05-14.md
  • docs/research-brief-team-server-tier-v1-2026-05-14.md
  • docs/semantic-drift-governance.md
  • docs/v0-productization-design-partner-dogfood.md
  • docs/v2-desync-optimization-guide.md
  • events/dogfood.py
  • events/sources/__init__.py
  • events/sources/granola.py
  • events/sources/local_directory.py
  • governance-gates.yaml
  • handlers/bind.py
  • handlers/history.py
  • handlers/ingest.py
  • handlers/link_commit.py
  • handlers/preflight.py
  • handlers/remove_decision.py
  • handlers/remove_source.py
  • handlers/resolve_collision.py
  • handlers/search_decisions.py
  • handlers/update.py
  • ledger/CLAUDE.md
  • ledger/adapter.py
  • ledger/client.py
  • ledger/queries.py
  • ledger/schema.py
  • ledger/timeout_telemetry.py
  • notifications/__init__.py
  • notifications/channel.py
  • notifications/contracts.py
  • notifications/stderr.py
  • pii_archive/__init__.py
  • pii_archive/contracts.py
  • pii_archive/store.py
  • preflight_telemetry.py
  • pyproject.toml
  • pytest.ini
  • scripts/audit_sociable_coverage.py
  • scripts/hooks/preflight_intent.py
  • scripts/lint_skill_governance.py
  • server.py
  • setup_wizard.py
  • skills/admin-surrealql/SKILL.md
  • skills/bicameral-brief/SKILL.md
  • skills/bicameral-context-sentry/CLAUDE.md
  • skills/bicameral-context-sentry/SKILL.md
  • skills/bicameral-doctor/SKILL.md
  • skills/bicameral-guided/SKILL.md
  • skills/bicameral-preflight/SKILL.md
  • skills/bicameral-scan-branch/SKILL.md
  • skills/bicameral-search/SKILL.md
  • skills/bicameral-status/SKILL.md
  • skills/bicameral-sync-and-brief/SKILL.md
  • skills/bicameral-update/SKILL.md
  • skills/remove-decision/CLAUDE.md
  • skills/remove-decision/SKILL.md
  • skills/remove-source/SKILL.md
  • telemetry_flags.py
  • tests/_extract_headless.py
  • tests/_replay_helpers.py
  • tests/conftest.py
  • tests/e2e/README.md
  • tests/e2e/run_e2e_flows.py
  • tests/eval/__init__.py
  • tests/eval/_bind_judge.py
  • tests/eval/_preflight_eval_seed.py
  • tests/eval/_preflight_m6_seeder.py
  • tests/eval/preflight_dataset.jsonl
  • tests/eval/run_preflight_eval.py
  • tests/eval_decision_relevance.py
  • tests/eval_grounding_recall.py
  • tests/eval_grounding_recall_summary.py
  • tests/eval_preflight_m6_recall.py
  • tests/eval_preflight_m6_summary.py
  • tests/fixtures/preflight_m6/__init__.py
  • tests/fixtures/preflight_m6/dataset.py
  • tests/fixtures/skill_lint/clean_skill/SKILL.md
  • tests/fixtures/skill_lint/flagged_skill/SKILL.md
  • tests/fixtures/skill_lint/registered_skill/SKILL.md
  • tests/perf/__init__.py
  • tests/perf/conftest.py
  • tests/perf/test_ledger_revision_perf.py
  • tests/regen_extraction_fixtures.py
  • tests/test_admin_surrealql_route.py
  • tests/test_brief_renderer.py
  • tests/test_claude_hooks_timeout_context.py
  • tests/test_codelocator_background_init.py
  • tests/test_compliance_policy_docs.py
  • tests/test_consent_notice.py
  • tests/test_dashboard_admin_panel.py
  • tests/test_dashboard_remove_flows.py
  • tests/test_dashboard_source_view.py
  • tests/test_diagnose_allowlist.py
  • tests/test_diagnose_cli.py
  • tests/test_dogfood_label_propagation.py
  • tests/test_grounding_failure_mode.py
  • tests/test_history_erasure_propagation.py
  • tests/test_history_input_span_id.py
  • tests/test_input_span_safe_upsert.py
  • tests/test_ledger_bicameral_meta_migration.py
  • tests/test_ledger_export_cli.py
  • tests/test_ledger_import_cli.py
  • tests/test_ledger_io_canonical_record.py
  • tests/test_ledger_io_export.py
  • tests/test_ledger_io_import.py
  • tests/test_ledger_mock_regression.py
  • tests/test_notifications_unit.py
  • tests/test_pii_archive_schema_migration.py
  • tests/test_pii_archive_schema_migration_b1.py
  • tests/test_pii_archive_unit.py
  • tests/test_preflight_dedup_telemetry.py
  • tests/test_preflight_dedup_v2.py
  • tests/test_preflight_graph_expansion.py
  • tests/test_preflight_hitl.py
  • tests/test_preflight_id_plumbing.py
  • tests/test_preflight_m6_eval.py
  • tests/test_query_timeout_handler_routing.py
  • tests/test_query_timeout_unit.py
  • tests/test_remove_decision.py
  • tests/test_remove_source.py
  • tests/test_replay_determinism.py
  • tests/test_replay_helpers_unit.py
  • tests/test_resolve_span_text_unit.py
  • tests/test_sessionstart_hook_install.py
  • tests/test_setup_wizard_channel_autodetect.py
  • tests/test_skill_governance_lint.py
  • tests/test_skills_symlink_integrity.py
  • tests/test_sources_granola_unit.py
  • tests/test_sources_local_directory_unit.py
  • tests/test_sync_and_brief_cli.py
  • tests/test_sync_and_brief_team_mode.py
  • tests/test_sync_middleware.py
  • tests/test_telemetry_flags.py
  • tests/test_v18_decision_updated_at.py
  • tests/test_v19_revision_counter.py
  • tests/test_v23_decision_level_backfill.py

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch release/v0.15.0

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

jinhongkuan pushed a commit that referenced this pull request May 16, 2026
… (triage eligibility)

Encodes the three architectural decisions ratified 2026-05-15
(decision:cp25jfz1nt6h3u2gjzmu, decision:adklplvfhthkdch05pe9,
decision:0ok1249n2tdrfud2a5j9):

§4.7 — new subsection enforcing two complementary rules for any PR
that touches ledger/schema.py or its _MIGRATIONS registry:

  §4.7.1 — schema migrations must be expand-only. Destructive operations
  (REMOVE / DROP / breaking ALTER / tightening ASSERT) live in their own
  commits and ship in a later release after the prior reader surface is
  validated as gone from prod. Includes an allowed/forbidden table for
  reviewer ease. CI lint planned via scripts/lint_schema_destructive.py.

  §4.7.2 — code paths that depend on new schema must be feature-flag
  gated and default OFF in prod (env var or .bicameral/config.yaml
  setting). Schema ships immediately; flag flips later in a separate
  release. If the experiment is killed, the flag never flips on and a
  follow-up cleanup migration drops the slot. Exception: invariant
  bugfixes (e.g. fixing a unique-index collision that breaks the
  dashboard for everyone) don't need flag-gating — that's not feature
  surface.

  §4.7.3 — concrete PR-review checklist for schema-touching PRs.

§10.5.1 — triage eligibility rule rewritten. Previously: "schema-migrating
changes are not triage-eligible" (blanket). Now: schema migrations CAN ride
a triage release if they comply with §4.7 (expand-only AND feature code is
flag-gated). The blanket ban is replaced by enumerated exclusions
(destructive schema, flag-flip releases, breaking public-API changes,
multi-PR epics, v1 patches).

Motivation: the prior rule was correct under the implicit assumption that
schema and feature ship together — then you can't ship one without the
other. Once §4.7 decouples them, schema can drain to main on every triage
instead of accumulating on dev waiting for a "real" release. The current
v18→v24 backlog (drained by the v0.15.0 release PR #388) is the symptom
the prior rule produced; this amendment prevents recurrence.

Refs decision:cp25jfz1nt6h3u2gjzmu, decision:adklplvfhthkdch05pe9, decision:0ok1249n2tdrfud2a5j9.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts:
#	CHANGELOG.md
#	RECOMMENDED_VERSION
#	pyproject.toml
@jinhongkuan jinhongkuan requested a deployment to recording-approval May 16, 2026 06:53 — with GitHub Actions Waiting
@jinhongkuan jinhongkuan enabled auto-merge (rebase) May 16, 2026 07:02
@jinhongkuan jinhongkuan merged commit 6963cb0 into main May 16, 2026
10 of 11 checks passed
@jinhongkuan jinhongkuan deleted the release/v0.15.0 branch May 16, 2026 07:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

flow:release Release PR (BicameralAI/dev → BicameralAI/main) that promotes integrated work to a tagged release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Preflight: broaden dedup cache key to include file_paths + ledger revision (M7a/b/c)

3 participants