Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 63 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,69 @@
All notable changes to bicameral-mcp are tracked here. Format loosely follows
[Keep a Changelog](https://keepachangelog.com/en/1.1.0/).

## 0.4.23 — 2026-04-21 — caller-LLM-driven retrieval + search_hint recall booster

Addresses the BM25 vocab-mismatch problem that surfaced after v0.4.20
made grounding status honest. Decisions whose natural-language
description doesn't lexically overlap with the real code identifier
vocabulary were getting bound to whatever file incidentally shared a
keyword — "email dispatch" binding to a React toast reducer's
`dispatch`, "active subscriber" binding to an unrelated `AcquisitionFunnel.tsx`
`ActiveUser` component. Under v0.4.19's silent auto-promotion nobody
saw this; under v0.4.20's honest PENDING projection users saw garbage
bindings and had nothing to do about them.

Two changes, both within the existing deterministic-retrieval moat:

### Changed — caller-LLM retrieval is now the default (Lever 1)

- `skills/bicameral-ingest/SKILL.md` restructured. Step 2 is now
*"Resolve code regions via the MCP retrieval tools"* — caller LLM is
instructed to use `validate_symbols` + `search_code` + `get_neighbors`
to build explicit `code_regions` from codebase evidence *before*
ingesting. Step 3 now leads with the internal format (with explicit
regions) as the preferred shape. Natural format remains supported
as the fallback for abstract decisions with no resolvable code surface.
- No server-side code changes. The server already accepted internal-
format ingest payloads; this flips the skill's default guidance from
*"use natural format, let BM25 handle it"* to *"resolve explicitly,
fall back to BM25 only when necessary."*

### Added — `search_hint` recall booster (Lever 2)

- `IngestMapping.search_hint: str` and `IngestDecision.search_hint: str`
— optional caller-supplied field carrying synonyms / domain vocab /
likely identifier names that the decision's description wouldn't
contain literally. Used only when the mapping falls through to
server-side auto-grounding.
- `adapters.code_locator.ground_mappings` concatenates
`description + " " + search_hint` as the BM25 query when the hint is
non-empty. Strictly additive: omitted hint = pre-v0.4.23 behavior.
- `search_hint` is query-only metadata. It is never stored on
`intent.description` and never surfaces in briefs, status responses,
or the gap-judge context pack. Humans see the clean decision text;
BM25 sees the widened query.

### Guarantee preserved — no server-side LLM

Retrieval remains deterministic at runtime. The caller LLM does the
expensive lookup at ingest time (when it has your full codebase
context), writes explicit `code_regions`, and the server's BM25 fallback
is only consulted for truly abstract decisions. This keeps the tech
moat (*deterministic, provider-agnostic retrieval*) intact while
fixing the quality complaint.

### Upgrade notes

- **Existing bindings from pre-v0.4.23 ingests are unchanged.** If you
have false-positive bindings from BM25 auto-grounding (e.g., dispatch
intents bound to `use-toast.ts`), they persist in the graph. To clean
them up today: `bicameral.reset` → re-ingest under the new skill
defaults. A targeted edge-pruning path is tracked for a future release.
- **No schema change**, no migration, no behavior shift for running
callers — the skill update only changes the default path the caller
LLM takes when the bicameral-ingest skill is invoked.

## 0.4.22 — 2026-04-20 — hotfix: init_schema idempotent against existing persistent DB

**Hotfix for v0.4.20 regression on persistent DBs.** Phase 1b made
Expand Down
2 changes: 1 addition & 1 deletion RECOMMENDED_VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
0.4.22
0.4.23
21 changes: 16 additions & 5 deletions adapters/code_locator.py
Original file line number Diff line number Diff line change
Expand Up @@ -404,30 +404,41 @@ def ground_mappings(self, mappings: list[dict]) -> tuple[list[dict], int]:
resolved.append(mapping)
continue

# v0.4.23 (Lever 2): widen the BM25 query with the caller-LLM's
# search_hint — synonyms, domain vocab, likely identifier names.
# Strictly additive: if no hint is provided, behavior is
# identical to pre-0.4.23. Hint is NOT stored as part of the
# intent's description; it's query-only metadata.
search_hint = (mapping.get("search_hint") or "").strip()
if search_hint:
bm25_query = f"{description} {search_hint}"
else:
bm25_query = description

# FC-1 guard: refuse to ground queries that degenerate to <2 corpus
# tokens. Witnessed in Accountable (2026-04-13): "GitHub Discussions
# vs Slack" left only ``slack`` after stopword filtering, BM25
# ranked every slack-mentioning file, and the #2 hit was anchored
# to ``log-error-to-slack/index.ts:getFeatureName`` by tiebreak.
# Under-specified queries belong as ungrounded open questions.
try:
corpus_token_count = self._search_tool.bm25.count_corpus_tokens(description)
corpus_token_count = self._search_tool.bm25.count_corpus_tokens(bm25_query)
except Exception as exc:
logger.debug("[ground] FC-1 token count failed for '%s': %s", description[:60], exc)
logger.debug("[ground] FC-1 token count failed for '%s': %s", bm25_query[:60], exc)
corpus_token_count = 2 # fail-open: do not block grounding on detector failure
if corpus_token_count < 2:
logger.info(
"[ground] FC-1 skip: %d corpus tokens in %r — leaving ungrounded",
corpus_token_count, description[:60],
corpus_token_count, bm25_query[:60],
)
resolved.append(mapping)
continue

# Run BM25 search once and reuse across tiers.
try:
hits = self.search_code(description)
hits = self.search_code(bm25_query)
except Exception as exc:
logger.warning("[ground] BM25 search failed for '%s': %s", description[:60], exc)
logger.warning("[ground] BM25 search failed for '%s': %s", bm25_query[:60], exc)
hits = []

code_regions: list[dict] = []
Expand Down
14 changes: 14 additions & 0 deletions contracts.py
Original file line number Diff line number Diff line change
Expand Up @@ -394,6 +394,13 @@ class IngestMapping(BaseModel):
span: IngestSpan = IngestSpan()
symbols: list[str] = []
code_regions: list[IngestCodeRegion] = []
# v0.4.23 (Lever 2): optional additional BM25 search terms — synonyms,
# related domain vocab, likely code identifiers — used to widen recall
# when the server falls through to auto-grounding. Only consulted when
# ``code_regions`` is empty. The caller LLM supplies these at ingest
# time via the bicameral-ingest skill. Stored-intent.description is
# never polluted with these terms; they're query-only metadata.
search_hint: str = ""


class IngestDecision(BaseModel):
Expand All @@ -417,6 +424,13 @@ class IngestDecision(BaseModel):
# the conclusion), and the ingest path will skip creating a
# placeholder source_span row.
source_excerpt: str = ""
# v0.4.23 (Lever 2): BM25 recall booster for the auto-grounding
# fallback path. Caller LLM supplies synonyms, domain vocab, likely
# identifier names that the decision's description wouldn't contain
# literally ("subscription check" → "resolveMemberStatus
# isActiveSubscriber dispatch_reminders dispatch_interventions"). Only
# consulted when no explicit code_regions are resolved.
search_hint: str = ""


class IngestActionItem(BaseModel):
Expand Down
2 changes: 2 additions & 0 deletions handlers/ingest.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,8 @@ def _normalize_payload(payload: dict) -> dict:
},
"symbols": [],
"code_regions": [],
# v0.4.23: propagate search_hint from natural format → internal.
"search_hint": d.search_hint,
})

for a in validated.action_items:
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ build-backend = "hatchling.build"

[project]
name = "bicameral-mcp"
version = "0.4.22"
version = "0.4.23"
description = "Decision ledger MCP server — ingests meeting transcripts, maps decisions to code, tracks drift"
readme = "README.md"
requires-python = ">=3.10"
Expand Down
122 changes: 85 additions & 37 deletions skills/bicameral-ingest/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -158,21 +158,85 @@ Keep the business driver attached to each decision's description so the gap judg

→ **Extract: 1 decision** — "Add PII redaction to the audit log (driver: GDPR self-assessment data-minimization check, next month deadline)." The key-rotation line is security hygiene with no business driver named — reject it. A PM reviewing the ledger can act on the GDPR item; they can't act on key rotation.

### 2. Validate relevance against the codebase
### 2. Resolve code regions via the MCP retrieval tools (v0.4.23+ default)

**This is where grounding quality is won or lost.** Server-side BM25 is a fallback
for *abstract* decisions with no identifiable code anchor. For every decision
that touches concrete code, **you** (the caller LLM) should resolve explicit
`code_regions` using the MCP retrieval tools before ingesting. You have full
codebase context; BM25 has a bag of tokens. Use your advantage.

**Procedure per decision**:

1. **Generate symbol hypotheses** from the decision text. If a decision says
*"all email dispatch functions filter via a single source-of-truth check,"*
your hypotheses are `dispatchReminders`, `dispatchInterventions`,
`dispatchNudge`, `resolveMemberStatus`, `isActiveSubscriber` — not just
the literal word "dispatch."
2. **Call `validate_symbols`** with the hypotheses. Keep symbols that actually
exist in the index; drop the rest.
3. **Call `search_code`** with the validated symbol_ids (not the raw decision
text — seeded graph traversal is strictly better than keyword BM25 for
finding the real regions). Take the top hits that look relevant.
4. **Call `get_neighbors`** on the top hit if you're unsure of scope — surfaces
callers/callees so you can tell whether the decision is local to one
function or spans a call tree.
5. **Build explicit `code_regions`** — `{file_path, symbol, start_line, end_line, type}` —
from the validated tool output. Prefer function-level pins over file-level;
bind to the tightest region that still covers the decision's surface area.

**Grounding quality: filter out false positives before ingesting**. If
`search_code` returns a hit that keyword-matches but doesn't actually implement
anything related to the decision, drop it. Example: a decision about email
dispatch should NOT bind to a React `dispatch` reducer just because the word
appears. Ingesting garbage bindings means every edit to that unrelated file
triggers a drift alarm later — noise that drowns out real signal.

**Skip decisions that don't bind to real code**. If after this procedure the
decision has zero concrete regions AND names no valid symbols, it's either
(a) strategic (drop it) or (b) a genuine "pending" decision for code that
doesn't exist yet. For the pending case, ingest it with empty `code_regions`
but include a `search_hint` (see Step 3) so the server's future re-grounding
sweeps have something to work with.

For each candidate decision, use the code locator tools to check whether it touches real code:

- Call `search_code` with a query derived from the decision text. If results come back with relevant hits, the decision is groundable.
- If the decision mentions specific symbols (functions, classes, modules), call `validate_symbols` with those names to confirm they exist.
- If a decision returns **zero relevant code hits** and names **no valid symbols**, it is likely strategic — drop it unless it describes something that *should* be built but doesn't exist yet (a genuine "pending" decision).
### 3. Ingest the filtered set

This step is a lightweight filter, not an exhaustive audit. Spend ~1 search per candidate decision.
Call `bicameral.ingest` using the **internal format** (preferred from
v0.4.23+ onward) with the `code_regions` you resolved in step 2. Natural
format remains supported as a fallback for truly abstract decisions with
no resolvable code surface.

### 3. Ingest the filtered set
**Internal format** (preferred v0.4.23+) — use this when you resolved
`code_regions` in Step 2:

Call `bicameral.ingest` with a `payload` using the **natural format** (preferred). Only include decisions that passed the relevance filter from step 2.
```
payload: {
query: "<topic / feature area — drives the auto-brief>",
mappings: [
{
intent: "Cache user sessions in Redis for horizontal scaling",
span: {
text: "<source excerpt>",
source_type: "transcript",
source_ref: "sprint-14-planning",
meeting_date: "2026-04-15",
speakers: ["Ian", "Brian"]
},
symbols: ["SessionCache", "RedisClient"],
code_regions: [
{ file_path: "src/lib/session.ts", symbol: "SessionCache",
start_line: 42, end_line: 89, type: "class" },
{ file_path: "src/lib/redis.ts", symbol: "RedisClient",
start_line: 1, end_line: 34, type: "class" }
],
search_hint: "SessionCache RedisClient session-cache horizontal scaling"
}
]
}
```

**Natural format** — canonical fields (use this shape):
**Natural format** (fallback) — use when a decision is truly abstract
and has no resolvable code surface:

```
payload: {
Expand All @@ -184,10 +248,12 @@ payload: {
decisions: [
{
description: "Cache user sessions in Redis for horizontal scaling",
id: "sprint-14-planning#session-cache" # optional stable id
id: "sprint-14-planning#session-cache", # optional stable id
search_hint: "SessionCache RedisClient session cache horizontal scaling"
},
{
description: "Apply 10% discount on orders ≥ $100"
description: "Apply 10% discount on orders ≥ $100",
search_hint: "calculateDiscount order_total applyDiscount PricingService"
}
],
action_items: [
Expand All @@ -198,37 +264,19 @@ payload: {

**Field rules** — get these right or decisions evaporate:

- **`mappings[].code_regions`** is the whole game from v0.4.23+. When you pass explicit regions, server BM25 does not run for that mapping — grounding is exactly what you resolved. No false positives from vocab mismatch.
- **`search_hint`** is the fallback recall booster. When server BM25 *does* run (you didn't resolve `code_regions`), the server concatenates `intent.description + search_hint` as the BM25 query. Put 3-5 likely identifier names or domain synonyms here — exactly the kind of vocabulary your codebase uses that the decision's natural-language description wouldn't contain literally. Example: a decision about "subscription status source-of-truth" won't mention `resolveMemberStatus` or `isActiveSubscriber` but BM25 needs those tokens to find the right dispatch functions. `search_hint` is query-only — it's never stored as part of the intent's description and never appears in briefs.
- **`decisions[].description`** is the canonical text field. `title` is accepted as a synonym for back-compat; `text` is tolerated as an alias (v0.4.16+). At least one of the three must be non-empty or the decision is silently dropped.
- **`action_items[].action`** is the canonical text field. `text` is tolerated as an alias (v0.4.16+). `owner` defaults to `"unassigned"`. `due` is an optional ISO date.
- **`query`** is load-bearing: it's the topic the post-ingest auto-brief and gap-judge chain fire on. If you omit it, the handler falls through to the longest decision description as a topic guess — usable but less focused. **When fanning out from the boundary-detection flow (step 0), always pass each segment's title as `query`.**
- **`participants`** on the payload populates `span.speakers` for every decision. Put the meeting attendees here, not on individual decisions.
- **`participants`** (natural format) or **`span.speakers`** (internal format) records the meeting attendees.
- Do NOT include `open_questions` unless they have direct implementation implications — they're accepted as `list[str]` but clutter the ledger with non-code entries.

**Internal format** — only if you already have pre-resolved code regions from `search_code` / `validate_symbols`:

```
payload: {
query: "...",
mappings: [
{
intent: "Cache user sessions in Redis",
span: {
text: "<source excerpt>",
source_type: "transcript",
source_ref: "sprint-14-planning",
meeting_date: "2026-04-15"
},
symbols: ["SessionCache"],
code_regions: [
{ file_path: "src/lib/session.ts", symbol: "SessionCache",
start_line: 42, end_line: 89, type: "class" }
]
}
]
}
```
**When to choose which format**:

Use the natural format in the common case. Fall through to internal format only when you already have verified file/line pins — otherwise you'll bypass auto-grounding and the server can't map decisions to code on its own.
- **Internal format, v0.4.23+ default.** You resolved `code_regions` via Step 2. Ingest with explicit pins. The ledger is a trustworthy drift anchor — editing those pinned files fires real drift alarms; editing unrelated files fires nothing. This is the posture we want for real branches.
- **Natural format + `search_hint`, fallback.** The decision is abstract ("ship by Q3," "SOC2-compliant session storage") or points at code that doesn't exist yet. Server BM25 tries with the widened query; if it produces zero hits the intent stays ungrounded (honest). If BM25 produces a false-positive binding, you'll catch it at the first `bicameral.doctor` or via a pending_compliance_check verdict.
- **Natural format WITHOUT `search_hint`, legacy.** Works, but this is how the 2026-04-20 Accountable dispatcher ingest ended up with "all dispatch functions" bound to `use-toast.ts:dispatch`. You almost always want at least the hint.

### 4. Report results

Expand Down
Loading
Loading