Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions .github/workflows/test-mcp-regression.yml
Original file line number Diff line number Diff line change
Expand Up @@ -148,6 +148,17 @@ jobs:
--gate-mode warn
-o test-results/m2-grounding-recall.json

# ── Surface M2 metrics on the GitHub run summary (#280 PR-3) ───
# Reads test-results/m2-grounding-recall.json and renders a markdown
# table to $GITHUB_STEP_SUMMARY so reviewers can read precision /
# recall / abort-rate without downloading the artifact. always()
# guard means the summary appears even when the eval step above
# exited non-zero (warn-only currently masks that).
- name: M2 metrics summary
if: always() && matrix.os == 'ubuntu-latest'
continue-on-error: true
run: python tests/eval_grounding_recall_summary.py test-results/m2-grounding-recall.json >> "$GITHUB_STEP_SUMMARY"

# ── Generate rich E2E report from artifacts ────────────────────
# Ubuntu-only: the script consumes the medusa adversarial corpus
# (cloned only on Ubuntu above) plus the Phase 3 E2E artifacts
Expand Down
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@ All notable changes to bicameral-mcp are tracked here. Format loosely follows

- **`tests/eval_grounding_recall.py` — M2 grounding-recall eval harness (#280 PR-2).** Synthetic-fixture benchmark that drives the `bicameral-bind` skill end-to-end and measures three axes: precision (of bindings the agent committed, what fraction were correct), recall (of ground-truth bindings, how many the agent got right), and abort rate (first-class signal because the bind skill makes "abort on weak evidence" an explicit contract). Dataset at `tests/fixtures/grounding_recall/dataset.py` with 23 cases across same-name-different-module (5), similar-intent-different-symbol (10), and cross-language (8) — fixture repo at `tests/fixtures/grounding_recall/repo/`. Headless caller-LLM driver at `tests/eval/_bind_judge.py` (modeled on `_skill_judge.py`) drives a multi-turn `read_file` / `validate_symbols` / `submit_binding` tool-use loop with response caching at `tests/eval/fixtures/bind_judge/` keyed on SHA(model | skill | repo | decision). Default gates: recall ≥ 0.80, precision ≥ 0.85, abort_rate ≤ 0.30 per #280 acceptance. New CI step is **warn-only initially** (`continue-on-error: true`, mirrors the M1 step) — gather a baseline first, ratchet to `--gate-mode hard` once the signal is stable.

- **M2 grounding-precision telemetry (#280 PR-3).** Three PostHog events now emit from the bind / ratification surfaces: `m2_grounding_attempt` (per `handle_bind` invocation, with `success` and `handler_rejected` diagnostics — the latter trips when #280 PR-1's reject path fires); `m2_grounding_ratified_correct` (per `handle_resolve_compliance` verdict where caller said `compliant`); `m2_grounding_ratified_incorrect` (per `drifted` or `not_relevant` verdict — both signal the original bind was wrong). New `m2_grounding_log.py` module owns the contract: writes a JSONL row to `~/.bicameral/m2_grounding.jsonl` (local mirror, 10 MB rotation, 3 backups) AND fires `telemetry.send_event` to the relay. Privacy-preserving — `decision_id` lives in the local mirror only; the PostHog payload carries only the controlled `decision_source` enum (`transcript` / `spec` / `chat` / `manual` / `document`) and numeric/bool diagnostics. M2 metrics surface on the GitHub Actions run page via a new `tests/eval_grounding_recall_summary.py` step that renders the PR-2 eval JSON to `$GITHUB_STEP_SUMMARY` (precision / recall / abort rate, per-case-type recall breakdown, miss-list) — no operator dashboard panel (per Jin: "the dashboard is for users").

### Changed

- **`code_locator/tools/validate_symbols.py`: dropped unused `self._db` field.** The retention comment ("Retained so `code_locator.adapter.ground_mappings()` can reach `db.lookup_by_file()`") referenced a path deleted in v0.6.0; the field had zero readers.
Expand Down
73 changes: 73 additions & 0 deletions handlers/bind.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,35 @@ def _spans_overlap(a_start: int, a_end: int, b_start: int, b_end: int) -> bool:
return a_start <= b_end and b_start <= a_end


def _emit_m2_attempt(
*,
decision_id: str,
decision_source: str | None,
success: bool,
handler_rejected: bool,
) -> None:
"""Fire-and-forget M2 grounding-attempt event (#280 PR-3).

Wraps ``m2_grounding_log.record_attempt`` in try/except so a telemetry
failure never breaks bind. Skip the call entirely when ``decision_id``
is empty (API misuse) or unknown (handled elsewhere — those aren't
representative grounding attempts and would skew the precision metric).
"""
if not decision_id:
return
try:
from m2_grounding_log import record_attempt

record_attempt(
decision_id=decision_id,
decision_source=decision_source,
success=success,
handler_rejected=handler_rejected,
)
except Exception as exc:
logger.debug("[bind] m2 telemetry emit failed (non-fatal): %s", exc)


async def handle_bind(
ctx,
bindings: list[dict],
Expand Down Expand Up @@ -121,6 +150,14 @@ async def _do_bind(ctx, bindings: list[dict]) -> BindResponse:
)
continue

# #280 PR-3 — resolve decision_source once for telemetry. Cheap query
# (single field SELECT). Best-effort; on lookup failure we still bind
# but log "unknown" as the source.
try:
decision_source = await ledger.get_decision_source(decision_id)
except Exception:
decision_source = None

if start_line is None or end_line is None:
from ledger.status import resolve_symbol_lines

Expand All @@ -134,6 +171,12 @@ async def _do_bind(ctx, bindings: list[dict]) -> BindResponse:
error=f"symbol '{symbol_name}' not found in {file_path} at {authoritative_sha}",
)
)
_emit_m2_attempt(
decision_id=decision_id,
decision_source=decision_source,
success=False,
handler_rejected=True,
)
continue
start_line, end_line = resolved
else:
Expand All @@ -149,6 +192,12 @@ async def _do_bind(ctx, bindings: list[dict]) -> BindResponse:
error=f"file '{file_path}' does not exist at {authoritative_sha} — only bind to existing code, never hypothetical files",
)
)
_emit_m2_attempt(
decision_id=decision_id,
decision_source=decision_source,
success=False,
handler_rejected=True,
)
continue

# #280 — caller-supplied line range cannot bypass symbol
Expand All @@ -167,6 +216,12 @@ async def _do_bind(ctx, bindings: list[dict]) -> BindResponse:
error=f"symbol '{symbol_name}' not found in {file_path} at {authoritative_sha} — caller-supplied line range cannot bypass symbol verification (#280)",
)
)
_emit_m2_attempt(
decision_id=decision_id,
decision_source=decision_source,
success=False,
handler_rejected=True,
)
continue
resolved_start, resolved_end = resolved
if not _spans_overlap(start_line, end_line, resolved_start, resolved_end):
Expand All @@ -178,6 +233,12 @@ async def _do_bind(ctx, bindings: list[dict]) -> BindResponse:
error=f"symbol '{symbol_name}' resolves at lines {resolved_start}-{resolved_end} but caller supplied {start_line}-{end_line} — span mismatch (#280)",
)
)
_emit_m2_attempt(
decision_id=decision_id,
decision_source=decision_source,
success=False,
handler_rejected=True,
)
continue

try:
Expand All @@ -201,6 +262,12 @@ async def _do_bind(ctx, bindings: list[dict]) -> BindResponse:
error=str(exc),
)
)
_emit_m2_attempt(
decision_id=decision_id,
decision_source=decision_source,
success=False,
handler_rejected=False, # ledger error, not a #280 reject
)
continue

region_id = bind_result["region_id"]
Expand Down Expand Up @@ -290,6 +357,12 @@ async def _do_bind(ctx, bindings: list[dict]) -> BindResponse:
pending_compliance_check=pending_check,
)
)
_emit_m2_attempt(
decision_id=decision_id,
decision_source=decision_source,
success=True,
handler_rejected=False,
)

try:
from dashboard.server import notify_dashboard
Expand Down
40 changes: 40 additions & 0 deletions handlers/resolve_compliance.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@
decision_exists,
delete_binds_to_edge,
get_canonical_id,
get_decision_source,
get_region_descriptor,
project_decision_status,
promote_ephemeral_verdict,
Expand All @@ -49,6 +50,31 @@
logger = logging.getLogger(__name__)


def _emit_m2_ratification(
*,
decision_id: str,
decision_source: str | None,
verdict: str,
confidence: str | None,
) -> None:
"""Fire-and-forget M2 ratification event (#280 PR-3).

Wraps ``m2_grounding_log.record_ratification`` in try/except so a
telemetry failure never breaks ratification.
"""
try:
from m2_grounding_log import record_ratification

record_ratification(
decision_id=decision_id,
decision_source=decision_source,
verdict=verdict,
confidence=confidence,
)
except Exception as exc:
logger.debug("[resolve_compliance] m2 telemetry emit failed (non-fatal): %s", exc)


_VALID_PHASES = {"ingest", "drift", "regrounding", "supersession", "divergence"}


Expand Down Expand Up @@ -210,6 +236,20 @@ async def handle_resolve_compliance(
)
)

# #280 PR-3 — M2 grounding-precision ratification telemetry.
# Best-effort source lookup (single-field query). On failure, fall
# back to "unknown" rather than blocking the verdict write.
try:
decision_source = await get_decision_source(client, v.decision_id)
except Exception:
decision_source = None
_emit_m2_ratification(
decision_id=v.decision_id,
decision_source=decision_source,
verdict=v.verdict,
confidence=v.confidence,
)

# Sync code_region.content_hash to the verdict hash for every accepted verdict.
# project_decision_status looks up verdicts by (decision_id, region_id,
# code_region.content_hash). When link_commit ran on a non-authoritative branch
Expand Down
10 changes: 10 additions & 0 deletions ledger/adapter.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
get_all_decisions,
get_compliance_verdict,
get_decision_level,
get_decision_source,
get_decisions_for_file,
get_decisions_for_files,
get_pending_decisions_with_regions,
Expand Down Expand Up @@ -269,6 +270,15 @@ async def get_decision_level(self, decision_id: str) -> str | None:
await self._ensure_connected()
return await get_decision_level(self._client, decision_id)

async def get_decision_source(self, decision_id: str) -> str | None:
"""Return the decision's ``source_type`` or ``None`` if unset.

Used by the M2 grounding-precision telemetry (#280 PR-3) to segment
events by decision provenance (controlled enum, safe to relay).
"""
await self._ensure_connected()
return await get_decision_source(self._client, decision_id)

async def bind_decision(
self,
decision_id: str,
Expand Down
19 changes: 19 additions & 0 deletions ledger/queries.py
Original file line number Diff line number Diff line change
Expand Up @@ -873,6 +873,25 @@ async def get_decision_level(client: LedgerClient, decision_id: str) -> str | No
return str(val) if val else None


async def get_decision_source(client: LedgerClient, decision_id: str) -> str | None:
"""Return ``decision.source_type`` (a controlled enum like
``"transcript"`` / ``"spec"`` / ``"chat"`` / ``"manual"`` /
``"document"``) or ``None`` if the row doesn't exist.

Used by the M2 grounding-precision telemetry (#280 PR-3) to segment
`m2_grounding_*` events by decision provenance. Safe to relay to
PostHog — the source_type value space is a fixed enum from the
ingest contract, not user content.
"""
rows = await client.query(
f"SELECT source_type FROM {decision_id} LIMIT 1",
)
if not rows:
return None
val = rows[0].get("source_type")
return str(val) if val else None


async def region_exists(client: LedgerClient, region_id: str) -> bool:
"""Return True iff a code_region row exists with the given record id."""
rows = await client.query(f"SELECT id FROM {region_id} LIMIT 1")
Expand Down
Loading
Loading