BicameralAI · jinhongkuan · May 9, 2026 · May 9, 2026 · May 9, 2026 · May 9, 2026
@@ -148,6 +148,17 @@ jobs:
           --gate-mode warn
           -o test-results/m2-grounding-recall.json
 
+      # ── Surface M2 metrics on the GitHub run summary (#280 PR-3) ───
+      # Reads test-results/m2-grounding-recall.json and renders a markdown
+      # table to $GITHUB_STEP_SUMMARY so reviewers can read precision /
+      # recall / abort-rate without downloading the artifact. always()
+      # guard means the summary appears even when the eval step above
+      # exited non-zero (warn-only currently masks that).
+      - name: M2 metrics summary
+        if: always() && matrix.os == 'ubuntu-latest'
+        continue-on-error: true
+        run: python tests/eval_grounding_recall_summary.py test-results/m2-grounding-recall.json >> "$GITHUB_STEP_SUMMARY"
+
       # ── Generate rich E2E report from artifacts ────────────────────
       # Ubuntu-only: the script consumes the medusa adversarial corpus
       # (cloned only on Ubuntu above) plus the Phase 3 E2E artifacts

@@ -15,6 +15,8 @@ All notable changes to bicameral-mcp are tracked here. Format loosely follows
 
 - **`tests/eval_grounding_recall.py` — M2 grounding-recall eval harness (#280 PR-2).** Synthetic-fixture benchmark that drives the `bicameral-bind` skill end-to-end and measures three axes: precision (of bindings the agent committed, what fraction were correct), recall (of ground-truth bindings, how many the agent got right), and abort rate (first-class signal because the bind skill makes "abort on weak evidence" an explicit contract). Dataset at `tests/fixtures/grounding_recall/dataset.py` with 23 cases across same-name-different-module (5), similar-intent-different-symbol (10), and cross-language (8) — fixture repo at `tests/fixtures/grounding_recall/repo/`. Headless caller-LLM driver at `tests/eval/_bind_judge.py` (modeled on `_skill_judge.py`) drives a multi-turn `read_file` / `validate_symbols` / `submit_binding` tool-use loop with response caching at `tests/eval/fixtures/bind_judge/` keyed on SHA(model | skill | repo | decision). Default gates: recall ≥ 0.80, precision ≥ 0.85, abort_rate ≤ 0.30 per #280 acceptance. New CI step is **warn-only initially** (`continue-on-error: true`, mirrors the M1 step) — gather a baseline first, ratchet to `--gate-mode hard` once the signal is stable.
 
+- **M2 grounding-precision telemetry (#280 PR-3).** Three PostHog events now emit from the bind / ratification surfaces: `m2_grounding_attempt` (per `handle_bind` invocation, with `success` and `handler_rejected` diagnostics — the latter trips when #280 PR-1's reject path fires); `m2_grounding_ratified_correct` (per `handle_resolve_compliance` verdict where caller said `compliant`); `m2_grounding_ratified_incorrect` (per `drifted` or `not_relevant` verdict — both signal the original bind was wrong). New `m2_grounding_log.py` module owns the contract: writes a JSONL row to `~/.bicameral/m2_grounding.jsonl` (local mirror, 10 MB rotation, 3 backups) AND fires `telemetry.send_event` to the relay. Privacy-preserving — `decision_id` lives in the local mirror only; the PostHog payload carries only the controlled `decision_source` enum (`transcript` / `spec` / `chat` / `manual` / `document`) and numeric/bool diagnostics. M2 metrics surface on the GitHub Actions run page via a new `tests/eval_grounding_recall_summary.py` step that renders the PR-2 eval JSON to `$GITHUB_STEP_SUMMARY` (precision / recall / abort rate, per-case-type recall breakdown, miss-list) — no operator dashboard panel (per Jin: "the dashboard is for users").
+
 ### Changed
 
 - **`code_locator/tools/validate_symbols.py`: dropped unused `self._db` field.** The retention comment ("Retained so `code_locator.adapter.ground_mappings()` can reach `db.lookup_by_file()`") referenced a path deleted in v0.6.0; the field had zero readers.

@@ -22,6 +22,35 @@ def _spans_overlap(a_start: int, a_end: int, b_start: int, b_end: int) -> bool:
     return a_start <= b_end and b_start <= a_end
 
 
+def _emit_m2_attempt(
+    *,
+    decision_id: str,
+    decision_source: str | None,
+    success: bool,
+    handler_rejected: bool,
+) -> None:
+    """Fire-and-forget M2 grounding-attempt event (#280 PR-3).
+
+    Wraps ``m2_grounding_log.record_attempt`` in try/except so a telemetry
+    failure never breaks bind. Skip the call entirely when ``decision_id``
+    is empty (API misuse) or unknown (handled elsewhere — those aren't
+    representative grounding attempts and would skew the precision metric).
+    """
+    if not decision_id:
+        return
+    try:
+        from m2_grounding_log import record_attempt
+
+        record_attempt(
+            decision_id=decision_id,
+            decision_source=decision_source,
+            success=success,
+            handler_rejected=handler_rejected,
+        )
+    except Exception as exc:
+        logger.debug("[bind] m2 telemetry emit failed (non-fatal): %s", exc)
+
+
 async def handle_bind(
     ctx,
     bindings: list[dict],
@@ -121,6 +150,14 @@ async def _do_bind(ctx, bindings: list[dict]) -> BindResponse:
             )
             continue
 
+        # #280 PR-3 — resolve decision_source once for telemetry. Cheap query
+        # (single field SELECT). Best-effort; on lookup failure we still bind
+        # but log "unknown" as the source.
+        try:
+            decision_source = await ledger.get_decision_source(decision_id)
+        except Exception:
+            decision_source = None
+
         if start_line is None or end_line is None:
             from ledger.status import resolve_symbol_lines
 
@@ -134,6 +171,12 @@ async def _do_bind(ctx, bindings: list[dict]) -> BindResponse:
                         error=f"symbol '{symbol_name}' not found in {file_path} at {authoritative_sha}",
                     )
                 )
+                _emit_m2_attempt(
+                    decision_id=decision_id,
+                    decision_source=decision_source,
+                    success=False,
+                    handler_rejected=True,
+                )
                 continue
             start_line, end_line = resolved
         else:
@@ -149,6 +192,12 @@ async def _do_bind(ctx, bindings: list[dict]) -> BindResponse:
                         error=f"file '{file_path}' does not exist at {authoritative_sha} — only bind to existing code, never hypothetical files",
                     )
                 )
+                _emit_m2_attempt(
+                    decision_id=decision_id,
+                    decision_source=decision_source,
+                    success=False,
+                    handler_rejected=True,
+                )
                 continue
 
             # #280 — caller-supplied line range cannot bypass symbol
@@ -167,6 +216,12 @@ async def _do_bind(ctx, bindings: list[dict]) -> BindResponse:
                         error=f"symbol '{symbol_name}' not found in {file_path} at {authoritative_sha} — caller-supplied line range cannot bypass symbol verification (#280)",
                     )
                 )
+                _emit_m2_attempt(
+                    decision_id=decision_id,
+                    decision_source=decision_source,
+                    success=False,
+                    handler_rejected=True,
+                )
                 continue
             resolved_start, resolved_end = resolved
             if not _spans_overlap(start_line, end_line, resolved_start, resolved_end):
@@ -178,6 +233,12 @@ async def _do_bind(ctx, bindings: list[dict]) -> BindResponse:
                         error=f"symbol '{symbol_name}' resolves at lines {resolved_start}-{resolved_end} but caller supplied {start_line}-{end_line} — span mismatch (#280)",
                     )
                 )
+                _emit_m2_attempt(
+                    decision_id=decision_id,
+                    decision_source=decision_source,
+                    success=False,
+                    handler_rejected=True,
+                )
                 continue
 
         try:
@@ -201,6 +262,12 @@ async def _do_bind(ctx, bindings: list[dict]) -> BindResponse:
                     error=str(exc),
                 )
             )
+            _emit_m2_attempt(
+                decision_id=decision_id,
+                decision_source=decision_source,
+                success=False,
+                handler_rejected=False,  # ledger error, not a #280 reject
+            )
             continue
 
         region_id = bind_result["region_id"]
@@ -290,6 +357,12 @@ async def _do_bind(ctx, bindings: list[dict]) -> BindResponse:
                 pending_compliance_check=pending_check,
             )
         )
+        _emit_m2_attempt(
+            decision_id=decision_id,
+            decision_source=decision_source,
+            success=True,
+            handler_rejected=False,
+        )
 
     try:
         from dashboard.server import notify_dashboard

@@ -37,6 +37,7 @@
     decision_exists,
     delete_binds_to_edge,
     get_canonical_id,
+    get_decision_source,
     get_region_descriptor,
     project_decision_status,
     promote_ephemeral_verdict,
@@ -49,6 +50,31 @@
 logger = logging.getLogger(__name__)
 
 
+def _emit_m2_ratification(
+    *,
+    decision_id: str,
+    decision_source: str | None,
+    verdict: str,
+    confidence: str | None,
+) -> None:
+    """Fire-and-forget M2 ratification event (#280 PR-3).
+
+    Wraps ``m2_grounding_log.record_ratification`` in try/except so a
+    telemetry failure never breaks ratification.
+    """
+    try:
+        from m2_grounding_log import record_ratification
+
+        record_ratification(
+            decision_id=decision_id,
+            decision_source=decision_source,
+            verdict=verdict,
+            confidence=confidence,
+        )
+    except Exception as exc:
+        logger.debug("[resolve_compliance] m2 telemetry emit failed (non-fatal): %s", exc)
+
+
 _VALID_PHASES = {"ingest", "drift", "regrounding", "supersession", "divergence"}
 
 
@@ -210,6 +236,20 @@ async def handle_resolve_compliance(
             )
         )
 
+        # #280 PR-3 — M2 grounding-precision ratification telemetry.
+        # Best-effort source lookup (single-field query). On failure, fall
+        # back to "unknown" rather than blocking the verdict write.
+        try:
+            decision_source = await get_decision_source(client, v.decision_id)
+        except Exception:
+            decision_source = None
+        _emit_m2_ratification(
+            decision_id=v.decision_id,
+            decision_source=decision_source,
+            verdict=v.verdict,
+            confidence=v.confidence,
+        )
+
     # Sync code_region.content_hash to the verdict hash for every accepted verdict.
     # project_decision_status looks up verdicts by (decision_id, region_id,
     # code_region.content_hash). When link_commit ran on a non-authoritative branch

@@ -21,6 +21,7 @@
     get_all_decisions,
     get_compliance_verdict,
     get_decision_level,
+    get_decision_source,
     get_decisions_for_file,
     get_decisions_for_files,
     get_pending_decisions_with_regions,
@@ -269,6 +270,15 @@ async def get_decision_level(self, decision_id: str) -> str | None:
         await self._ensure_connected()
         return await get_decision_level(self._client, decision_id)
 
+    async def get_decision_source(self, decision_id: str) -> str | None:
+        """Return the decision's ``source_type`` or ``None`` if unset.
+
+        Used by the M2 grounding-precision telemetry (#280 PR-3) to segment
+        events by decision provenance (controlled enum, safe to relay).
+        """
+        await self._ensure_connected()
+        return await get_decision_source(self._client, decision_id)
+
     async def bind_decision(
         self,
         decision_id: str,

@@ -873,6 +873,25 @@ async def get_decision_level(client: LedgerClient, decision_id: str) -> str | No
     return str(val) if val else None
 
 
+async def get_decision_source(client: LedgerClient, decision_id: str) -> str | None:
+    """Return ``decision.source_type`` (a controlled enum like
+    ``"transcript"`` / ``"spec"`` / ``"chat"`` / ``"manual"`` /
+    ``"document"``) or ``None`` if the row doesn't exist.
+
+    Used by the M2 grounding-precision telemetry (#280 PR-3) to segment
+    `m2_grounding_*` events by decision provenance. Safe to relay to
+    PostHog — the source_type value space is a fixed enum from the
+    ingest contract, not user content.
+    """
+    rows = await client.query(
+        f"SELECT source_type FROM {decision_id} LIMIT 1",
+    )
+    if not rows:
+        return None
+    val = rows[0].get("source_type")
+    return str(val) if val else None
+
+
 async def region_exists(client: LedgerClient, region_id: str) -> bool:
     """Return True iff a code_region row exists with the given record id."""
     rows = await client.query(f"SELECT id FROM {region_id} LIMIT 1")