BicameralAI · silongtan · May 8, 2026 · May 8, 2026 · May 8, 2026
@@ -129,6 +129,25 @@ jobs:
           --skill-variant from-skill-md
           -o test-results/m1-adversarial.json
 
+      # ── M2 grounding-recall eval (warn-only, #280 PR-2) ────────────
+      # Drives the bicameral-bind skill against tests/fixtures/grounding_recall/
+      # — synthetic fixture with 23 decisions across same-name-different-module,
+      # similar-intent, and cross-language cases. Cache hits at
+      # tests/eval/fixtures/bind_judge/ keep CI cost ~$0 unless the dataset,
+      # fixture repo, or skill change. Warn-only initially per #280's gating-
+      # is-observability framing — we ship the measurement, observe the
+      # baseline, then ratchet to --gate-mode hard once the signal is stable.
+      - name: M2 grounding-recall eval (warn-only)
+        if: matrix.os == 'ubuntu-latest'
+        continue-on-error: true
+        env:
+          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
+          BICAMERAL_GROUNDING_EVAL_MODEL: claude-haiku-4-5-20251001
+        run: >
+          python tests/eval_grounding_recall.py
+          --gate-mode warn
+          -o test-results/m2-grounding-recall.json
+
       # ── Generate rich E2E report from artifacts ────────────────────
       # Ubuntu-only: the script consumes the medusa adversarial corpus
       # (cloned only on Ubuntu above) plus the Phase 3 E2E artifacts

@@ -13,6 +13,8 @@ All notable changes to bicameral-mcp are tracked here. Format loosely follows
 
 - **`skills/bicameral-bind/SKILL.md` (#280).** New skill that extracts the bind contract out of `skills/bicameral-ingest/SKILL.md` §2 and tightens it from advisory to mandatory: the agent must Read at least one candidate file end-to-end, confirm the symbol via `validate_symbols`, and abort on weak evidence. Documents the handler-side rejection contract added in this release. `bicameral-ingest` §2 now points at the new skill instead of duplicating the verification rules.
 
+- **`tests/eval_grounding_recall.py` — M2 grounding-recall eval harness (#280 PR-2).** Synthetic-fixture benchmark that drives the `bicameral-bind` skill end-to-end and measures three axes: precision (of bindings the agent committed, what fraction were correct), recall (of ground-truth bindings, how many the agent got right), and abort rate (first-class signal because the bind skill makes "abort on weak evidence" an explicit contract). Dataset at `tests/fixtures/grounding_recall/dataset.py` with 23 cases across same-name-different-module (5), similar-intent-different-symbol (10), and cross-language (8) — fixture repo at `tests/fixtures/grounding_recall/repo/`. Headless caller-LLM driver at `tests/eval/_bind_judge.py` (modeled on `_skill_judge.py`) drives a multi-turn `read_file` / `validate_symbols` / `submit_binding` tool-use loop with response caching at `tests/eval/fixtures/bind_judge/` keyed on SHA(model | skill | repo | decision). Default gates: recall ≥ 0.80, precision ≥ 0.85, abort_rate ≤ 0.30 per #280 acceptance. New CI step is **warn-only initially** (`continue-on-error: true`, mirrors the M1 step) — gather a baseline first, ratchet to `--gate-mode hard` once the signal is stable.
+
 ### Changed
 
 - **`code_locator/tools/validate_symbols.py`: dropped unused `self._db` field.** The retention comment ("Retained so `code_locator.adapter.ground_mappings()` can reach `db.lookup_by_file()`") referenced a path deleted in v0.6.0; the field had zero readers.