Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions .github/workflows/test-mcp-regression.yml
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,25 @@ jobs:
--skill-variant from-skill-md
-o test-results/m1-adversarial.json

# ── M2 grounding-recall eval (warn-only, #280 PR-2) ────────────
# Drives the bicameral-bind skill against tests/fixtures/grounding_recall/
# — synthetic fixture with 23 decisions across same-name-different-module,
# similar-intent, and cross-language cases. Cache hits at
# tests/eval/fixtures/bind_judge/ keep CI cost ~$0 unless the dataset,
# fixture repo, or skill change. Warn-only initially per #280's gating-
# is-observability framing — we ship the measurement, observe the
# baseline, then ratchet to --gate-mode hard once the signal is stable.
- name: M2 grounding-recall eval (warn-only)
if: matrix.os == 'ubuntu-latest'
continue-on-error: true
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
BICAMERAL_GROUNDING_EVAL_MODEL: claude-haiku-4-5-20251001
run: >
python tests/eval_grounding_recall.py
--gate-mode warn
-o test-results/m2-grounding-recall.json

# ── Generate rich E2E report from artifacts ────────────────────
# Ubuntu-only: the script consumes the medusa adversarial corpus
# (cloned only on Ubuntu above) plus the Phase 3 E2E artifacts
Expand Down
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@ All notable changes to bicameral-mcp are tracked here. Format loosely follows

- **`skills/bicameral-bind/SKILL.md` (#280).** New skill that extracts the bind contract out of `skills/bicameral-ingest/SKILL.md` §2 and tightens it from advisory to mandatory: the agent must Read at least one candidate file end-to-end, confirm the symbol via `validate_symbols`, and abort on weak evidence. Documents the handler-side rejection contract added in this release. `bicameral-ingest` §2 now points at the new skill instead of duplicating the verification rules.

- **`tests/eval_grounding_recall.py` — M2 grounding-recall eval harness (#280 PR-2).** Synthetic-fixture benchmark that drives the `bicameral-bind` skill end-to-end and measures three axes: precision (of bindings the agent committed, what fraction were correct), recall (of ground-truth bindings, how many the agent got right), and abort rate (first-class signal because the bind skill makes "abort on weak evidence" an explicit contract). Dataset at `tests/fixtures/grounding_recall/dataset.py` with 23 cases across same-name-different-module (5), similar-intent-different-symbol (10), and cross-language (8) — fixture repo at `tests/fixtures/grounding_recall/repo/`. Headless caller-LLM driver at `tests/eval/_bind_judge.py` (modeled on `_skill_judge.py`) drives a multi-turn `read_file` / `validate_symbols` / `submit_binding` tool-use loop with response caching at `tests/eval/fixtures/bind_judge/` keyed on SHA(model | skill | repo | decision). Default gates: recall ≥ 0.80, precision ≥ 0.85, abort_rate ≤ 0.30 per #280 acceptance. New CI step is **warn-only initially** (`continue-on-error: true`, mirrors the M1 step) — gather a baseline first, ratchet to `--gate-mode hard` once the signal is stable.

### Changed

- **`code_locator/tools/validate_symbols.py`: dropped unused `self._db` field.** The retention comment ("Retained so `code_locator.adapter.ground_mappings()` can reach `db.lookup_by_file()`") referenced a path deleted in v0.6.0; the field had zero readers.
Expand Down
Loading
Loading