feat(eval): M2 grounding-recall harness for caller-LLM bind precision (#280 PR-2) by silongtan · Pull Request #284 · BicameralAI/bicameral-mcp

silongtan · 2026-05-08T19:52:42Z

Summary

PR-2 of 3 for #280. Ships the synthetic-recall benchmark that measures caller-LLM grounding precision against the post-PR-1 handler. Companion to #283 (PR-1 — handler reject path), which landed in dev as 6c4a1c5.

This PR is a measurement layer only — no runtime contract change. Warn-only CI initially per #280's gating-is-observability framing; we ship the eval, observe the baseline, then ratchet to a hard gate.

What's in this PR

Three measurement axes (deliberately split for diagnosis)

Precision = correct / (correct + wrong_symbol + wrong_file)
Recall = correct / total_rows (aborts AND wrong bindings count against)
Abort rate = aborted / total_rows — first-class signal because the bind skill makes "abort on weak evidence" an explicit contract

The split tells us why we're missing: high-precision-low-recall = agent over-cautious; low-precision-high-recall = hallucinations the PR-1 handler now rejects. Without the split it's a single number; with it we know which knob to turn.

Files

File	LOC	Role
`tests/fixtures/grounding_recall/dataset.py`	230	23 `GroundingCase` rows: 5 case-A (same-name-different-module), 10 case-B (similar-intent), 8 case-C (cross-language). `GENERATOR_VERSION` invalidates cache when bumped. Import-time `_validate_dataset()` fails loud on shape regressions.
`tests/fixtures/grounding_recall/repo/`	15 files / ~625	Hand-crafted fixture repo with intended + plausible-distractor symbols. Each function body is short but real enough that the agent must read carefully to disambiguate (e.g. three `process_order` definitions in three modules, each with different semantics).
`tests/eval/_bind_judge.py`	466	Headless caller-LLM driver, modeled on `tests/eval/_skill_judge.py`. Multi-turn tool-use loop with 3 tools: `read_file`, `validate_symbols`, `submit_binding`. Cap at 8 turns. Response cache at `tests/eval/fixtures/bind_judge/` keyed on `SHA(model \| bind_skill \| repo \| decision)`.
`tests/eval_grounding_recall.py`	256	Argparse runner, modeled on `tests/eval_decision_relevance.py`. Loads dataset, drives `_bind_judge` per case, classifies outcome, aggregates, emits JSON, optional gate enforcement (`--gate-mode warn\|hard`).
`.github/workflows/test-mcp-regression.yml`	+19	New "M2 grounding-recall eval (warn-only)" step — Ubuntu-only, `continue-on-error: true`, mirrors the M1 step shape.
`CHANGELOG.md`	+2	Unreleased entry.

Default gates (per #280 acceptance)

Recall ≥ 0.80
Precision ≥ 0.85
Abort rate ≤ 0.30

Cache strategy

Cache hits at tests/eval/fixtures/bind_judge/ keep CI cost ~$0 unless one of these changes:

BICAMERAL_GROUNDING_EVAL_MODEL
skills/bicameral-bind/SKILL.md SHA
The fixture repo SHA
The case's decision text

Same pattern as the existing M1 _skill_judge.py.

Local verification

✅ Dataset imports clean (23 cases, _validate_dataset() passes)
✅ _bind_judge symbol indexer resolves all 11 spot-checked intended symbols (including Class.method form like CheckoutRetryGuard.check_cap, TenantCheckoutRateLimiter.check)
✅ eval_grounding_recall.py CLI runs offline with --skip-missing-fixtures (0 cases, gate breaches reported, exits 0 in warn mode as designed)
⏳ Live eval will run once this PR opens (CI provides ANTHROPIC_API_KEY)

What's NOT in this PR (per `plan-280-grounding-precision-fix.md`)

PR-3 — PostHog m2_grounding_* events + dashboard panel (separate surface, not blocking the eval)
Friction capture (≥ 5 design-partner cases) — not engineering scope
Hard-gate flip — defer until baseline is stable; same path the M1 eval has been on

Refs

Refs #280. Built on #283's PR-1 (handler reject path) which is in dev as 6c4a1c5.

🤖 Generated with Claude Code

…#280 PR-2) Synthetic-fixture benchmark that drives the bicameral-bind skill end-to-end against 23 cases across three failure modes — same-name-different-module, similar-intent-different-symbol, and cross-language. Measures three axes deliberately split for diagnosis: - precision = correct / (correct + wrong_symbol + wrong_file) - recall = correct / total_rows - abort_rate = aborted / total_rows The split matters: high-precision-low-recall = agent over-cautious; low- precision-high-recall = hallucinations the #280 PR-1 handler would now reject (handler_rejected outcome would surface as precision drag). Files tests/fixtures/grounding_recall/dataset.py 230 LOC 23 GroundingCase rows: 5 case-A (process_order × 3 modules, cancel_order × 2 modules), 10 case-B (rate-limit/throttle/retry/ auth/metrics intent disambiguation), 8 case-C (Python ↔ TS pairs). GENERATOR_VERSION constant invalidates the cache when bumped. Import-time _validate_dataset() fails loud on duplicate ids, invalid case_type, distractor === intended, etc. tests/fixtures/grounding_recall/repo/ 15 files / ~625 LOC Hand-crafted fixture repo with intended + distractor symbols. Each function/class body is short but real enough that the agent can actually distinguish behavior from keyword overlap (e.g. checkout/orders.py:process_order = customer flow w/ retry cap; admin/orders.py:process_order = manual replay of finance-flagged orders; billing/refunds.py:process_order = bulk-refund pipeline). tests/eval/_bind_judge.py 466 LOC Headless caller-LLM driver — modeled on tests/eval/_skill_judge.py. Multi-turn tool-use loop with 3 tools exposed: read_file, validate_symbols, submit_binding. Cap at 8 turns. Cache at tests/eval/fixtures/bind_judge/ keyed on SHA(model | bind_skill | repo | decision). Cache hits keep CI cost ~$0 unless dataset, fixture repo, or skill change. tests/eval_grounding_recall.py 256 LOC Argparse runner — modeled on tests/eval_decision_relevance.py. Loads dataset, drives _bind_judge per case, classifies outcome (correct / wrong_symbol / wrong_file / aborted), aggregates, emits JSON report, optional gate enforcement (--gate-mode warn|hard). .github/workflows/test-mcp-regression.yml +19 LOC New "M2 grounding-recall eval (warn-only)" step. Ubuntu-only, continue-on-error: true, mirrors the M1 step shape. ANTHROPIC_API_KEY from secrets, model env var, output to test-results/m2-grounding-recall.json. CHANGELOG.md +2 lines Default gates per #280 acceptance: recall ≥ 0.80, precision ≥ 0.85, abort_rate ≤ 0.30. Ship warn-only first to record a post-PR-1 baseline, then ratchet to --gate-mode hard once the signal is stable. Same path the M1 eval has been on. Out of scope for PR-2 (per plan-280-grounding-precision-fix.md): - PR-3 ships PostHog m2_grounding_* events + dashboard panel - Friction capture (≥ 5 design-partner cases) is not engineering scope Local verification - dataset.py imports clean (23 cases, _validate_dataset() passes) - _bind_judge symbol indexer resolves all 11 spot-checked intended symbols including Class.method form - eval_grounding_recall.py CLI runs offline with --skip-missing-fixtures (0 cases, gate breaches reported, exit 0 in warn mode as designed) Refs #280. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-05-08T19:52:50Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7b649ad1-18d6-4923-8b40-054a7c97e328

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch 280-grounding-eval-harness

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Five lint-side findings on the initial PR-2 commit, none of them runtime — fixing in place rather than amending the prior commit: - tests/eval/_bind_judge.py B007: add `# noqa: B007` to the `for turn in range(...)` loop. The loop variable IS used after the loop for telemetry (judgment_payload["turns"] = turn); suppression is more honest than renaming to `_turn` and losing the post-loop reference. - tests/eval/_bind_judge.py mypy: type-annotate `chosen_model: str` and tighten the `os.getenv` fallback chain so mypy can resolve `str | None` → `str`. Construct BindJudgment field-by-field instead of `**judgment_payload` so the dataclass field types are enforced (3× errors in the cached + write paths). - tests/eval_grounding_recall.py I001 + E402: per-line `# noqa: E402, I001` on the two local imports that must follow the sys.path inserts. Same shape `eval_decision_relevance.py` uses for its single post-path import. - tests/eval_grounding_recall.py F541: drop the f-prefix on `print(f" ✓ all gates pass")` (no placeholders). - tests/fixtures/grounding_recall/repo/src/checkout/orders.py B007: rename `for attempt in range(3):` → `for _attempt in range(3):` (loop body doesn't reference the counter). Plus `ruff format` reflowed 4 files (line wrapping, parens, exponent spacing) — no semantic changes. Local verification: ruff check + ruff format --check + mypy all green on the PR-2 surface (15 fixture files + 2 eval files). Refs #280. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Triages 25 dev commits onto main (already on dev as of merge time): • #289 — team-mode remote event-log adapter (#277) • #285, #284, #283 — M2 grounding telemetry, eval harness, precision fix (#280) • #275 — README/SECURITY surface • plus assorted fixes flowing through dev Resolved conflicts in CHANGELOG.md (kept dev's [Unreleased] block, inserted v0.14.2's release entry from main below it, then renamed [Unreleased] → v0.14.3) and README.md (kept dev's Solo-vs-Team mode section + extended setup-writes table from #289 — main was missing both because PR #289 hadn't backflowed yet). pyproject.toml: 0.14.2 → 0.14.3 RECOMMENDED_VERSION: 0.14.1 → 0.14.3 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…gate (BicameralAI#58) Sibling of the M2 grounding-recall eval (BicameralAI#284). Phase A is **measurement only** — no runtime change to `handle_preflight` or any retrieval surface. Recall regression risk = zero. The Phase B optimization choice (multi-hop expansion / semantic search / LLM reranker) is gated on this PR's first stable baseline, per the wiki's optimization principle: "identify the specific scenario, then optimize." Per the Phase 2 spec posted on BicameralAI#58 (all four open questions defaulted to recommended): Q1 dataset size → 25 cases hand-curated, matches M2's 23 Q2 miss-mode buckets → three (vocabulary / unbound / transitive), matches the issue body's framing Q3 fire-rate gate → raw retrieval (`response.decisions`); fire is downstream and a secondary diagnostic Q4 ledger persistence → per-run temp + memory:// (per-case freshness) Three measurement axes (deliberately split for diagnosis) --------------------------------------------------------- overall recall surfaced / (surfaced + missed) gate ≥ 0.70 per-mode recall same, sliced by miss_mode gate ≥ 0.50 fire rate response.fired / total gate ≥ 0.60 Errors (seeder infra failures, not agent misses) are excluded from the recall denominator but counted separately so reviewers can see them. Files ----- tests/fixtures/preflight_m6/dataset.py (412 LOC) 25 hand-curated M6Case rows, 8 + 8 + 9 across the three modes. Frozen dataclass; GENERATOR_VERSION constant invalidates downstream caches when bumped. Import-time _validate_dataset() fails loud on duplicate case_id, invalid miss_mode, transitive case without intended_file_path, unbound case with non-ungrounded status. tests/eval/_preflight_m6_seeder.py (231 LOC) Per-case freshness: each call creates a new tempdir + memory:// ledger + git-initialized repo + writes a placeholder file (or the transitive case's intended + caller files). Calls the real handle_ingest + handle_bind so seeded rows have production shape (source_type, span, signoff, binds_to). Reset code-locator + ledger singletons before AND after so the next case starts clean. tests/eval_preflight_m6_recall.py (274 LOC) Argparse runner, drives the seeder + handle_preflight, classifies outcomes, aggregates. JSON output + gate enforcement (--gate-mode warn|hard). Mirrors eval_grounding_recall.py shape so existing CI patterns transfer. tests/eval_preflight_m6_summary.py (162 LOC) Markdown step-summary renderer for $GITHUB_STEP_SUMMARY. Per-mode table + collapsible missed-case detail with topic + intended description. Fail-quiet on missing JSON / parse errors. tests/test_preflight_m6_eval.py (267 LOC) 16 sociable unit tests for the classifier + aggregator. Per the new CLAUDE.md "Sociable Testing for UX Paths" rule (BicameralAI#303): SimpleNamespace + real M6Case dataclasses, NEVER MagicMock — so any added/removed field on the response shape fails the test honestly. .github/workflows/test-mcp-regression.yml (+31 LOC) New "M6 preflight recall eval (warn-only)" + summary steps after M2. No ANTHROPIC_API_KEY needed — preflight retrieval is deterministic. CHANGELOG.md (+2 lines) [Unreleased] / Added entry. Local verification ------------------ - 16/16 sociable unit tests pass on the classifier + aggregator (test_aggregate_basic_recall_math, test_errors_excluded_from_recall_denominator, test_per_miss_mode_breakdown, etc.) - Dataset import + _validate_dataset() pass — 25 cases (8/8/9) - Runner --help renders cleanly - Summary renderer smoke-tested on synthetic JSON — per-mode table + missed-case detail render correctly with emoji gates - ruff check + ruff format --check + mypy all green on touched files What's NOT in this PR (intentionally — Phase B gating) ------------------------------------------------------ - Any runtime change to handle_preflight or _region_anchored_preflight - Skill changes (no agent-facing contract change in Phase A) - Multi-hop / call-graph / inheritance graph expansion (Phase B candidate, deferred) - Semantic search layer (Phase B candidate, deferred) - LLM reranker (Phase B candidate, deferred) - Real-corpus eval (synthetic first; corpus follow-up if needed) After this PR's first CI baseline lands, we pick the dominant miss-mode from the per-mode breakdown and ship Phase B targeted to it. Cheap-first ordering per the wiki: search_hint refinement → multi-hop graph → semantic → reranker. Refs BicameralAI#58. Plan: plan-58-preflight-decision-detection.md. Phase 2 spec signoff: BicameralAI#58 (comment) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

silongtan temporarily deployed to ci-test May 8, 2026 19:52 — with GitHub Actions Inactive

silongtan temporarily deployed to ci-test May 8, 2026 20:26 — with GitHub Actions Inactive

silongtan merged commit f8cd9ee into dev May 8, 2026
6 checks passed

This was referenced May 9, 2026

feat(telemetry): M2 grounding-precision events + dashboard panel (#280 PR-3) #285

Merged

ci(m2): flip M2 grounding-recall gate warn → hard after stable baseline (#280) #288

Merged

jinhongkuan mentioned this pull request May 9, 2026

triage: dev → main · v0.14.3 (#277 team-mode + #280 grounding precision) #290

Merged

5 tasks

This was referenced May 11, 2026

M6 preflight handler retrieval: by-design split (handler structural, skill-layer covered by #306) #58

Closed

feat(eval): M6 preflight retrieval recall eval — Phase A measurement gate (#58) #304

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): M2 grounding-recall harness for caller-LLM bind precision (#280 PR-2)#284

feat(eval): M2 grounding-recall harness for caller-LLM bind precision (#280 PR-2)#284
silongtan merged 2 commits into
devfrom
280-grounding-eval-harness

silongtan commented May 8, 2026

Uh oh!

coderabbitai Bot commented May 8, 2026 •

edited

Loading

Review skipped

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

silongtan commented May 8, 2026

Summary

What's in this PR

Three measurement axes (deliberately split for diagnosis)

Files

Default gates (per #280 acceptance)

Cache strategy

Local verification

What's NOT in this PR (per plan-280-grounding-precision-fix.md)

Refs

Uh oh!

coderabbitai Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

What's NOT in this PR (per `plan-280-grounding-precision-fix.md`)

coderabbitai Bot commented May 8, 2026 •

edited

Loading