feat(eval): M2 grounding-recall harness for caller-LLM bind precision (#280 PR-2)#284
Merged
Conversation
…#280 PR-2) Synthetic-fixture benchmark that drives the bicameral-bind skill end-to-end against 23 cases across three failure modes — same-name-different-module, similar-intent-different-symbol, and cross-language. Measures three axes deliberately split for diagnosis: - precision = correct / (correct + wrong_symbol + wrong_file) - recall = correct / total_rows - abort_rate = aborted / total_rows The split matters: high-precision-low-recall = agent over-cautious; low- precision-high-recall = hallucinations the #280 PR-1 handler would now reject (handler_rejected outcome would surface as precision drag). Files tests/fixtures/grounding_recall/dataset.py 230 LOC 23 GroundingCase rows: 5 case-A (process_order × 3 modules, cancel_order × 2 modules), 10 case-B (rate-limit/throttle/retry/ auth/metrics intent disambiguation), 8 case-C (Python ↔ TS pairs). GENERATOR_VERSION constant invalidates the cache when bumped. Import-time _validate_dataset() fails loud on duplicate ids, invalid case_type, distractor === intended, etc. tests/fixtures/grounding_recall/repo/ 15 files / ~625 LOC Hand-crafted fixture repo with intended + distractor symbols. Each function/class body is short but real enough that the agent can actually distinguish behavior from keyword overlap (e.g. checkout/orders.py:process_order = customer flow w/ retry cap; admin/orders.py:process_order = manual replay of finance-flagged orders; billing/refunds.py:process_order = bulk-refund pipeline). tests/eval/_bind_judge.py 466 LOC Headless caller-LLM driver — modeled on tests/eval/_skill_judge.py. Multi-turn tool-use loop with 3 tools exposed: read_file, validate_symbols, submit_binding. Cap at 8 turns. Cache at tests/eval/fixtures/bind_judge/ keyed on SHA(model | bind_skill | repo | decision). Cache hits keep CI cost ~$0 unless dataset, fixture repo, or skill change. tests/eval_grounding_recall.py 256 LOC Argparse runner — modeled on tests/eval_decision_relevance.py. Loads dataset, drives _bind_judge per case, classifies outcome (correct / wrong_symbol / wrong_file / aborted), aggregates, emits JSON report, optional gate enforcement (--gate-mode warn|hard). .github/workflows/test-mcp-regression.yml +19 LOC New "M2 grounding-recall eval (warn-only)" step. Ubuntu-only, continue-on-error: true, mirrors the M1 step shape. ANTHROPIC_API_KEY from secrets, model env var, output to test-results/m2-grounding-recall.json. CHANGELOG.md +2 lines Default gates per #280 acceptance: recall ≥ 0.80, precision ≥ 0.85, abort_rate ≤ 0.30. Ship warn-only first to record a post-PR-1 baseline, then ratchet to --gate-mode hard once the signal is stable. Same path the M1 eval has been on. Out of scope for PR-2 (per plan-280-grounding-precision-fix.md): - PR-3 ships PostHog m2_grounding_* events + dashboard panel - Friction capture (≥ 5 design-partner cases) is not engineering scope Local verification - dataset.py imports clean (23 cases, _validate_dataset() passes) - _bind_judge symbol indexer resolves all 11 spot-checked intended symbols including Class.method form - eval_grounding_recall.py CLI runs offline with --skip-missing-fixtures (0 cases, gate breaches reported, exit 0 in warn mode as designed) Refs #280. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Five lint-side findings on the initial PR-2 commit, none of them
runtime — fixing in place rather than amending the prior commit:
- tests/eval/_bind_judge.py B007: add `# noqa: B007` to the
`for turn in range(...)` loop. The loop variable IS used after
the loop for telemetry (judgment_payload["turns"] = turn);
suppression is more honest than renaming to `_turn` and losing
the post-loop reference.
- tests/eval/_bind_judge.py mypy: type-annotate `chosen_model: str`
and tighten the `os.getenv` fallback chain so mypy can resolve
`str | None` → `str`. Construct BindJudgment field-by-field
instead of `**judgment_payload` so the dataclass field types
are enforced (3× errors in the cached + write paths).
- tests/eval_grounding_recall.py I001 + E402: per-line
`# noqa: E402, I001` on the two local imports that must follow
the sys.path inserts. Same shape `eval_decision_relevance.py`
uses for its single post-path import.
- tests/eval_grounding_recall.py F541: drop the f-prefix on
`print(f" ✓ all gates pass")` (no placeholders).
- tests/fixtures/grounding_recall/repo/src/checkout/orders.py B007:
rename `for attempt in range(3):` → `for _attempt in range(3):`
(loop body doesn't reference the counter).
Plus `ruff format` reflowed 4 files (line wrapping, parens, exponent
spacing) — no semantic changes.
Local verification: ruff check + ruff format --check + mypy all
green on the PR-2 surface (15 fixture files + 2 eval files).
Refs #280.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 9, 2026
jinhongkuan
pushed a commit
that referenced
this pull request
May 9, 2026
Triages 25 dev commits onto main (already on dev as of merge time): • #289 — team-mode remote event-log adapter (#277) • #285, #284, #283 — M2 grounding telemetry, eval harness, precision fix (#280) • #275 — README/SECURITY surface • plus assorted fixes flowing through dev Resolved conflicts in CHANGELOG.md (kept dev's [Unreleased] block, inserted v0.14.2's release entry from main below it, then renamed [Unreleased] → v0.14.3) and README.md (kept dev's Solo-vs-Team mode section + extended setup-writes table from #289 — main was missing both because PR #289 hadn't backflowed yet). pyproject.toml: 0.14.2 → 0.14.3 RECOMMENDED_VERSION: 0.14.1 → 0.14.3 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 tasks
This was referenced May 11, 2026
Knapp-Kevin
pushed a commit
to Knapp-Kevin/bicameral-mcp
that referenced
this pull request
May 21, 2026
…gate (BicameralAI#58) Sibling of the M2 grounding-recall eval (BicameralAI#284). Phase A is **measurement only** — no runtime change to `handle_preflight` or any retrieval surface. Recall regression risk = zero. The Phase B optimization choice (multi-hop expansion / semantic search / LLM reranker) is gated on this PR's first stable baseline, per the wiki's optimization principle: "identify the specific scenario, then optimize." Per the Phase 2 spec posted on BicameralAI#58 (all four open questions defaulted to recommended): Q1 dataset size → 25 cases hand-curated, matches M2's 23 Q2 miss-mode buckets → three (vocabulary / unbound / transitive), matches the issue body's framing Q3 fire-rate gate → raw retrieval (`response.decisions`); fire is downstream and a secondary diagnostic Q4 ledger persistence → per-run temp + memory:// (per-case freshness) Three measurement axes (deliberately split for diagnosis) --------------------------------------------------------- overall recall surfaced / (surfaced + missed) gate ≥ 0.70 per-mode recall same, sliced by miss_mode gate ≥ 0.50 fire rate response.fired / total gate ≥ 0.60 Errors (seeder infra failures, not agent misses) are excluded from the recall denominator but counted separately so reviewers can see them. Files ----- tests/fixtures/preflight_m6/dataset.py (412 LOC) 25 hand-curated M6Case rows, 8 + 8 + 9 across the three modes. Frozen dataclass; GENERATOR_VERSION constant invalidates downstream caches when bumped. Import-time _validate_dataset() fails loud on duplicate case_id, invalid miss_mode, transitive case without intended_file_path, unbound case with non-ungrounded status. tests/eval/_preflight_m6_seeder.py (231 LOC) Per-case freshness: each call creates a new tempdir + memory:// ledger + git-initialized repo + writes a placeholder file (or the transitive case's intended + caller files). Calls the real handle_ingest + handle_bind so seeded rows have production shape (source_type, span, signoff, binds_to). Reset code-locator + ledger singletons before AND after so the next case starts clean. tests/eval_preflight_m6_recall.py (274 LOC) Argparse runner, drives the seeder + handle_preflight, classifies outcomes, aggregates. JSON output + gate enforcement (--gate-mode warn|hard). Mirrors eval_grounding_recall.py shape so existing CI patterns transfer. tests/eval_preflight_m6_summary.py (162 LOC) Markdown step-summary renderer for $GITHUB_STEP_SUMMARY. Per-mode table + collapsible missed-case detail with topic + intended description. Fail-quiet on missing JSON / parse errors. tests/test_preflight_m6_eval.py (267 LOC) 16 sociable unit tests for the classifier + aggregator. Per the new CLAUDE.md "Sociable Testing for UX Paths" rule (BicameralAI#303): SimpleNamespace + real M6Case dataclasses, NEVER MagicMock — so any added/removed field on the response shape fails the test honestly. .github/workflows/test-mcp-regression.yml (+31 LOC) New "M6 preflight recall eval (warn-only)" + summary steps after M2. No ANTHROPIC_API_KEY needed — preflight retrieval is deterministic. CHANGELOG.md (+2 lines) [Unreleased] / Added entry. Local verification ------------------ - 16/16 sociable unit tests pass on the classifier + aggregator (test_aggregate_basic_recall_math, test_errors_excluded_from_recall_denominator, test_per_miss_mode_breakdown, etc.) - Dataset import + _validate_dataset() pass — 25 cases (8/8/9) - Runner --help renders cleanly - Summary renderer smoke-tested on synthetic JSON — per-mode table + missed-case detail render correctly with emoji gates - ruff check + ruff format --check + mypy all green on touched files What's NOT in this PR (intentionally — Phase B gating) ------------------------------------------------------ - Any runtime change to handle_preflight or _region_anchored_preflight - Skill changes (no agent-facing contract change in Phase A) - Multi-hop / call-graph / inheritance graph expansion (Phase B candidate, deferred) - Semantic search layer (Phase B candidate, deferred) - LLM reranker (Phase B candidate, deferred) - Real-corpus eval (synthetic first; corpus follow-up if needed) After this PR's first CI baseline lands, we pick the dominant miss-mode from the per-mode breakdown and ship Phase B targeted to it. Cheap-first ordering per the wiki: search_hint refinement → multi-hop graph → semantic → reranker. Refs BicameralAI#58. Plan: plan-58-preflight-decision-detection.md. Phase 2 spec signoff: BicameralAI#58 (comment) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PR-2 of 3 for #280. Ships the synthetic-recall benchmark that measures caller-LLM grounding precision against the post-PR-1 handler. Companion to #283 (PR-1 — handler reject path), which landed in dev as
6c4a1c5.This PR is a measurement layer only — no runtime contract change. Warn-only CI initially per #280's gating-is-observability framing; we ship the eval, observe the baseline, then ratchet to a hard gate.
What's in this PR
Three measurement axes (deliberately split for diagnosis)
correct / (correct + wrong_symbol + wrong_file)correct / total_rows(aborts AND wrong bindings count against)aborted / total_rows— first-class signal because the bind skill makes "abort on weak evidence" an explicit contractThe split tells us why we're missing: high-precision-low-recall = agent over-cautious; low-precision-high-recall = hallucinations the PR-1 handler now rejects. Without the split it's a single number; with it we know which knob to turn.
Files
tests/fixtures/grounding_recall/dataset.pyGroundingCaserows: 5 case-A (same-name-different-module), 10 case-B (similar-intent), 8 case-C (cross-language).GENERATOR_VERSIONinvalidates cache when bumped. Import-time_validate_dataset()fails loud on shape regressions.tests/fixtures/grounding_recall/repo/process_orderdefinitions in three modules, each with different semantics).tests/eval/_bind_judge.pytests/eval/_skill_judge.py. Multi-turn tool-use loop with 3 tools:read_file,validate_symbols,submit_binding. Cap at 8 turns. Response cache attests/eval/fixtures/bind_judge/keyed onSHA(model | bind_skill | repo | decision).tests/eval_grounding_recall.pytests/eval_decision_relevance.py. Loads dataset, drives_bind_judgeper case, classifies outcome, aggregates, emits JSON, optional gate enforcement (--gate-mode warn|hard)..github/workflows/test-mcp-regression.ymlcontinue-on-error: true, mirrors the M1 step shape.CHANGELOG.mdDefault gates (per #280 acceptance)
Cache strategy
Cache hits at
tests/eval/fixtures/bind_judge/keep CI cost ~$0 unless one of these changes:BICAMERAL_GROUNDING_EVAL_MODELskills/bicameral-bind/SKILL.mdSHASame pattern as the existing M1
_skill_judge.py.Local verification
_validate_dataset()passes)_bind_judgesymbol indexer resolves all 11 spot-checked intended symbols (includingClass.methodform likeCheckoutRetryGuard.check_cap,TenantCheckoutRateLimiter.check)eval_grounding_recall.pyCLI runs offline with--skip-missing-fixtures(0 cases, gate breaches reported, exits 0 in warn mode as designed)What's NOT in this PR (per
plan-280-grounding-precision-fix.md)m2_grounding_*events + dashboard panel (separate surface, not blocking the eval)Refs
Refs #280. Built on #283's PR-1 (handler reject path) which is in
devas6c4a1c5.🤖 Generated with Claude Code