Skip to content

feat(eval): M2 grounding-recall harness for caller-LLM bind precision (#280 PR-2)#284

Merged
silongtan merged 2 commits into
devfrom
280-grounding-eval-harness
May 8, 2026
Merged

feat(eval): M2 grounding-recall harness for caller-LLM bind precision (#280 PR-2)#284
silongtan merged 2 commits into
devfrom
280-grounding-eval-harness

Conversation

@silongtan

Copy link
Copy Markdown
Collaborator

Summary

PR-2 of 3 for #280. Ships the synthetic-recall benchmark that measures caller-LLM grounding precision against the post-PR-1 handler. Companion to #283 (PR-1 — handler reject path), which landed in dev as 6c4a1c5.

This PR is a measurement layer only — no runtime contract change. Warn-only CI initially per #280's gating-is-observability framing; we ship the eval, observe the baseline, then ratchet to a hard gate.

What's in this PR

Three measurement axes (deliberately split for diagnosis)

  • Precision = correct / (correct + wrong_symbol + wrong_file)
  • Recall = correct / total_rows (aborts AND wrong bindings count against)
  • Abort rate = aborted / total_rows — first-class signal because the bind skill makes "abort on weak evidence" an explicit contract

The split tells us why we're missing: high-precision-low-recall = agent over-cautious; low-precision-high-recall = hallucinations the PR-1 handler now rejects. Without the split it's a single number; with it we know which knob to turn.

Files

File LOC Role
tests/fixtures/grounding_recall/dataset.py 230 23 GroundingCase rows: 5 case-A (same-name-different-module), 10 case-B (similar-intent), 8 case-C (cross-language). GENERATOR_VERSION invalidates cache when bumped. Import-time _validate_dataset() fails loud on shape regressions.
tests/fixtures/grounding_recall/repo/ 15 files / ~625 Hand-crafted fixture repo with intended + plausible-distractor symbols. Each function body is short but real enough that the agent must read carefully to disambiguate (e.g. three process_order definitions in three modules, each with different semantics).
tests/eval/_bind_judge.py 466 Headless caller-LLM driver, modeled on tests/eval/_skill_judge.py. Multi-turn tool-use loop with 3 tools: read_file, validate_symbols, submit_binding. Cap at 8 turns. Response cache at tests/eval/fixtures/bind_judge/ keyed on SHA(model | bind_skill | repo | decision).
tests/eval_grounding_recall.py 256 Argparse runner, modeled on tests/eval_decision_relevance.py. Loads dataset, drives _bind_judge per case, classifies outcome, aggregates, emits JSON, optional gate enforcement (--gate-mode warn|hard).
.github/workflows/test-mcp-regression.yml +19 New "M2 grounding-recall eval (warn-only)" step — Ubuntu-only, continue-on-error: true, mirrors the M1 step shape.
CHANGELOG.md +2 Unreleased entry.

Default gates (per #280 acceptance)

  • Recall ≥ 0.80
  • Precision ≥ 0.85
  • Abort rate ≤ 0.30

Cache strategy

Cache hits at tests/eval/fixtures/bind_judge/ keep CI cost ~$0 unless one of these changes:

  • BICAMERAL_GROUNDING_EVAL_MODEL
  • skills/bicameral-bind/SKILL.md SHA
  • The fixture repo SHA
  • The case's decision text

Same pattern as the existing M1 _skill_judge.py.

Local verification

  • ✅ Dataset imports clean (23 cases, _validate_dataset() passes)
  • _bind_judge symbol indexer resolves all 11 spot-checked intended symbols (including Class.method form like CheckoutRetryGuard.check_cap, TenantCheckoutRateLimiter.check)
  • eval_grounding_recall.py CLI runs offline with --skip-missing-fixtures (0 cases, gate breaches reported, exits 0 in warn mode as designed)
  • ⏳ Live eval will run once this PR opens (CI provides ANTHROPIC_API_KEY)

What's NOT in this PR (per plan-280-grounding-precision-fix.md)

  • PR-3 — PostHog m2_grounding_* events + dashboard panel (separate surface, not blocking the eval)
  • Friction capture (≥ 5 design-partner cases) — not engineering scope
  • Hard-gate flip — defer until baseline is stable; same path the M1 eval has been on

Refs

Refs #280. Built on #283's PR-1 (handler reject path) which is in dev as 6c4a1c5.

🤖 Generated with Claude Code

…#280 PR-2)

Synthetic-fixture benchmark that drives the bicameral-bind skill end-to-end
against 23 cases across three failure modes — same-name-different-module,
similar-intent-different-symbol, and cross-language. Measures three axes
deliberately split for diagnosis:

  - precision  = correct / (correct + wrong_symbol + wrong_file)
  - recall     = correct / total_rows
  - abort_rate = aborted / total_rows

The split matters: high-precision-low-recall = agent over-cautious; low-
precision-high-recall = hallucinations the #280 PR-1 handler would now
reject (handler_rejected outcome would surface as precision drag).

Files

  tests/fixtures/grounding_recall/dataset.py          230 LOC
    23 GroundingCase rows: 5 case-A (process_order × 3 modules,
    cancel_order × 2 modules), 10 case-B (rate-limit/throttle/retry/
    auth/metrics intent disambiguation), 8 case-C (Python ↔ TS pairs).
    GENERATOR_VERSION constant invalidates the cache when bumped.
    Import-time _validate_dataset() fails loud on duplicate ids,
    invalid case_type, distractor === intended, etc.

  tests/fixtures/grounding_recall/repo/               15 files / ~625 LOC
    Hand-crafted fixture repo with intended + distractor symbols.
    Each function/class body is short but real enough that the agent
    can actually distinguish behavior from keyword overlap (e.g.
    checkout/orders.py:process_order = customer flow w/ retry cap;
    admin/orders.py:process_order = manual replay of finance-flagged
    orders; billing/refunds.py:process_order = bulk-refund pipeline).

  tests/eval/_bind_judge.py                           466 LOC
    Headless caller-LLM driver — modeled on tests/eval/_skill_judge.py.
    Multi-turn tool-use loop with 3 tools exposed: read_file,
    validate_symbols, submit_binding. Cap at 8 turns. Cache at
    tests/eval/fixtures/bind_judge/ keyed on
    SHA(model | bind_skill | repo | decision). Cache hits keep CI
    cost ~$0 unless dataset, fixture repo, or skill change.

  tests/eval_grounding_recall.py                      256 LOC
    Argparse runner — modeled on tests/eval_decision_relevance.py.
    Loads dataset, drives _bind_judge per case, classifies outcome
    (correct / wrong_symbol / wrong_file / aborted), aggregates,
    emits JSON report, optional gate enforcement (--gate-mode warn|hard).

  .github/workflows/test-mcp-regression.yml           +19 LOC
    New "M2 grounding-recall eval (warn-only)" step. Ubuntu-only,
    continue-on-error: true, mirrors the M1 step shape. ANTHROPIC_API_KEY
    from secrets, model env var, output to test-results/m2-grounding-recall.json.

  CHANGELOG.md                                        +2 lines

Default gates per #280 acceptance: recall ≥ 0.80, precision ≥ 0.85,
abort_rate ≤ 0.30. Ship warn-only first to record a post-PR-1 baseline,
then ratchet to --gate-mode hard once the signal is stable. Same path
the M1 eval has been on.

Out of scope for PR-2 (per plan-280-grounding-precision-fix.md):

  - PR-3 ships PostHog m2_grounding_* events + dashboard panel
  - Friction capture (≥ 5 design-partner cases) is not engineering scope

Local verification

  - dataset.py imports clean (23 cases, _validate_dataset() passes)
  - _bind_judge symbol indexer resolves all 11 spot-checked intended
    symbols including Class.method form
  - eval_grounding_recall.py CLI runs offline with --skip-missing-fixtures
    (0 cases, gate breaches reported, exit 0 in warn mode as designed)

Refs #280.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented May 8, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7b649ad1-18d6-4923-8b40-054a7c97e328

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch 280-grounding-eval-harness

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Five lint-side findings on the initial PR-2 commit, none of them
runtime — fixing in place rather than amending the prior commit:

  - tests/eval/_bind_judge.py B007: add `# noqa: B007` to the
    `for turn in range(...)` loop. The loop variable IS used after
    the loop for telemetry (judgment_payload["turns"] = turn);
    suppression is more honest than renaming to `_turn` and losing
    the post-loop reference.

  - tests/eval/_bind_judge.py mypy: type-annotate `chosen_model: str`
    and tighten the `os.getenv` fallback chain so mypy can resolve
    `str | None` → `str`. Construct BindJudgment field-by-field
    instead of `**judgment_payload` so the dataclass field types
    are enforced (3× errors in the cached + write paths).

  - tests/eval_grounding_recall.py I001 + E402: per-line
    `# noqa: E402, I001` on the two local imports that must follow
    the sys.path inserts. Same shape `eval_decision_relevance.py`
    uses for its single post-path import.

  - tests/eval_grounding_recall.py F541: drop the f-prefix on
    `print(f"  ✓ all gates pass")` (no placeholders).

  - tests/fixtures/grounding_recall/repo/src/checkout/orders.py B007:
    rename `for attempt in range(3):` → `for _attempt in range(3):`
    (loop body doesn't reference the counter).

Plus `ruff format` reflowed 4 files (line wrapping, parens, exponent
spacing) — no semantic changes.

Local verification: ruff check + ruff format --check + mypy all
green on the PR-2 surface (15 fixture files + 2 eval files).

Refs #280.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@silongtan silongtan merged commit f8cd9ee into dev May 8, 2026
6 checks passed
jinhongkuan pushed a commit that referenced this pull request May 9, 2026
Triages 25 dev commits onto main (already on dev as of merge time):
  • #289 — team-mode remote event-log adapter (#277)
  • #285, #284, #283 — M2 grounding telemetry, eval harness, precision fix (#280)
  • #275 — README/SECURITY surface
  • plus assorted fixes flowing through dev

Resolved conflicts in CHANGELOG.md (kept dev's [Unreleased] block,
inserted v0.14.2's release entry from main below it, then renamed
[Unreleased] → v0.14.3) and README.md (kept dev's Solo-vs-Team mode
section + extended setup-writes table from #289 — main was missing
both because PR #289 hadn't backflowed yet).

pyproject.toml: 0.14.2 → 0.14.3
RECOMMENDED_VERSION: 0.14.1 → 0.14.3

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Knapp-Kevin pushed a commit to Knapp-Kevin/bicameral-mcp that referenced this pull request May 21, 2026
…gate (BicameralAI#58)

Sibling of the M2 grounding-recall eval (BicameralAI#284). Phase A is **measurement
only** — no runtime change to `handle_preflight` or any retrieval surface.
Recall regression risk = zero. The Phase B optimization choice (multi-hop
expansion / semantic search / LLM reranker) is gated on this PR's first
stable baseline, per the wiki's optimization principle: "identify the
specific scenario, then optimize."

Per the Phase 2 spec posted on BicameralAI#58 (all four open questions defaulted to
recommended):

  Q1 dataset size       → 25 cases hand-curated, matches M2's 23
  Q2 miss-mode buckets  → three (vocabulary / unbound / transitive),
                          matches the issue body's framing
  Q3 fire-rate gate     → raw retrieval (`response.decisions`); fire is
                          downstream and a secondary diagnostic
  Q4 ledger persistence → per-run temp + memory:// (per-case freshness)

Three measurement axes (deliberately split for diagnosis)
---------------------------------------------------------

  overall recall    surfaced / (surfaced + missed)        gate ≥ 0.70
  per-mode recall   same, sliced by miss_mode             gate ≥ 0.50
  fire rate         response.fired / total                gate ≥ 0.60

Errors (seeder infra failures, not agent misses) are excluded from the
recall denominator but counted separately so reviewers can see them.

Files
-----

  tests/fixtures/preflight_m6/dataset.py       (412 LOC)
    25 hand-curated M6Case rows, 8 + 8 + 9 across the three modes.
    Frozen dataclass; GENERATOR_VERSION constant invalidates downstream
    caches when bumped. Import-time _validate_dataset() fails loud on
    duplicate case_id, invalid miss_mode, transitive case without
    intended_file_path, unbound case with non-ungrounded status.

  tests/eval/_preflight_m6_seeder.py            (231 LOC)
    Per-case freshness: each call creates a new tempdir + memory:// ledger
    + git-initialized repo + writes a placeholder file (or the transitive
    case's intended + caller files). Calls the real handle_ingest +
    handle_bind so seeded rows have production shape (source_type,
    span, signoff, binds_to). Reset code-locator + ledger singletons
    before AND after so the next case starts clean.

  tests/eval_preflight_m6_recall.py             (274 LOC)
    Argparse runner, drives the seeder + handle_preflight, classifies
    outcomes, aggregates. JSON output + gate enforcement
    (--gate-mode warn|hard). Mirrors eval_grounding_recall.py shape so
    existing CI patterns transfer.

  tests/eval_preflight_m6_summary.py            (162 LOC)
    Markdown step-summary renderer for $GITHUB_STEP_SUMMARY. Per-mode
    table + collapsible missed-case detail with topic + intended
    description. Fail-quiet on missing JSON / parse errors.

  tests/test_preflight_m6_eval.py               (267 LOC)
    16 sociable unit tests for the classifier + aggregator. Per the new
    CLAUDE.md "Sociable Testing for UX Paths" rule (BicameralAI#303): SimpleNamespace
    + real M6Case dataclasses, NEVER MagicMock — so any added/removed
    field on the response shape fails the test honestly.

  .github/workflows/test-mcp-regression.yml     (+31 LOC)
    New "M6 preflight recall eval (warn-only)" + summary steps after M2.
    No ANTHROPIC_API_KEY needed — preflight retrieval is deterministic.

  CHANGELOG.md                                  (+2 lines)
    [Unreleased] / Added entry.

Local verification
------------------

  - 16/16 sociable unit tests pass on the classifier + aggregator
    (test_aggregate_basic_recall_math, test_errors_excluded_from_recall_denominator,
    test_per_miss_mode_breakdown, etc.)
  - Dataset import + _validate_dataset() pass — 25 cases (8/8/9)
  - Runner --help renders cleanly
  - Summary renderer smoke-tested on synthetic JSON — per-mode table +
    missed-case detail render correctly with emoji gates
  - ruff check + ruff format --check + mypy all green on touched files

What's NOT in this PR (intentionally — Phase B gating)
------------------------------------------------------

  - Any runtime change to handle_preflight or _region_anchored_preflight
  - Skill changes (no agent-facing contract change in Phase A)
  - Multi-hop / call-graph / inheritance graph expansion (Phase B
    candidate, deferred)
  - Semantic search layer (Phase B candidate, deferred)
  - LLM reranker (Phase B candidate, deferred)
  - Real-corpus eval (synthetic first; corpus follow-up if needed)

After this PR's first CI baseline lands, we pick the dominant miss-mode
from the per-mode breakdown and ship Phase B targeted to it. Cheap-first
ordering per the wiki: search_hint refinement → multi-hop graph → semantic
→ reranker.

Refs BicameralAI#58. Plan: plan-58-preflight-decision-detection.md.
Phase 2 spec signoff: BicameralAI#58 (comment)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant