Silong/fc2 eval multi region by silongtan · Pull Request #22 · BicameralAI/bicameral-mcp

silongtan · 2026-04-16T02:16:56Z

Summary by CodeRabbit

New Features
- Introduced file-based recall metrics and file cardinality tracking to improve code location evaluation accuracy.
- Enhanced reporting with multi-region decision analysis, including file-pattern coverage metrics and detailed diagnostics in verbose output.
Tests
- Added multi-region decision test fixtures with comprehensive evaluation scenarios.

5 cross-cutting decisions (bicameral-mcp workflows spanning 3+ files) added to ground truth fixtures with `multi_region: True` flag. New metrics in eval_code_locator.py: - recall@files: fraction of expected file patterns covered by grounded regions - file-cardinality distribution: histogram of distinct files per grounding - multi-region breakdown in summary output (all vs multi-region-only) Baseline on bicameral monorepo: 0% multi-region recall@files (expected — NL descriptions compete with demo/experiment files in the full index). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

get_adapter() now checks .bicameral/local/code-graph.db first (team mode), falling back to .bicameral/code-graph.db (solo mode). Previously the eval always used the solo path, loading a near-empty BM25 index (1 doc) when the repo was configured in team mode (17k docs in local/). Multi-region recall@files baseline remains 0% — this is a real quality gap (regex-token seeding matches wrong symbols from NL descriptions), not a tooling bug. Documents the F1 follow-up (LLM-extracted symbol seeding) as the fix path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-04-16T02:17:08Z

📝 Walkthrough

Walkthrough

The changes update the code locator evaluation system to support multi-region decisions by adding new metrics (recall_at_files, file_cardinality, multi_region), updating database path selection logic with fallback handling, and extending test fixtures with five multi-region decision cases for comprehensive testing.

Changes

Cohort / File(s)	Summary
Evaluation Metrics & Logic `tests/eval_code_locator.py`	Updates `get_adapter()` with fallback DB path selection (`./.bicameral/local/code-graph.db` → `./.bicameral/code-graph.db`). Adds per-decision metrics during evaluation: `recall_at_files`, `file_cardinality`, and `multi_region`. Extends verbose output to annotate multi-region decisions and display new metrics. Aggregate reporting now includes `avg_recall_at_files`, `multi_region_count`, `multi_region_recall_at_files`, and `file_cardinality_distribution`.
Test Fixtures `tests/fixtures/expected/decisions.py`	Introduces `BICAMERAL_MULTI_REGION` fixture list with five FC-2 multi-region decision entries (ingest/grounding, multi-channel retrieval with RRF fusion, drift detection, team dual-write/event materialization, coverage-tier threshold broadening). Updates `ALL_DECISIONS` to include new multi-region fixtures and adds `MULTI_REGION` as filtered subset.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

decision grounding reuse + coverage loop #5: The evaluation changes directly extend grounding-focused functionality introduced in this PR, adding region-grounded metrics and multi-region decision support.

Suggested reviewers

jinhongkuan

Poem

🐰 Hops through regions, metrics bloom,
Recall at files clears the gloom,
Multi-region dreams take flight,
Fixtures shine with test-case light,
Fallback paths, so smooth and spry,
Evaluation reaches for the sky! 🌟

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'Silong/fc2 eval multi region' is vague and lacks clarity. It uses branch-naming conventions as a title and doesn't clearly convey the main purpose of the changes (adding multi-region evaluation metrics and fixtures).	Consider revising the title to be more descriptive, e.g., 'Add multi-region evaluation metrics and fixtures' or 'Implement multi-region decision support in code locator evaluation'.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch silong/fc2-eval-multi-region

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/fixtures/expected/decisions.py (1)
499-509: ⚠️ Potential issue | 🟠 Major

Keep the FC-2 fixtures out of the global registry.

tests/eval_code_locator.py only repo-filters when more than one repo is passed. With this addition, any single-repo run against a non-bicameral checkout will now execute these bicameral-only cases and skew the reported metrics. Prefer wiring BICAMERAL_MULTI_REGION through an explicit FC-2 subset, or fix the caller to always repo-filter.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/fixtures/expected/decisions.py` around lines 499 - 509, ALL_DECISIONS
currently includes BICAMERAL_MULTI_REGION which causes bicameral-only FC-2 cases
to be run for single-repo test runs; remove BICAMERAL_MULTI_REGION from the
ALL_DECISIONS tuple and instead expose it only via an explicit FC-2 subset
(e.g., create FC2_BICAMERAL_DECISIONS or add BICAMERAL_MULTI_REGION to an
existing FC2_DECISIONS collection) so that callers can opt-in to bicameral
cases; update any test harness assembly that builds the FC-2 test set to include
this new subset when appropriate (refer to the ALL_DECISIONS and
BICAMERAL_MULTI_REGION symbols to locate the change).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/eval_code_locator.py`:
- Around line 231-237: The multi-region aggregates drop errored FC-2 cases
because mr_results is built only from entries where r.get("multi_region") is
truthy and the exception branch doesn't append a multi_region entry; update the
exception handling in the block that builds results (the code that currently
appends precision/recall/mrr on success) to also append a result dict with
"multi_region": True and zeros for "precision", "recall", "mrr" and
"recall_at_files" when ground_mappings() raises, so mr_results, mr_count and
mr_recall_at_files correctly include errored multi-region cases; apply the same
change to the analogous aggregation block that starts around where the other
multi-region aggregates are computed (the block referenced in the comment for
lines 261-263).

---

Outside diff comments:
In `@tests/fixtures/expected/decisions.py`:
- Around line 499-509: ALL_DECISIONS currently includes BICAMERAL_MULTI_REGION
which causes bicameral-only FC-2 cases to be run for single-repo test runs;
remove BICAMERAL_MULTI_REGION from the ALL_DECISIONS tuple and instead expose it
only via an explicit FC-2 subset (e.g., create FC2_BICAMERAL_DECISIONS or add
BICAMERAL_MULTI_REGION to an existing FC2_DECISIONS collection) so that callers
can opt-in to bicameral cases; update any test harness assembly that builds the
FC-2 test set to include this new subset when appropriate (refer to the
ALL_DECISIONS and BICAMERAL_MULTI_REGION symbols to locate the change).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1327be0c-935b-44ad-b600-1afd00a3d248

📥 Commits

Reviewing files that changed from the base of the PR and between cf37eaf and 2e9c47f.

📒 Files selected for processing (2)

tests/eval_code_locator.py
tests/fixtures/expected/decisions.py

coderabbitai · 2026-04-16T02:21:34Z

+    # recall@files (multi-region only)
+    mr_results = [r for r in results if r.get("multi_region")]
+    mr_count = len(mr_results)
+    mr_recall_at_files = (
+        sum(r.get("recall_at_files", 0) for r in mr_results) / mr_count
+        if mr_count else 0
+    )


⚠️ Potential issue | 🟠 Major

Count errored multi-region cases in these aggregates.

mr_results is filtered on r.get("multi_region"), but the exception branch only appends precision/recall/mrr. If ground_mappings() raises on an FC-2 case, it drops out of multi_region_count and multi_region_recall_at_files instead of contributing zero.

Proposed fix

except Exception as e: results.append({ "description": d["description"][:80], "query": query, "error": str(e), "precision": 0, "recall": 0, "mrr": 0, + "recall_at_files": 0, + "file_cardinality": 0, + "multi_region": d.get("multi_region", False), }) continue

Also applies to: 261-263

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/eval_code_locator.py` around lines 231 - 237, The multi-region aggregates drop errored FC-2 cases because mr_results is built only from entries where r.get("multi_region") is truthy and the exception branch doesn't append a multi_region entry; update the exception handling in the block that builds results (the code that currently appends precision/recall/mrr on success) to also append a result dict with "multi_region": True and zeros for "precision", "recall", "mrr" and "recall_at_files" when ground_mappings() raises, so mr_results, mr_count and mr_recall_at_files correctly include errored multi-region cases; apply the same change to the analogous aggregation block that starts around where the other multi-region aggregates are computed (the block referenced in the comment for lines 261-263).

…hook Phase 0a — Decompose server.py:cli_main (92 LOC → 15 LOC orchestrator + _register_subparsers (16 LOC) + _dispatch (29 LOC)). Razor-compliant. Phase 0 — Promote cli/branch_scan.py:_invoke_link_commit to shared cli/_link_commit_runner.py module. Pure refactor under existing test_branch_scan_cli.py coverage. Phase 1 — Register link_commit CLI subcommand: - cli/link_commit_cli.py (29 LOC) — JSON-to-stdout default, --quiet flag, always exits 0 (graceful skip on no-ledger or handler error). - server.py — subparser registration in _register_subparsers + dispatch branch in _dispatch. - tests/test_link_commit_cli.py (6 tests) — argparse defaults, output shape, --quiet, no-ledger graceful skip, handler-exception graceful skip. Phase 2 — Harden post-commit hook: - setup_wizard.py:_GIT_POST_COMMIT_HOOK now writes stderr to ${HOME}/.bicameral/hook-errors.log (was /dev/null), surfaces a one-line summary on stderr, always exits 0. > truncates the file on each run so successful commits auto-clear stale errors. F-2 remediation per audit v2. - tests/test_hook_command_registration.py (3 tests) — smoke that walks every bicameral-mcp <cmd> in installed hooks and asserts CLI registration + dispatch coverage. Original #124 bug class is now caught at PR time. Phase 3 — CHANGELOG [Unreleased] Fixed entry. Validation: 20 passed, 1 skipped (Windows chmod). ruff check + format + mypy clean. Manual smoke: link_commit --help renders. Plan v2 PASS at META_LEDGER #21 (chain 86225d49). Implementation sealed at META_LEDGER #22 (chain e83d674c). Closes #124. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reality matches Promise. All 8 files (5 new + 4 modified - 1 plan) land per v2 plan; 9 new tests + 11 regression = 20 passed, 1 skipped; ruff/format/mypy clean; manual smoke confirms link_commit subcommand registers and renders correctly. Plan: plan-124-post-commit-hook-fix.md (v2 PASS @ 44c6568) Audit: META_LEDGER #21 (chain hash 86225d49) Implementation: META_LEDGER #22 (chain hash e83d674c) Merkle seal: 950f362cb700da5a4db85c545f6b55bb725502a5744bfbb2c2eb3a9c9728661a Closes #124 silent-failure regression. Defense-in-depth: the fix itself, the cli_main decomposition (so the next subcommand addition doesn't hit the same wall), the hook-command-registration smoke test (catches this bug class at PR time), and the loud-but-non-blocking hook (next regression surfaces immediately). Capability shortfalls: gate artifacts, reliability sweep, version bump all skipped (qor/ runtime helpers absent on this branch). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…hook Phase 0a — Decompose server.py:cli_main (92 LOC → 15 LOC orchestrator + _register_subparsers (16 LOC) + _dispatch (29 LOC)). Razor-compliant. Phase 0 — Promote cli/branch_scan.py:_invoke_link_commit to shared cli/_link_commit_runner.py module. Pure refactor under existing test_branch_scan_cli.py coverage. Phase 1 — Register link_commit CLI subcommand: - cli/link_commit_cli.py (29 LOC) — JSON-to-stdout default, --quiet flag, always exits 0 (graceful skip on no-ledger or handler error). - server.py — subparser registration in _register_subparsers + dispatch branch in _dispatch. - tests/test_link_commit_cli.py (6 tests) — argparse defaults, output shape, --quiet, no-ledger graceful skip, handler-exception graceful skip. Phase 2 — Harden post-commit hook: - setup_wizard.py:_GIT_POST_COMMIT_HOOK now writes stderr to ${HOME}/.bicameral/hook-errors.log (was /dev/null), surfaces a one-line summary on stderr, always exits 0. > truncates the file on each run so successful commits auto-clear stale errors. F-2 remediation per audit v2. - tests/test_hook_command_registration.py (3 tests) — smoke that walks every bicameral-mcp <cmd> in installed hooks and asserts CLI registration + dispatch coverage. Original #124 bug class is now caught at PR time. Phase 3 — CHANGELOG [Unreleased] Fixed entry. Validation: 20 passed, 1 skipped (Windows chmod). ruff check + format + mypy clean. Manual smoke: link_commit --help renders. Plan v2 PASS at META_LEDGER #21 (chain 86225d49). Implementation sealed at META_LEDGER #22 (chain e83d674c). Closes #124. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> (cherry picked from commit 431e202) Adaptation: server.py — kept dev's _register_subparsers / _dispatch helper extraction (#124 phase 0a refactor) so test_hook_command_registration.py introspection works; omitted dev's branch-scan subparser registration and the setup --with-push-hook flag (both are #48 prerequisites missing on triage) Adaptation: server.py:_dispatch — kept dev's setup → branch-scan → link_commit dispatch chain shape; dropped branch-scan dispatch case (cli/branch_scan.py is a missing prerequisite from #48 on triage); kept link_commit dispatch case (the actual #124 fix) Adaptation: tests/test_hook_command_registration.py — dropped _GIT_PRE_PUSH_HOOK import + the test_pre_push_hook_command_is_registered test (pre-push hook is from #48, not on triage); test_all_hook_commands_have_dispatch_branches scoped to _GIT_POST_COMMIT_HOOK only; test_post_commit_hook_command_is_registered (the canonical #124 regression test) is preserved Skip: cli/branch_scan.py — kept triage's prior absence of this file (added by #48); the cherry-pick wanted to refactor it Skip: docs/META_LEDGER.md — kept triage's HEAD chain state; e6d4b8f's META_LEDGER #21/#23 entries are dev's chain, not triage's Skip: CHANGELOG.md — kept triage's HEAD; v0.X.Y triage release narrative goes in PR #128 per DEV_CYCLE §10.5.4

silongtan and others added 2 commits April 15, 2026 21:53

coderabbitai Bot reviewed Apr 16, 2026

View reviewed changes

silongtan merged commit 220ef86 into main Apr 16, 2026
1 check passed

silongtan deleted the silong/fc2-eval-multi-region branch April 18, 2026 22:37

Knapp-Kevin mentioned this pull request Apr 30, 2026

fix(#124): register link_commit CLI subcommand + harden post-commit hook #127

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Silong/fc2 eval multi region#22

Silong/fc2 eval multi region#22
silongtan merged 2 commits into
mainfrom
silong/fc2-eval-multi-region

silongtan commented Apr 16, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 16, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

silongtan commented Apr 16, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

silongtan commented Apr 16, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 16, 2026 •

edited

Loading