Silong/fc2 eval multi region#22
Conversation
5 cross-cutting decisions (bicameral-mcp workflows spanning 3+ files) added to ground truth fixtures with `multi_region: True` flag. New metrics in eval_code_locator.py: - recall@files: fraction of expected file patterns covered by grounded regions - file-cardinality distribution: histogram of distinct files per grounding - multi-region breakdown in summary output (all vs multi-region-only) Baseline on bicameral monorepo: 0% multi-region recall@files (expected — NL descriptions compete with demo/experiment files in the full index). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
get_adapter() now checks .bicameral/local/code-graph.db first (team mode), falling back to .bicameral/code-graph.db (solo mode). Previously the eval always used the solo path, loading a near-empty BM25 index (1 doc) when the repo was configured in team mode (17k docs in local/). Multi-region recall@files baseline remains 0% — this is a real quality gap (regex-token seeding matches wrong symbols from NL descriptions), not a tooling bug. Documents the F1 follow-up (LLM-extracted symbol seeding) as the fix path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughThe changes update the code locator evaluation system to support multi-region decisions by adding new metrics ( Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/fixtures/expected/decisions.py (1)
499-509:⚠️ Potential issue | 🟠 MajorKeep the FC-2 fixtures out of the global registry.
tests/eval_code_locator.pyonly repo-filters when more than one repo is passed. With this addition, any single-repo run against a non-bicameral checkout will now execute these bicameral-only cases and skew the reported metrics. Prefer wiringBICAMERAL_MULTI_REGIONthrough an explicit FC-2 subset, or fix the caller to always repo-filter.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/fixtures/expected/decisions.py` around lines 499 - 509, ALL_DECISIONS currently includes BICAMERAL_MULTI_REGION which causes bicameral-only FC-2 cases to be run for single-repo test runs; remove BICAMERAL_MULTI_REGION from the ALL_DECISIONS tuple and instead expose it only via an explicit FC-2 subset (e.g., create FC2_BICAMERAL_DECISIONS or add BICAMERAL_MULTI_REGION to an existing FC2_DECISIONS collection) so that callers can opt-in to bicameral cases; update any test harness assembly that builds the FC-2 test set to include this new subset when appropriate (refer to the ALL_DECISIONS and BICAMERAL_MULTI_REGION symbols to locate the change).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/eval_code_locator.py`:
- Around line 231-237: The multi-region aggregates drop errored FC-2 cases
because mr_results is built only from entries where r.get("multi_region") is
truthy and the exception branch doesn't append a multi_region entry; update the
exception handling in the block that builds results (the code that currently
appends precision/recall/mrr on success) to also append a result dict with
"multi_region": True and zeros for "precision", "recall", "mrr" and
"recall_at_files" when ground_mappings() raises, so mr_results, mr_count and
mr_recall_at_files correctly include errored multi-region cases; apply the same
change to the analogous aggregation block that starts around where the other
multi-region aggregates are computed (the block referenced in the comment for
lines 261-263).
---
Outside diff comments:
In `@tests/fixtures/expected/decisions.py`:
- Around line 499-509: ALL_DECISIONS currently includes BICAMERAL_MULTI_REGION
which causes bicameral-only FC-2 cases to be run for single-repo test runs;
remove BICAMERAL_MULTI_REGION from the ALL_DECISIONS tuple and instead expose it
only via an explicit FC-2 subset (e.g., create FC2_BICAMERAL_DECISIONS or add
BICAMERAL_MULTI_REGION to an existing FC2_DECISIONS collection) so that callers
can opt-in to bicameral cases; update any test harness assembly that builds the
FC-2 test set to include this new subset when appropriate (refer to the
ALL_DECISIONS and BICAMERAL_MULTI_REGION symbols to locate the change).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 1327be0c-935b-44ad-b600-1afd00a3d248
📒 Files selected for processing (2)
tests/eval_code_locator.pytests/fixtures/expected/decisions.py
| # recall@files (multi-region only) | ||
| mr_results = [r for r in results if r.get("multi_region")] | ||
| mr_count = len(mr_results) | ||
| mr_recall_at_files = ( | ||
| sum(r.get("recall_at_files", 0) for r in mr_results) / mr_count | ||
| if mr_count else 0 | ||
| ) |
There was a problem hiding this comment.
Count errored multi-region cases in these aggregates.
mr_results is filtered on r.get("multi_region"), but the exception branch only appends precision/recall/mrr. If ground_mappings() raises on an FC-2 case, it drops out of multi_region_count and multi_region_recall_at_files instead of contributing zero.
Proposed fix
except Exception as e:
results.append({
"description": d["description"][:80],
"query": query,
"error": str(e),
"precision": 0, "recall": 0, "mrr": 0,
+ "recall_at_files": 0,
+ "file_cardinality": 0,
+ "multi_region": d.get("multi_region", False),
})
continueAlso applies to: 261-263
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/eval_code_locator.py` around lines 231 - 237, The multi-region
aggregates drop errored FC-2 cases because mr_results is built only from entries
where r.get("multi_region") is truthy and the exception branch doesn't append a
multi_region entry; update the exception handling in the block that builds
results (the code that currently appends precision/recall/mrr on success) to
also append a result dict with "multi_region": True and zeros for "precision",
"recall", "mrr" and "recall_at_files" when ground_mappings() raises, so
mr_results, mr_count and mr_recall_at_files correctly include errored
multi-region cases; apply the same change to the analogous aggregation block
that starts around where the other multi-region aggregates are computed (the
block referenced in the comment for lines 261-263).
…hook
Phase 0a — Decompose server.py:cli_main (92 LOC → 15 LOC orchestrator
+ _register_subparsers (16 LOC) + _dispatch (29 LOC)). Razor-compliant.
Phase 0 — Promote cli/branch_scan.py:_invoke_link_commit to shared
cli/_link_commit_runner.py module. Pure refactor under existing
test_branch_scan_cli.py coverage.
Phase 1 — Register link_commit CLI subcommand:
- cli/link_commit_cli.py (29 LOC) — JSON-to-stdout default, --quiet
flag, always exits 0 (graceful skip on no-ledger or handler error).
- server.py — subparser registration in _register_subparsers + dispatch
branch in _dispatch.
- tests/test_link_commit_cli.py (6 tests) — argparse defaults, output
shape, --quiet, no-ledger graceful skip, handler-exception graceful
skip.
Phase 2 — Harden post-commit hook:
- setup_wizard.py:_GIT_POST_COMMIT_HOOK now writes stderr to
${HOME}/.bicameral/hook-errors.log (was /dev/null), surfaces a
one-line summary on stderr, always exits 0. > truncates the file
on each run so successful commits auto-clear stale errors. F-2
remediation per audit v2.
- tests/test_hook_command_registration.py (3 tests) — smoke that
walks every bicameral-mcp <cmd> in installed hooks and asserts
CLI registration + dispatch coverage. Original #124 bug class is
now caught at PR time.
Phase 3 — CHANGELOG [Unreleased] Fixed entry.
Validation: 20 passed, 1 skipped (Windows chmod). ruff check + format
+ mypy clean. Manual smoke: link_commit --help renders.
Plan v2 PASS at META_LEDGER #21 (chain 86225d49). Implementation
sealed at META_LEDGER #22 (chain e83d674c).
Closes #124.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reality matches Promise. All 8 files (5 new + 4 modified - 1 plan) land per v2 plan; 9 new tests + 11 regression = 20 passed, 1 skipped; ruff/format/mypy clean; manual smoke confirms link_commit subcommand registers and renders correctly. Plan: plan-124-post-commit-hook-fix.md (v2 PASS @ 44c6568) Audit: META_LEDGER #21 (chain hash 86225d49) Implementation: META_LEDGER #22 (chain hash e83d674c) Merkle seal: 950f362cb700da5a4db85c545f6b55bb725502a5744bfbb2c2eb3a9c9728661a Closes #124 silent-failure regression. Defense-in-depth: the fix itself, the cli_main decomposition (so the next subcommand addition doesn't hit the same wall), the hook-command-registration smoke test (catches this bug class at PR time), and the loud-but-non-blocking hook (next regression surfaces immediately). Capability shortfalls: gate artifacts, reliability sweep, version bump all skipped (qor/ runtime helpers absent on this branch). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…hook
Phase 0a — Decompose server.py:cli_main (92 LOC → 15 LOC orchestrator
+ _register_subparsers (16 LOC) + _dispatch (29 LOC)). Razor-compliant.
Phase 0 — Promote cli/branch_scan.py:_invoke_link_commit to shared
cli/_link_commit_runner.py module. Pure refactor under existing
test_branch_scan_cli.py coverage.
Phase 1 — Register link_commit CLI subcommand:
- cli/link_commit_cli.py (29 LOC) — JSON-to-stdout default, --quiet
flag, always exits 0 (graceful skip on no-ledger or handler error).
- server.py — subparser registration in _register_subparsers + dispatch
branch in _dispatch.
- tests/test_link_commit_cli.py (6 tests) — argparse defaults, output
shape, --quiet, no-ledger graceful skip, handler-exception graceful
skip.
Phase 2 — Harden post-commit hook:
- setup_wizard.py:_GIT_POST_COMMIT_HOOK now writes stderr to
${HOME}/.bicameral/hook-errors.log (was /dev/null), surfaces a
one-line summary on stderr, always exits 0. > truncates the file
on each run so successful commits auto-clear stale errors. F-2
remediation per audit v2.
- tests/test_hook_command_registration.py (3 tests) — smoke that
walks every bicameral-mcp <cmd> in installed hooks and asserts
CLI registration + dispatch coverage. Original #124 bug class is
now caught at PR time.
Phase 3 — CHANGELOG [Unreleased] Fixed entry.
Validation: 20 passed, 1 skipped (Windows chmod). ruff check + format
+ mypy clean. Manual smoke: link_commit --help renders.
Plan v2 PASS at META_LEDGER #21 (chain 86225d49). Implementation
sealed at META_LEDGER #22 (chain e83d674c).
Closes #124.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
(cherry picked from commit 431e202)
Adaptation: server.py — kept dev's _register_subparsers / _dispatch helper extraction (#124 phase 0a refactor) so test_hook_command_registration.py introspection works; omitted dev's branch-scan subparser registration and the setup --with-push-hook flag (both are #48 prerequisites missing on triage)
Adaptation: server.py:_dispatch — kept dev's setup → branch-scan → link_commit dispatch chain shape; dropped branch-scan dispatch case (cli/branch_scan.py is a missing prerequisite from #48 on triage); kept link_commit dispatch case (the actual #124 fix)
Adaptation: tests/test_hook_command_registration.py — dropped _GIT_PRE_PUSH_HOOK import + the test_pre_push_hook_command_is_registered test (pre-push hook is from #48, not on triage); test_all_hook_commands_have_dispatch_branches scoped to _GIT_POST_COMMIT_HOOK only; test_post_commit_hook_command_is_registered (the canonical #124 regression test) is preserved
Skip: cli/branch_scan.py — kept triage's prior absence of this file (added by #48); the cherry-pick wanted to refactor it
Skip: docs/META_LEDGER.md — kept triage's HEAD chain state; e6d4b8f's META_LEDGER #21/#23 entries are dev's chain, not triage's
Skip: CHANGELOG.md — kept triage's HEAD; v0.X.Y triage release narrative goes in PR #128 per DEV_CYCLE §10.5.4
…hook
Phase 0a — Decompose server.py:cli_main (92 LOC → 15 LOC orchestrator
+ _register_subparsers (16 LOC) + _dispatch (29 LOC)). Razor-compliant.
Phase 0 — Promote cli/branch_scan.py:_invoke_link_commit to shared
cli/_link_commit_runner.py module. Pure refactor under existing
test_branch_scan_cli.py coverage.
Phase 1 — Register link_commit CLI subcommand:
- cli/link_commit_cli.py (29 LOC) — JSON-to-stdout default, --quiet
flag, always exits 0 (graceful skip on no-ledger or handler error).
- server.py — subparser registration in _register_subparsers + dispatch
branch in _dispatch.
- tests/test_link_commit_cli.py (6 tests) — argparse defaults, output
shape, --quiet, no-ledger graceful skip, handler-exception graceful
skip.
Phase 2 — Harden post-commit hook:
- setup_wizard.py:_GIT_POST_COMMIT_HOOK now writes stderr to
${HOME}/.bicameral/hook-errors.log (was /dev/null), surfaces a
one-line summary on stderr, always exits 0. > truncates the file
on each run so successful commits auto-clear stale errors. F-2
remediation per audit v2.
- tests/test_hook_command_registration.py (3 tests) — smoke that
walks every bicameral-mcp <cmd> in installed hooks and asserts
CLI registration + dispatch coverage. Original #124 bug class is
now caught at PR time.
Phase 3 — CHANGELOG [Unreleased] Fixed entry.
Validation: 20 passed, 1 skipped (Windows chmod). ruff check + format
+ mypy clean. Manual smoke: link_commit --help renders.
Plan v2 PASS at META_LEDGER #21 (chain 86225d49). Implementation
sealed at META_LEDGER #22 (chain e83d674c).
Closes #124.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
(cherry picked from commit 431e202)
Adaptation: server.py — kept dev's _register_subparsers / _dispatch helper extraction (#124 phase 0a refactor) so test_hook_command_registration.py introspection works; omitted dev's branch-scan subparser registration and the setup --with-push-hook flag (both are #48 prerequisites missing on triage)
Adaptation: server.py:_dispatch — kept dev's setup → branch-scan → link_commit dispatch chain shape; dropped branch-scan dispatch case (cli/branch_scan.py is a missing prerequisite from #48 on triage); kept link_commit dispatch case (the actual #124 fix)
Adaptation: tests/test_hook_command_registration.py — dropped _GIT_PRE_PUSH_HOOK import + the test_pre_push_hook_command_is_registered test (pre-push hook is from #48, not on triage); test_all_hook_commands_have_dispatch_branches scoped to _GIT_POST_COMMIT_HOOK only; test_post_commit_hook_command_is_registered (the canonical #124 regression test) is preserved
Skip: cli/branch_scan.py — kept triage's prior absence of this file (added by #48); the cherry-pick wanted to refactor it
Skip: docs/META_LEDGER.md — kept triage's HEAD chain state; e6d4b8f's META_LEDGER #21/#23 entries are dev's chain, not triage's
Skip: CHANGELOG.md — kept triage's HEAD; v0.X.Y triage release narrative goes in PR #128 per DEV_CYCLE §10.5.4
Summary by CodeRabbit
New Features
Tests