Skip to content

Silong/fc2 eval multi region#22

Merged
silongtan merged 2 commits into
mainfrom
silong/fc2-eval-multi-region
Apr 16, 2026
Merged

Silong/fc2 eval multi region#22
silongtan merged 2 commits into
mainfrom
silong/fc2-eval-multi-region

Conversation

@silongtan

@silongtan silongtan commented Apr 16, 2026

Copy link
Copy Markdown
Collaborator

Summary by CodeRabbit

  • New Features

    • Introduced file-based recall metrics and file cardinality tracking to improve code location evaluation accuracy.
    • Enhanced reporting with multi-region decision analysis, including file-pattern coverage metrics and detailed diagnostics in verbose output.
  • Tests

    • Added multi-region decision test fixtures with comprehensive evaluation scenarios.

silongtan and others added 2 commits April 15, 2026 21:53
5 cross-cutting decisions (bicameral-mcp workflows spanning 3+ files)
added to ground truth fixtures with `multi_region: True` flag.

New metrics in eval_code_locator.py:
- recall@files: fraction of expected file patterns covered by grounded regions
- file-cardinality distribution: histogram of distinct files per grounding
- multi-region breakdown in summary output (all vs multi-region-only)

Baseline on bicameral monorepo: 0% multi-region recall@files (expected —
NL descriptions compete with demo/experiment files in the full index).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
get_adapter() now checks .bicameral/local/code-graph.db first (team mode),
falling back to .bicameral/code-graph.db (solo mode). Previously the eval
always used the solo path, loading a near-empty BM25 index (1 doc) when
the repo was configured in team mode (17k docs in local/).

Multi-region recall@files baseline remains 0% — this is a real quality gap
(regex-token seeding matches wrong symbols from NL descriptions), not a
tooling bug. Documents the F1 follow-up (LLM-extracted symbol seeding)
as the fix path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Apr 16, 2026

Copy link
Copy Markdown
📝 Walkthrough

Walkthrough

The changes update the code locator evaluation system to support multi-region decisions by adding new metrics (recall_at_files, file_cardinality, multi_region), updating database path selection logic with fallback handling, and extending test fixtures with five multi-region decision cases for comprehensive testing.

Changes

Cohort / File(s) Summary
Evaluation Metrics & Logic
tests/eval_code_locator.py
Updates get_adapter() with fallback DB path selection (./.bicameral/local/code-graph.db./.bicameral/code-graph.db). Adds per-decision metrics during evaluation: recall_at_files, file_cardinality, and multi_region. Extends verbose output to annotate multi-region decisions and display new metrics. Aggregate reporting now includes avg_recall_at_files, multi_region_count, multi_region_recall_at_files, and file_cardinality_distribution.
Test Fixtures
tests/fixtures/expected/decisions.py
Introduces BICAMERAL_MULTI_REGION fixture list with five FC-2 multi-region decision entries (ingest/grounding, multi-channel retrieval with RRF fusion, drift detection, team dual-write/event materialization, coverage-tier threshold broadening). Updates ALL_DECISIONS to include new multi-region fixtures and adds MULTI_REGION as filtered subset.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • jinhongkuan

Poem

🐰 Hops through regions, metrics bloom,
Recall at files clears the gloom,
Multi-region dreams take flight,
Fixtures shine with test-case light,
Fallback paths, so smooth and spry,
Evaluation reaches for the sky! 🌟

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Silong/fc2 eval multi region' is vague and lacks clarity. It uses branch-naming conventions as a title and doesn't clearly convey the main purpose of the changes (adding multi-region evaluation metrics and fixtures). Consider revising the title to be more descriptive, e.g., 'Add multi-region evaluation metrics and fixtures' or 'Implement multi-region decision support in code locator evaluation'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch silong/fc2-eval-multi-region

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/fixtures/expected/decisions.py (1)

499-509: ⚠️ Potential issue | 🟠 Major

Keep the FC-2 fixtures out of the global registry.

tests/eval_code_locator.py only repo-filters when more than one repo is passed. With this addition, any single-repo run against a non-bicameral checkout will now execute these bicameral-only cases and skew the reported metrics. Prefer wiring BICAMERAL_MULTI_REGION through an explicit FC-2 subset, or fix the caller to always repo-filter.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/fixtures/expected/decisions.py` around lines 499 - 509, ALL_DECISIONS
currently includes BICAMERAL_MULTI_REGION which causes bicameral-only FC-2 cases
to be run for single-repo test runs; remove BICAMERAL_MULTI_REGION from the
ALL_DECISIONS tuple and instead expose it only via an explicit FC-2 subset
(e.g., create FC2_BICAMERAL_DECISIONS or add BICAMERAL_MULTI_REGION to an
existing FC2_DECISIONS collection) so that callers can opt-in to bicameral
cases; update any test harness assembly that builds the FC-2 test set to include
this new subset when appropriate (refer to the ALL_DECISIONS and
BICAMERAL_MULTI_REGION symbols to locate the change).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/eval_code_locator.py`:
- Around line 231-237: The multi-region aggregates drop errored FC-2 cases
because mr_results is built only from entries where r.get("multi_region") is
truthy and the exception branch doesn't append a multi_region entry; update the
exception handling in the block that builds results (the code that currently
appends precision/recall/mrr on success) to also append a result dict with
"multi_region": True and zeros for "precision", "recall", "mrr" and
"recall_at_files" when ground_mappings() raises, so mr_results, mr_count and
mr_recall_at_files correctly include errored multi-region cases; apply the same
change to the analogous aggregation block that starts around where the other
multi-region aggregates are computed (the block referenced in the comment for
lines 261-263).

---

Outside diff comments:
In `@tests/fixtures/expected/decisions.py`:
- Around line 499-509: ALL_DECISIONS currently includes BICAMERAL_MULTI_REGION
which causes bicameral-only FC-2 cases to be run for single-repo test runs;
remove BICAMERAL_MULTI_REGION from the ALL_DECISIONS tuple and instead expose it
only via an explicit FC-2 subset (e.g., create FC2_BICAMERAL_DECISIONS or add
BICAMERAL_MULTI_REGION to an existing FC2_DECISIONS collection) so that callers
can opt-in to bicameral cases; update any test harness assembly that builds the
FC-2 test set to include this new subset when appropriate (refer to the
ALL_DECISIONS and BICAMERAL_MULTI_REGION symbols to locate the change).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1327be0c-935b-44ad-b600-1afd00a3d248

📥 Commits

Reviewing files that changed from the base of the PR and between cf37eaf and 2e9c47f.

📒 Files selected for processing (2)
  • tests/eval_code_locator.py
  • tests/fixtures/expected/decisions.py

Comment on lines +231 to +237
# recall@files (multi-region only)
mr_results = [r for r in results if r.get("multi_region")]
mr_count = len(mr_results)
mr_recall_at_files = (
sum(r.get("recall_at_files", 0) for r in mr_results) / mr_count
if mr_count else 0
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Count errored multi-region cases in these aggregates.

mr_results is filtered on r.get("multi_region"), but the exception branch only appends precision/recall/mrr. If ground_mappings() raises on an FC-2 case, it drops out of multi_region_count and multi_region_recall_at_files instead of contributing zero.

Proposed fix
         except Exception as e:
             results.append({
                 "description": d["description"][:80],
                 "query": query,
                 "error": str(e),
                 "precision": 0, "recall": 0, "mrr": 0,
+                "recall_at_files": 0,
+                "file_cardinality": 0,
+                "multi_region": d.get("multi_region", False),
             })
             continue

Also applies to: 261-263

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/eval_code_locator.py` around lines 231 - 237, The multi-region
aggregates drop errored FC-2 cases because mr_results is built only from entries
where r.get("multi_region") is truthy and the exception branch doesn't append a
multi_region entry; update the exception handling in the block that builds
results (the code that currently appends precision/recall/mrr on success) to
also append a result dict with "multi_region": True and zeros for "precision",
"recall", "mrr" and "recall_at_files" when ground_mappings() raises, so
mr_results, mr_count and mr_recall_at_files correctly include errored
multi-region cases; apply the same change to the analogous aggregation block
that starts around where the other multi-region aggregates are computed (the
block referenced in the comment for lines 261-263).

@silongtan silongtan merged commit 220ef86 into main Apr 16, 2026
1 check passed
@silongtan silongtan deleted the silong/fc2-eval-multi-region branch April 18, 2026 22:37
jinhongkuan pushed a commit that referenced this pull request Apr 30, 2026
…hook

Phase 0a — Decompose server.py:cli_main (92 LOC → 15 LOC orchestrator
+ _register_subparsers (16 LOC) + _dispatch (29 LOC)). Razor-compliant.

Phase 0 — Promote cli/branch_scan.py:_invoke_link_commit to shared
cli/_link_commit_runner.py module. Pure refactor under existing
test_branch_scan_cli.py coverage.

Phase 1 — Register link_commit CLI subcommand:
- cli/link_commit_cli.py (29 LOC) — JSON-to-stdout default, --quiet
  flag, always exits 0 (graceful skip on no-ledger or handler error).
- server.py — subparser registration in _register_subparsers + dispatch
  branch in _dispatch.
- tests/test_link_commit_cli.py (6 tests) — argparse defaults, output
  shape, --quiet, no-ledger graceful skip, handler-exception graceful
  skip.

Phase 2 — Harden post-commit hook:
- setup_wizard.py:_GIT_POST_COMMIT_HOOK now writes stderr to
  ${HOME}/.bicameral/hook-errors.log (was /dev/null), surfaces a
  one-line summary on stderr, always exits 0. > truncates the file
  on each run so successful commits auto-clear stale errors. F-2
  remediation per audit v2.
- tests/test_hook_command_registration.py (3 tests) — smoke that
  walks every bicameral-mcp <cmd> in installed hooks and asserts
  CLI registration + dispatch coverage. Original #124 bug class is
  now caught at PR time.

Phase 3 — CHANGELOG [Unreleased] Fixed entry.

Validation: 20 passed, 1 skipped (Windows chmod). ruff check + format
+ mypy clean. Manual smoke: link_commit --help renders.

Plan v2 PASS at META_LEDGER #21 (chain 86225d49). Implementation
sealed at META_LEDGER #22 (chain e83d674c).

Closes #124.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jinhongkuan pushed a commit that referenced this pull request Apr 30, 2026
Reality matches Promise. All 8 files (5 new + 4 modified - 1 plan)
land per v2 plan; 9 new tests + 11 regression = 20 passed, 1 skipped;
ruff/format/mypy clean; manual smoke confirms link_commit subcommand
registers and renders correctly.

Plan: plan-124-post-commit-hook-fix.md (v2 PASS @ 44c6568)
Audit: META_LEDGER #21 (chain hash 86225d49)
Implementation: META_LEDGER #22 (chain hash e83d674c)
Merkle seal: 950f362cb700da5a4db85c545f6b55bb725502a5744bfbb2c2eb3a9c9728661a

Closes #124 silent-failure regression. Defense-in-depth: the fix
itself, the cli_main decomposition (so the next subcommand addition
doesn't hit the same wall), the hook-command-registration smoke test
(catches this bug class at PR time), and the loud-but-non-blocking
hook (next regression surfaces immediately).

Capability shortfalls: gate artifacts, reliability sweep, version
bump all skipped (qor/ runtime helpers absent on this branch).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jinhongkuan pushed a commit that referenced this pull request Apr 30, 2026
…hook

Phase 0a — Decompose server.py:cli_main (92 LOC → 15 LOC orchestrator
+ _register_subparsers (16 LOC) + _dispatch (29 LOC)). Razor-compliant.

Phase 0 — Promote cli/branch_scan.py:_invoke_link_commit to shared
cli/_link_commit_runner.py module. Pure refactor under existing
test_branch_scan_cli.py coverage.

Phase 1 — Register link_commit CLI subcommand:
- cli/link_commit_cli.py (29 LOC) — JSON-to-stdout default, --quiet
  flag, always exits 0 (graceful skip on no-ledger or handler error).
- server.py — subparser registration in _register_subparsers + dispatch
  branch in _dispatch.
- tests/test_link_commit_cli.py (6 tests) — argparse defaults, output
  shape, --quiet, no-ledger graceful skip, handler-exception graceful
  skip.

Phase 2 — Harden post-commit hook:
- setup_wizard.py:_GIT_POST_COMMIT_HOOK now writes stderr to
  ${HOME}/.bicameral/hook-errors.log (was /dev/null), surfaces a
  one-line summary on stderr, always exits 0. > truncates the file
  on each run so successful commits auto-clear stale errors. F-2
  remediation per audit v2.
- tests/test_hook_command_registration.py (3 tests) — smoke that
  walks every bicameral-mcp <cmd> in installed hooks and asserts
  CLI registration + dispatch coverage. Original #124 bug class is
  now caught at PR time.

Phase 3 — CHANGELOG [Unreleased] Fixed entry.

Validation: 20 passed, 1 skipped (Windows chmod). ruff check + format
+ mypy clean. Manual smoke: link_commit --help renders.

Plan v2 PASS at META_LEDGER #21 (chain 86225d49). Implementation
sealed at META_LEDGER #22 (chain e83d674c).

Closes #124.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
(cherry picked from commit 431e202)

Adaptation: server.py — kept dev's _register_subparsers / _dispatch helper extraction (#124 phase 0a refactor) so test_hook_command_registration.py introspection works; omitted dev's branch-scan subparser registration and the setup --with-push-hook flag (both are #48 prerequisites missing on triage)
Adaptation: server.py:_dispatch — kept dev's setup → branch-scan → link_commit dispatch chain shape; dropped branch-scan dispatch case (cli/branch_scan.py is a missing prerequisite from #48 on triage); kept link_commit dispatch case (the actual #124 fix)
Adaptation: tests/test_hook_command_registration.py — dropped _GIT_PRE_PUSH_HOOK import + the test_pre_push_hook_command_is_registered test (pre-push hook is from #48, not on triage); test_all_hook_commands_have_dispatch_branches scoped to _GIT_POST_COMMIT_HOOK only; test_post_commit_hook_command_is_registered (the canonical #124 regression test) is preserved
Skip: cli/branch_scan.py — kept triage's prior absence of this file (added by #48); the cherry-pick wanted to refactor it
Skip: docs/META_LEDGER.md — kept triage's HEAD chain state; e6d4b8f's META_LEDGER #21/#23 entries are dev's chain, not triage's
Skip: CHANGELOG.md — kept triage's HEAD; v0.X.Y triage release narrative goes in PR #128 per DEV_CYCLE §10.5.4
jinhongkuan pushed a commit that referenced this pull request Apr 30, 2026
…hook

Phase 0a — Decompose server.py:cli_main (92 LOC → 15 LOC orchestrator
+ _register_subparsers (16 LOC) + _dispatch (29 LOC)). Razor-compliant.

Phase 0 — Promote cli/branch_scan.py:_invoke_link_commit to shared
cli/_link_commit_runner.py module. Pure refactor under existing
test_branch_scan_cli.py coverage.

Phase 1 — Register link_commit CLI subcommand:
- cli/link_commit_cli.py (29 LOC) — JSON-to-stdout default, --quiet
  flag, always exits 0 (graceful skip on no-ledger or handler error).
- server.py — subparser registration in _register_subparsers + dispatch
  branch in _dispatch.
- tests/test_link_commit_cli.py (6 tests) — argparse defaults, output
  shape, --quiet, no-ledger graceful skip, handler-exception graceful
  skip.

Phase 2 — Harden post-commit hook:
- setup_wizard.py:_GIT_POST_COMMIT_HOOK now writes stderr to
  ${HOME}/.bicameral/hook-errors.log (was /dev/null), surfaces a
  one-line summary on stderr, always exits 0. > truncates the file
  on each run so successful commits auto-clear stale errors. F-2
  remediation per audit v2.
- tests/test_hook_command_registration.py (3 tests) — smoke that
  walks every bicameral-mcp <cmd> in installed hooks and asserts
  CLI registration + dispatch coverage. Original #124 bug class is
  now caught at PR time.

Phase 3 — CHANGELOG [Unreleased] Fixed entry.

Validation: 20 passed, 1 skipped (Windows chmod). ruff check + format
+ mypy clean. Manual smoke: link_commit --help renders.

Plan v2 PASS at META_LEDGER #21 (chain 86225d49). Implementation
sealed at META_LEDGER #22 (chain e83d674c).

Closes #124.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
(cherry picked from commit 431e202)

Adaptation: server.py — kept dev's _register_subparsers / _dispatch helper extraction (#124 phase 0a refactor) so test_hook_command_registration.py introspection works; omitted dev's branch-scan subparser registration and the setup --with-push-hook flag (both are #48 prerequisites missing on triage)
Adaptation: server.py:_dispatch — kept dev's setup → branch-scan → link_commit dispatch chain shape; dropped branch-scan dispatch case (cli/branch_scan.py is a missing prerequisite from #48 on triage); kept link_commit dispatch case (the actual #124 fix)
Adaptation: tests/test_hook_command_registration.py — dropped _GIT_PRE_PUSH_HOOK import + the test_pre_push_hook_command_is_registered test (pre-push hook is from #48, not on triage); test_all_hook_commands_have_dispatch_branches scoped to _GIT_POST_COMMIT_HOOK only; test_post_commit_hook_command_is_registered (the canonical #124 regression test) is preserved
Skip: cli/branch_scan.py — kept triage's prior absence of this file (added by #48); the cherry-pick wanted to refactor it
Skip: docs/META_LEDGER.md — kept triage's HEAD chain state; e6d4b8f's META_LEDGER #21/#23 entries are dev's chain, not triage's
Skip: CHANGELOG.md — kept triage's HEAD; v0.X.Y triage release narrative goes in PR #128 per DEV_CYCLE §10.5.4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant