fix: structural grounding pipeline fixes — stage reorder, eval matching, raised caps #27
Conversation
…ng, raised caps Three bugs fixed: 1. Stage reorder: fuzzy-direct-lookup is now primary (Stage 1), file-level retrieval is fallback (Stage 2). Previously Stage 1 was dead code because RRF scores (~0.036) never passed score thresholds (0.5/0.3/0.1). 2. Eval bare-name matching: ground truth uses bare names but pipeline emits qualified names (ClassName.method). Eval now matches both forms. 3. Raised max_symbols from 3/5/5 to 5/8/10 across coverage tiers. Bonus: fixed self._bm25 AttributeError (FC-1 guard was silently broken). Eval results: Recall 13.9% → 20.0%, MRR 0.592 → 0.605, Saleor recall 5% → 31.8%. 48/48 tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…uzzy Two fixes from Codex review feedback: 1. P1 #1: Filter weak BM25 scores (< 0.1) when no fuzzy matches exist. Prevents false groundings for under-specified queries that pass the FC-1 guard but have only noise-level BM25 hits. 2. P1 #2: Stage 2 now runs as enrichment after fuzzy direct lookup, not just as fallback. When fuzzy finds symbols from a subset of files, file-level retrieval fills remaining budget with additional files — preserving multi-file feature discovery. 49/49 tests pass. Eval metrics stable: MRR@5=0.605, Recall=20.0%. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughThe PR restructures code locator's grounding strategy by replacing BM25 score thresholds with a Changes
Sequence Diagram(s)sequenceDiagram
participant Caller as Caller
participant Locator as Code Locator
participant FuzzyMatcher as Fuzzy Matcher
participant FusedRetrieval as Fused Retrieval
participant FileRanker as File Ranker
Caller->>Locator: ground_mappings(mappings)
loop For each tier (max_files, fuzzy_thresh, max_sym)
Locator->>Locator: Extract matched_ids from fuzzy match
alt matched_ids present
Note over Locator,FuzzyMatcher: Stage 1: Direct Fuzzy Resolution
Locator->>FuzzyMatcher: Direct symbol-to-region lookup
FuzzyMatcher-->>Locator: code_regions (fuzzy-matched)
else No matched_ids
Note over Locator,FusedRetrieval: Stage 1 Skipped (No fuzzy hits)
end
alt code_regions size < max_symbols
Note over Locator,FileRanker: Stage 2: File-Level Fused Retrieval
Locator->>FileRanker: Rank files (limited to max_files)
FileRanker-->>Locator: Top ranked files
Locator->>FusedRetrieval: Per-file symbol extraction (remaining budget)
FusedRetrieval-->>Locator: Additional code_regions
else Budget met
Note over Locator: Stage 2 Skipped (sufficient symbols)
end
Locator->>Locator: Return final code_regions
end
Locator-->>Caller: Grounded mappings
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/test_fc2_multi_region_grounding.py (1)
213-230:⚠️ Potential issue | 🟡 MinorMake the FC-2 tests exercise Stage 2 instead of Stage 1 only.
These assertions can pass from fuzzy direct lookup alone: Line 217 seeds one fuzzy symbol per file, and Line 257 seeds enough IDs to hit
max_symbolsbefore file-level retrieval matters. Seed fewer fuzzy IDs and assert the seeded fused call happened so regressions in Stage 2 are caught.🧪 Proposed test adjustment
- adapter, _ = _adapter_with_fakes( + adapter, call_log = _adapter_with_fakes( bm25_hits=bm25_hits, fused_hits=fused_hits, symbols_by_file=symbols, - fuzzy_symbol_ids=[1, 3, 5], # one seed from each file + fuzzy_symbol_ids=[1], # Stage 1 covers one file; Stage 2 must add the others ) @@ distinct_files = {r["file_path"] for r in regions} + assert [symbol_ids for _, symbol_ids in call_log if symbol_ids is not None] == [[1]] assert len(distinct_files) >= 2, ( f"FC-2 regression: multi-file feature collapsed to {len(distinct_files)} " f"file(s). Expected ≥2. Regions: {regions}" @@ adapter, _ = _adapter_with_fakes( bm25_hits=bm25_hits, fused_hits=bm25_hits, symbols_by_file=symbols, - fuzzy_symbol_ids=list(range(10)), + fuzzy_symbol_ids=[1], )Also applies to: 253-269
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_fc2_multi_region_grounding.py` around lines 213 - 230, The test currently seeds too many fuzzy_symbol_ids (e.g., [1,3,5]) so adapter._ground_single can pass via Stage 1 fuzzy lookups; reduce the seeded fuzzy_symbol_ids (e.g., to a single ID) in the _adapter_with_fakes call so Stage 2 (file-level fused retrieval) must run, and make the parallel change in the other block (lines ~253-269) as well; after calling adapter._ground_single, add an assertion that fused retrieval was used (e.g., assert the provided fused_hits fixture was consulted or a fused-call side-effect is present) so regressions in Stage 2 are detected. Ensure you update references to fuzzy_symbol_ids, fused_hits, adapter._ground_single, max_symbols and fuzzy_threshold accordingly.
🧹 Nitpick comments (1)
adapters/code_locator.py (1)
282-296: Use the full remaining symbol budget when splitting across files.
remaining_budget // len(qualifying_files)can leave capacity unused, e.g. 5 remaining slots across 2 files returns at most 4 regions even when both files have enough symbols. That weakens the newly raised caps.♻️ Proposed budget allocation tweak
- per_file_budget = max(1, remaining_budget // len(qualifying_files)) + base_budget, extra_budget = divmod(remaining_budget, len(qualifying_files)) try: - for fp in qualifying_files: + for idx, fp in enumerate(qualifying_files): file_symbols = db.lookup_by_file(fp) if not file_symbols: continue @@ ranked = sorted( file_symbols, key=lambda r: (r["id"] not in relevant_ids, -matched_scores.get(r["id"], 0)), ) - for row in ranked[:per_file_budget]: + file_budget = max(1, base_budget + (1 if idx < extra_budget else 0)) + for row in ranked[:file_budget]: code_regions.append({🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@adapters/code_locator.py` around lines 282 - 296, The per-file allocation using per_file_budget = max(1, remaining_budget // len(qualifying_files)) can leave unused slots when remaining_budget % len(qualifying_files) != 0; change the allocation loop so you compute base, remainder = divmod(remaining_budget, len(qualifying_files)) and give base + 1 to the first `remainder` files and base to the rest (still min 1 as needed), then proceed to fetch `file_symbols` via db.lookup_by_file(fp), compute `relevant_ids`, build `ranked`, and take up to the per-file quota for each file (ensuring you never exceed `max_symbols` or double-count `code_regions`).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@adapters/code_locator.py`:
- Around line 260-274: The fused search fallback leaves matched_ids truthy so
the low-score gate in the loop (if not matched_ids and h.get("score", 0) < 0.1)
is skipped; fix by clearing the matched_ids flag when you fall back to
BM25-only: inside the except block that sets fused = hits (around the
search_code call and fused assignment), also set matched_ids = [] (or None) so
the subsequent loop treating fused as BM25-only will run the low-score filter;
reference symbols: search_code, fused, hits, matched_ids, and the loop that
inspects h.get("score", 0).
In `@tests/eval_code_locator.py`:
- Line 280: The default value for the "--top-k" CLI argument was changed from 3
to 5, which will break regression comparisons and existing MRR/hit-rate gates;
revert the default in the parser.add_argument call for "--top-k" back to 3 so
existing tests and thresholds remain stable, ensuring the signature
parser.add_argument("--top-k", type=int, default=3, help="Top-K for
precision/MRR") is restored and run any CI tests that depend on default
MRR/hit-rate thresholds to confirm no regressions.
---
Outside diff comments:
In `@tests/test_fc2_multi_region_grounding.py`:
- Around line 213-230: The test currently seeds too many fuzzy_symbol_ids (e.g.,
[1,3,5]) so adapter._ground_single can pass via Stage 1 fuzzy lookups; reduce
the seeded fuzzy_symbol_ids (e.g., to a single ID) in the _adapter_with_fakes
call so Stage 2 (file-level fused retrieval) must run, and make the parallel
change in the other block (lines ~253-269) as well; after calling
adapter._ground_single, add an assertion that fused retrieval was used (e.g.,
assert the provided fused_hits fixture was consulted or a fused-call side-effect
is present) so regressions in Stage 2 are detected. Ensure you update references
to fuzzy_symbol_ids, fused_hits, adapter._ground_single, max_symbols and
fuzzy_threshold accordingly.
---
Nitpick comments:
In `@adapters/code_locator.py`:
- Around line 282-296: The per-file allocation using per_file_budget = max(1,
remaining_budget // len(qualifying_files)) can leave unused slots when
remaining_budget % len(qualifying_files) != 0; change the allocation loop so you
compute base, remainder = divmod(remaining_budget, len(qualifying_files)) and
give base + 1 to the first `remainder` files and base to the rest (still min 1
as needed), then proceed to fetch `file_symbols` via db.lookup_by_file(fp),
compute `relevant_ids`, build `ranked`, and take up to the per-file quota for
each file (ensuring you never exceed `max_symbols` or double-count
`code_regions`).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 7fb7fbe4-115a-47bc-b3c3-1c840f740075
📒 Files selected for processing (4)
adapters/code_locator.pytests/eval_code_locator.pytests/test_coverage_loop.pytests/test_fc2_multi_region_grounding.py
| try: | ||
| if matched_ids: | ||
| fused = self.search_code(description, symbol_ids=sorted(matched_ids)) | ||
| else: | ||
| fused = hits | ||
| except Exception as exc: | ||
| logger.debug("[ground] fused search failed for '%s': %s — falling back to BM25-only", description[:60], exc) | ||
| fused = hits | ||
|
|
||
| covered_files = {r["file_path"] for r in code_regions} | ||
| qualifying_files: list[str] = [] | ||
| seen_files: set[str] = set(covered_files) | ||
| for h in fused: | ||
| if not matched_ids and h.get("score", 0) < 0.1: | ||
| continue |
There was a problem hiding this comment.
Apply the low-score gate when fused search falls back to BM25 hits.
If matched_ids is truthy but seeded search_code(..., symbol_ids=...) fails, Line 267 falls back to BM25-only hits; Line 273 then skips the < 0.1 gate because matched_ids is still truthy. That can reintroduce the low-score noise this PR is trying to suppress.
🐛 Proposed fix
- try:
+ bm25_only_fallback = False
+ try:
if matched_ids:
fused = self.search_code(description, symbol_ids=sorted(matched_ids))
else:
fused = hits
+ bm25_only_fallback = True
except Exception as exc:
logger.debug("[ground] fused search failed for '%s': %s — falling back to BM25-only", description[:60], exc)
fused = hits
+ bm25_only_fallback = True
covered_files = {r["file_path"] for r in code_regions}
qualifying_files: list[str] = []
seen_files: set[str] = set(covered_files)
for h in fused:
- if not matched_ids and h.get("score", 0) < 0.1:
+ if bm25_only_fallback and h.get("score", 0) < 0.1:
continue🧰 Tools
🪛 Ruff (0.15.10)
[warning] 265-265: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@adapters/code_locator.py` around lines 260 - 274, The fused search fallback
leaves matched_ids truthy so the low-score gate in the loop (if not matched_ids
and h.get("score", 0) < 0.1) is skipped; fix by clearing the matched_ids flag
when you fall back to BM25-only: inside the except block that sets fused = hits
(around the search_code call and fused assignment), also set matched_ids = []
(or None) so the subsequent loop treating fused as BM25-only will run the
low-score filter; reference symbols: search_code, fused, hits, matched_ids, and
the loop that inspects h.get("score", 0).
| parser.add_argument("--multi-repo", type=str, default=None, | ||
| help='JSON map of repo_name→path, e.g. \'{"medusa":"/path/to/medusa"}\'') | ||
| parser.add_argument("--top-k", type=int, default=3, help="Top-K for precision/MRR") | ||
| parser.add_argument("--top-k", type=int, default=5, help="Top-K for precision/MRR") |
There was a problem hiding this comment.
Keep the default Top-K stable for regression comparisons.
Changing the default from 3 to 5 makes default MRR/hit-rate runs incomparable with prior baselines and can loosen existing --min-mrr gates unless every caller updates thresholds intentionally.
🧪 Proposed fix
- parser.add_argument("--top-k", type=int, default=5, help="Top-K for precision/MRR")
+ parser.add_argument("--top-k", type=int, default=3, help="Top-K for precision/MRR")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| parser.add_argument("--top-k", type=int, default=5, help="Top-K for precision/MRR") | |
| parser.add_argument("--top-k", type=int, default=3, help="Top-K for precision/MRR") |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/eval_code_locator.py` at line 280, The default value for the "--top-k"
CLI argument was changed from 3 to 5, which will break regression comparisons
and existing MRR/hit-rate gates; revert the default in the parser.add_argument
call for "--top-k" back to 3 so existing tests and thresholds remain stable,
ensuring the signature parser.add_argument("--top-k", type=int, default=3,
help="Top-K for precision/MRR") is restored and run any CI tests that depend on
default MRR/hit-rate thresholds to confirm no regressions.
QorLogic SDLC governance trail for the Priority C v0 implementation that landed in commits 1-4 of this PR. Includes: - docs/research-brief-priority-c-selective-ingest-2026-05-02.md (v3) — research substrate. v1 was rejected for INVARIANT_FROM_IMPLEMENTATION (treating v0 agent-fetches-only code state as product principle); v2 added playbook substrate; v3 narrowed to Slack-first + team-server + CocoIndex-conditional after operator dialogue clarified "no managed backend" = "no human-ops-tax architecture," not "no backend." - plan-priority-c-team-server-slack-v0.md (437 LOC) — the L3 plan with five phases (Phase 5 deferred per "if we can manage it" feasibility caveat). - docs/SHADOW_GENOME.md Failure Entry #6 + addendum — captures the framing-error pattern AND the "anti-goals must be parsed by their load-bearing keyword" lesson; symmetric to v0-code-as-principle. - docs/META_LEDGER.md Entries #27 (IMPLEMENT) + #28 (SEAL). Predecessor: efd0304b (#135-triage seal on dev). Implement chain: 211ffb9e. Substantiation seal: 6f4f8f8f1d63ad82b952a3c6aff270d30584e08b0572077ff685e84ce453f6c2 - docs/SYSTEM_STATE.md — Priority C v0 section appended; documents schema additions, architectural properties achieved, audit advisory disposition, Phase 5 deferred state, and the qor-logic-internal steps skipped for downstream-project rationale. - .agent/staging/AUDIT_REPORT.md — PASS verdict, three non-blocking advisories all addressed at implement-time. Verdict: REALITY = PROMISE for Phases 1-4. Phase 5 (CocoIndex #136) explicitly deferred per plan slip-independence design. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
primary path.
Metrics (cumulative with prior fixes)
Test plan
Branch silong/p0-pascal-ngram-recall (additive feature):
Title: feat: PascalCase ngram candidate generation for grounding
Body:
Summary
Neutral on current eval set (most expected PascalCase symbols don't exist in test repos). Enables correct matching for repos that do have multi-word PascalCase class
names.
Test plan
Summary by CodeRabbit
Release Notes