Skip to content

fix: structural grounding pipeline fixes — stage reorder, eval matching, raised caps #27

Merged
silongtan merged 2 commits into
mainfrom
silong/p0-grounding-structural-fixes
Apr 18, 2026
Merged

fix: structural grounding pipeline fixes — stage reorder, eval matching, raised caps #27
silongtan merged 2 commits into
mainfrom
silong/p0-grounding-structural-fixes

Conversation

@silongtan

@silongtan silongtan commented Apr 18, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Fix Stage 1 dead code: RRF scores (~0.036) never passed score thresholds (0.5/0.3/0.1). Replaced with rank-based file selection, moved fuzzy-direct-lookup to
    primary path.
  • Fix eval bare-name matching: ground truth uses bare names but pipeline emits qualified names (ClassName.method). Eval now matches both.
  • Raise max_symbols from 3/5/5 to 5/8/10 across coverage tiers.
  • Fix self._bm25 AttributeError (FC-1 guard was silently broken).
  • Address Codex review: BM25 score gate (< 0.1) for no-fuzzy path + file enrichment after fuzzy direct.

Metrics (cumulative with prior fixes)

Metric Before After
MRR 0.215 0.605
Recall 4.3% 20.0%
Hit Rate 35.6% ~75%

Test plan

  • 49/49 unit + integration tests pass
  • Eval harness across 3 repos (medusa, saleor, vendure)
  • Codex review clean (2 P2s deferred — theoretical edge cases, no observed regression)

Branch silong/p0-pascal-ngram-recall (additive feature):

Title: feat: PascalCase ngram candidate generation for grounding

Body:

Summary

  • Adds _pascal_ngrams() to generate PascalCase bigrams/trigrams from adjacent description words (e.g., "payment provider" → "PaymentProvider")
  • Uses 3+ char tokens so short words like "job", "bus", "app" participate
  • 10 unit tests for the generator

Neutral on current eval set (most expected PascalCase symbols don't exist in test repos). Enables correct matching for repos that do have multi-word PascalCase class
names.

Test plan

  • 44/44 tests pass
  • Eval: MRR@5=0.605, Recall=20.0% (unchanged)
  • Codex review: 2 P2s (deferred, no regression observed)

Summary by CodeRabbit

Release Notes

  • Improvements
    • Enhanced code location retrieval accuracy and symbol matching capabilities.
    • Refined file and symbol selection strategy for more relevant results.
    • Updated evaluation metrics to better reflect system performance.

silongtan and others added 2 commits April 17, 2026 23:15
…ng, raised caps

Three bugs fixed:
1. Stage reorder: fuzzy-direct-lookup is now primary (Stage 1), file-level
   retrieval is fallback (Stage 2). Previously Stage 1 was dead code because
   RRF scores (~0.036) never passed score thresholds (0.5/0.3/0.1).
2. Eval bare-name matching: ground truth uses bare names but pipeline emits
   qualified names (ClassName.method). Eval now matches both forms.
3. Raised max_symbols from 3/5/5 to 5/8/10 across coverage tiers.

Bonus: fixed self._bm25 AttributeError (FC-1 guard was silently broken).

Eval results: Recall 13.9% → 20.0%, MRR 0.592 → 0.605, Saleor recall 5% → 31.8%.
48/48 tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…uzzy

Two fixes from Codex review feedback:

1. P1 #1: Filter weak BM25 scores (< 0.1) when no fuzzy matches exist.
   Prevents false groundings for under-specified queries that pass
   the FC-1 guard but have only noise-level BM25 hits.

2. P1 #2: Stage 2 now runs as enrichment after fuzzy direct lookup,
   not just as fallback. When fuzzy finds symbols from a subset of
   files, file-level retrieval fills remaining budget with additional
   files — preserving multi-file feature discovery.

49/49 tests pass. Eval metrics stable: MRR@5=0.605, Recall=20.0%.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Apr 18, 2026

Copy link
Copy Markdown
📝 Walkthrough

Walkthrough

The PR restructures code locator's grounding strategy by replacing BM25 score thresholds with a max_files-based file cap. Stage 1 now performs direct fuzzy symbol-to-region resolution when matches exist; Stage 2 executes file-level fused retrieval if symbol budget permits. Tests and evaluation logic are updated accordingly.

Changes

Cohort / File(s) Summary
Core Grounding Logic
adapters/code_locator.py
Replaced BM25 threshold tiers with max_files caps; restructured two-stage grounding where Stage 1 is direct fuzzy resolution (when matched IDs exist) and Stage 2 is rank-based file-level fused retrieval with remaining symbol budget allocation. Updated _ground_single signature from bm25_threshold: float to max_files: int and adjusted tokenization call from self._bm25 to self._search_tool.bm25.
Coverage Loop Tests
tests/test_coverage_loop.py
Updated tier progression values to track max_files [2, 4, 6] instead of BM25 thresholds [0.5, 0.3, 0.1]. All _ground_single calls now pass max_files parameter instead of bm25_threshold. Added count_corpus_tokens.return_value = 5 mock configuration and refactored test assertions to emphasize file-count limiting and fuzzy fallback gating.
Evaluation Metrics
tests/eval_code_locator.py
Extended symbol relevance matching to treat "bare" symbol suffixes (text after final .) as valid matches; broadened ground-truth symbol hits to include both full and bare symbol forms for recall calculation. Increased default CLI --top-k from 3 to 5.
Multi-Region Grounding Tests
tests/test_fc2_multi_region_grounding.py
Updated fuzzy validation stub to include match_score field. Replaced graph-channel test with fuzzy direct-lookup priority assertion. Modified test fixtures to reduce fuzzy seed IDs and re-order expected fused ranking. Updated all _ground_single invocations to use max_files=4 instead of bm25_threshold=0.3.

Sequence Diagram(s)

sequenceDiagram
    participant Caller as Caller
    participant Locator as Code Locator
    participant FuzzyMatcher as Fuzzy Matcher
    participant FusedRetrieval as Fused Retrieval
    participant FileRanker as File Ranker
    
    Caller->>Locator: ground_mappings(mappings)
    loop For each tier (max_files, fuzzy_thresh, max_sym)
        Locator->>Locator: Extract matched_ids from fuzzy match
        alt matched_ids present
            Note over Locator,FuzzyMatcher: Stage 1: Direct Fuzzy Resolution
            Locator->>FuzzyMatcher: Direct symbol-to-region lookup
            FuzzyMatcher-->>Locator: code_regions (fuzzy-matched)
        else No matched_ids
            Note over Locator,FusedRetrieval: Stage 1 Skipped (No fuzzy hits)
        end
        
        alt code_regions size < max_symbols
            Note over Locator,FileRanker: Stage 2: File-Level Fused Retrieval
            Locator->>FileRanker: Rank files (limited to max_files)
            FileRanker-->>Locator: Top ranked files
            Locator->>FusedRetrieval: Per-file symbol extraction (remaining budget)
            FusedRetrieval-->>Locator: Additional code_regions
        else Budget met
            Note over Locator: Stage 2 Skipped (sufficient symbols)
        end
        
        Locator->>Locator: Return final code_regions
    end
    Locator-->>Caller: Grounded mappings
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • jinhongkuan

Poem

🐰 A fuzzy hop, then files align,
Two stages dance in perfect time,
The max\_files cap, a budget fair,
Ground symbols now with greatest care! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 74.19% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main structural changes: stage reordering in the grounding pipeline, evaluation matching updates, and increased file/symbol caps.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch silong/p0-grounding-structural-fixes

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/test_fc2_multi_region_grounding.py (1)

213-230: ⚠️ Potential issue | 🟡 Minor

Make the FC-2 tests exercise Stage 2 instead of Stage 1 only.

These assertions can pass from fuzzy direct lookup alone: Line 217 seeds one fuzzy symbol per file, and Line 257 seeds enough IDs to hit max_symbols before file-level retrieval matters. Seed fewer fuzzy IDs and assert the seeded fused call happened so regressions in Stage 2 are caught.

🧪 Proposed test adjustment
-    adapter, _ = _adapter_with_fakes(
+    adapter, call_log = _adapter_with_fakes(
         bm25_hits=bm25_hits,
         fused_hits=fused_hits,
         symbols_by_file=symbols,
-        fuzzy_symbol_ids=[1, 3, 5],  # one seed from each file
+        fuzzy_symbol_ids=[1],  # Stage 1 covers one file; Stage 2 must add the others
     )
@@
     distinct_files = {r["file_path"] for r in regions}
+    assert [symbol_ids for _, symbol_ids in call_log if symbol_ids is not None] == [[1]]
     assert len(distinct_files) >= 2, (
         f"FC-2 regression: multi-file feature collapsed to {len(distinct_files)} "
         f"file(s). Expected ≥2. Regions: {regions}"
@@
     adapter, _ = _adapter_with_fakes(
         bm25_hits=bm25_hits,
         fused_hits=bm25_hits,
         symbols_by_file=symbols,
-        fuzzy_symbol_ids=list(range(10)),
+        fuzzy_symbol_ids=[1],
     )

Also applies to: 253-269

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_fc2_multi_region_grounding.py` around lines 213 - 230, The test
currently seeds too many fuzzy_symbol_ids (e.g., [1,3,5]) so
adapter._ground_single can pass via Stage 1 fuzzy lookups; reduce the seeded
fuzzy_symbol_ids (e.g., to a single ID) in the _adapter_with_fakes call so Stage
2 (file-level fused retrieval) must run, and make the parallel change in the
other block (lines ~253-269) as well; after calling adapter._ground_single, add
an assertion that fused retrieval was used (e.g., assert the provided fused_hits
fixture was consulted or a fused-call side-effect is present) so regressions in
Stage 2 are detected. Ensure you update references to fuzzy_symbol_ids,
fused_hits, adapter._ground_single, max_symbols and fuzzy_threshold accordingly.
🧹 Nitpick comments (1)
adapters/code_locator.py (1)

282-296: Use the full remaining symbol budget when splitting across files.

remaining_budget // len(qualifying_files) can leave capacity unused, e.g. 5 remaining slots across 2 files returns at most 4 regions even when both files have enough symbols. That weakens the newly raised caps.

♻️ Proposed budget allocation tweak
-                per_file_budget = max(1, remaining_budget // len(qualifying_files))
+                base_budget, extra_budget = divmod(remaining_budget, len(qualifying_files))
                 try:
-                    for fp in qualifying_files:
+                    for idx, fp in enumerate(qualifying_files):
                         file_symbols = db.lookup_by_file(fp)
                         if not file_symbols:
                             continue
@@
                         ranked = sorted(
                             file_symbols,
                             key=lambda r: (r["id"] not in relevant_ids, -matched_scores.get(r["id"], 0)),
                         )
-                        for row in ranked[:per_file_budget]:
+                        file_budget = max(1, base_budget + (1 if idx < extra_budget else 0))
+                        for row in ranked[:file_budget]:
                             code_regions.append({
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@adapters/code_locator.py` around lines 282 - 296, The per-file allocation
using per_file_budget = max(1, remaining_budget // len(qualifying_files)) can
leave unused slots when remaining_budget % len(qualifying_files) != 0; change
the allocation loop so you compute base, remainder = divmod(remaining_budget,
len(qualifying_files)) and give base + 1 to the first `remainder` files and base
to the rest (still min 1 as needed), then proceed to fetch `file_symbols` via
db.lookup_by_file(fp), compute `relevant_ids`, build `ranked`, and take up to
the per-file quota for each file (ensuring you never exceed `max_symbols` or
double-count `code_regions`).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@adapters/code_locator.py`:
- Around line 260-274: The fused search fallback leaves matched_ids truthy so
the low-score gate in the loop (if not matched_ids and h.get("score", 0) < 0.1)
is skipped; fix by clearing the matched_ids flag when you fall back to
BM25-only: inside the except block that sets fused = hits (around the
search_code call and fused assignment), also set matched_ids = [] (or None) so
the subsequent loop treating fused as BM25-only will run the low-score filter;
reference symbols: search_code, fused, hits, matched_ids, and the loop that
inspects h.get("score", 0).

In `@tests/eval_code_locator.py`:
- Line 280: The default value for the "--top-k" CLI argument was changed from 3
to 5, which will break regression comparisons and existing MRR/hit-rate gates;
revert the default in the parser.add_argument call for "--top-k" back to 3 so
existing tests and thresholds remain stable, ensuring the signature
parser.add_argument("--top-k", type=int, default=3, help="Top-K for
precision/MRR") is restored and run any CI tests that depend on default
MRR/hit-rate thresholds to confirm no regressions.

---

Outside diff comments:
In `@tests/test_fc2_multi_region_grounding.py`:
- Around line 213-230: The test currently seeds too many fuzzy_symbol_ids (e.g.,
[1,3,5]) so adapter._ground_single can pass via Stage 1 fuzzy lookups; reduce
the seeded fuzzy_symbol_ids (e.g., to a single ID) in the _adapter_with_fakes
call so Stage 2 (file-level fused retrieval) must run, and make the parallel
change in the other block (lines ~253-269) as well; after calling
adapter._ground_single, add an assertion that fused retrieval was used (e.g.,
assert the provided fused_hits fixture was consulted or a fused-call side-effect
is present) so regressions in Stage 2 are detected. Ensure you update references
to fuzzy_symbol_ids, fused_hits, adapter._ground_single, max_symbols and
fuzzy_threshold accordingly.

---

Nitpick comments:
In `@adapters/code_locator.py`:
- Around line 282-296: The per-file allocation using per_file_budget = max(1,
remaining_budget // len(qualifying_files)) can leave unused slots when
remaining_budget % len(qualifying_files) != 0; change the allocation loop so you
compute base, remainder = divmod(remaining_budget, len(qualifying_files)) and
give base + 1 to the first `remainder` files and base to the rest (still min 1
as needed), then proceed to fetch `file_symbols` via db.lookup_by_file(fp),
compute `relevant_ids`, build `ranked`, and take up to the per-file quota for
each file (ensuring you never exceed `max_symbols` or double-count
`code_regions`).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7fb7fbe4-115a-47bc-b3c3-1c840f740075

📥 Commits

Reviewing files that changed from the base of the PR and between f649280 and b7c86c6.

📒 Files selected for processing (4)
  • adapters/code_locator.py
  • tests/eval_code_locator.py
  • tests/test_coverage_loop.py
  • tests/test_fc2_multi_region_grounding.py

Comment thread adapters/code_locator.py
Comment on lines +260 to +274
try:
if matched_ids:
fused = self.search_code(description, symbol_ids=sorted(matched_ids))
else:
fused = hits
except Exception as exc:
logger.debug("[ground] fused search failed for '%s': %s — falling back to BM25-only", description[:60], exc)
fused = hits

covered_files = {r["file_path"] for r in code_regions}
qualifying_files: list[str] = []
seen_files: set[str] = set(covered_files)
for h in fused:
if not matched_ids and h.get("score", 0) < 0.1:
continue

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Apply the low-score gate when fused search falls back to BM25 hits.

If matched_ids is truthy but seeded search_code(..., symbol_ids=...) fails, Line 267 falls back to BM25-only hits; Line 273 then skips the < 0.1 gate because matched_ids is still truthy. That can reintroduce the low-score noise this PR is trying to suppress.

🐛 Proposed fix
-            try:
+            bm25_only_fallback = False
+            try:
                 if matched_ids:
                     fused = self.search_code(description, symbol_ids=sorted(matched_ids))
                 else:
                     fused = hits
+                    bm25_only_fallback = True
             except Exception as exc:
                 logger.debug("[ground] fused search failed for '%s': %s — falling back to BM25-only", description[:60], exc)
                 fused = hits
+                bm25_only_fallback = True
 
             covered_files = {r["file_path"] for r in code_regions}
             qualifying_files: list[str] = []
             seen_files: set[str] = set(covered_files)
             for h in fused:
-                if not matched_ids and h.get("score", 0) < 0.1:
+                if bm25_only_fallback and h.get("score", 0) < 0.1:
                     continue
🧰 Tools
🪛 Ruff (0.15.10)

[warning] 265-265: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@adapters/code_locator.py` around lines 260 - 274, The fused search fallback
leaves matched_ids truthy so the low-score gate in the loop (if not matched_ids
and h.get("score", 0) < 0.1) is skipped; fix by clearing the matched_ids flag
when you fall back to BM25-only: inside the except block that sets fused = hits
(around the search_code call and fused assignment), also set matched_ids = []
(or None) so the subsequent loop treating fused as BM25-only will run the
low-score filter; reference symbols: search_code, fused, hits, matched_ids, and
the loop that inspects h.get("score", 0).

parser.add_argument("--multi-repo", type=str, default=None,
help='JSON map of repo_name→path, e.g. \'{"medusa":"/path/to/medusa"}\'')
parser.add_argument("--top-k", type=int, default=3, help="Top-K for precision/MRR")
parser.add_argument("--top-k", type=int, default=5, help="Top-K for precision/MRR")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Keep the default Top-K stable for regression comparisons.

Changing the default from 3 to 5 makes default MRR/hit-rate runs incomparable with prior baselines and can loosen existing --min-mrr gates unless every caller updates thresholds intentionally.

🧪 Proposed fix
-    parser.add_argument("--top-k", type=int, default=5, help="Top-K for precision/MRR")
+    parser.add_argument("--top-k", type=int, default=3, help="Top-K for precision/MRR")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
parser.add_argument("--top-k", type=int, default=5, help="Top-K for precision/MRR")
parser.add_argument("--top-k", type=int, default=3, help="Top-K for precision/MRR")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/eval_code_locator.py` at line 280, The default value for the "--top-k"
CLI argument was changed from 3 to 5, which will break regression comparisons
and existing MRR/hit-rate gates; revert the default in the parser.add_argument
call for "--top-k" back to 3 so existing tests and thresholds remain stable,
ensuring the signature parser.add_argument("--top-k", type=int, default=3,
help="Top-K for precision/MRR") is restored and run any CI tests that depend on
default MRR/hit-rate thresholds to confirm no regressions.

@silongtan silongtan merged commit f06b5df into main Apr 18, 2026
1 check passed
Knapp-Kevin added a commit that referenced this pull request May 3, 2026
QorLogic SDLC governance trail for the Priority C v0 implementation
that landed in commits 1-4 of this PR.

Includes:

- docs/research-brief-priority-c-selective-ingest-2026-05-02.md (v3) —
  research substrate. v1 was rejected for INVARIANT_FROM_IMPLEMENTATION
  (treating v0 agent-fetches-only code state as product principle); v2
  added playbook substrate; v3 narrowed to Slack-first + team-server
  + CocoIndex-conditional after operator dialogue clarified "no managed
  backend" = "no human-ops-tax architecture," not "no backend."
- plan-priority-c-team-server-slack-v0.md (437 LOC) — the L3 plan
  with five phases (Phase 5 deferred per "if we can manage it"
  feasibility caveat).
- docs/SHADOW_GENOME.md Failure Entry #6 + addendum — captures the
  framing-error pattern AND the "anti-goals must be parsed by their
  load-bearing keyword" lesson; symmetric to v0-code-as-principle.
- docs/META_LEDGER.md Entries #27 (IMPLEMENT) + #28 (SEAL).
  Predecessor: efd0304b (#135-triage seal on dev).
  Implement chain: 211ffb9e.
  Substantiation seal: 6f4f8f8f1d63ad82b952a3c6aff270d30584e08b0572077ff685e84ce453f6c2
- docs/SYSTEM_STATE.md — Priority C v0 section appended; documents
  schema additions, architectural properties achieved, audit advisory
  disposition, Phase 5 deferred state, and the qor-logic-internal
  steps skipped for downstream-project rationale.
- .agent/staging/AUDIT_REPORT.md — PASS verdict, three non-blocking
  advisories all addressed at implement-time.

Verdict: REALITY = PROMISE for Phases 1-4. Phase 5 (CocoIndex #136)
explicitly deferred per plan slip-independence design.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant