grounding accuracy MRR@3 0.59 → 0.79, recall 14% → 30% #28
Conversation
Two-track token extraction prevents NL-word pollution in graph seeds (e.g. "void" → Void, "return" → Return) while preserving domain words (e.g. "checkout" → Checkout). Case-form bigrams bridge vocabulary gap between NL descriptions and code identifiers. Pipeline changes: - Three-track token extraction: identifiers, domain words, case-form bigrams - Keyword blocklist filters programming words from fuzzy matching - Coverage tiers widened: (2,80,5)→(3,75,5) at Tier 0 - RRF k lowered 60→40 for sharper ranking - Symbol type priority as Stage 2 tiebreaker Ground truth corrections: - Remove 5 phantom Medusa symbols (CartCompletionStrategy, etc.) - Fix PluginManager→PluginsManager for Saleor - Fix on_commit→fulfillment_created (Django builtin → actual symbol) - Fix overly-specific Vendure file patterns Results: recall 13.9%→25.5%, variance 0.360→0.156, 49/49 tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Eval improvements (25.5% → 29.9% recall): - Case-insensitive symbol matching in recall computation - All-component extraction from qualified names (EventBusService.emit now adds both EventBusService and emit to found_symbols) - Case-insensitive _is_relevant() for MRR consistency Pipeline preparation: - Configurable fuzzy scorer via config (WRatio/token_set_ratio/partial_ratio) - BM25 k1/b params wired through config → bm25s constructor - Enables future grid search tuning without code changes 49/49 tests pass. MRR@3 stable at 0.786. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughConfiguration and algorithm tuning across code locator's grounding and retrieval systems: adjusted coverage-tier limits (max_files increased 2→3, 4→5, 6→7) and fuzzy thresholds, added symbol-type-priority ranking, expanded tokenization with regex-based identifier detection, exposed BM25 hyperparameters (k1, b) and RRF parameter (rrf_k) as configurable options, added pluggable fuzzy scorer selection, and made test symbol matching case-insensitive with corresponding fixture and assertion updates. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~35 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (6)
tests/eval_code_locator.py (1)
129-133: Minor consistency nit: denominator could use lowercased set for symmetry.If
expected_symbolsever contains case-variant duplicates (e.g.,{"Foo", "foo"}) — unlikely with the current fixtures — the numerator collapses after.lower()while the denominator uses the raw set cardinality, slightly underreporting recall. Usinglen(expected_lower)would make the metric consistent with the case-insensitive intent. Not a bug with the current fixture contents; flagging only for robustness.♻️ Optional tweak
- recall = matched_count / len(expected_symbols) if expected_symbols else 0 + recall = matched_count / len(expected_lower) if expected_lower else 0🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/eval_code_locator.py` around lines 129 - 133, The recall calculation uses len(expected_symbols) as the denominator while the numerator is derived from expected_lower (case-folded), which miscounts when expected_symbols has case-variant duplicates; update the recall computation to use len(expected_lower) (i.e., recall = matched_count / len(expected_lower) if expected_lower else 0) so the denominator matches the case-insensitive sets used to compute matched_count (references: expected_lower, found_lower, matched_count, recall).adapters/code_locator.py (3)
152-155: Annotate_SYMBOL_TYPE_PRIORITYasClassVar(RUF012).Ruff flags this as a mutable class attribute. The mapping is effectively constant; annotating with
ClassVardocuments that intent and silences the lint without behavior change.♻️ Proposed annotation
+from typing import ClassVar @@ - _SYMBOL_TYPE_PRIORITY = { + _SYMBOL_TYPE_PRIORITY: ClassVar[dict[str, int]] = { "class": 0, "interface": 1, "type_alias": 2, "function": 3, "method": 4, "variable": 5, }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@adapters/code_locator.py` around lines 152 - 155, Annotate the class-level mapping _SYMBOL_TYPE_PRIORITY with ClassVar to mark it as an immutable class attribute: add "from typing import ClassVar" to imports (or include ClassVar in the existing typing import) and change the declaration to use ClassVar[dict[str, int]] (or ClassVar[Mapping[str, int]] if Mapping is preferred) for the variable type; keep the existing mapping values and name _SYMBOL_TYPE_PRIORITY unchanged so the lint (RUF012) is satisfied without behavioral change.
208-210: Dead helper —_type_priorityis defined but never called.The sort lambda at lines 351-355 inlines
self._SYMBOL_TYPE_PRIORITY.get(r["type"], 3)directly (which is correct, since the row is already in hand — calling this helper would add an unnecessarydb.lookup_by_idper symbol). Either remove the helper or route the sort key through it; leaving it as-is invites future callers to pick the DB-fetching version by mistake.♻️ Suggested removal
- def _type_priority(self, sid: int, db) -> int: - row = db.lookup_by_id(sid) - return self._SYMBOL_TYPE_PRIORITY.get(row["type"], 3) if row else 3 -🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@adapters/code_locator.py` around lines 208 - 210, _remove the dead helper _type_priority which performs an unnecessary db lookup; delete the method definition and any unused imports so callers use the inline expression self._SYMBOL_TYPE_PRIORITY.get(r["type"], 3) (or, if you intended a reusable helper, replace _type_priority(sid, db) with a non-DB version like _type_priority_from_row(row) that reads row["type"]). Ensure no remaining references to _type_priority exist and adjust tests/usages accordingly.
234-267: Tokenization looks correct — one small heads-up on acronym handling.Three-track extraction with de-dup is clean. One heuristic wrinkle:
str.capitalize()lowercases the tail, so acronyms from the description —"JWT permissions"— produce"JwtPermissions"/"jwt_permissions"rather than"JWTPermissions". That's still a plausible symbol candidate but won't match true all-caps acronym-prefixed class names likeJWTHandler. If grounding acronyms becomes a recall gap in follow-up evals, consider preserving the original casing of already-upper tokens (e.g.,w if w.isupper() else w.capitalize()). Not blocking.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@adapters/code_locator.py` around lines 234 - 267, The pascal-case construction in the case_forms loop lowercases acronym tails via str.capitalize(), causing all-caps tokens like "JWT" to become "JwtPermissions" and miss true symbols like "JWTPermissions"; update the loop in adapters/code_locator.py (the nl_words → case_forms section) to preserve already-uppercase words when building pascal (e.g., use the original token if token.isupper() else token.capitalize()) and similarly ensure snake uses lowercased tokens for separation; keep the other heuristics and the length check intact.code_locator/retrieval/bm25s_client.py (1)
68-68: Protocol and concrete-class signatures now diverge — consider keeping them in sync.
BM25Search.index(self, repo_path, output_dir)(percode_locator/retrieval/bm25_protocol.py) no longer matches this concrete implementation, which now also acceptssymbol_db,k1, andb. Because all extras have defaults, callers using the protocol type still work, so this is cosmetic — but documenting the tuning knobs at the protocol level would help future BM25 backends (e.g. the mentioned "zoekt") honor them.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@code_locator/retrieval/bm25s_client.py` at line 68, The BM25 protocol signature and docs need to match the concrete implementation: update BM25Search.index in bm25_protocol.py to accept the same optional parameters (symbol_db=None, k1: float = 1.5, b: float = 0.75) and add a short docstring describing each tuning knob so other backends (e.g., zoekt) can implement them consistently; ensure the parameter names and default values exactly match the concrete method index in bm25s_client.py.code_locator/tools/validate_symbols.py (1)
80-85: Hoist_SCORERSto module or class scope.The dict is rebuilt on every call to
_fuzzy_match. Lifting it to a module-level constant (or class attribute) keeps the scorer resolution centralized and avoids the per-call dict allocation. Functionally fine as-is.♻️ Proposed refactor
from rapidfuzz import fuzz +_SCORERS = { + "WRatio": fuzz.WRatio, + "token_set_ratio": fuzz.token_set_ratio, + "partial_ratio": fuzz.partial_ratio, +} + # JSON Schema for tool parameter validation @@ - _SCORERS = { - "WRatio": fuzz.WRatio, - "token_set_ratio": fuzz.token_set_ratio, - "partial_ratio": fuzz.partial_ratio, - } - scorer = _SCORERS.get(self.config.fuzzy_scorer, fuzz.WRatio) + scorer = _SCORERS.get(self.config.fuzzy_scorer, fuzz.WRatio)One silent-fallback concern worth considering: an unknown/typo'd
fuzzy_scorervalue in config currently silently reverts toWRatio. Consider logging a warning so a misconfigured env var (CODE_LOCATOR_FUZZY_SCORER=token_sort_ratio) doesn't quietly run with the wrong scorer.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@code_locator/tools/validate_symbols.py` around lines 80 - 85, Hoist the _SCORERS dict out of _fuzzy_match into module scope (e.g. TOP_LEVEL _SCORERS constant) or a class attribute so it isn’t recreated per call, then have _fuzzy_match look up scorer = _SCORERS.get(self.config.fuzzy_scorer, fuzz.WRatio); additionally, detect when the get() falls back (i.e. self.config.fuzzy_scorer not a key) and emit a warning via the existing logger (or logging.getLogger(__name__)) indicating the unknown fuzzy_scorer and that WRatio is being used as a fallback; keep symbol names _SCORERS, _fuzzy_match, and self.config.fuzzy_scorer exactly as in the diff so the changes are easy to locate.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@adapters/code_locator.py`:
- Around line 152-155: Annotate the class-level mapping _SYMBOL_TYPE_PRIORITY
with ClassVar to mark it as an immutable class attribute: add "from typing
import ClassVar" to imports (or include ClassVar in the existing typing import)
and change the declaration to use ClassVar[dict[str, int]] (or
ClassVar[Mapping[str, int]] if Mapping is preferred) for the variable type; keep
the existing mapping values and name _SYMBOL_TYPE_PRIORITY unchanged so the lint
(RUF012) is satisfied without behavioral change.
- Around line 208-210: _remove the dead helper _type_priority which performs an
unnecessary db lookup; delete the method definition and any unused imports so
callers use the inline expression self._SYMBOL_TYPE_PRIORITY.get(r["type"], 3)
(or, if you intended a reusable helper, replace _type_priority(sid, db) with a
non-DB version like _type_priority_from_row(row) that reads row["type"]). Ensure
no remaining references to _type_priority exist and adjust tests/usages
accordingly.
- Around line 234-267: The pascal-case construction in the case_forms loop
lowercases acronym tails via str.capitalize(), causing all-caps tokens like
"JWT" to become "JwtPermissions" and miss true symbols like "JWTPermissions";
update the loop in adapters/code_locator.py (the nl_words → case_forms section)
to preserve already-uppercase words when building pascal (e.g., use the original
token if token.isupper() else token.capitalize()) and similarly ensure snake
uses lowercased tokens for separation; keep the other heuristics and the length
check intact.
In `@code_locator/retrieval/bm25s_client.py`:
- Line 68: The BM25 protocol signature and docs need to match the concrete
implementation: update BM25Search.index in bm25_protocol.py to accept the same
optional parameters (symbol_db=None, k1: float = 1.5, b: float = 0.75) and add a
short docstring describing each tuning knob so other backends (e.g., zoekt) can
implement them consistently; ensure the parameter names and default values
exactly match the concrete method index in bm25s_client.py.
In `@code_locator/tools/validate_symbols.py`:
- Around line 80-85: Hoist the _SCORERS dict out of _fuzzy_match into module
scope (e.g. TOP_LEVEL _SCORERS constant) or a class attribute so it isn’t
recreated per call, then have _fuzzy_match look up scorer =
_SCORERS.get(self.config.fuzzy_scorer, fuzz.WRatio); additionally, detect when
the get() falls back (i.e. self.config.fuzzy_scorer not a key) and emit a
warning via the existing logger (or logging.getLogger(__name__)) indicating the
unknown fuzzy_scorer and that WRatio is being used as a fallback; keep symbol
names _SCORERS, _fuzzy_match, and self.config.fuzzy_scorer exactly as in the
diff so the changes are easy to locate.
In `@tests/eval_code_locator.py`:
- Around line 129-133: The recall calculation uses len(expected_symbols) as the
denominator while the numerator is derived from expected_lower (case-folded),
which miscounts when expected_symbols has case-variant duplicates; update the
recall computation to use len(expected_lower) (i.e., recall = matched_count /
len(expected_lower) if expected_lower else 0) so the denominator matches the
case-insensitive sets used to compute matched_count (references: expected_lower,
found_lower, matched_count, recall).
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 6bb0a65c-4ad5-4cb6-a0b7-7154001e85d6
📒 Files selected for processing (8)
adapters/code_locator.pycode_locator/config.pycode_locator/retrieval/bm25s_client.pycode_locator/tools/validate_symbols.pycode_locator_runtime.pytests/eval_code_locator.pytests/fixtures/expected/decisions.pytests/test_coverage_loop.py
Three-round audit cycle (VETO -> VETO -> PASS) for Notion ingest + cache contract migration. Plan ships across five phases: - Phase 0 — cache contract migration (schema v1->v2, schema_version table, callable migration dispatch, upsert_canonical_extraction) - Phase 0.5 — worker-task lifecycle pattern + Slack reference wiring (closes the v0 dormant-Slack-worker gap) - Phase 1 — Notion API client + property serializer (internal- integration auth, no OAuth router) - Phase 2 — Notion ingest worker (per-database watermark, peer- authored team_event) - Phase 3 — Notion task registration on lifespan META_LEDGER entries #29-#33 capture: round-1 VETO (4 missing/ undeclared symbols), round-2 VETO (1 wrong-call-shape for decrypt_token), round-3 PASS, IMPLEMENT, and SUBSTANTIATION. SHADOW_GENOME #7 addendum extends the PARALLEL_STRUCTURE_ASSUMED detection heuristic with three new in-sketch checks: signature, type-boundary, helper-symmetry. The two VETOs in this session are the empirical justification. SYSTEM_STATE.md adds the Priority C v1 section: schema state (v2), architectural properties achieved, audit cycle outcomes, implementation deviations from plan. Merkle seal: SHA256(content_hash + previous_hash) = dcb619104e6d88b97a04689093b80b9f03825f9a24bac3c3b9ab3d0107ff24d7 (content_hash 9f003c40..., previous_hash 6f4f8f8f... = Priority C v0 SEAL at Entry #28). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
QorLogic SDLC governance trail for the Priority C v0 implementation that landed in commits 1-4 of this PR. Includes: - docs/research-brief-priority-c-selective-ingest-2026-05-02.md (v3) — research substrate. v1 was rejected for INVARIANT_FROM_IMPLEMENTATION (treating v0 agent-fetches-only code state as product principle); v2 added playbook substrate; v3 narrowed to Slack-first + team-server + CocoIndex-conditional after operator dialogue clarified "no managed backend" = "no human-ops-tax architecture," not "no backend." - plan-priority-c-team-server-slack-v0.md (437 LOC) — the L3 plan with five phases (Phase 5 deferred per "if we can manage it" feasibility caveat). - docs/SHADOW_GENOME.md Failure Entry #6 + addendum — captures the framing-error pattern AND the "anti-goals must be parsed by their load-bearing keyword" lesson; symmetric to v0-code-as-principle. - docs/META_LEDGER.md Entries #27 (IMPLEMENT) + #28 (SEAL). Predecessor: efd0304b (#135-triage seal on dev). Implement chain: 211ffb9e. Substantiation seal: 6f4f8f8f1d63ad82b952a3c6aff270d30584e08b0572077ff685e84ce453f6c2 - docs/SYSTEM_STATE.md — Priority C v0 section appended; documents schema additions, architectural properties achieved, audit advisory disposition, Phase 5 deferred state, and the qor-logic-internal steps skipped for downstream-project rationale. - .agent/staging/AUDIT_REPORT.md — PASS verdict, three non-blocking advisories all addressed at implement-time. Verdict: REALITY = PROMISE for Phases 1-4. Phase 5 (CocoIndex #136) explicitly deferred per plan slip-independence design. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three-round audit cycle (VETO -> VETO -> PASS) for Notion ingest + cache contract migration. Plan ships across five phases: - Phase 0 — cache contract migration (schema v1->v2, schema_version table, callable migration dispatch, upsert_canonical_extraction) - Phase 0.5 — worker-task lifecycle pattern + Slack reference wiring (closes the v0 dormant-Slack-worker gap) - Phase 1 — Notion API client + property serializer (internal- integration auth, no OAuth router) - Phase 2 — Notion ingest worker (per-database watermark, peer- authored team_event) - Phase 3 — Notion task registration on lifespan META_LEDGER entries #29-#33 capture: round-1 VETO (4 missing/ undeclared symbols), round-2 VETO (1 wrong-call-shape for decrypt_token), round-3 PASS, IMPLEMENT, and SUBSTANTIATION. SHADOW_GENOME #7 addendum extends the PARALLEL_STRUCTURE_ASSUMED detection heuristic with three new in-sketch checks: signature, type-boundary, helper-symmetry. The two VETOs in this session are the empirical justification. SYSTEM_STATE.md adds the Priority C v1 section: schema state (v2), architectural properties achieved, audit cycle outcomes, implementation deviations from plan. Merkle seal: SHA256(content_hash + previous_hash) = dcb619104e6d88b97a04689093b80b9f03825f9a24bac3c3b9ab3d0107ff24d7 (content_hash 9f003c40..., previous_hash 6f4f8f8f... = Priority C v0 SEAL at Entry #28). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Changes
Pipeline improvements (
adapters/code_locator.py)"state"→State graph seed pollution while preserving "checkout"→Checkout
Eval metric improvements (
tests/eval_code_locator.py)Ground truth corrections (
tests/fixtures/expected/decisions.py)Config
Test plan
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes