Enhanced the hybrid search strategy with tsvector keyword matching#539
Enhanced the hybrid search strategy with tsvector keyword matching#539
Conversation
WalkthroughExpands database schema and search capabilities with hybrid ts_vector + embedding functions, indexes, and policies. Updates server search to call new Postgres hybrid search RPCs, removes Python-side keyword merge, and adjusts reranking to fetch larger candidate pools with top_k limiting. Resets now drops new hybrid functions. Tests updated accordingly. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Client
participant Server as Server (HybridSearchStrategy)
participant DB as Postgres
Client->>Server: search_documents_hybrid(query, match_count, filter_metadata)
Server->>DB: RPC hybrid_search_archon_crawled_pages(query_text, match_count, filter, source_filter)
DB-->>Server: rows[id, url, chunk_number, content, metadata, source_id, similarity, match_type]
Server->>Server: Normalize results, log counts and match_type distribution
Server-->>Client: List[dict]
note over DB,Server: Keyword and vector merging now occurs in SQL via full outer join
sequenceDiagram
autonumber
participant Client
participant RAG as RagService
participant Search as HybridSearchStrategy
participant Reranker as RerankingStrategy
Client->>RAG: query(match_count, rerank=True)
RAG->>RAG: search_match_count = match_count * 5
RAG->>Search: search_documents_hybrid(query, search_match_count, filters)
Search-->>RAG: candidates (N ~= 5x)
RAG->>Reranker: rerank_results(query, candidates, top_k=match_count)
alt Rerank success
Reranker-->>RAG: top_k results
else Rerank failure
RAG->>RAG: Fallback trim to match_count
end
RAG-->>Client: final results (<= match_count)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested reviewers
Poem
✨ Finishing Touches
🧪 Generate unit tests
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 9
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
python/src/server/services/search/rag_service.py (2)
143-147: Preserve stack traces for document search failures.- except Exception as e: - logger.error(f"Document search failed: {e}") + except Exception as e: + logger.error(f"Document search failed: {e}", exc_info=True)
399-402: Preserve stack traces for code example pipeline failures.- except Exception as e: - logger.error(f"Code example search failed: {e}") + except Exception as e: + logger.error(f"Code example search failed: {e}", exc_info=True)
🧹 Nitpick comments (8)
migration/add_hybrid_search_tsvector.sql (4)
29-33: Trigram indexes are heavy; consider need and creation strategy.You’re creating GIN trigram indexes on large text columns, but the new hybrid functions don’t use trigram operators (% or ILIKE). If they’re only “nice to have,” consider deferring creation or adding them in a separate opt-in migration. For large tables, prefer CONCURRENTLY (outside transaction) to avoid long locks.
93-99: Prefer websearch_to_tsquery and guard empty queries.plainto_tsquery is strict and may underperform for natural queries. Also, when query_text is blank/whitespace, the text path should short-circuit to avoid scanning.
Suggested minimal change:
- ts_rank_cd(cp.content_search_vector, plainto_tsquery('english', query_text)) AS text_sim + ts_rank(cp.content_search_vector, websearch_to_tsquery('english', query_text)) AS text_sim FROM archon_crawled_pages cp WHERE cp.metadata @> filter AND (source_filter IS NULL OR cp.source_id = source_filter) - AND cp.content_search_vector @@ plainto_tsquery('english', query_text) + AND btrim(query_text) <> '' + AND cp.content_search_vector @@ websearch_to_tsquery('english', query_text)
111-117: Similarity mixing is on different scales; expose both scores or normalize.COALESCE(v.vector_sim, t.text_sim) mixes (1 − cosine distance) with ts_rank values—these are not comparable. This will bias ordering toward whichever scale is larger (likely vector).
Options (keep API stable):
- Add columns vector_similarity and text_rank while retaining similarity as-is for backward compatibility.
- Or normalize text_rank (e.g., ts_rank with normalization or percentile within the CTE) and compute a weighted final score.
If you want, I can propose a backward-compatible patch that adds extra columns and preserves similarity.
62-65: Overfetch split is fixed 50/50; consider tunable distribution.Fetching match_count from both vector and text paths may double work and then discard half. Consider parameters (e.g., vector_ratio, text_ratio) or dynamic allocation (more vector for longer queries, more text for short/keyword-y queries).
migration/complete_setup.sql (4)
221-223: Indexes for content_search_vector and trigram: consider usage.GIN on content_search_vector is necessary; trigram index is optional unless you plan fuzzy/ILIKE queries. Consider deferring trigram to reduce disk/maintenance costs.
234-251: tsvector on code content+summary is solid; indexes align.Looks good. One micro-nit: ensure summary is NOT NULL in schema or keep COALESCE as you did.
330-515: Hybrid functions duplicate those in add_hybrid_search_tsvector.sql; watch score mixing and query ergonomics.
- Same comments as the dedicated migration: prefer websearch_to_tsquery, guard empty query_text, and avoid mixing score scales without normalization or exposing both scores.
- Duplication is fine for “complete setup,” but keep them in sync with migrations to avoid drift.
If you want, I can send a single patch that applies the websearch_to_tsquery and empty-query guard to both functions here.
516-519: Comment says “configurable weighting” but no weights exist.Either add vector_weight/text_weight params or adjust the comment to avoid implying tunable weights.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (7)
migration/RESET_DB.sql(1 hunks)migration/add_hybrid_search_tsvector.sql(1 hunks)migration/complete_setup.sql(6 hunks)python/src/server/services/search/hybrid_search_strategy.py(6 hunks)python/src/server/services/search/rag_service.py(4 hunks)python/tests/test_rag_simple.py(0 hunks)python/tests/test_rag_strategies.py(0 hunks)
💤 Files with no reviewable changes (2)
- python/tests/test_rag_strategies.py
- python/tests/test_rag_simple.py
🧰 Additional context used
📓 Path-based instructions (6)
python/src/{server,mcp,agents}/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/src/{server,mcp,agents}/**/*.py: Fail fast on service startup failures, missing configuration, database connection issues, auth failures, critical dependency outages, and invalid data that would corrupt state
External API calls should use retry with exponential backoff and ultimately fail with a clear, contextual error message
Error messages must include context (operation being attempted) and relevant IDs/URLs/data for debugging
Preserve full stack traces in logs (e.g., Python logging with exc_info=True)
Use specific exception types; avoid catching broad Exception unless re-raising with context
Never signal failure by returning None/null; raise a descriptive exception instead
Files:
python/src/server/services/search/rag_service.pypython/src/server/services/search/hybrid_search_strategy.py
python/src/{server/services,agents}/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Never accept or store corrupted data (e.g., zero embeddings, null foreign keys, malformed JSON); skip failed items entirely instead of persisting bad data
Files:
python/src/server/services/search/rag_service.pypython/src/server/services/search/hybrid_search_strategy.py
python/src/server/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/src/server/**/*.py: For batch processing and background tasks, continue processing but log detailed per-item failures and return both successes and failures
Do not crash the server on a single WebSocket event failure; log the error and continue serving other clients
Files:
python/src/server/services/search/rag_service.pypython/src/server/services/search/hybrid_search_strategy.py
python/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/**/*.py: Target Python 3.12 with a 120-character line length
Use Ruff for linting and Mypy for type checking before commit
Files:
python/src/server/services/search/rag_service.pypython/src/server/services/search/hybrid_search_strategy.py
{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}
📄 CodeRabbit inference engine (CLAUDE.md)
{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}: Remove dead code immediately; do not keep legacy/unused functions
Avoid comments that reference change history (e.g., LEGACY, CHANGED, REMOVED); keep comments focused on current functionality
Files:
python/src/server/services/search/rag_service.pypython/src/server/services/search/hybrid_search_strategy.py
python/src/server/**
📄 CodeRabbit inference engine (CLAUDE.md)
Keep FastAPI application code under python/src/server/ (routes in api_routes/, services in services/, main in main.py)
Files:
python/src/server/services/search/rag_service.pypython/src/server/services/search/hybrid_search_strategy.py
🧠 Learnings (2)
📚 Learning: 2025-08-20T19:38:04.097Z
Learnt from: Chillbruhhh
PR: coleam00/Archon#378
File: python/src/server/services/storage/document_storage_service.py:304-306
Timestamp: 2025-08-20T19:38:04.097Z
Learning: The archon_crawled_pages table in the Archon project has a table-level unique constraint on (url, chunk_number) defined inline in the CREATE TABLE statement in migration/complete_setup.sql at line 202, which allows upsert operations with on_conflict="url,chunk_number" to work properly without requiring additional migrations.
Applied to files:
migration/complete_setup.sql
📚 Learning: 2025-08-20T19:38:04.097Z
Learnt from: Chillbruhhh
PR: coleam00/Archon#378
File: python/src/server/services/storage/document_storage_service.py:304-306
Timestamp: 2025-08-20T19:38:04.097Z
Learning: The archon_crawled_pages table in the Archon project has a table-level unique constraint on (url, chunk_number) defined inline in the CREATE TABLE statement in migration/complete_setup.sql at line 202, which allows upsert operations with on_conflict="url,chunk_number" to work properly.
Applied to files:
migration/complete_setup.sql
🧬 Code graph analysis (2)
python/src/server/services/search/rag_service.py (2)
python/src/server/services/search/reranking_strategy.py (1)
rerank_results(139-194)python/src/server/services/search/hybrid_search_strategy.py (1)
search_code_examples_hybrid(108-194)
python/src/server/services/search/hybrid_search_strategy.py (2)
python/src/server/config/logfire_config.py (2)
safe_span(150-171)set_attribute(177-179)python/src/server/services/embeddings/embedding_service.py (1)
create_embedding(71-128)
🔇 Additional comments (4)
migration/add_hybrid_search_tsvector.sql (1)
9-11: Verify vector extension availability in environments running partial migrations.This migration references vector(1536) in function signatures but doesn’t CREATE EXTENSION vector. If add_hybrid_search_tsvector.sql can run independently of complete setup, ensure vector is installed earlier in your chain, or add a defensive check here.
migration/RESET_DB.sql (1)
136-139: Drops match the new function signatures. LGTM.The DROP FUNCTION signatures (vector, text, int, jsonb, text) correctly match the new hybrid_search_* functions (typmod on vector is ignored here). Safe with CASCADE.
migration/complete_setup.sql (2)
18-18: Extension addition is correct. LGTM.
207-207: Generated tsvector column on content: good choice.This keeps vectors in-sync and avoids trigger maintenance.
| filter_json = filter_metadata or {} | ||
| source_filter = filter_json.pop("source", None) if "source" in filter_json else None | ||
|
|
There was a problem hiding this comment.
Avoid mutating caller’s filter_metadata (in-place pop).
filter_json = filter_metadata or {} aliases the original dict; pop mutates caller state. Copy before mutation.
- filter_json = filter_metadata or {}
- source_filter = filter_json.pop("source", None) if "source" in filter_json else None
+ filter_json = dict(filter_metadata) if filter_metadata else {}
+ source_filter = filter_json.pop("source", None)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| filter_json = filter_metadata or {} | |
| source_filter = filter_json.pop("source", None) if "source" in filter_json else None | |
| filter_json = dict(filter_metadata) if filter_metadata else {} | |
| source_filter = filter_json.pop("source", None) |
🤖 Prompt for AI Agents
In python/src/server/services/search/hybrid_search_strategy.py around lines 54
to 56, the code aliases filter_metadata with filter_json = filter_metadata or {}
and then calls pop, which mutates the caller's dict; instead, create a shallow
copy before any mutation (e.g., filter_json = dict(filter_metadata) if
filter_metadata else {}) and then safely pop "source" from that copy so the
original filter_metadata remains unmodified.
| if not response.data: | ||
| logger.debug("No results from hybrid search") | ||
| return [] |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Log with stack trace for easier debugging.
- logger.debug("No results from hybrid search")
+ logger.debug("No results from hybrid search")And on the exception path below:
- except Exception as e:
- logger.error(f"Hybrid document search failed: {e}")
+ except Exception as e:
+ logger.error(f"Hybrid document search failed: {e}", exc_info=True)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if not response.data: | |
| logger.debug("No results from hybrid search") | |
| return [] | |
| if not response.data: | |
| logger.debug("No results from hybrid search") | |
| return [] | |
| except Exception as e: | |
| logger.error(f"Hybrid document search failed: {e}", exc_info=True) |
🤖 Prompt for AI Agents
python/src/server/services/search/hybrid_search_strategy.py lines 69-71: the
current debug log for "No results from hybrid search" should include additional
context (e.g., query/params) to aid investigation, and the exception path below
must log the full stack trace; update the no-results log to include relevant
request details and change the exception handling to use logger.exception(...)
or logger.error(..., exc_info=True) so the stack trace is captured in logs.
| filter_json = filter_metadata or {} | ||
| # Use source_id parameter if provided, otherwise check filter_metadata | ||
| final_source_filter = source_id | ||
| if not final_source_filter and "source" in filter_json: | ||
| final_source_filter = filter_json.pop("source") | ||
|
|
There was a problem hiding this comment.
Same in-place mutation bug for code path.
Copy the dict before popping “source”.
- filter_json = filter_metadata or {}
+ filter_json = dict(filter_metadata) if filter_metadata else {}
# Use source_id parameter if provided, otherwise check filter_metadata
final_source_filter = source_id
- if not final_source_filter and "source" in filter_json:
- final_source_filter = filter_json.pop("source")
+ if not final_source_filter and "source" in filter_json:
+ final_source_filter = filter_json.pop("source")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| filter_json = filter_metadata or {} | |
| # Use source_id parameter if provided, otherwise check filter_metadata | |
| final_source_filter = source_id | |
| if not final_source_filter and "source" in filter_json: | |
| final_source_filter = filter_json.pop("source") | |
| filter_json = dict(filter_metadata) if filter_metadata else {} | |
| # Use source_id parameter if provided, otherwise check filter_metadata | |
| final_source_filter = source_id | |
| if not final_source_filter and "source" in filter_json: | |
| final_source_filter = filter_json.pop("source") |
🤖 Prompt for AI Agents
In python/src/server/services/search/hybrid_search_strategy.py around lines 138
to 143, the code currently assigns filter_json = filter_metadata or {} and then
pops "source" from it, which mutates the original filter_metadata; fix this by
making a shallow copy of filter_metadata before any mutation (e.g., filter_json
= (filter_metadata or {}).copy() or filter_json = dict(filter_metadata) if
filter_metadata else {}), then pop "source" from that copy and use the copied
dict for subsequent operations so the original input is not modified.
| # Call the hybrid search PostgreSQL function | ||
| response = self.supabase_client.rpc( | ||
| "hybrid_search_archon_code_examples", | ||
| { | ||
| "query_embedding": query_embedding, | ||
| "query_text": query, | ||
| "match_count": match_count, | ||
| "filter": filter_json, | ||
| "source_filter": final_source_filter, | ||
| }, | ||
| ).execute() | ||
|
|
There was a problem hiding this comment.
🛠️ Refactor suggestion
Supabase RPC calls should be retried with backoff.
Per guidelines, wrap RPC calls with limited retries and context-rich errors. I can add a small helper (tenacity-free) that retries on transient errors (HTTP 5xx/connection issues). Want me to draft it?
🤖 Prompt for AI Agents
In python/src/server/services/search/hybrid_search_strategy.py around lines
144-155, the Supabase RPC call is made without retries; implement a small retry
wrapper for self.supabase_client.rpc that retries on transient failures (HTTP
5xx responses and network/connection exceptions) with a limited number of
attempts (e.g., 3-5), exponential backoff with jitter between attempts, and a
short delay on each retry; ensure the wrapper re-raises a context-rich exception
when exhausted (including RPC name, args, attempt count and last error) and use
it to call "hybrid_search_archon_code_examples" instead of calling
rpc(...).execute() directly.
| if not response.data: | ||
| logger.debug("No results from hybrid code search") | ||
| return [] |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Add stack traces on errors; optional: guard blank query_text.
- Add exc_info=True to error logs.
- Optional: if not query.strip(): return [] early to avoid RPC with empty text.
- except Exception as e:
- logger.error(f"Hybrid code example search failed: {e}")
+ except Exception as e:
+ logger.error(f"Hybrid code example search failed: {e}", exc_info=True)Also applies to: 189-194
🤖 Prompt for AI Agents
In python/src/server/services/search/hybrid_search_strategy.py around lines
156-158 (and similarly update the block at 189-194), the error logs lack stack
traces and the function may call RPCs with empty query text; update the
exception logging calls to include exc_info=True so stack traces are captured
(e.g., logger.error(..., exc_info=True)), and add an early guard that returns an
empty list when the incoming query_text is empty or only whitespace (e.g., if
not query_text.strip(): return []) to avoid unnecessary RPCs.
| # If reranking is enabled, fetch more candidates for the reranker to evaluate | ||
| # This allows the reranker to see a broader set of results | ||
| search_match_count = match_count | ||
| if use_reranking and self.reranking_strategy: | ||
| # Fetch 5x the requested amount when reranking is enabled | ||
| # The reranker will select the best from this larger pool | ||
| search_match_count = match_count * 5 | ||
| logger.debug(f"Reranking enabled - fetching {search_match_count} candidates for {match_count} final results") | ||
|
|
There was a problem hiding this comment.
🛠️ Refactor suggestion
Make rerank candidate pool configurable and capped.
Multiplying by 5 unconditionally can be expensive on large datasets. Read a multiplier and cap from settings to avoid runaway queries.
- search_match_count = match_count
- if use_reranking and self.reranking_strategy:
- # Fetch 5x the requested amount when reranking is enabled
- # The reranker will select the best from this larger pool
- search_match_count = match_count * 5
- logger.debug(f"Reranking enabled - fetching {search_match_count} candidates for {match_count} final results")
+ search_match_count = match_count
+ if use_reranking and self.reranking_strategy:
+ # Fetch more candidates for reranking, with config + cap
+ try:
+ multiplier = int(self.get_setting("RERANK_CANDIDATE_MULTIPLIER", "5"))
+ cap = int(self.get_setting("RERANK_CANDIDATE_CAP", "200"))
+ except Exception:
+ multiplier, cap = 5, 200
+ multiplier = max(1, multiplier)
+ search_match_count = min(match_count * multiplier, cap)
+ logger.debug(f"Reranking enabled - fetching {search_match_count} candidates for {match_count} final results")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # If reranking is enabled, fetch more candidates for the reranker to evaluate | |
| # This allows the reranker to see a broader set of results | |
| search_match_count = match_count | |
| if use_reranking and self.reranking_strategy: | |
| # Fetch 5x the requested amount when reranking is enabled | |
| # The reranker will select the best from this larger pool | |
| search_match_count = match_count * 5 | |
| logger.debug(f"Reranking enabled - fetching {search_match_count} candidates for {match_count} final results") | |
| # If reranking is enabled, fetch more candidates for the reranker to evaluate | |
| # This allows the reranker to see a broader set of results | |
| search_match_count = match_count | |
| if use_reranking and self.reranking_strategy: | |
| # Fetch more candidates for reranking, with config + cap | |
| try: | |
| multiplier = int(self.get_setting("RERANK_CANDIDATE_MULTIPLIER", "5")) | |
| cap = int(self.get_setting("RERANK_CANDIDATE_CAP", "200")) | |
| except Exception: | |
| multiplier, cap = 5, 200 | |
| multiplier = max(1, multiplier) | |
| search_match_count = min(match_count * multiplier, cap) | |
| logger.debug(f"Reranking enabled - fetching {search_match_count} candidates for {match_count} final results") |
| # Pass top_k to limit results to the originally requested count | ||
| formatted_results = await self.reranking_strategy.rerank_results( | ||
| query, formatted_results, content_key="content" | ||
| query, formatted_results, content_key="content", top_k=match_count | ||
| ) | ||
| reranking_applied = True | ||
| logger.debug(f"Reranking applied to {len(formatted_results)} results") | ||
| logger.debug(f"Reranking applied: {search_match_count} candidates -> {len(formatted_results)} final results") | ||
| except Exception as e: | ||
| logger.warning(f"Reranking failed: {e}") | ||
| reranking_applied = False | ||
| # If reranking fails but we fetched extra results, trim to requested count | ||
| if len(formatted_results) > match_count: | ||
| formatted_results = formatted_results[:match_count] | ||
|
|
There was a problem hiding this comment.
🛠️ Refactor suggestion
Preserve stack traces on reranking failures.
Add exc_info=True to aid debugging and align with logging guidelines.
- except Exception as e:
- logger.warning(f"Reranking failed: {e}")
+ except Exception as e:
+ logger.warning(f"Reranking failed: {e}", exc_info=True)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Pass top_k to limit results to the originally requested count | |
| formatted_results = await self.reranking_strategy.rerank_results( | |
| query, formatted_results, content_key="content" | |
| query, formatted_results, content_key="content", top_k=match_count | |
| ) | |
| reranking_applied = True | |
| logger.debug(f"Reranking applied to {len(formatted_results)} results") | |
| logger.debug(f"Reranking applied: {search_match_count} candidates -> {len(formatted_results)} final results") | |
| except Exception as e: | |
| logger.warning(f"Reranking failed: {e}") | |
| reranking_applied = False | |
| # If reranking fails but we fetched extra results, trim to requested count | |
| if len(formatted_results) > match_count: | |
| formatted_results = formatted_results[:match_count] | |
| # Pass top_k to limit results to the originally requested count | |
| formatted_results = await self.reranking_strategy.rerank_results( | |
| query, formatted_results, content_key="content", top_k=match_count | |
| ) | |
| reranking_applied = True | |
| logger.debug(f"Reranking applied: {search_match_count} candidates -> {len(formatted_results)} final results") | |
| except Exception as e: | |
| logger.warning(f"Reranking failed: {e}", exc_info=True) | |
| reranking_applied = False | |
| # If reranking fails but we fetched extra results, trim to requested count | |
| if len(formatted_results) > match_count: | |
| formatted_results = formatted_results[:match_count] |
🤖 Prompt for AI Agents
In python/src/server/services/search/rag_service.py around lines 246 to 258, the
logger.warning call inside the reranking exception handler currently logs only
the exception message and loses the stack trace; update the call to include
exc_info=True (e.g., logger.warning(f"Reranking failed: {e}", exc_info=True)) so
the full traceback is preserved for debugging while keeping the existing logic
that sets reranking_applied=False and trims formatted_results to match_count
when necessary.
| # If reranking is enabled, fetch more candidates | ||
| search_match_count = match_count | ||
| if use_reranking and self.reranking_strategy: | ||
| search_match_count = match_count * 5 | ||
| logger.debug(f"Reranking enabled for code search - fetching {search_match_count} candidates") | ||
|
|
There was a problem hiding this comment.
🛠️ Refactor suggestion
Use the same configurable rerank-pool logic for code search.
Mirror the multiplier/cap approach here for consistency.
- search_match_count = match_count
- if use_reranking and self.reranking_strategy:
- search_match_count = match_count * 5
- logger.debug(f"Reranking enabled for code search - fetching {search_match_count} candidates")
+ search_match_count = match_count
+ if use_reranking and self.reranking_strategy:
+ try:
+ multiplier = int(self.get_setting("RERANK_CANDIDATE_MULTIPLIER", "5"))
+ cap = int(self.get_setting("RERANK_CANDIDATE_CAP", "200"))
+ except Exception:
+ multiplier, cap = 5, 200
+ multiplier = max(1, multiplier)
+ search_match_count = min(match_count * multiplier, cap)
+ logger.debug(f"Reranking enabled for code search - fetching {search_match_count} candidates")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # If reranking is enabled, fetch more candidates | |
| search_match_count = match_count | |
| if use_reranking and self.reranking_strategy: | |
| search_match_count = match_count * 5 | |
| logger.debug(f"Reranking enabled for code search - fetching {search_match_count} candidates") | |
| # If reranking is enabled, fetch more candidates | |
| search_match_count = match_count | |
| if use_reranking and self.reranking_strategy: | |
| try: | |
| multiplier = int(self.get_setting("RERANK_CANDIDATE_MULTIPLIER", "5")) | |
| cap = int(self.get_setting("RERANK_CANDIDATE_CAP", "200")) | |
| except Exception: | |
| multiplier, cap = 5, 200 | |
| multiplier = max(1, multiplier) | |
| search_match_count = min(match_count * multiplier, cap) | |
| logger.debug(f"Reranking enabled for code search - fetching {search_match_count} candidates") |
🤖 Prompt for AI Agents
In python/src/server/services/search/rag_service.py around lines 329 to 334, the
code for increasing search_match_count when reranking is enabled simply
multiplies match_count by 5; replace that with the same configurable rerank-pool
logic used elsewhere (use the configured multiplier and cap rather than a
hardcoded 5). Read the rerank multiplier and max/cap from the existing reranking
configuration (e.g., self.reranking_strategy or global config used elsewhere),
compute search_match_count = min(match_count * rerank_multiplier,
rerank_max_pool), and keep the debug log showing the final computed candidate
count.
| results = await self.reranking_strategy.rerank_results( | ||
| query, results, content_key="content" | ||
| query, results, content_key="content", top_k=match_count | ||
| ) | ||
| logger.debug(f"Code reranking applied: {search_match_count} candidates -> {len(results)} final results") | ||
| except Exception as e: | ||
| logger.warning(f"Code reranking failed: {e}") | ||
| # If reranking fails but we fetched extra results, trim to requested count | ||
| if len(results) > match_count: | ||
| results = results[:match_count] |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Preserve stack traces on code reranking failures.
- except Exception as e:
- logger.warning(f"Code reranking failed: {e}")
+ except Exception as e:
+ logger.warning(f"Code reranking failed: {e}", exc_info=True)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| results = await self.reranking_strategy.rerank_results( | |
| query, results, content_key="content" | |
| query, results, content_key="content", top_k=match_count | |
| ) | |
| logger.debug(f"Code reranking applied: {search_match_count} candidates -> {len(results)} final results") | |
| except Exception as e: | |
| logger.warning(f"Code reranking failed: {e}") | |
| # If reranking fails but we fetched extra results, trim to requested count | |
| if len(results) > match_count: | |
| results = results[:match_count] | |
| results = await self.reranking_strategy.rerank_results( | |
| query, results, content_key="content", top_k=match_count | |
| ) | |
| logger.debug(f"Code reranking applied: {search_match_count} candidates -> {len(results)} final results") | |
| except Exception as e: | |
| logger.warning(f"Code reranking failed: {e}", exc_info=True) | |
| # If reranking fails but we fetched extra results, trim to requested count | |
| if len(results) > match_count: | |
| results = results[:match_count] |
🤖 Prompt for AI Agents
In python/src/server/services/search/rag_service.py around lines 358 to 366, the
except block logs reranking failures with logger.warning(f"Code reranking
failed: {e}") which loses the stack trace; change the logging to preserve the
exception info (e.g., use logger.exception("Code reranking failed") or
logger.warning("Code reranking failed", exc_info=True)) so the traceback is
recorded, and keep the existing trimming logic for results as-is.
|
Error: SQL query ran into an upstream timeout |
|
@coleam00 : After doing this huge wordpress developer docu crwal your script running in some serious issues. |
|
@ujconsulting - this seems unrelated to this PR? |
…539) * fix: add domain prefix to log event names across git, isolation, and adapters packages Log events must follow the {domain}.{action}_{state} format per CLAUDE.md convention. Previously, many events were missing the domain prefix (e.g., 'checkout_failed' instead of 'git.checkout_failed'), making log filtering and observability harder. - packages/git/src/branch.ts: 11 events prefixed with git. - packages/git/src/repo.ts: 8 events prefixed with git. - packages/git/src/worktree.ts: 6 events prefixed with git. - packages/git/src/git.test.ts: 9 test assertions updated to match new event names - packages/isolation/src/resolver.ts: 6 events changed from isolation_ to isolation. notation - packages/isolation/src/providers/worktree.ts: 24 events prefixed with isolation. - packages/adapters/src/forge/github/adapter.ts: 3 cleanup events prefixed with isolation. * Revert "fix: add domain prefix to log event names across git, isolation, and adapters packages" This reverts commit 0120a26861a17135237361adf050f1ba356dff06.
…oleam00#539) * fix: add domain prefix to log event names across git, isolation, and adapters packages Log events must follow the {domain}.{action}_{state} format per CLAUDE.md convention. Previously, many events were missing the domain prefix (e.g., 'checkout_failed' instead of 'git.checkout_failed'), making log filtering and observability harder. - packages/git/src/branch.ts: 11 events prefixed with git. - packages/git/src/repo.ts: 8 events prefixed with git. - packages/git/src/worktree.ts: 6 events prefixed with git. - packages/git/src/git.test.ts: 9 test assertions updated to match new event names - packages/isolation/src/resolver.ts: 6 events changed from isolation_ to isolation. notation - packages/isolation/src/providers/worktree.ts: 24 events prefixed with isolation. - packages/adapters/src/forge/github/adapter.ts: 3 cleanup events prefixed with isolation. * Revert "fix: add domain prefix to log event names across git, isolation, and adapters packages" This reverts commit 0120a26861a17135237361adf050f1ba356dff06.
…oleam00#539) * fix: add domain prefix to log event names across git, isolation, and adapters packages Log events must follow the {domain}.{action}_{state} format per CLAUDE.md convention. Previously, many events were missing the domain prefix (e.g., 'checkout_failed' instead of 'git.checkout_failed'), making log filtering and observability harder. - packages/git/src/branch.ts: 11 events prefixed with git. - packages/git/src/repo.ts: 8 events prefixed with git. - packages/git/src/worktree.ts: 6 events prefixed with git. - packages/git/src/git.test.ts: 9 test assertions updated to match new event names - packages/isolation/src/resolver.ts: 6 events changed from isolation_ to isolation. notation - packages/isolation/src/providers/worktree.ts: 24 events prefixed with isolation. - packages/adapters/src/forge/github/adapter.ts: 3 cleanup events prefixed with isolation. * Revert "fix: add domain prefix to log event names across git, isolation, and adapters packages" This reverts commit 0120a26861a17135237361adf050f1ba356dff06.

Pull Request
Summary
Enhanced the hybrid search strategy with tsvector keyword matching
Changes Made
Type of Change
Affected Services
Testing
Test Evidence
Manually tested by crawling fresh documentation (https://ai.pydantic.dev/llms-full.txt) and then performing some RAG queries in Claude Desktop with the LOG_LEVEL set to DEBUG so I could confirm that with the hybrid search strategy enabled, I'm getting chunks back from both semantic and keyword search. All chunks are being sent into the reranker as well when the reranking strategy is enabled.
Checklist
Breaking Changes
This PR will require a DB migration for the hybrid search strategy to work.
Additional Notes
The current setup to detect the need for a DB migration is specific to #472. We will need a more generic implementation or potentially create a process to automatically run migrations whenever Archon is spun up.
Summary by CodeRabbit
New Features
Refactor
Tests