Skip to content

Enhanced the hybrid search strategy with tsvector keyword matching#539

Merged
Wirasm merged 1 commit intomainfrom
hybrid-rag-enhancement
Sep 10, 2025
Merged

Enhanced the hybrid search strategy with tsvector keyword matching#539
Wirasm merged 1 commit intomainfrom
hybrid-rag-enhancement

Conversation

@coleam00
Copy link
Copy Markdown
Owner

@coleam00 coleam00 commented Aug 30, 2025

Pull Request

Summary

Enhanced the hybrid search strategy with tsvector keyword matching

Changes Made

  • Added a new SQL match function for hybrid search - semantic similarity + keyword search with tsvector
  • Added a migration script and updated complete_setup.sql and RESET_DB.sql
  • Updated the hybrid search RAG strategy to use the new hybrid search SQL function instead of the old naive keyword search strategy
  • Updated the main RAG coordinator to pass all semantic + keyword search results to the reranker instead of filtering before the reranker

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Performance improvement
  • Code refactoring

Affected Services

  • Frontend (React UI)
  • Server (FastAPI backend)
  • MCP Server (Model Context Protocol)
  • Agents (PydanticAI service)
  • Database (migrations/schema)
  • Docker/Infrastructure
  • Documentation site

Testing

  • All existing tests pass
  • Added new tests for new functionality
  • Manually tested affected user flows
  • Docker builds succeed for all services

Test Evidence

make test-be

Manually tested by crawling fresh documentation (https://ai.pydantic.dev/llms-full.txt) and then performing some RAG queries in Claude Desktop with the LOG_LEVEL set to DEBUG so I could confirm that with the hybrid search strategy enabled, I'm getting chunks back from both semantic and keyword search. All chunks are being sent into the reranker as well when the reranking strategy is enabled.

Checklist

  • My code follows the service architecture patterns
  • If using an AI coding assistant, I used the CLAUDE.md rules
  • [!] I have added tests that prove my fix/feature works (No, using existing tests for the RAG strategies)
  • All new and existing tests pass locally
  • My changes generate no new warnings
  • I have updated relevant documentation
  • I have verified no regressions in existing features

Breaking Changes

This PR will require a DB migration for the hybrid search strategy to work.

Additional Notes

The current setup to detect the need for a DB migration is specific to #472. We will need a more generic implementation or potentially create a process to automatically run migrations whenever Archon is spun up.

Summary by CodeRabbit

  • New Features

    • Introduces hybrid search combining semantic and full‑text matching for documents and code examples, with fuzzy matching and faster results via new indexes.
    • Adds a Knowledge Base (sources, crawled pages, code examples) and seeded settings/prompts to support search and RAG workflows.
    • Improves result quality when reranking by fetching a larger candidate pool while keeping final output size unchanged.
  • Refactor

    • Streamlines search to a single database-powered query for more consistent relevance and performance.
  • Tests

    • Updates tests to align with the new hybrid search flow and remove obsolete merge logic checks.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Aug 30, 2025

Walkthrough

Expands database schema and search capabilities with hybrid ts_vector + embedding functions, indexes, and policies. Updates server search to call new Postgres hybrid search RPCs, removes Python-side keyword merge, and adjusts reranking to fetch larger candidate pools with top_k limiting. Resets now drops new hybrid functions. Tests updated accordingly.

Changes

Cohort / File(s) Summary
DB reset updates
migration/RESET_DB.sql
Adds DROP FUNCTION ... CASCADE for hybrid_search_archon_crawled_pages(vector, text, int, jsonb, text) and hybrid_search_archon_code_examples(vector, text, int, jsonb, text), plus a comment.
Hybrid search migrations
migration/add_hybrid_search_tsvector.sql
Adds generated tsvector columns, GIN indexes, and PL/pgSQL functions hybrid_search_archon_crawled_pages and hybrid_search_archon_code_examples combining vector similarity and full-text ranking; supports optional JSONB filter and source filter.
Comprehensive setup/seed
migration/complete_setup.sql
Introduces knowledge base tables, projects/tasks module with RLS, prompts, settings, indexes, triggers, and functions including match_archon_* and hybrid_search_archon_*; seeds initial data and comments.
Server hybrid strategy refactor
python/src/server/services/search/hybrid_search_strategy.py
Replaces Python-side merge and keyword search with single DB-backed hybrid search RPCs; removes _merge_search_results and keyword flow; normalizes results; updates method signature to drop query_embedding; logging and filter handling updated.
RAG pipeline candidate expansion and top_k
python/src/server/services/search/rag_service.py, python/src/server/services/search/reranking_strategy.py
Expands initial candidate pool to 5x when reranking; adds top_k to rerank_results; trims to requested count on failure; consistent behavior for documents and code examples.
Tests adjusted for new flow
python/tests/test_rag_simple.py, python/tests/test_rag_strategies.py
Removes tests for _merge_search_results and related assertions; retains initialization checks for search_documents_hybrid.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Client
  participant Server as Server (HybridSearchStrategy)
  participant DB as Postgres

  Client->>Server: search_documents_hybrid(query, match_count, filter_metadata)
  Server->>DB: RPC hybrid_search_archon_crawled_pages(query_text, match_count, filter, source_filter)
  DB-->>Server: rows[id, url, chunk_number, content, metadata, source_id, similarity, match_type]
  Server->>Server: Normalize results, log counts and match_type distribution
  Server-->>Client: List[dict]
  note over DB,Server: Keyword and vector merging now occurs in SQL via full outer join
Loading
sequenceDiagram
  autonumber
  participant Client
  participant RAG as RagService
  participant Search as HybridSearchStrategy
  participant Reranker as RerankingStrategy

  Client->>RAG: query(match_count, rerank=True)
  RAG->>RAG: search_match_count = match_count * 5
  RAG->>Search: search_documents_hybrid(query, search_match_count, filters)
  Search-->>RAG: candidates (N ~= 5x)
  RAG->>Reranker: rerank_results(query, candidates, top_k=match_count)
  alt Rerank success
    Reranker-->>RAG: top_k results
  else Rerank failure
    RAG->>RAG: Fallback trim to match_count
  end
  RAG-->>Client: final results (<= match_count)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • tazmon95
  • leex279

Poem

Thump-thump go my paws on the query trail,
Vectors meet words where results prevail.
GIN trees whisper, trigrams align,
Postgres hums a hybrid design.
Fivefold fetch, then tidy the heap—
Carrot-ranked answers, crisp and deep. 🥕

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch hybrid-rag-enhancement

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
python/src/server/services/search/rag_service.py (2)

143-147: Preserve stack traces for document search failures.

-            except Exception as e:
-                logger.error(f"Document search failed: {e}")
+            except Exception as e:
+                logger.error(f"Document search failed: {e}", exc_info=True)

399-402: Preserve stack traces for code example pipeline failures.

-            except Exception as e:
-                logger.error(f"Code example search failed: {e}")
+            except Exception as e:
+                logger.error(f"Code example search failed: {e}", exc_info=True)
🧹 Nitpick comments (8)
migration/add_hybrid_search_tsvector.sql (4)

29-33: Trigram indexes are heavy; consider need and creation strategy.

You’re creating GIN trigram indexes on large text columns, but the new hybrid functions don’t use trigram operators (% or ILIKE). If they’re only “nice to have,” consider deferring creation or adding them in a separate opt-in migration. For large tables, prefer CONCURRENTLY (outside transaction) to avoid long locks.


93-99: Prefer websearch_to_tsquery and guard empty queries.

plainto_tsquery is strict and may underperform for natural queries. Also, when query_text is blank/whitespace, the text path should short-circuit to avoid scanning.

Suggested minimal change:

-            ts_rank_cd(cp.content_search_vector, plainto_tsquery('english', query_text)) AS text_sim
+            ts_rank(cp.content_search_vector, websearch_to_tsquery('english', query_text)) AS text_sim
         FROM archon_crawled_pages cp
         WHERE cp.metadata @> filter
             AND (source_filter IS NULL OR cp.source_id = source_filter)
-            AND cp.content_search_vector @@ plainto_tsquery('english', query_text)
+            AND btrim(query_text) <> ''
+            AND cp.content_search_vector @@ websearch_to_tsquery('english', query_text)

111-117: Similarity mixing is on different scales; expose both scores or normalize.

COALESCE(v.vector_sim, t.text_sim) mixes (1 − cosine distance) with ts_rank values—these are not comparable. This will bias ordering toward whichever scale is larger (likely vector).

Options (keep API stable):

  • Add columns vector_similarity and text_rank while retaining similarity as-is for backward compatibility.
  • Or normalize text_rank (e.g., ts_rank with normalization or percentile within the CTE) and compute a weighted final score.

If you want, I can propose a backward-compatible patch that adds extra columns and preserves similarity.


62-65: Overfetch split is fixed 50/50; consider tunable distribution.

Fetching match_count from both vector and text paths may double work and then discard half. Consider parameters (e.g., vector_ratio, text_ratio) or dynamic allocation (more vector for longer queries, more text for short/keyword-y queries).

migration/complete_setup.sql (4)

221-223: Indexes for content_search_vector and trigram: consider usage.

GIN on content_search_vector is necessary; trigram index is optional unless you plan fuzzy/ILIKE queries. Consider deferring trigram to reduce disk/maintenance costs.


234-251: tsvector on code content+summary is solid; indexes align.

Looks good. One micro-nit: ensure summary is NOT NULL in schema or keep COALESCE as you did.


330-515: Hybrid functions duplicate those in add_hybrid_search_tsvector.sql; watch score mixing and query ergonomics.

  • Same comments as the dedicated migration: prefer websearch_to_tsquery, guard empty query_text, and avoid mixing score scales without normalization or exposing both scores.
  • Duplication is fine for “complete setup,” but keep them in sync with migrations to avoid drift.

If you want, I can send a single patch that applies the websearch_to_tsquery and empty-query guard to both functions here.


516-519: Comment says “configurable weighting” but no weights exist.

Either add vector_weight/text_weight params or adjust the comment to avoid implying tunable weights.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 763e5b8 and 636a3a4.

📒 Files selected for processing (7)
  • migration/RESET_DB.sql (1 hunks)
  • migration/add_hybrid_search_tsvector.sql (1 hunks)
  • migration/complete_setup.sql (6 hunks)
  • python/src/server/services/search/hybrid_search_strategy.py (6 hunks)
  • python/src/server/services/search/rag_service.py (4 hunks)
  • python/tests/test_rag_simple.py (0 hunks)
  • python/tests/test_rag_strategies.py (0 hunks)
💤 Files with no reviewable changes (2)
  • python/tests/test_rag_strategies.py
  • python/tests/test_rag_simple.py
🧰 Additional context used
📓 Path-based instructions (6)
python/src/{server,mcp,agents}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/{server,mcp,agents}/**/*.py: Fail fast on service startup failures, missing configuration, database connection issues, auth failures, critical dependency outages, and invalid data that would corrupt state
External API calls should use retry with exponential backoff and ultimately fail with a clear, contextual error message
Error messages must include context (operation being attempted) and relevant IDs/URLs/data for debugging
Preserve full stack traces in logs (e.g., Python logging with exc_info=True)
Use specific exception types; avoid catching broad Exception unless re-raising with context
Never signal failure by returning None/null; raise a descriptive exception instead

Files:

  • python/src/server/services/search/rag_service.py
  • python/src/server/services/search/hybrid_search_strategy.py
python/src/{server/services,agents}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Never accept or store corrupted data (e.g., zero embeddings, null foreign keys, malformed JSON); skip failed items entirely instead of persisting bad data

Files:

  • python/src/server/services/search/rag_service.py
  • python/src/server/services/search/hybrid_search_strategy.py
python/src/server/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/server/**/*.py: For batch processing and background tasks, continue processing but log detailed per-item failures and return both successes and failures
Do not crash the server on a single WebSocket event failure; log the error and continue serving other clients

Files:

  • python/src/server/services/search/rag_service.py
  • python/src/server/services/search/hybrid_search_strategy.py
python/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/**/*.py: Target Python 3.12 with a 120-character line length
Use Ruff for linting and Mypy for type checking before commit

Files:

  • python/src/server/services/search/rag_service.py
  • python/src/server/services/search/hybrid_search_strategy.py
{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}

📄 CodeRabbit inference engine (CLAUDE.md)

{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}: Remove dead code immediately; do not keep legacy/unused functions
Avoid comments that reference change history (e.g., LEGACY, CHANGED, REMOVED); keep comments focused on current functionality

Files:

  • python/src/server/services/search/rag_service.py
  • python/src/server/services/search/hybrid_search_strategy.py
python/src/server/**

📄 CodeRabbit inference engine (CLAUDE.md)

Keep FastAPI application code under python/src/server/ (routes in api_routes/, services in services/, main in main.py)

Files:

  • python/src/server/services/search/rag_service.py
  • python/src/server/services/search/hybrid_search_strategy.py
🧠 Learnings (2)
📚 Learning: 2025-08-20T19:38:04.097Z
Learnt from: Chillbruhhh
PR: coleam00/Archon#378
File: python/src/server/services/storage/document_storage_service.py:304-306
Timestamp: 2025-08-20T19:38:04.097Z
Learning: The archon_crawled_pages table in the Archon project has a table-level unique constraint on (url, chunk_number) defined inline in the CREATE TABLE statement in migration/complete_setup.sql at line 202, which allows upsert operations with on_conflict="url,chunk_number" to work properly without requiring additional migrations.

Applied to files:

  • migration/complete_setup.sql
📚 Learning: 2025-08-20T19:38:04.097Z
Learnt from: Chillbruhhh
PR: coleam00/Archon#378
File: python/src/server/services/storage/document_storage_service.py:304-306
Timestamp: 2025-08-20T19:38:04.097Z
Learning: The archon_crawled_pages table in the Archon project has a table-level unique constraint on (url, chunk_number) defined inline in the CREATE TABLE statement in migration/complete_setup.sql at line 202, which allows upsert operations with on_conflict="url,chunk_number" to work properly.

Applied to files:

  • migration/complete_setup.sql
🧬 Code graph analysis (2)
python/src/server/services/search/rag_service.py (2)
python/src/server/services/search/reranking_strategy.py (1)
  • rerank_results (139-194)
python/src/server/services/search/hybrid_search_strategy.py (1)
  • search_code_examples_hybrid (108-194)
python/src/server/services/search/hybrid_search_strategy.py (2)
python/src/server/config/logfire_config.py (2)
  • safe_span (150-171)
  • set_attribute (177-179)
python/src/server/services/embeddings/embedding_service.py (1)
  • create_embedding (71-128)
🔇 Additional comments (4)
migration/add_hybrid_search_tsvector.sql (1)

9-11: Verify vector extension availability in environments running partial migrations.

This migration references vector(1536) in function signatures but doesn’t CREATE EXTENSION vector. If add_hybrid_search_tsvector.sql can run independently of complete setup, ensure vector is installed earlier in your chain, or add a defensive check here.

migration/RESET_DB.sql (1)

136-139: Drops match the new function signatures. LGTM.

The DROP FUNCTION signatures (vector, text, int, jsonb, text) correctly match the new hybrid_search_* functions (typmod on vector is ignored here). Safe with CASCADE.

migration/complete_setup.sql (2)

18-18: Extension addition is correct. LGTM.


207-207: Generated tsvector column on content: good choice.

This keeps vectors in-sync and avoids trigger maintenance.

Comment on lines +54 to +56
filter_json = filter_metadata or {}
source_filter = filter_json.pop("source", None) if "source" in filter_json else None

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Avoid mutating caller’s filter_metadata (in-place pop).

filter_json = filter_metadata or {} aliases the original dict; pop mutates caller state. Copy before mutation.

-                filter_json = filter_metadata or {}
-                source_filter = filter_json.pop("source", None) if "source" in filter_json else None
+                filter_json = dict(filter_metadata) if filter_metadata else {}
+                source_filter = filter_json.pop("source", None)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
filter_json = filter_metadata or {}
source_filter = filter_json.pop("source", None) if "source" in filter_json else None
filter_json = dict(filter_metadata) if filter_metadata else {}
source_filter = filter_json.pop("source", None)
🤖 Prompt for AI Agents
In python/src/server/services/search/hybrid_search_strategy.py around lines 54
to 56, the code aliases filter_metadata with filter_json = filter_metadata or {}
and then calls pop, which mutates the caller's dict; instead, create a shallow
copy before any mutation (e.g., filter_json = dict(filter_metadata) if
filter_metadata else {}) and then safely pop "source" from that copy so the
original filter_metadata remains unmodified.

Comment on lines +69 to +71
if not response.data:
logger.debug("No results from hybrid search")
return []
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Log with stack trace for easier debugging.

-                    logger.debug("No results from hybrid search")
+                    logger.debug("No results from hybrid search")

And on the exception path below:

-            except Exception as e:
-                logger.error(f"Hybrid document search failed: {e}")
+            except Exception as e:
+                logger.error(f"Hybrid document search failed: {e}", exc_info=True)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if not response.data:
logger.debug("No results from hybrid search")
return []
if not response.data:
logger.debug("No results from hybrid search")
return []
except Exception as e:
logger.error(f"Hybrid document search failed: {e}", exc_info=True)
🤖 Prompt for AI Agents
python/src/server/services/search/hybrid_search_strategy.py lines 69-71: the
current debug log for "No results from hybrid search" should include additional
context (e.g., query/params) to aid investigation, and the exception path below
must log the full stack trace; update the no-results log to include relevant
request details and change the exception handling to use logger.exception(...)
or logger.error(..., exc_info=True) so the stack trace is captured in logs.

Comment on lines +138 to +143
filter_json = filter_metadata or {}
# Use source_id parameter if provided, otherwise check filter_metadata
final_source_filter = source_id
if not final_source_filter and "source" in filter_json:
final_source_filter = filter_json.pop("source")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Same in-place mutation bug for code path.

Copy the dict before popping “source”.

-                filter_json = filter_metadata or {}
+                filter_json = dict(filter_metadata) if filter_metadata else {}
                 # Use source_id parameter if provided, otherwise check filter_metadata
                 final_source_filter = source_id
-                if not final_source_filter and "source" in filter_json:
-                    final_source_filter = filter_json.pop("source")
+                if not final_source_filter and "source" in filter_json:
+                    final_source_filter = filter_json.pop("source")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
filter_json = filter_metadata or {}
# Use source_id parameter if provided, otherwise check filter_metadata
final_source_filter = source_id
if not final_source_filter and "source" in filter_json:
final_source_filter = filter_json.pop("source")
filter_json = dict(filter_metadata) if filter_metadata else {}
# Use source_id parameter if provided, otherwise check filter_metadata
final_source_filter = source_id
if not final_source_filter and "source" in filter_json:
final_source_filter = filter_json.pop("source")
🤖 Prompt for AI Agents
In python/src/server/services/search/hybrid_search_strategy.py around lines 138
to 143, the code currently assigns filter_json = filter_metadata or {} and then
pops "source" from it, which mutates the original filter_metadata; fix this by
making a shallow copy of filter_metadata before any mutation (e.g., filter_json
= (filter_metadata or {}).copy() or filter_json = dict(filter_metadata) if
filter_metadata else {}), then pop "source" from that copy and use the copied
dict for subsequent operations so the original input is not modified.

Comment on lines +144 to +155
# Call the hybrid search PostgreSQL function
response = self.supabase_client.rpc(
"hybrid_search_archon_code_examples",
{
"query_embedding": query_embedding,
"query_text": query,
"match_count": match_count,
"filter": filter_json,
"source_filter": final_source_filter,
},
).execute()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Supabase RPC calls should be retried with backoff.

Per guidelines, wrap RPC calls with limited retries and context-rich errors. I can add a small helper (tenacity-free) that retries on transient errors (HTTP 5xx/connection issues). Want me to draft it?

🤖 Prompt for AI Agents
In python/src/server/services/search/hybrid_search_strategy.py around lines
144-155, the Supabase RPC call is made without retries; implement a small retry
wrapper for self.supabase_client.rpc that retries on transient failures (HTTP
5xx responses and network/connection exceptions) with a limited number of
attempts (e.g., 3-5), exponential backoff with jitter between attempts, and a
short delay on each retry; ensure the wrapper re-raises a context-rich exception
when exhausted (including RPC name, args, attempt count and last error) and use
it to call "hybrid_search_archon_code_examples" instead of calling
rpc(...).execute() directly.

Comment on lines +156 to +158
if not response.data:
logger.debug("No results from hybrid code search")
return []
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add stack traces on errors; optional: guard blank query_text.

  • Add exc_info=True to error logs.
  • Optional: if not query.strip(): return [] early to avoid RPC with empty text.
-            except Exception as e:
-                logger.error(f"Hybrid code example search failed: {e}")
+            except Exception as e:
+                logger.error(f"Hybrid code example search failed: {e}", exc_info=True)

Also applies to: 189-194

🤖 Prompt for AI Agents
In python/src/server/services/search/hybrid_search_strategy.py around lines
156-158 (and similarly update the block at 189-194), the error logs lack stack
traces and the function may call RPCs with empty query text; update the
exception logging calls to include exc_info=True so stack traces are captured
(e.g., logger.error(..., exc_info=True)), and add an early guard that returns an
empty list when the incoming query_text is empty or only whitespace (e.g., if
not query_text.strip(): return []) to avoid unnecessary RPCs.

Comment on lines +207 to +215
# If reranking is enabled, fetch more candidates for the reranker to evaluate
# This allows the reranker to see a broader set of results
search_match_count = match_count
if use_reranking and self.reranking_strategy:
# Fetch 5x the requested amount when reranking is enabled
# The reranker will select the best from this larger pool
search_match_count = match_count * 5
logger.debug(f"Reranking enabled - fetching {search_match_count} candidates for {match_count} final results")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Make rerank candidate pool configurable and capped.

Multiplying by 5 unconditionally can be expensive on large datasets. Read a multiplier and cap from settings to avoid runaway queries.

-                search_match_count = match_count
-                if use_reranking and self.reranking_strategy:
-                    # Fetch 5x the requested amount when reranking is enabled
-                    # The reranker will select the best from this larger pool
-                    search_match_count = match_count * 5
-                    logger.debug(f"Reranking enabled - fetching {search_match_count} candidates for {match_count} final results")
+                search_match_count = match_count
+                if use_reranking and self.reranking_strategy:
+                    # Fetch more candidates for reranking, with config + cap
+                    try:
+                        multiplier = int(self.get_setting("RERANK_CANDIDATE_MULTIPLIER", "5"))
+                        cap = int(self.get_setting("RERANK_CANDIDATE_CAP", "200"))
+                    except Exception:
+                        multiplier, cap = 5, 200
+                    multiplier = max(1, multiplier)
+                    search_match_count = min(match_count * multiplier, cap)
+                    logger.debug(f"Reranking enabled - fetching {search_match_count} candidates for {match_count} final results")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# If reranking is enabled, fetch more candidates for the reranker to evaluate
# This allows the reranker to see a broader set of results
search_match_count = match_count
if use_reranking and self.reranking_strategy:
# Fetch 5x the requested amount when reranking is enabled
# The reranker will select the best from this larger pool
search_match_count = match_count * 5
logger.debug(f"Reranking enabled - fetching {search_match_count} candidates for {match_count} final results")
# If reranking is enabled, fetch more candidates for the reranker to evaluate
# This allows the reranker to see a broader set of results
search_match_count = match_count
if use_reranking and self.reranking_strategy:
# Fetch more candidates for reranking, with config + cap
try:
multiplier = int(self.get_setting("RERANK_CANDIDATE_MULTIPLIER", "5"))
cap = int(self.get_setting("RERANK_CANDIDATE_CAP", "200"))
except Exception:
multiplier, cap = 5, 200
multiplier = max(1, multiplier)
search_match_count = min(match_count * multiplier, cap)
logger.debug(f"Reranking enabled - fetching {search_match_count} candidates for {match_count} final results")

Comment on lines +246 to 258
# Pass top_k to limit results to the originally requested count
formatted_results = await self.reranking_strategy.rerank_results(
query, formatted_results, content_key="content"
query, formatted_results, content_key="content", top_k=match_count
)
reranking_applied = True
logger.debug(f"Reranking applied to {len(formatted_results)} results")
logger.debug(f"Reranking applied: {search_match_count} candidates -> {len(formatted_results)} final results")
except Exception as e:
logger.warning(f"Reranking failed: {e}")
reranking_applied = False
# If reranking fails but we fetched extra results, trim to requested count
if len(formatted_results) > match_count:
formatted_results = formatted_results[:match_count]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Preserve stack traces on reranking failures.

Add exc_info=True to aid debugging and align with logging guidelines.

-                    except Exception as e:
-                        logger.warning(f"Reranking failed: {e}")
+                    except Exception as e:
+                        logger.warning(f"Reranking failed: {e}", exc_info=True)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Pass top_k to limit results to the originally requested count
formatted_results = await self.reranking_strategy.rerank_results(
query, formatted_results, content_key="content"
query, formatted_results, content_key="content", top_k=match_count
)
reranking_applied = True
logger.debug(f"Reranking applied to {len(formatted_results)} results")
logger.debug(f"Reranking applied: {search_match_count} candidates -> {len(formatted_results)} final results")
except Exception as e:
logger.warning(f"Reranking failed: {e}")
reranking_applied = False
# If reranking fails but we fetched extra results, trim to requested count
if len(formatted_results) > match_count:
formatted_results = formatted_results[:match_count]
# Pass top_k to limit results to the originally requested count
formatted_results = await self.reranking_strategy.rerank_results(
query, formatted_results, content_key="content", top_k=match_count
)
reranking_applied = True
logger.debug(f"Reranking applied: {search_match_count} candidates -> {len(formatted_results)} final results")
except Exception as e:
logger.warning(f"Reranking failed: {e}", exc_info=True)
reranking_applied = False
# If reranking fails but we fetched extra results, trim to requested count
if len(formatted_results) > match_count:
formatted_results = formatted_results[:match_count]
🤖 Prompt for AI Agents
In python/src/server/services/search/rag_service.py around lines 246 to 258, the
logger.warning call inside the reranking exception handler currently logs only
the exception message and loses the stack trace; update the call to include
exc_info=True (e.g., logger.warning(f"Reranking failed: {e}", exc_info=True)) so
the full traceback is preserved for debugging while keeping the existing logic
that sets reranking_applied=False and trims formatted_results to match_count
when necessary.

Comment on lines +329 to +334
# If reranking is enabled, fetch more candidates
search_match_count = match_count
if use_reranking and self.reranking_strategy:
search_match_count = match_count * 5
logger.debug(f"Reranking enabled for code search - fetching {search_match_count} candidates")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Use the same configurable rerank-pool logic for code search.

Mirror the multiplier/cap approach here for consistency.

-                search_match_count = match_count
-                if use_reranking and self.reranking_strategy:
-                    search_match_count = match_count * 5
-                    logger.debug(f"Reranking enabled for code search - fetching {search_match_count} candidates")
+                search_match_count = match_count
+                if use_reranking and self.reranking_strategy:
+                    try:
+                        multiplier = int(self.get_setting("RERANK_CANDIDATE_MULTIPLIER", "5"))
+                        cap = int(self.get_setting("RERANK_CANDIDATE_CAP", "200"))
+                    except Exception:
+                        multiplier, cap = 5, 200
+                    multiplier = max(1, multiplier)
+                    search_match_count = min(match_count * multiplier, cap)
+                    logger.debug(f"Reranking enabled for code search - fetching {search_match_count} candidates")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# If reranking is enabled, fetch more candidates
search_match_count = match_count
if use_reranking and self.reranking_strategy:
search_match_count = match_count * 5
logger.debug(f"Reranking enabled for code search - fetching {search_match_count} candidates")
# If reranking is enabled, fetch more candidates
search_match_count = match_count
if use_reranking and self.reranking_strategy:
try:
multiplier = int(self.get_setting("RERANK_CANDIDATE_MULTIPLIER", "5"))
cap = int(self.get_setting("RERANK_CANDIDATE_CAP", "200"))
except Exception:
multiplier, cap = 5, 200
multiplier = max(1, multiplier)
search_match_count = min(match_count * multiplier, cap)
logger.debug(f"Reranking enabled for code search - fetching {search_match_count} candidates")
🤖 Prompt for AI Agents
In python/src/server/services/search/rag_service.py around lines 329 to 334, the
code for increasing search_match_count when reranking is enabled simply
multiplies match_count by 5; replace that with the same configurable rerank-pool
logic used elsewhere (use the configured multiplier and cap rather than a
hardcoded 5). Read the rerank multiplier and max/cap from the existing reranking
configuration (e.g., self.reranking_strategy or global config used elsewhere),
compute search_match_count = min(match_count * rerank_multiplier,
rerank_max_pool), and keep the debug log showing the final computed candidate
count.

Comment on lines 358 to +366
results = await self.reranking_strategy.rerank_results(
query, results, content_key="content"
query, results, content_key="content", top_k=match_count
)
logger.debug(f"Code reranking applied: {search_match_count} candidates -> {len(results)} final results")
except Exception as e:
logger.warning(f"Code reranking failed: {e}")
# If reranking fails but we fetched extra results, trim to requested count
if len(results) > match_count:
results = results[:match_count]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Preserve stack traces on code reranking failures.

-                    except Exception as e:
-                        logger.warning(f"Code reranking failed: {e}")
+                    except Exception as e:
+                        logger.warning(f"Code reranking failed: {e}", exc_info=True)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
results = await self.reranking_strategy.rerank_results(
query, results, content_key="content"
query, results, content_key="content", top_k=match_count
)
logger.debug(f"Code reranking applied: {search_match_count} candidates -> {len(results)} final results")
except Exception as e:
logger.warning(f"Code reranking failed: {e}")
# If reranking fails but we fetched extra results, trim to requested count
if len(results) > match_count:
results = results[:match_count]
results = await self.reranking_strategy.rerank_results(
query, results, content_key="content", top_k=match_count
)
logger.debug(f"Code reranking applied: {search_match_count} candidates -> {len(results)} final results")
except Exception as e:
logger.warning(f"Code reranking failed: {e}", exc_info=True)
# If reranking fails but we fetched extra results, trim to requested count
if len(results) > match_count:
results = results[:match_count]
🤖 Prompt for AI Agents
In python/src/server/services/search/rag_service.py around lines 358 to 366, the
except block logs reranking failures with logger.warning(f"Code reranking
failed: {e}") which loses the stack trace; change the logging to preserve the
exception info (e.g., use logger.exception("Code reranking failed") or
logger.warning("Code reranking failed", exc_info=True)) so the traceback is
recorded, and keep the existing trimming logic for results as-is.

@ujconsulting
Copy link
Copy Markdown

ujconsulting commented Sep 1, 2025

Error: SQL query ran into an upstream timeout
maybe it has to run in batches - as attached for example

add_hybrid_search_tsvector_optimized.txt
grafik

@ujconsulting
Copy link
Copy Markdown

@coleam00 : After doing this huge wordpress developer docu crwal your script running in some serious issues.
Claude Code generated that optimized Update script which is running in batches and checks after each run how far it is and whats open if it needs more than one run... share this with you - maybe its helpful
fix_crawled_pages_simple.txt

@coleam00
Copy link
Copy Markdown
Owner Author

coleam00 commented Sep 6, 2025

@ujconsulting - this seems unrelated to this PR?

@Wirasm Wirasm merged commit 926b6f5 into main Sep 10, 2025
44 checks passed
@Wirasm Wirasm deleted the hybrid-rag-enhancement branch September 10, 2025 11:23
@coderabbitai coderabbitai Bot mentioned this pull request Sep 22, 2025
24 tasks
@coderabbitai coderabbitai Bot mentioned this pull request Oct 7, 2025
20 tasks
coleam00 pushed a commit that referenced this pull request Apr 7, 2026
…539)

* fix: add domain prefix to log event names across git, isolation, and adapters packages

Log events must follow the {domain}.{action}_{state} format per CLAUDE.md convention.
Previously, many events were missing the domain prefix (e.g., 'checkout_failed'
instead of 'git.checkout_failed'), making log filtering and observability harder.

- packages/git/src/branch.ts: 11 events prefixed with git.
- packages/git/src/repo.ts: 8 events prefixed with git.
- packages/git/src/worktree.ts: 6 events prefixed with git.
- packages/git/src/git.test.ts: 9 test assertions updated to match new event names
- packages/isolation/src/resolver.ts: 6 events changed from isolation_ to isolation. notation
- packages/isolation/src/providers/worktree.ts: 24 events prefixed with isolation.
- packages/adapters/src/forge/github/adapter.ts: 3 cleanup events prefixed with isolation.

* Revert "fix: add domain prefix to log event names across git, isolation, and adapters packages"

This reverts commit 0120a26861a17135237361adf050f1ba356dff06.
Tyone88 pushed a commit to Tyone88/Archon that referenced this pull request Apr 16, 2026
…oleam00#539)

* fix: add domain prefix to log event names across git, isolation, and adapters packages

Log events must follow the {domain}.{action}_{state} format per CLAUDE.md convention.
Previously, many events were missing the domain prefix (e.g., 'checkout_failed'
instead of 'git.checkout_failed'), making log filtering and observability harder.

- packages/git/src/branch.ts: 11 events prefixed with git.
- packages/git/src/repo.ts: 8 events prefixed with git.
- packages/git/src/worktree.ts: 6 events prefixed with git.
- packages/git/src/git.test.ts: 9 test assertions updated to match new event names
- packages/isolation/src/resolver.ts: 6 events changed from isolation_ to isolation. notation
- packages/isolation/src/providers/worktree.ts: 24 events prefixed with isolation.
- packages/adapters/src/forge/github/adapter.ts: 3 cleanup events prefixed with isolation.

* Revert "fix: add domain prefix to log event names across git, isolation, and adapters packages"

This reverts commit 0120a26861a17135237361adf050f1ba356dff06.
joaobmonteiro pushed a commit to joaobmonteiro/Archon that referenced this pull request Apr 26, 2026
…oleam00#539)

* fix: add domain prefix to log event names across git, isolation, and adapters packages

Log events must follow the {domain}.{action}_{state} format per CLAUDE.md convention.
Previously, many events were missing the domain prefix (e.g., 'checkout_failed'
instead of 'git.checkout_failed'), making log filtering and observability harder.

- packages/git/src/branch.ts: 11 events prefixed with git.
- packages/git/src/repo.ts: 8 events prefixed with git.
- packages/git/src/worktree.ts: 6 events prefixed with git.
- packages/git/src/git.test.ts: 9 test assertions updated to match new event names
- packages/isolation/src/resolver.ts: 6 events changed from isolation_ to isolation. notation
- packages/isolation/src/providers/worktree.ts: 24 events prefixed with isolation.
- packages/adapters/src/forge/github/adapter.ts: 3 cleanup events prefixed with isolation.

* Revert "fix: add domain prefix to log event names across git, isolation, and adapters packages"

This reverts commit 0120a26861a17135237361adf050f1ba356dff06.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants