Enhanced the hybrid search strategy with tsvector keyword matching by coleam00 · Pull Request #539 · coleam00/Archon

coleam00 · 2025-08-30T19:16:28Z

Pull Request

Summary

Enhanced the hybrid search strategy with tsvector keyword matching

Changes Made

Added a new SQL match function for hybrid search - semantic similarity + keyword search with tsvector
Added a migration script and updated complete_setup.sql and RESET_DB.sql
Updated the hybrid search RAG strategy to use the new hybrid search SQL function instead of the old naive keyword search strategy
Updated the main RAG coordinator to pass all semantic + keyword search results to the reranker instead of filtering before the reranker

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Performance improvement
Code refactoring

Affected Services

Testing

All existing tests pass
Added new tests for new functionality
Manually tested affected user flows
Docker builds succeed for all services

Test Evidence

make test-be

Manually tested by crawling fresh documentation (https://ai.pydantic.dev/llms-full.txt) and then performing some RAG queries in Claude Desktop with the LOG_LEVEL set to DEBUG so I could confirm that with the hybrid search strategy enabled, I'm getting chunks back from both semantic and keyword search. All chunks are being sent into the reranker as well when the reranking strategy is enabled.

Checklist

My code follows the service architecture patterns
If using an AI coding assistant, I used the CLAUDE.md rules
[!] I have added tests that prove my fix/feature works (No, using existing tests for the RAG strategies)
All new and existing tests pass locally
My changes generate no new warnings
I have updated relevant documentation
I have verified no regressions in existing features

Breaking Changes

This PR will require a DB migration for the hybrid search strategy to work.

Additional Notes

The current setup to detect the need for a DB migration is specific to #472. We will need a more generic implementation or potentially create a process to automatically run migrations whenever Archon is spun up.

Summary by CodeRabbit

New Features
- Introduces hybrid search combining semantic and full‑text matching for documents and code examples, with fuzzy matching and faster results via new indexes.
- Adds a Knowledge Base (sources, crawled pages, code examples) and seeded settings/prompts to support search and RAG workflows.
- Improves result quality when reranking by fetching a larger candidate pool while keeping final output size unchanged.
Refactor
- Streamlines search to a single database-powered query for more consistent relevance and performance.
Tests
- Updates tests to align with the new hybrid search flow and remove obsolete merge logic checks.

coderabbitai · 2025-08-30T19:16:34Z

Walkthrough

Expands database schema and search capabilities with hybrid ts_vector + embedding functions, indexes, and policies. Updates server search to call new Postgres hybrid search RPCs, removes Python-side keyword merge, and adjusts reranking to fetch larger candidate pools with top_k limiting. Resets now drops new hybrid functions. Tests updated accordingly.

Changes

Cohort / File(s)	Summary
DB reset updates `migration/RESET_DB.sql`	Adds DROP FUNCTION ... CASCADE for `hybrid_search_archon_crawled_pages(vector, text, int, jsonb, text)` and `hybrid_search_archon_code_examples(vector, text, int, jsonb, text)`, plus a comment.
Hybrid search migrations `migration/add_hybrid_search_tsvector.sql`	Adds generated tsvector columns, GIN indexes, and PL/pgSQL functions `hybrid_search_archon_crawled_pages` and `hybrid_search_archon_code_examples` combining vector similarity and full-text ranking; supports optional JSONB filter and source filter.
Comprehensive setup/seed `migration/complete_setup.sql`	Introduces knowledge base tables, projects/tasks module with RLS, prompts, settings, indexes, triggers, and functions including `match_archon_` and `hybrid_search_archon_`; seeds initial data and comments.
Server hybrid strategy refactor `python/src/server/services/search/hybrid_search_strategy.py`	Replaces Python-side merge and keyword search with single DB-backed hybrid search RPCs; removes `_merge_search_results` and keyword flow; normalizes results; updates method signature to drop `query_embedding`; logging and filter handling updated.
RAG pipeline candidate expansion and top_k `python/src/server/services/search/rag_service.py`, `python/src/server/services/search/reranking_strategy.py`	Expands initial candidate pool to 5x when reranking; adds `top_k` to `rerank_results`; trims to requested count on failure; consistent behavior for documents and code examples.
Tests adjusted for new flow `python/tests/test_rag_simple.py`, `python/tests/test_rag_strategies.py`	Removes tests for `_merge_search_results` and related assertions; retains initialization checks for `search_documents_hybrid`.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Client
  participant Server as Server (HybridSearchStrategy)
  participant DB as Postgres

  Client->>Server: search_documents_hybrid(query, match_count, filter_metadata)
  Server->>DB: RPC hybrid_search_archon_crawled_pages(query_text, match_count, filter, source_filter)
  DB-->>Server: rows[id, url, chunk_number, content, metadata, source_id, similarity, match_type]
  Server->>Server: Normalize results, log counts and match_type distribution
  Server-->>Client: List[dict]
  note over DB,Server: Keyword and vector merging now occurs in SQL via full outer join

sequenceDiagram
  autonumber
  participant Client
  participant RAG as RagService
  participant Search as HybridSearchStrategy
  participant Reranker as RerankingStrategy

  Client->>RAG: query(match_count, rerank=True)
  RAG->>RAG: search_match_count = match_count * 5
  RAG->>Search: search_documents_hybrid(query, search_match_count, filters)
  Search-->>RAG: candidates (N ~= 5x)
  RAG->>Reranker: rerank_results(query, candidates, top_k=match_count)
  alt Rerank success
    Reranker-->>RAG: top_k results
  else Rerank failure
    RAG->>RAG: Fallback trim to match_count
  end
  RAG-->>Client: final results (<= match_count)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

tazmon95
leex279

Poem

Thump-thump go my paws on the query trail,
Vectors meet words where results prevail.
GIN trees whisper, trigrams align,
Postgres hums a hybrid design.
Fivefold fetch, then tidy the heap—
Carrot-ranked answers, crisp and deep. 🥕

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch hybrid-rag-enhancement

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

python/src/server/services/search/rag_service.py (2)

143-147: Preserve stack traces for document search failures.

-            except Exception as e:
-                logger.error(f"Document search failed: {e}")
+            except Exception as e:
+                logger.error(f"Document search failed: {e}", exc_info=True)

399-402: Preserve stack traces for code example pipeline failures.

-            except Exception as e:
-                logger.error(f"Code example search failed: {e}")
+            except Exception as e:
+                logger.error(f"Code example search failed: {e}", exc_info=True)

🧹 Nitpick comments (8)

migration/add_hybrid_search_tsvector.sql (4)
29-33: Trigram indexes are heavy; consider need and creation strategy.

You’re creating GIN trigram indexes on large text columns, but the new hybrid functions don’t use trigram operators (% or ILIKE). If they’re only “nice to have,” consider deferring creation or adding them in a separate opt-in migration. For large tables, prefer CONCURRENTLY (outside transaction) to avoid long locks.

93-99: Prefer websearch_to_tsquery and guard empty queries.

plainto_tsquery is strict and may underperform for natural queries. Also, when query_text is blank/whitespace, the text path should short-circuit to avoid scanning.

Suggested minimal change:
-            ts_rank_cd(cp.content_search_vector, plainto_tsquery('english', query_text)) AS text_sim
+            ts_rank(cp.content_search_vector, websearch_to_tsquery('english', query_text)) AS text_sim
         FROM archon_crawled_pages cp
         WHERE cp.metadata @> filter
             AND (source_filter IS NULL OR cp.source_id = source_filter)
-            AND cp.content_search_vector @@ plainto_tsquery('english', query_text)
+            AND btrim(query_text) <> ''
+            AND cp.content_search_vector @@ websearch_to_tsquery('english', query_text)
111-117: Similarity mixing is on different scales; expose both scores or normalize.

COALESCE(v.vector_sim, t.text_sim) mixes (1 − cosine distance) with ts_rank values—these are not comparable. This will bias ordering toward whichever scale is larger (likely vector).

Options (keep API stable):

Add columns vector_similarity and text_rank while retaining similarity as-is for backward compatibility.

Or normalize text_rank (e.g., ts_rank with normalization or percentile within the CTE) and compute a weighted final score.

If you want, I can propose a backward-compatible patch that adds extra columns and preserves similarity.

62-65: Overfetch split is fixed 50/50; consider tunable distribution.

Fetching match_count from both vector and text paths may double work and then discard half. Consider parameters (e.g., vector_ratio, text_ratio) or dynamic allocation (more vector for longer queries, more text for short/keyword-y queries).
migration/complete_setup.sql (4)

221-223: Indexes for content_search_vector and trigram: consider usage.

GIN on content_search_vector is necessary; trigram index is optional unless you plan fuzzy/ILIKE queries. Consider deferring trigram to reduce disk/maintenance costs.

234-251: tsvector on code content+summary is solid; indexes align.

Looks good. One micro-nit: ensure summary is NOT NULL in schema or keep COALESCE as you did.

330-515: Hybrid functions duplicate those in add_hybrid_search_tsvector.sql; watch score mixing and query ergonomics.

Same comments as the dedicated migration: prefer websearch_to_tsquery, guard empty query_text, and avoid mixing score scales without normalization or exposing both scores.

Duplication is fine for “complete setup,” but keep them in sync with migrations to avoid drift.

If you want, I can send a single patch that applies the websearch_to_tsquery and empty-query guard to both functions here.

516-519: Comment says “configurable weighting” but no weights exist.

Either add vector_weight/text_weight params or adjust the comment to avoid implying tunable weights.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 763e5b8 and 636a3a4.

📒 Files selected for processing (7)

migration/RESET_DB.sql (1 hunks)
migration/add_hybrid_search_tsvector.sql (1 hunks)
migration/complete_setup.sql (6 hunks)
python/src/server/services/search/hybrid_search_strategy.py (6 hunks)
python/src/server/services/search/rag_service.py (4 hunks)
python/tests/test_rag_simple.py (0 hunks)
python/tests/test_rag_strategies.py (0 hunks)

💤 Files with no reviewable changes (2)

python/tests/test_rag_strategies.py
python/tests/test_rag_simple.py

🧰 Additional context used

📓 Path-based instructions (6)

python/src/{server,mcp,agents}/**/*.py