Skip to content

Fix: Update hybrid search functions for multi-dimensional vector fields#681

Merged
tazmon95 merged 1 commit intomainfrom
fix/multi-dimensional-vector-hybrid-search
Sep 18, 2025
Merged

Fix: Update hybrid search functions for multi-dimensional vector fields#681
tazmon95 merged 1 commit intomainfrom
fix/multi-dimensional-vector-hybrid-search

Conversation

@tazmon95
Copy link
Copy Markdown
Collaborator

@tazmon95 tazmon95 commented Sep 16, 2025

Summary

Fixes #675 - Updates hybrid search functions in migration scripts to properly handle multi-dimensional vector fields instead of referencing non-existent cp.embedding and ce.embedding columns.

Problem

The hybrid search functions in both migration scripts were referencing columns that don't exist:

  • Referenced: cp.embedding and ce.embedding
  • Should use: embedding_384, embedding_768, embedding_1024, embedding_1536, embedding_3072

This caused RAG queries to fail with "column does not exist" database errors.

Solution

  • Created new multi-dimensional hybrid search functions that dynamically select the correct embedding column based on dimension
  • Maintained backward compatibility through legacy wrapper functions
  • Applied the same pattern used in existing match functions

Changes

  • migration/complete_setup.sql: Updated hybrid search functions to use multi-dimensional approach
  • migration/add_hybrid_search_tsvector.sql: Applied identical fixes for consistency

Testing

✅ Verified RAG queries work without database errors
✅ Confirmed hybrid search mode is active
✅ No "column does not exist" errors in logs

Test Plan

  1. Apply migration script to database
  2. Run RAG query via API: POST /api/rag/query
  3. Verify no column errors in server logs
  4. Test with different embedding providers (OpenAI, Google, Ollama)

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features
    • Hybrid search now supports multiple embedding dimensions/models, selectable per request.
    • Code example results include a summary field in results.
  • Compatibility
    • Existing search integrations continue working unchanged with the default embedding dimension.
  • Improvements
    • Relevance boosted by combining vector and full‑text signals in a unified search.
    • Supports optional metadata and source filters for more precise results.
  • Documentation
    • Updated guidance to reflect multi‑dimensional search and backward‑compatible endpoints.

Fixes critical bug where hybrid search functions referenced non-existent
cp.embedding and ce.embedding columns instead of dimension-specific columns.

Changes:
- Add new multi-dimensional hybrid search functions with dynamic column selection
- Maintain backward compatibility with existing legacy functions
- Support all embedding dimensions: 384, 768, 1024, 1536, 3072
- Proper error handling for unsupported dimensions

Resolves: #675 - RAG queries now work with multi-dimensional embeddings

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Sep 16, 2025

Walkthrough

Introduces multi-dimensional hybrid search SQL functions with dynamic selection of embedding columns by dimension. Adds backward-compatible wrappers fixed to 1536D. Applies the same updates in add_hybrid_search_tsvector.sql and complete_setup.sql, replacing prior single-column references and addressing the non-existent cp.embedding issue.

Changes

Cohort / File(s) Summary of Changes
Hybrid search: multi-dimension functions
migration/add_hybrid_search_tsvector.sql, migration/complete_setup.sql
Added hybrid_search_archon_crawled_pages_multi and hybrid_search_archon_code_examples_multi accepting generic VECTOR plus embedding_dimension; dynamic SQL selects embedding column (embedding_384/768/1024/1536/3072); computes vector and text similarities; validates supported dimensions.
Backward-compatibility wrappers (1536D)
migration/add_hybrid_search_tsvector.sql, migration/complete_setup.sql
Added/retained wrappers hybrid_search_archon_crawled_pages(vector(1536), ...) and hybrid_search_archon_code_examples(vector(1536), ...) delegating to multi variants with embedding_dimension = 1536; signatures unchanged; comments updated.
Bug fix: invalid column reference
migration/add_hybrid_search_tsvector.sql, migration/complete_setup.sql
Replaced references to non-existent cp.embedding with dimension-specific columns via dynamic selection; ensures WHERE/ORDER BY use the correct embedding column.
Documentation/comments
migration/...
Added COMMENTS describing multi-dimensional hybrid search and legacy wrappers; notes on configurable dimensions and error on unsupported values.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Client
  participant DB as Postgres
  participant HF as hybrid_search_*_multi

  Client->>DB: SELECT * FROM hybrid_search_*_multi(query_embedding, embedding_dimension, query_text, match_count, filter, source_filter)
  activate DB
  DB->>HF: Invoke function
  activate HF
  HF->>HF: Map embedding_dimension → embedding column (e.g., embedding_1536)
  HF->>DB: Dynamic SQL: vector similarity + full-text tsquery, apply filters
  DB-->>HF: Result rows (id, url, chunk_number, content, ... , similarity, match_type)
  deactivate HF
  DB-->>Client: Top N matches
  deactivate DB
Loading
sequenceDiagram
  autonumber
  actor Client
  participant DB as Postgres
  participant Legacy as hybrid_search_* (1536D wrapper)
  participant Multi as hybrid_search_*_multi

  Client->>DB: SELECT * FROM hybrid_search_*(query_embedding_1536, query_text, ...)
  activate DB
  DB->>Legacy: Call legacy wrapper
  activate Legacy
  Legacy->>Multi: Delegate with embedding_dimension = 1536
  activate Multi
  Multi->>DB: Execute dynamic query using embedding_1536
  DB-->>Multi: Rows
  deactivate Multi
  Legacy-->>DB: Return rows
  deactivate Legacy
  DB-->>Client: Rows
  deactivate DB
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

I twitch my ears at vectors wide,
From 384 to 3k stride.
No phantom columns now to chase—
The right one picked for every case.
With hops through text and cosine bright,
I fetch the best, both code and byte. 🐇✨

Pre-merge checks and finishing touches

✅ Passed checks (5 passed)
Check name Status Explanation
Title Check ✅ Passed The PR title "Fix: Update hybrid search functions for multi-dimensional vector fields" succinctly and accurately summarizes the primary change, namely updating hybrid search functions to use dimension-specific embedding columns and add multi-dimensional support in migration scripts; it directly reflects the edits in migration/complete_setup.sql and migration/add_hybrid_search_tsvector.sql. The phrasing is concise and specific, without noisy extras or misleading information, so a reviewer scanning history will understand the main intent.
Linked Issues Check ✅ Passed The changes implement the requested fix from linked issue [#675] by replacing references to non-existent cp.embedding/ce.embedding with dimension-specific columns and adding hybrid_search_archon_*_multi functions that select the appropriate embedding column based on embedding_dimension while retaining 1536D legacy wrappers for backward compatibility. Both migration files cited in the issue were updated and the PR's testing notes indicate RAG queries no longer produce "column does not exist" errors, demonstrating the primary objective is met. Accordingly, the code changes satisfy the linked issue's requirements.
Out of Scope Changes Check ✅ Passed All modifications reported in the raw summary are confined to migration SQL files (migration/complete_setup.sql and migration/add_hybrid_search_tsvector.sql) and specifically target embedding-column selection and compatibility wrappers; there is no indication of unrelated service code, schema changes outside embeddings, or extraneous feature work. The scope of edits aligns with the linked issue and PR objectives. Therefore no out-of-scope changes were detected.
Description Check ✅ Passed The PR description provides a clear Summary, Problem statement, Solution, a Changes section, Testing notes, and a Test Plan, covering the core information required by the repository template and making the intent and scope of the fix easy to evaluate. It does not populate the formal Type of Change and Affected Services checkboxes, the checklist, or include explicit Test Evidence commands/log outputs. Because the essential details required to review the change are present, the description is considered mostly complete and acceptable.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/multi-dimensional-vector-hybrid-search

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@tazmon95 tazmon95 requested a review from leex279 September 16, 2025 23:46
@tazmon95 tazmon95 added this to Archon Sep 16, 2025
@tazmon95 tazmon95 moved this to In review in Archon Sep 16, 2025
@tazmon95 tazmon95 added the bug label Sep 16, 2025
@tazmon95 tazmon95 self-assigned this Sep 16, 2025
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (7)
migration/add_hybrid_search_tsvector.sql (4)

66-75: Guard against embedding dimension mismatches.

Add a runtime check so a 768D vector isn’t used against a 1536D column (currently would error at execution).

Apply this diff near the start of the function body:

 BEGIN
+    -- Validate query vector dimension
+    IF vector_dims(query_embedding) != embedding_dimension THEN
+        RAISE EXCEPTION 'Query embedding dimension (%) does not match embedding_dimension (%)',
+            vector_dims(query_embedding), embedding_dimension;
+    END IF;

100-115: Text-search rank vs. vector similarity scale mismatch.

ts_rank_cd and cosine similarity are on different scales; ordering the combined set by a single similarity can bias results. Optional: blend scores with weights.

Example change inside combined_results:

-            COALESCE(v.vector_sim, t.text_sim, 0)::float8 AS similarity,
+            (0.7 * COALESCE(v.vector_sim, 0) + 0.3 * COALESCE(t.text_sim, 0))::float8 AS similarity,

Consider making weights configurable later.


171-209: Repeat the dimension guard in code_examples.

Mirror the earlier dimension check to prevent mismatches here as well.

 BEGIN
+    IF vector_dims(query_embedding) != embedding_dimension THEN
+        RAISE EXCEPTION 'Query embedding dimension (%) does not match embedding_dimension (%)',
+            vector_dims(query_embedding), embedding_dimension;
+    END IF;

252-276: Same optional score fusion note for code_examples.

Consider weighted blending as above.

migration/complete_setup.sql (3)

477-514: Deduplicate CASE via helper function and add dimension check.

You already define get_embedding_column_name(dimension) and can also validate dimension early.

 BEGIN
-    -- Determine which embedding column to use based on dimension
-    CASE embedding_dimension
-        WHEN 384 THEN embedding_column := 'embedding_384';
-        WHEN 768 THEN embedding_column := 'embedding_768';
-        WHEN 1024 THEN embedding_column := 'embedding_1024';
-        WHEN 1536 THEN embedding_column := 'embedding_1536';
-        WHEN 3072 THEN embedding_column := 'embedding_3072';
-        ELSE RAISE EXCEPTION 'Unsupported embedding dimension: %', embedding_dimension;
-    END CASE;
+    IF vector_dims(query_embedding) != embedding_dimension THEN
+        RAISE EXCEPTION 'Query embedding dimension (%) does not match embedding_dimension (%)',
+            vector_dims(query_embedding), embedding_dimension;
+    END IF;
+    embedding_column := get_embedding_column_name(embedding_dimension);

610-647: Apply the same helper+dimension guard to code_examples.

Reduce duplication and add safety.

 BEGIN
-    -- Determine which embedding column to use based on dimension
-    CASE embedding_dimension
-        WHEN 384 THEN embedding_column := 'embedding_384';
-        WHEN 768 THEN embedding_column := 'embedding_768';
-        WHEN 1024 THEN embedding_column := 'embedding_1024';
-        WHEN 1536 THEN embedding_column := 'embedding_1536';
-        WHEN 3072 THEN embedding_column := 'embedding_3072';
-        ELSE RAISE EXCEPTION 'Unsupported embedding dimension: %', embedding_dimension;
-    END CASE;
+    IF vector_dims(query_embedding) != embedding_dimension THEN
+        RAISE EXCEPTION 'Query embedding dimension (%) does not match embedding_dimension (%)',
+            vector_dims(query_embedding), embedding_dimension;
+    END IF;
+    embedding_column := get_embedding_column_name(embedding_dimension);

227-234: Performance caveat for 3072D embeddings.

No index can be created for 3072D; expect seq scans and sorting. Consider a config flag to disallow 3072D at scale or set ivfflat.probes higher for other dims to improve recall.

Run EXPLAIN ANALYZE comparing 1536D vs 3072D to validate impact on your dataset size.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b2ec7df and 35ec9ff.

📒 Files selected for processing (2)
  • migration/add_hybrid_search_tsvector.sql (10 hunks)
  • migration/complete_setup.sql (9 hunks)
🔇 Additional comments (14)
migration/add_hybrid_search_tsvector.sql (7)

38-75: Good fix: dynamic embedding column selection and safe SQL construction.

  • Correctly maps dimension → column and uses %I for identifiers.
  • Addresses the original cp/ce.embedding bug.

Please run EXPLAIN on a 3072D query to confirm acceptable performance without an index.


80-98: Cosine distance to similarity conversion is correct (pgvector <=>).

Using 1 - (col <=> $1) yields cosine similarity when <=> is cosine distance. Keep as-is.

Confirm your pgvector version uses <=> for cosine distance; if not, we should switch operators.


146-170: Legacy wrapper: compatibility preserved.

Wrapper cleanly delegates to 1536D. No issues.


214-233: OK: vector branch mirrors crawled_pages.

Identifier formatting and null filter are correct.


235-251: OK: full‑text search joins summary + content.

content_search_vector already indexes both; good.


283-307: Legacy wrapper: compatibility preserved.

Looks good.


316-321: Helpful function comments.

Clear docstrings for future maintainers.

migration/complete_setup.sql (7)

519-579: Vector branch and dynamic SQL look correct; placeholders align.

No correctness issues spotted.


580-582: Parameter order verification.

USING query_embedding, max_vector_results, max_text_results, filter, source_filter, query_text matches placeholders $1..$6.


585-609: Legacy wrapper: OK.

Back-compat maintained.


653-716: OK: dynamic SQL mirrors crawled_pages; identifiers escaped.

Looks good.


717-719: Placeholder ordering verified.

USING argument list matches $1..$6.


722-746: Legacy wrapper: OK.

Back-compat maintained.


749-752: Doc comments helpful; keep consistent across migrations.

Good documentation.

@tazmon95 tazmon95 merged commit 85bd6bc into main Sep 18, 2025
8 checks passed
@github-project-automation github-project-automation Bot moved this from In review to Done in Archon Sep 18, 2025
@coderabbitai coderabbitai Bot mentioned this pull request Sep 22, 2025
24 tasks
leonj1 pushed a commit to leonj1/Archon that referenced this pull request Oct 13, 2025
Fixes critical bug where hybrid search functions referenced non-existent
cp.embedding and ce.embedding columns instead of dimension-specific columns.

Changes:
- Add new multi-dimensional hybrid search functions with dynamic column selection
- Maintain backward compatibility with existing legacy functions
- Support all embedding dimensions: 384, 768, 1024, 1536, 3072
- Proper error handling for unsupported dimensions

Resolves: coleam00#675 - RAG queries now work with multi-dimensional embeddings

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-authored-by: Claude <noreply@anthropic.com>
@Wirasm Wirasm deleted the fix/multi-dimensional-vector-hybrid-search branch April 6, 2026 07:38
coleam00 added a commit that referenced this pull request Apr 7, 2026
…l-nav

fix: replace sidebar navigation with top-level tabs
Tyone88 pushed a commit to Tyone88/Archon that referenced this pull request Apr 16, 2026
…top-level-nav

fix: replace sidebar navigation with top-level tabs
joaobmonteiro pushed a commit to joaobmonteiro/Archon that referenced this pull request Apr 26, 2026
…top-level-nav

fix: replace sidebar navigation with top-level tabs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done (In Stable)

Development

Successfully merging this pull request may close these issues.

🐛 [Bug]:

1 participant