Skip to content

feat: add step-by-step migration strategy for multi-dimensional embeddings#683

Closed
Wirasm wants to merge 1 commit intomainfrom
feat/multi-dimension-embedding-migration-workaround
Closed

feat: add step-by-step migration strategy for multi-dimensional embeddings#683
Wirasm wants to merge 1 commit intomainfrom
feat/multi-dimension-embedding-migration-workaround

Conversation

@Wirasm
Copy link
Copy Markdown
Collaborator

@Wirasm Wirasm commented Sep 17, 2025

Pull Request

Summary

Adds an alternative migration strategy for users on Supabase free tier who encounter timeouts when running the full migration/upgrade_database.sql script. This PR provides a step-by-step approach that breaks the migration into smaller, manageable chunks.

Changes Made

  • Added MIGRATION_GUIDE.md with detailed instructions for handling Supabase SQL editor timeouts
  • Created 4 step-by-step SQL scripts that can be run individually:
    • step1_add_columns.sql - Adds new columns for multi-dimensional embeddings (~5 seconds)
    • step2_migrate_data.sql - Migrates existing data to new columns (~10 seconds)
    • step3_create_functions.sql - Creates search functions (~5 seconds)
    • step4_create_indexes_optional.sql - Creates vector indexes (may timeout - optional)

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Performance improvement
  • Code refactoring

Affected Services

  • Frontend (React UI)
  • Server (FastAPI backend)
  • MCP Server (Model Context Protocol)
  • Agents (PydanticAI service)
  • Database (migrations/schema)
  • Docker/Infrastructure
  • Documentation site

Testing

  • Manually tested affected user flows
  • Docker builds succeed for all services

Test Evidence

Migration scripts have been tested on Supabase free tier instances with large datasets (>10k documents) where the original upgrade_database.sql would timeout.

Checklist

  • My code follows the service architecture patterns
  • I have verified no regressions in existing features
  • I have updated relevant documentation

Breaking Changes

None - this is an alternative migration path that doesn't change the existing upgrade_database.sql approach.

Additional Notes

This migration strategy is specifically designed for:

  • Users on Supabase free tier with timeout limitations
  • Large datasets where vector index creation exceeds memory limits
  • Users who prefer a step-by-step approach to monitor migration progress

The migration can be run via:

  1. Supabase SQL editor (step by step)
  2. Direct database connection (psql, TablePlus, etc.)
  3. Supabase CLI

If Step 4 (index creation) times out, the system will still work using brute-force search, which is acceptable for smaller datasets (<10k documents).

Summary by CodeRabbit

  • New Features

    • Added support for multiple embedding dimensions (including 1536 and others) with backward-compatible vector search.
    • Enhanced similarity search with flexible filtering and improved performance via optional indexes.
    • Safer, staged database migration with options for direct connection, CLI, or GUI tools.
    • Option to skip vector indexes for small datasets.
  • Documentation

    • Introduced a comprehensive migration guide with step-by-step instructions, verification steps, troubleshooting for timeouts/memory/permissions, and post-migration checks.

…dings

Adds alternative migration approach for users on Supabase free tier who encounter timeouts when running the full upgrade_database.sql script. Breaks migration into 4 manageable steps to avoid memory/timeout issues when creating vector indexes on large datasets.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Sep 17, 2025

Walkthrough

Introduces a new migration guide and four SQL scripts enabling a staged database migration for multi-dimensional embeddings. Adds columns, migrates existing data, creates search functions with legacy wrappers, and optionally builds vector and metadata indexes. Includes guidance for execution paths and verification.

Changes

Cohort / File(s) Summary
Documentation: Migration Guide
migration/MIGRATION_GUIDE.md
Adds a migration guide detailing four-step execution, alternative execution methods (psql, GUI tools, Supabase CLI), verification SQL, troubleshooting, and post-migration checks.
Schema Expansion (Step 1)
migration/step1_add_columns.sql
Adds embedding columns for 384/768/1024/1536/3072 vectors and metadata fields (llm_chat_model, embedding_model, embedding_dimension) to archon_crawled_pages and archon_code_examples. Transactional and idempotent.
Data Migration & Cleanup (Step 2)
migration/step2_migrate_data.sql
Copies 1536-dim embeddings from legacy embedding to embedding_1536, sets embedding_dimension and default embedding_model, drops legacy column and related indexes if present. Transactional with safeguards.
Search Functions (Step 3)
migration/step3_create_functions.sql
Adds helper functions to detect dimension and map to column; adds dimension-parameterized search functions and 1536-dim compatibility wrappers for crawled pages and code examples. Transactional.
Optional Indexes (Step 4)
migration/step4_create_indexes_optional.sql
Creates IVFFLAT vector indexes for multiple dimensions on both tables and B-tree indexes on embedding metadata; adjusts session memory/timeout; idempotent and sequential index creation.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Admin as Admin/Operator
  participant DB as PostgreSQL
  participant T1 as archon_crawled_pages
  participant T2 as archon_code_examples

  rect rgb(232,240,254)
  note over Admin,DB: Migration Flow (multi-path)
  Admin->>DB: Step 1: add embedding_* and metadata columns
  Admin->>DB: Step 2: migrate legacy embedding -> embedding_1536<br/>drop old column and indexes
  Admin->>DB: Step 3: create helper + multi search + legacy wrappers
  alt Optional
    Admin->>DB: Step 4: create IVFFLAT and B-tree indexes
  else Skipped or partial
    note over DB: Falls back to brute-force scan when IVFFLAT absent
  end
  end

  rect rgb(240,255,240)
  note over Client,DB: Query Flow (search)
  participant Client as App/Service
  Client->>DB: SELECT FROM match_archon_*_multi(query_embedding, dim, ...)
  DB->>DB: get_embedding_column_name(dim)
  DB->>T1: Scan/vector search on chosen embedding column (or T2)
  T1-->>DB: Top-k rows with distance
  DB-->>Client: Rows with similarity and metadata
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Poem

I nibbled through columns, five vectors wide,
Hopped past old indexes set aside.
New functions burrow, seeking near,
Optional IVFFLAT—no fear!
If timeouts loom, I pause my run—
Still find the carrots, one by one. 🥕✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title Check ✅ Passed The title "feat: add step-by-step migration strategy for multi-dimensional embeddings" is concise, accurately describes the primary change (a stepwise migration strategy for multi-dimensional embeddings), and maps directly to the added MIGRATION_GUIDE.md and four migration scripts, so a reviewer scanning history will understand the main intent.
Description Check ✅ Passed The PR description follows the repository template and includes a Summary, explicit Changes Made, Type of Change with checked boxes, Affected Services, Testing notes, Checklist, Breaking Changes, and Additional Notes describing workflows and when Step 4 may be skipped; it sufficiently documents the four-step migration and target users. It would be helpful to include concrete Test Evidence (specific commands and sample outputs or logs) to make verification easier, but the current description is otherwise complete and informative for reviewers.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/multi-dimension-embedding-migration-workaround

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (17)
migration/step1_add_columns.sql (2)

11-21: Constrain allowed embedding_dimension values.

A CHECK guards accidental writes (e.g., 512) and helps query planners.

Append these constraints after the ALTER (either table-by-table or once per table):

 ALTER TABLE archon_crawled_pages
@@
   ADD COLUMN IF NOT EXISTS embedding_dimension INTEGER;
+
+-- Constrain dimension to known values
+ALTER TABLE archon_crawled_pages
+  ADD CONSTRAINT archon_crawled_pages_embedding_dimension_chk
+  CHECK (embedding_dimension IN (384, 768, 1024, 1536, 3072));

Repeat for archon_code_examples:

 ALTER TABLE archon_code_examples
@@
   ADD COLUMN IF NOT EXISTS embedding_dimension INTEGER;
+
+ALTER TABLE archon_code_examples
+  ADD CONSTRAINT archon_code_examples_embedding_dimension_chk
+  CHECK (embedding_dimension IN (384, 768, 1024, 1536, 3072));

6-8: maintenance_work_mem isn’t needed for ADD COLUMN.

Harmless but not used here; consider moving this to the indexing step only to avoid confusion.

migration/step2_migrate_data.sql (2)

16-20: Filter by schema in information_schema lookups.

Avoid false positives if similarly named tables exist in other schemas.

Apply:

-    WHERE table_name = 'archon_crawled_pages'
+    WHERE table_schema = 'public' AND table_name = 'archon_crawled_pages'
@@
-    WHERE table_name = 'archon_code_examples'
+    WHERE table_schema = 'public' AND table_name = 'archon_code_examples'

Also applies to: 41-45


29-34: Don’t guess the embedding_model.

Setting 'text-embedding-3-small' may be incorrect for legacy data (e.g., ada-002). Prefer leaving as-is or marking as 'legacy-1536' for later curation.

Apply:

-                embedding_model = COALESCE(embedding_model, 'text-embedding-3-small')
+                embedding_model = COALESCE(embedding_model, 'legacy-1536')

(Repeat for both tables.)

Also applies to: 53-58

migration/step3_create_functions.sql (4)

16-29: Use the helper to avoid duplicate CASE logic.

get_embedding_column_name exists; leverage it as single source of truth.

Apply:

 CREATE OR REPLACE FUNCTION get_embedding_column_name(dimension INTEGER)
 RETURNS TEXT AS $$
 BEGIN
-    CASE dimension
-        WHEN 384 THEN RETURN 'embedding_384';
-        WHEN 768 THEN RETURN 'embedding_768';
-        WHEN 1024 THEN RETURN 'embedding_1024';
-        WHEN 1536 THEN RETURN 'embedding_1536';
-        WHEN 3072 THEN RETURN 'embedding_3072';
-        ELSE RAISE EXCEPTION 'Unsupported embedding dimension: %', dimension;
-    END CASE;
+    CASE dimension
+        WHEN 384 THEN RETURN 'embedding_384';
+        WHEN 768 THEN RETURN 'embedding_768';
+        WHEN 1024 THEN RETURN 'embedding_1024';
+        WHEN 1536 THEN RETURN 'embedding_1536';
+        WHEN 3072 THEN RETURN 'embedding_3072';
+        ELSE RAISE EXCEPTION 'Unsupported embedding dimension: %', dimension;
+    END CASE;
 END;
 $$ LANGUAGE plpgsql IMMUTABLE;

(Keep function, then use it below.)


53-61: De-duplicate CASE blocks inside search functions.

Resolve column with the helper to reduce divergence risk.

Apply in both functions:

-  CASE embedding_dimension
-    WHEN 384 THEN embedding_column := 'embedding_384';
-    WHEN 768 THEN embedding_column := 'embedding_768';
-    WHEN 1024 THEN embedding_column := 'embedding_1024';
-    WHEN 1536 THEN embedding_column := 'embedding_1536';
-    WHEN 3072 THEN embedding_column := 'embedding_3072';
-    ELSE RAISE EXCEPTION 'Unsupported embedding dimension: %', embedding_dimension;
-  END CASE;
+  embedding_column := get_embedding_column_name(embedding_dimension);

Also applies to: 101-109


63-75: Compute distance once; order by the alias.

Saves recomputation and can help planners. Also guard against dimension mismatch.

Apply to both functions:

-  sql_query := format('
-    SELECT id, url, chunk_number, content, metadata, source_id,
-           1 - (%I <=> $1) AS similarity
+  -- Optional: verify query embedding dimension matches requested dimension
+  IF detect_embedding_dimension(query_embedding) <> embedding_dimension THEN
+    RAISE EXCEPTION 'Query embedding dimension (%) does not match requested dimension (%)',
+      detect_embedding_dimension(query_embedding), embedding_dimension;
+  END IF;
+
+  sql_query := format('
+    SELECT id, url, chunk_number, content, metadata, source_id,
+           (%I <=> $1) AS distance,
+           1 - (%I <=> $1) AS similarity
     FROM archon_crawled_pages
     WHERE (%I IS NOT NULL)
       AND metadata @> $3
       AND ($4 IS NULL OR source_id = $4)
-    ORDER BY %I <=> $1
+    ORDER BY distance
     LIMIT $2',
-    embedding_column, embedding_column, embedding_column);
+    embedding_column, embedding_column, embedding_column);

And for code examples:

-    SELECT id, url, chunk_number, content, summary, metadata, source_id,
-           1 - (%I <=> $1) AS similarity
+    SELECT id, url, chunk_number, content, summary, metadata, source_id,
+           (%I <=> $1) AS distance,
+           1 - (%I <=> $1) AS similarity
@@
-    ORDER BY %I <=> $1
+    ORDER BY distance
@@
-    embedding_column, embedding_column, embedding_column);
+    embedding_column, embedding_column, embedding_column);

Also applies to: 111-123


39-46: Prefer DOUBLE PRECISION over FLOAT for clarity.

Postgres maps float ambiguously; be explicit.

-  similarity FLOAT
+  similarity DOUBLE PRECISION

(Apply in all RETURNS TABLE signatures.)

Also applies to: 86-94

migration/step4_create_indexes_optional.sql (3)

16-55: Consider adding optional 3072D indexes (commented) for users of text-embedding-3-large.

Keeps parity with available columns; users can uncomment when needed.

Append after Index 8:

+-- (Optional) Index 9 of 10
+-- CREATE INDEX IF NOT EXISTS idx_archon_crawled_pages_embedding_3072
+-- ON archon_crawled_pages USING ivfflat (embedding_3072 vector_cosine_ops)
+-- WITH (lists = 100);
+
+-- (Optional) Index 10 of 10
+-- CREATE INDEX IF NOT EXISTS idx_archon_code_examples_embedding_3072
+-- ON archon_code_examples USING ivfflat (embedding_3072 vector_cosine_ops)
+-- WITH (lists = 100);

56-63: Add GIN indexes on metadata to accelerate JSONB filter.

metadata @> filter benefits heavily from GIN.

Add:

 CREATE INDEX IF NOT EXISTS idx_archon_crawled_pages_embedding_model ON archon_crawled_pages (embedding_model);
@@
 CREATE INDEX IF NOT EXISTS idx_archon_code_examples_llm_chat_model ON archon_code_examples (llm_chat_model);
+
+-- JSONB metadata indexes
+CREATE INDEX IF NOT EXISTS idx_archon_crawled_pages_metadata_gin ON archon_crawled_pages USING GIN (metadata);
+CREATE INDEX IF NOT EXISTS idx_archon_code_examples_metadata_gin ON archon_code_examples USING GIN (metadata);

10-12: Analyze after building IVFFLAT indexes.

IVFFLAT needs ANALYZE for good recall/perf; also consider CONCURRENTLY in production.

Apply:

 SET statement_timeout = '10min';
@@
 RESET maintenance_work_mem;
 RESET statement_timeout;
+
+-- Gather stats for planner / IVFFLAT
+ANALYZE archon_crawled_pages;
+ANALYZE archon_code_examples;

Note: If minimizing write locks is critical, consider CREATE INDEX CONCURRENTLY outside transactions (may run longer on free tier).

Also applies to: 64-66

migration/MIGRATION_GUIDE.md (6)

20-28: Explicitly require pgvector extension.

Avoids confusion when VECTOR type/functions are missing.

Add after “Direct Database Connection” section intro:

+> Prerequisite: Ensure pgvector is installed
+>
+> ```sql
+> CREATE EXTENSION IF NOT EXISTS vector;
+> ```

63-66: Clarify skipping indexes and performance caveat.

Mention 3072D specifically won’t be indexed by default; add ANALYZE note.

Add:

 If you have a small dataset (<10,000 documents), you can skip Step 4 entirely. The system will use brute-force search which is fast enough for small datasets.
+
+Note: 3072‑dimension embeddings (e.g., text‑embedding‑3‑large) are not indexed by default in Step 4. Expect brute‑force for 3072D unless you add those indexes later.

69-82: Verification query may count metadata indexes too; that’s OK—clarify expected ranges.

Minor doc tweak so users aren’t surprised by counts > 8.

- - `index_count`: 8+ (or 0 if you skipped Step 4)
+ - `index_count`: 8–14 (0 if you skipped Step 4), depending on which vector/metadata indexes you created

109-122: Recommend ANALYZE after creating IVFFLAT indexes.

Improves recall/performance immediately after indexing.

Add:

 2. **Test the system**:
@@
 3. **Monitor performance**:
    - If searches are slow without indexes, create them via direct connection
    - Consider using smaller embedding dimensions (384 or 768) for faster performance
+   - After creating IVFFLAT indexes, run:
+     ```sql
+     ANALYZE archon_crawled_pages;
+     ANALYZE archon_code_examples;
+     ```

91-99: Timeout guidance: prefer smaller batches and direct connection; avoid transaction pooler for long DDL.

Add a note about connecting via the Session pooler or direct port to support SET and long‑running ops.

 - Use direct database connection
 - Increase `statement_timeout` setting
+ - Ensure you connect via the Session pooler (not Transaction pooler) or directly to Postgres for long‑running DDL and `SET` commands

52-58: Minor: “db push” wording can confuse; clarify both workflows.

Clarify difference between pushing migration files vs. executing ad‑hoc SQL.

 # Login and link project
@@
-# Run migration
-supabase db push migration/upgrade_database_with_memory_fix.sql
+# Option A: apply local migrations in supabase/migrations
+# supabase db push
+# Option B: run a specific SQL file
+# supabase db query -f migration/upgrade_database_with_memory_fix.sql
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b2ec7df and 1a0a850.

📒 Files selected for processing (5)
  • migration/MIGRATION_GUIDE.md (1 hunks)
  • migration/step1_add_columns.sql (1 hunks)
  • migration/step2_migrate_data.sql (1 hunks)
  • migration/step3_create_functions.sql (1 hunks)
  • migration/step4_create_indexes_optional.sql (1 hunks)

Comment on lines +59 to +61
# Run migration
supabase db push migration/upgrade_database_with_memory_fix.sql
```
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Supabase CLI command is incorrect for running a single SQL file.

Use db query -f (or psql -f) to execute a script; db push applies migration files from the migrations dir.

Apply:

-# Run migration
-supabase db push migration/upgrade_database_with_memory_fix.sql
+# Run a specific SQL script
+supabase db query -f migration/upgrade_database_with_memory_fix.sql
+
+# Or with psql
+# psql "$CONNECTION_STRING" -f migration/upgrade_database_with_memory_fix.sql

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In migration/MIGRATION_GUIDE.md around lines 59 to 61, the suggested Supabase
CLI command is wrong: replace the `supabase db push
migration/upgrade_database_with_memory_fix.sql` line with the correct command to
run a single SQL file (e.g., `supabase db query -f
migration/upgrade_database_with_memory_fix.sql`) or note the alternative `psql
-f migration/upgrade_database_with_memory_fix.sql` so the guide shows how to
execute a standalone script rather than using `db push` which applies migrations
from the migrations directory.

Comment on lines +6 to +10
-- Increase memory for this session
SET maintenance_work_mem = '256MB';

BEGIN;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Ensure pgvector extension is installed before using VECTOR.

Without CREATE EXTENSION vector, the type VECTOR(n) and ops won’t exist on some setups (esp. fresh DBs or new environments).

Apply this diff near the top (before BEGIN;):

 -- Increase memory for this session
-SET maintenance_work_mem = '256MB';
+SET maintenance_work_mem = '256MB';
+
+-- Ensure pgvector is available
+CREATE EXTENSION IF NOT EXISTS vector;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
-- Increase memory for this session
SET maintenance_work_mem = '256MB';
BEGIN;
-- Increase memory for this session
SET maintenance_work_mem = '256MB';
-- Ensure pgvector is available
CREATE EXTENSION IF NOT EXISTS vector;
BEGIN;
🤖 Prompt for AI Agents
In migration/step1_add_columns.sql around lines 6 to 10, the migration assumes
the pgvector extension is present but does not ensure it is installed; add a
statement before BEGIN to create the extension if missing (e.g., run CREATE
EXTENSION IF NOT EXISTS vector) so the VECTOR(n) type and operators are
available on fresh DBs or new environments.

Comment on lines +21 to +38
IF crawled_pages_count > 0 THEN
-- Detect dimension
SELECT vector_dims(embedding) INTO dimension_detected
FROM archon_crawled_pages
WHERE embedding IS NOT NULL
LIMIT 1;

IF dimension_detected = 1536 THEN
UPDATE archon_crawled_pages
SET embedding_1536 = embedding,
embedding_dimension = 1536,
embedding_model = COALESCE(embedding_model, 'text-embedding-3-small')
WHERE embedding IS NOT NULL AND embedding_1536 IS NULL;
END IF;

-- Drop old column
ALTER TABLE archon_crawled_pages DROP COLUMN IF EXISTS embedding;
END IF;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Data‑loss risk: non‑1536 embeddings are dropped.

If the old embedding column isn’t 1536D (e.g., 384/768/1024/3072), the script skips copying but still drops the column, losing data.

Replace the DO block with dimension‑aware copy and safe drop:

 DO $$
 DECLARE
-    crawled_pages_count INTEGER;
-    code_examples_count INTEGER;
-    dimension_detected INTEGER;
+    crawled_pages_count INTEGER;
+    code_examples_count INTEGER;
+    dimension_detected INTEGER;
 BEGIN
@@
-    IF crawled_pages_count > 0 THEN
-        -- Detect dimension
-        SELECT vector_dims(embedding) INTO dimension_detected
-        FROM archon_crawled_pages
-        WHERE embedding IS NOT NULL
-        LIMIT 1;
-
-        IF dimension_detected = 1536 THEN
-            UPDATE archon_crawled_pages
-            SET embedding_1536 = embedding,
-                embedding_dimension = 1536,
-                embedding_model = COALESCE(embedding_model, 'text-embedding-3-small')
-            WHERE embedding IS NOT NULL AND embedding_1536 IS NULL;
-        END IF;
-
-        -- Drop old column
-        ALTER TABLE archon_crawled_pages DROP COLUMN IF EXISTS embedding;
-    END IF;
+    IF crawled_pages_count > 0 THEN
+        SELECT vector_dims(embedding) INTO dimension_detected
+        FROM archon_crawled_pages
+        WHERE embedding IS NOT NULL
+        LIMIT 1;
+
+        IF dimension_detected IS NOT NULL THEN
+            CASE dimension_detected
+                WHEN 384 THEN
+                    UPDATE archon_crawled_pages
+                    SET embedding_384 = embedding,
+                        embedding_dimension = 384
+                    WHERE embedding IS NOT NULL AND embedding_384 IS NULL;
+                WHEN 768 THEN
+                    UPDATE archon_crawled_pages
+                    SET embedding_768 = embedding,
+                        embedding_dimension = 768
+                    WHERE embedding IS NOT NULL AND embedding_768 IS NULL;
+                WHEN 1024 THEN
+                    UPDATE archon_crawled_pages
+                    SET embedding_1024 = embedding,
+                        embedding_dimension = 1024
+                    WHERE embedding IS NOT NULL AND embedding_1024 IS NULL;
+                WHEN 1536 THEN
+                    UPDATE archon_crawled_pages
+                    SET embedding_1536 = embedding,
+                        embedding_dimension = 1536
+                    WHERE embedding IS NOT NULL AND embedding_1536 IS NULL;
+                WHEN 3072 THEN
+                    UPDATE archon_crawled_pages
+                    SET embedding_3072 = embedding,
+                        embedding_dimension = 3072
+                    WHERE embedding IS NOT NULL AND embedding_3072 IS NULL;
+                ELSE
+                    RAISE NOTICE 'Unsupported embedding dimension % in archon_crawled_pages; keeping old column', dimension_detected;
+            END CASE;
+        END IF;
+
+        -- Only drop after successful copy of a supported dimension
+        IF dimension_detected IN (384,768,1024,1536,3072) THEN
+            ALTER TABLE archon_crawled_pages DROP COLUMN IF EXISTS embedding;
+        END IF;
+    END IF;
@@
-    IF code_examples_count > 0 THEN
-        SELECT vector_dims(embedding) INTO dimension_detected
-        FROM archon_code_examples
-        WHERE embedding IS NOT NULL
-        LIMIT 1;
-
-        IF dimension_detected = 1536 THEN
-            UPDATE archon_code_examples
-            SET embedding_1536 = embedding,
-                embedding_dimension = 1536,
-                embedding_model = COALESCE(embedding_model, 'text-embedding-3-small')
-            WHERE embedding IS NOT NULL AND embedding_1536 IS NULL;
-        END IF;
-
-        ALTER TABLE archon_code_examples DROP COLUMN IF EXISTS embedding;
-    END IF;
+    IF code_examples_count > 0 THEN
+        SELECT vector_dims(embedding) INTO dimension_detected
+        FROM archon_code_examples
+        WHERE embedding IS NOT NULL
+        LIMIT 1;
+
+        IF dimension_detected IS NOT NULL THEN
+            CASE dimension_detected
+                WHEN 384 THEN
+                    UPDATE archon_code_examples
+                    SET embedding_384 = embedding,
+                        embedding_dimension = 384
+                    WHERE embedding IS NOT NULL AND embedding_384 IS NULL;
+                WHEN 768 THEN
+                    UPDATE archon_code_examples
+                    SET embedding_768 = embedding,
+                        embedding_dimension = 768
+                    WHERE embedding IS NOT NULL AND embedding_768 IS NULL;
+                WHEN 1024 THEN
+                    UPDATE archon_code_examples
+                    SET embedding_1024 = embedding,
+                        embedding_dimension = 1024
+                    WHERE embedding IS NOT NULL AND embedding_1024 IS NULL;
+                WHEN 1536 THEN
+                    UPDATE archon_code_examples
+                    SET embedding_1536 = embedding,
+                        embedding_dimension = 1536
+                    WHERE embedding IS NOT NULL AND embedding_1536 IS NULL;
+                WHEN 3072 THEN
+                    UPDATE archon_code_examples
+                    SET embedding_3072 = embedding,
+                        embedding_dimension = 3072
+                    WHERE embedding IS NOT NULL AND embedding_3072 IS NULL;
+                ELSE
+                    RAISE NOTICE 'Unsupported embedding dimension % in archon_code_examples; keeping old column', dimension_detected;
+            END CASE;
+        END IF;
+
+        IF dimension_detected IN (384,768,1024,1536,3072) THEN
+            ALTER TABLE archon_code_examples DROP COLUMN IF EXISTS embedding;
+        END IF;
+    END IF;
 END $$;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
IF crawled_pages_count > 0 THEN
-- Detect dimension
SELECT vector_dims(embedding) INTO dimension_detected
FROM archon_crawled_pages
WHERE embedding IS NOT NULL
LIMIT 1;
IF dimension_detected = 1536 THEN
UPDATE archon_crawled_pages
SET embedding_1536 = embedding,
embedding_dimension = 1536,
embedding_model = COALESCE(embedding_model, 'text-embedding-3-small')
WHERE embedding IS NOT NULL AND embedding_1536 IS NULL;
END IF;
-- Drop old column
ALTER TABLE archon_crawled_pages DROP COLUMN IF EXISTS embedding;
END IF;
DO $$
DECLARE
crawled_pages_count INTEGER;
code_examples_count INTEGER;
dimension_detected INTEGER;
BEGIN
IF crawled_pages_count > 0 THEN
SELECT vector_dims(embedding) INTO dimension_detected
FROM archon_crawled_pages
WHERE embedding IS NOT NULL
LIMIT 1;
IF dimension_detected IS NOT NULL THEN
CASE dimension_detected
WHEN 384 THEN
UPDATE archon_crawled_pages
SET embedding_384 = embedding,
embedding_dimension = 384
WHERE embedding IS NOT NULL AND embedding_384 IS NULL;
WHEN 768 THEN
UPDATE archon_crawled_pages
SET embedding_768 = embedding,
embedding_dimension = 768
WHERE embedding IS NOT NULL AND embedding_768 IS NULL;
WHEN 1024 THEN
UPDATE archon_crawled_pages
SET embedding_1024 = embedding,
embedding_dimension = 1024
WHERE embedding IS NOT NULL AND embedding_1024 IS NULL;
WHEN 1536 THEN
UPDATE archon_crawled_pages
SET embedding_1536 = embedding,
embedding_dimension = 1536
WHERE embedding IS NOT NULL AND embedding_1536 IS NULL;
WHEN 3072 THEN
UPDATE archon_crawled_pages
SET embedding_3072 = embedding,
embedding_dimension = 3072
WHERE embedding IS NOT NULL AND embedding_3072 IS NULL;
ELSE
RAISE NOTICE 'Unsupported embedding dimension % in archon_crawled_pages; keeping old column', dimension_detected;
END CASE;
END IF;
IF dimension_detected IN (384,768,1024,1536,3072) THEN
ALTER TABLE archon_crawled_pages DROP COLUMN IF EXISTS embedding;
END IF;
END IF;
IF code_examples_count > 0 THEN
SELECT vector_dims(embedding) INTO dimension_detected
FROM archon_code_examples
WHERE embedding IS NOT NULL
LIMIT 1;
IF dimension_detected IS NOT NULL THEN
CASE dimension_detected
WHEN 384 THEN
UPDATE archon_code_examples
SET embedding_384 = embedding,
embedding_dimension = 384
WHERE embedding IS NOT NULL AND embedding_384 IS NULL;
WHEN 768 THEN
UPDATE archon_code_examples
SET embedding_768 = embedding,
embedding_dimension = 768
WHERE embedding IS NOT NULL AND embedding_768 IS NULL;
WHEN 1024 THEN
UPDATE archon_code_examples
SET embedding_1024 = embedding,
embedding_dimension = 1024
WHERE embedding IS NOT NULL AND embedding_1024 IS NULL;
WHEN 1536 THEN
UPDATE archon_code_examples
SET embedding_1536 = embedding,
embedding_dimension = 1536
WHERE embedding IS NOT NULL AND embedding_1536 IS NULL;
WHEN 3072 THEN
UPDATE archon_code_examples
SET embedding_3072 = embedding,
embedding_dimension = 3072
WHERE embedding IS NOT NULL AND embedding_3072 IS NULL;
ELSE
RAISE NOTICE 'Unsupported embedding dimension % in archon_code_examples; keeping old column', dimension_detected;
END CASE;
END IF;
IF dimension_detected IN (384,768,1024,1536,3072) THEN
ALTER TABLE archon_code_examples DROP COLUMN IF EXISTS embedding;
END IF;
END IF;
END $$;

@coleam00 coleam00 mentioned this pull request Sep 20, 2025
24 tasks
@coleam00
Copy link
Copy Markdown
Owner

Thanks for this Rasmus! I actually took this split and incorporated it into my latest PR that redoes a lot of our migrations/ folder setup - #718. I'll be closing this PR because of that but I did use this work directly.

@coleam00 coleam00 closed this Sep 20, 2025
@Wirasm Wirasm deleted the feat/multi-dimension-embedding-migration-workaround branch April 6, 2026 07:37
coleam00 added a commit that referenced this pull request Apr 7, 2026
…at-state

fix: welcoming empty chat state and suppress Disconnected/No project
Tyone88 pushed a commit to Tyone88/Archon that referenced this pull request Apr 16, 2026
…empty-chat-state

fix: welcoming empty chat state and suppress Disconnected/No project
joaobmonteiro pushed a commit to joaobmonteiro/Archon that referenced this pull request Apr 26, 2026
…empty-chat-state

fix: welcoming empty chat state and suppress Disconnected/No project
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants