Skip to content

Feat/data ingestion system#3

Merged
SanghunYun95 merged 6 commits intomainfrom
feat/data-ingestion-system
Feb 26, 2026
Merged

Feat/data ingestion system#3
SanghunYun95 merged 6 commits intomainfrom
feat/data-ingestion-system

Conversation

@SanghunYun95
Copy link
Copy Markdown
Owner

@SanghunYun95 SanghunYun95 commented Feb 26, 2026

Summary by CodeRabbit

릴리스 노트

  • 버그 수정

    • URL 안전성 검증 및 타임아웃 추가로 안정성 향상
    • 데이터베이스 오류 처리 개선
  • 새 기능

    • 임베딩 생성 시 재시도 로직 추가
    • Project Gutenberg 보일러플레이트 자동 제거
    • 데이터 수집 성능 최적화
  • 개선 사항

    • 로컬 기반 임베딩 모델로 전환
    • 데이터베이스 스키마 업데이트
    • 의존성 패키지 추가

… system

- Added download_books.py to scrape Philosophy & Ethics bookshelf\n- Downloaded 100 philosophy books to data directory\n- Implemented resumable ingestion in ingest_data.py to skip existing chunks\n- Updated vector dimension logic and added HNSW index migration\n- Added check_progress.py and verify_and_clear.py scripts
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Feb 26, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2b76c5f and 2daad6f.

📒 Files selected for processing (3)
  • download_books.py
  • scripts/check_db.py
  • scripts/ingest_data.py

📝 Walkthrough

개요

이 PR은 Google Gemini 기반 임베딩을 로컬 FastEmbedEmbeddings(all-MiniLM-L6-v2)로 전환하고, 384차원 벡터를 지원하도록 데이터베이스 스키마를 업데이트하며, 문서 수집 및 URL 처리의 안정성을 강화합니다.

변경사항

Cohort / File(s) Summary
임베딩 서비스 마이그레이션
app/services/embedding.py
Gemini 기반 임베딩에서 FastEmbedEmbeddings 로컬 모델로 전환; 모델을 sentence-transformers/all-MiniLM-L6-v2로 업데이트; 384차원 벡터 검증 로직 추가.
LLM 설정
app/services/llm.py
주석에서 모델 참조를 gemini-2.5-flash로 업데이트; 런타임 동작 변경 없음.
의존성 업데이트
requirements.txt
langchain-community 및 fastembed 의존성 추가.
URL 및 네트워크 안정성
download_books.py
수동 URL 연결을 urllib.parse.urljoin으로 대체; 모든 urlopen 호출에 20초 timeout 추가; URL 스킴 검증(http/https) 추가로 SSRF 보호 강화.
데이터베이스 검사 개선
scripts/check_db.py
테이블 누락 시나리오 감지 추가; 존재하지 않는 테이블의 경우 마이그레이션 안내 메시지 출력 후 정상 종료.
수집 스크립트 강화
scripts/ingest_all_data.py
Windows cp949 인코딩 오류 방지를 위해 UTF-8 표준 출력 강제; 파일 목록 결정적 정렬(sort) 추가.
문서 수집 엔진 재작성
scripts/ingest_data.py
IngestionError 예외 클래스 추가; UUID v5(결정적) 생성으로 전환; Project Gutenberg 보일러플레이트 제거 함수 추가; 재시도 로직이 있는 임베딩 생성 래퍼 추가; 배치 실패 추적 및 통합 오류 보고 기능; limit 파라미터 추가.
데이터베이스 스키마 마이그레이션
supabase/migrations/20260223065008_initialize_pgvector.sql, supabase/migrations/20260225112500_update_vector_dimension.sql, supabase/migrations/20260226140500_update_vector_to_mini_lm.sql
match_documents 함수 기본값 업데이트(null → 10); 벡터 차원 3072에서 384로 다운그레이드; match_count 입력 검증(범위 [1, 200]) 추가; 인덱스 및 함수 재생성.

시퀀스 다이어그램

sequenceDiagram
    participant User as 사용자
    participant Ingest as ingest_data.py
    participant Embed as EmbeddingService
    participant DB as Supabase DB
    
    User->>Ingest: ingest_document(text, philosopher, school, book_title)
    Ingest->>Ingest: strip_gutenberg_boilerplate(text)
    Ingest->>Ingest: 텍스트 청킹
    Ingest->>Ingest: 기존 청크 확인(제목 기반)
    
    loop 각 청크에 대해
        Ingest->>Embed: generate_embedding_with_retry(chunk_text)
        Embed->>Embed: FastEmbedEmbeddings 모델 로드
        Embed->>Embed: 임베딩 생성 (384차원)
        alt 성공
            Embed-->>Ingest: 임베딩 벡터
        else 실패 (max_retries 초과)
            Embed-->>Ingest: 예외
            Ingest->>Ingest: failed_batches 누적
        end
    end
    
    alt 배치 실패 있음
        Ingest->>Ingest: IngestionError 발생
    else 모든 배치 성공
        Ingest->>DB: 임베딩 + 메타데이터 저장
        DB->>DB: match_documents 함수로 벡터 검색 가능
    end
Loading

코드 리뷰 예상 난이도

🎯 4 (복잡함) | ⏱️ ~45분

관련 PR

🐰 로컬 모델로 갈아타고,
벡터를 384로 줄여서,
빠르고 안전한 수집 엔진 완성!
다시 시도하고 검증하고,
데이터베이스는 이제 준비 완료 ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed PR 제목이 데이터 수집 시스템 개선이라는 주요 변경사항을 명확하게 요약하고 있으며, 다양한 파일들의 광범위한 변경을 적절히 대표합니다.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/data-ingestion-system

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
download_books.py (1)

1-6: ⚠️ Potential issue | 🔴 Critical

urllib.parse 모듈 import 누락으로 런타임 에러 발생

코드에서 urllib.parse.urljoin을 사용하고 있지만(Line 101, 123, 174), urllib.parse 모듈이 import되지 않았습니다. Python 3에서는 urllib.parse를 명시적으로 import해야 합니다. 현재 상태로는 AttributeError: module 'urllib' has no attribute 'parse' 에러가 발생합니다.

🐛 누락된 import 추가
 import os
 import re
 import time
 import urllib.request
 import urllib.error
+import urllib.parse
 from html.parser import HTMLParser
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@download_books.py` around lines 1 - 6, Add the missing import for
urllib.parse so calls to urllib.parse.urljoin succeed; specifically, update the
module imports at the top (where urllib.request and urllib.error are imported)
to also import urllib.parse (or `from urllib.parse import urljoin`) so
references to urllib.parse.urljoin used in the code resolve without raising
AttributeError.
supabase/migrations/20260225112500_update_vector_dimension.sql (1)

12-35: ⚠️ Potential issue | 🟡 Minor

match_count 처리가 다른 마이그레이션과 불일치합니다.

이 마이그레이션에서는 match_count의 기본값이 null이고 경계값 검증이 없습니다. 반면 2026022320260226 마이그레이션에서는 기본값 10[1, 200] 범위 검증이 있습니다.

마이그레이션이 순차적으로 적용되므로 최종 상태는 20260226의 로직을 따르지만, 중간 상태에서 null 기본값으로 인해 예상치 못한 동작이 발생할 수 있습니다.

♻️ 일관성을 위한 수정 제안
 create or replace function match_documents (
   query_embedding vector(3072),
-  match_count int DEFAULT null,
+  match_count int DEFAULT 10,
   filter jsonb DEFAULT '{}'
 ) returns table (
   id uuid,
   content text,
   metadata jsonb,
   similarity float
 )
 language plpgsql
 as $$
 begin
+  if match_count < 1 then
+    match_count := 1;
+  elsif match_count > 200 then
+    match_count := 200;
+  end if;
+
   return query
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@supabase/migrations/20260225112500_update_vector_dimension.sql` around lines
12 - 35, The match_documents function sets match_count DEFAULT null and lacks
boundary validation, which is inconsistent with other migrations; update the
function signature and body for match_documents to use match_count DEFAULT 10
and add validation enforcing 1 <= match_count <= 200 (returning an error or
normalizing to the bounds) before running the RETURN QUERY; reference the
function name match_documents and the parameter match_count when applying this
change so intermediate migration behavior matches others.
🧹 Nitpick comments (6)
scripts/check_db.py (1)

19-19: raise e 대신 raise를 사용하세요.

raise e는 새로운 예외를 발생시켜 원본 스택 트레이스가 손실됩니다. 원본 트레이스를 보존하려면 bare raise를 사용해야 합니다.

♻️ 수정 제안
     else:
         print(f"Database connection or query error: {e}")
-        raise e
+        raise
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/check_db.py` at line 19, The exception handler in scripts/check_db.py
currently re-raises using "raise e", which loses the original traceback; locate
the exception handling block that uses "raise e" (the re-raise in the try/except
around the DB check) and replace "raise e" with a bare "raise" so the original
traceback is preserved when propagating the exception from the function/method
handling the DB check.
download_books.py (1)

173-174: next_page URL에도 동일한 스키마 검증 권장

parser.next_page 역시 외부 HTML에서 파싱된 값이므로, 악의적인 URL이 주입될 가능성이 있습니다. HTTP/HTTPS 스키마 검증을 추가하는 것이 안전합니다.

🛡️ 스키마 검증 추가
         if parser.next_page:
             current_url = urllib.parse.urljoin(base_url, parser.next_page)
+            if not current_url.startswith(('https://', 'http://')):
+                print(f"Invalid next page URL scheme: {current_url}")
+                current_url = None
         else:
             current_url = None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@download_books.py` around lines 173 - 174, parser.next_page is parsed from
external HTML and should be validated before use; update the logic that sets
current_url (where you currently call urllib.parse.urljoin(base_url,
parser.next_page)) to first parse parser.next_page via urllib.parse.urlparse and
ensure the scheme is 'http' or 'https' (reject or skip otherwise), then join
only if valid; refer to parser.next_page, current_url, base_url and
urllib.parse.urljoin/urlparse when making the change.
requirements.txt (1)

11-12: 의존성 버전 고정 시 일관성을 유지하세요.

langchain-communityfastembed에 버전이 명시되어 있지 않습니다. 다만 현재 requirements.txt의 패턴을 보면 pydantic>=2.7.0 외에는 대부분의 패키지(fastapi, uvicorn, supabase, langchain 등)에서 버전 핀을 사용하지 않고 있습니다.

재현 가능성을 위해 버전을 고정하려면 새로 추가된 패키지만 선택적으로 고정하기보다는, 모든 의존성에 일관되게 버전 제약을 적용하거나 기존 패턴을 유지하는 것이 좋습니다.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@requirements.txt` around lines 11 - 12, requirements.txt currently lists
langchain-community and fastembed without version constraints which breaks
consistency; update the file to follow the project's dependency versioning
policy by either adding explicit version pins for both langchain-community and
fastembed (choose compatible versions) or removing version pins across all
packages to match the existing pattern—ensure you change the entries
"langchain-community" and "fastembed" accordingly so they follow the same
pinning scheme used by other dependencies.
scripts/ingest_data.py (3)

5-5: 미사용 import hashlib 제거 필요

uuid5 기반 ID 생성으로 전환된 후 hashlib는 더 이상 사용되지 않습니다.

🧹 제안된 수정
-import hashlib
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/ingest_data.py` at line 5, Remove the unused import hashlib from the
top of scripts/ingest_data.py; since ID generation now uses uuid.uuid5 (or the
uuid5-based function in this module), delete the hashlib import line and any
dead references to it (none expected), leaving only the necessary uuid-related
imports and functions (e.g., the uuid5-based ID generator) to eliminate the
unused-import warning.

113-132: 기존 청크 확인 로직에 대한 선택적 개선 제안

현재 구현은 특정 철학자의 모든 문서를 가져온 후 Python에서 필터링합니다. 대규모 데이터셋에서는 비효율적일 수 있습니다.

Supabase에서 JSONB 중첩 쿼리를 지원한다면 서버 측 필터링을 고려해볼 수 있습니다:

♻️ 선택적 개선: 서버 측 필터링
# 복합 JSONB 필터링이 가능한 경우
res = supabase_client.table('documents') \
    .select('metadata') \
    .eq("metadata->>'scholar'", philosopher) \
    .eq("metadata->'book_info'->>'title'", book_info.get('title')) \
    .execute()

단, Supabase JSONB 쿼리 지원 여부를 확인해야 합니다.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/ingest_data.py` around lines 113 - 132, The current resume logic
populates existing_indices by fetching all documents for a philosopher and
filtering in Python, which is inefficient for large datasets; update the
supabase_client query in the block that calls
supabase_client.table('documents').select('metadata').eq("metadata->>'scholar'",
philosopher).execute() to add a server-side JSONB filter for the book title
(e.g., an additional .eq("metadata->'book_info'->>'title'",
book_info.get('title')) if Supabase JSONB querying is supported) so only
matching rows are returned and existing_indices is built from that smaller
result set; keep the existing exception handling path if the JSONB operator is
not supported.

136-137: 배치 실패 추적 및 에러 보고 로직이 적절합니다.

실패한 배치를 추적하고 최종적으로 RuntimeError를 발생시키는 방식이 수집 무결성을 보장합니다.

정적 분석 도구에서 TRY003 경고(예외 클래스 외부에서 긴 메시지 지정)가 발생했습니다. 선택적으로 커스텀 예외 클래스를 정의할 수 있습니다.

♻️ 선택적 개선: 커스텀 예외 클래스 사용
class IngestionError(Exception):
    """Raised when data ingestion fails."""
    def __init__(self, failed_batches):
        self.failed_batches = failed_batches
        super().__init__(f"Ingestion incomplete. Failed batches: {failed_batches}")
-    if failed_batches:
-        raise RuntimeError(f"Ingestion incomplete. Failed batches: {failed_batches}")
+    if failed_batches:
+        raise IngestionError(failed_batches)

Also applies to: 194-197

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/ingest_data.py` around lines 136 - 137, Declare a small custom
exception class (e.g., IngestionError) that accepts the failed_batches list and
constructs its long message inside its __init__, then replace raising
RuntimeError with raising IngestionError(failed_batches); update the places
using the failed_batches variable (the failed_batches declaration and the raise
site currently around the later block that checks it) to raise the new
IngestionError so the long message is created inside the exception class rather
than outside (reference symbols: failed_batches and IngestionError).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@download_books.py`:
- Around line 121-128: The parsed txt_url (parser.txt_url) is taken from
external HTML and passed to urllib.parse.urljoin(base_url, txt_url) then
fetched, which allows absolute URLs with dangerous schemes (e.g., file://,
ftp://) to bypass base_url; to fix, parse txt_url with urllib.parse.urlparse
before joining and allow only safe or relative schemes—e.g., permit scheme == ""
(relative) or scheme in ("http","https")—and skip/log and continue for any other
scheme; apply this check just before the urljoin/Request/urlopen block that uses
txt_url and ensure only validated URLs are passed to
urllib.request.Request/urllib.request.urlopen.

In `@supabase/migrations/20260225112500_update_vector_dimension.sql`:
- Line 5: The migration changes the recreated function
match_documents(vector(3072), int, jsonb) to set match_count DEFAULT NULL and
removes the 1..200 range validation, which deviates from migration 20260223;
restore the original behavior by setting the parameter default back to DEFAULT
10 (or the intended non-null default) and reintroduce the range check logic that
enforces match_count between 1 and 200 (e.g., validate and clamp or raise an
error) inside the function body so LIMIT uses a safe integer and the same
validation policy is preserved across migrations.

In `@supabase/migrations/20260226140500_update_vector_to_mini_lm.sql`:
- Around line 8-20: The current migration DO block truncates the documents table
whenever the embedding column exists, which risks data loss; update the guard in
the DO $$...$$ block that references table "documents", column "embedding" and
the TRUNCATE statement to require an explicit safety check before truncation
(e.g., verify a runtime flag/environment setting or a custom Postgres GUC like
"migration.safe_to_truncate" is true, check that the row count is zero or prompt
for a confirmation marker, and RAISE EXCEPTION or skip truncation if not set),
and add a clear RAISE NOTICE explaining why truncation was skipped so CI/CD or
operators can see the decision rather than silently truncating data.

---

Outside diff comments:
In `@download_books.py`:
- Around line 1-6: Add the missing import for urllib.parse so calls to
urllib.parse.urljoin succeed; specifically, update the module imports at the top
(where urllib.request and urllib.error are imported) to also import urllib.parse
(or `from urllib.parse import urljoin`) so references to urllib.parse.urljoin
used in the code resolve without raising AttributeError.

In `@supabase/migrations/20260225112500_update_vector_dimension.sql`:
- Around line 12-35: The match_documents function sets match_count DEFAULT null
and lacks boundary validation, which is inconsistent with other migrations;
update the function signature and body for match_documents to use match_count
DEFAULT 10 and add validation enforcing 1 <= match_count <= 200 (returning an
error or normalizing to the bounds) before running the RETURN QUERY; reference
the function name match_documents and the parameter match_count when applying
this change so intermediate migration behavior matches others.

---

Nitpick comments:
In `@download_books.py`:
- Around line 173-174: parser.next_page is parsed from external HTML and should
be validated before use; update the logic that sets current_url (where you
currently call urllib.parse.urljoin(base_url, parser.next_page)) to first parse
parser.next_page via urllib.parse.urlparse and ensure the scheme is 'http' or
'https' (reject or skip otherwise), then join only if valid; refer to
parser.next_page, current_url, base_url and urllib.parse.urljoin/urlparse when
making the change.

In `@requirements.txt`:
- Around line 11-12: requirements.txt currently lists langchain-community and
fastembed without version constraints which breaks consistency; update the file
to follow the project's dependency versioning policy by either adding explicit
version pins for both langchain-community and fastembed (choose compatible
versions) or removing version pins across all packages to match the existing
pattern—ensure you change the entries "langchain-community" and "fastembed"
accordingly so they follow the same pinning scheme used by other dependencies.

In `@scripts/check_db.py`:
- Line 19: The exception handler in scripts/check_db.py currently re-raises
using "raise e", which loses the original traceback; locate the exception
handling block that uses "raise e" (the re-raise in the try/except around the DB
check) and replace "raise e" with a bare "raise" so the original traceback is
preserved when propagating the exception from the function/method handling the
DB check.

In `@scripts/ingest_data.py`:
- Line 5: Remove the unused import hashlib from the top of
scripts/ingest_data.py; since ID generation now uses uuid.uuid5 (or the
uuid5-based function in this module), delete the hashlib import line and any
dead references to it (none expected), leaving only the necessary uuid-related
imports and functions (e.g., the uuid5-based ID generator) to eliminate the
unused-import warning.
- Around line 113-132: The current resume logic populates existing_indices by
fetching all documents for a philosopher and filtering in Python, which is
inefficient for large datasets; update the supabase_client query in the block
that calls
supabase_client.table('documents').select('metadata').eq("metadata->>'scholar'",
philosopher).execute() to add a server-side JSONB filter for the book title
(e.g., an additional .eq("metadata->'book_info'->>'title'",
book_info.get('title')) if Supabase JSONB querying is supported) so only
matching rows are returned and existing_indices is built from that smaller
result set; keep the existing exception handling path if the JSONB operator is
not supported.
- Around line 136-137: Declare a small custom exception class (e.g.,
IngestionError) that accepts the failed_batches list and constructs its long
message inside its __init__, then replace raising RuntimeError with raising
IngestionError(failed_batches); update the places using the failed_batches
variable (the failed_batches declaration and the raise site currently around the
later block that checks it) to raise the new IngestionError so the long message
is created inside the exception class rather than outside (reference symbols:
failed_batches and IngestionError).

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 78c7aac and 2b76c5f.

⛔ Files ignored due to path filters (1)
  • push-error.log is excluded by !**/*.log
📒 Files selected for processing (11)
  • app/services/embedding.py
  • app/services/llm.py
  • download_books.py
  • requirements.txt
  • scripts/check_db.py
  • scripts/ingest_all_data.py
  • scripts/ingest_data.py
  • supabase/migrations/20260223065008_initialize_pgvector.sql
  • supabase/migrations/20260225112500_update_vector_dimension.sql
  • supabase/migrations/20260226140500_update_vector_to_mini_lm.sql
  • verify_and_clear.py

Comment thread download_books.py

-- 1. Drop the existing function
DROP FUNCTION IF EXISTS match_documents(vector(1536), int, jsonb);
DROP FUNCTION IF EXISTS match_documents(vector(3072), int, jsonb);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find all migration files and examine match_documents signatures

echo "=== Finding migration files ==="
git ls-files supabase/migrations/*.sql | sort

echo ""
echo "=== Migration files with match_documents ==="
fd -e sql . supabase/migrations | sort | while read f; do
  if rg -q "match_documents" "$f"; then
    echo "--- $f ---"
    cat -n "$f"
  fi
done

Repository: SanghunYun95/philo-rag

Length of output: 6754


match_count 기본값 및 범위 검사 제거 - 동작 변경 확인

DROP 문의 시그니처는 정확히 20260223 마이그레이션과 일치하므로 올바릅니다. 다만 함수 재생성 시 두 가지 문제가 있습니다:

  1. match_count DEFAULT null: 이전 마이그레이션(20260223)에서는 DEFAULT 10이었으나 이 마이그레이션에서 DEFAULT null로 변경됩니다. SQL LIMIT 절에서 null 값은 예상치 못한 동작을 일으킬 수 있습니다.

  2. 범위 검사 제거: 이전 함수에는 match_count를 1~200 범위로 제한하는 검증 로직이 있었으나, 재생성된 함수에서는 완전히 제거되었습니다.

이러한 변경이 의도적인지 확인하세요. 만약 의도되지 않은 경우, 20260223의 동작을 유지하거나 일관된 정책을 20260225141500 마이그레이션에도 적용해야 합니다.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@supabase/migrations/20260225112500_update_vector_dimension.sql` at line 5,
The migration changes the recreated function match_documents(vector(3072), int,
jsonb) to set match_count DEFAULT NULL and removes the 1..200 range validation,
which deviates from migration 20260223; restore the original behavior by setting
the parameter default back to DEFAULT 10 (or the intended non-null default) and
reintroduce the range check logic that enforces match_count between 1 and 200
(e.g., validate and clamp or raise an error) inside the function body so LIMIT
uses a safe integer and the same validation policy is preserved across
migrations.

Comment on lines 8 to +20
-- 2. Clear existing incompatible 3072-dimension vectors to avoid casting errors
TRUNCATE TABLE documents;
DO $$
BEGIN
-- This is a guard to prevent accidental truncation in production CI/CD.
-- In a real scenario, you'd check a configuration or role here.
-- For now, we explicitly execute it but wrap it to highlight the danger.
IF EXISTS (
SELECT 1 FROM information_schema.columns
WHERE table_name='documents' AND column_name='embedding'
) THEN
TRUNCATE TABLE documents;
END IF;
END $$;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

⚠️ 프로덕션 데이터 손실 위험 - 보호 장치 강화 필요.

현재 가드는 컬럼 존재 여부만 확인하고 무조건 TRUNCATE를 실행합니다. 프로덕션 환경에서 실수로 실행되면 모든 문서 데이터가 삭제됩니다.

더 안전한 접근 방법을 고려하세요:

  1. 환경 변수나 설정으로 프로덕션 실행 차단
  2. 데이터가 있을 경우 명시적 확인 요구
  3. CI/CD 파이프라인에서 프로덕션 마이그레이션 별도 관리
🛡️ 더 안전한 가드 제안
 DO $$
+DECLARE
+    row_count INTEGER;
 BEGIN
-    -- This is a guard to prevent accidental truncation in production CI/CD.
-    -- In a real scenario, you'd check a configuration or role here.
-    -- For now, we explicitly execute it but wrap it to highlight the danger.
     IF EXISTS (
         SELECT 1 FROM information_schema.columns 
         WHERE table_name='documents' AND column_name='embedding'
     ) THEN
+        SELECT COUNT(*) INTO row_count FROM documents;
+        IF row_count > 0 THEN
+            RAISE NOTICE 'WARNING: Truncating % existing documents', row_count;
+            -- Consider: RAISE EXCEPTION for production environments
+        END IF;
         TRUNCATE TABLE documents;
     END IF;
 END $$;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
-- 2. Clear existing incompatible 3072-dimension vectors to avoid casting errors
TRUNCATE TABLE documents;
DO $$
BEGIN
-- This is a guard to prevent accidental truncation in production CI/CD.
-- In a real scenario, you'd check a configuration or role here.
-- For now, we explicitly execute it but wrap it to highlight the danger.
IF EXISTS (
SELECT 1 FROM information_schema.columns
WHERE table_name='documents' AND column_name='embedding'
) THEN
TRUNCATE TABLE documents;
END IF;
END $$;
-- 2. Clear existing incompatible 3072-dimension vectors to avoid casting errors
DO $$
DECLARE
row_count INTEGER;
BEGIN
IF EXISTS (
SELECT 1 FROM information_schema.columns
WHERE table_name='documents' AND column_name='embedding'
) THEN
SELECT COUNT(*) INTO row_count FROM documents;
IF row_count > 0 THEN
RAISE NOTICE 'WARNING: Truncating % existing documents', row_count;
-- Consider: RAISE EXCEPTION for production environments
END IF;
TRUNCATE TABLE documents;
END IF;
END $$;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@supabase/migrations/20260226140500_update_vector_to_mini_lm.sql` around lines
8 - 20, The current migration DO block truncates the documents table whenever
the embedding column exists, which risks data loss; update the guard in the DO
$$...$$ block that references table "documents", column "embedding" and the
TRUNCATE statement to require an explicit safety check before truncation (e.g.,
verify a runtime flag/environment setting or a custom Postgres GUC like
"migration.safe_to_truncate" is true, check that the row count is zero or prompt
for a confirmation marker, and RAISE EXCEPTION or skip truncation if not set),
and add a clear RAISE NOTICE explaining why truncation was skipped so CI/CD or
operators can see the decision rather than silently truncating data.

@SanghunYun95 SanghunYun95 merged commit 4fee493 into main Feb 26, 2026
1 check was pending
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant