Skip to content

Feat/data ingestion system#4

Merged
SanghunYun95 merged 6 commits intomainfrom
feat/data-ingestion-system
Feb 26, 2026
Merged

Feat/data ingestion system#4
SanghunYun95 merged 6 commits intomainfrom
feat/data-ingestion-system

Conversation

@SanghunYun95
Copy link
Copy Markdown
Owner

@SanghunYun95 SanghunYun95 commented Feb 26, 2026

Summary by CodeRabbit

릴리스 노트

  • 새로운 기능

    • 검색 결과에 유사도 점수 추가
    • 프로젝트 구텐베르크 보일러플레이트 자동 제거
    • 배치 처리 오류 추적 및 상세 오류 보고
  • 버그 수정

    • 다운로드 안정성 개선 (타임아웃 및 URL 검증 추가)
    • 데이터베이스 마이그레이션 오류 처리 강화
    • 데이터 처리 순서 일관성 확보
  • 개선 사항

    • 임베딩 기술 업데이트
    • 의존성 추가 (langchain-community)

… system

- Added download_books.py to scrape Philosophy & Ethics bookshelf\n- Downloaded 100 philosophy books to data directory\n- Implemented resumable ingestion in ingest_data.py to skip existing chunks\n- Updated vector dimension logic and added HNSW index migration\n- Added check_progress.py and verify_and_clear.py scripts
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Feb 26, 2026

📝 Walkthrough

요약

이 풀 리퀘스트는 임베딩 검증, URL 보안 처리, 데이터 수집 재설계, 데이터베이스 마이그레이션 및 오류 처리 개선을 포함한 여러 시스템 업데이트를 구현합니다. 벡터 차원이 1536에서 3072로, 최종적으로 384로 업데이트되었습니다.

변경 사항

통합 / 파일(s) 요약
임베딩 서비스
app/services/embedding.py
임베딩 길이 유효성 검사 추가 (384 길이 필수). 길이가 다르면 ValueError 발생, 그렇지 않으면 임베딩 반환.
LLM 서비스
app/services/llm.py
모델 버전 주석을 gemini-2.5-flash로 업데이트. 실제 동작 변경 없음.
책 다운로드 및 URL 처리
download_books.py
urllib.parse를 사용한 URL 구성 표준화, 모든 urlopen 호출에 20초 타임아웃 추가, http/https 스키마 유효성 검사로 SSRF 보호 강화.
의존성
requirements.txt
langchain-community 추가.
데이터베이스 검증
scripts/check_db.py
행 존재 메시지를 "최소 1개 행 발견"으로 변경. 테이블 누락 여부에 대한 예외 처리 개선 (코드 42p01 감지).
데이터 수집 스크립트
scripts/ingest_all_data.py
TXT 파일 정렬된 순서로 수집하여 결정적 처리 순서 보장. 진행 분리선을 순수 문자열로 변경.
수집 데이터 로직
scripts/ingest_data.py
IngestionError 예외 클래스 추가. UUID v5 기반 결정적 UUID 생성 구현. Gutenberg 보일러플레이트 제거 함수 추가. 재시도 메커니즘 구현. 기존 청크 확인, 배치 실패 추적, 메타데이터 필터링 강화.
데이터베이스 마이그레이션
supabase/migrations/20260223065008_initialize_pgvector.sql
match_documents 함수의 기본값을 null에서 10으로 변경. match_count를 1~200 범위로 제한하는 클램핑 로직 추가.
벡터 차원 업데이트
supabase/migrations/20260225112500_update_vector_dimension.sql
임베딩 차원을 vector(1536)에서 vector(3072)으로 변경. documents 테이블 및 match_documents 함수 시그니처 업데이트.
벡터를 미니 LM으로 업데이트
supabase/migrations/20260226140500_update_vector_to_mini_lm.sql
조건부 TRUNCATE TABLE 추가. match_documents 반환 스키마를 id, content, metadata, similarity로 확장. match_count 기본값 10 유지 및 범위 제한 추가.

예상 코드 리뷰 노력

🎯 3 (Moderate) | ⏱️ ~35분

관련 가능 PR

🐰 벡터는 춤을 추고, 임베딩은 검증되며,
청크는 정렬되고, 오류는 추적되네!
차원이 변하고 데이터가 흐르니,
철학이 담긴 책들이 더욱 빛나도다. ✨📚

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 54.55% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed 제목이 PR의 주요 변경 사항을 명확하게 요약하고 있습니다. 데이터 수집 시스템 기능 추가라는 핵심 내용을 잘 반영하고 있습니다.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/data-ingestion-system

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@download_books.py`:
- Around line 124-128: The current URL creation and SSRF checks are
insufficient: replace urljoin(base_url, txt_url) with urljoin(book_url, txt_url)
(and urljoin(current_url, parser.next_page) for pagination) so relative paths
resolve against the actual page, and strengthen the SSRF guard in the code that
currently only checks txt_url.startswith(('http://','https://')) by validating
the parsed URL's scheme and hostname against an allowlist and rejecting
private/internal IPs and loopback addresses; use urllib.parse.urlparse on the
resolved URL to get netloc, resolve the hostname to an IP and check with the
ipaddress module to deny 127.0.0.0/8, 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16,
and link-local 169.254.0.0/16, and log/skip the URL (return downloaded_count)
when it fails these checks (update the validation around txt_url,
parser.next_page and any other places that fetch external resources).

In `@requirements.txt`:
- Line 11: Pin LangChain-related dependencies in requirements.txt: replace the
unversioned entries for langchain, langchain-google-genai, and
langchain-community with the recommended, compatible versions so installs are
reproducible; specifically set langchain-core to >=1.2.5,<2.0.0,
langchain-classic to >=1.0.0,<2.0.0, langchain-community to ==0.4.1, and
langchain-google-genai to >=4.2.1,<5.0.0 (update the existing lines for
"langchain", "langchain-google-genai", and "langchain-community" accordingly).

In `@scripts/check_db.py`:
- Around line 15-16: The script prints a message when the 'documents' table is
missing but calls sys.exit(0), which incorrectly signals success; update the
exit behavior in scripts/check_db.py where the missing-table branch uses
sys.exit(0) (the print + sys.exit call) to instead exit with a non-zero code
(e.g., sys.exit(1) or raise SystemExit(1)) so CI/deploy will treat the check as
failed; make this change in the block that detects the 'documents' table absence
(the print("Table 'documents' does not exist yet. Please run migrations.") +
sys.exit(...) sequence).

In `@scripts/ingest_data.py`:
- Around line 142-143: 현재 failed_batches에 업서트 예외만 추가되어 부분 실패(청크 처리 실패)가 누락됩니다;
처리 루프에서 청크 처리 예외(현재 except 블록 주변의 chunk 처리 부분)를 잡을 때도 failed_batches에 해당 배치와 청크
식별자(예: batch_id, chunk_index) 및 예외 정보를 추가하고, 최종 집계(업스트/커밋 성공 여부 판정)에
failed_batches를 참조하도록 수정하세요; 관련 심볼: failed_batches, 해당 청크 처리 루프(예외가 발생하는
try/except 블록), 업서트 예외 처리 블록(현재 업서트 예외만 추가하는 코드)을 찾아 동일한 실패 로깅/집계 정책을 적용하면 됩니다.

In `@supabase/migrations/20260223065008_initialize_pgvector.sql`:
- Around line 28-32: 현재 clamp 로직에서 match_count가 NULL이면 분기문을 모두 건너뛰어 LIMIT NULL이
되는 문제가 있습니다; fix는 비교 전에 match_count를 기본값으로 초기화하도록 match_count :=
COALESCE(match_count, 10); 를 삽입하거나 기존 범위검사 블록 바로 앞에서 NULL을 10으로 대체해 1~200으로
clamp 되도록 처리하십시오 (참조 심볼: match_count 및 기존 if ... elsif ... end if 블록).

In `@supabase/migrations/20260226140500_update_vector_to_mini_lm.sql`:
- Line 29: The match_count parameter can be NULL and bypass your clamp/LIMIT
logic; normalize it at the top of the function with COALESCE so downstream
comparisons and the "LIMIT match_count" use a non-NULL value. For example,
immediately after function entry assign match_count := COALESCE(match_count, 10)
(or COALESCE(match_count, <existing default>)) and update the clamp/comparison
block referenced around lines 40-44 to use that normalized match_count variable.
- Around line 9-20: The DO $$ block currently performs a destructive TRUNCATE
TABLE documents when only the embedding column exists; change it to fail-closed
by requiring an explicit allow flag before running TRUNCATE: add a guard that
reads a trusted indicator (e.g., a session/config setting or a control column)
and RAISE EXCEPTION unless that flag is explicitly true, then only execute
TRUNCATE TABLE documents when the flag is set and the embedding column exists;
update the block that contains TRUNCATE TABLE documents to check this explicit
allow variable (or current_setting) and abort otherwise.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4fee493 and 2daad6f.

📒 Files selected for processing (10)
  • app/services/embedding.py
  • app/services/llm.py
  • download_books.py
  • requirements.txt
  • scripts/check_db.py
  • scripts/ingest_all_data.py
  • scripts/ingest_data.py
  • supabase/migrations/20260223065008_initialize_pgvector.sql
  • supabase/migrations/20260225112500_update_vector_dimension.sql
  • supabase/migrations/20260226140500_update_vector_to_mini_lm.sql

Comment thread download_books.py
Comment on lines +124 to +128
txt_url = urllib.parse.urljoin(base_url, txt_url)
# Verify valid scheme to prevent SSRF
if not txt_url.startswith(('http://', 'https://')):
print(f"Skipping invalid URL scheme: {txt_url}")
return downloaded_count
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n download_books.py | head -200

Repository: SanghunYun95/philo-rag

Length of output: 7730


🏁 Script executed:

wc -l download_books.py

Repository: SanghunYun95/philo-rag

Length of output: 87


🏁 Script executed:

sed -n '100,190p' download_books.py

Repository: SanghunYun95/philo-rag

Length of output: 3162


🏁 Script executed:

sed -n '1,50p' download_books.py

Repository: SanghunYun95/philo-rag

Length of output: 1586


🏁 Script executed:

rg -n "base_url|book_url|ALLOWED_HOSTS" download_books.py

Repository: SanghunYun95/philo-rag

Length of output: 530


스킴 검사만으로는 SSRF 방어가 불충분하며, urljoin의 기준 URL이 부정확합니다.

문제점:

  1. Lines 126-128, 180-182의 http:// / https:// 체크만으로는 https://169.254.169.254/..., https://127.0.0.1/... 같은 내부 IP 공격을 차단할 수 없습니다.

  2. Line 124는 urljoin(base_url, txt_url)을 사용하는데, txt_url은 book 페이지 HTML에서 파싱된 상대경로이므로 book_url 기준으로 결합해야 합니다. 현재 코드는 부정확한 URL을 생성할 수 있습니다.

  3. Line 179도 마찬가지로 urljoin(base_url, parser.next_page) 대신 urljoin(current_url, parser.next_page)를 사용해야 합니다.

수정 방안:

호스트 allowlist 추가 및 urljoin 기준 URL 수정
+ALLOWED_HOSTS = {"www.gutenberg.org", "gutenberg.org"}
+
+def is_allowed_url(url: str) -> bool:
+    parsed = urllib.parse.urlparse(url)
+    return parsed.scheme in {"http", "https"} and parsed.hostname in ALLOWED_HOSTS
+
 def download_book(book, data_dir, base_url, target_count, downloaded_count):
     href = book['href']
     if not href.startswith('/ebooks/'):
@@ 
     txt_url = parser.txt_url
     if txt_url:
-        txt_url = urllib.parse.urljoin(base_url, txt_url)
-        # Verify valid scheme to prevent SSRF
-        if not txt_url.startswith(('http://', 'https://')):
-            print(f"Skipping invalid URL scheme: {txt_url}")
+        txt_url = urllib.parse.urljoin(book_url, txt_url)
+        if not is_allowed_url(txt_url):
+            print(f"Skipping disallowed URL: {txt_url}")
             return downloaded_count
@@ 
     if parser.next_page:
-        current_url = urllib.parse.urljoin(base_url, parser.next_page)
-        if not current_url.startswith(('http://', 'https://')):
-            print(f"Invalid next page URL scheme: {current_url}")
+        current_url = urllib.parse.urljoin(current_url, parser.next_page)
+        if not is_allowed_url(current_url):
+            print(f"Invalid next page URL: {current_url}")
             current_url = None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@download_books.py` around lines 124 - 128, The current URL creation and SSRF
checks are insufficient: replace urljoin(base_url, txt_url) with
urljoin(book_url, txt_url) (and urljoin(current_url, parser.next_page) for
pagination) so relative paths resolve against the actual page, and strengthen
the SSRF guard in the code that currently only checks
txt_url.startswith(('http://','https://')) by validating the parsed URL's scheme
and hostname against an allowlist and rejecting private/internal IPs and
loopback addresses; use urllib.parse.urlparse on the resolved URL to get netloc,
resolve the hostname to an IP and check with the ipaddress module to deny
127.0.0.0/8, 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16, and link-local
169.254.0.0/16, and log/skip the URL (return downloaded_count) when it fails
these checks (update the validation around txt_url, parser.next_page and any
other places that fetch external resources).

Comment thread requirements.txt
pydantic>=2.7.0
pydantic-settings
python-dotenv
langchain-community
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name "requirements.txt" -type f

Repository: SanghunYun95/philo-rag

Length of output: 84


🌐 Web query:

langchain langchain-community langchain-google-genai compatible version combinations official

💡 Result:

Officially supported/compatible version combinations are the ones that satisfy each package’s declared dependency ranges (i.e., let pip’s resolver pick versions that meet them).

Current (LangChain 1.x) compatibility (Python)

  • langchain-community==0.4.1 requires:
    • langchain-core>=1.0.1,<2.0.0
    • langchain-classic>=1.0.0,<2.0.0 [2]
  • langchain-google-genai==4.2.1 requires:
    • langchain-core>=1.2.5,<2.0.0 (and google-genai>=1.56.0,<2.0.0) [3]
  • LangChain 1.0 GA announcement (context: “LangChain 1.0”) notes legacy functionality moved to langchain-classic. [1]
  • langchain-classic==1.0.1 requires:
    • langchain-core>=1.2.5,<2.0.0 [4]

So, a safe “works together” set is:

langchain-core>=1.2.5,<2.0.0
langchain-classic>=1.0.0,<2.0.0
langchain-community==0.4.1
langchain-google-genai>=4.2.1,<5.0.0

Because both langchain-community and langchain-google-genai are aligned on langchain-core<2.0.0 (and langchain-google-genai specifically needs langchain-core>=1.2.5), any combination within those bounds is “officially compatible” per their published metadata. [2][3][4]


🏁 Script executed:

cat -n ./requirements.txt

Repository: SanghunYun95/philo-rag

Length of output: 319


LangChain 관련 의존성의 버전 고정이 필요합니다.

현재 requirements.txt의 langchain, langchain-google-genai, langchain-community (라인 4, 5, 11)이 모두 무버전 상태로 지정되어 있어 설치 시점별로 호환되지 않는 버전 조합이 설치될 수 있습니다. 재현 가능한 배포를 위해 다음의 공식 권장 호환 버전을 적용하세요:

langchain-core>=1.2.5,<2.0.0
langchain-classic>=1.0.0,<2.0.0
langchain-community==0.4.1
langchain-google-genai>=4.2.1,<5.0.0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@requirements.txt` at line 11, Pin LangChain-related dependencies in
requirements.txt: replace the unversioned entries for langchain,
langchain-google-genai, and langchain-community with the recommended, compatible
versions so installs are reproducible; specifically set langchain-core to
>=1.2.5,<2.0.0, langchain-classic to >=1.0.0,<2.0.0, langchain-community to
==0.4.1, and langchain-google-genai to >=4.2.1,<5.0.0 (update the existing lines
for "langchain", "langchain-google-genai", and "langchain-community"
accordingly).

Comment thread scripts/check_db.py
Comment on lines +15 to +16
print("Table 'documents' does not exist yet. Please run migrations.")
sys.exit(0)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

테이블 미존재를 성공 코드(0)로 종료하면 체크가 거짓 양성으로 통과할 수 있습니다.

Line 15-16에서 마이그레이션 미적용 상태를 안내하면서 sys.exit(0)을 반환하면, CI/배포 단계에서 실패해야 할 상황이 성공으로 처리될 수 있습니다. 체크 스크립트라면 비정상 종료 코드가 안전합니다.

수정 제안
-        print("Table 'documents' does not exist yet. Please run migrations.")
-        sys.exit(0)
+        print("Table 'documents' does not exist yet. Please run migrations.")
+        sys.exit(1)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/check_db.py` around lines 15 - 16, The script prints a message when
the 'documents' table is missing but calls sys.exit(0), which incorrectly
signals success; update the exit behavior in scripts/check_db.py where the
missing-table branch uses sys.exit(0) (the print + sys.exit call) to instead
exit with a non-zero code (e.g., sys.exit(1) or raise SystemExit(1)) so
CI/deploy will treat the check as failed; make this change in the block that
detects the 'documents' table absence (the print("Table 'documents' does not
exist yet. Please run migrations.") + sys.exit(...) sequence).

Comment thread scripts/ingest_data.py
Comment on lines +142 to +143
failed_batches = []

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

배치 업서트 실패만 집계하면 부분 실패가 누락됩니다.

Line 200-203은 업서트 예외만 failed_batches에 넣습니다. 그런데 청크 처리 실패(Line 189-190)는 현재 집계되지 않아, 일부 청크 유실이 있어도 최종적으로 예외 없이 끝날 수 있습니다.

🛠️ 권장 수정안
     BATCH_SIZE = 100
     failed_batches = []
@@
                 except Exception as exc:
                     print(f"Chunk {idx} completely failed: {exc}")
+                    failed_batches.append(
+                        (i // BATCH_SIZE + 1, f"chunk {idx} failed: {exc}")
+                    )
@@
     if failed_batches:
         raise IngestionError(failed_batches)

Also applies to: 200-203

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/ingest_data.py` around lines 142 - 143, 현재 failed_batches에 업서트 예외만
추가되어 부분 실패(청크 처리 실패)가 누락됩니다; 처리 루프에서 청크 처리 예외(현재 except 블록 주변의 chunk 처리 부분)를 잡을
때도 failed_batches에 해당 배치와 청크 식별자(예: batch_id, chunk_index) 및 예외 정보를 추가하고, 최종
집계(업스트/커밋 성공 여부 판정)에 failed_batches를 참조하도록 수정하세요; 관련 심볼: failed_batches, 해당 청크
처리 루프(예외가 발생하는 try/except 블록), 업서트 예외 처리 블록(현재 업서트 예외만 추가하는 코드)을 찾아 동일한 실패 로깅/집계
정책을 적용하면 됩니다.

Comment on lines +28 to +32
if match_count < 1 then
match_count := 1;
elsif match_count > 200 then
match_count := 200;
end if;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

NULL 입력 시 clamp가 적용되지 않는 케이스를 같이 막아야 합니다.

Line 28-32는 match_count가 NULL이면 분기를 타지 않아 LIMIT NULL이 됩니다. 이 마이그레이션도 재생성 경로에서 실행되므로 COALESCE(match_count, 10) 선처리를 넣어 동일하게 보호해주세요.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@supabase/migrations/20260223065008_initialize_pgvector.sql` around lines 28 -
32, 현재 clamp 로직에서 match_count가 NULL이면 분기문을 모두 건너뛰어 LIMIT NULL이 되는 문제가 있습니다; fix는
비교 전에 match_count를 기본값으로 초기화하도록 match_count := COALESCE(match_count, 10); 를
삽입하거나 기존 범위검사 블록 바로 앞에서 NULL을 10으로 대체해 1~200으로 clamp 되도록 처리하십시오 (참조 심볼:
match_count 및 기존 if ... elsif ... end if 블록).

Comment on lines +9 to +20
DO $$
BEGIN
-- This is a guard to prevent accidental truncation in production CI/CD.
-- In a real scenario, you'd check a configuration or role here.
-- For now, we explicitly execute it but wrap it to highlight the danger.
IF EXISTS (
SELECT 1 FROM information_schema.columns
WHERE table_name='documents' AND column_name='embedding'
) THEN
TRUNCATE TABLE documents;
END IF;
END $$;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

운영 보호 없이 TRUNCATE가 실행되어 데이터 전체 유실 위험이 큽니다.

Line 14-19의 조건은 embedding 컬럼 존재만 확인해서, 실제로는 대부분 환경에서 TRUNCATE가 실행됩니다. 파괴적 작업은 명시적 허용 플래그가 없으면 중단하도록 fail-closed로 바꿔주세요.

수정 예시
 DO $$
 BEGIN
-    -- This is a guard to prevent accidental truncation in production CI/CD.
-    -- In a real scenario, you'd check a configuration or role here.
-    -- For now, we explicitly execute it but wrap it to highlight the danger.
-    IF EXISTS (
-        SELECT 1 FROM information_schema.columns 
-        WHERE table_name='documents' AND column_name='embedding'
-    ) THEN
-        TRUNCATE TABLE documents;
-    END IF;
+    IF current_setting('app.allow_destructive_migrations', true) = 'true' THEN
+        TRUNCATE TABLE public.documents;
+    ELSE
+        RAISE EXCEPTION 'Refusing to truncate public.documents without app.allow_destructive_migrations=true';
+    END IF;
 END $$;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@supabase/migrations/20260226140500_update_vector_to_mini_lm.sql` around lines
9 - 20, The DO $$ block currently performs a destructive TRUNCATE TABLE
documents when only the embedding column exists; change it to fail-closed by
requiring an explicit allow flag before running TRUNCATE: add a guard that reads
a trusted indicator (e.g., a session/config setting or a control column) and
RAISE EXCEPTION unless that flag is explicitly true, then only execute TRUNCATE
TABLE documents when the flag is set and the embedding column exists; update the
block that contains TRUNCATE TABLE documents to check this explicit allow
variable (or current_setting) and abort otherwise.

create or replace function match_documents (
query_embedding vector(384),
match_count int DEFAULT null,
match_count int DEFAULT 10,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

match_count = NULL 입력 시 clamp가 우회되어 무제한 조회가 됩니다.

Line 40-44 비교식은 NULL에서 동작하지 않아 LIMIT match_count가 사실상 LIMIT NULL(제한 없음)이 됩니다. Line 29 기본값과 별개로, 함수 시작 시 COALESCE로 NULL을 정규화해야 합니다.

수정 예시
 begin
+  match_count := COALESCE(match_count, 10);
+
   if match_count < 1 then
     match_count := 1;
   elsif match_count > 200 then
     match_count := 200;
   end if;

Also applies to: 40-44

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@supabase/migrations/20260226140500_update_vector_to_mini_lm.sql` at line 29,
The match_count parameter can be NULL and bypass your clamp/LIMIT logic;
normalize it at the top of the function with COALESCE so downstream comparisons
and the "LIMIT match_count" use a non-NULL value. For example, immediately after
function entry assign match_count := COALESCE(match_count, 10) (or
COALESCE(match_count, <existing default>)) and update the clamp/comparison block
referenced around lines 40-44 to use that normalized match_count variable.

@SanghunYun95 SanghunYun95 merged commit 964b044 into main Feb 26, 2026
1 check passed
@coderabbitai coderabbitai bot mentioned this pull request Feb 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant