Conversation
… system - Added download_books.py to scrape Philosophy & Ethics bookshelf\n- Downloaded 100 philosophy books to data directory\n- Implemented resumable ingestion in ingest_data.py to skip existing chunks\n- Updated vector dimension logic and added HNSW index migration\n- Added check_progress.py and verify_and_clear.py scripts
|
Caution Review failedThe pull request is closed. ℹ️ Recent review infoConfiguration used: Organization UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (3)
📝 Walkthrough개요이 PR은 Google Gemini 기반 임베딩을 로컬 FastEmbedEmbeddings(all-MiniLM-L6-v2)로 전환하고, 384차원 벡터를 지원하도록 데이터베이스 스키마를 업데이트하며, 문서 수집 및 URL 처리의 안정성을 강화합니다. 변경사항
시퀀스 다이어그램sequenceDiagram
participant User as 사용자
participant Ingest as ingest_data.py
participant Embed as EmbeddingService
participant DB as Supabase DB
User->>Ingest: ingest_document(text, philosopher, school, book_title)
Ingest->>Ingest: strip_gutenberg_boilerplate(text)
Ingest->>Ingest: 텍스트 청킹
Ingest->>Ingest: 기존 청크 확인(제목 기반)
loop 각 청크에 대해
Ingest->>Embed: generate_embedding_with_retry(chunk_text)
Embed->>Embed: FastEmbedEmbeddings 모델 로드
Embed->>Embed: 임베딩 생성 (384차원)
alt 성공
Embed-->>Ingest: 임베딩 벡터
else 실패 (max_retries 초과)
Embed-->>Ingest: 예외
Ingest->>Ingest: failed_batches 누적
end
end
alt 배치 실패 있음
Ingest->>Ingest: IngestionError 발생
else 모든 배치 성공
Ingest->>DB: 임베딩 + 메타데이터 저장
DB->>DB: match_documents 함수로 벡터 검색 가능
end
코드 리뷰 예상 난이도🎯 4 (복잡함) | ⏱️ ~45분 관련 PR
시
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
download_books.py (1)
1-6:⚠️ Potential issue | 🔴 Critical
urllib.parse모듈 import 누락으로 런타임 에러 발생코드에서
urllib.parse.urljoin을 사용하고 있지만(Line 101, 123, 174),urllib.parse모듈이 import되지 않았습니다. Python 3에서는urllib.parse를 명시적으로 import해야 합니다. 현재 상태로는AttributeError: module 'urllib' has no attribute 'parse'에러가 발생합니다.🐛 누락된 import 추가
import os import re import time import urllib.request import urllib.error +import urllib.parse from html.parser import HTMLParser🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@download_books.py` around lines 1 - 6, Add the missing import for urllib.parse so calls to urllib.parse.urljoin succeed; specifically, update the module imports at the top (where urllib.request and urllib.error are imported) to also import urllib.parse (or `from urllib.parse import urljoin`) so references to urllib.parse.urljoin used in the code resolve without raising AttributeError.supabase/migrations/20260225112500_update_vector_dimension.sql (1)
12-35:⚠️ Potential issue | 🟡 Minormatch_count 처리가 다른 마이그레이션과 불일치합니다.
이 마이그레이션에서는
match_count의 기본값이null이고 경계값 검증이 없습니다. 반면20260223과20260226마이그레이션에서는 기본값10과[1, 200]범위 검증이 있습니다.마이그레이션이 순차적으로 적용되므로 최종 상태는
20260226의 로직을 따르지만, 중간 상태에서null기본값으로 인해 예상치 못한 동작이 발생할 수 있습니다.♻️ 일관성을 위한 수정 제안
create or replace function match_documents ( query_embedding vector(3072), - match_count int DEFAULT null, + match_count int DEFAULT 10, filter jsonb DEFAULT '{}' ) returns table ( id uuid, content text, metadata jsonb, similarity float ) language plpgsql as $$ begin + if match_count < 1 then + match_count := 1; + elsif match_count > 200 then + match_count := 200; + end if; + return query🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@supabase/migrations/20260225112500_update_vector_dimension.sql` around lines 12 - 35, The match_documents function sets match_count DEFAULT null and lacks boundary validation, which is inconsistent with other migrations; update the function signature and body for match_documents to use match_count DEFAULT 10 and add validation enforcing 1 <= match_count <= 200 (returning an error or normalizing to the bounds) before running the RETURN QUERY; reference the function name match_documents and the parameter match_count when applying this change so intermediate migration behavior matches others.
🧹 Nitpick comments (6)
scripts/check_db.py (1)
19-19:raise e대신raise를 사용하세요.
raise e는 새로운 예외를 발생시켜 원본 스택 트레이스가 손실됩니다. 원본 트레이스를 보존하려면 bareraise를 사용해야 합니다.♻️ 수정 제안
else: print(f"Database connection or query error: {e}") - raise e + raise🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/check_db.py` at line 19, The exception handler in scripts/check_db.py currently re-raises using "raise e", which loses the original traceback; locate the exception handling block that uses "raise e" (the re-raise in the try/except around the DB check) and replace "raise e" with a bare "raise" so the original traceback is preserved when propagating the exception from the function/method handling the DB check.download_books.py (1)
173-174: next_page URL에도 동일한 스키마 검증 권장
parser.next_page역시 외부 HTML에서 파싱된 값이므로, 악의적인 URL이 주입될 가능성이 있습니다. HTTP/HTTPS 스키마 검증을 추가하는 것이 안전합니다.🛡️ 스키마 검증 추가
if parser.next_page: current_url = urllib.parse.urljoin(base_url, parser.next_page) + if not current_url.startswith(('https://', 'http://')): + print(f"Invalid next page URL scheme: {current_url}") + current_url = None else: current_url = None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@download_books.py` around lines 173 - 174, parser.next_page is parsed from external HTML and should be validated before use; update the logic that sets current_url (where you currently call urllib.parse.urljoin(base_url, parser.next_page)) to first parse parser.next_page via urllib.parse.urlparse and ensure the scheme is 'http' or 'https' (reject or skip otherwise), then join only if valid; refer to parser.next_page, current_url, base_url and urllib.parse.urljoin/urlparse when making the change.requirements.txt (1)
11-12: 의존성 버전 고정 시 일관성을 유지하세요.
langchain-community와fastembed에 버전이 명시되어 있지 않습니다. 다만 현재 requirements.txt의 패턴을 보면pydantic>=2.7.0외에는 대부분의 패키지(fastapi, uvicorn, supabase, langchain 등)에서 버전 핀을 사용하지 않고 있습니다.재현 가능성을 위해 버전을 고정하려면 새로 추가된 패키지만 선택적으로 고정하기보다는, 모든 의존성에 일관되게 버전 제약을 적용하거나 기존 패턴을 유지하는 것이 좋습니다.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@requirements.txt` around lines 11 - 12, requirements.txt currently lists langchain-community and fastembed without version constraints which breaks consistency; update the file to follow the project's dependency versioning policy by either adding explicit version pins for both langchain-community and fastembed (choose compatible versions) or removing version pins across all packages to match the existing pattern—ensure you change the entries "langchain-community" and "fastembed" accordingly so they follow the same pinning scheme used by other dependencies.scripts/ingest_data.py (3)
5-5: 미사용 importhashlib제거 필요uuid5 기반 ID 생성으로 전환된 후
hashlib는 더 이상 사용되지 않습니다.🧹 제안된 수정
-import hashlib🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/ingest_data.py` at line 5, Remove the unused import hashlib from the top of scripts/ingest_data.py; since ID generation now uses uuid.uuid5 (or the uuid5-based function in this module), delete the hashlib import line and any dead references to it (none expected), leaving only the necessary uuid-related imports and functions (e.g., the uuid5-based ID generator) to eliminate the unused-import warning.
113-132: 기존 청크 확인 로직에 대한 선택적 개선 제안현재 구현은 특정 철학자의 모든 문서를 가져온 후 Python에서 필터링합니다. 대규모 데이터셋에서는 비효율적일 수 있습니다.
Supabase에서 JSONB 중첩 쿼리를 지원한다면 서버 측 필터링을 고려해볼 수 있습니다:
♻️ 선택적 개선: 서버 측 필터링
# 복합 JSONB 필터링이 가능한 경우 res = supabase_client.table('documents') \ .select('metadata') \ .eq("metadata->>'scholar'", philosopher) \ .eq("metadata->'book_info'->>'title'", book_info.get('title')) \ .execute()단, Supabase JSONB 쿼리 지원 여부를 확인해야 합니다.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/ingest_data.py` around lines 113 - 132, The current resume logic populates existing_indices by fetching all documents for a philosopher and filtering in Python, which is inefficient for large datasets; update the supabase_client query in the block that calls supabase_client.table('documents').select('metadata').eq("metadata->>'scholar'", philosopher).execute() to add a server-side JSONB filter for the book title (e.g., an additional .eq("metadata->'book_info'->>'title'", book_info.get('title')) if Supabase JSONB querying is supported) so only matching rows are returned and existing_indices is built from that smaller result set; keep the existing exception handling path if the JSONB operator is not supported.
136-137: 배치 실패 추적 및 에러 보고 로직이 적절합니다.실패한 배치를 추적하고 최종적으로 RuntimeError를 발생시키는 방식이 수집 무결성을 보장합니다.
정적 분석 도구에서 TRY003 경고(예외 클래스 외부에서 긴 메시지 지정)가 발생했습니다. 선택적으로 커스텀 예외 클래스를 정의할 수 있습니다.
♻️ 선택적 개선: 커스텀 예외 클래스 사용
class IngestionError(Exception): """Raised when data ingestion fails.""" def __init__(self, failed_batches): self.failed_batches = failed_batches super().__init__(f"Ingestion incomplete. Failed batches: {failed_batches}")- if failed_batches: - raise RuntimeError(f"Ingestion incomplete. Failed batches: {failed_batches}") + if failed_batches: + raise IngestionError(failed_batches)Also applies to: 194-197
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/ingest_data.py` around lines 136 - 137, Declare a small custom exception class (e.g., IngestionError) that accepts the failed_batches list and constructs its long message inside its __init__, then replace raising RuntimeError with raising IngestionError(failed_batches); update the places using the failed_batches variable (the failed_batches declaration and the raise site currently around the later block that checks it) to raise the new IngestionError so the long message is created inside the exception class rather than outside (reference symbols: failed_batches and IngestionError).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@download_books.py`:
- Around line 121-128: The parsed txt_url (parser.txt_url) is taken from
external HTML and passed to urllib.parse.urljoin(base_url, txt_url) then
fetched, which allows absolute URLs with dangerous schemes (e.g., file://,
ftp://) to bypass base_url; to fix, parse txt_url with urllib.parse.urlparse
before joining and allow only safe or relative schemes—e.g., permit scheme == ""
(relative) or scheme in ("http","https")—and skip/log and continue for any other
scheme; apply this check just before the urljoin/Request/urlopen block that uses
txt_url and ensure only validated URLs are passed to
urllib.request.Request/urllib.request.urlopen.
In `@supabase/migrations/20260225112500_update_vector_dimension.sql`:
- Line 5: The migration changes the recreated function
match_documents(vector(3072), int, jsonb) to set match_count DEFAULT NULL and
removes the 1..200 range validation, which deviates from migration 20260223;
restore the original behavior by setting the parameter default back to DEFAULT
10 (or the intended non-null default) and reintroduce the range check logic that
enforces match_count between 1 and 200 (e.g., validate and clamp or raise an
error) inside the function body so LIMIT uses a safe integer and the same
validation policy is preserved across migrations.
In `@supabase/migrations/20260226140500_update_vector_to_mini_lm.sql`:
- Around line 8-20: The current migration DO block truncates the documents table
whenever the embedding column exists, which risks data loss; update the guard in
the DO $$...$$ block that references table "documents", column "embedding" and
the TRUNCATE statement to require an explicit safety check before truncation
(e.g., verify a runtime flag/environment setting or a custom Postgres GUC like
"migration.safe_to_truncate" is true, check that the row count is zero or prompt
for a confirmation marker, and RAISE EXCEPTION or skip truncation if not set),
and add a clear RAISE NOTICE explaining why truncation was skipped so CI/CD or
operators can see the decision rather than silently truncating data.
---
Outside diff comments:
In `@download_books.py`:
- Around line 1-6: Add the missing import for urllib.parse so calls to
urllib.parse.urljoin succeed; specifically, update the module imports at the top
(where urllib.request and urllib.error are imported) to also import urllib.parse
(or `from urllib.parse import urljoin`) so references to urllib.parse.urljoin
used in the code resolve without raising AttributeError.
In `@supabase/migrations/20260225112500_update_vector_dimension.sql`:
- Around line 12-35: The match_documents function sets match_count DEFAULT null
and lacks boundary validation, which is inconsistent with other migrations;
update the function signature and body for match_documents to use match_count
DEFAULT 10 and add validation enforcing 1 <= match_count <= 200 (returning an
error or normalizing to the bounds) before running the RETURN QUERY; reference
the function name match_documents and the parameter match_count when applying
this change so intermediate migration behavior matches others.
---
Nitpick comments:
In `@download_books.py`:
- Around line 173-174: parser.next_page is parsed from external HTML and should
be validated before use; update the logic that sets current_url (where you
currently call urllib.parse.urljoin(base_url, parser.next_page)) to first parse
parser.next_page via urllib.parse.urlparse and ensure the scheme is 'http' or
'https' (reject or skip otherwise), then join only if valid; refer to
parser.next_page, current_url, base_url and urllib.parse.urljoin/urlparse when
making the change.
In `@requirements.txt`:
- Around line 11-12: requirements.txt currently lists langchain-community and
fastembed without version constraints which breaks consistency; update the file
to follow the project's dependency versioning policy by either adding explicit
version pins for both langchain-community and fastembed (choose compatible
versions) or removing version pins across all packages to match the existing
pattern—ensure you change the entries "langchain-community" and "fastembed"
accordingly so they follow the same pinning scheme used by other dependencies.
In `@scripts/check_db.py`:
- Line 19: The exception handler in scripts/check_db.py currently re-raises
using "raise e", which loses the original traceback; locate the exception
handling block that uses "raise e" (the re-raise in the try/except around the DB
check) and replace "raise e" with a bare "raise" so the original traceback is
preserved when propagating the exception from the function/method handling the
DB check.
In `@scripts/ingest_data.py`:
- Line 5: Remove the unused import hashlib from the top of
scripts/ingest_data.py; since ID generation now uses uuid.uuid5 (or the
uuid5-based function in this module), delete the hashlib import line and any
dead references to it (none expected), leaving only the necessary uuid-related
imports and functions (e.g., the uuid5-based ID generator) to eliminate the
unused-import warning.
- Around line 113-132: The current resume logic populates existing_indices by
fetching all documents for a philosopher and filtering in Python, which is
inefficient for large datasets; update the supabase_client query in the block
that calls
supabase_client.table('documents').select('metadata').eq("metadata->>'scholar'",
philosopher).execute() to add a server-side JSONB filter for the book title
(e.g., an additional .eq("metadata->'book_info'->>'title'",
book_info.get('title')) if Supabase JSONB querying is supported) so only
matching rows are returned and existing_indices is built from that smaller
result set; keep the existing exception handling path if the JSONB operator is
not supported.
- Around line 136-137: Declare a small custom exception class (e.g.,
IngestionError) that accepts the failed_batches list and constructs its long
message inside its __init__, then replace raising RuntimeError with raising
IngestionError(failed_batches); update the places using the failed_batches
variable (the failed_batches declaration and the raise site currently around the
later block that checks it) to raise the new IngestionError so the long message
is created inside the exception class rather than outside (reference symbols:
failed_batches and IngestionError).
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
push-error.logis excluded by!**/*.log
📒 Files selected for processing (11)
app/services/embedding.pyapp/services/llm.pydownload_books.pyrequirements.txtscripts/check_db.pyscripts/ingest_all_data.pyscripts/ingest_data.pysupabase/migrations/20260223065008_initialize_pgvector.sqlsupabase/migrations/20260225112500_update_vector_dimension.sqlsupabase/migrations/20260226140500_update_vector_to_mini_lm.sqlverify_and_clear.py
|
|
||
| -- 1. Drop the existing function | ||
| DROP FUNCTION IF EXISTS match_documents(vector(1536), int, jsonb); | ||
| DROP FUNCTION IF EXISTS match_documents(vector(3072), int, jsonb); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find all migration files and examine match_documents signatures
echo "=== Finding migration files ==="
git ls-files supabase/migrations/*.sql | sort
echo ""
echo "=== Migration files with match_documents ==="
fd -e sql . supabase/migrations | sort | while read f; do
if rg -q "match_documents" "$f"; then
echo "--- $f ---"
cat -n "$f"
fi
doneRepository: SanghunYun95/philo-rag
Length of output: 6754
match_count 기본값 및 범위 검사 제거 - 동작 변경 확인
DROP 문의 시그니처는 정확히 20260223 마이그레이션과 일치하므로 올바릅니다. 다만 함수 재생성 시 두 가지 문제가 있습니다:
-
match_count DEFAULT null: 이전 마이그레이션(20260223)에서는
DEFAULT 10이었으나 이 마이그레이션에서DEFAULT null로 변경됩니다. SQL LIMIT 절에서 null 값은 예상치 못한 동작을 일으킬 수 있습니다. -
범위 검사 제거: 이전 함수에는 match_count를 1~200 범위로 제한하는 검증 로직이 있었으나, 재생성된 함수에서는 완전히 제거되었습니다.
이러한 변경이 의도적인지 확인하세요. 만약 의도되지 않은 경우, 20260223의 동작을 유지하거나 일관된 정책을 20260225141500 마이그레이션에도 적용해야 합니다.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@supabase/migrations/20260225112500_update_vector_dimension.sql` at line 5,
The migration changes the recreated function match_documents(vector(3072), int,
jsonb) to set match_count DEFAULT NULL and removes the 1..200 range validation,
which deviates from migration 20260223; restore the original behavior by setting
the parameter default back to DEFAULT 10 (or the intended non-null default) and
reintroduce the range check logic that enforces match_count between 1 and 200
(e.g., validate and clamp or raise an error) inside the function body so LIMIT
uses a safe integer and the same validation policy is preserved across
migrations.
| -- 2. Clear existing incompatible 3072-dimension vectors to avoid casting errors | ||
| TRUNCATE TABLE documents; | ||
| DO $$ | ||
| BEGIN | ||
| -- This is a guard to prevent accidental truncation in production CI/CD. | ||
| -- In a real scenario, you'd check a configuration or role here. | ||
| -- For now, we explicitly execute it but wrap it to highlight the danger. | ||
| IF EXISTS ( | ||
| SELECT 1 FROM information_schema.columns | ||
| WHERE table_name='documents' AND column_name='embedding' | ||
| ) THEN | ||
| TRUNCATE TABLE documents; | ||
| END IF; | ||
| END $$; |
There was a problem hiding this comment.
현재 가드는 컬럼 존재 여부만 확인하고 무조건 TRUNCATE를 실행합니다. 프로덕션 환경에서 실수로 실행되면 모든 문서 데이터가 삭제됩니다.
더 안전한 접근 방법을 고려하세요:
- 환경 변수나 설정으로 프로덕션 실행 차단
- 데이터가 있을 경우 명시적 확인 요구
- CI/CD 파이프라인에서 프로덕션 마이그레이션 별도 관리
🛡️ 더 안전한 가드 제안
DO $$
+DECLARE
+ row_count INTEGER;
BEGIN
- -- This is a guard to prevent accidental truncation in production CI/CD.
- -- In a real scenario, you'd check a configuration or role here.
- -- For now, we explicitly execute it but wrap it to highlight the danger.
IF EXISTS (
SELECT 1 FROM information_schema.columns
WHERE table_name='documents' AND column_name='embedding'
) THEN
+ SELECT COUNT(*) INTO row_count FROM documents;
+ IF row_count > 0 THEN
+ RAISE NOTICE 'WARNING: Truncating % existing documents', row_count;
+ -- Consider: RAISE EXCEPTION for production environments
+ END IF;
TRUNCATE TABLE documents;
END IF;
END $$;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| -- 2. Clear existing incompatible 3072-dimension vectors to avoid casting errors | |
| TRUNCATE TABLE documents; | |
| DO $$ | |
| BEGIN | |
| -- This is a guard to prevent accidental truncation in production CI/CD. | |
| -- In a real scenario, you'd check a configuration or role here. | |
| -- For now, we explicitly execute it but wrap it to highlight the danger. | |
| IF EXISTS ( | |
| SELECT 1 FROM information_schema.columns | |
| WHERE table_name='documents' AND column_name='embedding' | |
| ) THEN | |
| TRUNCATE TABLE documents; | |
| END IF; | |
| END $$; | |
| -- 2. Clear existing incompatible 3072-dimension vectors to avoid casting errors | |
| DO $$ | |
| DECLARE | |
| row_count INTEGER; | |
| BEGIN | |
| IF EXISTS ( | |
| SELECT 1 FROM information_schema.columns | |
| WHERE table_name='documents' AND column_name='embedding' | |
| ) THEN | |
| SELECT COUNT(*) INTO row_count FROM documents; | |
| IF row_count > 0 THEN | |
| RAISE NOTICE 'WARNING: Truncating % existing documents', row_count; | |
| -- Consider: RAISE EXCEPTION for production environments | |
| END IF; | |
| TRUNCATE TABLE documents; | |
| END IF; | |
| END $$; |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@supabase/migrations/20260226140500_update_vector_to_mini_lm.sql` around lines
8 - 20, The current migration DO block truncates the documents table whenever
the embedding column exists, which risks data loss; update the guard in the DO
$$...$$ block that references table "documents", column "embedding" and the
TRUNCATE statement to require an explicit safety check before truncation (e.g.,
verify a runtime flag/environment setting or a custom Postgres GUC like
"migration.safe_to_truncate" is true, check that the row count is zero or prompt
for a confirmation marker, and RAISE EXCEPTION or skip truncation if not set),
and add a clear RAISE NOTICE explaining why truncation was skipped so CI/CD or
operators can see the decision rather than silently truncating data.
Summary by CodeRabbit
릴리스 노트
버그 수정
새 기능
개선 사항