Conversation
… system - Added download_books.py to scrape Philosophy & Ethics bookshelf\n- Downloaded 100 philosophy books to data directory\n- Implemented resumable ingestion in ingest_data.py to skip existing chunks\n- Updated vector dimension logic and added HNSW index migration\n- Added check_progress.py and verify_and_clear.py scripts
📝 Walkthrough요약이 풀 리퀘스트는 임베딩 검증, URL 보안 처리, 데이터 수집 재설계, 데이터베이스 마이그레이션 및 오류 처리 개선을 포함한 여러 시스템 업데이트를 구현합니다. 벡터 차원이 1536에서 3072로, 최종적으로 384로 업데이트되었습니다. 변경 사항
예상 코드 리뷰 노력🎯 3 (Moderate) | ⏱️ ~35분 관련 가능 PR
시
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@download_books.py`:
- Around line 124-128: The current URL creation and SSRF checks are
insufficient: replace urljoin(base_url, txt_url) with urljoin(book_url, txt_url)
(and urljoin(current_url, parser.next_page) for pagination) so relative paths
resolve against the actual page, and strengthen the SSRF guard in the code that
currently only checks txt_url.startswith(('http://','https://')) by validating
the parsed URL's scheme and hostname against an allowlist and rejecting
private/internal IPs and loopback addresses; use urllib.parse.urlparse on the
resolved URL to get netloc, resolve the hostname to an IP and check with the
ipaddress module to deny 127.0.0.0/8, 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16,
and link-local 169.254.0.0/16, and log/skip the URL (return downloaded_count)
when it fails these checks (update the validation around txt_url,
parser.next_page and any other places that fetch external resources).
In `@requirements.txt`:
- Line 11: Pin LangChain-related dependencies in requirements.txt: replace the
unversioned entries for langchain, langchain-google-genai, and
langchain-community with the recommended, compatible versions so installs are
reproducible; specifically set langchain-core to >=1.2.5,<2.0.0,
langchain-classic to >=1.0.0,<2.0.0, langchain-community to ==0.4.1, and
langchain-google-genai to >=4.2.1,<5.0.0 (update the existing lines for
"langchain", "langchain-google-genai", and "langchain-community" accordingly).
In `@scripts/check_db.py`:
- Around line 15-16: The script prints a message when the 'documents' table is
missing but calls sys.exit(0), which incorrectly signals success; update the
exit behavior in scripts/check_db.py where the missing-table branch uses
sys.exit(0) (the print + sys.exit call) to instead exit with a non-zero code
(e.g., sys.exit(1) or raise SystemExit(1)) so CI/deploy will treat the check as
failed; make this change in the block that detects the 'documents' table absence
(the print("Table 'documents' does not exist yet. Please run migrations.") +
sys.exit(...) sequence).
In `@scripts/ingest_data.py`:
- Around line 142-143: 현재 failed_batches에 업서트 예외만 추가되어 부분 실패(청크 처리 실패)가 누락됩니다;
처리 루프에서 청크 처리 예외(현재 except 블록 주변의 chunk 처리 부분)를 잡을 때도 failed_batches에 해당 배치와 청크
식별자(예: batch_id, chunk_index) 및 예외 정보를 추가하고, 최종 집계(업스트/커밋 성공 여부 판정)에
failed_batches를 참조하도록 수정하세요; 관련 심볼: failed_batches, 해당 청크 처리 루프(예외가 발생하는
try/except 블록), 업서트 예외 처리 블록(현재 업서트 예외만 추가하는 코드)을 찾아 동일한 실패 로깅/집계 정책을 적용하면 됩니다.
In `@supabase/migrations/20260223065008_initialize_pgvector.sql`:
- Around line 28-32: 현재 clamp 로직에서 match_count가 NULL이면 분기문을 모두 건너뛰어 LIMIT NULL이
되는 문제가 있습니다; fix는 비교 전에 match_count를 기본값으로 초기화하도록 match_count :=
COALESCE(match_count, 10); 를 삽입하거나 기존 범위검사 블록 바로 앞에서 NULL을 10으로 대체해 1~200으로
clamp 되도록 처리하십시오 (참조 심볼: match_count 및 기존 if ... elsif ... end if 블록).
In `@supabase/migrations/20260226140500_update_vector_to_mini_lm.sql`:
- Line 29: The match_count parameter can be NULL and bypass your clamp/LIMIT
logic; normalize it at the top of the function with COALESCE so downstream
comparisons and the "LIMIT match_count" use a non-NULL value. For example,
immediately after function entry assign match_count := COALESCE(match_count, 10)
(or COALESCE(match_count, <existing default>)) and update the clamp/comparison
block referenced around lines 40-44 to use that normalized match_count variable.
- Around line 9-20: The DO $$ block currently performs a destructive TRUNCATE
TABLE documents when only the embedding column exists; change it to fail-closed
by requiring an explicit allow flag before running TRUNCATE: add a guard that
reads a trusted indicator (e.g., a session/config setting or a control column)
and RAISE EXCEPTION unless that flag is explicitly true, then only execute
TRUNCATE TABLE documents when the flag is set and the embedding column exists;
update the block that contains TRUNCATE TABLE documents to check this explicit
allow variable (or current_setting) and abort otherwise.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
app/services/embedding.pyapp/services/llm.pydownload_books.pyrequirements.txtscripts/check_db.pyscripts/ingest_all_data.pyscripts/ingest_data.pysupabase/migrations/20260223065008_initialize_pgvector.sqlsupabase/migrations/20260225112500_update_vector_dimension.sqlsupabase/migrations/20260226140500_update_vector_to_mini_lm.sql
| txt_url = urllib.parse.urljoin(base_url, txt_url) | ||
| # Verify valid scheme to prevent SSRF | ||
| if not txt_url.startswith(('http://', 'https://')): | ||
| print(f"Skipping invalid URL scheme: {txt_url}") | ||
| return downloaded_count |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n download_books.py | head -200Repository: SanghunYun95/philo-rag
Length of output: 7730
🏁 Script executed:
wc -l download_books.pyRepository: SanghunYun95/philo-rag
Length of output: 87
🏁 Script executed:
sed -n '100,190p' download_books.pyRepository: SanghunYun95/philo-rag
Length of output: 3162
🏁 Script executed:
sed -n '1,50p' download_books.pyRepository: SanghunYun95/philo-rag
Length of output: 1586
🏁 Script executed:
rg -n "base_url|book_url|ALLOWED_HOSTS" download_books.pyRepository: SanghunYun95/philo-rag
Length of output: 530
스킴 검사만으로는 SSRF 방어가 불충분하며, urljoin의 기준 URL이 부정확합니다.
문제점:
-
Lines 126-128, 180-182의
http:///https://체크만으로는https://169.254.169.254/...,https://127.0.0.1/...같은 내부 IP 공격을 차단할 수 없습니다. -
Line 124는
urljoin(base_url, txt_url)을 사용하는데,txt_url은 book 페이지 HTML에서 파싱된 상대경로이므로book_url기준으로 결합해야 합니다. 현재 코드는 부정확한 URL을 생성할 수 있습니다. -
Line 179도 마찬가지로
urljoin(base_url, parser.next_page)대신urljoin(current_url, parser.next_page)를 사용해야 합니다.
수정 방안:
호스트 allowlist 추가 및 urljoin 기준 URL 수정
+ALLOWED_HOSTS = {"www.gutenberg.org", "gutenberg.org"}
+
+def is_allowed_url(url: str) -> bool:
+ parsed = urllib.parse.urlparse(url)
+ return parsed.scheme in {"http", "https"} and parsed.hostname in ALLOWED_HOSTS
+
def download_book(book, data_dir, base_url, target_count, downloaded_count):
href = book['href']
if not href.startswith('/ebooks/'):
@@
txt_url = parser.txt_url
if txt_url:
- txt_url = urllib.parse.urljoin(base_url, txt_url)
- # Verify valid scheme to prevent SSRF
- if not txt_url.startswith(('http://', 'https://')):
- print(f"Skipping invalid URL scheme: {txt_url}")
+ txt_url = urllib.parse.urljoin(book_url, txt_url)
+ if not is_allowed_url(txt_url):
+ print(f"Skipping disallowed URL: {txt_url}")
return downloaded_count
@@
if parser.next_page:
- current_url = urllib.parse.urljoin(base_url, parser.next_page)
- if not current_url.startswith(('http://', 'https://')):
- print(f"Invalid next page URL scheme: {current_url}")
+ current_url = urllib.parse.urljoin(current_url, parser.next_page)
+ if not is_allowed_url(current_url):
+ print(f"Invalid next page URL: {current_url}")
current_url = None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@download_books.py` around lines 124 - 128, The current URL creation and SSRF
checks are insufficient: replace urljoin(base_url, txt_url) with
urljoin(book_url, txt_url) (and urljoin(current_url, parser.next_page) for
pagination) so relative paths resolve against the actual page, and strengthen
the SSRF guard in the code that currently only checks
txt_url.startswith(('http://','https://')) by validating the parsed URL's scheme
and hostname against an allowlist and rejecting private/internal IPs and
loopback addresses; use urllib.parse.urlparse on the resolved URL to get netloc,
resolve the hostname to an IP and check with the ipaddress module to deny
127.0.0.0/8, 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16, and link-local
169.254.0.0/16, and log/skip the URL (return downloaded_count) when it fails
these checks (update the validation around txt_url, parser.next_page and any
other places that fetch external resources).
| pydantic>=2.7.0 | ||
| pydantic-settings | ||
| python-dotenv | ||
| langchain-community |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "requirements.txt" -type fRepository: SanghunYun95/philo-rag
Length of output: 84
🌐 Web query:
langchain langchain-community langchain-google-genai compatible version combinations official
💡 Result:
Officially supported/compatible version combinations are the ones that satisfy each package’s declared dependency ranges (i.e., let pip’s resolver pick versions that meet them).
Current (LangChain 1.x) compatibility (Python)
langchain-community==0.4.1requires:langchain-core>=1.0.1,<2.0.0langchain-classic>=1.0.0,<2.0.0[2]
langchain-google-genai==4.2.1requires:langchain-core>=1.2.5,<2.0.0(andgoogle-genai>=1.56.0,<2.0.0) [3]
- LangChain 1.0 GA announcement (context: “LangChain 1.0”) notes legacy functionality moved to
langchain-classic. [1] langchain-classic==1.0.1requires:langchain-core>=1.2.5,<2.0.0[4]
So, a safe “works together” set is:
langchain-core>=1.2.5,<2.0.0
langchain-classic>=1.0.0,<2.0.0
langchain-community==0.4.1
langchain-google-genai>=4.2.1,<5.0.0Because both langchain-community and langchain-google-genai are aligned on langchain-core<2.0.0 (and langchain-google-genai specifically needs langchain-core>=1.2.5), any combination within those bounds is “officially compatible” per their published metadata. [2][3][4]
🏁 Script executed:
cat -n ./requirements.txtRepository: SanghunYun95/philo-rag
Length of output: 319
LangChain 관련 의존성의 버전 고정이 필요합니다.
현재 requirements.txt의 langchain, langchain-google-genai, langchain-community (라인 4, 5, 11)이 모두 무버전 상태로 지정되어 있어 설치 시점별로 호환되지 않는 버전 조합이 설치될 수 있습니다. 재현 가능한 배포를 위해 다음의 공식 권장 호환 버전을 적용하세요:
langchain-core>=1.2.5,<2.0.0
langchain-classic>=1.0.0,<2.0.0
langchain-community==0.4.1
langchain-google-genai>=4.2.1,<5.0.0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@requirements.txt` at line 11, Pin LangChain-related dependencies in
requirements.txt: replace the unversioned entries for langchain,
langchain-google-genai, and langchain-community with the recommended, compatible
versions so installs are reproducible; specifically set langchain-core to
>=1.2.5,<2.0.0, langchain-classic to >=1.0.0,<2.0.0, langchain-community to
==0.4.1, and langchain-google-genai to >=4.2.1,<5.0.0 (update the existing lines
for "langchain", "langchain-google-genai", and "langchain-community"
accordingly).
| print("Table 'documents' does not exist yet. Please run migrations.") | ||
| sys.exit(0) |
There was a problem hiding this comment.
테이블 미존재를 성공 코드(0)로 종료하면 체크가 거짓 양성으로 통과할 수 있습니다.
Line 15-16에서 마이그레이션 미적용 상태를 안내하면서 sys.exit(0)을 반환하면, CI/배포 단계에서 실패해야 할 상황이 성공으로 처리될 수 있습니다. 체크 스크립트라면 비정상 종료 코드가 안전합니다.
수정 제안
- print("Table 'documents' does not exist yet. Please run migrations.")
- sys.exit(0)
+ print("Table 'documents' does not exist yet. Please run migrations.")
+ sys.exit(1)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/check_db.py` around lines 15 - 16, The script prints a message when
the 'documents' table is missing but calls sys.exit(0), which incorrectly
signals success; update the exit behavior in scripts/check_db.py where the
missing-table branch uses sys.exit(0) (the print + sys.exit call) to instead
exit with a non-zero code (e.g., sys.exit(1) or raise SystemExit(1)) so
CI/deploy will treat the check as failed; make this change in the block that
detects the 'documents' table absence (the print("Table 'documents' does not
exist yet. Please run migrations.") + sys.exit(...) sequence).
| failed_batches = [] | ||
|
|
There was a problem hiding this comment.
배치 업서트 실패만 집계하면 부분 실패가 누락됩니다.
Line 200-203은 업서트 예외만 failed_batches에 넣습니다. 그런데 청크 처리 실패(Line 189-190)는 현재 집계되지 않아, 일부 청크 유실이 있어도 최종적으로 예외 없이 끝날 수 있습니다.
🛠️ 권장 수정안
BATCH_SIZE = 100
failed_batches = []
@@
except Exception as exc:
print(f"Chunk {idx} completely failed: {exc}")
+ failed_batches.append(
+ (i // BATCH_SIZE + 1, f"chunk {idx} failed: {exc}")
+ )
@@
if failed_batches:
raise IngestionError(failed_batches)Also applies to: 200-203
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/ingest_data.py` around lines 142 - 143, 현재 failed_batches에 업서트 예외만
추가되어 부분 실패(청크 처리 실패)가 누락됩니다; 처리 루프에서 청크 처리 예외(현재 except 블록 주변의 chunk 처리 부분)를 잡을
때도 failed_batches에 해당 배치와 청크 식별자(예: batch_id, chunk_index) 및 예외 정보를 추가하고, 최종
집계(업스트/커밋 성공 여부 판정)에 failed_batches를 참조하도록 수정하세요; 관련 심볼: failed_batches, 해당 청크
처리 루프(예외가 발생하는 try/except 블록), 업서트 예외 처리 블록(현재 업서트 예외만 추가하는 코드)을 찾아 동일한 실패 로깅/집계
정책을 적용하면 됩니다.
| if match_count < 1 then | ||
| match_count := 1; | ||
| elsif match_count > 200 then | ||
| match_count := 200; | ||
| end if; |
There was a problem hiding this comment.
NULL 입력 시 clamp가 적용되지 않는 케이스를 같이 막아야 합니다.
Line 28-32는 match_count가 NULL이면 분기를 타지 않아 LIMIT NULL이 됩니다. 이 마이그레이션도 재생성 경로에서 실행되므로 COALESCE(match_count, 10) 선처리를 넣어 동일하게 보호해주세요.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@supabase/migrations/20260223065008_initialize_pgvector.sql` around lines 28 -
32, 현재 clamp 로직에서 match_count가 NULL이면 분기문을 모두 건너뛰어 LIMIT NULL이 되는 문제가 있습니다; fix는
비교 전에 match_count를 기본값으로 초기화하도록 match_count := COALESCE(match_count, 10); 를
삽입하거나 기존 범위검사 블록 바로 앞에서 NULL을 10으로 대체해 1~200으로 clamp 되도록 처리하십시오 (참조 심볼:
match_count 및 기존 if ... elsif ... end if 블록).
| DO $$ | ||
| BEGIN | ||
| -- This is a guard to prevent accidental truncation in production CI/CD. | ||
| -- In a real scenario, you'd check a configuration or role here. | ||
| -- For now, we explicitly execute it but wrap it to highlight the danger. | ||
| IF EXISTS ( | ||
| SELECT 1 FROM information_schema.columns | ||
| WHERE table_name='documents' AND column_name='embedding' | ||
| ) THEN | ||
| TRUNCATE TABLE documents; | ||
| END IF; | ||
| END $$; |
There was a problem hiding this comment.
운영 보호 없이 TRUNCATE가 실행되어 데이터 전체 유실 위험이 큽니다.
Line 14-19의 조건은 embedding 컬럼 존재만 확인해서, 실제로는 대부분 환경에서 TRUNCATE가 실행됩니다. 파괴적 작업은 명시적 허용 플래그가 없으면 중단하도록 fail-closed로 바꿔주세요.
수정 예시
DO $$
BEGIN
- -- This is a guard to prevent accidental truncation in production CI/CD.
- -- In a real scenario, you'd check a configuration or role here.
- -- For now, we explicitly execute it but wrap it to highlight the danger.
- IF EXISTS (
- SELECT 1 FROM information_schema.columns
- WHERE table_name='documents' AND column_name='embedding'
- ) THEN
- TRUNCATE TABLE documents;
- END IF;
+ IF current_setting('app.allow_destructive_migrations', true) = 'true' THEN
+ TRUNCATE TABLE public.documents;
+ ELSE
+ RAISE EXCEPTION 'Refusing to truncate public.documents without app.allow_destructive_migrations=true';
+ END IF;
END $$;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@supabase/migrations/20260226140500_update_vector_to_mini_lm.sql` around lines
9 - 20, The DO $$ block currently performs a destructive TRUNCATE TABLE
documents when only the embedding column exists; change it to fail-closed by
requiring an explicit allow flag before running TRUNCATE: add a guard that reads
a trusted indicator (e.g., a session/config setting or a control column) and
RAISE EXCEPTION unless that flag is explicitly true, then only execute TRUNCATE
TABLE documents when the flag is set and the embedding column exists; update the
block that contains TRUNCATE TABLE documents to check this explicit allow
variable (or current_setting) and abort otherwise.
| create or replace function match_documents ( | ||
| query_embedding vector(384), | ||
| match_count int DEFAULT null, | ||
| match_count int DEFAULT 10, |
There was a problem hiding this comment.
match_count = NULL 입력 시 clamp가 우회되어 무제한 조회가 됩니다.
Line 40-44 비교식은 NULL에서 동작하지 않아 LIMIT match_count가 사실상 LIMIT NULL(제한 없음)이 됩니다. Line 29 기본값과 별개로, 함수 시작 시 COALESCE로 NULL을 정규화해야 합니다.
수정 예시
begin
+ match_count := COALESCE(match_count, 10);
+
if match_count < 1 then
match_count := 1;
elsif match_count > 200 then
match_count := 200;
end if;Also applies to: 40-44
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@supabase/migrations/20260226140500_update_vector_to_mini_lm.sql` at line 29,
The match_count parameter can be NULL and bypass your clamp/LIMIT logic;
normalize it at the top of the function with COALESCE so downstream comparisons
and the "LIMIT match_count" use a non-NULL value. For example, immediately after
function entry assign match_count := COALESCE(match_count, 10) (or
COALESCE(match_count, <existing default>)) and update the clamp/comparison block
referenced around lines 40-44 to use that normalized match_count variable.
Summary by CodeRabbit
릴리스 노트
새로운 기능
버그 수정
개선 사항