Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
75 commits
Select commit Hold shift + click to select a range
5a5407a
feat: integrate korean book metadata and UI citations
SanghunYun95 Mar 2, 2026
8a01e1d
fix: apply coderabbit review suggestions
SanghunYun95 Mar 2, 2026
133442a
fix(backend): apply coderabbit review feedback for db and mapping scr…
SanghunYun95 Mar 2, 2026
43d1722
fix(backend): address additional coderabbit PR inline comments
SanghunYun95 Mar 2, 2026
0dd84a4
refactor(backend): use shared env parser and HTTPS for API
SanghunYun95 Mar 3, 2026
3057ad7
fix(backend): allow key rotation for all errors in book mapping
SanghunYun95 Mar 3, 2026
fc24774
feat: implement dynamic chat title and dynamic philosopher highlighting
SanghunYun95 Mar 3, 2026
cdbc817
fix: apply CodeRabbit PR review feedback
SanghunYun95 Mar 3, 2026
6c7566d
fix(pr): address CodeRabbit review feedback on backend tools and DB s…
SanghunYun95 Mar 3, 2026
78fc51a
chore: resolve merge conflicts
SanghunYun95 Mar 3, 2026
9de894d
fix(pr): address additional CodeRabbit comments
SanghunYun95 Mar 3, 2026
3d773d7
style: update welcome messages and input placeholder to be more gener…
SanghunYun95 Mar 3, 2026
4335bee
fix(pr): address additional CodeRabbit feedback for title truncation …
SanghunYun95 Mar 3, 2026
7298aac
UI: Remove redundant buttons (useful, copy, regenerate) from MessageList
SanghunYun95 Mar 3, 2026
30dd215
Merge branch 'main' into feat/book-metadata
SanghunYun95 Mar 3, 2026
ce91d6a
Refactor: apply CodeRabbit review suggestions
SanghunYun95 Mar 3, 2026
0bd1fcd
docs: rewrite README for interviewers
SanghunYun95 Mar 3, 2026
1196e30
docs, refactor: refine README and MessageList observer logic per PR c…
SanghunYun95 Mar 3, 2026
1b31b83
refactor: resolve observer unmount leak, Biome formatting, exhaustive…
SanghunYun95 Mar 3, 2026
e1ec3fc
fix: clear visibleMessages on unmount & use targeted eslint disable
SanghunYun95 Mar 3, 2026
36bd572
docs, refactor: disable philosopher filtering & update README examples
SanghunYun95 Mar 3, 2026
f13f327
refactor: apply PR refinements for mapping script and observers
SanghunYun95 Mar 3, 2026
1a9358b
Merge origin/main into feat/book-metadata (Resolve conflicts)
SanghunYun95 Mar 3, 2026
5d2841d
Fix: apply CodeRabbit feedback for React hooks and Tailwind
SanghunYun95 Mar 3, 2026
2584e3b
Feat: support multiple GEMINI_API_KEYS via comma-separated env var fo…
SanghunYun95 Mar 4, 2026
2395400
Fix: apply PR CodeRabbit round 8 feedback and add favicon
SanghunYun95 Mar 4, 2026
a0f719c
Fix: resolve conflicts and apply PR CodeRabbit round 9 feedback
SanghunYun95 Mar 4, 2026
789bdf4
Fix: apply PR CodeRabbit round 10 feedback
SanghunYun95 Mar 4, 2026
4c33094
Fix: apply PR CodeRabbit round 11 feedback
SanghunYun95 Mar 4, 2026
c9b0b91
Fix: apply PR CodeRabbit round 12 feedback
SanghunYun95 Mar 4, 2026
f24b224
fix(backend): preload models on startup and use async invokes to prev…
SanghunYun95 Mar 4, 2026
622a663
test: update mocks for refactored async llm/embedding functions
SanghunYun95 Mar 4, 2026
9eedd78
fix(pr): address lint, magic numbers, and use favicon for logo
SanghunYun95 Mar 4, 2026
4d878c2
fix(pr): resolve conflicts and add sizes prop to next/image
SanghunYun95 Mar 4, 2026
8495460
fix(backend): load models in background to prevent startup timeout on…
SanghunYun95 Mar 5, 2026
110049b
fix(backend): resolve conflict and apply PR feedback (timeouts, track…
SanghunYun95 Mar 5, 2026
105a59c
fix(backend): add graceful teardown for preload task on shutdown
SanghunYun95 Mar 5, 2026
7d918eb
feat(backend): add /ready endpoint and handle CancelledError in preload
SanghunYun95 Mar 5, 2026
382f90e
fix(backend): handle CancelledError properly in /ready readiness probe
SanghunYun95 Mar 5, 2026
1987897
fix(backend): lazy load ML models in chat routes to avoid Uvicorn sta…
SanghunYun95 Mar 5, 2026
f11491c
fix(backend): add error logging to /ready endpoint for better observa…
SanghunYun95 Mar 5, 2026
cad791b
refactor(backend): use else block for successful return in readiness …
SanghunYun95 Mar 5, 2026
e94fbe2
refactor(backend): use logger.warning in /ready, catch Exception in l…
SanghunYun95 Mar 5, 2026
359511c
Merge branch 'main' into feat/book-metadata and apply lifespan except…
SanghunYun95 Mar 5, 2026
f187cb1
fix: handle zero-chunk LLM responses, add prompt injection defense, a…
SanghunYun95 Mar 5, 2026
95be5fa
fix(backend): use HuggingFace Inference API for embeddings to resolve…
SanghunYun95 Mar 6, 2026
2c0f465
fix(backend): address CodeRabbit PR feedback for llm.py cleanup, chat…
SanghunYun95 Mar 6, 2026
bfe167c
fix: resolve merge conflicts and restore PR feedback fixes
SanghunYun95 Mar 6, 2026
9b00c6c
fix(backend): increase timeouts and add timing logs to debug latency
SanghunYun95 Mar 6, 2026
bba2528
fix: resolve merge conflicts and apply coderabbit feedback (timeout, …
SanghunYun95 Mar 6, 2026
ad7e026
refactor(backend): extract timeout constant and add semaphore for DB RPC
SanghunYun95 Mar 6, 2026
ff949ea
feat: migrate LLM service from Google Gemini to OpenAI (gpt-4o-mini)
SanghunYun95 Mar 6, 2026
d406f0b
fix: address PR feedback for chat timeouts and dependencies
SanghunYun95 Mar 6, 2026
e60cc1a
Merge main and address PR comments on logger formatting
SanghunYun95 Mar 6, 2026
1dc51b4
chore: update agent skills
SanghunYun95 Mar 25, 2026
13a7538
Merge branch 'main' into feat/migrate-to-openai
SanghunYun95 Mar 25, 2026
da4d56d
feat: add keep-alive GitHub Action
SanghunYun95 Mar 25, 2026
52b0474
feat: add keep-alive github action
SanghunYun95 Mar 25, 2026
98c26b9
chore: refactor keep-alive action based on review comments
SanghunYun95 Mar 25, 2026
2e2d75d
fix: increase keep-alive timeout to 120s and improve robustness
SanghunYun95 Mar 25, 2026
d77ef69
chore: resolve merge conflict in keep-alive workflow
SanghunYun95 Mar 25, 2026
a4f65e0
fix: adjust keep-alive endpoints for Render (GET required) and Supaba…
SanghunYun95 Mar 25, 2026
24c81d9
refactor: improve curl error handling in keep-alive action as suggest…
SanghunYun95 Mar 25, 2026
1ff42a1
merge: resolve conflict in keep-alive workflow by keeping fixed logic
SanghunYun95 Mar 25, 2026
c507e59
fix: chat input Enter behavior and remove keep-alive CronJob
SanghunYun95 Mar 26, 2026
37d655f
chore: resolve conflict by removing keep-alive cronjob (migrated to C…
SanghunYun95 Mar 26, 2026
19dba21
feat: optimize Philo-RAG data pipeline with 101 books and 31.8% effic…
SanghunYun95 Mar 28, 2026
5298f64
refactor: address CodeRabbit review comments (BOM removal, error hand…
SanghunYun95 Mar 28, 2026
14f1890
refactor: implement atomic failure handling in update_metadata.py
SanghunYun95 Mar 28, 2026
d5fa21d
refactor: improve metadata update atomicity using batch upsert
SanghunYun95 Mar 28, 2026
549a06e
Refactor: Update JSONB path syntax and optimize metadata update query
SanghunYun95 Mar 28, 2026
67f412f
feat: migrate infrastructure to GCP Cloud Run and Firebase Hosting
SanghunYun95 Mar 28, 2026
81a2a2c
fix: sync package-lock.json with package.json to fix build failure
SanghunYun95 Mar 28, 2026
2f1db46
Refactoring Philo-RAG: Robustness and Security improvements as per PR…
SanghunYun95 Mar 28, 2026
c0921d7
Merge main into feature/migrate-to-gcp and resolve conflicts in favor…
SanghunYun95 Mar 28, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
6 changes: 3 additions & 3 deletions .agent/documents/bmad.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

## 1. 핵심 원칙 (Core Philosophy)

- **Docs-as-Code:** 모든 기능의 시작은 `documents/stories/` 내의 스토리 파일입니다.
- **Docs-as-Code:** 모든 기능의 시작은 `.agent/documents/stories/` 내의 스토리 파일입니다.
- **Behavior-driven:** 기능은 사용자의 행동과 기대 결과(Acceptance Criteria) 중심으로 정의합니다.
- **Model-based:** 복잡한 로직은 텍스트보다는 구조화된 모델(Mermaid 다이어그램, JSON 스키마 등)로 표현합니다.
- **Context Integrity:** 문서를 스토리 단위로 쪼개어 AI가 필요한 정보에만 집중하게 합니다.
Expand All @@ -18,7 +18,7 @@
### 📋 [Analysis Phase] - 비즈니스 분석가 (Analyst)

- **목표:** 모호한 요구사항을 명확한 '스토리(Story)'로 변환합니다.
- **결과물:** `documents/stories/ID.story_name.md` (Gherkin 스타일의 Behavior 정의 포함)
- **결과물:** `.agent/documents/stories/ID.story_name.md` (Gherkin 스타일의 Behavior 정의 포함)
- **지침:** "사용자가 ~할 때, ~한 결과가 나와야 한다"는 비즈니스 로직에 집중합니다.

### 📐 [Architecture Phase] - 시스템 설계자 (Architect)
Expand Down Expand Up @@ -46,7 +46,7 @@
### ① 기획 및 설계 시나리오

**지시:** "BMAD 스킬로 'AI 기반 계약 생애주기 관리(CLM) 플랫폼을 위한 공통 시스템(Shared System) 백엔드 코어 모듈' 스토리 파일 만들어줘."
**AI 행동:** `documents/stories/001.clm-shared-system-core-module.md` 생성 후 승인 요청.
**AI 행동:** `.agent/documents/stories/001.clm-shared-system-core-module.md` 생성 후 승인 요청.

### ② 프롬프트 예시

Expand Down
2 changes: 1 addition & 1 deletion .agent/documents/improvement_plan.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,4 +42,4 @@ LLM 기반 서비스의 보안을 위해 다음과 같은 전략을 수립합니
---

> [!TIP]
> **보안 마크다운 가이드라인**을 별도 문서로 관리하며, `back-end`의 검증 로직과 싱크를 맞추는 것을 권장합니다.
> **보안 마크다운 가이드라인**을 별도 문서로 관리하며, `backend`의 검증 로직과 싱크를 맞추는 것을 권장합니다.
7 changes: 3 additions & 4 deletions .agent/documents/stories/001.advanced_rag_system.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
- **I want to** AI의 답변이 얼마나 정확하고(Faithfulness) 관련성이 높은지(Relevance) 수치로 확인하고 싶다.
- **Acceptance Criteria:**
- 답변 생성 후 RAGAS를 통해 Faithfulness, Answer Relevance 점수를 계산한다.
- 특정 점수 이하의 답변이 생성될 경우 로그를 기록하고 개선 프로세스를 실행한다.
- **Faithfulness 0.7 미만 또는 Answer Relevance 0.7 미만**의 답변이 생성될 경우 로그를 기록하고, 최대 2회까지 재검색 및 재생성을 시도하는 개선 프로세스를 실행한다.
- 평가 결과를 대시보드나 로그 파일로 확인할 수 있다.

## 3. 아키텍처 설계 (Architecture Notes)
Expand All @@ -44,9 +44,8 @@ graph TD
## 4. 보안 고려사항 (Security)

### 프롬프트 인젝션 방지 (Anti-Injection)
- 시스템 프롬프트에 `Strict Instruction` 추가 (이미 구현됨: `llm.py: get_rag_prompt`).
- 입력 데이터 검증(Sanitization) 로직 추가.
- `Post-Prompting` 기법을 사용하여 사용자 입력 후에 핵심 지침 재강조.
- 시스템 프롬프트에 `Strict Instruction` 반영 (기초 단계 구현됨: `llm.py: get_rag_prompt`).
- **미구현/추후 반영 예정:** 입력 데이터 검증(Sanitization) 및 `Post-Prompting` 기법을 사용한 핵심 지침 재강조 로직.

---
> [!NOTE]
Expand Down
18 changes: 13 additions & 5 deletions .agent/rules/security_guideline.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,19 +19,27 @@

## 2. 입력 데이터 정문화 (Input Sanitization)

- **Markdown/Script Injection:** 사용자 입력에 포함된 `<script>`, `<iframe>` 등 위험한 HTML 태그를 제거합니다.
- **Length Limiting:** 과도하게 긴 입력을 통한 서비스 거부(DoS) 공격을 방지하기 위해 입력 길이를 제한합니다.
- **Markdown/Script Injection:** 사용자 입력에 포함된 `<script>`, `<iframe>` 등 위험한 HTML 태그를 정규식 기반으로 제거합니다.
- **Length Limiting:**
- **최대 입력 길이:** 2,000자 또는 2,048 토큰 (둘 중 하한값 적용).
- **제한 초과 시:** 사용자에게 즉시 오류 메시지(HTTP 400 - Bad Request)를 반환합니다.

## 3. 출력 데이터 검증 (Output Content Security)

LLM이 생성한 결과물을 사용자에게 보여주기 전 다음 사항을 확인합니다.
- **PII (개인정보) 필터링:** 주민등록번호, 이메일 주소 등이 노출되지 않도록 필터링합니다.
- **Harmful Content:** 혐오 표현, 위험 정보 등이 포함되었는지 별도의 소형 모델이나 필터링 라이브러리를 통해 검증합니다.
- **PII (개인정보) 필터링:**
- **대상:** 이메일, 주민등록번호, 전화번호.
- **검증 규칙:** `Presidio` 라이브러리 및 표준 정규식(Regex)을 사용하여 탐지하고 `[MASK]` 처리합니다.
- **Harmful Content:** 혐오 표현, 위험 정보 등이 포함되었는지 별도의 소형 모델이나 필터링 라이브러리를 통해 검증하며, 탐지 시 응답을 중단하고 HTTP 403 - Forbidden을 반환합니다.

## 4. API 보안 및 인프라

- **Rate Limiting:** IP당/계정당 API 호출 횟수를 제한하여 무분별한 비용 발생 및 공격을 차단합니다.
- **Rate Limiting:**
- **Quotas:** IP당 분당 60회, 계정당 일일 1,000회 호출로 제한합니다.
- **Enforcement:** 한도 초과 시 `Retry-After` 헤더를 포함한 HTTP 429 - Too Many Requests를 반환합니다.
- **API Key Management:** 환경 변수(`.env`)를 통해 관리하며, 절대 코드 저장소에 노출하지 않습니다.
- **Logging Policy:**
- 익명화된 인시던트 ID와 규칙 ID만 로깅하며, 원문 데이터(특히 PII)는 절대 로그에 남기지 않습니다.

---

Expand Down
34 changes: 26 additions & 8 deletions .github/workflows/deploy.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,42 +5,59 @@ on:
branches:
- main

# Prevent overlapping deployments but allow queuing
concurrency:
group: "deploy-main"
cancel-in-progress: false

env:
PROJECT_ID: 'vigilant-shift-490601-t5' # Using provided project ID
REGION: 'asia-northeast3' # Seoul region for better latency in KR
PROJECT_ID: 'vigilant-shift-490601-t5'
REGION: 'asia-northeast3'
SERVICE_NAME: 'philo-rag-backend'
GAR_HOST: 'asia-northeast3-docker.pkg.dev'
IMAGE_NAME: 'philo-rag-backend'
# Constructs the full image URI for centralization
IMAGE_URI: 'asia-northeast3-docker.pkg.dev/vigilant-shift-490601-t5/cloud-run-source-deploy/philo-rag-backend:latest'

jobs:
deploy-backend:
name: Build & Deploy Backend to Cloud Run
runs-on: ubuntu-latest
permissions:
contents: 'read'
id-token: 'write'
outputs:
backend_url: ${{ steps.deploy.outputs.url }}
steps:
- name: Checkout
uses: actions/checkout@v4

- name: Google Auth
uses: google-github-actions/auth@v2
with:
credentials_json: ${{ secrets.GCP_SA_KEY }}
# Use Workload Identity Federation for better security
workload_identity_provider: ${{ secrets.GCP_WORKLOAD_IDENTITY_PROVIDER }}
service_account: ${{ secrets.GCP_SERVICE_ACCOUNT_EMAIL }}

- name: Set up Cloud SDK
uses: google-github-actions/setup-gcloud@v2

- name: Authorize Docker
run: gcloud auth configure-docker asia-northeast3-docker.pkg.dev
run: gcloud auth configure-docker ${{ env.GAR_HOST }}

- name: Build and Push Container
working-directory: ./backend
run: |-
docker build -t "asia-northeast3-docker.pkg.dev/${{ env.PROJECT_ID }}/cloud-run-source-deploy/${{ env.SERVICE_NAME }}:${{ github.sha }}" .
docker push "asia-northeast3-docker.pkg.dev/${{ env.PROJECT_ID }}/cloud-run-source-deploy/${{ env.SERVICE_NAME }}:${{ github.sha }}"
docker build -t "${{ env.IMAGE_URI }}" .
docker push "${{ env.IMAGE_URI }}"

- name: Deploy to Cloud Run
id: deploy
uses: google-github-actions/deploy-cloudrun@v2
with:
service: ${{ env.SERVICE_NAME }}
region: ${{ env.REGION }}
image: asia-northeast3-docker.pkg.dev/${{ env.PROJECT_ID }}/cloud-run-source-deploy/${{ env.SERVICE_NAME }}:${{ github.sha }}
image: ${{ env.IMAGE_URI }}
env_vars: |-
OPENAI_API_KEY=${{ secrets.OPENAI_API_KEY }}
SUPABASE_URL=${{ secrets.SUPABASE_URL }}
Expand Down Expand Up @@ -68,7 +85,8 @@ jobs:
- name: Build Frontend
working-directory: ./frontend
env:
NEXT_PUBLIC_API_BASE_URL: ${{ secrets.NEXT_PUBLIC_API_BASE_URL }}
# Injects the actual deployed backend URL into the frontend build
NEXT_PUBLIC_API_BASE_URL: ${{ needs.deploy-backend.outputs.backend_url }}
run: npm run build

- name: Deploy to Firebase Hosting
Expand Down
4 changes: 1 addition & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Philo-RAG (철학자와의 대화)
# Philo-RAG (철학자와의 대화)
> ⚠️ **안내사항 (Cold Start)**
> 본 프로젝트의 백엔드 서버는 무료 클라우드 인스턴스에 배포되어 운영 중입니다. 일정 시간 요청이 없으면 서버가 휴면 상태로 전환되므로, **최초 접속 시 (Cold start) 백엔드 응답까지 약 1분 정도의 대기 시간이 발생**할 수 있습니다.

Expand Down Expand Up @@ -221,5 +221,3 @@ npm run dev
```
Open `http://localhost:3000` to start using the system.

Open `http://localhost:3000` to start using the system.

18 changes: 14 additions & 4 deletions backend/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -17,12 +17,22 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY . .
# Setup model cache directory for pre-loading
RUN mkdir -p /app/model_cache && chmod 777 /app/model_cache
ENV HF_HOME=/app/model_cache

# Pre-load the embedding model to reduce cold start latency
RUN python -c "from langchain_huggingface import HuggingFaceEmbeddings; HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2', model_kwargs={'device': 'cpu'})"

# Create a non-privileged user for security
RUN adduser --disabled-password --gecos "" appuser \
&& chown -R appuser:appuser /app

# Switch to non-privileged user
USER appuser

# Expose the port
EXPOSE 8080

# Command to run the application
# We use 0.0.0.0 for Cloud Run
# Command to run the application (0.0.0.0 for Cloud Run)
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8080", "--proxy-headers"]
8 changes: 7 additions & 1 deletion backend/download_books.py
Original file line number Diff line number Diff line change
Expand Up @@ -181,7 +181,13 @@ def main():
os.makedirs(data_dir, exist_ok=True)

downloaded_count = 0
target_count = int(os.getenv("TARGET_COUNT", "300"))
raw_target_count = os.getenv("TARGET_COUNT", "300")
try:
target_count = int(raw_target_count)
except ValueError as exc:
raise ValueError(f"TARGET_COUNT must be an integer, got '{raw_target_count}'") from exc
if target_count <= 0:
raise ValueError(f"TARGET_COUNT must be greater than 0, got {target_count}")

current_url = shelf_url

Expand Down
2 changes: 1 addition & 1 deletion backend/scripts/check_db.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
import sys

# Ensure we can import app modules
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))

from app.services.database import get_client

Expand Down
38 changes: 24 additions & 14 deletions backend/scripts/ingest_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,14 @@
from typing import List, Dict

# Ensure we can import app modules
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))

from app.core.config import settings
from app.services.embedding import embedding_service
from app.services.database import get_client
from langchain.text_splitter import RecursiveCharacterTextSplitter

supabase_client = get_client()
# Supabase client will be initialized in ingest_document to avoid issues during import

class IngestionError(Exception):
"""Raised when data ingestion fails."""
Expand Down Expand Up @@ -49,23 +49,32 @@ def generate_deterministic_uuid(seed_text: str) -> str:
"""Generates a consistent UUID based on the input text to ensure idempotency."""
return str(uuid.uuid5(UUID_NAMESPACE, seed_text))

import re

def strip_gutenberg_boilerplate(text: str) -> str:
"""Removes Project Gutenberg START and END identifiers from the text."""
start_marker = "*** START OF THE PROJECT GUTENBERG EBOOK"
end_marker = "*** END OF THE PROJECT GUTENBERG EBOOK"
# Robust START and END markers derived from standard Gutenberg patterns
start_marker_regex = r"\*\*\* START OF (THE|THIS) PROJECT GUTENBERG EBOOK.*?\*\*\*"
end_marker_regex = r"\*\*\* END OF (THE|THIS) PROJECT GUTENBERG EBOOK.*?\*\*\*"

start_idx = text.upper().find(start_marker)
if start_idx != -1:
# Move past the marker line
newline_idx = text.find("\n", start_idx)
if newline_idx != -1:
text = text[newline_idx+1:]
# Find START marker
start_match = re.search(start_marker_regex, text, re.IGNORECASE | re.DOTALL)
if start_match:
text = text[start_match.end():]

end_idx = text.upper().find(end_marker)
if end_idx != -1:
text = text[:end_idx]
# Find END marker
# Use standard marker or common fallback license starters
end_match = re.search(end_marker_regex, text, re.IGNORECASE | re.DOTALL)
if end_match:
text = text[:end_match.start()]
else:
# Fallback to searching for the full license block if marker is missing
license_starter = r"THE FULL PROJECT GUTENBERG™ LICENSE"
license_match = re.search(license_starter, text, re.IGNORECASE)
if license_match:
text = text[:license_match.start()]
Comment on lines +72 to +75
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fallback 라이선스 패턴이 너무 엄격합니다.

현재 fallback은 가 포함된 정확한 문구만 잡아서, 표기 변형((TM), -TM, 미포함)에서는 footer 제거가 실패할 수 있습니다. Line 72 기준으로 패턴을 완화하는 편이 안전합니다.

🔧 제안 수정안
-        license_starter = r"THE FULL PROJECT GUTENBERG™ LICENSE"
+        license_starter = r"THE FULL PROJECT GUTENBERG(?:\s*\(TM\)|-TM|™)? LICENSE"
         license_match = re.search(license_starter, text, re.IGNORECASE)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/scripts/ingest_data.py` around lines 72 - 75, The fallback license
pattern using the literal string in license_starter is too strict and misses
common variations like "(TM)", "-TM", "TM" or absence of the trademark symbol;
update the regex used in license_starter (and the subsequent license_match
logic) to match a more flexible pattern that looks for "THE FULL PROJECT
GUTENBERG" followed by optional whitespace and an optional trademark token
(e.g., optionally match ™ or (TM) or -TM or TM, with optional punctuation) using
re.IGNORECASE so text = text[:license_match.start()] still trims the footer when
any of these variants are present.


return text
return text.strip()

def generate_embedding_with_retry(text: str, max_retries: int = 3):
"""Wrapper to handle rate limiting and retries for the embedding API."""
Expand All @@ -87,6 +96,7 @@ def ingest_document(text: str, philosopher: str, school: str, book_title: str, l
Chunks text, fetches metadata, generates embeddings via multiprocessing,
and upserts to Supabase in batches with idempotency.
"""
supabase_client = get_client() # Lazy initialization
print(f"Starting ingestion for {philosopher} - {book_title}")

# 1. Fetch metadata
Expand Down
17 changes: 11 additions & 6 deletions backend/scripts/update_metadata.py
Original file line number Diff line number Diff line change
Expand Up @@ -100,14 +100,19 @@ def update_metadata():
# Prepare batch data for atomicity (at book level)
batch_data = []
for doc in res.data:
# Merge new fields into nested book_info to maintain structure
new_metadata = doc["metadata"].copy()
book_info = new_metadata.get("book_info", {}).copy()
book_info.update({
"kr_title": meta["kr_title"],
"cover_url": meta["thumbnail"] or book_info.get("cover_url", ""),
"link": meta["link"] or book_info.get("link", "")
})
new_metadata["book_info"] = book_info

batch_data.append({
"id": doc["id"],
"metadata": {
**doc["metadata"],
"kr_title": meta["kr_title"],
"thumbnail": meta["thumbnail"],
"link": meta["link"]
}
"metadata": new_metadata
})

if batch_data:
Expand Down
Loading