feat: implement Gutenberg books scraping and resumable data ingestion…#1
feat: implement Gutenberg books scraping and resumable data ingestion…#1SanghunYun95 merged 1 commit intomainfrom
Conversation
… system - Added download_books.py to scrape Philosophy & Ethics bookshelf\n- Downloaded 100 philosophy books to data directory\n- Implemented resumable ingestion in ingest_data.py to skip existing chunks\n- Updated vector dimension logic and added HNSW index migration\n- Added check_progress.py and verify_and_clear.py scripts
|
Caution Review failedThe pull request is closed. ℹ️ Recent review infoConfiguration used: defaults Review profile: CHILL Plan: Pro 📒 Files selected for processing (112)
📝 WalkthroughWalkthroughThe pull request migrates embeddings from local SentenceTransformer models to Gemini API with 3072-dimensional vectors, upgrades the LLM model to gemini-2.5-flash, adds nine public domain texts for ingestion, and introduces a complete data ingestion pipeline with batch processing, deterministic IDs, retry logic, and HNSW vector indexing. Changes
Sequence Diagram(s)sequenceDiagram
participant Script as ingest_all_data.py
participant Reader as File Reader
participant Ingester as ingest_document()
participant Chunker as Text Chunker
participant Embedder as Gemini API
participant Batch as Batch Processor
participant DB as Supabase
Script->>Reader: Read .txt files from data/
Reader-->>Script: File content
Script->>Ingester: Call ingest_document(text, philosopher, school, title)
Ingester->>Chunker: Split text into chunks
Chunker-->>Ingester: Chunks with indices
loop Per chunk with retries
Ingester->>Embedder: generate_embedding_with_retry(chunk)
Embedder-->>Ingester: 3072-dim embedding vector
end
Ingester->>Batch: Collect enriched chunks (Title, Author injected)
Batch->>Batch: Generate deterministic UUIDs per chunk
Batch->>DB: Upsert batch of 100 chunks with embeddings
DB-->>Batch: Confirm insertion
Batch-->>Ingester: Batch complete
Ingester-->>Script: Processing result
Script->>Script: Aggregate results and report
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
✨ Finishing Touches
🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Comment |
… system
Summary by CodeRabbit
Release Notes
New Features
Updates