feat: implement Gutenberg books scraping and resumable data ingestion… by SanghunYun95 · Pull Request #1 · SanghunYun95/philo-rag

SanghunYun95 · 2026-02-25T13:14:38Z

… system

Added download_books.py to scrape Philosophy & Ethics bookshelf\n- Downloaded 100 philosophy books to data directory\n- Implemented resumable ingestion in ingest_data.py to skip existing chunks\n- Updated vector dimension logic and added HNSW index migration\n- Added check_progress.py and verify_and_clear.py scripts

Summary by CodeRabbit

Release Notes

New Features
- Added book download and data ingestion pipeline for classic literature texts
- Added database progress tracking utility
- Included 7 classic literary texts in the data library
Updates
- Upgraded AI embedding and language model services to Gemini for improved performance
- Optimized database schema for better search and retrieval efficiency

… system - Added download_books.py to scrape Philosophy & Ethics bookshelf\n- Downloaded 100 philosophy books to data directory\n- Implemented resumable ingestion in ingest_data.py to skip existing chunks\n- Updated vector dimension logic and added HNSW index migration\n- Added check_progress.py and verify_and_clear.py scripts

coderabbitai · 2026-02-25T13:15:01Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5cc5746 and c508cf2.

📒 Files selected for processing (112)

app/services/embedding.py
app/services/llm.py
check_progress.py
data/A Budget of Paradoxes Volume I by Augustus De Morgan.txt
data/A Pickle for the Knowing Ones by Timothy Dexter.txt
data/A Treatise of Human Nature by David Hume.txt
data/A Vindication of the Rights of Woman by Mary Wollstonecraft.txt
data/Also sprach Zarathustra Ein Buch für Alle und Keinen German by Friedrich Wilhelm Nietzsche.txt
data/An Enquiry Concerning Human Understanding by David Hume.txt
data/An Essay Concerning Humane Understanding Volume 1 by John Locke.txt
data/Apology Crito and Phaedo of Socrates by Plato.txt
data/Apology by Plato.txt
data/As a man thinketh by James Allen.txt
data/Bacons Essays and Wisdom of the Ancients by Francis Bacon.txt
data/Beyond Good and Evil by Friedrich Wilhelm Nietzsche.txt
data/Ciceros Tusculan Disputations by Marcus Tullius Cicero.txt
data/De Officiis Latin by Marcus Tullius Cicero.txt
data/Democracy and Education An Introduction to the Philosophy of Education by John Dewey.txt
data/Democracy in America Volume 2 by Alexis de Tocqueville.txt
data/Demonology and Devil-lore by Moncure Daniel Conway.txt
data/Discourse on the Method of Rightly Conducting Ones Reason and of Seeking Truth in the Sciences by René Descartes.txt
data/Ecce Homo by Friedrich Wilhelm Nietzsche.txt
data/Essays by Ralph Waldo Emerson by Ralph Waldo Emerson.txt
data/Essays of Schopenhauer by Arthur Schopenhauer.txt
data/Ethics by Benedictus de Spinoza.txt
data/Etiquette by Emily Post.txt
data/Euthyphro by Plato.txt
data/Fundamental Principles of the Metaphysic of Morals by Immanuel Kant.txt
data/Goethes Theory of Colours by Johann Wolfgang von Goethe.txt
data/Gorgias by Plato.txt
data/How We Think by John Dewey.txt
data/Human All Too Human A Book for Free Spirits by Friedrich Wilhelm Nietzsche.txt
data/Isis unveiled Volume 1 of 2 Science A master-key to mysteries of ancient and modern science and theology by H P Blavatsky.txt
data/Laws by Plato.txt
data/Leviathan by Thomas Hobbes.txt
data/Meditations by Emperor of Rome Marcus Aurelius.txt
data/Nature by Ralph Waldo Emerson.txt
data/On Heroes Hero-Worship and the Heroic in History by Thomas Carlyle.txt
data/On Liberty by John Stuart Mill.txt
data/On War by Carl von Clausewitz.txt
data/On the Duty of Civil Disobedience by Henry David Thoreau.txt
data/On the Nature of Things by Titus Lucretius Carus.txt
data/Pascals Pensées by Blaise Pascal.txt
data/Perpetual Peace A Philosophical Essay by Immanuel Kant.txt
data/Phaedo by Plato.txt
data/Phaedrus by Plato.txt
data/Plato and the Other Companions of Sokrates 3rd ed Volume 1 by George Grote.txt
data/Plutarchs Morals by Plutarch.txt
data/Politics A Treatise on Government by Aristotle.txt
data/Pragmatism A New Name for Some Old Ways of Thinking by William James.txt
data/Psyche The Cult of Souls and Belief in Immortality among the Greeks by Erwin Rohde.txt
data/Psychology of the Unconscious by C G Jung.txt
data/Reflections or Sentences and Moral Maxims by François duc de La Rochefoucauld.txt
data/Revelations of Divine Love by of Norwich Julian.txt
data/Roman Stoicism by Edward Vernon Arnold.txt
data/Second Treatise of Government by John Locke.txt
data/Siddhartha by Hermann Hesse.txt
data/Sun Tzŭ on the Art of War The Oldest Military Treatise in the World by active 6th century BC Sunzi.txt
data/Symposium by Plato.txt
data/The Anatomy of Melancholy by Robert Burton.txt
data/The Antichrist by Friedrich Wilhelm Nietzsche.txt
data/The Birth of Tragedy or Hellenism and Pessimism by Friedrich Wilhelm Nietzsche.txt
data/The Case of Wagner Nietzsche Contra Wagner and Selected Aphorisms by Friedrich Wilhelm Nietzsche.txt
data/The City of God Volume I by Saint of Hippo Augustine.txt
data/The City of God Volume II by Saint of Hippo Augustine.txt
data/The Communist Manifesto by Karl Marx and Friedrich Engels.txt
data/The Confessions of St Augustine by Saint of Hippo Augustine.txt
data/The Consolation of Philosophy by Boethius.txt
data/The Critique of Pure Reason by Immanuel Kant.txt
data/The Enchiridion by Epictetus.txt
data/The Essays of Arthur Schopenhauer Studies in Pessimism by Arthur Schopenhauer.txt
data/The Essays of Arthur Schopenhauer the Wisdom of Life by Arthur Schopenhauer.txt
data/The Ethics of Aristotle by Aristotle.txt
data/The Genealogy of Morals by Friedrich Wilhelm Nietzsche.txt
data/The Grand Inquisitor by Fyodor Dostoyevsky.txt
data/The Kama Sutra of Vatsyayana by Vatsyayana.txt
data/The Man Who Was Thursday A Nightmare by G K Chesterton.txt
data/The Marriage of Heaven and Hell by William Blake.txt
data/The Meditations of the Emperor Marcus Aurelius Antoninus by Emperor of Rome Marcus Aurelius.txt
data/The Poetics of Aristotle by Aristotle.txt
data/The Prince by Niccolò Machiavelli.txt
data/The Principles of Psychology Volume 1 of 2 by William James.txt
data/The Problems of Philosophy by Bertrand Russell.txt
data/The Prophet by Kahlil Gibran.txt
data/The Republic by Plato.txt
data/The Republic of Plato by Plato.txt
data/The Secret Doctrine Vol 1 of 4 by H P Blavatsky.txt
data/The Secret Doctrine Vol 2 of 4 by H P Blavatsky.txt
data/The Song Celestial Or Bhagavad-Gîtâ from the Mahâbhârata.txt
data/The Twilight of the Idols or How to Philosophize with the Hammer The Antichrist by Friedrich Wilhelm Nietzsche.txt
data/The Will to Believe and Other Essays in Popular Philosophy by William James.txt
data/The World as Will and Idea Vol 1 of 3 by Arthur Schopenhauer.txt
data/The history of magic including a clear and precise exposition of its procedure its rites and its mysteries by Éliphas Lévi.txt
data/The social contract discourses by Jean-Jacques Rousseau.txt
data/The symbolism of Freemasonry Illustrating and explaining its science and philosophy its legends myths and symbols by Albert Gallatin Mackey.txt
data/Thus Spake Zarathustra A Book for All and None by Friedrich Wilhelm Nietzsche.txt
data/Utilitarianism by John Stuart Mill.txt
data/Utopia by Saint Thomas More.txt
data/Walden and On The Duty Of Civil Disobedience by Henry David Thoreau.txt
data/What Is Art by graf Leo Tolstoy.txt
data/新序 Chinese by Xiang Liu.txt
data/日知錄 Chinese by Yanwu Gu.txt
data/韓詩外傳 Complete Chinese by active 150 BC Ying Han.txt
download_books.py
requirements.txt
scripts/check_db.py
scripts/ingest_all_data.py
scripts/ingest_data.py
supabase/migrations/20260223065008_initialize_pgvector.sql
supabase/migrations/20260225112500_update_vector_dimension.sql
supabase/migrations/20260225141500_add_hnsw_index.sql
verify_and_clear.py

📝 Walkthrough

Walkthrough

The pull request migrates embeddings from local SentenceTransformer models to Gemini API with 3072-dimensional vectors, upgrades the LLM model to gemini-2.5-flash, adds nine public domain texts for ingestion, and introduces a complete data ingestion pipeline with batch processing, deterministic IDs, retry logic, and HNSW vector indexing.

Changes

Cohort / File(s)	Summary
Embedding & LLM Services `app/services/embedding.py`, `app/services/llm.py`	Replaced local SentenceTransformer embeddings with GoogleGenerativeAIEmbeddings (gemini-embedding-001, 3072 dims). Updated LLM model from gemini-1.5-pro to gemini-2.5-flash.
Data Files `data/*.txt`	Added nine public domain texts: A Pickle for the Knowing Ones, As a man thinketh, Euthyphro, On the Duty of Civil Disobedience, The Communist Manifesto, The Enchiridion, The Grand Inquisitor, The Marriage of Heaven and Hell, and others (1104–1791 lines each).
Data Ingestion Pipeline `download_books.py`, `scripts/ingest_all_data.py`, `scripts/ingest_data.py`	Introduced book scraper (download_books.py), batch ingestion orchestrator (ingest_all_data.py), and enhanced ingestion with deterministic UUIDs, retry logic, batch processing, idempotency checks, and --limit flag support.
Database Infrastructure `supabase/migrations/*.sql`, `verify_and_clear.py`	Updated pgvector dimension from 1536 to 3072 across schema and match_documents function. Added HNSW index for optimized vector search. Included verification script for new vector dimensions.
Dependencies & Utilities `requirements.txt`, `check_progress.py`, `scripts/check_db.py`	Loosened version constraints and added langchain-google-genai. Added database and progress check utilities.

Sequence Diagram(s)

sequenceDiagram
    participant Script as ingest_all_data.py
    participant Reader as File Reader
    participant Ingester as ingest_document()
    participant Chunker as Text Chunker
    participant Embedder as Gemini API
    participant Batch as Batch Processor
    participant DB as Supabase

    Script->>Reader: Read .txt files from data/
    Reader-->>Script: File content
    Script->>Ingester: Call ingest_document(text, philosopher, school, title)
    Ingester->>Chunker: Split text into chunks
    Chunker-->>Ingester: Chunks with indices
    loop Per chunk with retries
        Ingester->>Embedder: generate_embedding_with_retry(chunk)
        Embedder-->>Ingester: 3072-dim embedding vector
    end
    Ingester->>Batch: Collect enriched chunks (Title, Author injected)
    Batch->>Batch: Generate deterministic UUIDs per chunk
    Batch->>DB: Upsert batch of 100 chunks with embeddings
    DB-->>Batch: Confirm insertion
    Batch-->>Ingester: Batch complete
    Ingester-->>Script: Processing result
    Script->>Script: Aggregate results and report

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Whiskers twitching with delight,
Gemini embeddings, 3072-bright!
Ancient texts now vectorized,
Batch by batch, optimized,
HNSW hops through the night! ✨

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/data-ingestion-system

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

SanghunYun95 merged commit 7ae8331 into main Feb 25, 2026
1 check was pending

This was referenced Feb 26, 2026

Feat/data ingestion system #2

Merged

Feat/data ingestion system #3

Merged

Feat/data ingestion system #4

Merged

Feat/data ingestion system #5

Merged

Feat/rag chat and error handling #9

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement Gutenberg books scraping and resumable data ingestion…#1

feat: implement Gutenberg books scraping and resumable data ingestion…#1
SanghunYun95 merged 1 commit intomainfrom
feat/data-ingestion-system

SanghunYun95 commented Feb 25, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

Uh oh!

coderabbitai bot commented Feb 25, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SanghunYun95 commented Feb 25, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

Uh oh!

coderabbitai bot commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SanghunYun95 commented Feb 25, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 25, 2026 •

edited

Loading