Skip to content

feat: implement Gutenberg books scraping and resumable data ingestion…#1

Merged
SanghunYun95 merged 1 commit intomainfrom
feat/data-ingestion-system
Feb 25, 2026
Merged

feat: implement Gutenberg books scraping and resumable data ingestion…#1
SanghunYun95 merged 1 commit intomainfrom
feat/data-ingestion-system

Conversation

@SanghunYun95
Copy link
Copy Markdown
Owner

@SanghunYun95 SanghunYun95 commented Feb 25, 2026

… system

  • Added download_books.py to scrape Philosophy & Ethics bookshelf\n- Downloaded 100 philosophy books to data directory\n- Implemented resumable ingestion in ingest_data.py to skip existing chunks\n- Updated vector dimension logic and added HNSW index migration\n- Added check_progress.py and verify_and_clear.py scripts

Summary by CodeRabbit

Release Notes

  • New Features

    • Added book download and data ingestion pipeline for classic literature texts
    • Added database progress tracking utility
    • Included 7 classic literary texts in the data library
  • Updates

    • Upgraded AI embedding and language model services to Gemini for improved performance
    • Optimized database schema for better search and retrieval efficiency

… system

- Added download_books.py to scrape Philosophy & Ethics bookshelf\n- Downloaded 100 philosophy books to data directory\n- Implemented resumable ingestion in ingest_data.py to skip existing chunks\n- Updated vector dimension logic and added HNSW index migration\n- Added check_progress.py and verify_and_clear.py scripts
@SanghunYun95 SanghunYun95 merged commit 7ae8331 into main Feb 25, 2026
1 check was pending
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Feb 25, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5cc5746 and c508cf2.

📒 Files selected for processing (112)
  • app/services/embedding.py
  • app/services/llm.py
  • check_progress.py
  • data/A Budget of Paradoxes Volume I by Augustus De Morgan.txt
  • data/A Pickle for the Knowing Ones by Timothy Dexter.txt
  • data/A Treatise of Human Nature by David Hume.txt
  • data/A Vindication of the Rights of Woman by Mary Wollstonecraft.txt
  • data/Also sprach Zarathustra Ein Buch für Alle und Keinen German by Friedrich Wilhelm Nietzsche.txt
  • data/An Enquiry Concerning Human Understanding by David Hume.txt
  • data/An Essay Concerning Humane Understanding Volume 1 by John Locke.txt
  • data/Apology Crito and Phaedo of Socrates by Plato.txt
  • data/Apology by Plato.txt
  • data/As a man thinketh by James Allen.txt
  • data/Bacons Essays and Wisdom of the Ancients by Francis Bacon.txt
  • data/Beyond Good and Evil by Friedrich Wilhelm Nietzsche.txt
  • data/Ciceros Tusculan Disputations by Marcus Tullius Cicero.txt
  • data/De Officiis Latin by Marcus Tullius Cicero.txt
  • data/Democracy and Education An Introduction to the Philosophy of Education by John Dewey.txt
  • data/Democracy in America Volume 2 by Alexis de Tocqueville.txt
  • data/Demonology and Devil-lore by Moncure Daniel Conway.txt
  • data/Discourse on the Method of Rightly Conducting Ones Reason and of Seeking Truth in the Sciences by René Descartes.txt
  • data/Ecce Homo by Friedrich Wilhelm Nietzsche.txt
  • data/Essays by Ralph Waldo Emerson by Ralph Waldo Emerson.txt
  • data/Essays of Schopenhauer by Arthur Schopenhauer.txt
  • data/Ethics by Benedictus de Spinoza.txt
  • data/Etiquette by Emily Post.txt
  • data/Euthyphro by Plato.txt
  • data/Fundamental Principles of the Metaphysic of Morals by Immanuel Kant.txt
  • data/Goethes Theory of Colours by Johann Wolfgang von Goethe.txt
  • data/Gorgias by Plato.txt
  • data/How We Think by John Dewey.txt
  • data/Human All Too Human A Book for Free Spirits by Friedrich Wilhelm Nietzsche.txt
  • data/Isis unveiled Volume 1 of 2 Science A master-key to mysteries of ancient and modern science and theology by H P Blavatsky.txt
  • data/Laws by Plato.txt
  • data/Leviathan by Thomas Hobbes.txt
  • data/Meditations by Emperor of Rome Marcus Aurelius.txt
  • data/Nature by Ralph Waldo Emerson.txt
  • data/On Heroes Hero-Worship and the Heroic in History by Thomas Carlyle.txt
  • data/On Liberty by John Stuart Mill.txt
  • data/On War by Carl von Clausewitz.txt
  • data/On the Duty of Civil Disobedience by Henry David Thoreau.txt
  • data/On the Nature of Things by Titus Lucretius Carus.txt
  • data/Pascals Pensées by Blaise Pascal.txt
  • data/Perpetual Peace A Philosophical Essay by Immanuel Kant.txt
  • data/Phaedo by Plato.txt
  • data/Phaedrus by Plato.txt
  • data/Plato and the Other Companions of Sokrates 3rd ed Volume 1 by George Grote.txt
  • data/Plutarchs Morals by Plutarch.txt
  • data/Politics A Treatise on Government by Aristotle.txt
  • data/Pragmatism A New Name for Some Old Ways of Thinking by William James.txt
  • data/Psyche The Cult of Souls and Belief in Immortality among the Greeks by Erwin Rohde.txt
  • data/Psychology of the Unconscious by C G Jung.txt
  • data/Reflections or Sentences and Moral Maxims by François duc de La Rochefoucauld.txt
  • data/Revelations of Divine Love by of Norwich Julian.txt
  • data/Roman Stoicism by Edward Vernon Arnold.txt
  • data/Second Treatise of Government by John Locke.txt
  • data/Siddhartha by Hermann Hesse.txt
  • data/Sun Tzŭ on the Art of War The Oldest Military Treatise in the World by active 6th century BC Sunzi.txt
  • data/Symposium by Plato.txt
  • data/The Anatomy of Melancholy by Robert Burton.txt
  • data/The Antichrist by Friedrich Wilhelm Nietzsche.txt
  • data/The Birth of Tragedy or Hellenism and Pessimism by Friedrich Wilhelm Nietzsche.txt
  • data/The Case of Wagner Nietzsche Contra Wagner and Selected Aphorisms by Friedrich Wilhelm Nietzsche.txt
  • data/The City of God Volume I by Saint of Hippo Augustine.txt
  • data/The City of God Volume II by Saint of Hippo Augustine.txt
  • data/The Communist Manifesto by Karl Marx and Friedrich Engels.txt
  • data/The Confessions of St Augustine by Saint of Hippo Augustine.txt
  • data/The Consolation of Philosophy by Boethius.txt
  • data/The Critique of Pure Reason by Immanuel Kant.txt
  • data/The Enchiridion by Epictetus.txt
  • data/The Essays of Arthur Schopenhauer Studies in Pessimism by Arthur Schopenhauer.txt
  • data/The Essays of Arthur Schopenhauer the Wisdom of Life by Arthur Schopenhauer.txt
  • data/The Ethics of Aristotle by Aristotle.txt
  • data/The Genealogy of Morals by Friedrich Wilhelm Nietzsche.txt
  • data/The Grand Inquisitor by Fyodor Dostoyevsky.txt
  • data/The Kama Sutra of Vatsyayana by Vatsyayana.txt
  • data/The Man Who Was Thursday A Nightmare by G K Chesterton.txt
  • data/The Marriage of Heaven and Hell by William Blake.txt
  • data/The Meditations of the Emperor Marcus Aurelius Antoninus by Emperor of Rome Marcus Aurelius.txt
  • data/The Poetics of Aristotle by Aristotle.txt
  • data/The Prince by Niccolò Machiavelli.txt
  • data/The Principles of Psychology Volume 1 of 2 by William James.txt
  • data/The Problems of Philosophy by Bertrand Russell.txt
  • data/The Prophet by Kahlil Gibran.txt
  • data/The Republic by Plato.txt
  • data/The Republic of Plato by Plato.txt
  • data/The Secret Doctrine Vol 1 of 4 by H P Blavatsky.txt
  • data/The Secret Doctrine Vol 2 of 4 by H P Blavatsky.txt
  • data/The Song Celestial Or Bhagavad-Gîtâ from the Mahâbhârata.txt
  • data/The Twilight of the Idols or How to Philosophize with the Hammer The Antichrist by Friedrich Wilhelm Nietzsche.txt
  • data/The Will to Believe and Other Essays in Popular Philosophy by William James.txt
  • data/The World as Will and Idea Vol 1 of 3 by Arthur Schopenhauer.txt
  • data/The history of magic including a clear and precise exposition of its procedure its rites and its mysteries by Éliphas Lévi.txt
  • data/The social contract discourses by Jean-Jacques Rousseau.txt
  • data/The symbolism of Freemasonry Illustrating and explaining its science and philosophy its legends myths and symbols by Albert Gallatin Mackey.txt
  • data/Thus Spake Zarathustra A Book for All and None by Friedrich Wilhelm Nietzsche.txt
  • data/Utilitarianism by John Stuart Mill.txt
  • data/Utopia by Saint Thomas More.txt
  • data/Walden and On The Duty Of Civil Disobedience by Henry David Thoreau.txt
  • data/What Is Art by graf Leo Tolstoy.txt
  • data/新序 Chinese by Xiang Liu.txt
  • data/日知錄 Chinese by Yanwu Gu.txt
  • data/韓詩外傳 Complete Chinese by active 150 BC Ying Han.txt
  • download_books.py
  • requirements.txt
  • scripts/check_db.py
  • scripts/ingest_all_data.py
  • scripts/ingest_data.py
  • supabase/migrations/20260223065008_initialize_pgvector.sql
  • supabase/migrations/20260225112500_update_vector_dimension.sql
  • supabase/migrations/20260225141500_add_hnsw_index.sql
  • verify_and_clear.py

📝 Walkthrough

Walkthrough

The pull request migrates embeddings from local SentenceTransformer models to Gemini API with 3072-dimensional vectors, upgrades the LLM model to gemini-2.5-flash, adds nine public domain texts for ingestion, and introduces a complete data ingestion pipeline with batch processing, deterministic IDs, retry logic, and HNSW vector indexing.

Changes

Cohort / File(s) Summary
Embedding & LLM Services
app/services/embedding.py, app/services/llm.py
Replaced local SentenceTransformer embeddings with GoogleGenerativeAIEmbeddings (gemini-embedding-001, 3072 dims). Updated LLM model from gemini-1.5-pro to gemini-2.5-flash.
Data Files
data/*.txt
Added nine public domain texts: A Pickle for the Knowing Ones, As a man thinketh, Euthyphro, On the Duty of Civil Disobedience, The Communist Manifesto, The Enchiridion, The Grand Inquisitor, The Marriage of Heaven and Hell, and others (1104–1791 lines each).
Data Ingestion Pipeline
download_books.py, scripts/ingest_all_data.py, scripts/ingest_data.py
Introduced book scraper (download_books.py), batch ingestion orchestrator (ingest_all_data.py), and enhanced ingestion with deterministic UUIDs, retry logic, batch processing, idempotency checks, and --limit flag support.
Database Infrastructure
supabase/migrations/*.sql, verify_and_clear.py
Updated pgvector dimension from 1536 to 3072 across schema and match_documents function. Added HNSW index for optimized vector search. Included verification script for new vector dimensions.
Dependencies & Utilities
requirements.txt, check_progress.py, scripts/check_db.py
Loosened version constraints and added langchain-google-genai. Added database and progress check utilities.

Sequence Diagram(s)

sequenceDiagram
    participant Script as ingest_all_data.py
    participant Reader as File Reader
    participant Ingester as ingest_document()
    participant Chunker as Text Chunker
    participant Embedder as Gemini API
    participant Batch as Batch Processor
    participant DB as Supabase

    Script->>Reader: Read .txt files from data/
    Reader-->>Script: File content
    Script->>Ingester: Call ingest_document(text, philosopher, school, title)
    Ingester->>Chunker: Split text into chunks
    Chunker-->>Ingester: Chunks with indices
    loop Per chunk with retries
        Ingester->>Embedder: generate_embedding_with_retry(chunk)
        Embedder-->>Ingester: 3072-dim embedding vector
    end
    Ingester->>Batch: Collect enriched chunks (Title, Author injected)
    Batch->>Batch: Generate deterministic UUIDs per chunk
    Batch->>DB: Upsert batch of 100 chunks with embeddings
    DB-->>Batch: Confirm insertion
    Batch-->>Ingester: Batch complete
    Ingester-->>Script: Processing result
    Script->>Script: Aggregate results and report
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Whiskers twitching with delight,
Gemini embeddings, 3072-bright!
Ancient texts now vectorized,
Batch by batch, optimized,
HNSW hops through the night!

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/data-ingestion-system

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant