Feat/data ingestion system by SanghunYun95 · Pull Request #2 · SanghunYun95/philo-rag

SanghunYun95 · 2026-02-26T06:48:18Z

Summary by CodeRabbit

릴리스 노트

버그 수정
- UTF-8 인코딩 지원 개선
Chores
- 임베딩 시스템 최적화로 성능을 개선했습니다.
- 데이터베이스 스키마를 업데이트하여 새로운 모델과 호환성을 확보했습니다.
- 필수 의존성을 업데이트했습니다.

… system - Added download_books.py to scrape Philosophy & Ethics bookshelf\n- Downloaded 100 philosophy books to data directory\n- Implemented resumable ingestion in ingest_data.py to skip existing chunks\n- Updated vector dimension logic and added HNSW index migration\n- Added check_progress.py and verify_and_clear.py scripts

coderabbitai · 2026-02-26T06:48:35Z

📝 Walkthrough

Walkthrough

임베딩 라이브러리를 SentenceTransformer에서 FastEmbed로 교체하고, 벡터 차원을 3072에서 384로 축소합니다. 데이터베이스 스키마와 검증 스크립트를 업데이트하며, UTF-8 인코딩 설정을 추가합니다.

Changes

코호트 / 파일(들)	요약
임베딩 서비스 업데이트 `app/services/embedding.py`, `requirements.txt`	SentenceTransformer를 FastEmbedEmbeddings로 교체. MODEL_NAME 상수 변경 및 `generate_embedding` 메서드가 `embed_query`를 사용하도록 수정. fastembed 의존성 추가.
데이터베이스 마이그레이션 `supabase/migrations/20260226140500_update_vector_to_mini_lm.sql`	`documents.embedding` 열을 `vector(3072)`에서 `vector(384)`로 변경. HNSW 인덱스 및 `match_documents` 함수를 삭제 후 재생성하여 384차원 벡터 지원.
검증 및 유틸리티 스크립트 `verify_and_clear.py`, `scripts/ingest_all_data.py`	documents 테이블 검증 및 초기화 스크립트 추가. ingest 스크립트에 UTF-8 인코딩 설정 추가.

Estimated code review effort

🎯 3 (보통) | ⏱️ ~25분

Possibly related PRs

feat: implement Gutenberg books scraping and resumable data ingestion… #1: 동일한 app/services/embedding.py 파일의 임베딩 클라이언트를 새로운 백엔드로 교체하고 embed_query 메서드를 호출하도록 수정하는 변경 사항 공유.

Poem

🐰 FastEmbed의 빠른 발걸음으로,
384의 차원 속에서 춤을 추며,
무거운 3072을 떨어내고,
경량의 마법이 피어나네! ✨
벡터는 작아도 의미는 크다!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 62.50% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	PR 제목이 변경사항과 부분적으로만 관련되어 있습니다. 제목은 '데이터 수집 시스템'을 언급하지만, 실제 변경사항은 임베딩 모델 변경(SentenceTransformer에서 FastEmbed로), 벡터 차원 축소(3072에서 384로), 데이터베이스 마이그레이션, 그리고 검증 스크립트 추가 등 여러 기술적 세부사항을 포함합니다.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/data-ingestion-system

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 12

🧹 Nitpick comments (4)

app/services/embedding.py (1)
16-22: Add a dimension guard before returning embeddings

The generate_embedding() method returns provider output unchecked. While the model (all-MiniLM-L6-v2) has a fixed 384-dimension architecture and the DB schema is correctly aligned via the latest migration, adding a dimension check provides defensive programming against accidental model changes without coordinated schema updates. Currently, a mismatch would fail at the Supabase RPC call and be caught by the existing exception handler, but failing earlier in generate_embedding() with an explicit dimension error improves debuggability.
💡 Proposed fix
     def generate_embedding(self, text: str) -> list[float]:
         """
         Generates a vector embedding for the given text using the FastEmbed model.
         Returns a list of 384 floats.
         """
         # The embed_query method returns a list of floats
-        return self.embeddings.embed_query(text)
+        embedding = self.embeddings.embed_query(text)
+        if len(embedding) != 384:
+            raise ValueError(
+                f"Unexpected embedding dimension: {len(embedding)} (expected 384)"
+            )
+        return embedding
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/services/embedding.py` around lines 16 - 22, Add a defensive dimension
guard in generate_embedding: after calling self.embeddings.embed_query(text)
(which currently returns the provider output), verify the resulting list length
equals 384 and if not raise a ValueError with a clear message (including actual
length and expected 384) before returning; keep the return type as list[float]
and reference the embed_query call and generate_embedding method when making the
change.
scripts/ingest_all_data.py (1)
21-24: Use deterministic file ordering for reproducible ingest runs.

os.listdir() ordering is filesystem-dependent, so run order and progress can vary between environments.
♻️ Proposed fix
-    files = [f for f in os.listdir(data_dir) if f.endswith(".txt")]
+    files = sorted(f for f in os.listdir(data_dir) if f.endswith(".txt"))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/ingest_all_data.py` around lines 21 - 24, The current file discovery
uses os.listdir(data_dir) which is non-deterministic; ensure a stable ingest
order by sorting the filenames before processing (e.g., replace the
comprehension assigned to files or sort files before enumerate). Update the code
that builds files (and/or right before for idx, filename in enumerate(files)) so
it uses a deterministic ordering (e.g., sorted(...) or files.sort()) to
guarantee reproducible runs across environments.
scripts/ingest_data.py (1)
40-43: Use a stronger deterministic UUID strategy for chunk IDs.

Using MD5 for identity keys is collision-prone compared with built-in namespace UUIDs and can risk unintended overwrites in long-lived datasets.
🔐 Proposed fix
-import hashlib
+UUID_NAMESPACE = uuid.UUID("6f0bdf73-9cc8-4e34-a302-a12037f0ac6d")

 def generate_deterministic_uuid(seed_text: str) -> str:
     """Generates a consistent UUID based on the input text to ensure idempotency."""
-    hash_obj = hashlib.md5(seed_text.encode('utf-8'))
-    return str(uuid.UUID(hash_obj.hexdigest()))
+    return str(uuid.uuid5(UUID_NAMESPACE, seed_text))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/ingest_data.py` around lines 40 - 43, The current
generate_deterministic_uuid uses MD5 hexdigest which is collision-prone; replace
it with a namespaced UUID5 approach to produce a deterministic, stronger ID:
inside generate_deterministic_uuid(seed_text: str) call uuid.uuid5 with a stable
namespace (e.g., uuid.NAMESPACE_URL or a dedicated uuid.UUID constant) and
seed_text as the name, then return its string form; ensure imports remain
correct and the function signature stays the same so chunk IDs remain
deterministic but use UUIDv5 instead of MD5-derived UUIDs.
download_books.py (1)
121-128: Use urllib.parse.urljoin for more robust URL construction.

While the current manual string concatenation works in this limited context, using urljoin is a better practice and handles edge cases more reliably. However, the proposed fix references book_url which is not in scope at lines 121-128—it's a local variable from line 101. A corrected approach would be:
from urllib.parse import urljoin

txt_url = parser.txt_url
if txt_url:
    txt_url = urljoin(base_url, txt_url)
This handles protocol-relative URLs (//example.com), relative paths, and absolute URLs uniformly. The "Also applies to 131-132" reference is unclear since those lines use the URL after construction; the issue itself is confined to the URL construction logic.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@download_books.py` around lines 121 - 128, Replace the manual URL
construction for parser.txt_url with urllib.parse.urljoin: import urljoin from
urllib.parse, then set txt_url = urljoin(base_url, parser.txt_url) (handling the
case parser.txt_url may be falsy before calling); this will correctly resolve
protocol-relative, relative and absolute URLs and remove the custom startswith
logic currently surrounding txt_url. Ensure the import exists at top and update
any subsequent uses that assume the old construction.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@app/services/llm.py`:
- Around line 11-13: Update the stale comment above the llm instantiation to
reflect the actual model in use: replace the mention of "gemini-1.5-pro or
gemini-pro" with a concise note stating that ChatGoogleGenerativeAI is
configured to use "gemini-2.5-flash" (refer to the llm variable and the
ChatGoogleGenerativeAI(model="gemini-2.5-flash") instantiation) so the comment
accurately matches the code.

In `@download_books.py`:
- Around line 11-12: The two blocking network calls using
urllib.request.urlopen(req) should include a timeout to avoid hanging; update
both occurrences (the call that reads response.read().decode(...) and the other
urlopen around lines ~132-133) to pass a timeout argument (e.g.,
urllib.request.urlopen(req, timeout=10)) or, better, use a shared
constant/parameter (e.g., TIMEOUT = 10 or function parameter timeout) and pass
that value to urllib.request.urlopen so timeouts are configurable and consistent
across the module.

In `@requirements.txt`:
- Around line 1-11: requirements.txt is missing the packages required by
app/services/embedding.py which imports FastEmbedEmbeddings from
langchain_community; add the two runtime dependencies "langchain-community" and
"fastembed" to requirements.txt (optionally with compatible version pins), so
FastEmbedEmbeddings can be imported at runtime without ImportError.

In `@scripts/check_db.py`:
- Around line 6-10: The message "Found N rows" is misleading because the call
supabase_client.table('documents').select('id').limit(1).execute() only ever
returns 0 or 1; update the check to either (a) change the output to indicate
presence/absence (e.g., "Table has at least one row" or "Table is empty") when
using limit(1) or (b) if you need the actual row count use the Supabase count
feature (e.g., call select with count='exact' or remove limit and fetch count)
and then print the true total; modify the print statement that currently uses
len(res.data) accordingly (locate the
supabase_client.table(...).select(...).limit(1).execute() call and the
subsequent prints).
- Around line 12-14: Update the broad except block in scripts/check_db.py (the
except Exception as e: handler) to distinguish expected "table missing" errors
from real failures: inspect the exception (by type/message) and if it indicates
a missing table (e.g., message contains "does not exist" or a specific SDK error
code for table-not-found) print the user-facing message and exit(0); for
credential, network, or other unexpected exceptions (e.g., missing credentials,
connection errors, ClientError/RuntimeError) re-raise the exception so the
process fails loudly. Keep the same log text for the expected path but make sure
to exit with success for that case and not swallow other errors.

In `@scripts/ingest_all_data.py`:
- Line 42: The print statement printing a separator uses an unnecessary f-string
prefix; in the print call that currently reads a formatted string (the separator
line), remove the leading "f" so it becomes a plain string literal—i.e., update
the print call (the separator print) to not use f-strings to resolve the lint
warning.

In `@scripts/ingest_data.py`:
- Around line 77-86: The ingestion currently chunks raw Gutenberg texts with
RecursiveCharacterTextSplitter (text_splitter and split_text) which indexes
boilerplate; add a helper (e.g., strip_gutenberg_boilerplate) that scans the
input string `text` for the start marker "*** START OF THE PROJECT GUTENBERG
EBOOK" and end marker "*** END OF THE PROJECT GUTENBERG EBOOK"
(case-insensitive), removes everything before the start and after the end (or
returns original text if markers not found), and call this helper to replace
`text` immediately before invoking text_splitter.split_text so chunks excludes
catalog/license boilerplate.
- Around line 165-171: The current try/except around
supabase_client.table('documents').upsert(batch_data).execute() swallows errors
and prints them, causing silent partial ingestion; change this to surface
failures by either re-raising the caught exception or collecting failed batches
and raising an aggregated exception at the end of the ingest function (the block
handling batch_data/upsert should not just print and continue). Specifically,
update the except block in the batch upsert logic so it logs the error and then
either raise the same exception (raise) or append batch identifiers to a
failures list and, after processing all batches in the function (e.g., in the
same function that iterates over batch_data), raise an exception summarizing
failed batches so callers can detect incomplete ingestion.

In `@supabase/migrations/20260223065008_initialize_pgvector.sql`:
- Around line 17-18: Change the match_count column to use a bounded default and
enforce limits: set its DEFAULT to 10 instead of null and add a check constraint
to clamp values to the 1–200 range (so queries using LIMIT match_count won't
return all rows or allow extreme values); update the column definition for
match_count in the migration (and any other occurrence at lines referenced) to
include DEFAULT 10 and a CHECK constraint that ensures match_count BETWEEN 1 AND
200.

In `@supabase/migrations/20260225112500_update_vector_dimension.sql`:
- Around line 7-9: The ALTER COLUMN TYPE on documents.embedding will fail if
existing rows store vectors of a different dimension; instead either (A) ensure
the table is empty before running (e.g., run a check and abort or TRUNCATE
documents), or (B, safer) add a new column embedding_3072, migrate each row by
re-embedding or casting logic, then drop the old embedding and rename
embedding_3072 to embedding; in both cases explicitly DROP any indexes on
documents.embedding before changing columns and CREATE the appropriate indexes
afterward to avoid corruption; update the migration to implement one of these
flows and include a runtime safeguard that aborts if incompatible-size vectors
are detected.
- Around line 1-2: This migration is redundant or has a signature mismatch: it
tries to ALTER the documents table and recreate match_documents but uses DROP
FUNCTION match_documents(vector(1536), int, jsonb) which doesn't match the
already-created signature vector(3072) (so the DROP silently no-ops); either
remove this migration entirely (consolidate into the earlier migration that
already set vector(3072)) or update this file to correctly target the existing
objects by changing the DROP FUNCTION and any signatures to match vector(3072)
and avoid re-applying identical ALTERs; reference match_documents and the
documents table in your fix and ensure the DROP/CREATE signatures exactly match
the already-created vector dimension.

In `@supabase/migrations/20260226140500_update_vector_to_mini_lm.sql`:
- Around line 8-10: The migration currently runs TRUNCATE TABLE documents
unconditionally which will irreversibly delete production data; update the
migration (the statement containing "TRUNCATE TABLE documents") to require an
explicit environment guard or approval before executing in production — for
example, add a conditional check that reads an env var (e.g., ALLOW_TRUNCATE or
MIGRATION_APPROVED) or fail-fast unless a production-approval flag is set, or
alternatively convert the file into a documented, manual production-only
migration with clear instructions and a safety check so automated deployments
cannot run TRUNCATE on production without explicit consent.

---

Nitpick comments:
In `@app/services/embedding.py`:
- Around line 16-22: Add a defensive dimension guard in generate_embedding:
after calling self.embeddings.embed_query(text) (which currently returns the
provider output), verify the resulting list length equals 384 and if not raise a
ValueError with a clear message (including actual length and expected 384)
before returning; keep the return type as list[float] and reference the
embed_query call and generate_embedding method when making the change.

In `@download_books.py`:
- Around line 121-128: Replace the manual URL construction for parser.txt_url
with urllib.parse.urljoin: import urljoin from urllib.parse, then set txt_url =
urljoin(base_url, parser.txt_url) (handling the case parser.txt_url may be falsy
before calling); this will correctly resolve protocol-relative, relative and
absolute URLs and remove the custom startswith logic currently surrounding
txt_url. Ensure the import exists at top and update any subsequent uses that
assume the old construction.

In `@scripts/ingest_all_data.py`:
- Around line 21-24: The current file discovery uses os.listdir(data_dir) which
is non-deterministic; ensure a stable ingest order by sorting the filenames
before processing (e.g., replace the comprehension assigned to files or sort
files before enumerate). Update the code that builds files (and/or right before
for idx, filename in enumerate(files)) so it uses a deterministic ordering
(e.g., sorted(...) or files.sort()) to guarantee reproducible runs across
environments.

In `@scripts/ingest_data.py`:
- Around line 40-43: The current generate_deterministic_uuid uses MD5 hexdigest
which is collision-prone; replace it with a namespaced UUID5 approach to produce
a deterministic, stronger ID: inside generate_deterministic_uuid(seed_text: str)
call uuid.uuid5 with a stable namespace (e.g., uuid.NAMESPACE_URL or a dedicated
uuid.UUID constant) and seed_text as the name, then return its string form;
ensure imports remain correct and the function signature stays the same so chunk
IDs remain deterministic but use UUIDv5 instead of MD5-derived UUIDs.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7ae8331 and 70acdd3.

⛔ Files ignored due to path filters (1)

push-error.log is excluded by !**/*.log

📒 Files selected for processing (113)

app/services/embedding.py
app/services/llm.py
check_progress.py
data/A Budget of Paradoxes Volume I by Augustus De Morgan.txt
data/A Pickle for the Knowing Ones by Timothy Dexter.txt
data/A Treatise of Human Nature by David Hume.txt
data/A Vindication of the Rights of Woman by Mary Wollstonecraft.txt
data/Also sprach Zarathustra Ein Buch für Alle und Keinen German by Friedrich Wilhelm Nietzsche.txt
data/An Enquiry Concerning Human Understanding by David Hume.txt
data/An Essay Concerning Humane Understanding Volume 1 by John Locke.txt
data/Apology Crito and Phaedo of Socrates by Plato.txt
data/Apology by Plato.txt
data/As a man thinketh by James Allen.txt
data/Bacons Essays and Wisdom of the Ancients by Francis Bacon.txt
data/Beyond Good and Evil by Friedrich Wilhelm Nietzsche.txt
data/Ciceros Tusculan Disputations by Marcus Tullius Cicero.txt
data/De Officiis Latin by Marcus Tullius Cicero.txt
data/Democracy and Education An Introduction to the Philosophy of Education by John Dewey.txt
data/Democracy in America Volume 2 by Alexis de Tocqueville.txt
data/Demonology and Devil-lore by Moncure Daniel Conway.txt
data/Discourse on the Method of Rightly Conducting Ones Reason and of Seeking Truth in the Sciences by René Descartes.txt
data/Ecce Homo by Friedrich Wilhelm Nietzsche.txt
data/Essays by Ralph Waldo Emerson by Ralph Waldo Emerson.txt
data/Essays of Schopenhauer by Arthur Schopenhauer.txt
data/Ethics by Benedictus de Spinoza.txt
data/Etiquette by Emily Post.txt
data/Euthyphro by Plato.txt
data/Fundamental Principles of the Metaphysic of Morals by Immanuel Kant.txt
data/Goethes Theory of Colours by Johann Wolfgang von Goethe.txt
data/Gorgias by Plato.txt
data/How We Think by John Dewey.txt
data/Human All Too Human A Book for Free Spirits by Friedrich Wilhelm Nietzsche.txt
data/Isis unveiled Volume 1 of 2 Science A master-key to mysteries of ancient and modern science and theology by H P Blavatsky.txt
data/Laws by Plato.txt
data/Leviathan by Thomas Hobbes.txt
data/Meditations by Emperor of Rome Marcus Aurelius.txt
data/Nature by Ralph Waldo Emerson.txt
data/On Heroes Hero-Worship and the Heroic in History by Thomas Carlyle.txt
data/On Liberty by John Stuart Mill.txt
data/On War by Carl von Clausewitz.txt
data/On the Duty of Civil Disobedience by Henry David Thoreau.txt
data/On the Nature of Things by Titus Lucretius Carus.txt
data/Pascals Pensées by Blaise Pascal.txt
data/Perpetual Peace A Philosophical Essay by Immanuel Kant.txt
data/Phaedo by Plato.txt
data/Phaedrus by Plato.txt
data/Plato and the Other Companions of Sokrates 3rd ed Volume 1 by George Grote.txt
data/Plutarchs Morals by Plutarch.txt
data/Politics A Treatise on Government by Aristotle.txt
data/Pragmatism A New Name for Some Old Ways of Thinking by William James.txt
data/Psyche The Cult of Souls and Belief in Immortality among the Greeks by Erwin Rohde.txt
data/Psychology of the Unconscious by C G Jung.txt
data/Reflections or Sentences and Moral Maxims by François duc de La Rochefoucauld.txt
data/Revelations of Divine Love by of Norwich Julian.txt
data/Roman Stoicism by Edward Vernon Arnold.txt
data/Second Treatise of Government by John Locke.txt
data/Siddhartha by Hermann Hesse.txt
data/Sun Tzŭ on the Art of War The Oldest Military Treatise in the World by active 6th century BC Sunzi.txt
data/Symposium by Plato.txt
data/The Anatomy of Melancholy by Robert Burton.txt
data/The Antichrist by Friedrich Wilhelm Nietzsche.txt
data/The Birth of Tragedy or Hellenism and Pessimism by Friedrich Wilhelm Nietzsche.txt
data/The Case of Wagner Nietzsche Contra Wagner and Selected Aphorisms by Friedrich Wilhelm Nietzsche.txt
data/The City of God Volume I by Saint of Hippo Augustine.txt
data/The City of God Volume II by Saint of Hippo Augustine.txt
data/The Communist Manifesto by Karl Marx and Friedrich Engels.txt
data/The Confessions of St Augustine by Saint of Hippo Augustine.txt
data/The Consolation of Philosophy by Boethius.txt
data/The Critique of Pure Reason by Immanuel Kant.txt
data/The Enchiridion by Epictetus.txt
data/The Essays of Arthur Schopenhauer Studies in Pessimism by Arthur Schopenhauer.txt
data/The Essays of Arthur Schopenhauer the Wisdom of Life by Arthur Schopenhauer.txt
data/The Ethics of Aristotle by Aristotle.txt
data/The Genealogy of Morals by Friedrich Wilhelm Nietzsche.txt
data/The Grand Inquisitor by Fyodor Dostoyevsky.txt
data/The Kama Sutra of Vatsyayana by Vatsyayana.txt
data/The Man Who Was Thursday A Nightmare by G K Chesterton.txt
data/The Marriage of Heaven and Hell by William Blake.txt
data/The Meditations of the Emperor Marcus Aurelius Antoninus by Emperor of Rome Marcus Aurelius.txt
data/The Poetics of Aristotle by Aristotle.txt
data/The Prince by Niccolò Machiavelli.txt
data/The Principles of Psychology Volume 1 of 2 by William James.txt
data/The Problems of Philosophy by Bertrand Russell.txt
data/The Prophet by Kahlil Gibran.txt
data/The Republic by Plato.txt
data/The Republic of Plato by Plato.txt
data/The Secret Doctrine Vol 1 of 4 by H P Blavatsky.txt
data/The Secret Doctrine Vol 2 of 4 by H P Blavatsky.txt
data/The Song Celestial Or Bhagavad-Gîtâ from the Mahâbhârata.txt
data/The Twilight of the Idols or How to Philosophize with the Hammer The Antichrist by Friedrich Wilhelm Nietzsche.txt
data/The Will to Believe and Other Essays in Popular Philosophy by William James.txt
data/The World as Will and Idea Vol 1 of 3 by Arthur Schopenhauer.txt
data/The history of magic including a clear and precise exposition of its procedure its rites and its mysteries by Éliphas Lévi.txt
data/The social contract discourses by Jean-Jacques Rousseau.txt
data/The symbolism of Freemasonry Illustrating and explaining its science and philosophy its legends myths and symbols by Albert Gallatin Mackey.txt
data/Thus Spake Zarathustra A Book for All and None by Friedrich Wilhelm Nietzsche.txt
data/Utilitarianism by John Stuart Mill.txt
data/Utopia by Saint Thomas More.txt
data/Walden and On The Duty Of Civil Disobedience by Henry David Thoreau.txt
data/What Is Art by graf Leo Tolstoy.txt
data/新序 Chinese by Xiang Liu.txt
data/日知錄 Chinese by Yanwu Gu.txt
data/韓詩外傳 Complete Chinese by active 150 BC Ying Han.txt
download_books.py
requirements.txt
scripts/check_db.py
scripts/ingest_all_data.py
scripts/ingest_data.py
supabase/migrations/20260223065008_initialize_pgvector.sql
supabase/migrations/20260225112500_update_vector_dimension.sql
supabase/migrations/20260225141500_add_hnsw_index.sql
supabase/migrations/20260226140500_update_vector_to_mini_lm.sql
verify_and_clear.py

coderabbitai · 2026-02-26T06:55:13Z

 # We use gemini-1.5-pro or gemini-pro depending on availability and needs
 llm = ChatGoogleGenerativeAI(
-    model="gemini-1.5-pro", 
+    model="gemini-2.5-flash", 


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cat -n app/services/llm.py | head -20

Repository: SanghunYun95/philo-rag

Length of output: 988

🌐 Web query:

Google Gemini API available models 2026

💡 Result:

As of Feb 19, 2026, the Gemini API exposes its currently-available models via GET https://generativelanguage.googleapis.com/v1beta/models (use this to see the exact, up-to-the-minute list for your project/region). [1]

Main Gemini text/multimodal models

gemini-3.1-pro-preview (also: gemini-3.1-pro-preview-customtools) [3]

gemini-3-flash-preview [2][4]

gemini-2.5-pro (stable) [4]

gemini-2.5-flash (stable) [5]

gemini-2.5-flash-lite (stable) [6]

Realtime (Live API)

gemini-2.5-flash-native-audio-preview-12-2025 (audio+text out; Live API) [7]

Image generation / editing

gemini-2.5-flash-image (a.k.a. Nano Banana; stable) [8]

Nano Banana Pro (preview) and Gemini 3 Pro Image (preview) are also listed as available on the Models page. [2]

Imagen 4 (text-to-image)

imagen-4.0-generate-001, imagen-4.0-ultra-generate-001, imagen-4.0-fast-generate-001 [9]

Other model categories listed as available (see Models page)

Veo (video generation), Lyria (music), Gemini Embeddings, Computer Use (UI automation), Gemini Deep Research, and Gemini Robotics (preview/experimental vary by model). [2]

Update stale comment to reflect the actual model being used.

The comment on line 11 references "gemini-1.5-pro or gemini-pro" but the code uses "gemini-2.5-flash". Update the comment to accurately describe the current model.

Proposed fix

# Configure LangChain model -# We use gemini-1.5-pro or gemini-pro depending on availability and needs +# We use gemini-2.5-flash for fast, cost-effective responses llm = ChatGoogleGenerativeAI( model="gemini-2.5-flash",

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# We use gemini-1.5-pro or gemini-pro depending on availability and needs

llm = ChatGoogleGenerativeAI(

model="gemini-1.5-pro",

model="gemini-2.5-flash",

# Configure LangChain model

# We use gemini-2.5-flash for fast, cost-effective responses

llm = ChatGoogleGenerativeAI(

model="gemini-2.5-flash",

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@app/services/llm.py` around lines 11 - 13, Update the stale comment above the llm instantiation to reflect the actual model in use: replace the mention of "gemini-1.5-pro or gemini-pro" with a concise note stating that ChatGoogleGenerativeAI is configured to use "gemini-2.5-flash" (refer to the llm variable and the ChatGoogleGenerativeAI(model="gemini-2.5-flash") instantiation) so the comment accurately matches the code.

coderabbitai · 2026-02-26T06:55:14Z

+        with urllib.request.urlopen(req) as response:
+            return response.read().decode('utf-8', errors='replace')


⚠️ Potential issue | 🟠 Major

❓ Verification inconclusive

Script executed:

find . -name "download_books.py" -type f

Repository: SanghunYun95/philo-rag

Repository: SanghunYun95/philo-rag
Exit code: 0

stdout:

./download_books.py

Script executed:

cat -n download_books.py

Repository: SanghunYun95/philo-rag

Repository: SanghunYun95/philo-rag
Exit code: 0

stdout:

1 import os 2 import re 3 import time 4 import urllib.request 5 import urllib.error 6 from html.parser import HTMLParser 7 8 def get_html(url): 9 req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'}) 10 try: 11 with urllib.request.urlopen(req) as response: 12 return response.read().decode('utf-8', errors='replace') 13 except Exception as e: 14 print(f"Error fetching {url}: {e}") 15 return "" 16 17 class LinkParser(HTMLParser): 18 def __init__(self): 19 super().__init__() 20 self.in_booklink = False 21 self.in_link = False 22 self.in_title = False 23 self.in_subtitle = False 24 25 self.current_href = None 26 self.current_title = "" 27 self.current_subtitle = "" 28 29 self.books = [] 30 self.next_page = None 31 32 def handle_starttag(self, tag, attrs): 33 attr_dict = dict(attrs) 34 35 if tag == 'li' and attr_dict.get('class') == 'booklink': 36 self.in_booklink = True 37 self.current_href = None 38 self.current_title = "" 39 self.current_subtitle = "" 40 41 elif tag == 'a' and self.in_booklink and attr_dict.get('class') == 'link': 42 self.in_link = True 43 self.current_href = attr_dict.get('href') 44 45 elif tag == 'span' and self.in_link: 46 cls = attr_dict.get('class') 47 if cls == 'title': 48 self.in_title = True 49 elif cls == 'subtitle': 50 self.in_subtitle = True 51 52 elif tag == 'a' and attr_dict.get('title') == 'Go to the next page of results.': 53 self.next_page = attr_dict.get('href') 54 55 def handle_endtag(self, tag): 56 if tag == 'span': 57 self.in_title = False 58 self.in_subtitle = False 59 elif tag == 'a' and self.in_link: 60 self.in_link = False 61 if self.current_href and self.current_title: 62 self.books.append({ 63 'href': self.current_href, 64 'title': self.current_title.strip(), 65 'author': self.current_subtitle.strip() 66 }) 67 elif tag == 'li' and self.in_booklink: 68 self.in_booklink = False 69 70 def handle_data(self, data): 71 if self.in_title: 72 self.current_title += data 73 elif self.in_subtitle: 74 self.current_subtitle += data 75 76 class BookPageParser(HTMLParser): 77 def __init__(self): 78 super().__init__() 79 self.txt_url = None 80 81 def handle_starttag(self, tag, attrs): 82 if self.txt_url: 83 return 84 attr_dict = dict(attrs) 85 if tag == 'a': 86 t = attr_dict.get('type', '') 87 href = attr_dict.get('href', '') 88 if 'text/plain' in t.lower() or href.endswith('.txt') or href.endswith('.txt.utf-8'): 89 self.txt_url = href 90 91 def sanitize_filename(filename): 92 sanitized = re.sub(r'[^\w\s-]', '', filename) 93 sanitized = re.sub(r'\s+', ' ', sanitized).strip() 94 return sanitized 95 96 def download_book(book, data_dir, base_url, target_count, downloaded_count): 97 href = book['href'] 98 if not href.startswith('/ebooks/'): 99 return downloaded_count 100 101 book_url = base_url + href 102 title = book['title'].replace('\n', ' ').replace('\r', '') 103 author = book['author'].replace('\n', ' ').replace('\r', '') 104 105 full_title = f"{title} by {author}" if author else title 106 safe_title = sanitize_filename(full_title) 107 108 file_path = os.path.join(data_dir, f"{safe_title}.txt") 109 110 if os.path.exists(file_path): 111 print(f"Already exists: {safe_title}") 112 return downloaded_count + 1 113 114 html = get_html(book_url) 115 if not html: 116 return downloaded_count 117 118 parser = BookPageParser() 119 parser.feed(html) 120 121 txt_url = parser.txt_url 122 if txt_url: 123 if not txt_url.startswith('http'): 124 if txt_url.startswith('//'): 125 txt_url = 'https:' + txt_url 126 else: 127 txt_url = base_url + txt_url 128 129 print(f"Downloading [{downloaded_count+1}/{target_count}]: {safe_title}") 130 try: 131 req = urllib.request.Request(txt_url, headers={'User-Agent': 'Mozilla/5.0'}) 132 with urllib.request.urlopen(req) as resp: 133 text = resp.read() 134 text = text.decode('utf-8', errors='replace') 135 with open(file_path, 'w', encoding='utf-8') as f: 136 f.write(text) 137 time.sleep(1) 138 return downloaded_count + 1 139 except Exception as e: 140 print(f"Error downloading text for {safe_title}: {e}") 141 else: 142 print(f"No plain text found for: {safe_title}") 143 144 return downloaded_count 145 146 def main(): 147 base_url = "https://www.gutenberg.org" 148 shelf_url = "https://www.gutenberg.org/ebooks/bookshelf/691" 149 150 data_dir = os.path.join(os.getcwd(), 'data') 151 os.makedirs(data_dir, exist_ok=True) 152 153 downloaded_count = 0 154 target_count = 100 155 156 current_url = shelf_url 157 158 while downloaded_count < target_count and current_url: 159 print(f"Fetching shelf page: {current_url}") 160 html = get_html(current_url) 161 if not html: 162 break 163 164 parser = LinkParser() 165 parser.feed(html) 166 167 books = parser.books 168 if not books: 169 print("No books found. Layout might have changed.") 170 break 171 172 for book in books: 173 if downloaded_count >= target_count: 174 break 175 downloaded_count = download_book(book, data_dir, base_url, target_count, downloaded_count) 176 177 if parser.next_page: 178 current_url = base_url + parser.next_page 179 else: 180 current_url = None 181 182 print(f"Finished. Downloaded (or already had) {downloaded_count} books.") 183 184 if __name__ == '__main__': 185 main()

Add network timeouts to prevent indefinite hangs on urlopen calls.

Both urlopen calls on lines 11-12 and 132-133 can block indefinitely on network issues, stalling the entire download process. Add the timeout parameter to prevent this:

Proposed fix

- with urllib.request.urlopen(req) as response: + with urllib.request.urlopen(req, timeout=20) as response:

- with urllib.request.urlopen(req) as resp: + with urllib.request.urlopen(req, timeout=20) as resp:

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

with urllib.request.urlopen(req) as response:

return response.read().decode('utf-8', errors='replace')

with urllib.request.urlopen(req, timeout=20) as response:

return response.read().decode('utf-8', errors='replace')

🧰 Tools

🪛 Ruff (0.15.2)

[error] 11-11: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@download_books.py` around lines 11 - 12, The two blocking network calls using urllib.request.urlopen(req) should include a timeout to avoid hanging; update both occurrences (the call that reads response.read().decode(...) and the other urlopen around lines ~132-133) to pass a timeout argument (e.g., urllib.request.urlopen(req, timeout=10)) or, better, use a shared constant/parameter (e.g., TIMEOUT = 10 or function parameter timeout) and pass that value to urllib.request.urlopen so timeouts are configurable and consistent across the module.

coderabbitai · 2026-02-26T06:55:14Z

+fastapi
+uvicorn
+supabase
+langchain
+langchain-google-genai
+google-generativeai
+sse-starlette
+pydantic>=2.7.0
+pydantic-settings
+python-dotenv
+sentence-transformers


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # Confirm new embedding implementation dependency usage. rg -n "FastEmbedEmbeddings|langchain_community" app/services/embedding.py # Confirm required packages are (not) declared. rg -n "^(langchain-community|fastembed)([<>=!~].*)?$" requirements.txt || true

Repository: SanghunYun95/philo-rag

Length of output: 190

Missing runtime dependencies for FastEmbedEmbeddings

app/services/embedding.py imports FastEmbedEmbeddings from langchain_community, but requirements.txt is missing both langchain-community and fastembed. This will cause an ImportError at runtime.

Add the missing packages:

fastapi uvicorn supabase langchain +langchain-community +fastembed langchain-google-genai google-generativeai sse-starlette pydantic>=2.7.0 pydantic-settings python-dotenv sentence-transformers

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

fastapi

uvicorn

supabase

langchain

langchain-google-genai

google-generativeai

sse-starlette

pydantic>=2.7.0

pydantic-settings

python-dotenv

sentence-transformers

fastapi

uvicorn

supabase

langchain

langchain-community

fastembed

langchain-google-genai

google-generativeai

sse-starlette

pydantic>=2.7.0

pydantic-settings

python-dotenv

sentence-transformers

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@requirements.txt` around lines 1 - 11, requirements.txt is missing the packages required by app/services/embedding.py which imports FastEmbedEmbeddings from langchain_community; add the two runtime dependencies "langchain-community" and "fastembed" to requirements.txt (optionally with compatible version pins), so FastEmbedEmbeddings can be imported at runtime without ImportError.

coderabbitai · 2026-02-26T06:55:14Z

+    res = supabase_client.table('documents').select('id').limit(1).execute()
+    print('Table exists')
+    if res.data:
+        print(f'Found {len(res.data)} rows')
+    else:


⚠️ Potential issue | 🟡 Minor

Found N rows is misleading with limit(1)

Line 6 limits the result to one row, so Line 9 can only print 0 or 1, not the table size.

💡 Proposed fix

- res = supabase_client.table('documents').select('id').limit(1).execute() + res = supabase_client.table('documents').select('id').limit(1).execute() @@ - if res.data: - print(f'Found {len(res.data)} rows') + if res.data: + print('Found at least 1 row')

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

res = supabase_client.table('documents').select('id').limit(1).execute()

print('Table exists')

if res.data:

print(f'Found {len(res.data)} rows')

else:

res = supabase_client.table('documents').select('id').limit(1).execute()

print('Table exists')

if res.data:

print('Found at least 1 row')

else:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/check_db.py` around lines 6 - 10, The message "Found N rows" is misleading because the call supabase_client.table('documents').select('id').limit(1).execute() only ever returns 0 or 1; update the check to either (a) change the output to indicate presence/absence (e.g., "Table has at least one row" or "Table is empty") when using limit(1) or (b) if you need the actual row count use the Supabase count feature (e.g., call select with count='exact' or remove limit and fetch count) and then print the true total; modify the print statement that currently uses len(res.data) accordingly (locate the supabase_client.table(...).select(...).limit(1).execute() call and the subsequent prints).

coderabbitai · 2026-02-26T06:55:14Z

+except Exception as e:
+    print(f'Error or Table does not exist: {e}')
+    # No exit(1) because we want to know if it's just not there


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # Check where check_db.py is invoked rg -n "scripts/check_db\.py|check_db\.py" -S --type py --type sh --type yaml --type json --type text # Also check for any imports or references rg -n "check_db" -S

Repository: SanghunYun95/philo-rag

Length of output: 94

🏁 Script executed:

#!/bin/bash set -euo pipefail # Search for references to check_db.py more broadly rg -n "check_db" -S # List Python files to understand repo structure fd ".*\.py$" | head -20

Repository: SanghunYun95/philo-rag

Length of output: 48

🏁 Script executed:

#!/bin/bash set -euo pipefail # First check if the file exists ls -la scripts/check_db.py 2>/dev/null || echo "File not found at scripts/check_db.py" # List all files in scripts directory ls -la scripts/ 2>/dev/null || echo "scripts directory not found" # Try broader search for check_db find . -name "*check_db*" 2>/dev/null || echo "No check_db files found" # List all Python files in the repository find . -name "*.py" -type f | head -30

Repository: SanghunYun95/philo-rag

Length of output: 735

🏁 Script executed:

#!/bin/bash set -euo pipefail # Read the check_db.py file cat -n scripts/check_db.py

Repository: SanghunYun95/philo-rag

Length of output: 637

🏁 Script executed:

#!/bin/bash set -euo pipefail # Search for all invocations of check_db.py in the repository rg -n "check_db" -S -A 2 -B 2 # Check for script execution patterns in common places rg -n "subprocess|call|run|exec|os\.system" -S app/ | head -20

Repository: SanghunYun95/philo-rag

Length of output: 48

🏁 Script executed:

#!/bin/bash set -euo pipefail # Check the database service to understand what exceptions can be raised cat -n app/services/database.py | head -50 # Search for any CI/CD configuration that might invoke check_db find . -name "*.yml" -o -name "*.yaml" -o -name "Makefile" -o -name ".github" | head -20 # Look for any documentation about running this script find . -name "README*" -o -name "*.md" | head -10

Repository: SanghunYun95/philo-rag

Length of output: 856

🏁 Script executed:

#!/bin/bash set -euo pipefail # Check for any CI/CD files that might use check_db.py find . -path "./.git" -prune -o -type f $ -name "*.yml" -o -name "*.yaml" -o -name "Dockerfile" -o -name "Makefile" -o -name ".gitignore" $ -print # Check main.py to see if this script is part of a startup sequence cat -n app/main.py | head -40 # Check ingest scripts since they're in the same directory head -30 scripts/ingest_data.py

Repository: SanghunYun95/philo-rag

Length of output: 1846

🏁 Script executed:

#!/bin/bash set -euo pipefail # Check ingest_all_data.py which might call check_db cat -n scripts/ingest_all_data.py # Check for any README files that might document the scripts cat -n README.md 2>/dev/null || cat -n readme.md 2>/dev/null || echo "No README found" # Search for any shell scripts that might call check_db.py find . -name "*.sh" -type f

Repository: SanghunYun95/philo-rag

Length of output: 2811

🏁 Script executed:

#!/bin/bash set -euo pipefail # Check Supabase Python SDK documentation by looking at what exceptions it defines # Let's also verify if this script has any exit code expectations cat -n scripts/check_db.py # Check if there are any test files or CI files that might show expected behavior find . -name "*test*" -o -name ".github" -o -name "pytest.ini" | head -20 # See if the script is called from anywhere with exit code checks rg -n "check_db|scripts/check" -S

Repository: SanghunYun95/philo-rag

Length of output: 637

Don't swallow unexpected failures with a blanket exception handler

Line 12 catches every exception without distinguishing "table missing" (expected) from real failures like missing credentials, network errors, or SDK errors. Combined with the intentional absence of exit(1) on line 14, this can hide critical issues silently.

Differentiate between expected and unexpected failures: keep the "table missing" path non-fatal, but re-raise credential, network, or runtime errors.

💡 Proposed approach

Check the error message to detect "table missing" errors, print and exit 0 for those, and re-raise other exceptions.

except Exception as e: - print(f'Error or Table does not exist: {e}') - # No exit(1) because we want to know if it's just not there + msg = str(e).lower() + if "does not exist" in msg or ('relation' in msg and 'documents' in msg): + print(f'Error or Table does not exist: {e}') + sys.exit(0) + raise

🧰 Tools

🪛 Ruff (0.15.2)

[warning] 12-12: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/check_db.py` around lines 12 - 14, Update the broad except block in scripts/check_db.py (the except Exception as e: handler) to distinguish expected "table missing" errors from real failures: inspect the exception (by type/message) and if it indicates a missing table (e.g., message contains "does not exist" or a specific SDK error code for table-not-found) print the user-facing message and exit(0); for credential, network, or other unexpected exceptions (e.g., missing credentials, connection errors, ClientError/RuntimeError) re-raise the exception so the process fails loudly. Keep the same log text for the expected path but make sure to exit with success for that case and not swallow other errors.

coderabbitai · 2026-02-26T06:55:14Z

+        if batch_data:
+            try:
+                # 'upsert' overwrites based on the unique UUID primary key constraint
+                supabase_client.table('documents').upsert(batch_data).execute()
+                print(f"✅ Successfully upserted {len(batch_data)} chunks to Supabase.")
+            except Exception as e:
+                print(f"❌ Error upserting batch: {e}")


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

fd -type f -name "ingest_data.py" | head -20

Repository: SanghunYun95/philo-rag

Length of output: 236

🏁 Script executed:

wc -l scripts/ingest_data.py 2>/dev/null || echo "File not found at that path"

Repository: SanghunYun95/philo-rag

Length of output: 92

🏁 Script executed:

cat -n scripts/ingest_data.py | sed -n '155,180p'

Repository: SanghunYun95/philo-rag

Length of output: 1617

🏁 Script executed:

cat -n scripts/ingest_data.py | sed -n '130,175p'

Repository: SanghunYun95/philo-rag

Length of output: 2652

🏁 Script executed:

cat -n scripts/ingest_data.py | sed -n '80,140p'

Repository: SanghunYun95/philo-rag

Length of output: 3188

🏁 Script executed:

cat -n scripts/ingest_data.py | sed -n '1,50p'

Repository: SanghunYun95/philo-rag

Length of output: 2189

🏁 Script executed:

grep -n "def ingest\|^def " scripts/ingest_data.py | head -20

Repository: SanghunYun95/philo-rag

Length of output: 360

🏁 Script executed:

cat -n scripts/ingest_data.py | sed -n '60,172p'

Repository: SanghunYun95/philo-rag

Length of output: 5743

🏁 Script executed:

cat -n scripts/ingest_data.py | sed -n '172,186p'

Repository: SanghunYun95/philo-rag

Length of output: 970

Track and propagate batch upsert failures to prevent silent partial ingestion.

Current code catches upsert exceptions and continues silently, leaving the function to complete normally even when batches fail. This masks incomplete ingestion as a successful run.

Proposed fix

+ failed_batches = [] for i in range(0, len(chunks), BATCH_SIZE): ... if batch_data: try: supabase_client.table('documents').upsert(batch_data).execute() print(f"✅ Successfully upserted {len(batch_data)} chunks to Supabase.") except Exception as e: print(f"❌ Error upserting batch: {e}") + failed_batches.append((i // BATCH_SIZE + 1, str(e))) + + if failed_batches: + raise RuntimeError(f"Ingestion incomplete. Failed batches: {failed_batches}")

🧰 Tools

🪛 Ruff (0.15.2)

[warning] 170-170: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/ingest_data.py` around lines 165 - 171, The current try/except around supabase_client.table('documents').upsert(batch_data).execute() swallows errors and prints them, causing silent partial ingestion; change this to surface failures by either re-raising the caught exception or collecting failed batches and raising an aggregated exception at the end of the ingest function (the block handling batch_data/upsert should not just print and continue). Specifically, update the except block in the batch upsert logic so it logs the error and then either raise the same exception (raise) or append batch identifiers to a failures list and, after processing all batches in the function (e.g., in the same function that iterates over batch_data), raise an exception summarizing failed batches so callers can detect incomplete ingestion.

coderabbitai · 2026-02-26T06:55:14Z

  match_count int DEFAULT null,
  filter jsonb DEFAULT '{}'


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n supabase/migrations/20260223065008_initialize_pgvector.sql

Repository: SanghunYun95/philo-rag

Length of output: 1437

🌐 Web query:

PostgreSQL LIMIT NULL behavior unbounded results

💡 Result:

In PostgreSQL, LIMIT NULL means no limit (i.e., it’s treated the same as LIMIT ALL or omitting the LIMIT clause), so the query returns an unbounded number of rows (all rows produced by the query). The same rule applies to OFFSET NULL, which is treated as OFFSET 0. [1], [2]

Example

SELECT * FROM t LIMIT NULL; -- same as: SELECT * FROM t;

Sources: PostgreSQL docs for SELECT / LIMIT semantics. [1], [2]

Set a bounded default for match_count to prevent unbounded vector scans.

When match_count is null, LIMIT match_count returns all rows, causing full table scans on the embeddings index. This is risky for latency and cost in production.

📉 Recommended fix

- match_count int DEFAULT null, + match_count int DEFAULT 10,

- limit match_count; + limit greatest(1, least(coalesce(match_count, 10), 200));

A default of 10 is typical for vector similarity retrieval. Clamping to 1–200 prevents empty results and resource exhaustion.

Also applies to: 37-37

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@supabase/migrations/20260223065008_initialize_pgvector.sql` around lines 17 - 18, Change the match_count column to use a bounded default and enforce limits: set its DEFAULT to 10 instead of null and add a check constraint to clamp values to the 1–200 range (so queries using LIMIT match_count won't return all rows or allow extreme values); update the column definition for match_count in the migration (and any other occurrence at lines referenced) to include DEFAULT 10 and a CHECK constraint that ensures match_count BETWEEN 1 AND 200.

coderabbitai · 2026-02-26T06:55:14Z

+-- Update the documents table and match_documents function to use 3072 dimensions
+-- for gemini-embedding-001


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check the migration sequence and final vector dimension echo "=== Migration files in order ===" fd -t f '.sql$' supabase/migrations --exec basename {} \; | sort echo "" echo "=== Content of migrations with vector dimensions ===" fd -t f '.sql$' supabase/migrations --exec echo "File: {}" \; -exec cat {} \; | head -200 echo "" echo "=== Extract vector dimension references ===" fd -t f '.sql$' supabase/migrations -exec rg -i 'vector$\d+$' {} + 2>/dev/null

Repository: SanghunYun95/philo-rag

Length of output: 521

🏁 Script executed:

#!/bin/bash echo "=== Migration 1: initialize_pgvector ===" cat supabase/migrations/20260223065008_initialize_pgvector.sql echo "" echo "=== Migration 2: update_vector_dimension ===" cat supabase/migrations/20260225112500_update_vector_dimension.sql echo "" echo "=== Migration 3: add_hnsw_index ===" cat supabase/migrations/20260225141500_add_hnsw_index.sql echo "" echo "=== Migration 4: update_vector_to_mini_lm ===" cat supabase/migrations/20260226140500_update_vector_to_mini_lm.sql

Repository: SanghunYun95/philo-rag

Length of output: 4501

Remove redundant migration or fix function signature mismatch.

Migration 1 already establishes the documents table and match_documents function with vector(3072) dimensions for gemini-embedding-001. Migration 2 (the one under review) is functionally redundant—it attempts to alter the column and recreate the function to the same 3072 dimension that already exists. Additionally, the DROP statement references match_documents(vector(1536), int, jsonb), but the function was created with a vector(3072) signature in Migration 1, so the DROP will silently fail (IF EXISTS clause masks the issue).

The final migration (20260226140500_update_vector_to_mini_lm.sql) explicitly TRUNCATEs the documents table before changing dimensions to 384, confirming that data is intentionally cleared during model transitions—no re-embedding scenario applies here.

Either consolidate this migration into Migration 1 or correct the function signature in the DROP statement to match what was actually created.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@supabase/migrations/20260225112500_update_vector_dimension.sql` around lines 1 - 2, This migration is redundant or has a signature mismatch: it tries to ALTER the documents table and recreate match_documents but uses DROP FUNCTION match_documents(vector(1536), int, jsonb) which doesn't match the already-created signature vector(3072) (so the DROP silently no-ops); either remove this migration entirely (consolidate into the earlier migration that already set vector(3072)) or update this file to correctly target the existing objects by changing the DROP FUNCTION and any signatures to match vector(3072) and avoid re-applying identical ALTERs; reference match_documents and the documents table in your fix and ensure the DROP/CREATE signatures exactly match the already-created vector dimension.

coderabbitai · 2026-02-26T06:55:14Z

+-- 2. Alter the table column
+ALTER TABLE documents 
+ALTER COLUMN embedding TYPE vector(3072);


⚠️ Potential issue | 🟠 Major

Risk: ALTER COLUMN TYPE will fail if existing embeddings have a different dimension.

pgvector cannot automatically cast a vector(1536) value to vector(3072). If the documents table contains any rows with existing embeddings, this migration will error out.

Consider one of these approaches:

Truncate or delete existing documents before altering (if data loss is acceptable)

Add a new column, migrate data with re-embedding, then drop the old column

Ensure this migration only runs on empty tables

Additionally, any existing indexes on documents.embedding should be explicitly dropped before the ALTER COLUMN and recreated afterward to avoid index corruption or stale references.

Proposed safeguard (if data can be cleared)

-- 1. Drop the existing function DROP FUNCTION IF EXISTS match_documents(vector(1536), int, jsonb); +-- 1.5. Drop any existing indexes on embedding column +DROP INDEX IF EXISTS documents_embedding_idx; + +-- 1.6. Clear existing embeddings (requires re-ingestion) +TRUNCATE TABLE documents; + -- 2. Alter the table column ALTER TABLE documents ALTER COLUMN embedding TYPE vector(3072);

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

-- 2. Alter the table column

ALTER TABLE documents

ALTER COLUMN embedding TYPE vector(3072);

-- 1. Drop the existing function

DROP FUNCTION IF EXISTS match_documents(vector(1536), int, jsonb);

-- 1.5. Drop any existing indexes on embedding column

DROP INDEX IF EXISTS documents_embedding_idx;

-- 1.6. Clear existing embeddings (requires re-ingestion)

TRUNCATE TABLE documents;

-- 2. Alter the table column

ALTER TABLE documents

ALTER COLUMN embedding TYPE vector(3072);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@supabase/migrations/20260225112500_update_vector_dimension.sql` around lines 7 - 9, The ALTER COLUMN TYPE on documents.embedding will fail if existing rows store vectors of a different dimension; instead either (A) ensure the table is empty before running (e.g., run a check and abort or TRUNCATE documents), or (B, safer) add a new column embedding_3072, migrate each row by re-embedding or casting logic, then drop the old embedding and rename embedding_3072 to embedding; in both cases explicitly DROP any indexes on documents.embedding before changing columns and CREATE the appropriate indexes afterward to avoid corruption; update the migration to implement one of these flows and include a runtime safeguard that aborts if incompatible-size vectors are detected.

coderabbitai · 2026-02-26T06:55:15Z

+-- 2. Clear existing incompatible 3072-dimension vectors to avoid casting errors
+TRUNCATE TABLE documents;
+


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

cat -n supabase/migrations/20260226140500_update_vector_to_mini_lm.sql

Repository: SanghunYun95/philo-rag

Length of output: 1600

🏁 Script executed:

# Look for migration configuration or deployment files find . -type f $ -name "*.json" -o -name "*.yaml" -o -name "*.yml" -o -name "*.toml" -o -name "*.md" $ | grep -iE "(migration|deploy|config|readme)" | head -20

Repository: SanghunYun95/philo-rag

Length of output: 88

🏁 Script executed:

# Check if there's a migrations directory with other files ls -la supabase/migrations/ 2>/dev/null | head -30

Repository: SanghunYun95/philo-rag

Length of output: 533

🏁 Script executed:

# Check supabase config cat supabase/config.toml

Repository: SanghunYun95/philo-rag

Length of output: 14313

🏁 Script executed:

# Look for README or documentation in supabase directory find supabase -maxdepth 2 -type f $ -name "README*" -o -name "*.md" $ 2>/dev/null

Repository: SanghunYun95/philo-rag

Length of output: 48

🏁 Script executed:

# Check root project files for context ls -la | grep -iE "^-.*\.(md|txt|yaml|yml|json)$"

Repository: SanghunYun95/philo-rag

Length of output: 131

Add environment guard or explicit production migration plan for TRUNCATE TABLE documents.

This migration irreversibly removes all documents during deployment. While TRUNCATE is technically necessary to alter the embedding column type from vector(3072) to vector(384), the migration file contains no environment checks. If deployed to production without explicit intent, all indexed documents will be deleted.

Either add a conditional guard (e.g., check environment variable) or document this as a breaking change requiring explicit approval for production deployments.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@supabase/migrations/20260226140500_update_vector_to_mini_lm.sql` around lines 8 - 10, The migration currently runs TRUNCATE TABLE documents unconditionally which will irreversibly delete production data; update the migration (the statement containing "TRUNCATE TABLE documents") to require an explicit environment guard or approval before executing in production — for example, add a conditional check that reads an env var (e.g., ALLOW_TRUNCATE or MIGRATION_APPROVED) or fail-fast unless a production-approval flag is set, or alternatively convert the file into a documented, manual production-only migration with clear instructions and a safety check so automated deployments cannot run TRUNCATE on production without explicit consent.

coderabbitai

♻️ Duplicate comments (1)

requirements.txt (1)

11-11: ⚠️ Potential issue | 🔴 Critical

fastembed만으로는 런타임 의존성이 완결되지 않습니다

FastEmbedEmbeddings를 langchain_community에서 import한다면 langchain-community 패키지도 requirements에 필요합니다. 현재 상태는 실행 환경에서 ImportError를 유발할 수 있습니다.

🔧 제안 수정안

 fastapi
 uvicorn
 supabase
 langchain
+langchain-community
 langchain-google-genai
 google-generativeai
 sse-starlette
 pydantic>=2.7.0
 pydantic-settings
 python-dotenv
 fastembed

#!/bin/bash
set -euo pipefail

# 1) 실제 import 사용 여부 확인
rg -n "from\s+langchain_community|import\s+langchain_community|FastEmbedEmbeddings" app || true

# 2) requirements 선언 여부 확인
rg -n "^(langchain-community|fastembed)([<>=!~].*)?$" requirements.txt || true

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@requirements.txt` at line 11, 프로젝트가 FastEmbedEmbeddings를
langchain_community에서 사용하고 있으므로 단독으로 fastembed만 선언하면 런타임 ImportError가 발생합니다;
requirements에 langchain-community 패키지를 추가하고 fastembed와 호환되는 버전(또는 범위)을 명시하여
FastEmbedEmbeddings import가 정상 동작하도록 수정하고, 변경 후 FastEmbedEmbeddings와
langchain_community가 실제로 사용되는지(예: FastEmbedEmbeddings, langchain_community 임포트
위치)를 확인하세요.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@requirements.txt`:
- Line 11: 프로젝트가 FastEmbedEmbeddings를 langchain_community에서 사용하고 있으므로 단독으로
fastembed만 선언하면 런타임 ImportError가 발생합니다; requirements에 langchain-community 패키지를
추가하고 fastembed와 호환되는 버전(또는 범위)을 명시하여 FastEmbedEmbeddings import가 정상 동작하도록 수정하고,
변경 후 FastEmbedEmbeddings와 langchain_community가 실제로 사용되는지(예: FastEmbedEmbeddings,
langchain_community 임포트 위치)를 확인하세요.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 70acdd3 and 254a82f.

📒 Files selected for processing (2)

requirements.txt
scripts/ingest_all_data.py

🚧 Files skipped from review as they are similar to previous changes (1)

scripts/ingest_all_data.py

SanghunYun95 added 2 commits February 25, 2026 22:12

feat(ingestion): switch to local fastembed model for scalable ingestion

70acdd3

coderabbitai bot reviewed Feb 26, 2026

View reviewed changes

chore: resolve merge conflicts with main

254a82f

coderabbitai bot reviewed Feb 26, 2026

View reviewed changes

SanghunYun95 merged commit 78c7aac into main Feb 26, 2026
1 check passed

This was referenced Feb 26, 2026

Feat/data ingestion system #3

Merged

Feat/data ingestion system #4

Merged

Feat/data ingestion system #5

Merged

		with urllib.request.urlopen(req) as response:
		return response.read().decode('utf-8', errors='replace')

		-- Update the documents table and match_documents function to use 3072 dimensions
		-- for gemini-embedding-001

		-- 2. Clear existing incompatible 3072-dimension vectors to avoid casting errors
		TRUNCATE TABLE documents;

Conversation

SanghunYun95 commented Feb 26, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

릴리스 노트

Uh oh!

coderabbitai bot commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 26, 2026

Choose a reason for hiding this comment

Main Gemini text/multimodal models

Realtime (Live API)

Image generation / editing

Imagen 4 (text-to-image)

Other model categories listed as available (see Models page)

Uh oh!

coderabbitai bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SanghunYun95 commented Feb 26, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 26, 2026 •

edited

Loading