feat(db): Database Abstraction Layer with Repository Pattern#918
feat(db): Database Abstraction Layer with Repository Pattern#918jlacerte wants to merge 5 commits intocoleam00:mainfrom
Conversation
Introduces a clean architecture database layer supporting multiple backends (Supabase, PostgreSQL, InMemory) without modifying business logic. ## Changes ### Domain Layer - Add interfaces: ICrawledPagesRepository, ISourcesRepository, ICodeExamplesRepository - Add Pydantic models: CrawledPage, Source, CodeExample, SearchResult - Multi-dimension embeddings support (768, 1024, 1536, 3072) ### Infrastructure Layer - Supabase: Refactored existing implementation to use interfaces - PostgreSQL: New asyncpg + pgvector implementation with connection pooling - InMemory: Thread-safe implementation for fast unit tests ### Container & DI - Singleton Container with async lifecycle (initialize/shutdown) - Factory pattern for backend selection via REPOSITORY_TYPE env var - Integrated with FastAPI lifespan ### Tests - 102 unit tests (100% passing) - Contract tests verifying interface compliance - InMemory fixtures for isolated testing ### Documentation - docs/DATABASE_ABSTRACTION.md - Complete architecture guide - Migration guide from direct Supabase calls - Instructions for adding new backends ## Configuration ```bash # Supabase (default) export REPOSITORY_TYPE=supabase # PostgreSQL export REPOSITORY_TYPE=postgres # InMemory (tests) export REPOSITORY_TYPE=memory ``` Related: PR coleam00#915 (previous work on legacy architecture) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
📝 WalkthroughWalkthroughAdds a Database Abstraction Layer: async domain models and interfaces, in-memory/Postgres/Supabase repository implementations, a repository factory and DI Container with lifecycle/health, server wiring in main, extensive unit/contract tests and fixtures, vector utilities, and supporting documentation. Changes
Sequence Diagram(s)sequenceDiagram
participant App as Application
participant Cont as DI Container
participant Fact as Repository Factory
participant Env as Config/Env
participant Pool as Postgres Pool
participant Repo as Repository
Note over App,Cont: Startup
App->>Cont: initialize()
activate Cont
Cont->>Env: read REPOSITORY_TYPE
Env-->>Cont: storage_type
alt storage_type == "postgres"
Cont->>Pool: initialize_postgres_pool()
Pool-->>Cont: pool ready
end
Cont-->>App: initialized
deactivate Cont
Note over App,Repo: Resolve repo
App->>Cont: get_crawled_pages_repository()
activate Cont
Cont->>Fact: get_crawled_pages_repository()
activate Fact
Fact->>Env: check REPOSITORY_TYPE
alt "memory"
Fact-->>Cont: InMemoryCrawledPagesRepository
else "postgres"
Fact->>Pool: use pool
Fact-->>Cont: PostgresCrawledPagesRepository(pool)
else "supabase"
Fact-->>Cont: SupabaseCrawledPagesRepository(client)
end
deactivate Fact
Cont-->>App: repository (cached)
deactivate Cont
Note over App,Repo: Normal ops
App->>Repo: insert(...) / search_similar(...)
Repo-->>App: result(s)
Note over App,Cont: Shutdown
App->>Cont: shutdown()
activate Cont
Cont->>Pool: close_postgres_pool()
Pool-->>Cont: closed
Cont->>Cont: clear cached repos
deactivate Cont
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 8
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
python/tests/conftest.py (1)
34-34: Duplicate import.
patchis already imported fromunittest.mockat line 4. This duplicate import is unnecessary.🔎 Proposed fix
# Apply global patches immediately -from unittest.mock import patch _global_patches = [
🧹 Nitpick comments (35)
docs/DATABASE_ABSTRACTION.md (1)
23-57: Add language identifiers to code blocks.Lines 23 and 527 contain fenced code blocks without language specifications. While the ASCII diagrams render correctly, adding language identifiers improves markdown linting compliance.
🔎 Proposed fix for missing code block language identifiers
-``` +```text +------------------------------------------------------------------+ | APPLICATION LAYER | | FastAPI endpoints, MCP tools, Services | | ... -``` +```textAnd similarly for line 527:
-``` +```text python/src/server/ +-- domain/ | +-- __init__.py | ... -``` +```textAlso applies to: 527-555
docs/REFACTORING_STATUS_REPORT_V2.md (3)
52-52: Add language identifiers to fenced code blocks.Multiple code blocks lack language specifications, violating markdown linting rules (MD040). This affects lines 52, 89, 187, and 429.
🔎 Proposed fix example
For line 52 (diagram):
-``` +```text +-----------------------------------------------------------+For lines 89, 187 (config/tables):
-``` +```yaml REPOSITORY_TYPE env varFor line 429 (Python code):
-``` +```python python/src/server/Also applies to: 89-89, 187-187, 429-429
412-412: Use a proper heading instead of emphasis.Line 412 uses emphasis (
**...**) for what should be a section heading. This violates the MD036 rule.-**Phases 0-7: COMPLETEES A 100%** +## Phases 0-7: Complétées à 100%
1-485: Consider aligning documentation language with the PR.This refactoring status report is written entirely in French, while DATABASE_ABSTRACTION.md and the broader PR are in English. This creates inconsistency in the documentation set.
Recommendation: Either translate this report to English for consistency, or clarify if it's intended as a French-language artifact for a French-speaking audience. If translation is planned, I can assist.
Would you like me to help translate this status report to English or clarify the intended audience for this document?
python/src/server/domain/interfaces/sources_repository.py (1)
121-130: Consider using an enum for status values.The status parameter accepts a bare string with documented values (
pending,crawling,completed,failed). ASourceStatusenum would provide compile-time safety and auto-completion. This is optional as the current approach is flexible.python/tests/unit/repositories/test_sources_repository.py (1)
122-138: Potential flakiness with timestamp ordering test.Using
asyncio.sleep(0.01)to ensure different timestamps could be flaky on slow systems or under load. Consider using explicit timestamps if the repository implementation supports it, or increase the sleep duration for more reliability.python/src/server/infrastructure/memory/vector_utils.py (1)
57-77: Consider adding empty vector validation for consistency.While mathematically valid,
euclidean_distanceallows empty vectors (returns 0.0) butcosine_similarityraisesValueError. For API consistency, consider adding the same empty vector check here.🔎 Proposed validation addition
def euclidean_distance(vec_a: Sequence[float], vec_b: Sequence[float]) -> float: """ Calculate Euclidean distance between two vectors. Args: vec_a: First vector vec_b: Second vector Returns: Euclidean distance (lower = more similar) Raises: ValueError: If vectors have different lengths """ if len(vec_a) != len(vec_b): raise ValueError( f"Vectors must have same length: {len(vec_a)} != {len(vec_b)}" ) + + if len(vec_a) == 0: + raise ValueError("Vectors cannot be empty") sum_squared_diff = sum((a - b) ** 2 for a, b in zip(vec_a, vec_b)) return math.sqrt(sum_squared_diff)python/src/server/infrastructure/memory/crawled_pages_repository.py (3)
17-18: Document rationale for low similarity threshold.The
SIMILARITY_THRESHOLD = 0.05is very permissive (5% similarity). Consider documenting why such a low threshold is appropriate for in-memory testing, or making it configurable if different test scenarios need different thresholds.
38-52: Consider reusing domain model's get_embedding method.The
_get_embeddinghelper duplicates logic fromCrawledPage.get_embedding()method (seen in domain model). Consider calling the domain model method directly to avoid duplication.🔎 Proposed refactor to reuse domain method
def _get_embedding(self, page: CrawledPage) -> list[float] | None: """Get the active embedding from a page based on its dimension.""" - if page.embedding_dimension == 768: - return page.embedding_768 - elif page.embedding_dimension == 1024: - return page.embedding_1024 - elif page.embedding_dimension == 1536: - return page.embedding_1536 - elif page.embedding_dimension == 3072: - return page.embedding_3072 - # Fallback: try to find any available embedding - return ( - page.embedding_1536 or page.embedding_768 or - page.embedding_1024 or page.embedding_3072 - ) + return page.get_embedding()
130-178: Handle unsupported embedding dimensions explicitly.The
insertmethod silently ignores embeddings with unsupported dimensions (not 768/1024/1536/3072). Consider logging a warning or raising an error to catch configuration issues early.🔎 Proposed validation for embedding dimension
# Determine which embedding column to use embedding_768 = None embedding_1024 = None embedding_1536 = None embedding_3072 = None if page.embedding and page.embedding_dimension: if page.embedding_dimension == 768: embedding_768 = page.embedding elif page.embedding_dimension == 1024: embedding_1024 = page.embedding elif page.embedding_dimension == 1536: embedding_1536 = page.embedding elif page.embedding_dimension == 3072: embedding_3072 = page.embedding + else: + raise ValueError( + f"Unsupported embedding dimension: {page.embedding_dimension}. " + f"Supported dimensions: 768, 1024, 1536, 3072" + )python/src/server/domain/models/source.py (1)
34-35: Consider adding URL validation.The
urlfield accepts any string without format validation. While domain models are often kept lean, adding basic URL validation could prevent invalid data from entering the system early.🔎 Optional: Add URL validation
from pydantic import BaseModel, Field +from pydantic import HttpUrl class Source(BaseModel): source_id: str = Field(..., description="Unique source identifier") - url: str = Field(..., description="Base URL of the source") + url: HttpUrl = Field(..., description="Base URL of the source")Alternatively, you can use a custom validator if you need more flexible URL formats:
from pydantic import field_validator class Source(BaseModel): # ... fields ... @field_validator('url') @classmethod def validate_url(cls, v: str) -> str: if not v.startswith(('http://', 'https://')): raise ValueError('URL must start with http:// or https://') return vpython/src/server/infrastructure/supabase/crawled_pages_repository.py (3)
173-184: Duplicated embedding column selection logic.The dimension-to-column mapping appears in both
insertandinsert_batch. Consider extracting a helper method.🔎 Proposed helper
def _get_embedding_column(self, dimension: int | None) -> str | None: """Map embedding dimension to column name.""" return {768: "embedding_768", 1024: "embedding_1024",Also applies to: 222-233
303-310: Count query fetches all columns unnecessarily.Using
select("*", count="exact")retrieves all rows. For a pure count, select a minimal column to reduce payload.🔎 Suggested optimization
- query = self.client.table(self.table_name).select("*", count="exact") + query = self.client.table(self.table_name).select("id", count="exact")
42-44: Movejsonimport to module level for consistency.The lazy import inside
_row_to_modelworks but differs from other repository files that importjsonat the top.python/src/server/infrastructure/memory/sources_repository.py (1)
100-115:updatemutates the caller'supdatesdict.Line 109 modifies the dict passed by the caller by adding
updated_at. This side effect could be unexpected. Consider working on a copy.🔎 Suggested fix
async def update(self, source_id: str, updates: dict[str, Any]) -> Source | None: """Update an existing source.""" with self._lock: source = self._sources.get(source_id) if not source: return None # Create updated source source_dict = source.model_dump() - source_dict.update(updates) - source_dict["updated_at"] = datetime.now(timezone.utc) + source_dict.update({**updates, "updated_at": datetime.now(timezone.utc)}) updated_source = Source(**source_dict) self._sources[source_id] = updated_source return updated_sourcepython/src/server/infrastructure/repository_factory.py (1)
146-266: Significant code duplication across factory functions.
get_crawled_pages_repository,get_sources_repository, andget_code_examples_repositoryshare identical storage-type switching logic. Consider a generic factory helper or table-driven approach.🔎 Conceptual refactor using a registry
_REPOSITORY_REGISTRY = { "crawled_pages": { "supabase": ("supabase.crawled_pages_repository", "SupabaseCrawledPagesRepository"), "postgres": ("postgres", "PostgresCrawledPagesRepository"), "memory": ("memory", "InMemoryCrawledPagesRepository"), }, # ... similar for sources, code_examples } def _create_repository(repo_type: str, storage_type: str): module_path, class_name = _REPOSITORY_REGISTRY[repo_type][storage_type] # dynamic import and instantiationpython/src/server/infrastructure/postgres/sources_repository.py (1)
278-280: Fragile parsing of DELETE result.Parsing
result.split()[-1]relies on asyncpg's"DELETE N"format. Consider a more defensive approach or document this assumption.🔎 Safer parsing
- # Parse "DELETE X" to get count - deleted_count = int(result.split()[-1]) + # asyncpg returns "DELETE N" where N is affected rows + try: + deleted_count = int(result.split()[-1]) + except (ValueError, IndexError): + self._logger.warning(f"Unexpected DELETE result format: {result}") + deleted_count = 0python/src/server/infrastructure/supabase/sources_repository.py (2)
158-180:updatemutates the caller'supdatesdict.Line 162 adds
updated_atdirectly to the passed dict, which could surprise callers. Consider working on a copy for consistency with expected behavior.🔎 Suggested fix
async def update(self, source_id: str, updates: dict[str, Any]) -> Source | None: """Update an existing source.""" try: # Add updated_at timestamp - updates["updated_at"] = datetime.now(timezone.utc).isoformat() + updates = {**updates, "updated_at": datetime.now(timezone.utc).isoformat()} response = ( self.client.table(self.table_name) .update(updates)
250-256: Count query fetches all columns unnecessarily.Same optimization opportunity as in
SupabaseCrawledPagesRepository—select a minimal column instead of"*".python/src/server/infrastructure/supabase/code_examples_repository.py (3)
39-62: Consider moving the JSON import to module level.The
import jsoninside_row_to_modelworks but is unconventional. Moving it to the module's import section improves clarity and avoids repeated import lookups.🔎 Suggested fix
Add to the imports at the top of the file:
from typing import Any +import json from supabase import ClientThen update
_row_to_model:def _row_to_model(self, row: dict[str, Any]) -> CodeExample: """Convert a database row to a CodeExample model.""" metadata = row.get("metadata", {}) if isinstance(metadata, str): - import json metadata = json.loads(metadata)
169-259: Extract duplicate embedding column selection logic to a helper method.The embedding column determination logic (lines 172-183 and 221-232) is duplicated. This violates DRY and makes adding new embedding dimensions error-prone.
🔎 Suggested refactor
+ def _get_embedding_column(self, dimension: int | None) -> str | None: + """Get the embedding column name for a given dimension.""" + if dimension == 768: + return "embedding_768" + elif dimension == 1024: + return "embedding_1024" + elif dimension == 1536: + return "embedding_1536" + elif dimension == 3072: + return "embedding_3072" + return None + async def insert(self, example: CodeExampleCreate) -> CodeExample: """Insert a new code example.""" try: # Determine embedding column - embedding_column = None - if example.embedding and example.embedding_dimension: - dim = example.embedding_dimension - if dim == 768: - embedding_column = "embedding_768" - elif dim == 1024: - embedding_column = "embedding_1024" - elif dim == 1536: - embedding_column = "embedding_1536" - elif dim == 3072: - embedding_column = "embedding_3072" + embedding_column = ( + self._get_embedding_column(example.embedding_dimension) + if example.embedding and example.embedding_dimension + else None + )
299-313: Optimize count query by selecting minimal columns.
select("*", count="exact")fetches all columns when only the count is needed. Select a single column (likeid) for better performance.🔎 Suggested fix
async def count(self, source_id: str | None = None) -> int: """Count code examples in the repository.""" try: - query = self.client.table(self.table_name).select("*", count="exact") + query = self.client.table(self.table_name).select("id", count="exact") if source_id: query = query.eq("source_id", source_id) response = query.execute() return response.count if response.count else 0python/src/server/container.py (3)
54-68: Singleton reset may leave stale state on existing references.
Container.reset()clears_instancebut doesn't reset_init_doneon existing objects. If any code holds a reference to the old container, creating a newContainer()returns a fresh instance, but the old reference remains in a partially valid state.For test isolation, also resetting instance attributes in
reset()would be safer:🔎 Suggested fix
@classmethod def reset(cls) -> None: """ Reset the container singleton. For testing purposes only. """ + if cls._instance is not None: + cls._instance._initialized = False + cls._instance._storage_type = None + # Remove _init_done to allow re-initialization + if hasattr(cls._instance, "_init_done"): + delattr(cls._instance, "_init_done") cls._instance = None logger.info("Container singleton reset")
113-115: Preserve full stack trace in error logging.Per coding guidelines, error logging should use
exc_info=Trueto preserve the full stack trace.🔎 Suggested fix
except Exception as e: - logger.error(f"Container initialization failed: {e}") + logger.error(f"Container initialization failed: {e}", exc_info=True) raise
144-146: Addexc_info=Trueto shutdown error logging.Same as initialization - preserve the stack trace for debugging.
🔎 Suggested fix
except Exception as e: - logger.error(f"Container shutdown error: {e}") + logger.error(f"Container shutdown error: {e}", exc_info=True) raisepython/src/server/infrastructure/memory/code_examples_repository.py (2)
180-186:insert_batchis not atomic - each insert acquires lock separately.
insert_batchcallsself.insert()in a loop, acquiring and releasing the lock for each item. This differs from the behavior of Supabase and PostgreSQL implementations which perform batch operations atomically. For test fidelity, consider making this atomic:🔎 Suggested fix
async def insert_batch(self, examples: list[CodeExampleCreate]) -> list[CodeExample]: """Insert multiple code examples in a batch.""" - results = [] - for example in examples: - created = await self.insert(example) - results.append(created) - return results + with self._lock: + results = [] + for example in examples: + example_id = str(uuid.uuid4()) + now = datetime.now(timezone.utc) + # ... (inline the insert logic here) + # Or refactor to have _insert_unlocked helper + return resultsAlternatively, create an
_insert_unlockedhelper that bothinsertandinsert_batchcan use.
243-247:set_examplesilently ignores examples without an ID.If
example.idisNone, the method does nothing without feedback. Consider raising aValueErroror auto-generating an ID to avoid silent failures in test setup.🔎 Suggested fix
def set_example(self, example: CodeExample) -> None: """Directly set an example (for test setup).""" with self._lock: - if example.id: - self._examples[example.id] = example + if not example.id: + raise ValueError("Cannot set example without an ID") + self._examples[example.id] = examplepython/src/server/infrastructure/postgres/crawled_pages_repository.py (3)
340-345: Fragile parsing of DELETE result status.Parsing
result.split()[-1]assumes PostgreSQL always returns"DELETE N"format. While this is the documented behavior, adding error handling makes the code more robust:🔎 Suggested fix
- # Parse "DELETE X" to get count - deleted_count = int(result.split()[-1]) + # Parse "DELETE X" to get count + try: + deleted_count = int(result.split()[-1]) + except (IndexError, ValueError): + self._logger.warning(f"Could not parse delete result: {result}") + deleted_count = 0
90-92: Addexc_info=Trueto all error logging calls.Per coding guidelines, error logging should preserve full stack traces. This pattern should be applied to all error handlers in this file (lines 91, 110, 129, 207, 264, 329, 348, 367, 386, 410).
🔎 Example fix
except Exception as e: - self._logger.error(f"get_by_id failed: {e}", id=id) + self._logger.error(f"get_by_id failed: {e}", id=id, exc_info=True) raise
267-330: Batch insert correctly uses transactions but has duplicate column-building logic.The transaction usage ensures atomicity (good). However, the embedding column determination and query building (lines 278-318) duplicates the logic in
insert(). Consider extracting a helper method to reduce duplication.python/src/server/infrastructure/postgres/code_examples_repository.py (2)
342-347: Same fragile DELETE result parsing as crawled_pages repository.Apply the same defensive parsing pattern suggested for
PostgresCrawledPagesRepository.delete_by_url.
90-92: Addexc_info=Trueto error logging calls throughout this file.Same pattern as
PostgresCrawledPagesRepository- all error handlers should preserve stack traces per coding guidelines.python/tests/unit/repositories/test_repository_contract.py (2)
25-37: Contract classes import ABC but don't inherit from it.
ABCis imported (line 10) but the contract classes don't inherit from it. While this works (Python doesn't enforce abstract fixture implementations), inheriting from ABC would make the intent clearer:🔎 Suggested fix
-class CrawledPagesRepositoryContract: +class CrawledPagesRepositoryContract(ABC): """ Contract tests for ICrawledPagesRepository implementations.Apply similarly to
SourcesRepositoryContractandCodeExamplesRepositoryContract.
163-222: Consider adding contract tests forsearch_similarand batch operations.The contract tests cover basic CRUD but omit
search_similar(a core feature for vector search) and batch insert operations. Adding these would ensure consistent behavior across all backends:🔎 Example test for search_similar
@pytest.mark.asyncio async def test_search_similar_returns_results(self, repo): """search_similar should find semantically similar examples.""" embedding = [0.1] * 768 # Example embedding example = CodeExampleCreate( source_id="test_source", page_url="https://test.com/page", code="print('hello')", language="python", embedding=embedding, embedding_dimension=768, ) await repo.insert(example) results = await repo.search_similar(embedding, match_count=5) assert len(results) >= 1 assert results[0].similarity > 0.5python/tests/unit/test_container.py (1)
66-94: Consider adding test for initialization failure handling.The tests verify successful initialization and idempotent calls, but don't test error handling when initialization fails (e.g., invalid configuration). A test with a mock that raises during repository access would validate the error propagation.
| Usage: | ||
| from server.domain import ICrawledPagesRepository, CrawledPage | ||
| from server.domain import ISourcesRepository, Source | ||
| from server.domain import ICodeExamplesRepository, CodeExample |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check the actual file content first
echo "=== File content ==="
cat -n python/src/server/domain/__init__.py | head -20
echo ""
echo "=== Import patterns in Python files ==="
rg -n "from (src\.)?server\.domain import" --type=py -A 1 | head -30
echo ""
echo "=== Also check for absolute imports without src ==="
rg -n "from server\.domain import" --type=py | head -20Repository: coleam00/Archon
Length of output: 1738
🏁 Script executed:
#!/bin/bash
# Search for actual import statements in Python files (not in docstrings)
echo "=== Actual imports of domain module in Python code ==="
rg "from server\.domain import|from \..*domain import|from archon\.server\.domain" --type=py -B 2 | grep -v "^--$" | head -40
echo ""
echo "=== Check the project root __init__.py structure ==="
find python/src -name "__init__.py" -type f | head -10
echo ""
echo "=== Check if there's a src/__init__.py ==="
cat python/src/__init__.py 2>/dev/null || echo "No src/__init__.py found"
echo ""
echo "=== Check server __init__.py ==="
head -5 python/src/server/__init__.py 2>/dev/null || echo "No server/__init__.py"Repository: coleam00/Archon
Length of output: 1244
🏁 Script executed:
#!/bin/bash
# Search for actual usage of domain classes in code (not docstrings)
echo "=== Files that import CrawledPage, Source, etc. ==="
rg "CrawledPage|ISourcesRepository|CodeExample" --type=py -l | head -20
echo ""
echo "=== Check import statements in those files ==="
rg "from.*domain|import.*CrawledPage" --type=py -B 1 -A 1 | grep -v "^--$" | head -50
echo ""
echo "=== Check pyproject.toml or setup.py for path config ==="
find . -maxdepth 2 -name "pyproject.toml" -o -name "setup.py" -o -name "setup.cfg" | xargs cat 2>/dev/null | head -40
echo ""
echo "=== Check conftest.py for path modifications ==="
find . -name "conftest.py" -exec cat -n {} +Repository: coleam00/Archon
Length of output: 19801
Minor: Update import paths in docstring to match actual usage.
The usage examples show from server.domain import ... but all actual code in the codebase uses from src.server.domain import .... Update the docstring to reflect the correct import paths.
🔎 Proposed fix
Usage:
- from server.domain import ICrawledPagesRepository, CrawledPage
- from server.domain import ISourcesRepository, Source
- from server.domain import ICodeExamplesRepository, CodeExample
+ from src.server.domain import ICrawledPagesRepository, CrawledPage
+ from src.server.domain import ISourcesRepository, Source
+ from src.server.domain import ICodeExamplesRepository, CodeExample📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Usage: | |
| from server.domain import ICrawledPagesRepository, CrawledPage | |
| from server.domain import ISourcesRepository, Source | |
| from server.domain import ICodeExamplesRepository, CodeExample | |
| Usage: | |
| from src.server.domain import ICrawledPagesRepository, CrawledPage | |
| from src.server.domain import ISourcesRepository, Source | |
| from src.server.domain import ICodeExamplesRepository, CodeExample |
🤖 Prompt for AI Agents
In python/src/server/domain/__init__.py around lines 8 to 11, the docstring
usage examples incorrectly show imports as "from server.domain import ..." but
the project uses "from src.server.domain import ..."; update those example lines
to use "from src.server.domain import ICrawledPagesRepository, CrawledPage",
"from src.server.domain import ISourcesRepository, Source", and "from
src.server.domain import ICodeExamplesRepository, CodeExample" so the docstring
matches actual import paths used across the codebase.
| def __init__(self, pool: Pool, table_name: str = "archon_code_examples"): | ||
| self.pool = pool | ||
| self.table_name = table_name | ||
| self._logger = logger.bind(repository="PostgresCodeExamplesRepository") |
There was a problem hiding this comment.
Same SQL injection concern as PostgresCrawledPagesRepository.
The table_name parameter is interpolated directly into SQL queries. Apply the same validation mitigation suggested for PostgresCrawledPagesRepository.
| host = host or os.getenv("POSTGRES_HOST", "localhost") | ||
| port = port or int(os.getenv("POSTGRES_PORT", "5432")) | ||
| database = database or os.getenv("POSTGRES_DB", "archon") | ||
| user = user or os.getenv("POSTGRES_USER", "postgres") | ||
| password = password or os.getenv("POSTGRES_PASSWORD", "") |
There was a problem hiding this comment.
Empty password default may cause confusing connection failures.
Line 90 defaults password to an empty string when not provided. This could lead to authentication failures that are harder to debug than a missing configuration error.
As per coding guidelines: "Fail fast and loud for service startup failures, missing configuration."
🔎 Suggested fix: Require password or fail explicitly
# Get configuration from environment with fallbacks
host = host or os.getenv("POSTGRES_HOST", "localhost")
port = port or int(os.getenv("POSTGRES_PORT", "5432"))
database = database or os.getenv("POSTGRES_DB", "archon")
user = user or os.getenv("POSTGRES_USER", "postgres")
- password = password or os.getenv("POSTGRES_PASSWORD", "")
+ password_env = os.getenv("POSTGRES_PASSWORD")
+ if password is None and password_env is None:
+ raise ValueError(
+ "PostgreSQL password must be provided via argument or POSTGRES_PASSWORD environment variable"
+ )
+ password = password or password_envAlternatively, if empty passwords are valid for local development:
- password = password or os.getenv("POSTGRES_PASSWORD", "")
+ password = password if password is not None else os.getenv("POSTGRES_PASSWORD", "")
+ if password is None:
+ logger.warning(
+ "PostgreSQL password not provided - connection may fail. "
+ "Set POSTGRES_PASSWORD environment variable."
+ )
+ password = ""📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| host = host or os.getenv("POSTGRES_HOST", "localhost") | |
| port = port or int(os.getenv("POSTGRES_PORT", "5432")) | |
| database = database or os.getenv("POSTGRES_DB", "archon") | |
| user = user or os.getenv("POSTGRES_USER", "postgres") | |
| password = password or os.getenv("POSTGRES_PASSWORD", "") | |
| host = host or os.getenv("POSTGRES_HOST", "localhost") | |
| port = port or int(os.getenv("POSTGRES_PORT", "5432")) | |
| database = database or os.getenv("POSTGRES_DB", "archon") | |
| user = user or os.getenv("POSTGRES_USER", "postgres") | |
| password_env = os.getenv("POSTGRES_PASSWORD") | |
| if password is None and password_env is None: | |
| raise ValueError( | |
| "PostgreSQL password must be provided via argument or POSTGRES_PASSWORD environment variable" | |
| ) | |
| password = password or password_env |
🤖 Prompt for AI Agents
python/src/server/infrastructure/postgres/connection.py lines 86-90: the code
currently defaults password to an empty string which can mask missing
configuration and produce confusing auth failures; change the logic to treat a
missing password as a configuration error by not defaulting to "" (use None or
no default), read os.getenv("POSTGRES_PASSWORD") and if the result is None or
empty when not explicitly allowed, raise a clear RuntimeError (or ValueError)
with a message like "POSTGRES_PASSWORD is required" so startup fails fast; if
you need to allow empty passwords for local dev, gate that behind a explicit
flag (e.g., ALLOW_EMPTY_POSTGRES_PASSWORD=true) and only permit empty when that
flag is set.
| def __init__(self, pool: Pool, table_name: str = "archon_crawled_pages"): | ||
| self.pool = pool | ||
| self.table_name = table_name | ||
| self._logger = logger.bind(repository="PostgresCrawledPagesRepository") |
There was a problem hiding this comment.
SQL injection risk with f-string table name interpolation.
The table_name parameter is interpolated directly into SQL queries (e.g., line 81, 100, 119, etc.). While the default is safe, if a caller passes untrusted input, this becomes an injection vector.
Consider validating the table name format or using an allow-list:
🔎 Suggested mitigation
+import re
+
+VALID_TABLE_NAME = re.compile(r'^[a-zA-Z_][a-zA-Z0-9_]*$')
+
class PostgresCrawledPagesRepository(ICrawledPagesRepository):
...
def __init__(self, pool: Pool, table_name: str = "archon_crawled_pages"):
+ if not VALID_TABLE_NAME.match(table_name):
+ raise ValueError(f"Invalid table name: {table_name}")
self.pool = pool
self.table_name = table_nameCommittable suggestion skipped: line range outside the PR's diff.
| global _crawled_pages_repository | ||
|
|
||
| if _crawled_pages_repository is None: | ||
| storage_type = _get_storage_type() | ||
|
|
||
| if storage_type == "supabase": | ||
| from .supabase.crawled_pages_repository import SupabaseCrawledPagesRepository | ||
|
|
||
| client = _get_supabase_client() | ||
| _crawled_pages_repository = SupabaseCrawledPagesRepository(client) | ||
| logger.info("Created SupabaseCrawledPagesRepository") | ||
|
|
||
| elif storage_type == "postgres": | ||
| from .postgres import PostgresCrawledPagesRepository | ||
|
|
||
| pool = _get_postgres_pool() | ||
| _crawled_pages_repository = PostgresCrawledPagesRepository(pool) | ||
| logger.info("Created PostgresCrawledPagesRepository") | ||
|
|
||
| elif storage_type == "memory": | ||
| from .memory import InMemoryCrawledPagesRepository | ||
|
|
||
| _crawled_pages_repository = InMemoryCrawledPagesRepository() | ||
| logger.info("Created InMemoryCrawledPagesRepository") | ||
|
|
||
| else: | ||
| raise ValueError(f"Unsupported storage type: {storage_type}") | ||
|
|
||
| return _crawled_pages_repository |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's examine the full repository_factory.py file to understand the context
wc -l python/src/server/infrastructure/repository_factory.pyRepository: coleam00/Archon
Length of output: 117
🏁 Script executed:
# Read the full file
cat -n python/src/server/infrastructure/repository_factory.pyRepository: coleam00/Archon
Length of output: 13150
🏁 Script executed:
# Search for how get_crawled_pages_repository is called
rg "get_crawled_pages_repository" --type py -B 2 -A 2Repository: coleam00/Archon
Length of output: 4595
🏁 Script executed:
# Check for threading/locking patterns in the codebase
rg "Lock|lock|RLock|threading|asyncio|semaphore" python/src/server/infrastructure/ --type py | head -20Repository: coleam00/Archon
Length of output: 1886
🏁 Script executed:
# Look for FastAPI app initialization and lifespan setup
rg "lifespan|startup|on_event|@app" python/src/server/ --type py | head -30Repository: coleam00/Archon
Length of output: 1973
🏁 Script executed:
# Check if other global singletons use locking
rg "^_[a-z_]+_repository\s*=" python/src/server/infrastructure/repository_factory.pyRepository: coleam00/Archon
Length of output: 41
🏁 Script executed:
# Check the lifespan and container initialization in detail
rg "async def lifespan" -A 40 python/src/server/main.pyRepository: coleam00/Archon
Length of output: 1563
🏁 Script executed:
# Check how container.initialize() uses the factory
rg "async def initialize" -A 30 python/src/server/container.pyRepository: coleam00/Archon
Length of output: 1194
🏁 Script executed:
# Check if any code calls these getters from async contexts or tests
rg "get_crawled_pages_repository\(\)|get_sources_repository\(\)|get_code_examples_repository\(\)" -B 3 -A 3 python/src/server/ --type py | head -60Repository: coleam00/Archon
Length of output: 3964
🏁 Script executed:
# Check if there's any async context where getters might be called concurrently
rg "asyncio|gather|create_task" python/src/server/ --type py | grep -E "(factory|container|repository)" | head -20Repository: coleam00/Archon
Length of output: 41
🏁 Script executed:
# Look at how tests use these factories
cat python/tests/unit/test_container.py | head -100Repository: coleam00/Archon
Length of output: 2716
Synchronous getters lack thread-safety despite async-only usage pattern.
The lazy initialization pattern in get_crawled_pages_repository(), get_sources_repository(), and get_code_examples_repository() is not thread-safe. While the current architecture mitigates this risk—FastAPI's lifespan initializes repositories sequentially before accepting requests—adding a lock would prevent potential issues if these functions are ever called from actual threads. Note that the memory-based repositories already implement threading.Lock for thread-safety internally, suggesting this pattern is valued in the codebase.
🤖 Prompt for AI Agents
In python/src/server/infrastructure/repository_factory.py around lines 162 to
190, the lazy initialization in get_crawled_pages_repository (and similarly in
get_sources_repository and get_code_examples_repository) is not protected by a
lock; add a module-level threading.Lock (e.g., _repo_init_lock) and use it to
guard the check-and-create sequence (acquire the lock, re-check if the global is
None, create and assign the repository, then release via a context manager) so
initialization becomes thread-safe without changing the existing initialization
logic or semantics.
| # Initialize DI container (handles PostgreSQL pool if REPOSITORY_TYPE=postgres) | ||
| try: | ||
| await container.initialize() | ||
| api_logger.info(f"✅ Container initialized (storage: {container.storage_type})") | ||
| except Exception as e: | ||
| api_logger.warning(f"Could not initialize container: {e}") |
There was a problem hiding this comment.
Container initialization failure should crash when explicitly configured.
Per coding guidelines, database connection failures should "crash with clear errors." Currently, if REPOSITORY_TYPE=postgres is set but the PostgreSQL pool fails to initialize, the app continues with only a warning. This could lead to silent failures where health checks pass but all repository operations fail.
Consider re-raising the exception when a non-default backend is explicitly configured:
🔎 Proposed fix
# Initialize DI container (handles PostgreSQL pool if REPOSITORY_TYPE=postgres)
try:
await container.initialize()
api_logger.info(f"✅ Container initialized (storage: {container.storage_type})")
except Exception as e:
- api_logger.warning(f"Could not initialize container: {e}")
+ # Fail fast if user explicitly configured a non-default backend
+ if container.storage_type != "supabase":
+ api_logger.error(f"❌ Failed to initialize container (storage: {container.storage_type}): {e}")
+ raise
+ api_logger.warning(f"Could not initialize container: {e}")🤖 Prompt for AI Agents
In python/src/server/main.py around lines 118-123, the try/except around
container.initialize() currently only logs a warning on exception; change this
so that when a non-default backend is explicitly configured (e.g.
REPOSITORY_TYPE == "postgres" or container.storage_type == "postgres") the
exception is not swallowed: log the error with full exception details and then
re-raise the exception (or call sys.exit(1)) so the process crashes with a clear
non-zero exit; otherwise (default in-memory backend) you may continue to warn
and proceed.
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (4)
docs/VALIDATION_MANIFEST.md (4)
125-128: Add language specifier to fenced code block.Code blocks should declare their language for syntax highlighting consistency.
🔎 Proposed fix
-``` +```bash ========================== 102 passed in 1.08s ============================= -``` +```
188-190: Add language specifier to fenced code block.Code blocks should declare their language for syntax highlighting consistency.
🔎 Proposed fix
-``` +```bash d8c45e8 feat(db): Add database abstraction layer with Repository Pattern -``` +```
299-301: Convert emphasis to proper heading syntax.Section headers should use markdown heading syntax (##, ###) rather than bold emphasis.
🔎 Proposed fix
--- -**Conclusion** +## Conclusion
1-316: Consider translating or documenting language choice for the project.The entire manifest is written in French, which may not align with the primary project documentation language. Ensure this choice is intentional and document the language strategy for the project's docs.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
docs/VALIDATION_MANIFEST.md
🧰 Additional context used
🪛 LanguageTool
docs/VALIDATION_MANIFEST.md
[grammar] ~6-~6: Ensure spelling is correct
Context: ...refactorisation Date: 2026-01-02 Projet: Database Abstraction Layer - Archon ...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~8-~8: Ensure spelling is correct
Context: ...action Layer - Archon V2 PR: #918 Branche: feature/db-abstraction-v2 --- ##...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~123-~123: Ensure spelling is correct
Context: ...` | MODIFIED | fixtures ajoutees | ### Resultats des tests ``` ========================...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~177-~177: Ensure spelling is correct
Context: ... | --- ## 7. Verification Git/PR ### Branche | Element | Status | |---------|--------...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~181-~181: Ensure spelling is correct
Context: ...ement | Status | |---------|--------| | Branche | feature/db-abstraction-v2 | | Base ...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~211-~211: Ensure spelling is correct
Context: ...w | Complete | --- ## 8. Verification Qualite Code ### TODOs/FIXMEs ```bash grep -r...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~220-~220: Ensure spelling is correct
Context: ...# Resultat: AUCUN ``` Status: Code propre, aucun TODO laisse ### Type Hints Tou...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~239-~239: Ensure spelling is correct
Context: ...st Finale ### Code - [x] Domain Layer complet (interfaces + models) - [x] Infrastruct...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~267-~267: Ensure spelling is correct
Context: ... cree - [x] PR #915 ferme (legacy) ### Qualite - [x] Aucun TODO/FIXME - [x] Type hints co...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~306-~306: Ensure spelling is correct
Context: ... 11/11 phases terminees - 102/102 tests passent - 41 fichiers, +7,513 lignes - 0 TODOs res...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~307-~307: Ensure spelling is correct
Context: ...102 tests passent - 41 fichiers, +7,513 lignes - 0 TODOs restants - PR #918 en review Au...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~311-~311: Ensure spelling is correct
Context: ...DOs restants - PR #918 en review Aucun element n'a ete oublie. Le travail est pret pou...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~311-~311: Ensure spelling is correct
Context: ... element n'a ete oublie. Le travail est pret pour review et merge. --- **Manifeste...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
🪛 markdownlint-cli2 (0.18.1)
docs/VALIDATION_MANIFEST.md
125-125: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
188-188: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
301-301: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
Fixes for 8 actionable items from CodeRabbit review: 1. Remove duplicate import in conftest.py (line 34) 2. Add exc_info=True to container error logging 3. Fix Container.reset() to clean up _init_done attribute 4. Optimize Supabase count queries (select "id" instead of "*") 5. Fix mutation of updates dict in sources repositories 6. Add defensive parsing for DELETE result in Postgres repos 7. Add exc_info=True to all Postgres repository error logs All 102 tests passing. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 13
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
♻️ Duplicate comments (2)
python/src/server/infrastructure/supabase/crawled_pages_repository.py (1)
137-144: filter_metadata is silently ignored when source_id is provided.This issue was already flagged in a previous review. When
source_idis set, line 140 assigns an empty dict torpc_params["filter"], discarding anyfilter_metadatathe caller provided.python/src/server/infrastructure/postgres/crawled_pages_repository.py (1)
35-38: SQL injection risk with table_name parameter.This issue was already flagged in a previous review. The
table_nameparameter is interpolated directly into SQL queries throughout the file, creating a potential SQL injection vector if untrusted input is passed.
🟡 Minor comments (13)
python/src/server/infrastructure/supabase/crawled_pages_repository.py-96-98 (1)
96-98: Add exc_info=True to error logging.Same as the previous comment—all error logs should include
exc_info=Truefor full stack traces.🔎 Proposed fix
except Exception as e: - self._logger.error(f"find_by_url failed: {e}", url=url) + self._logger.error(f"find_by_url failed: {e}", url=url, exc_info=True) raiseAs per coding guidelines.
python/src/server/infrastructure/supabase/crawled_pages_repository.py-209-211 (1)
209-211: Add exc_info=True to error logging.🔎 Proposed fix
except Exception as e: - self._logger.error(f"insert failed: {e}", url=page.url) + self._logger.error(f"insert failed: {e}", url=page.url, exc_info=True) raiseAs per coding guidelines.
python/src/server/infrastructure/supabase/crawled_pages_repository.py-312-314 (1)
312-314: Add exc_info=True to error logging.🔎 Proposed fix
except Exception as e: - self._logger.error(f"count failed: {e}", source_id=source_id) + self._logger.error(f"count failed: {e}", source_id=source_id, exc_info=True) raiseAs per coding guidelines.
python/src/server/infrastructure/supabase/sources_repository.py-70-72 (1)
70-72: Add exc_info=True to all error logging statements.All error logging throughout this file should include
exc_info=Trueto preserve full stack traces, per coding guidelines forpython/src/**/*.py.Example fix pattern:
- self._logger.error(f"operation failed: {e}", ...) + self._logger.error(f"operation failed: {e}", ..., exc_info=True)As per coding guidelines.
Also applies to: 89-91, 105-107, 123-125, 154-156, 178-180, 204-206, 220-222, 243-245, 258-260
python/src/server/infrastructure/supabase/crawled_pages_repository.py-258-260 (1)
258-260: Add exc_info=True to error logging.🔎 Proposed fix
except Exception as e: - self._logger.error(f"insert_batch failed: {e}", count=len(pages)) + self._logger.error(f"insert_batch failed: {e}", count=len(pages), exc_info=True) raiseAs per coding guidelines.
python/src/server/infrastructure/supabase/crawled_pages_repository.py-331-333 (1)
331-333: Add exc_info=True to error logging.🔎 Proposed fix
except Exception as e: - self._logger.error(f"list_unique_urls failed: {e}", source_id=source_id) + self._logger.error(f"list_unique_urls failed: {e}", source_id=source_id, exc_info=True) raiseAs per coding guidelines.
python/src/server/infrastructure/postgres/crawled_pages_repository.py-91-91 (1)
91-91: Add exc_info=True to error logging statements.Most error logging in this file is missing
exc_info=True, which is required per coding guidelines. Line 207 correctly includes it, showing the right pattern to follow.Apply this pattern to all error logs:
- self._logger.error(f"operation failed: {e}", ...) + self._logger.error(f"operation failed: {e}", ..., exc_info=True)As per coding guidelines.
Also applies to: 110-110, 129-129, 264-264, 329-329, 352-352, 370-370, 390-390, 414-414
python/src/server/infrastructure/supabase/crawled_pages_repository.py-166-168 (1)
166-168: Add exc_info=True to error logging.🔎 Proposed fix
except Exception as e: - self._logger.error(f"search_similar failed: {e}") + self._logger.error(f"search_similar failed: {e}", exc_info=True) raiseAs per coding guidelines.
python/src/server/infrastructure/supabase/crawled_pages_repository.py-79-81 (1)
79-81: Add exc_info=True to error logging.Per coding guidelines, error logging in
python/src/**/*.pymust preserve full stack traces usingexc_info=True.🔎 Proposed fix
except Exception as e: - self._logger.error(f"get_by_id failed: {e}", id=id) + self._logger.error(f"get_by_id failed: {e}", id=id, exc_info=True) raiseAs per coding guidelines.
python/src/server/infrastructure/postgres/crawled_pages_repository.py-364-364 (1)
364-364: Add defensive parsing for DELETE result.Line 364 in
delete_by_source()parses the DELETE result without error handling, whiledelete_by_url()(lines 342-346) uses a try/except block with a fallback. This inconsistency could causedelete_by_source()to fail if the result format is unexpected.🔎 Proposed fix
result = await conn.execute( f"DELETE FROM {self.table_name} WHERE source_id = $1", source_id ) - deleted_count = int(result.split()[-1]) + # Parse "DELETE X" to get count (asyncpg returns "DELETE N") + try: + deleted_count = int(result.split()[-1]) + except (ValueError, IndexError): + self._logger.warning(f"Could not parse delete result: {result}") + deleted_count = 0 self._logger.info(f"delete_by_source deleted {deleted_count} pages", source_id=source_id)python/src/server/infrastructure/supabase/crawled_pages_repository.py-114-116 (1)
114-116: Add exc_info=True to error logging.🔎 Proposed fix
except Exception as e: - self._logger.error(f"find_by_source failed: {e}", source_id=source_id) + self._logger.error(f"find_by_source failed: {e}", source_id=source_id, exc_info=True) raiseAs per coding guidelines.
python/src/server/infrastructure/supabase/crawled_pages_repository.py-277-279 (1)
277-279: Add exc_info=True to error logging.🔎 Proposed fix
except Exception as e: - self._logger.error(f"delete_by_url failed: {e}", url=url) + self._logger.error(f"delete_by_url failed: {e}", url=url, exc_info=True) raiseAs per coding guidelines.
python/src/server/infrastructure/supabase/crawled_pages_repository.py-296-298 (1)
296-298: Add exc_info=True to error logging.🔎 Proposed fix
except Exception as e: - self._logger.error(f"delete_by_source failed: {e}", source_id=source_id) + self._logger.error(f"delete_by_source failed: {e}", source_id=source_id, exc_info=True) raiseAs per coding guidelines.
🧹 Nitpick comments (4)
python/src/server/infrastructure/supabase/code_examples_repository.py (1)
173-183: Consider extracting duplicate embedding column logic.The embedding column determination logic is duplicated between
insertandinsert_batch. Consider extracting to a helper method for maintainability.🔎 Proposed refactor
Add a helper method to the class:
def _get_embedding_column(self, dimension: int | None) -> str | None: """Determine the embedding column name based on dimension.""" if dimension is None: return None return { }.get(dimension)Then simplify both methods:
- # Determine embedding column - embedding_column = None - if example.embedding and example.embedding_dimension: - dim = example.embedding_dimension - if dim == 768: - embedding_column = "embedding_768" - elif dim == 1024: - embedding_column = "embedding_1024" - elif dim == 1536: - embedding_column = "embedding_1536" - elif dim == 3072: - embedding_column = "embedding_3072" + embedding_column = self._get_embedding_column(example.embedding_dimension)Also applies to: 220-232
python/src/server/infrastructure/supabase/crawled_pages_repository.py (2)
42-44: Consider moving json import to the top of the file.While inline imports are functionally safe, moving the json import to the module-level imports (line 7) would be more conventional and slightly more efficient if this method is called frequently.
🔎 Proposed refactor
At the top of the file:
from typing import Any +import json from supabase import ClientIn the method:
metadata = row.get("metadata", {}) if isinstance(metadata, str): - import json metadata = json.loads(metadata)
222-234: Extract duplicated embedding column determination logic.The embedding column determination logic (lines 173-185 in
insert()and 222-234 ininsert_batch()) is identical. Consider extracting this to a helper method to improve maintainability.🔎 Proposed refactor
+ def _get_embedding_column(self, dimension: int | None) -> str | None: + """Determine embedding column name from dimension.""" + if not dimension: + return None + + if dimension == 768: + return "embedding_768" + elif dimension == 1024: + return "embedding_1024" + elif dimension == 1536: + return "embedding_1536" + elif dimension == 3072: + return "embedding_3072" + return None + async def insert(self, page: CrawledPageCreate) -> CrawledPage: """Insert a new crawled page chunk.""" try: # Determine embedding column - embedding_column = None - if page.embedding and page.embedding_dimension: - dim = page.embedding_dimension - if dim == 768: - embedding_column = "embedding_768" - elif dim == 1024: - embedding_column = "embedding_1024" - elif dim == 1536: - embedding_column = "embedding_1536" - elif dim == 3072: - embedding_column = "embedding_3072" + embedding_column = self._get_embedding_column(page.embedding_dimension) if page.embedding else NoneApply the same pattern in
insert_batch().python/src/server/infrastructure/supabase/sources_repository.py (1)
38-40: Consider moving json import to the top of the file.Same suggestion as in
crawled_pages_repository.py—moving the json import to module-level would be more conventional.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
python/src/server/container.pypython/src/server/infrastructure/memory/sources_repository.pypython/src/server/infrastructure/postgres/code_examples_repository.pypython/src/server/infrastructure/postgres/crawled_pages_repository.pypython/src/server/infrastructure/postgres/sources_repository.pypython/src/server/infrastructure/supabase/code_examples_repository.pypython/src/server/infrastructure/supabase/crawled_pages_repository.pypython/src/server/infrastructure/supabase/sources_repository.pypython/tests/conftest.py
🚧 Files skipped from review as they are similar to previous changes (1)
- python/src/server/infrastructure/postgres/code_examples_repository.py
🧰 Additional context used
📓 Path-based instructions (4)
python/tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use Pytest for testing with async support in Python backend
Use Pytest for testing Python code with async support
Files:
python/tests/conftest.py
python/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Error messages must include context about what was being attempted, preserve full stack traces, use specific exception types, include relevant IDs/URLs, and report both success count and detailed failure list for batch operations
Files:
python/tests/conftest.pypython/src/server/infrastructure/supabase/crawled_pages_repository.pypython/src/server/container.pypython/src/server/infrastructure/supabase/code_examples_repository.pypython/src/server/infrastructure/supabase/sources_repository.pypython/src/server/infrastructure/postgres/sources_repository.pypython/src/server/infrastructure/postgres/crawled_pages_repository.pypython/src/server/infrastructure/memory/sources_repository.py
python/src/server/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/src/server/**/*.py: Fail fast and loud for service startup failures, missing configuration, database connection failures, authentication/authorization failures, data corruption, critical dependency unavailability, and invalid data that would corrupt state
Never accept corrupted data - skip failed items entirely rather than storing corrupted or incomplete data
Files:
python/src/server/infrastructure/supabase/crawled_pages_repository.pypython/src/server/container.pypython/src/server/infrastructure/supabase/code_examples_repository.pypython/src/server/infrastructure/supabase/sources_repository.pypython/src/server/infrastructure/postgres/sources_repository.pypython/src/server/infrastructure/postgres/crawled_pages_repository.pypython/src/server/infrastructure/memory/sources_repository.py
python/src/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/src/**/*.py: Complete batch processing, background tasks, and external API calls but log detailed errors for each failure instead of crashing
Include context about what was being attempted, preserve full stack traces with exc_info=True in Python logging, use specific exception types, include relevant IDs/URLs, never return None to indicate failure, and report both success count and detailed failure list for batch operations
Use Ruff for linting and MyPy for type checking with 120 character line length in Python backend
Use Python 3.12 with 120 character line length in backend code
Import GITHUB_REPO_OWNER and GITHUB_REPO_NAME constants from python/src/server/config/version.py as the single source of truth for repository configuration
python/src/**/*.py: Python backend code should use 120 character line length
Use Ruff for linting Python code to check for errors, warnings, and unused imports
Use Mypy for type checking in Python to ensure type safety
Files:
python/src/server/infrastructure/supabase/crawled_pages_repository.pypython/src/server/container.pypython/src/server/infrastructure/supabase/code_examples_repository.pypython/src/server/infrastructure/supabase/sources_repository.pypython/src/server/infrastructure/postgres/sources_repository.pypython/src/server/infrastructure/postgres/crawled_pages_repository.pypython/src/server/infrastructure/memory/sources_repository.py
🧬 Code graph analysis (4)
python/tests/conftest.py (5)
python/src/server/infrastructure/memory/crawled_pages_repository.py (2)
InMemoryCrawledPagesRepository(21-240)clear(232-235)python/src/server/infrastructure/memory/sources_repository.py (1)
clear(163-166)python/src/server/infrastructure/memory/code_examples_repository.py (1)
clear(233-236)python/src/server/infrastructure/repository_factory.py (1)
reset_repositories_sync(291-306)python/src/server/container.py (2)
Container(38-221)reset(208-221)
python/src/server/infrastructure/supabase/crawled_pages_repository.py (3)
python/src/server/domain/interfaces/crawled_pages_repository.py (11)
ICrawledPagesRepository(15-180)get_by_id(36-46)find_by_url(49-59)find_by_source(62-72)search_similar(75-97)insert(100-113)insert_batch(116-128)count(157-167)delete_by_url(131-141)delete_by_source(144-154)list_unique_urls(170-180)python/src/server/domain/models/crawled_page.py (3)
CrawledPage(26-80)CrawledPageCreate(83-100)CrawledPageMetadata(14-23)python/src/server/domain/models/search_result.py (1)
SearchResult(14-28)
python/src/server/infrastructure/postgres/sources_repository.py (4)
python/src/server/container.py (1)
sources_repository(159-166)python/src/server/domain/interfaces/sources_repository.py (11)
ISourcesRepository(14-155)get_by_id(29-39)get_by_url(42-52)list_all(55-62)search(65-75)create(78-91)update(94-105)update_counts(108-119)update_status(122-130)delete(133-145)count(148-155)python/src/server/domain/models/source.py (2)
Source(14-43)SourceCreate(46-57)python/src/server/infrastructure/postgres/connection.py (1)
pool(42-49)
python/src/server/infrastructure/postgres/crawled_pages_repository.py (3)
python/src/server/domain/interfaces/crawled_pages_repository.py (11)
ICrawledPagesRepository(15-180)get_by_id(36-46)find_by_url(49-59)find_by_source(62-72)search_similar(75-97)insert(100-113)insert_batch(116-128)count(157-167)delete_by_url(131-141)delete_by_source(144-154)list_unique_urls(170-180)python/src/server/domain/models/crawled_page.py (3)
CrawledPage(26-80)CrawledPageCreate(83-100)CrawledPageMetadata(14-23)python/src/server/domain/models/search_result.py (1)
SearchResult(14-28)
🔇 Additional comments (12)
python/src/server/infrastructure/supabase/crawled_pages_repository.py (2)
1-21: LGTM!Module structure, imports, and similarity threshold constant are appropriate.
22-38: LGTM!Class initialization and logger binding follow best practices.
python/src/server/infrastructure/memory/sources_repository.py (1)
1-176: LGTM! Well-structured in-memory repository implementation.The implementation demonstrates solid practices:
- Thread-safe operations with proper lock usage
- Timezone-aware datetime handling
- Immutable updates through model reconstruction
- Clean test helper methods for test setup/teardown
python/src/server/infrastructure/supabase/sources_repository.py (1)
158-162: Excellent defensive copy to prevent mutation.The explicit copy (
{**updates, ...}) with a clear comment demonstrates good defensive programming and prevents unintended side effects on the caller's dictionary.python/src/server/infrastructure/postgres/crawled_pages_repository.py (2)
47-56: LGTM! Robust embedding parsing.The
parse_embeddinghelper handles multiple formats (string JSON, iterables, None) defensively, which is appropriate for the different ways pgvector might return embedding data.
207-207: Good! Error logging includes exc_info=True.This is the correct pattern per coding guidelines. However, note that this is inconsistent with other error logging in the file (lines 91, 110, 129, 264, 329, 352, 370, 390, 414) which are missing
exc_info=True.python/tests/conftest.py (3)
197-221: LGTM! Clean repository fixtures with proper cleanup.The memory repository fixtures follow pytest best practices:
- Clear initialization
- Yield for test usage
- Cleanup with
clear()to ensure test isolation
224-247: LGTM! Proper environment variable isolation.The fixture correctly:
- Stores the original
REPOSITORY_TYPEvalue- Sets it to "memory" for the test
- Restores the original value (or removes if not set)
- Resets repository singletons before and after
This ensures clean test isolation.
250-278: LGTM! Proper container lifecycle management for tests.The fixture provides excellent test isolation by:
- Resetting the Container singleton
- Setting memory repository type
- Resetting repository singletons
- Importing the container after environment setup
- Cleaning up all state after the test
This ensures each test using this fixture gets a fresh, isolated container.
python/src/server/container.py (1)
82-146: LGTM! Well-structured lifecycle management.The initialization and shutdown methods follow best practices:
- Idempotent operations with proper state guards
- Early validation through eager repository pre-warming (lines 106-108)
- Full stack traces preserved with
exc_info=True(lines 114, 145)- Fail-fast behavior for startup failures
python/src/server/infrastructure/postgres/sources_repository.py (2)
269-295: Good defensive parsing for DELETE results.The defensive parsing (lines 279-284) properly handles edge cases where the asyncpg DELETE result format might vary, with appropriate fallback behavior and warning logs.
36-53: LGTM! Robust type handling.The
_row_to_model()method correctly handles metadata as both dict and string (lines 39-40), which provides resilience against schema variations.
| async def get_by_id(self, source_id: str) -> Source | None: | ||
| """Get a source by its ID.""" | ||
| try: | ||
| async with self.pool.acquire() as conn: | ||
| row = await conn.fetchrow( | ||
| f"SELECT * FROM {self.table_name} WHERE source_id = $1", | ||
| source_id | ||
| ) | ||
|
|
||
| if not row: | ||
| return None | ||
|
|
||
| return self._row_to_model(row) | ||
|
|
||
| except Exception as e: | ||
| self._logger.error(f"get_by_id failed: {e}", source_id=source_id) | ||
| raise |
There was a problem hiding this comment.
Add exc_info=True to all error logs for complete stack traces.
Most error logs are missing exc_info=True, which violates the coding guidelines requirement to "preserve full stack traces with exc_info=True in Python logging." Only list_all() (line 102) and count() (line 308) include it.
Missing in:
- Line 70:
get_by_id - Line 88:
get_by_url - Line 123:
search - Line 159:
create - Line 200:
update - Line 243:
update_counts - Line 266:
update_status - Line 294:
delete
Based on coding guidelines requiring full stack traces for debugging and operational visibility.
🔎 Proposed fix to add exc_info=True to all error logs
except Exception as e:
- self._logger.error(f"get_by_id failed: {e}", source_id=source_id)
+ self._logger.error(f"get_by_id failed: {e}", source_id=source_id, exc_info=True)
raise
except Exception as e:
- self._logger.error(f"get_by_url failed: {e}", url=url)
+ self._logger.error(f"get_by_url failed: {e}", url=url, exc_info=True)
raise
except Exception as e:
- self._logger.error(f"search failed: {e}", query=query)
+ self._logger.error(f"search failed: {e}", query=query, exc_info=True)
raise
except Exception as e:
- self._logger.error(f"create failed: {e}", source_id=source.source_id)
+ self._logger.error(f"create failed: {e}", source_id=source.source_id, exc_info=True)
raise
except Exception as e:
- self._logger.error(f"update failed: {e}", source_id=source_id)
+ self._logger.error(f"update failed: {e}", source_id=source_id, exc_info=True)
raise
except Exception as e:
- self._logger.error(f"update_counts failed: {e}", source_id=source_id)
+ self._logger.error(f"update_counts failed: {e}", source_id=source_id, exc_info=True)
raise
except Exception as e:
- self._logger.error(f"update_status failed: {e}", source_id=source_id)
+ self._logger.error(f"update_status failed: {e}", source_id=source_id, exc_info=True)
raise
except Exception as e:
- self._logger.error(f"delete failed: {e}", source_id=source_id)
+ self._logger.error(f"delete failed: {e}", source_id=source_id, exc_info=True)
raiseAlso applies to: 73-89, 105-124, 126-160, 162-201, 203-244, 246-267, 269-295
🤖 Prompt for AI Agents
In python/src/server/infrastructure/postgres/sources_repository.py (lines 55-71,
73-89, 105-124, 126-160, 162-201, 203-244, 246-267, 269-295) the logger.error
calls in the exception handlers do not include exc_info=True; update each
exception handler so the existing self._logger.error(...) calls include
exc_info=True (preserve the current message and keyword args such as
source_id/url/etc.), and keep the subsequent raise so the exception is re-raised
unmodified.
| async def update(self, source_id: str, updates: dict[str, Any]) -> Source | None: | ||
| """Update an existing source.""" | ||
| try: | ||
| # Add updated_at timestamp | ||
| updates["updated_at"] = datetime.now(timezone.utc) | ||
|
|
||
| # Build SET clause dynamically | ||
| set_clauses = [] | ||
| values = [] | ||
| param_idx = 1 | ||
|
|
||
| for key, value in updates.items(): | ||
| if key == "metadata" and isinstance(value, dict): | ||
| value = json.dumps(value) | ||
| set_clauses.append(f"{key} = ${param_idx}") | ||
| values.append(value) | ||
| param_idx += 1 | ||
|
|
||
| values.append(source_id) | ||
|
|
||
| query = f""" | ||
| UPDATE {self.table_name} | ||
| SET {", ".join(set_clauses)} | ||
| WHERE source_id = ${param_idx} | ||
| RETURNING * | ||
| """ | ||
|
|
||
| async with self.pool.acquire() as conn: | ||
| row = await conn.fetchrow(query, *values) | ||
|
|
||
| if not row: | ||
| return None | ||
|
|
||
| self._logger.info(f"Updated source: {source_id}", updates=list(updates.keys())) | ||
|
|
||
| return self._row_to_model(row) | ||
|
|
||
| except Exception as e: | ||
| self._logger.error(f"update failed: {e}", source_id=source_id) | ||
| raise |
There was a problem hiding this comment.
Fix input dict mutation in update() method.
Line 166 mutates the input updates dict by adding updated_at, which can cause unexpected side effects for callers. The PR summary indicates this was addressed ("fixed mutation of updates dict in sources repositories"), but the mutation remains.
🔎 Proposed fix to avoid mutating the input dict
async def update(self, source_id: str, updates: dict[str, Any]) -> Source | None:
"""Update an existing source."""
try:
- # Add updated_at timestamp
- updates["updated_at"] = datetime.now(timezone.utc)
-
# Build SET clause dynamically
set_clauses = []
values = []
param_idx = 1
+ # Add updated_at timestamp without mutating input
+ all_updates = {**updates, "updated_at": datetime.now(timezone.utc)}
+
- for key, value in updates.items():
+ for key, value in all_updates.items():
if key == "metadata" and isinstance(value, dict):
value = json.dumps(value)
set_clauses.append(f"{key} = ${param_idx}")
values.append(value)
param_idx += 1🤖 Prompt for AI Agents
In python/src/server/infrastructure/postgres/sources_repository.py around lines
162 to 201, the update() method currently mutates the caller-supplied updates
dict by setting updates["updated_at"]; to fix, create a new local dict (shallow
copy) from the incoming updates and add the updated_at key to that copy, then
use the copy for serialization and query construction (do not modify the
original updates), and ensure any metadata dict is JSON-serialized from the copy
as before.
Addresses CodeRabbit review feedback - ensures full stack traces are captured in Supabase repository error logs for better debugging. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (3)
python/src/server/infrastructure/supabase/crawled_pages_repository.py (1)
137-144:filter_metadatais silently ignored whensource_idis provided.When
source_idis set,filteris assigned an empty dict, discarding anyfilter_metadatathe caller provided. Callers might expect both filters to apply.🔎 Suggested fix to preserve filter_metadata
# Add source filter if source_id: rpc_params["source_filter"] = source_id - rpc_params["filter"] = {} - elif filter_metadata: - rpc_params["filter"] = filter_metadata - else: - rpc_params["filter"] = {} + rpc_params["filter"] = filter_metadata or {}python/src/server/infrastructure/supabase/code_examples_repository.py (2)
39-62: Add error handling for metadata JSON parsing.If
metadatais a malformed JSON string,json.loads()will raise aJSONDecodeErrorwithout context about which row failed. Per coding guidelines, this should either fail fast with context or skip corrupted data.🔎 Proposed fix with context-rich error handling
metadata = row.get("metadata", {}) if isinstance(metadata, str): import json - metadata = json.loads(metadata) + try: + metadata = json.loads(metadata) + except json.JSONDecodeError as e: + self._logger.error( + f"Failed to parse metadata JSON for code example: {e}", + id=row.get("id"), + exc_info=True + ) + raise ValueError(f"Invalid metadata JSON for code example {row.get('id')}") from e
144-154: Post-query language filtering may return fewer results thanmatch_count.When
languageis specified, filtering happens after fetchingmatch_countrows (lines 151-152). If several rows don't match the language, fewer thanmatch_countresults are returned. Consider documenting this limitation or updating the RPC to support language filtering server-side for consistent behavior across backends.
🧹 Nitpick comments (3)
python/src/server/infrastructure/supabase/crawled_pages_repository.py (1)
173-184: Consider extracting duplicated embedding column selection logic.The embedding dimension-to-column mapping logic is duplicated between
insertandinsert_batch. Extracting this to a helper method would improve maintainability and reduce the risk of divergence.🔎 Proposed helper method
+ def _get_embedding_column(self, dimension: int | None) -> str | None: + """Map embedding dimension to column name.""" + if dimension is None: + return None + return { + 768: "embedding_768", + 1024: "embedding_1024", + 1536: "embedding_1536", + 3072: "embedding_3072", + }.get(dimension) + async def insert(self, page: CrawledPageCreate) -> CrawledPage: """Insert a new crawled page chunk.""" try: # Determine embedding column - embedding_column = None - if page.embedding and page.embedding_dimension: - dim = page.embedding_dimension - if dim == 768: - embedding_column = "embedding_768" - elif dim == 1024: - embedding_column = "embedding_1024" - elif dim == 1536: - embedding_column = "embedding_1536" - elif dim == 3072: - embedding_column = "embedding_3072" + embedding_column = self._get_embedding_column(page.embedding_dimension) if page.embedding else NoneAlso applies to: 222-233
python/src/server/infrastructure/supabase/code_examples_repository.py (2)
165-167: Add search parameters to error context.While
exc_info=Trueis now included, the error log lacks context about the search parameters that failed, which would aid debugging.🔎 Proposed fix
except Exception as e: - self._logger.error(f"search_similar failed: {e}", exc_info=True) + self._logger.error( + f"search_similar failed: {e}", + embedding_dimension=len(embedding), + match_count=match_count, + source_id=source_id, + language=language, + exc_info=True + ) raise
172-183: Consider extracting duplicated embedding column selection logic.Same duplication pattern as in
crawled_pages_repository.py. Extracting to a shared helper (or base class method) would improve maintainability.Also applies to: 221-232
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
python/src/server/infrastructure/supabase/code_examples_repository.pypython/src/server/infrastructure/supabase/crawled_pages_repository.pypython/src/server/infrastructure/supabase/sources_repository.py
🧰 Additional context used
📓 Path-based instructions (3)
python/src/server/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/src/server/**/*.py: Fail fast and loud for service startup failures, missing configuration, database connection failures, authentication/authorization failures, data corruption, critical dependency unavailability, and invalid data that would corrupt state
Never accept corrupted data - skip failed items entirely rather than storing corrupted or incomplete data
Files:
python/src/server/infrastructure/supabase/code_examples_repository.pypython/src/server/infrastructure/supabase/sources_repository.pypython/src/server/infrastructure/supabase/crawled_pages_repository.py
python/src/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/src/**/*.py: Complete batch processing, background tasks, and external API calls but log detailed errors for each failure instead of crashing
Include context about what was being attempted, preserve full stack traces with exc_info=True in Python logging, use specific exception types, include relevant IDs/URLs, never return None to indicate failure, and report both success count and detailed failure list for batch operations
Use Ruff for linting and MyPy for type checking with 120 character line length in Python backend
Use Python 3.12 with 120 character line length in backend code
Import GITHUB_REPO_OWNER and GITHUB_REPO_NAME constants from python/src/server/config/version.py as the single source of truth for repository configuration
python/src/**/*.py: Python backend code should use 120 character line length
Use Ruff for linting Python code to check for errors, warnings, and unused imports
Use Mypy for type checking in Python to ensure type safety
Files:
python/src/server/infrastructure/supabase/code_examples_repository.pypython/src/server/infrastructure/supabase/sources_repository.pypython/src/server/infrastructure/supabase/crawled_pages_repository.py
python/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Error messages must include context about what was being attempted, preserve full stack traces, use specific exception types, include relevant IDs/URLs, and report both success count and detailed failure list for batch operations
Files:
python/src/server/infrastructure/supabase/code_examples_repository.pypython/src/server/infrastructure/supabase/sources_repository.pypython/src/server/infrastructure/supabase/crawled_pages_repository.py
🧠 Learnings (5)
📚 Learning: 2025-11-29T19:25:23.987Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-29T19:25:23.987Z
Learning: Applies to python/**/*.py : Error messages must include context about what was being attempted, preserve full stack traces, use specific exception types, include relevant IDs/URLs, and report both success count and detailed failure list for batch operations
Applied to files:
python/src/server/infrastructure/supabase/code_examples_repository.py
📚 Learning: 2025-11-29T19:25:01.813Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: python/src/agent_work_orders/CLAUDE.md:0-0
Timestamp: 2025-11-29T19:25:01.813Z
Learning: Applies to python/src/agent_work_orders/**/*.py : Use `logger.exception()` in except blocks to automatically capture full stack traces
Applied to files:
python/src/server/infrastructure/supabase/code_examples_repository.py
📚 Learning: 2025-11-29T19:24:36.223Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-11-29T19:24:36.223Z
Learning: Applies to python/src/**/*.py : Include context about what was being attempted, preserve full stack traces with exc_info=True in Python logging, use specific exception types, include relevant IDs/URLs, never return None to indicate failure, and report both success count and detailed failure list for batch operations
Applied to files:
python/src/server/infrastructure/supabase/code_examples_repository.py
📚 Learning: 2025-11-29T19:25:01.813Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: python/src/agent_work_orders/CLAUDE.md:0-0
Timestamp: 2025-11-29T19:25:01.813Z
Learning: Applies to python/src/agent_work_orders/**/*.py : Do not silently catch exceptions; always log with `logger.exception()` or re-raise the exception
Applied to files:
python/src/server/infrastructure/supabase/code_examples_repository.py
📚 Learning: 2025-11-29T19:25:01.813Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: python/src/agent_work_orders/CLAUDE.md:0-0
Timestamp: 2025-11-29T19:25:01.813Z
Learning: Applies to python/src/agent_work_orders/**/*.py : Use appropriate logging levels: `debug` for diagnostics, `info` for operations, `warning` for recoverable issues, `error` for non-fatal failures, `exception` in except blocks with stack traces
Applied to files:
python/src/server/infrastructure/supabase/code_examples_repository.py
🔇 Additional comments (4)
python/src/server/infrastructure/supabase/crawled_pages_repository.py (1)
64-81: LGTM on error handling and logging patterns.All exception handlers correctly include
exc_info=Trueand relevant context (IDs, URLs, counts). The implementation follows the coding guidelines for preserving stack traces and including contextual information.Also applies to: 83-98, 100-116, 170-211, 213-260, 262-279, 281-298, 300-314, 316-333
python/src/server/infrastructure/supabase/code_examples_repository.py (1)
64-81: LGTM on error handling patterns.All exception handlers now correctly include
exc_info=Trueand provide relevant context. This addresses the previous review feedback about preserving stack traces.Also applies to: 83-98, 100-115, 169-210, 212-259, 261-278, 280-297, 299-313, 315-336
python/src/server/infrastructure/supabase/sources_repository.py (2)
55-72: LGTM on CRUD operations and error handling.The implementation correctly includes
exc_info=Truein all error handlers and provides relevant context (source_id, url, query). The update methods properly handle partial updates and timestamp management.Also applies to: 74-91, 93-107, 127-156, 158-180, 182-206, 208-222, 224-245
247-260: Fix count query to use correct column selection pattern.The
count()method uses.select("id", count="exact"), but the table usessource_idas its identifier, notid. This will cause the query to fail. Use.select("*", count="exact")to match the pattern used elsewhere in the codebase (e.g.,database_metrics_service.py,knowledge_item_service.py).⛔ Skipped due to learnings
Learnt from: stevepresley Repo: coleam00/Archon PR: 573 File: python/src/server/api_routes/projects_api.py:0-0 Timestamp: 2025-09-04T18:53:35.908Z Learning: Server-side counting optimization pattern: When counting rows by status/category in Supabase, use separate queries with count="exact" parameter instead of fetching all rows. Example: supabase_client.table("table").select("id", count="exact").eq("field", "value").execute() and read res.count. This reduces complexity from O(n) to O(1) and eliminates unnecessary data transfer.
| def _row_to_model(self, row: dict[str, Any]) -> Source: | ||
| """Convert a database row to a Source model.""" | ||
| metadata = row.get("metadata", {}) | ||
| if isinstance(metadata, str): | ||
| import json | ||
| metadata = json.loads(metadata) | ||
|
|
||
| return Source( | ||
| source_id=row["source_id"], | ||
| url=row.get("url", ""), | ||
| title=row.get("title"), | ||
| description=row.get("description"), | ||
| metadata=metadata, | ||
| pages_count=row.get("pages_count", 0), | ||
| chunks_count=row.get("chunks_count", 0), | ||
| status=row.get("status", "pending"), | ||
| created_at=row.get("created_at"), | ||
| updated_at=row.get("updated_at"), | ||
| ) |
There was a problem hiding this comment.
Add error handling for metadata JSON parsing.
If metadata is a malformed JSON string, json.loads() will raise a JSONDecodeError without context. Per coding guidelines, this should fail fast with context or skip corrupted data.
🔎 Proposed fix with context-rich error handling
def _row_to_model(self, row: dict[str, Any]) -> Source:
"""Convert a database row to a Source model."""
metadata = row.get("metadata", {})
if isinstance(metadata, str):
import json
- metadata = json.loads(metadata)
+ try:
+ metadata = json.loads(metadata)
+ except json.JSONDecodeError as e:
+ self._logger.error(
+ f"Failed to parse metadata JSON for source: {e}",
+ source_id=row.get("source_id"),
+ exc_info=True
+ )
+ raise ValueError(f"Invalid metadata JSON for source {row.get('source_id')}") from e🤖 Prompt for AI Agents
In python/src/server/infrastructure/supabase/sources_repository.py around lines
35 to 53, the code calls json.loads(metadata) without handling malformed JSON;
wrap the json.loads call in a try/except that catches json.JSONDecodeError and
raise a new ValueError (or a custom exception) that includes the source_id
(row.get("source_id")), the original error message, and the raw metadata string
(use "from e" to preserve traceback) so the function fails fast with context for
debugging.
| async def search(self, query: str) -> list[Source]: | ||
| """Search sources by title or description.""" | ||
| try: | ||
| # Use ilike for case-insensitive search | ||
| response = ( | ||
| self.client.table(self.table_name) | ||
| .select("*") | ||
| .or_(f"title.ilike.%{query}%,description.ilike.%{query}%") | ||
| .order("created_at", desc=True) | ||
| .execute() | ||
| ) | ||
|
|
||
| return [self._row_to_model(row) for row in response.data] | ||
|
|
||
| except Exception as e: | ||
| self._logger.error(f"search failed: {e}", query=query, exc_info=True) | ||
| raise |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Search for similar .or_() patterns and how they're used
rg "\.or_\(" --type=py -A2 -B2Repository: coleam00/Archon
Length of output: 4897
🏁 Script executed:
# Look for any sanitization or escaping utilities in the codebase
rg -i "sanitiz|escap|inject|quote" --type=py | head -30Repository: coleam00/Archon
Length of output: 3563
🏁 Script executed:
# Check if there are any input validation functions for queries
rg -n "def.*valid|def.*sanitiz|def.*escap" --type=pyRepository: coleam00/Archon
Length of output: 7435
🏁 Script executed:
# Look at the full sources_repository.py file to understand context
wc -l python/src/server/infrastructure/supabase/sources_repository.pyRepository: coleam00/Archon
Length of output: 126
🌐 Web query:
Supabase PostgREST client SQL injection protection escaping parameters
💡 Result:
Short answer
- The Supabase REST client (PostgREST) does not let you inject raw SQL through its HTTP endpoints; PostgREST escapes literal values when building queries and you should avoid constructing dynamic SQL on the server/client side. [1][2]
Recommended protections
- Use the REST/JS client or RPC (stored procedures) with typed parameters instead of concatenating strings into SQL. RPC parameters are passed as data, not raw SQL. [2]
- If you must run SQL from a server driver (pg, psycopg, asyncpg, etc.), always use parameterized queries / prepared statements (driver placeholders like $1) rather than string interpolation. [4]
- For any dynamic SQL generated inside PL/pgSQL, use quote_literal() and quote_ident() (or proper parameterization) and strongly validate/whitelist identifiers. [3]
- Avoid building SQL with user-controlled identifiers; if you must accept names (table/column), whitelist or use quote_identifier helpers from libraries (or pg_escape). [3][5]
Practical checklist
- Don’t send raw SQL over HTTP to PostgREST — use filters, RPC, or policy-based endpoints. [2]
- Use client/driver parameter placeholders ($1, $2) on server-side DB connections. [4]
- Validate and whitelist any user-supplied identifier names. [3][5]
- If you audit PostgREST behavior: the project escapes literals internally, but you shouldn’t rely on client-side escaping—design for parameterization. [1]
If you want, tell me which language/driver or whether you’re using the JS Supabase client vs a server DB driver and I’ll give concrete code examples.
Sources
- PostgREST discussion on escaping literals. [1]
- Supabase community answer re: direct SQL and REST clients. [2]
- PostgreSQL functions quote_literal / quote_identifier and best-practices. [3]
- Supabase docs on prepared statements / driver behavior. [4]
- pg_escape / identifier-quoting libraries for languages (example). [5]
🏁 Script executed:
# Check if there are any LIKE escaping utilities
rg -i "like.*escap|escap.*like" --type=pyRepository: coleam00/Archon
Length of output: 41
🏁 Script executed:
# Look at the full sources_repository.py file structure
sed -n '1,50p' python/src/server/infrastructure/supabase/sources_repository.py
sed -n '100,150p' python/src/server/infrastructure/supabase/sources_repository.pyRepository: coleam00/Archon
Length of output: 3370
🌐 Web query:
PostgreSQL LIKE operator escape percent underscore Supabase
💡 Result:
Short answer
- In PostgreSQL LIKE patterns, % and _ are wildcards. Precede them with an escape character to match them literally (default escape is backslash). Use ESCAPE to choose a different escape char. [1]
Examples
- SQL literal (escape % with backslash): SELECT * FROM t WHERE col LIKE '%100%%'; -- matches strings that end with "100%". In SQL string literals you may need to double backslashes depending on string-escape settings, e.g. '%100\%%'. [1][2]
- Use a custom escape char: SELECT * FROM t WHERE col LIKE '%50!%%' ESCAPE '!'; -- here ! escapes %. [1]
- To match a literal underscore: WHERE col LIKE 'file_name' (or 'file\_name' in some string contexts). [1][2]
Supabase / client notes
- Supabase uses Postgres (PostgREST), so the same escaping rules apply. When sending patterns from a client:
- Use parameterized queries or the client library methods (e.g., .rpc / SQL or filter/ilike) so you avoid manual quoting issues.
- If you build the pattern as a string, escape backslashes for the client language (e.g. in JS string literals use '\%') and URL-encode the pattern when calling the REST endpoint. [1]
Sources
- PostgreSQL pattern matching / LIKE (escape and ESCAPE clause). [1]
- Example usage and notes on backslash escaping in string literals. [2]
Escape LIKE wildcard characters (% and _) in the search query.
The query parameter is directly interpolated into the ILIKE pattern without escaping PostgreSQL's wildcard metacharacters. In PostgreSQL, % matches any characters and _ matches a single character. User input like "50%" will be interpreted as a wildcard pattern matching "50anything" rather than the literal string "50%".
To fix, escape backslashes in the search term before interpolation:
escaped_query = query.replace("\\", "\\\\").replace("%", "\\%").replace("_", "\\_")
response = (
self.client.table(self.table_name)
.select("*")
.or_(f"title.ilike.%{escaped_query}%,description.ilike.%{escaped_query}%")
.order("created_at", desc=True)
.execute()
)This same pattern appears in knowledge_summary_service.py, knowledge_item_service.py, and task_service.py—apply the same fix across all search methods.
🤖 Prompt for AI Agents
In python/src/server/infrastructure/supabase/sources_repository.py around lines
109-125, the search method interpolates the raw query into the ILIKE pattern
allowing PostgreSQL wildcards (%) and (_) to change semantics; escape
backslashes, percent signs and underscores in the input before building the
ILIKE pattern (e.g. replace "\" with "\\", then "%" with "\%", then "_" with
"\_"), use the escaped string in the .or_() interpolation, and keep using ILIKE
for case-insensitive search; apply the identical escaping fix to the search
implementations in knowledge_summary_service.py, knowledge_item_service.py, and
task_service.py.
1. Preserve filter_metadata when source_id is provided (crawled_pages) 2. Add JSON error handling in _row_to_model (all 3 repos) 3. Document post-query language filtering limitation (code_examples) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
python/src/server/infrastructure/supabase/sources_repository.py (1)
117-133: LIKE wildcard escaping still missing (duplicate issue).The
queryparameter is directly interpolated into the ILIKE pattern without escaping PostgreSQL wildcard characters (%and_). This remains unfixed from the previous review. User input like "50%" will match "50anything" rather than the literal string "50%".🔎 Proposed fix
async def search(self, query: str) -> list[Source]: """Search sources by title or description.""" try: + # Escape LIKE wildcards for literal matching + escaped_query = query.replace("\\", "\\\\").replace("%", "\\%").replace("_", "\\_") # Use ilike for case-insensitive search response = ( self.client.table(self.table_name) .select("*") - .or_(f"title.ilike.%{query}%,description.ilike.%{query}%") + .or_(f"title.ilike.%{escaped_query}%,description.ilike.%{escaped_query}%") .order("created_at", desc=True) .execute() )
🧹 Nitpick comments (2)
python/src/server/infrastructure/supabase/crawled_pages_repository.py (1)
174-215: Consider extracting embedding column selection logic.The embedding column selection pattern (lines 178-188) is duplicated in both
insertandinsert_batchmethods, and similar logic appears across all three repository files. Consider extracting this into a private helper method to reduce duplication and improve maintainability.🔎 Proposed refactor
+ def _get_embedding_column(self, dimension: int | None) -> str | None: + """Get the appropriate embedding column name for the given dimension.""" + if dimension == 768: + return "embedding_768" + elif dimension == 1024: + return "embedding_1024" + elif dimension == 1536: + return "embedding_1536" + elif dimension == 3072: + return "embedding_3072" + elif dimension is not None: + self._logger.warning(f"Unsupported embedding dimension: {dimension}") + return None + async def insert(self, page: CrawledPageCreate) -> CrawledPage: """Insert a new crawled page chunk.""" try: # Determine embedding column - embedding_column = None - if page.embedding and page.embedding_dimension: - dim = page.embedding_dimension - if dim == 768: - embedding_column = "embedding_768" - elif dim == 1024: - embedding_column = "embedding_1024" - elif dim == 1536: - embedding_column = "embedding_1536" - elif dim == 3072: - embedding_column = "embedding_3072" + embedding_column = self._get_embedding_column(page.embedding_dimension) if page.embedding else NoneApply the same pattern in
insert_batchand across the other repository files.python/src/server/infrastructure/supabase/code_examples_repository.py (1)
183-224: Same embedding column selection duplication as in crawled_pages_repository.The embedding column selection logic (lines 186-197) is duplicated here and in
insert_batch. Consider extracting this into a shared helper method as suggested for the crawled_pages_repository.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
python/src/server/infrastructure/supabase/code_examples_repository.pypython/src/server/infrastructure/supabase/crawled_pages_repository.pypython/src/server/infrastructure/supabase/sources_repository.py
🧰 Additional context used
📓 Path-based instructions (3)
python/src/server/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/src/server/**/*.py: Fail fast and loud for service startup failures, missing configuration, database connection failures, authentication/authorization failures, data corruption, critical dependency unavailability, and invalid data that would corrupt state
Never accept corrupted data - skip failed items entirely rather than storing corrupted or incomplete data
Files:
python/src/server/infrastructure/supabase/sources_repository.pypython/src/server/infrastructure/supabase/crawled_pages_repository.pypython/src/server/infrastructure/supabase/code_examples_repository.py
python/src/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/src/**/*.py: Complete batch processing, background tasks, and external API calls but log detailed errors for each failure instead of crashing
Include context about what was being attempted, preserve full stack traces with exc_info=True in Python logging, use specific exception types, include relevant IDs/URLs, never return None to indicate failure, and report both success count and detailed failure list for batch operations
Use Ruff for linting and MyPy for type checking with 120 character line length in Python backend
Use Python 3.12 with 120 character line length in backend code
Import GITHUB_REPO_OWNER and GITHUB_REPO_NAME constants from python/src/server/config/version.py as the single source of truth for repository configuration
python/src/**/*.py: Python backend code should use 120 character line length
Use Ruff for linting Python code to check for errors, warnings, and unused imports
Use Mypy for type checking in Python to ensure type safety
Files:
python/src/server/infrastructure/supabase/sources_repository.pypython/src/server/infrastructure/supabase/crawled_pages_repository.pypython/src/server/infrastructure/supabase/code_examples_repository.py
python/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Error messages must include context about what was being attempted, preserve full stack traces, use specific exception types, include relevant IDs/URLs, and report both success count and detailed failure list for batch operations
Files:
python/src/server/infrastructure/supabase/sources_repository.pypython/src/server/infrastructure/supabase/crawled_pages_repository.pypython/src/server/infrastructure/supabase/code_examples_repository.py
🧠 Learnings (5)
📚 Learning: 2025-11-29T19:25:23.987Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-29T19:25:23.987Z
Learning: Applies to python/**/*.py : Error messages must include context about what was being attempted, preserve full stack traces, use specific exception types, include relevant IDs/URLs, and report both success count and detailed failure list for batch operations
Applied to files:
python/src/server/infrastructure/supabase/code_examples_repository.py
📚 Learning: 2025-11-29T19:25:01.813Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: python/src/agent_work_orders/CLAUDE.md:0-0
Timestamp: 2025-11-29T19:25:01.813Z
Learning: Applies to python/src/agent_work_orders/**/*.py : Use `logger.exception()` in except blocks to automatically capture full stack traces
Applied to files:
python/src/server/infrastructure/supabase/code_examples_repository.py
📚 Learning: 2025-11-29T19:24:36.223Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-11-29T19:24:36.223Z
Learning: Applies to python/src/**/*.py : Include context about what was being attempted, preserve full stack traces with exc_info=True in Python logging, use specific exception types, include relevant IDs/URLs, never return None to indicate failure, and report both success count and detailed failure list for batch operations
Applied to files:
python/src/server/infrastructure/supabase/code_examples_repository.py
📚 Learning: 2025-11-29T19:25:01.813Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: python/src/agent_work_orders/CLAUDE.md:0-0
Timestamp: 2025-11-29T19:25:01.813Z
Learning: Applies to python/src/agent_work_orders/**/*.py : Do not silently catch exceptions; always log with `logger.exception()` or re-raise the exception
Applied to files:
python/src/server/infrastructure/supabase/code_examples_repository.py
📚 Learning: 2025-11-29T19:25:01.813Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: python/src/agent_work_orders/CLAUDE.md:0-0
Timestamp: 2025-11-29T19:25:01.813Z
Learning: Applies to python/src/agent_work_orders/**/*.py : Use appropriate logging levels: `debug` for diagnostics, `info` for operations, `warning` for recoverable issues, `error` for non-fatal failures, `exception` in except blocks with stack traces
Applied to files:
python/src/server/infrastructure/supabase/code_examples_repository.py
🔇 Additional comments (5)
python/src/server/infrastructure/supabase/crawled_pages_repository.py (2)
40-70: JSON error handling properly implemented.The metadata JSON parsing now includes proper error handling with context logging and graceful fallback to empty metadata. This addresses the previous review concern while maintaining service availability.
145-148: Filter metadata is now correctly preserved.The previous issue where
filter_metadatawas silently ignored whensource_idwas provided has been fixed. The filter is now correctly set regardless of whether a source_id is present.python/src/server/infrastructure/supabase/sources_repository.py (1)
36-61: JSON error handling properly implemented.The metadata JSON parsing now includes proper error handling with context logging and graceful fallback to empty metadata, consistent with the other repository implementations.
python/src/server/infrastructure/supabase/code_examples_repository.py (2)
40-70: JSON error handling properly implemented.The metadata JSON parsing includes proper error handling with context logging and graceful fallback to empty metadata, consistent with the other repository implementations.
125-181: Post-query language filtering limitation is now documented.The Note in the docstring (lines 137-142) clearly documents that language filtering happens post-query, which may result in fewer than
match_countresults. This addresses the previous review concern about inconsistent behavior across backends.
|
🔄 This repository is being replaced by a new version of Archon. The original Python/MCP codebase is being archived to the This PR is being closed as part of the migration. Thank you for your contribution! |
Summary
This PR introduces a database abstraction layer using the Repository Pattern, enabling Archon to support multiple database backends without modifying business logic.
Key Features
Architecture
Configuration
Files Changed
Test Plan
Breaking Changes
None. Existing Supabase users continue working without any changes. The abstraction layer is additive.
Related
docs/DATABASE_ABSTRACTION.md🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Documentation
Tests
✏️ Tip: You can customize this high-level summary in your review settings.