feat(db): Database Abstraction Layer with Repository Pattern by jlacerte · Pull Request #918 · coleam00/Archon

jlacerte · 2026-01-02T00:03:30Z

Summary

This PR introduces a database abstraction layer using the Repository Pattern, enabling Archon to support multiple database backends without modifying business logic.

Key Features

3 Backend Support: Supabase (existing), PostgreSQL (new), InMemory (tests)
Clean Architecture: Domain interfaces + Infrastructure implementations
Dependency Injection: Container singleton with async lifecycle
Multi-dimension Embeddings: 768, 1024, 1536, 3072 support
102 Unit Tests: All passing (0.91s execution)

Architecture

Application Layer (FastAPI, MCP)
         |
    Container (DI)
         |
   Domain Layer (Interfaces + Models)
         |
Infrastructure Layer (Supabase | PostgreSQL | InMemory)

Configuration

# Supabase (default - no change for existing users)
export REPOSITORY_TYPE=supabase

# PostgreSQL (self-hosted alternative)
export REPOSITORY_TYPE=postgres
export POSTGRES_HOST=localhost

# InMemory (fast unit tests)
export REPOSITORY_TYPE=memory

Files Changed

Category	Files	Lines
Domain (interfaces + models)	10	~460
Infrastructure (3 backends)	12	~1,650
Container & Factory	2	~380
Tests	9	~1,200
Documentation	2	~700
Total	41	~7,500

Test Plan

102 unit tests pass locally
InMemory backend tested extensively
Container lifecycle tested (init/shutdown)
Contract tests verify interface compliance
Integration tests with real PostgreSQL (CI)
Integration tests with real Supabase (CI)

Breaking Changes

None. Existing Supabase users continue working without any changes. The abstraction layer is additive.

Summary by CodeRabbit

New Features
- Runtime-selectable storage backends (Supabase, PostgreSQL, InMemory) with a DI container, lifecycle management and health checks; unified repository APIs for pages, sources and code examples.
- Multi-dimension embedding support and vector similarity search for pages and code examples.
Documentation
- Detailed database abstraction guide, refactoring status report, validation manifest, migration and troubleshooting guides.
Tests
- Extensive unit and contract test suite covering repositories, vector utilities and container behavior.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Introduces a clean architecture database layer supporting multiple backends (Supabase, PostgreSQL, InMemory) without modifying business logic. ## Changes ### Domain Layer - Add interfaces: ICrawledPagesRepository, ISourcesRepository, ICodeExamplesRepository - Add Pydantic models: CrawledPage, Source, CodeExample, SearchResult - Multi-dimension embeddings support (768, 1024, 1536, 3072) ### Infrastructure Layer - Supabase: Refactored existing implementation to use interfaces - PostgreSQL: New asyncpg + pgvector implementation with connection pooling - InMemory: Thread-safe implementation for fast unit tests ### Container & DI - Singleton Container with async lifecycle (initialize/shutdown) - Factory pattern for backend selection via REPOSITORY_TYPE env var - Integrated with FastAPI lifespan ### Tests - 102 unit tests (100% passing) - Contract tests verifying interface compliance - InMemory fixtures for isolated testing ### Documentation - docs/DATABASE_ABSTRACTION.md - Complete architecture guide - Migration guide from direct Supabase calls - Instructions for adding new backends ## Configuration ```bash # Supabase (default) export REPOSITORY_TYPE=supabase # PostgreSQL export REPOSITORY_TYPE=postgres # InMemory (tests) export REPOSITORY_TYPE=memory ``` Related: PR coleam00#915 (previous work on legacy architecture) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

coderabbitai · 2026-01-02T00:03:44Z

📝 Walkthrough

Walkthrough

Adds a Database Abstraction Layer: async domain models and interfaces, in-memory/Postgres/Supabase repository implementations, a repository factory and DI Container with lifecycle/health, server wiring in main, extensive unit/contract tests and fixtures, vector utilities, and supporting documentation.

Changes

Cohort / File(s)	Summary
Documentation `docs/DATABASE_ABSTRACTION.md`, `docs/REFACTORING_STATUS_REPORT_V2.md`, `docs/VALIDATION_MANIFEST.md`	New design, status, validation, migration and rollout docs for the DAL and refactor.
Domain package & models `python/src/server/domain/__init__.py`, `python/src/server/domain/models/*`	New domain API: Pydantic models (`CrawledPage`, `Source`, `CodeExample*`, `SearchResult`) with multi-dimension embedding fields and `get_embedding` helpers.
Domain interfaces `python/src/server/domain/interfaces/*`	New async repository/service interfaces: `ICrawledPagesRepository`, `ISourcesRepository`, `ICodeExamplesRepository`, `IEmbeddingService`.
Infrastructure facade & factory `python/src/server/infrastructure/__init__.py`, `python/src/server/infrastructure/repository_factory.py`	Repository factory and facade: env-driven backend selection (memory/postgres/supabase), lazy singletons, Postgres pool init/close, overrides and test reset helpers.
In-memory implementations & utils `python/src/server/infrastructure/memory/*`	Thread-safe in-memory repos (`InMemory*`), `vector_utils.py` (cosine/euclidean/normalize), similarity search, test helpers and public exports.
Postgres implementations & connection `python/src/server/infrastructure/postgres/*`	`PostgresConnectionManager` and Postgres-backed repositories using asyncpg/pgvector, dynamic embedding-column selection, similarity queries, transactions, and row→model mapping.
Supabase implementations `python/src/server/infrastructure/supabase/*`	Supabase-backed repositories with RPC-based vector search, metadata handling, embedding-dimension mapping, and row→model conversion.
DI Container & server integration `python/src/server/container.py`, `python/src/server/main.py`	`Container` singleton with initialize/shutdown/health_check, repository accessors wired via `REPOSITORY_TYPE`, lifecycle calls integrated into server startup/shutdown, and global helper getters.
Tests & fixtures `python/tests/conftest.py`, `python/tests/unit/`, `python/tests/integration/`	Extensive unit and contract tests for container, repositories and vector utils; fixtures to switch to memory backend and reset state; integration package markers.
Test package initializers `python/tests/unit/__init__.py`, `python/tests/integration/__init__.py`, `python/tests/unit/repositories/__init__.py`	Package markers for test organization.

Sequence Diagram(s)

sequenceDiagram
    participant App as Application
    participant Cont as DI Container
    participant Fact as Repository Factory
    participant Env as Config/Env
    participant Pool as Postgres Pool
    participant Repo as Repository

    Note over App,Cont: Startup
    App->>Cont: initialize()
    activate Cont
    Cont->>Env: read REPOSITORY_TYPE
    Env-->>Cont: storage_type
    alt storage_type == "postgres"
        Cont->>Pool: initialize_postgres_pool()
        Pool-->>Cont: pool ready
    end
    Cont-->>App: initialized
    deactivate Cont

    Note over App,Repo: Resolve repo
    App->>Cont: get_crawled_pages_repository()
    activate Cont
    Cont->>Fact: get_crawled_pages_repository()
    activate Fact
    Fact->>Env: check REPOSITORY_TYPE
    alt "memory"
        Fact-->>Cont: InMemoryCrawledPagesRepository
    else "postgres"
        Fact->>Pool: use pool
        Fact-->>Cont: PostgresCrawledPagesRepository(pool)
    else "supabase"
        Fact-->>Cont: SupabaseCrawledPagesRepository(client)
    end
    deactivate Fact
    Cont-->>App: repository (cached)
    deactivate Cont

    Note over App,Repo: Normal ops
    App->>Repo: insert(...) / search_similar(...)
    Repo-->>App: result(s)

    Note over App,Cont: Shutdown
    App->>Cont: shutdown()
    activate Cont
    Cont->>Pool: close_postgres_pool()
    Pool-->>Cont: closed
    Cont->>Cont: clear cached repos
    deactivate Cont

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

RAG Enhancements (Page Level Retrieval) #767: Overlaps page-level storage and page_id handling; touches CrawledPage models and repository APIs (strong code-level connection).

Suggested reviewers

coleam00
leex279
tazmon95

Poem

🐰 I hopped into code with a twitchy nose,
Buried vectors, nests of repos in tidy rows.
Container hums, backends snug in bed,
Tests hop by moonlight, keeping bugs well-fed.
Cheers — the data burrow springs ahead!

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically summarizes the main change: introducing a database abstraction layer using the Repository Pattern. It directly reflects the core purpose of the changeset.
Description check	✅ Passed	The PR description is comprehensive and well-structured, covering Summary, Key Features, Architecture, Configuration, Files Changed, Test Plan, Breaking Changes, and Related sections. It addresses the required template sections with substantive detail about the abstraction layer implementation.
Docstring Coverage	✅ Passed	Docstring coverage is 94.62% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

python/tests/conftest.py (1)
34-34: Duplicate import.

patch is already imported from unittest.mock at line 4. This duplicate import is unnecessary.
🔎 Proposed fix
 # Apply global patches immediately
-from unittest.mock import patch
 _global_patches = [

🧹 Nitpick comments (35)

docs/DATABASE_ABSTRACTION.md (1)
23-57: Add language identifiers to code blocks.

Lines 23 and 527 contain fenced code blocks without language specifications. While the ASCII diagrams render correctly, adding language identifiers improves markdown linting compliance.
🔎 Proposed fix for missing code block language identifiers
-```
+```text
 +------------------------------------------------------------------+
 |                        APPLICATION LAYER                          |
 |  FastAPI endpoints, MCP tools, Services                          |
 |  ...
-```
+```text
And similarly for line 527:
-```
+```text
 python/src/server/
 +-- domain/
 |   +-- __init__.py
 |   ...
-```
+```text
Also applies to: 527-555
docs/REFACTORING_STATUS_REPORT_V2.md (3)
52-52: Add language identifiers to fenced code blocks.

Multiple code blocks lack language specifications, violating markdown linting rules (MD040). This affects lines 52, 89, 187, and 429.
🔎 Proposed fix example

For line 52 (diagram):
-```
+```text
 +-----------------------------------------------------------+
For lines 89, 187 (config/tables):
-```
+```yaml
 REPOSITORY_TYPE env var
For line 429 (Python code):
-```
+```python
 python/src/server/
Also applies to: 89-89, 187-187, 429-429

412-412: Use a proper heading instead of emphasis.

Line 412 uses emphasis (**...**) for what should be a section heading. This violates the MD036 rule.
-**Phases 0-7: COMPLETEES A 100%**
+## Phases 0-7: Complétées à 100%
1-485: Consider aligning documentation language with the PR.

This refactoring status report is written entirely in French, while DATABASE_ABSTRACTION.md and the broader PR are in English. This creates inconsistency in the documentation set.

Recommendation: Either translate this report to English for consistency, or clarify if it's intended as a French-language artifact for a French-speaking audience. If translation is planned, I can assist.

Would you like me to help translate this status report to English or clarify the intended audience for this document?
python/src/server/domain/interfaces/sources_repository.py (1)

121-130: Consider using an enum for status values.

The status parameter accepts a bare string with documented values (pending, crawling, completed, failed). A SourceStatus enum would provide compile-time safety and auto-completion. This is optional as the current approach is flexible.

python/tests/unit/repositories/test_sources_repository.py (1)

122-138: Potential flakiness with timestamp ordering test.

Using asyncio.sleep(0.01) to ensure different timestamps could be flaky on slow systems or under load. Consider using explicit timestamps if the repository implementation supports it, or increase the sleep duration for more reliability.
python/src/server/infrastructure/memory/vector_utils.py (1)
57-77: Consider adding empty vector validation for consistency.

While mathematically valid, euclidean_distance allows empty vectors (returns 0.0) but cosine_similarity raises ValueError. For API consistency, consider adding the same empty vector check here.
🔎 Proposed validation addition
 def euclidean_distance(vec_a: Sequence[float], vec_b: Sequence[float]) -> float:
     """
     Calculate Euclidean distance between two vectors.
 
     Args:
         vec_a: First vector
         vec_b: Second vector
 
     Returns:
         Euclidean distance (lower = more similar)
 
     Raises:
         ValueError: If vectors have different lengths
     """
     if len(vec_a) != len(vec_b):
         raise ValueError(
             f"Vectors must have same length: {len(vec_a)} != {len(vec_b)}"
         )
+
+    if len(vec_a) == 0:
+        raise ValueError("Vectors cannot be empty")
 
     sum_squared_diff = sum((a - b) ** 2 for a, b in zip(vec_a, vec_b))
     return math.sqrt(sum_squared_diff)
python/src/server/infrastructure/memory/crawled_pages_repository.py (3)
17-18: Document rationale for low similarity threshold.

The SIMILARITY_THRESHOLD = 0.05 is very permissive (5% similarity). Consider documenting why such a low threshold is appropriate for in-memory testing, or making it configurable if different test scenarios need different thresholds.

38-52: Consider reusing domain model's get_embedding method.

The _get_embedding helper duplicates logic from CrawledPage.get_embedding() method (seen in domain model). Consider calling the domain model method directly to avoid duplication.
🔎 Proposed refactor to reuse domain method
 def _get_embedding(self, page: CrawledPage) -> list[float] | None:
     """Get the active embedding from a page based on its dimension."""
-    if page.embedding_dimension == 768:
-        return page.embedding_768
-    elif page.embedding_dimension == 1024:
-        return page.embedding_1024
-    elif page.embedding_dimension == 1536:
-        return page.embedding_1536
-    elif page.embedding_dimension == 3072:
-        return page.embedding_3072
-    # Fallback: try to find any available embedding
-    return (
-        page.embedding_1536 or page.embedding_768 or
-        page.embedding_1024 or page.embedding_3072
-    )
+    return page.get_embedding()
130-178: Handle unsupported embedding dimensions explicitly.

The insert method silently ignores embeddings with unsupported dimensions (not 768/1024/1536/3072). Consider logging a warning or raising an error to catch configuration issues early.
🔎 Proposed validation for embedding dimension
 # Determine which embedding column to use
 embedding_768 = None
 embedding_1024 = None
 embedding_1536 = None
 embedding_3072 = None

 if page.embedding and page.embedding_dimension:
     if page.embedding_dimension == 768:
         embedding_768 = page.embedding
     elif page.embedding_dimension == 1024:
         embedding_1024 = page.embedding
     elif page.embedding_dimension == 1536:
         embedding_1536 = page.embedding
     elif page.embedding_dimension == 3072:
         embedding_3072 = page.embedding
+    else:
+        raise ValueError(
+            f"Unsupported embedding dimension: {page.embedding_dimension}. "
+            f"Supported dimensions: 768, 1024, 1536, 3072"
+        )
python/src/server/domain/models/source.py (1)
34-35: Consider adding URL validation.

The url field accepts any string without format validation. While domain models are often kept lean, adding basic URL validation could prevent invalid data from entering the system early.
🔎 Optional: Add URL validation
 from pydantic import BaseModel, Field
+from pydantic import HttpUrl

 class Source(BaseModel):
     source_id: str = Field(..., description="Unique source identifier")
-    url: str = Field(..., description="Base URL of the source")
+    url: HttpUrl = Field(..., description="Base URL of the source")
Alternatively, you can use a custom validator if you need more flexible URL formats:
from pydantic import field_validator

class Source(BaseModel):
    # ... fields ...
    
    @field_validator('url')
    @classmethod
    def validate_url(cls, v: str) -> str:
        if not v.startswith(('http://', 'https://')):
            raise ValueError('URL must start with http:// or https://')
        return v
python/src/server/infrastructure/supabase/crawled_pages_repository.py (3)
173-184: Duplicated embedding column selection logic.

The dimension-to-column mapping appears in both insert and insert_batch. Consider extracting a helper method.
🔎 Proposed helper
def _get_embedding_column(self, dimension: int | None) -> str | None:
    """Map embedding dimension to column name."""
    return {768: "embedding_768", 1024: "embedding_1024", 
Also applies to: 222-233

303-310: Count query fetches all columns unnecessarily.

Using select("*", count="exact") retrieves all rows. For a pure count, select a minimal column to reduce payload.
🔎 Suggested optimization
-            query = self.client.table(self.table_name).select("*", count="exact")
+            query = self.client.table(self.table_name).select("id", count="exact")
42-44: Move json import to module level for consistency.

The lazy import inside _row_to_model works but differs from other repository files that import json at the top.
python/src/server/infrastructure/memory/sources_repository.py (1)
100-115: update mutates the caller's updates dict.

Line 109 modifies the dict passed by the caller by adding updated_at. This side effect could be unexpected. Consider working on a copy.
🔎 Suggested fix
     async def update(self, source_id: str, updates: dict[str, Any]) -> Source | None:
         """Update an existing source."""
         with self._lock:
             source = self._sources.get(source_id)
             if not source:
                 return None

             # Create updated source
             source_dict = source.model_dump()
-            source_dict.update(updates)
-            source_dict["updated_at"] = datetime.now(timezone.utc)
+            source_dict.update({**updates, "updated_at": datetime.now(timezone.utc)})

             updated_source = Source(**source_dict)
             self._sources[source_id] = updated_source

             return updated_source
python/src/server/infrastructure/repository_factory.py (1)
146-266: Significant code duplication across factory functions.

get_crawled_pages_repository, get_sources_repository, and get_code_examples_repository share identical storage-type switching logic. Consider a generic factory helper or table-driven approach.
🔎 Conceptual refactor using a registry
_REPOSITORY_REGISTRY = {
    "crawled_pages": {
        "supabase": ("supabase.crawled_pages_repository", "SupabaseCrawledPagesRepository"),
        "postgres": ("postgres", "PostgresCrawledPagesRepository"),
        "memory": ("memory", "InMemoryCrawledPagesRepository"),
    },
    # ... similar for sources, code_examples
}

def _create_repository(repo_type: str, storage_type: str):
    module_path, class_name = _REPOSITORY_REGISTRY[repo_type][storage_type]
    # dynamic import and instantiation
python/src/server/infrastructure/postgres/sources_repository.py (1)
278-280: Fragile parsing of DELETE result.

Parsing result.split()[-1] relies on asyncpg's "DELETE N" format. Consider a more defensive approach or document this assumption.
🔎 Safer parsing
-                # Parse "DELETE X" to get count
-                deleted_count = int(result.split()[-1])
+                # asyncpg returns "DELETE N" where N is affected rows
+                try:
+                    deleted_count = int(result.split()[-1])
+                except (ValueError, IndexError):
+                    self._logger.warning(f"Unexpected DELETE result format: {result}")
+                    deleted_count = 0
python/src/server/infrastructure/supabase/sources_repository.py (2)
158-180: update mutates the caller's updates dict.

Line 162 adds updated_at directly to the passed dict, which could surprise callers. Consider working on a copy for consistency with expected behavior.
🔎 Suggested fix
     async def update(self, source_id: str, updates: dict[str, Any]) -> Source | None:
         """Update an existing source."""
         try:
             # Add updated_at timestamp
-            updates["updated_at"] = datetime.now(timezone.utc).isoformat()
+            updates = {**updates, "updated_at": datetime.now(timezone.utc).isoformat()}

             response = (
                 self.client.table(self.table_name)
                 .update(updates)
250-256: Count query fetches all columns unnecessarily.

Same optimization opportunity as in SupabaseCrawledPagesRepository—select a minimal column instead of "*".
python/src/server/infrastructure/supabase/code_examples_repository.py (3)
39-62: Consider moving the JSON import to module level.

The import json inside _row_to_model works but is unconventional. Moving it to the module's import section improves clarity and avoids repeated import lookups.
🔎 Suggested fix

Add to the imports at the top of the file:
 from typing import Any
+import json

 from supabase import Client
Then update _row_to_model:
     def _row_to_model(self, row: dict[str, Any]) -> CodeExample:
         """Convert a database row to a CodeExample model."""
         metadata = row.get("metadata", {})
         if isinstance(metadata, str):
-            import json
             metadata = json.loads(metadata)
169-259: Extract duplicate embedding column selection logic to a helper method.

The embedding column determination logic (lines 172-183 and 221-232) is duplicated. This violates DRY and makes adding new embedding dimensions error-prone.
🔎 Suggested refactor
+    def _get_embedding_column(self, dimension: int | None) -> str | None:
+        """Get the embedding column name for a given dimension."""
+        if dimension == 768:
+            return "embedding_768"
+        elif dimension == 1024:
+            return "embedding_1024"
+        elif dimension == 1536:
+            return "embedding_1536"
+        elif dimension == 3072:
+            return "embedding_3072"
+        return None
+
     async def insert(self, example: CodeExampleCreate) -> CodeExample:
         """Insert a new code example."""
         try:
             # Determine embedding column
-            embedding_column = None
-            if example.embedding and example.embedding_dimension:
-                dim = example.embedding_dimension
-                if dim == 768:
-                    embedding_column = "embedding_768"
-                elif dim == 1024:
-                    embedding_column = "embedding_1024"
-                elif dim == 1536:
-                    embedding_column = "embedding_1536"
-                elif dim == 3072:
-                    embedding_column = "embedding_3072"
+            embedding_column = (
+                self._get_embedding_column(example.embedding_dimension)
+                if example.embedding and example.embedding_dimension
+                else None
+            )
299-313: Optimize count query by selecting minimal columns.

select("*", count="exact") fetches all columns when only the count is needed. Select a single column (like id) for better performance.
🔎 Suggested fix
     async def count(self, source_id: str | None = None) -> int:
         """Count code examples in the repository."""
         try:
-            query = self.client.table(self.table_name).select("*", count="exact")
+            query = self.client.table(self.table_name).select("id", count="exact")

             if source_id:
                 query = query.eq("source_id", source_id)

             response = query.execute()

             return response.count if response.count else 0
python/src/server/container.py (3)
54-68: Singleton reset may leave stale state on existing references.

Container.reset() clears _instance but doesn't reset _init_done on existing objects. If any code holds a reference to the old container, creating a new Container() returns a fresh instance, but the old reference remains in a partially valid state.

For test isolation, also resetting instance attributes in reset() would be safer:
🔎 Suggested fix
     @classmethod
     def reset(cls) -> None:
         """
         Reset the container singleton.

         For testing purposes only.
         """
+        if cls._instance is not None:
+            cls._instance._initialized = False
+            cls._instance._storage_type = None
+            # Remove _init_done to allow re-initialization
+            if hasattr(cls._instance, "_init_done"):
+                delattr(cls._instance, "_init_done")
         cls._instance = None
         logger.info("Container singleton reset")
113-115: Preserve full stack trace in error logging.

Per coding guidelines, error logging should use exc_info=True to preserve the full stack trace.
🔎 Suggested fix
         except Exception as e:
-            logger.error(f"Container initialization failed: {e}")
+            logger.error(f"Container initialization failed: {e}", exc_info=True)
             raise
144-146: Add exc_info=True to shutdown error logging.

Same as initialization - preserve the stack trace for debugging.
🔎 Suggested fix
         except Exception as e:
-            logger.error(f"Container shutdown error: {e}")
+            logger.error(f"Container shutdown error: {e}", exc_info=True)
             raise
python/src/server/infrastructure/memory/code_examples_repository.py (2)
180-186: insert_batch is not atomic - each insert acquires lock separately.

insert_batch calls self.insert() in a loop, acquiring and releasing the lock for each item. This differs from the behavior of Supabase and PostgreSQL implementations which perform batch operations atomically. For test fidelity, consider making this atomic:
🔎 Suggested fix
     async def insert_batch(self, examples: list[CodeExampleCreate]) -> list[CodeExample]:
         """Insert multiple code examples in a batch."""
-        results = []
-        for example in examples:
-            created = await self.insert(example)
-            results.append(created)
-        return results
+        with self._lock:
+            results = []
+            for example in examples:
+                example_id = str(uuid.uuid4())
+                now = datetime.now(timezone.utc)
+                # ... (inline the insert logic here)
+                # Or refactor to have _insert_unlocked helper
+            return results
Alternatively, create an _insert_unlocked helper that both insert and insert_batch can use.
243-247: set_example silently ignores examples without an ID.

If example.id is None, the method does nothing without feedback. Consider raising a ValueError or auto-generating an ID to avoid silent failures in test setup.
🔎 Suggested fix
     def set_example(self, example: CodeExample) -> None:
         """Directly set an example (for test setup)."""
         with self._lock:
-            if example.id:
-                self._examples[example.id] = example
+            if not example.id:
+                raise ValueError("Cannot set example without an ID")
+            self._examples[example.id] = example
python/src/server/infrastructure/postgres/crawled_pages_repository.py (3)
340-345: Fragile parsing of DELETE result status.

Parsing result.split()[-1] assumes PostgreSQL always returns "DELETE N" format. While this is the documented behavior, adding error handling makes the code more robust:
🔎 Suggested fix
-                # Parse "DELETE X" to get count
-                deleted_count = int(result.split()[-1])
+                # Parse "DELETE X" to get count
+                try:
+                    deleted_count = int(result.split()[-1])
+                except (IndexError, ValueError):
+                    self._logger.warning(f"Could not parse delete result: {result}")
+                    deleted_count = 0
90-92: Add exc_info=True to all error logging calls.

Per coding guidelines, error logging should preserve full stack traces. This pattern should be applied to all error handlers in this file (lines 91, 110, 129, 207, 264, 329, 348, 367, 386, 410).
🔎 Example fix
         except Exception as e:
-            self._logger.error(f"get_by_id failed: {e}", id=id)
+            self._logger.error(f"get_by_id failed: {e}", id=id, exc_info=True)
             raise
267-330: Batch insert correctly uses transactions but has duplicate column-building logic.

The transaction usage ensures atomicity (good). However, the embedding column determination and query building (lines 278-318) duplicates the logic in insert(). Consider extracting a helper method to reduce duplication.
python/src/server/infrastructure/postgres/code_examples_repository.py (2)

342-347: Same fragile DELETE result parsing as crawled_pages repository.

Apply the same defensive parsing pattern suggested for PostgresCrawledPagesRepository.delete_by_url.

90-92: Add exc_info=True to error logging calls throughout this file.

Same pattern as PostgresCrawledPagesRepository - all error handlers should preserve stack traces per coding guidelines.
python/tests/unit/repositories/test_repository_contract.py (2)
25-37: Contract classes import ABC but don't inherit from it.

ABC is imported (line 10) but the contract classes don't inherit from it. While this works (Python doesn't enforce abstract fixture implementations), inheriting from ABC would make the intent clearer:
🔎 Suggested fix
-class CrawledPagesRepositoryContract:
+class CrawledPagesRepositoryContract(ABC):
     """
     Contract tests for ICrawledPagesRepository implementations.
Apply similarly to SourcesRepositoryContract and CodeExamplesRepositoryContract.
163-222: Consider adding contract tests for search_similar and batch operations.

The contract tests cover basic CRUD but omit search_similar (a core feature for vector search) and batch insert operations. Adding these would ensure consistent behavior across all backends:
🔎 Example test for search_similar
@pytest.mark.asyncio
async def test_search_similar_returns_results(self, repo):
    """search_similar should find semantically similar examples."""
    embedding = [0.1] * 768  # Example embedding
    example = CodeExampleCreate(
        source_id="test_source",
        page_url="https://test.com/page",
        code="print('hello')",
        language="python",
        embedding=embedding,
        embedding_dimension=768,
    )
    await repo.insert(example)
    
    results = await repo.search_similar(embedding, match_count=5)
    
    assert len(results) >= 1
    assert results[0].similarity > 0.5
python/tests/unit/test_container.py (1)

66-94: Consider adding test for initialization failure handling.

The tests verify successful initialization and idempotent calls, but don't test error handling when initialization fails (e.g., invalid configuration). A test with a mock that raises during repository access would validate the error propagation.

coderabbitai · 2026-01-02T00:08:33Z

+Usage:
+    from server.domain import ICrawledPagesRepository, CrawledPage
+    from server.domain import ISourcesRepository, Source
+    from server.domain import ICodeExamplesRepository, CodeExample


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check the actual file content first echo "=== File content ===" cat -n python/src/server/domain/__init__.py | head -20 echo "" echo "=== Import patterns in Python files ===" rg -n "from (src\.)?server\.domain import" --type=py -A 1 | head -30 echo "" echo "=== Also check for absolute imports without src ===" rg -n "from server\.domain import" --type=py | head -20

Repository: coleam00/Archon

Length of output: 1738

🏁 Script executed:

#!/bin/bash # Search for actual import statements in Python files (not in docstrings) echo "=== Actual imports of domain module in Python code ===" rg "from server\.domain import|from \..*domain import|from archon\.server\.domain" --type=py -B 2 | grep -v "^--$" | head -40 echo "" echo "=== Check the project root __init__.py structure ===" find python/src -name "__init__.py" -type f | head -10 echo "" echo "=== Check if there's a src/__init__.py ===" cat python/src/__init__.py 2>/dev/null || echo "No src/__init__.py found" echo "" echo "=== Check server __init__.py ===" head -5 python/src/server/__init__.py 2>/dev/null || echo "No server/__init__.py"

Repository: coleam00/Archon

Length of output: 1244

🏁 Script executed:

#!/bin/bash # Search for actual usage of domain classes in code (not docstrings) echo "=== Files that import CrawledPage, Source, etc. ===" rg "CrawledPage|ISourcesRepository|CodeExample" --type=py -l | head -20 echo "" echo "=== Check import statements in those files ===" rg "from.*domain|import.*CrawledPage" --type=py -B 1 -A 1 | grep -v "^--$" | head -50 echo "" echo "=== Check pyproject.toml or setup.py for path config ===" find . -maxdepth 2 -name "pyproject.toml" -o -name "setup.py" -o -name "setup.cfg" | xargs cat 2>/dev/null | head -40 echo "" echo "=== Check conftest.py for path modifications ===" find . -name "conftest.py" -exec cat -n {} +

Repository: coleam00/Archon

Length of output: 19801

Minor: Update import paths in docstring to match actual usage.

The usage examples show from server.domain import ... but all actual code in the codebase uses from src.server.domain import .... Update the docstring to reflect the correct import paths.

🔎 Proposed fix

Usage: - from server.domain import ICrawledPagesRepository, CrawledPage - from server.domain import ISourcesRepository, Source - from server.domain import ICodeExamplesRepository, CodeExample + from src.server.domain import ICrawledPagesRepository, CrawledPage + from src.server.domain import ISourcesRepository, Source + from src.server.domain import ICodeExamplesRepository, CodeExample

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

Usage:

from server.domain import ICrawledPagesRepository, CrawledPage

from server.domain import ISourcesRepository, Source

from server.domain import ICodeExamplesRepository, CodeExample

Usage:

from src.server.domain import ICrawledPagesRepository, CrawledPage

from src.server.domain import ISourcesRepository, Source

from src.server.domain import ICodeExamplesRepository, CodeExample

🤖 Prompt for AI Agents

In python/src/server/domain/__init__.py around lines 8 to 11, the docstring usage examples incorrectly show imports as "from server.domain import ..." but the project uses "from src.server.domain import ..."; update those example lines to use "from src.server.domain import ICrawledPagesRepository, CrawledPage", "from src.server.domain import ISourcesRepository, Source", and "from src.server.domain import ICodeExamplesRepository, CodeExample" so the docstring matches actual import paths used across the codebase.

coderabbitai · 2026-01-02T00:08:33Z

+    def __init__(self, pool: Pool, table_name: str = "archon_code_examples"):
+        self.pool = pool
+        self.table_name = table_name
+        self._logger = logger.bind(repository="PostgresCodeExamplesRepository")


⚠️ Potential issue | 🟠 Major

Same SQL injection concern as PostgresCrawledPagesRepository.

The table_name parameter is interpolated directly into SQL queries. Apply the same validation mitigation suggested for PostgresCrawledPagesRepository.

coderabbitai · 2026-01-02T00:08:33Z

+        host = host or os.getenv("POSTGRES_HOST", "localhost")
+        port = port or int(os.getenv("POSTGRES_PORT", "5432"))
+        database = database or os.getenv("POSTGRES_DB", "archon")
+        user = user or os.getenv("POSTGRES_USER", "postgres")
+        password = password or os.getenv("POSTGRES_PASSWORD", "")


⚠️ Potential issue | 🟠 Major

Empty password default may cause confusing connection failures.

Line 90 defaults password to an empty string when not provided. This could lead to authentication failures that are harder to debug than a missing configuration error.

As per coding guidelines: "Fail fast and loud for service startup failures, missing configuration."

🔎 Suggested fix: Require password or fail explicitly

# Get configuration from environment with fallbacks host = host or os.getenv("POSTGRES_HOST", "localhost") port = port or int(os.getenv("POSTGRES_PORT", "5432")) database = database or os.getenv("POSTGRES_DB", "archon") user = user or os.getenv("POSTGRES_USER", "postgres") - password = password or os.getenv("POSTGRES_PASSWORD", "") + password_env = os.getenv("POSTGRES_PASSWORD") + if password is None and password_env is None: + raise ValueError( + "PostgreSQL password must be provided via argument or POSTGRES_PASSWORD environment variable" + ) + password = password or password_env

Alternatively, if empty passwords are valid for local development:

- password = password or os.getenv("POSTGRES_PASSWORD", "") + password = password if password is not None else os.getenv("POSTGRES_PASSWORD", "") + if password is None: + logger.warning( + "PostgreSQL password not provided - connection may fail. " + "Set POSTGRES_PASSWORD environment variable." + ) + password = ""

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

host = host or os.getenv("POSTGRES_HOST", "localhost")

port = port or int(os.getenv("POSTGRES_PORT", "5432"))

database = database or os.getenv("POSTGRES_DB", "archon")

user = user or os.getenv("POSTGRES_USER", "postgres")

password = password or os.getenv("POSTGRES_PASSWORD", "")

host = host or os.getenv("POSTGRES_HOST", "localhost")

port = port or int(os.getenv("POSTGRES_PORT", "5432"))

database = database or os.getenv("POSTGRES_DB", "archon")

user = user or os.getenv("POSTGRES_USER", "postgres")

password_env = os.getenv("POSTGRES_PASSWORD")

if password is None and password_env is None:

raise ValueError(

"PostgreSQL password must be provided via argument or POSTGRES_PASSWORD environment variable"

)

password = password or password_env

🤖 Prompt for AI Agents

python/src/server/infrastructure/postgres/connection.py lines 86-90: the code currently defaults password to an empty string which can mask missing configuration and produce confusing auth failures; change the logic to treat a missing password as a configuration error by not defaulting to "" (use None or no default), read os.getenv("POSTGRES_PASSWORD") and if the result is None or empty when not explicitly allowed, raise a clear RuntimeError (or ValueError) with a message like "POSTGRES_PASSWORD is required" so startup fails fast; if you need to allow empty passwords for local dev, gate that behind a explicit flag (e.g., ALLOW_EMPTY_POSTGRES_PASSWORD=true) and only permit empty when that flag is set.

coderabbitai · 2026-01-02T00:08:33Z

+    def __init__(self, pool: Pool, table_name: str = "archon_crawled_pages"):
+        self.pool = pool
+        self.table_name = table_name
+        self._logger = logger.bind(repository="PostgresCrawledPagesRepository")


⚠️ Potential issue | 🟠 Major

SQL injection risk with f-string table name interpolation.

The table_name parameter is interpolated directly into SQL queries (e.g., line 81, 100, 119, etc.). While the default is safe, if a caller passes untrusted input, this becomes an injection vector.

Consider validating the table name format or using an allow-list:

🔎 Suggested mitigation

+import re + +VALID_TABLE_NAME = re.compile(r'^[a-zA-Z_][a-zA-Z0-9_]*$') + class PostgresCrawledPagesRepository(ICrawledPagesRepository): ... def __init__(self, pool: Pool, table_name: str = "archon_crawled_pages"): + if not VALID_TABLE_NAME.match(table_name): + raise ValueError(f"Invalid table name: {table_name}") self.pool = pool self.table_name = table_name

Committable suggestion skipped: line range outside the PR's diff.

coderabbitai · 2026-01-02T00:08:33Z

+    global _crawled_pages_repository
+
+    if _crawled_pages_repository is None:
+        storage_type = _get_storage_type()
+
+        if storage_type == "supabase":
+            from .supabase.crawled_pages_repository import SupabaseCrawledPagesRepository
+
+            client = _get_supabase_client()
+            _crawled_pages_repository = SupabaseCrawledPagesRepository(client)
+            logger.info("Created SupabaseCrawledPagesRepository")
+
+        elif storage_type == "postgres":
+            from .postgres import PostgresCrawledPagesRepository
+
+            pool = _get_postgres_pool()
+            _crawled_pages_repository = PostgresCrawledPagesRepository(pool)
+            logger.info("Created PostgresCrawledPagesRepository")
+
+        elif storage_type == "memory":
+            from .memory import InMemoryCrawledPagesRepository
+
+            _crawled_pages_repository = InMemoryCrawledPagesRepository()
+            logger.info("Created InMemoryCrawledPagesRepository")
+
+        else:
+            raise ValueError(f"Unsupported storage type: {storage_type}")
+
+    return _crawled_pages_repository


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the full repository_factory.py file to understand the context wc -l python/src/server/infrastructure/repository_factory.py

Repository: coleam00/Archon

Length of output: 117

🏁 Script executed:

# Read the full file cat -n python/src/server/infrastructure/repository_factory.py

Repository: coleam00/Archon

Length of output: 13150

🏁 Script executed:

# Search for how get_crawled_pages_repository is called rg "get_crawled_pages_repository" --type py -B 2 -A 2

Repository: coleam00/Archon

Length of output: 4595

🏁 Script executed:

# Check for threading/locking patterns in the codebase rg "Lock|lock|RLock|threading|asyncio|semaphore" python/src/server/infrastructure/ --type py | head -20

Repository: coleam00/Archon

Length of output: 1886

🏁 Script executed:

# Look for FastAPI app initialization and lifespan setup rg "lifespan|startup|on_event|@app" python/src/server/ --type py | head -30

Repository: coleam00/Archon

Length of output: 1973

🏁 Script executed:

# Check if other global singletons use locking rg "^_[a-z_]+_repository\s*=" python/src/server/infrastructure/repository_factory.py

Repository: coleam00/Archon

Length of output: 41

🏁 Script executed:

# Check the lifespan and container initialization in detail rg "async def lifespan" -A 40 python/src/server/main.py

Repository: coleam00/Archon

Length of output: 1563

🏁 Script executed:

# Check how container.initialize() uses the factory rg "async def initialize" -A 30 python/src/server/container.py

Repository: coleam00/Archon

Length of output: 1194

🏁 Script executed:

# Check if any code calls these getters from async contexts or tests rg "get_crawled_pages_repository|get_sources_repository|get_code_examples_repository" -B 3 -A 3 python/src/server/ --type py | head -60

Repository: coleam00/Archon

Length of output: 3964

🏁 Script executed:

# Check if there's any async context where getters might be called concurrently rg "asyncio|gather|create_task" python/src/server/ --type py | grep -E "(factory|container|repository)" | head -20

Repository: coleam00/Archon

Length of output: 41

🏁 Script executed:

# Look at how tests use these factories cat python/tests/unit/test_container.py | head -100

Repository: coleam00/Archon

Length of output: 2716

Synchronous getters lack thread-safety despite async-only usage pattern.

The lazy initialization pattern in get_crawled_pages_repository(), get_sources_repository(), and get_code_examples_repository() is not thread-safe. While the current architecture mitigates this risk—FastAPI's lifespan initializes repositories sequentially before accepting requests—adding a lock would prevent potential issues if these functions are ever called from actual threads. Note that the memory-based repositories already implement threading.Lock for thread-safety internally, suggesting this pattern is valued in the codebase.

🤖 Prompt for AI Agents

In python/src/server/infrastructure/repository_factory.py around lines 162 to 190, the lazy initialization in get_crawled_pages_repository (and similarly in get_sources_repository and get_code_examples_repository) is not protected by a lock; add a module-level threading.Lock (e.g., _repo_init_lock) and use it to guard the check-and-create sequence (acquire the lock, re-check if the global is None, create and assign the repository, then release via a context manager) so initialization becomes thread-safe without changing the existing initialization logic or semantics.

coderabbitai · 2026-01-02T00:08:34Z

+        # Initialize DI container (handles PostgreSQL pool if REPOSITORY_TYPE=postgres)
+        try:
+            await container.initialize()
+            api_logger.info(f"✅ Container initialized (storage: {container.storage_type})")
+        except Exception as e:
+            api_logger.warning(f"Could not initialize container: {e}")


⚠️ Potential issue | 🟠 Major

Container initialization failure should crash when explicitly configured.

Per coding guidelines, database connection failures should "crash with clear errors." Currently, if REPOSITORY_TYPE=postgres is set but the PostgreSQL pool fails to initialize, the app continues with only a warning. This could lead to silent failures where health checks pass but all repository operations fail.

Consider re-raising the exception when a non-default backend is explicitly configured:

🔎 Proposed fix

# Initialize DI container (handles PostgreSQL pool if REPOSITORY_TYPE=postgres) try: await container.initialize() api_logger.info(f"✅ Container initialized (storage: {container.storage_type})") except Exception as e: - api_logger.warning(f"Could not initialize container: {e}") + # Fail fast if user explicitly configured a non-default backend + if container.storage_type != "supabase": + api_logger.error(f"❌ Failed to initialize container (storage: {container.storage_type}): {e}") + raise + api_logger.warning(f"Could not initialize container: {e}")

🤖 Prompt for AI Agents

In python/src/server/main.py around lines 118-123, the try/except around container.initialize() currently only logs a warning on exception; change this so that when a non-default backend is explicitly configured (e.g. REPOSITORY_TYPE == "postgres" or container.storage_type == "postgres") the exception is not swallowed: log the error with full exception details and then re-raise the exception (or call sys.exit(1)) so the process crashes with a clear non-zero exit; otherwise (default in-memory backend) you may continue to warn and proceed.

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (4)

docs/VALIDATION_MANIFEST.md (4)
125-128: Add language specifier to fenced code block.

Code blocks should declare their language for syntax highlighting consistency.
🔎 Proposed fix
-```
+```bash
 ========================== 102 passed in 1.08s =============================
-```
+```
188-190: Add language specifier to fenced code block.

Code blocks should declare their language for syntax highlighting consistency.
🔎 Proposed fix
-```
+```bash
 d8c45e8 feat(db): Add database abstraction layer with Repository Pattern
-```
+```
299-301: Convert emphasis to proper heading syntax.

Section headers should use markdown heading syntax (##, ###) rather than bold emphasis.
🔎 Proposed fix
 ---
 
-**Conclusion**
+## Conclusion
1-316: Consider translating or documenting language choice for the project.

The entire manifest is written in French, which may not align with the primary project documentation language. Ensure this choice is intentional and document the language strategy for the project's docs.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d8c45e8 and b73a185.

📒 Files selected for processing (1)

docs/VALIDATION_MANIFEST.md

🧰 Additional context used

🪛 LanguageTool

docs/VALIDATION_MANIFEST.md

[grammar] ~6-~6: Ensure spelling is correct
Context: ...refactorisation Date: 2026-01-02 Projet: Database Abstraction Layer - Archon ...