feat(db-refactor): Complete Database Layer Abstraction with Traefik Network by jlacerte · Pull Request #915 · coleam00/Archon

jlacerte · 2025-12-30T15:09:26Z

Summary

This PR implements a complete database abstraction layer using the Repository Pattern, achieving 100% Core Scope migration. All core services now use repository interfaces instead of direct Supabase calls, enabling seamless switching between database backends.

Migration Statistics:

11 repository interfaces implemented
150+ direct Supabase calls abstracted
100% Core Scope complete
Repository Layer: 93.9% (InMemory 11/11, Supabase 10/11, Postgres 10/11)
Full testing infrastructure with InMemory backends

Changes Made

Repository Interfaces (11 total)

IPageRepository: Crawled pages and chunks (16 methods)
- ISourceRepository: Knowledge sources (12 methods)
- ICodeExampleRepository: Code snippets (9 methods)
- IProjectRepository: Projects (10 methods)
- ITaskRepository: Tasks (12 methods)
- ISettingsRepository: Application settings (5 methods)
- IVersionRepository: Document versions (6 methods)
- IProjectSourceRepository: Project-source links (5 methods)
- IPromptRepository: AI prompts (2 methods)
- IMigrationRepository: Migration tracking (3 methods)
- IEmbeddingService: Embedding generation (3 methods)

Services Migrated

knowledge_service.py - RAG search operations
- knowledge_item_service.py - Source CRUD operations
- knowledge_summary_service.py - Summary statistics
- project_service.py - Project management
- task_service.py - Task management
- settings_service.py - Application settings
- version_service.py - Version history
- prompt_service.py - AI prompts
- migration_service.py - Database migrations
- rag_tools.py (MCP) - RAG tool implementations
- project_tools.py (MCP) - Project tool implementations
- task_tools.py (MCP) - Task tool implementations

Architecture

python/src/core/
├── domain/
│   ├── interfaces/  # Abstract repository interfaces
│   └── models/      # Domain models (CrawledPage, Source, Project, etc.)
├── infrastructure/
│   ├── supabase/    # Supabase implementations
│   ├── postgres/    # PostgreSQL implementations
│   └── memory/      # In-memory implementations (testing)
└── container.py     # Dependency injection container

Benefits

Testability: InMemory repository enables fast, isolated unit tests
- Flexibility: Easy to add new database providers (PostgreSQL, Neon, etc.)
- Clean Architecture: Clear separation between domain logic and infrastructure
- Runtime Switching: REPOSITORY_TYPE env var switches backends without code changes

Test Plan

All existing tests pass
- [x] New unit tests for domain models
- [x] New unit tests for repositories
- [x] New unit tests for DI container
- [x] Manual testing with Docker-based test database
- [x] Integration tests for all repository implementations

Configuration

Set REPOSITORY_TYPE environment variable:

supabase (default) - Use Supabase backend
- postgres - Use direct PostgreSQL connection
- memory - Use in-memory storage (testing only)## Summary
  This PR implements a complete database abstraction layer using the Repository Pattern, achieving 100% Core Scope migration. All core services now use repository interfaces instead of direct Supabase calls, enabling seamless switching between database backends.

Migration Statistics:

11 repository interfaces implemented
- 150+ direct Supabase calls abstracted
- 100% Core Scope complete

Changes Made

Repository Interfaces (11 total)

IPageRepository: Crawled pages and chunks (16 methods)
- ISourceRepository: Knowledge sources (12 methods)
- ICodeExampleRepository: Code snippets (9 methods)
- IProjectRepository: Projects (10 methods)
- ITaskRepository: Tasks (12 methods)
- ISettingsRepository: Application settings (5 methods)
- IVersionRepository: Document versions (6 methods)
- IProjectSourceRepository: Project-source links (5 methods)
- IPromptRepository: AI prompts (2 methods)
- IMigrationRepository: Migration tracking (3 methods)
- IEmbeddingService: Embedding generation (3 methods)

Services Migrated

knowledge_service.py - RAG search operations
- knowledge_item_service.py - Source CRUD operations
- knowledge_summary_service.py - Summary statistics
- project_service.py - Project management
- task_service.py - Task management
- settings_service.py - Application settings
- version_service.py - Version history
- prompt_service.py - AI prompts
- migration_service.py - Database migrations
- rag_tools.py (MCP) - RAG tool implementations
- project_tools.py (MCP) - Project tool implementations
- task_tools.py (MCP) - Task tool implementations

Architecture

python/src/core/
├── domain/
│   ├── interfaces/  # Abstract repository interfaces
│   └── models/      # Domain models (CrawledPage, Source, Project, etc.)
├── infrastructure/
│   ├── supabase/    # Supabase implementations
│   ├── postgres/    # PostgreSQL implementations
│   └── memory/      # In-memory implementations (testing)
└── container.py     # Dependency injection container

Benefits

Testability: InMemory repository enables fast, isolated unit tests
- Flexibility: Easy to add new database providers (PostgreSQL, Neon, etc.)
- Clean Architecture: Clear separation between domain logic and infrastructure
- Runtime Switching: REPOSITORY_TYPE env var switches backends without code changes

Test Plan

All existing tests pass
- [x] New unit tests for domain models
- [x] New unit tests for repositories
- [x] New unit tests for DI container
- [x] Manual testing with Docker-based test database
- [x] Integration tests for all repository implementations

Configuration

Set REPOSITORY_TYPE environment variable:

supabase (default) - Use Supabase backend
- postgres - Use direct PostgreSQL connection
- memory - Use in-memory storage (testing only)

Summary by CodeRabbit

New Features
- Native async PostgreSQL backend with vector similarity search, pluggable repository & embedding services, DI support, and in-memory mock for local testing.
Documentation
- Extensive guides for architecture, migration, activation, staging, deprecation, and phase reports.
Infrastructure
- Staging Docker image/environment, DB migration & validation utilities, and async connection pooling.
Tests
- Pytest config plus unit and integration test suites for repositories and migration flows.
Configuration
- Environment templates and updated dependency requirements.
Bug Fixes
- LLM provider initialization updated for pydantic-ai compatibility.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Provides project overview, common commands, architecture documentation, and development notes for Claude Code to understand and work with the Archon codebase effectively. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Phase 1 - Domain Layer: - Add domain models: SitePage, SitePageMetadata, SearchResult - Add interfaces: ISitePagesRepository (8 methods), IEmbeddingService (2 methods) - Clean module exports via __init__.py Phase 2 - Infrastructure Layer: - Add SupabaseSitePagesRepository with mappers - Add InMemorySitePagesRepository for testing - Add OpenAIEmbeddingService wrapper Tests: - Unit tests for domain models and interfaces (37 tests) - Unit tests for mappers and in-memory repository (20 tests) - Integration test script for manual validation (10 scenarios) - All 57 unit tests passing Validation (Phase 2.5): - All imports working correctly - No circular dependencies - All interface implementations verified - Database schema coherence confirmed Part of database layer refactoring project. Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>

Add archon/container.py with dependency injection pattern: - Factory functions: get_repository(), get_embedding_service() - Configuration: configure(repository_type, embedding_type) - Singleton pattern with lazy initialization - Support for Supabase (production) and Memory (tests) - Support for OpenAI (production) and Mock (tests) - Override functions for testing: override_repository(), override_embedding_service() - Reset function for test isolation Add tests/test_container.py with comprehensive test coverage: - Configuration tests (4 tests) - Singleton behavior tests (3 tests) - Override functionality tests (3 tests) - Error handling tests (2 tests) - All 12 tests passing Update docs/MIGRATION_MANIFEST.md: - Mark P3-01 as VERIFIED - Update Phase 3 progress: 1/15 blocs complete - Update global progress: 49% (17/35 blocs) - Add entry in registre des modifications Part of Phase 3 migration - Database Layer Refactoring. Breaking change: None (new functionality). Generated with Claude Code (https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Migration Strategy: Dual-mode with backward compatibility Changes: - agent_tools.py: Add optional repository/embedding_service parameters to all functions * get_embedding(): Add embedding_service param with fallback to legacy client * retrieve_relevant_documentation_tool(): Add repository param with fallback * list_documentation_pages_tool(): Add repository param with fallback * get_page_content_tool(): Add repository param with fallback - All functions maintain backward compatibility with Supabase client - Prefer new repository/service when provided, fallback to legacy otherwise Testing: - Created tests/test_agent_tools_migration.py with 15 comprehensive tests - All tests pass (90 passed, 29 skipped) - Tests validate both new repository pattern and legacy Supabase paths Infrastructure Fix: - Fixed floating-point precision issue in InMemorySitePagesRepository - Added similarity clipping to [0.0, 1.0] range to avoid validation errors Documentation: - Updated MIGRATION_MANIFEST.md: P3-03 (a-h) marked as VERIFIED - Progress: 51% complete (18/35 blocks verified) Phase 3 Migration - Step 2/13 Complete 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

…ttern (P3-04) Migration completed with dual-mode strategy (backward compatible): Functions migrated: - get_embedding(): Add optional embedding_service parameter - insert_chunk(): Add optional repository parameter, use repository.insert() - clear_existing_records(): Add optional repository parameter, use repository.delete_by_source() - process_chunk(): Pass embedding_service to get_embedding() - process_and_store_document(): Pass repository and embedding_service - crawl_parallel_with_requests(): Accept and propagate DI parameters - main_with_requests(): Accept and propagate DI parameters - start_crawl_with_requests(): Accept and propagate DI parameters Key changes: - All async repository/service methods called with await (not run_in_executor) - Maintained backward compatibility with fallback to global clients - SitePage.id set to None (assigned by database) - clear_existing_records() now async to await repository.delete_by_source() Tests: - Created tests/test_crawl_migration.py with 6 validation tests - All tests pass: 6/6 ✓ - Tests verify: - Functions accept optional DI parameters - Functions work with injected dependencies - Backward compatibility maintained - Signature correctness Blocks verified: - P3-04a: ✓ get_clients() injection via optional parameters - P3-04b: ✓ supabase.table().insert() -> repository.insert() - P3-04c: ✓ supabase.table().delete() -> repository.delete_by_source() Dependencies installed: - html2text==2025.4.15 - crawl4ai==0.7.7 Progress: 54% complete (19/35 blocks verified) Part of Phase 3 migration (P3-04). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

…05, P3-06) Migration des pages Streamlit database.py et documentation.py vers le nouveau repository pattern avec stratégie dual-mode. **Changes P3-05 - database.py:** - Ajouter paramètre optionnel `repository: Optional[ISitePagesRepository]` - Migrer count() vers repository.count() avec asyncio.run() - Conserver Supabase pour opération admin "delete all" (non couverte par repository) - Mode dual: utiliser repository si fourni, sinon fallback Supabase **Changes P3-06 - documentation.py:** - Ajouter paramètre optionnel `repository: Optional[ISitePagesRepository]` - Migrer count(source) vers repository.count(source="pydantic_ai_docs") - Conserver Supabase pour affichage échantillon UI (non opération métier) - Mode dual: utiliser repository si fourni, sinon fallback Supabase **Tests:** - Ajouter tests/test_streamlit_migration.py (10 tests) - Valider signatures, imports, type hints - Tous les tests passent: 106 passed, 29 skipped **Documentation:** - Mettre à jour docs/MIGRATION_MANIFEST.md - P3-05: [v] VERIFIED - P3-06: [v] VERIFIED - Progression globale: 60% (21/35 blocs) Part of Phase 3 migration. Breaking change: None (backward compatible). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Detailed migration report covering: - Streamlit pages database.py and documentation.py migration - Strategy explanation (dual-mode with fallback) - Async/sync adaptation with asyncio.run() - UI-specific operations kept with Supabase - Test results (106 passed) - Problems encountered and solutions - Next steps 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

…epository pattern CONTEXT: This commit completes the Phase 3 migration by refactoring all Pydantic AI agents to use the repository pattern via dependency injection container. CHANGES: P3-07: archon/archon_graph.py (LangGraph orchestration) - Remove: 'from supabase import Client' - Add: 'from archon.container import get_repository, get_embedding_service' - Replace: 'embedding_client, supabase = get_clients()' with 'repository = get_repository()' and 'embedding_service = get_embedding_service()' - Update all PydanticAIDeps, ToolsRefinerDeps, AgentRefinerDeps instantiations to use repository/embedding_service instead of supabase/embedding_client - Update list_documentation_pages_tool call to pass repository parameter P3-08: archon/pydantic_ai_coder.py (Main coding agent) - Remove: 'from supabase import Client' - Add: 'from archon.domain import ISitePagesRepository, IEmbeddingService' - Update PydanticAIDeps dataclass: - 'repository: ISitePagesRepository' (was 'supabase: Client') - 'embedding_service: IEmbeddingService' (was 'embedding_client: AsyncOpenAI') - Update all 3 tools to use ctx.deps.repository and ctx.deps.embedding_service P3-09: archon/advisor_agent.py (Simple cleanup) - Remove unused import: 'from supabase import Client' P3-10: archon/refiner_agents/tools_refiner_agent.py - Remove: 'from supabase import Client' - Add: 'from archon.domain import ISitePagesRepository, IEmbeddingService' - Update ToolsRefinerDeps dataclass to use repository/embedding_service - Update all 3 tools to use new dependencies P3-11: archon/refiner_agents/agent_refiner_agent.py - Remove: 'from supabase import Client' - Add: 'from archon.domain import ISitePagesRepository, IEmbeddingService' - Update AgentRefinerDeps dataclass to use repository/embedding_service - Update all 3 tools to use new dependencies P3-12: archon/refiner_agents/prompt_refiner_agent.py (Simple cleanup) - Remove unused import: 'from supabase import Client' TESTING: - Added tests/test_agents_migration.py with 15 comprehensive tests - All tests validate: * Correct imports of domain interfaces * Removal of Supabase Client imports * Dataclass Deps use interfaces * Tools use new dependency names * Container integration in archon_graph.py TEST RESULTS: - 121 tests passed (was 106, added 15 new tests) - 29 skipped (integration tests requiring Supabase) - 0 failures - All existing tests still pass (backward compatibility maintained) ARCHITECTURE: The migration maintains backward compatibility because: 1. agent_tools.py already supports dual mode (legacy + new) 2. archon_graph.py is the single point of dependency injection 3. All agents receive their dependencies from the graph NOTES: - Pre-existing bug with OpenAIModel.__init__ signature not addressed (was already broken, unrelated to this migration) - Migration manifest P3-07 through P3-12 blocks are now complete Part of Phase 3 migration. Breaking change: None (fully backward compatible via dual mode in agent_tools.py) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Phase 4 validation results: - P4-01: Remove unused Supabase imports from agent_tools.py * Removed 'from supabase import Client' import * Changed Client type annotations to Any (legacy fallback) * Only utils.py retains Supabase import (justified - get_clients legacy) - P4-02: Full test suite verification * 121 tests passing * 29 tests skipped (integration tests requiring Supabase) * All migration tests passing - P4-03: Import validation * Domain imports: OK * Container imports: OK * Infrastructure imports: OK - P4-04: Update MIGRATION_MANIFEST.md * Phase 4 blocks marked VERIFIED * Progression: 89% core (31/35), 84% global (31/37) Files modified: - archon/agent_tools.py: Remove Supabase import, update type hints - docs/MIGRATION_MANIFEST.md: Mark P4-01 to P4-04 as VERIFIED Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>

Extract individual client factories from get_clients() to eliminate dependency circularity with container.py and improve modularity. Changes: - Extract get_openai_client() from get_clients() (lines 383-400) - Extract get_supabase_client() from get_clients() (lines 402-419) - Refactor get_clients() to use extracted factories (lines 421-433) - Mark get_clients() as DEPRECATED in docstring - Update container.py to use get_openai_client() and get_supabase_client() Benefits: - Container no longer unpacks unused tuple values - Individual factories can be used independently - Maintains 100% backward compatibility (get_clients() unchanged behavior) - Reduces coupling between utils.py and container.py Tests: pytest tests/ → 121/121 passed Part of Phase 3 migration (P3-02a-d). Breaking change: None. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

FINAL BLOC OF PHASE 3! 🎉 Created a business logic services layer between agents and repositories: NEW FILES: - archon/services/__init__.py - Services package initialization - archon/services/documentation_service.py - DocumentationService class - tests/test_services.py - 14 comprehensive tests SERVICES LAYER: DocumentationService encapsulates documentation operations: - search_documentation(query, limit, source) - Semantic search - get_page_content(url) - Concatenates all page chunks - list_available_pages(source) - Lists available URLs - get_page_metadata(url) - Retrieves page metadata - count_pages(source) - Counts pages/chunks CONTAINER INTEGRATION: - Added get_documentation_service() factory to container.py - Service auto-wires repository and embedding service dependencies ARCHITECTURE: Agents (pydantic_ai_coder, etc.) ↓ Services (DocumentationService) ↓ Repositories (ISitePagesRepository) TESTING: - 14/14 new service tests pass - 135/135 total tests pass - 29 skipped (Supabase integration - expected) - Zero failures! MIGRATION MANIFEST: - P3-13 marked as VERIFIED - Phase 3 COMPLETE: 13/13 blocs verified - Core project: 100% (33/33 blocs) - Global: 94% (33/35 blocs including optional) Part of Phase 3 migration - Services Layer. Breaking change: None (additive only). 🎉 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>

Comprehensive report documenting: - P3-13 Services Layer implementation - 100% Phase 3 completion (13/13 blocs) - 135/135 tests passing - Architecture improvements - Zero breaking changes Phase 3 is COMPLETE! 🎉

The pydantic-ai library changed its API between v0.0.22 and v1.0.x. The old pattern `OpenAIModel(model, base_url=..., api_key=...)` was replaced with `OpenAIModel(model, provider=OpenAIProvider(base_url=..., api_key=...))`. This fix updates all 6 agent files to use the new API pattern: - archon/pydantic_ai_coder.py - archon/advisor_agent.py - archon/archon_graph.py (2 model instances) - archon/refiner_agents/prompt_refiner_agent.py - archon/refiner_agents/tools_refiner_agent.py - archon/refiner_agents/agent_refiner_agent.py Includes bug report documentation in docs/BUG_REPORT_PYDANTIC_AI_API.md Fixes TypeError: OpenAIChatModel.__init__() got unexpected keyword argument 'base_url' 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Create specialized agent to implement alternative database backends: - PostgreSQL direct (asyncpg + pgvector) - SQLAlchemy (multi-DB portability) - SQLite (local development) Files added: - .claude/agents/db-backend-agent.md - Agent definition and templates - docs/CONTEXT_DB_BACKEND_AGENT.md - Full context for agent sessions This agent leverages the completed Repository Pattern refactoring to easily add new storage backends. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

coderabbitai · 2025-12-30T15:09:38Z

📝 Walkthrough

Walkthrough

Adds a new Domain layer (models + async interfaces), a Services layer, multiple infrastructure backends (Postgres asyncpg+pgvector, Supabase, in-memory, OpenAI embedding), a DI container with async repository support and test overrides, migration/schema utilities, agent and tool changes to use repository+embedding abstractions, tests, staging config/Docker, and extensive documentation and migration manifests.

Changes

Cohort / File(s)	Summary
Domain Layer `archon/domain/__init__.py`, `archon/domain/models/...`, `archon/domain/interfaces/...`	New domain models (`SitePage`, `SitePageMetadata`, `SearchResult`), interfaces (`ISitePagesRepository` with 8 async methods; `IEmbeddingService` with single/batch async methods), package exports and re-exports.
Dependency Injection (DI) Container `archon/container.py`	New container API: `configure()`, sync `get_repository()`, async `get_repository_async()`, `get_embedding_service()`, `get_documentation_service()`, plus `reset()`/override helpers and lazy singletons supporting `postgres`/`supabase`/`memory`.
Postgres Backend & DB Utilities `archon/infrastructure/postgres/...`, `migrate_schema.py`, `check_db_schema.py`	Asyncpg pool management (`connection.py`), `PostgresSitePagesRepository` implementing ISitePagesRepository with pgvector similarity, pool lifecycle, migration and schema-check scripts.
Supabase Backend `archon/infrastructure/supabase/...`	`SupabaseSitePagesRepository`, mappers (`dict_to_site_page`, `site_page_to_dict`, `dict_to_search_result`), RPC-based similarity and full interface implementation.
In-memory & OpenAI Infra `archon/infrastructure/memory/...`, `archon/infrastructure/openai/...`	`InMemorySitePagesRepository` and `MockEmbeddingService` for tests; `OpenAIEmbeddingService` implementing IEmbeddingService with batch support.
Services Layer `archon/services/documentation_service.py`, `archon/services/__init__.py`	New `DocumentationService` composing repository + embedding service: high-level search, page content, listing, metadata, and counts.
Agents & Consumers `archon/archon_graph.py`, `archon/advisor_agent.py`, `archon/pydantic_ai_coder.py`, `archon/refiner_agents/*`	Agents migrated to depend on `ISitePagesRepository` and `IEmbeddingService` (fallbacks preserved), OpenAIModel now constructed with `OpenAIProvider(...)`, lazy async repo accessor (`get_repository_instance()`/get_repository_async) integrated.
Tools & Crawlers `archon/agent_tools.py`, `archon/crawl_pydantic_ai_docs.py`	Tool and crawler functions accept optional `repository` and `embedding_service`, prefer new interfaces, keep legacy client fallbacks and propagate deps through call chains.
Infrastructure Package Exports & READMEs `archon/infrastructure/*`, `archon/infrastructure/README.md`	New infra package initializers and README documenting backends, mappers, and how to add implementations.
Tests & Pytest config `pytest.ini`, `tests/...`	Pytest config, unit tests for Postgres repository and integration tests added.
Staging / Config / Docker `.env.staging`, `.env.staging.template`, `Dockerfile.staging`, `.gitignore`, `requirements-*`, `graph_service.py`	Staging env files and template, staging Dockerfile, updated .gitignore, separated requirements (base/dev/staging), env-aware graph_service host/port.
Docs, Guides & Reports `docs/`, `CLAUDE.md`, `DELIVERABLE_SUMMARY.md`, `POSTGRES_BACKEND.md`, `ACTIVATION_GUIDE_POSTGRES.md`, many `.claude/agents/.md`	Extensive documentation: architecture, migration manifests, activation guides, completion reports, staging quickstart, bug report for pydantic‑ai, agent specs and migration plans.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Agent as Agent / Tool
    participant Container as DI Container
    participant Repo as ISitePagesRepository
    participant Embed as IEmbeddingService
    participant DB as Backend (Postgres / Supabase)

    Agent->>Container: get_repository_async()
    activate Container
    Container->>Repo: create/init backend instance (async)
    note right of Container `#DDEBF7`: lazy async init / singleton
    Container-->>Agent: repository instance
    deactivate Container

    Agent->>Container: get_embedding_service()
    activate Container
    Container->>Embed: create/init embedding service
    Container-->>Agent: embedding_service
    deactivate Container

    Agent->>Embed: get_embedding(text)
    activate Embed
    Embed->>DB: request embedding (OpenAI API or mock)
    DB-->>Embed: embedding vector
    Embed-->>Agent: List[float]
    deactivate Embed

    Agent->>Repo: search_similar(embedding, limit)
    activate Repo
    Repo->>DB: vector similarity query (pgvector / RPC)
    DB-->>Repo: rows + similarity
    Repo-->>Agent: List[SearchResult]
    deactivate Repo

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

RAG Enhancements (Page Level Retrieval) #767 — Overlapping data-model and storage changes (page-level storage, schema & repository concerns).
Remove Docusaurus documentation system #849 — Related documentation and doc-file updates tied to backend/migration docs.

Suggested reviewers

coleam00
leex279
Wirasm

Poem

🐰 I dug through layers, clean and neat,
Reorganized every data street.
Repos in rows, services sing—
Postgres, Supabase, mocks take wing.
Hop, test, commit — the pipeline's sweet!

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: implementing a complete database layer abstraction using the Repository Pattern with multi-backend support, though 'Traefik Network' appears tangential.
Description check	✅ Passed	The description is comprehensive and mostly follows the template structure with Summary, Changes Made, and Architecture sections, though some test plan items have formatting issues with nested checkboxes.
Docstring Coverage	✅ Passed	Docstring coverage is 96.31% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (7)

python/src/core/domain/interfaces/embedding_service.py (1)

1-77: Consider documenting the exception contract more explicitly.

The interface docstrings reference EmbeddingError (lines 35, 53), but this exception type is not defined or imported. Consider either:

Defining a concrete EmbeddingError exception class in the domain layer for implementers to use

Updating docstrings to specify a more general exception type (e.g., "Exception: If embedding generation fails")

This improves clarity for implementers about the expected exception contract.
python/tests/core/test_container.py (1)
158-171: Consider using the public property instead of private attribute.

Line 171 accesses service._dimension (a private attribute). Using the public dimension property would reduce coupling to implementation details:
-        assert service._dimension == 768
+        assert service.dimension == 768
This makes the test more resilient to internal refactoring.
python/src/core/domain/models/page_metadata.py (1)

109-121: Word count calculation uses naive split.

len(markdown.split()) works for basic cases but may be inaccurate with multiple consecutive whitespace characters, tabs, or markdown syntax elements. This is acceptable for approximate statistics but worth noting.
python/src/core/infrastructure/memory/mock_embedding_service.py (1)
69-92: Unused variable hash_bytes on line 71.

The initial hash_bytes computation is never used since the loop generates fresh hashes with iteration indices. This is harmless dead code.
🔎 Remove unused variable
     def _generate_embedding(self, text: str) -> list[float]:
         """
         Generate a deterministic embedding from text.

         Uses SHA-256 hash extended to fill the dimension.
         Values are normalized to [-1, 1] range.
         """
         # Create hash of text
         text_bytes = text.encode("utf-8")
-        hash_bytes = hashlib.sha256(text_bytes).digest()

         # Extend hash to fill dimension
         embedding = []
         iteration = 0
python/src/core/infrastructure/supabase/page_repository.py (3)
139-142: Consider chaining the original exception for better debugging.

When raising RuntimeError, the original exception context is lost. Chain it for better traceability.
🔎 Proposed fix
         if response.data and len(response.data) > 0:
             return response.data[0].get("id", "")

-        raise RuntimeError("Failed to insert chunk - no ID returned")
+        raise RuntimeError("Failed to insert chunk - no ID returned") from None
Or if you want to preserve any underlying exception:
try:
    response = self._client.table(self.CHUNKS_TABLE).insert(data).execute()
    if response.data and len(response.data) > 0:
        return response.data[0].get("id", "")
    raise RuntimeError("Failed to insert chunk - no ID returned")
except Exception as e:
    raise RuntimeError(f"Failed to insert chunk: {e}") from e
203-211: Optimize count_chunks to avoid fetching all columns.

Selecting "*" retrieves all columns when only the count is needed. This is inefficient for tables with large content/embedding columns.
🔎 Proposed fix
     async def count_chunks(self, source_id: str | None = None) -> int:
         """Count chunks, optionally filtered by source."""
-        query = self._client.table(self.CHUNKS_TABLE).select("*", count="exact")
+        query = self._client.table(self.CHUNKS_TABLE).select("id", count="exact", head=True)

         if source_id:
             query = query.eq("source_id", source_id)

         response = query.execute()
         return response.count or 0
Using head=True avoids returning row data entirely when only the count is needed.
100-105: Clarify source_filter and filter parameter logic.

The current logic sets filter to metadata_filter or {} regardless of whether source_filter is provided. This appears correct but the conditional structure is slightly confusing—the filter assignment is duplicated.
🔎 Simplify the parameter building
             # Build RPC parameters
             rpc_params: dict[str, Any] = {
                 "query_embedding": embedding,
                 "match_count": match_count,
+                "filter": metadata_filter or {},
             }

             # Add source filter
             if source_filter:
                 rpc_params["source_filter"] = source_filter
-                rpc_params["filter"] = metadata_filter or {}
-            else:
-                rpc_params["filter"] = metadata_filter or {}

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ecaece4 and c5059f2.

📒 Files selected for processing (24)

python/src/core/__init__.py
python/src/core/container.py
python/src/core/domain/__init__.py
python/src/core/domain/interfaces/__init__.py
python/src/core/domain/interfaces/embedding_service.py
python/src/core/domain/interfaces/page_repository.py
python/src/core/domain/models/__init__.py
python/src/core/domain/models/crawled_page.py
python/src/core/domain/models/page_metadata.py
python/src/core/domain/models/search_result.py
python/src/core/infrastructure/__init__.py
python/src/core/infrastructure/memory/__init__.py
python/src/core/infrastructure/memory/mock_embedding_service.py
python/src/core/infrastructure/memory/page_repository.py
python/src/core/infrastructure/openai/__init__.py
python/src/core/infrastructure/postgres/__init__.py
python/src/core/infrastructure/supabase/__init__.py
python/src/core/infrastructure/supabase/page_repository.py
python/src/server/services/search/rag_service.py
python/src/server/utils/__init__.py
python/tests/core/__init__.py
python/tests/core/test_container.py
python/tests/core/test_domain_models.py
python/tests/core/test_repositories.py

🧰 Additional context used

📓 Path-based instructions (5)

python/src/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/**/*.py: Complete batch processing, background tasks, and external API calls but log detailed errors for each failure instead of crashing
Include context about what was being attempted, preserve full stack traces with exc_info=True in Python logging, use specific exception types, include relevant IDs/URLs, never return None to indicate failure, and report both success count and detailed failure list for batch operations
Use Ruff for linting and MyPy for type checking with 120 character line length in Python backend
Use Python 3.12 with 120 character line length in backend code
Import GITHUB_REPO_OWNER and GITHUB_REPO_NAME constants from python/src/server/config/version.py as the single source of truth for repository configuration

python/src/**/*.py: Python backend code should use 120 character line length
Use Ruff for linting Python code to check for errors, warnings, and unused imports
Use Mypy for type checking in Python to ensure type safety

Files:

python/src/core/infrastructure/memory/mock_embedding_service.py
python/src/core/domain/models/search_result.py
python/src/core/domain/models/crawled_page.py
python/src/core/infrastructure/supabase/__init__.py
python/src/core/domain/interfaces/embedding_service.py
python/src/server/utils/__init__.py
python/src/core/domain/models/__init__.py
python/src/server/services/search/rag_service.py
python/src/core/infrastructure/memory/__init__.py
python/src/core/domain/__init__.py
python/src/core/infrastructure/openai/__init__.py
python/src/core/container.py
python/src/core/domain/interfaces/__init__.py
python/src/core/infrastructure/postgres/__init__.py
python/src/core/domain/interfaces/page_repository.py
python/src/core/infrastructure/memory/page_repository.py
python/src/core/__init__.py
python/src/core/infrastructure/__init__.py
python/src/core/domain/models/page_metadata.py
python/src/core/infrastructure/supabase/page_repository.py

python/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Error messages must include context about what was being attempted, preserve full stack traces, use specific exception types, include relevant IDs/URLs, and report both success count and detailed failure list for batch operations

Files:

python/src/core/infrastructure/memory/mock_embedding_service.py
python/src/core/domain/models/search_result.py
python/src/core/domain/models/crawled_page.py
python/tests/core/test_domain_models.py
python/src/core/infrastructure/supabase/__init__.py
python/src/core/domain/interfaces/embedding_service.py
python/src/server/utils/__init__.py
python/src/core/domain/models/__init__.py
python/src/server/services/search/rag_service.py
python/src/core/infrastructure/memory/__init__.py
python/src/core/domain/__init__.py
python/src/core/infrastructure/openai/__init__.py
python/tests/core/test_container.py
python/tests/core/test_repositories.py
python/tests/core/__init__.py
python/src/core/container.py
python/src/core/domain/interfaces/__init__.py
python/src/core/infrastructure/postgres/__init__.py
python/src/core/domain/interfaces/page_repository.py
python/src/core/infrastructure/memory/page_repository.py
python/src/core/__init__.py
python/src/core/infrastructure/__init__.py
python/src/core/domain/models/page_metadata.py
python/src/core/infrastructure/supabase/page_repository.py

python/tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use Pytest for testing with async support in Python backend

Use Pytest for testing Python code with async support

Files:

python/tests/core/test_domain_models.py
python/tests/core/test_container.py
python/tests/core/test_repositories.py
python/tests/core/__init__.py

python/src/server/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/server/**/*.py: Fail fast and loud for service startup failures, missing configuration, database connection failures, authentication/authorization failures, data corruption, critical dependency unavailability, and invalid data that would corrupt state
Never accept corrupted data - skip failed items entirely rather than storing corrupted or incomplete data

Files:

python/src/server/utils/__init__.py
python/src/server/services/search/rag_service.py

python/src/server/services/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

python/src/server/services/**/*.py: For batch processing and background tasks, complete what you can and report detailed failures for each item rather than crashing
Never accept corrupted data - skip failed items entirely rather than storing corrupted data in batch operations

Files:

python/src/server/services/search/rag_service.py

🧠 Learnings (2)

📓 Common learnings

Learnt from: zaksnet
Repo: coleam00/Archon PR: 473
File: archon-ui-main/src/components/agent-chat/ArchonChatPanel.tsx:26-31
Timestamp: 2025-09-10T11:35:20.655Z
Learning: When reviewing large refactoring PRs like "Clean multi provider feature", focus feedback on the core objectives (provider management, API keys, model configuration, service registry) rather than unrelated UI improvements in components that were only incidentally modified during the refactor.

📚 Learning: 2025-11-29T19:24:36.223Z

Learnt from: CR
Repo: coleam00/Archon PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-11-29T19:24:36.223Z
Learning: Applies to python/src/**/*.py : Import GITHUB_REPO_OWNER and GITHUB_REPO_NAME constants from python/src/server/config/version.py as the single source of truth for repository configuration

Applied to files:

python/src/server/utils/__init__.py

🧬 Code graph analysis (14)

python/src/core/infrastructure/memory/mock_embedding_service.py (1)

python/src/core/domain/interfaces/embedding_service.py (1)

IEmbeddingService (11-77)

python/src/core/domain/models/crawled_page.py (1)

python/src/core/domain/models/page_metadata.py (1)

to_insert_dict (62-83)

python/tests/core/test_domain_models.py (3)

python/src/core/domain/models/crawled_page.py (1)

CrawledPage (14-115)

python/src/core/domain/models/page_metadata.py (1)

PageMetadata (14-121)

python/src/core/domain/models/search_result.py (2)

SearchResult (13-110)

from_rpc_result (52-75)

python/src/core/infrastructure/supabase/__init__.py (1)

python/src/core/infrastructure/supabase/page_repository.py (1)

SupabasePageRepository (38-393)

python/src/server/utils/__init__.py (1)

python/src/core/container.py (1)

get_page_repository (53-98)

python/src/core/domain/models/__init__.py (3)

python/src/core/domain/models/crawled_page.py (1)

CrawledPage (14-115)

python/src/core/domain/models/page_metadata.py (1)

PageMetadata (14-121)

python/src/core/domain/models/search_result.py (1)

SearchResult (13-110)

python/src/server/services/search/rag_service.py (1)

python/src/core/container.py (1)

get_page_repository (53-98)

python/tests/core/test_container.py (2)

python/src/core/container.py (7)

reset (150-162)

get_repository_type (43-50)

configure (165-182)

get_page_repository (53-98)

override_repository (128-136)

get_embedding_service (101-125)

override_embedding_service (139-147)

python/src/core/infrastructure/memory/mock_embedding_service.py (1)

MockEmbeddingService (13-92)

python/tests/core/test_repositories.py (3)

python/src/core/domain/models/crawled_page.py (1)

CrawledPage (14-115)

python/src/core/domain/models/page_metadata.py (1)

PageMetadata (14-121)

python/src/core/domain/models/search_result.py (1)

SearchResult (13-110)

python/src/core/container.py (4)

python/src/core/domain/interfaces/embedding_service.py (1)

IEmbeddingService (11-77)

python/src/core/domain/interfaces/page_repository.py (1)

IPageRepository (15-263)

python/src/core/infrastructure/memory/page_repository.py (1)

InMemoryPageRepository (18-335)

python/src/core/infrastructure/memory/mock_embedding_service.py (1)

MockEmbeddingService (13-92)

python/src/core/domain/interfaces/__init__.py (2)

python/src/core/domain/interfaces/embedding_service.py (1)

IEmbeddingService (11-77)

python/src/core/domain/interfaces/page_repository.py (1)

IPageRepository (15-263)

python/src/core/domain/interfaces/page_repository.py (3)

python/src/core/domain/models/crawled_page.py (1)

CrawledPage (14-115)

python/src/core/domain/models/page_metadata.py (1)

PageMetadata (14-121)

python/src/core/domain/models/search_result.py (1)

SearchResult (13-110)

python/src/core/domain/models/page_metadata.py (1)

python/src/core/domain/models/crawled_page.py (1)

to_insert_dict (79-115)

python/src/core/infrastructure/supabase/page_repository.py (4)

python/src/core/domain/interfaces/page_repository.py (1)

IPageRepository (15-263)

python/src/core/domain/models/crawled_page.py (1)

CrawledPage (14-115)

python/src/core/domain/models/page_metadata.py (1)

PageMetadata (14-121)

python/src/core/domain/models/search_result.py (2)

SearchResult (13-110)

from_rpc_result (52-75)

🔇 Additional comments (31)

python/tests/core/__init__.py (1)

1-1: LGTM!

Standard test package initializer with an appropriate docstring.

python/src/core/infrastructure/__init__.py (1)

1-13: LGTM!

Clean module documentation with appropriate lazy-import strategy to avoid loading backend dependencies unnecessarily.

python/src/core/domain/models/search_result.py (3)

1-50: LGTM!

Well-structured Pydantic model with clear field definitions, appropriate defaults, and good documentation. The model_config settings are correctly configured for Pydantic v2.

51-75: LGTM!

The from_rpc_result factory correctly handles the integer-to-string ID conversion and safely handles None values.

100-110: LGTM!

Good use of model_copy for immutability, returning a new instance with the updated rerank score.

python/src/server/utils/__init__.py (2)

23-26: LGTM!

Clean integration of the repository pattern import from the core module. The relative import path correctly navigates to the core package.

91-92: LGTM!

Properly exposed in __all__ for backward-compatible public API access.

python/src/core/infrastructure/openai/__init__.py (1)

1-8: LGTM!

Clean placeholder module for future OpenAI embedding service implementation, following the established lazy-import pattern.

python/src/core/infrastructure/memory/__init__.py (1)

1-14: LGTM!

Well-structured module with clear documentation. The eager imports are appropriate for lightweight test implementations that have no external dependencies.

python/src/core/domain/interfaces/__init__.py (1)

1-15: LGTM!

Clean interface aggregation module that properly re-exports domain contracts. The docstring accurately describes the purpose of enabling dependency inversion and testability.

python/src/core/infrastructure/supabase/__init__.py (1)

1-10: LGTM!

Clean module initialization with clear documentation and proper public API exposure.

python/tests/core/test_domain_models.py (1)

1-309: LGTM! Excellent test coverage.

The test suite comprehensively validates domain models including:

Field initialization and defaults

Multi-dimensional embedding support

Dictionary conversion for database insertion

RPC result mapping with type normalization (e.g., integer ID to string)

Edge cases (empty content, boundary similarity scores)

The test structure is clear and follows pytest best practices.

python/src/core/domain/models/__init__.py (1)

1-16: LGTM!

Clean public API surface for domain models with clear documentation emphasizing separation from infrastructure concerns.

python/tests/core/test_container.py (1)

1-206: LGTM! Comprehensive container tests.

The test suite thoroughly validates:

Environment-driven repository selection

Singleton behavior and reset semantics

Override mechanisms for testing

Error handling for invalid configurations

Supabase credential requirements

Test organization is clear and follows good pytest practices.

python/src/core/domain/__init__.py (1)

1-19: LGTM!

Clear domain layer public API with well-documented separation of concerns. The explicit note about no infrastructure dependencies reinforces clean architecture principles.

python/tests/core/test_repositories.py (1)

1-586: LGTM! Excellent repository test coverage.

This comprehensive test suite validates all repository operations:

Chunk operations: Insert, batch insert, retrieve, delete (by URL/source), count

Page operations: Upsert (insert/update), batch upsert, retrieve (by URL/ID), list with filters, delete

Vector search: Exact match, ranking, source filtering, match limits, edge cases (no embeddings, empty repo)

Utilities: Unique URLs listing, health check, clear/reset

The tests effectively exercise the InMemoryPageRepository implementation and demonstrate the expected behavior for the IPageRepository interface.

python/src/core/__init__.py (1)

1-51: LGTM! Excellent public API design.

The core module provides a clean, unified interface with:

Clear architectural documentation explaining the layers (domain, interfaces, infrastructure, container)

Proper separation of concerns between models, interfaces, and DI functions

Helpful usage examples in the docstring

This design enables easy switching between backends while maintaining a stable public API.

python/src/server/services/search/rag_service.py (2)

39-50: LGTM! Clean backward-compatible repository integration.

The constructor properly initializes the repository with graceful fallback to direct Supabase queries. The try/except pattern ensures existing deployments without repository configuration continue to work.

220-275: Solid dual-path lookup with proper fallback logic.

The repository-first approach with Supabase fallback maintains backward compatibility while enabling the new abstraction. Exception handling at line 230-232 appropriately logs and falls back rather than failing the entire operation.

One minor observation: the repository path (lines 264-275) and the Supabase path (lines 253-263) produce slightly different result structures—the repository path doesn't check for None on page_metadata.word_count like the Supabase path does with .get("word_count", 0). Since PageMetadata.word_count defaults to 0 in the model definition, this should be fine.

python/src/core/domain/models/page_metadata.py (2)

14-61: Well-structured domain model following Pydantic v2 conventions.

The model correctly uses ConfigDict with from_attributes=True for ORM compatibility and populate_by_name=True for flexible field mapping. Field definitions with proper defaults and descriptions are clean.

62-83: Consider including id in insert dict for upsert scenarios.

The to_insert_dict() method excludes id, which is correct for inserts. However, if this is used with upsert operations where the record already exists, the id might be needed. The Supabase implementation uses on_conflict="url" which handles this, but verify this works for all use cases.

python/src/core/infrastructure/memory/mock_embedding_service.py (1)

40-50: LGTM! Clean implementation of IEmbeddingService interface.

The async methods correctly implement the interface contract with deterministic, reproducible embeddings suitable for unit testing.

python/src/core/domain/models/crawled_page.py (2)

62-77: LGTM! Multi-dimensional embedding selection logic is correct.

The method properly checks both the dimension configuration AND the presence of the corresponding embedding field before returning it. Falls back to generic embedding field when no dimension-specific match exists.

79-115: Well-implemented insertion helper with flexible embedding column support.

The method correctly builds the insert payload, conditionally including optional fields and routing embeddings to the appropriate column based on the embedding_column parameter.

python/src/core/infrastructure/memory/page_repository.py (3)

51-109: Solid vector search implementation with proper edge case handling.

The cosine similarity computation correctly handles zero-norm vectors (lines 62-63, 86-87) and clips results to [0, 1] for consistency. The filtering and sorting logic is clean.

143-168: Safe deletion pattern avoids dict mutation during iteration.

Building the to_delete list before iterating deletions is the correct approach to avoid RuntimeError from modifying a dict during iteration.

204-228: Correct upsert semantics with URL-based deduplication.

The logic properly checks for existing pages by URL, preserves created_at on updates, and maintains the _url_to_page_id index. The timestamp handling is clean.

python/src/core/container.py (2)

53-98: Clean lazy initialization with environment-based selection.

The singleton pattern with test override precedence is well-structured. Lazy imports avoid circular dependency issues.

Minor consideration: In multi-threaded scenarios, concurrent calls to get_page_repository() before singleton initialization could create multiple instances. For async Python servers, this is typically not an issue since initialization usually happens at startup. If thread safety becomes a concern, consider a lock.

101-125: MockEmbeddingService as default is intentional for safety.

The docstring correctly documents that production systems should use override_embedding_service() with a real provider. This prevents accidental API calls in tests/dev.

python/src/core/domain/interfaces/page_repository.py (1)

15-263: Excellent interface design with clear contracts.

The IPageRepository interface provides a comprehensive, well-documented contract for all page and chunk operations. The separation of chunk operations (lines 27-134) and page operations (lines 136-236) is logical and follows the domain model structure.

All methods are properly async, return types are explicit, and docstrings clearly specify arguments, return values, and semantics (e.g., upsert behavior).

python/src/core/infrastructure/supabase/page_repository.py (1)

81-127: Overall search implementation is sound.

The RPC-based vector search with similarity threshold filtering and proper error handling (returning empty results on failure) provides resilience. The SearchResult.from_rpc_result() integration is clean.

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (1)

python/src/core/infrastructure/supabase/page_repository.py (1)

122-125: Replace print() with proper logging.

This is a duplicate of a previous review comment. Using print() for error logging bypasses structured logging. Per coding guidelines, errors should be logged with context and stack traces using logger.error() with exc_info=True.

🧹 Nitpick comments (2)

python/src/core/infrastructure/memory/page_repository.py (1)

298-302: Consider explicit behavior for missing page_id.

The method silently skips updates when page_id is not found. While this may be intentional for a test implementation, consider whether callers might benefit from knowing when an update is skipped (e.g., logging a warning or raising a KeyError).
python/src/core/infrastructure/supabase/page_repository.py (1)
265-267: Consider adding error handling and logging.

This method silently fails if the update doesn't succeed. Per coding guidelines, errors should be logged with context even when not crashing. Consider logging if the update fails or returns no matching rows.
🔎 Proposed enhancement
+import logging
+
+logger = logging.getLogger(__name__)
+
     async def update_chunk_count(self, page_id: str, chunk_count: int) -> None:
         """Update the chunk_count for a page."""
-        self._client.table(self.PAGES_TABLE).update({"chunk_count": chunk_count}).eq("id", page_id).execute()
+        try:
+            response = (
+                self._client.table(self.PAGES_TABLE)
+                .update({"chunk_count": chunk_count})
+                .eq("id", page_id)
+                .execute()
+            )
+            if not response.data:
+                logger.warning(
+                    f"update_chunk_count: No matching page found. page_id: {page_id}"
+                )
+        except Exception as e:
+            logger.error(
+                f"Failed to update chunk_count for page_id: {page_id}",
+                exc_info=True
+            )

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c5059f2 and 0548e85.

📒 Files selected for processing (4)

python/src/core/container.py
python/src/core/infrastructure/memory/mock_embedding_service.py
python/src/core/infrastructure/memory/page_repository.py
python/src/core/infrastructure/supabase/page_repository.py

🚧 Files skipped from review as they are similar to previous changes (2)

python/src/core/infrastructure/memory/mock_embedding_service.py
python/src/core/container.py

🧰 Additional context used

📓 Path-based instructions (2)

python/src/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/**/*.py: Complete batch processing, background tasks, and external API calls but log detailed errors for each failure instead of crashing
Include context about what was being attempted, preserve full stack traces with exc_info=True in Python logging, use specific exception types, include relevant IDs/URLs, never return None to indicate failure, and report both success count and detailed failure list for batch operations
Use Ruff for linting and MyPy for type checking with 120 character line length in Python backend
Use Python 3.12 with 120 character line length in backend code
Import GITHUB_REPO_OWNER and GITHUB_REPO_NAME constants from python/src/server/config/version.py as the single source of truth for repository configuration

python/src/**/*.py: Python backend code should use 120 character line length
Use Ruff for linting Python code to check for errors, warnings, and unused imports
Use Mypy for type checking in Python to ensure type safety

Files:

python/src/core/infrastructure/memory/page_repository.py
python/src/core/infrastructure/supabase/page_repository.py

python/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Error messages must include context about what was being attempted, preserve full stack traces, use specific exception types, include relevant IDs/URLs, and report both success count and detailed failure list for batch operations

Files:

python/src/core/infrastructure/memory/page_repository.py
python/src/core/infrastructure/supabase/page_repository.py

🧠 Learnings (1)

📓 Common learnings

Learnt from: zaksnet
Repo: coleam00/Archon PR: 473
File: archon-ui-main/src/components/agent-chat/ArchonChatPanel.tsx:26-31
Timestamp: 2025-09-10T11:35:20.655Z
Learning: When reviewing large refactoring PRs like "Clean multi provider feature", focus feedback on the core objectives (provider management, API keys, model configuration, service registry) rather than unrelated UI improvements in components that were only incidentally modified during the refactor.

🧬 Code graph analysis (2)

python/src/core/infrastructure/memory/page_repository.py (4)

python/src/core/domain/interfaces/page_repository.py (1)

IPageRepository (15-263)

python/src/core/domain/models/crawled_page.py (2)

CrawledPage (14-115)

get_active_embedding (62-77)

python/src/core/domain/models/page_metadata.py (1)

PageMetadata (14-121)

python/src/core/domain/models/search_result.py (1)

SearchResult (13-110)

python/src/core/infrastructure/supabase/page_repository.py (4)

python/src/core/domain/interfaces/page_repository.py (1)

IPageRepository (15-263)

python/src/core/domain/models/crawled_page.py (1)

CrawledPage (14-115)

python/src/core/domain/models/page_metadata.py (1)

PageMetadata (14-121)

python/src/core/domain/models/search_result.py (2)

SearchResult (13-110)

from_rpc_result (52-75)

🔇 Additional comments (8)

python/src/core/infrastructure/memory/page_repository.py (8)

1-16: Clean module structure and imports.

The imports are well-organized and the module docstring clearly explains the purpose of this in-memory implementation for testing.

18-45: LGTM - Clean initialization and test utilities.

The three-dictionary structure (_chunks, _pages, _url_to_page_id) provides efficient lookups by different keys, and the clear() helper method is useful for test isolation.

51-109: Solid vector similarity implementation.

The cosine similarity calculation correctly handles edge cases (zero-norm vectors) and the clipping to [0, 1] is appropriate for treating dissimilar vectors as having zero similarity in search contexts.

111-141: Correct chunk insertion implementation.

The embedding_column parameter in insert_chunks_batch is appropriately accepted for interface compliance even though the in-memory implementation doesn't need column-specific storage. The use of get_active_embedding() ensures the correct embedding is stored.

143-188: LGTM - Chunk operations implemented correctly.

The delete operations properly avoid modifying dictionaries during iteration, and get_chunk_by_id correctly reconstructs the CrawledPage model with proper handling of optional fields.

194-226: Excellent upsert implementation.

The upsert logic correctly handles both insert and update cases, properly preserving created_at timestamps on updates while updating updated_at. The URL-to-ID mapping maintenance ensures efficient lookups.

228-283: LGTM - Page retrieval and listing logic is sound.

The delegation pattern in get_page_by_url is clean, and all methods properly reconstruct PageMetadata instances with appropriate defaults for optional fields.

308-321: LGTM - Utility operations are appropriate.

The list_unique_urls method correctly iterates chunks (as documented) to gather URLs, and health_check appropriately returns True for an in-memory implementation with no external dependencies.

coderabbitai · 2025-12-30T15:30:12Z

+    async def get_chunk_by_id(self, chunk_id: str) -> CrawledPage | None:
+        """Get a chunk by ID."""
+        response = self._client.table(self.CHUNKS_TABLE).select("*").eq("id", chunk_id).single().execute()
+
+        if not response.data:
+            return None
+
+        return self._row_to_crawled_page(response.data)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

Supabase Python client single() method exception behavior

💡 Result:

Short answer: supabase-py .single() requires the query to return exactly one row. If the result is zero or >1 rows, PostgREST returns an error (PGRST116) and the client surfaces that error in the response (data will be None and error populated). Use .limit(1).single() when you expect exactly one row, or use .maybe_single() if you want "zero allowed" (it returns null data rather than the PGRST116 error). [1][2][3]

Sources:

Supabase Python .single() docs. [1]

Supabase Python .maybe_single() docs. [2]

PostgREST / Supabase issue showing PGRST116 ("JSON object requested, multiple (or no) rows returned"). [3]

🏁 Script executed:

# First, check the file size and content around the specified lines wc -l python/src/core/infrastructure/supabase/page_repository.py

Repository: coleam00/Archon

Length of output: 121

🏁 Script executed:

# Read the specific lines mentioned in the review sed -n '165,180p' python/src/core/infrastructure/supabase/page_repository.py

Repository: coleam00/Archon

Length of output: 657

🏁 Script executed:

# Check lines 223 and 232 as well sed -n '220,235p' python/src/core/infrastructure/supabase/page_repository.py

Repository: coleam00/Archon

Length of output: 643

🏁 Script executed:

# Search for all uses of .single() in the file rg '\.single' python/src/core/infrastructure/supabase/page_repository.py -n -A 3

Repository: coleam00/Archon

Length of output: 591

Check response.error and log failures instead of silently returning None.

The Supabase .single() method does not raise exceptions. Instead, when zero or multiple records are returned, PostgREST returns an error (PGRST116) in the response object with data=None and error populated. The current code at lines 170, 223, and 232 only checks if not response.data: return None, which silently treats all error cases identically without logging.

Log the error with context (attempted operation, ID being queried, error details) before returning None, following the guideline to log detailed errors for each failure instead of silently crashing or ignoring errors.

🤖 Prompt for AI Agents

In python/src/core/infrastructure/supabase/page_repository.py around lines 168 to 175, the method get_chunk_by_id checks only response.data and returns None silently on failure; update it to check response.error and, when present, log a clear message containing the attempted operation ("get_chunk_by_id"), the chunk_id being queried, and the response.error details before returning None; apply the same pattern for the other similar checks noted (around lines 223 and 232) so failures are recorded with context instead of being silently ignored.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (4)

python/src/core/container.py (3)

34-95: Consider adding thread synchronization for singleton initialization.

The singleton pattern using global variables without locking can lead to race conditions in multi-threaded environments (e.g., FastAPI/Flask with multiple workers). If two threads call get_page_repository() simultaneously before _repository_instance is set, both might proceed past the if _repository_instance is not None check (line 74) and create multiple instances.

🔎 Proposed fix using threading.Lock

 import os
+import threading
 from typing import TYPE_CHECKING

 if TYPE_CHECKING:
     from .domain.interfaces import IEmbeddingService, IPageRepository

 # Singleton instances
 _repository_instance: "IPageRepository | None" = None
 _embedding_service_instance: "IEmbeddingService | None" = None

 # Override instances for testing
 _repository_override: "IPageRepository | None" = None
 _embedding_service_override: "IEmbeddingService | None" = None

+# Lock for thread-safe singleton initialization
+_lock = threading.Lock()


 def get_repository_type() -> str:
     ...


 def get_page_repository() -> "IPageRepository":
     """
     Get the page repository instance.
     ...
     """
     global _repository_instance

     # Check for test override first
     if _repository_override is not None:
         return _repository_override

     # Return cached instance if available
     if _repository_instance is not None:
         return _repository_instance

-    # Create new instance based on type
-    repo_type = get_repository_type()
-
-    if repo_type == "supabase":
-        from .infrastructure.supabase.page_repository import SupabasePageRepository
-
-        _repository_instance = SupabasePageRepository()
-    elif repo_type == "postgres":
-        from .infrastructure.postgres.page_repository import PostgresPageRepository
-
-        _repository_instance = PostgresPageRepository()
-    elif repo_type == "memory":
-        from .infrastructure.memory.page_repository import InMemoryPageRepository
-
-        _repository_instance = InMemoryPageRepository()
-    else:
-        raise ValueError(f"Unknown repository type: {repo_type}. Valid options: 'supabase', 'postgres', 'memory'")
-
-    return _repository_instance
+    # Double-checked locking pattern
+    with _lock:
+        # Check again after acquiring lock
+        if _repository_instance is not None:
+            return _repository_instance
+
+        # Create new instance based on type
+        repo_type = get_repository_type()
+
+        if repo_type == "supabase":
+            from .infrastructure.supabase.page_repository import SupabasePageRepository
+
+            _repository_instance = SupabasePageRepository()
+        elif repo_type == "postgres":
+            from .infrastructure.postgres.page_repository import PostgresPageRepository
+
+            _repository_instance = PostgresPageRepository()
+        elif repo_type == "memory":
+            from .infrastructure.memory.page_repository import InMemoryPageRepository
+
+            _repository_instance = InMemoryPageRepository()
+        else:
+            raise ValueError(f"Unknown repository type: {repo_type}. Valid options: 'supabase', 'postgres', 'memory'")
+
+        return _repository_instance

Apply similar locking to get_embedding_service() (lines 98-122).

166-183: Consider validating repository_type in configure() for earlier error detection.

Currently, configure() accepts any string for repository_type without validation (line 178). Invalid values won't be caught until get_page_repository() is called later, potentially making debugging harder. Consider adding validation to fail fast with a clear error message at configuration time.

🔎 Proposed validation

+_VALID_REPOSITORY_TYPES = {"supabase", "postgres", "memory"}
+

 def configure(
     repository_type: str | None = None,
     force_reinit: bool = False,
 ) -> None:
     """
     Configure the container.

     Args:
         repository_type: Override repository type (sets env var)
         force_reinit: Force reinitialization of instances
+
+    Raises:
+        ValueError: If repository_type is invalid
     """
     if repository_type:
+        if repository_type.lower() not in _VALID_REPOSITORY_TYPES:
+            raise ValueError(
+                f"Invalid repository type: {repository_type}. "
+                f"Valid options: {', '.join(sorted(_VALID_REPOSITORY_TYPES))}"
+            )
         os.environ["REPOSITORY_TYPE"] = repository_type

     if force_reinit:
         global _repository_instance, _embedding_service_instance
         _repository_instance = None
         _embedding_service_instance = None

98-122: Consider environment-based configuration for embedding service (consistency with repository pattern).

The embedding service is hardcoded to MockEmbeddingService (line 121) and requires manual override_embedding_service() calls for production use, while the repository supports environment-based configuration via REPOSITORY_TYPE. This creates an inconsistency in the DI container's API design.

For consistency, consider adding an EMBEDDING_SERVICE_TYPE environment variable (e.g., "mock", "openai", "local") similar to the repository pattern, allowing declarative configuration rather than requiring programmatic overrides in production code.

python/src/core/domain/models/search_result.py (1)

37-49: Consider adding validation for similarity range.

The similarity field documentation indicates it should be in the range 0-1, but there's no Field validator enforcing this constraint. While the factory methods may always provide valid values, adding a validator would make the model more robust.
🔎 Optional: Add Field validator for similarity range
-    similarity: float = Field(default=0.0, description="Vector similarity score (0-1)")
+    similarity: float = Field(default=0.0, ge=0.0, le=1.0, description="Vector similarity score (0-1)")

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0548e85 and 917c6e2.

📒 Files selected for processing (3)

python/src/core/container.py
python/src/core/domain/models/search_result.py
python/tests/core/test_domain_models.py

🚧 Files skipped from review as they are similar to previous changes (1)

python/tests/core/test_domain_models.py

🧰 Additional context used

📓 Path-based instructions (2)

python/src/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/**/*.py: Complete batch processing, background tasks, and external API calls but log detailed errors for each failure instead of crashing
Include context about what was being attempted, preserve full stack traces with exc_info=True in Python logging, use specific exception types, include relevant IDs/URLs, never return None to indicate failure, and report both success count and detailed failure list for batch operations
Use Ruff for linting and MyPy for type checking with 120 character line length in Python backend
Use Python 3.12 with 120 character line length in backend code
Import GITHUB_REPO_OWNER and GITHUB_REPO_NAME constants from python/src/server/config/version.py as the single source of truth for repository configuration

python/src/**/*.py: Python backend code should use 120 character line length
Use Ruff for linting Python code to check for errors, warnings, and unused imports
Use Mypy for type checking in Python to ensure type safety

Files:

python/src/core/domain/models/search_result.py
python/src/core/container.py

python/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Error messages must include context about what was being attempted, preserve full stack traces, use specific exception types, include relevant IDs/URLs, and report both success count and detailed failure list for batch operations

Files:

python/src/core/domain/models/search_result.py
python/src/core/container.py

🔇 Additional comments (5)

python/src/core/container.py (1)

147-163: Good fix: reset() now clears environment variable.

The addition of os.environ.pop("REPOSITORY_TYPE", None) on line 163 properly addresses the test pollution issue raised in the previous review. This ensures that tests calling configure() won't leave environment state that affects subsequent tests.

python/src/core/domain/models/search_result.py (4)

1-10: LGTM! Clean imports and clear documentation.

The module docstring clearly explains the purpose, and imports are minimal and appropriate for a Pydantic model.

51-75: LGTM! Consistent None handling and proper type conversions.

The method correctly handles None values for the id field and provides sensible defaults for missing fields. The conversion from numeric id to string is well-documented.

77-102: LGTM! The None handling issue has been resolved.

The previously flagged inconsistency in None handling for the id field has been fixed. The method now correctly preserves None for missing IDs (lines 90-91), consistent with from_rpc_result. The implementation correctly mirrors the RPC factory method while accepting similarity as an explicit parameter.

104-114: LGTM! Proper use of Pydantic's immutable update pattern.

The method correctly uses model_copy with the update parameter (Pydantic v2 API) to return a new instance with the updated rerank_score, following immutable design principles.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (5)

python/tests/core/test_domain_models.py (3)

12-110: Consider adding test coverage for get_active_embedding() method.

The CrawledPage model has a get_active_embedding() method (per the relevant snippet) that returns the appropriate embedding based on embedding_dimension. This behavior isn't directly tested, though it's implicitly used by to_insert_dict.

🔎 Example test for get_active_embedding

def test_get_active_embedding_returns_correct_dimension(self):
    """Test get_active_embedding returns embedding matching dimension."""
    chunk = CrawledPage(
        source_id="src-123",
        url="https://example.com/page",
        chunk_number=0,
        content="Content",
        metadata={},
        embedding_768=[0.1] * 768,
        embedding_1536=[0.3] * 1536,
        embedding_dimension=768,
    )

    active = chunk.get_active_embedding()
    assert active is not None
    assert len(active) == 768
    assert active == chunk.embedding_768

112-170: Consider adding test for from_crawl_result() factory method.

The PageMetadata model has a from_crawl_result() classmethod that constructs a PageMetadata from crawl results, computing word_count, char_count, and building the metadata dict. This factory method isn't tested.

🔎 Example test for from_crawl_result

def test_from_crawl_result(self):
    """Test creating PageMetadata from crawl result."""
    page = PageMetadata.from_crawl_result(
        url="https://example.com/page",
        markdown="This is some markdown content.",
        source_id="src-123",
        knowledge_type="documentation",
        tags=["python", "api"],
        crawl_type="url",
    )

    assert page.url == "https://example.com/page"
    assert page.source_id == "src-123"
    assert page.full_content == "This is some markdown content."
    assert page.word_count > 0
    assert page.char_count > 0
    assert page.metadata["knowledge_type"] == "documentation"
    assert page.metadata["crawl_type"] == "url"
    assert page.metadata["tags"] == ["python", "api"]

172-293: Consider adding test for with_rerank_score() method.

The SearchResult model has a with_rerank_score() method that returns a copy with the rerank score set. This method isn't tested.

🔎 Example test for with_rerank_score

def test_with_rerank_score(self):
    """Test creating a copy with rerank score."""
    original = SearchResult(
        id="result-123",
        content="Content",
        url="https://example.com",
        similarity=0.9,
    )

    updated = original.with_rerank_score(0.95)

    # Original should be unchanged
    assert original.rerank_score is None
    # New instance should have the score
    assert updated.rerank_score == 0.95
    # Other fields should be preserved
    assert updated.id == "result-123"
    assert updated.similarity == 0.9

python/tests/core/test_container.py (2)

195-201: Simplify environment variable clearing.

The pattern of setting environment variables to empty strings in patch.dict and then immediately popping them is unnecessarily convoluted. You can simplify this by removing the variables directly within the patch.dict context.
🔎 Proposed simplification
         # Clear Supabase env vars
-        with patch.dict(
-            os.environ,
-            {"SUPABASE_URL": "", "SUPABASE_SERVICE_KEY": ""},
-            clear=False,
-        ):
+        with patch.dict(os.environ, clear=False):
             os.environ.pop("SUPABASE_URL", None)
             os.environ.pop("SUPABASE_SERVICE_KEY", None)
 
             with pytest.raises(ValueError) as exc_info:
187-206: Consider adding positive test case for Supabase configuration.

The current test verifies that missing Supabase credentials raise an error. Consider adding a companion test that verifies successful Supabase repository instantiation when valid credentials are provided. This would provide more complete coverage of the Supabase configuration path.

Example test structure:
def test_supabase_with_credentials_succeeds(self):
    """Test that Supabase with credentials succeeds."""
    container.configure(repository_type="supabase")
    
    with patch.dict(
        os.environ,
        {
            "SUPABASE_URL": "https://example.supabase.co",
            "SUPABASE_SERVICE_KEY": "test-key-123"
        }
    ):
        repo = container.get_page_repository()
        
        from src.core.infrastructure.supabase.page_repository import (
            SupabasePageRepository,
        )
        assert isinstance(repo, SupabasePageRepository)

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 727b403 and 3109c3c.

📒 Files selected for processing (5)

python/src/core/domain/interfaces/__init__.py
python/src/core/domain/interfaces/embedding_service.py
python/tests/core/test_container.py
python/tests/core/test_domain_models.py
python/tests/core/test_repositories.py

✅ Files skipped from review due to trivial changes (1)

python/tests/core/test_repositories.py

🚧 Files skipped from review as they are similar to previous changes (1)

python/src/core/domain/interfaces/embedding_service.py

🧰 Additional context used

📓 Path-based instructions (3)

python/src/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/**/*.py: Complete batch processing, background tasks, and external API calls but log detailed errors for each failure instead of crashing
Include context about what was being attempted, preserve full stack traces with exc_info=True in Python logging, use specific exception types, include relevant IDs/URLs, never return None to indicate failure, and report both success count and detailed failure list for batch operations
Use Ruff for linting and MyPy for type checking with 120 character line length in Python backend
Use Python 3.12 with 120 character line length in backend code
Import GITHUB_REPO_OWNER and GITHUB_REPO_NAME constants from python/src/server/config/version.py as the single source of truth for repository configuration

python/src/**/*.py: Python backend code should use 120 character line length
Use Ruff for linting Python code to check for errors, warnings, and unused imports
Use Mypy for type checking in Python to ensure type safety

Files:

python/src/core/domain/interfaces/__init__.py

python/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Error messages must include context about what was being attempted, preserve full stack traces, use specific exception types, include relevant IDs/URLs, and report both success count and detailed failure list for batch operations

Files:

python/src/core/domain/interfaces/__init__.py
python/tests/core/test_domain_models.py
python/tests/core/test_container.py

python/tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use Pytest for testing with async support in Python backend

Use Pytest for testing Python code with async support

Files:

python/tests/core/test_domain_models.py
python/tests/core/test_container.py

🧬 Code graph analysis (3)

python/src/core/domain/interfaces/__init__.py (2)

python/src/core/domain/interfaces/embedding_service.py (2)

EmbeddingError (11-14)

IEmbeddingService (17-83)

python/src/core/domain/interfaces/page_repository.py (1)

IPageRepository (15-263)

python/tests/core/test_domain_models.py (3)

python/src/core/domain/models/crawled_page.py (1)

CrawledPage (14-115)

python/src/core/domain/models/page_metadata.py (1)

PageMetadata (14-121)

python/src/core/domain/models/search_result.py (3)

SearchResult (13-114)

from_rpc_result (52-75)

from_postgres_row (78-102)

python/tests/core/test_container.py (3)

python/src/core/infrastructure/memory/page_repository.py (2)

clear (41-45)

InMemoryPageRepository (18-321)

python/src/core/infrastructure/memory/mock_embedding_service.py (1)

MockEmbeddingService (13-91)

python/src/core/domain/interfaces/embedding_service.py (1)

dimension (65-72)

🔇 Additional comments (7)

python/tests/core/test_domain_models.py (2)

1-9: LGTM!

Imports and module docstring are correct. The models are properly imported from the domain models package.

295-334: LGTM!

Edge case tests cover important boundary conditions: empty content and similarity bounds (0.0 and 1.0). These are valuable for ensuring model robustness.

python/src/core/domain/interfaces/__init__.py (1)

1-16: LGTM!

Clean package initialization that properly re-exports the domain interfaces and exceptions. The __all__ list correctly matches the imports, and the docstring clearly explains the architectural purpose of these abstractions for dependency inversion and testability.

python/tests/core/test_container.py (4)

68-102: LGTM!

The repository instantiation tests properly verify memory repository creation, singleton behavior, and reset functionality. The assertions correctly validate instance types and object identity.

104-136: LGTM!

The repository override tests correctly verify that custom repositories can be injected and that the reset mechanism properly clears overrides. Identity assertions appropriately use the is operator.

138-172: LGTM!

The embedding service tests comprehensively cover instantiation, singleton behavior, and override functionality. The custom dimension verification in the override test (line 171) adds good specificity.

174-185: LGTM!

The invalid configuration test properly verifies that an unknown repository type raises a ValueError with an appropriate error message. The use of pytest.raises is correct.

coderabbitai · 2025-12-30T17:00:07Z

+        with patch.dict(os.environ, {}, clear=True):
+            # Remove REPOSITORY_TYPE if present
+            os.environ.pop("REPOSITORY_TYPE", None)


⚠️ Potential issue | 🟡 Minor

Remove redundant environment variable pop.

The patch.dict(os.environ, {}, clear=True) on line 28 already replaces the environment with an empty dictionary, so the pop on line 30 will never find REPOSITORY_TYPE to remove. This line is redundant and can be deleted.

🔎 Proposed fix

def test_default_repository_type(self): """Test default repository type is supabase.""" with patch.dict(os.environ, {}, clear=True): - # Remove REPOSITORY_TYPE if present - os.environ.pop("REPOSITORY_TYPE", None) - repo_type = container.get_repository_type() assert repo_type == "supabase"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

with patch.dict(os.environ, {}, clear=True):

# Remove REPOSITORY_TYPE if present

os.environ.pop("REPOSITORY_TYPE", None)

def test_default_repository_type(self):

"""Test default repository type is supabase."""

with patch.dict(os.environ, {}, clear=True):

repo_type = container.get_repository_type()

assert repo_type == "supabase"

🤖 Prompt for AI Agents

In python/tests/core/test_container.py around lines 28 to 30, the call to os.environ.pop("REPOSITORY_TYPE", None) is redundant because patch.dict(os.environ, {}, clear=True) already clears the environment; remove the pop line to avoid dead code and keep the test setup concise.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

python/tests/core/test_container.py (1)
28-30: Remove redundant environment variable pop.

The patch.dict(os.environ, {}, clear=True) already clears the environment, making the subsequent pop operation redundant.
🔎 Proposed fix
     def test_default_repository_type(self):
         """Test default repository type is supabase."""
         with patch.dict(os.environ, {}, clear=True):
-            # Remove REPOSITORY_TYPE if present
-            os.environ.pop("REPOSITORY_TYPE", None)
-
             repo_type = container.get_repository_type()
 
             assert repo_type == "supabase"

🧹 Nitpick comments (1)

python/tests/core/test_container.py (1)

199-215: Simplify the credential clearing pattern.

Setting environment variables to empty strings in patch.dict and then immediately popping them is redundant. You can simplify this by either excluding the keys from the patch dict entirely or relying solely on the pop operations.

🔎 Proposed fix (clearer approach using only pop)

     def test_supabase_without_credentials_raises(self):
         """Test that Supabase without credentials raises error."""
         container.configure(repository_type="supabase")
 
-        # Clear Supabase env vars
-        with patch.dict(
-            os.environ,
-            {"SUPABASE_URL": "", "SUPABASE_SERVICE_KEY": ""},
-            clear=False,
-        ):
+        # Clear Supabase env vars while preserving other environment variables
+        with patch.dict(os.environ, clear=False):
             os.environ.pop("SUPABASE_URL", None)
             os.environ.pop("SUPABASE_SERVICE_KEY", None)
 
             with pytest.raises(ValueError) as exc_info:
                 container.get_page_repository()
 
             assert "SUPABASE_URL" in str(exc_info.value)

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3109c3c and 0d628eb.

📒 Files selected for processing (2)

python/src/core/container.py
python/tests/core/test_container.py

🚧 Files skipped from review as they are similar to previous changes (1)

python/src/core/container.py

🧰 Additional context used

📓 Path-based instructions (2)

python/tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use Pytest for testing with async support in Python backend

Use Pytest for testing Python code with async support

Files:

python/tests/core/test_container.py

python/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Error messages must include context about what was being attempted, preserve full stack traces, use specific exception types, include relevant IDs/URLs, and report both success count and detailed failure list for batch operations

Files:

python/tests/core/test_container.py

🧬 Code graph analysis (1)

python/tests/core/test_container.py (3)

python/src/core/container.py (7)

reset (150-166)

get_repository_type (43-50)

configure (169-186)

get_page_repository (53-98)

override_repository (128-136)

get_embedding_service (101-125)

override_embedding_service (139-147)

python/src/core/infrastructure/memory/page_repository.py (2)

clear (41-45)

InMemoryPageRepository (18-321)

python/src/core/infrastructure/memory/mock_embedding_service.py (1)

MockEmbeddingService (13-91)

🔇 Additional comments (6)

python/tests/core/test_container.py (6)

15-20: LGTM! Excellent test isolation pattern.

The autouse fixture ensures clean container state before and after each test, preventing test pollution.

36-65: LGTM! Configuration tests cover all scenarios.

The tests properly validate environment variable handling, function-based configuration, and override behavior.

71-101: LGTM! Repository instantiation tests are thorough.

The tests properly validate singleton behavior, instance types, and reset semantics.

107-135: LGTM! Override tests validate test-time behavior.

The tests properly verify that custom repository instances can be injected for testing and that reset clears the override.

141-171: LGTM! Embedding service tests mirror repository patterns.

The tests validate singleton behavior, override mechanism, and custom configuration (dimension parameter).

177-193: LGTM! Error handling tests are well-structured.

The tests properly validate error conditions using pytest.raises and verify error message content for both invalid types and unimplemented backends.

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

python/tests/core/test_container.py (1)
28-30: Remove redundant environment variable pop.

The patch.dict(os.environ, {}, clear=True) on line 28 already replaces the environment with an empty dictionary, so the pop on line 30 will never find REPOSITORY_TYPE to remove. This line is redundant and can be deleted.
🔎 Proposed fix
     def test_default_repository_type(self):
         """Test default repository type is supabase."""
         with patch.dict(os.environ, {}, clear=True):
-            # Remove REPOSITORY_TYPE if present
-            os.environ.pop("REPOSITORY_TYPE", None)
-
             repo_type = container.get_repository_type()
 
             assert repo_type == "supabase"

🧹 Nitpick comments (5)

python/src/core/container.py (2)

98-122: Consider documenting production embedding service setup more prominently.

The current implementation always defaults to MockEmbeddingService, with a comment indicating production systems should use override_embedding_service(). This is fine for the abstraction layer PR, but consider adding a TODO or follow-up to implement production embedding service selection (e.g., via environment variable similar to REPOSITORY_TYPE).

166-183: Clarify force_reinit behavior with overrides.

When force_reinit=True, the function clears singleton instances but not overrides. If an override is set, get_page_repository() will still return the override (line 70-71) rather than creating a new instance based on the updated repository_type. This may be intentional for testing, but consider documenting this behavior in the docstring to avoid confusion.
🔎 Suggested docstring update
 def configure(
     repository_type: str | None = None,
     force_reinit: bool = False,
 ) -> None:
     """
     Configure the container.
 
     Args:
         repository_type: Override repository type (sets env var)
-        force_reinit: Force reinitialization of instances
+        force_reinit: Force reinitialization of singleton instances.
+            Note: Does not clear test overrides; use reset() for full cleanup.
     """

python/src/core/infrastructure/postgres/page_repository.py (1)

212-231: Consider batch insertion for better performance.

The current implementation inserts chunks one-by-one, acquiring a connection for each. For large batches, this could be optimized using executemany or COPY. However, the current approach provides better error isolation (one failed insert doesn't block others) and aligns with the resilient error-handling pattern.

python/tests/core/infrastructure/postgres/test_page_repository.py (2)

120-139: Consider breaking long lines for readability.

Lines 126-127 exceed 120 characters. Per coding guidelines, Python backend code should use 120 character line length.

🔎 Proposed fix

     @pytest.mark.asyncio
     async def test_search_similar_filters_below_threshold(self):
         """Test search_similar filters results below similarity threshold."""
         mock_conn = AsyncMock()
         mock_conn.fetch = AsyncMock(
             return_value=[
-                {"id": 1, "url": "https://example.com", "content": "test", "chunk_number": 0, "metadata": {}, "source_id": "src-1", "similarity": 0.03},  # Below threshold
-                {"id": 2, "url": "https://example.com/2", "content": "test2", "chunk_number": 0, "metadata": {}, "source_id": "src-1", "similarity": 0.10},  # Above threshold
+                # Below threshold
+                {
+                    "id": 1, "url": "https://example.com", "content": "test",
+                    "chunk_number": 0, "metadata": {}, "source_id": "src-1", "similarity": 0.03,
+                },
+                # Above threshold
+                {
+                    "id": 2, "url": "https://example.com/2", "content": "test2",
+                    "chunk_number": 0, "metadata": {}, "source_id": "src-1", "similarity": 0.10,
+                },
             ]
         )

461-479: Consider breaking long lines for readability.

Lines 467-468 exceed 120 characters. Per coding guidelines, Python backend code should use 120 character line length.

🔎 Proposed fix

     @pytest.mark.asyncio
     async def test_list_pages_by_source(self):
         """Test list_pages_by_source returns list of pages."""
         mock_conn = AsyncMock()
         mock_conn.fetch = AsyncMock(
             return_value=[
-                {"id": "page-1", "source_id": "src-1", "url": "https://example.com/1", "full_content": "Content 1", "section_title": None, "section_order": 0, "word_count": 2, "char_count": 9, "chunk_count": 1, "metadata": {}},
-                {"id": "page-2", "source_id": "src-1", "url": "https://example.com/2", "full_content": "Content 2", "section_title": None, "section_order": 1, "word_count": 2, "char_count": 9, "chunk_count": 1, "metadata": {}},
+                {
+                    "id": "page-1", "source_id": "src-1", "url": "https://example.com/1",
+                    "full_content": "Content 1", "section_title": None, "section_order": 0,
+                    "word_count": 2, "char_count": 9, "chunk_count": 1, "metadata": {},
+                },
+                {
+                    "id": "page-2", "source_id": "src-1", "url": "https://example.com/2",
+                    "full_content": "Content 2", "section_title": None, "section_order": 1,
+                    "word_count": 2, "char_count": 9, "chunk_count": 1, "metadata": {},
+                },
             ]
         )

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0d628eb and 63633e3.

📒 Files selected for processing (7)

.env.example
python/src/core/container.py
python/src/core/infrastructure/postgres/__init__.py
python/src/core/infrastructure/postgres/page_repository.py
python/tests/core/infrastructure/postgres/__init__.py
python/tests/core/infrastructure/postgres/test_page_repository.py
python/tests/core/test_container.py

✅ Files skipped from review due to trivial changes (1)

python/tests/core/infrastructure/postgres/init.py

🧰 Additional context used

📓 Path-based instructions (3)

python/src/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/**/*.py: Complete batch processing, background tasks, and external API calls but log detailed errors for each failure instead of crashing
Include context about what was being attempted, preserve full stack traces with exc_info=True in Python logging, use specific exception types, include relevant IDs/URLs, never return None to indicate failure, and report both success count and detailed failure list for batch operations
Use Ruff for linting and MyPy for type checking with 120 character line length in Python backend
Use Python 3.12 with 120 character line length in backend code
Import GITHUB_REPO_OWNER and GITHUB_REPO_NAME constants from python/src/server/config/version.py as the single source of truth for repository configuration

python/src/**/*.py: Python backend code should use 120 character line length
Use Ruff for linting Python code to check for errors, warnings, and unused imports
Use Mypy for type checking in Python to ensure type safety

Files:

python/src/core/container.py
python/src/core/infrastructure/postgres/__init__.py
python/src/core/infrastructure/postgres/page_repository.py

python/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Error messages must include context about what was being attempted, preserve full stack traces, use specific exception types, include relevant IDs/URLs, and report both success count and detailed failure list for batch operations

Files:

python/src/core/container.py
python/src/core/infrastructure/postgres/__init__.py
python/src/core/infrastructure/postgres/page_repository.py
python/tests/core/infrastructure/postgres/test_page_repository.py
python/tests/core/test_container.py

python/tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use Pytest for testing with async support in Python backend

Use Pytest for testing Python code with async support

Files:

python/tests/core/infrastructure/postgres/test_page_repository.py
python/tests/core/test_container.py

🧬 Code graph analysis (2)

python/src/core/infrastructure/postgres/__init__.py (1)

python/src/core/infrastructure/postgres/page_repository.py (1)

PostgresPageRepository (54-506)

python/src/core/infrastructure/postgres/page_repository.py (4)

python/src/core/domain/interfaces/page_repository.py (1)

IPageRepository (15-263)

python/src/core/domain/models/crawled_page.py (1)

CrawledPage (14-115)

python/src/core/domain/models/page_metadata.py (1)

PageMetadata (14-121)

python/src/core/domain/models/search_result.py (2)

SearchResult (13-114)

from_postgres_row (78-102)

🔇 Additional comments (16)

.env.example (1)

26-48: LGTM!

The database backend configuration section is well-structured with clear documentation explaining the three supported backends and both connection options for PostgreSQL. The commented-out examples with sensible defaults provide a good template for users.

python/src/core/infrastructure/postgres/__init__.py (1)

1-10: LGTM!

Clean package initialization with appropriate docstring explaining the backend's purpose. The export aligns with the container's import expectations.

python/tests/core/test_container.py (3)

15-20: LGTM!

The autouse fixture properly resets container state before and after each test, preventing test pollution across the suite.

68-101: LGTM!

Good coverage of repository instantiation, singleton behavior, and reset semantics. The tests properly verify that the container returns the correct implementation type and maintains singleton identity.

174-200: LGTM!

Error handling tests properly verify that invalid repository types and missing PostgreSQL credentials raise appropriate ValueError exceptions with informative messages.

python/src/core/container.py (3)

1-26: LGTM!

Excellent module docstring with clear usage examples for both production and testing scenarios. The configuration options are well-documented.

53-95: LGTM!

The repository factory properly handles test overrides with precedence, uses lazy imports to avoid circular dependencies, and provides clear error messages for invalid types. The singleton pattern with override support enables clean testing.

147-163: LGTM!

The reset function properly clears all state including the REPOSITORY_TYPE environment variable, which addresses test pollution concerns. This is the correct cleanup behavior for test teardown.

python/src/core/infrastructure/postgres/page_repository.py (4)

1-18: LGTM!

Clean module structure with appropriate imports and logging setup. The docstring accurately describes the implementation's purpose and dependencies.

21-51: LGTM!

The connection string builder properly supports both URL-based and individual parameter configurations, with clear error messages that include the required environment variables when credentials are missing. This aligns with the coding guidelines for error context.

168-177: LGTM!

Error handling follows the coding guidelines: logs detailed context (match_count, source_filter) with exc_info=True for full stack trace, uses specific exception handling, and returns a safe fallback (empty list) to maintain resilience.

460-468: LGTM!

Simple and effective health check that returns a boolean without exposing internal errors. The silent exception handling is appropriate here since the caller only needs to know if the connection is healthy.

python/tests/core/infrastructure/postgres/test_page_repository.py (4)

1-18: LGTM!

Clean test module structure with appropriate imports. Testing the internal _get_connection_string helper is valuable for verifying environment configuration handling.

75-84: LGTM!

The create_mock_pool helper provides a clean way to mock asyncpg's pool with proper async context manager support, enabling isolated unit tests without a real database.

141-159: LGTM!

Good test for error resilience. The # noqa: B901 comment correctly suppresses the linter warning for the intentional unreachable yield after raise, which is needed to make the async context manager syntactically valid.

510-583: LGTM!

Utility operation tests provide good coverage for list_unique_urls (with and without source filter) and health_check (success and failure paths). The error simulation pattern is consistent with other tests in this file.

coderabbitai · 2025-12-30T19:34:34Z

+
+        async with pool.acquire() as conn:
+            row = await conn.fetchrow(query, int(chunk_id))
+
+        if not row:
+            return None
+
+        return self._row_to_crawled_page(dict(row))


⚠️ Potential issue | 🟡 Minor

Handle non-integer chunk IDs gracefully.

Line 276 calls int(chunk_id) which will raise ValueError if chunk_id is not a valid integer string (e.g., UUID). Consider catching this or validating input, especially if the ID format might vary between backends.

🔎 Proposed fix

async def get_chunk_by_id(self, chunk_id: str) -> CrawledPage | None: """Get a chunk by ID.""" pool = await self._ensure_pool() + + try: + chunk_id_int = int(chunk_id) + except ValueError: + logger.warning("Invalid chunk_id format: %s (expected integer)", chunk_id) + return None query = f""" SELECT * FROM {self.CHUNKS_TABLE} WHERE id = $1 """ async with pool.acquire() as conn: - row = await conn.fetchrow(query, int(chunk_id)) + row = await conn.fetchrow(query, chunk_id_int) if not row: return None return self._row_to_crawled_page(dict(row))

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In python/src/core/infrastructure/postgres/page_repository.py around lines 274 to 281, the code unconditionally calls int(chunk_id) which will raise ValueError for non-integer IDs (e.g., UUIDs); change the logic to validate/normalize chunk_id before using it: if chunk_id is numeric (use str.isdigit() or a try/int conversion), convert to int and pass it to the query, otherwise pass the chunk_id as-is (or transform to the expected DB type), and wrap the conversion in a try/except that either logs the invalid format and returns None or raises a clear, typed exception—ensure the DB query uses the correctly-typed parameter and avoid letting ValueError propagate uncaught.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

python/tests/core/infrastructure/postgres/test_integration.py (2)
382-383: Consider using fixture-based cleanup for consistency.

The manual cleanup of other_source_id is inconsistent with the fixture-based cleanup pattern used elsewhere. If this cleanup fails, it will cause the test to fail unexpectedly.
🔎 Suggested approach

Either wrap in try-except:
         # Cleanup other source
-        await repo.delete_chunks_by_source(other_source_id)
+        try:
+            await repo.delete_chunks_by_source(other_source_id)
+        except Exception as e:
+            import logging
+            logging.getLogger(__name__).warning(
+                "Failed to cleanup other_source_id=%s: %s",
+                other_source_id,
+                e,
+            )
Or rely on database isolation—since you're using unique source IDs, you could let the test leave the data and trust that other tests won't conflict.
462-470: Strengthen test assertions for multi-dimension embeddings.

The test only validates that results is a list without checking if the inserted chunk was actually found. This provides minimal validation of the multi-dimension embedding functionality.

Consider one of these approaches:
Option 1: Assert chunk is found (if RPC supports all dimensions)
         # Search with same dimension
         results = await repo.search_similar(
             embedding=embedding,
             match_count=5,
             source_filter=test_source_id,
         )
 
-        # Should find the chunk (may not find if RPC doesn't support this dimension)
-        # This validates the search doesn't error out
-        assert isinstance(results, list)
+        # Should find the inserted chunk
+        assert len(results) > 0, f"No results found for {dimension}D embedding"
+        assert any(
+            result.source_id == test_source_id and result.content == f"Content for {dimension}D embedding"
+            for result in results
+        ), f"Inserted chunk not found in results for {dimension}D embedding"
Option 2: Skip unsupported dimensions

If some dimensions aren't supported by the RPC function, use pytest.mark.skipif or query the database schema to determine supported dimensions, then only test those.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 63633e3 and c2b85b5.

📒 Files selected for processing (1)

python/tests/core/infrastructure/postgres/test_integration.py

🧰 Additional context used

📓 Path-based instructions (2)

python/tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use Pytest for testing with async support in Python backend

Use Pytest for testing Python code with async support

Files:

python/tests/core/infrastructure/postgres/test_integration.py

python/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Error messages must include context about what was being attempted, preserve full stack traces, use specific exception types, include relevant IDs/URLs, and report both success count and detailed failure list for batch operations

Files:

python/tests/core/infrastructure/postgres/test_integration.py

🧠 Learnings (1)

📚 Learning: 2025-11-29T19:25:01.813Z

Learnt from: CR
Repo: coleam00/Archon PR: 0
File: python/src/agent_work_orders/CLAUDE.md:0-0
Timestamp: 2025-11-29T19:25:01.813Z
Learning: Applies to python/src/agent_work_orders/tests/integration/**/*.py : Mark integration tests with `pytest.mark.integration` decorator and place in `tests/integration/` directory

Applied to files:

python/tests/core/infrastructure/postgres/test_integration.py

🧬 Code graph analysis (1)

python/tests/core/infrastructure/postgres/test_integration.py (3)

python/src/core/infrastructure/postgres/page_repository.py (1)

PostgresPageRepository (54-506)

python/src/core/domain/models/crawled_page.py (1)

CrawledPage (14-115)

python/src/core/domain/models/page_metadata.py (1)

PageMetadata (14-121)

🔇 Additional comments (1)

python/tests/core/infrastructure/postgres/test_integration.py (1)

66-186: LGTM! Well-structured integration tests.

The connection and chunk operation tests are well-designed with proper test isolation using unique source IDs, comprehensive coverage of CRUD operations, and appropriate assertions.

coderabbitai · 2025-12-30T19:38:27Z

+pytestmark = pytest.mark.skipif(
+    os.getenv("POSTGRES_INTEGRATION_TEST", "").lower() != "true",
+    reason="PostgreSQL integration tests disabled. Set POSTGRES_INTEGRATION_TEST=true to enable.",
+)


⚠️ Potential issue | 🟠 Major

Missing pytest.mark.integration decorator per project standards.

Based on learnings, integration tests should be marked with @pytest.mark.integration in addition to (or instead of) the skipif marker. This allows proper test categorization and filtering in CI/CD pipelines.

🔎 Suggested fix

# Skip all tests in this module if integration testing is not enabled pytestmark = pytest.mark.skipif( os.getenv("POSTGRES_INTEGRATION_TEST", "").lower() != "true", reason="PostgreSQL integration tests disabled. Set POSTGRES_INTEGRATION_TEST=true to enable.", ) + +# Mark all tests as integration tests +pytestmark = [ + pytest.mark.skipif( + os.getenv("POSTGRES_INTEGRATION_TEST", "").lower() != "true", + reason="PostgreSQL integration tests disabled. Set POSTGRES_INTEGRATION_TEST=true to enable.", + ), + pytest.mark.integration, +]

Based on learnings, integration tests should use the pytest.mark.integration decorator and be placed in tests/integration/ directory. Consider relocating this file to python/tests/integration/core/infrastructure/postgres/test_integration.py.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

pytestmark = pytest.mark.skipif(

os.getenv("POSTGRES_INTEGRATION_TEST", "").lower() != "true",

reason="PostgreSQL integration tests disabled. Set POSTGRES_INTEGRATION_TEST=true to enable.",

)

pytestmark = [

pytest.mark.skipif(

os.getenv("POSTGRES_INTEGRATION_TEST", "").lower() != "true",

reason="PostgreSQL integration tests disabled. Set POSTGRES_INTEGRATION_TEST=true to enable.",

),

pytest.mark.integration,

]

🤖 Prompt for AI Agents

In python/tests/core/infrastructure/postgres/test_integration.py around lines 28-31, the module only uses a skipif marker and is missing the required pytest.mark.integration designation; add the integration marker so CI can categorize these tests (either by updating the module-level pytestmark to combine markers, e.g. include pytest.mark.integration alongside the existing skipif, or by annotating the test class/functions with @pytest.mark.integration) and optionally move the file to python/tests/integration/core/infrastructure/postgres/test_integration.py to follow the project integration test directory convention.

coderabbitai · 2025-12-30T19:38:27Z

+    try:
+        await repository.delete_chunks_by_source(test_source_id)
+        await repository.delete_pages_by_source(test_source_id)
+    except Exception:
+        pass  # Ignore cleanup errors


⚠️ Potential issue | 🟠 Major

Improve error handling in cleanup to preserve diagnostics.

The bare except Exception: pass suppresses all cleanup errors without logging, violating the coding guidelines requirement to "preserve full stack traces" and "include context about what was being attempted."

🔎 Suggested fix with logging

# Cleanup all test data try: await repository.delete_chunks_by_source(test_source_id) await repository.delete_pages_by_source(test_source_id) - except Exception: - pass # Ignore cleanup errors + except Exception as e: + # Log but don't fail the test on cleanup errors + import logging + logging.getLogger(__name__).warning( + "Failed to cleanup test data for source_id=%s: %s", + test_source_id, + e, + exc_info=True, + )

As per coding guidelines, error messages must include context and preserve stack traces.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

try:

await repository.delete_chunks_by_source(test_source_id)

await repository.delete_pages_by_source(test_source_id)

except Exception:

pass # Ignore cleanup errors

try:

await repository.delete_chunks_by_source(test_source_id)

await repository.delete_pages_by_source(test_source_id)

except Exception as e:

# Log but don't fail the test on cleanup errors

import logging

logging.getLogger(__name__).warning(

"Failed to cleanup test data for source_id=%s: %s",

test_source_id,

e,

exc_info=True,

)

🤖 Prompt for AI Agents

In python/tests/core/infrastructure/postgres/test_integration.py around lines 59-63, the cleanup block currently swallows all exceptions with `except Exception: pass`; change it to catch the exception as a variable and log it with context and full stack trace (e.g., `except Exception as e: logger.exception("Failed cleaning up pages/chunks for source %s", test_source_id, exc_info=True)`), ensuring a suitable logger is available or imported in the test; do not remove the cleanup attempt—just log the failure with the source id and stack trace so diagnostics are preserved.

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (3)

python/tests/core/infrastructure/postgres/test_integration.py (2)

28-31: Add pytest.mark.integration marker per project standards.

Based on learnings, integration tests should be marked with pytest.mark.integration in addition to the skipif marker for proper test categorization.

🔎 Suggested fix

-pytestmark = pytest.mark.skipif(
-    os.getenv("POSTGRES_INTEGRATION_TEST", "").lower() != "true",
-    reason="PostgreSQL integration tests disabled. Set POSTGRES_INTEGRATION_TEST=true to enable.",
-)
+pytestmark = [
+    pytest.mark.skipif(
+        os.getenv("POSTGRES_INTEGRATION_TEST", "").lower() != "true",
+        reason="PostgreSQL integration tests disabled. Set POSTGRES_INTEGRATION_TEST=true to enable.",
+    ),
+    pytest.mark.integration,
+]

72-83: Log cleanup errors with context and stack trace.

The bare except Exception: pass violates coding guidelines requiring error context and stack traces. As per coding guidelines, preserve full stack traces with exc_info=True.

🔎 Suggested fix

     # Cleanup all test data
     try:
         await repository.delete_chunks_by_source(test_source_id)
         await repository.delete_pages_by_source(test_source_id)
         # Also delete the source
         async with pool.acquire() as conn:
             await conn.execute(
                 "DELETE FROM archon_sources WHERE source_id = $1",
                 test_source_id,
             )
-    except Exception:
-        pass  # Ignore cleanup errors
+    except Exception as e:
+        # Log but don't fail the test on cleanup errors
+        logger.warning(
+            "Failed to cleanup test data for source_id=%s: %s",
+            test_source_id,
+            e,
+            exc_info=True,
+        )

Based on learnings, use logger.exception() or log with exc_info=True to preserve stack traces.

python/src/core/infrastructure/postgres/page_repository.py (1)

275-290: Handle non-integer chunk IDs gracefully.

Line 285 calls int(chunk_id) which will raise ValueError for non-integer IDs. Consider validating the input or catching the exception to provide a clearer error message.

🔎 Suggested fix

     async def get_chunk_by_id(self, chunk_id: str) -> CrawledPage | None:
         """Get a chunk by ID."""
         pool = await self._ensure_pool()
+        
+        try:
+            chunk_id_int = int(chunk_id)
+        except ValueError:
+            logger.warning("Invalid chunk_id format: %s (expected integer)", chunk_id, exc_info=True)
+            return None
 
         query = f"""
             SELECT * FROM {self.CHUNKS_TABLE}
             WHERE id = $1
         """
 
         async with pool.acquire() as conn:
-            row = await conn.fetchrow(query, int(chunk_id))
+            row = await conn.fetchrow(query, chunk_id_int)

🧹 Nitpick comments (2)

python/tests/core/infrastructure/postgres/test_integration.py (1)
416-422: Consider adding error handling to manual cleanup.

The manual cleanup at the end of this test lacks error handling, unlike the fixture cleanup. If this cleanup fails, it could leave test data behind.
🔎 Optional improvement
         # Cleanup other source (chunks and source entry)
-        await repo.delete_chunks_by_source(other_source_id)
-        async with pool.acquire() as conn:
-            await conn.execute(
-                "DELETE FROM archon_sources WHERE source_id = $1",
-                other_source_id,
-            )
+        try:
+            await repo.delete_chunks_by_source(other_source_id)
+            async with pool.acquire() as conn:
+                await conn.execute(
+                    "DELETE FROM archon_sources WHERE source_id = $1",
+                    other_source_id,
+                )
+        except Exception as e:
+            logger.warning(
+                "Failed to cleanup other_source_id=%s: %s",
+                other_source_id,
+                e,
+                exc_info=True,
+            )
python/src/core/infrastructure/postgres/page_repository.py (1)

181-190: Consider differentiating critical vs. recoverable errors in search.

Returning an empty list for all exceptions provides resilience but may mask critical issues like connection failures. While the logging is good (includes context and exc_info=True), consider whether connection/pool errors should propagate rather than return [].

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c2b85b5 and 6f56d39.

📒 Files selected for processing (3)

python/src/core/domain/models/crawled_page.py
python/src/core/infrastructure/postgres/page_repository.py
python/tests/core/infrastructure/postgres/test_integration.py

🧰 Additional context used

📓 Path-based instructions (3)

python/tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use Pytest for testing with async support in Python backend

Use Pytest for testing Python code with async support

Files:

python/tests/core/infrastructure/postgres/test_integration.py

python/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Error messages must include context about what was being attempted, preserve full stack traces, use specific exception types, include relevant IDs/URLs, and report both success count and detailed failure list for batch operations

Files:

python/tests/core/infrastructure/postgres/test_integration.py
python/src/core/domain/models/crawled_page.py
python/src/core/infrastructure/postgres/page_repository.py

python/src/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/**/*.py: Complete batch processing, background tasks, and external API calls but log detailed errors for each failure instead of crashing
Include context about what was being attempted, preserve full stack traces with exc_info=True in Python logging, use specific exception types, include relevant IDs/URLs, never return None to indicate failure, and report both success count and detailed failure list for batch operations
Use Ruff for linting and MyPy for type checking with 120 character line length in Python backend
Use Python 3.12 with 120 character line length in backend code
Import GITHUB_REPO_OWNER and GITHUB_REPO_NAME constants from python/src/server/config/version.py as the single source of truth for repository configuration

python/src/**/*.py: Python backend code should use 120 character line length
Use Ruff for linting Python code to check for errors, warnings, and unused imports
Use Mypy for type checking in Python to ensure type safety

Files:

python/src/core/domain/models/crawled_page.py
python/src/core/infrastructure/postgres/page_repository.py

🧠 Learnings (7)

📚 Learning: 2025-11-29T19:25:01.813Z

Learnt from: CR
Repo: coleam00/Archon PR: 0
File: python/src/agent_work_orders/CLAUDE.md:0-0
Timestamp: 2025-11-29T19:25:01.813Z
Learning: Applies to python/src/agent_work_orders/tests/integration/**/*.py : Mark integration tests with `pytest.mark.integration` decorator and place in `tests/integration/` directory

Applied to files:

python/tests/core/infrastructure/postgres/test_integration.py

📚 Learning: 2025-11-29T19:25:01.813Z

Learnt from: CR
Repo: coleam00/Archon PR: 0
File: python/src/agent_work_orders/CLAUDE.md:0-0
Timestamp: 2025-11-29T19:25:01.813Z
Learning: Applies to python/src/agent_work_orders/tests/**/*.py : Mark unit tests with `pytest.mark.unit` decorator

Applied to files:

python/tests/core/infrastructure/postgres/test_integration.py

📚 Learning: 2025-11-29T19:24:36.223Z

Learnt from: CR
Repo: coleam00/Archon PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-11-29T19:24:36.223Z
Learning: Applies to python/src/**/*.py : Include context about what was being attempted, preserve full stack traces with exc_info=True in Python logging, use specific exception types, include relevant IDs/URLs, never return None to indicate failure, and report both success count and detailed failure list for batch operations

Applied to files:

python/tests/core/infrastructure/postgres/test_integration.py

📚 Learning: 2025-11-29T19:25:23.987Z

Learnt from: CR
Repo: coleam00/Archon PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-29T19:25:23.987Z
Learning: Applies to python/**/*.py : Error messages must include context about what was being attempted, preserve full stack traces, use specific exception types, include relevant IDs/URLs, and report both success count and detailed failure list for batch operations

Applied to files:

python/tests/core/infrastructure/postgres/test_integration.py

📚 Learning: 2025-11-29T19:25:01.813Z

Learnt from: CR
Repo: coleam00/Archon PR: 0
File: python/src/agent_work_orders/CLAUDE.md:0-0
Timestamp: 2025-11-29T19:25:01.813Z
Learning: Applies to python/src/agent_work_orders/**/*.py : Use `logger.exception()` in except blocks to automatically capture full stack traces

Applied to files:

python/tests/core/infrastructure/postgres/test_integration.py

📚 Learning: 2025-11-29T19:25:01.813Z

Learnt from: CR
Repo: coleam00/Archon PR: 0
File: python/src/agent_work_orders/CLAUDE.md:0-0
Timestamp: 2025-11-29T19:25:01.813Z
Learning: Applies to python/src/agent_work_orders/**/*.py : Do not silently catch exceptions; always log with `logger.exception()` or re-raise the exception

Applied to files:

python/tests/core/infrastructure/postgres/test_integration.py

📚 Learning: 2025-11-29T19:25:01.813Z

Learnt from: CR
Repo: coleam00/Archon PR: 0
File: python/src/agent_work_orders/CLAUDE.md:0-0
Timestamp: 2025-11-29T19:25:01.813Z
Learning: Applies to python/src/agent_work_orders/**/*.py : Use appropriate logging levels: `debug` for diagnostics, `info` for operations, `warning` for recoverable issues, `error` for non-fatal failures, `exception` in except blocks with stack traces

Applied to files:

python/tests/core/infrastructure/postgres/test_integration.py

🧬 Code graph analysis (1)

python/src/core/domain/models/crawled_page.py (1)

python/src/core/domain/models/page_metadata.py (1)

to_insert_dict (62-83)

🔇 Additional comments (5)

python/src/core/domain/models/crawled_page.py (1)

14-60: LGTM - Well-structured domain model.

The CrawledPage model is well-designed with:

Clear field definitions and proper Optional typing

Support for multi-dimensional embeddings (768, 1024, 1536, 3072)

Good docstrings explaining the purpose and relationships

Appropriate Pydantic configuration for database ORM usage

python/tests/core/infrastructure/postgres/test_integration.py (1)

86-509: LGTM - Comprehensive integration test coverage.

The integration tests provide excellent coverage:

Connection health checks

Chunk CRUD operations (insert, batch insert, delete, retrieve, count)

Page operations (upsert, update, list, chunk count updates)

Vector similarity search with source filtering and ordering verification

Multi-dimensional embedding support (768, 1024, 1536, 3072) via parametrized tests

The test isolation using unique test_source_id per test is a good practice.

python/src/core/infrastructure/postgres/page_repository.py (3)

89-127: LGTM - Codec registration properly implemented.

The pool initialization correctly uses the init callback to register type codecs (vector and jsonb) on every pooled connection. This ensures all connections have the necessary codecs before use, addressing the previous review concern about connection-specific codec registration.

481-513: LGTM - Robust data mapping helpers.

The helper methods handle database row conversion safely:

Proper None handling with .get() and default values

Type conversions for IDs using str()

Appropriate defaults for required fields

54-476: Well-structured repository implementation.

The PostgresPageRepository is well-organized with:

Clear separation between chunk operations, page operations, and utilities

Proper connection pooling with asyncpg

Consistent async/await patterns throughout

Comprehensive implementation of the IPageRepository interface

The noted issues with error logging context in batch operations should be addressed, but the overall architecture is sound.

coderabbitai · 2025-12-30T19:57:51Z

+        # Set embedding in appropriate column
+        embedding = self.get_active_embedding()
+        if embedding:
+            if embedding_column and embedding_column.startswith("embedding_"):
+                data[embedding_column] = embedding
+            elif self.embedding_dimension:
+                # Use dimension-specific column when dimension is known
+                data[f"embedding_{self.embedding_dimension}"] = embedding
+            # Note: Generic 'embedding' column does not exist in schema
+            # Only dimension-specific columns are supported
+


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, examine the CrawledPage class file to understand the full context cat -n python/src/core/domain/models/crawled_page.py | head -150

Repository: coleam00/Archon

Length of output: 5862

🏁 Script executed:

# Run the original shell script to find CrawledPage instantiations rg -nP --type=py 'CrawledPage\(' | head -20

Repository: coleam00/Archon

Length of output: 1459

🏁 Script executed:

# Look for test files related to CrawledPage fd -e py -e test '.*crawled_page' -i

Repository: coleam00/Archon

Length of output: 104

🏁 Script executed:

# Search for where embedding_dimension is set/assigned rg -nP --type=py 'embedding_dimension\s*=' python/src/core/domain/models/crawled_page.py

Repository: coleam00/Archon

Length of output: 350

🏁 Script executed:

# Examine test file to see how CrawledPage is instantiated cat -n python/tests/core/test_domain_models.py | head -120

Repository: coleam00/Archon

Length of output: 5032

🏁 Script executed:

# Search for patterns where embedding is set without embedding_dimension rg -nP --type=py 'embedding.*=|embedding_dimension' python/tests/core/test_domain_models.py | head -40

Repository: coleam00/Archon

Length of output: 1085

🏁 Script executed:

# Look for any validation or constraints on CrawledPage initialization rg -nP --type=py 'embedding.*:' python/src/core/domain/models/crawled_page.py

Repository: coleam00/Archon

Length of output: 1716

🏁 Script executed:

# Search for actual usage of CrawledPage creation with embedding rg -nP --type=py 'embedding\s*=' python/tests/core/test_repositories.py -A2 -B2 | head -60

Repository: coleam00/Archon

Length of output: 2017

🏁 Script executed:

# Check if there's any validation that prevents embedding without embedding_dimension rg -nP --type=py '@validator|@field_validator' python/src/core/domain/models/crawled_page.py

Repository: coleam00/Archon

Length of output: 41

🏁 Script executed:

# Look for actual CrawledPage creation in non-test code rg -nP --type=py 'CrawledPage\(' python/src --include='*.py' -A5 | grep -E '(CrawledPage|embedding|embedding_dimension)' | head -40

Repository: coleam00/Archon

Length of output: 500

🏁 Script executed:

# Look at the test_repositories.py more carefully around line 82 where embedding is set without dimension cat -n python/tests/core/test_repositories.py | sed -n '70,90p'

Repository: coleam00/Archon

Length of output: 837

🏁 Script executed:

# Search for any actual production code that creates CrawledPage with embedding rg -nP --type=py 'CrawledPage' python/src/core -A5 -B2 | head -100

Repository: coleam00/Archon

Length of output: 8785

🏁 Script executed:

# Check if there's any logic that automatically sets embedding_dimension rg -nP --type=py 'embedding_dimension\s*=' python/src/core -B3 -A1 | head -80

Repository: coleam00/Archon

Length of output: 2877

🏁 Script executed:

# Look for places where to_insert_dict is called to understand the real usage patterns rg -nP --type=py 'to_insert_dict' python/src -B3 -A3

Repository: coleam00/Archon

Length of output: 5359

🏁 Script executed:

# Check the insert_chunks_batch implementation to see if it handles dimension cat -n python/src/core/infrastructure/supabase/page_repository.py | sed -n '149,180p'

Repository: coleam00/Archon

Length of output: 1418

🏁 Script executed:

# Check if there's any logic that infers embedding_dimension from the embedding vector itself rg -nP --type=py 'len\(.*embedding' python/src/core -A2 -B2

Repository: coleam00/Archon

Length of output: 1610

🏁 Script executed:

# Check postgres implementation to see if it has similar logic cat -n python/src/core/infrastructure/postgres/page_repository.py | sed -n '221,250p'

Repository: coleam00/Archon

Length of output: 1096

🏁 Script executed:

# Search for any recent git history or comments about this design decision rg -nP --type=py 'dimension.*column|generic.*embedding' python/src/core -B2 -A2

Repository: coleam00/Archon

Length of output: 1312

Add validation or warning for embeddings without dimension metadata.

The to_insert_dict() method silently drops embeddings when embedding_dimension is None and no explicit embedding_column is provided. This occurs because the schema only supports dimension-specific columns (e.g., embedding_768, embedding_1536), not a generic embedding column. While this is by design, the Pydantic model allows creating CrawledPage instances with embeddings but without dimension metadata, creating a mismatch.

Consider either:

Adding a field validator in __init__ to require embedding_dimension when embedding is set, or

Logging a warning in to_insert_dict() when an embedding is silently dropped due to missing dimension metadata.

🤖 Prompt for AI Agents

In python/src/core/domain/models/crawled_page.py around lines 107-117, the code silently drops an embedding when embedding_dimension is None and no embedding_column is provided; add a fix so embeddings are not lost: implement a Pydantic validator (or __post_init__) on the model to raise a clear validation error when embedding is set but embedding_dimension is missing, or if you prefer runtime tolerance, add a logging warning inside to_insert_dict() right before the embedding is skipped that warns the embedding will be dropped due to missing dimension metadata (include page id/context and suggest setting embedding_dimension or providing embedding_column).

coderabbitai · 2025-12-30T19:57:51Z

+    async def insert_chunks_batch(
+        self,
+        chunks: list[CrawledPage],
+        embedding_column: str | None = None,
+    ) -> int:
+        """Insert multiple chunks in a batch."""
+        if not chunks:
+            return 0
+
+        count = 0
+
+        # Insert each chunk (each insert_chunk call manages its own connection)
+        for chunk in chunks:
+            try:
+                await self.insert_chunk(chunk)
+                count += 1
+            except Exception as e:
+                logger.error("Failed to insert chunk: %s", e)
+
+        return count


⚠️ Potential issue | 🟠 Major

Add context to batch insert error logging.

The error logging at line 238 lacks context about which chunk failed. As per coding guidelines, error messages must include relevant IDs/URLs and context about what was being attempted.

🔎 Suggested fix

# Insert each chunk (each insert_chunk call manages its own connection) for chunk in chunks: try: await self.insert_chunk(chunk) count += 1 except Exception as e: - logger.error("Failed to insert chunk: %s", e) + logger.error( + "Failed to insert chunk (url=%s, source_id=%s, chunk_number=%d): %s", + chunk.url, + chunk.source_id, + chunk.chunk_number, + e, + exc_info=True, + )

Based on coding guidelines, include context about what was being attempted and preserve full stack traces.

🤖 Prompt for AI Agents

In python/src/core/infrastructure/postgres/page_repository.py around lines 221 to 240, the batch insert error log does not include which chunk failed or the stack trace; update the except block to log contextual identifiers from the CrawledPage (e.g., chunk.id and/or chunk.url or other unique fields) and preserve the full traceback (use logger.exception or logger.error(..., exc_info=True)) so each failure message clearly states which chunk failed and includes the stack trace for debugging.

coderabbitai · 2025-12-30T19:57:51Z

+    async def upsert_pages_batch(self, pages: list[PageMetadata]) -> dict[str, str]:
+        """Upsert multiple pages and return url->id mapping."""
+        if not pages:
+            return {}
+
+        result = {}
+        for page in pages:
+            try:
+                page_id = await self.upsert_page(page)
+                result[page.url] = page_id
+            except Exception as e:
+                logger.error("Failed to upsert page %s: %s", page.url, e)
+
+        return result


⚠️ Potential issue | 🟠 Major

Add context to batch upsert error logging.

Similar to insert_chunks_batch, the error logging at line 354 lacks sufficient context. Include the page URL and source_id to help debugging.

🔎 Suggested fix

result = {} for page in pages: try: page_id = await self.upsert_page(page) result[page.url] = page_id except Exception as e: - logger.error("Failed to upsert page %s: %s", page.url, e) + logger.error( + "Failed to upsert page (url=%s, source_id=%s): %s", + page.url, + page.source_id, + e, + exc_info=True, + )

Based on coding guidelines, preserve full stack traces with exc_info=True.

🤖 Prompt for AI Agents

In python/src/core/infrastructure/postgres/page_repository.py around lines 343 to 356, the batch upsert error log only includes the page URL and exception message; update the exception handling to log both page.url and page.source_id (or page.metadata.source_id) and pass exc_info=True so the full stack trace is preserved; ensure the log message is descriptive (e.g., "Failed to upsert page %s (source_id=%s)") and include both values when calling logger.error.

- CODE_CHANGES_SUMMARY.md: Detail async repository initialization fix for archon_graph.py - STAGING_VALIDATION_INDEX.md: Document staging validation results and deployment readiness - docs/RECHERCHE_MIGRATION_SUPABASE_POSTGRES.md: Strategic research for Supabase to PostgreSQL migration 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…upabase deprecation Phase 4 - Documentation finale: - Add comprehensive docs/ARCHITECTURE.md with: - ASCII diagrams of Domain/Infrastructure/Application layers - Complete Container DI guide with examples - Test examples with InMemoryRepository - Guide for adding new database backends (MongoDB example) - FAQ and troubleshooting - Update README.md with: - New 'Database Layer Architecture' section - Quick start guide for PostgreSQL - Links to architecture and migration documentation Phase 5 - Supabase cleanup documentation: - Add docs/SUPABASE_DEPRECATION_GUIDE.md with: - Complete inventory of Supabase legacy code - Line-by-line analysis of code to clean - Migration plan in 4 phases (6 months timeline) - Cleanup checklist and risk assessment - Estimation: ~585 lines of code to remove IMPORTANT: This is DOCUMENTATION ONLY. No Supabase code was deleted. The cleanup guide is for future reference when user decides to proceed. All docstrings verified - already complete and well-structured. Part of Séquence 4 (P4-06 + P5-04) completion.

Complete report for Phase 4 & 5 documentation tasks: - ARCHITECTURE.md: 800+ lines of comprehensive architecture docs - SUPABASE_DEPRECATION_GUIDE.md: 1000+ lines deprecation guide - README.md: New Database Layer Architecture section - All docstrings verified and complete Total: 4/4 tasks completed, ~2200 lines of documentation added. No production code modified (documentation only).

coderabbitai

Actionable comments posted: 5

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

archon/crawl_pydantic_ai_docs.py (1)
283-332: Fix type mismatch and move import to module level.

Two issues identified:

Critical: Type mismatch for metadata - The SitePage domain model expects metadata: SitePageMetadata (see archon/domain/models/site_page.py), but the code passes a plain dict created in process_chunk (lines 266-271). This will cause Pydantic validation errors at runtime.

Major: Import inside function - Line 299 imports SitePage inside the function, which is a code smell suggesting potential circular import issues.
🔎 Proposed fixes

Fix 1: Move import to module level
 # Import domain interfaces for dependency injection
-from archon.domain import ISitePagesRepository, IEmbeddingService
+from archon.domain import ISitePagesRepository, IEmbeddingService, SitePage, SitePageMetadata
Fix 2: Update process_chunk to return SitePageMetadata

In the process_chunk function (lines 265-271):
+    from archon.domain import SitePageMetadata
+
     # Create metadata
-    metadata = {
-        "source": "pydantic_ai_docs",
-        "chunk_size": len(chunk),
-        "crawled_at": datetime.now(timezone.utc).isoformat(),
-        "url_path": urlparse(url).path
-    }
+    metadata = SitePageMetadata(
+        source="pydantic_ai_docs",
+        chunk_size=len(chunk),
+        crawled_at=datetime.now(timezone.utc).isoformat(),
+        url_path=urlparse(url).path
+    )
Fix 3: Remove the now-redundant import

In insert_chunk function (line 299):
         # Use injected repository if provided (new pattern)
         if repository is not None:
-            from archon.domain import SitePage
-
             page = SitePage(

🟠 Major comments (22)

requirements.txt-79-79 (1)

79-79: Upgrade MCP to ≥1.23.0 to resolve three HIGH-severity vulnerabilities.

The MCP package v1.2.1 is vulnerable to:

CVE-2025-53366: FastMCP validation DoS via malformed requests (fixed 1.9.4)

CVE-2025-66416: DNS rebinding bypass in unauthenticated HTTP servers (fixed 1.23.0)

CVE-2025-53365: ClosedResourceError DoS in Streamable HTTP Transport (fixed 1.10.0)

Upgrade to mcp ≥1.23.0 to address all three. Alternatively, apply transport-level mitigations (strict request validation, DNS rebinding protection settings, process supervision).
migrate_schema.py-34-34 (1)
34-34: CASCADE deletion could have unintended consequences.

Using DROP TABLE ... CASCADE will automatically drop all dependent objects (views, foreign keys, etc.) without explicit confirmation. This could lead to unintended data loss beyond the table itself.
🔎 Safer approach with explicit dependency check
+        # Check for dependencies before dropping
+        print("\nChecking for table dependencies...")
+        deps = await conn.fetch("""
+            SELECT dependent_ns.nspname as dependent_schema,
+                   dependent_view.relname as dependent_view 
+            FROM pg_depend 
+            JOIN pg_rewrite ON pg_depend.objid = pg_rewrite.oid 
+            JOIN pg_class as dependent_view ON pg_rewrite.ev_class = dependent_view.oid 
+            JOIN pg_class as source_table ON pg_depend.refobjid = source_table.oid 
+            JOIN pg_namespace dependent_ns ON dependent_view.relnamespace = dependent_ns.oid
+            WHERE source_table.relname = 'site_pages'
+        """)
+        
+        if deps:
+            print(f"\nWARNING: {len(deps)} dependent objects found:")
+            for dep in deps:
+                print(f"  - {dep['dependent_schema']}.{dep['dependent_view']}")
+            response = input("\nDrop dependencies? (yes/no): ")
+            if response.lower() != "yes":
+                print("\nMigration cancelled.")
+                return False
+
         print("\nDropping existing table...")
-        await conn.execute("DROP TABLE IF EXISTS site_pages CASCADE")
+        await conn.execute("DROP TABLE IF EXISTS site_pages CASCADE")
Committable suggestion skipped: line range outside the PR's diff.
migrate_schema.py-27-30 (1)
27-30: User input blocks automation.

Using input() for confirmation prevents the script from running in CI/CD pipelines or automated environments.
🔎 Add command-line flag for non-interactive mode
+import sys
+
-async def migrate():
+async def migrate(force: bool = False):
     """Migrate the id column from UUID to SERIAL."""
     
     # ... connection code ...
     
     if count > 0:
         print(f"\nWARNING: Table has {count} records.")
         print("They will be DELETED during migration!")
-        response = input("\nContinue? (yes/no): ")
-        if response.lower() != "yes":
-            print("\nMigration cancelled.")
-            return False
+        if not force:
+            response = input("\nContinue? (yes/no): ")
+            if response.lower() != "yes":
+                print("\nMigration cancelled.")
+                return False
+        else:
+            print("\n[FORCE MODE] Proceeding with migration...")
Then update main:
 if __name__ == "__main__":
     print("PostgreSQL Schema Migration Tool")
     print("This will convert the id column from UUID to SERIAL")
-    asyncio.run(migrate())
+    force = "--force" in sys.argv or "-f" in sys.argv
+    asyncio.run(migrate(force=force))
Committable suggestion skipped: line range outside the PR's diff.
migrate_schema.py-8-14 (1)
8-14: Missing connection timeout and error handling.

The database connection lacks timeout configuration, which could cause the script to hang indefinitely if the database is unreachable.
🔎 Add timeout and connection retry logic
+    import os
+    
     async def migrate():
         """Migrate the id column from UUID to SERIAL."""
-    conn = await asyncpg.connect(
-        host="localhost",
-        port=5432,
-        user="postgres",
-        password="postgres",
-        database="mydb"
-    )
+    try:
+        conn = await asyncio.wait_for(
+            asyncpg.connect(
+                host=os.getenv("POSTGRES_HOST", "localhost"),
+                port=int(os.getenv("POSTGRES_PORT", "5432")),
+                user=os.getenv("POSTGRES_USER", "postgres"),
+                password=os.getenv("POSTGRES_PASSWORD"),
+                database=os.getenv("POSTGRES_DB", "mydb"),
+                timeout=10
+            ),
+            timeout=15
+        )
+    except asyncio.TimeoutError:
+        print(f"\n[ERROR] Connection timeout - ensure PostgreSQL is running")
+        return False
+    except Exception as e:
+        print(f"\n[ERROR] Connection failed: {e}")
+        return False
Committable suggestion skipped: line range outside the PR's diff.
migrate_schema.py-73-77 (1)
73-77: Broad exception catching loses error context.

Catching bare Exception makes debugging difficult and could mask unexpected errors like programming mistakes.
🔎 Catch specific exceptions
-    except Exception as e:
-        print(f"\n[ERROR] Migration failed: {e}")
-        import traceback
-        traceback.print_exc()
-        return False
+    except asyncpg.PostgresError as e:
+        print(f"\n[ERROR] Database error during migration: {e}")
+        print(f"SQLSTATE: {e.sqlstate}")
+        import traceback
+        traceback.print_exc()
+        return False
+    except Exception as e:
+        print(f"\n[ERROR] Unexpected error: {e}")
+        import traceback
+        traceback.print_exc()
+        raise  # Re-raise unexpected errors
Committable suggestion skipped: line range outside the PR's diff.
migrate_schema.py-16-79 (1)
16-79: Missing transaction management and rollback capability.

The migration performs multiple DDL operations without transaction management. If any step fails midway, the database could be left in an inconsistent state with no way to roll back.
🔎 Add transaction wrapper for atomicity
     try:
+        async with conn.transaction():
             print("\nMigration: UUID -> SERIAL (INTEGER)")
             print("=" * 60)
 
             # Check if table has data
             count = await conn.fetchval("SELECT COUNT(*) FROM site_pages")
             print(f"Current records: {count}")
 
             if count > 0:
                 print(f"\nWARNING: Table has {count} records.")
                 print("They will be DELETED during migration!")
                 response = input("\nContinue? (yes/no): ")
                 if response.lower() != "yes":
                     print("\nMigration cancelled.")
                     return False
 
             # Drop and recreate table with correct schema
             print("\nDropping existing table...")
             await conn.execute("DROP TABLE IF EXISTS site_pages CASCADE")
 
             print("Creating table with SERIAL id...")
             await conn.execute("""
                 CREATE TABLE site_pages (
                     id SERIAL PRIMARY KEY,
                     url TEXT NOT NULL,
                     chunk_number INTEGER DEFAULT 0,
                     title TEXT,
                     summary TEXT,
                     content TEXT,
                     metadata JSONB DEFAULT '{}',
                     embedding vector(1536),
                     created_at TIMESTAMPTZ DEFAULT NOW()
                 )
             """)
 
             # Create indexes
             print("Creating embedding index (ivfflat)...")
             await conn.execute("""
                 CREATE INDEX site_pages_embedding_idx
                 ON site_pages
                 USING ivfflat (embedding vector_cosine_ops)
                 WITH (lists = 100)
             """)
 
             print("Creating url index...")
             await conn.execute("CREATE INDEX site_pages_url_idx ON site_pages (url)")
 
             print("Creating metadata->source index...")
             await conn.execute("""
                 CREATE INDEX site_pages_metadata_source_idx
                 ON site_pages ((metadata->>'source'))
             """)
 
             print("\n[SUCCESS] Migration completed!")
             print("Schema is now compatible with domain model.")
-            return True
+        
+        return True
 
     except Exception as e:
         print(f"\n[ERROR] Migration failed: {e}")
+        print("Transaction rolled back - database unchanged")
         import traceback
         traceback.print_exc()
         return False
docs/CONTEXT_DB_STAGING_AGENT.md-26-27 (1)
26-27: Wrap bare URLs in Markdown link syntax.

Several lines contain bare URLs that should be formatted as Markdown links or code blocks:

Lines 26–27: http://localhost:8502, http://localhost:8101/health

Lines 131–132: http://localhost:8502, http://localhost:8101/health

Lines 140–141: http://localhost:8501, http://localhost:8100/health
-- [ ] http://localhost:8502 charge l'UI Streamlit
+- [ ] [`http://localhost:8502`](http://localhost:8502) charge l'UI Streamlit
Apply the same pattern to all bare URLs.

Also applies to: 131-132, 140-141
.claude/agents/db-layer-refactoring-analyst.md-35-35 (1)
35-35: Replace emphasis with proper Markdown headings.

Multiple subsections use bold text (e.g., **Direct Client Usage**, **Query Patterns**) instead of heading syntax. This breaks document structure and reduces navigability. Consistent heading levels also improve readability for both humans and automated tools.
🔎 Example fixes (apply to all occurrences)
-**Direct Client Usage**
+#### Direct Client Usage
Apply the same pattern to lines 35, 42, 50, 200, 205, 210, 215, 220, 225. Use #### (h4) for subsections within named sections, or ### (h3) if these are primary section headings at the top level.
Also applies to: 42-42, 50-50, 200-200, 205-205, 210-210, 215-215, 220-220, 225-225
.claude/agents/db-layer-refactoring-analyst.md-69-69 (1)
69-69: Specify language identifiers for all fenced code blocks.

Code blocks at lines 69, 80, and 167 lack language specifiers (e.g., ```yaml, ```python), which prevents syntax highlighting and reduces readability.
🔎 Proposed fixes
-```
+```yaml
 UI/API Layer → Service Layer → [?Data Access?] → Supabase
For line 80 (directory structure):
```diff
-```
+```
 src/
 ├── domain/
 ...
For line 167 (module structure):
```diff
-```
+```
 modules/
 ├── users/
 ...
</details>



Also applies to: 80-80, 167-167

</blockquote></details>
<details>
<summary>docs/CONTEXT_DB_STAGING_AGENT.md-33-33 (1)</summary><blockquote>

`33-33`: **Specify language identifiers for code blocks.**

Lines 33, 86–93, and 95–106 contain fenced code blocks without language specifiers:


```diff
-```
+```bash
 PRODUCTION (Actuelle)              STAGING (Nouvelle)
 ...
For Python code blocks (lines 86–93, 95–106):
-```python
+```python
 if __name__ == "__main__":
 ...
Also applies to: 86-93, 95-106
docs/CONTEXT_DB_STAGING_AGENT.md-1-186 (1)

1-186: Clarify intended language: use English or French consistently, not mixed.

This document mixes French and English throughout (French is predominant). This reduces clarity, searchability, and long-term maintainability. Choose one language and standardize across the codebase.

Is this file intended for a French-speaking team, or is the French content an artifact of AI-assisted generation that should be converted to English? If English is the standard, convert the entire document. If French is intentional, ensure other project documentation (README.md, CONTRIBUTING.md, etc.) also use French consistently.
STAGING_VALIDATION_INDEX.md-10-10 (1)
10-10: Wrap bare URLs in Markdown link syntax.

Line 10 contains a bare URL (http://localhost:8502) that should be formatted as a Markdown link or placed in a code block.
-- **Access Staging:** http://localhost:8502
+- **Access Staging:** [`http://localhost:8502`](http://localhost:8502)
.claude/agents/db-layer-refactoring-analyst.md-2-3 (1)
2-3: Simplify the agent description—it's overly complex with nested XML examples.

The description field contains embedded example blocks that may not parse correctly. Consider moving detailed examples into the main body or creating a separate examples file, and keep the description brief and self-contained.
🔎 Suggested simplification

Replace the complex nested examples in lines 2–3 with a concise one-liner:
-description: Use this agent when you need to analyze and plan database layer refactoring, particularly for decoupling ORM/BaaS dependencies (like Supabase, Firebase, Prisma), creating abstraction layers, or migrating to a multi-modular architecture. This agent specializes in Repository Pattern implementation, database abstraction strategies, and incremental migration planning.\n\nExamples:\n\n<example>\nContext: User wants to decouple Supabase from their codebase\nuser: "Our codebase is tightly coupled to Supabase and we want to be database-agnostic"\nassistant: "I'll use the db-layer-refactoring-analyst agent to map all Supabase dependencies and design an abstraction strategy."\n<Task tool call to db-layer-refactoring-analyst>\n</example>\n\n<example>\nContext: User needs to create a repository layer\nuser: "We have database calls scattered throughout our services and want to centralize them"\nassistant: "Let me launch the db-layer-refactoring-analyst to analyze your current data access patterns and design a proper repository architecture."\n<Task tool call to db-layer-refactoring-analyst>\n</example>\n\n<example>\nContext: User is planning a database migration\nuser: "We're considering moving from Supabase to a self-hosted PostgreSQL with SQLAlchemy"\nassistant: "I'll use the db-layer-refactoring-analyst to create a migration roadmap that minimizes risk and maintains functionality throughout."\n<Task tool call to db-layer-refactoring-analyst>\n</example>\n\n<example>\nContext: User wants to modularize their data layer\nuser: "Our monolithic database module needs to be split into domain-specific modules"\nassistant: "Let me analyze the codebase with the db-layer-refactoring-analyst to identify domain boundaries and design a multi-modular data architecture."\n<Task tool call to db-layer-refactoring-analyst>\n</example>
+description: "Use this agent when you need to analyze and plan database layer refactoring, particularly for decoupling ORM/BaaS dependencies (like Supabase, Firebase, Prisma), creating abstraction layers, or migrating to a multi-modular architecture."
STAGING_VALIDATION_INDEX.md-216-225 (1)

216-225: Add container reset() fixture to main conftest.py to ensure global state cleanup between test runs.

The lazy singleton pattern in archon/archon_graph.py (lines 69–77) does create testing and concurrency challenges:

Test isolation gap: tests/test_container.py correctly calls reset() in setup_method(), but tests/conftest.py and tests/integration/conftest.py lack universal container state cleanup. This means other test suites do not automatically reset the container's global _repository_instance between test runs, risking state leakage across tests.

Concurrency risk: The async initialization in get_repository_instance() lacks synchronization primitives. Multiple concurrent async contexts could attempt initialization simultaneously, potentially bypassing the if repository is None check.

Architectural redundancy: archon_graph.py maintains a separate global repository variable that wraps container.get_repository_async(), creating an orphaned mutable state outside the container's reset mechanism.

Add reset() to the module-level fixtures in conftest.py (e.g., as a session-scoped autouse fixture or function-scoped setup), or refactor archon_graph.py to call container.reset() where needed and rely directly on the container's get_repository_async() instead of maintaining a separate global variable.
check_db_schema.py-8-14 (1)
8-14: Security: Use environment variables for database credentials.

Hardcoded database credentials pose a security risk. Consider using environment variables or a configuration file:
🔎 Proposed fix
+import os
+
 async def check_schema():
     """Check the current schema of site_pages table."""
     conn = await asyncpg.connect(
-        host="localhost",
-        port=5432,
-        user="postgres",
-        password="postgres",
-        database="mydb"
+        host=os.getenv("POSTGRES_HOST", "localhost"),
+        port=int(os.getenv("POSTGRES_PORT", "5432")),
+        user=os.getenv("POSTGRES_USER", "postgres"),
+        password=os.getenv("POSTGRES_PASSWORD"),
+        database=os.getenv("POSTGRES_DB", "mydb")
     )
Committable suggestion skipped: line range outside the PR's diff.
check_db_schema.py-52-58 (1)
52-58: Security: Use environment variables for database credentials.

Same security concern as in check_schema(). Extract credentials to environment variables for consistency and security.
🔎 Proposed fix
+import os
+
 async def migrate_to_serial():
     """Migrate the id column from UUID to SERIAL."""
     conn = await asyncpg.connect(
-        host="localhost",
-        port=5432,
-        user="postgres",
-        password="postgres",
-        database="mydb"
+        host=os.getenv("POSTGRES_HOST", "localhost"),
+        port=int(os.getenv("POSTGRES_PORT", "5432")),
+        user=os.getenv("POSTGRES_USER", "postgres"),
+        password=os.getenv("POSTGRES_PASSWORD"),
+        database=os.getenv("POSTGRES_DB", "mydb")
     )
archon/services/documentation_service.py-138-145 (1)
138-145: Potential TypeError if chunk content is None.

Line 142 joins chunk.content for all chunks, but SitePage.content is Optional[str]. If any chunk has None content, the join will fail with TypeError: sequence item N: expected str instance, NoneType found.
🔎 Suggested fix
     # Sort by chunk_number to ensure correct order
     sorted_chunks = sorted(chunks, key=lambda c: c.chunk_number)

     # Concatenate content
-    full_content = "\n\n".join(chunk.content for chunk in sorted_chunks)
+    full_content = "\n\n".join(
+        chunk.content for chunk in sorted_chunks if chunk.content
+    )
archon/archon_graph.py-67-77 (1)
67-77: Potential race condition in lazy repository initialization.

If get_repository_instance() is called concurrently before the repository is initialized, multiple instances could be created. The second assignment would overwrite the first, but this wastes resources and could cause subtle issues with connection pools.
🔎 Suggested fix using asyncio.Lock
+import asyncio
+
 # Initialize repository and embedding service via container
 # Repository will be lazy-initialized on first use (supports async backends like PostgreSQL)
 repository = None
+_repository_lock = asyncio.Lock()
 embedding_service = get_embedding_service()

 async def get_repository_instance():
     """Get or create repository instance (lazy initialization for async backends)."""
     global repository
     if repository is None:
-        repository = await get_repository_async()
+        async with _repository_lock:
+            if repository is None:  # Double-check after acquiring lock
+                repository = await get_repository_async()
     return repository
CODE_CHANGES_SUMMARY.md-118-134 (1)
118-134: Add synchronization to prevent race conditions in lazy initialization.

The implementation lacks protection against concurrent access. If multiple async tasks call get_repository_instance() simultaneously before initialization completes, both will observe repository is None and both will call await get_repository_async(), leading to duplicate initialization. Add an asyncio.Lock to guard the initialization:
Example fix
_init_lock = asyncio.Lock()

async def get_repository_instance():
    """Get or create repository instance (lazy initialization for async backends)."""
    global repository
    if repository is None:
        async with _init_lock:
            if repository is None:  # Double-check after acquiring lock
                repository = await get_repository_async()
    return repository
archon/container.py-129-172 (1)
129-172: Add lifecycle management for async resources.

The container creates async resources (PostgreSQL connection pool) but doesn't provide a way to properly close/cleanup these resources. This can lead to resource leaks and warnings about unclosed connections.
🔎 Recommended addition
async def close_async_resources() -> None:
    """
    Close async resources (connection pools, etc.).
    
    Should be called during application shutdown.
    
    Example:
        >>> await close_async_resources()
    """
    global _repository_instance
    
    if _repository_instance is not None:
        repo_type = _config["repository_type"]
        
        if repo_type == "postgres":
            logger.info("Closing PostgreSQL connection pool")
            await _repository_instance.close()
            _repository_instance = None
Usage in application shutdown:
# In your main application
async def shutdown():
    from archon.container import close_async_resources
    await close_async_resources()
archon/infrastructure/postgres/site_pages_repository.py-122-124 (1)
122-124: Use logger.exception instead of logger.error in exception handlers.

When logging exceptions, logger.exception() automatically includes the stack trace, making debugging significantly easier than logger.error().
🔎 Proposed refactor
 except Exception as e:
-    logger.error(f"get_by_id(id={id}) -> ERROR: {e}")
+    logger.exception(f"get_by_id(id={id}) -> ERROR: {e}")
     raise
Apply this pattern to all 8 exception handlers in the file (lines 123, 154, 218, 254, 306, 365, 396, 438).
Also applies to: 153-155, 217-219, 253-255, 305-307, 364-366, 395-397, 437-439
archon/infrastructure/postgres/site_pages_repository.py-33-42 (1)
33-42: Validate table_name to prevent SQL injection.

While table_name defaults to "site_pages" and is typically not user-controlled, accepting it as a parameter without validation creates a potential SQL injection vector. Static analysis correctly flags this concern across multiple query constructions.
🔎 Recommended fix: Add validation
 def __init__(self, pool: Pool, table_name: str = "site_pages"):
     """
     Initialize the repository with a connection pool.

     Args:
         pool: asyncpg connection pool
         table_name: Name of the table to use
     """
+    # Validate table_name to prevent SQL injection
+    if not table_name.replace('_', '').isalnum():
+        raise ValueError(f"Invalid table name: {table_name}")
     self.pool = pool
     self.table_name = table_name

🟡 Minor comments (9)

.claude/agents/mcp-server-refactoring-analyst.md-325-342 (1)
325-342: Phase numbering gap: jumps from Phase 4 to Phase 6.

The roadmap goes from "Phase 4: Testing & Validation" (line 325) to "Phase 6: Production Considerations" (line 342), skipping Phase 5. This appears to be a typo or oversight.
Suggested fix
-### Phase 6: Production Considerations
+### Phase 5: Production Considerations
Or add the missing Phase 5 content if it was intended.
docs/CONTEXT_DB_TEST_RUNNER_AGENT.md-42-44 (1)
42-44: Avoid documenting default credentials in version-controlled files.

Documenting User: postgres and Password: postgres poses a risk if these match actual deployed credentials. Even for development, consider using environment variables or referencing a secure credentials store instead.
Suggested approach
 | Host | `localhost` |
 | Port | `5432` |
-| User | `postgres` |
-| Password | `postgres` |
+| User | `${POSTGRES_USER}` (from environment) |
+| Password | `${POSTGRES_PASSWORD}` (from environment) |
 | Database | `mydb` |
.claude/agents/db-test-runner-agent.md-82-90 (1)
82-90: Hardcoded credentials in documentation.

The documentation embeds PostgreSQL credentials directly in the configuration section. While this is documentation rather than live code, it sets a poor example and could lead to credential leakage if developers copy-paste without modification.
🔎 Recommended approach

Consider using placeholder values or referencing environment variables instead:
-User: postgres
-Password: postgres
+User: ${POSTGRES_USER}
+Password: ${POSTGRES_PASSWORD}
Or use explicit placeholders:
-User: postgres
-Password: postgres
+User: <your-postgres-user>
+Password: <your-postgres-password>
archon/infrastructure/postgres/connection.py-98-111 (1)
98-111: Return type annotation contradicts behavior.

The function is annotated to return Optional[Pool] but always raises RuntimeError when _pool is None, meaning it never actually returns None. The annotation should be Pool to match the actual behavior.
🔎 Proposed fix
-def get_pool() -> Optional[Pool]:
+def get_pool() -> Pool:
     """
     Get the current connection pool.

     Returns:
-        The global pool instance, or None if not created
+        The global pool instance

     Raises:
         RuntimeError: If pool has not been created
     """
.claude/agents/db-refactor-test-phase-agent.md-82-88 (1)
82-88: Avoid exposing plaintext credentials in documentation.

Hardcoded credentials (User: postgres, Password: postgres) in shared documentation could pose a security risk if this pattern is copied to other environments. Consider using environment variable references instead.
🔎 Suggested fix
 PostgreSQL Docker:
   Container: mg_postgres
   Host: localhost
   Port: 5432
-  User: postgres
-  Password: postgres
+  User: ${POSTGRES_TEST_USER:-postgres}
+  Password: ${POSTGRES_TEST_PASSWORD}
   Status: Running
.claude/agents/db-refactor-test-phase-agent.md-62-64 (1)
62-64: Hardcoded Windows paths reduce portability.

These absolute Windows paths (D:\archon\archon\docs\...) will not work on Linux/macOS or in CI environments. Consider using relative paths from the repository root.
🔎 Suggested fix
-1. **Plan Phase 0 Tests**: `D:\archon\archon\docs\PLAN_PHASE0_TESTS.md` ← PRINCIPAL
-2. **Migration Manifest**: `D:\archon\archon\docs\MIGRATION_MANIFEST.md`
-3. **Plan Global**: `D:\archon\archon\docs\PLAN_REFACTORISATION_DATABASE_LAYER.md`
+1. **Plan Phase 0 Tests**: `docs/PLAN_PHASE0_TESTS.md` ← PRINCIPAL
+2. **Migration Manifest**: `docs/MIGRATION_MANIFEST.md`
+3. **Plan Global**: `docs/PLAN_REFACTORISATION_DATABASE_LAYER.md`
archon/agent_tools.py-234-234 (1)
234-234: Fix typo in comment.

"coule" should be "couple".
🔎 Proposed fix
-            # Join everything together but limit the characters in case the page is massive (there are a coule big ones)
+            # Join everything together but limit the characters in case the page is massive (there are a couple big ones)
archon/infrastructure/postgres/site_pages_repository.py-390-393 (1)
390-393: Fragile parsing of DELETE command result.

Parsing the DELETE result with .split()[-1] is brittle and depends on the exact format of the command tag string. Consider using asyncpg's built-in parsing or a more robust approach.
🔎 Recommended fix
-                # Parse "DELETE X" to get count
-                deleted_count = int(result.split()[-1])
+                # Extract count from command tag (format: "DELETE n")
+                try:
+                    deleted_count = int(result.split()[-1]) if result else 0
+                except (ValueError, IndexError):
+                    logger.warning(f"Could not parse DELETE result: {result}")
+                    deleted_count = 0
Alternatively, track the count with a SELECT query before deletion if accuracy is critical.
archon/container.py-91-115 (1)
91-115: Remove dead code: unused postgres_config.

The postgres_config dictionary is built but never used before the RuntimeError is raised. This is confusing dead code that should be removed.
🔎 Proposed fix
 elif repo_type == "postgres":
-    # PostgreSQL direct with asyncpg + pgvector
-    import os
-    from archon.infrastructure.postgres import PostgresSitePagesRepository, create_pool
-
-    # Get PostgreSQL configuration from environment
-    postgres_config = {
-        "host": os.environ.get("POSTGRES_HOST", "localhost"),
-        "port": int(os.environ.get("POSTGRES_PORT", "5432")),
-        "database": os.environ.get("POSTGRES_DB", "archon"),
-        "user": os.environ.get("POSTGRES_USER", "postgres"),
-        "password": os.environ.get("POSTGRES_PASSWORD", ""),
-    }
-
-    # Create pool and repository synchronously
-    # Note: Pool creation must be done in an async context
-    # So we raise an error with instructions
+    # PostgreSQL requires async initialization
     raise RuntimeError(
         "PostgreSQL repository requires async initialization. "
         "Use get_repository_async() instead, or initialize manually:\n\n"
         "  from archon.infrastructure.postgres import PostgresSitePagesRepository\n"
         "  repo = await PostgresSitePagesRepository.create(**config)\n"
         "  from archon.container import override_repository\n"
         "  override_repository(repo)\n"
     )

coderabbitai · 2026-01-01T01:58:39Z

@@ -0,0 +1,448 @@
+# Livrable: Backend PostgreSQL pour Archon
+
+**Date de Livraison:** 2025-11-30


⚠️ Potential issue | 🔴 Critical

Critical: Fix future dates in documentation.

The delivery date and generation date are set to 2025-11-30, which is in the future. This should be corrected to 2024-11-30 to reflect the actual delivery date.

🔎 Proposed fix

-**Date de Livraison:** 2025-11-30 +**Date de Livraison:** 2024-11-30

And at the end of the file:

-*Généré le: 2025-11-30* +*Généré le: 2024-11-30*

Also applies to: 446-446

🧰 Tools

🪛 LanguageTool

[grammar] ~3-~3: Essayez «  livraison :  »
Context: ...ckend PostgreSQL pour Archon Date de Livraison: 2025-11-30 Status: ✅ COMPLET - Prod...

(QB_NEW_FR_OTHER_ERROR_IDS_REPLACEMENT_MULTITOKEN)

🤖 Prompt for AI Agents

In DELIVERABLE_SUMMARY.md around line 3 (and also at line 446), the delivery/generation dates are incorrectly set to 2025-11-30; update those occurrences to 2024-11-30 so the document reflects the actual past delivery date, and verify there are no other instances of 2025-11-30 elsewhere in the file to replace similarly.

coderabbitai · 2026-01-01T01:58:39Z

+cd D:\archon\archon
+python run_staging.py
+```


⚠️ Potential issue | 🔴 Critical

Replace hard-coded Windows paths with repository-relative paths or environment variables.

This is a critical blocker for cross-platform development. Multiple sections reference hard-coded Windows paths:

Line 47: cd D:\archon\archon\

Line 65: D:\archon\archon\...

Line 77–81: Multiple references to D:\archon\archon\

These paths are unusable on Linux/macOS systems and create maintenance burden. Use relative paths from repository root instead.

🔎 Suggested fixes

For command examples, use relative paths or environment variables:

-cd D:\archon\archon\ +cd $(git rev-parse --show-toplevel) # or similar python run_staging.py

For file references in tables:

-| Config env | `D:\archon\archon\.env.staging` | Variables d'environnement avec API keys | +| Config env | `.env.staging` (repo root) | Variables d'environnement avec API keys |

For Agent path:

-| Agent | `D:\archon\archon\.claude\agents\db-staging-setup-agent.md` | Définition agent | +| Agent | `.claude/agents/db-staging-setup-agent.md` | Définition agent |

Also applies to: 64-71, 77-81

🤖 Prompt for AI Agents

In docs/CONTEXT_DB_STAGING_AGENT.md around lines 47 to 49 (and also apply to lines 64–71 and 77–81), replace hard-coded Windows absolute paths like "D:\archon\archon\" with repository-relative paths or environment variables; update command examples to use relative paths (e.g., cd ./<repo-root> or ./run_staging.py) and change file/agent path references to use variables (e.g., ${REPO_ROOT}/path or ./path) so the instructions work cross-platform and on Linux/macOS.

coderabbitai · 2026-01-01T01:58:39Z

+    conn = await asyncpg.connect(
+        host="localhost",
+        port=5432,
+        user="postgres",
+        password="postgres",
+        database="mydb"
+    )


⚠️ Potential issue | 🔴 Critical

Hardcoded credentials pose security risk.

Database credentials are hardcoded directly in the connection string. This is a critical security issue if this code is committed to version control or deployed.

🔎 Proposed fix using environment variables

+import os + async def migrate(): """Migrate the id column from UUID to SERIAL.""" conn = await asyncpg.connect( - host="localhost", - port=5432, - user="postgres", - password="postgres", - database="mydb" + host=os.getenv("POSTGRES_HOST", "localhost"), + port=int(os.getenv("POSTGRES_PORT", "5432")), + user=os.getenv("POSTGRES_USER", "postgres"), + password=os.getenv("POSTGRES_PASSWORD"), + database=os.getenv("POSTGRES_DB", "mydb") )

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools

🪛 Ruff (0.14.10)

12-12: Possible hardcoded password assigned to argument: "password"

(S106)

🤖 Prompt for AI Agents

In migrate_schema.py around lines 8 to 14, the database connection uses hardcoded credentials; replace them with values read from environment variables (e.g., DB_HOST, DB_PORT, DB_USER, DB_PASSWORD, DB_NAME) or a configuration/secrets manager, validate that required variables are present and raise a clear error if any are missing, and avoid defaulting to sensitive credentials in code; ensure sensitive values are not committed (use .env for local dev and a secrets manager for production) and pass the env-derived values into asyncpg.connect.

coderabbitai · 2026-01-01T01:58:39Z

CVE-2021-29510

+# Archon Base Requirements
+# Core dependencies shared across all environments
+# Other requirements files import this with: -r requirements-base.txt
+
+# Core AI/LLM Frameworks
+pydantic-ai>=1.0.15
+langgraph>=0.2.0
+
+# LLM Providers
+openai>=1.50.0
+anthropic>=0.69.0
+litellm>=1.50.0
+
+# Web Framework
+streamlit>=1.40.0
+fastapi>=0.115.0
+uvicorn>=0.34.0
+
+# HTTP/Networking
+aiohttp>=3.0.0
+httpx>=0.27.0
+requests>=2.30.0
+
+# Utilities
+python-dotenv>=1.0.0
+pyyaml>=6.0.0
+tenacity>=9.0.0
+tqdm>=4.60.0
+
+# Telemetry
+logfire>=3.0.0
+opentelemetry-api>=1.20.0
+opentelemetry-sdk>=1.20.0
+
+# MCP
+mcp>=1.0.0


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

pydantic-ai 1.0.15 security vulnerabilities CVE

💡 Result:

I found no recorded CVEs for pydantic-ai v1.0.15. The pydantic-ai GitHub Security page shows no published advisories [1]. Known CVEs (e.g., CVE-2021-29510 and CVE-2024-3772) apply to the pydantic project, not pydantic-ai itself, and are documented separately [2][3]. If you want, I can monitor vulnerability databases and notify you if a CVE for pydantic-ai 1.0.15 appears.

Sources:

pydantic-ai GitHub Security (no advisories). [1]

CVE‑2021‑29510 (pydantic). [2]

GHSA / CVE‑2024‑3772 (pydantic regular-expression DoS). [3]

🌐 Web query:

openai python 1.50.0 security vulnerabilities

💡 Result:

I found no published CVE or confirmed security vulnerability specific to openai Python v1.50.0. Security scanners (Snyk) report no direct vulnerabilities for openai versions, and public vulnerability trackers (Safety DB) list no CVE tied to v1.50.0. [1][2]

Known related issues:

Compatibility/packaging bugs have been reported (e.g., httpx / PyInstaller / macOS issues around the 1.50.x line), but these are functional/packaging bugs, not disclosed CVEs. [3]

Recommendations

Upgrade to the latest openai Python release (to pick up fixes and dependency updates). Verify with your package manager (pip/poetry). [1]

Scan your environment with a vulnerability scanner (Snyk, safety, pip-audit) to catch dependency/indirect issues. [1][2]

If you need a formal CVE check or forensic review, tell me which vendor/source you prefer and I’ll run a deeper search and produce dated evidence.

Sources:
[1] Snyk package page for openai (vulnerabilities and versions).
[2] Safety DB entry for openai.
[3] openai/openai-python GitHub issue discussing httpx/packaging problems (1.50.x).

🌐 Web query:

anthropic python 0.69.0 security vulnerabilities

💡 Result:

Short answer: There are no public, direct CVEs reported for the Python package anthropic version 0.69.0. Snyk’s vulnerability database shows no direct vulnerabilities for anthropic@0.69.0 (only dependency/indirect risks may exist). [1][2]

Details and recommendations (as of Jan 1, 2026)

Snyk (package page and version page) reports no direct vulnerabilities for the anthropic PyPI package and for v0.69.0 specifically. [1][2]

SafetyDB / curated feeds list anthropic versions (including 0.69.0) in their inventory but the only CVE they show against the project targets versions <0.3.2 (CVE-2022-23491); that CVE does not affect 0.69.0. [3]

Recommendation: upgrade to the latest upstream anthropic release, keep dependencies patched, and run a supply-chain scan (Snyk/OSV/GHAS or your tooling) to detect any indirect/dependency vulnerabilities or newly disclosed issues. If you want, I can check the very latest release and advisories right now.

Sources:
[1] Snyk — anthropic (PyPI) package/vulnerability summary.
[2] Snyk — anthropic@0.69.0 version page.
[3] SafetyDB / Safety CLI entry for anthropic (inventory and CVE listing).

🌐 Web query:

streamlit 1.40.0 security vulnerabilities

💡 Result:

Short answer: Streamlit 1.40.0 is not affected by the July 2024 path‑traversal advisory (fixed in 1.37.0), but it is identified as vulnerable to an "Arbitrary File Upload" issue reported by Snyk (affects versions before 1.43.2). Recommendation: upgrade to a Streamlit release >= 1.43.2 (or the latest stable) to remediate. Sources: Snyk vulnerability entry [1], Streamlit security advisory (path traversal, fixed in 1.37.0) [2], NVD/CVE record for earlier directory‑traversal issues [3].

Sources

[1] Snyk — streamlit@1.40.0 vulnerabilities (Arbitrary File Upload; upgrade to >=1.43.2).

[2] Streamlit security advisory (STREAMLIT‑2024‑001) — path traversal fixed in 1.37.0.

[3] NVD / CVE-2022-35918 — earlier directory traversal fixed in 1.11.1.

🌐 Web query:

fastapi 0.115.0 security vulnerabilities

💡 Result:

Short answer: There are no public CVEs reported for FastAPI core version 0.115.0 itself. Security advisories you’ll see tend to be about third‑party/malicious PyPI packages or vulnerabilities in related libraries (dependencies), not a CVE against FastAPI 0.115.0. [1][2][3][4]

Recommended actions:

Upgrade to the latest FastAPI release (or at least the newest 0.115.x) and keep Starlette / Pydantic up to date. [1]

Scan your environment for malicious/typo-squat packages (e.g., the backdoored fastapi-toolkit incident) and run SCA tools. [3]

Run dependency vulnerability checks (Snyk/OSV/Dependabot/NVD) for Starlette, Pydantic and other deps used by your apps. [4]

Sources:

FastAPI release notes (0.115.x). [1]

Vulnerability database summary showing no known FastAPI 0.115.0 CVEs. [2]

Datadog writeup on a backdoored third‑party FastAPI-related PyPI package. [3]

Package-scan reports noting flagged vulnerabilities in some package versions/dependencies. [4]

🌐 Web query:

requests 2.30.0 python security vulnerabilities

💡 Result:

Short answer: Requests 2.30.0 is affected by known security issues — most importantly a Proxy-Authorization header leakage (CVE-2023-32681) and related session/control-flow bugs that allow certificate-verification bypass when a Session first uses verify=False. These issues were fixed in later 2.31.x / 2.32.x releases; upgrade to a patched release (2.31.0+ for the proxy leak, and 2.32.2+ / 2.32.4+ for the session/other fixes) or the latest 2.32.x. [1][2][3]

References:

Requests release notes / security advisory (describes proxy header leak and affected versions). [1]

Requests changelog / release pages for 2.30.0 → 2.31.0 → 2.32.x. [2]

Snyk vulnerability summary for requests 2.30.0 (lists the issues and recommended fixed versions). [3]

🌐 Web query:

aiohttp 3.0.0 security vulnerabilities

💡 Result:

Short answer: aiohttp 3.0.0 is old and affected by multiple security issues discovered since that release (open-redirect, HTTP request-smuggling/parser bugs, directory-traversal, XSS, etc.). You should upgrade to a supported, patched release (or at minimum avoid vulnerable features / apply the documented workarounds). Sources and notable CVEs below.

Notable vulnerabilities affecting older 3.x releases

Open redirect (CVE-2021-21330) — fixed in 3.7.4; affects normalize_path_middleware. [3]

HTTP request smuggling / parser bugs (CVE-2023-47627, CVE-2023-47641, CVE-2024-23829, related fixes in 3.8.6, 3.8.5, 3.9.2, etc.). Pure-Python parser / AIOHTTP_NO_EXTENSIONS usage is repeatedly implicated. [2][1]

Directory traversal / static file issues (CVE-2024-23334) — fixed in 3.9.2 (and related static-route fixes). [4]

XSS in static index pages (CVE-2024-27306) — fixed in 3.9.4; mitigated by using a reverse proxy or disabling show_index. [5]

Path-traversal with compressed symlinks in FileResponse (CVE-2024-42367) — fixed on 3.10.2. [2]

Memory-leak via MatchInfoError when middleware enabled (CVE-2024-52303) — fixed in 3.10.11. [6]

Recommendations

Upgrade aiohttp to a current patched release (do not stay on 3.0.0). The fixes above appear across 3.7.x → 3.12.x series depending on the issue; test and move to the latest stable in your environment. [2][1][4]

If you cannot upgrade immediately: avoid using affected features (e.g., normalize_path_middleware, show_index, follow_symlinks=True), disable pure-Python parser where advised (use the llhttp extensions instead of AIOHTTP_NO_EXTENSIONS), and place a vetted reverse proxy in front of aiohttp. [2][5][3]

Sources
[1] Summary of aiohttp parser/request-smuggling advisories (Meterian component summary).
[2] NVD / GitHub advisory for CVE-2025-53643 (parser/trailer request-smuggling) and related NVD entries.
[3] NVD entry for CVE-2021-21330 (open redirect).
[4] CVE details for CVE-2024-23334 (static file/directory traversal).
[5] CVE details for CVE-2024-27306 (XSS in static index).
[6] CVE details for CVE-2024-52303 (memory leak).

If you want, I can (a) check the exact earliest patched aiohttp version that removes every specific CVE in your environment, or (b) give pip commands and a minimal upgrade plan tailored to your current version.

Update package versions to patch known security vulnerabilities.

The base requirements file contains several packages with published security vulnerabilities:

aiohttp>=3.0.0 is critically outdated (2017 release) with multiple published CVEs including open redirect (CVE-2021-21330), HTTP request smuggling (CVE-2023-47627, CVE-2023-47641, CVE-2024-23829), directory traversal (CVE-2024-23334), XSS (CVE-2024-27306), and path traversal (CVE-2024-42367). Upgrade to a current patched release (3.9.x or later).

streamlit>=1.40.0 is affected by an arbitrary file upload vulnerability. Upgrade to >=1.43.2.

requests>=2.30.0 has proxy-authorization header leakage (CVE-2023-32681) and session/certificate-verification bypass issues. Upgrade to >=2.31.0, or >=2.32.2+ for all fixes.

The other specified versions (pydantic-ai, openai, anthropic, fastapi) are not affected by published CVEs at their specified constraints.

🤖 Prompt for AI Agents

requirements-base.txt lines 1-36: bump vulnerable package constraints—update aiohttp to a patched series (e.g., change to >=3.9.0), update streamlit to >=1.43.2, and update requests to >=2.32.2 (or latest compatible release); modify the requirement lines accordingly, run dependency install/tests and any lockfile/constraints regeneration to ensure compatibility, and verify CI passes before merging.

- Add explicit exception chaining (from e) to all repository methods - Wrap exceptions in RuntimeError with descriptive messages - Improves stack trace debugging and error context - Addresses CodeRabbit review feedback 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (4)

archon/infrastructure/postgres/site_pages_repository.py (4)

122-124: Prefer logging.exception over logging.error in exception handlers.

Throughout this file, exception handlers use logging.error to log errors. Using logging.exception instead would automatically include the full traceback in the logs, significantly improving debuggability without requiring manual traceback formatting.

This applies to all exception handlers in the file (lines 123, 154, 220, 256, 308, 367, 398, 440).

🔎 Example fix for this method

     except Exception as e:
-        logger.error(f"get_by_id(id={id}) -> ERROR: {e}")
+        logger.exception(f"get_by_id(id={id}) -> ERROR: {e}")
         raise RuntimeError(f"Failed to get page by id {id}") from e

Based on static analysis hints.

208-211: Consider logging when similarity clipping occurs.

The code clips similarity values to the [0, 1] range with a comment noting this can happen with poorly normalized embeddings. While the clipping prevents invalid results, silently correcting data quality issues may hide problems. Consider logging a warning when clipping occurs to aid in identifying embedding quality issues.

🔎 Proposed enhancement

                 for row in rows:
                     page = self._row_to_site_page(row)
-                    # Clip similarity to valid range [0, 1]
-                    # Note: Can be negative with poorly normalized embeddings
-                    similarity = max(0.0, min(1.0, float(row["similarity"])))
+                    raw_similarity = float(row["similarity"])
+                    # Clip similarity to valid range [0, 1]
+                    # Note: Can be negative with poorly normalized embeddings
+                    if raw_similarity < 0.0 or raw_similarity > 1.0:
+                        logger.warning(
+                            f"Clipping similarity value {raw_similarity} to [0, 1] range. "
+                            "This may indicate poorly normalized embeddings."
+                        )
+                    similarity = max(0.0, min(1.0, raw_similarity))
                     results.append(SearchResult(page=page, similarity=similarity))

333-359: Optimize batch insert with bulk operations.

The current implementation uses individual INSERT statements in a loop, which requires N database roundtrips even within the transaction. For large batches, this can significantly impact performance. Consider using executemany or PostgreSQL's COPY command for bulk inserts.

🔎 Proposed optimization using executemany

             async with self.pool.acquire() as conn:
                 # Use a transaction for batch insert
                 async with conn.transaction():
-                    inserted = []
-                    for page in pages:
-                        # Prepare embedding
-                        embedding_str = None
-                        if page.embedding:
-                            embedding_str = str(page.embedding)
-
-                        row = await conn.fetchrow(
-                            f"""
-                            INSERT INTO {self.table_name}
-                            (url, chunk_number, title, summary, content, metadata, embedding)
-                            VALUES ($1, $2, $3, $4, $5, $6, $7::vector)
-                            RETURNING *
-                            """,
-                            page.url,
-                            page.chunk_number,
-                            page.title,
-                            page.summary,
-                            page.content,
-                            page.metadata.model_dump_json() if page.metadata else "{}",
-                            embedding_str,
-                        )
-                        inserted.append(self._row_to_site_page(row))
+                    # Prepare all values for bulk insert
+                    values = [
+                        (
+                            page.url,
+                            page.chunk_number,
+                            page.title,
+                            page.summary,
+                            page.content,
+                            page.metadata.model_dump_json() if page.metadata else "{}",
+                            str(page.embedding) if page.embedding else None,
+                        )
+                        for page in pages
+                    ]
+                    
+                    # Bulk insert and return all rows
+                    rows = await conn.fetch(
+                        f"""
+                        INSERT INTO {self.table_name}
+                        (url, chunk_number, title, summary, content, metadata, embedding)
+                        SELECT * FROM UNNEST($1::text[], $2::int[], $3::text[], $4::text[], 
+                                             $5::text[], $6::jsonb[], $7::vector[])
+                        RETURNING *
+                        """,
+                        [v[0] for v in values],  # urls
+                        [v[1] for v in values],  # chunk_numbers
+                        [v[2] for v in values],  # titles
+                        [v[3] for v in values],  # summaries
+                        [v[4] for v in values],  # contents
+                        [v[5] for v in values],  # metadata
+                        [v[6] for v in values],  # embeddings
+                    )
+                    inserted = [self._row_to_site_page(row) for row in rows]

Note: This uses PostgreSQL's UNNEST function to handle arrays. Alternatively, you could use executemany which is simpler but doesn't return the inserted rows in a single call.

392-394: Consider the fragility of DELETE result parsing.

Parsing the execute result string by splitting on spaces and taking the last element works but relies on undocumented asyncpg behavior. While this format appears stable, it's not part of asyncpg's documented API contract. The current approach is acceptable but be aware it may break if asyncpg changes the result format.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3bbfc86 and fdcebea.

📒 Files selected for processing (2)

archon/infrastructure/postgres/site_pages_repository.py
archon/infrastructure/supabase/site_pages_repository.py

🧰 Additional context used

🪛 Ruff (0.14.10)

archon/infrastructure/postgres/site_pages_repository.py

110-110: Possible SQL injection vector through string-based query construction

(S608)

123-123: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

124-124: Avoid specifying long messages outside the exception class

(TRY003)

141-145: Possible SQL injection vector through string-based query construction

(S608)

154-154: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

155-155: Avoid specifying long messages outside the exception class

(TRY003)

182-187: Possible SQL injection vector through string-based query construction

(S608)

220-220: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

221-221: Avoid specifying long messages outside the exception class

(TRY003)

239-243: Possible SQL injection vector through string-based query construction

(S608)

248-248: Possible SQL injection vector through string-based query construction

(S608)

256-256: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

257-257: Avoid specifying long messages outside the exception class

(TRY003)

273-273: Avoid specifying long messages outside the exception class

(TRY003)

285-290: Possible SQL injection vector through string-based query construction

(S608)

308-308: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

309-309: Avoid specifying long messages outside the exception class

(TRY003)

325-325: Avoid specifying long messages outside the exception class

(TRY003)

344-349: Possible SQL injection vector through string-based query construction

(S608)

367-367: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

368-368: Avoid specifying long messages outside the exception class

(TRY003)

385-388: Possible SQL injection vector through string-based query construction

(S608)

398-398: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

399-399: Avoid specifying long messages outside the exception class

(TRY003)

414-414: Possible SQL injection vector through string-based query construction

(S608)

440-440: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

441-441: Avoid specifying long messages outside the exception class

(TRY003)

archon/infrastructure/supabase/site_pages_repository.py

60-60: Consider moving this statement to an else block

(TRY300)

63-63: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

64-64: Avoid specifying long messages outside the exception class

(TRY003)

89-89: Consider moving this statement to an else block

(TRY300)

92-92: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

93-93: Avoid specifying long messages outside the exception class

(TRY003)

138-138: Consider moving this statement to an else block

(TRY300)

141-141: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

142-142: Avoid specifying long messages outside the exception class

(TRY003)

169-169: Consider moving this statement to an else block

(TRY300)

172-172: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

173-173: Avoid specifying long messages outside the exception class

(TRY003)

189-189: Avoid specifying long messages outside the exception class

(TRY003)

201-201: Consider moving this statement to an else block

(TRY300)

204-204: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

205-205: Avoid specifying long messages outside the exception class

(TRY003)

221-221: Avoid specifying long messages outside the exception class

(TRY003)

236-236: Consider moving this statement to an else block

(TRY300)

239-239: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

240-240: Avoid specifying long messages outside the exception class

(TRY003)

266-266: Consider moving this statement to an else block

(TRY300)

269-269: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

270-270: Avoid specifying long messages outside the exception class

(TRY003)

303-303: Consider moving this statement to an else block

(TRY300)

306-306: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

307-307: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (5)

archon/infrastructure/supabase/site_pages_repository.py (2)

188-189: LGTM! Excellent id validation for insert operations.

The validation ensuring id is None before insertion prevents accidental overwrites and correctly enforces database-generated ids. Clear error messages and consistent across both single and batch insert methods.

Also applies to: 220-221

161-161: The PostgreSQL JSONB operators (metadata->>source) are properly supported in Supabase Python client v2.11.0. This is a standard, documented pattern for querying JSONB fields via the .eq() method, as confirmed by official Supabase documentation. The code at lines 161, 258, and 293 uses these operators correctly.

archon/infrastructure/postgres/site_pages_repository.py (3)

78-88: LGTM!

The connection pool creation is well-structured with appropriate parameters and informative logging.

259-310: LGTM!

The insert implementation correctly validates inputs, handles embeddings for pgvector, and uses parameterized queries for safety.

443-479: LGTM!

The row-to-model conversion is well-implemented with defensive parsing that handles multiple data formats for both metadata and embeddings. This robustness helps prevent runtime errors from unexpected data formats.

Security fixes: - Update h11>=0.16.0 to fix CVE-2025-43859 vulnerability - Add VALID_TABLE_NAMES whitelist to prevent SQL injection - Add VALID_COLUMN_NAMES whitelist for filter key validation - Sanitize metadata keys (alphanumeric + underscore only) Logging improvements: - Replace logger.error() with logger.exception() for automatic stacktrace - Proper exception chaining with RuntimeError wrapper 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

archon/infrastructure/supabase/site_pages_repository.py (1)

38-46: [DUPLICATE] Use AsyncClient for proper async operations.

All methods in this class are declared async def per the interface contract, but the Supabase client operations remain synchronous (using Client instead of AsyncClient). This blocks the event loop when these methods are called from async contexts.

This issue was previously flagged and remains unresolved. Since supabase-py provides AsyncClient and acreate_client, the repository should be updated to use the async client and await all operations.

🧹 Nitpick comments (1)

archon/infrastructure/postgres/site_pages_repository.py (1)

175-239: Consider validating all filter keys in search_similar().

The search_similar() method only handles the "source" filter key (line 212), while other filter keys would be silently ignored. This is inconsistent with the count() method which validates and applies all filter keys using whitelists.

Consider either:

Validating and applying additional filter keys (similar to count() method), or

Documenting that only "source" filtering is supported and logging a warning if other keys are provided.

This ensures consistent behavior across methods and prevents silent failures.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fdcebea and 9311f07.

📒 Files selected for processing (3)

archon/infrastructure/postgres/site_pages_repository.py
archon/infrastructure/supabase/site_pages_repository.py
requirements.txt

🧰 Additional context used

🧠 Learnings (5)

📚 Learning: 2025-11-29T19:25:01.813Z

Learnt from: CR
Repo: coleam00/Archon PR: 0
File: python/src/agent_work_orders/CLAUDE.md:0-0
Timestamp: 2025-11-29T19:25:01.813Z
Learning: Applies to python/src/agent_work_orders/**/*.py : Use `logger.exception()` in except blocks to automatically capture full stack traces

Applied to files:

archon/infrastructure/supabase/site_pages_repository.py

📚 Learning: 2025-11-29T19:24:36.223Z

Learnt from: CR
Repo: coleam00/Archon PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-11-29T19:24:36.223Z
Learning: Applies to python/src/**/*.py : Include context about what was being attempted, preserve full stack traces with exc_info=True in Python logging, use specific exception types, include relevant IDs/URLs, never return None to indicate failure, and report both success count and detailed failure list for batch operations

Applied to files:

archon/infrastructure/supabase/site_pages_repository.py

📚 Learning: 2025-11-29T19:25:01.813Z

Learnt from: CR
Repo: coleam00/Archon PR: 0
File: python/src/agent_work_orders/CLAUDE.md:0-0
Timestamp: 2025-11-29T19:25:01.813Z
Learning: Applies to python/src/agent_work_orders/**/*.py : Do not silently catch exceptions; always log with `logger.exception()` or re-raise the exception

Applied to files:

archon/infrastructure/supabase/site_pages_repository.py

📚 Learning: 2025-11-29T19:25:23.987Z

Learnt from: CR
Repo: coleam00/Archon PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-29T19:25:23.987Z
Learning: Applies to python/**/*.py : Error messages must include context about what was being attempted, preserve full stack traces, use specific exception types, include relevant IDs/URLs, and report both success count and detailed failure list for batch operations

Applied to files:

archon/infrastructure/supabase/site_pages_repository.py

📚 Learning: 2025-11-29T19:25:01.813Z

Learnt from: CR
Repo: coleam00/Archon PR: 0
File: python/src/agent_work_orders/CLAUDE.md:0-0
Timestamp: 2025-11-29T19:25:01.813Z
Learning: Applies to python/src/agent_work_orders/**/*.py : Use appropriate logging levels: `debug` for diagnostics, `info` for operations, `warning` for recoverable issues, `error` for non-fatal failures, `exception` in except blocks with stack traces

Applied to files:

archon/infrastructure/supabase/site_pages_repository.py

🧬 Code graph analysis (1)

archon/infrastructure/postgres/site_pages_repository.py (3)

archon/domain/interfaces/site_pages_repository.py (1)

ISitePagesRepository (14-197)

archon/domain/models/site_page.py (2)

SitePage (32-83)

SitePageMetadata (13-29)

archon/domain/models/search_result.py (1)

SearchResult (11-49)

🪛 OSV Scanner (2.3.1)

requirements.txt

[HIGH] 79-79: mcp 1.2.1: MCP Python SDK vulnerability in the FastMCP Server causes validation error, leading to DoS

(GHSA-3qhf-m339-9g5v)

[HIGH] 79-79: mcp 1.2.1: Model Context Protocol (MCP) Python SDK does not enable DNS rebinding protection by default

(GHSA-9h52-p55h-vw2f)

[HIGH] 79-79: mcp 1.2.1: MCP Python SDK has Unhandled Exception in Streamable HTTP Transport, Leading to Denial of Service

(GHSA-j975-95f5-7wqh)

🪛 Ruff (0.14.10)

archon/infrastructure/postgres/site_pages_repository.py

55-58: Avoid specifying long messages outside the exception class

(TRY003)

128-128: Possible SQL injection vector through string-based query construction

(S608)

142-142: Avoid specifying long messages outside the exception class

(TRY003)

159-163: Possible SQL injection vector through string-based query construction

(S608)

173-173: Avoid specifying long messages outside the exception class

(TRY003)

200-205: Possible SQL injection vector through string-based query construction

(S608)

239-239: Avoid specifying long messages outside the exception class

(TRY003)

257-261: Possible SQL injection vector through string-based query construction

(S608)

266-266: Possible SQL injection vector through string-based query construction

(S608)

275-275: Avoid specifying long messages outside the exception class

(TRY003)

291-291: Avoid specifying long messages outside the exception class

(TRY003)

303-308: Possible SQL injection vector through string-based query construction

(S608)

327-327: Avoid specifying long messages outside the exception class

(TRY003)

343-343: Avoid specifying long messages outside the exception class

(TRY003)

362-367: Possible SQL injection vector through string-based query construction

(S608)

386-386: Avoid specifying long messages outside the exception class

(TRY003)

403-406: Possible SQL injection vector through string-based query construction

(S608)

417-417: Avoid specifying long messages outside the exception class

(TRY003)

432-432: Possible SQL injection vector through string-based query construction

(S608)

466-466: Avoid specifying long messages outside the exception class

(TRY003)

archon/infrastructure/supabase/site_pages_repository.py

69-69: Consider moving this statement to an else block

(TRY300)

73-73: Avoid specifying long messages outside the exception class

(TRY003)

98-98: Consider moving this statement to an else block

(TRY300)

102-102: Avoid specifying long messages outside the exception class

(TRY003)

147-147: Consider moving this statement to an else block

(TRY300)

151-151: Avoid specifying long messages outside the exception class

(TRY003)

178-178: Consider moving this statement to an else block

(TRY300)

182-182: Avoid specifying long messages outside the exception class

(TRY003)

198-198: Avoid specifying long messages outside the exception class

(TRY003)

210-210: Consider moving this statement to an else block

(TRY300)

214-214: Avoid specifying long messages outside the exception class

(TRY003)

230-230: Avoid specifying long messages outside the exception class

(TRY003)

245-245: Consider moving this statement to an else block

(TRY300)

249-249: Avoid specifying long messages outside the exception class

(TRY003)

275-275: Consider moving this statement to an else block

(TRY300)

279-279: Avoid specifying long messages outside the exception class

(TRY003)

320-320: Consider moving this statement to an else block

(TRY300)

324-324: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (7)

archon/infrastructure/supabase/site_pages_repository.py (2)

17-24: Security whitelists properly implemented.

The VALID_TABLE_NAMES and VALID_COLUMN_NAMES whitelists provide defense-in-depth against SQL injection. Combined with the validation logic in the count() method (lines 302-312), this addresses the security concerns mentioned in the commit message.

297-312: Excellent input validation and sanitization.

The filter handling properly validates both metadata keys (alphanumeric + underscore check) and column names (whitelist check), with appropriate warning logs for invalid inputs. This aligns with the security improvements described in the commit message.

archon/infrastructure/postgres/site_pages_repository.py (3)

42-60: Excellent security improvement: table_name validation.

The constructor now validates table_name against the VALID_TABLE_NAMES whitelist, preventing SQL injection even if this constraint changes in the future. This resolves the major security concern from the previous review.

Based on past review feedback.

436-466: Excellent security improvement: filter key validation.

The count() method now properly validates filter keys using both the VALID_COLUMN_NAMES whitelist for regular columns and alphanumeric+underscore sanitization for metadata keys. This resolves the SQL injection risk from the previous review.

Based on past review feedback.

62-106: Well-designed factory pattern for async initialization.

The create() class method provides a clean async factory for creating repository instances with connection pooling. This aligns well with the async repository access pattern described in the PR objectives.

requirements.txt (2)

44-44: Security fix properly applied.

The upgrade to h11>=0.16.0 successfully addresses the critical HTTP request-smuggling vulnerability (CVE-2025-43859) flagged in the previous review. The inline comment clearly documents the security rationale.

177-178: PostgreSQL backend dependencies properly added.

The asyncpg and pgvector dependencies align with the PostgreSQL repository implementation (PostgresSitePagesRepository) and support the database abstraction layer objectives described in the PR.

coderabbitai · 2026-01-01T15:30:01Z

GHSA-3qhf-m339-9g5v

 lxml==5.3.0
 markdown-it-py==3.0.0
 MarkupSafe==3.0.2
 mcp==1.2.1


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

GHSA-3qhf-m339-9g5v mcp FastMCP Server validation

💡 Result:

Summary: GHSA-3qhf-m339-9g5v reports a validation bug in the MCP Python SDK (PyPI package "mcp") that can raise an unhandled exception when FastMCP Server processes malformed requests, causing 500 errors/denial of service. It is tracked as CVE-2025-53366 (High severity). Affected versions: mcp < 1.9.4. Fixed in: mcp 1.9.4 — upgrade to 1.9.4 or later. References: [1], [2].

References

GHSA advisory (modelcontextprotocol/python-sdk). [1]

OSV / CVE record (CVE-2025-53366). [2]

🌐 Web query:

GHSA-9h52-p55h-vw2f mcp DNS rebinding

💡 Result:

Summary

GHSA-9h52-p55h-vw2f (CVE-2025-66416): the MCP (mcp on PyPI) Python SDK did not enable DNS-rebinding protection by default for HTTP-based servers running on localhost. A malicious website could exploit DNS rebinding to reach an unauthenticated local FastMCP server (streamable HTTP or SSE) and invoke tools or access exposed resources. [1][2]

Affected / fixed

Affected: mcp versions < 1.23.0.

Fixed: update to mcp v1.23.0 (FastMCP now enables DNS-rebinding protection by default when host is 127.0.0.1 or localhost). [1][2]

Mitigation/notes

Upgrade to mcp ≥ 1.23.0. [1]

Avoid running unauthenticated HTTP-based MCP servers on localhost; prefer authenticated or stdio transport. [1][2]

If using custom low-level transports (StreamableHTTPSessionManager, SseServerTransport), explicitly configure TransportSecuritySettings to enable DNS-rebinding protections. [1][2]

Sources

GitHub security advisory for GHSA-9h52-p55h-vw2f. [1]

OSV / CVE-2025-66416 entry summarizing the issue and fix. [2]

🌐 Web query:

GHSA-j975-95f5-7wqh mcp Streamable HTTP Transport

💡 Result:

Summary — GHSA-j975-95f5-7wqh (alias CVE-2025-53365)

Issue: Uncaught ClosedResourceError in the MCP Python SDK's streamable HTTP transport that can crash the server (denial of service) if a client triggers an exception after establishing a streamable HTTP session. [1][2]

Affected versions: mcp < 1.10.0. [1][3]

Severity: High (CVSS 8.7). [2]

Fix / Mitigation: Upgrade mcp to version 1.10.0 or later (patch introduced in v1.10.0). [1][3]

References

GitHub Security Advisory GHSA-j975-95f5-7wqh (advisory + patch links). [1]

OSV / CVE-2025-53365 entry (CVSS and published details). [2]

GitLab advisory / vendor summary (upgrade guidance). [3]

🏁 Script executed:

#!/bin/bash # Check PyPI for mcp package information echo "=== Checking PyPI for mcp package ===" curl -s https://pypi.org/pypi/mcp/json | jq -r '.info | "Version: \(.version)\nLicense: \(.license)\nHome page: \(.home_page)"' echo "" echo "=== Checking all versions of mcp ===" curl -s https://pypi.org/pypi/mcp/json | jq -r '.releases | keys | sort -V | tail -20'

Repository: coleam00/Archon

Length of output: 458

Upgrade mcp to 1.25.0 to address HIGH severity vulnerabilities.

The mcp package version 1.2.1 is affected by three HIGH severity vulnerabilities:

GHSA-3qhf-m339-9g5v (CVE-2025-53366): Validation error in FastMCP Server causing DoS — fixed in v1.9.4+

GHSA-j975-95f5-7wqh (CVE-2025-53365): Unhandled ClosedResourceError in streamable HTTP transport — fixed in v1.10.0+

GHSA-9h52-p55h-vw2f (CVE-2025-66416): DNS rebinding protection not enabled by default — fixed in v1.23.0+

Upgrade to mcp 1.25.0 (latest available) to address all three vulnerabilities.

🧰 Tools

🪛 OSV Scanner (2.3.1)

[HIGH] 79-79: mcp 1.2.1: MCP Python SDK vulnerability in the FastMCP Server causes validation error, leading to DoS

(GHSA-3qhf-m339-9g5v)

[HIGH] 79-79: mcp 1.2.1: Model Context Protocol (MCP) Python SDK does not enable DNS rebinding protection by default

(GHSA-9h52-p55h-vw2f)

[HIGH] 79-79: mcp 1.2.1: MCP Python SDK has Unhandled Exception in Streamable HTTP Transport, Leading to Denial of Service

(GHSA-j975-95f5-7wqh)

🤖 Prompt for AI Agents

In requirements.txt at line 79, the pinned mcp==1.2.1 is vulnerable; update the version to mcp==1.25.0 to incorporate fixes for GHSA-3qhf-m339-9g5v, GHSA-j975-95f5-7wqh, and GHSA-9h52-p55h-vw2f; after updating the requirements file, regenerate any lock/constraints files used by the project (pip-compile / pip freeze / poetry lock), run CI tests and a dependency audit (pip-audit or Snyk) to confirm no remaining vulnerabilities, and ensure the change does not break compatibility with the codebase.

CodeRabbit fixes (PR coleam00#915): 1. Supabase AsyncClient Migration: - Add get_supabase_async_client() to utils.py - Update SupabaseSitePagesRepository to support both sync and async clients - Use conditional await for backwards compatibility - Add warning when sync client is used in async context - Update container.py get_repository_async() for Supabase 2. Filter Validation in search_similar(): - Add warning for unsupported filter keys in PostgreSQL repo - Document that only "source" filter is supported for vector search - Log warning instead of silently ignoring unsupported keys 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9311f07 and 37fd037.

📒 Files selected for processing (4)

archon/container.py
archon/infrastructure/postgres/site_pages_repository.py
archon/infrastructure/supabase/site_pages_repository.py
utils/utils.py

🧰 Additional context used

🧠 Learnings (5)

📚 Learning: 2025-11-29T19:25:01.813Z

Learnt from: CR
Repo: coleam00/Archon PR: 0
File: python/src/agent_work_orders/CLAUDE.md:0-0
Timestamp: 2025-11-29T19:25:01.813Z
Learning: Applies to python/src/agent_work_orders/**/*.py : Use `logger.exception()` in except blocks to automatically capture full stack traces

Applied to files:

archon/infrastructure/supabase/site_pages_repository.py

📚 Learning: 2025-11-29T19:24:36.223Z

Learnt from: CR
Repo: coleam00/Archon PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-11-29T19:24:36.223Z
Learning: Applies to python/src/**/*.py : Include context about what was being attempted, preserve full stack traces with exc_info=True in Python logging, use specific exception types, include relevant IDs/URLs, never return None to indicate failure, and report both success count and detailed failure list for batch operations

Applied to files:

archon/infrastructure/supabase/site_pages_repository.py

📚 Learning: 2025-11-29T19:25:01.813Z

Learnt from: CR
Repo: coleam00/Archon PR: 0
File: python/src/agent_work_orders/CLAUDE.md:0-0
Timestamp: 2025-11-29T19:25:01.813Z
Learning: Applies to python/src/agent_work_orders/**/*.py : Do not silently catch exceptions; always log with `logger.exception()` or re-raise the exception

Applied to files:

archon/infrastructure/supabase/site_pages_repository.py

📚 Learning: 2025-11-29T19:25:23.987Z

Learnt from: CR
Repo: coleam00/Archon PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-29T19:25:23.987Z
Learning: Applies to python/**/*.py : Error messages must include context about what was being attempted, preserve full stack traces, use specific exception types, include relevant IDs/URLs, and report both success count and detailed failure list for batch operations

Applied to files:

archon/infrastructure/supabase/site_pages_repository.py

📚 Learning: 2025-11-29T19:25:01.813Z

Learnt from: CR
Repo: coleam00/Archon PR: 0
File: python/src/agent_work_orders/CLAUDE.md:0-0
Timestamp: 2025-11-29T19:25:01.813Z
Learning: Applies to python/src/agent_work_orders/**/*.py : Use appropriate logging levels: `debug` for diagnostics, `info` for operations, `warning` for recoverable issues, `error` for non-fatal failures, `exception` in except blocks with stack traces

Applied to files:

archon/infrastructure/supabase/site_pages_repository.py

🧬 Code graph analysis (2)

archon/infrastructure/postgres/site_pages_repository.py (4)

archon/domain/interfaces/site_pages_repository.py (1)

ISitePagesRepository (14-197)

archon/domain/models/site_page.py (2)

SitePage (32-83)

SitePageMetadata (13-29)

archon/domain/models/search_result.py (1)

SearchResult (11-49)

archon/infrastructure/postgres/connection.py (1)

create_pool (18-77)

archon/infrastructure/supabase/site_pages_repository.py (4)

archon/domain/interfaces/site_pages_repository.py (9)

ISitePagesRepository (14-197)

get_by_id (26-41)

find_by_url (44-62)

search_similar (65-93)

list_unique_urls (96-110)

insert (113-137)

insert_batch (140-160)

delete_by_source (163-179)

count (182-197)

archon/domain/models/site_page.py (1)

SitePage (32-83)

archon/domain/models/search_result.py (1)

SearchResult (11-49)

archon/infrastructure/supabase/mappers.py (3)

dict_to_site_page (14-63)

site_page_to_dict (66-108)

dict_to_search_result (111-145)

🪛 Ruff (0.14.10)

archon/infrastructure/postgres/site_pages_repository.py

55-58: Avoid specifying long messages outside the exception class

(TRY003)

128-128: Possible SQL injection vector through string-based query construction

(S608)

142-142: Avoid specifying long messages outside the exception class

(TRY003)

159-163: Possible SQL injection vector through string-based query construction

(S608)

173-173: Avoid specifying long messages outside the exception class

(TRY003)

202-207: Possible SQL injection vector through string-based query construction

(S608)

249-249: Avoid specifying long messages outside the exception class

(TRY003)

267-271: Possible SQL injection vector through string-based query construction

(S608)

276-276: Possible SQL injection vector through string-based query construction

(S608)

285-285: Avoid specifying long messages outside the exception class

(TRY003)

301-301: Avoid specifying long messages outside the exception class

(TRY003)

313-318: Possible SQL injection vector through string-based query construction

(S608)

337-337: Avoid specifying long messages outside the exception class

(TRY003)

353-353: Avoid specifying long messages outside the exception class

(TRY003)

372-377: Possible SQL injection vector through string-based query construction

(S608)

396-396: Avoid specifying long messages outside the exception class

(TRY003)

413-416: Possible SQL injection vector through string-based query construction

(S608)

427-427: Avoid specifying long messages outside the exception class

(TRY003)

442-442: Possible SQL injection vector through string-based query construction

(S608)

476-476: Avoid specifying long messages outside the exception class

(TRY003)

archon/infrastructure/supabase/site_pages_repository.py

85-85: Consider moving this statement to an else block

(TRY300)

89-89: Avoid specifying long messages outside the exception class

(TRY003)

114-114: Consider moving this statement to an else block

(TRY300)

118-118: Avoid specifying long messages outside the exception class

(TRY003)

164-164: Consider moving this statement to an else block

(TRY300)

168-168: Avoid specifying long messages outside the exception class

(TRY003)

195-195: Consider moving this statement to an else block

(TRY300)

199-199: Avoid specifying long messages outside the exception class

(TRY003)

215-215: Avoid specifying long messages outside the exception class

(TRY003)

228-228: Consider moving this statement to an else block

(TRY300)

232-232: Avoid specifying long messages outside the exception class

(TRY003)

248-248: Avoid specifying long messages outside the exception class

(TRY003)

264-264: Consider moving this statement to an else block

(TRY300)

268-268: Avoid specifying long messages outside the exception class

(TRY003)

294-294: Consider moving this statement to an else block

(TRY300)

298-298: Avoid specifying long messages outside the exception class

(TRY003)

339-339: Consider moving this statement to an else block

(TRY300)

343-343: Avoid specifying long messages outside the exception class

(TRY003)

archon/container.py

86-89: Avoid specifying long messages outside the exception class

(TRY003)

98-98: Local variable postgres_config is assigned to but never used

Remove assignment to unused variable postgres_config

(F841)

109-116: Avoid specifying long messages outside the exception class

(TRY003)

125-125: Avoid specifying long messages outside the exception class

(TRY003)

159-162: Avoid specifying long messages outside the exception class

(TRY003)

214-217: Avoid specifying long messages outside the exception class

(TRY003)

229-229: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (9)

archon/infrastructure/supabase/site_pages_repository.py (2)

40-61: Good hybrid approach for async/sync client support.

The implementation correctly handles both AsyncClient and Client with conditional await and appropriate warnings. This addresses the previous review concern while maintaining backward compatibility.

144-156: The match_site_pages RPC function is safe from SQL injection and does not require input validation as described.

The filter parameter is typed as JSONB and used with PostgreSQL's parameterized @> containment operator (WHERE metadata @> filter), which safely handles the filter without any risk of SQL injection. JSONB parameters are bound safely by PostgreSQL and cannot be used to modify the query structure. There is no dynamic SQL generation or string concatenation, so validating filter keys is unnecessary for security purposes. The lack of key validation is a design choice for flexibility, not a vulnerability.

Likely an incorrect or invalid review comment.

archon/container.py (3)

130-189: LGTM! Clean async repository initialization.

The async version properly handles async initialization for backends that require it (Supabase AsyncClient, PostgreSQL pool creation) while delegating to the sync version for backends that don't need async setup (memory).

192-231: LGTM! Well-structured embedding service factory.

The embedding service initialization properly handles both production (OpenAI) and test (mock) scenarios with clear error messages when dependencies are unavailable.

234-290: LGTM! Useful testing utilities.

The service factory, reset, and override functions provide clean testing support and dependency injection capabilities.

archon/infrastructure/postgres/site_pages_repository.py (4)

42-60: LGTM! Proper table name validation implemented.

The table name validation against VALID_TABLE_NAMES whitelist addresses the security concern from previous reviews, preventing SQL injection through table name interpolation.

175-249: LGTM! Robust vector similarity search with proper filter validation.

The implementation correctly:

Uses pgvector's cosine distance operator for similarity search

Explicitly documents and enforces that only "source" filtering is supported

Warns about unsupported filter keys

Clips similarity to [0, 1] range to handle edge cases with poorly normalized embeddings

429-476: LGTM! Comprehensive filter validation implemented.

The filter validation properly addresses the SQL injection concern from previous reviews:

Metadata keys are validated using alphanumeric + underscore pattern

Column names are validated against VALID_COLUMN_NAMES whitelist

Invalid keys are logged and skipped rather than causing errors

All values are properly parameterized

478-514: LGTM! Robust row-to-model conversion with defensive parsing.

The helper method handles multiple data formats gracefully:

Parses metadata from both JSON string and dict formats

Handles pgvector embedding in both JSON array format and alternative formats

Proper error resilience for data format variations

coderabbitai · 2026-01-01T16:13:09Z

+        elif repo_type == "postgres":
+            # PostgreSQL direct with asyncpg + pgvector
+            from archon.infrastructure.postgres import PostgresSitePagesRepository, create_pool
+
+            # Get PostgreSQL configuration from environment
+            postgres_config = {
+                "host": os.environ.get("POSTGRES_HOST", "localhost"),
+                "port": int(os.environ.get("POSTGRES_PORT", "5432")),
+                "database": os.environ.get("POSTGRES_DB", "archon"),
+                "user": os.environ.get("POSTGRES_USER", "postgres"),
+                "password": os.environ.get("POSTGRES_PASSWORD", ""),
+            }
+
+            # Create pool and repository synchronously
+            # Note: Pool creation must be done in an async context
+            # So we raise an error with instructions
+            raise RuntimeError(
+                "PostgreSQL repository requires async initialization. "
+                "Use get_repository_async() instead, or initialize manually:\n\n"
+                "  from archon.infrastructure.postgres import PostgresSitePagesRepository\n"
+                "  repo = await PostgresSitePagesRepository.create(**config)\n"
+                "  from archon.container import override_repository\n"
+                "  override_repository(repo)\n"
+            )


⚠️ Potential issue | 🟡 Minor

Remove unused postgres_config variable.

The postgres_config dictionary is constructed but never used since the code immediately raises a RuntimeError instructing users to use get_repository_async(). Remove the unused variable to clean up the code.

🔎 Proposed fix

elif repo_type == "postgres": # PostgreSQL direct with asyncpg + pgvector from archon.infrastructure.postgres import PostgresSitePagesRepository, create_pool - # Get PostgreSQL configuration from environment - postgres_config = { - "host": os.environ.get("POSTGRES_HOST", "localhost"), - "port": int(os.environ.get("POSTGRES_PORT", "5432")), - "database": os.environ.get("POSTGRES_DB", "archon"), - "user": os.environ.get("POSTGRES_USER", "postgres"), - "password": os.environ.get("POSTGRES_PASSWORD", ""), - } - - # Create pool and repository synchronously - # Note: Pool creation must be done in an async context - # So we raise an error with instructions + # PostgreSQL repository requires async initialization raise RuntimeError( "PostgreSQL repository requires async initialization. " "Use get_repository_async() instead, or initialize manually:\n\n" " from archon.infrastructure.postgres import PostgresSitePagesRepository\n" " repo = await PostgresSitePagesRepository.create(**config)\n" " from archon.container import override_repository\n" " override_repository(repo)\n" )

Based on static analysis hints.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

elif repo_type == "postgres":

# PostgreSQL direct with asyncpg + pgvector

from archon.infrastructure.postgres import PostgresSitePagesRepository, create_pool

# Get PostgreSQL configuration from environment

postgres_config = {

"host": os.environ.get("POSTGRES_HOST", "localhost"),

"port": int(os.environ.get("POSTGRES_PORT", "5432")),

"database": os.environ.get("POSTGRES_DB", "archon"),

"user": os.environ.get("POSTGRES_USER", "postgres"),

"password": os.environ.get("POSTGRES_PASSWORD", ""),

}

# Create pool and repository synchronously

# Note: Pool creation must be done in an async context

# So we raise an error with instructions

raise RuntimeError(

"PostgreSQL repository requires async initialization. "

"Use get_repository_async() instead, or initialize manually:\n\n"

" from archon.infrastructure.postgres import PostgresSitePagesRepository\n"

" repo = await PostgresSitePagesRepository.create(**config)\n"

" from archon.container import override_repository\n"

" override_repository(repo)\n"

)

elif repo_type == "postgres":

# PostgreSQL direct with asyncpg + pgvector

from archon.infrastructure.postgres import PostgresSitePagesRepository, create_pool

# PostgreSQL repository requires async initialization

raise RuntimeError(

"PostgreSQL repository requires async initialization. "

"Use get_repository_async() instead, or initialize manually:\n\n"

" from archon.infrastructure.postgres import PostgresSitePagesRepository\n"

" repo = await PostgresSitePagesRepository.create(**config)\n"

" from archon.container import override_repository\n"

" override_repository(repo)\n"

)

🧰 Tools

🪛 Ruff (0.14.10)

98-98: Local variable postgres_config is assigned to but never used

Remove assignment to unused variable postgres_config

(F841)

109-116: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents

In archon/container.py around lines 93 to 116, the code builds a postgres_config dict but never uses it because the function immediately raises a RuntimeError; remove the unused postgres_config creation to eliminate dead code and potential confusion. Delete the block that builds postgres_config (host, port, database, user, password) so the postgres branch only imports the Postgres classes and raises the RuntimeError with its usage instructions, ensuring no leftover unused variables remain.

CodeRabbit fixes (PR coleam00#915): 1. Supabase AsyncClient Migration: - Add get_supabase_async_client() to utils.py - Update SupabaseSitePagesRepository to support both sync and async clients - Use conditional await for backwards compatibility - Add warning when sync client is used in async context - Update container.py get_repository_async() for Supabase 2. Filter Validation in search_similar(): - Add warning for unsupported filter keys in PostgreSQL repo - Document that only "source" filter is supported for vector search - Log warning instead of silently ignoring unsupported keys 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add threading.Lock for thread-safe singleton initialization (double-checked locking pattern for get_repository and get_embedding_service) - Add early validation in configure() with clear error messages - Add EMBEDDING_SERVICE_TYPE env var support (parallel to REPOSITORY_TYPE) - Add VALID_REPOSITORY_TYPES and VALID_EMBEDDING_TYPES constants Fixes CodeRabbit nitpicks coleam00#1, coleam00#2, coleam00#3 from PR coleam00#915 review. Note: Nitpick coleam00#4 (SearchResult similarity validation) was already implemented. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Introduces a clean architecture database layer supporting multiple backends (Supabase, PostgreSQL, InMemory) without modifying business logic. ## Changes ### Domain Layer - Add interfaces: ICrawledPagesRepository, ISourcesRepository, ICodeExamplesRepository - Add Pydantic models: CrawledPage, Source, CodeExample, SearchResult - Multi-dimension embeddings support (768, 1024, 1536, 3072) ### Infrastructure Layer - Supabase: Refactored existing implementation to use interfaces - PostgreSQL: New asyncpg + pgvector implementation with connection pooling - InMemory: Thread-safe implementation for fast unit tests ### Container & DI - Singleton Container with async lifecycle (initialize/shutdown) - Factory pattern for backend selection via REPOSITORY_TYPE env var - Integrated with FastAPI lifespan ### Tests - 102 unit tests (100% passing) - Contract tests verifying interface compliance - InMemory fixtures for isolated testing ### Documentation - docs/DATABASE_ABSTRACTION.md - Complete architecture guide - Migration guide from direct Supabase calls - Instructions for adding new backends ## Configuration ```bash # Supabase (default) export REPOSITORY_TYPE=supabase # PostgreSQL export REPOSITORY_TYPE=postgres # InMemory (tests) export REPOSITORY_TYPE=memory ``` Related: PR coleam00#915 (previous work on legacy architecture) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

jlacerte · 2026-01-02T00:07:14Z

Closing in favor of #918 which is based on the current v2 architecture.

This PR was developed against the legacy codebase structure that has since been completely refactored. PR #918 implements the same Repository Pattern concept but aligned with the new python/src/server/ architecture.

Thank you for the reviews and feedback on this draft - the learnings were incorporated into the new implementation.

* feat: natural-language approval routing for workflow gates Users can now just type their answer in the same conversation instead of using /workflow approve <id>. The message handler detects a paused workflow and auto-routes the response as the approval. Flow: message arrives → check for paused run → write approval events → transition to resumable → discover workflow → dispatch resume - Add getPausedWorkflowRun() DB query (dual conversation/parent check) - Add interception in handleMessage() before slash command routing - Remove resumeRun from CommandResult (natural language handles it) - Simplify /workflow approve to just record approval (no auto-dispatch) - Update docs with natural-language approval as recommended approach * fix: resolve WorkflowWithSource type mismatch in approval routing discoverWorkflowsWithConfig returns WorkflowWithSource[], but findWorkflow expects WorkflowDefinition[]. Map to .workflow before passing. * fix: address review findings for natural-language approval routing - Move DB writes (createWorkflowEvent, updateWorkflowRun) inside try block to prevent silent message loss on DB failures - Use discoverAllWorkflows() instead of discoverWorkflowsWithConfig() to include global and bundled workflows in approval resume path - Add runtime validation for ApprovalContext metadata instead of unsafe type cast; send explicit guidance when context is missing - Add _started/_completed log event pairing per logging conventions - Include conversationId in error log context for correlation - Elevate getPausedWorkflowRun DB error log from warn to error - Clarify /workflow approve message to guide users toward NL resume - Add "why status: failed?" comment matching command-handler pattern - Update docs/approval-nodes.md and docs/authoring-workflows.md to reflect that /workflow approve no longer auto-resumes - Add 7 behavioral tests: happy path, slash bypass, no paused run, missing approval context, workflow not found, no codebase, DB failure

) * feat: natural-language approval routing for workflow gates Users can now just type their answer in the same conversation instead of using /workflow approve <id>. The message handler detects a paused workflow and auto-routes the response as the approval. Flow: message arrives → check for paused run → write approval events → transition to resumable → discover workflow → dispatch resume - Add getPausedWorkflowRun() DB query (dual conversation/parent check) - Add interception in handleMessage() before slash command routing - Remove resumeRun from CommandResult (natural language handles it) - Simplify /workflow approve to just record approval (no auto-dispatch) - Update docs with natural-language approval as recommended approach * fix: resolve WorkflowWithSource type mismatch in approval routing discoverWorkflowsWithConfig returns WorkflowWithSource[], but findWorkflow expects WorkflowDefinition[]. Map to .workflow before passing. * fix: address review findings for natural-language approval routing - Move DB writes (createWorkflowEvent, updateWorkflowRun) inside try block to prevent silent message loss on DB failures - Use discoverAllWorkflows() instead of discoverWorkflowsWithConfig() to include global and bundled workflows in approval resume path - Add runtime validation for ApprovalContext metadata instead of unsafe type cast; send explicit guidance when context is missing - Add _started/_completed log event pairing per logging conventions - Include conversationId in error log context for correlation - Elevate getPausedWorkflowRun DB error log from warn to error - Clarify /workflow approve message to guide users toward NL resume - Add "why status: failed?" comment matching command-handler pattern - Update docs/approval-nodes.md and docs/authoring-workflows.md to reflect that /workflow approve no longer auto-resumes - Add 7 behavioral tests: happy path, slash bypass, no paused run, missing approval context, workflow not found, no codebase, DB failure

jlacerte and others added 16 commits November 29, 2025 13:51

docs(db-refactor): Update manifest with P3-07 to P3-12 completion

ce7dd28

docs(db-refactor): Add comprehensive P3 agents migration report

7e6902f

docs(db-refactor): Add Phase 3 completion report

a4e76a7

Comprehensive report documenting: - P3-13 Services Layer implementation - 100% Phase 3 completion (13/13 blocs) - 135/135 tests passing - Architecture improvements - Zero breaking changes Phase 3 is COMPLETE! 🎉

coderabbitai Bot reviewed Dec 30, 2025

View reviewed changes

Comment thread python/src/core/container.py Outdated

Comment thread python/src/core/domain/models/search_result.py Outdated

Comment thread python/src/core/infrastructure/postgres/__init__.py Outdated

Comment thread python/src/core/infrastructure/supabase/page_repository.py Outdated

coderabbitai Bot reviewed Dec 30, 2025

View reviewed changes

jlacerte marked this pull request as draft December 30, 2025 22:50

jlacerte changed the title ~~feat(core): Database Provider Abstraction - Repository Pattern Implementation~~ feat(core): Database Provider Abstraction - 100% Core Scope Complete Dec 31, 2025

jlacerte and others added 3 commits December 31, 2025 19:50

refactorisation part1 - testintg

172965d

jlacerte added 2 commits December 31, 2025 19:50

fix(security): Add .env.staging to gitignore

3bbfc86

jlacerte force-pushed the feature/database-abstraction branch from 6b37422 to 3bbfc86 Compare January 1, 2026 01:31

jlacerte changed the title ~~feat(core): Database Provider Abstraction - 100% Core Scope Complete~~ feat(db-refactor): Complete Database Layer Abstraction with Traefik Network Jan 1, 2026

jlacerte marked this pull request as ready for review January 1, 2026 01:53

coderabbitai Bot reviewed Jan 1, 2026

View reviewed changes

Comment thread archon/infrastructure/postgres/site_pages_repository.py

Comment thread archon/infrastructure/postgres/site_pages_repository.py

Comment thread archon/infrastructure/supabase/site_pages_repository.py

Comment thread archon/infrastructure/supabase/site_pages_repository.py Outdated

coderabbitai Bot reviewed Jan 1, 2026

View reviewed changes

jlacerte marked this pull request as draft January 1, 2026 22:12

jlacerte mentioned this pull request Jan 2, 2026

feat(db): Database Abstraction Layer with Repository Pattern #918

Closed

6 tasks

jlacerte closed this Jan 2, 2026

This was referenced Apr 2, 2026

feat: Local database mode with PostgreSQL + PostgREST (no Supabase cloud required) #947

Closed

feat: Local database mode with PostgreSQL + PostgREST (no Supabase cloud required) #948

Closed

Wirasm mentioned this pull request Apr 8, 2026

feat(web): rename Reject button to Cancel/Revise based on approval node config #969

Open

		@@ -0,0 +1,448 @@
		# Livrable: Backend PostgreSQL pour Archon

		Date de Livraison: 2025-11-30

Conversation

jlacerte commented Dec 30, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes Made

Repository Interfaces (11 total)

Services Migrated

Architecture

Benefits

Test Plan

Configuration

Changes Made

Repository Interfaces (11 total)

Services Migrated

Architecture

Benefits

Test Plan

Configuration

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

jlacerte commented Dec 30, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Dec 30, 2025 •

edited

Loading