feat(db-refactor): Complete Database Layer Abstraction with Traefik Network#915
feat(db-refactor): Complete Database Layer Abstraction with Traefik Network#915jlacerte wants to merge 24 commits intocoleam00:mainfrom
Conversation
Provides project overview, common commands, architecture documentation, and development notes for Claude Code to understand and work with the Archon codebase effectively. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Phase 1 - Domain Layer: - Add domain models: SitePage, SitePageMetadata, SearchResult - Add interfaces: ISitePagesRepository (8 methods), IEmbeddingService (2 methods) - Clean module exports via __init__.py Phase 2 - Infrastructure Layer: - Add SupabaseSitePagesRepository with mappers - Add InMemorySitePagesRepository for testing - Add OpenAIEmbeddingService wrapper Tests: - Unit tests for domain models and interfaces (37 tests) - Unit tests for mappers and in-memory repository (20 tests) - Integration test script for manual validation (10 scenarios) - All 57 unit tests passing Validation (Phase 2.5): - All imports working correctly - No circular dependencies - All interface implementations verified - Database schema coherence confirmed Part of database layer refactoring project. Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>
Add archon/container.py with dependency injection pattern: - Factory functions: get_repository(), get_embedding_service() - Configuration: configure(repository_type, embedding_type) - Singleton pattern with lazy initialization - Support for Supabase (production) and Memory (tests) - Support for OpenAI (production) and Mock (tests) - Override functions for testing: override_repository(), override_embedding_service() - Reset function for test isolation Add tests/test_container.py with comprehensive test coverage: - Configuration tests (4 tests) - Singleton behavior tests (3 tests) - Override functionality tests (3 tests) - Error handling tests (2 tests) - All 12 tests passing Update docs/MIGRATION_MANIFEST.md: - Mark P3-01 as VERIFIED - Update Phase 3 progress: 1/15 blocs complete - Update global progress: 49% (17/35 blocs) - Add entry in registre des modifications Part of Phase 3 migration - Database Layer Refactoring. Breaking change: None (new functionality). Generated with Claude Code (https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Migration Strategy: Dual-mode with backward compatibility Changes: - agent_tools.py: Add optional repository/embedding_service parameters to all functions * get_embedding(): Add embedding_service param with fallback to legacy client * retrieve_relevant_documentation_tool(): Add repository param with fallback * list_documentation_pages_tool(): Add repository param with fallback * get_page_content_tool(): Add repository param with fallback - All functions maintain backward compatibility with Supabase client - Prefer new repository/service when provided, fallback to legacy otherwise Testing: - Created tests/test_agent_tools_migration.py with 15 comprehensive tests - All tests pass (90 passed, 29 skipped) - Tests validate both new repository pattern and legacy Supabase paths Infrastructure Fix: - Fixed floating-point precision issue in InMemorySitePagesRepository - Added similarity clipping to [0.0, 1.0] range to avoid validation errors Documentation: - Updated MIGRATION_MANIFEST.md: P3-03 (a-h) marked as VERIFIED - Progress: 51% complete (18/35 blocks verified) Phase 3 Migration - Step 2/13 Complete 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
…ttern (P3-04) Migration completed with dual-mode strategy (backward compatible): Functions migrated: - get_embedding(): Add optional embedding_service parameter - insert_chunk(): Add optional repository parameter, use repository.insert() - clear_existing_records(): Add optional repository parameter, use repository.delete_by_source() - process_chunk(): Pass embedding_service to get_embedding() - process_and_store_document(): Pass repository and embedding_service - crawl_parallel_with_requests(): Accept and propagate DI parameters - main_with_requests(): Accept and propagate DI parameters - start_crawl_with_requests(): Accept and propagate DI parameters Key changes: - All async repository/service methods called with await (not run_in_executor) - Maintained backward compatibility with fallback to global clients - SitePage.id set to None (assigned by database) - clear_existing_records() now async to await repository.delete_by_source() Tests: - Created tests/test_crawl_migration.py with 6 validation tests - All tests pass: 6/6 ✓ - Tests verify: - Functions accept optional DI parameters - Functions work with injected dependencies - Backward compatibility maintained - Signature correctness Blocks verified: - P3-04a: ✓ get_clients() injection via optional parameters - P3-04b: ✓ supabase.table().insert() -> repository.insert() - P3-04c: ✓ supabase.table().delete() -> repository.delete_by_source() Dependencies installed: - html2text==2025.4.15 - crawl4ai==0.7.7 Progress: 54% complete (19/35 blocks verified) Part of Phase 3 migration (P3-04). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
…05, P3-06) Migration des pages Streamlit database.py et documentation.py vers le nouveau repository pattern avec stratégie dual-mode. **Changes P3-05 - database.py:** - Ajouter paramètre optionnel `repository: Optional[ISitePagesRepository]` - Migrer count() vers repository.count() avec asyncio.run() - Conserver Supabase pour opération admin "delete all" (non couverte par repository) - Mode dual: utiliser repository si fourni, sinon fallback Supabase **Changes P3-06 - documentation.py:** - Ajouter paramètre optionnel `repository: Optional[ISitePagesRepository]` - Migrer count(source) vers repository.count(source="pydantic_ai_docs") - Conserver Supabase pour affichage échantillon UI (non opération métier) - Mode dual: utiliser repository si fourni, sinon fallback Supabase **Tests:** - Ajouter tests/test_streamlit_migration.py (10 tests) - Valider signatures, imports, type hints - Tous les tests passent: 106 passed, 29 skipped **Documentation:** - Mettre à jour docs/MIGRATION_MANIFEST.md - P3-05: [v] VERIFIED - P3-06: [v] VERIFIED - Progression globale: 60% (21/35 blocs) Part of Phase 3 migration. Breaking change: None (backward compatible). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Detailed migration report covering: - Streamlit pages database.py and documentation.py migration - Strategy explanation (dual-mode with fallback) - Async/sync adaptation with asyncio.run() - UI-specific operations kept with Supabase - Test results (106 passed) - Problems encountered and solutions - Next steps 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
…epository pattern CONTEXT: This commit completes the Phase 3 migration by refactoring all Pydantic AI agents to use the repository pattern via dependency injection container. CHANGES: P3-07: archon/archon_graph.py (LangGraph orchestration) - Remove: 'from supabase import Client' - Add: 'from archon.container import get_repository, get_embedding_service' - Replace: 'embedding_client, supabase = get_clients()' with 'repository = get_repository()' and 'embedding_service = get_embedding_service()' - Update all PydanticAIDeps, ToolsRefinerDeps, AgentRefinerDeps instantiations to use repository/embedding_service instead of supabase/embedding_client - Update list_documentation_pages_tool call to pass repository parameter P3-08: archon/pydantic_ai_coder.py (Main coding agent) - Remove: 'from supabase import Client' - Add: 'from archon.domain import ISitePagesRepository, IEmbeddingService' - Update PydanticAIDeps dataclass: - 'repository: ISitePagesRepository' (was 'supabase: Client') - 'embedding_service: IEmbeddingService' (was 'embedding_client: AsyncOpenAI') - Update all 3 tools to use ctx.deps.repository and ctx.deps.embedding_service P3-09: archon/advisor_agent.py (Simple cleanup) - Remove unused import: 'from supabase import Client' P3-10: archon/refiner_agents/tools_refiner_agent.py - Remove: 'from supabase import Client' - Add: 'from archon.domain import ISitePagesRepository, IEmbeddingService' - Update ToolsRefinerDeps dataclass to use repository/embedding_service - Update all 3 tools to use new dependencies P3-11: archon/refiner_agents/agent_refiner_agent.py - Remove: 'from supabase import Client' - Add: 'from archon.domain import ISitePagesRepository, IEmbeddingService' - Update AgentRefinerDeps dataclass to use repository/embedding_service - Update all 3 tools to use new dependencies P3-12: archon/refiner_agents/prompt_refiner_agent.py (Simple cleanup) - Remove unused import: 'from supabase import Client' TESTING: - Added tests/test_agents_migration.py with 15 comprehensive tests - All tests validate: * Correct imports of domain interfaces * Removal of Supabase Client imports * Dataclass Deps use interfaces * Tools use new dependency names * Container integration in archon_graph.py TEST RESULTS: - 121 tests passed (was 106, added 15 new tests) - 29 skipped (integration tests requiring Supabase) - 0 failures - All existing tests still pass (backward compatibility maintained) ARCHITECTURE: The migration maintains backward compatibility because: 1. agent_tools.py already supports dual mode (legacy + new) 2. archon_graph.py is the single point of dependency injection 3. All agents receive their dependencies from the graph NOTES: - Pre-existing bug with OpenAIModel.__init__ signature not addressed (was already broken, unrelated to this migration) - Migration manifest P3-07 through P3-12 blocks are now complete Part of Phase 3 migration. Breaking change: None (fully backward compatible via dual mode in agent_tools.py) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Phase 4 validation results: - P4-01: Remove unused Supabase imports from agent_tools.py * Removed 'from supabase import Client' import * Changed Client type annotations to Any (legacy fallback) * Only utils.py retains Supabase import (justified - get_clients legacy) - P4-02: Full test suite verification * 121 tests passing * 29 tests skipped (integration tests requiring Supabase) * All migration tests passing - P4-03: Import validation * Domain imports: OK * Container imports: OK * Infrastructure imports: OK - P4-04: Update MIGRATION_MANIFEST.md * Phase 4 blocks marked VERIFIED * Progression: 89% core (31/35), 84% global (31/37) Files modified: - archon/agent_tools.py: Remove Supabase import, update type hints - docs/MIGRATION_MANIFEST.md: Mark P4-01 to P4-04 as VERIFIED Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>
Extract individual client factories from get_clients() to eliminate dependency circularity with container.py and improve modularity. Changes: - Extract get_openai_client() from get_clients() (lines 383-400) - Extract get_supabase_client() from get_clients() (lines 402-419) - Refactor get_clients() to use extracted factories (lines 421-433) - Mark get_clients() as DEPRECATED in docstring - Update container.py to use get_openai_client() and get_supabase_client() Benefits: - Container no longer unpacks unused tuple values - Individual factories can be used independently - Maintains 100% backward compatibility (get_clients() unchanged behavior) - Reduces coupling between utils.py and container.py Tests: pytest tests/ → 121/121 passed Part of Phase 3 migration (P3-02a-d). Breaking change: None. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
FINAL BLOC OF PHASE 3! 🎉
Created a business logic services layer between agents and repositories:
NEW FILES:
- archon/services/__init__.py - Services package initialization
- archon/services/documentation_service.py - DocumentationService class
- tests/test_services.py - 14 comprehensive tests
SERVICES LAYER:
DocumentationService encapsulates documentation operations:
- search_documentation(query, limit, source) - Semantic search
- get_page_content(url) - Concatenates all page chunks
- list_available_pages(source) - Lists available URLs
- get_page_metadata(url) - Retrieves page metadata
- count_pages(source) - Counts pages/chunks
CONTAINER INTEGRATION:
- Added get_documentation_service() factory to container.py
- Service auto-wires repository and embedding service dependencies
ARCHITECTURE:
Agents (pydantic_ai_coder, etc.)
↓
Services (DocumentationService)
↓
Repositories (ISitePagesRepository)
TESTING:
- 14/14 new service tests pass
- 135/135 total tests pass
- 29 skipped (Supabase integration - expected)
- Zero failures!
MIGRATION MANIFEST:
- P3-13 marked as VERIFIED
- Phase 3 COMPLETE: 13/13 blocs verified
- Core project: 100% (33/33 blocs)
- Global: 94% (33/35 blocs including optional)
Part of Phase 3 migration - Services Layer.
Breaking change: None (additive only).
🎉 Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
Comprehensive report documenting: - P3-13 Services Layer implementation - 100% Phase 3 completion (13/13 blocs) - 135/135 tests passing - Architecture improvements - Zero breaking changes Phase 3 is COMPLETE! 🎉
The pydantic-ai library changed its API between v0.0.22 and v1.0.x. The old pattern `OpenAIModel(model, base_url=..., api_key=...)` was replaced with `OpenAIModel(model, provider=OpenAIProvider(base_url=..., api_key=...))`. This fix updates all 6 agent files to use the new API pattern: - archon/pydantic_ai_coder.py - archon/advisor_agent.py - archon/archon_graph.py (2 model instances) - archon/refiner_agents/prompt_refiner_agent.py - archon/refiner_agents/tools_refiner_agent.py - archon/refiner_agents/agent_refiner_agent.py Includes bug report documentation in docs/BUG_REPORT_PYDANTIC_AI_API.md Fixes TypeError: OpenAIChatModel.__init__() got unexpected keyword argument 'base_url' 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Create specialized agent to implement alternative database backends: - PostgreSQL direct (asyncpg + pgvector) - SQLAlchemy (multi-DB portability) - SQLite (local development) Files added: - .claude/agents/db-backend-agent.md - Agent definition and templates - docs/CONTEXT_DB_BACKEND_AGENT.md - Full context for agent sessions This agent leverages the completed Repository Pattern refactoring to easily add new storage backends. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
📝 WalkthroughWalkthroughAdds a new Domain layer (models + async interfaces), a Services layer, multiple infrastructure backends (Postgres asyncpg+pgvector, Supabase, in-memory, OpenAI embedding), a DI container with async repository support and test overrides, migration/schema utilities, agent and tool changes to use repository+embedding abstractions, tests, staging config/Docker, and extensive documentation and migration manifests. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Agent as Agent / Tool
participant Container as DI Container
participant Repo as ISitePagesRepository
participant Embed as IEmbeddingService
participant DB as Backend (Postgres / Supabase)
Agent->>Container: get_repository_async()
activate Container
Container->>Repo: create/init backend instance (async)
note right of Container `#DDEBF7`: lazy async init / singleton
Container-->>Agent: repository instance
deactivate Container
Agent->>Container: get_embedding_service()
activate Container
Container->>Embed: create/init embedding service
Container-->>Agent: embedding_service
deactivate Container
Agent->>Embed: get_embedding(text)
activate Embed
Embed->>DB: request embedding (OpenAI API or mock)
DB-->>Embed: embedding vector
Embed-->>Agent: List[float]
deactivate Embed
Agent->>Repo: search_similar(embedding, limit)
activate Repo
Repo->>DB: vector similarity query (pgvector / RPC)
DB-->>Repo: rows + similarity
Repo-->>Agent: List[SearchResult]
deactivate Repo
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (7)
python/src/core/domain/interfaces/embedding_service.py (1)
1-77: Consider documenting the exception contract more explicitly.The interface docstrings reference
EmbeddingError(lines 35, 53), but this exception type is not defined or imported. Consider either:
- Defining a concrete
EmbeddingErrorexception class in the domain layer for implementers to use- Updating docstrings to specify a more general exception type (e.g., "Exception: If embedding generation fails")
This improves clarity for implementers about the expected exception contract.
python/tests/core/test_container.py (1)
158-171: Consider using the public property instead of private attribute.Line 171 accesses
service._dimension(a private attribute). Using the publicdimensionproperty would reduce coupling to implementation details:- assert service._dimension == 768 + assert service.dimension == 768This makes the test more resilient to internal refactoring.
python/src/core/domain/models/page_metadata.py (1)
109-121: Word count calculation uses naive split.
len(markdown.split())works for basic cases but may be inaccurate with multiple consecutive whitespace characters, tabs, or markdown syntax elements. This is acceptable for approximate statistics but worth noting.python/src/core/infrastructure/memory/mock_embedding_service.py (1)
69-92: Unused variablehash_byteson line 71.The initial
hash_bytescomputation is never used since the loop generates fresh hashes with iteration indices. This is harmless dead code.🔎 Remove unused variable
def _generate_embedding(self, text: str) -> list[float]: """ Generate a deterministic embedding from text. Uses SHA-256 hash extended to fill the dimension. Values are normalized to [-1, 1] range. """ # Create hash of text text_bytes = text.encode("utf-8") - hash_bytes = hashlib.sha256(text_bytes).digest() # Extend hash to fill dimension embedding = [] iteration = 0python/src/core/infrastructure/supabase/page_repository.py (3)
139-142: Consider chaining the original exception for better debugging.When raising
RuntimeError, the original exception context is lost. Chain it for better traceability.🔎 Proposed fix
if response.data and len(response.data) > 0: return response.data[0].get("id", "") - raise RuntimeError("Failed to insert chunk - no ID returned") + raise RuntimeError("Failed to insert chunk - no ID returned") from NoneOr if you want to preserve any underlying exception:
try: response = self._client.table(self.CHUNKS_TABLE).insert(data).execute() if response.data and len(response.data) > 0: return response.data[0].get("id", "") raise RuntimeError("Failed to insert chunk - no ID returned") except Exception as e: raise RuntimeError(f"Failed to insert chunk: {e}") from e
203-211: Optimizecount_chunksto avoid fetching all columns.Selecting
"*"retrieves all columns when only the count is needed. This is inefficient for tables with large content/embedding columns.🔎 Proposed fix
async def count_chunks(self, source_id: str | None = None) -> int: """Count chunks, optionally filtered by source.""" - query = self._client.table(self.CHUNKS_TABLE).select("*", count="exact") + query = self._client.table(self.CHUNKS_TABLE).select("id", count="exact", head=True) if source_id: query = query.eq("source_id", source_id) response = query.execute() return response.count or 0Using
head=Trueavoids returning row data entirely when only the count is needed.
100-105: Clarifysource_filterandfilterparameter logic.The current logic sets
filtertometadata_filter or {}regardless of whethersource_filteris provided. This appears correct but the conditional structure is slightly confusing—thefilterassignment is duplicated.🔎 Simplify the parameter building
# Build RPC parameters rpc_params: dict[str, Any] = { "query_embedding": embedding, "match_count": match_count, + "filter": metadata_filter or {}, } # Add source filter if source_filter: rpc_params["source_filter"] = source_filter - rpc_params["filter"] = metadata_filter or {} - else: - rpc_params["filter"] = metadata_filter or {}
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (24)
python/src/core/__init__.pypython/src/core/container.pypython/src/core/domain/__init__.pypython/src/core/domain/interfaces/__init__.pypython/src/core/domain/interfaces/embedding_service.pypython/src/core/domain/interfaces/page_repository.pypython/src/core/domain/models/__init__.pypython/src/core/domain/models/crawled_page.pypython/src/core/domain/models/page_metadata.pypython/src/core/domain/models/search_result.pypython/src/core/infrastructure/__init__.pypython/src/core/infrastructure/memory/__init__.pypython/src/core/infrastructure/memory/mock_embedding_service.pypython/src/core/infrastructure/memory/page_repository.pypython/src/core/infrastructure/openai/__init__.pypython/src/core/infrastructure/postgres/__init__.pypython/src/core/infrastructure/supabase/__init__.pypython/src/core/infrastructure/supabase/page_repository.pypython/src/server/services/search/rag_service.pypython/src/server/utils/__init__.pypython/tests/core/__init__.pypython/tests/core/test_container.pypython/tests/core/test_domain_models.pypython/tests/core/test_repositories.py
🧰 Additional context used
📓 Path-based instructions (5)
python/src/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/src/**/*.py: Complete batch processing, background tasks, and external API calls but log detailed errors for each failure instead of crashing
Include context about what was being attempted, preserve full stack traces with exc_info=True in Python logging, use specific exception types, include relevant IDs/URLs, never return None to indicate failure, and report both success count and detailed failure list for batch operations
Use Ruff for linting and MyPy for type checking with 120 character line length in Python backend
Use Python 3.12 with 120 character line length in backend code
Import GITHUB_REPO_OWNER and GITHUB_REPO_NAME constants from python/src/server/config/version.py as the single source of truth for repository configuration
python/src/**/*.py: Python backend code should use 120 character line length
Use Ruff for linting Python code to check for errors, warnings, and unused imports
Use Mypy for type checking in Python to ensure type safety
Files:
python/src/core/infrastructure/memory/mock_embedding_service.pypython/src/core/domain/models/search_result.pypython/src/core/domain/models/crawled_page.pypython/src/core/infrastructure/supabase/__init__.pypython/src/core/domain/interfaces/embedding_service.pypython/src/server/utils/__init__.pypython/src/core/domain/models/__init__.pypython/src/server/services/search/rag_service.pypython/src/core/infrastructure/memory/__init__.pypython/src/core/domain/__init__.pypython/src/core/infrastructure/openai/__init__.pypython/src/core/container.pypython/src/core/domain/interfaces/__init__.pypython/src/core/infrastructure/postgres/__init__.pypython/src/core/domain/interfaces/page_repository.pypython/src/core/infrastructure/memory/page_repository.pypython/src/core/__init__.pypython/src/core/infrastructure/__init__.pypython/src/core/domain/models/page_metadata.pypython/src/core/infrastructure/supabase/page_repository.py
python/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Error messages must include context about what was being attempted, preserve full stack traces, use specific exception types, include relevant IDs/URLs, and report both success count and detailed failure list for batch operations
Files:
python/src/core/infrastructure/memory/mock_embedding_service.pypython/src/core/domain/models/search_result.pypython/src/core/domain/models/crawled_page.pypython/tests/core/test_domain_models.pypython/src/core/infrastructure/supabase/__init__.pypython/src/core/domain/interfaces/embedding_service.pypython/src/server/utils/__init__.pypython/src/core/domain/models/__init__.pypython/src/server/services/search/rag_service.pypython/src/core/infrastructure/memory/__init__.pypython/src/core/domain/__init__.pypython/src/core/infrastructure/openai/__init__.pypython/tests/core/test_container.pypython/tests/core/test_repositories.pypython/tests/core/__init__.pypython/src/core/container.pypython/src/core/domain/interfaces/__init__.pypython/src/core/infrastructure/postgres/__init__.pypython/src/core/domain/interfaces/page_repository.pypython/src/core/infrastructure/memory/page_repository.pypython/src/core/__init__.pypython/src/core/infrastructure/__init__.pypython/src/core/domain/models/page_metadata.pypython/src/core/infrastructure/supabase/page_repository.py
python/tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use Pytest for testing with async support in Python backend
Use Pytest for testing Python code with async support
Files:
python/tests/core/test_domain_models.pypython/tests/core/test_container.pypython/tests/core/test_repositories.pypython/tests/core/__init__.py
python/src/server/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/src/server/**/*.py: Fail fast and loud for service startup failures, missing configuration, database connection failures, authentication/authorization failures, data corruption, critical dependency unavailability, and invalid data that would corrupt state
Never accept corrupted data - skip failed items entirely rather than storing corrupted or incomplete data
Files:
python/src/server/utils/__init__.pypython/src/server/services/search/rag_service.py
python/src/server/services/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
python/src/server/services/**/*.py: For batch processing and background tasks, complete what you can and report detailed failures for each item rather than crashing
Never accept corrupted data - skip failed items entirely rather than storing corrupted data in batch operations
Files:
python/src/server/services/search/rag_service.py
🧠 Learnings (2)
📓 Common learnings
Learnt from: zaksnet
Repo: coleam00/Archon PR: 473
File: archon-ui-main/src/components/agent-chat/ArchonChatPanel.tsx:26-31
Timestamp: 2025-09-10T11:35:20.655Z
Learning: When reviewing large refactoring PRs like "Clean multi provider feature", focus feedback on the core objectives (provider management, API keys, model configuration, service registry) rather than unrelated UI improvements in components that were only incidentally modified during the refactor.
📚 Learning: 2025-11-29T19:24:36.223Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-11-29T19:24:36.223Z
Learning: Applies to python/src/**/*.py : Import GITHUB_REPO_OWNER and GITHUB_REPO_NAME constants from python/src/server/config/version.py as the single source of truth for repository configuration
Applied to files:
python/src/server/utils/__init__.py
🧬 Code graph analysis (14)
python/src/core/infrastructure/memory/mock_embedding_service.py (1)
python/src/core/domain/interfaces/embedding_service.py (1)
IEmbeddingService(11-77)
python/src/core/domain/models/crawled_page.py (1)
python/src/core/domain/models/page_metadata.py (1)
to_insert_dict(62-83)
python/tests/core/test_domain_models.py (3)
python/src/core/domain/models/crawled_page.py (1)
CrawledPage(14-115)python/src/core/domain/models/page_metadata.py (1)
PageMetadata(14-121)python/src/core/domain/models/search_result.py (2)
SearchResult(13-110)from_rpc_result(52-75)
python/src/core/infrastructure/supabase/__init__.py (1)
python/src/core/infrastructure/supabase/page_repository.py (1)
SupabasePageRepository(38-393)
python/src/server/utils/__init__.py (1)
python/src/core/container.py (1)
get_page_repository(53-98)
python/src/core/domain/models/__init__.py (3)
python/src/core/domain/models/crawled_page.py (1)
CrawledPage(14-115)python/src/core/domain/models/page_metadata.py (1)
PageMetadata(14-121)python/src/core/domain/models/search_result.py (1)
SearchResult(13-110)
python/src/server/services/search/rag_service.py (1)
python/src/core/container.py (1)
get_page_repository(53-98)
python/tests/core/test_container.py (2)
python/src/core/container.py (7)
reset(150-162)get_repository_type(43-50)configure(165-182)get_page_repository(53-98)override_repository(128-136)get_embedding_service(101-125)override_embedding_service(139-147)python/src/core/infrastructure/memory/mock_embedding_service.py (1)
MockEmbeddingService(13-92)
python/tests/core/test_repositories.py (3)
python/src/core/domain/models/crawled_page.py (1)
CrawledPage(14-115)python/src/core/domain/models/page_metadata.py (1)
PageMetadata(14-121)python/src/core/domain/models/search_result.py (1)
SearchResult(13-110)
python/src/core/container.py (4)
python/src/core/domain/interfaces/embedding_service.py (1)
IEmbeddingService(11-77)python/src/core/domain/interfaces/page_repository.py (1)
IPageRepository(15-263)python/src/core/infrastructure/memory/page_repository.py (1)
InMemoryPageRepository(18-335)python/src/core/infrastructure/memory/mock_embedding_service.py (1)
MockEmbeddingService(13-92)
python/src/core/domain/interfaces/__init__.py (2)
python/src/core/domain/interfaces/embedding_service.py (1)
IEmbeddingService(11-77)python/src/core/domain/interfaces/page_repository.py (1)
IPageRepository(15-263)
python/src/core/domain/interfaces/page_repository.py (3)
python/src/core/domain/models/crawled_page.py (1)
CrawledPage(14-115)python/src/core/domain/models/page_metadata.py (1)
PageMetadata(14-121)python/src/core/domain/models/search_result.py (1)
SearchResult(13-110)
python/src/core/domain/models/page_metadata.py (1)
python/src/core/domain/models/crawled_page.py (1)
to_insert_dict(79-115)
python/src/core/infrastructure/supabase/page_repository.py (4)
python/src/core/domain/interfaces/page_repository.py (1)
IPageRepository(15-263)python/src/core/domain/models/crawled_page.py (1)
CrawledPage(14-115)python/src/core/domain/models/page_metadata.py (1)
PageMetadata(14-121)python/src/core/domain/models/search_result.py (2)
SearchResult(13-110)from_rpc_result(52-75)
🔇 Additional comments (31)
python/tests/core/__init__.py (1)
1-1: LGTM!Standard test package initializer with an appropriate docstring.
python/src/core/infrastructure/__init__.py (1)
1-13: LGTM!Clean module documentation with appropriate lazy-import strategy to avoid loading backend dependencies unnecessarily.
python/src/core/domain/models/search_result.py (3)
1-50: LGTM!Well-structured Pydantic model with clear field definitions, appropriate defaults, and good documentation. The
model_configsettings are correctly configured for Pydantic v2.
51-75: LGTM!The
from_rpc_resultfactory correctly handles the integer-to-string ID conversion and safely handlesNonevalues.
100-110: LGTM!Good use of
model_copyfor immutability, returning a new instance with the updated rerank score.python/src/server/utils/__init__.py (2)
23-26: LGTM!Clean integration of the repository pattern import from the core module. The relative import path correctly navigates to the core package.
91-92: LGTM!Properly exposed in
__all__for backward-compatible public API access.python/src/core/infrastructure/openai/__init__.py (1)
1-8: LGTM!Clean placeholder module for future OpenAI embedding service implementation, following the established lazy-import pattern.
python/src/core/infrastructure/memory/__init__.py (1)
1-14: LGTM!Well-structured module with clear documentation. The eager imports are appropriate for lightweight test implementations that have no external dependencies.
python/src/core/domain/interfaces/__init__.py (1)
1-15: LGTM!Clean interface aggregation module that properly re-exports domain contracts. The docstring accurately describes the purpose of enabling dependency inversion and testability.
python/src/core/infrastructure/supabase/__init__.py (1)
1-10: LGTM!Clean module initialization with clear documentation and proper public API exposure.
python/tests/core/test_domain_models.py (1)
1-309: LGTM! Excellent test coverage.The test suite comprehensively validates domain models including:
- Field initialization and defaults
- Multi-dimensional embedding support
- Dictionary conversion for database insertion
- RPC result mapping with type normalization (e.g., integer ID to string)
- Edge cases (empty content, boundary similarity scores)
The test structure is clear and follows pytest best practices.
python/src/core/domain/models/__init__.py (1)
1-16: LGTM!Clean public API surface for domain models with clear documentation emphasizing separation from infrastructure concerns.
python/tests/core/test_container.py (1)
1-206: LGTM! Comprehensive container tests.The test suite thoroughly validates:
- Environment-driven repository selection
- Singleton behavior and reset semantics
- Override mechanisms for testing
- Error handling for invalid configurations
- Supabase credential requirements
Test organization is clear and follows good pytest practices.
python/src/core/domain/__init__.py (1)
1-19: LGTM!Clear domain layer public API with well-documented separation of concerns. The explicit note about no infrastructure dependencies reinforces clean architecture principles.
python/tests/core/test_repositories.py (1)
1-586: LGTM! Excellent repository test coverage.This comprehensive test suite validates all repository operations:
- Chunk operations: Insert, batch insert, retrieve, delete (by URL/source), count
- Page operations: Upsert (insert/update), batch upsert, retrieve (by URL/ID), list with filters, delete
- Vector search: Exact match, ranking, source filtering, match limits, edge cases (no embeddings, empty repo)
- Utilities: Unique URLs listing, health check, clear/reset
The tests effectively exercise the InMemoryPageRepository implementation and demonstrate the expected behavior for the IPageRepository interface.
python/src/core/__init__.py (1)
1-51: LGTM! Excellent public API design.The core module provides a clean, unified interface with:
- Clear architectural documentation explaining the layers (domain, interfaces, infrastructure, container)
- Proper separation of concerns between models, interfaces, and DI functions
- Helpful usage examples in the docstring
This design enables easy switching between backends while maintaining a stable public API.
python/src/server/services/search/rag_service.py (2)
39-50: LGTM! Clean backward-compatible repository integration.The constructor properly initializes the repository with graceful fallback to direct Supabase queries. The try/except pattern ensures existing deployments without repository configuration continue to work.
220-275: Solid dual-path lookup with proper fallback logic.The repository-first approach with Supabase fallback maintains backward compatibility while enabling the new abstraction. Exception handling at line 230-232 appropriately logs and falls back rather than failing the entire operation.
One minor observation: the repository path (lines 264-275) and the Supabase path (lines 253-263) produce slightly different result structures—the repository path doesn't check for
Noneonpage_metadata.word_countlike the Supabase path does with.get("word_count", 0). SincePageMetadata.word_countdefaults to0in the model definition, this should be fine.python/src/core/domain/models/page_metadata.py (2)
14-61: Well-structured domain model following Pydantic v2 conventions.The model correctly uses
ConfigDictwithfrom_attributes=Truefor ORM compatibility andpopulate_by_name=Truefor flexible field mapping. Field definitions with proper defaults and descriptions are clean.
62-83: Consider includingidin insert dict for upsert scenarios.The
to_insert_dict()method excludesid, which is correct for inserts. However, if this is used with upsert operations where the record already exists, theidmight be needed. The Supabase implementation useson_conflict="url"which handles this, but verify this works for all use cases.python/src/core/infrastructure/memory/mock_embedding_service.py (1)
40-50: LGTM! Clean implementation of IEmbeddingService interface.The async methods correctly implement the interface contract with deterministic, reproducible embeddings suitable for unit testing.
python/src/core/domain/models/crawled_page.py (2)
62-77: LGTM! Multi-dimensional embedding selection logic is correct.The method properly checks both the dimension configuration AND the presence of the corresponding embedding field before returning it. Falls back to generic
embeddingfield when no dimension-specific match exists.
79-115: Well-implemented insertion helper with flexible embedding column support.The method correctly builds the insert payload, conditionally including optional fields and routing embeddings to the appropriate column based on the
embedding_columnparameter.python/src/core/infrastructure/memory/page_repository.py (3)
51-109: Solid vector search implementation with proper edge case handling.The cosine similarity computation correctly handles zero-norm vectors (lines 62-63, 86-87) and clips results to [0, 1] for consistency. The filtering and sorting logic is clean.
143-168: Safe deletion pattern avoids dict mutation during iteration.Building the
to_deletelist before iterating deletions is the correct approach to avoidRuntimeErrorfrom modifying a dict during iteration.
204-228: Correct upsert semantics with URL-based deduplication.The logic properly checks for existing pages by URL, preserves
created_aton updates, and maintains the_url_to_page_idindex. The timestamp handling is clean.python/src/core/container.py (2)
53-98: Clean lazy initialization with environment-based selection.The singleton pattern with test override precedence is well-structured. Lazy imports avoid circular dependency issues.
Minor consideration: In multi-threaded scenarios, concurrent calls to
get_page_repository()before singleton initialization could create multiple instances. For async Python servers, this is typically not an issue since initialization usually happens at startup. If thread safety becomes a concern, consider a lock.
101-125: MockEmbeddingService as default is intentional for safety.The docstring correctly documents that production systems should use
override_embedding_service()with a real provider. This prevents accidental API calls in tests/dev.python/src/core/domain/interfaces/page_repository.py (1)
15-263: Excellent interface design with clear contracts.The
IPageRepositoryinterface provides a comprehensive, well-documented contract for all page and chunk operations. The separation of chunk operations (lines 27-134) and page operations (lines 136-236) is logical and follows the domain model structure.All methods are properly async, return types are explicit, and docstrings clearly specify arguments, return values, and semantics (e.g., upsert behavior).
python/src/core/infrastructure/supabase/page_repository.py (1)
81-127: Overall search implementation is sound.The RPC-based vector search with similarity threshold filtering and proper error handling (returning empty results on failure) provides resilience. The
SearchResult.from_rpc_result()integration is clean.
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
python/src/core/infrastructure/supabase/page_repository.py (1)
122-125: Replaceprint()with proper logging.This is a duplicate of a previous review comment. Using
print()for error logging bypasses structured logging. Per coding guidelines, errors should be logged with context and stack traces usinglogger.error()withexc_info=True.
🧹 Nitpick comments (2)
python/src/core/infrastructure/memory/page_repository.py (1)
298-302: Consider explicit behavior for missing page_id.The method silently skips updates when
page_idis not found. While this may be intentional for a test implementation, consider whether callers might benefit from knowing when an update is skipped (e.g., logging a warning or raising a KeyError).python/src/core/infrastructure/supabase/page_repository.py (1)
265-267: Consider adding error handling and logging.This method silently fails if the update doesn't succeed. Per coding guidelines, errors should be logged with context even when not crashing. Consider logging if the update fails or returns no matching rows.
🔎 Proposed enhancement
+import logging + +logger = logging.getLogger(__name__) + async def update_chunk_count(self, page_id: str, chunk_count: int) -> None: """Update the chunk_count for a page.""" - self._client.table(self.PAGES_TABLE).update({"chunk_count": chunk_count}).eq("id", page_id).execute() + try: + response = ( + self._client.table(self.PAGES_TABLE) + .update({"chunk_count": chunk_count}) + .eq("id", page_id) + .execute() + ) + if not response.data: + logger.warning( + f"update_chunk_count: No matching page found. page_id: {page_id}" + ) + except Exception as e: + logger.error( + f"Failed to update chunk_count for page_id: {page_id}", + exc_info=True + )
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
python/src/core/container.pypython/src/core/infrastructure/memory/mock_embedding_service.pypython/src/core/infrastructure/memory/page_repository.pypython/src/core/infrastructure/supabase/page_repository.py
🚧 Files skipped from review as they are similar to previous changes (2)
- python/src/core/infrastructure/memory/mock_embedding_service.py
- python/src/core/container.py
🧰 Additional context used
📓 Path-based instructions (2)
python/src/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/src/**/*.py: Complete batch processing, background tasks, and external API calls but log detailed errors for each failure instead of crashing
Include context about what was being attempted, preserve full stack traces with exc_info=True in Python logging, use specific exception types, include relevant IDs/URLs, never return None to indicate failure, and report both success count and detailed failure list for batch operations
Use Ruff for linting and MyPy for type checking with 120 character line length in Python backend
Use Python 3.12 with 120 character line length in backend code
Import GITHUB_REPO_OWNER and GITHUB_REPO_NAME constants from python/src/server/config/version.py as the single source of truth for repository configuration
python/src/**/*.py: Python backend code should use 120 character line length
Use Ruff for linting Python code to check for errors, warnings, and unused imports
Use Mypy for type checking in Python to ensure type safety
Files:
python/src/core/infrastructure/memory/page_repository.pypython/src/core/infrastructure/supabase/page_repository.py
python/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Error messages must include context about what was being attempted, preserve full stack traces, use specific exception types, include relevant IDs/URLs, and report both success count and detailed failure list for batch operations
Files:
python/src/core/infrastructure/memory/page_repository.pypython/src/core/infrastructure/supabase/page_repository.py
🧠 Learnings (1)
📓 Common learnings
Learnt from: zaksnet
Repo: coleam00/Archon PR: 473
File: archon-ui-main/src/components/agent-chat/ArchonChatPanel.tsx:26-31
Timestamp: 2025-09-10T11:35:20.655Z
Learning: When reviewing large refactoring PRs like "Clean multi provider feature", focus feedback on the core objectives (provider management, API keys, model configuration, service registry) rather than unrelated UI improvements in components that were only incidentally modified during the refactor.
🧬 Code graph analysis (2)
python/src/core/infrastructure/memory/page_repository.py (4)
python/src/core/domain/interfaces/page_repository.py (1)
IPageRepository(15-263)python/src/core/domain/models/crawled_page.py (2)
CrawledPage(14-115)get_active_embedding(62-77)python/src/core/domain/models/page_metadata.py (1)
PageMetadata(14-121)python/src/core/domain/models/search_result.py (1)
SearchResult(13-110)
python/src/core/infrastructure/supabase/page_repository.py (4)
python/src/core/domain/interfaces/page_repository.py (1)
IPageRepository(15-263)python/src/core/domain/models/crawled_page.py (1)
CrawledPage(14-115)python/src/core/domain/models/page_metadata.py (1)
PageMetadata(14-121)python/src/core/domain/models/search_result.py (2)
SearchResult(13-110)from_rpc_result(52-75)
🔇 Additional comments (8)
python/src/core/infrastructure/memory/page_repository.py (8)
1-16: Clean module structure and imports.The imports are well-organized and the module docstring clearly explains the purpose of this in-memory implementation for testing.
18-45: LGTM - Clean initialization and test utilities.The three-dictionary structure (_chunks, _pages, _url_to_page_id) provides efficient lookups by different keys, and the clear() helper method is useful for test isolation.
51-109: Solid vector similarity implementation.The cosine similarity calculation correctly handles edge cases (zero-norm vectors) and the clipping to [0, 1] is appropriate for treating dissimilar vectors as having zero similarity in search contexts.
111-141: Correct chunk insertion implementation.The
embedding_columnparameter ininsert_chunks_batchis appropriately accepted for interface compliance even though the in-memory implementation doesn't need column-specific storage. The use ofget_active_embedding()ensures the correct embedding is stored.
143-188: LGTM - Chunk operations implemented correctly.The delete operations properly avoid modifying dictionaries during iteration, and
get_chunk_by_idcorrectly reconstructs theCrawledPagemodel with proper handling of optional fields.
194-226: Excellent upsert implementation.The upsert logic correctly handles both insert and update cases, properly preserving
created_attimestamps on updates while updatingupdated_at. The URL-to-ID mapping maintenance ensures efficient lookups.
228-283: LGTM - Page retrieval and listing logic is sound.The delegation pattern in
get_page_by_urlis clean, and all methods properly reconstructPageMetadatainstances with appropriate defaults for optional fields.
308-321: LGTM - Utility operations are appropriate.The
list_unique_urlsmethod correctly iterates chunks (as documented) to gather URLs, andhealth_checkappropriately returnsTruefor an in-memory implementation with no external dependencies.
| async def get_chunk_by_id(self, chunk_id: str) -> CrawledPage | None: | ||
| """Get a chunk by ID.""" | ||
| response = self._client.table(self.CHUNKS_TABLE).select("*").eq("id", chunk_id).single().execute() | ||
|
|
||
| if not response.data: | ||
| return None | ||
|
|
||
| return self._row_to_crawled_page(response.data) |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Supabase Python client single() method exception behavior
💡 Result:
Short answer: supabase-py .single() requires the query to return exactly one row. If the result is zero or >1 rows, PostgREST returns an error (PGRST116) and the client surfaces that error in the response (data will be None and error populated). Use .limit(1).single() when you expect exactly one row, or use .maybe_single() if you want "zero allowed" (it returns null data rather than the PGRST116 error). [1][2][3]
Sources:
- Supabase Python .single() docs. [1]
- Supabase Python .maybe_single() docs. [2]
- PostgREST / Supabase issue showing PGRST116 ("JSON object requested, multiple (or no) rows returned"). [3]
🏁 Script executed:
# First, check the file size and content around the specified lines
wc -l python/src/core/infrastructure/supabase/page_repository.pyRepository: coleam00/Archon
Length of output: 121
🏁 Script executed:
# Read the specific lines mentioned in the review
sed -n '165,180p' python/src/core/infrastructure/supabase/page_repository.pyRepository: coleam00/Archon
Length of output: 657
🏁 Script executed:
# Check lines 223 and 232 as well
sed -n '220,235p' python/src/core/infrastructure/supabase/page_repository.pyRepository: coleam00/Archon
Length of output: 643
🏁 Script executed:
# Search for all uses of .single() in the file
rg '\.single\(\)' python/src/core/infrastructure/supabase/page_repository.py -n -A 3Repository: coleam00/Archon
Length of output: 591
Check response.error and log failures instead of silently returning None.
The Supabase .single() method does not raise exceptions. Instead, when zero or multiple records are returned, PostgREST returns an error (PGRST116) in the response object with data=None and error populated. The current code at lines 170, 223, and 232 only checks if not response.data: return None, which silently treats all error cases identically without logging.
Log the error with context (attempted operation, ID being queried, error details) before returning None, following the guideline to log detailed errors for each failure instead of silently crashing or ignoring errors.
🤖 Prompt for AI Agents
In python/src/core/infrastructure/supabase/page_repository.py around lines 168
to 175, the method get_chunk_by_id checks only response.data and returns None
silently on failure; update it to check response.error and, when present, log a
clear message containing the attempted operation ("get_chunk_by_id"), the
chunk_id being queried, and the response.error details before returning None;
apply the same pattern for the other similar checks noted (around lines 223 and
232) so failures are recorded with context instead of being silently ignored.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (4)
python/src/core/container.py (3)
34-95: Consider adding thread synchronization for singleton initialization.The singleton pattern using global variables without locking can lead to race conditions in multi-threaded environments (e.g., FastAPI/Flask with multiple workers). If two threads call
get_page_repository()simultaneously before_repository_instanceis set, both might proceed past theif _repository_instance is not Nonecheck (line 74) and create multiple instances.🔎 Proposed fix using threading.Lock
import os +import threading from typing import TYPE_CHECKING if TYPE_CHECKING: from .domain.interfaces import IEmbeddingService, IPageRepository # Singleton instances _repository_instance: "IPageRepository | None" = None _embedding_service_instance: "IEmbeddingService | None" = None # Override instances for testing _repository_override: "IPageRepository | None" = None _embedding_service_override: "IEmbeddingService | None" = None +# Lock for thread-safe singleton initialization +_lock = threading.Lock() def get_repository_type() -> str: ... def get_page_repository() -> "IPageRepository": """ Get the page repository instance. ... """ global _repository_instance # Check for test override first if _repository_override is not None: return _repository_override # Return cached instance if available if _repository_instance is not None: return _repository_instance - # Create new instance based on type - repo_type = get_repository_type() - - if repo_type == "supabase": - from .infrastructure.supabase.page_repository import SupabasePageRepository - - _repository_instance = SupabasePageRepository() - elif repo_type == "postgres": - from .infrastructure.postgres.page_repository import PostgresPageRepository - - _repository_instance = PostgresPageRepository() - elif repo_type == "memory": - from .infrastructure.memory.page_repository import InMemoryPageRepository - - _repository_instance = InMemoryPageRepository() - else: - raise ValueError(f"Unknown repository type: {repo_type}. Valid options: 'supabase', 'postgres', 'memory'") - - return _repository_instance + # Double-checked locking pattern + with _lock: + # Check again after acquiring lock + if _repository_instance is not None: + return _repository_instance + + # Create new instance based on type + repo_type = get_repository_type() + + if repo_type == "supabase": + from .infrastructure.supabase.page_repository import SupabasePageRepository + + _repository_instance = SupabasePageRepository() + elif repo_type == "postgres": + from .infrastructure.postgres.page_repository import PostgresPageRepository + + _repository_instance = PostgresPageRepository() + elif repo_type == "memory": + from .infrastructure.memory.page_repository import InMemoryPageRepository + + _repository_instance = InMemoryPageRepository() + else: + raise ValueError(f"Unknown repository type: {repo_type}. Valid options: 'supabase', 'postgres', 'memory'") + + return _repository_instanceApply similar locking to
get_embedding_service()(lines 98-122).
166-183: Consider validatingrepository_typeinconfigure()for earlier error detection.Currently,
configure()accepts any string forrepository_typewithout validation (line 178). Invalid values won't be caught untilget_page_repository()is called later, potentially making debugging harder. Consider adding validation to fail fast with a clear error message at configuration time.🔎 Proposed validation
+_VALID_REPOSITORY_TYPES = {"supabase", "postgres", "memory"} + def configure( repository_type: str | None = None, force_reinit: bool = False, ) -> None: """ Configure the container. Args: repository_type: Override repository type (sets env var) force_reinit: Force reinitialization of instances + + Raises: + ValueError: If repository_type is invalid """ if repository_type: + if repository_type.lower() not in _VALID_REPOSITORY_TYPES: + raise ValueError( + f"Invalid repository type: {repository_type}. " + f"Valid options: {', '.join(sorted(_VALID_REPOSITORY_TYPES))}" + ) os.environ["REPOSITORY_TYPE"] = repository_type if force_reinit: global _repository_instance, _embedding_service_instance _repository_instance = None _embedding_service_instance = None
98-122: Consider environment-based configuration for embedding service (consistency with repository pattern).The embedding service is hardcoded to
MockEmbeddingService(line 121) and requires manualoverride_embedding_service()calls for production use, while the repository supports environment-based configuration viaREPOSITORY_TYPE. This creates an inconsistency in the DI container's API design.For consistency, consider adding an
EMBEDDING_SERVICE_TYPEenvironment variable (e.g., "mock", "openai", "local") similar to the repository pattern, allowing declarative configuration rather than requiring programmatic overrides in production code.python/src/core/domain/models/search_result.py (1)
37-49: Consider adding validation for similarity range.The
similarityfield documentation indicates it should be in the range 0-1, but there's no Field validator enforcing this constraint. While the factory methods may always provide valid values, adding a validator would make the model more robust.🔎 Optional: Add Field validator for similarity range
- similarity: float = Field(default=0.0, description="Vector similarity score (0-1)") + similarity: float = Field(default=0.0, ge=0.0, le=1.0, description="Vector similarity score (0-1)")
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
python/src/core/container.pypython/src/core/domain/models/search_result.pypython/tests/core/test_domain_models.py
🚧 Files skipped from review as they are similar to previous changes (1)
- python/tests/core/test_domain_models.py
🧰 Additional context used
📓 Path-based instructions (2)
python/src/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/src/**/*.py: Complete batch processing, background tasks, and external API calls but log detailed errors for each failure instead of crashing
Include context about what was being attempted, preserve full stack traces with exc_info=True in Python logging, use specific exception types, include relevant IDs/URLs, never return None to indicate failure, and report both success count and detailed failure list for batch operations
Use Ruff for linting and MyPy for type checking with 120 character line length in Python backend
Use Python 3.12 with 120 character line length in backend code
Import GITHUB_REPO_OWNER and GITHUB_REPO_NAME constants from python/src/server/config/version.py as the single source of truth for repository configuration
python/src/**/*.py: Python backend code should use 120 character line length
Use Ruff for linting Python code to check for errors, warnings, and unused imports
Use Mypy for type checking in Python to ensure type safety
Files:
python/src/core/domain/models/search_result.pypython/src/core/container.py
python/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Error messages must include context about what was being attempted, preserve full stack traces, use specific exception types, include relevant IDs/URLs, and report both success count and detailed failure list for batch operations
Files:
python/src/core/domain/models/search_result.pypython/src/core/container.py
🔇 Additional comments (5)
python/src/core/container.py (1)
147-163: Good fix:reset()now clears environment variable.The addition of
os.environ.pop("REPOSITORY_TYPE", None)on line 163 properly addresses the test pollution issue raised in the previous review. This ensures that tests callingconfigure()won't leave environment state that affects subsequent tests.python/src/core/domain/models/search_result.py (4)
1-10: LGTM! Clean imports and clear documentation.The module docstring clearly explains the purpose, and imports are minimal and appropriate for a Pydantic model.
51-75: LGTM! Consistent None handling and proper type conversions.The method correctly handles None values for the
idfield and provides sensible defaults for missing fields. The conversion from numeric id to string is well-documented.
77-102: LGTM! The None handling issue has been resolved.The previously flagged inconsistency in None handling for the
idfield has been fixed. The method now correctly preservesNonefor missing IDs (lines 90-91), consistent withfrom_rpc_result. The implementation correctly mirrors the RPC factory method while accepting similarity as an explicit parameter.
104-114: LGTM! Proper use of Pydantic's immutable update pattern.The method correctly uses
model_copywith theupdateparameter (Pydantic v2 API) to return a new instance with the updatedrerank_score, following immutable design principles.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (5)
python/tests/core/test_domain_models.py (3)
12-110: Consider adding test coverage forget_active_embedding()method.The
CrawledPagemodel has aget_active_embedding()method (per the relevant snippet) that returns the appropriate embedding based onembedding_dimension. This behavior isn't directly tested, though it's implicitly used byto_insert_dict.🔎 Example test for get_active_embedding
def test_get_active_embedding_returns_correct_dimension(self): """Test get_active_embedding returns embedding matching dimension.""" chunk = CrawledPage( source_id="src-123", url="https://example.com/page", chunk_number=0, content="Content", metadata={}, embedding_768=[0.1] * 768, embedding_1536=[0.3] * 1536, embedding_dimension=768, ) active = chunk.get_active_embedding() assert active is not None assert len(active) == 768 assert active == chunk.embedding_768
112-170: Consider adding test forfrom_crawl_result()factory method.The
PageMetadatamodel has afrom_crawl_result()classmethod that constructs aPageMetadatafrom crawl results, computingword_count,char_count, and building themetadatadict. This factory method isn't tested.🔎 Example test for from_crawl_result
def test_from_crawl_result(self): """Test creating PageMetadata from crawl result.""" page = PageMetadata.from_crawl_result( url="https://example.com/page", markdown="This is some markdown content.", source_id="src-123", knowledge_type="documentation", tags=["python", "api"], crawl_type="url", ) assert page.url == "https://example.com/page" assert page.source_id == "src-123" assert page.full_content == "This is some markdown content." assert page.word_count > 0 assert page.char_count > 0 assert page.metadata["knowledge_type"] == "documentation" assert page.metadata["crawl_type"] == "url" assert page.metadata["tags"] == ["python", "api"]
172-293: Consider adding test forwith_rerank_score()method.The
SearchResultmodel has awith_rerank_score()method that returns a copy with the rerank score set. This method isn't tested.🔎 Example test for with_rerank_score
def test_with_rerank_score(self): """Test creating a copy with rerank score.""" original = SearchResult( id="result-123", content="Content", url="https://example.com", similarity=0.9, ) updated = original.with_rerank_score(0.95) # Original should be unchanged assert original.rerank_score is None # New instance should have the score assert updated.rerank_score == 0.95 # Other fields should be preserved assert updated.id == "result-123" assert updated.similarity == 0.9python/tests/core/test_container.py (2)
195-201: Simplify environment variable clearing.The pattern of setting environment variables to empty strings in
patch.dictand then immediately popping them is unnecessarily convoluted. You can simplify this by removing the variables directly within thepatch.dictcontext.🔎 Proposed simplification
# Clear Supabase env vars - with patch.dict( - os.environ, - {"SUPABASE_URL": "", "SUPABASE_SERVICE_KEY": ""}, - clear=False, - ): + with patch.dict(os.environ, clear=False): os.environ.pop("SUPABASE_URL", None) os.environ.pop("SUPABASE_SERVICE_KEY", None) with pytest.raises(ValueError) as exc_info:
187-206: Consider adding positive test case for Supabase configuration.The current test verifies that missing Supabase credentials raise an error. Consider adding a companion test that verifies successful Supabase repository instantiation when valid credentials are provided. This would provide more complete coverage of the Supabase configuration path.
Example test structure:
def test_supabase_with_credentials_succeeds(self): """Test that Supabase with credentials succeeds.""" container.configure(repository_type="supabase") with patch.dict( os.environ, { "SUPABASE_URL": "https://example.supabase.co", "SUPABASE_SERVICE_KEY": "test-key-123" } ): repo = container.get_page_repository() from src.core.infrastructure.supabase.page_repository import ( SupabasePageRepository, ) assert isinstance(repo, SupabasePageRepository)
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
python/src/core/domain/interfaces/__init__.pypython/src/core/domain/interfaces/embedding_service.pypython/tests/core/test_container.pypython/tests/core/test_domain_models.pypython/tests/core/test_repositories.py
✅ Files skipped from review due to trivial changes (1)
- python/tests/core/test_repositories.py
🚧 Files skipped from review as they are similar to previous changes (1)
- python/src/core/domain/interfaces/embedding_service.py
🧰 Additional context used
📓 Path-based instructions (3)
python/src/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/src/**/*.py: Complete batch processing, background tasks, and external API calls but log detailed errors for each failure instead of crashing
Include context about what was being attempted, preserve full stack traces with exc_info=True in Python logging, use specific exception types, include relevant IDs/URLs, never return None to indicate failure, and report both success count and detailed failure list for batch operations
Use Ruff for linting and MyPy for type checking with 120 character line length in Python backend
Use Python 3.12 with 120 character line length in backend code
Import GITHUB_REPO_OWNER and GITHUB_REPO_NAME constants from python/src/server/config/version.py as the single source of truth for repository configuration
python/src/**/*.py: Python backend code should use 120 character line length
Use Ruff for linting Python code to check for errors, warnings, and unused imports
Use Mypy for type checking in Python to ensure type safety
Files:
python/src/core/domain/interfaces/__init__.py
python/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Error messages must include context about what was being attempted, preserve full stack traces, use specific exception types, include relevant IDs/URLs, and report both success count and detailed failure list for batch operations
Files:
python/src/core/domain/interfaces/__init__.pypython/tests/core/test_domain_models.pypython/tests/core/test_container.py
python/tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use Pytest for testing with async support in Python backend
Use Pytest for testing Python code with async support
Files:
python/tests/core/test_domain_models.pypython/tests/core/test_container.py
🧬 Code graph analysis (3)
python/src/core/domain/interfaces/__init__.py (2)
python/src/core/domain/interfaces/embedding_service.py (2)
EmbeddingError(11-14)IEmbeddingService(17-83)python/src/core/domain/interfaces/page_repository.py (1)
IPageRepository(15-263)
python/tests/core/test_domain_models.py (3)
python/src/core/domain/models/crawled_page.py (1)
CrawledPage(14-115)python/src/core/domain/models/page_metadata.py (1)
PageMetadata(14-121)python/src/core/domain/models/search_result.py (3)
SearchResult(13-114)from_rpc_result(52-75)from_postgres_row(78-102)
python/tests/core/test_container.py (3)
python/src/core/infrastructure/memory/page_repository.py (2)
clear(41-45)InMemoryPageRepository(18-321)python/src/core/infrastructure/memory/mock_embedding_service.py (1)
MockEmbeddingService(13-91)python/src/core/domain/interfaces/embedding_service.py (1)
dimension(65-72)
🔇 Additional comments (7)
python/tests/core/test_domain_models.py (2)
1-9: LGTM!Imports and module docstring are correct. The models are properly imported from the domain models package.
295-334: LGTM!Edge case tests cover important boundary conditions: empty content and similarity bounds (0.0 and 1.0). These are valuable for ensuring model robustness.
python/src/core/domain/interfaces/__init__.py (1)
1-16: LGTM!Clean package initialization that properly re-exports the domain interfaces and exceptions. The
__all__list correctly matches the imports, and the docstring clearly explains the architectural purpose of these abstractions for dependency inversion and testability.python/tests/core/test_container.py (4)
68-102: LGTM!The repository instantiation tests properly verify memory repository creation, singleton behavior, and reset functionality. The assertions correctly validate instance types and object identity.
104-136: LGTM!The repository override tests correctly verify that custom repositories can be injected and that the reset mechanism properly clears overrides. Identity assertions appropriately use the
isoperator.
138-172: LGTM!The embedding service tests comprehensively cover instantiation, singleton behavior, and override functionality. The custom dimension verification in the override test (line 171) adds good specificity.
174-185: LGTM!The invalid configuration test properly verifies that an unknown repository type raises a
ValueErrorwith an appropriate error message. The use ofpytest.raisesis correct.
| with patch.dict(os.environ, {}, clear=True): | ||
| # Remove REPOSITORY_TYPE if present | ||
| os.environ.pop("REPOSITORY_TYPE", None) |
There was a problem hiding this comment.
Remove redundant environment variable pop.
The patch.dict(os.environ, {}, clear=True) on line 28 already replaces the environment with an empty dictionary, so the pop on line 30 will never find REPOSITORY_TYPE to remove. This line is redundant and can be deleted.
🔎 Proposed fix
def test_default_repository_type(self):
"""Test default repository type is supabase."""
with patch.dict(os.environ, {}, clear=True):
- # Remove REPOSITORY_TYPE if present
- os.environ.pop("REPOSITORY_TYPE", None)
-
repo_type = container.get_repository_type()
assert repo_type == "supabase"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| with patch.dict(os.environ, {}, clear=True): | |
| # Remove REPOSITORY_TYPE if present | |
| os.environ.pop("REPOSITORY_TYPE", None) | |
| def test_default_repository_type(self): | |
| """Test default repository type is supabase.""" | |
| with patch.dict(os.environ, {}, clear=True): | |
| repo_type = container.get_repository_type() | |
| assert repo_type == "supabase" |
🤖 Prompt for AI Agents
In python/tests/core/test_container.py around lines 28 to 30, the call to
os.environ.pop("REPOSITORY_TYPE", None) is redundant because
patch.dict(os.environ, {}, clear=True) already clears the environment; remove
the pop line to avoid dead code and keep the test setup concise.
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
python/tests/core/test_container.py (1)
28-30: Remove redundant environment variable pop.The
patch.dict(os.environ, {}, clear=True)already clears the environment, making the subsequentpopoperation redundant.🔎 Proposed fix
def test_default_repository_type(self): """Test default repository type is supabase.""" with patch.dict(os.environ, {}, clear=True): - # Remove REPOSITORY_TYPE if present - os.environ.pop("REPOSITORY_TYPE", None) - repo_type = container.get_repository_type() assert repo_type == "supabase"
🧹 Nitpick comments (1)
python/tests/core/test_container.py (1)
199-215: Simplify the credential clearing pattern.Setting environment variables to empty strings in
patch.dictand then immediately popping them is redundant. You can simplify this by either excluding the keys from the patch dict entirely or relying solely on thepopoperations.🔎 Proposed fix (clearer approach using only pop)
def test_supabase_without_credentials_raises(self): """Test that Supabase without credentials raises error.""" container.configure(repository_type="supabase") - # Clear Supabase env vars - with patch.dict( - os.environ, - {"SUPABASE_URL": "", "SUPABASE_SERVICE_KEY": ""}, - clear=False, - ): + # Clear Supabase env vars while preserving other environment variables + with patch.dict(os.environ, clear=False): os.environ.pop("SUPABASE_URL", None) os.environ.pop("SUPABASE_SERVICE_KEY", None) with pytest.raises(ValueError) as exc_info: container.get_page_repository() assert "SUPABASE_URL" in str(exc_info.value)
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
python/src/core/container.pypython/tests/core/test_container.py
🚧 Files skipped from review as they are similar to previous changes (1)
- python/src/core/container.py
🧰 Additional context used
📓 Path-based instructions (2)
python/tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use Pytest for testing with async support in Python backend
Use Pytest for testing Python code with async support
Files:
python/tests/core/test_container.py
python/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Error messages must include context about what was being attempted, preserve full stack traces, use specific exception types, include relevant IDs/URLs, and report both success count and detailed failure list for batch operations
Files:
python/tests/core/test_container.py
🧬 Code graph analysis (1)
python/tests/core/test_container.py (3)
python/src/core/container.py (7)
reset(150-166)get_repository_type(43-50)configure(169-186)get_page_repository(53-98)override_repository(128-136)get_embedding_service(101-125)override_embedding_service(139-147)python/src/core/infrastructure/memory/page_repository.py (2)
clear(41-45)InMemoryPageRepository(18-321)python/src/core/infrastructure/memory/mock_embedding_service.py (1)
MockEmbeddingService(13-91)
🔇 Additional comments (6)
python/tests/core/test_container.py (6)
15-20: LGTM! Excellent test isolation pattern.The autouse fixture ensures clean container state before and after each test, preventing test pollution.
36-65: LGTM! Configuration tests cover all scenarios.The tests properly validate environment variable handling, function-based configuration, and override behavior.
71-101: LGTM! Repository instantiation tests are thorough.The tests properly validate singleton behavior, instance types, and reset semantics.
107-135: LGTM! Override tests validate test-time behavior.The tests properly verify that custom repository instances can be injected for testing and that reset clears the override.
141-171: LGTM! Embedding service tests mirror repository patterns.The tests validate singleton behavior, override mechanism, and custom configuration (dimension parameter).
177-193: LGTM! Error handling tests are well-structured.The tests properly validate error conditions using pytest.raises and verify error message content for both invalid types and unimplemented backends.
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
python/tests/core/test_container.py (1)
28-30: Remove redundant environment variable pop.The
patch.dict(os.environ, {}, clear=True)on line 28 already replaces the environment with an empty dictionary, so thepopon line 30 will never findREPOSITORY_TYPEto remove. This line is redundant and can be deleted.🔎 Proposed fix
def test_default_repository_type(self): """Test default repository type is supabase.""" with patch.dict(os.environ, {}, clear=True): - # Remove REPOSITORY_TYPE if present - os.environ.pop("REPOSITORY_TYPE", None) - repo_type = container.get_repository_type() assert repo_type == "supabase"
🧹 Nitpick comments (5)
python/src/core/container.py (2)
98-122: Consider documenting production embedding service setup more prominently.The current implementation always defaults to
MockEmbeddingService, with a comment indicating production systems should useoverride_embedding_service(). This is fine for the abstraction layer PR, but consider adding a TODO or follow-up to implement production embedding service selection (e.g., via environment variable similar toREPOSITORY_TYPE).
166-183: Clarifyforce_reinitbehavior with overrides.When
force_reinit=True, the function clears singleton instances but not overrides. If an override is set,get_page_repository()will still return the override (line 70-71) rather than creating a new instance based on the updatedrepository_type. This may be intentional for testing, but consider documenting this behavior in the docstring to avoid confusion.🔎 Suggested docstring update
def configure( repository_type: str | None = None, force_reinit: bool = False, ) -> None: """ Configure the container. Args: repository_type: Override repository type (sets env var) - force_reinit: Force reinitialization of instances + force_reinit: Force reinitialization of singleton instances. + Note: Does not clear test overrides; use reset() for full cleanup. """python/src/core/infrastructure/postgres/page_repository.py (1)
212-231: Consider batch insertion for better performance.The current implementation inserts chunks one-by-one, acquiring a connection for each. For large batches, this could be optimized using
executemanyorCOPY. However, the current approach provides better error isolation (one failed insert doesn't block others) and aligns with the resilient error-handling pattern.python/tests/core/infrastructure/postgres/test_page_repository.py (2)
120-139: Consider breaking long lines for readability.Lines 126-127 exceed 120 characters. Per coding guidelines, Python backend code should use 120 character line length.
🔎 Proposed fix
@pytest.mark.asyncio async def test_search_similar_filters_below_threshold(self): """Test search_similar filters results below similarity threshold.""" mock_conn = AsyncMock() mock_conn.fetch = AsyncMock( return_value=[ - {"id": 1, "url": "https://example.com", "content": "test", "chunk_number": 0, "metadata": {}, "source_id": "src-1", "similarity": 0.03}, # Below threshold - {"id": 2, "url": "https://example.com/2", "content": "test2", "chunk_number": 0, "metadata": {}, "source_id": "src-1", "similarity": 0.10}, # Above threshold + # Below threshold + { + "id": 1, "url": "https://example.com", "content": "test", + "chunk_number": 0, "metadata": {}, "source_id": "src-1", "similarity": 0.03, + }, + # Above threshold + { + "id": 2, "url": "https://example.com/2", "content": "test2", + "chunk_number": 0, "metadata": {}, "source_id": "src-1", "similarity": 0.10, + }, ] )
461-479: Consider breaking long lines for readability.Lines 467-468 exceed 120 characters. Per coding guidelines, Python backend code should use 120 character line length.
🔎 Proposed fix
@pytest.mark.asyncio async def test_list_pages_by_source(self): """Test list_pages_by_source returns list of pages.""" mock_conn = AsyncMock() mock_conn.fetch = AsyncMock( return_value=[ - {"id": "page-1", "source_id": "src-1", "url": "https://example.com/1", "full_content": "Content 1", "section_title": None, "section_order": 0, "word_count": 2, "char_count": 9, "chunk_count": 1, "metadata": {}}, - {"id": "page-2", "source_id": "src-1", "url": "https://example.com/2", "full_content": "Content 2", "section_title": None, "section_order": 1, "word_count": 2, "char_count": 9, "chunk_count": 1, "metadata": {}}, + { + "id": "page-1", "source_id": "src-1", "url": "https://example.com/1", + "full_content": "Content 1", "section_title": None, "section_order": 0, + "word_count": 2, "char_count": 9, "chunk_count": 1, "metadata": {}, + }, + { + "id": "page-2", "source_id": "src-1", "url": "https://example.com/2", + "full_content": "Content 2", "section_title": None, "section_order": 1, + "word_count": 2, "char_count": 9, "chunk_count": 1, "metadata": {}, + }, ] )
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
.env.examplepython/src/core/container.pypython/src/core/infrastructure/postgres/__init__.pypython/src/core/infrastructure/postgres/page_repository.pypython/tests/core/infrastructure/postgres/__init__.pypython/tests/core/infrastructure/postgres/test_page_repository.pypython/tests/core/test_container.py
✅ Files skipped from review due to trivial changes (1)
- python/tests/core/infrastructure/postgres/init.py
🧰 Additional context used
📓 Path-based instructions (3)
python/src/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/src/**/*.py: Complete batch processing, background tasks, and external API calls but log detailed errors for each failure instead of crashing
Include context about what was being attempted, preserve full stack traces with exc_info=True in Python logging, use specific exception types, include relevant IDs/URLs, never return None to indicate failure, and report both success count and detailed failure list for batch operations
Use Ruff for linting and MyPy for type checking with 120 character line length in Python backend
Use Python 3.12 with 120 character line length in backend code
Import GITHUB_REPO_OWNER and GITHUB_REPO_NAME constants from python/src/server/config/version.py as the single source of truth for repository configuration
python/src/**/*.py: Python backend code should use 120 character line length
Use Ruff for linting Python code to check for errors, warnings, and unused imports
Use Mypy for type checking in Python to ensure type safety
Files:
python/src/core/container.pypython/src/core/infrastructure/postgres/__init__.pypython/src/core/infrastructure/postgres/page_repository.py
python/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Error messages must include context about what was being attempted, preserve full stack traces, use specific exception types, include relevant IDs/URLs, and report both success count and detailed failure list for batch operations
Files:
python/src/core/container.pypython/src/core/infrastructure/postgres/__init__.pypython/src/core/infrastructure/postgres/page_repository.pypython/tests/core/infrastructure/postgres/test_page_repository.pypython/tests/core/test_container.py
python/tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use Pytest for testing with async support in Python backend
Use Pytest for testing Python code with async support
Files:
python/tests/core/infrastructure/postgres/test_page_repository.pypython/tests/core/test_container.py
🧬 Code graph analysis (2)
python/src/core/infrastructure/postgres/__init__.py (1)
python/src/core/infrastructure/postgres/page_repository.py (1)
PostgresPageRepository(54-506)
python/src/core/infrastructure/postgres/page_repository.py (4)
python/src/core/domain/interfaces/page_repository.py (1)
IPageRepository(15-263)python/src/core/domain/models/crawled_page.py (1)
CrawledPage(14-115)python/src/core/domain/models/page_metadata.py (1)
PageMetadata(14-121)python/src/core/domain/models/search_result.py (2)
SearchResult(13-114)from_postgres_row(78-102)
🔇 Additional comments (16)
.env.example (1)
26-48: LGTM!The database backend configuration section is well-structured with clear documentation explaining the three supported backends and both connection options for PostgreSQL. The commented-out examples with sensible defaults provide a good template for users.
python/src/core/infrastructure/postgres/__init__.py (1)
1-10: LGTM!Clean package initialization with appropriate docstring explaining the backend's purpose. The export aligns with the container's import expectations.
python/tests/core/test_container.py (3)
15-20: LGTM!The autouse fixture properly resets container state before and after each test, preventing test pollution across the suite.
68-101: LGTM!Good coverage of repository instantiation, singleton behavior, and reset semantics. The tests properly verify that the container returns the correct implementation type and maintains singleton identity.
174-200: LGTM!Error handling tests properly verify that invalid repository types and missing PostgreSQL credentials raise appropriate
ValueErrorexceptions with informative messages.python/src/core/container.py (3)
1-26: LGTM!Excellent module docstring with clear usage examples for both production and testing scenarios. The configuration options are well-documented.
53-95: LGTM!The repository factory properly handles test overrides with precedence, uses lazy imports to avoid circular dependencies, and provides clear error messages for invalid types. The singleton pattern with override support enables clean testing.
147-163: LGTM!The reset function properly clears all state including the
REPOSITORY_TYPEenvironment variable, which addresses test pollution concerns. This is the correct cleanup behavior for test teardown.python/src/core/infrastructure/postgres/page_repository.py (4)
1-18: LGTM!Clean module structure with appropriate imports and logging setup. The docstring accurately describes the implementation's purpose and dependencies.
21-51: LGTM!The connection string builder properly supports both URL-based and individual parameter configurations, with clear error messages that include the required environment variables when credentials are missing. This aligns with the coding guidelines for error context.
168-177: LGTM!Error handling follows the coding guidelines: logs detailed context (match_count, source_filter) with
exc_info=Truefor full stack trace, uses specific exception handling, and returns a safe fallback (empty list) to maintain resilience.
460-468: LGTM!Simple and effective health check that returns a boolean without exposing internal errors. The silent exception handling is appropriate here since the caller only needs to know if the connection is healthy.
python/tests/core/infrastructure/postgres/test_page_repository.py (4)
1-18: LGTM!Clean test module structure with appropriate imports. Testing the internal
_get_connection_stringhelper is valuable for verifying environment configuration handling.
75-84: LGTM!The
create_mock_poolhelper provides a clean way to mock asyncpg's pool with proper async context manager support, enabling isolated unit tests without a real database.
141-159: LGTM!Good test for error resilience. The
# noqa: B901comment correctly suppresses the linter warning for the intentional unreachableyieldafterraise, which is needed to make the async context manager syntactically valid.
510-583: LGTM!Utility operation tests provide good coverage for
list_unique_urls(with and without source filter) andhealth_check(success and failure paths). The error simulation pattern is consistent with other tests in this file.
|
|
||
| async with pool.acquire() as conn: | ||
| row = await conn.fetchrow(query, int(chunk_id)) | ||
|
|
||
| if not row: | ||
| return None | ||
|
|
||
| return self._row_to_crawled_page(dict(row)) |
There was a problem hiding this comment.
Handle non-integer chunk IDs gracefully.
Line 276 calls int(chunk_id) which will raise ValueError if chunk_id is not a valid integer string (e.g., UUID). Consider catching this or validating input, especially if the ID format might vary between backends.
🔎 Proposed fix
async def get_chunk_by_id(self, chunk_id: str) -> CrawledPage | None:
"""Get a chunk by ID."""
pool = await self._ensure_pool()
+
+ try:
+ chunk_id_int = int(chunk_id)
+ except ValueError:
+ logger.warning("Invalid chunk_id format: %s (expected integer)", chunk_id)
+ return None
query = f"""
SELECT * FROM {self.CHUNKS_TABLE}
WHERE id = $1
"""
async with pool.acquire() as conn:
- row = await conn.fetchrow(query, int(chunk_id))
+ row = await conn.fetchrow(query, chunk_id_int)
if not row:
return None
return self._row_to_crawled_page(dict(row))Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In python/src/core/infrastructure/postgres/page_repository.py around lines 274
to 281, the code unconditionally calls int(chunk_id) which will raise ValueError
for non-integer IDs (e.g., UUIDs); change the logic to validate/normalize
chunk_id before using it: if chunk_id is numeric (use str.isdigit() or a try/int
conversion), convert to int and pass it to the query, otherwise pass the
chunk_id as-is (or transform to the expected DB type), and wrap the conversion
in a try/except that either logs the invalid format and returns None or raises a
clear, typed exception—ensure the DB query uses the correctly-typed parameter
and avoid letting ValueError propagate uncaught.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
python/tests/core/infrastructure/postgres/test_integration.py (2)
382-383: Consider using fixture-based cleanup for consistency.The manual cleanup of
other_source_idis inconsistent with the fixture-based cleanup pattern used elsewhere. If this cleanup fails, it will cause the test to fail unexpectedly.🔎 Suggested approach
Either wrap in try-except:
# Cleanup other source - await repo.delete_chunks_by_source(other_source_id) + try: + await repo.delete_chunks_by_source(other_source_id) + except Exception as e: + import logging + logging.getLogger(__name__).warning( + "Failed to cleanup other_source_id=%s: %s", + other_source_id, + e, + )Or rely on database isolation—since you're using unique source IDs, you could let the test leave the data and trust that other tests won't conflict.
462-470: Strengthen test assertions for multi-dimension embeddings.The test only validates that
resultsis a list without checking if the inserted chunk was actually found. This provides minimal validation of the multi-dimension embedding functionality.Consider one of these approaches:
Option 1: Assert chunk is found (if RPC supports all dimensions)
# Search with same dimension results = await repo.search_similar( embedding=embedding, match_count=5, source_filter=test_source_id, ) - # Should find the chunk (may not find if RPC doesn't support this dimension) - # This validates the search doesn't error out - assert isinstance(results, list) + # Should find the inserted chunk + assert len(results) > 0, f"No results found for {dimension}D embedding" + assert any( + result.source_id == test_source_id and result.content == f"Content for {dimension}D embedding" + for result in results + ), f"Inserted chunk not found in results for {dimension}D embedding"Option 2: Skip unsupported dimensions
If some dimensions aren't supported by the RPC function, use
pytest.mark.skipifor query the database schema to determine supported dimensions, then only test those.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
python/tests/core/infrastructure/postgres/test_integration.py
🧰 Additional context used
📓 Path-based instructions (2)
python/tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use Pytest for testing with async support in Python backend
Use Pytest for testing Python code with async support
Files:
python/tests/core/infrastructure/postgres/test_integration.py
python/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Error messages must include context about what was being attempted, preserve full stack traces, use specific exception types, include relevant IDs/URLs, and report both success count and detailed failure list for batch operations
Files:
python/tests/core/infrastructure/postgres/test_integration.py
🧠 Learnings (1)
📚 Learning: 2025-11-29T19:25:01.813Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: python/src/agent_work_orders/CLAUDE.md:0-0
Timestamp: 2025-11-29T19:25:01.813Z
Learning: Applies to python/src/agent_work_orders/tests/integration/**/*.py : Mark integration tests with `pytest.mark.integration` decorator and place in `tests/integration/` directory
Applied to files:
python/tests/core/infrastructure/postgres/test_integration.py
🧬 Code graph analysis (1)
python/tests/core/infrastructure/postgres/test_integration.py (3)
python/src/core/infrastructure/postgres/page_repository.py (1)
PostgresPageRepository(54-506)python/src/core/domain/models/crawled_page.py (1)
CrawledPage(14-115)python/src/core/domain/models/page_metadata.py (1)
PageMetadata(14-121)
🔇 Additional comments (1)
python/tests/core/infrastructure/postgres/test_integration.py (1)
66-186: LGTM! Well-structured integration tests.The connection and chunk operation tests are well-designed with proper test isolation using unique source IDs, comprehensive coverage of CRUD operations, and appropriate assertions.
| pytestmark = pytest.mark.skipif( | ||
| os.getenv("POSTGRES_INTEGRATION_TEST", "").lower() != "true", | ||
| reason="PostgreSQL integration tests disabled. Set POSTGRES_INTEGRATION_TEST=true to enable.", | ||
| ) |
There was a problem hiding this comment.
Missing pytest.mark.integration decorator per project standards.
Based on learnings, integration tests should be marked with @pytest.mark.integration in addition to (or instead of) the skipif marker. This allows proper test categorization and filtering in CI/CD pipelines.
🔎 Suggested fix
# Skip all tests in this module if integration testing is not enabled
pytestmark = pytest.mark.skipif(
os.getenv("POSTGRES_INTEGRATION_TEST", "").lower() != "true",
reason="PostgreSQL integration tests disabled. Set POSTGRES_INTEGRATION_TEST=true to enable.",
)
+
+# Mark all tests as integration tests
+pytestmark = [
+ pytest.mark.skipif(
+ os.getenv("POSTGRES_INTEGRATION_TEST", "").lower() != "true",
+ reason="PostgreSQL integration tests disabled. Set POSTGRES_INTEGRATION_TEST=true to enable.",
+ ),
+ pytest.mark.integration,
+]Based on learnings, integration tests should use the pytest.mark.integration decorator and be placed in tests/integration/ directory. Consider relocating this file to python/tests/integration/core/infrastructure/postgres/test_integration.py.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| pytestmark = pytest.mark.skipif( | |
| os.getenv("POSTGRES_INTEGRATION_TEST", "").lower() != "true", | |
| reason="PostgreSQL integration tests disabled. Set POSTGRES_INTEGRATION_TEST=true to enable.", | |
| ) | |
| pytestmark = [ | |
| pytest.mark.skipif( | |
| os.getenv("POSTGRES_INTEGRATION_TEST", "").lower() != "true", | |
| reason="PostgreSQL integration tests disabled. Set POSTGRES_INTEGRATION_TEST=true to enable.", | |
| ), | |
| pytest.mark.integration, | |
| ] |
🤖 Prompt for AI Agents
In python/tests/core/infrastructure/postgres/test_integration.py around lines
28-31, the module only uses a skipif marker and is missing the required
pytest.mark.integration designation; add the integration marker so CI can
categorize these tests (either by updating the module-level pytestmark to
combine markers, e.g. include pytest.mark.integration alongside the existing
skipif, or by annotating the test class/functions with @pytest.mark.integration)
and optionally move the file to
python/tests/integration/core/infrastructure/postgres/test_integration.py to
follow the project integration test directory convention.
| try: | ||
| await repository.delete_chunks_by_source(test_source_id) | ||
| await repository.delete_pages_by_source(test_source_id) | ||
| except Exception: | ||
| pass # Ignore cleanup errors |
There was a problem hiding this comment.
Improve error handling in cleanup to preserve diagnostics.
The bare except Exception: pass suppresses all cleanup errors without logging, violating the coding guidelines requirement to "preserve full stack traces" and "include context about what was being attempted."
🔎 Suggested fix with logging
# Cleanup all test data
try:
await repository.delete_chunks_by_source(test_source_id)
await repository.delete_pages_by_source(test_source_id)
- except Exception:
- pass # Ignore cleanup errors
+ except Exception as e:
+ # Log but don't fail the test on cleanup errors
+ import logging
+ logging.getLogger(__name__).warning(
+ "Failed to cleanup test data for source_id=%s: %s",
+ test_source_id,
+ e,
+ exc_info=True,
+ )As per coding guidelines, error messages must include context and preserve stack traces.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| try: | |
| await repository.delete_chunks_by_source(test_source_id) | |
| await repository.delete_pages_by_source(test_source_id) | |
| except Exception: | |
| pass # Ignore cleanup errors | |
| try: | |
| await repository.delete_chunks_by_source(test_source_id) | |
| await repository.delete_pages_by_source(test_source_id) | |
| except Exception as e: | |
| # Log but don't fail the test on cleanup errors | |
| import logging | |
| logging.getLogger(__name__).warning( | |
| "Failed to cleanup test data for source_id=%s: %s", | |
| test_source_id, | |
| e, | |
| exc_info=True, | |
| ) |
🤖 Prompt for AI Agents
In python/tests/core/infrastructure/postgres/test_integration.py around lines
59-63, the cleanup block currently swallows all exceptions with `except
Exception: pass`; change it to catch the exception as a variable and log it with
context and full stack trace (e.g., `except Exception as e:
logger.exception("Failed cleaning up pages/chunks for source %s",
test_source_id, exc_info=True)`), ensuring a suitable logger is available or
imported in the test; do not remove the cleanup attempt—just log the failure
with the source id and stack trace so diagnostics are preserved.
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (3)
python/tests/core/infrastructure/postgres/test_integration.py (2)
28-31: Addpytest.mark.integrationmarker per project standards.Based on learnings, integration tests should be marked with
pytest.mark.integrationin addition to the skipif marker for proper test categorization.🔎 Suggested fix
-pytestmark = pytest.mark.skipif( - os.getenv("POSTGRES_INTEGRATION_TEST", "").lower() != "true", - reason="PostgreSQL integration tests disabled. Set POSTGRES_INTEGRATION_TEST=true to enable.", -) +pytestmark = [ + pytest.mark.skipif( + os.getenv("POSTGRES_INTEGRATION_TEST", "").lower() != "true", + reason="PostgreSQL integration tests disabled. Set POSTGRES_INTEGRATION_TEST=true to enable.", + ), + pytest.mark.integration, +]
72-83: Log cleanup errors with context and stack trace.The bare
except Exception: passviolates coding guidelines requiring error context and stack traces. As per coding guidelines, preserve full stack traces withexc_info=True.🔎 Suggested fix
# Cleanup all test data try: await repository.delete_chunks_by_source(test_source_id) await repository.delete_pages_by_source(test_source_id) # Also delete the source async with pool.acquire() as conn: await conn.execute( "DELETE FROM archon_sources WHERE source_id = $1", test_source_id, ) - except Exception: - pass # Ignore cleanup errors + except Exception as e: + # Log but don't fail the test on cleanup errors + logger.warning( + "Failed to cleanup test data for source_id=%s: %s", + test_source_id, + e, + exc_info=True, + )Based on learnings, use
logger.exception()or log withexc_info=Trueto preserve stack traces.python/src/core/infrastructure/postgres/page_repository.py (1)
275-290: Handle non-integer chunk IDs gracefully.Line 285 calls
int(chunk_id)which will raiseValueErrorfor non-integer IDs. Consider validating the input or catching the exception to provide a clearer error message.🔎 Suggested fix
async def get_chunk_by_id(self, chunk_id: str) -> CrawledPage | None: """Get a chunk by ID.""" pool = await self._ensure_pool() + + try: + chunk_id_int = int(chunk_id) + except ValueError: + logger.warning("Invalid chunk_id format: %s (expected integer)", chunk_id, exc_info=True) + return None query = f""" SELECT * FROM {self.CHUNKS_TABLE} WHERE id = $1 """ async with pool.acquire() as conn: - row = await conn.fetchrow(query, int(chunk_id)) + row = await conn.fetchrow(query, chunk_id_int)
🧹 Nitpick comments (2)
python/tests/core/infrastructure/postgres/test_integration.py (1)
416-422: Consider adding error handling to manual cleanup.The manual cleanup at the end of this test lacks error handling, unlike the fixture cleanup. If this cleanup fails, it could leave test data behind.
🔎 Optional improvement
# Cleanup other source (chunks and source entry) - await repo.delete_chunks_by_source(other_source_id) - async with pool.acquire() as conn: - await conn.execute( - "DELETE FROM archon_sources WHERE source_id = $1", - other_source_id, - ) + try: + await repo.delete_chunks_by_source(other_source_id) + async with pool.acquire() as conn: + await conn.execute( + "DELETE FROM archon_sources WHERE source_id = $1", + other_source_id, + ) + except Exception as e: + logger.warning( + "Failed to cleanup other_source_id=%s: %s", + other_source_id, + e, + exc_info=True, + )python/src/core/infrastructure/postgres/page_repository.py (1)
181-190: Consider differentiating critical vs. recoverable errors in search.Returning an empty list for all exceptions provides resilience but may mask critical issues like connection failures. While the logging is good (includes context and
exc_info=True), consider whether connection/pool errors should propagate rather than return[].
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
python/src/core/domain/models/crawled_page.pypython/src/core/infrastructure/postgres/page_repository.pypython/tests/core/infrastructure/postgres/test_integration.py
🧰 Additional context used
📓 Path-based instructions (3)
python/tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use Pytest for testing with async support in Python backend
Use Pytest for testing Python code with async support
Files:
python/tests/core/infrastructure/postgres/test_integration.py
python/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Error messages must include context about what was being attempted, preserve full stack traces, use specific exception types, include relevant IDs/URLs, and report both success count and detailed failure list for batch operations
Files:
python/tests/core/infrastructure/postgres/test_integration.pypython/src/core/domain/models/crawled_page.pypython/src/core/infrastructure/postgres/page_repository.py
python/src/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/src/**/*.py: Complete batch processing, background tasks, and external API calls but log detailed errors for each failure instead of crashing
Include context about what was being attempted, preserve full stack traces with exc_info=True in Python logging, use specific exception types, include relevant IDs/URLs, never return None to indicate failure, and report both success count and detailed failure list for batch operations
Use Ruff for linting and MyPy for type checking with 120 character line length in Python backend
Use Python 3.12 with 120 character line length in backend code
Import GITHUB_REPO_OWNER and GITHUB_REPO_NAME constants from python/src/server/config/version.py as the single source of truth for repository configuration
python/src/**/*.py: Python backend code should use 120 character line length
Use Ruff for linting Python code to check for errors, warnings, and unused imports
Use Mypy for type checking in Python to ensure type safety
Files:
python/src/core/domain/models/crawled_page.pypython/src/core/infrastructure/postgres/page_repository.py
🧠 Learnings (7)
📚 Learning: 2025-11-29T19:25:01.813Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: python/src/agent_work_orders/CLAUDE.md:0-0
Timestamp: 2025-11-29T19:25:01.813Z
Learning: Applies to python/src/agent_work_orders/tests/integration/**/*.py : Mark integration tests with `pytest.mark.integration` decorator and place in `tests/integration/` directory
Applied to files:
python/tests/core/infrastructure/postgres/test_integration.py
📚 Learning: 2025-11-29T19:25:01.813Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: python/src/agent_work_orders/CLAUDE.md:0-0
Timestamp: 2025-11-29T19:25:01.813Z
Learning: Applies to python/src/agent_work_orders/tests/**/*.py : Mark unit tests with `pytest.mark.unit` decorator
Applied to files:
python/tests/core/infrastructure/postgres/test_integration.py
📚 Learning: 2025-11-29T19:24:36.223Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-11-29T19:24:36.223Z
Learning: Applies to python/src/**/*.py : Include context about what was being attempted, preserve full stack traces with exc_info=True in Python logging, use specific exception types, include relevant IDs/URLs, never return None to indicate failure, and report both success count and detailed failure list for batch operations
Applied to files:
python/tests/core/infrastructure/postgres/test_integration.py
📚 Learning: 2025-11-29T19:25:23.987Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-29T19:25:23.987Z
Learning: Applies to python/**/*.py : Error messages must include context about what was being attempted, preserve full stack traces, use specific exception types, include relevant IDs/URLs, and report both success count and detailed failure list for batch operations
Applied to files:
python/tests/core/infrastructure/postgres/test_integration.py
📚 Learning: 2025-11-29T19:25:01.813Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: python/src/agent_work_orders/CLAUDE.md:0-0
Timestamp: 2025-11-29T19:25:01.813Z
Learning: Applies to python/src/agent_work_orders/**/*.py : Use `logger.exception()` in except blocks to automatically capture full stack traces
Applied to files:
python/tests/core/infrastructure/postgres/test_integration.py
📚 Learning: 2025-11-29T19:25:01.813Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: python/src/agent_work_orders/CLAUDE.md:0-0
Timestamp: 2025-11-29T19:25:01.813Z
Learning: Applies to python/src/agent_work_orders/**/*.py : Do not silently catch exceptions; always log with `logger.exception()` or re-raise the exception
Applied to files:
python/tests/core/infrastructure/postgres/test_integration.py
📚 Learning: 2025-11-29T19:25:01.813Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: python/src/agent_work_orders/CLAUDE.md:0-0
Timestamp: 2025-11-29T19:25:01.813Z
Learning: Applies to python/src/agent_work_orders/**/*.py : Use appropriate logging levels: `debug` for diagnostics, `info` for operations, `warning` for recoverable issues, `error` for non-fatal failures, `exception` in except blocks with stack traces
Applied to files:
python/tests/core/infrastructure/postgres/test_integration.py
🧬 Code graph analysis (1)
python/src/core/domain/models/crawled_page.py (1)
python/src/core/domain/models/page_metadata.py (1)
to_insert_dict(62-83)
🔇 Additional comments (5)
python/src/core/domain/models/crawled_page.py (1)
14-60: LGTM - Well-structured domain model.The CrawledPage model is well-designed with:
- Clear field definitions and proper Optional typing
- Support for multi-dimensional embeddings (768, 1024, 1536, 3072)
- Good docstrings explaining the purpose and relationships
- Appropriate Pydantic configuration for database ORM usage
python/tests/core/infrastructure/postgres/test_integration.py (1)
86-509: LGTM - Comprehensive integration test coverage.The integration tests provide excellent coverage:
- Connection health checks
- Chunk CRUD operations (insert, batch insert, delete, retrieve, count)
- Page operations (upsert, update, list, chunk count updates)
- Vector similarity search with source filtering and ordering verification
- Multi-dimensional embedding support (768, 1024, 1536, 3072) via parametrized tests
The test isolation using unique
test_source_idper test is a good practice.python/src/core/infrastructure/postgres/page_repository.py (3)
89-127: LGTM - Codec registration properly implemented.The pool initialization correctly uses the
initcallback to register type codecs (vector and jsonb) on every pooled connection. This ensures all connections have the necessary codecs before use, addressing the previous review concern about connection-specific codec registration.
481-513: LGTM - Robust data mapping helpers.The helper methods handle database row conversion safely:
- Proper None handling with
.get()and default values- Type conversions for IDs using
str()- Appropriate defaults for required fields
54-476: Well-structured repository implementation.The PostgresPageRepository is well-organized with:
- Clear separation between chunk operations, page operations, and utilities
- Proper connection pooling with asyncpg
- Consistent async/await patterns throughout
- Comprehensive implementation of the IPageRepository interface
The noted issues with error logging context in batch operations should be addressed, but the overall architecture is sound.
| # Set embedding in appropriate column | ||
| embedding = self.get_active_embedding() | ||
| if embedding: | ||
| if embedding_column and embedding_column.startswith("embedding_"): | ||
| data[embedding_column] = embedding | ||
| elif self.embedding_dimension: | ||
| # Use dimension-specific column when dimension is known | ||
| data[f"embedding_{self.embedding_dimension}"] = embedding | ||
| # Note: Generic 'embedding' column does not exist in schema | ||
| # Only dimension-specific columns are supported | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, examine the CrawledPage class file to understand the full context
cat -n python/src/core/domain/models/crawled_page.py | head -150Repository: coleam00/Archon
Length of output: 5862
🏁 Script executed:
# Run the original shell script to find CrawledPage instantiations
rg -nP --type=py 'CrawledPage\(' | head -20Repository: coleam00/Archon
Length of output: 1459
🏁 Script executed:
# Look for test files related to CrawledPage
fd -e py -e test '.*crawled_page' -iRepository: coleam00/Archon
Length of output: 104
🏁 Script executed:
# Search for where embedding_dimension is set/assigned
rg -nP --type=py 'embedding_dimension\s*=' python/src/core/domain/models/crawled_page.pyRepository: coleam00/Archon
Length of output: 350
🏁 Script executed:
# Examine test file to see how CrawledPage is instantiated
cat -n python/tests/core/test_domain_models.py | head -120Repository: coleam00/Archon
Length of output: 5032
🏁 Script executed:
# Search for patterns where embedding is set without embedding_dimension
rg -nP --type=py 'embedding.*=|embedding_dimension' python/tests/core/test_domain_models.py | head -40Repository: coleam00/Archon
Length of output: 1085
🏁 Script executed:
# Look for any validation or constraints on CrawledPage initialization
rg -nP --type=py 'embedding.*:' python/src/core/domain/models/crawled_page.pyRepository: coleam00/Archon
Length of output: 1716
🏁 Script executed:
# Search for actual usage of CrawledPage creation with embedding
rg -nP --type=py 'embedding\s*=' python/tests/core/test_repositories.py -A2 -B2 | head -60Repository: coleam00/Archon
Length of output: 2017
🏁 Script executed:
# Check if there's any validation that prevents embedding without embedding_dimension
rg -nP --type=py '@validator|@field_validator' python/src/core/domain/models/crawled_page.pyRepository: coleam00/Archon
Length of output: 41
🏁 Script executed:
# Look for actual CrawledPage creation in non-test code
rg -nP --type=py 'CrawledPage\(' python/src --include='*.py' -A5 | grep -E '(CrawledPage|embedding|embedding_dimension)' | head -40Repository: coleam00/Archon
Length of output: 500
🏁 Script executed:
# Look at the test_repositories.py more carefully around line 82 where embedding is set without dimension
cat -n python/tests/core/test_repositories.py | sed -n '70,90p'Repository: coleam00/Archon
Length of output: 837
🏁 Script executed:
# Search for any actual production code that creates CrawledPage with embedding
rg -nP --type=py 'CrawledPage' python/src/core -A5 -B2 | head -100Repository: coleam00/Archon
Length of output: 8785
🏁 Script executed:
# Check if there's any logic that automatically sets embedding_dimension
rg -nP --type=py 'embedding_dimension\s*=' python/src/core -B3 -A1 | head -80Repository: coleam00/Archon
Length of output: 2877
🏁 Script executed:
# Look for places where to_insert_dict is called to understand the real usage patterns
rg -nP --type=py 'to_insert_dict' python/src -B3 -A3Repository: coleam00/Archon
Length of output: 5359
🏁 Script executed:
# Check the insert_chunks_batch implementation to see if it handles dimension
cat -n python/src/core/infrastructure/supabase/page_repository.py | sed -n '149,180p'Repository: coleam00/Archon
Length of output: 1418
🏁 Script executed:
# Check if there's any logic that infers embedding_dimension from the embedding vector itself
rg -nP --type=py 'len\(.*embedding' python/src/core -A2 -B2Repository: coleam00/Archon
Length of output: 1610
🏁 Script executed:
# Check postgres implementation to see if it has similar logic
cat -n python/src/core/infrastructure/postgres/page_repository.py | sed -n '221,250p'Repository: coleam00/Archon
Length of output: 1096
🏁 Script executed:
# Search for any recent git history or comments about this design decision
rg -nP --type=py 'dimension.*column|generic.*embedding' python/src/core -B2 -A2Repository: coleam00/Archon
Length of output: 1312
Add validation or warning for embeddings without dimension metadata.
The to_insert_dict() method silently drops embeddings when embedding_dimension is None and no explicit embedding_column is provided. This occurs because the schema only supports dimension-specific columns (e.g., embedding_768, embedding_1536), not a generic embedding column. While this is by design, the Pydantic model allows creating CrawledPage instances with embeddings but without dimension metadata, creating a mismatch.
Consider either:
- Adding a field validator in
__init__to requireembedding_dimensionwhenembeddingis set, or - Logging a warning in
to_insert_dict()when an embedding is silently dropped due to missing dimension metadata.
🤖 Prompt for AI Agents
In python/src/core/domain/models/crawled_page.py around lines 107-117, the code
silently drops an embedding when embedding_dimension is None and no
embedding_column is provided; add a fix so embeddings are not lost: implement a
Pydantic validator (or __post_init__) on the model to raise a clear validation
error when embedding is set but embedding_dimension is missing, or if you prefer
runtime tolerance, add a logging warning inside to_insert_dict() right before
the embedding is skipped that warns the embedding will be dropped due to missing
dimension metadata (include page id/context and suggest setting
embedding_dimension or providing embedding_column).
| async def insert_chunks_batch( | ||
| self, | ||
| chunks: list[CrawledPage], | ||
| embedding_column: str | None = None, | ||
| ) -> int: | ||
| """Insert multiple chunks in a batch.""" | ||
| if not chunks: | ||
| return 0 | ||
|
|
||
| count = 0 | ||
|
|
||
| # Insert each chunk (each insert_chunk call manages its own connection) | ||
| for chunk in chunks: | ||
| try: | ||
| await self.insert_chunk(chunk) | ||
| count += 1 | ||
| except Exception as e: | ||
| logger.error("Failed to insert chunk: %s", e) | ||
|
|
||
| return count |
There was a problem hiding this comment.
Add context to batch insert error logging.
The error logging at line 238 lacks context about which chunk failed. As per coding guidelines, error messages must include relevant IDs/URLs and context about what was being attempted.
🔎 Suggested fix
# Insert each chunk (each insert_chunk call manages its own connection)
for chunk in chunks:
try:
await self.insert_chunk(chunk)
count += 1
except Exception as e:
- logger.error("Failed to insert chunk: %s", e)
+ logger.error(
+ "Failed to insert chunk (url=%s, source_id=%s, chunk_number=%d): %s",
+ chunk.url,
+ chunk.source_id,
+ chunk.chunk_number,
+ e,
+ exc_info=True,
+ )Based on coding guidelines, include context about what was being attempted and preserve full stack traces.
🤖 Prompt for AI Agents
In python/src/core/infrastructure/postgres/page_repository.py around lines 221
to 240, the batch insert error log does not include which chunk failed or the
stack trace; update the except block to log contextual identifiers from the
CrawledPage (e.g., chunk.id and/or chunk.url or other unique fields) and
preserve the full traceback (use logger.exception or logger.error(...,
exc_info=True)) so each failure message clearly states which chunk failed and
includes the stack trace for debugging.
| async def upsert_pages_batch(self, pages: list[PageMetadata]) -> dict[str, str]: | ||
| """Upsert multiple pages and return url->id mapping.""" | ||
| if not pages: | ||
| return {} | ||
|
|
||
| result = {} | ||
| for page in pages: | ||
| try: | ||
| page_id = await self.upsert_page(page) | ||
| result[page.url] = page_id | ||
| except Exception as e: | ||
| logger.error("Failed to upsert page %s: %s", page.url, e) | ||
|
|
||
| return result |
There was a problem hiding this comment.
Add context to batch upsert error logging.
Similar to insert_chunks_batch, the error logging at line 354 lacks sufficient context. Include the page URL and source_id to help debugging.
🔎 Suggested fix
result = {}
for page in pages:
try:
page_id = await self.upsert_page(page)
result[page.url] = page_id
except Exception as e:
- logger.error("Failed to upsert page %s: %s", page.url, e)
+ logger.error(
+ "Failed to upsert page (url=%s, source_id=%s): %s",
+ page.url,
+ page.source_id,
+ e,
+ exc_info=True,
+ )Based on coding guidelines, preserve full stack traces with exc_info=True.
🤖 Prompt for AI Agents
In python/src/core/infrastructure/postgres/page_repository.py around lines 343
to 356, the batch upsert error log only includes the page URL and exception
message; update the exception handling to log both page.url and page.source_id
(or page.metadata.source_id) and pass exc_info=True so the full stack trace is
preserved; ensure the log message is descriptive (e.g., "Failed to upsert page
%s (source_id=%s)") and include both values when calling logger.error.
- CODE_CHANGES_SUMMARY.md: Detail async repository initialization fix for archon_graph.py - STAGING_VALIDATION_INDEX.md: Document staging validation results and deployment readiness - docs/RECHERCHE_MIGRATION_SUPABASE_POSTGRES.md: Strategic research for Supabase to PostgreSQL migration 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…upabase deprecation Phase 4 - Documentation finale: - Add comprehensive docs/ARCHITECTURE.md with: - ASCII diagrams of Domain/Infrastructure/Application layers - Complete Container DI guide with examples - Test examples with InMemoryRepository - Guide for adding new database backends (MongoDB example) - FAQ and troubleshooting - Update README.md with: - New 'Database Layer Architecture' section - Quick start guide for PostgreSQL - Links to architecture and migration documentation Phase 5 - Supabase cleanup documentation: - Add docs/SUPABASE_DEPRECATION_GUIDE.md with: - Complete inventory of Supabase legacy code - Line-by-line analysis of code to clean - Migration plan in 4 phases (6 months timeline) - Cleanup checklist and risk assessment - Estimation: ~585 lines of code to remove IMPORTANT: This is DOCUMENTATION ONLY. No Supabase code was deleted. The cleanup guide is for future reference when user decides to proceed. All docstrings verified - already complete and well-structured. Part of Séquence 4 (P4-06 + P5-04) completion.
Complete report for Phase 4 & 5 documentation tasks: - ARCHITECTURE.md: 800+ lines of comprehensive architecture docs - SUPABASE_DEPRECATION_GUIDE.md: 1000+ lines deprecation guide - README.md: New Database Layer Architecture section - All docstrings verified and complete Total: 4/4 tasks completed, ~2200 lines of documentation added. No production code modified (documentation only).
6b37422 to
3bbfc86
Compare
There was a problem hiding this comment.
Actionable comments posted: 5
Note
Due to the large number of review comments, Critical severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
archon/crawl_pydantic_ai_docs.py (1)
283-332: Fix type mismatch and move import to module level.Two issues identified:
Critical: Type mismatch for metadata - The
SitePagedomain model expectsmetadata: SitePageMetadata(seearchon/domain/models/site_page.py), but the code passes a plain dict created inprocess_chunk(lines 266-271). This will cause Pydantic validation errors at runtime.Major: Import inside function - Line 299 imports
SitePageinside the function, which is a code smell suggesting potential circular import issues.🔎 Proposed fixes
Fix 1: Move import to module level
# Import domain interfaces for dependency injection -from archon.domain import ISitePagesRepository, IEmbeddingService +from archon.domain import ISitePagesRepository, IEmbeddingService, SitePage, SitePageMetadataFix 2: Update process_chunk to return SitePageMetadata
In the
process_chunkfunction (lines 265-271):+ from archon.domain import SitePageMetadata + # Create metadata - metadata = { - "source": "pydantic_ai_docs", - "chunk_size": len(chunk), - "crawled_at": datetime.now(timezone.utc).isoformat(), - "url_path": urlparse(url).path - } + metadata = SitePageMetadata( + source="pydantic_ai_docs", + chunk_size=len(chunk), + crawled_at=datetime.now(timezone.utc).isoformat(), + url_path=urlparse(url).path + )Fix 3: Remove the now-redundant import
In
insert_chunkfunction (line 299):# Use injected repository if provided (new pattern) if repository is not None: - from archon.domain import SitePage - page = SitePage(
🟠 Major comments (22)
requirements.txt-79-79 (1)
79-79: Upgrade MCP to ≥1.23.0 to resolve three HIGH-severity vulnerabilities.The MCP package v1.2.1 is vulnerable to:
- CVE-2025-53366: FastMCP validation DoS via malformed requests (fixed 1.9.4)
- CVE-2025-66416: DNS rebinding bypass in unauthenticated HTTP servers (fixed 1.23.0)
- CVE-2025-53365: ClosedResourceError DoS in Streamable HTTP Transport (fixed 1.10.0)
Upgrade to mcp ≥1.23.0 to address all three. Alternatively, apply transport-level mitigations (strict request validation, DNS rebinding protection settings, process supervision).
migrate_schema.py-34-34 (1)
34-34: CASCADE deletion could have unintended consequences.Using
DROP TABLE ... CASCADEwill automatically drop all dependent objects (views, foreign keys, etc.) without explicit confirmation. This could lead to unintended data loss beyond the table itself.🔎 Safer approach with explicit dependency check
+ # Check for dependencies before dropping + print("\nChecking for table dependencies...") + deps = await conn.fetch(""" + SELECT dependent_ns.nspname as dependent_schema, + dependent_view.relname as dependent_view + FROM pg_depend + JOIN pg_rewrite ON pg_depend.objid = pg_rewrite.oid + JOIN pg_class as dependent_view ON pg_rewrite.ev_class = dependent_view.oid + JOIN pg_class as source_table ON pg_depend.refobjid = source_table.oid + JOIN pg_namespace dependent_ns ON dependent_view.relnamespace = dependent_ns.oid + WHERE source_table.relname = 'site_pages' + """) + + if deps: + print(f"\nWARNING: {len(deps)} dependent objects found:") + for dep in deps: + print(f" - {dep['dependent_schema']}.{dep['dependent_view']}") + response = input("\nDrop dependencies? (yes/no): ") + if response.lower() != "yes": + print("\nMigration cancelled.") + return False + print("\nDropping existing table...") - await conn.execute("DROP TABLE IF EXISTS site_pages CASCADE") + await conn.execute("DROP TABLE IF EXISTS site_pages CASCADE")Committable suggestion skipped: line range outside the PR's diff.
migrate_schema.py-27-30 (1)
27-30: User input blocks automation.Using
input()for confirmation prevents the script from running in CI/CD pipelines or automated environments.🔎 Add command-line flag for non-interactive mode
+import sys + -async def migrate(): +async def migrate(force: bool = False): """Migrate the id column from UUID to SERIAL.""" # ... connection code ... if count > 0: print(f"\nWARNING: Table has {count} records.") print("They will be DELETED during migration!") - response = input("\nContinue? (yes/no): ") - if response.lower() != "yes": - print("\nMigration cancelled.") - return False + if not force: + response = input("\nContinue? (yes/no): ") + if response.lower() != "yes": + print("\nMigration cancelled.") + return False + else: + print("\n[FORCE MODE] Proceeding with migration...")Then update main:
if __name__ == "__main__": print("PostgreSQL Schema Migration Tool") print("This will convert the id column from UUID to SERIAL") - asyncio.run(migrate()) + force = "--force" in sys.argv or "-f" in sys.argv + asyncio.run(migrate(force=force))Committable suggestion skipped: line range outside the PR's diff.
migrate_schema.py-8-14 (1)
8-14: Missing connection timeout and error handling.The database connection lacks timeout configuration, which could cause the script to hang indefinitely if the database is unreachable.
🔎 Add timeout and connection retry logic
+ import os + async def migrate(): """Migrate the id column from UUID to SERIAL.""" - conn = await asyncpg.connect( - host="localhost", - port=5432, - user="postgres", - password="postgres", - database="mydb" - ) + try: + conn = await asyncio.wait_for( + asyncpg.connect( + host=os.getenv("POSTGRES_HOST", "localhost"), + port=int(os.getenv("POSTGRES_PORT", "5432")), + user=os.getenv("POSTGRES_USER", "postgres"), + password=os.getenv("POSTGRES_PASSWORD"), + database=os.getenv("POSTGRES_DB", "mydb"), + timeout=10 + ), + timeout=15 + ) + except asyncio.TimeoutError: + print(f"\n[ERROR] Connection timeout - ensure PostgreSQL is running") + return False + except Exception as e: + print(f"\n[ERROR] Connection failed: {e}") + return FalseCommittable suggestion skipped: line range outside the PR's diff.
migrate_schema.py-73-77 (1)
73-77: Broad exception catching loses error context.Catching bare
Exceptionmakes debugging difficult and could mask unexpected errors like programming mistakes.🔎 Catch specific exceptions
- except Exception as e: - print(f"\n[ERROR] Migration failed: {e}") - import traceback - traceback.print_exc() - return False + except asyncpg.PostgresError as e: + print(f"\n[ERROR] Database error during migration: {e}") + print(f"SQLSTATE: {e.sqlstate}") + import traceback + traceback.print_exc() + return False + except Exception as e: + print(f"\n[ERROR] Unexpected error: {e}") + import traceback + traceback.print_exc() + raise # Re-raise unexpected errorsCommittable suggestion skipped: line range outside the PR's diff.
migrate_schema.py-16-79 (1)
16-79: Missing transaction management and rollback capability.The migration performs multiple DDL operations without transaction management. If any step fails midway, the database could be left in an inconsistent state with no way to roll back.
🔎 Add transaction wrapper for atomicity
try: + async with conn.transaction(): print("\nMigration: UUID -> SERIAL (INTEGER)") print("=" * 60) # Check if table has data count = await conn.fetchval("SELECT COUNT(*) FROM site_pages") print(f"Current records: {count}") if count > 0: print(f"\nWARNING: Table has {count} records.") print("They will be DELETED during migration!") response = input("\nContinue? (yes/no): ") if response.lower() != "yes": print("\nMigration cancelled.") return False # Drop and recreate table with correct schema print("\nDropping existing table...") await conn.execute("DROP TABLE IF EXISTS site_pages CASCADE") print("Creating table with SERIAL id...") await conn.execute(""" CREATE TABLE site_pages ( id SERIAL PRIMARY KEY, url TEXT NOT NULL, chunk_number INTEGER DEFAULT 0, title TEXT, summary TEXT, content TEXT, metadata JSONB DEFAULT '{}', embedding vector(1536), created_at TIMESTAMPTZ DEFAULT NOW() ) """) # Create indexes print("Creating embedding index (ivfflat)...") await conn.execute(""" CREATE INDEX site_pages_embedding_idx ON site_pages USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100) """) print("Creating url index...") await conn.execute("CREATE INDEX site_pages_url_idx ON site_pages (url)") print("Creating metadata->source index...") await conn.execute(""" CREATE INDEX site_pages_metadata_source_idx ON site_pages ((metadata->>'source')) """) print("\n[SUCCESS] Migration completed!") print("Schema is now compatible with domain model.") - return True + + return True except Exception as e: print(f"\n[ERROR] Migration failed: {e}") + print("Transaction rolled back - database unchanged") import traceback traceback.print_exc() return Falsedocs/CONTEXT_DB_STAGING_AGENT.md-26-27 (1)
26-27: Wrap bare URLs in Markdown link syntax.Several lines contain bare URLs that should be formatted as Markdown links or code blocks:
- Lines 26–27:
http://localhost:8502,http://localhost:8101/health- Lines 131–132:
http://localhost:8502,http://localhost:8101/health- Lines 140–141:
http://localhost:8501,http://localhost:8100/health-- [ ] http://localhost:8502 charge l'UI Streamlit +- [ ] [`http://localhost:8502`](http://localhost:8502) charge l'UI StreamlitApply the same pattern to all bare URLs.
Also applies to: 131-132, 140-141
.claude/agents/db-layer-refactoring-analyst.md-35-35 (1)
35-35: Replace emphasis with proper Markdown headings.Multiple subsections use bold text (e.g.,
**Direct Client Usage**,**Query Patterns**) instead of heading syntax. This breaks document structure and reduces navigability. Consistent heading levels also improve readability for both humans and automated tools.🔎 Example fixes (apply to all occurrences)
-**Direct Client Usage** +#### Direct Client UsageApply the same pattern to lines 35, 42, 50, 200, 205, 210, 215, 220, 225. Use
####(h4) for subsections within named sections, or###(h3) if these are primary section headings at the top level.Also applies to: 42-42, 50-50, 200-200, 205-205, 210-210, 215-215, 220-220, 225-225
.claude/agents/db-layer-refactoring-analyst.md-69-69 (1)
69-69: Specify language identifiers for all fenced code blocks.Code blocks at lines 69, 80, and 167 lack language specifiers (e.g.,
```yaml,```python), which prevents syntax highlighting and reduces readability.🔎 Proposed fixes
-``` +```yaml UI/API Layer → Service Layer → [?Data Access?] → SupabaseFor line 80 (directory structure): ```diff -``` +``` src/ ├── domain/ ...For line 167 (module structure): ```diff -``` +``` modules/ ├── users/ ...</details> Also applies to: 80-80, 167-167 </blockquote></details> <details> <summary>docs/CONTEXT_DB_STAGING_AGENT.md-33-33 (1)</summary><blockquote> `33-33`: **Specify language identifiers for code blocks.** Lines 33, 86–93, and 95–106 contain fenced code blocks without language specifiers: ```diff -``` +```bash PRODUCTION (Actuelle) STAGING (Nouvelle) ...For Python code blocks (lines 86–93, 95–106):
-```python +```python if __name__ == "__main__": ...Also applies to: 86-93, 95-106
docs/CONTEXT_DB_STAGING_AGENT.md-1-186 (1)
1-186: Clarify intended language: use English or French consistently, not mixed.This document mixes French and English throughout (French is predominant). This reduces clarity, searchability, and long-term maintainability. Choose one language and standardize across the codebase.
Is this file intended for a French-speaking team, or is the French content an artifact of AI-assisted generation that should be converted to English? If English is the standard, convert the entire document. If French is intentional, ensure other project documentation (README.md, CONTRIBUTING.md, etc.) also use French consistently.
STAGING_VALIDATION_INDEX.md-10-10 (1)
10-10: Wrap bare URLs in Markdown link syntax.Line 10 contains a bare URL (
http://localhost:8502) that should be formatted as a Markdown link or placed in a code block.-- **Access Staging:** http://localhost:8502 +- **Access Staging:** [`http://localhost:8502`](http://localhost:8502).claude/agents/db-layer-refactoring-analyst.md-2-3 (1)
2-3: Simplify the agent description—it's overly complex with nested XML examples.The description field contains embedded example blocks that may not parse correctly. Consider moving detailed examples into the main body or creating a separate examples file, and keep the description brief and self-contained.
🔎 Suggested simplification
Replace the complex nested examples in lines 2–3 with a concise one-liner:
-description: Use this agent when you need to analyze and plan database layer refactoring, particularly for decoupling ORM/BaaS dependencies (like Supabase, Firebase, Prisma), creating abstraction layers, or migrating to a multi-modular architecture. This agent specializes in Repository Pattern implementation, database abstraction strategies, and incremental migration planning.\n\nExamples:\n\n<example>\nContext: User wants to decouple Supabase from their codebase\nuser: "Our codebase is tightly coupled to Supabase and we want to be database-agnostic"\nassistant: "I'll use the db-layer-refactoring-analyst agent to map all Supabase dependencies and design an abstraction strategy."\n<Task tool call to db-layer-refactoring-analyst>\n</example>\n\n<example>\nContext: User needs to create a repository layer\nuser: "We have database calls scattered throughout our services and want to centralize them"\nassistant: "Let me launch the db-layer-refactoring-analyst to analyze your current data access patterns and design a proper repository architecture."\n<Task tool call to db-layer-refactoring-analyst>\n</example>\n\n<example>\nContext: User is planning a database migration\nuser: "We're considering moving from Supabase to a self-hosted PostgreSQL with SQLAlchemy"\nassistant: "I'll use the db-layer-refactoring-analyst to create a migration roadmap that minimizes risk and maintains functionality throughout."\n<Task tool call to db-layer-refactoring-analyst>\n</example>\n\n<example>\nContext: User wants to modularize their data layer\nuser: "Our monolithic database module needs to be split into domain-specific modules"\nassistant: "Let me analyze the codebase with the db-layer-refactoring-analyst to identify domain boundaries and design a multi-modular data architecture."\n<Task tool call to db-layer-refactoring-analyst>\n</example> +description: "Use this agent when you need to analyze and plan database layer refactoring, particularly for decoupling ORM/BaaS dependencies (like Supabase, Firebase, Prisma), creating abstraction layers, or migrating to a multi-modular architecture."STAGING_VALIDATION_INDEX.md-216-225 (1)
216-225: Add container reset() fixture to main conftest.py to ensure global state cleanup between test runs.The lazy singleton pattern in
archon/archon_graph.py(lines 69–77) does create testing and concurrency challenges:
- Test isolation gap:
tests/test_container.pycorrectly callsreset()insetup_method(), buttests/conftest.pyandtests/integration/conftest.pylack universal container state cleanup. This means other test suites do not automatically reset the container's global_repository_instancebetween test runs, risking state leakage across tests.- Concurrency risk: The async initialization in
get_repository_instance()lacks synchronization primitives. Multiple concurrent async contexts could attempt initialization simultaneously, potentially bypassing theif repository is Nonecheck.- Architectural redundancy:
archon_graph.pymaintains a separate globalrepositoryvariable that wrapscontainer.get_repository_async(), creating an orphaned mutable state outside the container's reset mechanism.Add
reset()to the module-level fixtures inconftest.py(e.g., as a session-scoped autouse fixture or function-scoped setup), or refactorarchon_graph.pyto callcontainer.reset()where needed and rely directly on the container'sget_repository_async()instead of maintaining a separate global variable.check_db_schema.py-8-14 (1)
8-14: Security: Use environment variables for database credentials.Hardcoded database credentials pose a security risk. Consider using environment variables or a configuration file:
🔎 Proposed fix
+import os + async def check_schema(): """Check the current schema of site_pages table.""" conn = await asyncpg.connect( - host="localhost", - port=5432, - user="postgres", - password="postgres", - database="mydb" + host=os.getenv("POSTGRES_HOST", "localhost"), + port=int(os.getenv("POSTGRES_PORT", "5432")), + user=os.getenv("POSTGRES_USER", "postgres"), + password=os.getenv("POSTGRES_PASSWORD"), + database=os.getenv("POSTGRES_DB", "mydb") )Committable suggestion skipped: line range outside the PR's diff.
check_db_schema.py-52-58 (1)
52-58: Security: Use environment variables for database credentials.Same security concern as in
check_schema(). Extract credentials to environment variables for consistency and security.🔎 Proposed fix
+import os + async def migrate_to_serial(): """Migrate the id column from UUID to SERIAL.""" conn = await asyncpg.connect( - host="localhost", - port=5432, - user="postgres", - password="postgres", - database="mydb" + host=os.getenv("POSTGRES_HOST", "localhost"), + port=int(os.getenv("POSTGRES_PORT", "5432")), + user=os.getenv("POSTGRES_USER", "postgres"), + password=os.getenv("POSTGRES_PASSWORD"), + database=os.getenv("POSTGRES_DB", "mydb") )archon/services/documentation_service.py-138-145 (1)
138-145: PotentialTypeErrorif chunk content isNone.Line 142 joins
chunk.contentfor all chunks, butSitePage.contentisOptional[str]. If any chunk hasNonecontent, the join will fail withTypeError: sequence item N: expected str instance, NoneType found.🔎 Suggested fix
# Sort by chunk_number to ensure correct order sorted_chunks = sorted(chunks, key=lambda c: c.chunk_number) # Concatenate content - full_content = "\n\n".join(chunk.content for chunk in sorted_chunks) + full_content = "\n\n".join( + chunk.content for chunk in sorted_chunks if chunk.content + )archon/archon_graph.py-67-77 (1)
67-77: Potential race condition in lazy repository initialization.If
get_repository_instance()is called concurrently before the repository is initialized, multiple instances could be created. The second assignment would overwrite the first, but this wastes resources and could cause subtle issues with connection pools.🔎 Suggested fix using asyncio.Lock
+import asyncio + # Initialize repository and embedding service via container # Repository will be lazy-initialized on first use (supports async backends like PostgreSQL) repository = None +_repository_lock = asyncio.Lock() embedding_service = get_embedding_service() async def get_repository_instance(): """Get or create repository instance (lazy initialization for async backends).""" global repository if repository is None: - repository = await get_repository_async() + async with _repository_lock: + if repository is None: # Double-check after acquiring lock + repository = await get_repository_async() return repositoryCODE_CHANGES_SUMMARY.md-118-134 (1)
118-134: Add synchronization to prevent race conditions in lazy initialization.The implementation lacks protection against concurrent access. If multiple async tasks call
get_repository_instance()simultaneously before initialization completes, both will observerepository is Noneand both will callawait get_repository_async(), leading to duplicate initialization. Add anasyncio.Lockto guard the initialization:Example fix
_init_lock = asyncio.Lock() async def get_repository_instance(): """Get or create repository instance (lazy initialization for async backends).""" global repository if repository is None: async with _init_lock: if repository is None: # Double-check after acquiring lock repository = await get_repository_async() return repositoryarchon/container.py-129-172 (1)
129-172: Add lifecycle management for async resources.The container creates async resources (PostgreSQL connection pool) but doesn't provide a way to properly close/cleanup these resources. This can lead to resource leaks and warnings about unclosed connections.
🔎 Recommended addition
async def close_async_resources() -> None: """ Close async resources (connection pools, etc.). Should be called during application shutdown. Example: >>> await close_async_resources() """ global _repository_instance if _repository_instance is not None: repo_type = _config["repository_type"] if repo_type == "postgres": logger.info("Closing PostgreSQL connection pool") await _repository_instance.close() _repository_instance = NoneUsage in application shutdown:
# In your main application async def shutdown(): from archon.container import close_async_resources await close_async_resources()archon/infrastructure/postgres/site_pages_repository.py-122-124 (1)
122-124: Uselogger.exceptioninstead oflogger.errorin exception handlers.When logging exceptions,
logger.exception()automatically includes the stack trace, making debugging significantly easier thanlogger.error().🔎 Proposed refactor
except Exception as e: - logger.error(f"get_by_id(id={id}) -> ERROR: {e}") + logger.exception(f"get_by_id(id={id}) -> ERROR: {e}") raiseApply this pattern to all 8 exception handlers in the file (lines 123, 154, 218, 254, 306, 365, 396, 438).
Also applies to: 153-155, 217-219, 253-255, 305-307, 364-366, 395-397, 437-439
archon/infrastructure/postgres/site_pages_repository.py-33-42 (1)
33-42: Validate table_name to prevent SQL injection.While
table_namedefaults to"site_pages"and is typically not user-controlled, accepting it as a parameter without validation creates a potential SQL injection vector. Static analysis correctly flags this concern across multiple query constructions.🔎 Recommended fix: Add validation
def __init__(self, pool: Pool, table_name: str = "site_pages"): """ Initialize the repository with a connection pool. Args: pool: asyncpg connection pool table_name: Name of the table to use """ + # Validate table_name to prevent SQL injection + if not table_name.replace('_', '').isalnum(): + raise ValueError(f"Invalid table name: {table_name}") self.pool = pool self.table_name = table_name
🟡 Minor comments (9)
.claude/agents/mcp-server-refactoring-analyst.md-325-342 (1)
325-342: Phase numbering gap: jumps from Phase 4 to Phase 6.The roadmap goes from "Phase 4: Testing & Validation" (line 325) to "Phase 6: Production Considerations" (line 342), skipping Phase 5. This appears to be a typo or oversight.
Suggested fix
-### Phase 6: Production Considerations +### Phase 5: Production ConsiderationsOr add the missing Phase 5 content if it was intended.
docs/CONTEXT_DB_TEST_RUNNER_AGENT.md-42-44 (1)
42-44: Avoid documenting default credentials in version-controlled files.Documenting
User: postgresandPassword: postgresposes a risk if these match actual deployed credentials. Even for development, consider using environment variables or referencing a secure credentials store instead.Suggested approach
| Host | `localhost` | | Port | `5432` | -| User | `postgres` | -| Password | `postgres` | +| User | `${POSTGRES_USER}` (from environment) | +| Password | `${POSTGRES_PASSWORD}` (from environment) | | Database | `mydb` |.claude/agents/db-test-runner-agent.md-82-90 (1)
82-90: Hardcoded credentials in documentation.The documentation embeds PostgreSQL credentials directly in the configuration section. While this is documentation rather than live code, it sets a poor example and could lead to credential leakage if developers copy-paste without modification.
🔎 Recommended approach
Consider using placeholder values or referencing environment variables instead:
-User: postgres -Password: postgres +User: ${POSTGRES_USER} +Password: ${POSTGRES_PASSWORD}Or use explicit placeholders:
-User: postgres -Password: postgres +User: <your-postgres-user> +Password: <your-postgres-password>archon/infrastructure/postgres/connection.py-98-111 (1)
98-111: Return type annotation contradicts behavior.The function is annotated to return
Optional[Pool]but always raisesRuntimeErrorwhen_poolis None, meaning it never actually returnsNone. The annotation should bePoolto match the actual behavior.🔎 Proposed fix
-def get_pool() -> Optional[Pool]: +def get_pool() -> Pool: """ Get the current connection pool. Returns: - The global pool instance, or None if not created + The global pool instance Raises: RuntimeError: If pool has not been created """.claude/agents/db-refactor-test-phase-agent.md-82-88 (1)
82-88: Avoid exposing plaintext credentials in documentation.Hardcoded credentials (
User: postgres,Password: postgres) in shared documentation could pose a security risk if this pattern is copied to other environments. Consider using environment variable references instead.🔎 Suggested fix
PostgreSQL Docker: Container: mg_postgres Host: localhost Port: 5432 - User: postgres - Password: postgres + User: ${POSTGRES_TEST_USER:-postgres} + Password: ${POSTGRES_TEST_PASSWORD} Status: Running.claude/agents/db-refactor-test-phase-agent.md-62-64 (1)
62-64: Hardcoded Windows paths reduce portability.These absolute Windows paths (
D:\archon\archon\docs\...) will not work on Linux/macOS or in CI environments. Consider using relative paths from the repository root.🔎 Suggested fix
-1. **Plan Phase 0 Tests**: `D:\archon\archon\docs\PLAN_PHASE0_TESTS.md` ← PRINCIPAL -2. **Migration Manifest**: `D:\archon\archon\docs\MIGRATION_MANIFEST.md` -3. **Plan Global**: `D:\archon\archon\docs\PLAN_REFACTORISATION_DATABASE_LAYER.md` +1. **Plan Phase 0 Tests**: `docs/PLAN_PHASE0_TESTS.md` ← PRINCIPAL +2. **Migration Manifest**: `docs/MIGRATION_MANIFEST.md` +3. **Plan Global**: `docs/PLAN_REFACTORISATION_DATABASE_LAYER.md`archon/agent_tools.py-234-234 (1)
234-234: Fix typo in comment."coule" should be "couple".
🔎 Proposed fix
- # Join everything together but limit the characters in case the page is massive (there are a coule big ones) + # Join everything together but limit the characters in case the page is massive (there are a couple big ones)archon/infrastructure/postgres/site_pages_repository.py-390-393 (1)
390-393: Fragile parsing of DELETE command result.Parsing the DELETE result with
.split()[-1]is brittle and depends on the exact format of the command tag string. Consider using asyncpg's built-in parsing or a more robust approach.🔎 Recommended fix
- # Parse "DELETE X" to get count - deleted_count = int(result.split()[-1]) + # Extract count from command tag (format: "DELETE n") + try: + deleted_count = int(result.split()[-1]) if result else 0 + except (ValueError, IndexError): + logger.warning(f"Could not parse DELETE result: {result}") + deleted_count = 0Alternatively, track the count with a SELECT query before deletion if accuracy is critical.
archon/container.py-91-115 (1)
91-115: Remove dead code: unusedpostgres_config.The
postgres_configdictionary is built but never used before the RuntimeError is raised. This is confusing dead code that should be removed.🔎 Proposed fix
elif repo_type == "postgres": - # PostgreSQL direct with asyncpg + pgvector - import os - from archon.infrastructure.postgres import PostgresSitePagesRepository, create_pool - - # Get PostgreSQL configuration from environment - postgres_config = { - "host": os.environ.get("POSTGRES_HOST", "localhost"), - "port": int(os.environ.get("POSTGRES_PORT", "5432")), - "database": os.environ.get("POSTGRES_DB", "archon"), - "user": os.environ.get("POSTGRES_USER", "postgres"), - "password": os.environ.get("POSTGRES_PASSWORD", ""), - } - - # Create pool and repository synchronously - # Note: Pool creation must be done in an async context - # So we raise an error with instructions + # PostgreSQL requires async initialization raise RuntimeError( "PostgreSQL repository requires async initialization. " "Use get_repository_async() instead, or initialize manually:\n\n" " from archon.infrastructure.postgres import PostgresSitePagesRepository\n" " repo = await PostgresSitePagesRepository.create(**config)\n" " from archon.container import override_repository\n" " override_repository(repo)\n" )
| @@ -0,0 +1,448 @@ | |||
| # Livrable: Backend PostgreSQL pour Archon | |||
|
|
|||
| **Date de Livraison:** 2025-11-30 | |||
There was a problem hiding this comment.
Critical: Fix future dates in documentation.
The delivery date and generation date are set to 2025-11-30, which is in the future. This should be corrected to 2024-11-30 to reflect the actual delivery date.
🔎 Proposed fix
-**Date de Livraison:** 2025-11-30
+**Date de Livraison:** 2024-11-30And at the end of the file:
-*Généré le: 2025-11-30*
+*Généré le: 2024-11-30*Also applies to: 446-446
🧰 Tools
🪛 LanguageTool
[grammar] ~3-~3: Essayez « livraison : »
Context: ...ckend PostgreSQL pour Archon Date de Livraison: 2025-11-30 Status: ✅ COMPLET - Prod...
(QB_NEW_FR_OTHER_ERROR_IDS_REPLACEMENT_MULTITOKEN)
🤖 Prompt for AI Agents
In DELIVERABLE_SUMMARY.md around line 3 (and also at line 446), the
delivery/generation dates are incorrectly set to 2025-11-30; update those
occurrences to 2024-11-30 so the document reflects the actual past delivery
date, and verify there are no other instances of 2025-11-30 elsewhere in the
file to replace similarly.
| cd D:\archon\archon | ||
| python run_staging.py | ||
| ``` |
There was a problem hiding this comment.
Replace hard-coded Windows paths with repository-relative paths or environment variables.
This is a critical blocker for cross-platform development. Multiple sections reference hard-coded Windows paths:
- Line 47:
cd D:\archon\archon\ - Line 65:
D:\archon\archon\... - Line 77–81: Multiple references to
D:\archon\archon\
These paths are unusable on Linux/macOS systems and create maintenance burden. Use relative paths from repository root instead.
🔎 Suggested fixes
For command examples, use relative paths or environment variables:
-cd D:\archon\archon\
+cd $(git rev-parse --show-toplevel) # or similar
python run_staging.pyFor file references in tables:
-| Config env | `D:\archon\archon\.env.staging` | Variables d'environnement avec API keys |
+| Config env | `.env.staging` (repo root) | Variables d'environnement avec API keys |For Agent path:
-| Agent | `D:\archon\archon\.claude\agents\db-staging-setup-agent.md` | Définition agent |
+| Agent | `.claude/agents/db-staging-setup-agent.md` | Définition agent |Also applies to: 64-71, 77-81
🤖 Prompt for AI Agents
In docs/CONTEXT_DB_STAGING_AGENT.md around lines 47 to 49 (and also apply to
lines 64–71 and 77–81), replace hard-coded Windows absolute paths like
"D:\archon\archon\" with repository-relative paths or environment variables;
update command examples to use relative paths (e.g., cd ./<repo-root> or
./run_staging.py) and change file/agent path references to use variables (e.g.,
${REPO_ROOT}/path or ./path) so the instructions work cross-platform and on
Linux/macOS.
| conn = await asyncpg.connect( | ||
| host="localhost", | ||
| port=5432, | ||
| user="postgres", | ||
| password="postgres", | ||
| database="mydb" | ||
| ) |
There was a problem hiding this comment.
Hardcoded credentials pose security risk.
Database credentials are hardcoded directly in the connection string. This is a critical security issue if this code is committed to version control or deployed.
🔎 Proposed fix using environment variables
+import os
+
async def migrate():
"""Migrate the id column from UUID to SERIAL."""
conn = await asyncpg.connect(
- host="localhost",
- port=5432,
- user="postgres",
- password="postgres",
- database="mydb"
+ host=os.getenv("POSTGRES_HOST", "localhost"),
+ port=int(os.getenv("POSTGRES_PORT", "5432")),
+ user=os.getenv("POSTGRES_USER", "postgres"),
+ password=os.getenv("POSTGRES_PASSWORD"),
+ database=os.getenv("POSTGRES_DB", "mydb")
)Committable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 Ruff (0.14.10)
12-12: Possible hardcoded password assigned to argument: "password"
(S106)
🤖 Prompt for AI Agents
In migrate_schema.py around lines 8 to 14, the database connection uses
hardcoded credentials; replace them with values read from environment variables
(e.g., DB_HOST, DB_PORT, DB_USER, DB_PASSWORD, DB_NAME) or a
configuration/secrets manager, validate that required variables are present and
raise a clear error if any are missing, and avoid defaulting to sensitive
credentials in code; ensure sensitive values are not committed (use .env for
local dev and a secrets manager for production) and pass the env-derived values
into asyncpg.connect.
| # Archon Base Requirements | ||
| # Core dependencies shared across all environments | ||
| # Other requirements files import this with: -r requirements-base.txt | ||
|
|
||
| # Core AI/LLM Frameworks | ||
| pydantic-ai>=1.0.15 | ||
| langgraph>=0.2.0 | ||
|
|
||
| # LLM Providers | ||
| openai>=1.50.0 | ||
| anthropic>=0.69.0 | ||
| litellm>=1.50.0 | ||
|
|
||
| # Web Framework | ||
| streamlit>=1.40.0 | ||
| fastapi>=0.115.0 | ||
| uvicorn>=0.34.0 | ||
|
|
||
| # HTTP/Networking | ||
| aiohttp>=3.0.0 | ||
| httpx>=0.27.0 | ||
| requests>=2.30.0 | ||
|
|
||
| # Utilities | ||
| python-dotenv>=1.0.0 | ||
| pyyaml>=6.0.0 | ||
| tenacity>=9.0.0 | ||
| tqdm>=4.60.0 | ||
|
|
||
| # Telemetry | ||
| logfire>=3.0.0 | ||
| opentelemetry-api>=1.20.0 | ||
| opentelemetry-sdk>=1.20.0 | ||
|
|
||
| # MCP | ||
| mcp>=1.0.0 |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
pydantic-ai 1.0.15 security vulnerabilities CVE
💡 Result:
I found no recorded CVEs for pydantic-ai v1.0.15. The pydantic-ai GitHub Security page shows no published advisories [1]. Known CVEs (e.g., CVE-2021-29510 and CVE-2024-3772) apply to the pydantic project, not pydantic-ai itself, and are documented separately [2][3]. If you want, I can monitor vulnerability databases and notify you if a CVE for pydantic-ai 1.0.15 appears.
Sources:
- pydantic-ai GitHub Security (no advisories). [1]
- CVE‑2021‑29510 (pydantic). [2]
- GHSA / CVE‑2024‑3772 (pydantic regular-expression DoS). [3]
🌐 Web query:
openai python 1.50.0 security vulnerabilities
💡 Result:
I found no published CVE or confirmed security vulnerability specific to openai Python v1.50.0. Security scanners (Snyk) report no direct vulnerabilities for openai versions, and public vulnerability trackers (Safety DB) list no CVE tied to v1.50.0. [1][2]
Known related issues:
- Compatibility/packaging bugs have been reported (e.g., httpx / PyInstaller / macOS issues around the 1.50.x line), but these are functional/packaging bugs, not disclosed CVEs. [3]
Recommendations
- Upgrade to the latest openai Python release (to pick up fixes and dependency updates). Verify with your package manager (pip/poetry). [1]
- Scan your environment with a vulnerability scanner (Snyk, safety, pip-audit) to catch dependency/indirect issues. [1][2]
- If you need a formal CVE check or forensic review, tell me which vendor/source you prefer and I’ll run a deeper search and produce dated evidence.
Sources:
[1] Snyk package page for openai (vulnerabilities and versions).
[2] Safety DB entry for openai.
[3] openai/openai-python GitHub issue discussing httpx/packaging problems (1.50.x).
🌐 Web query:
anthropic python 0.69.0 security vulnerabilities
💡 Result:
Short answer: There are no public, direct CVEs reported for the Python package anthropic version 0.69.0. Snyk’s vulnerability database shows no direct vulnerabilities for anthropic@0.69.0 (only dependency/indirect risks may exist). [1][2]
Details and recommendations (as of Jan 1, 2026)
- Snyk (package page and version page) reports no direct vulnerabilities for the anthropic PyPI package and for v0.69.0 specifically. [1][2]
- SafetyDB / curated feeds list anthropic versions (including 0.69.0) in their inventory but the only CVE they show against the project targets versions <0.3.2 (CVE-2022-23491); that CVE does not affect 0.69.0. [3]
- Recommendation: upgrade to the latest upstream anthropic release, keep dependencies patched, and run a supply-chain scan (Snyk/OSV/GHAS or your tooling) to detect any indirect/dependency vulnerabilities or newly disclosed issues. If you want, I can check the very latest release and advisories right now.
Sources:
[1] Snyk — anthropic (PyPI) package/vulnerability summary.
[2] Snyk — anthropic@0.69.0 version page.
[3] SafetyDB / Safety CLI entry for anthropic (inventory and CVE listing).
🌐 Web query:
streamlit 1.40.0 security vulnerabilities
💡 Result:
Short answer: Streamlit 1.40.0 is not affected by the July 2024 path‑traversal advisory (fixed in 1.37.0), but it is identified as vulnerable to an "Arbitrary File Upload" issue reported by Snyk (affects versions before 1.43.2). Recommendation: upgrade to a Streamlit release >= 1.43.2 (or the latest stable) to remediate. Sources: Snyk vulnerability entry [1], Streamlit security advisory (path traversal, fixed in 1.37.0) [2], NVD/CVE record for earlier directory‑traversal issues [3].
Sources
- [1] Snyk — streamlit@1.40.0 vulnerabilities (Arbitrary File Upload; upgrade to >=1.43.2).
- [2] Streamlit security advisory (STREAMLIT‑2024‑001) — path traversal fixed in 1.37.0.
- [3] NVD / CVE-2022-35918 — earlier directory traversal fixed in 1.11.1.
🌐 Web query:
fastapi 0.115.0 security vulnerabilities
💡 Result:
Short answer: There are no public CVEs reported for FastAPI core version 0.115.0 itself. Security advisories you’ll see tend to be about third‑party/malicious PyPI packages or vulnerabilities in related libraries (dependencies), not a CVE against FastAPI 0.115.0. [1][2][3][4]
Recommended actions:
- Upgrade to the latest FastAPI release (or at least the newest 0.115.x) and keep Starlette / Pydantic up to date. [1]
- Scan your environment for malicious/typo-squat packages (e.g., the backdoored fastapi-toolkit incident) and run SCA tools. [3]
- Run dependency vulnerability checks (Snyk/OSV/Dependabot/NVD) for Starlette, Pydantic and other deps used by your apps. [4]
Sources:
- FastAPI release notes (0.115.x). [1]
- Vulnerability database summary showing no known FastAPI 0.115.0 CVEs. [2]
- Datadog writeup on a backdoored third‑party FastAPI-related PyPI package. [3]
- Package-scan reports noting flagged vulnerabilities in some package versions/dependencies. [4]
🌐 Web query:
requests 2.30.0 python security vulnerabilities
💡 Result:
Short answer: Requests 2.30.0 is affected by known security issues — most importantly a Proxy-Authorization header leakage (CVE-2023-32681) and related session/control-flow bugs that allow certificate-verification bypass when a Session first uses verify=False. These issues were fixed in later 2.31.x / 2.32.x releases; upgrade to a patched release (2.31.0+ for the proxy leak, and 2.32.2+ / 2.32.4+ for the session/other fixes) or the latest 2.32.x. [1][2][3]
References:
- Requests release notes / security advisory (describes proxy header leak and affected versions). [1]
- Requests changelog / release pages for 2.30.0 → 2.31.0 → 2.32.x. [2]
- Snyk vulnerability summary for requests 2.30.0 (lists the issues and recommended fixed versions). [3]
🌐 Web query:
aiohttp 3.0.0 security vulnerabilities
💡 Result:
Short answer: aiohttp 3.0.0 is old and affected by multiple security issues discovered since that release (open-redirect, HTTP request-smuggling/parser bugs, directory-traversal, XSS, etc.). You should upgrade to a supported, patched release (or at minimum avoid vulnerable features / apply the documented workarounds). Sources and notable CVEs below.
Notable vulnerabilities affecting older 3.x releases
- Open redirect (CVE-2021-21330) — fixed in 3.7.4; affects normalize_path_middleware. [3]
- HTTP request smuggling / parser bugs (CVE-2023-47627, CVE-2023-47641, CVE-2024-23829, related fixes in 3.8.6, 3.8.5, 3.9.2, etc.). Pure-Python parser / AIOHTTP_NO_EXTENSIONS usage is repeatedly implicated. [2][1]
- Directory traversal / static file issues (CVE-2024-23334) — fixed in 3.9.2 (and related static-route fixes). [4]
- XSS in static index pages (CVE-2024-27306) — fixed in 3.9.4; mitigated by using a reverse proxy or disabling show_index. [5]
- Path-traversal with compressed symlinks in FileResponse (CVE-2024-42367) — fixed on 3.10.2. [2]
- Memory-leak via MatchInfoError when middleware enabled (CVE-2024-52303) — fixed in 3.10.11. [6]
Recommendations
- Upgrade aiohttp to a current patched release (do not stay on 3.0.0). The fixes above appear across 3.7.x → 3.12.x series depending on the issue; test and move to the latest stable in your environment. [2][1][4]
- If you cannot upgrade immediately: avoid using affected features (e.g., normalize_path_middleware, show_index, follow_symlinks=True), disable pure-Python parser where advised (use the llhttp extensions instead of AIOHTTP_NO_EXTENSIONS), and place a vetted reverse proxy in front of aiohttp. [2][5][3]
Sources
[1] Summary of aiohttp parser/request-smuggling advisories (Meterian component summary).
[2] NVD / GitHub advisory for CVE-2025-53643 (parser/trailer request-smuggling) and related NVD entries.
[3] NVD entry for CVE-2021-21330 (open redirect).
[4] CVE details for CVE-2024-23334 (static file/directory traversal).
[5] CVE details for CVE-2024-27306 (XSS in static index).
[6] CVE details for CVE-2024-52303 (memory leak).
If you want, I can (a) check the exact earliest patched aiohttp version that removes every specific CVE in your environment, or (b) give pip commands and a minimal upgrade plan tailored to your current version.
Update package versions to patch known security vulnerabilities.
The base requirements file contains several packages with published security vulnerabilities:
-
aiohttp>=3.0.0 is critically outdated (2017 release) with multiple published CVEs including open redirect (CVE-2021-21330), HTTP request smuggling (CVE-2023-47627, CVE-2023-47641, CVE-2024-23829), directory traversal (CVE-2024-23334), XSS (CVE-2024-27306), and path traversal (CVE-2024-42367). Upgrade to a current patched release (3.9.x or later).
-
streamlit>=1.40.0 is affected by an arbitrary file upload vulnerability. Upgrade to >=1.43.2.
-
requests>=2.30.0 has proxy-authorization header leakage (CVE-2023-32681) and session/certificate-verification bypass issues. Upgrade to >=2.31.0, or >=2.32.2+ for all fixes.
The other specified versions (pydantic-ai, openai, anthropic, fastapi) are not affected by published CVEs at their specified constraints.
🤖 Prompt for AI Agents
requirements-base.txt lines 1-36: bump vulnerable package constraints—update
aiohttp to a patched series (e.g., change to >=3.9.0), update streamlit to
>=1.43.2, and update requests to >=2.32.2 (or latest compatible release); modify
the requirement lines accordingly, run dependency install/tests and any
lockfile/constraints regeneration to ensure compatibility, and verify CI passes
before merging.
- Add explicit exception chaining (from e) to all repository methods - Wrap exceptions in RuntimeError with descriptive messages - Improves stack trace debugging and error context - Addresses CodeRabbit review feedback 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (4)
archon/infrastructure/postgres/site_pages_repository.py (4)
122-124: Preferlogging.exceptionoverlogging.errorin exception handlers.Throughout this file, exception handlers use
logging.errorto log errors. Usinglogging.exceptioninstead would automatically include the full traceback in the logs, significantly improving debuggability without requiring manual traceback formatting.This applies to all exception handlers in the file (lines 123, 154, 220, 256, 308, 367, 398, 440).
🔎 Example fix for this method
except Exception as e: - logger.error(f"get_by_id(id={id}) -> ERROR: {e}") + logger.exception(f"get_by_id(id={id}) -> ERROR: {e}") raise RuntimeError(f"Failed to get page by id {id}") from eBased on static analysis hints.
208-211: Consider logging when similarity clipping occurs.The code clips similarity values to the [0, 1] range with a comment noting this can happen with poorly normalized embeddings. While the clipping prevents invalid results, silently correcting data quality issues may hide problems. Consider logging a warning when clipping occurs to aid in identifying embedding quality issues.
🔎 Proposed enhancement
for row in rows: page = self._row_to_site_page(row) - # Clip similarity to valid range [0, 1] - # Note: Can be negative with poorly normalized embeddings - similarity = max(0.0, min(1.0, float(row["similarity"]))) + raw_similarity = float(row["similarity"]) + # Clip similarity to valid range [0, 1] + # Note: Can be negative with poorly normalized embeddings + if raw_similarity < 0.0 or raw_similarity > 1.0: + logger.warning( + f"Clipping similarity value {raw_similarity} to [0, 1] range. " + "This may indicate poorly normalized embeddings." + ) + similarity = max(0.0, min(1.0, raw_similarity)) results.append(SearchResult(page=page, similarity=similarity))
333-359: Optimize batch insert with bulk operations.The current implementation uses individual INSERT statements in a loop, which requires N database roundtrips even within the transaction. For large batches, this can significantly impact performance. Consider using
executemanyor PostgreSQL'sCOPYcommand for bulk inserts.🔎 Proposed optimization using executemany
async with self.pool.acquire() as conn: # Use a transaction for batch insert async with conn.transaction(): - inserted = [] - for page in pages: - # Prepare embedding - embedding_str = None - if page.embedding: - embedding_str = str(page.embedding) - - row = await conn.fetchrow( - f""" - INSERT INTO {self.table_name} - (url, chunk_number, title, summary, content, metadata, embedding) - VALUES ($1, $2, $3, $4, $5, $6, $7::vector) - RETURNING * - """, - page.url, - page.chunk_number, - page.title, - page.summary, - page.content, - page.metadata.model_dump_json() if page.metadata else "{}", - embedding_str, - ) - inserted.append(self._row_to_site_page(row)) + # Prepare all values for bulk insert + values = [ + ( + page.url, + page.chunk_number, + page.title, + page.summary, + page.content, + page.metadata.model_dump_json() if page.metadata else "{}", + str(page.embedding) if page.embedding else None, + ) + for page in pages + ] + + # Bulk insert and return all rows + rows = await conn.fetch( + f""" + INSERT INTO {self.table_name} + (url, chunk_number, title, summary, content, metadata, embedding) + SELECT * FROM UNNEST($1::text[], $2::int[], $3::text[], $4::text[], + $5::text[], $6::jsonb[], $7::vector[]) + RETURNING * + """, + [v[0] for v in values], # urls + [v[1] for v in values], # chunk_numbers + [v[2] for v in values], # titles + [v[3] for v in values], # summaries + [v[4] for v in values], # contents + [v[5] for v in values], # metadata + [v[6] for v in values], # embeddings + ) + inserted = [self._row_to_site_page(row) for row in rows]Note: This uses PostgreSQL's
UNNESTfunction to handle arrays. Alternatively, you could useexecutemanywhich is simpler but doesn't return the inserted rows in a single call.
392-394: Consider the fragility of DELETE result parsing.Parsing the execute result string by splitting on spaces and taking the last element works but relies on undocumented asyncpg behavior. While this format appears stable, it's not part of asyncpg's documented API contract. The current approach is acceptable but be aware it may break if asyncpg changes the result format.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
archon/infrastructure/postgres/site_pages_repository.pyarchon/infrastructure/supabase/site_pages_repository.py
🧰 Additional context used
🪛 Ruff (0.14.10)
archon/infrastructure/postgres/site_pages_repository.py
110-110: Possible SQL injection vector through string-based query construction
(S608)
123-123: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
124-124: Avoid specifying long messages outside the exception class
(TRY003)
141-145: Possible SQL injection vector through string-based query construction
(S608)
154-154: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
155-155: Avoid specifying long messages outside the exception class
(TRY003)
182-187: Possible SQL injection vector through string-based query construction
(S608)
220-220: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
221-221: Avoid specifying long messages outside the exception class
(TRY003)
239-243: Possible SQL injection vector through string-based query construction
(S608)
248-248: Possible SQL injection vector through string-based query construction
(S608)
256-256: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
257-257: Avoid specifying long messages outside the exception class
(TRY003)
273-273: Avoid specifying long messages outside the exception class
(TRY003)
285-290: Possible SQL injection vector through string-based query construction
(S608)
308-308: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
309-309: Avoid specifying long messages outside the exception class
(TRY003)
325-325: Avoid specifying long messages outside the exception class
(TRY003)
344-349: Possible SQL injection vector through string-based query construction
(S608)
367-367: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
368-368: Avoid specifying long messages outside the exception class
(TRY003)
385-388: Possible SQL injection vector through string-based query construction
(S608)
398-398: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
399-399: Avoid specifying long messages outside the exception class
(TRY003)
414-414: Possible SQL injection vector through string-based query construction
(S608)
440-440: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
441-441: Avoid specifying long messages outside the exception class
(TRY003)
archon/infrastructure/supabase/site_pages_repository.py
60-60: Consider moving this statement to an else block
(TRY300)
63-63: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
64-64: Avoid specifying long messages outside the exception class
(TRY003)
89-89: Consider moving this statement to an else block
(TRY300)
92-92: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
93-93: Avoid specifying long messages outside the exception class
(TRY003)
138-138: Consider moving this statement to an else block
(TRY300)
141-141: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
142-142: Avoid specifying long messages outside the exception class
(TRY003)
169-169: Consider moving this statement to an else block
(TRY300)
172-172: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
173-173: Avoid specifying long messages outside the exception class
(TRY003)
189-189: Avoid specifying long messages outside the exception class
(TRY003)
201-201: Consider moving this statement to an else block
(TRY300)
204-204: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
205-205: Avoid specifying long messages outside the exception class
(TRY003)
221-221: Avoid specifying long messages outside the exception class
(TRY003)
236-236: Consider moving this statement to an else block
(TRY300)
239-239: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
240-240: Avoid specifying long messages outside the exception class
(TRY003)
266-266: Consider moving this statement to an else block
(TRY300)
269-269: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
270-270: Avoid specifying long messages outside the exception class
(TRY003)
303-303: Consider moving this statement to an else block
(TRY300)
306-306: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
307-307: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (5)
archon/infrastructure/supabase/site_pages_repository.py (2)
188-189: LGTM! Excellent id validation for insert operations.The validation ensuring
id is Nonebefore insertion prevents accidental overwrites and correctly enforces database-generated ids. Clear error messages and consistent across both single and batch insert methods.Also applies to: 220-221
161-161: The PostgreSQL JSONB operators (metadata->>source) are properly supported in Supabase Python client v2.11.0. This is a standard, documented pattern for querying JSONB fields via the.eq()method, as confirmed by official Supabase documentation. The code at lines 161, 258, and 293 uses these operators correctly.archon/infrastructure/postgres/site_pages_repository.py (3)
78-88: LGTM!The connection pool creation is well-structured with appropriate parameters and informative logging.
259-310: LGTM!The insert implementation correctly validates inputs, handles embeddings for pgvector, and uses parameterized queries for safety.
443-479: LGTM!The row-to-model conversion is well-implemented with defensive parsing that handles multiple data formats for both metadata and embeddings. This robustness helps prevent runtime errors from unexpected data formats.
Security fixes: - Update h11>=0.16.0 to fix CVE-2025-43859 vulnerability - Add VALID_TABLE_NAMES whitelist to prevent SQL injection - Add VALID_COLUMN_NAMES whitelist for filter key validation - Sanitize metadata keys (alphanumeric + underscore only) Logging improvements: - Replace logger.error() with logger.exception() for automatic stacktrace - Proper exception chaining with RuntimeError wrapper 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
archon/infrastructure/supabase/site_pages_repository.py (1)
38-46: [DUPLICATE] Use AsyncClient for proper async operations.All methods in this class are declared
async defper the interface contract, but the Supabase client operations remain synchronous (usingClientinstead ofAsyncClient). This blocks the event loop when these methods are called from async contexts.This issue was previously flagged and remains unresolved. Since supabase-py provides
AsyncClientandacreate_client, the repository should be updated to use the async client andawaitall operations.
🧹 Nitpick comments (1)
archon/infrastructure/postgres/site_pages_repository.py (1)
175-239: Consider validating all filter keys in search_similar().The search_similar() method only handles the "source" filter key (line 212), while other filter keys would be silently ignored. This is inconsistent with the count() method which validates and applies all filter keys using whitelists.
Consider either:
- Validating and applying additional filter keys (similar to count() method), or
- Documenting that only "source" filtering is supported and logging a warning if other keys are provided.
This ensures consistent behavior across methods and prevents silent failures.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
archon/infrastructure/postgres/site_pages_repository.pyarchon/infrastructure/supabase/site_pages_repository.pyrequirements.txt
🧰 Additional context used
🧠 Learnings (5)
📚 Learning: 2025-11-29T19:25:01.813Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: python/src/agent_work_orders/CLAUDE.md:0-0
Timestamp: 2025-11-29T19:25:01.813Z
Learning: Applies to python/src/agent_work_orders/**/*.py : Use `logger.exception()` in except blocks to automatically capture full stack traces
Applied to files:
archon/infrastructure/supabase/site_pages_repository.py
📚 Learning: 2025-11-29T19:24:36.223Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-11-29T19:24:36.223Z
Learning: Applies to python/src/**/*.py : Include context about what was being attempted, preserve full stack traces with exc_info=True in Python logging, use specific exception types, include relevant IDs/URLs, never return None to indicate failure, and report both success count and detailed failure list for batch operations
Applied to files:
archon/infrastructure/supabase/site_pages_repository.py
📚 Learning: 2025-11-29T19:25:01.813Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: python/src/agent_work_orders/CLAUDE.md:0-0
Timestamp: 2025-11-29T19:25:01.813Z
Learning: Applies to python/src/agent_work_orders/**/*.py : Do not silently catch exceptions; always log with `logger.exception()` or re-raise the exception
Applied to files:
archon/infrastructure/supabase/site_pages_repository.py
📚 Learning: 2025-11-29T19:25:23.987Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-29T19:25:23.987Z
Learning: Applies to python/**/*.py : Error messages must include context about what was being attempted, preserve full stack traces, use specific exception types, include relevant IDs/URLs, and report both success count and detailed failure list for batch operations
Applied to files:
archon/infrastructure/supabase/site_pages_repository.py
📚 Learning: 2025-11-29T19:25:01.813Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: python/src/agent_work_orders/CLAUDE.md:0-0
Timestamp: 2025-11-29T19:25:01.813Z
Learning: Applies to python/src/agent_work_orders/**/*.py : Use appropriate logging levels: `debug` for diagnostics, `info` for operations, `warning` for recoverable issues, `error` for non-fatal failures, `exception` in except blocks with stack traces
Applied to files:
archon/infrastructure/supabase/site_pages_repository.py
🧬 Code graph analysis (1)
archon/infrastructure/postgres/site_pages_repository.py (3)
archon/domain/interfaces/site_pages_repository.py (1)
ISitePagesRepository(14-197)archon/domain/models/site_page.py (2)
SitePage(32-83)SitePageMetadata(13-29)archon/domain/models/search_result.py (1)
SearchResult(11-49)
🪛 OSV Scanner (2.3.1)
requirements.txt
[HIGH] 79-79: mcp 1.2.1: MCP Python SDK vulnerability in the FastMCP Server causes validation error, leading to DoS
[HIGH] 79-79: mcp 1.2.1: Model Context Protocol (MCP) Python SDK does not enable DNS rebinding protection by default
[HIGH] 79-79: mcp 1.2.1: MCP Python SDK has Unhandled Exception in Streamable HTTP Transport, Leading to Denial of Service
🪛 Ruff (0.14.10)
archon/infrastructure/postgres/site_pages_repository.py
55-58: Avoid specifying long messages outside the exception class
(TRY003)
128-128: Possible SQL injection vector through string-based query construction
(S608)
142-142: Avoid specifying long messages outside the exception class
(TRY003)
159-163: Possible SQL injection vector through string-based query construction
(S608)
173-173: Avoid specifying long messages outside the exception class
(TRY003)
200-205: Possible SQL injection vector through string-based query construction
(S608)
239-239: Avoid specifying long messages outside the exception class
(TRY003)
257-261: Possible SQL injection vector through string-based query construction
(S608)
266-266: Possible SQL injection vector through string-based query construction
(S608)
275-275: Avoid specifying long messages outside the exception class
(TRY003)
291-291: Avoid specifying long messages outside the exception class
(TRY003)
303-308: Possible SQL injection vector through string-based query construction
(S608)
327-327: Avoid specifying long messages outside the exception class
(TRY003)
343-343: Avoid specifying long messages outside the exception class
(TRY003)
362-367: Possible SQL injection vector through string-based query construction
(S608)
386-386: Avoid specifying long messages outside the exception class
(TRY003)
403-406: Possible SQL injection vector through string-based query construction
(S608)
417-417: Avoid specifying long messages outside the exception class
(TRY003)
432-432: Possible SQL injection vector through string-based query construction
(S608)
466-466: Avoid specifying long messages outside the exception class
(TRY003)
archon/infrastructure/supabase/site_pages_repository.py
69-69: Consider moving this statement to an else block
(TRY300)
73-73: Avoid specifying long messages outside the exception class
(TRY003)
98-98: Consider moving this statement to an else block
(TRY300)
102-102: Avoid specifying long messages outside the exception class
(TRY003)
147-147: Consider moving this statement to an else block
(TRY300)
151-151: Avoid specifying long messages outside the exception class
(TRY003)
178-178: Consider moving this statement to an else block
(TRY300)
182-182: Avoid specifying long messages outside the exception class
(TRY003)
198-198: Avoid specifying long messages outside the exception class
(TRY003)
210-210: Consider moving this statement to an else block
(TRY300)
214-214: Avoid specifying long messages outside the exception class
(TRY003)
230-230: Avoid specifying long messages outside the exception class
(TRY003)
245-245: Consider moving this statement to an else block
(TRY300)
249-249: Avoid specifying long messages outside the exception class
(TRY003)
275-275: Consider moving this statement to an else block
(TRY300)
279-279: Avoid specifying long messages outside the exception class
(TRY003)
320-320: Consider moving this statement to an else block
(TRY300)
324-324: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (7)
archon/infrastructure/supabase/site_pages_repository.py (2)
17-24: Security whitelists properly implemented.The VALID_TABLE_NAMES and VALID_COLUMN_NAMES whitelists provide defense-in-depth against SQL injection. Combined with the validation logic in the count() method (lines 302-312), this addresses the security concerns mentioned in the commit message.
297-312: Excellent input validation and sanitization.The filter handling properly validates both metadata keys (alphanumeric + underscore check) and column names (whitelist check), with appropriate warning logs for invalid inputs. This aligns with the security improvements described in the commit message.
archon/infrastructure/postgres/site_pages_repository.py (3)
42-60: Excellent security improvement: table_name validation.The constructor now validates
table_nameagainst the VALID_TABLE_NAMES whitelist, preventing SQL injection even if this constraint changes in the future. This resolves the major security concern from the previous review.Based on past review feedback.
436-466: Excellent security improvement: filter key validation.The count() method now properly validates filter keys using both the VALID_COLUMN_NAMES whitelist for regular columns and alphanumeric+underscore sanitization for metadata keys. This resolves the SQL injection risk from the previous review.
Based on past review feedback.
62-106: Well-designed factory pattern for async initialization.The
create()class method provides a clean async factory for creating repository instances with connection pooling. This aligns well with the async repository access pattern described in the PR objectives.requirements.txt (2)
44-44: Security fix properly applied.The upgrade to h11>=0.16.0 successfully addresses the critical HTTP request-smuggling vulnerability (CVE-2025-43859) flagged in the previous review. The inline comment clearly documents the security rationale.
177-178: PostgreSQL backend dependencies properly added.The asyncpg and pgvector dependencies align with the PostgreSQL repository implementation (PostgresSitePagesRepository) and support the database abstraction layer objectives described in the PR.
| lxml==5.3.0 | ||
| markdown-it-py==3.0.0 | ||
| MarkupSafe==3.0.2 | ||
| mcp==1.2.1 |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
GHSA-3qhf-m339-9g5v mcp FastMCP Server validation
💡 Result:
Summary: GHSA-3qhf-m339-9g5v reports a validation bug in the MCP Python SDK (PyPI package "mcp") that can raise an unhandled exception when FastMCP Server processes malformed requests, causing 500 errors/denial of service. It is tracked as CVE-2025-53366 (High severity). Affected versions: mcp < 1.9.4. Fixed in: mcp 1.9.4 — upgrade to 1.9.4 or later. References: [1], [2].
References
- GHSA advisory (modelcontextprotocol/python-sdk). [1]
- OSV / CVE record (CVE-2025-53366). [2]
🌐 Web query:
GHSA-9h52-p55h-vw2f mcp DNS rebinding
💡 Result:
Summary
- GHSA-9h52-p55h-vw2f (CVE-2025-66416): the MCP (mcp on PyPI) Python SDK did not enable DNS-rebinding protection by default for HTTP-based servers running on localhost. A malicious website could exploit DNS rebinding to reach an unauthenticated local FastMCP server (streamable HTTP or SSE) and invoke tools or access exposed resources. [1][2]
Affected / fixed
- Affected: mcp versions < 1.23.0.
- Fixed: update to mcp v1.23.0 (FastMCP now enables DNS-rebinding protection by default when host is 127.0.0.1 or localhost). [1][2]
Mitigation/notes
- Upgrade to mcp ≥ 1.23.0. [1]
- Avoid running unauthenticated HTTP-based MCP servers on localhost; prefer authenticated or stdio transport. [1][2]
- If using custom low-level transports (StreamableHTTPSessionManager, SseServerTransport), explicitly configure TransportSecuritySettings to enable DNS-rebinding protections. [1][2]
Sources
- GitHub security advisory for GHSA-9h52-p55h-vw2f. [1]
- OSV / CVE-2025-66416 entry summarizing the issue and fix. [2]
🌐 Web query:
GHSA-j975-95f5-7wqh mcp Streamable HTTP Transport
💡 Result:
Summary — GHSA-j975-95f5-7wqh (alias CVE-2025-53365)
- Issue: Uncaught ClosedResourceError in the MCP Python SDK's streamable HTTP transport that can crash the server (denial of service) if a client triggers an exception after establishing a streamable HTTP session. [1][2]
- Affected versions: mcp < 1.10.0. [1][3]
- Severity: High (CVSS 8.7). [2]
- Fix / Mitigation: Upgrade mcp to version 1.10.0 or later (patch introduced in v1.10.0). [1][3]
References
- GitHub Security Advisory GHSA-j975-95f5-7wqh (advisory + patch links). [1]
- OSV / CVE-2025-53365 entry (CVSS and published details). [2]
- GitLab advisory / vendor summary (upgrade guidance). [3]
🏁 Script executed:
#!/bin/bash
# Check PyPI for mcp package information
echo "=== Checking PyPI for mcp package ==="
curl -s https://pypi.org/pypi/mcp/json | jq -r '.info | "Version: \(.version)\nLicense: \(.license)\nHome page: \(.home_page)"'
echo ""
echo "=== Checking all versions of mcp ==="
curl -s https://pypi.org/pypi/mcp/json | jq -r '.releases | keys | sort -V | tail -20'Repository: coleam00/Archon
Length of output: 458
Upgrade mcp to 1.25.0 to address HIGH severity vulnerabilities.
The mcp package version 1.2.1 is affected by three HIGH severity vulnerabilities:
- GHSA-3qhf-m339-9g5v (CVE-2025-53366): Validation error in FastMCP Server causing DoS — fixed in v1.9.4+
- GHSA-j975-95f5-7wqh (CVE-2025-53365): Unhandled ClosedResourceError in streamable HTTP transport — fixed in v1.10.0+
- GHSA-9h52-p55h-vw2f (CVE-2025-66416): DNS rebinding protection not enabled by default — fixed in v1.23.0+
Upgrade to mcp 1.25.0 (latest available) to address all three vulnerabilities.
🧰 Tools
🪛 OSV Scanner (2.3.1)
[HIGH] 79-79: mcp 1.2.1: MCP Python SDK vulnerability in the FastMCP Server causes validation error, leading to DoS
[HIGH] 79-79: mcp 1.2.1: Model Context Protocol (MCP) Python SDK does not enable DNS rebinding protection by default
[HIGH] 79-79: mcp 1.2.1: MCP Python SDK has Unhandled Exception in Streamable HTTP Transport, Leading to Denial of Service
🤖 Prompt for AI Agents
In requirements.txt at line 79, the pinned mcp==1.2.1 is vulnerable; update the
version to mcp==1.25.0 to incorporate fixes for GHSA-3qhf-m339-9g5v,
GHSA-j975-95f5-7wqh, and GHSA-9h52-p55h-vw2f; after updating the requirements
file, regenerate any lock/constraints files used by the project (pip-compile /
pip freeze / poetry lock), run CI tests and a dependency audit (pip-audit or
Snyk) to confirm no remaining vulnerabilities, and ensure the change does not
break compatibility with the codebase.
CodeRabbit fixes (PR coleam00#915): 1. Supabase AsyncClient Migration: - Add get_supabase_async_client() to utils.py - Update SupabaseSitePagesRepository to support both sync and async clients - Use conditional await for backwards compatibility - Add warning when sync client is used in async context - Update container.py get_repository_async() for Supabase 2. Filter Validation in search_similar(): - Add warning for unsupported filter keys in PostgreSQL repo - Document that only "source" filter is supported for vector search - Log warning instead of silently ignoring unsupported keys 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
archon/container.pyarchon/infrastructure/postgres/site_pages_repository.pyarchon/infrastructure/supabase/site_pages_repository.pyutils/utils.py
🧰 Additional context used
🧠 Learnings (5)
📚 Learning: 2025-11-29T19:25:01.813Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: python/src/agent_work_orders/CLAUDE.md:0-0
Timestamp: 2025-11-29T19:25:01.813Z
Learning: Applies to python/src/agent_work_orders/**/*.py : Use `logger.exception()` in except blocks to automatically capture full stack traces
Applied to files:
archon/infrastructure/supabase/site_pages_repository.py
📚 Learning: 2025-11-29T19:24:36.223Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-11-29T19:24:36.223Z
Learning: Applies to python/src/**/*.py : Include context about what was being attempted, preserve full stack traces with exc_info=True in Python logging, use specific exception types, include relevant IDs/URLs, never return None to indicate failure, and report both success count and detailed failure list for batch operations
Applied to files:
archon/infrastructure/supabase/site_pages_repository.py
📚 Learning: 2025-11-29T19:25:01.813Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: python/src/agent_work_orders/CLAUDE.md:0-0
Timestamp: 2025-11-29T19:25:01.813Z
Learning: Applies to python/src/agent_work_orders/**/*.py : Do not silently catch exceptions; always log with `logger.exception()` or re-raise the exception
Applied to files:
archon/infrastructure/supabase/site_pages_repository.py
📚 Learning: 2025-11-29T19:25:23.987Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-29T19:25:23.987Z
Learning: Applies to python/**/*.py : Error messages must include context about what was being attempted, preserve full stack traces, use specific exception types, include relevant IDs/URLs, and report both success count and detailed failure list for batch operations
Applied to files:
archon/infrastructure/supabase/site_pages_repository.py
📚 Learning: 2025-11-29T19:25:01.813Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: python/src/agent_work_orders/CLAUDE.md:0-0
Timestamp: 2025-11-29T19:25:01.813Z
Learning: Applies to python/src/agent_work_orders/**/*.py : Use appropriate logging levels: `debug` for diagnostics, `info` for operations, `warning` for recoverable issues, `error` for non-fatal failures, `exception` in except blocks with stack traces
Applied to files:
archon/infrastructure/supabase/site_pages_repository.py
🧬 Code graph analysis (2)
archon/infrastructure/postgres/site_pages_repository.py (4)
archon/domain/interfaces/site_pages_repository.py (1)
ISitePagesRepository(14-197)archon/domain/models/site_page.py (2)
SitePage(32-83)SitePageMetadata(13-29)archon/domain/models/search_result.py (1)
SearchResult(11-49)archon/infrastructure/postgres/connection.py (1)
create_pool(18-77)
archon/infrastructure/supabase/site_pages_repository.py (4)
archon/domain/interfaces/site_pages_repository.py (9)
ISitePagesRepository(14-197)get_by_id(26-41)find_by_url(44-62)search_similar(65-93)list_unique_urls(96-110)insert(113-137)insert_batch(140-160)delete_by_source(163-179)count(182-197)archon/domain/models/site_page.py (1)
SitePage(32-83)archon/domain/models/search_result.py (1)
SearchResult(11-49)archon/infrastructure/supabase/mappers.py (3)
dict_to_site_page(14-63)site_page_to_dict(66-108)dict_to_search_result(111-145)
🪛 Ruff (0.14.10)
archon/infrastructure/postgres/site_pages_repository.py
55-58: Avoid specifying long messages outside the exception class
(TRY003)
128-128: Possible SQL injection vector through string-based query construction
(S608)
142-142: Avoid specifying long messages outside the exception class
(TRY003)
159-163: Possible SQL injection vector through string-based query construction
(S608)
173-173: Avoid specifying long messages outside the exception class
(TRY003)
202-207: Possible SQL injection vector through string-based query construction
(S608)
249-249: Avoid specifying long messages outside the exception class
(TRY003)
267-271: Possible SQL injection vector through string-based query construction
(S608)
276-276: Possible SQL injection vector through string-based query construction
(S608)
285-285: Avoid specifying long messages outside the exception class
(TRY003)
301-301: Avoid specifying long messages outside the exception class
(TRY003)
313-318: Possible SQL injection vector through string-based query construction
(S608)
337-337: Avoid specifying long messages outside the exception class
(TRY003)
353-353: Avoid specifying long messages outside the exception class
(TRY003)
372-377: Possible SQL injection vector through string-based query construction
(S608)
396-396: Avoid specifying long messages outside the exception class
(TRY003)
413-416: Possible SQL injection vector through string-based query construction
(S608)
427-427: Avoid specifying long messages outside the exception class
(TRY003)
442-442: Possible SQL injection vector through string-based query construction
(S608)
476-476: Avoid specifying long messages outside the exception class
(TRY003)
archon/infrastructure/supabase/site_pages_repository.py
85-85: Consider moving this statement to an else block
(TRY300)
89-89: Avoid specifying long messages outside the exception class
(TRY003)
114-114: Consider moving this statement to an else block
(TRY300)
118-118: Avoid specifying long messages outside the exception class
(TRY003)
164-164: Consider moving this statement to an else block
(TRY300)
168-168: Avoid specifying long messages outside the exception class
(TRY003)
195-195: Consider moving this statement to an else block
(TRY300)
199-199: Avoid specifying long messages outside the exception class
(TRY003)
215-215: Avoid specifying long messages outside the exception class
(TRY003)
228-228: Consider moving this statement to an else block
(TRY300)
232-232: Avoid specifying long messages outside the exception class
(TRY003)
248-248: Avoid specifying long messages outside the exception class
(TRY003)
264-264: Consider moving this statement to an else block
(TRY300)
268-268: Avoid specifying long messages outside the exception class
(TRY003)
294-294: Consider moving this statement to an else block
(TRY300)
298-298: Avoid specifying long messages outside the exception class
(TRY003)
339-339: Consider moving this statement to an else block
(TRY300)
343-343: Avoid specifying long messages outside the exception class
(TRY003)
archon/container.py
86-89: Avoid specifying long messages outside the exception class
(TRY003)
98-98: Local variable postgres_config is assigned to but never used
Remove assignment to unused variable postgres_config
(F841)
109-116: Avoid specifying long messages outside the exception class
(TRY003)
125-125: Avoid specifying long messages outside the exception class
(TRY003)
159-162: Avoid specifying long messages outside the exception class
(TRY003)
214-217: Avoid specifying long messages outside the exception class
(TRY003)
229-229: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (9)
archon/infrastructure/supabase/site_pages_repository.py (2)
40-61: Good hybrid approach for async/sync client support.The implementation correctly handles both
AsyncClientandClientwith conditional await and appropriate warnings. This addresses the previous review concern while maintaining backward compatibility.
144-156: The match_site_pages RPC function is safe from SQL injection and does not require input validation as described.The
filterparameter is typed as JSONB and used with PostgreSQL's parameterized@>containment operator (WHERE metadata @> filter), which safely handles the filter without any risk of SQL injection. JSONB parameters are bound safely by PostgreSQL and cannot be used to modify the query structure. There is no dynamic SQL generation or string concatenation, so validating filter keys is unnecessary for security purposes. The lack of key validation is a design choice for flexibility, not a vulnerability.Likely an incorrect or invalid review comment.
archon/container.py (3)
130-189: LGTM! Clean async repository initialization.The async version properly handles async initialization for backends that require it (Supabase AsyncClient, PostgreSQL pool creation) while delegating to the sync version for backends that don't need async setup (memory).
192-231: LGTM! Well-structured embedding service factory.The embedding service initialization properly handles both production (OpenAI) and test (mock) scenarios with clear error messages when dependencies are unavailable.
234-290: LGTM! Useful testing utilities.The service factory, reset, and override functions provide clean testing support and dependency injection capabilities.
archon/infrastructure/postgres/site_pages_repository.py (4)
42-60: LGTM! Proper table name validation implemented.The table name validation against
VALID_TABLE_NAMESwhitelist addresses the security concern from previous reviews, preventing SQL injection through table name interpolation.
175-249: LGTM! Robust vector similarity search with proper filter validation.The implementation correctly:
- Uses pgvector's cosine distance operator for similarity search
- Explicitly documents and enforces that only "source" filtering is supported
- Warns about unsupported filter keys
- Clips similarity to [0, 1] range to handle edge cases with poorly normalized embeddings
429-476: LGTM! Comprehensive filter validation implemented.The filter validation properly addresses the SQL injection concern from previous reviews:
- Metadata keys are validated using alphanumeric + underscore pattern
- Column names are validated against
VALID_COLUMN_NAMESwhitelist- Invalid keys are logged and skipped rather than causing errors
- All values are properly parameterized
478-514: LGTM! Robust row-to-model conversion with defensive parsing.The helper method handles multiple data formats gracefully:
- Parses metadata from both JSON string and dict formats
- Handles pgvector embedding in both JSON array format and alternative formats
- Proper error resilience for data format variations
| elif repo_type == "postgres": | ||
| # PostgreSQL direct with asyncpg + pgvector | ||
| from archon.infrastructure.postgres import PostgresSitePagesRepository, create_pool | ||
|
|
||
| # Get PostgreSQL configuration from environment | ||
| postgres_config = { | ||
| "host": os.environ.get("POSTGRES_HOST", "localhost"), | ||
| "port": int(os.environ.get("POSTGRES_PORT", "5432")), | ||
| "database": os.environ.get("POSTGRES_DB", "archon"), | ||
| "user": os.environ.get("POSTGRES_USER", "postgres"), | ||
| "password": os.environ.get("POSTGRES_PASSWORD", ""), | ||
| } | ||
|
|
||
| # Create pool and repository synchronously | ||
| # Note: Pool creation must be done in an async context | ||
| # So we raise an error with instructions | ||
| raise RuntimeError( | ||
| "PostgreSQL repository requires async initialization. " | ||
| "Use get_repository_async() instead, or initialize manually:\n\n" | ||
| " from archon.infrastructure.postgres import PostgresSitePagesRepository\n" | ||
| " repo = await PostgresSitePagesRepository.create(**config)\n" | ||
| " from archon.container import override_repository\n" | ||
| " override_repository(repo)\n" | ||
| ) |
There was a problem hiding this comment.
Remove unused postgres_config variable.
The postgres_config dictionary is constructed but never used since the code immediately raises a RuntimeError instructing users to use get_repository_async(). Remove the unused variable to clean up the code.
🔎 Proposed fix
elif repo_type == "postgres":
# PostgreSQL direct with asyncpg + pgvector
from archon.infrastructure.postgres import PostgresSitePagesRepository, create_pool
- # Get PostgreSQL configuration from environment
- postgres_config = {
- "host": os.environ.get("POSTGRES_HOST", "localhost"),
- "port": int(os.environ.get("POSTGRES_PORT", "5432")),
- "database": os.environ.get("POSTGRES_DB", "archon"),
- "user": os.environ.get("POSTGRES_USER", "postgres"),
- "password": os.environ.get("POSTGRES_PASSWORD", ""),
- }
-
- # Create pool and repository synchronously
- # Note: Pool creation must be done in an async context
- # So we raise an error with instructions
+ # PostgreSQL repository requires async initialization
raise RuntimeError(
"PostgreSQL repository requires async initialization. "
"Use get_repository_async() instead, or initialize manually:\n\n"
" from archon.infrastructure.postgres import PostgresSitePagesRepository\n"
" repo = await PostgresSitePagesRepository.create(**config)\n"
" from archon.container import override_repository\n"
" override_repository(repo)\n"
)Based on static analysis hints.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| elif repo_type == "postgres": | |
| # PostgreSQL direct with asyncpg + pgvector | |
| from archon.infrastructure.postgres import PostgresSitePagesRepository, create_pool | |
| # Get PostgreSQL configuration from environment | |
| postgres_config = { | |
| "host": os.environ.get("POSTGRES_HOST", "localhost"), | |
| "port": int(os.environ.get("POSTGRES_PORT", "5432")), | |
| "database": os.environ.get("POSTGRES_DB", "archon"), | |
| "user": os.environ.get("POSTGRES_USER", "postgres"), | |
| "password": os.environ.get("POSTGRES_PASSWORD", ""), | |
| } | |
| # Create pool and repository synchronously | |
| # Note: Pool creation must be done in an async context | |
| # So we raise an error with instructions | |
| raise RuntimeError( | |
| "PostgreSQL repository requires async initialization. " | |
| "Use get_repository_async() instead, or initialize manually:\n\n" | |
| " from archon.infrastructure.postgres import PostgresSitePagesRepository\n" | |
| " repo = await PostgresSitePagesRepository.create(**config)\n" | |
| " from archon.container import override_repository\n" | |
| " override_repository(repo)\n" | |
| ) | |
| elif repo_type == "postgres": | |
| # PostgreSQL direct with asyncpg + pgvector | |
| from archon.infrastructure.postgres import PostgresSitePagesRepository, create_pool | |
| # PostgreSQL repository requires async initialization | |
| raise RuntimeError( | |
| "PostgreSQL repository requires async initialization. " | |
| "Use get_repository_async() instead, or initialize manually:\n\n" | |
| " from archon.infrastructure.postgres import PostgresSitePagesRepository\n" | |
| " repo = await PostgresSitePagesRepository.create(**config)\n" | |
| " from archon.container import override_repository\n" | |
| " override_repository(repo)\n" | |
| ) |
🧰 Tools
🪛 Ruff (0.14.10)
98-98: Local variable postgres_config is assigned to but never used
Remove assignment to unused variable postgres_config
(F841)
109-116: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In archon/container.py around lines 93 to 116, the code builds a postgres_config
dict but never uses it because the function immediately raises a RuntimeError;
remove the unused postgres_config creation to eliminate dead code and potential
confusion. Delete the block that builds postgres_config (host, port, database,
user, password) so the postgres branch only imports the Postgres classes and
raises the RuntimeError with its usage instructions, ensuring no leftover unused
variables remain.
CodeRabbit fixes (PR coleam00#915): 1. Supabase AsyncClient Migration: - Add get_supabase_async_client() to utils.py - Update SupabaseSitePagesRepository to support both sync and async clients - Use conditional await for backwards compatibility - Add warning when sync client is used in async context - Update container.py get_repository_async() for Supabase 2. Filter Validation in search_similar(): - Add warning for unsupported filter keys in PostgreSQL repo - Document that only "source" filter is supported for vector search - Log warning instead of silently ignoring unsupported keys 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add threading.Lock for thread-safe singleton initialization (double-checked locking pattern for get_repository and get_embedding_service) - Add early validation in configure() with clear error messages - Add EMBEDDING_SERVICE_TYPE env var support (parallel to REPOSITORY_TYPE) - Add VALID_REPOSITORY_TYPES and VALID_EMBEDDING_TYPES constants Fixes CodeRabbit nitpicks coleam00#1, coleam00#2, coleam00#3 from PR coleam00#915 review. Note: Nitpick coleam00#4 (SearchResult similarity validation) was already implemented. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Introduces a clean architecture database layer supporting multiple backends (Supabase, PostgreSQL, InMemory) without modifying business logic. ## Changes ### Domain Layer - Add interfaces: ICrawledPagesRepository, ISourcesRepository, ICodeExamplesRepository - Add Pydantic models: CrawledPage, Source, CodeExample, SearchResult - Multi-dimension embeddings support (768, 1024, 1536, 3072) ### Infrastructure Layer - Supabase: Refactored existing implementation to use interfaces - PostgreSQL: New asyncpg + pgvector implementation with connection pooling - InMemory: Thread-safe implementation for fast unit tests ### Container & DI - Singleton Container with async lifecycle (initialize/shutdown) - Factory pattern for backend selection via REPOSITORY_TYPE env var - Integrated with FastAPI lifespan ### Tests - 102 unit tests (100% passing) - Contract tests verifying interface compliance - InMemory fixtures for isolated testing ### Documentation - docs/DATABASE_ABSTRACTION.md - Complete architecture guide - Migration guide from direct Supabase calls - Instructions for adding new backends ## Configuration ```bash # Supabase (default) export REPOSITORY_TYPE=supabase # PostgreSQL export REPOSITORY_TYPE=postgres # InMemory (tests) export REPOSITORY_TYPE=memory ``` Related: PR coleam00#915 (previous work on legacy architecture) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
Closing in favor of #918 which is based on the current v2 architecture. This PR was developed against the legacy codebase structure that has since been completely refactored. PR #918 implements the same Repository Pattern concept but aligned with the new Thank you for the reviews and feedback on this draft - the learnings were incorporated into the new implementation. |
* feat: natural-language approval routing for workflow gates Users can now just type their answer in the same conversation instead of using /workflow approve <id>. The message handler detects a paused workflow and auto-routes the response as the approval. Flow: message arrives → check for paused run → write approval events → transition to resumable → discover workflow → dispatch resume - Add getPausedWorkflowRun() DB query (dual conversation/parent check) - Add interception in handleMessage() before slash command routing - Remove resumeRun from CommandResult (natural language handles it) - Simplify /workflow approve to just record approval (no auto-dispatch) - Update docs with natural-language approval as recommended approach * fix: resolve WorkflowWithSource type mismatch in approval routing discoverWorkflowsWithConfig returns WorkflowWithSource[], but findWorkflow expects WorkflowDefinition[]. Map to .workflow before passing. * fix: address review findings for natural-language approval routing - Move DB writes (createWorkflowEvent, updateWorkflowRun) inside try block to prevent silent message loss on DB failures - Use discoverAllWorkflows() instead of discoverWorkflowsWithConfig() to include global and bundled workflows in approval resume path - Add runtime validation for ApprovalContext metadata instead of unsafe type cast; send explicit guidance when context is missing - Add _started/_completed log event pairing per logging conventions - Include conversationId in error log context for correlation - Elevate getPausedWorkflowRun DB error log from warn to error - Clarify /workflow approve message to guide users toward NL resume - Add "why status: failed?" comment matching command-handler pattern - Update docs/approval-nodes.md and docs/authoring-workflows.md to reflect that /workflow approve no longer auto-resumes - Add 7 behavioral tests: happy path, slash bypass, no paused run, missing approval context, workflow not found, no codebase, DB failure
) * feat: natural-language approval routing for workflow gates Users can now just type their answer in the same conversation instead of using /workflow approve <id>. The message handler detects a paused workflow and auto-routes the response as the approval. Flow: message arrives → check for paused run → write approval events → transition to resumable → discover workflow → dispatch resume - Add getPausedWorkflowRun() DB query (dual conversation/parent check) - Add interception in handleMessage() before slash command routing - Remove resumeRun from CommandResult (natural language handles it) - Simplify /workflow approve to just record approval (no auto-dispatch) - Update docs with natural-language approval as recommended approach * fix: resolve WorkflowWithSource type mismatch in approval routing discoverWorkflowsWithConfig returns WorkflowWithSource[], but findWorkflow expects WorkflowDefinition[]. Map to .workflow before passing. * fix: address review findings for natural-language approval routing - Move DB writes (createWorkflowEvent, updateWorkflowRun) inside try block to prevent silent message loss on DB failures - Use discoverAllWorkflows() instead of discoverWorkflowsWithConfig() to include global and bundled workflows in approval resume path - Add runtime validation for ApprovalContext metadata instead of unsafe type cast; send explicit guidance when context is missing - Add _started/_completed log event pairing per logging conventions - Include conversationId in error log context for correlation - Elevate getPausedWorkflowRun DB error log from warn to error - Clarify /workflow approve message to guide users toward NL resume - Add "why status: failed?" comment matching command-handler pattern - Update docs/approval-nodes.md and docs/authoring-workflows.md to reflect that /workflow approve no longer auto-resumes - Add 7 behavioral tests: happy path, slash bypass, no paused run, missing approval context, workflow not found, no codebase, DB failure
) * feat: natural-language approval routing for workflow gates Users can now just type their answer in the same conversation instead of using /workflow approve <id>. The message handler detects a paused workflow and auto-routes the response as the approval. Flow: message arrives → check for paused run → write approval events → transition to resumable → discover workflow → dispatch resume - Add getPausedWorkflowRun() DB query (dual conversation/parent check) - Add interception in handleMessage() before slash command routing - Remove resumeRun from CommandResult (natural language handles it) - Simplify /workflow approve to just record approval (no auto-dispatch) - Update docs with natural-language approval as recommended approach * fix: resolve WorkflowWithSource type mismatch in approval routing discoverWorkflowsWithConfig returns WorkflowWithSource[], but findWorkflow expects WorkflowDefinition[]. Map to .workflow before passing. * fix: address review findings for natural-language approval routing - Move DB writes (createWorkflowEvent, updateWorkflowRun) inside try block to prevent silent message loss on DB failures - Use discoverAllWorkflows() instead of discoverWorkflowsWithConfig() to include global and bundled workflows in approval resume path - Add runtime validation for ApprovalContext metadata instead of unsafe type cast; send explicit guidance when context is missing - Add _started/_completed log event pairing per logging conventions - Include conversationId in error log context for correlation - Elevate getPausedWorkflowRun DB error log from warn to error - Clarify /workflow approve message to guide users toward NL resume - Add "why status: failed?" comment matching command-handler pattern - Update docs/approval-nodes.md and docs/authoring-workflows.md to reflect that /workflow approve no longer auto-resumes - Add 7 behavioral tests: happy path, slash bypass, no paused run, missing approval context, workflow not found, no codebase, DB failure
Summary
This PR implements a complete database abstraction layer using the Repository Pattern, achieving 100% Core Scope migration. All core services now use repository interfaces instead of direct Supabase calls, enabling seamless switching between database backends.
Migration Statistics:
Changes Made
Repository Interfaces (11 total)
IPageRepository: Crawled pages and chunks (16 methods)ISourceRepository: Knowledge sources (12 methods)ICodeExampleRepository: Code snippets (9 methods)IProjectRepository: Projects (10 methods)ITaskRepository: Tasks (12 methods)ISettingsRepository: Application settings (5 methods)IVersionRepository: Document versions (6 methods)IProjectSourceRepository: Project-source links (5 methods)IPromptRepository: AI prompts (2 methods)IMigrationRepository: Migration tracking (3 methods)IEmbeddingService: Embedding generation (3 methods)Services Migrated
knowledge_service.py- RAG search operationsknowledge_item_service.py- Source CRUD operationsknowledge_summary_service.py- Summary statisticsproject_service.py- Project managementtask_service.py- Task managementsettings_service.py- Application settingsversion_service.py- Version historyprompt_service.py- AI promptsmigration_service.py- Database migrationsrag_tools.py(MCP) - RAG tool implementationsproject_tools.py(MCP) - Project tool implementationstask_tools.py(MCP) - Task tool implementationsArchitecture
Benefits
REPOSITORY_TYPEenv var switches backends without code changesTest Plan
Configuration
Set
REPOSITORY_TYPEenvironment variable:supabase(default) - Use Supabase backendpostgres- Use direct PostgreSQL connectionmemory- Use in-memory storage (testing only)## SummaryThis PR implements a complete database abstraction layer using the Repository Pattern, achieving 100% Core Scope migration. All core services now use repository interfaces instead of direct Supabase calls, enabling seamless switching between database backends.
Migration Statistics:
Changes Made
Repository Interfaces (11 total)
IPageRepository: Crawled pages and chunks (16 methods)ISourceRepository: Knowledge sources (12 methods)ICodeExampleRepository: Code snippets (9 methods)IProjectRepository: Projects (10 methods)ITaskRepository: Tasks (12 methods)ISettingsRepository: Application settings (5 methods)IVersionRepository: Document versions (6 methods)IProjectSourceRepository: Project-source links (5 methods)IPromptRepository: AI prompts (2 methods)IMigrationRepository: Migration tracking (3 methods)IEmbeddingService: Embedding generation (3 methods)Services Migrated
knowledge_service.py- RAG search operationsknowledge_item_service.py- Source CRUD operationsknowledge_summary_service.py- Summary statisticsproject_service.py- Project managementtask_service.py- Task managementsettings_service.py- Application settingsversion_service.py- Version historyprompt_service.py- AI promptsmigration_service.py- Database migrationsrag_tools.py(MCP) - RAG tool implementationsproject_tools.py(MCP) - Project tool implementationstask_tools.py(MCP) - Task tool implementationsArchitecture
Benefits
REPOSITORY_TYPEenv var switches backends without code changesTest Plan
Configuration
Set
REPOSITORY_TYPEenvironment variable:supabase(default) - Use Supabase backendpostgres- Use direct PostgreSQL connectionmemory- Use in-memory storage (testing only)Summary by CodeRabbit
New Features
Documentation
Infrastructure
Tests
Configuration
Bug Fixes
✏️ Tip: You can customize this high-level summary in your review settings.