Conversation
WalkthroughAdds persistent page metadata and section parsing, links chunks to pages, exposes Pages API and RAG return_mode ("pages"/"chunks"), adds page storage operations and LLMS-full parser, propagates page_id through crawling/storage, extracts HTML titles, and updates source upsert behavior and tests. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Client
participant KnowledgeAPI as /api/knowledge
participant RAGService
participant VectorStore
participant PagesTable as archon_page_metadata
Client->>KnowledgeAPI: POST /api/knowledge { query, return_mode="pages" }
KnowledgeAPI->>RAGService: perform_rag_query(query, return_mode)
RAGService->>VectorStore: vector/hybrid search → chunk results
VectorStore-->>RAGService: chunk_results (may include page_id/url)
alt return_mode == "pages" and page_ids present
RAGService->>PagesTable: SELECT metadata WHERE id IN (page_ids)
PagesTable-->>RAGService: page rows
RAGService-->>KnowledgeAPI: grouped page-level results + actual_return_mode="pages"
else
RAGService-->>KnowledgeAPI: chunk-level results + actual_return_mode="chunks"
end
KnowledgeAPI-->>Client: { success: true, results, actual_return_mode }
sequenceDiagram
autonumber
participant Crawler
participant CrawlService
participant PageStore as PageStorageOperations
participant DocStoreOps as DocumentStorageOperations
participant ChunkStore as DocumentStorageService
participant Pages as archon_page_metadata
Crawler->>CrawlService: emit crawl_results (html/text/url)
alt llms-full content
CrawlService->>PageStore: store_llms_full_sections(base_url, content, source_id)
PageStore->>Pages: INSERT section pages
Pages-->>PageStore: url→page_id map
else normal pages
CrawlService->>PageStore: store_pages(crawl_results, source_id)
PageStore->>Pages: INSERT pages
Pages-->>PageStore: url→page_id map
end
PageStore-->>CrawlService: url→page_id mapping
CrawlService->>DocStoreOps: process_and_store_documents(..., url_to_page_id)
DocStoreOps->>ChunkStore: add_documents_to_supabase(..., url_to_page_id)
ChunkStore-->>DocStoreOps: stored chunk rows
DocStoreOps->>PageStore: update_page_chunk_count(page_id, count)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
♻️ Duplicate comments (2)
python/src/server/services/crawling/strategies/batch.py (1)
238-248: Duplicate code: extract title parsing to shared helper.This is identical to the title extraction in
single_page.py(lines 182-192) andrecursive.py(lines 280-290). See the refactor suggestion insingle_page.pyto consolidate this logic into a shared helper function.python/src/server/services/crawling/strategies/recursive.py (1)
280-290: Duplicate code: extract title parsing to shared helper.This is the third occurrence of identical title extraction logic (also in
single_page.pylines 182-192 andbatch.pylines 238-248). See the refactor suggestion insingle_page.pyto consolidate this into a shared helper.
🧹 Nitpick comments (3)
python/src/server/api_routes/knowledge_api.py (1)
87-119: Consider flattening the nested exception handling.The nested try-catch structure is functional but could be simplified for better readability. The current implementation correctly handles provider-specific errors, but the nesting adds cognitive complexity.
Optional simplification:
try: # Test API key with minimal embedding request using provider-scoped configuration from ..services.embeddings.embedding_service import create_embedding test_result = await create_embedding(text="test", provider=provider) if not test_result: logger.error( f"❌ {provider.title()} API key validation failed - no embedding returned" ) raise HTTPException( status_code=401, detail={ "error": f"Invalid {provider.title()} API key", "message": f"Please verify your {provider.title()} API key in Settings.", "error_type": "authentication_failed", "provider": provider, }, ) except HTTPException: raise except Exception as e: logger.error( f"❌ {provider.title()} API key validation failed: {e}", exc_info=True, ) raise HTTPException( status_code=401, detail={ "error": f"Invalid {provider.title()} API key", "message": f"Please verify your {provider.title()} API key in Settings. Error: {str(e)[:100]}", "error_type": "authentication_failed", "provider": provider, }, )python/src/mcp_server/features/rag/rag_tools.py (1)
238-242: Consider using elif for mutual exclusivity in input validation.The validation correctly enforces "either page_id or url" but could be slightly clearer.
Optional refinement:
- if not page_id and not url: - return json.dumps( - {"success": False, "error": "Must provide either page_id or url"}, - indent=2 - ) + if not page_id and not url: + return json.dumps( + {"success": False, "error": "Must provide either page_id or url"}, + indent=2 + ) + elif page_id and url: + return json.dumps( + {"success": False, "error": "Provide either page_id or url, not both"}, + indent=2 + )python/src/server/services/crawling/page_storage_operations.py (1)
206-224: Consider more specific exception handling.While using
logger.warningfor this non-critical update is appropriate, the bareExceptioncatch should be replaced with more specific exception types (e.g., database-related exceptions).- except Exception as e: - logger.warning( - f"Error updating chunk_count for page {page_id}: {e}", exc_info=True - ) + except (SupabaseException, DatabaseError) as e: + logger.warning( + f"Error updating chunk_count for page {page_id}: {e}", exc_info=True + )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (15)
migration/0.1.0/010_add_page_metadata_table.sql(1 hunks)python/src/mcp_server/features/rag/rag_tools.py(4 hunks)python/src/server/api_routes/knowledge_api.py(5 hunks)python/src/server/api_routes/pages_api.py(1 hunks)python/src/server/main.py(2 hunks)python/src/server/services/crawling/crawling_service.py(3 hunks)python/src/server/services/crawling/document_storage_operations.py(4 hunks)python/src/server/services/crawling/helpers/llms_full_parser.py(1 hunks)python/src/server/services/crawling/page_storage_operations.py(1 hunks)python/src/server/services/crawling/strategies/batch.py(1 hunks)python/src/server/services/crawling/strategies/recursive.py(1 hunks)python/src/server/services/crawling/strategies/single_page.py(1 hunks)python/src/server/services/search/rag_service.py(3 hunks)python/src/server/services/storage/document_storage_service.py(3 hunks)python/tests/server/services/test_llms_full_parser.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (9)
python/src/server/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/src/server/**/*.py: Never accept corrupted data in batch/continuation paths: skip failed items entirely (e.g., do not store zero embeddings, null FKs, or malformed JSON)
Use specific exception types; avoid catching bare Exception
Preserve full stack traces in logs (use exc_info=True with Python logging)
Never return None to indicate failure; raise an exception with details
Authentication/authorization failures must halt the operation and be clearly surfaced
Service startup, missing configuration, database connection, or critical dependency failures should crash fast with clear errors
During crawling/batch/background tasks and WebSocket events, continue processing other items but log failures with context
Include context (operation intent, relevant IDs/URLs/data) in error messages
Pydantic should raise on data corruption or validation errors; do not accept invalid inputs
Files:
python/src/server/services/crawling/strategies/recursive.pypython/src/server/services/crawling/strategies/batch.pypython/src/server/main.pypython/src/server/services/crawling/document_storage_operations.pypython/src/server/services/crawling/page_storage_operations.pypython/src/server/services/search/rag_service.pypython/src/server/api_routes/pages_api.pypython/src/server/services/crawling/helpers/llms_full_parser.pypython/src/server/services/storage/document_storage_service.pypython/src/server/api_routes/knowledge_api.pypython/src/server/services/crawling/strategies/single_page.pypython/src/server/services/crawling/crawling_service.py
python/src/server/services/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/src/server/services/**/*.py: For batch operations, report both success count and a detailed failure list
On external API calls, retry with exponential backoff and fail with a clear message after retries
Place business logic in service layer modules (API Route → Service → Database pattern)
Files:
python/src/server/services/crawling/strategies/recursive.pypython/src/server/services/crawling/strategies/batch.pypython/src/server/services/crawling/document_storage_operations.pypython/src/server/services/crawling/page_storage_operations.pypython/src/server/services/search/rag_service.pypython/src/server/services/crawling/helpers/llms_full_parser.pypython/src/server/services/storage/document_storage_service.pypython/src/server/services/crawling/strategies/single_page.pypython/src/server/services/crawling/crawling_service.py
python/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/**/*.py: Target Python 3.12 with a 120-character line length
Use Ruff for linting (errors, warnings, unused imports)
Files:
python/src/server/services/crawling/strategies/recursive.pypython/src/server/services/crawling/strategies/batch.pypython/src/server/main.pypython/src/server/services/crawling/document_storage_operations.pypython/src/server/services/crawling/page_storage_operations.pypython/src/server/services/search/rag_service.pypython/src/server/api_routes/pages_api.pypython/src/server/services/crawling/helpers/llms_full_parser.pypython/tests/server/services/test_llms_full_parser.pypython/src/server/services/storage/document_storage_service.pypython/src/server/api_routes/knowledge_api.pypython/src/server/services/crawling/strategies/single_page.pypython/src/mcp_server/features/rag/rag_tools.pypython/src/server/services/crawling/crawling_service.py
python/src/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use MyPy for type checking to ensure type safety
python/src/**/*.py: On service startup, missing configuration, DB connection failures, auth/authorization failures, critical dependency outages, or invalid/corrupting data: fail fast and bubble errors
For batch processing, background tasks, WebSocket events, optional features, and external API calls: continue processing but log errors (with retries/backoff for APIs)
Never accept or persist corrupted data; skip failed items entirely (e.g., zero embeddings, null FKs, malformed JSON)
Error messages must include operation context, IDs/URLs, use specific exception types, preserve full stack traces (logging with exc_info=True), and avoid returning None/null—raise exceptions instead; for batches report success counts and detailed failures
Backend code targets Python 3.12 and adheres to a 120 character line length
Use Ruff for linting (errors, warnings, unused imports) in backend code
Use Mypy for static type checking in backend code
Files:
python/src/server/services/crawling/strategies/recursive.pypython/src/server/services/crawling/strategies/batch.pypython/src/server/main.pypython/src/server/services/crawling/document_storage_operations.pypython/src/server/services/crawling/page_storage_operations.pypython/src/server/services/search/rag_service.pypython/src/server/api_routes/pages_api.pypython/src/server/services/crawling/helpers/llms_full_parser.pypython/src/server/services/storage/document_storage_service.pypython/src/server/api_routes/knowledge_api.pypython/src/server/services/crawling/strategies/single_page.pypython/src/mcp_server/features/rag/rag_tools.pypython/src/server/services/crawling/crawling_service.py
python/src/server/main.py
📄 CodeRabbit inference engine (CLAUDE.md)
Register FastAPI exception handlers in the main application (search for @app.exception_handler)
Files:
python/src/server/main.py
{python/src/server/exceptions.py,python/src/server/main.py}
📄 CodeRabbit inference engine (CLAUDE.md)
Define custom exceptions in exceptions.py and ensure handlers are wired in main.py
Files:
python/src/server/main.py
python/src/server/api_routes/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Place HTTP route handlers in the API routes directory
Files:
python/src/server/api_routes/pages_api.pypython/src/server/api_routes/knowledge_api.py
python/tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Write backend tests with Pytest (including async support)
Files:
python/tests/server/services/test_llms_full_parser.py
python/src/mcp_server/features/*/*_tools.py
📄 CodeRabbit inference engine (CLAUDE.md)
Name and locate MCP tools as [feature]_tools.py under each feature
Name MCP tools as find_[resource] for list/search/get and manage_[resource] for create/update/delete (with action parameter)
Files:
python/src/mcp_server/features/rag/rag_tools.py
🧬 Code graph analysis (7)
python/src/server/services/crawling/document_storage_operations.py (2)
python/src/server/services/crawling/page_storage_operations.py (2)
store_llms_full_sections(115-204)store_pages(33-113)python/src/server/services/crawling/helpers/llms_full_parser.py (1)
parse_llms_full_sections(75-189)
python/src/server/services/crawling/page_storage_operations.py (2)
python/src/server/config/logfire_config.py (2)
get_logger(137-147)safe_logfire_info(224-236)python/src/server/services/crawling/helpers/llms_full_parser.py (1)
parse_llms_full_sections(75-189)
python/src/server/services/search/rag_service.py (2)
python/src/server/api_routes/knowledge_api.py (1)
perform_rag_query(1110-1143)python/src/server/config/logfire_config.py (1)
set_attribute(178-180)
python/src/server/api_routes/pages_api.py (2)
python/src/server/config/logfire_config.py (2)
get_logger(137-147)safe_logfire_error(239-251)python/src/server/services/client_manager.py (1)
get_supabase_client(15-43)
python/tests/server/services/test_llms_full_parser.py (1)
python/src/server/services/crawling/helpers/llms_full_parser.py (3)
create_section_slug(23-56)create_section_url(59-72)parse_llms_full_sections(75-189)
python/src/server/api_routes/knowledge_api.py (4)
python/src/server/services/embeddings/embedding_service.py (1)
create_embedding(236-293)python/src/server/services/search/rag_service.py (2)
RAGService(31-479)perform_rag_query(245-366)python/src/server/services/client_manager.py (1)
get_supabase_client(15-43)python/src/server/services/crawling/crawling_service.py (2)
get_active_orchestration(47-51)unregister_orchestration(61-65)
python/src/server/services/crawling/crawling_service.py (1)
python/src/server/services/crawling/page_storage_operations.py (1)
PageStorageOperations(16-225)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Backend Tests (Python + pytest)
🔇 Additional comments (17)
python/src/server/services/crawling/strategies/single_page.py (1)
194-202: LGTM! Title field properly integrated into crawl result.The title extraction enhances page metadata for downstream RAG and storage operations. The fallback to "Untitled" ensures consistent behavior.
python/src/server/services/crawling/strategies/batch.py (1)
250-255: LGTM! Title field properly added to batch results.The title inclusion aligns with the page-level metadata enhancement across crawling strategies.
python/src/server/main.py (2)
28-28: LGTM! Pages API router imported correctly.
187-187: LGTM! Pages API router registered properly.The new pages router is correctly registered with FastAPI, enabling page retrieval endpoints introduced in this PR.
python/src/server/api_routes/knowledge_api.py (3)
180-180: LGTM! Return mode field properly typed and documented.The new
return_modefield with default "chunks" and comment clarifying valid values follows Pydantic best practices.
1120-1132: LGTM! Return mode correctly propagated through RAG pipeline.The
return_modeparameter is properly passed toRAGService.perform_rag_queryand the success flag is correctly added to the result before returning.
1295-1313: LGTM! Orchestration service cancellation properly handled.The orchestration service is correctly retrieved, cancelled, and unregistered. The sequential steps ensure proper cleanup of the crawl task.
python/src/server/services/crawling/strategies/recursive.py (1)
292-297: LGTM! Title field properly integrated into recursive crawl results.python/src/server/services/crawling/crawling_service.py (3)
22-22: LGTM! PageStorageOperations imported correctly.
102-102: LGTM! Page storage operations properly initialized.The
PageStorageOperationsinstance is correctly instantiated with the Supabase client, following the same pattern asdoc_storage_ops.
427-437: LGTM! Document storage called with explicit url_to_page_id parameter.Passing
url_to_page_id=Noneis correct at this stage in the orchestration flow. The parameter will be populated by page storage operations before document storage in the complete implementation.python/src/server/services/storage/document_storage_service.py (2)
16-29: LGTM! Page ID linkage parameter properly added.The
url_to_page_idparameter is correctly typed asdict[str, str] | Noneand positioned appropriately in the function signature to enable FK linkage between pages and chunks.
403-417: LGTM! Page ID correctly retrieved and included in chunk records.The page_id is properly fetched from the url_to_page_id mapping and included in the batch data payload, establishing the FK relationship in the database. The conditional check ensures graceful handling when page_id is unavailable.
python/src/mcp_server/features/rag/rag_tools.py (4)
79-85: LGTM! Return mode parameter properly integrated.The
return_modeparameter with default "pages" follows the established pattern and is correctly documented in the docstring.
117-138: LGTM! Return mode correctly propagated to backend and response.The
return_modeis properly included in the request payload (lines 117-121) and the backend's return_mode is correctly surfaced in the response (line 133).
217-279: LGTM! New rag_read_full_page tool properly implemented.The tool provides a clean interface for retrieving complete page content after RAG search. The dual lookup by page_id or URL offers flexibility, and error handling follows the established pattern.
248-254: No changes needed for URL encoding
httpx automatically encodes query parameters when using theparamsdict (e.g.url=https%3A%2F%2F…), as confirmed by testing.
| @router.get("/pages/{page_id}") | ||
| async def get_page_by_id(page_id: str): | ||
| """ | ||
| Get a single page by its ID. | ||
|
|
||
| Args: | ||
| page_id: The UUID of the page | ||
|
|
||
| Returns: | ||
| PageResponse with complete page data | ||
| """ | ||
| try: | ||
| client = get_supabase_client() | ||
|
|
||
| # Query by ID | ||
| result = client.table("archon_page_metadata").select("*").eq("id", page_id).single().execute() | ||
|
|
||
| if not result.data: | ||
| raise HTTPException(status_code=404, detail=f"Page not found: {page_id}") | ||
|
|
||
| return PageResponse(**result.data) | ||
|
|
||
| except HTTPException: | ||
| raise | ||
| except Exception as e: | ||
| logger.error(f"Error getting page {page_id}: {e}", exc_info=True) | ||
| safe_logfire_error(f"Failed to get page | page_id={page_id} | error={str(e)}") | ||
| raise HTTPException(status_code=500, detail=f"Failed to get page: {str(e)}") from e |
There was a problem hiding this comment.
Remove redundant check and improve exception handling.
Two issues:
-
Redundant None check (lines 106-107): The
.single()method raises an exception when no results are found, so checkingif not result.datais redundant and will never execute. The Supabase client raises an exception before returningNone. -
Bare Exception catching (line 113): Should use specific exception types as per coding guidelines.
+from supabase import SupabaseException
+
@router.get("/pages/{page_id}")
async def get_page_by_id(page_id: str):
"""
Get a single page by its ID.
Args:
page_id: The UUID of the page
Returns:
PageResponse with complete page data
"""
try:
client = get_supabase_client()
# Query by ID
result = client.table("archon_page_metadata").select("*").eq("id", page_id).single().execute()
- if not result.data:
- raise HTTPException(status_code=404, detail=f"Page not found: {page_id}")
-
return PageResponse(**result.data)
except HTTPException:
raise
- except Exception as e:
+ except SupabaseException as e:
+ # .single() raises when no results found - convert to 404
+ if "not found" in str(e).lower() or "no rows" in str(e).lower():
+ raise HTTPException(status_code=404, detail=f"Page not found: {page_id}") from e
logger.error(f"Error getting page {page_id}: {e}", exc_info=True)
safe_logfire_error(f"Failed to get page | page_id={page_id} | error={str(e)}")
raise HTTPException(status_code=500, detail=f"Failed to get page: {str(e)}") from e
+ except ValueError as e:
+ logger.error(f"Invalid page data for {page_id}: {e}", exc_info=True)
+ raise HTTPException(status_code=500, detail=f"Invalid page data: {str(e)}") from eCommittable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In python/src/server/api_routes/pages_api.py around lines 89 to 116, remove the
redundant check for result.data after .single() (the Supabase client will raise
when no row is found) and replace the broad "except Exception" with targeted
exception handling: catch the Supabase/Postgrest client-specific exception(s)
(e.g., PostgrestError or the client-specific error type you use) to turn
not-found errors into a 404 HTTPException and other client errors into a 500,
and keep the existing except HTTPException: raise passthrough; avoid catching
bare Exception so only expected DB/client errors are handled and rewrapped into
HTTPException with a clear message.
| async def store_pages( | ||
| self, | ||
| crawl_results: list[dict], | ||
| source_id: str, | ||
| request: dict[str, Any], | ||
| crawl_type: str, | ||
| ) -> dict[str, str]: | ||
| """ | ||
| Store pages in archon_page_metadata table from regular crawl results. | ||
|
|
||
| Args: | ||
| crawl_results: List of crawled documents with url, markdown, title, etc. | ||
| source_id: The source ID these pages belong to | ||
| request: The original crawl request with knowledge_type, tags, etc. | ||
| crawl_type: Type of crawl performed (sitemap, url, link_collection, etc.) | ||
|
|
||
| Returns: | ||
| {url: page_id} mapping for FK references in chunks | ||
| """ | ||
| safe_logfire_info( | ||
| f"store_pages called | source_id={source_id} | crawl_type={crawl_type} | num_results={len(crawl_results)}" | ||
| ) | ||
|
|
||
| url_to_page_id: dict[str, str] = {} | ||
| pages_to_insert: list[dict[str, Any]] = [] | ||
|
|
||
| for doc in crawl_results: | ||
| url = doc.get("url", "").strip() | ||
| markdown = doc.get("markdown", "").strip() | ||
|
|
||
| # Skip documents with empty content or missing URLs | ||
| if not url or not markdown: | ||
| continue | ||
|
|
||
| # Prepare page record | ||
| word_count = len(markdown.split()) | ||
| char_count = len(markdown) | ||
|
|
||
| page_record = { | ||
| "source_id": source_id, | ||
| "url": url, | ||
| "full_content": markdown, | ||
| "section_title": None, # Regular page, not a section | ||
| "section_order": 0, | ||
| "word_count": word_count, | ||
| "char_count": char_count, | ||
| "chunk_count": 0, # Will be updated after chunking | ||
| "metadata": { | ||
| "knowledge_type": request.get("knowledge_type", "documentation"), | ||
| "crawl_type": crawl_type, | ||
| "page_type": "documentation", | ||
| "tags": request.get("tags", []), | ||
| }, | ||
| } | ||
| pages_to_insert.append(page_record) | ||
|
|
||
| # Batch insert pages | ||
| if pages_to_insert: | ||
| try: | ||
| safe_logfire_info( | ||
| f"Inserting {len(pages_to_insert)} pages into archon_page_metadata table" | ||
| ) | ||
| result = ( | ||
| self.supabase_client.table("archon_page_metadata") | ||
| .insert(pages_to_insert) | ||
| .execute() | ||
| ) | ||
|
|
||
| # Build url → page_id mapping | ||
| for page in result.data: | ||
| url_to_page_id[page["url"]] = page["id"] | ||
|
|
||
| safe_logfire_info( | ||
| f"Successfully stored {len(url_to_page_id)} pages in archon_page_metadata" | ||
| ) | ||
|
|
||
| except Exception as e: | ||
| logger.error(f"Error inserting pages into archon_page_metadata: {e}", exc_info=True) | ||
| # Don't raise - allow chunking to continue even if page storage fails | ||
|
|
||
| return url_to_page_id |
There was a problem hiding this comment.
Enhance batch operation error reporting and exception specificity.
The method has several issues related to the coding guidelines:
-
Bare Exception catching (line 109): The guidelines require "Use specific exception types; avoid catching bare Exception". Consider catching more specific exceptions like
SupabaseExceptionor database-related errors. -
Missing detailed failure reporting: The guidelines state "For batch operations, report both success count and a detailed failure list". Currently, if the batch insert fails, you lose visibility into which specific pages failed. The current implementation is all-or-nothing.
-
Potential data loss: If the batch insert partially succeeds (some pages inserted, some failed), the url_to_page_id mapping will be incomplete, but downstream code won't know which pages are missing.
Consider this refactored approach:
- # Batch insert pages
- if pages_to_insert:
- try:
- safe_logfire_info(
- f"Inserting {len(pages_to_insert)} pages into archon_page_metadata table"
- )
- result = (
- self.supabase_client.table("archon_page_metadata")
- .insert(pages_to_insert)
- .execute()
- )
-
- # Build url → page_id mapping
- for page in result.data:
- url_to_page_id[page["url"]] = page["id"]
-
- safe_logfire_info(
- f"Successfully stored {len(url_to_page_id)} pages in archon_page_metadata"
- )
-
- except Exception as e:
- logger.error(f"Error inserting pages into archon_page_metadata: {e}", exc_info=True)
- # Don't raise - allow chunking to continue even if page storage fails
+ # Batch insert pages
+ if pages_to_insert:
+ failed_pages = []
+ try:
+ safe_logfire_info(
+ f"Inserting {len(pages_to_insert)} pages into archon_page_metadata table"
+ )
+ result = (
+ self.supabase_client.table("archon_page_metadata")
+ .insert(pages_to_insert)
+ .execute()
+ )
+
+ # Build url → page_id mapping
+ for page in result.data:
+ url_to_page_id[page["url"]] = page["id"]
+
+ safe_logfire_info(
+ f"Successfully stored {len(url_to_page_id)} pages | failed={len(failed_pages)} | total={len(pages_to_insert)}"
+ )
+
+ except (SupabaseException, DatabaseError) as e:
+ # Track which pages failed
+ failed_pages = [p["url"] for p in pages_to_insert]
+ logger.error(
+ f"Error inserting pages into archon_page_metadata | failed_count={len(failed_pages)} | error={e}",
+ exc_info=True,
+ extra={"failed_urls": failed_pages[:10]} # Log first 10 for debugging
+ )
+ # Don't raise - allow chunking to continue even if page storage failsNote: You'll need to import the appropriate exception types from your Supabase/database client library.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| async def store_pages( | |
| self, | |
| crawl_results: list[dict], | |
| source_id: str, | |
| request: dict[str, Any], | |
| crawl_type: str, | |
| ) -> dict[str, str]: | |
| """ | |
| Store pages in archon_page_metadata table from regular crawl results. | |
| Args: | |
| crawl_results: List of crawled documents with url, markdown, title, etc. | |
| source_id: The source ID these pages belong to | |
| request: The original crawl request with knowledge_type, tags, etc. | |
| crawl_type: Type of crawl performed (sitemap, url, link_collection, etc.) | |
| Returns: | |
| {url: page_id} mapping for FK references in chunks | |
| """ | |
| safe_logfire_info( | |
| f"store_pages called | source_id={source_id} | crawl_type={crawl_type} | num_results={len(crawl_results)}" | |
| ) | |
| url_to_page_id: dict[str, str] = {} | |
| pages_to_insert: list[dict[str, Any]] = [] | |
| for doc in crawl_results: | |
| url = doc.get("url", "").strip() | |
| markdown = doc.get("markdown", "").strip() | |
| # Skip documents with empty content or missing URLs | |
| if not url or not markdown: | |
| continue | |
| # Prepare page record | |
| word_count = len(markdown.split()) | |
| char_count = len(markdown) | |
| page_record = { | |
| "source_id": source_id, | |
| "url": url, | |
| "full_content": markdown, | |
| "section_title": None, # Regular page, not a section | |
| "section_order": 0, | |
| "word_count": word_count, | |
| "char_count": char_count, | |
| "chunk_count": 0, # Will be updated after chunking | |
| "metadata": { | |
| "knowledge_type": request.get("knowledge_type", "documentation"), | |
| "crawl_type": crawl_type, | |
| "page_type": "documentation", | |
| "tags": request.get("tags", []), | |
| }, | |
| } | |
| pages_to_insert.append(page_record) | |
| # Batch insert pages | |
| if pages_to_insert: | |
| try: | |
| safe_logfire_info( | |
| f"Inserting {len(pages_to_insert)} pages into archon_page_metadata table" | |
| ) | |
| result = ( | |
| self.supabase_client.table("archon_page_metadata") | |
| .insert(pages_to_insert) | |
| .execute() | |
| ) | |
| # Build url → page_id mapping | |
| for page in result.data: | |
| url_to_page_id[page["url"]] = page["id"] | |
| safe_logfire_info( | |
| f"Successfully stored {len(url_to_page_id)} pages in archon_page_metadata" | |
| ) | |
| except Exception as e: | |
| logger.error(f"Error inserting pages into archon_page_metadata: {e}", exc_info=True) | |
| # Don't raise - allow chunking to continue even if page storage fails | |
| return url_to_page_id | |
| async def store_pages( | |
| self, | |
| crawl_results: list[dict], | |
| source_id: str, | |
| request: dict[str, Any], | |
| crawl_type: str, | |
| ) -> dict[str, str]: | |
| """ | |
| Store pages in archon_page_metadata table from regular crawl results. | |
| Args: | |
| crawl_results: List of crawled documents with url, markdown, title, etc. | |
| source_id: The source ID these pages belong to | |
| request: The original crawl request with knowledge_type, tags, etc. | |
| crawl_type: Type of crawl performed (sitemap, url, link_collection, etc.) | |
| Returns: | |
| {url: page_id} mapping for FK references in chunks | |
| """ | |
| safe_logfire_info( | |
| f"store_pages called | source_id={source_id} | crawl_type={crawl_type} | num_results={len(crawl_results)}" | |
| ) | |
| url_to_page_id: dict[str, str] = {} | |
| pages_to_insert: list[dict[str, Any]] = [] | |
| for doc in crawl_results: | |
| url = doc.get("url", "").strip() | |
| markdown = doc.get("markdown", "").strip() | |
| # Skip documents with empty content or missing URLs | |
| if not url or not markdown: | |
| continue | |
| # Prepare page record | |
| word_count = len(markdown.split()) | |
| char_count = len(markdown) | |
| page_record = { | |
| "source_id": source_id, | |
| "url": url, | |
| "full_content": markdown, | |
| "section_title": None, # Regular page, not a section | |
| "section_order": 0, | |
| "word_count": word_count, | |
| "char_count": char_count, | |
| "chunk_count": 0, # Will be updated after chunking | |
| "metadata": { | |
| "knowledge_type": request.get("knowledge_type", "documentation"), | |
| "crawl_type": crawl_type, | |
| "page_type": "documentation", | |
| "tags": request.get("tags", []), | |
| }, | |
| } | |
| pages_to_insert.append(page_record) | |
| # Batch insert pages | |
| if pages_to_insert: | |
| failed_pages: list[str] = [] | |
| try: | |
| safe_logfire_info( | |
| f"Inserting {len(pages_to_insert)} pages into archon_page_metadata table" | |
| ) | |
| result = ( | |
| self.supabase_client.table("archon_page_metadata") | |
| .insert(pages_to_insert) | |
| .execute() | |
| ) | |
| # Build url → page_id mapping | |
| for page in result.data: | |
| url_to_page_id[page["url"]] = page["id"] | |
| safe_logfire_info( | |
| f"Successfully stored {len(url_to_page_id)} pages | failed={len(failed_pages)} | total={len(pages_to_insert)}" | |
| ) | |
| except (SupabaseException, DatabaseError) as e: | |
| # Track which pages failed | |
| failed_pages = [p["url"] for p in pages_to_insert] | |
| logger.error( | |
| f"Error inserting pages into archon_page_metadata | " | |
| f"failed_count={len(failed_pages)} | error={e}", | |
| exc_info=True, | |
| extra={"failed_urls": failed_pages[:10]}, # Log first 10 for debugging | |
| ) | |
| # Don't raise - allow chunking to continue even if page storage fails | |
| return url_to_page_id |
🤖 Prompt for AI Agents
In python/src/server/services/crawling/page_storage_operations.py around lines
33 to 113, the batch insert catches a bare Exception and provides no granular
failure reporting or recovery, which can hide partial successes and missing page
IDs; change the try/except to catch the specific Supabase/database exception(s)
(import them from the client lib), after a failed batch insert log the exception
with response/error payload, attempt a fallback by inserting pages individually
(or in smaller batches) to collect which page records succeeded vs failed,
populate url_to_page_id only for successful inserts, and return both the success
count and a failure list (or at minimum log the failed URLs and their errors) so
downstream code can know which pages are missing; ensure all logs include the
exception details and relevant page URLs/indices.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (3)
python/tests/server/services/test_llms_full_parser.py (1)
126-126: Clarify comment wording.The comment "Should only have 2 sections (empty one skipped)" is slightly misleading. The standalone
#on line 118 is not treated as a section boundary (it doesn't match the"# "pattern with a space), so there's no empty section being skipped—it's just content within Section 1.Consider rewording for clarity:
- # Should only have 2 sections (empty one skipped) + # Should have 2 sections (standalone "#" is content, not a section boundary)python/src/mcp_server/features/rag/rag_tools.py (1)
296-358: Add validation for mutually exclusive parameters.The docstring states "Provide EITHER page_id OR url, not both," but the validation only checks if neither is provided (lines 317-321). When both are provided, the code proceeds with page_id taking precedence (line 327), which may not be the intended behavior.
Add validation for the mutually exclusive case:
async def rag_read_full_page( ctx: Context, page_id: str | None = None, url: str | None = None ) -> str: """...""" try: - if not page_id and not url: + if not page_id and not url: return json.dumps( {"success": False, "error": "Must provide either page_id or url"}, indent=2 ) + if page_id and url: + return json.dumps( + {"success": False, "error": "Provide either page_id or url, not both"}, + indent=2 + )python/src/server/services/search/rag_service.py (1)
245-364: LGTM! Clean return_mode integration with smart fallback.The extension of
perform_rag_queryto supportreturn_modeis well-implemented. The fallback logic (lines 334-345) gracefully handles pre-migration data without page_ids by returning chunks instead of pages, ensuring backward compatibility.The response propagates
actual_return_mode(line 357) and sets the span attribute (line 362), but the log message on line 365 uses the originalreturn_modeparameter instead ofactual_return_mode. This could be slightly misleading when fallback occurs:-logger.info(f"RAG query completed - {len(formatted_results)} {return_mode} found") +logger.info(f"RAG query completed - {len(formatted_results)} {actual_return_mode} found")This ensures the log accurately reflects whether pages or chunks were actually returned.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
CLAUDE.md(1 hunks)python/src/mcp_server/features/rag/rag_tools.py(4 hunks)python/src/server/services/search/rag_service.py(3 hunks)python/src/server/services/source_management_service.py(4 hunks)python/tests/server/services/test_llms_full_parser.py(1 hunks)python/tests/test_source_race_condition.py(2 hunks)
🧰 Additional context used
📓 Path-based instructions (6)
python/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/**/*.py: Target Python 3.12 with a 120-character line length
Use Ruff for linting (errors, warnings, unused imports)
Files:
python/tests/test_source_race_condition.pypython/tests/server/services/test_llms_full_parser.pypython/src/mcp_server/features/rag/rag_tools.pypython/src/server/services/source_management_service.pypython/src/server/services/search/rag_service.py
python/tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Write backend tests with Pytest (including async support)
Files:
python/tests/test_source_race_condition.pypython/tests/server/services/test_llms_full_parser.py
python/src/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use MyPy for type checking to ensure type safety
python/src/**/*.py: On service startup, missing configuration, DB connection failures, auth/authorization failures, critical dependency outages, or invalid/corrupting data: fail fast and bubble errors
For batch processing, background tasks, WebSocket events, optional features, and external API calls: continue processing but log errors (with retries/backoff for APIs)
Never accept or persist corrupted data; skip failed items entirely (e.g., zero embeddings, null FKs, malformed JSON)
Error messages must include operation context, IDs/URLs, use specific exception types, preserve full stack traces (logging with exc_info=True), and avoid returning None/null—raise exceptions instead; for batches report success counts and detailed failures
Backend code targets Python 3.12 and adheres to a 120 character line length
Use Ruff for linting (errors, warnings, unused imports) in backend code
Use Mypy for static type checking in backend code
Files:
python/src/mcp_server/features/rag/rag_tools.pypython/src/server/services/source_management_service.pypython/src/server/services/search/rag_service.py
python/src/mcp_server/features/*/*_tools.py
📄 CodeRabbit inference engine (CLAUDE.md)
Name and locate MCP tools as [feature]_tools.py under each feature
Name MCP tools as find_[resource] for list/search/get and manage_[resource] for create/update/delete (with action parameter)
Files:
python/src/mcp_server/features/rag/rag_tools.py
python/src/server/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/src/server/**/*.py: Never accept corrupted data in batch/continuation paths: skip failed items entirely (e.g., do not store zero embeddings, null FKs, or malformed JSON)
Use specific exception types; avoid catching bare Exception
Preserve full stack traces in logs (use exc_info=True with Python logging)
Never return None to indicate failure; raise an exception with details
Authentication/authorization failures must halt the operation and be clearly surfaced
Service startup, missing configuration, database connection, or critical dependency failures should crash fast with clear errors
During crawling/batch/background tasks and WebSocket events, continue processing other items but log failures with context
Include context (operation intent, relevant IDs/URLs/data) in error messages
Pydantic should raise on data corruption or validation errors; do not accept invalid inputs
Files:
python/src/server/services/source_management_service.pypython/src/server/services/search/rag_service.py
python/src/server/services/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/src/server/services/**/*.py: For batch operations, report both success count and a detailed failure list
On external API calls, retry with exponential backoff and fail with a clear message after retries
Place business logic in service layer modules (API Route → Service → Database pattern)
Files:
python/src/server/services/source_management_service.pypython/src/server/services/search/rag_service.py
🧬 Code graph analysis (3)
python/tests/server/services/test_llms_full_parser.py (1)
python/src/server/services/crawling/helpers/llms_full_parser.py (3)
create_section_slug(23-56)create_section_url(59-72)parse_llms_full_sections(75-189)
python/src/server/services/source_management_service.py (1)
python/src/server/services/llm_provider_service.py (2)
extract_message_text(885-922)get_llm_client(313-548)
python/src/server/services/search/rag_service.py (2)
python/src/server/api_routes/knowledge_api.py (1)
perform_rag_query(1110-1143)python/src/server/config/logfire_config.py (1)
set_attribute(178-180)
🔇 Additional comments (4)
python/tests/server/services/test_llms_full_parser.py (1)
1-179: LGTM! Comprehensive test coverage with resolved anchor format.The test file provides excellent coverage of the LLMs-full parser functionality. The previous review concern about anchor format (expecting
#section-{order}-{slug}instead of just#{slug}) has been correctly addressed throughout the test assertions.Test coverage includes:
- Slug generation and URL creation ✓
- Single and multiple H1 sections ✓
- Edge cases: no H1s, H2 handling, empty sections, consecutive H1s ✓
- Word count calculation and empty content handling ✓
python/src/mcp_server/features/rag/rag_tools.py (2)
79-152: LGTM! Clear extension of search with return_mode support.The addition of the
return_modeparameter with a default of"pages"aligns well with the PR objectives to move toward page-level retrieval. The documentation clearly explains both modes and recommends pages for better context.The request/response flow correctly propagates return_mode through the backend call and includes it in the response for transparency.
217-294: LGTM! Well-structured page listing tool.The new
rag_list_pages_for_sourcetool provides a clean interface for browsing documentation structure. The docstring includes a helpful example workflow, and error handling is comprehensive.python/src/server/services/search/rag_service.py (1)
175-243: LGTM! Solid page-grouping implementation with thoughtful scoring.The
_group_chunks_by_pagesmethod effectively aggregates chunk results into page-level responses. The aggregate scoring logic (average similarity with match-count boost capped at 20%) provides a reasonable signal for page relevance. The fallback from page_id to URL for grouping ensures compatibility with different data sources.Key strengths:
- Handles both page_id and URL-based grouping
- Fetches complete page metadata from archon_page_metadata
- Includes preview text from best matching chunk
- Robust handling of missing page records with
.maybe_single()
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (5)
python/src/server/services/crawling/page_storage_operations.py (2)
118-123: Remove the bareExceptionhandler instore_pages.Catching
Exceptionhere still violates the service guidelines and swallows unexpected failures, leaving callers with an emptyurl_to_page_idand no signal that page storage failed. Drop this block (or re‑raise) and rely on the specificAPIErrorhandling so upstream code can react appropriately.- except Exception as e: - safe_logfire_error( - f"Unexpected error upserting pages | source_id={source_id} | attempted={len(pages_to_insert)} | error={str(e)}" - ) - logger.error(f"Unexpected error upserting pages for source {source_id}: {e}", exc_info=True) - # Don't raise - allow chunking to continue + except Exception: + raise
217-223: Avoid swallowing unexpected errors when storing llms-full sections.Same issue as above: the bare
Exceptionhandler hides real failures and leaves downstream processes thinking the upsert succeeded. Remove it (or re‑raise) so only the specific database error path is tolerated.- except Exception as e: - safe_logfire_error( - f"Unexpected error upserting sections | base_url={base_url} | attempted={len(pages_to_insert)} | error={str(e)}" - ) - logger.error(f"Unexpected error upserting sections for {base_url}: {e}", exc_info=True) - # Don't raise - allow process to continue + except Exception: + raisepython/src/server/api_routes/pages_api.py (3)
130-133: Use specific Supabase exceptions inlist_pages.The broad
except Exceptionwas already flagged—please catchAPIError(and other expected types) explicitly and let unexpected errors bubble. This keeps to the service guidelines and preserves stack traces.- except Exception as e: + except APIError as e: logger.error(f"Error listing pages for source {source_id}: {e}", exc_info=True) safe_logfire_error(f"Failed to list pages | source_id={source_id} | error={str(e)}") raise HTTPException(status_code=500, detail=f"Failed to list pages: {str(e)}") from e + except ValueError as e: + logger.error(f"Invalid page data for source {source_id}: {e}", exc_info=True) + raise HTTPException(status_code=500, detail=f"Invalid page data: {str(e)}") from e
155-167: Return 404 for missing pages and remove the deadif not result.datacheck.
Postgrest’s.single()raisesAPIError(codePGRST116) when no row exists, so theif not result.databranch never runs. Worse, the exception is caught by the bare handler and exposed as a 500. CatchAPIError, translate the not-found case to a 404, and drop the redundant check.- result = client.table("archon_page_metadata").select("*").eq("url", url).single().execute() - - if not result.data: - raise HTTPException(status_code=404, detail=f"Page not found for URL: {url}") + result = client.table("archon_page_metadata").select("*").eq("url", url).single().execute() # Handle large pages page_data = _handle_large_page_content(result.data.copy()) return PageResponse(**page_data) except HTTPException: raise - except Exception as e: + except APIError as e: + if e.code == "PGRST116" and "no rows" in (e.message or "").lower(): + raise HTTPException(status_code=404, detail=f"Page not found for URL: {url}") from e + logger.error(f"Error getting page by URL {url}: {e}", exc_info=True) + safe_logfire_error(f"Failed to get page by URL | url={url} | error={str(e)}") + raise HTTPException(status_code=500, detail=f"Failed to get page: {str(e)}") from e + except ValueError as e: logger.error(f"Error getting page by URL {url}: {e}", exc_info=True) safe_logfire_error(f"Failed to get page by URL | url={url} | error={str(e)}") raise HTTPException(status_code=500, detail=f"Failed to get page: {str(e)}") from e
187-199: Fix not-found handling for/pages/{page_id}and drop the redundant null check.Same issue as the URL lookup:
.single()raisesAPIErrorwhen no page exists, so theif not result.databranch never executes. The caller currently receives a 500 instead of a 404. CatchAPIError, translate the not-found case, and avoid the bareException.- result = client.table("archon_page_metadata").select("*").eq("id", page_id).single().execute() - - if not result.data: - raise HTTPException(status_code=404, detail=f"Page not found: {page_id}") + result = client.table("archon_page_metadata").select("*").eq("id", page_id).single().execute() # Handle large pages page_data = _handle_large_page_content(result.data.copy()) return PageResponse(**page_data) except HTTPException: raise - except Exception as e: + except APIError as e: + if e.code == "PGRST116" and "no rows" in (e.message or "").lower(): + raise HTTPException(status_code=404, detail=f"Page not found: {page_id}") from e + logger.error(f"Error getting page {page_id}: {e}", exc_info=True) + safe_logfire_error(f"Failed to get page | page_id={page_id} | error={str(e)}") + raise HTTPException(status_code=500, detail=f"Failed to get page: {str(e)}") from e + except ValueError as e: logger.error(f"Error getting page {page_id}: {e}", exc_info=True) safe_logfire_error(f"Failed to get page | page_id={page_id} | error={str(e)}") raise HTTPException(status_code=500, detail=f"Failed to get page: {str(e)}") from e
🧹 Nitpick comments (1)
python/src/server/services/search/rag_service.py (1)
197-197: Consider using highest-similarity chunk instead of first chunk.The current implementation uses the first chunk's content as
best_chunk_content, which may not be the most representative. Consider selecting the chunk with the highest similarity score instead.Apply this diff:
+ "best_similarity": result.get("similarity_score", 0.0), "best_chunk_content": result.get("content", ""),Then update the accumulation logic:
page_groups[group_key]["chunk_matches"] += 1 page_groups[group_key]["total_similarity"] += result.get("similarity_score", 0.0) + # Track the best chunk by similarity + current_similarity = result.get("similarity_score", 0.0) + if current_similarity > page_groups[group_key]["best_similarity"]: + page_groups[group_key]["best_similarity"] = current_similarity + page_groups[group_key]["best_chunk_content"] = result.get("content", "")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
python/src/server/api_routes/pages_api.py(1 hunks)python/src/server/services/crawling/page_storage_operations.py(1 hunks)python/src/server/services/search/rag_service.py(3 hunks)
🧰 Additional context used
📓 Path-based instructions (5)
python/src/server/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/src/server/**/*.py: Never accept corrupted data in batch/continuation paths: skip failed items entirely (e.g., do not store zero embeddings, null FKs, or malformed JSON)
Use specific exception types; avoid catching bare Exception
Preserve full stack traces in logs (use exc_info=True with Python logging)
Never return None to indicate failure; raise an exception with details
Authentication/authorization failures must halt the operation and be clearly surfaced
Service startup, missing configuration, database connection, or critical dependency failures should crash fast with clear errors
During crawling/batch/background tasks and WebSocket events, continue processing other items but log failures with context
Include context (operation intent, relevant IDs/URLs/data) in error messages
Pydantic should raise on data corruption or validation errors; do not accept invalid inputs
Files:
python/src/server/services/search/rag_service.pypython/src/server/services/crawling/page_storage_operations.pypython/src/server/api_routes/pages_api.py
python/src/server/services/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/src/server/services/**/*.py: For batch operations, report both success count and a detailed failure list
On external API calls, retry with exponential backoff and fail with a clear message after retries
Place business logic in service layer modules (API Route → Service → Database pattern)
Files:
python/src/server/services/search/rag_service.pypython/src/server/services/crawling/page_storage_operations.py
python/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/**/*.py: Target Python 3.12 with a 120-character line length
Use Ruff for linting (errors, warnings, unused imports)
Files:
python/src/server/services/search/rag_service.pypython/src/server/services/crawling/page_storage_operations.pypython/src/server/api_routes/pages_api.py
python/src/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use MyPy for type checking to ensure type safety
python/src/**/*.py: On service startup, missing configuration, DB connection failures, auth/authorization failures, critical dependency outages, or invalid/corrupting data: fail fast and bubble errors
For batch processing, background tasks, WebSocket events, optional features, and external API calls: continue processing but log errors (with retries/backoff for APIs)
Never accept or persist corrupted data; skip failed items entirely (e.g., zero embeddings, null FKs, malformed JSON)
Error messages must include operation context, IDs/URLs, use specific exception types, preserve full stack traces (logging with exc_info=True), and avoid returning None/null—raise exceptions instead; for batches report success counts and detailed failures
Backend code targets Python 3.12 and adheres to a 120 character line length
Use Ruff for linting (errors, warnings, unused imports) in backend code
Use Mypy for static type checking in backend code
Files:
python/src/server/services/search/rag_service.pypython/src/server/services/crawling/page_storage_operations.pypython/src/server/api_routes/pages_api.py
python/src/server/api_routes/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Place HTTP route handlers in the API routes directory
Files:
python/src/server/api_routes/pages_api.py
🧬 Code graph analysis (3)
python/src/server/services/search/rag_service.py (2)
python/src/server/api_routes/knowledge_api.py (1)
perform_rag_query(1110-1143)python/src/server/config/logfire_config.py (1)
set_attribute(178-180)
python/src/server/services/crawling/page_storage_operations.py (2)
python/src/server/config/logfire_config.py (3)
get_logger(137-147)safe_logfire_error(239-251)safe_logfire_info(224-236)python/src/server/services/crawling/helpers/llms_full_parser.py (1)
parse_llms_full_sections(75-189)
python/src/server/api_routes/pages_api.py (2)
python/src/server/config/logfire_config.py (2)
get_logger(137-147)safe_logfire_error(239-251)python/src/server/services/client_manager.py (1)
get_supabase_client(15-43)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Backend Tests (Python + pytest)
🔇 Additional comments (2)
python/src/server/services/search/rag_service.py (2)
329-344: LGTM! Page grouping logic with graceful fallback.The implementation correctly handles the page-level return mode with a graceful fallback to chunks when page_ids are not present. The logging provides good visibility for the fallback behavior.
245-245: LGTM! Return mode parameter added correctly.The
return_modeparameter is added with an appropriate default value of "chunks" to maintain backward compatibility.
| except Exception as e: | ||
| logger.warning( | ||
| f"Unexpected error updating chunk_count for page {page_id}: {e}", exc_info=True |
There was a problem hiding this comment.
Let truly unexpected update failures surface.
update_page_chunk_count should not catch bare Exception; re-raise after logging so programming/config bugs don’t get silently muted.
- except Exception as e:
- logger.warning(
- f"Unexpected error updating chunk_count for page {page_id}: {e}", exc_info=True
- )
+ except Exception:
+ raise📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| except Exception as e: | |
| logger.warning( | |
| f"Unexpected error updating chunk_count for page {page_id}: {e}", exc_info=True | |
| except Exception: | |
| raise |
🤖 Prompt for AI Agents
In python/src/server/services/crawling/page_storage_operations.py around lines
245 to 247, the except block catches a bare Exception and only logs it which
masks programming/configuration errors; modify the handler to re-raise the
exception after logging (e.g., log with exc_info=True and then call raise) or
remove the broad except entirely so unexpected failures propagate, ensuring only
expected/handled errors are swallowed if explicitly intended.
| # Query page by page_id if available, otherwise by URL | ||
| if data["page_id"]: | ||
| page_info = ( | ||
| self.supabase_client.table("archon_page_metadata") | ||
| .select("id, url, section_title, word_count") | ||
| .eq("id", data["page_id"]) | ||
| .maybe_single() | ||
| .execute() | ||
| ) | ||
| else: | ||
| # Regular pages - exact URL match | ||
| page_info = ( | ||
| self.supabase_client.table("archon_page_metadata") | ||
| .select("id, url, section_title, word_count") | ||
| .eq("url", data["url"]) | ||
| .maybe_single() | ||
| .execute() | ||
| ) |
There was a problem hiding this comment.
Refactor to use a single bulk query and add error handling.
The current implementation queries the database once per page in a loop, creating an N+1 query problem that degrades performance when many pages are involved. Additionally, database queries lack error handling, which violates the coding guidelines requiring proper error handling and logging for database operations.
Apply this diff to fetch all page metadata in a single bulk query with error handling:
- page_results = []
- for group_key, data in page_groups.items():
+ # Collect all page_ids and URLs for bulk query
+ page_ids_to_fetch = [data["page_id"] for data in page_groups.values() if data["page_id"]]
+ urls_to_fetch = [data["url"] for data in page_groups.values() if not data["page_id"] and data["url"]]
+
+ # Fetch all page metadata in bulk
+ page_metadata_map = {}
+ try:
+ if page_ids_to_fetch:
+ page_results_by_id = (
+ self.supabase_client.table("archon_page_metadata")
+ .select("id, url, section_title, word_count")
+ .in_("id", page_ids_to_fetch)
+ .execute()
+ )
+ for page in page_results_by_id.data or []:
+ page_metadata_map[page["id"]] = page
+
+ if urls_to_fetch:
+ page_results_by_url = (
+ self.supabase_client.table("archon_page_metadata")
+ .select("id, url, section_title, word_count")
+ .in_("url", urls_to_fetch)
+ .execute()
+ )
+ for page in page_results_by_url.data or []:
+ page_metadata_map[page["url"]] = page
+ except Exception as e:
+ logger.error(f"Failed to fetch page metadata | page_ids={len(page_ids_to_fetch)} | urls={len(urls_to_fetch)} | error={str(e)}", exc_info=True)
+ return []
+
+ page_results = []
+ for group_key, data in page_groups.items():
avg_similarity = data["total_similarity"] / data["chunk_matches"]
match_boost = min(0.2, data["chunk_matches"] * 0.02)
aggregate_score = avg_similarity * (1 + match_boost)
- # Query page by page_id if available, otherwise by URL
- if data["page_id"]:
- page_info = (
- self.supabase_client.table("archon_page_metadata")
- .select("id, url, section_title, word_count")
- .eq("id", data["page_id"])
- .maybe_single()
- .execute()
- )
- else:
- # Regular pages - exact URL match
- page_info = (
- self.supabase_client.table("archon_page_metadata")
- .select("id, url, section_title, word_count")
- .eq("url", data["url"])
- .maybe_single()
- .execute()
- )
-
- if page_info and page_info.data is not None:
+ # Look up page metadata from bulk query results
+ lookup_key = data["page_id"] if data["page_id"] else data["url"]
+ page_info_data = page_metadata_map.get(lookup_key)
+
+ if page_info_data:
page_results.append({
- "page_id": page_info.data["id"],
- "url": page_info.data["url"],
- "section_title": page_info.data.get("section_title"),
- "word_count": page_info.data.get("word_count", 0),
+ "page_id": page_info_data["id"],
+ "url": page_info_data["url"],
+ "section_title": page_info_data.get("section_title"),
+ "word_count": page_info_data.get("word_count", 0),
"chunk_matches": data["chunk_matches"],
"aggregate_similarity": aggregate_score,
"average_similarity": avg_similarity,
"source_id": data["source_id"],
})
+ else:
+ logger.warning(f"Page metadata not found | group_key={group_key} | page_id={data['page_id']} | url={data['url']}")As per coding guidelines.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Query page by page_id if available, otherwise by URL | |
| if data["page_id"]: | |
| page_info = ( | |
| self.supabase_client.table("archon_page_metadata") | |
| .select("id, url, section_title, word_count") | |
| .eq("id", data["page_id"]) | |
| .maybe_single() | |
| .execute() | |
| ) | |
| else: | |
| # Regular pages - exact URL match | |
| page_info = ( | |
| self.supabase_client.table("archon_page_metadata") | |
| .select("id, url, section_title, word_count") | |
| .eq("url", data["url"]) | |
| .maybe_single() | |
| .execute() | |
| ) | |
| # Collect all page_ids and URLs for bulk query | |
| page_ids_to_fetch = [data["page_id"] for data in page_groups.values() if data["page_id"]] | |
| urls_to_fetch = [data["url"] for data in page_groups.values() if not data["page_id"] and data["url"]] | |
| # Fetch all page metadata in bulk | |
| page_metadata_map = {} | |
| try: | |
| if page_ids_to_fetch: | |
| page_results_by_id = ( | |
| self.supabase_client.table("archon_page_metadata") | |
| .select("id, url, section_title, word_count") | |
| .in_("id", page_ids_to_fetch) | |
| .execute() | |
| ) | |
| for page in page_results_by_id.data or []: | |
| page_metadata_map[page["id"]] = page | |
| if urls_to_fetch: | |
| page_results_by_url = ( | |
| self.supabase_client.table("archon_page_metadata") | |
| .select("id, url, section_title, word_count") | |
| .in_("url", urls_to_fetch) | |
| .execute() | |
| ) | |
| for page in page_results_by_url.data or []: | |
| page_metadata_map[page["url"]] = page | |
| except Exception as e: | |
| logger.error( | |
| f"Failed to fetch page metadata | page_ids={len(page_ids_to_fetch)} | " | |
| f"urls={len(urls_to_fetch)} | error={e}", | |
| exc_info=True | |
| ) | |
| return [] | |
| page_results = [] | |
| for group_key, data in page_groups.items(): | |
| avg_similarity = data["total_similarity"] / data["chunk_matches"] | |
| match_boost = min(0.2, data["chunk_matches"] * 0.02) | |
| aggregate_score = avg_similarity * (1 + match_boost) | |
| # Look up page metadata from bulk query results | |
| lookup_key = data["page_id"] if data["page_id"] else data["url"] | |
| page_info_data = page_metadata_map.get(lookup_key) | |
| if page_info_data: | |
| page_results.append({ | |
| "page_id": page_info_data["id"], | |
| "url": page_info_data["url"], | |
| "section_title": page_info_data.get("section_title"), | |
| "word_count": page_info_data.get("word_count", 0), | |
| "chunk_matches": data["chunk_matches"], | |
| "aggregate_similarity": aggregate_score, | |
| "average_similarity": avg_similarity, | |
| "source_id": data["source_id"], | |
| }) | |
| else: | |
| logger.warning( | |
| f"Page metadata not found | group_key={group_key} | " | |
| f"page_id={data['page_id']} | url={data['url']}" | |
| ) |
There was a problem hiding this comment.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
python/src/server/services/crawling/helpers/llms_full_parser.py(1 hunks)python/src/server/services/storage/base_storage_service.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
python/src/server/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/src/server/**/*.py: Never accept corrupted data in batch/continuation paths: skip failed items entirely (e.g., do not store zero embeddings, null FKs, or malformed JSON)
Use specific exception types; avoid catching bare Exception
Preserve full stack traces in logs (use exc_info=True with Python logging)
Never return None to indicate failure; raise an exception with details
Authentication/authorization failures must halt the operation and be clearly surfaced
Service startup, missing configuration, database connection, or critical dependency failures should crash fast with clear errors
During crawling/batch/background tasks and WebSocket events, continue processing other items but log failures with context
Include context (operation intent, relevant IDs/URLs/data) in error messages
Pydantic should raise on data corruption or validation errors; do not accept invalid inputs
Files:
python/src/server/services/storage/base_storage_service.pypython/src/server/services/crawling/helpers/llms_full_parser.py
python/src/server/services/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/src/server/services/**/*.py: For batch operations, report both success count and a detailed failure list
On external API calls, retry with exponential backoff and fail with a clear message after retries
Place business logic in service layer modules (API Route → Service → Database pattern)
Files:
python/src/server/services/storage/base_storage_service.pypython/src/server/services/crawling/helpers/llms_full_parser.py
python/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/**/*.py: Target Python 3.12 with a 120-character line length
Use Ruff for linting (errors, warnings, unused imports)
Files:
python/src/server/services/storage/base_storage_service.pypython/src/server/services/crawling/helpers/llms_full_parser.py
python/src/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use MyPy for type checking to ensure type safety
python/src/**/*.py: On service startup, missing configuration, DB connection failures, auth/authorization failures, critical dependency outages, or invalid/corrupting data: fail fast and bubble errors
For batch processing, background tasks, WebSocket events, optional features, and external API calls: continue processing but log errors (with retries/backoff for APIs)
Never accept or persist corrupted data; skip failed items entirely (e.g., zero embeddings, null FKs, malformed JSON)
Error messages must include operation context, IDs/URLs, use specific exception types, preserve full stack traces (logging with exc_info=True), and avoid returning None/null—raise exceptions instead; for batches report success counts and detailed failures
Backend code targets Python 3.12 and adheres to a 120 character line length
Use Ruff for linting (errors, warnings, unused imports) in backend code
Use Mypy for static type checking in backend code
Files:
python/src/server/services/storage/base_storage_service.pypython/src/server/services/crawling/helpers/llms_full_parser.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Backend Tests (Python + pytest)
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
python/tests/server/services/test_llms_full_parser.py (1)
1-195: Well-structured test suite with good edge case coverage.The test suite effectively covers the core functionality of the LLMs-full.txt parser:
- Slug generation with punctuation/spacing edge cases
- URL construction with section ordering
- Single and multiple section parsing
- Edge cases: no H1 headers, H2 boundaries, empty sections, consecutive H1s, whitespace handling
The tests are clear, well-named, and follow pytest conventions. The anchor format (
#section-{order}-{slug}) correctly matches the parser implementation.Optional enhancements:
- Test code block handling: The parser ignores H1 headers inside code blocks. Consider adding:
def test_h1_inside_code_block_ignored(): """Test that H1 inside code blocks are not section boundaries""" content = """# Real Section Some content here. ```python # This is not a section def foo(): passMore content in the real section.
"""
sections = parse_llms_full_sections(content, "https://example.com/llms-full.txt")
assert len(sections) == 1
assert sections[0].section_title == "# Real Section"2. **Test code block split merging**: The parser merges sections split by unclosed code blocks (parser lines 200-234). Consider testing this behavior. 3. **Test small section combining**: The parser combines consecutive sections <200 chars (parser lines 237-259). While `test_empty_sections_skipped` and `test_consecutive_h1_headers` work around this by using longer content, an explicit test would verify the combining logic. 4. **Parameterize slug tests**: Lines 14-21 could use `pytest.mark.parametrize` for conciseness: ```python @pytest.mark.parametrize("input_heading,expected_slug", [ ("# Core Concepts", "core-concepts"), ("# Getting Started!", "getting-started"), ("# API Reference (v2)", "api-reference-v2"), # ... ]) def test_create_section_slug(input_heading, expected_slug): assert create_section_slug(input_heading) == expected_slug
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
python/tests/server/services/test_llms_full_parser.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
python/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/**/*.py: Target Python 3.12 with a 120-character line length
Use Ruff for linting (errors, warnings, unused imports)
Files:
python/tests/server/services/test_llms_full_parser.py
python/tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Write backend tests with Pytest (including async support)
Files:
python/tests/server/services/test_llms_full_parser.py
🧬 Code graph analysis (1)
python/tests/server/services/test_llms_full_parser.py (1)
python/src/server/services/crawling/helpers/llms_full_parser.py (3)
create_section_slug(23-56)create_section_url(59-72)parse_llms_full_sections(75-262)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Backend Tests (Python + pytest)
* Initial commit for RAG by document * Phase 2 * Adding migrations * Fixing page IDs for chunk metadata * Fixing unit tests, adding tool to list pages for source * Fixing page storage upsert issues * Max file length for retrieval * Fixing title issue * Fixing tests
…ts (#767) The /api/health endpoint reported concurrency.active: 0 while workflows were running because background workflows bypass the ConversationLockManager. Now the endpoint queries running workflows from the DB and merges them with lock manager stats, deduplicating shared conversation IDs. Changes: - Added getRunningWorkflows() DB function returning running workflow details - Enriched /api/health to combine lock manager + DB workflow sources - Updated and expanded health endpoint tests for merge/dedup behavior Fixes #767 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ts (coleam00#767) The /api/health endpoint reported concurrency.active: 0 while workflows were running because background workflows bypass the ConversationLockManager. Now the endpoint queries running workflows from the DB and merges them with lock manager stats, deduplicating shared conversation IDs. Changes: - Added getRunningWorkflows() DB function returning running workflow details - Enriched /api/health to combine lock manager + DB workflow sources - Updated and expanded health endpoint tests for merge/dedup behavior Fixes coleam00#767 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ts (coleam00#767) The /api/health endpoint reported concurrency.active: 0 while workflows were running because background workflows bypass the ConversationLockManager. Now the endpoint queries running workflows from the DB and merges them with lock manager stats, deduplicating shared conversation IDs. Changes: - Added getRunningWorkflows() DB function returning running workflow details - Enriched /api/health to combine lock manager + DB workflow sources - Updated and expanded health endpoint tests for merge/dedup behavior Fixes coleam00#767 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pull Request
Summary
Moved from chunk level to page level retrieval to fit better with how RAG best works with AI coding assistants and code documentation.
Changes Made
Type of Change
Affected Services
Testing
New backend tests for RAG (MCP and API).
Test Evidence
All unit tests including the new ones are passing.
Checklist
Breaking Changes
Old documentation will work still but for optimal results everyone should recrawl.
Summary by CodeRabbit
New Features
Documentation
Tests