Fix backend linting issues (148 auto-fixable errors)#470
Conversation
Applied safe auto-fixes for: - W293, W292: Fixed whitespace issues in blank lines and EOF - F401: Removed unused imports - UP035: Updated deprecated typing imports (Dict, List to dict, list) - SIM108: Simplified if-else blocks to ternary operators - C408: Simplified unnecessary dict() calls Remaining 44 errors require manual review (mostly F841 unused variables that may have side effects from function calls). 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>
WalkthroughRefactors imports (typing to collections.abc), applies whitespace/newline fixes, and cleans unused imports across multiple modules and tests. Introduces progress callbacks in single-page markdown crawling and expands document processing to include crawl_type-driven per-chunk metadata and updated storage/summary flows. Removes a side-effect Socket.IO handlers import from server startup. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Caller
participant SinglePage as SinglePageCrawlStrategy
participant Fetcher as HTTP/Browser Fetcher
participant Parser as Markdown Generator
rect rgba(230,245,255,0.5)
note over Caller,SinglePage: crawl_markdown_file with progress_callback
Caller->>SinglePage: crawl_markdown_file(url, transform_url_func, progress_callback)
SinglePage-->>Caller: progress_callback(start_progress)
end
SinglePage->>Fetcher: fetch url (wait_until: domcontentloaded, timeouts, selectors)
Fetcher-->>SinglePage: HTML/content
SinglePage->>Parser: generate markdown/content
Parser-->>SinglePage: chunks/metadata
rect rgba(230,245,255,0.5)
SinglePage-->>Caller: progress_callback(end_progress)
SinglePage-->>Caller: results
end
sequenceDiagram
autonumber
participant Orchestrator as Crawl Orchestrator
participant DSO as DocumentStorageOperations
participant Store as add_documents_to_supabase
participant Sources as Source Records
Orchestrator->>DSO: process_and_store_documents(crawl_results, request, crawl_type, original_source_id, ...)
loop per chunk
DSO->>DSO: build per-chunk metadata (url, title, description, crawl_type, ...)
end
DSO->>Store: add_documents_to_supabase(contents, metadatas, urls, chunk_numbers, url_to_full_document, batch_size=25, enable_parallel_batches=True)
alt group by source_id
DSO->>Sources: update_source_info(..., summary, word_count, content, knowledge_type, tags, update_frequency, original_url)
opt fallback
DSO->>Sources: upsert into archon_sources with metadata flags
end
end
DSO-->>Orchestrator: result
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (7)
python/src/server/services/storage/storage_services.py (2)
57-71: Harden WebSocket progress sends to not fail the upload on transient WS errorsPer coding guidelines, a single WebSocket failure should not crash/abort the server or the job. Wrap
websocket.send_jsonin its own try/except and log with stack trace to avoid turning UI disconnects into upload failures.- async def report_progress(message: str, percentage: int, batch_info: dict = None): + async def report_progress(message: str, percentage: float, batch_info: dict | None = None): if websocket: - data = { - "type": "upload_progress", - "filename": filename, - "progress": percentage, - "message": message, - } - if batch_info: - data.update(batch_info) - await websocket.send_json(data) + data = { + "type": "upload_progress", + "filename": filename, + "progress": percentage, + "message": message, + } + if batch_info: + data.update(batch_info) + try: + await websocket.send_json(data) + except Exception as ws_err: + # Do not fail the upload on WS send errors; just log with traceback. + logger.error( + "WebSocket send failed during document upload | filename=%s | error=%s", + filename, + str(ws_err), + exc_info=True, + ) if progress_callback: await progress_callback(message, percentage, batch_info)
181-193: Preserve stack traces and add context on errorsLogging the exception without
exc_info=Truedrops the traceback. Also, include contextual fields (filename/source_id) in the error response to aid debugging.- span.set_attribute("success", False) - span.set_attribute("error", str(e)) - logger.error(f"Error uploading document: {e}") + span.set_attribute("success", False) + span.set_attribute("error", str(e)) + logger.error( + "Error uploading document | filename=%s | source_id=%s | error=%s", + filename, + source_id, + str(e), + exc_info=True, + ) @@ - return False, {"error": f"Error uploading document: {str(e)}"} + return False, { + "error": f"Error uploading document: {str(e)}", + "filename": filename, + "source_id": source_id, + }python/src/server/api_routes/mcp_api.py (5)
615-621: Fix breaking use of unsupported keyword args in logging (debug/error).logging.Logger methods don’t accept arbitrary keywords like count= or error=. This will raise TypeError at runtime.
Apply:
- logs = mcp_manager.get_logs(limit) - api_logger.debug("MCP server logs retrieved", count=len(logs)) + logs = mcp_manager.get_logs(limit) + api_logger.debug("MCP server logs retrieved - count=%d", len(logs)) safe_set_attribute(span, "log_count", len(logs)) return {"logs": logs} except Exception as e: - api_logger.error("MCP server logs API failed", error=str(e)) + api_logger.error("MCP server logs API failed - error=%s", str(e)) safe_set_attribute(span, "error", str(e)) raise HTTPException(status_code=500, detail=str(e))
637-640: Same logging kwargs issue in clear_logs error path.Apply:
- except Exception as e: - api_logger.error("MCP server clear logs API failed", error=str(e)) + except Exception as e: + api_logger.error("MCP server clear logs API failed - error=%s", str(e)) safe_set_attribute(span, "success", False) safe_set_attribute(span, "error", str(e)) raise HTTPException(status_code=500, detail=str(e))
701-704: Same logging kwargs issue in get_mcp_config error path.Apply:
- except Exception as e: - api_logger.error("Failed to get MCP configuration", error=str(e)) + except Exception as e: + api_logger.error("Failed to get MCP configuration - error=%s", str(e)) safe_set_attribute(span, "error", str(e)) raise HTTPException(status_code=500, detail={"error": str(e)})
825-835: Same logging kwargs issue in get_mcp_tools (debug error branch).Apply:
- except Exception as e: - api_logger.error("Failed to debug MCP server tools", error=str(e)) + except Exception as e: + api_logger.error("Failed to debug MCP server tools - error=%s", str(e)) return { "tools": [], "count": 0, "server_running": is_running, "source": "debug_error", "message": f"Debug failed: {str(e)}", }
837-847: Same logging kwargs issue in get_mcp_tools (outer error path).Apply:
- except Exception as e: - api_logger.error("Failed to get MCP tools", error=str(e)) + except Exception as e: + api_logger.error("Failed to get MCP tools - error=%s", str(e)) safe_set_attribute(span, "error", str(e)) safe_set_attribute(span, "source", "general_error") return {
🧹 Nitpick comments (20)
python/src/server/services/storage/storage_services.py (1)
78-81: Type mismatch on progress percentage (float vs int) — make the API consistent
report_progresstookpercentage: int, but you pass a float value (10 + float(pct) * 0.2). Align the signature and usage to avoid type-checker noise and inconsistent payloads.- progress_callback=lambda msg, pct: report_progress( - f"Chunking: {msg}", 10 + float(pct) * 0.2 - ), + progress_callback=lambda msg, pct: report_progress( + f"Chunking: {msg}", 10.0 + float(pct) * 0.2 + ),Additionally, keep the updated
report_progress(..., percentage: float, ...)from the previous diff.python/src/server/api_routes/knowledge_api.py (2)
21-35: Deduplicate repeated imports to reduce noise and satisfy linters
get_crawler,RAGService,DocumentStorageService, andget_supabase_clientare imported twice. Remove duplicates to keep the module tidy and avoid future merge conflicts.-from ..services.crawler_manager import get_crawler - -# Import unified logging -from ..config.logfire_config import get_logger, safe_logfire_error, safe_logfire_info -from ..services.crawler_manager import get_crawler -from ..services.search.rag_service import RAGService -from ..services.storage import DocumentStorageService -from ..utils import get_supabase_client +from ..services.crawler_manager import get_crawler +# Import unified logging +from ..config.logfire_config import get_logger, safe_logfire_error, safe_logfire_info
909-919: Prefer timezone-aware timestamps for emitted events
datetime.utcnow().isoformat()produces naive timestamps. Considerdatetime.now(timezone.utc).isoformat()to ensure consumers treat them as UTC.- "timestamp": datetime.utcnow().isoformat(), + "timestamp": datetime.now(datetime.timezone.utc).isoformat(),You’ll need
from datetime import datetime, timezoneat the top.python/src/mcp_server/utils/timeout_config.py (1)
64-79: Optional: add jitter and clamp negative attempts in backoff.Helps reduce thundering herd and guards against accidental negative attempts. Out of scope for this lint-only PR, but worth a follow-up.
Apply this minimal diff inside get_polling_interval:
- # Exponential backoff: 1s, 2s, 4s, 5s, 5s, ... - interval = min(base_interval * (2**attempt), max_interval) - return float(interval) + # Exponential backoff: 1s, 2s, 4s, 5s, 5s, ... + attempt = max(0, attempt) + interval = min(base_interval * (2**attempt), max_interval) + # Optional jitter to prevent thundering herd; set MCP_POLLING_JITTER_PCT (0.0–1.0) + jitter_pct = float(os.getenv("MCP_POLLING_JITTER_PCT", "0.0")) + if jitter_pct > 0.0: + import random + jitter = (random.random() * 2 - 1) * (interval * jitter_pct) + interval = min(max_interval, max(base_interval, interval + jitter)) + return float(interval)python/src/mcp_server/utils/error_handling.py (2)
7-10: Modernize typing to Python 3.12 style (PEP 585/604).Consistent with repo guideline (Ruff UP035), prefer builtin generics and unions. This is cosmetic and safe.
Apply:
-from typing import Any, Dict, Optional +from typing import Any, OptionalAnd update occurrences:
- error_response: Dict[str, Any] = { + error_response: dict[str, Any] = {- details: Dict[str, Any] = {"exception_type": type(exception).__name__, "exception_message": str(exception)} + details: dict[str, Any] = {"exception_type": type(exception).__name__, "exception_message": str(exception)}
71-90: Make HTTP error-body parsing robust to non-dict “detail”.body.get("detail", {}).get("error") breaks if detail is a string. Fall back safely.
Targeted change inside from_http_error:
- body = response.json() - if isinstance(body, dict): - # Look for common error fields - error_message = ( - body.get("detail", {}).get("error") - or body.get("error") - or body.get("message") - or body.get("detail") - ) + body = response.json() + if isinstance(body, dict): + # Look for common error fields + detail = body.get("detail") + detail_error = detail.get("error") if isinstance(detail, dict) else None + error_message = ( + detail_error + or body.get("error") + or body.get("message") + or (detail if isinstance(detail, str) else None) + )python/src/server/api_routes/mcp_api.py (3)
720-721: Remove unused local variable supabase_client (ruff F841).Assigned but never used; safe to drop.
Apply:
- supabase_client = get_supabase_client() - config_json = config.model_dump_json()
464-466: Nit: StopIteration except is unreachable with next(..., None).You pass a default to next(), so StopIteration won’t be raised. Either remove the except or drop the default.
Apply one of:
- except StopIteration: - breakor
- log_line = await asyncio.get_event_loop().run_in_executor( - None, next, log_generator, None - ) + log_line = await asyncio.get_event_loop().run_in_executor( + None, next, log_generator + )
71-79: Optional: extract container name to a class-level constant.Avoid magic string duplication and ease future changes.
Example:
class MCPServerManager: """Manages the MCP Docker container lifecycle.""" def __init__(self): - self.container_name = None # Will be resolved dynamically + self.container_name = "archon-mcp"And replace string literals accordingly.
python/src/mcp_server/features/projects/project_tools.py (1)
33-38: Prefer PEP 604 unions over Optional for Python 3.12.For consistency with other modules (e.g., ServerResponse uses str | None), switch Optional[str] to str | None and drop the import.
Apply:
-from typing import Optional +# Optional no longer needed if using PEP 604 unions @@ - github_repo: Optional[str] = None, + github_repo: str | None = None, @@ - title: Optional[str] = None, - description: Optional[str] = None, - github_repo: Optional[str] = None, + title: str | None = None, + description: str | None = None, + github_repo: str | None = None,And remove the now-unused Optional import at the top.
Also applies to: 280-286
python/src/server/services/crawling/document_storage_operations.py (9)
8-9: Modernize type hints (PEP 585) and reduce importsPrefer builtin generics over typing aliases in Python 3.12. Keep Callable from collections.abc; only import Any from typing.
Apply:
-from typing import Dict, Any, List, Optional -from collections.abc import Callable +from typing import Any +from collections.abc import Callable
57-59: Don’t re-instantiate DocumentStorageService; reuse the one created in initAvoid duplicate instances; there may be configuration/state you want to keep consistent.
- # Initialize storage service for chunking - storage_service = DocumentStorageService(self.supabase_client) + # Initialize storage service for chunking + storage_service = self.doc_storage_service
91-99: Skip empty chunks to avoid storing zero-length contentGuard against chunkers that may emit blanks after splitting.
- for i, chunk in enumerate(chunks): + for i, chunk in enumerate(chunks): + if not chunk or not chunk.strip(): + continue
100-115: Standardize default knowledge_type (‘documentation’ vs ‘technical’)Two different defaults will fragment analytics/filters. Pick one; suggestion: use 'documentation' in both places to match per-chunk metadata.
Update update_source_info call:
- knowledge_type=request.get('knowledge_type', 'technical'), + knowledge_type=request.get('knowledge_type', 'documentation'),Also applies to: 228-231
204-210: Build combined_content efficiently and without leading spaceUse join + slice for clarity and fewer copies.
- combined_content = '' - for chunk in source_contents[:3]: # First 3 chunks for this source - if len(combined_content) + len(chunk) < 15000: - combined_content += ' ' + chunk - else: - break + combined_content = " ".join(source_contents[:3])[:15000]
212-217: Run synchronous LLM summary generation off the event loopextract_source_summary performs network I/O synchronously; shift it to a worker thread to avoid blocking FastAPI’s loop.
- try: - summary = extract_source_summary(source_id, combined_content) + try: + summary = await asyncio.to_thread(extract_source_summary, source_id, combined_content)
221-233: Consider offloading update_source_info to a thread as wellSupabase client calls are synchronous; making them run via asyncio.to_thread prevents blocking the event loop under load. Keep as-is if this runs in a dedicated worker, otherwise consider:
Example:
await asyncio.to_thread( update_source_info, client=self.supabase_client, source_id=source_id, summary=summary, word_count=source_id_word_counts[source_id], content=combined_content, knowledge_type=request.get('knowledge_type', 'documentation'), tags=request.get('tags', []), update_frequency=0, original_url=request.get('url'), )
34-42: Tighten function signature types and clean up importsThe proposed refactor tightens the signature to use built-in generics and explicit callback contracts, and you can safely apply it now—no existing callsite passes custom callbacks so the defaults remain compatible.
• Remove unused typing imports (List, Dict) in
python/src/server/services/crawling/document_storage_operations.py
• Update signature to uselist[dict[str, Any]],dict[str, Any]and explicit callback types
• Only one callsite found (crawling_service.py:383) passing the four required args, soprogress_callback=Noneandcancellation_check=Noneremain valid--- a/python/src/server/services/crawling/document_storage_operations.py +++ b/python/src/server/services/crawling/document_storage_operations.py @@ - from typing import Any, Callable, Dict, List, Optional + from typing import Any, Callable, Optional async def process_and_store_documents( self, - crawl_results: List[Dict], - request: Dict[str, Any], + crawl_results: list[dict[str, Any]], + request: dict[str, Any], crawl_type: str, original_source_id: str, - progress_callback: Optional[Callable] = None, - cancellation_check: Optional[Callable] = None - ) -> Dict[str, Any]: + progress_callback: Callable[..., None] | None = None, + cancellation_check: Callable[..., bool] | None = None, + ) -> dict[str, Any]:
138-151: Optional: Short-circuit when there’s no contentThe current add_documents_to_supabase implementation safely handles empty inputs:
- If contents (and thus urls) is empty, it skips both the delete and insert loops, and simply returns without error.
- Only a trivial safe_span is opened, but no Supabase calls are made.
That said, you can still add a guard to avoid even the span/credential calls:
- await add_documents_to_supabase( + if all_contents: + await add_documents_to_supabase( client=self.supabase_client, urls=all_urls, chunk_numbers=all_chunk_numbers, contents=all_contents, metadatas=all_metadatas, url_to_full_document=url_to_full_document, batch_size=25, progress_callback=progress_callback, enable_parallel_batches=True, provider=None, cancellation_check=cancellation_check, - ) + )This is purely an optimization to eliminate unnecessary overhead and is not strictly required to prevent errors.
python/tests/test_url_handler.py (1)
1-125: LGTM: whitespace normalization; tests remain behaviorally identicalMinor nit: since URLHandler methods are static, you could call them via the class to avoid per-test instantiation, but current style is fine.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (27)
python/src/mcp_server/features/projects/project_tools.py(1 hunks)python/src/mcp_server/utils/__init__.py(1 hunks)python/src/mcp_server/utils/error_handling.py(1 hunks)python/src/mcp_server/utils/http_client.py(2 hunks)python/src/mcp_server/utils/timeout_config.py(1 hunks)python/src/server/api_routes/knowledge_api.py(2 hunks)python/src/server/api_routes/mcp_api.py(1 hunks)python/src/server/config/config.py(2 hunks)python/src/server/main.py(0 hunks)python/src/server/services/crawling/crawling_service.py(2 hunks)python/src/server/services/crawling/document_storage_operations.py(11 hunks)python/src/server/services/crawling/helpers/__init__.py(1 hunks)python/src/server/services/crawling/helpers/site_config.py(3 hunks)python/src/server/services/crawling/helpers/url_handler.py(6 hunks)python/src/server/services/crawling/strategies/__init__.py(1 hunks)python/src/server/services/crawling/strategies/batch.py(1 hunks)python/src/server/services/crawling/strategies/recursive.py(1 hunks)python/src/server/services/crawling/strategies/single_page.py(10 hunks)python/src/server/services/crawling/strategies/sitemap.py(2 hunks)python/src/server/services/projects/task_service.py(1 hunks)python/src/server/services/storage/storage_services.py(1 hunks)python/tests/mcp_server/features/projects/test_project_tools.py(0 hunks)python/tests/mcp_server/features/tasks/test_task_tools.py(1 hunks)python/tests/mcp_server/utils/test_error_handling.py(0 hunks)python/tests/mcp_server/utils/test_timeout_config.py(0 hunks)python/tests/test_supabase_validation.py(1 hunks)python/tests/test_url_handler.py(7 hunks)
💤 Files with no reviewable changes (4)
- python/tests/mcp_server/utils/test_error_handling.py
- python/tests/mcp_server/features/projects/test_project_tools.py
- python/src/server/main.py
- python/tests/mcp_server/utils/test_timeout_config.py
🧰 Additional context used
📓 Path-based instructions (7)
python/src/{server,mcp,agents}/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/src/{server,mcp,agents}/**/*.py: Fail fast on service startup failures, missing configuration, database connection issues, auth failures, critical dependency outages, and invalid data that would corrupt state
External API calls should use retry with exponential backoff and ultimately fail with a clear, contextual error message
Error messages must include context (operation being attempted) and relevant IDs/URLs/data for debugging
Preserve full stack traces in logs (e.g., Python logging with exc_info=True)
Use specific exception types; avoid catching broad Exception unless re-raising with context
Never signal failure by returning None/null; raise a descriptive exception instead
Files:
python/src/server/services/crawling/strategies/__init__.pypython/src/server/api_routes/knowledge_api.pypython/src/server/services/crawling/strategies/batch.pypython/src/server/services/storage/storage_services.pypython/src/server/api_routes/mcp_api.pypython/src/server/services/crawling/strategies/recursive.pypython/src/server/services/crawling/strategies/sitemap.pypython/src/server/config/config.pypython/src/server/services/crawling/helpers/site_config.pypython/src/server/services/crawling/helpers/__init__.pypython/src/server/services/crawling/helpers/url_handler.pypython/src/server/services/crawling/crawling_service.pypython/src/server/services/projects/task_service.pypython/src/server/services/crawling/document_storage_operations.pypython/src/server/services/crawling/strategies/single_page.py
python/src/{server/services,agents}/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Never accept or store corrupted data (e.g., zero embeddings, null foreign keys, malformed JSON); skip failed items entirely instead of persisting bad data
Files:
python/src/server/services/crawling/strategies/__init__.pypython/src/server/services/crawling/strategies/batch.pypython/src/server/services/storage/storage_services.pypython/src/server/services/crawling/strategies/recursive.pypython/src/server/services/crawling/strategies/sitemap.pypython/src/server/services/crawling/helpers/site_config.pypython/src/server/services/crawling/helpers/__init__.pypython/src/server/services/crawling/helpers/url_handler.pypython/src/server/services/crawling/crawling_service.pypython/src/server/services/projects/task_service.pypython/src/server/services/crawling/document_storage_operations.pypython/src/server/services/crawling/strategies/single_page.py
python/src/server/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/src/server/**/*.py: For batch processing and background tasks, continue processing but log detailed per-item failures and return both successes and failures
Do not crash the server on a single WebSocket event failure; log the error and continue serving other clients
Files:
python/src/server/services/crawling/strategies/__init__.pypython/src/server/api_routes/knowledge_api.pypython/src/server/services/crawling/strategies/batch.pypython/src/server/services/storage/storage_services.pypython/src/server/api_routes/mcp_api.pypython/src/server/services/crawling/strategies/recursive.pypython/src/server/services/crawling/strategies/sitemap.pypython/src/server/config/config.pypython/src/server/services/crawling/helpers/site_config.pypython/src/server/services/crawling/helpers/__init__.pypython/src/server/services/crawling/helpers/url_handler.pypython/src/server/services/crawling/crawling_service.pypython/src/server/services/projects/task_service.pypython/src/server/services/crawling/document_storage_operations.pypython/src/server/services/crawling/strategies/single_page.py
python/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/**/*.py: Target Python 3.12 with a 120-character line length
Use Ruff for linting and Mypy for type checking before commit
Files:
python/src/server/services/crawling/strategies/__init__.pypython/src/server/api_routes/knowledge_api.pypython/src/server/services/crawling/strategies/batch.pypython/tests/mcp_server/features/tasks/test_task_tools.pypython/tests/test_supabase_validation.pypython/src/mcp_server/utils/error_handling.pypython/src/mcp_server/utils/http_client.pypython/src/server/services/storage/storage_services.pypython/src/server/api_routes/mcp_api.pypython/src/server/services/crawling/strategies/recursive.pypython/src/mcp_server/utils/__init__.pypython/src/mcp_server/features/projects/project_tools.pypython/src/mcp_server/utils/timeout_config.pypython/src/server/services/crawling/strategies/sitemap.pypython/src/server/config/config.pypython/src/server/services/crawling/helpers/site_config.pypython/src/server/services/crawling/helpers/__init__.pypython/tests/test_url_handler.pypython/src/server/services/crawling/helpers/url_handler.pypython/src/server/services/crawling/crawling_service.pypython/src/server/services/projects/task_service.pypython/src/server/services/crawling/document_storage_operations.pypython/src/server/services/crawling/strategies/single_page.py
{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}
📄 CodeRabbit inference engine (CLAUDE.md)
{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}: Remove dead code immediately; do not keep legacy/unused functions
Avoid comments that reference change history (e.g., LEGACY, CHANGED, REMOVED); keep comments focused on current functionality
Files:
python/src/server/services/crawling/strategies/__init__.pypython/src/server/api_routes/knowledge_api.pypython/src/server/services/crawling/strategies/batch.pypython/tests/mcp_server/features/tasks/test_task_tools.pypython/tests/test_supabase_validation.pypython/src/mcp_server/utils/error_handling.pypython/src/mcp_server/utils/http_client.pypython/src/server/services/storage/storage_services.pypython/src/server/api_routes/mcp_api.pypython/src/server/services/crawling/strategies/recursive.pypython/src/mcp_server/utils/__init__.pypython/src/mcp_server/features/projects/project_tools.pypython/src/mcp_server/utils/timeout_config.pypython/src/server/services/crawling/strategies/sitemap.pypython/src/server/config/config.pypython/src/server/services/crawling/helpers/site_config.pypython/src/server/services/crawling/helpers/__init__.pypython/tests/test_url_handler.pypython/src/server/services/crawling/helpers/url_handler.pypython/src/server/services/crawling/crawling_service.pypython/src/server/services/projects/task_service.pypython/src/server/services/crawling/document_storage_operations.pypython/src/server/services/crawling/strategies/single_page.py
python/src/server/**
📄 CodeRabbit inference engine (CLAUDE.md)
Keep FastAPI application code under python/src/server/ (routes in api_routes/, services in services/, main in main.py)
Files:
python/src/server/services/crawling/strategies/__init__.pypython/src/server/api_routes/knowledge_api.pypython/src/server/services/crawling/strategies/batch.pypython/src/server/services/storage/storage_services.pypython/src/server/api_routes/mcp_api.pypython/src/server/services/crawling/strategies/recursive.pypython/src/server/services/crawling/strategies/sitemap.pypython/src/server/config/config.pypython/src/server/services/crawling/helpers/site_config.pypython/src/server/services/crawling/helpers/__init__.pypython/src/server/services/crawling/helpers/url_handler.pypython/src/server/services/crawling/crawling_service.pypython/src/server/services/projects/task_service.pypython/src/server/services/crawling/document_storage_operations.pypython/src/server/services/crawling/strategies/single_page.py
python/tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Place backend tests under python/tests/
Files:
python/tests/mcp_server/features/tasks/test_task_tools.pypython/tests/test_supabase_validation.pypython/tests/test_url_handler.py
🧠 Learnings (1)
📚 Learning: 2025-08-21T11:22:33.541Z
Learnt from: CR
PR: coleam00/Archon#0
File: CLAUDE.md:0-0
Timestamp: 2025-08-21T11:22:33.541Z
Learning: Applies to python/src/{server,mcp,agents}/**/*.py : Never signal failure by returning None/null; raise a descriptive exception instead
Applied to files:
python/src/mcp_server/utils/error_handling.py
🧬 Code graph analysis (3)
python/tests/test_url_handler.py (1)
python/src/server/services/crawling/helpers/url_handler.py (4)
is_binary_file(52-97)URLHandler(14-127)is_txt(35-49)transform_github_url(100-127)
python/src/server/services/crawling/document_storage_operations.py (2)
python/src/server/config/logfire_config.py (2)
safe_logfire_info(223-235)safe_logfire_error(238-250)python/src/server/services/source_management_service.py (1)
extract_source_summary(35-139)
python/src/server/services/crawling/strategies/single_page.py (3)
python/src/server/services/crawling/strategies/batch.py (1)
report_progress(123-127)python/src/server/services/crawling/strategies/recursive.py (1)
report_progress(129-134)python/src/server/services/storage/storage_services.py (1)
report_progress(58-70)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Backend Tests (Python + pytest)
🔇 Additional comments (76)
python/src/mcp_server/utils/__init__.py (1)
21-21: Trailing newline/formatting-only change looks goodNo behavioral impact; exports remain unchanged.
python/src/server/services/storage/storage_services.py (1)
49-49: Whitespace-only touch — OKNo functional change.
python/tests/mcp_server/features/tasks/test_task_tools.py (1)
177-177: Whitespace-only normalization — OKNo behavioral change; tests remain clear and deterministic.
python/src/server/api_routes/knowledge_api.py (1)
520-520: Whitespace-only addition — OKNo functional impact.
python/tests/test_supabase_validation.py (1)
8-8: Unused import removal — good hygieneRemoving
MagicMockkeeps tests lean and avoids Ruff F401.python/src/mcp_server/utils/timeout_config.py (1)
79-79: LGTM: explicit float cast on backoff interval.No behavior change; keeps typing precise for httpx.Timeout usage.
python/src/mcp_server/utils/error_handling.py (1)
166-166: LGTM: trailing newline only.No functional changes; safe.
python/src/server/services/crawling/strategies/__init__.py (1)
17-17: LGTM: EOF newline fix.No runtime implications; keeps diffs clean.
python/src/server/api_routes/mcp_api.py (1)
69-69: LGTM: whitespace-only change.No behavior impact.
python/src/mcp_server/features/projects/project_tools.py (1)
11-11: LGTM: removed unused Any import.Keeps imports lean; no behavior changes.
python/src/mcp_server/utils/http_client.py (2)
8-9: Import migration fromtypingtocollections.abcis correct.The move of
AsyncIteratorfromtypingtocollections.abcfollows Python best practices for generic types and aligns with the broader pattern in this PR across multiple crawling modules.
39-39: LGTM! Added trailing newline for formatting consistency.python/src/server/config/config.py (1)
104-128: Whitespace-only formatting adjustments are appropriate.The added blank lines improve readability without affecting the security validation logic for local hosts and private IP addresses.
python/src/server/services/projects/task_service.py (3)
21-21: Added blank line for consistent formatting.
25-25: Added blank line for consistent formatting.
32-32: Added blank line for consistent formatting.python/src/server/services/crawling/helpers/url_handler.py (1)
16-16: Whitespace formatting improvements enhance code readability.The added blank lines appropriately separate logical blocks within static methods without changing any functionality.
Also applies to: 33-33, 50-50, 66-66, 86-86, 92-92, 98-98, 118-118, 125-127
python/src/server/services/crawling/strategies/batch.py (1)
7-8: Import migration fromtypingtocollections.abcis correct.Moving
Callablefromtypingtocollections.abcfollows Python best practices and maintains consistency with the broader pattern across crawling modules in this PR.python/src/server/services/crawling/strategies/recursive.py (1)
7-8: LGTM: Import modernization successfully completed.The migration from
typingtocollections.abcfor theCallableimport follows Python best practices and aligns with the project's modernization effort. The import change is non-breaking and maintains the same runtime behavior.python/src/server/services/crawling/strategies/sitemap.py (7)
18-18: LGTM: Minor formatting cleanup.The whitespace adjustments improve code consistency without affecting functionality.
30-30: LGTM: Whitespace cleanup maintained.Consistent formatting improvements throughout the method.
34-34: LGTM: Formatting consistency maintained.The whitespace adjustments continue to improve code readability.
38-38: LGTM: Exception handling formatting improved.The whitespace adjustments around exception blocks enhance readability.
43-43: LGTM: Consistent formatting applied.The formatting improvements are consistent throughout the method.
48-48: LGTM: Exception handling formatting completed.The whitespace adjustments complete the formatting improvements for this exception block.
54-55: LGTM: Method completion formatting and EOF correction.The whitespace adjustments and trailing newline correction complete the formatting improvements for this file.
python/src/server/services/crawling/helpers/site_config.py (11)
15-15: LGTM: Class docstring formatting improved.The whitespace adjustment improves consistency with the rest of the codebase.
20-20: LGTM: CODE_BLOCK_SELECTORS formatting maintained.The formatting improvements continue consistently through the configuration data.
22-22: LGTM: Continued formatting consistency.The whitespace adjustments maintain readability throughout the selector list.
27-27: LGTM: Selector grouping formatting improved.The formatting adjustments help maintain logical grouping of related selectors.
32-32: LGTM: Framework-specific selector formatting.The whitespace adjustments maintain consistency in the framework-specific selector sections.
35-35: LGTM: Highlight.js section formatting.The formatting improvements continue through the highlight.js selector section.
41-41: LGTM: Shiki section formatting maintained.The whitespace adjustments maintain consistency in the Shiki framework section.
47-47: LGTM: Generic patterns section completed.The formatting improvements complete the CODE_BLOCK_SELECTORS section.
72-72: LGTM: Method separator formatting.The whitespace adjustment improves separation between static methods.
75-75: LGTM: Method completion formatting.The formatting adjustment maintains consistency in method structure.
98-98: LGTM: File completion and EOF correction.The formatting improvement and trailing newline addition complete the file formatting cleanup.
python/src/server/services/crawling/crawling_service.py (2)
11-12: LGTM: Import modernization successfully completed.The migration of
CallableandAwaitablefromtypingtocollections.abcfollows Python 3.9+ best practices and maintains backward compatibility. This change is part of the broader effort to modernize typing imports across the codebase.
562-562: LGTM: Minor formatting improvement.The whitespace adjustment improves code readability without affecting functionality.
python/src/server/services/crawling/strategies/single_page.py (33)
8-9: LGTM: Import modernization completed successfully.The migration from
typing.Callabletocollections.abc.Callablefollows Python best practices and aligns with the project-wide import modernization effort.
19-19: LGTM: Class docstring formatting improved.The whitespace adjustment maintains consistency with the codebase formatting standards.
30-30: LGTM: Method separator formatting maintained.The formatting improvement enhances code readability.
34-34: LGTM: Method implementation formatting improved.The whitespace adjustment maintains consistency within the method structure.
55-55: LGTM: Method completion formatting maintained.The formatting adjustment completes the consistency improvements for this method.
78-78: LGTM: Variable initialization formatting improved.The whitespace adjustment maintains consistency in variable initialization blocks.
80-80: LGTM: Loop initialization formatting maintained.The formatting improvement continues the consistency throughout the retry logic.
89-89: LGTM: Error handling formatting improved.The whitespace adjustment enhances readability of the error handling block.
92-92: LGTM: Cache mode configuration formatting maintained.The formatting improvement maintains consistency in the cache mode selection logic.
95-95: LGTM: Documentation site detection formatting improved.The whitespace adjustment maintains consistency in the conditional logic flow.
100-100: LGTM: Enhanced documentation site handling with expanded framework support.The improvements to documentation site handling are well-implemented:
- Expanded framework support: Added support for vitepress, gitbook, mkdocs, docsify, copilotkit, and milkdown
- Optimized configuration: Enhanced crawl settings with appropriate timeouts, delays, and content loading strategies
- Better wait selectors: Framework-specific selectors ensure proper content loading
The configuration changes strike a good balance between performance and completeness.
135-135: LGTM: Configuration completion formatting maintained.The formatting adjustment completes the crawl configuration setup consistently.
138-138: LGTM: Logging configuration formatting improved.The whitespace adjustment maintains consistency in the logging statements.
147-147: LGTM: Exception handling formatting maintained.The formatting improvement enhances readability of the exception handling logic.
151-151: LGTM: Error condition handling formatting improved.The whitespace adjustment maintains consistency in the error condition checking.
156-156: LGTM: Retry logic formatting maintained.The formatting improvement continues the consistency through the retry logic.
161-161: LGTM: Content validation formatting improved.The whitespace adjustment maintains consistency in the content validation logic.
165-165: LGTM: Retry continuation formatting maintained.The formatting improvement enhances readability of the retry flow control.
171-171: LGTM: Debug logging formatting improved.The whitespace adjustment maintains consistency in the debug logging section.
173-173: LGTM: Result logging formatting maintained.The formatting improvement continues the consistency in the logging statements.
177-177: LGTM: Conditional logging formatting improved.The whitespace adjustment maintains consistency in the conditional logging blocks.
180-180: LGTM: Debug logging continuation formatting maintained.The formatting improvement completes the consistency in the debug logging section.
190-190: LGTM: Success return formatting improved.The whitespace adjustment maintains consistency in the return statement structure.
198-198: LGTM: Exception handling completion formatting maintained.The formatting improvement enhances readability of the exception handling completion.
202-202: LGTM: Retry backoff formatting improved.The whitespace adjustment maintains consistency in the retry backoff logic.
208-208: LGTM: Method completion formatting maintained.The formatting improvement completes the consistency for this method.
235-235: LGTM: Progress reporting implementation with proper formatting.The addition of progress reporting to
crawl_markdown_fileis well-implemented:
- Consistent API: Follows the same progress callback pattern used in other strategies
- Proper error handling: Maintains existing error handling while adding progress updates
- Clear progress flow: Reports start and completion states appropriately
The progress reporting integration aligns well with the broader crawling service architecture.
241-241: LGTM: Progress helper implementation formatting maintained.The formatting of the progress helper function maintains consistency with similar implementations in other strategy files.
244-244: LGTM: Progress reporting usage formatting improved.The whitespace adjustment maintains consistency in the progress reporting calls.
250-250: LGTM: Configuration setup formatting maintained.The formatting improvement enhances readability of the crawl configuration setup.
253-253: LGTM: Success logging formatting improved.The whitespace adjustment maintains consistency in the success logging statements.
257-257: LGTM: Progress completion formatting maintained.The formatting improvement completes the consistency in the progress reporting flow.
265-265: LGTM: Method completion and error handling formatting improved.The whitespace adjustment and method completion maintain consistency with the overall formatting improvements throughout the file.
python/src/server/services/crawling/document_storage_operations.py (3)
271-301: LGTM: returning the result prevents silent dropsGood to expose the number of code examples stored. No issues spotted.
238-251: The script will print out the path todocument_storage_operations.py, its import statements, and the first 200 lines so we can locateupdate_source_infousage and confirm whether it usesword_countortotal_word_countfields.
71-73: Ignore boolean-check refactor;cancellation_checkraises exceptionsThe
cancellation_checkcallback (e.g.check_upload_cancellationin python/src/server/api_routes/knowledge_api.py:590–594) is defined to raiseasyncio.CancelledErrorwhen a cancellation is detected, not to return a boolean. The existing calls:
- python/src/server/services/crawling/document_storage_operations.py lines 71–73
- python/src/server/services/crawling/document_storage_operations.py lines 93–95
are correct: invoking the callback will either raise to abort processing or return
Noneto continue. No Boolean check or additional handling is needed.Likely an incorrect or invalid review comment.
python/src/server/services/crawling/helpers/__init__.py (1)
10-13: LGTM: trailing newline and export list intactExports remain stable; formatting improvement only.
| # Log chunking results | ||
| safe_logfire_info(f"Document storage | documents={len(crawl_results)} | chunks={len(all_contents)} | avg_chunks_per_doc={len(all_contents)/len(crawl_results):.1f}") | ||
|
|
There was a problem hiding this comment.
Guard against division by zero in avg_chunks_per_doc logging
When crawl_results is empty, this will raise ZeroDivisionError.
- safe_logfire_info(f"Document storage | documents={len(crawl_results)} | chunks={len(all_contents)} | avg_chunks_per_doc={len(all_contents)/len(crawl_results):.1f}")
+ doc_count = len(crawl_results)
+ avg = (len(all_contents) / doc_count) if doc_count else 0.0
+ safe_logfire_info(
+ f"Document storage | documents={doc_count} | chunks={len(all_contents)} | avg_chunks_per_doc={avg:.1f}"
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| safe_logfire_info(f"url_to_full_document keys: {list(url_to_full_document.keys())[:5]}") | |
| # Log chunking results | |
| safe_logfire_info(f"Document storage | documents={len(crawl_results)} | chunks={len(all_contents)} | avg_chunks_per_doc={len(all_contents)/len(crawl_results):.1f}") | |
| safe_logfire_info(f"url_to_full_document keys: {list(url_to_full_document.keys())[:5]}") | |
| # Log chunking results | |
| doc_count = len(crawl_results) | |
| avg = (len(all_contents) / doc_count) if doc_count else 0.0 | |
| safe_logfire_info( | |
| f"Document storage | documents={doc_count} | chunks={len(all_contents)} | avg_chunks_per_doc={avg:.1f}" | |
| ) |
🤖 Prompt for AI Agents
In python/src/server/services/crawling/document_storage_operations.py around
lines 133 to 137, the logging expression computes avg_chunks_per_doc by dividing
len(all_contents) by len(crawl_results) which will raise a ZeroDivisionError
when crawl_results is empty; fix by computing a safe denominator (e.g., num_docs
= len(crawl_results)) and only perform the division if num_docs > 0, otherwise
use 0 or "N/A" for the average; update the safe_logfire_info call to include the
computed average variable (or the fallback) instead of performing the raw
division inline.
* fix: stop copying .archon defaults into target repo during setup The setup skill was copying all bundled default commands (~26 files) and workflows (~8 files) into the target repo's .archon/ directory. This caused the first PR in a repo to include all the Archon scaffolding mixed in with actual feature changes, making PRs noisy and hard to review. Defaults are already loaded at runtime from the Archon installation — the physical copy was only for inspection. Remove the copy step and instead tell users where to find defaults and how to override a specific one by copying just that file into their repo with the same filename. Also add .claude/skills/ to .gitignore in the target repo so the skill copy doesn't pollute PRs either. Closes #462 * remove .gitignore modification from setup skill Don't modify the user's .gitignore — they may have other skills they want tracked in git. * make skill copy optional and don't touch .gitignore Let the user choose whether to copy the Archon skill into their target repo. Never modify their .gitignore — that's their call.
…eam00#470) * fix: stop copying .archon defaults into target repo during setup The setup skill was copying all bundled default commands (~26 files) and workflows (~8 files) into the target repo's .archon/ directory. This caused the first PR in a repo to include all the Archon scaffolding mixed in with actual feature changes, making PRs noisy and hard to review. Defaults are already loaded at runtime from the Archon installation — the physical copy was only for inspection. Remove the copy step and instead tell users where to find defaults and how to override a specific one by copying just that file into their repo with the same filename. Also add .claude/skills/ to .gitignore in the target repo so the skill copy doesn't pollute PRs either. Closes coleam00#462 * remove .gitignore modification from setup skill Don't modify the user's .gitignore — they may have other skills they want tracked in git. * make skill copy optional and don't touch .gitignore Let the user choose whether to copy the Archon skill into their target repo. Never modify their .gitignore — that's their call.
…eam00#470) * fix: stop copying .archon defaults into target repo during setup The setup skill was copying all bundled default commands (~26 files) and workflows (~8 files) into the target repo's .archon/ directory. This caused the first PR in a repo to include all the Archon scaffolding mixed in with actual feature changes, making PRs noisy and hard to review. Defaults are already loaded at runtime from the Archon installation — the physical copy was only for inspection. Remove the copy step and instead tell users where to find defaults and how to override a specific one by copying just that file into their repo with the same filename. Also add .claude/skills/ to .gitignore in the target repo so the skill copy doesn't pollute PRs either. Closes coleam00#462 * remove .gitignore modification from setup skill Don't modify the user's .gitignore — they may have other skills they want tracked in git. * make skill copy optional and don't touch .gitignore Let the user choose whether to copy the Archon skill into their target repo. Never modify their .gitignore — that's their call.
Summary
Changes Applied
Dict,List→dict,list)Remaining Work
44 errors remain that require manual review, primarily:
These were intentionally not auto-fixed to avoid breaking functionality.
Test Plan
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Performance
Chores
Tests