Skip to content

Fix race condition in concurrent crawling with unique source IDs#472

Merged
Wirasm merged 19 commits intomainfrom
fix/source-id-architecture-refactor
Aug 29, 2025
Merged

Fix race condition in concurrent crawling with unique source IDs#472
Wirasm merged 19 commits intomainfrom
fix/source-id-architecture-refactor

Conversation

@Wirasm
Copy link
Copy Markdown
Collaborator

@Wirasm Wirasm commented Aug 25, 2025

Pull Request

Summary

Fixes a critical race condition where concurrent crawls to the same domain (especially GitHub) would conflict due to using domain names as source IDs. Implements a clean architecture that separates unique identification from display concerns.

Changes Made

  • Added unique hash-based source ID generation using SHA256 (16 characters) to prevent conflicts
  • Introduced three-field architecture for sources:
    • source_id: Unique hash for database primary key
    • source_url: Original URL for tracking
    • source_display_name: Human-friendly name for UI display
  • Added URLHandler methods for ID generation and display name extraction
  • Updated crawling pipeline to use new ID generation throughout
  • Created database migration to add new columns with backward compatibility
  • Optimized to use display name as title, eliminating unnecessary AI calls
  • Added comprehensive test suite with 16 tests validating the fix

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • Performance improvement
  • Code refactoring

Affected Services

  • Frontend (React UI)
  • Server (FastAPI backend)
  • MCP Server (Model Context Protocol)
  • Agents (PydanticAI service)
  • Database (migrations/schema)
  • Docker/Infrastructure
  • Documentation site

Testing

  • All existing tests pass
  • Added new tests for new functionality
  • Manually tested affected user flows
  • Docker builds succeed for all services

Test Evidence

# Ran comprehensive test suite
uv run pytest python/tests/test_source_id_refactor.py -v
# Result: 16 tests passed

# Verified in production database
# - Crawled multiple GitHub repos concurrently
# - Confirmed unique source_ids generated
# - Verified source_url and source_display_name populated correctly

Checklist

  • My code follows the service architecture patterns
  • If using an AI coding assistant, I used the CLAUDE.md rules
  • I have added tests that prove my fix/feature works
  • All new and existing tests pass locally
  • My changes generate no new warnings
  • I have updated relevant documentation
  • I have verified no regressions in existing features

Breaking Changes

None - the migration provides backward compatibility by:

  • Adding new columns without removing existing ones
  • Backfilling null values with source_id for compatibility
  • Maintaining existing source_id format for old records

Additional Notes

Problem Solved

Previously, concurrent crawls to github.com/owner1/repo1 and github.com/owner2/repo2 would both try to use github.com as the source_id, causing foreign key violations and data corruption.

Solution Benefits

  1. No more race conditions: Each URL gets a unique hash-based ID
  2. Better UI display: Friendly names like "GitHub - microsoft/typescript"
  3. Performance improvement: Skip AI title generation when display name available
  4. Clean separation of concerns: ID for uniqueness, URL for tracking, display name for UI

Migration Instructions

Run the migration script after deployment:

-- migration/add_source_url_display_name.sql
ALTER TABLE archon_sources 
ADD COLUMN IF NOT EXISTS source_url TEXT,
ADD COLUMN IF NOT EXISTS source_display_name TEXT;

UPDATE archon_sources 
SET source_url = source_id,
    source_display_name = source_id
WHERE source_url IS NULL;

Fixes the race condition issue reported in multiple crawling scenarios.

Summary by CodeRabbit

  • New Features

    • Sources now store original URL and a human-friendly display name; source IDs are stable, unique hashes.
    • UI displays a migration banner when a backend migration is required; health hook polls status.
  • Performance

    • Indexes and batched uploads improve lookup and storage throughput.
  • Improvements

    • URL/display-name preserved through processing, title generation, and code extraction; clearer startup/migration guidance and improved metrics/logging.
  • Tests

    • Expanded coverage for ID hashing, display-name extraction, canonicalization, concurrency, code extraction, and storage metrics.
  • Chores

    • Database schema extended and existing sources backfilled.

Wirasm added 3 commits August 25, 2025 11:57
- Add unique hash-based source_id generation to prevent conflicts
- Separate source identification from display with three fields:
  - source_id: 16-char SHA256 hash for unique identification
  - source_url: Original URL for tracking
  - source_display_name: Human-friendly name for UI
- Add comprehensive test suite validating the fix
- Migrate existing data with backward compatibility
- Pass source_display_name to title generation function
- Use display name in AI prompt instead of hash-based source_id
- Results in more specific, meaningful titles for each source
- Use source_display_name directly as title to avoid unnecessary AI calls
- More efficient and predictable than AI-generated titles
- Keep AI generation only as fallback for backward compatibility
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Aug 25, 2025

Walkthrough

Adds deterministic 16‑char hashed source IDs and human-friendly source_display_name; threads source_url and source_display_name from crawler → document storage → code extraction; updates DB schema with new columns, indexes, and backfill; extends URL helpers; adds UI migration banner and health-layer migration checks; and introduces extensive tests for canonicalization, propagation, and concurrency.

Changes

Cohort / File(s) Summary
DB migrations
migration/add_source_url_display_name.sql, migration/complete_setup.sql
Add source_url and source_display_name TEXT columns to archon_sources, create indexes (idx_archon_sources_url, idx_archon_sources_display_name), add column comments, and backfill NULLs from source_id.
Crawling orchestration
python/src/server/services/crawling/crawling_service.py
Generate source_id via URLHandler.generate_unique_source_id(url) and source_display_name via URLHandler.extract_display_name(url); log and pass source_url and source_display_name into document storage; forward storage_results["source_id"] downstream.
Document storage pipeline
python/src/server/services/crawling/document_storage_operations.py
process_and_store_documents and _create_source_records accept/propagate source_url and source_display_name; reuse self.doc_storage_service; use per-document doc_url as canonical keys; batch add documents (size 25); aggregate content for source summaries, run async thread-based summarization with fallbacks; call update_source_info and fallback upsert including new fields; ensure source records exist before storing; propagate source_id into code extraction.
URL utilities
python/src/server/services/crawling/helpers/url_handler.py
Add generate_unique_source_id(url) (SHA‑256, first 16 hex) and extract_display_name(url) with specialized formatting for GitHub/docs/API patterns; ensure transform_github_url returns original URL when unmatched; expand binary extension list and normalize strings.
Source management
python/src/server/services/source_management_service.py
generate_source_title_and_metadata gains source_display_name; update_source_info accepts optional source_url and source_display_name; when provided use display name as title; include update_frequency, preserve original_url, and conditionally include new fields in insert/update payloads.
Code extraction
python/src/server/services/crawling/code_extraction_service.py
extract_and_store_code_examples and _extract_code_blocks_from_documents accept source_id and attach it to extracted code blocks; remove per-document urlparse-derived IDs and the urlparse import.
Health & main
python/src/server/main.py, python/src/server/api_routes/knowledge_api.py
Add _check_database_schema() and lightweight cache to detect missing schema (source_url/source_display_name); health endpoint returns migration_required payload and migration_instructions when schema invalid; knowledge_health integrates the schema check early.
UI: migration banner
archon-ui-main/src/components/ui/MigrationBanner.tsx, archon-ui-main/src/hooks/useMigrationStatus.ts, archon-ui-main/src/App.tsx, archon-ui-main/src/components/BackendStartupError.tsx
Add MigrationBanner component and useMigrationStatus hook; surface banner in App when health indicates migration_required; adjust backend startup error messaging to list common issues and add instructions.
Tests
python/tests/test_source_id_refactor.py, python/tests/test_code_extraction_source_id.py, python/tests/test_source_url_shadowing.py, python/tests/test_document_storage_metrics.py, python/tests/test_url_canonicalization.py, python/tests/test_async_source_summary.py, python/tests/test_source_race_condition.py
Add tests for deterministic/unique 16‑char hash IDs and canonicalization, display-name extraction, propagation of source_id into code extraction (no domain IDs), preservation of source_url (not shadowed), document storage metrics, async/threaded source summarization with fallbacks, and race-condition handling in source creation; validate absence of urlparse import where expected.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant CrawlSvc as CrawlService
  participant URLHandler as URLHandler
  participant DocStorage as DocumentStorageOps
  participant CodeExtract as CodeExtractionSvc
  participant SourceMgmt as SourceMgmtSvc
  participant DB as DB (archon_sources)

  User->>CrawlSvc: start_crawl(url)
  CrawlSvc->>URLHandler: generate_unique_source_id(url)
  URLHandler-->>CrawlSvc: source_id (16‑hex)
  CrawlSvc->>URLHandler: extract_display_name(url)
  URLHandler-->>CrawlSvc: source_display_name
  CrawlSvc->>DocStorage: process_and_store_documents(..., original_source_id=source_id, source_url=url, source_display_name)
  DocStorage->>DocStorage: chunk docs, build url_to_full_document (keys = doc_url)
  DocStorage->>CodeExtract: extract_and_store_code_examples(crawl_results, url_to_full_document, source_id)
  CodeExtract-->>DocStorage: code examples stored (blocks tagged with source_id)
  DocStorage->>SourceMgmt: update_source_info(source_id, summary, word_count, original_url, source_url, source_display_name)
  alt existing source
    SourceMgmt->>DB: UPDATE archon_sources (include source_url/display_name)
  else new/fallback
    SourceMgmt->>DB: INSERT/UPSERT archon_sources (include source_url/display_name)
  end
  DB-->>SourceMgmt: OK
  SourceMgmt-->>DocStorage: confirmed
  DocStorage-->>CrawlSvc: completed
  CrawlSvc-->>User: crawl complete
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • coleam00

Poem

I hop through links and hash each one,
Sixteen hex carrots fresh in the sun.
I name the source so humans may see,
Backfill the past, keep IDs conflict‑free.
Indexes set — tests nibble, pass, and run. 🥕✨

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/source-id-architecture-refactor

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
python/src/server/services/source_management_service.py (1)

8-16: NameError risk: os is used before import.

_get_model_choice uses os.getenv but os isn’t imported at module scope. This will crash early.

 from typing import Any
+import os
🧹 Nitpick comments (19)
migration/add_source_url_display_name.sql (2)

17-19: Index choices are fine; consider a case-insensitive index for display lookups.

The plain btree indexes will help equality filters. If the UI often performs case-insensitive searches (ILIKE) on source_display_name, add a functional index to avoid sequential scans.

Apply this additional DDL (outside the changed hunk) if case-insensitive lookups are common:

CREATE INDEX IF NOT EXISTS idx_archon_sources_display_name_lower
  ON archon_sources (lower(source_display_name));

35-36: 16-hex hash is probably sufficient; consider 20+ hex chars or per-URL uniqueness guard for extra safety.

64-bit IDs have extremely low collision probability, but widening to 80–96 bits (20–24 hex chars) further reduces risk with negligible storage cost. Alternatively, enforce uniqueness on source_url to protect against theoretical collisions and accidental canonicalization drift.

Would you like me to propose a safe widening plan (forward-compatible code + non-blocking online migration)?

migration/complete_setup.sql (1)

190-193: Tighten the column comment to reflect non-URL sources.

source_id is a 16-char SHA-256 prefix for URL sources, but can be file_* for file sources. Clarify to avoid future confusion.

Apply:

-COMMENT ON COLUMN archon_sources.source_id IS 'Unique hash identifier for the source (16-char SHA256 hash of URL)';
+COMMENT ON COLUMN archon_sources.source_id IS 'Identifier: 16-char SHA-256 prefix for URL sources, or file_* for file uploads';
python/src/server/services/crawling/helpers/url_handler.py (2)

244-247: Consistent brand capitalization: GitHub.

Minor UX polish: use “GitHub API,” not “Github API.”

-                    return "Github API"
+                    return "GitHub API"

100-106: Consider treating .svg as crawlable text.

SVG is XML; excluding it as binary skips potentially useful diagrams/code examples. If bandwidth isn’t a concern, allow SVG.

-                ".svg",

If you prefer keeping it excluded globally, consider a site-config override to include SVG only for documentation domains.

python/src/server/services/source_management_service.py (6)

18-33: Preserve stack traces for diagnostics.

Log warnings about model selection with full context.

-    except Exception as e:
-        logger.warning(f"Error getting model choice: {e}, using default")
+    except Exception as e:
+        logger.warning(f"Error getting model choice: {e}, using default", exc_info=True)

43-51: Docstring still mentions “domain”; update to new ID semantics.

Align docs with the 16-char SHA-256 prefix and file_* IDs to reduce confusion.

-        source_id: The source ID (domain)
+        source_id: The source ID (16-char SHA-256 prefix for URL sources, or file_* for file uploads)

101-104: Add exc_info to error logs around LLM calls.

This improves debuggability in production without changing behavior.

-            search_logger.error(f"Failed to create LLM client fallback: {e}")
+            search_logger.error(f"Failed to create LLM client fallback: {e}", exc_info=True)
@@
-            search_logger.error(f"Empty or invalid response from LLM for {source_id}")
+            search_logger.error(f"Empty or invalid response from LLM for {source_id}", exc_info=True)
@@
-        search_logger.error(
-            f"Error generating summary with LLM for {source_id}: {e}. Using default summary."
-        )
+        search_logger.error(
+            f"Error generating summary with LLM for {source_id}: {e}. Using default summary.",
+            exc_info=True,
+        )

Also applies to: 118-125, 135-139


148-149: Default title should prefer display name when provided.

Avoid surfacing the hash as a title if content is short and a display name exists.

-    # Default title is the source ID
-    title = source_id
+    # Default title prefers human-readable display name
+    title = source_display_name or source_id

Also applies to: 163-167


365-366: Bubble up full stack traces on failure.

Include exc_info to aid postmortems; re-raise is good.

-        search_logger.error(f"Error updating source {source_id}: {e}")
+        search_logger.error(f"Error updating source {source_id}: {e}", exc_info=True)

572-585: Consider retries with backoff for external LLM calls.

Guidelines recommend retries for dependency calls. A lightweight tenacity retry on client.chat.completions.create would improve resilience without complicating the flow. Happy to draft this if desired.

python/src/server/services/crawling/crawling_service.py (2)

307-313: Sanitize URLs in logs; IDs and display names look good.

Generate IDs/display names here (nice), but avoid logging raw query strings/tokens. Log a redacted URL (strip ?query and #fragment).

-            safe_logfire_info(
-                f"Generated unique source_id '{original_source_id}' and display name '{source_display_name}' from URL '{url}'"
-            )
+            redacted = url.split("?", 1)[0].split("#", 1)[0]
+            safe_logfire_info(
+                f"Generated unique source_id '{original_source_id}' and display name '{source_display_name}' from URL '{redacted}'"
+            )

446-456: Include display name in completion payload for immediate UI updates.

Frontends can show the friendly name without extra fetch.

             await complete_crawl_progress(
                 task_id,
                 {
                     "chunks_stored": storage_results["chunk_count"],
                     "code_examples_found": code_examples_count,
                     "processed_pages": len(crawl_results),
                     "total_pages": len(crawl_results),
                     "sourceId": storage_results.get("source_id", ""),
+                    "sourceDisplayName": source_display_name,
                     "log": "Crawl completed successfully!",
                 },
             )
python/tests/test_source_id_refactor.py (2)

8-11: Remove unused imports to satisfy Ruff and keep tests clean.

pytest, Mock, patch, and hashlib are not used.

-import pytest
-from unittest.mock import Mock, patch
-import hashlib
 from concurrent.futures import ThreadPoolExecutor
 import time

204-220: Edge-case handling looks solid; consider one more case to avoid false API positives.

Current URLHandler logic treats any path containing "/api" as an API. This can misclassify paths like "/apiary". Consider tightening the check (e.g., segment match r'(^|/)api(/|$)') and adding a test like "https://example.com/apiary" → "Example".

Would you like me to propose a patch for URLHandler to use a segment-aware check and add the corresponding test?

python/src/server/services/crawling/document_storage_operations.py (4)

67-68: Reuse the initialized DocumentStorageService instead of creating a new instance.

self.doc_storage_service is already constructed in init. Reuse it to keep configuration consistent and reduce allocations.

-        # Initialize storage service for chunking
-        storage_service = DocumentStorageService(self.supabase_client)
+        # Reuse initialized storage service for chunking
+        storage_service = self.doc_storage_service
@@
-            chunks = storage_service.smart_chunk_text(markdown_content, chunk_size=5000)
+            chunks = storage_service.smart_chunk_text(markdown_content, chunk_size=5000)

Also applies to: 93-94


10-18: Remove unused imports (Ruff/Mypy hygiene).

urlparse, generate_code_summaries_batch, and add_code_examples_to_supabase are not used in this module.

-from urllib.parse import urlparse
@@
-from ..storage.code_storage_service import (
-    generate_code_summaries_batch,
-    add_code_examples_to_supabase,
-)
+# (removed unused code_storage_service imports)

110-123: Optional: include display metadata per chunk for better traceability.

If downstream analytics benefit from it, consider adding source_display_name to each chunk’s metadata (it’s currently only stored at the source level). This helps with ad-hoc queries without joining back to sources.

Example:

                 metadata = {
                     "url": source_url,
                     "title": doc.get("title", ""),
                     "description": doc.get("description", ""),
                     "source_id": source_id,
+                    "source_display_name": source_display_name,
                     "knowledge_type": request.get("knowledge_type", "documentation"),
                     "crawl_type": crawl_type,
                     "word_count": word_count,

220-224: Nit: avoid leading space and repeated string concatenation when building combined content.

Use a list-append + join pattern; it’s cleaner and avoids a leading space for the first chunk.

-            combined_content = ""
-            for chunk in source_contents[:3]:  # First 3 chunks for this source
-                if len(combined_content) + len(chunk) < 15000:
-                    combined_content += " " + chunk
-                else:
-                    break
+            combined_parts = []
+            total_len = 0
+            for chunk in source_contents[:3]:  # First 3 chunks for this source
+                if total_len + len(chunk) < 15000:
+                    combined_parts.append(chunk)
+                    total_len += len(chunk)
+                else:
+                    break
+            combined_content = " ".join(combined_parts)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 6a1b030 and 40abaf9.

📒 Files selected for processing (7)
  • migration/add_source_url_display_name.sql (1 hunks)
  • migration/complete_setup.sql (2 hunks)
  • python/src/server/services/crawling/crawling_service.py (3 hunks)
  • python/src/server/services/crawling/document_storage_operations.py (5 hunks)
  • python/src/server/services/crawling/helpers/url_handler.py (2 hunks)
  • python/src/server/services/source_management_service.py (6 hunks)
  • python/tests/test_source_id_refactor.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (7)
python/src/{server,mcp,agents}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/{server,mcp,agents}/**/*.py: Fail fast on service startup failures, missing configuration, database connection issues, auth failures, critical dependency outages, and invalid data that would corrupt state
External API calls should use retry with exponential backoff and ultimately fail with a clear, contextual error message
Error messages must include context (operation being attempted) and relevant IDs/URLs/data for debugging
Preserve full stack traces in logs (e.g., Python logging with exc_info=True)
Use specific exception types; avoid catching broad Exception unless re-raising with context
Never signal failure by returning None/null; raise a descriptive exception instead

Files:

  • python/src/server/services/crawling/helpers/url_handler.py
  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/services/crawling/document_storage_operations.py
  • python/src/server/services/source_management_service.py
python/src/{server/services,agents}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Never accept or store corrupted data (e.g., zero embeddings, null foreign keys, malformed JSON); skip failed items entirely instead of persisting bad data

Files:

  • python/src/server/services/crawling/helpers/url_handler.py
  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/services/crawling/document_storage_operations.py
  • python/src/server/services/source_management_service.py
python/src/server/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/server/**/*.py: For batch processing and background tasks, continue processing but log detailed per-item failures and return both successes and failures
Do not crash the server on a single WebSocket event failure; log the error and continue serving other clients

Files:

  • python/src/server/services/crawling/helpers/url_handler.py
  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/services/crawling/document_storage_operations.py
  • python/src/server/services/source_management_service.py
python/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/**/*.py: Target Python 3.12 with a 120-character line length
Use Ruff for linting and Mypy for type checking before commit

Files:

  • python/src/server/services/crawling/helpers/url_handler.py
  • python/tests/test_source_id_refactor.py
  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/services/crawling/document_storage_operations.py
  • python/src/server/services/source_management_service.py
{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}

📄 CodeRabbit inference engine (CLAUDE.md)

{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}: Remove dead code immediately; do not keep legacy/unused functions
Avoid comments that reference change history (e.g., LEGACY, CHANGED, REMOVED); keep comments focused on current functionality

Files:

  • python/src/server/services/crawling/helpers/url_handler.py
  • python/tests/test_source_id_refactor.py
  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/services/crawling/document_storage_operations.py
  • python/src/server/services/source_management_service.py
python/src/server/**

📄 CodeRabbit inference engine (CLAUDE.md)

Keep FastAPI application code under python/src/server/ (routes in api_routes/, services in services/, main in main.py)

Files:

  • python/src/server/services/crawling/helpers/url_handler.py
  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/services/crawling/document_storage_operations.py
  • python/src/server/services/source_management_service.py
python/tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Place backend tests under python/tests/

Files:

  • python/tests/test_source_id_refactor.py
🧬 Code graph analysis (3)
python/tests/test_source_id_refactor.py (1)
python/src/server/services/crawling/helpers/url_handler.py (3)
  • URLHandler (16-323)
  • generate_unique_source_id (186-210)
  • extract_display_name (213-323)
python/src/server/services/crawling/crawling_service.py (2)
python/src/server/services/crawling/helpers/url_handler.py (2)
  • generate_unique_source_id (186-210)
  • extract_display_name (213-323)
python/src/server/config/logfire_config.py (1)
  • safe_logfire_info (223-235)
python/src/server/services/crawling/document_storage_operations.py (4)
python/src/server/services/storage/code_storage_service.py (1)
  • add_code_examples_to_supabase (740-969)
python/src/server/services/storage/storage_services.py (1)
  • DocumentStorageService (19-284)
python/src/server/config/logfire_config.py (2)
  • safe_logfire_info (223-235)
  • safe_logfire_error (238-250)
python/src/server/services/source_management_service.py (2)
  • extract_source_summary (35-139)
  • update_source_info (249-366)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Backend Tests (Python + pytest)
🔇 Additional comments (10)
migration/add_source_url_display_name.sql (3)

13-15: Columns addition is sound and backward-compatible.

Using ADD COLUMN IF NOT EXISTS keeps the migration idempotent, and TEXT fits the intended variability of URLs and names. LGTM.


25-33: Backfill is correct and selective.

COALESCE with a WHERE filter avoids unnecessary writes and preserves existing non-null data. LGTM.


17-19: No action needed: complete_setup.sql duplication of DDL is intentional
The complete_setup.sql file serves as the full, from-scratch schema definition and must include every column, index, and comment that individual migrations introduce. Duplication between it and incremental migration scripts (like add_source_url_display_name.sql) is by design—so new environments can be bootstrapped in one step—while existing databases follow the ordered migrations.

• File migration/complete_setup.sql (lines 171–185): defines source_url, source_display_name, and their indexes in the initial schema
• File migration/add_source_url_display_name.sql (lines 14–19): incrementally adds the same columns and indexes for already-running databases

Since this pattern is intentional and guarded with IF NOT EXISTS, no changes are required here.

Likely an incorrect or invalid review comment.

migration/complete_setup.sql (1)

185-187: Indexes align with usage.

Creating indexes on source_url and source_display_name here keeps initial setup and migrations consistent. LGTM.

python/src/server/services/source_management_service.py (1)

314-323: Conditional inclusion of new fields is correct; LGTM.

Passing source_url and source_display_name only when provided keeps updates minimal and inserts clean.

Also applies to: 347-362

python/src/server/services/crawling/crawling_service.py (1)

385-393: Propagation of source_url and source_display_name is correct.

These args ensure end-to-end availability for storage and UI. LGTM.

python/tests/test_source_id_refactor.py (2)

141-156: LGTM on GitHub display-name coverage.

Good coverage for owner/repo, owner-only, and root cases, including .git stripping.


90-115: Concurrency simulation is concise and adequate.

ThreadPool-based stress ensures IDs remain unique without race conditions.

python/src/server/services/crawling/document_storage_operations.py (2)

237-255: Source creation flow is robust and in line with guidelines.

Creating/updating source records before chunk storage prevents FK issues; logging and fallback upsert paths are clear. Passing source_url and source_display_name into update_source_info aligns with the 3-field architecture.


293-315: Good verification step before bulk storage.

Explicit DB existence checks with contextual error messages catch schema/migration issues early.

Comment on lines +173 to 175
source_url TEXT,
source_display_name TEXT,
summary TEXT,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Schema extension is correct; add an updated_at trigger for archon_sources.

You already use update_updated_at_column() for other tables; archon_sources lacks that trigger. Today, the app attempts to set updated_at to "now()" (string), which will store a literal string unless a trigger updates it.

Add this DDL (outside this hunk):

CREATE OR REPLACE TRIGGER update_archon_sources_updated_at
  BEFORE UPDATE ON archon_sources
  FOR EACH ROW EXECUTE FUNCTION update_updated_at_column();

Then remove manual "updated_at": "now()" writes in app code (see review in source_management_service.py).

🤖 Prompt for AI Agents
In migration/complete_setup.sql around lines 173 to 175, the archon_sources
table is missing the update_updated_at trigger; add a CREATE OR REPLACE TRIGGER
update_archon_sources_updated_at BEFORE UPDATE ON archon_sources FOR EACH ROW
EXECUTE FUNCTION update_updated_at_column(); statement (place it after the table
definitions / alongside other trigger creations, outside the shown hunk), and
then remove the manual application code that sets updated_at to the literal
"now()" in source_management_service.py so the trigger can correctly set the
timestamp.

Comment thread python/src/server/services/crawling/document_storage_operations.py
Comment thread python/src/server/services/crawling/helpers/url_handler.py
Comment on lines +301 to +306
update_data = {
"summary": summary,
"total_word_count": word_count,
"metadata": metadata,
"updated_at": "now()",
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Do not set updated_at to the string "now()"; rely on a DB trigger.

PostgREST updates treat "now()" as a string literal, corrupting the timestamp. Add a trigger in SQL (see migration comment) and stop setting this field in code.

             update_data = {
                 "summary": summary,
                 "total_word_count": word_count,
                 "metadata": metadata,
-                "updated_at": "now()",
             }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
update_data = {
"summary": summary,
"total_word_count": word_count,
"metadata": metadata,
"updated_at": "now()",
}
update_data = {
"summary": summary,
"total_word_count": word_count,
"metadata": metadata,
}
🤖 Prompt for AI Agents
In python/src/server/services/source_management_service.py around lines 301 to
306, the update_data currently sets "updated_at" to the string "now()"; remove
the "updated_at" key from update_data so the application does not send a literal
to PostgREST, and rely on the database trigger added in the migration to
populate updated_at automatically; ensure any tests or callers expecting
updated_at are updated to not set it and that the migration creating the trigger
has been applied.

Comment thread python/tests/test_source_id_refactor.py
@Wirasm
Copy link
Copy Markdown
Collaborator Author

Wirasm commented Aug 25, 2025

@claude-review

@claude
Copy link
Copy Markdown

claude Bot commented Aug 25, 2025

Claude finished @Wirasm's task —— View job


Summary

This PR successfully addresses a critical race condition in concurrent crawling by implementing a three-field architecture: unique hash-based source_id (for database uniqueness), source_url (for tracking), and source_display_name (for UI display). The implementation is comprehensive with proper database migrations, test coverage, and maintains backward compatibility.

Previous Review Comments

  • CodeRabbit provided detailed feedback with 6 actionable comments covering parameter shadowing, division by zero risks, URL canonicalization improvements, and minor code quality enhancements
  • First comprehensive human review - addressing both the architectural solution and implementation details

Issues Found

Total: 3 critical, 4 important, 6 minor

🔴 Critical (Must Fix)

  • Parameter Shadowing Bug - python/src/server/services/crawling/document_storage_operations.py:91
    Problem: source_url = doc.get("url", "") overwrites the function parameter, causing incorrect source record creation
    Fix: Rename the local variable to doc_url and preserve the original source_url parameter

  • Division by Zero Risk - python/src/server/services/crawling/document_storage_operations.py:148
    Problem: len(all_contents) / len(crawl_results) will crash if crawl_results is empty
    Fix: Add zero-check and track only processed documents: avg = len(all_contents) / max(1, processed_docs)

  • Incorrect Test Assertion - python/tests/test_source_id_refactor.py:182
    Problem: Test expects "Example" but URLHandler returns "Example API" for /api paths
    Fix: Update test case to expect "Example API" for "https://example.com/api/v2/"

🟡 Important (Should Fix)

  • Missing OS Import - python/src/server/services/source_management_service.py:75
    Problem: os.getenv() is used without importing os module
    Fix: Add import os at the top of the file

  • URL Canonicalization Needed - python/src/server/services/crawling/helpers/url_handler.py:201
    Problem: Simple lowercasing creates duplicate IDs for equivalent URLs (trailing slashes, query params, fragments)
    Fix: Implement proper URL canonicalization before hashing (normalize paths, sort query params, remove tracking parameters)

  • Sensitive Data in Logs - python/src/server/services/crawling/helpers/url_handler.py:207
    Problem: Error logs may expose API keys or tokens in URL query parameters
    Fix: Redact query parameters in error logs, only log domain and path

  • Manual updated_at Timestamp - python/src/server/services/source_management_service.py:306
    Problem: Setting "updated_at": "now()" creates a string literal, not a timestamp
    Fix: Remove manual timestamp and add a database trigger for automatic updates

🟢 Minor (Consider)

  • Unused Imports - Multiple files have unused imports that should be removed for code cleanliness
  • Error Logging Enhancement - Add exc_info=True to error logs for better debugging
  • SVG File Exclusion - Consider allowing SVG files as they contain text content that could be valuable
  • GitHub Capitalization - Use "GitHub" instead of "Github" for brand consistency
  • Missing Database Trigger - Add update_updated_at_column() trigger for archon_sources table
  • Batch Size Optimization - Document storage uses batch size of 25, may benefit from tuning based on load testing

Security Assessment

No security issues found. The implementation properly:

  • Uses SHA256 hashing for ID generation (cryptographically secure)
  • Validates input data through Pydantic models
  • Prevents SQL injection through parameterized queries
  • Handles malformed URLs gracefully with fallbacks

Minor recommendation: Implement query parameter redaction in error logs to prevent accidental token exposure.

Performance Considerations

  • Hash collision probability: 64-bit IDs (16 hex chars) provide ~18 quintillion unique values - collision risk is negligible for practical use
  • Database indexes: Proper indexes added for new columns to maintain query performance
  • Memory usage: Chunking with 5000-character limit prevents memory issues
  • Batch processing: 25-item batch size for database inserts is reasonable but could be tuned

No significant performance concerns identified.

Good Practices Observed

  • Comprehensive test coverage: 16 tests covering deterministic ID generation, concurrency safety, and edge cases
  • Backward compatibility: Migration preserves existing data while adding new columns
  • Error handling: Multiple fallback strategies for source creation failures
  • Separation of concerns: Clean separation between ID generation, URL tracking, and display logic
  • Alpha development approach: Prioritizes functionality and rapid iteration over production patterns
  • Detailed logging: Good use of structured logging for debugging and monitoring

Questionable Practices

  • Simple URL normalization: Current approach only lowercases URLs, missing canonical form normalization that could prevent duplicate IDs
  • 16-character hash length: While statistically safe, 20-24 characters would provide additional collision safety margin
  • Mixed error handling strategies: Some functions use try/catch with fallbacks, others fail fast - should be more consistent per CLAUDE.md guidelines
  • Manual timestamp management: Application-level timestamp setting instead of database triggers

Test Coverage

Current Coverage: Excellent (16 comprehensive tests)

Test Highlights:

  1. Deterministic ID Generation - Verifies same URLs always produce same IDs
  2. Concurrency Safety - ThreadPool simulation ensures no race conditions
  3. Display Name Extraction - Covers GitHub, docs sites, APIs, and edge cases
  4. URL Normalization - Tests case-insensitive ID generation
  5. Integration Flow - End-to-end crawling simulation

Missing Tests:

  1. URL Canonicalization Edge Cases

    • What to test: Equivalent URLs with different formats (trailing slashes, query param order, default ports)
    • Why important: Ensures no duplicate IDs for semantically identical URLs
    • Suggested test: Test that https://example.com/, https://example.com, and https://example.com:443/ generate the same ID
  2. Database Migration Rollback

    • What to test: Migration rollback and data integrity
    • Why important: Ensures safe deployment and rollback procedures
    • Suggested test: Apply migration, rollback, verify no data corruption

Recommendations

Merge Decision:

  • Requires fixes before merging

Priority Actions:

  1. Fix parameter shadowing bug in document_storage_operations.py (critical for data integrity)
  2. Add zero-division protection in progress logging (prevents runtime crashes)
  3. Fix test assertion for API endpoint display names (test suite must pass)
  4. Add missing import os statement (prevents import errors)

Rationale:
While this PR successfully solves the race condition problem with a solid architectural approach, the critical bugs identified could cause data corruption or runtime failures. The parameter shadowing issue is particularly concerning as it could lead to incorrect source records being created. These issues should be resolved before merging to maintain the alpha project's principle of "detailed errors over graceful failures" - we want these problems to surface and be fixed rather than cause silent data corruption.

The architectural solution itself is excellent and aligns well with the alpha development principles of breaking things to improve them and removing backward compatibility concerns.


Review based on Archon V2 Alpha guidelines and CLAUDE.md principles


Wirasm added a commit that referenced this pull request Aug 25, 2025
- Add missing os import to prevent NameError crash
- Remove unused imports (pytest, Mock, patch, hashlib, urlparse, etc.)
- Fix GitHub API capitalization consistency
- Reuse existing DocumentStorageService instance
- Update test expectations to match corrected capitalization

Addresses CodeRabbit review feedback on PR #472
- Add missing os import to prevent NameError crash
- Remove unused imports (pytest, Mock, patch, hashlib, urlparse, etc.)
- Fix GitHub API capitalization consistency
- Reuse existing DocumentStorageService instance
- Update test expectations to match corrected capitalization

Addresses CodeRabbit review feedback on PR #472
@Wirasm Wirasm force-pushed the fix/source-id-architecture-refactor branch from 41675d9 to 56220bd Compare August 25, 2025 10:54
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
python/src/server/services/source_management_service.py (2)

67-74: Add retry with exponential backoff on external LLM calls.

External API calls must use retry with backoff and fail with contextual errors (guidelines). Wrap the chat completion in a small retry loop.

-        # Call the OpenAI API to generate the summary
-        response = client.chat.completions.create(
-            model=model_choice,
-            messages=[
-                {
-                    "role": "system",
-                    "content": "You are a helpful assistant that provides concise library/tool/framework summaries.",
-                },
-                {"role": "user", "content": prompt},
-            ],
-        )
+        # Call the OpenAI API to generate the summary with simple retry/backoff
+        response = None
+        for attempt in range(3):
+            try:
+                response = client.chat.completions.create(
+                    model=model_choice,
+                    messages=[
+                        {
+                            "role": "system",
+                            "content": "You are a helpful assistant that provides concise library/tool/framework summaries.",
+                        },
+                        {"role": "user", "content": prompt},
+                    ],
+                )
+                break
+            except Exception as api_err:
+                wait = 2 ** attempt
+                search_logger.warning(
+                    f"LLM summary attempt {attempt + 1}/3 failed for {source_id}; retrying in {wait}s: {api_err}"
+                )
+                time.sleep(wait)
+        if response is None:
+            search_logger.error(f"LLM summary failed after retries for {source_id}")
+            return default_summary

Note: add import time at top if not already present.

Also applies to: 106-116, 118-127, 136-140


196-204: Apply the same retry/backoff to title generation.

Mirror the retry around the chat call in generate_source_title_and_metadata.

-            response = client.chat.completions.create(
-                model=model_choice,
-                messages=[
-                    {
-                        "role": "system",
-                        "content": "You are a helpful assistant that generates concise titles.",
-                    },
-                    {"role": "user", "content": prompt},
-                ],
-            )
+            response = None
+            for attempt in range(3):
+                try:
+                    response = client.chat.completions.create(
+                        model=model_choice,
+                        messages=[
+                            {"role": "system", "content": "You are a helpful assistant that generates concise titles."},
+                            {"role": "user", "content": prompt},
+                        ],
+                    )
+                    break
+                except Exception as api_err:
+                    wait = 2 ** attempt
+                    search_logger.warning(
+                        f"LLM title attempt {attempt + 1}/3 failed for {source_id}; retrying in {wait}s: {api_err}"
+                    )
+                    time.sleep(wait)
+            if response is None:
+                raise RuntimeError("LLM title generation failed after retries")

Also applies to: 218-227, 229-235

♻️ Duplicate comments (3)
python/tests/test_source_id_refactor.py (1)

176-181: Fix expected display name for path-based API URLs ("Example API" vs "Example").

extract_display_name returns " API" when "/api" appears in the path. For "https://example.com/api/v2/" the expected value should be "Example API", not "Example".

Apply this diff:

         test_cases = [
             ("https://api.github.com/", "GitHub API"),
             ("https://api.openai.com/v1/", "Openai API"),
-            ("https://example.com/api/v2/", "Example"),
+            ("https://example.com/api/v2/", "Example API"),
         ]
python/src/server/services/source_management_service.py (1)

302-307: Do not set updated_at to the string literal "now()"; rely on DB default/trigger.

PostgREST/Supabase will store "now()" as a string, corrupting your timestamp. Use a DB trigger/default instead and remove this field from application updates.

Apply this diff:

             update_data = {
                 "summary": summary,
                 "total_word_count": word_count,
                 "metadata": metadata,
-                "updated_at": "now()",
             }
python/src/server/services/crawling/document_storage_operations.py (1)

64-71: Avoid ZeroDivisionError and report accurate averages.

len(all_contents)/len(crawl_results) can divide by zero and overstates averages when documents are skipped. Track processed docs.

         all_metadatas = []
         source_word_counts = {}
         url_to_full_document = {}
+        processed_docs = 0
@@
-            if not markdown_content:
+            if not markdown_content:
                 continue
+            processed_docs += 1
@@
-        safe_logfire_info(
-            f"Document storage | documents={len(crawl_results)} | chunks={len(all_contents)} | avg_chunks_per_doc={len(all_contents) / len(crawl_results):.1f}"
-        )
+        avg_chunks = (len(all_contents) / processed_docs) if processed_docs else 0.0
+        safe_logfire_info(
+            f"Document storage | documents={processed_docs} | chunks={len(all_contents)} | avg_chunks_per_doc={avg_chunks:.1f}"
+        )

Also applies to: 141-143

🧹 Nitpick comments (9)
python/tests/test_source_id_refactor.py (1)

71-86: Normalization tests only cover casing; AI summary mentions scheme normalization.

The current tests verify case-only normalization. The AI summary claims "URL normalization yields identical IDs for variations (case, scheme, host casing)," but generate_unique_source_id only lowercases the URL. http vs https (and trailing slashes, default ports) will produce different IDs today. Either:

  • add tests to assert the current behavior (distinct IDs for different schemes), or
  • enhance normalization to canonicalize scheme/ports/trailing slashes and then add tests accordingly.
python/src/server/services/source_management_service.py (5)

365-367: Preserve stack traces in logs for diagnostics.

Include exc_info=True so logs retain full tracebacks per guidelines.

-        search_logger.error(f"Error updating source {source_id}: {e}")
+        search_logger.error(f"Error updating source {source_id}: {e}", exc_info=True)

170-176: Redundant imports inside function.

import os is already at module top; remove the inner import to satisfy Ruff.

-                import os
-
                 import openai

45-51: Docstring references to "domain" are outdated post-refactor.

source_id is now a 16-char hash; docstrings should reflect that to avoid confusion in prompts/logs.

-        source_id: The source ID (domain)
+        source_id: The unique source identifier (16-char hash or file_* id)

Also applies to: 154-161


235-237: Include exception context in title-generation error logs.

Add exc_info for easier debugging.

-            search_logger.error(f"Error generating title for {source_id}: {e}")
+            search_logger.error(f"Error generating title for {source_id}: {e}", exc_info=True)

59-74: Consider using source_display_name in the summary prompt as context.

extract_source_summary currently references source_id in the prompt, which will be a hash. Passing a human-friendly name (when available) will improve summaries. This requires threading a display name through callers; optional but recommended.

python/src/server/services/crawling/document_storage_operations.py (3)

223-229: Avoid blocking the event loop with synchronous LLM calls.

extract_source_summary performs a network call synchronously. Consider offloading to a thread or using an async client to prevent event loop stalls.


111-112: Default knowledge_type inconsistency ("documentation" vs "technical").

Chunk metadata defaults to "documentation" while update_source_info uses "technical". Unify defaults to avoid mixed data.

-                    "knowledge_type": request.get("knowledge_type", "documentation"),
+                    "knowledge_type": request.get("knowledge_type", "technical"),

Also applies to: 243-246


256-277: Prefer display name as title in fallback upsert path.

When the primary update fails, you upsert with title=source_id, which is a hash and not user-friendly. Use source_display_name when available.

-                    fallback_data = {
+                    fallback_data = {
                         "source_id": source_id,
-                        "title": source_id,  # Use source_id as title fallback
+                        "title": source_display_name or source_id,  # Prefer display name if available
                         "summary": summary,
                         "total_word_count": source_id_word_counts[source_id],
                         "metadata": {
                             "knowledge_type": request.get("knowledge_type", "technical"),
                             "tags": request.get("tags", []),
                             "auto_generated": True,
                             "fallback_creation": True,
                             "original_url": request.get("url"),
                         },
                     }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 40abaf9 and 56220bd.

📒 Files selected for processing (4)
  • python/src/server/services/crawling/document_storage_operations.py (4 hunks)
  • python/src/server/services/crawling/helpers/url_handler.py (2 hunks)
  • python/src/server/services/source_management_service.py (7 hunks)
  • python/tests/test_source_id_refactor.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • python/src/server/services/crawling/helpers/url_handler.py
🧰 Additional context used
📓 Path-based instructions (7)
python/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/**/*.py: Target Python 3.12 with a 120-character line length
Use Ruff for linting and Mypy for type checking before commit

Files:

  • python/tests/test_source_id_refactor.py
  • python/src/server/services/source_management_service.py
  • python/src/server/services/crawling/document_storage_operations.py
{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}

📄 CodeRabbit inference engine (CLAUDE.md)

{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}: Remove dead code immediately; do not keep legacy/unused functions
Avoid comments that reference change history (e.g., LEGACY, CHANGED, REMOVED); keep comments focused on current functionality

Files:

  • python/tests/test_source_id_refactor.py
  • python/src/server/services/source_management_service.py
  • python/src/server/services/crawling/document_storage_operations.py
python/tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Place backend tests under python/tests/

Files:

  • python/tests/test_source_id_refactor.py
python/src/{server,mcp,agents}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/{server,mcp,agents}/**/*.py: Fail fast on service startup failures, missing configuration, database connection issues, auth failures, critical dependency outages, and invalid data that would corrupt state
External API calls should use retry with exponential backoff and ultimately fail with a clear, contextual error message
Error messages must include context (operation being attempted) and relevant IDs/URLs/data for debugging
Preserve full stack traces in logs (e.g., Python logging with exc_info=True)
Use specific exception types; avoid catching broad Exception unless re-raising with context
Never signal failure by returning None/null; raise a descriptive exception instead

Files:

  • python/src/server/services/source_management_service.py
  • python/src/server/services/crawling/document_storage_operations.py
python/src/{server/services,agents}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Never accept or store corrupted data (e.g., zero embeddings, null foreign keys, malformed JSON); skip failed items entirely instead of persisting bad data

Files:

  • python/src/server/services/source_management_service.py
  • python/src/server/services/crawling/document_storage_operations.py
python/src/server/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/server/**/*.py: For batch processing and background tasks, continue processing but log detailed per-item failures and return both successes and failures
Do not crash the server on a single WebSocket event failure; log the error and continue serving other clients

Files:

  • python/src/server/services/source_management_service.py
  • python/src/server/services/crawling/document_storage_operations.py
python/src/server/**

📄 CodeRabbit inference engine (CLAUDE.md)

Keep FastAPI application code under python/src/server/ (routes in api_routes/, services in services/, main in main.py)

Files:

  • python/src/server/services/source_management_service.py
  • python/src/server/services/crawling/document_storage_operations.py
🧬 Code graph analysis (2)
python/tests/test_source_id_refactor.py (1)
python/src/server/services/crawling/helpers/url_handler.py (3)
  • URLHandler (16-323)
  • generate_unique_source_id (186-210)
  • extract_display_name (213-323)
python/src/server/services/crawling/document_storage_operations.py (3)
python/src/server/services/storage/storage_services.py (1)
  • DocumentStorageService (19-284)
python/src/server/config/logfire_config.py (2)
  • safe_logfire_info (223-235)
  • safe_logfire_error (238-250)
python/src/server/services/source_management_service.py (2)
  • extract_source_summary (36-140)
  • update_source_info (250-367)
🔇 Additional comments (1)
python/tests/test_source_id_refactor.py (1)

138-149: Solid coverage and relevant assertions.

Good breadth across uniqueness, determinism, GitHub-specific formatting, docs/APIs, hash distribution, and integration flow. Nice use of ThreadPoolExecutor to simulate concurrency.

Also applies to: 154-171, 186-200, 218-229, 232-274, 276-309

Comment thread python/src/server/services/crawling/document_storage_operations.py Outdated
- Truncate display names to 100 chars when used as titles
- Document hash collision probability (negligible for <1M sources)

Simple, pragmatic fixes per KISS principle
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
python/src/server/services/source_management_service.py (1)

289-296: Incorrect source_type inference after hashed IDs; infer from URL(s) instead.

With 16-char hashed source_ids, startswith("file_") will never be true. Derive source_type from source_url or original_url to avoid mislabeling all sources as url.

-            # Update metadata while preserving title
-            source_type = "file" if source_id.startswith("file_") else "url"
-            metadata = {
+            # Update metadata while preserving title
+            # IDs are opaque (hashed); infer source_type from URL(s)
+            if source_url or original_url:
+                _u = source_url or original_url
+                source_type = "url" if _u.startswith(("http://", "https://")) else "file"
+            else:
+                source_type = "unknown"
+            metadata = {
                 "knowledge_type": knowledge_type,
                 "tags": tags or [],
                 "source_type": source_type,
                 "auto_generated": False,  # Mark as not auto-generated since we're preserving
                 "update_frequency": update_frequency,
             }
♻️ Duplicate comments (1)
python/src/server/services/source_management_service.py (1)

302-307: Do not set updated_at to the string literal "now()"; rely on DB defaults/triggers.

PostgREST treats "now()" as a string and will corrupt the timestamp. Let the database set updated_at.

             update_data = {
                 "summary": summary,
                 "total_word_count": word_count,
                 "metadata": metadata,
-                "updated_at": "now()",
             }
🧹 Nitpick comments (4)
python/src/server/services/source_management_service.py (4)

8-8: Remove redundant in-function import os statements; keep the single top-level import.

There are additional import os statements inside functions that are unnecessary now that Line 8 imports it at module scope.

Outside the selected range, remove the inner imports:

-            import os
             import openai
-                import os
                 import openai

208-214: Sanitize source_display_name before injecting into the LLM prompt.

Prevent prompt injection/formatting issues by stripping newlines and constraining length before interpolation.

-            # Use display name if available for better context
-            source_context = source_display_name if source_display_name else source_id
+            # Use display name if available for better context; sanitize to avoid prompt injection
+            source_context = (source_display_name or source_id)
+            source_context = source_context.replace("\n", " ").replace("\r", " ").strip()[:100]

315-321: Remove unused result variable from the update call.

The assignment is unused and can be dropped.

-            result = (
-                client.table("archon_sources")
-                .update(update_data)
-                .eq("source_id", source_id)
-                .execute()
-            )
+            client.table("archon_sources").update(update_data).eq("source_id", source_id).execute()

326-335: Good: prefer source_display_name as title; also sanitize and ensure fallback.

This aligns with the PR goal to avoid unnecessary AI title generation. Minor hardening: ensure the title is single-line and non-empty.

-            if source_display_name:
-                # Use the display name directly as the title (truncated to prevent DB issues)
-                title = source_display_name[:100].strip()
+            if source_display_name:
+                # Sanitize to a single line and truncate
+                sanitized = source_display_name.replace("\n", " ").replace("\r", " ").strip()
+                title = (sanitized[:100] if sanitized else source_id)
                 metadata = {
                     "knowledge_type": knowledge_type,
                     "tags": tags or [],
                     "auto_generated": False,
                 }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 56220bd and b9d52fb.

📒 Files selected for processing (2)
  • python/src/server/services/crawling/helpers/url_handler.py (2 hunks)
  • python/src/server/services/source_management_service.py (7 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • python/src/server/services/crawling/helpers/url_handler.py
🧰 Additional context used
📓 Path-based instructions (6)
python/src/{server,mcp,agents}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/{server,mcp,agents}/**/*.py: Fail fast on service startup failures, missing configuration, database connection issues, auth failures, critical dependency outages, and invalid data that would corrupt state
External API calls should use retry with exponential backoff and ultimately fail with a clear, contextual error message
Error messages must include context (operation being attempted) and relevant IDs/URLs/data for debugging
Preserve full stack traces in logs (e.g., Python logging with exc_info=True)
Use specific exception types; avoid catching broad Exception unless re-raising with context
Never signal failure by returning None/null; raise a descriptive exception instead

Files:

  • python/src/server/services/source_management_service.py
python/src/{server/services,agents}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Never accept or store corrupted data (e.g., zero embeddings, null foreign keys, malformed JSON); skip failed items entirely instead of persisting bad data

Files:

  • python/src/server/services/source_management_service.py
python/src/server/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/server/**/*.py: For batch processing and background tasks, continue processing but log detailed per-item failures and return both successes and failures
Do not crash the server on a single WebSocket event failure; log the error and continue serving other clients

Files:

  • python/src/server/services/source_management_service.py
python/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/**/*.py: Target Python 3.12 with a 120-character line length
Use Ruff for linting and Mypy for type checking before commit

Files:

  • python/src/server/services/source_management_service.py
{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}

📄 CodeRabbit inference engine (CLAUDE.md)

{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}: Remove dead code immediately; do not keep legacy/unused functions
Avoid comments that reference change history (e.g., LEGACY, CHANGED, REMOVED); keep comments focused on current functionality

Files:

  • python/src/server/services/source_management_service.py
python/src/server/**

📄 CodeRabbit inference engine (CLAUDE.md)

Keep FastAPI application code under python/src/server/ (routes in api_routes/, services in services/, main in main.py)

Files:

  • python/src/server/services/source_management_service.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Backend Tests (Python + pytest)
🔇 Additional comments (2)
python/src/server/services/source_management_service.py (2)

348-354: LGTM: Insert payload constructed via insert_data improves readability and safety.

This refactor centralizes the insert columns and makes conditional fields straightforward.


357-363: Verified: Database schema and code paths updated for source_url & source_display_name

All migration files include the new columns, comments, and indexes, and the service code now conditionally handles these fields in both create and update flows:

  • migration/complete_setup.sql and migration/add_source_url_display_name.sql define source_url and source_display_name (with TEXT types), add indexes (idx_archon_sources_url, idx_archon_sources_display_name), and include column comments.
  • In source_management_service.py, new sources are inserted with source_url and source_display_name when provided (lines 357–363), and existing sources are updated via the update_source_info utility, whose signature now includes optional source_url and source_display_name.
  • Call sites in document_storage_operations.py and within the service itself propagate source_url and source_display_name into update_source_info (e.g., manual‐refresh update in document storage and the scheduled update flow).

With the schema and all relevant code paths aligned, this change is safe to deploy.

Comment on lines +149 to 151
source_display_name: str | None = None,
) -> tuple[str, dict[str, Any]]:
"""
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Type correctness for provider; document the new argument.

provider: str = None violates strict typing (mypy) under Python 3.12. Use an optional type and update the docstring to include provider and source_display_name.

Apply:

-def generate_source_title_and_metadata(
+def generate_source_title_and_metadata(
     source_id: str,
     content: str,
     knowledge_type: str = "technical",
     tags: list[str] | None = None,
-    provider: str = None,
+    provider: str | None = None,
     source_display_name: str | None = None,
 ) -> tuple[str, dict[str, Any]]:

Also update the Args section (outside this hunk) to add:

  • provider: Optional provider override
  • source_display_name: Optional human-friendly name to guide title generation

For consistency, make the same typing fix in extract_source_summary’s signature (provider: str | None = None).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
source_display_name: str | None = None,
) -> tuple[str, dict[str, Any]]:
"""
def generate_source_title_and_metadata(
source_id: str,
content: str,
knowledge_type: str = "technical",
tags: list[str] | None = None,
provider: str | None = None,
source_display_name: str | None = None,
) -> tuple[str, dict[str, Any]]:
🤖 Prompt for AI Agents
In python/src/server/services/source_management_service.py around lines 149 to
151, the function signature uses `provider: str = None` which breaks strict
typing under Python 3.12 and the docstring lacks the new `provider` and
`source_display_name` params; change the annotation to an explicit optional type
(e.g., `provider: str | None = None` or `Optional[str]`) and do the same change
in `extract_source_summary`'s signature, then update the function docstring Args
section to include `provider: Optional provider override` and
`source_display_name: Optional human-friendly name to guide title generation` so
type hints and documentation stay consistent.

Comment thread python/src/server/services/source_management_service.py
…names

- Fixed critical bug where code extraction was using old domain-based source_ids
- Updated code extraction service to accept source_id as parameter instead of extracting from URL
- Added special handling for llms.txt and sitemap.xml files in display names
- Added comprehensive tests for source_id handling in code extraction
- Removed unused urlparse import from code_extraction_service.py

This fixes the foreign key constraint errors that were preventing code examples
from being stored after the source_id architecture refactor.

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
python/src/server/services/crawling/helpers/url_handler.py (1)

185-215: Canonicalize URLs before hashing; redact sensitive data in error logs.

Lowercasing alone will produce different IDs for equivalent URLs (trailing slash, fragments, default ports, reordered/tracking query params). This can reintroduce collisions/duplicates across “equivalent” URLs and undermine the PR’s objective. Also, logging the raw URL can leak tokens.

     @staticmethod
     def generate_unique_source_id(url: str) -> str:
         """
-        Generate a unique source ID from URL using hash.
+        Generate a deterministic source ID from a canonicalized URL using SHA-256.
@@
-        This creates a 16-character hash that is guaranteed to be unique
-        for each URL, solving race condition issues when multiple crawls
-        target the same domain.
+        This creates a 16-character hash that is extremely unlikely to collide
+        for distinct canonical URLs, solving race conditions when multiple crawls
+        target the same domain.
@@
-        try:
-            # Normalize URL for consistent hashing
-            normalized_url = url.strip().lower()
-            # Generate SHA256 hash and take first 16 characters
-            # 16 hex chars = 64 bits = ~18 quintillion unique values
-            return hashlib.sha256(normalized_url.encode("utf-8")).hexdigest()[:16]
-        except Exception as e:
-            logger.error(f"Error generating unique source ID for {url}: {e}")
-            # Fallback: use a hash of the error message + url to still get something unique
-            fallback = f"error_{url}_{str(e)}"
-            return hashlib.sha256(fallback.encode("utf-8")).hexdigest()[:16]
+        try:
+            # Canonicalize deterministically: lower-case scheme/host, drop default ports,
+            # drop fragment, normalize path (strip trailing slash except root),
+            # sort query params and drop common tracking params.
+            from urllib.parse import urlparse, urlunparse, parse_qsl, urlencode
+
+            parsed = urlparse(url.strip())
+            scheme = (parsed.scheme or "").lower()
+            netloc = (parsed.netloc or "").lower()
+            if netloc.endswith(":80") and scheme == "http":
+                netloc = netloc[:-3]
+            if netloc.endswith(":443") and scheme == "https":
+                netloc = netloc[:-4]
+
+            path = parsed.path or ""
+            if path.endswith("/") and path != "/":
+                path = path.rstrip("/")
+
+            tracking = {"utm_source","utm_medium","utm_campaign","utm_term","utm_content","gclid","fbclid"}
+            query_items = [(k, v) for k, v in parse_qsl(parsed.query, keep_blank_values=True) if k not in tracking]
+            query = urlencode(sorted(query_items))
+
+            canonical = urlunparse((scheme, netloc, path, "", query, ""))  # fragment dropped
+            return hashlib.sha256(canonical.encode("utf-8")).hexdigest()[:16]
+        except Exception as e:
+            # Avoid logging raw query params/tokens; include stack trace
+            try:
+                redacted = url.split("?", 1)[0]
+            except Exception:
+                redacted = "<unparseable-url>"
+            logger.error(f"Error generating unique source ID for {redacted}: {e}", exc_info=True)
+            # Fallback: hash a safe concatenation
+            fallback = f"error_{redacted}_{str(e)}"
+            return hashlib.sha256(fallback.encode("utf-8")).hexdigest()[:16]

Optionally consider 24 hex chars (96 bits) to further reduce birthday-bound collision risk, while keeping IDs compact. Happy to wire tests if you want.

python/src/server/services/crawling/document_storage_operations.py (2)

78-86: Bug: local variable source_url shadows function parameter; corrupts source records and logs.

The loop assigns source_url = doc.get("url", ""), overwriting the function parameter meant to carry the original crawl URL. This flows into _create_source_records and can store an arbitrary doc URL as the source_url.

-            source_url = doc.get("url", "")
-            markdown_content = doc.get("markdown", "")
+            doc_url = doc.get("url", "")
+            markdown_content = doc.get("markdown", "")
@@
-            url_to_full_document[source_url] = markdown_content
+            url_to_full_document[doc_url] = markdown_content
@@
-            source_id = original_source_id
-            safe_logfire_info(f"Using original source_id '{source_id}' for URL '{source_url}'")
+            source_id = original_source_id
+            safe_logfire_info(f"Using original source_id '{source_id}' for URL '{doc_url}'")
@@
-                all_urls.append(source_url)
+                all_urls.append(doc_url)
@@
-                    "url": source_url,
+                    "url": doc_url,
@@
-            await self._create_source_records(
-                all_metadatas, all_contents, source_word_counts, request,
-                source_url, source_display_name
-            )
+            await self._create_source_records(
+                all_metadatas, all_contents, source_word_counts, request,
+                source_url, source_display_name
+            )

Also applies to: 90-103, 104-122, 131-137, 311-338


64-71: Fix avg_chunks_per_doc and avoid ZeroDivisionError; log processed count.

Average uses total requested docs (crawl_results) and will divide by zero when no docs are processed. Track processed docs instead.

-        all_metadatas = []
+        all_metadatas = []
         source_word_counts = {}
         url_to_full_document = {}
+        processed_docs = 0
@@
-            if not markdown_content:
+            if not markdown_content:
                 continue
+            processed_docs += 1
@@
-        safe_logfire_info(
-            f"Document storage | documents={len(crawl_results)} | chunks={len(all_contents)} | avg_chunks_per_doc={len(all_contents) / len(crawl_results):.1f}"
-        )
+        avg_chunks = (len(all_contents) / processed_docs) if processed_docs else 0.0
+        safe_logfire_info(
+            f"Document storage | documents={processed_docs} | chunks={len(all_contents)} | avg_chunks_per_doc={avg_chunks:.1f}"
+        )

Also applies to: 140-143

🧹 Nitpick comments (11)
python/src/server/services/crawling/helpers/url_handler.py (4)

31-31: Make sitemap/txt checks robust and case-insensitive.

Parse the URL and lower-case the path to avoid false negatives (e.g., uppercase extensions, query strings). Also prefer avoiding repeated parsing work in hot paths.

-            return url.endswith("sitemap.xml") or "sitemap" in urlparse(url).path
+            parsed = urlparse(url)
+            path = (parsed.path or "").lower()
+            return path.endswith("/sitemap.xml") or "sitemap" in path
@@
-            return url.endswith(".txt")
+            parsed = urlparse(url)
+            path = (parsed.path or "").lower()
+            return path.endswith(".txt")

Also applies to: 48-48


165-184: Redact query strings in logs for GitHub URL transformation.

Avoid leaking tokens/params in logs; log a redacted URL (no query/fragment). This also keeps logs cleaner.

-            logger.info(f"Transformed GitHub file URL to raw: {url} -> {raw_url}")
+            redacted_in = url.split("?", 1)[0]
+            redacted_out = raw_url.split("?", 1)[0]
+            logger.info(f"Transformed GitHub file URL to raw: {redacted_in} -> {redacted_out}")
@@
-            logger.warning(
-                f"GitHub directory URL detected: {url} - consider using specific file URLs or GitHub API"
-            )
+            redacted = url.split("?", 1)[0]
+            logger.warning(
+                f"GitHub directory URL detected: {redacted} - consider using specific file URLs or GitHub API"
+            )

360-363: Redact URL and include stack trace on display-name extraction failures.

Follow logging guidelines: avoid leaking tokens and preserve stack traces.

-            logger.warning(f"Error extracting display name for {url}: {e}, using URL")
+            redacted = url.split("?", 1)[0]
+            logger.warning(f"Error extracting display name for {redacted}: {e}, using URL", exc_info=True)

268-279: Polish display-name casing for special files.

Use conventional casing (“LLMs.txt”, “Sitemap.xml”) for nicer UI output.

-                        return f"{base_name} - Llms.Txt"
+                        return f"{base_name} - LLMs.txt"
@@
-                        return f"{base_name} - Sitemap.Xml"
+                        return f"{base_name} - Sitemap.xml"
@@
-                    return f"{formatted} - Sitemap.Xml"
+                    return f"{formatted} - Sitemap.xml"
@@
-                    return f"{formatted} - Llms.Txt"
+                    return f"{formatted} - LLMs.txt"

Also applies to: 320-341

python/src/server/services/crawling/crawling_service.py (3)

259-259: Redact query strings in info logs to avoid token leakage.

Several info logs include the full URL; redact queries/fragments in logs while keeping IDs and context.

-        safe_logfire_info(f"Starting background crawl orchestration | url={url}")
+        safe_logfire_info(f"Starting background crawl orchestration | url={url.split('?', 1)[0]}")
@@
-            safe_logfire_info(f"Starting async crawl orchestration | url={url} | task_id={task_id}")
+            safe_logfire_info(f"Starting async crawl orchestration | url={url.split('?', 1)[0]} | task_id={task_id}")
@@
-            safe_logfire_info(
-                f"Generated unique source_id '{original_source_id}' and display name '{source_display_name}' from URL '{url}'"
-            )
+            safe_logfire_info(
+                f"Generated unique source_id '{original_source_id}' and display name '{source_display_name}' from URL '{url.split('?', 1)[0]}'"
+            )

Also applies to: 305-305, 311-312


395-401: Sanity check: validate source_id round-trip.

Ensure the source_id returned from storage matches the one derived at the start; log a warning if not.

             storage_results = await self.doc_storage_ops.process_and_store_documents(
                 crawl_results,
                 request,
                 crawl_type,
                 original_source_id,
                 doc_storage_callback,
                 self._check_cancellation,
                 source_url=url,
                 source_display_name=source_display_name,
             )
+            if storage_results.get("source_id") != original_source_id:
+                safe_logfire_error(
+                    f"source_id mismatch after storage | expected={original_source_id} | actual={storage_results.get('source_id','')}"
+                )

483-486: Preserve stack trace on orchestration errors.

safe_logfire_error records the message, but we also want a full traceback for postmortems.

-        except Exception as e:
-            safe_logfire_error(f"Async crawl orchestration failed | error={str(e)}")
+        except Exception as e:
+            safe_logfire_error(f"Async crawl orchestration failed | error={str(e)}")
+            # Also emit stack trace via standard logger
+            logger.exception("Async crawl orchestration failed")
python/tests/test_code_extraction_source_id.py (1)

40-46: Fix mocked storage payload shape; avoid referencing late-populated list.

The mocked _prepare_code_examples_for_storage currently returns a list (not the dict the service expects) and closes over extracted_blocks before it’s populated, risking None/KeyErrors if the implementation changes. Either remove the mock or return a dict with required keys.

-        code_service._generate_code_summaries = AsyncMock(return_value=[{"summary": "Test code"}])
-        code_service._prepare_code_examples_for_storage = Mock(return_value=[
-            {"source_id": extracted_blocks[0]["source_id"] if extracted_blocks else None}
-        ])
+        code_service._generate_code_summaries = AsyncMock(return_value=[{"summary": "Test code"}])
+        # Return a minimal-but-correct storage payload shape
+        code_service._prepare_code_examples_for_storage = Mock(return_value={
+            "urls": [],
+            "chunk_numbers": [],
+            "examples": [],
+            "summaries": [{"summary": "Test code"}],
+            "metadatas": [{"source_id": "placeholder"}],
+        })

Alternative: drop the _prepare_code_examples_for_storage mock entirely—since _store_code_examples is already mocked, you can exercise the real preparation logic safely.

Also applies to: 59-70

python/src/server/services/crawling/code_extraction_service.py (2)

339-342: Include stack traces on errors in extraction/storage.

Add a standard logger with exception stack traces alongside safe_logfire_error for better diagnostics.

-from ...config.logfire_config import safe_logfire_error, safe_logfire_info
+from ...config.logfire_config import safe_logfire_error, safe_logfire_info, get_logger
+logger = get_logger(__name__)
@@
-            safe_logfire_error(
-                    f"Error processing code from document | url={doc.get('url')} | error={str(e)}"
-                )
+            safe_logfire_error(
+                f"Error processing code from document | url={doc.get('url')} | error={str(e)}"
+            )
+            logger.exception("Error processing code from document")
@@
-        except Exception as e:
-            safe_logfire_error(f"Error storing code examples | error={str(e)}")
+        except Exception as e:
+            safe_logfire_error(f"Error storing code examples | error={str(e)}")
+            logger.exception("Error storing code examples")
             return 0

Also applies to: 1526-1529, 11-11


346-357: Docstring drift: remove unused parameter from _extract_html_code_blocks docs.

The signature no longer has a min_length argument; update the docstring to match.

-        Args:
-            content: The content to search for HTML code patterns
-            min_length: Minimum length for code blocks
+        Args:
+            content: The content to search for HTML code patterns
python/src/server/services/crawling/document_storage_operations.py (1)

138-139: Redact URL keys when logging url_to_full_document to avoid leaking tokens.

Log redacted keys (no query/fragment) to keep PII/credentials out of logs.

-        safe_logfire_info(f"url_to_full_document keys: {list(url_to_full_document.keys())[:5]}")
+        keys = list(url_to_full_document.keys())[:5]
+        redacted_keys = [k.split('?', 1)[0] for k in keys]
+        safe_logfire_info(f"url_to_full_document keys: {redacted_keys}")
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between b9d52fb and 5e603ea.

📒 Files selected for processing (6)
  • python/src/server/services/crawling/code_extraction_service.py (5 hunks)
  • python/src/server/services/crawling/crawling_service.py (4 hunks)
  • python/src/server/services/crawling/document_storage_operations.py (4 hunks)
  • python/src/server/services/crawling/helpers/url_handler.py (2 hunks)
  • python/tests/test_code_extraction_source_id.py (1 hunks)
  • python/tests/test_source_id_refactor.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • python/tests/test_source_id_refactor.py
🧰 Additional context used
📓 Path-based instructions (7)
python/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/**/*.py: Target Python 3.12 with a 120-character line length
Use Ruff for linting and Mypy for type checking before commit

Files:

  • python/tests/test_code_extraction_source_id.py
  • python/src/server/services/crawling/code_extraction_service.py
  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/services/crawling/document_storage_operations.py
  • python/src/server/services/crawling/helpers/url_handler.py
{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}

📄 CodeRabbit inference engine (CLAUDE.md)

{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}: Remove dead code immediately; do not keep legacy/unused functions
Avoid comments that reference change history (e.g., LEGACY, CHANGED, REMOVED); keep comments focused on current functionality

Files:

  • python/tests/test_code_extraction_source_id.py
  • python/src/server/services/crawling/code_extraction_service.py
  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/services/crawling/document_storage_operations.py
  • python/src/server/services/crawling/helpers/url_handler.py
python/tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Place backend tests under python/tests/

Files:

  • python/tests/test_code_extraction_source_id.py
python/src/{server,mcp,agents}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/{server,mcp,agents}/**/*.py: Fail fast on service startup failures, missing configuration, database connection issues, auth failures, critical dependency outages, and invalid data that would corrupt state
External API calls should use retry with exponential backoff and ultimately fail with a clear, contextual error message
Error messages must include context (operation being attempted) and relevant IDs/URLs/data for debugging
Preserve full stack traces in logs (e.g., Python logging with exc_info=True)
Use specific exception types; avoid catching broad Exception unless re-raising with context
Never signal failure by returning None/null; raise a descriptive exception instead

Files:

  • python/src/server/services/crawling/code_extraction_service.py
  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/services/crawling/document_storage_operations.py
  • python/src/server/services/crawling/helpers/url_handler.py
python/src/{server/services,agents}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Never accept or store corrupted data (e.g., zero embeddings, null foreign keys, malformed JSON); skip failed items entirely instead of persisting bad data

Files:

  • python/src/server/services/crawling/code_extraction_service.py
  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/services/crawling/document_storage_operations.py
  • python/src/server/services/crawling/helpers/url_handler.py
python/src/server/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/server/**/*.py: For batch processing and background tasks, continue processing but log detailed per-item failures and return both successes and failures
Do not crash the server on a single WebSocket event failure; log the error and continue serving other clients

Files:

  • python/src/server/services/crawling/code_extraction_service.py
  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/services/crawling/document_storage_operations.py
  • python/src/server/services/crawling/helpers/url_handler.py
python/src/server/**

📄 CodeRabbit inference engine (CLAUDE.md)

Keep FastAPI application code under python/src/server/ (routes in api_routes/, services in services/, main in main.py)

Files:

  • python/src/server/services/crawling/code_extraction_service.py
  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/services/crawling/document_storage_operations.py
  • python/src/server/services/crawling/helpers/url_handler.py
🧬 Code graph analysis (3)
python/tests/test_code_extraction_source_id.py (2)
python/src/server/services/crawling/code_extraction_service.py (5)
  • _extract_code_blocks_from_documents (202-344)
  • _generate_code_summaries (1341-1406)
  • _prepare_code_examples_for_storage (1408-1455)
  • _store_code_examples (1457-1528)
  • extract_and_store_code_examples (134-200)
python/src/server/services/crawling/document_storage_operations.py (1)
  • extract_and_store_code_examples (311-338)
python/src/server/services/crawling/crawling_service.py (2)
python/src/server/services/crawling/helpers/url_handler.py (2)
  • generate_unique_source_id (186-214)
  • extract_display_name (217-363)
python/src/server/config/logfire_config.py (1)
  • safe_logfire_info (223-235)
python/src/server/services/crawling/document_storage_operations.py (3)
python/src/server/services/storage/storage_services.py (1)
  • DocumentStorageService (19-284)
python/src/server/services/crawling/code_extraction_service.py (2)
  • CodeExtractionService (19-1528)
  • extract_and_store_code_examples (134-200)
python/src/server/services/source_management_service.py (2)
  • extract_source_summary (36-140)
  • update_source_info (250-367)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Backend Tests (Python + pytest)
🔇 Additional comments (2)
python/src/server/services/crawling/code_extraction_service.py (1)

134-142: LGTM: source_id is correctly threaded through extraction and metadata.

The new source_id parameter is accepted, propagated into block metadata, and persisted through storage. This aligns with the PR objective and the tests.

Also applies to: 165-168, 314-321, 1408-1455

python/src/server/services/crawling/document_storage_operations.py (1)

232-249: Good: source records are updated before chunk storage and include the new fields.

Creating/updating source records up front (with summary, knowledge_type, tags, original_url, source_url, source_display_name) aligns with the new schema and avoids FKs on chunk inserts.

- Fixed variable shadowing in document_storage_operations.py where source_url parameter
  was being overwritten by document URLs, causing incorrect source_url in database
- Fixed source_type determination to use actual URLs instead of hash-based source_id
- Added comprehensive tests for source URL preservation
- Ensure source_type is correctly set to "file" for file uploads, "url" for web crawls

The variable shadowing bug was causing sitemap sources to have the wrong source_url
(last crawled page instead of sitemap URL). The source_type bug would mark all
sources as "url" even for file uploads due to hash-based IDs not starting with "file_".

Co-Authored-By: Claude <noreply@anthropic.com>
@Wirasm
Copy link
Copy Markdown
Collaborator Author

Wirasm commented Aug 25, 2025

@coleam00 @leex279 @tazmon95 This change will require any old user pulling the new main to runt he migration script, and require any new user to run the updated complete_setup.sql

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

♻️ Duplicate comments (6)
python/src/server/services/source_management_service.py (4)

36-38: Fix Optional typing for provider to satisfy mypy/3.12

Annotate provider as optional; provider: str = None violates strict typing. Keep signatures consistent across the module.

-def extract_source_summary(
-    source_id: str, content: str, max_length: int = 500, provider: str = None
-) -> str:
+def extract_source_summary(
+    source_id: str, content: str, max_length: int = 500, provider: str | None = None
+) -> str:

333-365: Good: AI source_type is overridden by URL inference

This addresses the earlier correctness concern with AI metadata carrying an incorrect source_type.


149-151: Make provider optional type; keep signature consistent with docstring

Same Optional typing fix as above for generate_source_title_and_metadata.

 def generate_source_title_and_metadata(
     source_id: str,
     content: str,
     knowledge_type: str = "technical",
     tags: list[str] | None = None,
-    provider: str = None,
+    provider: str | None = None,
     source_display_name: str | None = None,
 ) -> tuple[str, dict[str, Any]]:

309-314: Do not set updated_at to string "now()" — rely on DB trigger

This writes a literal string to the column and corrupts timestamps via PostgREST. Remove it and rely on your migration trigger.

             update_data = {
                 "summary": summary,
                 "total_word_count": word_count,
                 "metadata": metadata,
-                "updated_at": "now()",
             }

If you need to force a write to bump updated_at, touch a benign field or use a SQL function on the server side. I can help produce the trigger if needed.

python/src/server/services/crawling/document_storage_operations.py (2)

78-86: Correct fix: avoid shadowing the source_url parameter

Renaming the local variable to doc_url prevents clobbering the function argument. This unblocks correct propagation and logging.


141-143: Division by zero and incorrect average when no docs are processed

len(all_contents) / len(crawl_results) can raise ZeroDivisionError and uses requested docs rather than processed docs. Track processed docs or derive from url_to_full_document.

Apply this focused diff for the logging lines:

-        safe_logfire_info(
-            f"Document storage | documents={len(crawl_results)} | chunks={len(all_contents)} | avg_chunks_per_doc={len(all_contents) / len(crawl_results):.1f}"
-        )
+        processed_docs = len(url_to_full_document)
+        avg_chunks = (len(all_contents) / processed_docs) if processed_docs else 0.0
+        safe_logfire_info(
+            f"Document storage | documents={processed_docs} | chunks={len(all_contents)} | avg_chunks_per_doc={avg_chunks:.1f}"
+        )

If you prefer an explicit counter, add the following outside this range:

@@
-        url_to_full_document = {}
+        url_to_full_document = {}
+        processed_docs = 0
@@
-            url_to_full_document[doc_url] = markdown_content
+            url_to_full_document[doc_url] = markdown_content
+            processed_docs += 1
🧹 Nitpick comments (5)
python/src/server/services/source_management_service.py (4)

152-162: Docstring missing new parameters

Please document provider and source_display_name to match the signature.

     """
     Generate a user-friendly title and metadata for a source based on its content.
 
     Args:
         source_id: The source ID (domain)
         content: Sample content from the source
         knowledge_type: Type of knowledge (default: "technical")
         tags: Optional list of tags
+        provider: Optional provider override for LLM selection
+        source_display_name: Optional human-friendly name to guide title generation

238-245: Clarify “caller-determined” source_type; ensure it never leaks as the default

You default source_type to "url" here and then rely on callers to override. That’s fine, but it can be easy to forget. Consider setting it only at the call site (or set to None here) to avoid accidental writes with a wrong default. If you keep the default, add a brief inline comment stating which call sites override it (e.g., update_source_info).


266-275: Docstring for update_source_info lacks source_url and source_display_name

Add these to the Args section to keep docs and signature aligned.

     Args:
         client: Supabase client
         source_id: The source ID (domain)
         summary: Summary of the source
         word_count: Total word count for the source
         content: Sample content for title generation
         knowledge_type: Type of knowledge
         tags: List of tags
         update_frequency: Update frequency in days
+        original_url: The crawl entry URL (kept for historical context)
+        source_url: The canonical source URL for this source_id
+        source_display_name: Human-friendly name for UI and title generation

289-296: Deduplicate source_type inference logic

You compute source_type from source_url/original_url in two places. Consider extracting a tiny helper to avoid drift.

@@
-            # Determine source_type based on source_url or original_url
-            if source_url and source_url.startswith("file://"):
-                source_type = "file"
-            elif original_url and original_url.startswith("file://"):
-                source_type = "file"
-            else:
-                source_type = "url"
+            def _infer_source_type(source_url: str | None, original_url: str | None) -> str:
+                for u in (source_url, original_url):
+                    if u and u.startswith("file://"):
+                        return "file"
+                return "url"
+            source_type = _infer_source_type(source_url, original_url)
@@
-                # Determine source_type based on source_url or original_url
-                if source_url and source_url.startswith("file://"):
-                    source_type = "file"
-                elif original_url and original_url.startswith("file://"):
-                    source_type = "file"
-                else:
-                    source_type = "url"
+                source_type = _infer_source_type(source_url, original_url)

Also applies to: 338-345

python/src/server/services/crawling/document_storage_operations.py (1)

232-236: Consider retries around DB writes to harden against transient failures

Supabase/PostgREST can intermittently fail. A light exponential backoff around update_source_info and the fallback upsert would improve resilience. Optional but recommended.

Also applies to: 250-286

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 5e603ea and 76bf0f0.

📒 Files selected for processing (3)
  • python/src/server/services/crawling/document_storage_operations.py (4 hunks)
  • python/src/server/services/source_management_service.py (9 hunks)
  • python/tests/test_source_url_shadowing.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (7)
python/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/**/*.py: Target Python 3.12 with a 120-character line length
Use Ruff for linting and Mypy for type checking before commit

Files:

  • python/tests/test_source_url_shadowing.py
  • python/src/server/services/crawling/document_storage_operations.py
  • python/src/server/services/source_management_service.py
{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}

📄 CodeRabbit inference engine (CLAUDE.md)

{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}: Remove dead code immediately; do not keep legacy/unused functions
Avoid comments that reference change history (e.g., LEGACY, CHANGED, REMOVED); keep comments focused on current functionality

Files:

  • python/tests/test_source_url_shadowing.py
  • python/src/server/services/crawling/document_storage_operations.py
  • python/src/server/services/source_management_service.py
python/tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Place backend tests under python/tests/

Files:

  • python/tests/test_source_url_shadowing.py
python/src/{server,mcp,agents}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/{server,mcp,agents}/**/*.py: Fail fast on service startup failures, missing configuration, database connection issues, auth failures, critical dependency outages, and invalid data that would corrupt state
External API calls should use retry with exponential backoff and ultimately fail with a clear, contextual error message
Error messages must include context (operation being attempted) and relevant IDs/URLs/data for debugging
Preserve full stack traces in logs (e.g., Python logging with exc_info=True)
Use specific exception types; avoid catching broad Exception unless re-raising with context
Never signal failure by returning None/null; raise a descriptive exception instead

Files:

  • python/src/server/services/crawling/document_storage_operations.py
  • python/src/server/services/source_management_service.py
python/src/{server/services,agents}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Never accept or store corrupted data (e.g., zero embeddings, null foreign keys, malformed JSON); skip failed items entirely instead of persisting bad data

Files:

  • python/src/server/services/crawling/document_storage_operations.py
  • python/src/server/services/source_management_service.py
python/src/server/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/server/**/*.py: For batch processing and background tasks, continue processing but log detailed per-item failures and return both successes and failures
Do not crash the server on a single WebSocket event failure; log the error and continue serving other clients

Files:

  • python/src/server/services/crawling/document_storage_operations.py
  • python/src/server/services/source_management_service.py
python/src/server/**

📄 CodeRabbit inference engine (CLAUDE.md)

Keep FastAPI application code under python/src/server/ (routes in api_routes/, services in services/, main in main.py)

Files:

  • python/src/server/services/crawling/document_storage_operations.py
  • python/src/server/services/source_management_service.py
🧬 Code graph analysis (2)
python/tests/test_source_url_shadowing.py (1)
python/src/server/services/crawling/document_storage_operations.py (2)
  • _create_source_records (170-309)
  • process_and_store_documents (34-168)
python/src/server/services/crawling/document_storage_operations.py (4)
python/src/server/services/storage/storage_services.py (1)
  • DocumentStorageService (19-284)
python/src/server/services/crawling/code_extraction_service.py (2)
  • CodeExtractionService (19-1528)
  • extract_and_store_code_examples (134-200)
python/src/server/config/logfire_config.py (2)
  • safe_logfire_info (223-235)
  • safe_logfire_error (238-250)
python/src/server/services/source_management_service.py (2)
  • extract_source_summary (36-140)
  • update_source_info (250-392)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Backend Tests (Python + pytest)
🔇 Additional comments (5)
python/src/server/services/source_management_service.py (1)

208-213: Nice: using source_display_name for better title prompts

Using a human-friendly name when available should materially improve title quality. Good call.

python/tests/test_source_url_shadowing.py (1)

27-37: Good harnessing of _create_source_records to capture arguments

Overriding the coroutine directly keeps the flow realistic while letting you assert the source_url propagation. Nice.

python/src/server/services/crawling/document_storage_operations.py (3)

100-117: Per-chunk metadata now correctly uses doc_url

This ensures each chunk is tied to its originating document. Matches the intent of the PR tests.


170-178: API surface improvement: propagate source_url/source_display_name

Extending the internal _create_source_records to carry the source context is aligned with the PR goal. Good change.


311-319: Explicit source_id threading into code extraction: good

Passing source_id through ensures code examples are correctly scoped and prevents cross-source contamination.

Also applies to: 334-336

Comment thread python/src/server/services/crawling/document_storage_operations.py
Comment thread python/src/server/services/crawling/document_storage_operations.py Outdated
Comment thread python/src/server/services/source_management_service.py Outdated
Comment on lines +39 to +41
with patch('src.server.services.crawling.document_storage_operations.add_documents_to_supabase') as mock_add:
mock_add.return_value = None

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Patch awaited function with AsyncMock to avoid TypeError

add_documents_to_supabase is awaited; patching it with a regular Mock causes TypeError: object MagicMock can't be used in 'await' expression. Use AsyncMock.

-        with patch('src.server.services.crawling.document_storage_operations.add_documents_to_supabase') as mock_add:
-            mock_add.return_value = None
+        with patch(
+            'src.server.services.crawling.document_storage_operations.add_documents_to_supabase',
+            new_callable=AsyncMock,
+        ) as mock_add:
+            mock_add.return_value = None

Optionally assert it was awaited once to harden the test:

mock_add.assert_awaited_once()
🤖 Prompt for AI Agents
In python/tests/test_source_url_shadowing.py around lines 39 to 41, the test
patches add_documents_to_supabase with a regular Mock which is awaited in the
code under test causing a TypeError; replace the patch with an AsyncMock so it
can be awaited (e.g., patch(..., new_callable=AsyncMock) or set mock_add =
AsyncMock()), and optionally add mock_add.assert_awaited_once() to verify it was
awaited.


doc_storage._create_source_records = mock_create_source_records

with patch('src.server.services.crawling.document_storage_operations.add_documents_to_supabase'):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Same AsyncMock fix for the second test

Ensure the awaited function is patched as async in both tests.

-        with patch('src.server.services.crawling.document_storage_operations.add_documents_to_supabase'):
+        with patch(
+            'src.server.services.crawling.document_storage_operations.add_documents_to_supabase',
+            new_callable=AsyncMock,
+        ):
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
with patch('src.server.services.crawling.document_storage_operations.add_documents_to_supabase'):
with patch(
'src.server.services.crawling.document_storage_operations.add_documents_to_supabase',
new_callable=AsyncMock,
):
🤖 Prompt for AI Agents
python/tests/test_source_url_shadowing.py around line 107: the test patches
add_documents_to_supabase with a regular patch but the function is awaited in
the code, so change the patch to provide an AsyncMock (e.g.,
patch('src.server.services.crawling.document_storage_operations.add_documents_to_supabase',
new_callable=AsyncMock) and ensure AsyncMock is imported from unittest.mock at
the top of the test file); update the second test to use the same
AsyncMock-style patch so the awaited call is properly mocked.

Wirasm added 3 commits August 25, 2025 15:28
- Implement proper URL canonicalization to prevent duplicate sources
  - Remove trailing slashes (except root)
  - Remove URL fragments
  - Remove tracking parameters (utm_*, gclid, fbclid, etc.)
  - Sort query parameters for consistency
  - Remove default ports (80 for HTTP, 443 for HTTPS)
  - Normalize scheme and domain to lowercase

- Fix avg_chunks_per_doc calculation to avoid division by zero
  - Track processed_docs count separately from total crawl_results
  - Handle all-empty document sets gracefully
  - Show processed/total in logs for better visibility

- Add comprehensive tests for both fixes
  - 10 test cases for URL canonicalization edge cases
  - 4 test cases for document metrics calculation

This prevents database constraint violations when crawling the same
content with URL variations and provides accurate metrics in logs.
- Run extract_source_summary in thread pool using asyncio.to_thread
- Prevents blocking the async event loop during AI summary generation
- Preserves exact error handling and fallback behavior
- Variables (source_id, combined_content) properly passed to thread

Added comprehensive tests verifying:
- Function runs in thread without blocking
- Error handling works correctly with fallback
- Multiple sources can be processed
- Thread safety with variable passing
- Run update_source_info in thread pool using asyncio.to_thread
- Prevents blocking the async event loop during database operations
- Preserves exact error handling and fallback behavior
- All kwargs properly passed to thread execution

Added comprehensive tests verifying:
- Function runs in thread without blocking
- Error handling triggers fallback correctly
- All kwargs are preserved when passed to thread
- Existing extract_source_summary tests still pass
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
python/src/server/services/crawling/document_storage_operations.py (2)

145-148: Accurate averages and div-by-zero guard — nice fix

Switching to processed_docs fixes both accuracy and ZeroDivisionError risks flagged earlier. LGTM.


245-257: Offload DB upsert/update to thread to keep async loop responsive

update_source_info performs network I/O; run it in a worker thread. This aligns with how extract_source_summary is offloaded.

-            try:
-                update_source_info(
+            try:
+                await asyncio.to_thread(
+                    update_source_info,
                     client=self.supabase_client,
                     source_id=source_id,
                     summary=summary,
                     word_count=source_id_word_counts[source_id],
                     content=combined_content,
-                    knowledge_type=request.get("knowledge_type", "technical"),
+                    knowledge_type=request.get("knowledge_type", "documentation"),
                     tags=request.get("tags", []),
                     update_frequency=0,  # Set to 0 since we're using manual refresh
                     original_url=request.get("url"),  # Store the original crawl URL
                     source_url=source_url,
                     source_display_name=source_display_name,
-                )
+                )
🧹 Nitpick comments (8)
python/src/server/services/crawling/document_storage_operations.py (5)

91-93: Avoid blocking the event loop during chunking

smart_chunk_text is synchronous; offload to a worker thread to prevent event-loop starvation on large documents.

-            chunks = storage_service.smart_chunk_text(markdown_content, chunk_size=5000)
+            chunks = await asyncio.to_thread(
+                storage_service.smart_chunk_text, markdown_content, chunk_size=5000
+            )

75-79: Support both sync and async cancellation checks

cancellation_check may be a coroutine. Detect and await it when needed to avoid unhandled coroutine warnings or no-ops.

-            if cancellation_check:
-                cancellation_check()
+            if cancellation_check:
+                if asyncio.iscoroutinefunction(cancellation_check):
+                    await cancellation_check()
+                else:
+                    cancellation_check()
@@
-                if cancellation_check and i % 10 == 0:  # Check every 10 chunks
-                    cancellation_check()
+                if cancellation_check and i % 10 == 0:  # Check every 10 chunks
+                    if asyncio.iscoroutinefunction(cancellation_check):
+                        await cancellation_check()
+                    else:
+                        cancellation_check()

Also applies to: 100-103


234-236: Preserve stack traces in logs; include structured context

Logging only str(e) loses the traceback. If safe_logfire_error supports it, pass exc_info=True and include structured context (source_id, url, counts).

Example (adjust to your logging helper signature):

safe_logfire_error(
    f"Failed to generate AI summary for '{source_id}'",
    exc_info=True,
)
# Similarly for source upsert/update failures and double-failure raise

Also applies to: 260-262, 289-295


297-317: Reduce DB round-trips: verify sources in bulk

Querying per source_id is N round-trips. Use a single in_ filter and compute missing IDs.

-        if unique_source_ids:
-            for source_id in unique_source_ids:
-                try:
-                    source_check = (
-                        self.supabase_client.table("archon_sources")
-                        .select("source_id")
-                        .eq("source_id", source_id)
-                        .execute()
-                    )
-                    if not source_check.data:
-                        raise Exception(
-                            f"Source record verification failed - '{source_id}' does not exist in sources table"
-                        )
-                    safe_logfire_info(f"Source record verified for '{source_id}'")
-                except Exception as e:
-                    safe_logfire_error(f"Source verification failed for '{source_id}': {str(e)}")
-                    raise
-            safe_logfire_info(
-                f"All {len(unique_source_ids)} source records verified - proceeding with document storage"
-            )
+        if unique_source_ids:
+            resp = (
+                self.supabase_client.table("archon_sources")
+                .select("source_id")
+                .in_("source_id", list(unique_source_ids))
+                .execute()
+            )
+            existing = {row["source_id"] for row in (resp.data or [])}
+            missing = set(unique_source_ids) - existing
+            if missing:
+                raise Exception(f"Source record verification failed - missing: {sorted(missing)}")
+            safe_logfire_info(
+                f"All {len(unique_source_ids)} source records verified - proceeding with document storage"
+            )

109-121: Unify Knowledge Type Default Across the Pipeline

There are scattered literal defaults for knowledge_type—some components default to "documentation", others to "technical". This inconsistency makes it harder to reason about analytics, filters, and test expectations. To clean this up, I recommend introducing a single constant and using it everywhere:

• Define in a central module (e.g. python/src/server/constants.py):

DEFAULT_KNOWLEDGE_TYPE = "documentation"  # or "technical", per team consensus

• Replace all literal defaults and metadata assignments with this constant:
– Pydantic model in api_routes/knowledge_api.py (lines ~64, 74)
– MCP client in services/mcp_service_client.py (line ~61)
– Source management service in services/source_management_service.py (lines ~146, 256, 576)
– Crawling storage operations in services/crawling/document_storage_operations.py (lines ~115, 251, 272)
– Storage services in services/storage/storage_services.py (lines ~27, 212)
– Knowledge item service in services/knowledge/knowledge_item_service.py (lines ~156, 374)

• In each processing method, compute once at entry:

kt = request.get("knowledge_type", DEFAULT_KNOWLEDGE_TYPE)

then pass knowledge_type=kt and set metadata["knowledge_type"] = kt.

• Update all tests to reference the constant or to expect the unified default:
python/tests/test_source_url_shadowing.py (line 62)
python/tests/test_async_source_summary.py (line 56)
python/tests/test_crawl_orchestration_isolated.py (lines 200, 266, 313, 429)

This refactor ensures one source of truth for the default and removes repeated string literals, reducing future drift.

python/tests/test_async_source_summary.py (3)

19-83: Solid non-blocking test; also assert new context propagation

Great coverage of asyncio.to_thread behavior. Add assertions to verify source_url and source_display_name are forwarded to update_source_info.

-            with patch('src.server.services.crawling.document_storage_operations.update_source_info'):
+            with patch('src.server.services.crawling.document_storage_operations.update_source_info') as mock_update:
@@
-                        assert total_time < 1.0, "Should complete in reasonable time"
+                        assert total_time < 1.0, "Should complete in reasonable time"
+
+                        # Ensure new context fields are propagated
+                        call_kwargs = mock_update.call_args.kwargs
+                        assert call_kwargs["source_url"] == "https://example.com"
+                        assert call_kwargs["source_display_name"] == "Example Site"

40-42: Remove unused mocks of smart_chunk_text

_These tests call create_source_records directly; chunking isn’t invoked here. Cleaning these lines reduces noise.

-        # Mock the storage service
-        doc_storage.doc_storage_service.smart_chunk_text = Mock(
-            return_value=["chunk1", "chunk2"]
-        )

Apply similar removals in the other two tests.

Also applies to: 97-99, 151-153, 213-216


135-150: Test name/docstring imply concurrency that isn’t present

current implementation awaits each to_thread call sequentially; sources aren’t processed concurrently. Either rename the test or parallelize summary generation with gather + a semaphore.

Option A — rename:

-    async def test_multiple_sources_concurrent_summaries(self):
-        """Test that multiple source summaries are generated concurrently."""
+    async def test_multiple_sources_threaded_execution(self):
+        """Test that each source summary runs in a worker thread without blocking the loop."""

Option B — make code concurrent (then update test to assert overlaps):

  • In _create_source_records, build tasks per source_id with asyncio.to_thread and await asyncio.gather with a bounded concurrency to avoid API/db overload.

Also applies to: 182-193

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 698e3b9 and f5de76d.

📒 Files selected for processing (2)
  • python/src/server/services/crawling/document_storage_operations.py (4 hunks)
  • python/tests/test_async_source_summary.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (7)
python/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/**/*.py: Target Python 3.12 with a 120-character line length
Use Ruff for linting and Mypy for type checking before commit

Files:

  • python/tests/test_async_source_summary.py
  • python/src/server/services/crawling/document_storage_operations.py
{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}

📄 CodeRabbit inference engine (CLAUDE.md)

{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}: Remove dead code immediately; do not keep legacy/unused functions
Avoid comments that reference change history (e.g., LEGACY, CHANGED, REMOVED); keep comments focused on current functionality

Files:

  • python/tests/test_async_source_summary.py
  • python/src/server/services/crawling/document_storage_operations.py
python/tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Place backend tests under python/tests/

Files:

  • python/tests/test_async_source_summary.py
python/src/{server,mcp,agents}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/{server,mcp,agents}/**/*.py: Fail fast on service startup failures, missing configuration, database connection issues, auth failures, critical dependency outages, and invalid data that would corrupt state
External API calls should use retry with exponential backoff and ultimately fail with a clear, contextual error message
Error messages must include context (operation being attempted) and relevant IDs/URLs/data for debugging
Preserve full stack traces in logs (e.g., Python logging with exc_info=True)
Use specific exception types; avoid catching broad Exception unless re-raising with context
Never signal failure by returning None/null; raise a descriptive exception instead

Files:

  • python/src/server/services/crawling/document_storage_operations.py
python/src/{server/services,agents}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Never accept or store corrupted data (e.g., zero embeddings, null foreign keys, malformed JSON); skip failed items entirely instead of persisting bad data

Files:

  • python/src/server/services/crawling/document_storage_operations.py
python/src/server/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/server/**/*.py: For batch processing and background tasks, continue processing but log detailed per-item failures and return both successes and failures
Do not crash the server on a single WebSocket event failure; log the error and continue serving other clients

Files:

  • python/src/server/services/crawling/document_storage_operations.py
python/src/server/**

📄 CodeRabbit inference engine (CLAUDE.md)

Keep FastAPI application code under python/src/server/ (routes in api_routes/, services in services/, main in main.py)

Files:

  • python/src/server/services/crawling/document_storage_operations.py
🧬 Code graph analysis (2)
python/tests/test_async_source_summary.py (1)
python/src/server/services/crawling/document_storage_operations.py (2)
  • DocumentStorageOperations (18-346)
  • _create_source_records (175-317)
python/src/server/services/crawling/document_storage_operations.py (3)
python/src/server/services/storage/storage_services.py (1)
  • DocumentStorageService (19-284)
python/src/server/services/crawling/code_extraction_service.py (1)
  • extract_and_store_code_examples (134-200)
python/src/server/services/source_management_service.py (2)
  • extract_source_summary (36-140)
  • update_source_info (250-392)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Backend Tests (Python + pytest)
🔇 Additional comments (5)
python/src/server/services/crawling/document_storage_operations.py (3)

135-141: Create source records before chunk upsert — good ordering

Creating/updating sources first prevents FK violations and aligns with the new 3-field source model. LGTM.


227-233: Async offload of AI summary generation — good use of asyncio.to_thread

This prevents blocking the event loop while keeping synchronous code. LGTM.


319-347: Code-extraction delegation matches new source_id threading

The pass-through includes source_id and progress window; consistent with CodeExtractionService expectations. LGTM.

python/tests/test_async_source_summary.py (2)

86-133: Error-path assertions look good

You verify error logging and fallback summary use before update_source_info. This exercises the resilience path effectively. LGTM.


195-247: Thread-safety/argument propagation test is focused and valuable

Capturing source_id and content length validates correct variable capture into to_thread calls. LGTM.

Comment on lines +79 to +87
doc_url = doc.get("url", "")
markdown_content = doc.get("markdown", "")

if not markdown_content:
continue


# Increment processed document count
processed_docs += 1

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Guard against missing URL and empty content; don’t persist bad documents

Persisting chunks with an empty URL will corrupt mappings (url_to_full_document) and downstream code-extraction lookups. Skip such docs and log context.

-            doc_url = doc.get("url", "")
-            markdown_content = doc.get("markdown", "")
+            doc_url = doc.get("url", "")
+            markdown_content = doc.get("markdown", "")
+
+            # Validate required fields
+            if not doc_url:
+                safe_logfire_error("Skipping document with missing URL; cannot store or map content")
+                continue
-            if not markdown_content:
-                continue
+            if not markdown_content:
+                safe_logfire_error(f"Skipping URL '{doc_url}' due to empty markdown content")
+                continue
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
doc_url = doc.get("url", "")
markdown_content = doc.get("markdown", "")
if not markdown_content:
continue
# Increment processed document count
processed_docs += 1
doc_url = doc.get("url", "")
markdown_content = doc.get("markdown", "")
# Validate required fields
if not doc_url:
safe_logfire_error("Skipping document with missing URL; cannot store or map content")
continue
if not markdown_content:
safe_logfire_error(f"Skipping URL '{doc_url}' due to empty markdown content")
continue
# Increment processed document count
processed_docs += 1
🤖 Prompt for AI Agents
In python/src/server/services/crawling/document_storage_operations.py around
lines 79 to 87, the code currently skips documents with empty markdown but does
not guard against missing or empty doc["url"] which can lead to persisting
chunks with an empty URL and corrupt url_to_full_document mappings; update the
loop to skip any doc where doc.get("url") is falsy (empty or missing) before
processing, increment processed_docs only for valid docs, and add a warning log
(including whatever identifier or snippet of the doc is available, e.g., source
or title) when skipping a doc due to missing/empty URL or empty markdown so you
have context for debugging.

- Replace INSERT with UPSERT for new sources to prevent PRIMARY KEY violations
- Handles concurrent crawls attempting to create the same source
- Maintains existing UPDATE behavior for sources that already exist

Added comprehensive tests verifying:
- Concurrent source creation doesn't fail
- Upsert is used for new sources (not insert)
- Update is still used for existing sources
- Async concurrent operations work correctly
- Race conditions with delays are handled

This prevents database constraint errors when multiple crawls target
the same URL simultaneously.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
python/src/server/services/source_management_service.py (1)

283-307: Preserve and merge existing metadata on UPDATE instead of clobbering.

The current update replaces metadata entirely, potentially dropping fields added elsewhere. Merge existing metadata and override only known keys.

-        existing_source = (
-            client.table("archon_sources").select("title").eq("source_id", source_id).execute()
-        )
+        existing_source = (
+            client.table("archon_sources").select("title,metadata").eq("source_id", source_id).execute()
+        )
@@
-            metadata = {
-                "knowledge_type": knowledge_type,
-                "tags": tags or [],
-                "source_type": source_type,
-                "auto_generated": False,  # Mark as not auto-generated since we're preserving
-                "update_frequency": update_frequency,
-            }
+            existing_md = existing_source.data[0].get("metadata", {}) if existing_source.data else {}
+            metadata = {
+                **(existing_md or {}),
+                "knowledge_type": knowledge_type,
+                "tags": (tags if tags is not None else existing_md.get("tags", [])) if isinstance(existing_md, dict) else (tags or []),
+                "source_type": source_type,
+                "auto_generated": False,
+                "update_frequency": update_frequency,
+            }
             search_logger.info(f"Updating existing source {source_id} metadata: knowledge_type={knowledge_type}")
             if original_url:
                 metadata["original_url"] = original_url
♻️ Duplicate comments (4)
python/src/server/services/source_management_service.py (3)

36-38: Use Optional typing for provider to satisfy mypy under Python 3.12.

provider: str = None violates strict typing.

-def extract_source_summary(
-    source_id: str, content: str, max_length: int = 500, provider: str = None
-) -> str:
+def extract_source_summary(
+    source_id: str, content: str, max_length: int = 500, provider: str | None = None
+) -> str:

143-151: Type correctness + docstring completeness for generate_source_title_and_metadata.

Mirror the Optional typing for provider and document both provider and source_display_name in Args.

-def generate_source_title_and_metadata(
+def generate_source_title_and_metadata(
     source_id: str,
     content: str,
     knowledge_type: str = "technical",
     tags: list[str] | None = None,
-    provider: str = None,
+    provider: str | None = None,
     source_display_name: str | None = None,
 ) -> tuple[str, dict[str, Any]]:
@@
-    Args:
+    Args:
         source_id: The source ID (domain)
         content: Sample content from the source
         knowledge_type: Type of knowledge (default: "technical")
         tags: Optional list of tags
+        provider: Optional provider override
+        source_display_name: Optional human-friendly name to guide title generation

309-314: Do not set updated_at to the string "now()"; rely on a DB trigger.

PostgREST treats "now()" as a literal; this will corrupt timestamps. Remove updated_at from update payload and let the DB trigger populate it.

             update_data = {
                 "summary": summary,
                 "total_word_count": word_count,
                 "metadata": metadata,
-                "updated_at": "now()",
             }
python/src/server/services/crawling/document_storage_operations.py (1)

79-87: Guard against missing URL; don’t persist chunks with empty doc_url.

Empty URLs will corrupt url_to_full_document and break downstream joins. Add a strict guard and log context before continuing.

-            doc_url = doc.get("url", "")
-            markdown_content = doc.get("markdown", "")
+            doc_url = doc.get("url", "")
+            markdown_content = doc.get("markdown", "")
+
+            # Validate required fields
+            if not doc_url:
+                safe_logfire_error("Skipping document with missing URL; cannot store or map content")
+                continue
             if not markdown_content:
                 continue
🧹 Nitpick comments (8)
python/tests/test_source_race_condition.py (1)

30-43: Also assert new columns are propagated in upserts.

Since this PR adds source_url and source_display_name, verifying they are present in upsert payloads would harden the regression surface for this class of bugs.

Apply:

-        def track_upsert(data):
-            """Track upsert calls."""
-            successful_creates.append(data["source_id"])
+        def track_upsert(data):
+            """Track upsert calls."""
+            successful_creates.append(data)  # capture full payload for validation
             return Mock(execute=mock_execute)
@@
-        assert len(successful_creates) == 5, "All 5 attempts should succeed"
-        assert all(sid == "test_source_123" for sid in successful_creates)
+        assert len(successful_creates) == 5, "All 5 attempts should succeed"
+        assert all(item["source_id"] == "test_source_123" for item in successful_creates)
+        assert all(item["source_url"] == "https://example.com" for item in successful_creates)
+        # Display name is accepted but title preservation happens server-side; we still verify propagation
+        assert all("source_display_name" in item for item in successful_creates)
python/tests/test_async_source_summary.py (3)

74-85: Relax timing threshold to avoid CI flakiness.

The function is async but still does real work and scheduling; <1.0s can intermittently fail under load.

-                        assert total_time < 1.0, "Should complete in reasonable time"
+                        assert total_time < 2.0, "Should complete in reasonable time"

136-141: Docstring overpromises concurrency; clarify behavior.

Summaries run in threads but sequentially per source (awaited in a loop). Adjust wording to reduce confusion.

-        """Test that multiple source summaries are generated concurrently."""
+        """Test that multiple source summaries are generated in threads without blocking the event loop (sequential per source)."""

298-304: Relax timing threshold to reduce false negatives.

Similar rationale as earlier test: <1.0s can be too strict for some runners.

-                        assert total_time < 1.0, "Should complete in reasonable time"
+                        assert total_time < 2.0, "Should complete in reasonable time"
python/src/server/services/source_management_service.py (3)

289-296: Normalize source_type inference via URL scheme; avoid ad hoc checks.

file:// is handled, but other schemes (s3://, gs://, data:, ftp) will be misclassified as "url" or "file". Use urllib.parse to infer a stable source_type and keep logic in one helper.

Example:

+from urllib.parse import urlparse
+
+def _infer_source_type(source_url: str | None, original_url: str | None) -> str:
+    url = source_url or original_url or ""
+    scheme = urlparse(url).scheme.lower()
+    if scheme == "file":
+        return "file"
+    # treat known remote schemes as url
+    return "url"

@@
-            if source_url and source_url.startswith("file://"):
-                source_type = "file"
-            elif original_url and original_url.startswith("file://"):
-                source_type = "file"
-            else:
-                source_type = "url"
+            source_type = _infer_source_type(source_url, original_url)
@@
-                if source_url and source_url.startswith("file://"):
-                    source_type = "file"
-                elif original_url and original_url.startswith("file://"):
-                    source_type = "file"
-                else:
-                    source_type = "url"
+                source_type = _infer_source_type(source_url, original_url)
@@
-                if source_url and source_url.startswith("file://"):
-                    metadata["source_type"] = "file"
-                elif original_url and original_url.startswith("file://"):
-                    metadata["source_type"] = "file"
-                else:
-                    metadata["source_type"] = "url"
+                metadata["source_type"] = _infer_source_type(source_url, original_url)

Also applies to: 339-345, 359-364


100-116: Add retries with backoff for LLM calls; fail with contextual errors.

External API calls should use retries and surface context on failure per coding guidelines. Wrap chat completions with a bounded backoff (e.g., tenacity).

Illustrative change:

from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type

@retry(
    reraise=True,
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=0.5, min=0.5, max=4),
    retry=retry_if_exception_type(Exception),
)
def _create_summary_with_llm(client, model, messages):
    return client.chat.completions.create(model=model, messages=messages)

# then use:
response = _create_summary_with_llm(client, model_choice, messages)

Log the source_id and content length in the exception path for debugging.

Also applies to: 218-227


266-275: Docstring should include new args: original_url, source_url, source_display_name.

Keep the function contract self-documenting.

     Args:
         client: Supabase client
         source_id: The source ID (domain)
         summary: Summary of the source
         word_count: Total word count for the source
         content: Sample content for title generation
         knowledge_type: Type of knowledge
         tags: List of tags
         update_frequency: Update frequency in days
+        original_url: Original crawl URL for provenance
+        source_url: Canonical URL of the source (used for type inference and UI)
+        source_display_name: Human-friendly name to store as title when available
python/src/server/services/crawling/document_storage_operations.py (1)

220-224: Avoid leading space when building combined_content.

Join chunks via list accumulation for cleaner content.

-            combined_content = ""
-            for chunk in source_contents[:3]:  # First 3 chunks for this source
-                if len(combined_content) + len(chunk) < 15000:
-                    combined_content += " " + chunk
-                else:
-                    break
+            parts = []
+            for chunk in source_contents[:3]:  # First 3 chunks for this source
+                if sum(len(p) for p in parts) + len(chunk) < 15000:
+                    parts.append(chunk)
+                else:
+                    break
+            combined_content = " ".join(parts)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f5de76d and 52187e2.

📒 Files selected for processing (4)
  • python/src/server/services/crawling/document_storage_operations.py (4 hunks)
  • python/src/server/services/source_management_service.py (8 hunks)
  • python/tests/test_async_source_summary.py (1 hunks)
  • python/tests/test_source_race_condition.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (7)
python/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/**/*.py: Target Python 3.12 with a 120-character line length
Use Ruff for linting and Mypy for type checking before commit

Files:

  • python/tests/test_async_source_summary.py
  • python/tests/test_source_race_condition.py
  • python/src/server/services/crawling/document_storage_operations.py
  • python/src/server/services/source_management_service.py
{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}

📄 CodeRabbit inference engine (CLAUDE.md)

{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}: Remove dead code immediately; do not keep legacy/unused functions
Avoid comments that reference change history (e.g., LEGACY, CHANGED, REMOVED); keep comments focused on current functionality

Files:

  • python/tests/test_async_source_summary.py
  • python/tests/test_source_race_condition.py
  • python/src/server/services/crawling/document_storage_operations.py
  • python/src/server/services/source_management_service.py
python/tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Place backend tests under python/tests/

Files:

  • python/tests/test_async_source_summary.py
  • python/tests/test_source_race_condition.py
python/src/{server,mcp,agents}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/{server,mcp,agents}/**/*.py: Fail fast on service startup failures, missing configuration, database connection issues, auth failures, critical dependency outages, and invalid data that would corrupt state
External API calls should use retry with exponential backoff and ultimately fail with a clear, contextual error message
Error messages must include context (operation being attempted) and relevant IDs/URLs/data for debugging
Preserve full stack traces in logs (e.g., Python logging with exc_info=True)
Use specific exception types; avoid catching broad Exception unless re-raising with context
Never signal failure by returning None/null; raise a descriptive exception instead

Files:

  • python/src/server/services/crawling/document_storage_operations.py
  • python/src/server/services/source_management_service.py
python/src/{server/services,agents}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Never accept or store corrupted data (e.g., zero embeddings, null foreign keys, malformed JSON); skip failed items entirely instead of persisting bad data

Files:

  • python/src/server/services/crawling/document_storage_operations.py
  • python/src/server/services/source_management_service.py
python/src/server/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/server/**/*.py: For batch processing and background tasks, continue processing but log detailed per-item failures and return both successes and failures
Do not crash the server on a single WebSocket event failure; log the error and continue serving other clients

Files:

  • python/src/server/services/crawling/document_storage_operations.py
  • python/src/server/services/source_management_service.py
python/src/server/**

📄 CodeRabbit inference engine (CLAUDE.md)

Keep FastAPI application code under python/src/server/ (routes in api_routes/, services in services/, main in main.py)

Files:

  • python/src/server/services/crawling/document_storage_operations.py
  • python/src/server/services/source_management_service.py
🧠 Learnings (1)
📚 Learning: 2025-08-20T19:38:04.050Z
Learnt from: Chillbruhhh
PR: coleam00/Archon#378
File: python/src/server/services/storage/document_storage_service.py:304-306
Timestamp: 2025-08-20T19:38:04.050Z
Learning: The archon_crawled_pages table in the Archon project has a table-level unique constraint on (url, chunk_number) defined inline in the CREATE TABLE statement in migration/complete_setup.sql at line 202, which allows upsert operations with on_conflict="url,chunk_number" to work properly without requiring additional migrations.

Applied to files:

  • python/src/server/services/source_management_service.py
🧬 Code graph analysis (3)
python/tests/test_async_source_summary.py (1)
python/src/server/services/crawling/document_storage_operations.py (2)
  • DocumentStorageOperations (18-348)
  • _create_source_records (175-319)
python/tests/test_source_race_condition.py (1)
python/src/server/services/source_management_service.py (1)
  • update_source_info (250-392)
python/src/server/services/crawling/document_storage_operations.py (4)
python/src/server/services/storage/storage_services.py (1)
  • DocumentStorageService (19-284)
python/src/server/services/crawling/code_extraction_service.py (2)
  • CodeExtractionService (19-1528)
  • extract_and_store_code_examples (134-200)
python/src/server/config/logfire_config.py (2)
  • safe_logfire_info (223-235)
  • safe_logfire_error (238-250)
python/src/server/services/source_management_service.py (2)
  • extract_source_summary (36-140)
  • update_source_info (250-392)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Backend Tests (Python + pytest)
🔇 Additional comments (3)
python/tests/test_source_race_condition.py (1)

20-76: Good concurrency coverage for upsert path; assertions are meaningful.

The test validates concurrent upserts target the same source_id and avoid PK violations. Mock chaining is minimal and focused; assertions read well.

python/src/server/services/crawling/document_storage_operations.py (2)

145-148: Nice: accurate averages and processed count.

Division-by-zero risk is eliminated and the log now reports processed docs vs requested.


246-259: Good: offloading DB updates to a thread prevents event-loop stalls.

Using asyncio.to_thread for update_source_info aligns with the async guidelines.

Comment on lines +195 to +269
def test_race_condition_with_delay(self):
"""Test race condition with simulated delay between check and create."""
import time

mock_client = Mock()

# Track timing of operations
check_times = []
create_times = []
source_created = threading.Event()

def delayed_select(*args):
"""Return a mock that simulates SELECT with delay."""
mock_select = Mock()

def eq_mock(*args):
mock_eq = Mock()
mock_eq.execute = lambda: delayed_check()
return mock_eq

mock_select.eq = eq_mock
return mock_select

def delayed_check():
"""Simulate SELECT execution with delay."""
check_times.append(time.time())
result = Mock()
# First thread doesn't see the source
if not source_created.is_set():
time.sleep(0.01) # Small delay to let both threads check
result.data = []
else:
# Subsequent checks would see it (but we use upsert so this doesn't matter)
result.data = [{"source_id": "race_source", "title": "Existing", "metadata": {}}]
return result

def track_upsert(data):
"""Track upsert and set event."""
create_times.append(time.time())
source_created.set()
return Mock(execute=Mock(return_value=Mock(data=[])))

# Set up table mock to return our custom select mock
mock_client.table.return_value.select = delayed_select
mock_client.table.return_value.upsert = track_upsert

errors = []

def create_with_error_tracking(thread_id):
try:
update_source_info(
client=mock_client,
source_id="race_source",
summary="Race summary",
word_count=100,
content="Race content",
knowledge_type="documentation",
source_display_name="Race Display Name" # Will be used as title
)
except Exception as e:
errors.append((thread_id, str(e)))

# Run 2 threads that will both check before either creates
with ThreadPoolExecutor(max_workers=2) as executor:
futures = [
executor.submit(create_with_error_tracking, 1),
executor.submit(create_with_error_tracking, 2)
]
for future in futures:
future.result()

# Both should succeed with upsert (no errors)
assert len(errors) == 0, f"No errors should occur with upsert: {errors}"
assert len(check_times) == 2, "Both threads should check"
assert len(create_times) == 2, "Both threads should attempt create/upsert" No newline at end of file
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Replace sleep-based race with a barrier for deterministic concurrency.

time.sleep(0.01) can flake on slow CI; a barrier ensures both threads hit SELECT before any proceeds to UPSERT.

Apply:

         source_created = threading.Event()
+        start_gate = threading.Barrier(2)
@@
         def delayed_check():
             """Simulate SELECT execution with delay."""
             check_times.append(time.time())
             result = Mock()
-            # First thread doesn't see the source
-            if not source_created.is_set():
-                time.sleep(0.01)  # Small delay to let both threads check
-                result.data = []
+            # Synchronize both threads so both perform the existence check before any upsert
+            try:
+                start_gate.wait(timeout=1.0)
+            except threading.BrokenBarrierError:
+                pass
+            if not source_created.is_set():
+                result.data = []
             else:
                 # Subsequent checks would see it (but we use upsert so this doesn't matter)
                 result.data = [{"source_id": "race_source", "title": "Existing", "metadata": {}}]
             return result
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def test_race_condition_with_delay(self):
"""Test race condition with simulated delay between check and create."""
import time
mock_client = Mock()
# Track timing of operations
check_times = []
create_times = []
source_created = threading.Event()
def delayed_select(*args):
"""Return a mock that simulates SELECT with delay."""
mock_select = Mock()
def eq_mock(*args):
mock_eq = Mock()
mock_eq.execute = lambda: delayed_check()
return mock_eq
mock_select.eq = eq_mock
return mock_select
def delayed_check():
"""Simulate SELECT execution with delay."""
check_times.append(time.time())
result = Mock()
# First thread doesn't see the source
if not source_created.is_set():
time.sleep(0.01) # Small delay to let both threads check
result.data = []
else:
# Subsequent checks would see it (but we use upsert so this doesn't matter)
result.data = [{"source_id": "race_source", "title": "Existing", "metadata": {}}]
return result
def track_upsert(data):
"""Track upsert and set event."""
create_times.append(time.time())
source_created.set()
return Mock(execute=Mock(return_value=Mock(data=[])))
# Set up table mock to return our custom select mock
mock_client.table.return_value.select = delayed_select
mock_client.table.return_value.upsert = track_upsert
errors = []
def create_with_error_tracking(thread_id):
try:
update_source_info(
client=mock_client,
source_id="race_source",
summary="Race summary",
word_count=100,
content="Race content",
knowledge_type="documentation",
source_display_name="Race Display Name" # Will be used as title
)
except Exception as e:
errors.append((thread_id, str(e)))
# Run 2 threads that will both check before either creates
with ThreadPoolExecutor(max_workers=2) as executor:
futures = [
executor.submit(create_with_error_tracking, 1),
executor.submit(create_with_error_tracking, 2)
]
for future in futures:
future.result()
# Both should succeed with upsert (no errors)
assert len(errors) == 0, f"No errors should occur with upsert: {errors}"
assert len(check_times) == 2, "Both threads should check"
assert len(create_times) == 2, "Both threads should attempt create/upsert"
def test_race_condition_with_delay(self):
"""Test race condition with simulated delay between check and create."""
import time
mock_client = Mock()
# Track timing of operations
check_times = []
create_times = []
source_created = threading.Event()
start_gate = threading.Barrier(2)
def delayed_select(*args):
"""Return a mock that simulates SELECT with delay."""
mock_select = Mock()
def eq_mock(*args):
mock_eq = Mock()
mock_eq.execute = lambda: delayed_check()
return mock_eq
mock_select.eq = eq_mock
return mock_select
def delayed_check():
"""Simulate SELECT execution with delay."""
check_times.append(time.time())
result = Mock()
- # First thread doesn't see the source
- if not source_created.is_set():
- time.sleep(0.01) # Small delay to let both threads check
# Synchronize both threads so both perform the existence check before any upsert
try:
start_gate.wait(timeout=1.0)
except threading.BrokenBarrierError:
pass
if not source_created.is_set():
result.data = []
else:
# Subsequent checks would see it (but we use upsert so this doesn't matter)
result.data = [{"source_id": "race_source", "title": "Existing", "metadata": {}}]
return result
def track_upsert(data):
"""Track upsert and set event."""
create_times.append(time.time())
source_created.set()
return Mock(execute=Mock(return_value=Mock(data=[])))
# Set up table mock to return our custom select mock
mock_client.table.return_value.select = delayed_select
mock_client.table.return_value.upsert = track_upsert
errors = []
def create_with_error_tracking(thread_id):
try:
update_source_info(
client=mock_client,
source_id="race_source",
summary="Race summary",
word_count=100,
content="Race content",
knowledge_type="documentation",
source_display_name="Race Display Name" # Will be used as title
)
except Exception as e:
errors.append((thread_id, str(e)))
# Run 2 threads that will both check before either creates
with ThreadPoolExecutor(max_workers=2) as executor:
futures = [
executor.submit(create_with_error_tracking, 1),
executor.submit(create_with_error_tracking, 2)
]
for future in futures:
future.result()
# Both should succeed with upsert (no errors)
assert len(errors) == 0, f"No errors should occur with upsert: {errors}"
assert len(check_times) == 2, "Both threads should check"
assert len(create_times) == 2, "Both threads should attempt create/upsert"
🤖 Prompt for AI Agents
In python/tests/test_source_race_condition.py around lines 195 to 269, the test
uses time.sleep(0.01) to force a race which is flaky; replace the sleep-based
synchronization with a threading.Barrier to deterministically ensure both
threads reach the SELECT-check before either proceeds to UPSERT. Create a
barrier = threading.Barrier(2) in the test scope, have delayed_check append the
timestamp, call barrier.wait() instead of time.sleep, and then decide the mocked
result based on source_created.is_set(); ensure nested functions reference the
barrier and add any missing threading import if necessary.

Wirasm added 3 commits August 26, 2025 12:10
Add MigrationBanner component with clear user instructions for database schema updates. Add useMigrationStatus hook for periodic health check monitoring with graceful error handling.
Add migration status monitoring and banner display to App.tsx. Shows migration banner when database schema updates are required.
Add detailed Docker restart instructions and migration script guidance. Improves user experience when encountering startup failures.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (6)
archon-ui-main/src/components/BackendStartupError.tsx (1)

43-48: Expand “Common issues” to include required env vars and clarify which SQL script to run

Good addition. Two tweaks will reduce support churn:

  • Call out missing SUPABASE_URL and SUPABASE_SERVICE_KEY explicitly (from team learnings).
  • Clarify which SQL script applies to new installs vs upgrades to avoid confusion with the new migration flow.

Apply this diff:

-                <ul className="text-yellow-200 text-sm mt-1 space-y-1 list-disc list-inside">
-                  <li>Using an ANON key instead of SERVICE key in your .env file</li>
-                  <li>Database not set up - run <code className="bg-yellow-800/50 px-1 rounded">migration/complete_setup.sql</code> in Supabase SQL Editor</li>
-                </ul>
+                <ul className="text-yellow-200 text-sm mt-1 space-y-1 list-disc list-inside">
+                  <li>Using an ANON key instead of SERVICE key in your .env file</li>
+                  <li>Missing required env vars in .env: <code className="bg-yellow-800/50 px-1 rounded">SUPABASE_URL</code> and <code className="bg-yellow-800/50 px-1 rounded">SUPABASE_SERVICE_KEY</code></li>
+                  <li>
+                    Database not set up — run
+                    {' '}
+                    <code className="bg-yellow-800/50 px-1 rounded">migration/complete_setup.sql</code>
+                    {' '}for new installs, or
+                    {' '}
+                    <code className="bg-yellow-800/50 px-1 rounded">migration/add_source_url_display_name.sql</code>
+                    {' '}when upgrading an existing install (per the migration banner/health check).
+                  </li>
+                </ul>
archon-ui-main/src/hooks/useMigrationStatus.ts (1)

16-41: Harden fetch handling: check response.ok and avoid state updates after unmount

Right now a non-2xx response still goes to response.json() and a late-resolving fetch can set state on an unmounted component. Add response.ok handling and an AbortController.

   useEffect(() => {
-    const checkMigrationStatus = async () => {
+    const controller = new AbortController();
+    const checkMigrationStatus = async () => {
       try {
-        const response = await fetch('/api/health');
-        const healthData = await response.json();
+        const response = await fetch('/api/health', { signal: controller.signal, cache: 'no-store' });
+        if (!response.ok) {
+          throw new Error(`HTTP ${response.status} ${response.statusText}`);
+        }
+        const healthData = await response.json();
         
         if (healthData.status === 'migration_required') {
           setStatus({
             migrationRequired: true,
             message: healthData.message,
             instructions: healthData.migration_instructions,
             loading: false,
           });
         } else {
           setStatus({
             migrationRequired: false,
             loading: false,
           });
         }
       } catch (error) {
-        console.error('Failed to check migration status:', error);
-        setStatus({
-          migrationRequired: false,
-          loading: false,
-        });
+        // Ignore aborts; log others with endpoint context
+        // @ts-expect-error name is present on DOMException
+        if (error?.name === 'AbortError') return;
+        console.error('Failed to check migration status (/api/health):', error);
+        setStatus((prev) => ({ ...prev, loading: false }));
       }
     };
 
     checkMigrationStatus();
     
     // Check periodically (every 30 seconds) to detect when migration is complete
-    const interval = setInterval(checkMigrationStatus, 30000);
+    const interval = setInterval(checkMigrationStatus, 30000);
     
-    return () => clearInterval(interval);
+    return () => {
+      clearInterval(interval);
+      controller.abort();
+    };
   }, []);
archon-ui-main/src/App.tsx (1)

84-91: Avoid flicker: don’t render banner while status is loading

Gate the banner on loading=false. Optional: persist dismissal in sessionStorage, but gating is the bigger win.

-            {migrationStatus.migrationRequired && !migrationBannerDismissed && (
+            {!migrationStatus.loading && migrationStatus.migrationRequired && !migrationBannerDismissed && (
               <MigrationBanner
                 message={migrationStatus.message || "Database migration required"}
                 instructions={migrationStatus.instructions || "Check migration folder"}
                 onDismiss={() => setMigrationBannerDismissed(true)}
               />
             )}
python/src/server/main.py (1)

282-309: Optional: avoid blocking the event loop in async function

Supabase Python client is sync; calling .execute() inside an async endpoint can block the loop. Given caching, the impact is small, but wrapping in a thread keeps /health responsive under load.

-        test_query = client.table('archon_sources').select('source_url, source_display_name').limit(1).execute()
+        # Run sync client call off the event loop
+        import asyncio
+        await asyncio.to_thread(
+            lambda: client.table('archon_sources').select('source_url, source_display_name').limit(1).execute()
+        )
python/src/server/api_routes/knowledge_api.py (2)

867-870: Decouple health check from main; avoid importing a private symbol and reduce circular-import risk

Importing _check_database_schema from ..main couples the router to app bootstrap and relies on a private name. Alias it to a public-facing name now, and consider moving the function into a lightweight module (e.g., server/utils/schema_health.py) to avoid circulars and to make it an explicit dependency.

Apply this minimal change now (keeps behavior, improves readability):

-from ..main import _check_database_schema
-
-schema_status = await _check_database_schema()
+from ..main import _check_database_schema as check_database_schema
+
+schema_status = await check_database_schema()

Follow-up (separate PR): extract _check_database_schema to server/utils/schema_health.py as check_database_schema() and import from there. I can draft that refactor if helpful.


874-874: Use timezone-aware UTC timestamps for consistency

Elsewhere in this file you use datetime.utcnow().isoformat(); here it’s datetime.now().isoformat() (naive). Standardize on UTC, timezone-aware.

-            "timestamp": datetime.now().isoformat(),
+            "timestamp": datetime.now(timezone.utc).isoformat(),

Add this import at the top of the file (outside this hunk):

from datetime import datetime, timezone

Optionally update the “healthy” payload’s timestamp similarly for consistency.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 52187e2 and 3eda01e.

📒 Files selected for processing (6)
  • archon-ui-main/src/App.tsx (3 hunks)
  • archon-ui-main/src/components/BackendStartupError.tsx (1 hunks)
  • archon-ui-main/src/components/ui/MigrationBanner.tsx (1 hunks)
  • archon-ui-main/src/hooks/useMigrationStatus.ts (1 hunks)
  • python/src/server/api_routes/knowledge_api.py (1 hunks)
  • python/src/server/main.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (7)
{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}

📄 CodeRabbit inference engine (CLAUDE.md)

{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}: Remove dead code immediately; do not keep legacy/unused functions
Avoid comments that reference change history (e.g., LEGACY, CHANGED, REMOVED); keep comments focused on current functionality

Files:

  • archon-ui-main/src/components/BackendStartupError.tsx
  • archon-ui-main/src/components/ui/MigrationBanner.tsx
  • python/src/server/main.py
  • archon-ui-main/src/App.tsx
  • archon-ui-main/src/hooks/useMigrationStatus.ts
  • python/src/server/api_routes/knowledge_api.py
archon-ui-main/src/components/**

📄 CodeRabbit inference engine (CLAUDE.md)

Place reusable UI components in archon-ui-main/src/components/

Files:

  • archon-ui-main/src/components/BackendStartupError.tsx
  • archon-ui-main/src/components/ui/MigrationBanner.tsx
python/src/{server,mcp,agents}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/{server,mcp,agents}/**/*.py: Fail fast on service startup failures, missing configuration, database connection issues, auth failures, critical dependency outages, and invalid data that would corrupt state
External API calls should use retry with exponential backoff and ultimately fail with a clear, contextual error message
Error messages must include context (operation being attempted) and relevant IDs/URLs/data for debugging
Preserve full stack traces in logs (e.g., Python logging with exc_info=True)
Use specific exception types; avoid catching broad Exception unless re-raising with context
Never signal failure by returning None/null; raise a descriptive exception instead

Files:

  • python/src/server/main.py
  • python/src/server/api_routes/knowledge_api.py
python/src/server/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/server/**/*.py: For batch processing and background tasks, continue processing but log detailed per-item failures and return both successes and failures
Do not crash the server on a single WebSocket event failure; log the error and continue serving other clients

Files:

  • python/src/server/main.py
  • python/src/server/api_routes/knowledge_api.py
python/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/**/*.py: Target Python 3.12 with a 120-character line length
Use Ruff for linting and Mypy for type checking before commit

Files:

  • python/src/server/main.py
  • python/src/server/api_routes/knowledge_api.py
python/src/server/**

📄 CodeRabbit inference engine (CLAUDE.md)

Keep FastAPI application code under python/src/server/ (routes in api_routes/, services in services/, main in main.py)

Files:

  • python/src/server/main.py
  • python/src/server/api_routes/knowledge_api.py
archon-ui-main/src/hooks/**

📄 CodeRabbit inference engine (CLAUDE.md)

Place custom React hooks in archon-ui-main/src/hooks/

Files:

  • archon-ui-main/src/hooks/useMigrationStatus.ts
🧠 Learnings (2)
📚 Learning: 2025-08-21T11:22:33.541Z
Learnt from: CR
PR: coleam00/Archon#0
File: CLAUDE.md:0-0
Timestamp: 2025-08-21T11:22:33.541Z
Learning: Applies to .env : Require SUPABASE_URL and SUPABASE_SERVICE_KEY in the .env file; optionally OPENAI_API_KEY, LOGFIRE_TOKEN, LOG_LEVEL

Applied to files:

  • archon-ui-main/src/components/BackendStartupError.tsx
📚 Learning: 2025-08-21T11:22:33.541Z
Learnt from: CR
PR: coleam00/Archon#0
File: CLAUDE.md:0-0
Timestamp: 2025-08-21T11:22:33.541Z
Learning: Applies to archon-ui-main/src/hooks/** : Place custom React hooks in archon-ui-main/src/hooks/

Applied to files:

  • archon-ui-main/src/hooks/useMigrationStatus.ts
🧬 Code graph analysis (2)
archon-ui-main/src/App.tsx (2)
archon-ui-main/src/hooks/useMigrationStatus.ts (1)
  • useMigrationStatus (10-53)
archon-ui-main/src/components/ui/MigrationBanner.tsx (1)
  • MigrationBanner (11-62)
python/src/server/api_routes/knowledge_api.py (1)
python/src/server/main.py (1)
  • _check_database_schema (282-345)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Backend Tests (Python + pytest)
🔇 Additional comments (2)
python/src/server/api_routes/knowledge_api.py (2)

878-879: Add optional migration script fields; no UI changes needed

The UI hook useMigrationStatus only reads and displays the migration_instructions field—any additional fields in the JSON payload will be ignored by the existing code. You can safely extend the response without breaking the UI.

Proposed next steps:

  • In knowledge_api.py (around lines 878–879), return the new, optional fields alongside the existing instruction string:
    {
      "migration_instructions": "Open Supabase Dashboard → SQL Editor → Run: migration/add_source_url_display_name.sql",
      "sql_migration_script":      "migration/add_source_url_display_name.sql",
      "full_setup_script":         "migration/complete_setup.sql",
      "docs_url":                  "https://your-docs.example.com/migrations"
    }
  • Define these values in a central Python config or constants module (or via environment variables) rather than hardcoding paths in the handler.
  • No changes are required in the UI; useMigrationStatus continues to map only:
    setStatus({
      migrationRequired: true,
      message:             healthData.message,
      instructions:        healthData.migration_instructions,
      loading:             false,
    });
  • (Optional) If you maintain TypeScript types for health data, you may extend the HealthData interface to include the new optional properties for improved DX and auto-completion.

871-879: No changes needed to HTTP status for migration-required health check

The repository’s container healthchecks (in docker-compose.yml and Dockerfile.server) all target /health (and its alias /api/health), not the Knowledge API’s /api/knowledge/health endpoint. There are no Kubernetes readinessProbe or livenessProbe manifests in the codebase invoking /api/knowledge/health, and the UI specifically polls /api/knowledge/health (via useMigrationStatus.ts) to detect migrations. Adjusting its HTTP status from 200→503 would break the front-end migration flow without impacting any existing deployment probes.

Likely an incorrect or invalid review comment.

Comment on lines +5 to +9
interface MigrationBannerProps {
message: string;
instructions: string;
onDismiss?: () => void;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

instructions prop is accepted but never rendered

This creates drift between backend-provided guidance and the hard-coded list. Use the instructions string when present and fall back to the default list.

Apply the follow-up diff below (second comment) to render instructions.

🤖 Prompt for AI Agents
In archon-ui-main/src/components/ui/MigrationBanner.tsx around lines 5 to 9, the
MigrationBanner component accepts an instructions prop but never renders it;
update the render logic to prefer the instructions prop when it is a non-empty
string and fall back to the existing hard-coded default list otherwise.
Specifically, replace the hard-coded instructions block with conditional
rendering: if props.instructions is present render it (e.g., as a paragraph or
sanitized text node), else render the original list items; ensure types remain
optional and that the DOM structure/spacing stays consistent.

Comment thread python/src/server/main.py
Comment on lines +249 to +261
# Check for required database schema
schema_status = await _check_database_schema()
if not schema_status["valid"]:
return {
"status": "migration_required",
"service": "archon-backend",
"timestamp": datetime.now().isoformat(),
"ready": False,
"migration_required": True,
"message": schema_status["message"],
"migration_instructions": "Open Supabase Dashboard → SQL Editor → Run: migration/add_source_url_display_name.sql"
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Return backend-provided instructions and mark schema_valid=false on migration_required

Currently instructions are hard-coded and schema_valid isn’t surfaced in the degraded response. Use instructions from the checker and explicitly declare schema_valid: false for parity with the healthy path.

-    if not schema_status["valid"]:
+    if not schema_status["valid"]:
         return {
             "status": "migration_required",
             "service": "archon-backend", 
             "timestamp": datetime.now().isoformat(),
             "ready": False,
             "migration_required": True,
-            "message": schema_status["message"],
-            "migration_instructions": "Open Supabase Dashboard → SQL Editor → Run: migration/add_source_url_display_name.sql"
+            "message": schema_status["message"],
+            "migration_instructions": schema_status.get(
+                "instructions",
+                "Open Supabase Dashboard → SQL Editor → Run: migration/add_source_url_display_name.sql"
+            ),
+            "schema_valid": False,
         }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Check for required database schema
schema_status = await _check_database_schema()
if not schema_status["valid"]:
return {
"status": "migration_required",
"service": "archon-backend",
"timestamp": datetime.now().isoformat(),
"ready": False,
"migration_required": True,
"message": schema_status["message"],
"migration_instructions": "Open Supabase Dashboard → SQL Editor → Run: migration/add_source_url_display_name.sql"
}
# Check for required database schema
schema_status = await _check_database_schema()
if not schema_status["valid"]:
return {
"status": "migration_required",
"service": "archon-backend",
"timestamp": datetime.now().isoformat(),
"ready": False,
"migration_required": True,
"message": schema_status["message"],
"migration_instructions": schema_status.get(
"instructions",
"Open Supabase Dashboard → SQL Editor → Run: migration/add_source_url_display_name.sql"
),
"schema_valid": False,
}
🤖 Prompt for AI Agents
In python/src/server/main.py around lines 249 to 261, the degraded response
hard-codes migration instructions and omits the schema_valid flag; update the
returned dict to set "schema_valid": False and replace the hard-coded
"migration_instructions" string with the instructions provided by the checker
(e.g. use schema_status["instructions"] or the appropriate field from
schema_status), ensuring the rest of the keys remain intact and the message
continues to use schema_status["message"].

Comment thread python/src/server/main.py
Comment on lines +318 to +345
# Check for missing columns first (more specific than table check)
missing_source_url = 'source_url' in error_msg and ('column' in error_msg or 'does not exist' in error_msg)
missing_source_display = 'source_display_name' in error_msg and ('column' in error_msg or 'does not exist' in error_msg)

# Also check for PostgreSQL error code 42703 (undefined column)
is_column_error = '42703' in error_msg or 'column' in error_msg

if (missing_source_url or missing_source_display) and is_column_error:
result = {
"valid": False,
"message": "Database schema outdated - missing required columns from recent updates"
}
# Cache failed result with timestamp
_schema_check_cache["valid"] = False
_schema_check_cache["checked_at"] = current_time
_schema_check_cache["result"] = result
return result

# Check for table doesn't exist (less specific, only if column check didn't match)
# Look for relation/table errors specifically
if ('relation' in error_msg and 'does not exist' in error_msg) or ('table' in error_msg and 'does not exist' in error_msg):
# Table doesn't exist - not a migration issue, it's a setup issue
return {"valid": True, "message": "Table doesn't exist - handled by startup error"}

# Other errors don't necessarily mean migration needed
result = {"valid": True, "message": f"Schema check inconclusive: {str(e)}"}
# Don't cache inconclusive results - allow retry
return result
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Treat missing tables as invalid schema and cache failures to avoid repeated exceptions

Returning valid=True for a missing table causes /health to report healthy despite a hard setup failure. Also, not caching this failure yields repeated exception work under polling. Return valid=False with targeted instructions and cache for 30s, same as the column-missing case.

-        if (missing_source_url or missing_source_display) and is_column_error:
+        if (missing_source_url or missing_source_display) and is_column_error:
             result = {
-                "valid": False, 
-                "message": "Database schema outdated - missing required columns from recent updates"
+                "valid": False, 
+                "message": "Database schema outdated - missing required columns from recent updates",
+                "instructions": "Open Supabase Dashboard → SQL Editor → Run: migration/add_source_url_display_name.sql"
             }
             # Cache failed result with timestamp
             _schema_check_cache["valid"] = False
             _schema_check_cache["checked_at"] = current_time
             _schema_check_cache["result"] = result
             return result
         
         # Check for table doesn't exist (less specific, only if column check didn't match)
         # Look for relation/table errors specifically
-        if ('relation' in error_msg and 'does not exist' in error_msg) or ('table' in error_msg and 'does not exist' in error_msg):
-            # Table doesn't exist - not a migration issue, it's a setup issue
-            return {"valid": True, "message": "Table doesn't exist - handled by startup error"}
+        if ('relation' in error_msg and 'does not exist' in error_msg) or ('table' in error_msg and 'does not exist' in error_msg):
+            # Database not initialized - guide user to full setup
+            result = {
+                "valid": False,
+                "message": "Database not initialized - required tables are missing",
+                "instructions": "Open Supabase Dashboard → SQL Editor → Run: migration/complete_setup.sql"
+            }
+            _schema_check_cache["valid"] = False
+            _schema_check_cache["checked_at"] = current_time
+            _schema_check_cache["result"] = result
+            return result
         
         # Other errors don't necessarily mean migration needed
         result = {"valid": True, "message": f"Schema check inconclusive: {str(e)}"}
         # Don't cache inconclusive results - allow retry
         return result
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Check for missing columns first (more specific than table check)
missing_source_url = 'source_url' in error_msg and ('column' in error_msg or 'does not exist' in error_msg)
missing_source_display = 'source_display_name' in error_msg and ('column' in error_msg or 'does not exist' in error_msg)
# Also check for PostgreSQL error code 42703 (undefined column)
is_column_error = '42703' in error_msg or 'column' in error_msg
if (missing_source_url or missing_source_display) and is_column_error:
result = {
"valid": False,
"message": "Database schema outdated - missing required columns from recent updates"
}
# Cache failed result with timestamp
_schema_check_cache["valid"] = False
_schema_check_cache["checked_at"] = current_time
_schema_check_cache["result"] = result
return result
# Check for table doesn't exist (less specific, only if column check didn't match)
# Look for relation/table errors specifically
if ('relation' in error_msg and 'does not exist' in error_msg) or ('table' in error_msg and 'does not exist' in error_msg):
# Table doesn't exist - not a migration issue, it's a setup issue
return {"valid": True, "message": "Table doesn't exist - handled by startup error"}
# Other errors don't necessarily mean migration needed
result = {"valid": True, "message": f"Schema check inconclusive: {str(e)}"}
# Don't cache inconclusive results - allow retry
return result
# Check for missing columns first (more specific than table check)
missing_source_url = 'source_url' in error_msg and ('column' in error_msg or 'does not exist' in error_msg)
missing_source_display = 'source_display_name' in error_msg and ('column' in error_msg or 'does not exist' in error_msg)
# Also check for PostgreSQL error code 42703 (undefined column)
is_column_error = '42703' in error_msg or 'column' in error_msg
if (missing_source_url or missing_source_display) and is_column_error:
result = {
"valid": False,
"message": "Database schema outdated - missing required columns from recent updates",
"instructions": "Open Supabase Dashboard → SQL Editor → Run: migration/add_source_url_display_name.sql"
}
# Cache failed result with timestamp
_schema_check_cache["valid"] = False
_schema_check_cache["checked_at"] = current_time
_schema_check_cache["result"] = result
return result
# Check for table doesn't exist (less specific, only if column check didn't match)
# Look for relation/table errors specifically
if ('relation' in error_msg and 'does not exist' in error_msg) or ('table' in error_msg and 'does not exist' in error_msg):
# Database not initialized - guide user to full setup
result = {
"valid": False,
"message": "Database not initialized - required tables are missing",
"instructions": "Open Supabase Dashboard → SQL Editor → Run: migration/complete_setup.sql"
}
# Cache failed result with timestamp
_schema_check_cache["valid"] = False
_schema_check_cache["checked_at"] = current_time
_schema_check_cache["result"] = result
return result
# Other errors don't necessarily mean migration needed
result = {"valid": True, "message": f"Schema check inconclusive: {str(e)}"}
# Don't cache inconclusive results - allow retry
return result
🤖 Prompt for AI Agents
In python/src/server/main.py around lines 318 to 345, the current logic treats
missing tables as valid and doesn't cache the failure; change it so that a
"relation/table does not exist" error returns valid=False with a clear message
instructing that the database needs setup/migrations, populate
_schema_check_cache["valid"]=False,
_schema_check_cache["checked_at"]=current_time and
_schema_check_cache["result"]=result (same caching behavior used for missing
columns) and then return result so the health check reports invalid and the
failure is cached for 30s to avoid repeated exceptions.

Wirasm added 2 commits August 26, 2025 12:35
Clean up component API by removing instructions prop that was accepted but never rendered. Simplifies the interface and eliminates dead code while keeping the functional hardcoded migration steps.
Add schema_valid: false flag to health endpoint response when database schema migration is required. Improves API consistency without changing existing behavior.
@Wirasm
Copy link
Copy Markdown
Collaborator Author

Wirasm commented Aug 26, 2025

Code Review Response

Thanks for the detailed feedback! We've addressed the issues pragmatically while keeping this PR focused on its core purpose (source ID architecture refactor).

Changes Made:

1. Removed unused prop from MigrationBanner ()

  • Issue: Component accepted prop but never rendered it
  • Fix: Removed the prop entirely from interface, component usage, and hook
  • Reasoning: Clean API is better than dead code. Migration steps are standardized ("run SQL, restart Docker") so hardcoded instructions work perfectly.

2. Added flag to health response ()

  • Issue: Migration response missing schema_valid flag for API consistency
  • Fix: Added to migration_required response
  • Reasoning: Minimal change for API consistency without scope creep.

Issues We Didn't Address (And Why):

Dynamic backend instruction rendering:

  • Why not: This would require HTML sanitization, markdown parsing, security review, and significant complexity for a use case that doesn't need it. Migration instructions are procedural ("do X, then Y") and don't benefit from being dynamic.
  • Current solution works: Users get clear, actionable steps that cover 99% of scenarios.

Missing table behavior changes:

  • Why not: This PR is about source ID refactoring, not improving setup flows. The current behavior works (users see backend errors and figure out they need ).
  • Better as separate PR: If we want to improve new user onboarding, that deserves its own focused PR with proper testing.

🎯 Our Approach:

We prioritized shipping working functionality over perfect abstractions. The migration banner successfully guides users through schema updates, which was the core goal. Additional polish can be added in future PRs if there's real user demand.

The changes maintain backward compatibility and don't introduce security risks or complexity.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
archon-ui-main/src/components/ui/MigrationBanner.tsx (1)

25-36: Render backend-provided instructions with a safe fallback to defaults

Hard-coded steps will drift from backend guidance (e.g., table-missing vs. column-missing flows). Prefer a backend-provided instructions prop (string or string[]) and fall back to default steps. This keeps UX accurate and reduces support load.

Apply this diff to render dynamic instructions while preserving the existing defaults:

           <div className="bg-red-100 dark:bg-red-900/40 border border-red-200 dark:border-red-800 rounded-lg p-3 mb-3">
             <p className="text-sm font-medium text-red-800 dark:text-red-300 mb-2">
               Follow these steps:
             </p>
-            <ol className="text-sm text-red-700 dark:text-red-400 space-y-1 list-decimal list-inside">
-              <li>Open your Supabase project dashboard</li>
-              <li>Navigate to the SQL Editor</li>
-              <li>Copy and run the migration script from: <code className="bg-red-200 dark:bg-red-800 px-1 rounded">migration/add_source_url_display_name.sql</code></li>
-              <li>Restart Docker containers: <code className="bg-red-200 dark:bg-red-800 px-1 rounded">docker compose down && docker compose up --build -d</code></li>
-              <li>If you used a profile, add it: <code className="bg-red-200 dark:bg-red-800 px-1 rounded">--profile full</code></li>
-            </ol>
+            {instructions
+              ? (Array.isArray(instructions) ? (
+                  <ol className="text-sm text-red-700 dark:text-red-400 space-y-1 list-decimal list-inside">
+                    {instructions.map((step, i) => (
+                      <li key={i}>{step}</li>
+                    ))}
+                  </ol>
+                ) : (
+                  <p className="text-sm text-red-700 dark:text-red-400 whitespace-pre-line">
+                    {instructions}
+                  </p>
+                ))
+              : (
+                <ol className="text-sm text-red-700 dark:text-red-400 space-y-1 list-decimal list-inside">
+                  {DEFAULT_MIGRATION_STEPS.map((step, i) => (
+                    <li key={i}>{step}</li>
+                  ))}
+                </ol>
+              )}
           </div>

Add this helper constant near the top of the file (outside the component) to keep defaults maintainable:

// Place near imports
const DEFAULT_MIGRATION_STEPS = [
  'Open your Supabase project dashboard',
  'Navigate to the SQL Editor',
  'Copy and run the migration script from: migration/add_source_url_display_name.sql',
  'Restart Docker containers: docker compose down && docker compose up --build -d',
  'If you used a profile, add it: --profile full',
];
🧹 Nitpick comments (5)
archon-ui-main/src/components/ui/MigrationBanner.tsx (5)

15-21: A11y: make this an ARIA alert and label the heading

Screen readers should announce the banner immediately; also hide the decorative icon from AT.

-    <Card className="bg-red-50 border-red-200 dark:bg-red-900/20 dark:border-red-800 mb-6">
+    <Card
+      className="bg-red-50 border-red-200 dark:bg-red-900/20 dark:border-red-800 mb-6"
+      role="alert"
+      aria-labelledby="migration-banner-title"
+    >
       <div className="flex items-start gap-3 p-4">
-        <AlertTriangle className="w-6 h-6 text-red-500 flex-shrink-0 mt-0.5" />
+        <AlertTriangle
+          className="w-6 h-6 text-red-500 flex-shrink-0 mt-0.5"
+          aria-hidden="true"
+        />
         <div className="flex-1">
-          <h3 className="text-lg font-semibold text-red-800 dark:text-red-300 mb-2">
+          <h3 id="migration-banner-title" className="text-lg font-semibold text-red-800 dark:text-red-300 mb-2">
             Database Migration Required
           </h3>

22-24: Preserve line breaks in backend message

Messages from /api/health may include newlines; preserve them to improve readability.

-          <p className="text-red-700 dark:text-red-400 mb-3">
+          <p className="text-red-700 dark:text-red-400 mb-3 whitespace-pre-line">
             {message}
           </p>

49-53: Prevent accidental form submission: set button type="button"

If this sits inside a form someday, the default type="submit" could cause side effects.

-              <button
-                onClick={onDismiss}
+              <button
+                type="button"
+                onClick={onDismiss}
                 className="text-red-600 dark:text-red-400 hover:text-red-800 dark:hover:text-red-200 text-sm font-medium"
               >
                 Dismiss (temporarily)
               </button>

39-46: Consider making the Supabase dashboard URL configurable

Some deployments may use a different dashboard URL. Expose it via a prop or config and default to supabase.com/dashboard.

If you choose to make it a prop:

 interface MigrationBannerProps {
   message: string;
+  dashboardUrl?: string;
   instructions?: string | string[];
   onDismiss?: () => void;
 }
 
 export const MigrationBanner: React.FC<MigrationBannerProps> = ({
   message,
+  dashboardUrl = "https://supabase.com/dashboard",
   instructions,
   onDismiss
 }) => {
   …
-            <a
-              href="https://supabase.com/dashboard"
+            <a
+              href={dashboardUrl}
               target="_blank"
               rel="noopener noreferrer"
               className="inline-flex items-center gap-2 bg-red-600 hover:bg-red-700 text-white px-4 py-2 rounded-lg text-sm font-medium transition-colors"
             >

25-36: Tests: add coverage for dynamic instructions and dismiss behavior

Add RTL tests to ensure:

  • instructions string renders as a paragraph with preserved newlines,
  • instructions array maps to list items,
  • fallback defaults render when instructions absent,
  • onDismiss fires on click.

Example (place under archon-ui-main/src/components/ui/tests/MigrationBanner.test.tsx):

import { render, screen, fireEvent } from "@testing-library/react";
import { MigrationBanner } from "../../MigrationBanner";

test("renders string instructions with newlines", () => {
  render(<MigrationBanner message="m required" instructions={"line1\nline2"} />);
  expect(screen.getByText("line1")).toBeInTheDocument();
  expect(screen.getByText("line2")).toBeInTheDocument();
});

test("renders array instructions as list", () => {
  render(<MigrationBanner message="m required" instructions={["s1","s2"]} />);
  const items = screen.getAllByRole("listitem");
  expect(items.map(li => li.textContent)).toEqual(["s1","s2"]);
});

test("falls back to defaults when instructions missing", () => {
  render(<MigrationBanner message="m required" />);
  expect(screen.getByText(/Open your Supabase project dashboard/)).toBeInTheDocument();
});

test("dismiss button triggers handler", () => {
  const onDismiss = jest.fn();
  render(<MigrationBanner message="m required" onDismiss={onDismiss} />);
  fireEvent.click(screen.getByText(/Dismiss/));
  expect(onDismiss).toHaveBeenCalled();
});
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 3eda01e and 75958f4.

📒 Files selected for processing (4)
  • archon-ui-main/src/App.tsx (3 hunks)
  • archon-ui-main/src/components/ui/MigrationBanner.tsx (1 hunks)
  • archon-ui-main/src/hooks/useMigrationStatus.ts (1 hunks)
  • python/src/server/main.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • archon-ui-main/src/App.tsx
  • python/src/server/main.py
  • archon-ui-main/src/hooks/useMigrationStatus.ts
🧰 Additional context used
📓 Path-based instructions (2)
{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}

📄 CodeRabbit inference engine (CLAUDE.md)

{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}: Remove dead code immediately; do not keep legacy/unused functions
Avoid comments that reference change history (e.g., LEGACY, CHANGED, REMOVED); keep comments focused on current functionality

Files:

  • archon-ui-main/src/components/ui/MigrationBanner.tsx
archon-ui-main/src/components/**

📄 CodeRabbit inference engine (CLAUDE.md)

Place reusable UI components in archon-ui-main/src/components/

Files:

  • archon-ui-main/src/components/ui/MigrationBanner.tsx
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Backend Tests (Python + pytest)

Comment on lines +5 to +8
interface MigrationBannerProps {
message: string;
onDismiss?: () => void;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Extend props to accept optional instructions (string or string[])

Expose backend guidance via an instructions prop and keep the API simple.

 interface MigrationBannerProps {
   message: string;
+  instructions?: string | string[];
   onDismiss?: () => void;
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
interface MigrationBannerProps {
message: string;
onDismiss?: () => void;
}
interface MigrationBannerProps {
message: string;
instructions?: string | string[];
onDismiss?: () => void;
}
🤖 Prompt for AI Agents
In archon-ui-main/src/components/ui/MigrationBanner.tsx around lines 5 to 8, the
component props only include message and onDismiss; extend the interface to add
an optional instructions?: string | string[] prop, update the component
signature to accept it, and update rendering logic to handle no instructions, a
single string, or an array (e.g., map array to list items or join single string)
so backend guidance can be displayed while keeping the existing API
backwards-compatible and optional.

Comment on lines +10 to +13
export const MigrationBanner: React.FC<MigrationBannerProps> = ({
message,
onDismiss
}) => {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Plumb the new instructions prop through the component

Wire the prop so the rendering logic can use it.

 export const MigrationBanner: React.FC<MigrationBannerProps> = ({
   message,
-  onDismiss
+  instructions,
+  onDismiss
 }) => {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
export const MigrationBanner: React.FC<MigrationBannerProps> = ({
message,
onDismiss
}) => {
export const MigrationBanner: React.FC<MigrationBannerProps> = ({
message,
instructions,
onDismiss
}) => {
// ...rest of component
}
🤖 Prompt for AI Agents
In archon-ui-main/src/components/ui/MigrationBanner.tsx around lines 10 to 13,
the component currently destructures only message and onDismiss; update the
MigrationBannerProps type to include an instructions prop (preferably optional,
typed as string | ReactNode), add instructions to the destructuring list, and
wire it into the render output (conditionally render the instructions
block/element where the banner shows guidance, preserving styling and
accessibility). Ensure types and any consumers are updated to pass the new prop
where needed.

Comment on lines +25 to +36
<div className="bg-red-100 dark:bg-red-900/40 border border-red-200 dark:border-red-800 rounded-lg p-3 mb-3">
<p className="text-sm font-medium text-red-800 dark:text-red-300 mb-2">
Follow these steps:
</p>
<ol className="text-sm text-red-700 dark:text-red-400 space-y-1 list-decimal list-inside">
<li>Open your Supabase project dashboard</li>
<li>Navigate to the SQL Editor</li>
<li>Copy and run the migration script from: <code className="bg-red-200 dark:bg-red-800 px-1 rounded">migration/add_source_url_display_name.sql</code></li>
<li>Restart Docker containers: <code className="bg-red-200 dark:bg-red-800 px-1 rounded">docker compose down && docker compose up --build -d</code></li>
<li>If you used a profile, add it: <code className="bg-red-200 dark:bg-red-800 px-1 rounded">--profile full</code></li>
</ol>
</div>
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Verify backend contract exposes instructions

Before wiring this, confirm the health endpoint or migration status hook surfaces instructions (string or list). If not, add it there.


🏁 Script executed:

#!/bin/bash
# Verify whether the UI hook and backend expose `instructions`.

echo "Searching UI for migration status and instructions…"
rg -n -C3 'useMigrationStatus|migration.*required|instructions' archon-ui-main/src || true

echo
echo "Searching backend for /api/health or migration instructions…"
rg -n -C3 -S 'api/health|/health|migration.*instruction|migrationRequired|instructions' --type=py || true

Length of output: 15320


Surface backend-provided migration instructions in the UI

The backend’s /health and /api/health endpoints already return a migration_instructions field when migration_required is true (see python/src/server/main.py lines 257–260 and knowledge_api.py lines 876–879). However, the front-end hook and MigrationBanner component aren’t consuming this field—they only use message and render hard-coded steps. To fully wire up dynamic instructions:

• In archon-ui-main/src/hooks/useMigrationStatus.ts
– Extend the MigrationStatus interface to include an instructions?: string (or string[]) field.
– In the if (healthData.status === 'migration_required') branch (around line 21), add instructions: healthData.migration_instructions to the object passed to setStatus.

• In archon-ui-main/src/components/ui/MigrationBanner.tsx
– Update MigrationBannerProps to accept an instructions: string[] prop (instead of, or in addition to, message).
– Replace the hard-coded <ol> block with something like:
tsx <ol className="…"> {instructions.map((step, i) => ( <li key={i}>{step}</li> ))} </ol>

• In archon-ui-main/src/App.tsx
– When rendering <MigrationBanner … /> (around lines 84–90), pass the new prop:
tsx <MigrationBanner message={migrationStatus.message || "Database migration required"} instructions={migrationStatus.instructions || []} onDismiss={() => setMigrationBannerDismissed(true)} />

This will ensure the UI surfaces the exact steps defined by the backend, keeping both sides in sync and avoiding duplicated hard-coded instructions.

🤖 Prompt for AI Agents
In archon-ui-main/src/components/ui/MigrationBanner.tsx around lines 25–36, the
component currently renders hard-coded migration steps instead of using
backend-provided instructions; update the component to accept an instructions
prop (string[]), remove the hard-coded <ol>, and render instructions.map((s,i)=>
<li key={i}>{s}</li>). Also update
archon-ui-main/src/hooks/useMigrationStatus.ts to extend the MigrationStatus
type with instructions?: string[] and set instructions:
healthData.migration_instructions in the migration_required branch, and finally
modify archon-ui-main/src/App.tsx where MigrationBanner is rendered (around
lines 84–90) to pass instructions={migrationStatus.instructions || []} alongside
the existing props.

@Wirasm
Copy link
Copy Markdown
Collaborator Author

Wirasm commented Aug 29, 2025

This PR will also close #399 when merged.

Credit to @StreamDemon for their initial work and partial implementation in #399 that helped identify and address this race condition issue.

@Wirasm Wirasm merged commit 3e204b0 into main Aug 29, 2025
7 checks passed
@Wirasm Wirasm deleted the fix/source-id-architecture-refactor branch August 29, 2025 11:54
@Wirasm Wirasm mentioned this pull request Aug 29, 2025
24 tasks
@coderabbitai coderabbitai Bot mentioned this pull request Oct 7, 2025
20 tasks
mionemedia added a commit to mionemedia/Archon that referenced this pull request Jan 16, 2026
* Add Supabase key validation and simplify frontend state management

- Add backend validation to detect and warn about anon vs service keys
- Prevent startup with incorrect Supabase key configuration
- Consolidate frontend state management following KISS principles
- Remove duplicate state tracking and sessionStorage polling
- Add clear error display when backend fails to start
- Improve .env.example documentation with detailed key selection guide
- Add comprehensive test coverage for validation logic
- Remove unused test results checking to eliminate 404 errors

The implementation now warns users about key misconfiguration while
maintaining backward compatibility. Frontend state is simplified with
MainLayout as the single source of truth for backend status.

* Fix critical issues from code review

- Use python-jose (already in dependencies) instead of PyJWT for JWT decoding
- Make unknown Supabase key roles fail fast per alpha principles
- Skip all JWT validations (not just signature) when checking role
- Update tests to expect failure for unknown roles

Fixes:
- No need to add PyJWT dependency - python-jose provides JWT functionality
- Unknown key types now raise ConfigurationError instead of warning
- JWT decode properly skips all validations to only check role claim

* Remove unnecessary startup delay script from frontend Dockerfile

- Rolled back to match main branch Dockerfile
- Removed 3-second sleep script that was added for backend readiness
- Container now runs npm directly without intermediate script
- Tested and verified all services start correctly without the delay

* Fix document deletion persistence issue (#278)

- Fixed projectService methods to include project_id parameter in API calls
- Updated deleteDocument() to use correct endpoint: /api/projects/{projectId}/docs/{docId}
- Updated getDocument() and updateDocument() to use correct endpoints with project_id
- Modified DocsTab component to call backend API when deleting documents
- Documents now properly persist deletion after page refresh

The issue was that document deletion was only happening in UI state and never
reached the backend. The service methods were using incorrect API endpoints
that didn't include the required project_id parameter.

* Add comprehensive test coverage for document CRUD operations

- Add Document interface for type safety
- Fix error messages to include projectId context
- Add unit tests for all projectService document methods
- Add integration tests for DocsTab deletion flow
- Update vitest config to include new test files

* MCP server consolidation and simplification

- Consolidated multiple MCP modules into unified project_module
- Removed redundant project, task, document, and version modules
- Identified critical issue with async project creation losing context
- Updated CLAUDE.md with project instructions

This commit captures the current state before refactoring to split
consolidated tools into separate operations for better clarity and
to solve the async project creation context issue.

* Improve MCP tool usability and documentation

- Fix parameter naming confusion in RAG tools (source → source_domain)
- Add clarification that source_domain expects domain names not IDs
- Improve manage_versions documentation with clear examples
- Add better error messages for validation failures
- Enhance manage_document with non-PRP examples
- Add comprehensive documentation to get_project_features
- Fix content parameter type in manage_versions to accept Any type

These changes address usability issues discovered during testing without
breaking existing functionality.

* Refactor MCP server structure and add separate project tools

- Rename src/mcp to src/mcp_server for clarity
- Update all internal imports to use new path
- Create features/projects directory for modular tool organization
- Add separate, simple project tools (create, list, get, delete, update)
- Keep consolidated tools for backward compatibility (via env var)
- Add USE_SEPARATE_PROJECT_TOOLS env var to toggle between approaches

The new separate tools:
- Solve the async project creation context loss issue
- Provide clearer, single-purpose interfaces
- Remove complex PRP examples for simplicity
- Handle project creation polling automatically

* Update bug_report.yml

Changing Archon Alpha to Beta in the issue template

* Update README.md

Added note in the README

* Fix project cards horizontal scrollbar visibility (#295)

Addresses issue #293 by replacing hide-scrollbar with scrollbar-thin
class to ensure users can see and interact with the horizontal scrollbar
when project cards overflow.

* Fix missing feature field in project tasks API response

Resolves issue #282 by adding feature field to task dictionary in
TaskService.list_tasks() method. The project tasks API endpoint was
excluding the feature field while individual task API included it,
causing frontend to default to 'General' instead of showing custom
feature values.

Changes:
- Add feature field to task response in list_tasks method
- Maintains compatibility with existing API consumers
- All 212 tests pass with this change

* Remove consolidated project module in favor of separated tools

The consolidated project module contained all project, task, document,
version, and feature management in a single 922-line file. This has been
replaced with focused, single-purpose tools in separate modules.

* Remove feature flags from Docker configuration

Removed USE_SEPARATE_PROJECT_AND_TASK_TOOLS and PROJECTS_ENABLED
environment variables as the separated tools are now the default.

* Add document and version management tools

Extract document management functionality into focused tools:
- create_document: Create new documents with metadata
- list_documents: List all documents in a project
- get_document: Retrieve specific document details
- update_document: Modify existing documents
- delete_document: Remove documents from projects

Extract version control functionality:
- create_version: Create immutable snapshots
- list_versions: View version history
- get_version: Retrieve specific version content
- restore_version: Rollback to previous versions

Includes improved documentation and error messages based on testing.

* Add task management tools with smart routing

Extract task functionality into focused tools:
- create_task: Create tasks with sources and code examples
- list_tasks: List tasks with project/status filtering
- get_task: Retrieve task details
- update_task: Modify task properties
- delete_task: Archive tasks (soft delete)

Preserves intelligent endpoint routing:
- Project-specific: /api/projects/{id}/tasks
- Status filtering: /api/tasks?status=X
- Assignee filtering: /api/tasks?assignee=X

* Add feature management tool for project capabilities

Extract get_project_features as a standalone tool with enhanced
documentation explaining feature structures and usage patterns.
Features track functional components like auth, api, and database.

* Update project tools to use simplified approach

Remove complex PRP validation logic and focus on core functionality.
Maintains backward compatibility with existing API endpoints.

* Register all separated tools in MCP server

Update MCP server to use the new modular tool structure:
- Projects and tasks from existing modules
- Documents and versions from new modules
- Feature management from standalone module

Remove all feature flag logic as separated tools are now default.

* Update MCP Dockerfile to support new module structure

Create documents directory and ensure all new modules are properly
included in the container build.

* Clean up unused imports in RAG module

Remove import of deleted project_module.

* Fix type errors and remove trailing whitespace

- Add explicit type annotations for params dictionaries to resolve mypy errors
- Remove trailing whitespace from blank lines (W293 ruff warnings)
- Ensure type safety in task_tools.py and document_tools.py

* Add comprehensive unit tests for MCP server features

- Create test structure mirroring features folder organization
- Add tests for document tools (create, list, update, delete)
- Add tests for version tools (create, list, restore, invalid field handling)
- Add tests for task tools (create with sources, list with filters, update, delete)
- Add tests for project tools (create with polling, list, get)
- Add tests for feature tools (get features with various structures)
- Mock HTTP client for all external API calls
- Test both success and error scenarios
- 100% test coverage for critical tool functions

* Updating the Logo for Archon

* Add Stage 1 workflow for external PR info collection

- Collects PR information without requiring secrets
- Triggers on pull_request events and @claude-review-ext comments
- Uploads PR details as artifact for secure processing

* Add Stage 2 secure review workflow for external PRs

- Runs after Stage 1 via workflow_run trigger
- Has access to repository secrets
- Downloads PR artifact and performs review
- Maintains security by never checking out fork code

* Add documentation for external PR review workflows

- Explains the two-stage security model
- Provides usage instructions for contributors and maintainers
- Includes troubleshooting and security considerations

* Fix base branch checkout in Stage 2 workflow

- Extract PR base branch from artifact instead of using workflow branch
- Add step to switch to correct base branch after downloading PR info
- Use PR base branch for diff generation instead of workflow branch

* Fix external PR workflow permissions and error handling

- Grant pull-requests write permission for comment posting
- Add try-catch error handling with continue-on-error
- Ensure workflow continues even if comment posting fails

* Simplify authorization for external PR workflows

- Move authorization check to Stage 1 job condition
- Remove complex authorization job from Stage 2
- Fix duplicate exec declaration error
- Add unauthorized user message handling in Stage 1
- Trust Stage 1's authorization in Stage 2

* Fix exec declaration error in Stage 2 workflow

- Replace async exec with execSync to avoid declaration issues
- Add proper error handling for artifact extraction
- Use childProcess module directly instead of destructuring

* Fix Claude Code Action authentication and context issues

- Remove invalid pr_number parameter
- Add explicit github_token to fix OIDC failure in workflow_run
- Add mode: review for proper review mode
- Create fake event.json to provide PR context
- Set environment variables to simulate PR event

* Remove invalid mode parameter and improve event context

- Remove invalid mode: review parameter
- Update event context to simulate issue_comment
- Add direct_prompt to guide Claude to review the diff
- Update instructions to use Read tool for pr-diff.patch

* Simplify fork PR review to single workflow with pull_request_target

- Remove complex two-stage workflow approach
- Use pull_request_target with security safeguards
- Add first-time contributor check and approval requirement
- Never checkout PR code - only analyze diff
- Mirror full review format from main claude-review workflow
- Manual trigger via @claude-review-fork for maintainers

* Fix Claude checkout issue for forked PRs

- Add environment overrides to prevent PR branch checkout
- Add explicit github_token for authentication
- Add direct_prompt to guide Claude to use diff file
- Override GITHUB_REF and GITHUB_SHA to stay on base branch

* Override event context to prevent PR checkout

- Set GITHUB_EVENT_NAME to workflow_dispatch to avoid PR detection
- Use override_prompt instead of direct_prompt for better control
- Create wrapper script for debugging
- Explicitly tell Claude not to checkout code

* fix: Restructure fork review workflow to avoid PR branch checkout

- Create isolated review context directory to prevent PR detection
- Move diff to changes.patch file in review-context directory
- Add explicit REVIEW_INSTRUCTIONS.md file for guidance
- Use standard 'prompt' parameter instead of 'override_prompt'
- This approach should prevent Claude Action from auto-detecting PR context

* Remove claude-review-fork.yml workflow

The Claude Code Action is not compatible with reviewing PRs from forks.
It always attempts to checkout the PR branch which fails for security reasons.

* Fix: Allow HTTP for local Supabase connections (#323)

- Modified validate_supabase_url() to allow HTTP for local development
- HTTP is now allowed for localhost, 127.0.0.1, host.docker.internal, and 0.0.0.0
- HTTPS is still required for production/non-local environments
- Fixes server startup failure when using local Supabase with Docker

* Update README.md (#332)

Makes it easier to copy paste & run the command in on single shot

* feat(mcp): Add robust error handling and timeout configuration

Critical improvements to MCP server reliability and client experience:

Error Handling:
- Created MCPErrorFormatter for consistent error responses across all tools
- Provides structured errors with type, message, details, and actionable suggestions
- Helps clients (like Claude Code) understand and handle failures gracefully
- Categorizes errors (connection_timeout, validation_error, etc.) for better debugging

Timeout Configuration:
- Centralized timeout config with environment variable support
- Different timeouts for regular operations vs polling operations
- Configurable via MCP_REQUEST_TIMEOUT, MCP_CONNECT_TIMEOUT, etc.
- Prevents indefinite hangs when services are unavailable

Module Registration:
- Distinguishes between ImportError (acceptable) and code errors (must fix)
- SyntaxError/NameError/AttributeError now halt execution immediately
- Prevents broken code from silently failing in production

Polling Safety:
- Fixed project creation polling with exponential backoff
- Handles API unavailability with proper error messages
- Maximum attempts configurable via MCP_MAX_POLLING_ATTEMPTS

Response Normalization:
- Fixed inconsistent response handling in list_tasks
- Validates and normalizes different API response formats
- Clear error messages when response format is unexpected

These changes address critical issues from PR review while maintaining
backward compatibility. All 20 existing tests pass.

* refactor(mcp): Apply consistent error handling to all MCP tools

Comprehensive update to MCP server error handling:

Error Handling Improvements:
- Applied MCPErrorFormatter to all remaining MCP tool files
- Replaced all hardcoded timeout values with configurable timeout system
- Converted all simple string errors to structured error format
- Added proper httpx exception handling with detailed context

Tools Updated:
- document_tools.py: All 5 document management tools
- version_tools.py: All 4 version management tools
- feature_tools.py: Project features tool
- project_tools.py: Remaining 3 project tools (get, list, delete)
- task_tools.py: Remaining 4 task tools (get, list, update, delete)

Test Improvements:
- Removed backward compatibility checks from all tests
- Tests now enforce structured error format (dict not string)
- Any string error response is now considered a bug
- All 20 tests passing with new strict validation

This completes the error handling refactor for all MCP tools,
ensuring consistent client experience and better debugging.

* fix(mcp): Address all priority actions from PR review

Based on latest PR #306 review feedback:

Fixed Issues:
- Replaced last remaining basic error handling with MCPErrorFormatter
  in version_tools.py get_version function
- Added proper error handling for invalid env vars in get_max_polling_attempts
- Improved type hints with TaskUpdateFields TypedDict for better validation
- All tools now consistently use get_default_timeout() (verified with grep)

Test Improvements:
- Added comprehensive tests for MCPErrorFormatter utility (10 tests)
- Added tests for timeout_config utility (13 tests)
- All 43 MCP tests passing with new utilities
- Tests verify structured error format and timeout configuration

Type Safety:
- Created TaskUpdateFields TypedDict to specify exact allowed fields
- Documents valid statuses and assignees in type comments
- Improves IDE support and catches type errors at development time

This completes all priority actions from the review:
✅ Fixed inconsistent timeout usage (was already done)
✅ Fixed error handling inconsistency
✅ Improved type hints for update_fields
✅ Added tests for utility modules

* style: Apply linting fixes and formatting

Applied automated linting and formatting:
- Fixed missing newlines at end of files
- Adjusted line wrapping for better readability
- Fixed multi-line string formatting in tests
- No functional changes, only style improvements

All 43 tests still passing after formatting changes.

* Update docker-compose.yml

Adding host.docker.internal:host-gateway to Docker Compose for the server and agents.

* Update README.md (#410)

added star history at end of readme

* Important updates to CONTRIBUTING.md and README

* fix: Allow HTTP for all private network ranges in Supabase URLs (#417)

* fix: Allow HTTP for all private network ranges in Supabase URLs

- Extend HTTP support to all RFC 1918 private IP ranges
- Class A: 10.0.0.0 to 10.255.255.255 (10.0.0.0/8)
- Class B: 172.16.0.0 to 172.31.255.255 (172.16.0.0/12)
- Class C: 192.168.0.0 to 192.168.255.255 (192.168.0.0/16)
- Also includes link-local (169.254.0.0/16) addresses
- Uses Python's ipaddress module for robust IP validation
- Maintains HTTPS requirement for public/production URLs
- Backwards compatible with existing localhost exceptions

* security: Fix URL validation vulnerabilities

- Replace substring matching with exact hostname matching to prevent bypass attacks
- Exclude unspecified address (0.0.0.0) from allowed HTTP hosts
- Add support for .localhost domains per RFC 6761
- Improve error messages with hostname context for better debugging

Addresses security concerns raised in PR review regarding:
- Malicious domains like 'localhost.attacker.com' bypassing HTTPS requirements
- Unspecified address being incorrectly allowed as valid connection target

---------

Co-authored-by: tazmon95 <tazmon95@users.noreply.github.com>
Co-authored-by: root <root@supatest2.jtpa.net>

* Fix business document categorization bug

- Fixed missing knowledge_type and tags parameters in DocumentStorageService.upload_document()
- Added source_type='file' to document chunk metadata for proper categorization
- Enhanced source metadata creation to include source_type based on source_id pattern
- Fixed metadata spread order in knowledge_item_service to prevent source_type override
- Business documents now correctly show pink color theme and appear in Business Documents section

Fixes issue where business documents were incorrectly stored as technical knowledge
and appeared with blue color theme instead of pink.

* feat: Add Gemini CLI support to MCPPage and IDEGlobalRules

* fix(mcp): Fix update_task signature and MCP instructions

Resolves #420 - Tasks being duplicated instead of updated

Changes:
1. Fixed update_task function signature to use individual optional parameters
   - Changed from TypedDict to explicit parameters (title, status, etc.)
   - Consistent with update_project and update_document patterns
   - Builds update_fields dict internally from provided parameters

2. Updated MCP instructions with correct function names
   - Replaced non-existent manage_task with actual functions
   - Added complete function signatures for all tools
   - Improved workflow documentation with concrete examples

This fixes the issue where AI agents were confused by:
- Wrong function names in instructions (manage_task vs update_task)
- Inconsistent parameter patterns across update functions
- TypedDict magic that wasn't clearly documented

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* test(mcp): Update tests for new update_task signature

- Fixed test_update_task_status to use individual parameters
- Added test_update_task_no_fields for validation testing
- All MCP tests passing (44 tests)

* style(mcp): Clean up whitespace in MCP instructions

- Remove trailing whitespace
- Consistent formatting in instruction text

* Fix crawler timeout for JavaScript-heavy documentation sites

Remove wait_for='body' selector from documentation site crawling config.
The body element exists immediately in HTML, causing unnecessary timeouts
for JavaScript-rendered content. Now relies on domcontentloaded event
and delay_before_return_html for proper JavaScript execution.

* chore: Remove unused imports and fix exception chaining

- Remove unused asyncio imports from batch.py and recursive.py
- Add proper exception chaining with 'from e' to preserve stack traces

* fix: Apply URL transformation before crawling in recursive strategy

- Transform URLs to raw content (e.g., GitHub blob -> raw) before sending to crawler
- Maintain mapping dictionary to preserve original URLs in results
- Align progress callback signatures between batch and recursive strategies
- Add safety guards for missing links attribute
- Remove unused loop counter in batch strategy
- Optimize binary file checks to avoid duplicate calls

This ensures GitHub files are crawled as raw content instead of HTML pages,
fixing the issue where content extraction was degraded due to HTML wrapping.

* Improve development environment with Docker Compose profiles (#435)

* Add improved development environment with backend in Docker and frontend locally

- Created dev.bat script to run backend services in Docker and frontend locally
- Added docker-compose.backend.yml for backend-only Docker setup
- Updated package.json to run frontend on port 3737
- Fixed api.ts to use default port 8181 instead of throwing error
- Script automatically stops production containers to avoid port conflicts
- Provides instant HMR for frontend development

* Refactor development environment setup: replace dev.bat with Makefile for cross-platform support and enhanced commands

* Enhance development environment: add environment variable checks and update test commands for frontend and backend

* Improve development environment with Docker Compose profiles

This commit enhances the development workflow by replacing the separate
docker-compose.backend.yml file with Docker Compose profiles, fixing
critical service discovery issues, and adding comprehensive developer
tooling through an improved Makefile system.

Key improvements:
- Replace docker-compose.backend.yml with cleaner profile approach
- Fix service discovery by maintaining consistent container names
- Fix port mappings (3737:3737 instead of 3737:5173)
- Add make doctor for environment validation
- Fix port configuration and frontend HMR
- Improve error handling with .SHELLFLAGS in Makefile
- Add comprehensive port configuration via environment variables
- Simplify make dev-local to only run essential services
- Add logging directory creation for local development
- Document profile strategy in docker-compose.yml

These changes provide three flexible development modes:
- Hybrid mode (default): Backend in Docker, frontend local with HMR
- Docker mode: Everything in Docker for production-like testing
- Local mode: API server and UI run locally

Co-authored-by: Zak Stam <zaksnet@users.noreply.github.com>

* Fix make stop command to properly handle Docker Compose profiles

The stop command now explicitly specifies all profiles to ensure
all containers are stopped regardless of how they were started.

* Fix README to document correct make commands

- Changed 'make lint' to 'make lint-frontend' and 'make lint-backend'
- Removed non-existent 'make logs-server' command
- Added 'make watch-mcp' and 'make watch-agents' commands
- All documented make commands now match what's available in Makefile

* fix: Address critical issues from code review #435

- Create robust environment validation script (check-env.js) that properly parses .env files
- Fix Docker healthcheck port mismatch (5173 -> 3737)
- Remove hard-coded port flags from package.json to allow environment configuration
- Fix Docker detection logic using /.dockerenv instead of HOSTNAME
- Normalize container names to lowercase (archon-server, archon-mcp, etc.)
- Improve stop-local command with port-based fallback for process killing
- Fix API configuration fallback chain to include VITE_PORT
- Fix Makefile shell variable expansion using runtime evaluation
- Update .PHONY targets with comprehensive list
- Add --profile flags to Docker Compose commands in README
- Add VITE_ARCHON_SERVER_PORT to docker-compose.yml
- Add Node.js 18+ to prerequisites
- Use dynamic ports in Makefile help messages
- Add lint alias combining frontend and backend linting
- Update .env.example documentation
- Scope .gitignore logs entry to /logs/

Co-Authored-By: Claude <noreply@anthropic.com>

* Fix container name resolution for MCP server

- Add dynamic container name resolution with three-tier strategy
- Support environment variables for custom container names
- Add service discovery labels to docker-compose services
- Update BackendStartupError with correct container name references

* Fix frontend test failures in API configuration tests

- Update environment variable names to use VITE_ prefix that matches production code
- Fix MCP client service tests to use singleton instance export
- Update default behavior tests to expect fallback to port 8181
- All 77 frontend tests now pass

* Fix make stop-local to avoid Docker daemon interference

Replace aggressive kill -9 with targeted process termination:
- Filter processes by command name (node/vite/python/uvicorn) before killing
- Use graceful SIGTERM instead of SIGKILL
- Add process verification to avoid killing Docker-related processes
- Improve logging with descriptive step messages

* refactor: Simplify development workflow based on comprehensive review

- Reduced Makefile from 344 lines (43 targets) to 83 lines (8 essential targets)
- Removed unnecessary environment variables (*_CONTAINER_NAME variables)
- Fixed Windows compatibility by removing Unix-specific commands
- Added security fixes to check-env.js (path validation)
- Simplified MCP container discovery to use fixed container names
- Fixed 'make stop' to properly handle Docker Compose profiles
- Updated documentation to reflect simplified workflow
- Restored original .env.example with comprehensive Supabase key documentation

This addresses all critical issues from code review:
- Cross-platform compatibility ✅
- Security vulnerabilities fixed ✅
- 81% reduction in complexity ✅
- Maintains all essential functionality ✅

All tests pass: Frontend (77/77), Backend (267/267)

* feat: Add granular test and lint commands to Makefile

- Split test command into test-fe and test-be for targeted testing
- Split lint command into lint-fe and lint-be for targeted linting
- Keep original test and lint commands that run both
- Update help text with new commands for better developer experience

* feat: Improve Docker Compose detection and prefer modern syntax

- Prefer 'docker compose' (plugin) over 'docker-compose' (standalone)
- Add better error handling in Makefile with proper exit on failures
- Add Node.js check before running environment scripts
- Pass environment variables correctly to frontend in hybrid mode
- Update all documentation to use modern 'docker compose' syntax
- Auto-detect which Docker Compose version is available

* docs: Update CONTRIBUTING.md to reflect simplified development workflow

- Add Node.js 18+ as prerequisite for hybrid development
- Mark Make as optional throughout the documentation
- Update all docker-compose commands to modern 'docker compose' syntax
- Add Make command alternatives for testing (make test, test-fe, test-be)
- Document make dev for hybrid development mode
- Remove linting requirements until codebase errors are resolved

* fix: Rename frontend service to archon-frontend for consistency

Aligns frontend service naming with other services (archon-server, archon-mcp, archon-agents) for better consistency in Docker image naming patterns.

---------

Co-authored-by: Zak Stam <zakscomputers@hotmail.com>
Co-authored-by: Zak Stam <zaksnet@users.noreply.github.com>

* fixed the creation and saving of docs in the project management area

* Fix logging error in threading service

Fixed TypeError when passing custom fields to Python logger by using the 'extra' parameter instead of direct keyword arguments. This resolves embedding creation failures during crawl operations.

* Apply linting fixes for better code formatting

- Added trailing commas for multi-line function calls
- Improved line breaks for better readability

* Complete logging fixes for all statements in threading service

Applied the extra parameter pattern to all remaining logging statements (11 more) to ensure consistency and prevent runtime errors when any code path is executed. This completes the fix for the entire file.

* Remove deprecated PRP testing scripts and dead code

- Removed python/src/server/testing/ folder containing deprecated test utilities
- These PRP viewer testing tools were used during initial development
- No longer needed as functionality has been integrated into main codebase
- No dependencies or references found in production code

* Remove original_archon folder from main branch

- Removed the original_archon/ directory containing the legacy Archon v1-v6 iterations
- This was the original AI agent builder system before the pivot to the current architecture
- The folder has been preserved in the 'preserve-original-archon' branch for historical reference
- Reduces repository size by ~5.2MB and removes confusion about which codebase is active

* Fix proxy when specifying PROD=true

* Add a test

* Parse before passing to vite

* PR Feedback

* Add titles to action buttons in DocumentCard, DraggableTaskCard, and TaskTableView components.

* feat(ui): add accessibility attributes to action buttons

Add type, aria-label, and aria-hidden attributes to action and icon buttons across task and document components to improve accessibility and assistive technology support.

* Fix Docker Compose default profile and error documentation

- Add 'default' profile to all services so 'docker compose up --build -d' works without --profile flag
- Update BackendStartupError.tsx to include '--profile full' in Docker command examples
- Update docker-compose.yml comments to document the new default behavior

This allows users to run either:
- docker compose up --build -d (uses default profile, starts all services)
- docker compose --profile full up --build -d (explicit profile, same result)
- docker compose --profile backend up --build -d (backend services only)

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* Use generic YOUR_PROFILE placeholder instead of hardcoded 'full' profile

This allows users to choose their preferred profile (backend, full, etc.) rather than
assuming they always want the full profile for error recovery.

* Remove profiles from all services to enable default startup

- Remove profile restrictions from all services so they start with 'docker compose up'
- All services now run by default without requiring --profile flags
- Profile functionality removed - users now use default behavior only
- This enables the requested 'docker compose up --build -d' workflow

* Update error modal to use 'full' profile with helpful note

- Change from generic YOUR_PROFILE to specific 'full' profile
- Add note explaining users can replace 'full' if needed
- Maintains clarity while providing flexibility for different profiles

* Merge UX improvements from PR #443

- Update error modal to show default 'docker compose up --build -d' command
- Add better organized note structure with bullet points
- Include profile-specific fallback example for existing users
- Update README Quick Start to show default command first
- Maintain backward compatibility guidance for profile users

* Fix critical token consumption issue in list endpoints (#488)

- Add include_content parameter to ProjectService.list_projects()
- Add exclude_large_fields parameter to TaskService.list_tasks()
- Add include_content parameter to DocumentService.list_documents()
- Update all MCP tools to use lightweight responses by default
- Fix critical N+1 query problem in ProjectService (was making separate query per project)
- Add response size monitoring and logging for validation
- Add comprehensive unit and integration tests

Results:
- Projects endpoint: 99.3% token reduction (27,055 -> 194 tokens)
- Tasks endpoint: 98.2% token reduction (12,750 -> 226 tokens)
- Documents endpoint: Returns metadata with content_size instead of full content
- Maintains full backward compatibility with default parameters
- Single query optimization eliminates N+1 performance issue

* fix: include_archived flag now works correctly in task listing

- Add include_archived parameter to TaskService.list_tasks()
- Service now conditionally applies archived filter based on parameter
- Add 'archived' field to task DTO for client visibility
- Update API endpoints to pass include_archived down to service
- Remove redundant client-side filtering in API layer
- Fix type hints in integration tests (dict[str, Any] | None)
- Use pytest.skip() instead of return for proper test reporting

These fixes address the functional bug identified by CodeRabbit where
archived tasks couldn't be retrieved even when explicitly requested.

* feat: disable agents service by default using Docker profiles

- Add 'agents' profile to archon-agents service
- Remove archon-agents as dependency from archon-mcp service
- Service now only starts with --profile agents flag
- Prevents startup issues while agents service is under development
- All core functionality continues to work without agents

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: address PR review feedback for agents service disabling

- Fix misleading profile documentation at top of docker-compose.yml
- Add AGENTS_ENABLED flag for cleaner agent service handling
- Make AGENTS_SERVICE_URL configurable via environment variable
- Prevent noisy connection errors when agents service isn't running

This provides a cleaner way to disable the agents service and allows
the application to skip agent wiring when AGENTS_ENABLED=false.

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>

* Add PRP story task template and reorganize PRP commands (#508)

* Reorganize PRP commands and add story task template

- Move PRP commands to dedicated subdirectories
- Add new agent definitions for codebase analysis and library research
- Create story task PRP template for user story implementation
- Rename prp-base.md to prp_base.md for consistency

* Update .claude/agents/library-researcher.md

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* Update .claude/commands/prp-claude-code/prp-story-task-create.md

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* Update .claude/commands/prp-claude-code/prp-story-task-create.md

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* Update PRPs/templates/prp_story_task.md

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* Update PRPs/templates/prp_story_task.md

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* fix: resolve container startup race condition in agents service (#451) (#503)

* depends on and env var added

Update Vite configuration to enable allowed hosts

- Uncommented the allowedHosts configuration to allow for dynamic host settings based on environment variables.
- This change enhances flexibility for different deployment environments while maintaining the default localhost and specific domain access.

Needs testing to confirm proper functionality with various host configurations.

rm my domain

* Enhance Vite configuration with dynamic allowed hosts support

- Added VITE_ALLOWED_HOSTS environment variable to .env.example and docker-compose.yml for flexible host configuration.
- Updated Vite config to dynamically set allowed hosts, incorporating defaults and custom values from the environment variable.
- This change improves deployment flexibility while maintaining security by defaulting to localhost and specific domains.

Needs testing to confirm proper functionality with various host configurations.

* refactor: remove unnecessary dependency on archon-agents in docker-compose.yml

- Removed the dependency condition for archon-agents from the archon-mcp service to streamline the startup process.
- This change simplifies the service configuration and reduces potential startup issues related to agent service health checks.

Needs testing to ensure that the application functions correctly without the archon-agents dependency.

---------

Co-authored-by: Julian Gegenhuber <office@salzkammercode.at>

* Fix race condition in concurrent crawling with unique source IDs (#472)

* Fix race condition in concurrent crawling with unique source IDs

- Add unique hash-based source_id generation to prevent conflicts
- Separate source identification from display with three fields:
  - source_id: 16-char SHA256 hash for unique identification
  - source_url: Original URL for tracking
  - source_display_name: Human-friendly name for UI
- Add comprehensive test suite validating the fix
- Migrate existing data with backward compatibility

* Fix title generation to use source_display_name for better AI context

- Pass source_display_name to title generation function
- Use display name in AI prompt instead of hash-based source_id
- Results in more specific, meaningful titles for each source

* Skip AI title generation when display name is available

- Use source_display_name directly as title to avoid unnecessary AI calls
- More efficient and predictable than AI-generated titles
- Keep AI generation only as fallback for backward compatibility

* Fix critical issues from code review

- Add missing os import to prevent NameError crash
- Remove unused imports (pytest, Mock, patch, hashlib, urlparse, etc.)
- Fix GitHub API capitalization consistency
- Reuse existing DocumentStorageService instance
- Update test expectations to match corrected capitalization

Addresses CodeRabbit review feedback on PR #472

* Add safety improvements from code review

- Truncate display names to 100 chars when used as titles
- Document hash collision probability (negligible for <1M sources)

Simple, pragmatic fixes per KISS principle

* Fix code extraction to use hash-based source_ids and improve display names

- Fixed critical bug where code extraction was using old domain-based source_ids
- Updated code extraction service to accept source_id as parameter instead of extracting from URL
- Added special handling for llms.txt and sitemap.xml files in display names
- Added comprehensive tests for source_id handling in code extraction
- Removed unused urlparse import from code_extraction_service.py

This fixes the foreign key constraint errors that were preventing code examples
from being stored after the source_id architecture refactor.

Co-Authored-By: Claude <noreply@anthropic.com>

* Fix critical variable shadowing and source_type determination issues

- Fixed variable shadowing in document_storage_operations.py where source_url parameter
  was being overwritten by document URLs, causing incorrect source_url in database
- Fixed source_type determination to use actual URLs instead of hash-based source_id
- Added comprehensive tests for source URL preservation
- Ensure source_type is correctly set to "file" for file uploads, "url" for web crawls

The variable shadowing bug was causing sitemap sources to have the wrong source_url
(last crawled page instead of sitemap URL). The source_type bug would mark all
sources as "url" even for file uploads due to hash-based IDs not starting with "file_".

Co-Authored-By: Claude <noreply@anthropic.com>

* Fix URL canonicalization and document metrics calculation

- Implement proper URL canonicalization to prevent duplicate sources
  - Remove trailing slashes (except root)
  - Remove URL fragments
  - Remove tracking parameters (utm_*, gclid, fbclid, etc.)
  - Sort query parameters for consistency
  - Remove default ports (80 for HTTP, 443 for HTTPS)
  - Normalize scheme and domain to lowercase

- Fix avg_chunks_per_doc calculation to avoid division by zero
  - Track processed_docs count separately from total crawl_results
  - Handle all-empty document sets gracefully
  - Show processed/total in logs for better visibility

- Add comprehensive tests for both fixes
  - 10 test cases for URL canonicalization edge cases
  - 4 test cases for document metrics calculation

This prevents database constraint violations when crawling the same
content with URL variations and provides accurate metrics in logs.

* Fix synchronous extract_source_summary blocking async event loop

- Run extract_source_summary in thread pool using asyncio.to_thread
- Prevents blocking the async event loop during AI summary generation
- Preserves exact error handling and fallback behavior
- Variables (source_id, combined_content) properly passed to thread

Added comprehensive tests verifying:
- Function runs in thread without blocking
- Error handling works correctly with fallback
- Multiple sources can be processed
- Thread safety with variable passing

* Fix synchronous update_source_info blocking async event loop

- Run update_source_info in thread pool using asyncio.to_thread
- Prevents blocking the async event loop during database operations
- Preserves exact error handling and fallback behavior
- All kwargs properly passed to thread execution

Added comprehensive tests verifying:
- Function runs in thread without blocking
- Error handling triggers fallback correctly
- All kwargs are preserved when passed to thread
- Existing extract_source_summary tests still pass

* Fix race condition in source creation using upsert

- Replace INSERT with UPSERT for new sources to prevent PRIMARY KEY violations
- Handles concurrent crawls attempting to create the same source
- Maintains existing UPDATE behavior for sources that already exist

Added comprehensive tests verifying:
- Concurrent source creation doesn't fail
- Upsert is used for new sources (not insert)
- Update is still used for existing sources
- Async concurrent operations work correctly
- Race conditions with delays are handled

This prevents database constraint errors when multiple crawls target
the same URL simultaneously.

* Add migration detection UI components

Add MigrationBanner component with clear user instructions for database schema updates. Add useMigrationStatus hook for periodic health check monitoring with graceful error handling.

* Integrate migration banner into main app

Add migration status monitoring and banner display to App.tsx. Shows migration banner when database schema updates are required.

* Enhance backend startup error instructions

Add detailed Docker restart instructions and migration script guidance. Improves user experience when encountering startup failures.

* Add database schema caching to health endpoint

Implement smart caching for schema validation to prevent repeated database queries. Cache successful validations permanently and throttle failures to 30-second intervals. Replace debug prints with proper logging.

* Clean up knowledge API imports and logging

Remove duplicate import statements and redundant logging. Improves code clarity and reduces log noise.

* Remove unused instructions prop from MigrationBanner

Clean up component API by removing instructions prop that was accepted but never rendered. Simplifies the interface and eliminates dead code while keeping the functional hardcoded migration steps.

* Add schema_valid flag to migration_required health response

Add schema_valid: false flag to health endpoint response when database schema migration is required. Improves API consistency without changing existing behavior.

---------

Co-authored-by: Claude <noreply@anthropic.com>

* Hotfix - crawls hanging after embedding rate limiting

* Moving Dockerfiles to uv for package installation (#533)

* Moving Dockerfiles to uv for package installation

* Updating uv installation for CI

* Reduced the size of sentence-transformers by making it CPU only, including reranking by default now (#534)

* CI fails now when unit tests for backend fail (#536)

* CI fails now when unit tests for backend fail

* Fixing up a couple unit tests

* Documentation improvements for MCP and README (#540)

* Spacing updates for Make installation in README

* add PRPs/completed/ to gitignore

* add archon-coderabbit-helper slash command

* refactor: Remove Socket.IO and implement HTTP polling architecture (#514)

* refactor: Remove Socket.IO and consolidate task status naming

Major refactoring to simplify the architecture:

1. Socket.IO Removal:
   - Removed all Socket.IO dependencies and code (~4,256 lines)
   - Replaced with HTTP polling for real-time updates
   - Added new polling hooks (usePolling, useDatabaseMutation, etc.)
   - Removed socket services and handlers

2. Status Consolidation:
   - Removed UI/DB status mapping layer
   - Using database values directly (todo, doing, review, done)
   - Removed obsolete status types and mapping functions
   - Updated all components to use database status values

3. Simplified Architecture:
   - Cleaner separation between frontend and backend
   - Reduced complexity in state management
   - More maintainable codebase

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>

* feat: Add loading states and error handling for UI operations

- Added loading overlay when dragging tasks between columns
- Added loading state when switching between projects
- Added proper error handling with toast notifications
- Removed remaining Socket.IO references
- Improved user feedback during async operations

* docs: Add comprehensive polling architecture documentation

Created developer guide explaining:
- Core polling components and hooks
- ETag caching implementation
- State management patterns
- Migration from Socket.IO
- Performance optimizations
- Developer guidelines and best practices

* fix: Correct method name for fetching tasks

- Fixed projectService.getTasks() to projectService.getTasksByProject()
- Ensures consistent naming throughout the codebase
- Resolves error when refreshing tasks after drag operations

* docs: Add comprehensive API naming conventions guide

Created naming standards documentation covering:
- Service method naming patterns
- API endpoint conventions
- Component and hook naming
- State variable naming
- Type definitions
- Common patterns and anti-patterns
- Migration notes from Socket.IO

* docs: Update CLAUDE.md with polling architecture and naming conventions

- Replaced Socket.IO references with HTTP polling architecture
- Added polling intervals and ETag caching documentation
- Added API naming conventions section
- Corrected task endpoint patterns (use getTasksByProject, not getTasks)
- Added state naming patterns and status values

* refactor: Remove Socket.IO and implement HTTP polling architecture

Complete removal of Socket.IO/WebSocket dependencies in favor of simple HTTP polling:

Frontend changes:
- Remove all WebSocket/Socket.IO references from KnowledgeBasePage
- Implement useCrawlProgressPolling hook for progress tracking
- Fix polling hook to prevent ERR_INSUFFICIENT_RESOURCES errors
- Add proper cleanup and state management for completed crawls
- Persist and restore active crawl progress across page refreshes
- Fix agent chat service to handle disabled agents gracefully

Backend changes:
- Remove python-socketio from requirements
- Convert ProgressTracker to in-memory state management
- Add /api/crawl-progress/{id} endpoint for polling
- Initialize ProgressTracker immediately when operations start
- Remove all Socket.IO event handlers and cleanup commented code
- Simplify agent_chat_api to basic REST endpoints

Bug fixes:
- Fix race condition where progress data wasn't available for polling
- Fix memory leaks from recreating polling callbacks
- Fix crawl progress URL mismatch between frontend and backend
- Add proper error filtering for expected 404s during initialization
- Stop polling when crawl operations complete

This change simplifies the architecture significantly and makes it more robust
by removing the complexity of WebSocket connections.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* Fix data consistency issue in crawl completion

- Modify add_documents_to_supabase to return actual chunks stored count
- Update crawl orchestration to validate chunks were actually saved to database
- Throw exception when chunks are processed but none stored (e.g., API key failures)
- Ensure UI shows error state instead of false success when storage fails
- Add proper error field to progress updates for frontend display

This prevents misleading "crawl completed" status when backend fails to store data.

* Consolidate API key access to unified LLM provider service pattern

- Fix credential service to properly store encrypted OpenAI API key from environment
- Remove direct environment variable access pattern from source management service
- Update both extract_source_summary and generate_source_title_and_metadata to async
- Convert all LLM operations to use get_llm_client() for multi-provider support
- Fix callers in document_storage_operations.py and storage_services.py to use await
- Improve title generation prompt with better context and examples for user-readable titles
- Consolidate on single pattern that supports OpenAI, Google, Ollama providers

This fixes embedding service failures while maintaining compatibility for future providers.

* Fix async/await consistency in source management services

- Make update_source_info async and await it properly
- Fix generate_source_title_and_metadata async calls
- Improve source title generation with URL-based detection
- Remove unnecessary threading wrapper for async operations

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: correct API response handling in MCP project polling

- Fix polling logic to properly extract projects array from API response
- The API returns {projects: [...]} but polling was trying to iterate directly over response
- This caused 'str' object has no attribute 'get' errors during project creation
- Update both create_project polling and list_projects response handling
- Verified all MCP tools now work correctly including create_project

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: Optimize project switching performance and eliminate task jumping

- Replace race condition-prone polling refetch with direct API calls for immediate task loading (100-200ms vs 1.5-2s)
- Add polling suppression during direct API calls to prevent task jumping from double setTasks() calls
- Clear stale tasks immediately on project switch to prevent wrong data visibility
- Maintain polling for background updates from agents/MCP while optimizing user-initiated actions

Performance improvements:
- Project switches now load tasks in 100-200ms instead of 1.5-2 seconds
- Eliminated visual task jumping during project transitions
- Clean separation: direct calls for user actions, polling for external updates

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: Remove race condition anti-pattern and complete Socket.IO removal

Critical fixes addressing code review findings:

**Race Condition Resolution:**
- Remove fragile isLoadingDirectly flag that could permanently disable polling
- Remove competing polling onSuccess callback that caused task jumping
- Clean separation: direct API calls for user actions, polling for external updates only

**Socket.IO Removal:**
- Replace projectCreationProgressService with useProgressPolling HTTP polling
- Remove all Socket.IO dependencies and references
- Complete migration to HTTP-only architecture

**Performance Optimization:**
- Add ETag support to /projects/{project_id}/tasks endpoint for 70% bandwidth savings
- Remove competing TasksTab onRefresh system that caused multiple API calls
- Single source of truth: polling handles background updates, direct calls for immediate feedback

**Task Management Simplification:**
- Remove onRefresh calls from all TasksTab operations (create, update, delete, move)
- Operations now use optimistic updates with polling fallback
- Eliminates 3-way race condition between polling, direct calls, and onRefresh

Result: Fast project switching (100-200ms), no task jumping, clean polling architecture

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* Remove remaining Socket.IO and WebSocket references

- Remove WebSocket URL configuration from api.ts
- Clean up WebSocket tests and mocks from test files
- Remove websocket parameter from embedding service
- Update MCP project tools tests to match new API response format
- Add example real test for usePolling hook
- Update vitest config to properly include test files

* Add comprehensive unit tests for polling architecture

- Add ETag utilities tests covering generation and checking logic
- Add progress API tests with 304 Not Modified support
- Add progress service tests for operation tracking
- Add projects API polling tests with ETag validation
- Fix projects API to properly handle ETag check independently of response object
- Test coverage for critical polling components following MCP test patterns

* Remove WebSocket functionality from service files

- Remove getWebSocketUrl imports that were causing runtime errors
- Replace WebSocket log streaming with deprecation warnings
- Remove unused WebSocket properties and methods
- Simplify disconnectLogs to no-op functions

These services now use HTTP polling exclusively as part of the
Socket.IO to polling migration.

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>

* Fix memory leaks in mutation hooks

- Add isMountedRef to track component mount status
- Guard all setState calls with mounted checks
- Prevent callbacks from firing after unmount
- Apply fix to useProjectMutation, useDatabaseMutation, and useAsyncMutation

Addresses Code Rabbit feedback about potential state updates after
component unmount. Simple pragmatic fix without over-engineering
request cancellation.

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>

* Document ETag implementation and limitations

- Add concise documentation explaining current ETag implementation
- Document that we use simple equality check, not full RFC 7232
- Clarify this works for our browser-to-API use case
- Note limitations for future CDN/proxy support

Addresses Code Rabbit feedback about RFC compliance by documenting
the known limitations of our simplified implementation.

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>

* Remove all WebSocket event schemas and functionality

- Remove WebSocket event schemas from projectSchemas.ts
- Remove WebSocket event types from types/project.ts
- Remove WebSocket initialization and subscription methods from projectService.ts
- Remove all broadcast event calls throughout the service
- Clean up imports to remove unused types

Complete removal of WebSocket infrastructure in favor of HTTP polling.

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>

* Fix progress field naming inconsistency

- Change backend API to return 'progress' instead of 'percentage'
- Remove unnecessary mapping in frontend
- Use consistent 'progress' field name throughout
- Update all progress initialization to use 'progress' field

Simple consolidation to one field name instead of mapping between two.

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>

* Fix tasks polling data not updating UI

- Update tasks state when polling returns new data
- Keep UI in sync with server changes for selected project
- Tasks now live-update from external changes without project switching

The polling was fetching fresh data but never updating the UI state.

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>

* Fix incorrect project title in pin/unpin toast messages

- Use API response data.title instead of selectedProject?.title
- Shows correct project name when pinning/unpinning any project card
- Toast now accurately reflects which project was actually modified

The issue was the toast would show the wrong project name when pinning
a project that wasn't the currently selected one.

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>

* Remove over-engineered tempProjects logic

Removed all temporary project tracking during creation:
- Removed tempProjects state and allProjects combining
- Removed handleProjectCreationProgress function
- Removed progress polling for project creation
- Removed ProjectCreationProgressCard rendering
- Simplified createProject to just create and let polling pick it up

This fixes false 'creation failed' errors and simplifies the code significantly.
Project creation now shows a simple toast and relies on polling for updates.

* Optimize task count loading with parallel fetching

Changed loadTaskCountsForAllProjects to use Promise.allSettled for parallel API calls:
- All project task counts now fetched simultaneously instead of sequentially
- Better error isolation - one project failing doesn't affect others
- Significant performance improvement for users with multiple projects
- If 5 projects: from 5×API_TIME to just 1×API_TIME total

* Fix TypeScript timer type for browser compatibility

Replace NodeJS.Timeout with ReturnType<typeof setInterval> in crawlProgressService.
This makes the timer type compatible across both Node.js and browser environments,
fixing TypeScript compilation errors in browser builds.

* Add explicit status mappings for crawl progress states

Map backend statuses to correct UI states:
- 'processing' → 'processing' (use existing UI state)
- 'queued' → 'starting' (pre-crawl state)
- 'cancelled' → 'cancelled' (use existing UI state)

This prevents incorrect UI states and gives users accurate feedback about
crawl operation status.

* Fix TypeScript timer types in pollingService for browser compatibility

Replace NodeJS.Timer with ReturnType<typeof setInterval> in both
TaskPollingService and ProjectPollingService classes. This ensures
compatibility across Node.js and browser environments.

* Remove unused pollingService.ts dead code

This file was created during Socket.IO removal but never actually used.
The application already uses usePolling hooks (useTaskPolling, useProjectPolling)
which have proper ETag support and visibility handling.

Removing dead code to reduce maintenance burden and confusion.

* Fix TypeScript timer type in progressService for browser compatibility

Replace NodeJS.Timer with ReturnType<typeof setInterval> to ensure
compatibility across Node.js and browser environments, consistent with
other timer type fixes throughout the codebase.

* Fix TypeScript timer type in projectCreationProgressService

Replace NodeJS.Timeout with ReturnType<typeof setInterval> in Map type
to ensure browser/DOM build compatibility.

* Add proper error handling to project creation progress polling

Stop infinite polling on fatal errors:
- 404 errors continue polling (resource might not exist yet)
- Other HTTP errors (500, 503, etc.) stop polling and report error
- Network/parsing errors stop polling and report error
- Clear feedback to callbacks on all error types

This prevents wasting resources polling forever on unrecoverable errors
and provides better user feedback when things go wrong.

* Fix documentation accuracy in API conventions and architecture docs

- Fix API_NAMING_CONVENTIONS.md: Changed 'documents' to 'docs' and used
  distinct placeholders ({project_id} and {doc_id}) to match actual API routes
- Fix POLLING_ARCHITECTURE.md: Updated import path to use relative import
  (from ..utils.etag_utils) to match actual code structure
- ARCHITECTURE.md: List formatting was already correct, no changes needed

These changes ensure documentation accurately reflects the actual codebase.

* Fix type annotations in recursive crawling strategy

- Changed max_concurrent from invalid 'int = None' to 'int | None = None'
- Made progress_callback explicitly async: 'Callable[..., Awaitable[None]] | None'
- Added Awaitable import from typing
- Uses modern Python 3.10+ union syntax (project requires Python 3.12)

* Improve error logging in sitemap parsing

- Use logger.exception() instead of logger.error() for automatic stack traces
- Include sitemap URL in all error messages for better debugging
- Remove unused traceback import and manual traceback logging
- Now all exceptions show which sitemap failed with full stack trace

* Remove all Socket.IO remnants from task_service.py

Removed:
- Duplicate broadcast_task_update function definitions
- _broadcast_available flag (always False)
- All Socket.IO broadcast blocks in create_task, update_task, and archive_task
- Socket.IO related logging and error handling
- Unnecessary traceback import within Socket.IO error handler

Task updates are now handled exclusively via HTTP polling as intended.

* Complete WebSocket/Socket.IO cleanup across frontend and backend

- Remove socket.io-client dependency and all related packages
- Remove WebSocket proxy configuration from vite.config.ts
- Clean up WebSocket state management and deprecated methods from services
- Remove VITE_ENABLE_WEBSOCKET environment variable checks
- Update all comments to remove WebSocket/Socket.IO references
- Fix user-facing error messages that mentioned Socket.IO
- Preserve legitimate FastAPI WebSocket endpoints for MCP/test streaming

This completes the refactoring to HTTP polling, removing all Socket.IO
infrastructure while keeping necessary WebSocket functionality.

* Remove MCP log display functionality following KISS principles

- Remove all log display UI from MCPPage (saved ~100 lines)
- Remove log-related API endpoints and WebSocket streaming
- Keep internal log tracking for Docker container monitoring
- Simplify MCPPage to focus on server control and configuration
- Remove unused LogEntry types and streaming methods

Following early beta KISS principles - MCP logs are debug info that
developers can check via terminal/Docker if needed. UI now focuses
on essential functionality only.

* Add Claude Code command for analyzing CodeRabbit suggestions

- Create structured command for CodeRabbit review analysis
- Provides clear format for assessing validity and priority
- Generates 2-5 practical options with tradeoffs
- Emphasizes early beta context and KISS principles
- Includes effort estimation for each option

This command helps quickly triage CodeRabbit suggestions and decide
whether to address them based on project priorities and tradeoffs.

* Add in-flight guard to prevent overlapping fetches in crawl progress polling

Prevents race condition where slow responses could cause multiple concurrent
fetches for the same progressId. Simple boolean flag skips new fetches while
one is active and properly cleans up on stop/disconnect.

Co-Authored-By: Claude <noreply@anthropic.com>

* Remove unused progressService.ts dead code

File was completely unused with no imports or references anywhere in the
codebase. Other services (crawlProgressService, projectCreationProgressService)
handle their specific progress polling needs directly.

Co-Authored-By: Claude <noreply@anthropic.com>

* Remove unused project creation progress components

Both ProjectCreationProgressCard.tsx and projectCreationProgressService.ts
were dead code with no references. The service duplicated existing usePolling
functionality unnecessarily. Removed per KISS principles.

Co-Authored-By: Claude <noreply@anthropic.com>

* Update POLLING_ARCHITECTURE.md to reflect current state

Removed references to deleted files (progressService.ts,
projectCreationProgressService.ts, ProjectCreationProgressCard.tsx).
Updated to document what exists now rather than migration history.

Co-Authored-By: Claude <noreply@anthropic.com>

* Update API_NAMING_CONVENTIONS.md to reflect current state

Updated progress endpoints to match actual implementation.
Removed migration/historical references and anti-patterns section.
Focused on current best practices and architecture patterns.

Co-Authored-By: Claude <noreply@anthropic.com>

* Remove unused optimistic updates code and references

Deleted unused useOptimisticUpdates.ts hook that was never imported.
Removed optimistic upda…
coleam00 added a commit that referenced this pull request Apr 7, 2026
Expand documentation for canonicalRepoPath, IIsolationProvider interface,
and getWorktreePath with full path format examples and error contracts.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Tyone88 pushed a commit to Tyone88/Archon that referenced this pull request Apr 16, 2026
…eam00#472)

Expand documentation for canonicalRepoPath, IIsolationProvider interface,
and getWorktreePath with full path format examples and error contracts.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
joaobmonteiro pushed a commit to joaobmonteiro/Archon that referenced this pull request Apr 26, 2026
…eam00#472)

Expand documentation for canonicalRepoPath, IIsolationProvider interface,
and getWorktreePath with full path format examples and error contracts.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant