diff --git a/CrawlProcess.md b/CrawlProcess.md new file mode 100644 index 0000000000..9123354e9d --- /dev/null +++ b/CrawlProcess.md @@ -0,0 +1,1203 @@ +# Crawl Process Architecture Documentation + +## Table of Contents +- [Overview](#overview) +- [Architecture Flow](#architecture-flow) +- [Core Components](#core-components) +- [Crawl Strategies](#crawl-strategies) +- [Process Flows](#process-flows) +- [Code Examples](#code-examples) +- [Configuration](#configuration) +- [Progress Tracking](#progress-tracking) + +--- + +## Overview + +The ArchonDM crawling system is a sophisticated web crawling architecture built on **Crawl4AI v0.6.2** that intelligently extracts, processes, and stores web content in a knowledge base. The system uses a strategy pattern to handle different types of content (single pages, batch URLs, recursive crawls, sitemaps) and provides real-time progress tracking via HTTP polling. + +### Key Features +- **Intelligent URL Detection**: Automatically detects content type (sitemap, text file, markdown, link collection, binary) +- **Multiple Crawl Strategies**: Single page, batch, recursive, and sitemap strategies +- **Progress Tracking**: Real-time progress updates via HTTP polling +- **Code Extraction**: Automatically extracts and indexes code examples +- **Cancellation Support**: Graceful cancellation of long-running crawl operations +- **Memory-Adaptive Processing**: Automatically throttles based on system memory +- **Supabase Integration**: Stores documents with vector embeddings for RAG + +--- + +## Architecture Flow + +``` +┌─────────────────────────────────────────────────────────────────────────┐ +│ CRAWL REQUEST │ +│ POST /api/knowledge-items/crawl │ +└───────────────────────────────┬─────────────────────────────────────────┘ + │ + ▼ + ┌───────────────────────────────┐ + │ knowledge_api.py │ + │ - Validate URL & API key │ + │ - Create progress_id │ + │ - Initialize ProgressTracker│ + └──────────────┬────────────────┘ + │ + ▼ + ┌────────────────────────────────┐ + │ CrawlerManager │ + │ - Singleton pattern │ + │ - Initializes AsyncWebCrawler │ + │ - Configures Chromium browser │ + └──────────────┬─────────────────┘ + │ + ▼ + ┌──────────────────────────────────────┐ + │ CrawlingService (Orchestrator) │ + │ - Analyzes URL type │ + │ - Selects appropriate strategy │ + │ - Manages progress tracking │ + │ - Handles cancellation │ + └──────────────┬───────────────────────┘ + │ + ├───────────┬──────────┬──────────┐ + ▼ ▼ ▼ ▼ + ┌──────────────┐ ┌─────────┐ ┌──────────┐ ┌─────────┐ + │Single Page │ │ Batch │ │Recursive │ │Sitemap │ + │Strategy │ │Strategy │ │Strategy │ │Strategy │ + └──────┬───────┘ └────┬────┘ └────┬─────┘ └────┬────┘ + │ │ │ │ + └──────────────┴───────────┴────────────┘ + │ + ▼ + ┌──────────────────────────────────────┐ + │ Crawl Results (Raw Content) │ + │ [{url, markdown, html, links}...] │ + │ │ + │ • markdown: Crawl4AI parsed text │ + │ • html: Raw HTML for code extract │ + └──────────────┬───────────────────────┘ + │ + ┌─────────────────────┴────────────────────────┐ + │ │ + ▼ ▼ + ┌───────────────────────────────────┐ ┌────────────────────────────────┐ + │ TEXT PROCESSING FLOW │ │ CODE EXTRACTION FLOW │ + │ (DocumentStorageOperations) │ │ (CodeExtractionService) │ + └───────────────────────────────────┘ └────────────────────────────────┘ + │ │ + ▼ ▼ + ┌───────────────────────────────────┐ ┌────────────────────────────────┐ + │ 1. PARSE & CHUNK MARKDOWN │ │ 1. PARSE HTML CONTENT │ + │ • Split into 5000 char chunks │ │ • Find
blocks │
+ │ • Smart chunking at boundaries │ │ • Parse markdown ```blocks │
+ │ • Preserve context │ │ • Detect language │
+ └────────────┬──────────────────────┘ └────────────┬───────────────────┘
+ │ │
+ ▼ ▼
+ ┌───────────────────────────────────┐ ┌────────────────────────────────┐
+ │ 2. CREATE SOURCE RECORDS │ │ 2. FILTER CODE BLOCKS │
+ │ • Generate unique source_id │ │ • Min length: 1000 chars │
+ │ • Extract display name │ │ • Remove duplicates │
+ │ • AI-generate summary │ │ • Validate syntax │
+ │ • Store in archon_sources │ └────────────┬───────────────────┘
+ └────────────┬──────────────────────┘ │
+ │ ▼
+ ▼ ┌────────────────────────────────┐
+ ┌───────────────────────────────────┐ │ 3. AI SUMMARY GENERATION │
+ │ 3. GENERATE EMBEDDINGS │ │ • ThreadingService (4-16 │
+ │ • Create vectors for each chunk│ │ workers, CPU-adaptive) │
+ │ • Batch processing (25/batch) │ │ • Rate limiting (200k TPM) │
+ │ • Provider: OpenAI/Google/ │ │ • Generate code summaries │
+ │ Ollama │ │ • Extract purpose & usage │
+ │ • Parallel processing enabled │ └────────────┬───────────────────┘
+ └────────────┬──────────────────────┘ │
+ │ ▼
+ ▼ ┌────────────────────────────────┐
+ ┌───────────────────────────────────┐ │ 4. STORE CODE EXAMPLES │
+ │ 4. STORE IN SUPABASE │ │ • Save to code_examples │
+ │ • archon_documents table │ │ table │
+ │ • Vector embeddings │ │ • Include: language, summary│
+ │ • Metadata: url, source_id, │ │ • Link to source_id │
+ │ tags, word_count │ │ • Enable code search │
+ │ • Enable semantic search │ └────────────┬───────────────────┘
+ └────────────┬──────────────────────┘ │
+ │ │
+ └───────────────────┬─────────────────────────┘
+ │
+ ▼
+ ┌─────────────────────┐
+ │ Crawl Complete │
+ │ Progress: 100% │
+ │ │
+ │ Results: │
+ │ • Documents stored │
+ │ • Code indexed │
+ │ • RAG ready │
+ └─────────────────────┘
+```
+
+---
+
+## Core Components
+
+### 1. CrawlerManager (`python/src/server/services/crawler_manager.py`)
+
+**Purpose**: Manages the global Crawl4AI crawler instance as a singleton.
+
+**Key Responsibilities**:
+- Initialize `AsyncWebCrawler` with optimized browser configuration
+- Configure Chromium with performance optimizations (disable images, GPU, etc.)
+- Provide thread-safe access to the crawler instance
+- Handle cleanup on shutdown
+
+**Code Location**: `python/src/server/services/crawler_manager.py`
+
+**Key Configuration**:
+```python
+browser_config = BrowserConfig(
+ headless=True,
+ verbose=False,
+ viewport_width=1920,
+ viewport_height=1080,
+ browser_type="chromium",
+ extra_args=[
+ "--disable-images", # Skip image loading
+ "--disable-gpu",
+ "--no-sandbox",
+ "--disable-setuid-sandbox",
+ # ... more optimizations
+ ]
+)
+```
+
+### 2. CrawlingService (`python/src/server/services/crawling/crawling_service.py`)
+
+**Purpose**: Main orchestrator that coordinates the entire crawl process.
+
+**Key Responsibilities**:
+- Analyze URL and determine crawl type
+- Select and execute appropriate crawl strategy
+- Track progress across all stages
+- Handle cancellation requests
+- Coordinate document storage and code extraction
+
+**Key Methods**:
+- `orchestrate_crawl()`: Main entry point for crawl operations
+- `_detect_crawl_type()`: Determines URL type (sitemap, text, markdown, etc.)
+- `cancel()`: Gracefully cancel ongoing operations
+- `is_cancelled()`: Check cancellation status
+
+### 3. URLHandler (`python/src/server/services/crawling/helpers/url_handler.py`)
+
+**Purpose**: Handles URL validation, transformation, and classification.
+
+**Key Methods**:
+- `is_sitemap(url)`: Detect sitemap.xml files
+- `is_txt(url)`: Detect .txt files
+- `is_markdown(url)`: Detect .md/.mdx/.markdown files
+- `is_binary_file(url)`: Skip binary files (zip, pdf, images, etc.)
+- `is_link_collection_file(url, content)`: Detect llms.txt-style link collections
+- `transform_github_url(url)`: Convert GitHub file URLs to raw.githubusercontent.com
+- `generate_unique_source_id(url)`: Generate hash-based unique IDs
+- `extract_display_name(url)`: Create human-readable names for sources
+- `extract_markdown_links(content, base_url)`: Extract links from markdown/text
+
+### 4. DocumentStorageOperations (`python/src/server/services/crawling/document_storage_operations.py`)
+
+**Purpose**: Handles document processing, chunking, and storage.
+
+**Key Responsibilities**:
+- Chunk documents into 5000-character segments
+- Create/update source records in database
+- Generate embeddings for each chunk
+- Store documents in Supabase with metadata
+- Support parallel batch processing
+
+**Key Methods**:
+- `process_and_store_documents()`: Main document processing pipeline
+- `_create_source_records()`: Create source entries in database
+
+### 5. CodeExtractionService (`python/src/server/services/crawling/code_extraction_service.py`)
+
+**Purpose**: Extract code examples from HTML and generate AI summaries.
+
+**Key Responsibilities**:
+- Extract code blocks from HTML content
+- Filter by minimum length (default: 1000 characters)
+- Generate AI-powered summaries for each code block
+- Store in `code_examples` table with metadata
+
+### 6. ProgressMapper (`python/src/server/services/crawling/progress_mapper.py`)
+
+**Purpose**: Maps sub-stage progress to overall progress percentages.
+
+**Stage Ranges**:
+```python
+STAGE_RANGES = {
+ "analyzing": (1, 3), # URL analysis
+ "crawling": (3, 15), # Web crawling
+ "processing": (15, 20), # Content processing
+ "source_creation": (20, 25), # Database operations
+ "document_storage": (25, 40), # Embeddings generation
+ "code_extraction": (40, 90), # Code extraction + summaries
+ "finalization": (90, 100), # Final cleanup
+}
+```
+
+---
+
+## Crawl Strategies
+
+### 1. Single Page Strategy (`strategies/single_page.py`)
+
+**When Used**: Single URL that's not a sitemap, link collection, or binary file.
+
+**Process**:
+1. Transform URL (e.g., GitHub → raw.githubusercontent.com)
+2. Detect if documentation site (special handling)
+3. Configure Crawl4AI with appropriate settings
+4. Crawl with retry logic (3 attempts with exponential backoff)
+5. Return markdown and HTML content
+
+**Configuration for Documentation Sites**:
+```python
+CrawlerRunConfig(
+ wait_for=wait_selector, # e.g., '.markdown, article'
+ wait_until='domcontentloaded',
+ page_timeout=30000,
+ delay_before_return_html=0.5,
+ scan_full_page=True,
+ remove_overlay_elements=True,
+ process_iframes=True
+)
+```
+
+### 2. Batch Strategy (`strategies/batch.py`)
+
+**When Used**: Multiple URLs (from sitemap or link collection).
+
+**Process**:
+1. Load batch size and concurrency settings from database
+2. Initialize `MemoryAdaptiveDispatcher` for memory management
+3. Process URLs in batches (default: 50 URLs per batch)
+4. Use `crawler.arun_many()` for parallel crawling
+5. Stream results as they complete
+6. Report progress for each batch
+
+**Key Features**:
+- Parallel crawling with configurable concurrency
+- Memory-adaptive throttling
+- Streaming results for incremental progress
+- Cancellation support between batches
+
+### 3. Recursive Strategy (`strategies/recursive.py`)
+
+**When Used**: Crawl a website by following internal links to a specified depth.
+
+**Process**:
+1. Start with initial URL(s)
+2. For each depth level (default: 3):
+ - Crawl all URLs at current depth
+ - Extract internal links from results
+ - Filter out binary files and visited URLs
+ - Add new URLs to next depth queue
+3. Continue until max depth reached or no new URLs
+
+**Key Features**:
+- Depth-limited crawling
+- Automatic link extraction
+- De-duplication of URLs
+- Progress reporting per depth level
+- Cancellation support at batch and depth boundaries
+
+### 4. Sitemap Strategy (`strategies/sitemap.py`)
+
+**When Used**: URL ends with `sitemap.xml`.
+
+**Process**:
+1. Fetch sitemap XML file
+2. Parse with `ElementTree`
+3. Extract all `` elements
+4. Return list of URLs for batch crawling
+
+**Note**: After parsing, the batch strategy is used to crawl all URLs.
+
+---
+
+## Process Flows
+
+### Flow 1: Single Page Crawl
+
+```
+User Request → API Validation → Get Crawler → Single Page Strategy
+ ↓
+ Crawl with retries
+ ↓
+ Return {url, markdown, html}
+ ↓
+ Chunk content (5000 chars)
+ ↓
+ Create source record
+ ↓
+ Generate embeddings
+ ↓
+ Store in Supabase
+ ↓
+ Extract code examples
+ ↓
+ Complete (100%)
+```
+
+### Flow 2: Sitemap Crawl
+
+```
+User Request → API Validation → Detect sitemap.xml → Sitemap Strategy
+ ↓
+ Parse XML for URLs
+ ↓
+ Batch Strategy
+ ↓
+ Process in batches of 50 URLs
+ ↓
+ Parallel crawl with arun_many()
+ ↓
+ Stream results incrementally
+ ↓
+ DocumentStorageOperations
+ (same as single page)
+```
+
+### Flow 3: Recursive Crawl
+
+```
+User Request → API Validation → Detect normal webpage → Recursive Strategy
+ ↓
+ Crawl starting URLs
+ ↓
+ Extract internal links
+ ↓
+ ┌───────────────┴────────────┐
+ ▼ ▼
+ Depth 1 URLs Depth 2 URLs
+ ↓ ↓
+ Batch crawl Batch crawl
+ └────────────┬───────────────┘
+ ▼
+ Depth 3 URLs
+ ↓
+ Batch crawl
+ ↓
+ DocumentStorageOperations
+ (same as single page)
+```
+
+### Flow 4: Link Collection (llms.txt) Crawl
+
+```
+User Request → API Validation → Detect llms.txt → Single Page Strategy
+ ↓
+ Fetch text file
+ ↓
+ Extract URLs from content
+ ↓
+ Batch Strategy
+ ↓
+ Crawl all discovered URLs
+ ↓
+ DocumentStorageOperations
+ (same as single page)
+```
+
+---
+
+## Code Examples
+
+### Example 1: Starting a Crawl from API
+
+```python
+# python/src/server/api_routes/knowledge_api.py
+
+@router.post("/knowledge-items/crawl")
+async def crawl_knowledge_item(request: KnowledgeItemRequest):
+ # Generate unique progress ID
+ progress_id = str(uuid.uuid4())
+
+ # Initialize progress tracker
+ tracker = ProgressTracker(progress_id, operation_type="crawl")
+ await tracker.start({
+ "url": str(request.url),
+ "crawl_type": "normal",
+ "progress": 0,
+ "log": f"Starting crawl for {request.url}"
+ })
+
+ # Start background task
+ asyncio.create_task(_perform_crawl_with_progress(progress_id, request, tracker))
+
+ return {
+ "success": True,
+ "progressId": progress_id,
+ "message": "Crawling started",
+ "estimatedDuration": "3-5 minutes"
+ }
+```
+
+### Example 2: Orchestrating a Crawl
+
+```python
+# python/src/server/services/crawling/crawling_service.py
+
+async def orchestrate_crawl(self, request: dict[str, Any]) -> dict[str, Any]:
+ """Main orchestration method for crawling operations."""
+
+ # Step 1: Analyze URL
+ crawl_type = await self._detect_crawl_type(request['url'])
+
+ # Step 2: Execute appropriate strategy
+ if crawl_type == "single_page":
+ results = await self._crawl_single_page(request)
+ elif crawl_type == "sitemap":
+ results = await self._crawl_sitemap(request)
+ elif crawl_type == "recursive":
+ results = await self._crawl_recursive(request)
+ elif crawl_type == "text_file":
+ results = await self._crawl_text_file(request)
+
+ # Step 3: Process and store documents
+ storage_result = await self.doc_storage_ops.process_and_store_documents(
+ crawl_results=results,
+ request=request,
+ crawl_type=crawl_type,
+ original_source_id=source_id,
+ progress_callback=self._create_progress_callback("document_storage")
+ )
+
+ # Step 4: Extract code examples (if enabled)
+ if request.get("extract_code_examples", True):
+ code_count = await self.doc_storage_ops.extract_and_store_code_examples(
+ crawl_results=results,
+ url_to_full_document=storage_result["url_to_full_document"],
+ source_id=source_id,
+ progress_callback=self._create_progress_callback("code_extraction")
+ )
+
+ return {"success": True, "chunks_stored": storage_result["chunks_stored"]}
+```
+
+### Example 3: Batch Crawling with Progress
+
+```python
+# python/src/server/services/crawling/strategies/batch.py
+
+async def crawl_batch_with_progress(
+ self,
+ urls: list[str],
+ progress_callback: Callable[..., Awaitable[None]] | None = None
+) -> list[dict[str, Any]]:
+ """Batch crawl URLs with progress reporting."""
+
+ batch_size = 50
+ total_urls = len(urls)
+ successful_results = []
+
+ for i in range(0, total_urls, batch_size):
+ batch_urls = urls[i : i + batch_size]
+
+ # Report progress
+ progress_percentage = int((i / total_urls) * 100)
+ await progress_callback(
+ "crawling",
+ progress_percentage,
+ f"Processing batch {i + 1}-{min(i + batch_size, total_urls)} of {total_urls}",
+ total_pages=total_urls,
+ processed_pages=i
+ )
+
+ # Crawl batch in parallel
+ batch_results = await self.crawler.arun_many(
+ urls=batch_urls,
+ config=crawl_config,
+ dispatcher=dispatcher
+ )
+
+ # Stream results
+ async for result in batch_results:
+ if result.success and result.markdown:
+ successful_results.append({
+ "url": result.url,
+ "markdown": result.markdown,
+ "html": result.html
+ })
+
+ return successful_results
+```
+
+### Example 4: URL Type Detection
+
+```python
+# python/src/server/services/crawling/helpers/url_handler.py
+
+class URLHandler:
+ @staticmethod
+ def is_link_collection_file(url: str, content: Optional[str] = None) -> bool:
+ """Check if file is a link collection like llms.txt."""
+ parsed = urlparse(url)
+ filename = parsed.path.split('/')[-1].lower()
+
+ # Check filename patterns
+ link_collection_patterns = [
+ 'llms.txt', 'links.txt', 'resources.txt',
+ 'llms.md', 'links.md', 'resources.md'
+ ]
+
+ if filename in link_collection_patterns:
+ return True
+
+ # Content-based detection
+ if content and 'full' not in filename:
+ extracted_links = URLHandler.extract_markdown_links(content, url)
+ total_links = len(extracted_links)
+ content_length = len(content.strip())
+
+ if content_length > 0:
+ link_density = (total_links * 100) / content_length
+ # If more than 2% of content is links
+ if link_density > 2.0 and total_links > 3:
+ return True
+
+ return False
+```
+
+---
+
+## Configuration
+
+### Database Settings (Stored in Supabase)
+
+Configuration settings are stored in the `credentials` table under category `rag_strategy`:
+
+| Setting | Default | Description |
+|---------------------------|---------|------------------------------------------------|
+| `CRAWL_BATCH_SIZE` | 50 | Number of URLs to crawl per batch |
+| `CRAWL_MAX_CONCURRENT` | 10 | Max parallel crawls within a single operation |
+| `MEMORY_THRESHOLD_PERCENT`| 80 | Memory usage threshold for throttling (%) |
+| `DISPATCHER_CHECK_INTERVAL`| 0.5 | How often to check memory (seconds) |
+| `CRAWL_WAIT_STRATEGY` | domcontentloaded | Playwright wait strategy |
+| `CRAWL_PAGE_TIMEOUT` | 30000 | Page load timeout (milliseconds) |
+| `CRAWL_DELAY_BEFORE_HTML` | 1.0 | Delay after page load (seconds) |
+| `CODE_BLOCK_MIN_LENGTH` | 1000 | Minimum code block size for extraction |
+
+### Environment Variables
+
+```bash
+# Crawl4AI Configuration
+CONCURRENT_CRAWL_LIMIT=3 # Server-level crawl concurrency limit
+
+# Supabase
+SUPABASE_URL=https://your-project.supabase.co
+SUPABASE_SERVICE_KEY=your-service-key
+
+# Embedding Provider
+EMBEDDING_PROVIDER=openai # openai, google, or ollama
+OPENAI_API_KEY=your-key
+```
+
+### Browser Configuration
+
+Located in `CrawlerManager.initialize()`:
+
+```python
+browser_config = BrowserConfig(
+ headless=True,
+ verbose=False,
+ viewport_width=1920,
+ viewport_height=1080,
+ user_agent="Mozilla/5.0 ...",
+ browser_type="chromium",
+ extra_args=[
+ "--disable-images", # Skip images for speed
+ "--disable-gpu",
+ "--no-sandbox",
+ "--disable-setuid-sandbox",
+ "--disable-web-security",
+ # ... more performance optimizations
+ ]
+)
+```
+
+---
+
+## Progress Tracking
+
+### Progress Architecture
+
+The system uses HTTP polling for real-time progress updates:
+
+```
+Frontend (React) → Poll /api/progress/{progress_id} every 500ms
+ ↓
+ ProgressTracker (in-memory)
+ ↓
+ Crawling Service updates
+```
+
+### Progress Stages
+
+| Stage | Range | Description |
+|--------------------|----------|------------------------------------------|
+| `analyzing` | 1-3% | URL analysis and type detection |
+| `crawling` | 3-15% | Web crawling (varies by depth/count) |
+| `processing` | 15-20% | Content processing and chunking |
+| `source_creation` | 20-25% | Creating database source records |
+| `document_storage` | 25-40% | Generating embeddings and storing chunks |
+| `code_extraction` | 40-90% | Extracting code + AI summaries |
+| `finalization` | 90-100% | Final cleanup and completion |
+
+### Progress Update Example
+
+```python
+# In any crawl strategy
+async def report_progress(progress_val: int, message: str, **kwargs):
+ if progress_callback:
+ await progress_callback(
+ "crawling", # Current stage
+ progress_val, # 0-100 within stage
+ message, # User-friendly message
+ total_pages=total_urls, # Total URLs to crawl
+ processed_pages=current # URLs crawled so far
+ )
+```
+
+### Cancellation Support
+
+Users can cancel long-running crawls:
+
+```python
+# API endpoint
+@router.delete("/knowledge-items/crawl/{progress_id}")
+async def cancel_crawl(progress_id: str):
+ # Cancel the orchestration service
+ orchestration = get_active_orchestration(progress_id)
+ if orchestration:
+ orchestration.cancel()
+
+ # Cancel the task
+ task = active_crawl_tasks.get(progress_id)
+ if task:
+ task.cancel()
+```
+
+Each strategy checks for cancellation between batches/pages:
+
+```python
+if cancellation_check:
+ try:
+ cancellation_check()
+ except asyncio.CancelledError:
+ # Cleanup and exit gracefully
+ await report_progress(99, "Crawl cancelled", status="cancelled")
+ return partial_results
+```
+
+---
+
+---
+
+## Multithreading & Concurrent Crawl Management
+
+The ArchonDM crawling system implements a sophisticated multi-level concurrency architecture that allows efficient parallel processing of multiple websites while protecting server resources.
+
+### Concurrency Architecture Overview
+
+```
+┌─────────────────────────────────────────────────────────────────────────┐
+│ LEVEL 1: Server-Level Concurrency │
+│ (Limits Total Number of Crawl Operations) │
+│ │
+│ User A: Crawl site1.com User B: Crawl site2.com User C: Queued │
+│ [RUNNING] [RUNNING] [WAITING] │
+│ ┌─────────────┐ ┌─────────────┐ ┌──────────────┐ │
+│ │ Progress: │ │ Progress: │ │ Semaphore │ │
+│ │ 45% │ │ 78% │ │ Limit: 3 │ │
+│ └──────┬──────┘ └──────┬──────┘ └──────────────┘ │
+│ │ │ │
+└──────────┼──────────────────────────┼────────────────────────────────────┘
+ │ │
+ ▼ ▼
+┌──────────────────────────────────────────────────────────────────────────┐
+│ LEVEL 2: Within-Crawl Parallel Page Processing │
+│ (Each crawl processes multiple pages) │
+│ │
+│ ┌─────────────────────────────────────────────────────────────┐ │
+│ │ Crawl site1.com - Batch Strategy │ │
+│ │ │ │
+│ │ Page 1 Page 2 Page 3 Page 4 Page 5 ... Page 10 │ │
+│ │ [DONE] [DONE] [CRAWL] [CRAWL] [WAIT] [WAIT] │ │
+│ │ │ │
+│ │ MemoryAdaptiveDispatcher: 10 concurrent pages │ │
+│ └─────────────────────────────────────────────────────────────┘ │
+└──────────────────────────────────────────────────────────────────────────┘
+ │
+ ▼
+┌──────────────────────────────────────────────────────────────────────────┐
+│ LEVEL 3: Adaptive Worker Pool (Threading Service) │
+│ (CPU-intensive operations like embeddings) │
+│ │
+│ Worker 1 Worker 2 Worker 3 Worker 4 ... Worker N │
+│ [Embed] [Embed] [Code AI] [IDLE] │
+│ │
+│ Dynamically adjusts based on CPU/Memory (4-16 workers) │
+└──────────────────────────────────────────────────────────────────────────┘
+```
+
+### Level 1: Server-Level Concurrency Control
+
+**Purpose**: Prevent server overload when multiple users initiate crawls simultaneously.
+
+**Implementation**: `python/src/server/api_routes/knowledge_api.py`
+
+**Key Code**:
+```python
+# Hardcoded limit to protect server resources
+CONCURRENT_CRAWL_LIMIT = 3 # Max simultaneous crawl operations
+
+# Create asyncio Semaphore
+crawl_semaphore = asyncio.Semaphore(CONCURRENT_CRAWL_LIMIT)
+
+# Track active crawl tasks
+active_crawl_tasks: dict[str, asyncio.Task] = {}
+
+async def _perform_crawl_with_progress(
+ progress_id: str, request: KnowledgeItemRequest, tracker
+):
+ """Perform crawl operation with semaphore protection."""
+ # Acquire semaphore - blocks if 3 crawls are already running
+ async with crawl_semaphore:
+ logger.info(f"Acquired crawl semaphore | progress_id={progress_id}")
+
+ # Get crawler and start operation
+ crawler = await get_crawler()
+ orchestration_service = CrawlingService(crawler, supabase_client)
+ orchestration_service.set_progress_id(progress_id)
+
+ # Orchestrate the crawl
+ result = await orchestration_service.orchestrate_crawl(request_dict)
+
+ # Store task for cancellation support
+ crawl_task = result.get("task")
+ if crawl_task:
+ active_crawl_tasks[progress_id] = crawl_task
+```
+
+**Behavior**:
+- **Max 3 concurrent crawl operations** (configurable via `CONCURRENT_CRAWL_LIMIT`)
+- 4th crawl request waits until one of the first 3 completes
+- Each crawl gets full access to within-crawl parallelism
+- Semaphore automatically releases when crawl completes (via context manager)
+
+### Level 2: Within-Crawl Parallel Page Processing
+
+**Purpose**: Process multiple pages in parallel within a single crawl operation.
+
+**Implementation**: Uses Crawl4AI's `MemoryAdaptiveDispatcher` in batch and recursive strategies.
+
+**Configuration** (from database `credentials` table):
+```python
+CRAWL_MAX_CONCURRENT = 10 # Max pages to crawl in parallel
+MEMORY_THRESHOLD_PERCENT = 80 # Throttle if memory exceeds 80%
+```
+
+**Example: Batch Strategy**:
+```python
+# python/src/server/services/crawling/strategies/batch.py
+
+async def crawl_batch_with_progress(
+ self,
+ urls: list[str],
+ max_concurrent: int = 10,
+ progress_callback: Callable | None = None
+) -> list[dict[str, Any]]:
+ """Crawl multiple URLs in parallel with memory management."""
+
+ # Load settings from database
+ settings = await credential_service.get_credentials_by_category("rag_strategy")
+ max_concurrent = int(settings.get("CRAWL_MAX_CONCURRENT", "10"))
+ memory_threshold = float(settings.get("MEMORY_THRESHOLD_PERCENT", "80"))
+
+ # Create memory-adaptive dispatcher
+ dispatcher = MemoryAdaptiveDispatcher(
+ memory_threshold_percent=memory_threshold,
+ check_interval=0.5, # Check memory every 0.5 seconds
+ max_session_permit=max_concurrent # Max parallel crawls
+ )
+
+ # Crawl in batches of 50 URLs
+ batch_size = 50
+ for i in range(0, len(urls), batch_size):
+ batch_urls = urls[i : i + batch_size]
+
+ # Crawl this batch with up to 10 pages in parallel
+ batch_results = await self.crawler.arun_many(
+ urls=batch_urls,
+ config=crawl_config,
+ dispatcher=dispatcher # Handles parallel execution
+ )
+
+ # Stream results as they complete
+ async for result in batch_results:
+ if result.success:
+ successful_results.append(result)
+
+ # Report progress
+ await progress_callback(
+ "crawling",
+ int((processed / total) * 100),
+ f"Crawled {processed}/{total} pages"
+ )
+
+ return successful_results
+```
+
+**Memory-Adaptive Behavior**:
+- Monitors system memory usage in real-time
+- Automatically reduces parallel crawls if memory exceeds threshold
+- Resumes full parallelism when memory drops below threshold
+- Prevents out-of-memory crashes during large crawls
+
+### Level 3: Adaptive Worker Pool (Threading Service)
+
+**Purpose**: Efficiently handle CPU-intensive operations (embeddings, AI summaries) and I/O-bound tasks.
+
+**Implementation**: `python/src/server/services/threading_service.py`
+
+**Key Features**:
+1. **Multiple Processing Modes**:
+ - `CPU_INTENSIVE`: AI summaries, embeddings (4-16 workers, uses CPU count)
+ - `IO_BOUND`: Database operations, file I/O (8-32 workers)
+ - `NETWORK_BOUND`: External API calls (4-16 workers)
+
+2. **Dynamic Worker Adjustment**:
+ - Monitors CPU and memory in real-time
+ - Reduces workers when resources are constrained
+ - Increases workers when resources are available
+
+3. **Rate Limiting**:
+ - Token bucket algorithm for API rate limiting
+ - Prevents API quota exhaustion
+ - Automatic backoff and retry
+
+**Example Usage**:
+```python
+# python/src/server/services/threading_service.py
+
+class ThreadingService:
+ """Main threading service for adaptive concurrency."""
+
+ async def batch_process(
+ self,
+ items: list[Any],
+ process_func: Callable,
+ mode: ProcessingMode = ProcessingMode.CPU_INTENSIVE,
+ progress_callback: Callable | None = None
+ ) -> list[Any]:
+ """Process items with adaptive worker pool."""
+
+ # Calculate optimal workers based on system load
+ optimal_workers = self.memory_dispatcher.calculate_optimal_workers(mode)
+ semaphore = asyncio.Semaphore(optimal_workers)
+
+ logger.info(
+ f"Starting adaptive processing: {len(items)} items, "
+ f"{optimal_workers} workers, mode={mode}"
+ )
+
+ async def process_single(item: Any, index: int) -> Any:
+ async with semaphore:
+ # For CPU-intensive, run in thread pool
+ if mode == ProcessingMode.CPU_INTENSIVE:
+ loop = asyncio.get_event_loop()
+ result = await loop.run_in_executor(None, process_func, item)
+ else:
+ # Run directly if async
+ result = await process_func(item)
+
+ # Report progress
+ if progress_callback:
+ await progress_callback({
+ "type": "worker_completed",
+ "completed": index + 1,
+ "total": len(items)
+ })
+
+ return result
+
+ # Execute all items with controlled concurrency
+ tasks = [process_single(item, idx) for idx, item in enumerate(items)]
+ results = await asyncio.gather(*tasks, return_exceptions=True)
+
+ return [r for r in results if not isinstance(r, Exception)]
+```
+
+**Worker Calculation Logic**:
+```python
+def calculate_optimal_workers(self, mode: ProcessingMode) -> int:
+ """Calculate optimal worker count based on system resources."""
+ metrics = self.get_system_metrics()
+
+ # Base worker count depends on mode
+ if mode == ProcessingMode.CPU_INTENSIVE:
+ base = min(4, psutil.cpu_count())
+ elif mode == ProcessingMode.IO_BOUND:
+ base = 8 # 2x for I/O operations
+ else:
+ base = 4
+
+ # Adjust based on system load
+ if metrics.memory_percent > 80:
+ workers = max(1, base // 2) # Reduce by 50%
+ elif metrics.cpu_percent > 90:
+ workers = max(1, base // 2) # Reduce by 50%
+ elif metrics.memory_percent < 50 and metrics.cpu_percent < 50:
+ workers = min(16, base * 2) # Increase up to 2x
+ else:
+ workers = base
+
+ return workers
+```
+
+### Complete Multi-Site Crawl Example
+
+Here's a complete example showing how to crawl multiple websites concurrently:
+
+```python
+"""
+Example: Crawl multiple documentation sites concurrently
+"""
+
+import asyncio
+from python.src.server.api_routes.knowledge_api import crawl_knowledge_item
+from python.src.server.services.crawling import CrawlingService
+from python.src.server.services.crawler_manager import get_crawler
+from python.src.server.utils import get_supabase_client
+
+async def crawl_multiple_sites(site_urls: list[str]) -> dict:
+ """
+ Crawl multiple sites concurrently with automatic queuing.
+
+ Args:
+ site_urls: List of website URLs to crawl
+
+ Returns:
+ Dict with progress IDs and status for each site
+ """
+ results = {
+ "started": [],
+ "queued": [],
+ "total": len(site_urls)
+ }
+
+ async def crawl_single_site(url: str, index: int):
+ """Crawl a single site."""
+ try:
+ # Create crawl request
+ request = KnowledgeItemRequest(
+ url=url,
+ knowledge_type="documentation",
+ tags=["auto-crawl"],
+ max_depth=2,
+ extract_code_examples=True
+ )
+
+ # This will automatically queue if 3 crawls are running
+ response = await crawl_knowledge_item(request)
+
+ print(f"✅ Site {index + 1}/{len(site_urls)}: {url}")
+ print(f" Progress ID: {response['progressId']}")
+ print(f" Status: {response['message']}")
+
+ results["started"].append({
+ "url": url,
+ "progress_id": response["progressId"],
+ "index": index
+ })
+
+ except Exception as e:
+ print(f"❌ Site {index + 1}/{len(site_urls)} failed: {url}")
+ print(f" Error: {str(e)}")
+
+ # Start all crawls concurrently
+ # The semaphore will automatically queue excess requests
+ crawl_tasks = [
+ crawl_single_site(url, idx)
+ for idx, url in enumerate(site_urls)
+ ]
+
+ # Wait for all to start (some may be queued)
+ await asyncio.gather(*crawl_tasks, return_exceptions=True)
+
+ return results
+
+# Example usage
+async def main():
+ sites_to_crawl = [
+ "https://docs.python.org/3/",
+ "https://fastapi.tiangolo.com/",
+ "https://docs.pydantic.dev/",
+ "https://www.supabase.com/docs",
+ "https://platform.openai.com/docs",
+ ]
+
+ print(f"🚀 Starting crawl of {len(sites_to_crawl)} sites...")
+ print(f" Server limit: 3 concurrent crawls")
+ print(f" Sites 1-3 will start immediately")
+ print(f" Sites 4-5 will queue and start as others complete\n")
+
+ results = await crawl_multiple_sites(sites_to_crawl)
+
+ print("\n" + "="*60)
+ print(f"📊 Crawl Summary:")
+ print(f" Total sites: {results['total']}")
+ print(f" Started/Queued: {len(results['started'])}")
+ print("="*60)
+
+# Run the example
+if __name__ == "__main__":
+ asyncio.run(main())
+```
+
+**Output**:
+```
+🚀 Starting crawl of 5 sites...
+ Server limit: 3 concurrent crawls
+ Sites 1-3 will start immediately
+ Sites 4-5 will queue and start as others complete
+
+✅ Site 1/5: https://docs.python.org/3/
+ Progress ID: a1b2c3d4-e5f6-7890-abcd-ef1234567890
+ Status: Crawling started
+
+✅ Site 2/5: https://fastapi.tiangolo.com/
+ Progress ID: b2c3d4e5-f6a7-8901-bcde-f12345678901
+ Status: Crawling started
+
+✅ Site 3/5: https://docs.pydantic.dev/
+ Progress ID: c3d4e5f6-a7b8-9012-cdef-123456789012
+ Status: Crawling started
+
+[Site 4 waits until one of the first 3 completes...]
+
+✅ Site 4/5: https://www.supabase.com/docs
+ Progress ID: d4e5f6a7-b8c9-0123-def1-234567890123
+ Status: Crawling started
+
+✅ Site 5/5: https://platform.openai.com/docs
+ Progress ID: e5f6a7b8-c9d0-1234-ef12-345678901234
+ Status: Crawling started
+
+============================================================
+📊 Crawl Summary:
+ Total sites: 5
+ Started/Queued: 5
+============================================================
+```
+
+### Monitoring Active Crawls
+
+You can monitor and manage active crawls:
+
+```python
+from python.src.server.api_routes.knowledge_api import active_crawl_tasks
+
+def get_active_crawls() -> dict:
+ """Get status of all active crawls."""
+ return {
+ "active_count": len(active_crawl_tasks),
+ "max_concurrent": CONCURRENT_CRAWL_LIMIT,
+ "available_slots": CONCURRENT_CRAWL_LIMIT - len(active_crawl_tasks),
+ "crawls": [
+ {
+ "progress_id": progress_id,
+ "task_name": task.get_name(),
+ "done": task.done()
+ }
+ for progress_id, task in active_crawl_tasks.items()
+ ]
+ }
+
+# Cancel a specific crawl
+async def cancel_crawl(progress_id: str):
+ """Cancel a running crawl operation."""
+ task = active_crawl_tasks.get(progress_id)
+ if task and not task.done():
+ task.cancel()
+ print(f"Cancelled crawl: {progress_id}")
+ else:
+ print(f"Crawl not found or already completed: {progress_id}")
+```
+
+### Configuration Summary
+
+| Level | Setting | Location | Default | Description |
+|-------|---------|----------|---------|-------------|
+| **Server** | `CONCURRENT_CRAWL_LIMIT` | `knowledge_api.py` | 3 | Max simultaneous crawl operations |
+| **Crawl** | `CRAWL_MAX_CONCURRENT` | Database (credentials) | 10 | Max pages crawled in parallel |
+| **Crawl** | `CRAWL_BATCH_SIZE` | Database (credentials) | 50 | URLs per batch |
+| **Crawl** | `MEMORY_THRESHOLD_PERCENT` | Database (credentials) | 80 | Memory throttle threshold |
+| **Threading** | `base_workers` | `ThreadingConfig` | 4 | Base worker count |
+| **Threading** | `max_workers` | `ThreadingConfig` | 16 | Maximum worker count |
+| **Threading** | `memory_threshold` | `ThreadingConfig` | 0.8 | Memory limit (80%) |
+| **Threading** | `cpu_threshold` | `ThreadingConfig` | 0.9 | CPU limit (90%) |
+
+### Performance Characteristics
+
+**Typical Performance** (on 8-core, 16GB RAM server):
+
+- **Single Site Crawl**: 50-100 pages in 2-3 minutes
+- **Concurrent 3 Sites**: 150-300 pages in 3-5 minutes (minimal slowdown)
+- **Memory Usage**: 2-4GB per active crawl operation
+- **CPU Usage**: 50-70% during active crawling, 80-90% during embedding generation
+
+**Scalability**:
+- Can handle 3 simultaneous large crawls (1000+ pages each)
+- Automatically throttles if memory exceeds 80%
+- Each crawl processes 10 pages in parallel by default
+- Threading service adjusts 4-16 workers based on system load
+
+---
+
+## Summary
+
+The ArchonDM crawl process is a robust, production-ready system that:
+
+1. **Intelligently analyzes URLs** to determine the best crawl strategy
+2. **Executes crawls efficiently** using multi-level parallel processing
+3. **Manages server resources** with adaptive concurrency control
+4. **Provides real-time feedback** via HTTP polling with granular progress updates
+5. **Stores documents** in Supabase with vector embeddings for RAG
+6. **Extracts code examples** automatically with AI-powered summaries
+7. **Supports cancellation** for long-running operations
+8. **Handles errors gracefully** with retry logic and detailed logging
+9. **Scales intelligently** based on CPU, memory, and system load
+
+### Key Files Reference
+
+- **Entry Point**: `python/src/server/api_routes/knowledge_api.py`
+- **Orchestrator**: `python/src/server/services/crawling/crawling_service.py`
+- **Crawler Manager**: `python/src/server/services/crawler_manager.py`
+- **Threading Service**: `python/src/server/services/threading_service.py`
+- **Strategies**: `python/src/server/services/crawling/strategies/*.py`
+- **URL Handling**: `python/src/server/services/crawling/helpers/url_handler.py`
+- **Document Storage**: `python/src/server/services/crawling/document_storage_operations.py`
+- **Progress Mapping**: `python/src/server/services/crawling/progress_mapper.py`
+
+### External Dependencies
+
+- **Crawl4AI v0.6.2**: Web crawling engine with Playwright
+- **Supabase**: Vector database for document storage
+- **OpenAI/Google/Ollama**: Embedding generation
+- **FastAPI**: REST API framework
+- **Pydantic**: Data validation
+- **psutil**: System resource monitoring
+- **asyncio**: Asynchronous I/O and concurrency control
+
diff --git a/PRPs/ai_docs/CODE_EXTRACTION.md b/PRPs/ai_docs/CODE_EXTRACTION.md
new file mode 100644
index 0000000000..301b6d3614
--- /dev/null
+++ b/PRPs/ai_docs/CODE_EXTRACTION.md
@@ -0,0 +1,1365 @@
+# Code Extraction and Parsing
+
+This document describes how Archon identifies, extracts, validates, and stores code examples from crawled websites and uploaded documents.
+
+## Overview
+
+The code extraction system is implemented in two primary service files:
+
+- **`python/src/server/services/crawling/code_extraction_service.py`** - Main extraction logic and validation
+- **`python/src/server/services/storage/code_storage_service.py`** - Storage, embedding generation, and AI summarization
+
+The system uses multiple extraction strategies based on content type, performs comprehensive quality validation, and stores code examples with AI-generated summaries and semantic embeddings for intelligent search.
+
+### ⚡ Key Concept: Separate Storage (with Intentional Duplication)
+
+**Code examples are stored in BOTH tables - this is intentional and beneficial:**
+
+| Aspect | Documents (Main Content) | Code Examples |
+|--------|-------------------------|---------------|
+| **Database Table** | `archon_crawled_pages` | `archon_code_examples` |
+| **Content** | Markdown chunks (INCLUDES code blocks with context) | Extracted code blocks ONLY |
+| **Code Included?** | ✅ Yes - as fenced blocks (` ```python...``` `) | ✅ Yes - extracted and isolated |
+| **Embedding Source** | Full markdown text (context + code + backticks) | Code + AI summary (no backticks, no context) |
+| **Metadata** | Document-level metadata | Code-specific metadata (language, example_name) |
+| **Search Endpoint** | `/search` (general) | `/search?search_code_examples=true` |
+| **Processing Order** | First (chunks, embeddings, storage) | Second (extraction, summaries, embeddings, storage) |
+| **Chunking Strategy** | Smart chunking that preserves code blocks intact | N/A - code extracted as complete units |
+
+**Important:** Code blocks appear in BOTH tables, but serve different purposes:
+
+#### In `archon_crawled_pages`:
+- **Purpose:** Preserve full context around code
+- **Format:** Markdown with backticks, explanatory text before/after
+- **Search Use:** "How do I..." / "What is the process for..."
+- **Example Result:** Tutorial-style content with code embedded
+
+#### In `archon_code_examples`:
+- **Purpose:** Isolated, searchable code examples
+- **Format:** Clean code with AI-generated summary
+- **Search Use:** "Show me a code example for..." / "Sample code to..."
+- **Example Result:** Copy-paste ready code snippets
+
+This duplication enables:
+- **Contextual learning** (documents) vs **quick reference** (code examples)
+- **Different embedding strategies** (natural language vs code semantics)
+- **Specialized search** for code-only queries
+- **Code-specific features** (language detection, validation, AI naming)
+- **Independent updates** (can re-extract code without re-crawling documents)
+
+## Extraction Strategies
+
+The system employs three distinct extraction strategies depending on the source type. **Importantly, both HTML and markdown versions of crawled pages are available**, but they are tried in priority order for optimal code extraction.
+
+### Extraction Priority Order
+
+For regular web pages:
+1. **HTML extraction (PRIMARY)** - Tries HTML patterns first to preserve code block structure and metadata
+2. **Markdown extraction (FALLBACK)** - Falls back to markdown triple-backtick extraction if HTML yields no results
+
+For uploaded files:
+1. **Text files (.txt, .md)** - Specialized text extraction (backticks, language labels, indented blocks)
+2. **PDF files** - Section-based code detection using code vs. prose scoring
+
+### 1. HTML-Based Extraction (Primary Method)
+
+For web pages, the system **first attempts HTML extraction** by matching **30+ HTML patterns** used by popular documentation frameworks and syntax highlighters.
+
+#### Supported Frameworks and Patterns
+
+| Framework/Tool | HTML Pattern Example | Priority |
+|---------------|---------------------|----------|
+| **GitHub/GitLab** | `...` | High |
+| **Docusaurus** | `...` | High |
+| **VitePress** | `...` | High |
+| **Prism.js** | `` | High |
+| **highlight.js** | `` | Medium |
+| **Shiki** | `` | Medium |
+| **Monaco Editor** | `` | Medium |
+| **CodeMirror** | `` | Medium |
+| **Nextra** | `...` | Medium |
+| **Astro** | `` | Medium |
+| **Milkdown** | `` | Low |
+| **Generic** | `` | Low (Fallback) |
+
+**Pattern Matching Strategy:**
+- Patterns are ordered by specificity (most specific first)
+- Each pattern extracts:
+ - Code content
+ - Language identifier (if available)
+ - Context before and after the code block
+- Overlapping extractions are deduplicated by position
+
+**HTML Cleaning Process:**
+```python
+# Steps performed during extraction:
+1. Decode HTML entities: < → <, > → >, & → &
+2. Remove syntax highlighting spans: ...
+3. Clean CodeMirror/Monaco nested div structures
+4. Preserve code formatting and indentation
+5. Fix spacing issues from tag removal
+```
+
+#### Example: Docusaurus Pattern
+
+```html
+
+
+
+ def
+ hello
+ (
+ )
+ :
+
+
+
+```
+
+**Extracted Result:**
+```python
+{
+ "code": "def hello():",
+ "language": "python",
+ "context_before": "...",
+ "context_after": "..."
+}
+```
+
+### 2. Markdown Extraction (Fallback Method)
+
+**Only used when HTML extraction yields no results.** The system extracts **triple-backtick fenced code blocks** from the markdown-converted content:
+
+**Why HTML is preferred over Markdown:**
+- HTML preserves original code block structure (``, `` tags)
+- Language info is explicit in CSS classes (`language-python`, `hljs typescript`)
+- Framework-specific patterns can be detected (Docusaurus, VitePress, etc.)
+- Syntax highlighter HTML contains rich metadata
+- HTML → Markdown conversion may lose structural information
+
+**When Markdown is used:**
+- HTML extraction found no code blocks
+- Content is a raw markdown file (.md upload)
+- HTML content is not available
+- Simpler extraction is sufficient
+
+````markdown
+```python
+def hello():
+ print("Hello, World!")
+```
+````
+
+**Extraction Features:**
+- Detects language from fence (e.g., ` ```python`)
+- Extracts context (configurable window size, default: 1000 characters before/after)
+- Handles nested/corrupted markdown structures
+- Validates code block completeness
+
+**Corrupted Markdown Detection:**
+```markdown
+# Detects and handles cases like:
+```K`
+
+```
+```
+
+The system recognizes this as corrupted and extracts the inner content.
+
+### 3. Plain Text Extraction (Text Files and PDFs)
+
+For uploaded `.txt`, `.md` files and PDF-extracted text, the system uses three detection methods:
+
+#### Method 1: Triple Backticks
+```text
+Here's an example:
+
+```typescript
+const example = "code";
+```
+```
+
+#### Method 2: Language Labels
+```text
+TypeScript:
+ const example = "code";
+ console.log(example);
+
+Python example:
+ def hello():
+ print("Hello")
+```
+
+#### Method 3: Indented Blocks
+```text
+Function implementation:
+
+ def process_data(items):
+ result = []
+ for item in items:
+ result.append(transform(item))
+ return result
+```
+
+**PDF-Specific Extraction:**
+
+PDFs lose markdown formatting during text extraction, so the system:
+1. Splits content by double newlines and page breaks
+2. Analyzes each section for "code-like" characteristics
+3. Scores sections using code vs. prose indicators
+4. Extracts sections with high code scores
+
+**Code vs. Prose Scoring:**
+
+```python
+# Code indicators (with weights):
+- Python imports: from X import Y (weight: 3)
+- Function definitions: def func() (weight: 3)
+- Class definitions: class Name (weight: 3)
+- Method calls: obj.method() (weight: 2)
+- Assignments: x = [...] (weight: 2)
+
+# Prose indicators (with weights):
+- Articles: the, this, that (weight: 1)
+- Sentence endings: . [A-Z] (weight: 2)
+- Transition words: however, therefore (weight: 2)
+
+# Section is considered code if:
+code_score > prose_score AND code_score > 2
+```
+
+## Language Detection
+
+The system detects programming languages using:
+
+### 1. Explicit Language Tags
+- HTML: `class="language-python"`
+- Markdown: ` ```python`
+- Text files: "Python:" or "TypeScript example:"
+
+### 2. Content-Based Detection
+
+When no explicit language is provided, the system analyzes patterns:
+
+```python
+LANGUAGE_PATTERNS = {
+ "python": [
+ r"\bdef\s+\w+\s*\(", # Function definitions
+ r"\bclass\s+\w+", # Class definitions
+ r"\bimport\s+\w+", # Imports
+ r"\bfrom\s+\w+\s+import", # From imports
+ ],
+ "javascript": [
+ r"\bfunction\s+\w+\s*\(", # Functions
+ r"\bconst\s+\w+\s*=", # Const declarations
+ r"\blet\s+\w+\s*=", # Let declarations
+ r"\bvar\s+\w+\s*=", # Var declarations
+ ],
+ "typescript": [
+ r"\binterface\s+\w+", # Interfaces
+ r":\s*\w+\[\]", # Type annotations
+ r"\btype\s+\w+\s*=", # Type aliases
+ ],
+ # ... more languages
+}
+```
+
+### 3. Language-Specific Minimum Indicators
+
+Each language has required indicators for validation:
+
+| Language | Required Indicators |
+|----------|-------------------|
+| TypeScript | `:`, `{`, `}`, `=>`, `interface`, `type` |
+| JavaScript | `function`, `{`, `}`, `=>`, `const`, `let` |
+| Python | `def`, `:`, `return`, `self`, `import`, `class` |
+| Java | `class`, `public`, `private`, `{`, `}`, `;` |
+| Rust | `fn`, `let`, `mut`, `impl`, `struct`, `->` |
+| Go | `func`, `type`, `struct`, `{`, `}`, `:=` |
+
+## Code Quality Validation
+
+After extraction, every code block undergoes rigorous validation through multiple filters:
+
+### 1. Length Validation
+
+**Dynamic Minimum Length:**
+
+The minimum length adapts based on language and context:
+
+```python
+# Base minimum lengths by language:
+- JSON/YAML/XML: 100 characters
+- HTML/CSS/SQL: 150 characters
+- Python/Go: 200 characters
+- JavaScript/TypeScript/Rust: 250 characters
+- Java/C++: 300 characters
+
+# Context-based adjustments:
+- Contains "example", "snippet", "demo": multiply by 0.7
+- Contains "implementation", "complete": multiply by 1.5
+- Contains "minimal", "simple", "basic": multiply by 0.8
+
+# Configurable bounds: 100-1000 characters
+```
+
+**Maximum Length:**
+- Default: 5,000 characters
+- Prevents extraction of entire files or corrupted content
+
+### 2. Code Indicator Validation
+
+Requires at least **3 code indicators** (configurable via `MIN_CODE_INDICATORS`):
+
+```python
+CODE_INDICATORS = {
+ "function_calls": r"\w+\s*\([^)]*\)",
+ "assignments": r"\w+\s*=\s*.+",
+ "control_flow": r"\b(if|for|while|switch|case|try|catch|except)\b",
+ "declarations": r"\b(var|let|const|def|class|function|interface|type|struct|enum)\b",
+ "imports": r"\b(import|from|require|include|using|use)\b",
+ "brackets": r"[\{\}\[\]]",
+ "operators": r"[\+\-\*\/\%\&\|\^<>=!]",
+ "method_chains": r"\.\w+",
+ "arrows": r"(=>|->)",
+ "keywords": r"\b(return|break|continue|yield|await|async)\b",
+}
+```
+
+**Validation Process:**
+1. Count matches for each indicator pattern
+2. Reject if fewer than minimum required (default: 3)
+3. Log which indicators were found for debugging
+
+### 3. Prose Filtering
+
+Rejects blocks that appear to be documentation rather than code:
+
+**Prose Indicators:**
+```python
+PROSE_PATTERNS = [
+ r"\b(the|this|that|these|those|is|are|was|were|will|would|should|could|have|has|had)\b",
+ r"[.!?]\s+[A-Z]", # Sentence endings
+ r"\b(however|therefore|furthermore|moreover|nevertheless)\b",
+]
+```
+
+**Threshold:**
+- Default: 15% prose ratio (`MAX_PROSE_RATIO`)
+- If `prose_score / word_count > 0.15`, block is rejected
+
+**Example:**
+```text
+This is a description of how to use the API. You should first
+initialize the client, then make requests. The responses will
+be in JSON format.
+```
+This would be rejected (high prose ratio), while:
+```python
+# Initialize client
+client = APIClient()
+response = client.get("/users")
+data = response.json()
+```
+This would pass (low prose ratio, high code indicators).
+
+### 4. Diagram Filtering
+
+Filters out ASCII art diagrams and visual representations:
+
+**Diagram Indicators:**
+```python
+DIAGRAM_CHARS = [
+ "┌", "┐", "└", "┘", "│", "─", "├", "┤", "┬", "┴", "┼", # Box drawing
+ "+-+", "|_|", "___", "...", # ASCII art
+ "→", "←", "↑", "↓", "⟶", "⟵", # Arrows
+]
+```
+
+**Detection Logic:**
+- Count lines with >70% special characters
+- Count diagram indicator occurrences
+- Reject if `special_char_lines >= 3` OR `diagram_indicators >= 5` AND `code_patterns < 5`
+
+**Example Rejected:**
+```text
+┌─────────────┐
+│ Server │
+│ │
+└──────┬──────┘
+ │
+ ↓
+┌─────────────┐
+│ Client │
+└─────────────┘
+```
+
+### 5. Language-Specific Validation
+
+For detected languages, validates presence of language-specific patterns:
+
+```python
+# Example: Python validation
+if language == "python":
+ required_indicators = ["def", ":", "return", "self", "import", "class"]
+ found = sum(1 for indicator in required_indicators if indicator in code.lower())
+
+ if found < 2: # Need at least 2 language-specific indicators
+ reject_block()
+```
+
+### 6. Structure Validation
+
+Additional structural checks:
+
+- **Minimum lines:** At least 3 non-empty lines
+- **Line length:** Reject if >50% of lines exceed 300 characters
+- **Comment ratio:** Reject if >70% of lines are comments
+- **Bad patterns:** Reject if contains:
+ - Unescaped HTML entities (`<`, `&`)
+ - Excessive HTML tags
+ - Concatenated keywords without spaces (e.g., `fromimport`)
+
+## Code Cleaning Process
+
+Extracted code undergoes cleaning to remove HTML artifacts and fix formatting issues:
+
+### HTML Entity Decoding
+
+```python
+REPLACEMENTS = {
+ "<": "<",
+ ">": ">",
+ "&": "&",
+ """: '"',
+ "'": "'",
+ " ": " ",
+ "'": "'",
+ "/": "/",
+}
+```
+
+### HTML Tag Removal
+
+```python
+# Strategy depends on tag usage:
+
+# 1. Syntax highlighting (no spaces between tags)
+if "]*>", "", text)
+ text = re.sub(r"", "", text)
+
+# 2. Normal span usage
+else:
+ # Add space where needed to prevent concatenation
+ text = re.sub(r"(?=[A-Za-z0-9])", " ", text)
+ text = re.sub(r"]*>", "", text)
+```
+
+### Spacing Fixes
+
+Fixes common concatenation issues from tag removal:
+
+```python
+SPACING_FIXES = [
+ # Import statements
+ (r"(\b(?:from|import|as)\b)([A-Za-z])", r"\1 \2"),
+
+ # Function/class definitions
+ (r"(\b(?:def|class|async|await|return)\b)([A-Za-z])", r"\1 \2"),
+
+ # Control flow
+ (r"(\b(?:if|elif|else|for|while|try|except)\b)([A-Za-z])", r"\1 \2"),
+
+ # Operators (careful with negative numbers)
+ (r"([A-Za-z_)])(\+|-|\*|/|=|<|>|%)", r"\1 \2"),
+ (r"(\+|-|\*|/|=|<|>|%)([A-Za-z_(])", r"\1 \2"),
+]
+```
+
+### Language-Specific Fixes
+
+```python
+# Python-specific
+if language == "python":
+ # Fix: "fromxximportyy" → "from xx import yy"
+ code = re.sub(r"(\bfrom\b)(\w+)(\bimport\b)", r"\1 \2 \3", code)
+
+ # Add missing colons at line ends
+ code = re.sub(
+ r"(\b(?:def|class|if|elif|else|for|while)\b[^:]+)$",
+ r"\1:",
+ code,
+ flags=re.MULTILINE
+ )
+```
+
+### Whitespace Normalization
+
+```python
+# Process each line individually to preserve indentation
+lines = code.split("\n")
+cleaned_lines = []
+
+for line in lines:
+ stripped = line.lstrip()
+ indent = line[:len(line) - len(stripped)] # Preserve indentation
+
+ # Normalize internal spacing
+ cleaned = re.sub(r" {2,}", " ", stripped)
+ cleaned = cleaned.rstrip() # Remove trailing spaces
+
+ cleaned_lines.append(indent + cleaned)
+
+code = "\n".join(cleaned_lines).strip()
+```
+
+## Code Deduplication
+
+The system automatically deduplicates similar code variants (e.g., Python 3.9 vs 3.10 syntax):
+
+### Normalization for Comparison
+
+Before comparing code blocks, they are normalized:
+
+```python
+def _normalize_code_for_comparison(code: str) -> str:
+ # 1. Normalize whitespace
+ normalized = re.sub(r"\s+", " ", code.strip())
+
+ # 2. Normalize typing imports
+ normalized = re.sub(r"from typing_extensions import", "from typing import", normalized)
+ normalized = re.sub(r"Annotated\[\s*([^,\]]+)[^]]*\]", r"\1", normalized)
+
+ # 3. Normalize FastAPI parameters
+ normalized = re.sub(r":\s*Annotated\[[^\]]+\]\s*=", "=", normalized)
+
+ # 4. Normalize trailing commas
+ normalized = re.sub(r",\s*\)", ")", normalized)
+
+ return normalized
+```
+
+### Similarity Calculation
+
+```python
+# Uses Python's difflib.SequenceMatcher
+similarity = SequenceMatcher(None, normalized1, normalized2).ratio()
+
+# Threshold: 0.85 (85% similarity)
+if similarity >= 0.85:
+ # Blocks are considered duplicates
+```
+
+### Best Variant Selection
+
+When duplicates are found, the system selects the best variant:
+
+```python
+def score_block(block):
+ score = 0
+
+ # Prefer explicit language specification
+ if block.get("language") and block["language"] not in ["", "text", "plaintext"]:
+ score += 10
+
+ # Prefer longer, more comprehensive examples
+ score += len(block["code"]) * 0.01
+
+ # Prefer blocks with better context
+ context_len = len(block.get("context_before", "")) + len(block.get("context_after", ""))
+ score += context_len * 0.005
+
+ # Slight preference for modern syntax
+ if "python 3.10" in block.get("full_context", "").lower():
+ score += 5
+
+ return score
+
+best_block = max(similar_blocks, key=score_block)
+```
+
+**Metadata Tracking:**
+
+The best variant includes metadata about consolidation:
+
+```python
+{
+ "code": "...",
+ "language": "python",
+ "consolidated_variants": 3, # Number of similar variants found
+ "variant_languages": ["python", "python3"], # Languages from all variants
+}
+```
+
+## AI-Powered Summarization
+
+After extraction and validation, the system generates AI summaries for each code block:
+
+### Summary Generation Process
+
+```python
+async def _generate_code_example_summary_async(
+ code: str,
+ context_before: str,
+ context_after: str,
+ language: str = "",
+ provider: str = None
+) -> dict[str, str]:
+```
+
+**Prompt Template:**
+
+```xml
+
+{last 500 chars of context before}
+
+
+
+{first 1500 chars of code}
+
+
+
+{first 500 chars of context after}
+
+
+Based on the code example and its surrounding context, provide:
+1. A concise, action-oriented name (1-4 words) that describes what this code DOES
+ Good: "Parse JSON Response", "Validate Email", "Connect PostgreSQL"
+ Bad: "Function Example", "Code Snippet", "JavaScript Code"
+
+2. A summary (2-3 sentences) describing what the code demonstrates
+
+Format as JSON:
+{
+ "example_name": "Action-oriented name",
+ "summary": "Description of what the code demonstrates"
+}
+```
+
+**LLM Configuration:**
+
+- **Model:** Uses `MODEL_CHOICE` setting (default: `gpt-4o-mini`)
+- **Temperature:** 0.3 (consistent, focused responses)
+- **Max tokens:** 500
+- **Response format:** JSON object
+- **Provider:** Unified LLM provider service (supports OpenAI, Anthropic, etc.)
+
+### Batch Processing
+
+Summaries are generated in batches with rate limiting:
+
+```python
+async def generate_code_summaries_batch(
+ code_blocks: list[dict],
+ max_workers: int = 3, # Configurable via CODE_SUMMARY_MAX_WORKERS
+ progress_callback = None
+):
+ # Use semaphore to limit concurrent requests
+ semaphore = asyncio.Semaphore(max_workers)
+
+ # Add 500ms delay between requests to avoid rate limiting
+ await asyncio.sleep(0.5)
+
+ # Process all blocks concurrently but with rate limiting
+ summaries = await asyncio.gather(
+ *[generate_single_summary_with_limit(block) for block in code_blocks],
+ return_exceptions=True
+ )
+```
+
+**Progress Reporting:**
+
+During batch processing, progress updates are sent:
+
+```python
+{
+ "status": "code_extraction",
+ "percentage": 45, # (completed / total) * 100
+ "log": "Generated 23/50 code summaries",
+ "completed_summaries": 23,
+ "total_summaries": 50
+}
+```
+
+### Fallback Handling
+
+If AI summarization fails or is disabled:
+
+```python
+{
+ "example_name": f"Code Example ({language})" if language else "Code Example",
+ "summary": "Code example for demonstration purposes."
+}
+```
+
+## Storage and Embedding
+
+Code examples are stored in the `archon_code_examples` table with semantic embeddings:
+
+### Embedding Generation
+
+**Combined Text for Embedding:**
+
+```python
+combined_text = f"{code}\n\nSummary: {summary}"
+```
+
+**Contextual Embeddings (Optional):**
+
+When `USE_CONTEXTUAL_EMBEDDINGS=true`, the system enriches embeddings with full document context:
+
+```python
+# Use LLM to create situating context
+situating_context = await generate_situating_context(
+ full_document=url_to_full_document[url],
+ chunk=combined_text
+)
+
+# Enhanced embedding input
+enhanced_text = f"{situating_context}\n\n{combined_text}"
+```
+
+**Embedding Dimensions:**
+
+The system supports multiple embedding dimensions:
+- **768** (e.g., all-minilm-L6-v2)
+- **1024** (e.g., text-embedding-3-small with 1024 dims)
+- **1536** (e.g., text-embedding-ada-002, text-embedding-3-small)
+- **3072** (e.g., text-embedding-3-large)
+
+### Database Schema
+
+Each code example is stored with:
+
+```python
+{
+ "url": str, # Source URL
+ "chunk_number": int, # Sequential number for this URL
+ "content": str, # The actual code
+ "summary": str, # AI-generated summary
+ "metadata": { # JSON object
+ "chunk_index": int,
+ "url": str,
+ "source": str,
+ "source_id": str,
+ "language": str,
+ "char_count": int,
+ "word_count": int,
+ "example_name": str,
+ "title": str,
+ "contextual_embedding": bool, # If contextual embedding used
+ "consolidated_variants": int, # If deduplicated
+ "variant_languages": [str] # Languages from variants
+ },
+ "source_id": str, # Domain or identifier
+ "embedding_768": vector, # Embedding (dimension-specific column)
+ "embedding_1024": vector,
+ "embedding_1536": vector,
+ "embedding_3072": vector,
+ "llm_chat_model": str, # Model used for summaries/contextual
+ "embedding_model": str, # Model used for embeddings
+ "embedding_dimension": int, # Actual dimension used
+ "created_at": timestamp
+}
+```
+
+### Batch Insertion
+
+```python
+# Insert in batches of 20 (configurable)
+batch_size = 20
+
+for i in range(0, total_items, batch_size):
+ batch_data = prepare_batch(i, batch_size)
+
+ # Retry logic with exponential backoff
+ for retry in range(max_retries):
+ try:
+ client.table("archon_code_examples").insert(batch_data).execute()
+ break
+ except Exception as e:
+ if retry < max_retries - 1:
+ await asyncio.sleep(retry_delay)
+ retry_delay *= 2 # Exponential backoff
+ else:
+ # Final attempt: insert individually
+ for record in batch_data:
+ client.table("archon_code_examples").insert(record).execute()
+```
+
+## Configuration Settings
+
+All code extraction behavior can be tuned via environment variables or the Settings UI:
+
+### Length Settings
+
+| Setting | Default | Description |
+|---------|---------|-------------|
+| `MIN_CODE_BLOCK_LENGTH` | `250` | Minimum characters for code blocks |
+| `MAX_CODE_BLOCK_LENGTH` | `5000` | Maximum characters (prevents corruption) |
+
+### Validation Settings
+
+| Setting | Default | Description |
+|---------|---------|-------------|
+| `MIN_CODE_INDICATORS` | `3` | Minimum required code indicators |
+| `ENABLE_PROSE_FILTERING` | `true` | Filter out documentation text |
+| `MAX_PROSE_RATIO` | `0.15` | Maximum allowed prose ratio (15%) |
+| `ENABLE_DIAGRAM_FILTERING` | `true` | Filter out ASCII art diagrams |
+
+### Context Settings
+
+| Setting | Default | Description |
+|---------|---------|-------------|
+| `CONTEXT_WINDOW_SIZE` | `1000` | Characters of context before/after |
+| `ENABLE_CONTEXTUAL_LENGTH` | `true` | Adjust min length by context |
+| `ENABLE_COMPLETE_BLOCK_DETECTION` | `true` | Find complete code blocks |
+| `ENABLE_LANGUAGE_SPECIFIC_PATTERNS` | `true` | Use language-specific validation |
+
+### AI Settings
+
+| Setting | Default | Description |
+|---------|---------|-------------|
+| `ENABLE_CODE_SUMMARIES` | `true` | Generate AI summaries |
+| `CODE_SUMMARY_MAX_WORKERS` | `3` | Concurrent summary requests |
+| `MODEL_CHOICE` | `gpt-4o-mini` | Model for summaries |
+
+### Embedding Settings
+
+| Setting | Default | Description |
+|---------|---------|-------------|
+| `USE_CONTEXTUAL_EMBEDDINGS` | `false` | Enrich embeddings with document context |
+| `CONTEXTUAL_EMBEDDINGS_MAX_WORKERS` | `3` | Concurrent contextual requests |
+
+## Processing Pipeline: Documents vs Code Examples
+
+When a website is crawled, the content goes through **two separate pipelines**:
+
+### Pipeline 1: Document Storage (Happens First)
+
+```python
+# 1. Crawl website → Get HTML & Markdown
+crawl_results = [
+ {
+ "url": "https://fastapi.tiangolo.com/tutorial/",
+ "html": "...",
+ "markdown": "# Tutorial\n\nFastAPI is...",
+ "title": "Tutorial - FastAPI"
+ }
+]
+
+# 2. Chunk markdown into manageable pieces (INCLUDES code blocks!)
+# Smart chunking breaks at code block boundaries to keep code intact
+chunks = [
+ "FastAPI is a modern, fast web framework for building APIs with Python...",
+ "To create an API, first install FastAPI:\n\n```bash\npip install fastapi\n```\n\nThen create your first app...",
+ "Define routes using decorators:\n\n```python\n@app.get(\"/\")\ndef read_root():\n return {\"Hello\": \"World\"}\n```\n\nThis creates a GET endpoint..."
+]
+
+# 3. Create embeddings for each chunk (including embedded code)
+chunk_embeddings = create_embeddings_batch(chunks)
+
+# 4. Store in archon_crawled_pages table
+await add_documents_to_supabase(
+ contents=chunks,
+ embeddings=chunk_embeddings,
+ table="archon_crawled_pages"
+)
+```
+
+### Pipeline 2: Code Example Extraction (Happens Second)
+
+```python
+# 5. Extract code from HTML (using same crawl_results)
+code_blocks = extract_html_code_blocks(html_content)
+# Result: [{"code": "from fastapi import...", "language": "python", ...}]
+
+# 6. Generate AI summaries for each code block
+summaries = generate_code_summaries_batch(code_blocks)
+# Result: [{"example_name": "Create FastAPI App", "summary": "..."}]
+
+# 7. Create embeddings for code + summary
+code_embeddings = create_embeddings_batch([
+ f"{code}\n\nSummary: {summary}"
+ for code, summary in zip(code_blocks, summaries)
+])
+
+# 8. Store in archon_code_examples table (separate!)
+await add_code_examples_to_supabase(
+ code_examples=code_blocks,
+ summaries=summaries,
+ embeddings=code_embeddings,
+ table="archon_code_examples"
+)
+```
+
+### Why Store Code in Both Places?
+
+**Yes, code appears in BOTH tables - this is a feature, not redundancy!**
+
+1. **Different purposes:**
+ - **Documents:** Learning and context ("Here's how to use FastAPI [code] and why it works...")
+ - **Code examples:** Quick reference and copy-paste ("Show me the code for X")
+
+2. **Different search semantics:**
+ - **Documents:** Natural language queries about concepts
+ - **Code examples:** Code-focused queries for specific implementations
+
+3. **Different embeddings:**
+ - **Documents:** Embedded as markdown text (optimized for prose + code context)
+ - **Code examples:** Embedded as code + AI summary (optimized for code semantics)
+
+4. **Specialized features for code:**
+ - Language detection and validation
+ - AI-generated example names
+ - Code-specific metadata
+ - Quality filtering (prose detection, diagram filtering)
+
+5. **Smart chunking preserves code:**
+ - Chunks break at code block boundaries (` ``` `)
+ - Code blocks stay intact within document chunks
+ - Users get context around code when searching documents
+
+6. **Flexibility:**
+ - Can re-extract code without re-crawling entire site
+ - Can disable code extraction but keep document chunks
+ - Can search documents-only, code-only, or both
+
+### Search Behavior
+
+When users search the knowledge base:
+
+```python
+# Search documents only (default)
+results = search(query="how to use fastapi", search_code_examples=False)
+# Returns: Text chunks explaining FastAPI concepts
+
+# Search code examples only
+results = search(query="fastapi route example", search_code_examples=True)
+# Returns: Actual code blocks with @app.get() decorators
+
+# Search both (when enabled)
+results = search(query="fastapi tutorial", search_both=True)
+# Returns: Mixed results from both tables
+```
+
+## Complete Extraction Flow
+
+Here's a step-by-step example of the complete process:
+
+### Example: Crawling FastAPI Documentation
+
+**1. Crawl Page**
+```
+URL: https://fastapi.tiangolo.com/tutorial/first-steps/
+```
+
+**2. Detect Framework**
+```python
+# System detects: Docusaurus + Prism.js
+wait_selector = ".markdown, .theme-doc-markdown, article"
+```
+
+**3. Extract HTML Code Block**
+```html
+
+
+
+ from
+ fastapi
+ import
+ FastAPI
+ ...
+
+
+
+```
+
+**4. Clean Code**
+```python
+# Before cleaning:
+"fromfastapi"
+
+# After cleaning:
+"from fastapi import FastAPI"
+```
+
+**5. Validate Quality**
+```python
+✓ Length: 342 characters (>250 minimum)
+✓ Code indicators: 8 found (function, import, =, (), return, {}, :, def)
+✓ Language indicators: 3 found (from, import, def)
+✓ Prose ratio: 0.05 (<0.15 threshold)
+✓ Structure: 12 non-empty lines
+```
+
+**6. Check for Duplicates**
+```python
+# Found similar block with 0.87 similarity (Python 3.9 syntax)
+# Selected current block (has Python 3.10+ syntax, longer)
+```
+
+**7. Generate AI Summary**
+```python
+# LLM Request
+{
+ "model": "gpt-4o-mini",
+ "messages": [
+ {
+ "role": "system",
+ "content": "You are a helpful assistant..."
+ },
+ {
+ "role": "user",
+ "content": "... ... "
+ }
+ ]
+}
+
+# Response
+{
+ "example_name": "Create Basic FastAPI App",
+ "summary": "Demonstrates how to create a minimal FastAPI application with a single GET endpoint. Shows the basic structure including importing FastAPI, creating an app instance, and defining a route with the @app.get() decorator."
+}
+```
+
+**8. Create Embedding**
+```python
+# Combined text for embedding
+combined = """from fastapi import FastAPI
+
+app = FastAPI()
+
+@app.get("/")
+def read_root():
+ return {"Hello": "World"}
+
+Summary: Demonstrates how to create a minimal FastAPI application..."""
+
+# Generate embedding (1536 dimensions)
+embedding = await create_embeddings_batch([combined])
+```
+
+**9. Store in Database**
+```python
+{
+ "url": "https://fastapi.tiangolo.com/tutorial/first-steps/",
+ "chunk_number": 0,
+ "content": "from fastapi import FastAPI\n\napp = FastAPI()...",
+ "summary": "Demonstrates how to create a minimal FastAPI application...",
+ "metadata": {
+ "language": "python",
+ "char_count": 342,
+ "word_count": 48,
+ "example_name": "Create Basic FastAPI App",
+ "title": "Create Basic FastAPI App",
+ "source_id": "fastapi.tiangolo.com",
+ "chunk_index": 0
+ },
+ "source_id": "fastapi.tiangolo.com",
+ "embedding_1536": [0.123, -0.456, 0.789, ...],
+ "llm_chat_model": "gpt-4o-mini",
+ "embedding_model": "text-embedding-3-small",
+ "embedding_dimension": 1536
+}
+```
+
+**10. Enable Semantic Search**
+
+User can now search:
+- "how to create a fastapi endpoint"
+- "basic fastapi hello world"
+- "fastapi route decorator example"
+
+The system will find this code example through semantic similarity matching.
+
+## Performance Optimizations
+
+### Concurrent Processing
+
+- **HTML extraction:** Processes multiple patterns in parallel
+- **Code summaries:** Concurrent with rate limiting (default: 3 workers)
+- **Embeddings:** Batch processing with configurable batch size
+- **Storage:** Batch inserts (default: 20 records per batch)
+
+### Progress Reporting
+
+The system reports detailed progress at each phase:
+
+```python
+# Phase 1: Extraction (0-20%)
+{
+ "status": "code_extraction",
+ "progress": 15,
+ "log": "Extracted code from 3/20 documents (12 code blocks found)",
+ "completed_documents": 3,
+ "total_documents": 20,
+ "code_blocks_found": 12
+}
+
+# Phase 2: Summarization (20-90%)
+{
+ "status": "code_extraction",
+ "progress": 55,
+ "log": "Generated 7/12 code summaries",
+ "completed_summaries": 7,
+ "total_summaries": 12
+}
+
+# Phase 3: Storage (90-100%)
+{
+ "status": "code_extraction",
+ "progress": 95,
+ "log": "Stored batch 1/1 of code examples",
+ "batch_number": 1,
+ "total_batches": 1,
+ "examples_stored": 12
+}
+```
+
+### Caching
+
+- Settings are cached to avoid repeated database lookups
+- Language patterns are compiled once and reused
+- HTML entity replacements use pre-built mapping
+
+## Error Handling
+
+The system follows Archon's beta development guidelines:
+
+### Fail Fast Scenarios
+
+- **Missing crawler instance:** Immediate error
+- **Invalid configuration:** Raises exception
+- **Corrupted data:** Skips with detailed error log
+
+### Complete with Logging
+
+- **Individual extraction failures:** Logs error, continues to next document
+- **Summary generation failures:** Uses fallback summary, continues
+- **Embedding failures:** Logs error, continues to next batch
+- **Storage failures:** Retries with exponential backoff, then individual inserts
+
+### Error Messages
+
+All errors include:
+- Operation being attempted
+- URL or document identifier
+- Specific error details
+- Stack trace (via `exc_info=True`)
+
+**Example Error Log:**
+
+```
+ERROR: Failed to extract code from document
+ URL: https://example.com/docs/api
+ Error: TimeoutError after 30 seconds
+ Attempt: 2/3
+ Stack trace: ...
+```
+
+## Related Files
+
+### Core Implementation
+
+- **`python/src/server/services/crawling/code_extraction_service.py`**
+ - Main extraction logic
+ - Quality validation
+ - HTML/markdown/text/PDF extraction strategies
+
+- **`python/src/server/services/storage/code_storage_service.py`**
+ - Markdown code block extraction
+ - AI summary generation
+ - Deduplication logic
+ - Embedding generation and storage
+
+### Supporting Services
+
+- **`python/src/server/services/embeddings/embedding_service.py`**
+ - Embedding generation
+ - Batch processing
+
+- **`python/src/server/services/embeddings/contextual_embedding_service.py`**
+ - Contextual embedding enrichment
+ - Situating context generation
+
+- **`python/src/server/services/llm_provider_service.py`**
+ - Unified LLM client
+ - Provider management (OpenAI, Anthropic, etc.)
+
+### Integration Points
+
+- **`python/src/server/services/crawling/orchestrator.py`**
+ - Orchestrates crawling and code extraction
+ - Progress tracking across all phases
+
+- **`python/src/server/services/storage/storage_services.py`**
+ - Document upload handling
+ - Code extraction trigger for uploaded files
+
+### Database
+
+- **`migration/complete_setup.sql`**
+ - `archon_code_examples` table schema
+ - Indexes for semantic search
+ - Vector search functions
+
+## Testing Code Extraction
+
+To test code extraction with different settings:
+
+```python
+# In Python REPL or test file
+from python.src.server.services.crawling.code_extraction_service import CodeExtractionService
+from python.src.server.config.supabase_config import get_supabase_client
+
+# Initialize
+client = get_supabase_client()
+service = CodeExtractionService(client)
+
+# Test HTML extraction
+html_content = """
+
+ def hello():
+ print("World")
+
+
+"""
+
+code_blocks = await service._extract_html_code_blocks(html_content)
+print(f"Found {len(code_blocks)} code blocks")
+
+# Test markdown extraction
+from python.src.server.services.storage.code_storage_service import extract_code_blocks
+
+md_content = """
+Here's an example:
+
+```python
+def hello():
+ print("World")
+```
+"""
+
+blocks = extract_code_blocks(md_content, min_length=50)
+print(f"Found {len(blocks)} markdown code blocks")
+```
+
+## Best Practices
+
+### For Documentation Sites
+
+1. **Use standard syntax highlighters** (Prism.js, highlight.js, Shiki)
+2. **Include language identifiers** in code blocks
+3. **Provide context** around code examples (explanatory text)
+4. **Keep examples focused** (200-500 lines ideal)
+5. **Use semantic HTML** (`` structure)
+
+### For Code Authors
+
+1. **Add descriptive comments** near code examples
+2. **Use clear variable names** for better AI summarization
+3. **Structure examples** with clear beginning/end
+4. **Avoid very long examples** (>5000 characters)
+5. **Include language tags** in markdown fences
+
+### For Configuration
+
+1. **Start with defaults** - they work well for most cases
+2. **Increase `MIN_CODE_BLOCK_LENGTH`** if getting too many snippets
+3. **Decrease `MIN_CODE_INDICATORS`** for configuration files (JSON, YAML)
+4. **Enable `USE_CONTEXTUAL_EMBEDDINGS`** for better semantic search
+5. **Adjust `CODE_SUMMARY_MAX_WORKERS`** based on API rate limits
+
+## Troubleshooting
+
+### No Code Examples Extracted
+
+**Possible causes:**
+1. Code blocks are too short (< `MIN_CODE_BLOCK_LENGTH`)
+2. Content appears as prose (high prose ratio)
+3. HTML pattern not recognized
+4. Code lacks required indicators
+
+**Solutions:**
+- Check logs for validation failures
+- Lower `MIN_CODE_BLOCK_LENGTH` temporarily
+- Disable `ENABLE_PROSE_FILTERING` for testing
+- Add custom HTML pattern if needed
+
+### Extracting Non-Code Content
+
+**Possible causes:**
+1. `MIN_CODE_INDICATORS` too low
+2. Prose filtering disabled
+3. Diagram filtering disabled
+
+**Solutions:**
+- Increase `MIN_CODE_INDICATORS` to 4 or 5
+- Enable `ENABLE_PROSE_FILTERING`
+- Enable `ENABLE_DIAGRAM_FILTERING`
+- Check logs to see what passed validation
+
+### Too Many Duplicate Examples
+
+**Possible causes:**
+1. Similar examples from different pages
+2. Similarity threshold too high
+
+**Solutions:**
+- System already deduplicates at 85% similarity
+- Duplication across pages is intentional (different context)
+- Lower similarity threshold in code (requires code change)
+
+### AI Summaries Not Generating
+
+**Possible causes:**
+1. `ENABLE_CODE_SUMMARIES` is false
+2. LLM API key not configured
+3. Rate limiting issues
+
+**Solutions:**
+- Check `ENABLE_CODE_SUMMARIES` setting
+- Verify LLM provider credentials
+- Increase `CODE_SUMMARY_MAX_WORKERS` delay
+- Check logs for API errors
+
+## Future Enhancements
+
+Potential improvements for code extraction:
+
+1. **Machine Learning Classification**
+ - Train model to classify code vs. prose
+ - Better language detection
+
+2. **Code Quality Scoring**
+ - Rank examples by completeness
+ - Prefer executable examples
+
+3. **Code Execution Validation**
+ - Test if code actually runs
+ - Verify imports are valid
+
+4. **Interactive Example Detection**
+ - Identify REPL/playground examples
+ - Extract input/output pairs
+
+5. **Code Relationship Mapping**
+ - Link related examples
+ - Track dependencies between code blocks
+
+6. **Custom Extraction Rules**
+ - User-defined HTML patterns
+ - Site-specific extraction logic
+ - Per-source configuration
+
+---
+
+**Document Version:** 1.0
+**Last Updated:** October 2025
+**Related Documentation:**
+- [Single Page Crawling Strategy](SINGLE_PAGE_CRAWLING.md)
+- [Architecture Overview](ARCHITECTURE.md)
+
diff --git a/PRPs/ai_docs/SINGLE_PAGE_CRAWLING.md b/PRPs/ai_docs/SINGLE_PAGE_CRAWLING.md
new file mode 100644
index 0000000000..a010aa582b
--- /dev/null
+++ b/PRPs/ai_docs/SINGLE_PAGE_CRAWLING.md
@@ -0,0 +1,339 @@
+# Single Page Crawling Strategy
+
+This document describes the `SinglePageCrawlStrategy` class implementation in `python/src/server/services/crawling/strategies/single_page.py`.
+
+## Overview
+
+The `SinglePageCrawlStrategy` class handles the crawling of individual web pages using the Crawl4AI library. It provides specialized configurations for different site types (documentation sites vs. regular sites) and implements robust retry logic with exponential backoff.
+
+## Class Structure
+
+### Initialization
+
+```python
+def __init__(self, crawler, markdown_generator):
+```
+
+**Parameters:**
+- `crawler` (AsyncWebCrawler): The Crawl4AI crawler instance for web crawling operations
+- `markdown_generator` (DefaultMarkdownGenerator): The markdown generator instance for converting HTML to markdown
+
+## Key Methods
+
+### 1. Documentation Site Detection
+
+```python
+def _get_wait_selector_for_docs(self, url: str) -> str:
+```
+
+Identifies the type of documentation framework used by a site based on URL patterns and returns appropriate CSS selectors to wait for content to load.
+
+**Supported Frameworks:**
+
+| Framework | URL Pattern | Wait Selector |
+|-----------|-------------|---------------|
+| Docusaurus | `docusaurus` | `.markdown, .theme-doc-markdown, article` |
+| VitePress | `vitepress` | `.VPDoc, .vp-doc, .content` |
+| GitBook | `gitbook` | `.markdown-section, .page-wrapper` |
+| MkDocs | `mkdocs` | `.md-content, article` |
+| Docsify | `docsify` | `#main, .markdown-section` |
+| CopilotKit | `copilotkit` | `div[class*="content"], div[class*="doc"], #__next` |
+| Milkdown | `milkdown` | `main, article, .prose, [class*="content"]` |
+| Generic | (fallback) | `body` |
+
+**Purpose:** JavaScript-heavy documentation sites need specific selectors to ensure content is fully loaded before extraction.
+
+### 2. Main Crawling Method
+
+```python
+async def crawl_single_page(
+ self,
+ url: str,
+ transform_url_func: Callable[[str], str],
+ is_documentation_site_func: Callable[[str], bool],
+ retry_count: int = 3
+) -> dict[str, Any]:
+```
+
+The primary method for crawling individual web pages with sophisticated retry logic and content validation.
+
+#### Parameters
+
+- `url`: The web page URL to crawl
+- `transform_url_func`: Function to transform URLs (e.g., GitHub URLs to raw content)
+- `is_documentation_site_func`: Function to check if URL is a documentation site
+- `retry_count`: Number of retry attempts (default: 3)
+
+#### Retry Logic
+
+- Attempts crawling up to `retry_count` times
+- Uses **exponential backoff** between retries: `2^attempt` seconds
+ - 1st retry: wait 1 second
+ - 2nd retry: wait 2 seconds
+ - 3rd retry: wait 4 seconds
+- First attempt uses **cached content** (if available)
+- Subsequent attempts **bypass cache** for fresh content
+
+#### Configuration: Documentation Sites
+
+When `is_documentation_site_func(url)` returns `True`:
+
+```python
+CrawlerRunConfig(
+ cache_mode=cache_mode,
+ stream=True, # Enable streaming for parallel processing
+ markdown_generator=markdown_generator,
+ wait_for=wait_selector, # Framework-specific selector
+ wait_until='domcontentloaded', # Don't wait for full page load
+ page_timeout=30000, # 30 seconds
+ delay_before_return_html=0.5, # 500ms delay for JS rendering
+ wait_for_images=False, # Skip image loading
+ scan_full_page=True, # Trigger lazy loading
+ exclude_all_images=False, # Keep images in content
+ remove_overlay_elements=True, # Remove popups/modals
+ process_iframes=True # Extract iframe content
+)
+```
+
+#### Configuration: Regular Sites
+
+For non-documentation sites:
+
+```python
+CrawlerRunConfig(
+ cache_mode=cache_mode,
+ stream=True, # Enable streaming
+ markdown_generator=markdown_generator,
+ wait_until='domcontentloaded', # Faster than 'networkidle'
+ page_timeout=45000, # 45 seconds
+ delay_before_return_html=0.3, # 300ms delay
+ scan_full_page=True # Trigger lazy loading
+)
+```
+
+#### Content Validation
+
+The method validates crawled content before returning:
+
+1. **Success check**: `result.success` must be `True`
+2. **Content length**: Markdown must have at least 50 characters
+3. **Non-empty**: Markdown content must exist
+
+If validation fails, the method retries with exponential backoff.
+
+#### Debug Logging
+
+The method logs extensive debug information:
+
+- Markdown length and presence
+- Triple backtick count (code blocks)
+- Sample content for specific URLs (e.g., 'getting-started')
+- Configuration details (wait_until, page_timeout)
+
+#### Return Value
+
+**On Success:**
+```python
+{
+ "success": True,
+ "url": original_url, # Original URL (before transformation)
+ "markdown": result.markdown, # Extracted markdown content
+ "html": result.html, # Raw HTML (used for code extraction)
+ "title": result.title, # Page title or "Untitled"
+ "links": result.links, # Extracted links from page
+ "content_length": len(markdown) # Length of markdown content
+}
+```
+
+**On Failure:**
+```python
+{
+ "success": False,
+ "error": "Error message describing what went wrong"
+}
+```
+
+### 3. Markdown File Crawling
+
+```python
+async def crawl_markdown_file(
+ self,
+ url: str,
+ transform_url_func: Callable[[str], str],
+ progress_callback: Callable[..., Awaitable[None]] | None = None,
+ start_progress: int = 10,
+ end_progress: int = 20
+) -> list[dict[str, Any]]:
+```
+
+Handles direct crawling of `.txt` or `.md` files with progress reporting.
+
+#### Parameters
+
+- `url`: URL of the text/markdown file
+- `transform_url_func`: Function to transform URLs (e.g., GitHub URLs to raw content)
+- `progress_callback`: Optional callback for progress updates
+- `start_progress`: Starting progress percentage (default: 10)
+- `end_progress`: Ending progress percentage (default: 20)
+
+#### Features
+
+- **URL transformation**: Converts GitHub URLs to raw content URLs
+- **Progress reporting**: Reports progress at start and completion
+- **Simpler configuration**: No special wait conditions needed
+- **Single document**: Always returns a list with one document
+
+#### Configuration
+
+```python
+CrawlerRunConfig(
+ cache_mode=CacheMode.ENABLED,
+ stream=False # Streaming not needed for single files
+)
+```
+
+#### Progress Reporting
+
+The method reports progress via the optional callback:
+
+**Start:**
+```python
+progress_callback('crawling', start_progress,
+ f"Fetching text file: {url}",
+ total_pages=1,
+ processed_pages=0
+)
+```
+
+**Completion:**
+```python
+progress_callback('crawling', end_progress,
+ f"Text file crawled successfully: {original_url}",
+ total_pages=1,
+ processed_pages=1
+)
+```
+
+#### Return Value
+
+**On Success:**
+```python
+[{
+ 'url': original_url,
+ 'markdown': result.markdown,
+ 'html': result.html
+}]
+```
+
+**On Failure:**
+```python
+[] # Empty list
+```
+
+## Error Handling
+
+The strategy follows beta development guidelines with intelligent error handling:
+
+### Fail Fast Scenarios
+
+- **Missing crawler instance**: Returns error immediately without retrying
+- **Invalid configuration**: Raises exceptions for bad parameters
+
+### Complete with Detailed Errors
+
+- **Individual page failures**: Logs error, continues processing (in batch contexts)
+- **Timeout errors**: Catches `TimeoutError`, logs, and retries
+- **Crawl failures**: Logs detailed error messages with full context
+
+### Error Logging
+
+All errors are logged with:
+- Full stack traces (`traceback.format_exc()`)
+- URL being crawled
+- Attempt number
+- Specific error messages from Crawl4AI
+- Validation failure reasons
+
+### Error Messages
+
+Error messages include:
+- What operation was being attempted
+- Which URL failed
+- Why it failed (timeout, validation, exception)
+- Number of attempts made
+
+## Performance Optimizations
+
+The strategy includes several optimizations for speed and efficiency:
+
+1. **Streaming enabled**: Allows parallel processing of crawl results
+2. **Reduced delays**:
+ - Documentation sites: 0.5s (down from 2.0s)
+ - Regular sites: 0.3s (down from 1.0s)
+3. **Cache mode**: First attempts use cached content when available
+4. **domcontentloaded**: Doesn't wait for full page load (faster than 'networkidle')
+5. **Skip image waiting**: Documentation sites don't wait for images to load
+6. **Exponential backoff**: Prevents hammering failing servers
+
+## Usage Example
+
+```python
+from crawl4ai import AsyncWebCrawler
+from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
+
+# Initialize dependencies
+crawler = AsyncWebCrawler()
+markdown_generator = DefaultMarkdownGenerator()
+
+# Create strategy
+strategy = SinglePageCrawlStrategy(crawler, markdown_generator)
+
+# Define helper functions
+def transform_url(url: str) -> str:
+ # Transform GitHub URLs to raw content
+ if 'github.com' in url and '/blob/' in url:
+ return url.replace('github.com', 'raw.githubusercontent.com').replace('/blob/', '/')
+ return url
+
+def is_documentation_site(url: str) -> bool:
+ doc_indicators = ['docs.', '/docs/', 'documentation', 'docusaurus', 'vitepress']
+ return any(indicator in url.lower() for indicator in doc_indicators)
+
+# Crawl a page
+result = await strategy.crawl_single_page(
+ url='https://docs.example.com/getting-started',
+ transform_url_func=transform_url,
+ is_documentation_site_func=is_documentation_site,
+ retry_count=3
+)
+
+if result['success']:
+ print(f"Successfully crawled: {result['title']}")
+ print(f"Content length: {result['content_length']} characters")
+ print(f"Found {len(result['links'])} links")
+else:
+ print(f"Failed to crawl: {result['error']}")
+```
+
+## Integration with Archon
+
+This strategy is used by Archon's crawling service to:
+
+1. **Crawl individual documentation pages** during sitemap/recursive crawls
+2. **Handle direct URL submissions** from users
+3. **Process text files** (llms.txt, README.md, etc.)
+4. **Extract code examples** from documentation sites
+
+The crawled content (markdown + HTML) is then:
+- Chunked into manageable pieces
+- Embedded using the embedding service
+- Stored in Supabase for RAG (Retrieval Augmented Generation)
+- Made searchable via the knowledge base
+
+## Related Files
+
+- `python/src/server/services/crawling/strategies/sitemap.py` - Sitemap crawling strategy
+- `python/src/server/services/crawling/strategies/recursive.py` - Recursive crawling strategy
+- `python/src/server/services/crawling/crawler_manager.py` - Crawler lifecycle management
+- `python/src/server/services/crawling/orchestrator.py` - Orchestrates all crawling strategies
+
diff --git a/docker-compose.unified.yml b/docker-compose.unified.yml
index e08238cf46..50d2572e68 100644
--- a/docker-compose.unified.yml
+++ b/docker-compose.unified.yml
@@ -169,6 +169,7 @@ services:
dockerfile: Dockerfile
target: development
container_name: archon-ui
+ restart: unless-stopped
ports:
- "${BIND_IP:-127.0.0.1}:${ARCHON_UI_PORT:-3737}:3737"
environment:
@@ -190,11 +191,11 @@ services:
- app-network
- proxy
healthcheck:
- test: ["CMD", "curl", "-f", "http://localhost:3737"]
+ test: ["CMD-SHELL", "curl -f http://localhost:3737 || exit 1"]
interval: 15s
- timeout: 15s
- retries: 5
- start_period: 60s
+ timeout: 10s
+ retries: 3
+ start_period: 90s
volumes:
# Development volumes for live editing
- ./archon-ui-main/src:/app/src
@@ -219,6 +220,7 @@ services:
VITE_API_URL: ${VITE_API_URL:-http://localhost:8181}
VITE_ARCHON_SERVER_PORT: ${VITE_ARCHON_SERVER_PORT:-}
container_name: archon-ui
+ restart: unless-stopped
ports:
- "${BIND_IP:-127.0.0.1}:${ARCHON_UI_PORT:-3737}:3737"
environment:
@@ -240,11 +242,11 @@ services:
- app-network
- proxy
healthcheck:
- test: ["CMD", "curl", "-f", "http://localhost:3737"]
+ test: ["CMD-SHELL", "curl -f http://localhost:3737 || exit 1"]
interval: 15s
- timeout: 15s
- retries: 5
- start_period: 60s
+ timeout: 10s
+ retries: 3
+ start_period: 90s
# No volumes in production
depends_on:
archon-server: