feat: Advanced Web Crawling with Domain Configuration by leex279 · Pull Request #548 · coleam00/Archon

leex279 · 2025-08-31T20:41:53Z

📋 Summary

This PR implements Advanced Web Crawling with comprehensive domain filtering and pattern matching configuration. Users can now precisely control which domains and URLs are crawled with advanced filtering rules.

Closes #546

✨ Features

Advanced Domain Configuration UI

Collapsible Configuration Panel - Clean toggle interface for advanced options
Domain Allowlist - Specify exactly which domains to crawl
Domain Blocklist - Exclude specific domains from crawling scope
URL Pattern Matching - Include/exclude patterns with regex support
Real-time Validation - Validate domains and patterns before crawling
Dynamic Management - Add/remove rules with visual feedback

Enhanced Crawling Engine

CrawlConfig Integration - Structured filtering configuration
Server-side Filtering - Backend application of domain rules
Pattern Matching - Support for complex URL filtering logic
Performance Optimization - Skip filtered content during discovery
Progress Integration - Status updates for filtering operations

API Enhancements

Enhanced /api/knowledge-items/crawl-v2 - Accepts crawl_config parameter
CrawlConfig Interface - Type-safe domain filtering configuration
Validation Layer - Backend validation of filtering rules
Backwards Compatibility - Falls back to regular crawl endpoint when no config

🔧 Technical Implementation

Frontend Components

Advanced Configuration Panel - Collapsible domain filtering UI
Domain Input Management - Add/remove domains with validation
Pattern Configuration - Include/exclude pattern inputs with regex support
Configuration State - Persistent filtering rules during crawl session

Backend Integration

CrawlConfig Processing - Parse and validate domain filtering rules
Crawling Service Enhancement - Apply filters during URL discovery
Performance Optimization - Early filtering reduces crawling overhead
Progress Reporting - Status updates include filtering information

🎯 Advanced Crawling Flow

URL Input - User enters website URL to crawl
Configuration - Configure domain filters and patterns in advanced panel
Validation - Backend validates URL and filtering rules
Enhanced Crawling - Apply filters during URL discovery process
Filtered Results - Only matching content gets processed and stored
Progress Tracking - Real-time updates with filtering status

🔍 Use Cases

Multi-domain Documentation Sites

Crawl docs.example.com but exclude community.example.com
Include API reference pages but exclude marketing content
Focus on specific documentation sections

Large Website Optimization

Exclude known irrelevant domains (ads, tracking, CDNs)
Target specific subdirectories with pattern matching
Reduce crawling time and improve content quality

Pattern-based Filtering

Include only /docs/ and /api/ paths
Exclude /blog/ and /news/ sections
Target specific file types or page structures

🧪 Testing Features

✅ Domain filtering UI works correctly
✅ Pattern matching validates input
✅ Backend applies filters during crawling
✅ Configuration persists through crawl session
✅ Enhanced crawl endpoint accepts crawl_config
✅ Backwards compatibility with regular crawling
✅ Performance improvement with filtering enabled
✅ TypeScript compilation passes

🤖 Generated with Claude Code

Frontend: - Add DocumentBrowser component with domain filtering and search - Add advanced domain configuration UI to AddKnowledgeModal - Add "Browse Documents" button to KnowledgeItemCard - Support comma-separated domain/pattern input with badges - Make modal scrollable and improve UX Backend: - Add CrawlConfig model with domain/pattern filtering options - Implement domain filtering logic with fnmatch pattern matching - Add /knowledge-items/{source_id}/chunks endpoint for chunk browsing - Add /knowledge-items/crawl-v2 endpoint with domain filtering support - Filter URLs during crawling based on allowed/excluded domains and patterns Features: - Whitelist specific domains to crawl - Blacklist domains to exclude - Include/exclude URL patterns using glob-style matching - Browse and search document chunks with domain filtering - Collapsible advanced configuration section 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Add optional chaining for domains array mapping - Add safety checks for filteredChunks and chunks arrays - Remove unsafe HTML content replacement to prevent XSS - Ensure component handles empty/undefined data gracefully 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Replace HTML option children with options prop - Select component expects {value, label} objects array - Fixes "Cannot read properties of undefined (reading 'map')" error 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Change from 'documents' to 'archon_crawled_pages' table - Fixes 500 error when fetching document chunks - Aligns with existing database schema naming convention 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Move DocumentBrowser from KnowledgeItemCard to KnowledgeBasePage level - Add onBrowseDocuments callback prop to KnowledgeItemCard - Fix modal rendering inside card container (z-index/stacking context issue) - Now opens as full-screen modal like other modals (CodeViewer, EditModal) - Fix table name in chunks API from 'documents' to 'archon_crawled_pages' 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Add left sidebar with document list (like code examples list) - Add right content area for selected document chunk - Add click-to-select functionality for document chunks - Auto-select first chunk when opening browser - Match CodeViewer modal design pattern but in blue theme - Show document preview in sidebar with domain badges - Improve overall UX with familiar layout pattern 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Add visible blue-themed scrollbars to document list sidebar - Add matching scrollbars to document content area - Include both Firefox (scrollbar-width/color) and WebKit scrollbar support - Inject custom CSS for cross-browser scrollbar styling - Improve visual feedback for scrollable content areas 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Remove redundant green "Browse" button - Make orange document count badge clickable to open DocumentBrowser - Update tooltip to indicate clickable behavior - Add hover effect to document count badge - Improve scrollbar visibility with overflow-y-scroll - More intuitive UX - click the document count to browse documents 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Add click outside modal to close (like CodeViewer) - Add proper flex layout constraints with min-h-0 for scrolling - Force scrollbars with overflow-y-scroll instead of auto - Add flex-shrink-0 to header sections to prevent compression - Ensure proper height calculations for scrollable containers 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Remove overflow-hidden and max-h constraints that blocked scrolling - Simplify flex layout using h-full instead of min-h-0 conflicts - Use overflow-y-auto pattern like CodeViewer (working reference) - Remove custom scrollbar styling that interfered with functionality - Follow code reviewer recommendations for proper height inheritance Fixes: Scrollbars now functional for both document list and content areas 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

Root cause: CSS height constraint conflict between flex-1 and h-full - flex-1 = take remaining space after other flex items - h-full = be 100% of parent height - Together they create competing height calculations Solution: Remove h-full from main content container - Let flex-1 handle height calculation naturally - Allows scrollable areas to establish proper heights - Enables functional scrolling in both sidebar and content 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Use createPortal for proper modal rendering outside component tree - Replace Card component with direct divs like CodeViewer - Copy exact scrolling pattern: h-[85vh] + overflow-hidden + overflow-y-auto - Use nested h-full + overflow-auto structure for content area - Match CodeViewer styling but with blue theme instead of pink - Add click-outside-to-close functionality - Remove complex flex constraints that blocked scrolling This replicates the proven working scrolling pattern from CodeViewerModal. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Remove all document upload functionality - Remove DocumentBrowser component and integration - Remove document chunks API endpoints - Keep only advanced crawling with domain filtering - Simplify modal to URL crawling with advanced config - Focus on CrawlConfig with domain/pattern filtering - Clean up imports and unused code 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

coderabbitai · 2025-08-31T20:41:59Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing Touches

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/advanced-crawling-domain-filtering

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

Add missing lastActivityUpdate.delete() calls in 3 executor exit paths where the module-level Map entry was not being cleaned up: 1. Between-step cancellation (cancel_detected) 2. Parallel block failure 3. Sequential step failure Without these, cancelled or failed workflow run IDs accumulate in the Map forever — a slow memory leak in long-running server processes. The success and finally-block cleanup paths already had this call; these 3 early-return error paths were missed when the throttle was introduced in PR #553. Closes #548 Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

…leam00#557) Add missing lastActivityUpdate.delete() calls in 3 executor exit paths where the module-level Map entry was not being cleaned up: 1. Between-step cancellation (cancel_detected) 2. Parallel block failure 3. Sequential step failure Without these, cancelled or failed workflow run IDs accumulate in the Map forever — a slow memory leak in long-running server processes. The success and finally-block cleanup paths already had this call; these 3 early-return error paths were missed when the throttle was introduced in PR coleam00#553. Closes coleam00#548 Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

leex279 and others added 13 commits August 30, 2025 21:03

leex279 closed this Sep 21, 2025

Wirasm deleted the feature/advanced-crawling-domain-filtering branch April 6, 2026 07:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Advanced Web Crawling with Domain Configuration#548

feat: Advanced Web Crawling with Domain Configuration#548
leex279 wants to merge 13 commits intomainfrom
feature/advanced-crawling-domain-filtering

leex279 commented Aug 31, 2025

Uh oh!

coderabbitai Bot commented Aug 31, 2025

Review skipped

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Status, Documentation and Community

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

leex279 commented Aug 31, 2025

📋 Summary

✨ Features

Advanced Domain Configuration UI

Enhanced Crawling Engine

API Enhancements

🔧 Technical Implementation

Frontend Components

Backend Integration

🎯 Advanced Crawling Flow

🔍 Use Cases

Multi-domain Documentation Sites

Large Website Optimization

Pattern-based Filtering

🧪 Testing Features

Uh oh!

coderabbitai Bot commented Aug 31, 2025

Review skipped

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Status, Documentation and Community

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

CodeRabbit Configuration File (`.coderabbit.yaml`)