feat: Advanced Web Crawling with Domain Configuration#548
Closed
feat: Advanced Web Crawling with Domain Configuration#548
Conversation
Frontend:
- Add DocumentBrowser component with domain filtering and search
- Add advanced domain configuration UI to AddKnowledgeModal
- Add "Browse Documents" button to KnowledgeItemCard
- Support comma-separated domain/pattern input with badges
- Make modal scrollable and improve UX
Backend:
- Add CrawlConfig model with domain/pattern filtering options
- Implement domain filtering logic with fnmatch pattern matching
- Add /knowledge-items/{source_id}/chunks endpoint for chunk browsing
- Add /knowledge-items/crawl-v2 endpoint with domain filtering support
- Filter URLs during crawling based on allowed/excluded domains and patterns
Features:
- Whitelist specific domains to crawl
- Blacklist domains to exclude
- Include/exclude URL patterns using glob-style matching
- Browse and search document chunks with domain filtering
- Collapsible advanced configuration section
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Add optional chaining for domains array mapping - Add safety checks for filteredChunks and chunks arrays - Remove unsafe HTML content replacement to prevent XSS - Ensure component handles empty/undefined data gracefully 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Replace HTML option children with options prop
- Select component expects {value, label} objects array
- Fixes "Cannot read properties of undefined (reading 'map')" error
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Change from 'documents' to 'archon_crawled_pages' table - Fixes 500 error when fetching document chunks - Aligns with existing database schema naming convention 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Move DocumentBrowser from KnowledgeItemCard to KnowledgeBasePage level - Add onBrowseDocuments callback prop to KnowledgeItemCard - Fix modal rendering inside card container (z-index/stacking context issue) - Now opens as full-screen modal like other modals (CodeViewer, EditModal) - Fix table name in chunks API from 'documents' to 'archon_crawled_pages' 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Add left sidebar with document list (like code examples list) - Add right content area for selected document chunk - Add click-to-select functionality for document chunks - Auto-select first chunk when opening browser - Match CodeViewer modal design pattern but in blue theme - Show document preview in sidebar with domain badges - Improve overall UX with familiar layout pattern 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Add visible blue-themed scrollbars to document list sidebar - Add matching scrollbars to document content area - Include both Firefox (scrollbar-width/color) and WebKit scrollbar support - Inject custom CSS for cross-browser scrollbar styling - Improve visual feedback for scrollable content areas 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Remove redundant green "Browse" button - Make orange document count badge clickable to open DocumentBrowser - Update tooltip to indicate clickable behavior - Add hover effect to document count badge - Improve scrollbar visibility with overflow-y-scroll - More intuitive UX - click the document count to browse documents 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Add click outside modal to close (like CodeViewer) - Add proper flex layout constraints with min-h-0 for scrolling - Force scrollbars with overflow-y-scroll instead of auto - Add flex-shrink-0 to header sections to prevent compression - Ensure proper height calculations for scrollable containers 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Remove overflow-hidden and max-h constraints that blocked scrolling - Simplify flex layout using h-full instead of min-h-0 conflicts - Use overflow-y-auto pattern like CodeViewer (working reference) - Remove custom scrollbar styling that interfered with functionality - Follow code reviewer recommendations for proper height inheritance Fixes: Scrollbars now functional for both document list and content areas 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Root cause: CSS height constraint conflict between flex-1 and h-full - flex-1 = take remaining space after other flex items - h-full = be 100% of parent height - Together they create competing height calculations Solution: Remove h-full from main content container - Let flex-1 handle height calculation naturally - Allows scrollable areas to establish proper heights - Enables functional scrolling in both sidebar and content 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Use createPortal for proper modal rendering outside component tree - Replace Card component with direct divs like CodeViewer - Copy exact scrolling pattern: h-[85vh] + overflow-hidden + overflow-y-auto - Use nested h-full + overflow-auto structure for content area - Match CodeViewer styling but with blue theme instead of pink - Add click-outside-to-close functionality - Remove complex flex constraints that blocked scrolling This replicates the proven working scrolling pattern from CodeViewerModal. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Remove all document upload functionality - Remove DocumentBrowser component and integration - Remove document chunks API endpoints - Keep only advanced crawling with domain filtering - Simplify modal to URL crawling with advanced config - Focus on CrawlConfig with domain/pattern filtering - Clean up imports and unused code 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the ✨ Finishing Touches🧪 Generate unit tests
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
coleam00
added a commit
that referenced
this pull request
Apr 7, 2026
Add missing lastActivityUpdate.delete() calls in 3 executor exit paths where the module-level Map entry was not being cleaned up: 1. Between-step cancellation (cancel_detected) 2. Parallel block failure 3. Sequential step failure Without these, cancelled or failed workflow run IDs accumulate in the Map forever — a slow memory leak in long-running server processes. The success and finally-block cleanup paths already had this call; these 3 early-return error paths were missed when the throttle was introduced in PR #553. Closes #548 Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Tyone88
pushed a commit
to Tyone88/Archon
that referenced
this pull request
Apr 16, 2026
…leam00#557) Add missing lastActivityUpdate.delete() calls in 3 executor exit paths where the module-level Map entry was not being cleaned up: 1. Between-step cancellation (cancel_detected) 2. Parallel block failure 3. Sequential step failure Without these, cancelled or failed workflow run IDs accumulate in the Map forever — a slow memory leak in long-running server processes. The success and finally-block cleanup paths already had this call; these 3 early-return error paths were missed when the throttle was introduced in PR coleam00#553. Closes coleam00#548 Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
joaobmonteiro
pushed a commit
to joaobmonteiro/Archon
that referenced
this pull request
Apr 26, 2026
…leam00#557) Add missing lastActivityUpdate.delete() calls in 3 executor exit paths where the module-level Map entry was not being cleaned up: 1. Between-step cancellation (cancel_detected) 2. Parallel block failure 3. Sequential step failure Without these, cancelled or failed workflow run IDs accumulate in the Map forever — a slow memory leak in long-running server processes. The success and finally-block cleanup paths already had this call; these 3 early-return error paths were missed when the throttle was introduced in PR coleam00#553. Closes coleam00#548 Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
📋 Summary
This PR implements Advanced Web Crawling with comprehensive domain filtering and pattern matching configuration. Users can now precisely control which domains and URLs are crawled with advanced filtering rules.
Closes #546
✨ Features
Advanced Domain Configuration UI
Enhanced Crawling Engine
API Enhancements
🔧 Technical Implementation
Frontend Components
Backend Integration
🎯 Advanced Crawling Flow
🔍 Use Cases
Multi-domain Documentation Sites
Large Website Optimization
Pattern-based Filtering
🧪 Testing Features
🤖 Generated with Claude Code