Skip to content

feat: Advanced Web Crawling with Domain Configuration#548

Closed
leex279 wants to merge 13 commits intomainfrom
feature/advanced-crawling-domain-filtering
Closed

feat: Advanced Web Crawling with Domain Configuration#548
leex279 wants to merge 13 commits intomainfrom
feature/advanced-crawling-domain-filtering

Conversation

@leex279
Copy link
Copy Markdown
Collaborator

@leex279 leex279 commented Aug 31, 2025

📋 Summary

This PR implements Advanced Web Crawling with comprehensive domain filtering and pattern matching configuration. Users can now precisely control which domains and URLs are crawled with advanced filtering rules.

Closes #546

✨ Features

Advanced Domain Configuration UI

  • Collapsible Configuration Panel - Clean toggle interface for advanced options
  • Domain Allowlist - Specify exactly which domains to crawl
  • Domain Blocklist - Exclude specific domains from crawling scope
  • URL Pattern Matching - Include/exclude patterns with regex support
  • Real-time Validation - Validate domains and patterns before crawling
  • Dynamic Management - Add/remove rules with visual feedback

Enhanced Crawling Engine

  • CrawlConfig Integration - Structured filtering configuration
  • Server-side Filtering - Backend application of domain rules
  • Pattern Matching - Support for complex URL filtering logic
  • Performance Optimization - Skip filtered content during discovery
  • Progress Integration - Status updates for filtering operations

API Enhancements

  • Enhanced /api/knowledge-items/crawl-v2 - Accepts crawl_config parameter
  • CrawlConfig Interface - Type-safe domain filtering configuration
  • Validation Layer - Backend validation of filtering rules
  • Backwards Compatibility - Falls back to regular crawl endpoint when no config

🔧 Technical Implementation

Frontend Components

  • Advanced Configuration Panel - Collapsible domain filtering UI
  • Domain Input Management - Add/remove domains with validation
  • Pattern Configuration - Include/exclude pattern inputs with regex support
  • Configuration State - Persistent filtering rules during crawl session

Backend Integration

  • CrawlConfig Processing - Parse and validate domain filtering rules
  • Crawling Service Enhancement - Apply filters during URL discovery
  • Performance Optimization - Early filtering reduces crawling overhead
  • Progress Reporting - Status updates include filtering information

🎯 Advanced Crawling Flow

  1. URL Input - User enters website URL to crawl
  2. Configuration - Configure domain filters and patterns in advanced panel
  3. Validation - Backend validates URL and filtering rules
  4. Enhanced Crawling - Apply filters during URL discovery process
  5. Filtered Results - Only matching content gets processed and stored
  6. Progress Tracking - Real-time updates with filtering status

🔍 Use Cases

Multi-domain Documentation Sites

  • Crawl docs.example.com but exclude community.example.com
  • Include API reference pages but exclude marketing content
  • Focus on specific documentation sections

Large Website Optimization

  • Exclude known irrelevant domains (ads, tracking, CDNs)
  • Target specific subdirectories with pattern matching
  • Reduce crawling time and improve content quality

Pattern-based Filtering

  • Include only /docs/ and /api/ paths
  • Exclude /blog/ and /news/ sections
  • Target specific file types or page structures

🧪 Testing Features

  • ✅ Domain filtering UI works correctly
  • ✅ Pattern matching validates input
  • ✅ Backend applies filters during crawling
  • ✅ Configuration persists through crawl session
  • ✅ Enhanced crawl endpoint accepts crawl_config
  • ✅ Backwards compatibility with regular crawling
  • ✅ Performance improvement with filtering enabled
  • ✅ TypeScript compilation passes

🤖 Generated with Claude Code

leex279 and others added 13 commits August 30, 2025 21:03
Frontend:
- Add DocumentBrowser component with domain filtering and search
- Add advanced domain configuration UI to AddKnowledgeModal
- Add "Browse Documents" button to KnowledgeItemCard
- Support comma-separated domain/pattern input with badges
- Make modal scrollable and improve UX

Backend:
- Add CrawlConfig model with domain/pattern filtering options
- Implement domain filtering logic with fnmatch pattern matching
- Add /knowledge-items/{source_id}/chunks endpoint for chunk browsing
- Add /knowledge-items/crawl-v2 endpoint with domain filtering support
- Filter URLs during crawling based on allowed/excluded domains and patterns

Features:
- Whitelist specific domains to crawl
- Blacklist domains to exclude
- Include/exclude URL patterns using glob-style matching
- Browse and search document chunks with domain filtering
- Collapsible advanced configuration section

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add optional chaining for domains array mapping
- Add safety checks for filteredChunks and chunks arrays
- Remove unsafe HTML content replacement to prevent XSS
- Ensure component handles empty/undefined data gracefully

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Replace HTML option children with options prop
- Select component expects {value, label} objects array
- Fixes "Cannot read properties of undefined (reading 'map')" error

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Change from 'documents' to 'archon_crawled_pages' table
- Fixes 500 error when fetching document chunks
- Aligns with existing database schema naming convention

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Move DocumentBrowser from KnowledgeItemCard to KnowledgeBasePage level
- Add onBrowseDocuments callback prop to KnowledgeItemCard
- Fix modal rendering inside card container (z-index/stacking context issue)
- Now opens as full-screen modal like other modals (CodeViewer, EditModal)
- Fix table name in chunks API from 'documents' to 'archon_crawled_pages'

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add left sidebar with document list (like code examples list)
- Add right content area for selected document chunk
- Add click-to-select functionality for document chunks
- Auto-select first chunk when opening browser
- Match CodeViewer modal design pattern but in blue theme
- Show document preview in sidebar with domain badges
- Improve overall UX with familiar layout pattern

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add visible blue-themed scrollbars to document list sidebar
- Add matching scrollbars to document content area
- Include both Firefox (scrollbar-width/color) and WebKit scrollbar support
- Inject custom CSS for cross-browser scrollbar styling
- Improve visual feedback for scrollable content areas

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Remove redundant green "Browse" button
- Make orange document count badge clickable to open DocumentBrowser
- Update tooltip to indicate clickable behavior
- Add hover effect to document count badge
- Improve scrollbar visibility with overflow-y-scroll
- More intuitive UX - click the document count to browse documents

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add click outside modal to close (like CodeViewer)
- Add proper flex layout constraints with min-h-0 for scrolling
- Force scrollbars with overflow-y-scroll instead of auto
- Add flex-shrink-0 to header sections to prevent compression
- Ensure proper height calculations for scrollable containers

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Remove overflow-hidden and max-h constraints that blocked scrolling
- Simplify flex layout using h-full instead of min-h-0 conflicts
- Use overflow-y-auto pattern like CodeViewer (working reference)
- Remove custom scrollbar styling that interfered with functionality
- Follow code reviewer recommendations for proper height inheritance

Fixes: Scrollbars now functional for both document list and content areas

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Root cause: CSS height constraint conflict between flex-1 and h-full
- flex-1 = take remaining space after other flex items
- h-full = be 100% of parent height
- Together they create competing height calculations

Solution: Remove h-full from main content container
- Let flex-1 handle height calculation naturally
- Allows scrollable areas to establish proper heights
- Enables functional scrolling in both sidebar and content

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Use createPortal for proper modal rendering outside component tree
- Replace Card component with direct divs like CodeViewer
- Copy exact scrolling pattern: h-[85vh] + overflow-hidden + overflow-y-auto
- Use nested h-full + overflow-auto structure for content area
- Match CodeViewer styling but with blue theme instead of pink
- Add click-outside-to-close functionality
- Remove complex flex constraints that blocked scrolling

This replicates the proven working scrolling pattern from CodeViewerModal.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Remove all document upload functionality
- Remove DocumentBrowser component and integration
- Remove document chunks API endpoints
- Keep only advanced crawling with domain filtering
- Simplify modal to URL crawling with advanced config
- Focus on CrawlConfig with domain/pattern filtering
- Clean up imports and unused code

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Aug 31, 2025

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/advanced-crawling-domain-filtering

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@leex279 leex279 closed this Sep 21, 2025
@Wirasm Wirasm deleted the feature/advanced-crawling-domain-filtering branch April 6, 2026 07:37
coleam00 added a commit that referenced this pull request Apr 7, 2026
Add missing lastActivityUpdate.delete() calls in 3 executor exit paths
where the module-level Map entry was not being cleaned up:

1. Between-step cancellation (cancel_detected)
2. Parallel block failure
3. Sequential step failure

Without these, cancelled or failed workflow run IDs accumulate in the
Map forever — a slow memory leak in long-running server processes.

The success and finally-block cleanup paths already had this call;
these 3 early-return error paths were missed when the throttle was
introduced in PR #553.

Closes #548

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Tyone88 pushed a commit to Tyone88/Archon that referenced this pull request Apr 16, 2026
…leam00#557)

Add missing lastActivityUpdate.delete() calls in 3 executor exit paths
where the module-level Map entry was not being cleaned up:

1. Between-step cancellation (cancel_detected)
2. Parallel block failure
3. Sequential step failure

Without these, cancelled or failed workflow run IDs accumulate in the
Map forever — a slow memory leak in long-running server processes.

The success and finally-block cleanup paths already had this call;
these 3 early-return error paths were missed when the throttle was
introduced in PR coleam00#553.

Closes coleam00#548

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
joaobmonteiro pushed a commit to joaobmonteiro/Archon that referenced this pull request Apr 26, 2026
…leam00#557)

Add missing lastActivityUpdate.delete() calls in 3 executor exit paths
where the module-level Map entry was not being cleaned up:

1. Between-step cancellation (cancel_detected)
2. Parallel block failure
3. Sequential step failure

Without these, cancelled or failed workflow run IDs accumulate in the
Map forever — a slow memory leak in long-running server processes.

The success and finally-block cleanup paths already had this call;
these 3 early-return error paths were missed when the throttle was
introduced in PR coleam00#553.

Closes coleam00#548

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Advanced Web Crawling with Domain Configuration

1 participant