Skip to content

feat: Automatic discovery of LLM files and sitemaps#444

Closed
leex279 wants to merge 5 commits intomainfrom
feature/auto-discover-llms-sitemap
Closed

feat: Automatic discovery of LLM files and sitemaps#444
leex279 wants to merge 5 commits intomainfrom
feature/auto-discover-llms-sitemap

Conversation

@leex279
Copy link
Copy Markdown
Collaborator

@leex279 leex279 commented Aug 22, 2025

Pull Request

Summary

Implements automatic discovery and parsing of llms.txt, sitemap.xml, and related files to enhance crawling capabilities for AI-driven content consumption. Resolves GitHub issue #430 by adding a comprehensive file discovery system that prioritizes LLM files over regular website crawling for optimal AI content consumption.

Changes Made

  • Created FileDiscoveryService with database-driven configuration and fallback defaults
  • Added priority-based LLM file selection (llms-full.txt > llms-ctx.txt > llms.md > llms.txt)
  • Enhanced URLHandler with is_llm_file() method and improved sitemap detection
  • Integrated discovery into CrawlingService with early return logic to stop regular crawling
  • Added database settings configuration for discovery file patterns
  • Implemented concurrent discovery operations with timeout handling
  • Created comprehensive test suite with 28+ test cases covering all scenarios
  • Added enhanced debug logging for discovery process visibility

Type of Change

  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Performance improvement
  • Code refactoring

Affected Services

  • Frontend (React UI)
  • Server (FastAPI backend)
  • MCP Server (Model Context Protocol)
  • Agents (PydanticAI service)
  • Database (migrations/schema)
  • Docker/Infrastructure
  • Documentation site

Testing

  • All existing tests pass
  • Added new tests for new functionality
  • Manually tested affected user flows
  • Docker builds succeed for all services

Test Evidence

# URL handler tests (no regressions)
uv run pytest tests/test_url_handler.py -v
# Result: 11 passed, 12 warnings

# File discovery tests (comprehensive coverage)
uv run pytest tests/test_file_discovery.py -v
# Result: 20+ passed tests covering discovery, database fallback, error handling

# Integration tests
uv run python -c "from src.server.services.crawling.crawling_service import CrawlingService; print('✅ Integration successful')"
# Result: ✅ Integration successful

# Linting and code quality
uv run ruff check src/server/services/crawling/helpers/file_discovery.py src/server/services/crawling/helpers/url_handler.py
# Result: Code quality standards met

Checklist

  • My code follows the service architecture patterns
  • If using an AI coding assistant, I used the CLAUDE.md rules
  • I have added tests that prove my fix/feature works
  • All new and existing tests pass locally
  • My changes generate no new warnings
  • I have updated relevant documentation
  • I have verified no regressions in existing features

Breaking Changes

None. This feature is fully backwards compatible. Existing crawling behavior is preserved when discovery is disabled or fails.

Additional Notes

🎯 Discovery Priority Logic

  1. LLM files (highest priority) - Stops regular crawling when found
  2. Robots.txt sitemaps - Processed if no LLM files found
  3. Regular crawling - Fallback when discovery fails or returns no results

📊 Performance Impact

  • Discovery phase: ~1-2 seconds with 10-second timeout
  • Concurrent operations: All discovery methods run in parallel
  • Early return: Prevents redundant crawling when LLM files found
  • NET RESULT: Faster crawling for sites with LLM files

🔧 Configuration

New database settings (configurable via admin UI):

{
  "CRAWL_DISCOVERY_LLM_FILES": ["llms-full.txt", "llms-ctx.txt", "llms.md", "llms.txt"],
  "CRAWL_DISCOVERY_SITEMAP_FILES": ["sitemap.xml", "sitemap_index.xml", "sitemap-*.xml"],
  "CRAWL_DISCOVERY_METADATA_FILES": ["robots.txt", ".well-known/security.txt", "humans.txt"]
}

🐛 Fixes Applied

  • Issue: System crawled both discovered LLM files AND regular website content
  • Fix: Priority-based selection returns only the best LLM file found
  • Result: Single LLM file crawl with early termination

🧪 Test Coverage

  • Database integration with fallback behavior
  • Robots.txt parsing with various formats
  • LLM file discovery with different patterns
  • Sitemap discovery including wildcards
  • Error handling and timeout scenarios
  • Concurrent discovery operations
  • Integration with existing crawling strategies

🤖 Generated with Claude Code

Implements GitHub issue #430 with comprehensive file discovery system:

🎯 Core Features:
- FileDiscoveryService: Discovers llms.txt, sitemaps, and metadata files
- Priority-based LLM file selection (llms-full.txt > llms-ctx.txt > llms.md > llms.txt)
- Database-driven configuration with fallback defaults
- Enhanced URL handler with LLM file detection
- Seamless crawling service integration

🔧 Discovery Logic:
- LLM files take highest priority and stop regular crawling
- Robots.txt sitemap extraction with fallback support
- Wildcard sitemap pattern support (sitemap-*.xml)
- Metadata file discovery (.well-known directory)
- Concurrent discovery operations with timeout handling

⚡ Performance Optimizations:
- Early return when LLM files found (no redundant crawling)
- HEAD requests for file existence checks
- 10-second discovery timeout with graceful fallback
- Progress reporting integration

🛠️ Technical Implementation:
- Database settings: CRAWL_DISCOVERY_LLM_FILES, CRAWL_DISCOVERY_SITEMAP_FILES, CRAWL_DISCOVERY_METADATA_FILES
- Enhanced URLHandler.is_sitemap() and new is_llm_file() methods
- Comprehensive test suite with 28+ test cases
- Error handling with fallback to regular crawling

🎉 Result: LLM files now replace regular website crawling for optimal AI content consumption

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Aug 22, 2025

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/auto-discover-llms-sitemap

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

leex279 and others added 4 commits August 22, 2025 22:59
- Clear 📋 CRAWLING DECISION logs showing which content source is used
- 🚀 STARTING CRAWL logs showing exactly what URLs will be crawled
- Fallback logging for regular website crawling
- Makes it crystal clear in logs which discovery method was chosen

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@leex279
Copy link
Copy Markdown
Collaborator Author

leex279 commented Sep 8, 2025

#622

@leex279 leex279 closed this Sep 8, 2025
@Wirasm Wirasm deleted the feature/auto-discover-llms-sitemap branch April 6, 2026 07:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant