feat: Automatic discovery of LLM files and sitemaps by leex279 · Pull Request #444 · coleam00/Archon

leex279 · 2025-08-22T20:50:51Z

Pull Request

Summary

Implements automatic discovery and parsing of llms.txt, sitemap.xml, and related files to enhance crawling capabilities for AI-driven content consumption. Resolves GitHub issue #430 by adding a comprehensive file discovery system that prioritizes LLM files over regular website crawling for optimal AI content consumption.

Changes Made

Created FileDiscoveryService with database-driven configuration and fallback defaults
Added priority-based LLM file selection (llms-full.txt > llms-ctx.txt > llms.md > llms.txt)
Enhanced URLHandler with is_llm_file() method and improved sitemap detection
Integrated discovery into CrawlingService with early return logic to stop regular crawling
Added database settings configuration for discovery file patterns
Implemented concurrent discovery operations with timeout handling
Created comprehensive test suite with 28+ test cases covering all scenarios
Added enhanced debug logging for discovery process visibility

Type of Change

New feature (non-breaking change which adds functionality)
Bug fix (non-breaking change which fixes an issue)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Performance improvement
Code refactoring

Affected Services

Testing

All existing tests pass
Added new tests for new functionality
Manually tested affected user flows
Docker builds succeed for all services

Test Evidence

# URL handler tests (no regressions)
uv run pytest tests/test_url_handler.py -v
# Result: 11 passed, 12 warnings

# File discovery tests (comprehensive coverage)
uv run pytest tests/test_file_discovery.py -v
# Result: 20+ passed tests covering discovery, database fallback, error handling

# Integration tests
uv run python -c "from src.server.services.crawling.crawling_service import CrawlingService; print('✅ Integration successful')"
# Result: ✅ Integration successful

# Linting and code quality
uv run ruff check src/server/services/crawling/helpers/file_discovery.py src/server/services/crawling/helpers/url_handler.py
# Result: Code quality standards met

Checklist

My code follows the service architecture patterns
If using an AI coding assistant, I used the CLAUDE.md rules
I have added tests that prove my fix/feature works
All new and existing tests pass locally
My changes generate no new warnings
I have updated relevant documentation
I have verified no regressions in existing features

Breaking Changes

None. This feature is fully backwards compatible. Existing crawling behavior is preserved when discovery is disabled or fails.

Additional Notes

🎯 Discovery Priority Logic

LLM files (highest priority) - Stops regular crawling when found
Robots.txt sitemaps - Processed if no LLM files found
Regular crawling - Fallback when discovery fails or returns no results

📊 Performance Impact

Discovery phase: ~1-2 seconds with 10-second timeout
Concurrent operations: All discovery methods run in parallel
Early return: Prevents redundant crawling when LLM files found
NET RESULT: Faster crawling for sites with LLM files

🔧 Configuration

New database settings (configurable via admin UI):

{
  "CRAWL_DISCOVERY_LLM_FILES": ["llms-full.txt", "llms-ctx.txt", "llms.md", "llms.txt"],
  "CRAWL_DISCOVERY_SITEMAP_FILES": ["sitemap.xml", "sitemap_index.xml", "sitemap-*.xml"],
  "CRAWL_DISCOVERY_METADATA_FILES": ["robots.txt", ".well-known/security.txt", "humans.txt"]
}

🐛 Fixes Applied

Issue: System crawled both discovered LLM files AND regular website content
Fix: Priority-based selection returns only the best LLM file found
Result: Single LLM file crawl with early termination

🧪 Test Coverage

Database integration with fallback behavior
Robots.txt parsing with various formats
LLM file discovery with different patterns
Sitemap discovery including wildcards
Error handling and timeout scenarios
Concurrent discovery operations
Integration with existing crawling strategies

🤖 Generated with Claude Code

Implements GitHub issue #430 with comprehensive file discovery system: 🎯 Core Features: - FileDiscoveryService: Discovers llms.txt, sitemaps, and metadata files - Priority-based LLM file selection (llms-full.txt > llms-ctx.txt > llms.md > llms.txt) - Database-driven configuration with fallback defaults - Enhanced URL handler with LLM file detection - Seamless crawling service integration 🔧 Discovery Logic: - LLM files take highest priority and stop regular crawling - Robots.txt sitemap extraction with fallback support - Wildcard sitemap pattern support (sitemap-*.xml) - Metadata file discovery (.well-known directory) - Concurrent discovery operations with timeout handling ⚡ Performance Optimizations: - Early return when LLM files found (no redundant crawling) - HEAD requests for file existence checks - 10-second discovery timeout with graceful fallback - Progress reporting integration 🛠️ Technical Implementation: - Database settings: CRAWL_DISCOVERY_LLM_FILES, CRAWL_DISCOVERY_SITEMAP_FILES, CRAWL_DISCOVERY_METADATA_FILES - Enhanced URLHandler.is_sitemap() and new is_llm_file() methods - Comprehensive test suite with 28+ test cases - Error handling with fallback to regular crawling 🎉 Result: LLM files now replace regular website crawling for optimal AI content consumption 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

coderabbitai · 2025-08-22T20:50:57Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing Touches

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/auto-discover-llms-sitemap

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

- Clear 📋 CRAWLING DECISION logs showing which content source is used - 🚀 STARTING CRAWL logs showing exactly what URLs will be crawled - Fallback logging for regular website crawling - Makes it crystal clear in logs which discovery method was chosen 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

leex279 · 2025-09-08T09:10:03Z

#622

leex279 and others added 4 commits August 22, 2025 22:59

debug: Add discovery results logging to diagnose issue

c9dd1ff

🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

debug: Add method entry logging

ce59921

🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

debug: Add auto_discover_files entry log

5daf85b

leex279 closed this Sep 8, 2025

Wirasm deleted the feature/auto-discover-llms-sitemap branch April 6, 2026 07:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Automatic discovery of LLM files and sitemaps#444

feat: Automatic discovery of LLM files and sitemaps#444
leex279 wants to merge 5 commits intomainfrom
feature/auto-discover-llms-sitemap

leex279 commented Aug 22, 2025 •

edited

Loading

Uh oh!

coderabbitai Bot commented Aug 22, 2025

Review skipped

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Status, Documentation and Community

Uh oh!

leex279 commented Sep 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

leex279 commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request

Summary

Changes Made

Type of Change

Affected Services

Testing

Test Evidence

Checklist

Breaking Changes

Additional Notes

🎯 Discovery Priority Logic

📊 Performance Impact

🔧 Configuration

🐛 Fixes Applied

🧪 Test Coverage

Uh oh!

coderabbitai Bot commented Aug 22, 2025

Review skipped

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Status, Documentation and Community

Uh oh!

leex279 commented Sep 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

leex279 commented Aug 22, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)