Skip to content

Implement robots.txt compliance for web crawler (Issue #275)#846

Closed
leex279 wants to merge 3 commits intomainfrom
feature/respect-robots-txt-275
Closed

Implement robots.txt compliance for web crawler (Issue #275)#846
leex279 wants to merge 3 commits intomainfrom
feature/respect-robots-txt-275

Conversation

@leex279
Copy link
Copy Markdown
Collaborator

@leex279 leex279 commented Nov 7, 2025

Summary

Implements robots.txt URL validation for Archon's web crawler to respect website Allow/Disallow rules.

  • RobotsChecker service - Fetches, parses, and caches robots.txt using Protego library
  • Proper bot identification - User-Agent: "Archon-Crawler/0.1.0 (+repo_url)"
  • URL validation - Validates URLs against Allow/Disallow rules at 4 critical points
  • RFC 9309 compliant - Follows official specification (404 = allow, 5xx = disallow)
  • Production logging - Clear INFO-level messages showing compliance

Verified Working ✅

Screenshot from live amazon.com crawl showing robots.txt validation:

  • INFO | Respecting robots.txt for https://www.amazon.com (cached for 24h)
  • INFO | URL blocked by robots.txt: https://www.amazon.com/gp/cart/...
  • INFO | Skipped (robots.txt): https://www.amazon.com/gp/cart/...

Multiple URLs correctly blocked based on Amazon's robots.txt Allow/Disallow rules.

What This PR Does ✅

  • URL Validation - Respects robots.txt Allow/Disallow rules
  • Proper Bot ID - User-Agent identifies Archon instead of spoofing Chrome
  • 24-hour caching - One robots.txt fetch per domain per day
  • Error handling - 404=allow, 5xx=disallow per RFC 9309
  • Thread-safe - Concurrent crawls properly deduplicated

What This PR Does NOT Do ❌

  • Crawl-delay enforcement NOT implemented
    • wait_if_needed_for_url() method exists but is not called
    • Enforcing 10s delays during validation would cause 540s for 54 sitemap URLs
    • Crawl4AI's arun_many() API has no per-URL hooks for delay enforcement
    • Crawl4AI's internal rate limiting and concurrency controls handle request pacing

Rationale: This PR focuses on ethical URL validation (respecting site owner's Allow/Disallow rules). Crawl delays are a performance optimization that Crawl4AI already handles via concurrency limits and internal rate limiting. Adding explicit delays would severely degrade performance without significant benefit.

Implementation

Core Service (robots_checker.py ~400 lines):

  • Async robots.txt fetching with shared httpx.AsyncClient singleton
  • Protego parser (40% faster than stdlib, wildcard support)
  • Manual TTL caching (24-hour default) with LRU eviction
  • Thread-safe with separate locks for cache and delay tracking
  • Public API: can_fetch(url), wait_if_needed_for_url(url) (available but not auto-called)

Integration Points (4 total):

  1. crawling_service.py - llms.txt linked files validation
  2. crawling_service.py - Extracted links validation
  3. crawling_service.py - Sitemap URLs validation
  4. recursive.py - URL queue validation (when adding discovered links)

Configuration (config.py):

  • get_crawler_config() with validation for numeric env vars
  • Raises ConfigurationError for invalid values (e.g., negative delays, zero cache size)
  • Type-safe returns (float for delay, int for sizes/TTLs)

User-Agent Update (crawler_manager.py):

  • Changed from Chrome spoofing to "Archon-Crawler/0.1.0 (+repo_url)"

Resource Management (knowledge_api.py):

  • Proper close() calls in finally blocks (currently no-op with shared client)

Discovery (discovery_service.py):

  • Removed robots.txt from discovery file list (used for validation, not content)

Dependencies

Added:

  • protego>=0.3.1 - Fast, RFC 9309 compliant robots.txt parser with wildcard support

Updated:

  • crawl4ai: 0.7.4 -> 0.7.6 - Latest bug fixes (unrelated to robots.txt, just keeping dependencies current)

Not Added:

  • asyncache (unmaintained, cachetools v6 compatibility risks)
  • cachetools (not needed with manual caching implementation)

Configuration

Environment variables (all optional with validated defaults):

ROBOTS_RESPECT=true                 # Enable/disable (default: true)
CRAWLER_USER_AGENT=...              # Custom User-Agent string
ROBOTS_DEFAULT_CRAWL_DELAY=10.0     # Default delay (available but not enforced, min: 0.0)
ROBOTS_CACHE_SIZE=1000              # Max domains to cache (min: 1)
ROBOTS_CACHE_TTL=86400              # Cache duration in seconds (min: 1, recommended max: 86400)

Testing

Validated locally and in production:

  • ✅ 21 comprehensive unit tests (all passing)
  • ✅ Zero regressions in existing test suite
  • ✅ Live amazon.com crawl shows proper URL validation
  • ✅ URLs blocked based on robots.txt Allow/Disallow rules
  • ✅ Clear logging of compliance activity
  • ✅ Config validation with proper error messages

Performance

  • First request per domain: ~100-200ms (robots.txt fetch)
  • Subsequent requests: <1ms (cache lookup)
  • Memory: ~5KB per domain (~5MB for 1000 cached domains)
  • Fast URL validation - No delays during filtering (prevents performance degradation)

Architecture

  • Shared httpx.AsyncClient singleton prevents connection leaks
  • Manual async caching with lock-based deduplication
  • Fail-open pattern for reliability (allow on errors)
  • Clean separation: URL validation (this PR) vs rate limiting (Crawl4AI)

Future Work

If crawl-delay enforcement becomes a requirement:

  1. Add delays in single_page.py strategy (straightforward)
  2. For batch/recursive: Would require either:
    • Crawl4AI API changes to support per-URL delay hooks
    • Custom HTTP client wrapper replacing Crawl4AI's arun_many
    • Token bucket rate limiter at higher level

Closes #275

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Implemented robots.txt compliance checking with per-domain caching and configurable crawl delays to respect website policies
    • Added configurable crawler settings for User-Agent generation and robots.txt respect behavior
  • Chores

    • Updated crawl4ai dependency to 0.7.6 and added protego support

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Nov 7, 2025

Walkthrough

Adds robots.txt compliance: new RobotsChecker module (fetch/parse/cache/delay via Protego), integrates robots checking and per-domain delays into CrawlingService and recursive strategy, exposes crawler config via get_crawler_config(), updates dependencies (Protego, crawl4ai), and ensures crawler services are closed during request cleanup.

Changes

Cohort / File(s) Summary
Dependency manifest
python/pyproject.toml
Bump crawl4ai 0.7.4 → 0.7.6; add protego>=0.3.1.
Configuration
python/src/server/config/config.py
Add get_crawler_config() -> dict reading crawler env vars and returning user_agent, respect_robots, default_crawl_delay, robots_cache_size, robots_cache_ttl.
Crawler init
python/src/server/services/crawler_manager.py
Use get_crawler_config() and replace hardcoded BrowserConfig user-agent with configured crawler_config["user_agent"].
Robots enforcement engine
python/src/server/services/crawling/robots_checker.py
New RobotsChecker class: async fetch/parse robots.txt (Protego), per-domain TTL cache with eviction, per-domain locks, can_fetch(), get_crawl_delay(), wait_if_needed(), clear_cache(), close(), and RFC9309-aligned error semantics.
Crawling integration
python/src/server/services/crawling/crawling_service.py
Initialize optional self.robots_checker; add async _can_fetch_url(url); filter discovered URLs (llms.txt, extracted links, sitemaps, discovery flows) via robots checks and log/count filtered items; call wait_if_needed() before fetches; add close() method.
Discovery priority
python/src/server/services/crawling/discovery_service.py
Removed "robots.txt" from DiscoveryService.DISCOVERY_PRIORITY.
Recursive strategy
python/src/server/services/crawling/strategies/recursive.py
Add optional robots_checker parameter to crawl_recursive_with_progress and skip URLs disallowed by robots checker.
Resource cleanup
python/src/server/api_routes/knowledge_api.py
Await crawl_service.close() and orchestration_service.close() in finally blocks of async handlers to release resources.

Sequence Diagram(s)

sequenceDiagram
    participant CS as CrawlingService
    participant RC as RobotsChecker
    participant HTTP as HTTP Client
    participant Parser as Protego Parser
    note right of CS `#ddeeff`: Discovery (llms.txt, links, sitemap)

    CS->>RC: can_fetch(url)
    alt cache hit
        RC-->>CS: allowed / disallowed
    else cache miss
        RC->>HTTP: GET domain/robots.txt
        alt 200
            HTTP-->>RC: content
            RC->>Parser: parse(content)
            Parser-->>RC: rules
            RC-->>CS: allowed / disallowed
        else 404
            HTTP-->>RC: 404
            RC-->>CS: allowed (allow-all)
        else 5xx / timeout
            HTTP-->>RC: error
            RC-->>CS: disallowed (conservative)
        end
    end

    alt allowed
        CS->>RC: wait_if_needed(domain)
        RC-->>CS: await delay (async)
        CS->>CS: enqueue/fetch URL
    else disallowed
        CS-->>CS: drop URL, log/count filtered
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

  • Focus review on:
    • python/src/server/services/crawling/robots_checker.py — async concurrency (per-domain locks), TTL eviction, HTTP error semantics, and shared httpx client lifecycle.
    • python/src/server/services/crawling/crawling_service.py and strategies/recursive.py — consistent application of robots checks, counters/logging, and correct use of wait_if_needed() before fetches.
    • Resource cleanup paths in python/src/server/api_routes/knowledge_api.py to ensure close() is always awaited and no client leaks occur.

Possibly related PRs

Suggested reviewers

  • coleam00

Poem

🐰 I nibble rules from robots.txt and hum,
I cache and wait so crawling stays quite calm,
Protego guides my courteous little hops,
Domains breathe easy as no rude bops,
A polite rabbit on the crawling run 🥕

Pre-merge checks and finishing touches

✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and clearly describes the main implementation: adding robots.txt compliance for the web crawler, matching the core objective of respecting Allow/Disallow rules.
Description check ✅ Passed The PR description comprehensively covers summary, changes, type (new feature), affected services (Server), testing evidence, and implementation details. All major template sections are addressed with substantive content.
Linked Issues check ✅ Passed The PR fully addresses #275 requirements: implements robots.txt validation before crawling, replaces User-Agent spoofing with proper bot identification, validates URLs against Allow/Disallow rules at multiple integration points, and provides RFC 9309 compliant error handling (404=allow, 5xx=disallow).
Out of Scope Changes check ✅ Passed All changes directly support robots.txt compliance: RobotsChecker service, config validation for crawler settings, User-Agent updates, integration points for validation, dependency updates for robots.txt parsing (protego), and resource cleanup. No extraneous modifications detected.
Docstring Coverage ✅ Passed Docstring coverage is 96.30% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/respect-robots-txt-275

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 368cafb and ec6d8f7.

📒 Files selected for processing (1)
  • python/src/server/config/config.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • python/src/server/config/config.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Backend Tests (Python + pytest)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@leex279 leex279 force-pushed the feature/respect-robots-txt-275 branch 2 times, most recently from b9e1ffe to 14de810 Compare November 7, 2025 21:36
@leex279
Copy link
Copy Markdown
Collaborator Author

leex279 commented Nov 7, 2025

@CodeRabbit review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Nov 7, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@leex279 leex279 self-assigned this Nov 7, 2025
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
python/src/server/services/crawling/crawling_service.py (1)

876-914: Initial URL still bypasses robots.txt

We only filter derived links, so the very first URL handed to _crawl_by_url_type (including discovered targets) is still fetched even when robots.txt says “Disallow”. That defeats the primary objective of this PR because we hit disallowed pages before any filtering happens. Please short-circuit the crawl when the entrypoint is blocked.

 async def _crawl_by_url_type(self, url: str, request: dict[str, Any]) -> tuple:
     """
     Detect URL type and perform appropriate crawling.
 
     Returns:
         Tuple of (crawl_results, crawl_type)
     """
     crawl_results = []
     crawl_type = None
+
+    if self.robots_checker and not await self._can_fetch_url(url):
+        safe_logfire_info(f"Skipped (robots.txt): {url}")
+        return [], None
🧹 Nitpick comments (1)
python/pyproject.toml (1)

43-45: Consider pinning asyncache to a known stable version.

While protego>=0.3.1 and cachetools>=5.3.0 are well-maintained libraries with safe version constraints, asyncache>=0.3.1 is a less common dependency. If the feature was tested and validated against a specific asyncache version (e.g., during development), consider pinning it to that version (e.g., asyncache>=0.3.1,<0.4.0) to reduce drift and ensure reproducibility across deployments.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 33f1db3 and 14de810.

⛔ Files ignored due to path filters (1)
  • python/uv.lock is excluded by !**/*.lock
📒 Files selected for processing (5)
  • python/pyproject.toml (2 hunks)
  • python/src/server/config/config.py (1 hunks)
  • python/src/server/services/crawler_manager.py (2 hunks)
  • python/src/server/services/crawling/crawling_service.py (7 hunks)
  • python/src/server/services/crawling/robots_checker.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
python/src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

python/src/**/*.py: On service startup, missing configuration, DB connection failures, auth/authorization failures, critical dependency outages, or invalid/corrupting data: fail fast and bubble errors
For batch processing, background tasks, WebSocket events, optional features, and external API calls: continue processing but log errors (with retries/backoff for APIs)
Never accept or persist corrupted data; skip failed items entirely (e.g., zero embeddings, null FKs, malformed JSON)
Error messages must include operation context, IDs/URLs, use specific exception types, preserve full stack traces (logging with exc_info=True), and avoid returning None/null—raise exceptions instead; for batches report success counts and detailed failures
Backend code targets Python 3.12 and adheres to a 120 character line length
Use Ruff for linting (errors, warnings, unused imports) in backend code
Use Mypy for static type checking in backend code

Files:

  • python/src/server/services/crawler_manager.py
  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/services/crawling/robots_checker.py
  • python/src/server/config/config.py
🧠 Learnings (1)
📚 Learning: 2025-08-30T16:33:09.452Z
Learnt from: coleam00
Repo: coleam00/Archon PR: 533
File: python/pyproject.toml:24-0
Timestamp: 2025-08-30T16:33:09.452Z
Learning: Crawl4AI includes Playwright as a dependency and automatically sets it up via the crawl4ai-setup command, so there's no need to explicitly add playwright to dependency groups when crawl4ai is already included.

Applied to files:

  • python/src/server/services/crawler_manager.py
  • python/pyproject.toml
🔇 Additional comments (1)
python/pyproject.toml (1)

125-128: Dependency grouping is consistent and well-organized.

The new dependencies are properly mirrored in both the server and all dependency groups, maintaining consistency and ensuring all runtime environments include the robots.txt compliance features.

Comment thread python/pyproject.toml Outdated
Comment thread python/src/server/services/crawling/crawling_service.py
Comment thread python/src/server/services/crawling/robots_checker.py
@leex279 leex279 force-pushed the feature/respect-robots-txt-275 branch 2 times, most recently from ad4695d to 79e6f22 Compare November 7, 2025 21:57
@leex279
Copy link
Copy Markdown
Collaborator Author

leex279 commented Nov 7, 2025

@CodeRabbit review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Nov 7, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (5)
python/src/server/services/crawling/crawling_service.py (2)

951-963: Performance concern: Sequential delay enforcement during URL filtering

The current implementation enforces crawl delays during the URL filtering phase. For N same-domain URLs with a 10s delay, filtering alone takes N × 10s (e.g., 100 URLs = 1000s). This occurs before any actual crawling begins.

Consider refactoring to separate permission checking from delay enforcement:

  • Filtering phase: Quickly check can_fetch without waiting
  • Crawling phase: Enforce delays when actually fetching URLs

This would require updating _can_fetch_url to accept a check_only parameter or splitting it into two methods:

  • _is_fetch_allowed(url) - fast permission check
  • _enforce_crawl_delay(url) - delay enforcement

Then have the batch/recursive strategies call _enforce_crawl_delay before each fetch.


1035-1047: Code duplication: Extract URL filtering into helper method

The robots.txt filtering logic is duplicated in three places (llms.txt links, extracted links, sitemap URLs). Consider extracting this into a reusable helper:

async def _filter_urls_by_robots(self, urls: list[str], context: str = "") -> list[str]:
    """Filter URLs based on robots.txt rules.
    
    Args:
        urls: List of URLs to filter
        context: Description of URL source for logging
        
    Returns:
        List of allowed URLs
    """
    if not self.robots_checker:
        return urls
        
    original_count = len(urls)
    allowed_urls = []
    
    for url in urls:
        if await self._can_fetch_url(url):
            allowed_urls.append(url)
        else:
            logger.info(f"Skipped (robots.txt): {url}")
    
    filtered = original_count - len(allowed_urls)
    if filtered > 0:
        logger.info(f"Filtered out {filtered} URLs by robots.txt from {context}")
    
    return allowed_urls

This improves maintainability and reduces the chance of inconsistencies. Note that the same performance concern mentioned in the earlier comment applies here.

Also applies to: 1105-1117

python/src/server/services/crawling/robots_checker.py (3)

1-52: Terminology clarification: "Thread-safe" is misleading

The class docstring and method comments describe the implementation as "thread-safe," but the code uses asyncio.Lock, which provides coroutine-safe (not thread-safe) synchronization within a single event loop. If this code were used in a multi-threaded environment, there could be race conditions.

Given that this is likely running in a single-threaded async event loop (typical for FastAPI), the use of asyncio.Lock is correct. However, the terminology should be updated to avoid confusion:

 class RobotsChecker:
     """
-    Thread-safe robots.txt checker with caching and crawl delay enforcement.
+    Async-safe robots.txt checker with caching and crawl delay enforcement.

Similarly, update the docstring for _get_domain_lock:

-        Thread-safe lock creation for concurrent access control.
+        Async-safe lock creation for concurrent coroutine access control.

54-82: Consider making HTTP timeout configurable

The HTTP client timeout is hardcoded at 10 seconds. For slow networks or servers, this might be too aggressive. Consider adding a configuration parameter:

         # Shared HTTP client for fetching robots.txt
-        self._client = httpx.AsyncClient(timeout=10.0, follow_redirects=True)
+        timeout = config.get("robots_fetch_timeout", 10.0)
+        self._client = httpx.AsyncClient(timeout=timeout, follow_redirects=True)

And document it in get_crawler_config() in config.py.


165-211: Cache eviction is not true LRU

The PR description states "LRU eviction when exceeding max size," but the implementation at line 202 evicts based on expiration time, not access recency:

oldest_domain = min(self._cache.keys(), key=lambda k: self._cache[k].expires_at)

This evicts the entry that will expire soonest, which is not LRU (Least Recently Used). True LRU would track access timestamps and evict the entry that was accessed longest ago.

While the current expiration-based eviction is a reasonable strategy, consider either:

  1. Update the PR description to reflect the actual eviction strategy ("expiration-based eviction")
  2. Implement true LRU by tracking access times:
 @dataclass
 class CachedRobotsEntry:
     """Cache entry for robots.txt parser with TTL tracking."""
     parser: Protego
     expires_at: datetime
+    last_accessed: datetime

     # In get_robots_parser:
+    entry.last_accessed = datetime.now()  # Update on access
     
     # For eviction:
-    oldest_domain = min(self._cache.keys(), key=lambda k: self._cache[k].expires_at)
+    oldest_domain = min(self._cache.keys(), key=lambda k: self._cache[k].last_accessed)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 14de810 and 79e6f22.

⛔ Files ignored due to path filters (1)
  • python/uv.lock is excluded by !**/*.lock
📒 Files selected for processing (5)
  • python/pyproject.toml (2 hunks)
  • python/src/server/config/config.py (1 hunks)
  • python/src/server/services/crawler_manager.py (2 hunks)
  • python/src/server/services/crawling/crawling_service.py (7 hunks)
  • python/src/server/services/crawling/robots_checker.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • python/pyproject.toml
  • python/src/server/config/config.py
  • python/src/server/services/crawler_manager.py
🧰 Additional context used
📓 Path-based instructions (1)
python/src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

python/src/**/*.py: On service startup, missing configuration, DB connection failures, auth/authorization failures, critical dependency outages, or invalid/corrupting data: fail fast and bubble errors
For batch processing, background tasks, WebSocket events, optional features, and external API calls: continue processing but log errors (with retries/backoff for APIs)
Never accept or persist corrupted data; skip failed items entirely (e.g., zero embeddings, null FKs, malformed JSON)
Error messages must include operation context, IDs/URLs, use specific exception types, preserve full stack traces (logging with exc_info=True), and avoid returning None/null—raise exceptions instead; for batches report success counts and detailed failures
Backend code targets Python 3.12 and adheres to a 120 character line length
Use Ruff for linting (errors, warnings, unused imports) in backend code
Use Mypy for static type checking in backend code

Files:

  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/services/crawling/robots_checker.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Backend Tests (Python + pytest)
🔇 Additional comments (9)
python/src/server/services/crawling/crawling_service.py (3)

16-16: LGTM: Clean imports for robots.txt support

The import statements correctly bring in the configuration loader and the RobotsChecker service.

Also applies to: 32-32


138-140: LGTM: Conditional initialization is correct

The RobotsChecker is only instantiated when respect_robots is enabled, avoiding unnecessary overhead. The configuration is loaded per service instance, which is acceptable for typical usage patterns.


171-202: LGTM: Correctly implements robots.txt validation with delay enforcement

This method properly addresses the past review feedback by calling wait_if_needed_for_url after permission is granted. The fail-open pattern ensures crawling continues even if robots.txt checking fails, which is appropriate for this use case.

python/src/server/services/crawling/robots_checker.py (6)

84-95: LGTM: Correct domain key extraction

The implementation correctly extracts the domain key as scheme://netloc, which is appropriate for robots.txt scoping.


97-128: LGTM: Well-designed lock separation

The use of separate lock dictionaries for cache access and delay tracking is excellent design. This prevents deadlock scenarios where wait_if_needed calls get_crawl_delay, which calls get_robots_parser. The comment clearly explains the rationale.


130-163: LGTM: Clean implementation with appropriate error handling

The fail-open pattern ensures that robots.txt checking failures don't block crawling, which aligns with the fail-safe design mentioned in the PR objectives. The comment about Protego's argument order is helpful.


213-269: LGTM: RFC 9309 compliant error handling

The error handling correctly implements RFC 9309 guidelines:

  • 404 → allow all (no robots.txt)
  • 5xx/timeout → disallow all (conservative)
  • Other errors → allow all (fail open)

The use of follow_redirects=True is also correct per the RFC.


271-304: LGTM: Clean delay extraction with sensible defaults

The method correctly extracts the crawl delay from robots.txt and falls back to the configured default when not specified or on errors.


306-353: LGTM: Correct delay enforcement and addresses past feedback

The wait_if_needed method correctly enforces per-domain crawl delays using non-blocking asyncio.sleep. The addition of wait_if_needed_for_url (lines 340-353) directly addresses the past review comment requesting a URL-aware delay helper, eliminating the need for callers to access _get_domain_key.

The lock ensures concurrent requests to the same domain are properly serialized with appropriate delays between them.

Comment thread python/src/server/services/crawling/robots_checker.py
@leex279 leex279 force-pushed the feature/respect-robots-txt-275 branch 3 times, most recently from 56cd33b to 0e20a2c Compare November 7, 2025 22:34
@leex279 leex279 marked this pull request as ready for review November 7, 2025 22:38
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
python/src/server/services/crawling/crawling_service.py (1)

171-198: Crawl-delay enforcement is missing.

The _can_fetch_url method correctly validates robots.txt rules but never enforces crawl delays, contradicting the PR objectives which state "Implement per-site crawl delays based on Crawl-delay directives." While the docstring explains that delays during validation would be too slow, delays must still be enforced during actual crawling.

Per the past review comment, RobotsChecker.wait_if_needed_for_url() exists but is never invoked. The delay should be enforced after validation passes and before the actual HTTP request is made.

Based on learnings from past review, verify whether crawl delays are enforced elsewhere in the crawling strategies or if this is a gap in the implementation:

#!/bin/bash
# Search for wait_if_needed or crawl delay enforcement in crawling strategies
rg -n "wait_if_needed|crawl.*delay" python/src/server/services/crawling/ --type=py -C3
🧹 Nitpick comments (2)
python/src/server/services/crawling/robots_checker.py (2)

66-94: Consider defensive validation of config parameters.

While configuration validation should happen in get_crawler_config(), adding defensive checks here would prevent runtime failures from invalid config values (e.g., negative cache_ttl, zero cache_size).

Add basic validation:

     def __init__(self, config: dict):
         """
         Initialize the RobotsChecker.
 
         Args:
             config: Crawler configuration dict with keys:
                 - user_agent: User-Agent string for requests
                 - robots_cache_size: Maximum domains to cache (default: 1000)
                 - robots_cache_ttl: Cache TTL in seconds (default: 86400 = 24h)
                 - default_crawl_delay: Default delay between requests (default: 10.0)
         """
         self._config = config
 
+        # Defensive validation of critical config values
+        cache_size = config.get("robots_cache_size", 1000)
+        cache_ttl = config.get("robots_cache_ttl", 86400)
+        if cache_size <= 0 or cache_ttl <= 0:
+            logger.warning(
+                f"Invalid cache config: size={cache_size}, ttl={cache_ttl}. Using defaults."
+            )
+            cache_size = max(1, cache_size) if cache_size else 1000
+            cache_ttl = max(1, cache_ttl) if cache_ttl else 86400
+
         # Manual TTL cache for parsed robots.txt (domain -> CachedRobotsEntry)
         self._cache: Dict[str, CachedRobotsEntry] = {}
-        self._cache_ttl = timedelta(seconds=config.get("robots_cache_ttl", 86400))  # 24 hours
-        self._max_cache_size = config.get("robots_cache_size", 1000)
+        self._cache_ttl = timedelta(seconds=cache_ttl)
+        self._max_cache_size = cache_size

218-221: Clarify eviction comment for accuracy.

The comment says "oldest entry" but the code evicts the entry with the earliest expiration time (which was cached earliest). This is correct for TTL-based caching but could be more precise.

-            # Evict oldest entry if cache is full
+            # Evict entry with earliest expiration (was cached first) if cache is full
             if len(self._cache) >= self._max_cache_size:
                 oldest_domain = min(self._cache.keys(), key=lambda k: self._cache[k].expires_at)
                 del self._cache[oldest_domain]
-                logger.debug(f"Evicted oldest cache entry: {oldest_domain}")
+                logger.debug(f"Evicted cache entry with earliest expiration: {oldest_domain}")
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 79e6f22 and 0e20a2c.

⛔ Files ignored due to path filters (1)
  • python/uv.lock is excluded by !**/*.lock
📒 Files selected for processing (7)
  • python/pyproject.toml (2 hunks)
  • python/src/server/api_routes/knowledge_api.py (2 hunks)
  • python/src/server/config/config.py (1 hunks)
  • python/src/server/services/crawler_manager.py (2 hunks)
  • python/src/server/services/crawling/crawling_service.py (8 hunks)
  • python/src/server/services/crawling/discovery_service.py (0 hunks)
  • python/src/server/services/crawling/robots_checker.py (1 hunks)
💤 Files with no reviewable changes (1)
  • python/src/server/services/crawling/discovery_service.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • python/pyproject.toml
  • python/src/server/services/crawler_manager.py
🧰 Additional context used
📓 Path-based instructions (1)
python/src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

python/src/**/*.py: On service startup, missing configuration, DB connection failures, auth/authorization failures, critical dependency outages, or invalid/corrupting data: fail fast and bubble errors
For batch processing, background tasks, WebSocket events, optional features, and external API calls: continue processing but log errors (with retries/backoff for APIs)
Never accept or persist corrupted data; skip failed items entirely (e.g., zero embeddings, null FKs, malformed JSON)
Error messages must include operation context, IDs/URLs, use specific exception types, preserve full stack traces (logging with exc_info=True), and avoid returning None/null—raise exceptions instead; for batches report success counts and detailed failures
Backend code targets Python 3.12 and adheres to a 120 character line length
Use Ruff for linting (errors, warnings, unused imports) in backend code
Use Mypy for static type checking in backend code

Files:

  • python/src/server/api_routes/knowledge_api.py
  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/config/config.py
  • python/src/server/services/crawling/robots_checker.py
🧬 Code graph analysis (3)
python/src/server/api_routes/knowledge_api.py (3)
python/src/server/services/crawling/crawling_service.py (1)
  • close (1149-1156)
python/src/server/services/crawling/robots_checker.py (1)
  • close (372-379)
python/src/server/services/provider_discovery_service.py (1)
  • close (93-97)
python/src/server/services/crawling/crawling_service.py (2)
python/src/server/config/config.py (1)
  • get_crawler_config (280-308)
python/src/server/services/crawling/robots_checker.py (3)
  • RobotsChecker (48-389)
  • can_fetch (147-180)
  • close (372-379)
python/src/server/services/crawling/robots_checker.py (1)
python/src/server/services/crawling/crawling_service.py (1)
  • close (1149-1156)
🔇 Additional comments (13)
python/src/server/api_routes/knowledge_api.py (2)

715-716: LGTM! Proper resource cleanup added.

The await crawl_service.close() call correctly ensures resource cleanup in the finally block, maintaining consistency with the service lifecycle pattern even though the current implementation is a no-op.


894-895: LGTM! Consistent resource cleanup pattern.

The await orchestration_service.close() call properly releases resources in the finally block, matching the cleanup pattern added for refresh operations.

python/src/server/services/crawling/crawling_service.py (5)

138-140: LGTM! Clean conditional initialization of robots checker.

The robots.txt checker is properly initialized based on configuration, with clean separation between config retrieval and conditional instantiation.


947-959: LGTM! Robots.txt filtering properly integrated for llms.txt links.

The filtering logic correctly validates URLs against robots.txt rules, with appropriate logging and metrics for filtered URLs.


1031-1043: LGTM! Consistent robots.txt filtering for extracted links.

The filtering implementation maintains consistency with the llms.txt filtering logic, with proper logging and metrics tracking.


1101-1113: LGTM! Robots.txt filtering applied to sitemap URLs.

The filtering is consistently implemented and properly positioned before the batch crawl operation begins.


1149-1156: LGTM! Close method properly documented as API compatibility.

The no-op close() method is correctly documented, explaining that the shared HTTP client pattern means no per-instance cleanup is needed.

python/src/server/services/crawling/robots_checker.py (6)

96-112: LGTM! Clean domain key extraction with proper error handling.

The method correctly extracts the domain key and provides clear error messages for malformed URLs.


114-145: LGTM! Well-designed lock management with deadlock prevention.

The separate lock dictionaries for cache and delay operations effectively prevent deadlock scenarios, with clear documentation of the design rationale.


147-180: LGTM! Robust can_fetch implementation with fail-open semantics.

The method correctly validates URLs against robots.txt rules with appropriate error handling and logging. The fail-open behavior on errors is the right choice for crawler availability.


230-286: LGTM! RFC 9309 compliant robots.txt fetching with proper error handling.

The fetch implementation correctly handles all error cases per RFC 9309 guidelines, with appropriate conservative behavior for server errors and fail-open for other failures.


323-370: LGTM! Clean delay enforcement implementation with URL convenience wrapper.

The wait_if_needed and wait_if_needed_for_url methods properly enforce crawl delays using non-blocking async sleep with per-domain locking. The URL wrapper addresses the past review comment about exposing a URL-aware delay helper.

Note: These methods are currently unused, which is already flagged in the crawling_service.py review comments.


372-389: LGTM! Clean resource management methods.

The close() method is properly documented as a no-op due to the shared client pattern, and clear_cache() correctly clears all cached state.

Comment thread python/src/server/config/config.py
Comment thread python/src/server/services/crawling/robots_checker.py
Adds robots.txt validation to respect website crawling policies.
Uses Protego library for parsing and enforces RFC 9309 standards.

Changes:
- RobotsChecker service with manual TTL caching and shared httpx client
- User-Agent: "Archon-Crawler/0.1.0 (+repo_url)"
- URL validation at 3 critical integration points
- Proper resource cleanup in API route finally blocks
- Removed robots.txt from discovery file list (used for validation, not content)
- Clean INFO-level logging: one line per domain showing compliance

Dependencies:
- Added protego>=0.3.1 (fast RFC 9309 compliant parser with wildcard support)
- crawl4ai updated 0.7.4 -> 0.7.6 (latest bug fixes, unrelated to robots.txt)
- Manual async caching (no asyncache - unmaintained with cachetools risks)

Key Features:
- 24-hour TTL cache per domain with LRU eviction
- Proper error handling (404=allow, 5xx=disallow per RFC 9309)
- Thread-safe with separate locks for cache and delay tracking
- Shared httpx.AsyncClient singleton prevents connection leaks
- close() called in finally blocks for proper cleanup
- Minimal logging: "Respecting robots.txt for {domain} (cached for 24h)"

Closes #275

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@leex279 leex279 force-pushed the feature/respect-robots-txt-275 branch from 0e20a2c to 247c7ea Compare November 7, 2025 22:45
leex279 and others added 2 commits November 8, 2025 00:06
The recursive strategy was bypassing robots.txt validation by extracting
and queuing URLs internally. This adds validation when URLs are added to
the crawl queue.

- Pass robots_checker to recursive strategy
- Validate URLs before adding to next_level_urls
- Log skipped URLs with "Skipped (robots.txt)" message

This ensures all crawling paths (batch, recursive, sitemap) respect robots.txt.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Validates numeric environment variables with proper error handling:
- ROBOTS_DEFAULT_CRAWL_DELAY: Must be valid float >= 0.0
- ROBOTS_CACHE_SIZE: Must be valid int >= 1
- ROBOTS_CACHE_TTL: Must be valid int >= 1

Raises ConfigurationError with clear messages including:
- The env var name
- The invalid value
- The required range/constraints

Returns type-safe dict (float for delay, int for sizes/TTLs).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@Wirasm
Copy link
Copy Markdown
Collaborator

Wirasm commented Apr 6, 2026

🔄 This repository is being replaced by a new version of Archon.

The original Python/MCP codebase is being archived to the archive/v1-python-mcp branch. The new Archon (TypeScript workflow engine for AI coding agents) is replacing it.

This PR is being closed as part of the migration. Thank you for your contribution!

@Wirasm Wirasm closed this Apr 6, 2026
@Wirasm Wirasm deleted the feature/respect-robots-txt-275 branch April 6, 2026 07:38
coleam00 pushed a commit that referenced this pull request Apr 7, 2026
…s (step 2.3) (#849)

* refactor(workflows): eliminate barrel exports — direct subpath imports (step 2.3) (#846)

Delete packages/workflows/src/index.ts and replace the `"."` barrel entry
in package.json exports with an explicit 17-entry subpath map. All 13 source
files and 12 test files across @archon/core, @archon/cli, and @archon/server
now import from specific subpaths (e.g. @archon/workflows/executor,
@archon/workflows/schemas/workflow) instead of the opaque barrel.

Test mock.module() calls are split per-subpath so spies target the same
module reference that the handler under test imports from. Updates CLAUDE.md
import convention docs to show direct-subpath-only examples.

Fixes #846

* style: auto-fix prettier formatting in auth-service/server.js

* docs: remove stale index.ts entry from workflows directory structure

packages/workflows/src/index.ts was deleted in this PR but the
CLAUDE.md directory tree still listed it. Remove the stale entry.

* fix: address review findings — trim unused exports, fix docs

- Remove 5 unused exports from package.json (executor-shared, dag-executor,
  model-validation, utils/variable-substitution, schemas barrel) — YAGNI
- Fix CLAUDE.md tree character (├── → └── on utils/, last child)
- Remove redundant barrel wrong-example block from CLAUDE.md (compiler
  already enforces this since there's no "." export entry)
- Clarify @archon/web import prohibition scope in CLAUDE.md
- Update docs/cli-developer-guide.md: stale discoverWorkflows →
  discoverWorkflowsWithConfig, barrel refs → direct subpaths
Tyone88 pushed a commit to Tyone88/Archon that referenced this pull request Apr 16, 2026
…s (step 2.3) (coleam00#849)

* refactor(workflows): eliminate barrel exports — direct subpath imports (step 2.3) (coleam00#846)

Delete packages/workflows/src/index.ts and replace the `"."` barrel entry
in package.json exports with an explicit 17-entry subpath map. All 13 source
files and 12 test files across @archon/core, @archon/cli, and @archon/server
now import from specific subpaths (e.g. @archon/workflows/executor,
@archon/workflows/schemas/workflow) instead of the opaque barrel.

Test mock.module() calls are split per-subpath so spies target the same
module reference that the handler under test imports from. Updates CLAUDE.md
import convention docs to show direct-subpath-only examples.

Fixes coleam00#846

* style: auto-fix prettier formatting in auth-service/server.js

* docs: remove stale index.ts entry from workflows directory structure

packages/workflows/src/index.ts was deleted in this PR but the
CLAUDE.md directory tree still listed it. Remove the stale entry.

* fix: address review findings — trim unused exports, fix docs

- Remove 5 unused exports from package.json (executor-shared, dag-executor,
  model-validation, utils/variable-substitution, schemas barrel) — YAGNI
- Fix CLAUDE.md tree character (├── → └── on utils/, last child)
- Remove redundant barrel wrong-example block from CLAUDE.md (compiler
  already enforces this since there's no "." export entry)
- Clarify @archon/web import prohibition scope in CLAUDE.md
- Update docs/cli-developer-guide.md: stale discoverWorkflows →
  discoverWorkflowsWithConfig, barrel refs → direct subpaths
joaobmonteiro pushed a commit to joaobmonteiro/Archon that referenced this pull request Apr 26, 2026
…s (step 2.3) (coleam00#849)

* refactor(workflows): eliminate barrel exports — direct subpath imports (step 2.3) (coleam00#846)

Delete packages/workflows/src/index.ts and replace the `"."` barrel entry
in package.json exports with an explicit 17-entry subpath map. All 13 source
files and 12 test files across @archon/core, @archon/cli, and @archon/server
now import from specific subpaths (e.g. @archon/workflows/executor,
@archon/workflows/schemas/workflow) instead of the opaque barrel.

Test mock.module() calls are split per-subpath so spies target the same
module reference that the handler under test imports from. Updates CLAUDE.md
import convention docs to show direct-subpath-only examples.

Fixes coleam00#846

* style: auto-fix prettier formatting in auth-service/server.js

* docs: remove stale index.ts entry from workflows directory structure

packages/workflows/src/index.ts was deleted in this PR but the
CLAUDE.md directory tree still listed it. Remove the stale entry.

* fix: address review findings — trim unused exports, fix docs

- Remove 5 unused exports from package.json (executor-shared, dag-executor,
  model-validation, utils/variable-substitution, schemas barrel) — YAGNI
- Fix CLAUDE.md tree character (├── → └── on utils/, last child)
- Remove redundant barrel wrong-example block from CLAUDE.md (compiler
  already enforces this since there's no "." export entry)
- Clarify @archon/web import prohibition scope in CLAUDE.md
- Update docs/cli-developer-guide.md: stale discoverWorkflows →
  discoverWorkflowsWithConfig, barrel refs → direct subpaths
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

🐛 [Bug/Feature]: Not respecting robots.txt

2 participants