Implement robots.txt compliance for web crawler (Issue #275)#846
Implement robots.txt compliance for web crawler (Issue #275)#846
Conversation
WalkthroughAdds robots.txt compliance: new RobotsChecker module (fetch/parse/cache/delay via Protego), integrates robots checking and per-domain delays into CrawlingService and recursive strategy, exposes crawler config via get_crawler_config(), updates dependencies (Protego, crawl4ai), and ensures crawler services are closed during request cleanup. Changes
Sequence Diagram(s)sequenceDiagram
participant CS as CrawlingService
participant RC as RobotsChecker
participant HTTP as HTTP Client
participant Parser as Protego Parser
note right of CS `#ddeeff`: Discovery (llms.txt, links, sitemap)
CS->>RC: can_fetch(url)
alt cache hit
RC-->>CS: allowed / disallowed
else cache miss
RC->>HTTP: GET domain/robots.txt
alt 200
HTTP-->>RC: content
RC->>Parser: parse(content)
Parser-->>RC: rules
RC-->>CS: allowed / disallowed
else 404
HTTP-->>RC: 404
RC-->>CS: allowed (allow-all)
else 5xx / timeout
HTTP-->>RC: error
RC-->>CS: disallowed (conservative)
end
end
alt allowed
CS->>RC: wait_if_needed(domain)
RC-->>CS: await delay (async)
CS->>CS: enqueue/fetch URL
else disallowed
CS-->>CS: drop URL, log/count filtered
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes
Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (5 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
b9e1ffe to
14de810
Compare
|
@CodeRabbit review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
python/src/server/services/crawling/crawling_service.py (1)
876-914: Initial URL still bypasses robots.txtWe only filter derived links, so the very first URL handed to
_crawl_by_url_type(including discovered targets) is still fetched even when robots.txt says “Disallow”. That defeats the primary objective of this PR because we hit disallowed pages before any filtering happens. Please short-circuit the crawl when the entrypoint is blocked.async def _crawl_by_url_type(self, url: str, request: dict[str, Any]) -> tuple: """ Detect URL type and perform appropriate crawling. Returns: Tuple of (crawl_results, crawl_type) """ crawl_results = [] crawl_type = None + + if self.robots_checker and not await self._can_fetch_url(url): + safe_logfire_info(f"Skipped (robots.txt): {url}") + return [], None
🧹 Nitpick comments (1)
python/pyproject.toml (1)
43-45: Consider pinning asyncache to a known stable version.While
protego>=0.3.1andcachetools>=5.3.0are well-maintained libraries with safe version constraints,asyncache>=0.3.1is a less common dependency. If the feature was tested and validated against a specific asyncache version (e.g., during development), consider pinning it to that version (e.g.,asyncache>=0.3.1,<0.4.0) to reduce drift and ensure reproducibility across deployments.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
python/uv.lockis excluded by!**/*.lock
📒 Files selected for processing (5)
python/pyproject.toml(2 hunks)python/src/server/config/config.py(1 hunks)python/src/server/services/crawler_manager.py(2 hunks)python/src/server/services/crawling/crawling_service.py(7 hunks)python/src/server/services/crawling/robots_checker.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
python/src/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
python/src/**/*.py: On service startup, missing configuration, DB connection failures, auth/authorization failures, critical dependency outages, or invalid/corrupting data: fail fast and bubble errors
For batch processing, background tasks, WebSocket events, optional features, and external API calls: continue processing but log errors (with retries/backoff for APIs)
Never accept or persist corrupted data; skip failed items entirely (e.g., zero embeddings, null FKs, malformed JSON)
Error messages must include operation context, IDs/URLs, use specific exception types, preserve full stack traces (logging with exc_info=True), and avoid returning None/null—raise exceptions instead; for batches report success counts and detailed failures
Backend code targets Python 3.12 and adheres to a 120 character line length
Use Ruff for linting (errors, warnings, unused imports) in backend code
Use Mypy for static type checking in backend code
Files:
python/src/server/services/crawler_manager.pypython/src/server/services/crawling/crawling_service.pypython/src/server/services/crawling/robots_checker.pypython/src/server/config/config.py
🧠 Learnings (1)
📚 Learning: 2025-08-30T16:33:09.452Z
Learnt from: coleam00
Repo: coleam00/Archon PR: 533
File: python/pyproject.toml:24-0
Timestamp: 2025-08-30T16:33:09.452Z
Learning: Crawl4AI includes Playwright as a dependency and automatically sets it up via the crawl4ai-setup command, so there's no need to explicitly add playwright to dependency groups when crawl4ai is already included.
Applied to files:
python/src/server/services/crawler_manager.pypython/pyproject.toml
🔇 Additional comments (1)
python/pyproject.toml (1)
125-128: Dependency grouping is consistent and well-organized.The new dependencies are properly mirrored in both the
serverandalldependency groups, maintaining consistency and ensuring all runtime environments include the robots.txt compliance features.
ad4695d to
79e6f22
Compare
|
@CodeRabbit review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (5)
python/src/server/services/crawling/crawling_service.py (2)
951-963: Performance concern: Sequential delay enforcement during URL filteringThe current implementation enforces crawl delays during the URL filtering phase. For N same-domain URLs with a 10s delay, filtering alone takes N × 10s (e.g., 100 URLs = 1000s). This occurs before any actual crawling begins.
Consider refactoring to separate permission checking from delay enforcement:
- Filtering phase: Quickly check
can_fetchwithout waiting- Crawling phase: Enforce delays when actually fetching URLs
This would require updating
_can_fetch_urlto accept acheck_onlyparameter or splitting it into two methods:
_is_fetch_allowed(url)- fast permission check_enforce_crawl_delay(url)- delay enforcementThen have the batch/recursive strategies call
_enforce_crawl_delaybefore each fetch.
1035-1047: Code duplication: Extract URL filtering into helper methodThe robots.txt filtering logic is duplicated in three places (llms.txt links, extracted links, sitemap URLs). Consider extracting this into a reusable helper:
async def _filter_urls_by_robots(self, urls: list[str], context: str = "") -> list[str]: """Filter URLs based on robots.txt rules. Args: urls: List of URLs to filter context: Description of URL source for logging Returns: List of allowed URLs """ if not self.robots_checker: return urls original_count = len(urls) allowed_urls = [] for url in urls: if await self._can_fetch_url(url): allowed_urls.append(url) else: logger.info(f"Skipped (robots.txt): {url}") filtered = original_count - len(allowed_urls) if filtered > 0: logger.info(f"Filtered out {filtered} URLs by robots.txt from {context}") return allowed_urlsThis improves maintainability and reduces the chance of inconsistencies. Note that the same performance concern mentioned in the earlier comment applies here.
Also applies to: 1105-1117
python/src/server/services/crawling/robots_checker.py (3)
1-52: Terminology clarification: "Thread-safe" is misleadingThe class docstring and method comments describe the implementation as "thread-safe," but the code uses
asyncio.Lock, which provides coroutine-safe (not thread-safe) synchronization within a single event loop. If this code were used in a multi-threaded environment, there could be race conditions.Given that this is likely running in a single-threaded async event loop (typical for FastAPI), the use of
asyncio.Lockis correct. However, the terminology should be updated to avoid confusion:class RobotsChecker: """ - Thread-safe robots.txt checker with caching and crawl delay enforcement. + Async-safe robots.txt checker with caching and crawl delay enforcement.Similarly, update the docstring for
_get_domain_lock:- Thread-safe lock creation for concurrent access control. + Async-safe lock creation for concurrent coroutine access control.
54-82: Consider making HTTP timeout configurableThe HTTP client timeout is hardcoded at 10 seconds. For slow networks or servers, this might be too aggressive. Consider adding a configuration parameter:
# Shared HTTP client for fetching robots.txt - self._client = httpx.AsyncClient(timeout=10.0, follow_redirects=True) + timeout = config.get("robots_fetch_timeout", 10.0) + self._client = httpx.AsyncClient(timeout=timeout, follow_redirects=True)And document it in
get_crawler_config()in config.py.
165-211: Cache eviction is not true LRUThe PR description states "LRU eviction when exceeding max size," but the implementation at line 202 evicts based on expiration time, not access recency:
oldest_domain = min(self._cache.keys(), key=lambda k: self._cache[k].expires_at)This evicts the entry that will expire soonest, which is not LRU (Least Recently Used). True LRU would track access timestamps and evict the entry that was accessed longest ago.
While the current expiration-based eviction is a reasonable strategy, consider either:
- Update the PR description to reflect the actual eviction strategy ("expiration-based eviction")
- Implement true LRU by tracking access times:
@dataclass class CachedRobotsEntry: """Cache entry for robots.txt parser with TTL tracking.""" parser: Protego expires_at: datetime + last_accessed: datetime # In get_robots_parser: + entry.last_accessed = datetime.now() # Update on access # For eviction: - oldest_domain = min(self._cache.keys(), key=lambda k: self._cache[k].expires_at) + oldest_domain = min(self._cache.keys(), key=lambda k: self._cache[k].last_accessed)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
python/uv.lockis excluded by!**/*.lock
📒 Files selected for processing (5)
python/pyproject.toml(2 hunks)python/src/server/config/config.py(1 hunks)python/src/server/services/crawler_manager.py(2 hunks)python/src/server/services/crawling/crawling_service.py(7 hunks)python/src/server/services/crawling/robots_checker.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- python/pyproject.toml
- python/src/server/config/config.py
- python/src/server/services/crawler_manager.py
🧰 Additional context used
📓 Path-based instructions (1)
python/src/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
python/src/**/*.py: On service startup, missing configuration, DB connection failures, auth/authorization failures, critical dependency outages, or invalid/corrupting data: fail fast and bubble errors
For batch processing, background tasks, WebSocket events, optional features, and external API calls: continue processing but log errors (with retries/backoff for APIs)
Never accept or persist corrupted data; skip failed items entirely (e.g., zero embeddings, null FKs, malformed JSON)
Error messages must include operation context, IDs/URLs, use specific exception types, preserve full stack traces (logging with exc_info=True), and avoid returning None/null—raise exceptions instead; for batches report success counts and detailed failures
Backend code targets Python 3.12 and adheres to a 120 character line length
Use Ruff for linting (errors, warnings, unused imports) in backend code
Use Mypy for static type checking in backend code
Files:
python/src/server/services/crawling/crawling_service.pypython/src/server/services/crawling/robots_checker.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Backend Tests (Python + pytest)
🔇 Additional comments (9)
python/src/server/services/crawling/crawling_service.py (3)
16-16: LGTM: Clean imports for robots.txt supportThe import statements correctly bring in the configuration loader and the RobotsChecker service.
Also applies to: 32-32
138-140: LGTM: Conditional initialization is correctThe RobotsChecker is only instantiated when
respect_robotsis enabled, avoiding unnecessary overhead. The configuration is loaded per service instance, which is acceptable for typical usage patterns.
171-202: LGTM: Correctly implements robots.txt validation with delay enforcementThis method properly addresses the past review feedback by calling
wait_if_needed_for_urlafter permission is granted. The fail-open pattern ensures crawling continues even if robots.txt checking fails, which is appropriate for this use case.python/src/server/services/crawling/robots_checker.py (6)
84-95: LGTM: Correct domain key extractionThe implementation correctly extracts the domain key as
scheme://netloc, which is appropriate for robots.txt scoping.
97-128: LGTM: Well-designed lock separationThe use of separate lock dictionaries for cache access and delay tracking is excellent design. This prevents deadlock scenarios where
wait_if_neededcallsget_crawl_delay, which callsget_robots_parser. The comment clearly explains the rationale.
130-163: LGTM: Clean implementation with appropriate error handlingThe fail-open pattern ensures that robots.txt checking failures don't block crawling, which aligns with the fail-safe design mentioned in the PR objectives. The comment about Protego's argument order is helpful.
213-269: LGTM: RFC 9309 compliant error handlingThe error handling correctly implements RFC 9309 guidelines:
- 404 → allow all (no robots.txt)
- 5xx/timeout → disallow all (conservative)
- Other errors → allow all (fail open)
The use of
follow_redirects=Trueis also correct per the RFC.
271-304: LGTM: Clean delay extraction with sensible defaultsThe method correctly extracts the crawl delay from robots.txt and falls back to the configured default when not specified or on errors.
306-353: LGTM: Correct delay enforcement and addresses past feedbackThe
wait_if_neededmethod correctly enforces per-domain crawl delays using non-blockingasyncio.sleep. The addition ofwait_if_needed_for_url(lines 340-353) directly addresses the past review comment requesting a URL-aware delay helper, eliminating the need for callers to access_get_domain_key.The lock ensures concurrent requests to the same domain are properly serialized with appropriate delays between them.
56cd33b to
0e20a2c
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
python/src/server/services/crawling/crawling_service.py (1)
171-198: Crawl-delay enforcement is missing.The
_can_fetch_urlmethod correctly validates robots.txt rules but never enforces crawl delays, contradicting the PR objectives which state "Implement per-site crawl delays based on Crawl-delay directives." While the docstring explains that delays during validation would be too slow, delays must still be enforced during actual crawling.Per the past review comment,
RobotsChecker.wait_if_needed_for_url()exists but is never invoked. The delay should be enforced after validation passes and before the actual HTTP request is made.Based on learnings from past review, verify whether crawl delays are enforced elsewhere in the crawling strategies or if this is a gap in the implementation:
#!/bin/bash # Search for wait_if_needed or crawl delay enforcement in crawling strategies rg -n "wait_if_needed|crawl.*delay" python/src/server/services/crawling/ --type=py -C3
🧹 Nitpick comments (2)
python/src/server/services/crawling/robots_checker.py (2)
66-94: Consider defensive validation of config parameters.While configuration validation should happen in
get_crawler_config(), adding defensive checks here would prevent runtime failures from invalid config values (e.g., negative cache_ttl, zero cache_size).Add basic validation:
def __init__(self, config: dict): """ Initialize the RobotsChecker. Args: config: Crawler configuration dict with keys: - user_agent: User-Agent string for requests - robots_cache_size: Maximum domains to cache (default: 1000) - robots_cache_ttl: Cache TTL in seconds (default: 86400 = 24h) - default_crawl_delay: Default delay between requests (default: 10.0) """ self._config = config + # Defensive validation of critical config values + cache_size = config.get("robots_cache_size", 1000) + cache_ttl = config.get("robots_cache_ttl", 86400) + if cache_size <= 0 or cache_ttl <= 0: + logger.warning( + f"Invalid cache config: size={cache_size}, ttl={cache_ttl}. Using defaults." + ) + cache_size = max(1, cache_size) if cache_size else 1000 + cache_ttl = max(1, cache_ttl) if cache_ttl else 86400 + # Manual TTL cache for parsed robots.txt (domain -> CachedRobotsEntry) self._cache: Dict[str, CachedRobotsEntry] = {} - self._cache_ttl = timedelta(seconds=config.get("robots_cache_ttl", 86400)) # 24 hours - self._max_cache_size = config.get("robots_cache_size", 1000) + self._cache_ttl = timedelta(seconds=cache_ttl) + self._max_cache_size = cache_size
218-221: Clarify eviction comment for accuracy.The comment says "oldest entry" but the code evicts the entry with the earliest expiration time (which was cached earliest). This is correct for TTL-based caching but could be more precise.
- # Evict oldest entry if cache is full + # Evict entry with earliest expiration (was cached first) if cache is full if len(self._cache) >= self._max_cache_size: oldest_domain = min(self._cache.keys(), key=lambda k: self._cache[k].expires_at) del self._cache[oldest_domain] - logger.debug(f"Evicted oldest cache entry: {oldest_domain}") + logger.debug(f"Evicted cache entry with earliest expiration: {oldest_domain}")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
python/uv.lockis excluded by!**/*.lock
📒 Files selected for processing (7)
python/pyproject.toml(2 hunks)python/src/server/api_routes/knowledge_api.py(2 hunks)python/src/server/config/config.py(1 hunks)python/src/server/services/crawler_manager.py(2 hunks)python/src/server/services/crawling/crawling_service.py(8 hunks)python/src/server/services/crawling/discovery_service.py(0 hunks)python/src/server/services/crawling/robots_checker.py(1 hunks)
💤 Files with no reviewable changes (1)
- python/src/server/services/crawling/discovery_service.py
🚧 Files skipped from review as they are similar to previous changes (2)
- python/pyproject.toml
- python/src/server/services/crawler_manager.py
🧰 Additional context used
📓 Path-based instructions (1)
python/src/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
python/src/**/*.py: On service startup, missing configuration, DB connection failures, auth/authorization failures, critical dependency outages, or invalid/corrupting data: fail fast and bubble errors
For batch processing, background tasks, WebSocket events, optional features, and external API calls: continue processing but log errors (with retries/backoff for APIs)
Never accept or persist corrupted data; skip failed items entirely (e.g., zero embeddings, null FKs, malformed JSON)
Error messages must include operation context, IDs/URLs, use specific exception types, preserve full stack traces (logging with exc_info=True), and avoid returning None/null—raise exceptions instead; for batches report success counts and detailed failures
Backend code targets Python 3.12 and adheres to a 120 character line length
Use Ruff for linting (errors, warnings, unused imports) in backend code
Use Mypy for static type checking in backend code
Files:
python/src/server/api_routes/knowledge_api.pypython/src/server/services/crawling/crawling_service.pypython/src/server/config/config.pypython/src/server/services/crawling/robots_checker.py
🧬 Code graph analysis (3)
python/src/server/api_routes/knowledge_api.py (3)
python/src/server/services/crawling/crawling_service.py (1)
close(1149-1156)python/src/server/services/crawling/robots_checker.py (1)
close(372-379)python/src/server/services/provider_discovery_service.py (1)
close(93-97)
python/src/server/services/crawling/crawling_service.py (2)
python/src/server/config/config.py (1)
get_crawler_config(280-308)python/src/server/services/crawling/robots_checker.py (3)
RobotsChecker(48-389)can_fetch(147-180)close(372-379)
python/src/server/services/crawling/robots_checker.py (1)
python/src/server/services/crawling/crawling_service.py (1)
close(1149-1156)
🔇 Additional comments (13)
python/src/server/api_routes/knowledge_api.py (2)
715-716: LGTM! Proper resource cleanup added.The
await crawl_service.close()call correctly ensures resource cleanup in the finally block, maintaining consistency with the service lifecycle pattern even though the current implementation is a no-op.
894-895: LGTM! Consistent resource cleanup pattern.The
await orchestration_service.close()call properly releases resources in the finally block, matching the cleanup pattern added for refresh operations.python/src/server/services/crawling/crawling_service.py (5)
138-140: LGTM! Clean conditional initialization of robots checker.The robots.txt checker is properly initialized based on configuration, with clean separation between config retrieval and conditional instantiation.
947-959: LGTM! Robots.txt filtering properly integrated for llms.txt links.The filtering logic correctly validates URLs against robots.txt rules, with appropriate logging and metrics for filtered URLs.
1031-1043: LGTM! Consistent robots.txt filtering for extracted links.The filtering implementation maintains consistency with the llms.txt filtering logic, with proper logging and metrics tracking.
1101-1113: LGTM! Robots.txt filtering applied to sitemap URLs.The filtering is consistently implemented and properly positioned before the batch crawl operation begins.
1149-1156: LGTM! Close method properly documented as API compatibility.The no-op
close()method is correctly documented, explaining that the shared HTTP client pattern means no per-instance cleanup is needed.python/src/server/services/crawling/robots_checker.py (6)
96-112: LGTM! Clean domain key extraction with proper error handling.The method correctly extracts the domain key and provides clear error messages for malformed URLs.
114-145: LGTM! Well-designed lock management with deadlock prevention.The separate lock dictionaries for cache and delay operations effectively prevent deadlock scenarios, with clear documentation of the design rationale.
147-180: LGTM! Robust can_fetch implementation with fail-open semantics.The method correctly validates URLs against robots.txt rules with appropriate error handling and logging. The fail-open behavior on errors is the right choice for crawler availability.
230-286: LGTM! RFC 9309 compliant robots.txt fetching with proper error handling.The fetch implementation correctly handles all error cases per RFC 9309 guidelines, with appropriate conservative behavior for server errors and fail-open for other failures.
323-370: LGTM! Clean delay enforcement implementation with URL convenience wrapper.The
wait_if_neededandwait_if_needed_for_urlmethods properly enforce crawl delays using non-blocking async sleep with per-domain locking. The URL wrapper addresses the past review comment about exposing a URL-aware delay helper.Note: These methods are currently unused, which is already flagged in the crawling_service.py review comments.
372-389: LGTM! Clean resource management methods.The
close()method is properly documented as a no-op due to the shared client pattern, andclear_cache()correctly clears all cached state.
Adds robots.txt validation to respect website crawling policies.
Uses Protego library for parsing and enforces RFC 9309 standards.
Changes:
- RobotsChecker service with manual TTL caching and shared httpx client
- User-Agent: "Archon-Crawler/0.1.0 (+repo_url)"
- URL validation at 3 critical integration points
- Proper resource cleanup in API route finally blocks
- Removed robots.txt from discovery file list (used for validation, not content)
- Clean INFO-level logging: one line per domain showing compliance
Dependencies:
- Added protego>=0.3.1 (fast RFC 9309 compliant parser with wildcard support)
- crawl4ai updated 0.7.4 -> 0.7.6 (latest bug fixes, unrelated to robots.txt)
- Manual async caching (no asyncache - unmaintained with cachetools risks)
Key Features:
- 24-hour TTL cache per domain with LRU eviction
- Proper error handling (404=allow, 5xx=disallow per RFC 9309)
- Thread-safe with separate locks for cache and delay tracking
- Shared httpx.AsyncClient singleton prevents connection leaks
- close() called in finally blocks for proper cleanup
- Minimal logging: "Respecting robots.txt for {domain} (cached for 24h)"
Closes #275
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
0e20a2c to
247c7ea
Compare
The recursive strategy was bypassing robots.txt validation by extracting and queuing URLs internally. This adds validation when URLs are added to the crawl queue. - Pass robots_checker to recursive strategy - Validate URLs before adding to next_level_urls - Log skipped URLs with "Skipped (robots.txt)" message This ensures all crawling paths (batch, recursive, sitemap) respect robots.txt. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Validates numeric environment variables with proper error handling: - ROBOTS_DEFAULT_CRAWL_DELAY: Must be valid float >= 0.0 - ROBOTS_CACHE_SIZE: Must be valid int >= 1 - ROBOTS_CACHE_TTL: Must be valid int >= 1 Raises ConfigurationError with clear messages including: - The env var name - The invalid value - The required range/constraints Returns type-safe dict (float for delay, int for sizes/TTLs). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
|
🔄 This repository is being replaced by a new version of Archon. The original Python/MCP codebase is being archived to the This PR is being closed as part of the migration. Thank you for your contribution! |
…s (step 2.3) (#849) * refactor(workflows): eliminate barrel exports — direct subpath imports (step 2.3) (#846) Delete packages/workflows/src/index.ts and replace the `"."` barrel entry in package.json exports with an explicit 17-entry subpath map. All 13 source files and 12 test files across @archon/core, @archon/cli, and @archon/server now import from specific subpaths (e.g. @archon/workflows/executor, @archon/workflows/schemas/workflow) instead of the opaque barrel. Test mock.module() calls are split per-subpath so spies target the same module reference that the handler under test imports from. Updates CLAUDE.md import convention docs to show direct-subpath-only examples. Fixes #846 * style: auto-fix prettier formatting in auth-service/server.js * docs: remove stale index.ts entry from workflows directory structure packages/workflows/src/index.ts was deleted in this PR but the CLAUDE.md directory tree still listed it. Remove the stale entry. * fix: address review findings — trim unused exports, fix docs - Remove 5 unused exports from package.json (executor-shared, dag-executor, model-validation, utils/variable-substitution, schemas barrel) — YAGNI - Fix CLAUDE.md tree character (├── → └── on utils/, last child) - Remove redundant barrel wrong-example block from CLAUDE.md (compiler already enforces this since there's no "." export entry) - Clarify @archon/web import prohibition scope in CLAUDE.md - Update docs/cli-developer-guide.md: stale discoverWorkflows → discoverWorkflowsWithConfig, barrel refs → direct subpaths
…s (step 2.3) (coleam00#849) * refactor(workflows): eliminate barrel exports — direct subpath imports (step 2.3) (coleam00#846) Delete packages/workflows/src/index.ts and replace the `"."` barrel entry in package.json exports with an explicit 17-entry subpath map. All 13 source files and 12 test files across @archon/core, @archon/cli, and @archon/server now import from specific subpaths (e.g. @archon/workflows/executor, @archon/workflows/schemas/workflow) instead of the opaque barrel. Test mock.module() calls are split per-subpath so spies target the same module reference that the handler under test imports from. Updates CLAUDE.md import convention docs to show direct-subpath-only examples. Fixes coleam00#846 * style: auto-fix prettier formatting in auth-service/server.js * docs: remove stale index.ts entry from workflows directory structure packages/workflows/src/index.ts was deleted in this PR but the CLAUDE.md directory tree still listed it. Remove the stale entry. * fix: address review findings — trim unused exports, fix docs - Remove 5 unused exports from package.json (executor-shared, dag-executor, model-validation, utils/variable-substitution, schemas barrel) — YAGNI - Fix CLAUDE.md tree character (├── → └── on utils/, last child) - Remove redundant barrel wrong-example block from CLAUDE.md (compiler already enforces this since there's no "." export entry) - Clarify @archon/web import prohibition scope in CLAUDE.md - Update docs/cli-developer-guide.md: stale discoverWorkflows → discoverWorkflowsWithConfig, barrel refs → direct subpaths
…s (step 2.3) (coleam00#849) * refactor(workflows): eliminate barrel exports — direct subpath imports (step 2.3) (coleam00#846) Delete packages/workflows/src/index.ts and replace the `"."` barrel entry in package.json exports with an explicit 17-entry subpath map. All 13 source files and 12 test files across @archon/core, @archon/cli, and @archon/server now import from specific subpaths (e.g. @archon/workflows/executor, @archon/workflows/schemas/workflow) instead of the opaque barrel. Test mock.module() calls are split per-subpath so spies target the same module reference that the handler under test imports from. Updates CLAUDE.md import convention docs to show direct-subpath-only examples. Fixes coleam00#846 * style: auto-fix prettier formatting in auth-service/server.js * docs: remove stale index.ts entry from workflows directory structure packages/workflows/src/index.ts was deleted in this PR but the CLAUDE.md directory tree still listed it. Remove the stale entry. * fix: address review findings — trim unused exports, fix docs - Remove 5 unused exports from package.json (executor-shared, dag-executor, model-validation, utils/variable-substitution, schemas barrel) — YAGNI - Fix CLAUDE.md tree character (├── → └── on utils/, last child) - Remove redundant barrel wrong-example block from CLAUDE.md (compiler already enforces this since there's no "." export entry) - Clarify @archon/web import prohibition scope in CLAUDE.md - Update docs/cli-developer-guide.md: stale discoverWorkflows → discoverWorkflowsWithConfig, barrel refs → direct subpaths
Summary
Implements robots.txt URL validation for Archon's web crawler to respect website Allow/Disallow rules.
"Archon-Crawler/0.1.0 (+repo_url)"Verified Working ✅
Screenshot from live amazon.com crawl showing robots.txt validation:
INFO | Respecting robots.txt for https://www.amazon.com (cached for 24h)INFO | URL blocked by robots.txt: https://www.amazon.com/gp/cart/...INFO | Skipped (robots.txt): https://www.amazon.com/gp/cart/...Multiple URLs correctly blocked based on Amazon's robots.txt Allow/Disallow rules.
What This PR Does ✅
What This PR Does NOT Do ❌
wait_if_needed_for_url()method exists but is not calledarun_many()API has no per-URL hooks for delay enforcementRationale: This PR focuses on ethical URL validation (respecting site owner's Allow/Disallow rules). Crawl delays are a performance optimization that Crawl4AI already handles via concurrency limits and internal rate limiting. Adding explicit delays would severely degrade performance without significant benefit.
Implementation
Core Service (
robots_checker.py~400 lines):can_fetch(url),wait_if_needed_for_url(url)(available but not auto-called)Integration Points (4 total):
crawling_service.py- llms.txt linked files validationcrawling_service.py- Extracted links validationcrawling_service.py- Sitemap URLs validationrecursive.py- URL queue validation (when adding discovered links)Configuration (
config.py):get_crawler_config()with validation for numeric env varsUser-Agent Update (
crawler_manager.py):"Archon-Crawler/0.1.0 (+repo_url)"Resource Management (
knowledge_api.py):close()calls in finally blocks (currently no-op with shared client)Discovery (
discovery_service.py):Dependencies
Added:
protego>=0.3.1- Fast, RFC 9309 compliant robots.txt parser with wildcard supportUpdated:
crawl4ai: 0.7.4 -> 0.7.6- Latest bug fixes (unrelated to robots.txt, just keeping dependencies current)Not Added:
Configuration
Environment variables (all optional with validated defaults):
Testing
Validated locally and in production:
Performance
Architecture
Future Work
If crawl-delay enforcement becomes a requirement:
Closes #275
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Chores