Skip to content

feat: Implement priority-based automatic discovery of llms.txt and sitemap.xml files#622

Merged
leex279 merged 23 commits intomainfrom
feature/automatic-discovery-llms-sitemap-430
Oct 19, 2025
Merged

feat: Implement priority-based automatic discovery of llms.txt and sitemap.xml files#622
leex279 merged 23 commits intomainfrom
feature/automatic-discovery-llms-sitemap-430

Conversation

@leex279
Copy link
Copy Markdown
Collaborator

@leex279 leex279 commented Sep 8, 2025

Summary

Implements automatic discovery and parsing of llms.txt, sitemap.xml, and related files with intelligent priority-based selection to enhance crawling efficiency.

Key Features

  • Single-file priority selection: Selects the BEST file from all categories instead of crawling multiple redundant files
  • Smart priority order: llms-full.txt > llms.txt > llms.md > llms.mdx > sitemap.xml > robots.txt
  • Robots.txt priority: Sitemap declarations in robots.txt have highest priority (explicitly declared)
  • Efficient crawling: When discovery finds a file, crawls ONLY that file instead of the main URL
  • Fallback support: Checks subdirectories for llms files (/static/, /public/, etc.)

Technical Implementation

  • DiscoveryService: New service implementing priority-based discovery with HTTP existence checking
  • Enhanced URLHandler: Added helper methods for discovery file type detection
  • Integrated orchestration: Modified CrawlingService to use discovered files instead of recursive crawling
  • Progress tracking: Added discovery stage to progress mapping with clear user feedback
  • Comprehensive tests: Full test coverage for priority logic and edge cases

Results

Before: Discovered 29 files and crawled ALL (1 main URL + 28 discovered files)
After: Discovers best file and crawls ONLY that (1 file total)

Example: https://ai.pydantic.dev/ now crawls only https://ai.pydantic.dev/llms-full.txt instead of recursively crawling 50+ pages plus multiple discovery files.

Example Output

Here's what the discovery process looks like in action:

2025-09-20 07:38:03 | src.server.services.crawling.discovery_service | INFO | No robots.txt found: HTTP 404
2025-09-20 07:38:03 | src.server.services.crawling.discovery_service | INFO | Discovery found best file: https://ai.pydantic.dev/llms-full.txt
2025-09-20 07:38:03 | src.server.services.crawling.strategies.single_page | INFO | Crawling markdown file: https://ai.pydantic.dev/llms-full.txt

[COMPLETE] ● Database backup created at: /root/.crawl4ai/crawl4ai.db.backup_20250920_073803
[INIT].... → Starting database migration...
[COMPLETE] ● Migration completed. 0 records processed.
[FETCH]... ↓ https://ai.pydantic.dev/llms-full.txt                | ✓ | ⏱: 1.14s
[SCRAPE].. ◆ https://ai.pydantic.dev/llms-full.txt                | ✓ | ⏱: 0.07s
[COMPLETE] ● https://ai.pydantic.dev/llms-full.txt                | ✓ | ⏱: 2.24s

Notice how the discovery service:

  1. First checks for robots.txt (not found)
  2. Discovers the best available file (llms-full.txt)
  3. Crawls ONLY that single file instead of the entire website

This results in dramatically faster crawling times and more focused content extraction.

Test plan

  • Unit tests pass for all discovery logic
  • Integration tests verify end-to-end discovery flow
  • Real-world testing with docs.anthropic.com and ai.pydantic.dev
  • Verified single-file selection works correctly
  • Confirmed no recursive crawling when discovery file found

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Automatic discovery of guidance files with prioritized selection, same-domain/subdomain link following, and finer discovery progress stage; UI shows discovered file and linked files with client-side URL validation.
    • Progress reporting accepts explicit start/end bounds and preserves progress when following discovery results.
  • Bug Fixes / Reliability

    • Discovery failures are non-fatal and fall back to regular crawling; safer hostname/IP validation and improved domain filtering.
  • Tests

    • Expanded unit and integration tests covering discovery, domain/subdomain matching, progress mapping, link-following, and URL handling.

…xml files

- Add DiscoveryService with single-file priority selection
  - Priority: llms-full.txt > llms.txt > llms.md > llms.mdx > sitemap.xml > robots.txt
  - All files contain similar AI/crawling guidance, so only best one is needed
  - Robots.txt sitemap declarations have highest priority
  - Fallback to subdirectories for llms files

- Enhance URLHandler with discovery helper methods
  - Add is_robots_txt, is_llms_variant, is_well_known_file, get_base_url methods
  - Follow existing patterns with proper error handling

- Integrate discovery into CrawlingService orchestration
  - When discovery finds file: crawl ONLY discovered file (not main URL)
  - When no discovery: crawl main URL normally
  - Fixes issue where both main URL + discovered file were crawled

- Add discovery stage to progress mapping
  - New "discovery" stage in progress flow
  - Clear progress messages for discovered files

- Comprehensive test coverage
  - Tests for priority-based selection logic
  - Tests for robots.txt priority and fallback behavior
  - Updated existing tests for new return formats

Resolves efficient crawling by selecting single best guidance file instead
of crawling redundant content from multiple similar files.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Sep 8, 2025

Walkthrough

Adds a new DiscoveryService and integrates discovery-driven crawling into CrawlingService, adds domain-filtering helpers and discovery-aware link crawling, introduces a "discovery" progress stage, new URLHandler helpers, frontend types/UI for discovered files, tests, and a tldextract dependency.

Changes

Cohort / File(s) Summary
Discovery Service
python/src/server/services/crawling/discovery_service.py
New DiscoveryService with `discover_files(base_url) -> str
Crawling Service Integration
python/src/server/services/crawling/crawling_service.py
Wires DiscoveryService into crawl flow; adds get_root_domain(host) (uses tldextract), _is_same_domain and _is_same_domain_or_subdomain; discovery orchestration (skip already-discovery-targets, follow discovered file, mark discovery targets, set original_domain); discovery-aware link extraction, domain filtering, and extended progress updates; discovery failures fall back to regular crawling.
URL Handler Updates
python/src/server/services/crawling/helpers/url_handler.py
Type-hint modernizations and new static helpers: is_robots_txt(), is_llms_variant(), is_well_known_file(), get_base_url(); minor import/formatting tweaks.
Progress Tracking
python/src/server/services/crawling/progress_mapper.py, python/tests/progress_tracking/test_progress_mapper.py
Adds "discovery" stage (3–4%), shifts "crawling" to (4–15%); adds aliases code_storage/extracting -> code_extraction; adds "complete" alias treated like "completed"; updates/extends tests for new stage and aliases.
Strategy Validation
python/src/server/services/crawling/strategies/single_page.py
Validates start_progress and end_progress types/ranges/order for crawl_markdown_file, raising ValueError on invalid inputs.
Runtime Dependency
python/pyproject.toml
Adds tldextract>=5.0.0 to server/all dependencies.
Backend Tests
python/tests/test_discovery_service.py, python/tests/test_url_handler.py, python/tests/test_crawling_service_subdomain.py, python/tests/test_llms_txt_link_following.py
New/updated tests covering discovery priority/parsing, SSRF handling, URLHandler detection helpers, subdomain matching, llms.txt link-following and filtering, and discovery integration.
Frontend Progress UI
archon-ui-main/src/features/progress/components/CrawlingProgress.tsx
Adds discovery UI: displays discovered_file (with optional type) and linked_files; integrates discovery into active progress rendering.
Frontend Types & Utilities
archon-ui-main/src/features/progress/types/progress.ts, archon-ui-main/src/features/progress/utils/urlValidation.ts
Adds discovery to ProgressStatus; expands CrawlType with llms/discovery variants; augments progress response types with discovered/linked fields; adds isValidHttpUrl() utility for safe link rendering.

Sequence Diagram(s)

sequenceDiagram
    participant User as User
    participant Crawling as CrawlingService
    participant Discovery as DiscoveryService
    participant Strategy as Strategy (single_page)
    participant URLH as URLHandler
    participant HTTP as HTTP/Network

    User->>Crawling: crawl_markdown_file(url, start_progress, end_progress)
    Crawling->>Discovery: discover_files(base_url)
    par Discovery checks
        Discovery->>HTTP: GET/HEAD candidate URLs (llms, sitemap, robots, .well-known)
        HTTP-->>Discovery: responses (200/404/redirect)
        Discovery->>Discovery: validate hostname/IP, parse robots/meta/sitemap
    end
    alt discovery found
        Discovery-->>Crawling: discovered_file_url
        Crawling->>Crawling: _is_same_domain_or_subdomain(discovered_file_url, base)
        Crawling->>Strategy: crawl_markdown_file(discovered_file_url, progress window)
        Strategy->>HTTP: GET discovered_file_url
        HTTP-->>Strategy: content
        Strategy->>URLH: extract_markdown_links(content)
        URLH-->>Crawling: candidate_links[]
        Crawling->>Crawling: filter same-domain links
        loop linked files (same domain)
            Crawling->>Strategy: crawl_batch_with_progress(link, subrange)
        end
    else no discovery
        Crawling->>Strategy: crawl_markdown_file(original_url, progress window)
    end
    Crawling-->>User: crawl_results (discovered_file?, linked_files?)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Possibly related PRs

Suggested reviewers

  • coleam00

Poem

🐰 I hopped through sitemaps, llms files in tow,
Found a single file where guidance would grow.
I filtered by domain, then followed the trail,
Progress numbers climbed as links told their tale,
Rabbit dances — discovery done, onward we go!

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title Check ✅ Passed The pull request title "feat: Implement priority-based automatic discovery of llms.txt and sitemap.xml files" is concise, specific, and directly relates to the core feature being implemented. The title accurately captures the main objective of the PR—adding a discovery mechanism to automatically detect and prioritize llms.txt and sitemap.xml files. While the PR also includes supporting changes like progress tracking updates and test coverage, the title focuses on the primary change, which aligns with the principle that titles don't need to cover every detail. The title is clear enough that a reviewer scanning the history would immediately understand the key contribution.
Description Check ✅ Passed The pull request description is comprehensive and covers the essential elements needed to understand the PR, though it deviates somewhat from the exact template structure. It includes a clear summary, key features, technical implementation details, measurable results, and test verification information organized into well-labeled sections. The description demonstrates the impact of the changes with concrete examples (e.g., ai.pydantic.dev crawl reduction). While the description doesn't strictly follow the template's checkbox format for "Type of Change" and "Affected Services" sections, and lacks formal "Test Evidence" code blocks, the critical content is present and informative. The testing information is provided as a "Test plan" section with checkmarks rather than following the exact template structure, and the description mentions unit tests (11/11 passing), integration tests (23/23 passing), and real-world verification. These are substantive pieces of information that communicate testing coverage, even if not in the precise template format.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/automatic-discovery-llms-sitemap-430

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 46ae553 and 35c9ea9.

📒 Files selected for processing (1)
  • python/tests/test_llms_txt_link_following.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • python/tests/test_llms_txt_link_following.py

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

…ration

- Fix discovery service tests to match new single-file return format
- Remove obsolete tests for removed discovery methods
- Update progress mapper tests for new discovery stage ranges
- Fix stage range expectations after adding discovery stage (2,3)

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@leex279
Copy link
Copy Markdown
Collaborator Author

leex279 commented Sep 8, 2025

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Sep 8, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
python/src/server/services/crawling/helpers/url_handler.py (1)

383-409: Over-broad 'full' exclusion will misclassify filenames (e.g., helpful.md).

Checking if 'full' is a substring will exclude unrelated names like helpful.md or meaningful-resources.txt.

Use a token-aware pattern:

-                if 'full' not in filename:
+                import re
+                # exclude only when 'full' is a standalone token (delimited by start/end, dot, dash, or underscore)
+                if not re.search(r'(^|[._-])full([._-]|$)', filename):

And similarly in the content-based branch:

-                if 'full' in filename:
+                import re
+                if re.search(r'(^|[._-])full([._-]|$)', filename):
python/src/server/services/crawling/crawling_service.py (2)

138-148: Prevent progress regressions: map crawling-stage progress via ProgressMapper.

_create_crawl_progress_callback writes raw progress for the “crawling” stage. If a prior update set a higher percentage, the UI can appear to go backwards. Map stage-local progress to overall progress, like you already do for document storage and code extraction.

Apply this diff:

-                # Update progress via tracker (stores in memory for HTTP polling)
-                await self.progress_tracker.update(
-                    status=base_status,
-                    progress=progress,
-                    log=message,
-                    **kwargs
-                )
+                # Update progress via tracker (stage-mapped to avoid regressions)
+                mapped = self.progress_mapper.map_progress(base_status, progress)
+                await self.progress_tracker.update(
+                    status=base_status,
+                    progress=mapped,
+                    log=message,
+                    **kwargs
+                )
                 safe_logfire_info(
-                    f"Updated crawl progress | progress_id={self.progress_id} | status={base_status} | progress={progress} | "
+                    f"Updated crawl progress | progress_id={self.progress_id} | status={base_status} | progress={mapped} | "
                     f"total_pages={kwargs.get('total_pages', 'N/A')} | processed_pages={kwargs.get('processed_pages', 'N/A')}"
                 )

645-686: Honor “crawl only the discovered file” by gating link-follow behavior for llms/link-collection files.

Current logic will batch-crawl all extracted links from an llms/markdown link collection, which contradicts the PR objective (“crawls only that file” when discovery succeeds). Gate link following behind a request flag and default it off for discovery targets.

Apply this diff to skip link crawling when the target came from discovery unless explicitly enabled:

-                if self.url_handler.is_link_collection_file(url, content):
+                if self.url_handler.is_link_collection_file(url, content):
+                    # If this file was selected by discovery and the caller prefers single-file mode,
+                    # do not follow links contained within the file.
+                    if request.get("is_discovery_target") and request.get("crawl_only_discovery_file", True):
+                        logger.info(f"Discovery-only mode: skipping link extraction for {url}")
+                        return crawl_results, "link_collection_text"
                     # Extract links from the content
                     extracted_links = self.url_handler.extract_markdown_links(content, url)
@@
-                    # Filter out binary files (PDFs, images, archives, etc.) to avoid wasteful crawling
+                    # Filter out binary files (PDFs, images, archives, etc.) to avoid wasteful crawling
                     if extracted_links:
                         original_count = len(extracted_links)
                         extracted_links = [link for link in extracted_links if not self.url_handler.is_binary_file(link)]
                         filtered_count = original_count - len(extracted_links)
                         if filtered_count > 0:
                             logger.info(f"Filtered out {filtered_count} binary files from {original_count} extracted links")
+                        # Deduplicate to reduce redundant work
+                        extracted_links = list(dict.fromkeys(extracted_links))

If you prefer, I can open a follow-up PR to add an integration test asserting that with auto_discovery=True and default flags, only the discovered file is crawled. Do you want me to draft that?

🧹 Nitpick comments (19)
python/src/server/services/crawling/helpers/url_handler.py (4)

307-315: Trim trailing quotes in _clean_url.

Markdown links with titles can leave a trailing quote on the URL; current rstrip doesn’t remove '"' or "'".

Apply:

-                cleaned = u.strip().rstrip('.,;:)]>')
+                cleaned = u.strip().rstrip('.,;:)]>\'"')

317-325: Use the compiled regex’ finditer for clarity.

Works either way, but calling finditer on the compiled pattern is clearer and avoids confusion with flags.

-            for match in re.finditer(combined_pattern, content):
+            for match in combined_pattern.finditer(content):

333-350: Normalize scheme/host to lowercase before de-dup to reduce false duplicates.

Same URL differing only by host case will currently appear twice.

-                if url.startswith(('http://', 'https://')):
-                    urls.append(url)
+                if url.startswith(('http://', 'https://')):
+                    try:
+                        p = urlparse(url)
+                        norm = f"{p.scheme.lower()}://{p.netloc.lower()}{p.path or ''}"
+                        if p.query:
+                            norm += f"?{p.query}"
+                        urls.append(norm)
+                    except Exception:
+                        urls.append(url)

631-651: is_well_known_file: path.count check is unnecessary.

A single startswith is sufficient; multiple occurrences are pathological. Current check is fine, just noting simplicity potential.

-            return path.startswith('/.well-known/') and path.count('/.well-known/') == 1
+            return path.startswith('/.well-known/')
python/tests/progress_tracking/test_progress_mapper.py (4)

26-28: Include 'discovery' in the sequential range validation.

This test skips the new stage, so it won’t catch ordering/regression issues around it.

-        crawl_stages = ["starting", "analyzing", "crawling", "processing",
+        crawl_stages = ["starting", "analyzing", "discovery", "crawling", "processing",
                        "source_creation", "document_storage", "code_extraction",
                        "finalization", "completed"]

51-53: Update comment to match new crawling range (3–5%).

Keep docs aligned with behavior.

-        # Should be halfway between crawling range (2-5%)
-        expected = 2 + (50 / 100) * (5 - 2)  # 3.5%, rounded to 4
+        # Should be halfway between crawling range (3-5%)
+        expected = 3 + (50 / 100) * (5 - 3)  # 4.0%, rounded to 4

186-189: Add 'discovery' to the realistic sequence.

Covers the new stage and ensures monotonic mapping across it.

         stages = [
             ("starting", 0, 0),
             ("analyzing", 100, 2),
+            ("discovery", 100, 3),
             ("crawling", 100, 5),

88-95: Clamping test: also assert lower bound equals start of crawling (3) after discovery.

Good update; consider asserting the upper bound comment too for clarity (nit only).

-        # Test progress over 100
+        # Test progress over 100 (caps at end of crawling range)
python/src/server/services/crawling/discovery_service.py (3)

8-8: Add missing import for urlparse used in HTML parsing fix.

Apply this diff:

-from urllib.parse import urljoin
+from urllib.parse import urljoin, urlparse

197-205: Consider HEAD with GET fallback and consistent, centralized timeouts.

Reduce bandwidth/latency and unify timeouts via constants (e.g., REQUEST_TIMEOUT=10). Not blocking.

I can send a small refactor that tries HEAD first (fallback to GET on 405/4xx) and hoists timeouts to class-level constants.


89-95: Validate meta-discovered URLs before returning.

As a safety net, check existence for the chosen HTML-derived URL (same as robots branch). Not blocking.

Happy to add a small check with _check_url_exists(best_file) here as well.

python/tests/test_url_handler.py (1)

154-185: Add llms-full.txt coverage.

The discovery logic prioritizes llms-full.txt, but this test suite doesn’t assert it as an llms variant.

Apply this diff:

         # All llms variants
+        assert handler.is_llms_variant("https://example.com/llms-full.txt") is True
         assert handler.is_llms_variant("https://example.com/llms.txt") is True
         assert handler.is_llms_variant("https://example.com/llms.md") is True
         assert handler.is_llms_variant("https://example.com/llms.mdx") is True
         assert handler.is_llms_variant("https://example.com/llms.markdown") is True
python/tests/test_discovery_service.py (3)

93-115: Consider asserting existence verification when robots declares a sitemap.

If we adopt existence checks, add a negative case where the declared sitemap 404s and discovery proceeds to next priority (e.g., llms.txt).

I can provide a test variant if you want it included in this PR.


166-185: Add a robots.txt test with relative, mixed-case Sitemap entries.

Ensures we support “Sitemap: /SiteMap.XML” and preserve URL case.

Proposed test snippet available on request.


205-229: HTML meta parsing test passes, but add a javascript: href negative to ensure filtering.

Prevents non-http(s) schemes from leaking into discovery results.

I can extend this test with a case that should be ignored.

python/src/server/services/crawling/crawling_service.py (4)

337-364: Discovery progress is fine; minor polish on message pluralization.

Message says “selected {file_count} best file” even when 0 or >1. Suggest pluralization for cleaner logs.

Apply this diff:

-                    file_count = len(discovered_urls)
-                    safe_logfire_info(f"Discovery selected {file_count} best file to crawl")
+                    file_count = len(discovered_urls)
+                    noun = "file" if file_count == 1 else "files"
+                    safe_logfire_info(f"Discovery selected {file_count} {noun} to crawl")

617-626: Tighten return type for clarity.

Annotate the concrete tuple element types to strengthen contracts and help static analysis.

Apply this diff:

-    async def _crawl_by_url_type(self, url: str, request: dict[str, Any]) -> tuple:
+    async def _crawl_by_url_type(self, url: str, request: dict[str, Any]) -> tuple[list[dict[str, Any]], str | None]:

703-719: Apply the same hygiene to sitemap URLs: filter binaries and deduplicate before batch crawl.

Some sitemaps include files that are effectively “binary” for our pipeline (e.g., PDFs). Also, deduping avoids wasted requests.

Apply this diff right after sitemap_urls = self.parse_sitemap(url):

             sitemap_urls = self.parse_sitemap(url)
 
             if sitemap_urls:
+                # Normalize sitemap URLs: remove binaries and dedupe
+                sitemap_urls = [u for u in sitemap_urls if not self.url_handler.is_binary_file(u)]
+                sitemap_urls = list(dict.fromkeys(sitemap_urls))
                 # Update progress before starting batch crawl

651-669: Unify logging approach (safe_logfire vs std logger).

This block uses logger.info while nearby code uses safe_logfire_info. Pick one for consistency (prefer the existing logfire wrappers if that’s your standard).

Also applies to: 670-686

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 012d2c5 and 43af7b7.

📒 Files selected for processing (7)
  • python/src/server/services/crawling/crawling_service.py (11 hunks)
  • python/src/server/services/crawling/discovery_service.py (1 hunks)
  • python/src/server/services/crawling/helpers/url_handler.py (10 hunks)
  • python/src/server/services/crawling/progress_mapper.py (1 hunks)
  • python/tests/progress_tracking/test_progress_mapper.py (10 hunks)
  • python/tests/test_discovery_service.py (1 hunks)
  • python/tests/test_url_handler.py (1 hunks)
🔇 Additional comments (10)
python/src/server/services/crawling/progress_mapper.py (2)

18-20: Discovery stage insertion looks good.

Stage ordering and ranges are consistent with the new flow; crawling adjusted to (3, 5) is coherent.


25-32: Ignore this concern: no crawl code uses the "extracting" alias (all crawl progress uses "code_extraction"), and upload stages deliberately use the final "extracting" mapping.

Likely an incorrect or invalid review comment.

python/src/server/services/crawling/helpers/url_handler.py (2)

586-606: is_robots_txt: consider allowing query strings and uppercase safely (already OK) — LGTM.


652-673: get_base_url: LGTM.

Returns scheme+netloc and logs missing pieces; good defensive behavior.

python/tests/test_url_handler.py (3)

127-153: Robots tests look solid and explicit about root-only semantics.


186-212: .well-known tests clearly define leniency on path casing. LGTM.


213-243: Base URL extraction tests are comprehensive and well-targeted.

python/tests/test_discovery_service.py (2)

230-246: Priority constant assertions align with PR intent. LGTM.


10-41: Discovery happy-path coverage looks good and focused on single-file return semantics.

Also applies to: 58-86, 116-144, 146-165, 247-265

python/src/server/services/crawling/crawling_service.py (1)

20-20: DiscoveryService import looks good.

Import path and usage align with the new discovery integration.

Comment on lines +87 to 88
self.discovery_service = DiscoveryService()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Avoid blocking the event loop during discovery; wrap sync discovery or make it async.

If DiscoveryService.discover_files performs network I/O synchronously (e.g., via requests), calling it directly inside an async flow will block the event loop and throttle all concurrent crawls. Either expose an async variant or run the sync call in a thread.

Apply this diff to safely offload discovery to a thread without changing the service API:

-        self.discovery_service = DiscoveryService()
+        self.discovery_service = DiscoveryService()
-                    discovered_file = self.discovery_service.discover_files(url)
+                    # Offload potential sync I/O to a worker thread to avoid blocking the event loop.
+                    discovered_file = await asyncio.to_thread(self.discovery_service.discover_files, url)

Run this script to confirm whether DiscoveryService is sync and uses blocking I/O:

Also applies to: 344-355


🏁 Script executed:

#!/bin/bash
# Inspect discovery implementation for sync I/O usage
rg -nC2 -g '!**/dist/**' -g '!**/build/**' $'class\\s+DiscoveryService\\b|def\\s+discover_files\\b|import\\s+requests|import\\s+httpx|async\\s+def\\s+discover_files' python/src

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Locate DiscoveryService definition and inspect its discover_files method
rg -nC2 "class\s+DiscoveryService" python/src
rg -nC2 "def\s+discover_files" python/src

Length of output: 864


🏁 Script executed:

#!/bin/bash
# Search for sync HTTP calls within DiscoveryService
rg -nC3 -e "import requests" -e "requests\." -e "import httpx" -e "httpx\." -e "import aiohttp" -e "aiohttp\." python/src/server/services/crawling/discovery_service.py

Length of output: 4969


Offload blocking DiscoveryService.discover_files to a worker thread
DiscoveryService.discover_files makes multiple synchronous requests.get calls (e.g. URL checks, robots.txt, sitemaps, well-known files), which will block the async event loop when invoked directly; wrap each call site in await asyncio.to_thread(self.discovery_service.discover_files, url) to prevent throttling concurrent crawls (apply in python/src/server/services/crawling/crawling_service.py at lines 87–88 and 344–355).

🤖 Prompt for AI Agents
In python/src/server/services/crawling/crawling_service.py around lines 87-88
and 344-355, DiscoveryService.discover_files performs blocking synchronous
requests and must be offloaded to a worker thread; replace direct synchronous
calls to self.discovery_service.discover_files(...) with await
asyncio.to_thread(self.discovery_service.discover_files, url) (or pass the same
args/kwargs) at each call site and ensure asyncio is imported at the top of the
module so the event loop is not blocked during concurrent crawls.

Comment thread python/src/server/services/crawling/crawling_service.py
Comment thread python/src/server/services/crawling/discovery_service.py Outdated
Comment thread python/src/server/services/crawling/discovery_service.py Outdated
Comment thread python/src/server/services/crawling/discovery_service.py Outdated
Comment thread python/src/server/services/crawling/discovery_service.py Outdated
- Fix progress regression: map crawl callback progress through ProgressMapper
  - Prevents UI progress bars from jumping backwards
  - Ensures consistent progress reporting across all stages

- Add same-domain filtering for discovered file link following
  - Discovery targets (llms.txt) can follow links but only to same domain
  - Prevents external crawling while preserving related AI guidance
  - Add _is_same_domain() method for domain comparison

- Fix filename filtering false positives with regex token matching
  - Replace substring 'full' check with token-aware regex pattern
  - Prevents excluding files like "helpful.md" or "meaningful.txt"
  - Only excludes actual "full" variants like "llms-full.txt"

- Add llms-full.txt to URLHandler detection patterns
  - Support for highest priority discovery file format
  - Ensures proper file type detection for link following logic

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@leex279
Copy link
Copy Markdown
Collaborator Author

leex279 commented Sep 8, 2025

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Sep 8, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
python/src/server/services/crawling/crawling_service.py (2)

716-731: Batch-crawling extracted links contradicts “single-file” behavior unless gated

Ensure the batch crawl only runs when follow_links_from_discovery=True or when not a discovery target.

Apply this diff (pairs with the previous change):

-                    if extracted_links:
+                    if extracted_links:
                         # Crawl the extracted links using batch crawling

No code change needed here if you apply the earlier gate that zeroes extracted_links. Otherwise, add an explicit guard:

-                    if extracted_links:
+                    if extracted_links and (not request.get("is_discovery_target") or request.get("follow_links_from_discovery", False)):

736-766: Offload blocking sitemap parsing onto a thread

The SitemapCrawlStrategy.parse_sitemap implementation uses requests.get(...) (blocking I/O) and will stall the event loop. In python/src/server/services/crawling/crawling_service.py replace:

-            sitemap_urls = self.parse_sitemap(url)
+            sitemap_urls = await asyncio.to_thread(self.parse_sitemap, url)
♻️ Duplicate comments (1)
python/src/server/services/crawling/crawling_service.py (1)

345-345: Do not block the event loop: offload synchronous discovery to a worker thread

DiscoveryService.discover_files(url) is likely synchronous and will block the async loop during network I/O; offload it.

Apply this diff:

-                    discovered_file = self.discovery_service.discover_files(url)
+                    # Offload potential sync I/O to avoid blocking the event loop.
+                    discovered_file = await asyncio.to_thread(self.discovery_service.discover_files, url)
🧹 Nitpick comments (1)
python/src/server/services/crawling/crawling_service.py (1)

591-612: Domain comparison should ignore port; consider subdomains as configurable

Comparing .netloc makes ports part of equality and will drop same-host different-port links. Use .hostname. Optionally, support subdomain inclusion.

Apply this diff:

-            from urllib.parse import urlparse
-            
-            url_domain = urlparse(url).netloc.lower()
-            base_netloc = urlparse(base_domain).netloc.lower()
-            
-            return url_domain == base_netloc
+            from urllib.parse import urlparse
+            u, b = urlparse(url), urlparse(base_domain)
+            url_host = (u.hostname or "").lower()
+            base_host = (b.hostname or "").lower()
+            return bool(url_host) and url_host == base_host

Optional: add a repo-wide setting (e.g., DISCOVERY_INCLUDE_SUBDOMAINS) to treat docs.example.com as same-domain as example.com.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 43af7b7 and d2adc15.

📒 Files selected for processing (3)
  • python/src/server/services/crawling/crawling_service.py (12 hunks)
  • python/src/server/services/crawling/helpers/url_handler.py (8 hunks)
  • python/tests/test_url_handler.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • python/tests/test_url_handler.py
  • python/src/server/services/crawling/helpers/url_handler.py
🔇 Additional comments (5)
python/src/server/services/crawling/crawling_service.py (5)

20-20: LGTM: DiscoveryService import

Import placement and scope look good.


87-87: LGTM: DiscoveryService lifecycle

Single instance on the service is appropriate.


137-149: Progress mapping hookup looks correct

Mapped progress prevents regressions; logging is clear.


481-513: Progress wiring for code extraction looks correct

Mapped progress and selective kwargs forwarding are consistent with earlier stages.


681-704: Introduce a “follow_links_from_discovery” flag to enforce single-file crawling

  • When request["is_discovery_target"] is True and request.get("follow_links_from_discovery", False) is False, clear extracted_links to honor the “crawl only that file” policy.
  • If follow_links_from_discovery is True, apply the existing same-domain filter.
  • Document the new flag (defaulting to False) and clarify strict vs. curated-links behavior.

Comment thread python/src/server/services/crawling/crawling_service.py Outdated
- Fix event loop blocking: Use asyncio.to_thread for discovery service calls
  - Prevents blocking main event loop during HTTP discovery requests
  - Maintains responsive async behavior throughout crawl process

- Add max_depth respect for same-domain link following
  - Use recursive crawling strategy instead of batch for depth control
  - Pass max_depth-1 to respect user's crawl depth setting
  - Fallback to batch crawling when depth limit reached

- Improve domain comparison for port tolerance
  - Use hostname instead of netloc to ignore port differences
  - Allows same-host different-port links (e.g., dev:3000 and api:8080)
  - More accurate same-domain detection

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@leex279
Copy link
Copy Markdown
Collaborator Author

leex279 commented Sep 8, 2025

@CodeRabbit review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Sep 8, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
python/src/server/services/crawling/crawling_service.py (2)

338-347: Good: Discovery offloaded to a worker thread to avoid blocking the event loop.
This addresses prior concern about sync I/O in discovery throttling concurrent crawls.


388-393: Use discovered domain for downstream filtering, not the original URL.
Currently sets original_domain from the original URL, which breaks same-domain filtering when discovery points to a different host/subdomain.

-                discovery_request["original_domain"] = self.url_handler.get_base_url(url)
+                discovery_request["original_domain"] = self.url_handler.get_base_url(discovered_url)
🧹 Nitpick comments (6)
python/src/server/services/crawling/crawling_service.py (6)

138-148: Progress mapping callback changes look good; consider making the factory sync.
The factory returns an async closure and does no awaits; making it def removes unnecessary scheduling overhead.

Apply this diff to make the factory synchronous:

-    async def _create_crawl_progress_callback(
+    def _create_crawl_progress_callback(
         self, base_status: str
     ) -> Callable[[str, int, str], Awaitable[None]]:
@@
-        async def callback(status: str, progress: int, message: str, **kwargs):
+        async def callback(status: str, progress: int, message: str, **kwargs):
             if self.progress_tracker:
@@
         return callback

Then drop await at call sites when passing the callback (e.g., lines 670, 727, 737, 777, 802).


360-365: Nit: logging grammar.
“selected {file_count} best file” reads oddly. Consider “selected {file_count} file(s) to crawl”.

- safe_logfire_info(f"Discovery selected {file_count} best file to crawl")
+ safe_logfire_info(f"Discovery selected {file_count} file(s) to crawl")

451-460: Confirm source_id/display_name consistency with effective source URL.
If UI/queries rely on these matching the stored documents’ source_url, consider deriving them from effective_source_url as well.

I can update source_id/display_name generation post-discovery selection if desired.


592-612: Domain check may be too strict; consider eTLD+1 or subdomain policy.
Equality on hostname rejects legitimate same-site links across subdomains (e.g., docs.example.com vs www.example.com). Depending on policy, compare eTLD+1 (tldextract) or allow subdomains of the base.

Example allowing subdomains without new deps:

-            url_host = (u.hostname or "").lower()
-            base_host = (b.hostname or "").lower()
-            return bool(url_host) and url_host == base_host
+            url_host = (u.hostname or "").lower()
+            base_host = (b.hostname or "").lower()
+            return bool(url_host) and (url_host == base_host or url_host.endswith("." + base_host))

If you prefer strict eTLD+1, use tldextract and compare registered_domain.


717-741: If link following is enabled, consider clamping depth for discovery-driven links.
Even when opt-in, default to depth=1 for discovery targets to prevent explosion.

-                        max_depth = request.get('max_depth', 2)  # Default depth 2
+                        max_depth = 1 if request.get('is_discovery_target') and not request.get('allow_recursive_from_discovery') else request.get('max_depth', 2)

775-781: Nit: Progress range consistency for sitemap crawl.
Elsewhere crawling maps to a 3–8 range; here start/end are 15/20. Align or let ProgressMapper handle, but be consistent to avoid UI jumps.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d2adc15 and 77b0470.

📒 Files selected for processing (1)
  • python/src/server/services/crawling/crawling_service.py (11 hunks)
🔇 Additional comments (6)
python/src/server/services/crawling/crawling_service.py (6)

20-20: LGTM: DiscoveryService import added.


87-87: LGTM: Instantiate DiscoveryService in constructor.


482-497: LGTM: Progress mapped for code extraction and tracker updates.


512-513: LGTM: Heartbeat after code extraction.


657-676: Text-file detection and link-collection handling: good.
The llms/markdown path is clear and guarded.


384-406: Propagate effective_source_url for accurate metadata
Set effective_source_url to the actual crawled URL (discovered_url in the discovery branch, fallback to url), and use it for source_url when storing results to avoid mislabeling analytics/UX.

Comment on lines +692 to +704
# For discovery targets, only follow same-domain links
if extracted_links and request.get("is_discovery_target"):
original_domain = request.get("original_domain")
if original_domain:
original_count = len(extracted_links)
extracted_links = [
link for link in extracted_links
if self._is_same_domain(link, original_domain)
]
domain_filtered_count = original_count - len(extracted_links)
if domain_filtered_count > 0:
safe_logfire_info(f"Discovery mode: filtered out {domain_filtered_count} external links, keeping {len(extracted_links)} same-domain links")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Honor “crawl only the discovered file” policy by default; gate link following behind an explicit flag.
Current logic still follows same-domain links from discovery targets, which conflicts with the PR objective (“crawls only that file”). Make following links opt-in.

Minimal change to disable follow by default:

-                    if extracted_links and request.get("is_discovery_target"):
+                    if extracted_links and request.get("is_discovery_target"):
                         original_domain = request.get("original_domain")
                         if original_domain:
                             original_count = len(extracted_links)
                             extracted_links = [
                                 link for link in extracted_links 
                                 if self._is_same_domain(link, original_domain)
                             ]
                             domain_filtered_count = original_count - len(extracted_links)
                             if domain_filtered_count > 0:
                                 safe_logfire_info(f"Discovery mode: filtered out {domain_filtered_count} external links, keeping {len(extracted_links)} same-domain links")
+                        # Do not follow links from discovery files unless explicitly allowed
+                        if not request.get("follow_links_from_discovery", False):
+                            logger.info("Discovery mode: not following links from discovery file (set follow_links_from_discovery=True to enable).")
+                            extracted_links = []

🏁 Script executed:

#!/bin/bash
set -euxo pipefail

# Print context around the discovery snippet
sed -n '650,840p' python/src/server/services/crawling/crawling_service.py

# Search for extracted_links usage
rg -n 'extracted_links' -C3 python/src/server/services/crawling/crawling_service.py

# Check for existing flags controlling link following
rg -n 'follow_links' -C3 python/src/server/services/crawling/crawling_service.py || echo "No follow_links flag found"

# Locate all usage of is_discovery_target
rg -n 'is_discovery_target' -C3 python/src/server/services/crawling/crawling_service.py

Length of output: 14145


Disable link crawling by default for discovery files
Insert a guard after the same-domain filter in crawling_service.py so that links are only followed when follow_links_from_discovery=True:

                     # For discovery targets, only follow same-domain links
                     if extracted_links and request.get("is_discovery_target"):
                         original_domain = request.get("original_domain")
                         if original_domain:
                             original_count = len(extracted_links)
                             extracted_links = [
                                 link for link in extracted_links 
                                 if self._is_same_domain(link, original_domain)
                             ]
                             domain_filtered_count = original_count - len(extracted_links)
                             if domain_filtered_count > 0:
                                 safe_logfire_info(f"Discovery mode: filtered out {domain_filtered_count} external links, keeping {len(extracted_links)} same-domain links")
+                        # Do not follow links from discovery files unless explicitly enabled
+                        if not request.get("follow_links_from_discovery", False):
+                            logger.info(
+                                "Discovery mode: not following links from discovery file "
+                                "(set follow_links_from_discovery=True to enable)."
+                            )
+                            extracted_links = []
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# For discovery targets, only follow same-domain links
if extracted_links and request.get("is_discovery_target"):
original_domain = request.get("original_domain")
if original_domain:
original_count = len(extracted_links)
extracted_links = [
link for link in extracted_links
if self._is_same_domain(link, original_domain)
]
domain_filtered_count = original_count - len(extracted_links)
if domain_filtered_count > 0:
safe_logfire_info(f"Discovery mode: filtered out {domain_filtered_count} external links, keeping {len(extracted_links)} same-domain links")
# For discovery targets, only follow same-domain links
if extracted_links and request.get("is_discovery_target"):
original_domain = request.get("original_domain")
if original_domain:
original_count = len(extracted_links)
extracted_links = [
link for link in extracted_links
if self._is_same_domain(link, original_domain)
]
domain_filtered_count = original_count - len(extracted_links)
if domain_filtered_count > 0:
safe_logfire_info(
f"Discovery mode: filtered out {domain_filtered_count} external links, "
f"keeping {len(extracted_links)} same-domain links"
)
# Do not follow links from discovery files unless explicitly enabled
if not request.get("follow_links_from_discovery", False):
logger.info(
"Discovery mode: not following links from discovery file "
"(set follow_links_from_discovery=True to enable)."
)
extracted_links = []
🤖 Prompt for AI Agents
In python/src/server/services/crawling/crawling_service.py around lines 692 to
704, after the same-domain filtering for discovery targets, add a guard that
disables following links unless request.get("follow_links_from_discovery") is
True; if the flag is missing or False, set extracted_links to an empty list (and
log that link crawling was skipped for discovery files) so downstream logic
won't follow any discovery links, otherwise leave extracted_links as filtered
and continue normally.

@TimothiousAI
Copy link
Copy Markdown
Collaborator

🧪 PR #622 Testing Results & Analysis

✅ What's Working Excellently

  1. Priority-based Discovery Logic: ✅ CONFIRMED
    - ai.pydantic.dev correctly returns llms-full.txt (highest priority)
    - Single-file selection works perfectly - no more crawling 29 files
    - Priority order respected: llms-full.txt > llms.txt > llms.md > sitemap.xml > robots.txt
  2. Unit Tests: ✅ ALL PASSING (11/11)
    - Priority order validation
    - Subdirectory fallback detection
    - Robots.txt sitemap parsing
    - Error handling scenarios
  3. Integration Tests: ✅ ALL PASSING (23/23)
    - Full crawl orchestration with discovery
    - Progress tracking integration
    - Cancellation handling

🎯 Key Issues Discovered

Issue #1: Supabase URL Construction Problem

Problem: Discovery doesn't find https://supabase.com/docs/llms.txt when base URL is provided as https://supabase.com/docs

Root Cause:

When base_url = "https://supabase.com/docs"

Discovery tries: https://supabase.com/llms.txt (wrong!)

But file exists at: https://supabase.com/docs/llms.txt

Impact: Discovery finds sitemap.xml instead of the llms.txt that actually exists.

Issue #2: Link Collection vs Single Content Confusion

Problem: Tim's concern about supabase.com/docs/llms.txt is valid:

Content of supabase llms.txt:

  • Supabase Guides

  • Supabase Reference (JavaScript)

  • [8 total links to other .txt files]

  • Detected as link collection: ✅ TRUE (8 links, high density)

  • Expected behavior: Should crawl each of the 8 linked .txt files

  • Possible issue: May not crawl the individual links due to domain restrictions or depth limits

🔧 Recommendations for Thomas

  1. URL Construction Fix (Priority: HIGH)

The discovery service needs to handle URL bases that already include paths better:

Current behavior

urljoin("https://supabase.com/docs", "/llms.txt")

→ "https://supabase.com/llms.txt" (wrong!)

Should try both:

1. urljoin(base_url, "/llms.txt") for root level

2. urljoin(base_url, "llms.txt") for same-directory level

  1. Same-Domain Link Crawling Verification (Priority: MEDIUM)

Confirm that when discovery finds a link collection file like Supabase's llms.txt, it actually crawls all 8 linked .txt files. The current domain filtering might be preventing this.

  1. Discovery Base URL Intelligence (Priority: LOW)

Consider detecting when a base URL already points to a docs section and adjusting discovery patterns accordingly.

🏆 Overall Assessment

PR Status: READY WITH MINOR FIXES

The core functionality is excellent and significantly improves crawling efficiency. The priority-based single-file selection is working perfectly. The main issue is a URL construction edge case that affects specific sites like Supabase.

Before/After Improvement:

  • Before: ai.pydantic.dev crawled 50+ pages + multiple discovery files
  • After: ai.pydantic.dev crawls only llms-full.txt ✅

The PR delivers exactly what it promises - intelligent single-file discovery that prevents redundant crawling.

● Recommendation: Ship this PR with the current functionality - it's a significant improvement. The Supabase URL construction issue can be addressed in a follow-up PR along with the advanced domain filtering features mentioned in issue #546.

- Resolved conflicts in progress_mapper.py to include discovery stage (3-4%)
- Resolved conflicts in crawling_service.py to maintain both discovery feature and main improvements
- Resolved conflicts in test_progress_mapper.py to include tests for discovery stage
- Kept all optimizations and improvements from main
- Maintained discovery feature functionality with proper integration
- Fixed duplicate else statement in crawling_service.py
- Corrected indentation for extracted links processing logic
- Syntax now validates successfully
@leex279
Copy link
Copy Markdown
Collaborator Author

leex279 commented Sep 20, 2025

/coderabbit review

…mapper

The progress mapper uses Python's round() function which rounds to nearest even number (banker's rounding). Updated test assertions to match actual rounding behavior:
- 3.5 rounds to 4 (not 3)
- 7.63 rounds to 8 (not 7)
- 9.5 rounds to 10 (not 9)

All tests now pass successfully.
@leex279 leex279 marked this pull request as ready for review September 20, 2025 11:00
…ismatches

Two critical fixes for the automatic discovery feature:

1. Discovery Service path handling:
   - Changed from always using root domain (/) to respecting given URL path
   - e.g., for 'supabase.com/docs', now checks 'supabase.com/docs/robots.txt'
   - Previously incorrectly checked 'supabase.com/robots.txt'
   - Fixed all urljoin calls to use relative paths instead of absolute paths

2. Method signature mismatches:
   - Removed start_progress and end_progress parameters from crawl_batch_with_progress
   - Removed same parameters from crawl_recursive_with_progress
   - Fixed all calls to these methods to match the strategy implementations

These fixes ensure discovery works correctly for subdirectory URLs and prevents TypeError crashes during crawling.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
python/src/server/services/crawling/helpers/url_handler.py (1)

281-309: Fix markdown link extraction: handle titles and avoid capturing them in the URL.

Current regex captures the entire "(url "title")" blob for text, producing invalid URLs (space + title). Tighten the md group and strip optional titles; also expand rstrip to drop stray quotes.

Apply this diff:

-            combined_pattern = re.compile(
-                r'\[(?P<text>[^\]]*)\]\((?P<md>[^)]+)\)'      # named: md
+            combined_pattern = re.compile(
+                r'\[(?P<text>[^\]]*)\]\(\s*(?P<md>[^)\s]+)(?:\s+["\'][^)]*["\'])?\s*\)'  # md URL + optional title
                 r'|<\s*(?P<auto>https?://[^>\s]+)\s*>'        # named: auto
                 r'|(?P<bare>https?://[^\s<>()\[\]"]+)'        # named: bare
                 r'|(?P<proto>//[^\s<>()\[\]"]+)'              # named: protocol-relative
                 r'|(?P<www>www\.[^\s<>()\[\]"]+)'             # named: www.* without scheme
             )
-                cleaned = u.strip().rstrip('.,;:)]>')
+                cleaned = u.strip().rstrip('.,;:)]>\'"')

Also applies to: 310-318

♻️ Duplicate comments (2)
python/src/server/services/crawling/crawling_service.py (2)

399-403: Use discovered file’s domain for filtering, not the original URL.

Setting original_domain from the original url can misfilter links when the discovered file is on a different host/subdomain. Use the discovered URL’s base.

-                discovery_request["original_domain"] = self.url_handler.get_base_url(url)
+                discovery_request["original_domain"] = self.url_handler.get_base_url(discovered_url)

766-778: Do not follow links from discovery files by default; gate behind an explicit flag.

Current logic still follows same‑domain links from a discovered link‑collection, contradicting the PR goal (“crawl only the discovered file”). Add an opt‑in flag.

                     # For discovery targets, only follow same-domain links
                     if extracted_links and request.get("is_discovery_target"):
                         original_domain = request.get("original_domain")
                         if original_domain:
                             original_count = len(extracted_links)
                             extracted_links = [
                                 link for link in extracted_links 
                                 if self._is_same_domain(link, original_domain)
                             ]
                             domain_filtered_count = original_count - len(extracted_links)
                             if domain_filtered_count > 0:
                                 safe_logfire_info(f"Discovery mode: filtered out {domain_filtered_count} external links, keeping {len(extracted_links)} same-domain links")
+                        # Honor single-file policy unless explicitly enabled
+                        if not request.get("follow_links_from_discovery", False):
+                            logger.info("Discovery mode: not following links from discovery file "
+                                        "(set follow_links_from_discovery=True to enable).")
+                            extracted_links = []

@@
-                        # For discovery targets, respect max_depth for same-domain links
-                        max_depth = request.get('max_depth', 2) if request.get("is_discovery_target") else request.get('max_depth', 1)
+                        # For discovery targets, default to no further depth unless explicitly enabled
+                        max_depth = request.get('max_depth', 1)

Also applies to: 790-815

🧹 Nitpick comments (5)
python/src/server/services/crawling/helpers/url_handler.py (2)

355-364: Lower log level for high-volume extractor message.

“Extracted N unique links” is noisy at info level in large crawls. Prefer debug.

-            logger.info(f"Extracted {len(unique_urls)} unique links from content")
+            logger.debug(f"Extracted {len(unique_urls)} unique links from content")

386-414: Remove redundant local imports; rely on module-level imports.

Two inner-scoped import re statements are unnecessary.

-                import re
                 if not re.search(r'(^|[._-])full([._-]|$)', filename):
                     ...
-                import re
                 if re.search(r'(^|[._-])full([._-]|$)', filename):
                     ...

Also applies to: 415-435

python/tests/progress_tracking/test_progress_mapper.py (1)

5-5: Remove unused import to satisfy Ruff (F401).

pytest is not used in this file.

-import pytest
python/src/server/services/crawling/crawling_service.py (2)

814-826: Avoid scheduling batch crawl with an empty URL list.

If no links remain after filtering, skip the batch call.

-                    else:
-                        # Use normal batch crawling for non-discovery targets
-                        logger.info(f"Crawling {len(extracted_links)} extracted links from {url}")
-                        batch_results = await self.crawl_batch_with_progress(
-                            extracted_links,
-                            max_concurrent=request.get('max_concurrent'),  # None -> use DB settings
-                            progress_callback=await self._create_crawl_progress_callback("crawling"),
-                        )
+                    else:
+                        batch_results = []

659-676: Normalize hosts to treat “www.” subdomains as same-domain.

Exact host equality will drop legitimate links like www.example.com vs example.com.

-            url_host = (u.hostname or "").lower()
-            base_host = (b.hostname or "").lower()
-            return bool(url_host) and url_host == base_host
+            def norm(h: str) -> str:
+                h = (h or "").lower()
+                return h[4:] if h.startswith("www.") else h
+            url_host = norm(u.hostname)
+            base_host = norm(b.hostname)
+            return bool(url_host) and url_host == base_host
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 77b0470 and 0a2c43f.

📒 Files selected for processing (4)
  • python/src/server/services/crawling/crawling_service.py (11 hunks)
  • python/src/server/services/crawling/helpers/url_handler.py (8 hunks)
  • python/src/server/services/crawling/progress_mapper.py (2 hunks)
  • python/tests/progress_tracking/test_progress_mapper.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (5)
python/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/**/*.py: Target Python 3.12 with a 120-character line length
Use Ruff for linting (errors, warnings, unused imports)

Files:

  • python/tests/progress_tracking/test_progress_mapper.py
  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/services/crawling/helpers/url_handler.py
  • python/src/server/services/crawling/progress_mapper.py
python/tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Write backend tests with Pytest (including async support)

Files:

  • python/tests/progress_tracking/test_progress_mapper.py
python/src/server/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/server/**/*.py: Never accept corrupted data in batch/continuation paths: skip failed items entirely (e.g., do not store zero embeddings, null FKs, or malformed JSON)
Use specific exception types; avoid catching bare Exception
Preserve full stack traces in logs (use exc_info=True with Python logging)
Never return None to indicate failure; raise an exception with details
Authentication/authorization failures must halt the operation and be clearly surfaced
Service startup, missing configuration, database connection, or critical dependency failures should crash fast with clear errors
During crawling/batch/background tasks and WebSocket events, continue processing other items but log failures with context
Include context (operation intent, relevant IDs/URLs/data) in error messages
Pydantic should raise on data corruption or validation errors; do not accept invalid inputs

Files:

  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/services/crawling/helpers/url_handler.py
  • python/src/server/services/crawling/progress_mapper.py
python/src/server/services/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/server/services/**/*.py: For batch operations, report both success count and a detailed failure list
On external API calls, retry with exponential backoff and fail with a clear message after retries
Place business logic in service layer modules (API Route → Service → Database pattern)

Files:

  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/services/crawling/helpers/url_handler.py
  • python/src/server/services/crawling/progress_mapper.py
python/src/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use MyPy for type checking to ensure type safety

python/src/**/*.py: On service startup, missing configuration, DB connection failures, auth/authorization failures, critical dependency outages, or invalid/corrupting data: fail fast and bubble errors
For batch processing, background tasks, WebSocket events, optional features, and external API calls: continue processing but log errors (with retries/backoff for APIs)
Never accept or persist corrupted data; skip failed items entirely (e.g., zero embeddings, null FKs, malformed JSON)
Error messages must include operation context, IDs/URLs, use specific exception types, preserve full stack traces (logging with exc_info=True), and avoid returning None/null—raise exceptions instead; for batches report success counts and detailed failures
Backend code targets Python 3.12 and adheres to a 120 character line length
Use Ruff for linting (errors, warnings, unused imports) in backend code
Use Mypy for static type checking in backend code

Files:

  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/services/crawling/helpers/url_handler.py
  • python/src/server/services/crawling/progress_mapper.py
🧬 Code graph analysis (2)
python/tests/progress_tracking/test_progress_mapper.py (1)
python/src/server/services/crawling/progress_mapper.py (9)
  • ProgressMapper (9-170)
  • map_progress (49-101)
  • calculate_stage_progress (107-121)
  • map_batch_progress (123-141)
  • map_with_substage (143-157)
  • reset (159-162)
  • get_current_stage (164-166)
  • get_current_progress (168-170)
  • get_stage_range (103-105)
python/src/server/services/crawling/crawling_service.py (6)
python/src/server/services/crawling/discovery_service.py (2)
  • DiscoveryService (17-441)
  • discover_files (44-101)
python/src/server/services/crawling/strategies/single_page.py (1)
  • crawl_markdown_file (210-276)
python/src/server/services/crawling/helpers/url_handler.py (3)
  • transform_github_url (179-208)
  • is_binary_file (79-176)
  • get_base_url (658-678)
python/src/server/config/logfire_config.py (2)
  • safe_logfire_info (224-236)
  • safe_logfire_error (239-251)
python/src/server/services/crawling/strategies/recursive.py (1)
  • crawl_recursive_with_progress (36-331)
python/src/server/services/crawling/strategies/batch.py (1)
  • crawl_batch_with_progress (33-273)
🔇 Additional comments (10)
python/src/server/services/crawling/helpers/url_handler.py (4)

591-611: LGTM: robots.txt detection is correct and root-scoped.


612-635: LGTM: llms variant detection matches stated priority set.


636-656: LGTM: .well-known guard limited to root path.


657-679: LGTM: base URL extraction behaves as expected (scheme + netloc).

python/src/server/services/crawling/progress_mapper.py (3)

21-21: LGTM: “complete” alias mirrors “completed”.


25-33: LGTM: discovery stage added with sensible 3–4% range; aliases for code_extraction are fine.


72-75: LGTM: completion handling updated to include “complete”.

python/tests/progress_tracking/test_progress_mapper.py (1)

32-39: LGTM: tests reflect discovery stage and adjusted ranges/rounding and alias coverage.

Also applies to: 44-55, 68-72, 113-136, 228-299

python/src/server/services/crawling/crawling_service.py (2)

348-376: Good: discovery offloaded via asyncio.to_thread to avoid blocking the loop.


394-404: Propagate the opt-in flag for link following from discovery.

Pass through follow_links_from_discovery so downstream logic can honor it.
[suggest_minor_issue]

                 discovery_request = request.copy()
                 discovery_request["is_discovery_target"] = True
                 discovery_request["original_domain"] = self.url_handler.get_base_url(url)
+                if "follow_links_from_discovery" in request:
+                    discovery_request["follow_links_from_discovery"] = bool(request["follow_links_from_discovery"])

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (7)
python/src/server/services/crawling/discovery_service.py (5)

20-42: Define DISCOVERY_TARGETS or delete helpers referencing it (runtime AttributeError).

Multiple methods index self.DISCOVERY_TARGETS, but the constant isn’t defined anywhere in this class/file. Any call into those helpers will crash.

Apply this diff adjacent to DISCOVERY_PRIORITY:

@@
-    DISCOVERY_PRIORITY = [
+    DISCOVERY_PRIORITY = [
         # LLMs files (highest priority - most comprehensive AI guidance)
         "llms-full.txt",
         "llms.txt", 
         "llms.md",
         "llms.mdx",
         "llms.markdown",
@@
         ".well-known/sitemap.xml"
     ]
+
+    # Per-category targets used by helper methods
+    DISCOVERY_TARGETS = {
+        "llms_files": ["llms-full.txt", "llms.txt", "llms.md", "llms.mdx", "llms.markdown"],
+        "sitemap_files": ["sitemap_index.xml", "sitemap-index.xml", "sitemap.xml"],
+        "robots_files": ["robots.txt"],
+        "well_known_files": [".well-known/ai.txt", ".well-known/llms.txt", ".well-known/sitemap.xml"],
+    }

58-64: Don’t short‑circuit to robots.txt sitemap without verifying reachability.

Iterate declared sitemaps, return the first that exists; otherwise continue discovery.

-            if robots_sitemaps:
-                best_file = robots_sitemaps[0]  # Use first sitemap from robots.txt
-                logger.info(f"Discovery found best file from robots.txt: {best_file}")
-                return best_file
+            if robots_sitemaps:
+                for sm in robots_sitemaps:
+                    if self._check_url_exists(sm):
+                        logger.info(f"Discovery found best file from robots.txt: {sm}")
+                        return sm
+                    logger.info(f"Declared sitemap not reachable: {sm}; continuing discovery.")

65-88: Fix Supabase/docs path edge case: also try same-directory candidates when base_url has a path.

Currently urljoin(base_url, filename) jumps to origin root if base_url lacks trailing “/”. Try dirname-based joins first to catch https://supabase.com/docs/llms.txt.

-            # Check files in global priority order
-            for filename in self.DISCOVERY_PRIORITY:
+            # Derive a same-directory base if base_url has a path (e.g., https://site.com/docs)
+            p = urlparse(base_url)
+            base_dir_url = None
+            if p.path and p.path != "/" and not p.path.endswith("/"):
+                dirname = p.path.rsplit("/", 1)[0]
+                base_dir_url = f"{p.scheme}://{p.netloc}{dirname}/"
+
+            # Check files in global priority order
+            for filename in self.DISCOVERY_PRIORITY:
                 # Try location relative to the given URL
                 file_url = urljoin(base_url, filename)
                 if self._check_url_exists(file_url):
                     logger.info(f"Discovery found best file: {file_url}")
                     return file_url
+                # Also try same-directory when applicable
+                if base_dir_url:
+                    same_dir_url = urljoin(base_dir_url, filename)
+                    if same_dir_url != file_url and self._check_url_exists(same_dir_url):
+                        logger.info(f"Discovery found best file in same directory: {same_dir_url}")
+                        return same_dir_url
@@
-                if filename.startswith('llms'):
+                if filename.startswith('llms'):
                     for subdir in ["static", "public", "docs", "assets", "doc", "api"]:
                         subdir_url = urljoin(base_url, f"{subdir}/{filename}")
                         if self._check_url_exists(subdir_url):
                             logger.info(f"Discovery found best file in subdirectory: {subdir_url}")
                             return subdir_url
+                        if base_dir_url:
+                            same_dir_sub = urljoin(base_dir_url, f"{subdir}/{filename}")
+                            if self._check_url_exists(same_dir_sub):
+                                logger.info(f"Discovery found best file in same-dir subdirectory: {same_dir_sub}")
+                                return same_dir_sub
@@
-                if filename.endswith('.xml') and not filename.startswith('.well-known'):
+                if filename.endswith('.xml') and not filename.startswith('.well-known'):
                     for subdir in ["sitemaps", "sitemap", "xml", "feed"]:
                         subdir_url = urljoin(base_url, f"{subdir}/{filename}")
                         if self._check_url_exists(subdir_url):
                             logger.info(f"Discovery found best file in subdirectory: {subdir_url}")
                             return subdir_url
+                        if base_dir_url:
+                            same_dir_sub = urljoin(base_dir_url, f"{subdir}/{filename}")
+                            if self._check_url_exists(same_dir_sub):
+                                logger.info(f"Discovery found best file in same-dir subdirectory: {same_dir_sub}")
+                                return same_dir_sub

233-241: Robots parsing: don’t lowercase URLs; support relative Sitemap paths.

Lowercasing corrupts path case; relative values are common.

-            # Parse robots.txt content for sitemap directives
-            for line in resp.text.splitlines():
-                line = line.strip().lower()
-                if line.startswith("sitemap:"):
-                    sitemap_url = line.split(":", 1)[1].strip()
-                    # Validate URL format before adding
-                    if sitemap_url and (sitemap_url.startswith('http://') or sitemap_url.startswith('https://')):
-                        sitemaps.append(sitemap_url)
-                        logger.info(f"Found sitemap in robots.txt: {sitemap_url}")
+            # Parse robots.txt content for sitemap directives
+            for raw_line in resp.text.splitlines():
+                line = raw_line.strip()
+                if line.lower().startswith("sitemap:"):
+                    sitemap_value = line.split(":", 1)[1].strip()
+                    if sitemap_value:
+                        sitemap_url = (
+                            sitemap_value if sitemap_value.lower().startswith(("http://", "https://"))
+                            else urljoin(resp.url, sitemap_value)
+                        )
+                        sitemaps.append(sitemap_url)
+                        logger.info(f"Found sitemap in robots.txt: {sitemap_url}")

320-341: HTML parsing: avoid lowercasing; match case‑insensitively and preserve href case.

Lowercasing HTML mangles URLs. Use re.IGNORECASE and validate scheme.

-            content = resp.text.lower()
-
-            # Look for sitemap meta tags or link elements
-            import re
-
-            # Check for <link rel="sitemap" href="...">
-            sitemap_link_pattern = r'<link[^>]*rel=["\']sitemap["\'][^>]*href=["\']([^"\']+)["\']'
-            matches = re.findall(sitemap_link_pattern, content)
+            content = resp.text
+            import re
+            from urllib.parse import urlparse
+
+            # <link rel="sitemap" href="..."> (case-insensitive)
+            sitemap_link_pattern = re.compile(r'<link[^>]*rel=["\']sitemap["\'][^>]*href=["\']([^"\']+)["\']', re.IGNORECASE)
+            matches = sitemap_link_pattern.findall(content)
             for match in matches:
-                sitemap_url = urljoin(base_url, match)
-                sitemaps.append(sitemap_url)
-                logger.info(f"Found sitemap in HTML link tag: {sitemap_url}")
+                sitemap_url = urljoin(resp.url, match)
+                if urlparse(sitemap_url).scheme in ("http", "https"):
+                    sitemaps.append(sitemap_url)
+                    logger.info(f"Found sitemap in HTML link tag: {sitemap_url}")
 
-            # Check for <meta name="sitemap" content="...">
-            sitemap_meta_pattern = r'<meta[^>]*name=["\']sitemap["\'][^>]*content=["\']([^"\']+)["\']'
-            matches = re.findall(sitemap_meta_pattern, content)
+            # <meta name="sitemap" content="..."> (case-insensitive)
+            sitemap_meta_pattern = re.compile(r'<meta[^>]*name=["\']sitemap["\'][^>]*content=["\']([^"\']+)["\']', re.IGNORECASE)
+            matches = sitemap_meta_pattern.findall(content)
             for match in matches:
-                sitemap_url = urljoin(base_url, match)
-                sitemaps.append(sitemap_url)
-                logger.info(f"Found sitemap in HTML meta tag: {sitemap_url}")
+                sitemap_url = urljoin(resp.url, match)
+                if urlparse(sitemap_url).scheme in ("http", "https"):
+                    sitemaps.append(sitemap_url)
+                    logger.info(f"Found sitemap in HTML meta tag: {sitemap_url}")
python/src/server/services/crawling/crawling_service.py (2)

390-395: Use discovered file’s domain for downstream filtering.

Setting original_domain from the original URL can mis-filter links when the discovery target is on a different host/subdomain.

-                discovery_request["original_domain"] = self.url_handler.get_base_url(url)
+                discovery_request["original_domain"] = self.url_handler.get_base_url(discovered_url)

758-770: Honor “crawl only the discovered file” by default; gate link-following behind a flag.

Single-file policy shouldn’t auto-follow links from discovery files unless explicitly enabled.

                     # For discovery targets, only follow same-domain links
                     if extracted_links and request.get("is_discovery_target"):
                         original_domain = request.get("original_domain")
                         if original_domain:
                             original_count = len(extracted_links)
                             extracted_links = [
                                 link for link in extracted_links 
                                 if self._is_same_domain(link, original_domain)
                             ]
                             domain_filtered_count = original_count - len(extracted_links)
                             if domain_filtered_count > 0:
                                 safe_logfire_info(f"Discovery mode: filtered out {domain_filtered_count} external links, keeping {len(extracted_links)} same-domain links")
+                        # Do not follow links from discovery files unless explicitly enabled
+                        if not request.get("follow_links_from_discovery", False):
+                            logger.info(
+                                "Discovery mode: not following links from discovery file "
+                                "(set follow_links_from_discovery=True to enable)."
+                            )
+                            extracted_links = []
🧹 Nitpick comments (8)
python/src/server/services/crawling/discovery_service.py (6)

8-11: Import urlparse for path-aware joining.

Needed for same-directory discovery fix below.

-from urllib.parse import urljoin
+from urllib.parse import urljoin, urlparse

201-209: Use HEAD with GET fallback, and tighter timeouts, for existence checks.

Reduces bandwidth and latency; fall back when HEAD is unsupported.

-        try:
-            resp = requests.get(url, timeout=5, allow_redirects=True)
-            success = resp.status_code == 200
-            logger.debug(f"URL check: {url} -> {resp.status_code} ({'exists' if success else 'not found'})")
-            return success
+        try:
+            # Try a lightweight HEAD first
+            resp = requests.head(url, timeout=(3, 5), allow_redirects=True)
+            if resp.status_code == 200:
+                logger.debug(f"URL check (HEAD): {url} -> {resp.status_code} (exists)")
+                return True
+            # Some servers disallow HEAD; fall back to GET
+            if resp.status_code in (405, 501):
+                g = requests.get(url, timeout=(5, 10), allow_redirects=True)
+                success = g.status_code == 200
+                logger.debug(f"URL check (GET fallback): {url} -> {g.status_code} ({'exists' if success else 'not found'})")
+                return success
+            logger.debug(f"URL check (HEAD): {url} -> {resp.status_code} (not found)")
+            return False

103-147: Also verify robots-declared sitemap reachability in _discover_best_sitemap.

Mirror the short‑circuit fix here to avoid dead entries.

-            robots_sitemaps = self._parse_robots_txt(base_url)
-            if robots_sitemaps:
-                return robots_sitemaps[0]  # Use first sitemap from robots.txt
+            robots_sitemaps = self._parse_robots_txt(base_url)
+            if robots_sitemaps:
+                for sm in robots_sitemaps:
+                    if self._check_url_exists(sm):
+                        return sm

159-177: Apply same-directory attempts in llms helper for consistency.

Keep helpers aligned with main discover_files behavior.

-            for filename in self.DISCOVERY_TARGETS["llms_files"]:
-                llms_url = urljoin(base_url, filename)
-                if self._check_url_exists(llms_url):
-                    return llms_url
+            for filename in self.DISCOVERY_TARGETS["llms_files"]:
+                llms_url = urljoin(base_url, filename)
+                if self._check_url_exists(llms_url):
+                    return llms_url
+                p = urlparse(base_url)
+                if p.path and p.path != "/" and not p.path.endswith("/"):
+                    dirname = p.path.rsplit("/", 1)[0]
+                    same_dir = f"{p.scheme}://{p.netloc}{dirname}/"
+                    llms_same = urljoin(same_dir, filename)
+                    if self._check_url_exists(llms_same):
+                        return llms_same

121-133: Apply same-directory attempts in sitemap helper for consistency.

-            for filename in self.DISCOVERY_TARGETS["sitemap_files"]:
-                sitemap_url = urljoin(base_url, filename)
-                if self._check_url_exists(sitemap_url):
-                    return sitemap_url
+            for filename in self.DISCOVERY_TARGETS["sitemap_files"]:
+                sitemap_url = urljoin(base_url, filename)
+                if self._check_url_exists(sitemap_url):
+                    return sitemap_url
+                p = urlparse(base_url)
+                if p.path and p.path != "/" and not p.path.endswith("/"):
+                    dirname = p.path.rsplit("/", 1)[0]
+                    same_dir = f"{p.scheme}://{p.netloc}{dirname}/"
+                    sitemap_same = urljoin(same_dir, filename)
+                    if self._check_url_exists(sitemap_same):
+                        return sitemap_same

55-101: Avoid catching bare Exception; narrow excepts where feasible.

These top-level catches hide programming errors. Keep RequestException handling, but reduce generic catches or re-raise after logging when not a network error.

python/src/server/services/crawling/crawling_service.py (2)

728-817: Optional: cap extracted links to prevent explosion.

Limit batch size from link-collection files (e.g., 200) to avoid OOM/DoS from huge lists.

-                    if extracted_links:
+                    if extracted_links:
+                        # Cap the number of links to crawl from a single file
+                        MAX_LINKS = int(request.get("max_links_per_file", 200))
+                        if len(extracted_links) > MAX_LINKS:
+                            logger.info(f"Truncating extracted links from {len(extracted_links)} to {MAX_LINKS}")
+                            extracted_links = extracted_links[:MAX_LINKS]

706-865: _crawl_by_url_type return typing (nit).

Consider annotating return type as tuple[list[dict[str, Any]], str | None] for MyPy clarity.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0a2c43f and 7f74aea.

📒 Files selected for processing (2)
  • python/src/server/services/crawling/crawling_service.py (7 hunks)
  • python/src/server/services/crawling/discovery_service.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
python/src/server/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/server/**/*.py: Never accept corrupted data in batch/continuation paths: skip failed items entirely (e.g., do not store zero embeddings, null FKs, or malformed JSON)
Use specific exception types; avoid catching bare Exception
Preserve full stack traces in logs (use exc_info=True with Python logging)
Never return None to indicate failure; raise an exception with details
Authentication/authorization failures must halt the operation and be clearly surfaced
Service startup, missing configuration, database connection, or critical dependency failures should crash fast with clear errors
During crawling/batch/background tasks and WebSocket events, continue processing other items but log failures with context
Include context (operation intent, relevant IDs/URLs/data) in error messages
Pydantic should raise on data corruption or validation errors; do not accept invalid inputs

Files:

  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/services/crawling/discovery_service.py
python/src/server/services/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/server/services/**/*.py: For batch operations, report both success count and a detailed failure list
On external API calls, retry with exponential backoff and fail with a clear message after retries
Place business logic in service layer modules (API Route → Service → Database pattern)

Files:

  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/services/crawling/discovery_service.py
python/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/**/*.py: Target Python 3.12 with a 120-character line length
Use Ruff for linting (errors, warnings, unused imports)

Files:

  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/services/crawling/discovery_service.py
python/src/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use MyPy for type checking to ensure type safety

python/src/**/*.py: On service startup, missing configuration, DB connection failures, auth/authorization failures, critical dependency outages, or invalid/corrupting data: fail fast and bubble errors
For batch processing, background tasks, WebSocket events, optional features, and external API calls: continue processing but log errors (with retries/backoff for APIs)
Never accept or persist corrupted data; skip failed items entirely (e.g., zero embeddings, null FKs, malformed JSON)
Error messages must include operation context, IDs/URLs, use specific exception types, preserve full stack traces (logging with exc_info=True), and avoid returning None/null—raise exceptions instead; for batches report success counts and detailed failures
Backend code targets Python 3.12 and adheres to a 120 character line length
Use Ruff for linting (errors, warnings, unused imports) in backend code
Use Mypy for static type checking in backend code

Files:

  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/services/crawling/discovery_service.py
🧬 Code graph analysis (2)
python/src/server/services/crawling/crawling_service.py (6)
python/src/server/services/crawling/discovery_service.py (2)
  • DiscoveryService (17-442)
  • discover_files (44-101)
python/src/server/services/crawling/strategies/single_page.py (1)
  • crawl_markdown_file (210-276)
python/src/server/services/crawling/helpers/url_handler.py (3)
  • transform_github_url (179-208)
  • is_binary_file (79-176)
  • get_base_url (658-678)
python/src/server/config/logfire_config.py (2)
  • safe_logfire_info (224-236)
  • safe_logfire_error (239-251)
python/src/server/services/crawling/strategies/recursive.py (1)
  • crawl_recursive_with_progress (36-331)
python/src/server/services/crawling/strategies/batch.py (1)
  • crawl_batch_with_progress (33-273)
python/src/server/services/crawling/discovery_service.py (1)
python/src/server/config/logfire_config.py (1)
  • get_logger (137-147)
🔇 Additional comments (4)
python/src/server/services/crawling/crawling_service.py (4)

87-88: Good: initialize DiscoveryService once.


183-194: LGTM: propagated new crawl_markdown_file signature.

Matches strategy’s API with start/end progress.


340-376: Good: offload sync discovery to a worker thread.

Prevents blocking the event loop.


340-409: Verification — repo scan shows no conflicting discovery flag or extra callers

  • No occurrences of "follow_links_from_discovery" found.
  • DISCOVERY_TARGETS is defined/used only in python/src/server/services/crawling/discovery_service.py.
  • discover_files is defined in discovery_service.py and direct invocations only appear in python/tests/test_discovery_service.py.

Conclusion: no other call sites appear to assume automatic link-follow from discovery; the new flag is not used elsewhere.

When a user directly provides a URL to a discovery file (sitemap.xml, llms.txt, robots.txt, etc.),
the system now skips the discovery phase and uses the provided file directly.

This prevents unnecessary discovery attempts and respects the user's explicit choice.

Changes:
- Check if the URL is already a discovery target before running discovery
- Skip discovery for: sitemap files, llms variants, robots.txt, well-known files, and any .txt files
- Add logging to indicate when discovery is skipped

Example: When crawling 'xyz.com/sitemap.xml' directly, the system will now use that sitemap
instead of trying to discover a different file like llms.txt
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
python/src/server/services/crawling/crawling_service.py (1)

770-831: Discovery target link filtering needs completion.

The same-domain filtering for discovery targets (lines 771-782) is correctly implemented, but the code is missing the guard to disable link crawling by default for discovery files. The past review comment identified this gap but it remains unaddressed.

Apply this fix after line 782:

                            domain_filtered_count = original_count - len(extracted_links)
                            if domain_filtered_count > 0:
                                safe_logfire_info(f"Discovery mode: filtered out {domain_filtered_count} external links, keeping {len(extracted_links)} same-domain links")
+                        
+                        # Do not follow links from discovery files unless explicitly enabled
+                        if not request.get("follow_links_from_discovery", False):
+                            logger.info(
+                                "Discovery mode: not following links from discovery file "
+                                "(set follow_links_from_discovery=True to enable)."
+                            )
+                            extracted_links = []
♻️ Duplicate comments (1)
python/src/server/services/crawling/crawling_service.py (1)

87-88: Resolved: Async wrapper properly addresses event loop blocking.

The past review comment about blocking the event loop has been properly addressed with the asyncio.to_thread wrapper at line 360, which offloads the synchronous discovery calls to a worker thread.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7f74aea and c1677a9.

📒 Files selected for processing (1)
  • python/src/server/services/crawling/crawling_service.py (7 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
python/src/server/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/server/**/*.py: Never accept corrupted data in batch/continuation paths: skip failed items entirely (e.g., do not store zero embeddings, null FKs, or malformed JSON)
Use specific exception types; avoid catching bare Exception
Preserve full stack traces in logs (use exc_info=True with Python logging)
Never return None to indicate failure; raise an exception with details
Authentication/authorization failures must halt the operation and be clearly surfaced
Service startup, missing configuration, database connection, or critical dependency failures should crash fast with clear errors
During crawling/batch/background tasks and WebSocket events, continue processing other items but log failures with context
Include context (operation intent, relevant IDs/URLs/data) in error messages
Pydantic should raise on data corruption or validation errors; do not accept invalid inputs

Files:

  • python/src/server/services/crawling/crawling_service.py
python/src/server/services/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/server/services/**/*.py: For batch operations, report both success count and a detailed failure list
On external API calls, retry with exponential backoff and fail with a clear message after retries
Place business logic in service layer modules (API Route → Service → Database pattern)

Files:

  • python/src/server/services/crawling/crawling_service.py
python/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/**/*.py: Target Python 3.12 with a 120-character line length
Use Ruff for linting (errors, warnings, unused imports)

Files:

  • python/src/server/services/crawling/crawling_service.py
python/src/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use MyPy for type checking to ensure type safety

python/src/**/*.py: On service startup, missing configuration, DB connection failures, auth/authorization failures, critical dependency outages, or invalid/corrupting data: fail fast and bubble errors
For batch processing, background tasks, WebSocket events, optional features, and external API calls: continue processing but log errors (with retries/backoff for APIs)
Never accept or persist corrupted data; skip failed items entirely (e.g., zero embeddings, null FKs, malformed JSON)
Error messages must include operation context, IDs/URLs, use specific exception types, preserve full stack traces (logging with exc_info=True), and avoid returning None/null—raise exceptions instead; for batches report success counts and detailed failures
Backend code targets Python 3.12 and adheres to a 120 character line length
Use Ruff for linting (errors, warnings, unused imports) in backend code
Use Mypy for static type checking in backend code

Files:

  • python/src/server/services/crawling/crawling_service.py
🧬 Code graph analysis (1)
python/src/server/services/crawling/crawling_service.py (3)
python/src/server/services/crawling/discovery_service.py (2)
  • DiscoveryService (17-442)
  • discover_files (44-101)
python/src/server/services/crawling/strategies/single_page.py (1)
  • crawl_markdown_file (210-276)
python/src/server/services/crawling/helpers/url_handler.py (8)
  • transform_github_url (179-208)
  • is_sitemap (20-37)
  • is_llms_variant (613-634)
  • is_robots_txt (592-610)
  • is_well_known_file (637-655)
  • is_txt (60-76)
  • is_binary_file (79-176)
  • get_base_url (658-678)
🔇 Additional comments (5)
python/src/server/services/crawling/crawling_service.py (5)

184-194: LGTM: Progress parameter additions enhance orchestration flexibility.

The start_progress and end_progress parameters are correctly passed through to the underlying strategy, allowing for better progress mapping during discovery-driven workflows.


340-387: Discovery orchestration implements single-file priority selection correctly.

The discovery phase properly:

  • Skips discovery for URLs that are already discovery targets
  • Uses asyncio.to_thread to avoid blocking the event loop
  • Selects only the single best discovered file
  • Filters binary files appropriately
  • Handles discovery failures gracefully by continuing with regular crawl

388-421: Discovery-target crawling properly marks context and uses correct domain.

The implementation correctly:

  • Marks the request with is_discovery_target: True for downstream filtering
  • Uses the original URL's domain for filtering (line 405) as previously recommended
  • Crawls only the discovered file instead of the main URL

663-683: Helper method implementation is robust.

The _is_same_domain method properly handles hostname parsing, normalization, and error cases with conservative fallback behavior.


795-827: Discovery-aware depth handling is correctly implemented.

The depth-aware crawling logic properly:

  • Respects max_depth for discovery targets with same-domain links
  • Uses recursive crawling when max_depth > 1
  • Falls back to batch crawling when depth limit is reached
  • Combines results appropriately

Comment thread python/src/server/services/crawling/crawling_service.py
When a file is selected through discovery, it should be crawled as a single file without
following any links contained within it. This preserves the efficiency gains of the
discovery feature.

Changes:
- Skip link extraction when is_discovery_target is true for link collection files
- Return sitemap metadata without crawling URLs when is_discovery_target is true
- Add clear logging to indicate single-file mode is active

This ensures discovered files (llms.txt, sitemap.xml, etc.) are processed as single
authoritative sources rather than starting recursive crawls, which aligns with the
PR's objective of efficient single-file discovery and crawling.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
python/src/server/services/crawling/crawling_service.py (3)

355-361: Addressed: Discovery offloaded to a worker thread.
Avoids blocking the event loop while calling sync discover_files.


777-789: Make discovery link-following opt‑in at the filter too.
Ensure the same‑domain filter only runs when link following is explicitly enabled; otherwise the branch is effectively dead (due to the early return above).

Apply this diff:

-                    # For discovery targets, only follow same-domain links
-                    if extracted_links and request.get("is_discovery_target"):
+                    # For discovery targets, only follow same-domain links when explicitly enabled
+                    if extracted_links and request.get("is_discovery_target") and request.get("follow_links_from_discovery", False):
                         original_domain = request.get("original_domain")
                         if original_domain:
                             original_count = len(extracted_links)
                             extracted_links = [
                                 link for link in extracted_links 
                                 if self._is_same_domain(link, original_domain)
                             ]
                             domain_filtered_count = original_count - len(extracted_links)
                             if domain_filtered_count > 0:
                                 safe_logfire_info(f"Discovery mode: filtered out {domain_filtered_count} external links, keeping {len(extracted_links)} same-domain links")

398-407: Fix domain used for discovery-mode filtering to the discovered URL’s domain.
Using the original URL’s domain can incorrectly filter out legitimate same-host links when the discovered file is on a different host/subdomain (e.g., docs.example.com). Use the discovered URL instead.

Apply this diff:

-                discovery_request["original_domain"] = self.url_handler.get_base_url(url)
+                discovery_request["original_domain"] = self.url_handler.get_base_url(discovered_url)
+                # Optional: keep the discovered URL for downstream telemetry/debug
+                discovery_request["discovered_url"] = discovered_url
🧹 Nitpick comments (2)
python/src/server/services/crawling/crawling_service.py (2)

756-762: Gate “single-file” short‑circuit behind an explicit flag.
Right now discovery unconditionally skips link extraction for link‑collection files. Add follow_links_from_discovery=True to allow opting into crawling those links when desired (e.g., llms.txt that lists multiple .txt files).

Apply this diff:

-                    # If this file was selected by discovery, skip link extraction (single-file mode)
-                    if request.get("is_discovery_target"):
+                    # If this file was selected by discovery, skip link extraction unless explicitly enabled
+                    if request.get("is_discovery_target") and not request.get("follow_links_from_discovery", False):
                         logger.info(f"Discovery single-file mode: skipping link extraction for {url}")
                         crawl_type = "discovery_single_file"
                         logger.info(f"Discovery file crawling completed: {len(crawl_results)} result")
                         return crawl_results, crawl_type

849-861: Also gate sitemap “single-file” behavior behind an explicit flag.
Allow opting in to follow URLs from a discovered sitemap when requested; default remains single-file.

Apply this diff:

-            # If this sitemap was selected by discovery, just return the sitemap itself (single-file mode)
-            if request.get("is_discovery_target"):
+            # If this sitemap was selected by discovery, return the sitemap itself unless explicitly enabled to follow links
+            if request.get("is_discovery_target") and not request.get("follow_links_from_discovery", False):
                 logger.info(f"Discovery single-file mode: returning sitemap itself without crawling URLs from {url}")
                 crawl_type = "discovery_sitemap"
                 # Return the sitemap file as the result
                 crawl_results = [{
                     'url': url,
                     'markdown': f"# Sitemap: {url}\n\nThis is a sitemap file discovered and returned in single-file mode.",
                     'title': f"Sitemap - {self.url_handler.extract_display_name(url)}",
                     'crawl_type': crawl_type
                 }]
                 return crawl_results, crawl_type
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c1677a9 and 597fc86.

📒 Files selected for processing (1)
  • python/src/server/services/crawling/crawling_service.py (9 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
python/src/server/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/server/**/*.py: Never accept corrupted data in batch/continuation paths: skip failed items entirely (e.g., do not store zero embeddings, null FKs, or malformed JSON)
Use specific exception types; avoid catching bare Exception
Preserve full stack traces in logs (use exc_info=True with Python logging)
Never return None to indicate failure; raise an exception with details
Authentication/authorization failures must halt the operation and be clearly surfaced
Service startup, missing configuration, database connection, or critical dependency failures should crash fast with clear errors
During crawling/batch/background tasks and WebSocket events, continue processing other items but log failures with context
Include context (operation intent, relevant IDs/URLs/data) in error messages
Pydantic should raise on data corruption or validation errors; do not accept invalid inputs

Files:

  • python/src/server/services/crawling/crawling_service.py
python/src/server/services/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/server/services/**/*.py: For batch operations, report both success count and a detailed failure list
On external API calls, retry with exponential backoff and fail with a clear message after retries
Place business logic in service layer modules (API Route → Service → Database pattern)

Files:

  • python/src/server/services/crawling/crawling_service.py
python/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/**/*.py: Target Python 3.12 with a 120-character line length
Use Ruff for linting (errors, warnings, unused imports)

Files:

  • python/src/server/services/crawling/crawling_service.py
python/src/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use MyPy for type checking to ensure type safety

python/src/**/*.py: On service startup, missing configuration, DB connection failures, auth/authorization failures, critical dependency outages, or invalid/corrupting data: fail fast and bubble errors
For batch processing, background tasks, WebSocket events, optional features, and external API calls: continue processing but log errors (with retries/backoff for APIs)
Never accept or persist corrupted data; skip failed items entirely (e.g., zero embeddings, null FKs, malformed JSON)
Error messages must include operation context, IDs/URLs, use specific exception types, preserve full stack traces (logging with exc_info=True), and avoid returning None/null—raise exceptions instead; for batches report success counts and detailed failures
Backend code targets Python 3.12 and adheres to a 120 character line length
Use Ruff for linting (errors, warnings, unused imports) in backend code
Use Mypy for static type checking in backend code

Files:

  • python/src/server/services/crawling/crawling_service.py
🧬 Code graph analysis (1)
python/src/server/services/crawling/crawling_service.py (4)
python/src/server/services/crawling/discovery_service.py (2)
  • DiscoveryService (17-442)
  • discover_files (44-101)
python/src/server/services/crawling/strategies/single_page.py (1)
  • crawl_markdown_file (210-276)
python/src/server/services/crawling/helpers/url_handler.py (9)
  • transform_github_url (179-208)
  • is_sitemap (20-37)
  • is_llms_variant (613-634)
  • is_robots_txt (592-610)
  • is_well_known_file (637-655)
  • is_txt (60-76)
  • is_binary_file (79-176)
  • get_base_url (658-678)
  • extract_display_name (443-589)
python/src/server/config/logfire_config.py (2)
  • safe_logfire_info (224-236)
  • safe_logfire_error (239-251)
🔇 Additional comments (7)
python/src/server/services/crawling/crawling_service.py (7)

20-20: Good: DiscoveryService is properly wired.
Import looks correct and localized to the service layer.


87-87: Good: DiscoveryService instance on the service.
Instantiation enables re-use and testability.


184-193: Good: Propagating start/end progress to single_page strategy.
Signature + forwarding align with the progress mapper changes.


340-349: Reasonable: Skip discovery for explicit discovery targets.
Covers sitemaps, robots, llms variants, .well-known, and generic .txt.


798-800: Good: Dedup extracted links.
Linear-time dedup via dict preserves order; adequate here.


802-822: Sound defaults for depth in discovery mode.
Depth gating is clear; will work once link-follow opt‑in is added.


663-683: Good: Minimal same‑host check for domain equality.
Hostname equality is acceptable given discovery scope; keep behavior strict to avoid cross‑subdomain drift.

leex279 and others added 2 commits September 22, 2025 22:35
Resolved merge conflicts by integrating features from both branches:
- Added page_storage_ops service initialization from main
- Merged link text extraction with discovery mode features
- Preserved discovery single-file mode and domain filtering
- Maintained link text fallbacks for title extraction

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Enable SSL certificate verification (verify=True) for all HTTP requests
- Implement streaming with size limits (10MB default) to prevent memory exhaustion
- Add _read_response_with_limit() helper for secure response reading
- Update all test mocks to support streaming API with iter_content()
- Fix test assertions to expect new security parameters
- Enforce deterministic rounding in progress mapper tests

Security improvements:
- Prevents MITM attacks through SSL verification
- Guards against DoS via oversized responses
- Ensures proper resource cleanup with response.close()

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (5)
python/src/server/services/crawling/crawling_service.py (2)

822-868: Disable link crawling by default for discovery targets.

The code filters discovery-target links to same-domain but still follows them, conflicting with the PR goal of "crawls only that file." Add a guard to skip link crawling unless explicitly enabled.

As previously noted, gate link following behind an explicit flag:

                 # For discovery targets, only follow same-domain links
                 if extracted_links_with_text and request.get("is_discovery_target"):
                     original_domain = request.get("original_domain")
                     if original_domain:
                         original_count = len(extracted_links_with_text)
                         extracted_links_with_text = [
                             (link, text) for link, text in extracted_links_with_text
                             if self._is_same_domain(link, original_domain)
                         ]
                         domain_filtered_count = original_count - len(extracted_links_with_text)
                         if domain_filtered_count > 0:
                             safe_logfire_info(f"Discovery mode: filtered out {domain_filtered_count} external links, keeping {len(extracted_links_with_text)} same-domain links")
+                    
+                    # Do not follow links from discovery files unless explicitly enabled
+                    if not request.get("follow_links_from_discovery", False):
+                        logger.info(
+                            "Discovery mode: not following links from discovery file "
+                            "(set follow_links_from_discovery=True to enable)."
+                        )
+                        extracted_links_with_text = []

422-424: Use discovered file's domain for downstream filtering.

Setting original_domain from the main URL instead of the discovered file causes same-domain filtering to use the wrong domain. If discovery finds a file on a subdomain or different path, legitimate same-domain links from that file will be incorrectly filtered out.

As previously noted, use the discovered file's domain:

             discovery_request = request.copy()
             discovery_request["is_discovery_target"] = True
-            discovery_request["original_domain"] = self.url_handler.get_base_url(url)
+            discovery_request["original_domain"] = self.url_handler.get_base_url(discovered_url)
python/src/server/services/crawling/discovery_service.py (3)

138-142: Verify declared sitemap exists before returning.

The code returns the first sitemap from robots.txt without verifying it's reachable. If robots.txt declares a dead sitemap URL, discovery short-circuits with an invalid target.

As previously noted, check the sitemap URL exists before returning:

 robots_sitemaps = self._parse_robots_txt(base_url)
 if robots_sitemaps:
-    best_file = robots_sitemaps[0]  # Use first sitemap from robots.txt
-    logger.info(f"Discovery found best file from robots.txt: {best_file}")
-    return best_file
+    # Verify the first sitemap is reachable before returning
+    for sitemap_url in robots_sitemaps:
+        if self._check_url_exists(sitemap_url):
+            logger.info(f"Discovery found best file from robots.txt: {sitemap_url}")
+            return sitemap_url
+    logger.info("No reachable sitemaps found in robots.txt; continuing discovery.")

318-326: Preserve URL case and resolve relative sitemap paths.

Lowercasing the entire line corrupts URL case, and the code ignores relative sitemap values commonly found in robots.txt files.

As previously noted, preserve case and resolve relatives:

-            # Parse robots.txt content for sitemap directives
-            for line in content.splitlines():
-                line = line.strip().lower()
-                if line.startswith("sitemap:"):
-                    sitemap_url = line.split(":", 1)[1].strip()
-                    # Validate URL format before adding
-                    if sitemap_url and (sitemap_url.startswith('http://') or sitemap_url.startswith('https://')):
-                        sitemaps.append(sitemap_url)
-                        logger.info(f"Found sitemap in robots.txt: {sitemap_url}")
+            # Parse robots.txt content for sitemap directives
+            for raw_line in content.splitlines():
+                line = raw_line.strip()
+                if line.lower().startswith("sitemap:"):
+                    sitemap_value = line.split(":", 1)[1].strip()
+                    if sitemap_value:
+                        # Allow absolute and relative sitemap values
+                        if sitemap_value.lower().startswith(("http://", "https://")):
+                            sitemap_url = sitemap_value
+                        else:
+                            sitemap_url = urljoin(robots_url, sitemap_value)
+                        sitemaps.append(sitemap_url)
+                        logger.info(f"Found sitemap in robots.txt: {sitemap_url}")

416-440: Match case-insensitively without lowercasing HTML content.

Lowercasing the entire HTML corrupts extracted href/content URLs. Use case-insensitive regex on original content instead.

As previously noted, preserve href case:

-            # Read response with size limit
             content = self._read_response_with_limit(resp, base_url)
-            content = content.lower()
             
             # Look for sitemap meta tags or link elements
             import re
             
-            # Check for <link rel="sitemap" href="...">
-            sitemap_link_pattern = r'<link[^>]*rel=["\']sitemap["\'][^>]*href=["\']([^"\']+)["\']'
-            matches = re.findall(sitemap_link_pattern, content)
+            # Check for <link rel="sitemap" href="..."> (case-insensitive)
+            sitemap_link_pattern = re.compile(r'<link[^>]*rel=["\']sitemap["\'][^>]*href=["\']([^"\']+)["\']', re.IGNORECASE)
+            matches = sitemap_link_pattern.findall(content)
             
             for match in matches:
                 sitemap_url = urljoin(base_url, match)
+                # Validate scheme
+                from urllib.parse import urlparse
+                if urlparse(sitemap_url).scheme in ("http", "https"):
-                sitemaps.append(sitemap_url)
-                logger.info(f"Found sitemap in HTML link tag: {sitemap_url}")
+                    sitemaps.append(sitemap_url)
+                    logger.info(f"Found sitemap in HTML link tag: {sitemap_url}")
             
-            # Check for <meta name="sitemap" content="...">
-            sitemap_meta_pattern = r'<meta[^>]*name=["\']sitemap["\'][^>]*content=["\']([^"\']+)["\']'
-            matches = re.findall(sitemap_meta_pattern, content)
+            # Check for <meta name="sitemap" content="..."> (case-insensitive)
+            sitemap_meta_pattern = re.compile(r'<meta[^>]*name=["\']sitemap["\'][^>]*content=["\']([^"\']+)["\']', re.IGNORECASE)
+            matches = sitemap_meta_pattern.findall(content)
             
             for match in matches:
                 sitemap_url = urljoin(base_url, match)
+                if urlparse(sitemap_url).scheme in ("http", "https"):
-                sitemaps.append(sitemap_url)
-                logger.info(f"Found sitemap in HTML meta tag: {sitemap_url}")
+                    sitemaps.append(sitemap_url)
+                    logger.info(f"Found sitemap in HTML meta tag: {sitemap_url}")
🧹 Nitpick comments (2)
python/src/server/services/crawling/discovery_service.py (2)

182-275: Remove unused helper methods or integrate them into the orchestrator.

The methods _discover_best_sitemap, _discover_best_llms_file, and _discover_robots_file are defined but never called. The main discover_files orchestrator inlines this logic instead. Either remove these methods to reduce surface area or refactor discover_files to use them.

If you intend to keep these helpers, apply this refactor to use them in discover_files:

 def discover_files(self, base_url: str) -> str | None:
     try:
         logger.info(f"Starting single-file discovery for {base_url}")
         
-        # First check robots.txt for explicit sitemap declarations (special case)
-        robots_sitemaps = self._parse_robots_txt(base_url)
-        if robots_sitemaps:
-            best_file = robots_sitemaps[0]
-            logger.info(f"Discovery found best file from robots.txt: {best_file}")
-            return best_file
-        
-        # Check files in global priority order
-        for filename in self.DISCOVERY_PRIORITY:
-            # Try location relative to the given URL
-            file_url = urljoin(base_url, filename)
-            if self._check_url_exists(file_url):
-                logger.info(f"Discovery found best file: {file_url}")
-                return file_url
-            
-            # For llms files, also try common subdirectories
-            if filename.startswith('llms'):
-                for subdir in ["static", "public", "docs", "assets", "doc", "api"]:
-                    subdir_url = urljoin(base_url, f"{subdir}/{filename}")
-                    if self._check_url_exists(subdir_url):
-                        logger.info(f"Discovery found best file in subdirectory: {subdir_url}")
-                        return subdir_url
-            
-            # For sitemap files, also try common subdirectories
-            if filename.endswith('.xml') and not filename.startswith('.well-known'):
-                for subdir in ["sitemaps", "sitemap", "xml", "feed"]:
-                    subdir_url = urljoin(base_url, f"{subdir}/{filename}")
-                    if self._check_url_exists(subdir_url):
-                        logger.info(f"Discovery found best file in subdirectory: {subdir_url}")
-                        return subdir_url
-        
-        # Check HTML meta tags for sitemap references as final fallback
-        html_sitemaps = self._parse_html_meta_tags(base_url)
-        if html_sitemaps:
-            best_file = html_sitemaps[0]
-            logger.info(f"Discovery found best file from HTML meta tags: {best_file}")
-            return best_file
+        # Try llms files first (highest priority)
+        llms_file = self._discover_best_llms_file(base_url)
+        if llms_file:
+            return llms_file
+        
+        # Try sitemap files
+        sitemap_file = self._discover_best_sitemap(base_url)
+        if sitemap_file:
+            return sitemap_file
+        
+        # Try robots.txt
+        robots_file = self._discover_robots_file(base_url)
+        if robots_file:
+            return robots_file
         
         logger.info(f"Discovery completed for {base_url}: no files found")
         return None

Otherwise, delete the unused methods (lines 182-275).


276-289: Use HEAD request for existence checks to reduce bandwidth.

_check_url_exists uses GET, which fetches the entire response body. For large files like sitemap.xml or llms-full.txt, this is wasteful. Use HEAD instead.

Apply this diff:

 def _check_url_exists(self, url: str) -> bool:
     try:
-        resp = requests.get(url, timeout=5, allow_redirects=True, verify=True)
+        resp = requests.head(url, timeout=5, allow_redirects=True, verify=True)
         success = resp.status_code == 200
         logger.debug(f"URL check: {url} -> {resp.status_code} ({'exists' if success else 'not found'})")
         resp.close()
         return success
     except Exception as e:
         logger.debug(f"URL check failed: {url} -> {e}")
         return False

Note: Some servers may not support HEAD; if this causes issues in testing, fall back to GET.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 597fc86 and 968e5b7.

📒 Files selected for processing (5)
  • python/src/server/services/crawling/crawling_service.py (9 hunks)
  • python/src/server/services/crawling/discovery_service.py (1 hunks)
  • python/src/server/services/crawling/helpers/url_handler.py (7 hunks)
  • python/tests/progress_tracking/test_progress_mapper.py (1 hunks)
  • python/tests/test_discovery_service.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • python/tests/test_discovery_service.py
🧰 Additional context used
📓 Path-based instructions (3)
python/src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

python/src/**/*.py: On service startup, missing configuration, DB connection failures, auth/authorization failures, critical dependency outages, or invalid/corrupting data: fail fast and bubble errors
For batch processing, background tasks, WebSocket events, optional features, and external API calls: continue processing but log errors (with retries/backoff for APIs)
Never accept or persist corrupted data; skip failed items entirely (e.g., zero embeddings, null FKs, malformed JSON)
Error messages must include operation context, IDs/URLs, use specific exception types, preserve full stack traces (logging with exc_info=True), and avoid returning None/null—raise exceptions instead; for batches report success counts and detailed failures
Backend code targets Python 3.12 and adheres to a 120 character line length
Use Ruff for linting (errors, warnings, unused imports) in backend code
Use Mypy for static type checking in backend code

python/src/**/*.py: Fail fast and loud on startup/config/auth/database/critical dependency failures and validation errors in backend services
Preserve full stack traces with logging.exc_info=True in Python logging
Use specific exception types and avoid catching generic Exception
Use Pydantic to raise on data validation errors; never silently accept bad data
Python 3.12 code should conform to a 120-character line length
Never return None to indicate failure in backend Python; raise exceptions with details instead

Files:

  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/services/crawling/discovery_service.py
  • python/src/server/services/crawling/helpers/url_handler.py
python/src/server/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Never accept corrupted data in backend flows; skip failed items entirely instead of storing bad data (e.g., zero embeddings, null FKs, malformed JSON)

Files:

  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/services/crawling/discovery_service.py
  • python/src/server/services/crawling/helpers/url_handler.py
python/src/server/services/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/server/services/**/*.py: For batch/background/optional operations, continue processing but log detailed per-item failures; report success counts and failure lists
Implement business logic in service modules, keeping database concerns separate
Place service logic under python/src/server/services/

Files:

  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/services/crawling/discovery_service.py
  • python/src/server/services/crawling/helpers/url_handler.py
🧬 Code graph analysis (3)
python/src/server/services/crawling/crawling_service.py (6)
python/src/server/services/crawling/discovery_service.py (1)
  • discover_files (123-180)
python/src/server/services/crawling/strategies/single_page.py (1)
  • crawl_markdown_file (222-288)
python/src/server/services/crawling/helpers/url_handler.py (7)
  • is_sitemap (21-38)
  • is_llms_variant (633-654)
  • is_robots_txt (612-630)
  • is_well_known_file (657-675)
  • is_txt (61-77)
  • is_binary_file (80-177)
  • get_base_url (678-698)
python/src/server/config/logfire_config.py (2)
  • safe_logfire_info (224-236)
  • safe_logfire_error (239-251)
python/src/server/services/crawling/strategies/recursive.py (1)
  • crawl_recursive_with_progress (36-344)
python/src/server/services/crawling/strategies/batch.py (1)
  • crawl_batch_with_progress (33-295)
python/src/server/services/crawling/discovery_service.py (1)
python/src/server/config/logfire_config.py (1)
  • get_logger (137-147)
python/tests/progress_tracking/test_progress_mapper.py (1)
python/src/server/services/crawling/progress_mapper.py (9)
  • ProgressMapper (9-170)
  • map_progress (49-101)
  • calculate_stage_progress (107-121)
  • map_batch_progress (123-141)
  • map_with_substage (143-157)
  • reset (159-162)
  • get_current_stage (164-166)
  • get_current_progress (168-170)
  • get_stage_range (103-105)

- Preserve URL case in robots.txt parsing by only lowercasing the sitemap: prefix check
- Add support for relative sitemap paths in robots.txt using urljoin()
- Fix HTML meta tag parsing to use case-insensitive regex instead of lowercasing content
- Add URL scheme validation for discovered sitemaps (http/https only)
- Fix discovery target domain filtering to use discovered URL's domain instead of input URL
- Clean up whitespace and improve dict comprehension usage

These changes improve discovery reliability and prevent URL corruption while maintaining
backward compatibility with existing discovery behavior.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
python/src/server/services/crawling/crawling_service.py (2)

822-868: Discovery link-following behavior contradicts "single-file" PR goal.

The same-domain filter (lines 822-834) limits extracted links to the same domain, but the code still follows those links (lines 848-868), either via recursive crawling (max_depth > 1) or batch crawling. This contradicts the PR objective: "when a discovery file is found the crawler fetches only that file instead of recursively crawling the site."

A past review recommended adding a guard to disable link following unless follow_links_from_discovery=True is explicitly set. This was not implemented.

Apply this diff after the same-domain filter to honor single-file mode:

                             domain_filtered_count = original_count - len(extracted_links_with_text)
                             if domain_filtered_count > 0:
                                 safe_logfire_info(f"Discovery mode: filtered out {domain_filtered_count} external links, keeping {len(extracted_links_with_text)} same-domain links")
+                        
+                        # Do not follow links from discovery files unless explicitly enabled
+                        if not request.get("follow_links_from_discovery", False):
+                            logger.info(
+                                "Discovery mode: not following links from discovery file "
+                                "(set follow_links_from_discovery=True to enable)."
+                            )
+                            extracted_links_with_text = []

Based on past review comments and PR objectives.


801-834: <!-- [scratchpad_update]
[task review_file_2/2 review_segment_5/5]
Dead code analysis complete. Web search confirms that unreachable code after return statements is a common and valid dead code detection concern.
[observations from search]

  • Unreachable code after return/raise/break/continue is recognized as dead code by analysis tools like Vulture
  • Dead code introduces unnecessary complexity and confusion (reduces maintainability)
  • Best practice: delete dead code as soon as it's identified
  • Code after terminal statements should be removed
    [conclusion]
    The review comment is correct: lines 822-834 checking is_discovery_target are unreachable dead code because:
  1. If is_link_collection_file is False: entire block (lines ~800-840+) is skipped
  2. If is_link_collection_file is True AND is_discovery_target is True: function returns at line 806
  3. If is_link_collection_file is True AND is_discovery_target is False: reach lines 808+, but is_discovery_target will always be False

The code at 822-834 will never execute. This is confirmed by control flow analysis and matches the definition of unreachable code that static analysis tools detect.

[done]
Review comment is valid. Lines 822-834 are dead code that should be removed.
[scratchpad_end] -->

Remove unreachable dead code at lines 822-834.

Lines 801-806 return early when is_discovery_target=True for link collection files. The subsequent check at line 823 (if extracted_links_with_text and request.get("is_discovery_target")) is unreachable dead code because:

  • If the early return at line 806 executes, lines 808+ never run
  • If the early return doesn't execute, is_discovery_target is always False

Delete lines 822-834 (the entire same-domain filter block for discovery targets).

♻️ Duplicate comments (1)
python/src/server/services/crawling/discovery_service.py (1)

137-142: Verify robots.txt sitemap exists before returning it.

The first sitemap from robots.txt is returned without checking if it's reachable. A dead link or typo will cause discovery to return a non-existent URL, failing downstream crawling.

Apply this diff to verify before returning:

             # First check robots.txt for explicit sitemap declarations (special case)
             robots_sitemaps = self._parse_robots_txt(base_url)
             if robots_sitemaps:
-                best_file = robots_sitemaps[0]  # Use first sitemap from robots.txt
-                logger.info(f"Discovery found best file from robots.txt: {best_file}")
-                return best_file
+                best_file = robots_sitemaps[0]  # Use first sitemap from robots.txt
+                if self._check_url_exists(best_file):
+                    logger.info(f"Discovery found best file from robots.txt: {best_file}")
+                    return best_file
+                logger.info(f"Declared sitemap not reachable: {best_file}; continuing discovery.")

Based on past review comments.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 968e5b7 and e5160dd.

📒 Files selected for processing (2)
  • python/src/server/services/crawling/crawling_service.py (10 hunks)
  • python/src/server/services/crawling/discovery_service.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
python/src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

python/src/**/*.py: On service startup, missing configuration, DB connection failures, auth/authorization failures, critical dependency outages, or invalid/corrupting data: fail fast and bubble errors
For batch processing, background tasks, WebSocket events, optional features, and external API calls: continue processing but log errors (with retries/backoff for APIs)
Never accept or persist corrupted data; skip failed items entirely (e.g., zero embeddings, null FKs, malformed JSON)
Error messages must include operation context, IDs/URLs, use specific exception types, preserve full stack traces (logging with exc_info=True), and avoid returning None/null—raise exceptions instead; for batches report success counts and detailed failures
Backend code targets Python 3.12 and adheres to a 120 character line length
Use Ruff for linting (errors, warnings, unused imports) in backend code
Use Mypy for static type checking in backend code

python/src/**/*.py: Fail fast and loud on startup/config/auth/database/critical dependency failures and validation errors in backend services
Preserve full stack traces with logging.exc_info=True in Python logging
Use specific exception types and avoid catching generic Exception
Use Pydantic to raise on data validation errors; never silently accept bad data
Python 3.12 code should conform to a 120-character line length
Never return None to indicate failure in backend Python; raise exceptions with details instead

Files:

  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/services/crawling/discovery_service.py
python/src/server/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Never accept corrupted data in backend flows; skip failed items entirely instead of storing bad data (e.g., zero embeddings, null FKs, malformed JSON)

Files:

  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/services/crawling/discovery_service.py
python/src/server/services/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/server/services/**/*.py: For batch/background/optional operations, continue processing but log detailed per-item failures; report success counts and failure lists
Implement business logic in service modules, keeping database concerns separate
Place service logic under python/src/server/services/

Files:

  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/services/crawling/discovery_service.py
🧬 Code graph analysis (2)
python/src/server/services/crawling/crawling_service.py (3)
python/src/server/services/crawling/discovery_service.py (2)
  • DiscoveryService (17-571)
  • discover_files (123-180)
python/src/server/services/crawling/helpers/url_handler.py (8)
  • is_sitemap (21-38)
  • is_llms_variant (633-654)
  • is_robots_txt (612-630)
  • is_well_known_file (657-675)
  • is_txt (61-77)
  • is_binary_file (80-177)
  • get_base_url (678-698)
  • extract_display_name (463-609)
python/src/server/services/crawling/strategies/single_page.py (1)
  • crawl_markdown_file (222-288)
python/src/server/services/crawling/discovery_service.py (1)
python/src/server/config/logfire_config.py (1)
  • get_logger (137-147)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Test Results Summary
🔇 Additional comments (3)
python/src/server/services/crawling/crawling_service.py (3)

359-406: LGTM! Discovery phase is well-integrated.

The implementation correctly:

  • Skips discovery for URLs that are already discovery targets (sitemaps, llms files, etc.)
  • Offloads blocking I/O to a thread pool via asyncio.to_thread (addressing past review concern)
  • Filters binary files from discovered results
  • Gracefully falls back to regular crawling on discovery errors

408-440: LGTM! Discovery routing correctly prioritizes discovered files.

The orchestration properly:

  • Crawls ONLY the discovered file (not the main URL) when discovery succeeds
  • Sets original_domain from the discovered URL for accurate same-domain filtering (addressing past review concern)
  • Falls back to the main URL when no discovery file is found

708-728: LGTM! Domain comparison is correct and conservative.

The _is_same_domain helper properly:

  • Extracts and compares hostnames (not full URLs)
  • Normalizes case for consistent comparison
  • Returns False on parse errors (conservative, prevents accidental inclusion)

Comment thread python/src/server/services/crawling/discovery_service.py Outdated
@leex279
Copy link
Copy Markdown
Collaborator Author

leex279 commented Oct 17, 2025

@CodeRabbit review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Oct 17, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Improve discovery logic to check the same directory as the base URL first before
falling back to root-level and subdirectories. This ensures files like
https://supabase.com/docs/llms.txt are found when crawling
https://supabase.com/docs.

Changes:
- Check same directory as base_url first (e.g., /docs/llms.txt for /docs URL)
- Fall back to root-level urljoin behavior
- Include base directory name in subdirectory checks (e.g., /docs subdirectory)
- Maintain priority order: same-dir > root > subdirectories
- Log discovery location for better debugging

This addresses cases where documentation directories contain their own llms.txt
or sitemap files that should take precedence over root-level files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
python/src/server/services/crawling/discovery_service.py (1)

137-142: Verify robots.txt sitemap exists before returning it.

This returns the first declared sitemap without checking if it's reachable, which can cause the crawler to fail on dead links. This concern was flagged in a previous review but remains unresolved.

Apply this diff to verify the sitemap before short-circuiting:

             if robots_sitemaps:
-                best_file = robots_sitemaps[0]  # Use first sitemap from robots.txt
-                logger.info(f"Discovery found best file from robots.txt: {best_file}")
-                return best_file
+                # Verify the declared sitemap is reachable
+                for candidate in robots_sitemaps:
+                    if self._check_url_exists(candidate):
+                        logger.info(f"Discovery found best file from robots.txt: {candidate}")
+                        return candidate
+                logger.info(f"Declared sitemap(s) not reachable; continuing discovery.")
🧹 Nitpick comments (2)
python/src/server/services/crawling/discovery_service.py (2)

170-200: Consider refactoring redundant subdirectory checks.

Lines 173-178 and 189-193 extract the base directory name from base_url and prepend it to the subdirectory list. However, if the same-directory check (lines 157-162) is working correctly, this becomes redundant:

  • If base_url = "https://supabase.com/docs", same-directory already tries /docs/llms.txt
  • Then extracting "docs" and checking urljoin(base_url, "docs/llms.txt") attempts the same URL

This doesn't break functionality but adds unnecessary HTTP checks.

Consider simplifying by removing the base directory name extraction and relying solely on the same-directory check (after fixing line 157-162 logic per previous comment) plus the standard subdirectory list.


380-605: Remove unused helper methods at lines 380, 501, and 537.

The methods _check_standard_patterns, _check_well_known_directory, and _try_common_variations are never invoked. The active discovery flow uses only _parse_robots_txt, _check_url_exists, and _parse_html_meta_tags; the three unused methods have no call sites in the codebase or tests. Remove them to reduce maintenance overhead.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e5160dd and 8777e94.

📒 Files selected for processing (1)
  • python/src/server/services/crawling/discovery_service.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
python/src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

python/src/**/*.py: On service startup, missing configuration, DB connection failures, auth/authorization failures, critical dependency outages, or invalid/corrupting data: fail fast and bubble errors
For batch processing, background tasks, WebSocket events, optional features, and external API calls: continue processing but log errors (with retries/backoff for APIs)
Never accept or persist corrupted data; skip failed items entirely (e.g., zero embeddings, null FKs, malformed JSON)
Error messages must include operation context, IDs/URLs, use specific exception types, preserve full stack traces (logging with exc_info=True), and avoid returning None/null—raise exceptions instead; for batches report success counts and detailed failures
Backend code targets Python 3.12 and adheres to a 120 character line length
Use Ruff for linting (errors, warnings, unused imports) in backend code
Use Mypy for static type checking in backend code

python/src/**/*.py: Fail fast and loud on startup/config/auth/database/critical dependency failures and validation errors in backend services
Preserve full stack traces with logging.exc_info=True in Python logging
Use specific exception types and avoid catching generic Exception
Use Pydantic to raise on data validation errors; never silently accept bad data
Python 3.12 code should conform to a 120-character line length
Never return None to indicate failure in backend Python; raise exceptions with details instead

Files:

  • python/src/server/services/crawling/discovery_service.py
python/src/server/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Never accept corrupted data in backend flows; skip failed items entirely instead of storing bad data (e.g., zero embeddings, null FKs, malformed JSON)

Files:

  • python/src/server/services/crawling/discovery_service.py
python/src/server/services/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/server/services/**/*.py: For batch/background/optional operations, continue processing but log detailed per-item failures; report success counts and failure lists
Implement business logic in service modules, keeping database concerns separate
Place service logic under python/src/server/services/

Files:

  • python/src/server/services/crawling/discovery_service.py
🧬 Code graph analysis (1)
python/src/server/services/crawling/discovery_service.py (1)
python/src/server/config/logfire_config.py (1)
  • get_logger (137-147)
🔇 Additional comments (7)
python/src/server/services/crawling/discovery_service.py (7)

1-14: LGTM!

Clean imports and logger setup following project conventions.


17-70: LGTM!

Priority-based constants are well-organized and DISCOVERY_TARGETS is now properly defined, addressing the previous review concern.


72-121: LGTM!

Solid streaming implementation with proper size limits, encoding fallbacks, and resource cleanup.


202-214: LGTM!

Appropriate fallback to HTML meta tags and proper error handling for a discovery service (returning None for "not found" is semantically correct).


310-322: LGTM!

Clean URL existence check with appropriate timeout and resource cleanup.


324-378: LGTM!

Robust robots.txt parsing with proper handling of relative paths, case-insensitive matching, and resource cleanup. This correctly addresses previous review feedback.


434-499: LGTM!

HTML meta tag parsing correctly uses case-insensitive regex and validates schemes, addressing previous review feedback. Resource cleanup and error handling are solid.

Comment thread python/src/server/services/crawling/discovery_service.py Outdated
Comment thread python/src/server/services/crawling/discovery_service.py Outdated
Remove the special case that gave robots.txt sitemap declarations highest
priority, which incorrectly overrode the global priority order. Now properly
respects the intended priority: llms-full.txt > llms.txt > llms.md > llms.mdx >
sitemap.xml > robots.txt.

This fixes the issue where supabase.com/docs would return sitemap.xml instead
of llms.txt even though both files exist at /docs/ and llms.txt should have
higher priority.

Changes:
- Removed robots.txt early return that bypassed priority order
- Updated test to verify llms files take precedence over robots.txt sitemaps
- All discovery now follows consistent DISCOVERY_PRIORITY order

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
python/src/server/services/crawling/discovery_service.py (2)

223-226: Validate robots-declared sitemap before short-circuiting in helper.

Return the first reachable sitemap instead of the first listed.

-            if robots_sitemaps:
-                return robots_sitemaps[0]  # Use first sitemap from robots.txt
+            if robots_sitemaps:
+                for sm in robots_sitemaps:
+                    if self._check_url_exists(sm):
+                        return sm

137-178: Fix same-directory detection and enforce robots.txt-declared sitemap priority (still outstanding).

  • Same-dir logic is fragile (dot heuristic, manual f-string concatenation). Use urljoin with a trailing slash and parse once.
  • Per PR goals, prefer sitemaps declared in robots.txt over other sitemap candidates (but still below llms.*). Add a one-time pre-check before evaluating sitemap filenames.

Apply this diff:

-            # Check files in global priority order
-            # Note: robots.txt sitemaps are not given special priority as llms files should be preferred
-            for filename in self.DISCOVERY_PRIORITY:
-                from urllib.parse import urlparse
-
-                # Get the directory path of the base URL
-                parsed = urlparse(base_url)
-                base_path = parsed.path.rstrip('/')
-                # Extract directory (remove filename if present)
-                if '.' in base_path.split('/')[-1]:
-                    base_dir = '/'.join(base_path.split('/')[:-1])
-                else:
-                    base_dir = base_path
-
-                # Priority 1: Check same directory as base_url (e.g., /docs/llms.txt for /docs URL)
-                if base_dir and base_dir != '/':
-                    same_dir_url = f"{parsed.scheme}://{parsed.netloc}{base_dir}/{filename}"
-                    if self._check_url_exists(same_dir_url):
-                        logger.info(f"Discovery found best file in same directory: {same_dir_url}")
-                        return same_dir_url
+            # Check files in global priority order
+            # Sitemaps declared in robots.txt are preferred over other sitemap candidates (but still below llms.*)
+            from urllib.parse import urlparse
+            parsed = urlparse(base_url)
+            base_dir = parsed.path.rstrip('/')
+            base_with_slash = base_url if base_url.endswith('/') else base_url + '/'
+            checked_robots_sitemaps = False
+
+            for filename in self.DISCOVERY_PRIORITY:
+                # Priority 1: same directory (treat base_url as a directory context)
+                if base_dir and base_dir != '/':
+                    same_dir_url = urljoin(base_with_slash, filename)
+                    if self._check_url_exists(same_dir_url):
+                        logger.info(f"Discovery found best file in same directory: {same_dir_url}")
+                        return same_dir_url
+
+                # Prefer robots.txt-declared sitemaps before any explicit sitemap filenames
+                if (filename in self.DISCOVERY_TARGETS["sitemap_files"]) and not checked_robots_sitemaps:
+                    for sm in self._parse_robots_txt(base_url):
+                        if self._check_url_exists(sm):
+                            logger.info(f"Discovery found best file from robots.txt: {sm}")
+                            return sm
+                    checked_robots_sitemaps = True

Please confirm this aligns with the PR’s stated priority: llms.* > robots-declared sitemaps > other sitemap candidates > robots.txt. If different, clarify the intended order.

🧹 Nitpick comments (6)
python/src/server/services/crawling/discovery_service.py (6)

304-317: Use HEAD first and accept any 2xx; avoid downloading bodies for existence checks.

Current GET + status==200 wastes bandwidth and misses valid 2xx responses.

-    def _check_url_exists(self, url: str) -> bool:
+    def _check_url_exists(self, url: str) -> bool:
         """
         Check if a URL exists and returns a successful response.
         """
         try:
-            resp = requests.get(url, timeout=5, allow_redirects=True, verify=True)
-            success = resp.status_code == 200
-            logger.debug(f"URL check: {url} -> {resp.status_code} ({'exists' if success else 'not found'})")
-            resp.close()
-            return success
+            # Try HEAD first
+            resp = requests.head(url, timeout=5, allow_redirects=True, verify=True)
+            status = resp.status_code
+            resp.close()
+            if 200 <= status < 300:
+                logger.debug(f"URL check: {url} -> {status} (exists)")
+                return True
+            # Fallback to GET with streaming (HEAD often blocked)
+            resp = requests.get(url, timeout=5, allow_redirects=True, stream=True, verify=True)
+            status = resp.status_code
+            resp.close()
+            logger.debug(f"URL check: {url} -> {status} ({'exists' if 200 <= status < 300 else 'not found'})")
+            return 200 <= status < 300
         except Exception as e:
             logger.debug(f"URL check failed: {url} -> {e}")
             return False

197-201: Verify HTML meta/link-declared sitemaps exist before returning.

Avoid returning dead links from meta tags.

-            if html_sitemaps:
-                best_file = html_sitemaps[0]
-                logger.info(f"Discovery found best file from HTML meta tags: {best_file}")
-                return best_file
+            if html_sitemaps:
+                for candidate in html_sitemaps:
+                    if self._check_url_exists(candidate):
+                        logger.info(f"Discovery found best file from HTML meta tags: {candidate}")
+                        return candidate

124-136: Minor: validate base_url early and normalize.

If base_url lacks scheme/host, raise ValueError to fail fast; this avoids malformed urljoin outcomes.

         try:
-            logger.info(f"Starting single-file discovery for {base_url}")
+            logger.info(f"Starting single-file discovery for {base_url}")
+            from urllib.parse import urlparse
+            p = urlparse(base_url)
+            if not (p.scheme and p.netloc):
+                raise ValueError(f"Invalid base_url: {base_url}")

390-417: Prefer HEAD with stream for pattern/variant probes to reduce bandwidth.

These probes only need existence; use HEAD and avoid downloading bodies.

Also applies to: 507-520, 554-564, 580-590


10-15: Optional: reuse a requests.Session for connection pooling.

Create a per-service Session to cut handshake overhead across multiple checks.

Happy to provide a small wrapper to inject a Session.


210-303: Remove unused helper methods to eliminate dead code.

Verification confirms that _discover_best_sitemap, _discover_best_llms_file, and _discover_robots_file (lines 210, 256, 291) are never called anywhere in the codebase. The discover_files method implements its own discovery logic and does not reuse these helpers. Delete these three methods to reduce code surface area.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8777e94 and a03ce1e.

📒 Files selected for processing (2)
  • python/src/server/services/crawling/discovery_service.py (1 hunks)
  • python/tests/test_discovery_service.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • python/tests/test_discovery_service.py
🧰 Additional context used
📓 Path-based instructions (3)
python/src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

python/src/**/*.py: On service startup, missing configuration, DB connection failures, auth/authorization failures, critical dependency outages, or invalid/corrupting data: fail fast and bubble errors
For batch processing, background tasks, WebSocket events, optional features, and external API calls: continue processing but log errors (with retries/backoff for APIs)
Never accept or persist corrupted data; skip failed items entirely (e.g., zero embeddings, null FKs, malformed JSON)
Error messages must include operation context, IDs/URLs, use specific exception types, preserve full stack traces (logging with exc_info=True), and avoid returning None/null—raise exceptions instead; for batches report success counts and detailed failures
Backend code targets Python 3.12 and adheres to a 120 character line length
Use Ruff for linting (errors, warnings, unused imports) in backend code
Use Mypy for static type checking in backend code

python/src/**/*.py: Fail fast and loud on startup/config/auth/database/critical dependency failures and validation errors in backend services
Preserve full stack traces with logging.exc_info=True in Python logging
Use specific exception types and avoid catching generic Exception
Use Pydantic to raise on data validation errors; never silently accept bad data
Python 3.12 code should conform to a 120-character line length
Never return None to indicate failure in backend Python; raise exceptions with details instead

Files:

  • python/src/server/services/crawling/discovery_service.py
python/src/server/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Never accept corrupted data in backend flows; skip failed items entirely instead of storing bad data (e.g., zero embeddings, null FKs, malformed JSON)

Files:

  • python/src/server/services/crawling/discovery_service.py
python/src/server/services/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/server/services/**/*.py: For batch/background/optional operations, continue processing but log detailed per-item failures; report success counts and failure lists
Implement business logic in service modules, keeping database concerns separate
Place service logic under python/src/server/services/

Files:

  • python/src/server/services/crawling/discovery_service.py
🧬 Code graph analysis (1)
python/src/server/services/crawling/discovery_service.py (1)
python/src/server/config/logfire_config.py (1)
  • get_logger (137-147)

Comment on lines +355 to +357
# Resolve relative path against base_url
sitemap_url = urljoin(base_url, sitemap_value)
sitemaps.append(sitemap_url)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Resolve robots.txt relative sitemap paths against the robots URL, not base_url.

Using base_url can mis-resolve when robots.txt is under a subpath or after redirects. Use resp.url.

-                                # Resolve relative path against base_url
-                                sitemap_url = urljoin(base_url, sitemap_value)
+                                # Resolve relative path against the actual robots.txt URL (handles redirects/paths)
+                                sitemap_url = urljoin(resp.url, sitemap_value)
🤖 Prompt for AI Agents
In python/src/server/services/crawling/discovery_service.py around lines
355-357, the code resolves relative sitemap paths against base_url which can
mis-resolve when robots.txt lives on a subpath or after redirects; change the
join to use the robots.txt response URL (resp.url) instead of base_url so
relative sitemap_value paths are resolved against the actual robots.txt location
(resp.url) that may include subpaths or redirected locations, i.e. compute
sitemap_url = urljoin(resp.url, sitemap_value) and append that result to
sitemaps.

Comment thread python/src/server/services/crawling/discovery_service.py Outdated
Implements complete llms.txt link following functionality that crawls
linked llms.txt files on the same domain/subdomain, along with critical
bug fixes for discovery priority and variant detection.

Backend Core Functionality:
- Add _is_same_domain_or_subdomain method for subdomain matching
- Fix is_llms_variant to detect .txt files in /llms/ directories
- Implement llms.txt link extraction and following logic
- Add two-phase discovery: prioritize ALL llms.txt before sitemaps
- Enhanced progress reporting with discovery metadata

Critical Bug Fixes:
- Discovery priority: Fixed sitemap.xml being found before llms.txt
- is_llms_variant: Now matches /llms/guides.txt, /llms/swift.txt, etc.
- These were blocking bugs preventing link following from working

Frontend UI:
- Add discovery and linked files display to CrawlingProgress component
- Update progress types to include discoveredFile, linkedFiles fields
- Add new crawl types: llms_txt_with_linked_files, discovery_*
- Add "discovery" to ProgressStatus enum and active statuses

Testing:
- 8 subdomain matching unit tests (test_crawling_service_subdomain.py)
- 7 integration tests for link following (test_llms_txt_link_following.py)
- All 15 tests passing
- Validated against real Supabase llms.txt structure (1 main + 8 linked)

Files Modified:
Backend:
- crawling_service.py: Core link following logic (lines 744-788, 862-920)
- url_handler.py: Fixed variant detection (lines 633-665)
- discovery_service.py: Two-phase discovery (lines 137-214)
- 2 new comprehensive test files

Frontend:
- progress/types/progress.ts: Updated types with new fields
- progress/components/CrawlingProgress.tsx: Added UI sections

Real-world testing: Crawling supabase.com/docs now discovers
/docs/llms.txt and automatically follows 8 linked llms.txt files,
indexing complete documentation from all files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
python/src/server/services/crawling/discovery_service.py (2)

230-322: Remove unused helper methods to reduce maintenance burden.

The methods _discover_best_sitemap (lines 230-274), _discover_best_llms_file (lines 276-309), and _discover_robots_file (lines 311-322) are not called by the main discover_files flow and represent dead code.

Past review verification confirmed these are never invoked. The main discover_files method (lines 123-228) uses its own inline logic instead.

Delete these three methods to reduce code surface:

-    def _discover_best_sitemap(self, base_url: str) -> str | None:
-        """
-        Discover the best available sitemap using priority-based selection.
-        ...
-        """
-        # ... (lines 230-274)
-
-    def _discover_best_llms_file(self, base_url: str) -> str | None:
-        """
-        Discover the best available llms file using priority-based selection.
-        ...
-        """
-        # ... (lines 276-309)
-
-    def _discover_robots_file(self, base_url: str) -> str | None:
-        """
-        Discover robots.txt file (always single file at root).
-        """
-        # ... (lines 311-322)

Based on learnings: Past review comments flagged these as unused; no calls found in codebase.


394-619: Remove dead utility functions or document their purpose.

The functions _check_standard_patterns (lines 394-446), _check_well_known_directory (lines 515-549), and _try_common_variations (lines 551-619) are never called by discover_files or any other method.

Past review verification confirmed these are unused. They represent ~225 lines of dead code.

Either remove these functions or document why they're kept (e.g., for future use, testing, public API):

-    def _check_standard_patterns(self, base_url: str) -> dict[str, list[str]]:
-        # ... (lines 394-446)
-
-    def _check_well_known_directory(self, base_url: str) -> list[str]:
-        # ... (lines 515-549)
-
-    def _try_common_variations(self, base_url: str) -> dict[str, list[str]]:
-        # ... (lines 551-619)

If these are intended as public helpers, add docstrings explaining when they should be called and consider adding tests.

Based on learnings: Past review comments flagged these as dead code; no usage found.

🧹 Nitpick comments (1)
archon-ui-main/src/features/progress/components/CrawlingProgress.tsx (1)

249-292: Replace unsafe type casts with proper type narrowing.

The code uses (operation as any) multiple times to access discovery-related fields. This bypasses TypeScript's type safety and could mask type errors.

Apply this diff to use proper type narrowing:

-                  {/* Discovery Information */}
-                  {(operation as any).discovered_file && (
+                  {/* Discovery Information */}
+                  {"discovered_file" in operation && operation.discovered_file && (
                     <div className="pt-2 border-t border-white/10">
                       <div className="flex items-center gap-2 mb-2">
                         <span className="text-xs font-semibold text-cyan-400">Discovery Result</span>
-                        {(operation as any).discovered_file_type && (
+                        {"discovered_file_type" in operation && operation.discovered_file_type && (
                           <span className="px-2 py-0.5 text-xs rounded bg-cyan-500/10 border border-cyan-500/20 text-cyan-300">
-                            {(operation as any).discovered_file_type}
+                            {operation.discovered_file_type}
                           </span>
                         )}
                       </div>
                       <a
-                        href={(operation as any).discovered_file}
+                        href={operation.discovered_file}
                         target="_blank"
                         rel="noopener noreferrer"
                         className="text-sm text-gray-400 hover:text-cyan-400 transition-colors truncate block"
                       >
-                        {(operation as any).discovered_file}
+                        {operation.discovered_file}
                       </a>
                     </div>
                   )}
 
                   {/* Linked Files */}
-                  {(operation as any).linked_files && (operation as any).linked_files.length > 0 && (
+                  {"linked_files" in operation && operation.linked_files && Array.isArray(operation.linked_files) && operation.linked_files.length > 0 && (
                     <div className="pt-2 border-t border-white/10">
                       <div className="text-xs font-semibold text-cyan-400 mb-2">
-                        Following {(operation as any).linked_files.length} Linked File
-                        {(operation as any).linked_files.length > 1 ? "s" : ""}
+                        Following {operation.linked_files.length} Linked File
+                        {operation.linked_files.length > 1 ? "s" : ""}
                       </div>
                       <div className="space-y-1 max-h-32 overflow-y-auto">
-                        {(operation as any).linked_files.map((file: string, idx: number) => (
+                        {operation.linked_files.map((file, idx) => (
                           <a
                             key={idx}
                             href={file}

Alternatively, update the ActiveOperation type in progress.ts to include these optional fields, which would be the preferred solution.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a03ce1e and cdf4323.

📒 Files selected for processing (8)
  • PRPs/llms-txt-link-following.md (1 hunks)
  • archon-ui-main/src/features/progress/components/CrawlingProgress.tsx (2 hunks)
  • archon-ui-main/src/features/progress/types/progress.ts (4 hunks)
  • python/src/server/services/crawling/crawling_service.py (10 hunks)
  • python/src/server/services/crawling/discovery_service.py (1 hunks)
  • python/src/server/services/crawling/helpers/url_handler.py (7 hunks)
  • python/tests/test_crawling_service_subdomain.py (1 hunks)
  • python/tests/test_llms_txt_link_following.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (12)
archon-ui-main/src/**/*.{ts,tsx}

📄 CodeRabbit inference engine (AGENTS.md)

archon-ui-main/src/**/*.{ts,tsx}: Frontend TypeScript must use strict mode with no implicit any
Use TanStack Query for all data fetching; avoid prop drilling
Use database values directly in the frontend; avoid mapping layers between BE and FE types

Never return null to indicate failure in frontend TypeScript; throw errors with details instead

Files:

  • archon-ui-main/src/features/progress/types/progress.ts
  • archon-ui-main/src/features/progress/components/CrawlingProgress.tsx
archon-ui-main/src/features/**/*.{ts,tsx}

📄 CodeRabbit inference engine (AGENTS.md)

Use Biome in features: 120 character line length, double quotes, and trailing commas

Use Biome formatting/conventions in features (120 char lines, double quotes, trailing commas)

Files:

  • archon-ui-main/src/features/progress/types/progress.ts
  • archon-ui-main/src/features/progress/components/CrawlingProgress.tsx
archon-ui-main/src/features/*/types/**/*.{ts,tsx}

📄 CodeRabbit inference engine (AGENTS.md)

Define shared types under src/features/[feature]/types

Files:

  • archon-ui-main/src/features/progress/types/progress.ts
archon-ui-main/src/**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (CLAUDE.md)

Use ESLint rules for legacy frontend code outside /src/features

Files:

  • archon-ui-main/src/features/progress/types/progress.ts
  • archon-ui-main/src/features/progress/components/CrawlingProgress.tsx
archon-ui-main/**/*.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

TypeScript strict mode with no implicit any in the frontend

Files:

  • archon-ui-main/src/features/progress/types/progress.ts
  • archon-ui-main/src/features/progress/components/CrawlingProgress.tsx
archon-ui-main/src/features/**/types/**/*.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

Define feature types under src/features/[feature]/types/

Files:

  • archon-ui-main/src/features/progress/types/progress.ts
python/src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

python/src/**/*.py: On service startup, missing configuration, DB connection failures, auth/authorization failures, critical dependency outages, or invalid/corrupting data: fail fast and bubble errors
For batch processing, background tasks, WebSocket events, optional features, and external API calls: continue processing but log errors (with retries/backoff for APIs)
Never accept or persist corrupted data; skip failed items entirely (e.g., zero embeddings, null FKs, malformed JSON)
Error messages must include operation context, IDs/URLs, use specific exception types, preserve full stack traces (logging with exc_info=True), and avoid returning None/null—raise exceptions instead; for batches report success counts and detailed failures
Backend code targets Python 3.12 and adheres to a 120 character line length
Use Ruff for linting (errors, warnings, unused imports) in backend code
Use Mypy for static type checking in backend code

python/src/**/*.py: Fail fast and loud on startup/config/auth/database/critical dependency failures and validation errors in backend services
Preserve full stack traces with logging.exc_info=True in Python logging
Use specific exception types and avoid catching generic Exception
Use Pydantic to raise on data validation errors; never silently accept bad data
Python 3.12 code should conform to a 120-character line length
Never return None to indicate failure in backend Python; raise exceptions with details instead

Files:

  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/services/crawling/discovery_service.py
  • python/src/server/services/crawling/helpers/url_handler.py
python/src/server/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Never accept corrupted data in backend flows; skip failed items entirely instead of storing bad data (e.g., zero embeddings, null FKs, malformed JSON)

Files:

  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/services/crawling/discovery_service.py
  • python/src/server/services/crawling/helpers/url_handler.py
python/src/server/services/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/server/services/**/*.py: For batch/background/optional operations, continue processing but log detailed per-item failures; report success counts and failure lists
Implement business logic in service modules, keeping database concerns separate
Place service logic under python/src/server/services/

Files:

  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/services/crawling/discovery_service.py
  • python/src/server/services/crawling/helpers/url_handler.py
archon-ui-main/src/features/*/components/**/*.{ts,tsx}

📄 CodeRabbit inference engine (AGENTS.md)

Place new UI components under src/features/[feature]/components

Files:

  • archon-ui-main/src/features/progress/components/CrawlingProgress.tsx
archon-ui-main/src/features/**/components/**/*.tsx

📄 CodeRabbit inference engine (CLAUDE.md)

archon-ui-main/src/features/**/components/**/*.tsx: Create UI components under src/features/[feature]/components/ using Radix UI primitives from src/features/ui/primitives/
Apply Tron-inspired glassmorphism styling with Tailwind and follow responsive (mobile-first) patterns

Files:

  • archon-ui-main/src/features/progress/components/CrawlingProgress.tsx
archon-ui-main/src/features/**/*.tsx

📄 CodeRabbit inference engine (CLAUDE.md)

Ensure no dynamic Tailwind class construction (avoid runtime-generated class strings)

Files:

  • archon-ui-main/src/features/progress/components/CrawlingProgress.tsx
🧠 Learnings (1)
📚 Learning: 2025-08-28T13:07:24.810Z
Learnt from: Wirasm
PR: coleam00/Archon#514
File: archon-ui-main/src/services/crawlProgressService.ts:35-39
Timestamp: 2025-08-28T13:07:24.810Z
Learning: The crawlProgressService.ts in the Archon codebase should be deprecated in favor of the existing useCrawlProgressPolling hook from usePolling.ts, which already includes ETag support, 304 handling, tab visibility detection, and proper React lifecycle integration. This consolidation reduces code duplication and improves performance.

Applied to files:

  • archon-ui-main/src/features/progress/types/progress.ts
  • archon-ui-main/src/features/progress/components/CrawlingProgress.tsx
🧬 Code graph analysis (4)
python/tests/test_crawling_service_subdomain.py (2)
python/src/server/services/crawling/crawling_service.py (3)
  • CrawlingService (69-1049)
  • _is_same_domain_or_subdomain (744-788)
  • _is_same_domain (723-742)
python/tests/test_llms_txt_link_following.py (1)
  • service (11-13)
python/tests/test_llms_txt_link_following.py (2)
python/src/server/services/crawling/crawling_service.py (5)
  • CrawlingService (69-1049)
  • _is_same_domain_or_subdomain (744-788)
  • crawl_batch_with_progress (217-233)
  • crawl_markdown_file (200-211)
  • _crawl_by_url_type (824-1049)
python/src/server/services/crawling/helpers/url_handler.py (2)
  • extract_markdown_links_with_text (298-387)
  • is_llms_variant (633-665)
python/src/server/services/crawling/crawling_service.py (2)
python/src/server/services/crawling/discovery_service.py (2)
  • DiscoveryService (17-619)
  • discover_files (123-228)
python/src/server/services/crawling/helpers/url_handler.py (9)
  • URLHandler (17-709)
  • is_sitemap (21-38)
  • is_llms_variant (633-665)
  • is_robots_txt (612-630)
  • is_well_known_file (668-686)
  • is_txt (61-77)
  • is_binary_file (80-177)
  • get_base_url (689-709)
  • extract_markdown_links_with_text (298-387)
python/src/server/services/crawling/discovery_service.py (1)
python/src/server/config/logfire_config.py (1)
  • get_logger (137-147)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Test Results Summary
🔇 Additional comments (18)
python/src/server/services/crawling/helpers/url_handler.py (7)

39-40: LGTM! Decorator placement fix is correct.

The @staticmethod decorator is now properly placed adjacent to the function definition with no behavioral change.


282-282: LGTM! Modern type hints are consistent and correct.

The signature updates use modern Python type union syntax (str | None instead of Optional[str], list[str] instead of List[str]), which is cleaner and aligns with Python 3.10+ conventions.

Also applies to: 390-390


611-630: LGTM! Robots.txt detection is precise.

The implementation correctly checks for root-level /robots.txt only, which is the standard location per RFC 9309.


632-665: LGTM! Critical bug fix for llms.txt variant detection.

The implementation now correctly detects:

  • Exact filename matches (llms-full.txt, llms.txt, llms.md, etc.)
  • Files in /llms/ directories (e.g., /llms/guides.txt)

This was the blocking bug preventing link-following from working, as documented in the PRP. The fix enables discovery of files like https://supabase.com/llms/guides.txt.


667-686: LGTM! Well-known file detection is correct.

The implementation correctly identifies .well-known/* files at root level only (single occurrence check) per RFC 8615.


688-708: LGTM! Base URL extraction with proper fallback.

The implementation safely extracts scheme://netloc with appropriate error handling and logging when extraction fails.


389-459: The "full" variant exclusion regex is correct and handles all edge cases properly.

The verification confirms that the pattern r'(^|[._-])full([._-]|$)' correctly:

  • Matches intended "full" variants: llms-full.txt, full-llms.txt, llms.full.txt, llms_full.txt
  • Avoids false positives: helpful.md, fullstack.txt, meaningful.txt are not matched
  • Treats "full" as a standalone token delimited by start/end of string or special characters

All 9 test cases passed. No action needed.

python/tests/test_crawling_service_subdomain.py (1)

1-152: LGTM! Comprehensive test coverage for subdomain matching.

The test suite provides excellent coverage with 8 test cases:

  • Exact domain matches
  • Subdomain matching (single and multiple levels)
  • Cross-domain rejection
  • Protocol-agnostic matching
  • Port handling
  • Edge cases (empty/malformed URLs)
  • Real-world examples (GitHub, Python docs, Stripe)
  • Backward compatibility for exact-match helper

All assertions appear correct and the test design is solid.

archon-ui-main/src/features/progress/components/CrawlingProgress.tsx (1)

132-132: LGTM! Discovery status correctly added to active statuses.

Adding "discovery" to the active status array ensures the progress UI shows proper activity indicators during the discovery phase.

python/tests/test_llms_txt_link_following.py (1)

1-217: LGTM! Comprehensive integration tests for llms.txt link following.

The test suite provides excellent coverage with 7 integration tests:

  1. Link extraction from real Supabase llms.txt content (8 links)
  2. All links recognized as llms.txt variants
  3. All links validated as same domain
  4. Complete filtering logic (llms variants + same domain)
  5. Full integration test with mocked async calls
  6. External domain filtering
  7. Non-llms link filtering

The use of real Supabase content as test data is excellent for validation. Mocking strategy with AsyncMock and MagicMock is appropriate for integration testing. All assertions validate the expected behavior correctly.

archon-ui-main/src/features/progress/types/progress.ts (1)

9-9: LGTM! Type definitions properly extended for discovery integration.

The type updates correctly add:

  • "discovery" to ProgressStatus
  • New crawl types: llms_txt_with_linked_files, llms_txt_linked_files, discovery_single_file, discovery_sitemap
  • Discovery fields in CrawlProgressData: discoveredFile, discoveredFileType, linkedFiles
  • Discovery fields in ProgressResponse with both camelCase and snake_case variants for backend compatibility

This enables the UI to display discovery progress and results while maintaining backward compatibility.

Also applies to: 28-37, 61-64, 144-150

python/src/server/services/crawling/discovery_service.py (1)

72-121: LGTM! Size-limited response reading prevents memory exhaustion.

The _read_response_with_limit method correctly:

  • Reads responses in 8KB chunks
  • Enforces a 10MB default limit
  • Handles encoding fallback (response.encoding → utf-8 → utf-8 with error replacement)
  • Closes response on any error
  • Provides clear error messages

This prevents memory exhaustion from maliciously large responses.

PRPs/llms-txt-link-following.md (1)

1-506: LGTM! Comprehensive implementation documentation.

The PRP document provides excellent coverage of the feature:

  • Clear problem statement and goals
  • Detailed implementation plan with code examples
  • Comprehensive test strategy
  • Edge case analysis
  • Success metrics
  • Implementation summary with actual results

The document correctly reflects the completed implementation across all sprints and provides valuable context for understanding the feature.

python/src/server/services/crawling/crawling_service.py (5)

103-103: LGTM! Discovery integration properly offloads blocking I/O.

The implementation correctly:

  • Initializes DiscoveryService in __init__ (line 103)
  • Offloads blocking discovery calls to a worker thread using asyncio.to_thread (line 379)
  • Handles discovery failures gracefully with try/except (lines 415-420)
  • Skips discovery for URLs that are already discovery targets (lines 362-368)
  • Reports discovery results with file type metadata (lines 389-413)

This prevents blocking the async event loop during network I/O operations.

Also applies to: 359-421


723-788: LGTM! Domain filtering helpers are correctly implemented.

Both helpers are properly implemented:

  • _is_same_domain (lines 723-742): Exact hostname match (case-insensitive)
  • _is_same_domain_or_subdomain (lines 744-788): Root domain match for subdomains

The root domain extraction correctly uses the last 2 parts (e.g., example.com from docs.example.com), which works for common TLDs. Edge cases like .co.uk would need special handling, but this is acceptable for the current scope.

Conservative fallback behavior (return False on parsing errors) is appropriate for security.


862-919: LGTM! llms.txt link following correctly implements single-level traversal.

The implementation correctly:

  • Checks if the file is a discovery target AND an llms variant (lines 863-865)
  • Extracts links with text for title fallback (line 871)
  • Filters for llms.txt variants only (line 880)
  • Filters for same domain/subdomain only (line 882)
  • Crawls linked files in batch (no recursion) (lines 902-907)
  • Combines results and returns appropriate crawl type (lines 910-913)
  • Falls back to single-file mode for non-llms discovery targets (lines 916-919)

This matches the design goal of "only follow one level" to avoid deep nesting.


422-454: LGTM! Discovery-target crawling correctly sets domain context.

The implementation correctly:

  • Crawls only the discovered file, not the main URL (lines 424-434)
  • Marks the request as a discovery target (line 438)
  • Sets original_domain using the discovered file's domain (line 439)
  • Passes the modified request to _crawl_by_url_type (line 441)

This ensures downstream domain filtering works correctly for discovered files that may be on different hosts/subdomains than the original URL (e.g., docs.example.com discovered from example.com).


935-947: Remove dead code at line 936: the condition request.get("is_discovery_target") can never be true at that location.

The code flow shows that all is_discovery_target paths return earlier:

  • Line 863–919: When is_discovery_target is true, the code either returns at line 913 (llms.txt case) or line 919 (non-llms discovery targets like sitemaps/robots.txt).
  • Line 936: Located outside the if request.get("is_discovery_target"): block, this check is unreachable. By the time execution reaches line 936, is_discovery_target must be false, making the condition at line 936 impossible to satisfy.

Delete the unreachable condition and its body (lines 936–946), keeping only the link extraction at lines 925–933 and the binary file filtering that follows.

Likely an incorrect or invalid review comment.

Comment thread python/src/server/services/crawling/crawling_service.py
Comment thread python/src/server/services/crawling/discovery_service.py
- Replace dot-based file detection with explicit extension checking
  in discovery service to correctly handle versioned directories
  like /docs.v2
- Add comprehensive validation for start_progress and end_progress
  parameters in crawl_markdown_file to ensure they are valid
  numeric values in range [0, 100] with start < end
- Validation runs before any async work or progress reporting begins
- Clear error messages indicate which parameter is invalid and why

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
python/src/server/services/crawling/strategies/single_page.py (1)

227-228: Type hint mismatch: declared as int but validated as int | float.

The type hints specify int, but lines 247-251 accept both int and float. Align the type hints with validation or restrict validation to int only.

Either update type hints to allow float:

-        start_progress: int = 10,
-        end_progress: int = 20
+        start_progress: int | float = 10,
+        end_progress: int | float = 20

Or restrict validation to int only:

-        if not isinstance(start_progress, (int, float)) or not isinstance(end_progress, (int, float)):
+        if not isinstance(start_progress, int) or not isinstance(end_progress, int):
♻️ Duplicate comments (6)
python/src/server/services/crawling/discovery_service.py (6)

243-335: Remove unused helper methods (duplicate reminder).

These methods are not called by discover_files and duplicate its logic. Previous review (commit cdf4323) flagged these for removal, but they remain in the code.


407-459: Dead code: _check_standard_patterns not called (duplicate reminder).

Not invoked by discover_files. Previous review (commit cdf4323) flagged for removal.


528-632: Dead code: _check_well_known_directory and _try_common_variations not called (duplicate reminder).

Neither method is invoked. Previous review (commit cdf4323) marked these for removal.


385-390: Resolve relative sitemap paths against resp.url, not base_url (previous review unaddressed).

Line 389 resolves relative sitemap paths against base_url, which fails when robots.txt is served from a redirected location or subpath. Use resp.url (the actual robots.txt URL) instead.

                             if sitemap_value.lower().startswith(("http://", "https://")):
                                 sitemap_url = sitemap_value
                             else:
-                                # Resolve relative path against base_url
-                                sitemap_url = urljoin(base_url, sitemap_value)
+                                # Resolve relative path against actual robots.txt URL (handles redirects/subpaths)
+                                sitemap_url = urljoin(resp.url, sitemap_value)
                             sitemaps.append(sitemap_url)

Based on previous review comment.


123-241: Missing robots.txt sitemap priority check contradicts PR objectives.

The PR objectives state: "sitemap declarations in robots.txt are treated as highest priority when explicitly declared." However, discover_files never calls _parse_robots_txt or checks robots.txt for sitemap declarations. The current implementation only iterates through DISCOVERY_PRIORITY files, missing the highest-priority source.

Add robots.txt check before Phase 1:

         try:
             logger.info(f"Starting single-file discovery for {base_url}")
+
+            # Priority 0: Check robots.txt for explicit sitemap declarations (highest priority)
+            robots_sitemaps = self._parse_robots_txt(base_url)
+            if robots_sitemaps:
+                # Verify the first declared sitemap exists before returning
+                for sitemap_url in robots_sitemaps:
+                    if self._check_url_exists(sitemap_url):
+                        logger.info(f"Discovery found best file from robots.txt: {sitemap_url}")
+                        return sitemap_url
+                logger.info("Declared sitemaps in robots.txt not reachable; continuing discovery")
 
             # Check files in global priority order

169-174: Prefer urljoin over manual URL construction for robustness.

Lines 171 and 203 manually concatenate URL components, bypassing urljoin's normalization (e.g., handling trailing slashes, double slashes, special characters).

Replace manual construction with urljoin:

                 # Priority 1a: Check same directory for llms files
                 if base_dir and base_dir != '/':
-                    same_dir_url = f"{parsed.scheme}://{parsed.netloc}{base_dir}/{filename}"
+                    base_with_slash = f"{parsed.scheme}://{parsed.netloc}{base_dir}/"
+                    same_dir_url = urljoin(base_with_slash, filename)
                     if self._check_url_exists(same_dir_url):

Apply the same pattern at line 203.

🧹 Nitpick comments (1)
python/src/server/services/crawling/discovery_service.py (1)

337-349: Consider using requests.Session for connection pooling.

The service makes multiple HTTP requests but doesn't reuse connections. A Session object would enable connection pooling and improve performance.

Initialize a session in __init__:

def __init__(self):
    self.session = requests.Session()
    self.session.headers.update({"User-Agent": "Archon-Discovery/1.0"})

Then replace requests.get(...) calls with self.session.get(...) in _check_url_exists, _parse_robots_txt, _parse_html_meta_tags, etc.

Based on learnings.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cdf4323 and 8ab6c75.

📒 Files selected for processing (2)
  • python/src/server/services/crawling/discovery_service.py (1 hunks)
  • python/src/server/services/crawling/strategies/single_page.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
python/src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

python/src/**/*.py: On service startup, missing configuration, DB connection failures, auth/authorization failures, critical dependency outages, or invalid/corrupting data: fail fast and bubble errors
For batch processing, background tasks, WebSocket events, optional features, and external API calls: continue processing but log errors (with retries/backoff for APIs)
Never accept or persist corrupted data; skip failed items entirely (e.g., zero embeddings, null FKs, malformed JSON)
Error messages must include operation context, IDs/URLs, use specific exception types, preserve full stack traces (logging with exc_info=True), and avoid returning None/null—raise exceptions instead; for batches report success counts and detailed failures
Backend code targets Python 3.12 and adheres to a 120 character line length
Use Ruff for linting (errors, warnings, unused imports) in backend code
Use Mypy for static type checking in backend code

python/src/**/*.py: Fail fast and loud on startup/config/auth/database/critical dependency failures and validation errors in backend services
Preserve full stack traces with logging.exc_info=True in Python logging
Use specific exception types and avoid catching generic Exception
Use Pydantic to raise on data validation errors; never silently accept bad data
Python 3.12 code should conform to a 120-character line length
Never return None to indicate failure in backend Python; raise exceptions with details instead

Files:

  • python/src/server/services/crawling/strategies/single_page.py
  • python/src/server/services/crawling/discovery_service.py
python/src/server/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Never accept corrupted data in backend flows; skip failed items entirely instead of storing bad data (e.g., zero embeddings, null FKs, malformed JSON)

Files:

  • python/src/server/services/crawling/strategies/single_page.py
  • python/src/server/services/crawling/discovery_service.py
python/src/server/services/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/server/services/**/*.py: For batch/background/optional operations, continue processing but log detailed per-item failures; report success counts and failure lists
Implement business logic in service modules, keeping database concerns separate
Place service logic under python/src/server/services/

Files:

  • python/src/server/services/crawling/strategies/single_page.py
  • python/src/server/services/crawling/discovery_service.py
🧬 Code graph analysis (1)
python/src/server/services/crawling/discovery_service.py (1)
python/src/server/config/logfire_config.py (1)
  • get_logger (137-147)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Backend Tests (Python + pytest)
🔇 Additional comments (5)
python/src/server/services/crawling/discovery_service.py (4)

17-70: LGTM - Constants and structure well-defined.

The DISCOVERY_PRIORITY and DISCOVERY_TARGETS constants provide clear priority ordering and category mapping. The 10MB size limit is reasonable for discovery files.


72-122: LGTM - Robust size-limited reading with proper error handling.

The chunked reading with size enforcement prevents memory exhaustion, encoding fallback handles edge cases, and response cleanup is guaranteed.


337-349: LGTM - Clean URL existence check with proper timeout.


461-526: LGTM - HTML meta tag parsing handles case-insensitivity and scheme validation.

The case-insensitive regex patterns and urlparse scheme checks are correct.

python/src/server/services/crawling/strategies/single_page.py (1)

246-267: LGTM - Comprehensive progress validation follows fail-fast principle.

The validation checks type, range, and ordering before any async work, raising clear ValueError messages. This aligns with the coding guidelines' fail-fast directive.

leex279 and others added 2 commits October 19, 2025 11:30
… detection

## Backend Improvements

### Discovery Service
- Fix SSRF protection: Use requests.Session() for max_redirects parameter
- Add comprehensive IP validation (_is_safe_ip, _resolve_and_validate_hostname)
- Add hostname DNS resolution validation before requests
- Fix llms.txt link following to crawl ALL same-domain pages (not just llms.txt files)
- Remove unused file variants: llms.md, llms.markdown, sitemap_index.xml, sitemap-index.xml
- Optimize DISCOVERY_PRIORITY based on real-world usage research
- Update priority: llms.txt > llms-full.txt > sitemap.xml > robots.txt

### URL Handler
- Fix .well-known path to be case-sensitive per RFC 8615
- Remove llms.md, llms.markdown, llms.mdx from variant detection
- Simplify link collection patterns to only .txt files (most common)
- Update llms_variants list to only include spec-compliant files

### Crawling Service
- Add tldextract for proper root domain extraction (handles .co.uk, .com.au, etc.)
- Replace naive domain extraction with robust get_root_domain() function
- Add tldextract>=5.0.0 to dependencies

## Frontend Improvements

### Type Safety
- Extend ActiveOperation type with discovery fields (discovered_file, discovered_file_type, linked_files)
- Remove all type casting (operation as any) from CrawlingProgress component
- Add proper TypeScript types for discovery information

### Security
- Create URL validation utility (urlValidation.ts)
- Only render clickable links for validated HTTP/HTTPS URLs
- Reject unsafe protocols (javascript:, data:, vbscript:, file:)
- Display invalid URLs as plain text instead of links

## Testing
- Update test mocks to include history and url attributes for redirect checking
- Fix .well-known case sensitivity tests (must be lowercase per RFC 8615)
- Update discovery priority tests to match new order
- Remove tests for deprecated file variants

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
python/src/server/services/crawling/discovery_service.py (1)

423-424: Resolve relative sitemap paths against robots.txt response URL.

Line 424 resolves relative sitemap paths against base_url, which can mis-resolve when robots.txt is redirected or lives on a subpath. Use resp.url to resolve against the actual robots.txt location.

Apply this diff:

                             else:
-                                # Resolve relative path against base_url
-                                sitemap_url = urljoin(base_url, sitemap_value)
+                                # Resolve relative path against robots.txt response URL (handles redirects/paths)
+                                sitemap_url = urljoin(resp.url, sitemap_value)
🧹 Nitpick comments (2)
archon-ui-main/src/features/progress/types/progress.ts (1)

28-37: Consider standardizing CrawlType naming conventions.

The new discovery-related types use snake_case ("llms_txt_with_linked_files"), but the existing "llms-txt" (line 31) uses kebab-case. While "text_file" (line 32) shows snake_case was already in use, the inconsistency between "llms-txt" and the new "llms_txt_*" variants specifically creates confusion since they're closely related concepts.

Consider standardizing on snake_case across all multi-word types for consistency.

Example refactor:

 export type CrawlType =
   | "normal"
   | "sitemap"
-  | "llms-txt"
+  | "llms_txt"
   | "text_file"
   | "refresh"
   | "llms_txt_with_linked_files"
   | "llms_txt_linked_files"
   | "discovery_single_file"
   | "discovery_sitemap";
python/src/server/services/crawling/discovery_service.py (1)

246-255: Remove duplicate cloud metadata service check.

Lines 248-250 and 253-255 both check for 169.254.169.254. The second check is redundant since both AWS and GCP use the same IP address.

Apply this diff:

             # Additional explicit checks for cloud metadata services
             # AWS metadata service
             if str(ip) == "169.254.169.254":
-                logger.warning(f"Blocked AWS metadata service IP: {ip_str}")
+                logger.warning(f"Blocked cloud metadata service IP (AWS/GCP): {ip_str}")
                 return False
 
-            # GCP metadata service
-            if str(ip) == "169.254.169.254":
-                logger.warning(f"Blocked GCP metadata service IP: {ip_str}")
-                return False
-
             return True
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8ab6c75 and 13796ab.

⛔ Files ignored due to path filters (1)
  • python/uv.lock is excluded by !**/*.lock
📒 Files selected for processing (9)
  • archon-ui-main/src/features/progress/components/CrawlingProgress.tsx (3 hunks)
  • archon-ui-main/src/features/progress/types/progress.ts (5 hunks)
  • archon-ui-main/src/features/progress/utils/urlValidation.ts (1 hunks)
  • python/pyproject.toml (1 hunks)
  • python/src/server/services/crawling/crawling_service.py (11 hunks)
  • python/src/server/services/crawling/discovery_service.py (1 hunks)
  • python/src/server/services/crawling/helpers/url_handler.py (6 hunks)
  • python/tests/test_discovery_service.py (1 hunks)
  • python/tests/test_url_handler.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • python/src/server/services/crawling/helpers/url_handler.py
  • archon-ui-main/src/features/progress/components/CrawlingProgress.tsx
🧰 Additional context used
📓 Path-based instructions (9)
archon-ui-main/src/**/*.{ts,tsx}

📄 CodeRabbit inference engine (AGENTS.md)

archon-ui-main/src/**/*.{ts,tsx}: Frontend TypeScript must use strict mode with no implicit any
Use TanStack Query for all data fetching; avoid prop drilling
Use database values directly in the frontend; avoid mapping layers between BE and FE types

Never return null to indicate failure in frontend TypeScript; throw errors with details instead

Files:

  • archon-ui-main/src/features/progress/utils/urlValidation.ts
  • archon-ui-main/src/features/progress/types/progress.ts
archon-ui-main/src/features/**/*.{ts,tsx}

📄 CodeRabbit inference engine (AGENTS.md)

Use Biome in features: 120 character line length, double quotes, and trailing commas

Use Biome formatting/conventions in features (120 char lines, double quotes, trailing commas)

Files:

  • archon-ui-main/src/features/progress/utils/urlValidation.ts
  • archon-ui-main/src/features/progress/types/progress.ts
archon-ui-main/src/**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (CLAUDE.md)

Use ESLint rules for legacy frontend code outside /src/features

Files:

  • archon-ui-main/src/features/progress/utils/urlValidation.ts
  • archon-ui-main/src/features/progress/types/progress.ts
archon-ui-main/**/*.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

TypeScript strict mode with no implicit any in the frontend

Files:

  • archon-ui-main/src/features/progress/utils/urlValidation.ts
  • archon-ui-main/src/features/progress/types/progress.ts
python/src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

python/src/**/*.py: On service startup, missing configuration, DB connection failures, auth/authorization failures, critical dependency outages, or invalid/corrupting data: fail fast and bubble errors
For batch processing, background tasks, WebSocket events, optional features, and external API calls: continue processing but log errors (with retries/backoff for APIs)
Never accept or persist corrupted data; skip failed items entirely (e.g., zero embeddings, null FKs, malformed JSON)
Error messages must include operation context, IDs/URLs, use specific exception types, preserve full stack traces (logging with exc_info=True), and avoid returning None/null—raise exceptions instead; for batches report success counts and detailed failures
Backend code targets Python 3.12 and adheres to a 120 character line length
Use Ruff for linting (errors, warnings, unused imports) in backend code
Use Mypy for static type checking in backend code

python/src/**/*.py: Fail fast and loud on startup/config/auth/database/critical dependency failures and validation errors in backend services
Preserve full stack traces with logging.exc_info=True in Python logging
Use specific exception types and avoid catching generic Exception
Use Pydantic to raise on data validation errors; never silently accept bad data
Python 3.12 code should conform to a 120-character line length
Never return None to indicate failure in backend Python; raise exceptions with details instead

Files:

  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/services/crawling/discovery_service.py
python/src/server/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Never accept corrupted data in backend flows; skip failed items entirely instead of storing bad data (e.g., zero embeddings, null FKs, malformed JSON)

Files:

  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/services/crawling/discovery_service.py
python/src/server/services/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/server/services/**/*.py: For batch/background/optional operations, continue processing but log detailed per-item failures; report success counts and failure lists
Implement business logic in service modules, keeping database concerns separate
Place service logic under python/src/server/services/

Files:

  • python/src/server/services/crawling/crawling_service.py
  • python/src/server/services/crawling/discovery_service.py
archon-ui-main/src/features/*/types/**/*.{ts,tsx}

📄 CodeRabbit inference engine (AGENTS.md)

Define shared types under src/features/[feature]/types

Files:

  • archon-ui-main/src/features/progress/types/progress.ts
archon-ui-main/src/features/**/types/**/*.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

Define feature types under src/features/[feature]/types/

Files:

  • archon-ui-main/src/features/progress/types/progress.ts
🧠 Learnings (1)
📚 Learning: 2025-08-28T13:07:24.810Z
Learnt from: Wirasm
PR: coleam00/Archon#514
File: archon-ui-main/src/services/crawlProgressService.ts:35-39
Timestamp: 2025-08-28T13:07:24.810Z
Learning: The crawlProgressService.ts in the Archon codebase should be deprecated in favor of the existing useCrawlProgressPolling hook from usePolling.ts, which already includes ETag support, 304 handling, tab visibility detection, and proper React lifecycle integration. This consolidation reduces code duplication and improves performance.

Applied to files:

  • archon-ui-main/src/features/progress/types/progress.ts
🧬 Code graph analysis (4)
python/tests/test_url_handler.py (1)
python/src/server/services/crawling/helpers/url_handler.py (4)
  • is_robots_txt (609-627)
  • is_llms_variant (630-662)
  • is_well_known_file (665-684)
  • get_base_url (687-707)
python/tests/test_discovery_service.py (1)
python/src/server/services/crawling/discovery_service.py (5)
  • DiscoveryService (50-558)
  • discover_files (78-114)
  • _check_url_exists (293-380)
  • _parse_robots_txt (382-445)
  • _parse_html_meta_tags (447-509)
python/src/server/services/crawling/crawling_service.py (3)
python/src/server/services/crawling/discovery_service.py (2)
  • DiscoveryService (50-558)
  • discover_files (78-114)
python/src/server/services/crawling/helpers/url_handler.py (9)
  • is_sitemap (21-38)
  • is_llms_variant (630-662)
  • is_robots_txt (609-627)
  • is_well_known_file (665-684)
  • is_txt (61-77)
  • is_binary_file (80-177)
  • get_base_url (687-707)
  • extract_markdown_links_with_text (298-387)
  • extract_display_name (460-606)
python/src/server/services/crawling/strategies/single_page.py (1)
  • crawl_markdown_file (222-314)
python/src/server/services/crawling/discovery_service.py (1)
python/src/server/config/logfire_config.py (1)
  • get_logger (137-147)
🔇 Additional comments (15)
archon-ui-main/src/features/progress/types/progress.ts (4)

9-9: LGTM: Discovery status addition is appropriate.

The new "discovery" status integrates cleanly with the existing progress stages and aligns with the PR's automatic discovery feature.


61-64: LGTM: Discovery fields properly typed.

The camelCase naming, optional markers, and type choices (string for file paths/types, string[] for linked files) are consistent with the interface conventions and appropriate for the discovery feature.


117-120: LGTM: Backend response fields properly defined.

The snake_case naming matches the backend API response format and follows the established pattern in this interface for mapping backend data (as seen with operation_id, operation_type, etc.).


148-154: LGTM: Dual-convention fields support both backend and frontend patterns.

Providing both camelCase and snake_case variants follows the established pattern in this interface and aligns with the coding guideline to use database values directly without transformation layers. The comment clarifying the snake_case origin is helpful.

archon-ui-main/src/features/progress/utils/urlValidation.ts (2)

6-7: Constants are well-defined.

The safe and unsafe protocol lists are clear and appropriate for URL validation. However, note that UNSAFE_PROTOCOLS is currently used in a redundant check (see comment below on lines 35-38).


40-43: No issues found. Hostname validation strictness is appropriate.

The isValidHttpUrl function validates discovered files from web crawling operations. In this context, single-label hostnames are not legitimate: web crawling discovers URLs from internet resources (sitemap.xml, llms.txt files), which naturally produce fully qualified domain names. The strictness is intentional and appropriate—it aligns with the use case of rendering discovered URLs as safe clickable links, with graceful fallback to plain text for invalid URLs.

python/pyproject.toml (1)

62-62: LGTM! Dependency addition is appropriate.

The tldextract>=5.0.0 constraint is correctly placed in the server dependency group and aligns with the new domain extraction functionality used by discovery-driven crawling.

python/tests/test_url_handler.py (1)

127-241: LGTM! Comprehensive test coverage for URL detection helpers.

The tests thoroughly cover positive cases, edge cases, RFC compliance, and error handling for all new URL detection methods.

python/tests/test_discovery_service.py (1)

1-329: LGTM! Excellent test coverage for DiscoveryService.

The test suite comprehensively covers:

  • Priority-based file selection
  • Subdirectory fallback logic
  • Robots.txt and HTML meta tag parsing
  • Error handling and edge cases
  • Mock setup correctly simulates streaming HTTP responses
python/src/server/services/crawling/crawling_service.py (6)

14-14: LGTM! Clean imports for discovery functionality.

The tldextract and DiscoveryService imports support the new discovery-driven crawling features.

Also applies to: 23-23


43-69: LGTM! Robust domain extraction with fallback.

The get_root_domain function correctly uses tldextract to handle multi-part public suffixes (e.g., .co.uk) and falls back to the original host on extraction failure.


389-450: LGTM! Discovery integration correctly offloads blocking I/O.

The discovery phase correctly:

  • Uses asyncio.to_thread to prevent blocking the event loop (line 409)
  • Skips discovery for URLs that are already discovery targets
  • Provides detailed progress updates
  • Handles errors gracefully without aborting the crawl

452-484: LGTM! Discovery-target crawling implements single-file priority.

When discovery succeeds, the orchestrator correctly:

  • Crawls ONLY the discovered file (not the main URL)
  • Marks it as a discovery target for downstream domain filtering
  • Sets original_domain from the discovered URL for accurate filtering

753-811: LGTM! Domain filtering correctly handles multi-part suffixes.

The domain helpers correctly use get_root_domain with tldextract to handle multi-part public suffixes (e.g., example.co.uk → root domain "example.co.uk"), ensuring accurate same-domain and subdomain checks.


869-1011: LGTM! Discovery-aware crawling aligns with PR objectives.

The implementation correctly handles:

  • llms.txt link following: Extracts and follows all same-domain links from llms.txt files (lines 890-934), which aligns with the PR objective that "llms.txt files act as link collections"
  • Single-file mode for non-llms discovery targets: Sitemaps and robots.txt return single-file representations without following links (lines 936-940, 1022-1033)
  • Domain filtering: Filters links to same-domain using _is_same_domain_or_subdomain (line 903)
  • Depth awareness: Respects max_depth for recursive crawling of discovery targets (lines 982-1002)

The behavior matches the reported testing results from TimothiousAI.

Comment thread archon-ui-main/src/features/progress/utils/urlValidation.ts Outdated
- Fix test mocks to use requests.Session for _check_url_exists
- Add url parameter to create_mock_response to prevent MagicMock issues
- Update all test scenarios to mock both requests.get and session.get
- Remove redundant UNSAFE_PROTOCOLS check in URL validation
- Fix test assertions to match new priority order (llms.txt > llms-full.txt)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
python/tests/test_discovery_service.py (1)

256-260: Consider tightening the assertion for better test specificity.

The current assertion uses >= and any() which only verifies that at least one sitemap URL containing 'sitemap' was found. Given the specific HTML content with two distinct sitemap references, the test could be more precise.

Consider this more specific assertion:

-        # Should find sitemaps from both link and meta tags
-        assert len(result) >= 1
-        assert any('sitemap' in url.lower() for url in result)
+        # Should find sitemaps from both link and meta tags
+        assert len(result) == 2
+        assert "https://example.com/sitemap.xml" in result
+        assert "https://example.com/sitemap-meta.xml" in result

This makes test failures more informative and catches regressions in parsing logic.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 13796ab and 957d8b9.

📒 Files selected for processing (2)
  • archon-ui-main/src/features/progress/utils/urlValidation.ts (1 hunks)
  • python/tests/test_discovery_service.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • archon-ui-main/src/features/progress/utils/urlValidation.ts
🧰 Additional context used
🧬 Code graph analysis (1)
python/tests/test_discovery_service.py (1)
python/src/server/services/crawling/discovery_service.py (5)
  • DiscoveryService (50-558)
  • discover_files (78-114)
  • _check_url_exists (293-380)
  • _parse_robots_txt (382-445)
  • _parse_html_meta_tags (447-509)
🔇 Additional comments (2)
python/tests/test_discovery_service.py (2)

8-36: Well-designed test helpers.

The mock factory functions properly simulate DNS resolution and HTTP responses with streaming support. The inclusion of iter_content, history, and close attributes accurately mirrors the real requests.Response API used by the DiscoveryService.


262-353: Comprehensive priority and error handling tests.

The multi-scenario priority behavior test (lines 262-332) effectively validates that discovery selects the highest-priority file under varying availability conditions. The network error handling test (lines 333-353) properly verifies graceful degradation without exceptions, which is critical for production resilience.

Comment on lines +1 to +353
"""Unit tests for DiscoveryService class."""
import socket
from unittest.mock import Mock, patch

from src.server.services.crawling.discovery_service import DiscoveryService


def create_mock_dns_response():
"""Create mock DNS response for safe public IPs."""
# Return a safe public IP for testing
return [
(socket.AF_INET, socket.SOCK_STREAM, 6, '', ('93.184.216.34', 0)) # example.com's actual IP
]


def create_mock_response(status_code: int, text: str = "", url: str = "https://example.com") -> Mock:
"""Create a mock response object that supports streaming API."""
response = Mock()
response.status_code = status_code
response.text = text
response.encoding = 'utf-8'
response.history = [] # Empty list for no redirects
response.url = url # Mock URL for redirect checks (must be string, not Mock)

# Mock iter_content to yield text in chunks as bytes
text_bytes = text.encode('utf-8')
chunk_size = 8192
chunks = [text_bytes[i:i+chunk_size] for i in range(0, len(text_bytes), chunk_size)]
if not chunks:
chunks = [b''] # Ensure at least one empty chunk
response.iter_content = Mock(return_value=iter(chunks))

# Mock close method
response.close = Mock()

return response


class TestDiscoveryService:
"""Test suite for DiscoveryService class."""

@patch('socket.getaddrinfo', return_value=create_mock_dns_response())
@patch('requests.Session')
@patch('requests.get')
def test_discover_files_basic(self, mock_get, mock_session, mock_dns):
"""Test main discovery method returns single best file."""
service = DiscoveryService()
base_url = "https://example.com"

# Mock robots.txt response (no sitemaps)
robots_response = create_mock_response(200, "User-agent: *\nDisallow: /admin/")

# Mock file existence - llms-full.txt doesn't exist, but llms.txt does
def mock_get_side_effect(url, **kwargs):
if url.endswith('robots.txt'):
return robots_response
elif url.endswith('llms-full.txt'):
return create_mock_response(404) # Highest priority doesn't exist
elif url.endswith('llms.txt'):
return create_mock_response(200) # Second priority exists
else:
return create_mock_response(404)

mock_get.side_effect = mock_get_side_effect
mock_session.return_value.get.side_effect = mock_get_side_effect

result = service.discover_files(base_url)

# Should return single URL string (not dict, not list)
assert isinstance(result, str)
assert result == 'https://example.com/llms.txt'

@patch('socket.getaddrinfo', return_value=create_mock_dns_response())
@patch('requests.Session')
@patch('requests.get')
def test_discover_files_no_files_found(self, mock_get, mock_session, mock_dns):
"""Test discovery when no files are found."""
service = DiscoveryService()
base_url = "https://example.com"

# Mock all HTTP requests to return 404
mock_get.return_value = create_mock_response(404)
mock_session.return_value.get.return_value = create_mock_response(404)

result = service.discover_files(base_url)

# Should return None when no files found
assert result is None

@patch('socket.getaddrinfo', return_value=create_mock_dns_response())
@patch('requests.Session')
@patch('requests.get')
def test_discover_files_priority_order(self, mock_get, mock_session, mock_dns):
"""Test that discovery follows the correct priority order."""
service = DiscoveryService()
base_url = "https://example.com"

# Mock robots.txt response (no sitemaps declared)
robots_response = create_mock_response(200, "User-agent: *\nDisallow: /admin/")

# Mock file existence - both sitemap.xml and llms.txt exist, but llms.txt has higher priority
def mock_get_side_effect(url, **kwargs):
if url.endswith('robots.txt'):
return robots_response
elif url.endswith('llms.txt') or url.endswith('sitemap.xml'):
return create_mock_response(200) # Both exist
else:
return create_mock_response(404)

mock_get.side_effect = mock_get_side_effect
mock_session.return_value.get.side_effect = mock_get_side_effect

result = service.discover_files(base_url)

# Should return llms.txt since it has higher priority than sitemap.xml
assert result == 'https://example.com/llms.txt'

@patch('socket.getaddrinfo', return_value=create_mock_dns_response())
@patch('requests.Session')
@patch('requests.get')
def test_discover_files_robots_sitemap_priority(self, mock_get, mock_session, mock_dns):
"""Test that llms files have priority over robots.txt sitemap declarations."""
service = DiscoveryService()
base_url = "https://example.com"

# Mock robots.txt response WITH sitemap declaration
robots_response = create_mock_response(200, "User-agent: *\nSitemap: https://example.com/declared-sitemap.xml")

# Mock other files also exist (both llms and sitemap files)
def mock_get_side_effect(url, **kwargs):
if url.endswith('robots.txt'):
return robots_response
elif 'llms' in url or 'sitemap' in url:
return create_mock_response(200)
else:
return create_mock_response(404)

mock_get.side_effect = mock_get_side_effect
mock_session.return_value.get.side_effect = mock_get_side_effect

result = service.discover_files(base_url)

# Should return llms.txt (highest priority llms file) since llms files have priority over sitemaps
# even when sitemaps are declared in robots.txt
assert result == 'https://example.com/llms.txt'

@patch('socket.getaddrinfo', return_value=create_mock_dns_response())
@patch('requests.Session')
@patch('requests.get')
def test_discover_files_subdirectory_fallback(self, mock_get, mock_session, mock_dns):
"""Test discovery falls back to subdirectories for llms files."""
service = DiscoveryService()
base_url = "https://example.com"

# Mock robots.txt response (no sitemaps declared)
robots_response = create_mock_response(200, "User-agent: *\nDisallow: /admin/")

# Mock file existence - no root llms files, but static/llms.txt exists
def mock_get_side_effect(url, **kwargs):
if url.endswith('robots.txt'):
return robots_response
elif '/static/llms.txt' in url:
return create_mock_response(200) # Found in subdirectory
else:
return create_mock_response(404)

mock_get.side_effect = mock_get_side_effect
mock_session.return_value.get.side_effect = mock_get_side_effect

result = service.discover_files(base_url)

# Should find the file in static subdirectory
assert result == 'https://example.com/static/llms.txt'

@patch('socket.getaddrinfo', return_value=create_mock_dns_response())
@patch('requests.Session')
@patch('requests.get')
def test_check_url_exists(self, mock_get, mock_session, mock_dns):
"""Test URL existence checking."""
service = DiscoveryService()

# Test successful response
mock_get.return_value = create_mock_response(200)
mock_session.return_value.get.return_value = create_mock_response(200)
assert service._check_url_exists("https://example.com/exists") is True

# Test 404 response
mock_get.return_value = create_mock_response(404)
mock_session.return_value.get.return_value = create_mock_response(404)
assert service._check_url_exists("https://example.com/not-found") is False

# Test network error
mock_get.side_effect = Exception
mock_session.return_value.get.side_effect = Exception("Network error")
assert service._check_url_exists("https://example.com/error") is False

@patch('socket.getaddrinfo', return_value=create_mock_dns_response())
@patch('requests.Session')
@patch('requests.get')
def test_parse_robots_txt_with_sitemap(self, mock_get, mock_session, mock_dns):
"""Test robots.txt parsing with sitemap directives."""
service = DiscoveryService()

# Mock successful robots.txt response
robots_text = """User-agent: *
Disallow: /admin/
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-news.xml"""
mock_get.return_value = create_mock_response(200, robots_text)

result = service._parse_robots_txt("https://example.com")

assert len(result) == 2
assert "https://example.com/sitemap.xml" in result
assert "https://example.com/sitemap-news.xml" in result
mock_get.assert_called_once_with("https://example.com/robots.txt", timeout=30, stream=True, verify=True, headers={'User-Agent': 'Archon-Discovery/1.0 (SSRF-Protected)'})

@patch('socket.getaddrinfo', return_value=create_mock_dns_response())
@patch('requests.Session')
@patch('requests.get')
def test_parse_robots_txt_no_sitemap(self, mock_get, mock_session, mock_dns):
"""Test robots.txt parsing without sitemap directives."""
service = DiscoveryService()

# Mock robots.txt without sitemaps
robots_text = """User-agent: *
Disallow: /admin/
Allow: /public/"""
mock_get.return_value = create_mock_response(200, robots_text)

result = service._parse_robots_txt("https://example.com")

assert len(result) == 0
mock_get.assert_called_once_with("https://example.com/robots.txt", timeout=30, stream=True, verify=True, headers={'User-Agent': 'Archon-Discovery/1.0 (SSRF-Protected)'})

@patch('socket.getaddrinfo', return_value=create_mock_dns_response())
@patch('requests.Session')
@patch('requests.get')
def test_parse_html_meta_tags(self, mock_get, mock_session, mock_dns):
"""Test HTML meta tag parsing for sitemaps."""
service = DiscoveryService()

# Mock HTML with sitemap references
html_content = """
<html>
<head>
<link rel="sitemap" href="/sitemap.xml">
<meta name="sitemap" content="https://example.com/sitemap-meta.xml">
</head>
<body>Content here</body>
</html>
"""
mock_get.return_value = create_mock_response(200, html_content)

result = service._parse_html_meta_tags("https://example.com")

# Should find sitemaps from both link and meta tags
assert len(result) >= 1
assert any('sitemap' in url.lower() for url in result)
mock_get.assert_called_once_with("https://example.com", timeout=30, stream=True, verify=True, headers={'User-Agent': 'Archon-Discovery/1.0 (SSRF-Protected)'})

@patch('socket.getaddrinfo', return_value=create_mock_dns_response())
@patch('requests.Session')
@patch('requests.get')
def test_discovery_priority_behavior(self, mock_get, mock_session, mock_dns):
"""Test that discovery returns highest-priority file when multiple files exist."""
service = DiscoveryService()
base_url = "https://example.com"

# Mock robots.txt response (no sitemaps declared)
robots_response = create_mock_response(200, "User-agent: *\nDisallow: /admin/")

# Scenario 1: All files exist - should return llms.txt (highest priority)
def mock_all_exist(url, **kwargs):
if url.endswith('robots.txt'):
return robots_response
elif any(file in url for file in ['llms.txt', 'llms-full.txt', 'sitemap.xml']):
return create_mock_response(200)
else:
return create_mock_response(404)

mock_get.side_effect = mock_all_exist
mock_session.return_value.get.side_effect = mock_all_exist
result = service.discover_files(base_url)
assert result == 'https://example.com/llms.txt', "Should return llms.txt when all files exist (highest priority)"

# Scenario 2: llms.txt missing, others exist - should return llms-full.txt
def mock_without_txt(url, **kwargs):
if url.endswith('robots.txt'):
return robots_response
elif url.endswith('llms.txt'):
return create_mock_response(404)
elif any(file in url for file in ['llms-full.txt', 'sitemap.xml']):
return create_mock_response(200)
else:
return create_mock_response(404)

mock_get.side_effect = mock_without_txt
mock_session.return_value.get.side_effect = mock_without_txt
result = service.discover_files(base_url)
assert result == 'https://example.com/llms-full.txt', "Should return llms-full.txt when llms.txt is missing"

# Scenario 3: Only sitemap files exist - should return sitemap.xml
def mock_only_sitemaps(url, **kwargs):
if url.endswith('robots.txt'):
return robots_response
elif any(file in url for file in ['llms.txt', 'llms-full.txt']):
return create_mock_response(404)
elif url.endswith('sitemap.xml'):
return create_mock_response(200)
else:
return create_mock_response(404)

mock_get.side_effect = mock_only_sitemaps
mock_session.return_value.get.side_effect = mock_only_sitemaps
result = service.discover_files(base_url)
assert result == 'https://example.com/sitemap.xml', "Should return sitemap.xml when llms files are missing"

# Scenario 4: llms files have priority over sitemap files
def mock_llms_and_sitemap(url, **kwargs):
if url.endswith('robots.txt'):
return robots_response
elif url.endswith('llms.txt') or url.endswith('sitemap.xml'):
return create_mock_response(200)
else:
return create_mock_response(404)

mock_get.side_effect = mock_llms_and_sitemap
mock_session.return_value.get.side_effect = mock_llms_and_sitemap
result = service.discover_files(base_url)
assert result == 'https://example.com/llms.txt', "Should prefer llms.txt over sitemap.xml"

@patch('socket.getaddrinfo', return_value=create_mock_dns_response())
@patch('requests.Session')
@patch('requests.get')
def test_network_error_handling(self, mock_get, mock_session, mock_dns):
"""Test error scenarios with network failures."""
service = DiscoveryService()

# Mock network error
mock_get.side_effect = Exception("Network error")
mock_session.return_value.get.side_effect = Exception("Network error")

# Should not raise exception, but return None
result = service.discover_files("https://example.com")
assert result is None

# Individual methods should also handle errors gracefully
result = service._parse_robots_txt("https://example.com")
assert result == []

result = service._parse_html_meta_tags("https://example.com")
assert result == []
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Comprehensive test suite with minor improvements needed.

This test file provides excellent coverage of the DiscoveryService functionality with 11 well-structured test methods. The use of mocking is appropriate, and error handling is thoroughly tested. The main improvements needed are:

  1. Clarifying the misleading comment about priority (line 58)
  2. Adding test coverage for base URLs with path components (the Supabase edge case mentioned in PR objectives)
  3. Optional: Tightening assertions in HTML meta tag parsing test

The test structure and overall approach are sound.

🤖 Prompt for AI Agents
python/tests/test_discovery_service.py around lines 1-353, update the misleading
comment at line 58 to accurately state that llms.txt has higher priority than
sitemap.xml, add a new unit test that covers base URLs containing path
components (e.g., "https://example.com/base/path") by patching
socket.getaddrinfo and requests similar to other tests, mocking robots.txt and
file responses to ensure discover_files correctly constructs and checks files
under the path component and returns the expected URL, and tighten the
assertions in test_parse_html_meta_tags by replacing the loose any(...) check
with explicit expected sitemap URLs (including resolution of relative link
hrefs) and asserting their presence in the result; implement these changes by
editing the comment text, adding a new test method with the same patch
decorators and mock patterns used elsewhere, and changing the html meta test
assertions to assert equality/contains for the exact normalized sitemap URLs.

Comment on lines +54 to +62
def mock_get_side_effect(url, **kwargs):
if url.endswith('robots.txt'):
return robots_response
elif url.endswith('llms-full.txt'):
return create_mock_response(404) # Highest priority doesn't exist
elif url.endswith('llms.txt'):
return create_mock_response(200) # Second priority exists
else:
return create_mock_response(404)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Clarify the misleading comment about priority.

The comment on line 58 states "Highest priority doesn't exist" referring to llms-full.txt, but according to DISCOVERY_PRIORITY in the implementation, llms.txt has higher priority than llms-full.txt. The test logic is correct, but the comment is misleading.

Apply this diff to fix the comment:

         # Mock file existence - llms-full.txt doesn't exist, but llms.txt does
         def mock_get_side_effect(url, **kwargs):
             if url.endswith('robots.txt'):
                 return robots_response
             elif url.endswith('llms-full.txt'):
-                return create_mock_response(404)  # Highest priority doesn't exist
+                return create_mock_response(404)  # Lower priority file doesn't exist
             elif url.endswith('llms.txt'):
-                return create_mock_response(200)  # Second priority exists
+                return create_mock_response(200)  # Highest priority file exists
             else:
                 return create_mock_response(404)
🤖 Prompt for AI Agents
In python/tests/test_discovery_service.py around lines 54 to 62, the inline
comment saying "Highest priority doesn't exist" for the llms-full.txt branch is
misleading because DISCOVERY_PRIORITY gives llms.txt higher priority than
llms-full.txt; update the comment to reflect that llms-full.txt is the
lower-priority file (so it doesn't exist) or reword to "Lower priority doesn't
exist" so it accurately describes the test scenario.

Comment on lines +147 to +174
@patch('socket.getaddrinfo', return_value=create_mock_dns_response())
@patch('requests.Session')
@patch('requests.get')
def test_discover_files_subdirectory_fallback(self, mock_get, mock_session, mock_dns):
"""Test discovery falls back to subdirectories for llms files."""
service = DiscoveryService()
base_url = "https://example.com"

# Mock robots.txt response (no sitemaps declared)
robots_response = create_mock_response(200, "User-agent: *\nDisallow: /admin/")

# Mock file existence - no root llms files, but static/llms.txt exists
def mock_get_side_effect(url, **kwargs):
if url.endswith('robots.txt'):
return robots_response
elif '/static/llms.txt' in url:
return create_mock_response(200) # Found in subdirectory
else:
return create_mock_response(404)

mock_get.side_effect = mock_get_side_effect
mock_session.return_value.get.side_effect = mock_get_side_effect

result = service.discover_files(base_url)

# Should find the file in static subdirectory
assert result == 'https://example.com/static/llms.txt'

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Add test coverage for base URLs with path components.

The current test uses base_url = "https://example.com" (root domain only). Per the PR objectives, there's a known edge case where base URLs with path components (e.g., https://supabase.com/docs) should check both same-directory and root-level URLs. This scenario is not currently tested.

Add a test case like this:

@patch('socket.getaddrinfo', return_value=create_mock_dns_response())
@patch('requests.Session')
@patch('requests.get')
def test_discover_files_path_based_url(self, mock_get, mock_session, mock_dns):
    """Test discovery with base URL containing a path component."""
    service = DiscoveryService()
    base_url = "https://example.com/docs"  # URL with path component

    # Mock robots.txt response
    robots_response = create_mock_response(200, "User-agent: *\nDisallow: /admin/")

    # Mock file existence - llms.txt exists in same directory as base URL path
    def mock_get_side_effect(url, **kwargs):
        if url.endswith('robots.txt'):
            return robots_response
        elif url == 'https://example.com/docs/llms.txt':
            return create_mock_response(200)  # Same directory
        else:
            return create_mock_response(404)

    mock_get.side_effect = mock_get_side_effect
    mock_session.return_value.get.side_effect = mock_get_side_effect

    result = service.discover_files(base_url)

    # Should find the file in same directory as base URL path
    assert result == 'https://example.com/docs/llms.txt'

Based on learnings from PR objectives.

🤖 Prompt for AI Agents
In python/tests/test_discovery_service.py around lines 147 to 174, add a new
unit test that covers base URLs containing path components (e.g.,
https://example.com/docs) so discovery checks both same-directory and root-level
file locations: create a patched test similar to
test_discover_files_subdirectory_fallback that sets base_url to a path-based
URL, mocks socket.getaddrinfo, requests.Session and requests.get, returns a
robots.txt response, and returns a 200 for the same-directory llms.txt
(https://example.com/docs/llms.txt) while 404 elsewhere; assert that
service.discover_files(base_url) returns the same-directory URL.

leex279 and others added 2 commits October 19, 2025 15:52
The tldextract package was missing from the 'all' dependency group,
causing CI test failures. It was already in the 'server' group but
needed in 'all' for running unit tests in CI/CD.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Aligns test expectations with the llms.txt specification which uses
'pages' rather than 'files' terminology. The implementation correctly
uses "llms_txt_with_linked_pages" - this updates the test to match.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@leex279 leex279 merged commit 68fb4a8 into main Oct 19, 2025
8 checks passed
@Wirasm Wirasm deleted the feature/automatic-discovery-llms-sitemap-430 branch April 6, 2026 07:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants