Fix crawler timeout for JavaScript-heavy documentation sites by Wirasm · Pull Request #395 · coleam00/Archon

Wirasm · 2025-08-20T19:06:27Z

Pull Request

Summary

Fixes crawler timeout issue when crawling JavaScript-heavy documentation sites like Stripe.js docs. The crawler was waiting for a 'body' selector that already exists in the HTML DOM, causing unnecessary 30-second timeouts.

Changes Made

Remove wait_for='body' selector from documentation site crawler configurations
Fix exception chaining to preserve stack traces for better debugging
Remove unused asyncio imports from crawling strategies

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Performance improvement
Code refactoring

Affected Services

Testing

All existing tests pass
Added new tests for new functionality
Manually tested affected user flows
Docker builds succeed for all services

Test Evidence

# Ready for testing with:
# - https://docs.stripe.com/js
# - Other JavaScript-heavy documentation sites

Checklist

My code follows the service architecture patterns
If using an AI coding assistant, I used the CLAUDE.md rules
I have added tests that prove my fix/feature works
All new and existing tests pass locally
My changes generate no new warnings
I have updated relevant documentation
I have verified no regressions in existing features

Breaking Changes

None - This fix maintains backward compatibility while enabling crawling of sites that previously timed out.

Additional Notes

Root Cause

The wait_for='body' selector was problematic because:

The body element exists immediately when HTML loads
For JavaScript-rendered content, we need to wait for JS execution, not the body element
This caused 30-second timeouts on valid documentation sites

Solution

Now relies on:

domcontentloaded event for DOM readiness
delay_before_return_html (1.0s for doc sites) for JavaScript rendering time

This fix allows crawling of JavaScript-heavy documentation that previously failed while maintaining compatibility with all existing crawlable sites.

Summary by CodeRabbit

New Features
- Progress reporting enhanced: callbacks receive step number, message, and extra progress data; results preserve and map back to original URLs.
Bug Fixes
- Smarter page-readiness wait strategy for documentation sites to reduce timeouts/partial loads.
- Error handling improved to preserve original causes and provide a default fallback for missing settings.
- Binary files are now explicitly skipped during crawling.
Chores
- Removed unused imports and minor logging/formatting tidy-ups.

coderabbitai · 2025-08-20T19:06:34Z

Walkthrough

Updates two crawling strategies: remove unused asyncio imports, preserve original errors with exception chaining on config load failures, switch documentation-site waits from an explicit wait_for="body" to wait_until/CRAWL_WAIT_STRATEGY, enrich progress callbacks with a step_info payload, and minor logging/formatting tweaks.

Changes

Cohort / File(s)	Summary of Changes
Batch crawling strategy `python/src/server/services/crawling/strategies/batch.py`	Removed unused `asyncio` import; re-raised config load errors using `raise ... from e`; added default `settings = {}` on non-critical load failure; removed explicit `wait_for="body"` from `CrawlerRunConfig` for docs batches; progress reporter now accepts extra kwargs and sends `step_info`; minor log formatting.
Recursive crawling strategy `python/src/server/services/crawling/strategies/recursive.py`	Removed unused `asyncio` import; chained config load exceptions; changed docs wait from `wait_for="body"` to `wait_until` driven by `CRAWL_WAIT_STRATEGY`; transformed URLs crawled with `url_mapping` to report original URLs; skip binary files; progress callbacks receive `step_info` with `currentStep` and `stepMessage`; mapped streaming results back to originals; logging and formatting adjustments.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Caller
  participant Strategy as Crawl Strategy (batch/recursive)
  participant Config as CrawlerRunConfig
  participant Engine as Crawl Engine
  participant Progress as Progress Callback

  Caller->>Strategy: start_crawl(params)
  Strategy->>Strategy: load crawler settings
  alt load error (ValueError/KeyError/TypeError)
    Strategy-->>Caller: raise ValueError(...) from e
  else settings available (or default {})
    Strategy->>Config: build config (use wait_until from CRAWL_WAIT_STRATEGY)
    Note right of Config: removed explicit wait_for="body"
    Strategy->>Engine: arun_many(transformed_urls, config)
    Engine-->>Strategy: streaming progress/events
    Strategy->>Progress: on_progress(step_info={currentStep, stepMessage}, **extra)
    Engine-->>Strategy: streaming/results (mapped to original URLs)
    Strategy-->>Caller: aggregated results
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

A rabbit hops through code and thread,
Chains exceptions, clears the dread.
Waits now listen, not just look,
Step_info scribbles in the book.
Logs aligned, results restored — hooray, the crawl’s unboxed today. 🐇✨

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/crawler-js-doc-sites

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (6)

python/src/server/services/crawling/strategies/batch.py (2)

165-194: Remove unused loop counter j

j is initialized and incremented but never used. Safe to remove.

-            j = 0
...
-                j += 1

121-126: Unify progress callback shape with recursive strategy

Batch still calls progress_callback without step_info, while recursive adds step metadata. Consider aligning for consumers that expect the richer payload.

-        async def report_progress(percentage: int, message: str):
+        async def report_progress(percentage: int, message: str, **kwargs):
             """Helper to report progress if callback is available"""
             if progress_callback:
-                await progress_callback("crawling", percentage, message)
+                step_info = {"currentStep": message, "stepMessage": message, **kwargs}
+                await progress_callback("crawling", percentage, message, **step_info)

python/src/server/services/crawling/strategies/recursive.py (4)

128-134: Progress callback enrichment is helpful; consider aligning batch, too

Adding step_info improves UI/telemetry. Recommend aligning the batch strategy to the same shape for consistency across flows.

145-147: Avoid double-normalization work per URL

current_urls is already normalized; this comprehension normalizes twice. Minor, but easy to streamline.

-            urls_to_crawl = [
-                normalize_url(url) for url in current_urls if normalize_url(url) not in visited
-            ]
+            urls_to_crawl = [url for url in current_urls if url not in visited]

211-214: Guard against missing/None links in results

Some crawler results may not populate links or may set it to None. Add a safe default to avoid attribute errors.

-                        for link in result.links.get("internal", []):
+                        links = getattr(result, "links", {}) or {}
+                        for link in links.get("internal", []):

215-221: Avoid duplicate is_binary_file checks per link

Compute once to reduce overhead and clarify intent.

-                            if next_url not in visited and not self.url_handler.is_binary_file(
-                                next_url
-                            ):
-                                next_level_urls.add(next_url)
-                            elif self.url_handler.is_binary_file(next_url):
+                            is_binary = self.url_handler.is_binary_file(next_url)
+                            if next_url not in visited and not is_binary:
+                                next_level_urls.add(next_url)
+                            elif is_binary:
                                 logger.debug(f"Skipping binary file from crawl queue: {next_url}")

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 8b29d20 and f077820.

📒 Files selected for processing (2)

python/src/server/services/crawling/strategies/batch.py (1 hunks)
python/src/server/services/crawling/strategies/recursive.py (7 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (1)

python/src/server/services/crawling/strategies/recursive.py (2)

python/src/server/services/crawling/helpers/url_handler.py (2)

URLHandler (14-127)

is_binary_file (52-97)

python/src/server/services/crawling/strategies/batch.py (1)

report_progress (122-125)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Docker Build Tests (server)
GitHub Check: Backend Tests (Python + pytest)

🔇 Additional comments (4)

python/src/server/services/crawling/strategies/batch.py (3)

72-72: Good fix: exception chaining preserves traceback

Raising from the original exception keeps the full stack. This will help debugging config issues.

75-77: Robust logging on settings fallback

Logging with exc_info while safely falling back to defaults is the right tradeoff for availability.

91-103: Doc-site wait strategy change is correct (prevents 30s timeouts)

Dropping wait_for="body" and relying on wait_until + delay_before_return_html for JS-heavy docs is the right approach. This avoids waiting on a selector that appears immediately while still allowing the page to render.

python/src/server/services/crawling/strategies/recursive.py (1)

94-109: Doc-site wait strategy: LGTM

Switching to wait_until with a short delay and skipping images addresses JS-rendered docs without the false-positive wait on .

coderabbitai · 2025-08-20T19:37:32Z

Note

Unit test generation is an Early Access feature. Expect some limitations and changes as we gather feedback and continue to improve it.

Generating unit tests... This may take up to 20 minutes.

coderabbitai · 2025-08-20T19:42:42Z

Here are the copyable unit test edits:

Copyable Edits

python/tests/server/services/crawling/strategies/test_batch.py

This is a new file.

# Tests for BatchCrawlStrategy.crawl_batch_with_progress
# Framework: pytest (with pytest-asyncio)

@_ASYNC_MARK
async def test_explicit_max_concurrent_overrides_settings(monkeypatch, strategy_module):
    urls = [f"https://override.com/{i}" for i in range(3)]

    def transform(u: str) -> str:
        return u

    def is_doc(u: str) -> bool:
        return False

    # One batch of 3 all success
    batch = [DummyResult(url=u) for u in urls]
    crawler = DummyCrawler(batches=[batch])

    # Settings specify 2, but we will pass 5 explicitly to the function
    creds = {
        "CRAWL_BATCH_SIZE": "10",
        "CRAWL_MAX_CONCURRENT": "2",
        "MEMORY_THRESHOLD_PERCENT": "70",
        "DISPATCHER_CHECK_INTERVAL": "0.2",
    }

    _patch_dependencies(monkeypatch, strategy_module, creds=creds)

    strategy = strategy_module.BatchCrawlStrategy(crawler, DummyMarkdownGenerator())
    results = await strategy.crawl_batch_with_progress(
        urls=urls,
        transform_url_func=transform,
        is_documentation_site_func=is_doc,
        max_concurrent=5,  # should override settings
    )

    assert len(results) == 3
    # Verify dispatcher used explicit max_concurrent rather than settings
    call = crawler.calls[0]
    disp = call["dispatcher"]
    assert disp.kwargs["max_session_permit"] == 5
    assert disp.kwargs["memory_threshold_percent"] == 70.0
    assert disp.kwargs["check_interval"] == 0.2

python/tests/server/services/crawling/strategies/test_recursive.py

This is a new file.

import asyncio
import sys
import types
import builtins
import inspect
from dataclasses import dataclass
from typing import Any, Dict, List, Optional

import pytest

# NOTE ON TESTING LIBRARY AND FRAMEWORK:
# These tests use pytest with pytest-asyncio for async test support, following the repository's Python testing conventions.
# If pytest-asyncio is not auto-discovered in this environment, we structure tests to use asyncio.run where possible.

# --- Fake crawl4ai module injection to avoid external dependency coupling ---
# The production code imports: CrawlerRunConfig, CacheMode, MemoryAdaptiveDispatcher from crawl4ai.
# We'll inject minimal fakes into sys.modules before importing the target module.

class _FakeCacheMode:
    BYPASS = "BYPASS"

@dataclass
class _FakeCrawlerRunConfig:
    cache_mode: Any = None
    stream: bool = False
    markdown_generator: Any = None
    wait_until: str = "domcontentloaded"
    page_timeout: int = 45000
    delay_before_return_html: float = 0.5
    wait_for_images: bool = True
    scan_full_page: bool = False
    exclude_all_images: bool = False
    remove_overlay_elements: bool = False
    process_iframes: bool = False

class _FakeMemoryAdaptiveDispatcher:
    def __init__(self, memory_threshold_percent: float, check_interval: float, max_session_permit: int):
        self.memory_threshold_percent = memory_threshold_percent
        self.check_interval = check_interval
        self.max_session_permit = max_session_permit

_fake_crawl4ai = types.ModuleType("crawl4ai")
setattr(_fake_crawl4ai, "CacheMode", _FakeCacheMode)
setattr(_fake_crawl4ai, "CrawlerRunConfig", _FakeCrawlerRunConfig)
setattr(_fake_crawl4ai, "MemoryAdaptiveDispatcher", _FakeMemoryAdaptiveDispatcher)
sys.modules["crawl4ai"] = _fake_crawl4ai

# --- Import target after faking external dependencies ---
# The target module path may be server/services/crawling/strategies/recursive.py or similar via package layout.
# Given the snippet, this test file is intended to validate the RecursiveCrawlStrategy class code.
# To avoid import path ambiguity, we embed a local shim that imports via the package structure used in the project.
# We try multiple likely imports and skip tests if import cannot be resolved.

RecursiveCrawlStrategy = None
URLHandlerClass = None

_import_errors = []

CANDIDATE_IMPORTS = [
    # Common layout: python/server/services/crawling/strategies/recursive.py
    "server.services.crawling.strategies.recursive",
    # Alternative: app/server/services/crawling/strategies/recursive.py
    "app.server.services.crawling.strategies.recursive",
    # Alternative testing layout; in some repos modules live under python/ package root
    "python.server.services.crawling.strategies.recursive",
]

for modname in CANDIDATE_IMPORTS:
    try:
        mod = __import__(modname, fromlist=["RecursiveCrawlStrategy"])
        if hasattr(mod, "RecursiveCrawlStrategy"):
            RecursiveCrawlStrategy = getattr(mod, "RecursiveCrawlStrategy")
        if hasattr(mod, "URLHandler"):
            URLHandlerClass = getattr(mod, "URLHandler")
        break
    except Exception as e:
        _import_errors.append((modname, e))

# If still unresolved, define a minimal local class by copying the provided logic (last resort).
# But prefer failing fast with a helpful message to maintain alignment with actual module.
if RecursiveCrawlStrategy is None:
    pytest.skip(f"Could not import RecursiveCrawlStrategy from expected locations: {_import_errors}")

# --- Test Utilities ---

class FakeResult:
    def __init__(self, url: str, success: bool, markdown: Optional[str] = None, html: Optional[str] = None, links: Optional[Dict[str, List[Dict[str, str]]]] = None, error_message: Optional[str] = None):
        self.url = url
        self.success = success
        self.markdown = markdown
        self.html = html
        self.links = links or {"internal": []}
        self.error_message = error_message

class FakeAsyncGen:
    def __init__(self, items):
        self._items = items

    def __aiter__(self):
        self._iter = iter(self._items)
        return self

    async def __anext__(self):
        try:
            await asyncio.sleep(0)  # simulate async boundary
            return next(self._iter)
        except StopIteration:
            raise StopAsyncIteration

class FakeCrawler:
    def __init__(self, batches: List[List[FakeResult]]):
        """
        batches: A list of batches; each batch is a list of FakeResult that will be yielded.
        Each call to arun_many returns an async generator over the next batch in sequence.
        """
        self._batches = list(batches)
        self.calls = []

    async def arun_many(self, urls, config, dispatcher):
        # record call data to assert run configuration characteristics indirectly
        self.calls.append({
            "urls": list(urls),
            "config": config,
            "dispatcher": dispatcher,
        })
        if self._batches:
            batch = self._batches.pop(0)
        else:
            batch = []
        return FakeAsyncGen(batch)

class FakeURLHandler:
    def __init__(self, binary_predicate=None):
        self.binary_predicate = binary_predicate or (lambda url: False)
        self.calls = []

    def is_binary_file(self, url: str) -> bool:
        self.calls.append(url)
        return self.binary_predicate(url)

class DummyMarkdownGen:
    pass

async def no_op_progress(event_type, percentage, message, **kwargs):
    return  # aid for tests not needing capture

class ProgressCapture:
    def __init__(self):
        self.calls = []

    async def __call__(self, event_type, percentage, message, **kwargs):
        self.calls.append((event_type, percentage, message, kwargs))

# Helpers to override URLHandler inside the strategy instance
def set_url_handler(strategy, handler):
    # Strategy constructs URLHandler() in __init__, replace it for testing determinism
    if hasattr(strategy, "url_handler"):
        strategy.url_handler = handler

# Monkeypatch helper to control credential_service in target module
def patch_credential_service(monkeypatch, return_value=None, side_effect=None):
    # Import the module where credential_service is referenced
    target_mod = sys.modules[RecursiveCrawlStrategy.__module__]
    async def _fake_get_creds(_category):
        if side_effect:
            raise side_effect
        return return_value if return_value is not None else {}
    monkeypatch.setattr(target_mod.credential_service, "get_credentials_by_category", _fake_get_creds, raising=True)

# --- Tests ---

@pytest.mark.asyncio
async def test_returns_empty_and_reports_error_when_no_crawler(monkeypatch):
    # Arrange
    patch_credential_service(monkeypatch, return_value={})  # Not used but avoid accidental calls
    strategy = RecursiveCrawlStrategy(crawler=None, markdown_generator=DummyMarkdownGen())
    progress = ProgressCapture()

    # Act
    results = await strategy.crawl_recursive_with_progress(
        start_urls=["https://example.com"],
        transform_url_func=lambda u: u,
        is_documentation_site_func=lambda u: False,
        max_depth=1,
        progress_callback=progress,
    )

    # Assert
    assert results == []
    assert any(call[0] == "error" and "Crawler not available" in call[2] for call in progress.calls)

@pytest.mark.asyncio
async def test_configuration_valueerror_raises(monkeypatch):
    # Arrange
    patch_credential_service(monkeypatch, side_effect=ValueError("bad int"))
    strategy = RecursiveCrawlStrategy(crawler=FakeCrawler([]), markdown_generator=DummyMarkdownGen())

    # Act / Assert
    with pytest.raises(ValueError) as ei:
        await strategy.crawl_recursive_with_progress(
            start_urls=["https://example.com"],
            transform_url_func=lambda u: u,
            is_documentation_site_func=lambda u: False,
            max_depth=1,
            progress_callback=no_op_progress,
        )
    assert "Failed to load crawler configuration" in str(ei.value)

@pytest.mark.asyncio
async def test_generic_exception_from_credentials_uses_defaults_and_crawls(monkeypatch):
    # Arrange
    patch_credential_service(monkeypatch, side_effect=RuntimeError("db down"))
    # One batch with one successful result and one failed
    success = FakeResult(
        url="https://example.com/page",
        success=True,
        markdown="# Hello",
        html="<h1>Hello</h1>",
        links={"internal": [{"href": "https://example.com/next"}]},
    )
    failure = FakeResult(
        url="https://example.com/bad",
        success=False,
        error_message="404",
    )
    crawler = FakeCrawler([[success, failure]])
    strategy = RecursiveCrawlStrategy(crawler=crawler, markdown_generator=DummyMarkdownGen())
    set_url_handler(strategy, FakeURLHandler())  # default non-binary

    progress = ProgressCapture()

    # Act
    results = await strategy.crawl_recursive_with_progress(
        start_urls=["https://example.com/start"],
        transform_url_func=lambda u: u,
        is_documentation_site_func=lambda u: False,
        max_depth=1,
        progress_callback=progress,
    )

    # Assert
    assert len(results) == 1
    assert results[0]["url"] == "https://example.com/page"
    assert results[0]["markdown"] == "# Hello"
    assert results[0]["html"] == "<h1>Hello</h1>"

    # Dispatcher default concurrency is 10 when None
    assert len(crawler.calls) == 1
    call = crawler.calls[0]
    assert isinstance(call["config"], _FakeCrawlerRunConfig)
    assert call["config"].stream is True
    # For non-doc site path, default timeout expected 45000
    assert call["config"].page_timeout == 45000
    assert isinstance(call["dispatcher"], _FakeMemoryAdaptiveDispatcher)
    assert call["dispatcher"].max_session_permit == 10

    # Progress reported start, batch progress, and completion
    assert any("Crawling depth 1/1" in m for (_, _, m, _) in progress.calls)
    assert any("completed" in m for (_, _, m, _) in progress.calls)

@pytest.mark.asyncio
async def test_doc_site_configuration_branch(monkeypatch):
    # Arrange: valid settings that tweak values
    patch_credential_service(monkeypatch, return_value={
        "CRAWL_PAGE_TIMEOUT": "30000",
        "CRAWL_DELAY_BEFORE_HTML": "1.0",
        "CRAWL_WAIT_STRATEGY": "domcontentloaded",
        "CRAWL_MAX_CONCURRENT": "7",
        "CRAWL_BATCH_SIZE": "2",
        "MEMORY_THRESHOLD_PERCENT": "75",
        "DISPATCHER_CHECK_INTERVAL": "0.2",
    })
    res = FakeResult(
        url="https://docs.example.com/index",
        success=True,
        markdown="mk",
        html="<p/>",
        links={"internal": []},
    )
    crawler = FakeCrawler([[res]])
    strategy = RecursiveCrawlStrategy(crawler=crawler, markdown_generator=DummyMarkdownGen())
    set_url_handler(strategy, FakeURLHandler())

    progress = ProgressCapture()

    # Act
    results = await strategy.crawl_recursive_with_progress(
        start_urls=["https://docs.example.com"],
        transform_url_func=lambda u: u,
        is_documentation_site_func=lambda u: "docs." in u,
        max_depth=1,
        progress_callback=progress,
    )

    # Assert
    assert len(results) == 1
    call = crawler.calls[0]
    cfg = call["config"]
    assert isinstance(cfg, _FakeCrawlerRunConfig)
    # For doc site branch, timeout is 30000 and some flags set specially
    assert cfg.page_timeout == 30000
    assert cfg.stream is True
    assert cfg.wait_for_images is False
    assert cfg.scan_full_page is True
    assert cfg.process_iframes is True
    disp = call["dispatcher"]
    assert isinstance(disp, _FakeMemoryAdaptiveDispatcher)
    assert disp.max_session_permit == 7
    assert disp.memory_threshold_percent == 75
    assert disp.check_interval == 0.2

@pytest.mark.asyncio
async def test_transform_url_mapping_original_url_recovery(monkeypatch):
    # Arrange
    patch_credential_service(monkeypatch, return_value={})
    orig = "https://example.com/a"
    transformed = "https://static.examplecdn.com/a"
    # Fake result uses transformed url; strategy should map back to original
    res = FakeResult(
        url=transformed,
        success=True,
        markdown="md",
        html="<p></p>",
        links={"internal": []},
    )
    crawler = FakeCrawler([[res]])
    strategy = RecursiveCrawlStrategy(crawler=crawler, markdown_generator=DummyMarkdownGen())
    set_url_handler(strategy, FakeURLHandler())

    # Act
    results = await strategy.crawl_recursive_with_progress(
        start_urls=[orig],
        transform_url_func=lambda u: transformed if u == orig else u,
        is_documentation_site_func=lambda u: False,
        max_depth=1,
        progress_callback=no_op_progress,
    )

    # Assert: original URL is preserved in output
    assert len(results) == 1
    assert results[0]["url"] == orig

@pytest.mark.asyncio
async def test_skips_binary_internal_links(monkeypatch):
    # Arrange
    patch_credential_service(monkeypatch, return_value={"CRAWL_BATCH_SIZE": "10"})
    res = FakeResult(
        url="https://example.com/page",
        success=True,
        markdown="md",
        html="<p/>",
        links={"internal": [
            {"href": "https://example.com/file.pdf"},  # considered binary
            {"href": "https://example.com/next"},      # not binary
        ]},
    )
    # First depth processes one batch; second depth will enqueue only non-binary
    crawler = FakeCrawler([[res], []])  # second call yields nothing (no results for next urls)
    # Mark URL handler to treat .pdf as binary
    handler = FakeURLHandler(binary_predicate=lambda u: u.endswith(".pdf"))

    strategy = RecursiveCrawlStrategy(crawler=crawler, markdown_generator=DummyMarkdownGen())
    set_url_handler(strategy, handler)

    progress = ProgressCapture()

    # Act
    results = await strategy.crawl_recursive_with_progress(
        start_urls=["https://example.com/start"],
        transform_url_func=lambda u: u,
        is_documentation_site_func=lambda u: False,
        max_depth=2,
        progress_callback=progress,
    )

    # Assert
    assert len(results) == 1
    # Verify binary URL checked and skipped, non-binary enqueued
    assert any(u.endswith(".pdf") for u in handler.calls)
    # next depth triggers a second arun_many call with the non-binary next url
    assert len(crawler.calls) == 2
    assert crawler.calls[1]["urls"] == ["https://example.com/next"]

@pytest.mark.asyncio
async def test_progress_reporting_intervals(monkeypatch):
    # Arrange
    patch_credential_service(monkeypatch, return_value={"CRAWL_BATCH_SIZE": "3"})
    # 6 URLs total to force multiple progress updates and modulo reporting
    results_batch = [
        FakeResult(url=f"https://example.com/u{i}", success=True, markdown="m", html="<p/>")
        for i in range(6)
    ]
    crawler = FakeCrawler([results_batch])
    strategy = RecursiveCrawlStrategy(crawler=crawler, markdown_generator=DummyMarkdownGen())
    set_url_handler(strategy, FakeURLHandler())
    progress = ProgressCapture()

    # Act
    await strategy.crawl_recursive_with_progress(
        start_urls=[f"https://example.com/u{i}" for i in range(6)],
        transform_url_func=lambda u: u,
        is_documentation_site_func=lambda u: False,
        max_depth=1,
        progress_callback=progress,
        start_progress=10,
        end_progress=60,
    )

    # Assert: expect at least a start, mid (5th URL), and completion progress messages
    msgs = [m for (_, _, m, _) in progress.calls]
    assert any("Crawling depth 1/1" in m for m in msgs)
    assert any("processed 5/6 URLs" in m for m in msgs)
    assert any("completed" in m for m in msgs)

@pytest.mark.asyncio
async def test_settings_parsing_and_fallbacks(monkeypatch):
    # Provide mixed settings and rely on defaults for unspecified ones
    patch_credential_service(monkeypatch, return_value={
        "CRAWL_BATCH_SIZE": "4",
        # no explicit CRAWL_MAX_CONCURRENT -> should use default 10 if not overridden
        "MEMORY_THRESHOLD_PERCENT": "81.5",
        "DISPATCHER_CHECK_INTERVAL": "0.33",
    })
    res = FakeResult(url="https://example.com", success=True, markdown="m", html="<p/>")
    crawler = FakeCrawler([[res]])
    strategy = RecursiveCrawlStrategy(crawler=crawler, markdown_generator=DummyMarkdownGen())
    set_url_handler(strategy, FakeURLHandler())

    await strategy.crawl_recursive_with_progress(
        start_urls=["https://example.com"],
        transform_url_func=lambda u: u,
        is_documentation_site_func=lambda u: False,
        max_depth=1,
        progress_callback=no_op_progress,
    )

    call = crawler.calls[0]
    disp = call["dispatcher"]
    assert isinstance(disp, _FakeMemoryAdaptiveDispatcher)
    assert disp.max_session_permit == 10
    assert disp.memory_threshold_percent == 81.5
    assert abs(disp.check_interval - 0.33) < 1e-9

# Edge case: duplicate URLs and fragments should be normalized and not double-crawled
@pytest.mark.asyncio
async def test_url_normalization_removes_fragments_and_dedupes(monkeypatch):
    patch_credential_service(monkeypatch, return_value={})
    # Provide two start URLs that normalize to same base
    start_urls = [
        "https://example.com/page#section1",
        "https://example.com/page#section2",
    ]
    res = FakeResult(url="https://example.com/page", success=True, markdown="m", html="<p/>")
    crawler = FakeCrawler([[res]])
    strategy = RecursiveCrawlStrategy(crawler=crawler, markdown_generator=DummyMarkdownGen())
    set_url_handler(strategy, FakeURLHandler())

    await strategy.crawl_recursive_with_progress(
        start_urls=start_urls,
        transform_url_func=lambda u: u,
        is_documentation_site_func=lambda u: False,
        max_depth=1,
        progress_callback=no_op_progress,
    )

    # Only one crawl call should occur for deduped URL
    assert len(crawler.calls) == 1
    assert crawler.calls[0]["urls"] == ["https://example.com/page"]

Wirasm · 2025-08-21T18:23:45Z

@claude-review

claude · 2025-08-21T18:23:56Z

Claude finished @Wirasm's task —— View job

Summary

This PR fixes a critical crawler timeout issue affecting JavaScript-heavy documentation sites by removing the problematic wait_for='body' selector that caused unnecessary 30-second timeouts. The changes improve error handling with proper exception chaining and remove unused imports across both batch and recursive crawling strategies.

Previous Review Comments

First review - no previous comments to address

Issues Found

Total: 1 critical, 2 important, 3 minor

🔴 Critical (Must Fix)

URL Transformation Bug in Recursive Strategy - python/src/server/services/crawling/strategies/recursive.py:169

Problem: URLs aren't transformed before crawling, breaking GitHub raw-content extraction and original URL mapping. The code passes batch_urls (untransformed) to the crawler but then tries to map results back by comparing transform_url_func(orig_url) to result.url, which will rarely match.

Fix: Transform URLs before crawling and maintain a mapping dictionary:
```
batch_original_urls = urls_to_crawl[batch_idx : batch_idx + batch_size]
batch_urls = [transform_url_func(u) for u in batch_original_urls]
url_mapping = dict(zip(batch_urls, batch_original_urls))

# After crawling, use:
original_url = url_mapping.get(result.url, result.url)
```

🟡 Important (Should Fix)

Progress Callback Inconsistency - python/src/server/services/crawling/strategies/batch.py:122-125

Problem: Batch strategy calls progress_callback("crawling", percentage, message) while recursive strategy adds step_info metadata. This inconsistency could break consumers expecting the richer payload format.

Fix: Align batch strategy with recursive by adding step_info parameter to progress callbacks.
Unused Loop Counter - python/src/server/services/crawling/strategies/batch.py:165,193

Problem: Variable j is initialized and incremented but never used, indicating potential logic issues or cleanup needed.

Fix: Remove the unused counter or implement the intended functionality.

🟢 Minor (Consider)

Double URL Normalization - python/src/server/services/crawling/strategies/recursive.py:145-147

normalize_url() is called twice per URL in the comprehension. Could optimize to avoid redundant work.
Missing Link Safety Guard - python/src/server/services/crawling/strategies/recursive.py:212

Should add safe default for missing or None links attribute: links = getattr(result, "links", {}) or {}
Duplicate Binary File Checks - python/src/server/services/crawling/strategies/recursive.py:215-221

is_binary_file() is called twice for the same URL. Could cache the result for efficiency.

Security Assessment

No security issues found. The changes are focused on crawler configuration and don't introduce new attack vectors. Input validation remains intact through the existing URL handling patterns.

Performance Considerations

Positive impacts:

Removing wait_for='body' eliminates 30-second timeouts on JS-heavy sites
Better suited timeout values (30s for docs, 45s for regular sites)
Streaming results processing maintains memory efficiency

Areas for optimization:

URL transformation bug fix will improve GitHub raw content extraction performance
Reducing duplicate binary file checks could save I/O operations

Good Practices Observed

Excellent error handling alignment with CLAUDE.md guidelines:
- Configuration errors fail fast with ValueError and proper exception chaining (raise ... from e)
- Non-critical errors (network issues) fall back to defaults while logging prominently
- Proper use of exc_info=True for stack trace preservation
Alpha project principles followed:
- Removes dead code (unused imports) immediately
- Detailed error messages with context
- Focus on functionality over backward compatibility
Smart crawler configuration:
- Documentation sites get specialized settings (shorter timeout, iframe processing, overlay removal)
- Streaming enabled for better parallel processing performance
- Memory-adaptive dispatcher for resource management

Questionable Practices

Inconsistent progress reporting: The different callback signatures between strategies could cause integration issues
URL transformation logic: The current approach in recursive strategy suggests the feature isn't fully tested with GitHub URL transformations

Test Coverage

Current Coverage: Moderate - Core crawling logic is present but edge cases need attention

Missing Tests:

URL Transformation Mapping
- What to test: GitHub URL transformation and result mapping back to originals
- Why important: Critical for GitHub content extraction functionality
- Suggested test: Mock crawler results with transformed URLs and verify original URL preservation
Documentation Site Detection
- What to test: is_documentation_site_func triggering correct configuration branch
- Why important: Different timeout/config behavior for different site types
- Suggested test: Verify doc sites get 30s timeout while regular sites get 45s
Progress Callback Consistency
- What to test: Both strategies calling progress callbacks with expected parameters
- Why important: UI/telemetry systems depend on consistent callback signatures
- Suggested test: Mock progress callback and verify parameter structure matches
Exception Chaining
- What to test: Configuration errors properly chain original exceptions
- Why important: Debugging configuration issues requires full stack traces
- Suggested test: Mock credential service failures and verify exception chain preservation

Recommendations

Merge Decision:

Requires fixes before merging

Priority Actions:

Fix URL transformation bug in recursive strategy - Critical for GitHub content extraction
Align progress callback signatures - Prevents integration issues with UI components
Add tests for URL transformation - Ensure the core feature works correctly

Rationale:
While the timeout fix addresses a real problem and the error handling improvements align well with alpha project principles, the URL transformation bug is a critical regression that would break GitHub repository crawling functionality. The fix is straightforward but essential before merging.

Review based on Archon V2 Alpha guidelines and CLAUDE.md principles

Remove wait_for='body' selector from documentation site crawling config. The body element exists immediately in HTML, causing unnecessary timeouts for JavaScript-rendered content. Now relies on domcontentloaded event and delay_before_return_html for proper JavaScript execution.

- Remove unused asyncio imports from batch.py and recursive.py - Add proper exception chaining with 'from e' to preserve stack traces

- Transform URLs to raw content (e.g., GitHub blob -> raw) before sending to crawler - Maintain mapping dictionary to preserve original URLs in results - Align progress callback signatures between batch and recursive strategies - Add safety guards for missing links attribute - Remove unused loop counter in batch strategy - Optimize binary file checks to avoid duplicate calls This ensures GitHub files are crawled as raw content instead of HTML pages, fixing the issue where content extraction was degraded due to HTML wrapping.

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

python/src/server/services/crawling/strategies/batch.py (2)
55-60: Treat missing crawler as a hard service error, not “empty results.”

Returning [] hides a critical dependency outage. Prefer raising a descriptive exception after notifying progress_callback so upstream can fail fast.

Apply this diff:
         if not self.crawler:
             logger.error("No crawler instance available for batch crawling")
             if progress_callback:
                 await progress_callback("error", 0, "Crawler not available")
-            return []
+            raise RuntimeError("BatchCrawlStrategy: crawler instance is not initialized")
161-163: Add explicit max_session_permit to MemoryAdaptiveDispatcher instantiation

We verified that in both batch and recursive crawling strategies the dispatcher is constructed once and passed directly into arun_many, with no subsequent overrides. However, to ensure your intended concurrency limit takes effect (rather than relying on the dispatcher’s default), you should explicitly set max_session_permit (often referred to as “max_concurrent”) in the constructor:

• File: python/src/server/services/crawling/strategies/batch.py
Lines 116–118

• File: python/src/server/services/crawling/strategies/recursive.py
Lines 122–124

Replace this pattern:
- dispatcher = MemoryAdaptiveDispatcher(
-     memory_threshold_percent=memory_threshold,
-     check_interval=check_interval,
- )
With an explicit concurrency override:
  dispatcher = MemoryAdaptiveDispatcher(
      memory_threshold_percent=memory_threshold,
      check_interval=check_interval,
+     max_session_permit=crawl_config.max_concurrent,  # explicitly cap concurrent sessions
  )
If crawl_config doesn’t expose a max_concurrent field, substitute in the appropriate constant or configuration value.

Relevant dispatcher API docs confirm support for max_session_permit to control concurrency: (crawl4ai.com)

♻️ Duplicate comments (1)

python/src/server/services/crawling/strategies/recursive.py (1)

169-179: Fixed: transform before crawl and map results back to originals.

This resolves the previously flagged mapping issue for GitHub/raw content.

🧹 Nitpick comments (4)

python/src/server/services/crawling/strategies/batch.py (1)
75-84: Fallback path is fine; also surface the concrete defaults being applied.

You already log the failure and proceed with safe defaults. Consider explicitly logging the resolved defaults to aid ops/debugging.

Apply this diff to include the effective values in the log:
-            logger.error(
-                f"Failed to load crawl settings from database: {e}, using defaults", exc_info=True
-            )
+            logger.error(
+                f"Failed to load crawl settings from database: {e}, using defaults",
+                exc_info=True,
+            )
+            logger.info(
+                "Crawl defaults in use: batch_size=50, max_concurrent=%s, "
+                "memory_threshold=80.0, check_interval=0.5",
+                10 if max_concurrent is None else max_concurrent,
+            )
python/src/server/services/crawling/strategies/recursive.py (3)
81-89: Fallback defaults: also log concretely which values are in effect.

Same suggestion as batch for observability of defaulted config.
-            logger.error(
-                f"Failed to load crawl settings from database: {e}, using defaults", exc_info=True
-            )
+            logger.error(
+                f"Failed to load crawl settings from database: {e}, using defaults",
+                exc_info=True,
+            )
+            logger.info(
+                "Crawl defaults in use: batch_size=50, max_concurrent=%s, "
+                "memory_threshold=80.0, check_interval=0.5",
+                10 if max_concurrent is None else max_concurrent,
+            )
207-212: Result shape consistent (url, markdown, html).

Keeping raw HTML is useful for code/DOM post-processing. Consider adding content_length guards later if memory becomes a concern.

204-205: Optional: only mark visited on success to allow retrying failures within the same session.

Currently, failed pages are marked visited and won’t be retried at later depths. If you want opportunistic retries (e.g., transient 429), defer adding to visited until success.

If interested, I can wire a simple per-URL retry with exponential backoff around the crawl call, gated by a max_retries setting.

Also applies to: 226-229

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f077820 and 329a65e.

📒 Files selected for processing (2)

python/src/server/services/crawling/strategies/batch.py (2 hunks)
python/src/server/services/crawling/strategies/recursive.py (6 hunks)

🧰 Additional context used

📓 Path-based instructions (6)

python/src/{server,mcp,agents}/**/*.py