Skip to content

Fix crawler timeout for JavaScript-heavy documentation sites#395

Merged
Wirasm merged 3 commits intomainfrom
fix/crawler-js-doc-sites
Aug 22, 2025
Merged

Fix crawler timeout for JavaScript-heavy documentation sites#395
Wirasm merged 3 commits intomainfrom
fix/crawler-js-doc-sites

Conversation

@Wirasm
Copy link
Copy Markdown
Collaborator

@Wirasm Wirasm commented Aug 20, 2025

Pull Request

Summary

Fixes crawler timeout issue when crawling JavaScript-heavy documentation sites like Stripe.js docs. The crawler was waiting for a 'body' selector that already exists in the HTML DOM, causing unnecessary 30-second timeouts.

Changes Made

  • Remove wait_for='body' selector from documentation site crawler configurations
  • Fix exception chaining to preserve stack traces for better debugging
  • Remove unused asyncio imports from crawling strategies

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Performance improvement
  • Code refactoring

Affected Services

  • Frontend (React UI)
  • Server (FastAPI backend)
  • MCP Server (Model Context Protocol)
  • Agents (PydanticAI service)
  • Database (migrations/schema)
  • Docker/Infrastructure
  • Documentation site

Testing

  • All existing tests pass
  • Added new tests for new functionality
  • Manually tested affected user flows
  • Docker builds succeed for all services

Test Evidence

# Ready for testing with:
# - https://docs.stripe.com/js
# - Other JavaScript-heavy documentation sites

Checklist

  • My code follows the service architecture patterns
  • If using an AI coding assistant, I used the CLAUDE.md rules
  • I have added tests that prove my fix/feature works
  • All new and existing tests pass locally
  • My changes generate no new warnings
  • I have updated relevant documentation
  • I have verified no regressions in existing features

Breaking Changes

None - This fix maintains backward compatibility while enabling crawling of sites that previously timed out.

Additional Notes

Root Cause

The wait_for='body' selector was problematic because:

  1. The body element exists immediately when HTML loads
  2. For JavaScript-rendered content, we need to wait for JS execution, not the body element
  3. This caused 30-second timeouts on valid documentation sites

Solution

Now relies on:

  • domcontentloaded event for DOM readiness
  • delay_before_return_html (1.0s for doc sites) for JavaScript rendering time

This fix allows crawling of JavaScript-heavy documentation that previously failed while maintaining compatibility with all existing crawlable sites.

Summary by CodeRabbit

  • New Features

    • Progress reporting enhanced: callbacks receive step number, message, and extra progress data; results preserve and map back to original URLs.
  • Bug Fixes

    • Smarter page-readiness wait strategy for documentation sites to reduce timeouts/partial loads.
    • Error handling improved to preserve original causes and provide a default fallback for missing settings.
    • Binary files are now explicitly skipped during crawling.
  • Chores

    • Removed unused imports and minor logging/formatting tidy-ups.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Aug 20, 2025

Walkthrough

Updates two crawling strategies: remove unused asyncio imports, preserve original errors with exception chaining on config load failures, switch documentation-site waits from an explicit wait_for="body" to wait_until/CRAWL_WAIT_STRATEGY, enrich progress callbacks with a step_info payload, and minor logging/formatting tweaks.

Changes

Cohort / File(s) Summary of Changes
Batch crawling strategy
python/src/server/services/crawling/strategies/batch.py
Removed unused asyncio import; re-raised config load errors using raise ... from e; added default settings = {} on non-critical load failure; removed explicit wait_for="body" from CrawlerRunConfig for docs batches; progress reporter now accepts extra kwargs and sends step_info; minor log formatting.
Recursive crawling strategy
python/src/server/services/crawling/strategies/recursive.py
Removed unused asyncio import; chained config load exceptions; changed docs wait from wait_for="body" to wait_until driven by CRAWL_WAIT_STRATEGY; transformed URLs crawled with url_mapping to report original URLs; skip binary files; progress callbacks receive step_info with currentStep and stepMessage; mapped streaming results back to originals; logging and formatting adjustments.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Caller
  participant Strategy as Crawl Strategy (batch/recursive)
  participant Config as CrawlerRunConfig
  participant Engine as Crawl Engine
  participant Progress as Progress Callback

  Caller->>Strategy: start_crawl(params)
  Strategy->>Strategy: load crawler settings
  alt load error (ValueError/KeyError/TypeError)
    Strategy-->>Caller: raise ValueError(...) from e
  else settings available (or default {})
    Strategy->>Config: build config (use wait_until from CRAWL_WAIT_STRATEGY)
    Note right of Config: removed explicit wait_for="body"
    Strategy->>Engine: arun_many(transformed_urls, config)
    Engine-->>Strategy: streaming progress/events
    Strategy->>Progress: on_progress(step_info={currentStep, stepMessage}, **extra)
    Engine-->>Strategy: streaming/results (mapped to original URLs)
    Strategy-->>Caller: aggregated results
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

A rabbit hops through code and thread,
Chains exceptions, clears the dread.
Waits now listen, not just look,
Step_info scribbles in the book.
Logs aligned, results restored — hooray, the crawl’s unboxed today. 🐇✨

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/crawler-js-doc-sites

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (6)
python/src/server/services/crawling/strategies/batch.py (2)

165-194: Remove unused loop counter j

j is initialized and incremented but never used. Safe to remove.

-            j = 0
...
-                j += 1

121-126: Unify progress callback shape with recursive strategy

Batch still calls progress_callback without step_info, while recursive adds step metadata. Consider aligning for consumers that expect the richer payload.

-        async def report_progress(percentage: int, message: str):
+        async def report_progress(percentage: int, message: str, **kwargs):
             """Helper to report progress if callback is available"""
             if progress_callback:
-                await progress_callback("crawling", percentage, message)
+                step_info = {"currentStep": message, "stepMessage": message, **kwargs}
+                await progress_callback("crawling", percentage, message, **step_info)
python/src/server/services/crawling/strategies/recursive.py (4)

128-134: Progress callback enrichment is helpful; consider aligning batch, too

Adding step_info improves UI/telemetry. Recommend aligning the batch strategy to the same shape for consistency across flows.


145-147: Avoid double-normalization work per URL

current_urls is already normalized; this comprehension normalizes twice. Minor, but easy to streamline.

-            urls_to_crawl = [
-                normalize_url(url) for url in current_urls if normalize_url(url) not in visited
-            ]
+            urls_to_crawl = [url for url in current_urls if url not in visited]

211-214: Guard against missing/None links in results

Some crawler results may not populate links or may set it to None. Add a safe default to avoid attribute errors.

-                        for link in result.links.get("internal", []):
+                        links = getattr(result, "links", {}) or {}
+                        for link in links.get("internal", []):

215-221: Avoid duplicate is_binary_file checks per link

Compute once to reduce overhead and clarify intent.

-                            if next_url not in visited and not self.url_handler.is_binary_file(
-                                next_url
-                            ):
-                                next_level_urls.add(next_url)
-                            elif self.url_handler.is_binary_file(next_url):
+                            is_binary = self.url_handler.is_binary_file(next_url)
+                            if next_url not in visited and not is_binary:
+                                next_level_urls.add(next_url)
+                            elif is_binary:
                                 logger.debug(f"Skipping binary file from crawl queue: {next_url}")
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 8b29d20 and f077820.

📒 Files selected for processing (2)
  • python/src/server/services/crawling/strategies/batch.py (1 hunks)
  • python/src/server/services/crawling/strategies/recursive.py (7 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
python/src/server/services/crawling/strategies/recursive.py (2)
python/src/server/services/crawling/helpers/url_handler.py (2)
  • URLHandler (14-127)
  • is_binary_file (52-97)
python/src/server/services/crawling/strategies/batch.py (1)
  • report_progress (122-125)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Docker Build Tests (server)
  • GitHub Check: Backend Tests (Python + pytest)
🔇 Additional comments (4)
python/src/server/services/crawling/strategies/batch.py (3)

72-72: Good fix: exception chaining preserves traceback

Raising from the original exception keeps the full stack. This will help debugging config issues.


75-77: Robust logging on settings fallback

Logging with exc_info while safely falling back to defaults is the right tradeoff for availability.


91-103: Doc-site wait strategy change is correct (prevents 30s timeouts)

Dropping wait_for="body" and relying on wait_until + delay_before_return_html for JS-heavy docs is the right approach. This avoids waiting on a selector that appears immediately while still allowing the page to render.

python/src/server/services/crawling/strategies/recursive.py (1)

94-109: Doc-site wait strategy: LGTM

Switching to wait_until with a short delay and skipping images addresses JS-rendered docs without the false-positive wait on .

Comment thread python/src/server/services/crawling/strategies/recursive.py
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Aug 20, 2025

Note

Unit test generation is an Early Access feature. Expect some limitations and changes as we gather feedback and continue to improve it.


Generating unit tests... This may take up to 20 minutes.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Aug 20, 2025

Here are the copyable unit test edits:

Copyable Edits

python/tests/server/services/crawling/strategies/test_batch.py

This is a new file.

# Tests for BatchCrawlStrategy.crawl_batch_with_progress
# Framework: pytest (with pytest-asyncio)

@_ASYNC_MARK
async def test_explicit_max_concurrent_overrides_settings(monkeypatch, strategy_module):
    urls = [f"https://override.com/{i}" for i in range(3)]

    def transform(u: str) -> str:
        return u

    def is_doc(u: str) -> bool:
        return False

    # One batch of 3 all success
    batch = [DummyResult(url=u) for u in urls]
    crawler = DummyCrawler(batches=[batch])

    # Settings specify 2, but we will pass 5 explicitly to the function
    creds = {
        "CRAWL_BATCH_SIZE": "10",
        "CRAWL_MAX_CONCURRENT": "2",
        "MEMORY_THRESHOLD_PERCENT": "70",
        "DISPATCHER_CHECK_INTERVAL": "0.2",
    }

    _patch_dependencies(monkeypatch, strategy_module, creds=creds)

    strategy = strategy_module.BatchCrawlStrategy(crawler, DummyMarkdownGenerator())
    results = await strategy.crawl_batch_with_progress(
        urls=urls,
        transform_url_func=transform,
        is_documentation_site_func=is_doc,
        max_concurrent=5,  # should override settings
    )

    assert len(results) == 3
    # Verify dispatcher used explicit max_concurrent rather than settings
    call = crawler.calls[0]
    disp = call["dispatcher"]
    assert disp.kwargs["max_session_permit"] == 5
    assert disp.kwargs["memory_threshold_percent"] == 70.0
    assert disp.kwargs["check_interval"] == 0.2

python/tests/server/services/crawling/strategies/test_recursive.py

This is a new file.

import asyncio
import sys
import types
import builtins
import inspect
from dataclasses import dataclass
from typing import Any, Dict, List, Optional

import pytest

# NOTE ON TESTING LIBRARY AND FRAMEWORK:
# These tests use pytest with pytest-asyncio for async test support, following the repository's Python testing conventions.
# If pytest-asyncio is not auto-discovered in this environment, we structure tests to use asyncio.run where possible.

# --- Fake crawl4ai module injection to avoid external dependency coupling ---
# The production code imports: CrawlerRunConfig, CacheMode, MemoryAdaptiveDispatcher from crawl4ai.
# We'll inject minimal fakes into sys.modules before importing the target module.

class _FakeCacheMode:
    BYPASS = "BYPASS"

@dataclass
class _FakeCrawlerRunConfig:
    cache_mode: Any = None
    stream: bool = False
    markdown_generator: Any = None
    wait_until: str = "domcontentloaded"
    page_timeout: int = 45000
    delay_before_return_html: float = 0.5
    wait_for_images: bool = True
    scan_full_page: bool = False
    exclude_all_images: bool = False
    remove_overlay_elements: bool = False
    process_iframes: bool = False

class _FakeMemoryAdaptiveDispatcher:
    def __init__(self, memory_threshold_percent: float, check_interval: float, max_session_permit: int):
        self.memory_threshold_percent = memory_threshold_percent
        self.check_interval = check_interval
        self.max_session_permit = max_session_permit

_fake_crawl4ai = types.ModuleType("crawl4ai")
setattr(_fake_crawl4ai, "CacheMode", _FakeCacheMode)
setattr(_fake_crawl4ai, "CrawlerRunConfig", _FakeCrawlerRunConfig)
setattr(_fake_crawl4ai, "MemoryAdaptiveDispatcher", _FakeMemoryAdaptiveDispatcher)
sys.modules["crawl4ai"] = _fake_crawl4ai

# --- Import target after faking external dependencies ---
# The target module path may be server/services/crawling/strategies/recursive.py or similar via package layout.
# Given the snippet, this test file is intended to validate the RecursiveCrawlStrategy class code.
# To avoid import path ambiguity, we embed a local shim that imports via the package structure used in the project.
# We try multiple likely imports and skip tests if import cannot be resolved.

RecursiveCrawlStrategy = None
URLHandlerClass = None

_import_errors = []

CANDIDATE_IMPORTS = [
    # Common layout: python/server/services/crawling/strategies/recursive.py
    "server.services.crawling.strategies.recursive",
    # Alternative: app/server/services/crawling/strategies/recursive.py
    "app.server.services.crawling.strategies.recursive",
    # Alternative testing layout; in some repos modules live under python/ package root
    "python.server.services.crawling.strategies.recursive",
]

for modname in CANDIDATE_IMPORTS:
    try:
        mod = __import__(modname, fromlist=["RecursiveCrawlStrategy"])
        if hasattr(mod, "RecursiveCrawlStrategy"):
            RecursiveCrawlStrategy = getattr(mod, "RecursiveCrawlStrategy")
        if hasattr(mod, "URLHandler"):
            URLHandlerClass = getattr(mod, "URLHandler")
        break
    except Exception as e:
        _import_errors.append((modname, e))

# If still unresolved, define a minimal local class by copying the provided logic (last resort).
# But prefer failing fast with a helpful message to maintain alignment with actual module.
if RecursiveCrawlStrategy is None:
    pytest.skip(f"Could not import RecursiveCrawlStrategy from expected locations: {_import_errors}")

# --- Test Utilities ---

class FakeResult:
    def __init__(self, url: str, success: bool, markdown: Optional[str] = None, html: Optional[str] = None, links: Optional[Dict[str, List[Dict[str, str]]]] = None, error_message: Optional[str] = None):
        self.url = url
        self.success = success
        self.markdown = markdown
        self.html = html
        self.links = links or {"internal": []}
        self.error_message = error_message

class FakeAsyncGen:
    def __init__(self, items):
        self._items = items

    def __aiter__(self):
        self._iter = iter(self._items)
        return self

    async def __anext__(self):
        try:
            await asyncio.sleep(0)  # simulate async boundary
            return next(self._iter)
        except StopIteration:
            raise StopAsyncIteration

class FakeCrawler:
    def __init__(self, batches: List[List[FakeResult]]):
        """
        batches: A list of batches; each batch is a list of FakeResult that will be yielded.
        Each call to arun_many returns an async generator over the next batch in sequence.
        """
        self._batches = list(batches)
        self.calls = []

    async def arun_many(self, urls, config, dispatcher):
        # record call data to assert run configuration characteristics indirectly
        self.calls.append({
            "urls": list(urls),
            "config": config,
            "dispatcher": dispatcher,
        })
        if self._batches:
            batch = self._batches.pop(0)
        else:
            batch = []
        return FakeAsyncGen(batch)

class FakeURLHandler:
    def __init__(self, binary_predicate=None):
        self.binary_predicate = binary_predicate or (lambda url: False)
        self.calls = []

    def is_binary_file(self, url: str) -> bool:
        self.calls.append(url)
        return self.binary_predicate(url)

class DummyMarkdownGen:
    pass

async def no_op_progress(event_type, percentage, message, **kwargs):
    return  # aid for tests not needing capture

class ProgressCapture:
    def __init__(self):
        self.calls = []

    async def __call__(self, event_type, percentage, message, **kwargs):
        self.calls.append((event_type, percentage, message, kwargs))

# Helpers to override URLHandler inside the strategy instance
def set_url_handler(strategy, handler):
    # Strategy constructs URLHandler() in __init__, replace it for testing determinism
    if hasattr(strategy, "url_handler"):
        strategy.url_handler = handler

# Monkeypatch helper to control credential_service in target module
def patch_credential_service(monkeypatch, return_value=None, side_effect=None):
    # Import the module where credential_service is referenced
    target_mod = sys.modules[RecursiveCrawlStrategy.__module__]
    async def _fake_get_creds(_category):
        if side_effect:
            raise side_effect
        return return_value if return_value is not None else {}
    monkeypatch.setattr(target_mod.credential_service, "get_credentials_by_category", _fake_get_creds, raising=True)

# --- Tests ---

@pytest.mark.asyncio
async def test_returns_empty_and_reports_error_when_no_crawler(monkeypatch):
    # Arrange
    patch_credential_service(monkeypatch, return_value={})  # Not used but avoid accidental calls
    strategy = RecursiveCrawlStrategy(crawler=None, markdown_generator=DummyMarkdownGen())
    progress = ProgressCapture()

    # Act
    results = await strategy.crawl_recursive_with_progress(
        start_urls=["https://example.com"],
        transform_url_func=lambda u: u,
        is_documentation_site_func=lambda u: False,
        max_depth=1,
        progress_callback=progress,
    )

    # Assert
    assert results == []
    assert any(call[0] == "error" and "Crawler not available" in call[2] for call in progress.calls)

@pytest.mark.asyncio
async def test_configuration_valueerror_raises(monkeypatch):
    # Arrange
    patch_credential_service(monkeypatch, side_effect=ValueError("bad int"))
    strategy = RecursiveCrawlStrategy(crawler=FakeCrawler([]), markdown_generator=DummyMarkdownGen())

    # Act / Assert
    with pytest.raises(ValueError) as ei:
        await strategy.crawl_recursive_with_progress(
            start_urls=["https://example.com"],
            transform_url_func=lambda u: u,
            is_documentation_site_func=lambda u: False,
            max_depth=1,
            progress_callback=no_op_progress,
        )
    assert "Failed to load crawler configuration" in str(ei.value)

@pytest.mark.asyncio
async def test_generic_exception_from_credentials_uses_defaults_and_crawls(monkeypatch):
    # Arrange
    patch_credential_service(monkeypatch, side_effect=RuntimeError("db down"))
    # One batch with one successful result and one failed
    success = FakeResult(
        url="https://example.com/page",
        success=True,
        markdown="# Hello",
        html="<h1>Hello</h1>",
        links={"internal": [{"href": "https://example.com/next"}]},
    )
    failure = FakeResult(
        url="https://example.com/bad",
        success=False,
        error_message="404",
    )
    crawler = FakeCrawler([[success, failure]])
    strategy = RecursiveCrawlStrategy(crawler=crawler, markdown_generator=DummyMarkdownGen())
    set_url_handler(strategy, FakeURLHandler())  # default non-binary

    progress = ProgressCapture()

    # Act
    results = await strategy.crawl_recursive_with_progress(
        start_urls=["https://example.com/start"],
        transform_url_func=lambda u: u,
        is_documentation_site_func=lambda u: False,
        max_depth=1,
        progress_callback=progress,
    )

    # Assert
    assert len(results) == 1
    assert results[0]["url"] == "https://example.com/page"
    assert results[0]["markdown"] == "# Hello"
    assert results[0]["html"] == "<h1>Hello</h1>"

    # Dispatcher default concurrency is 10 when None
    assert len(crawler.calls) == 1
    call = crawler.calls[0]
    assert isinstance(call["config"], _FakeCrawlerRunConfig)
    assert call["config"].stream is True
    # For non-doc site path, default timeout expected 45000
    assert call["config"].page_timeout == 45000
    assert isinstance(call["dispatcher"], _FakeMemoryAdaptiveDispatcher)
    assert call["dispatcher"].max_session_permit == 10

    # Progress reported start, batch progress, and completion
    assert any("Crawling depth 1/1" in m for (_, _, m, _) in progress.calls)
    assert any("completed" in m for (_, _, m, _) in progress.calls)

@pytest.mark.asyncio
async def test_doc_site_configuration_branch(monkeypatch):
    # Arrange: valid settings that tweak values
    patch_credential_service(monkeypatch, return_value={
        "CRAWL_PAGE_TIMEOUT": "30000",
        "CRAWL_DELAY_BEFORE_HTML": "1.0",
        "CRAWL_WAIT_STRATEGY": "domcontentloaded",
        "CRAWL_MAX_CONCURRENT": "7",
        "CRAWL_BATCH_SIZE": "2",
        "MEMORY_THRESHOLD_PERCENT": "75",
        "DISPATCHER_CHECK_INTERVAL": "0.2",
    })
    res = FakeResult(
        url="https://docs.example.com/index",
        success=True,
        markdown="mk",
        html="<p/>",
        links={"internal": []},
    )
    crawler = FakeCrawler([[res]])
    strategy = RecursiveCrawlStrategy(crawler=crawler, markdown_generator=DummyMarkdownGen())
    set_url_handler(strategy, FakeURLHandler())

    progress = ProgressCapture()

    # Act
    results = await strategy.crawl_recursive_with_progress(
        start_urls=["https://docs.example.com"],
        transform_url_func=lambda u: u,
        is_documentation_site_func=lambda u: "docs." in u,
        max_depth=1,
        progress_callback=progress,
    )

    # Assert
    assert len(results) == 1
    call = crawler.calls[0]
    cfg = call["config"]
    assert isinstance(cfg, _FakeCrawlerRunConfig)
    # For doc site branch, timeout is 30000 and some flags set specially
    assert cfg.page_timeout == 30000
    assert cfg.stream is True
    assert cfg.wait_for_images is False
    assert cfg.scan_full_page is True
    assert cfg.process_iframes is True
    disp = call["dispatcher"]
    assert isinstance(disp, _FakeMemoryAdaptiveDispatcher)
    assert disp.max_session_permit == 7
    assert disp.memory_threshold_percent == 75
    assert disp.check_interval == 0.2

@pytest.mark.asyncio
async def test_transform_url_mapping_original_url_recovery(monkeypatch):
    # Arrange
    patch_credential_service(monkeypatch, return_value={})
    orig = "https://example.com/a"
    transformed = "https://static.examplecdn.com/a"
    # Fake result uses transformed url; strategy should map back to original
    res = FakeResult(
        url=transformed,
        success=True,
        markdown="md",
        html="<p></p>",
        links={"internal": []},
    )
    crawler = FakeCrawler([[res]])
    strategy = RecursiveCrawlStrategy(crawler=crawler, markdown_generator=DummyMarkdownGen())
    set_url_handler(strategy, FakeURLHandler())

    # Act
    results = await strategy.crawl_recursive_with_progress(
        start_urls=[orig],
        transform_url_func=lambda u: transformed if u == orig else u,
        is_documentation_site_func=lambda u: False,
        max_depth=1,
        progress_callback=no_op_progress,
    )

    # Assert: original URL is preserved in output
    assert len(results) == 1
    assert results[0]["url"] == orig

@pytest.mark.asyncio
async def test_skips_binary_internal_links(monkeypatch):
    # Arrange
    patch_credential_service(monkeypatch, return_value={"CRAWL_BATCH_SIZE": "10"})
    res = FakeResult(
        url="https://example.com/page",
        success=True,
        markdown="md",
        html="<p/>",
        links={"internal": [
            {"href": "https://example.com/file.pdf"},  # considered binary
            {"href": "https://example.com/next"},      # not binary
        ]},
    )
    # First depth processes one batch; second depth will enqueue only non-binary
    crawler = FakeCrawler([[res], []])  # second call yields nothing (no results for next urls)
    # Mark URL handler to treat .pdf as binary
    handler = FakeURLHandler(binary_predicate=lambda u: u.endswith(".pdf"))

    strategy = RecursiveCrawlStrategy(crawler=crawler, markdown_generator=DummyMarkdownGen())
    set_url_handler(strategy, handler)

    progress = ProgressCapture()

    # Act
    results = await strategy.crawl_recursive_with_progress(
        start_urls=["https://example.com/start"],
        transform_url_func=lambda u: u,
        is_documentation_site_func=lambda u: False,
        max_depth=2,
        progress_callback=progress,
    )

    # Assert
    assert len(results) == 1
    # Verify binary URL checked and skipped, non-binary enqueued
    assert any(u.endswith(".pdf") for u in handler.calls)
    # next depth triggers a second arun_many call with the non-binary next url
    assert len(crawler.calls) == 2
    assert crawler.calls[1]["urls"] == ["https://example.com/next"]

@pytest.mark.asyncio
async def test_progress_reporting_intervals(monkeypatch):
    # Arrange
    patch_credential_service(monkeypatch, return_value={"CRAWL_BATCH_SIZE": "3"})
    # 6 URLs total to force multiple progress updates and modulo reporting
    results_batch = [
        FakeResult(url=f"https://example.com/u{i}", success=True, markdown="m", html="<p/>")
        for i in range(6)
    ]
    crawler = FakeCrawler([results_batch])
    strategy = RecursiveCrawlStrategy(crawler=crawler, markdown_generator=DummyMarkdownGen())
    set_url_handler(strategy, FakeURLHandler())
    progress = ProgressCapture()

    # Act
    await strategy.crawl_recursive_with_progress(
        start_urls=[f"https://example.com/u{i}" for i in range(6)],
        transform_url_func=lambda u: u,
        is_documentation_site_func=lambda u: False,
        max_depth=1,
        progress_callback=progress,
        start_progress=10,
        end_progress=60,
    )

    # Assert: expect at least a start, mid (5th URL), and completion progress messages
    msgs = [m for (_, _, m, _) in progress.calls]
    assert any("Crawling depth 1/1" in m for m in msgs)
    assert any("processed 5/6 URLs" in m for m in msgs)
    assert any("completed" in m for m in msgs)

@pytest.mark.asyncio
async def test_settings_parsing_and_fallbacks(monkeypatch):
    # Provide mixed settings and rely on defaults for unspecified ones
    patch_credential_service(monkeypatch, return_value={
        "CRAWL_BATCH_SIZE": "4",
        # no explicit CRAWL_MAX_CONCURRENT -> should use default 10 if not overridden
        "MEMORY_THRESHOLD_PERCENT": "81.5",
        "DISPATCHER_CHECK_INTERVAL": "0.33",
    })
    res = FakeResult(url="https://example.com", success=True, markdown="m", html="<p/>")
    crawler = FakeCrawler([[res]])
    strategy = RecursiveCrawlStrategy(crawler=crawler, markdown_generator=DummyMarkdownGen())
    set_url_handler(strategy, FakeURLHandler())

    await strategy.crawl_recursive_with_progress(
        start_urls=["https://example.com"],
        transform_url_func=lambda u: u,
        is_documentation_site_func=lambda u: False,
        max_depth=1,
        progress_callback=no_op_progress,
    )

    call = crawler.calls[0]
    disp = call["dispatcher"]
    assert isinstance(disp, _FakeMemoryAdaptiveDispatcher)
    assert disp.max_session_permit == 10
    assert disp.memory_threshold_percent == 81.5
    assert abs(disp.check_interval - 0.33) < 1e-9

# Edge case: duplicate URLs and fragments should be normalized and not double-crawled
@pytest.mark.asyncio
async def test_url_normalization_removes_fragments_and_dedupes(monkeypatch):
    patch_credential_service(monkeypatch, return_value={})
    # Provide two start URLs that normalize to same base
    start_urls = [
        "https://example.com/page#section1",
        "https://example.com/page#section2",
    ]
    res = FakeResult(url="https://example.com/page", success=True, markdown="m", html="<p/>")
    crawler = FakeCrawler([[res]])
    strategy = RecursiveCrawlStrategy(crawler=crawler, markdown_generator=DummyMarkdownGen())
    set_url_handler(strategy, FakeURLHandler())

    await strategy.crawl_recursive_with_progress(
        start_urls=start_urls,
        transform_url_func=lambda u: u,
        is_documentation_site_func=lambda u: False,
        max_depth=1,
        progress_callback=no_op_progress,
    )

    # Only one crawl call should occur for deduped URL
    assert len(crawler.calls) == 1
    assert crawler.calls[0]["urls"] == ["https://example.com/page"]

@Wirasm
Copy link
Copy Markdown
Collaborator Author

Wirasm commented Aug 21, 2025

@claude-review

@claude
Copy link
Copy Markdown

claude Bot commented Aug 21, 2025

Claude finished @Wirasm's task —— View job


Summary

This PR fixes a critical crawler timeout issue affecting JavaScript-heavy documentation sites by removing the problematic wait_for='body' selector that caused unnecessary 30-second timeouts. The changes improve error handling with proper exception chaining and remove unused imports across both batch and recursive crawling strategies.

Previous Review Comments

  • First review - no previous comments to address

Issues Found

Total: 1 critical, 2 important, 3 minor

🔴 Critical (Must Fix)

  • URL Transformation Bug in Recursive Strategy - python/src/server/services/crawling/strategies/recursive.py:169

    Problem: URLs aren't transformed before crawling, breaking GitHub raw-content extraction and original URL mapping. The code passes batch_urls (untransformed) to the crawler but then tries to map results back by comparing transform_url_func(orig_url) to result.url, which will rarely match.

    Fix: Transform URLs before crawling and maintain a mapping dictionary:

    batch_original_urls = urls_to_crawl[batch_idx : batch_idx + batch_size]
    batch_urls = [transform_url_func(u) for u in batch_original_urls]
    url_mapping = dict(zip(batch_urls, batch_original_urls))
    
    # After crawling, use:
    original_url = url_mapping.get(result.url, result.url)

🟡 Important (Should Fix)

  • Progress Callback Inconsistency - python/src/server/services/crawling/strategies/batch.py:122-125

    Problem: Batch strategy calls progress_callback("crawling", percentage, message) while recursive strategy adds step_info metadata. This inconsistency could break consumers expecting the richer payload format.

    Fix: Align batch strategy with recursive by adding step_info parameter to progress callbacks.

  • Unused Loop Counter - python/src/server/services/crawling/strategies/batch.py:165,193

    Problem: Variable j is initialized and incremented but never used, indicating potential logic issues or cleanup needed.

    Fix: Remove the unused counter or implement the intended functionality.

🟢 Minor (Consider)

  • Double URL Normalization - python/src/server/services/crawling/strategies/recursive.py:145-147

    normalize_url() is called twice per URL in the comprehension. Could optimize to avoid redundant work.

  • Missing Link Safety Guard - python/src/server/services/crawling/strategies/recursive.py:212

    Should add safe default for missing or None links attribute: links = getattr(result, "links", {}) or {}

  • Duplicate Binary File Checks - python/src/server/services/crawling/strategies/recursive.py:215-221

    is_binary_file() is called twice for the same URL. Could cache the result for efficiency.

Security Assessment

No security issues found. The changes are focused on crawler configuration and don't introduce new attack vectors. Input validation remains intact through the existing URL handling patterns.

Performance Considerations

Positive impacts:

  • Removing wait_for='body' eliminates 30-second timeouts on JS-heavy sites
  • Better suited timeout values (30s for docs, 45s for regular sites)
  • Streaming results processing maintains memory efficiency

Areas for optimization:

  • URL transformation bug fix will improve GitHub raw content extraction performance
  • Reducing duplicate binary file checks could save I/O operations

Good Practices Observed

  • Excellent error handling alignment with CLAUDE.md guidelines:

    • Configuration errors fail fast with ValueError and proper exception chaining (raise ... from e)
    • Non-critical errors (network issues) fall back to defaults while logging prominently
    • Proper use of exc_info=True for stack trace preservation
  • Alpha project principles followed:

    • Removes dead code (unused imports) immediately
    • Detailed error messages with context
    • Focus on functionality over backward compatibility
  • Smart crawler configuration:

    • Documentation sites get specialized settings (shorter timeout, iframe processing, overlay removal)
    • Streaming enabled for better parallel processing performance
    • Memory-adaptive dispatcher for resource management

Questionable Practices

  • Inconsistent progress reporting: The different callback signatures between strategies could cause integration issues
  • URL transformation logic: The current approach in recursive strategy suggests the feature isn't fully tested with GitHub URL transformations

Test Coverage

Current Coverage: Moderate - Core crawling logic is present but edge cases need attention

Missing Tests:

  1. URL Transformation Mapping

    • What to test: GitHub URL transformation and result mapping back to originals
    • Why important: Critical for GitHub content extraction functionality
    • Suggested test: Mock crawler results with transformed URLs and verify original URL preservation
  2. Documentation Site Detection

    • What to test: is_documentation_site_func triggering correct configuration branch
    • Why important: Different timeout/config behavior for different site types
    • Suggested test: Verify doc sites get 30s timeout while regular sites get 45s
  3. Progress Callback Consistency

    • What to test: Both strategies calling progress callbacks with expected parameters
    • Why important: UI/telemetry systems depend on consistent callback signatures
    • Suggested test: Mock progress callback and verify parameter structure matches
  4. Exception Chaining

    • What to test: Configuration errors properly chain original exceptions
    • Why important: Debugging configuration issues requires full stack traces
    • Suggested test: Mock credential service failures and verify exception chain preservation

Recommendations

Merge Decision:

  • Requires fixes before merging

Priority Actions:

  1. Fix URL transformation bug in recursive strategy - Critical for GitHub content extraction
  2. Align progress callback signatures - Prevents integration issues with UI components
  3. Add tests for URL transformation - Ensure the core feature works correctly

Rationale:
While the timeout fix addresses a real problem and the error handling improvements align well with alpha project principles, the URL transformation bug is a critical regression that would break GitHub repository crawling functionality. The fix is straightforward but essential before merging.


Review based on Archon V2 Alpha guidelines and CLAUDE.md principles


Wirasm added 3 commits August 21, 2025 21:24
Remove wait_for='body' selector from documentation site crawling config.
The body element exists immediately in HTML, causing unnecessary timeouts
for JavaScript-rendered content. Now relies on domcontentloaded event
and delay_before_return_html for proper JavaScript execution.
- Remove unused asyncio imports from batch.py and recursive.py
- Add proper exception chaining with 'from e' to preserve stack traces
- Transform URLs to raw content (e.g., GitHub blob -> raw) before sending to crawler
- Maintain mapping dictionary to preserve original URLs in results
- Align progress callback signatures between batch and recursive strategies
- Add safety guards for missing links attribute
- Remove unused loop counter in batch strategy
- Optimize binary file checks to avoid duplicate calls

This ensures GitHub files are crawled as raw content instead of HTML pages,
fixing the issue where content extraction was degraded due to HTML wrapping.
@Wirasm Wirasm force-pushed the fix/crawler-js-doc-sites branch from f077820 to 329a65e Compare August 21, 2025 19:05
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
python/src/server/services/crawling/strategies/batch.py (2)

55-60: Treat missing crawler as a hard service error, not “empty results.”

Returning [] hides a critical dependency outage. Prefer raising a descriptive exception after notifying progress_callback so upstream can fail fast.

Apply this diff:

         if not self.crawler:
             logger.error("No crawler instance available for batch crawling")
             if progress_callback:
                 await progress_callback("error", 0, "Crawler not available")
-            return []
+            raise RuntimeError("BatchCrawlStrategy: crawler instance is not initialized")

161-163: Add explicit max_session_permit to MemoryAdaptiveDispatcher instantiation

We verified that in both batch and recursive crawling strategies the dispatcher is constructed once and passed directly into arun_many, with no subsequent overrides. However, to ensure your intended concurrency limit takes effect (rather than relying on the dispatcher’s default), you should explicitly set max_session_permit (often referred to as “max_concurrent”) in the constructor:

• File: python/src/server/services/crawling/strategies/batch.py
Lines 116–118

• File: python/src/server/services/crawling/strategies/recursive.py
Lines 122–124

Replace this pattern:

- dispatcher = MemoryAdaptiveDispatcher(
-     memory_threshold_percent=memory_threshold,
-     check_interval=check_interval,
- )

With an explicit concurrency override:

  dispatcher = MemoryAdaptiveDispatcher(
      memory_threshold_percent=memory_threshold,
      check_interval=check_interval,
+     max_session_permit=crawl_config.max_concurrent,  # explicitly cap concurrent sessions
  )

If crawl_config doesn’t expose a max_concurrent field, substitute in the appropriate constant or configuration value.

Relevant dispatcher API docs confirm support for max_session_permit to control concurrency: (crawl4ai.com)

♻️ Duplicate comments (1)
python/src/server/services/crawling/strategies/recursive.py (1)

169-179: Fixed: transform before crawl and map results back to originals.

This resolves the previously flagged mapping issue for GitHub/raw content.

🧹 Nitpick comments (4)
python/src/server/services/crawling/strategies/batch.py (1)

75-84: Fallback path is fine; also surface the concrete defaults being applied.

You already log the failure and proceed with safe defaults. Consider explicitly logging the resolved defaults to aid ops/debugging.

Apply this diff to include the effective values in the log:

-            logger.error(
-                f"Failed to load crawl settings from database: {e}, using defaults", exc_info=True
-            )
+            logger.error(
+                f"Failed to load crawl settings from database: {e}, using defaults",
+                exc_info=True,
+            )
+            logger.info(
+                "Crawl defaults in use: batch_size=50, max_concurrent=%s, "
+                "memory_threshold=80.0, check_interval=0.5",
+                10 if max_concurrent is None else max_concurrent,
+            )
python/src/server/services/crawling/strategies/recursive.py (3)

81-89: Fallback defaults: also log concretely which values are in effect.

Same suggestion as batch for observability of defaulted config.

-            logger.error(
-                f"Failed to load crawl settings from database: {e}, using defaults", exc_info=True
-            )
+            logger.error(
+                f"Failed to load crawl settings from database: {e}, using defaults",
+                exc_info=True,
+            )
+            logger.info(
+                "Crawl defaults in use: batch_size=50, max_concurrent=%s, "
+                "memory_threshold=80.0, check_interval=0.5",
+                10 if max_concurrent is None else max_concurrent,
+            )

207-212: Result shape consistent (url, markdown, html).

Keeping raw HTML is useful for code/DOM post-processing. Consider adding content_length guards later if memory becomes a concern.


204-205: Optional: only mark visited on success to allow retrying failures within the same session.

Currently, failed pages are marked visited and won’t be retried at later depths. If you want opportunistic retries (e.g., transient 429), defer adding to visited until success.

If interested, I can wire a simple per-URL retry with exponential backoff around the crawl call, gated by a max_retries setting.

Also applies to: 226-229

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f077820 and 329a65e.

📒 Files selected for processing (2)
  • python/src/server/services/crawling/strategies/batch.py (2 hunks)
  • python/src/server/services/crawling/strategies/recursive.py (6 hunks)
🧰 Additional context used
📓 Path-based instructions (6)
python/src/{server,mcp,agents}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/{server,mcp,agents}/**/*.py: Fail fast on service startup failures, missing configuration, database connection issues, auth failures, critical dependency outages, and invalid data that would corrupt state
External API calls should use retry with exponential backoff and ultimately fail with a clear, contextual error message
Error messages must include context (operation being attempted) and relevant IDs/URLs/data for debugging
Preserve full stack traces in logs (e.g., Python logging with exc_info=True)
Use specific exception types; avoid catching broad Exception unless re-raising with context
Never signal failure by returning None/null; raise a descriptive exception instead

Files:

  • python/src/server/services/crawling/strategies/recursive.py
  • python/src/server/services/crawling/strategies/batch.py
python/src/{server/services,agents}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Never accept or store corrupted data (e.g., zero embeddings, null foreign keys, malformed JSON); skip failed items entirely instead of persisting bad data

Files:

  • python/src/server/services/crawling/strategies/recursive.py
  • python/src/server/services/crawling/strategies/batch.py
python/src/server/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/src/server/**/*.py: For batch processing and background tasks, continue processing but log detailed per-item failures and return both successes and failures
Do not crash the server on a single WebSocket event failure; log the error and continue serving other clients

Files:

  • python/src/server/services/crawling/strategies/recursive.py
  • python/src/server/services/crawling/strategies/batch.py
python/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

python/**/*.py: Target Python 3.12 with a 120-character line length
Use Ruff for linting and Mypy for type checking before commit

Files:

  • python/src/server/services/crawling/strategies/recursive.py
  • python/src/server/services/crawling/strategies/batch.py
{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}

📄 CodeRabbit inference engine (CLAUDE.md)

{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}: Remove dead code immediately; do not keep legacy/unused functions
Avoid comments that reference change history (e.g., LEGACY, CHANGED, REMOVED); keep comments focused on current functionality

Files:

  • python/src/server/services/crawling/strategies/recursive.py
  • python/src/server/services/crawling/strategies/batch.py
python/src/server/**

📄 CodeRabbit inference engine (CLAUDE.md)

Keep FastAPI application code under python/src/server/ (routes in api_routes/, services in services/, main in main.py)

Files:

  • python/src/server/services/crawling/strategies/recursive.py
  • python/src/server/services/crawling/strategies/batch.py
🔇 Additional comments (13)
python/src/server/services/crawling/strategies/batch.py (2)

72-72: Good fix: preserve original stack via exception chaining.

Raising with "from e" keeps the cause and full traceback. Matches our guidelines on preserving stack traces.


91-103: Right direction: remove wait_for='body' for JS-heavy docs; rely on wait_until + delay.

This should eliminate the 30s timeouts on pages where appears before JS content. Ensure ops config documents CRAWL_WAIT_STRATEGY accepted values (e.g., 'domcontentloaded', 'load', 'networkidle') so deployments don’t misconfigure it.

Would you like me to add a short section to the settings docs describing CRAWL_WAIT_STRATEGY and recommended values for documentation sites?

python/src/server/services/crawling/strategies/recursive.py (11)

6-6: Trivial whitespace/comment-only change.

Nothing to review here.


42-43: Progress allocation: confirm the 10%→60% window matches UI expectations.

end_progress=60 reserves 40% for post-recursive phases. Verify callers/UX expect this distribution to avoid >100% aggregate progress.

If needed, I can adjust the depth weighting factor (currently 0.8 of the window) to maintain smoothness.


77-77: Good: exception chaining on configuration errors.

Preserves root cause and stack trace.


93-109: Doc-site run_config: sensible defaults for JS-heavy pages.

Switching to wait_until with a 1.0s delay and disabling image waits should reduce timeouts. Confirm process_iframes=True is supported by your crawl4ai version in prod.

If process_iframes is not universally available, we can gate it via a settings flag.


126-126: Dispatcher params look correct.

max_session_permit wires to max_concurrent; matches tests the bot suggested.


181-189: Progress payload keys sent via report_progress will now live under step_info.

Once step_info=... is used (see earlier fix), totalPages/processedPages will be found at step_info.totalPages, etc. Confirm the UI expects them there.

I can adapt the progress emitter to also surface these at top-level if your UI relies on that.


193-195: Correct: crawl transformed URLs with dispatcher.

Matches batch behavior and uses arun_many for native parallel streaming.


200-202: Correct URL mapping back to originals.

Ensures downstream storage and dedupe work on canonical URLs.


216-224: Good: robust link extraction and binary-skip guard.

Using getattr(..., {}) avoids attribute errors; binary filter prevents enqueuing blobs.


226-229: Useful warning includes original URL and message.

Keeps logs actionable without crashing the crawl.


247-256: Progress reporting at depth end and crawl end is clear.

Messages are specific and will help users understand progress across depths.

Comment on lines +122 to 127
async def report_progress(percentage: int, message: str, **kwargs):
"""Helper to report progress if callback is available"""
if progress_callback:
await progress_callback("crawling", percentage, message)
step_info = {"currentStep": message, "stepMessage": message, **kwargs}
await progress_callback("crawling", percentage, message, step_info=step_info)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Progress callback contract differs from recursive strategy. Align on step_info kw param.

Here you pass step_info as a single kwarg. In recursive.py, step_info’s contents are expanded as top-level kwargs, which can cause runtime TypeError if the callback doesn’t accept those names. Let’s standardize on step_info=... in both places.

I’ve proposed the corresponding fix in recursive.py. No change needed here; this comment flags the cross-file inconsistency.

🤖 Prompt for AI Agents
In python/src/server/services/crawling/strategies/batch.py around lines 122 to
127, ensure the progress callback is called with a single step_info kwarg (as
currently implemented: step_info = {...} and await progress_callback("crawling",
percentage, message, step_info=step_info)); leave this implementation unchanged
and confirm the recursive strategy is updated to pass step_info as a single
kwarg as well so both strategies match the same callback contract.

Comment on lines 60 to +65
if not self.crawler:
logger.error("No crawler instance available for recursive crawling")
if progress_callback:
await progress_callback('error', 0, 'Crawler not available')
await progress_callback("error", 0, "Crawler not available")
return []

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Fail fast when crawler is unavailable.

Mirror batch strategy: treat this as a hard error instead of returning []. Keep the progress notification, then raise.

         if not self.crawler:
             logger.error("No crawler instance available for recursive crawling")
             if progress_callback:
                 await progress_callback("error", 0, "Crawler not available")
-            return []
+            raise RuntimeError("RecursiveCrawlStrategy: crawler instance is not initialized")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if not self.crawler:
logger.error("No crawler instance available for recursive crawling")
if progress_callback:
await progress_callback('error', 0, 'Crawler not available')
await progress_callback("error", 0, "Crawler not available")
return []
if not self.crawler:
logger.error("No crawler instance available for recursive crawling")
if progress_callback:
await progress_callback("error", 0, "Crawler not available")
raise RuntimeError("RecursiveCrawlStrategy: crawler instance is not initialized")
🤖 Prompt for AI Agents
In python/src/server/services/crawling/strategies/recursive.py around lines 60
to 65, the code currently logs and returns an empty list when self.crawler is
missing; instead, mirror the batch strategy by treating this as a hard error:
keep the existing logger.error and await the progress_callback("error", 0,
"Crawler not available") if provided, then raise an appropriate exception (e.g.,
RuntimeError or a CrawlerUnavailableError) to fail fast rather than returning
[].

Comment on lines +132 to +134
step_info = {"currentStep": message, "stepMessage": message, **kwargs}
await progress_callback("crawling", percentage, message, **step_info)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Bug: progress_callback invoked with expanded kwargs; use step_info=... instead.

Expanding step_info as top-level kwargs risks TypeError in callbacks not expecting those names and is inconsistent with batch.py.

Apply this diff:

-                step_info = {"currentStep": message, "stepMessage": message, **kwargs}
-                await progress_callback("crawling", percentage, message, **step_info)
+                step_info = {"currentStep": message, "stepMessage": message, **kwargs}
+                await progress_callback("crawling", percentage, message, step_info=step_info)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
step_info = {"currentStep": message, "stepMessage": message, **kwargs}
await progress_callback("crawling", percentage, message, **step_info)
step_info = {"currentStep": message, "stepMessage": message, **kwargs}
await progress_callback("crawling", percentage, message, step_info=step_info)
🤖 Prompt for AI Agents
In python/src/server/services/crawling/strategies/recursive.py around lines 132
to 134, the code expands step_info into top-level kwargs when calling
progress_callback which can cause TypeError and is inconsistent with batch.py;
instead remove the **step_info/**kwargs expansion and pass the dict as a single
named argument (e.g., step_info=step_info) so the callback receives the info
consistently; update the await call to await progress_callback("crawling",
percentage, message, step_info=step_info) and ensure any leftover **kwargs are
not expanded into the call.

Comment on lines +233 to +241
current_progress = depth_start + int(
(current_idx / len(urls_to_crawl)) * (depth_end - depth_start)
)
await report_progress(
current_progress,
f"Depth {depth + 1}: processed {current_idx}/{len(urls_to_crawl)} URLs ({depth_successful} successful)",
totalPages=total_processed,
processedPages=len(results_all),
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Same progress callback bug here; align to step_info=...

Apply the same fix to avoid expanding unknown kwargs.

-                        await report_progress(
-                            current_progress,
-                            f"Depth {depth + 1}: processed {current_idx}/{len(urls_to_crawl)} URLs ({depth_successful} successful)",
-                            totalPages=total_processed,
-                            processedPages=len(results_all),
-                        )
+                        await report_progress(
+                            current_progress,
+                            f"Depth {depth + 1}: processed {current_idx}/{len(urls_to_crawl)} URLs ({depth_successful} successful)",
+                            totalPages=total_processed,
+                            processedPages=len(results_all),
+                        )

Note: The call site stays the same; the earlier change in report_progress will ensure these land under step_info.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In python/src/server/services/crawling/strategies/recursive.py around lines 233
to 241, the report_progress call is passing unknown kwargs (totalPages and
processedPages) which will be dropped; update the call to pass those values
inside the step_info dict (e.g., step_info={"totalPages": total_processed,
"processedPages": len(results_all)}) and keep the same message and
current_progress argument so the progress payload lands under step_info as the
other fixes did.

@coleam00 coleam00 self-requested a review August 21, 2025 21:08
@Wirasm Wirasm merged commit cb4dba1 into main Aug 22, 2025
24 checks passed
@Wirasm Wirasm deleted the fix/crawler-js-doc-sites branch August 22, 2025 05:56
POWERFULMOVES added a commit to POWERFULMOVES/PMOVES-Archon that referenced this pull request Feb 12, 2026
…ation

Merge PR coleam00#395 - NATS infrastructure and A2UI bridge

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
coleam00 pushed a commit that referenced this pull request Apr 7, 2026
…395) (#406)

* Investigate handler bugs #392-#395: commands JSON parsing, dotenv worktree, error messages

* Fix handler bugs: JSON parsing, dotenv worktree, error messages (#392-#395)

Four related fixes for handler/startup issues:

- Fix /commands showing character indices instead of command names when SQLite
  stores commands as JSON string (#392)
- Remove dotenv/config import, rely on Bun's built-in .env loading which
  searches up the directory tree — fixes worktree startup failures (#393)
- Add actionable guidance to no_ai_credentials fatal error (#394)
- Improve /setcwd rejection message with explanation and alternative (#395)

Fixes #392, Fixes #393, Fixes #394, Fixes #395

* Add try-catch for JSON.parse in getCodebaseCommands()

Prevents crash on malformed JSON in database. Logs the error and returns
empty commands object as graceful fallback.
Tyone88 pushed a commit to Tyone88/Archon that referenced this pull request Apr 16, 2026
…am00#392-coleam00#395) (coleam00#406)

* Investigate handler bugs coleam00#392-coleam00#395: commands JSON parsing, dotenv worktree, error messages

* Fix handler bugs: JSON parsing, dotenv worktree, error messages (coleam00#392-coleam00#395)

Four related fixes for handler/startup issues:

- Fix /commands showing character indices instead of command names when SQLite
  stores commands as JSON string (coleam00#392)
- Remove dotenv/config import, rely on Bun's built-in .env loading which
  searches up the directory tree — fixes worktree startup failures (coleam00#393)
- Add actionable guidance to no_ai_credentials fatal error (coleam00#394)
- Improve /setcwd rejection message with explanation and alternative (coleam00#395)

Fixes coleam00#392, Fixes coleam00#393, Fixes coleam00#394, Fixes coleam00#395

* Add try-catch for JSON.parse in getCodebaseCommands()

Prevents crash on malformed JSON in database. Logs the error and returns
empty commands object as graceful fallback.
joaobmonteiro pushed a commit to joaobmonteiro/Archon that referenced this pull request Apr 26, 2026
…am00#392-coleam00#395) (coleam00#406)

* Investigate handler bugs coleam00#392-coleam00#395: commands JSON parsing, dotenv worktree, error messages

* Fix handler bugs: JSON parsing, dotenv worktree, error messages (coleam00#392-coleam00#395)

Four related fixes for handler/startup issues:

- Fix /commands showing character indices instead of command names when SQLite
  stores commands as JSON string (coleam00#392)
- Remove dotenv/config import, rely on Bun's built-in .env loading which
  searches up the directory tree — fixes worktree startup failures (coleam00#393)
- Add actionable guidance to no_ai_credentials fatal error (coleam00#394)
- Improve /setcwd rejection message with explanation and alternative (coleam00#395)

Fixes coleam00#392, Fixes coleam00#393, Fixes coleam00#394, Fixes coleam00#395

* Add try-catch for JSON.parse in getCodebaseCommands()

Prevents crash on malformed JSON in database. Logs the error and returns
empty commands object as graceful fallback.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants