Fix crawler timeout for JavaScript-heavy documentation sites#395
Fix crawler timeout for JavaScript-heavy documentation sites#395
Conversation
WalkthroughUpdates two crawling strategies: remove unused asyncio imports, preserve original errors with exception chaining on config load failures, switch documentation-site waits from an explicit wait_for="body" to wait_until/CRAWL_WAIT_STRATEGY, enrich progress callbacks with a step_info payload, and minor logging/formatting tweaks. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Caller
participant Strategy as Crawl Strategy (batch/recursive)
participant Config as CrawlerRunConfig
participant Engine as Crawl Engine
participant Progress as Progress Callback
Caller->>Strategy: start_crawl(params)
Strategy->>Strategy: load crawler settings
alt load error (ValueError/KeyError/TypeError)
Strategy-->>Caller: raise ValueError(...) from e
else settings available (or default {})
Strategy->>Config: build config (use wait_until from CRAWL_WAIT_STRATEGY)
Note right of Config: removed explicit wait_for="body"
Strategy->>Engine: arun_many(transformed_urls, config)
Engine-->>Strategy: streaming progress/events
Strategy->>Progress: on_progress(step_info={currentStep, stepMessage}, **extra)
Engine-->>Strategy: streaming/results (mapped to original URLs)
Strategy-->>Caller: aggregated results
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (6)
python/src/server/services/crawling/strategies/batch.py (2)
165-194: Remove unused loop counterj
jis initialized and incremented but never used. Safe to remove.- j = 0 ... - j += 1
121-126: Unify progress callback shape with recursive strategyBatch still calls progress_callback without step_info, while recursive adds step metadata. Consider aligning for consumers that expect the richer payload.
- async def report_progress(percentage: int, message: str): + async def report_progress(percentage: int, message: str, **kwargs): """Helper to report progress if callback is available""" if progress_callback: - await progress_callback("crawling", percentage, message) + step_info = {"currentStep": message, "stepMessage": message, **kwargs} + await progress_callback("crawling", percentage, message, **step_info)python/src/server/services/crawling/strategies/recursive.py (4)
128-134: Progress callback enrichment is helpful; consider aligning batch, tooAdding step_info improves UI/telemetry. Recommend aligning the batch strategy to the same shape for consistency across flows.
145-147: Avoid double-normalization work per URLcurrent_urls is already normalized; this comprehension normalizes twice. Minor, but easy to streamline.
- urls_to_crawl = [ - normalize_url(url) for url in current_urls if normalize_url(url) not in visited - ] + urls_to_crawl = [url for url in current_urls if url not in visited]
211-214: Guard against missing/None links in resultsSome crawler results may not populate links or may set it to None. Add a safe default to avoid attribute errors.
- for link in result.links.get("internal", []): + links = getattr(result, "links", {}) or {} + for link in links.get("internal", []):
215-221: Avoid duplicate is_binary_file checks per linkCompute once to reduce overhead and clarify intent.
- if next_url not in visited and not self.url_handler.is_binary_file( - next_url - ): - next_level_urls.add(next_url) - elif self.url_handler.is_binary_file(next_url): + is_binary = self.url_handler.is_binary_file(next_url) + if next_url not in visited and not is_binary: + next_level_urls.add(next_url) + elif is_binary: logger.debug(f"Skipping binary file from crawl queue: {next_url}")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
python/src/server/services/crawling/strategies/batch.py(1 hunks)python/src/server/services/crawling/strategies/recursive.py(7 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
python/src/server/services/crawling/strategies/recursive.py (2)
python/src/server/services/crawling/helpers/url_handler.py (2)
URLHandler(14-127)is_binary_file(52-97)python/src/server/services/crawling/strategies/batch.py (1)
report_progress(122-125)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Docker Build Tests (server)
- GitHub Check: Backend Tests (Python + pytest)
🔇 Additional comments (4)
python/src/server/services/crawling/strategies/batch.py (3)
72-72: Good fix: exception chaining preserves tracebackRaising from the original exception keeps the full stack. This will help debugging config issues.
75-77: Robust logging on settings fallbackLogging with exc_info while safely falling back to defaults is the right tradeoff for availability.
91-103: Doc-site wait strategy change is correct (prevents 30s timeouts)Dropping wait_for="body" and relying on wait_until + delay_before_return_html for JS-heavy docs is the right approach. This avoids waiting on a selector that appears immediately while still allowing the page to render.
python/src/server/services/crawling/strategies/recursive.py (1)
94-109: Doc-site wait strategy: LGTMSwitching to wait_until with a short delay and skipping images addresses JS-rendered docs without the false-positive wait on .
|
Note Unit test generation is an Early Access feature. Expect some limitations and changes as we gather feedback and continue to improve it. Generating unit tests... This may take up to 20 minutes. |
|
Here are the copyable unit test edits: Copyable Editspython/tests/server/services/crawling/strategies/test_batch.pyThis is a new file. python/tests/server/services/crawling/strategies/test_recursive.pyThis is a new file. |
|
Claude finished @Wirasm's task —— View job SummaryThis PR fixes a critical crawler timeout issue affecting JavaScript-heavy documentation sites by removing the problematic Previous Review Comments
Issues FoundTotal: 1 critical, 2 important, 3 minor 🔴 Critical (Must Fix)
🟡 Important (Should Fix)
🟢 Minor (Consider)
Security AssessmentNo security issues found. The changes are focused on crawler configuration and don't introduce new attack vectors. Input validation remains intact through the existing URL handling patterns. Performance ConsiderationsPositive impacts:
Areas for optimization:
Good Practices Observed
Questionable Practices
Test CoverageCurrent Coverage: Moderate - Core crawling logic is present but edge cases need attention Missing Tests:
RecommendationsMerge Decision:
Priority Actions:
Rationale: Review based on Archon V2 Alpha guidelines and CLAUDE.md principles |
Remove wait_for='body' selector from documentation site crawling config. The body element exists immediately in HTML, causing unnecessary timeouts for JavaScript-rendered content. Now relies on domcontentloaded event and delay_before_return_html for proper JavaScript execution.
- Remove unused asyncio imports from batch.py and recursive.py - Add proper exception chaining with 'from e' to preserve stack traces
- Transform URLs to raw content (e.g., GitHub blob -> raw) before sending to crawler - Maintain mapping dictionary to preserve original URLs in results - Align progress callback signatures between batch and recursive strategies - Add safety guards for missing links attribute - Remove unused loop counter in batch strategy - Optimize binary file checks to avoid duplicate calls This ensures GitHub files are crawled as raw content instead of HTML pages, fixing the issue where content extraction was degraded due to HTML wrapping.
f077820 to
329a65e
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
python/src/server/services/crawling/strategies/batch.py (2)
55-60: Treat missing crawler as a hard service error, not “empty results.”Returning [] hides a critical dependency outage. Prefer raising a descriptive exception after notifying progress_callback so upstream can fail fast.
Apply this diff:
if not self.crawler: logger.error("No crawler instance available for batch crawling") if progress_callback: await progress_callback("error", 0, "Crawler not available") - return [] + raise RuntimeError("BatchCrawlStrategy: crawler instance is not initialized")
161-163: Add explicit max_session_permit to MemoryAdaptiveDispatcher instantiationWe verified that in both batch and recursive crawling strategies the dispatcher is constructed once and passed directly into
arun_many, with no subsequent overrides. However, to ensure your intended concurrency limit takes effect (rather than relying on the dispatcher’s default), you should explicitly setmax_session_permit(often referred to as “max_concurrent”) in the constructor:• File: python/src/server/services/crawling/strategies/batch.py
Lines 116–118• File: python/src/server/services/crawling/strategies/recursive.py
Lines 122–124Replace this pattern:
- dispatcher = MemoryAdaptiveDispatcher( - memory_threshold_percent=memory_threshold, - check_interval=check_interval, - )With an explicit concurrency override:
dispatcher = MemoryAdaptiveDispatcher( memory_threshold_percent=memory_threshold, check_interval=check_interval, + max_session_permit=crawl_config.max_concurrent, # explicitly cap concurrent sessions )If
crawl_configdoesn’t expose amax_concurrentfield, substitute in the appropriate constant or configuration value.Relevant dispatcher API docs confirm support for
max_session_permitto control concurrency: (crawl4ai.com)
♻️ Duplicate comments (1)
python/src/server/services/crawling/strategies/recursive.py (1)
169-179: Fixed: transform before crawl and map results back to originals.This resolves the previously flagged mapping issue for GitHub/raw content.
🧹 Nitpick comments (4)
python/src/server/services/crawling/strategies/batch.py (1)
75-84: Fallback path is fine; also surface the concrete defaults being applied.You already log the failure and proceed with safe defaults. Consider explicitly logging the resolved defaults to aid ops/debugging.
Apply this diff to include the effective values in the log:
- logger.error( - f"Failed to load crawl settings from database: {e}, using defaults", exc_info=True - ) + logger.error( + f"Failed to load crawl settings from database: {e}, using defaults", + exc_info=True, + ) + logger.info( + "Crawl defaults in use: batch_size=50, max_concurrent=%s, " + "memory_threshold=80.0, check_interval=0.5", + 10 if max_concurrent is None else max_concurrent, + )python/src/server/services/crawling/strategies/recursive.py (3)
81-89: Fallback defaults: also log concretely which values are in effect.Same suggestion as batch for observability of defaulted config.
- logger.error( - f"Failed to load crawl settings from database: {e}, using defaults", exc_info=True - ) + logger.error( + f"Failed to load crawl settings from database: {e}, using defaults", + exc_info=True, + ) + logger.info( + "Crawl defaults in use: batch_size=50, max_concurrent=%s, " + "memory_threshold=80.0, check_interval=0.5", + 10 if max_concurrent is None else max_concurrent, + )
207-212: Result shape consistent (url, markdown, html).Keeping raw HTML is useful for code/DOM post-processing. Consider adding content_length guards later if memory becomes a concern.
204-205: Optional: only mark visited on success to allow retrying failures within the same session.Currently, failed pages are marked visited and won’t be retried at later depths. If you want opportunistic retries (e.g., transient 429), defer adding to visited until success.
If interested, I can wire a simple per-URL retry with exponential backoff around the crawl call, gated by a max_retries setting.
Also applies to: 226-229
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
python/src/server/services/crawling/strategies/batch.py(2 hunks)python/src/server/services/crawling/strategies/recursive.py(6 hunks)
🧰 Additional context used
📓 Path-based instructions (6)
python/src/{server,mcp,agents}/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/src/{server,mcp,agents}/**/*.py: Fail fast on service startup failures, missing configuration, database connection issues, auth failures, critical dependency outages, and invalid data that would corrupt state
External API calls should use retry with exponential backoff and ultimately fail with a clear, contextual error message
Error messages must include context (operation being attempted) and relevant IDs/URLs/data for debugging
Preserve full stack traces in logs (e.g., Python logging with exc_info=True)
Use specific exception types; avoid catching broad Exception unless re-raising with context
Never signal failure by returning None/null; raise a descriptive exception instead
Files:
python/src/server/services/crawling/strategies/recursive.pypython/src/server/services/crawling/strategies/batch.py
python/src/{server/services,agents}/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Never accept or store corrupted data (e.g., zero embeddings, null foreign keys, malformed JSON); skip failed items entirely instead of persisting bad data
Files:
python/src/server/services/crawling/strategies/recursive.pypython/src/server/services/crawling/strategies/batch.py
python/src/server/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/src/server/**/*.py: For batch processing and background tasks, continue processing but log detailed per-item failures and return both successes and failures
Do not crash the server on a single WebSocket event failure; log the error and continue serving other clients
Files:
python/src/server/services/crawling/strategies/recursive.pypython/src/server/services/crawling/strategies/batch.py
python/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/**/*.py: Target Python 3.12 with a 120-character line length
Use Ruff for linting and Mypy for type checking before commit
Files:
python/src/server/services/crawling/strategies/recursive.pypython/src/server/services/crawling/strategies/batch.py
{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}
📄 CodeRabbit inference engine (CLAUDE.md)
{python/**/*.py,archon-ui-main/src/**/*.{ts,tsx,js,jsx}}: Remove dead code immediately; do not keep legacy/unused functions
Avoid comments that reference change history (e.g., LEGACY, CHANGED, REMOVED); keep comments focused on current functionality
Files:
python/src/server/services/crawling/strategies/recursive.pypython/src/server/services/crawling/strategies/batch.py
python/src/server/**
📄 CodeRabbit inference engine (CLAUDE.md)
Keep FastAPI application code under python/src/server/ (routes in api_routes/, services in services/, main in main.py)
Files:
python/src/server/services/crawling/strategies/recursive.pypython/src/server/services/crawling/strategies/batch.py
🔇 Additional comments (13)
python/src/server/services/crawling/strategies/batch.py (2)
72-72: Good fix: preserve original stack via exception chaining.Raising with "from e" keeps the cause and full traceback. Matches our guidelines on preserving stack traces.
91-103: Right direction: remove wait_for='body' for JS-heavy docs; rely on wait_until + delay.This should eliminate the 30s timeouts on pages where appears before JS content. Ensure ops config documents CRAWL_WAIT_STRATEGY accepted values (e.g., 'domcontentloaded', 'load', 'networkidle') so deployments don’t misconfigure it.
Would you like me to add a short section to the settings docs describing CRAWL_WAIT_STRATEGY and recommended values for documentation sites?
python/src/server/services/crawling/strategies/recursive.py (11)
6-6: Trivial whitespace/comment-only change.Nothing to review here.
42-43: Progress allocation: confirm the 10%→60% window matches UI expectations.end_progress=60 reserves 40% for post-recursive phases. Verify callers/UX expect this distribution to avoid >100% aggregate progress.
If needed, I can adjust the depth weighting factor (currently 0.8 of the window) to maintain smoothness.
77-77: Good: exception chaining on configuration errors.Preserves root cause and stack trace.
93-109: Doc-site run_config: sensible defaults for JS-heavy pages.Switching to wait_until with a 1.0s delay and disabling image waits should reduce timeouts. Confirm process_iframes=True is supported by your crawl4ai version in prod.
If process_iframes is not universally available, we can gate it via a settings flag.
126-126: Dispatcher params look correct.max_session_permit wires to max_concurrent; matches tests the bot suggested.
181-189: Progress payload keys sent via report_progress will now live under step_info.Once step_info=... is used (see earlier fix), totalPages/processedPages will be found at step_info.totalPages, etc. Confirm the UI expects them there.
I can adapt the progress emitter to also surface these at top-level if your UI relies on that.
193-195: Correct: crawl transformed URLs with dispatcher.Matches batch behavior and uses arun_many for native parallel streaming.
200-202: Correct URL mapping back to originals.Ensures downstream storage and dedupe work on canonical URLs.
216-224: Good: robust link extraction and binary-skip guard.Using getattr(..., {}) avoids attribute errors; binary filter prevents enqueuing blobs.
226-229: Useful warning includes original URL and message.Keeps logs actionable without crashing the crawl.
247-256: Progress reporting at depth end and crawl end is clear.Messages are specific and will help users understand progress across depths.
| async def report_progress(percentage: int, message: str, **kwargs): | ||
| """Helper to report progress if callback is available""" | ||
| if progress_callback: | ||
| await progress_callback("crawling", percentage, message) | ||
| step_info = {"currentStep": message, "stepMessage": message, **kwargs} | ||
| await progress_callback("crawling", percentage, message, step_info=step_info) | ||
|
|
There was a problem hiding this comment.
Progress callback contract differs from recursive strategy. Align on step_info kw param.
Here you pass step_info as a single kwarg. In recursive.py, step_info’s contents are expanded as top-level kwargs, which can cause runtime TypeError if the callback doesn’t accept those names. Let’s standardize on step_info=... in both places.
I’ve proposed the corresponding fix in recursive.py. No change needed here; this comment flags the cross-file inconsistency.
🤖 Prompt for AI Agents
In python/src/server/services/crawling/strategies/batch.py around lines 122 to
127, ensure the progress callback is called with a single step_info kwarg (as
currently implemented: step_info = {...} and await progress_callback("crawling",
percentage, message, step_info=step_info)); leave this implementation unchanged
and confirm the recursive strategy is updated to pass step_info as a single
kwarg as well so both strategies match the same callback contract.
| if not self.crawler: | ||
| logger.error("No crawler instance available for recursive crawling") | ||
| if progress_callback: | ||
| await progress_callback('error', 0, 'Crawler not available') | ||
| await progress_callback("error", 0, "Crawler not available") | ||
| return [] | ||
|
|
There was a problem hiding this comment.
🛠️ Refactor suggestion
Fail fast when crawler is unavailable.
Mirror batch strategy: treat this as a hard error instead of returning []. Keep the progress notification, then raise.
if not self.crawler:
logger.error("No crawler instance available for recursive crawling")
if progress_callback:
await progress_callback("error", 0, "Crawler not available")
- return []
+ raise RuntimeError("RecursiveCrawlStrategy: crawler instance is not initialized")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if not self.crawler: | |
| logger.error("No crawler instance available for recursive crawling") | |
| if progress_callback: | |
| await progress_callback('error', 0, 'Crawler not available') | |
| await progress_callback("error", 0, "Crawler not available") | |
| return [] | |
| if not self.crawler: | |
| logger.error("No crawler instance available for recursive crawling") | |
| if progress_callback: | |
| await progress_callback("error", 0, "Crawler not available") | |
| raise RuntimeError("RecursiveCrawlStrategy: crawler instance is not initialized") |
🤖 Prompt for AI Agents
In python/src/server/services/crawling/strategies/recursive.py around lines 60
to 65, the code currently logs and returns an empty list when self.crawler is
missing; instead, mirror the batch strategy by treating this as a hard error:
keep the existing logger.error and await the progress_callback("error", 0,
"Crawler not available") if provided, then raise an appropriate exception (e.g.,
RuntimeError or a CrawlerUnavailableError) to fail fast rather than returning
[].
| step_info = {"currentStep": message, "stepMessage": message, **kwargs} | ||
| await progress_callback("crawling", percentage, message, **step_info) | ||
|
|
There was a problem hiding this comment.
Bug: progress_callback invoked with expanded kwargs; use step_info=... instead.
Expanding step_info as top-level kwargs risks TypeError in callbacks not expecting those names and is inconsistent with batch.py.
Apply this diff:
- step_info = {"currentStep": message, "stepMessage": message, **kwargs}
- await progress_callback("crawling", percentage, message, **step_info)
+ step_info = {"currentStep": message, "stepMessage": message, **kwargs}
+ await progress_callback("crawling", percentage, message, step_info=step_info)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| step_info = {"currentStep": message, "stepMessage": message, **kwargs} | |
| await progress_callback("crawling", percentage, message, **step_info) | |
| step_info = {"currentStep": message, "stepMessage": message, **kwargs} | |
| await progress_callback("crawling", percentage, message, step_info=step_info) |
🤖 Prompt for AI Agents
In python/src/server/services/crawling/strategies/recursive.py around lines 132
to 134, the code expands step_info into top-level kwargs when calling
progress_callback which can cause TypeError and is inconsistent with batch.py;
instead remove the **step_info/**kwargs expansion and pass the dict as a single
named argument (e.g., step_info=step_info) so the callback receives the info
consistently; update the await call to await progress_callback("crawling",
percentage, message, step_info=step_info) and ensure any leftover **kwargs are
not expanded into the call.
| current_progress = depth_start + int( | ||
| (current_idx / len(urls_to_crawl)) * (depth_end - depth_start) | ||
| ) | ||
| await report_progress( | ||
| current_progress, | ||
| f"Depth {depth + 1}: processed {current_idx}/{len(urls_to_crawl)} URLs ({depth_successful} successful)", | ||
| totalPages=total_processed, | ||
| processedPages=len(results_all), | ||
| ) |
There was a problem hiding this comment.
Same progress callback bug here; align to step_info=...
Apply the same fix to avoid expanding unknown kwargs.
- await report_progress(
- current_progress,
- f"Depth {depth + 1}: processed {current_idx}/{len(urls_to_crawl)} URLs ({depth_successful} successful)",
- totalPages=total_processed,
- processedPages=len(results_all),
- )
+ await report_progress(
+ current_progress,
+ f"Depth {depth + 1}: processed {current_idx}/{len(urls_to_crawl)} URLs ({depth_successful} successful)",
+ totalPages=total_processed,
+ processedPages=len(results_all),
+ )Note: The call site stays the same; the earlier change in report_progress will ensure these land under step_info.
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In python/src/server/services/crawling/strategies/recursive.py around lines 233
to 241, the report_progress call is passing unknown kwargs (totalPages and
processedPages) which will be dropped; update the call to pass those values
inside the step_info dict (e.g., step_info={"totalPages": total_processed,
"processedPages": len(results_all)}) and keep the same message and
current_progress argument so the progress payload lands under step_info as the
other fixes did.
…ation Merge PR coleam00#395 - NATS infrastructure and A2UI bridge 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…395) (#406) * Investigate handler bugs #392-#395: commands JSON parsing, dotenv worktree, error messages * Fix handler bugs: JSON parsing, dotenv worktree, error messages (#392-#395) Four related fixes for handler/startup issues: - Fix /commands showing character indices instead of command names when SQLite stores commands as JSON string (#392) - Remove dotenv/config import, rely on Bun's built-in .env loading which searches up the directory tree — fixes worktree startup failures (#393) - Add actionable guidance to no_ai_credentials fatal error (#394) - Improve /setcwd rejection message with explanation and alternative (#395) Fixes #392, Fixes #393, Fixes #394, Fixes #395 * Add try-catch for JSON.parse in getCodebaseCommands() Prevents crash on malformed JSON in database. Logs the error and returns empty commands object as graceful fallback.
…am00#392-coleam00#395) (coleam00#406) * Investigate handler bugs coleam00#392-coleam00#395: commands JSON parsing, dotenv worktree, error messages * Fix handler bugs: JSON parsing, dotenv worktree, error messages (coleam00#392-coleam00#395) Four related fixes for handler/startup issues: - Fix /commands showing character indices instead of command names when SQLite stores commands as JSON string (coleam00#392) - Remove dotenv/config import, rely on Bun's built-in .env loading which searches up the directory tree — fixes worktree startup failures (coleam00#393) - Add actionable guidance to no_ai_credentials fatal error (coleam00#394) - Improve /setcwd rejection message with explanation and alternative (coleam00#395) Fixes coleam00#392, Fixes coleam00#393, Fixes coleam00#394, Fixes coleam00#395 * Add try-catch for JSON.parse in getCodebaseCommands() Prevents crash on malformed JSON in database. Logs the error and returns empty commands object as graceful fallback.
…am00#392-coleam00#395) (coleam00#406) * Investigate handler bugs coleam00#392-coleam00#395: commands JSON parsing, dotenv worktree, error messages * Fix handler bugs: JSON parsing, dotenv worktree, error messages (coleam00#392-coleam00#395) Four related fixes for handler/startup issues: - Fix /commands showing character indices instead of command names when SQLite stores commands as JSON string (coleam00#392) - Remove dotenv/config import, rely on Bun's built-in .env loading which searches up the directory tree — fixes worktree startup failures (coleam00#393) - Add actionable guidance to no_ai_credentials fatal error (coleam00#394) - Improve /setcwd rejection message with explanation and alternative (coleam00#395) Fixes coleam00#392, Fixes coleam00#393, Fixes coleam00#394, Fixes coleam00#395 * Add try-catch for JSON.parse in getCodebaseCommands() Prevents crash on malformed JSON in database. Logs the error and returns empty commands object as graceful fallback.
Pull Request
Summary
Fixes crawler timeout issue when crawling JavaScript-heavy documentation sites like Stripe.js docs. The crawler was waiting for a 'body' selector that already exists in the HTML DOM, causing unnecessary 30-second timeouts.
Changes Made
wait_for='body'selector from documentation site crawler configurationsType of Change
Affected Services
Testing
Test Evidence
Checklist
Breaking Changes
None - This fix maintains backward compatibility while enabling crawling of sites that previously timed out.
Additional Notes
Root Cause
The
wait_for='body'selector was problematic because:Solution
Now relies on:
domcontentloadedevent for DOM readinessdelay_before_return_html(1.0s for doc sites) for JavaScript rendering timeThis fix allows crawling of JavaScript-heavy documentation that previously failed while maintaining compatibility with all existing crawlable sites.
Summary by CodeRabbit
New Features
Bug Fixes
Chores