-
Notifications
You must be signed in to change notification settings - Fork 5
feat: add WaterCrawl integration with self-hosted Docker deployment #192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Add watercrawl-helper.sh with Docker setup, start/stop, Coolify deploy - Add watercrawl.md documentation prioritizing self-hosted deployment - Update browser-automation.md decision tree to include WaterCrawl - Add WaterCrawl to subagent-index.toon WaterCrawl provides: - Smart crawling with depth/domain/path controls - Web search engine integration - Sitemap generation and analysis - JavaScript rendering with screenshots - AI-powered content processing (OpenAI) - Extensible plugin system Self-hosted deployment via Docker at ~/.aidevops/watercrawl/ Cloud API fallback at https://app.watercrawl.dev
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @marcusquinn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces WaterCrawl, a powerful web crawling and data extraction tool, into the aidevops framework. The primary focus is on enabling self-hosted Docker deployments for users to gain full control over their crawling operations, while also providing options for cloud API usage. This integration expands the framework's capabilities for LLM-ready data extraction, web search, and sitemap generation, offering a robust solution for various web automation needs. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Sun Jan 25 02:28:42 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces integration for WaterCrawl, a web crawling tool. It adds a comprehensive helper script for managing self-hosted Docker deployments and interacting with the API, along with detailed documentation. The changes are well-structured, with a clear focus on a self-hosted-first approach. My review focuses on improving the robustness and maintainability of the new bash helper script. I've identified a bug where progress messages could corrupt JSON output and suggested a fix. I've also pointed out several areas where code repetition can be reduced to improve maintainability.
| if [[ -n "$output_file" ]]; then | ||
| echo "$result" > "$output_file" | ||
| print_success "Results saved to: $output_file" | ||
| else | ||
| echo "$result" | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The output of the node script includes a progress message on stderr ('Scraping URL...'). By redirecting stderr to stdout with 2>&1, this message is captured in the result variable along with the JSON output, which can corrupt the JSON and cause issues for downstream parsing. This should be filtered out.
Additionally, the logic can be refactored to avoid repeating the filtering command, which improves maintainability. This same refactoring pattern can be applied to the crawl_website, search_web, and generate_sitemap functions.
| if [[ -n "$output_file" ]]; then | |
| echo "$result" > "$output_file" | |
| print_success "Results saved to: $output_file" | |
| else | |
| echo "$result" | |
| fi | |
| local clean_result | |
| clean_result=$(echo "$result" | grep -v "^Scraping URL") | |
| if [[ -n "$output_file" ]]; then | |
| echo "$clean_result" > "$output_file" | |
| print_success "Results saved to: $output_file" | |
| else | |
| echo "$clean_result" | |
| fi |
| if [[ -n "$output_file" ]]; then | ||
| echo "$result" | grep -v "^\(Status:\|Crawled:\|Creating\|Crawl started\|Monitoring\)" > "$output_file" | ||
| print_success "Results saved to: $output_file" | ||
| else | ||
| echo "$result" | grep -v "^\(Status:\|Crawled:\|Creating\|Crawl started\|Monitoring\)" | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The grep -v command to filter out progress messages is repeated for both cases (with and without an output file). This code can be made more maintainable and less repetitive by filtering the result once, storing it in a variable, and then using that variable.
| if [[ -n "$output_file" ]]; then | |
| echo "$result" | grep -v "^\(Status:\|Crawled:\|Creating\|Crawl started\|Monitoring\)" > "$output_file" | |
| print_success "Results saved to: $output_file" | |
| else | |
| echo "$result" | grep -v "^\(Status:\|Crawled:\|Creating\|Crawl started\|Monitoring\)" | |
| fi | |
| local clean_result | |
| clean_result=$(echo "$result" | grep -v "^\(Status:\|Crawled:\|Creating\|Crawl started\|Monitoring\)") | |
| if [[ -n "$output_file" ]]; then | |
| echo "$clean_result" > "$output_file" | |
| print_success "Results saved to: $output_file" | |
| else | |
| echo "$clean_result" | |
| fi |
| if [[ -n "$output_file" ]]; then | ||
| echo "$result" | grep -v "^Searching" > "$output_file" | ||
| print_success "Results saved to: $output_file" | ||
| else | ||
| echo "$result" | grep -v "^Searching" | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The grep -v command to filter out progress messages is repeated. To improve maintainability and avoid repetition, it's better to filter the output once, store it in a variable, and then use that variable.
| if [[ -n "$output_file" ]]; then | |
| echo "$result" | grep -v "^Searching" > "$output_file" | |
| print_success "Results saved to: $output_file" | |
| else | |
| echo "$result" | grep -v "^Searching" | |
| fi | |
| local clean_result | |
| clean_result=$(echo "$result" | grep -v "^Searching") | |
| if [[ -n "$output_file" ]]; then | |
| echo "$clean_result" > "$output_file" | |
| print_success "Results saved to: $output_file" | |
| else | |
| echo "$clean_result" | |
| fi |
| if [[ -n "$output_file" ]]; then | ||
| echo "$result" | grep -v "^Creating sitemap" > "$output_file" | ||
| print_success "Sitemap saved to: $output_file" | ||
| else | ||
| echo "$result" | grep -v "^Creating sitemap" | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The grep -v command to filter out progress messages is repeated. To improve maintainability and avoid repetition, it's better to filter the output once, store it in a variable, and then use that variable.
| if [[ -n "$output_file" ]]; then | |
| echo "$result" | grep -v "^Creating sitemap" > "$output_file" | |
| print_success "Sitemap saved to: $output_file" | |
| else | |
| echo "$result" | grep -v "^Creating sitemap" | |
| fi | |
| local clean_result | |
| clean_result=$(echo "$result" | grep -v "^Creating sitemap") | |
| if [[ -n "$output_file" ]]; then | |
| echo "$clean_result" > "$output_file" | |
| print_success "Sitemap saved to: $output_file" | |
| else | |
| echo "$clean_result" | |
| fi |
🤖 Augment PR SummarySummary: Adds WaterCrawl as a new browser automation / crawling option in aidevops, with a self-hosted-first Docker deployment path and cloud API fallback. Changes:
Technical Notes: Stores API configuration in 🤖 Was this summary useful? React with 👍 or 👎 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
|
||
| main "$@" | ||
|
|
||
| exit 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
|
||
| # Create Node.js script for scraping | ||
| local temp_script | ||
| temp_script=$(mktemp /tmp/watercrawl_scrape_XXXXXX.mjs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The temp Node script is written under /tmp, so Node will resolve @watercrawl/nodejs relative to /tmp/node_modules; this will typically fail even after a global npm install -g unless module resolution is explicitly configured.
Other Locations
.agent/scripts/watercrawl-helper.sh:660.agent/scripts/watercrawl-helper.sh:772.agent/scripts/watercrawl-helper.sh:860
🤖 Was this useful? React with 👍 or 👎
| SCRIPT | ||
|
|
||
| local result | ||
| if result=$(WATERCRAWL_API_KEY="$WATERCRAWL_API_KEY" WATERCRAWL_API_URL="$WATERCRAWL_API_URL" node "$temp_script" "$url" 2>&1); then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| mkdir -p "$(dirname "$MCP_ENV_FILE")" | ||
|
|
||
| # Check if file exists and has the key | ||
| if [[ -f "$MCP_ENV_FILE" ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| coolify-helper.sh,Coolify deployment management | ||
| stagehand-helper.sh,Browser automation with Stagehand | ||
| crawl4ai-helper.sh,Web crawling and extraction | ||
| watercrawl-helper.sh,WaterCrawl cloud API for web crawling and search |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| | | ||
| +-> EXTRACT data (scraping, reading)? | ||
| | | | ||
| | +-> Need web search + crawl? --> WaterCrawl (cloud API with search) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These WaterCrawl references frame it primarily as a cloud/managed API, but this PR adds a self-hosted-first Docker path; readers may miss the recommended self-hosted option.
Other Locations
.agent/tools/browser/browser-automation.md:38.agent/tools/browser/browser-automation.md:135.agent/tools/browser/browser-automation.md:164
🤖 Was this useful? React with 👍 or 👎
- Add WaterCrawl to browser automation tools (7 tools now) - Add LibPDF and Unstract to document processing section - Add Cloudron app packaging enhancement note - Add multi-tenant credential storage documentation - Update MCP count to 19 (added Unstract) - Update subagent count to 560+ and scripts to 146+ - Document MCP lazy-loading optimization (12-24s startup savings) - Add WaterCrawl to tool selection guide Based on PRs #178-#192
- Add WaterCrawl to browser automation tools (7 tools now) - Add LibPDF and Unstract to document processing section - Add Cloudron app packaging enhancement note - Add multi-tenant credential storage documentation - Update MCP count to 19 (added Unstract) - Update subagent count to 560+ and scripts to 146+ - Document MCP lazy-loading optimization (12-24s startup savings) - Add WaterCrawl to tool selection guide Based on PRs #178-#192



Summary
Changes
New Files
.agent/scripts/watercrawl-helper.sh- Helper script with Docker setup, start/stop, API operations.agent/tools/browser/watercrawl.md- Full documentation with self-hosted first approachModified Files
.agent/tools/browser/browser-automation.md- Updated decision tree and feature matrix to include WaterCrawl.agent/subagent-index.toon- Added WaterCrawl to browser tools and scripts listWaterCrawl Features
Self-Hosted Deployment
Installation path:
~/.aidevops/watercrawl/vs Other Tools
Testing