feat: add WaterCrawl integration with self-hosted Docker deployment #192

marcusquinn · 2026-01-25T02:26:50Z

Summary

Add WaterCrawl as a new browser automation/crawling tool for aidevops
Prioritize self-hosted Docker deployment over cloud API
Include Coolify deployment instructions for VPS hosting

Changes

New Files

.agent/scripts/watercrawl-helper.sh - Helper script with Docker setup, start/stop, API operations
.agent/tools/browser/watercrawl.md - Full documentation with self-hosted first approach

Modified Files

.agent/tools/browser/browser-automation.md - Updated decision tree and feature matrix to include WaterCrawl
.agent/subagent-index.toon - Added WaterCrawl to browser tools and scripts list

WaterCrawl Features

Smart crawling with depth/domain/path controls
Web search engine integration (real-time web search)
Sitemap generation and analysis
JavaScript rendering with screenshots
AI-powered content processing (OpenAI integration)
Extensible plugin system
Proxy support (datacenter + residential)
Full web dashboard for team management

Self-Hosted Deployment

# Quick start
bash .agent/scripts/watercrawl-helper.sh docker-setup
bash .agent/scripts/watercrawl-helper.sh docker-start
bash .agent/scripts/watercrawl-helper.sh docker-admin
# Access at http://localhost

Installation path: ~/.aidevops/watercrawl/

vs Other Tools

Feature	WaterCrawl	Crawl4AI
Self-hosted	Yes (Docker)	Yes (Docker/pip)
Web Search	Yes	No
CAPTCHA Solving	No	Yes (CapSolver)
Web Dashboard	Yes	Yes
Open Source	Yes	Yes

Testing

ShellCheck passes on watercrawl-helper.sh
Documentation follows existing patterns
Decision tree updated in browser-automation.md

- Add watercrawl-helper.sh with Docker setup, start/stop, Coolify deploy - Add watercrawl.md documentation prioritizing self-hosted deployment - Update browser-automation.md decision tree to include WaterCrawl - Add WaterCrawl to subagent-index.toon WaterCrawl provides: - Smart crawling with depth/domain/path controls - Web search engine integration - Sitemap generation and analysis - JavaScript rendering with screenshots - AI-powered content processing (OpenAI) - Extensible plugin system Self-hosted deployment via Docker at ~/.aidevops/watercrawl/ Cloud API fallback at https://app.watercrawl.dev

coderabbitai · 2026-01-25T02:26:55Z

Warning

Rate limit exceeded

@marcusquinn has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 0 minutes and 28 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2026-01-25T02:27:08Z

Summary of Changes

Hello @marcusquinn, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces WaterCrawl, a powerful web crawling and data extraction tool, into the aidevops framework. The primary focus is on enabling self-hosted Docker deployments for users to gain full control over their crawling operations, while also providing options for cloud API usage. This integration expands the framework's capabilities for LLM-ready data extraction, web search, and sitemap generation, offering a robust solution for various web automation needs.

Highlights

WaterCrawl Integration: A new web crawling framework, WaterCrawl, has been integrated, offering smart crawling, real-time web search, sitemap generation, and AI-powered content processing capabilities.
Self-Hosted Docker Deployment Priority: The integration prioritizes self-hosted Docker deployment for WaterCrawl, providing a dedicated helper script (watercrawl-helper.sh) for easy setup, start/stop operations, and administration, emphasizing user control and cost efficiency.
Comprehensive Documentation: New, detailed documentation (watercrawl.md) has been added, covering WaterCrawl's features, quick start guides for both self-hosted and cloud API, SDK usage examples (Node.js, Python), API options, proxy integration, and comparisons with other tools.
Updated Browser Automation Tools: Existing browser automation documentation (browser-automation.md) and the subagent index (subagent-index.toon) have been updated to include WaterCrawl, reflecting its capabilities and use cases within the broader framework of web automation tools.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

sonarqubecloud · 2026-01-25T02:27:41Z

Quality Gate passed

Issues
2 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

github-actions · 2026-01-25T02:28:43Z

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 479 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Sun Jan 25 02:27:16 UTC 2026: Code review monitoring started
Sun Jan 25 02:27:17 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 479
Sun Jan 25 02:27:17 UTC 2026: Qlty - 0 issues found, auto-formatting applied
Sun Jan 25 02:27:19 UTC 2026: Codacy analysis completed with auto-fixes

📈 Current Quality Metrics

BUGS: 0
CODE SMELLS: 479
VULNERABILITIES: 0

Generated on: Sun Jan 25 02:28:42 UTC 2026

Generated by AI DevOps Framework Code Review Monitoring

gemini-code-assist

Code Review

This pull request introduces integration for WaterCrawl, a web crawling tool. It adds a comprehensive helper script for managing self-hosted Docker deployments and interacting with the API, along with detailed documentation. The changes are well-structured, with a clear focus on a self-hosted-first approach. My review focuses on improving the robustness and maintainability of the new bash helper script. I've identified a bug where progress messages could corrupt JSON output and suggested a fix. I've also pointed out several areas where code repetition can be reduced to improve maintainability.

gemini-code-assist · 2026-01-25T02:29:11Z

.agent/scripts/watercrawl-helper.sh

+        if [[ -n "$output_file" ]]; then
+            echo "$result" > "$output_file"
+            print_success "Results saved to: $output_file"
+        else
+            echo "$result"
+        fi


The output of the node script includes a progress message on stderr ('Scraping URL...'). By redirecting stderr to stdout with 2>&1, this message is captured in the result variable along with the JSON output, which can corrupt the JSON and cause issues for downstream parsing. This should be filtered out.

Additionally, the logic can be refactored to avoid repeating the filtering command, which improves maintainability. This same refactoring pattern can be applied to the crawl_website, search_web, and generate_sitemap functions.

Suggested change

if [[ -n "$output_file" ]]; then

echo "$result" > "$output_file"

print_success "Results saved to: $output_file"

else

echo "$result"

fi

local clean_result

clean_result=$(echo "$result" | grep -v "^Scraping URL")

if [[ -n "$output_file" ]]; then

echo "$clean_result" > "$output_file"

print_success "Results saved to: $output_file"

else

echo "$clean_result"

fi

gemini-code-assist · 2026-01-25T02:29:11Z

.agent/scripts/watercrawl-helper.sh

+        if [[ -n "$output_file" ]]; then
+            echo "$result" | grep -v "^\(Status:\|Crawled:\|Creating\|Crawl started\|Monitoring\)" > "$output_file"
+            print_success "Results saved to: $output_file"
+        else
+            echo "$result" | grep -v "^\(Status:\|Crawled:\|Creating\|Crawl started\|Monitoring\)"
+        fi


The grep -v command to filter out progress messages is repeated for both cases (with and without an output file). This code can be made more maintainable and less repetitive by filtering the result once, storing it in a variable, and then using that variable.

Suggested change

if [[ -n "$output_file" ]]; then

echo "$result" | grep -v "^$Status:\|Crawled:\|Creating\|Crawl started\|Monitoring$" > "$output_file"

print_success "Results saved to: $output_file"

else

echo "$result" | grep -v "^$Status:\|Crawled:\|Creating\|Crawl started\|Monitoring$"

fi

local clean_result

clean_result=$(echo "$result" | grep -v "^$Status:\|Crawled:\|Creating\|Crawl started\|Monitoring$")

if [[ -n "$output_file" ]]; then

echo "$clean_result" > "$output_file"

print_success "Results saved to: $output_file"

else

echo "$clean_result"

fi

gemini-code-assist · 2026-01-25T02:29:11Z

.agent/scripts/watercrawl-helper.sh

+        if [[ -n "$output_file" ]]; then
+            echo "$result" | grep -v "^Searching" > "$output_file"
+            print_success "Results saved to: $output_file"
+        else
+            echo "$result" | grep -v "^Searching"
+        fi


The grep -v command to filter out progress messages is repeated. To improve maintainability and avoid repetition, it's better to filter the output once, store it in a variable, and then use that variable.

Suggested change

if [[ -n "$output_file" ]]; then

echo "$result" | grep -v "^Searching" > "$output_file"

print_success "Results saved to: $output_file"

else

echo "$result" | grep -v "^Searching"

fi

local clean_result

clean_result=$(echo "$result" | grep -v "^Searching")

if [[ -n "$output_file" ]]; then

echo "$clean_result" > "$output_file"

print_success "Results saved to: $output_file"

else

echo "$clean_result"

fi

gemini-code-assist · 2026-01-25T02:29:11Z

.agent/scripts/watercrawl-helper.sh

+        if [[ -n "$output_file" ]]; then
+            echo "$result" | grep -v "^Creating sitemap" > "$output_file"
+            print_success "Sitemap saved to: $output_file"
+        else
+            echo "$result" | grep -v "^Creating sitemap"
+        fi


The grep -v command to filter out progress messages is repeated. To improve maintainability and avoid repetition, it's better to filter the output once, store it in a variable, and then use that variable.

Suggested change

if [[ -n "$output_file" ]]; then

echo "$result" | grep -v "^Creating sitemap" > "$output_file"

print_success "Sitemap saved to: $output_file"

else

echo "$result" | grep -v "^Creating sitemap"

fi

local clean_result

clean_result=$(echo "$result" | grep -v "^Creating sitemap")

if [[ -n "$output_file" ]]; then

echo "$clean_result" > "$output_file"

print_success "Sitemap saved to: $output_file"

else

echo "$clean_result"

fi

augmentcode · 2026-01-25T02:31:27Z

🤖 Augment PR Summary

Summary: Adds WaterCrawl as a new browser automation / crawling option in aidevops, with a self-hosted-first Docker deployment path and cloud API fallback.

Changes:

Added .agent/scripts/watercrawl-helper.sh to set up/run a self-hosted WaterCrawl stack (Docker/Coolify) and to invoke scrape/crawl/search/sitemap via the SDK.
Added .agent/tools/browser/watercrawl.md integration guide covering self-hosted + cloud usage, endpoints, env vars, and examples.
Updated .agent/tools/browser/browser-automation.md decision tree + feature matrix to include WaterCrawl (web search, sitemap generation).
Registered WaterCrawl in .agent/subagent-index.toon for discovery.

Technical Notes: Stores API configuration in /.config/aidevops/mcp-env.sh and installs the self-hosted WaterCrawl repo into /.aidevops/watercrawl/.

_{🤖 Was this summary useful? React with 👍 or 👎}

augmentcode

Review completed. 6 suggestions posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2026-01-25T02:31:28Z

.agent/scripts/watercrawl-helper.sh

+
+main "$@"
+
+exit 0


exit 0 forces a success exit status even when a subcommand fails (and main also unconditionally return 0), which can make CI/automation treat failures as successes.

_{🤖 Was this useful? React with 👍 or 👎}

augmentcode · 2026-01-25T02:31:28Z

.agent/scripts/watercrawl-helper.sh

+
+    # Create Node.js script for scraping
+    local temp_script
+    temp_script=$(mktemp /tmp/watercrawl_scrape_XXXXXX.mjs)


The temp Node script is written under /tmp, so Node will resolve @watercrawl/nodejs relative to /tmp/node_modules; this will typically fail even after a global npm install -g unless module resolution is explicitly configured.

Other Locations

.agent/scripts/watercrawl-helper.sh:660

.agent/scripts/watercrawl-helper.sh:772

.agent/scripts/watercrawl-helper.sh:860

_{🤖 Was this useful? React with 👍 or 👎}

augmentcode · 2026-01-25T02:31:28Z

.agent/scripts/watercrawl-helper.sh

+SCRIPT
+
+    local result
+    if result=$(WATERCRAWL_API_KEY="$WATERCRAWL_API_KEY" WATERCRAWL_API_URL="$WATERCRAWL_API_URL" node "$temp_script" "$url" 2>&1); then


Capturing node ... 2>&1 mixes progress logs (stderr) with the JSON payload (stdout), so scrape output (and saved files) may not be valid JSON and can break piping/consumers.

_{🤖 Was this useful? React with 👍 or 👎}

augmentcode · 2026-01-25T02:31:28Z

.agent/scripts/watercrawl-helper.sh

+    mkdir -p "$(dirname "$MCP_ENV_FILE")"
+
+    # Check if file exists and has the key
+    if [[ -f "$MCP_ENV_FILE" ]]; then


When updating/appending to an existing $MCP_ENV_FILE, the script doesn’t enforce chmod 600, so API keys/URLs could remain readable by other users if the file already had broader permissions.

Other Locations

.agent/scripts/watercrawl-helper.sh:463

_{🤖 Was this useful? React with 👍 or 👎}

augmentcode · 2026-01-25T02:31:28Z

.agent/subagent-index.toon

 coolify-helper.sh,Coolify deployment management
 stagehand-helper.sh,Browser automation with Stagehand
 crawl4ai-helper.sh,Web crawling and extraction
+watercrawl-helper.sh,WaterCrawl cloud API for web crawling and search


This entry describes WaterCrawl as “cloud API” only, but the new helper script/docs are self-hosted-first (Docker/Coolify); this may mislead users browsing the index.

_{🤖 Was this useful? React with 👍 or 👎}

augmentcode · 2026-01-25T02:31:28Z

.agent/tools/browser/browser-automation.md

    |
    +-> EXTRACT data (scraping, reading)?
    |       |
+    |       +-> Need web search + crawl? --> WaterCrawl (cloud API with search)


These WaterCrawl references frame it primarily as a cloud/managed API, but this PR adds a self-hosted-first Docker path; readers may miss the recommended self-hosted option.

Other Locations

.agent/tools/browser/browser-automation.md:38

.agent/tools/browser/browser-automation.md:135

.agent/tools/browser/browser-automation.md:164

_{🤖 Was this useful? React with 👍 or 👎}

- Add WaterCrawl to browser automation tools (7 tools now) - Add LibPDF and Unstract to document processing section - Add Cloudron app packaging enhancement note - Add multi-tenant credential storage documentation - Update MCP count to 19 (added Unstract) - Update subagent count to 560+ and scripts to 146+ - Document MCP lazy-loading optimization (12-24s startup savings) - Add WaterCrawl to tool selection guide Based on PRs #178-#192

gemini-code-assist bot reviewed Jan 25, 2026

View reviewed changes

marcusquinn merged commit 2e43226 into main Jan 25, 2026
17 checks passed

augmentcode bot reviewed Jan 25, 2026

View reviewed changes

marcusquinn mentioned this pull request Jan 25, 2026

docs: update README with recent PR features #193

Merged

3 tasks

feat: add WaterCrawl integration with self-hosted Docker deployment #192

feat: add WaterCrawl integration with self-hosted Docker deployment #192

Uh oh!

Conversation

marcusquinn commented Jan 25, 2026

Summary

Changes

New Files

Modified Files

WaterCrawl Features

Self-Hosted Deployment

vs Other Tools

Testing

Uh oh!

coderabbitai bot commented Jan 25, 2026

Rate limit exceeded

Uh oh!

gemini-code-assist bot commented Jan 25, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

sonarqubecloud bot commented Jan 25, 2026

Quality Gate passed

Uh oh!

github-actions bot commented Jan 25, 2026

🔍 Code Quality Report

📈 Current Quality Metrics

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

augmentcode bot commented Jan 25, 2026

Uh oh!

augmentcode bot left a comment

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant