Skip to content

Conversation

@marcusquinn
Copy link
Owner

Summary

  • Add WaterCrawl as a new browser automation/crawling tool for aidevops
  • Prioritize self-hosted Docker deployment over cloud API
  • Include Coolify deployment instructions for VPS hosting

Changes

New Files

  • .agent/scripts/watercrawl-helper.sh - Helper script with Docker setup, start/stop, API operations
  • .agent/tools/browser/watercrawl.md - Full documentation with self-hosted first approach

Modified Files

  • .agent/tools/browser/browser-automation.md - Updated decision tree and feature matrix to include WaterCrawl
  • .agent/subagent-index.toon - Added WaterCrawl to browser tools and scripts list

WaterCrawl Features

  • Smart crawling with depth/domain/path controls
  • Web search engine integration (real-time web search)
  • Sitemap generation and analysis
  • JavaScript rendering with screenshots
  • AI-powered content processing (OpenAI integration)
  • Extensible plugin system
  • Proxy support (datacenter + residential)
  • Full web dashboard for team management

Self-Hosted Deployment

# Quick start
bash .agent/scripts/watercrawl-helper.sh docker-setup
bash .agent/scripts/watercrawl-helper.sh docker-start
bash .agent/scripts/watercrawl-helper.sh docker-admin
# Access at http://localhost

Installation path: ~/.aidevops/watercrawl/

vs Other Tools

Feature WaterCrawl Crawl4AI
Self-hosted Yes (Docker) Yes (Docker/pip)
Web Search Yes No
CAPTCHA Solving No Yes (CapSolver)
Web Dashboard Yes Yes
Open Source Yes Yes

Testing

  • ShellCheck passes on watercrawl-helper.sh
  • Documentation follows existing patterns
  • Decision tree updated in browser-automation.md

- Add watercrawl-helper.sh with Docker setup, start/stop, Coolify deploy
- Add watercrawl.md documentation prioritizing self-hosted deployment
- Update browser-automation.md decision tree to include WaterCrawl
- Add WaterCrawl to subagent-index.toon

WaterCrawl provides:
- Smart crawling with depth/domain/path controls
- Web search engine integration
- Sitemap generation and analysis
- JavaScript rendering with screenshots
- AI-powered content processing (OpenAI)
- Extensible plugin system

Self-hosted deployment via Docker at ~/.aidevops/watercrawl/
Cloud API fallback at https://app.watercrawl.dev
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 25, 2026

Warning

Rate limit exceeded

@marcusquinn has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 0 minutes and 28 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link

Summary of Changes

Hello @marcusquinn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces WaterCrawl, a powerful web crawling and data extraction tool, into the aidevops framework. The primary focus is on enabling self-hosted Docker deployments for users to gain full control over their crawling operations, while also providing options for cloud API usage. This integration expands the framework's capabilities for LLM-ready data extraction, web search, and sitemap generation, offering a robust solution for various web automation needs.

Highlights

  • WaterCrawl Integration: A new web crawling framework, WaterCrawl, has been integrated, offering smart crawling, real-time web search, sitemap generation, and AI-powered content processing capabilities.
  • Self-Hosted Docker Deployment Priority: The integration prioritizes self-hosted Docker deployment for WaterCrawl, providing a dedicated helper script (watercrawl-helper.sh) for easy setup, start/stop operations, and administration, emphasizing user control and cost efficiency.
  • Comprehensive Documentation: New, detailed documentation (watercrawl.md) has been added, covering WaterCrawl's features, quick start guides for both self-hosted and cloud API, SDK usage examples (Node.js, Python), API options, proxy integration, and comparisons with other tools.
  • Updated Browser Automation Tools: Existing browser automation documentation (browser-automation.md) and the subagent index (subagent-index.toon) have been updated to include WaterCrawl, reflecting its capabilities and use cases within the broader framework of web automation tools.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@sonarqubecloud
Copy link

@github-actions
Copy link

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 479 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Sun Jan 25 02:27:16 UTC 2026: Code review monitoring started
Sun Jan 25 02:27:17 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 479
Sun Jan 25 02:27:17 UTC 2026: Qlty - 0 issues found, auto-formatting applied
Sun Jan 25 02:27:19 UTC 2026: Codacy analysis completed with auto-fixes

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 479
  • VULNERABILITIES: 0

Generated on: Sun Jan 25 02:28:42 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces integration for WaterCrawl, a web crawling tool. It adds a comprehensive helper script for managing self-hosted Docker deployments and interacting with the API, along with detailed documentation. The changes are well-structured, with a clear focus on a self-hosted-first approach. My review focuses on improving the robustness and maintainability of the new bash helper script. I've identified a bug where progress messages could corrupt JSON output and suggested a fix. I've also pointed out several areas where code repetition can be reduced to improve maintainability.

Comment on lines +618 to +623
if [[ -n "$output_file" ]]; then
echo "$result" > "$output_file"
print_success "Results saved to: $output_file"
else
echo "$result"
fi

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The output of the node script includes a progress message on stderr ('Scraping URL...'). By redirecting stderr to stdout with 2>&1, this message is captured in the result variable along with the JSON output, which can corrupt the JSON and cause issues for downstream parsing. This should be filtered out.

Additionally, the logic can be refactored to avoid repeating the filtering command, which improves maintainability. This same refactoring pattern can be applied to the crawl_website, search_web, and generate_sitemap functions.

Suggested change
if [[ -n "$output_file" ]]; then
echo "$result" > "$output_file"
print_success "Results saved to: $output_file"
else
echo "$result"
fi
local clean_result
clean_result=$(echo "$result" | grep -v "^Scraping URL")
if [[ -n "$output_file" ]]; then
echo "$clean_result" > "$output_file"
print_success "Results saved to: $output_file"
else
echo "$clean_result"
fi

Comment on lines +730 to +735
if [[ -n "$output_file" ]]; then
echo "$result" | grep -v "^\(Status:\|Crawled:\|Creating\|Crawl started\|Monitoring\)" > "$output_file"
print_success "Results saved to: $output_file"
else
echo "$result" | grep -v "^\(Status:\|Crawled:\|Creating\|Crawl started\|Monitoring\)"
fi

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The grep -v command to filter out progress messages is repeated for both cases (with and without an output file). This code can be made more maintainable and less repetitive by filtering the result once, storing it in a variable, and then using that variable.

Suggested change
if [[ -n "$output_file" ]]; then
echo "$result" | grep -v "^\(Status:\|Crawled:\|Creating\|Crawl started\|Monitoring\)" > "$output_file"
print_success "Results saved to: $output_file"
else
echo "$result" | grep -v "^\(Status:\|Crawled:\|Creating\|Crawl started\|Monitoring\)"
fi
local clean_result
clean_result=$(echo "$result" | grep -v "^\(Status:\|Crawled:\|Creating\|Crawl started\|Monitoring\)")
if [[ -n "$output_file" ]]; then
echo "$clean_result" > "$output_file"
print_success "Results saved to: $output_file"
else
echo "$clean_result"
fi

Comment on lines +818 to +823
if [[ -n "$output_file" ]]; then
echo "$result" | grep -v "^Searching" > "$output_file"
print_success "Results saved to: $output_file"
else
echo "$result" | grep -v "^Searching"
fi

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The grep -v command to filter out progress messages is repeated. To improve maintainability and avoid repetition, it's better to filter the output once, store it in a variable, and then use that variable.

Suggested change
if [[ -n "$output_file" ]]; then
echo "$result" | grep -v "^Searching" > "$output_file"
print_success "Results saved to: $output_file"
else
echo "$result" | grep -v "^Searching"
fi
local clean_result
clean_result=$(echo "$result" | grep -v "^Searching")
if [[ -n "$output_file" ]]; then
echo "$clean_result" > "$output_file"
print_success "Results saved to: $output_file"
else
echo "$clean_result"
fi

Comment on lines +920 to +925
if [[ -n "$output_file" ]]; then
echo "$result" | grep -v "^Creating sitemap" > "$output_file"
print_success "Sitemap saved to: $output_file"
else
echo "$result" | grep -v "^Creating sitemap"
fi

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The grep -v command to filter out progress messages is repeated. To improve maintainability and avoid repetition, it's better to filter the output once, store it in a variable, and then use that variable.

Suggested change
if [[ -n "$output_file" ]]; then
echo "$result" | grep -v "^Creating sitemap" > "$output_file"
print_success "Sitemap saved to: $output_file"
else
echo "$result" | grep -v "^Creating sitemap"
fi
local clean_result
clean_result=$(echo "$result" | grep -v "^Creating sitemap")
if [[ -n "$output_file" ]]; then
echo "$clean_result" > "$output_file"
print_success "Sitemap saved to: $output_file"
else
echo "$clean_result"
fi

@marcusquinn marcusquinn merged commit 2e43226 into main Jan 25, 2026
17 checks passed
@augmentcode
Copy link

augmentcode bot commented Jan 25, 2026

🤖 Augment PR Summary

Summary: Adds WaterCrawl as a new browser automation / crawling option in aidevops, with a self-hosted-first Docker deployment path and cloud API fallback.

Changes:

  • Added .agent/scripts/watercrawl-helper.sh to set up/run a self-hosted WaterCrawl stack (Docker/Coolify) and to invoke scrape/crawl/search/sitemap via the SDK.
  • Added .agent/tools/browser/watercrawl.md integration guide covering self-hosted + cloud usage, endpoints, env vars, and examples.
  • Updated .agent/tools/browser/browser-automation.md decision tree + feature matrix to include WaterCrawl (web search, sitemap generation).
  • Registered WaterCrawl in .agent/subagent-index.toon for discovery.

Technical Notes: Stores API configuration in /.config/aidevops/mcp-env.sh and installs the self-hosted WaterCrawl repo into /.aidevops/watercrawl/.

🤖 Was this summary useful? React with 👍 or 👎

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 6 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.


main "$@"

exit 0
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exit 0 forces a success exit status even when a subcommand fails (and main also unconditionally return 0), which can make CI/automation treat failures as successes.

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎


# Create Node.js script for scraping
local temp_script
temp_script=$(mktemp /tmp/watercrawl_scrape_XXXXXX.mjs)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The temp Node script is written under /tmp, so Node will resolve @watercrawl/nodejs relative to /tmp/node_modules; this will typically fail even after a global npm install -g unless module resolution is explicitly configured.

Other Locations
  • .agent/scripts/watercrawl-helper.sh:660
  • .agent/scripts/watercrawl-helper.sh:772
  • .agent/scripts/watercrawl-helper.sh:860

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎

SCRIPT

local result
if result=$(WATERCRAWL_API_KEY="$WATERCRAWL_API_KEY" WATERCRAWL_API_URL="$WATERCRAWL_API_URL" node "$temp_script" "$url" 2>&1); then
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capturing node ... 2>&1 mixes progress logs (stderr) with the JSON payload (stdout), so scrape output (and saved files) may not be valid JSON and can break piping/consumers.

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎

mkdir -p "$(dirname "$MCP_ENV_FILE")"

# Check if file exists and has the key
if [[ -f "$MCP_ENV_FILE" ]]; then
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When updating/appending to an existing $MCP_ENV_FILE, the script doesn’t enforce chmod 600, so API keys/URLs could remain readable by other users if the file already had broader permissions.

Other Locations
  • .agent/scripts/watercrawl-helper.sh:463

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎

coolify-helper.sh,Coolify deployment management
stagehand-helper.sh,Browser automation with Stagehand
crawl4ai-helper.sh,Web crawling and extraction
watercrawl-helper.sh,WaterCrawl cloud API for web crawling and search
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entry describes WaterCrawl as “cloud API” only, but the new helper script/docs are self-hosted-first (Docker/Coolify); this may mislead users browsing the index.

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎

|
+-> EXTRACT data (scraping, reading)?
| |
| +-> Need web search + crawl? --> WaterCrawl (cloud API with search)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These WaterCrawl references frame it primarily as a cloud/managed API, but this PR adds a self-hosted-first Docker path; readers may miss the recommended self-hosted option.

Other Locations
  • .agent/tools/browser/browser-automation.md:38
  • .agent/tools/browser/browser-automation.md:135
  • .agent/tools/browser/browser-automation.md:164

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎

marcusquinn added a commit that referenced this pull request Jan 25, 2026
- Add WaterCrawl to browser automation tools (7 tools now)
- Add LibPDF and Unstract to document processing section
- Add Cloudron app packaging enhancement note
- Add multi-tenant credential storage documentation
- Update MCP count to 19 (added Unstract)
- Update subagent count to 560+ and scripts to 146+
- Document MCP lazy-loading optimization (12-24s startup savings)
- Add WaterCrawl to tool selection guide

Based on PRs #178-#192
marcusquinn added a commit that referenced this pull request Jan 25, 2026
- Add WaterCrawl to browser automation tools (7 tools now)
- Add LibPDF and Unstract to document processing section
- Add Cloudron app packaging enhancement note
- Add multi-tenant credential storage documentation
- Update MCP count to 19 (added Unstract)
- Update subagent count to 560+ and scripts to 146+
- Document MCP lazy-loading optimization (12-24s startup savings)
- Add WaterCrawl to tool selection guide

Based on PRs #178-#192
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant