diff --git a/.agent/tools/browser/browser-automation.md b/.agent/tools/browser/browser-automation.md index 0481ad7b0..7b83e9b20 100644 --- a/.agent/tools/browser/browser-automation.md +++ b/.agent/tools/browser/browser-automation.md @@ -93,6 +93,10 @@ What do you need? | +-> Proxy per profile / geo-targeting? | --> proxy-integration.md (residential, SOCKS5, rotation) | + +-> EXPERIMENTAL (agent-native browser, VLM vision)? + | --> neural-chromium.md (Chromium fork, semantic DOM, gRPC, Windows-only) + | --> Note: Early stage, requires building Chromium from source + | +-> TEST your own app (dev server)? | +-> Need to stay logged in across restarts? --> dev-browser (profile) @@ -174,6 +178,7 @@ Tested 2026-01-24, macOS ARM64 (Apple Silicon), headless, warm daemon. Median of | **Playwriter** | Existing browser, extensions, bypass detection | Medium | Chrome extension + `npx playwriter` | | **Stagehand** | Unknown pages, natural language, self-healing | Slow | `stagehand-helper.sh setup` + API key | | **Anti-detect** | Bot evasion, multi-account, fingerprint rotation | Medium | `anti-detect-helper.sh setup` | +| **Neural-Chromium** | Semantic DOM, VLM vision, stealth (experimental) | Medium | Build from source (Windows) | ## AI Page Understanding (Visual Verification) diff --git a/.agent/tools/browser/neural-chromium.md b/.agent/tools/browser/neural-chromium.md new file mode 100644 index 000000000..3812569a0 --- /dev/null +++ b/.agent/tools/browser/neural-chromium.md @@ -0,0 +1,265 @@ +--- +description: Neural-Chromium - agent-native Chromium fork with semantic DOM, gRPC, and VLM vision +mode: subagent +tools: + read: true + write: false + edit: false + bash: true + glob: true + grep: true + webfetch: true + task: true +--- + +# Neural-Chromium - Agent-Native Browser Runtime + + + +## Quick Reference + +- **Purpose**: Chromium fork designed for AI agents with direct browser state access +- **GitHub**: https://github.com/mcpmessenger/neural-chromium +- **License**: BSD-3-Clause (same as Chromium) +- **Languages**: C++ (81%), Python (17%) +- **Status**: Experimental (Phase 3 complete, Windows-only builds currently) +- **Stars**: 4 (early stage project) + +**Key Differentiators**: + +- **Shared memory + gRPC** for direct browser state access (no CDP/WebSocket overhead) +- **Semantic DOM understanding** via accessibility tree (roles, names, not CSS selectors) +- **VLM-powered vision** via Llama 3.2 Vision (Ollama) for visual reasoning +- **Stealth capabilities** - native event dispatch, no `navigator.webdriver` flag +- **Deep iframe access** - cross-origin frame traversal without context switching + +**When to Use**: + +- Experimental agent automation requiring semantic element targeting +- CAPTCHA solving research (VLM-based, experimental) +- Dynamic SPA interaction where CSS selectors break frequently +- Privacy-first automation (local VLM, no cloud dependency) + +**When NOT to Use** (prefer established tools): + +- Production workloads (project is early stage, Windows-only) +- Cross-platform needs (Linux/Mac builds not yet available) +- Quick automation tasks (Playwright is faster and mature) +- Bulk extraction (Crawl4AI is purpose-built) + +**Maturity Warning**: Neural-Chromium is an experimental project with 4 stars and 22 commits. It requires building Chromium from source (~4 hours). For production use, prefer Playwright, agent-browser, or dev-browser. + + + +## Architecture + +Neural-Chromium modifies Chromium's rendering pipeline to expose internal state directly to AI agents: + +```text +AI Agent (Python) + │ + ├── gRPC Client ──────────────────┐ + │ │ + │ Chromium Process │ + │ ├── Blink Renderer │ + │ │ └── NeuralPageHandler │ ← Blink supplement pattern + │ │ ├── DOM Traversal │ + │ │ ├── Accessibility Tree │ + │ │ └── Layout Info │ + │ │ │ + │ ├── Viz (Compositor) │ + │ │ └── Shared Memory ─────────┤ ← Zero-copy viewport capture + │ │ │ + │ └── In-Process gRPC Server ────┘ + │ + └── VLM (Ollama) ← Llama 3.2 Vision for visual reasoning +``` + +### Key Components + +| Component | Purpose | +|-----------|---------| +| **Visual Cortex** | Zero-copy access to rendering pipeline, 60+ FPS frame processing | +| **High-Precision Action** | Coordinate transformation for mapping agent actions to browser events | +| **Deep State Awareness** | Direct DOM access, 800+ node traversal with parent-child relationships | +| **Local Intelligence** | Llama 3.2 Vision via Ollama for privacy-first visual decision-making | + +## Installation + +### Prerequisites + +- **Windows** (Linux/Mac support planned) +- **Python 3.10+** +- **Ollama** (for VLM features) +- **16GB RAM** (for full Chromium build) +- **depot_tools** (Chromium build toolchain) + +### Build from Source + +```bash +# Set up depot_tools +git clone https://chromium.googlesource.com/chromium/tools/depot_tools.git +export PATH="/path/to/depot_tools:$PATH" + +# Clone Neural-Chromium +git clone https://github.com/mcpmessenger/neural-chromium.git +cd neural-chromium + +# Sync and build (~4 hours on first run) +cd src +gclient sync +gn gen out/Default +ninja -C out/Default chrome +``` + +### Install VLM (Optional) + +```bash +# Install Ollama +curl -fsSL https://ollama.com/install.sh | sh + +# Pull vision model +ollama pull llama3.2-vision +``` + +## Usage + +### Start the Runtime + +```bash +# Terminal 1: Start Neural-Chromium with remote debugging +out/Default/chrome.exe --remote-debugging-port=9222 + +# Terminal 2: Start gRPC agent server +python src/glazyr/nexus_agent.py + +# Terminal 3: Run automation scripts +python src/demo_saucedemo_login.py +``` + +### Python API + +```python +from nexus_scenarios import AgentClient, AgentAction +import action_pb2 + +client = AgentClient() +client.navigate("https://www.saucedemo.com") + +# Observe page state (semantic DOM snapshot) +state = client.observe() + +# Find elements by semantic role (not CSS selectors) +user_field = find(state, role="textbox", name="Username") +pass_field = find(state, role="textbox", name="Password") +login_btn = find(state, role="button", name="Login") + +# Type into fields by element ID +client.act(AgentAction(type=action_pb2.TypeAction( + element_id=user_field.id, text="standard_user" +))) +client.act(AgentAction(type=action_pb2.TypeAction( + element_id=pass_field.id, text="secret_sauce" +))) + +# Click by element ID (no coordinates needed) +client.act(AgentAction(click=action_pb2.ClickAction( + element_id=login_btn.id +))) +``` + +### Core Actions + +| Action | Method | Description | +|--------|--------|-------------| +| **observe()** | `client.observe()` | Full DOM + accessibility tree snapshot | +| **click(id)** | `AgentAction(click=ClickAction(element_id=id))` | Direct event dispatch by element ID | +| **type(id, text)** | `AgentAction(type=TypeAction(element_id=id, text=text))` | Input injection by element ID | +| **navigate(url)** | `client.navigate(url)` | Navigate to URL | + +### VLM CAPTCHA Solving (Experimental) + +```bash +# Requires Ollama with llama3.2-vision +python src/vlm_captcha_solve.py +``` + +The VLM solver captures viewport via shared memory, sends to Llama 3.2 Vision, and receives structured predictions (JSON tile indices with confidence scores). + +## Performance Benchmarks + +From the project's own benchmarks (10 runs per task, 120s timeout): + +| Task | Neural-Chromium | Playwright | Notes | +|------|----------------|------------|-------| +| **Interaction latency** | 1.32s | ~0.5s | NC trades speed for semantic robustness | +| **Auth + data extraction** | 2.3s (100%) | 1.1s (90%) | NC uses semantic selectors | +| **Dynamic SPA (TodoMVC)** | 9.4s (100%) | 3.2s (60%) | NC handles async DOM reliably | +| **Multi-step form** | 4.1s (100%) | 2.8s (95%) | NC uses native event dispatch | +| **CAPTCHA solving** | ~50s (experimental) | N/A (blocked) | VLM-based, contingent on model | + +**Key trade-off**: Neural-Chromium is slower in raw latency but claims higher reliability for dynamic SPAs and sites that break CSS selectors frequently. + +## Comparison with Existing Tools + +| Feature | Neural-Chromium | Playwright | agent-browser | Stagehand | +|---------|----------------|------------|---------------|-----------| +| **Interface** | Python + gRPC | JS/TS API | CLI (Rust) | JS/Python SDK | +| **Element targeting** | Semantic (role/name) | CSS/XPath | Refs from snapshot | Natural language | +| **Browser engine** | Custom Chromium fork | Bundled Chromium | Bundled Chromium | Bundled Chromium | +| **Stealth** | Native (no webdriver) | Detectable | Detectable | Detectable | +| **VLM vision** | Built-in (Ollama) | No | No | No | +| **CAPTCHA handling** | Experimental (VLM) | Blocked | Blocked | Blocked | +| **Iframe access** | Deep traversal | Context switching | Context switching | Context switching | +| **Platform** | Windows only | Cross-platform | Cross-platform | Cross-platform | +| **Maturity** | Experimental | Production | Production | Production | +| **Setup complexity** | Build Chromium (~4h) | `npm install` | `npm install` | `npm install` | + +## Roadmap + +### Phase 4: Production Hardening (Next) + +- Delta updates (only changed DOM nodes, target <500ms latency) +- Push-based events (replace polling with `wait_for_signal`) +- Shadow DOM piercing for modern SPAs +- Multi-tab support for parallel agent execution +- Linux/Mac builds + +### Phase 5: Advanced Vision + +- OCR integration for text extraction from images +- Visual grounding (click coordinates from natural language) +- Screen diffing for visual change detection + +### Phase 6: Ecosystem + +- Python SDK (`neural_chromium.Agent()`) +- Docker images for containerized runtime +- Kubernetes operator for cloud deployment + +## Repository Structure + +```text +neural-chromium/ +├── src/ +│ ├── glazyr/ +│ │ ├── nexus_agent.py # gRPC server + VisualCortex +│ │ ├── proto/ # Protocol Buffer definitions +│ │ └── neural_page_handler.* # Blink C++ integration +│ ├── nexus_scenarios.py # High-level agent client +│ ├── vlm_solver.py # Llama Vision integration +│ └── demo_*.py # Example flows +├── docs/ +│ └── NEURAL_CHROMIUM_ARCHITECTURE.md +├── deployment/ # Docker/deployment configs +├── tests/ # Test suite +└── Makefile # Build and benchmark commands +``` + +## Resources + +- **GitHub**: https://github.com/mcpmessenger/neural-chromium +- **Live Demo**: https://neuralchrom-dtcvjx99.manus.space +- **Demo Video**: https://youtube.com/shorts/8nOlID7izjQ +- **Twitter**: https://x.com/MCPMessenger +- **License**: BSD-3-Clause diff --git a/TODO.md b/TODO.md index 1ce2bf6db..1ed652d86 100644 --- a/TODO.md +++ b/TODO.md @@ -201,7 +201,7 @@ Tasks with no open blockers - ready to work on. Use `/ready` to refresh this lis - Notes: MCP server for iOS simulator interaction (1.5k stars, MIT). Featured in Anthropic's Claude Code Best Practices. Tools: tap, swipe, type, screenshot, record_video, describe-ui (accessibility), install_app, launch_app. Install: `npx -y ios-simulator-mcp`. Requires macOS, Xcode, Facebook IDB. Enables AI-assisted QA testing - verify UI elements, confirm text input, validate gestures. Complements XcodeBuildMCP (build) and Maestro (E2E flows). Add to tools/mobile/ or tools/testing/. - [ ] t098 Add Playwright device emulation subagent #tools #browser #testing #mobile ~30m (ai:20m test:5m read:5m) logged:2026-01-30 related:t096 ref:https://playwright.dev/docs/emulation - Notes: Document Playwright's device emulation capabilities for mobile/tablet testing. Features: device registry (iPhone, iPad, Pixel, Galaxy), viewport/screen size, userAgent, touch events, geolocation, locale/timezone, permissions, colorScheme, offline mode. Config via playwright.config.ts or per-test. Complements native mobile testing (Maestro, iOS Simulator MCP) for web-based mobile testing. Add to tools/browser/playwright-emulation.md or extend existing playwright.md. -- [ ] t099 Add Neural-Chromium for agent-native browser automation #tools #browser #ai #automation ~2h (ai:1.5h test:20m read:10m) logged:2026-01-30 ref:https://github.com/mcpmessenger/neural-chromium +- [ ] t099 Add Neural-Chromium for agent-native browser automation #tools #browser #ai #automation ~2h (ai:1.5h test:20m read:10m) logged:2026-01-30 started:2026-02-05T00:00Z ref:https://github.com/mcpmessenger/neural-chromium - Notes: Neural-Chromium (BSD-3) - Chromium fork designed for AI agents. Features: 1.3s interaction latency (4.7x faster than Playwright), semantic DOM understanding via accessibility tree, VLM-powered vision (Llama 3.2 via Ollama), stealth capabilities (no navigator.webdriver), deep iframe access. Uses shared memory + gRPC for direct browser state access. Tools: click(element_id), type(element_id, text), observe() for DOM snapshots. Early stage but promising for agent automation. Evaluate for: CAPTCHA solving, dynamic SPA interaction, form filling. Add to tools/browser/ as experimental option. - [ ] t100 Add AXe CLI for iOS simulator accessibility automation #tools #ios #testing #accessibility ~45m (ai:30m test:10m read:5m) logged:2026-01-30 related:t095,t097 ref:https://github.com/cameroncooke/AXe - Notes: AXe (1.1k stars, MIT) - CLI tool for iOS Simulator automation using Apple's Accessibility APIs and HID. By same author as XcodeBuildMCP. Features: tap (coordinates or accessibility ID/label), swipe, type, hardware buttons (home, lock, siri), gesture presets (scroll-up/down, edge swipes), screenshot, video recording/streaming, describe-ui (accessibility tree). Install: `brew install cameroncooke/axe/axe`. Single binary, no server required. Timing controls (pre/post delays). Complements XcodeBuildMCP for build+test workflow. Add to tools/mobile/ or tools/testing/.