Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .agent/tools/browser/browser-automation.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,10 @@ What do you need?
| +-> Proxy per profile / geo-targeting?
| --> proxy-integration.md (residential, SOCKS5, rotation)
|
+-> EXPERIMENTAL (agent-native browser, VLM vision)?
| --> neural-chromium.md (Chromium fork, semantic DOM, gRPC, Windows-only)
| --> Note: Early stage, requires building Chromium from source
|
+-> TEST your own app (dev server)?
|
+-> Need to stay logged in across restarts? --> dev-browser (profile)
Expand Down Expand Up @@ -174,6 +178,7 @@ Tested 2026-01-24, macOS ARM64 (Apple Silicon), headless, warm daemon. Median of
| **Playwriter** | Existing browser, extensions, bypass detection | Medium | Chrome extension + `npx playwriter` |
| **Stagehand** | Unknown pages, natural language, self-healing | Slow | `stagehand-helper.sh setup` + API key |
| **Anti-detect** | Bot evasion, multi-account, fingerprint rotation | Medium | `anti-detect-helper.sh setup` |
| **Neural-Chromium** | Semantic DOM, VLM vision, stealth (experimental) | Medium | Build from source (Windows) |

## AI Page Understanding (Visual Verification)

Expand Down
265 changes: 265 additions & 0 deletions .agent/tools/browser/neural-chromium.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,265 @@
---
description: Neural-Chromium - agent-native Chromium fork with semantic DOM, gRPC, and VLM vision
mode: subagent
tools:
read: true
write: false
edit: false
bash: true
glob: true
grep: true
webfetch: true
task: true
---

# Neural-Chromium - Agent-Native Browser Runtime

<!-- AI-CONTEXT-START -->

## Quick Reference

- **Purpose**: Chromium fork designed for AI agents with direct browser state access
- **GitHub**: https://github.com/mcpmessenger/neural-chromium
- **License**: BSD-3-Clause (same as Chromium)
- **Languages**: C++ (81%), Python (17%)
- **Status**: Experimental (Phase 3 complete, Windows-only builds currently)
- **Stars**: 4 (early stage project)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Hyphenate “early-stage” per style/grammar.

Static analysis flagged “early stage project” → “early-stage project”.

🧰 Tools
🪛 LanguageTool

[grammar] ~26-~26: Use a hyphen to join words.
Context: ... builds currently) - Stars: 4 (early stage project) Key Differentiators:...

(QB_NEW_EN_HYPHEN)

🤖 Prompt for AI Agents
In @.agent/tools/browser/neural-chromium.md at line 26, Change the phrase "early
stage project" to the hyphenated form "early-stage project" in the Markdown line
that reads "**Stars**: 4 (early stage project)"; update that exact text so it
becomes "**Stars**: 4 (early-stage project)" to comply with style/grammar.


**Key Differentiators**:

- **Shared memory + gRPC** for direct browser state access (no CDP/WebSocket overhead)
- **Semantic DOM understanding** via accessibility tree (roles, names, not CSS selectors)
- **VLM-powered vision** via Llama 3.2 Vision (Ollama) for visual reasoning
- **Stealth capabilities** - native event dispatch, no `navigator.webdriver` flag
- **Deep iframe access** - cross-origin frame traversal without context switching

**When to Use**:

- Experimental agent automation requiring semantic element targeting
- CAPTCHA solving research (VLM-based, experimental)
- Dynamic SPA interaction where CSS selectors break frequently
- Privacy-first automation (local VLM, no cloud dependency)

**When NOT to Use** (prefer established tools):

- Production workloads (project is early stage, Windows-only)
- Cross-platform needs (Linux/Mac builds not yet available)
- Quick automation tasks (Playwright is faster and mature)
- Bulk extraction (Crawl4AI is purpose-built)

**Maturity Warning**: Neural-Chromium is an experimental project with 4 stars and 22 commits. It requires building Chromium from source (~4 hours). For production use, prefer Playwright, agent-browser, or dev-browser.
Comment on lines +26 to +50
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Avoid hard‑coding star/commit counts (they go stale quickly).

Consider removing the star/commit counts or date-stamping them (e.g., “as of 2026‑02‑05”).

🧰 Tools
🪛 LanguageTool

[grammar] ~26-~26: Use a hyphen to join words.
Context: ... builds currently) - Stars: 4 (early stage project) Key Differentiators:...

(QB_NEW_EN_HYPHEN)

🤖 Prompt for AI Agents
In @.agent/tools/browser/neural-chromium.md around lines 26 - 50, The README
hard-codes project statistics (the "**Stars**: 4 (early stage project)" line and
"4 stars and 22 commits" in the "Maturity Warning"); either remove these
specific numeric counts or replace them with a dated statement (e.g., "as of
YYYY-MM-DD") and/or a relative phrase ("actively developed, early stage") so the
content doesn't go stale; update the lines containing "**Stars**: 4 (early stage
project)" and the "Maturity Warning" sentence accordingly and ensure any mention
of commit counts is removed or annotated with the timestamp.


<!-- AI-CONTEXT-END -->

## Architecture

Neural-Chromium modifies Chromium's rendering pipeline to expose internal state directly to AI agents:

```text
AI Agent (Python)
├── gRPC Client ──────────────────┐
│ │
│ Chromium Process │
│ ├── Blink Renderer │
│ │ └── NeuralPageHandler │ ← Blink supplement pattern
│ │ ├── DOM Traversal │
│ │ ├── Accessibility Tree │
│ │ └── Layout Info │
│ │ │
│ ├── Viz (Compositor) │
│ │ └── Shared Memory ─────────┤ ← Zero-copy viewport capture
│ │ │
│ └── In-Process gRPC Server ────┘
└── VLM (Ollama) ← Llama 3.2 Vision for visual reasoning
```

### Key Components

| Component | Purpose |
|-----------|---------|
| **Visual Cortex** | Zero-copy access to rendering pipeline, 60+ FPS frame processing |
| **High-Precision Action** | Coordinate transformation for mapping agent actions to browser events |
| **Deep State Awareness** | Direct DOM access, 800+ node traversal with parent-child relationships |
| **Local Intelligence** | Llama 3.2 Vision via Ollama for privacy-first visual decision-making |

## Installation

### Prerequisites

- **Windows** (Linux/Mac support planned)
- **Python 3.10+**
- **Ollama** (for VLM features)
- **16GB RAM** (for full Chromium build)
- **depot_tools** (Chromium build toolchain)

### Build from Source

```bash
# Set up depot_tools
git clone https://chromium.googlesource.com/chromium/tools/depot_tools.git
export PATH="/path/to/depot_tools:$PATH"

# Clone Neural-Chromium
git clone https://github.com/mcpmessenger/neural-chromium.git
cd neural-chromium

# Sync and build (~4 hours on first run)
cd src
gclient sync
gn gen out/Default
ninja -C out/Default chrome
```

### Install VLM (Optional)

```bash
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

Piping curl to sh can be a security risk as it executes a remote script without inspection. For better security, it's recommended to suggest a two-step process: download the script, allow the user to inspect it, and then execute it locally. This prevents potentially malicious code from running automatically.

Suggested change
curl -fsSL https://ollama.com/install.sh | sh
curl -fsSL https://ollama.com/install.sh -o install.sh
# Optionally, inspect the script before running it.
sh install.sh


# Pull vision model
ollama pull llama3.2-vision
```

## Usage

### Start the Runtime

```bash
# Terminal 1: Start Neural-Chromium with remote debugging
out/Default/chrome.exe --remote-debugging-port=9222

# Terminal 2: Start gRPC agent server
python src/glazyr/nexus_agent.py

# Terminal 3: Run automation scripts
python src/demo_saucedemo_login.py
```

### Python API

```python
from nexus_scenarios import AgentClient, AgentAction
import action_pb2

client = AgentClient()
client.navigate("https://www.saucedemo.com")

# Observe page state (semantic DOM snapshot)
state = client.observe()

# Find elements by semantic role (not CSS selectors)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The find function used in the following lines is not defined in this snippet, which could be confusing for users. It would be helpful to add a comment explaining its purpose for context.

Suggested change
# Find elements by semantic role (not CSS selectors)
# Find elements by semantic role (not CSS selectors)
# (Note: 'find' is a helper function to search for elements in the state object)

user_field = find(state, role="textbox", name="Username")
pass_field = find(state, role="textbox", name="Password")
login_btn = find(state, role="button", name="Login")

# Type into fields by element ID
client.act(AgentAction(type=action_pb2.TypeAction(
element_id=user_field.id, text="standard_user"
)))
client.act(AgentAction(type=action_pb2.TypeAction(
element_id=pass_field.id, text="secret_sauce"
)))

# Click by element ID (no coordinates needed)
client.act(AgentAction(click=action_pb2.ClickAction(
element_id=login_btn.id
)))
```

### Core Actions

| Action | Method | Description |
|--------|--------|-------------|
| **observe()** | `client.observe()` | Full DOM + accessibility tree snapshot |
| **click(id)** | `AgentAction(click=ClickAction(element_id=id))` | Direct event dispatch by element ID |
| **type(id, text)** | `AgentAction(type=TypeAction(element_id=id, text=text))` | Input injection by element ID |
| **navigate(url)** | `client.navigate(url)` | Navigate to URL |

### VLM CAPTCHA Solving (Experimental)

```bash
# Requires Ollama with llama3.2-vision
python src/vlm_captcha_solve.py
```

The VLM solver captures viewport via shared memory, sends to Llama 3.2 Vision, and receives structured predictions (JSON tile indices with confidence scores).

## Performance Benchmarks

From the project's own benchmarks (10 runs per task, 120s timeout):

| Task | Neural-Chromium | Playwright | Notes |
|------|----------------|------------|-------|
| **Interaction latency** | 1.32s | ~0.5s | NC trades speed for semantic robustness |
| **Auth + data extraction** | 2.3s (100%) | 1.1s (90%) | NC uses semantic selectors |
| **Dynamic SPA (TodoMVC)** | 9.4s (100%) | 3.2s (60%) | NC handles async DOM reliably |
| **Multi-step form** | 4.1s (100%) | 2.8s (95%) | NC uses native event dispatch |
| **CAPTCHA solving** | ~50s (experimental) | N/A (blocked) | VLM-based, contingent on model |

**Key trade-off**: Neural-Chromium is slower in raw latency but claims higher reliability for dynamic SPAs and sites that break CSS selectors frequently.

## Comparison with Existing Tools

| Feature | Neural-Chromium | Playwright | agent-browser | Stagehand |
|---------|----------------|------------|---------------|-----------|
| **Interface** | Python + gRPC | JS/TS API | CLI (Rust) | JS/Python SDK |
| **Element targeting** | Semantic (role/name) | CSS/XPath | Refs from snapshot | Natural language |
| **Browser engine** | Custom Chromium fork | Bundled Chromium | Bundled Chromium | Bundled Chromium |
| **Stealth** | Native (no webdriver) | Detectable | Detectable | Detectable |
| **VLM vision** | Built-in (Ollama) | No | No | No |
| **CAPTCHA handling** | Experimental (VLM) | Blocked | Blocked | Blocked |
| **Iframe access** | Deep traversal | Context switching | Context switching | Context switching |
| **Platform** | Windows only | Cross-platform | Cross-platform | Cross-platform |
| **Maturity** | Experimental | Production | Production | Production |
| **Setup complexity** | Build Chromium (~4h) | `npm install` | `npm install` | `npm install` |

## Roadmap

### Phase 4: Production Hardening (Next)

- Delta updates (only changed DOM nodes, target <500ms latency)
- Push-based events (replace polling with `wait_for_signal`)
- Shadow DOM piercing for modern SPAs
- Multi-tab support for parallel agent execution
- Linux/Mac builds

### Phase 5: Advanced Vision

- OCR integration for text extraction from images
- Visual grounding (click coordinates from natural language)
- Screen diffing for visual change detection

### Phase 6: Ecosystem

- Python SDK (`neural_chromium.Agent()`)
- Docker images for containerized runtime
- Kubernetes operator for cloud deployment

## Repository Structure

```text
neural-chromium/
├── src/
│ ├── glazyr/
│ │ ├── nexus_agent.py # gRPC server + VisualCortex
│ │ ├── proto/ # Protocol Buffer definitions
│ │ └── neural_page_handler.* # Blink C++ integration
│ ├── nexus_scenarios.py # High-level agent client
│ ├── vlm_solver.py # Llama Vision integration
│ └── demo_*.py # Example flows
├── docs/
│ └── NEURAL_CHROMIUM_ARCHITECTURE.md
├── deployment/ # Docker/deployment configs
├── tests/ # Test suite
└── Makefile # Build and benchmark commands
```
Comment on lines +15 to +257
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Apply progressive disclosure + replace inline code blocks with file:line references.

This subagent doc is highly inline (architecture, install, usage, code). The .agent/**/*.md guideline requires pointers to subagents and authoritative file:line references instead of inline snippets. Please collapse detail into references (e.g., upstream docs or local helper scripts) and keep this doc as a concise entry point.

Based on learnings: Applies to .agent/**/*.md : Apply progressive disclosure pattern by using pointers to subagents rather than including inline content in agent documentation.

🧰 Tools
🪛 LanguageTool

[grammar] ~26-~26: Use a hyphen to join words.
Context: ... builds currently) - Stars: 4 (early stage project) Key Differentiators:...

(QB_NEW_EN_HYPHEN)

🤖 Prompt for AI Agents
In @.agent/tools/browser/neural-chromium.md around lines 15 - 257, The doc
.agent/tools/browser/neural-chromium.md currently contains large inline code
blocks and full walkthroughs (architecture diagram, build steps, Python API,
examples, repo tree) instead of progressive-disclosure pointers; replace those
inline snippets with concise entry text and explicit file:line references to the
authoritative sources (e.g., point to src/glazyr/nexus_agent.py:1-200 for gRPC
entry points, src/vlm_solver.py:1-120 for VLM CAPTCHA logic,
src/demo_saucedemo_login.py:1-80 for usage example, and neural_page_handler.*
for Blink integration), collapse installation/build steps into a short summary
linking to upstream BUILD or docs files, and convert large tables/benchmarks
into a single-line summary with a pointer to the benchmark file; keep this MD as
a brief gateway that references subagent files and upstream docs rather than
embedding full code or long snippets.


## Resources

- **GitHub**: https://github.com/mcpmessenger/neural-chromium
- **Live Demo**: https://neuralchrom-dtcvjx99.manus.space
- **Demo Video**: https://youtube.com/shorts/8nOlID7izjQ
- **Twitter**: https://x.com/MCPMessenger
- **License**: BSD-3-Clause
2 changes: 1 addition & 1 deletion TODO.md
Original file line number Diff line number Diff line change
Expand Up @@ -201,7 +201,7 @@ Tasks with no open blockers - ready to work on. Use `/ready` to refresh this lis
- Notes: MCP server for iOS simulator interaction (1.5k stars, MIT). Featured in Anthropic's Claude Code Best Practices. Tools: tap, swipe, type, screenshot, record_video, describe-ui (accessibility), install_app, launch_app. Install: `npx -y ios-simulator-mcp`. Requires macOS, Xcode, Facebook IDB. Enables AI-assisted QA testing - verify UI elements, confirm text input, validate gestures. Complements XcodeBuildMCP (build) and Maestro (E2E flows). Add to tools/mobile/ or tools/testing/.
- [ ] t098 Add Playwright device emulation subagent #tools #browser #testing #mobile ~30m (ai:20m test:5m read:5m) logged:2026-01-30 related:t096 ref:https://playwright.dev/docs/emulation
- Notes: Document Playwright's device emulation capabilities for mobile/tablet testing. Features: device registry (iPhone, iPad, Pixel, Galaxy), viewport/screen size, userAgent, touch events, geolocation, locale/timezone, permissions, colorScheme, offline mode. Config via playwright.config.ts or per-test. Complements native mobile testing (Maestro, iOS Simulator MCP) for web-based mobile testing. Add to tools/browser/playwright-emulation.md or extend existing playwright.md.
- [ ] t099 Add Neural-Chromium for agent-native browser automation #tools #browser #ai #automation ~2h (ai:1.5h test:20m read:10m) logged:2026-01-30 ref:https://github.com/mcpmessenger/neural-chromium
- [ ] t099 Add Neural-Chromium for agent-native browser automation #tools #browser #ai #automation ~2h (ai:1.5h test:20m read:10m) logged:2026-01-30 started:2026-02-05T00:00Z ref:https://github.com/mcpmessenger/neural-chromium
- Notes: Neural-Chromium (BSD-3) - Chromium fork designed for AI agents. Features: 1.3s interaction latency (4.7x faster than Playwright), semantic DOM understanding via accessibility tree, VLM-powered vision (Llama 3.2 via Ollama), stealth capabilities (no navigator.webdriver), deep iframe access. Uses shared memory + gRPC for direct browser state access. Tools: click(element_id), type(element_id, text), observe() for DOM snapshots. Early stage but promising for agent automation. Evaluate for: CAPTCHA solving, dynamic SPA interaction, form filling. Add to tools/browser/ as experimental option.
- [ ] t100 Add AXe CLI for iOS simulator accessibility automation #tools #ios #testing #accessibility ~45m (ai:30m test:10m read:5m) logged:2026-01-30 related:t095,t097 ref:https://github.com/cameroncooke/AXe
- Notes: AXe (1.1k stars, MIT) - CLI tool for iOS Simulator automation using Apple's Accessibility APIs and HID. By same author as XcodeBuildMCP. Features: tap (coordinates or accessibility ID/label), swipe, type, hardware buttons (home, lock, siri), gesture presets (scroll-up/down, edge swipes), screenshot, video recording/streaming, describe-ui (accessibility tree). Install: `brew install cameroncooke/axe/axe`. Single binary, no server required. Timing controls (pre/post delays). Complements XcodeBuildMCP for build+test workflow. Add to tools/mobile/ or tools/testing/.
Expand Down
Loading