diff --git a/.agent/tools/browser/browser-automation.md b/.agent/tools/browser/browser-automation.md index 7b83e9b2..0481ad7b 100644 --- a/.agent/tools/browser/browser-automation.md +++ b/.agent/tools/browser/browser-automation.md @@ -93,10 +93,6 @@ What do you need? | +-> Proxy per profile / geo-targeting? | --> proxy-integration.md (residential, SOCKS5, rotation) | - +-> EXPERIMENTAL (agent-native browser, VLM vision)? - | --> neural-chromium.md (Chromium fork, semantic DOM, gRPC, Windows-only) - | --> Note: Early stage, requires building Chromium from source - | +-> TEST your own app (dev server)? | +-> Need to stay logged in across restarts? --> dev-browser (profile) @@ -178,7 +174,6 @@ Tested 2026-01-24, macOS ARM64 (Apple Silicon), headless, warm daemon. Median of | **Playwriter** | Existing browser, extensions, bypass detection | Medium | Chrome extension + `npx playwriter` | | **Stagehand** | Unknown pages, natural language, self-healing | Slow | `stagehand-helper.sh setup` + API key | | **Anti-detect** | Bot evasion, multi-account, fingerprint rotation | Medium | `anti-detect-helper.sh setup` | -| **Neural-Chromium** | Semantic DOM, VLM vision, stealth (experimental) | Medium | Build from source (Windows) | ## AI Page Understanding (Visual Verification) diff --git a/.agent/tools/browser/neural-chromium.md b/.agent/tools/browser/neural-chromium.md deleted file mode 100644 index 3812569a..00000000 --- a/.agent/tools/browser/neural-chromium.md +++ /dev/null @@ -1,265 +0,0 @@ ---- -description: Neural-Chromium - agent-native Chromium fork with semantic DOM, gRPC, and VLM vision -mode: subagent -tools: - read: true - write: false - edit: false - bash: true - glob: true - grep: true - webfetch: true - task: true ---- - -# Neural-Chromium - Agent-Native Browser Runtime - - - -## Quick Reference - -- **Purpose**: Chromium fork designed for AI agents with direct browser state access -- **GitHub**: https://github.com/mcpmessenger/neural-chromium -- **License**: BSD-3-Clause (same as Chromium) -- **Languages**: C++ (81%), Python (17%) -- **Status**: Experimental (Phase 3 complete, Windows-only builds currently) -- **Stars**: 4 (early stage project) - -**Key Differentiators**: - -- **Shared memory + gRPC** for direct browser state access (no CDP/WebSocket overhead) -- **Semantic DOM understanding** via accessibility tree (roles, names, not CSS selectors) -- **VLM-powered vision** via Llama 3.2 Vision (Ollama) for visual reasoning -- **Stealth capabilities** - native event dispatch, no `navigator.webdriver` flag -- **Deep iframe access** - cross-origin frame traversal without context switching - -**When to Use**: - -- Experimental agent automation requiring semantic element targeting -- CAPTCHA solving research (VLM-based, experimental) -- Dynamic SPA interaction where CSS selectors break frequently -- Privacy-first automation (local VLM, no cloud dependency) - -**When NOT to Use** (prefer established tools): - -- Production workloads (project is early stage, Windows-only) -- Cross-platform needs (Linux/Mac builds not yet available) -- Quick automation tasks (Playwright is faster and mature) -- Bulk extraction (Crawl4AI is purpose-built) - -**Maturity Warning**: Neural-Chromium is an experimental project with 4 stars and 22 commits. It requires building Chromium from source (~4 hours). For production use, prefer Playwright, agent-browser, or dev-browser. - - - -## Architecture - -Neural-Chromium modifies Chromium's rendering pipeline to expose internal state directly to AI agents: - -```text -AI Agent (Python) - │ - ├── gRPC Client ──────────────────┐ - │ │ - │ Chromium Process │ - │ ├── Blink Renderer │ - │ │ └── NeuralPageHandler │ ← Blink supplement pattern - │ │ ├── DOM Traversal │ - │ │ ├── Accessibility Tree │ - │ │ └── Layout Info │ - │ │ │ - │ ├── Viz (Compositor) │ - │ │ └── Shared Memory ─────────┤ ← Zero-copy viewport capture - │ │ │ - │ └── In-Process gRPC Server ────┘ - │ - └── VLM (Ollama) ← Llama 3.2 Vision for visual reasoning -``` - -### Key Components - -| Component | Purpose | -|-----------|---------| -| **Visual Cortex** | Zero-copy access to rendering pipeline, 60+ FPS frame processing | -| **High-Precision Action** | Coordinate transformation for mapping agent actions to browser events | -| **Deep State Awareness** | Direct DOM access, 800+ node traversal with parent-child relationships | -| **Local Intelligence** | Llama 3.2 Vision via Ollama for privacy-first visual decision-making | - -## Installation - -### Prerequisites - -- **Windows** (Linux/Mac support planned) -- **Python 3.10+** -- **Ollama** (for VLM features) -- **16GB RAM** (for full Chromium build) -- **depot_tools** (Chromium build toolchain) - -### Build from Source - -```bash -# Set up depot_tools -git clone https://chromium.googlesource.com/chromium/tools/depot_tools.git -export PATH="/path/to/depot_tools:$PATH" - -# Clone Neural-Chromium -git clone https://github.com/mcpmessenger/neural-chromium.git -cd neural-chromium - -# Sync and build (~4 hours on first run) -cd src -gclient sync -gn gen out/Default -ninja -C out/Default chrome -``` - -### Install VLM (Optional) - -```bash -# Install Ollama -curl -fsSL https://ollama.com/install.sh | sh - -# Pull vision model -ollama pull llama3.2-vision -``` - -## Usage - -### Start the Runtime - -```bash -# Terminal 1: Start Neural-Chromium with remote debugging -out/Default/chrome.exe --remote-debugging-port=9222 - -# Terminal 2: Start gRPC agent server -python src/glazyr/nexus_agent.py - -# Terminal 3: Run automation scripts -python src/demo_saucedemo_login.py -``` - -### Python API - -```python -from nexus_scenarios import AgentClient, AgentAction -import action_pb2 - -client = AgentClient() -client.navigate("https://www.saucedemo.com") - -# Observe page state (semantic DOM snapshot) -state = client.observe() - -# Find elements by semantic role (not CSS selectors) -user_field = find(state, role="textbox", name="Username") -pass_field = find(state, role="textbox", name="Password") -login_btn = find(state, role="button", name="Login") - -# Type into fields by element ID -client.act(AgentAction(type=action_pb2.TypeAction( - element_id=user_field.id, text="standard_user" -))) -client.act(AgentAction(type=action_pb2.TypeAction( - element_id=pass_field.id, text="secret_sauce" -))) - -# Click by element ID (no coordinates needed) -client.act(AgentAction(click=action_pb2.ClickAction( - element_id=login_btn.id -))) -``` - -### Core Actions - -| Action | Method | Description | -|--------|--------|-------------| -| **observe()** | `client.observe()` | Full DOM + accessibility tree snapshot | -| **click(id)** | `AgentAction(click=ClickAction(element_id=id))` | Direct event dispatch by element ID | -| **type(id, text)** | `AgentAction(type=TypeAction(element_id=id, text=text))` | Input injection by element ID | -| **navigate(url)** | `client.navigate(url)` | Navigate to URL | - -### VLM CAPTCHA Solving (Experimental) - -```bash -# Requires Ollama with llama3.2-vision -python src/vlm_captcha_solve.py -``` - -The VLM solver captures viewport via shared memory, sends to Llama 3.2 Vision, and receives structured predictions (JSON tile indices with confidence scores). - -## Performance Benchmarks - -From the project's own benchmarks (10 runs per task, 120s timeout): - -| Task | Neural-Chromium | Playwright | Notes | -|------|----------------|------------|-------| -| **Interaction latency** | 1.32s | ~0.5s | NC trades speed for semantic robustness | -| **Auth + data extraction** | 2.3s (100%) | 1.1s (90%) | NC uses semantic selectors | -| **Dynamic SPA (TodoMVC)** | 9.4s (100%) | 3.2s (60%) | NC handles async DOM reliably | -| **Multi-step form** | 4.1s (100%) | 2.8s (95%) | NC uses native event dispatch | -| **CAPTCHA solving** | ~50s (experimental) | N/A (blocked) | VLM-based, contingent on model | - -**Key trade-off**: Neural-Chromium is slower in raw latency but claims higher reliability for dynamic SPAs and sites that break CSS selectors frequently. - -## Comparison with Existing Tools - -| Feature | Neural-Chromium | Playwright | agent-browser | Stagehand | -|---------|----------------|------------|---------------|-----------| -| **Interface** | Python + gRPC | JS/TS API | CLI (Rust) | JS/Python SDK | -| **Element targeting** | Semantic (role/name) | CSS/XPath | Refs from snapshot | Natural language | -| **Browser engine** | Custom Chromium fork | Bundled Chromium | Bundled Chromium | Bundled Chromium | -| **Stealth** | Native (no webdriver) | Detectable | Detectable | Detectable | -| **VLM vision** | Built-in (Ollama) | No | No | No | -| **CAPTCHA handling** | Experimental (VLM) | Blocked | Blocked | Blocked | -| **Iframe access** | Deep traversal | Context switching | Context switching | Context switching | -| **Platform** | Windows only | Cross-platform | Cross-platform | Cross-platform | -| **Maturity** | Experimental | Production | Production | Production | -| **Setup complexity** | Build Chromium (~4h) | `npm install` | `npm install` | `npm install` | - -## Roadmap - -### Phase 4: Production Hardening (Next) - -- Delta updates (only changed DOM nodes, target <500ms latency) -- Push-based events (replace polling with `wait_for_signal`) -- Shadow DOM piercing for modern SPAs -- Multi-tab support for parallel agent execution -- Linux/Mac builds - -### Phase 5: Advanced Vision - -- OCR integration for text extraction from images -- Visual grounding (click coordinates from natural language) -- Screen diffing for visual change detection - -### Phase 6: Ecosystem - -- Python SDK (`neural_chromium.Agent()`) -- Docker images for containerized runtime -- Kubernetes operator for cloud deployment - -## Repository Structure - -```text -neural-chromium/ -├── src/ -│ ├── glazyr/ -│ │ ├── nexus_agent.py # gRPC server + VisualCortex -│ │ ├── proto/ # Protocol Buffer definitions -│ │ └── neural_page_handler.* # Blink C++ integration -│ ├── nexus_scenarios.py # High-level agent client -│ ├── vlm_solver.py # Llama Vision integration -│ └── demo_*.py # Example flows -├── docs/ -│ └── NEURAL_CHROMIUM_ARCHITECTURE.md -├── deployment/ # Docker/deployment configs -├── tests/ # Test suite -└── Makefile # Build and benchmark commands -``` - -## Resources - -- **GitHub**: https://github.com/mcpmessenger/neural-chromium -- **Live Demo**: https://neuralchrom-dtcvjx99.manus.space -- **Demo Video**: https://youtube.com/shorts/8nOlID7izjQ -- **Twitter**: https://x.com/MCPMessenger -- **License**: BSD-3-Clause diff --git a/TODO.md b/TODO.md index 9b0fea02..0b9611f4 100644 --- a/TODO.md +++ b/TODO.md @@ -57,9 +57,9 @@ Tasks with no open blockers - ready to work on. Use `/ready` to refresh this lis - [ ] t106 Replace eval in system-cleanup.sh find command construction with safe args #security #shell ~1h (ai:45m test:15m) logged:2026-02-03 - [ ] t107 Avoid eval-based export in credential-helper.sh; use safe output/quoting #security #shell ~1h (ai:45m test:15m) logged:2026-02-03 - [ ] t108 Dashboard token storage hardening (avoid localStorage; add reset/clear flow) #security #dashboard #plan → [todo/PLANS.md#2026-02-03-dashboard-token-storage-hardening] ~3h (ai:1.5h test:1h read:30m) logged:2026-02-03 -- [ ] t109 Fix template deploy head usage error (invalid option -z) #setup #deploy #bugfix ~30m (ai:20m test:10m) logged:2026-02-03 -- [ ] t110 Resolve awk newline warnings during setup deploy (system-reminder) #setup #deploy #bugfix ~45m (ai:30m test:15m) logged:2026-02-03 -- [ ] t111 Resolve DSPy dependency conflict (gepa) in setup flow #python #dspy #deps ~45m (ai:30m test:15m) logged:2026-02-03 +- [ ] t121 Fix template deploy head usage error (invalid option -z) #setup #deploy #bugfix ~30m (ai:20m test:10m) logged:2026-02-03 +- [ ] t122 Resolve awk newline warnings during setup deploy (system-reminder) #setup #deploy #bugfix ~45m (ai:30m test:15m) logged:2026-02-03 +- [ ] t123 Resolve DSPy dependency conflict (gepa) in setup flow #python #dspy #deps ~45m (ai:30m test:15m) logged:2026-02-03 - [ ] t082 Fix version sync inconsistency (VERSION vs package.json/setup.sh/aidevops.sh) #bugfix ~15m (ai:10m test:5m) logged:2026-01-29 - Notes: Release commit bd0695c bumped VERSION to 2.92.1 but missed syncing package.json, setup.sh, aidevops.sh, sonar-project.properties, .claude-plugin/marketplace.json. Either fix manually or ensure version-manager.sh is used for all releases. - [ ] t068 Multi-Agent Orchestration & Token Efficiency #plan → [todo/PLANS.md#2026-01-23-multi-agent-orchestration--token-efficiency] ~5d (ai:3d test:1d read:1d) logged:2026-01-23 started:2026-01-23T00:00Z @@ -99,7 +99,8 @@ Tasks with no open blockers - ready to work on. Use `/ready` to refresh this lis - [ ] t020 Git Issues Bi-directional Sync (GitHub, GitLab, Gitea) #plan #git #sync ~1d (ai:4h test:4h read:2h) logged:2025-12-21 - [x] t021 Auto-mark tasks complete from commit messages in release #workflow #automation ~30m (ai:20m test:10m) logged:2025-12-22 completed:2026-01-25 - [ ] t023 Evaluate Shannon AI pentester for security testing integration #security #tools ~30m (ai:20m read:10m) logged:2025-01-03 -- [ ] t024 Evaluate Dexter autonomous financial research agent #research #finance #agents ~30m (ai:20m read:10m) logged:2025-01-03 +- [ ] t024 Evaluate Dexter autonomous financial research agent #research #finance #agents ~30m (ai:20m read:10m) logged:2025-01-03 ref:https://github.com/virattt/dexter + - Notes: Dexter (10.6k stars, actively maintained, last commit Feb 2026). TypeScript multi-step autonomous agent for deep financial research. Evaluate: 1) Integration as optional financial-research subagent, 2) Adoption of its tool-chaining patterns for other research domains (SEO, competitive analysis), 3) Potential as imported skill via `aidevops skill add virattt/dexter`. - [ ] t025 Create terminal optimization /command and @subagent using Claude #tools #terminal #productivity ~1h (ai:40m test:10m read:10m) logged:2025-01-03 ref:https://x.com/deedydas/status/2007342412335927400 - [ ] t026 Create subscription audit /command and @subagent for accounts agent #accounts #subscriptions #automation ~1.5h (ai:1h test:20m read:10m) logged:2025-01-03 ref:https://x.com/frankdegods/status/2007199488776253597 - [ ] t027 Add hyprwhspr speech-to-text support (Arch/Omarchy Linux only) #tools #accessibility #linux ~30m (ai:20m read:10m) logged:2025-01-03 @@ -107,9 +108,12 @@ Tasks with no open blockers - ready to work on. Use `/ready` to refresh this lis - Notes: Implemented secure OpenCode GitHub integration with max security approach. Created `.github/workflows/opencode-agent.yml` with trusted-user-only access, ai-approved label requirement, prompt injection detection, audit logging, and 15-min timeout. Documentation in `.agent/tools/git/opencode-github-security.md`. Helper script updated with `create-secure` and `create-labels` commands. - [x] t051 Loop System v2 - Fresh sessions per iteration #workflow #automation ~4h (ai:2h test:1h read:1h) logged:2025-01-11 started:2025-01-11T00:00Z completed:2025-01-11 ref:https://github.com/gmickel/gmickel-claude-marketplace/tree/main/plugins/flow-next - Notes: Implemented flow-next inspired architecture for ralph-loop and full-loop. Created loop-common.sh (~700 lines) with JSON state management, re-anchor prompt generator (reads TODO.md, git state, memories), receipt verification, memory integration hooks, and task blocking after N failures. PR #38 merged. Prevents context drift by spawning fresh AI sessions per iteration. -- [ ] t029 Review @penberg post for aidevops inclusion or similar approach #research #tools ~15m (ai:10m read:5m) logged:2025-01-03 ref:https://x.com/penberg/status/2007533204622770214 -- [ ] t030 Evaluate @irl_danB post for useful advantages #research #tools ~15m (ai:10m read:5m) logged:2025-01-03 ref:https://x.com/irl_danB/status/2007259356103094523 +- [ ] t029 Research Penberg's Weave project (deterministic execution for AI agents) #research #tools #agents ~30m (ai:20m read:10m) logged:2025-01-03 ref:https://x.com/penberg/status/2007533204622770214,https://github.com/penberg/weave + - Notes: Petri Penberg (@penberg) - systems/database researcher. His "Weave" project (13 stars, Jan 2026) provides deterministic execution for reproducible debugging of AI agents. Evaluate: 1) Session reproducibility benefits for aidevops, 2) Debugging improvements for agent workflows, 3) Integration with ralph-loop and session-manager. +- [ ] t030 Research irl_danB's progressive-memory and clawdbot projects #research #tools #memory ~30m (ai:20m read:10m) logged:2025-01-03 ref:https://x.com/irl_danB/status/2007259356103094523,https://github.com/irl-dan/progressive-memory + - Notes: irl_danB's projects (active Feb 2026): progressive-memory (filesystem-based progressive disclosure for agent memory), clawdbot (multi-platform AI assistant), agentsy-live. Evaluate: 1) Progressive-memory pattern vs aidevops's SQLite FTS5 memory, 2) Multi-platform deployment lessons from clawdbot, 3) Potential collaboration or feature adoption. - [ ] t031 Company orchestration agent/workflow inspired by @DanielleMorrill #plan #agents #business ~1h (ai:40m test:10m read:10m) logged:2025-01-03 ref:https://x.com/DanielleMorrill/status/2007508036584341899 + - Notes: Company-level agent orchestration - AI agents managing company functions (HR, Finance, Operations, Marketing) coordinating via runner-helper.sh. Extends t109 (Parallel Agents). Scope: 1) Document company-level agent patterns, 2) Create example runners for common company functions (hiring-coordinator, finance-reviewer, ops-monitor), 3) Integrate with coordinator-helper.sh for cross-function task dispatch. - [x] t032 Create performance skill/subagent/command inspired by @elithrar #tools #performance ~30m actual:25m (ai:20m test:5m read:5m) logged:2025-01-03 started:2026-01-25T15:00Z completed:2026-01-25 ref:https://x.com/elithrar/status/2007455910218871067 - Notes: Created tools/performance/performance.md subagent and /performance command. Uses Chrome DevTools MCP for Core Web Vitals (FCP, LCP, CLS, FID, TTFB), network dependency analysis, and accessibility auditing. Actionable output format with file:line references. PR #209 merged. - [ ] t033 Add X/Twitter fetching via fxtwitter API (x.sh script) #tools #browser ~20m (ai:15m test:5m) logged:2025-01-03 ref:https://gist.github.com/marckohlbrugge/93bcf631c3317e793f0295e6155e6e7f @@ -201,16 +205,21 @@ Tasks with no open blockers - ready to work on. Use `/ready` to refresh this lis - Notes: MCP server for iOS simulator interaction (1.5k stars, MIT). Featured in Anthropic's Claude Code Best Practices. Tools: tap, swipe, type, screenshot, record_video, describe-ui (accessibility), install_app, launch_app. Install: `npx -y ios-simulator-mcp`. Requires macOS, Xcode, Facebook IDB. Enables AI-assisted QA testing - verify UI elements, confirm text input, validate gestures. Complements XcodeBuildMCP (build) and Maestro (E2E flows). Add to tools/mobile/ or tools/testing/. - [ ] t098 Add Playwright device emulation subagent #tools #browser #testing #mobile ~30m (ai:20m test:5m read:5m) logged:2026-01-30 related:t096 ref:https://playwright.dev/docs/emulation - Notes: Document Playwright's device emulation capabilities for mobile/tablet testing. Features: device registry (iPhone, iPad, Pixel, Galaxy), viewport/screen size, userAgent, touch events, geolocation, locale/timezone, permissions, colorScheme, offline mode. Config via playwright.config.ts or per-test. Complements native mobile testing (Maestro, iOS Simulator MCP) for web-based mobile testing. Add to tools/browser/playwright-emulation.md or extend existing playwright.md. -- [ ] t099 Add Neural-Chromium for agent-native browser automation #tools #browser #ai #automation ~2h (ai:1.5h test:20m read:10m) logged:2026-01-30 started:2026-02-05T00:00Z ref:https://github.com/mcpmessenger/neural-chromium - - Notes: Neural-Chromium (BSD-3) - Chromium fork designed for AI agents. Features: 1.3s interaction latency (4.7x faster than Playwright), semantic DOM understanding via accessibility tree, VLM-powered vision (Llama 3.2 via Ollama), stealth capabilities (no navigator.webdriver), deep iframe access. Uses shared memory + gRPC for direct browser state access. Tools: click(element_id), type(element_id, text), observe() for DOM snapshots. Early stage but promising for agent automation. Evaluate for: CAPTCHA solving, dynamic SPA interaction, form filling. Add to tools/browser/ as experimental option. +- [-] t099 ~~Add Neural-Chromium for agent-native browser automation~~ #tools #browser #ai #automation ~2h logged:2026-01-30 declined:2026-02-05 + - Notes: DECLINED - Neural-Chromium has only 4 stars, no releases, and is Windows-only. Not viable for macOS/Linux users. Better alternatives: browser-use (77.8k stars, cross-platform), Stagehand (20.8k stars, already documented), Skyvern (20.3k stars). The existing neural-chromium.md subagent should be removed. Features: 1.3s interaction latency (4.7x faster than Playwright), semantic DOM understanding via accessibility tree, VLM-powered vision (Llama 3.2 via Ollama), stealth capabilities (no navigator.webdriver), deep iframe access. Uses shared memory + gRPC for direct browser state access. Tools: click(element_id), type(element_id, text), observe() for DOM snapshots. Early stage but promising for agent automation. Evaluate for: CAPTCHA solving, dynamic SPA interaction, form filling. Add to tools/browser/ as experimental option. +- [ ] t125 Add browser-use subagent for AI-native browser automation #tools #browser #ai #automation ~1.5h (ai:1h test:20m read:10m) logged:2026-02-05 ref:https://github.com/browser-use/browser-use + - Notes: browser-use (77.8k stars, MIT, Python) - the most popular AI browser automation framework. Connects LLMs to browsers for web agent tasks. Features: multi-tab management, vision + HTML extraction, custom actions, self-correcting, supports GPT-4o/Claude/Gemini/DeepSeek/Llama. Install: `pip install browser-use`. Cloud option via browser-use.com. Significantly more mature than Neural-Chromium (declined t099). Create subagent at tools/browser/browser-use.md. Evaluate: 1) Integration with existing Playwright stack, 2) Comparison with Stagehand (already documented), 3) Custom action patterns for aidevops workflows. +- [ ] t126 Add Skyvern subagent for computer vision browser automation #tools #browser #ai #automation ~1.5h (ai:1h test:20m read:10m) logged:2026-02-05 ref:https://github.com/Skyvern-AI/skyvern + - Notes: Skyvern (20.3k stars, AGPL-3.0, Python) - automates browser workflows using LLMs and computer vision. Unlike CSS/XPath selectors, uses visual understanding to interact with elements. Features: real-time visual parsing, complex workflow chaining, proxy support, CAPTCHA solving, 2FA/TOTP handling. Cloud and self-hosted options. Install: Docker or pip. Create subagent at tools/browser/skyvern.md. Evaluate: 1) Visual approach vs Stagehand's natural language, 2) Self-hosted deployment for privacy, 3) Workflow chaining for multi-step automations. - [ ] t100 Add AXe CLI for iOS simulator accessibility automation #tools #ios #testing #accessibility ~45m (ai:30m test:10m read:5m) logged:2026-01-30 related:t095,t097 ref:https://github.com/cameroncooke/AXe - Notes: AXe (1.1k stars, MIT) - CLI tool for iOS Simulator automation using Apple's Accessibility APIs and HID. By same author as XcodeBuildMCP. Features: tap (coordinates or accessibility ID/label), swipe, type, hardware buttons (home, lock, siri), gesture presets (scroll-up/down, edge swipes), screenshot, video recording/streaming, describe-ui (accessibility tree). Install: `brew install cameroncooke/axe/axe`. Single binary, no server required. Timing controls (pre/post delays). Complements XcodeBuildMCP for build+test workflow. Add to tools/mobile/ or tools/testing/. - [ ] t101 Create Mom Test UX/CRO agent framework #agents #ux #cro #conversion ~3h (ai:2h test:30m read:30m) logged:2026-01-30 - Notes: Apple-inspired "Mom Test" framework for UX evaluation and CRO. **6 UX Principles:** Clarity, Simplicity, Consistency, Feedback, Discoverability, Forgiveness. **Workflow:** 1) Screen-by-screen "Would this confuse my mom?" analysis. 2) Generate tables: Confusing Element | Mom's Reaction | Fix. 3) Rank biggest UX failures by severity. 4) Identify quick wins with effort/impact matrix. 5) Prioritize CRO recommendations based on proven UX patterns. **Output:** Actionable report with specific fixes, not vague suggestions. Integrate with browser automation (Playwright/Stagehand) for automated page analysis. Reference existing page-cro.md (t093) and accessibility testing. Add to seo/ or tools/ux/. -- [x] t103 Review Pi agent for aidevops inspiration #research #agents #architecture ~1h actual:45m (ai:45m read:15m) logged:2026-02-01 started:2026-02-05T20:00Z completed:2026-02-05 ref:https://lucumr.pocoo.org/2026/1/31/pi/,https://github.com/badlogic/pi-mono/,https://github.com/badlogic/pi-mono/tree/main/packages/coding-agent +- [x] t103 Review Pi agent for aidevops inspiration #research #agents #architecture ~1h actual:45m (ai:45m read:15m) logged:2026-02-01 started:2026-02-05T20:00Z completed:2026-02-05 ref:https://lucumr.pocoo.org/2026/1/31/pi/,https://github.com/badlogic/pi-mono/ + - Notes: Review complete. See todo/research/pi-agent-review.md. Key findings: Pi's minimal 4-tool core validates aidevops's on-demand MCP loading. Session trees (branching/rewinding) are interesting but require agent-level support. Recommended: document "remix" skill pattern, add desktop notification pattern for long tasks. - Notes: Review complete. See todo/research/pi-agent-review.md. Key findings: Pi's minimal 4-tool core validates aidevops's on-demand MCP loading. Session trees (branching/rewinding) are the most interesting feature aidevops lacks but can't implement without agent-level support. Extension hot-reload is powerful but different architecture from aidevops's markdown subagents. Recommended: document "remix" skill pattern in build-agent.md, add desktop notification pattern for long tasks. Skip: removing MCP, rewriting in TypeScript. -- [ ] t104 Add Tirith terminal security guard for homograph/injection attacks #security #tools #terminal ~2h (ai:1.5h test:20m read:10m) logged:2026-02-03 ref:https://github.com/sheeki03/tirith - - Notes: Tirith (740 stars, Rust, AGPL-3.0) - terminal security tool that catches attacks browsers block but terminals don't. **30 rules across 7 categories:** 1) Homograph attacks (Cyrillic/Greek lookalikes, punycode, mixed-script). 2) Terminal injection (ANSI escapes, bidi overrides, zero-width chars). 3) Pipe-to-shell (`curl|bash`, `wget|sh`, `eval $(wget ...)`). 4) Dotfile attacks (downloads targeting ~/.bashrc, ~/.ssh/authorized_keys). 5) Insecure transport (HTTP piped to shell, `curl -k`). 6) Ecosystem threats (git clone typosquats, untrusted Docker registries, pip/npm URL installs). 7) Credential exposure (userinfo tricks, shortened URLs). **Integration options:** 1) Add to aidevops setup/onboarding as recommended install. 2) Create tirith.md subagent at tools/security/. 3) Document shell hook setup (`eval "$(tirith init)"`). 4) Consider MCP wrapper for `tirith check` command validation. **Key features:** Sub-millisecond overhead, local-only (no network calls), YAML policy config, bypass with `TIRITH=0` prefix. Install: `brew install sheeki03/tap/tirith` or `npm install -g tirith` or `cargo install tirith`. +- [ ] t124 Add Tirith terminal security guard for homograph/injection attacks #security #tools #terminal ~2h (ai:1.5h test:20m read:10m) logged:2026-02-03 ref:https://github.com/sheeki03/tirith + - Notes: Tirith (1,300 stars, Rust, AGPL-3.0) - also see t104 plan: Tirith's `tirith run` command provides verified download-then-execute as the implementation vehicle for curl|sh hardening. - terminal security tool that catches attacks browsers block but terminals don't. **30 rules across 7 categories:** 1) Homograph attacks (Cyrillic/Greek lookalikes, punycode, mixed-script). 2) Terminal injection (ANSI escapes, bidi overrides, zero-width chars). 3) Pipe-to-shell (`curl|bash`, `wget|sh`, `eval $(wget ...)`). 4) Dotfile attacks (downloads targeting ~/.bashrc, ~/.ssh/authorized_keys). 5) Insecure transport (HTTP piped to shell, `curl -k`). 6) Ecosystem threats (git clone typosquats, untrusted Docker registries, pip/npm URL installs). 7) Credential exposure (userinfo tricks, shortened URLs). **Integration options:** 1) Add to aidevops setup/onboarding as recommended install. 2) Create tirith.md subagent at tools/security/. 3) Document shell hook setup (`eval "$(tirith init)"`). 4) Consider MCP wrapper for `tirith check` command validation. **Key features:** Sub-millisecond overhead, local-only (no network calls), YAML policy config, bypass with `TIRITH=0` prefix. Install: `brew install sheeki03/tap/tirith` or `npm install -g tirith` or `cargo install tirith`. - [ ] t109 Parallel Agents & Headless Dispatch #plan → [todo/PLANS.md#2026-02-03-parallel-agents--headless-dispatch] ~3d (ai:1.5d test:1d read:0.5d) logged:2026-02-03 started:2026-02-05T00:00Z - [x] t109.1 Document headless dispatch patterns ~4h blocked-by:none completed:2026-02-05 - Notes: Created tools/ai-assistants/headless-dispatch.md. Documents `opencode run` headless flags, `opencode serve` server mode, `--attach` warm server pattern, SDK parallel dispatch, runner lifecycle, custom agents, CI/CD integration. @@ -263,15 +272,15 @@ Tasks with no open blockers - ready to work on. Use `/ready` to refresh this lis - [ ] t078 Add Lumen subagent for AI-powered git diffs and commit generation #tools #git #code-review ~20m (ai:15m read:5m) logged:2026-01-23 ref:https://github.com/jnsahaj/lumen - Notes: Lumen (1.8k stars, Rust, MIT) - Beautiful git diff viewer + AI commit messages + change explanations + git command generation from CLI. Install: `brew install jnsahaj/lumen/lumen` or `cargo install lumen`. Supports OpenAI, Claude, Gemini, Groq, DeepSeek, xAI, Ollama, OpenRouter, Vercel AI Gateway. Config: `~/.config/lumen/lumen.config.json`. Key commands: `lumen diff` (visual diff), `lumen draft` (commit msg), `lumen explain` (change summary), `lumen operate` (natural language git commands). Create subagent at tools/git/lumen.md covering: API key setup (reuse existing keys from mcp-env.sh or per-provider env vars), when to use (pre-commit review, PR diffs, understanding AI-generated changes), integration with aidevops git workflow. - [ ] t074 Review DocStrange for document structured data extraction #research #tools #document-extraction ~30m (ai:20m read:10m) logged:2026-01-25 ref:https://github.com/NanoNets/docstrange - - Notes: NanoNets DocStrange - document structured data extraction tool. Evaluate for: 1) Integration with existing document-extraction workflow (t073) 2) Comparison with Docling/ExtractThinker/Unstract 3) Potential as alternative or complement to current tools 4) Local vs cloud processing options 5) Output format compatibility with aidevops pipelines. + - Notes: NanoNets DocStrange (1.3k stars, actively maintained). Now has built-in MCP server for Claude Desktop, supports local GPU processing, handles PDF/DOCX/PPTX/XLSX/images/URLs with structured JSON extraction via schema. Single pip install. Significantly simpler than the Docling+ExtractThinker+Presidio stack in t073. Evaluate: 1) Could replace or complement t073's stack, 2) MCP server integration with aidevops, 3) Local vs cloud processing modes, 4) Schema-based extraction quality vs ExtractThinker contracts. - [ ] t075 Content Calendar Workflow subagent #content #seo #planning ~2h (ai:1.5h test:30m) logged:2026-01-25 ref:t037 - Notes: Inspired by ALwrity (see todo/research/alwrity-review.md). Create tools/content/content-calendar.md subagent. Features: AI-powered content gap analysis, topic suggestions based on keyword research, scheduling across platforms, content lifecycle tracking. Integrate with keyword-research.md and google-search-console.md. - [ ] t076 Platform Persona Adaptations for content guidelines #content #marketing ~1h (ai:45m test:15m) logged:2026-01-25 ref:t037 - Notes: Inspired by ALwrity persona system. Extend content/guidelines.md with platform-specific sections for LinkedIn, Instagram, YouTube. Define voice, tone, structure, and best practices per platform. - [ ] t077 LinkedIn Content Subagent #tools #social-media ~1h (ai:45m test:15m) logged:2026-01-25 ref:t037 - Notes: Create tools/social-media/linkedin.md. Support post types: text posts, articles, carousels, documents. Follow bird.md pattern. Include LinkedIn-specific best practices (hashtags, timing, engagement). -- [ ] t080 Set up Pipecat + NVIDIA Nemotron voice agents for OpenClaw realtime calls #tools #voice #ai #agents ~6h (ai:3h test:2h read:1h) logged:2026-01-26 related:t071,t046 ref:https://www.daily.co/blog/building-voice-agents-with-nvidia-open-models/,https://github.com/pipecat-ai/nemotron-january-2026/ - - Notes: Set up and test Pipecat AI framework with NVIDIA Nemotron open models for building realtime voice agents. Two use cases: 1) Integrate with OpenClaw (t046) for realtime voice call assistance via WhatsApp/Telegram/phone - enabling hands-free AI help during development, debugging, and DevOps tasks. 2) Customer service voice agents for websites and software apps - automated phone/voice support using Pipecat pipelines with Daily.co WebRTC transport. Stack: Pipecat (Python framework for voice/multimodal agents), NVIDIA Nemotron models (open-weight LLMs optimized for agentic tasks), Daily.co (WebRTC transport layer). Steps: clone nemotron-january-2026 repo, configure NVIDIA API keys, test basic voice pipeline, integrate with OpenClaw gateway for messaging platform voice calls, build customer service agent template with configurable personas and knowledge bases. +- [ ] t080 Set up cloud voice agents and S2S models (GPT-4o-Realtime, MiniCPM-o, Nemotron) #tools #voice #ai #agents ~6h (ai:3h test:2h read:1h) logged:2026-01-26 related:t071,t046 ref:https://www.daily.co/blog/building-voice-agents-with-nvidia-open-models/,https://github.com/pipecat-ai/nemotron-january-2026/,https://github.com/OpenBMB/MiniCPM-o + - Notes: Set up and test Pipecat AI framework with multiple S2S providers for building realtime voice agents. **Cloud S2S providers:** GPT-4o-Realtime (OpenAI, most mature Pipecat S2S), AWS Nova Sonic, Gemini Multimodal Live, Ultravox. **NVIDIA Nemotron:** Cloud-only via NVIDIA API (requires NVIDIA GPU for local; use free cloud credits for low usage). **Local S2S:** MiniCPM-o 4.5 (23k stars, Apache-2.0, 9B params) - runs on Mac via llama.cpp-omni, full-duplex voice+vision+text, WebRTC demo available. Also MiniCPM-o 2.6 for lighter use. Two use cases: 1) Integrate with OpenClaw (t046) for realtime voice calls via WhatsApp/Telegram/phone. 2) Customer service voice agents with Daily.co WebRTC transport. Steps: configure GPT-4o-Realtime S2S first, test MiniCPM-o local pipeline on Mac, test Nemotron via NVIDIA cloud API, integrate with OpenClaw gateway, build customer service agent template with configurable personas. - [ ] t081 Set up Pipecat local voice agent with Soniox STT + Cartesia TTS + OpenAI/Anthropic LLM #tools #voice #ai #agents ~4h (ai:2h test:1.5h read:30m) logged:2026-01-26 related:t080,t071,t072 ref:https://github.com/kwindla/macos-local-voice-agents,https://github.com/pipecat-ai/pipecat,https://www.pipecat.ai/,https://soniox.com/,https://cartesia.ai/sonic,https://github.com/daily-co/nimble-pipecat,https://github.com/pipecat-ai/voice-ui-kit - Notes: Set up and test Pipecat voice agent pipeline locally on macOS using kwindla/macos-local-voice-agents as reference. **Services stack:** STT: Soniox (https://docs.pipecat.ai/server/services/stt/soniox), TTS: Cartesia Sonic (https://docs.pipecat.ai/server/services/tts/cartesia), LLM: OpenAI (https://docs.pipecat.ai/server/services/llm/openai) and Anthropic (https://docs.pipecat.ai/server/services/llm/anthropic), S2S: OpenAI speech-to-speech (https://docs.pipecat.ai/server/services/s2s/openai). **Local vs Cloud:** Configure to support both local models (whisper.cpp, llama.cpp, local TTS) and cloud APIs (Soniox, Cartesia, OpenAI, Anthropic) with easy switching. **Steps:** 1) Clone pipecat-ai/pipecat and kwindla/macos-local-voice-agents repos. 2) Set up Python env with pipecat dependencies. 3) Configure API keys for Soniox, Cartesia, OpenAI, Anthropic. 4) Test basic voice pipeline: mic input -> Soniox STT -> OpenAI/Anthropic LLM -> Cartesia TTS -> speaker output. 5) Test OpenAI S2S mode. 6) Configure local fallback models for offline use. 7) Document setup in tools/voice/ subagent. **Demo quality:** Soniox + OpenAI + Cartesia Sonic sounded excellent in pipecat.ai demo. - [x] t079 Consolidate Plan+ and AI-DevOps into Build+ #refactor #agents #architecture ~4h actual:2h (ai:3h test:1h) logged:2026-01-25 started:2026-01-25T19:42Z completed:2026-01-25 @@ -291,41 +300,35 @@ Tasks with no open blockers - ready to work on. Use `/ready` to refresh this lis - [x] t067 Optimise OpenCode MCP loading with on-demand activation #opencode #performance #mcp ~4h (ai:2h test:1h read:1h) logged:2026-01-21 blocked-by:t056 started:2026-01-21T06:15Z completed:2026-01-21 actual:30m - Notes: Implemented on-demand MCP loading pattern. Updated generate-opencode-agents.sh to sync MCP index on agent generation. Added MCP On-Demand Loading section to AGENTS.md. Pattern: MCPs disabled globally, enabled per-subagent via frontmatter, discoverable via mcp-index-helper.sh search. -<<<<<<< HEAD - diff --git a/todo/PLANS.md b/todo/PLANS.md index 4232f214..ab1fff92 100644 --- a/todo/PLANS.md +++ b/todo/PLANS.md @@ -892,7 +892,7 @@ disc001,p009,Implementation faster than estimated,All core functionality already p009,beads-sync-helper.sh; todo-ready.sh; beads.md subagent; blocked-by/blocks syntax; hierarchical IDs; TOON schema; setup.sh integration; AGENTS.md docs,Robust sync script; comprehensive docs; seamless integration,Add optional UI installation to setup.sh,2d,1.5d,-25,1 --> - ### [2026-02-05] MCP Auto-Installation in setup.sh @@ -2334,6 +2336,294 @@ d046,p017,Worktree isolation for all changes,Easy rollback and doesn't affect ma --- +### [2026-02-05] SEO Tool Subagents Sprint + +**Status:** Planning +**Estimate:** ~1.5d (ai:1d test:4h read:2h) + + + +#### Purpose + +Batch-create 12 SEO tool subagents (t083-t094) in a single sprint. All follow an identical pattern: create a markdown subagent with API docs, install commands, usage examples, and integration notes. The existing 16 SEO subagents in `seo/` provide perfect templates. + +**Estimated total:** ~11.5h across 12 tasks, but parallelizable to ~4-5h actual since they follow the same pattern and an AI agent can generate multiple in a single session. + +#### Context from Discussion + +**Corrections identified during audit (2026-02-05):** + +| Task | Issue | Fix | +|------|-------|-----| +| t084 Rich Results Test | Google deprecated the standalone API | Use URL-based testing only; document browser automation approach | +| t086 Screaming Frog | CLI requires paid license ($259/yr) | Document free tier limits (500 URLs); note license requirement | +| t088 Sitebulb | No public API or CLI exists | Change scope to "document manual workflow" or decline | +| t089 ContentKing | Acquired by Conductor in 2022 | Verify post-acquisition API status; may need different endpoint | +| t087 Semrush | API has pricing tiers | Document free tier (10 requests/day) and paid tiers | + +#### Progress + +- [ ] (2026-02-05) Phase 1: API-based subagents (7 tasks, ~6h) ~6h + - t083 Bing Webmaster Tools - API key from Bing portal, URL submission, indexation, analytics + - t084 Rich Results Test - URL-based testing (API deprecated), browser automation for validation + - t085 Schema Validator - schema.org validator + Google structured data testing tool + - t087 Semrush - API integration, note pricing tiers (free: 10 req/day) + - t090 WebPageTest - API integration, differentiate from existing pagespeed.md + - t092 Schema Markup - JSON-LD templates for Article, Product, FAQ, HowTo, Organization, LocalBusiness + - t094 Analytics Tracking - GA4 setup, event tracking, UTM parameters, attribution + +- [ ] (2026-02-05) Phase 2: Workflow-based subagents (3 tasks, ~4h) ~4h + - t091 Programmatic SEO - Template engine decision, keyword clustering, internal linking automation + - t093 Page CRO - A/B testing setup, CTA optimization, landing page best practices + - t089 ContentKing/Conductor - Verify API status post-acquisition, real-time SEO monitoring + +- [ ] (2026-02-05) Phase 3: Special cases + integration (2 tasks, ~2h) ~2h + - t086 Screaming Frog - Document CLI with license requirement, free tier limits (500 URLs) + - t088 Sitebulb - Document manual workflow only (no API/CLI exists), or decline + - Update subagent-index.toon with all new subagents + - Update seo.md main agent with references to new subagents + + + +#### Decision Log + +- **Decision:** Batch all 12 SEO tasks into a single sprint + **Rationale:** All follow identical subagent creation pattern; existing 16 SEO subagents provide templates. Parallelizable to ~4-5h actual. + **Date:** 2026-02-05 + +- **Decision:** t088 (Sitebulb) scope changed to manual workflow documentation + **Rationale:** Sitebulb has no public API or CLI. Desktop-only application. + **Date:** 2026-02-05 + + + +#### Surprises & Discoveries + +(To be populated during implementation) + + + +#### Related Tasks + +| Task | Description | Phase | +|------|-------------|-------| +| t083 | Bing Webmaster Tools | 1 | +| t084 | Rich Results Test | 1 | +| t085 | Schema Validator | 1 | +| t086 | Screaming Frog | 3 | +| t087 | Semrush | 1 | +| t088 | Sitebulb | 3 | +| t089 | ContentKing/Conductor | 2 | +| t090 | WebPageTest | 1 | +| t091 | Programmatic SEO | 2 | +| t092 | Schema Markup | 1 | +| t093 | Page CRO | 2 | +| t094 | Analytics Tracking | 1 | + +--- + +### [2026-02-05] Voice Integration Pipeline + +**Status:** Planning +**Estimate:** ~3d (ai:1.5d test:1d read:0.5d) + + + +#### Purpose + +Create a comprehensive voice integration for aidevops supporting both local and cloud-based speech capabilities. This enables hands-free AI interaction via voice-to-text, text-to-speech, and full speech-to-speech conversation loops with OpenCode. + +**Dual-track philosophy:** Every voice capability should have both a local option (privacy, offline, no cost) and an API option (higher quality, lower latency, easier setup). Users choose based on their needs. + +**Key capabilities:** +1. **Transcription** (audio/video → text) - Local: Whisper/faster-whisper. API: Groq, ElevenLabs Scribe, Deepgram, Soniox +2. **TTS** (text → speech) - Local: Qwen3-TTS, Piper. API: Cartesia Sonic, ElevenLabs, OpenAI TTS +3. **STT** (realtime speech → text) - Local: Whisper.cpp. API: Soniox, Deepgram, Google +4. **S2S** (speech → speech, no intermediate text) - API: OpenAI Realtime, AWS Nova Sonic, Gemini Multimodal Live, Ultravox +5. **Voice agent pipeline** - Pipecat framework orchestrating STT+LLM+TTS or S2S +6. **Dispatch shortcuts** - macOS/iOS shortcuts for voice-triggered OpenCode commands + +#### Context from Discussion + +**Pipecat ecosystem (v0.0.101, 10.2k stars, Feb 2026):** +- Python framework for voice/multimodal AI agents +- 50+ service integrations (STT, TTS, LLM, S2S, transport) +- Daily.co WebRTC transport for real-time audio +- S2S support: OpenAI Realtime, AWS Nova Sonic, Gemini Multimodal Live, Grok Voice Agent, Ultravox +- Voice UI Kit for web-based voice interfaces + +**Local model options:** +- **Qwen3-TTS** (0.6B/1.7B, Apache-2.0): 10 languages, voice clone/design, streaming, vLLM support +- **Piper** (MIT): Fast local TTS, many voices, low resource usage +- **Whisper Large v3 Turbo** (1.5GB): Best accuracy/speed tradeoff for local transcription +- **faster-whisper**: CTranslate2-optimized Whisper, 4x faster than original + +**Task sequencing:** + +| Phase | Tasks | Dependency | Rationale | +|-------|-------|------------|-----------| +| 1 | t072 Transcription | None | Foundation - most broadly useful | +| 2 | t071 TTS/STT Models | None (parallel with Phase 1) | Model catalog for other phases | +| 3 | t081 Local Pipecat | t071, t072 | Local voice agent pipeline | +| 4 | t080 NVIDIA Nemotron | t081 | Cloud voice agent with open models | +| 5 | t114 OpenCode bridge | t081 | Connect voice pipeline to AI | +| 6 | t112, t113 Shortcuts | t114 | Quick dispatch from desktop/mobile | + +#### Progress + +- [ ] (2026-02-05) Phase 1: Transcription subagent (t072) ~6h + - Create `tools/voice/transcription.md` subagent + - Create `scripts/transcription-helper.sh` (transcribe, models, configure) + - Document local models: Whisper Large v3 Turbo (recommended), faster-whisper, NVIDIA Parakeet + - Document cloud APIs: Groq Whisper, ElevenLabs Scribe v2, Deepgram Nova, Soniox + - Support inputs: YouTube (yt-dlp), URLs, local audio/video files + - Output formats: plain text, SRT, VTT + +- [ ] (2026-02-05) Phase 2: Voice AI models catalog (t071) ~4h + - Create `tools/voice/voice-models.md` subagent + - Document TTS options: local (Qwen3-TTS, Piper) vs API (Cartesia Sonic, ElevenLabs, OpenAI) + - Document STT options: local (Whisper.cpp, faster-whisper) vs API (Soniox, Deepgram) + - Document S2S options: OpenAI Realtime, AWS Nova Sonic, Gemini Multimodal Live, Ultravox + - Include model selection guide (quality vs speed vs cost vs privacy) + - GPU requirements and benchmarks for local models + +- [ ] (2026-02-05) Phase 3: Local Pipecat voice agent (t081) ~4h + - Create `tools/voice/pipecat.md` subagent + - Create `scripts/pipecat-helper.sh` (setup, start, stop, configure) + - Document pipeline: Mic → STT → LLM → TTS → Speaker + - Support both STT+LLM+TTS pipeline and S2S mode (OpenAI Realtime) + - Configure local fallback: Whisper.cpp + llama.cpp + Piper for offline use + - Configure cloud default: Soniox + OpenAI/Anthropic + Cartesia Sonic + - Test on macOS using kwindla/macos-local-voice-agents as reference + +- [ ] (2026-02-05) Phase 4: Cloud voice agents and S2S models (t080) ~6h + - Extend pipecat.md with cloud S2S provider configurations + - **S2S providers (no separate STT/TTS needed):** GPT-4o-Realtime (OpenAI), AWS Nova Sonic, Gemini Multimodal Live, Ultravox + - **NVIDIA Nemotron:** Cloud-only via NVIDIA API (requires NVIDIA GPU for local; use cloud credits for low usage). Clone pipecat-ai/nemotron-january-2026 repo + - **Local S2S alternative:** MiniCPM-o 4.5 (23k stars, Apache-2.0, 9B params) - runs on Mac via llama.cpp-omni, supports full-duplex voice+vision+text, WebRTC demo available. Also MiniCPM-o 2.6 for lighter-weight local use + - Test voice pipeline with Daily.co WebRTC transport + - Build customer service agent template with configurable personas + - Document integration with OpenClaw for messaging platform voice calls + +- [ ] (2026-02-05) Phase 5: OpenCode voice bridge (t114) ~4h + - Create `tools/voice/pipecat-opencode.md` subagent + - Pipeline: Mic → Soniox STT → OpenCode API → Cartesia TTS → Speaker + - Use OpenCode server API for prompt submission and response streaming + - Support session continuity (resume voice conversation) + - Handle long responses (streaming TTS as text arrives) + +- [ ] (2026-02-05) Phase 6: Voice dispatch shortcuts (t112, t113) ~2h + - Create `tools/voice/voiceink-shortcut.md` (macOS) + - Create `tools/voice/ios-shortcut.md` (iPhone) + - macOS: VoiceInk transcription → Shortcut → HTTP POST to OpenCode → response + - iOS: Dictate → HTTP POST to OpenCode (via Tailscale) → Speak response + - Include AppleScript/Shortcuts app instructions + + + +#### Decision Log + +- **Decision:** Dual-track local + API for every capability + **Rationale:** Privacy-sensitive users need local options; quality-focused users need cloud APIs. Both must be first-class. + **Date:** 2026-02-05 + +- **Decision:** Pipecat as the orchestration framework + **Rationale:** 10.2k stars, 50+ service integrations, Python, actively maintained, S2S support. No viable alternative at this scale. + **Date:** 2026-02-05 + +- **Decision:** Whisper Large v3 Turbo as default local transcription model + **Rationale:** Best accuracy/speed tradeoff (9.7 accuracy, 7.5 speed). Half the size of Large v3 (1.5GB vs 2.9GB) with near-identical accuracy. + **Date:** 2026-02-05 + +- **Decision:** S2S as preferred mode when available + **Rationale:** OpenAI Realtime, AWS Nova Sonic, and Gemini Multimodal Live provide lower latency and more natural conversation than STT+LLM+TTS pipeline. Fall back to pipeline when S2S unavailable. + **Date:** 2026-02-05 + + + +#### Surprises & Discoveries + +- **Observation:** Pipecat v0.0.101 now supports 5 S2S providers natively + **Evidence:** OpenAI Realtime, AWS Nova Sonic, Gemini Multimodal Live, Grok Voice Agent, Ultravox all documented in pipecat.ai/docs + **Impact:** Simplifies t081 significantly - S2S may replace STT+LLM+TTS for cloud use + **Date:** 2026-02-05 + +- **Observation:** MiniCPM-o 4.5 (23k stars, Apache-2.0) provides local full-duplex S2S on Mac + **Evidence:** 9B param model runs via llama.cpp-omni with WebRTC demo. Supports simultaneous vision+audio+text. Approaches Gemini 2.5 Flash quality. + **Impact:** Provides a strong local S2S alternative to cloud-only options. NVIDIA Nemotron requires NVIDIA GPU locally but MiniCPM-o runs on Mac/CPU. + **Date:** 2026-02-05 + +- **Observation:** GPT-4o-Realtime is the most mature S2S option via Pipecat + **Evidence:** First S2S provider supported by Pipecat, well-documented, lowest latency + **Impact:** Recommended as default cloud S2S provider for Phase 4 + **Date:** 2026-02-05 + + + +#### Files to Create + +| File | Purpose | Phase | +|------|---------|-------| +| `tools/voice/transcription.md` | Transcription subagent | 1 | +| `scripts/transcription-helper.sh` | Transcription CLI | 1 | +| `tools/voice/voice-models.md` | Voice AI model catalog | 2 | +| `tools/voice/pipecat.md` | Pipecat voice agent subagent | 3 | +| `scripts/pipecat-helper.sh` | Pipecat CLI | 3 | +| `tools/voice/pipecat-opencode.md` | OpenCode voice bridge | 5 | +| `tools/voice/voiceink-shortcut.md` | macOS voice shortcut | 6 | +| `tools/voice/ios-shortcut.md` | iPhone voice shortcut | 6 | + +#### Files to Modify + +| File | Changes | Phase | +|------|---------|-------| +| `subagent-index.toon` | Add voice subagents | 1-6 | +| `AGENTS.md` | Add voice integration to progressive disclosure table | 6 | +| `README.md` | Update Voice Integration section | 6 | + +#### Related Tasks + +| Task | Description | Phase | +|------|-------------|-------| +| t072 | Audio/Video Transcription subagent | 1 | +| t071 | Voice AI models catalog | 2 | +| t081 | Local Pipecat voice agent | 3 | +| t080 | NVIDIA Nemotron voice agents | 4 | +| t114 | Pipecat-OpenCode bridge | 5 | +| t112 | VoiceInk macOS shortcut | 6 | +| t113 | iPhone voice shortcut | 6 | +| t027 | hyprwhspr Linux STT (related) | - | + +--- + ## Completed Plans ### [2025-12-21] Beads Integration for aidevops Tasks & Plans ✓