diff --git a/.agents/subagent-index.toon b/.agents/subagent-index.toon
index 4459874948..a918c53590 100644
--- a/.agents/subagent-index.toon
+++ b/.agents/subagent-index.toon
@@ -45,7 +45,7 @@ tools/document/,Document extraction - conversion structured data extraction and
 tools/ocr/,OCR and text extraction - local document processing via Ollama,glm-ocr
 tools/video/,Video creation and downloading - programmatic generation and YouTube downloads,remotion|higgsfield|yt-dlp|video-prompt-design
 tools/vision/,Vision AI - image generation understanding and editing with local and cloud models,overview|image-generation|image-understanding|image-editing
-tools/voice/,Voice AI - TTS/STT model catalog voice bridge and Pipecat pipeline for talking to AI agents,voice-models|speech-to-speech|voice-bridge|hyprwhspr|pipecat-opencode|transcription|buzz
+tools/voice/,Voice AI - TTS/STT/S2S model catalog voice bridge Pipecat pipeline and cloud voice agents,voice-models|voice-ai-models|speech-to-speech|cloud-voice-agents|voice-bridge|hyprwhspr|pipecat-opencode|transcription|buzz
 tools/data-extraction/,Data extraction - scraping business data,outscraper
 tools/deployment/,Deployment automation - self-hosted PaaS and orchestration,coolify|coolify-cli|vercel|cloudron-app-packaging|uncloud
 tools/git/,Git operations - GitHub/GitLab/Gitea CLIs and diff tools,github-cli|gitlab-cli|gitea-cli|github-actions|worktrunk|lumen|jujutsu|conflict-resolution
diff --git a/.agents/tools/voice/cloud-voice-agents.md b/.agents/tools/voice/cloud-voice-agents.md
new file mode 100644
index 0000000000..e4e14a3fac
--- /dev/null
+++ b/.agents/tools/voice/cloud-voice-agents.md
@@ -0,0 +1,438 @@
+---
+description: "Cloud voice agents - deploy S2S voice agents using GPT-4o Realtime, MiniCPM-o, and NVIDIA Nemotron Speech"
+mode: subagent
+tools:
+  read: true
+  write: false
+  edit: false
+  bash: true
+  glob: true
+  grep: true
+  webfetch: true
+  task: true
+---
+
+# Cloud Voice Agents
+
+<!-- AI-CONTEXT-START -->
+
+## Quick Reference
+
+- **Purpose**: Deploy speech-to-speech voice agents in the cloud using leading S2S models
+- **Models**: GPT-4o Realtime (OpenAI), MiniCPM-o 2.6 (open weights), NVIDIA Nemotron Speech (Riva NIM)
+- **Frameworks**: Pipecat (recommended), OpenAI Agents SDK, custom WebSocket/WebRTC
+- **Local S2S pipeline**: `tools/voice/speech-to-speech.md` (cascaded VAD+STT+LLM+TTS)
+- **Pipecat integration**: `tools/voice/pipecat-opencode.md` (real-time voice bridge)
+- **Model catalog**: `tools/voice/voice-ai-models.md` (full model comparison)
+
+**When to use**: Building production voice agents, phone bots, customer service agents, or real-time conversational AI that runs in the cloud. For local/development voice interaction, use `voice-helper.sh talk` instead.
+
+<!-- AI-CONTEXT-END -->
+
+## Architecture Overview
+
+Cloud voice agents use one of two approaches:
+
+```text
+Approach 1: Native S2S (single model)
+  Audio In -> [S2S Model] -> Audio Out
+  Examples: GPT-4o Realtime, MiniCPM-o omni mode
+
+Approach 2: Cascaded Pipeline (composable)
+  Audio In -> [STT] -> [LLM] -> [TTS] -> Audio Out
+  Examples: Parakeet STT + Claude + Magpie TTS (via NVIDIA Riva)
+```
+
+Native S2S is lower latency but less controllable. Cascaded pipelines let you swap components independently and are easier to debug.
+
+## Model Comparison
+
+| Model | Type | Latency | VRAM | License | Languages | Best For |
+|-------|------|---------|------|---------|-----------|----------|
+| GPT-4o Realtime | Cloud API | ~300ms | N/A | Proprietary | 50+ | Production cloud, lowest latency |
+| MiniCPM-o 2.6 | Open weights | ~500ms | 8-16GB | Apache-2.0 | EN, ZH (bilingual) | Self-hosted, privacy, multimodal |
+| NVIDIA Nemotron Speech | NIM API/Self-host | ~200-400ms | Varies | Mixed | 25+ (ASR), 17+ (TTS) | Enterprise, on-prem, NVIDIA GPUs |
+| Gemini 2.0 Live | Cloud API | ~350ms | N/A | Proprietary | 40+ | Google ecosystem, multimodal |
+| AWS Nova Sonic | Cloud API | ~600ms | N/A | Proprietary | 7 | AWS ecosystem |
+
+## GPT-4o Realtime
+
+OpenAI's native speech-to-speech model. GA (general availability) as of 2025. Supports WebRTC (browser), WebSocket (server), and SIP (telephony) connections.
+
+### Key Features
+
+- Native audio understanding and generation (no STT/TTS intermediary)
+- Emotion-aware voice output with 9+ voice options
+- Function calling during voice conversations
+- Input transcription for logging/compliance
+- WebRTC for browser, WebSocket for server, SIP for VoIP telephony
+- Model name: `gpt-realtime` (GA) or `gpt-4o-realtime-preview` (legacy)
+
+### Setup
+
+```bash
+# Store API key
+aidevops secret set OPENAI_API_KEY
+```
+
+### Via OpenAI Agents SDK (Recommended for Browser)
+
+```javascript
+import { RealtimeAgent, RealtimeSession } from "@openai/agents/realtime";
+
+const agent = new RealtimeAgent({
+    name: "DevOps Assistant",
+    instructions: "You are an AI DevOps assistant. Keep responses brief and spoken.",
+});
+
+const session = new RealtimeSession(agent);
+await session.connect({ apiKey: "<client-api-key>" });
+```
+
+### Via Pipecat (Recommended for Server)
+
+```python
+from pipecat.services.openai_realtime.llm import OpenAIRealtimeLLMService
+
+s2s = OpenAIRealtimeLLMService(
+    api_key=os.getenv("OPENAI_API_KEY"),
+    model="gpt-4o-realtime-preview",
+    voice="alloy",
+)
+
+# S2S replaces STT + LLM + TTS in the pipeline
+pipeline = Pipeline([
+    transport.input(),
+    s2s,
+    transport.output(),
+])
+```
+
+### Via WebSocket (Direct)
+
+```python
+import websockets, json, os
+
+url = "wss://api.openai.com/v1/realtime?model=gpt-realtime"
+headers = {"Authorization": f"Bearer {os.getenv('OPENAI_API_KEY')}"}
+
+async with websockets.connect(url, extra_headers=headers) as ws:
+    await ws.send(json.dumps({
+        "type": "session.update",
+        "session": {
+            "type": "realtime",
+            "instructions": "You are a helpful voice assistant.",
+            "audio": {"output": {"voice": "marin"}},
+        },
+    }))
+```
+
+### Voices
+
+alloy, ash, ballad, coral, echo, fable, marin, sage, shimmer, verse.
+
+### Pricing
+
+Audio input: ~$40/1M tokens. Audio output: ~$80/1M tokens. Cached audio input: ~$2.50/1M tokens. Roughly $0.06/min for typical conversation.
+
+### Docs
+
+- API reference: https://platform.openai.com/docs/guides/realtime
+- Voice agents quickstart: https://openai.github.io/openai-agents-js/guides/voice-agents/quickstart/
+
+## MiniCPM-o 2.6
+
+Open-weight omni-modal model (8B params) by OpenBMB. Handles vision, speech, and multimodal live streaming. End-to-end architecture: SigLip-400M + Whisper-medium-300M + ChatTTS-200M + Qwen2.5-7B.
+
+### Key Features
+
+- End-to-end speech conversation (no separate STT/TTS pipeline)
+- Bilingual real-time speech (English + Chinese)
+- Configurable voices via audio system prompt
+- Voice cloning from short reference audio
+- Emotion/speed/style control
+- Multimodal live streaming (video + audio + text simultaneously)
+- Outperforms GPT-4o-realtime on audio understanding benchmarks (ASR, STT translation)
+- Runs on consumer GPUs (8GB+ VRAM), iPad, or cloud
+
+### Setup
+
+```bash
+pip install torch==2.3.1 torchaudio==2.3.1 transformers==4.44.2 \
+    librosa soundfile vector-quantize-pytorch vocos decord moviepy
+```
+
+### Basic Speech Conversation
+
+```python
+import torch, librosa
+from transformers import AutoModel, AutoTokenizer
+
+model = AutoModel.from_pretrained(
+    "openbmb/MiniCPM-o-2_6",
+    trust_remote_code=True,
+    attn_implementation="sdpa",
+    torch_dtype=torch.bfloat16,
+    init_vision=False,  # speech-only mode
+    init_audio=True,
+    init_tts=True,
+)
+model = model.eval().cuda()
+tokenizer = AutoTokenizer.from_pretrained(
+    "openbmb/MiniCPM-o-2_6", trust_remote_code=True
+)
+model.init_tts()
+
+# Load reference voice for configurable output
+ref_audio, _ = librosa.load("reference_voice.wav", sr=16000, mono=True)
+sys_prompt = model.get_sys_prompt(
+    ref_audio=ref_audio, mode="audio_assistant", language="en"
+)
+
+# Speech input
+user_audio, _ = librosa.load("user_question.wav", sr=16000, mono=True)
+msgs = [sys_prompt, {"role": "user", "content": [user_audio]}]
+
+res = model.chat(
+    msgs=msgs,
+    tokenizer=tokenizer,
+    sampling=True,
+    max_new_tokens=128,
+    use_tts_template=True,
+    generate_audio=True,
+    temperature=0.3,
+    output_audio_path="response.wav",
+)
+```
+
+### Streaming Mode (Low Latency)
+
+```python
+model.reset_session()
+session_id = "voice-agent-001"
+
+# Prefill system prompt
+model.streaming_prefill(
+    session_id=session_id, msgs=[sys_prompt], tokenizer=tokenizer
+)
+
+# Stream audio chunks and generate responses incrementally
+for chunk in audio_chunks:
+    model.streaming_prefill(
+        session_id=session_id,
+        msgs=[{"role": "user", "content": ["<unit>", chunk]}],
+        tokenizer=tokenizer,
+    )
+
+# Generate streaming response
+for r in model.streaming_generate(
+    session_id=session_id, tokenizer=tokenizer,
+    temperature=0.5, generate_audio=True
+):
+    play_audio(r.audio_wav, r.sampling_rate)
+```
+
+### Deployment Options
+
+| Method | Notes |
+|--------|-------|
+| HuggingFace Transformers | Default, see code above |
+| vLLM | High-throughput server deployment |
+| llama.cpp | CPU inference on edge devices |
+| Ollama | `ollama run openbmb/minicpm-o2.6` |
+| int4 quantized | `openbmb/MiniCPM-o-2_6-int4` (reduced VRAM) |
+| GGUF | 16 quantization sizes available |
+
+### Requirements
+
+- Python 3.10+, PyTorch 2.3+
+- CUDA GPU with 8GB+ VRAM (16GB recommended for full omni mode)
+- `transformers==4.44.2` (specific version required)
+- Apple Silicon: via llama.cpp (MPS not directly supported for full model)
+
+### Docs
+
+- GitHub: https://github.com/OpenBMB/MiniCPM-o
+- HuggingFace: https://huggingface.co/openbmb/MiniCPM-o-2_6
+- Ollama: https://ollama.com/openbmb/minicpm-o2.6
+
+## NVIDIA Nemotron Speech (Riva NIM)
+
+NVIDIA's speech AI stack for enterprise voice agents. Not a single S2S model but a composable pipeline of best-in-class ASR (Parakeet), TTS (Magpie), and NMT models deployed as NIM microservices via NVIDIA Riva.
+
+### Key Features
+
+- **ASR**: Parakeet TDT 0.6B v2 (#1 on HuggingFace ASR leaderboard, 6.05% WER)
+- **ASR multilingual**: Parakeet RNNT 1.1B (25 languages)
+- **TTS**: Magpie TTS Multilingual (natural voices, 17+ languages)
+- **TTS zero-shot**: Magpie TTS Zero-Shot (voice cloning from short sample)
+- **Speech enhancement**: StudioVoice (noise removal, studio quality)
+- **Translation**: Riva Translate (36 languages)
+- Deployed as GPU-accelerated NIM microservices
+- Available via NVIDIA AI Enterprise or self-hosted
+- 50x faster inference than alternatives (Parakeet v2)
+
+### ASR Models (Nemotron Speech / Parakeet)
+
+| Model | Params | Languages | WER | Speed (RTFx) | NIM |
+|-------|--------|-----------|-----|-------------|-----|
+| Parakeet TDT 0.6B v2 | 600M | English | 6.05% | 3386x | HF only |
+| Parakeet CTC 1.1B | 1.1B | English | ~6.5% | Fast | Yes |
+| Parakeet RNNT 1.1B | 1.1B | 25 langs | ~7% | Fast | Yes |
+| Parakeet CTC 0.6B | 600M | EN, ES | ~7% | Fastest | Yes |
+| Canary 1B | 1B | 4 langs | ~7% | Fast | Yes |
+
+### TTS Models (Magpie)
+
+| Model | Languages | Voice Clone | Streaming | NIM |
+|-------|-----------|-------------|-----------|-----|
+| Magpie TTS Multilingual | 17+ | No (preset voices) | Yes | Yes |
+| Magpie TTS Zero-Shot | EN+ | Yes (short sample) | Yes | API |
+| Magpie TTS Flow | EN+ | Yes (short sample) | Yes | API |
+
+### Setup (NIM API)
+
+```bash
+# Store NVIDIA API key
+aidevops secret set NVIDIA_API_KEY
+
+# Test ASR via NIM API
+curl -X POST "https://integrate.api.nvidia.com/v1/asr" \
+  -H "Authorization: Bearer ${NVIDIA_API_KEY}" \
+  -H "Content-Type: multipart/form-data" \
+  -F "file=@audio.wav" \
+  -F "model=nvidia/parakeet-ctc-0_6b-asr"
+```
+
+### Setup (Self-Hosted NIM)
+
+```bash
+# Pull and run Parakeet ASR NIM container
+docker run --gpus all -p 8000:8000 \
+  nvcr.io/nim/nvidia/parakeet-ctc-0_6b-asr:latest
+
+# Pull and run Magpie TTS NIM container
+docker run --gpus all -p 8001:8001 \
+  nvcr.io/nim/nvidia/magpie-tts-multilingual:latest
+```
+
+### Composable Voice Agent Pipeline
+
+```text
+Audio In -> [Parakeet ASR NIM] -> Text -> [LLM (Claude/GPT/Nemotron)] -> Text -> [Magpie TTS NIM] -> Audio Out
+                                                                                        |
+                                                                            [StudioVoice NIM] (optional enhancement)
+```
+
+This cascaded approach gives full control over each component. Use any LLM in the middle (Claude, GPT-4o, Llama, Nemotron).
+
+### Via Pipecat
+
+Pipecat does not have a native NVIDIA Riva integration yet, but you can use the Riva gRPC API as a custom service or use the NIM REST endpoints.
+
+### Requirements
+
+- NVIDIA GPU (A100/H100 recommended for NIM self-hosting)
+- Docker with NVIDIA Container Toolkit
+- NVIDIA AI Enterprise license (for production NIM)
+- Or use free API endpoints at https://build.nvidia.com/explore/speech
+
+### Docs
+
+- NVIDIA NIM Speech: https://build.nvidia.com/explore/speech
+- Parakeet v2: https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2
+- Riva documentation: https://docs.nvidia.com/deeplearning/riva/
+
+## Deployment Patterns
+
+### Pattern 1: Browser Voice Agent (WebRTC)
+
+Best for: Customer-facing web apps, support chatbots.
+
+```text
+Browser (WebRTC) <-> OpenAI Realtime API (or Pipecat + Daily.co)
+```
+
+- Use OpenAI Agents SDK or Pipecat with SmallWebRTCTransport
+- Client-side ephemeral keys for security
+- No server infrastructure needed for OpenAI Realtime
+
+### Pattern 2: Phone Bot (SIP/Twilio)
+
+Best for: Call centers, IVR replacement, appointment booking.
+
+```text
+Phone (PSTN) -> Twilio -> SIP -> OpenAI Realtime API
+                    or -> WebSocket -> Pipecat Pipeline
+```
+
+- OpenAI Realtime supports direct SIP connections
+- Twilio Media Streams for WebSocket-based pipelines
+- See `services/communications/twilio.md` for Twilio setup
+
+### Pattern 3: Self-Hosted (Privacy/Compliance)
+
+Best for: Healthcare, finance, government, air-gapped environments.
+
+```text
+Audio In -> [MiniCPM-o 2.6 on CUDA GPU] -> Audio Out
+  or
+Audio In -> [Parakeet NIM] -> [Local LLM] -> [Magpie NIM] -> Audio Out
+```
+
+- MiniCPM-o for single-model simplicity (Apache-2.0)
+- NVIDIA Riva NIM for enterprise-grade composable pipeline
+- No data leaves your infrastructure
+
+### Pattern 4: Hybrid (Cloud LLM + Local Speech)
+
+Best for: Balancing cost, latency, and quality.
+
+```text
+Audio In -> [Local Parakeet STT] -> Text -> [Cloud Claude/GPT] -> Text -> [Local Magpie TTS] -> Audio Out
+```
+
+- Speech processing stays local (fast, private)
+- Only text hits the cloud LLM (smaller payload, lower cost)
+- This is what the cascaded `speech-to-speech.md` pipeline does with `--llm open_api`
+
+## Cost Comparison
+
+| Solution | Approx. Cost/Min | Notes |
+|----------|-------------------|-------|
+| GPT-4o Realtime | ~$0.06 | Audio token pricing |
+| Gemini 2.0 Live | ~$0.04 | Audio token pricing |
+| MiniCPM-o (self-hosted) | GPU cost only | ~$0.01-0.03 on cloud GPU |
+| NVIDIA Riva NIM (self-hosted) | GPU cost only | Enterprise license required |
+| NVIDIA NIM API | Free tier available | Rate-limited |
+| Cascaded (Groq STT + Claude + EdgeTTS) | ~$0.02 | Mix of free and paid |
+
+## Framework Selection
+
+| Framework | Best For | S2S Support | Complexity |
+|-----------|----------|-------------|------------|
+| **OpenAI Agents SDK** | Browser voice agents with GPT-4o | GPT-4o Realtime only | Low |
+| **Pipecat** | Production multi-provider pipelines | OpenAI, Gemini, Nova Sonic | Medium |
+| **voice-helper.sh** | Quick local voice interaction | No (cascaded only) | Low |
+| **speech-to-speech.md** | Local/cloud cascaded pipeline | No (cascaded only) | Medium |
+| **Custom WebSocket** | Full control, custom protocols | Any | High |
+
+## Monitoring and Observability
+
+For production voice agents, monitor:
+
+- **Latency**: Time from user speech end to first audio response byte
+- **Transcription accuracy**: Log STT output for quality review
+- **Turn completion rate**: Percentage of turns that complete without interruption
+- **Cost per conversation**: Track token/minute usage per provider
+
+Use Pipecat's built-in metrics (`enable_metrics=True`) or instrument with OpenTelemetry.
+
+## See Also
+
+- `tools/voice/pipecat-opencode.md` - Pipecat pipeline setup for AI coding agents
+- `tools/voice/speech-to-speech.md` - HuggingFace cascaded S2S pipeline
+- `tools/voice/voice-ai-models.md` - Complete model comparison (TTS, STT, S2S)
+- `tools/voice/voice-models.md` - TTS engine details and implementations
+- `tools/infrastructure/cloud-gpu.md` - Cloud GPU deployment for self-hosted models
+- `services/communications/twilio.md` - Phone integration
diff --git a/.agents/tools/voice/pipecat-opencode.md b/.agents/tools/voice/pipecat-opencode.md
index 79802c5a18..eb6e219b0c 100644
--- a/.agents/tools/voice/pipecat-opencode.md
+++ b/.agents/tools/voice/pipecat-opencode.md
@@ -448,6 +448,7 @@ For production deployment:
 
 ## See Also
 
+- `tools/voice/cloud-voice-agents.md` - Cloud voice agents (GPT-4o Realtime, MiniCPM-o, NVIDIA Nemotron)
 - `tools/voice/speech-to-speech.md` - HuggingFace S2S pipeline (alternative approach)
 - `scripts/voice-helper.sh` - Simple voice bridge (existing, terminal-based)
 - `scripts/voice-bridge.py` - Voice bridge Python implementation
diff --git a/.agents/tools/voice/speech-to-speech.md b/.agents/tools/voice/speech-to-speech.md
index 2f491bf7da..fcb52529e3 100644
--- a/.agents/tools/voice/speech-to-speech.md
+++ b/.agents/tools/voice/speech-to-speech.md
@@ -318,6 +318,9 @@ The full S2S pipeline above is for advanced use cases (custom LLMs, server/clien
 
 ## See Also
 
+- `tools/voice/cloud-voice-agents.md` - Cloud voice agents (GPT-4o Realtime, MiniCPM-o, NVIDIA Nemotron Speech)
+- `tools/voice/voice-ai-models.md` - Complete model comparison (TTS, STT, S2S)
+- `tools/voice/pipecat-opencode.md` - Pipecat real-time voice pipeline
 - `tools/infrastructure/cloud-gpu.md` - Cloud GPU deployment guide (provider comparison, setup, cost optimization)
 - `services/communications/twilio.md` - Phone integration
 - `tools/video/remotion.md` - Video narration
diff --git a/.agents/tools/voice/voice-ai-models.md b/.agents/tools/voice/voice-ai-models.md
index bda5227703..8446a05c95 100644
--- a/.agents/tools/voice/voice-ai-models.md
+++ b/.agents/tools/voice/voice-ai-models.md
@@ -16,6 +16,7 @@ tools:
 - **TTS details**: `tools/voice/voice-models.md` (implemented engines, integration)
 - **STT details**: `tools/voice/transcription.md` (transcription workflows, cloud APIs)
 - **S2S pipeline**: `tools/voice/speech-to-speech.md` (full voice pipeline setup)
+- **Cloud voice agents**: `tools/voice/cloud-voice-agents.md` (GPT-4o Realtime, MiniCPM-o, Nemotron)
 - **Offline tool**: `tools/voice/buzz.md` (Buzz GUI/CLI for Whisper)
 
 **When to use**: Choosing between voice AI models for a project. For implementation details, follow the cross-references above.
@@ -31,9 +32,10 @@ tools:
 | ElevenLabs | ~300ms | Best | Yes | 29 | $5-330/mo |
 | OpenAI TTS | ~400ms | Great | No | 57 | $15/1M chars |
 | Cartesia Sonic 3 | ~90ms | Great | Yes (10s ref) | 17 | $8-66/mo |
+| NVIDIA Magpie TTS | ~200ms | Great | Yes (zero-shot) | 17+ | NIM API (free tier) |
 | Google Cloud TTS | ~200ms | Good | No (custom) | 50+ | $4-16/1M chars |
 
-**Pick**: ElevenLabs for quality/cloning, Cartesia Sonic 3 for lowest latency, Google for language breadth.
+**Pick**: ElevenLabs for quality/cloning, Cartesia Sonic 3 for lowest latency, NVIDIA Magpie for enterprise/self-hosted, Google for language breadth.
 
 ### Local Models
 
@@ -57,10 +59,11 @@ Also implemented in the voice bridge: **EdgeTTS** (free, 300+ voices), **macOS S
 |----------|-------|----------|-----------|------|
 | Groq | Whisper Large v3 Turbo | 9.6 | No (batch) | Free tier |
 | ElevenLabs | Scribe v2 | 9.9 | No | Per minute |
+| NVIDIA Riva | Parakeet CTC/RNNT | 9.4-9.6 | Yes (streaming) | NIM API (free tier) |
 | Deepgram | Nova-2 / Nova-3 | 9.5-9.6 | Yes | Per minute |
 | Soniox | stt-async-v3 | 9.6 | Yes | Per minute |
 
-**Pick**: Groq for free/fast batch, ElevenLabs Scribe for accuracy, Deepgram for real-time streaming.
+**Pick**: Groq for free/fast batch, ElevenLabs Scribe for accuracy, NVIDIA Parakeet for enterprise/self-hosted, Deepgram for real-time streaming.
 
 ### Local Models
 
@@ -81,16 +84,35 @@ Backends: `faster-whisper` (4x speed, recommended), `whisper.cpp` (C++ native, A
 
 ## S2S (Speech-to-Speech)
 
+### Native S2S Models
+
 End-to-end models that process speech directly without text intermediary:
 
 | Model | Type | Latency | Availability | Notes |
 |-------|------|---------|--------------|-------|
-| GPT-4o Realtime | Cloud API | ~300ms | OpenAI API | Voice mode, emotion-aware |
+| GPT-4o Realtime | Cloud API | ~300ms | OpenAI API (GA) | Voice mode, emotion-aware, function calling, SIP telephony |
 | Gemini 2.0 Live | Cloud API | ~350ms | Google API | Multimodal, streaming |
-| MiniCPM-o 4.5 | Open weights | ~500ms | Local (8GB+) | 9B params, Apache-2.0 |
+| MiniCPM-o 2.6 | Open weights | ~500ms | Local (8GB+) | 8B params, Apache-2.0, vision+speech+streaming |
+| AWS Nova Sonic | Cloud API | ~600ms | AWS API | AWS ecosystem, 7 languages |
 | Ultravox | Open weights | ~400ms | Local (6GB+) | Audio-text multimodal |
 
-**Pick**: GPT-4o Realtime for production cloud, MiniCPM-o 4.5 for local/private. For cascaded S2S (VAD+STT+LLM+TTS), see `speech-to-speech.md`.
+### Composable S2S Pipelines (NVIDIA Nemotron Speech)
+
+Enterprise-grade cascaded pipelines using NVIDIA Riva NIM microservices:
+
+| Component | Model | Role | Languages | NIM Available |
+|-----------|-------|------|-----------|---------------|
+| ASR | Parakeet TDT 0.6B v2 | Speech-to-text | English | HF (research) |
+| ASR | Parakeet CTC 1.1B | Speech-to-text | English | Yes |
+| ASR | Parakeet RNNT 1.1B | Speech-to-text | 25 languages | Yes |
+| TTS | Magpie TTS Multilingual | Text-to-speech | 17+ languages | Yes |
+| TTS | Magpie TTS Zero-Shot | Voice cloning TTS | English+ | API |
+| Enhancement | StudioVoice | Noise removal | Any | Yes |
+| Translation | Riva Translate | NMT | 36 languages | Yes |
+
+Compose as: `Audio -> [Parakeet ASR] -> [Any LLM] -> [Magpie TTS] -> Audio`. See `cloud-voice-agents.md` for deployment patterns.
+
+**Pick**: GPT-4o Realtime for production cloud (lowest latency, GA), MiniCPM-o 2.6 for self-hosted/private (Apache-2.0, multimodal), NVIDIA Riva for enterprise on-prem (composable, 25+ languages). For cascaded S2S (VAD+STT+LLM+TTS), see `speech-to-speech.md`.
 
 ## Model Selection Guide
 
@@ -99,10 +121,11 @@ End-to-end models that process speech directly without text intermediary:
 | Priority | TTS | STT | S2S |
 |----------|-----|-----|-----|
 | **Quality** | ElevenLabs / Qwen3-TTS 1.7B | ElevenLabs Scribe / Large v3 | GPT-4o Realtime |
-| **Speed** | Cartesia Sonic 3 / EdgeTTS | Groq / Parakeet V3 | Cascaded pipeline |
-| **Cost** | EdgeTTS (free) / Piper | Local Whisper ($0) / Groq free | MiniCPM-o 4.5 (local) |
-| **Privacy** | Piper / Qwen3-TTS | faster-whisper / whisper.cpp | MiniCPM-o 4.5 |
-| **Voice clone** | ElevenLabs / Qwen3-TTS | N/A | N/A |
+| **Speed** | Cartesia Sonic 3 / EdgeTTS | Groq / Parakeet V3 | GPT-4o Realtime / Cascaded |
+| **Cost** | EdgeTTS (free) / Piper | Local Whisper ($0) / Groq free | MiniCPM-o 2.6 (local) |
+| **Privacy** | Piper / Qwen3-TTS | faster-whisper / whisper.cpp | MiniCPM-o 2.6 |
+| **Enterprise** | NVIDIA Magpie / ElevenLabs | NVIDIA Parakeet / Scribe | NVIDIA Riva pipeline |
+| **Voice clone** | ElevenLabs / Qwen3-TTS | N/A | MiniCPM-o 2.6 |
 
 ### Decision Flow
 
@@ -119,8 +142,9 @@ Need voice AI?
 │   ├── Need free? → Groq free tier (cloud) or any local model
 │   └── Default → Whisper Large v3 Turbo (local)
 └── Conversational (S2S)
-    ├── Cloud OK? → GPT-4o Realtime
-    ├── Local/private? → MiniCPM-o 4.5 or cascaded pipeline
+    ├── Cloud OK? → GPT-4o Realtime (see cloud-voice-agents.md)
+    ├── Enterprise/on-prem? → NVIDIA Riva (Parakeet + LLM + Magpie)
+    ├── Local/private? → MiniCPM-o 2.6 or cascaded pipeline
     └── Default → speech-to-speech.md cascaded pipeline
 ```
 
@@ -131,7 +155,7 @@ Need voice AI?
 | STT only (Whisper Turbo) | 5GB | 8GB |
 | TTS only (Qwen3-TTS 0.6B) | 2GB | 4GB |
 | TTS only (Bark) | 6GB | 8GB |
-| S2S (MiniCPM-o 4.5) | 8GB | 16GB |
+| S2S (MiniCPM-o 2.6) | 8GB | 16GB |
 | Full cascaded pipeline | 4GB | 12GB |
 | CPU-only (Piper + whisper.cpp) | 0 | 8GB RAM |
 
@@ -139,8 +163,10 @@ Apple Silicon: MPS acceleration works for most PyTorch models. Use `whisper-mlx`
 
 ## Related
 
+- `tools/voice/cloud-voice-agents.md` - Cloud voice agent deployment (GPT-4o Realtime, MiniCPM-o, Nemotron)
 - `tools/voice/voice-models.md` - TTS engines implemented in voice bridge
 - `tools/voice/transcription.md` - STT workflows, cloud API examples
 - `tools/voice/speech-to-speech.md` - Full cascaded voice pipeline
+- `tools/voice/pipecat-opencode.md` - Pipecat real-time voice pipeline
 - `tools/voice/buzz.md` - Buzz offline transcription tool
 - `voice-helper.sh` - CLI for voice operations