diff --git a/docs/design/gemma4-audio-input.md b/docs/design/gemma4-audio-input.md new file mode 100644 index 00000000..f5a284cd --- /dev/null +++ b/docs/design/gemma4-audio-input.md @@ -0,0 +1,167 @@ +# Design: Audio Input for Chat Completions + +## Problem + +Gemma 4 is Google's omnimodal model family supporting text, image, video, and audio input. vllm-mlx already supports text, image, and video input plus tool calling. Audio is the last missing input modality. This document captures the design decisions, model capability findings, and architecture tradeoffs for adding audio input to vllm-mlx chat completions. + +## Scope + +This feature adds **audio as an input modality to `/v1/chat/completions`** via the OpenAI `input_audio` and `audio_url` content types. It is NOT: + +- **STT (speech-to-text)**: vllm-mlx already has `/v1/audio/transcriptions` using dedicated models (Whisper, Parakeet). These are optimized for transcription accuracy, handle long audio (hours), and produce structured text with timestamps. +- **TTS (text-to-speech)**: vllm-mlx already has `/v1/audio/speech` using dedicated models (Kokoro, Chatterbox). These generate speech audio output with voice selection and prosody control. + +What we built is the omnimodal case: the LLM itself "hears" the audio and responds, similar to how it "sees" images. The model processes audio through its native audio encoder (Conformer) rather than delegating to a specialized transcription model. + +### When to use each + +| Capability | Tool | Use case | +|-----------|------|----------| +| Transcribe audio to text | Whisper STT (`/v1/audio/transcriptions`) | Meeting transcription, captioning, voice-to-text input | +| Generate speech from text | Kokoro TTS (`/v1/audio/speech`) | Reading text aloud, voice assistants, accessibility | +| Understand audio content | Gemma 4 E4B chat (`/v1/chat/completions` with `input_audio`) | "What language is this?", "Is the speaker angry?", "Summarize and suggest a reply" | + +The key distinction: Whisper turns audio into text. Gemma 4 E4B turns audio into *understanding*. For a voice pipeline (e.g., OpenClaw), the architecture would chain: Whisper STT (reliable transcription of user speech) -> LLM reasoning (Gemma 4 or Qwen3) -> Kokoro TTS (speak the response back). Gemma 4 E4B's audio input is for when you need the model to reason about audio content directly, not just transcribe it. + +## Model Capabilities + +### Gemma 4 31B IT (text + vision only) + +- **Zero audio weights** in both `google/gemma-4-31b-it` and `mlx-community/gemma-4-31b-it-8bit` +- No `audio_config` in config.json +- `audio_tower = None`, `embed_audio = None` +- This is a Google design choice, not a quantization issue -- the original Google model also has no audio weights +- Audio weights cannot be loaded separately; the model architecture was trained without them +- Suitable for: text + vision + tool calling + +### Gemma 4 E4B IT (text + vision + audio) + +- **754 audio weight keys** in `mlx-community/gemma-4-e4b-it-8bit` +- Full `AudioEncoder` (Conformer, 12 layers), `MultimodalEmbedder`, `Gemma4AudioFeatureExtractor` +- Audio processing pipeline: WAV -> mel spectrogram (128 bins) -> Conformer encoder -> projection -> scatter into embeddings at `<|audio|>` token positions +- Max audio duration: 30 seconds at 16kHz (480,000 samples) +- 4B parameters -- much smaller than 31B, less capable for coding/reasoning/tool use +- Suitable for: audio understanding, multimodal reasoning about sound + +### Implications for OpenClaw + +Gemma 4 31B cannot replace Qwen3-Coder for OpenClaw with audio support. The 31B model handles text + vision + tool calling but has no audio capability. If audio understanding is needed: + +- Keep Qwen3-Coder (or Gemma 4 31B) on port 8000 for primary coding/reasoning +- Run Gemma 4 E4B on a separate port as a dedicated audio understanding service +- For reliable STT, use Whisper (faster, handles long audio, structured output) + +Additional blockers for Gemma 4 31B replacing Qwen3: +- Tool parser PR (#254) not merged upstream -- manual install required +- Continuous batching mode untested with Gemma 4 vision + tool calling +- Model size: 31B 8-bit (~31GB) vs Qwen3-Coder 4-bit (~15GB) + +## Architecture + +### Data Flow + +``` +HTTP Request (JSON with audio_url or input_audio content parts) + -> server.py: validate audio capability (reject non-MLLM, reject batched) + -> server.py: detect media, build chat_kwargs with audio + -> SimpleEngine.chat() / stream_chat() via **kwargs + -> MLLM.chat() / stream_chat(): + - _collect_audio_inputs(): extract URLs, decode base64 to temp files + - Build chat_messages with {"type": "audio"} markers + - get_chat_template() renders