diff --git a/recipes/OpenBMB/MiniCPM-o-4_5.md b/recipes/OpenBMB/MiniCPM-o-4_5.md new file mode 100644 index 00000000000..efc1699b373 --- /dev/null +++ b/recipes/OpenBMB/MiniCPM-o-4_5.md @@ -0,0 +1,229 @@ +# MiniCPM-o 4.5 + +> Online serving for omni multimodal chat (text / image / audio / video → text + 24 kHz speech) + +## Summary + +- Vendor: OpenBMB +- Model: [`openbmb/MiniCPM-o-4_5`](https://huggingface.co/openbmb/MiniCPM-o-4_5) +- Task: Omni multimodal chat — accepts text / image / audio / video input; + emits text and 24 kHz mono speech in the same response +- Mode: Online serving via the OpenAI-compatible `/v1/chat/completions` + API, plus a bundled Gradio demo (text + speech UI) +- Maintainer: [`@tc-mb`](https://github.com/tc-mb) (MiniCPM-V / MiniCPM-o team) + +## When to use this recipe + +Use this recipe as a known-good starting point for serving +`openbmb/MiniCPM-o-4_5` on vLLM-Omni. MiniCPM-o 4.5 is the omni member +of the MiniCPM-o family — it pairs a multimodal-understanding thinker +LLM with a streaming `MiniCPMTTS + Token2Wav` talker so a single +`/v1/chat/completions` call can return text and 24 kHz speech in one +shot. The recipe covers three shipped GPU layouts (2 / 3 / 8 GPUs) +selected via `--deploy-config`. + +## References + +- Default deploy configs (auto-loaded by HF `model_type=minicpmo` + + `hf_config.version="4.5"`): + - 2-GPU layout (default): + [`vllm_omni/deploy/minicpmo_4_5.yaml`](../../vllm_omni/deploy/minicpmo_4_5.yaml) + - 3-GPU layout (thinker TP=2): + [`vllm_omni/deploy/minicpmo_4_5_3gpu.yaml`](../../vllm_omni/deploy/minicpmo_4_5_3gpu.yaml) + - 8x RTX 4090 layout: + [`vllm_omni/deploy/minicpmo_4_5_8x4090.yaml`](../../vllm_omni/deploy/minicpmo_4_5_8x4090.yaml) +- Online example + Gradio demo: + [`examples/online_serving/minicpmo/`](../../examples/online_serving/minicpmo/) +- Pipeline / talker source: + [`vllm_omni/model_executor/models/minicpmo_4_5/`](../../vllm_omni/model_executor/models/minicpmo_4_5/) +- Stage-input processor (thinker → talker bridge): + [`vllm_omni/model_executor/stage_input_processors/minicpmo_4_5_omni.py`](../../vllm_omni/model_executor/stage_input_processors/minicpmo_4_5_omni.py) +- Upstream model card: + [`openbmb/MiniCPM-o-4_5`](https://huggingface.co/openbmb/MiniCPM-o-4_5) +- Integration PR: + [vllm-project/vllm-omni#3642](https://github.com/vllm-project/vllm-omni/pull/3642) + +## Hardware Support + +Three GPU layouts ship with default deploy configs. Pick the layout that +matches your hardware and pass it via `--deploy-config`; the talker +(`MiniCPMTTS + Token2Wav`) always lives on its own GPU because of the +in-process vocoder, and the thinker is the part that scales out via TP. + +| Layout | Thinker | Talker + Token2Wav | Typical hardware | +| --- | --- | --- | --- | +| 2-GPU (default) | GPU 0 | GPU 1 | 2x A100/H100/H200 80GB | +| 3-GPU (thinker TP=2) | GPU 0,1 (TP=2) | GPU 2 | 3x mid-tier GPUs | +| 8x RTX 4090 24GB | GPU 0–3 (TP=4) | GPU 4 | 8x RTX 4090 consumer | + +## GPU + +### 2 x GPU (default — single command) + +The default +[`vllm_omni/deploy/minicpmo_4_5.yaml`](../../vllm_omni/deploy/minicpmo_4_5.yaml) +puts the thinker on GPU 0 (`~70 %` memory, `enforce_eager: true`, +`max_num_seqs: 1`) and the talker + Token2Wav vocoder on GPU 1 +(`~75 %` memory). This is the recommended starting layout — works on +any pair of 80GB-class GPUs (A100, H100, H200) and on most 40GB+ +pairs as long as the thinker model weights fit. + +#### Environment + +- OS: Linux +- Python: 3.10+ +- vLLM / vLLM-Omni: >= 0.21.0 (or current `main`) +- Optional Talker dep: `stepaudio2-minicpmo` (see Notes for why this is + required and how to install it) + +#### Command + +```bash +vllm serve openbmb/MiniCPM-o-4_5 --omni \ + --trust-remote-code \ + --host 0.0.0.0 --port 8099 +``` + +The deploy config is auto-loaded by the model registry — no +`--deploy-config` flag needed for this default 2-GPU layout. + +#### Verification + +**Quick smoke test (text-only output)**: + +```bash +curl http://localhost:8099/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "openbmb/MiniCPM-o-4_5", + "messages": [{"role": "user", "content": "Briefly introduce yourself."}], + "modalities": ["text"] + }' +``` + +**Text + speech in one response** (the headline 4.5 feature). The TTS +path is gated by a Jinja flag on the chat template, passed through +`extra_body.chat_template_kwargs.use_tts_template=true`: + +```bash +curl http://localhost:8099/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "openbmb/MiniCPM-o-4_5", + "messages": [{"role": "user", "content": "Say hello, then introduce vLLM in one sentence."}], + "modalities": ["text", "audio"], + "extra_body": {"chat_template_kwargs": {"use_tts_template": true}} + }' +``` + +Response carries text in `choices[0].message.content` and base64 WAV +in `choices[0].message.audio.data` (24 kHz mono, see Notes). + +**Gradio demo (text + image + audio + video UI)**: + +```bash +bash examples/online_serving/minicpmo/run_gradio_demo.sh +# or run the python entry point directly: +python examples/online_serving/minicpmo/gradio_demo.py \ + --minicpmo45-api-base http://localhost:8099/v1 \ + --minicpmo45-model openbmb/MiniCPM-o-4_5 \ + --port 7862 +``` + +Open `http://:7862` and try a text prompt with the **"Generate +speech output (TTS)"** checkbox on / off. + +#### Notes + +- Memory budget: thinker weights occupy GPU 0 at `gpu_memory_utilization: + 0.7`; talker + Token2Wav vocoder share GPU 1 at `0.75`. +- `--trust-remote-code` is required — the HF repo ships a custom + `MiniCPMO` config / model class. +- Pin: `enforce_eager: true` on both stages (CUDA graph capture is off + by design for the talker's Token2Wav path). +- Stage 1 (talker) is hard-capped to `max_num_seqs: 1`: the talker + only consumes `runtime_additional_information[0]`, so any value > 1 + makes concurrent requests share request-0's audio. This is the same + cap baked into the deploy config. + +### 3 x GPU (thinker TP=2) + +Use +[`vllm_omni/deploy/minicpmo_4_5_3gpu.yaml`](../../vllm_omni/deploy/minicpmo_4_5_3gpu.yaml) +when you have a third GPU available and want the thinker on 2-way +tensor parallel for higher throughput; the talker stays on its own +GPU (talker has its own in-process Token2Wav vocoder, so co-locating +it with the thinker risks OOM under load). + +#### Command + +```bash +vllm serve openbmb/MiniCPM-o-4_5 --omni \ + --deploy-config vllm_omni/deploy/minicpmo_4_5_3gpu.yaml \ + --trust-remote-code \ + --host 0.0.0.0 --port 8099 +``` + +Verification and Notes mirror the 2-GPU section; thinker latency +roughly halves under load thanks to TP=2. + +### 8 x RTX 4090 24GB (consumer-GPU layout) + +Use +[`vllm_omni/deploy/minicpmo_4_5_8x4090.yaml`](../../vllm_omni/deploy/minicpmo_4_5_8x4090.yaml) +on an 8x RTX 4090 host. Thinker uses 4-way TP across GPUs 0–3 +(`~85 %` mem each ≈ 20.4 GiB/card), talker + Token2Wav lives on GPU 4 +(`~90 %` mem). GPUs 5–7 are left free. + +#### Command + +```bash +vllm serve openbmb/MiniCPM-o-4_5 --omni \ + --deploy-config vllm_omni/deploy/minicpmo_4_5_8x4090.yaml \ + --trust-remote-code \ + --host 0.0.0.0 --port 8099 +``` + +#### Notes + +- `max_model_len` is capped at 4096 in this layout — 8192 still OOMs on + 4090s. Raise it if your cards have more headroom (e.g. 4090 D / + custom 32 GB SKUs), but verify with a long-prompt run before + promoting. +- All other knobs match the 2-GPU section; the only difference is the + per-card memory pressure on the thinker shards. + +## Notes (applies to all layouts) + +- **Talker dependency**: the `MiniCPM-o 4.5` talker calls + `from stepaudio2 import Token2wav` against the MiniCPM-o-flavored + vocoder (PyPI package `stepaudio2-minicpmo` — NOT the upstream + `stepfun-ai/Step-Audio2`, whose `Token2wav.__init__` signature + rejects `n_timesteps`). Install via the published extra: + + ```bash + pip install 'vllm-omni[minicpmo]' + ``` + + Equivalent direct install: `pip install stepaudio2-minicpmo`. A + missing dep raises `ImportError` at first request with the same + install hint instead of silently emitting empty audio. + +- **TTS trigger**: speech output is only emitted when the client passes + `extra_body.chat_template_kwargs.use_tts_template=true`. Without it, + the response is text-only (which is also faster). + +- **Output audio**: 24 kHz mono WAV inside the OpenAI-style + `message.audio.data` (base64). The Gradio demo's WAV player decodes + this automatically. + +- **Routing**: MiniCPM-o 4.5 and 2.6 both ship `architectures= + ["MiniCPMO"]` in HF config; routing is disambiguated by + `hf_config.version == "4.5"` via the + `hf_config_predicate` on the 4.5 pipeline. A 2.6 checkpoint loaded + with this recipe's `--deploy-config` will be rejected at startup + rather than silently misrouted. + +- **Async chunking**: disabled in all three deploy configs + (`async_chunk: false`) — the talker batches a single full thinker + output, not chunks. diff --git a/recipes/README.md b/recipes/README.md index 885710db1d4..48e9d0a80e7 100644 --- a/recipes/README.md +++ b/recipes/README.md @@ -37,6 +37,7 @@ recipes/ | [`LTX/LTX-2.3.md`](./LTX/LTX-2.3.md) | Text-to-video with audio generation (22B) | 1x GPU (96GB VRAM) | | [`mistralai/Voxtral-TTS.md`](./mistralai/Voxtral-TTS.md) | Online serving for TTS | 1x RTX 4090 24GB | | [`nvidia/Cosmos3-Nano.md`](./nvidia/Cosmos3-Nano.md) | Text-to-image, text-to-video, and image-to-video generation | 1x H200 141GB / B300 | +| [`OpenBMB/MiniCPM-o-4_5.md`](./OpenBMB/MiniCPM-o-4_5.md) | Online serving for omni multimodal chat (text / image / audio / video → text + 24 kHz speech) | 2x A100/H100 80GB / 3x mid-tier GPU / 8x RTX 4090 24GB | | [`OpenBMB/VoxCPM2.md`](./OpenBMB/VoxCPM2.md) | Online + offline TTS with native AR pipeline (48 kHz, 30+ languages) | 1x RTX 4090 24GB | | [`Qwen/Qwen-Image.md`](./Qwen/Qwen-Image.md) | Text-to-image serving with step-wise continuous batching replay and ModelOpt mixed FP8/NVFP4 | 1x A100 80GB / 2x B200 | | [`Qwen/Qwen-Image.md`](./Qwen/Qwen-Image.md) | Text-to-image serving with step-wise continuous batching replay | 1x A100 80GB |