Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
229 changes: 229 additions & 0 deletions recipes/OpenBMB/MiniCPM-o-4_5.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,229 @@
# MiniCPM-o 4.5

> Online serving for omni multimodal chat (text / image / audio / video → text + 24 kHz speech)

## Summary

- Vendor: OpenBMB
- Model: [`openbmb/MiniCPM-o-4_5`](https://huggingface.co/openbmb/MiniCPM-o-4_5)
- Task: Omni multimodal chat — accepts text / image / audio / video input;
emits text and 24 kHz mono speech in the same response
- Mode: Online serving via the OpenAI-compatible `/v1/chat/completions`
API, plus a bundled Gradio demo (text + speech UI)
- Maintainer: [`@tc-mb`](https://github.com/tc-mb) (MiniCPM-V / MiniCPM-o team)

## When to use this recipe

Use this recipe as a known-good starting point for serving
`openbmb/MiniCPM-o-4_5` on vLLM-Omni. MiniCPM-o 4.5 is the omni member
of the MiniCPM-o family — it pairs a multimodal-understanding thinker
LLM with a streaming `MiniCPMTTS + Token2Wav` talker so a single
`/v1/chat/completions` call can return text and 24 kHz speech in one
shot. The recipe covers three shipped GPU layouts (2 / 3 / 8 GPUs)
selected via `--deploy-config`.

## References

- Default deploy configs (auto-loaded by HF `model_type=minicpmo` +
`hf_config.version="4.5"`):
- 2-GPU layout (default):
[`vllm_omni/deploy/minicpmo_4_5.yaml`](../../vllm_omni/deploy/minicpmo_4_5.yaml)
- 3-GPU layout (thinker TP=2):
[`vllm_omni/deploy/minicpmo_4_5_3gpu.yaml`](../../vllm_omni/deploy/minicpmo_4_5_3gpu.yaml)
- 8x RTX 4090 layout:
[`vllm_omni/deploy/minicpmo_4_5_8x4090.yaml`](../../vllm_omni/deploy/minicpmo_4_5_8x4090.yaml)
- Online example + Gradio demo:
[`examples/online_serving/minicpmo/`](../../examples/online_serving/minicpmo/)
- Pipeline / talker source:
[`vllm_omni/model_executor/models/minicpmo_4_5/`](../../vllm_omni/model_executor/models/minicpmo_4_5/)
- Stage-input processor (thinker → talker bridge):
[`vllm_omni/model_executor/stage_input_processors/minicpmo_4_5_omni.py`](../../vllm_omni/model_executor/stage_input_processors/minicpmo_4_5_omni.py)
- Upstream model card:
[`openbmb/MiniCPM-o-4_5`](https://huggingface.co/openbmb/MiniCPM-o-4_5)
- Integration PR:
[vllm-project/vllm-omni#3642](https://github.com/vllm-project/vllm-omni/pull/3642)

## Hardware Support

Three GPU layouts ship with default deploy configs. Pick the layout that
matches your hardware and pass it via `--deploy-config`; the talker
(`MiniCPMTTS + Token2Wav`) always lives on its own GPU because of the
in-process vocoder, and the thinker is the part that scales out via TP.

| Layout | Thinker | Talker + Token2Wav | Typical hardware |
| --- | --- | --- | --- |
| 2-GPU (default) | GPU 0 | GPU 1 | 2x A100/H100/H200 80GB |
| 3-GPU (thinker TP=2) | GPU 0,1 (TP=2) | GPU 2 | 3x mid-tier GPUs |
| 8x RTX 4090 24GB | GPU 0–3 (TP=4) | GPU 4 | 8x RTX 4090 consumer |

## GPU

### 2 x GPU (default — single command)

The default
[`vllm_omni/deploy/minicpmo_4_5.yaml`](../../vllm_omni/deploy/minicpmo_4_5.yaml)
puts the thinker on GPU 0 (`~70 %` memory, `enforce_eager: true`,
`max_num_seqs: 1`) and the talker + Token2Wav vocoder on GPU 1
(`~75 %` memory). This is the recommended starting layout — works on
any pair of 80GB-class GPUs (A100, H100, H200) and on most 40GB+
pairs as long as the thinker model weights fit.

#### Environment

- OS: Linux
- Python: 3.10+
- vLLM / vLLM-Omni: >= 0.21.0 (or current `main`)
- Optional Talker dep: `stepaudio2-minicpmo` (see Notes for why this is
required and how to install it)

#### Command

```bash
vllm serve openbmb/MiniCPM-o-4_5 --omni \
--trust-remote-code \
--host 0.0.0.0 --port 8099
```

The deploy config is auto-loaded by the model registry — no
`--deploy-config` flag needed for this default 2-GPU layout.

#### Verification

**Quick smoke test (text-only output)**:

```bash
curl http://localhost:8099/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "openbmb/MiniCPM-o-4_5",
"messages": [{"role": "user", "content": "Briefly introduce yourself."}],
"modalities": ["text"]
}'
```

**Text + speech in one response** (the headline 4.5 feature). The TTS
path is gated by a Jinja flag on the chat template, passed through
`extra_body.chat_template_kwargs.use_tts_template=true`:

```bash
curl http://localhost:8099/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "openbmb/MiniCPM-o-4_5",
"messages": [{"role": "user", "content": "Say hello, then introduce vLLM in one sentence."}],
"modalities": ["text", "audio"],
"extra_body": {"chat_template_kwargs": {"use_tts_template": true}}
}'
```

Response carries text in `choices[0].message.content` and base64 WAV
in `choices[0].message.audio.data` (24 kHz mono, see Notes).

**Gradio demo (text + image + audio + video UI)**:

```bash
bash examples/online_serving/minicpmo/run_gradio_demo.sh
# or run the python entry point directly:
python examples/online_serving/minicpmo/gradio_demo.py \
--minicpmo45-api-base http://localhost:8099/v1 \
--minicpmo45-model openbmb/MiniCPM-o-4_5 \
--port 7862
```

Open `http://<host>:7862` and try a text prompt with the **"Generate
speech output (TTS)"** checkbox on / off.

#### Notes

- Memory budget: thinker weights occupy GPU 0 at `gpu_memory_utilization:
0.7`; talker + Token2Wav vocoder share GPU 1 at `0.75`.
- `--trust-remote-code` is required — the HF repo ships a custom
`MiniCPMO` config / model class.
- Pin: `enforce_eager: true` on both stages (CUDA graph capture is off
by design for the talker's Token2Wav path).
- Stage 1 (talker) is hard-capped to `max_num_seqs: 1`: the talker
only consumes `runtime_additional_information[0]`, so any value > 1
makes concurrent requests share request-0's audio. This is the same
cap baked into the deploy config.

### 3 x GPU (thinker TP=2)

Use
[`vllm_omni/deploy/minicpmo_4_5_3gpu.yaml`](../../vllm_omni/deploy/minicpmo_4_5_3gpu.yaml)
when you have a third GPU available and want the thinker on 2-way
tensor parallel for higher throughput; the talker stays on its own
GPU (talker has its own in-process Token2Wav vocoder, so co-locating
it with the thinker risks OOM under load).

#### Command

```bash
vllm serve openbmb/MiniCPM-o-4_5 --omni \
--deploy-config vllm_omni/deploy/minicpmo_4_5_3gpu.yaml \
--trust-remote-code \
--host 0.0.0.0 --port 8099
```

Verification and Notes mirror the 2-GPU section; thinker latency
roughly halves under load thanks to TP=2.

### 8 x RTX 4090 24GB (consumer-GPU layout)

Use
[`vllm_omni/deploy/minicpmo_4_5_8x4090.yaml`](../../vllm_omni/deploy/minicpmo_4_5_8x4090.yaml)
on an 8x RTX 4090 host. Thinker uses 4-way TP across GPUs 0–3
(`~85 %` mem each ≈ 20.4 GiB/card), talker + Token2Wav lives on GPU 4
(`~90 %` mem). GPUs 5–7 are left free.

#### Command

```bash
vllm serve openbmb/MiniCPM-o-4_5 --omni \
--deploy-config vllm_omni/deploy/minicpmo_4_5_8x4090.yaml \
--trust-remote-code \
--host 0.0.0.0 --port 8099
```

#### Notes

- `max_model_len` is capped at 4096 in this layout — 8192 still OOMs on
4090s. Raise it if your cards have more headroom (e.g. 4090 D /
custom 32 GB SKUs), but verify with a long-prompt run before
promoting.
- All other knobs match the 2-GPU section; the only difference is the
per-card memory pressure on the thinker shards.

## Notes (applies to all layouts)

- **Talker dependency**: the `MiniCPM-o 4.5` talker calls
`from stepaudio2 import Token2wav` against the MiniCPM-o-flavored
vocoder (PyPI package `stepaudio2-minicpmo` — NOT the upstream
`stepfun-ai/Step-Audio2`, whose `Token2wav.__init__` signature
rejects `n_timesteps`). Install via the published extra:

```bash
pip install 'vllm-omni[minicpmo]'
```

Equivalent direct install: `pip install stepaudio2-minicpmo`. A
missing dep raises `ImportError` at first request with the same
install hint instead of silently emitting empty audio.

- **TTS trigger**: speech output is only emitted when the client passes
`extra_body.chat_template_kwargs.use_tts_template=true`. Without it,
the response is text-only (which is also faster).

- **Output audio**: 24 kHz mono WAV inside the OpenAI-style
`message.audio.data` (base64). The Gradio demo's WAV player decodes
this automatically.

- **Routing**: MiniCPM-o 4.5 and 2.6 both ship `architectures=
["MiniCPMO"]` in HF config; routing is disambiguated by
`hf_config.version == "4.5"` via the
`hf_config_predicate` on the 4.5 pipeline. A 2.6 checkpoint loaded
with this recipe's `--deploy-config` will be rejected at startup
rather than silently misrouted.

- **Async chunking**: disabled in all three deploy configs
(`async_chunk: false`) — the talker batches a single full thinker
output, not chunks.
1 change: 1 addition & 0 deletions recipes/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ recipes/
| [`LTX/LTX-2.3.md`](./LTX/LTX-2.3.md) | Text-to-video with audio generation (22B) | 1x GPU (96GB VRAM) |
| [`mistralai/Voxtral-TTS.md`](./mistralai/Voxtral-TTS.md) | Online serving for TTS | 1x RTX 4090 24GB |
| [`nvidia/Cosmos3-Nano.md`](./nvidia/Cosmos3-Nano.md) | Text-to-image, text-to-video, and image-to-video generation | 1x H200 141GB / B300 |
| [`OpenBMB/MiniCPM-o-4_5.md`](./OpenBMB/MiniCPM-o-4_5.md) | Online serving for omni multimodal chat (text / image / audio / video → text + 24 kHz speech) | 2x A100/H100 80GB / 3x mid-tier GPU / 8x RTX 4090 24GB |
| [`OpenBMB/VoxCPM2.md`](./OpenBMB/VoxCPM2.md) | Online + offline TTS with native AR pipeline (48 kHz, 30+ languages) | 1x RTX 4090 24GB |
| [`Qwen/Qwen-Image.md`](./Qwen/Qwen-Image.md) | Text-to-image serving with step-wise continuous batching replay and ModelOpt mixed FP8/NVFP4 | 1x A100 80GB / 2x B200 |
| [`Qwen/Qwen-Image.md`](./Qwen/Qwen-Image.md) | Text-to-image serving with step-wise continuous batching replay | 1x A100 80GB |
Expand Down
Loading