[New Model]: Add MiniCPM-o-4_5 support by Drice1999 · Pull Request #3337 · vllm-project/vllm-omni

Drice1999 · 2026-05-04T13:15:26Z

Support MiniCPM-o 4.5 integration with a three-stage pipeline (Thinker -> Talker -> Code2Wav) for text/image/video to audio generation.
This PR uses the non-async stage handoff path. Async chunk support requires Talker streaming support and is planned for a follow-up PR.

Architecture

Stage 0: Thinker (MiniCPM-o 4.5 LLM + vision/video processing)
  -> text tokens + TTS hidden states
        |
        | non-async stage handoff
        v
Stage 1: Talker
  -> audio codec token ids
        |
        | non-async stage handoff
        v
Stage 2: Code2Wav
  -> 24kHz waveform

minicpmo4_5_thinker.py: Stage-0 model. MiniCPM-o 4.5 thinker for text, image, and video understanding, and for producing the text/TTS hidden-state stream consumed by the talker.

minicpmo4_5_talker.py: Stage-1 model. MiniCPM-o 4.5 talker that consumes thinker TTS hidden states and generates audio codec token ids.

minicpmo4_5_code2wav.py: Stage-2 model. MiniCPM-o 4.5 Code2Wav decoder that converts audio codec token ids to a 24kHz waveform.

stage_input_processors/minicpmo4_5.py: Stage glue. thinker2talker extracts the TTS token span and hidden states from thinker output, and talker2code2wav forwards finished codec token ids to Code2Wav.

Installation:

pip install --no-build-isolation 'minicpmo-utils[all]==1.0.6'
pip install pillow opencv-python-headless openai

minicpmo-utils[all]==1.0.6 is required by the Code2Wav stage for MiniCPM Token2Wav dependencies such as stepaudio2, s3tokenizer, and hyperpyyaml. This pinned version was verified with the current vLLM test environment (vllm==0.20.0): vllm imports successfully, MiniCPM Token2Wav loads, and text/image/video -> audio tests can generate WAV artifacts.

Purpose

Resolve #1182

Add MiniCPM-o 4.5 recipe support:

Add offline examples for text/image/video -> audio.
Add online serving examples for text/image/video -> audio.
Add benchmark recipe with latency, RTF, output token count, and peak VRAM reporting.
Add MiniCPM-o 4.5 deploy config and pipeline registry entry.
Update supported_models.md.

Test Plan

Assume a machine with two visible GPUs:

export CUDA_VISIBLE_DEVICES=0,1

Offline examples:

bash examples/offline_inference/minicpmo4_5/run_text_to_audio.sh
bash examples/offline_inference/minicpmo4_5/run_image_to_audio.sh
bash examples/offline_inference/minicpmo4_5/run_video_to_audio.sh

Online serving examples:

bash examples/online_serving/minicpmo4_5/run_server.sh

In another shell:

bash examples/online_serving/minicpmo4_5/run_curl_text_to_audio.sh
bash examples/online_serving/minicpmo4_5/run_curl_image_to_audio.sh /path/to/image.jpg
bash examples/online_serving/minicpmo4_5/run_curl_video_to_audio.sh /path/to/video.mp4

Benchmark:

python benchmarks/minicpmo4_5/bench_minicpmo4_5.py \
  --model-path openbmb/MiniCPM-o-4_5 \
  --mode all \
  --modalities text,text+image,text+video \
  --num-repeats 1 \
  --cuda-visible-devices 0,1 \
  --temperature 0.7 \
  --max-new-tokens 2048 \
  --output-dir bench_results/minicpmo4_5

MiniCPM-o 4.5 L3 e2e:

MINICPMO45_E2E_OUTPUT_DIR=/workspace/l3_artifacts/minicpmo_full \
pytest -s \
  tests/e2e/offline_inference/test_minicpmo4_5.py \
  tests/e2e/online_serving/test_minicpmo4_5.py \
  --tb=short --disable-warnings

Test Result

MiniCPM-o 4.5 L3 e2e on Runpod 4x H100 80GB: passed.

Command: pytest -s tests/e2e/offline_inference/test_minicpmo4_5.py tests/e2e/online_serving/test_minicpmo4_5.py --tb=short --disable-warnings
Result: 11 passed, 20 warnings in 915.55s (0:15:15)
Environment included minicpmo-utils[all]==1.0.6, vllm==0.20.0, and system espeak-ng.
Offline and online text/image/video -> audio generated WAV artifacts successfully.

E2E smoke on 2x L40S: passed.

Online text/image/video -> audio: passed
Offline text/image/video -> audio: passed
External image input: passed
External video input: passed

Audio Samples

Samples from the online serving smoke run with reference audio.

Test 1 - Text -> Audio

Input prompt

"Please read this single long sentence aloud exactly once without shortening it: vLLM Omni is running a benchmark for MiniCPM speech generation, and this sentence intentionally includes enough detail about streaming text to audio generation, multimodal reasoning, stage connectors, careful benchmarking, and stable speech synthesis behavior to last well over ten seconds when spoken at a natural pace."

Output audio

text_to_audio.wav

Test 2 - Image -> Audio

Input image

Input prompt

"Describe the image in one single detailed spoken sentence of at least sixty words, mentioning every visible shape, its color, its approximate size, its position relative to the other shapes, the plain background, and the overall layout, and keep the answer natural but long enough to last more than ten seconds."

Output audio

image_to_audio.wav

Test 3 - Video -> Audio

Input video

external_input.mp4

Input prompt

"Describe the video in one single detailed spoken sentence of at least sixty words, covering the moving objects, their colors, their approximate sizes, the direction and pattern of their motion over time, the dark background, and the overall scene, and keep the answer natural but long enough to last more than ten seconds."

Output audio

video_to_audio.wav

vLLM-Omni vs HF streaming reference (2x L40S, CUDA_VISIBLE_DEVICES=0,1, seed=42, one request per modality):

Task	vLLM E2E	HF E2E	vLLM RTF	HF RTF	Text Tokens (vLLM/HF)	Speedup
text -> audio	5002ms	11063ms	0.232	0.508	60/58	2.21x
text+image -> audio	10683ms	26037ms	0.203	0.461	168/163	2.44x
text+video -> audio	8656ms	16191ms	0.228	0.462	91/88	1.87x
Overall	8114ms	17764ms	0.221	0.477	106.3/103.0	2.19x

The vLLM and HF text token counts are aligned after using the same chat template shape and aligned thinker sampling settings.

Peak VRAM profile (nvidia-smi memory.used, visible GPUs):

Mode	GPU 0 Peak	GPU 1 Peak	Total Peak VRAM
vLLM-Omni non-async	27103 MiB	15926 MiB	43029 MiB
HF reference	25335 MiB	3 MiB	25338 MiB

Docs updated:

docs/models/supported_models.md
examples/offline_inference/minicpmo4_5/README.md
examples/online_serving/minicpmo4_5/README.md
benchmarks/minicpmo4_5/README.md

mkdocs serve was not run locally.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

Signed-off-by: Drice1999 <chenxh267@gmail.com>

chatgpt-codex-connector · 2026-05-04T13:15:36Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

Signed-off-by: Drice1999 <chenxh267@gmail.com>

princepride · 2026-05-04T14:56:46Z

Wow, so many lines change, I will review it tmr.

hsliuustc0106 · 2026-05-04T21:07:33Z

Large PR (>10 files). Please run L3 tests locally and paste results here.

Drice1999 · 2026-05-04T23:54:45Z

Large PR (>10 files). Please run L3 tests locally and paste results here.

Do you mean all L3 tests? Some of them require higher-end GPUs such as H100, and I’m afraid I don’t have that hardware locally.

Signed-off-by: Drice1999 <chenxh267@gmail.com>

Drice1999 · 2026-05-05T11:48:33Z

Minicpmo4_5 related L3 tests should pass now, tested on Runpod (4x H100 80GB).

Command:

mkdir -p /workspace/l3_logs

LOG=/workspace/l3_logs/l3_minicpmo4_5_full_20260505T110054Z.log
XML=/workspace/l3_logs/l3_minicpmo4_5_full_20260505T110054Z.xml

pytest -s \
  tests/e2e/offline_inference/test_minicpmo4_5.py \
  tests/e2e/online_serving/test_minicpmo4_5.py \
  --tb=short --disable-warnings --junitxml="$XML" 2>&1 | tee "$LOG"

Result:

- generated xml file: /workspace/l3_logs/l3_minicpmo4_5_full_20260505T110054Z.xml -
--- Running Summary
================= 11 passed, 20 warnings in 915.55s (0:15:15) ==================
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute

Generated logs:
l3_minicpmo4_5_full_20260505T110054Z.log
l3_minicpmo4_5_full_20260505T110054Z.xml

@hsliuustc0106 PTAL, let me know if there are other changes that need to be made

hsliuustc0106 · 2026-05-16T13:09:38Z

Hi @Drice1999, friendly reminder — this PR hasn't had any activity (commits or reviews) in the past 11 days. 🕐

Could you please provide an update?

If you're still working on it, that's great — just let us know.
If you're blocked on something, feel free to ask for help.
If this PR is no longer being pursued, please consider closing it so we can keep the review queue manageable.

Thanks for your contribution! 🙏

linyueqian · 2026-06-04T23:25:01Z

closed as #3642 is merged. thanks for your willingness to contribute!

WIP: add MiniCPM-o 4.5 staged integration

ff19fd2

Signed-off-by: Drice1999 <chenxh267@gmail.com>

Drice1999 requested a review from hsliuustc0106 as a code owner May 4, 2026 13:15

hsliuustc0106 requested review from ZeldaHuang, linyueqian, lishunyang12, princepride and tzhouam May 4, 2026 13:17

Drice1999 added 8 commits May 4, 2026 21:20

Debug MiniCPM async talker parity

de2e129

Signed-off-by: Drice1999 <chenxh267@gmail.com>

Refactor MiniCPM async talker token streaming

9ee2370

Signed-off-by: Drice1999 <chenxh267@gmail.com>

Fix MiniCPM async talker sampling

24fe971

Signed-off-by: Drice1999 <chenxh267@gmail.com>

Support MiniCPM async chunk ref audio

ba9b630

Signed-off-by: Drice1999 <chenxh267@gmail.com>

Add MiniCPM-o 4.5 examples and benchmarks

371e5c8

Signed-off-by: Drice1999 <chenxh267@gmail.com>

Clean up MiniCPM-o 4.5 PR scope

306dd6f

Signed-off-by: Drice1999 <chenxh267@gmail.com>

Add MiniCPM-o 4.5 examples and benchmark

d589416

Signed-off-by: Drice1999 <chenxh267@gmail.com>

Update MiniCPM-o 4.5 examples and benchmark

c1bedf5

Signed-off-by: Drice1999 <chenxh267@gmail.com>

Drice1999 force-pushed the minicpmo4_5-wip branch from 6b20e82 to c1bedf5 Compare May 4, 2026 13:20

Fix MiniCPM-o 4.5 L3 tests

a597281

Signed-off-by: Drice1999 <chenxh267@gmail.com>

linyueqian closed this Jun 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[New Model]: Add MiniCPM-o-4_5 support#3337

[New Model]: Add MiniCPM-o-4_5 support#3337
Drice1999 wants to merge 10 commits into
vllm-project:mainfrom
Drice1999:minicpmo4_5-wip

Drice1999 commented May 4, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot commented May 4, 2026

Uh oh!

princepride commented May 4, 2026

Uh oh!

hsliuustc0106 commented May 4, 2026

Uh oh!

Drice1999 commented May 4, 2026

Uh oh!

Drice1999 commented May 5, 2026

Uh oh!

hsliuustc0106 commented May 16, 2026

Uh oh!

linyueqian commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Drice1999 commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Architecture

Purpose

Test Plan

Test Result

Audio Samples

Test 1 - Text -> Audio

Test 2 - Image -> Audio

Test 3 - Video -> Audio

Uh oh!

chatgpt-codex-connector Bot commented May 4, 2026

Uh oh!

princepride commented May 4, 2026

Uh oh!

hsliuustc0106 commented May 4, 2026

Uh oh!

Drice1999 commented May 4, 2026

Uh oh!

Drice1999 commented May 5, 2026

Uh oh!

hsliuustc0106 commented May 16, 2026

Uh oh!

linyueqian commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Drice1999 commented May 4, 2026 •

edited

Loading