Skip to content

[New Model]: Add MiniCPM-o-4_5 support#3337

Closed
Drice1999 wants to merge 10 commits into
vllm-project:mainfrom
Drice1999:minicpmo4_5-wip
Closed

[New Model]: Add MiniCPM-o-4_5 support#3337
Drice1999 wants to merge 10 commits into
vllm-project:mainfrom
Drice1999:minicpmo4_5-wip

Conversation

@Drice1999

@Drice1999 Drice1999 commented May 4, 2026

Copy link
Copy Markdown

Support MiniCPM-o 4.5 integration with a three-stage pipeline (Thinker -> Talker -> Code2Wav) for text/image/video to audio generation.
This PR uses the non-async stage handoff path. Async chunk support requires Talker streaming support and is planned for a follow-up PR.

Architecture

Stage 0: Thinker (MiniCPM-o 4.5 LLM + vision/video processing)
  -> text tokens + TTS hidden states
        |
        | non-async stage handoff
        v
Stage 1: Talker
  -> audio codec token ids
        |
        | non-async stage handoff
        v
Stage 2: Code2Wav
  -> 24kHz waveform

minicpmo4_5_thinker.py: Stage-0 model. MiniCPM-o 4.5 thinker for text, image, and video understanding, and for producing the text/TTS hidden-state stream consumed by the talker.

minicpmo4_5_talker.py: Stage-1 model. MiniCPM-o 4.5 talker that consumes thinker TTS hidden states and generates audio codec token ids.

minicpmo4_5_code2wav.py: Stage-2 model. MiniCPM-o 4.5 Code2Wav decoder that converts audio codec token ids to a 24kHz waveform.

stage_input_processors/minicpmo4_5.py: Stage glue. thinker2talker extracts the TTS token span and hidden states from thinker output, and talker2code2wav forwards finished codec token ids to Code2Wav.

Installation:

pip install --no-build-isolation 'minicpmo-utils[all]==1.0.6'
pip install pillow opencv-python-headless openai

minicpmo-utils[all]==1.0.6 is required by the Code2Wav stage for MiniCPM Token2Wav dependencies such as stepaudio2, s3tokenizer, and hyperpyyaml. This pinned version was verified with the current vLLM test environment (vllm==0.20.0): vllm imports successfully, MiniCPM Token2Wav loads, and text/image/video -> audio tests can generate WAV artifacts.

Purpose

Resolve #1182

Add MiniCPM-o 4.5 recipe support:

  • Add offline examples for text/image/video -> audio.
  • Add online serving examples for text/image/video -> audio.
  • Add benchmark recipe with latency, RTF, output token count, and peak VRAM reporting.
  • Add MiniCPM-o 4.5 deploy config and pipeline registry entry.
  • Update supported_models.md.

Test Plan

Assume a machine with two visible GPUs:

export CUDA_VISIBLE_DEVICES=0,1

Offline examples:

bash examples/offline_inference/minicpmo4_5/run_text_to_audio.sh
bash examples/offline_inference/minicpmo4_5/run_image_to_audio.sh
bash examples/offline_inference/minicpmo4_5/run_video_to_audio.sh

Online serving examples:

bash examples/online_serving/minicpmo4_5/run_server.sh

In another shell:

bash examples/online_serving/minicpmo4_5/run_curl_text_to_audio.sh
bash examples/online_serving/minicpmo4_5/run_curl_image_to_audio.sh /path/to/image.jpg
bash examples/online_serving/minicpmo4_5/run_curl_video_to_audio.sh /path/to/video.mp4

Benchmark:

python benchmarks/minicpmo4_5/bench_minicpmo4_5.py \
  --model-path openbmb/MiniCPM-o-4_5 \
  --mode all \
  --modalities text,text+image,text+video \
  --num-repeats 1 \
  --cuda-visible-devices 0,1 \
  --temperature 0.7 \
  --max-new-tokens 2048 \
  --output-dir bench_results/minicpmo4_5

MiniCPM-o 4.5 L3 e2e:

MINICPMO45_E2E_OUTPUT_DIR=/workspace/l3_artifacts/minicpmo_full \
pytest -s \
  tests/e2e/offline_inference/test_minicpmo4_5.py \
  tests/e2e/online_serving/test_minicpmo4_5.py \
  --tb=short --disable-warnings

Test Result

MiniCPM-o 4.5 L3 e2e on Runpod 4x H100 80GB: passed.

  • Command: pytest -s tests/e2e/offline_inference/test_minicpmo4_5.py tests/e2e/online_serving/test_minicpmo4_5.py --tb=short --disable-warnings
  • Result: 11 passed, 20 warnings in 915.55s (0:15:15)
  • Environment included minicpmo-utils[all]==1.0.6, vllm==0.20.0, and system espeak-ng.
  • Offline and online text/image/video -> audio generated WAV artifacts successfully.

E2E smoke on 2x L40S: passed.

  • Online text/image/video -> audio: passed
  • Offline text/image/video -> audio: passed
  • External image input: passed
  • External video input: passed

Audio Samples

Samples from the online serving smoke run with reference audio.

Test 1 - Text -> Audio

Input prompt

"Please read this single long sentence aloud exactly once without shortening it: vLLM Omni is running a benchmark for MiniCPM speech generation, and this sentence intentionally includes enough detail about streaming text to audio generation, multimodal reasoning, stage connectors, careful benchmarking, and stable speech synthesis behavior to last well over ten seconds when spoken at a natural pace."

Output audio

text_to_audio.wav

Test 2 - Image -> Audio

Input image

external_input

Input prompt

"Describe the image in one single detailed spoken sentence of at least sixty words, mentioning every visible shape, its color, its approximate size, its position relative to the other shapes, the plain background, and the overall layout, and keep the answer natural but long enough to last more than ten seconds."

Output audio

image_to_audio.wav

Test 3 - Video -> Audio

Input video

external_input.mp4

Input prompt

"Describe the video in one single detailed spoken sentence of at least sixty words, covering the moving objects, their colors, their approximate sizes, the direction and pattern of their motion over time, the dark background, and the overall scene, and keep the answer natural but long enough to last more than ten seconds."

Output audio

video_to_audio.wav

vLLM-Omni vs HF streaming reference (2x L40S, CUDA_VISIBLE_DEVICES=0,1, seed=42, one request per modality):

Task vLLM E2E HF E2E vLLM RTF HF RTF Text Tokens (vLLM/HF) Speedup
text -> audio 5002ms 11063ms 0.232 0.508 60/58 2.21x
text+image -> audio 10683ms 26037ms 0.203 0.461 168/163 2.44x
text+video -> audio 8656ms 16191ms 0.228 0.462 91/88 1.87x
Overall 8114ms 17764ms 0.221 0.477 106.3/103.0 2.19x

The vLLM and HF text token counts are aligned after using the same chat template shape and aligned thinker sampling settings.

Peak VRAM profile (nvidia-smi memory.used, visible GPUs):

Mode GPU 0 Peak GPU 1 Peak Total Peak VRAM
vLLM-Omni non-async 27103 MiB 15926 MiB 43029 MiB
HF reference 25335 MiB 3 MiB 25338 MiB

Docs updated:

  • docs/models/supported_models.md
  • examples/offline_inference/minicpmo4_5/README.md
  • examples/online_serving/minicpmo4_5/README.md
  • benchmarks/minicpmo4_5/README.md

mkdocs serve was not run locally.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

Signed-off-by: Drice1999 <chenxh267@gmail.com>
@Drice1999 Drice1999 requested a review from hsliuustc0106 as a code owner May 4, 2026 13:15
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

Drice1999 added 8 commits May 4, 2026 21:20
Signed-off-by: Drice1999 <chenxh267@gmail.com>
Signed-off-by: Drice1999 <chenxh267@gmail.com>
Signed-off-by: Drice1999 <chenxh267@gmail.com>
Signed-off-by: Drice1999 <chenxh267@gmail.com>
Signed-off-by: Drice1999 <chenxh267@gmail.com>
Signed-off-by: Drice1999 <chenxh267@gmail.com>
Signed-off-by: Drice1999 <chenxh267@gmail.com>
Signed-off-by: Drice1999 <chenxh267@gmail.com>
@princepride

Copy link
Copy Markdown
Collaborator

Wow, so many lines change, I will review it tmr.

@hsliuustc0106

Copy link
Copy Markdown
Collaborator

Large PR (>10 files). Please run L3 tests locally and paste results here.

@Drice1999

Copy link
Copy Markdown
Author

Large PR (>10 files). Please run L3 tests locally and paste results here.

Do you mean all L3 tests? Some of them require higher-end GPUs such as H100, and I’m afraid I don’t have that hardware locally.

Signed-off-by: Drice1999 <chenxh267@gmail.com>
@Drice1999

Copy link
Copy Markdown
Author

Minicpmo4_5 related L3 tests should pass now, tested on Runpod (4x H100 80GB).

Command:

mkdir -p /workspace/l3_logs

LOG=/workspace/l3_logs/l3_minicpmo4_5_full_20260505T110054Z.log
XML=/workspace/l3_logs/l3_minicpmo4_5_full_20260505T110054Z.xml

pytest -s \
  tests/e2e/offline_inference/test_minicpmo4_5.py \
  tests/e2e/online_serving/test_minicpmo4_5.py \
  --tb=short --disable-warnings --junitxml="$XML" 2>&1 | tee "$LOG"

Result:

- generated xml file: /workspace/l3_logs/l3_minicpmo4_5_full_20260505T110054Z.xml -
--- Running Summary
================= 11 passed, 20 warnings in 915.55s (0:15:15) ==================
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute

Generated logs:
l3_minicpmo4_5_full_20260505T110054Z.log
l3_minicpmo4_5_full_20260505T110054Z.xml

@hsliuustc0106 PTAL, let me know if there are other changes that need to be made

@hsliuustc0106

Copy link
Copy Markdown
Collaborator

Hi @Drice1999, friendly reminder — this PR hasn't had any activity (commits or reviews) in the past 11 days. 🕐

Could you please provide an update?

  • If you're still working on it, that's great — just let us know.
  • If you're blocked on something, feel free to ask for help.
  • If this PR is no longer being pursued, please consider closing it so we can keep the review queue manageable.

Thanks for your contribution! 🙏

@linyueqian linyueqian closed this Jun 4, 2026
@linyueqian

Copy link
Copy Markdown
Collaborator

closed as #3642 is merged. thanks for your willingness to contribute!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[New Model]: Omni model openbmb/MiniCPM-o-4_5

4 participants