[New Model]: Add MiniCPM-o-4_5 support#3337
Conversation
Signed-off-by: Drice1999 <chenxh267@gmail.com>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
Signed-off-by: Drice1999 <chenxh267@gmail.com>
Signed-off-by: Drice1999 <chenxh267@gmail.com>
Signed-off-by: Drice1999 <chenxh267@gmail.com>
Signed-off-by: Drice1999 <chenxh267@gmail.com>
Signed-off-by: Drice1999 <chenxh267@gmail.com>
Signed-off-by: Drice1999 <chenxh267@gmail.com>
Signed-off-by: Drice1999 <chenxh267@gmail.com>
Signed-off-by: Drice1999 <chenxh267@gmail.com>
|
Wow, so many lines change, I will review it tmr. |
|
Large PR (>10 files). Please run L3 tests locally and paste results here. |
Do you mean all L3 tests? Some of them require higher-end GPUs such as H100, and I’m afraid I don’t have that hardware locally. |
Signed-off-by: Drice1999 <chenxh267@gmail.com>
|
Minicpmo4_5 related L3 tests should pass now, tested on Runpod ( Command: mkdir -p /workspace/l3_logs
LOG=/workspace/l3_logs/l3_minicpmo4_5_full_20260505T110054Z.log
XML=/workspace/l3_logs/l3_minicpmo4_5_full_20260505T110054Z.xml
pytest -s \
tests/e2e/offline_inference/test_minicpmo4_5.py \
tests/e2e/online_serving/test_minicpmo4_5.py \
--tb=short --disable-warnings --junitxml="$XML" 2>&1 | tee "$LOG"Result: Generated logs: @hsliuustc0106 PTAL, let me know if there are other changes that need to be made |
|
Hi @Drice1999, friendly reminder — this PR hasn't had any activity (commits or reviews) in the past 11 days. 🕐 Could you please provide an update?
Thanks for your contribution! 🙏 |
|
closed as #3642 is merged. thanks for your willingness to contribute! |
Support MiniCPM-o 4.5 integration with a three-stage pipeline (Thinker -> Talker -> Code2Wav) for text/image/video to audio generation.
This PR uses the non-async stage handoff path. Async chunk support requires Talker streaming support and is planned for a follow-up PR.
Architecture
Stage 0: Thinker (MiniCPM-o 4.5 LLM + vision/video processing) -> text tokens + TTS hidden states | | non-async stage handoff v Stage 1: Talker -> audio codec token ids | | non-async stage handoff v Stage 2: Code2Wav -> 24kHz waveformminicpmo4_5_thinker.py: Stage-0 model. MiniCPM-o 4.5 thinker for text, image, and video understanding, and for producing the text/TTS hidden-state stream consumed by the talker.
minicpmo4_5_talker.py: Stage-1 model. MiniCPM-o 4.5 talker that consumes thinker TTS hidden states and generates audio codec token ids.
minicpmo4_5_code2wav.py: Stage-2 model. MiniCPM-o 4.5 Code2Wav decoder that converts audio codec token ids to a 24kHz waveform.
stage_input_processors/minicpmo4_5.py: Stage glue.
thinker2talkerextracts the TTS token span and hidden states from thinker output, andtalker2code2wavforwards finished codec token ids to Code2Wav.Installation:
pip install --no-build-isolation 'minicpmo-utils[all]==1.0.6' pip install pillow opencv-python-headless openaiminicpmo-utils[all]==1.0.6is required by the Code2Wav stage for MiniCPM Token2Wav dependencies such asstepaudio2,s3tokenizer, andhyperpyyaml. This pinned version was verified with the current vLLM test environment (vllm==0.20.0):vllmimports successfully, MiniCPM Token2Wav loads, and text/image/video -> audio tests can generate WAV artifacts.Purpose
Resolve #1182
Add MiniCPM-o 4.5 recipe support:
supported_models.md.Test Plan
Assume a machine with two visible GPUs:
export CUDA_VISIBLE_DEVICES=0,1Offline examples:
Online serving examples:
In another shell:
Benchmark:
MiniCPM-o 4.5 L3 e2e:
Test Result
MiniCPM-o 4.5 L3 e2e on Runpod
4x H100 80GB: passed.pytest -s tests/e2e/offline_inference/test_minicpmo4_5.py tests/e2e/online_serving/test_minicpmo4_5.py --tb=short --disable-warnings11 passed, 20 warnings in 915.55s (0:15:15)minicpmo-utils[all]==1.0.6,vllm==0.20.0, and systemespeak-ng.E2E smoke on
2x L40S: passed.Audio Samples
Samples from the online serving smoke run with reference audio.
Test 1 - Text -> Audio
Input prompt
Output audio
text_to_audio.wav
Test 2 - Image -> Audio
Input image
Input prompt
Output audio
image_to_audio.wav
Test 3 - Video -> Audio
Input video
external_input.mp4
Input prompt
Output audio
video_to_audio.wav
vLLM-Omni vs HF streaming reference (
2x L40S,CUDA_VISIBLE_DEVICES=0,1, seed=42, one request per modality):The vLLM and HF text token counts are aligned after using the same chat template shape and aligned thinker sampling settings.
Peak VRAM profile (
nvidia-smi memory.used, visible GPUs):Docs updated:
docs/models/supported_models.mdexamples/offline_inference/minicpmo4_5/README.mdexamples/online_serving/minicpmo4_5/README.mdbenchmarks/minicpmo4_5/README.mdmkdocs servewas not run locally.Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.