[Model] VoxCPM2 native AR TTS support#2658
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
bfa4ebd to
0599def
Compare
|
@Sy0307 @JuanPZuluaga @lishunyang12 PTAL. looking for your insights on how to improve this model. |
756edd2 to
5b8295f
Compare
Decompose VoxCPM2's autoregressive loop so each decode step runs through vllm's engine, enabling future batching and PagedAttention integration. Single-stage pipeline using native MiniCPM4 base_lm + native AudioVAE decode. Architecture: native base_lm → FSQ → residual_lm → diffusion (LocDiT) → feat_encoder → AudioVAE V2 → 48kHz audio Key design decisions: - Native MiniCPM4 modules (LongRoPE mismatch blocks vllm MiniCPM) - VAE decode in talker (single-stage, bypasses Stage 1 output pipeline) - vllm MiniCPMModel scaffold satisfies FlashInfer warmup requirements - nanovllm decode pattern: base_lm → FSQ → res_lm → diffusion Performance (H20 single request): - Short prompt RTF: 0.28 - Long prompt RTF: 0.34 Files: - vllm_omni/model_executor/models/voxcpm2/voxcpm2_talker.py - vllm_omni/model_executor/models/voxcpm2/voxcpm2_import_utils.py - vllm_omni/model_executor/stage_configs/voxcpm2.yaml - vllm_omni/transformers_utils/configs/voxcpm2.py - examples/offline_inference/voxcpm2/ - tests/e2e/offline_inference/test_voxcpm2.py - .buildkite/test-merge.yml (CI entry) Known limitations (Phase 2): - No PagedAttention (uses manual KV cache) - No streaming (VAE decodes all patches at end) - Scaffold model double-forward overhead - Requires voxcpm package or VLLM_OMNI_VOXCPM_CODE_PATH Co-authored-by: SYLAR <lishunyang12@users.noreply.github.com> Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
5b8295f to
572190f
Compare
feat_encoder feeds INTO base_lm (feedback loop), not after LocDiT. Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
9e0762a to
05c28b7
Compare
Import from tests.utils (where it exists), not tests.e2e.utils.conftest_utils. Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
- Fix tests.e2e.utils.conftest_utils → tests.utils (CI import error) - extract_audio falls back to model_outputs key - Add TODO for sliding-window VAE streaming (nanovllm pattern) Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
CI Docker image doesn't include voxcpm. Skip gracefully instead of crashing with ImportError during engine initialization. Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
The CI Docker image doesn't include voxcpm. Install it at test time. Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
|
Buildkite CI is failing. Please check the failed build and fix before this can be merged. Also: the PR body mentions the model uses native MiniCPM4 rather than vllm's PagedAttention due to hidden state mismatches. Worth adding a TODO comment in the model code tracking this limitation for future integration. |
Engine init fails when the model isn't cached in CI. Skip gracefully instead of erroring the test suite. Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
a932610 to
c85698d
Compare
Document the two concrete issues blocking vllm MiniCPM4 PagedAttention: per-request residual_lm state isolation and streaming VAE decode. Reference prototype branch for future work. Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
dbafccf to
8d8f4e4
Compare
Model + native VoxCPM2 loads ~8GB. With 0.3 on L4 (22GB) there's no room for KV cache. Increase to 0.9 for compatibility with smaller GPUs. Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
CI should surface failures (missing voxcpm, OOM, model not cached) rather than silently skipping. Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
|
I tested on H20 and got an RTF of around 0.65. Could there be some environment differences between us? Could you describe your setup in detail, especially the version of voxcpm you're using? @linyueqian |
Previous numbers (~0.28-0.34) were nanovllm reference benchmarks. Actual vllm-omni RTF on H20: ~0.72-0.81 (single request, enforce_eager). Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
|
I've updated the RTF numbers to match actual measurements. The previous numbers (~0.28/0.34) were from the nanovllm reference implementation. Our vllm-omni integration measures ~0.72-0.81 on H20 |
|
is there any accuracy problem now? I think for the first stage, we can accept the implmentation with accuracy gaurantee and RTF < 1.0. |
|
btw, why #2467 PR(~2500 LOC) is much longer than this PR (~1300 LOC)? |
the accuracy should be fine |
we import a lot from voxcpm's package. |
have we decided to do so? |
|
Thanks for adding this model. I'll also add it in the: #2630. @linyueqian |
Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com> Co-authored-by: SYLAR <lishunyang12@users.noreply.github.com>
|
When streaming output with stream=True enabled, it seems to return everything in full, not incrementally. There is an issue where the audio stream output repeatedly returns the same data, for example: 'Hello, welcome to our store, what do you need?' During playback, it comes out as 'Hello, hello, welcome; hello, welcome to our store;...' in this way. This is a test example |
yes i found out this issue as well. can you try #2758? |
Okay, I'll pull the branch and test it now. Thank you. |
VoxCPM2 support has been in the codebase since vllm-project#2658 but was never documented on the Speech API page. This PR adds features on top of that surface, so document the whole thing in one go: * Add VoxCPM2 to the top-level supported-models bullet list and Quick Start serve commands. * Add a "VoxCPM2-specific Parameters" subsection under Request Parameters that defines `cfg_value` and points to the mode table. * Add a VoxCPM2 section under "Supported Models" covering the three synthesis modes (Voice Design / Controllable Cloning / Ultimate Cloning), how they map to request fields, reference-audio guidelines, and curl examples for each mode. No code changes in this commit; pure docs. Signed-off-by: gnomefin <alfian@uselevers.com>
VoxCPM2 support has been in the codebase since vllm-project#2658 but was never documented on the Speech API page. This PR adds features on top of that surface, so document the whole thing in one go: * Add VoxCPM2 to the top-level supported-models bullet list and Quick Start serve commands. * Add a "VoxCPM2-specific Parameters" subsection under Request Parameters that defines `cfg_value` and points to the mode table. * Add a VoxCPM2 section under "Supported Models" covering the three synthesis modes (Voice Design / Controllable Cloning / Ultimate Cloning), how they map to request fields, reference-audio guidelines, and curl examples for each mode. No code changes in this commit; pure docs. Signed-off-by: gnomefin <alfian@uselevers.com>
Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com> Co-authored-by: SYLAR <lishunyang12@users.noreply.github.com>
Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com> Co-authored-by: SYLAR <lishunyang12@users.noreply.github.com>
Summary
Add VoxCPM2 native AR TTS support — decomposes VoxCPM2's autoregressive loop so each decode step runs through vllm's engine, enabling future batching and PagedAttention integration.
Implementation: Single-stage pipeline with native MiniCPM4 base_lm + native AudioVAE decode. Working E2E: text → 48kHz speech (zero-shot + voice cloning).
Independent PR — does not depend on other open PRs. Only adds VoxCPM2 files, no VoxCPM v1 dependencies.
Performance (H20 80GB, voxcpm 0.0.0, PyTorch 2.10.0+cu128):
RTF < 1.0 means faster than real time.
Known limitations (tracked as TODO in talker code):
voxcpmpackage (pip install voxcpm) orVLLM_OMNI_VOXCPM_CODE_PATHenv varArchitecture (per AR step):
Test Plan
pytest tests/e2e/offline_inference/test_voxcpm2.py -m core_model -vpython examples/offline_inference/voxcpm2/end2end.py --text "Hello, test.".buildkite/test-ready.yml(pre-merge, 20min L4 test)Test Result
generate()outputCo-authored-by: lishunyang12 lishunyang12@users.noreply.github.com