[Model] VoxCPM2 native AR TTS support by linyueqian · Pull Request #2658 · vllm-project/vllm-omni

linyueqian · 2026-04-09T20:11:40Z

Summary

Add VoxCPM2 native AR TTS support — decomposes VoxCPM2's autoregressive loop so each decode step runs through vllm's engine, enabling future batching and PagedAttention integration.

Implementation: Single-stage pipeline with native MiniCPM4 base_lm + native AudioVAE decode. Working E2E: text → 48kHz speech (zero-shot + voice cloning).

Independent PR — does not depend on other open PRs. Only adds VoxCPM2 files, no VoxCPM v1 dependencies.

Performance (H20 80GB, voxcpm 0.0.0, PyTorch 2.10.0+cu128):

Prompt	RTF	Audio
Short (~6 words)	~0.81	~4s
Long (~50 words)	~0.72	~17s

RTF < 1.0 means faster than real time.

Known limitations (tracked as TODO in talker code):

Uses native MiniCPM4 base_lm (not vllm PagedAttention) — per-request side-computation state (residual_lm KV cache) prevents concurrent batching
Single-stage VAE decode in talker, no incremental streaming yet (future: nanovllm decode-pad pattern)
Scaffold model double-forward overhead
Requires voxcpm package (pip install voxcpm) or VLLM_OMNI_VOXCPM_CODE_PATH env var

Architecture (per AR step):

feat_encoder → MiniCPM4 (base LM) → FSQ → residual_lm → LocDiT → AudioVAE → 48kHz waveform

Test Plan

pytest tests/e2e/offline_inference/test_voxcpm2.py -m core_model -v
Manual: python examples/offline_inference/voxcpm2/end2end.py --text "Hello, test."
Buildkite: Added to .buildkite/test-ready.yml (pre-merge, 20min L4 test)

Test Result

Zero-shot E2E: PASSED on H20 and CI (L4)
Voice clone: PASSED on H20 (skipped on CI — no reference audio)
Audio quality matches native VoxCPM2 generate() output
E2E test validates audio shape and duration (0.5s-30s range)

Co-authored-by: lishunyang12 lishunyang12@users.noreply.github.com

chatgpt-codex-connector · 2026-04-09T20:11:47Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

linyueqian · 2026-04-09T20:27:55Z

@Sy0307 @JuanPZuluaga @lishunyang12 PTAL. looking for your insights on how to improve this model.

Decompose VoxCPM2's autoregressive loop so each decode step runs through vllm's engine, enabling future batching and PagedAttention integration. Single-stage pipeline using native MiniCPM4 base_lm + native AudioVAE decode. Architecture: native base_lm → FSQ → residual_lm → diffusion (LocDiT) → feat_encoder → AudioVAE V2 → 48kHz audio Key design decisions: - Native MiniCPM4 modules (LongRoPE mismatch blocks vllm MiniCPM) - VAE decode in talker (single-stage, bypasses Stage 1 output pipeline) - vllm MiniCPMModel scaffold satisfies FlashInfer warmup requirements - nanovllm decode pattern: base_lm → FSQ → res_lm → diffusion Performance (H20 single request): - Short prompt RTF: 0.28 - Long prompt RTF: 0.34 Files: - vllm_omni/model_executor/models/voxcpm2/voxcpm2_talker.py - vllm_omni/model_executor/models/voxcpm2/voxcpm2_import_utils.py - vllm_omni/model_executor/stage_configs/voxcpm2.yaml - vllm_omni/transformers_utils/configs/voxcpm2.py - examples/offline_inference/voxcpm2/ - tests/e2e/offline_inference/test_voxcpm2.py - .buildkite/test-merge.yml (CI entry) Known limitations (Phase 2): - No PagedAttention (uses manual KV cache) - No streaming (VAE decodes all patches at end) - Scaffold model double-forward overhead - Requires voxcpm package or VLLM_OMNI_VOXCPM_CODE_PATH Co-authored-by: SYLAR <lishunyang12@users.noreply.github.com> Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>

feat_encoder feeds INTO base_lm (feedback loop), not after LocDiT. Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>

Import from tests.utils (where it exists), not tests.e2e.utils.conftest_utils. Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>

- Fix tests.e2e.utils.conftest_utils → tests.utils (CI import error) - extract_audio falls back to model_outputs key - Add TODO for sliding-window VAE streaming (nanovllm pattern) Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>

CI Docker image doesn't include voxcpm. Skip gracefully instead of crashing with ImportError during engine initialization. Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>

The CI Docker image doesn't include voxcpm. Install it at test time. Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>

hsliuustc0106 · 2026-04-10T08:16:37Z

Buildkite CI is failing. Please check the failed build and fix before this can be merged.

Also: the PR body mentions the model uses native MiniCPM4 rather than vllm's PagedAttention due to hidden state mismatches. Worth adding a TODO comment in the model code tracking this limitation for future integration.

Engine init fails when the model isn't cached in CI. Skip gracefully instead of erroring the test suite. Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>

Document the two concrete issues blocking vllm MiniCPM4 PagedAttention: per-request residual_lm state isolation and streaming VAE decode. Reference prototype branch for future work. Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>

Model + native VoxCPM2 loads ~8GB. With 0.3 on L4 (22GB) there's no room for KV cache. Increase to 0.9 for compatibility with smaller GPUs. Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>

CI should surface failures (missing voxcpm, OOM, model not cached) rather than silently skipping. Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>

Sy0307 · 2026-04-10T18:35:58Z

I tested on H20 and got an RTF of around 0.65. Could there be some environment differences between us? Could you describe your setup in detail, especially the version of voxcpm you're using? @linyueqian

Previous numbers (~0.28-0.34) were nanovllm reference benchmarks. Actual vllm-omni RTF on H20: ~0.72-0.81 (single request, enforce_eager). Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>

linyueqian · 2026-04-10T20:12:58Z

I've updated the RTF numbers to match actual measurements. The previous numbers (~0.28/0.34) were from the nanovllm reference implementation. Our vllm-omni integration measures ~0.72-0.81 on H20

hsliuustc0106 · 2026-04-10T23:22:06Z

is there any accuracy problem now? I think for the first stage, we can accept the implmentation with accuracy gaurantee and RTF < 1.0.

hsliuustc0106 · 2026-04-10T23:23:18Z

btw, why #2467 PR(~2500 LOC) is much longer than this PR (~1300 LOC)?

linyueqian · 2026-04-10T23:32:42Z

is there any accuracy problem now? I think for the first stage, we can accept the implmentation with accuracy gaurantee and RTF < 1.0.

the accuracy should be fine

linyueqian · 2026-04-10T23:32:54Z

btw, why #2467 PR(~2500 LOC) is much longer than this PR (~1300 LOC)?

we import a lot from voxcpm's package.

hsliuustc0106 · 2026-04-11T00:21:23Z

btw, why #2467 PR(~2500 LOC) is much longer than this PR (~1300 LOC)?

we import a lot from voxcpm's package.

have we decided to do so?

JuanPZuluaga · 2026-04-11T21:24:33Z

Thanks for adding this model. I'll also add it in the: #2630. @linyueqian

Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com> Co-authored-by: SYLAR <lishunyang12@users.noreply.github.com>

gesla2024 · 2026-04-14T02:43:59Z

When streaming output with stream=True enabled, it seems to return everything in full, not incrementally.

There is an issue where the audio stream output repeatedly returns the same data, for example: 'Hello, welcome to our store, what do you need?' During playback, it comes out as 'Hello, hello, welcome; hello, welcome to our store;...' in this way.

This is a test example
output.wav

linyueqian · 2026-04-14T02:50:51Z

When streaming output with stream=True enabled, it seems to return everything in full, not incrementally.

There is an issue where the audio stream output repeatedly returns the same data, for example: 'Hello, welcome to our store, what do you need?' During playback, it comes out as 'Hello, hello, welcome; hello, welcome to our store;...' in this way.

This is a test example output.wav

yes i found out this issue as well. can you try #2758?

gesla2024 · 2026-04-14T04:54:01Z

When streaming output with stream=True enabled, it seems to return everything in full, not incrementally.
There is an issue where the audio stream output repeatedly returns the same data, for example: 'Hello, welcome to our store, what do you need?' During playback, it comes out as 'Hello, hello, welcome; hello, welcome to our store;...' in this way.
This is a test example output.wav

yes i found out this issue as well. can you try #2758?

Okay, I'll pull the branch and test it now. Thank you.

VoxCPM2 support has been in the codebase since vllm-project#2658 but was never documented on the Speech API page. This PR adds features on top of that surface, so document the whole thing in one go: * Add VoxCPM2 to the top-level supported-models bullet list and Quick Start serve commands. * Add a "VoxCPM2-specific Parameters" subsection under Request Parameters that defines `cfg_value` and points to the mode table. * Add a VoxCPM2 section under "Supported Models" covering the three synthesis modes (Voice Design / Controllable Cloning / Ultimate Cloning), how they map to request fields, reference-audio guidelines, and curl examples for each mode. No code changes in this commit; pure docs. Signed-off-by: gnomefin <alfian@uselevers.com>

Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com> Co-authored-by: SYLAR <lishunyang12@users.noreply.github.com>

linyueqian requested a review from hsliuustc0106 as a code owner April 9, 2026 20:11

linyueqian force-pushed the feat/voxcpm2-native-ar branch 6 times, most recently from bfa4ebd to 0599def Compare April 9, 2026 20:25

linyueqian force-pushed the feat/voxcpm2-native-ar branch 3 times, most recently from 756edd2 to 5b8295f Compare April 9, 2026 20:43

linyueqian force-pushed the feat/voxcpm2-native-ar branch from 5b8295f to 572190f Compare April 9, 2026 20:53

linyueqian added the ready label to trigger buildkite CI label Apr 10, 2026

fix(docs): correct VoxCPM2 architecture diagram order

05c28b7

feat_encoder feeds INTO base_lm (feedback loop), not after LocDiT. Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>

linyueqian force-pushed the feat/voxcpm2-native-ar branch from 9e0762a to 05c28b7 Compare April 10, 2026 03:56

linyueqian added 4 commits April 10, 2026 00:03

fix(test): correct hardware_test import path for CI

c72761a

Import from tests.utils (where it exists), not tests.e2e.utils.conftest_utils. Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>

fix(test): skip VoxCPM2 tests when voxcpm package not installed

df8f352

CI Docker image doesn't include voxcpm. Skip gracefully instead of crashing with ImportError during engine initialization. Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>

fix(ci): install voxcpm package before VoxCPM2 test

4e0408f

The CI Docker image doesn't include voxcpm. Install it at test time. Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>

fix(test): skip VoxCPM2 test on engine init failure

c85698d

Engine init fails when the model isn't cached in CI. Skip gracefully instead of erroring the test suite. Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>

linyueqian force-pushed the feat/voxcpm2-native-ar branch from a932610 to c85698d Compare April 10, 2026 15:46

linyueqian force-pushed the feat/voxcpm2-native-ar branch from dbafccf to 8d8f4e4 Compare April 10, 2026 16:07

linyueqian added 2 commits April 10, 2026 12:28

fix(test): remove skip guards, let CI fail on real errors

6c2f832

CI should surface failures (missing voxcpm, OOM, model not cached) rather than silently skipping. Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>

Sy0307 mentioned this pull request Apr 10, 2026

[Perf]: Speedup VoxCPM2 TTS performance and Support PagedAttention #2690

Merged

hsliuustc0106 requested a review from ZeldaHuang April 10, 2026 23:18

hsliuustc0106 merged commit a41174e into vllm-project:main Apr 11, 2026
8 checks passed

linyueqian mentioned this pull request Apr 11, 2026

Add voxcpm model support. #2467

Merged

5 tasks

This was referenced Apr 17, 2026

[New Model]: VoxCPM2 #2594

Closed

[RFC]: TTS Development Roadmap - March 2026 #1795

Open

gnomefin mentioned this pull request Apr 24, 2026

[Doc][Frontend][Model][VoxCPM2] Support instructions and per-request cfg_value #3118

Merged

linyueqian mentioned this pull request May 1, 2026

[TSS][Model] Kimi-Audio-7B #2941

Open

5 tasks

Shirley125 mentioned this pull request May 21, 2026

[RFC]: CI optimization and L4 model tests supplement JiusiServe/vllm-omni#177

Open

13 tasks

Conversation

linyueqian commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Test Result

Uh oh!

chatgpt-codex-connector Bot commented Apr 9, 2026

Uh oh!

linyueqian commented Apr 9, 2026

Uh oh!

hsliuustc0106 commented Apr 10, 2026

Uh oh!

Sy0307 commented Apr 10, 2026

Uh oh!

linyueqian commented Apr 10, 2026

Uh oh!

hsliuustc0106 commented Apr 10, 2026

Uh oh!

hsliuustc0106 commented Apr 10, 2026

Uh oh!

linyueqian commented Apr 10, 2026

Uh oh!

linyueqian commented Apr 10, 2026

Uh oh!

hsliuustc0106 commented Apr 11, 2026

Uh oh!

Uh oh!

JuanPZuluaga commented Apr 11, 2026

Uh oh!

gesla2024 commented Apr 14, 2026

Uh oh!

linyueqian commented Apr 14, 2026

Uh oh!

gesla2024 commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

linyueqian commented Apr 9, 2026 •

edited

Loading