Skip to content

[Model] VoxCPM2 native AR TTS support#2658

Merged
hsliuustc0106 merged 11 commits into
vllm-project:mainfrom
linyueqian:feat/voxcpm2-native-ar
Apr 11, 2026
Merged

[Model] VoxCPM2 native AR TTS support#2658
hsliuustc0106 merged 11 commits into
vllm-project:mainfrom
linyueqian:feat/voxcpm2-native-ar

Conversation

@linyueqian
Copy link
Copy Markdown
Collaborator

@linyueqian linyueqian commented Apr 9, 2026

Summary

Add VoxCPM2 native AR TTS support — decomposes VoxCPM2's autoregressive loop so each decode step runs through vllm's engine, enabling future batching and PagedAttention integration.

Implementation: Single-stage pipeline with native MiniCPM4 base_lm + native AudioVAE decode. Working E2E: text → 48kHz speech (zero-shot + voice cloning).

Independent PR — does not depend on other open PRs. Only adds VoxCPM2 files, no VoxCPM v1 dependencies.

Performance (H20 80GB, voxcpm 0.0.0, PyTorch 2.10.0+cu128):

Prompt RTF Audio
Short (~6 words) ~0.81 ~4s
Long (~50 words) ~0.72 ~17s

RTF < 1.0 means faster than real time.

Known limitations (tracked as TODO in talker code):

  • Uses native MiniCPM4 base_lm (not vllm PagedAttention) — per-request side-computation state (residual_lm KV cache) prevents concurrent batching
  • Single-stage VAE decode in talker, no incremental streaming yet (future: nanovllm decode-pad pattern)
  • Scaffold model double-forward overhead
  • Requires voxcpm package (pip install voxcpm) or VLLM_OMNI_VOXCPM_CODE_PATH env var

Architecture (per AR step):

feat_encoder → MiniCPM4 (base LM) → FSQ → residual_lm → LocDiT → AudioVAE → 48kHz waveform

Test Plan

  • pytest tests/e2e/offline_inference/test_voxcpm2.py -m core_model -v
  • Manual: python examples/offline_inference/voxcpm2/end2end.py --text "Hello, test."
  • Buildkite: Added to .buildkite/test-ready.yml (pre-merge, 20min L4 test)

Test Result

  • Zero-shot E2E: PASSED on H20 and CI (L4)
  • Voice clone: PASSED on H20 (skipped on CI — no reference audio)
  • Audio quality matches native VoxCPM2 generate() output
  • E2E test validates audio shape and duration (0.5s-30s range)

Co-authored-by: lishunyang12 lishunyang12@users.noreply.github.com

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@linyueqian linyueqian force-pushed the feat/voxcpm2-native-ar branch 6 times, most recently from bfa4ebd to 0599def Compare April 9, 2026 20:25
@linyueqian
Copy link
Copy Markdown
Collaborator Author

@Sy0307 @JuanPZuluaga @lishunyang12 PTAL. looking for your insights on how to improve this model.

@linyueqian linyueqian force-pushed the feat/voxcpm2-native-ar branch 3 times, most recently from 756edd2 to 5b8295f Compare April 9, 2026 20:43
Decompose VoxCPM2's autoregressive loop so each decode step runs
through vllm's engine, enabling future batching and PagedAttention
integration. Single-stage pipeline using native MiniCPM4 base_lm +
native AudioVAE decode.

Architecture:
  native base_lm → FSQ → residual_lm → diffusion (LocDiT)
  → feat_encoder → AudioVAE V2 → 48kHz audio

Key design decisions:
- Native MiniCPM4 modules (LongRoPE mismatch blocks vllm MiniCPM)
- VAE decode in talker (single-stage, bypasses Stage 1 output pipeline)
- vllm MiniCPMModel scaffold satisfies FlashInfer warmup requirements
- nanovllm decode pattern: base_lm → FSQ → res_lm → diffusion

Performance (H20 single request):
- Short prompt RTF: 0.28
- Long prompt RTF: 0.34

Files:
- vllm_omni/model_executor/models/voxcpm2/voxcpm2_talker.py
- vllm_omni/model_executor/models/voxcpm2/voxcpm2_import_utils.py
- vllm_omni/model_executor/stage_configs/voxcpm2.yaml
- vllm_omni/transformers_utils/configs/voxcpm2.py
- examples/offline_inference/voxcpm2/
- tests/e2e/offline_inference/test_voxcpm2.py
- .buildkite/test-merge.yml (CI entry)

Known limitations (Phase 2):
- No PagedAttention (uses manual KV cache)
- No streaming (VAE decodes all patches at end)
- Scaffold model double-forward overhead
- Requires voxcpm package or VLLM_OMNI_VOXCPM_CODE_PATH

Co-authored-by: SYLAR <lishunyang12@users.noreply.github.com>
Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
@linyueqian linyueqian force-pushed the feat/voxcpm2-native-ar branch from 5b8295f to 572190f Compare April 9, 2026 20:53
@linyueqian linyueqian added the ready label to trigger buildkite CI label Apr 10, 2026
feat_encoder feeds INTO base_lm (feedback loop), not after LocDiT.

Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
@linyueqian linyueqian force-pushed the feat/voxcpm2-native-ar branch from 9e0762a to 05c28b7 Compare April 10, 2026 03:56
Import from tests.utils (where it exists), not tests.e2e.utils.conftest_utils.

Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
- Fix tests.e2e.utils.conftest_utils → tests.utils (CI import error)
- extract_audio falls back to model_outputs key
- Add TODO for sliding-window VAE streaming (nanovllm pattern)

Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
CI Docker image doesn't include voxcpm. Skip gracefully instead of
crashing with ImportError during engine initialization.

Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
The CI Docker image doesn't include voxcpm. Install it at test time.

Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

Buildkite CI is failing. Please check the failed build and fix before this can be merged.

Also: the PR body mentions the model uses native MiniCPM4 rather than vllm's PagedAttention due to hidden state mismatches. Worth adding a TODO comment in the model code tracking this limitation for future integration.

Engine init fails when the model isn't cached in CI. Skip gracefully
instead of erroring the test suite.

Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
@linyueqian linyueqian force-pushed the feat/voxcpm2-native-ar branch from a932610 to c85698d Compare April 10, 2026 15:46
Document the two concrete issues blocking vllm MiniCPM4 PagedAttention:
per-request residual_lm state isolation and streaming VAE decode.
Reference prototype branch for future work.

Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
@linyueqian linyueqian force-pushed the feat/voxcpm2-native-ar branch from dbafccf to 8d8f4e4 Compare April 10, 2026 16:07
Model + native VoxCPM2 loads ~8GB. With 0.3 on L4 (22GB) there's no
room for KV cache. Increase to 0.9 for compatibility with smaller GPUs.

Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
CI should surface failures (missing voxcpm, OOM, model not cached)
rather than silently skipping.

Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
@Sy0307
Copy link
Copy Markdown
Contributor

Sy0307 commented Apr 10, 2026

I tested on H20 and got an RTF of around 0.65. Could there be some environment differences between us? Could you describe your setup in detail, especially the version of voxcpm you're using? @linyueqian

Previous numbers (~0.28-0.34) were nanovllm reference benchmarks.
Actual vllm-omni RTF on H20: ~0.72-0.81 (single request, enforce_eager).

Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
@linyueqian
Copy link
Copy Markdown
Collaborator Author

I've updated the RTF numbers to match actual measurements. The previous numbers (~0.28/0.34) were from the nanovllm reference implementation. Our vllm-omni integration measures ~0.72-0.81 on H20

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

is there any accuracy problem now? I think for the first stage, we can accept the implmentation with accuracy gaurantee and RTF < 1.0.

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

btw, why #2467 PR(~2500 LOC) is much longer than this PR (~1300 LOC)?

@linyueqian
Copy link
Copy Markdown
Collaborator Author

is there any accuracy problem now? I think for the first stage, we can accept the implmentation with accuracy gaurantee and RTF < 1.0.

the accuracy should be fine

@linyueqian
Copy link
Copy Markdown
Collaborator Author

btw, why #2467 PR(~2500 LOC) is much longer than this PR (~1300 LOC)?

we import a lot from voxcpm's package.

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

btw, why #2467 PR(~2500 LOC) is much longer than this PR (~1300 LOC)?

we import a lot from voxcpm's package.

have we decided to do so?

@hsliuustc0106 hsliuustc0106 merged commit a41174e into vllm-project:main Apr 11, 2026
8 checks passed
@linyueqian linyueqian mentioned this pull request Apr 11, 2026
5 tasks
@JuanPZuluaga
Copy link
Copy Markdown
Contributor

Thanks for adding this model. I'll also add it in the: #2630. @linyueqian

daixinning pushed a commit to daixinning/vllm-omni that referenced this pull request Apr 13, 2026
Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
Co-authored-by: SYLAR <lishunyang12@users.noreply.github.com>
@gesla2024
Copy link
Copy Markdown

When streaming output with stream=True enabled, it seems to return everything in full, not incrementally.

There is an issue where the audio stream output repeatedly returns the same data, for example: 'Hello, welcome to our store, what do you need?' During playback, it comes out as 'Hello, hello, welcome; hello, welcome to our store;...' in this way.

This is a test example
output.wav

@linyueqian
Copy link
Copy Markdown
Collaborator Author

When streaming output with stream=True enabled, it seems to return everything in full, not incrementally.

There is an issue where the audio stream output repeatedly returns the same data, for example: 'Hello, welcome to our store, what do you need?' During playback, it comes out as 'Hello, hello, welcome; hello, welcome to our store;...' in this way.

This is a test example output.wav

yes i found out this issue as well. can you try #2758?

@gesla2024
Copy link
Copy Markdown

When streaming output with stream=True enabled, it seems to return everything in full, not incrementally.
There is an issue where the audio stream output repeatedly returns the same data, for example: 'Hello, welcome to our store, what do you need?' During playback, it comes out as 'Hello, hello, welcome; hello, welcome to our store;...' in this way.
This is a test example output.wav

yes i found out this issue as well. can you try #2758?

Okay, I'll pull the branch and test it now. Thank you.

gnomefin added a commit to gnomefin/vllm-omni that referenced this pull request Apr 24, 2026
VoxCPM2 support has been in the codebase since vllm-project#2658 but was never
documented on the Speech API page. This PR adds features on top of
that surface, so document the whole thing in one go:

* Add VoxCPM2 to the top-level supported-models bullet list and
  Quick Start serve commands.
* Add a "VoxCPM2-specific Parameters" subsection under Request
  Parameters that defines `cfg_value` and points to the mode table.
* Add a VoxCPM2 section under "Supported Models" covering the three
  synthesis modes (Voice Design / Controllable Cloning / Ultimate
  Cloning), how they map to request fields, reference-audio
  guidelines, and curl examples for each mode.

No code changes in this commit; pure docs.

Signed-off-by: gnomefin <alfian@uselevers.com>
gnomefin added a commit to gnomefin/vllm-omni that referenced this pull request Apr 24, 2026
VoxCPM2 support has been in the codebase since vllm-project#2658 but was never
documented on the Speech API page. This PR adds features on top of
that surface, so document the whole thing in one go:

* Add VoxCPM2 to the top-level supported-models bullet list and
  Quick Start serve commands.
* Add a "VoxCPM2-specific Parameters" subsection under Request
  Parameters that defines `cfg_value` and points to the mode table.
* Add a VoxCPM2 section under "Supported Models" covering the three
  synthesis modes (Voice Design / Controllable Cloning / Ultimate
  Cloning), how they map to request fields, reference-audio
  guidelines, and curl examples for each mode.

No code changes in this commit; pure docs.

Signed-off-by: gnomefin <alfian@uselevers.com>
lengrongfu pushed a commit to lengrongfu/vllm-omni that referenced this pull request May 1, 2026
Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
Co-authored-by: SYLAR <lishunyang12@users.noreply.github.com>
@linyueqian linyueqian mentioned this pull request May 1, 2026
5 tasks
clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026
Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
Co-authored-by: SYLAR <lishunyang12@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants