[Bugfix][VoxCPM2] Fix voice-clone decode loop by padding prefill prompt by Sy0307 · Pull Request #2894 · vllm-project/vllm-omni

Sy0307 · 2026-04-17T20:40:03Z

Purpose

Fix VoxCPM2 voice cloning via /v1/audio/speech: any request carrying ref_audio produced ~136-320 s of noise and eventually tripped MAX_DECODE_STEPS=2000. Follow-up to #2832.

Root cause

Three facts have to line up for the bug to appear, and today they don't:

Talker expects a long prefill. voxcpm2_talker._build_prefill_inputs concatenates ref_prefix / prompt_feat in front of the target text plus an audio_start token, so the embedding fed into base_lm has shape tts_len = ref_feat_len + text_len + 1 (and more for ref_continuation).
vLLM only allocates text-sized slots. The serving layer sent prompt_token_ids = self._voxcpm2_encode(input), so the scheduler reserved exactly len(text) forward slots. gpu_model_runner then clamps the talker-produced embeddings with seg_len = min(span_len, req_embeds.shape[0]) (gpu_model_runner.py:1306) and silently drops the overflow.
Talker zero-pads to cover the missing slots. _prepare_residual_prefill right-pads base_lm_out back up to tts_len (voxcpm2_talker.py:797-806), which means the audio_start position - the last column of enc_outputs that stop_head reads - becomes zero.

Zeroed lm_hidden makes stop_head(lm_hidden).argmax() stick to "don't stop" on every step, so decoding runs until the hard MAX_DECODE_STEPS cap and returns garbled audio.

Observed on a clean origin/main build serving openbmb/VoxCPM2: a 9.92 s English ref clip gave scaffold_len=11, tts_len=75 - 64 slots (ref_start + 62 VAE patches + ref_end) never reached base_lm.

Changes

vllm_omni/entrypoints/openai/serving_speech.py

New _build_voxcpm2_prompt helper, shaped like _build_fish_speech_prompt / _build_cosyvoice3_prompt: pads prompt_token_ids to the full prefill length (effective_text + audio_start + ref/prompt region) and ships the real (post-BOS-strip) text IDs through additional_information["text_token_ids"]. The placeholder [1] * prefill_len only reserves slots; its base_lm output is discarded by the talker's feat_mask.
ref_audio + ref_text is routed to native VoxCPM2 continuation mode (prompt_audio + prompt_text); ref_audio alone stays on reference mode. This matches what the OpenAI speech client means by "reference audio + transcript".
ref_feat_len is computed from hf_config.audio_vae_config (sample_rate + encoder_rates) so a VAE layout change surfaces as a startup error rather than decode-time drift.

vllm_omni/model_executor/models/voxcpm2/voxcpm2_talker.py

_prepare_residual_prefill asserts scaffold_len == tts_len. Serving and talker must agree on prefill length bit-for-bit; any future drift crashes loudly instead of regressing to 300 s of noise.
preprocess() reads real text IDs from additional_information["text_token_ids"] when present and falls back to input_ids.tolist() for offline callers (examples/offline_inference/voxcpm2/end2end.py).

Test Plan

Verified on NVIDIA H20, openbmb/VoxCPM2, patched branch:

Start server: vllm-omni serve openbmb/VoxCPM2 --stage-configs-path vllm_omni/model_executor/stage_configs/voxcpm2.yaml --omni --port 8765 --trust-remote-code.
POST /v1/audio/speech covering text-only, ref_audio only, and ref_audio + ref_text, in both English and Chinese.
ASR each response with whisper-base.
Repeat with a ~30 s reference clip to exercise long ref_feat_len.

Test Result

All seven scenarios return within a few seconds, decode stops naturally (no MAX_DECODE_STEPS warning), and the talker's scaffold_len == tts_len assert never triggers.

Scenario	Duration	Whisper transcription	Matches target
text-only EN	3.20 s	`Hello, this is a voice clone verification request.`	yes
ref-only EN (9.92 s ref)	3.36 s	`Hello, this is a voice clone verification request.`	yes
ref + ref_text EN	4.16 s	`Hello, this is a voice clone verification request.`	yes
zh, ref-only	2.24 s	`你好，这是一个测试程序。`	yes
zh, ref + ref_text	3.04 s	`你好，这是一个测试程序。`	yes
long ref-only (29.76 s ref)	2.72 s	`Testing a long reference clip for voice cloning.`	yes
long ref + ref_text	3.84 s	`Testing a long reference clip for voice cloning.`	yes

Before the fix (same server binary minus this patch), the ref-only English request produced a 320.16 s output that Whisper failed to transcribe, and the worker logged WARNING ... MAX_DECODE_STEPS for speech-... (2000), forcing stop.

cc @linyueqian @gesla2024 - this should also unblock the Chinese / voice-clone cases you reported on #2832.

The talker's prefill embedding includes ref_audio / prompt_audio regions on top of the target text, but vLLM only forwarded ``len(prompt_token_ids)`` slots through base_lm. The remaining ref/prompt positions were zero-padded, so lm_hidden at the audio_start position read zero and stop_head never fired — decode ran to MAX_DECODE_STEPS=2000 and the response was ~320s of noise. Serving now pads ``prompt_token_ids`` to the full prefill length (text + audio_start + ref/prompt region) using AudioVAE parameters read from hf_config, and ships the real text tokens through ``additional_information['text_token_ids']``. ``ref_audio + ref_text`` is routed to native continuation mode; ``ref_audio`` alone keeps reference-only mode. To prevent silent regressions from layout drift, the talker now asserts ``scaffold_len == tts_len`` at prefill entry — any mismatch crashes immediately instead of degrading to noisy audio. Signed-off-by: Sy03 <1370724210@qq.com>

chatgpt-codex-connector · 2026-04-17T20:40:09Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

hsliuustc0106 · 2026-04-17T21:02:57Z

🚫 Pre-commit check failing. Please fix before proceeding.

linyueqian · 2026-04-18T02:04:32Z

fix pre commit please

linyueqian

Verified end-to-end on H20 against openbmb/VoxCPM2.

Bug reproduced on origin/main (same model, same wav, same port): ref-only EN (9.92 s ref) returned 320.16 s of noise, server logged WARNING [voxcpm2_talker.py] MAX_DECODE_STEPS for speech-... (2000), forcing stop. Matches the description exactly.

With this PR: 5/5 cases (text-only / ref-only / ref+ref_text, EN+ZH) decode within 0.3 to 13 s, no MAX_DECODE_STEPS warning, no prefill length mismatch assertion. Whisper-small ASR matches the requested text on every well-formed input.

Root cause analysis is sound: with the old prompt_token_ids = self._voxcpm2_encode(input), the scheduler reserves text_len slots, the talker returns embeds of length tts_len = ref_t + text_len + 1, gpu_model_runner.py:1269 truncates with seg_len = min(span_len, req_embeds.shape[0]), and _prepare_residual_prefill zero-pads base_lm_out at the tail so lm_hidden at the audio_start position becomes zero. Padding prompt_token_ids to the full prefill_len is the right fix.

Two concerns inline, would treat the offline-example regression as blocking until addressed.

…ict assert Extract ``build_voxcpm2_prompt`` into ``voxcpm2_talker.py`` so online serving and the offline ``end2end.py`` share one tokenizer/CJK-split code path. This removes the length prediction drift between ``_voxcpm2_tokenizer.encode`` (used by serving) and ``tts.text_tokenizer(prompt_text)`` (used by the talker) that a future CJK-range change would have exposed via the prefill-length assertion. - ``serving_speech._build_voxcpm2_prompt`` delegates to the shared helper. - ``examples/offline_inference/voxcpm2/end2end.py`` builds the same padded ``prompt_token_ids`` via the helper, so ``--reference-audio`` and the new ``--ref-text`` flag no longer trip the talker assert. - ``voxcpm2_talker._prepare_residual_prefill`` keeps a strict ``scaffold_len == tts_len`` assert with no fallback pad: zero-padding ``base_lm_out`` turned ``lm_hidden`` at audio_start into zeros and caused the original voice-clone decode loop, so a hard failure is correct. - ``voxcpm2_talker.preprocess`` tightens ``token_ids = real[0] if real else …`` to ``real is None`` so an explicit empty ``text_token_ids`` list surfaces a bug instead of silently using ``input_ids``. Verified on H20 against openbmb/VoxCPM2 (online + offline, zero-shot / ref-only / ref+ref_text): all 6 cases decode within 1–13 s with no MAX_DECODE_STEPS warning, no prefill-length assertion, and Whisper ASR matches the target text. Signed-off-by: Sy03 <1370724210@qq.com>

Sy0307 · 2026-04-18T08:30:05Z

PTAK again @linyueqian @hsliuustc0106 . And can we run Ci for this PR?

linyueqian

Re-verified against 7dd2f47d on H20.

All three prior points resolved:

[blocking] offline regression → build_voxcpm2_prompt was extracted into voxcpm2_talker.py and both the online serving path and examples/offline_inference/voxcpm2/end2end.py now go through it. I ran examples/offline_inference/voxcpm2/end2end.py --reference-audio /tmp/en_only.wav --text ... end-to-end: output is a 3.36 s WAV, whisper-small transcribes back to the requested text, no prefill length mismatch assert, no MAX_DECODE_STEPS. The dead if scaffold_len < tts_len zero-pad branch in _prepare_residual_prefill is also gone, which is the right cleanup now that the assertion guarantees equality.
[important] tokenizer mismatch → The shared helper runs split_multichar_chinese(tokenizer.encode(..., add_special_tokens=True), split_map) for both the target text and the ref_text, so the two sides can never drift.
[suggestion] empty-list fallback → token_ids = input_ids.tolist() if real is None else real[0] as proposed.

Nice refactor, LGTM.

…pt (vllm-project#2894) Signed-off-by: Sy03 <1370724210@qq.com>

gesla2024 · 2026-04-20T02:56:47Z

Hello, thank you. Just now, I pulled the latest branch and found that it still outputs in streaming or non-streaming mode, and the Chinese and English TTS voices are normal.

However, there are a few issues: 1. When using voice cloning, regardless of whether it is streaming or non-streaming output, the generated TTS audio has a drift in voice timbre. 2. In streaming output, whether using cloned voice or default voice, the same audio is returned twice, which causes the same audio to be played twice.

8.mp4

…pt (vllm-project#2894) Signed-off-by: Sy03 <1370724210@qq.com>

Sy0307 requested a review from hsliuustc0106 as a code owner April 17, 2026 20:40

linyueqian self-requested a review April 17, 2026 20:40

linyueqian reviewed Apr 18, 2026

View reviewed changes

Comment thread vllm_omni/model_executor/models/voxcpm2/voxcpm2_talker.py

Comment thread vllm_omni/entrypoints/openai/serving_speech.py Outdated

Comment thread vllm_omni/model_executor/models/voxcpm2/voxcpm2_talker.py Outdated

Sy0307 added 2 commits April 18, 2026 16:25

Merge branch 'main' into fix/voxcpm2-voice-clone

72dd6de

Merge branch 'main' into fix/voxcpm2-voice-clone

7dd2f47

linyueqian added the ready label to trigger buildkite CI label Apr 18, 2026

linyueqian approved these changes Apr 18, 2026

View reviewed changes

Merge branch 'main' into fix/voxcpm2-voice-clone

2cfdaa6

linyueqian enabled auto-merge (squash) April 18, 2026 13:55

linyueqian merged commit a683b1d into vllm-project:main Apr 18, 2026
8 checks passed

linyueqian mentioned this pull request Apr 18, 2026

[Bug]: VoxCPM2 voice-cloning decoder never emits stop token, output always ~5 min #2896

Closed

1 task

linyueqian linked an issue Apr 18, 2026 that may be closed by this pull request

[Bug]: VoxCPM2 voice-cloning decoder never emits stop token, output always ~5 min #2896

Closed

1 task

hsliuustc0106 mentioned this pull request Apr 18, 2026

[Bugfix][VoxCPM2]: Fix vectorized_gather OOB under concurrent prefill+decode batches #2903

Merged

7 tasks

gnomefin mentioned this pull request Apr 19, 2026

[Bug]: VoxCPM2 /v1/audio/speech with stream: true returns ~2× audio duration #2907

Open

1 task

lvliang-intel pushed a commit to lvliang-intel/vllm-omni that referenced this pull request Apr 20, 2026

[Bugfix][VoxCPM2] Fix voice-clone decode loop by padding prefill prom…

2c50ebe

…pt (vllm-project#2894) Signed-off-by: Sy03 <1370724210@qq.com>

qinganrice pushed a commit to qinganrice/vllm-omni that referenced this pull request Apr 23, 2026

[Bugfix][VoxCPM2] Fix voice-clone decode loop by padding prefill prom…

e578a2d

…pt (vllm-project#2894) Signed-off-by: Sy03 <1370724210@qq.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix][VoxCPM2] Fix voice-clone decode loop by padding prefill prompt#2894

[Bugfix][VoxCPM2] Fix voice-clone decode loop by padding prefill prompt#2894
linyueqian merged 5 commits intovllm-project:mainfrom
Sy0307:fix/voxcpm2-voice-clone

Sy0307 commented Apr 17, 2026

Uh oh!

chatgpt-codex-connector Bot commented Apr 17, 2026

Uh oh!

hsliuustc0106 commented Apr 17, 2026

Uh oh!

linyueqian commented Apr 18, 2026

Uh oh!

linyueqian left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Sy0307 commented Apr 18, 2026

Uh oh!

linyueqian left a comment

Uh oh!

Uh oh!

gesla2024 commented Apr 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Sy0307 commented Apr 17, 2026

Purpose

Root cause

Changes

Test Plan

Test Result

Uh oh!

chatgpt-codex-connector Bot commented Apr 17, 2026

Uh oh!

hsliuustc0106 commented Apr 17, 2026

Uh oh!

linyueqian commented Apr 18, 2026

Uh oh!

linyueqian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Sy0307 commented Apr 18, 2026

Uh oh!

linyueqian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gesla2024 commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gesla2024 commented Apr 20, 2026 •

edited

Loading