Skip to content

[Bugfix][VoxCPM2] Fix voice-clone decode loop by padding prefill prompt#2894

Merged
linyueqian merged 5 commits intovllm-project:mainfrom
Sy0307:fix/voxcpm2-voice-clone
Apr 18, 2026
Merged

[Bugfix][VoxCPM2] Fix voice-clone decode loop by padding prefill prompt#2894
linyueqian merged 5 commits intovllm-project:mainfrom
Sy0307:fix/voxcpm2-voice-clone

Conversation

@Sy0307
Copy link
Copy Markdown
Contributor

@Sy0307 Sy0307 commented Apr 17, 2026

Purpose

Fix VoxCPM2 voice cloning via /v1/audio/speech: any request carrying ref_audio produced ~136-320 s of noise and eventually tripped MAX_DECODE_STEPS=2000. Follow-up to #2832.

Root cause

Three facts have to line up for the bug to appear, and today they don't:

  1. Talker expects a long prefill. voxcpm2_talker._build_prefill_inputs concatenates ref_prefix / prompt_feat in front of the target text plus an audio_start token, so the embedding fed into base_lm has shape tts_len = ref_feat_len + text_len + 1 (and more for ref_continuation).
  2. vLLM only allocates text-sized slots. The serving layer sent prompt_token_ids = self._voxcpm2_encode(input), so the scheduler reserved exactly len(text) forward slots. gpu_model_runner then clamps the talker-produced embeddings with seg_len = min(span_len, req_embeds.shape[0]) (gpu_model_runner.py:1306) and silently drops the overflow.
  3. Talker zero-pads to cover the missing slots. _prepare_residual_prefill right-pads base_lm_out back up to tts_len (voxcpm2_talker.py:797-806), which means the audio_start position - the last column of enc_outputs that stop_head reads - becomes zero.

Zeroed lm_hidden makes stop_head(lm_hidden).argmax() stick to "don't stop" on every step, so decoding runs until the hard MAX_DECODE_STEPS cap and returns garbled audio.

Observed on a clean origin/main build serving openbmb/VoxCPM2: a 9.92 s English ref clip gave scaffold_len=11, tts_len=75 - 64 slots (ref_start + 62 VAE patches + ref_end) never reached base_lm.

Changes

vllm_omni/entrypoints/openai/serving_speech.py

  • New _build_voxcpm2_prompt helper, shaped like _build_fish_speech_prompt / _build_cosyvoice3_prompt: pads prompt_token_ids to the full prefill length (effective_text + audio_start + ref/prompt region) and ships the real (post-BOS-strip) text IDs through additional_information["text_token_ids"]. The placeholder [1] * prefill_len only reserves slots; its base_lm output is discarded by the talker's feat_mask.
  • ref_audio + ref_text is routed to native VoxCPM2 continuation mode (prompt_audio + prompt_text); ref_audio alone stays on reference mode. This matches what the OpenAI speech client means by "reference audio + transcript".
  • ref_feat_len is computed from hf_config.audio_vae_config (sample_rate + encoder_rates) so a VAE layout change surfaces as a startup error rather than decode-time drift.

vllm_omni/model_executor/models/voxcpm2/voxcpm2_talker.py

  • _prepare_residual_prefill asserts scaffold_len == tts_len. Serving and talker must agree on prefill length bit-for-bit; any future drift crashes loudly instead of regressing to 300 s of noise.
  • preprocess() reads real text IDs from additional_information["text_token_ids"] when present and falls back to input_ids.tolist() for offline callers (examples/offline_inference/voxcpm2/end2end.py).

Test Plan

Verified on NVIDIA H20, openbmb/VoxCPM2, patched branch:

  1. Start server: vllm-omni serve openbmb/VoxCPM2 --stage-configs-path vllm_omni/model_executor/stage_configs/voxcpm2.yaml --omni --port 8765 --trust-remote-code.
  2. POST /v1/audio/speech covering text-only, ref_audio only, and ref_audio + ref_text, in both English and Chinese.
  3. ASR each response with whisper-base.
  4. Repeat with a ~30 s reference clip to exercise long ref_feat_len.

Test Result

All seven scenarios return within a few seconds, decode stops naturally (no MAX_DECODE_STEPS warning), and the talker's scaffold_len == tts_len assert never triggers.

Scenario Duration Whisper transcription Matches target
text-only EN 3.20 s Hello, this is a voice clone verification request. yes
ref-only EN (9.92 s ref) 3.36 s Hello, this is a voice clone verification request. yes
ref + ref_text EN 4.16 s Hello, this is a voice clone verification request. yes
zh, ref-only 2.24 s 你好,这是一个测试程序。 yes
zh, ref + ref_text 3.04 s 你好,这是一个测试程序。 yes
long ref-only (29.76 s ref) 2.72 s Testing a long reference clip for voice cloning. yes
long ref + ref_text 3.84 s Testing a long reference clip for voice cloning. yes

Before the fix (same server binary minus this patch), the ref-only English request produced a 320.16 s output that Whisper failed to transcribe, and the worker logged WARNING ... MAX_DECODE_STEPS for speech-... (2000), forcing stop.

cc @linyueqian @gesla2024 - this should also unblock the Chinese / voice-clone cases you reported on #2832.

The talker's prefill embedding includes ref_audio / prompt_audio regions on
top of the target text, but vLLM only forwarded ``len(prompt_token_ids)``
slots through base_lm. The remaining ref/prompt positions were zero-padded,
so lm_hidden at the audio_start position read zero and stop_head never fired
— decode ran to MAX_DECODE_STEPS=2000 and the response was ~320s of noise.

Serving now pads ``prompt_token_ids`` to the full prefill length (text +
audio_start + ref/prompt region) using AudioVAE parameters read from
hf_config, and ships the real text tokens through
``additional_information['text_token_ids']``. ``ref_audio + ref_text``
is routed to native continuation mode; ``ref_audio`` alone keeps
reference-only mode.

To prevent silent regressions from layout drift, the talker now asserts
``scaffold_len == tts_len`` at prefill entry — any mismatch crashes
immediately instead of degrading to noisy audio.

Signed-off-by: Sy03 <1370724210@qq.com>
@Sy0307 Sy0307 requested a review from hsliuustc0106 as a code owner April 17, 2026 20:40
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@linyueqian linyueqian self-requested a review April 17, 2026 20:40
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

🚫 Pre-commit check failing. Please fix before proceeding.

@linyueqian
Copy link
Copy Markdown
Collaborator

fix pre commit please

Copy link
Copy Markdown
Collaborator

@linyueqian linyueqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified end-to-end on H20 against openbmb/VoxCPM2.

Bug reproduced on origin/main (same model, same wav, same port): ref-only EN (9.92 s ref) returned 320.16 s of noise, server logged WARNING [voxcpm2_talker.py] MAX_DECODE_STEPS for speech-... (2000), forcing stop. Matches the description exactly.

With this PR: 5/5 cases (text-only / ref-only / ref+ref_text, EN+ZH) decode within 0.3 to 13 s, no MAX_DECODE_STEPS warning, no prefill length mismatch assertion. Whisper-small ASR matches the requested text on every well-formed input.

Root cause analysis is sound: with the old prompt_token_ids = self._voxcpm2_encode(input), the scheduler reserves text_len slots, the talker returns embeds of length tts_len = ref_t + text_len + 1, gpu_model_runner.py:1269 truncates with seg_len = min(span_len, req_embeds.shape[0]), and _prepare_residual_prefill zero-pads base_lm_out at the tail so lm_hidden at the audio_start position becomes zero. Padding prompt_token_ids to the full prefill_len is the right fix.

Two concerns inline, would treat the offline-example regression as blocking until addressed.

Comment thread vllm_omni/model_executor/models/voxcpm2/voxcpm2_talker.py
Comment thread vllm_omni/entrypoints/openai/serving_speech.py Outdated
Comment thread vllm_omni/model_executor/models/voxcpm2/voxcpm2_talker.py Outdated
Sy0307 added 2 commits April 18, 2026 16:25
…ict assert

Extract ``build_voxcpm2_prompt`` into ``voxcpm2_talker.py`` so online serving
and the offline ``end2end.py`` share one tokenizer/CJK-split code path.  This
removes the length prediction drift between ``_voxcpm2_tokenizer.encode`` (used
by serving) and ``tts.text_tokenizer(prompt_text)`` (used by the talker) that a
future CJK-range change would have exposed via the prefill-length assertion.

- ``serving_speech._build_voxcpm2_prompt`` delegates to the shared helper.
- ``examples/offline_inference/voxcpm2/end2end.py`` builds the same padded
  ``prompt_token_ids`` via the helper, so ``--reference-audio`` and the new
  ``--ref-text`` flag no longer trip the talker assert.
- ``voxcpm2_talker._prepare_residual_prefill`` keeps a strict
  ``scaffold_len == tts_len`` assert with no fallback pad: zero-padding
  ``base_lm_out`` turned ``lm_hidden`` at audio_start into zeros and caused
  the original voice-clone decode loop, so a hard failure is correct.
- ``voxcpm2_talker.preprocess`` tightens ``token_ids = real[0] if real else …``
  to ``real is None`` so an explicit empty ``text_token_ids`` list surfaces a
  bug instead of silently using ``input_ids``.

Verified on H20 against openbmb/VoxCPM2 (online + offline, zero-shot / ref-only
/ ref+ref_text): all 6 cases decode within 1–13 s with no MAX_DECODE_STEPS
warning, no prefill-length assertion, and Whisper ASR matches the target text.

Signed-off-by: Sy03 <1370724210@qq.com>
@Sy0307
Copy link
Copy Markdown
Contributor Author

Sy0307 commented Apr 18, 2026

PTAK again @linyueqian @hsliuustc0106 . And can we run Ci for this PR?

@linyueqian linyueqian added the ready label to trigger buildkite CI label Apr 18, 2026
Copy link
Copy Markdown
Collaborator

@linyueqian linyueqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-verified against 7dd2f47d on H20.

All three prior points resolved:

  1. [blocking] offline regressionbuild_voxcpm2_prompt was extracted into voxcpm2_talker.py and both the online serving path and examples/offline_inference/voxcpm2/end2end.py now go through it. I ran examples/offline_inference/voxcpm2/end2end.py --reference-audio /tmp/en_only.wav --text ... end-to-end: output is a 3.36 s WAV, whisper-small transcribes back to the requested text, no prefill length mismatch assert, no MAX_DECODE_STEPS. The dead if scaffold_len < tts_len zero-pad branch in _prepare_residual_prefill is also gone, which is the right cleanup now that the assertion guarantees equality.

  2. [important] tokenizer mismatch → The shared helper runs split_multichar_chinese(tokenizer.encode(..., add_special_tokens=True), split_map) for both the target text and the ref_text, so the two sides can never drift.

  3. [suggestion] empty-list fallbacktoken_ids = input_ids.tolist() if real is None else real[0] as proposed.

Nice refactor, LGTM.

@linyueqian linyueqian enabled auto-merge (squash) April 18, 2026 13:55
@linyueqian linyueqian merged commit a683b1d into vllm-project:main Apr 18, 2026
8 checks passed
lvliang-intel pushed a commit to lvliang-intel/vllm-omni that referenced this pull request Apr 20, 2026
@gesla2024
Copy link
Copy Markdown

gesla2024 commented Apr 20, 2026

Hello, thank you. Just now, I pulled the latest branch and found that it still outputs in streaming or non-streaming mode, and the Chinese and English TTS voices are normal.

However, there are a few issues: 1. When using voice cloning, regardless of whether it is streaming or non-streaming output, the generated TTS audio has a drift in voice timbre. 2. In streaming output, whether using cloned voice or default voice, the same audio is returned twice, which causes the same audio to be played twice.

8.mp4

qinganrice pushed a commit to qinganrice/vllm-omni that referenced this pull request Apr 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: VoxCPM2 voice-cloning decoder never emits stop token, output always ~5 min

4 participants