[Bugfix][VoxCPM2] Fix voice-clone decode loop by padding prefill prompt#2894
[Bugfix][VoxCPM2] Fix voice-clone decode loop by padding prefill prompt#2894linyueqian merged 5 commits intovllm-project:mainfrom
Conversation
The talker's prefill embedding includes ref_audio / prompt_audio regions on top of the target text, but vLLM only forwarded ``len(prompt_token_ids)`` slots through base_lm. The remaining ref/prompt positions were zero-padded, so lm_hidden at the audio_start position read zero and stop_head never fired — decode ran to MAX_DECODE_STEPS=2000 and the response was ~320s of noise. Serving now pads ``prompt_token_ids`` to the full prefill length (text + audio_start + ref/prompt region) using AudioVAE parameters read from hf_config, and ships the real text tokens through ``additional_information['text_token_ids']``. ``ref_audio + ref_text`` is routed to native continuation mode; ``ref_audio`` alone keeps reference-only mode. To prevent silent regressions from layout drift, the talker now asserts ``scaffold_len == tts_len`` at prefill entry — any mismatch crashes immediately instead of degrading to noisy audio. Signed-off-by: Sy03 <1370724210@qq.com>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
🚫 Pre-commit check failing. Please fix before proceeding. |
|
fix pre commit please |
linyueqian
left a comment
There was a problem hiding this comment.
Verified end-to-end on H20 against openbmb/VoxCPM2.
Bug reproduced on origin/main (same model, same wav, same port): ref-only EN (9.92 s ref) returned 320.16 s of noise, server logged WARNING [voxcpm2_talker.py] MAX_DECODE_STEPS for speech-... (2000), forcing stop. Matches the description exactly.
With this PR: 5/5 cases (text-only / ref-only / ref+ref_text, EN+ZH) decode within 0.3 to 13 s, no MAX_DECODE_STEPS warning, no prefill length mismatch assertion. Whisper-small ASR matches the requested text on every well-formed input.
Root cause analysis is sound: with the old prompt_token_ids = self._voxcpm2_encode(input), the scheduler reserves text_len slots, the talker returns embeds of length tts_len = ref_t + text_len + 1, gpu_model_runner.py:1269 truncates with seg_len = min(span_len, req_embeds.shape[0]), and _prepare_residual_prefill zero-pads base_lm_out at the tail so lm_hidden at the audio_start position becomes zero. Padding prompt_token_ids to the full prefill_len is the right fix.
Two concerns inline, would treat the offline-example regression as blocking until addressed.
…ict assert Extract ``build_voxcpm2_prompt`` into ``voxcpm2_talker.py`` so online serving and the offline ``end2end.py`` share one tokenizer/CJK-split code path. This removes the length prediction drift between ``_voxcpm2_tokenizer.encode`` (used by serving) and ``tts.text_tokenizer(prompt_text)`` (used by the talker) that a future CJK-range change would have exposed via the prefill-length assertion. - ``serving_speech._build_voxcpm2_prompt`` delegates to the shared helper. - ``examples/offline_inference/voxcpm2/end2end.py`` builds the same padded ``prompt_token_ids`` via the helper, so ``--reference-audio`` and the new ``--ref-text`` flag no longer trip the talker assert. - ``voxcpm2_talker._prepare_residual_prefill`` keeps a strict ``scaffold_len == tts_len`` assert with no fallback pad: zero-padding ``base_lm_out`` turned ``lm_hidden`` at audio_start into zeros and caused the original voice-clone decode loop, so a hard failure is correct. - ``voxcpm2_talker.preprocess`` tightens ``token_ids = real[0] if real else …`` to ``real is None`` so an explicit empty ``text_token_ids`` list surfaces a bug instead of silently using ``input_ids``. Verified on H20 against openbmb/VoxCPM2 (online + offline, zero-shot / ref-only / ref+ref_text): all 6 cases decode within 1–13 s with no MAX_DECODE_STEPS warning, no prefill-length assertion, and Whisper ASR matches the target text. Signed-off-by: Sy03 <1370724210@qq.com>
|
PTAK again @linyueqian @hsliuustc0106 . And can we run Ci for this PR? |
linyueqian
left a comment
There was a problem hiding this comment.
Re-verified against 7dd2f47d on H20.
All three prior points resolved:
-
[blocking] offline regression →
build_voxcpm2_promptwas extracted intovoxcpm2_talker.pyand both the online serving path andexamples/offline_inference/voxcpm2/end2end.pynow go through it. I ranexamples/offline_inference/voxcpm2/end2end.py --reference-audio /tmp/en_only.wav --text ...end-to-end: output is a 3.36 s WAV, whisper-small transcribes back to the requested text, noprefill length mismatchassert, noMAX_DECODE_STEPS. The deadif scaffold_len < tts_lenzero-pad branch in_prepare_residual_prefillis also gone, which is the right cleanup now that the assertion guarantees equality. -
[important] tokenizer mismatch → The shared helper runs
split_multichar_chinese(tokenizer.encode(..., add_special_tokens=True), split_map)for both the target text and the ref_text, so the two sides can never drift. -
[suggestion] empty-list fallback →
token_ids = input_ids.tolist() if real is None else real[0]as proposed.
Nice refactor, LGTM.
…pt (vllm-project#2894) Signed-off-by: Sy03 <1370724210@qq.com>
|
Hello, thank you. Just now, I pulled the latest branch and found that it still outputs in streaming or non-streaming mode, and the Chinese and English TTS voices are normal. However, there are a few issues: 1. When using voice cloning, regardless of whether it is streaming or non-streaming output, the generated TTS audio has a drift in voice timbre. 2. In streaming output, whether using cloned voice or default voice, the same audio is returned twice, which causes the same audio to be played twice. 8.mp4 |
…pt (vllm-project#2894) Signed-off-by: Sy03 <1370724210@qq.com>
Purpose
Fix VoxCPM2 voice cloning via
/v1/audio/speech: any request carryingref_audioproduced ~136-320 s of noise and eventually trippedMAX_DECODE_STEPS=2000. Follow-up to #2832.Root cause
Three facts have to line up for the bug to appear, and today they don't:
voxcpm2_talker._build_prefill_inputsconcatenatesref_prefix/prompt_featin front of the target text plus anaudio_starttoken, so the embedding fed intobase_lmhas shapetts_len = ref_feat_len + text_len + 1(and more forref_continuation).prompt_token_ids = self._voxcpm2_encode(input), so the scheduler reserved exactlylen(text)forward slots.gpu_model_runnerthen clamps the talker-produced embeddings withseg_len = min(span_len, req_embeds.shape[0])(gpu_model_runner.py:1306) and silently drops the overflow._prepare_residual_prefillright-padsbase_lm_outback up totts_len(voxcpm2_talker.py:797-806), which means theaudio_startposition - the last column ofenc_outputsthatstop_headreads - becomes zero.Zeroed
lm_hiddenmakesstop_head(lm_hidden).argmax()stick to "don't stop" on every step, so decoding runs until the hardMAX_DECODE_STEPScap and returns garbled audio.Observed on a clean
origin/mainbuild servingopenbmb/VoxCPM2: a 9.92 s English ref clip gavescaffold_len=11, tts_len=75- 64 slots (ref_start + 62 VAE patches + ref_end) never reachedbase_lm.Changes
vllm_omni/entrypoints/openai/serving_speech.py_build_voxcpm2_prompthelper, shaped like_build_fish_speech_prompt/_build_cosyvoice3_prompt: padsprompt_token_idsto the full prefill length (effective_text + audio_start + ref/prompt region) and ships the real (post-BOS-strip) text IDs throughadditional_information["text_token_ids"]. The placeholder[1] * prefill_lenonly reserves slots; itsbase_lmoutput is discarded by the talker'sfeat_mask.ref_audio + ref_textis routed to native VoxCPM2continuationmode (prompt_audio + prompt_text);ref_audioalone stays onreferencemode. This matches what the OpenAI speech client means by "reference audio + transcript".ref_feat_lenis computed fromhf_config.audio_vae_config(sample_rate + encoder_rates) so a VAE layout change surfaces as a startup error rather than decode-time drift.vllm_omni/model_executor/models/voxcpm2/voxcpm2_talker.py_prepare_residual_prefillassertsscaffold_len == tts_len. Serving and talker must agree on prefill length bit-for-bit; any future drift crashes loudly instead of regressing to 300 s of noise.preprocess()reads real text IDs fromadditional_information["text_token_ids"]when present and falls back toinput_ids.tolist()for offline callers (examples/offline_inference/voxcpm2/end2end.py).Test Plan
Verified on NVIDIA H20,
openbmb/VoxCPM2, patched branch:vllm-omni serve openbmb/VoxCPM2 --stage-configs-path vllm_omni/model_executor/stage_configs/voxcpm2.yaml --omni --port 8765 --trust-remote-code./v1/audio/speechcovering text-only,ref_audioonly, andref_audio + ref_text, in both English and Chinese.whisper-base.ref_feat_len.Test Result
All seven scenarios return within a few seconds, decode stops naturally (no
MAX_DECODE_STEPSwarning), and the talker'sscaffold_len == tts_lenassert never triggers.Hello, this is a voice clone verification request.Hello, this is a voice clone verification request.Hello, this is a voice clone verification request.你好,这是一个测试程序。你好,这是一个测试程序。Testing a long reference clip for voice cloning.Testing a long reference clip for voice cloning.Before the fix (same server binary minus this patch), the ref-only English request produced a 320.16 s output that Whisper failed to transcribe, and the worker logged
WARNING ... MAX_DECODE_STEPS for speech-... (2000), forcing stop.cc @linyueqian @gesla2024 - this should also unblock the Chinese / voice-clone cases you reported on #2832.