[Perf] VoxCPM2: streaming VAE + compile optimization (45% RTF reduction)#2758
Conversation
Replace the O(N^2) accumulate-and-re-decode loop in _collect_audio with a nanovllm-style sliding-window stream: each VAE decode takes only the trailing pad frames plus the newly-generated latents, and we slice out just the new audio region. Total VAE work drops from O(N^2) to O(N) over a full generation. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
The inner Euler integration in _optimized_solve_euler called .item() on the 0-dim GPU tensors t and dt up to 4 times per diffusion step, forcing a GPU->CPU sync every time. With n_timesteps=10 and ~4 syncs per step that is ~40 syncs per AR decode step; profiling counted ~4k aten::_local_scalar_dense calls over a long generation. Broadcast the 0-dim tensors directly via .copy_() instead, keeping the work on-device. Also gate the one-shot prefill norm log behind an isEnabledFor(DEBUG) check so it no longer syncs on every request. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
PR vllm-project#2690 compiled `layer.mlp` and `layer.self_attn.o_proj` separately (2 compiled regions per layer, fullgraph=True). Profiling showed 1,737 per-layer compiled-region dispatches on a long prompt at ~530 us CPU self-time each (~925 ms of pure Dynamo dispatch overhead). Wrap `Model.forward` in a single `torch.compile(fullgraph=False)` so Dynamo traces the full 28-layer loop once. Graph breaks at PagedAttention produce sub-graphs that are memoised after the first step, collapsing per-step Python dispatch from 28+ calls to a handful. Same treatment for the 8-layer residual model. Benchmarked on H20: RTF dropped from 0.197 to 0.126 (36%) on the long prompt, matching or beating nanovllm-voxcpm on short prompts. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
6925202 to
62d5e12
Compare
|
@JuanPZuluaga @Sy0307 ptal |
8bf003a to
5cec79a
Compare
Also fix streaming: return delta audio chunks (not cumulative) from _collect_audio, and return None on steps without a VAE decode. The output processor accumulates deltas into a list; the speech streaming layer yields each new entry as a separate PCM chunk to the client. Previously, returning cumulative audio caused the client to replay the full audio from the start on every VAE decode interval. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
5cec79a to
82e3dc1
Compare
|
The issue with streaming has been resolved, but there is a new problem: the input text is in Simplified Chinese, but the generated audio either speaks Cantonese or just produces noise. If streaming output is not enabled, the generated audio is always 16k in size, and the file contains nothing. Environment: ubuntu24.04 This is the command I use to start the server.
This is the request I tested after the server started
These are the generated audio files This is a program I use for testing `from future import annotations DEFAULT_API_BASE = "http://localhost:8071" def encode_audio_to_base64(audio_path: str) -> str: def main() -> None: if name == "main": |
|
Thanks for the detailed report! This is a confirmed bug introduced by the sliding-window VAE change (ff7b5af). @gesla2024 Root cause: The streaming VAE refactor switched Fix: One-line change in Verified on H20:
cc @linyueqian |
|
BLOCKER scan:
OVERALL: NO BLOCKERS VERDICT: COMMENT Excellent work! The 40-44% RTF reduction is impressive. A few minor notes:
Test plan shows e2e tests are pending - would be good to verify these before merge. |
|
with @Sy0307 suggestion of one line change, benchmark 5 request on L20 48G:
|
The streaming VAE change (ff7b5af) switched _collect_audio to return per-step delta chunks instead of cumulative audio. Offline consumers then received only the last chunk (~0.16s) because _consolidate_multimodal_tensors in engine/output_processor.py skipped concatenation for the 'audio' key, and the non-streaming speech server kept only the last list entry (~16 KB WAV, empty-sounding output). Fix at the structural root: have the consolidation step concatenate audio delta chunks into the full waveform (flatten each to 1-D first to tolerate inconsistent leading dims). Consolidation only runs on finished=True so streaming is unaffected. Offline extract_audio helpers add a defensive torch.cat fallback for mid-stream list snapshots; normal completed requests now see a single consolidated tensor. Reported by @gesla2024 in vllm-project#2758, root-caused by @Sy0307. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
ede1c93 to
fa66e98
Compare
After the output_processor consolidation fix, multimodal_output["audio"]
is a Tensor rather than a list, so `dict.get("audio") or dict.get(...)`
raises "Boolean value of Tensor with more than one value is ambiguous".
Use explicit None-checks instead of `or` short-circuiting.
Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
|
@Sy0307 Thank you, I will download the updated repository and test it. |
|
I just pulled the latest code to test, and the noise issue has been resolved. However, the generated output, whether streamed, voice-cloned, or non-streamed, is not normal Mandarin. The test program I used is the gradio_demo.py program from this example: https://docs.vllm.ai/projects/vllm-omni/en/latest/user_guide/examples/online_serving/voxcpm2/#example-materials. Below is the recording of my test. Test environment: 3.mp42.mp41.mp4 |
|
Thanks so much for testing. @gesla2024 lets move our discussion over pr #2803. |
Thank you, I will pull this branch and test it now. |
|
After pulling #2803 and testing, I see the problem is still the same.
This was generated after configuring the cloned voice, without enabling streaming output. This was generated without configuring the cloned voice, and streaming output was not enabled. Whether I use gradio_demo.py or openai_speech_client.py, it's the same; the TTS voice generated for Chinese content is not quite right. Another detail is that when streaming output, in gradio_demo.py the audio plays twice. 4.mp4I directly used the test Python program from https://docs.vllm.ai/projects/vllm-omni/en/latest/user_guide/examples/online_serving/voxcpm2/#example-materials to test, and then wrote a simple script myself to test it as well, but the Chinese text generation was incorrect. Calling the model directly to generate it had no problem. CUDA:12.6 After updating the main branch of vllm-omni, pull the branch #2803 The command to run the vllm service is
Using the command in the vllm-omni documentation at https://docs.vllm.ai/projects/vllm-omni/en/latest/user_guide/examples/online_serving/voxcpm2/#start-the-server, a warning appears and it crashes immediately without detailed logs.
:128: RuntimeWarning: 'vllm_omni.entrypoints.openai.api_server' found in sys.modules after import of package 'vllm_omni.entrypoints.openai', but prior to execution of 'vllm_omni.entrypoints.openai.api_server'; this may result in unpredictable behaviour This is the content returned at runtime
|
|
Can not reproduce your issue @gesla2024 . Also cc @hsliuustc0106 @linyueqian |
`from future import annotations import base64 DEFAULT_API_BASE = "http://localhost:8071" REFERENCE_AUDIO_PATH = "D:\Users\Administrator\Downloads\200030.wav" def encode_audio_to_base64(audio_path: str) -> str: def main() -> None: if name == "main": This is my test code for generating audio, let's see if it helps you. |
|
Sorry it is my mistake as I forgot to remove temporary tokenizer config which is for test in my remote test machine so my chinese tokenizer result is right, but now main branch's result don't. Sorry for that again. @gesla2024 |
…on) (vllm-project#2758) Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
|
Okay, thank you. I will pull it again and test it. |
…on) (vllm-project#2758) Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

Summary
Four changes for VoxCPM2 performance and streaming:
Sliding-window VAE decode — replaces the O(N^2) accumulate-and-re-decode pattern with per-step streaming: each VAE call takes [decode_pad (12 frames) + new_patch (4 frames)] and slices out the new audio region using exact decoder_chunk_size alignment. Matches the nanovllm-voxcpm reference implementation.
Eliminate GPU->CPU syncs in CFM diffusion — the Euler integration loop called .item() on 0-dim GPU tensors t/dt up to 4x per diffusion step (x10 timesteps x ~60 decode steps = ~2,400 syncs per long prompt). Replaced with on-device .copy_() broadcasts.
Compile whole Model.forward instead of per-submodule — PR [Perf]: Speedup VoxCPM2 TTS performance and Support PagedAttention #2690 compiled layer.mlp + layer.self_attn.o_proj separately (56 Dynamo dispatches per step). Wrapping Model.forward in torch.compile(fullgraph=False) lets Dynamo memoise the full 28-layer loop. Biggest single win (~36%).
Streaming Gradio demo — AudioWorklet-based gapless streaming player (adapted from Qwen3-TTS demo) with live TTFP/RTF metrics. Supports all 3 VoxCPM2 modes: Voice Design, Controllable Cloning, and Ultimate Cloning.
Benchmark results (H20 GPU, openbmb/VoxCPM2)
Net: 40-44% RTF reduction. Audio quality verified by listening. Streaming playback verified via Gradio demo (gapless, no boundary artifacts).
Files changed
vllm_omni/model_executor/models/voxcpm2/voxcpm2_talker.py— streaming VAE decode, CFM sync fixvllm_omni/model_executor/models/voxcpm2/minicpm4_paged.py— whole-model compile strategyexamples/online_serving/voxcpm2/gradio_demo.py— streaming Gradio demo (NEW)Test plan