[Perf] VoxCPM2: Speedup by manual CUDA Graph capture for scaffold/residual forward#2803
Conversation
|
impressive! I will take a look later today. |
|
Quick blocker scan - no issues found. A few notes:
|
|
amazing results from L20 48G:
If we compare
|
Capture and replay CUDA Graphs for the 28-layer scaffold and 8-layer residual model forwards during decode steps, eliminating per-step kernel launch overhead. Reduces average RTF from 0.135 to 0.106 (-21%) and improves concurrent throughput by 14-40%. Signed-off-by: Sy03 <1370724210@qq.com>
4131a72 to
8cfdf66
Compare
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
…idual forward (vllm-project#2803) Signed-off-by: Sy03 <1370724210@qq.com>
Root cause of the c>=2 vectorized_gather OOB / illegal memory access reported on VoxCPM2 after PR vllm-project#2803: preprocess() used _pending_requests (per-step prefix, cleared each forward) as if it were the full active batch. When a new prefill was scheduled after cached decode requests in the same batch, the decode requests were wrongly classified as stale and removed from _active_states; the next forward then recreated empty states, silently skipped residual_model for them, and desynchronized attn metadata -- producing NaN on the prefill slice and eventually crashing inside the CFM/LocDiT kernels. State cleanup is now driven solely by on_requests_finished -> _flush_deferred_cleanup at the end of forward(). A one-shot leak warning in _get_or_create_state guards against a future regression. Signed-off-by: Sy03 <1370724210@qq.com>
Root cause of the c>=2 vectorized_gather OOB / illegal memory access reported on VoxCPM2 after PR vllm-project#2803: preprocess() used _pending_requests (per-step prefix, cleared each forward) as if it were the full active batch. When a new prefill was scheduled after cached decode requests in the same batch, the decode requests were wrongly classified as stale and removed from _active_states; the next forward then recreated empty states, silently skipped residual_model for them, and desynchronized attn metadata -- producing NaN on the prefill slice and eventually crashing inside the CFM/LocDiT kernels. State cleanup is now driven solely by on_requests_finished -> _flush_deferred_cleanup at the end of forward(). A one-shot leak warning in _get_or_create_state guards against a future regression. Signed-off-by: Sy03 <1370724210@qq.com>
…idual forward (vllm-project#2803) Signed-off-by: Sy03 <1370724210@qq.com>
…idual forward (vllm-project#2803) Signed-off-by: Sy03 <1370724210@qq.com>
…idual forward (vllm-project#2803) Signed-off-by: Sy03 <1370724210@qq.com>
WAITING FOR #2758 MERGE.
Purpose
Follow-up to #2690 and #2758. Adds manual CUDA Graph capture/replay for the 28-layer scaffold (MiniCPM4PagedForVoxCPM2) and 8-layer residual model forwards during decode steps, eliminating per-step kernel launch overhead.
Key changes
CUDA Graph capture/replay (
voxcpm2_talker.py):_CapturedGraphdataclass holding static input/output buffers and the CUDA graph_capture_graph(): 3 warmup runs + graph capture under patched ForwardContext (scheduler_metadata nullified to avoid shape mismatches across batch sizes)_replay_graph(): copy inputs into static buffers, replay, clone outputforward()dispatch: pure-decode batches → graph replay; mixed prefill+decode → eager fallbacktorch.cuda.graph_pool_handle()to isolate from reduce-overhead cudagraph_treestorch.compile(graph capture already eliminates kernel launch overhead)precompute_fused_qkv() (
minicpm4_paged.py):torch.catallocation inside the captured graphTest Plan
Test Result
Hardware: NVIDIA H20 (98GB), CUDA 13.0
RTF Comparison (3 runs, OpenAI speech API)
Concurrent Throughput (medium_en, 2 runs)
cc @linyueqian @hsliuustc0106