[Perf] Fish Speech S2 Pro: CUDA graph acceleration for Fast AR codebook decode by linyueqian · Pull Request #2579 · vllm-project/vllm-omni

linyueqian · 2026-04-08T04:17:25Z

Purpose

Accelerate Fish Speech S2 Pro's per-step decode latency by enabling CUDA graph capture and replay for the Fast AR (residual codebook predictor). This follows the same pattern already used by Qwen3 TTS's CodePredictor.

Key Changes

fish_speech_fast_ar.py — Switch from variable-length re-prefill to fixed-shape full-buffer forward:

Always forward the full [padded_bsz, max_seq, hidden] embedding buffer (zero-padded future positions) instead of slicing to growing seq_len
torch.compile with epilogue_fusion=False, dynamic=False
Capture CUDA graphs per power-of-2 batch-size bucket
Replay graph each codebook step, index the relevant position for logits
Sampling (top_k/top_p/multinomial) stays outside the graph
Defer compile + graph capture to first forward() to avoid OOM during model loading (before KV cache allocation)
Expose self.talker = self.fast_ar so OmniGPUModelRunner wraps talker_mtp in CUDAGraphWrapper

fish_speech_slow_ar.py — Make talker_mtp CUDA-graph-safe:

Replace data-dependent if semantic_mask.any(): branch with branchless torch.where (eliminates host-device sync during graph capture)
Add self.talker = self.fast_ar to trigger outer CUDAGraphWrapper wrapping
Add self.talker_mtp_graph_safe = True flag

Test Plan

Tested on NVIDIA H20-3e (143GB) with fishaudio/s2-pro, single GPU, enforce_eager for Stage 0, using the benchmark config from #2515.

CUDA_VISIBLE_DEVICES=0 python -m vllm_omni.entrypoints.cli.main serve \
    "fishaudio/s2-pro" --omni --host 127.0.0.1 --port 8091 \
    --stage-configs-path benchmarks/fish-speech/config/vllm_omni/fish_speech_s2_pro.yaml \
    --trust-remote-code --enforce-eager

Client-side profiling with streaming PCM requests at concurrency=1.

Test Result

Metric	Before	After	Improvement
Per-step Fast AR time	73 ms	48 ms	34% faster
End-to-end latency	1800 ms	1180 ms	34% faster
RTF (real-time factor)	0.48	0.33	31% better
TTFP (time to first packet)	440 ms	400 ms	9% faster

Server logs confirm CUDA graph capture:

Fish Speech Fast AR: compile warmup done for buckets [1, 2, 4]
Fish Speech Fast AR: CUDA graphs captured for buckets [1, 2, 4]

Switch Fish Speech's Fast AR from variable-length re-prefill to fixed-shape full-buffer forward with CUDA graph capture and replay. Follows the same pattern as Qwen3 TTS's CodePredictor. Key changes: - Always forward the full [padded_bsz, max_seq, hidden] buffer (zero-padded future positions) instead of slicing to growing seq_len - torch.compile with epilogue_fusion=False, dynamic=False - Capture CUDA graphs per power-of-2 batch-size bucket - Replay graph each codebook step, index the relevant position - Sampling (top_k/top_p/multinomial) stays outside the graph - Defer compile + graph capture to first forward() to avoid OOM during model loading (before KV cache allocation) Benchmark on H20-3e (143GB): - Per-step Fast AR time: 73ms -> 50ms (31% reduction) - E2E latency: 1800ms -> 1253ms (30% reduction) - RTF: 0.48 -> 0.35 (27% improvement)

chatgpt-codex-connector · 2026-04-08T04:17:32Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

linyueqian · 2026-04-08T04:19:57Z

Closing in favor of #2520 which has a cleaner approach (explicit talker_mtp_graph_safe flag + dedicated capture method) and better benchmark results (52.7% E2E improvement). See #2520 for the preferred implementation.

linyueqian requested a review from hsliuustc0106 as a code owner April 8, 2026 04:17

linyueqian closed this Apr 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Perf] Fish Speech S2 Pro: CUDA graph acceleration for Fast AR codebook decode#2579

[Perf] Fish Speech S2 Pro: CUDA graph acceleration for Fast AR codebook decode#2579
linyueqian wants to merge 1 commit intovllm-project:mainfrom
linyueqian:exp/fish-speech-fast-ar-cudagraph

linyueqian commented Apr 8, 2026

Uh oh!

chatgpt-codex-connector Bot commented Apr 8, 2026

Uh oh!

linyueqian commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

linyueqian commented Apr 8, 2026

Purpose

Key Changes

Test Plan

Test Result

Uh oh!

chatgpt-codex-connector Bot commented Apr 8, 2026

Uh oh!

linyueqian commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant