Skip to content

[Perf] Fish Speech S2 Pro: CUDA graph acceleration for Fast AR codebook decode#2579

Closed
linyueqian wants to merge 1 commit intovllm-project:mainfrom
linyueqian:exp/fish-speech-fast-ar-cudagraph
Closed

[Perf] Fish Speech S2 Pro: CUDA graph acceleration for Fast AR codebook decode#2579
linyueqian wants to merge 1 commit intovllm-project:mainfrom
linyueqian:exp/fish-speech-fast-ar-cudagraph

Conversation

@linyueqian
Copy link
Copy Markdown
Collaborator

Purpose

Accelerate Fish Speech S2 Pro's per-step decode latency by enabling CUDA graph capture and replay for the Fast AR (residual codebook predictor). This follows the same pattern already used by Qwen3 TTS's CodePredictor.

Key Changes

fish_speech_fast_ar.py — Switch from variable-length re-prefill to fixed-shape full-buffer forward:

  • Always forward the full [padded_bsz, max_seq, hidden] embedding buffer (zero-padded future positions) instead of slicing to growing seq_len
  • torch.compile with epilogue_fusion=False, dynamic=False
  • Capture CUDA graphs per power-of-2 batch-size bucket
  • Replay graph each codebook step, index the relevant position for logits
  • Sampling (top_k/top_p/multinomial) stays outside the graph
  • Defer compile + graph capture to first forward() to avoid OOM during model loading (before KV cache allocation)
  • Expose self.talker = self.fast_ar so OmniGPUModelRunner wraps talker_mtp in CUDAGraphWrapper

fish_speech_slow_ar.py — Make talker_mtp CUDA-graph-safe:

  • Replace data-dependent if semantic_mask.any(): branch with branchless torch.where (eliminates host-device sync during graph capture)
  • Add self.talker = self.fast_ar to trigger outer CUDAGraphWrapper wrapping
  • Add self.talker_mtp_graph_safe = True flag

Test Plan

Tested on NVIDIA H20-3e (143GB) with fishaudio/s2-pro, single GPU, enforce_eager for Stage 0, using the benchmark config from #2515.

CUDA_VISIBLE_DEVICES=0 python -m vllm_omni.entrypoints.cli.main serve \
    "fishaudio/s2-pro" --omni --host 127.0.0.1 --port 8091 \
    --stage-configs-path benchmarks/fish-speech/config/vllm_omni/fish_speech_s2_pro.yaml \
    --trust-remote-code --enforce-eager

Client-side profiling with streaming PCM requests at concurrency=1.

Test Result

Metric Before After Improvement
Per-step Fast AR time 73 ms 48 ms 34% faster
End-to-end latency 1800 ms 1180 ms 34% faster
RTF (real-time factor) 0.48 0.33 31% better
TTFP (time to first packet) 440 ms 400 ms 9% faster

Server logs confirm CUDA graph capture:

Fish Speech Fast AR: compile warmup done for buckets [1, 2, 4]
Fish Speech Fast AR: CUDA graphs captured for buckets [1, 2, 4]

Switch Fish Speech's Fast AR from variable-length re-prefill to
fixed-shape full-buffer forward with CUDA graph capture and replay.
Follows the same pattern as Qwen3 TTS's CodePredictor.

Key changes:
- Always forward the full [padded_bsz, max_seq, hidden] buffer
  (zero-padded future positions) instead of slicing to growing seq_len
- torch.compile with epilogue_fusion=False, dynamic=False
- Capture CUDA graphs per power-of-2 batch-size bucket
- Replay graph each codebook step, index the relevant position
- Sampling (top_k/top_p/multinomial) stays outside the graph
- Defer compile + graph capture to first forward() to avoid OOM
  during model loading (before KV cache allocation)

Benchmark on H20-3e (143GB):
- Per-step Fast AR time: 73ms -> 50ms (31% reduction)
- E2E latency: 1800ms -> 1253ms (30% reduction)
- RTF: 0.48 -> 0.35 (27% improvement)
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@linyueqian
Copy link
Copy Markdown
Collaborator Author

Closing in favor of #2520 which has a cleaner approach (explicit talker_mtp_graph_safe flag + dedicated capture method) and better benchmark results (52.7% E2E improvement). See #2520 for the preferred implementation.

@linyueqian linyueqian closed this Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant