[RFC] Qwen3 TTS optimization plan#16
Closed
marksverdhei wants to merge 8 commits into
Closed
Conversation
Fall back to SDPA attention when flash-attn is not installed, enabling inference on systems without flash-attention. Includes regression test suite (no GPU/weights required). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
… code predictor - Regional torch.compile for code predictor decoder layers with reduce-overhead mode, respecting enforce_eager config - Manual KV-cached loop (generate_codes) replacing HF generate() for the code predictor, eliminating per-step framework overhead - _sample_token and _apply_repetition_penalty helpers for efficient token sampling outside HF GenerationMixin - Benchmarks for code predictor and audio quality validation - Extended test suite with compilation and KV-cache tests Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- HTTP-level streaming for /v1/audio/speech endpoint with SSE chunked audio responses and proper content-type handling - Model-level streaming via generate_streaming() with manual talker prefill/decode loop yielding audio chunks during generation - Refactored _prepare_talker_inputs() shared by generate() and generate_streaming() to avoid code duplication - Streaming-aware scheduler and model runner updates for chunked output handling - Stride-0 tensor serialization fix for streaming TTS - Stream parameter support in OpenAI speech request protocol - Streaming playback client (stream_tts_play.py) - Comprehensive streaming test suite Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Bash tool (scripts/tts-stream) for real-time TTS streaming with chunked audio playback via ffplay - Preset voice support with configurable voice names - TTS test script (scripts/tts-test.sh) for voice clone and custom voice inference validation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- speaker_embedding parameter in OpenAI speech API request protocol for direct voice cloning without reference audio - Speaker embedding extraction and passing through the serving, async engine, and model layers - Speaker embedding interpolation example script for blending multiple voice characteristics - Updated OpenAI speech client example with speaker_embedding usage Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- CudaGraphDecoderWrapper for capturing and replaying the speech tokenizer decoder forward pass as a CUDA graph, reducing kernel launch overhead during audio decoding - Integration with qwen3_tts model to use CUDA graph decoder when available, with automatic fallback - Speech tokenizer v2 modifications for CUDA graph compatibility Cherry-picked from unmerged upstream PR vllm-project#1205. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- HT-branded logo (vLLM-Omni logo with HT avatar sticker overlay) - HT Fork Changes section in README documenting all fork-specific features with upstream status annotations - PR template checkbox for updating HT Fork Changes section Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Design document proposing multi-phase optimizations for Qwen3 TTS inference. Tracked as a PR for discussion rather than a committed file. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
39db6bf to
eb3e146
Compare
e06ae8f to
c5dc9a0
Compare
936dbcc to
dc37d3d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
RFC: Qwen3 TTS Inference Optimization Plan
Design document for optimizing Qwen3 TTS inference in vLLM-Omni. Phased approach targeting the remaining performance bottlenecks — single-request processing, monolithic pipeline, and missing CUDA graph coverage.
Remaining Work
Phase 1a: CUDA Graph Capture for Code Predictor
The code predictor has a fixed computation graph per step (5-layer transformer, fixed shapes during generation). Capture CUDA graphs for the inner loop to eliminate kernel launch overhead across all 31 iterations. Expected: 15-30% latency reduction.
Files to modify:
vllm_omni/model_executor/stage_configs/qwen3_tts.yaml— setenforce_eager: falsevllm_omni/model_executor/models/qwen3_tts/modeling_qwen3_tts.py— ensure code predictor forward is graph-capturable (static shapes, no dynamic control flow)Phase 1c: Parallel Codebook Prediction (Speculative)
Investigate whether codebooks 2-32 can be predicted in parallel rather than sequentially. Requires analysis of:
This is research-grade work and may not be feasible without retraining.
Phase 2: Multi-Stage Decomposition
Decompose the monolithic TTS pipeline into separate stages, mirroring Qwen3 Omni's architecture.
Proposed 3-Stage Pipeline
Stage Config
Required Changes
New files:
vllm_omni/model_executor/stage_input_processors/qwen3_tts.pyvllm_omni/model_executor/stage_configs/qwen3_tts_multistage.yamlModified files:
modeling_qwen3_tts.py— splitQwen3TTSForConditionalGenerationso talker and decoder load as separate stage modelsqwen3_tts.py— update wrapper to support multi-stage moderegistry.py— register talker and decoder as separate model architecturesPhase 3a: Async Chunk Pipeline
Enable
async_chunk: truefor the multi-stage TTS pipeline. This allows:Follows the pattern established by
qwen3_omni_moe_async_chunk.yaml.Phase 3c: Chunked Decode in Tokenizer
The 12Hz tokenizer decoder already supports chunked decode (300-frame chunks with 25-frame context overlap). Add a streaming variant with smaller chunks (~25 frames) for lower latency, similar to
qwen3_omni_code2wav.py'schunked_decode_streaming().Phase 4: Batching and Throughput
4a. Increase Batch Size
With multi-stage decomposition (Phase 2), the talker stage can batch multiple requests with padded text sequences and proper masking.
4b. Continuous Batching
Leverage vLLM's existing continuous batching for the talker stage. Requests enter and exit the batch dynamically, avoiding the convoy effect. Requires the talker to be registered as a proper autoregressive LLM stage.
Priority Matrix (Remaining Only)
Risks and Open Questions
CUDA graph compatibility: The code predictor's 31-step loop with dynamic sampling may not be fully graph-capturable. Need to verify that sampling operations (top-k, nucleus) can be captured or must remain eager.
Quality regression from chunked streaming: Smaller decode chunks may introduce boundary artifacts. Need perceptual evaluation (MOS testing) to validate chunk size choices.
Multi-stage overhead: Inter-stage communication adds latency. For a single request, multi-stage may be slower than monolithic. The benefit is throughput under load.
Model weight splitting: Decomposing
Qwen3TTSForConditionalGenerationinto separate stage models requires careful weight mapping to ensure pretrained weights load correctly into each stage.Parallel codebook feasibility: Codebooks 2-32 may have strong sequential dependencies that make parallel prediction infeasible without retraining.
Upstream References
Completed Work (HT Branch)
The following phases from the original plan have been implemented:
✅ Phase 1b: Code Predictor KV Cache
Manual KV-cached loop (
generate_codes) replacing HFGenerationMixin.generate()for the code predictor. Eliminates per-step framework overhead and is a prerequisite for future CUDA graph capture. Also includes regionaltorch.compilefor code predictor decoder layers.Commit:
feat(qwen3-tts): regional torch.compile and manual KV-cached loop for code predictor✅ Phase 3b: Speech Endpoint Streaming
HTTP-level streaming for
/v1/audio/speechwith SSE chunked audio responses. Model-level streaming viagenerate_streaming()with manual talker prefill/decode loop yielding audio chunks during generation. Includes_prepare_talker_inputs()refactor shared bygenerate()andgenerate_streaming().Commit:
feat(qwen3-tts): HTTP and model-level streaming for TTS speech API✅ Additional: CUDA Graph for Speech Tokenizer Decoder
CudaGraphDecoderWrapperfor capturing and replaying the speech tokenizer decoder forward pass as a CUDA graph. Cherry-picked from unmerged upstream PR #1205.Commit:
feat(qwen3-tts): CUDA graph support for speech tokenizer decoder✅ Additional: SDPA Attention Fallback
Fallback to PyTorch SDPA attention when flash-attn is unavailable.
Commit:
feat(qwen3-tts): SDPA attention fallback when flash-attn unavailable✅ Additional: Speaker Embedding Support
speaker_embeddingparameter for direct voice cloning without reference audio.Commit:
feat(qwen3-tts): speaker embedding support for voice cloning✅ Additional: tts-stream Tool
Bash tool for real-time TTS streaming playback with preset voice support.
Commit:
feat(qwen3-tts): tts-stream tool for low-latency streaming playbackBackground
Current Architecture
Qwen3 TTS runs as a single
OmniLLMstage:Model Components
Bottleneck Analysis
max_batch_size: 1— requests processed serially.