Skip to content

[RFC] Qwen3 TTS optimization plan#16

Closed
marksverdhei wants to merge 8 commits into
htfrom
docs/optimization-plan-rfc
Closed

[RFC] Qwen3 TTS optimization plan#16
marksverdhei wants to merge 8 commits into
htfrom
docs/optimization-plan-rfc

Conversation

@marksverdhei

@marksverdhei marksverdhei commented Feb 20, 2026

Copy link
Copy Markdown

RFC: Qwen3 TTS Inference Optimization Plan

Design document for optimizing Qwen3 TTS inference in vLLM-Omni. Phased approach targeting the remaining performance bottlenecks — single-request processing, monolithic pipeline, and missing CUDA graph coverage.

Note: This document was originally a standalone file and has been moved to a PR for cleaner history. Several phases have already been implemented on the ht branch — see Completed Work at the bottom.


Remaining Work

Phase 1a: CUDA Graph Capture for Code Predictor

The code predictor has a fixed computation graph per step (5-layer transformer, fixed shapes during generation). Capture CUDA graphs for the inner loop to eliminate kernel launch overhead across all 31 iterations. Expected: 15-30% latency reduction.

Files to modify:

  • vllm_omni/model_executor/stage_configs/qwen3_tts.yaml — set enforce_eager: false
  • vllm_omni/model_executor/models/qwen3_tts/modeling_qwen3_tts.py — ensure code predictor forward is graph-capturable (static shapes, no dynamic control flow)

Phase 1c: Parallel Codebook Prediction (Speculative)

Investigate whether codebooks 2-32 can be predicted in parallel rather than sequentially. Requires analysis of:

  • How much quality degrades if codebooks are predicted independently
  • Whether a distilled parallel predictor can match sequential quality
  • Feasibility of predicting groups of codebooks in parallel (e.g., 4 groups of 8)

This is research-grade work and may not be feasible without retraining.

Phase 2: Multi-Stage Decomposition

Decompose the monolithic TTS pipeline into separate stages, mirroring Qwen3 Omni's architecture.

Proposed 3-Stage Pipeline

Stage 0 (Talker):     Text → RVQ codec codes (32 codebooks)
                       Type: autoregressive LLM
                       Batch: up to 32
                       GPU: device 0

Stage 1 (Tokenizer Decoder): Codec codes → waveform
                       Type: non-autoregressive generation
                       Batch: up to 4
                       GPU: device 0

[Optional] Stage 2 (Speaker Encoder): Reference audio → speaker embedding
                       Type: non-autoregressive generation
                       Batch: up to 16
                       GPU: device 0

Stage Config

# qwen3_tts_multistage.yaml
stages:
  - stage_id: 0
    model_stage: qwen3_tts_talker
    model_arch: Qwen3TTSTalkerForConditionalGeneration
    stage_type: llm
    engine_output_type: audio_codes
    max_batch_size: 32
    gpu_memory_utilization: 0.7
    devices: "0"
    enforce_eager: false

  - stage_id: 1
    model_stage: qwen3_tts_decoder
    model_arch: Qwen3TTSTokenizerV2Decoder
    stage_type: generation
    engine_output_type: audio
    max_batch_size: 4
    gpu_memory_utilization: 0.2
    devices: "0"
    enforce_eager: true

Required Changes

New files:

  • vllm_omni/model_executor/stage_input_processors/qwen3_tts.py
  • vllm_omni/model_executor/stage_configs/qwen3_tts_multistage.yaml

Modified files:

  • modeling_qwen3_tts.py — split Qwen3TTSForConditionalGeneration so talker and decoder load as separate stage models
  • qwen3_tts.py — update wrapper to support multi-stage mode
  • registry.py — register talker and decoder as separate model architectures

Phase 3a: Async Chunk Pipeline

Enable async_chunk: true for the multi-stage TTS pipeline. This allows:

  • Talker generates codec codes incrementally
  • Decoder starts synthesizing waveform for completed chunks while talker continues
  • Audio chunks delivered to client as they become available

Follows the pattern established by qwen3_omni_moe_async_chunk.yaml.

Phase 3c: Chunked Decode in Tokenizer

The 12Hz tokenizer decoder already supports chunked decode (300-frame chunks with 25-frame context overlap). Add a streaming variant with smaller chunks (~25 frames) for lower latency, similar to qwen3_omni_code2wav.py's chunked_decode_streaming().

Phase 4: Batching and Throughput

4a. Increase Batch Size

With multi-stage decomposition (Phase 2), the talker stage can batch multiple requests with padded text sequences and proper masking.

4b. Continuous Batching

Leverage vLLM's existing continuous batching for the talker stage. Requests enter and exit the batch dynamically, avoiding the convoy effect. Requires the talker to be registered as a proper autoregressive LLM stage.


Priority Matrix (Remaining Only)

Phase Priority Complexity Expected Impact
1a CUDA graphs for code predictor High Low 15-30% latency reduction
2 Multi-stage decomposition High High Enables phases 3a, 3c, 4; structural prerequisite
3a Async chunk pipeline Medium Medium First-chunk latency reduction
3c Chunked decode in tokenizer Medium Medium Lower streaming latency
4a Batch size increase Medium Low Linear throughput scaling
4b Continuous batching Low Medium Throughput under concurrent load
1c Parallel codebook prediction Low Research Potentially 5-10× code predictor speedup (speculative)

Risks and Open Questions

  1. CUDA graph compatibility: The code predictor's 31-step loop with dynamic sampling may not be fully graph-capturable. Need to verify that sampling operations (top-k, nucleus) can be captured or must remain eager.

  2. Quality regression from chunked streaming: Smaller decode chunks may introduce boundary artifacts. Need perceptual evaluation (MOS testing) to validate chunk size choices.

  3. Multi-stage overhead: Inter-stage communication adds latency. For a single request, multi-stage may be slower than monolithic. The benefit is throughput under load.

  4. Model weight splitting: Decomposing Qwen3TTSForConditionalGeneration into separate stage models requires careful weight mapping to ensure pretrained weights load correctly into each stage.

  5. Parallel codebook feasibility: Codebooks 2-32 may have strong sequential dependencies that make parallel prediction infeasible without retraining.


Upstream References

Ref Title Relevance
#938 Optimization for inference deployment of Qwen3-TTS Master tracking issue
#976 Separate Qwen3-TTS to 2-stage pipeline Directly maps to Phase 2
#907 Optimize Qwen3-TTS with vLLM native ops Prerequisite for Phase 1a
#1061 qwen3 tts slow on 5090 Baseline latency reference

Completed Work (HT Branch)

The following phases from the original plan have been implemented:

✅ Phase 1b: Code Predictor KV Cache

Manual KV-cached loop (generate_codes) replacing HF GenerationMixin.generate() for the code predictor. Eliminates per-step framework overhead and is a prerequisite for future CUDA graph capture. Also includes regional torch.compile for code predictor decoder layers.

Commit: feat(qwen3-tts): regional torch.compile and manual KV-cached loop for code predictor

✅ Phase 3b: Speech Endpoint Streaming

HTTP-level streaming for /v1/audio/speech with SSE chunked audio responses. Model-level streaming via generate_streaming() with manual talker prefill/decode loop yielding audio chunks during generation. Includes _prepare_talker_inputs() refactor shared by generate() and generate_streaming().

Commit: feat(qwen3-tts): HTTP and model-level streaming for TTS speech API

✅ Additional: CUDA Graph for Speech Tokenizer Decoder

CudaGraphDecoderWrapper for capturing and replaying the speech tokenizer decoder forward pass as a CUDA graph. Cherry-picked from unmerged upstream PR #1205.

Commit: feat(qwen3-tts): CUDA graph support for speech tokenizer decoder

✅ Additional: SDPA Attention Fallback

Fallback to PyTorch SDPA attention when flash-attn is unavailable.

Commit: feat(qwen3-tts): SDPA attention fallback when flash-attn unavailable

✅ Additional: Speaker Embedding Support

speaker_embedding parameter for direct voice cloning without reference audio.

Commit: feat(qwen3-tts): speaker embedding support for voice cloning

✅ Additional: tts-stream Tool

Bash tool for real-time TTS streaming playback with preset voice support.

Commit: feat(qwen3-tts): tts-stream tool for low-latency streaming playback


Background

Current Architecture

Qwen3 TTS runs as a single OmniLLM stage:

Text → [Talker (20L transformer)] → first codebook logits
                ↓
        [Code Predictor (5L transformer)] × 31 sequential passes → codebooks 2-32
                ↓
        [Speech Tokenizer Decoder] → waveform

Model Components

Component Parameters Role
Talker 20 layers, 1024 hidden Text embeddings → first codebook logits (autoregressive)
Code Predictor 5 layers, 31 lm_heads First codebook → remaining 31 codebooks (31 sequential steps)
Speaker Encoder ECAPA-TDNN, 1024-dim Reference audio → speaker embedding (Base task only)
Speech Tokenizer Mimi encoder + RVQ decoder + HiFi-GAN Codec codes → waveform (non-autoregressive)

Bottleneck Analysis

  1. Code Predictor Sequential Loop: For each generated token, 31 sequential forward passes. A 10s utterance = 125 tokens × 31 = 3,875 forward passes.
  2. No Batching: max_batch_size: 1 — requests processed serially.
  3. Monolithic Forward Pass: Text encoding, codec generation, and waveform synthesis fused into a single call.

marksverdhei and others added 8 commits February 20, 2026 14:24
Fall back to SDPA attention when flash-attn is not installed, enabling
inference on systems without flash-attention. Includes regression test
suite (no GPU/weights required).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
… code predictor

- Regional torch.compile for code predictor decoder layers with
  reduce-overhead mode, respecting enforce_eager config
- Manual KV-cached loop (generate_codes) replacing HF generate() for
  the code predictor, eliminating per-step framework overhead
- _sample_token and _apply_repetition_penalty helpers for efficient
  token sampling outside HF GenerationMixin
- Benchmarks for code predictor and audio quality validation
- Extended test suite with compilation and KV-cache tests

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- HTTP-level streaming for /v1/audio/speech endpoint with SSE chunked
  audio responses and proper content-type handling
- Model-level streaming via generate_streaming() with manual talker
  prefill/decode loop yielding audio chunks during generation
- Refactored _prepare_talker_inputs() shared by generate() and
  generate_streaming() to avoid code duplication
- Streaming-aware scheduler and model runner updates for chunked
  output handling
- Stride-0 tensor serialization fix for streaming TTS
- Stream parameter support in OpenAI speech request protocol
- Streaming playback client (stream_tts_play.py)
- Comprehensive streaming test suite

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Bash tool (scripts/tts-stream) for real-time TTS streaming with
  chunked audio playback via ffplay
- Preset voice support with configurable voice names
- TTS test script (scripts/tts-test.sh) for voice clone and custom
  voice inference validation

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- speaker_embedding parameter in OpenAI speech API request protocol
  for direct voice cloning without reference audio
- Speaker embedding extraction and passing through the serving,
  async engine, and model layers
- Speaker embedding interpolation example script for blending
  multiple voice characteristics
- Updated OpenAI speech client example with speaker_embedding usage

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- CudaGraphDecoderWrapper for capturing and replaying the speech
  tokenizer decoder forward pass as a CUDA graph, reducing kernel
  launch overhead during audio decoding
- Integration with qwen3_tts model to use CUDA graph decoder when
  available, with automatic fallback
- Speech tokenizer v2 modifications for CUDA graph compatibility

Cherry-picked from unmerged upstream PR vllm-project#1205.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- HT-branded logo (vLLM-Omni logo with HT avatar sticker overlay)
- HT Fork Changes section in README documenting all fork-specific
  features with upstream status annotations
- PR template checkbox for updating HT Fork Changes section

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Design document proposing multi-phase optimizations for Qwen3 TTS
inference. Tracked as a PR for discussion rather than a committed file.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@marksverdhei marksverdhei force-pushed the ht branch 4 times, most recently from 39db6bf to eb3e146 Compare February 27, 2026 15:00
@marksverdhei marksverdhei force-pushed the ht branch 8 times, most recently from e06ae8f to c5dc9a0 Compare March 8, 2026 06:18
@marksverdhei marksverdhei force-pushed the ht branch 5 times, most recently from 936dbcc to dc37d3d Compare March 12, 2026 10:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant