[RFC] Qwen3 TTS optimization plan by marksverdhei · Pull Request #16 · heiervang-technologies/ht-vllm-omni

marksverdhei · 2026-02-20T05:50:53Z

RFC: Qwen3 TTS Inference Optimization Plan

Design document for optimizing Qwen3 TTS inference in vLLM-Omni. Phased approach targeting the remaining performance bottlenecks — single-request processing, monolithic pipeline, and missing CUDA graph coverage.

Note: This document was originally a standalone file and has been moved to a PR for cleaner history. Several phases have already been implemented on the ht branch — see Completed Work at the bottom.

Remaining Work

Phase 1a: CUDA Graph Capture for Code Predictor

The code predictor has a fixed computation graph per step (5-layer transformer, fixed shapes during generation). Capture CUDA graphs for the inner loop to eliminate kernel launch overhead across all 31 iterations. Expected: 15-30% latency reduction.

Files to modify:

vllm_omni/model_executor/stage_configs/qwen3_tts.yaml — set enforce_eager: false
vllm_omni/model_executor/models/qwen3_tts/modeling_qwen3_tts.py — ensure code predictor forward is graph-capturable (static shapes, no dynamic control flow)

Phase 1c: Parallel Codebook Prediction (Speculative)

Investigate whether codebooks 2-32 can be predicted in parallel rather than sequentially. Requires analysis of:

How much quality degrades if codebooks are predicted independently
Whether a distilled parallel predictor can match sequential quality
Feasibility of predicting groups of codebooks in parallel (e.g., 4 groups of 8)

This is research-grade work and may not be feasible without retraining.

Phase 2: Multi-Stage Decomposition

Decompose the monolithic TTS pipeline into separate stages, mirroring Qwen3 Omni's architecture.

Proposed 3-Stage Pipeline

Stage 0 (Talker):     Text → RVQ codec codes (32 codebooks)
                       Type: autoregressive LLM
                       Batch: up to 32
                       GPU: device 0

Stage 1 (Tokenizer Decoder): Codec codes → waveform
                       Type: non-autoregressive generation
                       Batch: up to 4
                       GPU: device 0

[Optional] Stage 2 (Speaker Encoder): Reference audio → speaker embedding
                       Type: non-autoregressive generation
                       Batch: up to 16
                       GPU: device 0

Stage Config

# qwen3_tts_multistage.yaml
stages:
  - stage_id: 0
    model_stage: qwen3_tts_talker
    model_arch: Qwen3TTSTalkerForConditionalGeneration
    stage_type: llm
    engine_output_type: audio_codes
    max_batch_size: 32
    gpu_memory_utilization: 0.7
    devices: "0"
    enforce_eager: false

  - stage_id: 1
    model_stage: qwen3_tts_decoder
    model_arch: Qwen3TTSTokenizerV2Decoder
    stage_type: generation
    engine_output_type: audio
    max_batch_size: 4
    gpu_memory_utilization: 0.2
    devices: "0"
    enforce_eager: true

Required Changes

New files:

vllm_omni/model_executor/stage_input_processors/qwen3_tts.py
vllm_omni/model_executor/stage_configs/qwen3_tts_multistage.yaml

Modified files:

modeling_qwen3_tts.py — split Qwen3TTSForConditionalGeneration so talker and decoder load as separate stage models
qwen3_tts.py — update wrapper to support multi-stage mode
registry.py — register talker and decoder as separate model architectures

Phase 3a: Async Chunk Pipeline

Enable async_chunk: true for the multi-stage TTS pipeline. This allows:

Talker generates codec codes incrementally
Decoder starts synthesizing waveform for completed chunks while talker continues
Audio chunks delivered to client as they become available

Follows the pattern established by qwen3_omni_moe_async_chunk.yaml.

Phase 3c: Chunked Decode in Tokenizer

The 12Hz tokenizer decoder already supports chunked decode (300-frame chunks with 25-frame context overlap). Add a streaming variant with smaller chunks (~25 frames) for lower latency, similar to qwen3_omni_code2wav.py's chunked_decode_streaming().

Phase 4: Batching and Throughput

4a. Increase Batch Size

With multi-stage decomposition (Phase 2), the talker stage can batch multiple requests with padded text sequences and proper masking.

4b. Continuous Batching

Leverage vLLM's existing continuous batching for the talker stage. Requests enter and exit the batch dynamically, avoiding the convoy effect. Requires the talker to be registered as a proper autoregressive LLM stage.

Priority Matrix (Remaining Only)

Phase	Priority	Complexity	Expected Impact
1a CUDA graphs for code predictor	High	Low	15-30% latency reduction
2 Multi-stage decomposition	High	High	Enables phases 3a, 3c, 4; structural prerequisite
3a Async chunk pipeline	Medium	Medium	First-chunk latency reduction
3c Chunked decode in tokenizer	Medium	Medium	Lower streaming latency
4a Batch size increase	Medium	Low	Linear throughput scaling
4b Continuous batching	Low	Medium	Throughput under concurrent load
1c Parallel codebook prediction	Low	Research	Potentially 5-10× code predictor speedup (speculative)

Risks and Open Questions

CUDA graph compatibility: The code predictor's 31-step loop with dynamic sampling may not be fully graph-capturable. Need to verify that sampling operations (top-k, nucleus) can be captured or must remain eager.
Quality regression from chunked streaming: Smaller decode chunks may introduce boundary artifacts. Need perceptual evaluation (MOS testing) to validate chunk size choices.
Multi-stage overhead: Inter-stage communication adds latency. For a single request, multi-stage may be slower than monolithic. The benefit is throughput under load.
Model weight splitting: Decomposing Qwen3TTSForConditionalGeneration into separate stage models requires careful weight mapping to ensure pretrained weights load correctly into each stage.
Parallel codebook feasibility: Codebooks 2-32 may have strong sequential dependencies that make parallel prediction infeasible without retraining.

Upstream References

Ref	Title	Relevance
#938	Optimization for inference deployment of Qwen3-TTS	Master tracking issue
#976	Separate Qwen3-TTS to 2-stage pipeline	Directly maps to Phase 2
#907	Optimize Qwen3-TTS with vLLM native ops	Prerequisite for Phase 1a
#1061	qwen3 tts slow on 5090	Baseline latency reference

Completed Work (HT Branch)

The following phases from the original plan have been implemented:

✅ Phase 1b: Code Predictor KV Cache

Manual KV-cached loop (generate_codes) replacing HF GenerationMixin.generate() for the code predictor. Eliminates per-step framework overhead and is a prerequisite for future CUDA graph capture. Also includes regional torch.compile for code predictor decoder layers.

Commit: feat(qwen3-tts): regional torch.compile and manual KV-cached loop for code predictor

✅ Phase 3b: Speech Endpoint Streaming

HTTP-level streaming for /v1/audio/speech with SSE chunked audio responses. Model-level streaming via generate_streaming() with manual talker prefill/decode loop yielding audio chunks during generation. Includes _prepare_talker_inputs() refactor shared by generate() and generate_streaming().

Commit: feat(qwen3-tts): HTTP and model-level streaming for TTS speech API

✅ Additional: CUDA Graph for Speech Tokenizer Decoder

CudaGraphDecoderWrapper for capturing and replaying the speech tokenizer decoder forward pass as a CUDA graph. Cherry-picked from unmerged upstream PR #1205.

Commit: feat(qwen3-tts): CUDA graph support for speech tokenizer decoder

✅ Additional: SDPA Attention Fallback

Fallback to PyTorch SDPA attention when flash-attn is unavailable.

Commit: feat(qwen3-tts): SDPA attention fallback when flash-attn unavailable

✅ Additional: Speaker Embedding Support

speaker_embedding parameter for direct voice cloning without reference audio.

Commit: feat(qwen3-tts): speaker embedding support for voice cloning

✅ Additional: tts-stream Tool

Bash tool for real-time TTS streaming playback with preset voice support.

Commit: feat(qwen3-tts): tts-stream tool for low-latency streaming playback

Background

Current Architecture

Qwen3 TTS runs as a single OmniLLM stage:

Text → [Talker (20L transformer)] → first codebook logits
                ↓
        [Code Predictor (5L transformer)] × 31 sequential passes → codebooks 2-32
                ↓
        [Speech Tokenizer Decoder] → waveform

Model Components

Component	Parameters	Role
Talker	20 layers, 1024 hidden	Text embeddings → first codebook logits (autoregressive)
Code Predictor	5 layers, 31 lm_heads	First codebook → remaining 31 codebooks (31 sequential steps)
Speaker Encoder	ECAPA-TDNN, 1024-dim	Reference audio → speaker embedding (Base task only)
Speech Tokenizer	Mimi encoder + RVQ decoder + HiFi-GAN	Codec codes → waveform (non-autoregressive)

Bottleneck Analysis

Code Predictor Sequential Loop: For each generated token, 31 sequential forward passes. A 10s utterance = 125 tokens × 31 = 3,875 forward passes.
No Batching: max_batch_size: 1 — requests processed serially.
Monolithic Forward Pass: Text encoding, codec generation, and waveform synthesis fused into a single call.

Fall back to SDPA attention when flash-attn is not installed, enabling inference on systems without flash-attention. Includes regression test suite (no GPU/weights required). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

… code predictor - Regional torch.compile for code predictor decoder layers with reduce-overhead mode, respecting enforce_eager config - Manual KV-cached loop (generate_codes) replacing HF generate() for the code predictor, eliminating per-step framework overhead - _sample_token and _apply_repetition_penalty helpers for efficient token sampling outside HF GenerationMixin - Benchmarks for code predictor and audio quality validation - Extended test suite with compilation and KV-cache tests Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- HTTP-level streaming for /v1/audio/speech endpoint with SSE chunked audio responses and proper content-type handling - Model-level streaming via generate_streaming() with manual talker prefill/decode loop yielding audio chunks during generation - Refactored _prepare_talker_inputs() shared by generate() and generate_streaming() to avoid code duplication - Streaming-aware scheduler and model runner updates for chunked output handling - Stride-0 tensor serialization fix for streaming TTS - Stream parameter support in OpenAI speech request protocol - Streaming playback client (stream_tts_play.py) - Comprehensive streaming test suite Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Bash tool (scripts/tts-stream) for real-time TTS streaming with chunked audio playback via ffplay - Preset voice support with configurable voice names - TTS test script (scripts/tts-test.sh) for voice clone and custom voice inference validation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- speaker_embedding parameter in OpenAI speech API request protocol for direct voice cloning without reference audio - Speaker embedding extraction and passing through the serving, async engine, and model layers - Speaker embedding interpolation example script for blending multiple voice characteristics - Updated OpenAI speech client example with speaker_embedding usage Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- CudaGraphDecoderWrapper for capturing and replaying the speech tokenizer decoder forward pass as a CUDA graph, reducing kernel launch overhead during audio decoding - Integration with qwen3_tts model to use CUDA graph decoder when available, with automatic fallback - Speech tokenizer v2 modifications for CUDA graph compatibility Cherry-picked from unmerged upstream PR vllm-project#1205. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- HT-branded logo (vLLM-Omni logo with HT avatar sticker overlay) - HT Fork Changes section in README documenting all fork-specific features with upstream status annotations - PR template checkbox for updating HT Fork Changes section Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Design document proposing multi-phase optimizations for Qwen3 TTS inference. Tracked as a PR for discussion rather than a committed file. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

marksverdhei and others added 8 commits February 20, 2026 14:24

docs: Qwen3 TTS optimization plan RFC

1dda0f4

Design document proposing multi-phase optimizations for Qwen3 TTS inference. Tracked as a PR for discussion rather than a committed file. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

marksverdhei force-pushed the ht branch 4 times, most recently from 39db6bf to eb3e146 Compare February 27, 2026 15:00

marksverdhei force-pushed the ht branch 8 times, most recently from e06ae8f to c5dc9a0 Compare March 8, 2026 06:18

marksverdhei force-pushed the ht branch 5 times, most recently from 936dbcc to dc37d3d Compare March 12, 2026 10:30

marksverdhei closed this Mar 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Qwen3 TTS optimization plan#16

[RFC] Qwen3 TTS optimization plan#16
marksverdhei wants to merge 8 commits into
htfrom
docs/optimization-plan-rfc

marksverdhei commented Feb 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

marksverdhei commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

RFC: Qwen3 TTS Inference Optimization Plan

Remaining Work

Phase 1a: CUDA Graph Capture for Code Predictor

Phase 1c: Parallel Codebook Prediction (Speculative)

Phase 2: Multi-Stage Decomposition

Proposed 3-Stage Pipeline

Stage Config

Required Changes

Phase 3a: Async Chunk Pipeline

Phase 3c: Chunked Decode in Tokenizer

Phase 4: Batching and Throughput

4a. Increase Batch Size

4b. Continuous Batching

Priority Matrix (Remaining Only)

Risks and Open Questions

Upstream References

Completed Work (HT Branch)

✅ Phase 1b: Code Predictor KV Cache

✅ Phase 3b: Speech Endpoint Streaming

✅ Additional: CUDA Graph for Speech Tokenizer Decoder

✅ Additional: SDPA Attention Fallback

✅ Additional: Speaker Embedding Support

✅ Additional: tts-stream Tool

Background

Current Architecture

Model Components

Bottleneck Analysis

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

marksverdhei commented Feb 20, 2026 •

edited

Loading