From d926f4fde5dbac281299bdf36a8e0a1c6fbbc9f1 Mon Sep 17 00:00:00 2001
From: ischencheng <cheng21@seas.upenn.edu>
Date: Wed, 6 May 2026 04:15:32 +0000
Subject: [PATCH] [Doc][Qwen3-TTS] Document stage-1 max_num_seqs trade-off per
 task type

The bundled qwen3_tts.yaml now ships stage 1 (Code2Wav) at
max_num_seqs: 10 (set in #2556), tuned for Base voice cloning's long
stage-1 lifetimes (~3 s/req): admitting up to 10 concurrent codec
sequences gives ~2x TTFA p95 at c=4 / c=8 on 1x H100 + 1.7B-Base +
seed-tts at an 8-12% audio-throughput cost.

CustomVoice / VoiceDesign have ~50-200 ms stage-1 lifetimes and remain
TTFA-optimal at max_num_seqs: 1. Document the trade-off and the
override invocation, and add a yaml comment so the choice is visible
at the config site.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: ischencheng <cheng21@seas.upenn.edu>
---
 .../examples/online_serving/text_to_speech.md          | 10 ++++++++++
 vllm_omni/deploy/qwen3_tts.yaml                        |  1 +
 2 files changed, 11 insertions(+)

diff --git a/docs/user_guide/examples/online_serving/text_to_speech.md b/docs/user_guide/examples/online_serving/text_to_speech.md
index 703b7cf7aca..3f8f3f86a52 100644
--- a/docs/user_guide/examples/online_serving/text_to_speech.md
+++ b/docs/user_guide/examples/online_serving/text_to_speech.md
@@ -201,6 +201,16 @@ Stage configs ship with the chunked-streaming default. To use the uniproc execut
 
 To opt out of chunked streaming, pass `--no-async-chunk` instead — the pipeline auto-dispatches to the end-to-end codec processor.
 
+### Tuning stage 1 `max_num_seqs` per task type
+The bundled `qwen3_tts.yaml` ships stage 1 (Code2Wav) at `max_num_seqs: 10`, tuned for Base voice cloning: stage-1 lifetimes are long (~3 s/req), so admitting up to 10 concurrent codec sequences lets requests progress in parallel in the scheduler — ~2× TTFA p95 at c=4 / c=8 (1× H100, 1.7B-Base, seed-tts) at an 8–12 % audio-throughput cost.
+
+CustomVoice / VoiceDesign have much shorter stage-1 lifetimes (~50–200 ms) and are TTFA-optimal at `max_num_seqs: 1`. Override the default when serving those task types:
+
+```bash
+vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-Base --omni \
+    --stage-overrides '{"1": {"max_num_seqs": 1}}'
+```
+
 ### Sending requests
 ```bash
 # CustomVoice with a predefined speaker
diff --git a/vllm_omni/deploy/qwen3_tts.yaml b/vllm_omni/deploy/qwen3_tts.yaml
index c2f9735026b..bdad8aaaf2e 100644
--- a/vllm_omni/deploy/qwen3_tts.yaml
+++ b/vllm_omni/deploy/qwen3_tts.yaml
@@ -52,6 +52,7 @@ stages:
       top_p: 1.0
 
   - stage_id: 1
+    # Tuned for Base voice clone; CustomVoice / VoiceDesign are TTFA-optimal at 1.
     max_num_seqs: 10
     gpu_memory_utilization: 0.3
     enforce_eager: true