vllm-project · linyueqian · May 29, 2026 · May 6, 2026 · hsliuustc0106 · May 19, 2026
@@ -201,6 +201,16 @@ Stage configs ship with the chunked-streaming default. To use the uniproc execut
 
 To opt out of chunked streaming, pass `--no-async-chunk` instead — the pipeline auto-dispatches to the end-to-end codec processor.
 
+### Tuning stage 1 `max_num_seqs` per task type
+The bundled `qwen3_tts.yaml` ships stage 1 (Code2Wav) at `max_num_seqs: 10`, tuned for Base voice cloning: stage-1 lifetimes are long (~3 s/req), so admitting up to 10 concurrent codec sequences lets requests progress in parallel in the scheduler — ~2× TTFA p95 at c=4 / c=8 (1× H100, 1.7B-Base, seed-tts) at an 8–12 % audio-throughput cost.
+
+CustomVoice / VoiceDesign have much shorter stage-1 lifetimes (~50–200 ms) and are TTFA-optimal at `max_num_seqs: 1`. Override the default when serving those task types:
+
+```bash
+vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-Base --omni \
+    --stage-overrides '{"1": {"max_num_seqs": 1}}'
+```
+
 ### Sending requests
 ```bash
 # CustomVoice with a predefined speaker

@@ -52,6 +52,7 @@ stages:
       top_p: 1.0
 
   - stage_id: 1
+    # Tuned for Base voice clone; CustomVoice / VoiceDesign are TTFA-optimal at 1.
     max_num_seqs: 10
     gpu_memory_utilization: 0.3
     enforce_eager: true