diff --git a/docs/user_guide/examples/online_serving/text_to_speech.md b/docs/user_guide/examples/online_serving/text_to_speech.md index 703b7cf7aca..3f8f3f86a52 100644 --- a/docs/user_guide/examples/online_serving/text_to_speech.md +++ b/docs/user_guide/examples/online_serving/text_to_speech.md @@ -201,6 +201,16 @@ Stage configs ship with the chunked-streaming default. To use the uniproc execut To opt out of chunked streaming, pass `--no-async-chunk` instead — the pipeline auto-dispatches to the end-to-end codec processor. +### Tuning stage 1 `max_num_seqs` per task type +The bundled `qwen3_tts.yaml` ships stage 1 (Code2Wav) at `max_num_seqs: 10`, tuned for Base voice cloning: stage-1 lifetimes are long (~3 s/req), so admitting up to 10 concurrent codec sequences lets requests progress in parallel in the scheduler — ~2× TTFA p95 at c=4 / c=8 (1× H100, 1.7B-Base, seed-tts) at an 8–12 % audio-throughput cost. + +CustomVoice / VoiceDesign have much shorter stage-1 lifetimes (~50–200 ms) and are TTFA-optimal at `max_num_seqs: 1`. Override the default when serving those task types: + +```bash +vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-Base --omni \ + --stage-overrides '{"1": {"max_num_seqs": 1}}' +``` + ### Sending requests ```bash # CustomVoice with a predefined speaker diff --git a/vllm_omni/deploy/qwen3_tts.yaml b/vllm_omni/deploy/qwen3_tts.yaml index c2f9735026b..bdad8aaaf2e 100644 --- a/vllm_omni/deploy/qwen3_tts.yaml +++ b/vllm_omni/deploy/qwen3_tts.yaml @@ -52,6 +52,7 @@ stages: top_p: 1.0 - stage_id: 1 + # Tuned for Base voice clone; CustomVoice / VoiceDesign are TTFA-optimal at 1. max_num_seqs: 10 gpu_memory_utilization: 0.3 enforce_eager: true