From d926f4fde5dbac281299bdf36a8e0a1c6fbbc9f1 Mon Sep 17 00:00:00 2001 From: ischencheng Date: Wed, 6 May 2026 04:15:32 +0000 Subject: [PATCH] [Doc][Qwen3-TTS] Document stage-1 max_num_seqs trade-off per task type The bundled qwen3_tts.yaml now ships stage 1 (Code2Wav) at max_num_seqs: 10 (set in #2556), tuned for Base voice cloning's long stage-1 lifetimes (~3 s/req): admitting up to 10 concurrent codec sequences gives ~2x TTFA p95 at c=4 / c=8 on 1x H100 + 1.7B-Base + seed-tts at an 8-12% audio-throughput cost. CustomVoice / VoiceDesign have ~50-200 ms stage-1 lifetimes and remain TTFA-optimal at max_num_seqs: 1. Document the trade-off and the override invocation, and add a yaml comment so the choice is visible at the config site. Co-Authored-By: Claude Opus 4.7 (1M context) Signed-off-by: ischencheng --- .../examples/online_serving/text_to_speech.md | 10 ++++++++++ vllm_omni/deploy/qwen3_tts.yaml | 1 + 2 files changed, 11 insertions(+) diff --git a/docs/user_guide/examples/online_serving/text_to_speech.md b/docs/user_guide/examples/online_serving/text_to_speech.md index 703b7cf7aca..3f8f3f86a52 100644 --- a/docs/user_guide/examples/online_serving/text_to_speech.md +++ b/docs/user_guide/examples/online_serving/text_to_speech.md @@ -201,6 +201,16 @@ Stage configs ship with the chunked-streaming default. To use the uniproc execut To opt out of chunked streaming, pass `--no-async-chunk` instead — the pipeline auto-dispatches to the end-to-end codec processor. +### Tuning stage 1 `max_num_seqs` per task type +The bundled `qwen3_tts.yaml` ships stage 1 (Code2Wav) at `max_num_seqs: 10`, tuned for Base voice cloning: stage-1 lifetimes are long (~3 s/req), so admitting up to 10 concurrent codec sequences lets requests progress in parallel in the scheduler — ~2× TTFA p95 at c=4 / c=8 (1× H100, 1.7B-Base, seed-tts) at an 8–12 % audio-throughput cost. + +CustomVoice / VoiceDesign have much shorter stage-1 lifetimes (~50–200 ms) and are TTFA-optimal at `max_num_seqs: 1`. Override the default when serving those task types: + +```bash +vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-Base --omni \ + --stage-overrides '{"1": {"max_num_seqs": 1}}' +``` + ### Sending requests ```bash # CustomVoice with a predefined speaker diff --git a/vllm_omni/deploy/qwen3_tts.yaml b/vllm_omni/deploy/qwen3_tts.yaml index c2f9735026b..bdad8aaaf2e 100644 --- a/vllm_omni/deploy/qwen3_tts.yaml +++ b/vllm_omni/deploy/qwen3_tts.yaml @@ -52,6 +52,7 @@ stages: top_p: 1.0 - stage_id: 1 + # Tuned for Base voice clone; CustomVoice / VoiceDesign are TTFA-optimal at 1. max_num_seqs: 10 gpu_memory_utilization: 0.3 enforce_eager: true