[Qwen3-TTS] Remove hardcoded distributed_executor_backend to improve single-GPU performance#2604
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
fix pre-commit please |
|
fix dco please |
b122235 to
dab9604
Compare
|
vllm already defaults to uniproc executor when distributed_executor_backend is None and world_size=1 so this is actually just a config change. Similarly mp is used by default when world_size > 1. |
b2abace to
e58e392
Compare
distributed_executor_backend to improve single-GPU performance
distributed_executor_backend to improve single-GPU performancedistributed_executor_backend to improve single-GPU performance
|
Thanks for the investigation! We ran the same benchmark on H20 (141GB) to verify the claim generalizes beyond H100. Setup: Results (H20):
On H20, This suggests the performance tradeoff is hardware-dependent. Auto-defaulting to A few observations on the PR itself:
Suggestion: keep the default as @tzhouam @hsliuustc0106 Could you help confirm these findings or share any thoughts on the mp vs uni tradeoff? |
|
Very interesting to see that it doesn't hold up on H200. I think it is fair to leave this up to users to flip based on their desired throughput/latency targets and hardware setup. Outside of what I shared about the setup in the related issue, what could be helpful here for getting a better sense of the discrepancy? |
|
Have you tried running the Base cloning task? This is what I actually got the results for vs. CustomVoice so arguably that could be playing a role. I'll also run CustomVoice on my setup to see if I observe the same. |
|
Follow-up: Base (voice cloning) task shows different results on the same H20 hardware.
For the Base task, This suggests the tradeoff is task-dependent, not just hardware-dependent. The Base task involves heavier per-request processing (reference audio encoding), making IPC serialization overhead a larger fraction of the total cost, which favors Given this, a blanket default change seems risky. Keeping |
|
Thanks for running @linyueqian! Agreed to not merge the blanket change. I think it would be worth potentially adding this to the docs or commenting somewhere more permanent and that future deployments can take advantage of these perf gains. Potentially having a task-specific stage config? |
|
does it also apply to qwen-omni as well? @ZeldaHuang @amy-why-3459 |
Qwen3-Omni Benchmark:
|
| Concurrency | TTFT (ms) mp / uni | TPOT (ms) mp / uni | Audio RTF mp / uni |
|---|---|---|---|
| 1 | 816 / 649 | 10.37 / 9.35 | 0.20 / 0.18 |
| 4 | 1325 / 1135 | 25.71 / 17.50 | 0.33 / 0.38 |
| 10 | 5397 / 2665 | 71.45 / 42.25 | 0.87 / 0.66 |
| 16 | 8473 / 4947 | 103.17 / 72.21 | 1.28 / 1.23 |
For TTFT and TPOT, the default uniproc executor consistently outperforms distributed_executor_backend="mp" across all concurrency levels. The gap widens as
concurrency increases — at c=16, uniproc achieves ~42% lower TTFT and ~30% lower TPOT.
For Audio RTF, the results are mixed: mp is slightly better at c=4 (0.33 vs 0.38) but worse at other levels.
|
@linyueqian @wtomin I think this will work for many cases, please check |
|
Nice work. I'm just curious, how did you discover this problem? |
|
I was profiling where time was being spent in Qwen3-TTS forward passes for the Base cloning task, and noticed low gpu utilization so that pointed to some CPU bound work potentially causing the GPU to idle. Then running at higher concurrency it became more apparent that the D2H copies, serialization/deserialization, msgpack encode, tensor detaching in 'mp' mode was taking up a considerable amount of time vs. AR steps/decode. |
|
fix ci please |
|
@iancarrasco-b10 Great investigation! Since the mp vs uni tradeoff is both hardware- and task-dependent, could you add a short section to the docs (e.g., under the Qwen3-TTS serving guide) summarizing:
This way future deployments can make an informed choice. A task-specific stage config example would also be helpful. |
|
Will go ahead and update the docs and add a config here |
Made-with: Cursor Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>
Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>
Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>
Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>
Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>
Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>
Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>
Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co> Made-with: Cursor Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>
Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>
67ac3a2 to
787a6e8
Compare
…e single-GPU performance (vllm-project#2604) Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>
…e single-GPU performance (vllm-project#2604) Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>
…e single-GPU performance (vllm-project#2604) Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>
…e single-GPU performance (vllm-project#2604) Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>
Summary
distributed_executor_backend: "mp"fromqwen3_tts.yamlstage configThis improves single-GPU performance by avoiding unnecessary multiprocessing overhead from the
mpexecutor when only one device is in use. This still preserves the current behavior of using mp in world_size > 1 scenarios.Test Plan
Test Qwen3-TTS with uniproc and mp executors and both worked in the single-gpu case. More results can be found here: #2603