Skip to content

[Qwen3-TTS] Remove hardcoded distributed_executor_backend to improve single-GPU performance#2604

Merged
hsliuustc0106 merged 10 commits into
vllm-project:mainfrom
basetenlabs:uniproc-executor
Apr 14, 2026
Merged

[Qwen3-TTS] Remove hardcoded distributed_executor_backend to improve single-GPU performance#2604
hsliuustc0106 merged 10 commits into
vllm-project:mainfrom
basetenlabs:uniproc-executor

Conversation

@iancarrasco-b10
Copy link
Copy Markdown
Contributor

@iancarrasco-b10 iancarrasco-b10 commented Apr 8, 2026

Summary

  • Remove hardcoded distributed_executor_backend: "mp" from qwen3_tts.yaml stage config

This improves single-GPU performance by avoiding unnecessary multiprocessing overhead from the mp executor when only one device is in use. This still preserves the current behavior of using mp in world_size > 1 scenarios.

Test Plan

Test Qwen3-TTS with uniproc and mp executors and both worked in the single-gpu case. More results can be found here: #2603

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@linyueqian
Copy link
Copy Markdown
Collaborator

fix pre-commit please

@linyueqian
Copy link
Copy Markdown
Collaborator

fix dco please

@linyueqian linyueqian added the ready label to trigger buildkite CI label Apr 8, 2026
@iancarrasco-b10 iancarrasco-b10 force-pushed the uniproc-executor branch 2 times, most recently from b122235 to dab9604 Compare April 8, 2026 16:52
@iancarrasco-b10
Copy link
Copy Markdown
Contributor Author

iancarrasco-b10 commented Apr 8, 2026

vllm already defaults to uniproc executor when distributed_executor_backend is None and world_size=1 so this is actually just a config change. Similarly mp is used by default when world_size > 1.
https://github.com/vllm-project/vllm/blob/main/vllm/config/parallel.py#L825

@iancarrasco-b10 iancarrasco-b10 changed the title Default to UniProcExecutor for single-GPU stages [Qwen3-TTS] Remove Hardcoded distributed_executor_backend to improve single-GPU performance Apr 8, 2026
@iancarrasco-b10 iancarrasco-b10 changed the title [Qwen3-TTS] Remove Hardcoded distributed_executor_backend to improve single-GPU performance [Qwen3-TTS] Remove hardcoded distributed_executor_backend to improve single-GPU performance Apr 8, 2026
@linyueqian
Copy link
Copy Markdown
Collaborator

Thanks for the investigation! We ran the same benchmark on H20 (141GB) to verify the claim generalizes beyond H100.

Setup: Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice, bs16 config, single GPU, 50 prompts per concurrency level.

Results (H20):

Concurrency Mean RTF (mp) Mean RTF (uni) Delta
1 0.155 0.151 ~tied
4 0.221 0.308 mp 39.6% better
10 0.372 0.423 mp 13.7% better
16 0.430 0.541 mp 25.8% better

On H20, mp executor is consistently faster at concurrency 4+, which is the opposite of the H100 results in #2603. The throughput gap is significant: at concurrency 4, mp delivers 18.12 audio s/s vs 12.84 for uni (+29%).

This suggests the performance tradeoff is hardware-dependent. Auto-defaulting to uni for all single-GPU stages would regress performance on H20.

A few observations on the PR itself:

  1. The Python code change (_default_executor_backend + setdefault) is redundant. vLLM already defaults to uni when distributed_executor_backend is None and world_size=1. Just removing the YAML line would have the same effect.
  2. The scope is narrow. Only qwen3_tts.yaml is updated, but 80+ other YAML configs also hardcode "mp".

Suggestion: keep the default as "mp" in the configs, and let users opt into "uni" explicitly if their hardware benefits from it. Or, could you share more details about the H100 setup so we can understand what drives the difference?

@tzhouam @hsliuustc0106 Could you help confirm these findings or share any thoughts on the mp vs uni tradeoff?

@iancarrasco-b10
Copy link
Copy Markdown
Contributor Author

iancarrasco-b10 commented Apr 8, 2026

Very interesting to see that it doesn't hold up on H200. I think it is fair to leave this up to users to flip based on their desired throughput/latency targets and hardware setup. Outside of what I shared about the setup in the related issue, what could be helpful here for getting a better sense of the discrepancy?

@iancarrasco-b10
Copy link
Copy Markdown
Contributor Author

iancarrasco-b10 commented Apr 8, 2026

Have you tried running the Base cloning task? This is what I actually got the results for vs. CustomVoice so arguably that could be playing a role. I'll also run CustomVoice on my setup to see if I observe the same.

@linyueqian
Copy link
Copy Markdown
Collaborator

Follow-up: Base (voice cloning) task shows different results on the same H20 hardware.

Concurrency Mean RTF (mp) Mean RTF (uni) Delta
1 0.297 0.240 uni 19.3% better
4 0.626 0.607 ~tied
10 1.193 1.044 uni 12.5% better
16 1.936 1.388 uni 28.3% better

For the Base task, uni is consistently better, consistent with the original findings in #2603. However, using CustomVoice on the same H20 GPU, we saw the opposite (mp winning at concurrency 4+).

This suggests the tradeoff is task-dependent, not just hardware-dependent. The Base task involves heavier per-request processing (reference audio encoding), making IPC serialization overhead a larger fraction of the total cost, which favors uni. CustomVoice is lighter per-request, so the process-level parallelism of mp dominates.

Given this, a blanket default change seems risky. Keeping mp as the explicit default in configs and letting users opt into uni for their specific workload would be the safer path.

@iancarrasco-b10
Copy link
Copy Markdown
Contributor Author

iancarrasco-b10 commented Apr 8, 2026

Thanks for running @linyueqian! Agreed to not merge the blanket change. I think it would be worth potentially adding this to the docs or commenting somewhere more permanent and that future deployments can take advantage of these perf gains. Potentially having a task-specific stage config?

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

does it also apply to qwen-omni as well? @ZeldaHuang @amy-why-3459

@hsliuustc0106 hsliuustc0106 added the nightly-test label to trigger buildkite nightly test CI label Apr 9, 2026
@ZeldaHuang
Copy link
Copy Markdown
Collaborator

Qwen3-Omni Benchmark: distributed_executor_backend="mp" vs default (uniproc)

stage path : qwen3_omni_moe_async_chunk.yaml

Concurrency TTFT (ms) mp / uni TPOT (ms) mp / uni Audio RTF mp / uni
1 816 / 649 10.37 / 9.35 0.20 / 0.18
4 1325 / 1135 25.71 / 17.50 0.33 / 0.38
10 5397 / 2665 71.45 / 42.25 0.87 / 0.66
16 8473 / 4947 103.17 / 72.21 1.28 / 1.23

For TTFT and TPOT, the default uniproc executor consistently outperforms distributed_executor_backend="mp" across all concurrency levels. The gap widens as
concurrency increases — at c=16, uniproc achieves ~42% lower TTFT and ~30% lower TPOT.

For Audio RTF, the results are mixed: mp is slightly better at c=4 (0.33 vs 0.38) but worse at other levels.

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

@linyueqian @wtomin I think this will work for many cases, please check

@amy-why-3459
Copy link
Copy Markdown
Contributor

Nice work. I'm just curious, how did you discover this problem?

@iancarrasco-b10
Copy link
Copy Markdown
Contributor Author

iancarrasco-b10 commented Apr 9, 2026

I was profiling where time was being spent in Qwen3-TTS forward passes for the Base cloning task, and noticed low gpu utilization so that pointed to some CPU bound work potentially causing the GPU to idle. Then running at higher concurrency it became more apparent that the D2H copies, serialization/deserialization, msgpack encode, tensor detaching in 'mp' mode was taking up a considerable amount of time vs. AR steps/decode.

@linyueqian
Copy link
Copy Markdown
Collaborator

fix ci please

@linyueqian
Copy link
Copy Markdown
Collaborator

@iancarrasco-b10 Great investigation! Since the mp vs uni tradeoff is both hardware- and task-dependent, could you add a short section to the docs (e.g., under the Qwen3-TTS serving guide) summarizing:

  1. When uni wins (Base/cloning tasks, H100)
  2. When mp wins (CustomVoice, H20 at high concurrency)
  3. How users can switch between them

This way future deployments can make an informed choice. A task-specific stage config example would also be helpful.

@iancarrasco-b10
Copy link
Copy Markdown
Contributor Author

Will go ahead and update the docs and add a config here

Made-with: Cursor
Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>
Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>
Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>
Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>
Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>
Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>
Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>
Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>
Made-with: Cursor
Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>
Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>
@hsliuustc0106 hsliuustc0106 merged commit 48c30bc into vllm-project:main Apr 14, 2026
7 of 8 checks passed
y123456y78 pushed a commit to y123456y78/vllm-omni that referenced this pull request Apr 15, 2026
…e single-GPU performance (vllm-project#2604)

Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>
lvliang-intel pushed a commit to lvliang-intel/vllm-omni that referenced this pull request Apr 20, 2026
…e single-GPU performance (vllm-project#2604)

Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>
lengrongfu pushed a commit to lengrongfu/vllm-omni that referenced this pull request May 1, 2026
…e single-GPU performance (vllm-project#2604)

Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>
clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026
…e single-GPU performance (vllm-project#2604)

Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nightly-test label to trigger buildkite nightly test CI ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Performance]: Uniproc executor has much better performance at higher concurrency on Qwen3-TTS (Single GPU)

6 participants