[Qwen3-TTS] Remove hardcoded `distributed_executor_backend` to improve single-GPU performance by iancarrasco-b10 · Pull Request #2604 · vllm-project/vllm-omni

iancarrasco-b10 · 2026-04-08T16:39:58Z

Summary

Remove hardcoded distributed_executor_backend: "mp" from qwen3_tts.yaml stage config

This improves single-GPU performance by avoiding unnecessary multiprocessing overhead from the mp executor when only one device is in use. This still preserves the current behavior of using mp in world_size > 1 scenarios.

Test Plan

Test Qwen3-TTS with uniproc and mp executors and both worked in the single-gpu case. More results can be found here: #2603

chatgpt-codex-connector · 2026-04-08T16:40:07Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

linyueqian · 2026-04-08T16:46:07Z

fix pre-commit please

linyueqian · 2026-04-08T16:50:13Z

fix dco please

iancarrasco-b10 · 2026-04-08T16:57:07Z

vllm already defaults to uniproc executor when distributed_executor_backend is None and world_size=1 so this is actually just a config change. Similarly mp is used by default when world_size > 1.
https://github.com/vllm-project/vllm/blob/main/vllm/config/parallel.py#L825

linyueqian · 2026-04-08T17:17:48Z

Thanks for the investigation! We ran the same benchmark on H20 (141GB) to verify the claim generalizes beyond H100.

Setup: Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice, bs16 config, single GPU, 50 prompts per concurrency level.

Results (H20):

Concurrency	Mean RTF (mp)	Mean RTF (uni)	Delta
1	0.155	0.151	~tied
4	0.221	0.308	mp 39.6% better
10	0.372	0.423	mp 13.7% better
16	0.430	0.541	mp 25.8% better

On H20, mp executor is consistently faster at concurrency 4+, which is the opposite of the H100 results in #2603. The throughput gap is significant: at concurrency 4, mp delivers 18.12 audio s/s vs 12.84 for uni (+29%).

This suggests the performance tradeoff is hardware-dependent. Auto-defaulting to uni for all single-GPU stages would regress performance on H20.

A few observations on the PR itself:

The Python code change (_default_executor_backend + setdefault) is redundant. vLLM already defaults to uni when distributed_executor_backend is None and world_size=1. Just removing the YAML line would have the same effect.
The scope is narrow. Only qwen3_tts.yaml is updated, but 80+ other YAML configs also hardcode "mp".

Suggestion: keep the default as "mp" in the configs, and let users opt into "uni" explicitly if their hardware benefits from it. Or, could you share more details about the H100 setup so we can understand what drives the difference?

@tzhouam @hsliuustc0106 Could you help confirm these findings or share any thoughts on the mp vs uni tradeoff?

iancarrasco-b10 · 2026-04-08T17:21:54Z

Very interesting to see that it doesn't hold up on H200. I think it is fair to leave this up to users to flip based on their desired throughput/latency targets and hardware setup. Outside of what I shared about the setup in the related issue, what could be helpful here for getting a better sense of the discrepancy?

iancarrasco-b10 · 2026-04-08T17:25:44Z

Have you tried running the Base cloning task? This is what I actually got the results for vs. CustomVoice so arguably that could be playing a role. I'll also run CustomVoice on my setup to see if I observe the same.

linyueqian · 2026-04-08T17:55:00Z

Follow-up: Base (voice cloning) task shows different results on the same H20 hardware.

Concurrency	Mean RTF (mp)	Mean RTF (uni)	Delta
1	0.297	0.240	uni 19.3% better
4	0.626	0.607	~tied
10	1.193	1.044	uni 12.5% better
16	1.936	1.388	uni 28.3% better

For the Base task, uni is consistently better, consistent with the original findings in #2603. However, using CustomVoice on the same H20 GPU, we saw the opposite (mp winning at concurrency 4+).

This suggests the tradeoff is task-dependent, not just hardware-dependent. The Base task involves heavier per-request processing (reference audio encoding), making IPC serialization overhead a larger fraction of the total cost, which favors uni. CustomVoice is lighter per-request, so the process-level parallelism of mp dominates.

Given this, a blanket default change seems risky. Keeping mp as the explicit default in configs and letting users opt into uni for their specific workload would be the safer path.

iancarrasco-b10 · 2026-04-08T18:00:03Z

Thanks for running @linyueqian! Agreed to not merge the blanket change. I think it would be worth potentially adding this to the docs or commenting somewhere more permanent and that future deployments can take advantage of these perf gains. Potentially having a task-specific stage config?

hsliuustc0106 · 2026-04-09T07:55:01Z

does it also apply to qwen-omni as well? @ZeldaHuang @amy-why-3459

ZeldaHuang · 2026-04-09T12:51:30Z

Qwen3-Omni Benchmark: `distributed_executor_backend="mp"` vs default (uniproc)

stage path : qwen3_omni_moe_async_chunk.yaml

Concurrency	TTFT (ms) mp / uni	TPOT (ms) mp / uni	Audio RTF mp / uni
1	816 / 649	10.37 / 9.35	0.20 / 0.18
4	1325 / 1135	25.71 / 17.50	0.33 / 0.38
10	5397 / 2665	71.45 / 42.25	0.87 / 0.66
16	8473 / 4947	103.17 / 72.21	1.28 / 1.23

For TTFT and TPOT, the default uniproc executor consistently outperforms distributed_executor_backend="mp" across all concurrency levels. The gap widens as
concurrency increases — at c=16, uniproc achieves ~42% lower TTFT and ~30% lower TPOT.

For Audio RTF, the results are mixed: mp is slightly better at c=4 (0.33 vs 0.38) but worse at other levels.

hsliuustc0106 · 2026-04-09T12:55:30Z

@linyueqian @wtomin I think this will work for many cases, please check

amy-why-3459 · 2026-04-09T13:14:22Z

Nice work. I'm just curious, how did you discover this problem?

iancarrasco-b10 · 2026-04-09T14:06:17Z

I was profiling where time was being spent in Qwen3-TTS forward passes for the Base cloning task, and noticed low gpu utilization so that pointed to some CPU bound work potentially causing the GPU to idle. Then running at higher concurrency it became more apparent that the D2H copies, serialization/deserialization, msgpack encode, tensor detaching in 'mp' mode was taking up a considerable amount of time vs. AR steps/decode.

linyueqian · 2026-04-12T00:38:30Z

fix ci please

linyueqian · 2026-04-12T00:48:09Z

@iancarrasco-b10 Great investigation! Since the mp vs uni tradeoff is both hardware- and task-dependent, could you add a short section to the docs (e.g., under the Qwen3-TTS serving guide) summarizing:

When uni wins (Base/cloning tasks, H100)
When mp wins (CustomVoice, H20 at high concurrency)
How users can switch between them

This way future deployments can make an informed choice. A task-specific stage config example would also be helpful.

iancarrasco-b10 · 2026-04-13T17:14:38Z

Will go ahead and update the docs and add a config here

Made-with: Cursor Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>

Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>

Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co> Made-with: Cursor Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>

Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>

…e single-GPU performance (vllm-project#2604) Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>

iancarrasco-b10 requested a review from hsliuustc0106 as a code owner April 8, 2026 16:40

iancarrasco-b10 mentioned this pull request Apr 8, 2026

[Performance]: Uniproc executor has much better performance at higher concurrency on Qwen3-TTS (Single GPU) #2603

Closed

1 task

linyueqian linked an issue Apr 8, 2026 that may be closed by this pull request

[Performance]: Uniproc executor has much better performance at higher concurrency on Qwen3-TTS (Single GPU) #2603

Closed

1 task

linyueqian self-requested a review April 8, 2026 16:46

linyueqian added the ready label to trigger buildkite CI label Apr 8, 2026

iancarrasco-b10 force-pushed the uniproc-executor branch 2 times, most recently from b122235 to dab9604 Compare April 8, 2026 16:52

iancarrasco-b10 force-pushed the uniproc-executor branch from b2abace to e58e392 Compare April 8, 2026 16:58

iancarrasco-b10 changed the title ~~Default to UniProcExecutor for single-GPU stages~~ [Qwen3-TTS] Remove Hardcoded distributed_executor_backend to improve single-GPU performance Apr 8, 2026

iancarrasco-b10 changed the title ~~[Qwen3-TTS] Remove Hardcoded distributed_executor_backend to improve single-GPU performance~~ [Qwen3-TTS] Remove hardcoded distributed_executor_backend to improve single-GPU performance Apr 8, 2026

hsliuustc0106 added the nightly-test label to trigger buildkite nightly test CI label Apr 9, 2026

ZeldaHuang mentioned this pull request Apr 10, 2026

[RFC][Performance]: Worker-Level Inter-Stage Data Transfer #2671

Closed

1 task

lishunyang12 approved these changes Apr 11, 2026

View reviewed changes

iancarrasco-b10 added 9 commits April 13, 2026 18:06

Default to uniproc when engine is unset otherwise mp

8f93909

Made-with: Cursor Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>

revert check

324ad29

Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>

cleaner call

940acea

Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>

ruff format

0a5eeeb

Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>

revert change as it is redundant

3f8c3a1

Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>

update readme

52c0f46

Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>

add a uniproc config

c21b499

Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>

add instructions

8755f75

Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co> Made-with: Cursor Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>

more readme updates

787a6e8

Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>

iancarrasco-b10 force-pushed the uniproc-executor branch from 67ac3a2 to 787a6e8 Compare April 13, 2026 22:06

Merge branch 'main' into uniproc-executor

224486a

hsliuustc0106 merged commit 48c30bc into vllm-project:main Apr 14, 2026
7 of 8 checks passed

y123456y78 pushed a commit to y123456y78/vllm-omni that referenced this pull request Apr 15, 2026

[Qwen3-TTS] Remove hardcoded distributed_executor_backend to improv…

9c52763

…e single-GPU performance (vllm-project#2604) Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>

lvliang-intel pushed a commit to lvliang-intel/vllm-omni that referenced this pull request Apr 20, 2026

[Qwen3-TTS] Remove hardcoded distributed_executor_backend to improv…

7d6263d

…e single-GPU performance (vllm-project#2604) Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>

lengrongfu pushed a commit to lengrongfu/vllm-omni that referenced this pull request May 1, 2026

[Qwen3-TTS] Remove hardcoded distributed_executor_backend to improv…

3a6f1a0

…e single-GPU performance (vllm-project#2604) Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>

clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026

[Qwen3-TTS] Remove hardcoded distributed_executor_backend to improv…

386131b

…e single-GPU performance (vllm-project#2604) Signed-off-by: Ian Carrasco <ian.carrasco@baseten.co>

Conversation

iancarrasco-b10 commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Uh oh!

chatgpt-codex-connector Bot commented Apr 8, 2026

Uh oh!

linyueqian commented Apr 8, 2026

Uh oh!

linyueqian commented Apr 8, 2026

Uh oh!

iancarrasco-b10 commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

linyueqian commented Apr 8, 2026

Uh oh!

iancarrasco-b10 commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iancarrasco-b10 commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

linyueqian commented Apr 8, 2026

Uh oh!

iancarrasco-b10 commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hsliuustc0106 commented Apr 9, 2026

Uh oh!

ZeldaHuang commented Apr 9, 2026

Qwen3-Omni Benchmark: distributed_executor_backend="mp" vs default (uniproc)

Uh oh!

hsliuustc0106 commented Apr 9, 2026

Uh oh!

amy-why-3459 commented Apr 9, 2026

Uh oh!

iancarrasco-b10 commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

linyueqian commented Apr 12, 2026

Uh oh!

linyueqian commented Apr 12, 2026

Uh oh!

iancarrasco-b10 commented Apr 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

iancarrasco-b10 commented Apr 8, 2026 •

edited

Loading

iancarrasco-b10 commented Apr 8, 2026 •

edited

Loading

iancarrasco-b10 commented Apr 8, 2026 •

edited

Loading

iancarrasco-b10 commented Apr 8, 2026 •

edited

Loading

iancarrasco-b10 commented Apr 8, 2026 •

edited

Loading

Qwen3-Omni Benchmark: `distributed_executor_backend="mp"` vs default (uniproc)

iancarrasco-b10 commented Apr 9, 2026 •

edited

Loading