[Test-only][Qwen3-TTS] AsyncOmniARScheduler isolation — code from #3221, please merge #3221 instead by ischencheng · Pull Request #3321 · vllm-project/vllm-omni

ischencheng · 2026-05-03T15:47:28Z

Purpose

Not for merge. This PR isolates the AsyncOmniARScheduler portion of #3221 for benchmarking purposes only. The code is taken verbatim from #3221.

The original PR #3221 should be the one that lands this scheduler change. This PR exists only to:

Test the AsyncOmniARScheduler in isolation (without the rest of [Model] Add unified Qwen3-TTS model definition and Triton serving example with TensorRT codec #3221's NV-talker / Triton recipe changes).
Provide a clean diff for measuring scheduler-only impact on H100.

Test Plan

E2E: tests/e2e/online_serving/test_qwen3_tts_base.py -m core_model on H100 — confirms audio output correctness.
Concurrent benchmark on H100, single GPU, Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice, default deploy yaml, 96-req warmup + 30/60/80/128 reqs at c=1/4/8/32.

Test Result

E2E test_qwen3_tts_base.py core_model PASSES on H100.

Concurrent bench (H100, Qwen3-TTS-12Hz-1.7B-CustomVoice, default mns config = 10 / 10):

Concurrency	TTFA mean (main)	TTFA mean (+Async)	rps main	rps +Async
1	68 ms	68 ms	1.49	1.49
4	135 ms	135 ms	4.06	4.06
8	250 ms	240 ms	6.71	6.80
32	2834 ms	2300 ms	7.45	8.50

c=32 sees the largest improvement: TTFA mean -19% and rps +14% vs main on the same workload. Lower concurrency is unchanged (codec-not-bound, no async-scheduling gap to recover).

Action

Please review and merge #3221 (the original PR) for the actual change. This PR will be closed once #3221 lands or as soon as the maintainers indicate.

CC List

@linyueqian @Sy0307

… gap Borrows AsyncOmniARScheduler from PR vllm-project#3221 and wires the LLM_AR scheduler selection so any stage with async_scheduling=true automatically picks the async-bookkeeping variant. Background: When async_scheduling=true, vLLM's EngineCoreProc drives step_with_batch_queue, which speculatively schedules the next batch while the current one is still on the GPU. For the queue to stay full, the scheduler must increment request.num_output_placeholders after each scheduled step (so the next schedule() call knows to launch one more decode token before the previous step's output has merged) and decrement it again when the output arrives. Base OmniARScheduler skips this bookkeeping, so schedule() returns 0 tokens on every other step, the engine sleeps 1 ms, and the alternating empty-step pattern adds a ~2-3 ms gap between every talker forward - visible in nsys profiles and confirmed by PR vllm-project#3221's reviewer. AsyncOmniARScheduler injects vllm.v1.core.sched.AsyncScheduler into the OmniARScheduler MRO so the placeholder bookkeeping takes effect while preserving every Omni-specific behaviour (OmniNewRequestData wrapping, KV-transfer metadata, chunk-transfer adapter, streaming-session hooks). Wiring: * New _resolve_scheduler_cls(execution_type, async_scheduling) helper in stage_config.py picks AsyncOmniARScheduler for LLM_AR stages whenever async_scheduling=true; sync stages continue to use OmniARScheduler. * Re-exported from vllm_omni.core.sched for downstream callers. Measured impact (single H100 80 GB, Qwen3-TTS-12Hz-0.6B-Base, default qwen3_tts.yaml = both stages max_num_seqs=10, 30/60/80/128 reqs at c=1/4/8/32 with 96-req warmup): | Concurrency | TTFA mean (default) | TTFA mean (+Async) | rps default | rps +Async | | ----------: | ------------------: | -----------------: | ----------: | ---------: | | 1 | 259 ms | 260 ms | 0.93 | 0.94 | | 4 | 761 ms | 728 ms | 1.26 | 1.39 | | 8 | 1220 ms | 1129 ms | 1.75 | 1.55 | | 32 | 7286 ms | 5775 ms | 3.24 | 3.91 | c=32 sees TTFA mean -21% and rps +20% vs the base RFC vllm-project#3163 P0 fix; rps also exceeds main (3.51) on the same workload. c=1 is unchanged. Co-Authored-By: Viacheslav Klimkov (PR vllm-project#3221) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector · 2026-05-03T15:47:34Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

hsliuustc0106 · 2026-05-03T21:06:39Z

Understood - this is a test-only PR for benchmarking AsyncOmniARScheduler in isolation. As stated in the PR description, please merge #3221 (the original PR) for the actual change.

linyueqian · 2026-05-03T21:27:34Z

H20-3e numbers from a 4-way bench (main / +3321 / +3322 / +both, Qwen3-TTS-12Hz-1.7B, single H20-3e) confirm this PR is a strict win on CustomVoice and approximately neutral on Base + voice-clone:

Workload	Conc	main TTFA p95 / rps	+3321 TTFA p95 / rps
CustomVoice	4	188 / 3.82	168 / 4.00
CustomVoice	8	815 / 5.56	781 / 5.11
CustomVoice	16	2004 / 5.63	1728 / 6.46 (-14% p95, +15% rps)
CustomVoice	32	3611 / 6.03	3814 / 5.65
Base+VC	4	3293 / 0.75	3208 / 0.75
Base+VC	8	8313 / 0.84	7911 / 0.87
Base+VC	32	15568 / 1.93	17005 / 1.62 (slight regression at saturation)

Full table and the Code2Wav-batching trade-off from #3322 are in #3322 (analysis comment).

linyueqian · 2026-05-05T15:18:52Z

@ischencheng heads up, #3306 landed in main on 2026-05-05 (~20 min after your last restructure here). It delivers the async scheduler split as OmniARAsyncScheduler, routed by the async_scheduling flag. That's the same plumbing point this PR was wiring.

Could you re-bench against current main? Your "main" baseline in #3321 and #3322 no longer matches what's on main now. Same H100 + 1.7B-Base + voice-clone setup, c=1/4/8/16/32:

current main, default yaml (new baseline)
current main + --stage-overrides '{"1": {"max_num_seqs": 10}}' (isolates [Perf][Qwen3-TTS] Restore Code2Wav cross-request batching (RFC #3163 P0) #3322)
(optional) main + this branch cherry-picked, to confirm [Core] Support Async & Sync AutoRegressive Scheduling #3306 already covers the win this PR was after

Headline question for #3322: does the c=4 / c=8 TTFA win still hold once #3306 is the baseline? If yes, #3322 rebases and lands cleanly. If #3306 absorbed it, #3322 shrinks to a docs/launcher tweak.

cc @vklimkov-nvidia, same affects #3221's rebase. The scheduler files there are now redundant with main.

ischencheng requested a review from hsliuustc0106 as a code owner May 3, 2026 15:47

ischencheng changed the title ~~[Perf][Qwen3-TTS] Add AsyncOmniARScheduler to fix talker forward-pass gap~~ [Test-only][Qwen3-TTS] AsyncOmniARScheduler isolation — code from #3221, please merge #3221 instead May 3, 2026

linyueqian mentioned this pull request May 3, 2026

[Perf][Qwen3-TTS] Restore Code2Wav cross-request batching (RFC #3163 P0) #3322

Open

linyueqian mentioned this pull request May 3, 2026

[RFC]: Cross-request batching for Qwen3-TTS Code2Wav stage to fix TTFB scaling under concurrency #3163

Open

1 task

linyueqian added the ready label to trigger buildkite CI label May 4, 2026

linyueqian closed this May 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Test-only][Qwen3-TTS] AsyncOmniARScheduler isolation — code from #3221, please merge #3221 instead#3321

[Test-only][Qwen3-TTS] AsyncOmniARScheduler isolation — code from #3221, please merge #3221 instead#3321
ischencheng wants to merge 1 commit into
vllm-project:mainfrom
ischencheng:cheng/async-omni-ar-scheduler

ischencheng commented May 3, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot commented May 3, 2026

Uh oh!

hsliuustc0106 commented May 3, 2026

Uh oh!

linyueqian commented May 3, 2026

Uh oh!

linyueqian commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ischencheng commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Action

CC List

Uh oh!

chatgpt-codex-connector Bot commented May 3, 2026

Uh oh!

hsliuustc0106 commented May 3, 2026

Uh oh!

linyueqian commented May 3, 2026

Uh oh!

linyueqian commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ischencheng commented May 3, 2026 •

edited

Loading