Skip to content

[Test-only][Qwen3-TTS] AsyncOmniARScheduler isolation — code from #3221, please merge #3221 instead#3321

Closed
ischencheng wants to merge 1 commit into
vllm-project:mainfrom
ischencheng:cheng/async-omni-ar-scheduler
Closed

[Test-only][Qwen3-TTS] AsyncOmniARScheduler isolation — code from #3221, please merge #3221 instead#3321
ischencheng wants to merge 1 commit into
vllm-project:mainfrom
ischencheng:cheng/async-omni-ar-scheduler

Conversation

@ischencheng
Copy link
Copy Markdown

@ischencheng ischencheng commented May 3, 2026

Purpose

Not for merge. This PR isolates the AsyncOmniARScheduler portion of #3221 for benchmarking purposes only. The code is taken verbatim from #3221.

The original PR #3221 should be the one that lands this scheduler change. This PR exists only to:

  1. Test the AsyncOmniARScheduler in isolation (without the rest of [Model] Add unified Qwen3-TTS model definition and Triton serving example with TensorRT codec #3221's NV-talker / Triton recipe changes).
  2. Provide a clean diff for measuring scheduler-only impact on H100.

Test Plan

  • E2E: tests/e2e/online_serving/test_qwen3_tts_base.py -m core_model on H100 — confirms audio output correctness.
  • Concurrent benchmark on H100, single GPU, Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice, default deploy yaml, 96-req warmup + 30/60/80/128 reqs at c=1/4/8/32.

Test Result

E2E test_qwen3_tts_base.py core_model PASSES on H100.

Concurrent bench (H100, Qwen3-TTS-12Hz-1.7B-CustomVoice, default mns config = 10 / 10):

Concurrency TTFA mean (main) TTFA mean (+Async) rps main rps +Async
1 68 ms 68 ms 1.49 1.49
4 135 ms 135 ms 4.06 4.06
8 250 ms 240 ms 6.71 6.80
32 2834 ms 2300 ms 7.45 8.50

c=32 sees the largest improvement: TTFA mean -19% and rps +14% vs main on the same workload. Lower concurrency is unchanged (codec-not-bound, no async-scheduling gap to recover).

Action

Please review and merge #3221 (the original PR) for the actual change. This PR will be closed once #3221 lands or as soon as the maintainers indicate.

CC List

@linyueqian @Sy0307

… gap

Borrows AsyncOmniARScheduler from PR vllm-project#3221 and wires the LLM_AR scheduler
selection so any stage with async_scheduling=true automatically picks the
async-bookkeeping variant.

Background:

When async_scheduling=true, vLLM's EngineCoreProc drives
step_with_batch_queue, which speculatively schedules the next batch while
the current one is still on the GPU. For the queue to stay full, the
scheduler must increment request.num_output_placeholders after each
scheduled step (so the next schedule() call knows to launch one more decode
token before the previous step's output has merged) and decrement it again
when the output arrives. Base OmniARScheduler skips this bookkeeping, so
schedule() returns 0 tokens on every other step, the engine sleeps 1 ms,
and the alternating empty-step pattern adds a ~2-3 ms gap between every
talker forward - visible in nsys profiles and confirmed by PR vllm-project#3221's
reviewer.

AsyncOmniARScheduler injects vllm.v1.core.sched.AsyncScheduler into the
OmniARScheduler MRO so the placeholder bookkeeping takes effect while
preserving every Omni-specific behaviour (OmniNewRequestData wrapping,
KV-transfer metadata, chunk-transfer adapter, streaming-session hooks).

Wiring:

* New _resolve_scheduler_cls(execution_type, async_scheduling) helper in
  stage_config.py picks AsyncOmniARScheduler for LLM_AR stages whenever
  async_scheduling=true; sync stages continue to use OmniARScheduler.
* Re-exported from vllm_omni.core.sched for downstream callers.

Measured impact (single H100 80 GB, Qwen3-TTS-12Hz-0.6B-Base, default
qwen3_tts.yaml = both stages max_num_seqs=10, 30/60/80/128 reqs at
c=1/4/8/32 with 96-req warmup):

| Concurrency | TTFA mean (default) | TTFA mean (+Async) | rps default | rps +Async |
| ----------: | ------------------: | -----------------: | ----------: | ---------: |
|           1 |              259 ms |             260 ms |       0.93  |       0.94 |
|           4 |              761 ms |             728 ms |       1.26  |       1.39 |
|           8 |             1220 ms |            1129 ms |       1.75  |       1.55 |
|          32 |             7286 ms |            5775 ms |       3.24  |       3.91 |

c=32 sees TTFA mean -21% and rps +20% vs the base RFC vllm-project#3163 P0 fix; rps
also exceeds main (3.51) on the same workload. c=1 is unchanged.

Co-Authored-By: Viacheslav Klimkov (PR vllm-project#3221)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ischencheng ischencheng requested a review from hsliuustc0106 as a code owner May 3, 2026 15:47
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@ischencheng ischencheng changed the title [Perf][Qwen3-TTS] Add AsyncOmniARScheduler to fix talker forward-pass gap [Test-only][Qwen3-TTS] AsyncOmniARScheduler isolation — code from #3221, please merge #3221 instead May 3, 2026
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

Understood - this is a test-only PR for benchmarking AsyncOmniARScheduler in isolation. As stated in the PR description, please merge #3221 (the original PR) for the actual change.

@linyueqian
Copy link
Copy Markdown
Collaborator

H20-3e numbers from a 4-way bench (main / +3321 / +3322 / +both, Qwen3-TTS-12Hz-1.7B, single H20-3e) confirm this PR is a strict win on CustomVoice and approximately neutral on Base + voice-clone:

Workload Conc main TTFA p95 / rps +3321 TTFA p95 / rps
CustomVoice 4 188 / 3.82 168 / 4.00
CustomVoice 8 815 / 5.56 781 / 5.11
CustomVoice 16 2004 / 5.63 1728 / 6.46 (-14% p95, +15% rps)
CustomVoice 32 3611 / 6.03 3814 / 5.65
Base+VC 4 3293 / 0.75 3208 / 0.75
Base+VC 8 8313 / 0.84 7911 / 0.87
Base+VC 32 15568 / 1.93 17005 / 1.62 (slight regression at saturation)

Full table and the Code2Wav-batching trade-off from #3322 are in #3322 (analysis comment).

@linyueqian
Copy link
Copy Markdown
Collaborator

@ischencheng heads up, #3306 landed in main on 2026-05-05 (~20 min after your last restructure here). It delivers the async scheduler split as OmniARAsyncScheduler, routed by the async_scheduling flag. That's the same plumbing point this PR was wiring.

Could you re-bench against current main? Your "main" baseline in #3321 and #3322 no longer matches what's on main now. Same H100 + 1.7B-Base + voice-clone setup, c=1/4/8/16/32:

  1. current main, default yaml (new baseline)
  2. current main + --stage-overrides '{"1": {"max_num_seqs": 10}}' (isolates [Perf][Qwen3-TTS] Restore Code2Wav cross-request batching (RFC #3163 P0) #3322)
  3. (optional) main + this branch cherry-picked, to confirm [Core] Support Async & Sync AutoRegressive Scheduling #3306 already covers the win this PR was after

Headline question for #3322: does the c=4 / c=8 TTFA win still hold once #3306 is the baseline? If yes, #3322 rebases and lands cleanly. If #3306 absorbed it, #3322 shrinks to a docs/launcher tweak.

cc @vklimkov-nvidia, same affects #3221's rebase. The scheduler files there are now redundant with main.

@linyueqian linyueqian closed this May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants