[Test-only][Qwen3-TTS] AsyncOmniARScheduler isolation — code from #3221, please merge #3221 instead#3321
Conversation
… gap Borrows AsyncOmniARScheduler from PR vllm-project#3221 and wires the LLM_AR scheduler selection so any stage with async_scheduling=true automatically picks the async-bookkeeping variant. Background: When async_scheduling=true, vLLM's EngineCoreProc drives step_with_batch_queue, which speculatively schedules the next batch while the current one is still on the GPU. For the queue to stay full, the scheduler must increment request.num_output_placeholders after each scheduled step (so the next schedule() call knows to launch one more decode token before the previous step's output has merged) and decrement it again when the output arrives. Base OmniARScheduler skips this bookkeeping, so schedule() returns 0 tokens on every other step, the engine sleeps 1 ms, and the alternating empty-step pattern adds a ~2-3 ms gap between every talker forward - visible in nsys profiles and confirmed by PR vllm-project#3221's reviewer. AsyncOmniARScheduler injects vllm.v1.core.sched.AsyncScheduler into the OmniARScheduler MRO so the placeholder bookkeeping takes effect while preserving every Omni-specific behaviour (OmniNewRequestData wrapping, KV-transfer metadata, chunk-transfer adapter, streaming-session hooks). Wiring: * New _resolve_scheduler_cls(execution_type, async_scheduling) helper in stage_config.py picks AsyncOmniARScheduler for LLM_AR stages whenever async_scheduling=true; sync stages continue to use OmniARScheduler. * Re-exported from vllm_omni.core.sched for downstream callers. Measured impact (single H100 80 GB, Qwen3-TTS-12Hz-0.6B-Base, default qwen3_tts.yaml = both stages max_num_seqs=10, 30/60/80/128 reqs at c=1/4/8/32 with 96-req warmup): | Concurrency | TTFA mean (default) | TTFA mean (+Async) | rps default | rps +Async | | ----------: | ------------------: | -----------------: | ----------: | ---------: | | 1 | 259 ms | 260 ms | 0.93 | 0.94 | | 4 | 761 ms | 728 ms | 1.26 | 1.39 | | 8 | 1220 ms | 1129 ms | 1.75 | 1.55 | | 32 | 7286 ms | 5775 ms | 3.24 | 3.91 | c=32 sees TTFA mean -21% and rps +20% vs the base RFC vllm-project#3163 P0 fix; rps also exceeds main (3.51) on the same workload. c=1 is unchanged. Co-Authored-By: Viacheslav Klimkov (PR vllm-project#3221) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
Understood - this is a test-only PR for benchmarking AsyncOmniARScheduler in isolation. As stated in the PR description, please merge #3221 (the original PR) for the actual change. |
|
H20-3e numbers from a 4-way bench (main / +3321 / +3322 / +both, Qwen3-TTS-12Hz-1.7B, single H20-3e) confirm this PR is a strict win on CustomVoice and approximately neutral on Base + voice-clone:
Full table and the Code2Wav-batching trade-off from #3322 are in #3322 (analysis comment). |
|
@ischencheng heads up, #3306 landed in main on 2026-05-05 (~20 min after your last restructure here). It delivers the async scheduler split as Could you re-bench against current main? Your "main" baseline in #3321 and #3322 no longer matches what's on
Headline question for #3322: does the c=4 / c=8 TTFA win still hold once #3306 is the baseline? If yes, #3322 rebases and lands cleanly. If #3306 absorbed it, #3322 shrinks to a docs/launcher tweak. cc @vklimkov-nvidia, same affects #3221's rebase. The scheduler files there are now redundant with main. |
Purpose
Not for merge. This PR isolates the
AsyncOmniARSchedulerportion of #3221 for benchmarking purposes only. The code is taken verbatim from #3221.The original PR #3221 should be the one that lands this scheduler change. This PR exists only to:
Test Plan
tests/e2e/online_serving/test_qwen3_tts_base.py -m core_modelon H100 — confirms audio output correctness.Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice, default deploy yaml, 96-req warmup + 30/60/80/128 reqs at c=1/4/8/32.Test Result
E2E
test_qwen3_tts_base.pycore_model PASSES on H100.Concurrent bench (H100,
Qwen3-TTS-12Hz-1.7B-CustomVoice, default mns config = 10 / 10):c=32 sees the largest improvement: TTFA mean -19% and rps +14% vs main on the same workload. Lower concurrency is unchanged (codec-not-bound, no async-scheduling gap to recover).
Action
Please review and merge #3221 (the original PR) for the actual change. This PR will be closed once #3221 lands or as soon as the maintainers indicate.
CC List
@linyueqian @Sy0307