[Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls by DomBrown · Pull Request #1985 · vllm-project/vllm-omni

DomBrown · 2026-03-18T17:30:43Z

Purpose

Keep audio_codes and last_talker_hidden on GPU in the Qwen3-TTS talker's per-step intermediate buffer, eliminating two cudaStreamSynchronize stalls per AR decode step.

Previously, both tensors were copied to CPU immediately after production via .to("cpu"), which blocked the CPU for ~2ms each while waiting for all pending GPU kernels to drain. Neither tensor is consumed on CPU during the decode loop — audio_codes is only used by make_omni_output (torch.cat on GPU) and last_talker_hidden is immediately transferred back to GPU in the next step's preprocess. The D2H transfer for audio_codes is deferred to sample_tokens, where hidden_states.to("cpu") already synchronizes the stream, making the subsequent copy zero-cost.

This uses the existing gpu_resident_buffer_keys mechanism (already used by Qwen3-Omni for last_talker_hidden) to opt specific keys into GPU-resident storage via _update_intermediate_buffer.

Test Plan

Performance benchmark

Test Result

Tested on NVIDIA RTX Pro 6000 Blackwell Server Edition

Command:

vllm serve Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice --stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts.yaml --omni --port 8000 --trust-remote-code

time vllm-omni bench serve \
    --backend openai-audio-speech \
    --endpoint /v1/audio/speech \
    --model "$MODEL" \
    --dataset-name hf \
    --dataset-path philschmid/mt-bench \
    --skip-chat-template \
    --num-prompts 10 \
    --max_concurrency 1 \
    --percentile-metrics e2el,audio_rtf,audio_duration \
    --extra-body '{"voice": "Vivian", "instructions": "Speak with great enthusiasm", "language": "English"}'

Before

============ Serving Benchmark Result ============
Successful requests:                     10        
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  142.50    
Request throughput (req/s):              0.07      
Peak concurrent requests:                2.00      
----------------End-to-end Latency----------------
Mean E2EL (ms):                          14249.69  
Median E2EL (ms):                        9697.36   
P99 E2EL (ms):                           38141.95  
================== Text Result ===================
Total input tokens:                      785       
Total generated tokens:                  0         
Output token throughput (tok/s):         0.00      
Peak output token throughput (tok/s):    1.00      
Peak concurrent requests:                2.00      
Total Token throughput (tok/s):          5.51      
================== Audio Result ==================
Total audio duration generated(s):       246.96    
Total audio frames generated:            5927040   
Audio throughput(audio duration/s):      1.73      
-----------------Real Time Factor-----------------
Mean AUDIO_RTF:                          0.58      
Median AUDIO_RTF:                        0.58      
P99 AUDIO_RTF:                           0.59      
------------------Audio Duration------------------
Mean AUDIO_DURATION (s):                 24.70     
Median AUDIO_DURATION (s):               16.80     
P99 AUDIO_DURATION (s):                  66.04     
==================================================

After

============ Serving Benchmark Result ============
Successful requests:                     10        
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  131.47    
Request throughput (req/s):              0.08      
Peak concurrent requests:                2.00      
----------------End-to-end Latency----------------
Mean E2EL (ms):                          13146.68  
Median E2EL (ms):                        8844.79   
P99 E2EL (ms):                           38730.45  
================== Text Result ===================
Total input tokens:                      785       
Total generated tokens:                  0         
Output token throughput (tok/s):         0.00      
Peak output token throughput (tok/s):    1.00      
Peak concurrent requests:                2.00      
Total Token throughput (tok/s):          5.97      
================== Audio Result ==================
Total audio duration generated(s):       267.36    
Total audio frames generated:            6416640   
Audio throughput(audio duration/s):      2.03      
-----------------Real Time Factor-----------------
Mean AUDIO_RTF:                          0.49      
Median AUDIO_RTF:                        0.49      
P99 AUDIO_RTF:                           0.50      
------------------Audio Duration------------------
Mean AUDIO_DURATION (s):                 26.74     
Median AUDIO_DURATION (s):               18.04     
P99 AUDIO_DURATION (s):                  79.38     
==================================================

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 73aed6091a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>

DomBrown · 2026-03-18T18:18:13Z

Looks like some of this overlapped with #1852

Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>

linyueqian

Tested locally on A100-80GB with Qwen3-TTS-12Hz-0.6B-CustomVoice, single-request sequential benchmark (5 runs after warmup):

Metric	main (with #1852)	This PR	Delta
E2EL mean	2037 ms	1772 ms	-13%
E2EL median	1976 ms	1767 ms	-11%
RTF mean	0.510	0.445	-13%
RTF median	0.508	0.436	-14%

Confirms the improvement. Already cleanly rebased on top of #1852. The talker-side change is just adding "audio_codes" to the existing gpu_resident_buffer_keys set. The bulk of the value comes from the runner-side changes (deferred D2H for code_predictor_codes in gpu_model_runner.py and hoisted multimodal copy in gpu_ar_model_runner.py).

LGTM.

lishunyang12

LGTM

…eliminate per-step sync stalls (vllm-project#1985) Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>

…eliminate per-step sync stalls (vllm-project#1985) Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> Signed-off-by: Hui <1779066624@qq.com>

…eliminate per-step sync stalls (vllm-project#1985) Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> Signed-off-by: yiliu30 <yi4.liu@intel.com>

### vllm-omni-audio-tts - Source: [PR #2059](vllm-project/vllm-omni#2059) - [BugFix][Qwen3TTS] CodePredictor CudaGraph Pool - Changes: - Bug fix: [BugFix][Qwen3TTS] CodePredictor CudaGraph Pool ### vllm-omni-perf - Source: [PR #2059](vllm-project/vllm-omni#2059) - [BugFix][Qwen3TTS] CodePredictor CudaGraph Pool - Changes: - Bug fix: [BugFix][Qwen3TTS] CodePredictor CudaGraph Pool ### vllm-omni-api - Source: [PR #2058](vllm-project/vllm-omni#2058) - [Bugfix] Fix Fish Speech and CosyVoice3 online serving - missing is_comprehension and broken model detection - Changes: - Bug fix: [Bugfix] Fix Fish Speech and CosyVoice3 online serving - missing is_comprehension and broken model detection ### vllm-omni-contrib - Source: [PR #2045](vllm-project/vllm-omni#2045) - [Voxtral] Improve example ### vllm-omni-cicd - Source: [PR #2045](vllm-project/vllm-omni#2045) - [Voxtral] Improve example ### vllm-omni-api - Source: [PR #2042](vllm-project/vllm-omni#2042) - [bugfix] /chat/completion doesn't read extra_body for diffusion model - Changes: - Bug fix: [bugfix] /chat/completion doesn't read extra_body for diffusion model ### vllm-omni-perf - Source: [PR #2042](vllm-project/vllm-omni#2042) - [bugfix] /chat/completion doesn't read extra_body for diffusion model - Changes: - Bug fix: [bugfix] /chat/completion doesn't read extra_body for diffusion model ### vllm-omni-contrib - Source: [PR #2038](vllm-project/vllm-omni#2038) - [Doc] Update docs and dockerfiles for rebase of vllm v0.18.0 ### vllm-omni-serving - Source: [PR #2037](vllm-project/vllm-omni#2037) - [Rebase] Rebase to vllm v0.18.0 ### vllm-omni-contrib - Source: [PR #2037](vllm-project/vllm-omni#2037) - [Rebase] Rebase to vllm v0.18.0 ### vllm-omni-api - Source: [PR #2037](vllm-project/vllm-omni#2037) - [Rebase] Rebase to vllm v0.18.0 ### vllm-omni-cicd - Source: [PR #2037](vllm-project/vllm-omni#2037) - [Rebase] Rebase to vllm v0.18.0 ### vllm-omni-cicd - Source: [PR #2032](vllm-project/vllm-omni#2032) - [CI] Change Bagel online test environment variable `VLLM_TEST_CLEAN_GPU_MEMORY` to `0` ### vllm-omni-cicd - Source: [PR #2031](vllm-project/vllm-omni#2031) - [CI] Fix test. - Changes: - Bug fix: [CI] Fix test. ### vllm-omni-cicd - Source: [PR #2017](vllm-project/vllm-omni#2017) - [CI] [ROCm] Setup `test-ready.yml` and `test-merge.yml` ### vllm-omni-cicd - Source: [PR #2014](vllm-project/vllm-omni#2014) - [Test] Implement mock HTTP request handling in benchmark CLI tests ### vllm-omni-perf - Source: [PR #2014](vllm-project/vllm-omni#2014) - [Test] Implement mock HTTP request handling in benchmark CLI tests ### vllm-omni-serving - Source: [PR #2012](vllm-project/vllm-omni#2012) - [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips - Changes: - Bug fix: [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips ### vllm-omni-image-gen - Source: [PR #2012](vllm-project/vllm-omni#2012) - [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips - Changes: - Bug fix: [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips ### vllm-omni-perf - Source: [PR #2012](vllm-project/vllm-omni#2012) - [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips - Changes: - Bug fix: [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips ### vllm-omni-serving - Source: [PR #2009](vllm-project/vllm-omni#2009) - [Bugfix] revert PR#1758 which introduced the accuracy problem of qwen3-omni - Changes: - Bug fix: [Bugfix] revert PR#1758 which introduced the accuracy problem of qwen3-omni ### vllm-omni-image-gen - Source: [PR #2007](vllm-project/vllm-omni#2007) - [Bugfix]Fix bug of online server can not return mutli images - Changes: - Bug fix: [Bugfix]Fix bug of online server can not return mutli images - Additions: - Qwen-Image-Layered - Qwen-Image-Layered - Qwen-Image-Layered ### vllm-omni-api - Source: [PR #2007](vllm-project/vllm-omni#2007) - [Bugfix]Fix bug of online server can not return mutli images - Changes: - Bug fix: [Bugfix]Fix bug of online server can not return mutli images ### vllm-omni-cicd - Source: [PR #1998](vllm-project/vllm-omni#1998) - [CI] Split BAGEL tests into dummy/real weight tiers (L2/L3) ### vllm-omni-serving - Source: [PR #1985](vllm-project/vllm-omni#1985) - [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls - Changes: - Performance improvement: [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls ### vllm-omni-audio-tts - Source: [PR #1985](vllm-project/vllm-omni#1985) - [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls - Changes: - Performance improvement: [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls ### vllm-omni-perf - Source: [PR #1985](vllm-project/vllm-omni#1985) - [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls - Changes: - Performance improvement: [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls ### vllm-omni-serving - Source: [PR #1984](vllm-project/vllm-omni#1984) - [CI] [ROCm] Bugfix device environment issue - Changes: - Bug fix: [CI] [ROCm] Bugfix device environment issue ### vllm-omni-api - Source: [PR #1984](vllm-project/vllm-omni#1984) - [CI] [ROCm] Bugfix device environment issue - Changes: - Bug fix: [CI] [ROCm] Bugfix device environment issue ### vllm-omni-serving - Source: [PR #1982](vllm-project/vllm-omni#1982) - [Fix] Fix slow hasattr in CUDAGraphWrapper.__getattr__ - Changes: - Bug fix: [Fix] Fix slow hasattr in CUDAGraphWrapper.__getattr__ ### vllm-omni-cicd - Source: [PR #1982](vllm-project/vllm-omni#1982) - [Fix] Fix slow hasattr in CUDAGraphWrapper.__getattr__ - Changes: - Bug fix: [Fix] Fix slow hasattr in CUDAGraphWrapper.__getattr__ ### vllm-omni-api - Source: [PR #1979](vllm-project/vllm-omni#1979) - [Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series) - Changes: - Bug fix: [Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series) - Additions: - `/v1/chat/completions` ### vllm-omni-perf - Source: [PR #1979](vllm-project/vllm-omni#1979) - [Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series) - Changes: - Bug fix: [Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series) ### vllm-omni-contrib - Source: [PR #1976](vllm-project/vllm-omni#1976) - [skip ci][Docs] Update WeChat QR code (fix filename case) - Changes: - Bug fix: [skip ci][Docs] Update WeChat QR code (fix filename case) ### vllm-omni-contrib - Source: [PR #1974](vllm-project/vllm-omni#1974) - [Docs] Update WeChat QR code for community support ### vllm-omni-cicd - Source: [PR #1945](vllm-project/vllm-omni#1945) - Fix Base voice clone streaming quality and stop-token crash - Changes: - Bug fix: Fix Base voice clone streaming quality and stop-token crash ### vllm-omni-cicd - Source: [PR #1938](vllm-project/vllm-omni#1938) - [Test] L4 complete diffusion feature test for Bagel models - Changes: - New feature: [Test] L4 complete diffusion feature test for Bagel models ### vllm-omni-perf - Source: [PR #1938](vllm-project/vllm-omni#1938) - [Test] L4 complete diffusion feature test for Bagel models - Changes: - New feature: [Test] L4 complete diffusion feature test for Bagel models ### vllm-omni-perf - Source: [PR #1934](vllm-project/vllm-omni#1934) - Fix OmniGen2 transformer config loading for HF models - Changes: - Bug fix: Fix OmniGen2 transformer config loading for HF models ### vllm-omni-audio-tts - Source: [PR #1930](vllm-project/vllm-omni#1930) - [Bug][Qwen3TTS][Streaming] remove dynamic initial chunk and only compute on initial request ### vllm-omni-perf - Source: [PR #1930](vllm-project/vllm-omni#1930) - [Bug][Qwen3TTS][Streaming] remove dynamic initial chunk and only compute on initial request ### vllm-omni-audio-tts - Source: [PR #1926](vllm-project/vllm-omni#1926) - [Misc] removed qwen3_tts.py as it is out-dated ### vllm-omni-contrib - Source: [PR #1920](vllm-project/vllm-omni#1920) - [Docs] Add Wan2.1-T2V as supported video generation models - Changes: - New feature: [Docs] Add Wan2.1-T2V as supported video generation models ### vllm-omni-video-gen - Source: [PR #1915](vllm-project/vllm-omni#1915) - [Bugfix] fix helios video generate use cpu device - Changes: - Bug fix: [Bugfix] fix helios video generate use cpu device ### vllm-omni-perf - Source: [PR #1915](vllm-project/vllm-omni#1915) - [Bugfix] fix helios video generate use cpu device - Changes: - Bug fix: [Bugfix] fix helios video generate use cpu device ### vllm-omni-audio-tts - Source: [PR #1913](vllm-project/vllm-omni#1913) - [Optim][Qwen3TTS][CodePredictor] support torch.compile with reduce-overhead and dynamic False ### vllm-omni-perf - Source: [PR #1913](vllm-project/vllm-omni#1913) - [Optim][Qwen3TTS][CodePredictor] support torch.compile with reduce-overhead and dynamic False ### vllm-omni-api - Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring ### vllm-omni-perf - Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring ### vllm-omni-contrib - Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring ### vllm-omni-serving - Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring ### vllm-omni-cicd - Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring ### vllm-omni-image-gen - Source: [PR #1900](vllm-project/vllm-omni#1900) - [Feat] support HSDP for Flux family - Changes: - New feature: [Feat] support HSDP for Flux family ### vllm-omni-contrib - Source: [PR #1900](vllm-project/vllm-omni#1900) - [Feat] support HSDP for Flux family - Changes: - New feature: [Feat] support HSDP for Flux family ### vllm-omni-distributed - Source: [PR #1898](vllm-project/vllm-omni#1898) - [Feature]: Remove some useless `hf_overrides` in yaml - Changes: - New feature: [Feature]: Remove some useless `hf_overrides` in yaml ### vllm-omni-quantization - Source: [PR #1898](vllm-project/vllm-omni#1898) - [Feature]: Remove some useless `hf_overrides` in yaml - Changes: - New feature: [Feature]: Remove some useless `hf_overrides` in yaml ### vllm-omni-cicd - Source: [PR #1898](vllm-project/vllm-omni#1898) - [Feature]: Remove some useless `hf_overrides` in yaml - Changes: - New feature: [Feature]: Remove some useless `hf_overrides` in yaml ### vllm-omni-perf - Source: [PR #1898](vllm-project/vllm-omni#1898) - [Feature]: Remove some useless `hf_overrides` in yaml - Changes: - New feature: [Feature]: Remove some useless `hf_overrides` in yaml ### vllm-omni-contrib - Source: [PR #1890](vllm-project/vllm-omni#1890) - [NPU] Upgrade to v0.17.0 ### vllm-omni-contrib - Source: [PR #1889](vllm-project/vllm-omni#1889) - Add `Governance` section - Changes: - New feature: Add `Governance` section ### vllm-omni-distributed - Source: [PR #1881](vllm-project/vllm-omni#1881) - [Feat] Support T5 Tensor Parallelism - Changes: - New feature: [Feat] Support T5 Tensor Parallelism ### vllm-omni-cicd - Source: [PR #1881](vllm-project/vllm-omni#1881) - [Feat] Support T5 Tensor Parallelism - Changes: - New feature: [Feat] Support T5 Tensor Parallelism

DomBrown requested a review from hsliuustc0106 as a code owner March 18, 2026 17:30

Reduce GPU -> CPU copies

2836bcd

Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>

DomBrown force-pushed the dev/remove_copy branch from 73aed60 to 2836bcd Compare March 18, 2026 17:37

chatgpt-codex-connector bot reviewed Mar 18, 2026

View reviewed changes

Comment thread vllm_omni/worker/gpu_model_runner.py

Post-rebase fix

9fcab28

Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>

Hoist copy as suggested in codex review

5fd26ea

Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>

linyueqian self-requested a review March 18, 2026 18:38

linyueqian approved these changes Mar 18, 2026

View reviewed changes

linyueqian added the ready label to trigger buildkite CI label Mar 18, 2026

linyueqian mentioned this pull request Mar 18, 2026

[RFC]: TTS Development Roadmap - March 2026 #1795

Open

Gaohan123 added this to the v0.18.0 milestone Mar 19, 2026

lishunyang12 approved these changes Mar 19, 2026

View reviewed changes

hsliuustc0106 merged commit 31c8cb2 into vllm-project:main Mar 19, 2026
6 of 7 checks passed

DomBrown deleted the dev/remove_copy branch March 19, 2026 09:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls#1985

[Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls#1985
hsliuustc0106 merged 3 commits intovllm-project:mainfrom
DomBrown:dev/remove_copy

DomBrown commented Mar 18, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

DomBrown commented Mar 18, 2026

Uh oh!

linyueqian left a comment

Uh oh!

lishunyang12 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

DomBrown commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

DomBrown commented Mar 18, 2026

Uh oh!

linyueqian left a comment

Choose a reason for hiding this comment

Uh oh!

lishunyang12 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

DomBrown commented Mar 18, 2026 •

edited

Loading