perf: replace per-element .item() GPU syncs with batch .tolist() in TTS code predictor#1614
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 20d80bb3f4
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| # calls (each .item() on a GPU tensor forces a GPU→CPU sync). | ||
| query_lens_list = query_lens_i32[:num_reqs].tolist() | ||
| seq_lens_list = seq_lens_i32[:num_reqs].tolist() | ||
| block_table_cpu = self._block_table[:num_reqs].cpu().tolist() |
There was a problem hiding this comment.
Avoid full block-table host copy in hot decode path
In _LocalPredictorKVCache.build_attn_metadata, this line materializes the entire num_reqs × blocks_per_seq block table as Python lists on every call, even though the inner loop only needs one block_idx per generated token; during decode (decode_logits sets query_lens to ones), that turns per-step work from roughly O(num_reqs) into O(num_reqs * blocks_per_seq), where blocks_per_seq is derived from configured max_seq_len. For long-context configs this extra device→host copy plus Python-int conversion can dominate token-step latency and regress throughput versus the previous indexed access.
Useful? React with 👍 / 👎.
|
Thanks for the suggestion. The block-table copy optimization is a good point but falls outside the scope of this PR (which focuses specifically on the per-element |
|
is there any e2e test result compared with previous baseline? |
hsliuustc0106
left a comment
There was a problem hiding this comment.
Code Review Report
📋 Summary
| Item | Details |
|---|---|
| PR | perf: replace per-element .item() GPU syncs with batch .tolist() in TTS code predictor |
| Author | @dubin555 |
| Files Changed | qwen3_tts_code_predictor_vllm.py, test_slot_mapping_fix.py |
| Changes | +224 / -5 |
🎯 Purpose
Eliminate expensive GPU→CPU synchronizations in _LocalPredictorKVCache.build_attn_metadata(), which is called on every TTS decode step for Qwen3-TTS requests.
Problem:
- Per-element
.item()calls inside nested loops forcecudaStreamSynchronizeon every iteration - 3 locations affected (
qwen3_tts_code_predictor_vllm.py:113-147):int(query_lens_i32[i].item())— one sync per requestint(seq_lens_i32[i].item())— one sync per requestint(self._block_table[i, block_idx].item())— one sync per token position per request (inner loop!)
Impact: Each .item() costs 5-20μs. With num_reqs × max_seq_len iterations → hundreds of unnecessary sync points per decode step.
Fix: Replace per-element .item() with batch .tolist() / .cpu().tolist() before loops, then use plain Python list indexing inside.
📊 Performance Improvement
| Metric | Before | After |
|---|---|---|
| GPU syncs per call | num_reqs * 4 + num_tokens |
3 batch transfers (O(1)) |
CPU Benchmark:
| num_reqs | avg_seq_len | Original | Fixed | Speedup |
|---|---|---|---|---|
| 1 | 8 | 0.021 ms | 0.016 ms | 1.33x |
| 4 | 8 | 0.055 ms | 0.027 ms | 2.02x |
| 8 | 16 | 0.091 ms | 0.040 ms | 2.28x |
| 16 | 16 | 0.175 ms | 0.068 ms | 2.56x |
| 32 | 16 | 0.341 ms | 0.124 ms | 2.74x |
| 64 | 16 | 0.673 ms | 0.242 ms | 2.79x |
GPU expected: 10-100x improvement due to eliminated cudaStreamSynchronize overhead.
🔍 Code Changes
# Before (per-element .item() = GPU sync each time):
for i in range(num_reqs):
ql = int(query_lens_i32[i].item()) # GPU sync!
sl = int(seq_lens_i32[i].item()) # GPU sync!
for p in range(start, sl):
block_id = int(self._block_table[i, block_idx].item()) # GPU sync!
# After (batch conversion, no per-element syncs):
query_lens_list = query_lens_i32[:num_reqs].tolist()
seq_lens_list = seq_lens_i32[:num_reqs].tolist()
block_table_cpu = self._block_table[:num_reqs].cpu().tolist()
for i in range(num_reqs):
ql = query_lens_list[i] # No sync
sl = seq_lens_list[i] # No sync
for p in range(start, sl):
block_id = block_table_cpu[i][block_idx] # No sync✅ Test Coverage
6 unit tests verifying correctness against original implementation:
| Test | Coverage |
|---|---|
test_decode_single_request |
Single request, decode mode (query_len=1) |
test_prefill_single_request |
Single request, prefill mode |
test_batch_decode |
Multiple requests in batch |
test_cross_block_boundary |
Tokens spanning multiple blocks |
test_large_batch |
32 requests stress test |
test_mixed_query_lens |
Different query lengths |
Result: All tests produce identical outputs between original and fixed code.
🔍 Review Findings
✅ Strengths
- Well-diagnosed performance issue — Clear identification of
cudaStreamSynchronizebottleneck - Minimal, focused change — Only 11 lines modified in production code
- Comprehensive test coverage — Tests verify behavioral equivalence across multiple scenarios
- Good documentation — PR description clearly explains the problem, fix, and benchmarks
- Real performance data — CPU benchmarks provided, GPU improvement estimated
💡 Observations
-
Test file location: The test file
test_slot_mapping_fix.pyis in the repo root. Consider moving totests/directory for consistency with PR #1613's test placement. -
No pytest markers: Unlike PR #1613, this test file lacks
pytestmark = [pytest.mark.core_model, pytest.mark.cpu]. Consider adding for CI integration. -
.cpu()call: Theblock_table_cpu = self._block_table[:num_reqs].cpu().tolist()adds an explicit.cpu()call. This is correct for GPU tensors but adds a no-op overhead if_block_tableis already on CPU. Consider whether the device check is needed.
📝 Verdict
| Rating | Notes |
|---|---|
| APPROVE ✅ | Solid performance optimization with good test coverage |
Rationale:
- Clear algorithmic improvement eliminating unnecessary GPU→CPU syncs
- Behavioral equivalence verified through comprehensive testing
- Minimal code changes with clear performance benefit
- Well-documented with benchmark data
Suggestion for follow-up: Consider adding pytest markers and moving test file to tests/ for consistency.
Reviewed by: vllm-omni-reviewer MCP tool 🤖
|
Benchmark results on NVIDIA H100 NVL (96GB): Each Benchmark code simulates the |
Benchmark Results — NVIDIA H100 NVLIsolated
|
| num_reqs | original (µs) | fixed (µs) | speedup |
|---|---|---|---|
| 1 | 41.6 | 42.2 | 0.99x |
| 4 | 116.6 | 69.6 | 1.67x |
| 8 | 253.3 | 111.1 | 2.28x |
| 32 | 879.2 | 329.5 | 2.67x |
| 128 | 3759.6 | 1239.1 | 3.03x |
The function-level speedup is clear: 2-3x for typical batch sizes (8-128 requests).
Full E2E TTS benchmark
Served Qwen/Qwen3-TTS-12Hz-0.6B-Base with vllm-omni 0.16.0, max_batch_size=8, concurrency=8, 24 requests per run, 3 runs each:
| Version | throughput (req/s) | p50 latency (ms) |
|---|---|---|
| Original (.item()) | 1.39 – 1.54 | 4558 – 4675 |
| Fixed (.tolist()) | 1.47 – 1.50 | 4313 – 4600 |
E2E difference is within noise — both versions overlap in throughput range.
Why micro shows 3x but E2E is flat
build_attn_metadata runs inside the code predictor sub-model, not the main AR decoder hot path. Rough estimate: the optimization saves ~142µs per call at batch_size=8, × ~200 AR decoding steps = ~28ms saved per request. Against a ~4500ms total E2E latency, that's <1% — completely masked by AR forward passes, code2wav decoding, and generation length variance.
Summary
The optimization is correct and meaningful for the function it targets (3x speedup), and follows the well-established pattern of batching GPU→CPU transfers. The E2E impact is currently small because TTS inference is dominated by autoregressive decoding, but this will matter more as other bottlenecks are optimized or batch sizes increase.
| @@ -0,0 +1,213 @@ | |||
| """Unit test for the .item()-in-inner-loop fix in _LocalPredictorKVCache.build_attn_metadata. | |||
There was a problem hiding this comment.
Nit: should this live under tests/ instead of the repo root?
There was a problem hiding this comment.
Good catch, moved to tests/. Thanks!
|
Solve DCO and precommit |
…TS code predictor In _LocalPredictorKVCache.build_attn_metadata(), per-element .item() calls inside nested loops force a GPU→CPU synchronization on every iteration. This is called on every TTS decode step for Qwen3-TTS. Replace per-element .item() on query_lens_i32, seq_lens_i32, and self._block_table with batch .tolist() / .cpu().tolist() before loops, then use plain Python list indexing. Before: num_reqs * 4 + num_tokens GPU sync points per call After: 3 batch transfers (O(1) syncs) per call CPU benchmark: 1.3-2.8x speedup; GPU expected 10-100x improvement. Signed-off-by: dubin555 <dubin555@gmail.com>
Signed-off-by: dubin555 <dubin555@gmail.com>
8d0031a to
549f9b7
Compare
linyueqian
left a comment
There was a problem hiding this comment.
LGTM. Can you move the test file to tests/models/qwen3_tts/ instead of top-level tests/?
yes, this is needed |
Signed-off-by: linyueqian <linyueqian@outlook.com>
### vllm-omni-api - Source: [PR #1724](vllm-project/vllm-omni#1724) - Revert "[Profile] Adding metrics for Diffusion/DiT Single diffusion Pipeline (#668)" - Changes: - New feature: Revert "[Profile] Adding metrics for Diffusion/DiT Single diffusion Pipeline (#668)" ### vllm-omni-contrib - Source: [PR #1724](vllm-project/vllm-omni#1724) - Revert "[Profile] Adding metrics for Diffusion/DiT Single diffusion Pipeline (#668)" - Changes: - New feature: Revert "[Profile] Adding metrics for Diffusion/DiT Single diffusion Pipeline (#668)" ### vllm-omni-api - Source: [PR #1716](vllm-project/vllm-omni#1716) - [Feature]: Add vae-patch-parallel CLI argument in online serving - Changes: - New feature: [Feature]: Add vae-patch-parallel CLI argument in online serving ### vllm-omni-contrib - Source: [PR #1716](vllm-project/vllm-omni#1716) - [Feature]: Add vae-patch-parallel CLI argument in online serving - Changes: - New feature: [Feature]: Add vae-patch-parallel CLI argument in online serving ### vllm-omni-contrib - Source: [PR #1693](vllm-project/vllm-omni#1693) - [skip CI][Docs] Add TTS model developer guide - Changes: - New feature: [skip CI][Docs] Add TTS model developer guide ### vllm-omni-audio-tts - Source: [PR #1688](vllm-project/vllm-omni#1688) - [MiMo-Audio] Bugfix tp lg than 1 - Changes: - Bug fix: [MiMo-Audio] Bugfix tp lg than 1 ### vllm-omni-distributed - Source: [PR #1688](vllm-project/vllm-omni#1688) - [MiMo-Audio] Bugfix tp lg than 1 - Changes: - Bug fix: [MiMo-Audio] Bugfix tp lg than 1 ### vllm-omni-perf - Source: [PR #1688](vllm-project/vllm-omni#1688) - [MiMo-Audio] Bugfix tp lg than 1 - Changes: - Bug fix: [MiMo-Audio] Bugfix tp lg than 1 ### vllm-omni-perf - Source: [PR #1687](vllm-project/vllm-omni#1687) - [BugFix] Return proper HTTP status for ErrorResponse in create_speech - Changes: - Bug fix: [BugFix] Return proper HTTP status for ErrorResponse in create_speech ### vllm-omni-distributed - Source: [PR #1687](vllm-project/vllm-omni#1687) - [BugFix] Return proper HTTP status for ErrorResponse in create_speech - Changes: - Bug fix: [BugFix] Return proper HTTP status for ErrorResponse in create_speech ### vllm-omni-api - Source: [PR #1687](vllm-project/vllm-omni#1687) - [BugFix] Return proper HTTP status for ErrorResponse in create_speech - Changes: - Bug fix: [BugFix] Return proper HTTP status for ErrorResponse in create_speech - Additions: - `/v1/audio/speech` ### vllm-omni-quantization - Source: [PR #1687](vllm-project/vllm-omni#1687) - [BugFix] Return proper HTTP status for ErrorResponse in create_speech - Changes: - Bug fix: [BugFix] Return proper HTTP status for ErrorResponse in create_speech ### vllm-omni-cicd - Source: [PR #1683](vllm-project/vllm-omni#1683) - [CI] Remove high concurrency tests before issue #1374 fixed. - Changes: - Bug fix: [CI] Remove high concurrency tests before issue #1374 fixed. ### vllm-omni-audio-tts - Source: [PR #1678](vllm-project/vllm-omni#1678) - Add non-async chunk support for Qwen3-TTS - Changes: - New feature: Add non-async chunk support for Qwen3-TTS ### vllm-omni-cicd - Source: [PR #1678](vllm-project/vllm-omni#1678) - Add non-async chunk support for Qwen3-TTS - Changes: - New feature: Add non-async chunk support for Qwen3-TTS ### vllm-omni-cicd - Source: [PR #1677](vllm-project/vllm-omni#1677) - Replace hard-coded cuda generator with current_omni_platform.device_type ### vllm-omni-perf - Source: [PR #1677](vllm-project/vllm-omni#1677) - Replace hard-coded cuda generator with current_omni_platform.device_type ### vllm-omni-serving - Source: [PR #1675](vllm-project/vllm-omni#1675) - [Misc] remove logits_processor_pattern this field, because vllm have … ### vllm-omni-cicd - Source: [PR #1666](vllm-project/vllm-omni#1666) - [Cleanup] Move cosyvoice3 tests to model subdirectory ### vllm-omni-audio-tts - Source: [PR #1664](vllm-project/vllm-omni#1664) - [Bugfix] Fix all-silence TTS output: use float32 for speech tokenizer decoder - Changes: - Bug fix: [Bugfix] Fix all-silence TTS output: use float32 for speech tokenizer decoder ### vllm-omni-cicd - Source: [PR #1664](vllm-project/vllm-omni#1664) - [Bugfix] Fix all-silence TTS output: use float32 for speech tokenizer decoder - Changes: - Bug fix: [Bugfix] Fix all-silence TTS output: use float32 for speech tokenizer decoder ### vllm-omni-distributed - Source: [PR #1656](vllm-project/vllm-omni#1656) - [Optimize][Qwen3-Omni] Reduce inter-packet latency in async chunk ### vllm-omni-contrib - Source: [PR #1656](vllm-project/vllm-omni#1656) - [Optimize][Qwen3-Omni] Reduce inter-packet latency in async chunk ### vllm-omni-quantization - Source: [PR #1652](vllm-project/vllm-omni#1652) - [UX] Add progress bar for diffusion models - Changes: - New feature: [UX] Add progress bar for diffusion models ### vllm-omni-perf - Source: [PR #1652](vllm-project/vllm-omni#1652) - [UX] Add progress bar for diffusion models - Changes: - New feature: [UX] Add progress bar for diffusion models ### vllm-omni-distributed - Source: [PR #1651](vllm-project/vllm-omni#1651) - docs: Announce vllm-omni-skills community project ### vllm-omni-quantization - Source: [PR #1651](vllm-project/vllm-omni#1651) - docs: Announce vllm-omni-skills community project ### vllm-omni-perf - Source: [PR #1651](vllm-project/vllm-omni#1651) - docs: Announce vllm-omni-skills community project ### vllm-omni-contrib - Source: [PR #1649](vllm-project/vllm-omni#1649) - [Misc] update wechat ### vllm-omni-perf - Source: [PR #1642](vllm-project/vllm-omni#1642) - [chore] add _repeated_blocks for regional compilation support - Changes: - New feature: [chore] add _repeated_blocks for regional compilation support ### vllm-omni-api - Source: [PR #1641](vllm-project/vllm-omni#1641) - [Bugfix] Add TTS request validation to prevent engine crashes - Changes: - New feature: [Bugfix] Add TTS request validation to prevent engine crashes ### vllm-omni-cicd - Source: [PR #1641](vllm-project/vllm-omni#1641) - [Bugfix] Add TTS request validation to prevent engine crashes - Changes: - New feature: [Bugfix] Add TTS request validation to prevent engine crashes ### vllm-omni-image-gen - Source: [PR #1640](vllm-project/vllm-omni#1640) - [FP8 Quantization] Add FP8 quantization support for Flux transformer - Changes: - New feature: [FP8 Quantization] Add FP8 quantization support for Flux transformer - Additions: - text-to-image - Text-to-Image - Flux ### vllm-omni-quantization - Source: [PR #1640](vllm-project/vllm-omni#1640) - [FP8 Quantization] Add FP8 quantization support for Flux transformer - Changes: - New feature: [FP8 Quantization] Add FP8 quantization support for Flux transformer - Additions: - FP8 support or improvements ### vllm-omni-contrib - Source: [PR #1640](vllm-project/vllm-omni#1640) - [FP8 Quantization] Add FP8 quantization support for Flux transformer - Changes: - New feature: [FP8 Quantization] Add FP8 quantization support for Flux transformer ### vllm-omni-perf - Source: [PR #1640](vllm-project/vllm-omni#1640) - [FP8 Quantization] Add FP8 quantization support for Flux transformer - Changes: - New feature: [FP8 Quantization] Add FP8 quantization support for Flux transformer ### vllm-omni-contrib - Source: [PR #1631](vllm-project/vllm-omni#1631) - [BugFix] Fix LongCat Sequence Parallelism / Small Cleanup - Changes: - Bug fix: [BugFix] Fix LongCat Sequence Parallelism / Small Cleanup ### vllm-omni-cicd - Source: [PR #1628](vllm-project/vllm-omni#1628) - [Test][Qwen3-Omni]Modify Qwen3-Omni benchmark test cases ### vllm-omni-perf - Source: [PR #1628](vllm-project/vllm-omni#1628) - [Test][Qwen3-Omni]Modify Qwen3-Omni benchmark test cases ### vllm-omni-perf - Source: [PR #1619](vllm-project/vllm-omni#1619) - [Bugfix] Fix Qwen3-TTS code predictor crash due to missing vLLM config context - Changes: - Bug fix: [Bugfix] Fix Qwen3-TTS code predictor crash due to missing vLLM config context ### vllm-omni-perf - Source: [PR #1617](vllm-project/vllm-omni#1617) - [Refactor][Perf] Qwen3-TTS: re-prefill Code Predictor with torch.compile + enable Code2Wav decoder CUDA Graph - Changes: - Performance improvement: [Refactor][Perf] Qwen3-TTS: re-prefill Code Predictor with torch.compile + enable Code2Wav decoder CUDA Graph ### vllm-omni-contrib - Source: [PR #1615](vllm-project/vllm-omni#1615) - [Doc] Fix links in the configuration doc - Changes: - Bug fix: [Doc] Fix links in the configuration doc ### vllm-omni-audio-tts - Source: [PR #1614](vllm-project/vllm-omni#1614) - perf: replace per-element .item() GPU syncs with batch .tolist() in TTS code predictor - Changes: - Performance improvement: perf: replace per-element .item() GPU syncs with batch .tolist() in TTS code predictor ### vllm-omni-perf - Source: [PR #1614](vllm-project/vllm-omni#1614) - perf: replace per-element .item() GPU syncs with batch .tolist() in TTS code predictor - Changes: - Performance improvement: perf: replace per-element .item() GPU syncs with batch .tolist() in TTS code predictor ### vllm-omni-image-gen - Source: [PR #1609](vllm-project/vllm-omni#1609) - [Bugfix] Fix filepath resolution for model with subdir and GLM-Image generation - Changes: - Bug fix: [Bugfix] Fix filepath resolution for model with subdir and GLM-Image generation - Additions: - GLM-Image - GLM-Image - GLM-Image - GLM-Image - GLM-Image - GLM-Image - GLM-Image - GLM-Image ### vllm-omni-api - Source: [PR #1609](vllm-project/vllm-omni#1609) - [Bugfix] Fix filepath resolution for model with subdir and GLM-Image generation - Changes: - Bug fix: [Bugfix] Fix filepath resolution for model with subdir and GLM-Image generation ### vllm-omni-perf - Source: [PR #1609](vllm-project/vllm-omni#1609) - [Bugfix] Fix filepath resolution for model with subdir and GLM-Image generation - Changes: - Bug fix: [Bugfix] Fix filepath resolution for model with subdir and GLM-Image generation ### vllm-omni-contrib - Source: [PR #1604](vllm-project/vllm-omni#1604) - [Model]: support Helios from ByteDance ### vllm-omni-perf - Source: [PR #1604](vllm-project/vllm-omni#1604) - [Model]: support Helios from ByteDance ### vllm-omni-serving - Source: [PR #1602](vllm-project/vllm-omni#1602) - [Bugfix] fix kernel error for qwen3-omni - Changes: - Bug fix: [Bugfix] fix kernel error for qwen3-omni ### vllm-omni-distributed - Source: [PR #1598](vllm-project/vllm-omni#1598) - [BugFix] Fix load_weights error when loading HunyuanImage3.0 - Changes: - Bug fix: [BugFix] Fix load_weights error when loading HunyuanImage3.0 ### vllm-omni-image-gen - Source: [PR #1598](vllm-project/vllm-omni#1598) - [BugFix] Fix load_weights error when loading HunyuanImage3.0 - Changes: - Bug fix: [BugFix] Fix load_weights error when loading HunyuanImage3.0 - Additions: - HunyuanImage3 - HunyuanImage3Pipeline - HunyuanImage3 - HunyuanImage-3 - HunyuanImage-3 - HunyuanImage-3 - HunyuanImage3Pipeline - HunyuanImage3Pipeline - HunyuanImage3Pipeline - HunyuanImage3Pipeline - HunyuanImage3Pipeline - HunyuanImage3Pipeline - HunyuanImage3Pipeline - HunyuanImage3Pipeline - HunyuanImage-3 ### vllm-omni-quantization - Source: [PR #1598](vllm-project/vllm-omni#1598) - [BugFix] Fix load_weights error when loading HunyuanImage3.0 - Changes: - Bug fix: [BugFix] Fix load_weights error when loading HunyuanImage3.0 ### vllm-omni-perf - Source: [PR #1598](vllm-project/vllm-omni#1598) - [BugFix] Fix load_weights error when loading HunyuanImage3.0 - Changes: - Bug fix: [BugFix] Fix load_weights error when loading HunyuanImage3.0 ### vllm-omni-audio-tts - Source: [PR #1583](vllm-project/vllm-omni#1583) - [Feat][Qwen3TTS] reduce TTFA with flexible initial phase - Changes: - New feature: [Feat][Qwen3TTS] reduce TTFA with flexible initial phase ### vllm-omni-api - Source: [PR #1583](vllm-project/vllm-omni#1583) - [Feat][Qwen3TTS] reduce TTFA with flexible initial phase - Changes: - New feature: [Feat][Qwen3TTS] reduce TTFA with flexible initial phase ### vllm-omni-cicd - Source: [PR #1583](vllm-project/vllm-omni#1583) - [Feat][Qwen3TTS] reduce TTFA with flexible initial phase - Changes: - New feature: [Feat][Qwen3TTS] reduce TTFA with flexible initial phase ### vllm-omni-contrib - Source: [PR #1583](vllm-project/vllm-omni#1583) - [Feat][Qwen3TTS] reduce TTFA with flexible initial phase - Changes: - New feature: [Feat][Qwen3TTS] reduce TTFA with flexible initial phase ### vllm-omni-api - Source: [PR #1579](vllm-project/vllm-omni#1579) - [1/N][Refactor] Clean up dead code in output processor ### vllm-omni-serving - Source: [PR #1579](vllm-project/vllm-omni#1579) - [1/N][Refactor] Clean up dead code in output processor ### vllm-omni-distributed - Source: [PR #1578](vllm-project/vllm-omni#1578) - [Feature][Bagel] Add CFG parallel mode - Changes: - New feature: [Feature][Bagel] Add CFG parallel mode ### vllm-omni-cicd - Source: [PR #1578](vllm-project/vllm-omni#1578) - [Feature][Bagel] Add CFG parallel mode - Changes: - New feature: [Feature][Bagel] Add CFG parallel mode ### vllm-omni-perf - Source: [PR #1578](vllm-project/vllm-omni#1578) - [Feature][Bagel] Add CFG parallel mode - Changes: - New feature: [Feature][Bagel] Add CFG parallel mode ### vllm-omni-contrib - Source: [PR #1576](vllm-project/vllm-omni#1576) - 0.16.0 release ### vllm-omni-audio-tts - Source: [PR #1570](vllm-project/vllm-omni#1570) - [bugfix] Fix unexpected argument 'is_finished' in function llm2code2wav_async_chunk of mimo-audio - Changes: - Bug fix: [bugfix] Fix unexpected argument 'is_finished' in function llm2code2wav_async_chunk of mimo-audio ### vllm-omni-api - Source: [PR #1566](vllm-project/vllm-omni#1566) - [Bugfix] Import InputPreprocessor into Renderer - Changes: - Bug fix: [Bugfix] Import InputPreprocessor into Renderer ### vllm-omni-distributed - Source: [PR #1539](vllm-project/vllm-omni#1539) - [Debug] Enable curl retry aligned with openai ### vllm-omni-quantization - Source: [PR #1539](vllm-project/vllm-omni#1539) - [Debug] Enable curl retry aligned with openai ### vllm-omni-perf - Source: [PR #1539](vllm-project/vllm-omni#1539) - [Debug] Enable curl retry aligned with openai ### vllm-omni-image-gen - Source: [PR #1537](vllm-project/vllm-omni#1537) - [NPU] [Features] [Bugfix] Support mindiesd adaln - Changes: - New feature: [NPU] [Features] [Bugfix] Support mindiesd adaln - Additions: - mindiesd - mindiesd - Qwen-Image-Edit-2509 - mindiesd - mindiesd - mindiesd - mindiesd ### vllm-omni-perf - Source: [PR #1537](vllm-project/vllm-omni#1537) - [NPU] [Features] [Bugfix] Support mindiesd adaln - Changes: - New feature: [NPU] [Features] [Bugfix] Support mindiesd adaln ### vllm-omni-serving - Source: [PR #1536](vllm-project/vllm-omni#1536) - [Bugfix] Fix transformers 5.x compat issues in online TTS serving - Changes: - Bug fix: [Bugfix] Fix transformers 5.x compat issues in online TTS serving ### vllm-omni-perf - Source: [PR #1536](vllm-project/vllm-omni#1536) - [Bugfix] Fix transformers 5.x compat issues in online TTS serving - Changes: - Bug fix: [Bugfix] Fix transformers 5.x compat issues in online TTS serving
…TS code predictor (vllm-project#1614) Signed-off-by: dubin555 <dubin555@gmail.com> Signed-off-by: linyueqian <linyueqian@outlook.com> Co-authored-by: linyueqian <linyueqian@outlook.com> Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com> Signed-off-by: lishunyang <lishunyang12@163.com>
Here's the PR body:
Purpose
In
_LocalPredictorKVCache.build_attn_metadata(), per-element.item()calls inside nested loops force a GPU→CPU synchronization on every iteration. This method is called on every TTS decode step for every Qwen3-TTS request.Three locations are affected (
qwen3_tts_code_predictor_vllm.py:113-147):int(query_lens_i32[i].item())andint(seq_lens_i32[i].item())— one GPU sync per request per tensorint(self._block_table[i, block_idx].item())— one GPU sync per token position per request (inner loop)Since
_block_tableis allocated on GPU (line 84), each.item()triggerscudaStreamSynchronize, which typically costs 5-20μs per call. Withnum_reqs × max_seq_leniterations, this creates hundreds of unnecessary sync points per decode step.The fix replaces per-element
.item()calls with batch.tolist()/.cpu().tolist()conversions before the loops, then uses plain Python list indexing inside the loops.Before:
num_reqs * 4 + num_tokensGPU sync points per callAfter: 3 batch transfers (O(1) syncs) per call
Test Plan
Unit tests covering 6 scenarios to verify correctness of the batch conversion approach against the original per-element
.item()implementation:Test cases:
Test Result
Correctness
All 6 tests produce identical outputs between original and fixed code:
Performance (CPU benchmark)
Even on CPU (where
.item()overhead is smaller than on GPU), batch conversion shows clear speedup:On GPU, the improvement is expected to be significantly larger due to eliminated
cudaStreamSynchronizeoverhead per.item()call.Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.