Skip to content

perf: replace per-element .item() GPU syncs with batch .tolist() in TTS code predictor#1614

Merged
hsliuustc0106 merged 4 commits intovllm-project:mainfrom
dubin555:oss-scout/verify-fix-item-calls-in-inner-loop
Mar 6, 2026
Merged

perf: replace per-element .item() GPU syncs with batch .tolist() in TTS code predictor#1614
hsliuustc0106 merged 4 commits intovllm-project:mainfrom
dubin555:oss-scout/verify-fix-item-calls-in-inner-loop

Conversation

@dubin555
Copy link
Copy Markdown
Contributor

@dubin555 dubin555 commented Mar 2, 2026

Here's the PR body:


Purpose

In _LocalPredictorKVCache.build_attn_metadata(), per-element .item() calls inside nested loops force a GPU→CPU synchronization on every iteration. This method is called on every TTS decode step for every Qwen3-TTS request.

Three locations are affected (qwen3_tts_code_predictor_vllm.py:113-147):

  1. Lines 126-127: int(query_lens_i32[i].item()) and int(seq_lens_i32[i].item()) — one GPU sync per request per tensor
  2. Lines 137-138: Same pattern, redundantly repeated for the slot_mapping loop
  3. Line 143: int(self._block_table[i, block_idx].item()) — one GPU sync per token position per request (inner loop)

Since _block_table is allocated on GPU (line 84), each .item() triggers cudaStreamSynchronize, which typically costs 5-20μs per call. With num_reqs × max_seq_len iterations, this creates hundreds of unnecessary sync points per decode step.

The fix replaces per-element .item() calls with batch .tolist() / .cpu().tolist() conversions before the loops, then uses plain Python list indexing inside the loops.

Before: num_reqs * 4 + num_tokens GPU sync points per call
After: 3 batch transfers (O(1) syncs) per call

Test Plan

Unit tests covering 6 scenarios to verify correctness of the batch conversion approach against the original per-element .item() implementation:

python test_slot_mapping_fix.py

Test cases:

  • Single-request decode
  • Single-request prefill
  • Batch decode (multiple requests)
  • Cross-block-boundary sequences
  • Large batch (64 requests)
  • Mixed query lengths

Test Result

Correctness

All 6 tests produce identical outputs between original and fixed code:

PASS: test_decode_single_request
PASS: test_prefill_single_request
PASS: test_batch_decode
PASS: test_cross_block_boundary
PASS: test_large_batch
PASS: test_mixed_query_lens

All tests PASSED — original and fixed code produce identical results.

Performance (CPU benchmark)

Even on CPU (where .item() overhead is smaller than on GPU), batch conversion shows clear speedup:

num_reqs avg_seq_len Original Fixed Speedup
1 8 0.021 ms 0.016 ms 1.33x
4 8 0.055 ms 0.027 ms 2.02x
8 16 0.091 ms 0.040 ms 2.28x
16 16 0.175 ms 0.068 ms 2.56x
32 16 0.341 ms 0.124 ms 2.74x
64 16 0.673 ms 0.242 ms 2.79x

On GPU, the improvement is expected to be significantly larger due to eliminated cudaStreamSynchronize overhead per .item() call.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

@dubin555 dubin555 requested a review from hsliuustc0106 as a code owner March 2, 2026 13:56
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 20d80bb3f4

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

# calls (each .item() on a GPU tensor forces a GPU→CPU sync).
query_lens_list = query_lens_i32[:num_reqs].tolist()
seq_lens_list = seq_lens_i32[:num_reqs].tolist()
block_table_cpu = self._block_table[:num_reqs].cpu().tolist()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid full block-table host copy in hot decode path

In _LocalPredictorKVCache.build_attn_metadata, this line materializes the entire num_reqs × blocks_per_seq block table as Python lists on every call, even though the inner loop only needs one block_idx per generated token; during decode (decode_logits sets query_lens to ones), that turns per-step work from roughly O(num_reqs) into O(num_reqs * blocks_per_seq), where blocks_per_seq is derived from configured max_seq_len. For long-context configs this extra device→host copy plus Python-int conversion can dominate token-step latency and regress throughput versus the previous indexed access.

Useful? React with 👍 / 👎.

@dubin555
Copy link
Copy Markdown
Contributor Author

dubin555 commented Mar 2, 2026

Thanks for the suggestion. The block-table copy optimization is a good point but falls outside the scope of this PR (which focuses specifically on the per-element .item() → batch .tolist() change). Happy to look into the block-table copy in a follow-up.

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

is there any e2e test result compared with previous baseline?

Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Report

📋 Summary

Item Details
PR perf: replace per-element .item() GPU syncs with batch .tolist() in TTS code predictor
Author @dubin555
Files Changed qwen3_tts_code_predictor_vllm.py, test_slot_mapping_fix.py
Changes +224 / -5

🎯 Purpose

Eliminate expensive GPU→CPU synchronizations in _LocalPredictorKVCache.build_attn_metadata(), which is called on every TTS decode step for Qwen3-TTS requests.

Problem:

  • Per-element .item() calls inside nested loops force cudaStreamSynchronize on every iteration
  • 3 locations affected (qwen3_tts_code_predictor_vllm.py:113-147):
    1. int(query_lens_i32[i].item()) — one sync per request
    2. int(seq_lens_i32[i].item()) — one sync per request
    3. int(self._block_table[i, block_idx].item())one sync per token position per request (inner loop!)

Impact: Each .item() costs 5-20μs. With num_reqs × max_seq_len iterations → hundreds of unnecessary sync points per decode step.

Fix: Replace per-element .item() with batch .tolist() / .cpu().tolist() before loops, then use plain Python list indexing inside.


📊 Performance Improvement

Metric Before After
GPU syncs per call num_reqs * 4 + num_tokens 3 batch transfers (O(1))

CPU Benchmark:

num_reqs avg_seq_len Original Fixed Speedup
1 8 0.021 ms 0.016 ms 1.33x
4 8 0.055 ms 0.027 ms 2.02x
8 16 0.091 ms 0.040 ms 2.28x
16 16 0.175 ms 0.068 ms 2.56x
32 16 0.341 ms 0.124 ms 2.74x
64 16 0.673 ms 0.242 ms 2.79x

GPU expected: 10-100x improvement due to eliminated cudaStreamSynchronize overhead.


🔍 Code Changes

# Before (per-element .item() = GPU sync each time):
for i in range(num_reqs):
    ql = int(query_lens_i32[i].item())  # GPU sync!
    sl = int(seq_lens_i32[i].item())    # GPU sync!
    for p in range(start, sl):
        block_id = int(self._block_table[i, block_idx].item())  # GPU sync!

# After (batch conversion, no per-element syncs):
query_lens_list = query_lens_i32[:num_reqs].tolist()
seq_lens_list = seq_lens_i32[:num_reqs].tolist()
block_table_cpu = self._block_table[:num_reqs].cpu().tolist()

for i in range(num_reqs):
    ql = query_lens_list[i]  # No sync
    sl = seq_lens_list[i]    # No sync
    for p in range(start, sl):
        block_id = block_table_cpu[i][block_idx]  # No sync

✅ Test Coverage

6 unit tests verifying correctness against original implementation:

Test Coverage
test_decode_single_request Single request, decode mode (query_len=1)
test_prefill_single_request Single request, prefill mode
test_batch_decode Multiple requests in batch
test_cross_block_boundary Tokens spanning multiple blocks
test_large_batch 32 requests stress test
test_mixed_query_lens Different query lengths

Result: All tests produce identical outputs between original and fixed code.


🔍 Review Findings

✅ Strengths

  1. Well-diagnosed performance issue — Clear identification of cudaStreamSynchronize bottleneck
  2. Minimal, focused change — Only 11 lines modified in production code
  3. Comprehensive test coverage — Tests verify behavioral equivalence across multiple scenarios
  4. Good documentation — PR description clearly explains the problem, fix, and benchmarks
  5. Real performance data — CPU benchmarks provided, GPU improvement estimated

💡 Observations

  1. Test file location: The test file test_slot_mapping_fix.py is in the repo root. Consider moving to tests/ directory for consistency with PR #1613's test placement.

  2. No pytest markers: Unlike PR #1613, this test file lacks pytestmark = [pytest.mark.core_model, pytest.mark.cpu]. Consider adding for CI integration.

  3. .cpu() call: The block_table_cpu = self._block_table[:num_reqs].cpu().tolist() adds an explicit .cpu() call. This is correct for GPU tensors but adds a no-op overhead if _block_table is already on CPU. Consider whether the device check is needed.


📝 Verdict

Rating Notes
APPROVE Solid performance optimization with good test coverage

Rationale:

  • Clear algorithmic improvement eliminating unnecessary GPU→CPU syncs
  • Behavioral equivalence verified through comprehensive testing
  • Minimal code changes with clear performance benefit
  • Well-documented with benchmark data

Suggestion for follow-up: Consider adding pytest markers and moving test file to tests/ for consistency.


Reviewed by: vllm-omni-reviewer MCP tool 🤖

@dubin555
Copy link
Copy Markdown
Contributor Author

dubin555 commented Mar 3, 2026

Benchmark results on NVIDIA H100 NVL (96GB):

  num_reqs    original(ms)       fixed(ms)    speedup
-------------------------------------------------------
         8           1.651           0.052     31.47x
        16           2.258           0.066     34.27x
        32           5.445           0.099     54.97x
        64          12.470           0.171     72.77x
       128          21.271           0.287     74.20x
       256          42.857           0.571     75.10x

Each .item() forces a GPU→CPU synchronization. With 256 concurrent requests, the original code performs thousands of syncs (42ms total), while .tolist() does a single bulk transfer (0.57ms). The speedup scales linearly with request count since the number of eliminated sync barriers grows proportionally.

Benchmark code simulates the _LocalPredictorKVCache.build_attn_metadata hot path with realistic tensor shapes on GPU.

@dubin555
Copy link
Copy Markdown
Contributor Author

dubin555 commented Mar 3, 2026

Benchmark Results — NVIDIA H100 NVL

Isolated build_attn_metadata benchmark

Extracted the exact method from qwen3_tts_code_predictor_vllm.py and benchmarked original vs fixed, 50 iterations median:

num_reqs original (µs) fixed (µs) speedup
1 41.6 42.2 0.99x
4 116.6 69.6 1.67x
8 253.3 111.1 2.28x
32 879.2 329.5 2.67x
128 3759.6 1239.1 3.03x

The function-level speedup is clear: 2-3x for typical batch sizes (8-128 requests).

Full E2E TTS benchmark

Served Qwen/Qwen3-TTS-12Hz-0.6B-Base with vllm-omni 0.16.0, max_batch_size=8, concurrency=8, 24 requests per run, 3 runs each:

Version throughput (req/s) p50 latency (ms)
Original (.item()) 1.39 – 1.54 4558 – 4675
Fixed (.tolist()) 1.47 – 1.50 4313 – 4600

E2E difference is within noise — both versions overlap in throughput range.

Why micro shows 3x but E2E is flat

build_attn_metadata runs inside the code predictor sub-model, not the main AR decoder hot path. Rough estimate: the optimization saves ~142µs per call at batch_size=8, × ~200 AR decoding steps = ~28ms saved per request. Against a ~4500ms total E2E latency, that's <1% — completely masked by AR forward passes, code2wav decoding, and generation length variance.

Summary

The optimization is correct and meaningful for the function it targets (3x speedup), and follows the well-established pattern of batching GPU→CPU transfers. The E2E impact is currently small because TTS inference is dominated by autoregressive decoding, but this will matter more as other bottlenecks are optimized or batch sizes increase.

Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see inline

Comment thread tests/test_slot_mapping_fix.py Outdated
@@ -0,0 +1,213 @@
"""Unit test for the .item()-in-inner-loop fix in _LocalPredictorKVCache.build_attn_metadata.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: should this live under tests/ instead of the repo root?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, moved to tests/. Thanks!

@lishunyang12
Copy link
Copy Markdown
Collaborator

Solve DCO and precommit

dubin555 added 2 commits March 5, 2026 02:48
…TS code predictor

In _LocalPredictorKVCache.build_attn_metadata(), per-element .item()
calls inside nested loops force a GPU→CPU synchronization on every
iteration. This is called on every TTS decode step for Qwen3-TTS.

Replace per-element .item() on query_lens_i32, seq_lens_i32, and
self._block_table with batch .tolist() / .cpu().tolist() before
loops, then use plain Python list indexing.

Before: num_reqs * 4 + num_tokens GPU sync points per call
After: 3 batch transfers (O(1) syncs) per call

CPU benchmark: 1.3-2.8x speedup; GPU expected 10-100x improvement.

Signed-off-by: dubin555 <dubin555@gmail.com>
Signed-off-by: dubin555 <dubin555@gmail.com>
@dubin555 dubin555 force-pushed the oss-scout/verify-fix-item-calls-in-inner-loop branch from 8d0031a to 549f9b7 Compare March 5, 2026 02:49
Copy link
Copy Markdown
Collaborator

@linyueqian linyueqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Can you move the test file to tests/models/qwen3_tts/ instead of top-level tests/?

@hsliuustc0106 hsliuustc0106 added the ready label to trigger buildkite CI label Mar 6, 2026
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

LGTM. Can you move the test file to tests/models/qwen3_tts/ instead of top-level tests/?

yes, this is needed

@hsliuustc0106 hsliuustc0106 merged commit fd51841 into vllm-project:main Mar 6, 2026
5 of 7 checks passed
hsliuustc0106 added a commit to hsliuustc0106/vllm-omni-skills that referenced this pull request Mar 7, 2026
### vllm-omni-api
- Source: [PR #1724](vllm-project/vllm-omni#1724) - Revert "[Profile] Adding metrics for Diffusion/DiT Single diffusion Pipeline (#668)"
- Changes:
  - New feature: Revert "[Profile] Adding metrics for Diffusion/DiT Single diffusion Pipeline (#668)"

### vllm-omni-contrib
- Source: [PR #1724](vllm-project/vllm-omni#1724) - Revert "[Profile] Adding metrics for Diffusion/DiT Single diffusion Pipeline (#668)"
- Changes:
  - New feature: Revert "[Profile] Adding metrics for Diffusion/DiT Single diffusion Pipeline (#668)"

### vllm-omni-api
- Source: [PR #1716](vllm-project/vllm-omni#1716) - [Feature]:  Add vae-patch-parallel CLI argument in online serving
- Changes:
  - New feature: [Feature]:  Add vae-patch-parallel CLI argument in online serving

### vllm-omni-contrib
- Source: [PR #1716](vllm-project/vllm-omni#1716) - [Feature]:  Add vae-patch-parallel CLI argument in online serving
- Changes:
  - New feature: [Feature]:  Add vae-patch-parallel CLI argument in online serving

### vllm-omni-contrib
- Source: [PR #1693](vllm-project/vllm-omni#1693) - [skip CI][Docs] Add TTS model developer guide
- Changes:
  - New feature: [skip CI][Docs] Add TTS model developer guide

### vllm-omni-audio-tts
- Source: [PR #1688](vllm-project/vllm-omni#1688) - [MiMo-Audio] Bugfix tp lg than 1
- Changes:
  - Bug fix: [MiMo-Audio] Bugfix tp lg than 1

### vllm-omni-distributed
- Source: [PR #1688](vllm-project/vllm-omni#1688) - [MiMo-Audio] Bugfix tp lg than 1
- Changes:
  - Bug fix: [MiMo-Audio] Bugfix tp lg than 1

### vllm-omni-perf
- Source: [PR #1688](vllm-project/vllm-omni#1688) - [MiMo-Audio] Bugfix tp lg than 1
- Changes:
  - Bug fix: [MiMo-Audio] Bugfix tp lg than 1

### vllm-omni-perf
- Source: [PR #1687](vllm-project/vllm-omni#1687) - [BugFix] Return proper HTTP status for ErrorResponse in create_speech
- Changes:
  - Bug fix: [BugFix] Return proper HTTP status for ErrorResponse in create_speech

### vllm-omni-distributed
- Source: [PR #1687](vllm-project/vllm-omni#1687) - [BugFix] Return proper HTTP status for ErrorResponse in create_speech
- Changes:
  - Bug fix: [BugFix] Return proper HTTP status for ErrorResponse in create_speech

### vllm-omni-api
- Source: [PR #1687](vllm-project/vllm-omni#1687) - [BugFix] Return proper HTTP status for ErrorResponse in create_speech
- Changes:
  - Bug fix: [BugFix] Return proper HTTP status for ErrorResponse in create_speech
- Additions:
  - `/v1/audio/speech`

### vllm-omni-quantization
- Source: [PR #1687](vllm-project/vllm-omni#1687) - [BugFix] Return proper HTTP status for ErrorResponse in create_speech
- Changes:
  - Bug fix: [BugFix] Return proper HTTP status for ErrorResponse in create_speech

### vllm-omni-cicd
- Source: [PR #1683](vllm-project/vllm-omni#1683) - [CI] Remove high concurrency tests before issue #1374 fixed.
- Changes:
  - Bug fix: [CI] Remove high concurrency tests before issue #1374 fixed.

### vllm-omni-audio-tts
- Source: [PR #1678](vllm-project/vllm-omni#1678) - Add non-async chunk support for Qwen3-TTS
- Changes:
  - New feature: Add non-async chunk support for Qwen3-TTS

### vllm-omni-cicd
- Source: [PR #1678](vllm-project/vllm-omni#1678) - Add non-async chunk support for Qwen3-TTS
- Changes:
  - New feature: Add non-async chunk support for Qwen3-TTS

### vllm-omni-cicd
- Source: [PR #1677](vllm-project/vllm-omni#1677) - Replace hard-coded cuda generator with current_omni_platform.device_type

### vllm-omni-perf
- Source: [PR #1677](vllm-project/vllm-omni#1677) - Replace hard-coded cuda generator with current_omni_platform.device_type

### vllm-omni-serving
- Source: [PR #1675](vllm-project/vllm-omni#1675) - [Misc] remove logits_processor_pattern this field, because vllm have …

### vllm-omni-cicd
- Source: [PR #1666](vllm-project/vllm-omni#1666) - [Cleanup] Move cosyvoice3 tests to model subdirectory

### vllm-omni-audio-tts
- Source: [PR #1664](vllm-project/vllm-omni#1664) - [Bugfix] Fix all-silence TTS output: use float32 for speech tokenizer decoder
- Changes:
  - Bug fix: [Bugfix] Fix all-silence TTS output: use float32 for speech tokenizer decoder

### vllm-omni-cicd
- Source: [PR #1664](vllm-project/vllm-omni#1664) - [Bugfix] Fix all-silence TTS output: use float32 for speech tokenizer decoder
- Changes:
  - Bug fix: [Bugfix] Fix all-silence TTS output: use float32 for speech tokenizer decoder

### vllm-omni-distributed
- Source: [PR #1656](vllm-project/vllm-omni#1656) - [Optimize][Qwen3-Omni] Reduce inter-packet latency in async chunk

### vllm-omni-contrib
- Source: [PR #1656](vllm-project/vllm-omni#1656) - [Optimize][Qwen3-Omni] Reduce inter-packet latency in async chunk

### vllm-omni-quantization
- Source: [PR #1652](vllm-project/vllm-omni#1652) - [UX] Add progress bar for diffusion models
- Changes:
  - New feature: [UX] Add progress bar for diffusion models

### vllm-omni-perf
- Source: [PR #1652](vllm-project/vllm-omni#1652) - [UX] Add progress bar for diffusion models
- Changes:
  - New feature: [UX] Add progress bar for diffusion models

### vllm-omni-distributed
- Source: [PR #1651](vllm-project/vllm-omni#1651) - docs: Announce vllm-omni-skills community project

### vllm-omni-quantization
- Source: [PR #1651](vllm-project/vllm-omni#1651) - docs: Announce vllm-omni-skills community project

### vllm-omni-perf
- Source: [PR #1651](vllm-project/vllm-omni#1651) - docs: Announce vllm-omni-skills community project

### vllm-omni-contrib
- Source: [PR #1649](vllm-project/vllm-omni#1649) - [Misc] update wechat

### vllm-omni-perf
- Source: [PR #1642](vllm-project/vllm-omni#1642) - [chore] add _repeated_blocks for regional compilation support
- Changes:
  - New feature: [chore] add _repeated_blocks for regional compilation support

### vllm-omni-api
- Source: [PR #1641](vllm-project/vllm-omni#1641) - [Bugfix] Add TTS request validation to prevent engine crashes
- Changes:
  - New feature: [Bugfix] Add TTS request validation to prevent engine crashes

### vllm-omni-cicd
- Source: [PR #1641](vllm-project/vllm-omni#1641) - [Bugfix] Add TTS request validation to prevent engine crashes
- Changes:
  - New feature: [Bugfix] Add TTS request validation to prevent engine crashes

### vllm-omni-image-gen
- Source: [PR #1640](vllm-project/vllm-omni#1640) - [FP8 Quantization] Add FP8 quantization support for Flux transformer
- Changes:
  - New feature: [FP8 Quantization] Add FP8 quantization support for Flux transformer
- Additions:
  - text-to-image
  - Text-to-Image
  - Flux

### vllm-omni-quantization
- Source: [PR #1640](vllm-project/vllm-omni#1640) - [FP8 Quantization] Add FP8 quantization support for Flux transformer
- Changes:
  - New feature: [FP8 Quantization] Add FP8 quantization support for Flux transformer
- Additions:
  - FP8 support or improvements

### vllm-omni-contrib
- Source: [PR #1640](vllm-project/vllm-omni#1640) - [FP8 Quantization] Add FP8 quantization support for Flux transformer
- Changes:
  - New feature: [FP8 Quantization] Add FP8 quantization support for Flux transformer

### vllm-omni-perf
- Source: [PR #1640](vllm-project/vllm-omni#1640) - [FP8 Quantization] Add FP8 quantization support for Flux transformer
- Changes:
  - New feature: [FP8 Quantization] Add FP8 quantization support for Flux transformer

### vllm-omni-contrib
- Source: [PR #1631](vllm-project/vllm-omni#1631) - [BugFix] Fix LongCat Sequence Parallelism / Small Cleanup
- Changes:
  - Bug fix: [BugFix] Fix LongCat Sequence Parallelism / Small Cleanup

### vllm-omni-cicd
- Source: [PR #1628](vllm-project/vllm-omni#1628) - [Test][Qwen3-Omni]Modify Qwen3-Omni benchmark test cases

### vllm-omni-perf
- Source: [PR #1628](vllm-project/vllm-omni#1628) - [Test][Qwen3-Omni]Modify Qwen3-Omni benchmark test cases

### vllm-omni-perf
- Source: [PR #1619](vllm-project/vllm-omni#1619) - [Bugfix] Fix Qwen3-TTS code predictor crash due to missing vLLM config context
- Changes:
  - Bug fix: [Bugfix] Fix Qwen3-TTS code predictor crash due to missing vLLM config context

### vllm-omni-perf
- Source: [PR #1617](vllm-project/vllm-omni#1617) - [Refactor][Perf] Qwen3-TTS: re-prefill Code Predictor with torch.compile + enable Code2Wav decoder CUDA Graph
- Changes:
  - Performance improvement: [Refactor][Perf] Qwen3-TTS: re-prefill Code Predictor with torch.compile + enable Code2Wav decoder CUDA Graph

### vllm-omni-contrib
- Source: [PR #1615](vllm-project/vllm-omni#1615) - [Doc] Fix links in the configuration doc
- Changes:
  - Bug fix: [Doc] Fix links in the configuration doc

### vllm-omni-audio-tts
- Source: [PR #1614](vllm-project/vllm-omni#1614) - perf: replace per-element .item() GPU syncs with batch .tolist() in TTS code predictor
- Changes:
  - Performance improvement: perf: replace per-element .item() GPU syncs with batch .tolist() in TTS code predictor

### vllm-omni-perf
- Source: [PR #1614](vllm-project/vllm-omni#1614) - perf: replace per-element .item() GPU syncs with batch .tolist() in TTS code predictor
- Changes:
  - Performance improvement: perf: replace per-element .item() GPU syncs with batch .tolist() in TTS code predictor

### vllm-omni-image-gen
- Source: [PR #1609](vllm-project/vllm-omni#1609) - [Bugfix] Fix filepath resolution for model with subdir and GLM-Image generation
- Changes:
  - Bug fix: [Bugfix] Fix filepath resolution for model with subdir and GLM-Image generation
- Additions:
  - GLM-Image
  - GLM-Image
  - GLM-Image
  - GLM-Image
  - GLM-Image
  - GLM-Image
  - GLM-Image
  - GLM-Image

### vllm-omni-api
- Source: [PR #1609](vllm-project/vllm-omni#1609) - [Bugfix] Fix filepath resolution for model with subdir and GLM-Image generation
- Changes:
  - Bug fix: [Bugfix] Fix filepath resolution for model with subdir and GLM-Image generation

### vllm-omni-perf
- Source: [PR #1609](vllm-project/vllm-omni#1609) - [Bugfix] Fix filepath resolution for model with subdir and GLM-Image generation
- Changes:
  - Bug fix: [Bugfix] Fix filepath resolution for model with subdir and GLM-Image generation

### vllm-omni-contrib
- Source: [PR #1604](vllm-project/vllm-omni#1604) - [Model]: support Helios  from ByteDance

### vllm-omni-perf
- Source: [PR #1604](vllm-project/vllm-omni#1604) - [Model]: support Helios  from ByteDance

### vllm-omni-serving
- Source: [PR #1602](vllm-project/vllm-omni#1602) - [Bugfix] fix kernel error for qwen3-omni
- Changes:
  - Bug fix: [Bugfix] fix kernel error for qwen3-omni

### vllm-omni-distributed
- Source: [PR #1598](vllm-project/vllm-omni#1598) - [BugFix] Fix load_weights error when loading HunyuanImage3.0
- Changes:
  - Bug fix: [BugFix] Fix load_weights error when loading HunyuanImage3.0

### vllm-omni-image-gen
- Source: [PR #1598](vllm-project/vllm-omni#1598) - [BugFix] Fix load_weights error when loading HunyuanImage3.0
- Changes:
  - Bug fix: [BugFix] Fix load_weights error when loading HunyuanImage3.0
- Additions:
  - HunyuanImage3
  - HunyuanImage3Pipeline
  - HunyuanImage3
  - HunyuanImage-3
  - HunyuanImage-3
  - HunyuanImage-3
  - HunyuanImage3Pipeline
  - HunyuanImage3Pipeline
  - HunyuanImage3Pipeline
  - HunyuanImage3Pipeline
  - HunyuanImage3Pipeline
  - HunyuanImage3Pipeline
  - HunyuanImage3Pipeline
  - HunyuanImage3Pipeline
  - HunyuanImage-3

### vllm-omni-quantization
- Source: [PR #1598](vllm-project/vllm-omni#1598) - [BugFix] Fix load_weights error when loading HunyuanImage3.0
- Changes:
  - Bug fix: [BugFix] Fix load_weights error when loading HunyuanImage3.0

### vllm-omni-perf
- Source: [PR #1598](vllm-project/vllm-omni#1598) - [BugFix] Fix load_weights error when loading HunyuanImage3.0
- Changes:
  - Bug fix: [BugFix] Fix load_weights error when loading HunyuanImage3.0

### vllm-omni-audio-tts
- Source: [PR #1583](vllm-project/vllm-omni#1583) - [Feat][Qwen3TTS] reduce TTFA with flexible initial phase
- Changes:
  - New feature: [Feat][Qwen3TTS] reduce TTFA with flexible initial phase

### vllm-omni-api
- Source: [PR #1583](vllm-project/vllm-omni#1583) - [Feat][Qwen3TTS] reduce TTFA with flexible initial phase
- Changes:
  - New feature: [Feat][Qwen3TTS] reduce TTFA with flexible initial phase

### vllm-omni-cicd
- Source: [PR #1583](vllm-project/vllm-omni#1583) - [Feat][Qwen3TTS] reduce TTFA with flexible initial phase
- Changes:
  - New feature: [Feat][Qwen3TTS] reduce TTFA with flexible initial phase

### vllm-omni-contrib
- Source: [PR #1583](vllm-project/vllm-omni#1583) - [Feat][Qwen3TTS] reduce TTFA with flexible initial phase
- Changes:
  - New feature: [Feat][Qwen3TTS] reduce TTFA with flexible initial phase

### vllm-omni-api
- Source: [PR #1579](vllm-project/vllm-omni#1579) - [1/N][Refactor] Clean up dead code in output processor

### vllm-omni-serving
- Source: [PR #1579](vllm-project/vllm-omni#1579) - [1/N][Refactor] Clean up dead code in output processor

### vllm-omni-distributed
- Source: [PR #1578](vllm-project/vllm-omni#1578) - [Feature][Bagel] Add CFG parallel mode
- Changes:
  - New feature: [Feature][Bagel] Add CFG parallel mode

### vllm-omni-cicd
- Source: [PR #1578](vllm-project/vllm-omni#1578) - [Feature][Bagel] Add CFG parallel mode
- Changes:
  - New feature: [Feature][Bagel] Add CFG parallel mode

### vllm-omni-perf
- Source: [PR #1578](vllm-project/vllm-omni#1578) - [Feature][Bagel] Add CFG parallel mode
- Changes:
  - New feature: [Feature][Bagel] Add CFG parallel mode

### vllm-omni-contrib
- Source: [PR #1576](vllm-project/vllm-omni#1576) - 0.16.0 release

### vllm-omni-audio-tts
- Source: [PR #1570](vllm-project/vllm-omni#1570) - [bugfix] Fix unexpected argument 'is_finished' in function llm2code2wav_async_chunk of mimo-audio
- Changes:
  - Bug fix: [bugfix] Fix unexpected argument 'is_finished' in function llm2code2wav_async_chunk of mimo-audio

### vllm-omni-api
- Source: [PR #1566](vllm-project/vllm-omni#1566) - [Bugfix] Import InputPreprocessor into Renderer
- Changes:
  - Bug fix: [Bugfix] Import InputPreprocessor into Renderer

### vllm-omni-distributed
- Source: [PR #1539](vllm-project/vllm-omni#1539) - [Debug] Enable curl retry aligned with openai

### vllm-omni-quantization
- Source: [PR #1539](vllm-project/vllm-omni#1539) - [Debug] Enable curl retry aligned with openai

### vllm-omni-perf
- Source: [PR #1539](vllm-project/vllm-omni#1539) - [Debug] Enable curl retry aligned with openai

### vllm-omni-image-gen
- Source: [PR #1537](vllm-project/vllm-omni#1537) - [NPU] [Features] [Bugfix] Support mindiesd adaln
- Changes:
  - New feature: [NPU] [Features] [Bugfix] Support mindiesd adaln
- Additions:
  - mindiesd
  - mindiesd
  - Qwen-Image-Edit-2509
  - mindiesd
  - mindiesd
  - mindiesd
  - mindiesd

### vllm-omni-perf
- Source: [PR #1537](vllm-project/vllm-omni#1537) - [NPU] [Features] [Bugfix] Support mindiesd adaln
- Changes:
  - New feature: [NPU] [Features] [Bugfix] Support mindiesd adaln

### vllm-omni-serving
- Source: [PR #1536](vllm-project/vllm-omni#1536) - [Bugfix] Fix transformers 5.x compat issues in online TTS serving
- Changes:
  - Bug fix: [Bugfix] Fix transformers 5.x compat issues in online TTS serving

### vllm-omni-perf
- Source: [PR #1536](vllm-project/vllm-omni#1536) - [Bugfix] Fix transformers 5.x compat issues in online TTS serving
- Changes:
  - Bug fix: [Bugfix] Fix transformers 5.x compat issues in online TTS serving
lishunyang12 pushed a commit to lishunyang12/vllm-omni that referenced this pull request Mar 11, 2026
…TS code predictor (vllm-project#1614)

Signed-off-by: dubin555 <dubin555@gmail.com>
Signed-off-by: linyueqian <linyueqian@outlook.com>
Co-authored-by: linyueqian <linyueqian@outlook.com>
Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
Signed-off-by: lishunyang <lishunyang12@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants