Skip to content

[Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips#2012

Merged
hsliuustc0106 merged 4 commits intovllm-project:mainfrom
LJH-LBJ:refactor/qwen3-omni-code-predictor-v2
Mar 20, 2026
Merged

[Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips#2012
hsliuustc0106 merged 4 commits intovllm-project:mainfrom
LJH-LBJ:refactor/qwen3-omni-code-predictor-v2

Conversation

@LJH-LBJ
Copy link
Copy Markdown
Contributor

@LJH-LBJ LJH-LBJ commented Mar 19, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Resolves: #1830
This PR fixes two critical accuracy bugs in the Qwen3-Omni code predictor that caused audio quality degradation, and re-applies the re-prefill + SDPA optimization from #1758 with correctness guarantees.

Performance analysis and detailed design remain the same as described in #1758this PR does not regress performance.

Root Cause 1: _proj_buf persistent buffer cross-request pollution

PR #1758 introduced a persistent self._proj_buf tensor (allocated once in __init__) to accumulate embeddings across the autoregressive loop. When multiple concurrent requests share the same model instance, request A's embedding data written into _proj_buf gets silently overwritten by request B, causing subsequent AR steps to read corrupted history. This is the primary cause of intermittent accuracy failures under concurrent load.

Fix: Allocate proj_buf locally in every forward() call:

# Before (buggy): persistent buffer shared across requests
self._proj_buf = torch.zeros(max_batch, max_seq, hidden_size, ...)  # in __init__

# After (fixed): per-call allocation, no aliasing
def forward(self, ...):
    proj_buf = torch.zeros(bsz, max_seq, self._hidden_size, dtype=dtype, device=device)

Root Cause 2: summed_embeddings 3D + text_step 2D broadcast error

In talker_mtp(), summed_embeddings has shape [B, S, H] (3D) while text_step has shape [B*S, H] (2D). When added directly, PyTorch performs silent broadcasting — for B > 1 the result is completely wrong ([B, 1, H] + [B, H][B, B, H]).

Fix: Flatten summed_embeddings to 2D before addition:

# Before (buggy): 3D + 2D silent broadcast
inputs_embeds = summed_embeddings.clone()
inputs_embeds = (inputs_embeds + text_step).reshape(...)

# After (fixed): uniform 2D addition
inputs_embeds = summed_embeddings.reshape(-1, self.talker_config.text_config.hidden_size)
inputs_embeds = (inputs_embeds + text_step).reshape(...)

Test Plan

Accurancy Test twice

pytest -s -v   tests/e2e/online_serving/test_qwen3_omni_expansion.py   -m "advanced_model" --run-level "advanced_model"

benchmark

vllm serve /workspace/models/Qwen3-Omni-30B-A3B-Instruct --omni --port 46354 --stage-configs-path ./vllm_omni/model_executor/stage_configs/qwen3_omni_moe_async_chunk.yaml

Test Result

image image

benchmark

Metric Before After
TTFP (ms) 13000.91 8851.15
E2E (ms) 86184.13 83231.48
RTF 0.65 0.58
TTFT (ms) 6877.48 5329.86
Before:
============ Serving Benchmark Result ============							
Successful requests:                     10        							
Failed requests:                         0         							
Maximum request concurrency:             10        							
Benchmark duration (s):                  131.49    							
Request throughput (req/s):              0.08      							
Peak concurrent requests:                10.00     							
----------------End-to-end Latency----------------							
Mean E2EL (ms):                          86184.13  							
Median E2EL (ms):                        95894.25  							
P99 E2EL (ms):                           130903.62 							
================== Text Result ===================							
Total input tokens:                      25000     							
Total generated tokens:                  5199      							
Output token throughput (tok/s):         39.54     							
Peak output token throughput (tok/s):    306.00    							
Peak concurrent requests:                10.00     							
Total Token throughput (tok/s):          229.67    							
---------------Time to First Token----------------							
Mean TTFT (ms):                          6877.48   							
Median TTFT (ms):                        7451.55   							
P99 TTFT (ms):                           7454.69   							
-----Time per Output Token (excl. 1st token)------							
Mean TPOT (ms):                          101.27    							
Median TPOT (ms):                        81.24     							
P99 TPOT (ms):                           240.85    							
---------------Inter-token Latency----------------							
Mean ITL (ms):                           53.44     							
Median ITL (ms):                         0.01      							
P99 ITL (ms):                            1925.43   							
================== Audio Result ==================							
Total audio duration generated(s):       1669.25   							
Total audio frames generated:            40061925  							
Audio throughput(audio duration/s):      12.69     							
---------------Time to First Packet---------------							
Mean AUDIO_TTFP (ms):                    13000.91  							
Median AUDIO_TTFP (ms):                  13079.55  							
P99 AUDIO_TTFP (ms):                     13637.70  							
-----------------Real Time Factor-----------------							
Mean AUDIO_RTF:                          0.65      							
Median AUDIO_RTF:                        0.57      							
P99 AUDIO_RTF:                           1.12      							
==================================================							


After:
============ Serving Benchmark Result ============							
Successful requests:                     10        							
Failed requests:                         0         							
Maximum request concurrency:             10        							
Benchmark duration (s):                  124.16    							
Request throughput (req/s):              0.08      							
Peak concurrent requests:                10.00     							
----------------End-to-end Latency----------------							
Mean E2EL (ms):                          83231.48  							
Median E2EL (ms):                        85536.25  							
P99 E2EL (ms):                           122818.30 							
================== Text Result ===================							
Total input tokens:                      25000     							
Total generated tokens:                  5172      							
Output token throughput (tok/s):         41.66     							
Peak output token throughput (tok/s):    304.00    							
Peak concurrent requests:                10.00     							
Total Token throughput (tok/s):          243.02    							
---------------Time to First Token----------------							
Mean TTFT (ms):                          5329.86   							
Median TTFT (ms):                        5882.53   							
P99 TTFT (ms):                           5884.73   							
-----Time per Output Token (excl. 1st token)------							
Mean TPOT (ms):                          104.86    							
Median TPOT (ms):                        92.06     							
P99 TPOT (ms):                           198.48    							
---------------Inter-token Latency----------------							
Mean ITL (ms):                           51.24     							
Median ITL (ms):                         0.01      							
P99 ITL (ms):                            2015.94   							
================== Audio Result ==================							
Total audio duration generated(s):       1618.88   							
Total audio frames generated:            38853060  							
Audio throughput(audio duration/s):      13.04     							
---------------Time to First Packet---------------							
Mean AUDIO_TTFP (ms):                    8851.15   							
Median AUDIO_TTFP (ms):                  8848.69   							
P99 AUDIO_TTFP (ms):                     9442.12   							
-----------------Real Time Factor-----------------							
Mean AUDIO_RTF:                          0.58      							
Median AUDIO_RTF:                        0.58      							
P99 AUDIO_RTF:                           0.83      							
==================================================												


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

LJH-LBJ added 2 commits March 19, 2026 19:21
Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
@LJH-LBJ LJH-LBJ requested a review from hsliuustc0106 as a code owner March 19, 2026 11:41
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e7f5cdb4dc

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +433 to +434
probs = F.softmax(logits, dim=-1)
code = torch.multinomial(probs, num_samples=1) # [bsz, 1]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid full CUDA graphs around the new inline multinomial path

When full cudagraphs are enabled for the Qwen3-Omni talker, OmniGPUModelRunner.load_model() still wraps talker_mtp whenever self.model.talker exists (vllm_omni/worker/gpu_model_runner.py:83-88). This refactor replaced the old custom sampling op with a plain torch.multinomial call inside that path, even though the runner comment explicitly treats torch.multinomial inside the code predictor as graph-unsafe. In that configuration, capture/replay can fail or fall back to eager on every decode step, which breaks the main optimization this commit is re-introducing.

Useful? React with 👍 / 👎.

Comment on lines +162 to +166
# Keys that should stay on GPU in model_intermediate_buffer to avoid CPU↔GPU round-trips
self.gpu_resident_buffer_keys: set[str] = {
"last_talker_hidden",
"trailing_text_hidden",
"tts_pad_embed_projected",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep code_predictor_codes in the GPU-resident buffer set

_talker_mtp_forward() writes the decode result under the default talker_mtp_output_key (code_predictor_codes), but _update_intermediate_buffer() only skips the .to("cpu") round-trip for keys listed in gpu_resident_buffer_keys (vllm_omni/worker/gpu_model_runner.py:1349-1354). Because this set omits code_predictor_codes, every decode step still synchronizes on a device-to-host copy of the MTP output, so the advertised hot-path CPU round-trip elimination never actually applies to Qwen3-Omni's codec codes.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code_predictor_codes will not be readed from gpu_resident_buffer_keys . no need to add to gpu_resident_buffer_keys

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

any performance w/o the feature here?

@LJH-LBJ
Copy link
Copy Markdown
Contributor Author

LJH-LBJ commented Mar 19, 2026

any performance w/o the feature here?

I didn’t make any changes to the main feature branch. The bug fix only addressed a tensor dimension mismatch and added a torch.zeros in one place; nothing else was modified. I don’t think it has any impact on performance.

@amy-why-3459
Copy link
Copy Markdown
Contributor

Please provide a performance data comparison before and after this PR.

@amy-why-3459
Copy link
Copy Markdown
Contributor

Please provide accuracy test results under multiple concurrent scenarios. Please add an accuracy test under multiple concurrent scenarios.

@LJH-LBJ
Copy link
Copy Markdown
Contributor Author

LJH-LBJ commented Mar 19, 2026

Please provide accuracy test results under multiple concurrent scenarios. Please add an accuracy test under multiple concurrent scenarios.

multiple concurrent scenarios already exists in test_qwen3_omni_expansion.py
you can check the request_num in the test case

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

Please provide accuracy test results under multiple concurrent scenarios. Please add an accuracy test under multiple concurrent scenarios.

multiple concurrent scenarios already exists in test_qwen3_omni_expansion.py you can check the request_num in the test case

I suggest to add the perf w/o this PR which is necessary to show the gain or regression.

@LJH-LBJ
Copy link
Copy Markdown
Contributor Author

LJH-LBJ commented Mar 19, 2026

Please provide accuracy test results under multiple concurrent scenarios. Please add an accuracy test under multiple concurrent scenarios.

multiple concurrent scenarios already exists in test_qwen3_omni_expansion.py you can check the request_num in the test case

I suggest to add the perf w/o this PR which is necessary to show the gain or regression.

OK, I will test it

@hsliuustc0106 hsliuustc0106 added the ready label to trigger buildkite CI label Mar 19, 2026
@amy-why-3459
Copy link
Copy Markdown
Contributor

@Sy0307 PTAL

@LJH-LBJ
Copy link
Copy Markdown
Contributor Author

LJH-LBJ commented Mar 20, 2026

any performance w/o the feature here?

@hsliuustc0106 @amy-why-3459 Already posted, I test with PR1982

@Sy0307
Copy link
Copy Markdown
Contributor

Sy0307 commented Mar 20, 2026

LGTM. Related issue is #2019. It will wait for this PR being merged.

k = k.view(bsz, seq_len, self.num_kv_heads, self.head_dim).transpose(1, 2)
v = v.view(bsz, seq_len, self.num_kv_heads, self.head_dim).transpose(1, 2)

attn_out = F.scaled_dot_product_attention(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apply_sdpa from vllm/v1/attention/ops/vit_attn_wrappers.py cannot be used here — it omits is_causal=True.

The code predictor runs autoregressive re-prefill: each AR step forwards a growing sequence (length 2 → num_code_groups + 1). Position i must only attend to positions ≤ i; otherwise the yet-to-be-predicted slots (initialized to zero in proj_buf) leak into the attention output and corrupt the logits. is_causal=True enforces the lower-triangular mask that prevents this.

apply_sdpa is designed for ViT-style encoders where every token attends to every other token (bidirectional). Dropping the causal mask here would break codec generation quality.

@hsliuustc0106 hsliuustc0106 merged commit 6901ba4 into vllm-project:main Mar 20, 2026
8 checks passed
hsliuustc0106 added a commit to hsliuustc0106/vllm-omni-skills that referenced this pull request Mar 22, 2026
### vllm-omni-audio-tts
- Source: [PR #2059](vllm-project/vllm-omni#2059) - [BugFix][Qwen3TTS] CodePredictor CudaGraph Pool
- Changes:
  - Bug fix: [BugFix][Qwen3TTS] CodePredictor CudaGraph Pool

### vllm-omni-perf
- Source: [PR #2059](vllm-project/vllm-omni#2059) - [BugFix][Qwen3TTS] CodePredictor CudaGraph Pool
- Changes:
  - Bug fix: [BugFix][Qwen3TTS] CodePredictor CudaGraph Pool

### vllm-omni-api
- Source: [PR #2058](vllm-project/vllm-omni#2058) - [Bugfix] Fix Fish Speech and CosyVoice3 online serving - missing is_comprehension and broken model detection
- Changes:
  - Bug fix: [Bugfix] Fix Fish Speech and CosyVoice3 online serving - missing is_comprehension and broken model detection

### vllm-omni-contrib
- Source: [PR #2045](vllm-project/vllm-omni#2045) - [Voxtral] Improve example

### vllm-omni-cicd
- Source: [PR #2045](vllm-project/vllm-omni#2045) - [Voxtral] Improve example

### vllm-omni-api
- Source: [PR #2042](vllm-project/vllm-omni#2042) - [bugfix] /chat/completion doesn't read extra_body for diffusion model
- Changes:
  - Bug fix: [bugfix] /chat/completion doesn't read extra_body for diffusion model

### vllm-omni-perf
- Source: [PR #2042](vllm-project/vllm-omni#2042) - [bugfix] /chat/completion doesn't read extra_body for diffusion model
- Changes:
  - Bug fix: [bugfix] /chat/completion doesn't read extra_body for diffusion model

### vllm-omni-contrib
- Source: [PR #2038](vllm-project/vllm-omni#2038) - [Doc] Update docs and dockerfiles for rebase of vllm v0.18.0

### vllm-omni-serving
- Source: [PR #2037](vllm-project/vllm-omni#2037) - [Rebase] Rebase to vllm v0.18.0

### vllm-omni-contrib
- Source: [PR #2037](vllm-project/vllm-omni#2037) - [Rebase] Rebase to vllm v0.18.0

### vllm-omni-api
- Source: [PR #2037](vllm-project/vllm-omni#2037) - [Rebase] Rebase to vllm v0.18.0

### vllm-omni-cicd
- Source: [PR #2037](vllm-project/vllm-omni#2037) - [Rebase] Rebase to vllm v0.18.0

### vllm-omni-cicd
- Source: [PR #2032](vllm-project/vllm-omni#2032) - [CI] Change Bagel online test environment variable `VLLM_TEST_CLEAN_GPU_MEMORY` to `0`

### vllm-omni-cicd
- Source: [PR #2031](vllm-project/vllm-omni#2031) - [CI] Fix test.
- Changes:
  - Bug fix: [CI] Fix test.

### vllm-omni-cicd
- Source: [PR #2017](vllm-project/vllm-omni#2017) - [CI] [ROCm] Setup `test-ready.yml` and `test-merge.yml`

### vllm-omni-cicd
- Source: [PR #2014](vllm-project/vllm-omni#2014) - [Test] Implement mock HTTP request handling in benchmark CLI tests

### vllm-omni-perf
- Source: [PR #2014](vllm-project/vllm-omni#2014) - [Test] Implement mock HTTP request handling in benchmark CLI tests

### vllm-omni-serving
- Source: [PR #2012](vllm-project/vllm-omni#2012) - [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips
- Changes:
  - Bug fix: [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips

### vllm-omni-image-gen
- Source: [PR #2012](vllm-project/vllm-omni#2012) - [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips
- Changes:
  - Bug fix: [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips

### vllm-omni-perf
- Source: [PR #2012](vllm-project/vllm-omni#2012) - [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips
- Changes:
  - Bug fix: [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips

### vllm-omni-serving
- Source: [PR #2009](vllm-project/vllm-omni#2009) - [Bugfix] revert PR#1758 which introduced the accuracy problem of qwen3-omni
- Changes:
  - Bug fix: [Bugfix] revert PR#1758 which introduced the accuracy problem of qwen3-omni

### vllm-omni-image-gen
- Source: [PR #2007](vllm-project/vllm-omni#2007) - [Bugfix]Fix bug of online server can not return mutli images
- Changes:
  - Bug fix: [Bugfix]Fix bug of online server can not return mutli images
- Additions:
  - Qwen-Image-Layered
  - Qwen-Image-Layered
  - Qwen-Image-Layered

### vllm-omni-api
- Source: [PR #2007](vllm-project/vllm-omni#2007) - [Bugfix]Fix bug of online server can not return mutli images
- Changes:
  - Bug fix: [Bugfix]Fix bug of online server can not return mutli images

### vllm-omni-cicd
- Source: [PR #1998](vllm-project/vllm-omni#1998) - [CI] Split BAGEL tests into dummy/real weight tiers (L2/L3)

### vllm-omni-serving
- Source: [PR #1985](vllm-project/vllm-omni#1985) - [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls
- Changes:
  - Performance improvement: [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls

### vllm-omni-audio-tts
- Source: [PR #1985](vllm-project/vllm-omni#1985) - [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls
- Changes:
  - Performance improvement: [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls

### vllm-omni-perf
- Source: [PR #1985](vllm-project/vllm-omni#1985) - [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls
- Changes:
  - Performance improvement: [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls

### vllm-omni-serving
- Source: [PR #1984](vllm-project/vllm-omni#1984) - [CI] [ROCm] Bugfix device environment issue
- Changes:
  - Bug fix: [CI] [ROCm] Bugfix device environment issue

### vllm-omni-api
- Source: [PR #1984](vllm-project/vllm-omni#1984) - [CI] [ROCm] Bugfix device environment issue
- Changes:
  - Bug fix: [CI] [ROCm] Bugfix device environment issue

### vllm-omni-serving
- Source: [PR #1982](vllm-project/vllm-omni#1982) - [Fix] Fix slow hasattr in CUDAGraphWrapper.__getattr__
- Changes:
  - Bug fix: [Fix] Fix slow hasattr in CUDAGraphWrapper.__getattr__

### vllm-omni-cicd
- Source: [PR #1982](vllm-project/vllm-omni#1982) - [Fix] Fix slow hasattr in CUDAGraphWrapper.__getattr__
- Changes:
  - Bug fix: [Fix] Fix slow hasattr in CUDAGraphWrapper.__getattr__

### vllm-omni-api
- Source: [PR #1979](vllm-project/vllm-omni#1979) - [Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series)
- Changes:
  - Bug fix: [Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series)
- Additions:
  - `/v1/chat/completions`

### vllm-omni-perf
- Source: [PR #1979](vllm-project/vllm-omni#1979) - [Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series)
- Changes:
  - Bug fix: [Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series)

### vllm-omni-contrib
- Source: [PR #1976](vllm-project/vllm-omni#1976) - [skip ci][Docs] Update WeChat QR code (fix filename case)
- Changes:
  - Bug fix: [skip ci][Docs] Update WeChat QR code (fix filename case)

### vllm-omni-contrib
- Source: [PR #1974](vllm-project/vllm-omni#1974) - [Docs] Update WeChat QR code for community support

### vllm-omni-cicd
- Source: [PR #1945](vllm-project/vllm-omni#1945) - Fix Base voice clone streaming quality and stop-token crash
- Changes:
  - Bug fix: Fix Base voice clone streaming quality and stop-token crash

### vllm-omni-cicd
- Source: [PR #1938](vllm-project/vllm-omni#1938) - [Test] L4 complete diffusion feature test for Bagel models
- Changes:
  - New feature: [Test] L4 complete diffusion feature test for Bagel models

### vllm-omni-perf
- Source: [PR #1938](vllm-project/vllm-omni#1938) - [Test] L4 complete diffusion feature test for Bagel models
- Changes:
  - New feature: [Test] L4 complete diffusion feature test for Bagel models

### vllm-omni-perf
- Source: [PR #1934](vllm-project/vllm-omni#1934) - Fix OmniGen2 transformer config loading for HF models
- Changes:
  - Bug fix: Fix OmniGen2 transformer config loading for HF models

### vllm-omni-audio-tts
- Source: [PR #1930](vllm-project/vllm-omni#1930) - [Bug][Qwen3TTS][Streaming] remove dynamic initial chunk and only compute on initial request

### vllm-omni-perf
- Source: [PR #1930](vllm-project/vllm-omni#1930) - [Bug][Qwen3TTS][Streaming] remove dynamic initial chunk and only compute on initial request

### vllm-omni-audio-tts
- Source: [PR #1926](vllm-project/vllm-omni#1926) - [Misc] removed qwen3_tts.py as it is out-dated

### vllm-omni-contrib
- Source: [PR #1920](vllm-project/vllm-omni#1920) - [Docs] Add Wan2.1-T2V as supported video generation models
- Changes:
  - New feature: [Docs] Add Wan2.1-T2V as supported video generation models

### vllm-omni-video-gen
- Source: [PR #1915](vllm-project/vllm-omni#1915) - [Bugfix] fix helios video generate use cpu device
- Changes:
  - Bug fix: [Bugfix] fix helios video generate use cpu device

### vllm-omni-perf
- Source: [PR #1915](vllm-project/vllm-omni#1915) - [Bugfix] fix helios video generate use cpu device
- Changes:
  - Bug fix: [Bugfix] fix helios video generate use cpu device

### vllm-omni-audio-tts
- Source: [PR #1913](vllm-project/vllm-omni#1913) - [Optim][Qwen3TTS][CodePredictor] support torch.compile with reduce-overhead and dynamic False

### vllm-omni-perf
- Source: [PR #1913](vllm-project/vllm-omni#1913) - [Optim][Qwen3TTS][CodePredictor] support torch.compile with reduce-overhead and dynamic False

### vllm-omni-api
- Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring

### vllm-omni-perf
- Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring

### vllm-omni-contrib
- Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring

### vllm-omni-serving
- Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring

### vllm-omni-cicd
- Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring

### vllm-omni-image-gen
- Source: [PR #1900](vllm-project/vllm-omni#1900) - [Feat] support HSDP for Flux family
- Changes:
  - New feature: [Feat] support HSDP for Flux family

### vllm-omni-contrib
- Source: [PR #1900](vllm-project/vllm-omni#1900) - [Feat] support HSDP for Flux family
- Changes:
  - New feature: [Feat] support HSDP for Flux family

### vllm-omni-distributed
- Source: [PR #1898](vllm-project/vllm-omni#1898) - [Feature]: Remove some useless `hf_overrides` in yaml
- Changes:
  - New feature: [Feature]: Remove some useless `hf_overrides` in yaml

### vllm-omni-quantization
- Source: [PR #1898](vllm-project/vllm-omni#1898) - [Feature]: Remove some useless `hf_overrides` in yaml
- Changes:
  - New feature: [Feature]: Remove some useless `hf_overrides` in yaml

### vllm-omni-cicd
- Source: [PR #1898](vllm-project/vllm-omni#1898) - [Feature]: Remove some useless `hf_overrides` in yaml
- Changes:
  - New feature: [Feature]: Remove some useless `hf_overrides` in yaml

### vllm-omni-perf
- Source: [PR #1898](vllm-project/vllm-omni#1898) - [Feature]: Remove some useless `hf_overrides` in yaml
- Changes:
  - New feature: [Feature]: Remove some useless `hf_overrides` in yaml

### vllm-omni-contrib
- Source: [PR #1890](vllm-project/vllm-omni#1890) - [NPU] Upgrade to v0.17.0

### vllm-omni-contrib
- Source: [PR #1889](vllm-project/vllm-omni#1889) - Add `Governance` section
- Changes:
  - New feature: Add `Governance` section

### vllm-omni-distributed
- Source: [PR #1881](vllm-project/vllm-omni#1881) - [Feat] Support T5 Tensor Parallelism
- Changes:
  - New feature: [Feat] Support T5 Tensor Parallelism

### vllm-omni-cicd
- Source: [PR #1881](vllm-project/vllm-omni#1881) - [Feat] Support T5 Tensor Parallelism
- Changes:
  - New feature: [Feat] Support T5 Tensor Parallelism
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI Failure]: Qwen3-omni,occasionally, the generated audio and text are inconsistent.

5 participants