[Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips by LJH-LBJ · Pull Request #2012 · vllm-project/vllm-omni

LJH-LBJ · 2026-03-19T11:41:47Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Resolves: #1830
This PR fixes two critical accuracy bugs in the Qwen3-Omni code predictor that caused audio quality degradation, and re-applies the re-prefill + SDPA optimization from #1758 with correctness guarantees.

Performance analysis and detailed design remain the same as described in #1758 — this PR does not regress performance.

Root Cause 1: `_proj_buf` persistent buffer cross-request pollution

PR #1758 introduced a persistent self._proj_buf tensor (allocated once in __init__) to accumulate embeddings across the autoregressive loop. When multiple concurrent requests share the same model instance, request A's embedding data written into _proj_buf gets silently overwritten by request B, causing subsequent AR steps to read corrupted history. This is the primary cause of intermittent accuracy failures under concurrent load.

Fix: Allocate proj_buf locally in every forward() call:

# Before (buggy): persistent buffer shared across requests
self._proj_buf = torch.zeros(max_batch, max_seq, hidden_size, ...)  # in __init__

# After (fixed): per-call allocation, no aliasing
def forward(self, ...):
    proj_buf = torch.zeros(bsz, max_seq, self._hidden_size, dtype=dtype, device=device)

Root Cause 2: `summed_embeddings` 3D + `text_step` 2D broadcast error

In talker_mtp(), summed_embeddings has shape [B, S, H] (3D) while text_step has shape [B*S, H] (2D). When added directly, PyTorch performs silent broadcasting — for B > 1 the result is completely wrong ([B, 1, H] + [B, H] → [B, B, H]).

Fix: Flatten summed_embeddings to 2D before addition:

# Before (buggy): 3D + 2D silent broadcast
inputs_embeds = summed_embeddings.clone()
inputs_embeds = (inputs_embeds + text_step).reshape(...)

# After (fixed): uniform 2D addition
inputs_embeds = summed_embeddings.reshape(-1, self.talker_config.text_config.hidden_size)
inputs_embeds = (inputs_embeds + text_step).reshape(...)

Test Plan

Accurancy Test twice

pytest -s -v   tests/e2e/online_serving/test_qwen3_omni_expansion.py   -m "advanced_model" --run-level "advanced_model"

benchmark

vllm serve /workspace/models/Qwen3-Omni-30B-A3B-Instruct --omni --port 46354 --stage-configs-path ./vllm_omni/model_executor/stage_configs/qwen3_omni_moe_async_chunk.yaml

Test Result

benchmark

Metric	Before	After
TTFP (ms)	13000.91	8851.15
E2E (ms)	86184.13	83231.48
RTF	0.65	0.58
TTFT (ms)	6877.48	5329.86

Before:
============ Serving Benchmark Result ============							
Successful requests:                     10        							
Failed requests:                         0         							
Maximum request concurrency:             10        							
Benchmark duration (s):                  131.49    							
Request throughput (req/s):              0.08      							
Peak concurrent requests:                10.00     							
----------------End-to-end Latency----------------							
Mean E2EL (ms):                          86184.13  							
Median E2EL (ms):                        95894.25  							
P99 E2EL (ms):                           130903.62 							
================== Text Result ===================							
Total input tokens:                      25000     							
Total generated tokens:                  5199      							
Output token throughput (tok/s):         39.54     							
Peak output token throughput (tok/s):    306.00    							
Peak concurrent requests:                10.00     							
Total Token throughput (tok/s):          229.67    							
---------------Time to First Token----------------							
Mean TTFT (ms):                          6877.48   							
Median TTFT (ms):                        7451.55   							
P99 TTFT (ms):                           7454.69   							
-----Time per Output Token (excl. 1st token)------							
Mean TPOT (ms):                          101.27    							
Median TPOT (ms):                        81.24     							
P99 TPOT (ms):                           240.85    							
---------------Inter-token Latency----------------							
Mean ITL (ms):                           53.44     							
Median ITL (ms):                         0.01      							
P99 ITL (ms):                            1925.43   							
================== Audio Result ==================							
Total audio duration generated(s):       1669.25   							
Total audio frames generated:            40061925  							
Audio throughput(audio duration/s):      12.69     							
---------------Time to First Packet---------------							
Mean AUDIO_TTFP (ms):                    13000.91  							
Median AUDIO_TTFP (ms):                  13079.55  							
P99 AUDIO_TTFP (ms):                     13637.70  							
-----------------Real Time Factor-----------------							
Mean AUDIO_RTF:                          0.65      							
Median AUDIO_RTF:                        0.57      							
P99 AUDIO_RTF:                           1.12      							
==================================================							


After:
============ Serving Benchmark Result ============							
Successful requests:                     10        							
Failed requests:                         0         							
Maximum request concurrency:             10        							
Benchmark duration (s):                  124.16    							
Request throughput (req/s):              0.08      							
Peak concurrent requests:                10.00     							
----------------End-to-end Latency----------------							
Mean E2EL (ms):                          83231.48  							
Median E2EL (ms):                        85536.25  							
P99 E2EL (ms):                           122818.30 							
================== Text Result ===================							
Total input tokens:                      25000     							
Total generated tokens:                  5172      							
Output token throughput (tok/s):         41.66     							
Peak output token throughput (tok/s):    304.00    							
Peak concurrent requests:                10.00     							
Total Token throughput (tok/s):          243.02    							
---------------Time to First Token----------------							
Mean TTFT (ms):                          5329.86   							
Median TTFT (ms):                        5882.53   							
P99 TTFT (ms):                           5884.73   							
-----Time per Output Token (excl. 1st token)------							
Mean TPOT (ms):                          104.86    							
Median TPOT (ms):                        92.06     							
P99 TPOT (ms):                           198.48    							
---------------Inter-token Latency----------------							
Mean ITL (ms):                           51.24     							
Median ITL (ms):                         0.01      							
P99 ITL (ms):                            2015.94   							
================== Audio Result ==================							
Total audio duration generated(s):       1618.88   							
Total audio frames generated:            38853060  							
Audio throughput(audio duration/s):      13.04     							
---------------Time to First Packet---------------							
Mean AUDIO_TTFP (ms):                    8851.15   							
Median AUDIO_TTFP (ms):                  8848.69   							
P99 AUDIO_TTFP (ms):                     9442.12   							
-----------------Real Time Factor-----------------							
Mean AUDIO_RTF:                          0.58      							
Median AUDIO_RTF:                        0.58      							
P99 AUDIO_RTF:                           0.83      							
==================================================

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e7f5cdb4dc

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-19T11:51:57Z

+            probs = F.softmax(logits, dim=-1)
+            code = torch.multinomial(probs, num_samples=1)  # [bsz, 1]


Avoid full CUDA graphs around the new inline multinomial path

When full cudagraphs are enabled for the Qwen3-Omni talker, OmniGPUModelRunner.load_model() still wraps talker_mtp whenever self.model.talker exists (vllm_omni/worker/gpu_model_runner.py:83-88). This refactor replaced the old custom sampling op with a plain torch.multinomial call inside that path, even though the runner comment explicitly treats torch.multinomial inside the code predictor as graph-unsafe. In that configuration, capture/replay can fail or fall back to eager on every decode step, which breaks the main optimization this commit is re-introducing.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-19T11:51:57Z

+            # Keys that should stay on GPU in model_intermediate_buffer to avoid CPU↔GPU round-trips
+            self.gpu_resident_buffer_keys: set[str] = {
+                "last_talker_hidden",
+                "trailing_text_hidden",
+                "tts_pad_embed_projected",


Keep code_predictor_codes in the GPU-resident buffer set

_talker_mtp_forward() writes the decode result under the default talker_mtp_output_key (code_predictor_codes), but _update_intermediate_buffer() only skips the .to("cpu") round-trip for keys listed in gpu_resident_buffer_keys (vllm_omni/worker/gpu_model_runner.py:1349-1354). Because this set omits code_predictor_codes, every decode step still synchronizes on a device-to-host copy of the MTP output, so the advertised hot-path CPU round-trip elimination never actually applies to Qwen3-Omni's codec codes.

Useful? React with 👍 / 👎.

code_predictor_codes will not be readed from gpu_resident_buffer_keys . no need to add to gpu_resident_buffer_keys

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

hsliuustc0106 · 2026-03-19T12:35:02Z

any performance w/o the feature here?

LJH-LBJ · 2026-03-19T12:42:25Z

any performance w/o the feature here?

I didn’t make any changes to the main feature branch. The bug fix only addressed a tensor dimension mismatch and added a torch.zeros in one place; nothing else was modified. I don’t think it has any impact on performance.

amy-why-3459 · 2026-03-19T12:45:31Z

Please provide a performance data comparison before and after this PR.

amy-why-3459 · 2026-03-19T12:51:42Z

Please provide accuracy test results under multiple concurrent scenarios. Please add an accuracy test under multiple concurrent scenarios.

LJH-LBJ · 2026-03-19T12:56:57Z

Please provide accuracy test results under multiple concurrent scenarios. Please add an accuracy test under multiple concurrent scenarios.

multiple concurrent scenarios already exists in test_qwen3_omni_expansion.py
you can check the request_num in the test case

hsliuustc0106 · 2026-03-19T12:58:18Z

Please provide accuracy test results under multiple concurrent scenarios. Please add an accuracy test under multiple concurrent scenarios.

multiple concurrent scenarios already exists in test_qwen3_omni_expansion.py you can check the request_num in the test case

I suggest to add the perf w/o this PR which is necessary to show the gain or regression.

LJH-LBJ · 2026-03-19T13:08:45Z

Please provide accuracy test results under multiple concurrent scenarios. Please add an accuracy test under multiple concurrent scenarios.

multiple concurrent scenarios already exists in test_qwen3_omni_expansion.py you can check the request_num in the test case

I suggest to add the perf w/o this PR which is necessary to show the gain or regression.

OK， I will test it

amy-why-3459 · 2026-03-20T01:31:01Z

@Sy0307 PTAL

…tor-v2

LJH-LBJ · 2026-03-20T01:49:05Z

any performance w/o the feature here?

@hsliuustc0106 @amy-why-3459 Already posted, I test with PR1982

Sy0307 · 2026-03-20T02:58:26Z

LGTM. Related issue is #2019. It will wait for this PR being merged.

ZeldaHuang · 2026-03-20T03:35:34Z

+        k = k.view(bsz, seq_len, self.num_kv_heads, self.head_dim).transpose(1, 2)
+        v = v.view(bsz, seq_len, self.num_kv_heads, self.head_dim).transpose(1, 2)
+
+        attn_out = F.scaled_dot_product_attention(


can we use this?
https://github.com/vllm-project/vllm/blob/9040151fe1899aba6e2934364fb4c5edfcb5e29c/vllm/v1/attention/ops/vit_attn_wrappers.py#L190

apply_sdpa from vllm/v1/attention/ops/vit_attn_wrappers.py cannot be used here — it omits is_causal=True.

The code predictor runs autoregressive re-prefill: each AR step forwards a growing sequence (length 2 → num_code_groups + 1). Position i must only attend to positions ≤ i; otherwise the yet-to-be-predicted slots (initialized to zero in proj_buf) leak into the attention output and corrupt the logits. is_causal=True enforces the lower-triangular mask that prevents this.

apply_sdpa is designed for ViT-style encoders where every token attends to every other token (bidirectional). Dropping the causal mask here would break codec generation quality.

### vllm-omni-audio-tts - Source: [PR #2059](vllm-project/vllm-omni#2059) - [BugFix][Qwen3TTS] CodePredictor CudaGraph Pool - Changes: - Bug fix: [BugFix][Qwen3TTS] CodePredictor CudaGraph Pool ### vllm-omni-perf - Source: [PR #2059](vllm-project/vllm-omni#2059) - [BugFix][Qwen3TTS] CodePredictor CudaGraph Pool - Changes: - Bug fix: [BugFix][Qwen3TTS] CodePredictor CudaGraph Pool ### vllm-omni-api - Source: [PR #2058](vllm-project/vllm-omni#2058) - [Bugfix] Fix Fish Speech and CosyVoice3 online serving - missing is_comprehension and broken model detection - Changes: - Bug fix: [Bugfix] Fix Fish Speech and CosyVoice3 online serving - missing is_comprehension and broken model detection ### vllm-omni-contrib - Source: [PR #2045](vllm-project/vllm-omni#2045) - [Voxtral] Improve example ### vllm-omni-cicd - Source: [PR #2045](vllm-project/vllm-omni#2045) - [Voxtral] Improve example ### vllm-omni-api - Source: [PR #2042](vllm-project/vllm-omni#2042) - [bugfix] /chat/completion doesn't read extra_body for diffusion model - Changes: - Bug fix: [bugfix] /chat/completion doesn't read extra_body for diffusion model ### vllm-omni-perf - Source: [PR #2042](vllm-project/vllm-omni#2042) - [bugfix] /chat/completion doesn't read extra_body for diffusion model - Changes: - Bug fix: [bugfix] /chat/completion doesn't read extra_body for diffusion model ### vllm-omni-contrib - Source: [PR #2038](vllm-project/vllm-omni#2038) - [Doc] Update docs and dockerfiles for rebase of vllm v0.18.0 ### vllm-omni-serving - Source: [PR #2037](vllm-project/vllm-omni#2037) - [Rebase] Rebase to vllm v0.18.0 ### vllm-omni-contrib - Source: [PR #2037](vllm-project/vllm-omni#2037) - [Rebase] Rebase to vllm v0.18.0 ### vllm-omni-api - Source: [PR #2037](vllm-project/vllm-omni#2037) - [Rebase] Rebase to vllm v0.18.0 ### vllm-omni-cicd - Source: [PR #2037](vllm-project/vllm-omni#2037) - [Rebase] Rebase to vllm v0.18.0 ### vllm-omni-cicd - Source: [PR #2032](vllm-project/vllm-omni#2032) - [CI] Change Bagel online test environment variable `VLLM_TEST_CLEAN_GPU_MEMORY` to `0` ### vllm-omni-cicd - Source: [PR #2031](vllm-project/vllm-omni#2031) - [CI] Fix test. - Changes: - Bug fix: [CI] Fix test. ### vllm-omni-cicd - Source: [PR #2017](vllm-project/vllm-omni#2017) - [CI] [ROCm] Setup `test-ready.yml` and `test-merge.yml` ### vllm-omni-cicd - Source: [PR #2014](vllm-project/vllm-omni#2014) - [Test] Implement mock HTTP request handling in benchmark CLI tests ### vllm-omni-perf - Source: [PR #2014](vllm-project/vllm-omni#2014) - [Test] Implement mock HTTP request handling in benchmark CLI tests ### vllm-omni-serving - Source: [PR #2012](vllm-project/vllm-omni#2012) - [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips - Changes: - Bug fix: [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips ### vllm-omni-image-gen - Source: [PR #2012](vllm-project/vllm-omni#2012) - [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips - Changes: - Bug fix: [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips ### vllm-omni-perf - Source: [PR #2012](vllm-project/vllm-omni#2012) - [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips - Changes: - Bug fix: [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips ### vllm-omni-serving - Source: [PR #2009](vllm-project/vllm-omni#2009) - [Bugfix] revert PR#1758 which introduced the accuracy problem of qwen3-omni - Changes: - Bug fix: [Bugfix] revert PR#1758 which introduced the accuracy problem of qwen3-omni ### vllm-omni-image-gen - Source: [PR #2007](vllm-project/vllm-omni#2007) - [Bugfix]Fix bug of online server can not return mutli images - Changes: - Bug fix: [Bugfix]Fix bug of online server can not return mutli images - Additions: - Qwen-Image-Layered - Qwen-Image-Layered - Qwen-Image-Layered ### vllm-omni-api - Source: [PR #2007](vllm-project/vllm-omni#2007) - [Bugfix]Fix bug of online server can not return mutli images - Changes: - Bug fix: [Bugfix]Fix bug of online server can not return mutli images ### vllm-omni-cicd - Source: [PR #1998](vllm-project/vllm-omni#1998) - [CI] Split BAGEL tests into dummy/real weight tiers (L2/L3) ### vllm-omni-serving - Source: [PR #1985](vllm-project/vllm-omni#1985) - [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls - Changes: - Performance improvement: [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls ### vllm-omni-audio-tts - Source: [PR #1985](vllm-project/vllm-omni#1985) - [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls - Changes: - Performance improvement: [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls ### vllm-omni-perf - Source: [PR #1985](vllm-project/vllm-omni#1985) - [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls - Changes: - Performance improvement: [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls ### vllm-omni-serving - Source: [PR #1984](vllm-project/vllm-omni#1984) - [CI] [ROCm] Bugfix device environment issue - Changes: - Bug fix: [CI] [ROCm] Bugfix device environment issue ### vllm-omni-api - Source: [PR #1984](vllm-project/vllm-omni#1984) - [CI] [ROCm] Bugfix device environment issue - Changes: - Bug fix: [CI] [ROCm] Bugfix device environment issue ### vllm-omni-serving - Source: [PR #1982](vllm-project/vllm-omni#1982) - [Fix] Fix slow hasattr in CUDAGraphWrapper.__getattr__ - Changes: - Bug fix: [Fix] Fix slow hasattr in CUDAGraphWrapper.__getattr__ ### vllm-omni-cicd - Source: [PR #1982](vllm-project/vllm-omni#1982) - [Fix] Fix slow hasattr in CUDAGraphWrapper.__getattr__ - Changes: - Bug fix: [Fix] Fix slow hasattr in CUDAGraphWrapper.__getattr__ ### vllm-omni-api - Source: [PR #1979](vllm-project/vllm-omni#1979) - [Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series) - Changes: - Bug fix: [Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series) - Additions: - `/v1/chat/completions` ### vllm-omni-perf - Source: [PR #1979](vllm-project/vllm-omni#1979) - [Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series) - Changes: - Bug fix: [Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series) ### vllm-omni-contrib - Source: [PR #1976](vllm-project/vllm-omni#1976) - [skip ci][Docs] Update WeChat QR code (fix filename case) - Changes: - Bug fix: [skip ci][Docs] Update WeChat QR code (fix filename case) ### vllm-omni-contrib - Source: [PR #1974](vllm-project/vllm-omni#1974) - [Docs] Update WeChat QR code for community support ### vllm-omni-cicd - Source: [PR #1945](vllm-project/vllm-omni#1945) - Fix Base voice clone streaming quality and stop-token crash - Changes: - Bug fix: Fix Base voice clone streaming quality and stop-token crash ### vllm-omni-cicd - Source: [PR #1938](vllm-project/vllm-omni#1938) - [Test] L4 complete diffusion feature test for Bagel models - Changes: - New feature: [Test] L4 complete diffusion feature test for Bagel models ### vllm-omni-perf - Source: [PR #1938](vllm-project/vllm-omni#1938) - [Test] L4 complete diffusion feature test for Bagel models - Changes: - New feature: [Test] L4 complete diffusion feature test for Bagel models ### vllm-omni-perf - Source: [PR #1934](vllm-project/vllm-omni#1934) - Fix OmniGen2 transformer config loading for HF models - Changes: - Bug fix: Fix OmniGen2 transformer config loading for HF models ### vllm-omni-audio-tts - Source: [PR #1930](vllm-project/vllm-omni#1930) - [Bug][Qwen3TTS][Streaming] remove dynamic initial chunk and only compute on initial request ### vllm-omni-perf - Source: [PR #1930](vllm-project/vllm-omni#1930) - [Bug][Qwen3TTS][Streaming] remove dynamic initial chunk and only compute on initial request ### vllm-omni-audio-tts - Source: [PR #1926](vllm-project/vllm-omni#1926) - [Misc] removed qwen3_tts.py as it is out-dated ### vllm-omni-contrib - Source: [PR #1920](vllm-project/vllm-omni#1920) - [Docs] Add Wan2.1-T2V as supported video generation models - Changes: - New feature: [Docs] Add Wan2.1-T2V as supported video generation models ### vllm-omni-video-gen - Source: [PR #1915](vllm-project/vllm-omni#1915) - [Bugfix] fix helios video generate use cpu device - Changes: - Bug fix: [Bugfix] fix helios video generate use cpu device ### vllm-omni-perf - Source: [PR #1915](vllm-project/vllm-omni#1915) - [Bugfix] fix helios video generate use cpu device - Changes: - Bug fix: [Bugfix] fix helios video generate use cpu device ### vllm-omni-audio-tts - Source: [PR #1913](vllm-project/vllm-omni#1913) - [Optim][Qwen3TTS][CodePredictor] support torch.compile with reduce-overhead and dynamic False ### vllm-omni-perf - Source: [PR #1913](vllm-project/vllm-omni#1913) - [Optim][Qwen3TTS][CodePredictor] support torch.compile with reduce-overhead and dynamic False ### vllm-omni-api - Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring ### vllm-omni-perf - Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring ### vllm-omni-contrib - Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring ### vllm-omni-serving - Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring ### vllm-omni-cicd - Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring ### vllm-omni-image-gen - Source: [PR #1900](vllm-project/vllm-omni#1900) - [Feat] support HSDP for Flux family - Changes: - New feature: [Feat] support HSDP for Flux family ### vllm-omni-contrib - Source: [PR #1900](vllm-project/vllm-omni#1900) - [Feat] support HSDP for Flux family - Changes: - New feature: [Feat] support HSDP for Flux family ### vllm-omni-distributed - Source: [PR #1898](vllm-project/vllm-omni#1898) - [Feature]: Remove some useless `hf_overrides` in yaml - Changes: - New feature: [Feature]: Remove some useless `hf_overrides` in yaml ### vllm-omni-quantization - Source: [PR #1898](vllm-project/vllm-omni#1898) - [Feature]: Remove some useless `hf_overrides` in yaml - Changes: - New feature: [Feature]: Remove some useless `hf_overrides` in yaml ### vllm-omni-cicd - Source: [PR #1898](vllm-project/vllm-omni#1898) - [Feature]: Remove some useless `hf_overrides` in yaml - Changes: - New feature: [Feature]: Remove some useless `hf_overrides` in yaml ### vllm-omni-perf - Source: [PR #1898](vllm-project/vllm-omni#1898) - [Feature]: Remove some useless `hf_overrides` in yaml - Changes: - New feature: [Feature]: Remove some useless `hf_overrides` in yaml ### vllm-omni-contrib - Source: [PR #1890](vllm-project/vllm-omni#1890) - [NPU] Upgrade to v0.17.0 ### vllm-omni-contrib - Source: [PR #1889](vllm-project/vllm-omni#1889) - Add `Governance` section - Changes: - New feature: Add `Governance` section ### vllm-omni-distributed - Source: [PR #1881](vllm-project/vllm-omni#1881) - [Feat] Support T5 Tensor Parallelism - Changes: - New feature: [Feat] Support T5 Tensor Parallelism ### vllm-omni-cicd - Source: [PR #1881](vllm-project/vllm-omni#1881) - [Feat] Support T5 Tensor Parallelism - Changes: - New feature: [Feat] Support T5 Tensor Parallelism

LJH-LBJ added 2 commits March 19, 2026 19:21

fix bug

6721c8b

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

fix bug

e7f5cdb

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

LJH-LBJ requested a review from hsliuustc0106 as a code owner March 19, 2026 11:41

chatgpt-codex-connector bot reviewed Mar 19, 2026

View reviewed changes

fix bug

b516347

Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>

hsliuustc0106 added the ready label to trigger buildkite CI label Mar 19, 2026

Merge branch 'vllm-project:main' into refactor/qwen3-omni-code-predic…

dbe77b2

…tor-v2

ZeldaHuang reviewed Mar 20, 2026

View reviewed changes

hsliuustc0106 merged commit 6901ba4 into vllm-project:main Mar 20, 2026
8 checks passed

		probs = F.softmax(logits, dim=-1)
		code = torch.multinomial(probs, num_samples=1) # [bsz, 1]

Conversation

LJH-LBJ commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Root Cause 1: _proj_buf persistent buffer cross-request pollution

Root Cause 2: summed_embeddings 3D + text_step 2D broadcast error

Test Plan

Test Result

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

LJH-LBJ Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 commented Mar 19, 2026

Uh oh!

LJH-LBJ commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amy-why-3459 commented Mar 19, 2026

Uh oh!

amy-why-3459 commented Mar 19, 2026

Uh oh!

LJH-LBJ commented Mar 19, 2026

Uh oh!

hsliuustc0106 commented Mar 19, 2026

Uh oh!

LJH-LBJ commented Mar 19, 2026

Uh oh!

amy-why-3459 commented Mar 20, 2026

Uh oh!

LJH-LBJ commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sy0307 commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ZeldaHuang Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

LJH-LBJ Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

LJH-LBJ commented Mar 19, 2026 •

edited

Loading

Root Cause 1: `_proj_buf` persistent buffer cross-request pollution

Root Cause 2: `summed_embeddings` 3D + `text_step` 2D broadcast error

LJH-LBJ commented Mar 19, 2026 •

edited

Loading

LJH-LBJ commented Mar 20, 2026 •

edited

Loading

Sy0307 commented Mar 20, 2026 •

edited

Loading