Dev/rebase 0.14.0 by tzhouam · Pull Request #813 · vllm-project/vllm-omni

tzhouam · 2026-01-16T06:40:07Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

This PR partially rebase to vllm v0.14.0 and supporting online/offline for Qwen 3 Omni.

Test Plan Online

Follow Readme

Test Result Online

Chat completion output from text: Based on the provided audio and images, here is an analysis of your questions:

### 1. What is recited in the audio?

The speaker recites the first verse of the classic English nursery rhyme "Mary Had a Little Lamb":

> "Mary had a little lamb, its fleece was white as snow; and everywhere that Mary went, the lamb was sure to go."

The speaker also mentions that these were the "first words I spoke in the original phonograph," indicating they are likely recounting a historical or personal anecdote about early sound recording.

---

### 2. What is the content of this image?

The image shows the Tokyo Skytree, a prominent telecommunications and observation tower in Tokyo, Japan. It is viewed from a low angle through the pink blossoms of cherry trees (sakura) in full bloom against a clear blue sky. This scene captures a beautiful springtime view, often associated with the Japanese tradition of *hanami* (flower viewing).

---

### 3. Why is this video funny?

The humor in the video stems from the contrast between the subject's appearance and their actions.

*   **Appearance:** The child is wearing large, thick-rimmed glasses that look comically oversized for their face.
*   **Action:** Despite the silly appearance, the child is engaged in a very serious and focused activity—reading a book. They turn the pages carefully and appear completely absorbed in the book.

The juxtaposition of a toddler dressed like a scholarly adult, intently reading, creates a charming and humorous effect.
Audio saved to audio_0.wav

Test Plan Offline

python3 end2end.py -q use_image

Test Result Offline

output_0_1a2bd30d-910c-4ad5-81ed-f19b9571836a.wav

INFO 01-16 05:15:09 [omni.py:295] [Orchestrator] Stage-1 reported ready
INFO 01-16 05:15:09 [omni.py:321] [Orchestrator] All stages initialized successfully
Adding requests:   0%|                                                                                                                                 | 0/1 [00:00<?, ?it/sThe image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'interleaved', 'mrope_section'}
Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'interleaved', 'mrope_section'}
(Worker pid=752775) [Stage-1] INFO 01-16 05:15:26 [mrope.py:452] Multimodal token idx changed!
(Worker pid=752775) [Stage-1] WARNING 01-16 05:15:26 [qwen3_omni_moe_code_predictor_mtp.py:228] Using sdpa attention backend (flash_attention_2 not available or failed)
(Worker pid=752772) [Stage-2] INFO 01-16 05:16:20 [mrope.py:452] Multimodal token idx changed!
INFO 01-16 05:16:21 [log_utils.py:550] {'type': 'request_level_metrics',
INFO 01-16 05:16:21 [log_utils.py:550]  'request_id': '0_1a2bd30d-910c-4ad5-81ed-f19b9571836a',
INFO 01-16 05:16:21 [log_utils.py:550]  'e2e_time_ms': 71268.56875419617,
INFO 01-16 05:16:21 [log_utils.py:550]  'e2e_tpt': 23.064261732749568,
INFO 01-16 05:16:21 [log_utils.py:550]  'e2e_total_tokens': 3090,
INFO 01-16 05:16:21 [log_utils.py:550]  'transfers_total_time_ms': 159.29198265075684,
INFO 01-16 05:16:21 [log_utils.py:550]  'transfers_total_bytes': 56000530,
INFO 01-16 05:16:21 [log_utils.py:550]  'stages': {0: {'stage_gen_time_ms': 14935.698509216309,
INFO 01-16 05:16:21 [log_utils.py:550]                 'num_tokens_out': 223,
INFO 01-16 05:16:21 [log_utils.py:550]                 'num_tokens_in': 2093},
INFO 01-16 05:16:21 [log_utils.py:550]             1: {'stage_gen_time_ms': 54997.80225753784, 'num_tokens_out': 774},
INFO 01-16 05:16:21 [log_utils.py:550]             2: {'stage_gen_time_ms': 273.1940746307373, 'num_tokens_out': 0}}}
Processed prompts: 100%|█████████████████████████████████████████████████████████████████| 1/1 [01:11<00:00, 71.27s/req, est. speed stage-2 tok/s: 43.36, avg e2e_lat: 0.0ms]
INFO 01-16 05:16:21 [omni.py:782] [Summary] {'e2e_requests': 1,██████████████████████████| 1/1 [01:11<00:00, 71.27s/req, est. speed stage-2 tok/s: 43.36, avg e2e_lat: 0.0ms]
INFO 01-16 05:16:21 [omni.py:782]  'e2e_total_time_ms': 71269.86789703369,
INFO 01-16 05:16:21 [omni.py:782]  'e2e_sum_time_ms': 71268.56875419617,
INFO 01-16 05:16:21 [omni.py:782]  'e2e_total_tokens': 3090,
INFO 01-16 05:16:21 [omni.py:782]  'e2e_avg_time_per_request_ms': 71268.56875419617,
INFO 01-16 05:16:21 [omni.py:782]  'e2e_avg_tokens_per_s': 43.3571215756745,
INFO 01-16 05:16:21 [omni.py:782]  'wall_time_ms': 71269.86789703369,
INFO 01-16 05:16:21 [omni.py:782]  'final_stage_id': {'0_1a2bd30d-910c-4ad5-81ed-f19b9571836a': 2},
INFO 01-16 05:16:21 [omni.py:782]  'stages': [{'stage_id': 0,
INFO 01-16 05:16:21 [omni.py:782]              'requests': 1,
INFO 01-16 05:16:21 [omni.py:782]              'tokens': 2316,
INFO 01-16 05:16:21 [omni.py:782]              'total_time_ms': 15139.074325561523,
INFO 01-16 05:16:21 [omni.py:782]              'avg_time_per_request_ms': 15139.074325561523,
INFO 01-16 05:16:21 [omni.py:782]              'avg_tokens_per_s': 152.98161236249146},
INFO 01-16 05:16:21 [omni.py:782]             {'stage_id': 1,
INFO 01-16 05:16:21 [omni.py:782]              'requests': 1,
INFO 01-16 05:16:21 [omni.py:782]              'tokens': 774,
INFO 01-16 05:16:21 [omni.py:782]              'total_time_ms': 55089.155197143555,
INFO 01-16 05:16:21 [omni.py:782]              'avg_time_per_request_ms': 55089.155197143555,
INFO 01-16 05:16:21 [omni.py:782]              'avg_tokens_per_s': 14.049952249769351},
INFO 01-16 05:16:21 [omni.py:782]             {'stage_id': 2,
INFO 01-16 05:16:21 [omni.py:782]              'requests': 1,
INFO 01-16 05:16:21 [omni.py:782]              'tokens': 0,
INFO 01-16 05:16:21 [omni.py:782]              'total_time_ms': 299.47686195373535,
INFO 01-16 05:16:21 [omni.py:782]              'avg_time_per_request_ms': 299.47686195373535,
INFO 01-16 05:16:21 [omni.py:782]              'avg_tokens_per_s': 0.0}],
INFO 01-16 05:16:21 [omni.py:782]  'transfers': [{'from_stage': 0,
INFO 01-16 05:16:21 [omni.py:782]                 'to_stage': 1,
INFO 01-16 05:16:21 [omni.py:782]                 'samples': 1,
INFO 01-16 05:16:21 [omni.py:782]                 'total_bytes': 49699758,
INFO 01-16 05:16:21 [omni.py:782]                 'total_time_ms': 69.85020637512207,
INFO 01-16 05:16:21 [omni.py:782]                 'tx_mbps': 5692.153031943067,
INFO 01-16 05:16:21 [omni.py:782]                 'rx_samples': 1,
INFO 01-16 05:16:21 [omni.py:782]                 'rx_total_bytes': 49699758,
INFO 01-16 05:16:21 [omni.py:782]                 'rx_total_time_ms': 65.75202941894531,
INFO 01-16 05:16:21 [omni.py:782]                 'rx_mbps': 6046.932201387521,
INFO 01-16 05:16:21 [omni.py:782]                 'total_samples': 1,
INFO 01-16 05:16:21 [omni.py:782]                 'total_transfer_time_ms': 138.01193237304688,
INFO 01-16 05:16:21 [omni.py:782]                 'total_mbps': 2880.8962903677825},
INFO 01-16 05:16:21 [omni.py:782]                {'from_stage': 1,
INFO 01-16 05:16:21 [omni.py:782]                 'to_stage': 2,
INFO 01-16 05:16:21 [omni.py:782]                 'samples': 1,
INFO 01-16 05:16:21 [omni.py:782]                 'total_bytes': 6300772,
INFO 01-16 05:16:21 [omni.py:782]                 'total_time_ms': 7.778406143188477,
INFO 01-16 05:16:21 [omni.py:782]                 'tx_mbps': 6480.270517134222,
INFO 01-16 05:16:21 [omni.py:782]                 'rx_samples': 1,
INFO 01-16 05:16:21 [omni.py:782]                 'rx_total_bytes': 6300772,
INFO 01-16 05:16:21 [omni.py:782]                 'rx_total_time_ms': 11.675119400024414,
INFO 01-16 05:16:21 [omni.py:782]                 'rx_mbps': 4317.401327809512,
INFO 01-16 05:16:21 [omni.py:782]                 'total_samples': 1,
INFO 01-16 05:16:21 [omni.py:782]                 'total_transfer_time_ms': 21.28005027770996,
INFO 01-16 05:16:21 [omni.py:782]                 'total_mbps': 2368.705681715355}]}
[Stage-1] INFO 01-16 05:16:21 [omni_stage.py:675] Received shutdown signal
[Stage-2] INFO 01-16 05:16:21 [omni_stage.py:675] Received shutdown signal
[Stage-0] INFO 01-16 05:16:21 [omni_stage.py:675] Received shutdown signal
Request ID: 0_1a2bd30d-910c-4ad5-81ed-f19b9571836a, Text saved to output_audio/0_1a2bd30d-910c-4ad5-81ed-f19b9571836a.txt
Request ID: 0_1a2bd30d-910c-4ad5-81ed-f19b9571836a, Saved audio to output_audio/output_0_1a2bd30d-910c-4ad5-81ed-f19b9571836a.wav
(Worker_TP1 pid=747414) (Worker pid=752772) [Stage-0] INFO 01-16 05:16:38 [multiproc_executor.py:707] Parent process exited, terminating worker
(Worker pid=752775) (Worker_TP0 pid=747413) [Stage-1] INFO 01-16 05:16:38 [multiproc_executor.py:707] Parent process exited, terminating worker
[Stage-2] INFO 01-16 05:16:38 [multiproc_executor.py:707] Parent process exited, terminating worker
[Stage-0] INFO 01-16 05:16:38 [multiproc_executor.py:707] Parent process exited, terminating worker

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: JaredforReal <w13431838023@gmail.com>

…e_prior_tokens Signed-off-by: JaredforReal <w13431838023@gmail.com>

Signed-off-by: JaredforReal <w13431838023@gmail.com>

Signed-off-by: root <root@hk01dgx039.cm.cluster>

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

… Qwen3 Omni Thinker is not finished) Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

…RequestState is not finished) Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

…llm-project#761) Signed-off-by: TangPeng <85704592@qq.com>

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

…in Ring Attention (vllm-project#767) Signed-off-by: XU Mingshi <91017482+mxuax@users.noreply.github.com> Signed-off-by: mxuax <mxuax@connect.ust.hk>

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

Signed-off-by: Sihyeon Jang <sihyeon.jang@navercorp.com>

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

Signed-off-by: iwzbi <wzbi@zju.edu.cn>

…rmat Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

…#722) Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com>

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 236f73e419

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-01-16T06:45:46Z

                        kv_transfer_params=kv_transfer_params,
                        trace_headers=request.trace_headers,
                        num_cached_tokens=request.num_cached_tokens,
+                        routed_experts=routed_experts,


Align EngineCoreOutput with routed_experts field

OmniARScheduler now always passes routed_experts into EngineCoreOutput. In vllm_omni, EngineCoreOutput is patched to OmniEngineCoreOutput (see vllm_omni/patch.py), which does not define a routed_experts field (vllm_omni/engine/__init__.py). msgspec.Struct will raise TypeError: __init__() got an unexpected keyword argument 'routed_experts', so any request that produces output will crash even when routed_experts is None. Either add the field to OmniEngineCoreOutput or guard this keyword by engine type/feature flag.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-01-16T06:45:46Z

+            if tool_parsing_unavailable and request.tool_choice not in (
+                None,
+                "none",
            ):
-                # for hf tokenizers, "auto" tools requires
-                # --enable-auto-tool-choice and --tool-call-parser
-                return self.create_error_response(
-                    '"auto" tool choice requires --enable-auto-tool-choice and --tool-call-parser to be set'
-                )
+                if request.tool_choice == "auto" and not self.enable_auto_tools:


Reject auto tool choice when no tool parser

With tool_parsing_unavailable true, the new logic only errors for tool_choice="auto" when enable_auto_tools is false; if enable_auto_tools is true but tool_parser is still None (HF tokenizers, not Harmony/Mistral), the request is accepted. Downstream parsing is gated on tool_parser being set, so auto tool calls are never extracted and the tool choice is silently ignored. This is user-visible for requests with tools but no --tool-call-parser. Consider restoring the requirement that auto needs a parser or explicitly rejecting when tool_parsing_unavailable and tool_choice is not none.

Useful? React with 👍 / 👎.

Signed-off-by: David Chen <530634352@qq.com>

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

hsliuustc0106 · 2026-01-16T10:03:07Z

review from gemini according to logs:

Based on the logs you provided, your system is running a multi-stage multimodal pipeline (likely Qwen2-VL or a similar Omni-model architecture) using an Orchestrator.

There are three main areas for optimization: Environment/Hardware, Configuration, and Logging/Observability.

1. Critical Performance Bottlenecks

Fix Attention Backend (High Impact)

Your logs show a major warning:

Using sdpa attention backend (flash_attention_2 not available or failed)

Problem: SDPA is significantly slower and more memory-intensive than Flash Attention 2, especially for high-resolution images or long sequences.
Optimization: Install flash-attn. Ensure your CUDA version and GPU (Ampere or newer, e.g., A100, H100, RTX 30/40 series) support it. This will drastically reduce the stage_gen_time_ms.

Eliminate RoPE Parameter Errors

Unrecognized keys in rope_parameters ... {'interleaved', 'mrope_section'}

Problem: This indicates a version mismatch between your transformers library and the model configuration. The model is falling back to default Rotary Positional Embeddings, which can degrade the quality of multimodal spatial understanding.
Optimization: Update your environment: pip install -U transformers accelerate.

Stage-1 Latency

Looking at the summary:

Stage-0: 152.9 tok/s (Fast)
Stage-1: 14.0 tok/s (Slow bottleneck)
Optimization: Stage-1 is where your heavy lifting happens. Since you are using Worker pid and multiproc_executor, consider rebalancing your GPU resources. If Stage-1 is your main LLM/Vision-Language block, it needs more compute or a more aggressive quantization (e.g., bitsandbytes 4-bit or AWQ).

2. Infrastructure & Data Transfer

Fast vs. Slow Image Processor

The log mentions: The image processor ... is now loaded as a fast processor by default.

Optimization: If you are seeing CPU spikes during the "Adding requests" phase, explicitly set use_fast=True in your AutoProcessor loading code to suppress the warning and ensure you aren't hitting the "breaking change" fallback.

Inter-Stage Transfer (NCCL/Shared Memory)

Status: Your transfer speed is ~2800 MB/s to 6000 MB/s.
Optimization: For a single-node setup, ensure you are using Shared Memory (shm) or NCCL for P2P transfers between workers. If these workers are on different GPUs, ensure NVLink is active.

3. Log Cleanup & Readability

The current logs are "noisy" because JSON metrics are being printed line-by-line via log_utils.py:550.

Single-Line JSON: Configure your logger to output the metrics dictionary as a single minimized JSON string. This makes it much easier for log aggregators (like ELK or Datadog) to parse.
Disable TQDM in Production: The progress bars (0%|...|) create massive amounts of "empty" line noise in file-based logs. Use disable=True in your tqdm instances when not sys.stdout.isatty().

Summary Table of Optimization Tasks

Signed-off-by: linyueqian <linyueqian@outlook.com> Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

Signed-off-by: wangyu31577 <wangyu31577@hundsun.com> Co-authored-by: wangyu31577 <wangyu31577@hundsun.com>

tzhouam · 2026-01-16T14:07:11Z

Inter-Stage Transfer (NCCL/Shared Memory)
Status: Your transfer speed is ~2800 MB/s to 6000 MB/s.

Optimization: For a single-node setup, ensure you are using Shared Memory (shm) or NCCL for P2P transfers between workers. If these workers are on different GPUs, ensure NVLink is active.

@natureofnature Are we going to develop NCCL based Connectors to further reduce the data transfer overhead?

tzhouam · 2026-01-16T14:08:33Z

Eliminate RoPE Parameter Errors
Unrecognized keys in rope_parameters ... {'interleaved', 'mrope_section'}

Problem: This indicates a version mismatch between your transformers library and the model configuration. The model is falling back to default Rotary Positional Embeddings, which can degrade the quality of multimodal spatial understanding.

Optimization: Update your environment: pip install -U transformers accelerate.

Will deal with this later.

tzhouam · 2026-01-16T14:10:56Z

Log Cleanup & Readability
The current logs are "noisy" because JSON metrics are being printed line-by-line via log_utils.py:550.

Single-Line JSON: Configure your logger to output the metrics dictionary as a single minimized JSON string. This makes it much easier for log aggregators (like ELK or Datadog) to parse.

Disable TQDM in Production: The progress bars (0%|...|) create massive amounts of "empty" line noise in file-based logs. Use disable=True in your tqdm instances when not sys.stdout.isatty().

Please pay attention to this. @Bounty-hunter
From my view, we can leave it until the rebase finished.

natureofnature · 2026-01-16T14:16:42Z

Inter-Stage Transfer (NCCL/Shared Memory)
Status: Your transfer speed is ~2800 MB/s to 6000 MB/s.

Optimization: For a single-node setup, ensure you are using Shared Memory (shm) or NCCL for P2P transfers between workers. If these workers are on different GPUs, ensure NVLink is active.

@natureofnature Are we going to develop NCCL based Connectors to further reduce the data transfer overhead?

Will implement a version.

hsliuustc0106 · 2026-01-16T21:17:16Z

Inter-Stage Transfer (NCCL/Shared Memory)
Status: Your transfer speed is ~2800 MB/s to 6000 MB/s.

Optimization: For a single-node setup, ensure you are using Shared Memory (shm) or NCCL for P2P transfers between workers. If these workers are on different GPUs, ensure NVLink is active.

@natureofnature Are we going to develop NCCL based Connectors to further reduce the data transfer overhead?

Will implement a version.

we can do it later, maybe in v0.15.0

Signed-off-by: princepride <wangzhipeng628@gmail.com>

…nd video input processing, and refining position handling for MRoPE. Adjustments made to the YAML configuration to disable async scheduling for consistency. Code cleanup and formatting improvements included. Signed-off-by: Taichang Zhou <tzhouam@connect.ust.hk>

….Linear (vllm-project#825) Signed-off-by: John Liu BUAA <liukecheng97@gmail.com>

Signed-off-by: princepride <wangzhipeng628@gmail.com>

Signed-off-by: Dinesh G <G.Dinesh@ibm.com> Signed-off-by: gDINESH13 <dinesh13g@gmail.com>

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

…lm-project#830) Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* init and registry Signed-off-by: JaredforReal <w13431838023@gmail.com> * implement glm_image_transformer.py Signed-off-by: JaredforReal <w13431838023@gmail.com> * update transformer Signed-off-by: JaredforReal <w13431838023@gmail.com> * init pipeline_glm_image.py Signed-off-by: JaredforReal <w13431838023@gmail.com> * init pipeline_glm_image.py Signed-off-by: JaredforReal <w13431838023@gmail.com> * remove pre process Signed-off-by: JaredforReal <w13431838023@gmail.com> * add check_input(), implement CFG parallel in diffuse(), align generate_prior_tokens Signed-off-by: JaredforReal <w13431838023@gmail.com> * fix check_input(prompt_embed), add KVCache for Image Edit Signed-off-by: JaredforReal <w13431838023@gmail.com> * print out vllm version Signed-off-by: root <root@hk01dgx039.cm.cluster> * update model config Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * update worker Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * update one import in AsyncOmniLLM (not finish all, but can run) Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * update Qwen3 Omni ViT init based on updated interface (the update for Qwen3 Omni Thinker is not finished) Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * Remove unnecessary override for OmniRequestState (the update for OmniRequestState is not finished) Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * update model runner dummy run Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * update ar scheduler Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * update _preprocess, execute model and sample_tokens for AR Model Runner Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * debug AR Scheduler Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * update OmniGPUModelRunner._update_states Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * update the offline LLM request sorting due to changed requested id format Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * update Qwen3 Omni to fit with the engine core logic Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * update generation model runner Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * debug GLM-Image Model Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * remove deleted args from doc string Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * [Model][Rebase] Add GLM-Image Model and Partial Rebase to v0.14.0 (Support AR Offiline) (vllm-project#763) Signed-off-by: JaredforReal <w13431838023@gmail.com> Signed-off-by: root <root@hk01dgx039.cm.cluster> Signed-off-by: tzhouam <tzhouam@connect.ust.hk> Co-authored-by: JaredforReal <w13431838023@gmail.com> Co-authored-by: root <root@hk01dgx039.cm.cluster> * disable async scheduling for generation models, avoiding inconsistency from race condition Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * Update Qwen 3 Omni Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * [Fix] GLM Image (vllm-project#799) Signed-off-by: JaredforReal <w13431838023@gmail.com> * support online serving for Qwen3 Omni Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * fix pre-commit Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * inherit engine outputs Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * supporting audio in video(not finished) Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * Update Qwen2.5 Omni model to version 0.14, adding support for image and video input processing, and refining position handling for MRoPE. Adjustments made to the YAML configuration to disable async scheduling for consistency. Code cleanup and formatting improvements included. Signed-off-by: Taichang Zhou <tzhouam@connect.ust.hk> * debug qwen 2.5 Omni Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * update doc Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * rebase to vllm 0.14.0 Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * unify query type Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * fix build doc Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * Dev/rebase 0.14.0 (vllm-project#813) Signed-off-by: JaredforReal <w13431838023@gmail.com> Signed-off-by: root <root@hk01dgx039.cm.cluster> Signed-off-by: tzhouam <tzhouam@connect.ust.hk> Signed-off-by: TangPeng <85704592@qq.com> Signed-off-by: XU Mingshi <91017482+mxuax@users.noreply.github.com> Signed-off-by: mxuax <mxuax@connect.ust.hk> Signed-off-by: Sihyeon Jang <sihyeon.jang@navercorp.com> Signed-off-by: zjy0516 <riverclouds.zhu@qq.com> Signed-off-by: iwzbi <wzbi@zju.edu.cn> Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com> Signed-off-by: Yuhan Liu <30294295+liuyuhanalex@users.noreply.github.com> Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com> Signed-off-by: samithuang <285365963@qq.com> Signed-off-by: linyueqian <linyueqian@outlook.com> Signed-off-by: David Chen <530634352@qq.com> Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com> Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com> Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com> Signed-off-by: wangyu31577 <wangyu31577@hundsun.com> Signed-off-by: princepride <wangzhipeng628@gmail.com> Signed-off-by: Taichang Zhou <tzhouam@connect.ust.hk> Signed-off-by: John Liu BUAA <liukecheng97@gmail.com> Signed-off-by: Dinesh G <G.Dinesh@ibm.com> Signed-off-by: gDINESH13 <dinesh13g@gmail.com> Co-authored-by: JaredforReal <w13431838023@gmail.com> Co-authored-by: root <root@hk01dgx039.cm.cluster> Co-authored-by: JustQJ <37905360+JustQJ@users.noreply.github.com> Co-authored-by: XU Mingshi <91017482+mxuax@users.noreply.github.com> Co-authored-by: Sihyeon Jang <uneedsihyeon@gmail.com> Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com> Co-authored-by: catcat <108673086+iwzbi@users.noreply.github.com> Co-authored-by: Ziming Huang <hzm414167@alibaba-inc.com> Co-authored-by: Yuhan Liu <30294295+liuyuhanalex@users.noreply.github.com> Co-authored-by: Alicia <115451386+congw729@users.noreply.github.com> Co-authored-by: Samit <285365963@qq.com> Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com> Co-authored-by: WeiQing Chen <40507679+david6666666@users.noreply.github.com> Co-authored-by: Peiqi Yin <60515999+yinpeiqi@users.noreply.github.com> Co-authored-by: Didan Deng <33117903+wtomin@users.noreply.github.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> Co-authored-by: wangyu <53896905+yenuo26@users.noreply.github.com> Co-authored-by: wangyu31577 <wangyu31577@hundsun.com> Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com> Co-authored-by: John Liu BUAA <liukecheng97@gmail.com> Co-authored-by: D!NE$H <67671800+gDINESH13@users.noreply.github.com> * update test import Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * update version from 0.14.0rc2 to 0.14.0 Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * set vllm config for all CI Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * update CI Signed-off-by: tzhouam <tzhouam@connect.ust.hk> * Fix CPU offload OOM and performance issues in GLM-Image pipeline * Fix CPU offload OOM and performance issues in GLM-Image pipeline - Conditionally load vision_language_encoder, text_encoder, and vae to GPU only when CPU offload is disabled - Propagate cpu_offload_gb argument to enable_cpu_offload flag - Include vision_language_encoder in CPU offload hooks for proper AR model offloading - Fix device mismatch in generate_prior_tokens during CPU offload mode * Fix shared memory broadcast hang in GLM-Image pipeline - Add manual encoder activation support to SequentialOffloader - Explicitly trigger vision_language_encoder onload before get_image_features in pipeline - Prevents CPU-bound stalling during AR generation when offload is active * Fix device mismatch in generate() by triggering offload hook * Clean up temporary patch files --------- Signed-off-by: JaredforReal <w13431838023@gmail.com> Signed-off-by: root <root@hk01dgx039.cm.cluster> Signed-off-by: tzhouam <tzhouam@connect.ust.hk> Signed-off-by: Taichang Zhou <tzhouam@connect.ust.hk> Signed-off-by: TangPeng <85704592@qq.com> Signed-off-by: XU Mingshi <91017482+mxuax@users.noreply.github.com> Signed-off-by: mxuax <mxuax@connect.ust.hk> Signed-off-by: Sihyeon Jang <sihyeon.jang@navercorp.com> Signed-off-by: zjy0516 <riverclouds.zhu@qq.com> Signed-off-by: iwzbi <wzbi@zju.edu.cn> Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com> Signed-off-by: Yuhan Liu <30294295+liuyuhanalex@users.noreply.github.com> Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com> Signed-off-by: samithuang <285365963@qq.com> Signed-off-by: linyueqian <linyueqian@outlook.com> Signed-off-by: David Chen <530634352@qq.com> Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com> Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com> Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com> Signed-off-by: wangyu31577 <wangyu31577@hundsun.com> Signed-off-by: princepride <wangzhipeng628@gmail.com> Signed-off-by: John Liu BUAA <liukecheng97@gmail.com> Signed-off-by: Dinesh G <G.Dinesh@ibm.com> Signed-off-by: gDINESH13 <dinesh13g@gmail.com> Co-authored-by: JaredforReal <w13431838023@gmail.com> Co-authored-by: root <root@hk01dgx039.cm.cluster> Co-authored-by: tzhouam <tzhouam@connect.ust.hk> Co-authored-by: JustQJ <37905360+JustQJ@users.noreply.github.com> Co-authored-by: XU Mingshi <91017482+mxuax@users.noreply.github.com> Co-authored-by: Sihyeon Jang <uneedsihyeon@gmail.com> Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com> Co-authored-by: catcat <108673086+iwzbi@users.noreply.github.com> Co-authored-by: Ziming Huang <hzm414167@alibaba-inc.com> Co-authored-by: Yuhan Liu <30294295+liuyuhanalex@users.noreply.github.com> Co-authored-by: Alicia <115451386+congw729@users.noreply.github.com> Co-authored-by: Samit <285365963@qq.com> Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com> Co-authored-by: WeiQing Chen <40507679+david6666666@users.noreply.github.com> Co-authored-by: Peiqi Yin <60515999+yinpeiqi@users.noreply.github.com> Co-authored-by: Didan Deng <33117903+wtomin@users.noreply.github.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> Co-authored-by: wangyu <53896905+yenuo26@users.noreply.github.com> Co-authored-by: wangyu31577 <wangyu31577@hundsun.com> Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com> Co-authored-by: John Liu BUAA <liukecheng97@gmail.com> Co-authored-by: D!NE$H <67671800+gDINESH13@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>

JaredforReal and others added 30 commits January 8, 2026 17:55

init and registry

3059e27

Signed-off-by: JaredforReal <w13431838023@gmail.com>

implement glm_image_transformer.py

c0a7684

Signed-off-by: JaredforReal <w13431838023@gmail.com>

update transformer

800cea4

Signed-off-by: JaredforReal <w13431838023@gmail.com>

init pipeline_glm_image.py

8664695

Signed-off-by: JaredforReal <w13431838023@gmail.com>

init pipeline_glm_image.py

b88b4b2

Signed-off-by: JaredforReal <w13431838023@gmail.com>

remove pre process

b9108f4

Signed-off-by: JaredforReal <w13431838023@gmail.com>

add check_input(), implement CFG parallel in diffuse(), align generat…

371afd5

…e_prior_tokens Signed-off-by: JaredforReal <w13431838023@gmail.com>

fix check_input(prompt_embed), add KVCache for Image Edit

3d4f5f2

Signed-off-by: JaredforReal <w13431838023@gmail.com>

print out vllm version

0810dae

Signed-off-by: root <root@hk01dgx039.cm.cluster>

update model config

8e36c51

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

update worker

7f704d5

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

update one import in AsyncOmniLLM (not finish all, but can run)

4afb2ff

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

update Qwen3 Omni ViT init based on updated interface (the update for…

cb2e053

… Qwen3 Omni Thinker is not finished) Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

Remove unnecessary override for OmniRequestState (the update for Omni…

e052c4a

…RequestState is not finished) Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

update model runner dummy run

c08dcdd

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

[Misc] Enable tensor_parallel_size argument with online serving cmd (v…

9cdf592

…llm-project#761) Signed-off-by: TangPeng <85704592@qq.com>

update ar scheduler

166fc78

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

[Bugfix] Raise ValueError when joint_strategy='rear' and causal=True …

8e20b33

…in Ring Attention (vllm-project#767) Signed-off-by: XU Mingshi <91017482+mxuax@users.noreply.github.com> Signed-off-by: mxuax <mxuax@connect.ust.hk>

update _preprocess, execute model and sample_tokens for AR Model Runner

4db8f0b

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

[Feat] add vllm-omni version collection (vllm-project#740)

f761119

Signed-off-by: Sihyeon Jang <sihyeon.jang@navercorp.com>

debug AR Scheduler

63a69a5

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

update OmniGPUModelRunner._update_states

5bcdb43

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

[Doc] refactor diffusion doc (vllm-project#753)

d7cd00e

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

[Bugfix] Fix stable diffusion3 compatibility error (vllm-project#772)

e9a1bee

Signed-off-by: iwzbi <wzbi@zju.edu.cn>

update the offline LLM request sorting due to changed requested id fo…

2a0f72f

…rmat Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

update Qwen3 Omni to fit with the engine core logic

f7c8af9

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

Merge PR vllm-project#724

f12e0af

[Feature] Support Qwen3 Omni talker mtp batch inference (vllm-project…

1444e1f

…#722) Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com>

update generation model runner

e2462d2

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

debug GLM-Image Model

d89e3c4

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

fix pre-commit

236f73e

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

tzhouam requested a review from hsliuustc0106 as a code owner January 16, 2026 06:40

chatgpt-codex-connector Bot reviewed Jan 16, 2026

View reviewed changes

david6666666 and others added 3 commits January 16, 2026 16:32

[Model] add flux2 klein (vllm-project#809)

a7f9926

Signed-off-by: David Chen <530634352@qq.com>

inherit engine outputs

14e83e7

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

supporting audio in video(not finished)

b00685c

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

linyueqian and others added 2 commits January 16, 2026 12:31

[bugfix] use unipc scheduler for Wan 2.2 (vllm-project#804)

3fb6adc

Signed-off-by: linyueqian <linyueqian@outlook.com> Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

[Test] Add full test for Qwen3-Omni-30B-A3B-Instruct (vllm-project#720)

4e23bff

Signed-off-by: wangyu31577 <wangyu31577@hundsun.com> Co-authored-by: wangyu31577 <wangyu31577@hundsun.com>

princepride and others added 7 commits January 17, 2026 15:21

[Bagel] Support Cache-Dit (vllm-project#736)

0888520

Signed-off-by: princepride <wangzhipeng628@gmail.com>

[Perf] Optimize the Qwen2.5-Omni Model thinker-to-talker-proj with nn…

bb24e07

….Linear (vllm-project#825) Signed-off-by: John Liu BUAA <liukecheng97@gmail.com>

[Core]Add GPU Diffusion Runner (vllm-project#822)

36c2876

Signed-off-by: princepride <wangzhipeng628@gmail.com>

[Feature]: Add CFG param to online serving (vllm-project#824)

5e7035e

Signed-off-by: Dinesh G <G.Dinesh@ibm.com> Signed-off-by: gDINESH13 <dinesh13g@gmail.com>

debug qwen 2.5 Omni

156cac7

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

[diffusion] add tp support for qwen-image and refactor some tests (vl…

3fc4f98

…lm-project#830) Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

david6666666 added this to the v0.14.0rc1 milestone Jan 19, 2026

david6666666 added the high priority high priority issue, needs to be done asap label Jan 19, 2026

tzhouam added 5 commits January 19, 2026 08:04

update doc

30880aa

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

rebase to vllm 0.14.0

bd22edd

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

Merge branch 'main' into dev/rebase-0.14.0

58246d2

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

unify query type

4a7d732

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

fix build doc

cfd5d32

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

tzhouam merged commit 52d20a7 into vllm-project:dev/rebase_0.14.0 Jan 19, 2026
1 of 2 checks passed

Conversation

tzhouam commented Jan 16, 2026

Purpose

Test Plan Online

Test Result Online

Test Plan Offline

Test Result Offline

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 commented Jan 16, 2026

1. Critical Performance Bottlenecks

Fix Attention Backend (High Impact)

Eliminate RoPE Parameter Errors

Stage-1 Latency

2. Infrastructure & Data Transfer

Fast vs. Slow Image Processor

Inter-Stage Transfer (NCCL/Shared Memory)

3. Log Cleanup & Readability

Summary Table of Optimization Tasks

Uh oh!

tzhouam commented Jan 16, 2026

Uh oh!

tzhouam commented Jan 16, 2026

Uh oh!

tzhouam commented Jan 16, 2026

Uh oh!

natureofnature commented Jan 16, 2026

Uh oh!

hsliuustc0106 commented Jan 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants