Skip to content

Dev/rebase 0.14.0#813

Merged
tzhouam merged 62 commits into
vllm-project:dev/rebase_0.14.0from
tzhouam:dev/rebase-0.14.0
Jan 19, 2026
Merged

Dev/rebase 0.14.0#813
tzhouam merged 62 commits into
vllm-project:dev/rebase_0.14.0from
tzhouam:dev/rebase-0.14.0

Conversation

@tzhouam
Copy link
Copy Markdown
Collaborator

@tzhouam tzhouam commented Jan 16, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

This PR partially rebase to vllm v0.14.0 and supporting online/offline for Qwen 3 Omni.

Test Plan Online

Follow Readme

Test Result Online

audio_0.wav

Chat completion output from text: Based on the provided audio and images, here is an analysis of your questions:

### 1. What is recited in the audio?

The speaker recites the first verse of the classic English nursery rhyme "Mary Had a Little Lamb":

> "Mary had a little lamb, its fleece was white as snow; and everywhere that Mary went, the lamb was sure to go."

The speaker also mentions that these were the "first words I spoke in the original phonograph," indicating they are likely recounting a historical or personal anecdote about early sound recording.

---

### 2. What is the content of this image?

The image shows the Tokyo Skytree, a prominent telecommunications and observation tower in Tokyo, Japan. It is viewed from a low angle through the pink blossoms of cherry trees (sakura) in full bloom against a clear blue sky. This scene captures a beautiful springtime view, often associated with the Japanese tradition of *hanami* (flower viewing).

---

### 3. Why is this video funny?

The humor in the video stems from the contrast between the subject's appearance and their actions.

*   **Appearance:** The child is wearing large, thick-rimmed glasses that look comically oversized for their face.
*   **Action:** Despite the silly appearance, the child is engaged in a very serious and focused activity—reading a book. They turn the pages carefully and appear completely absorbed in the book.

The juxtaposition of a toddler dressed like a scholarly adult, intently reading, creates a charming and humorous effect.
Audio saved to audio_0.wav

Test Plan Offline

python3 end2end.py -q use_image

Test Result Offline

output_0_1a2bd30d-910c-4ad5-81ed-f19b9571836a.wav

INFO 01-16 05:15:09 [omni.py:295] [Orchestrator] Stage-1 reported ready
INFO 01-16 05:15:09 [omni.py:321] [Orchestrator] All stages initialized successfully
Adding requests:   0%|                                                                                                                                 | 0/1 [00:00<?, ?it/sThe image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'interleaved', 'mrope_section'}
Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'interleaved', 'mrope_section'}
(Worker pid=752775) [Stage-1] INFO 01-16 05:15:26 [mrope.py:452] Multimodal token idx changed!
(Worker pid=752775) [Stage-1] WARNING 01-16 05:15:26 [qwen3_omni_moe_code_predictor_mtp.py:228] Using sdpa attention backend (flash_attention_2 not available or failed)
(Worker pid=752772) [Stage-2] INFO 01-16 05:16:20 [mrope.py:452] Multimodal token idx changed!
INFO 01-16 05:16:21 [log_utils.py:550] {'type': 'request_level_metrics',
INFO 01-16 05:16:21 [log_utils.py:550]  'request_id': '0_1a2bd30d-910c-4ad5-81ed-f19b9571836a',
INFO 01-16 05:16:21 [log_utils.py:550]  'e2e_time_ms': 71268.56875419617,
INFO 01-16 05:16:21 [log_utils.py:550]  'e2e_tpt': 23.064261732749568,
INFO 01-16 05:16:21 [log_utils.py:550]  'e2e_total_tokens': 3090,
INFO 01-16 05:16:21 [log_utils.py:550]  'transfers_total_time_ms': 159.29198265075684,
INFO 01-16 05:16:21 [log_utils.py:550]  'transfers_total_bytes': 56000530,
INFO 01-16 05:16:21 [log_utils.py:550]  'stages': {0: {'stage_gen_time_ms': 14935.698509216309,
INFO 01-16 05:16:21 [log_utils.py:550]                 'num_tokens_out': 223,
INFO 01-16 05:16:21 [log_utils.py:550]                 'num_tokens_in': 2093},
INFO 01-16 05:16:21 [log_utils.py:550]             1: {'stage_gen_time_ms': 54997.80225753784, 'num_tokens_out': 774},
INFO 01-16 05:16:21 [log_utils.py:550]             2: {'stage_gen_time_ms': 273.1940746307373, 'num_tokens_out': 0}}}
Processed prompts: 100%|█████████████████████████████████████████████████████████████████| 1/1 [01:11<00:00, 71.27s/req, est. speed stage-2 tok/s: 43.36, avg e2e_lat: 0.0ms]
INFO 01-16 05:16:21 [omni.py:782] [Summary] {'e2e_requests': 1,██████████████████████████| 1/1 [01:11<00:00, 71.27s/req, est. speed stage-2 tok/s: 43.36, avg e2e_lat: 0.0ms]
INFO 01-16 05:16:21 [omni.py:782]  'e2e_total_time_ms': 71269.86789703369,
INFO 01-16 05:16:21 [omni.py:782]  'e2e_sum_time_ms': 71268.56875419617,
INFO 01-16 05:16:21 [omni.py:782]  'e2e_total_tokens': 3090,
INFO 01-16 05:16:21 [omni.py:782]  'e2e_avg_time_per_request_ms': 71268.56875419617,
INFO 01-16 05:16:21 [omni.py:782]  'e2e_avg_tokens_per_s': 43.3571215756745,
INFO 01-16 05:16:21 [omni.py:782]  'wall_time_ms': 71269.86789703369,
INFO 01-16 05:16:21 [omni.py:782]  'final_stage_id': {'0_1a2bd30d-910c-4ad5-81ed-f19b9571836a': 2},
INFO 01-16 05:16:21 [omni.py:782]  'stages': [{'stage_id': 0,
INFO 01-16 05:16:21 [omni.py:782]              'requests': 1,
INFO 01-16 05:16:21 [omni.py:782]              'tokens': 2316,
INFO 01-16 05:16:21 [omni.py:782]              'total_time_ms': 15139.074325561523,
INFO 01-16 05:16:21 [omni.py:782]              'avg_time_per_request_ms': 15139.074325561523,
INFO 01-16 05:16:21 [omni.py:782]              'avg_tokens_per_s': 152.98161236249146},
INFO 01-16 05:16:21 [omni.py:782]             {'stage_id': 1,
INFO 01-16 05:16:21 [omni.py:782]              'requests': 1,
INFO 01-16 05:16:21 [omni.py:782]              'tokens': 774,
INFO 01-16 05:16:21 [omni.py:782]              'total_time_ms': 55089.155197143555,
INFO 01-16 05:16:21 [omni.py:782]              'avg_time_per_request_ms': 55089.155197143555,
INFO 01-16 05:16:21 [omni.py:782]              'avg_tokens_per_s': 14.049952249769351},
INFO 01-16 05:16:21 [omni.py:782]             {'stage_id': 2,
INFO 01-16 05:16:21 [omni.py:782]              'requests': 1,
INFO 01-16 05:16:21 [omni.py:782]              'tokens': 0,
INFO 01-16 05:16:21 [omni.py:782]              'total_time_ms': 299.47686195373535,
INFO 01-16 05:16:21 [omni.py:782]              'avg_time_per_request_ms': 299.47686195373535,
INFO 01-16 05:16:21 [omni.py:782]              'avg_tokens_per_s': 0.0}],
INFO 01-16 05:16:21 [omni.py:782]  'transfers': [{'from_stage': 0,
INFO 01-16 05:16:21 [omni.py:782]                 'to_stage': 1,
INFO 01-16 05:16:21 [omni.py:782]                 'samples': 1,
INFO 01-16 05:16:21 [omni.py:782]                 'total_bytes': 49699758,
INFO 01-16 05:16:21 [omni.py:782]                 'total_time_ms': 69.85020637512207,
INFO 01-16 05:16:21 [omni.py:782]                 'tx_mbps': 5692.153031943067,
INFO 01-16 05:16:21 [omni.py:782]                 'rx_samples': 1,
INFO 01-16 05:16:21 [omni.py:782]                 'rx_total_bytes': 49699758,
INFO 01-16 05:16:21 [omni.py:782]                 'rx_total_time_ms': 65.75202941894531,
INFO 01-16 05:16:21 [omni.py:782]                 'rx_mbps': 6046.932201387521,
INFO 01-16 05:16:21 [omni.py:782]                 'total_samples': 1,
INFO 01-16 05:16:21 [omni.py:782]                 'total_transfer_time_ms': 138.01193237304688,
INFO 01-16 05:16:21 [omni.py:782]                 'total_mbps': 2880.8962903677825},
INFO 01-16 05:16:21 [omni.py:782]                {'from_stage': 1,
INFO 01-16 05:16:21 [omni.py:782]                 'to_stage': 2,
INFO 01-16 05:16:21 [omni.py:782]                 'samples': 1,
INFO 01-16 05:16:21 [omni.py:782]                 'total_bytes': 6300772,
INFO 01-16 05:16:21 [omni.py:782]                 'total_time_ms': 7.778406143188477,
INFO 01-16 05:16:21 [omni.py:782]                 'tx_mbps': 6480.270517134222,
INFO 01-16 05:16:21 [omni.py:782]                 'rx_samples': 1,
INFO 01-16 05:16:21 [omni.py:782]                 'rx_total_bytes': 6300772,
INFO 01-16 05:16:21 [omni.py:782]                 'rx_total_time_ms': 11.675119400024414,
INFO 01-16 05:16:21 [omni.py:782]                 'rx_mbps': 4317.401327809512,
INFO 01-16 05:16:21 [omni.py:782]                 'total_samples': 1,
INFO 01-16 05:16:21 [omni.py:782]                 'total_transfer_time_ms': 21.28005027770996,
INFO 01-16 05:16:21 [omni.py:782]                 'total_mbps': 2368.705681715355}]}
[Stage-1] INFO 01-16 05:16:21 [omni_stage.py:675] Received shutdown signal
[Stage-2] INFO 01-16 05:16:21 [omni_stage.py:675] Received shutdown signal
[Stage-0] INFO 01-16 05:16:21 [omni_stage.py:675] Received shutdown signal
Request ID: 0_1a2bd30d-910c-4ad5-81ed-f19b9571836a, Text saved to output_audio/0_1a2bd30d-910c-4ad5-81ed-f19b9571836a.txt
Request ID: 0_1a2bd30d-910c-4ad5-81ed-f19b9571836a, Saved audio to output_audio/output_0_1a2bd30d-910c-4ad5-81ed-f19b9571836a.wav
(Worker_TP1 pid=747414) (Worker pid=752772) [Stage-0] INFO 01-16 05:16:38 [multiproc_executor.py:707] Parent process exited, terminating worker
(Worker pid=752775) (Worker_TP0 pid=747413) [Stage-1] INFO 01-16 05:16:38 [multiproc_executor.py:707] Parent process exited, terminating worker
[Stage-2] INFO 01-16 05:16:38 [multiproc_executor.py:707] Parent process exited, terminating worker
[Stage-0] INFO 01-16 05:16:38 [multiproc_executor.py:707] Parent process exited, terminating worker

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

JaredforReal and others added 30 commits January 8, 2026 17:55
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
…e_prior_tokens

Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: root <root@hk01dgx039.cm.cluster>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
… Qwen3 Omni Thinker is not finished)

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
…RequestState is not finished)

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
…in Ring Attention (vllm-project#767)

Signed-off-by: XU Mingshi <91017482+mxuax@users.noreply.github.com>
Signed-off-by: mxuax <mxuax@connect.ust.hk>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Signed-off-by: Sihyeon Jang <sihyeon.jang@navercorp.com>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
…rmat

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
…#722)

Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 236f73e419

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 223 to +226
kv_transfer_params=kv_transfer_params,
trace_headers=request.trace_headers,
num_cached_tokens=request.num_cached_tokens,
routed_experts=routed_experts,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Align EngineCoreOutput with routed_experts field

OmniARScheduler now always passes routed_experts into EngineCoreOutput. In vllm_omni, EngineCoreOutput is patched to OmniEngineCoreOutput (see vllm_omni/patch.py), which does not define a routed_experts field (vllm_omni/engine/__init__.py). msgspec.Struct will raise TypeError: __init__() got an unexpected keyword argument 'routed_experts', so any request that produces output will crash even when routed_experts is None. Either add the field to OmniEngineCoreOutput or guard this keyword by engine type/feature flag.

Useful? React with 👍 / 👎.

Comment on lines +188 to +192
if tool_parsing_unavailable and request.tool_choice not in (
None,
"none",
):
# for hf tokenizers, "auto" tools requires
# --enable-auto-tool-choice and --tool-call-parser
return self.create_error_response(
'"auto" tool choice requires --enable-auto-tool-choice and --tool-call-parser to be set'
)
if request.tool_choice == "auto" and not self.enable_auto_tools:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject auto tool choice when no tool parser

With tool_parsing_unavailable true, the new logic only errors for tool_choice="auto" when enable_auto_tools is false; if enable_auto_tools is true but tool_parser is still None (HF tokenizers, not Harmony/Mistral), the request is accepted. Downstream parsing is gated on tool_parser being set, so auto tool calls are never extracted and the tool choice is silently ignored. This is user-visible for requests with tools but no --tool-call-parser. Consider restoring the requirement that auto needs a parser or explicitly rejecting when tool_parsing_unavailable and tool_choice is not none.

Useful? React with 👍 / 👎.

david6666666 and others added 3 commits January 16, 2026 16:32
Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

review from gemini according to logs:

Based on the logs you provided, your system is running a multi-stage multimodal pipeline (likely Qwen2-VL or a similar Omni-model architecture) using an Orchestrator.

There are three main areas for optimization: Environment/Hardware, Configuration, and Logging/Observability.


1. Critical Performance Bottlenecks

Fix Attention Backend (High Impact)

Your logs show a major warning:

Using sdpa attention backend (flash_attention_2 not available or failed)

  • Problem: SDPA is significantly slower and more memory-intensive than Flash Attention 2, especially for high-resolution images or long sequences.

  • Optimization: Install flash-attn. Ensure your CUDA version and GPU (Ampere or newer, e.g., A100, H100, RTX 30/40 series) support it. This will drastically reduce the stage_gen_time_ms.

Eliminate RoPE Parameter Errors

Unrecognized keys in rope_parameters ... {'interleaved', 'mrope_section'}

  • Problem: This indicates a version mismatch between your transformers library and the model configuration. The model is falling back to default Rotary Positional Embeddings, which can degrade the quality of multimodal spatial understanding.

  • Optimization: Update your environment: pip install -U transformers accelerate.

Stage-1 Latency

Looking at the summary:

  • Stage-0: 152.9 tok/s (Fast)

  • Stage-1: 14.0 tok/s (Slow bottleneck)

  • Optimization: Stage-1 is where your heavy lifting happens. Since you are using Worker pid and multiproc_executor, consider rebalancing your GPU resources. If Stage-1 is your main LLM/Vision-Language block, it needs more compute or a more aggressive quantization (e.g., bitsandbytes 4-bit or AWQ).


2. Infrastructure & Data Transfer

Fast vs. Slow Image Processor

The log mentions: The image processor ... is now loaded as a fast processor by default.

  • Optimization: If you are seeing CPU spikes during the "Adding requests" phase, explicitly set use_fast=True in your AutoProcessor loading code to suppress the warning and ensure you aren't hitting the "breaking change" fallback.

Inter-Stage Transfer (NCCL/Shared Memory)

  • Status: Your transfer speed is ~2800 MB/s to 6000 MB/s.

  • Optimization: For a single-node setup, ensure you are using Shared Memory (shm) or NCCL for P2P transfers between workers. If these workers are on different GPUs, ensure NVLink is active.


3. Log Cleanup & Readability

The current logs are "noisy" because JSON metrics are being printed line-by-line via log_utils.py:550.

  • Single-Line JSON: Configure your logger to output the metrics dictionary as a single minimized JSON string. This makes it much easier for log aggregators (like ELK or Datadog) to parse.

  • Disable TQDM in Production: The progress bars (0%|...|) create massive amounts of "empty" line noise in file-based logs. Use disable=True in your tqdm instances when not sys.stdout.isatty().


Summary Table of Optimization Tasks

Category | Action Item | Expected Result -- | -- | -- Compute | Install flash-attn | 2x - 3x speedup in Stage-1 Software | Update transformers | Fix RoPE errors & improve accuracy Pipeline | Set use_fast=True for Processor | Avoid initialization overhead IO | Move logs to single-line JSON | Reduce log volume by 80% Model | Apply AWQ/GPTQ Quantization | Increase Stage-1 tok/s

linyueqian and others added 2 commits January 16, 2026 12:31
Signed-off-by: linyueqian <linyueqian@outlook.com>
Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
Co-authored-by: wangyu31577 <wangyu31577@hundsun.com>
@tzhouam
Copy link
Copy Markdown
Collaborator Author

tzhouam commented Jan 16, 2026

Inter-Stage Transfer (NCCL/Shared Memory)
Status: Your transfer speed is ~2800 MB/s to 6000 MB/s.

Optimization: For a single-node setup, ensure you are using Shared Memory (shm) or NCCL for P2P transfers between workers. If these workers are on different GPUs, ensure NVLink is active.

@natureofnature Are we going to develop NCCL based Connectors to further reduce the data transfer overhead?

@tzhouam
Copy link
Copy Markdown
Collaborator Author

tzhouam commented Jan 16, 2026

Eliminate RoPE Parameter Errors
Unrecognized keys in rope_parameters ... {'interleaved', 'mrope_section'}

Problem: This indicates a version mismatch between your transformers library and the model configuration. The model is falling back to default Rotary Positional Embeddings, which can degrade the quality of multimodal spatial understanding.

Optimization: Update your environment: pip install -U transformers accelerate.

Will deal with this later.

@tzhouam
Copy link
Copy Markdown
Collaborator Author

tzhouam commented Jan 16, 2026

  1. Log Cleanup & Readability
    The current logs are "noisy" because JSON metrics are being printed line-by-line via log_utils.py:550.

Single-Line JSON: Configure your logger to output the metrics dictionary as a single minimized JSON string. This makes it much easier for log aggregators (like ELK or Datadog) to parse.

Disable TQDM in Production: The progress bars (0%|...|) create massive amounts of "empty" line noise in file-based logs. Use disable=True in your tqdm instances when not sys.stdout.isatty().

Please pay attention to this. @Bounty-hunter
From my view, we can leave it until the rebase finished.

@natureofnature
Copy link
Copy Markdown
Contributor

Inter-Stage Transfer (NCCL/Shared Memory)
Status: Your transfer speed is ~2800 MB/s to 6000 MB/s.

Optimization: For a single-node setup, ensure you are using Shared Memory (shm) or NCCL for P2P transfers between workers. If these workers are on different GPUs, ensure NVLink is active.

@natureofnature Are we going to develop NCCL based Connectors to further reduce the data transfer overhead?

Will implement a version.

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

Inter-Stage Transfer (NCCL/Shared Memory)
Status: Your transfer speed is ~2800 MB/s to 6000 MB/s.

Optimization: For a single-node setup, ensure you are using Shared Memory (shm) or NCCL for P2P transfers between workers. If these workers are on different GPUs, ensure NVLink is active.

@natureofnature Are we going to develop NCCL based Connectors to further reduce the data transfer overhead?

Will implement a version.

we can do it later, maybe in v0.15.0

princepride and others added 7 commits January 17, 2026 15:21
Signed-off-by: princepride <wangzhipeng628@gmail.com>
…nd video input processing, and refining position handling for MRoPE. Adjustments made to the YAML configuration to disable async scheduling for consistency. Code cleanup and formatting improvements included.

Signed-off-by: Taichang Zhou <tzhouam@connect.ust.hk>
….Linear (vllm-project#825)

Signed-off-by: John Liu BUAA <liukecheng97@gmail.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: Dinesh G <G.Dinesh@ibm.com>
Signed-off-by: gDINESH13 <dinesh13g@gmail.com>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
@david6666666 david6666666 added this to the v0.14.0rc1 milestone Jan 19, 2026
@david6666666 david6666666 added the high priority high priority issue, needs to be done asap label Jan 19, 2026
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
@tzhouam tzhouam merged commit 52d20a7 into vllm-project:dev/rebase_0.14.0 Jan 19, 2026
1 of 2 checks passed
ningzichun added a commit to unal-ai/vllm-omni that referenced this pull request Jan 20, 2026
* init and registry

Signed-off-by: JaredforReal <w13431838023@gmail.com>

* implement glm_image_transformer.py

Signed-off-by: JaredforReal <w13431838023@gmail.com>

* update transformer

Signed-off-by: JaredforReal <w13431838023@gmail.com>

* init pipeline_glm_image.py

Signed-off-by: JaredforReal <w13431838023@gmail.com>

* init pipeline_glm_image.py

Signed-off-by: JaredforReal <w13431838023@gmail.com>

* remove pre process

Signed-off-by: JaredforReal <w13431838023@gmail.com>

* add check_input(), implement CFG parallel in diffuse(), align generate_prior_tokens

Signed-off-by: JaredforReal <w13431838023@gmail.com>

* fix check_input(prompt_embed), add KVCache for Image Edit

Signed-off-by: JaredforReal <w13431838023@gmail.com>

* print out vllm version

Signed-off-by: root <root@hk01dgx039.cm.cluster>

* update model config

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* update worker

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* update one import in AsyncOmniLLM (not finish all, but can run)

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* update Qwen3 Omni ViT init based on updated interface (the update for Qwen3 Omni Thinker is not finished)

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* Remove unnecessary override for OmniRequestState (the update for OmniRequestState is not finished)

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* update model runner dummy run

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* update ar scheduler

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* update _preprocess, execute model and sample_tokens for AR Model Runner

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* debug AR Scheduler

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* update OmniGPUModelRunner._update_states

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* update the offline LLM request sorting due to changed requested id format

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* update Qwen3 Omni to fit with the engine core logic

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* update generation model runner

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* debug GLM-Image Model

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* remove deleted args from doc string

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* [Model][Rebase] Add GLM-Image Model and Partial Rebase to v0.14.0 (Support AR Offiline) (vllm-project#763)

Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: root <root@hk01dgx039.cm.cluster>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Co-authored-by: JaredforReal <w13431838023@gmail.com>
Co-authored-by: root <root@hk01dgx039.cm.cluster>

* disable async scheduling for generation models, avoiding inconsistency from race condition

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* Update Qwen 3 Omni

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* [Fix] GLM Image (vllm-project#799)

Signed-off-by: JaredforReal <w13431838023@gmail.com>

* support online serving for Qwen3 Omni

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* fix pre-commit

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* inherit engine outputs

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* supporting audio in video(not finished)

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* Update Qwen2.5 Omni model to version 0.14, adding support for image and video input processing, and refining position handling for MRoPE. Adjustments made to the YAML configuration to disable async scheduling for consistency. Code cleanup and formatting improvements included.

Signed-off-by: Taichang Zhou <tzhouam@connect.ust.hk>

* debug qwen 2.5 Omni

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* update doc

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* rebase to vllm 0.14.0

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* unify query type

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* fix build doc

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* Dev/rebase 0.14.0 (vllm-project#813)

Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: root <root@hk01dgx039.cm.cluster>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Signed-off-by: TangPeng <85704592@qq.com>
Signed-off-by: XU Mingshi <91017482+mxuax@users.noreply.github.com>
Signed-off-by: mxuax <mxuax@connect.ust.hk>
Signed-off-by: Sihyeon Jang <sihyeon.jang@navercorp.com>
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
Signed-off-by: iwzbi <wzbi@zju.edu.cn>
Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com>
Signed-off-by: Yuhan Liu <30294295+liuyuhanalex@users.noreply.github.com>
Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com>
Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: linyueqian <linyueqian@outlook.com>
Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com>
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: Taichang Zhou <tzhouam@connect.ust.hk>
Signed-off-by: John Liu BUAA <liukecheng97@gmail.com>
Signed-off-by: Dinesh G <G.Dinesh@ibm.com>
Signed-off-by: gDINESH13 <dinesh13g@gmail.com>
Co-authored-by: JaredforReal <w13431838023@gmail.com>
Co-authored-by: root <root@hk01dgx039.cm.cluster>
Co-authored-by: JustQJ <37905360+JustQJ@users.noreply.github.com>
Co-authored-by: XU Mingshi <91017482+mxuax@users.noreply.github.com>
Co-authored-by: Sihyeon Jang <uneedsihyeon@gmail.com>
Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com>
Co-authored-by: catcat <108673086+iwzbi@users.noreply.github.com>
Co-authored-by: Ziming Huang <hzm414167@alibaba-inc.com>
Co-authored-by: Yuhan Liu <30294295+liuyuhanalex@users.noreply.github.com>
Co-authored-by: Alicia <115451386+congw729@users.noreply.github.com>
Co-authored-by: Samit <285365963@qq.com>
Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
Co-authored-by: WeiQing Chen <40507679+david6666666@users.noreply.github.com>
Co-authored-by: Peiqi Yin <60515999+yinpeiqi@users.noreply.github.com>
Co-authored-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: wangyu <53896905+yenuo26@users.noreply.github.com>
Co-authored-by: wangyu31577 <wangyu31577@hundsun.com>
Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com>
Co-authored-by: John Liu BUAA <liukecheng97@gmail.com>
Co-authored-by: D!NE$H <67671800+gDINESH13@users.noreply.github.com>

* update test import

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* update version from 0.14.0rc2 to 0.14.0

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* set vllm config for all CI

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* update CI

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

* Fix CPU offload OOM and performance issues in GLM-Image pipeline

* Fix CPU offload OOM and performance issues in GLM-Image pipeline

- Conditionally load vision_language_encoder, text_encoder, and vae to GPU only when CPU offload is disabled
- Propagate cpu_offload_gb argument to enable_cpu_offload flag
- Include vision_language_encoder in CPU offload hooks for proper AR model offloading
- Fix device mismatch in generate_prior_tokens during CPU offload mode

* Fix shared memory broadcast hang in GLM-Image pipeline

- Add manual encoder activation support to SequentialOffloader
- Explicitly trigger vision_language_encoder onload before get_image_features in pipeline
- Prevents CPU-bound stalling during AR generation when offload is active

* Fix device mismatch in generate() by triggering offload hook

* Clean up temporary patch files

---------

Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: root <root@hk01dgx039.cm.cluster>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Signed-off-by: Taichang Zhou <tzhouam@connect.ust.hk>
Signed-off-by: TangPeng <85704592@qq.com>
Signed-off-by: XU Mingshi <91017482+mxuax@users.noreply.github.com>
Signed-off-by: mxuax <mxuax@connect.ust.hk>
Signed-off-by: Sihyeon Jang <sihyeon.jang@navercorp.com>
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
Signed-off-by: iwzbi <wzbi@zju.edu.cn>
Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com>
Signed-off-by: Yuhan Liu <30294295+liuyuhanalex@users.noreply.github.com>
Signed-off-by: Alicia <115451386+congw729@users.noreply.github.com>
Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: linyueqian <linyueqian@outlook.com>
Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com>
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: John Liu BUAA <liukecheng97@gmail.com>
Signed-off-by: Dinesh G <G.Dinesh@ibm.com>
Signed-off-by: gDINESH13 <dinesh13g@gmail.com>
Co-authored-by: JaredforReal <w13431838023@gmail.com>
Co-authored-by: root <root@hk01dgx039.cm.cluster>
Co-authored-by: tzhouam <tzhouam@connect.ust.hk>
Co-authored-by: JustQJ <37905360+JustQJ@users.noreply.github.com>
Co-authored-by: XU Mingshi <91017482+mxuax@users.noreply.github.com>
Co-authored-by: Sihyeon Jang <uneedsihyeon@gmail.com>
Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com>
Co-authored-by: catcat <108673086+iwzbi@users.noreply.github.com>
Co-authored-by: Ziming Huang <hzm414167@alibaba-inc.com>
Co-authored-by: Yuhan Liu <30294295+liuyuhanalex@users.noreply.github.com>
Co-authored-by: Alicia <115451386+congw729@users.noreply.github.com>
Co-authored-by: Samit <285365963@qq.com>
Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
Co-authored-by: WeiQing Chen <40507679+david6666666@users.noreply.github.com>
Co-authored-by: Peiqi Yin <60515999+yinpeiqi@users.noreply.github.com>
Co-authored-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: wangyu <53896905+yenuo26@users.noreply.github.com>
Co-authored-by: wangyu31577 <wangyu31577@hundsun.com>
Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com>
Co-authored-by: John Liu BUAA <liukecheng97@gmail.com>
Co-authored-by: D!NE$H <67671800+gDINESH13@users.noreply.github.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

high priority high priority issue, needs to be done asap

Projects

None yet

Development

Successfully merging this pull request may close these issues.